Re: Timestamp as a key good practice?

Jean-Marc Spaggiari Sat, 16 Jun 2012 07:42:59 -0700

Let's imagine the timestamp is "123456789".

If I salt it with later from 'a' to 'z' them it will always be split
between few RegionServers. I will have like "t123456789". The issue is
that I will have to do 26 queries to be able to find all the entries.
I will need to query from A000000000 to Axxxxxxxxx, then same for B,
and so on.


So what's worst? Am I better to deal with the hotspotting? Salt the
key myself? Or what if I use something like HBaseWD?

JM

2012/6/16, Michel Segel <[email protected]>:
> You can't salt the key in the second table.
> By salting the key, you lose the ability to do range scans, which is what
> you want to do.
>
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[email protected]>
> wrote:
>
>> Thanks all for your comments and suggestions. Regarding the
>> hotspotting I will try to salt the key in the 2nd table and see the
>> results.
>>
>> Yesterday I finished to install my 4 servers cluster with old machine.
>> It's slow, but it's working. So I will do some testing.
>>
>> You are recommending to modify the timestamp to be to the second or
>> minute and have more entries per row. Is that because it's better to
>> have more columns than rows? Or it's more because that will allow to
>> have a more "squarred" pattern (lot of rows, lot of colums) which if
>> more efficient?
>>
>> JM
>>
>> 2012/6/15, Michael Segel <[email protected]>:
>>> Thought about this a little bit more...
>>>
>>> You will want two tables for a solution.
>>>
>>> 1 Table is  Key: Unique ID
>>>                    Column: FilePath            Value: Full Path to file
>>>                    Column: Last Update time    Value: timestamp
>>>
>>> 2 Table is Key: Last Update time    (The timestamp)
>>>                            Column 1-N: Unique ID    Value: Full Path to
>>> the
>>> file
>>>
>>> Now if you want to get fancy,  in Table 1, you could use the time stamp
>>> on
>>> the column File Path to hold the last update time.
>>> But its probably easier for you to start by keeping the data as a
>>> separate
>>> column and ignore the Timestamps on the columns for now.
>>>
>>> Note the following:
>>>
>>> 1) I used the notation Column 1-N to reflect that for a given timestamp
>>> you
>>> may or may not have multiple files that were updated. (You weren't
>>> specific
>>> as to the scale)
>>> This is a good example of HBase's column oriented approach where you may
>>> or
>>> may not have a column. It doesn't matter. :-) You could also modify the
>>> timestamp to be to the second or minute and have more entries per row.
>>> It
>>> doesn't matter. You insert based on timestamp:columnName, value, so you
>>> will
>>> add a column to this table.
>>>
>>> 2) First prove that the logic works. You insert/update table 1 to
>>> capture
>>> the ID of the file and its last update time.  You then delete the old
>>> timestamp entry in table 2, then insert new entry in table 2.
>>>
>>> 3) You store Table 2 in ascending order. Then when you want to find your
>>> last 500 entries, you do a start scan at 0x000 and then limit the scan
>>> to
>>> 500 rows. Note that you may or may not have multiple entries so as you
>>> walk
>>> through the result set, you count the number of columns and stop when
>>> you
>>> have 500 columns, regardless of the number of rows you've processed.
>>>
>>> This should solve your problem and be pretty efficient.
>>> You can then work out the Coprocessors and add it to the solution to be
>>> even
>>> more efficient.
>>>
>>>
>>> With respect to 'hot-spotting' , can't be helped. You could hash your
>>> unique
>>> ID in table 1, this will reduce the potential of a hotspot as the table
>>> splits.
>>> On table 2, because you have temporal data and you want to efficiently
>>> scan
>>> a small portion of the table based on size, you will always scan the
>>> first
>>> bloc, however as data rolls off and compression occurs, you will
>>> probably
>>> have to do some cleanup. I'm not sure how HBase  handles splits that no
>>> longer contain data. When you compress an empty split, does it go away?
>>>
>>> By switching to coprocessors, you now limit the update accessors to the
>>> second table so you should still have pretty good performance.
>>>
>>> You may also want to look at Asynchronous HBase, however I don't know
>>> how
>>> well it will work with Coprocessors or if you want to perform async
>>> operations in this specific use case.
>>>
>>> Good luck, HTH...
>>>
>>> -Mike
>>>
>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> For now this is more a proof of concept than a production application.
>>>> And if it's working, it should be growing a lot and database at the
>>>> end will easily be over 1B rows. each individual server will have to
>>>> send it's own information to one centralized server which will insert
>>>> that into a database. That's why it need to be very quick and that's
>>>> why I'm looking in HBase's direction. I tried with some relational
>>>> databases with 4M rows in the table but the insert time is to slow
>>>> when I have to introduce entries in bulk. Also, the ability for HBase
>>>> to keep only the cells with values will allow to save a lot on the
>>>> disk space (futur projects).
>>>>
>>>> I'm not yet used with HBase and there is still many things I need to
>>>> undertsand but until I'm able to create a solution and test it, I will
>>>> continue to read, learn and try that way. Then at then end I will be
>>>> able to compare the 2 options I have (HBase or relational) and decide
>>>> based on the results.
>>>>
>>>> So yes, your reply helped because it's giving me a way to achieve this
>>>> goal (using co-processors). I don't know ye thow this part is working,
>>>> so I will dig the documentation for it.
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>> Jean-Marc,
>>>>>
>>>>> You do realize that this really isn't a good use case for HBase,
>>>>> assuming
>>>>> that what you are describing is a stand alone system.
>>>>> It would be easier and better if you just used a simple relational
>>>>> database.
>>>>>
>>>>> Then you would have your table w an ID, and a secondary index on the
>>>>> timestamp.
>>>>> Retrieve the data in Ascending order by timestamp and take the top 500
>>>>> off
>>>>> the list.
>>>>>
>>>>> If you insist on using HBase, yes you will have to have a secondary
>>>>> table.
>>>>> Then using co-processors...
>>>>> When you update the row in your base table, you
>>>>> then get() the row in your index by timestamp, removing the column for
>>>>> that
>>>>> rowid.
>>>>> Add the new column to the timestamp row.
>>>>>
>>>>> As you put it.
>>>>>
>>>>> Now you can just do a partial scan on your index. Because your index
>>>>> table
>>>>> is so small... you shouldn't worry about hotspots.
>>>>> You may just want to rebuild your index every so often...
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> Thanks for your feedback. Here are more details to describe what I'm
>>>>>> trying to achieve.
>>>>>>
>>>>>> My goal is to store information about files into the database. I need
>>>>>> to check the oldest files in the database to refresh the information.
>>>>>>
>>>>>> The key is an 8 bytes ID of the server name in the network hosting
>>>>>> the
>>>>>> file + MD5 of the file path. Total is a 24 bytes key.
>>>>>>
>>>>>> So each time I look at a file and gather the information, I update
>>>>>> its
>>>>>> row in the database based on the key including a "last_update" field.
>>>>>> I can calculate this key for any file in the drives.
>>>>>>
>>>>>> In order to know which file I need to check in the network, I need to
>>>>>> scan the table by "last_update" field. So the idea is to build
>>>>>> another
>>>>>> table which contain the last_update as a key and the files IDs in
>>>>>> columns. (Here is the hotspotting)
>>>>>>
>>>>>> Each time I work on a file, I will have to update the main table by
>>>>>> ID
>>>>>> and remove the cell from the second table (the index) and put it back
>>>>>> with the new "last_update" key.
>>>>>>
>>>>>> I'm mainly doing 3 operations in the database.
>>>>>> 1) I retrieve a list of 500 files which need to be update
>>>>>> 2) I update the information for  those 500 files (bulk update)
>>>>>> 3) I load new files references to be checked.
>>>>>>
>>>>>> For 2 and 3, I use the main table with the file ID as the key. the
>>>>>> distribution is almost perfect because I'm using hash. The prefix is
>>>>>> the server ID but it's not always going to the same server since it's
>>>>>> done by last_update. But this allow a quick access to the list of
>>>>>> files from one server.
>>>>>> For 1, I have expected to build this second table with the
>>>>>> "last_update" as the key.
>>>>>>
>>>>>> Regarding the frequency, it really depends on the activities on the
>>>>>> network, but it should be "often".  The faster the database update
>>>>>> will be, the more up to date I will be able to keep it.
>>>>>>
>>>>>> JM
>>>>>>
>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>> Actually I think you should revisit your key design....
>>>>>>>
>>>>>>> Look at your access path to the data for each of the types of
>>>>>>> queries
>>>>>>> you
>>>>>>> are going to run.
>>>>>>> From your post:
>>>>>>> "I have a table with a uniq key, a file path and a "last update"
>>>>>>> field.
>>>>>>>>>> I can easily find back the file with the ID and find when it has
>>>>>>>>>> been
>>>>>>>>>> updated.
>>>>>>>>>>
>>>>>>>>>> But what I need too is to find the files not updated for more
>>>>>>>>>> than
>>>>>>>>>> a
>>>>>>>>>> certain period of time.
>>>>>>> "
>>>>>>> So your primary query is going to be against the key.
>>>>>>> Not sure if you meant to say that your key was a composite key or
>>>>>>> not...
>>>>>>> sounds like your key is just the unique key and the rest are columns
>>>>>>> in
>>>>>>> the
>>>>>>> table.
>>>>>>>
>>>>>>> The secondary query or path to the data is to find data where the
>>>>>>> files
>>>>>>> were
>>>>>>> not updated for more than a period of time.
>>>>>>>
>>>>>>> If you make your key temporal, that is adding time as a component of
>>>>>>> your
>>>>>>> key, you will end up creating new rows of data while the old row
>>>>>>> still
>>>>>>> exists.
>>>>>>> Not a good side effect.
>>>>>>>
>>>>>>> The other nasty side effect of using time as your key is that you
>>>>>>> not
>>>>>>> only
>>>>>>> have the potential for hot spotting, but that you also have the
>>>>>>> nasty
>>>>>>> side
>>>>>>> effect of creating splits that will never grow.
>>>>>>>
>>>>>>> How often are you going to ask to see the files where they were not
>>>>>>> updated
>>>>>>> in the last couple of days/minutes? If its infrequent, then you
>>>>>>> really
>>>>>>> should care if you have to do a complete table scan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>>>>>>>
>>>>>>>> Wow! This is exactly what I was looking for. So I will read all of
>>>>>>>> that
>>>>>>>> now.
>>>>>>>>
>>>>>>>> Need to read here at the bottom:
>>>>>>>> https://github.com/sematext/HBaseWD
>>>>>>>> and here:
>>>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> JM
>>>>>>>>
>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this comes
>>>>>>>>> up
>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>>>>>>>>>
>>>>>>>>> Otis
>>>>>>>>> ----
>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>>>>>>>>> http://sematext.com/spm
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ________________________________
>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
>>>>>>>>>> To: [email protected]
>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>>>>>>>>>> Subject: Timestamp as a key good practice?
>>>>>>>>>>
>>>>>>>>>> I watched Lars George's video about HBase and read the
>>>>>>>>>> documentation
>>>>>>>>>> and it's saying that it's not a good idea to have the timestamp
>>>>>>>>>> as
>>>>>>>>>> a
>>>>>>>>>> key because that will always load the same region until the
>>>>>>>>>> timestamp
>>>>>>>>>> reach a certain value and move to the next region (hotspotting).
>>>>>>>>>>
>>>>>>>>>> I have a table with a uniq key, a file path and a "last update"
>>>>>>>>>> field.
>>>>>>>>>> I can easily find back the file with the ID and find when it has
>>>>>>>>>> been
>>>>>>>>>> updated.
>>>>>>>>>>
>>>>>>>>>> But what I need too is to find the files not updated for more
>>>>>>>>>> than
>>>>>>>>>> a
>>>>>>>>>> certain period of time.
>>>>>>>>>>
>>>>>>>>>> If I want to retrieve that from this single table, I will have to
>>>>>>>>>> do
>>>>>>>>>> a
>>>>>>>>>> full parsing of the table. Which might take a while.
>>>>>>>>>>
>>>>>>>>>> So I thought of building a table to reference that (kind of
>>>>>>>>>> secondary
>>>>>>>>>> index). The key is the "last update", one FC and each column will
>>>>>>>>>> have
>>>>>>>>>> the ID of the file with a dummy content.
>>>>>>>>>>
>>>>>>>>>> When a file is updated, I remove its cell from this table, and
>>>>>>>>>> introduce a new cell with the new timestamp as the key.
>>>>>>>>>>
>>>>>>>>>> And so one.
>>>>>>>>>>
>>>>>>>>>> With this schema, I can find the files by ID very quickly and I
>>>>>>>>>> can
>>>>>>>>>> find the files which need to be updated pretty quickly too. But
>>>>>>>>>> it's
>>>>>>>>>> hotspotting one region.
>>>>>>>>>>
>>>>>>>>>> From the video (0:45:10) I can see 4 situations.
>>>>>>>>>> 1) Hotspotting.
>>>>>>>>>> 2) Salting.
>>>>>>>>>> 3) Key field swap/promotion
>>>>>>>>>> 4) Randomization.
>>>>>>>>>>
>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other options.
>>>>>>>>>>
>>>>>>>>>> I can do salting. Like prefix the timestamp with a number between
>>>>>>>>>> 0
>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To find
>>>>>>>>>> all
>>>>>>>>>> the files with a timestamp below a specific value, I will need to
>>>>>>>>>> run
>>>>>>>>>> 10 requests instead of one. But when the load will becaume to big
>>>>>>>>>> for
>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99?
>>>>>>>>>> Which
>>>>>>>>>> mean 100 request? And the more regions I will have, the more
>>>>>>>>>> requests
>>>>>>>>>> I will have to do. Is that really a good approach?
>>>>>>>>>>
>>>>>>>>>> Key field swap is close to salting. I can add the first few bytes
>>>>>>>>>> from
>>>>>>>>>> the path before the timestamp, but the issue will remain the
>>>>>>>>>> same.
>>>>>>>>>>
>>>>>>>>>> I looked and randomization, and I can't do that. Else I will have
>>>>>>>>>> no
>>>>>>>>>> way to retreive the information I'm looking for.
>>>>>>>>>>
>>>>>>>>>> So the question is. Is there a good way to store the data to
>>>>>>>>>> retrieve
>>>>>>>>>> them base on the date?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> JM
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: Timestamp as a key good practice?

Reply via email to