You can't salt the key in the second table.
By salting the key, you lose the ability to do range scans, which is what you 
want to do.



Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[email protected]> 
wrote:

> Thanks all for your comments and suggestions. Regarding the
> hotspotting I will try to salt the key in the 2nd table and see the
> results.
> 
> Yesterday I finished to install my 4 servers cluster with old machine.
> It's slow, but it's working. So I will do some testing.
> 
> You are recommending to modify the timestamp to be to the second or
> minute and have more entries per row. Is that because it's better to
> have more columns than rows? Or it's more because that will allow to
> have a more "squarred" pattern (lot of rows, lot of colums) which if
> more efficient?
> 
> JM
> 
> 2012/6/15, Michael Segel <[email protected]>:
>> Thought about this a little bit more...
>> 
>> You will want two tables for a solution.
>> 
>> 1 Table is  Key: Unique ID
>>                    Column: FilePath            Value: Full Path to file
>>                    Column: Last Update time    Value: timestamp
>> 
>> 2 Table is Key: Last Update time    (The timestamp)
>>                            Column 1-N: Unique ID    Value: Full Path to the
>> file
>> 
>> Now if you want to get fancy,  in Table 1, you could use the time stamp on
>> the column File Path to hold the last update time.
>> But its probably easier for you to start by keeping the data as a separate
>> column and ignore the Timestamps on the columns for now.
>> 
>> Note the following:
>> 
>> 1) I used the notation Column 1-N to reflect that for a given timestamp you
>> may or may not have multiple files that were updated. (You weren't specific
>> as to the scale)
>> This is a good example of HBase's column oriented approach where you may or
>> may not have a column. It doesn't matter. :-) You could also modify the
>> timestamp to be to the second or minute and have more entries per row. It
>> doesn't matter. You insert based on timestamp:columnName, value, so you will
>> add a column to this table.
>> 
>> 2) First prove that the logic works. You insert/update table 1 to capture
>> the ID of the file and its last update time.  You then delete the old
>> timestamp entry in table 2, then insert new entry in table 2.
>> 
>> 3) You store Table 2 in ascending order. Then when you want to find your
>> last 500 entries, you do a start scan at 0x000 and then limit the scan to
>> 500 rows. Note that you may or may not have multiple entries so as you walk
>> through the result set, you count the number of columns and stop when you
>> have 500 columns, regardless of the number of rows you've processed.
>> 
>> This should solve your problem and be pretty efficient.
>> You can then work out the Coprocessors and add it to the solution to be even
>> more efficient.
>> 
>> 
>> With respect to 'hot-spotting' , can't be helped. You could hash your unique
>> ID in table 1, this will reduce the potential of a hotspot as the table
>> splits.
>> On table 2, because you have temporal data and you want to efficiently scan
>> a small portion of the table based on size, you will always scan the first
>> bloc, however as data rolls off and compression occurs, you will probably
>> have to do some cleanup. I'm not sure how HBase  handles splits that no
>> longer contain data. When you compress an empty split, does it go away?
>> 
>> By switching to coprocessors, you now limit the update accessors to the
>> second table so you should still have pretty good performance.
>> 
>> You may also want to look at Asynchronous HBase, however I don't know how
>> well it will work with Coprocessors or if you want to perform async
>> operations in this specific use case.
>> 
>> Good luck, HTH...
>> 
>> -Mike
>> 
>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>> 
>>> Hi Michael,
>>> 
>>> For now this is more a proof of concept than a production application.
>>> And if it's working, it should be growing a lot and database at the
>>> end will easily be over 1B rows. each individual server will have to
>>> send it's own information to one centralized server which will insert
>>> that into a database. That's why it need to be very quick and that's
>>> why I'm looking in HBase's direction. I tried with some relational
>>> databases with 4M rows in the table but the insert time is to slow
>>> when I have to introduce entries in bulk. Also, the ability for HBase
>>> to keep only the cells with values will allow to save a lot on the
>>> disk space (futur projects).
>>> 
>>> I'm not yet used with HBase and there is still many things I need to
>>> undertsand but until I'm able to create a solution and test it, I will
>>> continue to read, learn and try that way. Then at then end I will be
>>> able to compare the 2 options I have (HBase or relational) and decide
>>> based on the results.
>>> 
>>> So yes, your reply helped because it's giving me a way to achieve this
>>> goal (using co-processors). I don't know ye thow this part is working,
>>> so I will dig the documentation for it.
>>> 
>>> Thanks,
>>> 
>>> JM
>>> 
>>> 2012/6/14, Michael Segel <[email protected]>:
>>>> Jean-Marc,
>>>> 
>>>> You do realize that this really isn't a good use case for HBase,
>>>> assuming
>>>> that what you are describing is a stand alone system.
>>>> It would be easier and better if you just used a simple relational
>>>> database.
>>>> 
>>>> Then you would have your table w an ID, and a secondary index on the
>>>> timestamp.
>>>> Retrieve the data in Ascending order by timestamp and take the top 500
>>>> off
>>>> the list.
>>>> 
>>>> If you insist on using HBase, yes you will have to have a secondary
>>>> table.
>>>> Then using co-processors...
>>>> When you update the row in your base table, you
>>>> then get() the row in your index by timestamp, removing the column for
>>>> that
>>>> rowid.
>>>> Add the new column to the timestamp row.
>>>> 
>>>> As you put it.
>>>> 
>>>> Now you can just do a partial scan on your index. Because your index
>>>> table
>>>> is so small... you shouldn't worry about hotspots.
>>>> You may just want to rebuild your index every so often...
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> Thanks for your feedback. Here are more details to describe what I'm
>>>>> trying to achieve.
>>>>> 
>>>>> My goal is to store information about files into the database. I need
>>>>> to check the oldest files in the database to refresh the information.
>>>>> 
>>>>> The key is an 8 bytes ID of the server name in the network hosting the
>>>>> file + MD5 of the file path. Total is a 24 bytes key.
>>>>> 
>>>>> So each time I look at a file and gather the information, I update its
>>>>> row in the database based on the key including a "last_update" field.
>>>>> I can calculate this key for any file in the drives.
>>>>> 
>>>>> In order to know which file I need to check in the network, I need to
>>>>> scan the table by "last_update" field. So the idea is to build another
>>>>> table which contain the last_update as a key and the files IDs in
>>>>> columns. (Here is the hotspotting)
>>>>> 
>>>>> Each time I work on a file, I will have to update the main table by ID
>>>>> and remove the cell from the second table (the index) and put it back
>>>>> with the new "last_update" key.
>>>>> 
>>>>> I'm mainly doing 3 operations in the database.
>>>>> 1) I retrieve a list of 500 files which need to be update
>>>>> 2) I update the information for  those 500 files (bulk update)
>>>>> 3) I load new files references to be checked.
>>>>> 
>>>>> For 2 and 3, I use the main table with the file ID as the key. the
>>>>> distribution is almost perfect because I'm using hash. The prefix is
>>>>> the server ID but it's not always going to the same server since it's
>>>>> done by last_update. But this allow a quick access to the list of
>>>>> files from one server.
>>>>> For 1, I have expected to build this second table with the
>>>>> "last_update" as the key.
>>>>> 
>>>>> Regarding the frequency, it really depends on the activities on the
>>>>> network, but it should be "often".  The faster the database update
>>>>> will be, the more up to date I will be able to keep it.
>>>>> 
>>>>> JM
>>>>> 
>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>> Actually I think you should revisit your key design....
>>>>>> 
>>>>>> Look at your access path to the data for each of the types of queries
>>>>>> you
>>>>>> are going to run.
>>>>>> From your post:
>>>>>> "I have a table with a uniq key, a file path and a "last update"
>>>>>> field.
>>>>>>>>> I can easily find back the file with the ID and find when it has
>>>>>>>>> been
>>>>>>>>> updated.
>>>>>>>>> 
>>>>>>>>> But what I need too is to find the files not updated for more than
>>>>>>>>> a
>>>>>>>>> certain period of time.
>>>>>> "
>>>>>> So your primary query is going to be against the key.
>>>>>> Not sure if you meant to say that your key was a composite key or
>>>>>> not...
>>>>>> sounds like your key is just the unique key and the rest are columns
>>>>>> in
>>>>>> the
>>>>>> table.
>>>>>> 
>>>>>> The secondary query or path to the data is to find data where the
>>>>>> files
>>>>>> were
>>>>>> not updated for more than a period of time.
>>>>>> 
>>>>>> If you make your key temporal, that is adding time as a component of
>>>>>> your
>>>>>> key, you will end up creating new rows of data while the old row still
>>>>>> exists.
>>>>>> Not a good side effect.
>>>>>> 
>>>>>> The other nasty side effect of using time as your key is that you not
>>>>>> only
>>>>>> have the potential for hot spotting, but that you also have the nasty
>>>>>> side
>>>>>> effect of creating splits that will never grow.
>>>>>> 
>>>>>> How often are you going to ask to see the files where they were not
>>>>>> updated
>>>>>> in the last couple of days/minutes? If its infrequent, then you really
>>>>>> should care if you have to do a complete table scan.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>>>>>> 
>>>>>>> Wow! This is exactly what I was looking for. So I will read all of
>>>>>>> that
>>>>>>> now.
>>>>>>> 
>>>>>>> Need to read here at the bottom: https://github.com/sematext/HBaseWD
>>>>>>> and here:
>>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> JM
>>>>>>> 
>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this comes
>>>>>>>> up
>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>>>>>>>> 
>>>>>>>> Otis
>>>>>>>> ----
>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>>>>>>>> http://sematext.com/spm
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> ________________________________
>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
>>>>>>>>> To: [email protected]
>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>>>>>>>>> Subject: Timestamp as a key good practice?
>>>>>>>>> 
>>>>>>>>> I watched Lars George's video about HBase and read the
>>>>>>>>> documentation
>>>>>>>>> and it's saying that it's not a good idea to have the timestamp as
>>>>>>>>> a
>>>>>>>>> key because that will always load the same region until the
>>>>>>>>> timestamp
>>>>>>>>> reach a certain value and move to the next region (hotspotting).
>>>>>>>>> 
>>>>>>>>> I have a table with a uniq key, a file path and a "last update"
>>>>>>>>> field.
>>>>>>>>> I can easily find back the file with the ID and find when it has
>>>>>>>>> been
>>>>>>>>> updated.
>>>>>>>>> 
>>>>>>>>> But what I need too is to find the files not updated for more than
>>>>>>>>> a
>>>>>>>>> certain period of time.
>>>>>>>>> 
>>>>>>>>> If I want to retrieve that from this single table, I will have to
>>>>>>>>> do
>>>>>>>>> a
>>>>>>>>> full parsing of the table. Which might take a while.
>>>>>>>>> 
>>>>>>>>> So I thought of building a table to reference that (kind of
>>>>>>>>> secondary
>>>>>>>>> index). The key is the "last update", one FC and each column will
>>>>>>>>> have
>>>>>>>>> the ID of the file with a dummy content.
>>>>>>>>> 
>>>>>>>>> When a file is updated, I remove its cell from this table, and
>>>>>>>>> introduce a new cell with the new timestamp as the key.
>>>>>>>>> 
>>>>>>>>> And so one.
>>>>>>>>> 
>>>>>>>>> With this schema, I can find the files by ID very quickly and I can
>>>>>>>>> find the files which need to be updated pretty quickly too. But
>>>>>>>>> it's
>>>>>>>>> hotspotting one region.
>>>>>>>>> 
>>>>>>>>> From the video (0:45:10) I can see 4 situations.
>>>>>>>>> 1) Hotspotting.
>>>>>>>>> 2) Salting.
>>>>>>>>> 3) Key field swap/promotion
>>>>>>>>> 4) Randomization.
>>>>>>>>> 
>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other options.
>>>>>>>>> 
>>>>>>>>> I can do salting. Like prefix the timestamp with a number between 0
>>>>>>>>> and 9. So that will distribut the load over 10 servers. To find all
>>>>>>>>> the files with a timestamp below a specific value, I will need to
>>>>>>>>> run
>>>>>>>>> 10 requests instead of one. But when the load will becaume to big
>>>>>>>>> for
>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? Which
>>>>>>>>> mean 100 request? And the more regions I will have, the more
>>>>>>>>> requests
>>>>>>>>> I will have to do. Is that really a good approach?
>>>>>>>>> 
>>>>>>>>> Key field swap is close to salting. I can add the first few bytes
>>>>>>>>> from
>>>>>>>>> the path before the timestamp, but the issue will remain the same.
>>>>>>>>> 
>>>>>>>>> I looked and randomization, and I can't do that. Else I will have
>>>>>>>>> no
>>>>>>>>> way to retreive the information I'm looking for.
>>>>>>>>> 
>>>>>>>>> So the question is. Is there a good way to store the data to
>>>>>>>>> retrieve
>>>>>>>>> them base on the date?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> JM
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 

Reply via email to