Wow! This is exactly what I was looking for. So I will read all of that now.
Need to read here at the bottom: https://github.com/sematext/HBaseWD and here: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ Thanks, JM 2012/6/14, Otis Gospodnetic <[email protected]>: > JM, have a look at https://github.com/sematext/HBaseWD (this comes up > often.... Doug, maybe you could add it to the Ref Guide?) > > Otis > ---- > Performance Monitoring for Solr / ElasticSearch / HBase - > http://sematext.com/spm > > > >>________________________________ >> From: Jean-Marc Spaggiari <[email protected]> >>To: [email protected] >>Sent: Wednesday, June 13, 2012 12:16 PM >>Subject: Timestamp as a key good practice? >> >>I watched Lars George's video about HBase and read the documentation >>and it's saying that it's not a good idea to have the timestamp as a >>key because that will always load the same region until the timestamp >>reach a certain value and move to the next region (hotspotting). >> >>I have a table with a uniq key, a file path and a "last update" field. >>I can easily find back the file with the ID and find when it has been >>updated. >> >>But what I need too is to find the files not updated for more than a >>certain period of time. >> >>If I want to retrieve that from this single table, I will have to do a >>full parsing of the table. Which might take a while. >> >>So I thought of building a table to reference that (kind of secondary >>index). The key is the "last update", one FC and each column will have >>the ID of the file with a dummy content. >> >>When a file is updated, I remove its cell from this table, and >>introduce a new cell with the new timestamp as the key. >> >>And so one. >> >>With this schema, I can find the files by ID very quickly and I can >>find the files which need to be updated pretty quickly too. But it's >>hotspotting one region. >> > >From the video (0:45:10) I can see 4 situations. >>1) Hotspotting. >>2) Salting. >>3) Key field swap/promotion >>4) Randomization. >> >>I need to avoid hostpotting, so I looked at the 3 other options. >> >>I can do salting. Like prefix the timestamp with a number between 0 >>and 9. So that will distribut the load over 10 servers. To find all >>the files with a timestamp below a specific value, I will need to run >>10 requests instead of one. But when the load will becaume to big for >>10 servers, I will have to prefix by a byte between 0 and 99? Which >>mean 100 request? And the more regions I will have, the more requests >>I will have to do. Is that really a good approach? >> >>Key field swap is close to salting. I can add the first few bytes from >>the path before the timestamp, but the issue will remain the same. >> >>I looked and randomization, and I can't do that. Else I will have no >>way to retreive the information I'm looking for. >> >>So the question is. Is there a good way to store the data to retrieve >>them base on the date? >> >>Thanks, >> >>JM >> >> >>
