You can't salt the key in the second table. By salting the key, you lose the ability to do range scans, which is what you want to do.
Sent from a remote device. Please excuse any typos... Mike Segel On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[email protected]> wrote: > Thanks all for your comments and suggestions. Regarding the > hotspotting I will try to salt the key in the 2nd table and see the > results. > > Yesterday I finished to install my 4 servers cluster with old machine. > It's slow, but it's working. So I will do some testing. > > You are recommending to modify the timestamp to be to the second or > minute and have more entries per row. Is that because it's better to > have more columns than rows? Or it's more because that will allow to > have a more "squarred" pattern (lot of rows, lot of colums) which if > more efficient? > > JM > > 2012/6/15, Michael Segel <[email protected]>: >> Thought about this a little bit more... >> >> You will want two tables for a solution. >> >> 1 Table is Key: Unique ID >> Column: FilePath Value: Full Path to file >> Column: Last Update time Value: timestamp >> >> 2 Table is Key: Last Update time (The timestamp) >> Column 1-N: Unique ID Value: Full Path to the >> file >> >> Now if you want to get fancy, in Table 1, you could use the time stamp on >> the column File Path to hold the last update time. >> But its probably easier for you to start by keeping the data as a separate >> column and ignore the Timestamps on the columns for now. >> >> Note the following: >> >> 1) I used the notation Column 1-N to reflect that for a given timestamp you >> may or may not have multiple files that were updated. (You weren't specific >> as to the scale) >> This is a good example of HBase's column oriented approach where you may or >> may not have a column. It doesn't matter. :-) You could also modify the >> timestamp to be to the second or minute and have more entries per row. It >> doesn't matter. You insert based on timestamp:columnName, value, so you will >> add a column to this table. >> >> 2) First prove that the logic works. You insert/update table 1 to capture >> the ID of the file and its last update time. You then delete the old >> timestamp entry in table 2, then insert new entry in table 2. >> >> 3) You store Table 2 in ascending order. Then when you want to find your >> last 500 entries, you do a start scan at 0x000 and then limit the scan to >> 500 rows. Note that you may or may not have multiple entries so as you walk >> through the result set, you count the number of columns and stop when you >> have 500 columns, regardless of the number of rows you've processed. >> >> This should solve your problem and be pretty efficient. >> You can then work out the Coprocessors and add it to the solution to be even >> more efficient. >> >> >> With respect to 'hot-spotting' , can't be helped. You could hash your unique >> ID in table 1, this will reduce the potential of a hotspot as the table >> splits. >> On table 2, because you have temporal data and you want to efficiently scan >> a small portion of the table based on size, you will always scan the first >> bloc, however as data rolls off and compression occurs, you will probably >> have to do some cleanup. I'm not sure how HBase handles splits that no >> longer contain data. When you compress an empty split, does it go away? >> >> By switching to coprocessors, you now limit the update accessors to the >> second table so you should still have pretty good performance. >> >> You may also want to look at Asynchronous HBase, however I don't know how >> well it will work with Coprocessors or if you want to perform async >> operations in this specific use case. >> >> Good luck, HTH... >> >> -Mike >> >> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >> >>> Hi Michael, >>> >>> For now this is more a proof of concept than a production application. >>> And if it's working, it should be growing a lot and database at the >>> end will easily be over 1B rows. each individual server will have to >>> send it's own information to one centralized server which will insert >>> that into a database. That's why it need to be very quick and that's >>> why I'm looking in HBase's direction. I tried with some relational >>> databases with 4M rows in the table but the insert time is to slow >>> when I have to introduce entries in bulk. Also, the ability for HBase >>> to keep only the cells with values will allow to save a lot on the >>> disk space (futur projects). >>> >>> I'm not yet used with HBase and there is still many things I need to >>> undertsand but until I'm able to create a solution and test it, I will >>> continue to read, learn and try that way. Then at then end I will be >>> able to compare the 2 options I have (HBase or relational) and decide >>> based on the results. >>> >>> So yes, your reply helped because it's giving me a way to achieve this >>> goal (using co-processors). I don't know ye thow this part is working, >>> so I will dig the documentation for it. >>> >>> Thanks, >>> >>> JM >>> >>> 2012/6/14, Michael Segel <[email protected]>: >>>> Jean-Marc, >>>> >>>> You do realize that this really isn't a good use case for HBase, >>>> assuming >>>> that what you are describing is a stand alone system. >>>> It would be easier and better if you just used a simple relational >>>> database. >>>> >>>> Then you would have your table w an ID, and a secondary index on the >>>> timestamp. >>>> Retrieve the data in Ascending order by timestamp and take the top 500 >>>> off >>>> the list. >>>> >>>> If you insist on using HBase, yes you will have to have a secondary >>>> table. >>>> Then using co-processors... >>>> When you update the row in your base table, you >>>> then get() the row in your index by timestamp, removing the column for >>>> that >>>> rowid. >>>> Add the new column to the timestamp row. >>>> >>>> As you put it. >>>> >>>> Now you can just do a partial scan on your index. Because your index >>>> table >>>> is so small... you shouldn't worry about hotspots. >>>> You may just want to rebuild your index every so often... >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >>>> >>>>> Hi Michael, >>>>> >>>>> Thanks for your feedback. Here are more details to describe what I'm >>>>> trying to achieve. >>>>> >>>>> My goal is to store information about files into the database. I need >>>>> to check the oldest files in the database to refresh the information. >>>>> >>>>> The key is an 8 bytes ID of the server name in the network hosting the >>>>> file + MD5 of the file path. Total is a 24 bytes key. >>>>> >>>>> So each time I look at a file and gather the information, I update its >>>>> row in the database based on the key including a "last_update" field. >>>>> I can calculate this key for any file in the drives. >>>>> >>>>> In order to know which file I need to check in the network, I need to >>>>> scan the table by "last_update" field. So the idea is to build another >>>>> table which contain the last_update as a key and the files IDs in >>>>> columns. (Here is the hotspotting) >>>>> >>>>> Each time I work on a file, I will have to update the main table by ID >>>>> and remove the cell from the second table (the index) and put it back >>>>> with the new "last_update" key. >>>>> >>>>> I'm mainly doing 3 operations in the database. >>>>> 1) I retrieve a list of 500 files which need to be update >>>>> 2) I update the information for those 500 files (bulk update) >>>>> 3) I load new files references to be checked. >>>>> >>>>> For 2 and 3, I use the main table with the file ID as the key. the >>>>> distribution is almost perfect because I'm using hash. The prefix is >>>>> the server ID but it's not always going to the same server since it's >>>>> done by last_update. But this allow a quick access to the list of >>>>> files from one server. >>>>> For 1, I have expected to build this second table with the >>>>> "last_update" as the key. >>>>> >>>>> Regarding the frequency, it really depends on the activities on the >>>>> network, but it should be "often". The faster the database update >>>>> will be, the more up to date I will be able to keep it. >>>>> >>>>> JM >>>>> >>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>> Actually I think you should revisit your key design.... >>>>>> >>>>>> Look at your access path to the data for each of the types of queries >>>>>> you >>>>>> are going to run. >>>>>> From your post: >>>>>> "I have a table with a uniq key, a file path and a "last update" >>>>>> field. >>>>>>>>> I can easily find back the file with the ID and find when it has >>>>>>>>> been >>>>>>>>> updated. >>>>>>>>> >>>>>>>>> But what I need too is to find the files not updated for more than >>>>>>>>> a >>>>>>>>> certain period of time. >>>>>> " >>>>>> So your primary query is going to be against the key. >>>>>> Not sure if you meant to say that your key was a composite key or >>>>>> not... >>>>>> sounds like your key is just the unique key and the rest are columns >>>>>> in >>>>>> the >>>>>> table. >>>>>> >>>>>> The secondary query or path to the data is to find data where the >>>>>> files >>>>>> were >>>>>> not updated for more than a period of time. >>>>>> >>>>>> If you make your key temporal, that is adding time as a component of >>>>>> your >>>>>> key, you will end up creating new rows of data while the old row still >>>>>> exists. >>>>>> Not a good side effect. >>>>>> >>>>>> The other nasty side effect of using time as your key is that you not >>>>>> only >>>>>> have the potential for hot spotting, but that you also have the nasty >>>>>> side >>>>>> effect of creating splits that will never grow. >>>>>> >>>>>> How often are you going to ask to see the files where they were not >>>>>> updated >>>>>> in the last couple of days/minutes? If its infrequent, then you really >>>>>> should care if you have to do a complete table scan. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >>>>>> >>>>>>> Wow! This is exactly what I was looking for. So I will read all of >>>>>>> that >>>>>>> now. >>>>>>> >>>>>>> Need to read here at the bottom: https://github.com/sematext/HBaseWD >>>>>>> and here: >>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> JM >>>>>>> >>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this comes >>>>>>>> up >>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >>>>>>>> >>>>>>>> Otis >>>>>>>> ---- >>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >>>>>>>> http://sematext.com/spm >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> ________________________________ >>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >>>>>>>>> To: [email protected] >>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >>>>>>>>> Subject: Timestamp as a key good practice? >>>>>>>>> >>>>>>>>> I watched Lars George's video about HBase and read the >>>>>>>>> documentation >>>>>>>>> and it's saying that it's not a good idea to have the timestamp as >>>>>>>>> a >>>>>>>>> key because that will always load the same region until the >>>>>>>>> timestamp >>>>>>>>> reach a certain value and move to the next region (hotspotting). >>>>>>>>> >>>>>>>>> I have a table with a uniq key, a file path and a "last update" >>>>>>>>> field. >>>>>>>>> I can easily find back the file with the ID and find when it has >>>>>>>>> been >>>>>>>>> updated. >>>>>>>>> >>>>>>>>> But what I need too is to find the files not updated for more than >>>>>>>>> a >>>>>>>>> certain period of time. >>>>>>>>> >>>>>>>>> If I want to retrieve that from this single table, I will have to >>>>>>>>> do >>>>>>>>> a >>>>>>>>> full parsing of the table. Which might take a while. >>>>>>>>> >>>>>>>>> So I thought of building a table to reference that (kind of >>>>>>>>> secondary >>>>>>>>> index). The key is the "last update", one FC and each column will >>>>>>>>> have >>>>>>>>> the ID of the file with a dummy content. >>>>>>>>> >>>>>>>>> When a file is updated, I remove its cell from this table, and >>>>>>>>> introduce a new cell with the new timestamp as the key. >>>>>>>>> >>>>>>>>> And so one. >>>>>>>>> >>>>>>>>> With this schema, I can find the files by ID very quickly and I can >>>>>>>>> find the files which need to be updated pretty quickly too. But >>>>>>>>> it's >>>>>>>>> hotspotting one region. >>>>>>>>> >>>>>>>>> From the video (0:45:10) I can see 4 situations. >>>>>>>>> 1) Hotspotting. >>>>>>>>> 2) Salting. >>>>>>>>> 3) Key field swap/promotion >>>>>>>>> 4) Randomization. >>>>>>>>> >>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other options. >>>>>>>>> >>>>>>>>> I can do salting. Like prefix the timestamp with a number between 0 >>>>>>>>> and 9. So that will distribut the load over 10 servers. To find all >>>>>>>>> the files with a timestamp below a specific value, I will need to >>>>>>>>> run >>>>>>>>> 10 requests instead of one. But when the load will becaume to big >>>>>>>>> for >>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? Which >>>>>>>>> mean 100 request? And the more regions I will have, the more >>>>>>>>> requests >>>>>>>>> I will have to do. Is that really a good approach? >>>>>>>>> >>>>>>>>> Key field swap is close to salting. I can add the first few bytes >>>>>>>>> from >>>>>>>>> the path before the timestamp, but the issue will remain the same. >>>>>>>>> >>>>>>>>> I looked and randomization, and I can't do that. Else I will have >>>>>>>>> no >>>>>>>>> way to retreive the information I'm looking for. >>>>>>>>> >>>>>>>>> So the question is. Is there a good way to store the data to >>>>>>>>> retrieve >>>>>>>>> them base on the date? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> JM >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> >
