Let's imagine the timestamp is "123456789". If I salt it with later from 'a' to 'z' them it will always be split between few RegionServers. I will have like "t123456789". The issue is that I will have to do 26 queries to be able to find all the entries. I will need to query from A000000000 to Axxxxxxxxx, then same for B, and so on.
So what's worst? Am I better to deal with the hotspotting? Salt the key myself? Or what if I use something like HBaseWD? JM 2012/6/16, Michel Segel <[email protected]>: > You can't salt the key in the second table. > By salting the key, you lose the ability to do range scans, which is what > you want to do. > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[email protected]> > wrote: > >> Thanks all for your comments and suggestions. Regarding the >> hotspotting I will try to salt the key in the 2nd table and see the >> results. >> >> Yesterday I finished to install my 4 servers cluster with old machine. >> It's slow, but it's working. So I will do some testing. >> >> You are recommending to modify the timestamp to be to the second or >> minute and have more entries per row. Is that because it's better to >> have more columns than rows? Or it's more because that will allow to >> have a more "squarred" pattern (lot of rows, lot of colums) which if >> more efficient? >> >> JM >> >> 2012/6/15, Michael Segel <[email protected]>: >>> Thought about this a little bit more... >>> >>> You will want two tables for a solution. >>> >>> 1 Table is Key: Unique ID >>> Column: FilePath Value: Full Path to file >>> Column: Last Update time Value: timestamp >>> >>> 2 Table is Key: Last Update time (The timestamp) >>> Column 1-N: Unique ID Value: Full Path to >>> the >>> file >>> >>> Now if you want to get fancy, in Table 1, you could use the time stamp >>> on >>> the column File Path to hold the last update time. >>> But its probably easier for you to start by keeping the data as a >>> separate >>> column and ignore the Timestamps on the columns for now. >>> >>> Note the following: >>> >>> 1) I used the notation Column 1-N to reflect that for a given timestamp >>> you >>> may or may not have multiple files that were updated. (You weren't >>> specific >>> as to the scale) >>> This is a good example of HBase's column oriented approach where you may >>> or >>> may not have a column. It doesn't matter. :-) You could also modify the >>> timestamp to be to the second or minute and have more entries per row. >>> It >>> doesn't matter. You insert based on timestamp:columnName, value, so you >>> will >>> add a column to this table. >>> >>> 2) First prove that the logic works. You insert/update table 1 to >>> capture >>> the ID of the file and its last update time. You then delete the old >>> timestamp entry in table 2, then insert new entry in table 2. >>> >>> 3) You store Table 2 in ascending order. Then when you want to find your >>> last 500 entries, you do a start scan at 0x000 and then limit the scan >>> to >>> 500 rows. Note that you may or may not have multiple entries so as you >>> walk >>> through the result set, you count the number of columns and stop when >>> you >>> have 500 columns, regardless of the number of rows you've processed. >>> >>> This should solve your problem and be pretty efficient. >>> You can then work out the Coprocessors and add it to the solution to be >>> even >>> more efficient. >>> >>> >>> With respect to 'hot-spotting' , can't be helped. You could hash your >>> unique >>> ID in table 1, this will reduce the potential of a hotspot as the table >>> splits. >>> On table 2, because you have temporal data and you want to efficiently >>> scan >>> a small portion of the table based on size, you will always scan the >>> first >>> bloc, however as data rolls off and compression occurs, you will >>> probably >>> have to do some cleanup. I'm not sure how HBase handles splits that no >>> longer contain data. When you compress an empty split, does it go away? >>> >>> By switching to coprocessors, you now limit the update accessors to the >>> second table so you should still have pretty good performance. >>> >>> You may also want to look at Asynchronous HBase, however I don't know >>> how >>> well it will work with Coprocessors or if you want to perform async >>> operations in this specific use case. >>> >>> Good luck, HTH... >>> >>> -Mike >>> >>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >>> >>>> Hi Michael, >>>> >>>> For now this is more a proof of concept than a production application. >>>> And if it's working, it should be growing a lot and database at the >>>> end will easily be over 1B rows. each individual server will have to >>>> send it's own information to one centralized server which will insert >>>> that into a database. That's why it need to be very quick and that's >>>> why I'm looking in HBase's direction. I tried with some relational >>>> databases with 4M rows in the table but the insert time is to slow >>>> when I have to introduce entries in bulk. Also, the ability for HBase >>>> to keep only the cells with values will allow to save a lot on the >>>> disk space (futur projects). >>>> >>>> I'm not yet used with HBase and there is still many things I need to >>>> undertsand but until I'm able to create a solution and test it, I will >>>> continue to read, learn and try that way. Then at then end I will be >>>> able to compare the 2 options I have (HBase or relational) and decide >>>> based on the results. >>>> >>>> So yes, your reply helped because it's giving me a way to achieve this >>>> goal (using co-processors). I don't know ye thow this part is working, >>>> so I will dig the documentation for it. >>>> >>>> Thanks, >>>> >>>> JM >>>> >>>> 2012/6/14, Michael Segel <[email protected]>: >>>>> Jean-Marc, >>>>> >>>>> You do realize that this really isn't a good use case for HBase, >>>>> assuming >>>>> that what you are describing is a stand alone system. >>>>> It would be easier and better if you just used a simple relational >>>>> database. >>>>> >>>>> Then you would have your table w an ID, and a secondary index on the >>>>> timestamp. >>>>> Retrieve the data in Ascending order by timestamp and take the top 500 >>>>> off >>>>> the list. >>>>> >>>>> If you insist on using HBase, yes you will have to have a secondary >>>>> table. >>>>> Then using co-processors... >>>>> When you update the row in your base table, you >>>>> then get() the row in your index by timestamp, removing the column for >>>>> that >>>>> rowid. >>>>> Add the new column to the timestamp row. >>>>> >>>>> As you put it. >>>>> >>>>> Now you can just do a partial scan on your index. Because your index >>>>> table >>>>> is so small... you shouldn't worry about hotspots. >>>>> You may just want to rebuild your index every so often... >>>>> >>>>> HTH >>>>> >>>>> -Mike >>>>> >>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >>>>> >>>>>> Hi Michael, >>>>>> >>>>>> Thanks for your feedback. Here are more details to describe what I'm >>>>>> trying to achieve. >>>>>> >>>>>> My goal is to store information about files into the database. I need >>>>>> to check the oldest files in the database to refresh the information. >>>>>> >>>>>> The key is an 8 bytes ID of the server name in the network hosting >>>>>> the >>>>>> file + MD5 of the file path. Total is a 24 bytes key. >>>>>> >>>>>> So each time I look at a file and gather the information, I update >>>>>> its >>>>>> row in the database based on the key including a "last_update" field. >>>>>> I can calculate this key for any file in the drives. >>>>>> >>>>>> In order to know which file I need to check in the network, I need to >>>>>> scan the table by "last_update" field. So the idea is to build >>>>>> another >>>>>> table which contain the last_update as a key and the files IDs in >>>>>> columns. (Here is the hotspotting) >>>>>> >>>>>> Each time I work on a file, I will have to update the main table by >>>>>> ID >>>>>> and remove the cell from the second table (the index) and put it back >>>>>> with the new "last_update" key. >>>>>> >>>>>> I'm mainly doing 3 operations in the database. >>>>>> 1) I retrieve a list of 500 files which need to be update >>>>>> 2) I update the information for those 500 files (bulk update) >>>>>> 3) I load new files references to be checked. >>>>>> >>>>>> For 2 and 3, I use the main table with the file ID as the key. the >>>>>> distribution is almost perfect because I'm using hash. The prefix is >>>>>> the server ID but it's not always going to the same server since it's >>>>>> done by last_update. But this allow a quick access to the list of >>>>>> files from one server. >>>>>> For 1, I have expected to build this second table with the >>>>>> "last_update" as the key. >>>>>> >>>>>> Regarding the frequency, it really depends on the activities on the >>>>>> network, but it should be "often". The faster the database update >>>>>> will be, the more up to date I will be able to keep it. >>>>>> >>>>>> JM >>>>>> >>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>> Actually I think you should revisit your key design.... >>>>>>> >>>>>>> Look at your access path to the data for each of the types of >>>>>>> queries >>>>>>> you >>>>>>> are going to run. >>>>>>> From your post: >>>>>>> "I have a table with a uniq key, a file path and a "last update" >>>>>>> field. >>>>>>>>>> I can easily find back the file with the ID and find when it has >>>>>>>>>> been >>>>>>>>>> updated. >>>>>>>>>> >>>>>>>>>> But what I need too is to find the files not updated for more >>>>>>>>>> than >>>>>>>>>> a >>>>>>>>>> certain period of time. >>>>>>> " >>>>>>> So your primary query is going to be against the key. >>>>>>> Not sure if you meant to say that your key was a composite key or >>>>>>> not... >>>>>>> sounds like your key is just the unique key and the rest are columns >>>>>>> in >>>>>>> the >>>>>>> table. >>>>>>> >>>>>>> The secondary query or path to the data is to find data where the >>>>>>> files >>>>>>> were >>>>>>> not updated for more than a period of time. >>>>>>> >>>>>>> If you make your key temporal, that is adding time as a component of >>>>>>> your >>>>>>> key, you will end up creating new rows of data while the old row >>>>>>> still >>>>>>> exists. >>>>>>> Not a good side effect. >>>>>>> >>>>>>> The other nasty side effect of using time as your key is that you >>>>>>> not >>>>>>> only >>>>>>> have the potential for hot spotting, but that you also have the >>>>>>> nasty >>>>>>> side >>>>>>> effect of creating splits that will never grow. >>>>>>> >>>>>>> How often are you going to ask to see the files where they were not >>>>>>> updated >>>>>>> in the last couple of days/minutes? If its infrequent, then you >>>>>>> really >>>>>>> should care if you have to do a complete table scan. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >>>>>>> >>>>>>>> Wow! This is exactly what I was looking for. So I will read all of >>>>>>>> that >>>>>>>> now. >>>>>>>> >>>>>>>> Need to read here at the bottom: >>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>> and here: >>>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> JM >>>>>>>> >>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this comes >>>>>>>>> up >>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >>>>>>>>> >>>>>>>>> Otis >>>>>>>>> ---- >>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >>>>>>>>> http://sematext.com/spm >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> ________________________________ >>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >>>>>>>>>> To: [email protected] >>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >>>>>>>>>> Subject: Timestamp as a key good practice? >>>>>>>>>> >>>>>>>>>> I watched Lars George's video about HBase and read the >>>>>>>>>> documentation >>>>>>>>>> and it's saying that it's not a good idea to have the timestamp >>>>>>>>>> as >>>>>>>>>> a >>>>>>>>>> key because that will always load the same region until the >>>>>>>>>> timestamp >>>>>>>>>> reach a certain value and move to the next region (hotspotting). >>>>>>>>>> >>>>>>>>>> I have a table with a uniq key, a file path and a "last update" >>>>>>>>>> field. >>>>>>>>>> I can easily find back the file with the ID and find when it has >>>>>>>>>> been >>>>>>>>>> updated. >>>>>>>>>> >>>>>>>>>> But what I need too is to find the files not updated for more >>>>>>>>>> than >>>>>>>>>> a >>>>>>>>>> certain period of time. >>>>>>>>>> >>>>>>>>>> If I want to retrieve that from this single table, I will have to >>>>>>>>>> do >>>>>>>>>> a >>>>>>>>>> full parsing of the table. Which might take a while. >>>>>>>>>> >>>>>>>>>> So I thought of building a table to reference that (kind of >>>>>>>>>> secondary >>>>>>>>>> index). The key is the "last update", one FC and each column will >>>>>>>>>> have >>>>>>>>>> the ID of the file with a dummy content. >>>>>>>>>> >>>>>>>>>> When a file is updated, I remove its cell from this table, and >>>>>>>>>> introduce a new cell with the new timestamp as the key. >>>>>>>>>> >>>>>>>>>> And so one. >>>>>>>>>> >>>>>>>>>> With this schema, I can find the files by ID very quickly and I >>>>>>>>>> can >>>>>>>>>> find the files which need to be updated pretty quickly too. But >>>>>>>>>> it's >>>>>>>>>> hotspotting one region. >>>>>>>>>> >>>>>>>>>> From the video (0:45:10) I can see 4 situations. >>>>>>>>>> 1) Hotspotting. >>>>>>>>>> 2) Salting. >>>>>>>>>> 3) Key field swap/promotion >>>>>>>>>> 4) Randomization. >>>>>>>>>> >>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other options. >>>>>>>>>> >>>>>>>>>> I can do salting. Like prefix the timestamp with a number between >>>>>>>>>> 0 >>>>>>>>>> and 9. So that will distribut the load over 10 servers. To find >>>>>>>>>> all >>>>>>>>>> the files with a timestamp below a specific value, I will need to >>>>>>>>>> run >>>>>>>>>> 10 requests instead of one. But when the load will becaume to big >>>>>>>>>> for >>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? >>>>>>>>>> Which >>>>>>>>>> mean 100 request? And the more regions I will have, the more >>>>>>>>>> requests >>>>>>>>>> I will have to do. Is that really a good approach? >>>>>>>>>> >>>>>>>>>> Key field swap is close to salting. I can add the first few bytes >>>>>>>>>> from >>>>>>>>>> the path before the timestamp, but the issue will remain the >>>>>>>>>> same. >>>>>>>>>> >>>>>>>>>> I looked and randomization, and I can't do that. Else I will have >>>>>>>>>> no >>>>>>>>>> way to retreive the information I'm looking for. >>>>>>>>>> >>>>>>>>>> So the question is. Is there a good way to store the data to >>>>>>>>>> retrieve >>>>>>>>>> them base on the date? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> JM >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> >
