Just to add from my experiences: Yes hotspotting is bad, but so are devops headaches. A reasonable machine can handle 3-4000 puts a second with ease, and a simple timerange scan can give you the records you need. I have my doubts you will be hitting these amounts anytime soon. A simple setup will get your PoC and then scale when you need to scale.
Rob On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel <[email protected]>wrote: > Jean-Marc, > > You indicated that you didn't want to do full table scans when you want to > find out which files hadn't been touched since X time has past. > (X could be months, weeks, days, hours, etc ...) > > So here's the thing. > First, I am not convinced that you will have hot spotting. > Second, you end up having to now do 26 scans instead of one. Then you need > to join the result set. > > Not really a good solution if you think about it. > > Oh and I don't believe that you will be hitting a single region, although > you may hit a region hard. > (Your second table's key is on the timestamp of the last update to the > file. If the file hadn't been touched in a week, there's the probability > that at scale, it won't be in the same region as a file that had recently > been touched. ) > > I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be > applied on a subset of problems. > (Think round-robin partitioning in a RDBMS. DB2 was big on this.) > > HTH > > -Mike > > > > On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: > > > Let's imagine the timestamp is "123456789". > > > > If I salt it with later from 'a' to 'z' them it will always be split > > between few RegionServers. I will have like "t123456789". The issue is > > that I will have to do 26 queries to be able to find all the entries. > > I will need to query from A000000000 to Axxxxxxxxx, then same for B, > > and so on. > > > > So what's worst? Am I better to deal with the hotspotting? Salt the > > key myself? Or what if I use something like HBaseWD? > > > > JM > > > > 2012/6/16, Michel Segel <[email protected]>: > >> You can't salt the key in the second table. > >> By salting the key, you lose the ability to do range scans, which is > what > >> you want to do. > >> > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < > [email protected]> > >> wrote: > >> > >>> Thanks all for your comments and suggestions. Regarding the > >>> hotspotting I will try to salt the key in the 2nd table and see the > >>> results. > >>> > >>> Yesterday I finished to install my 4 servers cluster with old machine. > >>> It's slow, but it's working. So I will do some testing. > >>> > >>> You are recommending to modify the timestamp to be to the second or > >>> minute and have more entries per row. Is that because it's better to > >>> have more columns than rows? Or it's more because that will allow to > >>> have a more "squarred" pattern (lot of rows, lot of colums) which if > >>> more efficient? > >>> > >>> JM > >>> > >>> 2012/6/15, Michael Segel <[email protected]>: > >>>> Thought about this a little bit more... > >>>> > >>>> You will want two tables for a solution. > >>>> > >>>> 1 Table is Key: Unique ID > >>>> Column: FilePath Value: Full Path to file > >>>> Column: Last Update time Value: timestamp > >>>> > >>>> 2 Table is Key: Last Update time (The timestamp) > >>>> Column 1-N: Unique ID Value: Full Path to > >>>> the > >>>> file > >>>> > >>>> Now if you want to get fancy, in Table 1, you could use the time > stamp > >>>> on > >>>> the column File Path to hold the last update time. > >>>> But its probably easier for you to start by keeping the data as a > >>>> separate > >>>> column and ignore the Timestamps on the columns for now. > >>>> > >>>> Note the following: > >>>> > >>>> 1) I used the notation Column 1-N to reflect that for a given > timestamp > >>>> you > >>>> may or may not have multiple files that were updated. (You weren't > >>>> specific > >>>> as to the scale) > >>>> This is a good example of HBase's column oriented approach where you > may > >>>> or > >>>> may not have a column. It doesn't matter. :-) You could also modify > the > >>>> timestamp to be to the second or minute and have more entries per row. > >>>> It > >>>> doesn't matter. You insert based on timestamp:columnName, value, so > you > >>>> will > >>>> add a column to this table. > >>>> > >>>> 2) First prove that the logic works. You insert/update table 1 to > >>>> capture > >>>> the ID of the file and its last update time. You then delete the old > >>>> timestamp entry in table 2, then insert new entry in table 2. > >>>> > >>>> 3) You store Table 2 in ascending order. Then when you want to find > your > >>>> last 500 entries, you do a start scan at 0x000 and then limit the scan > >>>> to > >>>> 500 rows. Note that you may or may not have multiple entries so as you > >>>> walk > >>>> through the result set, you count the number of columns and stop when > >>>> you > >>>> have 500 columns, regardless of the number of rows you've processed. > >>>> > >>>> This should solve your problem and be pretty efficient. > >>>> You can then work out the Coprocessors and add it to the solution to > be > >>>> even > >>>> more efficient. > >>>> > >>>> > >>>> With respect to 'hot-spotting' , can't be helped. You could hash your > >>>> unique > >>>> ID in table 1, this will reduce the potential of a hotspot as the > table > >>>> splits. > >>>> On table 2, because you have temporal data and you want to efficiently > >>>> scan > >>>> a small portion of the table based on size, you will always scan the > >>>> first > >>>> bloc, however as data rolls off and compression occurs, you will > >>>> probably > >>>> have to do some cleanup. I'm not sure how HBase handles splits that > no > >>>> longer contain data. When you compress an empty split, does it go > away? > >>>> > >>>> By switching to coprocessors, you now limit the update accessors to > the > >>>> second table so you should still have pretty good performance. > >>>> > >>>> You may also want to look at Asynchronous HBase, however I don't know > >>>> how > >>>> well it will work with Coprocessors or if you want to perform async > >>>> operations in this specific use case. > >>>> > >>>> Good luck, HTH... > >>>> > >>>> -Mike > >>>> > >>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: > >>>> > >>>>> Hi Michael, > >>>>> > >>>>> For now this is more a proof of concept than a production > application. > >>>>> And if it's working, it should be growing a lot and database at the > >>>>> end will easily be over 1B rows. each individual server will have to > >>>>> send it's own information to one centralized server which will insert > >>>>> that into a database. That's why it need to be very quick and that's > >>>>> why I'm looking in HBase's direction. I tried with some relational > >>>>> databases with 4M rows in the table but the insert time is to slow > >>>>> when I have to introduce entries in bulk. Also, the ability for HBase > >>>>> to keep only the cells with values will allow to save a lot on the > >>>>> disk space (futur projects). > >>>>> > >>>>> I'm not yet used with HBase and there is still many things I need to > >>>>> undertsand but until I'm able to create a solution and test it, I > will > >>>>> continue to read, learn and try that way. Then at then end I will be > >>>>> able to compare the 2 options I have (HBase or relational) and decide > >>>>> based on the results. > >>>>> > >>>>> So yes, your reply helped because it's giving me a way to achieve > this > >>>>> goal (using co-processors). I don't know ye thow this part is > working, > >>>>> so I will dig the documentation for it. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> JM > >>>>> > >>>>> 2012/6/14, Michael Segel <[email protected]>: > >>>>>> Jean-Marc, > >>>>>> > >>>>>> You do realize that this really isn't a good use case for HBase, > >>>>>> assuming > >>>>>> that what you are describing is a stand alone system. > >>>>>> It would be easier and better if you just used a simple relational > >>>>>> database. > >>>>>> > >>>>>> Then you would have your table w an ID, and a secondary index on the > >>>>>> timestamp. > >>>>>> Retrieve the data in Ascending order by timestamp and take the top > 500 > >>>>>> off > >>>>>> the list. > >>>>>> > >>>>>> If you insist on using HBase, yes you will have to have a secondary > >>>>>> table. > >>>>>> Then using co-processors... > >>>>>> When you update the row in your base table, you > >>>>>> then get() the row in your index by timestamp, removing the column > for > >>>>>> that > >>>>>> rowid. > >>>>>> Add the new column to the timestamp row. > >>>>>> > >>>>>> As you put it. > >>>>>> > >>>>>> Now you can just do a partial scan on your index. Because your index > >>>>>> table > >>>>>> is so small... you shouldn't worry about hotspots. > >>>>>> You may just want to rebuild your index every so often... > >>>>>> > >>>>>> HTH > >>>>>> > >>>>>> -Mike > >>>>>> > >>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: > >>>>>> > >>>>>>> Hi Michael, > >>>>>>> > >>>>>>> Thanks for your feedback. Here are more details to describe what > I'm > >>>>>>> trying to achieve. > >>>>>>> > >>>>>>> My goal is to store information about files into the database. I > need > >>>>>>> to check the oldest files in the database to refresh the > information. > >>>>>>> > >>>>>>> The key is an 8 bytes ID of the server name in the network hosting > >>>>>>> the > >>>>>>> file + MD5 of the file path. Total is a 24 bytes key. > >>>>>>> > >>>>>>> So each time I look at a file and gather the information, I update > >>>>>>> its > >>>>>>> row in the database based on the key including a "last_update" > field. > >>>>>>> I can calculate this key for any file in the drives. > >>>>>>> > >>>>>>> In order to know which file I need to check in the network, I need > to > >>>>>>> scan the table by "last_update" field. So the idea is to build > >>>>>>> another > >>>>>>> table which contain the last_update as a key and the files IDs in > >>>>>>> columns. (Here is the hotspotting) > >>>>>>> > >>>>>>> Each time I work on a file, I will have to update the main table by > >>>>>>> ID > >>>>>>> and remove the cell from the second table (the index) and put it > back > >>>>>>> with the new "last_update" key. > >>>>>>> > >>>>>>> I'm mainly doing 3 operations in the database. > >>>>>>> 1) I retrieve a list of 500 files which need to be update > >>>>>>> 2) I update the information for those 500 files (bulk update) > >>>>>>> 3) I load new files references to be checked. > >>>>>>> > >>>>>>> For 2 and 3, I use the main table with the file ID as the key. the > >>>>>>> distribution is almost perfect because I'm using hash. The prefix > is > >>>>>>> the server ID but it's not always going to the same server since > it's > >>>>>>> done by last_update. But this allow a quick access to the list of > >>>>>>> files from one server. > >>>>>>> For 1, I have expected to build this second table with the > >>>>>>> "last_update" as the key. > >>>>>>> > >>>>>>> Regarding the frequency, it really depends on the activities on the > >>>>>>> network, but it should be "often". The faster the database update > >>>>>>> will be, the more up to date I will be able to keep it. > >>>>>>> > >>>>>>> JM > >>>>>>> > >>>>>>> 2012/6/14, Michael Segel <[email protected]>: > >>>>>>>> Actually I think you should revisit your key design.... > >>>>>>>> > >>>>>>>> Look at your access path to the data for each of the types of > >>>>>>>> queries > >>>>>>>> you > >>>>>>>> are going to run. > >>>>>>>> From your post: > >>>>>>>> "I have a table with a uniq key, a file path and a "last update" > >>>>>>>> field. > >>>>>>>>>>> I can easily find back the file with the ID and find when it > has > >>>>>>>>>>> been > >>>>>>>>>>> updated. > >>>>>>>>>>> > >>>>>>>>>>> But what I need too is to find the files not updated for more > >>>>>>>>>>> than > >>>>>>>>>>> a > >>>>>>>>>>> certain period of time. > >>>>>>>> " > >>>>>>>> So your primary query is going to be against the key. > >>>>>>>> Not sure if you meant to say that your key was a composite key or > >>>>>>>> not... > >>>>>>>> sounds like your key is just the unique key and the rest are > columns > >>>>>>>> in > >>>>>>>> the > >>>>>>>> table. > >>>>>>>> > >>>>>>>> The secondary query or path to the data is to find data where the > >>>>>>>> files > >>>>>>>> were > >>>>>>>> not updated for more than a period of time. > >>>>>>>> > >>>>>>>> If you make your key temporal, that is adding time as a component > of > >>>>>>>> your > >>>>>>>> key, you will end up creating new rows of data while the old row > >>>>>>>> still > >>>>>>>> exists. > >>>>>>>> Not a good side effect. > >>>>>>>> > >>>>>>>> The other nasty side effect of using time as your key is that you > >>>>>>>> not > >>>>>>>> only > >>>>>>>> have the potential for hot spotting, but that you also have the > >>>>>>>> nasty > >>>>>>>> side > >>>>>>>> effect of creating splits that will never grow. > >>>>>>>> > >>>>>>>> How often are you going to ask to see the files where they were > not > >>>>>>>> updated > >>>>>>>> in the last couple of days/minutes? If its infrequent, then you > >>>>>>>> really > >>>>>>>> should care if you have to do a complete table scan. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: > >>>>>>>> > >>>>>>>>> Wow! This is exactly what I was looking for. So I will read all > of > >>>>>>>>> that > >>>>>>>>> now. > >>>>>>>>> > >>>>>>>>> Need to read here at the bottom: > >>>>>>>>> https://github.com/sematext/HBaseWD > >>>>>>>>> and here: > >>>>>>>>> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>>> JM > >>>>>>>>> > >>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: > >>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this > comes > >>>>>>>>>> up > >>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) > >>>>>>>>>> > >>>>>>>>>> Otis > >>>>>>>>>> ---- > >>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - > >>>>>>>>>> http://sematext.com/spm > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> ________________________________ > >>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> > >>>>>>>>>>> To: [email protected] > >>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM > >>>>>>>>>>> Subject: Timestamp as a key good practice? > >>>>>>>>>>> > >>>>>>>>>>> I watched Lars George's video about HBase and read the > >>>>>>>>>>> documentation > >>>>>>>>>>> and it's saying that it's not a good idea to have the timestamp > >>>>>>>>>>> as > >>>>>>>>>>> a > >>>>>>>>>>> key because that will always load the same region until the > >>>>>>>>>>> timestamp > >>>>>>>>>>> reach a certain value and move to the next region > (hotspotting). > >>>>>>>>>>> > >>>>>>>>>>> I have a table with a uniq key, a file path and a "last update" > >>>>>>>>>>> field. > >>>>>>>>>>> I can easily find back the file with the ID and find when it > has > >>>>>>>>>>> been > >>>>>>>>>>> updated. > >>>>>>>>>>> > >>>>>>>>>>> But what I need too is to find the files not updated for more > >>>>>>>>>>> than > >>>>>>>>>>> a > >>>>>>>>>>> certain period of time. > >>>>>>>>>>> > >>>>>>>>>>> If I want to retrieve that from this single table, I will have > to > >>>>>>>>>>> do > >>>>>>>>>>> a > >>>>>>>>>>> full parsing of the table. Which might take a while. > >>>>>>>>>>> > >>>>>>>>>>> So I thought of building a table to reference that (kind of > >>>>>>>>>>> secondary > >>>>>>>>>>> index). The key is the "last update", one FC and each column > will > >>>>>>>>>>> have > >>>>>>>>>>> the ID of the file with a dummy content. > >>>>>>>>>>> > >>>>>>>>>>> When a file is updated, I remove its cell from this table, and > >>>>>>>>>>> introduce a new cell with the new timestamp as the key. > >>>>>>>>>>> > >>>>>>>>>>> And so one. > >>>>>>>>>>> > >>>>>>>>>>> With this schema, I can find the files by ID very quickly and I > >>>>>>>>>>> can > >>>>>>>>>>> find the files which need to be updated pretty quickly too. But > >>>>>>>>>>> it's > >>>>>>>>>>> hotspotting one region. > >>>>>>>>>>> > >>>>>>>>>>> From the video (0:45:10) I can see 4 situations. > >>>>>>>>>>> 1) Hotspotting. > >>>>>>>>>>> 2) Salting. > >>>>>>>>>>> 3) Key field swap/promotion > >>>>>>>>>>> 4) Randomization. > >>>>>>>>>>> > >>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other > options. > >>>>>>>>>>> > >>>>>>>>>>> I can do salting. Like prefix the timestamp with a number > between > >>>>>>>>>>> 0 > >>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To find > >>>>>>>>>>> all > >>>>>>>>>>> the files with a timestamp below a specific value, I will need > to > >>>>>>>>>>> run > >>>>>>>>>>> 10 requests instead of one. But when the load will becaume to > big > >>>>>>>>>>> for > >>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? > >>>>>>>>>>> Which > >>>>>>>>>>> mean 100 request? And the more regions I will have, the more > >>>>>>>>>>> requests > >>>>>>>>>>> I will have to do. Is that really a good approach? > >>>>>>>>>>> > >>>>>>>>>>> Key field swap is close to salting. I can add the first few > bytes > >>>>>>>>>>> from > >>>>>>>>>>> the path before the timestamp, but the issue will remain the > >>>>>>>>>>> same. > >>>>>>>>>>> > >>>>>>>>>>> I looked and randomization, and I can't do that. Else I will > have > >>>>>>>>>>> no > >>>>>>>>>>> way to retreive the information I'm looking for. > >>>>>>>>>>> > >>>>>>>>>>> So the question is. Is there a good way to store the data to > >>>>>>>>>>> retrieve > >>>>>>>>>>> them base on the date? > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> > >>>>>>>>>>> JM > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > > > >
