Hi Mike, Hi Rob, Thanks for your replies and advices. Seems that now I'm due for some implementation. I'm readgin Lars' book first and when I will be done I will start with the coding.
I already have my Zookeeper/Hadoop/HBase running and based on the first pages I read, I already know it's not well done since I have put a DataNode and a Zookeeper server on ALL the servers ;) So. More reading for me for the next few days, and then I will start. Thanks again! JM 2012/6/16, Rob Verkuylen <[email protected]>: > Just to add from my experiences: > > Yes hotspotting is bad, but so are devops headaches. A reasonable machine > can handle 3-4000 puts a second with ease, and a simple timerange scan can > give you the records you need. I have my doubts you will be hitting these > amounts anytime soon. A simple setup will get your PoC and then scale when > you need to scale. > > Rob > > On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel > <[email protected]>wrote: > >> Jean-Marc, >> >> You indicated that you didn't want to do full table scans when you want >> to >> find out which files hadn't been touched since X time has past. >> (X could be months, weeks, days, hours, etc ...) >> >> So here's the thing. >> First, I am not convinced that you will have hot spotting. >> Second, you end up having to now do 26 scans instead of one. Then you >> need >> to join the result set. >> >> Not really a good solution if you think about it. >> >> Oh and I don't believe that you will be hitting a single region, although >> you may hit a region hard. >> (Your second table's key is on the timestamp of the last update to the >> file. If the file hadn't been touched in a week, there's the probability >> that at scale, it won't be in the same region as a file that had recently >> been touched. ) >> >> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be >> applied on a subset of problems. >> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >> >> HTH >> >> -Mike >> >> >> >> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >> >> > Let's imagine the timestamp is "123456789". >> > >> > If I salt it with later from 'a' to 'z' them it will always be split >> > between few RegionServers. I will have like "t123456789". The issue is >> > that I will have to do 26 queries to be able to find all the entries. >> > I will need to query from A000000000 to Axxxxxxxxx, then same for B, >> > and so on. >> > >> > So what's worst? Am I better to deal with the hotspotting? Salt the >> > key myself? Or what if I use something like HBaseWD? >> > >> > JM >> > >> > 2012/6/16, Michel Segel <[email protected]>: >> >> You can't salt the key in the second table. >> >> By salting the key, you lose the ability to do range scans, which is >> what >> >> you want to do. >> >> >> >> >> >> >> >> Sent from a remote device. Please excuse any typos... >> >> >> >> Mike Segel >> >> >> >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >> [email protected]> >> >> wrote: >> >> >> >>> Thanks all for your comments and suggestions. Regarding the >> >>> hotspotting I will try to salt the key in the 2nd table and see the >> >>> results. >> >>> >> >>> Yesterday I finished to install my 4 servers cluster with old >> >>> machine. >> >>> It's slow, but it's working. So I will do some testing. >> >>> >> >>> You are recommending to modify the timestamp to be to the second or >> >>> minute and have more entries per row. Is that because it's better to >> >>> have more columns than rows? Or it's more because that will allow to >> >>> have a more "squarred" pattern (lot of rows, lot of colums) which if >> >>> more efficient? >> >>> >> >>> JM >> >>> >> >>> 2012/6/15, Michael Segel <[email protected]>: >> >>>> Thought about this a little bit more... >> >>>> >> >>>> You will want two tables for a solution. >> >>>> >> >>>> 1 Table is Key: Unique ID >> >>>> Column: FilePath Value: Full Path to >> >>>> file >> >>>> Column: Last Update time Value: timestamp >> >>>> >> >>>> 2 Table is Key: Last Update time (The timestamp) >> >>>> Column 1-N: Unique ID Value: Full Path >> >>>> to >> >>>> the >> >>>> file >> >>>> >> >>>> Now if you want to get fancy, in Table 1, you could use the time >> stamp >> >>>> on >> >>>> the column File Path to hold the last update time. >> >>>> But its probably easier for you to start by keeping the data as a >> >>>> separate >> >>>> column and ignore the Timestamps on the columns for now. >> >>>> >> >>>> Note the following: >> >>>> >> >>>> 1) I used the notation Column 1-N to reflect that for a given >> timestamp >> >>>> you >> >>>> may or may not have multiple files that were updated. (You weren't >> >>>> specific >> >>>> as to the scale) >> >>>> This is a good example of HBase's column oriented approach where you >> may >> >>>> or >> >>>> may not have a column. It doesn't matter. :-) You could also modify >> the >> >>>> timestamp to be to the second or minute and have more entries per >> >>>> row. >> >>>> It >> >>>> doesn't matter. You insert based on timestamp:columnName, value, so >> you >> >>>> will >> >>>> add a column to this table. >> >>>> >> >>>> 2) First prove that the logic works. You insert/update table 1 to >> >>>> capture >> >>>> the ID of the file and its last update time. You then delete the >> >>>> old >> >>>> timestamp entry in table 2, then insert new entry in table 2. >> >>>> >> >>>> 3) You store Table 2 in ascending order. Then when you want to find >> your >> >>>> last 500 entries, you do a start scan at 0x000 and then limit the >> >>>> scan >> >>>> to >> >>>> 500 rows. Note that you may or may not have multiple entries so as >> >>>> you >> >>>> walk >> >>>> through the result set, you count the number of columns and stop >> >>>> when >> >>>> you >> >>>> have 500 columns, regardless of the number of rows you've processed. >> >>>> >> >>>> This should solve your problem and be pretty efficient. >> >>>> You can then work out the Coprocessors and add it to the solution to >> be >> >>>> even >> >>>> more efficient. >> >>>> >> >>>> >> >>>> With respect to 'hot-spotting' , can't be helped. You could hash >> >>>> your >> >>>> unique >> >>>> ID in table 1, this will reduce the potential of a hotspot as the >> table >> >>>> splits. >> >>>> On table 2, because you have temporal data and you want to >> >>>> efficiently >> >>>> scan >> >>>> a small portion of the table based on size, you will always scan the >> >>>> first >> >>>> bloc, however as data rolls off and compression occurs, you will >> >>>> probably >> >>>> have to do some cleanup. I'm not sure how HBase handles splits that >> no >> >>>> longer contain data. When you compress an empty split, does it go >> away? >> >>>> >> >>>> By switching to coprocessors, you now limit the update accessors to >> the >> >>>> second table so you should still have pretty good performance. >> >>>> >> >>>> You may also want to look at Asynchronous HBase, however I don't >> >>>> know >> >>>> how >> >>>> well it will work with Coprocessors or if you want to perform async >> >>>> operations in this specific use case. >> >>>> >> >>>> Good luck, HTH... >> >>>> >> >>>> -Mike >> >>>> >> >>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >> >>>> >> >>>>> Hi Michael, >> >>>>> >> >>>>> For now this is more a proof of concept than a production >> application. >> >>>>> And if it's working, it should be growing a lot and database at the >> >>>>> end will easily be over 1B rows. each individual server will have >> >>>>> to >> >>>>> send it's own information to one centralized server which will >> >>>>> insert >> >>>>> that into a database. That's why it need to be very quick and >> >>>>> that's >> >>>>> why I'm looking in HBase's direction. I tried with some relational >> >>>>> databases with 4M rows in the table but the insert time is to slow >> >>>>> when I have to introduce entries in bulk. Also, the ability for >> >>>>> HBase >> >>>>> to keep only the cells with values will allow to save a lot on the >> >>>>> disk space (futur projects). >> >>>>> >> >>>>> I'm not yet used with HBase and there is still many things I need >> >>>>> to >> >>>>> undertsand but until I'm able to create a solution and test it, I >> will >> >>>>> continue to read, learn and try that way. Then at then end I will >> >>>>> be >> >>>>> able to compare the 2 options I have (HBase or relational) and >> >>>>> decide >> >>>>> based on the results. >> >>>>> >> >>>>> So yes, your reply helped because it's giving me a way to achieve >> this >> >>>>> goal (using co-processors). I don't know ye thow this part is >> working, >> >>>>> so I will dig the documentation for it. >> >>>>> >> >>>>> Thanks, >> >>>>> >> >>>>> JM >> >>>>> >> >>>>> 2012/6/14, Michael Segel <[email protected]>: >> >>>>>> Jean-Marc, >> >>>>>> >> >>>>>> You do realize that this really isn't a good use case for HBase, >> >>>>>> assuming >> >>>>>> that what you are describing is a stand alone system. >> >>>>>> It would be easier and better if you just used a simple relational >> >>>>>> database. >> >>>>>> >> >>>>>> Then you would have your table w an ID, and a secondary index on >> >>>>>> the >> >>>>>> timestamp. >> >>>>>> Retrieve the data in Ascending order by timestamp and take the top >> 500 >> >>>>>> off >> >>>>>> the list. >> >>>>>> >> >>>>>> If you insist on using HBase, yes you will have to have a >> >>>>>> secondary >> >>>>>> table. >> >>>>>> Then using co-processors... >> >>>>>> When you update the row in your base table, you >> >>>>>> then get() the row in your index by timestamp, removing the column >> for >> >>>>>> that >> >>>>>> rowid. >> >>>>>> Add the new column to the timestamp row. >> >>>>>> >> >>>>>> As you put it. >> >>>>>> >> >>>>>> Now you can just do a partial scan on your index. Because your >> >>>>>> index >> >>>>>> table >> >>>>>> is so small... you shouldn't worry about hotspots. >> >>>>>> You may just want to rebuild your index every so often... >> >>>>>> >> >>>>>> HTH >> >>>>>> >> >>>>>> -Mike >> >>>>>> >> >>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >> >>>>>> >> >>>>>>> Hi Michael, >> >>>>>>> >> >>>>>>> Thanks for your feedback. Here are more details to describe what >> I'm >> >>>>>>> trying to achieve. >> >>>>>>> >> >>>>>>> My goal is to store information about files into the database. I >> need >> >>>>>>> to check the oldest files in the database to refresh the >> information. >> >>>>>>> >> >>>>>>> The key is an 8 bytes ID of the server name in the network >> >>>>>>> hosting >> >>>>>>> the >> >>>>>>> file + MD5 of the file path. Total is a 24 bytes key. >> >>>>>>> >> >>>>>>> So each time I look at a file and gather the information, I >> >>>>>>> update >> >>>>>>> its >> >>>>>>> row in the database based on the key including a "last_update" >> field. >> >>>>>>> I can calculate this key for any file in the drives. >> >>>>>>> >> >>>>>>> In order to know which file I need to check in the network, I >> >>>>>>> need >> to >> >>>>>>> scan the table by "last_update" field. So the idea is to build >> >>>>>>> another >> >>>>>>> table which contain the last_update as a key and the files IDs in >> >>>>>>> columns. (Here is the hotspotting) >> >>>>>>> >> >>>>>>> Each time I work on a file, I will have to update the main table >> >>>>>>> by >> >>>>>>> ID >> >>>>>>> and remove the cell from the second table (the index) and put it >> back >> >>>>>>> with the new "last_update" key. >> >>>>>>> >> >>>>>>> I'm mainly doing 3 operations in the database. >> >>>>>>> 1) I retrieve a list of 500 files which need to be update >> >>>>>>> 2) I update the information for those 500 files (bulk update) >> >>>>>>> 3) I load new files references to be checked. >> >>>>>>> >> >>>>>>> For 2 and 3, I use the main table with the file ID as the key. >> >>>>>>> the >> >>>>>>> distribution is almost perfect because I'm using hash. The prefix >> is >> >>>>>>> the server ID but it's not always going to the same server since >> it's >> >>>>>>> done by last_update. But this allow a quick access to the list of >> >>>>>>> files from one server. >> >>>>>>> For 1, I have expected to build this second table with the >> >>>>>>> "last_update" as the key. >> >>>>>>> >> >>>>>>> Regarding the frequency, it really depends on the activities on >> >>>>>>> the >> >>>>>>> network, but it should be "often". The faster the database >> >>>>>>> update >> >>>>>>> will be, the more up to date I will be able to keep it. >> >>>>>>> >> >>>>>>> JM >> >>>>>>> >> >>>>>>> 2012/6/14, Michael Segel <[email protected]>: >> >>>>>>>> Actually I think you should revisit your key design.... >> >>>>>>>> >> >>>>>>>> Look at your access path to the data for each of the types of >> >>>>>>>> queries >> >>>>>>>> you >> >>>>>>>> are going to run. >> >>>>>>>> From your post: >> >>>>>>>> "I have a table with a uniq key, a file path and a "last update" >> >>>>>>>> field. >> >>>>>>>>>>> I can easily find back the file with the ID and find when it >> has >> >>>>>>>>>>> been >> >>>>>>>>>>> updated. >> >>>>>>>>>>> >> >>>>>>>>>>> But what I need too is to find the files not updated for more >> >>>>>>>>>>> than >> >>>>>>>>>>> a >> >>>>>>>>>>> certain period of time. >> >>>>>>>> " >> >>>>>>>> So your primary query is going to be against the key. >> >>>>>>>> Not sure if you meant to say that your key was a composite key >> >>>>>>>> or >> >>>>>>>> not... >> >>>>>>>> sounds like your key is just the unique key and the rest are >> columns >> >>>>>>>> in >> >>>>>>>> the >> >>>>>>>> table. >> >>>>>>>> >> >>>>>>>> The secondary query or path to the data is to find data where >> >>>>>>>> the >> >>>>>>>> files >> >>>>>>>> were >> >>>>>>>> not updated for more than a period of time. >> >>>>>>>> >> >>>>>>>> If you make your key temporal, that is adding time as a >> >>>>>>>> component >> of >> >>>>>>>> your >> >>>>>>>> key, you will end up creating new rows of data while the old row >> >>>>>>>> still >> >>>>>>>> exists. >> >>>>>>>> Not a good side effect. >> >>>>>>>> >> >>>>>>>> The other nasty side effect of using time as your key is that >> >>>>>>>> you >> >>>>>>>> not >> >>>>>>>> only >> >>>>>>>> have the potential for hot spotting, but that you also have the >> >>>>>>>> nasty >> >>>>>>>> side >> >>>>>>>> effect of creating splits that will never grow. >> >>>>>>>> >> >>>>>>>> How often are you going to ask to see the files where they were >> not >> >>>>>>>> updated >> >>>>>>>> in the last couple of days/minutes? If its infrequent, then you >> >>>>>>>> really >> >>>>>>>> should care if you have to do a complete table scan. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >> >>>>>>>> >> >>>>>>>>> Wow! This is exactly what I was looking for. So I will read all >> of >> >>>>>>>>> that >> >>>>>>>>> now. >> >>>>>>>>> >> >>>>>>>>> Need to read here at the bottom: >> >>>>>>>>> https://github.com/sematext/HBaseWD >> >>>>>>>>> and here: >> >>>>>>>>> >> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> >> >>>>>>>>> JM >> >>>>>>>>> >> >>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >> >>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this >> comes >> >>>>>>>>>> up >> >>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >> >>>>>>>>>> >> >>>>>>>>>> Otis >> >>>>>>>>>> ---- >> >>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >> >>>>>>>>>> http://sematext.com/spm >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> ________________________________ >> >>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >> >>>>>>>>>>> To: [email protected] >> >>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >> >>>>>>>>>>> Subject: Timestamp as a key good practice? >> >>>>>>>>>>> >> >>>>>>>>>>> I watched Lars George's video about HBase and read the >> >>>>>>>>>>> documentation >> >>>>>>>>>>> and it's saying that it's not a good idea to have the >> >>>>>>>>>>> timestamp >> >>>>>>>>>>> as >> >>>>>>>>>>> a >> >>>>>>>>>>> key because that will always load the same region until the >> >>>>>>>>>>> timestamp >> >>>>>>>>>>> reach a certain value and move to the next region >> (hotspotting). >> >>>>>>>>>>> >> >>>>>>>>>>> I have a table with a uniq key, a file path and a "last >> >>>>>>>>>>> update" >> >>>>>>>>>>> field. >> >>>>>>>>>>> I can easily find back the file with the ID and find when it >> has >> >>>>>>>>>>> been >> >>>>>>>>>>> updated. >> >>>>>>>>>>> >> >>>>>>>>>>> But what I need too is to find the files not updated for more >> >>>>>>>>>>> than >> >>>>>>>>>>> a >> >>>>>>>>>>> certain period of time. >> >>>>>>>>>>> >> >>>>>>>>>>> If I want to retrieve that from this single table, I will >> >>>>>>>>>>> have >> to >> >>>>>>>>>>> do >> >>>>>>>>>>> a >> >>>>>>>>>>> full parsing of the table. Which might take a while. >> >>>>>>>>>>> >> >>>>>>>>>>> So I thought of building a table to reference that (kind of >> >>>>>>>>>>> secondary >> >>>>>>>>>>> index). The key is the "last update", one FC and each column >> will >> >>>>>>>>>>> have >> >>>>>>>>>>> the ID of the file with a dummy content. >> >>>>>>>>>>> >> >>>>>>>>>>> When a file is updated, I remove its cell from this table, >> >>>>>>>>>>> and >> >>>>>>>>>>> introduce a new cell with the new timestamp as the key. >> >>>>>>>>>>> >> >>>>>>>>>>> And so one. >> >>>>>>>>>>> >> >>>>>>>>>>> With this schema, I can find the files by ID very quickly and >> >>>>>>>>>>> I >> >>>>>>>>>>> can >> >>>>>>>>>>> find the files which need to be updated pretty quickly too. >> >>>>>>>>>>> But >> >>>>>>>>>>> it's >> >>>>>>>>>>> hotspotting one region. >> >>>>>>>>>>> >> >>>>>>>>>>> From the video (0:45:10) I can see 4 situations. >> >>>>>>>>>>> 1) Hotspotting. >> >>>>>>>>>>> 2) Salting. >> >>>>>>>>>>> 3) Key field swap/promotion >> >>>>>>>>>>> 4) Randomization. >> >>>>>>>>>>> >> >>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other >> options. >> >>>>>>>>>>> >> >>>>>>>>>>> I can do salting. Like prefix the timestamp with a number >> between >> >>>>>>>>>>> 0 >> >>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To >> >>>>>>>>>>> find >> >>>>>>>>>>> all >> >>>>>>>>>>> the files with a timestamp below a specific value, I will >> >>>>>>>>>>> need >> to >> >>>>>>>>>>> run >> >>>>>>>>>>> 10 requests instead of one. But when the load will becaume to >> big >> >>>>>>>>>>> for >> >>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? >> >>>>>>>>>>> Which >> >>>>>>>>>>> mean 100 request? And the more regions I will have, the more >> >>>>>>>>>>> requests >> >>>>>>>>>>> I will have to do. Is that really a good approach? >> >>>>>>>>>>> >> >>>>>>>>>>> Key field swap is close to salting. I can add the first few >> bytes >> >>>>>>>>>>> from >> >>>>>>>>>>> the path before the timestamp, but the issue will remain the >> >>>>>>>>>>> same. >> >>>>>>>>>>> >> >>>>>>>>>>> I looked and randomization, and I can't do that. Else I will >> have >> >>>>>>>>>>> no >> >>>>>>>>>>> way to retreive the information I'm looking for. >> >>>>>>>>>>> >> >>>>>>>>>>> So the question is. Is there a good way to store the data to >> >>>>>>>>>>> retrieve >> >>>>>>>>>>> them base on the date? >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks, >> >>>>>>>>>>> >> >>>>>>>>>>> JM >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>> >> >> >> > >> >> >
