If you have a really small cluster... You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a single node. (Secondary too) Then you have Data Nodes that run DN, TT, and RS.
That would solve any ZK RS problems. On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: > Hi Mike, Hi Rob, > > Thanks for your replies and advices. Seems that now I'm due for some > implementation. I'm readgin Lars' book first and when I will be done I > will start with the coding. > > I already have my Zookeeper/Hadoop/HBase running and based on the > first pages I read, I already know it's not well done since I have put > a DataNode and a Zookeeper server on ALL the servers ;) So. More > reading for me for the next few days, and then I will start. > > Thanks again! > > JM > > 2012/6/16, Rob Verkuylen <[email protected]>: >> Just to add from my experiences: >> >> Yes hotspotting is bad, but so are devops headaches. A reasonable machine >> can handle 3-4000 puts a second with ease, and a simple timerange scan can >> give you the records you need. I have my doubts you will be hitting these >> amounts anytime soon. A simple setup will get your PoC and then scale when >> you need to scale. >> >> Rob >> >> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >> <[email protected]>wrote: >> >>> Jean-Marc, >>> >>> You indicated that you didn't want to do full table scans when you want >>> to >>> find out which files hadn't been touched since X time has past. >>> (X could be months, weeks, days, hours, etc ...) >>> >>> So here's the thing. >>> First, I am not convinced that you will have hot spotting. >>> Second, you end up having to now do 26 scans instead of one. Then you >>> need >>> to join the result set. >>> >>> Not really a good solution if you think about it. >>> >>> Oh and I don't believe that you will be hitting a single region, although >>> you may hit a region hard. >>> (Your second table's key is on the timestamp of the last update to the >>> file. If the file hadn't been touched in a week, there's the probability >>> that at scale, it won't be in the same region as a file that had recently >>> been touched. ) >>> >>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only be >>> applied on a subset of problems. >>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>> >>> HTH >>> >>> -Mike >>> >>> >>> >>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>> >>>> Let's imagine the timestamp is "123456789". >>>> >>>> If I salt it with later from 'a' to 'z' them it will always be split >>>> between few RegionServers. I will have like "t123456789". The issue is >>>> that I will have to do 26 queries to be able to find all the entries. >>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B, >>>> and so on. >>>> >>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>> key myself? Or what if I use something like HBaseWD? >>>> >>>> JM >>>> >>>> 2012/6/16, Michel Segel <[email protected]>: >>>>> You can't salt the key in the second table. >>>>> By salting the key, you lose the ability to do range scans, which is >>> what >>>>> you want to do. >>>>> >>>>> >>>>> >>>>> Sent from a remote device. Please excuse any typos... >>>>> >>>>> Mike Segel >>>>> >>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >>> [email protected]> >>>>> wrote: >>>>> >>>>>> Thanks all for your comments and suggestions. Regarding the >>>>>> hotspotting I will try to salt the key in the 2nd table and see the >>>>>> results. >>>>>> >>>>>> Yesterday I finished to install my 4 servers cluster with old >>>>>> machine. >>>>>> It's slow, but it's working. So I will do some testing. >>>>>> >>>>>> You are recommending to modify the timestamp to be to the second or >>>>>> minute and have more entries per row. Is that because it's better to >>>>>> have more columns than rows? Or it's more because that will allow to >>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which if >>>>>> more efficient? >>>>>> >>>>>> JM >>>>>> >>>>>> 2012/6/15, Michael Segel <[email protected]>: >>>>>>> Thought about this a little bit more... >>>>>>> >>>>>>> You will want two tables for a solution. >>>>>>> >>>>>>> 1 Table is Key: Unique ID >>>>>>> Column: FilePath Value: Full Path to >>>>>>> file >>>>>>> Column: Last Update time Value: timestamp >>>>>>> >>>>>>> 2 Table is Key: Last Update time (The timestamp) >>>>>>> Column 1-N: Unique ID Value: Full Path >>>>>>> to >>>>>>> the >>>>>>> file >>>>>>> >>>>>>> Now if you want to get fancy, in Table 1, you could use the time >>> stamp >>>>>>> on >>>>>>> the column File Path to hold the last update time. >>>>>>> But its probably easier for you to start by keeping the data as a >>>>>>> separate >>>>>>> column and ignore the Timestamps on the columns for now. >>>>>>> >>>>>>> Note the following: >>>>>>> >>>>>>> 1) I used the notation Column 1-N to reflect that for a given >>> timestamp >>>>>>> you >>>>>>> may or may not have multiple files that were updated. (You weren't >>>>>>> specific >>>>>>> as to the scale) >>>>>>> This is a good example of HBase's column oriented approach where you >>> may >>>>>>> or >>>>>>> may not have a column. It doesn't matter. :-) You could also modify >>> the >>>>>>> timestamp to be to the second or minute and have more entries per >>>>>>> row. >>>>>>> It >>>>>>> doesn't matter. You insert based on timestamp:columnName, value, so >>> you >>>>>>> will >>>>>>> add a column to this table. >>>>>>> >>>>>>> 2) First prove that the logic works. You insert/update table 1 to >>>>>>> capture >>>>>>> the ID of the file and its last update time. You then delete the >>>>>>> old >>>>>>> timestamp entry in table 2, then insert new entry in table 2. >>>>>>> >>>>>>> 3) You store Table 2 in ascending order. Then when you want to find >>> your >>>>>>> last 500 entries, you do a start scan at 0x000 and then limit the >>>>>>> scan >>>>>>> to >>>>>>> 500 rows. Note that you may or may not have multiple entries so as >>>>>>> you >>>>>>> walk >>>>>>> through the result set, you count the number of columns and stop >>>>>>> when >>>>>>> you >>>>>>> have 500 columns, regardless of the number of rows you've processed. >>>>>>> >>>>>>> This should solve your problem and be pretty efficient. >>>>>>> You can then work out the Coprocessors and add it to the solution to >>> be >>>>>>> even >>>>>>> more efficient. >>>>>>> >>>>>>> >>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash >>>>>>> your >>>>>>> unique >>>>>>> ID in table 1, this will reduce the potential of a hotspot as the >>> table >>>>>>> splits. >>>>>>> On table 2, because you have temporal data and you want to >>>>>>> efficiently >>>>>>> scan >>>>>>> a small portion of the table based on size, you will always scan the >>>>>>> first >>>>>>> bloc, however as data rolls off and compression occurs, you will >>>>>>> probably >>>>>>> have to do some cleanup. I'm not sure how HBase handles splits that >>> no >>>>>>> longer contain data. When you compress an empty split, does it go >>> away? >>>>>>> >>>>>>> By switching to coprocessors, you now limit the update accessors to >>> the >>>>>>> second table so you should still have pretty good performance. >>>>>>> >>>>>>> You may also want to look at Asynchronous HBase, however I don't >>>>>>> know >>>>>>> how >>>>>>> well it will work with Coprocessors or if you want to perform async >>>>>>> operations in this specific use case. >>>>>>> >>>>>>> Good luck, HTH... >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >>>>>>> >>>>>>>> Hi Michael, >>>>>>>> >>>>>>>> For now this is more a proof of concept than a production >>> application. >>>>>>>> And if it's working, it should be growing a lot and database at the >>>>>>>> end will easily be over 1B rows. each individual server will have >>>>>>>> to >>>>>>>> send it's own information to one centralized server which will >>>>>>>> insert >>>>>>>> that into a database. That's why it need to be very quick and >>>>>>>> that's >>>>>>>> why I'm looking in HBase's direction. I tried with some relational >>>>>>>> databases with 4M rows in the table but the insert time is to slow >>>>>>>> when I have to introduce entries in bulk. Also, the ability for >>>>>>>> HBase >>>>>>>> to keep only the cells with values will allow to save a lot on the >>>>>>>> disk space (futur projects). >>>>>>>> >>>>>>>> I'm not yet used with HBase and there is still many things I need >>>>>>>> to >>>>>>>> undertsand but until I'm able to create a solution and test it, I >>> will >>>>>>>> continue to read, learn and try that way. Then at then end I will >>>>>>>> be >>>>>>>> able to compare the 2 options I have (HBase or relational) and >>>>>>>> decide >>>>>>>> based on the results. >>>>>>>> >>>>>>>> So yes, your reply helped because it's giving me a way to achieve >>> this >>>>>>>> goal (using co-processors). I don't know ye thow this part is >>> working, >>>>>>>> so I will dig the documentation for it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> JM >>>>>>>> >>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>> Jean-Marc, >>>>>>>>> >>>>>>>>> You do realize that this really isn't a good use case for HBase, >>>>>>>>> assuming >>>>>>>>> that what you are describing is a stand alone system. >>>>>>>>> It would be easier and better if you just used a simple relational >>>>>>>>> database. >>>>>>>>> >>>>>>>>> Then you would have your table w an ID, and a secondary index on >>>>>>>>> the >>>>>>>>> timestamp. >>>>>>>>> Retrieve the data in Ascending order by timestamp and take the top >>> 500 >>>>>>>>> off >>>>>>>>> the list. >>>>>>>>> >>>>>>>>> If you insist on using HBase, yes you will have to have a >>>>>>>>> secondary >>>>>>>>> table. >>>>>>>>> Then using co-processors... >>>>>>>>> When you update the row in your base table, you >>>>>>>>> then get() the row in your index by timestamp, removing the column >>> for >>>>>>>>> that >>>>>>>>> rowid. >>>>>>>>> Add the new column to the timestamp row. >>>>>>>>> >>>>>>>>> As you put it. >>>>>>>>> >>>>>>>>> Now you can just do a partial scan on your index. Because your >>>>>>>>> index >>>>>>>>> table >>>>>>>>> is so small... you shouldn't worry about hotspots. >>>>>>>>> You may just want to rebuild your index every so often... >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> -Mike >>>>>>>>> >>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >>>>>>>>> >>>>>>>>>> Hi Michael, >>>>>>>>>> >>>>>>>>>> Thanks for your feedback. Here are more details to describe what >>> I'm >>>>>>>>>> trying to achieve. >>>>>>>>>> >>>>>>>>>> My goal is to store information about files into the database. I >>> need >>>>>>>>>> to check the oldest files in the database to refresh the >>> information. >>>>>>>>>> >>>>>>>>>> The key is an 8 bytes ID of the server name in the network >>>>>>>>>> hosting >>>>>>>>>> the >>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key. >>>>>>>>>> >>>>>>>>>> So each time I look at a file and gather the information, I >>>>>>>>>> update >>>>>>>>>> its >>>>>>>>>> row in the database based on the key including a "last_update" >>> field. >>>>>>>>>> I can calculate this key for any file in the drives. >>>>>>>>>> >>>>>>>>>> In order to know which file I need to check in the network, I >>>>>>>>>> need >>> to >>>>>>>>>> scan the table by "last_update" field. So the idea is to build >>>>>>>>>> another >>>>>>>>>> table which contain the last_update as a key and the files IDs in >>>>>>>>>> columns. (Here is the hotspotting) >>>>>>>>>> >>>>>>>>>> Each time I work on a file, I will have to update the main table >>>>>>>>>> by >>>>>>>>>> ID >>>>>>>>>> and remove the cell from the second table (the index) and put it >>> back >>>>>>>>>> with the new "last_update" key. >>>>>>>>>> >>>>>>>>>> I'm mainly doing 3 operations in the database. >>>>>>>>>> 1) I retrieve a list of 500 files which need to be update >>>>>>>>>> 2) I update the information for those 500 files (bulk update) >>>>>>>>>> 3) I load new files references to be checked. >>>>>>>>>> >>>>>>>>>> For 2 and 3, I use the main table with the file ID as the key. >>>>>>>>>> the >>>>>>>>>> distribution is almost perfect because I'm using hash. The prefix >>> is >>>>>>>>>> the server ID but it's not always going to the same server since >>> it's >>>>>>>>>> done by last_update. But this allow a quick access to the list of >>>>>>>>>> files from one server. >>>>>>>>>> For 1, I have expected to build this second table with the >>>>>>>>>> "last_update" as the key. >>>>>>>>>> >>>>>>>>>> Regarding the frequency, it really depends on the activities on >>>>>>>>>> the >>>>>>>>>> network, but it should be "often". The faster the database >>>>>>>>>> update >>>>>>>>>> will be, the more up to date I will be able to keep it. >>>>>>>>>> >>>>>>>>>> JM >>>>>>>>>> >>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>> Actually I think you should revisit your key design.... >>>>>>>>>>> >>>>>>>>>>> Look at your access path to the data for each of the types of >>>>>>>>>>> queries >>>>>>>>>>> you >>>>>>>>>>> are going to run. >>>>>>>>>>> From your post: >>>>>>>>>>> "I have a table with a uniq key, a file path and a "last update" >>>>>>>>>>> field. >>>>>>>>>>>>>> I can easily find back the file with the ID and find when it >>> has >>>>>>>>>>>>>> been >>>>>>>>>>>>>> updated. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But what I need too is to find the files not updated for more >>>>>>>>>>>>>> than >>>>>>>>>>>>>> a >>>>>>>>>>>>>> certain period of time. >>>>>>>>>>> " >>>>>>>>>>> So your primary query is going to be against the key. >>>>>>>>>>> Not sure if you meant to say that your key was a composite key >>>>>>>>>>> or >>>>>>>>>>> not... >>>>>>>>>>> sounds like your key is just the unique key and the rest are >>> columns >>>>>>>>>>> in >>>>>>>>>>> the >>>>>>>>>>> table. >>>>>>>>>>> >>>>>>>>>>> The secondary query or path to the data is to find data where >>>>>>>>>>> the >>>>>>>>>>> files >>>>>>>>>>> were >>>>>>>>>>> not updated for more than a period of time. >>>>>>>>>>> >>>>>>>>>>> If you make your key temporal, that is adding time as a >>>>>>>>>>> component >>> of >>>>>>>>>>> your >>>>>>>>>>> key, you will end up creating new rows of data while the old row >>>>>>>>>>> still >>>>>>>>>>> exists. >>>>>>>>>>> Not a good side effect. >>>>>>>>>>> >>>>>>>>>>> The other nasty side effect of using time as your key is that >>>>>>>>>>> you >>>>>>>>>>> not >>>>>>>>>>> only >>>>>>>>>>> have the potential for hot spotting, but that you also have the >>>>>>>>>>> nasty >>>>>>>>>>> side >>>>>>>>>>> effect of creating splits that will never grow. >>>>>>>>>>> >>>>>>>>>>> How often are you going to ask to see the files where they were >>> not >>>>>>>>>>> updated >>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then you >>>>>>>>>>> really >>>>>>>>>>> should care if you have to do a complete table scan. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>> >>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read all >>> of >>>>>>>>>>>> that >>>>>>>>>>>> now. >>>>>>>>>>>> >>>>>>>>>>>> Need to read here at the bottom: >>>>>>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>>>>>> and here: >>>>>>>>>>>> >>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> JM >>>>>>>>>>>> >>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this >>> comes >>>>>>>>>>>>> up >>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >>>>>>>>>>>>> >>>>>>>>>>>>> Otis >>>>>>>>>>>>> ---- >>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >>>>>>>>>>>>> http://sematext.com/spm >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >>>>>>>>>>>>>> Subject: Timestamp as a key good practice? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I watched Lars George's video about HBase and read the >>>>>>>>>>>>>> documentation >>>>>>>>>>>>>> and it's saying that it's not a good idea to have the >>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>> as >>>>>>>>>>>>>> a >>>>>>>>>>>>>> key because that will always load the same region until the >>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>> reach a certain value and move to the next region >>> (hotspotting). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>>> update" >>>>>>>>>>>>>> field. >>>>>>>>>>>>>> I can easily find back the file with the ID and find when it >>> has >>>>>>>>>>>>>> been >>>>>>>>>>>>>> updated. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But what I need too is to find the files not updated for more >>>>>>>>>>>>>> than >>>>>>>>>>>>>> a >>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If I want to retrieve that from this single table, I will >>>>>>>>>>>>>> have >>> to >>>>>>>>>>>>>> do >>>>>>>>>>>>>> a >>>>>>>>>>>>>> full parsing of the table. Which might take a while. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So I thought of building a table to reference that (kind of >>>>>>>>>>>>>> secondary >>>>>>>>>>>>>> index). The key is the "last update", one FC and each column >>> will >>>>>>>>>>>>>> have >>>>>>>>>>>>>> the ID of the file with a dummy content. >>>>>>>>>>>>>> >>>>>>>>>>>>>> When a file is updated, I remove its cell from this table, >>>>>>>>>>>>>> and >>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key. >>>>>>>>>>>>>> >>>>>>>>>>>>>> And so one. >>>>>>>>>>>>>> >>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly and >>>>>>>>>>>>>> I >>>>>>>>>>>>>> can >>>>>>>>>>>>>> find the files which need to be updated pretty quickly too. >>>>>>>>>>>>>> But >>>>>>>>>>>>>> it's >>>>>>>>>>>>>> hotspotting one region. >>>>>>>>>>>>>> >>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations. >>>>>>>>>>>>>> 1) Hotspotting. >>>>>>>>>>>>>> 2) Salting. >>>>>>>>>>>>>> 3) Key field swap/promotion >>>>>>>>>>>>>> 4) Randomization. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other >>> options. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number >>> between >>>>>>>>>>>>>> 0 >>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To >>>>>>>>>>>>>> find >>>>>>>>>>>>>> all >>>>>>>>>>>>>> the files with a timestamp below a specific value, I will >>>>>>>>>>>>>> need >>> to >>>>>>>>>>>>>> run >>>>>>>>>>>>>> 10 requests instead of one. But when the load will becaume to >>> big >>>>>>>>>>>>>> for >>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? >>>>>>>>>>>>>> Which >>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the more >>>>>>>>>>>>>> requests >>>>>>>>>>>>>> I will have to do. Is that really a good approach? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Key field swap is close to salting. I can add the first few >>> bytes >>>>>>>>>>>>>> from >>>>>>>>>>>>>> the path before the timestamp, but the issue will remain the >>>>>>>>>>>>>> same. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I will >>> have >>>>>>>>>>>>>> no >>>>>>>>>>>>>> way to retreive the information I'm looking for. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So the question is. Is there a good way to store the data to >>>>>>>>>>>>>> retrieve >>>>>>>>>>>>>> them base on the date? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> JM >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >> >
