Am I better to run it on 1? Or on 3? I just want to do some testing for now. But I have issues with the performances. It's taking 20 seconds to do 1000 gets with the actual configuration... I'm tracking the issues. I think the network is one so I will address it this week, but for ZK, can I keep it in only one server for now? Or it will be more efficient if Iconfigure it on 3?
Thanks, JM 2012/6/26, Jean-Daniel Cryans <[email protected]>: > A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the > exception you are seeing is that ZK is trying to get a quorum on with > 1 machine but that doesn't make sense so instead it should revert to a > standalone server and still work. > > J-D > > On Fri, Jun 22, 2012 at 7:20 PM, Jean-Marc Spaggiari > <[email protected]> wrote: >> Hum... Seems that it's not working that way: >> >> ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one >> server specified (ignoring) >> >> So most porbably the secondary should looks exactly like the master, >> but I'm not 100% sure... >> >> 2012/6/22, Jean-Marc Spaggiari <[email protected]>: >>> Ok. So if I understand correctly, I need: >>> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and >>> ZooKeeper (ZK) >>> PC2 => Secondary Name Node (Hadoop) >>> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server >>> (HBase) >>> >>> For PC2, should I run Zookeeper, JobTracker and master too? Can I have >>> 2 masters? Or I just run just the secondray name node? >>> >>> 2012/6/21, Michael Segel <[email protected]>: >>>> If you have a really small cluster... >>>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a >>>> single node. (Secondary too) >>>> Then you have Data Nodes that run DN, TT, and RS. >>>> >>>> That would solve any ZK RS problems. >>>> >>>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: >>>> >>>>> Hi Mike, Hi Rob, >>>>> >>>>> Thanks for your replies and advices. Seems that now I'm due for some >>>>> implementation. I'm readgin Lars' book first and when I will be done I >>>>> will start with the coding. >>>>> >>>>> I already have my Zookeeper/Hadoop/HBase running and based on the >>>>> first pages I read, I already know it's not well done since I have put >>>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More >>>>> reading for me for the next few days, and then I will start. >>>>> >>>>> Thanks again! >>>>> >>>>> JM >>>>> >>>>> 2012/6/16, Rob Verkuylen <[email protected]>: >>>>>> Just to add from my experiences: >>>>>> >>>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>>>>> machine >>>>>> can handle 3-4000 puts a second with ease, and a simple timerange >>>>>> scan >>>>>> can >>>>>> give you the records you need. I have my doubts you will be hitting >>>>>> these >>>>>> amounts anytime soon. A simple setup will get your PoC and then scale >>>>>> when >>>>>> you need to scale. >>>>>> >>>>>> Rob >>>>>> >>>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Jean-Marc, >>>>>>> >>>>>>> You indicated that you didn't want to do full table scans when you >>>>>>> want >>>>>>> to >>>>>>> find out which files hadn't been touched since X time has past. >>>>>>> (X could be months, weeks, days, hours, etc ...) >>>>>>> >>>>>>> So here's the thing. >>>>>>> First, I am not convinced that you will have hot spotting. >>>>>>> Second, you end up having to now do 26 scans instead of one. Then >>>>>>> you >>>>>>> need >>>>>>> to join the result set. >>>>>>> >>>>>>> Not really a good solution if you think about it. >>>>>>> >>>>>>> Oh and I don't believe that you will be hitting a single region, >>>>>>> although >>>>>>> you may hit a region hard. >>>>>>> (Your second table's key is on the timestamp of the last update to >>>>>>> the >>>>>>> file. If the file hadn't been touched in a week, there's the >>>>>>> probability >>>>>>> that at scale, it won't be in the same region as a file that had >>>>>>> recently >>>>>>> been touched. ) >>>>>>> >>>>>>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only >>>>>>> be >>>>>>> applied on a subset of problems. >>>>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>>>>>> >>>>>>>> Let's imagine the timestamp is "123456789". >>>>>>>> >>>>>>>> If I salt it with later from 'a' to 'z' them it will always be >>>>>>>> split >>>>>>>> between few RegionServers. I will have like "t123456789". The issue >>>>>>>> is >>>>>>>> that I will have to do 26 queries to be able to find all the >>>>>>>> entries. >>>>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for >>>>>>>> B, >>>>>>>> and so on. >>>>>>>> >>>>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>>>>>> key myself? Or what if I use something like HBaseWD? >>>>>>>> >>>>>>>> JM >>>>>>>> >>>>>>>> 2012/6/16, Michel Segel <[email protected]>: >>>>>>>>> You can't salt the key in the second table. >>>>>>>>> By salting the key, you lose the ability to do range scans, which >>>>>>>>> is >>>>>>> what >>>>>>>>> you want to do. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>>> >>>>>>>>> Mike Segel >>>>>>>>> >>>>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >>>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks all for your comments and suggestions. Regarding the >>>>>>>>>> hotspotting I will try to salt the key in the 2nd table and see >>>>>>>>>> the >>>>>>>>>> results. >>>>>>>>>> >>>>>>>>>> Yesterday I finished to install my 4 servers cluster with old >>>>>>>>>> machine. >>>>>>>>>> It's slow, but it's working. So I will do some testing. >>>>>>>>>> >>>>>>>>>> You are recommending to modify the timestamp to be to the second >>>>>>>>>> or >>>>>>>>>> minute and have more entries per row. Is that because it's better >>>>>>>>>> to >>>>>>>>>> have more columns than rows? Or it's more because that will allow >>>>>>>>>> to >>>>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which >>>>>>>>>> if >>>>>>>>>> more efficient? >>>>>>>>>> >>>>>>>>>> JM >>>>>>>>>> >>>>>>>>>> 2012/6/15, Michael Segel <[email protected]>: >>>>>>>>>>> Thought about this a little bit more... >>>>>>>>>>> >>>>>>>>>>> You will want two tables for a solution. >>>>>>>>>>> >>>>>>>>>>> 1 Table is Key: Unique ID >>>>>>>>>>> Column: FilePath Value: Full Path to >>>>>>>>>>> file >>>>>>>>>>> Column: Last Update time Value: timestamp >>>>>>>>>>> >>>>>>>>>>> 2 Table is Key: Last Update time (The timestamp) >>>>>>>>>>> Column 1-N: Unique ID Value: Full >>>>>>>>>>> Path >>>>>>>>>>> to >>>>>>>>>>> the >>>>>>>>>>> file >>>>>>>>>>> >>>>>>>>>>> Now if you want to get fancy, in Table 1, you could use the >>>>>>>>>>> time >>>>>>> stamp >>>>>>>>>>> on >>>>>>>>>>> the column File Path to hold the last update time. >>>>>>>>>>> But its probably easier for you to start by keeping the data as >>>>>>>>>>> a >>>>>>>>>>> separate >>>>>>>>>>> column and ignore the Timestamps on the columns for now. >>>>>>>>>>> >>>>>>>>>>> Note the following: >>>>>>>>>>> >>>>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given >>>>>>> timestamp >>>>>>>>>>> you >>>>>>>>>>> may or may not have multiple files that were updated. (You >>>>>>>>>>> weren't >>>>>>>>>>> specific >>>>>>>>>>> as to the scale) >>>>>>>>>>> This is a good example of HBase's column oriented approach where >>>>>>>>>>> you >>>>>>> may >>>>>>>>>>> or >>>>>>>>>>> may not have a column. It doesn't matter. :-) You could also >>>>>>>>>>> modify >>>>>>> the >>>>>>>>>>> timestamp to be to the second or minute and have more entries >>>>>>>>>>> per >>>>>>>>>>> row. >>>>>>>>>>> It >>>>>>>>>>> doesn't matter. You insert based on timestamp:columnName, value, >>>>>>>>>>> so >>>>>>> you >>>>>>>>>>> will >>>>>>>>>>> add a column to this table. >>>>>>>>>>> >>>>>>>>>>> 2) First prove that the logic works. You insert/update table 1 >>>>>>>>>>> to >>>>>>>>>>> capture >>>>>>>>>>> the ID of the file and its last update time. You then delete >>>>>>>>>>> the >>>>>>>>>>> old >>>>>>>>>>> timestamp entry in table 2, then insert new entry in table 2. >>>>>>>>>>> >>>>>>>>>>> 3) You store Table 2 in ascending order. Then when you want to >>>>>>>>>>> find >>>>>>> your >>>>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit >>>>>>>>>>> the >>>>>>>>>>> scan >>>>>>>>>>> to >>>>>>>>>>> 500 rows. Note that you may or may not have multiple entries so >>>>>>>>>>> as >>>>>>>>>>> you >>>>>>>>>>> walk >>>>>>>>>>> through the result set, you count the number of columns and stop >>>>>>>>>>> when >>>>>>>>>>> you >>>>>>>>>>> have 500 columns, regardless of the number of rows you've >>>>>>>>>>> processed. >>>>>>>>>>> >>>>>>>>>>> This should solve your problem and be pretty efficient. >>>>>>>>>>> You can then work out the Coprocessors and add it to the >>>>>>>>>>> solution >>>>>>>>>>> to >>>>>>> be >>>>>>>>>>> even >>>>>>>>>>> more efficient. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash >>>>>>>>>>> your >>>>>>>>>>> unique >>>>>>>>>>> ID in table 1, this will reduce the potential of a hotspot as >>>>>>>>>>> the >>>>>>> table >>>>>>>>>>> splits. >>>>>>>>>>> On table 2, because you have temporal data and you want to >>>>>>>>>>> efficiently >>>>>>>>>>> scan >>>>>>>>>>> a small portion of the table based on size, you will always scan >>>>>>>>>>> the >>>>>>>>>>> first >>>>>>>>>>> bloc, however as data rolls off and compression occurs, you will >>>>>>>>>>> probably >>>>>>>>>>> have to do some cleanup. I'm not sure how HBase handles splits >>>>>>>>>>> that >>>>>>> no >>>>>>>>>>> longer contain data. When you compress an empty split, does it >>>>>>>>>>> go >>>>>>> away? >>>>>>>>>>> >>>>>>>>>>> By switching to coprocessors, you now limit the update accessors >>>>>>>>>>> to >>>>>>> the >>>>>>>>>>> second table so you should still have pretty good performance. >>>>>>>>>>> >>>>>>>>>>> You may also want to look at Asynchronous HBase, however I don't >>>>>>>>>>> know >>>>>>>>>>> how >>>>>>>>>>> well it will work with Coprocessors or if you want to perform >>>>>>>>>>> async >>>>>>>>>>> operations in this specific use case. >>>>>>>>>>> >>>>>>>>>>> Good luck, HTH... >>>>>>>>>>> >>>>>>>>>>> -Mike >>>>>>>>>>> >>>>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Michael, >>>>>>>>>>>> >>>>>>>>>>>> For now this is more a proof of concept than a production >>>>>>> application. >>>>>>>>>>>> And if it's working, it should be growing a lot and database at >>>>>>>>>>>> the >>>>>>>>>>>> end will easily be over 1B rows. each individual server will >>>>>>>>>>>> have >>>>>>>>>>>> to >>>>>>>>>>>> send it's own information to one centralized server which will >>>>>>>>>>>> insert >>>>>>>>>>>> that into a database. That's why it need to be very quick and >>>>>>>>>>>> that's >>>>>>>>>>>> why I'm looking in HBase's direction. I tried with some >>>>>>>>>>>> relational >>>>>>>>>>>> databases with 4M rows in the table but the insert time is to >>>>>>>>>>>> slow >>>>>>>>>>>> when I have to introduce entries in bulk. Also, the ability for >>>>>>>>>>>> HBase >>>>>>>>>>>> to keep only the cells with values will allow to save a lot on >>>>>>>>>>>> the >>>>>>>>>>>> disk space (futur projects). >>>>>>>>>>>> >>>>>>>>>>>> I'm not yet used with HBase and there is still many things I >>>>>>>>>>>> need >>>>>>>>>>>> to >>>>>>>>>>>> undertsand but until I'm able to create a solution and test it, >>>>>>>>>>>> I >>>>>>> will >>>>>>>>>>>> continue to read, learn and try that way. Then at then end I >>>>>>>>>>>> will >>>>>>>>>>>> be >>>>>>>>>>>> able to compare the 2 options I have (HBase or relational) and >>>>>>>>>>>> decide >>>>>>>>>>>> based on the results. >>>>>>>>>>>> >>>>>>>>>>>> So yes, your reply helped because it's giving me a way to >>>>>>>>>>>> achieve >>>>>>> this >>>>>>>>>>>> goal (using co-processors). I don't know ye thow this part is >>>>>>> working, >>>>>>>>>>>> so I will dig the documentation for it. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> JM >>>>>>>>>>>> >>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>>>> Jean-Marc, >>>>>>>>>>>>> >>>>>>>>>>>>> You do realize that this really isn't a good use case for >>>>>>>>>>>>> HBase, >>>>>>>>>>>>> assuming >>>>>>>>>>>>> that what you are describing is a stand alone system. >>>>>>>>>>>>> It would be easier and better if you just used a simple >>>>>>>>>>>>> relational >>>>>>>>>>>>> database. >>>>>>>>>>>>> >>>>>>>>>>>>> Then you would have your table w an ID, and a secondary index >>>>>>>>>>>>> on >>>>>>>>>>>>> the >>>>>>>>>>>>> timestamp. >>>>>>>>>>>>> Retrieve the data in Ascending order by timestamp and take the >>>>>>>>>>>>> top >>>>>>> 500 >>>>>>>>>>>>> off >>>>>>>>>>>>> the list. >>>>>>>>>>>>> >>>>>>>>>>>>> If you insist on using HBase, yes you will have to have a >>>>>>>>>>>>> secondary >>>>>>>>>>>>> table. >>>>>>>>>>>>> Then using co-processors... >>>>>>>>>>>>> When you update the row in your base table, you >>>>>>>>>>>>> then get() the row in your index by timestamp, removing the >>>>>>>>>>>>> column >>>>>>> for >>>>>>>>>>>>> that >>>>>>>>>>>>> rowid. >>>>>>>>>>>>> Add the new column to the timestamp row. >>>>>>>>>>>>> >>>>>>>>>>>>> As you put it. >>>>>>>>>>>>> >>>>>>>>>>>>> Now you can just do a partial scan on your index. Because your >>>>>>>>>>>>> index >>>>>>>>>>>>> table >>>>>>>>>>>>> is so small... you shouldn't worry about hotspots. >>>>>>>>>>>>> You may just want to rebuild your index every so often... >>>>>>>>>>>>> >>>>>>>>>>>>> HTH >>>>>>>>>>>>> >>>>>>>>>>>>> -Mike >>>>>>>>>>>>> >>>>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for your feedback. Here are more details to describe >>>>>>>>>>>>>> what >>>>>>> I'm >>>>>>>>>>>>>> trying to achieve. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My goal is to store information about files into the >>>>>>>>>>>>>> database. >>>>>>>>>>>>>> I >>>>>>> need >>>>>>>>>>>>>> to check the oldest files in the database to refresh the >>>>>>> information. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The key is an 8 bytes ID of the server name in the network >>>>>>>>>>>>>> hosting >>>>>>>>>>>>>> the >>>>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So each time I look at a file and gather the information, I >>>>>>>>>>>>>> update >>>>>>>>>>>>>> its >>>>>>>>>>>>>> row in the database based on the key including a >>>>>>>>>>>>>> "last_update" >>>>>>> field. >>>>>>>>>>>>>> I can calculate this key for any file in the drives. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In order to know which file I need to check in the network, I >>>>>>>>>>>>>> need >>>>>>> to >>>>>>>>>>>>>> scan the table by "last_update" field. So the idea is to >>>>>>>>>>>>>> build >>>>>>>>>>>>>> another >>>>>>>>>>>>>> table which contain the last_update as a key and the files >>>>>>>>>>>>>> IDs >>>>>>>>>>>>>> in >>>>>>>>>>>>>> columns. (Here is the hotspotting) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Each time I work on a file, I will have to update the main >>>>>>>>>>>>>> table >>>>>>>>>>>>>> by >>>>>>>>>>>>>> ID >>>>>>>>>>>>>> and remove the cell from the second table (the index) and put >>>>>>>>>>>>>> it >>>>>>> back >>>>>>>>>>>>>> with the new "last_update" key. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm mainly doing 3 operations in the database. >>>>>>>>>>>>>> 1) I retrieve a list of 500 files which need to be update >>>>>>>>>>>>>> 2) I update the information for those 500 files (bulk >>>>>>>>>>>>>> update) >>>>>>>>>>>>>> 3) I load new files references to be checked. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For 2 and 3, I use the main table with the file ID as the >>>>>>>>>>>>>> key. >>>>>>>>>>>>>> the >>>>>>>>>>>>>> distribution is almost perfect because I'm using hash. The >>>>>>>>>>>>>> prefix >>>>>>> is >>>>>>>>>>>>>> the server ID but it's not always going to the same server >>>>>>>>>>>>>> since >>>>>>> it's >>>>>>>>>>>>>> done by last_update. But this allow a quick access to the >>>>>>>>>>>>>> list >>>>>>>>>>>>>> of >>>>>>>>>>>>>> files from one server. >>>>>>>>>>>>>> For 1, I have expected to build this second table with the >>>>>>>>>>>>>> "last_update" as the key. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regarding the frequency, it really depends on the activities >>>>>>>>>>>>>> on >>>>>>>>>>>>>> the >>>>>>>>>>>>>> network, but it should be "often". The faster the database >>>>>>>>>>>>>> update >>>>>>>>>>>>>> will be, the more up to date I will be able to keep it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> JM >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>>>>>> Actually I think you should revisit your key design.... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Look at your access path to the data for each of the types >>>>>>>>>>>>>>> of >>>>>>>>>>>>>>> queries >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> are going to run. >>>>>>>>>>>>>>> From your post: >>>>>>>>>>>>>>> "I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>>>> update" >>>>>>>>>>>>>>> field. >>>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when >>>>>>>>>>>>>>>>>> it >>>>>>> has >>>>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>>> updated. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for >>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>>>> " >>>>>>>>>>>>>>> So your primary query is going to be against the key. >>>>>>>>>>>>>>> Not sure if you meant to say that your key was a composite >>>>>>>>>>>>>>> key >>>>>>>>>>>>>>> or >>>>>>>>>>>>>>> not... >>>>>>>>>>>>>>> sounds like your key is just the unique key and the rest are >>>>>>> columns >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> table. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The secondary query or path to the data is to find data >>>>>>>>>>>>>>> where >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> files >>>>>>>>>>>>>>> were >>>>>>>>>>>>>>> not updated for more than a period of time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If you make your key temporal, that is adding time as a >>>>>>>>>>>>>>> component >>>>>>> of >>>>>>>>>>>>>>> your >>>>>>>>>>>>>>> key, you will end up creating new rows of data while the old >>>>>>>>>>>>>>> row >>>>>>>>>>>>>>> still >>>>>>>>>>>>>>> exists. >>>>>>>>>>>>>>> Not a good side effect. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The other nasty side effect of using time as your key is >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> only >>>>>>>>>>>>>>> have the potential for hot spotting, but that you also have >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> nasty >>>>>>>>>>>>>>> side >>>>>>>>>>>>>>> effect of creating splits that will never grow. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> How often are you going to ask to see the files where they >>>>>>>>>>>>>>> were >>>>>>> not >>>>>>>>>>>>>>> updated >>>>>>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> really >>>>>>>>>>>>>>> should care if you have to do a complete table scan. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read >>>>>>>>>>>>>>>> all >>>>>>> of >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> now. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Need to read here at the bottom: >>>>>>>>>>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>>>>>>>>>> and here: >>>>>>>>>>>>>>>> >>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> JM >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >>>>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD >>>>>>>>>>>>>>>>> (this >>>>>>> comes >>>>>>>>>>>>>>>>> up >>>>>>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Otis >>>>>>>>>>>>>>>>> ---- >>>>>>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >>>>>>>>>>>>>>>>> http://sematext.com/spm >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >>>>>>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >>>>>>>>>>>>>>>>>> Subject: Timestamp as a key good practice? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I watched Lars George's video about HBase and read the >>>>>>>>>>>>>>>>>> documentation >>>>>>>>>>>>>>>>>> and it's saying that it's not a good idea to have the >>>>>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> key because that will always load the same region until >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>>>>> reach a certain value and move to the next region >>>>>>> (hotspotting). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>>>>>>> update" >>>>>>>>>>>>>>>>>> field. >>>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when >>>>>>>>>>>>>>>>>> it >>>>>>> has >>>>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>>> updated. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for >>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If I want to retrieve that from this single table, I will >>>>>>>>>>>>>>>>>> have >>>>>>> to >>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> full parsing of the table. Which might take a while. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So I thought of building a table to reference that (kind >>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>> secondary >>>>>>>>>>>>>>>>>> index). The key is the "last update", one FC and each >>>>>>>>>>>>>>>>>> column >>>>>>> will >>>>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>>> the ID of the file with a dummy content. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> When a file is updated, I remove its cell from this >>>>>>>>>>>>>>>>>> table, >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And so one. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>> find the files which need to be updated pretty quickly >>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>> But >>>>>>>>>>>>>>>>>> it's >>>>>>>>>>>>>>>>>> hotspotting one region. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations. >>>>>>>>>>>>>>>>>> 1) Hotspotting. >>>>>>>>>>>>>>>>>> 2) Salting. >>>>>>>>>>>>>>>>>> 3) Key field swap/promotion >>>>>>>>>>>>>>>>>> 4) Randomization. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other >>>>>>> options. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number >>>>>>> between >>>>>>>>>>>>>>>>>> 0 >>>>>>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers. >>>>>>>>>>>>>>>>>> To >>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>> the files with a timestamp below a specific value, I will >>>>>>>>>>>>>>>>>> need >>>>>>> to >>>>>>>>>>>>>>>>>> run >>>>>>>>>>>>>>>>>> 10 requests instead of one. But when the load will >>>>>>>>>>>>>>>>>> becaume >>>>>>>>>>>>>>>>>> to >>>>>>> big >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and >>>>>>>>>>>>>>>>>> 99? >>>>>>>>>>>>>>>>>> Which >>>>>>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the >>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>> requests >>>>>>>>>>>>>>>>>> I will have to do. Is that really a good approach? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Key field swap is close to salting. I can add the first >>>>>>>>>>>>>>>>>> few >>>>>>> bytes >>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>> the path before the timestamp, but the issue will remain >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I >>>>>>>>>>>>>>>>>> will >>>>>>> have >>>>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>>>> way to retreive the information I'm looking for. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So the question is. Is there a good way to store the data >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> retrieve >>>>>>>>>>>>>>>>>> them base on the date? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> JM >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >
