We use https://github.com/sematext/HBaseWD and I just learned Amazon.com people are using it and are happy with it, so it may work for you, too.
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Nov 20, 2013 at 1:00 AM, Asaf Mesika <asaf.mes...@gmail.com> wrote: > Thanks for clearing that out. > I'm using your message to ping anyone who assist as to it appears the use > case should happen to a lot of people? > > Thanks! > > On Wednesday, November 20, 2013, Himanshu Vashishtha wrote: > >> Re: "The 32 limit makes HBase go into >> stress mode, and dump all involving regions contains in those 32 WAL >> Files." >> >> Pardon, I haven't read all your data points/details thoroughly, but the >> above statement is not true. Rather, it looks at the oldest WAL file, and >> flushes those regions which would free that WAL file. >> >> But I agree that in general with this kind of workload, we should handle >> WAL files more intelligently and free up those WAL files which don't have >> any dependency (that is, all their entries are already flushed) when >> archiving. We do that in trunk but not in any released version, though. >> >> >> >> On Sat, Nov 16, 2013 at 11:16 AM, Asaf Mesika <asaf.mes...@gmail.com> >> wrote: >> >> > First I forgot to mention that <customerId> in our case is >> > MD5(<customerId>). >> > In our case, we have so much data flowing in, that we end up having a >> > region per <customerId><bucket> pretty quickly and even that, is splitted >> > into different regions by specific date duration (timestamp). >> > >> > We're not witnessing a hotspot issue. I built some scripts in java and >> awk, >> > and saw that 66% of our customers use more than 1Rs. >> > >> > We have two main serious issues: primary and secondary. >> > >> > Our primary issue being the slow-region vs fast-region. First let's be >> > reminded that a region represents as I detailed before a specific >> > <customerId><bucket>. Some customers gets x50 times more data that other >> > customers at a specific time frame (2hrs - 1 day). So in a one RS, we >> have >> > regions getting 10 write requests per hour, vs 50k write requests per >> hour. >> > So the region mapped to the slow-filling customer id, doesn't get to the >> > 256MB flush limit and hence isn't flushed, while the regions mapped to >> the >> > fast-filling customer id, are flushing very quickly since they are >> filling >> > very quickly. >> > Let's say the 1st WAL file contains the put of a slow-filling customerId. >> > the fast-filling customerId, fills up the rest of that file. After 20-30 >> > seconds, the file gets rolled, and another file fills up with fast >> filling >> > customerId. After a while, we get to 32 WAL Files. The 1st file wasn't >> > deleted since its region wasn't flushed. The 32 limit makes HBase go into >> > stress mode, and dump all involving regions contains in those 32 WAL >> Files. >> > In our case, we saw that it flushes 111 regions. Lots of the store files >> > are 3k-3mb sized. So our compaction queue start filling up with those >> store >> > files needs to be compacted. >> > At the of the road, the RS gets dead. >> > >> > Our secondary issue is those of empty regions - we get to a situation >> where >> > a region is mapped to a specific <customerId>, <bucket>, and date range >> > (1/7 - 3/7). Those when we are in August (we TTL set to 30 days), those >> > regions gets empty and will never get filled again. >> > We assume this somehow wrecks havoc in the load balancer, and also MSLAB >> > probably steals 1-2 GB of memory for those empty regions. >> > >> > Thanks! >> > >> > >> > >> > On Sat, Nov 16, 2013 at 7:25 PM, Mike Axiak <m...@axiak.net> wrote: >> > >> > > Hi, >> > > >> > > One new key pattern that we're starting to use is a salt based on a >> > shard. >> > > For example, let's take your key: >> > > >> > > <customerId><bucket><timestampInMs><uniqueId> >> > > >> > > Consider a shard between 0 and 15 inclusive. We determine this with: >> > > >> > > <shard> = abs(hash32(uniqueId) % 16) >> > > >> > > We can then define a salt to be based on customerId and the shard: >> > > >> > > <salt> = hash32(<shard><customerId>) >> > > >> > > So then the new key becomes: >> > > >> > > <salt><customerId><timestampInMs><uniqueId> >> > > >> > > This will distribute the data for a given customer across the N shards >> > that >> > > you pick, while having a deterministic function for a given row key (so >> > > long as the # of shards you pick is fixed, otherwise you can migrate >> the >> > > data). Placing the bucket after the customerId doesn't help distribute >> > the >> > > single customer's data at all. Furthermore, by using a separate hash >> > > (instead of just <shard><customerId>), you're guaranteeing that new >> data >> > > will appear in a somewhat random location (i.e., solving the problem of >> > > adding a bunch of new data for a new customer). >> > > >> > > I have a key simulation script in python that I can start tweaking and >> > > share with people if they'd like. >> > > >> > > Hope this helps, >> > > Mike >> > > >> > > >> > > On Sat, Nov 16, 2013 at 1:16 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> > > >> > > > bq. all regions of that customer >> > > > >>