Leon, have a look at HBaseWD to solve key problems: https://github.com/sematext/HBaseWD#readme
Here is a post about it that includes some figures, some performance graphs, and code: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Leon Mergen <[email protected]> >To: [email protected] >Sent: Tuesday, May 15, 2012 10:53 AM >Subject: Re: Splits and MapReduce > >Hello Himanish, > >Thanks for the advice. It looks like they are using a compound key of a >"metric id" in addition to the timestamp: > >http://opentsdb.net/schema.html > >This sounds like a good solution for their use case but, unfortunately, we >have a lot of MapReduce jobs which *only* filter based on the timestamp, >and thus would result in a big table scan. However, I did find this little >gem: > >https://bugzilla.mozilla.org/show_bug.cgi?id=566340 > >It looks like the Mozilla Sorocco project ran into a similar issue, and >they have chosen to use a salt for their row keys: prepend the timestamp >with the first digit of an OOID to ensure a certain amount of parallelism >when writing. > >What are the thoughts of the experts here about this solution ? > > >Regards, > >Leon Mergen > > > > >On Tue, May 15, 2012 at 4:28 PM, Himanish Kushary <[email protected]>wrote: > >> Hi, >> >> You could take a look into *OpenTSDB* . I think they are addressing some >> of the issues that you mention here. >> >> Thanks >> >> >> On Tue, May 15, 2012 at 10:09 AM, Leon Mergen <[email protected]> wrote: >> >> > Hello all, >> > >> > We are currently orienting on HBase as a possible way to store our log >> data >> > in a structured way, and I want to verify a few things I was not able to >> > find online. Specifically, what we are trying to achieve: >> > >> > * be able to quickly search for logs within a specific time range; >> > * limit the amount of maps in our mapreduce jobs to only those areas >> we're >> > interested in. >> > >> > As I understand it, there is a tradeoff: >> > >> > * if you use a timestamp as a split key, be prepared for a tradeoff: a >> > single region server can become a hotspot. This is bad when writing data >> at >> > a high load; >> > * if we do not have the timestamp as the first key of the splitkeys, a >> > MapReduce job will have to do a TableScan and have a huge amount of maps. >> > >> > Is there a known solution / workaround for this problem that people have >> > used? Since our timespan queries are usually limited based on days, we >> were >> > considering adding a new table for each day, but that looked like a bit >> of >> > an ugly hack. >> > >> > Any ideas / suggestions about this ? >> > >> > Regards, >> > >> > Leon Mergen >> > >> >> >> >> -- >> Thanks & Regards >> Himanish >> > > >
