Splits and MapReduce

Leon Mergen Tue, 15 May 2012 07:10:52 -0700

Hello all,

We are currently orienting on HBase as a possible way to store our log data
in a structured way, and I want to verify a few things I was not able to
find online. Specifically, what we are trying to achieve:


 * be able to quickly search for logs within a specific time range;
 * limit the amount of maps in our mapreduce jobs to only those areas we're
interested in.

As I understand it, there is a tradeoff:

* if you use a timestamp as a split key, be prepared for a tradeoff: a
single region server can become a hotspot. This is bad when writing data at
a high load;
* if we do not have the timestamp as the first key of the splitkeys, a
MapReduce job will have to do a TableScan and have a huge amount of maps.

Is there a known solution / workaround for this problem that people have
used? Since our timespan queries are usually limited based on days, we were
considering adding a new table for each day, but that looked like a bit of
an ugly hack.

Any ideas / suggestions about this ?

Regards,

Leon Mergen

Splits and MapReduce

Reply via email to