Re: Question about HBaseSourceTarget

Gabriel Reid Tue, 17 Mar 2015 09:15:56 -0700

Hi Nithin,

This is a long-standing issue in Crunch (I think it's been present since
Crunch was originally open-sourced). I'd love to get this fixed somehow,
although it seems to not be trivial to do -- it can be difficult to
accurately estimate the size of data that will come from an HBase table,
especially considering that filters and selections of a subset of columns
can be done on an HBase table.

One short-term way of working around this is to add a simple identity
function directly after the HBaseSourceTarget that implements the
scaleFactor method to manipulate the calculated size of the HBase data, but
this is just another hack.

Maybe the better solution would be to estimate the size of the HBase table
based on its size on HDFS when using the HBaseFrom.table(String) method,
and then also overload the HBaseFrom.table(String, Scan) method to also
take a long value which is the estimated byte size (or perhaps scale
factor) of the table content that is expected to be returned from the given
Scan.

Any thoughts on either of these?

- Gabriel

On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <[email protected]> wrote:

> Hello,
> I came across a unique behavior while using HBaseSourceTarget. Suppose I
> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
> and passes all the data to a reduce phase, the number of reducers set by
> planner will be equal to 1. The reason being [1]. So, it looks like the
> planner assumes there is only about 1Gb of data that's read from the
> source, and sets the number of reducers accordingly. However, let's say my
> HBase scan is returning very less data or huge amounts of data. The planner
> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
> interesting is, if there are dependent jobs, the planner will set the
> number of reducers based on the initially determined size from HBase
> source.
>
> As a fix for the above problem, I can set the number of reducers on the
> groupByKey(), but that does not offer much flexibility when dealing with
> data that is of varying sizes. The other option, is to have a map only job
> that reads from HBase and writes to HDFS and have a run(). The next job
> will determine the size right, since FileSourceImpl calculates the size on
> disk.
>
> I noticed the comment on HBaseSourceTarget, and was wondering if there was
> anything planned to have it implemented.
>
> [1]
>
> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>
> Thanks
> Nithin
>

Re: Question about HBaseSourceTarget

Reply via email to