Also open to suggestion here-- this has annoyed me for some time (as Gabriel pointed out), but I don't have a good fix for it.
On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <[email protected]> wrote: > Hi Nithin, > > This is a long-standing issue in Crunch (I think it's been present since > Crunch was originally open-sourced). I'd love to get this fixed somehow, > although it seems to not be trivial to do -- it can be difficult to > accurately estimate the size of data that will come from an HBase table, > especially considering that filters and selections of a subset of columns > can be done on an HBase table. > > One short-term way of working around this is to add a simple identity > function directly after the HBaseSourceTarget that implements the > scaleFactor method to manipulate the calculated size of the HBase data, but > this is just another hack. > > Maybe the better solution would be to estimate the size of the HBase table > based on its size on HDFS when using the HBaseFrom.table(String) method, > and then also overload the HBaseFrom.table(String, Scan) method to also > take a long value which is the estimated byte size (or perhaps scale > factor) of the table content that is expected to be returned from the given > Scan. > > Any thoughts on either of these? > > - Gabriel > > > On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <[email protected]> wrote: > >> Hello, >> I came across a unique behavior while using HBaseSourceTarget. Suppose I >> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget >> and passes all the data to a reduce phase, the number of reducers set by >> planner will be equal to 1. The reason being [1]. So, it looks like the >> planner assumes there is only about 1Gb of data that's read from the >> source, and sets the number of reducers accordingly. However, let's say my >> HBase scan is returning very less data or huge amounts of data. The >> planner >> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more >> interesting is, if there are dependent jobs, the planner will set the >> number of reducers based on the initially determined size from HBase >> source. >> >> As a fix for the above problem, I can set the number of reducers on the >> groupByKey(), but that does not offer much flexibility when dealing with >> data that is of varying sizes. The other option, is to have a map only job >> that reads from HBase and writes to HDFS and have a run(). The next job >> will determine the size right, since FileSourceImpl calculates the size on >> disk. >> >> I noticed the comment on HBaseSourceTarget, and was wondering if there was >> anything planned to have it implemented. >> >> [1] >> >> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173 >> >> Thanks >> Nithin >> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
