Re: HFileInputFormat for MapReduce

2012-02-10 Thread Tim Robertson
Is HIVE involved?  Or is it just raw scan compared to TFIF? No Hive Is this a MR scan or just a shell serial scan (or is it still PE?)? We are using PE scan to try and standardize as much as possible. You want to get this scan speed up only?  You are not interested in figuring how to

Re: HFileInputFormat for MapReduce

2012-02-10 Thread Stack
On Fri, Feb 10, 2012 at 3:21 AM, Tim Robertson timrobertson...@gmail.com wrote: We are using PE scan to try and standardize as much as possible. Fair enough. Since CDH3u3 is ongoing as I type, I'm not sure on the regions (50 regions on 3 RS with the PE TestTable). Why are you not sure?

HFileInputFormat for MapReduce

2012-02-09 Thread Tim Robertson
Hi all, Can anyone elaborate on the pitfalls or implications of running MapReduce using an HFileInputFormat extending FileInputFormat? I'm sure scanning goes through the RS for good reasons (guessing handling splits, locking, RS monitoring etc) but can it ever be safe to run MR over HFiles

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Amandeep Khurana
of running MapReduce using an HFileInputFormat extending FileInputFormat? I'm sure scanning goes through the RS for good reasons (guessing handling splits, locking, RS monitoring etc) but can it ever be safe to run MR over HFiles directly? E.g. For scenarios like a a region split, would the MR

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Tim Robertson
) Amandeep On Feb 9, 2012, at 12:19 AM, Tim Robertson timrobertson...@gmail.com wrote: Hi all, Can anyone elaborate on the pitfalls or implications of running MapReduce using an HFileInputFormat extending FileInputFormat? I'm sure scanning goes through the RS for good reasons (guessing

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Stack
On Thu, Feb 9, 2012 at 12:55 AM, Tim Robertson timrobertson...@gmail.com wrote: From the limitations you mention, 1) and 2) we can live with, but 3) could be why my quick tests are already giving incorrect record counts.  That sounds like a show stopper straight away right? So Tim, you are

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Tim Robertson
Hey Stack, We see the difference between a scan and TextFileInputFormat of the same data as csv being 10x slower. This is what prompted me to look at MR using an HFIF just out of curiosity. Cheers, Tim On Thu, Feb 9, 2012 at 7:32 PM, Stack st...@duboce.net wrote: On Thu, Feb 9, 2012 at

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Amandeep Khurana
From the limitations you mention, 1) and 2) we can live with, but 3) could be why my quick tests are already giving incorrect record counts. That sounds like a show stopper straight away right? One option for us would be HBase for the primary store for random access, and periodic (e.g. 12

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Bruce Bian
I also encountered this issue when comparing Hive+HBase with Hive+HDFS(native hive tables). After some tuning(ensure data locality, using scan cache,appropriate number of mappers per node etc), Hive+HBase is around 4~5X slower. I guess the two main reasons are : 1) HFile repeats keys for each K/V

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Stack
On Thu, Feb 9, 2012 at 3:00 PM, Tim Robertson timrobertson...@gmail.com wrote: Hey Stack, We see the difference between a scan and TextFileInputFormat of the same data as csv being 10x slower.  This is what prompted me to look at MR using an HFIF just out of curiosity. Is HIVE involved? Or