Appreciate if you can help test. PIG-1270-2.patch should be directly applicable to 0.9.0.
Daniel On Sun, Sep 11, 2011 at 11:25 PM, Rajesh Balamohan <[email protected]> wrote: > > Thanks Daniel for the comments. > > > >> PIG-1270 is to solve 2, but performance test does not show improvement > > This puts a restriction on the PigRecordReader itself and prevents mappers > from reading more data. Isn't supposed to increase the performance?. What > was the datasize you used? If this patch is compatible with 0.9, I can try > it on my cluster. > > On Mon, Sep 12, 2011 at 11:14 AM, Daniel Dai <[email protected]> wrote: > > > Two ways to optimize: > > 1. Launching less maps > > 2. For each map, stop earlier > > > > PIG-1270 is to solve 2, but performance test does not show improvement. For > > 1, in extreme case, such as 2T data only contains 100 records, launching > > all > > maps is necessary. Pig currently does not probe the input data before > > launching map-reduce jobs. Maybe we can launch fewer maps as initial guess > > and launch all maps if guess fail. Thoughts? > > > > Daniel > > > > On Sun, Sep 11, 2011 at 10:13 PM, Rajesh Balamohan < > > [email protected]> wrote: > > > > > I have a large data set (> 2 TB) and I tried scanning 100 records from > > it. > > > > > > a = load '/usr/largedata/' using PigStorage(','); > > > b = limit a 100; > > > dump b; > > > > > > >>>> > > > 2011-09-11 21:56:34,262 [main] INFO > > > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > > > script: LIMIT > > > 2011-09-11 21:56:34,414 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler > > - > > > File concatenation threshold: 100 optimistic? false > > > 2011-09-11 21:56:34,483 [main] INFO > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > > > - MR plan size before optimization: 1 > > > 2011-09-11 21:56:34,484 [main] INFO > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > > > - MR plan size after optimization: 1 > > > >>>> > > > > > > This ends up launching a MR job with 20,000+ Maps and a single reducer. > > > > > > Is it possible for PIG to analyze such cases and realistically scan only > > > 100 > > > rows (rather than scanning the entire data and emitting 100 rows?). > > > > > > This is on PIG 0.9. > > > > > > -- > > > ~Rajesh.B > > > > > > > > > -- > ~Rajesh.B
