Re: LIMIT optimization

Daniel Dai Sun, 11 Sep 2011 23:48:59 -0700

Appreciate if you can help test.  PIG-1270-2.patch should be directly
applicable to 0.9.0.


Daniel

On Sun, Sep 11, 2011 at 11:25 PM, Rajesh Balamohan
<[email protected]> wrote:
>
> Thanks Daniel for the comments.
>
>
> >> PIG-1270 is to solve 2, but performance test does not show improvement
>
> This puts a restriction on the PigRecordReader itself and prevents mappers
> from reading more data. Isn't supposed to increase the performance?. What
> was the datasize you used? If this patch is compatible with 0.9, I can try
> it on my cluster.
>
> On Mon, Sep 12, 2011 at 11:14 AM, Daniel Dai <[email protected]> wrote:
>
> > Two ways to optimize:
> > 1. Launching less maps
> > 2. For each map, stop earlier
> >
> > PIG-1270 is to solve 2, but performance test does not show improvement. For
> > 1, in extreme case, such as 2T data only contains 100 records, launching
> > all
> > maps is necessary. Pig currently does not probe the input data before
> > launching map-reduce jobs. Maybe we can launch fewer maps as initial guess
> > and launch all maps if guess fail. Thoughts?
> >
> > Daniel
> >
> > On Sun, Sep 11, 2011 at 10:13 PM, Rajesh Balamohan <
> > [email protected]> wrote:
> >
> > > I have a large data set (> 2 TB) and I tried scanning 100 records from
> > it.
> > >
> > > a = load '/usr/largedata/' using PigStorage(',');
> > > b = limit a 100;
> > > dump b;
> > >
> > > >>>>
> > > 2011-09-11 21:56:34,262 [main] INFO
> > >  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> > > script: LIMIT
> > > 2011-09-11 21:56:34,414 [main] INFO
> > >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> > -
> > > File concatenation threshold: 100 optimistic? false
> > > 2011-09-11 21:56:34,483 [main] INFO
> > >
> > >
> >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > > - MR plan size before optimization: 1
> > > 2011-09-11 21:56:34,484 [main] INFO
> > >
> > >
> >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > > - MR plan size after optimization: 1
> > > >>>>
> > >
> > > This ends up launching a MR job with 20,000+ Maps and a single reducer.
> > >
> > > Is it possible for PIG to analyze such cases and realistically scan only
> > > 100
> > > rows (rather than scanning the entire data and emitting 100 rows?).
> > >
> > > This is on PIG 0.9.
> > >
> > > --
> > > ~Rajesh.B
> > >
> >
>
>
>
> --
> ~Rajesh.B

Re: LIMIT optimization

Reply via email to