Re: Pig multiple groupby problem

Deepak Tiwari Fri, 28 Sep 2012 16:15:51 -0700

Sure. I will deploy it today and run it again.. I usually check the job
conf file for verification.. but I will send you log files..


Thanks very much for help.

Regards,

Deepak

On Fri, Sep 28, 2012 at 3:58 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Can you check if your mapper logs said anything about in-map aggregation
> being turned off?
> In fact, the whole log of one of the mappers might help (POPartialAgg
> prints some helpful stats).
>
>
> On Fri, Sep 28, 2012 at 3:27 PM, Deepak Tiwari <[email protected]>
> wrote:
>
> > Yeah I believe pig.exec.mapPartAgg was true but I think minReduction was
> 10
> > or something. I will double check this and try that again. So If accuracy
> > is compromised and Bloomfilter is chosen, should I follow the approach
> > described at
> > http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html
> .
> > ..
> > sorry I am bit hazy over here...
> >
> > On Fri, Sep 28, 2012 at 3:12 PM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> >
> > > When you tried 2888, did you have pig.exec.mapPartAgg set to true,
> > > and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)?
> > >
> > > You said you applied the patch -- what version are you currently
> running?
> > >
> > > Other approaches are also probabilistic so if you need exact counts, no
> > > dice.. I was thinking bloom filters or hyper log log.
> > >
> > > D
> > >
> > > On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[email protected]>
> > > wrote:
> > >
> > > > Hi Dmitriy
> > > >
> > > > I did try 2888  ( I checked out new from trunk and applied the patch
>  )
> > > and
> > > > unfortunately it was not making much difference for me.  You have
> > > mentioned
> > > > other distinct counting approaches. Could you please give me more
> > details
> > > > and any hints to implement those.
> > > >
> > > > Regards,
> > > >
> > > > Deepak.
> > > >
> > > > On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Thanks Dmitry.
> > > > >
> > > > > 1) yup. exact distinct counts are required, since it is finance
> > > > reporting.
> > > > > ( I actually had thought about bloom filter but since we need exact
> > > count
> > > > > it might not be applicable )
> > > > > 2) Oh I think Pig 2888 recently filed, it didnt come in my search
> > > > > previously. Sure I will apply the patch and see if that makes any
> > > > > difference..
> > > > >
> > > > > Thanks very much for responding....
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <
> [email protected]
> > > > >wrote:
> > > > >
> > > > >> Couple of ideas:
> > > > >>
> > > > >> 1) do you need exact distinct counts? There are approximate
> distinct
> > > > >> counting approaches that may be appropriate an much more
> efficient.
> > > > >> 2) can you try with pig-2888?
> > > > >>
> > > > >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[email protected]>
> > > > wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > I am processing huge dataset and need to aggregate data using on
> > > > >> multiple
> > > > >> > levels ( columns ).
> > > > >> >
> > > > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1,
> > > > >> > CalculateDistinctinctOnValue2, Sum(value3)
> > > > >> >
> > > > >> > I have tried two approaches in one I am reading the file one
> time
> > > and
> > > > >> > generating groupby on each level
> > > > >> >
> > > > >> > for example group by (A,B), group by (A,B,C)
> > > > >> >
> > > > >> > Since I have to do distinct inside foreach which is taking too
> > much
> > > > >> time,
> > > > >> > mostly because of skew. ( I have enabled multiquery)
> > > > >> >
> > > > >> > In another approach I have tried creating 8 separate scripts to
> > > > process
> > > > >> > each group by too, but that is taking more or less the same time
> > and
> > > > >> not a
> > > > >> > very efficient one. Could someone please suggest any other way..
> > > > >> >
> > > > >> > Thanks in advance.
> > > > >> >
> > > > >> >
> > > > >> > Deepak
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Pig multiple groupby problem

Reply via email to