Sure. I will deploy it today and run it again.. I usually check the job conf file for verification.. but I will send you log files..
Thanks very much for help. Regards, Deepak On Fri, Sep 28, 2012 at 3:58 PM, Dmitriy Ryaboy <[email protected]> wrote: > Can you check if your mapper logs said anything about in-map aggregation > being turned off? > In fact, the whole log of one of the mappers might help (POPartialAgg > prints some helpful stats). > > > On Fri, Sep 28, 2012 at 3:27 PM, Deepak Tiwari <[email protected]> > wrote: > > > Yeah I believe pig.exec.mapPartAgg was true but I think minReduction was > 10 > > or something. I will double check this and try that again. So If accuracy > > is compromised and Bloomfilter is chosen, should I follow the approach > > described at > > http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html > . > > .. > > sorry I am bit hazy over here... > > > > On Fri, Sep 28, 2012 at 3:12 PM, Dmitriy Ryaboy <[email protected]> > > wrote: > > > > > When you tried 2888, did you have pig.exec.mapPartAgg set to true, > > > and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)? > > > > > > You said you applied the patch -- what version are you currently > running? > > > > > > Other approaches are also probabilistic so if you need exact counts, no > > > dice.. I was thinking bloom filters or hyper log log. > > > > > > D > > > > > > On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[email protected]> > > > wrote: > > > > > > > Hi Dmitriy > > > > > > > > I did try 2888 ( I checked out new from trunk and applied the patch > ) > > > and > > > > unfortunately it was not making much difference for me. You have > > > mentioned > > > > other distinct counting approaches. Could you please give me more > > details > > > > and any hints to implement those. > > > > > > > > Regards, > > > > > > > > Deepak. > > > > > > > > On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[email protected] > > > > > > wrote: > > > > > > > > > Thanks Dmitry. > > > > > > > > > > 1) yup. exact distinct counts are required, since it is finance > > > > reporting. > > > > > ( I actually had thought about bloom filter but since we need exact > > > count > > > > > it might not be applicable ) > > > > > 2) Oh I think Pig 2888 recently filed, it didnt come in my search > > > > > previously. Sure I will apply the patch and see if that makes any > > > > > difference.. > > > > > > > > > > Thanks very much for responding.... > > > > > > > > > > > > > > > > > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy < > [email protected] > > > > >wrote: > > > > > > > > > >> Couple of ideas: > > > > >> > > > > >> 1) do you need exact distinct counts? There are approximate > distinct > > > > >> counting approaches that may be appropriate an much more > efficient. > > > > >> 2) can you try with pig-2888? > > > > >> > > > > >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[email protected]> > > > > wrote: > > > > >> > > > > >> > Hi, > > > > >> > > > > > >> > I am processing huge dataset and need to aggregate data using on > > > > >> multiple > > > > >> > levels ( columns ). > > > > >> > > > > > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > > > > >> > CalculateDistinctinctOnValue2, Sum(value3) > > > > >> > > > > > >> > I have tried two approaches in one I am reading the file one > time > > > and > > > > >> > generating groupby on each level > > > > >> > > > > > >> > for example group by (A,B), group by (A,B,C) > > > > >> > > > > > >> > Since I have to do distinct inside foreach which is taking too > > much > > > > >> time, > > > > >> > mostly because of skew. ( I have enabled multiquery) > > > > >> > > > > > >> > In another approach I have tried creating 8 separate scripts to > > > > process > > > > >> > each group by too, but that is taking more or less the same time > > and > > > > >> not a > > > > >> > very efficient one. Could someone please suggest any other way.. > > > > >> > > > > > >> > Thanks in advance. > > > > >> > > > > > >> > > > > > >> > Deepak > > > > >> > > > > > > > > > > > > > > > > > > > >
