Thanks Dmitry. 1) yup. exact distinct counts are required, since it is finance reporting. ( I actually had thought about bloom filter but since we need exact count it might not be applicable ) 2) Oh I think Pig 2888 recently filed, it didnt come in my search previously. Sure I will apply the patch and see if that makes any difference..
Thanks very much for responding.... On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[email protected]> wrote: > Couple of ideas: > > 1) do you need exact distinct counts? There are approximate distinct > counting approaches that may be appropriate an much more efficient. > 2) can you try with pig-2888? > > On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[email protected]> wrote: > > > Hi, > > > > I am processing huge dataset and need to aggregate data using on multiple > > levels ( columns ). > > > > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > > CalculateDistinctinctOnValue2, Sum(value3) > > > > I have tried two approaches in one I am reading the file one time and > > generating groupby on each level > > > > for example group by (A,B), group by (A,B,C) > > > > Since I have to do distinct inside foreach which is taking too much time, > > mostly because of skew. ( I have enabled multiquery) > > > > In another approach I have tried creating 8 separate scripts to process > > each group by too, but that is taking more or less the same time and not > a > > very efficient one. Could someone please suggest any other way.. > > > > Thanks in advance. > > > > > > Deepak >
