Couple of ideas: 1) do you need exact distinct counts? There are approximate distinct counting approaches that may be appropriate an much more efficient. 2) can you try with pig-2888?
On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[email protected]> wrote: > Hi, > > I am processing huge dataset and need to aggregate data using on multiple > levels ( columns ). > > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > CalculateDistinctinctOnValue2, Sum(value3) > > I have tried two approaches in one I am reading the file one time and > generating groupby on each level > > for example group by (A,B), group by (A,B,C) > > Since I have to do distinct inside foreach which is taking too much time, > mostly because of skew. ( I have enabled multiquery) > > In another approach I have tried creating 8 separate scripts to process > each group by too, but that is taking more or less the same time and not a > very efficient one. Could someone please suggest any other way.. > > Thanks in advance. > > > Deepak
