Hi Yang, A couple points: the grouping of z will create exactly one input group for the reducers. Since there's only one, more reducers doesn't help any. There are accumulator and algebraic UDFs, but SIZE is not one of them because SIZE can also take data types other than bags (you can't split the computation of the SIZE of a chararray, for example). Since you're using it for a bag, the builtin UDF 'COUNT' ( http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/builtin/COUNT.html) is a much more scalable approach. It will do some aggregation in the combiner and scale (much, much) better.
Thanks, Mark On Thu, Apr 11, 2013 at 3:13 PM, Yang <[email protected]> wrote: > I set default_parallel=15 > > but when I did a > > y = group z ALL; > x = foreach y generate SIZE(z); > > the 2 lines generate a MR job with only 1 reducer. > > > I guess it's because SIZE() needs to count all the groups. but don't we > have the sort of cumulative/additive UDFs ? > > > it would be faster if we could parallelize SIZE() > > thanks > Yang >
