Also, look into the TOP udf instead of doing the limit. It can potentially be a lot faster and is cleaner, IMHO.
2013/5/19 Norbert Burger <norbert.bur...@gmail.com> > Take a look at the PARALLEL clause: > > http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause > > On Fri, May 17, 2013 at 10:48 AM, Vincent Barat <vincent.ba...@gmail.com > >wrote: > > > Hi, > > > > I use this request to remove duplicated entries from a set of input files > > (I cannot use DISTINCT since some fields can be different) > > > > grp = GROUP alias BY key; > > alias = FOREACH grp { > > record = LIMIT alias 1; > > GENERATE FLATTEN(record) AS ... : > > } > > > > It appears that this request always generates 1 reducer (I use 0 as > > default nb of reducer to let PIG decide) whatever the size of my input > data. > > > > Is it a normal behavior ? How can I improve my request time by using > > several reducers ? > > > > Thanks a lot for your help. > > > > > > >