Re: Nb of reduce tasks when GROUPing

Vincent Barat Wed, 22 May 2013 06:30:23 -0700

I tested TOP : for my use case, TOP is actually slower than LIMIT 1(which anyway seems logical).


Le 21/05/13 19:23, Norbert Burger a écrit :

As Jonathan mentioned, TOP should obviate this particular use case.  But
for future examples, the parameters
pig.exec.reducers.bytes.per.reducer and pig.exec.reducers.max
might be useful:


https://issues.apache.org/jira/browse/PIG-1249

Norbert

On Tue, May 21, 2013 at 9:23 AM, Vincent Barat <vincent.ba...@gmail.com>wrote:

Thanks for your reply.

My goal is actually to AVOID using PARALLEL toi let PIG guess a good
number of reducer by itself.
Usually it works well for me, so I don't understadn why in that case it
does not.

Le 19/05/13 15:37, Norbert Burger a écrit :

  Take a look at the PARALLEL clause:

http://pig.apache.org/docs/r0.**7.0/cookbook.html#Use+the+**
PARALLEL+Clause<http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause>

On Fri, May 17, 2013 at 10:48 AM, Vincent Barat <vincent.ba...@gmail.com>
**wrote:

  Hi,

I use this request to remove duplicated entries from a set of input files
(I cannot use DISTINCT since some fields can be different)

grp = GROUP alias BY key;
alias = FOREACH grp {
    record = LIMIT  alias 1;
    GENERATE FLATTEN(record) AS ... :
}

It appears that this request always generates 1 reducer (I use 0 as
default nb of reducer to let PIG decide) whatever the size of my input
data.

Is it a normal behavior ? How can I improve my request time by using
several reducers ?

Thanks a lot for your help.

Re: Nb of reduce tasks when GROUPing

Reply via email to