Hi Thejas,

Thank you for your reply!

I agree that accumulator mode can be used if you only use built-in UDFs. :)

I noticed what you mentioned in your reply in the past. In my script,
PAGE_COUNT is an evaluation function which accumulator is the only
interface implemented. I also check built-in UDF COUNT and SUM's source
code - it implements both algebraic and accumulator. Theoretically pig
should use accumulator mode in this situation.

I also notice sometimes if my pig script is VERY SIMPLE, such as without
extracting a lot of tuples/fields in nested foreach, the PAGE_COUNT can be
called in accumulator mode, but not in all situations.

I'm really curious if there are other situations that can break accumulator
call - or do I nedd to manually turn on some optimization for pig to fire
accumulator? Cannot find a lot of relative resource online...

Best,
Yen

On Thu, Mar 15, 2012 at 9:34 PM, Thejas Nair <[email protected]> wrote:

> Hi Yen,
> Does the function also implement Algebraic ? In that case it might end up
> using the algebraic interface of the udf.
> If your foreach statement has functions that don't implement Accumulator
> interface, then reduce task won't run in accumulative mode. This is because
> you are anyway going to load the whole bag into memory.
>
> If the query is using accumulator mode, you would see this log message -
> INFO 
> org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**AccumulatorOptimizer
> - Reducer is to run in accumulative mode.
>
>
> I tried modifying your query to -
>
>
> stats = FOREACH grpd {
>                                  pages = records.page;
>                                  GENERATE group.$1 AS host, group.$0 AS
> domain, COUNT(pages) AS page_count:long;
> };
>
> and ran it disabling the combiner -
> bin/pig -Dpig.exec.nocombiner=true -x local -e 'explain -script
> /tmp/t.pig;'
>
> I was able to verify that it would run Accumulator mode, using the above
> log message.
>
> Thanks,
> Thejas
>
>
>
>
>
> On 3/13/12 12:22 PM, Yen SYU wrote:
>
>> Hi Jon,
>>
>> Thanks for your reponse! I use pig 0.9.1-snapshot.
>>
>> I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired.
>> Also tried to remove generic type in accumulator but it did not help. :(
>>
>> Is it easy for you to fire accumulator?
>>
>> Yen
>>
>> On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney<[email protected]>**
>> wrote:
>>
>>  What version of pig are you using?
>>>
>>> just as an experiment in the simple case, can you try doing
>>>
>>> GENERATE flatten(group) as (domain,host), ...(the rest)...
>>>
>>> shouldn't make a difference, but I think I remember that in some older
>>> versions it did
>>>
>>> 2012/3/13 Yen SYU<[email protected]>
>>>
>>>  Hi all,
>>>>
>>>> I just test a very simple pig script as following:
>>>>
>>>> records = LOAD '$input' AS (hash:chararray, domain:chararray,
>>>> host:chararray, page:chararray, freq:int);
>>>> grpd = GROUP records BY (domain, host);
>>>> stats = FOREACH grpd {
>>>>                                  hashes = records.hash;
>>>>                                  uniq_hashes = DISTINCT hashes;
>>>>                                  pages = records.page;
>>>>                                  GENERATE group.$1 AS host, group.$0 AS
>>>> domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
>>>> page_count:long, SUM(freq) AS freq:long);
>>>> };
>>>> STORE stats INTO '$output';
>>>>
>>>> where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
>>>> EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
>>>> accumulate method is never called. Even I tried to remove all other
>>>> built-in UDFs and keep the NESTED FOREACH as simple as:
>>>>
>>>> stats = FOREACH grpd {
>>>>                                  pages = records.page;
>>>>                                  GENERATE group.$1 AS host, group.$0 AS
>>>> domain, PAGE_COUNT(pages) AS page_count:long;
>>>> };
>>>>
>>>> Anyone idea what's going on behind the scenes?
>>>>
>>>> Thanks,
>>>> Yen
>>>>
>>>>
>>>
>>
>

Reply via email to