Hi all,
I just test a very simple pig script as following:
records = LOAD '$input' AS (hash:chararray, domain:chararray,
host:chararray, page:chararray, freq:int);
grpd = GROUP records BY (domain, host);
stats = FOREACH grpd {
hashes = records.hash;
uniq_hashes = DISTINCT hashes;
pages = records.page;
GENERATE group.$1 AS host, group.$0 AS
domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS
page_count:long, SUM(freq) AS freq:long);
};
STORE stats INTO '$output';
where PAGE_COUNT is a customized UDF implementing Accumulator. I add an
EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the
accumulate method is never called. Even I tried to remove all other
built-in UDFs and keep the NESTED FOREACH as simple as:
stats = FOREACH grpd {
pages = records.page;
GENERATE group.$1 AS host, group.$0 AS
domain, PAGE_COUNT(pages) AS page_count:long;
};
Anyone idea what's going on behind the scenes?
Thanks,
Yen