Distributed accumulator functions

Benjamin Smedberg Mon, 13 Aug 2012 09:06:13 -0700

I'm a new-ish pig user querying data on an hbase cluster. I have aquestion about accumulator-style functions.

When writing an accumulator-style UDF, is all of the data shipped to asingle machine before it is reduced/accumulated? For example, if I weredoing to write re-implement SUM as a UDF, it seems to me that it wouldbe more efficient to run SUM on each map node, and then do a sum-of-sumswhen reducing. Is there a way to write a UDF which supports this styleof accumulation/aggregation?

Also, is PigStorage compatible with the quoting expected by exceltab-delimited files? AIUI that would require quoting the values with"value\tvalue" and escaping double quotes. If this isn't the nativePigStorage format, is there a storage module already written whichsupports excel-tab output?


--BDS

Distributed accumulator functions

Reply via email to