I'm a new-ish pig user querying data on an hbase cluster. I have a
question about accumulator-style functions.
When writing an accumulator-style UDF, is all of the data shipped to a
single machine before it is reduced/accumulated? For example, if I were
doing to write re-implement SUM as a UDF, it seems to me that it would
be more efficient to run SUM on each map node, and then do a sum-of-sums
when reducing. Is there a way to write a UDF which supports this style
of accumulation/aggregation?
Also, is PigStorage compatible with the quoting expected by excel
tab-delimited files? AIUI that would require quoting the values with
"value\tvalue" and escaping double quotes. If this isn't the native
PigStorage format, is there a storage module already written which
supports excel-tab output?
--BDS
- Distributed accumulator functions Benjamin Smedberg
-