I'm a new-ish pig user querying data on an hbase cluster. I have a question about accumulator-style functions.

When writing an accumulator-style UDF, is all of the data shipped to a single machine before it is reduced/accumulated? For example, if I were doing to write re-implement SUM as a UDF, it seems to me that it would be more efficient to run SUM on each map node, and then do a sum-of-sums when reducing. Is there a way to write a UDF which supports this style of accumulation/aggregation?

Also, is PigStorage compatible with the quoting expected by excel tab-delimited files? AIUI that would require quoting the values with "value\tvalue" and escaping double quotes. If this isn't the native PigStorage format, is there a storage module already written which supports excel-tab output?

--BDS

Reply via email to