On Aug 13, 2012, at 9:05 AM, Benjamin Smedberg wrote:

> I'm a new-ish pig user querying data on an hbase cluster. I have a question 
> about accumulator-style functions.
> 
> When writing an accumulator-style UDF, is all of the data shipped to a single 
> machine before it is reduced/accumulated? For example, if I were doing to 
> write re-implement SUM as a UDF, it seems to me that it would be more 
> efficient to run SUM on each map node, and then do a sum-of-sums when 
> reducing. Is there a way to write a UDF which supports this style of 
> accumulation/aggregation?

How many reducers are involved in an operation is independent of the type of 
UDF you use.  The number of reducers is determined by the parallelism you 
declare in your script (via the parallel clause in your group statement or via 
a set default parallelism statement in your script) or by the default Pig 
chooses.  

As to whether it is more efficient to do a sum of sums, it certainly is. For 
those types of operations you should use an algebraic UDF rather than an 
accumulative.  Algebraic UDFs have an initial (map), intermediate (combiner), 
and final (reducer) steps.  Accumulative UDFs are for operations that cannot be 
distributed but that only need to see the data stream once.  An example would 
be cumulative sums, where you want to return not just a final sum but a list of 
the sums as you went along.  This is order dependent and thus can't be done 
until you've collected all the values for a given key.

> 
> Also, is PigStorage compatible with the quoting expected by excel 
> tab-delimited files? AIUI that would require quoting the values with 
> "value\tvalue" and escaping double quotes. If this isn't the native 
> PigStorage format, is there a storage module already written which supports 
> excel-tab output?

PigStorage doesn't support escaping.  I am not aware of a storage function 
focussed on excel CSV format, but others may be.

Alan.

> 
> --BDS
> 

Reply via email to