Hi, Currently the Algebraic interface allows a UDF writer to have an Initial, Intermediate and Final class (each of which should implement EvalFunc). The idea is that the UDF can be called in stages - Initial.exec() in the map, Intermediate.exec() in the combiner and Final.exec() in the Reduce. The UDF (say COUNT) which implements Algebraic, also extends EvalFunc. This means that it has an exec() method. Currently Pig calls this exec() method at the top level when the UDF is not "combinable". When it is "combinable", Pig currently calls Initial.exec() in the combine and Final.exec() in the Reduce. I will be changing the "combinable" case to call Initial.exec() in the map, Intermediate.exec() in the combine and Final.exec() in the reduce as part of https://issues.apache.org/jira/browse/PIG-563.
There are two options for the Non combinable case: 1) The way it is described above - top level UDF's exec() is called when combiner is not used and if combiner is used, Initial.exec() is called in the map, Intermediate.exec() in the combine and Final.exec() in the reduce. * Pros: a. Initial.exec() can be optimized with the knowledge that it is only called in the map. For example, in UDFs like COUNT, since Initial.exec() is always going to be called in map, the implementation can be optimized to simply emit Integer 1. * Cons: a. UDF writer has to potentially write two different code paths - one where UDF.exec() computes the result completely in the reduce() and another where Initial.exec(), Intermediate.exec() and Final.exec() compute the result in parts in the map, combine and reduce respectively. 2) If a UDF implements Algebraic, Pig will have to guarantee that Initial.exec() will be called and later Final.exec() will be called. If the UDF is combinable, these will be called from map and reduce respectively and Intermediate.exec() will be called from the combine. If the UDF is NOT combinable, Initial.exec() will be called first in the reduce, then its output will be put in a bag and supplied to a call of the Final.exec(). In both the cases the top level exec() of the UDF will never be called. * Pros: a. The guarantee that Initial.exec() and Final.exec() are called in both combinable and non combinable cases. * Cons: a. The UDF writer has to give a dummy implementation for UDF.exec() to satisfy the EvalFunc interface though UDF.exec() is never called. b. UDF writer should make sure the Initial.exec() and Final.exec() work in both the combinable and non combinable cases. c. There are performance penalties - in the combinable case, the Initial.exec() cannot be optimized since there is no guarantee that it is always called in the map. In the non combinable case, the call to Initial.exec() will contain all input and hence the result can be computed in that call itself. Despite this, Pig will have to take the result of Initial.exec(), put it in a bag and call Final.exec() which can be highly inefficient. I would vote for option 1 since it is much better from a performance angle. Please provide Comments/Suggestions on the proposal. Thanks, Pradeep