Hey Narlin, DoFns are similar to the Mapper and Reducer classes that you would write in classic MapReduce jobs-- they don't spawn MapReduce jobs themselves. The Crunch planner will analyze the overall DAG of DoFns, groupByKeys, unions, and combineValues operations and compile the DAG into one or more MapReduce jobs, where each of the DoFns will be assigned to one of the Mappers or Reducers in those jobs. Crunch has its own Mapper and Reducer implementations (named CrunchMapper and CrunchReducer, naturally) that are responsible for executing the DoFns that are assigned to each phase of the job.
In general, you should not need to use mapper and reducer classes when you use Crunch, although if you have legacy Mapper and Reducer classes that you would like to use in conjunction with the DoFns in a Crunch pipeline, there is a collection of methods in org.apache.crunch.lib.MapReduce in Crunch 0.7.0 that will wrap a given Mapper or Reducer class inside of a DoFn. Hope that helps. Best, Josh On Wed, Aug 14, 2013 at 12:59 PM, Narlin M <[email protected]> wrote: > I have just recently started using Crunch, having been recommended to use > it instead of writing plain map reduce jobs. As I was going through the > crunch documentation, some questions came to my mind. Am I correct in > saying that the DoFn family of functions will internally spawn map-reduce > jobs, so there is no need to write separate mapper or reducer classes? If > so, I agree that this will abstract some of the lower level details from > the programmer, but at the same time, does it not lower the programmer's > control over the processing logic? > > Also, will there be situations when separate mapper / reducer classes will > be required in addition to the DoFn functions? > > Thanks. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
