Re: Crunch DoFn vs Mapper/reducer

Josh Wills Wed, 14 Aug 2013 15:51:41 -0700

Hey Narlin,

DoFns are similar to the Mapper and Reducer classes that you would write in
classic MapReduce jobs-- they don't spawn MapReduce jobs themselves. The
Crunch planner will analyze the overall DAG of DoFns, groupByKeys, unions,
and combineValues operations and compile the DAG into one or more MapReduce
jobs, where each of the DoFns will be assigned to one of the Mappers or
Reducers in those jobs. Crunch has its own Mapper and Reducer
implementations (named CrunchMapper and CrunchReducer, naturally) that are
responsible for executing the DoFns that are assigned to each phase of the
job.

In general, you should not need to use mapper and reducer classes when you
use Crunch, although if you have legacy Mapper and Reducer classes that you
would like to use in conjunction with the DoFns in a Crunch pipeline, there
is a collection of methods in org.apache.crunch.lib.MapReduce in Crunch
0.7.0 that will wrap a given Mapper or Reducer class inside of a DoFn.

Hope that helps.

Best,
Josh

On Wed, Aug 14, 2013 at 12:59 PM, Narlin M <[email protected]> wrote:

> I have just recently started using Crunch, having been recommended to use
> it instead of writing plain map reduce jobs. As I was going through the
> crunch documentation, some questions came to my mind. Am I correct in
> saying that the DoFn family of functions will internally spawn map-reduce
> jobs, so there is no need to write separate mapper or reducer classes? If
> so, I agree that this will abstract some of the lower level details from
> the programmer, but at the same time, does it not lower the programmer's
> control over the processing logic?
>
> Also, will there be situations when separate mapper / reducer classes will
> be required in addition to the DoFn functions?
>
> Thanks.
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Crunch DoFn vs Mapper/reducer

Reply via email to