Re: more reduce tasks

Robert Dyer Thu, 03 Jan 2013 21:55:58 -0800

You could create a CustomOutputCommitter and in the commitJob() method
simply read in the part-* files and write them out into a single aggregated
file.


This requires making a CustomOutputFormat class that uses the
CustomOutputCommittter and then setting that
via job.setOutputFormatClass(CustomOutputFormat.class).

See these classes:


http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputCommitter.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/FileOutputFormat.html

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)

- Robert

On Thu, Jan 3, 2013 at 3:11 PM, Pavel Hančar <[email protected]> wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd
> like to have only one result. Is it possible? Or should I run one more job
> to merge the result? And is it the same with non-streaming jobs? Below you
> see, I have 5 results for mapred.reduce.tasks=5.
>
> $ hadoop jar
> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
> job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
>
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
>
> Thanks for any answer,
>  Pavel Hančar

Re: more reduce tasks

Reply via email to