What do you mean by a "final reduce"? Not all jobs require that the final output result be singular, since the reducer phase is provided to work on a per-partition basis (also why the files are named part-*). One job consists of only one reduce phase, wherein the reducers all work independently and complete.
If you need a result assembled together in order of the partitions created, rely on the above provided solutions such as a second step of fs -getmerge, or a call of the same in a custom FileOutputCommitter, etc. On Fri, Jan 4, 2013 at 2:05 PM, Pavel Hančar <[email protected]> wrote: > Hello, > thank you for the answer. Exactly: I want the parallelism but a single final > output. What do you mean by "another stage"? I thought I should set > mapred.reduce.tasks large enough and hadoop will run the reducers in so many > rounds it will be optimal. But it isn't the case. > When I tried to run the classical WordCount example, and try to set this > by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output > (there were no word duplicates for the normal words -- only some for strange > words). So why the hadoop doesn't run the final reduce in my simple > streaming example? > Thank you, > Pavel Hančar > > 2013/1/4 Vinod Kumar Vavilapalli <[email protected]> >> >> >> Is it that you want the parallelism but a single final output? Assuming >> your first job's reducers generate a small output, another stage is the way >> to go. If not, second stage won't help. What exactly are your objectives? >> >> Thanks, >> +Vinod >> >> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote: >> >> Hello, >> I'd like to use more than one reduce task with Hadoop Streaming and I'd >> like to have only one result. Is it possible? Or should I run one more job >> to merge the result? And is it the same with non-streaming jobs? Below you >> see, I have 5 results for mapred.reduce.tasks=5. >> >> $ hadoop jar >> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar >> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc >> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc >> . >> . >> . >> 13/01/03 22:00:03 INFO streaming.StreamJob: map 100% reduce 100% >> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: >> job_201301021717_0038 >> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc >> $ hadoop dfs -cat 1gb.wc/part-* >> 472173052 >> 165736187 >> 201719914 >> 184376668 >> 163872819 >> $ >> >> where /tmp/wcc contains >> #!/bin/bash >> wc -c >> >> Thanks for any answer, >> Pavel Hančar >> >> > -- Harsh J
