Hello, thank you for the answer. Exactly: I want the parallelism but a single final output. What do you mean by "another stage"? I thought I should setmapred.reduce.tasks large enough and hadoop will run the reducers in so many rounds it will be optimal. But it isn't the case. When I tried to run the classical WordCount example, and try to set this by JobConf.setNumReduceTasks(int n), it seemed to me I had the final output (there were no word duplicates for the normal words -- only some for strange words). So why the hadoop doesn't run the final reduce in my simple streaming example? Thank you, Pavel Hančar
2013/1/4 Vinod Kumar Vavilapalli <[email protected]> > > Is it that you want the parallelism but a single final output? Assuming > your first job's reducers generate a small output, another stage is the way > to go. If not, second stage won't help. What exactly are your objectives? > > Thanks, > +Vinod > > On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote: > > Hello, > I'd like to use more than one reduce task with Hadoop Streaming and I'd > like to have only one result. Is it possible? Or should I run one more job > to merge the result? And is it the same with non-streaming jobs? Below you > see, I have 5 results for mapred.reduce.tasks=5. > > $ hadoop jar > /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar > -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc > -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc > . > . > . > 13/01/03 22:00:03 INFO streaming.StreamJob: map 100% reduce 100% > 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: > job_201301021717_0038 > 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc > $ hadoop dfs -cat 1gb.wc/part-* > 472173052 > 165736187 > 201719914 > 184376668 > 163872819 > $ > > where /tmp/wcc contains > #!/bin/bash > wc -c > > Thanks for any answer, > Pavel Hančar > > >
