Hi Bejoy Thank you for your idea.
The hadoop patch I said means this merge happens during the output writing process. Regards! Chen On Jan 3, 2013 11:25 PM, <[email protected]> wrote: > ** > Hi Chen, > > You do have an option in hadoop to achieve this if you want the merged > file in LFS. > > 1) Run your job with n number of reducers. And you'll have n files in the > output dir. > > 2) Issue a hadoop fs -getmerge command to get the files in output dir > merged into a single file in LFS > (In recent versions use 'hdfs dfs -getmerge') > > Regards > Bejoy KS > > Sent from remote device, Please excuse typos > ------------------------------ > *From: * Chen He <[email protected]> > *Date: *Thu, 3 Jan 2013 22:55:36 -0600 > *To: *<[email protected]> > *ReplyTo: * [email protected] > *Subject: *Re: more reduce tasks > > Sounds like you want more reducer to reduce the execution time but only > want a single output file. > > Is this waht you want? > > You can use as many as your want (may not be optimal) reducers when you > are running your reducer. Once the program is done, write a small perl, > python, or shell program connect those part-* files. > > if you do not want to write your own script to connect those files and let > Hadoop automatically generate a single file. > > It may need some patched to current Hadoop. I am not sure they are ready > or not. > > On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli < > [email protected]> wrote: > >> >> Is it that you want the parallelism but a single final output? Assuming >> your first job's reducers generate a small output, another stage is the way >> to go. If not, second stage won't help. What exactly are your objectives? >> >> Thanks, >> +Vinod >> >> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote: >> >> Hello, >> I'd like to use more than one reduce task with Hadoop Streaming and I'd >> like to have only one result. Is it possible? Or should I run one more job >> to merge the result? And is it the same with non-streaming jobs? Below you >> see, I have 5 results for mapred.reduce.tasks=5. >> >> $ hadoop jar >> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar >> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc >> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc >> . >> . >> . >> 13/01/03 22:00:03 INFO streaming.StreamJob: map 100% reduce 100% >> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: >> job_201301021717_0038 >> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc >> $ hadoop dfs -cat 1gb.wc/part-* >> 472173052 >> 165736187 >> 201719914 >> 184376668 >> 163872819 >> $ >> >> where /tmp/wcc contains >> #!/bin/bash >> wc -c >> >> Thanks for any answer, >> Pavel Hančar >> >> >> >
