Thanks for the help. I just implemented it as suggested. I am processing the new file and then joining it with previous results. but can i modify the original document with updated counts plus new word counts. so my inputs are step1_word_count_output.txt + new_raw_input The output I want is saved in step1_word_count_output.txt Which is to say, that I just want to have one output file?
On Wed, Jan 16, 2013 at 7:30 PM, <[email protected]> wrote: > ** > Hi Jamal > > You can use Distributed Cache only if the file to be distributed is small. > Mapreduce should be dealing with larger datasets so you should expect the > output file to get larger. > > In simple straight forward manner. You can get the second data set > processed then merge the fist output with second output, you can use > KeyValueInputFormat to load the outputs to second MR job. > > Else you can use multple Inputs here and process the new input file into > 'word 1' and the previous output file as 'word $count' in the mapper and do > its aggregation in the reducer. > Regards > Bejoy KS > > Sent from remote device, Please excuse typos > ------------------------------ > *From: * jamal sasha <[email protected]> > *Date: *Wed, 16 Jan 2013 18:54:04 -0800 > *To: *[email protected]<[email protected]>; <[email protected]> > *ReplyTo: * [email protected] > *Subject: *Re: modifying existing wordcount example > > Hi, > Thanks for giving your thoughts. > I was reading some libraries in hadoop.. and i feel like distributed cache > might help me. > but i picked up hadoop very recently (and along it java as well) and i am > not able to think of how to actually code :( > > > On Wed, Jan 16, 2013 at 6:13 PM, Chris Embree <[email protected]> wrote: > >> Can you instead copy intput1 and input2 together? >> >> Or process both files on the second pass? >> >> Otherwise, you'll have to read in output file, load the values and start >> your map/red job. >> >> Probably someone else will have a better answer. :) >> >> >> On Wed, Jan 16, 2013 at 9:07 PM, jamal sasha <[email protected]>wrote: >> >>> Hi, >>> In the wordcount example: >>> http://hadoop.apache.org/docs/r0.17.0/mapred_tutorial.html >>> Lets say I run the above example and save the the output. >>> But lets say that I have now a new input file. What I want to do is.. >>> basically again do the wordcount but basically modifying the previous >>> counts. >>> For example.. >>> sample_input1.txt //foo bar foo bar bar bar >>> After first run: >>> 1) foo 2 >>> 2) bar 4 >>> >>> Save it in output1.txt >>> >>> Now sample_input2.txt //bar hello world >>> Now the result I am looking for is: >>> 1)foo 2 >>> 2)bar 5 >>> 3) hello 1 >>> 4) world 1 >>> >>> How do i achieve this in map reduce? >>> >>> >> >
