Hi I'm trying to figure out how to incrementally add to an existing output directory using MapReduce.
I cannot specify the exact output path, as data in the input is sorted into categories and then written to different directories based in the contents. (in the examples below, token=AAAA or token=BBBB) As an example: When using MultipleOutput and provided that outDir does not exist yet, the following will work: hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-03/* --output-path=outDir The result will be: outDir/token=AAAA/dt=2013-05-03/ outDir/token=BBBB/dt=2013-05-03/ However, the following will fail because outDir already exists. Even though I am copying new inputs. hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-04/* --output-path=outDir will throw FileAlreadyExistsException What I would expect is that it adds outDir/token=AAAA/dt=2013-05-04/ outDir/token=BBBB/dt=2013-05-04/ Another possibility would be the following hack but it does not seem to be very elegant: hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* --output-path=tempOutDir then copy from tempOutDir to outDir Is there a better way to address incrementally adding to an existing hadoop output directory?
