Incrementally adding to existing output directory

Max Lebedev Tue, 16 Jul 2013 11:06:04 -0700

Hi

I'm trying to figure out how to incrementally add to an existing output
directory using MapReduce.


I cannot specify the exact output path, as data in the input is sorted into
categories and then written to different directories based in the contents.
(in the examples below, token=AAAA or token=BBBB)

As an example:

When using MultipleOutput and provided that outDir does not exist yet, the
following will work:

hadoop jar myMR.jar
--input-path=inputDir/dt=2013-05-03/* --output-path=outDir

The result will be:

outDir/token=AAAA/dt=2013-05-03/

outDir/token=BBBB/dt=2013-05-03/

However, the following will fail because outDir already exists. Even though
I am copying new inputs.

hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/*
--output-path=outDir

will throw FileAlreadyExistsException

What I would expect is that it adds

outDir/token=AAAA/dt=2013-05-04/

outDir/token=BBBB/dt=2013-05-04/

Another possibility would be the following hack but it does not seem to be
very elegant:

hadoop jar myMR.jar --input-path=inputDir/2013-05-04/*
--output-path=tempOutDir

then copy from tempOutDir to outDir

Is there a better way to address incrementally adding to an existing hadoop
output directory?

Incrementally adding to existing output directory

Reply via email to