RE: Incrementally adding to existing output directory

Devaraj k Wed, 17 Jul 2013 19:52:36 -0700

It seems, It is not taking the CutomOutputFormat for the Job. You need to set 
the custom output format class using the 
org.apache.hadoop.mapred.JobConf.setOutputFormat(Class<? extends OutputFormat> 
theClass) API for your Job.


If we don't set OutputFormat for Job, it takes the default as TextOutputFormat 
which internally extends FileOutputFormat, that's why you see in the below 
exception still it is using the FileOutputFormat.


Thanks
Devaraj k

From: Max Lebedev [mailto:[email protected]]
Sent: 18 July 2013 01:03
To: [email protected]
Subject: Re: Incrementally adding to existing output directory

Hi Devaraj,

Thank you very much for your help. I've created a CustomOutputFormat which is 
almost identical to FileOutputFormat as seen 
here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java>
except I've removed line 125 which throws the FileAlreadyExistsException. 
However, when I try to run my code, I get this error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: 
Output directory outDir already exists
           at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:396)
            at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
            at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
            at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
            at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
            ...
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

In my source code, I've changed "FileOutputFormat.setOutputPath" to 
"CustomOutputFormat.setOutputPath"

Is it the case that FileOutputFormat.checkOutputSpecs is happening somewhere 
else, or have I done something wrong?
I also don't quite understand your suggestion about MultipleOutputs. Would you 
mind elaborating?

Thanks,
Max Lebedev

On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k 
<[email protected]<mailto:[email protected]>> wrote:
Hi Max,

  It can be done by customizing the output format class for your Job according 
to your expectations. You could you refer 
OutputFormat.checkOutputSpecs(JobContext context) method which checks the ouput 
specification. We can override this in your custom OutputFormat. You can also 
see MultipleOutputs class for implementation details how it could be done.

Thanks
Devaraj k

From: Max Lebedev [mailto:[email protected]<mailto:[email protected]>]
Sent: 16 July 2013 23:33
To: [email protected]<mailto:[email protected]>
Subject: Incrementally adding to existing output directory

Hi
I'm trying to figure out how to incrementally add to an existing output 
directory using MapReduce.
I cannot specify the exact output path, as data in the input is sorted into 
categories and then written to different directories based in the contents. (in 
the examples below, token=AAAA or token=BBBB)
As an example:
When using MultipleOutput and provided that outDir does not exist yet, the 
following will work:
hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-03/* --output-path=outDir
The result will be:
outDir/token=AAAA/dt=2013-05-03/
outDir/token=BBBB/dt=2013-05-03/
However, the following will fail because outDir already exists. Even though I 
am copying new inputs.
hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/* --output-path=outDir
will throw FileAlreadyExistsException
What I would expect is that it adds
outDir/token=AAAA/dt=2013-05-04/
outDir/token=BBBB/dt=2013-05-04/
Another possibility would be the following hack but it does not seem to be very 
elegant:
hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* --output-path=tempOutDir
then copy from tempOutDir to outDir
Is there a better way to address incrementally adding to an existing hadoop 
output directory?

RE: Incrementally adding to existing output directory

Reply via email to