Hi Devaraj, Thanks for the advice. That did the trick.
Thanks, Max Lebedev On Wed, Jul 17, 2013 at 10:51 PM, Devaraj k <[email protected]> wrote: > It seems, It is not taking the CutomOutputFormat for the Job. You need > to set the custom output format class using the > org.apache.hadoop.mapred.JobConf.setOutputFormat(Class<? > extends OutputFormat> theClass) API for your Job.**** > > ** ** > > If we don’t set OutputFormat for Job, it takes the default as > TextOutputFormat which internally extends FileOutputFormat, that’s why you > see in the below exception still it is using the FileOutputFormat.**** > > ** ** > > ** ** > > Thanks**** > > Devaraj k**** > > ** ** > > *From:* Max Lebedev [mailto:[email protected]] > *Sent:* 18 July 2013 01:03 > *To:* [email protected] > *Subject:* Re: Incrementally adding to existing output directory**** > > ** ** > > Hi Devaraj, > > Thank you very much for your help. I've created a CustomOutputFormat which > is almost identical to FileOutputFormat as seen > here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java> > except I've removed line 125 which throws the FileAlreadyExistsException. > However, when I try to run my code, I get this error:**** > > Exception in thread "main" > org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory > outDir already exists**** > > at > org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) > **** > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887) > **** > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > **** > > at java.security.AccessController.doPrivileged(Native Method)* > *** > > at javax.security.auth.Subject.doAs(Subject.java:396)**** > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > **** > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)** > ** > > at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)**** > > at > org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)**** > > ...**** > > at java.lang.reflect.Method.invoke(Method.java:597)**** > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > In my source code, I've changed "FileOutputFormat.setOutputPath" to > "CustomOutputFormat.setOutputPath" > > Is it the case that FileOutputFormat.checkOutputSpecs is happening > somewhere else, or have I done something wrong? > I also don't quite understand your suggestion about MultipleOutputs. Would > you mind elaborating? > > Thanks, > Max Lebedev**** > > ** ** > > On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <[email protected]> wrote:** > ** > > Hi Max,**** > > **** > > It can be done by customizing the output format class for your Job > according to your expectations. You could you refer > OutputFormat.checkOutputSpecs(JobContext context) method which checks the > ouput specification. We can override this in your custom OutputFormat. You > can also see MultipleOutputs class for implementation details how it could > be done.**** > > **** > > Thanks**** > > Devaraj k**** > > **** > > *From:* Max Lebedev [mailto:[email protected]] > *Sent:* 16 July 2013 23:33 > *To:* [email protected] > *Subject:* Incrementally adding to existing output directory**** > > **** > > Hi**** > > I'm trying to figure out how to incrementally add to an existing output > directory using MapReduce.**** > > I cannot specify the exact output path, as data in the input is sorted > into categories and then written to different directories based in the > contents. (in the examples below, token=AAAA or token=BBBB)**** > > As an example:**** > > When using MultipleOutput and provided that outDir does not exist yet, the > following will work:**** > > hadoop jar myMR.jar > --input-path=inputDir/dt=2013-05-03/* --output-path=outDir**** > > The result will be: **** > > outDir/token=AAAA/dt=2013-05-03/**** > > outDir/token=BBBB/dt=2013-05-03/**** > > However, the following will fail because outDir already exists. Even > though I am copying new inputs.**** > > hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-04/* > --output-path=outDir**** > > will throw FileAlreadyExistsException**** > > What I would expect is that it adds**** > > outDir/token=AAAA/dt=2013-05-04/**** > > outDir/token=BBBB/dt=2013-05-04/**** > > Another possibility would be the following hack but it does not seem to be > very elegant:**** > > hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* > --output-path=tempOutDir**** > > then copy from tempOutDir to outDir**** > > Is there a better way to address incrementally adding to an existing > hadoop output directory?**** > > ** ** >
