Re: Incrementally adding to existing output directory

Max Lebedev Tue, 23 Jul 2013 09:31:00 -0700

Hi Devaraj,

Thanks for the advice. That did the trick.


Thanks,
Max Lebedev


On Wed, Jul 17, 2013 at 10:51 PM, Devaraj k <[email protected]> wrote:

>  It seems, It is not taking the CutomOutputFormat for the Job. You need
> to set the custom output format class using the 
> org.apache.hadoop.mapred.JobConf.setOutputFormat(Class<?
> extends OutputFormat> theClass) API for your Job.****
>
> ** **
>
> If we don’t set OutputFormat for Job, it takes the default as
> TextOutputFormat which internally extends FileOutputFormat, that’s why you
> see in the below exception still it is using the FileOutputFormat.****
>
> ** **
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Max Lebedev [mailto:[email protected]]
> *Sent:* 18 July 2013 01:03
> *To:* [email protected]
> *Subject:* Re: Incrementally adding to existing output directory****
>
> ** **
>
> Hi Devaraj,
>
> Thank you very much for your help. I've created a CustomOutputFormat which
> is almost identical to FileOutputFormat as seen 
> here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java>
> except I've removed line 125 which throws the FileAlreadyExistsException.
> However, when I try to run my code, I get this error:****
>
> Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> outDir already exists****
>
>            at
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
> ****
>
>             at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
> ****
>
>             at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> ****
>
>             at java.security.AccessController.doPrivileged(Native Method)*
> ***
>
>             at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>             at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ****
>
>             at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)**
> **
>
>             at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)****
>
>             at
> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)****
>
>             ...****
>
>             at java.lang.reflect.Method.invoke(Method.java:597)****
>
>             at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> In my source code, I've changed "FileOutputFormat.setOutputPath" to
> "CustomOutputFormat.setOutputPath"
>
> Is it the case that FileOutputFormat.checkOutputSpecs is happening
> somewhere else, or have I done something wrong?
> I also don't quite understand your suggestion about MultipleOutputs. Would
> you mind elaborating?
>
> Thanks,
> Max Lebedev****
>
> ** **
>
> On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <[email protected]> wrote:**
> **
>
> Hi Max,****
>
>  ****
>
>   It can be done by customizing the output format class for your Job
> according to your expectations. You could you refer
> OutputFormat.checkOutputSpecs(JobContext context) method which checks the
> ouput specification. We can override this in your custom OutputFormat. You
> can also see MultipleOutputs class for implementation details how it could
> be done.****
>
>  ****
>
> Thanks****
>
> Devaraj k****
>
>  ****
>
> *From:* Max Lebedev [mailto:[email protected]]
> *Sent:* 16 July 2013 23:33
> *To:* [email protected]
> *Subject:* Incrementally adding to existing output directory****
>
>  ****
>
> Hi****
>
> I'm trying to figure out how to incrementally add to an existing output
> directory using MapReduce.****
>
> I cannot specify the exact output path, as data in the input is sorted
> into categories and then written to different directories based in the
> contents. (in the examples below, token=AAAA or token=BBBB)****
>
> As an example:****
>
> When using MultipleOutput and provided that outDir does not exist yet, the
> following will work:****
>
> hadoop jar myMR.jar
> --input-path=inputDir/dt=2013-05-03/* --output-path=outDir****
>
> The result will be: ****
>
> outDir/token=AAAA/dt=2013-05-03/****
>
> outDir/token=BBBB/dt=2013-05-03/****
>
> However, the following will fail because outDir already exists. Even
> though I am copying new inputs.****
>
> hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/*
> --output-path=outDir****
>
> will throw FileAlreadyExistsException****
>
> What I would expect is that it adds****
>
> outDir/token=AAAA/dt=2013-05-04/****
>
> outDir/token=BBBB/dt=2013-05-04/****
>
> Another possibility would be the following hack but it does not seem to be
> very elegant:****
>
> hadoop jar myMR.jar --input-path=inputDir/2013-05-04/*
> --output-path=tempOutDir****
>
> then copy from tempOutDir to outDir****
>
> Is there a better way to address incrementally adding to an existing
> hadoop output directory?****
>
> ** **
>

Re: Incrementally adding to existing output directory

Reply via email to