Hadoop MultiOutputs API Issue

Ashish Paliwal Wed, 21 Dec 2016 04:59:13 -0800

Hi,

Hadoop Map Reduce version: 2.2.0


We are using MultiOutputs to write mullitple output files from Mapper(No
reducer). As per requirement, multioutput should write in directory other
than job's default output directory. So We used below MultiOutput method to
write in different directory.

 public <K, V> void
<http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.java#>
write(String
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
 namedOutput, K key, V value,String
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
 baseOutputPath)

Now, if any Map task run for longer time, then (cause speculative execution
enabled), hadoop start parallel task to complete task early. Now, both task
trying to write in same directory in same file. Second task failed with
"File already exists issue" and so Job.

After analyzing it founds that, like default context writer, *MultiOutputs
API does not create any temporary directory*. It directly starts writing
into output directory. and the reason is FileOutputCommitter used by
default context writer (and so Application Master) is different
than MultiOutputs.writer. So in case of MultiOutput, none of the method of
FileOutputCommitter is get called.

So is it known issue or default behavior? And what is the solution for this
problem?


Regards,
Ashish.

Hadoop MultiOutputs API Issue

Reply via email to