Thanks Gopal for the reply. Below is the usage of Multioutputs in my reducer code.
*multipleOutputs.write(namedOutput, NullWritable.get(), value, path);* where the path is changing for different payloads for partitioning the data to correct directory later used by hive. Can you please help me in identifying the flow for commitPending() -> canCommit() in case of Multioutputs. As per my understanding Multioutputs will create multiple writers from given outputFormat but don't use OutputCommitter. On Fri, Aug 18, 2017 at 12:42 AM, Gopal Vijayaraghavan <[email protected]> wrote: > > > Here the files with the same name will be overwritten by the retry > attempt and it will guarantee correct result from a successful job. > > I think your patch might fix your problem, but it fails silently when two > processes try to write the same file, which isn't supposed to happen (but > you'll end up introducing the possibility, without any errors). > > The MultipleOutputs should be safe to use without an overwrite, because > the operations involve a commitPending() -> canCommit() step, which > resolves race conditions between the speculated tasks. > > Unless you're using the broken S3 committer, I think that cannot happen - > if it is causing trouble for some reason, you might want to explain and I > can help with the MR job. > > The directory renames happen from Attempt -> Task -> Job, so a failed > attempt should not be able to get a file into the final output in anyway. > > Cheers, > Gopal > > >
