Re: Having config to overwrite the file with the same name already exists

Piyush Mukati (Data Platform) Thu, 17 Aug 2017 21:54:07 -0700

Thanks Gopal for the reply.
Below is the usage of  Multioutputs in my reducer code.

*multipleOutputs.write(namedOutput, NullWritable.get(), value, path);*
where the path is changing for different payloads for partitioning the data
to correct directory later used by hive.

Can you please help me in identifying the flow for   commitPending() ->
canCommit() in case of Multioutputs.
 As per my understanding  Multioutputs will create multiple writers from
given outputFormat but don't use OutputCommitter.

On Fri, Aug 18, 2017 at 12:42 AM, Gopal Vijayaraghavan <[email protected]>
wrote:

>
> > Here the files with the same name will be overwritten by the retry
> attempt and it will guarantee correct result from a successful job.
>
> I think your patch might fix your problem, but it fails silently when two
> processes try to write the same file, which isn't supposed to happen (but
> you'll end up introducing the possibility, without any errors).
>
> The MultipleOutputs should be safe to use without an overwrite, because
> the operations involve a commitPending() -> canCommit()  step, which
> resolves race conditions between the speculated tasks.
>
> Unless you're using the broken S3 committer, I think that cannot happen -
> if it is causing trouble for some reason, you might want to explain and I
> can help with the MR job.
>
> The directory renames happen from Attempt -> Task -> Job, so a failed
> attempt should not be able to get a file into the final output in anyway.
>
> Cheers,
> Gopal
>
>
>

Re: Having config to overwrite the file with the same name already exists

Reply via email to