Hi Alan, I believe this is expected behavior wrt EMR and S3. There cannot exist a duplicate file path in S3 prior to commit; in your case it looks like bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On EMR, to mitigate hanging tasks, a given job may spawn duplicate tasks (referenced by a trailing _0, _1, etc.). This then becomes a race condition issue wrt duplicate tasks (_0, _1, etc.) committing to the same bucket/path in S3.
In addition, you may also consider increasing the task timeout from 600s to something higher/lower to potentially timeout less/more (I think lowest bound is 60000ms). I've had jobs which required a *two hour* timeout in order to succeed. This can be done with a bootstrap, ie) --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args -m,mapred.task.timeout=2400000 As for the cleaning up of intermediate steps, I'm not sure. Possibly try implementing EXEC <https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints prior to problem blocks, but this will cause pig's job chaining to weaken and the execution time to grow. Hope this helps. -Dan On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[email protected]> wrote: > Is this expected behavior or improper error recovery: > > *Task attempt_201306130117_0001_m_000009_0 failed to report status for 602 > seconds. Killing!* > > This was then followed by the retries of the task failing due to the > existence of the S3 output file that the dead task had started writing: > > *org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable > to setup the store function. > * > *...* > *Caused by: java.io.IOException: File already > exists:s3n://n2ygk/reduced.1/useful/part-m-00009* > > Seems like this is exactly the kind of task restart that should "just work" > if the garbage from the failed task were properly cleaned up. > > Is there a way to tell Pig to just clobber output files? > > Is there a technique for checkpointing Pig scripts so that I don't have to > keep resubmitting this job and losing hours of work? I was even doing > "STORE" of intermediate aliases so I could restart later, but the job > failure causes the intermediate files to be deleted from S3. > > Thanks. > /a >
