One thing I've done regarding timeouts is to insert prints to STDERR more often in my UDF. If I recall correctly, this takes care of the timeout problem.
On Thu, Jun 13, 2013 at 11:37 AM, Alan Crosswell <[email protected]> wrote: > Thanks for the suggestion, Cheolsoo. > /a > > > On Thu, Jun 13, 2013 at 2:18 PM, Cheolsoo Park <[email protected]> > wrote: > > > Hi Alan, > > > > >> Seems like this is exactly the kind of task restart that should "just > > work" if the garbage from the failed task were properly cleaned up. > > > > Unfortunately,this is not the case because of S3 eventual consistency. > Even > > though the failed task cleans up files on S3, since delete is not > > immediately propagated on S3, the next task may still see them and fail. > As > > far as I know, EMR Pig/S3 integration is not as good as EMR Hive/S3 > > integration. So you will have handle S3 eventual consistency by yourself > in > > Pig. > > > > One workaround is to write StoreFunc that stages data to HDFS until task > > completes and then copies them to S3 at commit task step. This will > > minimize the number of S3 eventual consistency issues you see. > > > > Thanks, > > Cheolsoo > > > > > > > > > > > > > > > > > > On Thu, Jun 13, 2013 at 7:40 AM, Alan Crosswell <[email protected]> > wrote: > > > > > The file did not exist until the first task attempt created it before > it > > > was killed. As such the subsequent task attempts were guaranteed to > fail > > > since the killed task's output file had not be cleaned up. So when I > > > launched the Pig script, there was no file in the way. > > > > > > I'll take a look at upping the timeout. > > > > > > Thanks. > > > > > > > > > On Thu, Jun 13, 2013 at 9:57 AM, Dan DeCapria, CivicScience < > > > [email protected]> wrote: > > > > > > > Hi Alan, > > > > > > > > I believe this is expected behavior wrt EMR and S3. There cannot > > exist a > > > > duplicate file path in S3 prior to commit; in your case it looks like > > > > bucket: n2ygk, path: reduced.1/useful/part-m-00009*/file -> file. On > > EMR, > > > > to mitigate hanging tasks, a given job may spawn duplicate tasks > > > > (referenced by a trailing _0, _1, etc.). This then becomes a race > > > > condition issue wrt duplicate tasks (_0, _1, etc.) committing to the > > same > > > > bucket/path in S3. > > > > > > > > In addition, you may also consider increasing the task timeout from > > 600s > > > to > > > > something higher/lower to potentially timeout less/more (I think > lowest > > > > bound is 60000ms). I've had jobs which required a *two hour* timeout > > in > > > > order to succeed. This can be done with a bootstrap, ie) > > > > --bootstrap-action > > > > s3://elasticmapreduce/bootstrap-actions/configure-hadoop > > > > --args -m,mapred.task.timeout=2400000 > > > > > > > > As for the cleaning up of intermediate steps, I'm not sure. Possibly > > try > > > > implementing EXEC > > > > <https://pig.apache.org/docs/r0.11.1/cmds.html#exec>breakpoints > prior > > > > to problem blocks, but this will cause pig's job chaining > > > > to weaken and the execution time to grow. > > > > > > > > Hope this helps. > > > > > > > > -Dan > > > > > > > > > > > > On Wed, Jun 12, 2013 at 11:21 PM, Alan Crosswell <[email protected]> > > > > wrote: > > > > > > > > > Is this expected behavior or improper error recovery: > > > > > > > > > > *Task attempt_201306130117_0001_m_000009_0 failed to report status > > for > > > > 602 > > > > > seconds. Killing!* > > > > > > > > > > This was then followed by the retries of the task failing due to > the > > > > > existence of the S3 output file that the dead task had started > > writing: > > > > > > > > > > *org.apache.pig.backend.executionengine.ExecException: ERROR 2081: > > > Unable > > > > > to setup the store function. > > > > > * > > > > > *...* > > > > > *Caused by: java.io.IOException: File already > > > > > exists:s3n://n2ygk/reduced.1/useful/part-m-00009* > > > > > > > > > > Seems like this is exactly the kind of task restart that should > "just > > > > work" > > > > > if the garbage from the failed task were properly cleaned up. > > > > > > > > > > Is there a way to tell Pig to just clobber output files? > > > > > > > > > > Is there a technique for checkpointing Pig scripts so that I don't > > have > > > > to > > > > > keep resubmitting this job and losing hours of work? I was even > doing > > > > > "STORE" of intermediate aliases so I could restart later, but the > job > > > > > failure causes the intermediate files to be deleted from S3. > > > > > > > > > > Thanks. > > > > > /a > > > > > > > > > > > > > > > -- Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
