Re: Any reason a bunch of nearly-identical jobs would suddenly stop working?

Kris Coward Wed, 09 Mar 2011 14:29:47 -0800

Also, reading some uncompressed data off the same cluster using
PigStorage shows a failure to even read the data in the first place :|


-K

On Tue, Mar 08, 2011 at 09:24:18PM -0500, Kris Coward wrote:
> 
> None of the nodes have more than 20% utilization on any of their disks;
> so it must be the cluster figuring that it can get away with this sort
> of thing when the sysadmin's not around to set it straight.. clearly a
> cluster of redundant/load-sharing sysadmins is also needed :)
> 
> -K
> 
> On Tue, Mar 08, 2011 at 03:24:50PM -0800, Dmitriy Ryaboy wrote:
> > Check task logs. I am guessing you ran out of either hdfs or local disk on
> > the nodes.
> > 
> > Also, never let your sysadmin go on vacation, that's what makes things
> > break! :)
> > 
> > D
> > 
> > On Tue, Mar 8, 2011 at 2:53 PM, Kris Coward <[email protected]> wrote:
> > 
> > >
> > > So I queued up a batch of jobs last night to run overnight (and into the
> > > day a bit, owing to to a bottleneck on the scheduler the way that things
> > > are currently implemented), made sure they were running correctly, went
> > > to sleep, and when I woke up in the morning, they were failing all over
> > > the place.
> > >
> > > Since each of these jobs was basicaly the same pig script being run with
> > > a different set of parameters, I tried re-reunning it with the
> > > parameters that it had run (successfully) with the night before, and it
> > > also failed. So I started whittling away at steps to try and find the
> > > origin of the failure, until I was even getting a failure loading the
> > > initial data, and dumping it out. Basically, I've reduced things to a
> > > matter of
> > >
> > > apa = LOAD
> > > '/rawfiles/08556ecf5c6841d59eb702e9762e649a/{1296432000,1296435600,1296439200,1296442800,1296446400,1296450000,1296453600,1296457200,1296460800,1296464400,1296468000,1296471600,1296475200,1296478800,1296482400,1296486000,1296489600,1296493200,1296496800,1296500400,1296504000,1296507600,1296511200,1296514800}/*/apa'
> > > USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader(',') AS
> > > (timestamp:long, type:chararray, appkey:chararray, uid:chararray,
> > > uniq:chararray, shortUniq:chararray, profUid:chararray, addr:chararray,
> > > ref:chararray);
> > > dump apa;
> > >
> > > and after getting all the happy messages from the loader like:
> > >
> > > 2011-03-08 21:48:46,454 [Thread-12] INFO
> > > com.twitter.elephantbird.pig.load.LzoBaseLoadFunc - Got 117 LZO slices in
> > > total.
> > > 2011-03-08 21:48:48,044 [main] INFO
> > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - 0% complete
> > > 2011-03-08 21:50:17,612 [main] INFO
> > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - 100% complete
> > >
> > > It went straight to:
> > >
> > > 2011-03-08 21:50:17,612 [main] ERROR
> > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - 1 map reduce job(s) failed!
> > > 2011-03-08 21:50:17,662 [main] ERROR
> > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - Failed to produce result in:
> > > "hdfs://master.hadoop:9000/tmp/temp-2121884028/tmp-268519128"
> > > 2011-03-08 21:50:17,664 [main] INFO
> > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - Failed!
> > > 2011-03-08 21:50:17,668 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > > ERROR 1066: Unable to open iterator for alias apa
> > > Details at logfile: /home/kris/pig_1299620898192.log
> > >
> > > And looking at the stack trace in the logfile, I've got:
> > >
> > > Pig Stack Trace
> > > ---------------
> > > ERROR 1066: Unable to open iterator for alias apa
> > >
> > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
> > > open iterator for alias apa
> > >        at org.apache.pig.PigServer.openIterator(PigServer.java:482)
> > >        at
> > > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
> > >        at
> > > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
> > >        at
> > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
> > >        at
> > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> > >        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> > >        at org.apache.pig.Main.main(Main.java:352)
> > > Caused by: java.io.IOException: Job terminated with anomalous status 
> > > FAILED
> > >        at org.apache.pig.PigServer.openIterator(PigServer.java:476)
> > >        ... 6 more
> > >
> > > ================================================================================
> > >
> > > My sysadmin's off on vacation for the week, but left information on the
> > > scripts to restart the cluster, so I tried that, and the problem is
> > > still persisting, so I was hoping someone here might have an idea what's
> > > wrong (and how to fix it).
> > >
> > > Thanks,
> > > Kris
> > >
> > > --
> > > Kris Coward                                     http://unripe.melon.org/
> > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> > >

-- 
Kris Coward                                     http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3

Re: Any reason a bunch of nearly-identical jobs would suddenly stop working?

Reply via email to