Ah I see, you were running in yarn cluster mode so the logs are the same. Glad you figured it out.
2014-10-02 10:31 GMT-07:00 Greg Hill <greg.h...@rackspace.com>: > So, I actually figured it out, and it's all my fault. I had an older > version of spark on the datanodes and was passing > in spark.executor.extraClassPath to pick it up. It was a holdover from > some initial work before I got everything working right. Once I removed > that, it picked up the spark JAR from hdfs instead and ran without issue. > > Sorry for the false alarm. > > The AM container logs were what I had pasted in the original email, btw. > > Greg > > From: Andrew Or <and...@databricks.com> > Date: Thursday, October 2, 2014 12:24 PM > To: Greg <greg.h...@rackspace.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: weird YARN errors on new Spark on Yarn cluster > > Hi Greg, > > Have you looked at the AM container logs? (You may already know this, > but) you can get these through the RM web UI or through: > > yarn logs -applicationId <your app ID> > > If an AM throws an exception then the executors may not be started > properly. > > -Andrew > > > > 2014-10-02 9:47 GMT-07:00 Greg Hill <greg.h...@rackspace.com>: > >> I haven't run into this until today. I spun up a fresh cluster to do >> some more testing, and it seems that every single executor fails because it >> can't connect to the driver. This is in the YARN logs: >> >> 14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend: >> Connecting to driver: akka.tcp://sparkDriver@GATEWAY-1 >> :60855/user/CoarseGrainedScheduler >> 14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver >> Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] -> >> [akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down. >> >> And this is what shows up from the driver: >> >> 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered >> executor: >> Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113] >> with ID 2 >> 14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to >> /rack/node8da83a04def73517bf437e95aeefa2469b1daf14 >> 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2 >> disconnected, so removing it >> >> It doesn't appear to be a networking issue. Networking works both >> directions and there's no firewall blocking ports. Googling the issue, it >> sounds like the most common problem is overallocation of memory, but I'm >> not doing that. I've got these settings for a 3 * 128GB node cluster: >> >> spark.executor.instances 17 >> spark.executor.memory 12424m >> spark.yarn.executor.memoryOverhead 3549 >> >> That makes it 6 * 15973 = 95838 MB per node, which is well beneath the >> 128GB limit. >> >> Frankly I'm stumped. It worked fine when I spun up a cluster last >> week, but now it doesn't. The logs give me no indication as to what the >> problem actually is. Any pointers to where else I might look? >> >> Thanks in advance. >> >> Greg >> > >