Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Matt Work Coarr Wed, 16 Jul 2014 12:37:28 -0700

Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
what's causing this to break.


One interesting point that we just discovered is that if we run the driver
and the slave (worker) on the same host it runs, but if we run the driver
on a separate host it does not run.

Anyways, this is all I see on the worker:

14/07/16 19:32:27 INFO Worker: Asked to launch executor
app-20140716193227-0000/0 for Spark Pi

14/07/16 19:32:27 WARN CommandUtils: SPARK_JAVA_OPTS was set on the worker.
It is deprecated in Spark 1.0.

14/07/16 19:32:27 WARN CommandUtils: Set SPARK_LOCAL_DIRS for node-specific
storage locations.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

14/07/16 19:32:27 INFO ExecutorRunner: Launch command: "/cask/jdk/bin/java"
"-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100"
"-Dspark.akka.frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler"
"0" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140716193227-0000"


And on the driver I see this:

14/07/16 19:32:26 INFO SparkContext: Added JAR
file:/cask/spark/lib/spark-examples-1.0.0-hadoop2.2.0.jar at
http://10.202.11.191:39642/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
timestamp 1405539146752

14/07/16 19:32:26 INFO AppClient$ClientActor: Connecting to master
spark://ip-10-202-9-195.ec2.internal:7077...

14/07/16 19:32:26 INFO SparkContext: Starting job: reduce at
SparkPi.scala:35

14/07/16 19:32:26 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
with 2 output partitions (allowLocal=false)

14/07/16 19:32:26 INFO DAGScheduler: Final stage: Stage 0(reduce at
SparkPi.scala:35)

14/07/16 19:32:26 INFO DAGScheduler: Parents of final stage: List()

14/07/16 19:32:26 INFO DAGScheduler: Missing parents: List()

14/07/16 19:32:26 DEBUG DAGScheduler: submitStage(Stage 0)

14/07/16 19:32:26 DEBUG DAGScheduler: missing: List()

14/07/16 19:32:26 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at
map at SparkPi.scala:31), which has no missing parents

14/07/16 19:32:26 DEBUG DAGScheduler: submitMissingTasks(Stage 0)

14/07/16 19:32:26 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (MappedRDD[1] at map at SparkPi.scala:31)

14/07/16 19:32:26 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(0,
0), ResultTask(0, 1))

14/07/16 19:32:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks

14/07/16 19:32:27 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0

14/07/16 19:32:27 DEBUG TaskSetManager: Valid locality levels for TaskSet
0.0: ANY

14/07/16 19:32:27 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0,
runningTasks: 0

14/07/16 19:32:27 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20140716193227-0000

14/07/16 19:32:27 INFO AppClient$ClientActor: Executor added:
app-20140716193227-0000/0 on
worker-20140716193059-ip-10-202-8-45.ec2.internal-7101
(ip-10-202-8-45.ec2.internal:7101) with 8 cores

14/07/16 19:32:27 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140716193227-0000/0 on hostPort ip-10-202-8-45.ec2.internal:7101 with
8 cores, 512.0 MB RAM

14/07/16 19:32:27 INFO AppClient$ClientActor: Executor updated:
app-20140716193227-0000/0 is now RUNNING


If I wait long enough and see several "inital job has not accepted any
resources" messages on the driver, this shows up in the worker:

14/07/16 19:34:09 INFO Worker: Executor app-20140716193227-0000/0 finished
with state FAILED message Command exited with code 1 exitStatus 1

14/07/16 19:34:09 INFO Worker: Asked to launch executor
app-20140716193227-0000/1 for Spark Pi

14/07/16 19:34:09 WARN CommandUtils: SPARK_JAVA_OPTS was set on the worker.
It is deprecated in Spark 1.0.

14/07/16 19:34:09 WARN CommandUtils: Set SPARK_LOCAL_DIRS for node-specific
storage locations.

14/07/16 19:34:09 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.202.8.45%3A46568-2#593829151]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

14/07/16 19:34:10 INFO ExecutorRunner: Launch command: "/cask/jdk/bin/java"
"-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100"
"-Dspark.akka.frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler"
"1" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140716193227-0000"


Matt


On Tue, Jul 15, 2014 at 5:47 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Have you looked at the slave machine to see if the process has
> actually launched? If it has, have you tried peeking into its log
> file?
>
> (That error is printed whenever the executors fail to report back to
> the driver. Insufficient resources to launch the executor is the most
> common cause of that, but not the only one.)
>
> On Tue, Jul 15, 2014 at 2:43 PM, Matt Work Coarr
> <mattcoarr.w...@gmail.com> wrote:
> > Hello spark folks,
> >
> > I have a simple spark cluster setup but I can't get jobs to run on it.
>  I am
> > using the standlone mode.
> >
> > One master, one slave.  Both machines have 32GB ram and 8 cores.
> >
> > The slave is setup with one worker that has 8 cores and 24GB memory
> > allocated.
> >
> > My application requires 2 cores and 5GB of memory.
> >
> > However, I'm getting the following error:
> >
> > WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
> > your cluster UI to ensure that workers are registered and have sufficient
> > memory
> >
> >
> > What else should I check for?
> >
> > This is a simplified setup (the real cluster has 20 nodes).  In this
> > simplified setup I am running the master and the slave manually.  The
> > master's web page shows the worker and it shows the application and the
> > memory/core requirements match what I mentioned above.
> >
> > I also tried running the SparkPi example via bin/run-example and get the
> > same result.  It requires 8 cores and 512MB of memory, which is also
> clearly
> > within the limits of the available worker.
> >
> > Any ideas would be greatly appreciated!!
> >
> > Matt
>
>
>
> --
> Marcelo
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Reply via email to