subject:"can't get jobs to run on cluster $enough memory and cpus are available on worker$"

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-21 Thread Matt Work Coarr

I got this working by having our sysadmin update our security group to
allow incoming traffic from the local subnet on ports 1-65535.  I'm not
sure if there's a more specific range I could have used, but so far,
everything is running!

Thanks for all the responses Marcelo and Andrew!!

Matt


On Thu, Jul 17, 2014 at 9:10 PM, Andrew Or and...@databricks.com wrote:

 Hi Matt,

 The security group shouldn't be an issue; the ports listed in
 `spark_ec2.py` are only for communication with the outside world.

 How did you launch your application? I notice you did not launch your
 driver from your Master node. What happens if you did? Another thing is
 that there seems to be some inconsistency or missing pieces in the logs you
 posted. After an executor says driver disassociated, what happens in the
 driver logs? Is an exception thrown or something?

 It would be useful if you could also post your conf/spark-env.sh.

 Andrew


 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin van...@cloudera.com:

 Hi Matt,

 I'm not very familiar with setup on ec2; the closest I can point you
 at is to look at the launch_cluster in ec2/spark_ec2.py, where the
 ports seem to be configured.


 On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
 mattcoarr.w...@gmail.com wrote:
  Thanks Marcelo!  This is a huge help!!
 
  Looking at the executor logs (in a vanilla spark install, I'm finding
 them
  in $SPARK_HOME/work/*)...
 
  It launches the executor, but it looks like the
 CoarseGrainedExecutorBackend
  is having trouble talking to the driver (exactly what you said!!!).
 
  Do you know what the range of random ports that is used for the the
  executor-to-driver?  Is that range adjustable?  Any config setting or
  environment variable?
 
  I manually setup my ec2 security group to include all the ports that the
  spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
  groups.  They included (for those listed above 1):
  1
  50060
  50070
  50075
  60060
  60070
  60075
 
  Obviously I'll need to make some adjustments to my EC2 security group!
  Just
  need to figure out exactly what should be in there.  To keep things
 simple,
  I just have one security group for the master, slaves, and the driver
  machine.
 
  In listing the port ranges in my current security group I looked at the
  ports that spark_ec2.py sets up as well as the ports listed in the
 spark
  standalone mode documentation page under configuring ports for network
  security:
 
  http://spark.apache.org/docs/latest/spark-standalone.html
 
 
  Here are the relevant fragments from the executor log:
 
  Spark Executor Command: /cask/jdk/bin/java -cp
 
 ::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
 
 
 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar
  -XX:MaxPermSize=128m -Dspark.akka.frameSize=100 -Dspark.akka.
 
  frameSize=100 -Xms512M -Xmx512M
  org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
 
  inedScheduler 0 ip-10-202-8-45.ec2.internal 8
  akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
  app-20140717195146-
 
  
 
  ...
 
  14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the
 custom-built
  native-hadoop library...
 
  14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
 with
  error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
 
  14/07/17 19:51:47 DEBUG NativeCodeLoader:
 
 java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
 
  14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
  library for your platform... using builtin-java classes where applicable
 
  14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling
 back
  to shell based
 
  14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
 mapping
  impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
 
  14/07/17 19:51:48 DEBUG Groups: Group mapping
  impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
  cacheTimeout=30
 
  14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
 
  ...
 
 
  14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to
 driver:
  akka.tcp://spark@ip-10-202-11-191.ec2.internal
 :46787/user/CoarseGrainedScheduler
 
  14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
  akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
 
  14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
  akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
 
  14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver
 Disassociated
  [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] -
  [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
  Shutting down.
 
 
  Thanks a bunch!
  Matt

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin

On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
 Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
 what's causing this to break.

 One interesting point that we just discovered is that if we run the driver
 and the slave (worker) on the same host it runs, but if we run the driver on
 a separate host it does not run.

When I meant the executor log, I meant the log of the process launched
by the worker, not the worker. In my CDH-based Spark install, those
end up in /var/run/spark/work.

If you look at your worker log, you'll see it's launching the executor
process. So there should be something there.

Since you say it works when both are run in the same node, that
probably points to some communication issue, since the executor needs
to connect back to the driver. Check to see if you don't have any
firewalls blocking the ports Spark tries to use. (That's one of the
non-resource-related cases that will cause that message.)

-- 
Marcelo

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Matt Work Coarr

Thanks Marcelo! This is a huge help!!

Looking at the executor logs (in a vanilla spark install, I'm finding them
in $SPARK_HOME/work/*)...

It launches the executor, but it looks like the
CoarseGrainedExecutorBackend is having trouble talking to the driver
(exactly what you said!!!).

Do you know what the range of random ports that is used for the the
executor-to-driver? Is that range adjustable? Any config setting or
environment variable?

I manually setup my ec2 security group to include all the ports that the
spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
groups. They included (for those listed above 1):
1
50060
50070
50075
60060
60070
60075

Obviously I'll need to make some adjustments to my EC2 security group!
Just need to figure out exactly what should be in there. To keep things
simple, I just have one security group for the master, slaves, and the
driver machine.

In listing the port ranges in my current security group I looked at the
ports that spark_ec2.py sets up as well as the ports listed in the spark
standalone mode documentation page under configuring ports for network
security:

http://spark.apache.org/docs/latest/spark-standalone.html

Here are the relevant fragments from the executor log:

Spark Executor Command: /cask/jdk/bin/java -cp
::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.

2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar
-XX:MaxPermSize=128m -Dspark.akka.frameSize=100 -Dspark.akka.

frameSize=100 -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra

inedScheduler 0 ip-10-202-8-45.ec2.internal 8
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
app-20140717195146-

...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

14/07/17 19:51:47 DEBUG NativeCodeLoader:
java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
to shell based

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping

14/07/17 19:51:48 DEBUG Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=30

14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user

...

14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@ip-10-202-11-191.ec2.internal
:46787/user/CoarseGrainedScheduler

14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] -
[akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
Shutting down.

Thanks a bunch!
Matt

On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin van...@cloudera.com wrote:

When I meant the executor log, I meant the log of the process launched
by the worker, not the worker. In my CDH-based Spark install, those
end up in /var/run/spark/work.

If you look at your worker log, you'll see it's launching the executor
process. So there should be something there.

Since you say it works when both are run in the same node, that
probably points to some communication issue, since the executor needs
to connect back to the driver. Check to see if you don't have any
firewalls blocking the ports Spark tries to use. (That's one of the
non-resource-related cases that will cause that message.)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin

Hi Matt,

I'm not very familiar with setup on ec2; the closest I can point you
at is to look at the launch_cluster in ec2/spark_ec2.py, where the
ports seem to be configured.

On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
Thanks Marcelo! This is a huge help!!

Looking at the executor logs (in a vanilla spark install, I'm finding them
in $SPARK_HOME/work/*)...

It launches the executor, but it looks like the CoarseGrainedExecutorBackend
is having trouble talking to the driver (exactly what you said!!!).

Do you know what the range of random ports that is used for the the
executor-to-driver? Is that range adjustable? Any config setting or
environment variable?

Obviously I'll need to make some adjustments to my EC2 security group! Just
need to figure out exactly what should be in there. To keep things simple,
I just have one security group for the master, slaves, and the driver
machine.

http://spark.apache.org/docs/latest/spark-standalone.html

Here are the relevant fragments from the executor log:

Spark Executor Command: /cask/jdk/bin/java -cp
::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.

2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar
-XX:MaxPermSize=128m -Dspark.akka.frameSize=100 -Dspark.akka.

frameSize=100 -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra

inedScheduler 0 ip-10-202-8-45.ec2.internal 8
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
app-20140717195146-

...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

14/07/17 19:51:47 DEBUG NativeCodeLoader:
java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
to shell based

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping

14/07/17 19:51:48 DEBUG Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=30

14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user

...

14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGrainedScheduler

14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

Thanks a bunch!
Matt

On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin van...@cloudera.com wrote:

When I meant the executor log, I meant the log of the process launched
by the worker, not the worker. In my CDH-based Spark install, those
end up in /var/run/spark/work.

If you look at your worker log, you'll see it's launching the executor
process. So there should be something there.

--
Marcelo

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Andrew Or

Hi Matt,

The security group shouldn't be an issue; the ports listed in
`spark_ec2.py` are only for communication with the outside world.

How did you launch your application? I notice you did not launch your
driver from your Master node. What happens if you did? Another thing is
that there seems to be some inconsistency or missing pieces in the logs you
posted. After an executor says driver disassociated, what happens in the
driver logs? Is an exception thrown or something?

It would be useful if you could also post your conf/spark-env.sh.

Andrew

2014-07-17 14:11 GMT-07:00 Marcelo Vanzin van...@cloudera.com:

Hi Matt,

I'm not very familiar with setup on ec2; the closest I can point you
at is to look at the launch_cluster in ec2/spark_ec2.py, where the
ports seem to be configured.

On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
Thanks Marcelo! This is a huge help!!

Looking at the executor logs (in a vanilla spark install, I'm finding
them
in $SPARK_HOME/work/*)...

It launches the executor, but it looks like the
CoarseGrainedExecutorBackend
is having trouble talking to the driver (exactly what you said!!!).

Do you know what the range of random ports that is used for the the
executor-to-driver? Is that range adjustable? Any config setting or
environment variable?

Obviously I'll need to make some adjustments to my EC2 security group!
Just
need to figure out exactly what should be in there. To keep things
simple,
I just have one security group for the master, slaves, and the driver
machine.

http://spark.apache.org/docs/latest/spark-standalone.html

Here are the relevant fragments from the executor log:

Spark Executor Command: /cask/jdk/bin/java -cp

::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.

2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar
-XX:MaxPermSize=128m -Dspark.akka.frameSize=100 -Dspark.akka.

frameSize=100 -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra

inedScheduler 0 ip-10-202-8-45.ec2.internal 8
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
app-20140717195146-

...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

14/07/17 19:51:47 DEBUG NativeCodeLoader:

java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling
back
to shell based

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping

14/07/17 19:51:48 DEBUG Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=30

14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user

...

14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to
driver:
akka.tcp://spark@ip-10-202-11-191.ec2.internal
:46787/user/CoarseGrainedScheduler

14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver
Disassociated
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] -
[akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
Shutting down.

Thanks a bunch!
Matt

On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin van...@cloudera.com
wrote:

When I meant the executor log, I meant the log of the process launched
by the worker, not the worker. In my CDH-based Spark install, those
end up in /var/run/spark/work.

If you look at your worker log, you'll see it's launching the executor
process. So there should be something there.

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-16 Thread Matt Work Coarr

Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
what's causing this to break.

One interesting point that we just discovered is that if we run the driver
and the slave (worker) on the same host it runs, but if we run the driver
on a separate host it does not run.

Anyways, this is all I see on the worker:

14/07/16 19:32:27 INFO Worker: Asked to launch executor
app-20140716193227-/0 for Spark Pi

14/07/16 19:32:27 WARN CommandUtils: SPARK_JAVA_OPTS was set on the worker.
It is deprecated in Spark 1.0.

14/07/16 19:32:27 WARN CommandUtils: Set SPARK_LOCAL_DIRS for node-specific
storage locations.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

14/07/16 19:32:27 INFO ExecutorRunner: Launch command: /cask/jdk/bin/java
-cp
::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar
-XX:MaxPermSize=128m -Dspark.akka.frameSize=100
-Dspark.akka.frameSize=100 -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler
0 ip-10-202-8-45.ec2.internal 8
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
app-20140716193227-


And on the driver I see this:

14/07/16 19:32:26 INFO SparkContext: Added JAR
file:/cask/spark/lib/spark-examples-1.0.0-hadoop2.2.0.jar at
http://10.202.11.191:39642/jars/spark-examples-1.0.0-hadoop2.2.0.jar with
timestamp 1405539146752

14/07/16 19:32:26 INFO AppClient$ClientActor: Connecting to master
spark://ip-10-202-9-195.ec2.internal:7077...

14/07/16 19:32:26 INFO SparkContext: Starting job: reduce at
SparkPi.scala:35

14/07/16 19:32:26 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35)
with 2 output partitions (allowLocal=false)

14/07/16 19:32:26 INFO DAGScheduler: Final stage: Stage 0(reduce at
SparkPi.scala:35)

14/07/16 19:32:26 INFO DAGScheduler: Parents of final stage: List()

14/07/16 19:32:26 INFO DAGScheduler: Missing parents: List()

14/07/16 19:32:26 DEBUG DAGScheduler: submitStage(Stage 0)

14/07/16 19:32:26 DEBUG DAGScheduler: missing: List()

14/07/16 19:32:26 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at
map at SparkPi.scala:31), which has no missing parents

14/07/16 19:32:26 DEBUG DAGScheduler: submitMissingTasks(Stage 0)

14/07/16 19:32:26 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (MappedRDD[1] at map at SparkPi.scala:31)

14/07/16 19:32:26 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(0,
0), ResultTask(0, 1))

14/07/16 19:32:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks

14/07/16 19:32:27 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0

14/07/16 19:32:27 DEBUG TaskSetManager: Valid locality levels for TaskSet
0.0: ANY

14/07/16 19:32:27 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0,
runningTasks: 0

14/07/16 19:32:27 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20140716193227-

14/07/16 19:32:27 INFO AppClient$ClientActor: Executor added:
app-20140716193227-/0 on
worker-20140716193059-ip-10-202-8-45.ec2.internal-7101
(ip-10-202-8-45.ec2.internal:7101) with 8 cores

14/07/16 19:32:27 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140716193227-/0 on hostPort ip-10-202-8-45.ec2.internal:7101 with
8 cores, 512.0 MB RAM

14/07/16 19:32:27 INFO AppClient$ClientActor: Executor updated:
app-20140716193227-/0 is now RUNNING


If I wait long enough and see several inital job has not accepted any
resources messages on the driver, this shows up in the worker:

14/07/16 19:34:09 INFO Worker: Executor app-20140716193227-/0 finished
with state FAILED message Command exited with code 1 exitStatus 1

14/07/16 19:34:09 INFO Worker: Asked to launch executor
app-20140716193227-/1 for Spark Pi

14/07/16 19:34:09 WARN CommandUtils: SPARK_JAVA_OPTS was set on the worker.
It is deprecated in Spark 1.0.

14/07/16 19:34:09 WARN CommandUtils: Set SPARK_LOCAL_DIRS for node-specific
storage locations.

14/07/16 19:34:09 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.202.8.45%3A46568-2#593829151]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] -
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with

can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Matt Work Coarr

Hello spark folks,

I have a simple spark cluster setup but I can't get jobs to run on it.  I
am using the standlone mode.

One master, one slave.  Both machines have 32GB ram and 8 cores.

The slave is setup with one worker that has 8 cores and 24GB memory
allocated.

My application requires 2 cores and 5GB of memory.

However, I'm getting the following error:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
memory

What else should I check for?

This is a simplified setup (the real cluster has 20 nodes).  In this
simplified setup I am running the master and the slave manually.  The
master's web page shows the worker and it shows the application and the
memory/core requirements match what I mentioned above.

I also tried running the SparkPi example via bin/run-example and get the
same result.  It requires 8 cores and 512MB of memory, which is also
clearly within the limits of the available worker.

Any ideas would be greatly appreciated!!

Matt

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Marcelo Vanzin

Have you looked at the slave machine to see if the process has
actually launched? If it has, have you tried peeking into its log
file?

(That error is printed whenever the executors fail to report back to
the driver. Insufficient resources to launch the executor is the most
common cause of that, but not the only one.)

On Tue, Jul 15, 2014 at 2:43 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
 Hello spark folks,

 I have a simple spark cluster setup but I can't get jobs to run on it.  I am
 using the standlone mode.

 One master, one slave.  Both machines have 32GB ram and 8 cores.

 The slave is setup with one worker that has 8 cores and 24GB memory
 allocated.

 My application requires 2 cores and 5GB of memory.

 However, I'm getting the following error:

 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
 your cluster UI to ensure that workers are registered and have sufficient
 memory


 What else should I check for?

 This is a simplified setup (the real cluster has 20 nodes).  In this
 simplified setup I am running the master and the slave manually.  The
 master's web page shows the worker and it shows the application and the
 memory/core requirements match what I mentioned above.

 I also tried running the SparkPi example via bin/run-example and get the
 same result.  It requires 8 cores and 512MB of memory, which is also clearly
 within the limits of the available worker.

 Any ideas would be greatly appreciated!!

 Matt



-- 
Marcelo

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

can't get jobs to run on cluster (enough memory and cpus are available on worker)

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

8 matches

Site Navigation

Mail list logo

Footer information