Spark Stand-alone mode job not starting (akka Connection refused)

T.J. Alumbaugh Wed, 28 May 2014 21:14:23 -0700

I've been trying for several days now to get a Spark application running in
stand-alone mode, as described here:


http://spark.apache.org/docs/latest/spark-standalone.html

I'm using pyspark, so I've been following the example here:

http://spark.apache.org/docs/0.9.1/quick-start.html#a-standalone-app-in-python

I've run Spark successfully in "local" mode using bin/pyspark, or even just
setting the SPARK_HOME environment variable, proper PYTHONPATH, and then
starting up python 2.7, importing pyspark, and creating a SparkContext
object. It's running in any kind of cluster mode that seems to be the
problem.

The StandAlone.py program in the example just reads a file and counts
lines. My SparkConf looks like this:

from pyspark import SparkConf, SparkContext
conf = SparkConf()
#conf.setMaster("spark://192.168.0.9:7077")
conf.setMaster("spark://myhostname.domain.com:7077")
conf.setAppName("My application")
conf.set("spark.executor.memory", "1g")

I tried a couple of configurations:

Config 1: ("All on one") - master is localhost, slave is localhost
Config 2 ("Separate master and slave") - master is localhost, slave is
another host

I've tried a few different machines:
Machine 1: Mac OS 10.9 w/ CDH5 Hadoop distribution, compiled
with SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 option

Machines 2, 3: Centos 6.4 w/ CDH5 Hadoop distribution, compiled
with SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 option

Machine 4: Centos 6.4 with Hadoop 1.04 (default Spark compilation)

Here are the results I've had:

Config 1 on Machine 1: Success
Config 1 on Machine 2: Fail
Config 2 on Machines 2,3: Fail
Config 1 on Machines 4: Fail
Config 2 on Machines 1,4: Fail

In the case of failure, the error is always the same.

akka.tcp://sp...@node4.myhostname.domain.com:43717 got disassociated,
removing it.
akka.tcp://sp...@node4.myhostname.domain.com:43717 got disassociated,
removing it.
Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.2.55%3A42546-2#-1875068764]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
AssociationError [akka.tcp://sparkMaster@node4:7077] -> [akka.tcp://
sp...@node4.myhostname.domain.com:43717]: Error [Association failed with
[akka.tcp://sp...@node4.myhostname.domain.com:43717]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sp...@node4.myhostname.domain.com:43717]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node4.myhostname.domain.com/10.0.2.55:43717

It will then repeat this line:
parentName: , name: TaskSet_0, runningTasks: 0

for a while, and then print out this message:
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory

I have turned the verbosity to DEBUG on all log4j.properties I can find.
There are no firewalls or blocked ports on the internal network.

In all configurations on all machines, when I do sbin/start-master.sh,
sbin/start-slaves.sh, the respective log files always show the correct info
("I have been elected leader! New state: ALIVE" or "Successfully registered
with master spark://blah-blah:7077"). The very nice UIs (on port 8080 for
the masters, port 8081 for the slaves) always show that everything is in
order. The master host shows the workers, the workers acknowledge they have
registered with the master.

When attempting to get 'Config 1' running on any of the machines, I've put
both 'localhost' and the actual fully qualified domain name of the host in
conf/slaves. Results are the same.

In the one case where things are working, I see messages like this in the
log:

Remoting started; listening on addresses :[akka.tcp://
sparkExecutor@192.168.0.9:59049]
Remoting now listens on addresses: [akka.tcp://
sparkExecutor@192.168.0.9:59049]
Connecting to driver: akka.tcp://
spark@192.168.0.9:59032/user/CoarseGrainedScheduler
Connecting to worker akka.tcp://sparkWorker@192.168.0.9:59005/user/Worker
Successfully connected to akka.tcp://
sparkWorker@192.168.0.9:59005/user/Worker
Successfully registered with driver

I've tried many different variables in my spark-env.sh. Currently, in the
one case that works, I set:

STANDALONE_SPARK_MASTER_HOST=`hostname -f`

but that's about it (setting that in the failure cases does not make them
work).
So to me, it seems like the messages from Akka are not getting to the
workers. Any idea why this is?
Thanks for the help!

-T.J.

Spark Stand-alone mode job not starting (akka Connection refused)

Reply via email to