I've been trying for several days now to get a Spark application running in stand-alone mode, as described here:
http://spark.apache.org/docs/latest/spark-standalone.html I'm using pyspark, so I've been following the example here: http://spark.apache.org/docs/0.9.1/quick-start.html#a-standalone-app-in-python I've run Spark successfully in "local" mode using bin/pyspark, or even just setting the SPARK_HOME environment variable, proper PYTHONPATH, and then starting up python 2.7, importing pyspark, and creating a SparkContext object. It's running in any kind of cluster mode that seems to be the problem. The StandAlone.py program in the example just reads a file and counts lines. My SparkConf looks like this: from pyspark import SparkConf, SparkContext conf = SparkConf() #conf.setMaster("spark://192.168.0.9:7077") conf.setMaster("spark://myhostname.domain.com:7077") conf.setAppName("My application") conf.set("spark.executor.memory", "1g") I tried a couple of configurations: Config 1: ("All on one") - master is localhost, slave is localhost Config 2 ("Separate master and slave") - master is localhost, slave is another host I've tried a few different machines: Machine 1: Mac OS 10.9 w/ CDH5 Hadoop distribution, compiled with SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 option Machines 2, 3: Centos 6.4 w/ CDH5 Hadoop distribution, compiled with SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 option Machine 4: Centos 6.4 with Hadoop 1.04 (default Spark compilation) Here are the results I've had: Config 1 on Machine 1: Success Config 1 on Machine 2: Fail Config 2 on Machines 2,3: Fail Config 1 on Machines 4: Fail Config 2 on Machines 1,4: Fail In the case of failure, the error is always the same. akka.tcp://sp...@node4.myhostname.domain.com:43717 got disassociated, removing it. akka.tcp://sp...@node4.myhostname.domain.com:43717 got disassociated, removing it. Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.2.55%3A42546-2#-1875068764] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. AssociationError [akka.tcp://sparkMaster@node4:7077] -> [akka.tcp:// sp...@node4.myhostname.domain.com:43717]: Error [Association failed with [akka.tcp://sp...@node4.myhostname.domain.com:43717]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sp...@node4.myhostname.domain.com:43717] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: node4.myhostname.domain.com/10.0.2.55:43717 It will then repeat this line: parentName: , name: TaskSet_0, runningTasks: 0 for a while, and then print out this message: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory I have turned the verbosity to DEBUG on all log4j.properties I can find. There are no firewalls or blocked ports on the internal network. In all configurations on all machines, when I do sbin/start-master.sh, sbin/start-slaves.sh, the respective log files always show the correct info ("I have been elected leader! New state: ALIVE" or "Successfully registered with master spark://blah-blah:7077"). The very nice UIs (on port 8080 for the masters, port 8081 for the slaves) always show that everything is in order. The master host shows the workers, the workers acknowledge they have registered with the master. When attempting to get 'Config 1' running on any of the machines, I've put both 'localhost' and the actual fully qualified domain name of the host in conf/slaves. Results are the same. In the one case where things are working, I see messages like this in the log: Remoting started; listening on addresses :[akka.tcp:// sparkExecutor@192.168.0.9:59049] Remoting now listens on addresses: [akka.tcp:// sparkExecutor@192.168.0.9:59049] Connecting to driver: akka.tcp:// spark@192.168.0.9:59032/user/CoarseGrainedScheduler Connecting to worker akka.tcp://sparkWorker@192.168.0.9:59005/user/Worker Successfully connected to akka.tcp:// sparkWorker@192.168.0.9:59005/user/Worker Successfully registered with driver I've tried many different variables in my spark-env.sh. Currently, in the one case that works, I set: STANDALONE_SPARK_MASTER_HOST=`hostname -f` but that's about it (setting that in the failure cases does not make them work). So to me, it seems like the messages from Akka are not getting to the workers. Any idea why this is? Thanks for the help! -T.J.