Same problem here. I am using 0.9.0 on EC2. All worker nodes died at the same time after several started minutes. Setting SPARK_MASTER_IP won't help. Any suggestion is appreciated.
here's master log: Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path=/root/hadoop-native/ -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip ZZZZZZ.ZZZZZ.ZZZZZZ.ZZZZ --port 7077 --webui-port 8080 ======================================== log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/02/07 09:07:05 INFO Master: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/02/07 09:07:05 INFO Master: Starting Spark master at spark:// master.spark.XXXXXX.info:7077 14/02/07 09:07:05 INFO MasterWebUI: Started Master web UI at http://master:8080 14/02/07 09:07:05 INFO Master: I have been elected leader! New state: ALIVE 14/02/07 09:07:07 INFO Master: Registering worker master.spark.XXXXXX.info:35973 with 16 cores, 57.5 GB RAM 14/02/07 09:07:07 INFO Master: Registering worker master.spark.XXXXXX.info:42106 with 16 cores, 57.5 GB RAM 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:42106 got disassociated, removing it. 14/02/07 09:10:01 INFO Master: Removing worker worker-20140207090706-ip-XX-XXX-XXX-XX.us-west-2.compute.internal-42106 on ip-XX-XXX-XXX-XX.us-west-2.compute.internal:42106 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:42106 got disassociated, removing it. 14/02/07 09:10:01 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40XX.XXX.XXX.XX%3A56933-1#1326712555] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:35973 got disassociated, removing it. 14/02/07 09:10:01 INFO Master: Removing worker worker-20140207090706-ip-YY-YYY-YYY-YYY.us-west-2.compute.internal-35973 on ip-YY-YYY-YYY-YYY.us-west-2.compute.internal:35973 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:35973 got disassociated, removing it. 14/02/07 09:10:01 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40YY.YYY.YYY.YYY%3A58332-2#-579824674] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:35973 got disassociated, removing it. 14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:35973]: Error [Association failed with [akka.tcp://[email protected]:35973]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:35973] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-YY-YYY-YYY-YYY.us-west-2.compute.internal/YY.YYY.YYY.YYY:35973 ] 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:42106 got disassociated, removing it. 14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:42106]: Error [Association failed with [akka.tcp://[email protected]:42106]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:42106] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-XX-XXX-XXX-XX.us-west-2.compute.internal/XX.XXX.XXX.XX:42106 ] 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:35973 got disassociated, removing it. 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:42106 got disassociated, removing it. 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:35973 got disassociated, removing it. 14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:35973]: Error [Association failed with [akka.tcp://[email protected]:35973]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:35973] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-YY-YYY-YYY-YYY.us-west-2.compute.internal/YY.YYY.YYY.YYY:35973 ] 14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:42106]: Error [Association failed with [akka.tcp://[email protected]:42106]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:42106] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-XX-XXX-XXX-XX.us-west-2.compute.internal/XX.XXX.XXX.XX:42106 ] 14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:35973]: Error [Association failed with [akka.tcp://[email protected]:35973]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:35973] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-YY-YYY-YYY-YYY.us-west-2.compute.internal/YY.YYY.YYY.YYY:35973 ] 14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp:// [email protected]:7077] -> [akka.tcp://[email protected]:42106]: Error [Association failed with [akka.tcp://[email protected]:42106]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:42106] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-XX-XXX-XXX-XX.us-west-2.compute.internal/XX.XXX.XXX.XX:42106 ] 14/02/07 09:10:01 INFO Master: akka.tcp://[email protected]:42106 got disassociated, removing it. 2014-02-07 Sourav Chandra <[email protected]>: > What is the outpur of 'host s1.machine.org <http://s1.machine.org:7077/>' > if you execute from your worker machine. > > ping will work but if this does not work it implies DNS entry is present > for this machine (s1.machine.org <http://s1.machine.org:7077/>) > > 2 alternatives could be: > - add dns entry > - start master with SPARK_MASTER_IP=<master ip addess> env variable set > > Thanks, > Sourav > > > On Fri, Feb 7, 2014 at 12:39 PM, Pillis W <[email protected]> wrote: > >> I have a "Connection Refused" error on the first worker (standalone >> cluster - no YARN, Mesos). No firewalls, and can ping master-worker nodes >> from the other. >> >> Master process started manually. It is running and can see Web UI at 8080. >> >> Using "spark-0.9.0-incubating-bin-hadoop2.tgz" >> >> =============================================== >> spark-0.9.0-incubating-bin-hadoop2]$ ./bin/spark-class >> org.apache.spark.deploy.worker.Worker spark://s1.machine.org:7077 >> 14/02/07 07:00:58 INFO Utils: Using Spark's default log4j profile: >> org/apache/spark/log4j-defaults.properties >> 14/02/07 07:00:58 WARN Utils: Your hostname, s2.machine.org resolves to >> a loopback address: 127.0.0.1; using 192.168.64.122 instead (on interface >> eth1) >> 14/02/07 07:00:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to >> another address >> 14/02/07 07:00:59 INFO Slf4jLogger: Slf4jLogger started >> 14/02/07 07:00:59 INFO Remoting: Starting remoting >> 14/02/07 07:00:59 INFO Remoting: Remoting started; listening on addresses >> :[akka.tcp://sparkWorker@s2:49614] >> 14/02/07 07:00:59 INFO Worker: Starting Spark worker s2:49614 with 1 >> cores, 853.0 MB RAM >> 14/02/07 07:00:59 INFO Worker: Spark home: >> /home/vagrant/spark-0.9.0-incubating-bin-hadoop2 >> 14/02/07 07:00:59 INFO WorkerWebUI: Started Worker web UI at >> http://s2:8081 >> 14/02/07 07:00:59 INFO Worker: Connecting to master >> spark://s1.machine.org:7077... >> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp:// >> [email protected]:7077]: Error [Association failed with >> [akka.tcp://[email protected]:7077]] [ >> akka.remote.EndpointAssociationException: Association failed with >> [akka.tcp://[email protected]:7077] >> Caused by: >> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> Connection refused: s1.machine.org/192.168.64.121:7077 >> ] >> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp:// >> [email protected]:7077]: Error [Association failed with >> [akka.tcp://[email protected]:7077]] [ >> akka.remote.EndpointAssociationException: Association failed with >> [akka.tcp://[email protected]:7077] >> Caused by: >> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> Connection refused: s1.machine.org/192.168.64.121:7077 >> ] >> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp:// >> [email protected]:7077]: Error [Association failed with >> [akka.tcp://[email protected]:7077]] [ >> akka.remote.EndpointAssociationException: Association failed with >> [akka.tcp://[email protected]:7077] >> Caused by: >> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> Connection refused: s1.machine.org/192.168.64.121:7077 >> ] >> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp:// >> [email protected]:7077]: Error [Association failed with >> [akka.tcp://[email protected]:7077]] [ >> akka.remote.EndpointAssociationException: Association failed with >> [akka.tcp://[email protected]:7077] >> Caused by: >> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: >> Connection refused: s1.machine.org/192.168.64.121:7077 >> ] >> 14/02/07 07:00:59 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: >> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from >> Actor[akka://sparkWorker/user/Worker#607746123] to >> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters >> encountered. This logging can be turned off or adjusted with configuration >> settings 'akka.log-dead-letters' and >> 'akka.log-dead-letters-during-shutdown'. >> >> ... >> >> 14/02/07 07:01:59 ERROR Worker: All masters are unresponsive! Giving up. >> =============================================== >> > > > > -- > > Sourav Chandra > > Senior Software Engineer > > · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · > > [email protected] > > o: +91 80 4121 8723 > > m: +91 988 699 3746 > > skype: sourav.chandra > > Livestream > > "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd > Block, Koramangala Industrial Area, > > Bangalore 560034 > > www.livestream.com >
