Same problem here. I am using 0.9.0 on EC2.
All worker nodes died at the same time after several started minutes.
Setting SPARK_MASTER_IP won't help.
Any suggestion is appreciated.

here's master log:

Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp
:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop1.0.4.jar
-Dspark.akka.logLifecycleEvents=true
-Djava.library.path=/root/hadoop-native/ -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip ZZZZZZ.ZZZZZ.ZZZZZZ.ZZZZ --port
7077 --webui-port 8080
========================================

log4j:WARN No appenders could be found for logger
(akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
14/02/07 09:07:05 INFO Master: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/02/07 09:07:05 INFO Master: Starting Spark master at spark://
master.spark.XXXXXX.info:7077
14/02/07 09:07:05 INFO MasterWebUI: Started Master web UI at
http://master:8080
14/02/07 09:07:05 INFO Master: I have been elected leader! New state: ALIVE
14/02/07 09:07:07 INFO Master: Registering worker
master.spark.XXXXXX.info:35973 with 16 cores, 57.5 GB RAM
14/02/07 09:07:07 INFO Master: Registering worker
master.spark.XXXXXX.info:42106 with 16 cores, 57.5 GB RAM
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:42106
got disassociated, removing it.
14/02/07 09:10:01 INFO Master: Removing worker
worker-20140207090706-ip-XX-XXX-XXX-XX.us-west-2.compute.internal-42106 on
ip-XX-XXX-XXX-XX.us-west-2.compute.internal:42106
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:42106
got disassociated, removing it.
14/02/07 09:10:01 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40XX.XXX.XXX.XX%3A56933-1#1326712555]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:35973
got disassociated, removing it.
14/02/07 09:10:01 INFO Master: Removing worker
worker-20140207090706-ip-YY-YYY-YYY-YYY.us-west-2.compute.internal-35973 on
ip-YY-YYY-YYY-YYY.us-west-2.compute.internal:35973
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:35973
got disassociated, removing it.
14/02/07 09:10:01 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40YY.YYY.YYY.YYY%3A58332-2#-579824674]
was not delivered. [2] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:35973
got disassociated, removing it.
14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp://
[email protected]:7077] ->
[akka.tcp://[email protected]:35973]:
Error [Association failed with
[akka.tcp://[email protected]:35973]]
[
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://[email protected]:35973]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused:
ip-YY-YYY-YYY-YYY.us-west-2.compute.internal/YY.YYY.YYY.YYY:35973
]
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:42106
got disassociated, removing it.
14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp://
[email protected]:7077] ->
[akka.tcp://[email protected]:42106]:
Error [Association failed with
[akka.tcp://[email protected]:42106]]
[
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://[email protected]:42106]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused:
ip-XX-XXX-XXX-XX.us-west-2.compute.internal/XX.XXX.XXX.XX:42106
]
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:35973
got disassociated, removing it.
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:42106
got disassociated, removing it.
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:35973
got disassociated, removing it.
14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp://
[email protected]:7077] ->
[akka.tcp://[email protected]:35973]:
Error [Association failed with
[akka.tcp://[email protected]:35973]]
[
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://[email protected]:35973]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused:
ip-YY-YYY-YYY-YYY.us-west-2.compute.internal/YY.YYY.YYY.YYY:35973
]
14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp://
[email protected]:7077] ->
[akka.tcp://[email protected]:42106]:
Error [Association failed with
[akka.tcp://[email protected]:42106]]
[
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://[email protected]:42106]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused:
ip-XX-XXX-XXX-XX.us-west-2.compute.internal/XX.XXX.XXX.XX:42106
]
14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp://
[email protected]:7077] ->
[akka.tcp://[email protected]:35973]:
Error [Association failed with
[akka.tcp://[email protected]:35973]]
[
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://[email protected]:35973]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused:
ip-YY-YYY-YYY-YYY.us-west-2.compute.internal/YY.YYY.YYY.YYY:35973
]
14/02/07 09:10:01 ERROR EndpointWriter: AssociationError [akka.tcp://
[email protected]:7077] ->
[akka.tcp://[email protected]:42106]:
Error [Association failed with
[akka.tcp://[email protected]:42106]]
[
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://[email protected]:42106]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused:
ip-XX-XXX-XXX-XX.us-west-2.compute.internal/XX.XXX.XXX.XX:42106
]
14/02/07 09:10:01 INFO Master:
akka.tcp://[email protected]:42106
got disassociated, removing it.



2014-02-07 Sourav Chandra <[email protected]>:

> What is the outpur of 'host s1.machine.org <http://s1.machine.org:7077/>'
> if you execute from your worker machine.
>
> ping will work but if this does not work it implies DNS entry is present
> for this machine (s1.machine.org <http://s1.machine.org:7077/>)
>
> 2 alternatives could be:
>  - add dns entry
>  - start master with SPARK_MASTER_IP=<master ip addess> env variable set
>
> Thanks,
> Sourav
>
>
> On Fri, Feb 7, 2014 at 12:39 PM, Pillis W <[email protected]> wrote:
>
>> I have a "Connection Refused" error on the first worker (standalone
>> cluster - no YARN, Mesos). No firewalls, and can ping master-worker nodes
>> from the other.
>>
>> Master process started manually. It is running and can see Web UI at 8080.
>>
>> Using "spark-0.9.0-incubating-bin-hadoop2.tgz"
>>
>> ===============================================
>> spark-0.9.0-incubating-bin-hadoop2]$ ./bin/spark-class
>> org.apache.spark.deploy.worker.Worker  spark://s1.machine.org:7077
>> 14/02/07 07:00:58 INFO Utils: Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 14/02/07 07:00:58 WARN Utils: Your hostname, s2.machine.org resolves to
>> a loopback address: 127.0.0.1; using 192.168.64.122 instead (on interface
>> eth1)
>> 14/02/07 07:00:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
>> another address
>> 14/02/07 07:00:59 INFO Slf4jLogger: Slf4jLogger started
>> 14/02/07 07:00:59 INFO Remoting: Starting remoting
>> 14/02/07 07:00:59 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkWorker@s2:49614]
>> 14/02/07 07:00:59 INFO Worker: Starting Spark worker s2:49614 with 1
>> cores, 853.0 MB RAM
>> 14/02/07 07:00:59 INFO Worker: Spark home:
>> /home/vagrant/spark-0.9.0-incubating-bin-hadoop2
>> 14/02/07 07:00:59 INFO WorkerWebUI: Started Worker web UI at
>> http://s2:8081
>> 14/02/07 07:00:59 INFO Worker: Connecting to master
>> spark://s1.machine.org:7077...
>> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError
>> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp://
>> [email protected]:7077]: Error [Association failed with
>> [akka.tcp://[email protected]:7077]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://[email protected]:7077]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: s1.machine.org/192.168.64.121:7077
>> ]
>> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError
>> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp://
>> [email protected]:7077]: Error [Association failed with
>> [akka.tcp://[email protected]:7077]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://[email protected]:7077]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: s1.machine.org/192.168.64.121:7077
>> ]
>> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError
>> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp://
>> [email protected]:7077]: Error [Association failed with
>> [akka.tcp://[email protected]:7077]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://[email protected]:7077]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: s1.machine.org/192.168.64.121:7077
>> ]
>> 14/02/07 07:00:59 ERROR EndpointWriter: AssociationError
>> [akka.tcp://sparkWorker@s2:49614] -> [akka.tcp://
>> [email protected]:7077]: Error [Association failed with
>> [akka.tcp://[email protected]:7077]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://[email protected]:7077]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: s1.machine.org/192.168.64.121:7077
>> ]
>> 14/02/07 07:00:59 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef:
>> Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from
>> Actor[akka://sparkWorker/user/Worker#607746123] to
>> Actor[akka://sparkWorker/deadLetters] was not delivered. [1] dead letters
>> encountered. This logging can be turned off or adjusted with configuration
>> settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>>
>> ...
>>
>> 14/02/07 07:01:59 ERROR Worker: All masters are unresponsive! Giving up.
>> ===============================================
>>
>
>
>
> --
>
> Sourav Chandra
>
> Senior Software Engineer
>
> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>
> [email protected]
>
> o: +91 80 4121 8723
>
> m: +91 988 699 3746
>
> skype: sourav.chandra
>
> Livestream
>
> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
> Block, Koramangala Industrial Area,
>
> Bangalore 560034
>
> www.livestream.com
>

Reply via email to