Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

Stephen Boesch Wed, 27 May 2015 13:07:43 -0700

Here is example after git clone-ing latest 1.4.0-SNAPSHOT.  The first 3
runs (FINISHED) were successful and connected quickly.  Fourth run (ALIVE)
is failing on connection/association.



URL: spark://mellyrn.local:7077
REST URL: spark://mellyrn.local:6066 (cluster mode)
Workers: 1
Cores: 8 Total, 0 Used
Memory: 15.0 GB Total, 0.0 B Used
Applications: 0 Running, 3 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Workers

Worker Id Address ▾ State Cores Memory
worker-20150527122155-10.0.0.3-60847 10.0.0.3:60847 ALIVE 8 (0 Used) 15.0
GB (0.0 B Used)
Running Applications

Application ID Name Cores Memory per Node Submitted Time User State Duration
Completed Applications

Application ID Name Cores Memory per Node Submitted Time User State Duration
app-20150527125945-0002 TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:59:45 steve FINISHED 7 s
app-20150527124403-0001 TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:44:03 steve FINISHED 6 s
app-20150527123822-0000 TestRunner: power-iteration-clustering 8 512.0
MB 2015/05/27
12:38:22 steve FINISHED 6 s



2015-05-27 11:42 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> Thanks Yana,
>
>    My current experience here is after running some small spark-submit
> based tests the Master once again stopped being reachable.  No change in
> the test setup.  I restarted Master/Worker and still not reachable.
>
> What might be the variables here in which association with the
> Master/Worker stops succeedng?
>
> For reference here are the Master/worker
>
>
>   501 34465     1   0 11:35AM ??         0:06.50
> /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
> -cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m
> org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077
>   501 34361     1   0 11:35AM ttys018    0:07.08
> /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java
> -cp <classpath..>  -Xms512m -Xmx512m -XX:MaxPermSize=128m
> org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077
> --webui-port 8080
>
>
> 15/05/27 11:36:37 INFO SparkUI: Started SparkUI at
> http://25.101.19.24:4040
> 15/05/27 11:36:37 INFO SparkContext: Added JAR
> file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at
> http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with
> timestamp 1432751797662
> 15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master
> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
> 15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to
> akka.tcp://sparkMaster@mellyrn.local:7077:
> akka.remote.InvalidAssociation: Invalid address:
> akka.tcp://sparkMaster@mellyrn.local:7077
> 15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable
> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
> now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
> 15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master
> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
> 15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to
> akka.tcp://sparkMaster@mellyrn.local:7077:
> akka.remote.InvalidAssociation: Invalid address:
> akka.tcp://sparkMaster@mellyrn.local:7077
> 15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable
> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
> now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
> 15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master
> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
> 15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to
> akka.tcp://sparkMaster@mellyrn.local:7077:
> akka.remote.InvalidAssociation: Invalid address:
> akka.tcp://sparkMaster@mellyrn.local:7077
> 15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable
> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
> now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077
> 15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not
> initialized yet.
> 1
>
>
> Even when successful, the time for the Master to come up has a
> surprisingly high variance. I am running on a single machine for which
> there is plenty of RAM. Note that was one problem before the present series
> :  if RAM is tight then the failure modes can be unpredictable. But now the
> RAM is not an issue: plenty available for both Master and Worker.
>
> Within the same hour period and starting/stopping maybe a dozen times, the
> startup time for the Master may be a few seconds up to  a couple to several
> minutes.
>
> 2015-05-20 7:39 GMT-07:00 Yana Kadiyska <yana.kadiy...@gmail.com>:
>
> But if I'm reading his email correctly he's saying that:
>>
>> 1. The master and slave are on the same box (so network hiccups are
>> unlikely culprit)
>> 2. The failures are intermittent -- i.e program works for a while then
>> worker gets disassociated...
>>
>> Is it possible that the master restarted? We used to have problems like
>> this where we'd restart the master process, it won't be listening on 7077
>> for some time, but the worker process is trying to connect and by the time
>> the master is up the worker has given up...
>>
>>
>> On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov <evo.efti...@isecc.com>
>> wrote:
>>
>>> Check whether the name can be resolved in the /etc/hosts file (or DNS)
>>> of the worker
>>>
>>>
>>>
>>> (the same btw applies for the Node where you run the driver app – all
>>> other nodes must be able to resolve its name)
>>>
>>>
>>>
>>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>>> *Sent:* Wednesday, May 20, 2015 10:07 AM
>>> *To:* user
>>> *Subject:* Intermittent difficulties for Worker to contact Master on
>>> same machine in standalone
>>>
>>>
>>>
>>>
>>>
>>> What conditions would cause the following delays / failure for a
>>> standalone machine/cluster to have the Worker contact the Master?
>>>
>>>
>>>
>>> 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at
>>> http://10.0.0.3:8081
>>>
>>> 15/05/20 02:02:53 INFO Worker: Connecting to master
>>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>>>
>>> 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable
>>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>>> now gated for 5000 ms, all messages to this address will be delivered to
>>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
>>>
>>> 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt #
>>> 1)
>>>
>>> ..
>>>
>>> ..
>>>
>>> 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt #
>>> 3)
>>>
>>> 15/05/20 02:03:26 INFO Worker: Connecting to master
>>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master...
>>>
>>> 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable
>>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is
>>> now gated for 5000 ms, all messages to this address will be delivered to
>>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077
>>>
>>
>>
>

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

Reply via email to