Here is example after git clone-ing latest 1.4.0-SNAPSHOT. The first 3 runs (FINISHED) were successful and connected quickly. Fourth run (ALIVE) is failing on connection/association.
URL: spark://mellyrn.local:7077 REST URL: spark://mellyrn.local:6066 (cluster mode) Workers: 1 Cores: 8 Total, 0 Used Memory: 15.0 GB Total, 0.0 B Used Applications: 0 Running, 3 Completed Drivers: 0 Running, 0 Completed Status: ALIVE Workers Worker Id Address ▾ State Cores Memory worker-20150527122155-10.0.0.3-60847 10.0.0.3:60847 ALIVE 8 (0 Used) 15.0 GB (0.0 B Used) Running Applications Application ID Name Cores Memory per Node Submitted Time User State Duration Completed Applications Application ID Name Cores Memory per Node Submitted Time User State Duration app-20150527125945-0002 TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:59:45 steve FINISHED 7 s app-20150527124403-0001 TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:44:03 steve FINISHED 6 s app-20150527123822-0000 TestRunner: power-iteration-clustering 8 512.0 MB 2015/05/27 12:38:22 steve FINISHED 6 s 2015-05-27 11:42 GMT-07:00 Stephen Boesch <java...@gmail.com>: > Thanks Yana, > > My current experience here is after running some small spark-submit > based tests the Master once again stopped being reachable. No change in > the test setup. I restarted Master/Worker and still not reachable. > > What might be the variables here in which association with the > Master/Worker stops succeedng? > > For reference here are the Master/worker > > > 501 34465 1 0 11:35AM ?? 0:06.50 > /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java > -cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m > org.apache.spark.deploy.worker.Worker spark://mellyrn.local:7077 > 501 34361 1 0 11:35AM ttys018 0:07.08 > /Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home/bin/java > -cp <classpath..> -Xms512m -Xmx512m -XX:MaxPermSize=128m > org.apache.spark.deploy.master.Master --ip mellyrn.local --port 7077 > --webui-port 8080 > > > 15/05/27 11:36:37 INFO SparkUI: Started SparkUI at > http://25.101.19.24:4040 > 15/05/27 11:36:37 INFO SparkContext: Added JAR > file:/shared/spark-perf/mllib-tests/target/mllib-perf-tests-assembly.jar at > http://25.101.19.24:60329/jars/mllib-perf-tests-assembly.jar with > timestamp 1432751797662 > 15/05/27 11:36:37 INFO AppClient$ClientActor: Connecting to master > akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... > 15/05/27 11:36:37 WARN AppClient$ClientActor: Could not connect to > akka.tcp://sparkMaster@mellyrn.local:7077: > akka.remote.InvalidAssociation: Invalid address: > akka.tcp://sparkMaster@mellyrn.local:7077 > 15/05/27 11:36:37 WARN Remoting: Tried to associate with unreachable > remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is > now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077 > 15/05/27 11:36:57 INFO AppClient$ClientActor: Connecting to master > akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... > 15/05/27 11:36:57 WARN AppClient$ClientActor: Could not connect to > akka.tcp://sparkMaster@mellyrn.local:7077: > akka.remote.InvalidAssociation: Invalid address: > akka.tcp://sparkMaster@mellyrn.local:7077 > 15/05/27 11:36:57 WARN Remoting: Tried to associate with unreachable > remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is > now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077 > 15/05/27 11:37:17 INFO AppClient$ClientActor: Connecting to master > akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... > 15/05/27 11:37:17 WARN AppClient$ClientActor: Could not connect to > akka.tcp://sparkMaster@mellyrn.local:7077: > akka.remote.InvalidAssociation: Invalid address: > akka.tcp://sparkMaster@mellyrn.local:7077 > 15/05/27 11:37:17 WARN Remoting: Tried to associate with unreachable > remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is > now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: Connection refused: mellyrn.local/25.101.19.24:7077 > 15/05/27 11:37:37 ERROR SparkDeploySchedulerBackend: Application has been > killed. Reason: All masters are unresponsive! Giving up. > 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not > initialized yet. > 1 > > > Even when successful, the time for the Master to come up has a > surprisingly high variance. I am running on a single machine for which > there is plenty of RAM. Note that was one problem before the present series > : if RAM is tight then the failure modes can be unpredictable. But now the > RAM is not an issue: plenty available for both Master and Worker. > > Within the same hour period and starting/stopping maybe a dozen times, the > startup time for the Master may be a few seconds up to a couple to several > minutes. > > 2015-05-20 7:39 GMT-07:00 Yana Kadiyska <yana.kadiy...@gmail.com>: > > But if I'm reading his email correctly he's saying that: >> >> 1. The master and slave are on the same box (so network hiccups are >> unlikely culprit) >> 2. The failures are intermittent -- i.e program works for a while then >> worker gets disassociated... >> >> Is it possible that the master restarted? We used to have problems like >> this where we'd restart the master process, it won't be listening on 7077 >> for some time, but the worker process is trying to connect and by the time >> the master is up the worker has given up... >> >> >> On Wed, May 20, 2015 at 5:16 AM, Evo Eftimov <evo.efti...@isecc.com> >> wrote: >> >>> Check whether the name can be resolved in the /etc/hosts file (or DNS) >>> of the worker >>> >>> >>> >>> (the same btw applies for the Node where you run the driver app – all >>> other nodes must be able to resolve its name) >>> >>> >>> >>> *From:* Stephen Boesch [mailto:java...@gmail.com] >>> *Sent:* Wednesday, May 20, 2015 10:07 AM >>> *To:* user >>> *Subject:* Intermittent difficulties for Worker to contact Master on >>> same machine in standalone >>> >>> >>> >>> >>> >>> What conditions would cause the following delays / failure for a >>> standalone machine/cluster to have the Worker contact the Master? >>> >>> >>> >>> 15/05/20 02:02:53 INFO WorkerWebUI: Started WorkerWebUI at >>> http://10.0.0.3:8081 >>> >>> 15/05/20 02:02:53 INFO Worker: Connecting to master >>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... >>> >>> 15/05/20 02:02:53 WARN Remoting: Tried to associate with unreachable >>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is >>> now gated for 5000 ms, all messages to this address will be delivered to >>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077 >>> >>> 15/05/20 02:03:04 INFO Worker: Retrying connection to master (attempt # >>> 1) >>> >>> .. >>> >>> .. >>> >>> 15/05/20 02:03:26 INFO Worker: Retrying connection to master (attempt # >>> 3) >>> >>> 15/05/20 02:03:26 INFO Worker: Connecting to master >>> akka.tcp://sparkMaster@mellyrn.local:7077/user/Master... >>> >>> 15/05/20 02:03:26 WARN Remoting: Tried to associate with unreachable >>> remote address [akka.tcp://sparkMaster@mellyrn.local:7077]. Address is >>> now gated for 5000 ms, all messages to this address will be delivered to >>> dead letters. Reason: Connection refused: mellyrn.local/10.0.0.3:7077 >>> >> >> >