I have deployed a Spark cluster in standalone mode with 3 machines: node1/192.168.1.2 -> master node2/192.168.1.3 -> worker 20 cores 12g node3/192.168.1.4 -> worker 20 cores 12g
The web interface shows the workers correctly. When I launch the scala job (which only requires 256m of memory) these are the logs: 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 55 tasks 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master spark://node1:7077... 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master spark://node1:7077... 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are unresponsive! Giving up. 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster looks dead, giving up. 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0 from pool 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run saveAsNewAPIHadoopFile at CondelCalc.scala:146 Exception in thread "main" org.apache.spark.SparkException: Job aborted: Spark cluster looks down at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) ... The generated logs by the master and the 2 workers are attached, but I found something weird in the master logs: 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297* with 20 cores, 12.0 GB RAM 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188* with 20 cores, 12.0 GB RAM It reports that the two workers are node1:57297 and node1:34188 instead of node3 and node2 respectively. $ cat /etc/hosts ... 192.168.1.2 node1 192.168.1.3 node2 192.168.1.4 node3 ... $ nslookup node2 Server: 192.168.1.1 Address: 192.168.1.1#53 Name: node2.cluster.local Address: 192.168.1.3 $ nslookup node3 Server: 192.168.1.1 Address: 192.168.1.1#53 Name: node3.cluster.local Address: 192.168.1.4 $ ssh node1 "ps aux | grep spark" cperez 17023 1.4 0.1 4691944 154532 pts/3 Sl 23:37 0:15 /data/users/cperez/opt/jdk/bin/java -cp :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port 8080 $ ssh node2 "ps aux | grep spark" cperez 17511 2.7 0.1 4625248 156304 ? Sl 23:37 0:07 /data/users/cperez/opt/jdk/bin/java -cp :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://node1:7077 $ ssh node2 "netstat -lptun | grep 17511" tcp 0 0 :::8081 :::* LISTEN 17511/java tcp 0 0 ::ffff:192.168.1.3:34188 :::* LISTEN 17511/java $ ssh node3 "ps aux | grep spark" cperez 7543 1.9 0.1 4625248 158600 ? Sl 23:37 0:09 /data/users/cperez/opt/jdk/bin/java -cp :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://node1:7077 $ ssh node3 "netstat -lptun | grep 7543" tcp 0 0 :::8081 :::* LISTEN 7543/java tcp 0 0 ::ffff:192.168.1.4:57297 :::* LISTEN 7543/java I am completely blocked at this, any help would be very helpful to me. Many thanks in advance. Christian
spark-cperez-org.apache.spark.deploy.master.Master-1-node1.out
Description: Binary data
spark-cperez-org.apache.spark.deploy.worker.Worker-1-node2.out
Description: Binary data
spark-cperez-org.apache.spark.deploy.worker.Worker-1-node3.out
Description: Binary data