Re: Job aborted: Spark cluster looks down

Mayur Rustagi Thu, 06 Mar 2014 14:55:22 -0800

Can you see your webUI of Spark. Is it running? (would run on
masterurl:8080)
if so what is the master URL shown thr..
MASTER=spark://<URL>:<PORT> ./bin/spark-shell
Should work.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Mar 6, 2014 at 2:22 PM, Christian <chri...@gmail.com> wrote:

> Hello, has anyone found this problem before? I am sorry to insist but I
> can not guess what is happening. Should I ask to the dev mailing list? Many
> thanks in advance.
> El 05/03/2014 23:57, "Christian" <chri...@gmail.com> escribió:
>
> I have deployed a Spark cluster in standalone mode with 3 machines:
>>
>> node1/192.168.1.2 -> master
>> node2/192.168.1.3 -> worker 20 cores 12g
>> node3/192.168.1.4 -> worker 20 cores 12g
>>
>> The web interface shows the workers correctly.
>>
>> When I launch the scala job (which only requires 256m of memory) these
>> are the logs:
>>
>> 14/03/05 23:24:06 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
>> with 55 tasks
>> 14/03/05 23:24:21 WARN scheduler.TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient memory
>> 14/03/05 23:24:23 INFO client.AppClient$ClientActor: Connecting to master
>> spark://node1:7077...
>> 14/03/05 23:24:36 WARN scheduler.TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient memory
>> 14/03/05 23:24:43 INFO client.AppClient$ClientActor: Connecting to master
>> spark://node1:7077...
>> 14/03/05 23:24:51 WARN scheduler.TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient memory
>> 14/03/05 23:25:03 ERROR client.AppClient$ClientActor: All masters are
>> unresponsive! Giving up.
>> 14/03/05 23:25:03 ERROR cluster.SparkDeploySchedulerBackend: Spark
>> cluster looks dead, giving up.
>> 14/03/05 23:25:03 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 0.0
>> from pool
>> 14/03/05 23:25:03 INFO scheduler.DAGScheduler: Failed to run
>> saveAsNewAPIHadoopFile at CondelCalc.scala:146
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
>> Spark cluster looks down
>>         at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>> ...
>>
>> The generated logs by the master and the 2 workers are attached, but I
>> found something weird in the master logs:
>>
>> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:57297*with 
>> 20 cores, 12.0 GB RAM
>> 14/03/05 23:37:43 INFO master.Master: Registering worker *node1:34188*with 
>> 20 cores, 12.0 GB RAM
>>
>> It reports that the two workers are node1:57297 and node1:34188 instead
>> of node3 and node2 respectively.
>>
>> $ cat /etc/hosts
>> ...
>> 192.168.1.2 node1
>> 192.168.1.3 node2
>> 192.168.1.4 node3
>> ...
>>
>> $ nslookup node2
>> Server:         192.168.1.1
>> Address:        192.168.1.1#53
>>
>> Name:   node2.cluster.local
>> Address: 192.168.1.3
>>
>> $ nslookup node3
>> Server:         192.168.1.1
>> Address:        192.168.1.1#53
>>
>> Name:   node3.cluster.local
>> Address: 192.168.1.4
>>
>> $ ssh node1 "ps aux | grep spark"
>> cperez   17023  1.4  0.1 4691944 154532 pts/3  Sl   23:37   0:15
>> /data/users/cperez/opt/jdk/bin/java -cp
>> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
>> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.master.Master --ip node1 --port 7077 --webui-port
>> 8080
>>
>> $ ssh node2 "ps aux | grep spark"
>> cperez   17511  2.7  0.1 4625248 156304 ?      Sl   23:37   0:07
>> /data/users/cperez/opt/jdk/bin/java -cp
>> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
>> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.worker.Worker spark://node1:7077
>>
>> $ ssh node2 "netstat -lptun | grep 17511"
>> tcp        0      0 :::8081                     :::*
>>    LISTEN      17511/java
>> tcp        0      0 ::ffff:192.168.1.3:34188    :::*
>>    LISTEN      17511/java
>>
>> $ ssh node3 "ps aux | grep spark"
>> cperez    7543  1.9  0.1 4625248 158600 ?      Sl   23:37   0:09
>> /data/users/cperez/opt/jdk/bin/java -cp
>> :/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/conf:/data/users/cperez/opt/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar:/data/users/cperez/opt/hadoop-2.2.0/etc/hadoop
>> -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m
>> org.apache.spark.deploy.worker.Worker spark://node1:7077
>>
>> $ ssh node3 "netstat -lptun | grep 7543"
>> tcp        0      0 :::8081                     :::*
>>    LISTEN      7543/java
>> tcp        0      0 ::ffff:192.168.1.4:57297    :::*
>>    LISTEN      7543/java
>>
>> I am completely blocked at this, any help would be very helpful to me.
>> Many thanks in advance.
>> Christian
>>
>

Re: Job aborted: Spark cluster looks down

Reply via email to