Re: Spark: All masters are unresponsive!

Andrew Or Tue, 08 Jul 2014 15:33:01 -0700

It seems that your driver (which I'm assuming you launched on the master
node) can now connect to the Master, but your executors cannot. Did you
make sure that all nodes have the same conf/spark-defaults.conf,
conf/spark-env.sh, and conf/slaves? It would be good if you can post the
stderr of the executor logs here. They are located on the worker node under
$SPARK_HOME/work.


(As of Spark-1.0, we recommend that you use the spark-submit arguments, i.e.

bin/spark-submit --master spark://pzxnvm2018.x.y.name.org:7077
--executor-memory 4g --executor-cores 3 --class <your main class> <your
application jar> <application arguments ...>)


2014-07-08 10:12 GMT-07:00 Sameer Tilak <ssti...@live.com>:

> Hi Akhil et al.,
> I made the following changes:
>
> In spark-env.sh I added the following three entries (standalone mode)
>
> export SPARK_MASTER_IP=pzxnvm2018.x.y.name.org
> export SPARK_WORKER_MEMORY=4G
> export SPARK_WORKER_CORES=3
>
> I then use start-master and start-slaves commands to start the services.
> Another sthing that I have noticed is that the number of cores that I
> specified is npot  used: 2022 shows up with only 1 core and 2023 and 2024
> show up with 4 cores.
>
> In the Web UI:
> URL: spark://pzxnvm2018.x.y.name.org:7077
>
> I run the spark shell command from pzxnvm2018.
>
> /etc/hosts on my master node has following entry:
> master-ip  pzxnvm2018.x.y.name.org pzxnvm2018
>
> /etc/hosts on my a worker node has following entry:
> worker-ip        pzxnvm2023.x.y.name.org pzxnvm2023
>
>
> However, on my master node log file I still see this:
>
> ERROR EndpointWriter: AssociationError [akka.tcp://
> sparkmas...@pzxnvm2018.x.y.name.org:7077] -> 
> [akka.tcp://spark@localhost:43569]:
> Error [Association failed with [akka.tcp://spark@localhost:43569]]
>
> My spark-shell has the following o/p
>
>
> scala> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Connected to
> Spark cluster with app ID app-20140708100139-0000
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/0 on
> worker-20140708095558-pzxnvm2024.x.y.name.orgg-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/0 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/1 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/1 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:39 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/2 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:39 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/2 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/0 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/1 is now RUNNING
> 14/07/08 10:01:40 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/2 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/0 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/0 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/3 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/3 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/3 is now RUNNING
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/1 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/1 removed: Command exited with code 1
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/4 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
> 14/07/08 10:01:42 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/4 on hostPort pzxnvm2023.x.y.name.org:38294 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:42 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/4 is now RUNNING
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/2 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/2 removed: Command exited with code 1
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/5 on
> worker-20140708095559-pzxnvm2022.x.y.name.org-41826 (
> pzxnvm2022.dcld.pldc.kp.org:41826) with 1 cores
> 14/07/08 10:01:43 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/5 on hostPort pzxnvm2022.x.y.name.org:41826 with
> 1 cores, 512.0 MB RAM
> 14/07/08 10:01:43 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/5 is now RUNNING
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/3 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/3 removed: Command exited with code 1
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/6 on
> worker-20140708095558-pzxnvm2024.x.y.name.org-50218 (
> pzxnvm2024.dcld.pldc.kp.org:50218) with 4 cores
> 14/07/08 10:01:44 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20140708100139-0000/6 on hostPort pzxnvm2024.x.y.name.org:50218 with
> 4 cores, 512.0 MB RAM
> 14/07/08 10:01:44 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/6 is now RUNNING
> 14/07/08 10:01:45 INFO AppClient$ClientActor: Executor updated:
> app-20140708100139-0000/4 is now FAILED (Command exited with code 1)
> 14/07/08 10:01:45 INFO SparkDeploySchedulerBackend: Executor
> app-20140708100139-0000/4 removed: Command exited with code 1
> 14/07/08 10:01:45 INFO AppClient$ClientActor: Executor added:
> app-20140708100139-0000/7 on
> worker-20140708095559-pzxnvm2023.x.y.name.org-38294 (
> pzxnvm2023.dcld.pldc.kp.org:38294) with 4 cores
>
>
> ------------------------------
> Date: Tue, 8 Jul 2014 12:29:21 +0530
> Subject: Re: Spark: All masters are unresponsive!
> From: ak...@sigmoidanalytics.com
> To: user@spark.apache.org
>
>
> Are you sure this is your master URL spark://pzxnvm2018:7077 ?
>
> You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left
> corner. Also make sure you are able to telnet pzxnvm2018 7077 from the
> machines where you are running the spark shell.
>
> Thanks
> Best Regards
>
>
> On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak <ssti...@live.com> wrote:
>
> Hi All,
>
> I am having a few issues with stability and scheduling. When I use spark
> shell to submit my application. I get the following error message and spark
> shell crashes. I have a small 4-node cluster for PoC. I tried both manual
> and scripts-based cluster set up. I tried using FQDN as well for specifying
> the master node, but no luck.
>
> 14/07/07 23:44:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 1 (MappedRDD[6] at map at JaccardScore.scala:83)
> 14/07/07 23:44:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:0 as TID 1 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:0 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO TaskSetManager: Starting task 1.0:1 as TID 2 on
> executor localhost: localhost (PROCESS_LOCAL)
> 14/07/07 23:44:35 INFO TaskSetManager: Serialized task 1.0:1 as 2322 bytes
> in 0 ms
> 14/07/07 23:44:35 INFO Executor: Running task ID 1
> 14/07/07 23:44:35 INFO Executor: Running task ID 2
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO BlockManager: Found block broadcast_1 locally
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:0+97239389
> 14/07/07 23:44:35 INFO HadoopRDD: Input split:
> hdfs://pzxnvm2018:54310/data/sameer_7-2-2014_3mm_sentences.tsv:97239389+97239390
> 14/07/07 23:44:54 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:14 INFO AppClient$ClientActor: Connecting to master
> spark://pzxnvm2018:7077...
> 14/07/07 23:45:35 ERROR SparkDeploySchedulerBackend: Application has been
> killed. Reason: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 ERROR TaskSchedulerImpl: Exiting due to error from
> cluster scheduler: All masters are unresponsive! Giving up.
> 14/07/07 23:45:35 WARN HadoopRDD: Exception in RecordReader.close()
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
> at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
>  at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:2135)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>  at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
> at
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
>  at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:208)
> at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>  at
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:193)
> at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>  at
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:113)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> 14/07/07 23:45:35 ERROR Executor: Exception in task ID 2
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:264)
>  at org.apache.hadoop.hdfs.DFSClient.access$1100(DFSClient.java:74)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2213)
>  at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>  at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
> at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
>  at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:198)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:181)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>  at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>  at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>  at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>  at java.lang.Thread.run(Thread.java:722)
>
>
>

Re: Spark: All masters are unresponsive!

Reply via email to