Hi, We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark 1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing random crashes with almost every job. Local jobs run fine, but same code with same data set in Mesos cluster leads to errors like:
14/05/22 15:03:34 ERROR DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error EOF reached before Python server acknowledged; shutting down SparkContext 14/05/22 15:03:34 INFO DAGScheduler: Failed to run saveAsTextFile at NativeMethodAccessorImpl.java:-2 Traceback (most recent call last): File "tag_prefixes.py", line 58, in <module> tag_prefix_counts.saveAsTextFile('tag_prefix_counts.data') File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/pyspark/rdd.py", line 910, in saveAsTextFile keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537, in __call__ File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError14/05/22 15:03:34 INFO TaskSchedulerImpl: Cancelling stage 0 : An error occurred while calling o44.saveAsTextFile. : org.apache.spark.SparkException: Job 0 cancelled as part of cancellation of all jobs at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998) at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499) Which looks similar to https://issues.apache.org/jira/browse/SPARK-1749, with the exception that the code isn't "bad". Furthermore we are seeing lots of Mesos(?) warnings like this: W0522 14:51:19.045565 10497 sched.cpp:901] Attempting to launch task 869 with an unknown offer 20140516-155535-170164746-5050-22001-112345 Which we didn't see with previous Mesos&Spark versions. There aren't any related errors in Mesos slave logs, instead they report jobs done without problems. Scala code seems to run without problems, so I suppose this isn't issue with out Mesos instalation Any ideas what might be wrong? Or is this bug in Spark? -Perttu