Hi,

We have a small Mesos (0.18.1) cluster with 4 nodes. Upgraded to Spark
1.0.0-rc9, to overcome some PySpark bugs. But now we are experiencing
random crashes with almost every job. Local jobs run fine, but same code
with same data set in Mesos cluster leads to errors like:

14/05/22 15:03:34 ERROR DAGSchedulerActorSupervisor: eventProcesserActor
failed due to the error EOF reached before Python server acknowledged;
shutting down SparkContext
14/05/22 15:03:34 INFO DAGScheduler: Failed to run saveAsTextFile at
NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "tag_prefixes.py", line 58, in <module>
    tag_prefix_counts.saveAsTextFile('tag_prefix_counts.data')
  File "/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/pyspark/rdd.py", line
910, in saveAsTextFile
    keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
  File
"/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 537, in __call__
  File
"/srv/spark/spark-1.0.0-bin-2.0.5-alpha/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError14/05/22 15:03:34 INFO TaskSchedulerImpl:
Cancelling stage 0
: An error occurred while calling o44.saveAsTextFile.
: org.apache.spark.SparkException: Job 0 cancelled as part of cancellation
of all jobs
        at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
        at
org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)


Which looks similar to https://issues.apache.org/jira/browse/SPARK-1749,
with the exception that the code isn't "bad". Furthermore we are seeing
lots of Mesos(?) warnings like this:

W0522 14:51:19.045565 10497 sched.cpp:901] Attempting to launch task 869
with an unknown offer 20140516-155535-170164746-5050-22001-112345

Which we didn't see with previous Mesos&Spark versions. There aren't any
related errors in Mesos slave logs, instead they report jobs done without
problems. Scala code seems to run without problems, so I suppose this isn't
issue with out Mesos instalation

Any ideas what might be wrong? Or is this bug in Spark?


-Perttu

Reply via email to