I’m in the PySpark shell and I’m trying to do this:

a = 
sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json',
minPartitions=sc.defaultParallelism * 3).cache()
a.map(lambda x: len(x)).max()

My job dies with the following:

14/07/30 01:46:28 WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 73, in main
    command = pickleSer._read_with_length(infile)
  File "/root/spark/python/pyspark/serializers.py", line 142, in
_read_with_length
    length = read_int(stream)
  File "/root/spark/python/pyspark/serializers.py", line 337, in read_int
    raise EOFError
EOFError

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:145)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
    at org.apache.spark.scheduler.Task.run(Task.scala:51)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
14/07/30 01:46:29 ERROR TaskSchedulerImpl: Lost executor 19 on
ip-10-190-171-217.ec2.internal: remote Akka client disassociated

How do I debug this? I’m using 1.0.2-rc1 deployed to EC2.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-debug-a-PythonException-tp10906.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to