So if I try this again but in the Scala shell (as opposed to the Python
one), this is what I get:

scala> val a = sc.textFile("s3n://some-path/*.json",
minPartitions=sc.defaultParallelism * 3).cache()
a: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

scala> a.map(_.length).max14/07/31 20:09:04 WARN LoadSnappy: Snappy
native library is available14/07/31 20:10:41 WARN TaskSetManager: Lost
TID 22 (task 0.0:22)14/07/31 20:10:41 WARN TaskSetManager: Loss was
due to java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOfRange(Arrays.java:2694)
    at java.lang.String.<init>(String.java:203)
    at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
    at java.nio.CharBuffer.toString(CharBuffer.java:1201)
    at org.apache.hadoop.io.Text.decode(Text.java:350)
    at org.apache.hadoop.io.Text.decode(Text.java:327)
    at org.apache.hadoop.io.Text.toString(Text.java:254)
    at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:458)
    at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:458)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
    at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
    at org.apache.spark.scheduler.Task.run(Task.scala:51)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)14/07/31 20:10:42 ERROR
TaskSchedulerImpl: Lost executor 19 on ip-10-13-142-142.ec2.internal:
OutOfMemoryError

So I guess I need to fiddle with some memory configs? I’m surprised that
just checking input line length could trigger this.

Nick
​


On Wed, Jul 30, 2014 at 8:58 PM, Davies Liu <dav...@databricks.com> wrote:

> The exception in Python means that the worker try to read command from
> JVM, but it reach
> the end of socket (socket had been closed). So it's possible that
> there another exception
> happened in JVM.
>
> Could you change the log level of log4j, then check is there any
> problem inside JVM?
>
> Davies
>
> On Wed, Jul 30, 2014 at 9:12 AM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > Any clues? This looks like a bug, but I can't report it without more
> precise
> > information.
> >
> >
> > On Tue, Jul 29, 2014 at 9:56 PM, Nick Chammas <
> nicholas.cham...@gmail.com>
> > wrote:
> >>
> >> I’m in the PySpark shell and I’m trying to do this:
> >>
> >> a =
> >>
> sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json',
> >> minPartitions=sc.defaultParallelism * 3).cache()
> >> a.map(lambda x: len(x)).max()
> >>
> >> My job dies with the following:
> >>
> >> 14/07/30 01:46:28 WARN TaskSetManager: Loss was due to
> >> org.apache.spark.api.python.PythonException
> >> org.apache.spark.api.python.PythonException: Traceback (most recent call
> >> last):
> >>   File "/root/spark/python/pyspark/worker.py", line 73, in main
> >>     command = pickleSer._read_with_length(infile)
> >>   File "/root/spark/python/pyspark/serializers.py", line 142, in
> >> _read_with_length
> >>     length = read_int(stream)
> >>   File "/root/spark/python/pyspark/serializers.py", line 337, in
> read_int
> >>     raise EOFError
> >> EOFError
> >>
> >>     at
> >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
> >>     at
> >>
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:145)
> >>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
> >>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >>     at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> >>     at org.apache.spark.scheduler.Task.run(Task.scala:51)
> >>     at
> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> >>     at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>     at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>     at java.lang.Thread.run(Thread.java:745)
> >> 14/07/30 01:46:29 ERROR TaskSchedulerImpl: Lost executor 19 on
> >> ip-10-190-171-217.ec2.internal: remote Akka client disassociated
> >>
> >> How do I debug this? I’m using 1.0.2-rc1 deployed to EC2.
> >>
> >> Nick
> >>
> >>
> >> ________________________________
> >> View this message in context: How do you debug a PythonException?
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>

Reply via email to