Re: pySpark memory usage

Jim Blomo Sat, 29 Mar 2014 15:18:27 -0700

I've only tried 0.9, in which I ran into the `stdin writer to Python
finished early` so frequently I wasn't able to load even a 1GB file.
Let me know if I can provide any other info!


On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We'll 
> try to look into these, seems like a serious error.
>
> Matei
>
> On Mar 27, 2014, at 7:27 PM, Jim Blomo <jim.bl...@gmail.com> wrote:
>
>> Thanks, Matei.  I am running "Spark 1.0.0-SNAPSHOT built for Hadoop
>> 1.0.4" from GitHub on 2014-03-18.
>>
>> I tried batchSizes of 512, 10, and 1 and each got me further but none
>> have succeeded.
>>
>> I can get this to work -- with manual interventions -- if I omit
>> `parsed.persist(StorageLevel.MEMORY_AND_DISK)` and set batchSize=1.  5
>> of the 175 executors hung, and I had to kill the python process to get
>> things going again.  The only indication of this in the logs was `INFO
>> python.PythonRDD: stdin writer to Python finished early`.
>>
>> With batchSize=1 and persist, a new memory error came up in several
>> tasks, before the app was failed:
>>
>> 14/03/28 01:51:15 ERROR executor.Executor: Uncaught exception in
>> thread Thread[stdin writer for python,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>>        at java.util.Arrays.copyOfRange(Arrays.java:2694)
>>        at java.lang.String.<init>(String.java:203)
>>        at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
>>        at java.nio.CharBuffer.toString(CharBuffer.java:1201)
>>        at org.apache.hadoop.io.Text.decode(Text.java:350)
>>        at org.apache.hadoop.io.Text.decode(Text.java:327)
>>        at org.apache.hadoop.io.Text.toString(Text.java:254)
>>        at 
>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
>>        at 
>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:349)
>>        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>        at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
>>        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>        at 
>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:242)
>>        at 
>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
>>
>> There are other exceptions, but I think they all stem from the above,
>> eg. org.apache.spark.SparkException: Error sending message to
>> BlockManagerMaster
>>
>> Let me know if there are other settings I should try, or if I should
>> try a newer snapshot.
>>
>> Thanks again!
>>
>>
>> On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia <matei.zaha...@gmail.com> 
>> wrote:
>>> Hey Jim,
>>>
>>> In Spark 0.9 we added a "batchSize" parameter to PySpark that makes it 
>>> group multiple objects together before passing them between Java and 
>>> Python, but this may be too high by default. Try passing batchSize=10 to 
>>> your SparkContext constructor to lower it (the default is 1024). Or even 
>>> batchSize=1 to match earlier versions.
>>>
>>> Matei
>>>
>>> On Mar 21, 2014, at 6:18 PM, Jim Blomo <jim.bl...@gmail.com> wrote:
>>>
>>>> Hi all, I'm wondering if there's any settings I can use to reduce the
>>>> memory needed by the PythonRDD when computing simple stats.  I am
>>>> getting OutOfMemoryError exceptions while calculating count() on big,
>>>> but not absurd, records.  It seems like PythonRDD is trying to keep
>>>> too many of these records in memory, when all that is needed is to
>>>> stream through them and count.  Any tips for getting through this
>>>> workload?
>>>>
>>>>
>>>> Code:
>>>> session = sc.textFile('s3://...json.gz') # ~54GB of compressed data
>>>>
>>>> # the biggest individual text line is ~3MB
>>>> parsed = session.map(lambda l: l.split("\t",1)).map(lambda (y,s):
>>>> (loads(y), loads(s)))
>>>> parsed.persist(StorageLevel.MEMORY_AND_DISK)
>>>>
>>>> parsed.count()
>>>> # will never finish: executor.Executor: Uncaught exception will FAIL
>>>> all executors
>>>>
>>>> Incidentally the whole app appears to be killed, but this error is not
>>>> propagated to the shell.
>>>>
>>>> Cluster:
>>>> 15 m2.xlarges (17GB memory, 17GB swap, spark.executor.memory=10GB)
>>>>
>>>> Exception:
>>>> java.lang.OutOfMemoryError: Java heap space
>>>>       at 
>>>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:132)
>>>>       at 
>>>> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:120)
>>>>       at 
>>>> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:113)
>>>>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>       at 
>>>> org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:113)
>>>>       at 
>>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>>>       at 
>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>>>       at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:94)
>>>>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>>>>       at 
>>>> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
>>>
>

Re: pySpark memory usage

Reply via email to