Re: pySpark memory usage

2014-05-15 Thread Matei Zaharia
Cool, that’s good to hear. We’d also like to add spilling in Python itself, or at least make it exit with a good message if it can’t do it. Matei On May 14, 2014, at 10:47 AM, Jim Blomo wrote: > That worked amazingly well, thank you Matei! Numbers that worked for > me were 400 for the textFil

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
That worked amazingly well, thank you Matei! Numbers that worked for me were 400 for the textFile()s, 1500 for the join()s. On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia wrote: > Hey Jim, unfortunately external spilling is not implemented in Python right > now. While it would be possible to up

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
Should add that I had to tweak the numbers a bit to keep above swap threshold, but below the "Too many open files" error (`ulimit -n` is 32768). On Wed, May 14, 2014 at 10:47 AM, Jim Blomo wrote: > That worked amazingly well, thank you Matei! Numbers that worked for > me were 400 for the textFil

Re: pySpark memory usage

2014-05-12 Thread Matei Zaharia
Hey Jim, unfortunately external spilling is not implemented in Python right now. While it would be possible to update combineByKey to do smarter stuff here, one simple workaround you can try is to launch more map tasks (or more reduce tasks). To set the minimum number of map tasks, you can pass

Re: pySpark memory usage

2014-05-12 Thread Jim Blomo
Thanks, Aaron, this looks like a good solution! Will be trying it out shortly. I noticed that the S3 exception seem to occur more frequently when the box is swapping. Why is the box swapping? combineByKey seems to make the assumption that it can fit an entire partition in memory when doing the

Re: pySpark memory usage

2014-05-04 Thread Aaron Davidson
I'd just like to update this thread by pointing to the PR based on our initial design: https://github.com/apache/spark/pull/640 This solution is a little more general and avoids catching IOException altogether. Long live exception propagation! On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell wr

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
Hey Jim, This IOException thing is a general issue that we need to fix and your observation is spot-in. There is actually a JIRA for it here I created a few days ago: https://issues.apache.org/jira/browse/SPARK-1579 Aaron is assigned on that one but not actively working on it, so we'd welcome a P

Re: pySpark memory usage

2014-04-28 Thread Jim Blomo
FYI, it looks like this "stdin writer to Python finished early" error was caused by a break in the connection to S3, from which the data was being pulled. A recent commit to PythonRDDnoted that t

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
This dataset is uncompressed text at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia wrote: > Okay, thanks. Do you have any info on how large your records and data file > are? I'd like to repro

Re: pySpark memory usage

2014-04-09 Thread Matei Zaharia
Okay, thanks. Do you have any info on how large your records and data file are? I’d like to reproduce and fix this. Matei On Apr 9, 2014, at 3:52 PM, Jim Blomo wrote: > Hi Matei, thanks for working with me to find these issues. > > To summarize, the issues I've seen are: > 0.9.0: > - https://

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are: 0.9.0: - https://issues.apache.org/jira/browse/SPARK-1323 SNAPSHOT 2014-03-18: - When persist() used and batchSize=1, java.lang.OutOfMemoryError: Java heap space. To me this indicates a memory leak

Re: pySpark memory usage

2014-04-03 Thread Matei Zaharia
Cool, thanks for the update. Have you tried running a branch with this fix (e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak issue are you referring to, is it separate from this? (Couldn’t find it earlier in the thread.) To turn on debug logging, copy conf/log4j.properti

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I think the problem I ran into in 0.9 is covered in https://issues.apache.org/jira/browse/SPARK-1323 When I kill the python process, the stacktrace I gets indicates that this happens at initialization. It looks like the initial write to the Python process does not go through, and then the iterato

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
I've only tried 0.9, in which I ran into the `stdin writer to Python finished early` so frequently I wasn't able to load even a 1GB file. Let me know if I can provide any other info! On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia wrote: > I see, did this also fail with previous versions of Spark

Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll try to look into these, seems like a serious error. Matei On Mar 27, 2014, at 7:27 PM, Jim Blomo wrote: > Thanks, Matei. I am running "Spark 1.0.0-SNAPSHOT built for Hadoop > 1.0.4" from GitHub on 2014-03-18. > > I

Re: pySpark memory usage

2014-03-27 Thread Jim Blomo
Thanks, Matei. I am running "Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4" from GitHub on 2014-03-18. I tried batchSizes of 512, 10, and 1 and each got me further but none have succeeded. I can get this to work -- with manual interventions -- if I omit `parsed.persist(StorageLevel.MEMORY_AND_DISK

Re: pySpark memory usage

2014-03-24 Thread Matei Zaharia
Hey Jim, In Spark 0.9 we added a “batchSize” parameter to PySpark that makes it group multiple objects together before passing them between Java and Python, but this may be too high by default. Try passing batchSize=10 to your SparkContext constructor to lower it (the default is 1024). Or even