Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Hi, I'm having trouble compiling a snapshot, any advice would be appreciated. I get the error below when compiling either master or branch-1.1. The key error is, I believe, [ERROR] File name too long but I don't understand what it is referring to. Thanks! ./make-distribution.sh --tgz

Re: Command exited with code 137

2014-06-13 Thread Jim Blomo
I've seen these caused by the OOM killer. I recommend checking /var/log/syslog to see if it was activated due to lack of system memory. On Thu, Jun 12, 2014 at 11:45 PM, libl 271592...@qq.com wrote: I use standalone mode submit task.But often,I got an error.The stacktrace as 2014-06-12

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
Should add that I had to tweak the numbers a bit to keep above swap threshold, but below the Too many open files error (`ulimit -n` is 32768). On Wed, May 14, 2014 at 10:47 AM, Jim Blomo jim.bl...@gmail.com wrote: That worked amazingly well, thank you Matei! Numbers that worked for me were 400

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
)). Matei On May 12, 2014, at 5:47 PM, Jim Blomo jim.bl...@gmail.com wrote: Thanks, Aaron, this looks like a good solution! Will be trying it out shortly. I noticed that the S3 exception seem to occur more frequently when the box is swapping. Why is the box swapping? combineByKey seems

Re: pySpark memory usage

2014-05-12 Thread Jim Blomo
there. The current proposal in the JIRA is somewhat complicated... - Patrick On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl...@gmail.com wrote: FYI, it looks like this stdin writer to Python finished early error was caused by a break in the connection to S3, from which the data was being

Re: Spark - ready for prime time?

2014-04-13 Thread Jim Blomo
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
are you giving to your executors, and does it show that much in the web UI? Matei On Mar 29, 2014, at 10:44 PM, Jim Blomo jim.bl...@gmail.com wrote: I think the problem I ran into in 0.9 is covered in https://issues.apache.org/jira/browse/SPARK-1323 When I kill the python process

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
are? I'd like to reproduce and fix this. Matei On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote: Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are: 0.9.0: - https://issues.apache.org/jira/browse/SPARK-1323 SNAPSHOT 2014-03-18

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
with previous versions of Spark (0.9 or 0.8)? We'll try to look into these, seems like a serious error. Matei On Mar 27, 2014, at 7:27 PM, Jim Blomo jim.bl...@gmail.com wrote: Thanks, Matei. I am running Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4 from GitHub on 2014-03-18. I tried batchSizes

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
, Mar 29, 2014 at 3:17 PM, Jim Blomo jim.bl...@gmail.com wrote: I've only tried 0.9, in which I ran into the `stdin writer to Python finished early` so frequently I wasn't able to load even a 1GB file. Let me know if I can provide any other info! On Thu, Mar 27, 2014 at 8:48 PM, Matei Zaharia

pySpark memory usage

2014-03-21 Thread Jim Blomo
Hi all, I'm wondering if there's any settings I can use to reduce the memory needed by the PythonRDD when computing simple stats. I am getting OutOfMemoryError exceptions while calculating count() on big, but not absurd, records. It seems like PythonRDD is trying to keep too many of these