Re: Pyspark Memory Woes

2014-03-12 Thread Aaron Olson
Hi Sandy, We are, yes. I strongly suspect we're not partitioning our data properly, but maybe 1.5G is simply too small for our workload. I'll bump the executor memory and see if we get better results. It seems we should be setting it to (SPARK_WORKER_MEMORY + pyspark memory) / # of concurrent app

Re: Pyspark Memory Woes

2014-03-11 Thread Sandy Ryza
Are you aware that you get an executor (and the 1.5GB) per machine, not per core? On Tue, Mar 11, 2014 at 12:52 PM, Aaron Olson wrote: > Hi Sandy, > > We're configuring that with the JAVA_OPTS environment variable in > $SPARK_HOME/spark-worker-env.sh like this: > > # JAVA OPTS > export SPARK_JA

Re: Pyspark Memory Woes

2014-03-11 Thread Aaron Olson
Hi Sandy, We're configuring that with the JAVA_OPTS environment variable in $SPARK_HOME/spark-worker-env.sh like this: # JAVA OPTS export SPARK_JAVA_OPTS="-Dspark.ui.port=0 -Dspark.default.parallelism=1024 -Dspark.cores.max=256 -Dspark.executor.memory=1500m -Dspark.worker.timeout=500 -Dspark.akka

Re: Pyspark Memory Woes

2014-03-11 Thread Sandy Ryza
Hi Aaron, When you say "Java heap space is 1.5G per worker, 24 or 32 cores across 46 nodes. It seems like we should have more than enough to do this comfortably.", how are you configuring this? -Sandy On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson wrote: > Dear Sparkians, > > We are working on

Pyspark Memory Woes

2014-03-11 Thread Aaron Olson
Dear Sparkians, We are working on a system to do relational modeling on top of Spark, all done in pyspark. While we've been learning a lot about Spark internals so far, we're currently running into memory issues and wondering how best to profile to fix them. Here are our symptoms: - We're oper