Re: pyspark memory usage

Matei Zaharia Thu, 17 Oct 2013 12:16:42 -0700

Hi there,

I'm not sure I understand your problem -- is it that Spark used *less* memory 
than the 2 GB? That out of memory message seems to be from your operating 
system, so maybe there were other things using RAM on that machine, or maybe 
Linux is configured to kill tasks quickly when the memory gets full.


When you're running PySpark, the underlying Spark process is unlikely to use a 
ton of memory unless you cache stuff, because it just pipes data to Python. 
However, it does launch one Python process per core, and those may be using a 
fair amount of RAM. If you'd like to decrease the memory usage per process, try 
changing the reduceByKey(add) in wordcount.py to use more reduce tasks by 
passing a second parameter to it (for example, reduceByKey(add, 20) will have 
it use 20 parallel tasks). Likewise you can set a "minimum number of tasks" 
value on the textFile call; it's 1 by default but you can increase it to, say, 
100, to make sure that there are at least 100 map tasks. This will make the 
load per task smaller.

Matei

On Oct 15, 2013, at 7:29 AM, eshishki <[email protected]> wrote:

> 
> Hello,
> 
> i setuped spark-0.8.0-incubating-bin-cdh4 on 5 node cluster. 
> 
> I limited SPARK_WORKER_MEMORY to 2g and there are 4 cores per node, so i 
> expected total memory consumption by spark to be 512mb + 2gb.
> Spark webui shows Memory: 10.0 GB Total, 0.0 B Used
> 
> Then i tried to run simple wordcount.py from examples on a hdfs file, which 
> size is 11GB.
> Spark launched 4 workers per node, and did not limited its total size by 2gb 
> - top showed RES consumption about 750mb and then 
> Out of memory: Kill process 26336 (python) score 97 or sacrifice child
> Killed process 26336, UID 500, (python) total-vm:969696kB, anon-rss:782976kB, 
> file-rss:196kB
> 
> and in the logs
> 
> INFO cluster.ClusterTaskSetManager: Loss was due to 
> org.apache.spark.SparkException
> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
>         at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:167)
>         at 
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:173)
>         at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:116)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
>         at 
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:193)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 
> So i could not finished the task. Yes, spark resubmited the task, but it was 
> continuing OOM Killed.
> 
> Against a smaller file spark was doing good.
> 
> So the question is - why spark does not limit its memory accordinaly and how 
> to analyze files larger than ram with it?
> 
> Thanks.

Re: pyspark memory usage

Reply via email to