You can actually look at this code base https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/python/pyspark/rdd.py#L1825
_memory_limit function returns the amount of memory that you set with spark.python.worker.memory and is used for groupBy and such operations. Thanks Best Regards On Fri, Oct 23, 2015 at 11:46 PM, Connor Zanin <cnnr...@udel.edu> wrote: > Hi all, > > I am running a simple word count job on a cluster of 4 nodes (24 cores per > node). I am varying two parameter in the configuration, > spark.python.worker.memory and the number of partitions in the RDD. My job > is written in python. > > I am observing a discontinuity in the run time of the job when the > spark.python.worker.memory is increased past a threshold. Unfortunately, I > am having trouble understanding exactly what this parameter is doing to > Spark internally and how it changes Spark's behavior to create this > discontinuity. > > The documentation describes this parameter as "Amount of memory to use > per python worker process during aggregation," but I find this is vague (or > I do not know enough Spark terminology to know what it means). > > I have been pointed to the source code in the past, specifically the > shuffle.py file where _spill() appears. > > Can anyone explain how this parameter behaves or point me to more > descriptive documentation? Thanks! > > -- > Regards, > > Connor Zanin > Computer Science > University of Delaware >