Hi all,

I am running a simple word count job on a cluster of 4 nodes (24 cores per
node). I am varying two parameter in the configuration,
spark.python.worker.memory and the number of partitions in the RDD. My job
is written in python.

I am observing a discontinuity in the run time of the job when the
spark.python.worker.memory is increased past a threshold. Unfortunately, I
am having trouble understanding exactly what this parameter is doing to
Spark internally and how it changes Spark's behavior to create this
discontinuity.

The documentation describes this parameter as "Amount of memory to use per
python worker process during aggregation," but I find this is vague (or I
do not know enough Spark terminology to know what it means).

I have been pointed to the source code in the past, specifically the
shuffle.py file where _spill() appears.

Can anyone explain how this parameter behaves or point me to more
descriptive documentation? Thanks!

-- 
Regards,

Connor Zanin
Computer Science
University of Delaware

Reply via email to