Hi everyone,

I'm using Spark on machines where I can't change the maximum number of open
files. As a result, I'm limiting the number of reducers to 500. I'm also
only using a single machine that has 32 cores and emulating a cluster by
running 4 worker daemons with 8 cores (maximum) each.

What I'm noticing is that as I allocate more cores to my Spark job, my job
is not being completed much more quickly, and, the CPUs aren't being
heavily utilized. When I attempt to use all 32 cores, monitoring via mpstat
is showing a decent chunk of CPU idleness (70%-80% idle).

My job consists primarily of reduceByKey and groupByKey operations, with
flatMaps, maps, and filters in between.

I was wondering: *what exactly does the number of reducers do?* Clearly, if
there are 32 cores, setting the number of reducers to 500 should be a
workload that keeps all 32 cores busy. Or am I misunderstanding the concept
of what a reducer is?

Thanks,

-Matt Ceah

Reply via email to