Hi everyone, I'm using Spark on machines where I can't change the maximum number of open files. As a result, I'm limiting the number of reducers to 500. I'm also only using a single machine that has 32 cores and emulating a cluster by running 4 worker daemons with 8 cores (maximum) each.
What I'm noticing is that as I allocate more cores to my Spark job, my job is not being completed much more quickly, and, the CPUs aren't being heavily utilized. When I attempt to use all 32 cores, monitoring via mpstat is showing a decent chunk of CPU idleness (70%-80% idle). My job consists primarily of reduceByKey and groupByKey operations, with flatMaps, maps, and filters in between. I was wondering: *what exactly does the number of reducers do?* Clearly, if there are 32 cores, setting the number of reducers to 500 should be a workload that keeps all 32 cores busy. Or am I misunderstanding the concept of what a reducer is? Thanks, -Matt Ceah