Hi Shahab, Are you running Spark in Local, Standalone, YARN or Mesos mode?
If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors. When you run a .count() on an RDD, it is actually distributing tasks to different executors to each do a local count on a local partition and then all the tasks send their sub-counts back to the driver for final aggregation. This sounds like the kind of behavior you're looking for. However, in Local mode, everything runs in a single JVM (the driver + executor), so there's no parallelization across Executors. On Thu, Oct 30, 2014 at 10:25 AM, shahab <shahab.mok...@gmail.com> wrote: > Hi, > > I noticed that the "count" (of RDD) in many of my queries is the most > time consuming one as it runs in the "driver" process rather then done by > parallel worker nodes, > > Is there any way to perform "count" in parallel , at at least parallelize > it as much as possible? > > best, > /Shahab >