By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as that could make a big speed improvement.
On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui <[email protected]> wrote: > Hi Shahab, > > Are you running Spark in Local, Standalone, YARN or Mesos mode? > > If you're running in Standalone/YARN/Mesos, then the .count() action is > indeed automatically parallelized across multiple Executors. > > When you run a .count() on an RDD, it is actually distributing tasks to > different executors to each do a local count on a local partition and then > all the tasks send their sub-counts back to the driver for final > aggregation. This sounds like the kind of behavior you're looking for. > > However, in Local mode, everything runs in a single JVM (the driver + > executor), so there's no parallelization across Executors. > > > > On Thu, Oct 30, 2014 at 10:25 AM, shahab <[email protected]> wrote: > >> Hi, >> >> I noticed that the "count" (of RDD) in many of my queries is the most >> time consuming one as it runs in the "driver" process rather then done by >> parallel worker nodes, >> >> Is there any way to perform "count" in parallel , at at least parallelize >> it as much as possible? >> >> best, >> /Shahab >> > >
