Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Sameer Farooqui Thu, 30 Oct 2014 11:13:31 -0700

Hi Shahab,

Are you running Spark in Local, Standalone, YARN or Mesos mode?

If you're running in Standalone/YARN/Mesos, then the .count() action is
indeed automatically parallelized across multiple Executors.

When you run a .count() on an RDD, it is actually distributing tasks to
different executors to each do a local count on a local partition and then
all the tasks send their sub-counts back to the driver for final
aggregation. This sounds like the kind of behavior you're looking for.

However, in Local mode, everything runs in a single JVM (the driver +
executor), so there's no parallelization across Executors.

On Thu, Oct 30, 2014 at 10:25 AM, shahab <shahab.mok...@gmail.com> wrote:

> Hi,
>
> I noticed that the "count" (of RDD)  in many of my queries is the most
> time consuming one as it runs in the "driver" process rather then done by
> parallel worker nodes,
>
> Is there any way to perform "count" in parallel , at at least parallelize
>  it as much as possible?
>
> best,
> /Shahab
>

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Reply via email to