By the way, in case you haven't done so, do try to .cache() the RDD before
running a .count() on it as that could make a big speed improvement.



On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui <[email protected]>
wrote:

> Hi Shahab,
>
> Are you running Spark in Local, Standalone, YARN or Mesos mode?
>
> If you're running in Standalone/YARN/Mesos, then the .count() action is
> indeed automatically parallelized across multiple Executors.
>
> When you run a .count() on an RDD, it is actually distributing tasks to
> different executors to each do a local count on a local partition and then
> all the tasks send their sub-counts back to the driver for final
> aggregation. This sounds like the kind of behavior you're looking for.
>
> However, in Local mode, everything runs in a single JVM (the driver +
> executor), so there's no parallelization across Executors.
>
>
>
> On Thu, Oct 30, 2014 at 10:25 AM, shahab <[email protected]> wrote:
>
>> Hi,
>>
>> I noticed that the "count" (of RDD)  in many of my queries is the most
>> time consuming one as it runs in the "driver" process rather then done by
>> parallel worker nodes,
>>
>> Is there any way to perform "count" in parallel , at at least parallelize
>>  it as much as possible?
>>
>> best,
>> /Shahab
>>
>
>

Reply via email to