Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Sonal Goyal Thu, 30 Oct 2014 21:49:59 -0700

Hey Sameer,

Wouldnt local[x] run count parallelly in each of the x threads?


Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui <[email protected]>
wrote:

> Hi Shahab,
>
> Are you running Spark in Local, Standalone, YARN or Mesos mode?
>
> If you're running in Standalone/YARN/Mesos, then the .count() action is
> indeed automatically parallelized across multiple Executors.
>
> When you run a .count() on an RDD, it is actually distributing tasks to
> different executors to each do a local count on a local partition and then
> all the tasks send their sub-counts back to the driver for final
> aggregation. This sounds like the kind of behavior you're looking for.
>
> However, in Local mode, everything runs in a single JVM (the driver +
> executor), so there's no parallelization across Executors.
>
>
>
> On Thu, Oct 30, 2014 at 10:25 AM, shahab <[email protected]> wrote:
>
>> Hi,
>>
>> I noticed that the "count" (of RDD)  in many of my queries is the most
>> time consuming one as it runs in the "driver" process rather then done by
>> parallel worker nodes,
>>
>> Is there any way to perform "count" in parallel , at at least parallelize
>>  it as much as possible?
>>
>> best,
>> /Shahab
>>
>
>

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Reply via email to