Yes these data sets are in HDFs.

Earlier that task completed in 25 mins.
Now its 15 + 20

On Wed, Jun 24, 2015 at 3:16 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>   Yes, it will introduce a shuffle stage in order to perform the
> repartitioning. So it’s more useful if you’re planning to do many
> downstream transformations for which you need the increased parallelism.
>
>  Is this a dataset from HDFS?
>
>   From: "ÐΞ€ρ@Ҝ (๏̯͡๏)"
> Date: Wednesday, June 24, 2015 at 6:11 PM
> To: Silvio Fiorito
> Cc: user
> Subject: Re: how to increase parallelism ?
>
>   What that did was run a repartition with 174 tasks
>
>  repartition with 174 tasks
> AND
> actual .filter.map stage with 500 tasks
>
>  It actually doubled to stages.
>
>
>
> On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>>   Hi Deepak,
>>
>>  Parallelism is controlled by the number of partitions. In this case,
>> how many partitions are there for the details RDD (likely 170).
>>
>>  You can check by running “details.partitions.length”. If you want to
>> increase parallelism you can do so by repartitioning, increasing the number
>> of partitions: “details.repartition(xxxx)”
>>
>>  Thanks,
>> Silvio
>>
>>   From: "ÐΞ€ρ@Ҝ (๏̯͡๏)"
>> Date: Wednesday, June 24, 2015 at 1:57 PM
>> To: user
>> Subject: how to increase parallelism ?
>>
>>   I have a filter.map that triggers 170 tasks.  How can i increase it ?
>>
>>  Code:
>>
>> val viEvents = details.filter(_.get(14).asInstanceOf[Long] != NULL_VALUE).map
>> { vi => (vi.get(14).asInstanceOf[Long], vi) }
>>
>>
>>  Deepak
>>
>>
>
>
>  --
>  Deepak
>
>


-- 
Deepak

Reply via email to