I did a repartition to 10000 (hardcoded) before the keyBy and it ends in
1.2 minutes.
The questions remain open, because I don't want to harcode paralellism.

El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com)
escribió:

> 128 is the default parallelism defined for the cluster.
> The question now is why keyBy operation is using default parallelism
> instead of the number of partition of the RDD created by the previous step
> (5580).
> Any clues?
>
> El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (
> tuerope...@gmail.com) escribió:
>
>> Hi,
>> I am running a job in spark (using aws emr) and some stages are taking a
>> lot more using spark  2.4 instead of Spark 2.3.1:
>>
>> Spark 2.4:
>> [image: image.png]
>>
>> Spark 2.3.1:
>> [image: image.png]
>>
>> With Spark 2.4, the keyBy operation take more than 10X what it took with
>> Spark 2.3.1
>> It seems to be related to the number of tasks / partitions.
>>
>> Questions:
>> - Is it not supposed that the number of task of a job is related to
>> number of parts of the RDD left by the previous job? Did that change in
>> version 2.4??
>> - Which tools/ configuration may I try, to reduce this aberrant downgrade
>> of performance??
>>
>> Thanks.
>> Pedro.
>>
>

Reply via email to