I did a repartition to 10000 (hardcoded) before the keyBy and it ends in 1.2 minutes. The questions remain open, because I don't want to harcode paralellism.
El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com) escribió: > 128 is the default parallelism defined for the cluster. > The question now is why keyBy operation is using default parallelism > instead of the number of partition of the RDD created by the previous step > (5580). > Any clues? > > El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero ( > tuerope...@gmail.com) escribió: > >> Hi, >> I am running a job in spark (using aws emr) and some stages are taking a >> lot more using spark 2.4 instead of Spark 2.3.1: >> >> Spark 2.4: >> [image: image.png] >> >> Spark 2.3.1: >> [image: image.png] >> >> With Spark 2.4, the keyBy operation take more than 10X what it took with >> Spark 2.3.1 >> It seems to be related to the number of tasks / partitions. >> >> Questions: >> - Is it not supposed that the number of task of a job is related to >> number of parts of the RDD left by the previous job? Did that change in >> version 2.4?? >> - Which tools/ configuration may I try, to reduce this aberrant downgrade >> of performance?? >> >> Thanks. >> Pedro. >> >