Hi, I am running a job in spark (using aws emr) and some stages are taking a lot more using spark 2.4 instead of Spark 2.3.1:
Spark 2.4: [image: image.png] Spark 2.3.1: [image: image.png] With Spark 2.4, the keyBy operation take more than 10X what it took with Spark 2.3.1 It seems to be related to the number of tasks / partitions. Questions: - Is it not supposed that the number of task of a job is related to number of parts of the RDD left by the previous job? Did that change in version 2.4?? - Which tools/ configuration may I try, to reduce this aberrant downgrade of performance?? Thanks. Pedro.