Hi, I have a simple code that does a groupby, agg count, sort, etc. This code finishes within 5 minutes on Spark 3.1.x. However, the same code, same dataset, same SparkSession (configs) on Spark 3.0.2 will finish within a minute. That is over 5x times the difference.
My SparkSession (same when it is used with --conf): val spark: SparkSession = SparkSession .builder() .appName("test") .master("local[*]") .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max","200M") .config("spark.serializer","org.apache.spark.serializer.KryoSerializer") .getOrCreate() Environments which I tested both 3.1.1 and 3.0.2: - Intellij - spark-shell - pyspark shell - pure Python with PyPI pyspark The code, dataset, and initial report for reproducibility: https://github.com/JohnSnowLabs/spark-nlp/issues/2739#issuecomment-815635930 I have observed that in Spark 3.1.1, only 2 tasks are doing the majority of the procession and it is not evenly distributed as one expects in a 12-partition DataFrame: <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png> However, without any change in any line of code or environment, Spark 3.0.2 will evenly distribute the tasks at the same time and everything runs in parallel: <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png> Is there a new feature in Spark 3.1.1, a new config, something that causes this unbalanced task execution which wasn't there before in Spark 2.4.x and 3.0.x? (I have read the migration guide but, could not find anything relevant: https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org