Hi Mich, Thanks for the reply.
I have tried to minimize as much a possible the effect of other factors between pyspark==3.0.2 and pyspark==3.1.1 including not reading csv or gz and just reading the Parquet. Here is a code purely in pyspark (nothing else included) and it finishes within 47 seconds in pyspark 3.1.1 and 15 seconds in pyspark 3.0.2: (still the performance hit is very large!) spark = SparkSession.builder \ .master("local[*]") \ .config("spark.driver.memory", "16G") \ .config("spark.driver.maxResultSize", "0") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.kryoserializer.buffer.max", "2000m") \ .getOrCreate() Toys = spark.read \ .parquet('./toys-cleaned').repartition(12) # tokenize the text regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="all_words", pattern="\\W") toys_with_words = regexTokenizer.transform(Toys) # remove stop words remover = StopWordsRemover(inputCol="all_words", outputCol="words") toys_with_tokens = remover.transform(toys_with_words).drop("all_words") all_words = toys_with_tokens.select(explode("words").alias("word")) # group by, sort and limit to 50k top50k = all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000) top50k.show() This is a local test, just different conda environments one for pyspark==3.0.2 and one for pyspark==3.1.1, same dataset, same code, same sessions. I think this is a very easy way to reproduce the issue without including any third-party libraries. The two screenshots are actually the pinpoint of this issue as to why 3.0.2 has 12 tasks in parallel when 3.1.1 has 12 tasks but 10 of them finish immediately while the other 2 are keep processing. (also, the CPU usage in 3.0.2 is full while in 3.1.1 is very minimal) Something is different in spark/pyspark 3.1.1 not sure if it's about the partitions, groupBy, limit, or just a conf being enabled or disabled in 3.1.1 resulting in these performance differences. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org