Hi Mich,

Thanks for the reply.

I have tried to minimize as much a possible the effect of other factors
between pyspark==3.0.2 and pyspark==3.1.1 including not reading csv or gz
and just reading the Parquet.

Here is a code purely in pyspark (nothing else included) and it finishes
within 47 seconds in pyspark 3.1.1 and 15 seconds in pyspark 3.0.2: (still
the performance hit is very large!)

spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k 
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()

This is a local test, just different conda environments one for
pyspark==3.0.2 and one for pyspark==3.1.1, same dataset, same code, same
sessions. I think this is a very easy way to reproduce the issue without
including any third-party libraries. The two screenshots are actually the
pinpoint of this issue as to why 3.0.2 has 12 tasks in parallel when 3.1.1
has 12 tasks but 10 of them finish immediately while the other 2 are keep
processing. (also, the CPU usage in 3.0.2 is full while in 3.1.1 is very
minimal)

Something is different in spark/pyspark 3.1.1 not sure if it's about the
partitions, groupBy, limit, or just a conf being enabled or disabled in
3.1.1 resulting in these performance differences.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to