Hi,

I have a simple code that does a groupby, agg count, sort, etc. This code
finishes within 5 minutes on Spark 3.1.x. However, the same code, same
dataset, same SparkSession (configs) on Spark 3.0.2 will finish within a
minute. That is over 5x times the difference.

My SparkSession (same when it is used with --conf):

val spark: SparkSession = SparkSession
    .builder()
    .appName("test")
    .master("local[*]")
    .config("spark.driver.memory", "16G")
    .config("spark.driver.maxResultSize", "0")
    .config("spark.kryoserializer.buffer.max","200M")
    .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
    .getOrCreate()

Environments which I tested both 3.1.1 and 3.0.2:
- Intellij
- spark-shell
- pyspark shell
- pure Python with PyPI pyspark

The code, dataset, and initial report for reproducibility:
https://github.com/JohnSnowLabs/spark-nlp/issues/2739#issuecomment-815635930

I have observed that in Spark 3.1.1, only 2 tasks are doing the majority of
the procession and it is not evenly distributed as one expects in a
12-partition DataFrame:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png>
 

However, without any change in any line of code or environment, Spark 3.0.2
will evenly distribute the tasks at the same time and everything runs in
parallel:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png>
 

Is there a new feature in Spark 3.1.1, a new config, something that causes
this unbalanced task execution which wasn't there before in Spark 2.4.x and
3.0.x? (I have read the migration guide but, could not find anything
relevant:
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to