Javier created SPARK-45745:
------------------------------

             Summary: Extremely slow execution of sum of columns in Spark 3.4.1
                 Key: SPARK-45745
                 URL: https://issues.apache.org/jira/browse/SPARK-45745
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.4.1
            Reporter: Javier


We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to Spark 
3.4.1 and some code that was running fine is now basically never ending even 
for small dataframes.

We have simplified the problematic piece of code and the minimum pySpark 
example below shows the issue:
{code:java}
n_cols = 50
data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
df_data = sql_context.createDataFrame(data)

df_data = df_data.withColumn(
    "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
)
df_data.show(10, False) {code}
Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the computation 
time seems to explode when the value of `n_cols` is bigger than about 25 
columns. A colleague suggested that it could be related to the limit of 22 
elements in a tuple in Scala 2.13, since the 25 columns are suspiciously close 
to this. Is there any known defect in the logical plan optimization in 3.4.1? 
Or is this kind of operations (sum of multiple columns) supposed to be 
implemented differently?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to