[jira] [Commented] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

Bruce Robbins (Jira) Sun, 10 Sep 2023 10:16:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763455#comment-17763455
 ]


Bruce Robbins commented on SPARK-44912:
---------------------------------------

It looks like this was fixed with SPARK-45071. Your issue was reported earlier, 
but missed somehow.

> Spark 3.4 multi-column sum slows with many columns
> --------------------------------------------------
>
>                 Key: SPARK-44912
>                 URL: https://issues.apache.org/jira/browse/SPARK-44912
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.0, 3.4.1
>            Reporter: Brady Bickel
>            Priority: Major
>
> The code below is a minimal reproducible example of an issue I discovered 
> with Pyspark 3.4.x. I want to sum the values of multiple columns and put the 
> sum of those columns (per row) into a new column. This code works and returns 
> in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in 
> Pyspark 3.4.x when the number of columns grows. See below for execution 
> timing summary as N varies.
> {code:java}
> import pyspark.sql.functions as F
> import random
> import string
> from functools import reduce
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> # generate a dataframe N columns by M rows with random 8 digit column 
> # names and random integers in [-5,10]
> N = 30
> M = 100
> columns = [''.join(random.choices(string.ascii_uppercase +
>                                   string.digits, k=8))
>            for _ in range(N)]
> data = [tuple([random.randint(-5,10) for _ in range(N)])
>         for _ in range(M)]
> df = spark.sparkContext.parallelize(data).toDF(columns)
> # 3 ways to add a sum column, all of them slow for high N in spark 3.4
> df = df.withColumn("col_sum1", sum(df[col] for col in columns))
> df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
> df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
> Timing results for Spark 3.3:
> ||N||Exe Time (s)||
> |5|0.514|
> |10|0.248|
> |15|0.327|
> |20|0.403|
> |25|0.279|
> |30|0.322|
> |50|0.430|
> Timing results for Spark 3.4:
> ||N||Exe Time (s)||
> |5|0.379|
> |10|0.318|
> |15|0.405|
> |20|1.32|
> |25|28.8|
> |30|448|
> |50|>10000 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

Reply via email to