[jira] [Resolved] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

2023-09-11 Thread Brady Bickel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brady Bickel resolved SPARK-44912.
--
Resolution: Fixed

Verified build containing linked issue fix solved the problem.

> Spark 3.4 multi-column sum slows with many columns
> --
>
> Key: SPARK-44912
> URL: https://issues.apache.org/jira/browse/SPARK-44912
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Brady Bickel
>Priority: Major
>
> The code below is a minimal reproducible example of an issue I discovered 
> with Pyspark 3.4.x. I want to sum the values of multiple columns and put the 
> sum of those columns (per row) into a new column. This code works and returns 
> in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in 
> Pyspark 3.4.x when the number of columns grows. See below for execution 
> timing summary as N varies.
> {code:java}
> import pyspark.sql.functions as F
> import random
> import string
> from functools import reduce
> from operator import add
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.getOrCreate()
> # generate a dataframe N columns by M rows with random 8 digit column 
> # names and random integers in [-5,10]
> N = 30
> M = 100
> columns = [''.join(random.choices(string.ascii_uppercase +
>   string.digits, k=8))
>for _ in range(N)]
> data = [tuple([random.randint(-5,10) for _ in range(N)])
> for _ in range(M)]
> df = spark.sparkContext.parallelize(data).toDF(columns)
> # 3 ways to add a sum column, all of them slow for high N in spark 3.4
> df = df.withColumn("col_sum1", sum(df[col] for col in columns))
> df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
> df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
> Timing results for Spark 3.3:
> ||N||Exe Time (s)||
> |5|0.514|
> |10|0.248|
> |15|0.327|
> |20|0.403|
> |25|0.279|
> |30|0.322|
> |50|0.430|
> Timing results for Spark 3.4:
> ||N||Exe Time (s)||
> |5|0.379|
> |10|0.318|
> |15|0.405|
> |20|1.32|
> |25|28.8|
> |30|448|
> |50|>1 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

2023-08-22 Thread Brady Bickel (Jira)
Brady Bickel created SPARK-44912:


 Summary: Spark 3.4 multi-column sum slows with many columns
 Key: SPARK-44912
 URL: https://issues.apache.org/jira/browse/SPARK-44912
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.4.1, 3.4.0
Reporter: Brady Bickel


The code below is a minimal reproducible example of an issue I discovered with 
Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of 
those columns (per row) into a new column. This code works and returns in a 
reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 
3.4.x when the number of columns grows. See below for execution timing summary 
as N varies.
{code:java}
import pyspark.sql.functions as F
import random
import string
from functools import reduce
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# generate a dataframe N columns by M rows with random 8 digit column 
# names and random integers in [-5,10]
N = 30
M = 100
columns = [''.join(random.choices(string.ascii_uppercase +
  string.digits, k=8))
   for _ in range(N)]
data = [tuple([random.randint(-5,10) for _ in range(N)])
for _ in range(M)]

df = spark.sparkContext.parallelize(data).toDF(columns)
# 3 ways to add a sum column, all of them slow for high N in spark 3.4
df = df.withColumn("col_sum1", sum(df[col] for col in columns))
df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
Timing results for Spark 3.3:
||N||Exe Time (s)||
|5|0.514|
|10|0.248|
|15|0.327|
|20|0.403|
|25|0.279|
|30|0.322|
|50|0.430|

Timing results for Spark 3.4:
||N||Exe Time (s)||
|5|0.379|
|10|0.318|
|15|0.405|
|20|1.32|
|25|28.8|
|30|448|
|50|>1 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org