[jira] [Resolved] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Yuming Wang (Jira) Tue, 05 Sep 2023 20:40:56 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuming Wang resolved SPARK-45071.
---------------------------------
    Fix Version/s: 3.5.0
                   4.0.0
                   3.4.2
       Resolution: Fixed

Issue resolved by pull request 42804
[https://github.com/apache/spark/pull/42804]

> Optimize the processing speed of `BinaryArithmetic#dataType` when processing 
> multi-column data
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-45071
>                 URL: https://issues.apache.org/jira/browse/SPARK-45071
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: zzzzming95
>            Assignee: zzzzming95
>            Priority: Major
>             Fix For: 3.5.0, 4.0.0, 3.4.2
>
>
> Since `BinaryArithmetic#dataType` will recursively process the datatype of 
> each node, the driver will be very slow when multiple columns are processed.
> For example, the following code:
> {code:java}
> ```
>     import spark.implicits._
>     import scala.util.Random
>     import org.apache.spark.sql.functions.sum
>     import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
>     val N = 30
>     val M = 100
>     val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
>     val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))
>     val schema = StructType(columns.map(StructField(_, IntegerType)))
>     val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
>     val df = spark.createDataFrame(rdd, schema)
>     val colExprs = columns.map(sum(_))
>     // gen a new column , and add the other 30 column
>     df.withColumn("new_col_sum", expr(columns.mkString(" + ")))
> ```
> {code}
>  
> This code will take a few minutes for the driver to execute in the spark3.4 
> version, but only takes a few seconds to execute in the spark3.2 version. 
> Related issue: SPARK-39316



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-45071) Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data

Reply via email to