[ https://issues.apache.org/jira/browse/SPARK-45071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang resolved SPARK-45071. --------------------------------- Fix Version/s: 3.5.0 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 42804 [https://github.com/apache/spark/pull/42804] > Optimize the processing speed of `BinaryArithmetic#dataType` when processing > multi-column data > ---------------------------------------------------------------------------------------------- > > Key: SPARK-45071 > URL: https://issues.apache.org/jira/browse/SPARK-45071 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.4.0, 3.5.0 > Reporter: zzzzming95 > Assignee: zzzzming95 > Priority: Major > Fix For: 3.5.0, 4.0.0, 3.4.2 > > > Since `BinaryArithmetic#dataType` will recursively process the datatype of > each node, the driver will be very slow when multiple columns are processed. > For example, the following code: > {code:java} > ``` > import spark.implicits._ > import scala.util.Random > import org.apache.spark.sql.functions.sum > import org.apache.spark.sql.types.{StructType, StructField, IntegerType} > val N = 30 > val M = 100 > val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) > val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) > val schema = StructType(columns.map(StructField(_, IntegerType))) > val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) > val df = spark.createDataFrame(rdd, schema) > val colExprs = columns.map(sum(_)) > // gen a new column , and add the other 30 column > df.withColumn("new_col_sum", expr(columns.mkString(" + "))) > ``` > {code} > > This code will take a few minutes for the driver to execute in the spark3.4 > version, but only takes a few seconds to execute in the spark3.2 version. > Related issue: SPARK-39316 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org