Dear spark users,

I'm experiencing an unusual issue with Spark 3.4.x.
When creating a new column as the sum of several existing columns, the time 
taken almost doubles as the number of columns increases. This operation doesn't 
require much resources, so I suspect there might be a problem with the parse 
engine.

This phenomenon did not occur in versions prior to 3.3.x. I've attached a 
simple example below.


//example code
val schema = StructType((1 to 100).map(x => StructField(s"c$x", IntegerType)))
val data = Row.fromSeq(Seq.fill(100)(1))
val df = spark.createDataFrame(spark.sparkContext.parallelize(Seq(data)), 
schema=schema)
val sumCols = (1 to 31).map(x => s"c$x").toList
spark.time{df.withColumn("sumofcols", sumCols.map(col).reduce(_+_)).count}
//=========
Time taken: 288213 ms
res13: Long = 1L

With spark 3.3.2, last line takes about 150ms.

Is there any known problem like this?

regards,

Dukhyun Ko

Reply via email to