You can take a look at the code that Spark generates: import org.apache.spark.sql.SparkSession import org.apache.spark.sql.execution.debug.codegenString
val spark: SparkSession import org.apache.spark.sql.functions._ import spark.implicits._ val data = Seq("A","b","c").toDF("col") data.write.parquet("/tmp/data") val df = spark.read.parquet("/tmp/data") val df1 = df.withColumn("valueconcat", concat(col(data.columns.head), lit(" "), lit("concat"))).select("valueconcat") println(codegenString(df1.queryExecution.executedPlan)) val df2 = df.map(e=> s"$e concat") println(codegenString(df2.queryExecution.executedPlan)) It shows that for the df1 it internally uses org.apache.spark.unsafe.types.UTF8String#concat vs deserialization/serialization of the map function in the df2 Using spark native functions in most cases is the most effective way in terms of performance On Sat, Apr 4, 2020 at 2:07 PM <em...@yeikel.com> wrote: > > Dear Community, > > > > Recently, I had to solve the following problem “for every entry of a > Dataset[String], concat a constant value” , and to solve it, I used built-in > functions : > > > > val data = Seq("A","b","c").toDS > > > > scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit(" > "),lit("concat"))).select("valueconcat").explain() > > == Physical Plan == > > LocalTableScan [valueconcat#161] > > > > As an alternative , a much simpler version of the program is to use map, but > it adds a serialization step that does not seem to be present for the version > above : > > > > scala> data.map(e=> s"$e concat").explain > > == Physical Plan == > > *(1) SerializeFromObject [staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > java.lang.String, true], true, false) AS value#92] > > +- *(1) MapElements <function1>, obj#91: java.lang.String > > +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String > > +- LocalTableScan [value#12] > > > > Is this over-optimization or is this the right way to go? > > > > As a follow up , is there any better API to get the one and only column > available in a DataSet[String] when using built-in functions? > “col(data.columns.head)” works but it is not ideal. > > > > Thanks! -- Sent from my iPhone --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org