[ https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26510. ---------------------------------- Resolution: Cannot Reproduce This is fixed in the current master. It should be great if the JIRA fixing that problem is identified, and backported if applicable. > Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using > 'createOrReplaceTempView' > -------------------------------------------------------------------------------------------------- > > Key: SPARK-26510 > URL: https://issues.apache.org/jira/browse/SPARK-26510 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.3.0 > Reporter: Hagai Attias > Priority: Major > > It seems that there's a change of behaviour between 1.6 and 2.3 when caching > a Dataframe and saving it as a temp table. In 1.6, the following code > executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) > executes it twice. > > {code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid} > > val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_)) > val schema = StructType(StructField("num", IntegerType) :: Nil) > val df1 = sql.createDataFrame(rdd, schema) > df1.registerTempTable("data_table") > sql.udf.register("printUDF", (x:Int) => {print(x) > x > }) > val df2 = sql.sql("select printUDF(num) result from data_table").cache() > df2.collect() //execute cache > df2.registerTempTable("cached_table") > val df3 = sql.sql("select result from cached_table") > df3.collect() > {code} > {code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid} > > val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_)) > val schema = StructType(StructField("num", IntegerType) :: Nil) > val df1 = session.createDataFrame(rdd, schema) > df1.createOrReplaceTempView("data_table") > session.udf.register("printUDF", (x:Int) => {print(x) > x > }) > val df2 = session.sql("select printUDF(num) result from data_table").cache() > df2.collect() //execute cache > df2.createOrReplaceTempView("cached_table") > val df3 = session.sql("select result from cached_table") > df3.collect() > {code} > > 1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe > twice. > Managed to mitigate by skipping the temporary table and selecting directly > from the cached dataframe, but was wondering if that is an expected behavior > / known issue. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org