[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151431#comment-17151431 ] Apache Spark commented on SPARK-29358: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28996 > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151349#comment-17151349 ] Syedhamjath commented on SPARK-29358: - I also came across same issue, if the data frame does not have a column. It makes sense to add column with null or keep the self data frame as fixed ignore additional columns from the parameter data frame based on additional parameter. I'm getting confused when it throw an error message, it should not throw an error on missing column or additional columns. I vote for this issue, if Spark doesn't solve I don't think anything can solve this problem. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107775#comment-17107775 ] Andreas Neumann commented on SPARK-29358: - I would like to put a vote for this feature * It makes life so much easier when you have multiple inputs with slightly varying schema, which is quite common for data that evolved over time. * The work-around described at the top where you explicitly add the missing columns is really cumbersome if the schema is large. * With the approach of an extra argument the compatibility concerns should be lifted. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073951#comment-17073951 ] Michael Armbrust commented on SPARK-29358: -- Sure, but it is very easy to make this not a behavior change. Add an optional boolean parameter, {{allowMissingColumns}} (or something) that defaults to {{false}}. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072300#comment-17072300 ] L. C. Hsieh commented on SPARK-29358: - > users might rely on the exception when the schema is mismatched This is what I concerned as behavior change. Maybe it is minor. I'm also good with this. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072294#comment-17072294 ] Hyukjin Kwon commented on SPARK-29358: -- I am good with adding this behaviour if other committers think it's useful. Yeah, it's really extending rather than breaking. My concern was: - users might rely on the exception when the schema is mismatched - workarounds might be possible with other combinations of APIs After rethinking about this, the former could be guarded by a parameter or switch. The latter concern would need to extend the functionality of {{StructType.merge}} so I am not so against it. Let me reopen the ticket. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071968#comment-17071968 ] Michael Armbrust commented on SPARK-29358: -- I think we should reconsider closing this as won't fix: - I think the semantics of this operation make sense. We already have this with writing JSON or parquet data. It is just a really inefficient way to accomplish the end goal. - I don't think it is a problem to move "away from SQL union". This is a clearly named, different operation. IMO this one makes *more* sense than SQL union. It is much more likely that columns with the same name are semantically equivalent than columns at the same ordinal with different names. - We are not breaking the behavior of unionByName. Currently it throws an exception in these cases. We are making more data transformations possible, but anything that was working before will continue to work. You could add a boolean flag if you were really concerned, but I think I would skip that. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948800#comment-16948800 ] Mukul Murthy commented on SPARK-29358: -- That would be a start to make us not have to do #1, but #2 is still annoying. Serializing and deserializing the data just to merge schemas is clunky, and transforming each DataFrame's schema is annoying enough to use when you don't have StructTypes and nested columns. When you add those into the picture, it gets even messier. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948101#comment-16948101 ] Hyukjin Kwon commented on SPARK-29358: -- {{StructType}} has {{merge}} implementation although it's private. How about we mark it public after refining it? > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947982#comment-16947982 ] Mukul Murthy commented on SPARK-29358: -- [~hyukjin.kwon], I disagree that the workaround is pretty easy. For the trivial example where you know what the schema are, it is pretty easy. For more complicated cases , especially where you don't know what the schema are, you have to: 1. Compute the merged schema using some complex logic. Check [https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/schema/SchemaUtils.scala#L662] for one correct implementation of this logic. 2. Write both data out to JSON, union those, and read it back with the merged schema, OR loop and transform each source data into the correct target schema. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946809#comment-16946809 ] Hyukjin Kwon commented on SPARK-29358: -- Given that workaround is pretty easy, I wouldn't need the new API. Seems not a strong reason to me (e.g. there's no way to work around or considerable codes are required). > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946069#comment-16946069 ] Mukul Murthy commented on SPARK-29358: -- I agree that it should not change the current behavior of unionByName. I'm proposing either a new optional parameter to unionByName or a new API entirely. Being similar to SQL is nice, but I think in cases where there are common problems that people have to get around (multiple DataFrames with different schemas, merging the schemas, and withColumn'ing all the missing ones from each DataFrame), it's nice to have the option to solve these in Spark instead of making users do it. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944927#comment-16944927 ] L. C. Hsieh commented on SPARK-29358: - And I also concern that it is more far away from SQL union. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944899#comment-16944899 ] L. C. Hsieh commented on SPARK-29358: - My concern is it breaks current behavior of unionByName. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org