subject:"\[jira\] \[Commented\] \(SPARK\-29358\) Make unionByName optionally fill missing columns with nulls"

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-07-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151431#comment-17151431
 ] 

Apache Spark commented on SPARK-29358:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28996

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-07-04 Thread Syedhamjath (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151349#comment-17151349
 ] 

Syedhamjath commented on SPARK-29358:
-

I also came across same issue, if the data frame does not have a column. It 
makes sense to add column with null or keep the self data frame as fixed ignore 
additional columns from the parameter data frame based on additional parameter.

I'm getting confused when it throw an error message, it should not throw an 
error on missing column or additional columns.

I vote for this issue, if Spark doesn't solve I don't think anything can solve 
this problem.

 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-05-14 Thread Andreas Neumann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107775#comment-17107775
 ] 

Andreas Neumann commented on SPARK-29358:
-

I would like to put a vote for this feature
 * It makes life so much easier when you have multiple inputs with slightly 
varying schema, which is quite common for data that evolved over time.
 * The work-around described at the top where you explicitly add the missing 
columns is really cumbersome if the schema is large.
 * With the approach of an extra argument the compatibility concerns should be 
lifted. 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-04-02 Thread Michael Armbrust (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073951#comment-17073951
 ] 

Michael Armbrust commented on SPARK-29358:
--

Sure, but it is very easy to make this not a behavior change.  Add an optional 
boolean parameter, {{allowMissingColumns}} (or something) that defaults to 
{{false}}.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072300#comment-17072300
 ] 

L. C. Hsieh commented on SPARK-29358:
-

> users might rely on the exception when the schema is mismatched

This is what I concerned as behavior change. Maybe it is minor. I'm also good 
with this. 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072294#comment-17072294
 ] 

Hyukjin Kwon commented on SPARK-29358:
--

I am good with adding this behaviour if other committers think it's useful. 
Yeah, it's really extending rather than breaking.
My concern was:
 - users might rely on the exception when the schema is mismatched
 - workarounds might be possible with other combinations of APIs

After rethinking about this, the former could be guarded by a parameter or 
switch. The latter concern would need to extend the functionality of 
{{StructType.merge}} so I am not so against it.
Let me reopen the ticket.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Michael Armbrust (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071968#comment-17071968
 ] 

Michael Armbrust commented on SPARK-29358:
--

I think we should reconsider closing this as won't fix:
 - I think the semantics of this operation make sense. We already have this 
with writing JSON or parquet data. It is just a really inefficient way to 
accomplish the end goal.
 - I don't think it is a problem to move "away from SQL union". This is a 
clearly named, different operation. IMO this one makes *more* sense than SQL 
union. It is much more likely that columns with the same name are semantically 
equivalent than columns at the same ordinal with different names.
 - We are not breaking the behavior of unionByName. Currently it throws an 
exception in these cases. We are making more data transformations possible, but 
anything that was working before will continue to work. You could add a boolean 
flag if you were really concerned, but I think I would skip that.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-10 Thread Mukul Murthy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948800#comment-16948800
 ] 

Mukul Murthy commented on SPARK-29358:
--

That would be a start to make us not have to do #1, but #2 is still annoying. 
Serializing and deserializing the data just to merge schemas is clunky, and 
transforming each DataFrame's schema is annoying enough to use when you don't 
have StructTypes and nested columns. When you add those into the picture, it 
gets even messier.

 

 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-09 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948101#comment-16948101
 ] 

Hyukjin Kwon commented on SPARK-29358:
--

{{StructType}} has {{merge}} implementation although it's private. How about we 
mark it public after refining it?

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-09 Thread Mukul Murthy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947982#comment-16947982
 ] 

Mukul Murthy commented on SPARK-29358:
--

[~hyukjin.kwon], I disagree that the workaround is pretty easy. For the trivial 
example where you know what the schema are, it is pretty easy. For more 
complicated cases , especially where you don't know what the schema are, you 
have to:

1. Compute the merged schema using some complex logic. Check 
[https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/schema/SchemaUtils.scala#L662]
 for one correct implementation of this logic.

2. Write both data out to JSON, union those, and read it back with the merged 
schema, OR loop and transform each source data into the correct target schema. 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946809#comment-16946809
 ] 

Hyukjin Kwon commented on SPARK-29358:
--

Given that workaround is pretty easy, I wouldn't need the new API. Seems not a 
strong reason to me (e.g. there's no way to work around or considerable codes 
are required).

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-07 Thread Mukul Murthy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946069#comment-16946069
 ] 

Mukul Murthy commented on SPARK-29358:
--

I agree that it should not change the current behavior of unionByName. I'm 
proposing either a new optional parameter to unionByName or a new API entirely.

Being similar to SQL is nice, but I think in cases where there are common 
problems that people have to get around (multiple DataFrames with different 
schemas, merging the schemas, and withColumn'ing all the missing ones from each 
DataFrame), it's nice to have the option to solve these in Spark instead of 
making users do it.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-04 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944927#comment-16944927
 ] 

L. C. Hsieh commented on SPARK-29358:
-

And I also concern that it is more far away from SQL union.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-04 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944899#comment-16944899
 ] 

L. C. Hsieh commented on SPARK-29358:
-

My concern is it breaks current behavior of unionByName. 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

14 matches

Site Navigation

Mail list logo

Footer information