Mukul Murthy created SPARK-29358:
------------------------------------

             Summary: Make unionByName optionally fill missing columns with 
nulls
                 Key: SPARK-29358
                 URL: https://issues.apache.org/jira/browse/SPARK-29358
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.4
            Reporter: Mukul Murthy


Currently, unionByName requires two DataFrames to have the same set of columns 
(even though the order can be different). It would be good to add either an 
option to unionByName or a new type of union which fills in missing columns 
with nulls. 
{code:java}
val df1 = Seq(1, 2, 3).toDF("x")
val df2 = Seq("a", "b", "c").toDF("y")
df1.unionByName(df2){code}
This currently throws 
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
(y);
{code}
Ideally, there would be a way to make this return a DataFrame containing:
{code:java}
+----+----+ 
| x| y| 
+----+----+ 
| 1|null| 
| 2|null| 
| 3|null| 
|null| a| 
|null| b| 
|null| c| 
+----+----+
{code}
Currently the workaround to make this possible is by using unionByName, but 
this is clunky:
{code:java}
df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to