[jira] [Updated] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25313:

Fix Version/s: 2.3.2

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-25313:
---
Description: 
In the follow example:

val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query 
shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from 
ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is 
changed in RemoveRedundantAliases.

To guarantee correctness, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.

  was:
et's see the follow example:

val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query 
shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from 
ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is 
changed in RemoveRedundantAliases.

To guarantee correctness, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.


> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org