[jira] [Updated] (SPARK-23493) insert-into depends on columns order, otherwise incorrect data inserted

Xiaoju Wu (JIRA) Fri, 23 Feb 2018 00:27:35 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-23493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiaoju Wu updated SPARK-23493:
------------------------------
    Description: 
insert-into only works when the partitionby key columns are set at last:

val data = Seq(
 (7, "test1", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )
 import spark.implicits._

val table = "default.tbl"
 spark
 .createDataset(data)
 .toDF("col1", "col2", "col3")
 .write
 .partitionBy("col1")
 .saveAsTable(table)

val data2 = Seq(
 (7, "test2", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )

 spark
 .createDataset(data2)
 .toDF("col1", "col2", "col3")
  .write
 .insertInto(table)

sql("select * from " + table).show()

+------------+-++----
|col2|col3|col1|

+------------+-++----
|test#test|0.0|8|
|test1|1.0|7|
|test3|0.0|9|
|8|null|0|
|9|null|0|
|7|null|1|

+------------+-++----

 

If you try inserting with sql, the issue is the same.

val data = Seq(
 (7, "test1", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )
 import spark.implicits._

val table = "default.tbl"
 spark
 .createDataset(data)
 .toDF("col1", "col2", "col3")
 .write
 .partitionBy("col1")
 .saveAsTable(table)

val data2 = Seq(
 (7, "test2", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )

sql("insert into " + table + " values(7,'test2',1.0)")
 sql("select * from " + table).show()

+----------+---++----
|col2|col3|col1|

+----------+---++----
|test#test|0.0|8|
|test1|1.0|7|
|test3|0.0|9|
|7|null|1|

+----------+---++----

No exception was thrown since I only run insertInto, not together with 
partitionBy. The data are inserted incorrectly. The issue is related to column 
order. If I change to partitionBy col3, which is the last column, it works.

val data = Seq(
 (7, "test1", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )
 import spark.implicits._

val table = "default.tbl"
 spark
 .createDataset(data)
 .toDF("col1", "col2", "col3")
 .write
 .partitionBy("col3")
 .saveAsTable(table)

val data2 = Seq(
 (7, "test2", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )

spark
 .createDataset(data2)
 .toDF("col1", "col2", "col3")
 .write
 .insertInto(table)

sql("select * from " + table).show()

+-------+------++----
|col1|col2|col3|

+-------+------++----
|8|test#test|0.0|
|9|test3|0.0|
|8|test#test|0.0|
|9|test3|0.0|
|7|test1|1.0|
|7|test2|1.0|

+-------+------++----

  was:
Seems the issue still exists, here's the test:

val data = Seq(
(7, "test1", 1.0),
(8, "test#test", 0.0),
(9, "test3", 0.0)
)
import spark.implicits._

val table = "default.tbl"
spark
.createDataset(data)
.toDF("col1", "col2", "col3")
.write
.partitionBy("col1")
.saveAsTable(table)

val data2 = Seq(
(7, "test2", 1.0),
(8, "test#test", 0.0),
(9, "test3", 0.0)
)

 spark
.createDataset(data2)
.toDF("col1", "col2", "col3")
 .write
.insertInto(table)

sql("select * from " + table).show()

+-----------+--++----
|col2|col3|col1|

+-----------+--++----
|test#test|0.0|8|
|test1|1.0|7|
|test3|0.0|9|
|8|null|0|
|9|null|0|
|7|null|1|

+-----------+--++----

 

If you try inserting with sql, the issue is the same.

val data = Seq(
 (7, "test1", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )
 import spark.implicits._

 val table = "default.tbl"
 spark
 .createDataset(data)
 .toDF("col1", "col2", "col3")
 .write
 .partitionBy("col1")
 .saveAsTable(table)

 val data2 = Seq(
 (7, "test2", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )

 sql("insert into " + table + " values(7,'test2',1.0)")
 sql("select * from " + table).show()

+---------+----+----+
| col2|col3|col1|
+---------+----+----+
|test#test| 0.0| 8|
| test1| 1.0| 7|
| test3| 0.0| 9|
| 7|null| 1|
+---------+----+----+

No exception was thrown since I only run insertInto, not together with 
partitionBy. The data are inserted incorrectly. The issue is related to column 
order. If I change to partitionBy col3, which is the last column, it works.

val data = Seq(
(7, "test1", 1.0),
(8, "test#test", 0.0),
(9, "test3", 0.0)
)
import spark.implicits._

val table = "default.tbl"
spark
.createDataset(data)
.toDF("col1", "col2", "col3")
.write
.partitionBy("col3")
.saveAsTable(table)

val data2 = Seq(
(7, "test2", 1.0),
(8, "test#test", 0.0),
(9, "test3", 0.0)
)

spark
.createDataset(data2)
.toDF("col1", "col2", "col3")
.write
.insertInto(table)

sql("select * from " + table).show()

+------+-------++----
|col1|col2|col3|

+------+-------++----
|8|test#test|0.0|
|9|test3|0.0|
|8|test#test|0.0|
|9|test3|0.0|
|7|test1|1.0|
|7|test2|1.0|

+------+-------++----


> insert-into depends on columns order, otherwise incorrect data inserted
> -----------------------------------------------------------------------
>
>                 Key: SPARK-23493
>                 URL: https://issues.apache.org/jira/browse/SPARK-23493
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: Xiaoju Wu
>            Priority: Minor
>
> insert-into only works when the partitionby key columns are set at last:
> val data = Seq(
>  (7, "test1", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  import spark.implicits._
> val table = "default.tbl"
>  spark
>  .createDataset(data)
>  .toDF("col1", "col2", "col3")
>  .write
>  .partitionBy("col1")
>  .saveAsTable(table)
> val data2 = Seq(
>  (7, "test2", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  spark
>  .createDataset(data2)
>  .toDF("col1", "col2", "col3")
>   .write
>  .insertInto(table)
> sql("select * from " + table).show()
> +------------+-++----
> |col2|col3|col1|
> +------------+-++----
> |test#test|0.0|8|
> |test1|1.0|7|
> |test3|0.0|9|
> |8|null|0|
> |9|null|0|
> |7|null|1|
> +------------+-++----
>  
> If you try inserting with sql, the issue is the same.
> val data = Seq(
>  (7, "test1", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  import spark.implicits._
> val table = "default.tbl"
>  spark
>  .createDataset(data)
>  .toDF("col1", "col2", "col3")
>  .write
>  .partitionBy("col1")
>  .saveAsTable(table)
> val data2 = Seq(
>  (7, "test2", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
> sql("insert into " + table + " values(7,'test2',1.0)")
>  sql("select * from " + table).show()
> +----------+---++----
> |col2|col3|col1|
> +----------+---++----
> |test#test|0.0|8|
> |test1|1.0|7|
> |test3|0.0|9|
> |7|null|1|
> +----------+---++----
> No exception was thrown since I only run insertInto, not together with 
> partitionBy. The data are inserted incorrectly. The issue is related to 
> column order. If I change to partitionBy col3, which is the last column, it 
> works.
> val data = Seq(
>  (7, "test1", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
>  import spark.implicits._
> val table = "default.tbl"
>  spark
>  .createDataset(data)
>  .toDF("col1", "col2", "col3")
>  .write
>  .partitionBy("col3")
>  .saveAsTable(table)
> val data2 = Seq(
>  (7, "test2", 1.0),
>  (8, "test#test", 0.0),
>  (9, "test3", 0.0)
>  )
> spark
>  .createDataset(data2)
>  .toDF("col1", "col2", "col3")
>  .write
>  .insertInto(table)
> sql("select * from " + table).show()
> +-------+------++----
> |col1|col2|col3|
> +-------+------++----
> |8|test#test|0.0|
> |9|test3|0.0|
> |8|test#test|0.0|
> |9|test3|0.0|
> |7|test1|1.0|
> |7|test2|1.0|
> +-------+------++----



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23493) insert-into depends on columns order, otherwise incorrect data inserted

Reply via email to