[ https://issues.apache.org/jira/browse/SPARK-23493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoju Wu updated SPARK-23493: ------------------------------ Description: insert-into only works when the partitionby key columns are set at last: val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col1") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) spark .createDataset(data2) .toDF("col1", "col2", "col3") .write .insertInto(table) sql("select * from " + table).show() +------------+-++---- |col2|col3|col1| +------------+-++---- |test#test|0.0|8| |test1|1.0|7| |test3|0.0|9| |8|null|0| |9|null|0| |7|null|1| +------------+-++---- If you try inserting with sql, the issue is the same. val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col1") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) sql("insert into " + table + " values(7,'test2',1.0)") sql("select * from " + table).show() +----------+---++---- |col2|col3|col1| +----------+---++---- |test#test|0.0|8| |test1|1.0|7| |test3|0.0|9| |7|null|1| +----------+---++---- No exception was thrown since I only run insertInto, not together with partitionBy. The data are inserted incorrectly. The issue is related to column order. If I change to partitionBy col3, which is the last column, it works. val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col3") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) spark .createDataset(data2) .toDF("col1", "col2", "col3") .write .insertInto(table) sql("select * from " + table).show() +-------+------++---- |col1|col2|col3| +-------+------++---- |8|test#test|0.0| |9|test3|0.0| |8|test#test|0.0| |9|test3|0.0| |7|test1|1.0| |7|test2|1.0| +-------+------++---- was: Seems the issue still exists, here's the test: val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col1") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) spark .createDataset(data2) .toDF("col1", "col2", "col3") .write .insertInto(table) sql("select * from " + table).show() +-----------+--++---- |col2|col3|col1| +-----------+--++---- |test#test|0.0|8| |test1|1.0|7| |test3|0.0|9| |8|null|0| |9|null|0| |7|null|1| +-----------+--++---- If you try inserting with sql, the issue is the same. val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col1") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) sql("insert into " + table + " values(7,'test2',1.0)") sql("select * from " + table).show() +---------+----+----+ | col2|col3|col1| +---------+----+----+ |test#test| 0.0| 8| | test1| 1.0| 7| | test3| 0.0| 9| | 7|null| 1| +---------+----+----+ No exception was thrown since I only run insertInto, not together with partitionBy. The data are inserted incorrectly. The issue is related to column order. If I change to partitionBy col3, which is the last column, it works. val data = Seq( (7, "test1", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) import spark.implicits._ val table = "default.tbl" spark .createDataset(data) .toDF("col1", "col2", "col3") .write .partitionBy("col3") .saveAsTable(table) val data2 = Seq( (7, "test2", 1.0), (8, "test#test", 0.0), (9, "test3", 0.0) ) spark .createDataset(data2) .toDF("col1", "col2", "col3") .write .insertInto(table) sql("select * from " + table).show() +------+-------++---- |col1|col2|col3| +------+-------++---- |8|test#test|0.0| |9|test3|0.0| |8|test#test|0.0| |9|test3|0.0| |7|test1|1.0| |7|test2|1.0| +------+-------++---- > insert-into depends on columns order, otherwise incorrect data inserted > ----------------------------------------------------------------------- > > Key: SPARK-23493 > URL: https://issues.apache.org/jira/browse/SPARK-23493 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Reporter: Xiaoju Wu > Priority: Minor > > insert-into only works when the partitionby key columns are set at last: > val data = Seq( > (7, "test1", 1.0), > (8, "test#test", 0.0), > (9, "test3", 0.0) > ) > import spark.implicits._ > val table = "default.tbl" > spark > .createDataset(data) > .toDF("col1", "col2", "col3") > .write > .partitionBy("col1") > .saveAsTable(table) > val data2 = Seq( > (7, "test2", 1.0), > (8, "test#test", 0.0), > (9, "test3", 0.0) > ) > spark > .createDataset(data2) > .toDF("col1", "col2", "col3") > .write > .insertInto(table) > sql("select * from " + table).show() > +------------+-++---- > |col2|col3|col1| > +------------+-++---- > |test#test|0.0|8| > |test1|1.0|7| > |test3|0.0|9| > |8|null|0| > |9|null|0| > |7|null|1| > +------------+-++---- > > If you try inserting with sql, the issue is the same. > val data = Seq( > (7, "test1", 1.0), > (8, "test#test", 0.0), > (9, "test3", 0.0) > ) > import spark.implicits._ > val table = "default.tbl" > spark > .createDataset(data) > .toDF("col1", "col2", "col3") > .write > .partitionBy("col1") > .saveAsTable(table) > val data2 = Seq( > (7, "test2", 1.0), > (8, "test#test", 0.0), > (9, "test3", 0.0) > ) > sql("insert into " + table + " values(7,'test2',1.0)") > sql("select * from " + table).show() > +----------+---++---- > |col2|col3|col1| > +----------+---++---- > |test#test|0.0|8| > |test1|1.0|7| > |test3|0.0|9| > |7|null|1| > +----------+---++---- > No exception was thrown since I only run insertInto, not together with > partitionBy. The data are inserted incorrectly. The issue is related to > column order. If I change to partitionBy col3, which is the last column, it > works. > val data = Seq( > (7, "test1", 1.0), > (8, "test#test", 0.0), > (9, "test3", 0.0) > ) > import spark.implicits._ > val table = "default.tbl" > spark > .createDataset(data) > .toDF("col1", "col2", "col3") > .write > .partitionBy("col3") > .saveAsTable(table) > val data2 = Seq( > (7, "test2", 1.0), > (8, "test#test", 0.0), > (9, "test3", 0.0) > ) > spark > .createDataset(data2) > .toDF("col1", "col2", "col3") > .write > .insertInto(table) > sql("select * from " + table).show() > +-------+------++---- > |col1|col2|col3| > +-------+------++---- > |8|test#test|0.0| > |9|test3|0.0| > |8|test#test|0.0| > |9|test3|0.0| > |7|test1|1.0| > |7|test2|1.0| > +-------+------++---- -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org