[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-07-10 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r201487246
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
--- End diff --

Thank you for review, @gatorsmile . I'll update like that.

For write operation, we cannot specify schema like read path, `.schema`. 
Spark writes the new file into the directory additionally or overwrites the 
directory.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-07-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r201170482
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
--- End diff --

I still want to avoid using `schema evolution` in the doc or tests. `Schema 
Projection` might better. More importantly, you have to clarify that this only 
covers the read path.

What is the behavior in the write path when the physical and data schemas 
are different.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176933664
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
+
+Users can control schema evolution in several ways. For example, new file 
can have additional
+new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
+data source supports this. Note that `text` data source always has a fixed 
single string column
+schema.
+
+
+
+
+val df1 = Seq("a", "b").toDF("col1")
+val df2 = df1.withColumn("col2", lit("x"))
+
+df1.write.save("/tmp/evolved_data/part=1")
+df2.write.save("/tmp/evolved_data/part=2")
+
+spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
+++++
+|col1|col2|part|
+++++
+|   a|   x|   2|
+|   b|   x|   2|
+|   a|null|   1|
+|   b|null|   1|
+++++
+
+
+
+
+The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
+data sources.
+
+  1. Add a column
+  2. Remove a column
+  3. Change a column position
+  4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` 
-> `double`)
--- End diff --

Yep. `Upcast`s are safe. This PR doesn't aim to cover or guarantee unsafe 
casting at this stage. Although these are straight-forward `upcast`s, not all 
Spark file-based data sources seems to support them (based on the test cases). 
This PR is trying to set the clear boundary and to clarify those missed things.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176933560
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
+
+Users can control schema evolution in several ways. For example, new file 
can have additional
+new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
+data source supports this. Note that `text` data source always has a fixed 
single string column
+schema.
+
+
+
+
+val df1 = Seq("a", "b").toDF("col1")
+val df2 = df1.withColumn("col2", lit("x"))
+
+df1.write.save("/tmp/evolved_data/part=1")
+df2.write.save("/tmp/evolved_data/part=2")
+
+spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
+++++
+|col1|col2|part|
+++++
+|   a|   x|   2|
+|   b|   x|   2|
+|   a|null|   1|
+|   b|null|   1|
+++++
+
+
+
+
+The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
+data sources.
+
+  1. Add a column
+  2. Remove a column
+  3. Change a column position
--- End diff --

Correct, we need to clarify that partition columns are always at the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176933506
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
+
+Users can control schema evolution in several ways. For example, new file 
can have additional
+new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
+data source supports this. Note that `text` data source always has a fixed 
single string column
+schema.
+
+
+
+
+val df1 = Seq("a", "b").toDF("col1")
+val df2 = df1.withColumn("col2", lit("x"))
+
+df1.write.save("/tmp/evolved_data/part=1")
+df2.write.save("/tmp/evolved_data/part=2")
+
+spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
+++++
+|col1|col2|part|
+++++
+|   a|   x|   2|
+|   b|   x|   2|
+|   a|null|   1|
+|   b|null|   1|
+++++
+
+
+
+
+The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
+data sources.
+
+  1. Add a column
+  2. Remove a column
--- End diff --

Right. The test case doen't aim to cover those cases so far.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176933493
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
--- End diff --

Thank you so much for review, @gatorsmile . I waited for this moment. :)
I agree all of your comments. The main reason of those limitation is 
because Spark file-based data sources doesn't have a capability to manage 
multi-version schema and the column default values here. In fact, that is 
beyond of Spark data sources' role. Thus, this PR is trying to add a test 
coverage for AS-IS capability in order to prevent future regression and to make 
a foundation to trust and to build on later. I don't think this is worthy of 
documentation at the beginning. It's a start.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176931126
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
--- End diff --

Based on the current behavior, we do not support schema evolution. Schema 
evolution is a well-defined term. It sounds like this PR is try to test the 
behaviors when users provide the schema that does not exactly match the 
physical schema. This is different from the definition of schema evolution.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176931027
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
+
+Users can control schema evolution in several ways. For example, new file 
can have additional
+new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
+data source supports this. Note that `text` data source always has a fixed 
single string column
+schema.
+
+
+
+
+val df1 = Seq("a", "b").toDF("col1")
+val df2 = df1.withColumn("col2", lit("x"))
+
+df1.write.save("/tmp/evolved_data/part=1")
+df2.write.save("/tmp/evolved_data/part=2")
+
+spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
+++++
+|col1|col2|part|
+++++
+|   a|   x|   2|
+|   b|   x|   2|
+|   a|null|   1|
+|   b|null|   1|
+++++
+
+
+
+
+The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
+data sources.
+
+  1. Add a column
+  2. Remove a column
+  3. Change a column position
+  4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` 
-> `double`)
--- End diff --

These are just upcast.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176931012
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
+
+Users can control schema evolution in several ways. For example, new file 
can have additional
+new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
+data source supports this. Note that `text` data source always has a fixed 
single string column
+schema.
+
+
+
+
+val df1 = Seq("a", "b").toDF("col1")
+val df2 = df1.withColumn("col2", lit("x"))
+
+df1.write.save("/tmp/evolved_data/part=1")
+df2.write.save("/tmp/evolved_data/part=2")
+
+spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
+++++
+|col1|col2|part|
+++++
+|   a|   x|   2|
+|   b|   x|   2|
+|   a|null|   1|
+|   b|null|   1|
+++++
+
+
+
+
+The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
+data sources.
+
+  1. Add a column
+  2. Remove a column
+  3. Change a column position
--- End diff --

Do we support it? When people issuing `select * from tab`, we automatically 
reorder the partition columns to the end of the schema. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-25 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176930962
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
+
+Users can control schema evolution in several ways. For example, new file 
can have additional
+new column. All file-based data sources (`csv`, `json`, `orc`, and 
`parquet`) except `text`
+data source supports this. Note that `text` data source always has a fixed 
single string column
+schema.
+
+
+
+
+val df1 = Seq("a", "b").toDF("col1")
+val df2 = df1.withColumn("col2", lit("x"))
+
+df1.write.save("/tmp/evolved_data/part=1")
+df2.write.save("/tmp/evolved_data/part=2")
+
+spark.read.schema("col1 string, col2 
string").load("/tmp/evolved_data").show
+++++
+|col1|col2|part|
+++++
+|   a|   x|   2|
+|   b|   x|   2|
+|   a|null|   1|
+|   b|null|   1|
+++++
+
+
+
+
+The following schema evolutions are supported in `csv`, `json`, `orc`, and 
`parquet` file-based
+data sources.
+
+  1. Add a column
+  2. Remove a column
--- End diff --

In SQL standard, when we remove a column, all the data are removed. 
However, we do not support it. Users could still see the data after they add 
the column with the same name like what they removed previously. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r176834608
  
--- Diff: docs/sql-programming-guide.md ---
@@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data 
source options. For examp
 when `path/to/table/gender=male` is the path of the data and
 users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
 
+### Schema Evolution
--- End diff --

@gatorsmile . I rebased to the master and added this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-03-19 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r175579537
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,406 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
+ *
+ * This aims to provide an explicit test coverage for schema evolution on 
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a 
test suite
+ * for each file-based data source with corresponding supported test case 
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ *   SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext {
+  val format: String
+  val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column (Case 1).
+ * This test suite assumes that the missing column should be `null`.
+ */
+trait AddColumnEvolutionTest extends SchemaEvolutionTest {
+  import testImplicits._
+
+  test("append column at the end") {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+
+  val df1 = Seq("a", "b").toDF("col1")
+  val df2 = df1.withColumn("col2", lit("x"))
+  val df3 = df2.withColumn("col3", lit("y"))
+
+  val dir1 = s"$path${File.separator}part=one"
+  val dir2 = s"$path${File.separator}part=two"
+  val dir3 = s"$path${File.separator}part=three"
+
+  df1.write.format(format).options(options).save(dir1)
+  df2.write.format(format).options(options).save(dir2)
+  df3.write.format(format).options(options).save(dir3)
+
+  val df = spark.read
+.schema(df3.schema)
--- End diff --

@gatorsmile . Please see this. This is not about **schema inferencing**.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162837578
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,406 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
--- End diff --

Ohaaa, the schema is explicitly set here. Sorry, I missed it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-21 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162835707
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,406 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
--- End diff --

Correct, and this is not about schema merging.
The final correct schema is given by users (or Hive).
In this PR, all schema is given by users, but for Hive table, we uses the 
Hive Metastore Schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162835448
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,406 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
--- End diff --

@dongjoon-hyun, how do we guarantee schema change in Parquet and ORC?

I thought we (roughly) randomly pick up a file, read its footer and then 
use it. So, I was thinking we don't properly support this. It makes sense to 
Parquet with `mergeSchema` tho.

I think it's not even guaranteed in CSV too because we will rely on its 
header from one file. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-20 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162786270
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
+ *
+ * This aims to provide an explicit test coverage for schema evolution on 
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a 
test suite
+ * for each file-based data source with corresponding supported test case 
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ *   SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext {
+  val format: String
+  val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column.
+ * This test suite assumes that the missing column should be `null`.
+ */
+trait AddColumnEvolutionTest extends SchemaEvolutionTest {
+  import testImplicits._
+
+  test("append column at the end") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+
+  val df1 = Seq("a", "b").toDF("col1")
+  val df2 = df1.withColumn("col2", lit("x"))
+  val df3 = df2.withColumn("col3", lit("y"))
+
+  val dir1 = s"$path${File.separator}part=one"
+  val dir2 = s"$path${File.separator}part=two"
+  val dir3 = s"$path${File.separator}part=three"
+
+  
df1.write.mode("overwrite").format(format).options(options).save(dir1)
+  
df2.write.mode("overwrite").format(format).options(options).save(dir2)
+  
df3.write.mode("overwrite").format(format).options(options).save(dir3)
+
+  val df = spark.read
+.schema(df3.schema)
+.format(format)
+.options(options)
+.load(path)
+
+  checkAnswer(df, Seq(
+Row("a", null, null, "one"),
+Row("b", null, null, "one"),
+Row("a", "x", null, "two"),
+Row("b", "x", null, "two"),
+Row("a", "x", "y", "three"),
+Row("b", "x", "y", "three")))
+}
+ 

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162781551
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
+ *
+ * This aims to provide an explicit test coverage for schema evolution on 
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a 
test suite
+ * for each file-based data source with corresponding supported test case 
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ *   SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext {
+  val format: String
+  val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column.
--- End diff --

Shall we leave the number given above in this comment like `(case 1.)`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162781286
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
+ *
+ * This aims to provide an explicit test coverage for schema evolution on 
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a 
test suite
+ * for each file-based data source with corresponding supported test case 
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ *   SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext {
+  val format: String
+  val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column.
+ * This test suite assumes that the missing column should be `null`.
+ */
+trait AddColumnEvolutionTest extends SchemaEvolutionTest {
+  import testImplicits._
+
+  test("append column at the end") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+
+  val df1 = Seq("a", "b").toDF("col1")
+  val df2 = df1.withColumn("col2", lit("x"))
+  val df3 = df2.withColumn("col3", lit("y"))
+
+  val dir1 = s"$path${File.separator}part=one"
+  val dir2 = s"$path${File.separator}part=two"
+  val dir3 = s"$path${File.separator}part=three"
+
+  
df1.write.mode("overwrite").format(format).options(options).save(dir1)
+  
df2.write.mode("overwrite").format(format).options(options).save(dir2)
+  
df3.write.mode("overwrite").format(format).options(options).save(dir3)
+
+  val df = spark.read
+.schema(df3.schema)
+.format(format)
+.options(options)
+.load(path)
+
+  checkAnswer(df, Seq(
+Row("a", null, null, "one"),
+Row("b", null, null, "one"),
+Row("a", "x", null, "two"),
+Row("b", "x", null, "two"),
+Row("a", "x", "y", "three"),
+Row("b", "x", "y", "three")))
+}
+  

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162781325
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
+ *
+ * This aims to provide an explicit test coverage for schema evolution on 
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a 
test suite
+ * for each file-based data source with corresponding supported test case 
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ *   SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext {
+  val format: String
+  val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column.
+ * This test suite assumes that the missing column should be `null`.
+ */
+trait AddColumnEvolutionTest extends SchemaEvolutionTest {
+  import testImplicits._
+
+  test("append column at the end") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+
+  val df1 = Seq("a", "b").toDF("col1")
+  val df2 = df1.withColumn("col2", lit("x"))
+  val df3 = df2.withColumn("col3", lit("y"))
+
+  val dir1 = s"$path${File.separator}part=one"
+  val dir2 = s"$path${File.separator}part=two"
+  val dir3 = s"$path${File.separator}part=three"
+
+  
df1.write.mode("overwrite").format(format).options(options).save(dir1)
+  
df2.write.mode("overwrite").format(format).options(options).save(dir2)
+  
df3.write.mode("overwrite").format(format).options(options).save(dir3)
+
+  val df = spark.read
+.schema(df3.schema)
+.format(format)
+.options(options)
+.load(path)
+
+  checkAnswer(df, Seq(
+Row("a", null, null, "one"),
+Row("b", null, null, "one"),
+Row("a", "x", null, "two"),
+Row("b", "x", null, "two"),
+Row("a", "x", "y", "three"),
+Row("b", "x", "y", "three")))
+}
+  

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20208#discussion_r162781308
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala
 ---
@@ -0,0 +1,436 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+
+/**
+ * Schema can evolve in several ways and the followings are supported in 
file-based data sources.
+ *
+ *   1. Add a column
+ *   2. Remove a column
+ *   3. Change a column position
+ *   4. Change a column type
+ *
+ * Here, we consider safe evolution without data loss. For example, data 
type evolution should be
+ * from small types to larger types like `int`-to-`long`, not vice versa.
+ *
+ * So far, file-based data sources have schema evolution coverages like 
the followings.
+ *
+ *   | File Format  | Coverage | Note  
 |
+ *   |  |  | 
-- |
+ *   | TEXT | N/A  | Schema consists of a single string 
column. |
+ *   | CSV  | 1, 2, 4  |   
 |
+ *   | JSON | 1, 2, 3, 4   |   
 |
+ *   | ORC  | 1, 2, 3, 4   | Native vectorized ORC reader has the 
widest coverage.  |
+ *   | PARQUET  | 1, 2, 3  |   
 |
+ *
+ * This aims to provide an explicit test coverage for schema evolution on 
file-based data sources.
+ * Since a file format has its own coverage of schema evolution, we need a 
test suite
+ * for each file-based data source with corresponding supported test case 
traits.
+ *
+ * The following is a hierarchy of test traits.
+ *
+ *   SchemaEvolutionTest
+ * -> AddColumnEvolutionTest
+ * -> RemoveColumnEvolutionTest
+ * -> ChangePositionEvolutionTest
+ * -> BooleanTypeEvolutionTest
+ * -> IntegralTypeEvolutionTest
+ * -> ToDoubleTypeEvolutionTest
+ * -> ToDecimalTypeEvolutionTest
+ */
+
+trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with 
SharedSQLContext {
+  val format: String
+  val options: Map[String, String] = Map.empty[String, String]
+}
+
+/**
+ * Add column.
+ * This test suite assumes that the missing column should be `null`.
+ */
+trait AddColumnEvolutionTest extends SchemaEvolutionTest {
+  import testImplicits._
+
+  test("append column at the end") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+
+  val df1 = Seq("a", "b").toDF("col1")
+  val df2 = df1.withColumn("col2", lit("x"))
+  val df3 = df2.withColumn("col3", lit("y"))
+
+  val dir1 = s"$path${File.separator}part=one"
+  val dir2 = s"$path${File.separator}part=two"
+  val dir3 = s"$path${File.separator}part=three"
+
+  
df1.write.mode("overwrite").format(format).options(options).save(dir1)
+  
df2.write.mode("overwrite").format(format).options(options).save(dir2)
+  
df3.write.mode("overwrite").format(format).options(options).save(dir3)
+
+  val df = spark.read
+.schema(df3.schema)
+.format(format)
+.options(options)
+.load(path)
+
+  checkAnswer(df, Seq(
+Row("a", null, null, "one"),
+Row("b", null, null, "one"),
+Row("a", "x", null, "two"),
+Row("b", "x", null, "two"),
+Row("a", "x", "y", "three"),
+Row("b", "x", "y", "three")))
+}
+  

[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...

2018-01-09 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/20208

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based 
data sources

## What changes were proposed in this pull request?

A schema can evolve in several ways and the followings are already 
supported in file-based data sources.

   1. Add a column
   2. Remove a column
   3. Change a column position
   4. Change a column type

This issue aims to guarantee users a backward-compatible schema evolution 
coverage on file-based data sources and to prevent future regressions by 
*adding schema evolution test suites explicitly*.

Here, we consider safe evolution without data loss. For example, data type 
evolution should be from small types to larger types like `int`-to-`long`, not 
vice versa.

As of today, in the master branch, file-based data sources have schema 
evolution coverages like the followings.

File Format | Coverage  | Note
--- | -- | 
TEXT  | N/A| Schema consists of a single string column.
CSV| 1, 2, 4|
JSON  | 1, 2, 3, 4|
ORC| 1, 2, 3, 4| Native vectorized ORC reader has the 
widest coverage.
PARQUET   | 1, 2, 3|


## How was this patch tested?

Pass the jenkins with newly added test suites.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-SCHEMA-EVOLUTION

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20208.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20208


commit 499801e7fdd545ac5918dd5f7a9294db2d5373be
Author: Dongjoon Hyun 
Date:   2018-01-07T00:02:09Z

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based 
data sources




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org