[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r201487246 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution --- End diff -- Thank you for review, @gatorsmile . I'll update like that. For write operation, we cannot specify schema like read path, `.schema`. Spark writes the new file into the directory additionally or overwrites the directory. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r201170482 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution --- End diff -- I still want to avoid using `schema evolution` in the doc or tests. `Schema Projection` might better. More importantly, you have to clarify that this only covers the read path. What is the behavior in the write path when the physical and data schemas are different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176933664 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution + +Users can control schema evolution in several ways. For example, new file can have additional +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text` +data source supports this. Note that `text` data source always has a fixed single string column +schema. + + + + +val df1 = Seq("a", "b").toDF("col1") +val df2 = df1.withColumn("col2", lit("x")) + +df1.write.save("/tmp/evolved_data/part=1") +df2.write.save("/tmp/evolved_data/part=2") + +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show +++++ +|col1|col2|part| +++++ +| a| x| 2| +| b| x| 2| +| a|null| 1| +| b|null| 1| +++++ + + + + +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based +data sources. + + 1. Add a column + 2. Remove a column + 3. Change a column position + 4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` -> `double`) --- End diff -- Yep. `Upcast`s are safe. This PR doesn't aim to cover or guarantee unsafe casting at this stage. Although these are straight-forward `upcast`s, not all Spark file-based data sources seems to support them (based on the test cases). This PR is trying to set the clear boundary and to clarify those missed things. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176933560 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution + +Users can control schema evolution in several ways. For example, new file can have additional +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text` +data source supports this. Note that `text` data source always has a fixed single string column +schema. + + + + +val df1 = Seq("a", "b").toDF("col1") +val df2 = df1.withColumn("col2", lit("x")) + +df1.write.save("/tmp/evolved_data/part=1") +df2.write.save("/tmp/evolved_data/part=2") + +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show +++++ +|col1|col2|part| +++++ +| a| x| 2| +| b| x| 2| +| a|null| 1| +| b|null| 1| +++++ + + + + +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based +data sources. + + 1. Add a column + 2. Remove a column + 3. Change a column position --- End diff -- Correct, we need to clarify that partition columns are always at the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176933506 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution + +Users can control schema evolution in several ways. For example, new file can have additional +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text` +data source supports this. Note that `text` data source always has a fixed single string column +schema. + + + + +val df1 = Seq("a", "b").toDF("col1") +val df2 = df1.withColumn("col2", lit("x")) + +df1.write.save("/tmp/evolved_data/part=1") +df2.write.save("/tmp/evolved_data/part=2") + +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show +++++ +|col1|col2|part| +++++ +| a| x| 2| +| b| x| 2| +| a|null| 1| +| b|null| 1| +++++ + + + + +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based +data sources. + + 1. Add a column + 2. Remove a column --- End diff -- Right. The test case doen't aim to cover those cases so far. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176933493 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution --- End diff -- Thank you so much for review, @gatorsmile . I waited for this moment. :) I agree all of your comments. The main reason of those limitation is because Spark file-based data sources doesn't have a capability to manage multi-version schema and the column default values here. In fact, that is beyond of Spark data sources' role. Thus, this PR is trying to add a test coverage for AS-IS capability in order to prevent future regression and to make a foundation to trust and to build on later. I don't think this is worthy of documentation at the beginning. It's a start. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176931126 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution --- End diff -- Based on the current behavior, we do not support schema evolution. Schema evolution is a well-defined term. It sounds like this PR is try to test the behaviors when users provide the schema that does not exactly match the physical schema. This is different from the definition of schema evolution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176931027 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution + +Users can control schema evolution in several ways. For example, new file can have additional +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text` +data source supports this. Note that `text` data source always has a fixed single string column +schema. + + + + +val df1 = Seq("a", "b").toDF("col1") +val df2 = df1.withColumn("col2", lit("x")) + +df1.write.save("/tmp/evolved_data/part=1") +df2.write.save("/tmp/evolved_data/part=2") + +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show +++++ +|col1|col2|part| +++++ +| a| x| 2| +| b| x| 2| +| a|null| 1| +| b|null| 1| +++++ + + + + +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based +data sources. + + 1. Add a column + 2. Remove a column + 3. Change a column position + 4. Change a column type (`byte` -> `short` -> `int` -> `long`, `float` -> `double`) --- End diff -- These are just upcast. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176931012 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution + +Users can control schema evolution in several ways. For example, new file can have additional +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text` +data source supports this. Note that `text` data source always has a fixed single string column +schema. + + + + +val df1 = Seq("a", "b").toDF("col1") +val df2 = df1.withColumn("col2", lit("x")) + +df1.write.save("/tmp/evolved_data/part=1") +df2.write.save("/tmp/evolved_data/part=2") + +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show +++++ +|col1|col2|part| +++++ +| a| x| 2| +| b| x| 2| +| a|null| 1| +| b|null| 1| +++++ + + + + +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based +data sources. + + 1. Add a column + 2. Remove a column + 3. Change a column position --- End diff -- Do we support it? When people issuing `select * from tab`, we automatically reorder the partition columns to the end of the schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176930962 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution + +Users can control schema evolution in several ways. For example, new file can have additional +new column. All file-based data sources (`csv`, `json`, `orc`, and `parquet`) except `text` +data source supports this. Note that `text` data source always has a fixed single string column +schema. + + + + +val df1 = Seq("a", "b").toDF("col1") +val df2 = df1.withColumn("col2", lit("x")) + +df1.write.save("/tmp/evolved_data/part=1") +df2.write.save("/tmp/evolved_data/part=2") + +spark.read.schema("col1 string, col2 string").load("/tmp/evolved_data").show +++++ +|col1|col2|part| +++++ +| a| x| 2| +| b| x| 2| +| a|null| 1| +| b|null| 1| +++++ + + + + +The following schema evolutions are supported in `csv`, `json`, `orc`, and `parquet` file-based +data sources. + + 1. Add a column + 2. Remove a column --- End diff -- In SQL standard, when we remove a column, all the data are removed. However, we do not support it. Users could still see the data after they add the column with the same name like what they removed previously. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r176834608 --- Diff: docs/sql-programming-guide.md --- @@ -815,6 +815,54 @@ should start with, they can set `basePath` in the data source options. For examp when `path/to/table/gender=male` is the path of the data and users set `basePath` to `path/to/table/`, `gender` will be a partitioning column. +### Schema Evolution --- End diff -- @gatorsmile . I rebased to the master and added this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r175579537 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,406 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column (Case 1). + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempPath { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.format(format).options(options).save(dir1) + df2.write.format(format).options(options).save(dir2) + df3.write.format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) --- End diff -- @gatorsmile . Please see this. This is not about **schema inferencing**. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162837578 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,406 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | --- End diff -- Ohaaa, the schema is explicitly set here. Sorry, I missed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162835707 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,406 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | --- End diff -- Correct, and this is not about schema merging. The final correct schema is given by users (or Hive). In this PR, all schema is given by users, but for Hive table, we uses the Hive Metastore Schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162835448 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,406 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | --- End diff -- @dongjoon-hyun, how do we guarantee schema change in Parquet and ORC? I thought we (roughly) randomly pick up a file, read its footer and then use it. So, I was thinking we don't properly support this. It makes sense to Parquet with `mergeSchema` tho. I think it's not even guaranteed in CSV too because we will rely on its header from one file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162786270 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} +
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781551 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. --- End diff -- Shall we leave the number given above in this comment like `(case 1.)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781286 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} +
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781325 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} +
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20208#discussion_r162781308 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaEvolutionTest.scala --- @@ -0,0 +1,436 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils} + +/** + * Schema can evolve in several ways and the followings are supported in file-based data sources. + * + * 1. Add a column + * 2. Remove a column + * 3. Change a column position + * 4. Change a column type + * + * Here, we consider safe evolution without data loss. For example, data type evolution should be + * from small types to larger types like `int`-to-`long`, not vice versa. + * + * So far, file-based data sources have schema evolution coverages like the followings. + * + * | File Format | Coverage | Note | + * | | | -- | + * | TEXT | N/A | Schema consists of a single string column. | + * | CSV | 1, 2, 4 | | + * | JSON | 1, 2, 3, 4 | | + * | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. | + * | PARQUET | 1, 2, 3 | | + * + * This aims to provide an explicit test coverage for schema evolution on file-based data sources. + * Since a file format has its own coverage of schema evolution, we need a test suite + * for each file-based data source with corresponding supported test case traits. + * + * The following is a hierarchy of test traits. + * + * SchemaEvolutionTest + * -> AddColumnEvolutionTest + * -> RemoveColumnEvolutionTest + * -> ChangePositionEvolutionTest + * -> BooleanTypeEvolutionTest + * -> IntegralTypeEvolutionTest + * -> ToDoubleTypeEvolutionTest + * -> ToDecimalTypeEvolutionTest + */ + +trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext { + val format: String + val options: Map[String, String] = Map.empty[String, String] +} + +/** + * Add column. + * This test suite assumes that the missing column should be `null`. + */ +trait AddColumnEvolutionTest extends SchemaEvolutionTest { + import testImplicits._ + + test("append column at the end") { +withTempDir { dir => + val path = dir.getCanonicalPath + + val df1 = Seq("a", "b").toDF("col1") + val df2 = df1.withColumn("col2", lit("x")) + val df3 = df2.withColumn("col3", lit("y")) + + val dir1 = s"$path${File.separator}part=one" + val dir2 = s"$path${File.separator}part=two" + val dir3 = s"$path${File.separator}part=three" + + df1.write.mode("overwrite").format(format).options(options).save(dir1) + df2.write.mode("overwrite").format(format).options(options).save(dir2) + df3.write.mode("overwrite").format(format).options(options).save(dir3) + + val df = spark.read +.schema(df3.schema) +.format(format) +.options(options) +.load(path) + + checkAnswer(df, Seq( +Row("a", null, null, "one"), +Row("b", null, null, "one"), +Row("a", "x", null, "two"), +Row("b", "x", null, "two"), +Row("a", "x", "y", "three"), +Row("b", "x", "y", "three"))) +} +
[GitHub] spark pull request #20208: [SPARK-23007][SQL][TEST] Add schema evolution tes...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/20208 [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources ## What changes were proposed in this pull request? A schema can evolve in several ways and the followings are already supported in file-based data sources. 1. Add a column 2. Remove a column 3. Change a column position 4. Change a column type This issue aims to guarantee users a backward-compatible schema evolution coverage on file-based data sources and to prevent future regressions by *adding schema evolution test suites explicitly*. Here, we consider safe evolution without data loss. For example, data type evolution should be from small types to larger types like `int`-to-`long`, not vice versa. As of today, in the master branch, file-based data sources have schema evolution coverages like the followings. File Format | Coverage | Note --- | -- | TEXT | N/A| Schema consists of a single string column. CSV| 1, 2, 4| JSON | 1, 2, 3, 4| ORC| 1, 2, 3, 4| Native vectorized ORC reader has the widest coverage. PARQUET | 1, 2, 3| ## How was this patch tested? Pass the jenkins with newly added test suites. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-SCHEMA-EVOLUTION Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20208.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20208 commit 499801e7fdd545ac5918dd5f7a9294db2d5373be Author: Dongjoon HyunDate: 2018-01-07T00:02:09Z [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based data sources --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org