[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22467 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22467#discussion_r219356330 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import org.apache.spark.sql.{Row, SaveMode} +import org.apache.spark.sql.catalyst.catalog.HiveTableRelation +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * A suite of tests for the Parquet support through the data sources API. + */ +class HiveParquetSourceSuite extends ParquetPartitioningTest { + import testImplicits._ + import spark._ + + override def beforeAll(): Unit = { +super.beforeAll() +dropTables("partitioned_parquet", + "partitioned_parquet_with_key", + "partitioned_parquet_with_complextypes", + "partitioned_parquet_with_key_and_complextypes", + "normal_parquet") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDir.toURI}' + ) --- End diff -- Since you are refactoring the code, also fix the code style issues. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22467#discussion_r218835749 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetPartitioningTest.scala --- @@ -0,0 +1,253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import org.apache.spark.sql._ +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils +import org.apache.spark.util.Utils + +// The data where the partitioning key exists only in the directory structure. +case class ParquetData(intField: Int, stringField: String) +// The data that also includes the partitioning key +case class ParquetDataWithKey(p: Int, intField: Int, stringField: String) + +case class StructContainer(intStructField: Int, stringStructField: String) + +case class ParquetDataWithComplexTypes( +intField: Int, +stringField: String, +structField: StructContainer, +arrayField: Seq[Int]) + +case class ParquetDataWithKeyAndComplexTypes( +p: Int, +intField: Int, +stringField: String, +structField: StructContainer, +arrayField: Seq[Int]) + +/** + * A collection of tests for parquet data with various forms of partitioning. + */ +abstract class ParquetPartitioningTest extends QueryTest with SQLTestUtils with TestHiveSingleton { + import testImplicits._ + + var partitionedTableDir: File = null + var normalTableDir: File = null + var partitionedTableDirWithKey: File = null + var partitionedTableDirWithComplexTypes: File = null + var partitionedTableDirWithKeyAndComplexTypes: File = null + + override def beforeAll(): Unit = { +super.beforeAll() +partitionedTableDir = Utils.createTempDir() +normalTableDir = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDir, s"p=$p") + sparkContext.makeRDD(1 to 10) +.map(i => ParquetData(i, s"part-$p")) +.toDF() +.write.parquet(partDir.getCanonicalPath) +} + +sparkContext + .makeRDD(1 to 10) + .map(i => ParquetData(i, s"part-1")) + .toDF() + .write.parquet(new File(normalTableDir, "normal").getCanonicalPath) + +partitionedTableDirWithKey = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDirWithKey, s"p=$p") + sparkContext.makeRDD(1 to 10) +.map(i => ParquetDataWithKey(p, i, s"part-$p")) +.toDF() +.write.parquet(partDir.getCanonicalPath) +} + +partitionedTableDirWithKeyAndComplexTypes = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDirWithKeyAndComplexTypes, s"p=$p") + sparkContext.makeRDD(1 to 10).map { i => +ParquetDataWithKeyAndComplexTypes( + p, i, s"part-$p", StructContainer(i, f"${i}_string"), 1 to i) + }.toDF().write.parquet(partDir.getCanonicalPath) +} + +partitionedTableDirWithComplexTypes = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDirWithComplexTypes, s"p=$p") + sparkContext.makeRDD(1 to 10).map { i => +ParquetDataWithComplexTypes(i, s"part-$p", StructContainer(i, f"${i}_string"), 1 to i) + }.toDF().write.parquet(partDir.getCanonicalPath) +} + } + + override protected def afterAll(): Unit = { +try { + partitionedTableDir.delete() + normalTableDir.delete() + partitionedTableDirWithKey.delete() + partitionedTableDirWithComplexTypes.delete() + partitionedTableDirWithKeyAndComplexTypes.delete() +} finally { + super.afterAll() +} + } + + /** + * Drop
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22467#discussion_r218844704 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import org.apache.spark.sql.{Row, SaveMode} +import org.apache.spark.sql.catalyst.catalog.HiveTableRelation +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * A suite of tests for the Parquet support through the data sources API. + */ +class HiveParquetSourceSuite extends ParquetPartitioningTest { + import testImplicits._ + import spark._ + + override def beforeAll(): Unit = { +super.beforeAll() +dropTables("partitioned_parquet", + "partitioned_parquet_with_key", + "partitioned_parquet_with_complextypes", + "partitioned_parquet_with_key_and_complextypes", + "normal_parquet") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDir.toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet_with_key + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDirWithKey.toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW normal_parquet + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${new File(partitionedTableDir, "p=1").toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet_with_key_and_complextypes + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDirWithKeyAndComplexTypes.toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet_with_complextypes + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDirWithComplexTypes.toURI}' + ) +""") + } + + test("SPARK-6016 make sure to use the latest footers") { +sql("drop table if exists spark_6016_fix") + +// Create a DataFrame with two partitions. So, the created table will have two parquet files. +val df1 = (1 to 10).map(Tuple1(_)).toDF("a").coalesce(2) + df1.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("spark_6016_fix") +checkAnswer( + sql("select * from spark_6016_fix"), + (1 to 10).map(i => Row(i)) +) + +// Create a DataFrame with four partitions. So, the created table will have four parquet files. +val df2 = (1 to 10).map(Tuple1(_)).toDF("b").coalesce(4) + df2.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("spark_6016_fix") +// For the bug of SPARK-6016, we are caching two outdated footers for df1. Then, +// since the new table has four parquet files, we are trying to read new footers from two files +// and then merge metadata in footers of these four (two outdated ones and two latest one), +// which will cause an error. +checkAnswer( + sql("select * from spark_6016_fix"), + (1 to 10).map(i => Row(i)) +) + +sql("drop table spark_6016_fix") + } + + test("SPARK-8811: compatibility with array of struct in Hive") { +withTempPath { dir => + withTable("array_of_struct") { +val conf = Seq( + HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false", + SQLConf.PARQUET_BINARY_AS_STRING.key -> "true", + SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> "false") + +
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22467#discussion_r218846178 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import org.apache.spark.sql.{Row, SaveMode} +import org.apache.spark.sql.catalyst.catalog.HiveTableRelation +import org.apache.spark.sql.execution.datasources.LogicalRelation +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * A suite of tests for the Parquet support through the data sources API. + */ +class HiveParquetSourceSuite extends ParquetPartitioningTest { + import testImplicits._ + import spark._ + + override def beforeAll(): Unit = { +super.beforeAll() +dropTables("partitioned_parquet", + "partitioned_parquet_with_key", + "partitioned_parquet_with_complextypes", + "partitioned_parquet_with_key_and_complextypes", + "normal_parquet") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDir.toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet_with_key + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDirWithKey.toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW normal_parquet + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${new File(partitionedTableDir, "p=1").toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet_with_key_and_complextypes + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDirWithKeyAndComplexTypes.toURI}' + ) +""") + +sql( s""" + CREATE TEMPORARY VIEW partitioned_parquet_with_complextypes + USING org.apache.spark.sql.parquet + OPTIONS ( +path '${partitionedTableDirWithComplexTypes.toURI}' + ) +""") + } + + test("SPARK-6016 make sure to use the latest footers") { +sql("drop table if exists spark_6016_fix") --- End diff -- This case can be wrapped with `withTable("spark_6016_fix")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22467#discussion_r218835682 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetPartitioningTest.scala --- @@ -0,0 +1,253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.io.File + +import org.apache.spark.sql._ +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils +import org.apache.spark.util.Utils + +// The data where the partitioning key exists only in the directory structure. +case class ParquetData(intField: Int, stringField: String) +// The data that also includes the partitioning key +case class ParquetDataWithKey(p: Int, intField: Int, stringField: String) + +case class StructContainer(intStructField: Int, stringStructField: String) + +case class ParquetDataWithComplexTypes( +intField: Int, +stringField: String, +structField: StructContainer, +arrayField: Seq[Int]) + +case class ParquetDataWithKeyAndComplexTypes( +p: Int, +intField: Int, +stringField: String, +structField: StructContainer, +arrayField: Seq[Int]) + +/** + * A collection of tests for parquet data with various forms of partitioning. + */ +abstract class ParquetPartitioningTest extends QueryTest with SQLTestUtils with TestHiveSingleton { + import testImplicits._ + + var partitionedTableDir: File = null + var normalTableDir: File = null + var partitionedTableDirWithKey: File = null + var partitionedTableDirWithComplexTypes: File = null + var partitionedTableDirWithKeyAndComplexTypes: File = null + + override def beforeAll(): Unit = { +super.beforeAll() +partitionedTableDir = Utils.createTempDir() +normalTableDir = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDir, s"p=$p") + sparkContext.makeRDD(1 to 10) +.map(i => ParquetData(i, s"part-$p")) +.toDF() +.write.parquet(partDir.getCanonicalPath) +} + +sparkContext + .makeRDD(1 to 10) + .map(i => ParquetData(i, s"part-1")) + .toDF() + .write.parquet(new File(normalTableDir, "normal").getCanonicalPath) + +partitionedTableDirWithKey = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDirWithKey, s"p=$p") + sparkContext.makeRDD(1 to 10) +.map(i => ParquetDataWithKey(p, i, s"part-$p")) +.toDF() +.write.parquet(partDir.getCanonicalPath) +} + +partitionedTableDirWithKeyAndComplexTypes = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDirWithKeyAndComplexTypes, s"p=$p") + sparkContext.makeRDD(1 to 10).map { i => +ParquetDataWithKeyAndComplexTypes( + p, i, s"part-$p", StructContainer(i, f"${i}_string"), 1 to i) + }.toDF().write.parquet(partDir.getCanonicalPath) +} + +partitionedTableDirWithComplexTypes = Utils.createTempDir() + +(1 to 10).foreach { p => + val partDir = new File(partitionedTableDirWithComplexTypes, s"p=$p") + sparkContext.makeRDD(1 to 10).map { i => +ParquetDataWithComplexTypes(i, s"part-$p", StructContainer(i, f"${i}_string"), 1 to i) + }.toDF().write.parquet(partDir.getCanonicalPath) +} + } + + override protected def afterAll(): Unit = { +try { + partitionedTableDir.delete() + normalTableDir.delete() + partitionedTableDirWithKey.delete() + partitionedTableDirWithComplexTypes.delete() + partitionedTableDirWithKeyAndComplexTypes.delete() +} finally { + super.afterAll() +} + } + + /** + * Drop
[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/22467 [SPARK-25465][TEST] Refactor Parquet test suites in project Hive ## What changes were proposed in this pull request? Current the file [parquetSuites.scala](https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala) is not recognizable. When I tried to find test suites for built-in Parquet conversions for Hive serde, I can only find [HiveParquetSuite](https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala) in the first few minutes. This PR is to revise the file name and test suite names. ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark refactor_parquet_suites Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22467.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22467 commit 8c67e0b25746444c583ea4a8ece0f5aaa88afd25 Author: Gengliang Wang Date: 2018-09-19T11:50:17Z refactor parquetSuites --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org