[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22467


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22467#discussion_r219356330
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import org.apache.spark.sql.{Row, SaveMode}
+import org.apache.spark.sql.catalyst.catalog.HiveTableRelation
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * A suite of tests for the Parquet support through the data sources API.
+ */
+class HiveParquetSourceSuite extends ParquetPartitioningTest {
+  import testImplicits._
+  import spark._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+dropTables("partitioned_parquet",
+  "partitioned_parquet_with_key",
+  "partitioned_parquet_with_complextypes",
+  "partitioned_parquet_with_key_and_complextypes",
+  "normal_parquet")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDir.toURI}'
+  )
--- End diff --

Since you are refactoring the code, also fix the code style issues. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-19 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22467#discussion_r218835749
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetPartitioningTest.scala 
---
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.util.Utils
+
+// The data where the partitioning key exists only in the directory 
structure.
+case class ParquetData(intField: Int, stringField: String)
+// The data that also includes the partitioning key
+case class ParquetDataWithKey(p: Int, intField: Int, stringField: String)
+
+case class StructContainer(intStructField: Int, stringStructField: String)
+
+case class ParquetDataWithComplexTypes(
+intField: Int,
+stringField: String,
+structField: StructContainer,
+arrayField: Seq[Int])
+
+case class ParquetDataWithKeyAndComplexTypes(
+p: Int,
+intField: Int,
+stringField: String,
+structField: StructContainer,
+arrayField: Seq[Int])
+
+/**
+ * A collection of tests for parquet data with various forms of 
partitioning.
+ */
+abstract class ParquetPartitioningTest extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
+  import testImplicits._
+
+  var partitionedTableDir: File = null
+  var normalTableDir: File = null
+  var partitionedTableDirWithKey: File = null
+  var partitionedTableDirWithComplexTypes: File = null
+  var partitionedTableDirWithKeyAndComplexTypes: File = null
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+partitionedTableDir = Utils.createTempDir()
+normalTableDir = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDir, s"p=$p")
+  sparkContext.makeRDD(1 to 10)
+.map(i => ParquetData(i, s"part-$p"))
+.toDF()
+.write.parquet(partDir.getCanonicalPath)
+}
+
+sparkContext
+  .makeRDD(1 to 10)
+  .map(i => ParquetData(i, s"part-1"))
+  .toDF()
+  .write.parquet(new File(normalTableDir, "normal").getCanonicalPath)
+
+partitionedTableDirWithKey = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDirWithKey, s"p=$p")
+  sparkContext.makeRDD(1 to 10)
+.map(i => ParquetDataWithKey(p, i, s"part-$p"))
+.toDF()
+.write.parquet(partDir.getCanonicalPath)
+}
+
+partitionedTableDirWithKeyAndComplexTypes = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDirWithKeyAndComplexTypes, 
s"p=$p")
+  sparkContext.makeRDD(1 to 10).map { i =>
+ParquetDataWithKeyAndComplexTypes(
+  p, i, s"part-$p", StructContainer(i, f"${i}_string"), 1 to i)
+  }.toDF().write.parquet(partDir.getCanonicalPath)
+}
+
+partitionedTableDirWithComplexTypes = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDirWithComplexTypes, s"p=$p")
+  sparkContext.makeRDD(1 to 10).map { i =>
+ParquetDataWithComplexTypes(i, s"part-$p", StructContainer(i, 
f"${i}_string"), 1 to i)
+  }.toDF().write.parquet(partDir.getCanonicalPath)
+}
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  partitionedTableDir.delete()
+  normalTableDir.delete()
+  partitionedTableDirWithKey.delete()
+  partitionedTableDirWithComplexTypes.delete()
+  partitionedTableDirWithKeyAndComplexTypes.delete()
+} finally {
+  super.afterAll()
+}
+  }
+
+  /**
+   * Drop 

[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-19 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22467#discussion_r218844704
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import org.apache.spark.sql.{Row, SaveMode}
+import org.apache.spark.sql.catalyst.catalog.HiveTableRelation
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * A suite of tests for the Parquet support through the data sources API.
+ */
+class HiveParquetSourceSuite extends ParquetPartitioningTest {
+  import testImplicits._
+  import spark._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+dropTables("partitioned_parquet",
+  "partitioned_parquet_with_key",
+  "partitioned_parquet_with_complextypes",
+  "partitioned_parquet_with_key_and_complextypes",
+  "normal_parquet")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDir.toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet_with_key
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDirWithKey.toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW normal_parquet
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${new File(partitionedTableDir, "p=1").toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet_with_key_and_complextypes
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDirWithKeyAndComplexTypes.toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet_with_complextypes
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDirWithComplexTypes.toURI}'
+  )
+""")
+  }
+
+  test("SPARK-6016 make sure to use the latest footers") {
+sql("drop table if exists spark_6016_fix")
+
+// Create a DataFrame with two partitions. So, the created table will 
have two parquet files.
+val df1 = (1 to 10).map(Tuple1(_)).toDF("a").coalesce(2)
+
df1.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("spark_6016_fix")
+checkAnswer(
+  sql("select * from spark_6016_fix"),
+  (1 to 10).map(i => Row(i))
+)
+
+// Create a DataFrame with four partitions. So, the created table will 
have four parquet files.
+val df2 = (1 to 10).map(Tuple1(_)).toDF("b").coalesce(4)
+
df2.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("spark_6016_fix")
+// For the bug of SPARK-6016, we are caching two outdated footers for 
df1. Then,
+// since the new table has four parquet files, we are trying to read 
new footers from two files
+// and then merge metadata in footers of these four (two outdated ones 
and two latest one),
+// which will cause an error.
+checkAnswer(
+  sql("select * from spark_6016_fix"),
+  (1 to 10).map(i => Row(i))
+)
+
+sql("drop table spark_6016_fix")
+  }
+
+  test("SPARK-8811: compatibility with array of struct in Hive") {
+withTempPath { dir =>
+  withTable("array_of_struct") {
+val conf = Seq(
+  HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false",
+  SQLConf.PARQUET_BINARY_AS_STRING.key -> "true",
+  SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> "false")
+
+

[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-19 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22467#discussion_r218846178
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import org.apache.spark.sql.{Row, SaveMode}
+import org.apache.spark.sql.catalyst.catalog.HiveTableRelation
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * A suite of tests for the Parquet support through the data sources API.
+ */
+class HiveParquetSourceSuite extends ParquetPartitioningTest {
+  import testImplicits._
+  import spark._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+dropTables("partitioned_parquet",
+  "partitioned_parquet_with_key",
+  "partitioned_parquet_with_complextypes",
+  "partitioned_parquet_with_key_and_complextypes",
+  "normal_parquet")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDir.toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet_with_key
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDirWithKey.toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW normal_parquet
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${new File(partitionedTableDir, "p=1").toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet_with_key_and_complextypes
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDirWithKeyAndComplexTypes.toURI}'
+  )
+""")
+
+sql( s"""
+  CREATE TEMPORARY VIEW partitioned_parquet_with_complextypes
+  USING org.apache.spark.sql.parquet
+  OPTIONS (
+path '${partitionedTableDirWithComplexTypes.toURI}'
+  )
+""")
+  }
+
+  test("SPARK-6016 make sure to use the latest footers") {
+sql("drop table if exists spark_6016_fix")
--- End diff --

This case can be wrapped with `withTable("spark_6016_fix")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-19 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22467#discussion_r218835682
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetPartitioningTest.scala 
---
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.util.Utils
+
+// The data where the partitioning key exists only in the directory 
structure.
+case class ParquetData(intField: Int, stringField: String)
+// The data that also includes the partitioning key
+case class ParquetDataWithKey(p: Int, intField: Int, stringField: String)
+
+case class StructContainer(intStructField: Int, stringStructField: String)
+
+case class ParquetDataWithComplexTypes(
+intField: Int,
+stringField: String,
+structField: StructContainer,
+arrayField: Seq[Int])
+
+case class ParquetDataWithKeyAndComplexTypes(
+p: Int,
+intField: Int,
+stringField: String,
+structField: StructContainer,
+arrayField: Seq[Int])
+
+/**
+ * A collection of tests for parquet data with various forms of 
partitioning.
+ */
+abstract class ParquetPartitioningTest extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
+  import testImplicits._
+
+  var partitionedTableDir: File = null
+  var normalTableDir: File = null
+  var partitionedTableDirWithKey: File = null
+  var partitionedTableDirWithComplexTypes: File = null
+  var partitionedTableDirWithKeyAndComplexTypes: File = null
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+partitionedTableDir = Utils.createTempDir()
+normalTableDir = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDir, s"p=$p")
+  sparkContext.makeRDD(1 to 10)
+.map(i => ParquetData(i, s"part-$p"))
+.toDF()
+.write.parquet(partDir.getCanonicalPath)
+}
+
+sparkContext
+  .makeRDD(1 to 10)
+  .map(i => ParquetData(i, s"part-1"))
+  .toDF()
+  .write.parquet(new File(normalTableDir, "normal").getCanonicalPath)
+
+partitionedTableDirWithKey = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDirWithKey, s"p=$p")
+  sparkContext.makeRDD(1 to 10)
+.map(i => ParquetDataWithKey(p, i, s"part-$p"))
+.toDF()
+.write.parquet(partDir.getCanonicalPath)
+}
+
+partitionedTableDirWithKeyAndComplexTypes = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDirWithKeyAndComplexTypes, 
s"p=$p")
+  sparkContext.makeRDD(1 to 10).map { i =>
+ParquetDataWithKeyAndComplexTypes(
+  p, i, s"part-$p", StructContainer(i, f"${i}_string"), 1 to i)
+  }.toDF().write.parquet(partDir.getCanonicalPath)
+}
+
+partitionedTableDirWithComplexTypes = Utils.createTempDir()
+
+(1 to 10).foreach { p =>
+  val partDir = new File(partitionedTableDirWithComplexTypes, s"p=$p")
+  sparkContext.makeRDD(1 to 10).map { i =>
+ParquetDataWithComplexTypes(i, s"part-$p", StructContainer(i, 
f"${i}_string"), 1 to i)
+  }.toDF().write.parquet(partDir.getCanonicalPath)
+}
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  partitionedTableDir.delete()
+  normalTableDir.delete()
+  partitionedTableDirWithKey.delete()
+  partitionedTableDirWithComplexTypes.delete()
+  partitionedTableDirWithKeyAndComplexTypes.delete()
+} finally {
+  super.afterAll()
+}
+  }
+
+  /**
+   * Drop 

[GitHub] spark pull request #22467: [SPARK-25465][TEST] Refactor Parquet test suites ...

2018-09-19 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/22467

[SPARK-25465][TEST] Refactor Parquet test suites in project Hive

## What changes were proposed in this pull request?

Current the file 
[parquetSuites.scala](https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala)
 is not recognizable. 
When I tried to find test suites for built-in Parquet conversions for Hive 
serde, I can only find 
[HiveParquetSuite](https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala)
 in the first few minutes.

This PR is to revise the file name and test suite names.

## How was this patch tested?

Unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark refactor_parquet_suites

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22467.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22467


commit 8c67e0b25746444c583ea4a8ece0f5aaa88afd25
Author: Gengliang Wang 
Date:   2018-09-19T11:50:17Z

refactor parquetSuites




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org