[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r67582175 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { cq.awaitTermination(2000L) } + + test("prevent all column partitioning") { +withTempDir { dir => + val path = dir.getCanonicalPath + intercept[AnalysisException] { + spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) + } + intercept[AnalysisException] { + spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) --- End diff -- I am trying to add more edge cases in the ongoing PR for improving the test coverage of `DataFrameReader` and `DataFrameWriter`. Will include this too. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r67581319 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { cq.awaitTermination(2000L) } + + test("prevent all column partitioning") { +withTempDir { dir => + val path = dir.getCanonicalPath + intercept[AnalysisException] { + spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) + } + intercept[AnalysisException] { + spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) --- End diff -- Ah, I see what you mean. I'm going to exclude `ORC` cases in #13730 . So, for the `OrcSuite`, you can do that. If then, I'll really appreciate it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r67580384 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { cq.awaitTermination(2000L) } + + test("prevent all column partitioning") { +withTempDir { dir => + val path = dir.getCanonicalPath + intercept[AnalysisException] { + spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) + } + intercept[AnalysisException] { + spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) --- End diff -- I am fine if you want to fix it. You can put it in `OrcSuite`. Then, running the following command to verify whether it passes or not: > build/sbt -Phive "hive/test-only org.apache.spark.sql.hive.orc.OrcSourceSuite" ```scala test("prevent all column partitioning") { withTempDir { dir => val path = dir.getCanonicalPath val e = intercept[AnalysisException] { spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) }.getMessage assert(e.contains("Cannot use all columns for partition columns")) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r67578803 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { cq.awaitTermination(2000L) } + + test("prevent all column partitioning") { +withTempDir { dir => + val path = dir.getCanonicalPath + intercept[AnalysisException] { + spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) + } + intercept[AnalysisException] { + spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) --- End diff -- Oh, I mean there is already ongoing PR. And, thank you for advice! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r67578679 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { cq.awaitTermination(2000L) } + + test("prevent all column partitioning") { +withTempDir { dir => + val path = dir.getCanonicalPath + intercept[AnalysisException] { + spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) + } + intercept[AnalysisException] { + spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) --- End diff -- Yep. That will be fixed in https://github.com/apache/spark/pull/13730 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r67578077 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { cq.awaitTermination(2000L) } + + test("prevent all column partitioning") { +withTempDir { dir => + val path = dir.getCanonicalPath + intercept[AnalysisException] { + spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) + } + intercept[AnalysisException] { + spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) --- End diff -- This test case is wrong, right? This exception message will be like `The ORC data source must be used with Hive support enabled`. To test this, we need to move this to another suite. Will fix it in my ongoing PR. My suggestion is to always verify the error message, if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13486 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r66349438 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PartitioningUtilsSuite.scala --- @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.test.SharedSQLContext + +class PartitioningUtilsSuite extends SharedSQLContext { --- End diff -- Sure. No problem. I'll put them into `DataFrameReaderWriterSuite`, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r66349371 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -339,7 +339,7 @@ private[sql] object PartitioningUtils { private val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) - def validatePartitionColumnDataTypes( + def validatePartitionColumnDataTypesAndCount( --- End diff -- Thank you for review, @marmbrus . That sounds better. I'll update that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r66347756 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PartitioningUtilsSuite.scala --- @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.test.SharedSQLContext + +class PartitioningUtilsSuite extends SharedSQLContext { --- End diff -- What about putting this in `DataFrameReaderWriterSuite`. This is testing more than just that utility class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r66347539 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -339,7 +339,7 @@ private[sql] object PartitioningUtils { private val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) - def validatePartitionColumnDataTypes( + def validatePartitionColumnDataTypesAndCount( --- End diff -- how about just `validatePartitionColumns`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r65799702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -350,6 +350,10 @@ private[sql] object PartitioningUtils { case _ => throw new AnalysisException(s"Cannot use ${field.dataType} for partition column") } } + +if (partitionColumns.size == schema.fields.size) { + throw new AnalysisException(s"Cannot use all columns for partition columns") +} } --- End diff -- Then, let's change it. :) Since `PartitionUtils` is `private[sql]`, it's safe to be changed. I'll update this PR. Thank you for your review and idea! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user wangyang1992 commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r65799660 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -350,6 +350,10 @@ private[sql] object PartitioningUtils { case _ => throw new AnalysisException(s"Cannot use ${field.dataType} for partition column") } } + +if (partitionColumns.size == schema.fields.size) { + throw new AnalysisException(s"Cannot use all columns for partition columns") +} } --- End diff -- Yeah, I think it's better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r65799585 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -350,6 +350,10 @@ private[sql] object PartitioningUtils { case _ => throw new AnalysisException(s"Cannot use ${field.dataType} for partition column") } } + +if (partitionColumns.size == schema.fields.size) { + throw new AnalysisException(s"Cannot use all columns for partition columns") +} } --- End diff -- Thank you for attention, @wangyang1992 . Good point! Maybe, `validatePartitionColumnDataTypes` -> `validatePartitionColumnDataTypesAndCount` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
Github user wangyang1992 commented on a diff in the pull request: https://github.com/apache/spark/pull/13486#discussion_r65799422 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -350,6 +350,10 @@ private[sql] object PartitioningUtils { case _ => throw new AnalysisException(s"Cannot use ${field.dataType} for partition column") } } + +if (partitionColumns.size == schema.fields.size) { + throw new AnalysisException(s"Cannot use all columns for partition columns") +} } --- End diff -- One little concern. If it is added here, should the method name be changed? After all it will do more than validating data types after the change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/13486 [SPARK-15743][SQL] Prevent saving with all-column partitioning ## What changes were proposed in this pull request? When saving datasets on storage, `partitionBy` provides an easy way to construct the directory structure. However, if a user choose all columns as partition columns, some exceptions occurs. - **ORC with all column partitioning**: `AnalysisException` on **future read** due to schema inference failure. ``` scala> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data") scala> spark.read.format("orc").load("/tmp/data").collect() org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /tmp/data. It must be specified manually; ``` - **Parquet with all-column partitioning**: `InvalidSchemaException` on **write execution** due to Parquet limitation. ``` scala> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data") [Stage 0:> (0 + 8) / 8]16/06/02 16:51:17 ERROR Utils: Aborting task org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema ... (lots of error messages) ``` Although some formats like JSON support all-column partitioning without any problem, it seems not a good idea to make lots of empty directories. This PR prevents saving with all-column partitioning by consistently raising `AnalysisException` before saving. ## How was this patch tested? Newly added `PartitioningUtilsSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-15743 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13486.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13486 commit bb97467dba96604d26d45763f4115152640ff189 Author: Dongjoon HyunDate: 2016-06-02T23:14:50Z [SPARK-15743][SQL] Prevent saving with all-column partitioning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org