[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r67582175
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
 cq.awaitTermination(2000L)
   }
+
+  test("prevent all column partitioning") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  intercept[AnalysisException] {
+
spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path)
+  }
+  intercept[AnalysisException] {
+
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
--- End diff --

I am trying to add more edge cases in the ongoing PR for improving the test 
coverage of `DataFrameReader` and `DataFrameWriter`. Will include this too. 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r67581319
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
 cq.awaitTermination(2000L)
   }
+
+  test("prevent all column partitioning") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  intercept[AnalysisException] {
+
spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path)
+  }
+  intercept[AnalysisException] {
+
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
--- End diff --

Ah, I see what you mean. I'm going to exclude `ORC` cases in #13730 .
So, for the `OrcSuite`, you can do that. If then, I'll really appreciate it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r67580384
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
 cq.awaitTermination(2000L)
   }
+
+  test("prevent all column partitioning") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  intercept[AnalysisException] {
+
spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path)
+  }
+  intercept[AnalysisException] {
+
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
--- End diff --

I am fine if you want to fix it. You can put it in `OrcSuite`. Then, 
running the following command to verify whether it passes or not:
> build/sbt -Phive "hive/test-only 
org.apache.spark.sql.hive.orc.OrcSourceSuite"

```scala
  test("prevent all column partitioning") {
withTempDir { dir =>
  val path = dir.getCanonicalPath
  val e = intercept[AnalysisException] {

spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
  }.getMessage
  assert(e.contains("Cannot use all columns for partition columns"))
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r67578803
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
 cq.awaitTermination(2000L)
   }
+
+  test("prevent all column partitioning") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  intercept[AnalysisException] {
+
spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path)
+  }
+  intercept[AnalysisException] {
+
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
--- End diff --

Oh, I mean there is already ongoing PR.
And, thank you for advice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-17 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r67578679
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
 cq.awaitTermination(2000L)
   }
+
+  test("prevent all column partitioning") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  intercept[AnalysisException] {
+
spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path)
+  }
+  intercept[AnalysisException] {
+
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
--- End diff --

Yep. That will be fixed in https://github.com/apache/spark/pull/13730 . 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r67578077
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -572,4 +572,16 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
 cq.awaitTermination(2000L)
   }
+
+  test("prevent all column partitioning") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  intercept[AnalysisException] {
+
spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path)
+  }
+  intercept[AnalysisException] {
+
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
--- End diff --

This test case is wrong, right? This exception message will be like `The 
ORC data source must be used with Hive support enabled`. To test this, we need 
to move this to another suite. 

Will fix it in my ongoing PR. My suggestion is to always verify the error 
message, if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13486


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r66349438
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PartitioningUtilsSuite.scala
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.test.SharedSQLContext
+
+class PartitioningUtilsSuite extends SharedSQLContext {
--- End diff --

Sure. No problem. I'll put them into `DataFrameReaderWriterSuite`, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r66349371
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -339,7 +339,7 @@ private[sql] object PartitioningUtils {
   private val upCastingOrder: Seq[DataType] =
 Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
 
-  def validatePartitionColumnDataTypes(
+  def validatePartitionColumnDataTypesAndCount(
--- End diff --

Thank you for review, @marmbrus .
That sounds better. I'll update that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-08 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r66347756
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PartitioningUtilsSuite.scala
 ---
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.test.SharedSQLContext
+
+class PartitioningUtilsSuite extends SharedSQLContext {
--- End diff --

What about putting this in `DataFrameReaderWriterSuite`.  This is testing 
more than just that utility class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-08 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r66347539
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -339,7 +339,7 @@ private[sql] object PartitioningUtils {
   private val upCastingOrder: Seq[DataType] =
 Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
 
-  def validatePartitionColumnDataTypes(
+  def validatePartitionColumnDataTypesAndCount(
--- End diff --

how about just `validatePartitionColumns`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r65799702
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -350,6 +350,10 @@ private[sql] object PartitioningUtils {
 case _ => throw new AnalysisException(s"Cannot use 
${field.dataType} for partition column")
   }
 }
+
+if (partitionColumns.size == schema.fields.size) {
+  throw new AnalysisException(s"Cannot use all columns for partition 
columns")
+}
   }
--- End diff --

Then, let's change it. :)
Since `PartitionUtils` is `private[sql]`, it's safe to be changed.
I'll update this PR. Thank you for your review and idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-04 Thread wangyang1992
Github user wangyang1992 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r65799660
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -350,6 +350,10 @@ private[sql] object PartitioningUtils {
 case _ => throw new AnalysisException(s"Cannot use 
${field.dataType} for partition column")
   }
 }
+
+if (partitionColumns.size == schema.fields.size) {
+  throw new AnalysisException(s"Cannot use all columns for partition 
columns")
+}
   }
--- End diff --

Yeah, I think it's better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r65799585
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -350,6 +350,10 @@ private[sql] object PartitioningUtils {
 case _ => throw new AnalysisException(s"Cannot use 
${field.dataType} for partition column")
   }
 }
+
+if (partitionColumns.size == schema.fields.size) {
+  throw new AnalysisException(s"Cannot use all columns for partition 
columns")
+}
   }
--- End diff --

Thank you for attention, @wangyang1992 . Good point!
Maybe, `validatePartitionColumnDataTypes` -> 
`validatePartitionColumnDataTypesAndCount` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-04 Thread wangyang1992
Github user wangyang1992 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13486#discussion_r65799422
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -350,6 +350,10 @@ private[sql] object PartitioningUtils {
 case _ => throw new AnalysisException(s"Cannot use 
${field.dataType} for partition column")
   }
 }
+
+if (partitionColumns.size == schema.fields.size) {
+  throw new AnalysisException(s"Cannot use all columns for partition 
columns")
+}
   }
--- End diff --

One little concern. If it is added here, should the method name be changed? 
After all it will do more than validating data types after the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13486: [SPARK-15743][SQL] Prevent saving with all-column...

2016-06-02 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/13486

[SPARK-15743][SQL] Prevent saving with all-column partitioning

## What changes were proposed in this pull request?

When saving datasets on storage, `partitionBy` provides an easy way to 
construct the directory structure. However, if a user choose all columns as 
partition columns, some exceptions occurs.

- **ORC with all column partitioning**: `AnalysisException` on **future 
read** due to schema inference failure.
 ```
scala> 
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data")


scala> spark.read.format("orc").load("/tmp/data").collect()
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
/tmp/data. It must be specified manually;
```

- **Parquet with all-column partitioning**: `InvalidSchemaException` on 
**write execution** due to Parquet limitation.
 ```
scala> 
spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data")
[Stage 0:>  (0 + 8) 
/ 8]16/06/02 16:51:17
ERROR Utils: Aborting task
org.apache.parquet.schema.InvalidSchemaException: A group type can not be 
empty. Parquet does not support empty group without leaves. Empty group: 
spark_schema
... (lots of error messages)
```

Although some formats like JSON support all-column partitioning without any 
problem, it seems not a good idea to make lots of empty directories. 

This PR prevents saving with all-column partitioning by consistently 
raising `AnalysisException` before saving. 

## How was this patch tested?

Newly added `PartitioningUtilsSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-15743

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13486.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13486


commit bb97467dba96604d26d45763f4115152640ff189
Author: Dongjoon Hyun 
Date:   2016-06-02T23:14:50Z

[SPARK-15743][SQL] Prevent saving with all-column partitioning




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org