[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-12-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23186


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r238008395
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -345,15 +346,18 @@ object PartitioningUtils {
*/
   def resolvePartitions(
   pathsWithPartitionValues: Seq[(Path, PartitionValues)],
+  caseSensitive: Boolean,
   timeZone: TimeZone): Seq[PartitionValues] = {
 if (pathsWithPartitionValues.isEmpty) {
   Seq.empty
 } else {
-  // TODO: Selective case sensitivity.
-  val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+  val distinctPartColNames = if (caseSensitive) {
+pathsWithPartitionValues.map(_._2.columnNames)
+  } else {
+pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase()))
+  }
   assert(
-distinctPartColNames.size == 1,
+distinctPartColNames.distinct.size == 1,
 listConflictingPartitionColumns(pathsWithPartitionValues))
--- End diff --

yes I see, thanks for the kind explanation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r237990120
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -65,6 +65,34 @@ class FileIndexSuite extends SharedSQLContext {
 }
   }
 
+  test("SPARK-26230: if case sensitive, validate partitions with original 
column names") {
+withTempDir { dir =>
+  val partitionDirectory = new File(dir, s"a=1")
--- End diff --

nit. Unnecessary `s`: `s"a=1"` -> `"a=1"`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r237963856
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -345,15 +346,18 @@ object PartitioningUtils {
*/
   def resolvePartitions(
   pathsWithPartitionValues: Seq[(Path, PartitionValues)],
+  caseSensitive: Boolean,
   timeZone: TimeZone): Seq[PartitionValues] = {
 if (pathsWithPartitionValues.isEmpty) {
   Seq.empty
 } else {
-  // TODO: Selective case sensitivity.
-  val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+  val distinctPartColNames = if (caseSensitive) {
+pathsWithPartitionValues.map(_._2.columnNames)
+  } else {
+pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase()))
+  }
   assert(
-distinctPartColNames.size == 1,
+distinctPartColNames.distinct.size == 1,
 listConflictingPartitionColumns(pathsWithPartitionValues))
--- End diff --

The method `listConflictingPartitionColumns` also shows the suspicious 
paths.
If case sensitive, the method works fine. 
If case insensitive, it will list all column names without any 
transformation. e.g. 
```
Partition column name list #0: a
Partition column name list #1: A
Partition column name list #2: B
```
I can fix the method listConflictingPartitionColumns. But seems a bit 
trivial, we will have to display the original column names instead of  
transforming all to lower case .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r237889346
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -345,15 +346,18 @@ object PartitioningUtils {
*/
   def resolvePartitions(
   pathsWithPartitionValues: Seq[(Path, PartitionValues)],
+  caseSensitive: Boolean,
   timeZone: TimeZone): Seq[PartitionValues] = {
 if (pathsWithPartitionValues.isEmpty) {
   Seq.empty
 } else {
-  // TODO: Selective case sensitivity.
-  val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+  val distinctPartColNames = if (caseSensitive) {
+pathsWithPartitionValues.map(_._2.columnNames)
+  } else {
+pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase()))
+  }
   assert(
-distinctPartColNames.size == 1,
+distinctPartColNames.distinct.size == 1,
 listConflictingPartitionColumns(pathsWithPartitionValues))
--- End diff --

why don't we use `distinctPartColNames` as parameter here? Moreover, is 
that method working fine according to case sensitivity?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r237889521
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -65,6 +65,34 @@ class FileIndexSuite extends SharedSQLContext {
 }
   }
 
+  test("SPARK-26230: if case sensitive, validate partitions with original 
column names") {
+withTempDir { dir =>
+  val partitionDirectory = new File(dir, s"a=1")
+  partitionDirectory.mkdir()
+  val file = new File(partitionDirectory, "text.txt")
+  stringToFile(file, "text")
+  val partitionDirectory2 = new File(dir, s"A=2")
+  partitionDirectory2.mkdir()
+  val file2 = new File(partitionDirectory2, "text.txt")
+  stringToFile(file2, "text")
+  val path = new Path(dir.getCanonicalPath)
+
+  withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") {
+val fileIndex = new InMemoryFileIndex(spark, Seq(path), Map.empty, 
None)
+val partitionValues = 
fileIndex.partitionSpec().partitions.map(_.values)
+assert(partitionValues.length == 2)
+  }
+
+  withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
+val msg = intercept[AssertionError] {
+  val fileIndex = new InMemoryFileIndex(spark, Seq(path), 
Map.empty, None)
+  fileIndex.partitionSpec()
+}.getMessage
+assert(msg.contains("Conflicting partition column names detected"))
--- End diff --

can we ensure that the message contains the right partitions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r237888926
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -345,15 +346,18 @@ object PartitioningUtils {
*/
   def resolvePartitions(
   pathsWithPartitionValues: Seq[(Path, PartitionValues)],
+  caseSensitive: Boolean,
   timeZone: TimeZone): Seq[PartitionValues] = {
 if (pathsWithPartitionValues.isEmpty) {
   Seq.empty
 } else {
-  // TODO: Selective case sensitivity.
-  val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+  val distinctPartColNames = if (caseSensitive) {
--- End diff --

nit: maybe rename as there is no distinct anymore?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23186

[SPARK-26230][SQL]FileIndex: if case sensitive, validate partitions with 
original column names

## What changes were proposed in this pull request?

Partition column name is required to be unique under the same directory. 
The following paths are invalid partitioned directory:
```
hdfs://host:9000/path/a=1
hdfs://host:9000/path/b=2
```

If case sensitive, the following paths should be invalid too:
```
hdfs://host:9000/path/a=1
hdfs://host:9000/path/A=2
```
Since column 'a' and 'A' are different, and it is wrong to use either one 
as the column name in partition schema.

Also, there is a `TODO` comment in the code.

This PR is to resolve the problem.

## How was this patch tested?

Add unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark SPARK-26230

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23186.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23186


commit 6d052130051a21b9aa7c3ffce56a556bee129a5e
Author: Gengliang Wang 
Date:   2018-11-30T09:23:32Z

 if case sensitive, validate partitions with original column names




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org