[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23186 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r238008395 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -345,15 +346,18 @@ object PartitioningUtils { */ def resolvePartitions( pathsWithPartitionValues: Seq[(Path, PartitionValues)], + caseSensitive: Boolean, timeZone: TimeZone): Seq[PartitionValues] = { if (pathsWithPartitionValues.isEmpty) { Seq.empty } else { - // TODO: Selective case sensitivity. - val distinctPartColNames = - pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct + val distinctPartColNames = if (caseSensitive) { +pathsWithPartitionValues.map(_._2.columnNames) + } else { +pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())) + } assert( -distinctPartColNames.size == 1, +distinctPartColNames.distinct.size == 1, listConflictingPartitionColumns(pathsWithPartitionValues)) --- End diff -- yes I see, thanks for the kind explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r237990120 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -65,6 +65,34 @@ class FileIndexSuite extends SharedSQLContext { } } + test("SPARK-26230: if case sensitive, validate partitions with original column names") { +withTempDir { dir => + val partitionDirectory = new File(dir, s"a=1") --- End diff -- nit. Unnecessary `s`: `s"a=1"` -> `"a=1"`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r237963856 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -345,15 +346,18 @@ object PartitioningUtils { */ def resolvePartitions( pathsWithPartitionValues: Seq[(Path, PartitionValues)], + caseSensitive: Boolean, timeZone: TimeZone): Seq[PartitionValues] = { if (pathsWithPartitionValues.isEmpty) { Seq.empty } else { - // TODO: Selective case sensitivity. - val distinctPartColNames = - pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct + val distinctPartColNames = if (caseSensitive) { +pathsWithPartitionValues.map(_._2.columnNames) + } else { +pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())) + } assert( -distinctPartColNames.size == 1, +distinctPartColNames.distinct.size == 1, listConflictingPartitionColumns(pathsWithPartitionValues)) --- End diff -- The method `listConflictingPartitionColumns` also shows the suspicious paths. If case sensitive, the method works fine. If case insensitive, it will list all column names without any transformation. e.g. ``` Partition column name list #0: a Partition column name list #1: A Partition column name list #2: B ``` I can fix the method listConflictingPartitionColumns. But seems a bit trivial, we will have to display the original column names instead of transforming all to lower case . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r237889346 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -345,15 +346,18 @@ object PartitioningUtils { */ def resolvePartitions( pathsWithPartitionValues: Seq[(Path, PartitionValues)], + caseSensitive: Boolean, timeZone: TimeZone): Seq[PartitionValues] = { if (pathsWithPartitionValues.isEmpty) { Seq.empty } else { - // TODO: Selective case sensitivity. - val distinctPartColNames = - pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct + val distinctPartColNames = if (caseSensitive) { +pathsWithPartitionValues.map(_._2.columnNames) + } else { +pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())) + } assert( -distinctPartColNames.size == 1, +distinctPartColNames.distinct.size == 1, listConflictingPartitionColumns(pathsWithPartitionValues)) --- End diff -- why don't we use `distinctPartColNames` as parameter here? Moreover, is that method working fine according to case sensitivity? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r237889521 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -65,6 +65,34 @@ class FileIndexSuite extends SharedSQLContext { } } + test("SPARK-26230: if case sensitive, validate partitions with original column names") { +withTempDir { dir => + val partitionDirectory = new File(dir, s"a=1") + partitionDirectory.mkdir() + val file = new File(partitionDirectory, "text.txt") + stringToFile(file, "text") + val partitionDirectory2 = new File(dir, s"A=2") + partitionDirectory2.mkdir() + val file2 = new File(partitionDirectory2, "text.txt") + stringToFile(file2, "text") + val path = new Path(dir.getCanonicalPath) + + withSQLConf(SQLConf.CASE_SENSITIVE.key -> "false") { +val fileIndex = new InMemoryFileIndex(spark, Seq(path), Map.empty, None) +val partitionValues = fileIndex.partitionSpec().partitions.map(_.values) +assert(partitionValues.length == 2) + } + + withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") { +val msg = intercept[AssertionError] { + val fileIndex = new InMemoryFileIndex(spark, Seq(path), Map.empty, None) + fileIndex.partitionSpec() +}.getMessage +assert(msg.contains("Conflicting partition column names detected")) --- End diff -- can we ensure that the message contains the right partitions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r237888926 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -345,15 +346,18 @@ object PartitioningUtils { */ def resolvePartitions( pathsWithPartitionValues: Seq[(Path, PartitionValues)], + caseSensitive: Boolean, timeZone: TimeZone): Seq[PartitionValues] = { if (pathsWithPartitionValues.isEmpty) { Seq.empty } else { - // TODO: Selective case sensitivity. - val distinctPartColNames = - pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct + val distinctPartColNames = if (caseSensitive) { --- End diff -- nit: maybe rename as there is no distinct anymore? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23186 [SPARK-26230][SQL]FileIndex: if case sensitive, validate partitions with original column names ## What changes were proposed in this pull request? Partition column name is required to be unique under the same directory. The following paths are invalid partitioned directory: ``` hdfs://host:9000/path/a=1 hdfs://host:9000/path/b=2 ``` If case sensitive, the following paths should be invalid too: ``` hdfs://host:9000/path/a=1 hdfs://host:9000/path/A=2 ``` Since column 'a' and 'A' are different, and it is wrong to use either one as the column name in partition schema. Also, there is a `TODO` comment in the code. This PR is to resolve the problem. ## How was this patch tested? Add unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark SPARK-26230 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23186 commit 6d052130051a21b9aa7c3ffce56a556bee129a5e Author: Gengliang Wang Date: 2018-11-30T09:23:32Z if case sensitive, validate partitions with original column names --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org