[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17176 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...
Github user barrenlake commented on a diff in the pull request: https://github.com/apache/spark/pull/17176#discussion_r154575331 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -159,36 +159,11 @@ class HadoopTableReader( def verifyPartitionPath( partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]): Map[HivePartition, Class[_ <: Deserializer]] = { - if (!sparkSession.sessionState.conf.verifyPartitionPath) { -partitionToDeserializer - } else { -var existPathSet = collection.mutable.Set[String]() -var pathPatternSet = collection.mutable.Set[String]() -partitionToDeserializer.filter { - case (partition, partDeserializer) => -def updateExistPathSetByPathPattern(pathPatternStr: String) { - val pathPattern = new Path(pathPatternStr) - val fs = pathPattern.getFileSystem(hadoopConf) - val matches = fs.globStatus(pathPattern) - matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) -} -// convert /demo/data/year/month/day to /demo/data/*/*/*/ -def getPathPatternByPath(parNum: Int, tempPath: Path): String = { - var path = tempPath - for (i <- (1 to parNum)) path = path.getParent - val tails = (1 to parNum).map(_ => "*").mkString("/", "/", "/") - path.toString + tails -} - -val partPath = partition.getDataLocation -val partNum = Utilities.getPartitionDesc(partition).getPartSpec.size(); -var pathPatternStr = getPathPatternByPath(partNum, partPath) -if (!pathPatternSet.contains(pathPatternStr)) { - pathPatternSet += pathPatternStr - updateExistPathSetByPathPattern(pathPatternStr) -} -existPathSet.contains(partPath.toString) -} + partitionToDeserializer.filter { +case (partition, partDeserializer) => + val partPath = partition.getDataLocation + val fs = partPath.getFileSystem(hadoopConf) + fs.exists(partPath) --- End diff -- Each partition sending an RPC request to the NameNode can result in poor performance --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...
Github user windpiger commented on a diff in the pull request: https://github.com/apache/spark/pull/17176#discussion_r104660965 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -159,36 +159,11 @@ class HadoopTableReader( def verifyPartitionPath( partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]): Map[HivePartition, Class[_ <: Deserializer]] = { - if (!sparkSession.sessionState.conf.verifyPartitionPath) { --- End diff -- after this pr https://github.com/apache/spark/pull/17187ï¼ read hive table which does not use `stored by` will not use `HiveTableScanExec`. this function has a bug ,that if the partition path is custom path 1. it will still do filter for all partition path in the parameter `partitionToDeserializer`, 2. it will scan the path which does not belong to the table ,e.g. custom path is `/root/a` and the partitionSpec is `b=1/c=2`, this will lead to scan `/` because of the `getPathPatternByPath ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...
Github user windpiger commented on a diff in the pull request: https://github.com/apache/spark/pull/17176#discussion_r104400143 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -159,36 +159,37 @@ class HadoopTableReader( def verifyPartitionPath( partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]): Map[HivePartition, Class[_ <: Deserializer]] = { - if (!sparkSession.sessionState.conf.verifyPartitionPath) { -partitionToDeserializer - } else { -var existPathSet = collection.mutable.Set[String]() -var pathPatternSet = collection.mutable.Set[String]() -partitionToDeserializer.filter { - case (partition, partDeserializer) => -def updateExistPathSetByPathPattern(pathPatternStr: String) { - val pathPattern = new Path(pathPatternStr) - val fs = pathPattern.getFileSystem(hadoopConf) - val matches = fs.globStatus(pathPattern) - matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) -} -// convert /demo/data/year/month/day to /demo/data/*/*/*/ -def getPathPatternByPath(parNum: Int, tempPath: Path): String = { + var existPathSet = collection.mutable.Set[String]() + var pathPatternSet = collection.mutable.Set[String]() + partitionToDeserializer.filter { +case (partition, partDeserializer) => + def updateExistPathSetByPathPattern(pathPatternStr: String) { +val pathPattern = new Path(pathPatternStr) +val fs = pathPattern.getFileSystem(hadoopConf) +val matches = fs.globStatus(pathPattern) +matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) + } + // convert /demo/data/year/month/day to /demo/data/*/*/*/ + def getPathPatternByPath(parNum: Int, tempPath: Path, partitionName: String): String = { +// if the partition path does not end with partition name, we should not --- End diff -- if the partition location has been altered to another location, we should not do this pattern, or we will list pattern files which does not belong to the partition --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...
GitHub user windpiger opened a pull request: https://github.com/apache/spark/pull/17176 [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_PATH, always return empty when the location does not exists ## What changes were proposed in this pull request? In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath, if it is set to true, it will avoid the task failed when the patition location does not exists in the filesystem. this situation should always return emtpy and don't lead to the task failed, here we remove this conf. ## How was this patch tested? modify a test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/windpiger/spark removeHiveVerfiyPath Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17176.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17176 commit 95aa9317b228220961074c04df06e1d08d2d8556 Author: windpigerDate: 2017-03-06T09:16:05Z [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_PATH, always return empty when the location does not exists --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org