[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...

2018-11-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17176


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...

2017-12-03 Thread barrenlake
Github user barrenlake commented on a diff in the pull request:

https://github.com/apache/spark/pull/17176#discussion_r154575331
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -159,36 +159,11 @@ class HadoopTableReader(
 def verifyPartitionPath(
 partitionToDeserializer: Map[HivePartition, Class[_ <: 
Deserializer]]):
 Map[HivePartition, Class[_ <: Deserializer]] = {
-  if (!sparkSession.sessionState.conf.verifyPartitionPath) {
-partitionToDeserializer
-  } else {
-var existPathSet = collection.mutable.Set[String]()
-var pathPatternSet = collection.mutable.Set[String]()
-partitionToDeserializer.filter {
-  case (partition, partDeserializer) =>
-def updateExistPathSetByPathPattern(pathPatternStr: String) {
-  val pathPattern = new Path(pathPatternStr)
-  val fs = pathPattern.getFileSystem(hadoopConf)
-  val matches = fs.globStatus(pathPattern)
-  matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
-}
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
-def getPathPatternByPath(parNum: Int, tempPath: Path): String 
= {
-  var path = tempPath
-  for (i <- (1 to parNum)) path = path.getParent
-  val tails = (1 to parNum).map(_ => "*").mkString("/", "/", 
"/")
-  path.toString + tails
-}
-
-val partPath = partition.getDataLocation
-val partNum = 
Utilities.getPartitionDesc(partition).getPartSpec.size();
-var pathPatternStr = getPathPatternByPath(partNum, partPath)
-if (!pathPatternSet.contains(pathPatternStr)) {
-  pathPatternSet += pathPatternStr
-  updateExistPathSetByPathPattern(pathPatternStr)
-}
-existPathSet.contains(partPath.toString)
-}
+  partitionToDeserializer.filter {
+case (partition, partDeserializer) =>
+  val partPath = partition.getDataLocation
+  val fs = partPath.getFileSystem(hadoopConf)
+  fs.exists(partPath)
--- End diff --

Each partition sending an RPC request to the NameNode can result in poor 
performance


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...

2017-03-07 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/17176#discussion_r104660965
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -159,36 +159,11 @@ class HadoopTableReader(
 def verifyPartitionPath(
 partitionToDeserializer: Map[HivePartition, Class[_ <: 
Deserializer]]):
 Map[HivePartition, Class[_ <: Deserializer]] = {
-  if (!sparkSession.sessionState.conf.verifyPartitionPath) {
--- End diff --

after this pr https://github.com/apache/spark/pull/17187, read hive table 
which does not use `stored by` will not use `HiveTableScanExec`.

this function has  a bug ,that if the partition path is custom path
1. it will still do filter for all partition path in the parameter 
`partitionToDeserializer`,
2. it will scan the path which does not belong to the table ,e.g. custom 
path is `/root/a`
and the partitionSpec is `b=1/c=2`, this will lead to scan `/` because of 
the `getPathPatternByPath `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...

2017-03-06 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/17176#discussion_r104400143
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -159,36 +159,37 @@ class HadoopTableReader(
 def verifyPartitionPath(
 partitionToDeserializer: Map[HivePartition, Class[_ <: 
Deserializer]]):
 Map[HivePartition, Class[_ <: Deserializer]] = {
-  if (!sparkSession.sessionState.conf.verifyPartitionPath) {
-partitionToDeserializer
-  } else {
-var existPathSet = collection.mutable.Set[String]()
-var pathPatternSet = collection.mutable.Set[String]()
-partitionToDeserializer.filter {
-  case (partition, partDeserializer) =>
-def updateExistPathSetByPathPattern(pathPatternStr: String) {
-  val pathPattern = new Path(pathPatternStr)
-  val fs = pathPattern.getFileSystem(hadoopConf)
-  val matches = fs.globStatus(pathPattern)
-  matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
-}
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
-def getPathPatternByPath(parNum: Int, tempPath: Path): String 
= {
+  var existPathSet = collection.mutable.Set[String]()
+  var pathPatternSet = collection.mutable.Set[String]()
+  partitionToDeserializer.filter {
+case (partition, partDeserializer) =>
+  def updateExistPathSetByPathPattern(pathPatternStr: String) {
+val pathPattern = new Path(pathPatternStr)
+val fs = pathPattern.getFileSystem(hadoopConf)
+val matches = fs.globStatus(pathPattern)
+matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
+  }
+  // convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+  def getPathPatternByPath(parNum: Int, tempPath: Path, 
partitionName: String): String = {
+// if the partition path does not end with partition name, we 
should not
--- End diff --

if the partition location has been altered to another location, we should 
not do this pattern, or we will list pattern files which does not belong to the 
partition


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17176: [SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PART...

2017-03-06 Thread windpiger
GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/17176

[SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_PATH, always return 
empty when the location does not exists

## What changes were proposed in this pull request?

In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath,
if it is set to true, it will avoid the task failed when the patition 
location does not exists in the filesystem.

this situation should always return emtpy and don't lead to the task 
failed, here we remove this conf.

## How was this patch tested?
modify a test case

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark removeHiveVerfiyPath

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17176


commit 95aa9317b228220961074c04df06e1d08d2d8556
Author: windpiger 
Date:   2017-03-06T09:16:05Z

[SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_PATH, always return 
empty when the location does not exists




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org