[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19868


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r182077103
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
 }
   }
 
+  test("Replace spark.sql.hive.verifyPartitionPath by 
spark.files.ignoreMissingFiles") {
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
--- End diff --

nit: we usually write `withSQLConf(SQLConf.HIVE_VERIFY_PARTITION_PATH.key 
-> "false")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r182075854
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
 }
   }
 
+  test("Replace spark.sql.hive.verifyPartitionPath by 
spark.files.ignoreMissingFiles") {
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+  sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+  val testData = sparkContext.parallelize(
+(1 to 10).map(i => TestData(i, i.toString))).toDF()
+  testData.createOrReplaceTempView("testData")
+
+  val tmpDir = Files.createTempDir()
+  // create the table for test
+  sql(s"CREATE TABLE table_with_partition(key int,value string) " +
+  s"PARTITIONED by (ds string) location '${tmpDir.toURI}' ")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') 
" +
+  "SELECT key,value FROM testData")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='2') 
" +
+  "SELECT key,value FROM testData")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='3') 
" +
+  "SELECT key,value FROM testData")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='4') 
" +
+  "SELECT key,value FROM testData")
+
+  // test for the exist path
+  checkAnswer(sql("select key,value from table_with_partition"),
+testData.toDF.collect ++ testData.toDF.collect
--- End diff --

`testData.union(testData).union(testData).union(testData)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r182075379
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
 }
   }
 
+  test("Replace spark.sql.hive.verifyPartitionPath by 
spark.files.ignoreMissingFiles") {
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+  sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+  val testData = sparkContext.parallelize(
+(1 to 10).map(i => TestData(i, i.toString))).toDF()
+  testData.createOrReplaceTempView("testData")
+
+  val tmpDir = Files.createTempDir()
+  // create the table for test
+  sql(s"CREATE TABLE table_with_partition(key int,value string) " +
--- End diff --

we should call `withTable`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r182075222
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
 }
   }
 
+  test("Replace spark.sql.hive.verifyPartitionPath by 
spark.files.ignoreMissingFiles") {
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+  sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+  val testData = sparkContext.parallelize(
+(1 to 10).map(i => TestData(i, i.toString))).toDF()
+  testData.createOrReplaceTempView("testData")
+
+  val tmpDir = Files.createTempDir()
--- End diff --

we should call `withTempDir`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-17 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r182075129
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
 }
   }
 
+  test("Replace spark.sql.hive.verifyPartitionPath by 
spark.files.ignoreMissingFiles") {
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) {
+  sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true")
+  val testData = sparkContext.parallelize(
+(1 to 10).map(i => TestData(i, i.toString))).toDF()
+  testData.createOrReplaceTempView("testData")
--- End diff --

we should call `withTempView`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181939704
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -195,6 +205,10 @@ class NewHadoopRDD[K, V](
   e)
 finished = true
 null
+  case e: FileNotFoundException if ignoreMissingFiles =>
--- End diff --

Yes, I should change this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181939661
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -276,6 +292,12 @@ class HadoopRDD[K, V](
 try {
   finished = !reader.next(key, value)
 } catch {
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+finished = true
+null
--- End diff --

I removed this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181788258
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -276,6 +292,12 @@ class HadoopRDD[K, V](
 try {
   finished = !reader.next(key, value)
 } catch {
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+finished = true
+null
--- End diff --

the return value is not read by anyone


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181788194
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -195,6 +205,10 @@ class NewHadoopRDD[K, V](
   e)
 finished = true
 null
+  case e: FileNotFoundException if ignoreMissingFiles =>
--- End diff --

shall we also apply 
https://github.com/apache/spark/pull/19868/files#diff-83eb37f7b0ebed3c14ccb7bff0d577c2R273
 here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181787994
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -218,6 +232,10 @@ class NewHadoopRDD[K, V](
 s"Skipped the rest content in the corrupted file: 
${split.serializableHadoopSplit}",
 e)
   finished = true
+case e: FileNotFoundException if ignoreMissingFiles =>
+  logWarning(s"Skipped missing file: 
${split.serializableHadoopSplit}", e)
+  finished = true
+  null
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181787857
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -276,6 +292,12 @@ class HadoopRDD[K, V](
 try {
   finished = !reader.next(key, value)
 } catch {
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+finished = true
+null
--- End diff --

the return value is not read by anyone


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181718520
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -279,6 +293,10 @@ class HadoopRDD[K, V](
   case e: IOException if ignoreCorruptFiles =>
 logWarning(s"Skipped the rest content in the corrupted file: 
${split.inputSplit}", e)
 finished = true
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+finished = true
+null
--- End diff --

Same logic with FileScanRDD -- `finished=true`, thus `NextIterator.hasNext` 
returns false.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/NextIterator.scala#L31


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-16 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181716009
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -197,17 +200,24 @@ class HadoopRDD[K, V](
 val jobConf = getJobConf()
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
-val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, 
minPartitions)
-val inputSplits = if (ignoreEmptySplits) {
-  allInputSplits.filter(_.getLength > 0)
-} else {
-  allInputSplits
-}
-val array = new Array[Partition](inputSplits.size)
-for (i <- 0 until inputSplits.size) {
-  array(i) = new HadoopPartition(id, i, inputSplits(i))
+try {
+  val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, 
minPartitions)
+  val inputSplits = if (ignoreEmptySplits) {
+allInputSplits.filter(_.getLength > 0)
+  } else {
+allInputSplits
+  }
+  val array = new Array[Partition](inputSplits.size)
+  for (i <- 0 until inputSplits.size) {
+array(i) = new HadoopPartition(id, i, inputSplits(i))
+  }
+  array
+} catch {
+  case e: InvalidInputException if ignoreMissingFiles =>
--- End diff --

Yes, when data dir doesn't exist.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181607994
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
 }
   }
 
+  test("Replace spark.sql.hive.verifyPartitionPath by 
spark.files.ignoreMissingFiles") {
--- End diff --

we should add some document for `spark.sql.hive.verifyPartitionPath`, say 
it's replaced by `spark.files.ignoreMissingFiles` and will be removed in future 
releases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181607863
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -279,6 +293,10 @@ class HadoopRDD[K, V](
   case e: IOException if ignoreCorruptFiles =>
 logWarning(s"Skipped the rest content in the corrupted file: 
${split.inputSplit}", e)
 finished = true
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: ${split.inputSplit}", e)
+finished = true
+null
--- End diff --

why returning null here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181607797
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -260,6 +270,10 @@ class HadoopRDD[K, V](
 logWarning(s"Skipped the rest content in the corrupted file: 
${split.inputSplit}", e)
 finished = true
 null
+  case e: FileNotFoundException if ignoreMissingFiles =>
--- End diff --

`FileNotFoundException` extends `IOException`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181607746
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -197,17 +200,24 @@ class HadoopRDD[K, V](
 val jobConf = getJobConf()
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
-val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, 
minPartitions)
-val inputSplits = if (ignoreEmptySplits) {
-  allInputSplits.filter(_.getLength > 0)
-} else {
-  allInputSplits
-}
-val array = new Array[Partition](inputSplits.size)
-for (i <- 0 until inputSplits.size) {
-  array(i) = new HadoopPartition(id, i, inputSplits(i))
+try {
+  val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, 
minPartitions)
+  val inputSplits = if (ignoreEmptySplits) {
+allInputSplits.filter(_.getLength > 0)
+  } else {
+allInputSplits
+  }
+  val array = new Array[Partition](inputSplits.size)
+  for (i <- 0 until inputSplits.size) {
+array(i) = new HadoopPartition(id, i, inputSplits(i))
+  }
+  array
+} catch {
+  case e: InvalidInputException if ignoreMissingFiles =>
--- End diff --

when will this happen? the root path doesn't exist?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-15 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181571713
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -197,17 +200,24 @@ class HadoopRDD[K, V](
 val jobConf = getJobConf()
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
-val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, 
minPartitions)
-val inputSplits = if (ignoreEmptySplits) {
-  allInputSplits.filter(_.getLength > 0)
-} else {
-  allInputSplits
-}
-val array = new Array[Partition](inputSplits.size)
-for (i <- 0 until inputSplits.size) {
-  array(i) = new HadoopPartition(id, i, inputSplits(i))
+try {
--- End diff --

Put original logic in try-catch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-15 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181571746
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -124,17 +126,25 @@ class NewHadoopRDD[K, V](
 configurable.setConf(_conf)
   case _ =>
 }
-val allRowSplits = inputFormat.getSplits(new JobContextImpl(_conf, 
jobId)).asScala
-val rawSplits = if (ignoreEmptySplits) {
-  allRowSplits.filter(_.getLength > 0)
-} else {
-  allRowSplits
-}
-val result = new Array[Partition](rawSplits.size)
-for (i <- 0 until rawSplits.size) {
-  result(i) = new NewHadoopPartition(id, i, 
rawSplits(i).asInstanceOf[InputSplit with Writable])
+try {
--- End diff --

Put original logic in try-catch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-13 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181538066
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

Sure, that would be great. Is there some existing pr/jira working on that? 
if not, I can make the change :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181371382
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

can we make `spark.sql.files.ignoreMissingFiles` work for hive scan? it 
seems a better solution than this check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-13 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181294385
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

I'm not sure. `spark.sql.hive.verifyPartitionPath` is only for hive table 
scan and `` is only for `spark.sql.files.ignoreMissingFiles` is only for 
datasource  scan, is it ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181272430
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

Actually do we still need this check? now we have something like 
`spark.sql.files.ignoreMissingFiles`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-12 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r181014951
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

@cloud-fan @jiangxb1987
Thanks a lot for review.
> Em... It seems we have to check all the levels unless we have specified a 
value for each partition column. We can make some improvement here but seems 
that require more complicated approach.
Yes, true. In this change, I only optimize when user specify for each 
partition column, which is very common in the production -- as our user always 
did: `select xxx from yyy where year=yy and month=mm and day=dd`

I'm not sure about you guys idea:  leave the current logic as it is(at 
least the code logic now is very simple)? or implement a more complicated 
approach and defend as many cases as possible? or do some improvement based on 
this pr and cover some very common cases?

Thanks again for review :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-11 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r180962121
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

If the partition columns are `A/B/C/D`, unless you specify a value for `A`, 
you have to check that level. The same case for `B`/`C`/`D`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-11 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r180961987
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

Em... It seems we have to check all the levels unless we have specified a 
value for each partition column. We can make some improvement here but seems 
that require more complicated approach.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r180741999
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

how about `select * from test where A='00'`? Shall we check one more level?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r180742133
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

and more trickily, `select * from test where B='01'`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-11 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r180733361
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

@cloud-fan 
Thanks a lot for review;
Like I mentioned in description, "/demo/data/year/month/day" will not be 
converted to "/demo/data/*/*/*/", instead it's converted to  
"/demo/data/year/month/*/" as a path pattern. When this path pattern is passed 
to `updateExistPathSetByPathPattern`, less matched paths will be returned. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2018-04-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19868#discussion_r180713699
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -176,12 +176,13 @@ class HadoopTableReader(
   val matches = fs.globStatus(pathPattern)
   matches.foreach(fileStatus => existPathSet += 
fileStatus.getPath.toString)
 }
-// convert  /demo/data/year/month/day  to  /demo/data/*/*/*/
+// convert  /demo/data/year/month/day  to  
/demo/data/year/month/*/
--- End diff --

This is a pretty old logic. Can you explain what's going on here and why 
your change works? It can help other people to understand your change quickly,


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...

2017-12-02 Thread jinxing64
GitHub user jinxing64 opened a pull request:

https://github.com/apache/spark/pull/19868

[SPARK-22676] Avoid iterating all partition paths when 
spark.sql.hive.verifyPartitionPath=true

## What changes were proposed in this pull request?

In current code, it will scanning all partition paths when 
spark.sql.hive.verifyPartitionPath=true.
e.g. table like below:
```
CREATE TABLE `test`(
`id` int,
`age` int,
`name` string)
PARTITIONED BY (
`A` string,
`B` string)
load data local inpath '/tmp/data0' into table test partition(A='00', 
B='00')
load data local inpath '/tmp/data1' into table test partition(A='01', 
B='01')
load data local inpath '/tmp/data2' into table test partition(A='10', 
B='10')
load data local inpath '/tmp/data3' into table test partition(A='11', 
B='11')
```
If I query with SQL – "select * from test where year=2017 and month=12 
and day=03", current code will scan all partition paths including 
'/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', 
'/data/A=11/B=11'. It costs much time and memory cost.

This pr proposes to avoid iterating all partition paths. Convert  
/demo/data/year/month/day  to  /demo/data/year/month/*/ when generating path 
pattern.

## How was this patch tested?

Manually test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jinxing64/spark SPARK-22676

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19868


commit 57676609faed4512291979a8d639e3be1ec80578
Author: jinxing 
Date:   2017-12-03T07:07:12Z

[SPARK-22676] Avoid iterating all partition paths when 
spark.sql.hive.verifyPartitionPath=true




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org