[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17079


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103373620
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,33 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf())
+  val catalog =
+new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, 
fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
+
+  assert(catalog.leafDirPaths.isEmpty)
+  assert(catalog.leafFilePaths.isEmpty)
--- End diff --

Move these two asserts after `stringToFile`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103373646
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,33 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf())
+  val catalog =
+new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, 
fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
--- End diff --

Nit: Indents issues for the above three lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103357633
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty,
+None, fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
+
+  assert(catalog.leafDirPaths.isEmpty)
+  assert(catalog.leafFilePaths.isEmpty)
+
+  val file = new File(dir, "text.txt")
+  stringToFile(file, "text")
+
+  catalog.refresh()
+
+  assert(catalog.leafFilePaths.size == 1)
+  assert(catalog.leafFilePaths.head.toString.stripSuffix("/") ==
+s"file:${file.getAbsolutePath.stripSuffix("/")}")
--- End diff --

ok, let me modify~ thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103350023
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty,
+None, fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
+
+  assert(catalog.leafDirPaths.isEmpty)
+  assert(catalog.leafFilePaths.isEmpty)
+
+  val file = new File(dir, "text.txt")
+  stringToFile(file, "text")
+
+  catalog.refresh()
+
+  assert(catalog.leafFilePaths.size == 1)
+  assert(catalog.leafFilePaths.head.toString.stripSuffix("/") ==
+s"file:${file.getAbsolutePath.stripSuffix("/")}")
--- End diff --

this looks hacky, can you turn them into `Path` and compare?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103349639
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty,
+None, fileStatusCache) {
--- End diff --

nit:
```
val catalog =
  new XXX(...) {
def xxx
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has an wrong o...

2017-02-26 Thread windpiger
GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/17079

[SPARK-19748][SQL]refresh function has an wrong order to do cache 
invalidate and regenerate the inmemory var for InMemoryFileIndex with 
FileStatusCache

## What changes were proposed in this pull request?

If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use 
the FileStatusCache to re-generate the cachedLeafFiles etc, then call 
FileStatusCache.invalidateAll. 

While the order to do these two actions is wrong, this lead to the refresh 
action does not take effect.

```
  override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
  }

  private def refresh0(): Unit = {
val files = listLeafFiles(rootPaths)
cachedLeafFiles =
  new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => 
f.getPath -> f)
cachedLeafDirToChildrenFiles = 
files.toArray.groupBy(_.getPath.getParent)
cachedPartitionSpec = null
  }
```
## How was this patch tested?
unit test added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark fixInMemoryFileIndexRefresh

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17079


commit 552955a5293bc22da1cd644ffde90b883fc560e8
Author: windpiger 
Date:   2017-02-27T07:46:18Z

[SPARK-19748][SQL]refresh function has an wrong order to do cache 
invalidate and regenerate the inmemory var for InMemoryFileIndex with 
FileStatusCache

commit fd3bb21597809409e7f33796589c9178744063c5
Author: windpiger 
Date:   2017-02-27T07:53:46Z

modify the test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org