[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17079 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17079#discussion_r103373620 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -178,6 +178,33 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog2.allFiles().nonEmpty) } } + + test("refresh for InMemoryFileIndex with FileStatusCache") { +withTempDir { dir => + val fileStatusCache = FileStatusCache.getOrCreate(spark) + val dirPath = new Path(dir.getAbsolutePath) + val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf()) + val catalog = +new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, fileStatusCache) { +def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq +def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq + } + + assert(catalog.leafDirPaths.isEmpty) + assert(catalog.leafFilePaths.isEmpty) --- End diff -- Move these two asserts after `stringToFile` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17079#discussion_r103373646 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -178,6 +178,33 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog2.allFiles().nonEmpty) } } + + test("refresh for InMemoryFileIndex with FileStatusCache") { +withTempDir { dir => + val fileStatusCache = FileStatusCache.getOrCreate(spark) + val dirPath = new Path(dir.getAbsolutePath) + val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf()) + val catalog = +new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, fileStatusCache) { +def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq +def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq + } --- End diff -- Nit: Indents issues for the above three lines. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...
Github user windpiger commented on a diff in the pull request: https://github.com/apache/spark/pull/17079#discussion_r103357633 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog2.allFiles().nonEmpty) } } + + test("refresh for InMemoryFileIndex with FileStatusCache") { +withTempDir { dir => + val fileStatusCache = FileStatusCache.getOrCreate(spark) + val dirPath = new Path(dir.getAbsolutePath) + val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, +None, fileStatusCache) { +def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq +def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq + } + + assert(catalog.leafDirPaths.isEmpty) + assert(catalog.leafFilePaths.isEmpty) + + val file = new File(dir, "text.txt") + stringToFile(file, "text") + + catalog.refresh() + + assert(catalog.leafFilePaths.size == 1) + assert(catalog.leafFilePaths.head.toString.stripSuffix("/") == +s"file:${file.getAbsolutePath.stripSuffix("/")}") --- End diff -- ok, let me modify~ thanks~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17079#discussion_r103350023 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog2.allFiles().nonEmpty) } } + + test("refresh for InMemoryFileIndex with FileStatusCache") { +withTempDir { dir => + val fileStatusCache = FileStatusCache.getOrCreate(spark) + val dirPath = new Path(dir.getAbsolutePath) + val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, +None, fileStatusCache) { +def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq +def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq + } + + assert(catalog.leafDirPaths.isEmpty) + assert(catalog.leafFilePaths.isEmpty) + + val file = new File(dir, "text.txt") + stringToFile(file, "text") + + catalog.refresh() + + assert(catalog.leafFilePaths.size == 1) + assert(catalog.leafFilePaths.head.toString.stripSuffix("/") == +s"file:${file.getAbsolutePath.stripSuffix("/")}") --- End diff -- this looks hacky, can you turn them into `Path` and compare? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17079#discussion_r103349639 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog2.allFiles().nonEmpty) } } + + test("refresh for InMemoryFileIndex with FileStatusCache") { +withTempDir { dir => + val fileStatusCache = FileStatusCache.getOrCreate(spark) + val dirPath = new Path(dir.getAbsolutePath) + val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, +None, fileStatusCache) { --- End diff -- nit: ``` val catalog = new XXX(...) { def xxx } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has an wrong o...
GitHub user windpiger opened a pull request: https://github.com/apache/spark/pull/17079 [SPARK-19748][SQL]refresh function has an wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache ## What changes were proposed in this pull request? If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll. While the order to do these two actions is wrong, this lead to the refresh action does not take effect. ``` override def refresh(): Unit = { refresh0() fileStatusCache.invalidateAll() } private def refresh0(): Unit = { val files = listLeafFiles(rootPaths) cachedLeafFiles = new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f) cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent) cachedPartitionSpec = null } ``` ## How was this patch tested? unit test added You can merge this pull request into a Git repository by running: $ git pull https://github.com/windpiger/spark fixInMemoryFileIndexRefresh Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17079.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17079 commit 552955a5293bc22da1cd644ffde90b883fc560e8 Author: windpigerDate: 2017-02-27T07:46:18Z [SPARK-19748][SQL]refresh function has an wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache commit fd3bb21597809409e7f33796589c9178744063c5 Author: windpiger Date: 2017-02-27T07:53:46Z modify the test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org