[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95939337 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- let's revert it first, we should think about cache and refresh more thorough later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95937423 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- @cloud-fan @ericl @mallman For non-partitioned parquet/orc tables, we convert them to the data source tables. Thus, it will not call `InsertIntoHiveTable`. I know it is a little bit confusing, but I am fine to keep it unchanged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95929082 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- I think the reason why we call `refreshTable` here is not related to the file status cache. We are refreshing the cache because the metadata cache is out of dated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95717498 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -609,6 +609,98 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("Explicitly added partitions should be readable after load") { +withTable("test_added_partitions") { + withTempDir { src => +val newPartitionDir = new File(src, "partition").getCanonicalPath --- End diff -- why don't we just use `src` as partition path? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95717399 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -609,6 +609,98 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("Explicitly added partitions should be readable after load") { +withTable("test_added_partitions") { + withTempDir { src => +val newPartitionDir = new File(src, "partition").getCanonicalPath +spark.range(2).selectExpr("cast(id as string)").toDF("a").write + .mode("overwrite") + .parquet(newPartitionDir) + +sql( + """ +|CREATE TABLE test_added_partitions (a STRING) +|PARTITIONED BY (b INT) +|STORED AS PARQUET + """.stripMargin) + +// Create partition without data files and check whether it can be read +sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") +// This table fetch is to fill the cache with zero leaf files +checkAnswer(spark.table("test_added_partitions"), Seq.empty) + +sql( + s""" + |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE + |INTO TABLE test_added_partitions PARTITION(b='1') + """.stripMargin) + +checkAnswer( + spark.table("test_added_partitions"), + Seq(("0", 1), ("1", 1)).toDF("a", "b")) + } +} + } + + test("Non-partitioned table readable after load") { +withTable("tab") { + withTempDir { src => +val newPartitionDir = new File(src, "data").getCanonicalPath +spark.range(2).selectExpr("cast(id as string)").toDF("a").write + .mode("overwrite") + .parquet(newPartitionDir) + +sql("CREATE TABLE tab (a STRING) STORED AS PARQUET") + +// This table fetch is to fill the cache with zero leaf files +checkAnswer(spark.table("tab"), Seq.empty) + +sql( + s""" + |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE + |INTO TABLE tab + """.stripMargin) + +checkAnswer(spark.table("tab"), Seq(Row("0"), Row("1"))) + } +} + } + + test("Partitioned table readable after insert") { --- End diff -- this and the next test are not needed if we revert https://github.com/apache/spark/pull/16500/files#diff-d579db9a8f27e0bbef37720ab14ec3f6L395 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95717322 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -609,6 +609,98 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("Explicitly added partitions should be readable after load") { +withTable("test_added_partitions") { + withTempDir { src => +val newPartitionDir = new File(src, "partition").getCanonicalPath +spark.range(2).selectExpr("cast(id as string)").toDF("a").write + .mode("overwrite") + .parquet(newPartitionDir) + +sql( + """ +|CREATE TABLE test_added_partitions (a STRING) +|PARTITIONED BY (b INT) +|STORED AS PARQUET + """.stripMargin) + +// Create partition without data files and check whether it can be read +sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')") +// This table fetch is to fill the cache with zero leaf files +checkAnswer(spark.table("test_added_partitions"), Seq.empty) + +sql( + s""" + |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE + |INTO TABLE test_added_partitions PARTITION(b='1') + """.stripMargin) + +checkAnswer( + spark.table("test_added_partitions"), + Seq(("0", 1), ("1", 1)).toDF("a", "b")) + } +} + } + + test("Non-partitioned table readable after load") { +withTable("tab") { + withTempDir { src => +val newPartitionDir = new File(src, "data").getCanonicalPath --- End diff -- why don't we just use `src` as the table path? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95717157 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- yea I think so, but I don't think it worth to avoid this `refreshTable` call and add a lot of comments to explain it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95705420 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- Is this because hive serde tables do not use the file status cache? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95206030 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- Why is it safe to restrict this call to the case where `partition.nonEmpty`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16500#discussion_r95068993 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -392,7 +392,9 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) - sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +if (partition.nonEmpty) { + sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier) +} --- End diff -- Actually, we can further limit the calls of `refreshTable`. For example, checking whether the format is `parquet` or `orc`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/16500 [SPARK-19120] [SPARK-19121] Refresh Metadata Cache After Load Partitioned Hive Tables ### What changes were proposed in this pull request? ```Scala sql("CREATE TABLE tab (a STRING) STORED AS PARQUET") // This table fetch is to fill the cache with zero leaf files spark.table("tab").show() sql( s""" |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE |INTO TABLE tab """.stripMargin) spark.table("tab").show() ``` In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache are only used for data source tables. Thus, only `parquet` and `orc` formats are facing such issues, because the Hive tables are converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on. This PR is to refresh the metadata cache after processing the `LOAD DATA` command. In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. ### How was this patch tested? Added a test case in parquetSuites.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark refreshInsertIntoHiveTable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16500.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16500 commit ea91cb077e4ea307020dc5b3b4ffe9a0d2a4dc88 Author: gatorsmileDate: 2017-01-07T18:59:52Z fix. commit b7013c2853bf993d82b88c4a605ce921d4593ebe Author: gatorsmile Date: 2017-01-07T22:55:25Z fix. commit 27fab56bdac74dac2d7dbd36db4c240d35c89dac Author: gatorsmile Date: 2017-01-08T00:06:47Z fix. commit 0f70e912402e118a79f48db38c1697baf6905cde Author: gatorsmile Date: 2017-01-08T00:33:20Z more test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org