[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95939337
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

let's revert it first, we should think about cache and refresh more 
thorough later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95937423
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

@cloud-fan @ericl @mallman For non-partitioned parquet/orc tables, we 
convert them to the data source tables. Thus, it will not call 
`InsertIntoHiveTable`. 

I know it is a little bit confusing, but I am fine to keep it unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95929082
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

I think the reason why we call `refreshTable` here is not related to the 
file status cache. We are refreshing the cache because the metadata cache is 
out of dated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95717498
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
@@ -609,6 +609,98 @@ class ParquetMetastoreSuite extends 
ParquetPartitioningTest {
 }
   }
 
+  test("Explicitly added partitions should be readable after load") {
+withTable("test_added_partitions") {
+  withTempDir { src =>
+val newPartitionDir = new File(src, "partition").getCanonicalPath
--- End diff --

why don't we just use `src` as partition path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95717399
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
@@ -609,6 +609,98 @@ class ParquetMetastoreSuite extends 
ParquetPartitioningTest {
 }
   }
 
+  test("Explicitly added partitions should be readable after load") {
+withTable("test_added_partitions") {
+  withTempDir { src =>
+val newPartitionDir = new File(src, "partition").getCanonicalPath
+spark.range(2).selectExpr("cast(id as string)").toDF("a").write
+  .mode("overwrite")
+  .parquet(newPartitionDir)
+
+sql(
+  """
+|CREATE TABLE test_added_partitions (a STRING)
+|PARTITIONED BY (b INT)
+|STORED AS PARQUET
+  """.stripMargin)
+
+// Create partition without data files and check whether it can be 
read
+sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
+// This table fetch is to fill the cache with zero leaf files
+checkAnswer(spark.table("test_added_partitions"), Seq.empty)
+
+sql(
+  s"""
+ |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
+ |INTO TABLE test_added_partitions PARTITION(b='1')
+   """.stripMargin)
+
+checkAnswer(
+  spark.table("test_added_partitions"),
+  Seq(("0", 1), ("1", 1)).toDF("a", "b"))
+  }
+}
+  }
+
+  test("Non-partitioned table readable after load") {
+withTable("tab") {
+  withTempDir { src =>
+val newPartitionDir = new File(src, "data").getCanonicalPath
+spark.range(2).selectExpr("cast(id as string)").toDF("a").write
+  .mode("overwrite")
+  .parquet(newPartitionDir)
+
+sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
+
+// This table fetch is to fill the cache with zero leaf files
+checkAnswer(spark.table("tab"), Seq.empty)
+
+sql(
+  s"""
+ |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
+ |INTO TABLE tab
+   """.stripMargin)
+
+checkAnswer(spark.table("tab"), Seq(Row("0"), Row("1")))
+  }
+}
+  }
+
+  test("Partitioned table readable after insert") {
--- End diff --

this and the next test are not needed if we revert 
https://github.com/apache/spark/pull/16500/files#diff-d579db9a8f27e0bbef37720ab14ec3f6L395


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95717322
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
@@ -609,6 +609,98 @@ class ParquetMetastoreSuite extends 
ParquetPartitioningTest {
 }
   }
 
+  test("Explicitly added partitions should be readable after load") {
+withTable("test_added_partitions") {
+  withTempDir { src =>
+val newPartitionDir = new File(src, "partition").getCanonicalPath
+spark.range(2).selectExpr("cast(id as string)").toDF("a").write
+  .mode("overwrite")
+  .parquet(newPartitionDir)
+
+sql(
+  """
+|CREATE TABLE test_added_partitions (a STRING)
+|PARTITIONED BY (b INT)
+|STORED AS PARQUET
+  """.stripMargin)
+
+// Create partition without data files and check whether it can be 
read
+sql(s"ALTER TABLE test_added_partitions ADD PARTITION (b='1')")
+// This table fetch is to fill the cache with zero leaf files
+checkAnswer(spark.table("test_added_partitions"), Seq.empty)
+
+sql(
+  s"""
+ |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
+ |INTO TABLE test_added_partitions PARTITION(b='1')
+   """.stripMargin)
+
+checkAnswer(
+  spark.table("test_added_partitions"),
+  Seq(("0", 1), ("1", 1)).toDF("a", "b"))
+  }
+}
+  }
+
+  test("Non-partitioned table readable after load") {
+withTable("tab") {
+  withTempDir { src =>
+val newPartitionDir = new File(src, "data").getCanonicalPath
--- End diff --

why don't we just use `src` as the table path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95717157
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

yea I think so, but I don't think it worth to avoid this `refreshTable` 
call and add a lot of comments to explain it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-11 Thread ericl
Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95705420
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

Is this because hive serde tables do not use the file status cache?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95206030
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

Why is it safe to restrict this call to the case where `partition.nonEmpty`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16500#discussion_r95068993
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -392,7 +392,9 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
-
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+if (partition.nonEmpty) {
+  
sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)
+}
--- End diff --

Actually, we can further limit the calls of `refreshTable`. For example, 
checking whether the format is `parquet` or `orc`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cach...

2017-01-07 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/16500

[SPARK-19120] [SPARK-19121] Refresh Metadata Cache After Load Partitioned 
Hive Tables

### What changes were proposed in this pull request?
```Scala
sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")

// This table fetch is to fill the cache with zero leaf files
spark.table("tab").show()

sql(
  s"""
 |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
 |INTO TABLE tab
   """.stripMargin)

spark.table("tab").show()
```

In the above example, the returned result is empty after table loading. The 
metadata cache could be out of dated after loading new data into the table, 
because loading/inserting does not update the cache. So far, the metadata cache 
are only used for data source tables. Thus, only `parquet` and `orc` formats 
are facing such issues, because the Hive tables are converted to data source 
tables when 
`spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` 
is on. 

This PR is to refresh the metadata cache after processing the `LOAD DATA` 
command.

In addition, Spark SQL does not convert **partitioned** Hive tables 
(orc/parquet) to data source tables in the write path, but the read path is 
using the metadata cache for both **partitioned** and non-partitioned Hive 
tables (orc/parquet). That means, writing the partitioned parquet/orc tables 
still use `InsertIntoHiveTable`, instead of 
`InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, 
`InsertIntoHiveTable` needs to refresh the metadata cache for partitioned 
tables. Note, it does not need to refresh the cache for non-partitioned 
parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. 

### How was this patch tested?
Added a test case in parquetSuites.scala

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark refreshInsertIntoHiveTable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16500.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16500


commit ea91cb077e4ea307020dc5b3b4ffe9a0d2a4dc88
Author: gatorsmile 
Date:   2017-01-07T18:59:52Z

fix.

commit b7013c2853bf993d82b88c4a605ce921d4593ebe
Author: gatorsmile 
Date:   2017-01-07T22:55:25Z

fix.

commit 27fab56bdac74dac2d7dbd36db4c240d35c89dac
Author: gatorsmile 
Date:   2017-01-08T00:06:47Z

fix.

commit 0f70e912402e118a79f48db38c1697baf6905cde
Author: gatorsmile 
Date:   2017-01-08T00:33:20Z

more test cases.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org