[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-29 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r461788374



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##
@@ -226,4 +227,16 @@ object StaticSQLConf {
   .version("3.0.0")
   .intConf
   .createWithDefault(100)
+
+  val METADATA_CACHE_TTL_SECONDS = 
buildStaticConf("spark.sql.metadataCacheTTLSeconds")
+.doc("Time-to-live (TTL) value for the metadata caches: partition file 
metadata cache and " +
+  "session catalog cache. This configuration only has an effect when this 
value having " +
+  "a positive value (> 0). It also requires setting " +
+  s"${StaticSQLConf.CATALOG_IMPLEMENTATION} to `hive`, setting " +
+  s"${SQLConf.HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE} > 0 and setting " 
+

Review comment:
   That was done in https://github.com/apache/spark/pull/29194 :) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r456209080



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,30 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val previousValue = SQLConf.get.getConf(StaticSQLConf.METADATA_CACHE_TTL)
+try {
+  SQLConf.get.setConf(StaticSQLConf.METADATA_CACHE_TTL, 1L)

Review comment:
   added a comment in 
https://github.com/apache/spark/pull/28852/commits/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
##
@@ -226,4 +227,16 @@ object StaticSQLConf {
   .version("3.0.0")
   .intConf
   .createWithDefault(100)
+
+  val METADATA_CACHE_TTL = buildStaticConf("spark.sql.metadataCacheTTL")

Review comment:
   Updated in 
https://github.com/apache/spark/pull/28852/commits/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-07-16 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r455915039



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,30 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val previousValue = SQLConf.get.getConf(StaticSQLConf.METADATA_CACHE_TTL)
+try {
+  SQLConf.get.setConf(StaticSQLConf.METADATA_CACHE_TTL, 1L)

Review comment:
   unfortunately `withSQLConf` doesn't work with `StaticSQLConf`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-29 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r447112657



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>

Review comment:
   I've decided to move the test from `HiveMetadataCacheSuite` to 
`SessionCatalogSuite` in 
https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf:
   
   - It does cover the in-memory catalog as well, which means bigger scope for 
the test.
   - It's easier to test it since SQLConf can be provided in the test. 
   
   I'm going to resolve this, but feel free to leave more comments on the new 
test.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-29 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r447110496



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -2656,6 +2656,16 @@ object SQLConf {
   .checkValue(_ > 0, "The difference must be positive.")
   .createWithDefault(4)
 
+  val METADATA_CACHE_TTL = buildConf("spark.sql.metadataCacheTTL")
+  .doc("Time-to-live (TTL) value for the metadata caches: partition file 
metadata cache and " +
+"session catalog cache. This configuration only has an effect when 
this value having " +
+"a positive value. It also requires setting `hive` to " +
+s"${StaticSQLConf.CATALOG_IMPLEMENTATION} to be applied to the 
partition file " +

Review comment:
   Updated in 
https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-29 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r447110589



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,25 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+withSQLConf(SQLConf.METADATA_CACHE_TTL.key -> "1") {
+  val path = new Path("/tmp", "abc")
+  val files = (1 to 3).map(_ => new FileStatus())
+
+  FileStatusCache.resetForTesting()
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  fileStatusCache.putLeafFiles(path, files.toArray)
+
+  // Exactly 3 files are cached.
+  assert(fileStatusCache.getLeafFiles(path).get.length === 3)
+  // Wait until the cache expiration.
+  eventually(timeout(2.seconds)) {

Review comment:
   Increased to 3 seconds in 
https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-29 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r447110243



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
##
@@ -135,7 +136,16 @@ class SessionCatalog(
 
   private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
 val cacheSize = conf.tableRelationCacheSize
-CacheBuilder.newBuilder().maximumSize(cacheSize).build[QualifiedTableName, 
LogicalPlan]()
+val cacheTTL = conf.metadataCacheTTL
+
+var builder = CacheBuilder.newBuilder()
+  .maximumSize(cacheSize)
+
+if (cacheTTL > 0) {

Review comment:
   Updated in 
https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf,
 now it says: ` This configuration only has an effect when this value having a 
positive value (> 0)`, I think it's good enough.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-29 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r447109798



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1")
+
+
withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { 
spark =>
+  val path = new Path("/tmp", "abc")

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-26 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r446422811



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>

Review comment:
   I see that `withSQLConf ` is overridden here 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala#L59-L68
 and it works with `StaticSQLConf`: 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala#L59-L68
   
   It also works If I implement something similar in my test, but it still 
feels hacky. Do you think we can go with this plan? In this case, the test in 
the `HiveMetadataCacheSuite` should be in its own suite / class.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445898815



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>

Review comment:
   @maropu hmm, how do I use `withSQLConf` with `StaticSQLConf `? It 
doesn't allow it: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala#L50





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445885114



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
##
@@ -135,7 +136,16 @@ class SessionCatalog(
 
   private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = {
 val cacheSize = conf.tableRelationCacheSize
-CacheBuilder.newBuilder().maximumSize(cacheSize).build[QualifiedTableName, 
LogicalPlan]()
+val cacheTTL = conf.metadataCacheTTL
+
+var builder = CacheBuilder.newBuilder()
+  .maximumSize(cacheSize)
+
+if (cacheTTL > 0) {

Review comment:
   Yes, I think it makes sense: 0 seconds == no TTL





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445884849



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>

Review comment:
   Apparently `withSparkSession` (current implementation) is not good 
enough, it terminates the Spark session and the next tests fail 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124516/console





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445736215



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>

Review comment:
   So, `withSQLConf` doesn't work here. It seems like it overrides the 
config, but it doesn't affect already initialized `SessionCatalog` (from the 
previous tests; it does work if you run it as a single test).
   
   `withSQLConf` works in `FileIndexSuite` because it uses `SharedSparkSession` 
as base class, but `TestHiveSingleton` is used as a base here. 
   
   Any suggestions? I can:
   
   -  keep it as is
   -  move it to its own separate suite, either in the same file or a separate 
file
   -  try to use a different base for the whole suite





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445733152



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1")

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/ec18c335cd5d2e038766f65b4109e170290afb99

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1")
+
+
withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { 
spark =>
+  val path = new Path("/tmp", "abc")
+  val files = (1 to 3).map(_ => new FileStatus())
+
+  FileStatusCache.resetForTesting()
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/ec18c335cd5d2e038766f65b4109e170290afb99





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445726584



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b

##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>
+  withTable("test_ttl") {
+withTempDir { dir =>
+  spark.sql(s"""
+|create external table test_ttl (id long)
+|partitioned by (f1 int, f2 int)
+|stored as parquet
+|location "${dir.toURI}.stripMargin)
+
+  val tableIdentifier = TableIdentifier("test_ttl", Some("default"))
+
+  // First, make sure the test table is not cached.
+  assert(getCachedDataSourceTable(tableIdentifier) === null)
+  // This query will make the table cached.
+  spark.sql("select * from test_ttl")
+  assert(getCachedDataSourceTable(tableIdentifier) !== null)
+  // Wait until the cache expiration.
+  Thread.sleep(1500L) // 1.5 seconds > 1 second.
+  // And the cache is gone.
+  assert(getCachedDataSourceTable(tableIdentifier) === null)
+}
+  }
+}
+  }
+
+  private def getCachedDataSourceTable(table: TableIdentifier)
+  (implicit spark: SparkSession): 
LogicalPlan = {

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b

##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala
##
@@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with 
SQLTestUtils with TestHiveSi
   for (pruningEnabled <- Seq(true, false)) {
 testCaching(pruningEnabled)
   }
+
+  test("cache TTL") {
+val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, 
"1")
+val newSession = 
SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession()
+
+withSparkSession(newSession) { implicit spark =>
+  withTable("test_ttl") {
+withTempDir { dir =>
+  spark.sql(s"""
+|create external table test_ttl (id long)

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445726445



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1")
+
+
withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { 
spark =>
+  val path = new Path("/tmp", "abc")
+  val files = (1 to 3).map(_ => new FileStatus())
+
+  FileStatusCache.resetForTesting()
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  fileStatusCache.putLeafFiles(path, files.toArray)
+
+  // Exactly 3 files are cached.
+  assert(fileStatusCache.getLeafFiles(path).get.length === 3)
+  // Wait until the cache expiration.

Review comment:
   Great point, totally forgot about `eventually`! Fixed in 
https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1")
+
+
withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { 
spark =>
+  val path = new Path("/tmp", "abc")
+  val files = (1 to 3).map(_ => new FileStatus())
+
+  FileStatusCache.resetForTesting()
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  fileStatusCache.putLeafFiles(path, files.toArray)
+
+  // Exactly 3 files are cached.
+  assert(fileStatusCache.getLeafFiles(path).get.length === 3)
+  // Wait until the cache expiration.
+  Thread.sleep(1500) // 1.5 seconds > 1 second
+  // And the cache is gone.
+  assert(fileStatusCache.getLeafFiles(path) === None)

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445726191



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -2656,6 +2656,17 @@ object SQLConf {
   .checkValue(_ > 0, "The difference must be positive.")
   .createWithDefault(4)
 
+  val METADATA_CACHE_TTL =
+buildConf("spark.sql.metadataCacheTTL")

Review comment:
   Fixed in 
https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-25 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445720751



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
##
@@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession {
 val fileIndex = new TestInMemoryFileIndex(spark, path)
 assert(fileIndex.leafFileStatuses.toSeq == statuses)
   }
+
+  test("expire FileStatusCache if TTL is configured") {
+val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1")
+
+
withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { 
spark =>
+  val path = new Path("/tmp", "abc")

Review comment:
   in this test, we don't need to have files or directories to be created, 
it's just metadata. So I believe `withTempDir` is not required. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445234340



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -835,6 +835,15 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL =
+buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL")

Review comment:
   @maropu right, updated





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r445226247



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -835,6 +835,15 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL =
+buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL")

Review comment:
   @maropu @gatorsmile I've changed the config option to 
`spark.sql.metadataCacheTTL` and applied it to `FileStatusCache` as well as 
`SessionCatalog` in 
https://github.com/apache/spark/pull/28852/commits/28da5cfd39dba6a8319dd1cdfe39e51ed5cbdea5.
 What do you think? 
   
   I'm happy to update `spark.sql.hive.filesourcePartitionFileCacheSize` (drop 
`hive` prefix), but it's clearly out of scope here, I could follow-up with 
another PR after this one is resolved.
   
   Thanks again.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-19 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r442962034



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -835,6 +835,15 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL =
+buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL")

Review comment:
   @maropu got it, will proceed with dropping `hive` from these two configs.
   
   @gatorsmile could you elaborate a bit? Are you just referring to dropping 
the `hive` prefix or is there something else I'm missing? Thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-18 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r442357968



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -835,6 +835,15 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL =
+buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL")
+  .doc("When positive, it's used as a TTL (time-to-live) value for the 
partition file " +
+"metadata cache. This conf only has an effect when hive filesource 
partition management " +
+"is enabled.")

Review comment:
   Updated, thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-18 Thread GitBox


sap1ens commented on a change in pull request #28852:
URL: https://github.com/apache/spark/pull/28852#discussion_r442351399



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -835,6 +835,15 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL =
+buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL")

Review comment:
   So, `spark.sql.hive.filesourcePartitionFileCacheTTL` simply mirrors the 
existing `spark.sql.hive.filesourcePartitionFileCacheSize` config. I think it 
makes sense for consistency, since they're applied to the same cache.
   
   I don't mind dropping `hive`, but in this case it also makes sense to drop 
it from the `spark.sql.hive.filesourcePartitionFileCacheSize` and it's going to 
be a backwards-incompatible change.
   
   What do you think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org