[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r461788374 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala ## @@ -226,4 +227,16 @@ object StaticSQLConf { .version("3.0.0") .intConf .createWithDefault(100) + + val METADATA_CACHE_TTL_SECONDS = buildStaticConf("spark.sql.metadataCacheTTLSeconds") +.doc("Time-to-live (TTL) value for the metadata caches: partition file metadata cache and " + + "session catalog cache. This configuration only has an effect when this value having " + + "a positive value (> 0). It also requires setting " + + s"${StaticSQLConf.CATALOG_IMPLEMENTATION} to `hive`, setting " + + s"${SQLConf.HIVE_FILESOURCE_PARTITION_FILE_CACHE_SIZE} > 0 and setting " + Review comment: That was done in https://github.com/apache/spark/pull/29194 :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r456209080 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,30 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val previousValue = SQLConf.get.getConf(StaticSQLConf.METADATA_CACHE_TTL) +try { + SQLConf.get.setConf(StaticSQLConf.METADATA_CACHE_TTL, 1L) Review comment: added a comment in https://github.com/apache/spark/pull/28852/commits/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala ## @@ -226,4 +227,16 @@ object StaticSQLConf { .version("3.0.0") .intConf .createWithDefault(100) + + val METADATA_CACHE_TTL = buildStaticConf("spark.sql.metadataCacheTTL") Review comment: Updated in https://github.com/apache/spark/pull/28852/commits/3e761dcd790b9c30e5cee7bffe916dfc2c82b7a5 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r455915039 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,30 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val previousValue = SQLConf.get.getConf(StaticSQLConf.METADATA_CACHE_TTL) +try { + SQLConf.get.setConf(StaticSQLConf.METADATA_CACHE_TTL, 1L) Review comment: unfortunately `withSQLConf` doesn't work with `StaticSQLConf` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r447112657 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => Review comment: I've decided to move the test from `HiveMetadataCacheSuite` to `SessionCatalogSuite` in https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf: - It does cover the in-memory catalog as well, which means bigger scope for the test. - It's easier to test it since SQLConf can be provided in the test. I'm going to resolve this, but feel free to leave more comments on the new test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r447110496 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -2656,6 +2656,16 @@ object SQLConf { .checkValue(_ > 0, "The difference must be positive.") .createWithDefault(4) + val METADATA_CACHE_TTL = buildConf("spark.sql.metadataCacheTTL") + .doc("Time-to-live (TTL) value for the metadata caches: partition file metadata cache and " + +"session catalog cache. This configuration only has an effect when this value having " + +"a positive value. It also requires setting `hive` to " + +s"${StaticSQLConf.CATALOG_IMPLEMENTATION} to be applied to the partition file " + Review comment: Updated in https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r447110589 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,25 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +withSQLConf(SQLConf.METADATA_CACHE_TTL.key -> "1") { + val path = new Path("/tmp", "abc") + val files = (1 to 3).map(_ => new FileStatus()) + + FileStatusCache.resetForTesting() + val fileStatusCache = FileStatusCache.getOrCreate(spark) + fileStatusCache.putLeafFiles(path, files.toArray) + + // Exactly 3 files are cached. + assert(fileStatusCache.getLeafFiles(path).get.length === 3) + // Wait until the cache expiration. + eventually(timeout(2.seconds)) { Review comment: Increased to 3 seconds in https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r447110243 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala ## @@ -135,7 +136,16 @@ class SessionCatalog( private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = { val cacheSize = conf.tableRelationCacheSize -CacheBuilder.newBuilder().maximumSize(cacheSize).build[QualifiedTableName, LogicalPlan]() +val cacheTTL = conf.metadataCacheTTL + +var builder = CacheBuilder.newBuilder() + .maximumSize(cacheSize) + +if (cacheTTL > 0) { Review comment: Updated in https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf, now it says: ` This configuration only has an effect when this value having a positive value (> 0)`, I think it's good enough. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r447109798 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1") + + withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { spark => + val path = new Path("/tmp", "abc") Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/18feeb02b11407b9eea1457139a91cb40ddc40cf This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r446422811 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => Review comment: I see that `withSQLConf ` is overridden here https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala#L59-L68 and it works with `StaticSQLConf`: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala#L59-L68 It also works If I implement something similar in my test, but it still feels hacky. Do you think we can go with this plan? In this case, the test in the `HiveMetadataCacheSuite` should be in its own suite / class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445898815 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => Review comment: @maropu hmm, how do I use `withSQLConf` with `StaticSQLConf `? It doesn't allow it: https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/SQLHelper.scala#L50 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445885114 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala ## @@ -135,7 +136,16 @@ class SessionCatalog( private val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = { val cacheSize = conf.tableRelationCacheSize -CacheBuilder.newBuilder().maximumSize(cacheSize).build[QualifiedTableName, LogicalPlan]() +val cacheTTL = conf.metadataCacheTTL + +var builder = CacheBuilder.newBuilder() + .maximumSize(cacheSize) + +if (cacheTTL > 0) { Review comment: Yes, I think it makes sense: 0 seconds == no TTL This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445884849 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => Review comment: Apparently `withSparkSession` (current implementation) is not good enough, it terminates the Spark session and the next tests fail https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124516/console This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445736215 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => Review comment: So, `withSQLConf` doesn't work here. It seems like it overrides the config, but it doesn't affect already initialized `SessionCatalog` (from the previous tests; it does work if you run it as a single test). `withSQLConf` works in `FileIndexSuite` because it uses `SharedSparkSession` as base class, but `TestHiveSingleton` is used as a base here. Any suggestions? I can: - keep it as is - move it to its own separate suite, either in the same file or a separate file - try to use a different base for the whole suite This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445733152 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1") Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/ec18c335cd5d2e038766f65b4109e170290afb99 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1") + + withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { spark => + val path = new Path("/tmp", "abc") + val files = (1 to 3).map(_ => new FileStatus()) + + FileStatusCache.resetForTesting() + val fileStatusCache = FileStatusCache.getOrCreate(spark) Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/ec18c335cd5d2e038766f65b4109e170290afb99 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445726584 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => + withTable("test_ttl") { +withTempDir { dir => + spark.sql(s""" +|create external table test_ttl (id long) +|partitioned by (f1 int, f2 int) +|stored as parquet +|location "${dir.toURI}.stripMargin) + + val tableIdentifier = TableIdentifier("test_ttl", Some("default")) + + // First, make sure the test table is not cached. + assert(getCachedDataSourceTable(tableIdentifier) === null) + // This query will make the table cached. + spark.sql("select * from test_ttl") + assert(getCachedDataSourceTable(tableIdentifier) !== null) + // Wait until the cache expiration. + Thread.sleep(1500L) // 1.5 seconds > 1 second. + // And the cache is gone. + assert(getCachedDataSourceTable(tableIdentifier) === null) +} + } +} + } + + private def getCachedDataSourceTable(table: TableIdentifier) + (implicit spark: SparkSession): LogicalPlan = { Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetadataCacheSuite.scala ## @@ -126,4 +129,39 @@ class HiveMetadataCacheSuite extends QueryTest with SQLTestUtils with TestHiveSi for (pruningEnabled <- Seq(true, false)) { testCaching(pruningEnabled) } + + test("cache TTL") { +val sparkConfWithTTl = new SparkConf().set(SQLConf.METADATA_CACHE_TTL.key, "1") +val newSession = SparkSession.builder.config(sparkConfWithTTl).getOrCreate().cloneSession() + +withSparkSession(newSession) { implicit spark => + withTable("test_ttl") { +withTempDir { dir => + spark.sql(s""" +|create external table test_ttl (id long) Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445726445 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1") + + withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { spark => + val path = new Path("/tmp", "abc") + val files = (1 to 3).map(_ => new FileStatus()) + + FileStatusCache.resetForTesting() + val fileStatusCache = FileStatusCache.getOrCreate(spark) + fileStatusCache.putLeafFiles(path, files.toArray) + + // Exactly 3 files are cached. + assert(fileStatusCache.getLeafFiles(path).get.length === 3) + // Wait until the cache expiration. Review comment: Great point, totally forgot about `eventually`! Fixed in https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1") + + withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { spark => + val path = new Path("/tmp", "abc") + val files = (1 to 3).map(_ => new FileStatus()) + + FileStatusCache.resetForTesting() + val fileStatusCache = FileStatusCache.getOrCreate(spark) + fileStatusCache.putLeafFiles(path, files.toArray) + + // Exactly 3 files are cached. + assert(fileStatusCache.getLeafFiles(path).get.length === 3) + // Wait until the cache expiration. + Thread.sleep(1500) // 1.5 seconds > 1 second + // And the cache is gone. + assert(fileStatusCache.getLeafFiles(path) === None) Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445726191 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -2656,6 +2656,17 @@ object SQLConf { .checkValue(_ > 0, "The difference must be positive.") .createWithDefault(4) + val METADATA_CACHE_TTL = +buildConf("spark.sql.metadataCacheTTL") Review comment: Fixed in https://github.com/apache/spark/pull/28852/commits/7d94f943496fb463f18cd445ccbde3f5225c8e7b This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445720751 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala ## @@ -488,6 +489,26 @@ class FileIndexSuite extends SharedSparkSession { val fileIndex = new TestInMemoryFileIndex(spark, path) assert(fileIndex.leafFileStatuses.toSeq == statuses) } + + test("expire FileStatusCache if TTL is configured") { +val sparkConfWithTTl = sparkConf.set(SQLConf.METADATA_CACHE_TTL.key, "1") + + withSparkSession(SparkSession.builder.config(sparkConfWithTTl).getOrCreate()) { spark => + val path = new Path("/tmp", "abc") Review comment: in this test, we don't need to have files or directories to be created, it's just metadata. So I believe `withTempDir` is not required. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445234340 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -835,6 +835,15 @@ object SQLConf { .longConf .createWithDefault(250 * 1024 * 1024) + val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL = +buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL") Review comment: @maropu right, updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r445226247 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -835,6 +835,15 @@ object SQLConf { .longConf .createWithDefault(250 * 1024 * 1024) + val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL = +buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL") Review comment: @maropu @gatorsmile I've changed the config option to `spark.sql.metadataCacheTTL` and applied it to `FileStatusCache` as well as `SessionCatalog` in https://github.com/apache/spark/pull/28852/commits/28da5cfd39dba6a8319dd1cdfe39e51ed5cbdea5. What do you think? I'm happy to update `spark.sql.hive.filesourcePartitionFileCacheSize` (drop `hive` prefix), but it's clearly out of scope here, I could follow-up with another PR after this one is resolved. Thanks again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r442962034 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -835,6 +835,15 @@ object SQLConf { .longConf .createWithDefault(250 * 1024 * 1024) + val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL = +buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL") Review comment: @maropu got it, will proceed with dropping `hive` from these two configs. @gatorsmile could you elaborate a bit? Are you just referring to dropping the `hive` prefix or is there something else I'm missing? Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r442357968 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -835,6 +835,15 @@ object SQLConf { .longConf .createWithDefault(250 * 1024 * 1024) + val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL = +buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL") + .doc("When positive, it's used as a TTL (time-to-live) value for the partition file " + +"metadata cache. This conf only has an effect when hive filesource partition management " + +"is enabled.") Review comment: Updated, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sap1ens commented on a change in pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
sap1ens commented on a change in pull request #28852: URL: https://github.com/apache/spark/pull/28852#discussion_r442351399 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -835,6 +835,15 @@ object SQLConf { .longConf .createWithDefault(250 * 1024 * 1024) + val HIVE_FILESOURCE_PARTITION_FILE_CACHE_TTL = +buildConf("spark.sql.hive.filesourcePartitionFileCacheTTL") Review comment: So, `spark.sql.hive.filesourcePartitionFileCacheTTL` simply mirrors the existing `spark.sql.hive.filesourcePartitionFileCacheSize` config. I think it makes sense for consistency, since they're applied to the same cache. I don't mind dropping `hive`, but in this case it also makes sense to drop it from the `spark.sql.hive.filesourcePartitionFileCacheSize` and it's going to be a backwards-incompatible change. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org