[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22743 Yes. you are right, if datasource table stats is empty, `DetermineTableStats` doesn't set stats for it, so it's only a problem for hive tables. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22743 > Datasource table will not cache in tableRelationCache. I don't think so. Spark caches data source table in `FindDataSourceTable` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22743 Datasource table will not cache in [tableRelationCache](https://github.com/apache/spark/blob/01c3dfab158d40653f8ce5d96f57220297545d5b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L134). Hive table only occured when Hive table stats is empty and enable `spark.sql.hive.convertMetastoreParquet` (default value). then when we read this table, Spark will [convertToLogicalRelation](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L116) and [cache it](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L207). Empty stats occured at least in 2 situations: 1. Create as Hive table and enable `spark.sql.hive.convertMetastoreParquet` (default value) and disable `spark.sql.statistics.size.autoUpdate.enabled` (default value) then do inserting. 2. Table managed by Hive and didn't gather stats. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22743 why it's only a problem for hive tables? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22743 This happens when a table `LogicalRelation` has been cached, then we change `spark.sql.statistics.fallBackToHdfs` or `spark.sql.defaultSizeInBytes` will not have any effect to stats, it always uses the stats already cached in `LogicalRelation`. This is an example: ```scala import org.apache.spark.sql.catalyst.QualifiedTableName import org.apache.spark.sql.catalyst.catalog.SessionCatalog import org.apache.spark.sql.execution.datasources.LogicalRelation spark.sql("CREATE TABLE t1 (c1 bigint) STORED AS PARQUET") spark.sql("INSERT INTO TABLE t1 VALUES (1)") spark.sql("REFRESH TABLE t1") val catalog = spark.sessionState.catalog val qualifiedTableName = QualifiedTableName(catalog.getCurrentDatabase, "t1") spark.sql("SELECT * from t1").collect() val cachedRelation = catalog.getCachedTable(qualifiedTableName) cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes // res4: BigInt = 9223372036854775807 spark.sql("set spark.sql.statistics.fallBackToHdfs=true") spark.sql("SELECT * from t1").collect() val cachedRelation = catalog.getCachedTable(qualifiedTableName) cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes // res7: BigInt = 9223372036854775807 // It should compute from file system, but still 9223372036854775807 spark.sql("REFRESH TABLE t1") spark.sql("SELECT * from t1").collect() val cachedRelation = catalog.getCachedTable(qualifiedTableName) cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes // res10: BigInt = 708 // If we refresh this table, it correct. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22743 can you explain more about how this happens? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22743 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97522/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22743 **[Test build #97522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97522/testReport)** for PR 22743 at commit [`206743c`](https://github.com/apache/spark/commit/206743cef96e536783a315785739af16f845f5c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22743 **[Test build #97522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97522/testReport)** for PR 22743 at commit [`206743c`](https://github.com/apache/spark/commit/206743cef96e536783a315785739af16f845f5c1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4079/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97517/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22743 **[Test build #97517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97517/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22743 **[Test build #97517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97517/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4075/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22743 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97515/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22743 **[Test build #97515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97515/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22743 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org