[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @gatorsmile , so excited to see this change merged. Thank you for your careful detailed reviews. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 ping @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @gatorsmile , thank you for reviewing this PR. I'm exciting that it is getting very close to be merged. It is not obvious to me why would you prefer to follow Hive and restrict the user to partitions spec which contains partition columns from the head of the list of all partition columns specified in the table schema and disallow arbitrary subsets. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r132493566 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -256,6 +257,201 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +def createPartition(ds: String, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + createPartition("2010-01-01", "SELECT '1', 'A' from src") + createPartition("2010-01-02", "SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src") + createPartition("2010-01-03", "SELECT '1', 'A' from src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assert(queryStats("2010-01-01").rowCount === None) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount === None) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS") + + assert(queryStats("2010-01-01").rowCount.get === 500) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount.get === 2*500) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) +} + } + + test("analyze a set of partitions") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String, hr: String): Option[CatalogStatistics] = { + val tableId = TableIdentifier(tableName) + val partition = +spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, "hr" -> hr)) + partition.stats +} + +def assertPartitionStats( +ds: String, +hr: String, +rowCount: Option[BigInt], +sizeInBytes: BigInt): Unit = { + val stats = queryStats(ds, hr).get + assert(stats.rowCount === rowCount) + assert(stats.sizeInBytes === sizeInBytes) +} + +def createPartition(ds: String, hr: Int, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING, hr INT)") + + createPartition("2010-01-01", 10, "SELECT '1', 'A' from src") + createPartition("2010-01-01", 11, "SELECT '1', 'A' from src") + createPartition("2010-01-02", 10, "SELECT '1', 'A' from src") + createPartition("2010-01-02", 11, +"SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) + assert(queryStats("2010-01-02", "10") === None) + assert(queryStats("2010-01-02", "11") === None) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "11", rowC
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r132490843 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -107,6 +109,7 @@ case class CatalogTablePartition( if (parameters.nonEmpty) { map.put("Partition Parameters", s"{${parameters.map(p => p._1 + "=" + p._2).mkString(", ")}}") } +stats.foreach(s => map.put("Partition Statistics", s.simpleString)) --- End diff -- Indeed. It works like you said it would. :-) Adding a test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @gatorsmile, thanks for review. Comments addressed. Questions answered. Tests as passing. Would you take another look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r131957187 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -256,6 +257,201 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +def createPartition(ds: String, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + createPartition("2010-01-01", "SELECT '1', 'A' from src") + createPartition("2010-01-02", "SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src") + createPartition("2010-01-03", "SELECT '1', 'A' from src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assert(queryStats("2010-01-01").rowCount === None) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount === None) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS") + + assert(queryStats("2010-01-01").rowCount.get === 500) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount.get === 2*500) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) +} + } + + test("analyze a set of partitions") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String, hr: String): Option[CatalogStatistics] = { + val tableId = TableIdentifier(tableName) + val partition = +spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, "hr" -> hr)) + partition.stats +} + +def assertPartitionStats( +ds: String, +hr: String, +rowCount: Option[BigInt], +sizeInBytes: BigInt): Unit = { + val stats = queryStats(ds, hr).get + assert(stats.rowCount === rowCount) + assert(stats.sizeInBytes === sizeInBytes) +} + +def createPartition(ds: String, hr: Int, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING, hr INT)") + + createPartition("2010-01-01", 10, "SELECT '1', 'A' from src") + createPartition("2010-01-01", 11, "SELECT '1', 'A' from src") + createPartition("2010-01-02", 10, "SELECT '1', 'A' from src") + createPartition("2010-01-02", 11, +"SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) + assert(queryStats("2010-01-02", "10") === None) + assert(queryStats("2010-01-02", "11") === None) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assertPartitionStats("2010-01-01", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-01", "11", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "10", rowCount = None, sizeInBytes = 2000) + assertPartitionStats("2010-01-02", "11", rowC
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r131957033 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, UnresolvedAttribute} +import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType} +import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec +import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, Literal} + +/** + * Analyzes a given set of partitions to generate per-partition statistics, which will be used in + * query optimizations. + * + * When `partitionSpec` is empty, statistics for all partitions are collected and stored in + * Metastore. + * + * When `partitionSpec` mentions only some of the partition columns, all partitions with + * matching values for specified columns are processed. + * + * If `partitionSpec` mentions unknown partition column, an `AnalysisException` is raised. + * + * By default, total number of rows and total size in bytes is calculated. When `noscan` + * is `false`, only total size in bytes is computed. + */ +case class AnalyzePartitionCommand( +tableIdent: TableIdentifier, +partitionSpec: Map[String, Option[String]], +noscan: Boolean = true) extends RunnableCommand { + + private def validatePartitionSpec(table: CatalogTable): Option[TablePartitionSpec] = { +val partitionColumnNames = table.partitionColumnNames.toSet +val invalidColumnNames = partitionSpec.keys.filterNot(partitionColumnNames.contains(_)) +if (invalidColumnNames.nonEmpty) { + val tableId = table.identifier + throw new AnalysisException(s"Partition specification for table '${tableId.table}' " + +s"in database '${tableId.database.get}' refers to unknown partition column(s): " + +invalidColumnNames.mkString(",")) +} + +val filteredSpec = partitionSpec.filter(_._2.isDefined) --- End diff -- @gatorsmile , this PR doesn't impose this requirement. Hive docs also doesn't list it: https://cwiki.apache.org/confluence/display/Hive/StatsDev --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r131953998 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -107,6 +109,7 @@ case class CatalogTablePartition( if (parameters.nonEmpty) { map.put("Partition Parameters", s"{${parameters.map(p => p._1 + "=" + p._2).mkString(", ")}}") } +stats.foreach(s => map.put("Partition Statistics", s.simpleString)) --- End diff -- @gatorsmile , this code doesn't make partition stats appear in the output of DESC command. It only adds stats into to CatalogTablePartition.toString output (similar to CatalogTable.toString). Do you still want me to add some tests for this functionality? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @gatorsmile , I believe I addressed all your comments and I'm seeing that tests are passing now. Thanks for pointers to -1 and <=0 changes. Tests were failing before because I missed these changes. Let me know if anything else needs more work or if this is now good to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 [ @gatorsmile ] Is that possible the partition-level row counts is larger than the table-level row counts after running this new command? I think so. If table contents have changed since table-level stats were collected and a new very large partition was added or existing one replaced, collecting partition-level stats for that one partition will result in a partition row-count exceeding table row-count. I think you are pointing out to a larger issue of staleness of statistics which results from table modifications. My thinking is that ANALYZE TABLE commands provide the necessary primitives to collect statistics, but a separate facility is needed to manage the process of collecting these stats. At a very basic level a user would be required to manually run necessary ANALYZE TABLE commands when table content is changed, but for larger deployments an automated solution could be devised. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r130408151 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.command + +import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, UnresolvedAttribute} +import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType} +import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec +import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, Literal} + +/** + * Analyzes a given set of partitions to generate per-partition statistics, which will be used in + * query optimizations. + * + * When `partitionSpec` is empty, statistics for all partitions are collected and stored in + * Metastore. + * + * When `partitionSpec` mentions only some of the partition columns, all partitions with + * matching values for specified columns are processed. + * + * If `partitionSpec` mentions unknown partition column, an `AnalysisException` is raised. + * + * By default, total number of rows and total size in bytes is calculated. When `noscan` + * is `false`, only total size in bytes is computed. + */ +case class AnalyzePartitionCommand( +tableIdent: TableIdentifier, +partitionSpec: Map[String, Option[String]], +noscan: Boolean = true) extends RunnableCommand { + + private def validatePartitionSpec(table: CatalogTable): Option[TablePartitionSpec] = { +val partitionColumnNames = table.partitionColumnNames.toSet +val invalidColumnNames = partitionSpec.keys.filterNot(partitionColumnNames.contains(_)) +if (invalidColumnNames.nonEmpty) { + val tableId = table.identifier + throw new AnalysisException(s"Partition specification for table '${tableId.table}' " + +s"in database '${tableId.database}' refers to unknown partition column(s): " + +invalidColumnNames.mkString(",")) +} + +val filteredSpec = partitionSpec.filter(_._2.isDefined) +if (filteredSpec.isEmpty) { + None +} else { + Some(filteredSpec.mapValues(_.get)) +} + } + + override def run(sparkSession: SparkSession): Seq[Row] = { +val sessionState = sparkSession.sessionState +val db = tableIdent.database.getOrElse(sessionState.catalog.getCurrentDatabase) +val tableIdentWithDB = TableIdentifier(tableIdent.table, Some(db)) +val tableMeta = sessionState.catalog.getTableMetadata(tableIdentWithDB) +if (tableMeta.tableType == CatalogTableType.VIEW) { + throw new AnalysisException("ANALYZE TABLE is not supported on views.") +} + +val partitionValueSpec = validatePartitionSpec(tableMeta) + +val partitions = sessionState.catalog.listPartitions(tableMeta.identifier, partitionValueSpec) + +if (partitions.isEmpty) { + if (partitionValueSpec.isDefined) { +throw new NoSuchPartitionException(db, tableIdent.table, partitionValueSpec.get) + } else { +// the user requested to analyze all partitions for a table which has no partitions +// return normally, since there is nothing to do +return Seq.empty[Row] + } +} + +// Compute statistics for individual partitions +val rowCounts: Map[TablePartitionSpec, BigInt] = + if (noscan) { +Map.empty + } else { +calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) + } + +// Update the metastore if newly computed statistics are different fr
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r130404020 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -90,30 +90,40 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { } /** - * Create an [[AnalyzeTableCommand]] command or an [[AnalyzeColumnCommand]] command. - * Example SQL for analyzing table : + * Create an [[AnalyzeTableCommand]] command, or an [[AnalyzePartitionCommand]] + * or an [[AnalyzeColumnCommand]] command. + * Example SQL for analyzing table or a set of partitions : * {{{ - * ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN]; + * ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], partcol2[=val2], ...)] + * COMPUTE STATISTICS [NOSCAN]; * }}} + * * Example SQL for analyzing columns : * {{{ - * ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2; + * ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS column1, column2; * }}} */ override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = withOrigin(ctx) { -if (ctx.partitionSpec != null) { - logWarning(s"Partition specification is ignored: ${ctx.partitionSpec.getText}") +if (ctx.identifier != null && +ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") { + throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`", ctx) } -if (ctx.identifier != null) { - if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") { -throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`", ctx) + +val table = visitTableIdentifier(ctx.tableIdentifier) +if (ctx.identifierSeq() == null) { + if (ctx.partitionSpec != null) { +AnalyzePartitionCommand(table, visitPartitionSpec(ctx.partitionSpec), + noscan = ctx.identifier != null) + } else { +AnalyzeTableCommand(table, noscan = ctx.identifier != null) } - AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier)) -} else if (ctx.identifierSeq() == null) { - AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier), noscan = false) } else { + if (ctx.partitionSpec != null) { +logWarning("Partition specification is ignored when collecting column statistics: " + + ctx.partitionSpec.getText) --- End diff -- This is the existing behavoir. ANALYZE TABLE command simply ignores PARTITION clause. This PR is adding support for PARTITION clause for COMPUTE STATISTICS, but keeps COMPUTE STATISTICS FOR COLUMNS as is. I'm planning to add partition support to FOR COLUMN In a follow-up PR. Changing this code to throw an exception in this PR will *break* existing uses if folks are relying on a WARN. Would it be OK to keep this functionality unchanged in the PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r130402344 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -112,4 +112,29 @@ object CommandUtils extends Logging { size } + def compareAndGetNewStats( + oldStats: Option[CatalogStatistics], + newTotalSize: BigInt, + newRowCount: Option[BigInt]): Option[CatalogStatistics] = { +val oldTotalSize = oldStats.map(_.sizeInBytes.toLong).getOrElse(0L) +val oldRowCount = oldStats.flatMap(_.rowCount.map(_.toLong)).getOrElse(-1L) +var newStats: Option[CatalogStatistics] = None +if (newTotalSize > 0 && newTotalSize != oldTotalSize) { --- End diff -- Indeed. I fixed this to use -1 and >=0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @gatorsmile, I was on vacation and today is my first day back. I'm planning to work through the feedback on this PR today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @wzhfy, @gatorsmile, I just noticed that Jenkins reported test failures. I'll look into these first week of August and update this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @wzhfy, @gatorsmile, I'm on vacation till the end of July. I hope this PR is good to merge now, but if not, I'll address further comments first week of August. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @wzhfy, I fixed the case of PARTITION (ds, hr) to collect stats for *all* partitions instead of for a table as a whole. Rebased. Addressed other comments. Please, take another look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r126725750 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -181,6 +182,151 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + private val SELECT_FROM_SRC = "SELECT '1', 'A' from src" + + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +def createPartition(ds: String, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + createPartition("2010-01-01", SELECT_FROM_SRC) + createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL $SELECT_FROM_SRC") + createPartition("2010-01-03", SELECT_FROM_SRC) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assert(queryStats("2010-01-01").rowCount === None) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount === None) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS") + + assert(queryStats("2010-01-01").rowCount.get === 500) + assert(queryStats("2010-01-01").sizeInBytes === 2000) + + assert(queryStats("2010-01-02").rowCount.get === 2*500) + assert(queryStats("2010-01-02").sizeInBytes === 2*2000) +} + } + + test("analyze a set of partitions") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String, hr: String): Option[CatalogStatistics] = { + val tableId = TableIdentifier(tableName) + val partition = +spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, "hr" -> hr)) + partition.stats +} + +def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: BigInt): Unit = { + val stats = queryStats(ds, hr).get + assert(stats.rowCount === Some(rowCount)) + assert(stats.sizeInBytes === sizeInBytes) +} + +def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: BigInt): Unit = { + val stats = queryStats(ds, hr).get + assert(stats.rowCount === None) + assert(stats.sizeInBytes === sizeInBytes) +} + +def createPartition(ds: String, hr: Int, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING, hr INT)") + + createPartition("2010-01-01", 10, SELECT_FROM_SRC) + createPartition("2010-01-01", 11, SELECT_FROM_SRC) + createPartition("2010-01-02", 10, SELECT_FROM_SRC) + createPartition("2010-01-02", 11, s"$SELECT_FROM_SRC UNION ALL $SELECT_FROM_SRC") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") + + assertSizeInBytesStats("2010-01-01", "10", 2000) + assertSizeInBytesStats("2010-01-01", "11", 2000) + assert(queryStats("2010-01-02", "10") === None) + assert(queryStats("2010-01-02", "11") === None) + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") + + assertSizeInBytesStats("2010-01-01", "10", 2000) + assertSizeInBytesStats("2010-01-01", "11", 2000) + assertSizeInBytesStats("2010-01-02", "10", 2000) + assertSizeInBytesStats("2010-01-02", "11", 2*2000) --- End diff -- I'
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r126724336 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -181,6 +182,151 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + private val SELECT_FROM_SRC = "SELECT '1', 'A' from src" + + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +def createPartition(ds: String, query: String): Unit = { + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query") +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + createPartition("2010-01-01", SELECT_FROM_SRC) + createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL $SELECT_FROM_SRC") + createPartition("2010-01-03", SELECT_FROM_SRC) --- End diff -- @wzhfy, I'm using multiple partitions to protect against bugs like the following: - ANALYZE PARTITION collects stats for the whole table instead of a single partition (hence, test table needs to have 2+ partitions) - ANALYZE PARTITION collects stats for the *first* partition instead of the partition specified in the SQL command (hence, need to analyze a couple of different partitions) Using "analyze single partition" for the test name I was hoping to communicate the intent of testing analyzing of a single partition as opposed to a set of partitions. Does this make sense? Do you still want me to reduce the test to a single-partition table? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r126452142 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -93,30 +93,50 @@ class SparkSqlAstBuilder extends AstBuilder { } /** - * Create an [[AnalyzeTableCommand]] command or an [[AnalyzeColumnCommand]] command. - * Example SQL for analyzing table : + * Create an [[AnalyzeTableCommand]] command, or an [[AnalyzePartitionCommand]] + * or an [[AnalyzeColumnCommand]] command. + * Example SQL for analyzing table or a set of partitions : * {{{ - * ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN]; + * ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], partcol2[=val2], ...)] + * COMPUTE STATISTICS [NOSCAN]; * }}} + * * Example SQL for analyzing columns : * {{{ - * ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2; + * ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS column1, column2; * }}} */ override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = withOrigin(ctx) { -if (ctx.partitionSpec != null) { - logWarning(s"Partition specification is ignored: ${ctx.partitionSpec.getText}") +if (ctx.identifier != null && +ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") { + throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`", ctx) } -if (ctx.identifier != null) { - if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") { -throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`", ctx) + +val partitionSpec = + if (ctx.partitionSpec != null) { +val filteredSpec = visitPartitionSpec(ctx.partitionSpec).filter(_._2.isDefined) +if (filteredSpec.isEmpty) { + None +} else { + Some(filteredSpec.mapValues(_.get)) +} + } else { +None + } + +val table = visitTableIdentifier(ctx.tableIdentifier) +if (ctx.identifierSeq() == null) { + if (partitionSpec.isDefined) { +AnalyzePartitionCommand(table, partitionSpec.get, noscan = ctx.identifier != null) + } else { +AnalyzeTableCommand(table, noscan = ctx.identifier != null) --- End diff -- @wzhfy , good catch. Thank you! I'll fix this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r125772533 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -201,6 +202,193 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') SELECT * FROM src") + sql( +s""" + |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02') + |SELECT * FROM src + |UNION ALL + |SELECT * FROM src + """.stripMargin) + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') SELECT * FROM src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS").collect() + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS").collect() + + assert(queryStats("2010-01-01").rowCount.get === 500) + assert(queryStats("2010-01-01").sizeInBytes === 5812) + + assert(queryStats("2010-01-02").rowCount.get === 2*500) + assert(queryStats("2010-01-02").sizeInBytes === 2*5812) +} + } + + test("analyze single partition noscan") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') SELECT * FROM src") + sql( +s""" + |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02') + |SELECT * FROM src + |UNION ALL + |SELECT * FROM src + """.stripMargin) + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') SELECT * FROM src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") +.collect() + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") +.collect() + + assert(queryStats("2010-01-01").rowCount === None) + assert(queryStats("2010-01-01").sizeInBytes === 5812) + + assert(queryStats("2010-01-02").rowCount === None) + assert(queryStats("2010-01-02").sizeInBytes === 2*5812) +} + } + + test("analyze a set of partitions") { --- End diff -- @wzhfy, given that these cases are covered in SparkSqlParserSuite, is it still necessary to cover them again in StatisticsSuite? Partition columns without values are removed at parsing stage so that AnalyzePartitionCommand always receives partition column along with a value. re: (ds, hr='10') - https://cwiki.apache.org/confluence/display/Hive/StatsDev suggests that it is allowed; this PR supports this syntax. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @wzhfy, thank you for review. I created AnalyzePartitionCommand class, modified the logic to update all partitions in a single call to Metastore, combined noscan and non-noscan test cases. I believe I addressed all comments except for requests to make queryStats and assertStats common functions. After merging test cases, each of these functions is used only in one test, hence, there is no copy-pasta anymore. Would you take another look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r125763666 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -201,6 +202,193 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("analyze single partition") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') SELECT * FROM src") + sql( +s""" + |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02') + |SELECT * FROM src + |UNION ALL + |SELECT * FROM src + """.stripMargin) + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') SELECT * FROM src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS").collect() + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS").collect() + + assert(queryStats("2010-01-01").rowCount.get === 500) + assert(queryStats("2010-01-01").sizeInBytes === 5812) + + assert(queryStats("2010-01-02").rowCount.get === 2*500) + assert(queryStats("2010-01-02").sizeInBytes === 2*5812) +} + } + + test("analyze single partition noscan") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): CatalogStatistics = { + val partition = + spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats.get +} + +withTable(tableName) { + sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') SELECT * FROM src") + sql( +s""" + |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02') + |SELECT * FROM src + |UNION ALL + |SELECT * FROM src + """.stripMargin) + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') SELECT * FROM src") + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE STATISTICS NOSCAN") +.collect() + + sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE STATISTICS NOSCAN") +.collect() + + assert(queryStats("2010-01-01").rowCount === None) + assert(queryStats("2010-01-01").sizeInBytes === 5812) + + assert(queryStats("2010-01-02").rowCount === None) + assert(queryStats("2010-01-02").sizeInBytes === 2*5812) +} + } + + test("analyze a set of partitions") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String, hr: String): Option[CatalogStatistics] = { + val tableId = TableIdentifier(tableName) + val partition = +spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, "hr" -> hr)) + partition.stats +} + +def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: BigInt): Unit = { + val stats = queryStats(ds, hr).get + assert(stats.rowCount === Some(rowCount)) + assert(stats.sizeInBytes === sizeInBytes) +} + +def assertNoStats(ds: String, hr: String): Unit = { + assert(queryStats(ds, hr) === None) +} + +def createPartition(ds: String, hr: Int, query: String): Unit = { --- End diff -- I created this method to make the test code more concise. For example, this method helps avoid multi-line methods calls like this: sql( s""" |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02', hr=11) |SELECT * FROM src |UNION ALL |SELECT * FROM src """.stripMargin) becomes createPartition("2010-01-02", 11, "SELECT * FROM SRC UNION ALL SELECT * FROM SRC") --- If your
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r125688658 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -986,6 +987,7 @@ private[hive] object HiveClientImpl { tpart.setTableName(ht.getTableName) tpart.setValues(partValues.asJava) tpart.setSd(storageDesc) +tpart.setParameters(mutable.Map(p.parameters.toSeq: _*).asJava) --- End diff -- p.parameters.asJava produces as immutable map and causes java.lang.UnsupportedOperationException: remove; --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @wzhfy, @gatorsmile, I updated PR to add full support for partition partition specs. This version supports a subset of partition columns with or without values specified. I'm going away for the long weekend tomorrow and will be back next Wed (July 5). I may not be able to respond to comments or provide updates to this PR until then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18421 @wzhfy, @gatorsmile, I updated this PR to support partial partition specs where values are defined only for a subset of partition columns. For example, if table has 2 partition columns ds and hr, both PARTITION (ds='2010-01-01', hr=10) and PARTITION (hr=10) specs are valid. I did not provide support for partition specs where partition column is mentioned without any value. E.g. PARTITION (ds, hr=10) spec is still not allowed. I didn't implement this because I realized that I don't understand why this syntax is useful. My understanding is that (ds, hr=10) is equivalent to (hr=10). Do you think it is important to implement this syntax? If so, would you help me understand why? Is this because we want to support HQL at the same level as Hive to allow seamless transition from Hive to Spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r124871491 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala --- @@ -239,18 +239,20 @@ class SparkSqlParserSuite extends AnalysisTest { AnalyzeTableCommand(TableIdentifier("t"), noscan = false)) assertEqual("analyze table t compute statistics noscan", AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) -assertEqual("analyze table t partition (a) compute statistics nOscAn", +assertEqual("analyze table t compute statistics nOscAn", AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) -// Partitions specified - we currently parse them but don't do anything with it +// Partitions specified assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS", - AnalyzeTableCommand(TableIdentifier("t"), noscan = false)) + AnalyzeTableCommand(TableIdentifier("t"), noscan = false, +partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11" assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS noscan", - AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) -assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS", - AnalyzeTableCommand(TableIdentifier("t"), noscan = false)) -assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS noscan", - AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) + AnalyzeTableCommand(TableIdentifier("t"), noscan = true, +partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11" +intercept("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS", --- End diff -- @wzhfy, I expect syntax change to be quite small and incremental. Currently, it is necessary to specify all partition columns along with values. The change will be to allow only a subset of partition columns and allow partition columns without values. `PARTITION (partcol1=val1,...) -> PARTITION (partcol1[=val1],...)` That said, I want to try to allow partial partition specs in this PR. Let me spend some time on it and report back my findings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r124546295 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { * {{{ * ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN]; * }}} + * Example SQL for analyzing a single partition : + * {{{ + * ANALYZE TABLE table PARTITION (key=value,..) COMPUTE STATISTICS [NOSCAN]; --- End diff -- I'm seeing that visitNonOptionalPartitionSpec detects unset partition columns and throws an exception: Found an empty partition key '$key'. Is this sufficient or do you have something else in mind? /** * Create a partition specification map without optional values. */ protected def visitNonOptionalPartitionSpec( ctx: PartitionSpecContext): Map[String, String] = withOrigin(ctx) { visitPartitionSpec(ctx).map { case (key, None) => throw new ParseException(s"Found an empty partition key '$key'.", ctx) case (key, Some(value)) => key -> value } } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18421#discussion_r124448971 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala --- @@ -239,18 +239,20 @@ class SparkSqlParserSuite extends AnalysisTest { AnalyzeTableCommand(TableIdentifier("t"), noscan = false)) assertEqual("analyze table t compute statistics noscan", AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) -assertEqual("analyze table t partition (a) compute statistics nOscAn", +assertEqual("analyze table t compute statistics nOscAn", AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) -// Partitions specified - we currently parse them but don't do anything with it +// Partitions specified assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS", - AnalyzeTableCommand(TableIdentifier("t"), noscan = false)) + AnalyzeTableCommand(TableIdentifier("t"), noscan = false, +partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11" assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS noscan", - AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) -assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS", - AnalyzeTableCommand(TableIdentifier("t"), noscan = false)) -assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS noscan", - AnalyzeTableCommand(TableIdentifier("t"), noscan = true)) + AnalyzeTableCommand(TableIdentifier("t"), noscan = true, +partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11" +intercept("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS", --- End diff -- I see. Thanks for pointing that out. Currently, this PR supports only exact partition spec. It doesn't support partial partition specs describing the document above. My preference would be to keep it simple for this PR and support only exact spec and add support for partial specs in a follow up PR. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...
GitHub user mbasmanova opened a pull request: https://github.com/apache/spark/pull/18421 [SPARK-21213][SQL] Support collecting partition-level statistics: row⦠â¦Count and sizeInBytes ## What changes were proposed in this pull request? Added support for ANALYZE TABLE table PARTITION (key=value,..) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for the specified partition. Calculated statistics are stores in Hive Metastore as user-defined properties attached to the partition object. Property names are the same as the ones used to store table-level row count and size in bytes. Specifically, spark.sql.statistics.totalSize and spark.sql.statistics.numRows. ## How was this patch tested? Added tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mbasmanova/spark mbasmanova-analyze-partition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18421.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18421 commit e18f144a2c6b318a285fbb184bb5c78408f06039 Author: Masha Basmanova <mbasman...@fb.com> Date: 2017-06-12T20:36:28Z [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 Folks, thanks for merging this PR. @gatorsmile, re: a request to backport to 2.1. Is this something that happens automatically? If not, could you point me to some instructions on how to do that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @gatorsmile, thank you for review. All the comments should be addressed now. Please, check it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123242117 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- Ok. I'll simplify the test and remove withTempPaths function. Since I need to make sure that partition locations are not under table location, I'll include ALTER TABLE $tableName SET LOCATION '$path' after creating table with partitions in the default location. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @shaneknapp, I can't think of why PySpark unit tests failed other than flakiness of the tests or infrastructure. Any chance you could trigger a re-run? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 Thank you, @shaneknapp . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @gatorsmile, thanks for clarifying. I added a test to analyze a table with only some of partitions being visible. Would you take another look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @gatorsmile, re: additional test, how can I create a table with a mix of visible and invisible partitions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122651779 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- This is how I wrote the test initially. In this case all partitions are located under the same top-level directory, but table-level location is somewhere else. I modified the test to use different paths for each partition as well as the table to address some of the earlier comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122649017 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils } /** + * Creates the requested number of temporary path (without creating the actual file/directory), --- End diff -- I wanted to create a multi-path version of an existing withTempPath function. That function returns a *valid*, but non-existent path. The comments and path.delete() part of the implementation came from there. /** * Generates a temporary path without creating the actual file/directory, then pass it to `f`. If * a file/directory is created there by `f`, it will be delete after `f` returns. * * @todo Probably this method should be moved to a more general place */ protected def withTempPath(f: File => Unit): Unit = { val path = Utils.createTempDir() path.delete() try f(path) finally Utils.deleteRecursively(path) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @wzhfy, thank you careful review and helpful suggestions. I think I incorporated all of your comments. Would you take another look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122505126 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,48 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +withTempPaths(4) { (paths) => paths match { --- End diff -- I need a path for the table and one more path for each of the three partitions. 1 + 3 = 4. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @wzhfy , @tejasapatil , I updated description of this PR, enhanced the test and renamed a function I missed before in Tejas's comment. How does it look now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @rxin, could you help unblock Jenkins for this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r14357 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +128,40 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +sql( + """ +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02') +|SELECT * FROM src + """.stripMargin).collect() +sql( --- End diff -- This is a fine suggestion, but I feel it falls outside of the scope of this change. This particular fix concerns only the case when table-level location is not a parent directory of a partition-level location. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...
Github user mbasmanova commented on the issue: https://github.com/apache/spark/pull/18309 @wzhfy, @tejasapatil, thanks for your feedback. I think I addressed all of your comments and added a test. Please, take another look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
GitHub user mbasmanova opened a pull request: https://github.com/apache/spark/pull/18309 [SPARK-21079] [SQL] Calculate total size of a partition table as a sum of individual partitions ## What changes were proposed in this pull request? When calculating total size of a partitioned table, use storage URIs associated with individual partitions to identify the files which make up the table. CC: @wzhfy ## How was this patch tested? Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mbasmanova/spark mbasmanova-analyze-part-table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18309 commit a1dbdd6f56e500586b399565a7f837800039bfb3 Author: Masha Basmanova <mbasman...@fb.com> Date: 2017-06-15T00:24:47Z [SPARK-21079] [SQL] Calculate total size of a partition table as a sum of individual partitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org