[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-18 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@gatorsmile , so excited to see this change merged. Thank you for your 
careful detailed reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-15 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
ping @gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-11 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@gatorsmile , thank you for reviewing this PR. I'm exciting that it is 
getting very close to be merged. 

It is not obvious to me why would you prefer to follow Hive and restrict 
the user to partitions spec which contains partition columns from the head of 
the list of all partition columns specified in the table schema and disallow 
arbitrary subsets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-10 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r132493566
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -256,6 +257,201 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", "SELECT '1', 'A' from src UNION ALL 
SELECT '1', 'A' from src")
+  createPartition("2010-01-03", "SELECT '1', 'A' from src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertPartitionStats(
+ds: String,
+hr: String,
+rowCount: Option[BigInt],
+sizeInBytes: BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === rowCount)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, "SELECT '1', 'A' from src")
+  createPartition("2010-01-01", 11, "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", 10, "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", 11,
+"SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = None, 
sizeInBytes = 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "11", rowC

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-10 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r132490843
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -107,6 +109,7 @@ case class CatalogTablePartition(
 if (parameters.nonEmpty) {
   map.put("Partition Parameters", s"{${parameters.map(p => p._1 + "=" 
+ p._2).mkString(", ")}}")
 }
+stats.foreach(s => map.put("Partition Statistics", s.simpleString))
--- End diff --

Indeed. It works like you said it would. :-) Adding a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-08-08 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@gatorsmile, thanks for review. Comments addressed. Questions answered. 
Tests as passing. Would you take another look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-08 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r131957187
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -256,6 +257,201 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", "SELECT '1', 'A' from src UNION ALL 
SELECT '1', 'A' from src")
+  createPartition("2010-01-03", "SELECT '1', 'A' from src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertPartitionStats(
+ds: String,
+hr: String,
+rowCount: Option[BigInt],
+sizeInBytes: BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === rowCount)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, "SELECT '1', 'A' from src")
+  createPartition("2010-01-01", 11, "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", 10, "SELECT '1', 'A' from src")
+  createPartition("2010-01-02", 11,
+"SELECT '1', 'A' from src UNION ALL SELECT '1', 'A' from src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = None, 
sizeInBytes = 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertPartitionStats("2010-01-01", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-01", "11", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "10", rowCount = None, 
sizeInBytes = 2000)
+  assertPartitionStats("2010-01-02", "11", rowC

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-08 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r131957033
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
 ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, 
UnresolvedAttribute}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, 
CatalogTableType}
+import 
org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
+import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, Literal}
+
+/**
+ * Analyzes a given set of partitions to generate per-partition 
statistics, which will be used in
+ * query optimizations.
+ *
+ * When `partitionSpec` is empty, statistics for all partitions are 
collected and stored in
+ * Metastore.
+ *
+ * When `partitionSpec` mentions only some of the partition columns, all 
partitions with
+ * matching values for specified columns are processed.
+ *
+ * If `partitionSpec` mentions unknown partition column, an 
`AnalysisException` is raised.
+ *
+ * By default, total number of rows and total size in bytes is calculated. 
When `noscan`
+ * is `false`, only total size in bytes is computed.
+ */
+case class AnalyzePartitionCommand(
+tableIdent: TableIdentifier,
+partitionSpec: Map[String, Option[String]],
+noscan: Boolean = true) extends RunnableCommand {
+
+  private def validatePartitionSpec(table: CatalogTable): 
Option[TablePartitionSpec] = {
+val partitionColumnNames = table.partitionColumnNames.toSet
+val invalidColumnNames = 
partitionSpec.keys.filterNot(partitionColumnNames.contains(_))
+if (invalidColumnNames.nonEmpty) {
+  val tableId = table.identifier
+  throw new AnalysisException(s"Partition specification for table 
'${tableId.table}' " +
+s"in database '${tableId.database.get}' refers to unknown 
partition column(s): " +
+invalidColumnNames.mkString(","))
+}
+
+val filteredSpec = partitionSpec.filter(_._2.isDefined)
--- End diff --

@gatorsmile , this PR doesn't impose this requirement. Hive docs also 
doesn't list it: https://cwiki.apache.org/confluence/display/Hive/StatsDev


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-08-08 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r131953998
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -107,6 +109,7 @@ case class CatalogTablePartition(
 if (parameters.nonEmpty) {
   map.put("Partition Parameters", s"{${parameters.map(p => p._1 + "=" 
+ p._2).mkString(", ")}}")
 }
+stats.foreach(s => map.put("Partition Statistics", s.simpleString))
--- End diff --

@gatorsmile , this code doesn't make partition stats appear in the output 
of DESC command. It only adds stats into to CatalogTablePartition.toString 
output (similar to CatalogTable.toString). Do you still want me to add some 
tests for this functionality? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-31 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@gatorsmile , I believe I addressed all your comments and I'm seeing that 
tests are passing now. Thanks for pointers to -1 and <=0 changes. Tests were 
failing before because I missed these changes. Let me know if anything else 
needs more work or if this is now good to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-31 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
[ @gatorsmile ] Is that possible the partition-level row counts is larger 
than the table-level row counts after running this new command?

I think so. If table contents have changed since table-level stats were 
collected and a new very large partition was added or existing one replaced, 
collecting partition-level stats for that one partition will result in a 
partition row-count exceeding table row-count.

I think you are pointing out to a larger issue of staleness of statistics 
which results from table modifications. My thinking is that ANALYZE TABLE 
commands provide the necessary primitives to collect statistics, but a separate 
facility is needed to manage the process of collecting these stats. At a very 
basic level a user would be required to manually run necessary ANALYZE TABLE 
commands when table content is changed, but for larger deployments an automated 
solution could be devised.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-31 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r130408151
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, 
UnresolvedAttribute}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, 
CatalogTableType}
+import 
org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
+import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, Literal}
+
+/**
+ * Analyzes a given set of partitions to generate per-partition 
statistics, which will be used in
+ * query optimizations.
+ *
+ * When `partitionSpec` is empty, statistics for all partitions are 
collected and stored in
+ * Metastore.
+ *
+ * When `partitionSpec` mentions only some of the partition columns, all 
partitions with
+ * matching values for specified columns are processed.
+ *
+ * If `partitionSpec` mentions unknown partition column, an 
`AnalysisException` is raised.
+ *
+ * By default, total number of rows and total size in bytes is calculated. 
When `noscan`
+ * is `false`, only total size in bytes is computed.
+ */
+case class AnalyzePartitionCommand(
+tableIdent: TableIdentifier,
+partitionSpec: Map[String, Option[String]],
+noscan: Boolean = true) extends RunnableCommand {
+
+  private def validatePartitionSpec(table: CatalogTable): 
Option[TablePartitionSpec] = {
+val partitionColumnNames = table.partitionColumnNames.toSet
+val invalidColumnNames = 
partitionSpec.keys.filterNot(partitionColumnNames.contains(_))
+if (invalidColumnNames.nonEmpty) {
+  val tableId = table.identifier
+  throw new AnalysisException(s"Partition specification for table 
'${tableId.table}' " +
+s"in database '${tableId.database}' refers to unknown partition 
column(s): " +
+invalidColumnNames.mkString(","))
+}
+
+val filteredSpec = partitionSpec.filter(_._2.isDefined)
+if (filteredSpec.isEmpty) {
+  None
+} else {
+  Some(filteredSpec.mapValues(_.get))
+}
+  }
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val db = 
tableIdent.database.getOrElse(sessionState.catalog.getCurrentDatabase)
+val tableIdentWithDB = TableIdentifier(tableIdent.table, Some(db))
+val tableMeta = sessionState.catalog.getTableMetadata(tableIdentWithDB)
+if (tableMeta.tableType == CatalogTableType.VIEW) {
+  throw new AnalysisException("ANALYZE TABLE is not supported on 
views.")
+}
+
+val partitionValueSpec = validatePartitionSpec(tableMeta)
+
+val partitions = 
sessionState.catalog.listPartitions(tableMeta.identifier, partitionValueSpec)
+
+if (partitions.isEmpty) {
+  if (partitionValueSpec.isDefined) {
+throw new NoSuchPartitionException(db, tableIdent.table, 
partitionValueSpec.get)
+  } else {
+// the user requested to analyze all partitions for a table which 
has no partitions
+// return normally, since there is nothing to do
+return Seq.empty[Row]
+  }
+}
+
+// Compute statistics for individual partitions
+val rowCounts: Map[TablePartitionSpec, BigInt] =
+  if (noscan) {
+Map.empty
+  } else {
+calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+  }
+
+// Update the metastore if newly computed statistics are different 
fr

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-31 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r130404020
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -90,30 +90,40 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
   }
 
   /**
-   * Create an [[AnalyzeTableCommand]] command or an 
[[AnalyzeColumnCommand]] command.
-   * Example SQL for analyzing table :
+   * Create an [[AnalyzeTableCommand]] command, or an 
[[AnalyzePartitionCommand]]
+   * or an [[AnalyzeColumnCommand]] command.
+   * Example SQL for analyzing table or a set of partitions :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
+   *   ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], 
partcol2[=val2], ...)]
+   *   COMPUTE STATISTICS [NOSCAN];
* }}}
+   *
* Example SQL for analyzing columns :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2;
+   *   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS 
column1, column2;
* }}}
*/
   override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = 
withOrigin(ctx) {
-if (ctx.partitionSpec != null) {
-  logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
+if (ctx.identifier != null &&
+ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+  throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
 }
-if (ctx.identifier != null) {
-  if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
-throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
+
+val table = visitTableIdentifier(ctx.tableIdentifier)
+if (ctx.identifierSeq() == null) {
+  if (ctx.partitionSpec != null) {
+AnalyzePartitionCommand(table, 
visitPartitionSpec(ctx.partitionSpec),
+  noscan = ctx.identifier != null)
+  } else {
+AnalyzeTableCommand(table, noscan = ctx.identifier != null)
   }
-  AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier))
-} else if (ctx.identifierSeq() == null) {
-  AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier), 
noscan = false)
 } else {
+  if (ctx.partitionSpec != null) {
+logWarning("Partition specification is ignored when collecting 
column statistics: " +
+  ctx.partitionSpec.getText)
--- End diff --

This is the existing behavoir. ANALYZE TABLE command simply ignores 
PARTITION clause. This PR is adding support for PARTITION clause for COMPUTE 
STATISTICS, but keeps COMPUTE STATISTICS FOR COLUMNS as is. I'm planning to add 
partition support to FOR COLUMN In a follow-up PR.

Changing this code to throw an exception in this PR will *break* existing 
uses if folks are relying on a WARN.

Would it be OK to keep this functionality unchanged in the PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-31 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r130402344
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -112,4 +112,29 @@ object CommandUtils extends Logging {
 size
   }
 
+  def compareAndGetNewStats(
+  oldStats: Option[CatalogStatistics],
+  newTotalSize: BigInt,
+  newRowCount: Option[BigInt]): Option[CatalogStatistics] = {
+val oldTotalSize = oldStats.map(_.sizeInBytes.toLong).getOrElse(0L)
+val oldRowCount = 
oldStats.flatMap(_.rowCount.map(_.toLong)).getOrElse(-1L)
+var newStats: Option[CatalogStatistics] = None
+if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
--- End diff --

Indeed. I fixed this to use -1 and >=0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-31 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@gatorsmile, I was on vacation and today is my first day back. I'm planning 
to work through the feedback on this PR today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-18 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@wzhfy, @gatorsmile, I just noticed that Jenkins reported test failures. 
I'll look into these first week of August and update this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-18 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-18 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@wzhfy, @gatorsmile, I'm on vacation till the end of July. I hope this PR 
is good to merge now, but if not, I'll address further comments first week of 
August.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-11 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@wzhfy, I fixed the case of PARTITION (ds, hr) to collect stats for *all* 
partitions instead of for a table as a whole. Rebased. Addressed other 
comments. Please, take another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-11 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126725750
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === None)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-01", 11, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 11, s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
--- End diff --

I'

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-11 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126724336
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
--- End diff --

@wzhfy, I'm using multiple partitions to protect against bugs like the 
following:

- ANALYZE PARTITION collects stats for the whole table instead of a single 
partition (hence, test table needs to have 2+ partitions)
- ANALYZE PARTITION collects stats for the *first* partition instead of the 
partition specified in the SQL command (hence, need to analyze a couple of 
different partitions)

Using "analyze single partition" for the test name I was hoping to 
communicate the intent of testing analyzing of a single partition as opposed to 
a set of partitions. Does this make sense? Do you still want me to reduce the 
test to a single-partition table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-10 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126452142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -93,30 +93,50 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   /**
-   * Create an [[AnalyzeTableCommand]] command or an 
[[AnalyzeColumnCommand]] command.
-   * Example SQL for analyzing table :
+   * Create an [[AnalyzeTableCommand]] command, or an 
[[AnalyzePartitionCommand]]
+   * or an [[AnalyzeColumnCommand]] command.
+   * Example SQL for analyzing table or a set of partitions :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
+   *   ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], 
partcol2[=val2], ...)]
+   *   COMPUTE STATISTICS [NOSCAN];
* }}}
+   *
* Example SQL for analyzing columns :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2;
+   *   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS 
column1, column2;
* }}}
*/
   override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = 
withOrigin(ctx) {
-if (ctx.partitionSpec != null) {
-  logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
+if (ctx.identifier != null &&
+ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+  throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
 }
-if (ctx.identifier != null) {
-  if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
-throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
+
+val partitionSpec =
+  if (ctx.partitionSpec != null) {
+val filteredSpec = 
visitPartitionSpec(ctx.partitionSpec).filter(_._2.isDefined)
+if (filteredSpec.isEmpty) {
+  None
+} else {
+  Some(filteredSpec.mapValues(_.get))
+}
+  } else {
+None
+  }
+
+val table = visitTableIdentifier(ctx.tableIdentifier)
+if (ctx.identifierSeq() == null) {
+  if (partitionSpec.isDefined) {
+AnalyzePartitionCommand(table, partitionSpec.get, noscan = 
ctx.identifier != null)
+  } else {
+AnalyzeTableCommand(table, noscan = ctx.identifier != null)
--- End diff --

@wzhfy , good catch. Thank you! I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-05 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r125772533
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -201,6 +202,193 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') 
SELECT * FROM src")
+  sql(
+s"""
+   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+   |SELECT * FROM src
+   |UNION ALL
+   |SELECT * FROM src
+ """.stripMargin)
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') 
SELECT * FROM src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS").collect()
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS").collect()
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 5812)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*5812)
+}
+  }
+
+  test("analyze single partition noscan") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') 
SELECT * FROM src")
+  sql(
+s"""
+   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+   |SELECT * FROM src
+   |UNION ALL
+   |SELECT * FROM src
+ """.stripMargin)
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') 
SELECT * FROM src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+.collect()
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+.collect()
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 5812)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*5812)
+}
+  }
+
+  test("analyze a set of partitions") {
--- End diff --

@wzhfy, given that these cases are covered in SparkSqlParserSuite, is it 
still necessary to cover them again in StatisticsSuite? Partition columns 
without values are removed at parsing stage so that AnalyzePartitionCommand 
always receives partition column along with a value.

re: (ds, hr='10') - 
https://cwiki.apache.org/confluence/display/Hive/StatsDev suggests that it is 
allowed; this PR supports this syntax.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-07-05 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@wzhfy, thank you for review. I created AnalyzePartitionCommand class, 
modified the logic to update all partitions in a single call to Metastore, 
combined noscan and non-noscan test cases. I believe I addressed all comments 
except for requests to make queryStats and assertStats common functions. After 
merging test cases, each of these functions is used only in one test, hence, 
there is no copy-pasta anymore. Would you take another look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-05 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r125763666
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -201,6 +202,193 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') 
SELECT * FROM src")
+  sql(
+s"""
+   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+   |SELECT * FROM src
+   |UNION ALL
+   |SELECT * FROM src
+ """.stripMargin)
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') 
SELECT * FROM src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS").collect()
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS").collect()
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 5812)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*5812)
+}
+  }
+
+  test("analyze single partition noscan") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') 
SELECT * FROM src")
+  sql(
+s"""
+   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+   |SELECT * FROM src
+   |UNION ALL
+   |SELECT * FROM src
+ """.stripMargin)
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') 
SELECT * FROM src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+.collect()
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+.collect()
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 5812)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*5812)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertNoStats(ds: String, hr: String): Unit = {
+  assert(queryStats(ds, hr) === None)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
--- End diff --

I created this method to make the test code more concise. For example, this 
method helps avoid multi-line methods calls like this:

sql(
s"""
   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02', hr=11)
   |SELECT * FROM src
   |UNION ALL
   |SELECT * FROM src
 """.stripMargin)

becomes 

createPartition("2010-01-02", 11, "SELECT * FROM SRC UNION ALL SELECT * 
FROM SRC")


---
If your 

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-05 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r125688658
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -986,6 +987,7 @@ private[hive] object HiveClientImpl {
 tpart.setTableName(ht.getTableName)
 tpart.setValues(partValues.asJava)
 tpart.setSd(storageDesc)
+tpart.setParameters(mutable.Map(p.parameters.toSeq: _*).asJava)
--- End diff --

p.parameters.asJava produces as immutable map and causes 
java.lang.UnsupportedOperationException: remove;


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-06-29 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@wzhfy, @gatorsmile, I updated PR to add full support for partition 
partition specs. This version supports a subset of partition columns with or 
without values specified. I'm going away for the long weekend tomorrow and will 
be back next Wed (July 5). I may not be able to respond to comments or provide 
updates to this PR until then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18421: [SPARK-21213][SQL] Support collecting partition-level st...

2017-06-29 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18421
  
@wzhfy, @gatorsmile, I updated this PR to support partial partition specs 
where values are defined only for a subset of partition columns. For example, 
if table has 2 partition columns ds and hr, both PARTITION (ds='2010-01-01', 
hr=10) and PARTITION (hr=10) specs are valid. I did not provide support for 
partition specs where partition column is mentioned without any value. E.g. 
PARTITION (ds, hr=10) spec is still not allowed. I didn't implement this 
because I realized that I don't understand why this syntax is useful. My 
understanding is that (ds, hr=10) is equivalent to (hr=10). Do you think it is 
important to implement this syntax? If so, would you help me understand why? Is 
this because we want to support HQL at the same level as Hive to allow seamless 
transition from Hive to Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-06-29 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r124871491
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
 ---
@@ -239,18 +239,20 @@ class SparkSqlParserSuite extends AnalysisTest {
   AnalyzeTableCommand(TableIdentifier("t"), noscan = false))
 assertEqual("analyze table t compute statistics noscan",
   AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
-assertEqual("analyze table t partition (a) compute statistics nOscAn",
+assertEqual("analyze table t compute statistics nOscAn",
   AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
 
-// Partitions specified - we currently parse them but don't do 
anything with it
+// Partitions specified
 assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE 
STATISTICS",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = false))
+  AnalyzeTableCommand(TableIdentifier("t"), noscan = false,
+partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11"
 assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE 
STATISTICS noscan",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
-assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = false))
-assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS 
noscan",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
+  AnalyzeTableCommand(TableIdentifier("t"), noscan = true,
+partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11"
+intercept("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS",
--- End diff --

@wzhfy, I expect syntax change to be quite small and incremental. 
Currently, it is necessary to specify all partition columns along with values. 
The change will be to allow only a subset of partition columns and allow 
partition columns without values.

`PARTITION (partcol1=val1,...) -> PARTITION (partcol1[=val1],...)`

That said, I want to try to allow partial partition specs in this PR. Let 
me spend some time on it and report back my findings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-06-28 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r124546295
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
* {{{
*   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
* }}}
+   * Example SQL for analyzing a single partition :
+   * {{{
+   *   ANALYZE TABLE table PARTITION (key=value,..) COMPUTE STATISTICS 
[NOSCAN];
--- End diff --

I'm seeing that visitNonOptionalPartitionSpec detects unset partition 
columns and throws an exception: Found an empty partition key '$key'. Is this 
sufficient or do you have something else in mind?

/**
   * Create a partition specification map without optional values.
   */
  protected def visitNonOptionalPartitionSpec(
  ctx: PartitionSpecContext): Map[String, String] = withOrigin(ctx) {
visitPartitionSpec(ctx).map {
  case (key, None) => throw new ParseException(s"Found an empty 
partition key '$key'.", ctx)
  case (key, Some(value)) => key -> value
}
  }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-06-27 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r124448971
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
 ---
@@ -239,18 +239,20 @@ class SparkSqlParserSuite extends AnalysisTest {
   AnalyzeTableCommand(TableIdentifier("t"), noscan = false))
 assertEqual("analyze table t compute statistics noscan",
   AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
-assertEqual("analyze table t partition (a) compute statistics nOscAn",
+assertEqual("analyze table t compute statistics nOscAn",
   AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
 
-// Partitions specified - we currently parse them but don't do 
anything with it
+// Partitions specified
 assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE 
STATISTICS",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = false))
+  AnalyzeTableCommand(TableIdentifier("t"), noscan = false,
+partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11"
 assertEqual("ANALYZE TABLE t PARTITION(ds='2008-04-09', hr=11) COMPUTE 
STATISTICS noscan",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
-assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = false))
-assertEqual("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS 
noscan",
-  AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
+  AnalyzeTableCommand(TableIdentifier("t"), noscan = true,
+partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11"
+intercept("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS",
--- End diff --

I see. Thanks for pointing that out. Currently, this PR supports only exact 
partition spec. It doesn't support partial partition specs describing the 
document above. My preference would be to keep it simple for this PR and 
support only exact spec and add support for partial specs in a follow up PR. 
What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-06-26 Thread mbasmanova
GitHub user mbasmanova opened a pull request:

https://github.com/apache/spark/pull/18421

[SPARK-21213][SQL] Support collecting partition-level statistics: row…

…Count and sizeInBytes

## What changes were proposed in this pull request?

Added support for ANALYZE TABLE table PARTITION (key=value,..) COMPUTE 
STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in 
bytes for the specified partition. Calculated statistics are stores in Hive 
Metastore as user-defined properties attached to the partition object. Property 
names are the same as the ones used to store table-level row count and size in 
bytes. Specifically, spark.sql.statistics.totalSize and 
spark.sql.statistics.numRows.

## How was this patch tested?

Added tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mbasmanova/spark mbasmanova-analyze-partition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18421


commit e18f144a2c6b318a285fbb184bb5c78408f06039
Author: Masha Basmanova <mbasman...@fb.com>
Date:   2017-06-12T20:36:28Z

[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount 
and sizeInBytes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-26 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
Folks, thanks for merging this PR. 

@gatorsmile, re: a request to backport to 2.1. Is this something that 
happens automatically? If not, could you point me to some instructions on how 
to do that? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-21 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@gatorsmile, thank you for review. All the comments should be addressed 
now. Please, check it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123242117
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

Ok. I'll simplify the test and remove withTempPaths function. Since I need 
to make sure that partition locations are not under table location, I'll 
include ALTER TABLE $tableName SET LOCATION '$path' after creating table with 
partitions in the default location. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-20 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@shaneknapp, I can't think of why PySpark unit tests failed other than 
flakiness of the tests or infrastructure. Any chance you could trigger a re-run?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-20 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
Thank you, @shaneknapp .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-20 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@gatorsmile, thanks for clarifying. I added a test to analyze a table with 
only some of partitions being visible. Would you take another look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-19 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@gatorsmile, re: additional test, how can I create a table with a mix of 
visible and invisible partitions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-19 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122651779
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

This is how I wrote the test initially. In this case all partitions are 
located under the same top-level directory, but table-level location is 
somewhere else. I modified the test to use different paths for each partition 
as well as the table to address some of the earlier comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-19 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122649017
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Creates the requested number of temporary path (without creating the 
actual file/directory),
--- End diff --

I wanted to create a multi-path version of an existing withTempPath 
function. That function returns a *valid*, but non-existent path. The comments 
and path.delete() part of the implementation came from there.

/**
   * Generates a temporary path without creating the actual file/directory, 
then pass it to `f`. If
   * a file/directory is created there by `f`, it will be delete after `f` 
returns.
   *
   * @todo Probably this method should be moved to a more general place
   */
  protected def withTempPath(f: File => Unit): Unit = {
val path = Utils.createTempDir()
path.delete()
try f(path) finally Utils.deleteRecursively(path)
  }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-16 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@wzhfy, thank you careful review and helpful suggestions. I think I 
incorporated all of your comments. Would you take another look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122505126
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,48 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+withTempPaths(4) { (paths) => paths match {
--- End diff --

I need a path for the table and one more path for each of the three 
partitions. 1 + 3 = 4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-15 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@wzhfy , @tejasapatil , I updated description of this PR, enhanced the test 
and renamed a function I missed before in Tejas's comment. How does it look now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-15 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@rxin, could you help unblock Jenkins for this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread mbasmanova
Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r14357
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +128,40 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+sql(
+  """
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
--- End diff --

This is a fine suggestion, but I feel it falls outside of the scope of this 
change. This particular fix concerns only the case when table-level location is 
not a parent directory of a partition-level location.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18309: [SPARK-21079] [SQL] Calculate total size of a partition ...

2017-06-15 Thread mbasmanova
Github user mbasmanova commented on the issue:

https://github.com/apache/spark/pull/18309
  
@wzhfy, @tejasapatil, thanks for your feedback. I think I addressed all of 
your comments and added a test. Please, take another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-14 Thread mbasmanova
GitHub user mbasmanova opened a pull request:

https://github.com/apache/spark/pull/18309

[SPARK-21079] [SQL] Calculate total size of a partition table as a sum of 
individual partitions

## What changes were proposed in this pull request?

When calculating total size of a partitioned table, use storage URIs 
associated with individual partitions to identify the files which make up the 
table.

CC: @wzhfy

## How was this patch tested?

Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and 
verified that sizeInBytes is calculated correctly. Before this change, the size 
would be zero.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mbasmanova/spark mbasmanova-analyze-part-table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18309.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18309


commit a1dbdd6f56e500586b399565a7f837800039bfb3
Author: Masha Basmanova <mbasman...@fb.com>
Date:   2017-06-15T00:24:47Z

[SPARK-21079] [SQL] Calculate total size of a partition table as a sum of 
individual partitions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org