[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235370#comment-15235370 ] Mick Davies commented on SPARK-6910: Hi, We are seeing something similar, but in our case subsequent queries are still expensive. Looking at HiveMetastoreCatalog.lookupRelation (we are using 1.5, but 1.6 looks the same) we seem to create a new MetastoreRelation for each query. Part of the analysis phase tries to convert this to a ParquetRelation using convertToParquetRelation which always calls metastoreRelation.getHiveQlPartitions() which gets all partition information. So every query incurs the cost of retrieving all partition info. We don't understand how the code can use the cachedDataSourceTables effectively in the circumstances just described. We changed HiveMetastoreCatalog.lookupRelation to use cache even if Hive table property "spark.sql.sources.provider" is unset which caused subsequent queries to use cached relation and therfore run more quickly. Eg, changed {code} if (table.properties.get("spark.sql.sources.provider").isDefined) {code} to {code} if (cachedDataSourceTables.getIfPresent(QualifiedTableName(databaseName, tblName).toLowerCase) != null || table.properties.get("spark.sql.sources.provider").isDefined) {code} Are we doing something wrong? > Support for pushing predicates down to metastore for partition pruning > -- > > Key: SPARK-6910 > URL: https://issues.apache.org/jira/browse/SPARK-6910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheolsoo Park >Priority: Critical > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957194#comment-14957194 ] Cheolsoo Park commented on SPARK-6910: -- You're right that 2nd query is faster because the table/partition metadata is cached. Particularly, if you set {{spark.sql.hive.metastorePartitionPruning}} false (by default, false), Spark will cache metadata for all the partitions and any query against the same table will run faster even with a different predicate. See relevant code [here|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L830-L839]. > Support for pushing predicates down to metastore for partition pruning > -- > > Key: SPARK-6910 > URL: https://issues.apache.org/jira/browse/SPARK-6910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheolsoo Park >Priority: Critical > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956508#comment-14956508 ] qian, chen commented on SPARK-6910: --- I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very interesting thing: in spark-sql shell: at first I ran this, it took about 3 minutes select * from table1 where date='20151010' and hour='12' and name='x' limit 5; Time taken: 164.502 seconds then I ran this, it only took 10s. date, hour and name are partition columns in this hive table. this table has >4000 partitions select * from table1 where date='20151010' and hour='13' limit 5; Time taken: 10.881 seconds is it because that the first time I need to download all partition information from hive metastore? the second query is faster because all partitions are cached in memory now? > Support for pushing predicates down to metastore for partition pruning > -- > > Key: SPARK-6910 > URL: https://issues.apache.org/jira/browse/SPARK-6910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheolsoo Park >Priority: Critical > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632373#comment-14632373 ] Apache Spark commented on SPARK-6910: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/7492 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628273#comment-14628273 ] Apache Spark commented on SPARK-6910: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/7421 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627409#comment-14627409 ] Apache Spark commented on SPARK-6910: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7409 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613468#comment-14613468 ] Apache Spark commented on SPARK-6910: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/7216 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603625#comment-14603625 ] Olivier Toupin commented on SPARK-6910: --- Anyone interested. I posted a workaround to this. It doesn't actually prune anything, but fix the latency issues. Tested in pre-production. Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603606#comment-14603606 ] Apache Spark commented on SPARK-6910: - User 'oliviertoupin' has created a pull request for this issue: https://github.com/apache/spark/pull/7049 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594706#comment-14594706 ] Apache Spark commented on SPARK-6910: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/6921 Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557628#comment-14557628 ] Cheolsoo Park commented on SPARK-6910: -- I don't know what's the best way to fix this, but here is how I work around it at work. *Fortunately*, I have an internal metadata service (similar to HCat) at work that takes a string representation of filter expression and returns a list of partitions. So I can simply push down predicates in Analyzer, convert them to a string, and call my metadata service. This approach only involves isolated changes to Analyzer and Catalog. *Unfortunately*, I couldn't make my solution fully work with Hive metastore. Even though Hive metastore client api has a similar method called {{getPartitionsByFilter()}}, it only works with string type partition keys, which is not so practical in real world. Although this issue is no longer blocking me, I am curious how it can be fixed with Hive metastore. The following are code snippets that are not specific to my internal metastore service. Feel free to use it if it helps anyhow- # I added predicate pushdown in Analyzer.ReloveRelations. {code} object ResolveRelations extends Rule[LogicalPlan] { def getTable(u: UnresolvedRelation, predicates: Seq[Expression] = Nil): LogicalPlan = { try { catalog.lookupRelation(u.tableIdentifier, u.alias, predicates) // push down predicates into catalog } catch { case _: NoSuchTableException = u.failAnalysis(sno such table ${u.tableName}) } } def apply(plan: LogicalPlan): LogicalPlan = plan transform { case i @ InsertIntoTable(u: UnresolvedRelation, _, _, _, _) = i.copy(table = EliminateSubQueries(getTable(u))) case f @ Filter(condition, u: UnresolvedRelation) = { // find any filter operator that has a unresolved relation child. f match { case FilteredOperation(predicates, _) = // extract conditions from filter operator f.copy(condition, getTable(u, predicates)) } } case u: UnresolvedRelation = getTable(u) } } {code} # In HiveMetastoreCatalog, I extract partition columns from predicates and build a string that represents the filter expression. {code} val partitionFilter: String = if (predicates.nonEmpty) { val partitionKeys: Set[String] = franklinTable.partition_keys().toSet val (pruningPredicates, _) = predicates.partition { predicate = val refs: Set[String] = predicate.references.map(_.name).toSet refs.nonEmpty refs.subsetOf(partitionKeys) predicate.isInstanceOf[BinaryComparison] } pruningPredicates.foldLeft() { (str, expr) = toFranklinExpression(str, expr, fieldType) // convert predicates to a string representation } } else { } {code} Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531325#comment-14531325 ] Ashwin Shankar commented on SPARK-6910: --- +1, this is causing pretty bad user experience for us too. Hope we can get this in soon, thanks ! Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529639#comment-14529639 ] Yana Kadiyska commented on SPARK-6910: -- Can you please provide an update on this -- status/maybe a target release? Spark-6904 was closed as a duplicate of this but this seems like a critical bug? We have a pretty large metastore (partitions are per month per customer with a few years of data) and Shark works OK but I cannot take advantage of the new cool versions of Spark until the Metastore interaction improves. Any advice on a workaround would also be great... I opened https://issues.apache.org/jira/browse/SPARK-6984 which is probably a dup of this. Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529666#comment-14529666 ] Michael Armbrust commented on SPARK-6910: - Target is Spark 1.5. In Spark 1.3 a workaround is to do something like: {{sqlContext.table(tableName).registerTempTable(...)}} which caches the list of partitions in memory on the driver. The initial pull is expensive but it is much faster after that. Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org