[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-78385690 @saucam if this is going to stay open, mind tagging it with [SQL] so it gets sorted properly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-69716627 Can we have some kind of hint mechanism in the query itself , if the user knows the subquery is small ? Then perhaps we can change the plan accordingly ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-69110615 Also if we use statistics we would still need to figure out how to make the calculation lazy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-69110525 The problem with a hybrid approach is: how do you choose between them? If we had good statistics I would agree with you that we should use those to decide. However, we don't at this point. Given that, I think a safe option that may be slower but never OOMs the driver is better than aiming for optimal performance in the small data cases. Also, were you testing with broadcast left semi join enabled and statistics for your tables? Avoiding a shuffle there might speed things up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68853317 Hi Michael, Thanks for the feedback. 1. Yes it does not handle correlated queries. It definitely makes more sense to convert correlated queries to joins, but for uncorrelated queries, i think its too slow if table size is large and user is querying on smaller data: eg: 2 tables with ~ 45 million rows each ; subquery returns only 90 rows: select * from Y1 where Y1.id in (select Y2.id from Y2 where Y2.id < 90); takes about 12 seconds to run by this approach on a single machine (--executor-memory 16G --driver-memory 8G) by following the join approach, query is changed to : select * from Y1 left semi join (select Y2.id as sqc0 from Y2 where id < 90) subquery on Y1.id = subquery.sqc0; which takes 660 seconds to run on the same machine 2. This approach can handle arbitrary nesting of subqueries : select * from Y1 where Y1.id in (select Y2.id where Y2.timestamp in (select Y3.timestamp limit 20)) Can we take some hybrid approach from the two ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68833871 This is simpler, but it has several disadvantages to the other approach: - The InSet it collected to the driver and thus could cause OOMs when large - I don't think that it handles correlated subqueries - The `execute()` involves eager evaluation and breaks RDD lineage For these reasons I think we should stick to extending the approach taken by the other PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68672348 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25046/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68672345 [Test build #25046 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25046/consoleFull) for PR 3888 at commit [`4019e0d`](https://github.com/apache/spark/commit/4019e0d6e0bf31a123f2817eb964562891211635). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class InSubquery(value: Expression) extends Predicate ` * `case class DynamicFilter(condition: Expression, left: LogicalPlan, right: LogicalPlan)` * `case class DynamicFilter(condition: Expression, left: SparkPlan, right: SparkPlan)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68669185 [Test build #25046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25046/consoleFull) for PR 3888 at commit [`4019e0d`](https://github.com/apache/spark/commit/4019e0d6e0bf31a123f2817eb964562891211635). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68669153 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68626506 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68626500 Hi, @marmbrus can you please take a look and suggest changes ; Have tested for a few queries and this approach looks simpler than an already existing PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
GitHub user saucam opened a pull request: https://github.com/apache/spark/pull/3888 SPARK-4226: Add support for subqueries in where in clause this PR adds support for subquery in where in clause by adding a dynamic filter class that will compute the values list from the subquery first and then create a hash-set , using it as input to inset class. You can merge this pull request into a Git repository by running: $ git pull https://github.com/saucam/spark subquery_where_clause Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3888.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3888 commit 4019e0d6e0bf31a123f2817eb964562891211635 Author: Yash Datta Date: 2015-01-04T09:06:55Z SPARK-4226: Add support for subqueries in where in clause --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org