[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-03-11 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-78385690
  
@saucam if this is going to stay open, mind tagging it with [SQL] so it 
gets sorted properly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-13 Thread saucam
Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-69716627
  
Can we have some kind of hint mechanism in the query itself , if the user 
knows the subquery is small ? Then perhaps we can change the plan accordingly ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-07 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-69110615
  
Also if we use statistics we would still need to figure out how to make the 
calculation lazy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-07 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-69110525
  
The problem with a hybrid approach is: how do you choose between them?  If 
we had good statistics I would agree with you that we should use those to 
decide.  However, we don't at this point.  Given that, I think a safe option 
that may be slower but never OOMs the driver is better than aiming for optimal 
performance in the small data cases.

Also, were you testing with broadcast left semi join enabled and statistics 
for your tables?  Avoiding a shuffle there might speed things up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-06 Thread saucam
Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68853317
  
Hi Michael, 

Thanks for the feedback.

1. Yes it does not handle correlated queries. It definitely makes more 
sense to convert correlated queries to joins, 
but for uncorrelated queries, i think its too slow if table size is large 
and user is querying on smaller data:

eg: 2 tables with ~ 45 million rows each ; subquery returns only 90 rows:
  
select * from Y1 where Y1.id in (select Y2.id from Y2 where Y2.id < 90);

takes about 12 seconds to run by this approach on a single machine 
(--executor-memory 16G --driver-memory 8G)

by following the join approach, query is changed to : 

select * from Y1 left semi join (select Y2.id as sqc0 from Y2 where id < 
90) subquery on Y1.id = subquery.sqc0;

which takes 660 seconds to run on the same machine

2. This approach can handle arbitrary nesting of subqueries : 

select * from Y1 where Y1.id in (select Y2.id where Y2.timestamp in (select 
Y3.timestamp limit 20))

Can we take some hybrid approach from the two ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-05 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68833871
  
This is simpler, but it has several disadvantages to the other approach:
 - The InSet it collected to the driver and thus could cause OOMs when large
 - I don't think that it handles correlated subqueries
 - The `execute()` involves eager evaluation and breaks RDD lineage

For these reasons I think we should stick to extending the approach taken 
by the other PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68672348
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25046/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68672345
  
  [Test build #25046 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25046/consoleFull)
 for   PR 3888 at commit 
[`4019e0d`](https://github.com/apache/spark/commit/4019e0d6e0bf31a123f2817eb964562891211635).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class InSubquery(value: Expression) extends Predicate `
  * `case class DynamicFilter(condition: Expression, left: LogicalPlan, 
right: LogicalPlan)`
  * `case class DynamicFilter(condition: Expression, left: SparkPlan, 
right: SparkPlan)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68669185
  
  [Test build #25046 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25046/consoleFull)
 for   PR 3888 at commit 
[`4019e0d`](https://github.com/apache/spark/commit/4019e0d6e0bf31a123f2817eb964562891211635).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68669153
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68626506
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread saucam
Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/3888#issuecomment-68626500
  

Hi, @marmbrus can you please take a look and suggest changes ; Have tested 
for a few queries and this approach looks simpler than an already existing PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...

2015-01-04 Thread saucam
GitHub user saucam opened a pull request:

https://github.com/apache/spark/pull/3888

SPARK-4226: Add support for subqueries in where in clause

this PR adds support for subquery in where in clause by adding a dynamic 
filter class that will compute the values list from the subquery first and then 
create a hash-set , using it as input to inset class.
  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/saucam/spark subquery_where_clause

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3888


commit 4019e0d6e0bf31a123f2817eb964562891211635
Author: Yash Datta 
Date:   2015-01-04T09:06:55Z

SPARK-4226: Add support for subqueries in where in clause




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org