Re: Correlated subqueries in the DataFrame API

2018-04-27 Thread Nicholas Chammas
What about exposing transforms that make it easy to coerce data to what the method needs? Instead of passing a dataframe, you’d pass df.toSet to isin Assuming toSet returns a local list, wouldn’t that have the problem of not being able to handle extremely large lists? In contrast, I believe SQL’s

Re: Correlated subqueries in the DataFrame API

2018-04-19 Thread Reynold Xin
Perhaps we can just have a function that turns a DataFrame into a Column? That'd work for both correlated and uncorrelated case, although in the correlated case we'd need to turn off eager analysis (otherwise there is no way to construct a valid DataFrame). On Thu, Apr 19, 2018 at 4:08 PM, Ryan B

Re: Correlated subqueries in the DataFrame API

2018-04-19 Thread Ryan Blue
Nick, thanks for raising this. It looks useful to have something in the DF API that behaves like sub-queries, but I’m not sure that passing a DF works. Making every method accept a DF that may contain matching data seems like it puts a lot of work on the API — which now has to accept a DF all over

Correlated subqueries in the DataFrame API

2018-04-09 Thread Nicholas Chammas
I just submitted SPARK-23945 but wanted to double check here to make sure I didn't miss something fundamental. Correlated subqueries are tracked at a high level in SPARK-18455 , but it's not clea