Re: custom joins on dataframe

2017-07-24 Thread Jörn Franke
It might be faster if you add the column with the hash result before the join to the dataframe and then do simply a normal join on that column > On 22. Jul 2017, at 17:39, Stephen Fletcher > wrote: > > Normally a family of joins (left, right outter, inner) are

Re: custom joins on dataframe

2017-07-23 Thread Michael Armbrust
> > left.join(right, my_fuzzy_udf (left("cola"),right("cola"))) > While this could work, the problem will be that we'll have to check every possible combination of tuples from left and right using your UDF. It would be best if you could somehow partition the problem so that we could reduce the

Re: custom joins on dataframe

2017-07-22 Thread Sumedh Wale
The Dataset.join(right: Dataset[_], joinExprs: Column) API can use any arbitrary expression so you can use UDF for join. The problem with all non-equality joins is that they use BroadcastNestedLoopJoin or equivalent, that is an (M X N) nested-loop which will be unusable for medium/large

custom joins on dataframe

2017-07-22 Thread Stephen Fletcher
Normally a family of joins (left, right outter, inner) are performed on two dataframes using columns for the comparison ie left("acol") === ight("acol") . the comparison operator of the "left" dataframe does something internally and produces a column that i assume is used by the join. What I want