It might be faster if you add the column with the hash result before the join
to the dataframe and then do simply a normal join on that column
> On 22. Jul 2017, at 17:39, Stephen Fletcher
> wrote:
>
> Normally a family of joins (left, right outter, inner) are
>
> left.join(right, my_fuzzy_udf (left("cola"),right("cola")))
>
While this could work, the problem will be that we'll have to check every
possible combination of tuples from left and right using your UDF. It
would be best if you could somehow partition the problem so that we could
reduce the
The Dataset.join(right: Dataset[_], joinExprs: Column) API can use any
arbitrary expression so you can use UDF for join.
The problem with all non-equality joins is that they use
BroadcastNestedLoopJoin or equivalent, that is an (M X N) nested-loop
which will be unusable for medium/large
Normally a family of joins (left, right outter, inner) are performed on two
dataframes using columns for the comparison ie left("acol") ===
ight("acol") . the comparison operator of the "left" dataframe does
something internally and produces a column that i assume is used by the
join.
What I want