It might be faster if you add the column with the hash result before the join 
to the dataframe and then do simply a normal join on that column

> On 22. Jul 2017, at 17:39, Stephen Fletcher <stephen.fletc...@gmail.com> 
> wrote:
> 
> Normally a family of joins (left, right outter, inner) are performed on two 
> dataframes using columns for the comparison ie left("acol") === ight("acol") 
> . the comparison operator of the "left" dataframe does something internally 
> and produces a column that i assume is used by the join.
> 
> What I want is to create my own comparison operation (i have a case where i 
> want to use some fuzzy matching between rows and if they fall within some 
> threshold we allow the join to happen)
> 
> so it would look something like
> 
> left.join(right, my_fuzzy_udf (left("cola"),right("cola")))
> 
> Where my_fuzzy_udf  is my defined UDF. My main concern is the column that 
> would have to be output what would its value be ie what would the function 
> need to return that the udf susbsystem would then turn to a column to be 
> evaluated by the join.
> 
> 
> Thanks in advance for any advice

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to