It might be faster if you add the column with the hash result before the join to the dataframe and then do simply a normal join on that column
> On 22. Jul 2017, at 17:39, Stephen Fletcher <stephen.fletc...@gmail.com> > wrote: > > Normally a family of joins (left, right outter, inner) are performed on two > dataframes using columns for the comparison ie left("acol") === ight("acol") > . the comparison operator of the "left" dataframe does something internally > and produces a column that i assume is used by the join. > > What I want is to create my own comparison operation (i have a case where i > want to use some fuzzy matching between rows and if they fall within some > threshold we allow the join to happen) > > so it would look something like > > left.join(right, my_fuzzy_udf (left("cola"),right("cola"))) > > Where my_fuzzy_udf is my defined UDF. My main concern is the column that > would have to be output what would its value be ie what would the function > need to return that the udf susbsystem would then turn to a column to be > evaluated by the join. > > > Thanks in advance for any advice --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org