The Dataset.join(right: Dataset[_], joinExprs: Column) API can use any
arbitrary expression so you can use UDF for join.
The problem with all non-equality joins is that they use
BroadcastNestedLoopJoin or equivalent, that is an (M X N) nested-loop
which will be unusable for medium/large tables. At least one of the
tables should be small for this to work with an acceptable performance.
For example if one table has 100M rows after filter, and other 1M rows,
then NLJ will result in 100 trillion rows to be scanned that will take
very long under normal circumstances, but if one of the sides is much
smaller after filter say few thousand rows then can be fine.
What you probably need for large tables is to implement own optimized
join operator and use some join structure that can do the join
efficiently without having to do nested loops (i.e. some fancy structure
for efficient fuzzy joins). Its possible to do that using internal Spark
APIs but its not easy and you have to implement an efficient join
structure first. Or perhaps some existing libraries out there could work
for you (like https://github.com/soundcloud/cosine-lsh-join-spark?).
--
Sumedh Wale
SnappyData (http://www.snappydata.io)
On Saturday 22 July 2017 09:09 PM, Stephen Fletcher wrote:
Normally a family of joins (left, right outter, inner) are performed
on two dataframes using columns for the comparison ie left("acol") ===
ight("acol") . the comparison operator of the "left" dataframe does
something internally and produces a column that i assume is used by
the join.
What I want is to create my own comparison operation (i have a case
where i want to use some fuzzy matching between rows and if they fall
within some threshold we allow the join to happen)
so it would look something like
left.join(right, my_fuzzy_udf (left("cola"),right("cola")))
Where my_fuzzy_udf is my defined UDF. My main concern is the column
that would have to be output what would its value be ie what would the
function need to return that the udf susbsystem would then turn to a
column to be evaluated by the join.
Thanks in advance for any advice
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org