Re: custom joins on dataframe

Sumedh Wale Sat, 22 Jul 2017 10:41:36 -0700

The Dataset.join(right: Dataset[_], joinExprs: Column) API can use anyarbitrary expression so you can use UDF for join.

The problem with all non-equality joins is that they useBroadcastNestedLoopJoin or equivalent, that is an (M X N) nested-loopwhich will be unusable for medium/large tables. At least one of thetables should be small for this to work with an acceptable performance.For example if one table has 100M rows after filter, and other 1M rows,then NLJ will result in 100 trillion rows to be scanned that will takevery long under normal circumstances, but if one of the sides is muchsmaller after filter say few thousand rows then can be fine.

What you probably need for large tables is to implement own optimizedjoin operator and use some join structure that can do the joinefficiently without having to do nested loops (i.e. some fancy structurefor efficient fuzzy joins). Its possible to do that using internal SparkAPIs but its not easy and you have to implement an efficient joinstructure first. Or perhaps some existing libraries out there could workfor you (like https://github.com/soundcloud/cosine-lsh-join-spark?).


--
Sumedh Wale
SnappyData (http://www.snappydata.io)

On Saturday 22 July 2017 09:09 PM, Stephen Fletcher wrote:

Normally a family of joins (left, right outter, inner) are performedon two dataframes using columns for the comparison ie left("acol") ===ight("acol") . the comparison operator of the "left" dataframe doessomething internally and produces a column that i assume is used bythe join.
What I want is to create my own comparison operation (i have a casewhere i want to use some fuzzy matching between rows and if they fallwithin some threshold we allow the join to happen)
so it would look something like

left.join(right, my_fuzzy_udf (left("cola"),right("cola")))
Where my_fuzzy_udf is my defined UDF. My main concern is the columnthat would have to be output what would its value be ie what would thefunction need to return that the udf susbsystem would then turn to acolumn to be evaluated by the join.
Thanks in advance for any advice



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: custom joins on dataframe

Reply via email to