I agree that there won't be a generic solution for these kind of cases.
Without the CBO from Spark or Hadoop ecosystem in short future, maybe Spark 
DataFrame/SQL should support more hints from the end user, as in these cases, 
end users will be smart enough to tell the engine what is the correct way to do.
Weren't the relational DBs doing exactly same path? RBO -> RBO + Hints -> CBO?
Yong

Date: Thu, 31 Mar 2016 16:07:14 +0530
Subject: Re: SPARK-13900 - Join with simple OR conditions take too long
From: hemant9...@gmail.com
To: ashokkumar.rajend...@gmail.com
CC: user@spark.apache.org

Hi Ashok,

That's interesting. 

As I understand, on table A and B, a nested loop join (that will produce m X n 
rows) is performed and than each row is evaluated to see if any of the 
condition is met. You are asking that Spark should instead do a 
BroadcastHashJoin on the equality conditions in parallel and then union the 
results like you are doing in a different query. 

If we leave aside parallelism for a moment, theoretically, time taken for 
nested loop join would vary little when the number of conditions are increased 
while the time taken for the solution that you are suggesting would increase 
linearly with number of conditions. So, when number of conditions are too many, 
nested loop join would be faster than the solution that you suggest. Now the 
question is, how should Spark decide when to do what? 

Hemant Bhanawat
www.snappydata.io 


On Thu, Mar 31, 2016 at 2:28 PM, ashokkumar rajendran 
<ashokkumar.rajend...@gmail.com> wrote:
Hi,

I have filed ticket SPARK-13900. There was an initial reply from a developer 
but did not get any reply on this. How can we do multiple hash joins together 
for OR conditions based joins? Could someone please guide on how can we fix 
this? 
Regards
Ashok


                                          

Reply via email to