Why would changing the order of the join make such a big difference?

I will try the repartition, although, it does not make sense to me why 
repartitioning should help, since the data itself is so small!

Regards,
Raghu

> On Jan 27, 2016, at 20:08, Cheng, Hao <hao.ch...@intel.com> wrote:
> 
> Another possibility is about the parallelism? Probably be 1 or some other 
> small value, since the input data size is not that big.
>  
> If in that case, probably you can try something like:
>  
> Df1.repartition(10).registerTempTable(“hospitals”);
> Df2.repartition(10).registerTempTable(“counties”);
> …
> And then doing the join.
>  
>  
> From: Raghu Ganti [mailto:raghuki...@gmail.com] 
> Sent: Thursday, January 28, 2016 3:06 AM
> To: Ted Yu; Дмитро Попович
> Cc: user
> Subject: Re: Spark SQL joins taking too long
>  
> The problem is with the way Spark query plan is being created, IMO, what was 
> happening before is that the order of the tables mattered and when the larger 
> table is given first, it took a very long time (~53mins to complete). I 
> changed the order of the tables with the smaller one first (including 
> replacing the table with one element with that of the entire one) and 
> modified the query to look like this:
> 
> SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' 
> AND ST_Intersects(c.shape, h.location)
> 
> With the above query, things worked like a charm (<1min to finish the entire 
> execution and join on 3141 polygons with 6.5k points).
> 
> Do let me know if you need more info in order to pin point the issue.
> 
> Regards,
> Raghu
>  
> On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> What's the type of shape column ?
>  
> Can you disclose what SomeUDF does (by showing the code) ?
>  
> Cheers
>  
> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran <raghuki...@gmail.com> wrote:
> Hi,
> 
> I create two tables, one counties with just one row (it actually has 2k
> rows, but I used only one) and another hospitals, which has 6k rows. The
> join command I use is as follows, which takes way too long to run and has
> never finished successfully (even after nearly 10mins). The following is
> what I have:
> 
> DataFrame df1 = ...
> df1.registerTempTable("hospitals");
> DataFrame df2 = ...
> df2.registerTempTable("counties"); //has only one row right now
> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
> counties c ON SomeUDF(c.shape, h.location)");
> long count = joinDf.count(); //this takes too long!
> 
> //whereas the following which is the exact equivalent of the above gets done
> very quickly!
> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
> SomeUDF('c.shape as string', h.location)");
> long count = joinDf.count(); //gives me the correct answer of 8
> 
> Any suggestions on what I can do to optimize and debug this piece of code?
> 
> Regards,
> Raghu
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>  
>  

Reply via email to