Re: Spark SQL joins taking too long

2016-01-27 Thread Raghu Ganti
Why would changing the order of the join make such a big difference?

I will try the repartition, although, it does not make sense to me why 
repartitioning should help, since the data itself is so small!

Regards,
Raghu

> On Jan 27, 2016, at 20:08, Cheng, Hao  wrote:
> 
> Another possibility is about the parallelism? Probably be 1 or some other 
> small value, since the input data size is not that big.
>  
> If in that case, probably you can try something like:
>  
> Df1.repartition(10).registerTempTable(“hospitals”);
> Df2.repartition(10).registerTempTable(“counties”);
> …
> And then doing the join.
>  
>  
> From: Raghu Ganti [mailto:raghuki...@gmail.com] 
> Sent: Thursday, January 28, 2016 3:06 AM
> To: Ted Yu; Дмитро Попович
> Cc: user
> Subject: Re: Spark SQL joins taking too long
>  
> The problem is with the way Spark query plan is being created, IMO, what was 
> happening before is that the order of the tables mattered and when the larger 
> table is given first, it took a very long time (~53mins to complete). I 
> changed the order of the tables with the smaller one first (including 
> replacing the table with one element with that of the entire one) and 
> modified the query to look like this:
> 
> SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' 
> AND ST_Intersects(c.shape, h.location)
> 
> With the above query, things worked like a charm (<1min to finish the entire 
> execution and join on 3141 polygons with 6.5k points).
> 
> Do let me know if you need more info in order to pin point the issue.
> 
> Regards,
> Raghu
>  
> On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu  wrote:
> What's the type of shape column ?
>  
> Can you disclose what SomeUDF does (by showing the code) ?
>  
> Cheers
>  
> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:
> Hi,
> 
> I create two tables, one counties with just one row (it actually has 2k
> rows, but I used only one) and another hospitals, which has 6k rows. The
> join command I use is as follows, which takes way too long to run and has
> never finished successfully (even after nearly 10mins). The following is
> what I have:
> 
> DataFrame df1 = ...
> df1.registerTempTable("hospitals");
> DataFrame df2 = ...
> df2.registerTempTable("counties"); //has only one row right now
> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
> counties c ON SomeUDF(c.shape, h.location)");
> long count = joinDf.count(); //this takes too long!
> 
> //whereas the following which is the exact equivalent of the above gets done
> very quickly!
> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
> SomeUDF('c.shape as string', h.location)");
> long count = joinDf.count(); //gives me the correct answer of 8
> 
> Any suggestions on what I can do to optimize and debug this piece of code?
> 
> Regards,
> Raghu
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>  
>  


RE: Spark SQL joins taking too long

2016-01-27 Thread Cheng, Hao
Another possibility is about the parallelism? Probably be 1 or some other small 
value, since the input data size is not that big.

If in that case, probably you can try something like:

Df1.repartition(10).registerTempTable(“hospitals”);
Df2.repartition(10).registerTempTable(“counties”);
…
And then doing the join.


From: Raghu Ganti [mailto:raghuki...@gmail.com]
Sent: Thursday, January 28, 2016 3:06 AM
To: Ted Yu; Дмитро Попович
Cc: user
Subject: Re: Spark SQL joins taking too long

The problem is with the way Spark query plan is being created, IMO, what was 
happening before is that the order of the tables mattered and when the larger 
table is given first, it took a very long time (~53mins to complete). I changed 
the order of the tables with the smaller one first (including replacing the 
table with one element with that of the entire one) and modified the query to 
look like this:

SELECT c.NAME, h.name<http://h.name> FROM counties c, hospitals h WHERE c.NAME 
= 'Dutchess' AND ST_Intersects(c.shape, h.location)
With the above query, things worked like a charm (<1min to finish the entire 
execution and join on 3141 polygons with 6.5k points).
Do let me know if you need more info in order to pin point the issue.
Regards,
Raghu

On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:
What's the type of shape column ?

Can you disclose what SomeUDF does (by showing the code) ?

Cheers

On Tue, Jan 26, 2016 at 12:41 PM, raghukiran 
mailto:raghuki...@gmail.com>> wrote:
Hi,

I create two tables, one counties with just one row (it actually has 2k
rows, but I used only one) and another hospitals, which has 6k rows. The
join command I use is as follows, which takes way too long to run and has
never finished successfully (even after nearly 10mins). The following is
what I have:

DataFrame df1 = ...
df1.registerTempTable("hospitals");
DataFrame df2 = ...
df2.registerTempTable("counties"); //has only one row right now
DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name>, 
c.name<http://c.name> FROM hospitals h JOIN
counties c ON SomeUDF(c.shape, h.location)");
long count = joinDf.count(); //this takes too long!

//whereas the following which is the exact equivalent of the above gets done
very quickly!
DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name> FROM hospitals WHERE
SomeUDF('c.shape as string', h.location)");
long count = joinDf.count(); //gives me the correct answer of 8

Any suggestions on what I can do to optimize and debug this piece of code?

Regards,
Raghu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Re: Spark SQL joins taking too long

2016-01-27 Thread Raghu Ganti
The problem is with the way Spark query plan is being created, IMO, what
was happening before is that the order of the tables mattered and when the
larger table is given first, it took a very long time (~53mins to
complete). I changed the order of the tables with the smaller one first
(including replacing the table with one element with that of the entire
one) and modified the query to look like this:

SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME =
'Dutchess' AND ST_Intersects(c.shape, h.location)

With the above query, things worked like a charm (<1min to finish the
entire execution and join on 3141 polygons with 6.5k points).

Do let me know if you need more info in order to pin point the issue.

Regards,
Raghu

On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu  wrote:

> What's the type of shape column ?
>
> Can you disclose what SomeUDF does (by showing the code) ?
>
> Cheers
>
> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:
>
>> Hi,
>>
>> I create two tables, one counties with just one row (it actually has 2k
>> rows, but I used only one) and another hospitals, which has 6k rows. The
>> join command I use is as follows, which takes way too long to run and has
>> never finished successfully (even after nearly 10mins). The following is
>> what I have:
>>
>> DataFrame df1 = ...
>> df1.registerTempTable("hospitals");
>> DataFrame df2 = ...
>> df2.registerTempTable("counties"); //has only one row right now
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h
>> JOIN
>> counties c ON SomeUDF(c.shape, h.location)");
>> long count = joinDf.count(); //this takes too long!
>>
>> //whereas the following which is the exact equivalent of the above gets
>> done
>> very quickly!
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
>> SomeUDF('c.shape as string', h.location)");
>> long count = joinDf.count(); //gives me the correct answer of 8
>>
>> Any suggestions on what I can do to optimize and debug this piece of code?
>>
>> Regards,
>> Raghu
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL joins taking too long

2016-01-26 Thread Raghu Ganti
Yes, the SomeUDF is Contains, shape is a UDT that maps a custom geometry type 
to sql binary type.

Custom geometry type is a Java class. Please let me know if you need further 
info.

Regards
Raghu

> On Jan 26, 2016, at 17:13, Ted Yu  wrote:
> 
> What's the type of shape column ?
> 
> Can you disclose what SomeUDF does (by showing the code) ?
> 
> Cheers
> 
>> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:
>> Hi,
>> 
>> I create two tables, one counties with just one row (it actually has 2k
>> rows, but I used only one) and another hospitals, which has 6k rows. The
>> join command I use is as follows, which takes way too long to run and has
>> never finished successfully (even after nearly 10mins). The following is
>> what I have:
>> 
>> DataFrame df1 = ...
>> df1.registerTempTable("hospitals");
>> DataFrame df2 = ...
>> df2.registerTempTable("counties"); //has only one row right now
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
>> counties c ON SomeUDF(c.shape, h.location)");
>> long count = joinDf.count(); //this takes too long!
>> 
>> //whereas the following which is the exact equivalent of the above gets done
>> very quickly!
>> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
>> SomeUDF('c.shape as string', h.location)");
>> long count = joinDf.count(); //gives me the correct answer of 8
>> 
>> Any suggestions on what I can do to optimize and debug this piece of code?
>> 
>> Regards,
>> Raghu
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: Spark SQL joins taking too long

2016-01-26 Thread Ted Yu
What's the type of shape column ?

Can you disclose what SomeUDF does (by showing the code) ?

Cheers

On Tue, Jan 26, 2016 at 12:41 PM, raghukiran  wrote:

> Hi,
>
> I create two tables, one counties with just one row (it actually has 2k
> rows, but I used only one) and another hospitals, which has 6k rows. The
> join command I use is as follows, which takes way too long to run and has
> never finished successfully (even after nearly 10mins). The following is
> what I have:
>
> DataFrame df1 = ...
> df1.registerTempTable("hospitals");
> DataFrame df2 = ...
> df2.registerTempTable("counties"); //has only one row right now
> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
> counties c ON SomeUDF(c.shape, h.location)");
> long count = joinDf.count(); //this takes too long!
>
> //whereas the following which is the exact equivalent of the above gets
> done
> very quickly!
> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
> SomeUDF('c.shape as string', h.location)");
> long count = joinDf.count(); //gives me the correct answer of 8
>
> Any suggestions on what I can do to optimize and debug this piece of code?
>
> Regards,
> Raghu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>