Re: Spark SQL joins taking too long
Why would changing the order of the join make such a big difference? I will try the repartition, although, it does not make sense to me why repartitioning should help, since the data itself is so small! Regards, Raghu > On Jan 27, 2016, at 20:08, Cheng, Hao wrote: > > Another possibility is about the parallelism? Probably be 1 or some other > small value, since the input data size is not that big. > > If in that case, probably you can try something like: > > Df1.repartition(10).registerTempTable(“hospitals”); > Df2.repartition(10).registerTempTable(“counties”); > … > And then doing the join. > > > From: Raghu Ganti [mailto:raghuki...@gmail.com] > Sent: Thursday, January 28, 2016 3:06 AM > To: Ted Yu; Дмитро Попович > Cc: user > Subject: Re: Spark SQL joins taking too long > > The problem is with the way Spark query plan is being created, IMO, what was > happening before is that the order of the tables mattered and when the larger > table is given first, it took a very long time (~53mins to complete). I > changed the order of the tables with the smaller one first (including > replacing the table with one element with that of the entire one) and > modified the query to look like this: > > SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' > AND ST_Intersects(c.shape, h.location) > > With the above query, things worked like a charm (<1min to finish the entire > execution and join on 3141 polygons with 6.5k points). > > Do let me know if you need more info in order to pin point the issue. > > Regards, > Raghu > > On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu wrote: > What's the type of shape column ? > > Can you disclose what SomeUDF does (by showing the code) ? > > Cheers > > On Tue, Jan 26, 2016 at 12:41 PM, raghukiran wrote: > Hi, > > I create two tables, one counties with just one row (it actually has 2k > rows, but I used only one) and another hospitals, which has 6k rows. The > join command I use is as follows, which takes way too long to run and has > never finished successfully (even after nearly 10mins). The following is > what I have: > > DataFrame df1 = ... > df1.registerTempTable("hospitals"); > DataFrame df2 = ... > df2.registerTempTable("counties"); //has only one row right now > DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN > counties c ON SomeUDF(c.shape, h.location)"); > long count = joinDf.count(); //this takes too long! > > //whereas the following which is the exact equivalent of the above gets done > very quickly! > DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE > SomeUDF('c.shape as string', h.location)"); > long count = joinDf.count(); //gives me the correct answer of 8 > > Any suggestions on what I can do to optimize and debug this piece of code? > > Regards, > Raghu > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >
RE: Spark SQL joins taking too long
Another possibility is about the parallelism? Probably be 1 or some other small value, since the input data size is not that big. If in that case, probably you can try something like: Df1.repartition(10).registerTempTable(“hospitals”); Df2.repartition(10).registerTempTable(“counties”); … And then doing the join. From: Raghu Ganti [mailto:raghuki...@gmail.com] Sent: Thursday, January 28, 2016 3:06 AM To: Ted Yu; Дмитро Попович Cc: user Subject: Re: Spark SQL joins taking too long The problem is with the way Spark query plan is being created, IMO, what was happening before is that the order of the tables mattered and when the larger table is given first, it took a very long time (~53mins to complete). I changed the order of the tables with the smaller one first (including replacing the table with one element with that of the entire one) and modified the query to look like this: SELECT c.NAME, h.name<http://h.name> FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' AND ST_Intersects(c.shape, h.location) With the above query, things worked like a charm (<1min to finish the entire execution and join on 3141 polygons with 6.5k points). Do let me know if you need more info in order to pin point the issue. Regards, Raghu On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: What's the type of shape column ? Can you disclose what SomeUDF does (by showing the code) ? Cheers On Tue, Jan 26, 2016 at 12:41 PM, raghukiran mailto:raghuki...@gmail.com>> wrote: Hi, I create two tables, one counties with just one row (it actually has 2k rows, but I used only one) and another hospitals, which has 6k rows. The join command I use is as follows, which takes way too long to run and has never finished successfully (even after nearly 10mins). The following is what I have: DataFrame df1 = ... df1.registerTempTable("hospitals"); DataFrame df2 = ... df2.registerTempTable("counties"); //has only one row right now DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name>, c.name<http://c.name> FROM hospitals h JOIN counties c ON SomeUDF(c.shape, h.location)"); long count = joinDf.count(); //this takes too long! //whereas the following which is the exact equivalent of the above gets done very quickly! DataFrame joinDf = sqlCtx.sql("SELECT h.name<http://h.name> FROM hospitals WHERE SomeUDF('c.shape as string', h.location)"); long count = joinDf.count(); //gives me the correct answer of 8 Any suggestions on what I can do to optimize and debug this piece of code? Regards, Raghu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
Re: Spark SQL joins taking too long
The problem is with the way Spark query plan is being created, IMO, what was happening before is that the order of the tables mattered and when the larger table is given first, it took a very long time (~53mins to complete). I changed the order of the tables with the smaller one first (including replacing the table with one element with that of the entire one) and modified the query to look like this: SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' AND ST_Intersects(c.shape, h.location) With the above query, things worked like a charm (<1min to finish the entire execution and join on 3141 polygons with 6.5k points). Do let me know if you need more info in order to pin point the issue. Regards, Raghu On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu wrote: > What's the type of shape column ? > > Can you disclose what SomeUDF does (by showing the code) ? > > Cheers > > On Tue, Jan 26, 2016 at 12:41 PM, raghukiran wrote: > >> Hi, >> >> I create two tables, one counties with just one row (it actually has 2k >> rows, but I used only one) and another hospitals, which has 6k rows. The >> join command I use is as follows, which takes way too long to run and has >> never finished successfully (even after nearly 10mins). The following is >> what I have: >> >> DataFrame df1 = ... >> df1.registerTempTable("hospitals"); >> DataFrame df2 = ... >> df2.registerTempTable("counties"); //has only one row right now >> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h >> JOIN >> counties c ON SomeUDF(c.shape, h.location)"); >> long count = joinDf.count(); //this takes too long! >> >> //whereas the following which is the exact equivalent of the above gets >> done >> very quickly! >> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE >> SomeUDF('c.shape as string', h.location)"); >> long count = joinDf.count(); //gives me the correct answer of 8 >> >> Any suggestions on what I can do to optimize and debug this piece of code? >> >> Regards, >> Raghu >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Spark SQL joins taking too long
Yes, the SomeUDF is Contains, shape is a UDT that maps a custom geometry type to sql binary type. Custom geometry type is a Java class. Please let me know if you need further info. Regards Raghu > On Jan 26, 2016, at 17:13, Ted Yu wrote: > > What's the type of shape column ? > > Can you disclose what SomeUDF does (by showing the code) ? > > Cheers > >> On Tue, Jan 26, 2016 at 12:41 PM, raghukiran wrote: >> Hi, >> >> I create two tables, one counties with just one row (it actually has 2k >> rows, but I used only one) and another hospitals, which has 6k rows. The >> join command I use is as follows, which takes way too long to run and has >> never finished successfully (even after nearly 10mins). The following is >> what I have: >> >> DataFrame df1 = ... >> df1.registerTempTable("hospitals"); >> DataFrame df2 = ... >> df2.registerTempTable("counties"); //has only one row right now >> DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN >> counties c ON SomeUDF(c.shape, h.location)"); >> long count = joinDf.count(); //this takes too long! >> >> //whereas the following which is the exact equivalent of the above gets done >> very quickly! >> DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE >> SomeUDF('c.shape as string', h.location)"); >> long count = joinDf.count(); //gives me the correct answer of 8 >> >> Any suggestions on what I can do to optimize and debug this piece of code? >> >> Regards, >> Raghu >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >
Re: Spark SQL joins taking too long
What's the type of shape column ? Can you disclose what SomeUDF does (by showing the code) ? Cheers On Tue, Jan 26, 2016 at 12:41 PM, raghukiran wrote: > Hi, > > I create two tables, one counties with just one row (it actually has 2k > rows, but I used only one) and another hospitals, which has 6k rows. The > join command I use is as follows, which takes way too long to run and has > never finished successfully (even after nearly 10mins). The following is > what I have: > > DataFrame df1 = ... > df1.registerTempTable("hospitals"); > DataFrame df2 = ... > df2.registerTempTable("counties"); //has only one row right now > DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN > counties c ON SomeUDF(c.shape, h.location)"); > long count = joinDf.count(); //this takes too long! > > //whereas the following which is the exact equivalent of the above gets > done > very quickly! > DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE > SomeUDF('c.shape as string', h.location)"); > long count = joinDf.count(); //gives me the correct answer of 8 > > Any suggestions on what I can do to optimize and debug this piece of code? > > Regards, > Raghu > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >