Why would changing the order of the join make such a big difference? I will try the repartition, although, it does not make sense to me why repartitioning should help, since the data itself is so small!
Regards, Raghu > On Jan 27, 2016, at 20:08, Cheng, Hao <hao.ch...@intel.com> wrote: > > Another possibility is about the parallelism? Probably be 1 or some other > small value, since the input data size is not that big. > > If in that case, probably you can try something like: > > Df1.repartition(10).registerTempTable(“hospitals”); > Df2.repartition(10).registerTempTable(“counties”); > … > And then doing the join. > > > From: Raghu Ganti [mailto:raghuki...@gmail.com] > Sent: Thursday, January 28, 2016 3:06 AM > To: Ted Yu; Дмитро Попович > Cc: user > Subject: Re: Spark SQL joins taking too long > > The problem is with the way Spark query plan is being created, IMO, what was > happening before is that the order of the tables mattered and when the larger > table is given first, it took a very long time (~53mins to complete). I > changed the order of the tables with the smaller one first (including > replacing the table with one element with that of the entire one) and > modified the query to look like this: > > SELECT c.NAME, h.name FROM counties c, hospitals h WHERE c.NAME = 'Dutchess' > AND ST_Intersects(c.shape, h.location) > > With the above query, things worked like a charm (<1min to finish the entire > execution and join on 3141 polygons with 6.5k points). > > Do let me know if you need more info in order to pin point the issue. > > Regards, > Raghu > > On Tue, Jan 26, 2016 at 5:13 PM, Ted Yu <yuzhih...@gmail.com> wrote: > What's the type of shape column ? > > Can you disclose what SomeUDF does (by showing the code) ? > > Cheers > > On Tue, Jan 26, 2016 at 12:41 PM, raghukiran <raghuki...@gmail.com> wrote: > Hi, > > I create two tables, one counties with just one row (it actually has 2k > rows, but I used only one) and another hospitals, which has 6k rows. The > join command I use is as follows, which takes way too long to run and has > never finished successfully (even after nearly 10mins). The following is > what I have: > > DataFrame df1 = ... > df1.registerTempTable("hospitals"); > DataFrame df2 = ... > df2.registerTempTable("counties"); //has only one row right now > DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN > counties c ON SomeUDF(c.shape, h.location)"); > long count = joinDf.count(); //this takes too long! > > //whereas the following which is the exact equivalent of the above gets done > very quickly! > DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE > SomeUDF('c.shape as string', h.location)"); > long count = joinDf.count(); //gives me the correct answer of 8 > > Any suggestions on what I can do to optimize and debug this piece of code? > > Regards, > Raghu > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >