Hi,

I create two tables, one counties with just one row (it actually has 2k
rows, but I used only one) and another hospitals, which has 6k rows. The
join command I use is as follows, which takes way too long to run and has
never finished successfully (even after nearly 10mins). The following is
what I have:

DataFrame df1 = ...
df1.registerTempTable("hospitals");
DataFrame df2 = ...
df2.registerTempTable("counties"); //has only one row right now
DataFrame joinDf = sqlCtx.sql("SELECT h.name, c.name FROM hospitals h JOIN
counties c ON SomeUDF(c.shape, h.location)");
long count = joinDf.count(); //this takes too long!

//whereas the following which is the exact equivalent of the above gets done
very quickly!
DataFrame joinDf = sqlCtx.sql("SELECT h.name FROM hospitals WHERE
SomeUDF('c.shape as string', h.location)");
long count = joinDf.count(); //gives me the correct answer of 8

Any suggestions on what I can do to optimize and debug this piece of code?

Regards,
Raghu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-joins-taking-too-long-tp26078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to