Thanks Jeff,
I registered the three data frames as temporary tables and performed the SQL query directly on them. I had to convert the oracle NUMBER and NUMBER(n,m) columns to TO_CHAR() at the query level to avoid the overflows. I think the fact that we can read data from JDBC databases and use Spark in-memory capabilities to do heterogeneous queries involving different tables on different databases is potentially very useful val sqlContext = new org.apache.spark.sql.SQLContext(sc) // val s = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(SELECT to_char(AMOUNT_SOLD) AS AMOUNT_SOLD, to_char(TIME_ID) AS TIME_ID, to_char(CHANNEL_ID) AS CHANNEL_ID FROM sh.sales)", "user" -> "sh", "password" -> "xxx")) // val c = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM sh.channels)", "user" -> "sh", "password" -> "xxx")) val t = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC FROM sh.times)", "user" -> "sh", "password" -> "xxx")) // // Registar three data frames as temporary tables // s.registerTempTable("t_s") c.registerTempTable("t_c") t.registerTempTable("t_t") // var sqltext : String = "" sqltext = """ SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) FROM ( SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, SUM(t_s.AMOUNT_SOLD) AS TotalSales FROM t_s, t_t, t_c WHERE t_s.TIME_ID = t_t.TIME_ID AND t_s.CHANNEL_ID = t_c.CHANNEL_ID GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC ) rs LIMIT 10 """ sqlContext.sql(sqltext).collect.foreach(println) Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: 14 February 2016 01:39 To: Mich Talebzadeh <m...@peridale.co.uk> Cc: user@spark.apache.org Subject: Re: Joining three tables with data frames What do you mean "does not work" ? What's the error message ? BTW would it be simpler that register the 3 data frames as temporary table and then use the sql query you used before in hive and oracle ? On Sun, Feb 14, 2016 at 9:28 AM, Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: Hi, I have created DFs on three Oracle tables. The join in Hive and Oracle are pretty simple SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales FROM sales s, times t, channels c WHERE s.time_id = t.time_id AND s.channel_id = c.channel_id GROUP BY t.calendar_month_desc, c.channel_desc ; I try to do this using Data Framess import org.apache.spark.sql.functions._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) // val s = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(sh.sales)", "user" -> "sh", "password" -> "xxxxx")) // val c = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(sh.channels)", "user" -> "sh", "password" -> "xxxxx")) val t = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb", "dbtable" -> "(sh.times)", "user" -> "sh", "password" -> "xxxxx")) // val sc = s.join(c, s.col("CHANNEL_ID") === c.col("CHANNEL_ID")) val st = s.join(t, s.col("TIME_ID") === t.col("TIME_ID")) val rs = sc.join(st) rs.groupBy($"calendar_month_desc",$"channel_desc").agg(sum($"amount_sold")) The las result set (rs) does not work. Since data is imported then I assume that the columns for joins need to be defined in data frame for each table rather than importing the whole columns. Thanks, Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. -- Best Regards Jeff Zhang