RE: Joining three tables with data frames

Mich Talebzadeh Sun, 14 Feb 2016 06:25:41 -0800

Thanks Jeff,


I registered the three data frames as temporary tables and performed the SQL 
query directly on them. I had to convert the oracle NUMBER and NUMBER(n,m) 
columns to TO_CHAR() at the query level to avoid the overflows.

 

I think the fact that we can read data from JDBC databases and use Spark 
in-memory capabilities to do heterogeneous queries involving different tables 
on different databases is potentially very useful 

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 

//

val s = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(AMOUNT_SOLD) AS AMOUNT_SOLD, to_char(TIME_ID) AS 
TIME_ID, to_char(CHANNEL_ID) AS CHANNEL_ID FROM sh.sales)",

"user" -> "sh",

"password" -> "xxx"))

//

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM 
sh.channels)",

"user" -> "sh",

"password" -> "xxx"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(TIME_ID) AS TIME_ID, CALENDAR_MONTH_DESC FROM 
sh.times)",

"user" -> "sh",

"password" -> "xxx"))

//

// Registar three data frames as temporary tables

//

s.registerTempTable("t_s")

c.registerTempTable("t_c")

t.registerTempTable("t_t")

//

var sqltext : String = ""

sqltext = """

SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)

FROM

(

SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, 
SUM(t_s.AMOUNT_SOLD) AS TotalSales

FROM t_s, t_t, t_c

WHERE t_s.TIME_ID = t_t.TIME_ID

AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID

GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

) rs

LIMIT 10

"""

sqlContext.sql(sqltext).collect.foreach(println)

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Jeff Zhang [mailto:zjf...@gmail.com] 
Sent: 14 February 2016 01:39
To: Mich Talebzadeh <m...@peridale.co.uk>
Cc: user@spark.apache.org
Subject: Re: Joining three tables with data frames

 

What do you mean "does not work" ? What's the error message ? BTW would it be 
simpler that register the 3 data frames as temporary table and then use the sql 
query you used before in hive and oracle ?

 

On Sun, Feb 14, 2016 at 9:28 AM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

 

I have created DFs on three Oracle tables.

 

The join in Hive and Oracle are pretty simple

 

SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS TotalSales

FROM sales s, times t, channels c

WHERE s.time_id = t.time_id

AND   s.channel_id = c.channel_id

GROUP BY t.calendar_month_desc, c.channel_desc

;

 

I try to do this using Data Framess

 

 

import org.apache.spark.sql.functions._

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

//

val s = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(sh.sales)",

"user" -> "sh",

"password" -> "xxxxx"))

//

val c = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(sh.channels)",

"user" -> "sh",

"password" -> "xxxxx"))

 

val t = sqlContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(sh.times)",

"user" -> "sh",

"password" -> "xxxxx"))

//

val sc = s.join(c, s.col("CHANNEL_ID") === c.col("CHANNEL_ID"))

val st = s.join(t, s.col("TIME_ID") === t.col("TIME_ID"))

 

val rs = sc.join(st)

 

rs.groupBy($"calendar_month_desc",$"channel_desc").agg(sum($"amount_sold"))

 

The las result set (rs) does not work. 

 

Since data is imported then I assume that the columns for joins need to be 
defined in data frame for each table rather than importing the whole columns. 

 

Thanks,

 

 

Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 





 

-- 

Best Regards

Jeff Zhang

RE: Joining three tables with data frames

Reply via email to