Re: convert SQL multiple Join in Spark

2016-03-03 Thread Mich Talebzadeh
Absolutely best to use sql here even in spark shell

look at this example using SQL

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("\nStarted at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("use oraclehadoop")
println ("\ncreating data set at "); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
val rs = HiveContext.sql(
"""
SELECTt.calendar_month_desc
, c.channel_desc
, SUM(s.amount_sold) AS TotalSales
FROM sales s
INNER JOIN times t
ON s.time_id = t.time_id
INNER JOIN channels c
ON s.channel_id = c.channel_id
GROUP BY t.calendar_month_desc, c.channel_desc
""")
rs.registerTempTable("tmp")
println ("\nfirst query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("""
SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
from tmp
ORDER BY MONTH, CHANNEL LIMIT 5
""").collect.foreach(println)
println ("\nsecond query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("""
SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
FROM tmp
GROUP BY channel_desc
order by SALES DESC LIMIT 5
""").collect.foreach(println)
println ("\nFinished at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)

Now The same code using functional programming

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("\nStarted at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
HiveContext.sql("use oraclehadoop")
var s =
HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
println ("\ncreating data set at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
val rs =
s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
println ("\nfirst query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
val rs1 =
rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
println ("\nsecond query at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
val rs2
=rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
println ("\nFinished at"); HiveContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 March 2016 at 22:08, Mohammed Guller  wrote:

> Why not use Spark SQL?
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Vikash Kumar [mailto:vikashsp...@gmail.com]
> *Sent:* Wednesday, March 2, 2016 8:29 PM
> *To:* user@spark.apache.org
> *Subject:* convert SQL multiple Join in Spark
>
>
>
> I have to write or convert below SQL query into spark/scala. Anybody can 
> suggest how to implement this in Spark?
>
> SELECT a.PERSON_ID as RETAINED_PERSON_ID,
>
> a.PERSON_ID,
>
> a.PERSONTYPE,
>
> 'y' as HOLDOUT,
>
> d.LOCATION,
>
> b.HHID,
>
> a.AGE_OUTPUT as AGE,
>
> a.FIRST_NAME,
>
> d.STARTDATE,
>
> d.ENDDATE,
>
> 'Not In Campaign' as HH_TYPE
>
> FROM PERSON_MASTER_VIEW a
>
> INNER JOIN PERSON_ADDRESS_HH_KEYS b
>
> on a.PERSON_ID = b.PERSON_ID
>
> LEFT JOIN #Holdouts c
>
> on a.PERSON_ID = c.RETAINED_PERSON_ID
>
> INNER JOIN #Holdouts d
>
> on b.HHID = d.HHID
>
> WHERE c.RETAINED_PERSON_ID IS NULL and 
> a.PERSONTYPE IS NOT NULL
>
> GROUP BY a.PERSON_ID, a.PERSONTYPE, b.HHID, 
> a.AGE_OUTPUT, 

RE: convert SQL multiple Join in Spark

2016-03-03 Thread Mohammed Guller
Why not use Spark SQL?

Mohammed
Author: Big Data Analytics with 
Spark

From: Vikash Kumar [mailto:vikashsp...@gmail.com]
Sent: Wednesday, March 2, 2016 8:29 PM
To: user@spark.apache.org
Subject: convert SQL multiple Join in Spark


I have to write or convert below SQL query into spark/scala. Anybody can 
suggest how to implement this in Spark?

SELECT a.PERSON_ID as RETAINED_PERSON_ID,

a.PERSON_ID,

a.PERSONTYPE,

'y' as HOLDOUT,

d.LOCATION,

b.HHID,

a.AGE_OUTPUT as AGE,

a.FIRST_NAME,

d.STARTDATE,

d.ENDDATE,

'Not In Campaign' as HH_TYPE

FROM PERSON_MASTER_VIEW a

INNER JOIN PERSON_ADDRESS_HH_KEYS b

on a.PERSON_ID = b.PERSON_ID

LEFT JOIN #Holdouts c

on a.PERSON_ID = c.RETAINED_PERSON_ID

INNER JOIN #Holdouts d

on b.HHID = d.HHID

WHERE c.RETAINED_PERSON_ID IS NULL and 
a.PERSONTYPE IS NOT NULL

GROUP BY a.PERSON_ID, a.PERSONTYPE, b.HHID, 
a.AGE_OUTPUT, a.FIRST_NAME, d.LOCATION, d.STARTDATE, d.ENDDATE