Re: on duplicate update equivalent?

Mich Talebzadeh Fri, 23 Sep 2016 05:17:31 -0700

Hi Vijay,

If dimensional tables are reasonable size and frequently updated, then you
can deploy *Spark SQL* to get data directly from your MySQL table through
JDBC and do your join with your fact table stored in Hive.
In general these days one can do better with Spark SQL. Your fact table
still remains immutable in Hive. You can parallelize reads from RDBMS to
speed up the query to see data as is from your dimensional tables. Some
dimensional tables are updated frequently or recurring updates like in our
case.

//
//Get the FACT table from Hive
//
var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID,
..... FROM HiveDB.sales")
//
// Get the dimensional tables from RDBMS (in this case Oracle) through JDBC
//
val c = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT CHANNEL_ID, CHANNEL_DESC, ..... FROM
OracleDB.channels)",
"user" -> _username,
"password" -> _password))
val t = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT TIME_ID , CALENDAR_MONTH_DESC, ...... FROM
OracleDB.times)",
"user" -> _username,
"password" -> _password))
//
// Registar three data frames as temporary tables using registerTempTable
call
//
s.registerTempTable("t_s")
c.registerTempTable("t_c")
t.registerTempTable("t_t")

Then you can perform SQL using tempTables much like Hive. HiveContext in
Spark is mapping here to HiveQL

var sqltext = ""
sqltext = """
SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) As Sales, ........
FROM
(
SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
SUM(t_s.AMOUNT_SOLD) AS TotalSales
FROM t_s, t_t, t_c
...) rs
"""
spark.sql(sqltext).collect

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 23 September 2016 at 11:46, Vijay Ramachandran <vi...@linkedin.com>
wrote:

>
> On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> What is the use case for UPSERT in Hive. The functionality does not exist
>> but there are other solutions.
>>
>> Are we talking about a set of dimension tables with primary keys hat need
>> to be updated (existing rows) or inserted (new rows)?
>>
>
>
> Hi Mich.
>
> Exactly, I'm looking at dimension tables.
>
> thanks,
>
>

Re: on duplicate update equivalent?

Reply via email to