Hi Vijay, If dimensional tables are reasonable size and frequently updated, then you can deploy *Spark SQL* to get data directly from your MySQL table through JDBC and do your join with your fact table stored in Hive. In general these days one can do better with Spark SQL. Your fact table still remains immutable in Hive. You can parallelize reads from RDBMS to speed up the query to see data as is from your dimensional tables. Some dimensional tables are updated frequently or recurring updates like in our case.
// //Get the FACT table from Hive // var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID, ..... FROM HiveDB.sales") // // Get the dimensional tables from RDBMS (in this case Oracle) through JDBC // val c = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT CHANNEL_ID, CHANNEL_DESC, ..... FROM OracleDB.channels)", "user" -> _username, "password" -> _password)) val t = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT TIME_ID , CALENDAR_MONTH_DESC, ...... FROM OracleDB.times)", "user" -> _username, "password" -> _password)) // // Registar three data frames as temporary tables using registerTempTable call // s.registerTempTable("t_s") c.registerTempTable("t_c") t.registerTempTable("t_t") Then you can perform SQL using tempTables much like Hive. HiveContext in Spark is mapping here to HiveQL var sqltext = "" sqltext = """ SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) As Sales, ........ FROM ( SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, SUM(t_s.AMOUNT_SOLD) AS TotalSales FROM t_s, t_t, t_c ...) rs """ spark.sql(sqltext).collect HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 23 September 2016 at 11:46, Vijay Ramachandran <vi...@linkedin.com> wrote: > > On Fri, Sep 23, 2016 at 3:47 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> What is the use case for UPSERT in Hive. The functionality does not exist >> but there are other solutions. >> >> Are we talking about a set of dimension tables with primary keys hat need >> to be updated (existing rows) or inserted (new rows)? >> > > > Hi Mich. > > Exactly, I'm looking at dimension tables. > > thanks, > >