One thing to be aware is that you better convert Oracle NUMBER and NUMBER(m,n) columns to varchar (--> TO_CHAR()) at source as Spark will throw overflow errors.
It is better to user TO_CHAR() in Oracle rather than writing UDF in Spark. UDFs in any language are slower compared to the generic functions. HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd, its subsidiaries nor their employees accept any responsibility. From: Gaurav Agarwal [mailto:gaurav130...@gmail.com] Sent: 16 February 2016 15:52 To: Mich Talebzadeh <m...@peridale.co.uk> Cc: user <user@spark.apache.org> Subject: Re: Stored proc with spark Thanks I will try with the options On Feb 16, 2016 9:15 PM, "Mich Talebzadeh" <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: You can use JDBC to oracle to get that data from a given table. What Oracle stored procedure does anyway? How many tables are involved? JDBC is pretty neat. In example below I use JDBC to load two Dimension tables from Oracle in Spark shell and read the FACT table of 100 million rows from Hive val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) println ("\nStarted at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) // var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb" var _username : String = "sh" var _password : String = "xx" // /Get the FACT table from Hive // var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM oraclehadoop.sales") //Get Oracle tables via JDBC val c = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM sh.channels)", "user" -> _username, "password" -> _password)) val t = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT TIME_ID AS TIME_ID, CALENDAR_MONTH_DESC FROM sh.times)", "user" -> _username, "password" -> _password)) // Registar three data frames as temporary tables using registerTempTable() call s.registerTempTable("t_s") c.registerTempTable("t_c") t.registerTempTable("t_t") // var sqltext : String = "" sqltext = """ SELECT rs.Month, rs.SalesChannel, round(TotalSales,2) FROM ( SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, SUM(t_s.AMOUNT_SOLD) AS TotalSales FROM t_s, t_t, t_c WHERE t_s.TIME_ID = t_t.TIME_ID AND t_s.CHANNEL_ID = t_c.CHANNEL_ID GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC ) rs LIMIT 10 """ HiveContext.sql(sqltext).collect.foreach(println) println ("\nFinished at"); HiveContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println) sys.exit() HTH -- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility. On 16/02/2016 09:04, Gaurav Agarwal wrote: Hi Can I load the data into spark from oracle storedproc Thanks