Hi, I did some investigation on this and created a dataframe on top of the underlying view in Oracle database.
Let assume that our oracle view is just a normal view as opposed to materialized view, something like below where both sales and costs are FACT tables CREATE OR REPLACE FORCE EDITIONABLE VIEW "SH"."PROFITS" ("CHANNEL_ID", "CUST_ID", "PROD_ID", "PROMO_ID", "TIME_ID", "UNIT_COST", "UNIT_PRICE", "AMOUNT_SOLD", "QUANTITY_SOLD", "TOTAL_COST") AS SELECT s.channel_id, s.cust_id, s.prod_id, s.promo_id, s.time_id, c.unit_cost, c.unit_price, s.amount_sold, s.quantity_sold, c.unit_cost * s.quantity_sold TOTAL_COST FROM costs c, sales s WHERE c.prod_id = s.prod_id AND c.time_id = s.time_id AND c.channel_id = s.channel_id AND c.promo_id = s.promo_id; So it is pretty simple view with a join on sales and cost tables You typically access this view in Spark with scala> val df = spark.read.format("jdbc").options( | Map("url" -> _ORACLEserver, | "dbtable" -> "(SELECT * FROM sh.profits)", // dbtable could be on a view or any valid sql | "user" -> _username, | "password" -> _password)).load df: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 8 more fields] scala> scala> df.printSchema() root |-- CHANNEL_ID: decimal(38,10) (nullable = true) |-- CUST_ID: decimal(38,10) (nullable = true) |-- PROD_ID: decimal(38,10) (nullable = true) |-- PROMO_ID: decimal(38,10) (nullable = true) |-- TIME_ID: timestamp (nullable = true) |-- UNIT_COST: decimal(10,2) (nullable = true) |-- UNIT_PRICE: decimal(10,2) (nullable = true) |-- AMOUNT_SOLD: decimal(10,2) (nullable = true) |-- QUANTITY_SOLD: decimal(10,2) (nullable = true) |-- TOTAL_COST: decimal(38,10) (nullable = true) If you run this all spark is going to fetch the result set from Oracle itself and the optimisation is going to happen within Oracle itself and results will be returned to Oracle. However, if you use the underlying Oracle tables themselves (create DF on top of them, here costs and sales tables), and run the SQL code in Spark itself, then you will get a more performant result. scala> val sales = spark.read.format("jdbc").options( | Map("url" -> _ORACLEserver, | "dbtable" -> "(SELECT * FROM sh.sales)", | "user" -> _username, | "password" -> _password)).load sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 5 more fields] scala> sales.createOrReplaceTempView("sales") scala> val costs = spark.read.format("jdbc").options( | Map("url" -> _ORACLEserver, | "dbtable" -> "(SELECT * FROM sh.costs)", | "user" -> _username, | "password" -> _password)).load costs: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), TIME_ID: timestamp ... 4 more fields] scala> costs.createOrReplaceTempView("costs") scala> var sqltext = """ | SELECT | s.channel_id, | s.cust_id, | s.prod_id, | s.promo_id, | s.time_id, | c.unit_cost, | c.unit_price, | s.amount_sold, | s.quantity_sold, | c.unit_cost * s.quantity_sold TOTAL_COST | FROM costs c, sales s | WHERE c.prod_id = s.prod_id | AND c.time_id = s.time_id | AND c.channel_id = s.channel_id | AND c.promo_id = s.promo_id | """ sqltext: String = " SELECT s.channel_id, s.cust_id, s.prod_id, s.promo_id, s.time_id, c.unit_cost, c.unit_price, s.amount_sold, s.quantity_sold, c.unit_cost * s.quantity_sold TOTAL_COST FROM costs c, sales s WHERE c.prod_id = s.prod_id AND c.time_id = s.time_id AND c.channel_id = s.channel_id AND c.promo_id = s.promo_id " Then you can look at what spark optimiser is doing scala> *spark.sql(sqltext).explain()* == Physical Plan == *(5) Project [channel_id#27, cust_id#25, prod_id#24, promo_id#28, time_id#26, unit_cost#42, unit_price#43, amount_sold#30, quantity_sold#29, CheckOverflow((promote_precision(unit_cost#42) * promote_precision(quantity_sold#29)), DecimalType(21,4), true) AS TOTAL_COST#50] +- *(5) SortMergeJoin [prod_id#38, time_id#39, channel_id#41, promo_id#40], [prod_id#24, time_id#26, channel_id#27, promo_id#28], Inner :- *(2) Sort [prod_id#38 ASC NULLS FIRST, time_id#39 ASC NULLS FIRST, channel_id#41 ASC NULLS FIRST, promo_id#40 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(prod_id#38, time_id#39, channel_id#41, promo_id#40, 200), ENSURE_REQUIREMENTS, [id=#37] : +- *(1) Scan JDBCRelation((SELECT * FROM sh.costs)) [numPartitions=1] [PROD_ID#38,TIME_ID#39,PROMO_ID#40,CHANNEL_ID#41,UNIT_COST#42,UNIT_PRICE#43] PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID), *IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema: struct<PROD_ID:decimal(38,10),TIME_ID:timestamp,PROMO_ID:decimal(38,10),CHANNEL_ID:decimal(38,10)... +- *(4) Sort [prod_id#24 ASC NULLS FIRST, time_id#26 ASC NULLS FIRST, channel_id#27 ASC NULLS FIRST, promo_id#28 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(prod_id#24, time_id#26, channel_id#27, promo_id#28, 200), ENSURE_REQUIREMENTS, [id=#43] +- *(3) Scan JDBCRelation((SELECT * FROM sh.sales)) [numPartitions=1] [PROD_ID#24,CUST_ID#25,TIME_ID#26,CHANNEL_ID#27,PROMO_ID#28,QUANTITY_SOLD#29,AMOUNT_SOLD#30] PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID), *IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema: struct<PROD_ID:decimal(38,10),CUST_ID:decimal(38,10),TIME_ID:timestamp,CHANNEL_ID:decimal(38,10),... So in summary, leave the data where it is (Oracle) and do your SQL code within Spark. Your mileage varies as some Oracle code may not be possible in Spark SQL etc. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 22 Mar 2021 at 05:38, Gaurav Singh <gaurav61...@gmail.com> wrote: > Hi Team, > > We have lots of complex oracle views ( containing multiple tables, joins, > analytical and aggregate functions, sub queries etc) and we are wondering > if Spark can help us execute those views faster. > > Also we want to know if those complex views can be implemented using Spark > SQL? > > Thanks and regards, > Gaurav Singh > +91 8600852256 > >