Hi,

I did some investigation on this and created a dataframe on top of the
underlying view in Oracle database.

Let assume that our oracle view is just a normal view as opposed to
materialized view, something like below where both sales and costs are FACT
tables

CREATE OR REPLACE FORCE EDITIONABLE VIEW "SH"."PROFITS" ("CHANNEL_ID",
"CUST_ID", "PROD_ID", "PROMO_ID", "TIME_ID", "UNIT_COST", "UNIT_PRICE",
"AMOUNT_SOLD", "QUANTITY_SOLD", "TOTAL_COST") AS
  SELECT
        s.channel_id,
        s.cust_id,
        s.prod_id,
        s.promo_id,
        s.time_id,
        c.unit_cost,
        c.unit_price,
        s.amount_sold,
        s.quantity_sold,
        c.unit_cost * s.quantity_sold TOTAL_COST
 FROM   costs c, sales s
 WHERE c.prod_id = s.prod_id
   AND c.time_id = s.time_id
   AND c.channel_id = s.channel_id
   AND c.promo_id = s.promo_id;

So it is pretty simple view with a join on sales and cost tables

You typically access this view in Spark with

scala> val df = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.profits)",  // dbtable could
be on a view or any valid sql
     |        "user" -> _username,
     |        "password" -> _password)).load
df: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(38,10), CUST_ID:
decimal(38,10) ... 8 more fields]

scala>

scala> df.printSchema()
root
 |-- CHANNEL_ID: decimal(38,10) (nullable = true)
 |-- CUST_ID: decimal(38,10) (nullable = true)
 |-- PROD_ID: decimal(38,10) (nullable = true)
 |-- PROMO_ID: decimal(38,10) (nullable = true)
 |-- TIME_ID: timestamp (nullable = true)
 |-- UNIT_COST: decimal(10,2) (nullable = true)
 |-- UNIT_PRICE: decimal(10,2) (nullable = true)
 |-- AMOUNT_SOLD: decimal(10,2) (nullable = true)
 |-- QUANTITY_SOLD: decimal(10,2) (nullable = true)
 |-- TOTAL_COST: decimal(38,10) (nullable = true)

If you run this all spark is going to fetch the result set from Oracle
itself and the optimisation is going to happen within Oracle itself and
results will be returned to Oracle.

However, if you use the underlying Oracle tables themselves (create DF on
top of them, here costs and sales tables), and run the  SQL code in Spark
itself, then you will get a more performant result.

scala> val sales = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.sales)",
     |        "user" -> _username,
     |        "password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID:
decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")

scala> val costs = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.costs)",
     |        "user" -> _username,
     |        "password" -> _password)).load
costs: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), TIME_ID:
timestamp ... 4 more fields]

scala> costs.createOrReplaceTempView("costs")

scala> var sqltext = """
     | SELECT
     |         s.channel_id,
     |         s.cust_id,
     |         s.prod_id,
     |         s.promo_id,
     |         s.time_id,
     |         c.unit_cost,
     |         c.unit_price,
     |         s.amount_sold,
     |         s.quantity_sold,
     |         c.unit_cost * s.quantity_sold TOTAL_COST
     |  FROM   costs c, sales s
     |  WHERE c.prod_id = s.prod_id
     |    AND c.time_id = s.time_id
     |    AND c.channel_id = s.channel_id
     |    AND c.promo_id = s.promo_id
     | """
sqltext: String =
"
SELECT
        s.channel_id,
        s.cust_id,
        s.prod_id,
        s.promo_id,
        s.time_id,
        c.unit_cost,
        c.unit_price,
        s.amount_sold,
        s.quantity_sold,
        c.unit_cost * s.quantity_sold TOTAL_COST
 FROM   costs c, sales s
 WHERE c.prod_id = s.prod_id
   AND c.time_id = s.time_id
   AND c.channel_id = s.channel_id
   AND c.promo_id = s.promo_id
"

Then you can look at what spark optimiser is doing

scala> *spark.sql(sqltext).explain()*
== Physical Plan ==
*(5) Project [channel_id#27, cust_id#25, prod_id#24, promo_id#28,
time_id#26, unit_cost#42, unit_price#43, amount_sold#30, quantity_sold#29,
CheckOverflow((promote_precision(unit_cost#42) *
promote_precision(quantity_sold#29)), DecimalType(21,4), true) AS
TOTAL_COST#50]
+- *(5) SortMergeJoin [prod_id#38, time_id#39, channel_id#41, promo_id#40],
[prod_id#24, time_id#26, channel_id#27, promo_id#28], Inner
   :- *(2) Sort [prod_id#38 ASC NULLS FIRST, time_id#39 ASC NULLS FIRST,
channel_id#41 ASC NULLS FIRST, promo_id#40 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(prod_id#38, time_id#39, channel_id#41,
promo_id#40, 200), ENSURE_REQUIREMENTS, [id=#37]
   :     +- *(1) Scan JDBCRelation((SELECT * FROM sh.costs))
[numPartitions=1]
[PROD_ID#38,TIME_ID#39,PROMO_ID#40,CHANNEL_ID#41,UNIT_COST#42,UNIT_PRICE#43]
PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID),
*IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema:
struct<PROD_ID:decimal(38,10),TIME_ID:timestamp,PROMO_ID:decimal(38,10),CHANNEL_ID:decimal(38,10)...
   +- *(4) Sort [prod_id#24 ASC NULLS FIRST, time_id#26 ASC NULLS FIRST,
channel_id#27 ASC NULLS FIRST, promo_id#28 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(prod_id#24, time_id#26, channel_id#27,
promo_id#28, 200), ENSURE_REQUIREMENTS, [id=#43]
         +- *(3) Scan JDBCRelation((SELECT * FROM sh.sales))
[numPartitions=1]
[PROD_ID#24,CUST_ID#25,TIME_ID#26,CHANNEL_ID#27,PROMO_ID#28,QUANTITY_SOLD#29,AMOUNT_SOLD#30]
PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID),
*IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema:
struct<PROD_ID:decimal(38,10),CUST_ID:decimal(38,10),TIME_ID:timestamp,CHANNEL_ID:decimal(38,10),...


So in summary, leave the data where it is (Oracle) and do your SQL code
within Spark.  Your mileage varies as some Oracle code may not be possible
in Spark SQL etc.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Mar 2021 at 05:38, Gaurav Singh <gaurav61...@gmail.com> wrote:

> Hi Team,
>
> We have lots of complex oracle views ( containing multiple tables, joins,
> analytical and  aggregate functions, sub queries etc) and we are wondering
> if Spark can help us execute those views faster.
>
> Also we want to know if those complex views can be implemented using Spark
> SQL?
>
> Thanks and regards,
> Gaurav Singh
> +91 8600852256
>
>

Reply via email to