Have you considered Presto with an Oracle connector?

From: Teemu Heikkilä <te...@emblica.fi.INVALID>
Date: Thursday, April 4, 2019 at 12:28 PM
To: Prasad Bhalerao <prasadbhalerao1...@gmail.com>
Cc: Jason Nerothin <jasonnerot...@gmail.com>, user <user@spark.apache.org>
Subject: <External>Re: reporting use case

Based on your answers, I would consider using the update stream to update 
actual snapshots ie. by joining the data

Ofcourse now it depends on how the update stream has been implemented how to 
get the data in spark.

Could you tell little bit more about that?
- Teemu


On 4 Apr 2019, at 22.23, Prasad Bhalerao 
<prasadbhalerao1...@gmail.com<mailto:prasadbhalerao1...@gmail.com>> wrote:

Hi ,

I can create a view on these tables but the thing is I am going to need almost 
every column from these tables and I have faced issues with oracle views on 
such a large tables which involves joins. Some how oracle used to choose not so 
correct execution plan.

Can you please tell me how creating a views will help in this scenario?

Can you please tell if I am thinking in right direction?

I have two challenges
1) First to load 2-4 TB of data in spark very quickly.
2) And then keep this data updated in spark whenever DB updates are done.

Thanks,
Prasad

On Fri, Apr 5, 2019 at 12:35 AM Jason Nerothin 
<jasonnerot...@gmail.com<mailto:jasonnerot...@gmail.com>> wrote:
Hi Prasad,

Could you create an Oracle-side view that captures only the relevant records 
and the use Spark JDBC connector to load the view into Spark?

On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao 
<prasadbhalerao1...@gmail.com<mailto:prasadbhalerao1...@gmail.com>> wrote:
Hi,

I am exploring spark for my Reporting application.
My use case is as follows...
I have 4-5 oracle tables which contains more than 1.5 billion rows. These 
tables are updated very frequently every day. I don't have choice to change 
database technology. So this data is going to remain in Oracle only.
To generate 1 report, on an average 15 - 50 million rows has to be fetched from 
oracle tables. These rows contains some blob columns. Most of the time is spent 
in fetching these many rows from db over the network. Data processing is not 
that complex. Currently these report takes around 3-8 hours to complete. I 
trying to speed up this report generation process.

Can use spark as a caching layer in this case to avoid fetching data from 
oracle over the network every time? I am thinking to submit a spark job for 
each report request and use spark SQL to fetch the data and then process it and 
write to a file? I trying to use kind of data locality in this case.

Whenever a data is updated in oracle tables can I refresh the data in spark 
storage? I can get the update feed using messaging technology.

Can some one from community help me with this?
Suggestions are welcome.


Thanks,
Prasad



Thanks,
Prasad


--
Thanks,
Jason

Reply via email to