Hi,

I am new to spark and no SQL databases.

So Please correct me if I am wrong.

Since I will be accessing multiple columns (almost 20-30 columns) of a row,
I will have to go with rowbased db instead column based right!
May be I can use Avro in this case. Does spark go well with Avroro? I will
do my research on this. But please let me know your opinion on this.

Thanks,
Prasad

On Fri 5 Apr, 2019, 1:09 AM Teemu Heikkilä <te...@emblica.fi wrote:

> So basically you could have base dump/snapshot of the full database - or
> all the required data stored into HDFS or similar system as partitioned
> files (ie. orc/parquet)
>
> Then you use the change stream after the dump and join it on the snapshot
> - similarly than what your database is doing.
> After that you can build the aggregates and reports from that table.
>
> - T
>
> On 4 Apr 2019, at 22.35, Prasad Bhalerao <prasadbhalerao1...@gmail.com>
> wrote:
>
> I did not understand this "update actual snapshots ie. by joining the
> data".
>
>
> There is another microservice which updates these Oracle tables. I can
> have this micro service to send the update data feed on Kafka topics.
>
> Thanks,
> Prasad
>
> On Fri 5 Apr, 2019, 12:57 AM Teemu Heikkilä <te...@emblica.fi wrote:
>
>> Based on your answers, I would consider using the update stream to update
>> actual snapshots ie. by joining the data
>>
>> Ofcourse now it depends on how the update stream has been implemented how
>> to get the data in spark.
>>
>> Could you tell little bit more about that?
>> - Teemu
>>
>> On 4 Apr 2019, at 22.23, Prasad Bhalerao <prasadbhalerao1...@gmail.com>
>> wrote:
>>
>> Hi ,
>>
>> I can create a view on these tables but the thing is I am going to need
>> almost every column from these tables and I have faced issues with oracle
>> views on such a large tables which involves joins. Some how oracle used to
>> choose not so correct execution plan.
>>
>> Can you please tell me how creating a views will help in this scenario?
>>
>> Can you please tell if I am thinking in right direction?
>>
>> I have two challenges
>> 1) First to load 2-4 TB of data in spark very quickly.
>> 2) And then keep this data updated in spark whenever DB updates are done.
>>
>> Thanks,
>> Prasad
>>
>> On Fri, Apr 5, 2019 at 12:35 AM Jason Nerothin <jasonnerot...@gmail.com>
>> wrote:
>>
>>> Hi Prasad,
>>>
>>> Could you create an Oracle-side view that captures only the relevant
>>> records and the use Spark JDBC connector to load the view into Spark?
>>>
>>> On Thu, Apr 4, 2019 at 1:48 PM Prasad Bhalerao <
>>> prasadbhalerao1...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am exploring spark for my Reporting application.
>>>> My use case is as follows...
>>>> I have 4-5 oracle tables which contains more than 1.5 billion rows.
>>>> These tables are updated very frequently every day. I don't have choice to
>>>> change database technology. So this data is going to remain in Oracle only.
>>>> To generate 1 report, on an average 15 - 50 million rows has to be
>>>> fetched from oracle tables. These rows contains some blob columns. Most of
>>>> the time is spent in fetching these many rows from db over the network.
>>>> Data processing is not that complex. Currently these report takes around
>>>> 3-8 hours to complete. I trying to speed up this report generation process.
>>>>
>>>> Can use spark as a caching layer in this case to avoid fetching data
>>>> from oracle over the network every time? I am thinking to submit a spark
>>>> job for each report request and use spark SQL to fetch the data and then
>>>> process it and write to a file? I trying to use kind of data locality in
>>>> this case.
>>>>
>>>> Whenever a data is updated in oracle tables can I refresh the data in
>>>> spark storage? I can get the update feed using messaging technology.
>>>>
>>>> Can some one from community help me with this?
>>>> Suggestions are welcome.
>>>>
>>>>
>>>> Thanks,
>>>> Prasad
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Prasad
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Jason
>>>
>>
>>
>

Reply via email to