Hi

We have a situation where we are ingesting high volume streaming ingest
coming from a Oracle table.
The requirement
Whenever there is a change in Oracle table, a CDC process will write out
the change in a Kafka or Event Hub stream, and the stream will be consumed
a spark streaming application.

The Problem:
Because of some challenges in Oracle side, it is observed that commits in
Oracle happens in big bursts, regularly over couple of millions of records,
and especially delete transactions. Hence, the stream consumed by spark app
is not evenly distributed.

The Question:

a) Is there some special care should be taken to write this kind of spark
app?
b) Is it better if we rather go with spark batch which can run every hour
or so? In that case we can use event hub archival process to write data to
storage every 5 mins and then consume from hdfs/storage every hour
c) Other than a CDC tool, is there any spark package which can actually
listen to Oracle change stream? So can we use spark as the CDC tool itself?

-- 
Best Regards,
Ayan Guha

Reply via email to