[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

Yuanzhe Yang (杨远哲) Thu, 29 Dec 2016 04:06:24 -0800

Hi all,

Thanks a lot for your contributions to bring us new technologies.


I don't want to waste your time, so before I write to you, I googled, checked 
stackoverflow and mailing list archive with keywords "streaming" and "jdbc". 
But I was not able to get any solution to my use case. I hope I can get some 
clarification from you.

The use case is quite straightforward, I need to harvest a relational database 
via jdbc, do something with data, and store result into Kafka. I am stuck at 
the first step, and the difficulty is as follows:

1. The database is too large to ingest with one thread.
2. The database is dynamic and time series data comes in constantly.

Then an ideal workflow is that multiple workers process partitions of data 
incrementally according to a time window. For example, the processing starts 
from the earliest data with each batch containing data for one hour. If data 
ingestion speed is faster than data production speed, then eventually the 
entire database will be harvested and those workers will start to "tail" the 
database for new data streams and the processing becomes real time.

With Spark SQL I can ingest data from a JDBC source with partitions divided by 
time windows, but how can I dynamically increment the time windows during 
execution? Assume that there are two workers ingesting data of 2017-01-01 and 
2017-01-02, the one which finishes quicker gets next task for 2017-01-03. But I 
am not able to find out how to increment those values during execution.

Then I looked into Structured Streaming. It looks much more promising because 
window operations based on event time are considered during streaming, which 
could be the solution to my use case. However, from documentation and code 
example I did not find anything related to streaming data from a growing 
database. Is there anything I can read to achieve my goal?

Any suggestion is highly appreciated. Thank you very much and have a nice day.

Best regards,
Yang
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

Reply via email to