Hi,

I’m currently doing some tests with Structured Streaming and I’m wondering how 
I can merge the streaming dataset with a more-or-less static dataset (from a 
JDBC source).
With more-or-less I mean a dataset which does not change that often and could 
be cached by Spark for a while. It is possible to merge static datasets but 
static datasets will be refreshed on every batch which increases batch duration.
With ‘traditional’ spark streaming (non-structured) I use a counter and refresh 
the dataset (by using unpersist() and cache()) when it hits a certain 
threshold. I admit it’s not a state-of-the-art solution but it works. With 
structured streaming I was not able to get this mechanism working. It looks 
like the code between input and sinks runs once…

Is there a way to cache external datasets, use them in consecutive batches 
(merging with new incoming streaming data, perform operations and sink results) 
and refresh the external datasets after a specified number of batches?

Regards,
Chris

Reply via email to