Hi, I’m currently doing some tests with Structured Streaming and I’m wondering how I can merge the streaming dataset with a more-or-less static dataset (from a JDBC source). With more-or-less I mean a dataset which does not change that often and could be cached by Spark for a while. It is possible to merge static datasets but static datasets will be refreshed on every batch which increases batch duration. With ‘traditional’ spark streaming (non-structured) I use a counter and refresh the dataset (by using unpersist() and cache()) when it hits a certain threshold. I admit it’s not a state-of-the-art solution but it works. With structured streaming I was not able to get this mechanism working. It looks like the code between input and sinks runs once…
Is there a way to cache external datasets, use them in consecutive batches (merging with new incoming streaming data, perform operations and sink results) and refresh the external datasets after a specified number of batches? Regards, Chris