We have a structured streaming job that processes a stream of events. It
needs to perform aggregation while maintaining state, for which we are
using flatMapGroupsWithState.

It also needs to load some domain data that needs to be refreshed
periodically. To refresh domain data, we are using a solution of query
restart that Tathagata suggested in this thread:
http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3CCA%2BAHuKn%2BvSEWkJD%3DbSSt6G5bDZDaS6wmN%2Bfwmn4jTm1X1nDAPA%40mail.gmail.com%3E


This works for the domain data refresh, however, on query restart, the
state maintained in  flatMapGroupsWithState is flushed.
Is there a way to retain the state on query refresh?

One way I am thinking is to split the job into two jobs to separate the
concerns of domain data refresh and state based processing. Does this make
sense? Are there other thoughts on solving this?

Thanks,
Ashutosh Joshi

Reply via email to