I am investigating Flink. I am considering a relatively simple use case -- I want to ingest streams of events that are essentially timestamped state changes. These events may look something like:
{ sourceId: 111, state: OPEN, timestamp: <date/time> } I want to apply various processing to these state change events, the output of which can be used for analytics. For example: 1. average time spent in state, by state 2. sources with longest (or shortest) time spent in OPEN state The time spent in each state may be days or even weeks. All the examples I have seen of similar logic involve windows on the order of 15 minutes. Since time spent in each state may far exceed these window sizes, I'm wondering what the best approach will be. One thought from reading the docs is to use `every` to operate on the entire stream. But it seems like this will take longer and longer to run as the event stream grows, so this is not an ideal solution. Or does Flink apply some clever optimizations to avoid the potential performance issue? Another thought was to split the event stream into multiple streams by source, each of which will have a small (and limited) amount of data. This will make processing each stream simpler, but since there can be thousands of sources, it will result in a lot of streams to handle and persist (probably in Kafka). This does not seem ideal either. It seems like this should be simple, but I'm struggling with understanding how to solve it elegantly. Regards, Raman