Window limitations on groupBy

Raman Gupta Wed, 18 Jan 2017 06:40:33 -0800

I am investigating Flink. I am considering a relatively simple use
case -- I want to ingest streams of events that are essentially
timestamped state changes. These events may look something like:


{
  sourceId: 111,
  state: OPEN,
  timestamp: <date/time>
}

I want to apply various processing to these state change events, the
output of which can be used for analytics. For example:

1. average time spent in state, by state
2. sources with longest (or shortest) time spent in OPEN state

The time spent in each state may be days or even weeks.

All the examples I have seen of similar logic involve windows on the
order of 15 minutes. Since time spent in each state may far exceed
these window sizes, I'm wondering what the best approach will be.

One thought from reading the docs is to use `every` to operate on the
entire stream. But it seems like this will take longer and longer to
run as the event stream grows, so this is not an ideal solution. Or
does Flink apply some clever optimizations to avoid the potential
performance issue?

Another thought was to split the event stream into multiple streams by
source, each of which will have a small (and limited) amount of data.
This will make processing each stream simpler, but since there can be
thousands of sources, it will result in a lot of streams to handle and
persist (probably in Kafka). This does not seem ideal either.

It seems like this should be simple, but I'm struggling with
understanding how to solve it elegantly.

Regards,
Raman

Window limitations on groupBy

Reply via email to