I am working on a new application to perform real time anomaly detection
using Flink. I am relatively new to Flink but have already one application
in production that is fairly basic and of low throughput. This next one
will be more complex and much higher throughput.

My query is about handling late arriving data. For this application, the
source data is like this:

   - zip files containing a single JSON file each are pushed to an S3
   bucket by many servers that I need to monitor where N servers are in M pools
   - An SQS event then is published from S3 for each new zip file that
   arrives
   - I write a Flink source to read the event and pull the S3 file as it
   arrives, stream unzip, deserialize the JSON, flat map its contents into a
   datastream, and then process the data in tumbling 60 second windows

Each file can contain up to either 300 seconds worth of metrics or 1000
time series records. When a server is processing a lot of connections, the
files grow faster so the JSON file is closed as soon at the1000
sample threshold hits. When a server's traffic is low, it emits the file
when the 300 second elapsed time threshold hits, regardless of how many
samples are in the file (so the file will have between 0 <= samples <=
1000) in this case.

It is this pattern that I am struggling with. I need to use one-minute
tumbling windows to aggregate these metrics. However, I may have to wait
300 seconds for the slow traffic file to be uploaded to S3 while other
files (on higher traffic servers) are showing up in S3 maybe every 10 or 20
seconds (each filled with 1000 samples that trigger the file closure and
update). Any of these files can contain a portion of the data that
would align into the same 60 second time window. All of the data from all
these servers needs to be aggregated and grouped as the servers are part of
pools of servers so I need to group by the records from each POOL and not
by just the records from each server.

So, my questions given that context are:

   1. Is having a 60 second tumbling window with 300 seconds of allowed
   lateness a pattern that is fairly common and typically implemented in Flink
   - where the lateness is not a fraction of the window size but a multiple of
   it? My sense is that this IS a reasonable problem that Flink can deal with.
   2. I believe with such an approach, there will potentially be 5-6
   60-second windows in flight for each grouping key to accommodate such
   allowed lateness and that means a lot more resources/state for the cluster
   to support. This impacts the resources needed for a given cluster scale. Do
   I have that assumption correct?

I just want to make sure I am considering the right tool for this job and
appreciate any inputs on this.

Reply via email to