Hi,
Thanks for the answers!
Michal,
Your approach seems most appropriate for my case as it dedups *and* handles
late records. Your point on losing messages in the map upon restart after a
failure, is very valid. One way of handling this is to have checkpoints at
window-level.
Roughly speaking,
I drafted an implementation outline in kafka-streams to address the
problem of sliding-window reordering (to cater for late messages within
the time window), it also caters for de-duplication:
It is not latest and greatest; however, here is an Akka Streams GraphStage
implementation for deduplication:
https://squbs.readthedocs.io/en/latest/deduplicate/. All happens in
memory, so you need to watch for memory growing and potentially pass a
custom registry that self cleans after a while.
Hi,
I'm looking for the latest and greatest techniques and thoughts in stream
deduplication and would love to know if anyone here has done this at scale.
Specifically, I'm looking for deduping that also handles late-arriving
messages.
In the past few days of my search, I've mostly come across