I am currently looking into how Beam can support a live data collection platform. We want to collect certain data in real-time. This data will be sent to Kafka and we want to use Beam to calculate statistics and derived events from it.
An important thing we need to be able to handle is amendment or deletion events. For example, we may get an event that someone has performed an action and from this we'd calculate how many of these actions they had taken in total. We'd also build calculations on top of that, for example top 10 rankings by these counts, or arbitrarily many layers of calculations beyond that. But sometime later (this could be a few seconds or a week) we receive an amendment event to that action. This indicates that the action was taken by a different person or from a different location. We then need Beam to recalculate all of our downstream stats i.e. the counts need to be changed and rankings need to be adjusted. We could potentially solve this problem by using Beam's stateful operators to store all actions in state and then we always calculate the stats from this state. But we're concerned this approach won't scale well when we have lots of events and we need to store lots of events that happen often. An alternative is the differential dataflow approach described here: https://docs.rs/differential-dataflow/0.11.0/differential_dataflow/. They explain: "Once you have defined a differential dataflow computation, you may then add records to or remove records from its inputs; the system will automatically update the computation's outputs with the appropriate corresponding additions and removals, and report these changes to you." Is anything like this achievable with Beam? Thanks!
