I am currently looking into how Beam can support a live data collection 
platform. We want to collect certain data in real-time. This data will be sent 
to Kafka and we want to use Beam to calculate statistics and derived events 
from it.

An important thing we need to be able to handle is amendment or deletion 
events. For example, we may get an event that someone has performed an action 
and from this we'd calculate how many of these actions they had taken in total. 
We'd also build calculations on top of that, for example top 10 rankings by 
these counts, or arbitrarily many layers of calculations beyond that. But 
sometime later (this could be a few seconds or a week) we receive an amendment 
event to that action. This indicates that the action was taken by a different 
person or from a different location. We then need Beam to recalculate all of 
our downstream stats i.e. the counts need to be changed and rankings need to be 
adjusted.

We could potentially solve this problem by using Beam's stateful operators to 
store all actions in state and then we always calculate the stats from this 
state. But we're concerned this approach won't scale well when we have lots of 
events and we need to store lots of events that happen often.

An alternative is the differential dataflow approach described here: 
https://docs.rs/differential-dataflow/0.11.0/differential_dataflow/. They 
explain:

"Once you have defined a differential dataflow computation, you may then add 
records to or remove records from its inputs; the system will automatically 
update the computation's outputs with the appropriate corresponding additions 
and removals, and report these changes to you."

Is anything like this achievable with Beam? Thanks!

Reply via email to