Re: Best approach for recalculating statistics based on amended or deleted events?

Reza Rokni Sun, 02 Feb 2020 17:04:18 -0800

Hi,

So https://issues.apache.org/jira/browse/BEAM-91 would be nice... but not
there sadly.

Not ideal but some thoughts:

With regards to state, the time scale ( for example you mentioned a week )
could be problematic if the events per key is large. Could you create an
aggregation value for your events, or do all events need to be available
when doing your recalculation? So a count would be ok I assume to keep as 1
hour aggregates ( 1 being arbitrary of course ). Things like unique are
only possible if you are ok with < 100% accuracy by using algorithms like
HyperLogLog and state sketches. Beam Docs HLL
<https://beam.apache.org/releases/javadoc/2.18.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
.

The other option is that you keep x min / hours / days  in state but also
push out information to a backing database, for the longer periods. So if
redaction is for 3 weeks ago, then in the Stateful DoFn read that keys data
from a backing database and recomput. Assuming most updates are within the
smaller timeframes, then they should be dealt with by the data in the State
( cache hit ) rather than off state ( cache miss ) .

Cheers

Reza

On Thu, 30 Jan 2020 at 22:31, Stephen Young <[email protected]>
wrote:

> I am currently looking into how Beam can support a live data collection
> platform. We want to collect certain data in real-time. This data will be
> sent to Kafka and we want to use Beam to calculate statistics and derived
> events from it.
>
> An important thing we need to be able to handle is amendment or deletion
> events. For example, we may get an event that someone has performed an
> action and from this we'd calculate how many of these actions they had
> taken in total. We'd also build calculations on top of that, for example
> top 10 rankings by these counts, or arbitrarily many layers of calculations
> beyond that. But sometime later (this could be a few seconds or a week) we
> receive an amendment event to that action. This indicates that the action
> was taken by a different person or from a different location. We then need
> Beam to recalculate all of our downstream stats i.e. the counts need to be
> changed and rankings need to be adjusted.
>
> We could potentially solve this problem by using Beam's stateful operators
> to store all actions in state and then we always calculate the stats from
> this state. But we're concerned this approach won't scale well when we have
> lots of events and we need to store lots of events that happen often.
>
> An alternative is the differential dataflow approach described here:
> https://docs.rs/differential-dataflow/0.11.0/differential_dataflow/. They
> explain:
>
> "Once you have defined a differential dataflow computation, you may then
> add records to or remove records from its inputs; the system will
> automatically update the computation's outputs with the appropriate
> corresponding additions and removals, and report these changes to you."
>
> Is anything like this achievable with Beam? Thanks!
>

Re: Best approach for recalculating statistics based on amended or deleted events?

Reply via email to