Thank you Reza. That was very helpful!

On 2020/02/03 01:03:18, Reza Rokni <[email protected]> wrote: 
> Hi,
> 
> So https://issues.apache.org/jira/browse/BEAM-91 would be nice... but not
> there sadly.
> 
> Not ideal but some thoughts:
> 
> With regards to state, the time scale ( for example you mentioned a week )
> could be problematic if the events per key is large. Could you create an
> aggregation value for your events, or do all events need to be available
> when doing your recalculation? So a count would be ok I assume to keep as 1
> hour aggregates ( 1 being arbitrary of course ). Things like unique are
> only possible if you are ok with < 100% accuracy by using algorithms like
> HyperLogLog and state sketches. Beam Docs HLL
> <https://beam.apache.org/releases/javadoc/2.18.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
> .
> 
> The other option is that you keep x min / hours / days  in state but also
> push out information to a backing database, for the longer periods. So if
> redaction is for 3 weeks ago, then in the Stateful DoFn read that keys data
> from a backing database and recomput. Assuming most updates are within the
> smaller timeframes, then they should be dealt with by the data in the State
> ( cache hit ) rather than off state ( cache miss ) .
> 
> Cheers
> 
> Reza
> 
> On Thu, 30 Jan 2020 at 22:31, Stephen Young <[email protected]>
> wrote:
> 
> > I am currently looking into how Beam can support a live data collection
> > platform. We want to collect certain data in real-time. This data will be
> > sent to Kafka and we want to use Beam to calculate statistics and derived
> > events from it.
> >
> > An important thing we need to be able to handle is amendment or deletion
> > events. For example, we may get an event that someone has performed an
> > action and from this we'd calculate how many of these actions they had
> > taken in total. We'd also build calculations on top of that, for example
> > top 10 rankings by these counts, or arbitrarily many layers of calculations
> > beyond that. But sometime later (this could be a few seconds or a week) we
> > receive an amendment event to that action. This indicates that the action
> > was taken by a different person or from a different location. We then need
> > Beam to recalculate all of our downstream stats i.e. the counts need to be
> > changed and rankings need to be adjusted.
> >
> > We could potentially solve this problem by using Beam's stateful operators
> > to store all actions in state and then we always calculate the stats from
> > this state. But we're concerned this approach won't scale well when we have
> > lots of events and we need to store lots of events that happen often.
> >
> > An alternative is the differential dataflow approach described here:
> > https://docs.rs/differential-dataflow/0.11.0/differential_dataflow/. They
> > explain:
> >
> > "Once you have defined a differential dataflow computation, you may then
> > add records to or remove records from its inputs; the system will
> > automatically update the computation's outputs with the appropriate
> > corresponding additions and removals, and report these changes to you."
> >
> > Is anything like this achievable with Beam? Thanks!
> >
> 

Reply via email to