Thank you Reza. That was very helpful!
On 2020/02/03 01:03:18, Reza Rokni <[email protected]> wrote: > Hi, > > So https://issues.apache.org/jira/browse/BEAM-91 would be nice... but not > there sadly. > > Not ideal but some thoughts: > > With regards to state, the time scale ( for example you mentioned a week ) > could be problematic if the events per key is large. Could you create an > aggregation value for your events, or do all events need to be available > when doing your recalculation? So a count would be ok I assume to keep as 1 > hour aggregates ( 1 being arbitrary of course ). Things like unique are > only possible if you are ok with < 100% accuracy by using algorithms like > HyperLogLog and state sketches. Beam Docs HLL > <https://beam.apache.org/releases/javadoc/2.18.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html> > . > > The other option is that you keep x min / hours / days in state but also > push out information to a backing database, for the longer periods. So if > redaction is for 3 weeks ago, then in the Stateful DoFn read that keys data > from a backing database and recomput. Assuming most updates are within the > smaller timeframes, then they should be dealt with by the data in the State > ( cache hit ) rather than off state ( cache miss ) . > > Cheers > > Reza > > On Thu, 30 Jan 2020 at 22:31, Stephen Young <[email protected]> > wrote: > > > I am currently looking into how Beam can support a live data collection > > platform. We want to collect certain data in real-time. This data will be > > sent to Kafka and we want to use Beam to calculate statistics and derived > > events from it. > > > > An important thing we need to be able to handle is amendment or deletion > > events. For example, we may get an event that someone has performed an > > action and from this we'd calculate how many of these actions they had > > taken in total. We'd also build calculations on top of that, for example > > top 10 rankings by these counts, or arbitrarily many layers of calculations > > beyond that. But sometime later (this could be a few seconds or a week) we > > receive an amendment event to that action. This indicates that the action > > was taken by a different person or from a different location. We then need > > Beam to recalculate all of our downstream stats i.e. the counts need to be > > changed and rankings need to be adjusted. > > > > We could potentially solve this problem by using Beam's stateful operators > > to store all actions in state and then we always calculate the stats from > > this state. But we're concerned this approach won't scale well when we have > > lots of events and we need to store lots of events that happen often. > > > > An alternative is the differential dataflow approach described here: > > https://docs.rs/differential-dataflow/0.11.0/differential_dataflow/. They > > explain: > > > > "Once you have defined a differential dataflow computation, you may then > > add records to or remove records from its inputs; the system will > > automatically update the computation's outputs with the appropriate > > corresponding additions and removals, and report these changes to you." > > > > Is anything like this achievable with Beam? Thanks! > > >
