Using a bounded (batch style) pipeline you should be able to just group all
events by user and ignore windowing completely and produce any information
since you'll have a global view of all events. This scales well since data
for a user is only held up to the point that it is processed and then can
be garbage collected. Building this pipeline should be fast and should
allow you to answer the question of what makes data irrelevant for a user.
Using this data, you should be able to build a pipeline using an unbounded
data source. You can use a really long duration window (such as a calendar
(month/year) or session with gap duration of many months) and use an early
firing trigger to produce intermediate output. If you store this data into
state (see https://beam.apache.org/blog/2017/02/13/stateful-processing.html),
you can compute both kinds of events. The windows will finish based upon
the windowing strategy that you choose and you can then compute a final
record saying that the user has become inactive if you so choose.
On Mon, Aug 28, 2017 at 7:38 AM, Edward Hartwell Goose
wrote:
> Hi,
>
> I'm starting to learn about Apache Beam, and I'm curious whether our data
> sets fit into the model.
>
> We have a set of events occurring which we record by User, broadly
> simplified down to purchases and shares. In its simplest form: someone
> buying something and someone posting it on Facebook at some point
> afterwards.
>
> The events could occur potentially weeks apart - e.g. I purchase something
> today, 2 weeks later I have a good experience with the product and then
> share it on Facebook.
>
> I'd like to be able to identify the "influencing" event that triggered the
> share, which is most likely to be the most recent event prior to that
> share. For instance:
>
> T0: Purchase 1
> T1: Purchase 2
> T2: Purchase 3
> T3: Share 1
> T4: Purchase 4
> T5: Share 2
>
> I believe that the events T0 and T1 are likely to be influencing T3, but
> I'd like to broadly attribute T3 to T2, and ideally pass it to some sort of
> Combiner to be added to other data. Perhaps something like this at a first
> pass:
>
> User X, Event T3, Influenced by Purchase 3 at T2
> User X, Event T5, Influenced by Purchase 4 at T4
>
> I'd read/understood that if the window was long (e.g. > 24 hours) a lot of
> data has to be stored and held up and that causes problems. I'd be happy to
> have a cutoff of somewhere in the region of a few months, but certainly
> longer than 24 hours.
>
> For extra bonus points, I'd like to be able to say something like this too:
> User X, Event T3, Total Prior Purchases = £X, Total Number of Purchases = 3
>
> Is it possible to do that with Beam? Or is there an alternative way of
> solving that problem?
>
> If it's relevant, I'd most likely be using the batch processing model to
> start, and our dataset size is ~30-50 million users with around 100 million
> events (i.e. most users generate a small number of events).
>
> Thanks,
> Ed
>