Hi,

I'm starting to learn about Apache Beam, and I'm curious whether our data
sets fit into the model.

We have a set of events occurring which we record by User, broadly
simplified down to purchases and shares. In its simplest form: someone
buying something and someone posting it on Facebook at some point
afterwards.

The events could occur potentially weeks apart - e.g. I purchase something
today, 2 weeks later I have a good experience with the product and then
share it on Facebook.

I'd like to be able to identify the "influencing" event that triggered the
share, which is most likely to be the most recent event prior to that
share. For instance:

T0: Purchase 1
T1: Purchase 2
T2: Purchase 3
T3: Share 1
T4: Purchase 4
T5: Share 2

I believe that the events T0 and T1 are likely to be influencing T3, but
I'd like to broadly attribute T3 to T2, and ideally pass it to some sort of
Combiner to be added to other data. Perhaps something like this at a first
pass:

User X, Event T3, Influenced by Purchase 3 at T2
User X, Event T5, Influenced by Purchase 4 at T4

I'd read/understood that if the window was long (e.g. > 24 hours) a lot of
data has to be stored and held up and that causes problems. I'd be happy to
have a cutoff of somewhere in the region of a few months, but certainly
longer than 24 hours.

For extra bonus points, I'd like to be able to say something like this too:
User X, Event T3, Total Prior Purchases = £X, Total Number of Purchases = 3

Is it possible to do that with Beam? Or is there an alternative way of
solving that problem?

If it's relevant, I'd most likely be using the batch processing model to
start, and our dataset size is ~30-50 million users with around 100 million
events (i.e. most users generate a small number of events).

Thanks,
Ed

Reply via email to