Hi, I'm starting to learn about Apache Beam, and I'm curious whether our data sets fit into the model.
We have a set of events occurring which we record by User, broadly simplified down to purchases and shares. In its simplest form: someone buying something and someone posting it on Facebook at some point afterwards. The events could occur potentially weeks apart - e.g. I purchase something today, 2 weeks later I have a good experience with the product and then share it on Facebook. I'd like to be able to identify the "influencing" event that triggered the share, which is most likely to be the most recent event prior to that share. For instance: T0: Purchase 1 T1: Purchase 2 T2: Purchase 3 T3: Share 1 T4: Purchase 4 T5: Share 2 I believe that the events T0 and T1 are likely to be influencing T3, but I'd like to broadly attribute T3 to T2, and ideally pass it to some sort of Combiner to be added to other data. Perhaps something like this at a first pass: User X, Event T3, Influenced by Purchase 3 at T2 User X, Event T5, Influenced by Purchase 4 at T4 I'd read/understood that if the window was long (e.g. > 24 hours) a lot of data has to be stored and held up and that causes problems. I'd be happy to have a cutoff of somewhere in the region of a few months, but certainly longer than 24 hours. For extra bonus points, I'd like to be able to say something like this too: User X, Event T3, Total Prior Purchases = £X, Total Number of Purchases = 3 Is it possible to do that with Beam? Or is there an alternative way of solving that problem? If it's relevant, I'd most likely be using the batch processing model to start, and our dataset size is ~30-50 million users with around 100 million events (i.e. most users generate a small number of events). Thanks, Ed
