Need help with designing a beam pipeline for data enrichment

Johannes Frey Thu, 31 Mar 2022 02:58:02 -0700

Hi Everybody,

I'm currently facing an issue where I'm not sure how to design it
using apache beam.
I'm batch processing data, it's about 300k entries per day. After
doing some aggregations the results are about 60k entries.


The issue that I'm facing now is that the entries from that batch may
be related to entries already processed at some time in the past and
if they are, I would need to fetch the already processed record from
the past and merge it with the new record.

To make matters worse the "window" of that relationship might be
several years, so I can't just sideload the last few days worth of
data and catch all the relationships, I would need to on each batch
run load all the already processed entries which seems not to be a
good idea ;-)

I also think that issuing 60k queries to always fetch the relevant
related entry from the past for each new entry is a good idea. I could
try to "window" it tho and group them by let's say 100 entries and
fire a query to fetch the 100 old entries for the current 100
processed entries... that would at least reduce the amount of queries
by 60k/100.

Are there any other good ways to solve issues like that? I would
imagine that those situations should be quite common. Maybe there are
some best practices around this issue.

It's basically enriching already processed entries with information
from new entries.

Would be great if someone could point me in the right direction or
give me some more keywords that I can google.

Thanks and regards
Jo

Need help with designing a beam pipeline for data enrichment

Reply via email to