Hi there, I'm currently evaluating Apache Beam as stream processing engine for my current project and hope you can help me out with some architectural questions. Let me fist describe what I'm willing to do:
I have lots of IoT sensor data coming in from an Azure EventHub. For each message, I can uniquele identify the device it came from. For most of the devices, I have a processing "job-description" which always has a format like this: "If the event timestamp from that device is in [2017-01-01; 2017-12-31], slide that interval into 60 minute "buckets" and do an average aggregation on each bucket. When? At the end of the current bucket-window in processing time as well as for each (indefinetly) late arrival write the aggregation result out to my time series database". Note that due to my time series database, I don't need to have a state storage for very late arrivals but I am able to just do the (possibly expensive) late-arrival-re-aggregation out of a query to the time series database. Thus, I think of using Beam as kind of "trigger-only" mode for very late arrivals in order to don't keep too much state and use the state only for a watermark of a few windows (where 99,999% of the event data will falls into) My problem here is that not only the interval can change for each job definition, but also the aggregation period may change arbitrarily. I can have device-jobs with 13 min aggregation windows, with 17 min aggregations and basically with any user set aggregation period. My questions are the following: 1. Am I right that I should define one large job which takes in all events from the EventHub, groups by my device key and do a processing job for each device? 2. Am I capable of defining flexible window sizes per "device-task"? Currently, this looks not possible to me as the window size seems to be static in a job definiton?! 2.a) If not possible, am I able to workaround this issue, e.g. by defining a unique job per different aggregation period? Will I face major performance issues with that? 3. What would be the best practice to store / query my low frequently changing "device job definition"? I can have millions of jobs, but only a very few of them will change each day. I plan to store those definitions into a SQL database. Should I make changes of them directly accessible to the pipeline via Side-Input? Or should I put them in sort of a Cache like Redis? I am looking forward to hearing from you. Best regards Theo
