If I’m not mistaken you could create a PCollection from the pubsub read operation, and then apply 3 different windowing strategies in different “chains” of the graph. Ex
PCollection<PubsubMessage> msgs = PubsubIO.read(…); msgs.apply(Window.into(FixedWindows.of(1 min)).apply(allMyTransforms) msgs.apply(Window.into(FixedWindows.of(5 min)).apply(allMyTransforms) msgs.apply(Window.into(FixedWindows.of(60 min)).apply(allMyTransforms) Similarly this could be done with a loop if preferred. On Tue, Oct 4, 2022 at 14:15 Yi En Ong <[email protected]> wrote: > Hi, > > > I am trying to optimize my Apache Beam pipeline on Google Cloud Platform > Dataflow, and I would really appreciate your help and advice. > > > Background information: I am trying to read data from PubSub Messages, and > aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such > aggregations consists of summing, averaging, finding the maximum or > minimum, etc. For example, for all data collected from 1200 to 1201, I want > to aggregate them and write the output into BigTable's 1-min column family. > And for all data collected from 1200 to 1205, I want to similarly aggregate > them and write the output into BigTable's 5-min column. Same goes for 60min. > > > The current approach I took is to have 3 separate dataflow jobs (i.e. 3 > separate Beam Pipelines), each one having a different window duration > (1min, 5min and 60min). See > https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html. > And the outputs of all 3 dataflow jobs are written to the same BigTable, > but on different column families. Other than that, the function and > aggregations of the data are the same for the 3 jobs. > > > However, this seems to be very computationally inefficient, and cost > inefficient, as the 3 jobs are essentially doing the same function, with > the only exception being the window time duration and output column family. > > > > Some challenges and limitations we faced was that from the documentation, > it seems like we are unable to create multiple windows of different periods > in a singular dataflow job. Also, when we write the final data into big > table, we would have to define the table, column family, column, and > rowkey. And unfortunately, the column family is a fixed property (i.e. it > cannot be redefined or changed given the window period). > > > Hence, I am writing to ask if there is a way to only use 1 dataflow job > that fulfils the objective of this project? Which is to aggregate data on > different window periods, and write them to different column families of > the same BigTable. > > > Thank you >
