Hi all!!

I want to create a job that reads data from pubsub and then joins with
another collection (side input) read from a BigQuery table (actually the
schemas). The thing I'm not 100% sure is what would be the best way to
update that side input (as those schemas may change at any given point in
time). I thought of using GenerateSequence + FixedTimeWindows but I was
wondering if that would be the best approach as it sounded "hacky" to me.

The idea would be something like this:

val schemas = inputPipeline.context.customInput("Schemas",
  GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1)))
  //.withFixedWindows(Duration.standardMinutes(1))
  .withFixedWindows(Duration.standardMinutes(1))
  .flatMap { _ =>
    log.info("Retrieving schemas...")
    Seq("Table1", "Table2").map(n => (n, getBQSchema(table, n)))
  }.asMapSideInput

In case this makes sense I have another question, how would that behave
with many workers? Will each of the workers actually retrieve the schema or
will it be ran by one and then "broadcasted"?

Thanks!

Reply via email to