Hi all!!
I want to create a job that reads data from pubsub and then joins with
another collection (side input) read from a BigQuery table (actually the
schemas). The thing I'm not 100% sure is what would be the best way to
update that side input (as those schemas may change at any given point in
time). I thought of using GenerateSequence + FixedTimeWindows but I was
wondering if that would be the best approach as it sounded "hacky" to me.
The idea would be something like this:
val schemas = inputPipeline.context.customInput("Schemas",
GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1)))
//.withFixedWindows(Duration.standardMinutes(1))
.withFixedWindows(Duration.standardMinutes(1))
.flatMap { _ =>
log.info("Retrieving schemas...")
Seq("Table1", "Table2").map(n => (n, getBQSchema(table, n)))
}.asMapSideInput
In case this makes sense I have another question, how would that behave
with many workers? Will each of the workers actually retrieve the schema or
will it be ran by one and then "broadcasted"?
Thanks!