Hi,
I am using the Python SDK with Dataflow as my runner. I am looking at
implementing a streaming pipeline that will continuously monitor a GCS bucket
for incoming files and depending on the regex of the file, launch a set of
transforms and write the final output back to parquet for each file I read.
Excuse me as I am new to beam and specially the streaming bit as I have done
the above in batch mode but we are limited by dataflow allowing only 100 jobs
per 24 hours.
I was looking into a couple of options:
1. Have a beam pipeline running in streaming mode listening to a pubsub
topic. Once a file lands in GCS a message is published. I am planning to use
the WriteToFiles transform but it seems that there is a limitation where :"it
currently does not have support for multiple trigger firings on the same
window."
* So what windowing strategy and trigger should I use?
* Which transform should I use since there are two ReadFromPubSub
transforms one in the io.gcp Subpackages and another one in the external.gcp?
2. Using the
TextIO.Read<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html>
and the
watchForNewFiles<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition->
from the Java SDK within a python pipeline as I understand there is some
support for cross-language transforms?
Regards,
Tarek