On Fri, Jun 12, 2020 at 12:52 AM TAREK ALSALEH <[email protected]> wrote:
> Hi, > > I am using the Python SDK with Dataflow as my runner. I am looking at > implementing a streaming pipeline that will continuously monitor a GCS > bucket for incoming files and depending on the regex of the file, launch a > set of transforms and write the final output back to parquet for each file > I read. > > Excuse me as I am new to beam and specially the streaming bit as I have > done the above in batch mode but we are limited by dataflow allowing only > 100 jobs per 24 hours. > > I was looking into a couple of options: > > 1. Have a beam pipeline running in streaming mode listening to a > pubsub topic. Once a file lands in GCS a message is published. I am > planning to use the WriteToFiles transform but it seems that there is > a limitation where :"it currently does not have support for multiple > trigger firings on the same window." > 1. So what windowing strategy and trigger should I use? > 2. Which transform should I use since there are two ReadFromPubSub > transforms one in the io.gcp Subpackages and another one in the > external.gcp? > > > 1. Using the TextIO.Read > > <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html> > and > the watchForNewFiles > > <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition-> > from > the Java SDK within a python pipeline as I understand there is some support > for cross-language transforms? > > Sounds like watchForNewFiles transform is exactly what you are looking for but we don't have that for Python SDK yet. Also above transforms haven't been tested with cross-language transforms framework yet and we don't have cross-language Python wrappers for these yet. Probably best solution for Python SDK today will be to use some sort of a GCS to Cloud Pub/Sub mapping to publish events regarding new files and read these files using a Beam pipeline that reads from Cloud Pub/Sub. For example following (I haven't tested this). https://cloud.google.com/storage/docs/pubsub-notifications For Dataflow use Pub/Sub connector in io/gcp submodule. Thanks, Cham > 1. > > > Regards, > Tarek >
