On Fri, Jun 12, 2020 at 12:52 AM TAREK ALSALEH <[email protected]>
wrote:

> Hi,
>
> I am using the Python SDK with Dataflow as my runner. I am looking at
> implementing a streaming pipeline that will continuously monitor a GCS
> bucket for incoming files and depending on the regex of the file, launch a
> set of transforms and write the final output back to parquet for each file
> I read.
>
> Excuse me as I am new to beam and specially the streaming bit as I have
> done the above in batch mode but we are limited by dataflow allowing only
> 100 jobs per 24 hours.
>
> I was looking into a couple of options:
>
>    1. Have a beam pipeline running in streaming mode listening to a
>    pubsub topic. Once a file lands in GCS a message is published. I am
>    planning to use the WriteToFiles transform but it seems that there is
>    a limitation where :"it currently does not have support for multiple
>    trigger firings on the same window."
>       1. So what windowing strategy and trigger should I use?
>       2. Which transform should I use since there are two ReadFromPubSub
>       transforms one in the io.gcp Subpackages and another one in the
>       external.gcp?
>
>
>    1. Using the TextIO.Read
>    
> <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html>
>  and
>    the watchForNewFiles
>    
> <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition->
>  from
>    the Java SDK within a python pipeline as I understand there is some support
>    for cross-language transforms?
>
>
Sounds like watchForNewFiles transform is exactly what you are looking for
but we don't have that for Python SDK yet.
Also above transforms haven't been tested with cross-language transforms
framework yet and we don't have cross-language Python wrappers for these
yet.

Probably best solution for Python SDK today will be to use some sort of a
GCS to Cloud Pub/Sub mapping to publish events regarding new files and read
these files using  a Beam pipeline that reads from Cloud Pub/Sub. For
example following (I haven't tested this).
https://cloud.google.com/storage/docs/pubsub-notifications

For Dataflow use Pub/Sub connector in io/gcp submodule.

Thanks,
Cham


>    1.
>
>
> Regards,
> Tarek
>

Reply via email to