Continuous Read pipeline

TAREK ALSALEH Fri, 12 Jun 2020 00:52:24 -0700

Hi,

I am using the Python SDK with Dataflow as my runner. I am looking at 
implementing a streaming pipeline that will continuously monitor a GCS bucket 
for incoming files and depending on the regex of the file, launch a set of 
transforms and write the final output back to parquet for each file I read.


Excuse me as I am new to beam and specially the streaming bit as I have done 
the above in batch mode but we are limited by dataflow allowing only 100 jobs 
per 24 hours.

I was looking into a couple of options:

  1.  Have a beam pipeline running in streaming mode listening to a pubsub 
topic. Once a file lands in GCS a message is published. I am planning to use 
the WriteToFiles transform but it seems that there is a limitation where :"it 
currently does not have support for multiple trigger firings on the same 
window."
     *   So what windowing strategy and trigger should I use?
     *   Which transform should I use since there are two ReadFromPubSub 
transforms one in the io.gcp Subpackages and another one in the external.gcp?
  2.  Using the 
TextIO.Read<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html>
 and the 
watchForNewFiles<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition->
 from the Java SDK within a python pipeline as I understand there is some 
support for cross-language transforms?

Regards,
Tarek

Continuous Read pipeline

Reply via email to