>From my understanding, you need the Pipeline for mainly two things: 1. Marking the start of any processing flows (it serves as the PBegin "PCollection") so any sources that follows it will run. 2. Running / executing / deploying the pipeline -- this happens automatically with the context manager in your example, but otherwise you can run pipeline.run() to get the same effect.
On Mon, Jul 12, 2021 at 10:04 AM <[email protected]> wrote: > Hi, > > > When using the python sdk I'm a little confused as to when the pipeline > object is actually needed. I gather one needs it initially to create a > pcollection, just because this is when I most often see it consistently > used ex: > > > with beam.Pipeline() as pipeline: > > dict_pc = ( > > pipeline > > | beam.io.fileio.MatchFiles("./*.csv") > > | 'Read matched files' >> beam.io.fileio.ReadMatches() > > | 'Get CSV data as a dict' >> beam.FlatMap(my_csv_reader) > > ) > > > > # do stuff with dict_pc and other operations > > > But beyond this when do one need the pipeline object? It seems like the > transforms expect a pcollection and output a pcollection so I'm confused > and not finding documentation that addresses this. thank you. > > >
