Re: When to use Pipeline object

Alex Koay Sun, 11 Jul 2021 22:50:48 -0700

>From my understanding, you need the Pipeline for mainly two things:
1. Marking the start of any processing flows (it serves as the PBegin
"PCollection") so any sources that follows it will run.
2. Running / executing / deploying the pipeline -- this happens
automatically with the context manager in your example, but otherwise you
can run pipeline.run() to get the same effect.


On Mon, Jul 12, 2021 at 10:04 AM <[email protected]> wrote:

> Hi,
>
>
> When using the python sdk I'm a little confused as to when the pipeline
> object is actually needed. I gather one needs it initially to create a
> pcollection, just because this is when I most often see it consistently
> used ex:
>
>
> with beam.Pipeline() as pipeline:
>
>     dict_pc = (
>
>         pipeline
>
>         | beam.io.fileio.MatchFiles("./*.csv")
>
>         | 'Read matched files' >> beam.io.fileio.ReadMatches()
>
>         | 'Get CSV data as a dict' >> beam.FlatMap(my_csv_reader)
>
>    )
>
>
>
>    # do stuff with dict_pc and other operations
>
>
> But beyond this when do one need the pipeline object?  It seems like the
> transforms expect a pcollection and output a pcollection so I'm confused
> and not finding documentation that addresses this.  thank you.
>
>
>

Re: When to use Pipeline object

Reply via email to