Re: Combining multiple DoFn's into one

Sourabh Bajaj Thu, 01 Jun 2017 14:55:41 -0700

I think what you need here is a PTransform which is named ReadFromRedshift
and inside which you implement the expand function that can invoke the two
DoFn calls as chained ParDo inside it.


A good example to follow here might be this code in the wordcount_debugging
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_debugging.py#L89>
example.

Thanks
Sourabh

On Thu, Jun 1, 2017 at 2:37 PM Dmitry Demeshchuk <[email protected]>
wrote:

> Hi folks,
>
> I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0 is
> not apparently supported by Dataflow, but trying to understand the 2.0.0
> API better too.
>
> I've been trying to find a way of combining two or more DoFn's into a
> single one, so that one doesn't have to repeat the same pattern over and
> over again.
>
> Specifically, my use case is getting data out of Redshift via the "UNLOAD"
> command:
>
> 1. Connect to Redshift via Postgres protocol and do the unload
> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>.
> 2. Connect to S3 and fetch the files that Redshift unloaded there,
> converting them into a PCollection.
>
> It's worth noting here that Redshift generates multiple files, usually at
> least 10 or so, the exact number may depend on the amount of cores of the
> Redshift instance, some settings, etc. Reading these files in parallel
> sounds like a good idea.
>
> So, it feels like this is just a combination of two FlatMaps:
> 1. SQL query -> list of S3 files
> 2. List of S3 files -> rows of data
>
> I could just create two DoFns for that and make people combine them, but
> that feels like an overkill. Instead, one should just call ReadFromRedshift
> and not really care about what exactly happens under the hood.
>
> Plus, it just feels like the ability of taking somewhat complex pieces of
> the execution graph and encapsulating them into a DoFn would be a nice
> capability.
>
> Are there any officially recommended ways to do that?
>
> Thank you.
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Combining multiple DoFn's into one

Reply via email to