I think what you need here is a PTransform which is named ReadFromRedshift and inside which you implement the expand function that can invoke the two DoFn calls as chained ParDo inside it.
A good example to follow here might be this code in the wordcount_debugging <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_debugging.py#L89> example. Thanks Sourabh On Thu, Jun 1, 2017 at 2:37 PM Dmitry Demeshchuk <[email protected]> wrote: > Hi folks, > > I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0 is > not apparently supported by Dataflow, but trying to understand the 2.0.0 > API better too. > > I've been trying to find a way of combining two or more DoFn's into a > single one, so that one doesn't have to repeat the same pattern over and > over again. > > Specifically, my use case is getting data out of Redshift via the "UNLOAD" > command: > > 1. Connect to Redshift via Postgres protocol and do the unload > <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>. > 2. Connect to S3 and fetch the files that Redshift unloaded there, > converting them into a PCollection. > > It's worth noting here that Redshift generates multiple files, usually at > least 10 or so, the exact number may depend on the amount of cores of the > Redshift instance, some settings, etc. Reading these files in parallel > sounds like a good idea. > > So, it feels like this is just a combination of two FlatMaps: > 1. SQL query -> list of S3 files > 2. List of S3 files -> rows of data > > I could just create two DoFns for that and make people combine them, but > that feels like an overkill. Instead, one should just call ReadFromRedshift > and not really care about what exactly happens under the hood. > > Plus, it just feels like the ability of taking somewhat complex pieces of > the execution graph and encapsulating them into a DoFn would be a nice > capability. > > Are there any officially recommended ways to do that? > > Thank you. > > -- > Best regards, > Dmitry Demeshchuk. >
