On Thu, Jun 1, 2017 at 2:56 PM Dmitry Demeshchuk <[email protected]> wrote:
> Haha, thanks, Sourabh, you beat me to it :) > > On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <[email protected]> > wrote: > >> Looks like the expand method should do the trick, similar to how it's >> done in GroupByKey? >> >> >> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104 >> >> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk <[email protected]> >> wrote: >> >>> Hi folks, >>> >>> I'm currently playing with the Python SDK, primarily 0.6.0, since 2.0.0 >>> is not apparently supported by Dataflow, but trying to understand the 2.0.0 >>> API better too. >>> >>> I think Dataflow supports 2.0.0 release. Did you find some documentation that says otherwise ? - Cham > I've been trying to find a way of combining two or more DoFn's into a >>> single one, so that one doesn't have to repeat the same pattern over and >>> over again. >>> >>> Specifically, my use case is getting data out of Redshift via the >>> "UNLOAD" command: >>> >>> 1. Connect to Redshift via Postgres protocol and do the unload >>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>. >>> 2. Connect to S3 and fetch the files that Redshift unloaded there, >>> converting them into a PCollection. >>> >>> It's worth noting here that Redshift generates multiple files, usually >>> at least 10 or so, the exact number may depend on the amount of cores of >>> the Redshift instance, some settings, etc. Reading these files in parallel >>> sounds like a good idea. >>> >>> So, it feels like this is just a combination of two FlatMaps: >>> 1. SQL query -> list of S3 files >>> 2. List of S3 files -> rows of data >>> >>> I could just create two DoFns for that and make people combine them, but >>> that feels like an overkill. Instead, one should just call ReadFromRedshift >>> and not really care about what exactly happens under the hood. >>> >>> Plus, it just feels like the ability of taking somewhat complex pieces >>> of the execution graph and encapsulating them into a DoFn would be a nice >>> capability. >>> >>> Are there any officially recommended ways to do that? >>> >>> Thank you. >>> >>> -- >>> Best regards, >>> Dmitry Demeshchuk. >>> >> >> >> >> -- >> Best regards, >> Dmitry Demeshchuk. >> > > > > -- > Best regards, > Dmitry Demeshchuk. >
