Is it possible that you didn't install GCP components when installing Beam ? You have to do following to install Beam with support for Dataflow.
pip install apache-beam[gcp] Please file a JIRA if you find any issues. Thanks, Cham On Thu, Jun 1, 2017 at 3:12 PM Dmitry Demeshchuk <[email protected]> wrote: > I may be wrong on that, indeed. > > Originally, I couldn't even run the regular WordCount on version 2.0.0, it > was coming down to some Beam-specific errors, and my reaction was "okay, > this is probably too early, I'll go back to 0.6.0 for now". > > Also, when reading the code I sometimes see things like "this is meant > only for DirectRunner" and such, so the degree of support of 2.0.0 by > Dataflow is a bit unclear to me. > > On Thu, Jun 1, 2017 at 2:59 PM, Chamikara Jayalath <[email protected]> > wrote: > >> >> >> On Thu, Jun 1, 2017 at 2:56 PM Dmitry Demeshchuk <[email protected]> >> wrote: >> >>> Haha, thanks, Sourabh, you beat me to it :) >>> >>> On Thu, Jun 1, 2017 at 2:55 PM, Dmitry Demeshchuk <[email protected]> >>> wrote: >>> >>>> Looks like the expand method should do the trick, similar to how it's >>>> done in GroupByKey? >>>> >>>> >>>> https://github.com/apache/beam/blob/dc4acfdd1bb30a07a9c48849f88a67f60bc8ff08/sdks/python/apache_beam/transforms/core.py#L1104 >>>> >>>> On Thu, Jun 1, 2017 at 2:37 PM, Dmitry Demeshchuk <[email protected] >>>> > wrote: >>>> >>>>> Hi folks, >>>>> >>>>> I'm currently playing with the Python SDK, primarily 0.6.0, since >>>>> 2.0.0 is not apparently supported by Dataflow, but trying to understand >>>>> the >>>>> 2.0.0 API better too. >>>>> >>>>> >> I think Dataflow supports 2.0.0 release. Did you find some documentation >> that says otherwise ? >> >> - Cham >> >> >>> I've been trying to find a way of combining two or more DoFn's into a >>>>> single one, so that one doesn't have to repeat the same pattern over and >>>>> over again. >>>>> >>>>> Specifically, my use case is getting data out of Redshift via the >>>>> "UNLOAD" command: >>>>> >>>>> 1. Connect to Redshift via Postgres protocol and do the unload >>>>> <http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>. >>>>> 2. Connect to S3 and fetch the files that Redshift unloaded there, >>>>> converting them into a PCollection. >>>>> >>>>> It's worth noting here that Redshift generates multiple files, usually >>>>> at least 10 or so, the exact number may depend on the amount of cores of >>>>> the Redshift instance, some settings, etc. Reading these files in parallel >>>>> sounds like a good idea. >>>>> >>>>> So, it feels like this is just a combination of two FlatMaps: >>>>> 1. SQL query -> list of S3 files >>>>> 2. List of S3 files -> rows of data >>>>> >>>>> I could just create two DoFns for that and make people combine them, >>>>> but that feels like an overkill. Instead, one should just call >>>>> ReadFromRedshift and not really care about what exactly happens under the >>>>> hood. >>>>> >>>>> Plus, it just feels like the ability of taking somewhat complex pieces >>>>> of the execution graph and encapsulating them into a DoFn would be a nice >>>>> capability. >>>>> >>>>> Are there any officially recommended ways to do that? >>>>> >>>>> Thank you. >>>>> >>>>> -- >>>>> Best regards, >>>>> Dmitry Demeshchuk. >>>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Dmitry Demeshchuk. >>>> >>> >>> >>> >>> -- >>> Best regards, >>> Dmitry Demeshchuk. >>> >> > > > -- > Best regards, > Dmitry Demeshchuk. >
