date:20200212

Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-02-12 Thread Hannah Jiang

Thanks everyone for supporting it.

Yes, it's very slow to get tickets resolved by infra. I propose a minor
improvement to reduce interactions with infra.

So far, we have granted maintainer permission(read & write) to release
managers' personal accounts. This step needs help from infra to add new
members to the group for every new release manager.
In order to avoid this, I proposed that we create a new account for release
purpose only and share it with release managers. The new account will have
read & write permissions to all Apache Beam docker repositories. A password
will be shared on an as-needed basis and we can change the password
periodically if needed, which is in our control. Are there any concerns
which I am not aware of with the sharing account approach?

Thanks,
Hannah


On Thu, Jan 16, 2020 at 10:41 AM Kenneth Knowles  wrote:

> +1 very nice explanation
>
> On Wed, Jan 15, 2020 at 1:57 PM Ahmet Altay  wrote:
>
>> +1 - Thank you for driving this!
>>
>> On Wed, Jan 15, 2020 at 1:55 PM Thomas Weise  wrote:
>>
>>> +1 for the namespace proposal.
>>>
>>> It is similar to github repos. Top-level is the org, then single level
>>> for repo (beam-abc, beam-xzy, ..)
>>>
>>>
>>>
>>> On Wed, Jan 15, 2020 at 1:45 PM Robert Bradshaw 
>>> wrote:
>>>
 Various tags of the same image should at least logically be the same
 thing, so I agree that we should not be trying to share a single
 repository in that way. A full suite of apache/beam-{image_desc}
 repositories, if apache is fine with that, seems like the best
 approach.

 On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver 
 wrote:
 >
 > +1, agree that moving current image name to tags is a non-starter.
 Thanks for driving this Hannah. Let us know what they say about repo
 creation.
 >
 > On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:
 >>
 >> SG +1
 >>
 >> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang <
 hannahji...@google.com> wrote:
 >>>
 >>> I have done some research about images released under apache
 namespace at docker hub, and here is my proposal.
 >>>
 >>> Currently, we are using apachebeam as our namespace and each image
 has its own repository. Version number is used to tag the images.
 >>> ie: apachebeam/python2.7_sdk:2.19.0,
 apachebeam/flink1.9_job_server:2.19.0
 >>>
 >>> Now we are migrating to apache namespace and docker hub doesn't
 support nested repository names, so we cannot use
 apache/beam/{image-desc}:{version}.
 >>> Instead, I propose to use apache/beam-{image_desc}:{version} as our
 repository name.
 >>> ie: apache/beam-python2.7_sdk:2.19.0,
 apache/beam-flink1.9_job_server:2.19.0
 >>> => When a user searches for apache/beam at docker hub, it will list
 all the repositories we deployed with apache/beam-, so no concerns that
 some released images are missed by users.
 >>> => Repository names give insights to the users which repositories
 they should use.
 >>> => A downside with this approach is we need to create a new
 repository whenever we release a new image, time and effort needed for this
 is pending, I am contacting apache docker hub management team.
 >>>
 >>> I have considered using beam as repository name and moving image
 name and version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which
 means put all images to a single repository, however, this approach has
 some downsides.
 >>> => When a user searches for apache/beam, only one repository is
 returned. Users need to use tags to identify which images they should use.
 Since we release images with new tags for each version, it will overwhelm
 the users and give them an impression that the images are not organized
 well. It's also difficult to know what kind of images we deployed.
 >>> => With both image name and version included at tags, it is a
 little bit more complicated to maintain the code.
 >>> => There is no correct answer which image the latest tag should
 point to.
 >>>
 >>> Are there any concerns with this proposal?
 >>>
 >>> Thanks,
 >>> Hannah
 >>>
 >>>
 >>>
 >>>
 >>> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay 
 wrote:
 
 
 
  On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay 
 wrote:
 >
 >
 >
 > On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka 
 wrote:
 >>
 >> Also curious to know if apache provide any infra support fro
 projects under Apache umbrella and any quota limits they might have.
 
 
  Maybe Hannah can ask with an infra ticket?
 
 >>
 >>
 >> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw <
 rober...@google.com> wrote:
 >>>
 >>> One downside is that, unlike many of these projects, we release
 a
 >>> dozen or so

Re: Poor Python 3.x performance on Dataflow?

2020-02-12 Thread Valentyn Tymofieiev

To close the loop here, the regression reported here is not specific to
Beam or Dataflow. The difference in performance is caused by a 'regression'
in the deprecated numpy random number generator, which we use to generate
synthetic input for the load test pipeline.  Since new releases of numpy
don't support Python 2, our Py2 tests are using a different, older, numpy
version where  that generator happens to perform faster.

You can follow BEAM-9085 for further details.

On Fri, Jan 10, 2020 at 9:26 AM Valentyn Tymofieiev 
wrote:

> Thanks, Kamil. I self-assigned the issue, but if anyone else is
> interested, feel free to take a look in parallel and post your findings on
> the Jira.
>
> On Fri, Jan 10, 2020 at 4:29 AM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
>
>> Our first Python3 performance test has just been implemented and we have
>> just started gathering results. Here[1] you can find dashboards with a
>> side-by-side comparison.
>> I also opened a Jira ticket to investigate the difference [2]. Anyone,
>> please feel free to assign it to yourself.
>>
>> Thanks,
>> Kamil
>>
>> [1]
>> https://apache-beam-testing.appspot.com/explore?dashboard=5678187241537536
>> [2] https://issues.apache.org/jira/browse/BEAM-9085
>>
>> On Mon, Dec 9, 2019 at 8:38 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> For now we should run Py3 and Py2 tests alongside each other to get a
>>> side-by-side comparison. I suggest we open a Jira ticket to investigate the
>>> difference in performance . We have limited performance test coverage on
>>> Python 3 in Beam, so more Py3 tests would help a lot here, thanks for
>>> adding them.
>>>
>>> On Fri, Dec 6, 2019 at 9:43 AM Robert Bradshaw 
>>> wrote:
>>>
 This is very surprising--I would expect the times to quite similar. Do
 you have profiles for where the (difference in) time is spent? With
 differences like these, I wonder if there are issues with container
 setup (e.g. some things not being installed or cached) for Python 3.

 On Fri, Dec 6, 2019 at 9:06 AM Kamil Wasilewski
  wrote:
 >
 > Hi all,
 >
 > Python 2.7 won't be maintained past 2020 and that's why we want to
 migrate all Python performance tests in Beam from Python 2.7 to Python 3.7.
 However, I was surprised by seeing that after switching Dataflow tests to
 Python 3.x they are a few times slower. For example, the same ParDo test
 that takes approx. 8 minutes to run on Python 2.7 needs approx. 21 minutes
 on Python 3.x. You can find all the results I gathered and the setup here.
 >
 > Do you know any possible reason for this? This issue makes it
 impossible to do the migration, because of the limited resources on Jenkins
 (almost every job would be aborted).
 >
 > Thanks,
 > Kamil

>>>

Re: Labels on PR

2020-02-12 Thread Alex Van Boxel

What do you exactly mean with github grep... where is it an issue. I find
it useful for searching here:

[image: Screen Shot 2020-02-13 at 06.11.33.png]

OK, you get some false positives, but then the color coding works. You
can't search on a category so this looks like the only alternative. I was
even thinking of adding more text in the description as it could help new
contributors to identify something they could help with.

It's also nice when you hover over the label.

So, could you exactly pinpoint where you see a problem?

 _/
_/ Alex Van Boxel


On Wed, Feb 12, 2020 at 10:22 PM Ismaël Mejía  wrote:

> Alex would you consider removing the descriptions from the labels? It
> seems that
> github greps not only the text of the label but also the text of the
> description
> producing false positives, e.g. if I search the label `io` it resolves not
> only
> all the IOs but also results like `core` because it matches the description
> `core-constructIOn-java` and also `extensIOns` making the point of having
> general categories not really useful.
>
> On Wed, Feb 12, 2020 at 3:01 PM Ismaël Mejía  wrote:
>
>> The prefix is just extra characters makes readibility worse, notice
>> that the full category (e.g. ios/runners/etc) will match because we have
>> an
>> extra tag equivalent to the prefix, so filtering is still possible. you
>> rarely
>> need to filter for more than one criteria, that's why github does not
>> allow it
>> (and the reason to have the extra per category labels).
>>
>> The only issue i can see is a possible name collision in the future, but
>> until that
>> happens i think this is a reasonable tradeoff.
>>
>> Excellent idea to unifiy the colors for the categories +1 !
>>
>> On Wed, Feb 12, 2020 at 2:33 PM Alex Van Boxel  wrote:
>>
>>> Ismael, I saw that you removed the prefix. I still want to have some
>>> grouping between the subtypes, so I color coded them.
>>>
>>> I also added all the labels in the file. We now have 62 labels.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Wed, Feb 12, 2020 at 12:03 PM Ismaël Mejía  wrote:
>>>
 Forgot to mention, older PRs will look not classified, up to you guys
 if you
 want to do manually. All new PRs will be automatically labeled.

 On Wed, Feb 12, 2020 at 12:02 PM Ismaël Mejía 
 wrote:

> For info Alex's PR to suport autolabeler was merged today and INFRA
> enabled the plugin.
> I created an artificial PR to check it was autolabeled correctly.
> It is working perfectly now.
> Thanks Alex !
>
> On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw 
> wrote:
>
>> +1 to finding the right balance.
>>
>> I do think per-runner makes sense, rather than a general "runners."
>> IOs might make sense as well. Not sure about all the extensions-* I'd
>> leave those out for now.
>>
>> On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía 
>> wrote:
>> >
>> > > So I propose going simple with a limited set of labels. Later on
>> we can refine. Don't forget that does labels only are useful during the
>> life-cycle of a PR.
>> >
>> > Labels are handy for quick filtering and finding PRs we care about
>> for example
>> > to review.
>> >
>> > I agree with the feeling that we should not go to the extremes, but
>> what is
>> > requested in the PR rarely would produce more than 5 labels per
>> PR.  For example
>> > if a PR changes KafkaIO and something in the CI it will produce
>> "java io kafka
>> > infra", a pure change on Flink runer will produce "runners flink"
>> >
>> > 100% d'accord with not to have many labels and keep them short, but
>> the current
>> > classification lacks detail, e.g. few people care about some
>> general categories
>> > "runners" or "io", but maintainers may care about their specific
>> categories like
>> > "spark" or "kafka" so I don't think that this extra level of detail
>> is
>> > inappropriate and in the end it will only add one extra label per
>> matching path.
>> >
>> > Let's give it a try if it is too excesive we can took the opposite
>> path and reduce it.
>> >
>> > Ismaël
>> >
>> >
>> > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel 
>> wrote:
>> >>
>> >> I'm wondering if we're not taking it too far with those detailed
>> labels. It's like going from nothing to super details. The simples 
>> use-case
>> hasn't proven itself in practice yet.
>> >>
>> >> So I propose going simple with a limited set of labels. Later on
>> we can refine. Don't forget that does labels only are useful during the
>> life-cycle of a PR.
>> >>
>> >>  _/
>> >> _/ Alex Van Boxel
>> >>
>> >>
>> >> On Tue, Feb 11, 2020 at 9:46 AM Ismaël Mejía 
>> wrote:
>> >>>
>> >>> Let some comments too, let's keep the discussion on refinements
>> in the PR.
>> >>>
>> >>> On

Re: Python2.7 Beam End-of-Life Date

2020-02-12 Thread Ahmet Altay

On Wed, Feb 12, 2020 at 1:29 AM Ismaël Mejía  wrote:

> I am with Chad on this, we should probably extend it a bit more, even if it
> makes us struggle a bit at least we have some workarounds as Robert
> suggests,
> and as Chad said there are still many people playing the python 3 catchup
> game,
> so worth to support those users.
>

> But maybe it is worth to evaluate the current state later in the year.
>

I would suggest re-evaluating this within the next 3 months again. We need
to balance between user pain/contributor pain/our ability to
continuously test with python 2 in a shifting environment.


> In the
> meantime can someone please update our Roadmap in the website with this
> info and
> where we are with Python 3 support (it looks not up to date).
> https://beam.apache.org/roadmap/
>

I made a minor change to update that page (
https://github.com/apache/beam/pull/10848). A more comprehensive update to
that page and linked (
https://beam.apache.org/roadmap/python-sdk/#python-3-support) would still
be welcome.


>
> - Ismaël
>
>
> On Tue, Feb 4, 2020 at 10:49 PM Robert Bradshaw 
> wrote:
>
>>  On Tue, Feb 4, 2020 at 12:12 PM Chad Dombrova  wrote:
>> >>
>> >>  Not to mention that all the nice work for the type hints will have to
>> be redone in the for 3.x.
>> >
>> > Note that there's a tool for automatically converting type comments to
>> annotations: https://github.com/ilevkivskyi/com2ann
>> >
>> > So don't let that part bother you.
>>
>> +1, I wouldn't worry about what can be easily automated.
>>
>> > I'm curious what other features you'd like to be using in the Beam
>> source that you cannot now.
>>
>> I hit things occasionally, e.g. I just ran into wanting keyword-only
>> arguments the other day.
>>
>> >> It seems the faster we drop support the better.
>> >
>> >
>> > I've already gone over my position on this, but a refresher for those
>> who care:  some of the key vendors that support my industry will not offer
>> python3-compatible versions of their software until the 4th quarter of
>> 2020.  If Beam switches to python3-only before that point we may be forced
>> to stop contributing features (note: I'm the guy who added the type hints
>> :).   Every month you can give us would be greatly appreciated.
>>
>> As another data point, we're still 80/20 on Py2/Py3 for downloads at
>> PyPi [1] (which I've heard should be taken with a grain of salt, but
>> likely isn't totally off). IMHO that ratio needs to be way higher for
>> Python 3 to consider dropping Python 2. It's pretty noisy, but say it
>> doubles every 3 months that would put us at least mid-year before we
>> hit a cross-over point. On the other hand Q4 2020 is probably a
>> stretch.
>>
>> We could consider whether it needs to be an all-or-nothing thing as
>> well. E.g. perhaps some features could be Python 3 only sooner than
>> the whole codebase. (This would have to be well justified.) Another
>> mitigation is that it is possible to mix Python 2 and Python 3 in the
>> same pipeline with portability, so if there's a library that you need
>> for one DoFn it doesn't mean you have to hold back your whole
>> pipeline.
>>
>> - Robert
>>
>> [1] https://pypistats.org/packages/apache-beam , and that 20% may just
>> be a spike.
>>
>

Re: daily dataflow job failing today

2020-02-12 Thread Ahmet Altay

On Wed, Feb 12, 2020 at 12:54 PM Ismaël Mejía  wrote:

> Independently of the bug in the dependency release the fact that the Beam
> Python
> SDK does not have pinned fixed dependency numbers is error-prone. We may
> continue to have this kind of problems until we fix this (with other
> dependencies too). In the Java SDK we do not accept such type of dynamic
> dependency numbers and python should probably follow this practice to avoid
> issues like the present one.
>
> Why don't we just do:
>
> 'avro-python3==1.9.1',
>
> instead of the current:
>
> 'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',
>

I agree this is error prone. Your argument for pinning makes sense and I
agree with it.

I can argue for not pinning and bounding with major version ranges. This
gives flexibility to users to mix other third party libraries that share
common dependencies with Beam. Our expectation is that dependencies follow
semantic versioning and do not introduce breaking changes unless there is a
major version change. A good example of this is Beam's dependency on
"pytz>=2018.3". It is a simple wrapper around a time zone file. Latest
version of the dependency is 2019.3, it is updated a few times a year. Beam
users do not have to update Beam just to be able to use a later version of
it since Beam does not pin it.

There is also a middle ground, where we can pin certain dependencies if we
are not confident about their releases. And allow ranges for rest of the
dependencies. In general, we are currently following this practice.


>
>
> On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:
>
>> Related: we have dependencies on avro, avro-python3, and fastavro.
>> fastavro supports both python 2 and 3. Could we reduce this dependency list
>> and depend only on fastavro? If we need avro and avro-python3 for the
>> purposes of testing only, we can move them to test only dependencies.
>>
>> +Chamikara Jayalath , because I vaguely remember
>> him working on this.
>>
>> The reason I am calling for this is the impact of bad dependency releases
>> are high. All previously released Beam versions will be impacted. Reducing
>> the dependency list will reduce the risk.
>>
>> Ahmet
>>
>> On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay  wrote:
>>
>>> Thank you Valentyn!
>>>
>>> On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Yes, otherwise all Python tests will continue to fail until Avro comes
 up with a new release. Sent: https://github.com/apache/beam/pull/10844

 On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay  wrote:

> Should we update Beam's setup.py to skip this avro-python3 version?
>
> On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> makes sense. I'll add this workaround for now.
>> Thanks so much for your help!
>>
>> On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Alan, Dataflow workers preinstall Beam SDK dependencies, including
>>> (a working version) of avro-python3. So after reading your email once
>>> again, I think in your case you were not able to install Beam SDK 
>>> locally.
>>> So a workaround for you would be to `pip install avro-python3==1.9.1` or
>>> `pip install pycodestyle`  before installing Beam, until AVRO-2737
>>> is resolved.
>>>
>>>
>>> On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
 it received attention.

 On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Opened https://issues.apache.org/jira/browse/AVRO-2738
>
> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Here's a short repro:
>>
>> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
>> root@04b45a100d16:/# pip install avro-python3
>> Collecting avro-python3
>>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
>> ERROR: Command errored out with exit status 1:
>>  command: /usr/local/bin/python -c 'import sys, setuptools,
>> tokenize; sys.argv[0] =
>> '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
>> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
>> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
>> '"'"'\n'"'"');f.close();exec(compile(code, __file__, 
>> '"'"'exec'"'"'))'
>> egg_info --egg-base 
>> /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
>> Complete output (5 lines):
>> Traceback (most recent call last):

Re: Google's support framework for community-led meetups

2020-02-12 Thread Austin Bennett

Hi Maria,

This might be useful in user@ as well?

Happy to walk you through editing webpage submitting a PR.  Then up to
appropriate committer as to whether to approve.  Write me off list and we
can find time.  A less focused walkthrough (including way more than you
need just for that, in case wanting to get an independent start):
https://www.youtube.com/watch?v=PtPslSdAPcM (will reoffer a more polished
version at next summit).

Some questions for clarity:
* What is an Amplifier and how distinguished from contributor?
* Do you have some examples of how Google Open Source contributing
content?  Is that different than what Google does and it's engineers?  Or
that's just the branding of it?
* When talking about company and/or Google Open Source sponsorships, are we
talking about Money (ex: Meetup sometimes solicits to broker companies to
pay for an advertising slot in a venue when can gather bodies in a room --
same as conference sponsorships)?  Or other/open?

Overall this is helpful, as someone trying to help build
local communities in a number of cities.

Cheers,
Austin


On Wed, Feb 12, 2020 at 11:40 AM María Cruz  wrote:

> Hi everyone,
>
> I have been working closely with Gris Cuevas to develop a framework to
> increase transparency on how Google Open Source supports community-led
> events.  Below you will find a table that identifies different event goals,
> community roles, and the type of support Google Open Source can offer in
> each case.
>
> If you would like to request support from Google for your meetup, please
> follow the meetup support request process
> .
> Can someone help me put this information, along with the support framework
> below, under the Community tab [1] of the Apache Beam website?
>
> We look forward to continuing partnering with our community organizers to
> help Beam grow even more! If you have any questions, please let me know.
> Support frameworkCommunity Roles
>
>-
>
>Project contributor AND community amplifier:
>-
>
>   Contributes code, documentation, project management, etc, to the
>   project
>   -
>
>   Organizes meetups
>   -
>
>   Engages other contributors in becoming organizers
>   -
>
>Project contributor
>-
>
>   Contributes code, documentation, project management, etc, to the
>   project
>   -
>
>   Organizes meetups
>   -
>
>Company that wants to co-sponsor a meetup
>-
>
>   A company that is an independent entity from Google that expressed
>   interest in co-sponsoring a meetup
>
> Framework
>
>
>
> Community Goal
>
> Community Role Requesting Support
>
> Support Google Open Source can provide
>
>
>
>
>
> Raise awareness about project / increase reach of project / maximize
> exposure
>
> E.g. Launch a new Meetup series in a new city.
>
> Project contributor AND amplifier
>
>
>-
>
>Up to USD 3,000 a year in swag
>-
>
>Content
>-
>
>Speakers
>
> Project contributor
>
>-
>
>Up to USD 1,500 a year in swag
>-
>
>Help to find a meetup sponsor
>-
>
>Content
>
> A company that wants to co-sponsor a meetup
>
>
>-
>
>Connect to local community members to organize meetup
>-
>
>Content
>
>
>
>
> Grow project adoption
>
> E.g. create awareness of a new release and help people adopt it.
>
> Project contributor AND amplifier
>
>-
>
>Add to leadership network
>-
>
>Support in finding sponsors
>
> Project contributor
>
> A company that wants to co-sponsor a meetup
>
>-
>
>Partner to present use case
>-
>
>Connect to community member in low-adoption region / city
>
>
>
> Grow project contributions
>
> E.g. run a hackathon to contribute to Beam website localization.
>
> Project contributor AND amplifier
>
>-
>
>Develop workshop content
>-
>
>Swag or GCP credits for top contributors
>
> Project contributor
>
>-
>
>Develop workshop content
>
> A company that wants to co-sponsor a meetup
>
>-
>
>Develop workshop content
>-
>
>Connect to expert to deliver workshop
>
>
>
> Expand professional network for project / Connect project features to
> other projects
>
> E.g. Working session at a conference
>
> Project contributor AND amplifier
>
>-
>
>Connect to related meetups
>-
>
>Create a list of potential presenters from other companies
>
> Project contributor
>
>-
>
>Connect to related meetups
>
> A company that wants to co-sponsor a meetup
>
>-
>
>Connect to local expert for presentations
>
>  --
>
> María
>
> [1] https://beam.apache.org/community/
>
>

Re: FnAPI proto backwards compatibility

2020-02-12 Thread Kenneth Knowles

On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw 
wrote:

> On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
> >
> > We can always detect on the runner/SDK side whether there is an unknown
> field[1] within a payload and fail to process it but this is painful in two
> situations:
> > 1) It doesn't provide for a good error message since you can't say what
> the purpose of the field is. With a capability URN, the runner/SDK could
> say which URN it doesn't understand.
> > 2) It doesn't allow for the addition of fields which don't impact
> semantics of execution. For example, if the display data feature was being
> developed, a runner could ignore it and still execute the pipeline
> correctly.
>
> Yeah, I don't think proto reflection is a flexible enough tool to do
> this well either.
>
> > If we think this to be common enough, we can add capabilities list to
> the PTransform so each PTransform can do this and has a natural way of
> being extended for additions which are forwards compatible. The alternative
> to having capabilities on PTransform (and other constructs) is that we
> would have a new URN when the specification of the transform changes. For
> forwards compatible changes, each SDK/runner would map older versions of
> the URN onto the latest and internally treat it as the latest version but
> always downgrade it to the version the other party expects when
> communicating with it. Backwards incompatible changes would always require
> a new URN which capabilities at the PTransform level would not help with.
>
> As you point out, stateful+splittable may not be a particularly useful
> combination, but as another example, we have
> (backwards-incompatible-when-introduced) markers on DoFn as to whether
> it requires finalization, stable inputs, and now time sorting. I don't
> think we should have a new URN for each combination.
>

Agree with this. I don't think stateful, splittable, and "plain" ParDo are
comparable to these. Each is an entirely different computational paradigm:
per-element independent processing, per-key-and-window linear processing,
and per-element-and-restriction splittable processing. Most relevant IMO is
the nature of the parallelism. If you added state to splittable processing,
it would still be splittable processing. Just as Combine and ParDo can
share the SideInput specification, it is easy to share relevant
sub-structures like state declarations. But it is a fair point that the
ability to split can be ignored and run as a plain-old ParDo. It brings up
the question of whether a runner that doesn't know SDF is should have to
reject it or should be allowed to run poorly.

It isn't a huge deal. Three different top-level URNS versus three different
sub-URNs will achieve the same result in the end if we get this
"capability" thing in place.

Kenn


>
> >> > I do think that splittable ParDo and stateful ParDo should have
> separate PTransform URNs since they are different paradigms than "vanilla"
> ParDo.
> >>
> >> Here I disagree. What about one that is both splittable and stateful?
> Would one have a fourth URN for that? If/when another flavor of DoFn comes
> out, would we then want 8 distinct URNs? (SplitableParDo in particular can
> be executed as a normal ParDo as long as the output is bounded.)
> >
> > I agree that you could have stateful and splittable dofns where the
> element is the key and you share state and timers across restrictions. No
> runner is capable of executing this efficiently.
> >
> >> >> > On the SDK requirements side: the constructing SDK owns the
> Environment proto completely, so it is in a position to ensure the involved
> docker images support the necessary features.
> >> >>
> >> >> Yes.
> >
> >
> > I believe capabilities do exist on a Pipeline and it informs runners
> about new types of fields to be aware of either within Components or on the
> Pipeline object itself but for this discussion it makes sense that an
> environment would store most "capabilities" related to execution.
> >
> >> [snip]
> >
> > As for the proto clean-ups, the scope is to cover almost all things
> needed for execution now and to follow-up with optional transforms,
> payloads, and coders later which would exclude job managment APIs and
> artifact staging. A formal enumeration would be useful here. Also, we
> should provide formal guidance about adding new fields, adding new types of
> transforms, new types of proto messages, ... (best to describe this on a
> case by case basis as to how people are trying to modify the protos and
> evolve this guidance over time).
>
> What we need is the ability for (1) runners to reject future pipelines
> they cannot faithfully execute and (2) runners to be able to take
> advantage of advanced features/protocols when interacting with those
> SDKs that understand them while avoiding them for older (or newer)
> SDKs that don't. Let's call (1) (hard) requirements and (2) (optional)
> capabilities.
>
> Where possible, I think this is best expressed

Re: Labels on PR

2020-02-12 Thread Ismaël Mejía

Alex would you consider removing the descriptions from the labels? It seems
that
github greps not only the text of the label but also the text of the
description
producing false positives, e.g. if I search the label `io` it resolves not
only
all the IOs but also results like `core` because it matches the description
`core-constructIOn-java` and also `extensIOns` making the point of having
general categories not really useful.

On Wed, Feb 12, 2020 at 3:01 PM Ismaël Mejía  wrote:

> The prefix is just extra characters makes readibility worse, notice
> that the full category (e.g. ios/runners/etc) will match because we have an
> extra tag equivalent to the prefix, so filtering is still possible. you
> rarely
> need to filter for more than one criteria, that's why github does not
> allow it
> (and the reason to have the extra per category labels).
>
> The only issue i can see is a possible name collision in the future, but
> until that
> happens i think this is a reasonable tradeoff.
>
> Excellent idea to unifiy the colors for the categories +1 !
>
> On Wed, Feb 12, 2020 at 2:33 PM Alex Van Boxel  wrote:
>
>> Ismael, I saw that you removed the prefix. I still want to have some
>> grouping between the subtypes, so I color coded them.
>>
>> I also added all the labels in the file. We now have 62 labels.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Wed, Feb 12, 2020 at 12:03 PM Ismaël Mejía  wrote:
>>
>>> Forgot to mention, older PRs will look not classified, up to you guys if
>>> you
>>> want to do manually. All new PRs will be automatically labeled.
>>>
>>> On Wed, Feb 12, 2020 at 12:02 PM Ismaël Mejía  wrote:
>>>
 For info Alex's PR to suport autolabeler was merged today and INFRA
 enabled the plugin.
 I created an artificial PR to check it was autolabeled correctly.
 It is working perfectly now.
 Thanks Alex !

 On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw 
 wrote:

> +1 to finding the right balance.
>
> I do think per-runner makes sense, rather than a general "runners."
> IOs might make sense as well. Not sure about all the extensions-* I'd
> leave those out for now.
>
> On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía 
> wrote:
> >
> > > So I propose going simple with a limited set of labels. Later on
> we can refine. Don't forget that does labels only are useful during the
> life-cycle of a PR.
> >
> > Labels are handy for quick filtering and finding PRs we care about
> for example
> > to review.
> >
> > I agree with the feeling that we should not go to the extremes, but
> what is
> > requested in the PR rarely would produce more than 5 labels per PR.
> For example
> > if a PR changes KafkaIO and something in the CI it will produce
> "java io kafka
> > infra", a pure change on Flink runer will produce "runners flink"
> >
> > 100% d'accord with not to have many labels and keep them short, but
> the current
> > classification lacks detail, e.g. few people care about some general
> categories
> > "runners" or "io", but maintainers may care about their specific
> categories like
> > "spark" or "kafka" so I don't think that this extra level of detail
> is
> > inappropriate and in the end it will only add one extra label per
> matching path.
> >
> > Let's give it a try if it is too excesive we can took the opposite
> path and reduce it.
> >
> > Ismaël
> >
> >
> > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel 
> wrote:
> >>
> >> I'm wondering if we're not taking it too far with those detailed
> labels. It's like going from nothing to super details. The simples 
> use-case
> hasn't proven itself in practice yet.
> >>
> >> So I propose going simple with a limited set of labels. Later on we
> can refine. Don't forget that does labels only are useful during the
> life-cycle of a PR.
> >>
> >>  _/
> >> _/ Alex Van Boxel
> >>
> >>
> >> On Tue, Feb 11, 2020 at 9:46 AM Ismaël Mejía 
> wrote:
> >>>
> >>> Let some comments too, let's keep the discussion on refinements in
> the PR.
> >>>
> >>> On Tue, Feb 11, 2020 at 9:13 AM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> 
>  I left comments on PR, the main suggestion is that we may need a
> discussion about what kind of labels should be add. I would like to share
> my thoughts as follows:
> 
>  I think we need to add labels according to some rules. For
> example, the easiest way is to add labels by languages, java / python / go
> etc. But this kind of help is very limited, so we need to subdivide some
> labels, such as by components. Currently we have more than 70 components,
> each component is configured with labels, and it seems cumbersome. So we
> should have some rules for dividing labels, which can play the

Re: daily dataflow job failing today

2020-02-12 Thread Ismaël Mejía

Independently of the bug in the dependency release the fact that the Beam
Python
SDK does not have pinned fixed dependency numbers is error-prone. We may
continue to have this kind of problems until we fix this (with other
dependencies too). In the Java SDK we do not accept such type of dynamic
dependency numbers and python should probably follow this practice to avoid
issues like the present one.

Why don't we just do:

'avro-python3==1.9.1',

instead of the current:

'avro-python3>=1.8.1,!=1.9.2,<2.0.0; python_version >= "3.0"',


On Wed, Feb 12, 2020 at 9:14 PM Ahmet Altay  wrote:

> Related: we have dependencies on avro, avro-python3, and fastavro.
> fastavro supports both python 2 and 3. Could we reduce this dependency list
> and depend only on fastavro? If we need avro and avro-python3 for the
> purposes of testing only, we can move them to test only dependencies.
>
> +Chamikara Jayalath , because I vaguely remember
> him working on this.
>
> The reason I am calling for this is the impact of bad dependency releases
> are high. All previously released Beam versions will be impacted. Reducing
> the dependency list will reduce the risk.
>
> Ahmet
>
> On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay  wrote:
>
>> Thank you Valentyn!
>>
>> On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Yes, otherwise all Python tests will continue to fail until Avro comes
>>> up with a new release. Sent: https://github.com/apache/beam/pull/10844
>>>
>>> On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay  wrote:
>>>
 Should we update Beam's setup.py to skip this avro-python3 version?

 On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz <
 alan.krumh...@betterup.co> wrote:

> makes sense. I'll add this workaround for now.
> Thanks so much for your help!
>
> On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
>> working version) of avro-python3. So after reading your email once 
>> again, I
>> think in your case you were not able to install Beam SDK locally. So a
>> workaround for you would be to `pip install avro-python3==1.9.1` or `pip
>> install pycodestyle`  before installing Beam, until AVRO-2737 is 
>> resolved.
>>
>>
>> On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
>>> it received attention.
>>>
>>> On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Opened https://issues.apache.org/jira/browse/AVRO-2738

 On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Here's a short repro:
>
> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
> root@04b45a100d16:/# pip install avro-python3
> Collecting avro-python3
>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
> ERROR: Command errored out with exit status 1:
>  command: /usr/local/bin/python -c 'import sys, setuptools,
> tokenize; sys.argv[0] =
> '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
> egg_info --egg-base 
> /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
> Complete output (5 lines):
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line
> 41, in 
> import pycodestyle
> ModuleNotFoundError: No module named 'pycodestyle'
> 
> ERROR: Command errored out with exit status 1: python setup.py
> egg_info Check the logs for full command output.
> root@04b45a100d16:/#
>
>
>
>
>
>
>
>
>
> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Yes, it is a bug in the recent Avro release. We should report it
>> to the Avro maintainers. The workaround is to downgrade avro-python3 
>> to
>> 1.9.1, for example via requirements.txt.
>>
>> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz <
>> sniem...@apache.org> wrote:
>>
>>> avro-python3 1.9.2 was released on pypi 4 hours ago, and
>>> added pycodestyle as a dependency, probably related?

Re: daily dataflow job failing today

2020-02-12 Thread Ahmet Altay

Related: we have dependencies on avro, avro-python3, and fastavro. fastavro
supports both python 2 and 3. Could we reduce this dependency list and
depend only on fastavro? If we need avro and avro-python3 for the purposes
of testing only, we can move them to test only dependencies.

+Chamikara Jayalath , because I vaguely remember him
working on this.

The reason I am calling for this is the impact of bad dependency releases
are high. All previously released Beam versions will be impacted. Reducing
the dependency list will reduce the risk.

Ahmet

On Wed, Feb 12, 2020 at 12:02 PM Ahmet Altay  wrote:

> Thank you Valentyn!
>
> On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev 
> wrote:
>
>> Yes, otherwise all Python tests will continue to fail until Avro comes up
>> with a new release. Sent: https://github.com/apache/beam/pull/10844
>>
>> On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay  wrote:
>>
>>> Should we update Beam's setup.py to skip this avro-python3 version?
>>>
>>> On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz <
>>> alan.krumh...@betterup.co> wrote:
>>>
 makes sense. I'll add this workaround for now.
 Thanks so much for your help!

 On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
> working version) of avro-python3. So after reading your email once again, 
> I
> think in your case you were not able to install Beam SDK locally. So a
> workaround for you would be to `pip install avro-python3==1.9.1` or `pip
> install pycodestyle`  before installing Beam, until AVRO-2737 is resolved.
>
>
> On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
>> it received attention.
>>
>> On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Opened https://issues.apache.org/jira/browse/AVRO-2738
>>>
>>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Here's a short repro:

 :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
 root@04b45a100d16:/# pip install avro-python3
 Collecting avro-python3
   Downloading avro-python3-1.9.2.tar.gz (37 kB)
 ERROR: Command errored out with exit status 1:
  command: /usr/local/bin/python -c 'import sys, setuptools,
 tokenize; sys.argv[0] =
 '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
 __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
 '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
 '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
 egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
 Complete output (5 lines):
 Traceback (most recent call last):
   File "", line 1, in 
   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line
 41, in 
 import pycodestyle
 ModuleNotFoundError: No module named 'pycodestyle'
 
 ERROR: Command errored out with exit status 1: python setup.py
 egg_info Check the logs for full command output.
 root@04b45a100d16:/#









 On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Yes, it is a bug in the recent Avro release. We should report it
> to the Avro maintainers. The workaround is to downgrade avro-python3 
> to
> 1.9.1, for example via requirements.txt.
>
> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz <
> sniem...@apache.org> wrote:
>
>> avro-python3 1.9.2 was released on pypi 4 hours ago, and
>> added pycodestyle as a dependency, probably related?
>>
>> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik 
>> wrote:
>>
>>> +dev 
>>>
>>> There was recently an update to add autoformatting to the Python
>>> SDK[1].
>>>
>>> I'm seeing this during testing of a PR as well.
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>>>
>>> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
>>> alan.krumh...@betterup.co> wrote:
>>>
 Some more information for this as I still can't get to fix
 it

 This job is triggered using the

Re: FnAPI proto backwards compatibility

2020-02-12 Thread Robert Bradshaw

On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik  wrote:
>
> We can always detect on the runner/SDK side whether there is an unknown 
> field[1] within a payload and fail to process it but this is painful in two 
> situations:
> 1) It doesn't provide for a good error message since you can't say what the 
> purpose of the field is. With a capability URN, the runner/SDK could say 
> which URN it doesn't understand.
> 2) It doesn't allow for the addition of fields which don't impact semantics 
> of execution. For example, if the display data feature was being developed, a 
> runner could ignore it and still execute the pipeline correctly.

Yeah, I don't think proto reflection is a flexible enough tool to do
this well either.

> If we think this to be common enough, we can add capabilities list to the 
> PTransform so each PTransform can do this and has a natural way of being 
> extended for additions which are forwards compatible. The alternative to 
> having capabilities on PTransform (and other constructs) is that we would 
> have a new URN when the specification of the transform changes. For forwards 
> compatible changes, each SDK/runner would map older versions of the URN onto 
> the latest and internally treat it as the latest version but always downgrade 
> it to the version the other party expects when communicating with it. 
> Backwards incompatible changes would always require a new URN which 
> capabilities at the PTransform level would not help with.

As you point out, stateful+splittable may not be a particularly useful
combination, but as another example, we have
(backwards-incompatible-when-introduced) markers on DoFn as to whether
it requires finalization, stable inputs, and now time sorting. I don't
think we should have a new URN for each combination.

>> > I do think that splittable ParDo and stateful ParDo should have separate 
>> > PTransform URNs since they are different paradigms than "vanilla" ParDo.
>>
>> Here I disagree. What about one that is both splittable and stateful? Would 
>> one have a fourth URN for that? If/when another flavor of DoFn comes out, 
>> would we then want 8 distinct URNs? (SplitableParDo in particular can be 
>> executed as a normal ParDo as long as the output is bounded.)
>
> I agree that you could have stateful and splittable dofns where the element 
> is the key and you share state and timers across restrictions. No runner is 
> capable of executing this efficiently.
>
>> >> > On the SDK requirements side: the constructing SDK owns the Environment 
>> >> > proto completely, so it is in a position to ensure the involved docker 
>> >> > images support the necessary features.
>> >>
>> >> Yes.
>
>
> I believe capabilities do exist on a Pipeline and it informs runners about 
> new types of fields to be aware of either within Components or on the 
> Pipeline object itself but for this discussion it makes sense that an 
> environment would store most "capabilities" related to execution.
>
>> [snip]
>
> As for the proto clean-ups, the scope is to cover almost all things needed 
> for execution now and to follow-up with optional transforms, payloads, and 
> coders later which would exclude job managment APIs and artifact staging. A 
> formal enumeration would be useful here. Also, we should provide formal 
> guidance about adding new fields, adding new types of transforms, new types 
> of proto messages, ... (best to describe this on a case by case basis as to 
> how people are trying to modify the protos and evolve this guidance over 
> time).

What we need is the ability for (1) runners to reject future pipelines
they cannot faithfully execute and (2) runners to be able to take
advantage of advanced features/protocols when interacting with those
SDKs that understand them while avoiding them for older (or newer)
SDKs that don't. Let's call (1) (hard) requirements and (2) (optional)
capabilities.

Where possible, I think this is best expressed inherently in the set
of transform (and possibly other component) URNs. For example, when an
SDK uses a combine_per_key composite, that's a signal that it
understands the various related combine_* transforms. Similarly, a
pipeline with a test_stream URN would be rejected by pipelines not
recognizing/supporting this primitive. However, this is not always
possible, e.g. for (1) we have the aforementioned boolean flags on
ParDo and for (2) we have features like large iterable and progress
support.

For (1) we have to enumerate now everywhere a runner must look a far
into the future as we want to remain backwards compatible. This is why
I suggested putting something on the pipeline itself, but we could
(likely in addition) add it to Transform and/or ParDoPayload if we
think that'd be useful now. (Note that a future pipeline-level
requirement could be "inspect (previously non-existent) requirements
field attached to objects of type X.")

For (2) I think adding a capabilities field to the environment for now
makes the most sense, and as

Re: daily dataflow job failing today

2020-02-12 Thread Ahmet Altay

Thank you Valentyn!

On Wed, Feb 12, 2020 at 11:32 AM Valentyn Tymofieiev 
wrote:

> Yes, otherwise all Python tests will continue to fail until Avro comes up
> with a new release. Sent: https://github.com/apache/beam/pull/10844
>
> On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay  wrote:
>
>> Should we update Beam's setup.py to skip this avro-python3 version?
>>
>> On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz 
>> wrote:
>>
>>> makes sense. I'll add this workaround for now.
>>> Thanks so much for your help!
>>>
>>> On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
 working version) of avro-python3. So after reading your email once again, I
 think in your case you were not able to install Beam SDK locally. So a
 workaround for you would be to `pip install avro-python3==1.9.1` or `pip
 install pycodestyle`  before installing Beam, until AVRO-2737 is resolved.


 On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
> it received attention.
>
> On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Opened https://issues.apache.org/jira/browse/AVRO-2738
>>
>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Here's a short repro:
>>>
>>> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
>>> root@04b45a100d16:/# pip install avro-python3
>>> Collecting avro-python3
>>>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
>>> ERROR: Command errored out with exit status 1:
>>>  command: /usr/local/bin/python -c 'import sys, setuptools,
>>> tokenize; sys.argv[0] =
>>> '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
>>> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
>>> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
>>> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
>>> egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>>>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
>>> Complete output (5 lines):
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line
>>> 41, in 
>>> import pycodestyle
>>> ModuleNotFoundError: No module named 'pycodestyle'
>>> 
>>> ERROR: Command errored out with exit status 1: python setup.py
>>> egg_info Check the logs for full command output.
>>> root@04b45a100d16:/#
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Yes, it is a bug in the recent Avro release. We should report it
 to the Avro maintainers. The workaround is to downgrade avro-python3 to
 1.9.1, for example via requirements.txt.

 On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
 wrote:

> avro-python3 1.9.2 was released on pypi 4 hours ago, and
> added pycodestyle as a dependency, probably related?
>
> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik 
> wrote:
>
>> +dev 
>>
>> There was recently an update to add autoformatting to the Python
>> SDK[1].
>>
>> I'm seeing this during testing of a PR as well.
>>
>> 1:
>> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>>
>> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
>> alan.krumh...@betterup.co> wrote:
>>
>>> Some more information for this as I still can't get to fix it
>>>
>>> This job is triggered using the beam[gcp] python sdk from a
>>> KubeFlow Pipelines component which runs on top of docker image:
>>> tensorflow/tensorflow:1.13.1-py3
>>>
>>> I just checked and that image hasn't been updated recently. I
>>> also redeployed my pipeline to another (older) deployment of KFP 
>>> and it
>>> gives me the same error (which tells me this isn't an internal KFP 
>>> problem)
>>>
>>> The exact same pipeline/code running on the exact same image has
>>> been running fine for days. Did anything changed on the 
>>> beam/dataflow side
>>> since yesterday morning?
>>>
>>> Thanks for your help! this is a production pipeline that is not
>>> running for us :(
>>>
>>>

Google's support framework for community-led meetups

2020-02-12 Thread María Cruz

Hi everyone,

I have been working closely with Gris Cuevas to develop a framework to
increase transparency on how Google Open Source supports community-led
events.  Below you will find a table that identifies different event goals,
community roles, and the type of support Google Open Source can offer in
each case.

If you would like to request support from Google for your meetup, please
follow the meetup support request process
.
Can someone help me put this information, along with the support framework
below, under the Community tab [1] of the Apache Beam website?

We look forward to continuing partnering with our community organizers to
help Beam grow even more! If you have any questions, please let me know.
Support frameworkCommunity Roles

   -

   Project contributor AND community amplifier:
   -

  Contributes code, documentation, project management, etc, to the
  project
  -

  Organizes meetups
  -

  Engages other contributors in becoming organizers
  -

   Project contributor
   -

  Contributes code, documentation, project management, etc, to the
  project
  -

  Organizes meetups
  -

   Company that wants to co-sponsor a meetup
   -

  A company that is an independent entity from Google that expressed
  interest in co-sponsoring a meetup

Framework



Community Goal

Community Role Requesting Support

Support Google Open Source can provide





Raise awareness about project / increase reach of project / maximize
exposure

E.g. Launch a new Meetup series in a new city.

Project contributor AND amplifier


   -

   Up to USD 3,000 a year in swag
   -

   Content
   -

   Speakers

Project contributor

   -

   Up to USD 1,500 a year in swag
   -

   Help to find a meetup sponsor
   -

   Content

A company that wants to co-sponsor a meetup


   -

   Connect to local community members to organize meetup
   -

   Content




Grow project adoption

E.g. create awareness of a new release and help people adopt it.

Project contributor AND amplifier

   -

   Add to leadership network
   -

   Support in finding sponsors

Project contributor

A company that wants to co-sponsor a meetup

   -

   Partner to present use case
   -

   Connect to community member in low-adoption region / city



Grow project contributions

E.g. run a hackathon to contribute to Beam website localization.

Project contributor AND amplifier

   -

   Develop workshop content
   -

   Swag or GCP credits for top contributors

Project contributor

   -

   Develop workshop content

A company that wants to co-sponsor a meetup

   -

   Develop workshop content
   -

   Connect to expert to deliver workshop



Expand professional network for project / Connect project features to other
projects

E.g. Working session at a conference

Project contributor AND amplifier

   -

   Connect to related meetups
   -

   Create a list of potential presenters from other companies

Project contributor

   -

   Connect to related meetups

A company that wants to co-sponsor a meetup

   -

   Connect to local expert for presentations

 --

María

[1] https://beam.apache.org/community/

Re: Request to be added to maintainters in Jira.

2020-02-12 Thread Luke Cwik

What is your JIRA id?

Also, note that there is an ongoing issue that prevents many people from
running tests themselves on their PRs[1] and requires asking on the dev@
mailing list for someone with the appropriate set of permissions to launch
the tests for you.

1: https://issues.apache.org/jira/browse/INFRA-19670

On Wed, Feb 12, 2020 at 11:16 AM Liu Wang  wrote:

> Hi Beam developers,
>
> I have been working on adding Beam Python tests since last November. It is
> inconvenient for me right now since I can't run tests, comment on open
> issues, ask or answer questions on the forum.
> For example, I have a PR that may fix BEAM-9003, but I can't run the test
> or see the test results, and I can't comment on issue BEAM-9003.
> I'd appreciate it if you could add me to the maintainters in Jira.
>
> Thanks,
> Liu Wang
>

Re: daily dataflow job failing today

2020-02-12 Thread Valentyn Tymofieiev

Yes, otherwise all Python tests will continue to fail until Avro comes up
with a new release. Sent: https://github.com/apache/beam/pull/10844

On Wed, Feb 12, 2020 at 11:08 AM Ahmet Altay  wrote:

> Should we update Beam's setup.py to skip this avro-python3 version?
>
> On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz 
> wrote:
>
>> makes sense. I'll add this workaround for now.
>> Thanks so much for your help!
>>
>> On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
>>> working version) of avro-python3. So after reading your email once again, I
>>> think in your case you were not able to install Beam SDK locally. So a
>>> workaround for you would be to `pip install avro-python3==1.9.1` or `pip
>>> install pycodestyle`  before installing Beam, until AVRO-2737 is resolved.
>>>
>>>
>>> On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
 it received attention.

 On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Opened https://issues.apache.org/jira/browse/AVRO-2738
>
> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Here's a short repro:
>>
>> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
>> root@04b45a100d16:/# pip install avro-python3
>> Collecting avro-python3
>>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
>> ERROR: Command errored out with exit status 1:
>>  command: /usr/local/bin/python -c 'import sys, setuptools,
>> tokenize; sys.argv[0] =
>> '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
>> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
>> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
>> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
>> egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
>> Complete output (5 lines):
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line
>> 41, in 
>> import pycodestyle
>> ModuleNotFoundError: No module named 'pycodestyle'
>> 
>> ERROR: Command errored out with exit status 1: python setup.py
>> egg_info Check the logs for full command output.
>> root@04b45a100d16:/#
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Yes, it is a bug in the recent Avro release. We should report it
>>> to the Avro maintainers. The workaround is to downgrade avro-python3 to
>>> 1.9.1, for example via requirements.txt.
>>>
>>> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
>>> wrote:
>>>
 avro-python3 1.9.2 was released on pypi 4 hours ago, and
 added pycodestyle as a dependency, probably related?

 On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:

> +dev 
>
> There was recently an update to add autoformatting to the Python
> SDK[1].
>
> I'm seeing this during testing of a PR as well.
>
> 1:
> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>
> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> Some more information for this as I still can't get to fix it
>>
>> This job is triggered using the beam[gcp] python sdk from a
>> KubeFlow Pipelines component which runs on top of docker image:
>> tensorflow/tensorflow:1.13.1-py3
>>
>> I just checked and that image hasn't been updated recently. I
>> also redeployed my pipeline to another (older) deployment of KFP and 
>> it
>> gives me the same error (which tells me this isn't an internal KFP 
>> problem)
>>
>> The exact same pipeline/code running on the exact same image has
>> been running fine for days. Did anything changed on the 
>> beam/dataflow side
>> since yesterday morning?
>>
>> Thanks for your help! this is a production pipeline that is not
>> running for us :(
>>
>>
>>
>> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
>> alan.krumh...@betterup.co> wrote:
>>
>>> Hi, I have a scheduled daily job that I have been running fine
>>> in dataflow for days now.

Request to be added to maintainters in Jira.

2020-02-12 Thread Liu Wang

Hi Beam developers,

I have been working on adding Beam Python tests since last November. It is
inconvenient for me right now since I can't run tests, comment on open
issues, ask or answer questions on the forum.
For example, I have a PR that may fix BEAM-9003, but I can't run the test
or see the test results, and I can't comment on issue BEAM-9003.
I'd appreciate it if you could add me to the maintainters in Jira.

Thanks,
Liu Wang

Re: FnAPI proto backwards compatibility

2020-02-12 Thread Luke Cwik

On Wed, Feb 12, 2020 at 7:57 AM Robert Bradshaw  wrote:

> On Tue, Feb 11, 2020 at 7:25 PM Kenneth Knowles  wrote:
> >
> > On Tue, Feb 11, 2020 at 8:38 AM Robert Bradshaw 
> wrote:
> >>
> >> On Mon, Feb 10, 2020 at 7:35 PM Kenneth Knowles 
> wrote:
> >> >
> >> > On the runner requirements side: if you have such a list at the
> pipeline level, it is an opportunity for the list to be inconsistent with
> the contents of the pipeline. For example, if a DoFn is marked "requires
> stable input" but not listed at the pipeline level, then the runner may run
> it without ensuring it requires stable input.
> >>
> >> Yes. Listing this feature at the top level, if used, would be part of
> >> the contract. The problem here that we're trying to solve is that the
> >> runner wouldn't know about the field used to mark a DoFn as "requires
> >> stable input." Another alternative would be to make this kind of ParDo
> >> a different URN, but that would result in a cross product of URNs for
> >> all supported features.
> >
> >
> >>
> >> Rather than attaching it to the pipeline object, we could attach it to
> >> the transform. (But if there are ever extensions that don't belong to
> >> transforms, we'd be out of luck. It'd be even worse to attach it to
> >> the ParDoPayload, as then we'd need one on CombinePayload, etc. just
> >> in case.) This is why I was leaning towards just putting it at the
> >> top.
> >>
> >> I agree about the potential for incompatibility. As much as possible
> >> I'd rather extend things in a way that would be intrinsically rejected
> >> by a non-comprehending runner. But I'm not sure how to do that when
> >> introducing new constraints for existing components like this. But I'm
> >> open to other suggestions.
> >
> >
> > I was waiting for Luke to mention something he suggested offline: that
> we make this set of fields a list of URNs and require a runner to fail if
> there are any that it does not understand. That should do it for
> DoFn-granularity features. It makes sense - proto is designed to
> ignore/propagate unknown bits. We want to fail on unknown bits.
>
> I agree this would be superior for bools like requires_time_sorted_input
> and requests_finalization. Would it be worth making this a map for those
> features that have attached data such that it could not be forgotten? (E.g.
> rather than state_specs being a top-level field, it would be a value for
> the requires-state URN.) Should we move to this pattern for existing
> requirements (like the aforementioned state) or just future ones? Was the
> parameters field an attempt in this direction?
>
> I still think we need something top-level lest we not be able to modify
> anything but ParDo, but putting it on ParDo as well could be natural.
>

We can always detect on the runner/SDK side whether there is an unknown
field[1] within a payload and fail to process it but this is painful in two
situations:
1) It doesn't provide for a good error message since you can't say what the
purpose of the field is. With a capability URN, the runner/SDK could say
which URN it doesn't understand.
2) It doesn't allow for the addition of fields which don't impact semantics
of execution. For example, if the display data feature was being developed,
a runner could ignore it and still execute the pipeline correctly.

If we think this to be common enough, we can add capabilities list to the
PTransform so each PTransform can do this and has a natural way of being
extended for additions which are forwards compatible. The alternative to
having capabilities on PTransform (and other constructs) is that we would
have a new URN when the specification of the transform changes. For
forwards compatible changes, each SDK/runner would map older versions of
the URN onto the latest and internally treat it as the latest version but
always downgrade it to the version the other party expects when
communicating with it. Backwards incompatible changes would always require
a new URN which capabilities at the PTransform level would not help with.

> I do think that splittable ParDo and stateful ParDo should have separate
> PTransform URNs since they are different paradigms than "vanilla" ParDo.
>
> Here I disagree. What about one that is both splittable and stateful?
> Would one have a fourth URN for that? If/when another flavor of DoFn comes
> out, would we then want 8 distinct URNs? (SplitableParDo in particular can
> be executed as a normal ParDo as long as the output is bounded.)
>

I agree that you could have stateful and splittable dofns where the element
is the key and you share state and timers across restrictions. No runner is
capable of executing this efficiently.

> >> > On the SDK requirements side: the constructing SDK owns the
> Environment proto completely, so it is in a position to ensure the involved
> docker images support the necessary features.
> >>
> >> Yes.
>

I believe capabilities do exist on a Pipeline and it informs runners about
new types of fields to be aware

Re: daily dataflow job failing today

2020-02-12 Thread Ahmet Altay

Should we update Beam's setup.py to skip this avro-python3 version?

On Wed, Feb 12, 2020 at 10:57 AM Alan Krumholz 
wrote:

> makes sense. I'll add this workaround for now.
> Thanks so much for your help!
>
> On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev 
> wrote:
>
>> Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
>> working version) of avro-python3. So after reading your email once again, I
>> think in your case you were not able to install Beam SDK locally. So a
>> workaround for you would be to `pip install avro-python3==1.9.1` or `pip
>> install pycodestyle`  before installing Beam, until AVRO-2737 is resolved.
>>
>>
>> On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
>>> it received attention.
>>>
>>> On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Opened https://issues.apache.org/jira/browse/AVRO-2738

 On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Here's a short repro:
>
> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
> root@04b45a100d16:/# pip install avro-python3
> Collecting avro-python3
>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
> ERROR: Command errored out with exit status 1:
>  command: /usr/local/bin/python -c 'import sys, setuptools,
> tokenize; sys.argv[0] =
> '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
> egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
> Complete output (5 lines):
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line 41,
> in 
> import pycodestyle
> ModuleNotFoundError: No module named 'pycodestyle'
> 
> ERROR: Command errored out with exit status 1: python setup.py
> egg_info Check the logs for full command output.
> root@04b45a100d16:/#
>
>
>
>
>
>
>
>
>
> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Yes, it is a bug in the recent Avro release. We should report it
>> to the Avro maintainers. The workaround is to downgrade avro-python3 to
>> 1.9.1, for example via requirements.txt.
>>
>> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
>> wrote:
>>
>>> avro-python3 1.9.2 was released on pypi 4 hours ago, and
>>> added pycodestyle as a dependency, probably related?
>>>
>>> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:
>>>
 +dev 

 There was recently an update to add autoformatting to the Python
 SDK[1].

 I'm seeing this during testing of a PR as well.

 1:
 https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E

 On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
 alan.krumh...@betterup.co> wrote:

> Some more information for this as I still can't get to fix it
>
> This job is triggered using the beam[gcp] python sdk from a
> KubeFlow Pipelines component which runs on top of docker image:
> tensorflow/tensorflow:1.13.1-py3
>
> I just checked and that image hasn't been updated recently. I also
> redeployed my pipeline to another (older) deployment of KFP and it 
> gives me
> the same error (which tells me this isn't an internal KFP problem)
>
> The exact same pipeline/code running on the exact same image has
> been running fine for days. Did anything changed on the beam/dataflow 
> side
> since yesterday morning?
>
> Thanks for your help! this is a production pipeline that is not
> running for us :(
>
>
>
> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> Hi, I have a scheduled daily job that I have been running fine in
>> dataflow for days now.
>> We haven't changed anything on this code but this morning run
>> failed  (it couldn't even spin up the job)
>> The job submits a setup.py file (that also hasn't changed) but
>> maybe is causing the problem? (based on the error I'm getting)
>>
>> Anyone else having the same issue? or know how to fix it?

Re: daily dataflow job failing today

2020-02-12 Thread Alan Krumholz

makes sense. I'll add this workaround for now.
Thanks so much for your help!

On Wed, Feb 12, 2020 at 10:33 AM Valentyn Tymofieiev 
wrote:

> Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
> working version) of avro-python3. So after reading your email once again, I
> think in your case you were not able to install Beam SDK locally. So a
> workaround for you would be to `pip install avro-python3==1.9.1` or `pip
> install pycodestyle`  before installing Beam, until AVRO-2737 is resolved.
>
>
> On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev 
> wrote:
>
>> Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
>> it received attention.
>>
>> On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Opened https://issues.apache.org/jira/browse/AVRO-2738
>>>
>>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Here's a short repro:

 :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
 root@04b45a100d16:/# pip install avro-python3
 Collecting avro-python3
   Downloading avro-python3-1.9.2.tar.gz (37 kB)
 ERROR: Command errored out with exit status 1:
  command: /usr/local/bin/python -c 'import sys, setuptools,
 tokenize; sys.argv[0] =
 '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
 __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
 '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
 '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
 egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
 Complete output (5 lines):
 Traceback (most recent call last):
   File "", line 1, in 
   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line 41,
 in 
 import pycodestyle
 ModuleNotFoundError: No module named 'pycodestyle'
 
 ERROR: Command errored out with exit status 1: python setup.py egg_info
 Check the logs for full command output.
 root@04b45a100d16:/#









 On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Yes, it is a bug in the recent Avro release. We should report it
> to the Avro maintainers. The workaround is to downgrade avro-python3 to
> 1.9.1, for example via requirements.txt.
>
> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
> wrote:
>
>> avro-python3 1.9.2 was released on pypi 4 hours ago, and
>> added pycodestyle as a dependency, probably related?
>>
>> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:
>>
>>> +dev 
>>>
>>> There was recently an update to add autoformatting to the Python
>>> SDK[1].
>>>
>>> I'm seeing this during testing of a PR as well.
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>>>
>>> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
>>> alan.krumh...@betterup.co> wrote:
>>>
 Some more information for this as I still can't get to fix it

 This job is triggered using the beam[gcp] python sdk from a
 KubeFlow Pipelines component which runs on top of docker image:
 tensorflow/tensorflow:1.13.1-py3

 I just checked and that image hasn't been updated recently. I also
 redeployed my pipeline to another (older) deployment of KFP and it 
 gives me
 the same error (which tells me this isn't an internal KFP problem)

 The exact same pipeline/code running on the exact same image has
 been running fine for days. Did anything changed on the beam/dataflow 
 side
 since yesterday morning?

 Thanks for your help! this is a production pipeline that is not
 running for us :(



 On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
 alan.krumh...@betterup.co> wrote:

> Hi, I have a scheduled daily job that I have been running fine in
> dataflow for days now.
> We haven't changed anything on this code but this morning run
> failed  (it couldn't even spin up the job)
> The job submits a setup.py file (that also hasn't changed) but
> maybe is causing the problem? (based on the error I'm getting)
>
> Anyone else having the same issue? or know how to fix it?
> Thanks!
>
> ERROR: Complete output from command python setup.py egg_info:
> 2 ERROR: Traceback (most recent call last):
> 3 File "", line 1, in 
> 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line
> 41, in

Re: daily dataflow job failing today

2020-02-12 Thread Valentyn Tymofieiev

Alan, Dataflow workers preinstall Beam SDK dependencies, including (a
working version) of avro-python3. So after reading your email once again, I
think in your case you were not able to install Beam SDK locally. So a
workaround for you would be to `pip install avro-python3==1.9.1` or `pip
install pycodestyle`  before installing Beam, until AVRO-2737 is resolved.


On Wed, Feb 12, 2020 at 10:21 AM Valentyn Tymofieiev 
wrote:

> Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and
> it received attention.
>
> On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev 
> wrote:
>
>> Opened https://issues.apache.org/jira/browse/AVRO-2738
>>
>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Here's a short repro:
>>>
>>> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
>>> root@04b45a100d16:/# pip install avro-python3
>>> Collecting avro-python3
>>>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
>>> ERROR: Command errored out with exit status 1:
>>>  command: /usr/local/bin/python -c 'import sys, setuptools,
>>> tokenize; sys.argv[0] =
>>> '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
>>> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
>>> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
>>> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
>>> egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>>>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
>>> Complete output (5 lines):
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line 41,
>>> in 
>>> import pycodestyle
>>> ModuleNotFoundError: No module named 'pycodestyle'
>>> 
>>> ERROR: Command errored out with exit status 1: python setup.py egg_info
>>> Check the logs for full command output.
>>> root@04b45a100d16:/#
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Yes, it is a bug in the recent Avro release. We should report it to the
 Avro maintainers. The workaround is to downgrade avro-python3 to 1.9.1, for
 example via requirements.txt.

 On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
 wrote:

> avro-python3 1.9.2 was released on pypi 4 hours ago, and
> added pycodestyle as a dependency, probably related?
>
> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:
>
>> +dev 
>>
>> There was recently an update to add autoformatting to the Python
>> SDK[1].
>>
>> I'm seeing this during testing of a PR as well.
>>
>> 1:
>> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>>
>> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
>> alan.krumh...@betterup.co> wrote:
>>
>>> Some more information for this as I still can't get to fix it
>>>
>>> This job is triggered using the beam[gcp] python sdk from a KubeFlow
>>> Pipelines component which runs on top of docker image:
>>> tensorflow/tensorflow:1.13.1-py3
>>>
>>> I just checked and that image hasn't been updated recently. I also
>>> redeployed my pipeline to another (older) deployment of KFP and it 
>>> gives me
>>> the same error (which tells me this isn't an internal KFP problem)
>>>
>>> The exact same pipeline/code running on the exact same image has
>>> been running fine for days. Did anything changed on the beam/dataflow 
>>> side
>>> since yesterday morning?
>>>
>>> Thanks for your help! this is a production pipeline that is not
>>> running for us :(
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
>>> alan.krumh...@betterup.co> wrote:
>>>
 Hi, I have a scheduled daily job that I have been running fine in
 dataflow for days now.
 We haven't changed anything on this code but this morning run
 failed  (it couldn't even spin up the job)
 The job submits a setup.py file (that also hasn't changed) but
 maybe is causing the problem? (based on the error I'm getting)

 Anyone else having the same issue? or know how to fix it?
 Thanks!

 ERROR: Complete output from command python setup.py egg_info:
 2 ERROR: Traceback (most recent call last):
 3 File "", line 1, in 
 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41,
 in 
 5 import pycodestyle
 6 ImportError: No module named 'pycodestyle'
 7 
 8ERROR: Command "python setup.py egg_info" failed with error code
 1 in /tmp/pip-install-42zyi89t/avro-python3/
 9 ERROR:

Re: daily dataflow job failing today

2020-02-12 Thread Valentyn Tymofieiev

Ah, there's already https://issues.apache.org/jira/browse/AVRO-2737 and it
received attention.

On Wed, Feb 12, 2020 at 10:19 AM Valentyn Tymofieiev 
wrote:

> Opened https://issues.apache.org/jira/browse/AVRO-2738
>
> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev 
> wrote:
>
>> Here's a short repro:
>>
>> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
>> root@04b45a100d16:/# pip install avro-python3
>> Collecting avro-python3
>>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
>> ERROR: Command errored out with exit status 1:
>>  command: /usr/local/bin/python -c 'import sys, setuptools, tokenize;
>> sys.argv[0] = '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
>> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
>> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
>> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
>> egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
>> Complete output (5 lines):
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line 41, in
>> 
>> import pycodestyle
>> ModuleNotFoundError: No module named 'pycodestyle'
>> 
>> ERROR: Command errored out with exit status 1: python setup.py egg_info
>> Check the logs for full command output.
>> root@04b45a100d16:/#
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Yes, it is a bug in the recent Avro release. We should report it to the
>>> Avro maintainers. The workaround is to downgrade avro-python3 to 1.9.1, for
>>> example via requirements.txt.
>>>
>>> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
>>> wrote:
>>>
 avro-python3 1.9.2 was released on pypi 4 hours ago, and
 added pycodestyle as a dependency, probably related?

 On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:

> +dev 
>
> There was recently an update to add autoformatting to the Python
> SDK[1].
>
> I'm seeing this during testing of a PR as well.
>
> 1:
> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>
> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> Some more information for this as I still can't get to fix it
>>
>> This job is triggered using the beam[gcp] python sdk from a KubeFlow
>> Pipelines component which runs on top of docker image:
>> tensorflow/tensorflow:1.13.1-py3
>>
>> I just checked and that image hasn't been updated recently. I also
>> redeployed my pipeline to another (older) deployment of KFP and it gives 
>> me
>> the same error (which tells me this isn't an internal KFP problem)
>>
>> The exact same pipeline/code running on the exact same image has been
>> running fine for days. Did anything changed on the beam/dataflow side 
>> since
>> yesterday morning?
>>
>> Thanks for your help! this is a production pipeline that is not
>> running for us :(
>>
>>
>>
>> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
>> alan.krumh...@betterup.co> wrote:
>>
>>> Hi, I have a scheduled daily job that I have been running fine in
>>> dataflow for days now.
>>> We haven't changed anything on this code but this morning run
>>> failed  (it couldn't even spin up the job)
>>> The job submits a setup.py file (that also hasn't changed) but maybe
>>> is causing the problem? (based on the error I'm getting)
>>>
>>> Anyone else having the same issue? or know how to fix it?
>>> Thanks!
>>>
>>> ERROR: Complete output from command python setup.py egg_info:
>>> 2 ERROR: Traceback (most recent call last):
>>> 3 File "", line 1, in 
>>> 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41,
>>> in 
>>> 5 import pycodestyle
>>> 6 ImportError: No module named 'pycodestyle'
>>> 7 
>>> 8ERROR: Command "python setup.py egg_info" failed with error code 1
>>> in /tmp/pip-install-42zyi89t/avro-python3/
>>> 9 ERROR: Complete output from command python setup.py egg_info:
>>> 10 ERROR: Traceback (most recent call last):
>>> 11 File "", line 1, in 
>>> 12 File "/tmp/pip-install-wrqytf9a/avro-python3/setup.py", line 41,
>>> in 
>>> 13 import pycodestyle
>>> 14 ImportError: No module named 'pycodestyle'
>>> 15 
>>> 16ERROR: Command "python setup.py egg_info" failed with error code
>>> 1 in /tmp/pip-install-wrqytf9a/avro-python3/
>>>
>>

Re: daily dataflow job failing today

2020-02-12 Thread Valentyn Tymofieiev

Opened https://issues.apache.org/jira/browse/AVRO-2738

On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev 
wrote:

> Here's a short repro:
>
> :~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
> root@04b45a100d16:/# pip install avro-python3
> Collecting avro-python3
>   Downloading avro-python3-1.9.2.tar.gz (37 kB)
> ERROR: Command errored out with exit status 1:
>  command: /usr/local/bin/python -c 'import sys, setuptools, tokenize;
> sys.argv[0] = '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
> __file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
> egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
>  cwd: /tmp/pip-install-mmy4vspt/avro-python3/
> Complete output (5 lines):
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line 41, in
> 
> import pycodestyle
> ModuleNotFoundError: No module named 'pycodestyle'
> 
> ERROR: Command errored out with exit status 1: python setup.py egg_info
> Check the logs for full command output.
> root@04b45a100d16:/#
>
>
>
>
>
>
>
>
>
> On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev 
> wrote:
>
>> Yes, it is a bug in the recent Avro release. We should report it to the
>> Avro maintainers. The workaround is to downgrade avro-python3 to 1.9.1, for
>> example via requirements.txt.
>>
>> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
>> wrote:
>>
>>> avro-python3 1.9.2 was released on pypi 4 hours ago, and
>>> added pycodestyle as a dependency, probably related?
>>>
>>> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:
>>>
 +dev 

 There was recently an update to add autoformatting to the Python SDK[1].

 I'm seeing this during testing of a PR as well.

 1:
 https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E

 On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz <
 alan.krumh...@betterup.co> wrote:

> Some more information for this as I still can't get to fix it
>
> This job is triggered using the beam[gcp] python sdk from a KubeFlow
> Pipelines component which runs on top of docker image:
> tensorflow/tensorflow:1.13.1-py3
>
> I just checked and that image hasn't been updated recently. I also
> redeployed my pipeline to another (older) deployment of KFP and it gives 
> me
> the same error (which tells me this isn't an internal KFP problem)
>
> The exact same pipeline/code running on the exact same image has been
> running fine for days. Did anything changed on the beam/dataflow side 
> since
> yesterday morning?
>
> Thanks for your help! this is a production pipeline that is not
> running for us :(
>
>
>
> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
> alan.krumh...@betterup.co> wrote:
>
>> Hi, I have a scheduled daily job that I have been running fine in
>> dataflow for days now.
>> We haven't changed anything on this code but this morning run failed
>> (it couldn't even spin up the job)
>> The job submits a setup.py file (that also hasn't changed) but maybe
>> is causing the problem? (based on the error I'm getting)
>>
>> Anyone else having the same issue? or know how to fix it?
>> Thanks!
>>
>> ERROR: Complete output from command python setup.py egg_info:
>> 2 ERROR: Traceback (most recent call last):
>> 3 File "", line 1, in 
>> 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41,
>> in 
>> 5 import pycodestyle
>> 6 ImportError: No module named 'pycodestyle'
>> 7 
>> 8ERROR: Command "python setup.py egg_info" failed with error code 1
>> in /tmp/pip-install-42zyi89t/avro-python3/
>> 9 ERROR: Complete output from command python setup.py egg_info:
>> 10 ERROR: Traceback (most recent call last):
>> 11 File "", line 1, in 
>> 12 File "/tmp/pip-install-wrqytf9a/avro-python3/setup.py", line 41,
>> in 
>> 13 import pycodestyle
>> 14 ImportError: No module named 'pycodestyle'
>> 15 
>> 16ERROR: Command "python setup.py egg_info" failed with error code 1
>> in /tmp/pip-install-wrqytf9a/avro-python3/
>>
>

Re: daily dataflow job failing today

2020-02-12 Thread Valentyn Tymofieiev

Here's a short repro:

:~$ docker run -it --entrypoint=/bin/bash python:3.7-stretch
root@04b45a100d16:/# pip install avro-python3
Collecting avro-python3
  Downloading avro-python3-1.9.2.tar.gz (37 kB)
ERROR: Command errored out with exit status 1:
 command: /usr/local/bin/python -c 'import sys, setuptools, tokenize;
sys.argv[0] = '"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';
__file__='"'"'/tmp/pip-install-mmy4vspt/avro-python3/setup.py'"'"';f=getattr(tokenize,
'"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
'"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
egg_info --egg-base /tmp/pip-install-mmy4vspt/avro-python3/pip-egg-info
 cwd: /tmp/pip-install-mmy4vspt/avro-python3/
Complete output (5 lines):
Traceback (most recent call last):
  File "", line 1, in 
  File "/tmp/pip-install-mmy4vspt/avro-python3/setup.py", line 41, in

import pycodestyle
ModuleNotFoundError: No module named 'pycodestyle'

ERROR: Command errored out with exit status 1: python setup.py egg_info
Check the logs for full command output.
root@04b45a100d16:/#









On Wed, Feb 12, 2020 at 10:14 AM Valentyn Tymofieiev 
wrote:

> Yes, it is a bug in the recent Avro release. We should report it to the
> Avro maintainers. The workaround is to downgrade avro-python3 to 1.9.1, for
> example via requirements.txt.
>
> On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz 
> wrote:
>
>> avro-python3 1.9.2 was released on pypi 4 hours ago, and
>> added pycodestyle as a dependency, probably related?
>>
>> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:
>>
>>> +dev 
>>>
>>> There was recently an update to add autoformatting to the Python SDK[1].
>>>
>>> I'm seeing this during testing of a PR as well.
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>>>
>>> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz 
>>> wrote:
>>>
 Some more information for this as I still can't get to fix it

 This job is triggered using the beam[gcp] python sdk from a KubeFlow
 Pipelines component which runs on top of docker image:
 tensorflow/tensorflow:1.13.1-py3

 I just checked and that image hasn't been updated recently. I also
 redeployed my pipeline to another (older) deployment of KFP and it gives me
 the same error (which tells me this isn't an internal KFP problem)

 The exact same pipeline/code running on the exact same image has been
 running fine for days. Did anything changed on the beam/dataflow side since
 yesterday morning?

 Thanks for your help! this is a production pipeline that is not running
 for us :(



 On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz <
 alan.krumh...@betterup.co> wrote:

> Hi, I have a scheduled daily job that I have been running fine in
> dataflow for days now.
> We haven't changed anything on this code but this morning run failed
> (it couldn't even spin up the job)
> The job submits a setup.py file (that also hasn't changed) but maybe
> is causing the problem? (based on the error I'm getting)
>
> Anyone else having the same issue? or know how to fix it?
> Thanks!
>
> ERROR: Complete output from command python setup.py egg_info:
> 2 ERROR: Traceback (most recent call last):
> 3 File "", line 1, in 
> 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41, in
> 
> 5 import pycodestyle
> 6 ImportError: No module named 'pycodestyle'
> 7 
> 8ERROR: Command "python setup.py egg_info" failed with error code 1
> in /tmp/pip-install-42zyi89t/avro-python3/
> 9 ERROR: Complete output from command python setup.py egg_info:
> 10 ERROR: Traceback (most recent call last):
> 11 File "", line 1, in 
> 12 File "/tmp/pip-install-wrqytf9a/avro-python3/setup.py", line 41,
> in 
> 13 import pycodestyle
> 14 ImportError: No module named 'pycodestyle'
> 15 
> 16ERROR: Command "python setup.py egg_info" failed with error code 1
> in /tmp/pip-install-wrqytf9a/avro-python3/
>

Re: daily dataflow job failing today

2020-02-12 Thread Valentyn Tymofieiev

Yes, it is a bug in the recent Avro release. We should report it to the
Avro maintainers. The workaround is to downgrade avro-python3 to 1.9.1, for
example via requirements.txt.

On Wed, Feb 12, 2020 at 10:06 AM Steve Niemitz  wrote:

> avro-python3 1.9.2 was released on pypi 4 hours ago, and added pycodestyle
> as a dependency, probably related?
>
> On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:
>
>> +dev 
>>
>> There was recently an update to add autoformatting to the Python SDK[1].
>>
>> I'm seeing this during testing of a PR as well.
>>
>> 1:
>> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>>
>> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz 
>> wrote:
>>
>>> Some more information for this as I still can't get to fix it
>>>
>>> This job is triggered using the beam[gcp] python sdk from a KubeFlow
>>> Pipelines component which runs on top of docker image:
>>> tensorflow/tensorflow:1.13.1-py3
>>>
>>> I just checked and that image hasn't been updated recently. I also
>>> redeployed my pipeline to another (older) deployment of KFP and it gives me
>>> the same error (which tells me this isn't an internal KFP problem)
>>>
>>> The exact same pipeline/code running on the exact same image has been
>>> running fine for days. Did anything changed on the beam/dataflow side since
>>> yesterday morning?
>>>
>>> Thanks for your help! this is a production pipeline that is not running
>>> for us :(
>>>
>>>
>>>
>>> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz 
>>> wrote:
>>>
 Hi, I have a scheduled daily job that I have been running fine in
 dataflow for days now.
 We haven't changed anything on this code but this morning run failed
 (it couldn't even spin up the job)
 The job submits a setup.py file (that also hasn't changed) but maybe is
 causing the problem? (based on the error I'm getting)

 Anyone else having the same issue? or know how to fix it?
 Thanks!

 ERROR: Complete output from command python setup.py egg_info:
 2 ERROR: Traceback (most recent call last):
 3 File "", line 1, in 
 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41, in
 
 5 import pycodestyle
 6 ImportError: No module named 'pycodestyle'
 7 
 8ERROR: Command "python setup.py egg_info" failed with error code 1 in
 /tmp/pip-install-42zyi89t/avro-python3/
 9 ERROR: Complete output from command python setup.py egg_info:
 10 ERROR: Traceback (most recent call last):
 11 File "", line 1, in 
 12 File "/tmp/pip-install-wrqytf9a/avro-python3/setup.py", line 41, in
 
 13 import pycodestyle
 14 ImportError: No module named 'pycodestyle'
 15 
 16ERROR: Command "python setup.py egg_info" failed with error code 1
 in /tmp/pip-install-wrqytf9a/avro-python3/

>>>

Re: daily dataflow job failing today

2020-02-12 Thread Steve Niemitz

avro-python3 1.9.2 was released on pypi 4 hours ago, and added pycodestyle
as a dependency, probably related?

On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik  wrote:

> +dev 
>
> There was recently an update to add autoformatting to the Python SDK[1].
>
> I'm seeing this during testing of a PR as well.
>
> 1:
> https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E
>
> On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz 
> wrote:
>
>> Some more information for this as I still can't get to fix it
>>
>> This job is triggered using the beam[gcp] python sdk from a KubeFlow
>> Pipelines component which runs on top of docker image:
>> tensorflow/tensorflow:1.13.1-py3
>>
>> I just checked and that image hasn't been updated recently. I also
>> redeployed my pipeline to another (older) deployment of KFP and it gives me
>> the same error (which tells me this isn't an internal KFP problem)
>>
>> The exact same pipeline/code running on the exact same image has been
>> running fine for days. Did anything changed on the beam/dataflow side since
>> yesterday morning?
>>
>> Thanks for your help! this is a production pipeline that is not running
>> for us :(
>>
>>
>>
>> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz 
>> wrote:
>>
>>> Hi, I have a scheduled daily job that I have been running fine in
>>> dataflow for days now.
>>> We haven't changed anything on this code but this morning run failed
>>> (it couldn't even spin up the job)
>>> The job submits a setup.py file (that also hasn't changed) but maybe is
>>> causing the problem? (based on the error I'm getting)
>>>
>>> Anyone else having the same issue? or know how to fix it?
>>> Thanks!
>>>
>>> ERROR: Complete output from command python setup.py egg_info:
>>> 2 ERROR: Traceback (most recent call last):
>>> 3 File "", line 1, in 
>>> 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41, in
>>> 
>>> 5 import pycodestyle
>>> 6 ImportError: No module named 'pycodestyle'
>>> 7 
>>> 8ERROR: Command "python setup.py egg_info" failed with error code 1 in
>>> /tmp/pip-install-42zyi89t/avro-python3/
>>> 9 ERROR: Complete output from command python setup.py egg_info:
>>> 10 ERROR: Traceback (most recent call last):
>>> 11 File "", line 1, in 
>>> 12 File "/tmp/pip-install-wrqytf9a/avro-python3/setup.py", line 41, in
>>> 
>>> 13 import pycodestyle
>>> 14 ImportError: No module named 'pycodestyle'
>>> 15 
>>> 16ERROR: Command "python setup.py egg_info" failed with error code 1 in
>>> /tmp/pip-install-wrqytf9a/avro-python3/
>>>
>>

Re: daily dataflow job failing today

2020-02-12 Thread Luke Cwik

+dev 

There was recently an update to add autoformatting to the Python SDK[1].

I'm seeing this during testing of a PR as well.

1:
https://lists.apache.org/thread.html/448bb5c2d73fbd74eec7aacb5f28fa2f9d791784c2e53a2e3325627a%40%3Cdev.beam.apache.org%3E

On Wed, Feb 12, 2020 at 9:57 AM Alan Krumholz 
wrote:

> Some more information for this as I still can't get to fix it
>
> This job is triggered using the beam[gcp] python sdk from a KubeFlow
> Pipelines component which runs on top of docker image:
> tensorflow/tensorflow:1.13.1-py3
>
> I just checked and that image hasn't been updated recently. I also
> redeployed my pipeline to another (older) deployment of KFP and it gives me
> the same error (which tells me this isn't an internal KFP problem)
>
> The exact same pipeline/code running on the exact same image has been
> running fine for days. Did anything changed on the beam/dataflow side since
> yesterday morning?
>
> Thanks for your help! this is a production pipeline that is not running
> for us :(
>
>
>
> On Wed, Feb 12, 2020 at 7:21 AM Alan Krumholz 
> wrote:
>
>> Hi, I have a scheduled daily job that I have been running fine in
>> dataflow for days now.
>> We haven't changed anything on this code but this morning run failed  (it
>> couldn't even spin up the job)
>> The job submits a setup.py file (that also hasn't changed) but maybe is
>> causing the problem? (based on the error I'm getting)
>>
>> Anyone else having the same issue? or know how to fix it?
>> Thanks!
>>
>> ERROR: Complete output from command python setup.py egg_info:
>> 2 ERROR: Traceback (most recent call last):
>> 3 File "", line 1, in 
>> 4 File "/tmp/pip-install-42zyi89t/avro-python3/setup.py", line 41, in
>> 
>> 5 import pycodestyle
>> 6 ImportError: No module named 'pycodestyle'
>> 7 
>> 8ERROR: Command "python setup.py egg_info" failed with error code 1 in
>> /tmp/pip-install-42zyi89t/avro-python3/
>> 9 ERROR: Complete output from command python setup.py egg_info:
>> 10 ERROR: Traceback (most recent call last):
>> 11 File "", line 1, in 
>> 12 File "/tmp/pip-install-wrqytf9a/avro-python3/setup.py", line 41, in
>> 
>> 13 import pycodestyle
>> 14 ImportError: No module named 'pycodestyle'
>> 15 
>> 16ERROR: Command "python setup.py egg_info" failed with error code 1 in
>> /tmp/pip-install-wrqytf9a/avro-python3/
>>
>

Re: [PROPOSAL] Preparing for Beam 2.20.0 release

2020-02-12 Thread Ahmet Altay

+1. Thank you.

On Tue, Feb 11, 2020 at 11:01 PM Rui Wang  wrote:

> Hi all,
>
> The next (2.20.0) release branch cut is scheduled for 02/26, according to
> the calendar
> 
> .
> I would like to volunteer myself to do this release.
> The plan is to cut the branch on that date, and cherrypick release-blocking
> fixes afterwards if any.
>
> Any unresolved release blocking JIRA issues for 2.20.0 should have their
> "Fix Version/s" marked as "2.20.0".
>
> Any comments or objections?
>
>
> -Rui
>

Re: Cross-language pipelines status

2020-02-12 Thread Chamikara Jayalath

On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko 
wrote:

>
> AFAIK, there's no official guide for cross-language pipelines. But there
>> are examples and test cases you can use as reference such as:
>>
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
>>
>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
>>
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
>>
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>>
>
> I'm trying to work with tech writers to add more documentation related to
> cross-language (in a few months). But any help related to documenting what
> we have now is greatly appreciated.
>
>
> That would be great since now the information is a bit scattered over
> different places. I’d be happy to help with any examples and their testing
> that I hope I’ll have after a while.
>

Great.


>
> The runner and SDK supports are in working state I could say but not many
>> IOs expose their cross-language interface yet (you can easily write
>> cross-language configuration for any Python transforms by yourself though).
>>
>
> Should mention here the test suites for portable Flink and Spark Heejong
> added recently :)
>
>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>
>
> Nice! Looks like my question above about cross-language support in Spark
> runner was redundant.
>
>
>
>>
>>
>>> - Is the information here
>>> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
>>> there any other entry points you can recommend?
>>>
>>
>> I think it's up-to-date.
>>
>
> Mostly up to date.  Testing status is more complete now and we are
> actively working on getting the dependences story correct and adding
> support for DataflowRunner.
>
>
> Are there any “umbrella" Jiras regarding cross-language support that I can
> track?
>

I don't think we have an umbrella JIRA currently. I can create one and
mention it in the roadmap.

Re: Cross-language pipelines status

2020-02-12 Thread Alexey Romanenko


> AFAIK, there's no official guide for cross-language pipelines. But there are 
> examples and test cases you can use as reference such as:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
>  
> 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
>  
> 
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
>  
> 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>  
> 
> 
> I'm trying to work with tech writers to add more documentation related to 
> cross-language (in a few months). But any help related to documenting what we 
> have now is greatly appreciated. 

That would be great since now the information is a bit scattered over different 
places. I’d be happy to help with any examples and their testing that I hope 
I’ll have after a while.

> The runner and SDK supports are in working state I could say but not many IOs 
> expose their cross-language interface yet (you can easily write 
> cross-language configuration for any Python transforms by yourself though). 
> 
> Should mention here the test suites for portable Flink and Spark Heejong 
> added recently :)
> 
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>  
> 
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>  
> 

Nice! Looks like my question above about cross-language support in Spark runner 
was redundant.

>  
>  
> - Is the information here 
> https://beam.apache.org/roadmap/connectors-multi-sdk/ 
>  up-to-date? Are there 
> any other entry points you can recommend?
> 
> I think it's up-to-date.
> 
> Mostly up to date.  Testing status is more complete now and we are actively 
> working on getting the dependences story correct and adding support for 
> DataflowRunner.

Are there any “umbrella" Jiras regarding cross-language support that I can 
track?

Re: FnAPI proto backwards compatibility

2020-02-12 Thread Robert Bradshaw

On Tue, Feb 11, 2020 at 7:25 PM Kenneth Knowles  wrote:
>
> On Tue, Feb 11, 2020 at 8:38 AM Robert Bradshaw 
wrote:
>>
>> On Mon, Feb 10, 2020 at 7:35 PM Kenneth Knowles  wrote:
>> >
>> > On the runner requirements side: if you have such a list at the
pipeline level, it is an opportunity for the list to be inconsistent with
the contents of the pipeline. For example, if a DoFn is marked "requires
stable input" but not listed at the pipeline level, then the runner may run
it without ensuring it requires stable input.
>>
>> Yes. Listing this feature at the top level, if used, would be part of
>> the contract. The problem here that we're trying to solve is that the
>> runner wouldn't know about the field used to mark a DoFn as "requires
>> stable input." Another alternative would be to make this kind of ParDo
>> a different URN, but that would result in a cross product of URNs for
>> all supported features.
>
>
>>
>> Rather than attaching it to the pipeline object, we could attach it to
>> the transform. (But if there are ever extensions that don't belong to
>> transforms, we'd be out of luck. It'd be even worse to attach it to
>> the ParDoPayload, as then we'd need one on CombinePayload, etc. just
>> in case.) This is why I was leaning towards just putting it at the
>> top.
>>
>> I agree about the potential for incompatibility. As much as possible
>> I'd rather extend things in a way that would be intrinsically rejected
>> by a non-comprehending runner. But I'm not sure how to do that when
>> introducing new constraints for existing components like this. But I'm
>> open to other suggestions.
>
>
> I was waiting for Luke to mention something he suggested offline: that we
make this set of fields a list of URNs and require a runner to fail if
there are any that it does not understand. That should do it for
DoFn-granularity features. It makes sense - proto is designed to
ignore/propagate unknown bits. We want to fail on unknown bits.

I agree this would be superior for bools like requires_time_sorted_input
and requests_finalization. Would it be worth making this a map for those
features that have attached data such that it could not be forgotten? (E.g.
rather than state_specs being a top-level field, it would be a value for
the requires-state URN.) Should we move to this pattern for existing
requirements (like the aforementioned state) or just future ones? Was the
parameters field an attempt in this direction?

I still think we need something top-level lest we not be able to modify
anything but ParDo, but putting it on ParDo as well could be natural.

> I do think that splittable ParDo and stateful ParDo should have separate
PTransform URNs since they are different paradigms than "vanilla" ParDo.

Here I disagree. What about one that is both splittable and stateful? Would
one have a fourth URN for that? If/when another flavor of DoFn comes out,
would we then want 8 distinct URNs? (SplitableParDo in particular can be
executed as a normal ParDo as long as the output is bounded.)

>> > On the SDK requirements side: the constructing SDK owns the
Environment proto completely, so it is in a position to ensure the involved
docker images support the necessary features.
>>
>> Yes.
>>
>> > Is it sufficient for each SDK involved in a cross-language expansion
to validate that it understands the inputs? For example if Python sends a
PCollection with a pickle coder to Java as input to an expansion then it
will fail. And conversely if the returned subgraph outputs a PCollection
with a Java custom coder.
>>
>> Yes. It's possible to imagine there could be some negotiation about
>> inserting length prefix coders (e.g. a Count transform could act on
>> any opaque data as long as it can delimit it), but that's still TBD.
>>
>> > More complex use cases that I can imagine all seem futuristic and
unlikely to come to pass (Python passes a pickled DoFn to the Java
expansion service which inserts it into the graph in a way where a
Java-based transform would have to invoke it on every element, etc)
>>
>> Some transforms are configured with UDFs of this form...but we'll
>> cross that bridge when we get to it.
>
>
> Now that I think harder, I know of a TimestampFn that governs the
watermark. Does SDF solve this by allowing a composite IO where the parsing
to be done in one language while the watermark is somehow governed by the
other? And then there's writing a SQL UDF in your language of choice...
Anyhow, probably a tangent...

Yeah, it'd be good to support this, someday...

>> > On Mon, Feb 10, 2020 at 5:03 PM Brian Hulette 
wrote:
>> >>
>> >> I like the capabilities/requirements idea. Would these capabilities
be at a level that it would make sense to document in the capabilities
matrix? i.e. could the URNs be the values of "X" Pablo described here [1].
>> >>
>> >> Brian
>> >>
>> >> [1]
https://lists.apache.org/thread.html/e93ac64d484551d61e559e1ba0cf4a15b760e69d74c5b1d0549ff74f%40%3Cdev.beam.apache.org%3E
>> >>
>> >> On Mon, Feb 10, 2020

Re: Cross-language pipelines status

2020-02-12 Thread Alexey Romanenko

Thank you for response!

> AFAIK, there's no official guide for cross-language pipelines. But there are 
> examples and test cases you can use as reference such as:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
>  
> 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
>  
> 
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
>  
> 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>  
> 
> 

More to this, are there any tests/examples showing how to execute an external 
Python transform (user code, not only IOs transforms) from Java pipeline?

>  
> - Is this something that already can be used (currently interested in 
> Java/Python pipelines) or the main work is still in progress? More precisely 
> - I’m more focused in executing some Python code from Java-based pipelines.
> 
> The runner and SDK supports are in working state I could say but not many IOs 
> expose their cross-language interface yet (you can easily write 
> cross-language configuration for any Python transforms by yourself though). 

If I understand that correctly, every runner should have a support for 
cross-language transforms additionally to portability support. If so, do you 
know if someone works on adding such support to Spark Runner?

Re: Labels on PR

2020-02-12 Thread Ismaël Mejía

The prefix is just extra characters makes readibility worse, notice
that the full category (e.g. ios/runners/etc) will match because we have an
extra tag equivalent to the prefix, so filtering is still possible. you
rarely
need to filter for more than one criteria, that's why github does not allow
it
(and the reason to have the extra per category labels).

The only issue i can see is a possible name collision in the future, but
until that
happens i think this is a reasonable tradeoff.

Excellent idea to unifiy the colors for the categories +1 !

On Wed, Feb 12, 2020 at 2:33 PM Alex Van Boxel  wrote:

> Ismael, I saw that you removed the prefix. I still want to have some
> grouping between the subtypes, so I color coded them.
>
> I also added all the labels in the file. We now have 62 labels.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Wed, Feb 12, 2020 at 12:03 PM Ismaël Mejía  wrote:
>
>> Forgot to mention, older PRs will look not classified, up to you guys if
>> you
>> want to do manually. All new PRs will be automatically labeled.
>>
>> On Wed, Feb 12, 2020 at 12:02 PM Ismaël Mejía  wrote:
>>
>>> For info Alex's PR to suport autolabeler was merged today and INFRA
>>> enabled the plugin.
>>> I created an artificial PR to check it was autolabeled correctly.
>>> It is working perfectly now.
>>> Thanks Alex !
>>>
>>> On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw 
>>> wrote:
>>>
 +1 to finding the right balance.

 I do think per-runner makes sense, rather than a general "runners."
 IOs might make sense as well. Not sure about all the extensions-* I'd
 leave those out for now.

 On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía  wrote:
 >
 > > So I propose going simple with a limited set of labels. Later on we
 can refine. Don't forget that does labels only are useful during the
 life-cycle of a PR.
 >
 > Labels are handy for quick filtering and finding PRs we care about
 for example
 > to review.
 >
 > I agree with the feeling that we should not go to the extremes, but
 what is
 > requested in the PR rarely would produce more than 5 labels per PR.
 For example
 > if a PR changes KafkaIO and something in the CI it will produce "java
 io kafka
 > infra", a pure change on Flink runer will produce "runners flink"
 >
 > 100% d'accord with not to have many labels and keep them short, but
 the current
 > classification lacks detail, e.g. few people care about some general
 categories
 > "runners" or "io", but maintainers may care about their specific
 categories like
 > "spark" or "kafka" so I don't think that this extra level of detail is
 > inappropriate and in the end it will only add one extra label per
 matching path.
 >
 > Let's give it a try if it is too excesive we can took the opposite
 path and reduce it.
 >
 > Ismaël
 >
 >
 > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel 
 wrote:
 >>
 >> I'm wondering if we're not taking it too far with those detailed
 labels. It's like going from nothing to super details. The simples use-case
 hasn't proven itself in practice yet.
 >>
 >> So I propose going simple with a limited set of labels. Later on we
 can refine. Don't forget that does labels only are useful during the
 life-cycle of a PR.
 >>
 >>  _/
 >> _/ Alex Van Boxel
 >>
 >>
 >> On Tue, Feb 11, 2020 at 9:46 AM Ismaël Mejía 
 wrote:
 >>>
 >>> Let some comments too, let's keep the discussion on refinements in
 the PR.
 >>>
 >>> On Tue, Feb 11, 2020 at 9:13 AM jincheng sun <
 sunjincheng...@gmail.com> wrote:

  I left comments on PR, the main suggestion is that we may need a
 discussion about what kind of labels should be add. I would like to share
 my thoughts as follows:

  I think we need to add labels according to some rules. For
 example, the easiest way is to add labels by languages, java / python / go
 etc. But this kind of help is very limited, so we need to subdivide some
 labels, such as by components. Currently we have more than 70 components,
 each component is configured with labels, and it seems cumbersome. So we
 should have some rules for dividing labels, which can play the role of
 labels without being too cumbersome. Such as:

  We can add `extensions` or `extensions-ideas and extensions-java`
 for the following components:

  - extensions-ideas
  - extensions-java-join-library
  - extensions-java-json
  - extensions-java-protobuf
  - extensions-java-sketching
  - extensions-java-sorter

  And it's better to add a label for each Runner as follows:

  - runner-apex
  - runner-core
  - runner-dataflow
  - runner-direct
  - runner-flink
  -

Re: Labels on PR

2020-02-12 Thread Alex Van Boxel

Ismael, I saw that you removed the prefix. I still want to have some
grouping between the subtypes, so I color coded them.

I also added all the labels in the file. We now have 62 labels.

 _/
_/ Alex Van Boxel


On Wed, Feb 12, 2020 at 12:03 PM Ismaël Mejía  wrote:

> Forgot to mention, older PRs will look not classified, up to you guys if
> you
> want to do manually. All new PRs will be automatically labeled.
>
> On Wed, Feb 12, 2020 at 12:02 PM Ismaël Mejía  wrote:
>
>> For info Alex's PR to suport autolabeler was merged today and INFRA
>> enabled the plugin.
>> I created an artificial PR to check it was autolabeled correctly.
>> It is working perfectly now.
>> Thanks Alex !
>>
>> On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw 
>> wrote:
>>
>>> +1 to finding the right balance.
>>>
>>> I do think per-runner makes sense, rather than a general "runners."
>>> IOs might make sense as well. Not sure about all the extensions-* I'd
>>> leave those out for now.
>>>
>>> On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía  wrote:
>>> >
>>> > > So I propose going simple with a limited set of labels. Later on we
>>> can refine. Don't forget that does labels only are useful during the
>>> life-cycle of a PR.
>>> >
>>> > Labels are handy for quick filtering and finding PRs we care about for
>>> example
>>> > to review.
>>> >
>>> > I agree with the feeling that we should not go to the extremes, but
>>> what is
>>> > requested in the PR rarely would produce more than 5 labels per PR.
>>> For example
>>> > if a PR changes KafkaIO and something in the CI it will produce "java
>>> io kafka
>>> > infra", a pure change on Flink runer will produce "runners flink"
>>> >
>>> > 100% d'accord with not to have many labels and keep them short, but
>>> the current
>>> > classification lacks detail, e.g. few people care about some general
>>> categories
>>> > "runners" or "io", but maintainers may care about their specific
>>> categories like
>>> > "spark" or "kafka" so I don't think that this extra level of detail is
>>> > inappropriate and in the end it will only add one extra label per
>>> matching path.
>>> >
>>> > Let's give it a try if it is too excesive we can took the opposite
>>> path and reduce it.
>>> >
>>> > Ismaël
>>> >
>>> >
>>> > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel 
>>> wrote:
>>> >>
>>> >> I'm wondering if we're not taking it too far with those detailed
>>> labels. It's like going from nothing to super details. The simples use-case
>>> hasn't proven itself in practice yet.
>>> >>
>>> >> So I propose going simple with a limited set of labels. Later on we
>>> can refine. Don't forget that does labels only are useful during the
>>> life-cycle of a PR.
>>> >>
>>> >>  _/
>>> >> _/ Alex Van Boxel
>>> >>
>>> >>
>>> >> On Tue, Feb 11, 2020 at 9:46 AM Ismaël Mejía 
>>> wrote:
>>> >>>
>>> >>> Let some comments too, let's keep the discussion on refinements in
>>> the PR.
>>> >>>
>>> >>> On Tue, Feb 11, 2020 at 9:13 AM jincheng sun <
>>> sunjincheng...@gmail.com> wrote:
>>> 
>>>  I left comments on PR, the main suggestion is that we may need a
>>> discussion about what kind of labels should be add. I would like to share
>>> my thoughts as follows:
>>> 
>>>  I think we need to add labels according to some rules. For example,
>>> the easiest way is to add labels by languages, java / python / go etc. But
>>> this kind of help is very limited, so we need to subdivide some labels,
>>> such as by components. Currently we have more than 70 components, each
>>> component is configured with labels, and it seems cumbersome. So we should
>>> have some rules for dividing labels, which can play the role of labels
>>> without being too cumbersome. Such as:
>>> 
>>>  We can add `extensions` or `extensions-ideas and extensions-java`
>>> for the following components:
>>> 
>>>  - extensions-ideas
>>>  - extensions-java-join-library
>>>  - extensions-java-json
>>>  - extensions-java-protobuf
>>>  - extensions-java-sketching
>>>  - extensions-java-sorter
>>> 
>>>  And it's better to add a label for each Runner as follows:
>>> 
>>>  - runner-apex
>>>  - runner-core
>>>  - runner-dataflow
>>>  - runner-direct
>>>  - runner-flink
>>>  - runner-jstorm
>>>  - runner-...
>>> 
>>>  So, I think would be great to collect feedbacks from the community
>>> on the set of labels needed.
>>> 
>>>  What do you think?
>>> 
>>>  Best,
>>>  Jincheng
>>> 
>>>  Alex Van Boxel  于2020年2月11日周二 下午3:11写道：
>>> >
>>> > I've opened a PR and a ticket with INFRA.
>>> >
>>> > PR: https://github.com/apache/beam/pull/10824
>>> >
>>> >  _/
>>> > _/ Alex Van Boxel
>>> >
>>> >
>>> > On Tue, Feb 11, 2020 at 6:57 AM jincheng sun <
>>> sunjincheng...@gmail.com> wrote:
>>> >>
>>> >> +1. Autolabeler seems really cool and it seems that it's simple
>>> to configure and set up.
>>> >>
>>> >> Best,
>>>

Re: Labels on PR

2020-02-12 Thread Ismaël Mejía

Forgot to mention, older PRs will look not classified, up to you guys if you
want to do manually. All new PRs will be automatically labeled.

On Wed, Feb 12, 2020 at 12:02 PM Ismaël Mejía  wrote:

> For info Alex's PR to suport autolabeler was merged today and INFRA
> enabled the plugin.
> I created an artificial PR to check it was autolabeled correctly.
> It is working perfectly now.
> Thanks Alex !
>
> On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw 
> wrote:
>
>> +1 to finding the right balance.
>>
>> I do think per-runner makes sense, rather than a general "runners."
>> IOs might make sense as well. Not sure about all the extensions-* I'd
>> leave those out for now.
>>
>> On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía  wrote:
>> >
>> > > So I propose going simple with a limited set of labels. Later on we
>> can refine. Don't forget that does labels only are useful during the
>> life-cycle of a PR.
>> >
>> > Labels are handy for quick filtering and finding PRs we care about for
>> example
>> > to review.
>> >
>> > I agree with the feeling that we should not go to the extremes, but
>> what is
>> > requested in the PR rarely would produce more than 5 labels per PR.
>> For example
>> > if a PR changes KafkaIO and something in the CI it will produce "java
>> io kafka
>> > infra", a pure change on Flink runer will produce "runners flink"
>> >
>> > 100% d'accord with not to have many labels and keep them short, but the
>> current
>> > classification lacks detail, e.g. few people care about some general
>> categories
>> > "runners" or "io", but maintainers may care about their specific
>> categories like
>> > "spark" or "kafka" so I don't think that this extra level of detail is
>> > inappropriate and in the end it will only add one extra label per
>> matching path.
>> >
>> > Let's give it a try if it is too excesive we can took the opposite path
>> and reduce it.
>> >
>> > Ismaël
>> >
>> >
>> > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel 
>> wrote:
>> >>
>> >> I'm wondering if we're not taking it too far with those detailed
>> labels. It's like going from nothing to super details. The simples use-case
>> hasn't proven itself in practice yet.
>> >>
>> >> So I propose going simple with a limited set of labels. Later on we
>> can refine. Don't forget that does labels only are useful during the
>> life-cycle of a PR.
>> >>
>> >>  _/
>> >> _/ Alex Van Boxel
>> >>
>> >>
>> >> On Tue, Feb 11, 2020 at 9:46 AM Ismaël Mejía 
>> wrote:
>> >>>
>> >>> Let some comments too, let's keep the discussion on refinements in
>> the PR.
>> >>>
>> >>> On Tue, Feb 11, 2020 at 9:13 AM jincheng sun <
>> sunjincheng...@gmail.com> wrote:
>> 
>>  I left comments on PR, the main suggestion is that we may need a
>> discussion about what kind of labels should be add. I would like to share
>> my thoughts as follows:
>> 
>>  I think we need to add labels according to some rules. For example,
>> the easiest way is to add labels by languages, java / python / go etc. But
>> this kind of help is very limited, so we need to subdivide some labels,
>> such as by components. Currently we have more than 70 components, each
>> component is configured with labels, and it seems cumbersome. So we should
>> have some rules for dividing labels, which can play the role of labels
>> without being too cumbersome. Such as:
>> 
>>  We can add `extensions` or `extensions-ideas and extensions-java`
>> for the following components:
>> 
>>  - extensions-ideas
>>  - extensions-java-join-library
>>  - extensions-java-json
>>  - extensions-java-protobuf
>>  - extensions-java-sketching
>>  - extensions-java-sorter
>> 
>>  And it's better to add a label for each Runner as follows:
>> 
>>  - runner-apex
>>  - runner-core
>>  - runner-dataflow
>>  - runner-direct
>>  - runner-flink
>>  - runner-jstorm
>>  - runner-...
>> 
>>  So, I think would be great to collect feedbacks from the community
>> on the set of labels needed.
>> 
>>  What do you think?
>> 
>>  Best,
>>  Jincheng
>> 
>>  Alex Van Boxel  于2020年2月11日周二 下午3:11写道：
>> >
>> > I've opened a PR and a ticket with INFRA.
>> >
>> > PR: https://github.com/apache/beam/pull/10824
>> >
>> >  _/
>> > _/ Alex Van Boxel
>> >
>> >
>> > On Tue, Feb 11, 2020 at 6:57 AM jincheng sun <
>> sunjincheng...@gmail.com> wrote:
>> >>
>> >> +1. Autolabeler seems really cool and it seems that it's simple to
>> configure and set up.
>> >>
>> >> Best,
>> >> Jincheng
>> >>
>> >>
>> >>
>> >> Udi Meiri  于2020年2月11日周二 上午2:01写道：
>> >>>
>> >>> Cool!
>> >>>
>> >>> On Mon, Feb 10, 2020 at 9:27 AM Robert Burke 
>> wrote:
>> 
>>  +1 to autolabeling
>> 
>>  On Mon, Feb 10, 2020, 9:21 AM Luke Cwik 
>> wrote:
>> >
>> > Nice
>> >
>> > On Mon, Feb 10, 2020 at 2:52 AM Alex Van

Re: Labels on PR

2020-02-12 Thread Ismaël Mejía

For info Alex's PR to suport autolabeler was merged today and INFRA enabled
the plugin.
I created an artificial PR to check it was autolabeled correctly.
It is working perfectly now.
Thanks Alex !

On Tue, Feb 11, 2020 at 5:23 PM Robert Bradshaw  wrote:

> +1 to finding the right balance.
>
> I do think per-runner makes sense, rather than a general "runners."
> IOs might make sense as well. Not sure about all the extensions-* I'd
> leave those out for now.
>
> On Tue, Feb 11, 2020 at 5:56 AM Ismaël Mejía  wrote:
> >
> > > So I propose going simple with a limited set of labels. Later on we
> can refine. Don't forget that does labels only are useful during the
> life-cycle of a PR.
> >
> > Labels are handy for quick filtering and finding PRs we care about for
> example
> > to review.
> >
> > I agree with the feeling that we should not go to the extremes, but what
> is
> > requested in the PR rarely would produce more than 5 labels per PR.  For
> example
> > if a PR changes KafkaIO and something in the CI it will produce "java io
> kafka
> > infra", a pure change on Flink runer will produce "runners flink"
> >
> > 100% d'accord with not to have many labels and keep them short, but the
> current
> > classification lacks detail, e.g. few people care about some general
> categories
> > "runners" or "io", but maintainers may care about their specific
> categories like
> > "spark" or "kafka" so I don't think that this extra level of detail is
> > inappropriate and in the end it will only add one extra label per
> matching path.
> >
> > Let's give it a try if it is too excesive we can took the opposite path
> and reduce it.
> >
> > Ismaël
> >
> >
> > On Tue, Feb 11, 2020 at 1:04 PM Alex Van Boxel  wrote:
> >>
> >> I'm wondering if we're not taking it too far with those detailed
> labels. It's like going from nothing to super details. The simples use-case
> hasn't proven itself in practice yet.
> >>
> >> So I propose going simple with a limited set of labels. Later on we can
> refine. Don't forget that does labels only are useful during the life-cycle
> of a PR.
> >>
> >>  _/
> >> _/ Alex Van Boxel
> >>
> >>
> >> On Tue, Feb 11, 2020 at 9:46 AM Ismaël Mejía  wrote:
> >>>
> >>> Let some comments too, let's keep the discussion on refinements in the
> PR.
> >>>
> >>> On Tue, Feb 11, 2020 at 9:13 AM jincheng sun 
> wrote:
> 
>  I left comments on PR, the main suggestion is that we may need a
> discussion about what kind of labels should be add. I would like to share
> my thoughts as follows:
> 
>  I think we need to add labels according to some rules. For example,
> the easiest way is to add labels by languages, java / python / go etc. But
> this kind of help is very limited, so we need to subdivide some labels,
> such as by components. Currently we have more than 70 components, each
> component is configured with labels, and it seems cumbersome. So we should
> have some rules for dividing labels, which can play the role of labels
> without being too cumbersome. Such as:
> 
>  We can add `extensions` or `extensions-ideas and extensions-java` for
> the following components:
> 
>  - extensions-ideas
>  - extensions-java-join-library
>  - extensions-java-json
>  - extensions-java-protobuf
>  - extensions-java-sketching
>  - extensions-java-sorter
> 
>  And it's better to add a label for each Runner as follows:
> 
>  - runner-apex
>  - runner-core
>  - runner-dataflow
>  - runner-direct
>  - runner-flink
>  - runner-jstorm
>  - runner-...
> 
>  So, I think would be great to collect feedbacks from the community on
> the set of labels needed.
> 
>  What do you think?
> 
>  Best,
>  Jincheng
> 
>  Alex Van Boxel  于2020年2月11日周二 下午3:11写道：
> >
> > I've opened a PR and a ticket with INFRA.
> >
> > PR: https://github.com/apache/beam/pull/10824
> >
> >  _/
> > _/ Alex Van Boxel
> >
> >
> > On Tue, Feb 11, 2020 at 6:57 AM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> >>
> >> +1. Autolabeler seems really cool and it seems that it's simple to
> configure and set up.
> >>
> >> Best,
> >> Jincheng
> >>
> >>
> >>
> >> Udi Meiri  于2020年2月11日周二 上午2:01写道：
> >>>
> >>> Cool!
> >>>
> >>> On Mon, Feb 10, 2020 at 9:27 AM Robert Burke 
> wrote:
> 
>  +1 to autolabeling
> 
>  On Mon, Feb 10, 2020, 9:21 AM Luke Cwik  wrote:
> >
> > Nice
> >
> > On Mon, Feb 10, 2020 at 2:52 AM Alex Van Boxel 
> wrote:
> >>
> >> Ha, cool. I'll have a look at the autolabeler. The infra stuff
> is not something I've looked at... I'll dive into that.
> >>
> >>  _/
> >> _/ Alex Van Boxel
> >>
> >>
> >> On Mon, Feb 10, 2020 at 11:49 AM Ismaël Mejía <
> ieme...@gmail.com> wrote:
> >>>
> >>> +1
> >>>
>

Re: Dynamic timers now supported!

2020-02-12 Thread Ismaël Mejía

Great to know you get it working on Dataflow easily Reuven. As a new
feature it
looks great!

Agree with Kenn maybe worth to open a new thread to discuss the changes
still
needed to support this in portable runners.

On Mon, Feb 10, 2020 at 8:25 PM Kenneth Knowles  wrote:

> I think the (lack of) portability bit may have been buried in this thread.
> Maybe a new thread about the design for that?
>
> Kenn
>
> On Sun, Feb 9, 2020 at 11:36 AM Reuven Lax  wrote:
>
>> FYI, this is now fixed for Dataflow. I also added better rejection so
>> that runners that don't support this feature will reject the pipeline.
>>
>> On Sat, Feb 8, 2020 at 12:10 AM Reuven Lax  wrote:
>>
>>> I took a look, and I think this was a simple bug. Testing a fix now.
>>>
>>> A larger question is how to support this in the portability layer. Right
>>> now portability assumes that each timer id corresponds to a logical input
>>> PCollection, but that assumption no longer works as we now support a
>>> dynamic set of timers, each with their own id. We could instead model each
>>> timer family as a PColleciton, but the FnApiRunner would need to
>>> dynamically get the timer id in order to invoke it, and today it statically
>>> reads the timer id from the PCollection name.
>>>
>>> Reuven
>>>
>>> On Fri, Feb 7, 2020 at 2:22 PM Reuven Lax  wrote:
>>>
 Thanks for finding this. Hopefully the bug is easy .to fix. The tests
 indeed never ran on any runner except for the DirectRunner, which is
 something I should've noticed in the code review.

 Reuven

 On Mon, Feb 3, 2020 at 12:50 AM Ismaël Mejía  wrote:

> I had a discussion with Rehman last week and we discovered that the
> TimersMap
> related tests were not running for all runners because they were not
> tagged as
> part of the ValidatesRunner category. I opened a PR [1] to enable
> this, so
> please someone help me with the review/merge.
>
> I took a look just for curiosity and discovered that they are only
> passing for
> Direct runner and for the classic Flink runner in batch mode. They are
> not
> passing for Dataflow [2][3] and for the Portable Flink runner, so
> probably worth
> to reopen the issue to investigate/fix.
>
> [1] https://github.com/apache/beam/pull/10747
> [2]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_PR/210/
> [3]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_PortabilityApi_Dataflow_PR/76/
>
>
> On Sat, Jan 25, 2020 at 1:26 AM Reuven Lax  wrote:
>
>> Yes. For now we exclude the flink runner, but fixing this should be
>> fairly trivial.
>>
>> On Fri, Jan 24, 2020 at 3:35 PM Maximilian Michels 
>> wrote:
>>
>>> The Flink Runner was allowing to set a timer multiple times before
>>> we
>>> made it comply with the Beam semantics of overwriting past
>>> invocations.
>>> I wouldn't be surprised if the Spark Runner never addressed this.
>>> Flink
>>> and Spark itself allow for a timer to be set to multiple times. In
>>> order
>>> to fix this for Beam, the Flink Runner has to maintain a
>>> checkpointed
>>> map which sits outside of its builtin TimerService.
>>>
>>> As far as I can see, multiple timer families are currently not
>>> supported
>>> in the Flink Runner due to the map not taking the family name into
>>> account. This can be easily fixed though.
>>>
>>> -Max
>>>
>>> On 24.01.20 21:31, Reuven Lax wrote:
>>> > The new timer family is in the portability protos. I think
>>> TimerReceiver
>>> > needs to be updated to set it though (I think a 1-line change).
>>> >
>>> > The TimerInternals class that runners implement today already
>>> handles
>>> > dynamic timers, so most of the work was in the Beam SDK  to
>>> provide an
>>> > API that allows users to access this feature.
>>> >
>>> > The main work needed in the runner was to take in account the
>>> timer
>>> > family. Beam semantics say that if a timer is set twice with the
>>> same
>>> > id, then the second timer overwrites the first.  Several runners
>>> > therefore had maps from timer id -> timer. However since the
>>> > timer family scopes the timers, we now allow two timers with the
>>> same id
>>> > as long as the timer families are different. Runners had to be
>>> updated
>>> > to include the timer family id in the map keys.
>>> >
>>> > Surprisingly, the new TimerMap tests seem to pass on Spark
>>> > ValidatesRunner, even though the Spark runner wasn't updated! I
>>> wonder
>>> > if this means that the Spark runner was incorrectly implementing
>>> the
>>> > Beam semantics before, and setTimer was not overwriting timers
>>> with the
>>> > same id?
>>> >
>>> > Reuven
>>> >
>>> > On Fri, Jan 24, 2020

Re: Python2.7 Beam End-of-Life Date

2020-02-12 Thread Ismaël Mejía

I am with Chad on this, we should probably extend it a bit more, even if it
makes us struggle a bit at least we have some workarounds as Robert
suggests,
and as Chad said there are still many people playing the python 3 catchup
game,
so worth to support those users.

But maybe it is worth to evaluate the current state later in the year. In
the
meantime can someone please update our Roadmap in the website with this
info and
where we are with Python 3 support (it looks not up to date).
https://beam.apache.org/roadmap/

- Ismaël


On Tue, Feb 4, 2020 at 10:49 PM Robert Bradshaw  wrote:

>  On Tue, Feb 4, 2020 at 12:12 PM Chad Dombrova  wrote:
> >>
> >>  Not to mention that all the nice work for the type hints will have to
> be redone in the for 3.x.
> >
> > Note that there's a tool for automatically converting type comments to
> annotations: https://github.com/ilevkivskyi/com2ann
> >
> > So don't let that part bother you.
>
> +1, I wouldn't worry about what can be easily automated.
>
> > I'm curious what other features you'd like to be using in the Beam
> source that you cannot now.
>
> I hit things occasionally, e.g. I just ran into wanting keyword-only
> arguments the other day.
>
> >> It seems the faster we drop support the better.
> >
> >
> > I've already gone over my position on this, but a refresher for those
> who care:  some of the key vendors that support my industry will not offer
> python3-compatible versions of their software until the 4th quarter of
> 2020.  If Beam switches to python3-only before that point we may be forced
> to stop contributing features (note: I'm the guy who added the type hints
> :).   Every month you can give us would be greatly appreciated.
>
> As another data point, we're still 80/20 on Py2/Py3 for downloads at
> PyPi [1] (which I've heard should be taken with a grain of salt, but
> likely isn't totally off). IMHO that ratio needs to be way higher for
> Python 3 to consider dropping Python 2. It's pretty noisy, but say it
> doubles every 3 months that would put us at least mid-year before we
> hit a cross-over point. On the other hand Q4 2020 is probably a
> stretch.
>
> We could consider whether it needs to be an all-or-nothing thing as
> well. E.g. perhaps some features could be Python 3 only sooner than
> the whole codebase. (This would have to be well justified.) Another
> mitigation is that it is possible to mix Python 2 and Python 3 in the
> same pipeline with portability, so if there's a library that you need
> for one DoFn it doesn't mean you have to hold back your whole
> pipeline.
>
> - Robert
>
> [1] https://pypistats.org/packages/apache-beam , and that 20% may just
> be a spike.
>

No space left on apache-beam-jenkins-7

2020-02-12 Thread Michał Walenia

Hi there,
it seems we have an error on one of the Jenkins workers, I created a Jira
to track this. Who can take care of this?
https://issues.apache.org/jira/browse/BEAM-9302

Michal

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects!

38 matches

Mail list logo