Re: Removing Java Reference Runner code

2019-04-26 Thread Daniel Oliveira
Hey Boyuan,

I think that's a good question. Mikhail's mostly right, that the user
shouldn't need to know how the Python ULR works for their debugging. This
is actually more of an issue with portability itself anyway. Even when I
was coding Java pipelines on the Java ULR, if something went wrong in the
runner it was still really difficult to debug. Hopefully the only people
that will need to do that painful exercise are Beam devs doing development
work on the runners. If an average user is having a problem, the runner's
logs and error messages should be effective enough that the user shouldn't
care what language the runner is using or how it's implemented.

On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang  wrote:

> Another concern from me is, will it be difficult for a Java person (who
> developing Java SDK) to figure out what's going on in Python ULR when
> debugging?
>
> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles  wrote:
>
>> Good points. Distilling one single item: can I, today, run the Java SDK's
>> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
>> in a single Gradle command?
>>
>> Kenn
>>
>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:
>>
>>> If there is no plans to invest in ULR then it makes sense to remove it.
>>>
>>> Going forward, however, I think we should try to document the higher
>>> level approach we're taking with runners (and portability) now that we have
>>> something working and can reflect on it. For example, couple of things that
>>> are not 100% clear to me:
>>>  - if the focus is on python runner for portability efforts, how does
>>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>>  - what's our approach to developing new features, should we make sure
>>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>>  - java DirectRunner is still there:
>>> - it is still the primary tool for java SDK development purposes,
>>> and as Kenn mentioned in the linked threads it adds value by making sure
>>> users don't rely on implementation details of specific runners. Do we have
>>> a similar story for portable scenarios?
>>> - I assume that extra validations in the DirectRunner have impact on
>>> performance in various ways (potentially non-deterministic). While this
>>> doesn't matter in some cases, it might do in others. Having a local runner
>>> that is (better) optimized for execution would probably make more sense for
>>> perf measurements, integration tests, and maybe even local production jobs.
>>> Is this something potentially worth looking into?
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels 
>>> wrote:
>>>
 Thanks for following up with this. I have mixed feelings to see the
 portable Java DirectRunner go, but I'm in favor of this change because
 it removes a lot of code that we do not really make use of.

 -Max

 On 26.04.19 02:58, Kenneth Knowles wrote:
 > Thanks for providing all this background on the PR. It is very easy
 to
 > see where it came from. Definitely nice to have less code and fewer
 > things that can break. Perhaps lazy consensus is enough.
 >
 > Kenn
 >
 > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
 danolive...@google.com
 > > wrote:
 >
 > Hey everyone,
 >
 > I made a preliminary PR for removing all the Java Reference Runner
 > code (PR-8380 ) since I
 > wanted to see if it could be done easily. It seems to be working
 > fine, so I wanted to open up this discussion to make sure people
 are
 > still in agreement on getting rid of this code and that people
 don't
 > have any concerns.
 >
 > For those who need additional context about this, this previous
 > thread
 > <
 https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
 >
 > is where we discussed deprecating the Java Reference Runner (in
 some
 > places it's called the ULR or Universal Local Runner, but it's the
 > same thing). Then there's this thread
 > <
 https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
 >
 > where we discussed removing the code from the repo since it's been
 > deprecated.
 >
 > If no one has any objections to trying to remove the code I'll
 have
 > someone review the PR I wrote and start a vote to have it merged.
 >
 > Thanks,
 > Daniel Oliveira
 >

>>>


Re: Removing Java Reference Runner code

2019-04-26 Thread Daniel Oliveira
Hey Kenn,

I'm not 100% sure. Robert (+Robert Bradshaw ) could
answer your question accurately. Last I checked (about 2 months ago) there
was no such target, but I don't think there's anything preventing one from
being written.

On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles  wrote:

> Good points. Distilling one single item: can I, today, run the Java SDK's
> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
> in a single Gradle command?
>
> Kenn
>
> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:
>
>> If there is no plans to invest in ULR then it makes sense to remove it.
>>
>> Going forward, however, I think we should try to document the higher
>> level approach we're taking with runners (and portability) now that we have
>> something working and can reflect on it. For example, couple of things that
>> are not 100% clear to me:
>>  - if the focus is on python runner for portability efforts, how does
>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>  - what's our approach to developing new features, should we make sure
>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>  - java DirectRunner is still there:
>> - it is still the primary tool for java SDK development purposes, and
>> as Kenn mentioned in the linked threads it adds value by making sure users
>> don't rely on implementation details of specific runners. Do we have a
>> similar story for portable scenarios?
>> - I assume that extra validations in the DirectRunner have impact on
>> performance in various ways (potentially non-deterministic). While this
>> doesn't matter in some cases, it might do in others. Having a local runner
>> that is (better) optimized for execution would probably make more sense for
>> perf measurements, integration tests, and maybe even local production jobs.
>> Is this something potentially worth looking into?
>>
>> Regards,
>> Anton
>>
>>
>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels 
>> wrote:
>>
>>> Thanks for following up with this. I have mixed feelings to see the
>>> portable Java DirectRunner go, but I'm in favor of this change because
>>> it removes a lot of code that we do not really make use of.
>>>
>>> -Max
>>>
>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>> > Thanks for providing all this background on the PR. It is very easy to
>>> > see where it came from. Definitely nice to have less code and fewer
>>> > things that can break. Perhaps lazy consensus is enough.
>>> >
>>> > Kenn
>>> >
>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
>>> danolive...@google.com
>>> > > wrote:
>>> >
>>> > Hey everyone,
>>> >
>>> > I made a preliminary PR for removing all the Java Reference Runner
>>> > code (PR-8380 ) since I
>>> > wanted to see if it could be done easily. It seems to be working
>>> > fine, so I wanted to open up this discussion to make sure people
>>> are
>>> > still in agreement on getting rid of this code and that people
>>> don't
>>> > have any concerns.
>>> >
>>> > For those who need additional context about this, this previous
>>> > thread
>>> > <
>>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>>> >
>>> > is where we discussed deprecating the Java Reference Runner (in
>>> some
>>> > places it's called the ULR or Universal Local Runner, but it's the
>>> > same thing). Then there's this thread
>>> > <
>>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>>> >
>>> > where we discussed removing the code from the repo since it's been
>>> > deprecated.
>>> >
>>> > If no one has any objections to trying to remove the code I'll have
>>> > someone review the PR I wrote and start a vote to have it merged.
>>> >
>>> > Thanks,
>>> > Daniel Oliveira
>>> >
>>>
>>


Re: Removing Java Reference Runner code

2019-04-26 Thread Daniel Oliveira
Good questions Anton. I can't give *definitive* answers to any of these,
but I can at least explain how I've been interpreting the move to the
Python version.

 - if the focus is on python runner for portability efforts, how does java
> SDK (and other languages) tie into this? E.g. how do we run, test, measure,
> and develop things (pipelines, aspects of the SDK, runner);


You should be able to run anything that worked with the Java ULR on the
Python one. Thanks to Portability the Runner and SDK can be completely
independent. For example when I was working on the Java ULR I got it
running the Python validatesRunner tests that are currently used to test
the Python ULR. The reverse should hold true. I don't want to get too in
depth on how it and other local portable runners are used, but the short
version is that you would start the runner as a separate process on your
machine and then indicate the runner you're using and the port it's on in
your Pipeline Options.

The main obstacle I see is that recommending a Python runner for people
running Java pipelines is counterintuitive. It would require users to have
Python installed on their machine just to test their Java code which is a
difficult situation to explain.

 - what's our approach to developing new features, should we make sure
> python runner supports them as early as possible (e.g. schemas and SQL)?
>

That was the original hope with the Java ULR, that it would be a good place
to start implementing and iterating on new features without having to
implement them in a more complex runner. Of course we never actually
reached that goal, but we might be able to with the Python ULR since it's
so much further in development.

- java DirectRunner is still there:
> - it is still the primary tool for java SDK development purposes, and
> as Kenn mentioned in the linked threads it adds value by making sure users
> don't rely on implementation details of specific runners. Do we have a
> similar story for portable scenarios?
>

I think a long-term goal when it comes to portable runners is that we only
have one local runner in one language that all developers use across
multiple SDKs. In that sense yes, the Python ULR would have a similar
story, but for all SDKs, but only with portable pipelines.

But we've had differing ideas about this and how far it should go. Like is
this runner supposed to be good for debugging or just running already
validated pipelines? Do we still want non-portable local runners for each
SDK for performance or debug reasons? Questions like that haven't really
been answered. I think in one of the threads I linked to in the OP there
was some discussion about this if you want to see.

- I assume that extra validations in the DirectRunner have impact on
> performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> Is this something potentially worth looking into?
>

Basically what I mentioned above, there's no specific plans so it's mainly
something that's up for community discussion.

My personal opinion is that it's worth looking into, but I think a basic
implementation of portable features is more important first. Once
portability is at the point where it's reached parity with non-portable
pipelines feature-wise, then we can start thinking about having runners
with more niche uses.

On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:

> If there is no plans to invest in ULR then it makes sense to remove it.
>
> Going forward, however, I think we should try to document the higher level
> approach we're taking with runners (and portability) now that we have
> something working and can reflect on it. For example, couple of things that
> are not 100% clear to me:
>  - if the focus is on python runner for portability efforts, how does java
> SDK (and other languages) tie into this? E.g. how do we run, test, measure,
> and develop things (pipelines, aspects of the SDK, runner);
>  - what's our approach to developing new features, should we make sure
> python runner supports them as early as possible (e.g. schemas and SQL)?
>  - java DirectRunner is still there:
> - it is still the primary tool for java SDK development purposes, and
> as Kenn mentioned in the linked threads it adds value by making sure users
> don't rely on implementation details of specific runners. Do we have a
> similar story for portable scenarios?
> - I assume that extra validations in the DirectRunner have impact on
> performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> 

Re: Enable security for data channels in portability

2019-04-26 Thread Ankur Goenka
In an offline chat with Hai, It seem useful for users to be able to provide
custom authentication like a secret which can be distributed out of band by
the infrastructure and can be provided via file system, rpc to another
service etc.
gRPC already has some mechanism for standard and custom authentication[1].
Instrumenting gRPC channel using command line option or environment
variable on the worker machines can be be useful.

[1] https://grpc.io/docs/guides/auth/

On Fri, Apr 26, 2019 at 4:33 PM Lukasz Cwik  wrote:

> The link to the ApiServiceDescriptor is
> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31
>
> On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik  wrote:
>
>> I had originally taken a look at this a while ago but not much has
>> progressed since then. The original idea was that the ApiServiceDescriptor
>> would be extended to support secure ways of authentication/communication. I
>> was prototyping with an OAuth2 client credentials grant at the time but
>> dropped it as other things were more important. The only currently
>> supported mode across all SDKs is an implicit authenticated/secure mode
>> where all communication is assumed to already be encrypted/private (e.g.
>> over VPN that is managed externally with trusted services) and hence the
>> gRPC channel itself is insecure and there is no authentication being
>> performed.
>>
>> Even though sdk_worker.py seems like it supports credentials, no one
>> invokes the constructor with credentials enabled as can be seen by this
>> comment by Robert[1].
>>
>> For SSL/TLS support it seems like we need some way to configure a runner
>> to be told to use SSL/TLS (potentially with a custom private key and trust
>> chain). Do you have some suggestions on how we add support for passing
>> around channel/call[2] credentials?
>>
>> 1:
>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
>> 2: https://grpc.io/docs/guides/auth/
>>
>> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu  wrote:
>>
>>> Hi,
>>>
>>> This is Hai from LinkedIn. Daniel and I have been working on
>>> productionizing Samza portable runner. BTW, Daniel didn't mention in his
>>> previous email that he has enabled and validated Python 3 for Samza runner
>>> and it worked smoothly. Kudos to the team!
>>>
>>> Here I have a few security related questions about portability. At
>>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data exchange.
>>> In the case of portable runner, we're required to secure the data channels
>>> between Java and Python processes as well because our Samza jobs are
>>> running in a multi-tenant environment. While I'm currently working on this
>>> on our internal branch, I do want to keep it clean and consistent with the
>>> master branch.
>>>
>>> My questions are: were there any plans/thoughts around security for
>>> portability? I see that sdk_worker.py does have some codes to create
>>> secured gRPC channels; is anyone actually leveraging those codes? I don't
>>> see on the Java side any work is done, though.
>>>
>>> Thanks,
>>> Hai Lu
>>>
>>


Re: Enable security for data channels in portability

2019-04-26 Thread Lukasz Cwik
I had originally taken a look at this a while ago but not much has
progressed since then. The original idea was that the ApiServiceDescriptor
would be extended to support secure ways of authentication/communication. I
was prototyping with an OAuth2 client credentials grant at the time but
dropped it as other things were more important. The only currently
supported mode across all SDKs is an implicit authenticated/secure mode
where all communication is assumed to already be encrypted/private (e.g.
over VPN that is managed externally with trusted services) and hence the
gRPC channel itself is insecure and there is no authentication being
performed.

Even though sdk_worker.py seems like it supports credentials, no one
invokes the constructor with credentials enabled as can be seen by this
comment by Robert[1].

For SSL/TLS support it seems like we need some way to configure a runner to
be told to use SSL/TLS (potentially with a custom private key and trust
chain). Do you have some suggestions on how we add support for passing
around channel/call[2] credentials?

1:
https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
2: https://grpc.io/docs/guides/auth/

On Tue, Apr 23, 2019 at 5:06 PM Hai Lu  wrote:

> Hi,
>
> This is Hai from LinkedIn. Daniel and I have been working on
> productionizing Samza portable runner. BTW, Daniel didn't mention in his
> previous email that he has enabled and validated Python 3 for Samza runner
> and it worked smoothly. Kudos to the team!
>
> Here I have a few security related questions about portability. At
> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data exchange.
> In the case of portable runner, we're required to secure the data channels
> between Java and Python processes as well because our Samza jobs are
> running in a multi-tenant environment. While I'm currently working on this
> on our internal branch, I do want to keep it clean and consistent with the
> master branch.
>
> My questions are: were there any plans/thoughts around security for
> portability? I see that sdk_worker.py does have some codes to create
> secured gRPC channels; is anyone actually leveraging those codes? I don't
> see on the Java side any work is done, though.
>
> Thanks,
> Hai Lu
>


Re: Enable security for data channels in portability

2019-04-26 Thread Lukasz Cwik
The link to the ApiServiceDescriptor is
https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31

On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik  wrote:

> I had originally taken a look at this a while ago but not much has
> progressed since then. The original idea was that the ApiServiceDescriptor
> would be extended to support secure ways of authentication/communication. I
> was prototyping with an OAuth2 client credentials grant at the time but
> dropped it as other things were more important. The only currently
> supported mode across all SDKs is an implicit authenticated/secure mode
> where all communication is assumed to already be encrypted/private (e.g.
> over VPN that is managed externally with trusted services) and hence the
> gRPC channel itself is insecure and there is no authentication being
> performed.
>
> Even though sdk_worker.py seems like it supports credentials, no one
> invokes the constructor with credentials enabled as can be seen by this
> comment by Robert[1].
>
> For SSL/TLS support it seems like we need some way to configure a runner
> to be told to use SSL/TLS (potentially with a custom private key and trust
> chain). Do you have some suggestions on how we add support for passing
> around channel/call[2] credentials?
>
> 1:
> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
> 2: https://grpc.io/docs/guides/auth/
>
> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu  wrote:
>
>> Hi,
>>
>> This is Hai from LinkedIn. Daniel and I have been working on
>> productionizing Samza portable runner. BTW, Daniel didn't mention in his
>> previous email that he has enabled and validated Python 3 for Samza runner
>> and it worked smoothly. Kudos to the team!
>>
>> Here I have a few security related questions about portability. At
>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data exchange.
>> In the case of portable runner, we're required to secure the data channels
>> between Java and Python processes as well because our Samza jobs are
>> running in a multi-tenant environment. While I'm currently working on this
>> on our internal branch, I do want to keep it clean and consistent with the
>> master branch.
>>
>> My questions are: were there any plans/thoughts around security for
>> portability? I see that sdk_worker.py does have some codes to create
>> secured gRPC channels; is anyone actually leveraging those codes? I don't
>> see on the Java side any work is done, though.
>>
>> Thanks,
>> Hai Lu
>>
>


Re: Artifact staging in cross-language pipelines

2019-04-26 Thread Lukasz Cwik
We should stick with URN + payload + artifact metadata[1] where the only
mandatory one that all SDKs and expansion services understand is the
"bytes" artifact type. This allows us to add optional URNs for file://,
http://, Maven, PyPi, ... in the future. I would make the artifact staging
service use the same URN + payload mechanism to get compatibility of
artifacts across the different services and also have the artifact staging
service be able to be queried for the list of artifact types it supports.
Finally, we would need to have environments enumerate the artifact types
that they support.

Having everyone have the same "artifact" representation would be beneficial
since:
a) Python environments could install dependencies from a requirements.txt
file (something that the Google Cloud Dataflow Python docker container
allows for today)
b) It provides an extensible and versioned mechanism for SDKs,
environments, and artifact staging/retrieval services to support additional
artifact types
c) Allow for expressing a canonical representation of an artifact like a
Maven package so a runner could merge environments that the runner deems
compatible.

The flow I could see is:
1) (optional) query artifact staging service for supported artifact types
2) SDK request expansion service to expand transform passing in a list of
artifact types the SDK and artifact staging service support, the expansion
service returns a list of artifact types limited to those supported types +
any supported by the environment
3) SDK converts any artifact types that the artifact staging service or
environment doesn't understand, e.g. pulls down Maven dependencies and
converts them to "bytes" artifacts
4) SDK sends artifacts to artifact staging service
5) Artifact staging service converts any artifacts to types that the
environment understands
6) Environment is started and gets artifacts from the artifact retrieval
service.

On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw  wrote:

> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels 
> wrote:
> >
> > Good idea to let the client expose an artifact staging service that the
> > ExpansionService could use to stage artifacts. This solves two problems:
> >
> > (1) The Expansion Service not being able to access the Job Server
> > artifact staging service
> > (2) The client not having access to the dependencies returned by the
> > Expansion Server
> >
> > The downside is that it adds an additional indirection. The alternative
> > to let the client handle staging the artifacts returned by the Expansion
> > Server is more transparent and easier to implement.
>
> The other downside is that it may not always be possible for the
> expansion service to connect to the artifact staging service (e.g.
> when constructing a pipeline locally against a remote expansion
> service).
>

Just to make sure, your saying the expansion service would return all the
artifacts (bytes, urls, ...) as part of the response since the expansion
service wouldn't be able to connect to the SDK that is running locally
either.


> > Ideally, the Expansion Service won't return any dependencies because the
> > environment already contains the required dependencies. We could make it
> > a requirement for the expansion to be performed inside an environment.
> > Then we would already ensure during expansion time that the runtime
> > dependencies are available.
>
> Yes, it's cleanest if the expansion service provides an environment
> without all the dependencies provided. Interesting idea to make this a
> property of the expansion service itself.
>

I had thought this too but an opaque docker container that was built on top
of a base Beam docker container would be very difficult for a runner to
introspect and check to see if its compatible to allow for fusion across
PTransforms. I think artifacts need to be communicated in their canonical
representation.


> > > In this case, the runner would (as
> > > requested by its configuration) be free to merge environments it
> > > deemed compatible, including swapping out beam-java-X for
> > > beam-java-embedded if it considers itself compatible with the
> > > dependency list.
> >
> > Could you explain how that would work in practice?
>
> Say one has a pipeline with environments
>
> A: beam-java-sdk-2.12-docker
> B: beam-java-sdk-2.12-docker + dep1
> C: beam-java-sdk-2.12-docker + dep2
> D: beam-java-sdk-2.12-docker + dep3
>
> A runner could (conceivably) be intelligent enough to know that dep1
> and dep2 are indeed compatible, and run A, B, and C in a single
> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
> corresponding fusion and lower overhead benefits). If a certain
> pipeline option is set, it might further note that dep1 and dep2 are
> compatible with its own workers, which are build against sdk-2.12, and
> choose to run these in embedded + dep1 + dep2 environment.
>

We have been talking about the expansion service and cross language
transforms a lot lately but I believe it 

Re: [discuss] A tweak to the Python API for SDF?

2019-04-26 Thread Chamikara Jayalath
On Fri, Apr 26, 2019 at 3:43 PM Pablo Estrada  wrote:

> Hi all,
> Sorry about the wall of text.
> So, first of all, I thought about this while reviewing a PR by Boyuan with
> an example of an SDF[1]. This is very exciting btw : ).
>
> Anyway... I certainly have a limited view of the whole SDF effort, but I
> think it's worth discussing this particular point about the API before
> finalizing SDF and making it widely available. So here I go:
>
> The Python API for SDF asks users to provide a restriction provider in
> their process function signature. More or less the following:
>
> class MyOwnLittleSDF(beam.DoFn):
>   def process(self, element,
>   restriction_tracker=MyOwnLittleRestrictionProvider()):
> # My DoFn logic...
>
> This is all fine, but something that I found a little odd is that the
> restriction provider gets replaced at runtime with a restriction tracker:
>
> class MyOwnLittleSDF(beam.DoFn):
>   def process(self, element,
>   restriction_tracker=MyOwnLittleRestrictionProvider()):
> # This assert succeeds : )
> assert not isinstance(restriction_tracker,
>   MyOwnLittleRestrictionProvider)
>
> After thinking a little bit about it, I realized that the default argument
> simply allows us to inform the runner where to find the restriction
> provider; but that the thing that we need at runtime is NOT the restriction
> provider - but rather, the restriction tracker.
>
> A similar pattern occurs with state and timers, where the runner needs to
> know the sort of state, the coder for the values in that state (or the time
> domain for timers); but the runtime parameter is different[2]. For state
> and timers (and window, timestamp, pane, etc.) we provide a pattern where
> users give a default value that is clearly a placeholder:
> beam.DoFn.TimerParam, or beam.DoFn.StateParam.
>

This is the way (new) DoFn work for Python SDK. SDK (harness) identifies
meanings of different (potential) arguments to a DoFn using pre-defined
default values.


>
> In this case, the API is fairly similar, but (at least in my imagination),
> it is much more clear about how the DoFnParam will be replaced with
> something else at runtime. A similar change could be done for SDF:
>
> class MyOwnLittleSDF(beam.DoFn):
>   MY_RESTRICTION = \
>   RestrictionSpec(provider=MyOwnLittleRestrictionProvider())
>
>   def process(
>   self, element,
>   restriction_tracker=beam.DoFn.RestrictionParam(MY_RESTRICTION)):
> # My DoFn logic..
>


If I understood correctly, what you propose is similar to the existing
solution but we add a XXXParam parameter for consistency ?
I think this is fine and should be a relatively small change. Main point
is, SDK should be able to find out the RestrictionProvider class to utilize
it's methods before passing elements to DoFn.process() method:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L241



>
> Perhaps it is a good opportunity to consider this, since SDF is still in
> progress.
>
> Some pros:
> - Consistent with other parameters that we pass to DoFn methods
> - A bit more clear about what will happen at runtime
>
> Some cons:
> - SDF developers are "power users", and will have gone through the SDF
> documentation. This point will be clear to them.
> - This may create unnecessary work, and perhaps unintended consequences.
> - I bet there's more
>
> Thoughts?
>
> -P.
>
> [1] https://github.com/apache/beam/pull/8338
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/userstate_test.py#L560-L586
>  .
>
>
>
>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Ahmet Altay
It is possible to download dependencies with pip to a local directory and
install from there [1]. As a side benefit this is supposed to speed up the
installation process. Since we setup virtualenv multiple times, this could
actually help us in a single run. And if we can keep this cache across test
runs we can reduce flakiness.

[1] https://pip.pypa.io/en/latest/user_guide/#installing-from-local-packages

On Fri, Apr 26, 2019 at 3:42 PM Valentyn Tymofieiev 
wrote:

> We do retry certain inherently flaky tests, for example, see[1]. This
> practice should be used with caution, see discussion [2].
>
> However retrying an individual test would not avoid the flake that Alex
> brought up in this thread, we'd have to retry setupVirtualEnv task that is
> executed once per suite of tests. Retrying just that task is different from
> retrying the whole suite.
>
> [1]
> https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/sdks/python/apache_beam/runners/worker/statesampler_test.py#L41,
> this was discussed
> [2]
> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af@%3Cdev.beam.apache.org%3E
>  discussed
>
> On Fri, Apr 26, 2019 at 3:30 PM Udi Meiri  wrote:
>
>> Alex, I changed my mind: I'm okay retrying single tests, just not entire
>> suites of tests (e.g. if precommits take an hour, retrying the run takes up
>> an additional hour on the Jenkins machine).
>> This is more of an issue in Python, where gradle does not (currently)
>> have insight into which tests failed and how to retry just them.
>>
>>
>>
>> On Fri, Apr 26, 2019 at 2:17 PM Alex Amato  wrote:
>>
>>> @Udi Meiri , Is this true if the specific tests are
>>> rerun? I don't think we should rerun all tests.
>>>
>>> On Fri, Apr 26, 2019 at 12:11 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Preinstalling dependencies may affect the dependency resolution, and we
 may end up testing a different configuration than a user would have after
 installing beam into a clean environment.

 I do think pip uses cache, unless one specifies "--no-cache-dir". By
 default the cache is ~/.cache/pip. Looking up the log message in OP, we can
 see several "Using cached..." log entries. Not sure why futures was not
 fetched from cache or PyPi. Perhaps it is also a pip flake.

 I would be against wiping flakes under the rug by rerunning the whole
 suite after an error, but re-rerunning parts of the test environment set
 up, that are prone to environmental flakes, such as setupVirtualEnv seems
 reasonable. I agree with Udi that care should be taken to not overload
 Jenkins (e.g. retries should be limited)

>>>


Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-26 Thread Ahmet Altay
The incremental value of publishing python artifacts to a separate place
but not to actual pypi listing will be low. Users can already download RC
artifacts, or even pip install from http location directly. I think the
incremental value will be low, because for a user or a downstream library
to test with Beam RCs using their usual ways will still require them to get
other dependencies from the regular pypi listing. That would mean they need
to change their setup to test with beam rcs, which is the same state as
today. There will be some incremental value of putting them in more obvious
places (e.g. pypi test repository). I would rather not complicate the
release process for doing this.



On Thu, Apr 25, 2019 at 2:25 PM Kenneth Knowles  wrote:

> Pip is also able to be pointed at any raw hosted directory for the
> install, right? So we could publish RCs or snapshots somewhere with more
> obvious caveats and not interfere with the pypi list of actual releases.
> Much like the Java snapshots are stored in a separate opt-in repository.
>
> Kenn
>
> On Thu, Apr 25, 2019 at 5:39 AM Maximilian Michels  wrote:
>
>> > wouldn't that be in conflict with Apache release policy [1] ?
>> > [1] http://www.apache.org/legal/release-policy.html
>>
>> Indeed, advertising pre-release artifacts is against ASF rules. For
>> example, Flink was asked to remove a link to the Maven snapshot
>> repository from their download page.
>>
>> However, that does not mean we cannot publish Python artifacts. We just
>> have to clearly mark them for developers only and not advertise them
>> alongside with the official releases.
>>
>> -Max
>>
>> On 25.04.19 10:23, Robert Bradshaw wrote:
>> > Don't we push java artifacts to maven repositories as part of the RC
>> > process? And completely unvetted snapshots? (Or is this OK because
>> > they are special opt-in apache-only ones?)
>> >
>> > I am generally in favor of the idea, but would like to avoid increased
>> > toil on the release manager.
>> >
>> > One potential hitch I see is that current release process updates the
>> > versions to x.y.z (no RC or other pre-release indicator in the version
>> > number) whereas pypi (and other systems) typically expect distinct
>> > (recognizable) version numbers for each attempt, and only the actual
>> > final result has the actual final release version.
>> >
>> > On Thu, Apr 25, 2019 at 6:38 AM Ahmet Altay  wrote:
>> >>
>> >> I do not know the answer.I believe this will be similar to sharing the
>> RC artifacts for validation purposes and would not be a formal release by
>> itself. But I am not an expert and I hope others will share their opinions.
>> >>
>> >> I quickly searched pypi for apache projects and found at least airflow
>> [1] and libcloud [2] are publishing rc artifacts to pypi. We can reach out
>> to those communities and learn about their processes.
>> >>
>> >> Ahmet
>> >>
>> >> [1] https://pypi.org/project/apache-airflow/#history
>> >> [2] https://pypi.org/project/apache-libcloud/#history
>> >>
>> >> On Wed, Apr 24, 2019 at 6:15 PM Michael Luckey 
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> wouldn't that be in conflict with Apache release policy [1] ?
>> >>>
>> >>> [1] http://www.apache.org/legal/release-policy.html
>> >>>
>> >>> On Thu, Apr 25, 2019 at 1:35 AM Alan Myrvold 
>> wrote:
>> 
>>  Great idea. I like the RC candidates to follow as much as the
>> release artifact process as possible.
>> 
>>  On Wed, Apr 24, 2019 at 3:27 PM Ahmet Altay 
>> wrote:
>> >
>> > To clarify my proposal, I am proposing publishing to the production
>> pypi repository with an rc tag in the version. And in turn allow users to
>> depend on beam's rc version + all the other regular dependencies users
>> would have directly from pypi.
>> >
>> > Publishing to test pypi repo would also be helpful if test pypi
>> repo also mirrors other packages that exist in the production pypi
>> repository.
>> >
>> > On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada 
>> wrote:
>> >>
>> >> I think this is a great idea. A way of doing it for python would
>> be by using the test repository for PyPi[1], and that way we would not have
>> to do an official PyPi release, but still would be able to install it with
>> pip (by passing an extra flag), and test.
>> >>
>> >> In fact, there are some Beam artifacts already in there[2]. At
>> some point I looked into this, but couldn't figure out who has access/the
>> password for it.
>> >
>> >
>> > I also don't know who owns beam package in test pypi repo. Does
>> anybody know?
>> >
>> >>
>> >>
>> >> In short: +1, and I would suggest using the test PyPi repo to
>> avoid publishing to the main PyPi repo.
>> >> Best
>> >> -P.
>> >>
>> >> [1] https://test.pypi.org/
>> >> [2] https://test.pypi.org/project/apache-beam/
>> >>
>> >> On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay 
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> What do 

[discuss] A tweak to the Python API for SDF?

2019-04-26 Thread Pablo Estrada
Hi all,
Sorry about the wall of text.
So, first of all, I thought about this while reviewing a PR by Boyuan with
an example of an SDF[1]. This is very exciting btw : ).

Anyway... I certainly have a limited view of the whole SDF effort, but I
think it's worth discussing this particular point about the API before
finalizing SDF and making it widely available. So here I go:

The Python API for SDF asks users to provide a restriction provider in
their process function signature. More or less the following:

class MyOwnLittleSDF(beam.DoFn):
  def process(self, element,
  restriction_tracker=MyOwnLittleRestrictionProvider()):
# My DoFn logic...

This is all fine, but something that I found a little odd is that the
restriction provider gets replaced at runtime with a restriction tracker:

class MyOwnLittleSDF(beam.DoFn):
  def process(self, element,
  restriction_tracker=MyOwnLittleRestrictionProvider()):
# This assert succeeds : )
assert not isinstance(restriction_tracker,
  MyOwnLittleRestrictionProvider)

After thinking a little bit about it, I realized that the default argument
simply allows us to inform the runner where to find the restriction
provider; but that the thing that we need at runtime is NOT the restriction
provider - but rather, the restriction tracker.

A similar pattern occurs with state and timers, where the runner needs to
know the sort of state, the coder for the values in that state (or the time
domain for timers); but the runtime parameter is different[2]. For state
and timers (and window, timestamp, pane, etc.) we provide a pattern where
users give a default value that is clearly a placeholder:
beam.DoFn.TimerParam, or beam.DoFn.StateParam.

In this case, the API is fairly similar, but (at least in my imagination),
it is much more clear about how the DoFnParam will be replaced with
something else at runtime. A similar change could be done for SDF:

class MyOwnLittleSDF(beam.DoFn):
  MY_RESTRICTION = \
  RestrictionSpec(provider=MyOwnLittleRestrictionProvider())

  def process(
  self, element,
  restriction_tracker=beam.DoFn.RestrictionParam(MY_RESTRICTION)):
# My DoFn logic..

Perhaps it is a good opportunity to consider this, since SDF is still in
progress.

Some pros:
- Consistent with other parameters that we pass to DoFn methods
- A bit more clear about what will happen at runtime

Some cons:
- SDF developers are "power users", and will have gone through the SDF
documentation. This point will be clear to them.
- This may create unnecessary work, and perhaps unintended consequences.
- I bet there's more

Thoughts?

-P.

[1] https://github.com/apache/beam/pull/8338
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/userstate_test.py#L560-L586
 .


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Valentyn Tymofieiev
We do retry certain inherently flaky tests, for example, see[1]. This
practice should be used with caution, see discussion [2].

However retrying an individual test would not avoid the flake that Alex
brought up in this thread, we'd have to retry setupVirtualEnv task that is
executed once per suite of tests. Retrying just that task is different from
retrying the whole suite.

[1]
https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/sdks/python/apache_beam/runners/worker/statesampler_test.py#L41,
this was discussed
[2]
https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af@%3Cdev.beam.apache.org%3E
 discussed

On Fri, Apr 26, 2019 at 3:30 PM Udi Meiri  wrote:

> Alex, I changed my mind: I'm okay retrying single tests, just not entire
> suites of tests (e.g. if precommits take an hour, retrying the run takes up
> an additional hour on the Jenkins machine).
> This is more of an issue in Python, where gradle does not (currently) have
> insight into which tests failed and how to retry just them.
>
>
>
> On Fri, Apr 26, 2019 at 2:17 PM Alex Amato  wrote:
>
>> @Udi Meiri , Is this true if the specific tests are
>> rerun? I don't think we should rerun all tests.
>>
>> On Fri, Apr 26, 2019 at 12:11 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Preinstalling dependencies may affect the dependency resolution, and we
>>> may end up testing a different configuration than a user would have after
>>> installing beam into a clean environment.
>>>
>>> I do think pip uses cache, unless one specifies "--no-cache-dir". By
>>> default the cache is ~/.cache/pip. Looking up the log message in OP, we can
>>> see several "Using cached..." log entries. Not sure why futures was not
>>> fetched from cache or PyPi. Perhaps it is also a pip flake.
>>>
>>> I would be against wiping flakes under the rug by rerunning the whole
>>> suite after an error, but re-rerunning parts of the test environment set
>>> up, that are prone to environmental flakes, such as setupVirtualEnv seems
>>> reasonable. I agree with Udi that care should be taken to not overload
>>> Jenkins (e.g. retries should be limited)
>>>
>>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Udi Meiri
Alex, I changed my mind: I'm okay retrying single tests, just not entire
suites of tests (e.g. if precommits take an hour, retrying the run takes up
an additional hour on the Jenkins machine).
This is more of an issue in Python, where gradle does not (currently) have
insight into which tests failed and how to retry just them.



On Fri, Apr 26, 2019 at 2:17 PM Alex Amato  wrote:

> @Udi Meiri , Is this true if the specific tests are
> rerun? I don't think we should rerun all tests.
>
> On Fri, Apr 26, 2019 at 12:11 PM Valentyn Tymofieiev 
> wrote:
>
>> Preinstalling dependencies may affect the dependency resolution, and we
>> may end up testing a different configuration than a user would have after
>> installing beam into a clean environment.
>>
>> I do think pip uses cache, unless one specifies "--no-cache-dir". By
>> default the cache is ~/.cache/pip. Looking up the log message in OP, we can
>> see several "Using cached..." log entries. Not sure why futures was not
>> fetched from cache or PyPi. Perhaps it is also a pip flake.
>>
>> I would be against wiping flakes under the rug by rerunning the whole
>> suite after an error, but re-rerunning parts of the test environment set
>> up, that are prone to environmental flakes, such as setupVirtualEnv seems
>> reasonable. I agree with Udi that care should be taken to not overload
>> Jenkins (e.g. retries should be limited)
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Ankur Goenka
Link in the download link page will be useful.

Additionally, to notify user about the next release, shall we add the
expected date of the next cut/release to sdk binary so that it's printed on
console once every day past the cut date?

Something like,
Print "Please check new version of Beam" after June 19th in Beam 2.13.0
Print "Please check new version of Beam" after July 31st in Beam 2.14.0

On Fri, Apr 26, 2019 at 1:13 PM Ismaël Mejía  wrote:

> Ah that works, thanks Anton, quite hard to see. Thanks!
>
> Kenneth maybe for awareness a link in the downloads page will be more
> 'visible'.
>
> On Fri, Apr 26, 2019 at 9:32 PM Kenneth Knowles  wrote:
>
>> By the way, that link is referenced by
>> https://beam.apache.org/community/policies/
>>
>> Is there a better way to surface the calendar?
>>
>> Kenn
>>
>> On Fri, Apr 26, 2019 at 12:23 PM Anton Kedin  wrote:
>>
>>> Following Ankur's link I see a "[+]GoogleCalendar" button in the bottom
>>> right corner of the page. Clicking it opens the google calendar and prompts
>>> to add the Beam Calendar (at least in Chrome). Ismael, do you have a
>>> similar button in your case?
>>>
>>> [image: image.png]
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Fri, Apr 26, 2019 at 5:07 AM Ismaël Mejía  wrote:
>>>
 Ankur, do you have the equivalent link that I can use to subscribe to
 that calendar via google calendars?
 The link seems to work only to see the calendar in a webpage.

 Thanks.

 On Fri, Apr 26, 2019 at 1:42 PM Maximilian Michels 
 wrote:
 >
 > Hi Ankur,
 >
 > Sounds good. This will ensure that we stay on track regarding the
 > release cycle.
 >
 > Thanks,
 > Max
 >
 > On 26.04.19 02:59, Ankur Goenka wrote:
 > > Correction, The planned cut date is May 8th.
 > >
 > > On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka >>> > > > wrote:
 > >
 > > Hello Beam community!
 > >
 > > Beam 2.13 release branch cut date is April 8th according to the
 > > release calendar [1]. I would like to volunteer myself to do
 this
 > > release. I intend to cut the branch as planned on April 8th and
 > > cherrypick fixes if needed.
 > >
 > > If you have releasing blocking issues for 2.13 please mark their
 > > "Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in
 case
 > > you would like to move any non-blocking issues to that version.
 > >
 > > Does this sound reasonable?
 > >
 > > Thanks,
 > > Ankur
 > >
 > > [1]
 > >
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
 > >

>>>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Alex Amato
@Udi Meiri , Is this true if the specific tests are
rerun? I don't think we should rerun all tests.

On Fri, Apr 26, 2019 at 12:11 PM Valentyn Tymofieiev 
wrote:

> Preinstalling dependencies may affect the dependency resolution, and we
> may end up testing a different configuration than a user would have after
> installing beam into a clean environment.
>
> I do think pip uses cache, unless one specifies "--no-cache-dir". By
> default the cache is ~/.cache/pip. Looking up the log message in OP, we can
> see several "Using cached..." log entries. Not sure why futures was not
> fetched from cache or PyPi. Perhaps it is also a pip flake.
>
> I would be against wiping flakes under the rug by rerunning the whole
> suite after an error, but re-rerunning parts of the test environment set
> up, that are prone to environmental flakes, such as setupVirtualEnv seems
> reasonable. I agree with Udi that care should be taken to not overload
> Jenkins (e.g. retries should be limited)
>


Re: Sharing plan to support complex equi-join condition in BeamSQL

2019-04-26 Thread Anton Kedin
Thank you for sharing this. This is a great overview. Left few comments in
the doc.

Regards,
Anton

On Fri, Apr 26, 2019 at 10:12 AM Rui Wang  wrote:

> Hi Community,
>
> TL;DR:
>
> BeamSQL only supports equi-join, and its join condition can only be forms
> of `col_a = col_b` or `col_a = col_b AND ...`. I come up with a doc[1] to
> describe how to support complex equi-join condition in BeamSQL, along with
> JIRAs on each sub task.
>
>
> -Rui
>
> [1]
> https://docs.google.com/document/d/1vDiE4HR5ZdbZypIf1vzyFy9yKmAMWBu6BzBKfP7JhPc/edit?usp=sharing
>


Good command to run before pushing java PRs.

2019-04-26 Thread Alex Amato
I asked about this on the dev list in the past. Just wanted to give an FYI
that some of the command names changed. "findBugsMain" -> "spotBugsMain".

FWIW, I now use this command:
./gradlew spotlessApply && ./gradlew checkstyleMain checkstyleTest javadoc
spotbugsMain compileJava compileTestJava
Hope this is useful, I put it on the wiki as well
.


[BEAM-7165] FileIOTest.testMatchWatchForNewFiles flakey in java presubmit

2019-04-26 Thread Alex Amato
https://issues.apache.org/jira/browse/BEAM-7165

https://builds.apache.org/job/beam_PreCommit_Java_Commit/5634/testReport/junit/org.apache.beam.sdk.io/FileIOTest/testMatchWatchForNewFiles/

Note: This test was flakey and fixed in BEAM-6491
, filed this new ticket
since I am not sure if its the same issue.
Stacktrace

java.lang.AssertionError:
FileIO.MatchAll/Reshuffle.ViaRandomKey/Values/Values/Map/ParMultiDo(Anonymous).output:
Expected: iterable with items
[,
,
] in any
order but: not matched:
 at
org.apache.beam.sdk.testing.PAssert$PAssertionSite.capture(PAssert.java:169)
at org.apache.beam.sdk.testing.PAssert.that(PAssert.java:393) at
org.apache.beam.sdk.testing.PAssert.that(PAssert.java:385) at
org.apache.beam.sdk.io.FileIOTest.testMatchWatchForNewFiles(FileIOTest.java:262)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:319)
at
org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:265)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) at
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:349) at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:314) at
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:312) at
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:292) at
org.junit.runners.ParentRunner.run(ParentRunner.java:396) at
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
at
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
at
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
at
org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
at
org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at
org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
at
org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source) at
org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
at
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at
org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:175)
at
org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:157)
at
org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:404)
at
org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
at
org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:46)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at

Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Ismaël Mejía
Ah that works, thanks Anton, quite hard to see. Thanks!

Kenneth maybe for awareness a link in the downloads page will be more
'visible'.

On Fri, Apr 26, 2019 at 9:32 PM Kenneth Knowles  wrote:

> By the way, that link is referenced by
> https://beam.apache.org/community/policies/
>
> Is there a better way to surface the calendar?
>
> Kenn
>
> On Fri, Apr 26, 2019 at 12:23 PM Anton Kedin  wrote:
>
>> Following Ankur's link I see a "[+]GoogleCalendar" button in the bottom
>> right corner of the page. Clicking it opens the google calendar and prompts
>> to add the Beam Calendar (at least in Chrome). Ismael, do you have a
>> similar button in your case?
>>
>> [image: image.png]
>>
>> Regards,
>> Anton
>>
>>
>> On Fri, Apr 26, 2019 at 5:07 AM Ismaël Mejía  wrote:
>>
>>> Ankur, do you have the equivalent link that I can use to subscribe to
>>> that calendar via google calendars?
>>> The link seems to work only to see the calendar in a webpage.
>>>
>>> Thanks.
>>>
>>> On Fri, Apr 26, 2019 at 1:42 PM Maximilian Michels 
>>> wrote:
>>> >
>>> > Hi Ankur,
>>> >
>>> > Sounds good. This will ensure that we stay on track regarding the
>>> > release cycle.
>>> >
>>> > Thanks,
>>> > Max
>>> >
>>> > On 26.04.19 02:59, Ankur Goenka wrote:
>>> > > Correction, The planned cut date is May 8th.
>>> > >
>>> > > On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka >> > > > wrote:
>>> > >
>>> > > Hello Beam community!
>>> > >
>>> > > Beam 2.13 release branch cut date is April 8th according to the
>>> > > release calendar [1]. I would like to volunteer myself to do this
>>> > > release. I intend to cut the branch as planned on April 8th and
>>> > > cherrypick fixes if needed.
>>> > >
>>> > > If you have releasing blocking issues for 2.13 please mark their
>>> > > "Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in
>>> case
>>> > > you would like to move any non-blocking issues to that version.
>>> > >
>>> > > Does this sound reasonable?
>>> > >
>>> > > Thanks,
>>> > > Ankur
>>> > >
>>> > > [1]
>>> > >
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>>> > >
>>>
>>


Re: Removing Java Reference Runner code

2019-04-26 Thread Mikhail Gryzykhin
+1 to remove overall. We removed all tests for ULR already and when we did
that, tests were red. Removing code base is a natural next step.

It is a valid point that we should have a way to run portable pipelines
locally with Python ULR.

I don't believe that a Java person working with Java SDK should actually
debug worker in most cases. If we have a situation when SDK dev have to
debug runner retularly, we should improve runner logging and error
reporting. This can be a great exercise of improving testability. As well
as a good requirement if we want to eventually split mono-repo.

--Mikhail

On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang  wrote:

> Another concern from me is, will it be difficult for a Java person (who
> developing Java SDK) to figure out what's going on in Python ULR when
> debugging?
>
> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles  wrote:
>
>> Good points. Distilling one single item: can I, today, run the Java SDK's
>> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
>> in a single Gradle command?
>>
>> Kenn
>>
>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:
>>
>>> If there is no plans to invest in ULR then it makes sense to remove it.
>>>
>>> Going forward, however, I think we should try to document the higher
>>> level approach we're taking with runners (and portability) now that we have
>>> something working and can reflect on it. For example, couple of things that
>>> are not 100% clear to me:
>>>  - if the focus is on python runner for portability efforts, how does
>>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>>  - what's our approach to developing new features, should we make sure
>>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>>  - java DirectRunner is still there:
>>> - it is still the primary tool for java SDK development purposes,
>>> and as Kenn mentioned in the linked threads it adds value by making sure
>>> users don't rely on implementation details of specific runners. Do we have
>>> a similar story for portable scenarios?
>>> - I assume that extra validations in the DirectRunner have impact on
>>> performance in various ways (potentially non-deterministic). While this
>>> doesn't matter in some cases, it might do in others. Having a local runner
>>> that is (better) optimized for execution would probably make more sense for
>>> perf measurements, integration tests, and maybe even local production jobs.
>>> Is this something potentially worth looking into?
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels 
>>> wrote:
>>>
 Thanks for following up with this. I have mixed feelings to see the
 portable Java DirectRunner go, but I'm in favor of this change because
 it removes a lot of code that we do not really make use of.

 -Max

 On 26.04.19 02:58, Kenneth Knowles wrote:
 > Thanks for providing all this background on the PR. It is very easy
 to
 > see where it came from. Definitely nice to have less code and fewer
 > things that can break. Perhaps lazy consensus is enough.
 >
 > Kenn
 >
 > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
 danolive...@google.com
 > > wrote:
 >
 > Hey everyone,
 >
 > I made a preliminary PR for removing all the Java Reference Runner
 > code (PR-8380 ) since I
 > wanted to see if it could be done easily. It seems to be working
 > fine, so I wanted to open up this discussion to make sure people
 are
 > still in agreement on getting rid of this code and that people
 don't
 > have any concerns.
 >
 > For those who need additional context about this, this previous
 > thread
 > <
 https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
 >
 > is where we discussed deprecating the Java Reference Runner (in
 some
 > places it's called the ULR or Universal Local Runner, but it's the
 > same thing). Then there's this thread
 > <
 https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
 >
 > where we discussed removing the code from the repo since it's been
 > deprecated.
 >
 > If no one has any objections to trying to remove the code I'll
 have
 > someone review the PR I wrote and start a vote to have it merged.
 >
 > Thanks,
 > Daniel Oliveira
 >

>>>


Re: Hazelcast Jet Runner

2019-04-26 Thread Kenneth Knowles
Nice! That is *way* more than the PR I was looking for. I just meant that
you could update the website/ directory. It is fine to keep the runner in
your own repository if you want.

But I think it is great if you want to contribute it to Apache Beam (hence
donate it to the Apache Software Foundation). The benefits include:
low-latency testing, free updates when someone does a refactor. Things to
consider are: subject to ASF / Beam governance, PMC, commiters, subject to
Beam's release cadence (and we might exclude from Beam releases for a
little bit). Typically, we have kept runners on a branch until they are
somewhat stable. I don't feel strongly about this for disjoint codebases
that can easily be excluded from releases. We might want to suffix
`-experimental` to the artifacts for some time.

I commented on the PR about the necessary i.p. clearance steps.

Kenn

On Fri, Apr 26, 2019 at 3:59 AM jo...@hazelcast.com 
wrote:

> Hi Kenn.
>
> It took me a while to migrate our code to the Beam repo, but I finally
> have been able to create the Pull Request you asked for, this is it:
> https://github.com/apache/beam/pull/8410
>
> Looking forward to your feedback!
>
> Best regards,
> Jozsef
>
> On 2019/04/19 20:52:42, Kenneth Knowles  wrote:
> > The ValidatesRunner tests are the best source we have for knowing the
> > capabilities of a runner. Are there instructions for running the tests?
> >
> > Assuming we can check it out, then just open a PR to the website with the
> > current capabilities and caveats. Since it is a big deal and could use
> lots
> > of eyes, I would share the PR link on this thread.
> >
> > Kenn
> >
> > On Thu, Apr 18, 2019 at 11:53 AM Jozsef Bartok 
> wrote:
> >
> > > Hi. We at Hazelcast Jet have been working for a while now to implement
> a
> > > Java Beam Runner (non-portable) based on Hazelcast Jet (
> > > https://jet.hazelcast.org/). The process is still ongoing (
> > > https://github.com/hazelcast/hazelcast-jet-beam-runner), but we are
> > > aiming for a fully functional, reliable Runner which can proudly join
> the
> > > Capability Matrix. For that purpose I would like to ask what’s your
> process
> > > of validating runners? We are already running the @ValidatesRunner
> tests
> > > and the Nexmark test suite, but beyond that what other steps do we
> need to
> > > take to get our Runner to the level it needs to be at?
> > >
> >
>


Re: Removing Java Reference Runner code

2019-04-26 Thread Boyuan Zhang
Another concern from me is, will it be difficult for a Java person (who
developing Java SDK) to figure out what's going on in Python ULR when
debugging?

On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles  wrote:

> Good points. Distilling one single item: can I, today, run the Java SDK's
> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
> in a single Gradle command?
>
> Kenn
>
> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:
>
>> If there is no plans to invest in ULR then it makes sense to remove it.
>>
>> Going forward, however, I think we should try to document the higher
>> level approach we're taking with runners (and portability) now that we have
>> something working and can reflect on it. For example, couple of things that
>> are not 100% clear to me:
>>  - if the focus is on python runner for portability efforts, how does
>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>  - what's our approach to developing new features, should we make sure
>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>  - java DirectRunner is still there:
>> - it is still the primary tool for java SDK development purposes, and
>> as Kenn mentioned in the linked threads it adds value by making sure users
>> don't rely on implementation details of specific runners. Do we have a
>> similar story for portable scenarios?
>> - I assume that extra validations in the DirectRunner have impact on
>> performance in various ways (potentially non-deterministic). While this
>> doesn't matter in some cases, it might do in others. Having a local runner
>> that is (better) optimized for execution would probably make more sense for
>> perf measurements, integration tests, and maybe even local production jobs.
>> Is this something potentially worth looking into?
>>
>> Regards,
>> Anton
>>
>>
>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels 
>> wrote:
>>
>>> Thanks for following up with this. I have mixed feelings to see the
>>> portable Java DirectRunner go, but I'm in favor of this change because
>>> it removes a lot of code that we do not really make use of.
>>>
>>> -Max
>>>
>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>> > Thanks for providing all this background on the PR. It is very easy to
>>> > see where it came from. Definitely nice to have less code and fewer
>>> > things that can break. Perhaps lazy consensus is enough.
>>> >
>>> > Kenn
>>> >
>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
>>> danolive...@google.com
>>> > > wrote:
>>> >
>>> > Hey everyone,
>>> >
>>> > I made a preliminary PR for removing all the Java Reference Runner
>>> > code (PR-8380 ) since I
>>> > wanted to see if it could be done easily. It seems to be working
>>> > fine, so I wanted to open up this discussion to make sure people
>>> are
>>> > still in agreement on getting rid of this code and that people
>>> don't
>>> > have any concerns.
>>> >
>>> > For those who need additional context about this, this previous
>>> > thread
>>> > <
>>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>>> >
>>> > is where we discussed deprecating the Java Reference Runner (in
>>> some
>>> > places it's called the ULR or Universal Local Runner, but it's the
>>> > same thing). Then there's this thread
>>> > <
>>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>>> >
>>> > where we discussed removing the code from the repo since it's been
>>> > deprecated.
>>> >
>>> > If no one has any objections to trying to remove the code I'll have
>>> > someone review the PR I wrote and start a vote to have it merged.
>>> >
>>> > Thanks,
>>> > Daniel Oliveira
>>> >
>>>
>>


Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Kenneth Knowles
By the way, that link is referenced by
https://beam.apache.org/community/policies/

Is there a better way to surface the calendar?

Kenn

On Fri, Apr 26, 2019 at 12:23 PM Anton Kedin  wrote:

> Following Ankur's link I see a "[+]GoogleCalendar" button in the bottom
> right corner of the page. Clicking it opens the google calendar and prompts
> to add the Beam Calendar (at least in Chrome). Ismael, do you have a
> similar button in your case?
>
> [image: image.png]
>
> Regards,
> Anton
>
>
> On Fri, Apr 26, 2019 at 5:07 AM Ismaël Mejía  wrote:
>
>> Ankur, do you have the equivalent link that I can use to subscribe to
>> that calendar via google calendars?
>> The link seems to work only to see the calendar in a webpage.
>>
>> Thanks.
>>
>> On Fri, Apr 26, 2019 at 1:42 PM Maximilian Michels 
>> wrote:
>> >
>> > Hi Ankur,
>> >
>> > Sounds good. This will ensure that we stay on track regarding the
>> > release cycle.
>> >
>> > Thanks,
>> > Max
>> >
>> > On 26.04.19 02:59, Ankur Goenka wrote:
>> > > Correction, The planned cut date is May 8th.
>> > >
>> > > On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka > > > > wrote:
>> > >
>> > > Hello Beam community!
>> > >
>> > > Beam 2.13 release branch cut date is April 8th according to the
>> > > release calendar [1]. I would like to volunteer myself to do this
>> > > release. I intend to cut the branch as planned on April 8th and
>> > > cherrypick fixes if needed.
>> > >
>> > > If you have releasing blocking issues for 2.13 please mark their
>> > > "Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in case
>> > > you would like to move any non-blocking issues to that version.
>> > >
>> > > Does this sound reasonable?
>> > >
>> > > Thanks,
>> > > Ankur
>> > >
>> > > [1]
>> > >
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>> > >
>>
>


Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Anton Kedin
Following Ankur's link I see a "[+]GoogleCalendar" button in the bottom
right corner of the page. Clicking it opens the google calendar and prompts
to add the Beam Calendar (at least in Chrome). Ismael, do you have a
similar button in your case?

[image: image.png]

Regards,
Anton


On Fri, Apr 26, 2019 at 5:07 AM Ismaël Mejía  wrote:

> Ankur, do you have the equivalent link that I can use to subscribe to
> that calendar via google calendars?
> The link seems to work only to see the calendar in a webpage.
>
> Thanks.
>
> On Fri, Apr 26, 2019 at 1:42 PM Maximilian Michels  wrote:
> >
> > Hi Ankur,
> >
> > Sounds good. This will ensure that we stay on track regarding the
> > release cycle.
> >
> > Thanks,
> > Max
> >
> > On 26.04.19 02:59, Ankur Goenka wrote:
> > > Correction, The planned cut date is May 8th.
> > >
> > > On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka  > > > wrote:
> > >
> > > Hello Beam community!
> > >
> > > Beam 2.13 release branch cut date is April 8th according to the
> > > release calendar [1]. I would like to volunteer myself to do this
> > > release. I intend to cut the branch as planned on April 8th and
> > > cherrypick fixes if needed.
> > >
> > > If you have releasing blocking issues for 2.13 please mark their
> > > "Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in case
> > > you would like to move any non-blocking issues to that version.
> > >
> > > Does this sound reasonable?
> > >
> > > Thanks,
> > > Ankur
> > >
> > > [1]
> > >
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
> > >
>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Valentyn Tymofieiev
Preinstalling dependencies may affect the dependency resolution, and we may
end up testing a different configuration than a user would have after
installing beam into a clean environment.

I do think pip uses cache, unless one specifies "--no-cache-dir". By
default the cache is ~/.cache/pip. Looking up the log message in OP, we can
see several "Using cached..." log entries. Not sure why futures was not
fetched from cache or PyPi. Perhaps it is also a pip flake.

I would be against wiping flakes under the rug by rerunning the whole suite
after an error, but re-rerunning parts of the test environment set up, that
are prone to environmental flakes, such as setupVirtualEnv seems
reasonable. I agree with Udi that care should be taken to not overload
Jenkins (e.g. retries should be limited)


Re: Removing Java Reference Runner code

2019-04-26 Thread Kenneth Knowles
Good points. Distilling one single item: can I, today, run the Java SDK's
suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
in a single Gradle command?

Kenn

On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin  wrote:

> If there is no plans to invest in ULR then it makes sense to remove it.
>
> Going forward, however, I think we should try to document the higher level
> approach we're taking with runners (and portability) now that we have
> something working and can reflect on it. For example, couple of things that
> are not 100% clear to me:
>  - if the focus is on python runner for portability efforts, how does java
> SDK (and other languages) tie into this? E.g. how do we run, test, measure,
> and develop things (pipelines, aspects of the SDK, runner);
>  - what's our approach to developing new features, should we make sure
> python runner supports them as early as possible (e.g. schemas and SQL)?
>  - java DirectRunner is still there:
> - it is still the primary tool for java SDK development purposes, and
> as Kenn mentioned in the linked threads it adds value by making sure users
> don't rely on implementation details of specific runners. Do we have a
> similar story for portable scenarios?
> - I assume that extra validations in the DirectRunner have impact on
> performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> Is this something potentially worth looking into?
>
> Regards,
> Anton
>
>
> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels  wrote:
>
>> Thanks for following up with this. I have mixed feelings to see the
>> portable Java DirectRunner go, but I'm in favor of this change because
>> it removes a lot of code that we do not really make use of.
>>
>> -Max
>>
>> On 26.04.19 02:58, Kenneth Knowles wrote:
>> > Thanks for providing all this background on the PR. It is very easy to
>> > see where it came from. Definitely nice to have less code and fewer
>> > things that can break. Perhaps lazy consensus is enough.
>> >
>> > Kenn
>> >
>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira > > > wrote:
>> >
>> > Hey everyone,
>> >
>> > I made a preliminary PR for removing all the Java Reference Runner
>> > code (PR-8380 ) since I
>> > wanted to see if it could be done easily. It seems to be working
>> > fine, so I wanted to open up this discussion to make sure people are
>> > still in agreement on getting rid of this code and that people don't
>> > have any concerns.
>> >
>> > For those who need additional context about this, this previous
>> > thread
>> > <
>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>> >
>> > is where we discussed deprecating the Java Reference Runner (in some
>> > places it's called the ULR or Universal Local Runner, but it's the
>> > same thing). Then there's this thread
>> > <
>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>> >
>> > where we discussed removing the code from the repo since it's been
>> > deprecated.
>> >
>> > If no one has any objections to trying to remove the code I'll have
>> > someone review the PR I wrote and start a vote to have it merged.
>> >
>> > Thanks,
>> > Daniel Oliveira
>> >
>>
>


Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-26 Thread Kenneth Knowles
To be clear, the intent was always that ValueState would be not usable in
merging pipelines. So no danger of clobbering, but also limited
functionality. Is there a runner than accepts it and clobbers? The whole
idea of the new DoFn is that it is easy to do the construction-time
analysis and reject the invalid pipeline. It is actually runner independent
and I think already implemented in ParDo's validation, no?

Kenn

On Fri, Apr 26, 2019 at 10:14 AM Lukasz Cwik  wrote:

> I am in the camp where we should only support merging state (either
> naturally via things like bags or via combiners). I believe that having the
> wrapper that Brian suggests is useful for users. As for the @OnMerge
> method, I believe combiners should have the ability to look at the window
> information and we should treat @OnMerge as syntactic sugar over a combiner
> if the combiner API is too cumbersome.
>
> I believe using combiners can also extend to side inputs and help us deal
> with singleton and map like side inputs when multiple firings occur. I also
> like treating everything like a combiner because it will give us a lot
> reuse of combiner implementations across all the places they could be used
> and will be especially useful when we start exposing APIs related to
> retractions on combiners.
>
> On Fri, Apr 26, 2019 at 9:43 AM Brian Hulette  wrote:
>
>> Yeah the danger with out of order processing concerns me more than the
>> merging as well. As a new Beam user, I immediately gravitated towards
>> ValueState since it was easy to think about and I just assumed there wasn't
>> anything to be concerned about. So it was shocking to learn that there is
>> this dangerous edge-case.
>>
>> What if ValueState were just implemented as a wrapper of CombiningState
>> with a LatestCombineFn and documented as such (and perhaps we encourage
>> users to consider using a CombiningState explicitly if at all possible)?
>>
>> Brian
>>
>>
>>
>> On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles  wrote:
>>> >
>>> > You could use a CombiningState with a CombineFn that returns the
>>> minimum for this case.
>>>
>>> We've also wanted to be able to set data when setting a timer that
>>> would be returned when the timer fires. (It's in the FnAPI, but not
>>> the SDKs yet.)
>>>
>>> The metadata is an interesting usecase, do you have some more specific
>>> examples? Might boil down to not having a rich enough (single) state
>>> type.
>>>
>>> > But I've come to feel there is a mismatch. On the one hand,
>>> ParDo() is a way to drop to a lower level and write logic
>>> that does not fit a more general computational pattern, really taking fine
>>> control. On the other hand, automatically merging state via CombiningState
>>> or BagState is more of a no-knobs higher level of programming. To me there
>>> seems to be a bit of a philosophical conflict.
>>> >
>>> > These days, I feel like an @OnMerge method would be more natural. If
>>> you are using state and timers, you probably often want more direct control
>>> over how state from windows gets merged. An of course we don't even have a
>>> design for timers - you would need some kind of timestamp CombineFn but I
>>> think setting/unsetting timers manually makes more sense. Especially
>>> considering the trickiness around merging windows in the absence of
>>> retractions, you really need this callback, so you can issue retractions
>>> manually for any output your stateful DoFn emitted in windows that no
>>> longer exist.
>>>
>>> I agree we'll probably need an @OnMerge. On the other hand, I like
>>> being able to have good defaults. The high/low level thing is a
>>> continuum (the indexing example falling towards the high end).
>>>
>>> Actually, the merging questions bother me less than how easy it is to
>>> accidentally clobber previous values. It looks so easy (like the
>>> easiest state to use) but is actually the most dangerous. If one wants
>>> this behavior, I would rather an explicit AnyCombineFn or
>>> LatestCombineFn which makes you think about the semantics.
>>>
>>> - Robert
>>>
>>> > On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni  wrote:
>>> >>
>>> >> +1 on the metadata use case.
>>> >>
>>> >> For performance reasons the Timer API does not support a read()
>>> operation, which for the  vast majority of use cases is not a required
>>> feature. In the small set of use cases where it is needed, for example when
>>> you need to set a Timer in EventTime based on the smallest timestamp seen
>>> in the elements within a DoFn, we can make use of a ValueState object to
>>> keep track of the value.
>>> >>
>>> >> On Fri, 26 Apr 2019 at 00:38, Reuven Lax  wrote:
>>> >>>
>>> >>> I see examples of people using ValueState that I think are not
>>> captured CombiningState. For example, one common one is users who set a
>>> timer and then record the timestamp of that timer in a ValueState. In
>>> general when you store state that is metadata about other 

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Pablo Estrada
Good points, Kenn : ) - I know Yifan looked into running tests using
Docker. I imagine a testing container could have a set of dependencies
installed, and reduce the amount of installations we perform from PyPi.
Best
-P.

On Fri, Apr 26, 2019 at 11:49 AM Kenneth Knowles  wrote:

> Flakiness in Java got a lot better when we put the Maven cache outside the
> wiped build directory. I am not sure about Gradle now... It is obviously
> less hermetic, but these things should be immutable so a cache is
> acceptable.
>
> Is there a way to achieve this for Python? For Maven/Gradle a package
> being downloaded does not imply it is available for import. If downloaded
> == importable then you risk masking errors.
>
> Kenn
>
> On Fri, Apr 26, 2019 at 11:44 AM Udi Meiri  wrote:
>
>> That runs the risk of overloading our test machines when the build goes
>> red.
>>
>> On Fri, Apr 26, 2019 at 11:29 AM Alex Amato  wrote:
>>
>>> It would be ideal to not need manual steps. If known flakey tests can be
>>> auto retried that would be a great improvement.
>>>
>>> On Fri, Apr 26, 2019 at 11:24 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 We could do something along the lines of retry with a back-off. Note
 that Java tests also have this problem as we sometimes fail to fetch
 packages from Maven Central.

 On Fri, Apr 26, 2019 at 11:19 AM Pablo Estrada 
 wrote:

> hm no, these are somewhat common. Yes, I think we could have retries
> to try to fix this sort of problem.
>
> Perhaps a mixture of reusing a virtualenv, and having retries when
> creating it?
>
> On Fri, Apr 26, 2019 at 11:15 AM Alex Amato 
> wrote:
>
>> Okay but this occurred on jenkins. So does the machine need an update?
>>
>> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> I think you hit a pypi flake.
>>>
>>> pip install futures>=2.2.0 works fine for me.
>>>
>>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato 
>>> wrote:
>>>
 Would be nice to fix this as it can slow down PRs. I am not sure if 
 this one is fixed on retry yet or not.



 *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
 *



 *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
 *


 *18:05:44* >* Task 
 :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
 python executable in 
 /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
  Also creating executable in 
 /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* 
 Running virtualenv with interpreter /usr/bin/python2.7*18:05:44* 
 DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 
 2020. Please upgrade your Python as Python 2.7 won't be maintained 
 after that date. A future version of pip will drop support for Python 
 2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using 
 cached 
 https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using 
 cached 
 https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
 cached 
 

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Kenneth Knowles
Flakiness in Java got a lot better when we put the Maven cache outside the
wiped build directory. I am not sure about Gradle now... It is obviously
less hermetic, but these things should be immutable so a cache is
acceptable.

Is there a way to achieve this for Python? For Maven/Gradle a package being
downloaded does not imply it is available for import. If downloaded ==
importable then you risk masking errors.

Kenn

On Fri, Apr 26, 2019 at 11:44 AM Udi Meiri  wrote:

> That runs the risk of overloading our test machines when the build goes
> red.
>
> On Fri, Apr 26, 2019 at 11:29 AM Alex Amato  wrote:
>
>> It would be ideal to not need manual steps. If known flakey tests can be
>> auto retried that would be a great improvement.
>>
>> On Fri, Apr 26, 2019 at 11:24 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> We could do something along the lines of retry with a back-off. Note
>>> that Java tests also have this problem as we sometimes fail to fetch
>>> packages from Maven Central.
>>>
>>> On Fri, Apr 26, 2019 at 11:19 AM Pablo Estrada 
>>> wrote:
>>>
 hm no, these are somewhat common. Yes, I think we could have retries to
 try to fix this sort of problem.

 Perhaps a mixture of reusing a virtualenv, and having retries when
 creating it?

 On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:

> Okay but this occurred on jenkins. So does the machine need an update?
>
> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> I think you hit a pypi flake.
>>
>> pip install futures>=2.2.0 works fine for me.
>>
>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato 
>> wrote:
>>
>>> Would be nice to fix this as it can slow down PRs. I am not sure if 
>>> this one is fixed on retry yet or not.
>>>
>>>
>>>
>>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>>> *
>>>
>>>
>>>
>>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>>> *
>>>
>>>
>>> *18:05:44* >* Task 
>>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
>>> python executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>>  Also creating executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* 
>>> Running virtualenv with interpreter /usr/bin/python2.7*18:05:44* 
>>> DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 
>>> 2020. Please upgrade your Python as Python 2.7 won't be maintained 
>>> after that date. A future version of pip will drop support for Python 
>>> 2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   
>>> Using cached 
>>> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Udi Meiri
That runs the risk of overloading our test machines when the build goes red.

On Fri, Apr 26, 2019 at 11:29 AM Alex Amato  wrote:

> It would be ideal to not need manual steps. If known flakey tests can be
> auto retried that would be a great improvement.
>
> On Fri, Apr 26, 2019 at 11:24 AM Valentyn Tymofieiev 
> wrote:
>
>> We could do something along the lines of retry with a back-off. Note that
>> Java tests also have this problem as we sometimes fail to fetch packages
>> from Maven Central.
>>
>> On Fri, Apr 26, 2019 at 11:19 AM Pablo Estrada 
>> wrote:
>>
>>> hm no, these are somewhat common. Yes, I think we could have retries to
>>> try to fix this sort of problem.
>>>
>>> Perhaps a mixture of reusing a virtualenv, and having retries when
>>> creating it?
>>>
>>> On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:
>>>
 Okay but this occurred on jenkins. So does the machine need an update?

 On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> I think you hit a pypi flake.
>
> pip install futures>=2.2.0 works fine for me.
>
> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:
>
>> Would be nice to fix this as it can slow down PRs. I am not sure if this 
>> one is fixed on retry yet or not.
>>
>>
>>
>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>> *
>>
>>
>>
>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>> *
>>
>>
>> *18:05:44* >* Task 
>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
>> python executable in 
>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>  Also creating executable in 
>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* 
>> Running virtualenv with interpreter /usr/bin/python2.7*18:05:44* 
>> DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 
>> 2020. Please upgrade your Python as Python 2.7 won't be maintained after 
>> that date. A future version of pip will drop support for Python 
>> 2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using 
>> cached 
>> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using 
>> cached 
>> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>> cached 
>> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>> cached 
>> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>  Collecting futures>=2.2.0; python_version < "3.2" (from 
>> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
>> version that satisfies the requirement futures>=2.2.0; python_version < 
>> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
>> none)*18:05:44* ERROR: No matching distribution found for 
>> futures>=2.2.0; python_version < "3.2" (from 
>> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* *18:05:46* >* Task 
>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv* FAILED*18:05:46*
>>

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Alex Amato
It would be ideal to not need manual steps. If known flakey tests can be
auto retried that would be a great improvement.

On Fri, Apr 26, 2019 at 11:24 AM Valentyn Tymofieiev 
wrote:

> We could do something along the lines of retry with a back-off. Note that
> Java tests also have this problem as we sometimes fail to fetch packages
> from Maven Central.
>
> On Fri, Apr 26, 2019 at 11:19 AM Pablo Estrada  wrote:
>
>> hm no, these are somewhat common. Yes, I think we could have retries to
>> try to fix this sort of problem.
>>
>> Perhaps a mixture of reusing a virtualenv, and having retries when
>> creating it?
>>
>> On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:
>>
>>> Okay but this occurred on jenkins. So does the machine need an update?
>>>
>>> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 I think you hit a pypi flake.

 pip install futures>=2.2.0 works fine for me.

 On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:

> Would be nice to fix this as it can slow down PRs. I am not sure if this 
> one is fixed on retry yet or not.
>
>
>
> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
> *
>
>
>
> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
> *
>
>
> *18:05:44* >* Task 
> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
> python executable in 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>  Also creating executable in 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* 
> Running virtualenv with interpreter /usr/bin/python2.7*18:05:44* 
> DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 
> 2020. Please upgrade your Python as Python 2.7 won't be maintained after 
> that date. A future version of pip will drop support for Python 
> 2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
> cached 
> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
> cached 
> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>  Collecting futures>=2.2.0; python_version < "3.2" (from 
> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
> version that satisfies the requirement futures>=2.2.0; python_version < 
> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
> none)*18:05:44* ERROR: No matching distribution found for futures>=2.2.0; 
> python_version < "3.2" (from 
> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* *18:05:46* >* Task 
> :beam-sdks-python-test-suites-dataflow:setupVirtualenv* FAILED*18:05:46*
>
>  
> 
>
>
>
>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Valentyn Tymofieiev
We could do something along the lines of retry with a back-off. Note that
Java tests also have this problem as we sometimes fail to fetch packages
from Maven Central.

On Fri, Apr 26, 2019 at 11:19 AM Pablo Estrada  wrote:

> hm no, these are somewhat common. Yes, I think we could have retries to
> try to fix this sort of problem.
>
> Perhaps a mixture of reusing a virtualenv, and having retries when
> creating it?
>
> On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:
>
>> Okay but this occurred on jenkins. So does the machine need an update?
>>
>> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> I think you hit a pypi flake.
>>>
>>> pip install futures>=2.2.0 works fine for me.
>>>
>>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:
>>>
 Would be nice to fix this as it can slow down PRs. I am not sure if this 
 one is fixed on retry yet or not.



 *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
 *



 *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
 *


 *18:05:44* >* Task 
 :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
 python executable in 
 /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
  Also creating executable in 
 /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* 
 Running virtualenv with interpreter /usr/bin/python2.7*18:05:44* 
 DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 
 2020. Please upgrade your Python as Python 2.7 won't be maintained after 
 that date. A future version of pip will drop support for Python 
 2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
 https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
 cached 
 https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
 cached 
 https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
  Collecting futures>=2.2.0; python_version < "3.2" (from 
 grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
 version that satisfies the requirement futures>=2.2.0; python_version < 
 "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
 none)*18:05:44* ERROR: No matching distribution found for futures>=2.2.0; 
 python_version < "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* 
 *18:05:46* >* Task :beam-sdks-python-test-suites-dataflow:setupVirtualenv* 
 FAILED*18:05:46*

  
 






Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Valentyn Tymofieiev
No, just re-run the tests. For the record, there are PyPi error-rates in
https://status.python.org/#system-metrics.

On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:

> Okay but this occurred on jenkins. So does the machine need an update?
>
> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev 
> wrote:
>
>> I think you hit a pypi flake.
>>
>> pip install futures>=2.2.0 works fine for me.
>>
>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:
>>
>>> Would be nice to fix this as it can slow down PRs. I am not sure if this 
>>> one is fixed on retry yet or not.
>>>
>>>
>>>
>>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>>> *
>>>
>>>
>>>
>>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>>> *
>>>
>>>
>>> *18:05:44* >* Task 
>>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
>>> python executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>>  Also creating executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* Running 
>>> virtualenv with interpreter /usr/bin/python2.7*18:05:44* DEPRECATION: 
>>> Python 2.7 will reach the end of its life on January 1st, 2020. Please 
>>> upgrade your Python as Python 2.7 won't be maintained after that date. A 
>>> future version of pip will drop support for Python 2.7.*18:05:44* 
>>> Collecting tox==3.0.0*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting futures>=2.2.0; python_version < "3.2" (from 
>>> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
>>> version that satisfies the requirement futures>=2.2.0; python_version < 
>>> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
>>> none)*18:05:44* ERROR: No matching distribution found for futures>=2.2.0; 
>>> python_version < "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* 
>>> *18:05:46* >* Task :beam-sdks-python-test-suites-dataflow:setupVirtualenv* 
>>> FAILED*18:05:46*
>>>
>>>  
>>> 
>>>
>>>
>>>
>>>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Pablo Estrada
hm no, these are somewhat common. Yes, I think we could have retries to try
to fix this sort of problem.

Perhaps a mixture of reusing a virtualenv, and having retries when creating
it?

On Fri, Apr 26, 2019 at 11:15 AM Alex Amato  wrote:

> Okay but this occurred on jenkins. So does the machine need an update?
>
> On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev 
> wrote:
>
>> I think you hit a pypi flake.
>>
>> pip install futures>=2.2.0 works fine for me.
>>
>> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:
>>
>>> Would be nice to fix this as it can slow down PRs. I am not sure if this 
>>> one is fixed on retry yet or not.
>>>
>>>
>>>
>>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>>> *
>>>
>>>
>>>
>>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>>> *
>>>
>>>
>>> *18:05:44* >* Task 
>>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New 
>>> python executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>>  Also creating executable in 
>>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* Running 
>>> virtualenv with interpreter /usr/bin/python2.7*18:05:44* DEPRECATION: 
>>> Python 2.7 will reach the end of its life on January 1st, 2020. Please 
>>> upgrade your Python as Python 2.7 won't be maintained after that date. A 
>>> future version of pip will drop support for Python 2.7.*18:05:44* 
>>> Collecting tox==3.0.0*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
>>> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>>>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>>> cached 
>>> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>>  Collecting futures>=2.2.0; python_version < "3.2" (from 
>>> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
>>> version that satisfies the requirement futures>=2.2.0; python_version < 
>>> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
>>> none)*18:05:44* ERROR: No matching distribution found for futures>=2.2.0; 
>>> python_version < "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* 
>>> *18:05:46* >* Task :beam-sdks-python-test-suites-dataflow:setupVirtualenv* 
>>> FAILED*18:05:46*
>>>
>>>  
>>> 
>>>
>>>
>>>
>>>


Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-04-26 Thread Ahmet Altay
I agree with both keeping 2.7.x going until a new LTS is declared and
declaring LTS spost-release after some use. 2.12 might actually be a good
candidate, with multiple RCs/validations it presumably is well tested. We
can consider that after it gets some real world use.

On Fri, Apr 26, 2019 at 6:29 AM Robert Bradshaw  wrote:

> IIRC, there was some talk on making 2.12 the next LTS, but the
> consensus is to decide on a LTS after having had some experience with
> it, not at or before the release itself.
>
>
> On Fri, Apr 26, 2019 at 3:04 PM Alexey Romanenko
>  wrote:
> >
> > Thanks for working on this, Kenn.
> >
> > Perhaps, I missed this but has it been already discussed/decided what
> will be the next LTS release?
> >
> > On 26 Apr 2019, at 08:02, Kenneth Knowles  wrote:
> >
> > Since it is all trivially reversible if there is some other feeling
> about this thread, I have gone ahead and started the work:
> >
> >  - I made release-2.7.1 branch point to the same commit as release-2.7.0
> so there is something to target PRs
> >  - I have opened the first PR, cherry-picking the set_version script and
> using it to set the version on the branch to 2.7.1:
> https://github.com/apache/beam/pull/8407 (found bug in the new script
> right away :-)
> >
> > Here is the release with list of issues:
> https://issues.apache.org/jira/projects/BEAM/versions/12344458. So anyone
> can grab a ticket and volunteer to open a backport PR to the release-2.7.1
> branch.
> >
> > I don't have a strong opinion about how long we should support the 2.7.x
> line. I am curious about different perspectives on user / vendor needs. I
> have two very basic thoughts: (1) we surely need to keep it going until
> some time after we have another LTS designated, to make sure there is a
> clear path for anyone only using LTS releases and (2) if we decide to end
> support of 2.7.x but then someone volunteers to backport and release, of
> course I would not expect anyone to block them, so it has no maximum
> lifetime, but we just need consensus on a minimum. And of course that
> consensus cannot force anyone to do the work, but is just a resolution of
> the community.
> >
> > Kenn
> >
> > On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré 
> wrote:
> >>
> >> +1 it sounds good to me.
> >>
> >> Thanks !
> >>
> >> Regards
> >> JB
> >>
> >> On 26/04/2019 02:42, Kenneth Knowles wrote:
> >> > Hi all,
> >> >
> >> > Since the release of 2.7.0 we have identified some serious bugs:
> >> >
> >> >  - There are 8 (non-dupe) issues* tagged with Fix Version 2.7.1
> >> >  - 2 are rated "Blocker" (aka P0) but I think the others may be
> underrated
> >> >  - If you know of a critical bug that is not on that list, please file
> >> > an LTS backport ticket for it
> >> >
> >> > If a user is on an old version and wants to move to the LTS, there are
> >> > some real blockers. I propose that we perform a 2.7.1 release
> starting now.
> >> >
> >> > I volunteer to manage the release. What do you think?
> >> >
> >> > Kenn
> >> >
> >> > *Some are "resolved" but this is not accurate as the LTS 2.7.1 branch
> is
> >> > not created yet. I suggest filing a ticket to track just the LTS
> >> > backport when you hit a bug that merits it.
> >> >
> >
> >
>


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Alex Amato
Okay but this occurred on jenkins. So does the machine need an update?

On Fri, Apr 26, 2019 at 10:43 AM Valentyn Tymofieiev 
wrote:

> I think you hit a pypi flake.
>
> pip install futures>=2.2.0 works fine for me.
>
> On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:
>
>> Would be nice to fix this as it can slow down PRs. I am not sure if this one 
>> is fixed on retry yet or not.
>>
>>
>>
>> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
>> *
>>
>>
>>
>> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
>> *
>>
>>
>> *18:05:44* >* Task 
>> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New python 
>> executable in 
>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>>  Also creating executable in 
>> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* Running 
>> virtualenv with interpreter /usr/bin/python2.7*18:05:44* DEPRECATION: Python 
>> 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your 
>> Python as Python 2.7 won't be maintained after that date. A future version 
>> of pip will drop support for Python 2.7.*18:05:44* Collecting 
>> tox==3.0.0*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
>> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>> cached 
>> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
>> cached 
>> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>>  Collecting futures>=2.2.0; python_version < "3.2" (from 
>> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
>> version that satisfies the requirement futures>=2.2.0; python_version < 
>> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: 
>> none)*18:05:44* ERROR: No matching distribution found for futures>=2.2.0; 
>> python_version < "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* 
>> *18:05:46* >* Task :beam-sdks-python-test-suites-dataflow:setupVirtualenv* 
>> FAILED*18:05:46*
>>
>>  
>> 
>>
>>
>>
>>


:beam-sdks-java-io-hadoop-input-format:test is extremely flaky

2019-04-26 Thread Reuven Lax
I find I usually have to rerun Presubmit multiple times to get a green run,
and this test is one of the biggest culprits (though it's not the only
culprit). Does anyone know what's going on with it?

Reuven


Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Valentyn Tymofieiev
I think you hit a pypi flake.

pip install futures>=2.2.0 works fine for me.

On Fri, Apr 26, 2019 at 9:41 AM Alex Amato  wrote:

> Would be nice to fix this as it can slow down PRs. I am not sure if this one 
> is fixed on retry yet or not.
>
>
>
> *https://issues.apache.org/jira/browse/BEAM-7164?filter=-2 
> *
>
>
>
> *https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
> *
>
>
> *18:05:44* >* Task 
> :beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New python 
> executable in 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
>  Also creating executable in 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
>  Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44* Running 
> virtualenv with interpreter /usr/bin/python2.7*18:05:44* DEPRECATION: Python 
> 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your 
> Python as Python 2.7 won't be maintained after that date. A future version of 
> pip will drop support for Python 2.7.*18:05:44* Collecting 
> tox==3.0.0*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
>  Collecting grpcio-tools==1.3.5*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>  Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
>  Collecting six (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
>  Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
>  Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
>  Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using cached 
> https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>  Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*   Using 
> cached 
> https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
>  Collecting futures>=2.2.0; python_version < "3.2" (from 
> grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find a 
> version that satisfies the requirement futures>=2.2.0; python_version < "3.2" 
> (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from versions: none)*18:05:44* 
> ERROR: No matching distribution found for futures>=2.2.0; python_version < 
> "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* *18:05:46* >* Task 
> :beam-sdks-python-test-suites-dataflow:setupVirtualenv* FAILED*18:05:46*
>
>  
>
>
>
>


Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-26 Thread Lukasz Cwik
I am in the camp where we should only support merging state (either
naturally via things like bags or via combiners). I believe that having the
wrapper that Brian suggests is useful for users. As for the @OnMerge
method, I believe combiners should have the ability to look at the window
information and we should treat @OnMerge as syntactic sugar over a combiner
if the combiner API is too cumbersome.

I believe using combiners can also extend to side inputs and help us deal
with singleton and map like side inputs when multiple firings occur. I also
like treating everything like a combiner because it will give us a lot
reuse of combiner implementations across all the places they could be used
and will be especially useful when we start exposing APIs related to
retractions on combiners.

On Fri, Apr 26, 2019 at 9:43 AM Brian Hulette  wrote:

> Yeah the danger with out of order processing concerns me more than the
> merging as well. As a new Beam user, I immediately gravitated towards
> ValueState since it was easy to think about and I just assumed there wasn't
> anything to be concerned about. So it was shocking to learn that there is
> this dangerous edge-case.
>
> What if ValueState were just implemented as a wrapper of CombiningState
> with a LatestCombineFn and documented as such (and perhaps we encourage
> users to consider using a CombiningState explicitly if at all possible)?
>
> Brian
>
>
>
> On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw 
> wrote:
>
>> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles  wrote:
>> >
>> > You could use a CombiningState with a CombineFn that returns the
>> minimum for this case.
>>
>> We've also wanted to be able to set data when setting a timer that
>> would be returned when the timer fires. (It's in the FnAPI, but not
>> the SDKs yet.)
>>
>> The metadata is an interesting usecase, do you have some more specific
>> examples? Might boil down to not having a rich enough (single) state
>> type.
>>
>> > But I've come to feel there is a mismatch. On the one hand,
>> ParDo() is a way to drop to a lower level and write logic
>> that does not fit a more general computational pattern, really taking fine
>> control. On the other hand, automatically merging state via CombiningState
>> or BagState is more of a no-knobs higher level of programming. To me there
>> seems to be a bit of a philosophical conflict.
>> >
>> > These days, I feel like an @OnMerge method would be more natural. If
>> you are using state and timers, you probably often want more direct control
>> over how state from windows gets merged. An of course we don't even have a
>> design for timers - you would need some kind of timestamp CombineFn but I
>> think setting/unsetting timers manually makes more sense. Especially
>> considering the trickiness around merging windows in the absence of
>> retractions, you really need this callback, so you can issue retractions
>> manually for any output your stateful DoFn emitted in windows that no
>> longer exist.
>>
>> I agree we'll probably need an @OnMerge. On the other hand, I like
>> being able to have good defaults. The high/low level thing is a
>> continuum (the indexing example falling towards the high end).
>>
>> Actually, the merging questions bother me less than how easy it is to
>> accidentally clobber previous values. It looks so easy (like the
>> easiest state to use) but is actually the most dangerous. If one wants
>> this behavior, I would rather an explicit AnyCombineFn or
>> LatestCombineFn which makes you think about the semantics.
>>
>> - Robert
>>
>> > On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni  wrote:
>> >>
>> >> +1 on the metadata use case.
>> >>
>> >> For performance reasons the Timer API does not support a read()
>> operation, which for the  vast majority of use cases is not a required
>> feature. In the small set of use cases where it is needed, for example when
>> you need to set a Timer in EventTime based on the smallest timestamp seen
>> in the elements within a DoFn, we can make use of a ValueState object to
>> keep track of the value.
>> >>
>> >> On Fri, 26 Apr 2019 at 00:38, Reuven Lax  wrote:
>> >>>
>> >>> I see examples of people using ValueState that I think are not
>> captured CombiningState. For example, one common one is users who set a
>> timer and then record the timestamp of that timer in a ValueState. In
>> general when you store state that is metadata about other state you store,
>> then ValueState will usually make more sense than CombiningState.
>> >>>
>> >>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette 
>> wrote:
>> 
>>  Currently the Python SDK does not make ValueState available to
>> users. My initial inclination was to go ahead and implement it there to be
>> consistent with Java, but Robert brings up a great point here that
>> ValueState has an inherent race condition for out of order data, and a lot
>> of it's use cases can actually be implemented with a CombiningState instead.
>> 
>>  It seems to me that at the 

Sharing plan to support complex equi-join condition in BeamSQL

2019-04-26 Thread Rui Wang
Hi Community,

TL;DR:

BeamSQL only supports equi-join, and its join condition can only be forms
of `col_a = col_b` or `col_a = col_b AND ...`. I come up with a doc[1] to
describe how to support complex equi-join condition in BeamSQL, along with
JIRAs on each sub task.


-Rui

[1]
https://docs.google.com/document/d/1vDiE4HR5ZdbZypIf1vzyFy9yKmAMWBu6BzBKfP7JhPc/edit?usp=sharing


Re: Hello from Hannah Jiang

2019-04-26 Thread Suneel Marthi
Welcome!!

On Fri, Apr 26, 2019 at 12:58 PM Lukasz Cwik  wrote:

> Welcome Hannah.
>
> On Fri, Apr 26, 2019 at 5:00 AM Ismaël Mejía  wrote:
>
>> Welcome to Beam Hannah !
>>
>> On Fri, Apr 26, 2019 at 1:44 PM Maximilian Michels 
>> wrote:
>> >
>> > Awesome. Welcome Hannah!
>> >
>> > Cheers,
>> > Max
>> >
>> > On 26.04.19 05:08, Yifan Zou wrote:
>> > > Welcome!
>> > >
>> > > On Thu, Apr 25, 2019 at 7:34 PM Connell O'Callaghan <
>> conne...@google.com
>> > > > wrote:
>> > >
>> > > Welcome Hannah!!!
>> > >
>> > > On Thu, Apr 25, 2019, 5:42 PM Reza Rokni > > > > wrote:
>> > >
>> > > Welcome!
>> > >
>> > > On Fri, 26 Apr 2019 at 04:36, Hannah Jiang
>> > > mailto:hannahji...@google.com>>
>> wrote:
>> > >
>> > > Thanks Cyrus!
>> > >
>> > > On Thu, Apr 25, 2019 at 1:34 PM Cyrus Maden
>> > > mailto:cma...@google.com>> wrote:
>> > >
>> > > Welcome!!
>> > >
>> > > On Thu, Apr 25, 2019 at 4:30 PM Hannah Jiang
>> > > > hannahji...@google.com>>
>> > > wrote:
>> > >
>> > > Thank you Robin!
>> > >
>> > > On Thu, Apr 25, 2019 at 1:27 PM Robin Qiu
>> > > mailto:robi...@google.com>>
>> wrote:
>> > >
>> > > Welcome Hannah!
>> > >
>> > > On Thu, Apr 25, 2019 at 1:26 PM Hannah Jiang
>> > > > > > > wrote:
>> > >
>> > > Thanks Kenneth!
>> > >
>> > > On Thu, Apr 25, 2019 at 1:24 PM Kenneth
>> > > Knowles > > > > wrote:
>> > >
>> > > Welcome!
>> > >
>> > > On Thu, Apr 25, 2019 at 12:38 PM
>> > > Matthias Baetens
>> > > > > > >
>> wrote:
>> > >
>> > > Welcome to the community!
>> > >
>> > > On Thu, Apr 25, 2019, 18:55
>> Griselda
>> > > Cuevas > > > > wrote:
>> > >
>> > > Welcome Hannah! - Very excited
>> > > to see you in the Beam
>> community :)
>> > >
>> > > On Tue, 23 Apr 2019 at 12:59,
>> > > Hannah Jiang
>> > > > > > > hannahji...@google.com>>
>> > > wrote:
>> > >
>> > > Hi everyone
>> > >
>> > > I joined Google recently
>> and
>> > > would work on Python
>> > > portability part. I am
>> happy
>> > > to be part of the
>> community.
>> > > Looking forward to working
>> > > with all of you together.
>> > >
>> > > I have a minor request,
>> can
>> > > admin please give me
>> access
>> > > to JIRA?
>> > >
>> > > Thanks,
>> > > Hannah
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > > This email may be confidential and privileged. If you received
>> > > this communication by mistake, please don't forward it to
>> anyone
>> > > else, please erase all copies and attachments, and please let
>> me
>> > > know that it has gone to the wrong person.
>> > >
>> > > The above terms reflect a potential business arrangement, are
>> > > provided solely as a basis for further discussion, and are not
>> > > intended to be and do not constitute a legally binding
>> > > obligation. No legally binding obligations will be created,
>> > > implied, or inferred until an agreement in final form is
>> > > executed in writing by all parties involved.
>> > >
>>
>


Re: Hello from Hannah Jiang

2019-04-26 Thread Lukasz Cwik
Welcome Hannah.

On Fri, Apr 26, 2019 at 5:00 AM Ismaël Mejía  wrote:

> Welcome to Beam Hannah !
>
> On Fri, Apr 26, 2019 at 1:44 PM Maximilian Michels  wrote:
> >
> > Awesome. Welcome Hannah!
> >
> > Cheers,
> > Max
> >
> > On 26.04.19 05:08, Yifan Zou wrote:
> > > Welcome!
> > >
> > > On Thu, Apr 25, 2019 at 7:34 PM Connell O'Callaghan <
> conne...@google.com
> > > > wrote:
> > >
> > > Welcome Hannah!!!
> > >
> > > On Thu, Apr 25, 2019, 5:42 PM Reza Rokni  > > > wrote:
> > >
> > > Welcome!
> > >
> > > On Fri, 26 Apr 2019 at 04:36, Hannah Jiang
> > > mailto:hannahji...@google.com>>
> wrote:
> > >
> > > Thanks Cyrus!
> > >
> > > On Thu, Apr 25, 2019 at 1:34 PM Cyrus Maden
> > > mailto:cma...@google.com>> wrote:
> > >
> > > Welcome!!
> > >
> > > On Thu, Apr 25, 2019 at 4:30 PM Hannah Jiang
> > > mailto:hannahji...@google.com
> >>
> > > wrote:
> > >
> > > Thank you Robin!
> > >
> > > On Thu, Apr 25, 2019 at 1:27 PM Robin Qiu
> > > mailto:robi...@google.com>>
> wrote:
> > >
> > > Welcome Hannah!
> > >
> > > On Thu, Apr 25, 2019 at 1:26 PM Hannah Jiang
> > >  > > > wrote:
> > >
> > > Thanks Kenneth!
> > >
> > > On Thu, Apr 25, 2019 at 1:24 PM Kenneth
> > > Knowles  > > > wrote:
> > >
> > > Welcome!
> > >
> > > On Thu, Apr 25, 2019 at 12:38 PM
> > > Matthias Baetens
> > >  > > >
> wrote:
> > >
> > > Welcome to the community!
> > >
> > > On Thu, Apr 25, 2019, 18:55
> Griselda
> > > Cuevas  > > > wrote:
> > >
> > > Welcome Hannah! - Very excited
> > > to see you in the Beam
> community :)
> > >
> > > On Tue, 23 Apr 2019 at 12:59,
> > > Hannah Jiang
> > >  > >  >>
> > > wrote:
> > >
> > > Hi everyone
> > >
> > > I joined Google recently
> and
> > > would work on Python
> > > portability part. I am
> happy
> > > to be part of the
> community.
> > > Looking forward to working
> > > with all of you together.
> > >
> > > I have a minor request, can
> > > admin please give me access
> > > to JIRA?
> > >
> > > Thanks,
> > > Hannah
> > >
> > >
> > >
> > >
> > > --
> > >
> > > This email may be confidential and privileged. If you received
> > > this communication by mistake, please don't forward it to
> anyone
> > > else, please erase all copies and attachments, and please let
> me
> > > know that it has gone to the wrong person.
> > >
> > > The above terms reflect a potential business arrangement, are
> > > provided solely as a basis for further discussion, and are not
> > > intended to be and do not constitute a legally binding
> > > obligation. No legally binding obligations will be created,
> > > implied, or inferred until an agreement in final form is
> > > executed in writing by all parties involved.
> > >
>


Re: Removing Java Reference Runner code

2019-04-26 Thread Anton Kedin
If there is no plans to invest in ULR then it makes sense to remove it.

Going forward, however, I think we should try to document the higher level
approach we're taking with runners (and portability) now that we have
something working and can reflect on it. For example, couple of things that
are not 100% clear to me:
 - if the focus is on python runner for portability efforts, how does java
SDK (and other languages) tie into this? E.g. how do we run, test, measure,
and develop things (pipelines, aspects of the SDK, runner);
 - what's our approach to developing new features, should we make sure
python runner supports them as early as possible (e.g. schemas and SQL)?
 - java DirectRunner is still there:
- it is still the primary tool for java SDK development purposes, and
as Kenn mentioned in the linked threads it adds value by making sure users
don't rely on implementation details of specific runners. Do we have a
similar story for portable scenarios?
- I assume that extra validations in the DirectRunner have impact on
performance in various ways (potentially non-deterministic). While this
doesn't matter in some cases, it might do in others. Having a local runner
that is (better) optimized for execution would probably make more sense for
perf measurements, integration tests, and maybe even local production jobs.
Is this something potentially worth looking into?

Regards,
Anton


On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels  wrote:

> Thanks for following up with this. I have mixed feelings to see the
> portable Java DirectRunner go, but I'm in favor of this change because
> it removes a lot of code that we do not really make use of.
>
> -Max
>
> On 26.04.19 02:58, Kenneth Knowles wrote:
> > Thanks for providing all this background on the PR. It is very easy to
> > see where it came from. Definitely nice to have less code and fewer
> > things that can break. Perhaps lazy consensus is enough.
> >
> > Kenn
> >
> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira  > > wrote:
> >
> > Hey everyone,
> >
> > I made a preliminary PR for removing all the Java Reference Runner
> > code (PR-8380 ) since I
> > wanted to see if it could be done easily. It seems to be working
> > fine, so I wanted to open up this discussion to make sure people are
> > still in agreement on getting rid of this code and that people don't
> > have any concerns.
> >
> > For those who need additional context about this, this previous
> > thread
> > <
> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
> >
> > is where we discussed deprecating the Java Reference Runner (in some
> > places it's called the ULR or Universal Local Runner, but it's the
> > same thing). Then there's this thread
> > <
> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
> >
> > where we discussed removing the code from the repo since it's been
> > deprecated.
> >
> > If no one has any objections to trying to remove the code I'll have
> > someone review the PR I wrote and start a vote to have it merged.
> >
> > Thanks,
> > Daniel Oliveira
> >
>


Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-26 Thread Brian Hulette
Yeah the danger with out of order processing concerns me more than the
merging as well. As a new Beam user, I immediately gravitated towards
ValueState since it was easy to think about and I just assumed there wasn't
anything to be concerned about. So it was shocking to learn that there is
this dangerous edge-case.

What if ValueState were just implemented as a wrapper of CombiningState
with a LatestCombineFn and documented as such (and perhaps we encourage
users to consider using a CombiningState explicitly if at all possible)?

Brian



On Fri, Apr 26, 2019 at 2:29 AM Robert Bradshaw  wrote:

> On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles  wrote:
> >
> > You could use a CombiningState with a CombineFn that returns the minimum
> for this case.
>
> We've also wanted to be able to set data when setting a timer that
> would be returned when the timer fires. (It's in the FnAPI, but not
> the SDKs yet.)
>
> The metadata is an interesting usecase, do you have some more specific
> examples? Might boil down to not having a rich enough (single) state
> type.
>
> > But I've come to feel there is a mismatch. On the one hand,
> ParDo() is a way to drop to a lower level and write logic
> that does not fit a more general computational pattern, really taking fine
> control. On the other hand, automatically merging state via CombiningState
> or BagState is more of a no-knobs higher level of programming. To me there
> seems to be a bit of a philosophical conflict.
> >
> > These days, I feel like an @OnMerge method would be more natural. If you
> are using state and timers, you probably often want more direct control
> over how state from windows gets merged. An of course we don't even have a
> design for timers - you would need some kind of timestamp CombineFn but I
> think setting/unsetting timers manually makes more sense. Especially
> considering the trickiness around merging windows in the absence of
> retractions, you really need this callback, so you can issue retractions
> manually for any output your stateful DoFn emitted in windows that no
> longer exist.
>
> I agree we'll probably need an @OnMerge. On the other hand, I like
> being able to have good defaults. The high/low level thing is a
> continuum (the indexing example falling towards the high end).
>
> Actually, the merging questions bother me less than how easy it is to
> accidentally clobber previous values. It looks so easy (like the
> easiest state to use) but is actually the most dangerous. If one wants
> this behavior, I would rather an explicit AnyCombineFn or
> LatestCombineFn which makes you think about the semantics.
>
> - Robert
>
> > On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni  wrote:
> >>
> >> +1 on the metadata use case.
> >>
> >> For performance reasons the Timer API does not support a read()
> operation, which for the  vast majority of use cases is not a required
> feature. In the small set of use cases where it is needed, for example when
> you need to set a Timer in EventTime based on the smallest timestamp seen
> in the elements within a DoFn, we can make use of a ValueState object to
> keep track of the value.
> >>
> >> On Fri, 26 Apr 2019 at 00:38, Reuven Lax  wrote:
> >>>
> >>> I see examples of people using ValueState that I think are not
> captured CombiningState. For example, one common one is users who set a
> timer and then record the timestamp of that timer in a ValueState. In
> general when you store state that is metadata about other state you store,
> then ValueState will usually make more sense than CombiningState.
> >>>
> >>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette 
> wrote:
> 
>  Currently the Python SDK does not make ValueState available to users.
> My initial inclination was to go ahead and implement it there to be
> consistent with Java, but Robert brings up a great point here that
> ValueState has an inherent race condition for out of order data, and a lot
> of it's use cases can actually be implemented with a CombiningState instead.
> 
>  It seems to me that at the very least we should discourage the use of
> ValueState by noting the danger in the documentation and preferring
> CombiningState in examples, and perhaps we should go further and deprecate
> it in Java and not implement it in python. Either way I think we should be
> consistent between Java and Python.
> 
>  I'm curious what people think about this, are there use cases that we
> really need to keep ValueState around for?
> 
>  Brian
> 
>  -- Forwarded message -
>  From: Robert Bradshaw 
>  Date: Thu, Apr 25, 2019, 08:31
>  Subject: Re: [docs] Python State & Timers
>  To: dev 
> 
> 
> 
> 
>  On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels 
> wrote:
> >
> > Completely agree that CombiningState is nicer in this example. Users
> may
> > still want to use ValueState when there is nothing to combine.
> 
> 
>  I've always had trouble coming 

[BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-26 Thread Alex Amato
Would be nice to fix this as it can slow down PRs. I am not sure if
this one is fixed on retry yet or not.



*https://issues.apache.org/jira/browse/BEAM-7164?filter=-2
*



*https://builds.apache.org/job/beam_PreCommit_Python_Commit/6035/consoleFull
*


*18:05:44* >* Task
:beam-sdks-python-test-suites-dataflow:setupVirtualenv**18:05:44* New
python executable in
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python2.7*18:05:44*
Also creating executable in
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-410805238/bin/python*18:05:44*
Installing setuptools, pkg_resources, pip, wheel...done.*18:05:44*
Running virtualenv with interpreter /usr/bin/python2.7*18:05:44*
DEPRECATION: Python 2.7 will reach the end of its life on January 1st,
2020. Please upgrade your Python as Python 2.7 won't be maintained
after that date. A future version of pip will drop support for Python
2.7.*18:05:44* Collecting tox==3.0.0*18:05:44*   Using cached
https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl*18:05:44*
Collecting grpcio-tools==1.3.5*18:05:44*   Using cached
https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)*18:05:44*   Using
cached 
https://files.pythonhosted.org/packages/84/e8/4ddac125b5a0e84ea6ffc93cfccf1e7ee1924e88f53c64e98227f0af2a5f/pluggy-0.9.0-py2.py3-none-any.whl*18:05:44*
Collecting six (from tox==3.0.0)*18:05:44*   Using cached
https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl*18:05:44*
Collecting virtualenv>=1.11.2 (from tox==3.0.0)*18:05:44*   Using
cached 
https://files.pythonhosted.org/packages/4f/ba/6f9315180501d5ac3e707f19fcb1764c26cc6a9a31af05778f7c2383eadb/virtualenv-16.5.0-py2.py3-none-any.whl*18:05:44*
Collecting py>=1.4.17 (from tox==3.0.0)*18:05:44*   Using cached
https://files.pythonhosted.org/packages/76/bc/394ad449851729244a97857ee14d7cba61ddb268dce3db538ba2f2ba1f0f/py-1.8.0-py2.py3-none-any.whl*18:05:44*
Collecting grpcio>=1.3.5 (from grpcio-tools==1.3.5)*18:05:44*   Using
cached 
https://files.pythonhosted.org/packages/7c/59/4da8df60a74f4af73ede9d92a75ca85c94bc2a109d5f67061496e8d496b2/grpcio-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
Collecting protobuf>=3.2.0 (from grpcio-tools==1.3.5)*18:05:44*
Using cached 
https://files.pythonhosted.org/packages/ea/72/5eadea03b06ca1320be2433ef2236155da17806b700efc92677ee99ae119/protobuf-3.7.1-cp27-cp27mu-manylinux1_x86_64.whl*18:05:44*
Collecting futures>=2.2.0; python_version < "3.2" (from
grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:44*   ERROR: Could not find
a version that satisfies the requirement futures>=2.2.0;
python_version < "3.2" (from grpcio>=1.3.5->grpcio-tools==1.3.5) (from
versions: none)*18:05:44* ERROR: No matching distribution found for
futures>=2.2.0; python_version < "3.2" (from
grpcio>=1.3.5->grpcio-tools==1.3.5)*18:05:46* *18:05:46* >* Task
:beam-sdks-python-test-suites-dataflow:setupVirtualenv*
FAILED*18:05:46*

 


Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-04-26 Thread Robert Bradshaw
IIRC, there was some talk on making 2.12 the next LTS, but the
consensus is to decide on a LTS after having had some experience with
it, not at or before the release itself.


On Fri, Apr 26, 2019 at 3:04 PM Alexey Romanenko
 wrote:
>
> Thanks for working on this, Kenn.
>
> Perhaps, I missed this but has it been already discussed/decided what will be 
> the next LTS release?
>
> On 26 Apr 2019, at 08:02, Kenneth Knowles  wrote:
>
> Since it is all trivially reversible if there is some other feeling about 
> this thread, I have gone ahead and started the work:
>
>  - I made release-2.7.1 branch point to the same commit as release-2.7.0 so 
> there is something to target PRs
>  - I have opened the first PR, cherry-picking the set_version script and 
> using it to set the version on the branch to 2.7.1: 
> https://github.com/apache/beam/pull/8407 (found bug in the new script right 
> away :-)
>
> Here is the release with list of issues: 
> https://issues.apache.org/jira/projects/BEAM/versions/12344458. So anyone can 
> grab a ticket and volunteer to open a backport PR to the release-2.7.1 branch.
>
> I don't have a strong opinion about how long we should support the 2.7.x 
> line. I am curious about different perspectives on user / vendor needs. I 
> have two very basic thoughts: (1) we surely need to keep it going until some 
> time after we have another LTS designated, to make sure there is a clear path 
> for anyone only using LTS releases and (2) if we decide to end support of 
> 2.7.x but then someone volunteers to backport and release, of course I would 
> not expect anyone to block them, so it has no maximum lifetime, but we just 
> need consensus on a minimum. And of course that consensus cannot force anyone 
> to do the work, but is just a resolution of the community.
>
> Kenn
>
> On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré  
> wrote:
>>
>> +1 it sounds good to me.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On 26/04/2019 02:42, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Since the release of 2.7.0 we have identified some serious bugs:
>> >
>> >  - There are 8 (non-dupe) issues* tagged with Fix Version 2.7.1
>> >  - 2 are rated "Blocker" (aka P0) but I think the others may be underrated
>> >  - If you know of a critical bug that is not on that list, please file
>> > an LTS backport ticket for it
>> >
>> > If a user is on an old version and wants to move to the LTS, there are
>> > some real blockers. I propose that we perform a 2.7.1 release starting now.
>> >
>> > I volunteer to manage the release. What do you think?
>> >
>> > Kenn
>> >
>> > *Some are "resolved" but this is not accurate as the LTS 2.7.1 branch is
>> > not created yet. I suggest filing a ticket to track just the LTS
>> > backport when you hit a bug that merits it.
>> >
>
>


Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-04-26 Thread Alexey Romanenko
Thanks for working on this, Kenn.

Perhaps, I missed this but has it been already discussed/decided what will be 
the next LTS release?

> On 26 Apr 2019, at 08:02, Kenneth Knowles  wrote:
> 
> Since it is all trivially reversible if there is some other feeling about 
> this thread, I have gone ahead and started the work:
> 
>  - I made release-2.7.1 branch point to the same commit as release-2.7.0 so 
> there is something to target PRs
>  - I have opened the first PR, cherry-picking the set_version script and 
> using it to set the version on the branch to 2.7.1: 
> https://github.com/apache/beam/pull/8407 
>  (found bug in the new script right 
> away :-)
> 
> Here is the release with list of issues: 
> https://issues.apache.org/jira/projects/BEAM/versions/12344458 
> . So anyone 
> can grab a ticket and volunteer to open a backport PR to the release-2.7.1 
> branch.
> 
> I don't have a strong opinion about how long we should support the 2.7.x 
> line. I am curious about different perspectives on user / vendor needs. I 
> have two very basic thoughts: (1) we surely need to keep it going until some 
> time after we have another LTS designated, to make sure there is a clear path 
> for anyone only using LTS releases and (2) if we decide to end support of 
> 2.7.x but then someone volunteers to backport and release, of course I would 
> not expect anyone to block them, so it has no maximum lifetime, but we just 
> need consensus on a minimum. And of course that consensus cannot force anyone 
> to do the work, but is just a resolution of the community.
> 
> Kenn
> 
> On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré  > wrote:
> +1 it sounds good to me.
> 
> Thanks !
> 
> Regards
> JB
> 
> On 26/04/2019 02:42, Kenneth Knowles wrote:
> > Hi all,
> > 
> > Since the release of 2.7.0 we have identified some serious bugs:
> > 
> >  - There are 8 (non-dupe) issues* tagged with Fix Version 2.7.1
> >  - 2 are rated "Blocker" (aka P0) but I think the others may be underrated
> >  - If you know of a critical bug that is not on that list, please file
> > an LTS backport ticket for it
> > 
> > If a user is on an old version and wants to move to the LTS, there are
> > some real blockers. I propose that we perform a 2.7.1 release starting now.
> > 
> > I volunteer to manage the release. What do you think?
> > 
> > Kenn
> > 
> > *Some are "resolved" but this is not accurate as the LTS 2.7.1 branch is
> > not created yet. I suggest filing a ticket to track just the LTS
> > backport when you hit a bug that merits it.
> > 



Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Ismaël Mejía
Ankur, do you have the equivalent link that I can use to subscribe to
that calendar via google calendars?
The link seems to work only to see the calendar in a webpage.

Thanks.

On Fri, Apr 26, 2019 at 1:42 PM Maximilian Michels  wrote:
>
> Hi Ankur,
>
> Sounds good. This will ensure that we stay on track regarding the
> release cycle.
>
> Thanks,
> Max
>
> On 26.04.19 02:59, Ankur Goenka wrote:
> > Correction, The planned cut date is May 8th.
> >
> > On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka  > > wrote:
> >
> > Hello Beam community!
> >
> > Beam 2.13 release branch cut date is April 8th according to the
> > release calendar [1]. I would like to volunteer myself to do this
> > release. I intend to cut the branch as planned on April 8th and
> > cherrypick fixes if needed.
> >
> > If you have releasing blocking issues for 2.13 please mark their
> > "Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in case
> > you would like to move any non-blocking issues to that version.
> >
> > Does this sound reasonable?
> >
> > Thanks,
> > Ankur
> >
> > [1]
> > 
> > https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
> >


Re: Hello from Hannah Jiang

2019-04-26 Thread Ismaël Mejía
Welcome to Beam Hannah !

On Fri, Apr 26, 2019 at 1:44 PM Maximilian Michels  wrote:
>
> Awesome. Welcome Hannah!
>
> Cheers,
> Max
>
> On 26.04.19 05:08, Yifan Zou wrote:
> > Welcome!
> >
> > On Thu, Apr 25, 2019 at 7:34 PM Connell O'Callaghan  > > wrote:
> >
> > Welcome Hannah!!!
> >
> > On Thu, Apr 25, 2019, 5:42 PM Reza Rokni  > > wrote:
> >
> > Welcome!
> >
> > On Fri, 26 Apr 2019 at 04:36, Hannah Jiang
> > mailto:hannahji...@google.com>> wrote:
> >
> > Thanks Cyrus!
> >
> > On Thu, Apr 25, 2019 at 1:34 PM Cyrus Maden
> > mailto:cma...@google.com>> wrote:
> >
> > Welcome!!
> >
> > On Thu, Apr 25, 2019 at 4:30 PM Hannah Jiang
> > mailto:hannahji...@google.com>>
> > wrote:
> >
> > Thank you Robin!
> >
> > On Thu, Apr 25, 2019 at 1:27 PM Robin Qiu
> > mailto:robi...@google.com>> wrote:
> >
> > Welcome Hannah!
> >
> > On Thu, Apr 25, 2019 at 1:26 PM Hannah Jiang
> >  > > wrote:
> >
> > Thanks Kenneth!
> >
> > On Thu, Apr 25, 2019 at 1:24 PM Kenneth
> > Knowles  > > wrote:
> >
> > Welcome!
> >
> > On Thu, Apr 25, 2019 at 12:38 PM
> > Matthias Baetens
> >  > > wrote:
> >
> > Welcome to the community!
> >
> > On Thu, Apr 25, 2019, 18:55 Griselda
> > Cuevas  > > wrote:
> >
> > Welcome Hannah! - Very excited
> > to see you in the Beam community :)
> >
> > On Tue, 23 Apr 2019 at 12:59,
> > Hannah Jiang
> >  > >
> > wrote:
> >
> > Hi everyone
> >
> > I joined Google recently and
> > would work on Python
> > portability part. I am happy
> > to be part of the community.
> > Looking forward to working
> > with all of you together.
> >
> > I have a minor request, can
> > admin please give me access
> > to JIRA?
> >
> > Thanks,
> > Hannah
> >
> >
> >
> >
> > --
> >
> > This email may be confidential and privileged. If you received
> > this communication by mistake, please don't forward it to anyone
> > else, please erase all copies and attachments, and please let me
> > know that it has gone to the wrong person.
> >
> > The above terms reflect a potential business arrangement, are
> > provided solely as a basis for further discussion, and are not
> > intended to be and do not constitute a legally binding
> > obligation. No legally binding obligations will be created,
> > implied, or inferred until an agreement in final form is
> > executed in writing by all parties involved.
> >


Re: Removing Java Reference Runner code

2019-04-26 Thread Maximilian Michels
Thanks for following up with this. I have mixed feelings to see the 
portable Java DirectRunner go, but I'm in favor of this change because 
it removes a lot of code that we do not really make use of.


-Max

On 26.04.19 02:58, Kenneth Knowles wrote:
Thanks for providing all this background on the PR. It is very easy to 
see where it came from. Definitely nice to have less code and fewer 
things that can break. Perhaps lazy consensus is enough.


Kenn

On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira > wrote:


Hey everyone,

I made a preliminary PR for removing all the Java Reference Runner
code (PR-8380 ) since I
wanted to see if it could be done easily. It seems to be working
fine, so I wanted to open up this discussion to make sure people are
still in agreement on getting rid of this code and that people don't
have any concerns.

For those who need additional context about this, this previous
thread


is where we discussed deprecating the Java Reference Runner (in some
places it's called the ULR or Universal Local Runner, but it's the
same thing). Then there's this thread


where we discussed removing the code from the repo since it's been
deprecated.

If no one has any objections to trying to remove the code I'll have
someone review the PR I wrote and start a vote to have it merged.

Thanks,
Daniel Oliveira



Re: Hello from Hannah Jiang

2019-04-26 Thread Maximilian Michels

Awesome. Welcome Hannah!

Cheers,
Max

On 26.04.19 05:08, Yifan Zou wrote:

Welcome!

On Thu, Apr 25, 2019 at 7:34 PM Connell O'Callaghan > wrote:


Welcome Hannah!!!

On Thu, Apr 25, 2019, 5:42 PM Reza Rokni mailto:r...@google.com>> wrote:

Welcome!

On Fri, 26 Apr 2019 at 04:36, Hannah Jiang
mailto:hannahji...@google.com>> wrote:

Thanks Cyrus!

On Thu, Apr 25, 2019 at 1:34 PM Cyrus Maden
mailto:cma...@google.com>> wrote:

Welcome!!

On Thu, Apr 25, 2019 at 4:30 PM Hannah Jiang
mailto:hannahji...@google.com>>
wrote:

Thank you Robin!

On Thu, Apr 25, 2019 at 1:27 PM Robin Qiu
mailto:robi...@google.com>> wrote:

Welcome Hannah!

On Thu, Apr 25, 2019 at 1:26 PM Hannah Jiang
mailto:hannahji...@google.com>> wrote:

Thanks Kenneth!

On Thu, Apr 25, 2019 at 1:24 PM Kenneth
Knowles mailto:k...@google.com>> wrote:

Welcome!

On Thu, Apr 25, 2019 at 12:38 PM
Matthias Baetens
mailto:baetensmatth...@gmail.com>> wrote:

Welcome to the community!

On Thu, Apr 25, 2019, 18:55 Griselda
Cuevas mailto:g...@google.com>> wrote:

Welcome Hannah! - Very excited
to see you in the Beam community :)

On Tue, 23 Apr 2019 at 12:59,
Hannah Jiang
mailto:hannahji...@google.com>>
wrote:

Hi everyone

I joined Google recently and
would work on Python
portability part. I am happy
to be part of the community.
Looking forward to working
with all of you together.

I have a minor request, can
admin please give me access
to JIRA?

Thanks,
Hannah




-- 


This email may be confidential and privileged. If you received
this communication by mistake, please don't forward it to anyone
else, please erase all copies and attachments, and please let me
know that it has gone to the wrong person.

The above terms reflect a potential business arrangement, are
provided solely as a basis for further discussion, and are not
intended to be and do not constitute a legally binding
obligation. No legally binding obligations will be created,
implied, or inferred until an agreement in final form is
executed in writing by all parties involved.



Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Maximilian Michels

Hi Ankur,

Sounds good. This will ensure that we stay on track regarding the 
release cycle.


Thanks,
Max

On 26.04.19 02:59, Ankur Goenka wrote:

Correction, The planned cut date is May 8th.

On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka > wrote:


Hello Beam community!

Beam 2.13 release branch cut date is April 8th according to the
release calendar [1]. I would like to volunteer myself to do this
release. I intend to cut the branch as planned on April 8th and
cherrypick fixes if needed.

If you have releasing blocking issues for 2.13 please mark their
"Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in case
you would like to move any non-blocking issues to that version.

Does this sound reasonable?

Thanks,
Ankur

[1]

https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles



Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-04-26 Thread Maximilian Michels
Sounds good Kenn. Thanks for starting the preparation for the LTS patch 
release.


I assembled a list of Flink issues a while ago which I will have to 
revisit because there have been more critical fixes since.


Thanks,
Max

On 26.04.19 08:02, Kenneth Knowles wrote:
Since it is all trivially reversible if there is some other feeling 
about this thread, I have gone ahead and started the work:


  - I made release-2.7.1 branch point to the same commit as 
release-2.7.0 so there is something to target PRs
  - I have opened the first PR, cherry-picking the set_version script 
and using it to set the version on the branch to 2.7.1: 
https://github.com/apache/beam/pull/8407 (found bug in the new script 
right away :-)


Here is the release with list of issues: 
https://issues.apache.org/jira/projects/BEAM/versions/12344458. So 
anyone can grab a ticket and volunteer to open a backport PR to the 
release-2.7.1 branch.


I don't have a strong opinion about how long we should support the 2.7.x 
line. I am curious about different perspectives on user / vendor needs. 
I have two very basic thoughts: (1) we surely need to keep it going 
until some time after we have another LTS designated, to make sure there 
is a clear path for anyone only using LTS releases and (2) if we decide 
to end support of 2.7.x but then someone volunteers to backport and 
release, of course I would not expect anyone to block them, so it has no 
maximum lifetime, but we just need consensus on a minimum. And of course 
that consensus cannot force anyone to do the work, but is just a 
resolution of the community.


Kenn

On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré > wrote:


+1 it sounds good to me.

Thanks !

Regards
JB

On 26/04/2019 02:42, Kenneth Knowles wrote:
 > Hi all,
 >
 > Since the release of 2.7.0 we have identified some serious bugs:
 >
 >  - There are 8 (non-dupe) issues* tagged with Fix Version 2.7.1
 >  - 2 are rated "Blocker" (aka P0) but I think the others may be
underrated
 >  - If you know of a critical bug that is not on that list, please
file
 > an LTS backport ticket for it
 >
 > If a user is on an old version and wants to move to the LTS,
there are
 > some real blockers. I propose that we perform a 2.7.1 release
starting now.
 >
 > I volunteer to manage the release. What do you think?
 >
 > Kenn
 >
 > *Some are "resolved" but this is not accurate as the LTS 2.7.1
branch is
 > not created yet. I suggest filing a ticket to track just the LTS
 > backport when you hit a bug that merits it.
 >



Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-26 Thread Robert Bradshaw
On Fri, Apr 26, 2019 at 6:40 AM Kenneth Knowles  wrote:
>
> You could use a CombiningState with a CombineFn that returns the minimum for 
> this case.

We've also wanted to be able to set data when setting a timer that
would be returned when the timer fires. (It's in the FnAPI, but not
the SDKs yet.)

The metadata is an interesting usecase, do you have some more specific
examples? Might boil down to not having a rich enough (single) state
type.

> But I've come to feel there is a mismatch. On the one hand, ParDo( DoFn>) is a way to drop to a lower level and write logic that does not fit a 
> more general computational pattern, really taking fine control. On the other 
> hand, automatically merging state via CombiningState or BagState is more of a 
> no-knobs higher level of programming. To me there seems to be a bit of a 
> philosophical conflict.
>
> These days, I feel like an @OnMerge method would be more natural. If you are 
> using state and timers, you probably often want more direct control over how 
> state from windows gets merged. An of course we don't even have a design for 
> timers - you would need some kind of timestamp CombineFn but I think 
> setting/unsetting timers manually makes more sense. Especially considering 
> the trickiness around merging windows in the absence of retractions, you 
> really need this callback, so you can issue retractions manually for any 
> output your stateful DoFn emitted in windows that no longer exist.

I agree we'll probably need an @OnMerge. On the other hand, I like
being able to have good defaults. The high/low level thing is a
continuum (the indexing example falling towards the high end).

Actually, the merging questions bother me less than how easy it is to
accidentally clobber previous values. It looks so easy (like the
easiest state to use) but is actually the most dangerous. If one wants
this behavior, I would rather an explicit AnyCombineFn or
LatestCombineFn which makes you think about the semantics.

- Robert

> On Thu, Apr 25, 2019 at 5:49 PM Reza Rokni  wrote:
>>
>> +1 on the metadata use case.
>>
>> For performance reasons the Timer API does not support a read() operation, 
>> which for the  vast majority of use cases is not a required feature. In the 
>> small set of use cases where it is needed, for example when you need to set 
>> a Timer in EventTime based on the smallest timestamp seen in the elements 
>> within a DoFn, we can make use of a ValueState object to keep track of the 
>> value.
>>
>> On Fri, 26 Apr 2019 at 00:38, Reuven Lax  wrote:
>>>
>>> I see examples of people using ValueState that I think are not captured 
>>> CombiningState. For example, one common one is users who set a timer and 
>>> then record the timestamp of that timer in a ValueState. In general when 
>>> you store state that is metadata about other state you store, then 
>>> ValueState will usually make more sense than CombiningState.
>>>
>>> On Thu, Apr 25, 2019 at 9:32 AM Brian Hulette  wrote:

 Currently the Python SDK does not make ValueState available to users. My 
 initial inclination was to go ahead and implement it there to be 
 consistent with Java, but Robert brings up a great point here that 
 ValueState has an inherent race condition for out of order data, and a lot 
 of it's use cases can actually be implemented with a CombiningState 
 instead.

 It seems to me that at the very least we should discourage the use of 
 ValueState by noting the danger in the documentation and preferring 
 CombiningState in examples, and perhaps we should go further and deprecate 
 it in Java and not implement it in python. Either way I think we should be 
 consistent between Java and Python.

 I'm curious what people think about this, are there use cases that we 
 really need to keep ValueState around for?

 Brian

 -- Forwarded message -
 From: Robert Bradshaw 
 Date: Thu, Apr 25, 2019, 08:31
 Subject: Re: [docs] Python State & Timers
 To: dev 




 On Thu, Apr 25, 2019, 5:26 PM Maximilian Michels  wrote:
>
> Completely agree that CombiningState is nicer in this example. Users may
> still want to use ValueState when there is nothing to combine.


 I've always had trouble coming up with any good examples of this.

> Also,
> users already know ValueState from the Java SDK.


 Maybe we should deprecate that :)


> On 25.04.19 17:12, Robert Bradshaw wrote:
> > On Thu, Apr 25, 2019 at 4:58 PM Maximilian Michels  
> > wrote:
> >>
> >> I forgot to give an example, just to clarify for others:
> >>
> >>> What was the specific example that was less natural?
> >>
> >> Basically every time we use ListState to express ValueState, e.g.
> >>
> >> next_index, = list(state.read()) or [0]
> >>
> >> Taken from:
> >> 

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-26 Thread Robert Bradshaw
Thanks for all the hard work!

https://dist.apache.org/repos/dist/dev/beam/2.12.0/ seems empty; were
the artifacts already moved?

On Fri, Apr 26, 2019 at 10:31 AM Etienne Chauchot  wrote:
>
> Hi,
> Thanks for all your work and patience Andrew !
>
> PS: as a side note, there were 5 binding votes (I voted +1)
>
> Etienne
>
> Le jeudi 25 avril 2019 à 11:16 -0700, Andrew Pilloud a écrit :
>
> I reran the Nexmark tests, each runner passed. I compared the numbers
>
> on the direct runner to the dashboard and they are where they should
>
> be.
>
>
> With that, I'm happy to announce that we have unanimously approved this 
> release.
>
>
> There are 8 approving votes, 4 of which are binding:
>
> * Jean-Baptiste Onofré
>
> * Lukasz Cwik
>
> * Maximilian Michels
>
> * Ahmet Altay
>
>
> There are no disapproving votes.
>
>
> Thanks everyone!
>
>


Re: CVE audit gradle plugin

2019-04-26 Thread Etienne Chauchot
Hi all,
Just to let you know, you can now check the vulnerabilities in libraries by 
running gradlew audit --info. It is a
separate task that is not in the dependencies of the build (the normal build 
will not fail if there are vulnerabilities
found).When you run it, It gives an output similar to mvn dependency:tree with 
red vulnerability arrow and the build
fails in case of vulnerabilities found. If there is none, the build succeeds.
For now, there is no more than that but it can be included into jenkins. As we 
did not agree on something, I did not do
the integration.
WDYT?
EtienneLe mercredi 24 avril 2019 à 15:56 +0200, Etienne Chauchot a écrit :
> Hi all,FYI I just submitted a PR (1) to add the CVE audit plugin to the build 
> as an optional task gradlew audit --
> info.
> [1] https://github.com/apache/beam/pull/8388
> Etienne
> Le mardi 23 avril 2019 à 17:25 +0200, Etienne Chauchot a écrit :
> > Hi,should I merge my branch 
> > https://github.com/echauchot/beam/tree/cve_audit_plugin to master to 
> > include this tool
> > to the build system then ?It will not fail the build but add an audit task 
> > to it.
> > EtienneLe vendredi 19 avril 2019 à 10:54 -0700, Lukasz Cwik a écrit :
> > >  Common Vulnerabilities and Exposures (CVE)
> > > 
> > > On Fri, Apr 19, 2019 at 10:33 AM Robert Burke  wrote:
> > > > Ah! What's CVE stand for then?
> > > > 
> > > > Re the PR: Sadly, it's more complicated than that, which I'll explain 
> > > > in the PR. Otherwise it would have been
> > > > done already. It's not too bad if the time is put in though.
> > > > On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:
> > > > > Robert, I believe what is being suggested is a tool that integrates 
> > > > > into CVE reports automatically and tells
> > > > > us if we have a dependency with a security issue (not just whether 
> > > > > there is a newer version). Also, there is a
> > > > > sweet draft PR to add Go modules[1].
> > > > > 1: https://github.com/apache/beam/pull/8354
> > > > > On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  
> > > > > wrote:
> > > > > > If we move to Go Modules, the go.mod file specifies direct 
> > > > > > dependencies and versions, and the go.sum file
> > > > > > includes checksums of the full transitive set of dependencies. 
> > > > > > There's likely going to be a tool for
> > > > > > detecting if an update is possible, if one doesn't exist in the go 
> > > > > > tooling already.
> > > > > > On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
> > > > > > > This seems worthwhile IMO.
> > > > > > > Ahmet, Pyup[1] is free for open source projects and has an API 
> > > > > > > that allows for dependency checking. They
> > > > > > > can scan Github repos automatically it seems but it may not be 
> > > > > > > compatible with how Apache permissions with
> > > > > > > Github work. I'm not sure if there is such a thing for Go.
> > > > > > > 
> > > > > > > 1: https://pyup.io/
> > > > > > > 
> > > > > > > On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  
> > > > > > > wrote:
> > > > > > > > I want to bring this subject back, any chance we can get this 
> > > > > > > > running
> > > > > > > > 
> > > > > > > > in or main repo maybe in a weekly basis like we do for the 
> > > > > > > > dependency
> > > > > > > > 
> > > > > > > > reports. It looks totallly worth.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > >
> > > > > > > > 
> > > > > > > > > Thank you, I agree this is very important. Does anyone know a 
> > > > > > > > > similar tool for python and go?
> > > > > > > > 
> > > > > > > > >
> > > > > > > > 
> > > > > > > > > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
> > > > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> Hi guys,
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> I came by this [1] gradle plugin that is a client to the 
> > > > > > > > >> Sonatype OSS Index CVE database.
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> I have set it up here in a branch [2], though the cache is 
> > > > > > > > >> not configured and the number of requests
> > > > > > > > is limited. It can be run with "gradle --info audit"
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> It could be nice to have something like this to track the 
> > > > > > > > >> CVEs in the libs we use. I know we have
> > > > > > > > been spammed by libs upgrade automatic requests in the past but 
> > > > > > > > CVE are more important IMHO.
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> This plugin is in BSD-3-Clause which is compatible with 
> > > > > > > > >> Apache V2 licence [3]
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> WDYT ?
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> > > > > > > > >> Etienne
> > > > > > > > 
> > > > > > > > >>
> > > > > > > > 
> 

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-26 Thread Etienne Chauchot
Hi,
Thanks for all your work and patience Andrew !

PS: as a side note, there were 5 binding votes (I voted +1)

Etienne

Le jeudi 25 avril 2019 à 11:16 -0700, Andrew Pilloud a écrit :
> I reran the Nexmark tests, each runner passed. I compared the numbers
> on the direct runner to the dashboard and they are where they should
> be.
> 
> With that, I'm happy to announce that we have unanimously approved this 
> release.
> 
> There are 8 approving votes, 4 of which are binding:
> * Jean-Baptiste Onofré
> * Lukasz Cwik
> * Maximilian Michels
> * Ahmet Altay
> 
> There are no disapproving votes.
> 
> Thanks everyone!
> 


Re: [PROPOSAL] Prepare for LTS bugfix release 2.7.1

2019-04-26 Thread Kenneth Knowles
Since it is all trivially reversible if there is some other feeling about
this thread, I have gone ahead and started the work:

 - I made release-2.7.1 branch point to the same commit as release-2.7.0 so
there is something to target PRs
 - I have opened the first PR, cherry-picking the set_version script and
using it to set the version on the branch to 2.7.1:
https://github.com/apache/beam/pull/8407 (found bug in the new script right
away :-)

Here is the release with list of issues:
https://issues.apache.org/jira/projects/BEAM/versions/12344458. So anyone
can grab a ticket and volunteer to open a backport PR to the release-2.7.1
branch.

I don't have a strong opinion about how long we should support the 2.7.x
line. I am curious about different perspectives on user / vendor needs. I
have two very basic thoughts: (1) we surely need to keep it going until
some time after we have another LTS designated, to make sure there is a
clear path for anyone only using LTS releases and (2) if we decide to end
support of 2.7.x but then someone volunteers to backport and release, of
course I would not expect anyone to block them, so it has no maximum
lifetime, but we just need consensus on a minimum. And of course that
consensus cannot force anyone to do the work, but is just a resolution of
the community.

Kenn

On Thu, Apr 25, 2019 at 10:29 PM Jean-Baptiste Onofré 
wrote:

> +1 it sounds good to me.
>
> Thanks !
>
> Regards
> JB
>
> On 26/04/2019 02:42, Kenneth Knowles wrote:
> > Hi all,
> >
> > Since the release of 2.7.0 we have identified some serious bugs:
> >
> >  - There are 8 (non-dupe) issues* tagged with Fix Version 2.7.1
> >  - 2 are rated "Blocker" (aka P0) but I think the others may be
> underrated
> >  - If you know of a critical bug that is not on that list, please file
> > an LTS backport ticket for it
> >
> > If a user is on an old version and wants to move to the LTS, there are
> > some real blockers. I propose that we perform a 2.7.1 release starting
> now.
> >
> > I volunteer to manage the release. What do you think?
> >
> > Kenn
> >
> > *Some are "resolved" but this is not accurate as the LTS 2.7.1 branch is
> > not created yet. I suggest filing a ticket to track just the LTS
> > backport when you hit a bug that merits it.
> >
>