Re: contributor permission for Beam Jira

2019-04-24 Thread Ahmet Altay
Done. Thank you for contributing.

On Wed, Apr 24, 2019 at 10:18 AM Yoshiki Obata 
wrote:

> Hello everyone
>
> This is Yoshiki Obata.
> I created ticket BEAM-7137 and plan to fix it.
> I'm glad someone would add me as contributer to Jira.
>
> my Jira username is yoshiki.obata
>
> Best regards,
> Yoshiki
>
>
> --
> Yoshiki Obata
> mail: yoshiki.ob...@gmail.com
> gh: https://github.com/lazylynx
>


Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Ahmet Altay
I do not know the answer.I believe this will be similar to sharing the RC
artifacts for validation purposes and would not be a formal release by
itself. But I am not an expert and I hope others will share their opinions.

I quickly searched pypi for apache projects and found at least airflow [1]
and libcloud [2] are publishing rc artifacts to pypi. We can reach out to
those communities and learn about their processes.

Ahmet

[1] https://pypi.org/project/apache-airflow/#history
[2] https://pypi.org/project/apache-libcloud/#history

On Wed, Apr 24, 2019 at 6:15 PM Michael Luckey  wrote:

> Hi,
>
> wouldn't that be in conflict with Apache release policy [1] ?
>
> [1] http://www.apache.org/legal/release-policy.html
>
> On Thu, Apr 25, 2019 at 1:35 AM Alan Myrvold  wrote:
>
>> Great idea. I like the RC candidates to follow as much as the release
>> artifact process as possible.
>>
>> On Wed, Apr 24, 2019 at 3:27 PM Ahmet Altay  wrote:
>>
>>> To clarify my proposal, I am proposing publishing to the production pypi
>>> repository with an rc tag in the version. And in turn allow users to depend
>>> on beam's rc version + all the other regular dependencies users would have
>>> directly from pypi.
>>>
>>> Publishing to test pypi repo would also be helpful if test pypi repo
>>> also mirrors other packages that exist in the production pypi repository.
>>>
>>> On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada 
>>> wrote:
>>>
 I think this is a great idea. A way of doing it for python would be by
 using the test repository for PyPi[1], and that way we would not have to do
 an official PyPi release, but still would be able to install it with pip
 (by passing an extra flag), and test.

 In fact, there are some Beam artifacts already in there[2]. At some
 point I looked into this, but couldn't figure out who has access/the
 password for it.

>>>
>>> I also don't know who owns beam package in test pypi repo. Does
>>> anybody know?
>>>
>>>

 In short: +1, and I would suggest using the test PyPi repo to avoid
 publishing to the main PyPi repo.
 Best
 -P.

 [1] https://test.pypi.org/
 [2] https://test.pypi.org/project/apache-beam/

 On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:

> Hi all,
>
> What do you think about the idea of publishing pre-release artifacts
> as part of the RC emails?
>
> For Python this would translate into publishing the same artifacts
> from RC email with a version like "2.X.0rcY" to pypi. I do not know, but I
> am guessing we can do a similar thing with Maven central for Java 
> artifacts
> as well.
>
> Advantages would be:
> - Allow end users to validate RCs for their own purposes using the
> same exact process they will normally use.
>  - Enable early-adaptors to start using RC releases early on in the
> release cycle if that is what they would like to do. This will in turn
> reduce time pressure on some releases. Especially for cases like someone
> needs a release to be finalized for an upcoming event.
>
> There will also be disadvantages, some I could think of:
> - Users could request support for RC artifacts. Hopefully in the form
> of feedback for us to improve the release. But it could also be in the 
> form
> of folks using RC artifacts for production for a long time.
> - It will add toil to the current release process, there will be one
> more step for each RC. I think for python this will be a small step but
> nevertheless it will be additional work.
>
> For an example of this, you can take a look at tensorflow releases.
> For 1.13 there were 3 pre-releases [1].
>
> Ahmet
>
> [1] https://pypi.org/project/tensorflow/#history
>



Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread jincheng sun
Hi Robert,

In addition to the questions described by Dian, I also want to know what
difficult problems Py4j's solution will encounter in add UDF support, which
you mentioned as follows:

Using something like Py4j is an easy way to get up an running, especially
> for a very faithful API, but the instant one wants to add UDFs one hits a
> cliff of sorts (which is surmountable, but likely a lot harder than having
> gone the above approach).


I appreciate if you can share more specific cases?

Thanks,
Jincheng

Dian Fu  于2019年4月25日周四 上午11:53写道:

> Thanks everyone for the discussion here.
>
> Regarding to the Java/Scala UDF and the built-in UDF to execute in the
> current Flink way (directly in JVM, not via RPC), I share the same thoughts
> with Max and Robert and I think it will not be a big problem. From the
> design doc, I guess the main reason to take the Py4J way instead of the DAG
> way at present is that DAG has some limitations in some scenarios such as
> interactive programing which may be a strong requirement for data scientist.
>
> > In addition (and I'll admit this is rather subjective) it seems to me
> one of the primary values of a table-like API in a given language (vs. just
> using (say) plain old SQL itself via a console) is the ability to embed it
> in a larger pipeline, or at least drop in operations that are not (as)
> naturally expressed in the "table way," including existing libraries. In
> other words, a full SDK. The Py4j wrapping doesn't extend itself to such
> integration nearly as easily.
>
>
> Hi Robert, regarding to "a larger pipeline", do you mean translating a
> table-like API jobs from/to another kind of API job or embedding third-part
> libraries into a table-like API jobs via UDF? Could you kindly explain why
> this would be a problem for Py4J and will not be a problem if expressing
> the job with DAG?
>
> Thanks,
> Dian
>
>
> > 在 2019年4月25日,上午12:16,Robert Bradshaw  写道:
> >
> > Thanks for the meeting summary, Stephan. Sound like you covered a lot of
> ground. Some more comments below, adding onto what Max has said.
> >
> > On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels  > wrote:
> > >
> > > Hi Stephan,
> > >
> > > This is excited! Thanks for sharing. The inter-process communication
> > > code looks like the most natural choice as a common ground. To go
> > > further, there are indeed some challenges to solve.
> >
> > It certainly does make sense to share this work, though it does to me
> seem like a rather low level to integrate at.
> >
> > > > => Biggest question is whether the language-independent DAG is
> expressive enough to capture all the expressions that we want to map
> directly to Table API expressions. Currently much is hidden in opaque UDFs.
> Kenn mentioned the structure should be flexible enough to capture more
> expressions transparently.
> > >
> > > Just to add some context how this could be done, there is the concept
> of
> > > a FunctionSpec which is part of a transform in the DAG. FunctionSpec
> > > contains a URN and with a payload. FunctionSpec can be either (1)
> > > translated by the Runner directly, e.g. map to table API concepts or
> (2)
> > > run a user-defined function with an Environment. It could be feasible
> > > for Flink to choose the direct path, whereas Beam Runners would
> leverage
> > > the more generic approach using UDFs. Granted, compatibility across
> > > Flink and Beam would only work if both of the translation paths yielded
> > > the same semantics.
> >
> > To elaborate a bit on this, Beam DAGs are built up by applying
> Transforms (basically operations) to PColections (the equivalent of
> dataset/datastream), but the key point here is that these transforms are
> often composite operations that expand out into smaller subtransforms. This
> expansion happens during pipeline construction, but with the recent work on
> cross language pipelines can happen out of process. This is one point of
> extendability. Secondly, and importantly, this composite structure is
> preserved in the DAG, and so a runner is free to ignore the provided
> expansion and supply its own (so long as semantically it produces exactly
> the same output). These composite operations can be identified by arbitrary
> URNs + payloads, and any runner that does not understand them simply uses
> the pre-provided expansion.
> >
> > The existing Flink runner operates on exactly this principle,
> translating URNs for the leaf operations (Map, Flatten, ...) as well as
> some composites it can do better (e.g. Reshard). It is intentionally easy
> to define and add new ones. This actually seems the easier approach (to me
> at least, but that's probably heavily influenced by what I'm familiar with
> vs. what I'm not).
> >
> > As for how well this maps onto the Flink Tables API, part of that
> depends on how much of the API is the operations themselves, and how much
> is concerning configuration/environment/etc. which is harder to talk about
> in an 

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread Dian Fu
Thanks everyone for the discussion here. 

Regarding to the Java/Scala UDF and the built-in UDF to execute in the current 
Flink way (directly in JVM, not via RPC), I share the same thoughts with Max 
and Robert and I think it will not be a big problem. From the design doc, I 
guess the main reason to take the Py4J way instead of the DAG way at present is 
that DAG has some limitations in some scenarios such as interactive programing 
which may be a strong requirement for data scientist.

> In addition (and I'll admit this is rather subjective) it seems to me one of 
> the primary values of a table-like API in a given language (vs. just using 
> (say) plain old SQL itself via a console) is the ability to embed it in a 
> larger pipeline, or at least drop in operations that are not (as) naturally 
> expressed in the "table way," including existing libraries. In other words, a 
> full SDK. The Py4j wrapping doesn't extend itself to such integration nearly 
> as easily. 


Hi Robert, regarding to "a larger pipeline", do you mean translating a 
table-like API jobs from/to another kind of API job or embedding third-part 
libraries into a table-like API jobs via UDF? Could you kindly explain why this 
would be a problem for Py4J and will not be a problem if expressing the job 
with DAG?

Thanks,
Dian


> 在 2019年4月25日,上午12:16,Robert Bradshaw  写道:
> 
> Thanks for the meeting summary, Stephan. Sound like you covered a lot of 
> ground. Some more comments below, adding onto what Max has said. 
> 
> On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels  > wrote:
> >
> > Hi Stephan,
> >
> > This is excited! Thanks for sharing. The inter-process communication
> > code looks like the most natural choice as a common ground. To go
> > further, there are indeed some challenges to solve.
> 
> It certainly does make sense to share this work, though it does to me seem 
> like a rather low level to integrate at. 
> 
> > > => Biggest question is whether the language-independent DAG is expressive 
> > > enough to capture all the expressions that we want to map directly to 
> > > Table API expressions. Currently much is hidden in opaque UDFs. Kenn 
> > > mentioned the structure should be flexible enough to capture more 
> > > expressions transparently.
> >
> > Just to add some context how this could be done, there is the concept of
> > a FunctionSpec which is part of a transform in the DAG. FunctionSpec
> > contains a URN and with a payload. FunctionSpec can be either (1)
> > translated by the Runner directly, e.g. map to table API concepts or (2)
> > run a user-defined function with an Environment. It could be feasible
> > for Flink to choose the direct path, whereas Beam Runners would leverage
> > the more generic approach using UDFs. Granted, compatibility across
> > Flink and Beam would only work if both of the translation paths yielded
> > the same semantics.
> 
> To elaborate a bit on this, Beam DAGs are built up by applying Transforms 
> (basically operations) to PColections (the equivalent of dataset/datastream), 
> but the key point here is that these transforms are often composite 
> operations that expand out into smaller subtransforms. This expansion happens 
> during pipeline construction, but with the recent work on cross language 
> pipelines can happen out of process. This is one point of extendability. 
> Secondly, and importantly, this composite structure is preserved in the DAG, 
> and so a runner is free to ignore the provided expansion and supply its own 
> (so long as semantically it produces exactly the same output). These 
> composite operations can be identified by arbitrary URNs + payloads, and any 
> runner that does not understand them simply uses the pre-provided expansion. 
> 
> The existing Flink runner operates on exactly this principle, translating 
> URNs for the leaf operations (Map, Flatten, ...) as well as some composites 
> it can do better (e.g. Reshard). It is intentionally easy to define and add 
> new ones. This actually seems the easier approach (to me at least, but that's 
> probably heavily influenced by what I'm familiar with vs. what I'm not). 
> 
> As for how well this maps onto the Flink Tables API, part of that depends on 
> how much of the API is the operations themselves, and how much is concerning 
> configuration/environment/etc. which is harder to talk about in an agnostic 
> way. 
> 
> Using something like Py4j is an easy way to get up an running, especially for 
> a very faithful API, but the instant one wants to add UDFs one hits a cliff 
> of sorts (which is surmountable, but likely a lot harder than having gone the 
> above approach). In addition (and I'll admit this is rather subjective) it 
> seems to me one of the primary values of a table-like API in a given language 
> (vs. just using (say) plain old SQL itself via a console) is the ability to 
> embed it in a larger pipeline, or at least drop in operations that are not 
> (as) naturally 

Re: [docs] Python State & Timers

2019-04-24 Thread Reza Ardeshir Rokni
Pablo, Kenneth and I have a new blog ready for publication which covers how
to create a "looping timer" it allows for default values to be created in a
window when no incoming elements exists. We just need to clear a few bits
before publication, but would be great to have that also include a python
example, I wrote it in java...

Cheers

Reza

On Thu, 25 Apr 2019 at 04:34, Reuven Lax  wrote:

> Well state is still not implemented for merging windows even for Java
> (though I believe the idea was to disallow ValueState there).
>
> On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw 
> wrote:
>
>> It was unclear what the semantics were for ValueState for merging
>> windows. (It's also a bit weird as it's inherently a race condition
>> wrt element ordering, unlike Bag and CombineState, though you can
>> always implement it as a CombineState that always returns the latest
>> value which is a bit more explicit about the dangers here.)
>>
>> On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette 
>> wrote:
>> >
>> > That's a great idea! I thought about this too after those posts came up
>> on the list recently. I started to look into it, but I noticed that there's
>> actually no implementation of ValueState in userstate. Is there a reason
>> for that? I started to work on a patch to add it but I was just curious if
>> there was some reason it was omitted that I should be aware of.
>> >
>> > We could certainly replicate the example without ValueState by using
>> BagState and clearing it before each write, but it would be nice if we
>> could draw a direct parallel.
>> >
>> > Brian
>> >
>> > On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels 
>> wrote:
>> >>
>> >> > It would probably be pretty easy to add the corresponding code
>> snippets to the docs as well.
>> >>
>> >> It's probably a bit more work because there is no section dedicated to
>> >> state/timer yet in the documentation. Tracked here:
>> >> https://jira.apache.org/jira/browse/BEAM-2472
>> >>
>> >> > I've been going over this topic a bit. I'll add the snippets next
>> week, if that's fine by y'all.
>> >>
>> >> That would be great. The blog posts are a great way to get started with
>> >> state/timers.
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On 11.04.19 20:21, Pablo Estrada wrote:
>> >> > I've been going over this topic a bit. I'll add the snippets next
>> week,
>> >> > if that's fine by y'all.
>> >> > Best
>> >> > -P.
>> >> >
>> >> > On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw > >> > > wrote:
>> >> >
>> >> > That's a great idea! It would probably be pretty easy to add the
>> >> > corresponding code snippets to the docs as well.
>> >> >
>> >> > On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels <
>> m...@apache.org
>> >> > > wrote:
>> >> >  >
>> >> >  > Hi everyone,
>> >> >  >
>> >> >  > The Python SDK still lacks documentation on state and timers.
>> >> >  >
>> >> >  > As a first step, what do you think about updating these two
>> blog
>> >> > posts
>> >> >  > with the corresponding Python code?
>> >> >  >
>> >> >  >
>> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>> >> >  >
>> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>> >> >  >
>> >> >  > Thanks,
>> >> >  > Max
>> >> >
>>
>


Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Michael Luckey
Hi,

wouldn't that be in conflict with Apache release policy [1] ?

[1] http://www.apache.org/legal/release-policy.html

On Thu, Apr 25, 2019 at 1:35 AM Alan Myrvold  wrote:

> Great idea. I like the RC candidates to follow as much as the release
> artifact process as possible.
>
> On Wed, Apr 24, 2019 at 3:27 PM Ahmet Altay  wrote:
>
>> To clarify my proposal, I am proposing publishing to the production pypi
>> repository with an rc tag in the version. And in turn allow users to depend
>> on beam's rc version + all the other regular dependencies users would have
>> directly from pypi.
>>
>> Publishing to test pypi repo would also be helpful if test pypi repo also
>> mirrors other packages that exist in the production pypi repository.
>>
>> On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada  wrote:
>>
>>> I think this is a great idea. A way of doing it for python would be by
>>> using the test repository for PyPi[1], and that way we would not have to do
>>> an official PyPi release, but still would be able to install it with pip
>>> (by passing an extra flag), and test.
>>>
>>> In fact, there are some Beam artifacts already in there[2]. At some
>>> point I looked into this, but couldn't figure out who has access/the
>>> password for it.
>>>
>>
>> I also don't know who owns beam package in test pypi repo. Does
>> anybody know?
>>
>>
>>>
>>> In short: +1, and I would suggest using the test PyPi repo to avoid
>>> publishing to the main PyPi repo.
>>> Best
>>> -P.
>>>
>>> [1] https://test.pypi.org/
>>> [2] https://test.pypi.org/project/apache-beam/
>>>
>>> On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:
>>>
 Hi all,

 What do you think about the idea of publishing pre-release artifacts as
 part of the RC emails?

 For Python this would translate into publishing the same artifacts from
 RC email with a version like "2.X.0rcY" to pypi. I do not know, but I am
 guessing we can do a similar thing with Maven central for Java artifacts as
 well.

 Advantages would be:
 - Allow end users to validate RCs for their own purposes using the same
 exact process they will normally use.
  - Enable early-adaptors to start using RC releases early on in the
 release cycle if that is what they would like to do. This will in turn
 reduce time pressure on some releases. Especially for cases like someone
 needs a release to be finalized for an upcoming event.

 There will also be disadvantages, some I could think of:
 - Users could request support for RC artifacts. Hopefully in the form
 of feedback for us to improve the release. But it could also be in the form
 of folks using RC artifacts for production for a long time.
 - It will add toil to the current release process, there will be one
 more step for each RC. I think for python this will be a small step but
 nevertheless it will be additional work.

 For an example of this, you can take a look at tensorflow releases. For
 1.13 there were 3 pre-releases [1].

 Ahmet

 [1] https://pypi.org/project/tensorflow/#history

>>>


Re: [BEAM-3344] Please review & [BEAM-352] Assigne me.

2019-04-24 Thread Ahmet Altay
Hi,

I assigned you BEAM-352 and restarted the failing tests on your PR. There
is also a reviewer assigned to your PR.

Ahmet

On Wed, Apr 24, 2019 at 11:26 AM Madhusudhan Reddy Vennapusa <
sudhan...@gmail.com> wrote:

> HI Team,
>
> I worked on [BEAM-3344] and raised a pull request, though Java pre-commit
> is failing(due to gcp tests), its not related to my changes. Could someone
> please review my changes or suggest me if java pre-commit is actually
> failing because i missed something.
>
> Also i would like to start working on [BEAM-352] could someone please
> assign me this issue. My user id *sudhan499*
>
> Thanks & Regards,
> Madhu
>


Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Alan Myrvold
Great idea. I like the RC candidates to follow as much as the release
artifact process as possible.

On Wed, Apr 24, 2019 at 3:27 PM Ahmet Altay  wrote:

> To clarify my proposal, I am proposing publishing to the production pypi
> repository with an rc tag in the version. And in turn allow users to depend
> on beam's rc version + all the other regular dependencies users would have
> directly from pypi.
>
> Publishing to test pypi repo would also be helpful if test pypi repo also
> mirrors other packages that exist in the production pypi repository.
>
> On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada  wrote:
>
>> I think this is a great idea. A way of doing it for python would be by
>> using the test repository for PyPi[1], and that way we would not have to do
>> an official PyPi release, but still would be able to install it with pip
>> (by passing an extra flag), and test.
>>
>> In fact, there are some Beam artifacts already in there[2]. At some point
>> I looked into this, but couldn't figure out who has access/the password for
>> it.
>>
>
> I also don't know who owns beam package in test pypi repo. Does
> anybody know?
>
>
>>
>> In short: +1, and I would suggest using the test PyPi repo to avoid
>> publishing to the main PyPi repo.
>> Best
>> -P.
>>
>> [1] https://test.pypi.org/
>> [2] https://test.pypi.org/project/apache-beam/
>>
>> On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:
>>
>>> Hi all,
>>>
>>> What do you think about the idea of publishing pre-release artifacts as
>>> part of the RC emails?
>>>
>>> For Python this would translate into publishing the same artifacts from
>>> RC email with a version like "2.X.0rcY" to pypi. I do not know, but I am
>>> guessing we can do a similar thing with Maven central for Java artifacts as
>>> well.
>>>
>>> Advantages would be:
>>> - Allow end users to validate RCs for their own purposes using the same
>>> exact process they will normally use.
>>>  - Enable early-adaptors to start using RC releases early on in the
>>> release cycle if that is what they would like to do. This will in turn
>>> reduce time pressure on some releases. Especially for cases like someone
>>> needs a release to be finalized for an upcoming event.
>>>
>>> There will also be disadvantages, some I could think of:
>>> - Users could request support for RC artifacts. Hopefully in the form of
>>> feedback for us to improve the release. But it could also be in the form of
>>> folks using RC artifacts for production for a long time.
>>> - It will add toil to the current release process, there will be one
>>> more step for each RC. I think for python this will be a small step but
>>> nevertheless it will be additional work.
>>>
>>> For an example of this, you can take a look at tensorflow releases. For
>>> 1.13 there were 3 pre-releases [1].
>>>
>>> Ahmet
>>>
>>> [1] https://pypi.org/project/tensorflow/#history
>>>
>>


Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Ahmet Altay
To clarify my proposal, I am proposing publishing to the production pypi
repository with an rc tag in the version. And in turn allow users to depend
on beam's rc version + all the other regular dependencies users would have
directly from pypi.

Publishing to test pypi repo would also be helpful if test pypi repo also
mirrors other packages that exist in the production pypi repository.

On Wed, Apr 24, 2019 at 3:12 PM Pablo Estrada  wrote:

> I think this is a great idea. A way of doing it for python would be by
> using the test repository for PyPi[1], and that way we would not have to do
> an official PyPi release, but still would be able to install it with pip
> (by passing an extra flag), and test.
>
> In fact, there are some Beam artifacts already in there[2]. At some point
> I looked into this, but couldn't figure out who has access/the password for
> it.
>

I also don't know who owns beam package in test pypi repo. Does
anybody know?


>
> In short: +1, and I would suggest using the test PyPi repo to avoid
> publishing to the main PyPi repo.
> Best
> -P.
>
> [1] https://test.pypi.org/
> [2] https://test.pypi.org/project/apache-beam/
>
> On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:
>
>> Hi all,
>>
>> What do you think about the idea of publishing pre-release artifacts as
>> part of the RC emails?
>>
>> For Python this would translate into publishing the same artifacts from
>> RC email with a version like "2.X.0rcY" to pypi. I do not know, but I am
>> guessing we can do a similar thing with Maven central for Java artifacts as
>> well.
>>
>> Advantages would be:
>> - Allow end users to validate RCs for their own purposes using the same
>> exact process they will normally use.
>>  - Enable early-adaptors to start using RC releases early on in the
>> release cycle if that is what they would like to do. This will in turn
>> reduce time pressure on some releases. Especially for cases like someone
>> needs a release to be finalized for an upcoming event.
>>
>> There will also be disadvantages, some I could think of:
>> - Users could request support for RC artifacts. Hopefully in the form of
>> feedback for us to improve the release. But it could also be in the form of
>> folks using RC artifacts for production for a long time.
>> - It will add toil to the current release process, there will be one more
>> step for each RC. I think for python this will be a small step but
>> nevertheless it will be additional work.
>>
>> For an example of this, you can take a look at tensorflow releases. For
>> 1.13 there were 3 pre-releases [1].
>>
>> Ahmet
>>
>> [1] https://pypi.org/project/tensorflow/#history
>>
>


Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Pablo Estrada
I think this is a great idea. A way of doing it for python would be by
using the test repository for PyPi[1], and that way we would not have to do
an official PyPi release, but still would be able to install it with pip
(by passing an extra flag), and test.

In fact, there are some Beam artifacts already in there[2]. At some point I
looked into this, but couldn't figure out who has access/the password for
it.

In short: +1, and I would suggest using the test PyPi repo to avoid
publishing to the main PyPi repo.
Best
-P.

[1] https://test.pypi.org/
[2] https://test.pypi.org/project/apache-beam/

On Wed, Apr 24, 2019 at 3:04 PM Ahmet Altay  wrote:

> Hi all,
>
> What do you think about the idea of publishing pre-release artifacts as
> part of the RC emails?
>
> For Python this would translate into publishing the same artifacts from RC
> email with a version like "2.X.0rcY" to pypi. I do not know, but I am
> guessing we can do a similar thing with Maven central for Java artifacts as
> well.
>
> Advantages would be:
> - Allow end users to validate RCs for their own purposes using the same
> exact process they will normally use.
>  - Enable early-adaptors to start using RC releases early on in the
> release cycle if that is what they would like to do. This will in turn
> reduce time pressure on some releases. Especially for cases like someone
> needs a release to be finalized for an upcoming event.
>
> There will also be disadvantages, some I could think of:
> - Users could request support for RC artifacts. Hopefully in the form of
> feedback for us to improve the release. But it could also be in the form of
> folks using RC artifacts for production for a long time.
> - It will add toil to the current release process, there will be one more
> step for each RC. I think for python this will be a small step but
> nevertheless it will be additional work.
>
> For an example of this, you can take a look at tensorflow releases. For
> 1.13 there were 3 pre-releases [1].
>
> Ahmet
>
> [1] https://pypi.org/project/tensorflow/#history
>


[Discuss] Publishing pre-release artifacts to repositories

2019-04-24 Thread Ahmet Altay
Hi all,

What do you think about the idea of publishing pre-release artifacts as
part of the RC emails?

For Python this would translate into publishing the same artifacts from RC
email with a version like "2.X.0rcY" to pypi. I do not know, but I am
guessing we can do a similar thing with Maven central for Java artifacts as
well.

Advantages would be:
- Allow end users to validate RCs for their own purposes using the same
exact process they will normally use.
 - Enable early-adaptors to start using RC releases early on in the release
cycle if that is what they would like to do. This will in turn reduce time
pressure on some releases. Especially for cases like someone needs a
release to be finalized for an upcoming event.

There will also be disadvantages, some I could think of:
- Users could request support for RC artifacts. Hopefully in the form of
feedback for us to improve the release. But it could also be in the form of
folks using RC artifacts for production for a long time.
- It will add toil to the current release process, there will be one more
step for each RC. I think for python this will be a small step but
nevertheless it will be additional work.

For an example of this, you can take a look at tensorflow releases. For
1.13 there were 3 pre-releases [1].

Ahmet

[1] https://pypi.org/project/tensorflow/#history


Re: [docs] Python State & Timers

2019-04-24 Thread Reuven Lax
Well state is still not implemented for merging windows even for Java
(though I believe the idea was to disallow ValueState there).

On Wed, Apr 24, 2019 at 1:11 PM Robert Bradshaw  wrote:

> It was unclear what the semantics were for ValueState for merging
> windows. (It's also a bit weird as it's inherently a race condition
> wrt element ordering, unlike Bag and CombineState, though you can
> always implement it as a CombineState that always returns the latest
> value which is a bit more explicit about the dangers here.)
>
> On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette 
> wrote:
> >
> > That's a great idea! I thought about this too after those posts came up
> on the list recently. I started to look into it, but I noticed that there's
> actually no implementation of ValueState in userstate. Is there a reason
> for that? I started to work on a patch to add it but I was just curious if
> there was some reason it was omitted that I should be aware of.
> >
> > We could certainly replicate the example without ValueState by using
> BagState and clearing it before each write, but it would be nice if we
> could draw a direct parallel.
> >
> > Brian
> >
> > On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels 
> wrote:
> >>
> >> > It would probably be pretty easy to add the corresponding code
> snippets to the docs as well.
> >>
> >> It's probably a bit more work because there is no section dedicated to
> >> state/timer yet in the documentation. Tracked here:
> >> https://jira.apache.org/jira/browse/BEAM-2472
> >>
> >> > I've been going over this topic a bit. I'll add the snippets next
> week, if that's fine by y'all.
> >>
> >> That would be great. The blog posts are a great way to get started with
> >> state/timers.
> >>
> >> Thanks,
> >> Max
> >>
> >> On 11.04.19 20:21, Pablo Estrada wrote:
> >> > I've been going over this topic a bit. I'll add the snippets next
> week,
> >> > if that's fine by y'all.
> >> > Best
> >> > -P.
> >> >
> >> > On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw  >> > > wrote:
> >> >
> >> > That's a great idea! It would probably be pretty easy to add the
> >> > corresponding code snippets to the docs as well.
> >> >
> >> > On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels <
> m...@apache.org
> >> > > wrote:
> >> >  >
> >> >  > Hi everyone,
> >> >  >
> >> >  > The Python SDK still lacks documentation on state and timers.
> >> >  >
> >> >  > As a first step, what do you think about updating these two
> blog
> >> > posts
> >> >  > with the corresponding Python code?
> >> >  >
> >> >  >
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> >> >  > https://beam.apache.org/blog/2017/08/28/timely-processing.html
> >> >  >
> >> >  > Thanks,
> >> >  > Max
> >> >
>


Re: [docs] Python State & Timers

2019-04-24 Thread Robert Bradshaw
It was unclear what the semantics were for ValueState for merging
windows. (It's also a bit weird as it's inherently a race condition
wrt element ordering, unlike Bag and CombineState, though you can
always implement it as a CombineState that always returns the latest
value which is a bit more explicit about the dangers here.)

On Wed, Apr 24, 2019 at 10:08 PM Brian Hulette  wrote:
>
> That's a great idea! I thought about this too after those posts came up on 
> the list recently. I started to look into it, but I noticed that there's 
> actually no implementation of ValueState in userstate. Is there a reason for 
> that? I started to work on a patch to add it but I was just curious if there 
> was some reason it was omitted that I should be aware of.
>
> We could certainly replicate the example without ValueState by using BagState 
> and clearing it before each write, but it would be nice if we could draw a 
> direct parallel.
>
> Brian
>
> On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels  wrote:
>>
>> > It would probably be pretty easy to add the corresponding code snippets to 
>> > the docs as well.
>>
>> It's probably a bit more work because there is no section dedicated to
>> state/timer yet in the documentation. Tracked here:
>> https://jira.apache.org/jira/browse/BEAM-2472
>>
>> > I've been going over this topic a bit. I'll add the snippets next week, if 
>> > that's fine by y'all.
>>
>> That would be great. The blog posts are a great way to get started with
>> state/timers.
>>
>> Thanks,
>> Max
>>
>> On 11.04.19 20:21, Pablo Estrada wrote:
>> > I've been going over this topic a bit. I'll add the snippets next week,
>> > if that's fine by y'all.
>> > Best
>> > -P.
>> >
>> > On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw > > > wrote:
>> >
>> > That's a great idea! It would probably be pretty easy to add the
>> > corresponding code snippets to the docs as well.
>> >
>> > On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels > > > wrote:
>> >  >
>> >  > Hi everyone,
>> >  >
>> >  > The Python SDK still lacks documentation on state and timers.
>> >  >
>> >  > As a first step, what do you think about updating these two blog
>> > posts
>> >  > with the corresponding Python code?
>> >  >
>> >  > https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>> >  > https://beam.apache.org/blog/2017/08/28/timely-processing.html
>> >  >
>> >  > Thanks,
>> >  > Max
>> >


Re: [docs] Python State & Timers

2019-04-24 Thread Brian Hulette
That's a great idea! I thought about this too after those posts came up on
the list recently. I started to look into it, but I noticed that there's
actually no implementation of ValueState in userstate. Is there a reason
for that? I started to work on a patch to add it but I was just curious if
there was some reason it was omitted that I should be aware of.

We could certainly replicate the example without ValueState by using
BagState and clearing it before each write, but it would be nice if we
could draw a direct parallel.

Brian

On Fri, Apr 12, 2019 at 7:05 AM Maximilian Michels  wrote:

> > It would probably be pretty easy to add the corresponding code snippets
> to the docs as well.
>
> It's probably a bit more work because there is no section dedicated to
> state/timer yet in the documentation. Tracked here:
> https://jira.apache.org/jira/browse/BEAM-2472
>
> > I've been going over this topic a bit. I'll add the snippets next week,
> if that's fine by y'all.
>
> That would be great. The blog posts are a great way to get started with
> state/timers.
>
> Thanks,
> Max
>
> On 11.04.19 20:21, Pablo Estrada wrote:
> > I've been going over this topic a bit. I'll add the snippets next week,
> > if that's fine by y'all.
> > Best
> > -P.
> >
> > On Thu, Apr 11, 2019 at 5:27 AM Robert Bradshaw  > > wrote:
> >
> > That's a great idea! It would probably be pretty easy to add the
> > corresponding code snippets to the docs as well.
> >
> > On Thu, Apr 11, 2019 at 2:00 PM Maximilian Michels  > > wrote:
> >  >
> >  > Hi everyone,
> >  >
> >  > The Python SDK still lacks documentation on state and timers.
> >  >
> >  > As a first step, what do you think about updating these two blog
> > posts
> >  > with the corresponding Python code?
> >  >
> >  > https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> >  > https://beam.apache.org/blog/2017/08/28/timely-processing.html
> >  >
> >  > Thanks,
> >  > Max
> >
>


[BEAM-3344] Please review & [BEAM-352] Assigne me.

2019-04-24 Thread Madhusudhan Reddy Vennapusa
HI Team,

I worked on [BEAM-3344] and raised a pull request, though Java pre-commit
is failing(due to gcp tests), its not related to my changes. Could someone
please review my changes or suggest me if java pre-commit is actually
failing because i missed something.

Also i would like to start working on [BEAM-352] could someone please
assign me this issue. My user id *sudhan499*

Thanks & Regards,
Madhu


Re: Integration of python/portable runner tests for Samza runner

2019-04-24 Thread Xinyu Liu
Thanks for the useful pointers! We are looking forward to integrating both
Portable and Python-specific tests for Samza runner. A few questions:

- For portable running tests: by looking at the portableValidatesRunnerTask in
flink_job_server.gradle, it seems it's the same set of Java tests but using
portability framework to validate (submit to job server and run the
protable pipeline in a specific runner). Is my understanding correct?

- For python tests: Looks like flinkValidatesRunner is using LOOPBACK SDK
worker type in the tests. Not sure what LOOPBACK does. Is it used for
testing? Currently Samza portable runner supports PROCESS worker.

Thanks,
Xinyu



On Wed, Apr 24, 2019 at 2:45 AM Maximilian Michels  wrote:

> > If you are interested in portable python pipeline validation, I think
> > fn_api_runner_test would also help.
>
> Just to note, Ankur mentioned flinkCompatibilityMatrix, that one uses
> fn_api_runner_test with some tooling on top to bring up the test cluster.
>
> On 23.04.19 19:23, Boyuan Zhang wrote:
> > Hi Daniel,
> > If you are interested in portable python pipeline validation, I think
> > fn_api_runner_test
> > <
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py>
>
> > would also help.
> >
> > On Tue, Apr 23, 2019 at 10:19 AM Pablo Estrada  > > wrote:
> >
> > This is cool, Daniel : ) Glad to see the Samza runner moving forward.
> > Best
> > -P.
> >
> > On Tue, Apr 23, 2019 at 2:52 AM Maximilian Michels  > > wrote:
> >
> > Hi Daniel,
> >
> > Note that there is also Portable Validates Runner which runs Java
> > portability tests. I don't know if you have integrated with that
> > one
> > already.
> >
> > Thanks,
> > Max
> >
> > On 23.04.19 02:28, Ankur Goenka wrote:
> >  > Hi Daniel,
> >  >
> >  > We use flinkCompatibilityMatrix [1] to check the Flink
> > compatibility
> >  > with python. This is python equivalent to validatesRunner
> > tests in java
> >  > for portable runners.
> >  > I think we can reuse it for Samza Portable runner with minor
> > refactoring.
> >  >
> >  > [1]
> >  >
> >
> https://github.com/apache/beam/blob/bdb1a713a120a887e71e85c77879dc4446a58541/sdks/python/build.gradle#L305
> >  >
> >  > On Mon, Apr 22, 2019 at 3:21 PM Daniel Chen
> > mailto:danx...@gmail.com>
> >  > >> wrote:
> >  >
> >  > Hi everyone,
> >  >
> >  > I'm working on improving the validation of the Python
> > portable Samza
> >  > runner. For java, we have the gradle task (
> > :validatesRunner) that
> >  > runs the runner validation tests.
> >  > I am looking for pointers on how to similarly
> > integrate/enable the
> >  > portability and Python tests for the Samza runner.
> >  >
> >  > Any help will be greatly appreciated.
> >  >
> >  > Thanks,
> >  > Daniel
> >  >
> >
>


contributor permission for Beam Jira

2019-04-24 Thread Yoshiki Obata
Hello everyone

This is Yoshiki Obata.
I created ticket BEAM-7137 and plan to fix it.
I'm glad someone would add me as contributer to Jira.

my Jira username is yoshiki.obata

Best regards,
Yoshiki


--
Yoshiki Obata
mail: yoshiki.ob...@gmail.com
gh: https://github.com/lazylynx


Re: JDK 11 compatibility testing

2019-04-24 Thread Robert Bradshaw
It seems to me that we can assume that if Beam is running in a Java 11
runtime, any Java 11 features used in the body of a DoFn should just work.
The interesting part will be whether there is anything on the boundary that
changes (e.g. are there changes to type inference rules that make them
stricter and/or smarter, or places where we reach into implementation
details like bytecode generation (with the full permutation of signature
options we support)).

Tests of this, of course, are critical.

On Wed, Apr 24, 2019 at 1:39 PM Michał Walenia 
wrote:

> Hi all,
>
> I’m currently working on enhancing a Beam test suite to check
> compatibility with Java 11 UDFs. As JDK11 introduces several useful
> features, I wanted to turn to the Devlist to gather your opinions on which
> features should be included in the DoFn.
>
> To give you an idea of how the test will be designed, I’m planning to
> create a test pipeline with a custom DoFn which will use JDK11 specific
> features. This test will be compiled with JDK11 and ran using a binary of
> Beam built with JDK8 in order to simulate a situation in which the user
> downloads the Beam from Maven repository and uses it in their project built
> in Java 11.
>
> The features I believe are worth checking are:
>
>
>-
>
>String manipulation methods:
>-
>
>   .repeat
>   -
>
>   stripTrailing, stripLeading and strip()
>   -
>
>   isBlank
>   -
>
>   Lines
>   -
>
>RegEx asMatchPredicate
>-
>
>Local parameter type inference in lambda expressions
>-
>
>Optional::isEmpty
>-
>
>Collection::toArray
>-
>
>Path API change - Path::of
>
>
> I don’t think that checking other new features in Java 11 such as flight
> recording or new HTTP client which we probably won’t use in Beam is
> justified, but I’m open to suggestions and discussion.
>
> Which of those new features should be included in the DoFn applied in the
> test?
>
> I will be grateful for any input.
>
> Have a good day
>
> Michal
>
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> We create human & business stories through technology.
> Check out our projects! 
>


Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread Robert Bradshaw
Thanks for the meeting summary, Stephan. Sound like you covered a lot of
ground. Some more comments below, adding onto what Max has said.

On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels  wrote:
>
> Hi Stephan,
>
> This is excited! Thanks for sharing. The inter-process communication
> code looks like the most natural choice as a common ground. To go
> further, there are indeed some challenges to solve.

It certainly does make sense to share this work, though it does to me seem
like a rather low level to integrate at.

> > => Biggest question is whether the language-independent DAG is
expressive enough to capture all the expressions that we want to map
directly to Table API expressions. Currently much is hidden in opaque UDFs.
Kenn mentioned the structure should be flexible enough to capture more
expressions transparently.
>
> Just to add some context how this could be done, there is the concept of
> a FunctionSpec which is part of a transform in the DAG. FunctionSpec
> contains a URN and with a payload. FunctionSpec can be either (1)
> translated by the Runner directly, e.g. map to table API concepts or (2)
> run a user-defined function with an Environment. It could be feasible
> for Flink to choose the direct path, whereas Beam Runners would leverage
> the more generic approach using UDFs. Granted, compatibility across
> Flink and Beam would only work if both of the translation paths yielded
> the same semantics.

To elaborate a bit on this, Beam DAGs are built up by applying Transforms
(basically operations) to PColections (the equivalent of
dataset/datastream), but the key point here is that these transforms are
often composite operations that expand out into smaller subtransforms. This
expansion happens during pipeline construction, but with the recent work on
cross language pipelines can happen out of process. This is one point of
extendability. Secondly, and importantly, this composite structure is
preserved in the DAG, and so a runner is free to ignore the provided
expansion and supply its own (so long as semantically it produces exactly
the same output). These composite operations can be identified by arbitrary
URNs + payloads, and any runner that does not understand them simply uses
the pre-provided expansion.

The existing Flink runner operates on exactly this principle, translating
URNs for the leaf operations (Map, Flatten, ...) as well as some composites
it can do better (e.g. Reshard). It is intentionally easy to define and add
new ones. This actually seems the easier approach (to me at least, but
that's probably heavily influenced by what I'm familiar with vs. what I'm
not).

As for how well this maps onto the Flink Tables API, part of that depends
on how much of the API is the operations themselves, and how much is
concerning configuration/environment/etc. which is harder to talk about in
an agnostic way.

Using something like Py4j is an easy way to get up an running, especially
for a very faithful API, but the instant one wants to add UDFs one hits a
cliff of sorts (which is surmountable, but likely a lot harder than having
gone the above approach). In addition (and I'll admit this is rather
subjective) it seems to me one of the primary values of a table-like API in
a given language (vs. just using (say) plain old SQL itself via a console)
is the ability to embed it in a larger pipeline, or at least drop in
operations that are not (as) naturally expressed in the "table way,"
including existing libraries. In other words, a full SDK. The Py4j wrapping
doesn't extend itself to such integration nearly as easily.

But I really do understand the desire to not block immediate work (and
value) for a longer term solution.

> >  If the DAG is generic enough to capture the additional information, we
probably still need some standardization, so that all the different
language APIs represent their expressions the same way
>
> I wonder whether that's necessary as a first step. I think it would be
> fine for Flink to have its own way to represent API concepts in the Beam
> DAG which Beam Runners may not be able to understand. We could then
> successively add the capability for these transforms to run with Beam.
>
> >  Similarly, it makes sense to standardize the type system (and type
inference) as far as built-in expressions and their interaction with UDFs
are concerned. The Flink Table API and Blink teams found this to be
essential for a consistent API behavior. This would not prevent all-UDF
programs from still using purely binary/opaque types.
>
> Beam has a set of standard coders which can be used across languages. We
> will have to expand those to play well with Flink's:
>
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tableApi.html#data-types
>
> I think we will need to exchange more ideas to work out a model that
> will work for both Flink and Beam. A regular meeting could be helpful.

+1, I think this would be really good for both this effort and general
collaboration 

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-24 Thread Maximilian Michels
Fully agree that this is an effort that goes beyond changing a type 
parameter but I think we have a chance here to cooperate between the two 
projects. I would be happy to help out where I can.


I'm not sure at this point what exactly is feasible for reuse but I 
would imagine the Runner-related code to be useful as well for the 
interaction with the SDK Harness. There are some fundamental differences 
in the model, e.g. how windowing works, which might be challenging to 
work around.


Thanks,
Max

On 24.04.19 12:03, jincheng sun wrote:


Hi Kenn, I think you are right, the Python SDK harness can be shared to 
Flink, and also need to add some new primitive operations. Regarding 
runner-side, I think most of the code which in runners/java-fun- 
Execution is can be shared(but need some improvement, such as 
FnDataService), some of them cannot be shared, such as job submission 
code. So, we may need to set up a document to clearly analyze which ones 
can be shared, which ones can be shared but need to do some changes, and 
which ones are definitely cannot be shared.


Hi Max, Thanks for sharing your opinion, I also prefer to using beam Fn 
service as a library, also willing to do more efforts for this.
 From the view of the current code, abstracting Fn Service into a class 
library that other projects can rely on requires a lot of effort from 
the Beam community. Turn `WindowedValue` into `T` is just the 
beginning of this effort. If the Beam community is willing on 
abstracting Fn Service into a class library that can be relied upon by 
other projects, I can try to draft a document, of course during this 
period may need a lot of help from you, Kenn, Lukasz, and the Beam 
community. (I am a recruit in the Beam community :-))


What do you think?

Regards,
Jincheng

Kenneth Knowles mailto:k...@apache.org>> 于2019年4月24 
日周三 上午3:32写道:


It seems to me that the most valuable code to share and keep up with
is the Python/Go/etc SDK harness; they would need to be enhanced
with new primitive operations. So you would want to depend directly
and share the original proto-generated classes too, which Beam
publishes as separate artifacts for Java. Is the runner-side support
code that valuable for direct integration into Flink? I would expect
once you get past trivial wrappers (that you can copy/paste with no
loss) you would hit differences in architecture so you would diverge
anyhow.

Kenn

On Tue, Apr 23, 2019 at 5:32 AM Maximilian Michels mailto:m...@apache.org>> wrote:

Hi Jincheng,

Copying code is a solution for the short term. In the long run
I'd like
the Fn services to be a library not only for the Beam
portability layer
but also for other projects which want to leverage it. We should
thus
make an effort to make it more generic/extensible where
necessary and
feasible.

Since you are investigating reuse of Beam portability in the
context of
Flink, do you think it would make sense to setup a document
where we
collect ideas and challenges?

Thanks,
Max

On 23.04.19 13:00, jincheng sun wrote:
 > Hi Reuven,
 >
 > I think you have provided an optional solution for other
community which
 > wants to take advantage of Beam's existing
achievements. Thank you very
 > much!
 >
 > I think the Flink community can choose to copy from Beam's
code or
 > choose to rely directly on the beam's class library. The
Flink community
 > also initiated a discussion, more info can be found here
 >


 >
 > The purpose of Turns `WindowedValue` into `T` is to
promote the
 > interface design of Beam more versatile, so that other open
source
 > projects have the opportunity to take advantage of Beam's
existing
 > achievements. Of course, just changing the `WindowedValue`
into `T`
 > is not enough to be shared by other projects in the form of a
class
 > library, we need to do more efforts. If Beam can provide a
class library
 > in the future, other community contributors will also have the
 > willingness to contribute to the beam community. This will
benefit both
 > the community that wants to take advantage of Beam's existing
 > achievements and the Beam community itself. And thanks to
Thomas for
 > that he has also made a lot of efforts in this regard.
 >
 > Thanks again for your valuable suggestion, and welcome any
feedback!
 >
 > Best,
 > Jincheng
 >
 > Reuven Lax 

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-24 Thread Reuven Lax
The Nexmark dataflow runs don't seem to be triggered by Run Java
PostCommit.

On Wed, Apr 24, 2019 at 1:58 AM Etienne Chauchot 
wrote:

> Reuven,
>
> Nexmark tests are indeed run as PostCommits (each commit on master). I
> guess we have been flooded with jenkins notification emails.
>
> Etienne
>
> Le mardi 23 avril 2019 à 15:24 -0700, Reuven Lax a écrit :
>
> I mistakenly though that Java PostCommit would run these tests, and I
> merged based on PostCommit passing. That's how the bug got into master.
>
> On Tue, Apr 23, 2019 at 3:21 PM Kenneth Knowles  wrote:
>
> What can we do to make this part of day-to-day workflow instead of finding
> out during release validation? Was this just a failing test that was missed?
>
> Kenn
>
> On Tue, Apr 23, 2019 at 3:02 PM Andrew Pilloud 
> wrote:
>
> It looks like Java Nexmark tests are on the validation sheet but we've
> missed it the last few releases. Thanks for checking it Etienne! Does the
> current release process require everything to be tested before making the
> release final?
>
> I fully agree with you on point 2. All of these issues were in RC1 and
> could have been fixed for RC2.
>
> Andrew
>
> On Tue, Apr 23, 2019 at 2:58 PM Ahmet Altay  wrote:
>
> Thank you Andrew. I will suggest two improvements to the release process:
> 1. We can include benchmarks in the validation sheet ("Apache Beam Release
> Acceptance Criteria"). They are used part of the validation process and we
> can ensure that we check those for each release.
> 2. For RC validation, we can continue to exhaustively validate each RC
> even after the first -1 vote. Otherwise we end up with not discovering all
> issues in a given RC and find them a  successive RC, increasing the number
> of iterations required.
>
>
> On Tue, Apr 23, 2019 at 2:11 PM Andrew Pilloud 
> wrote:
>
> Please consider the vote for RC4 canceled. I'll quickly follow up with a
> new RC.
>
> Thanks for the complete testing everyone!
>
> Andrew
>
> On Tue, Apr 23, 2019 at 2:06 PM Reuven Lax  wrote:
>
> -1
>
> we need to cherry pick pr/8325 and pr/8385 to fix the above issue
>
> On Tue, Apr 23, 2019 at 1:48 PM Andrew Pilloud 
> wrote:
>
> I believe the breakage of Nexmark on Dataflow is
> https://issues.apache.org/jira/browse/BEAM-7002, which went in before the
> release was cut. It looks like this might be a release blocker based on the
> fix: https://github.com/apache/beam/pull/8325.
>
> The degraded performance is after the release is cut, so we should be good
> there.
>
> Andrew
>
> On Tue, Apr 23, 2019 at 8:44 AM Ismaël Mejía  wrote:
>
> Etienne RC1 vote happened in 04/03 and there have not been any cherry
> picks on the spark runner afterwards so if there is a commit that
> degraded performance around 04/10 it is not part of the release we are
> voting, so please consider reverting your -1.
>
> However the issue you are reporting looks important, from a quick look
> I am guessing it could be related to BEAM-5775 that was merged on
> 12/04 however the performance regressions started happening since
> 09/04 so it could be unrelated. Maybe it could be due to changes in
> our infrastructure. Maybe the change in the workers to be tracked, but
> definitely not a release blocker at least for the Spark runner.
>
>
>
>
>
> On Tue, Apr 23, 2019 at 5:12 PM Etienne Chauchot 
> wrote:
> >
> > Hi guys ,
> > I will vote -1 (binding) on this RC (although degradation is before RC4
> cut date). I took a look at Nexmark graphs for the 3 major runners :
> > - there seem to have functional regressions on Dataflow:
> https://apache-beam-testing.appspot.com/explore?dashboard=5647201107705856
> . 13 queries fail in batch mode starting on 04/17
> > - there is a perf degradation (+200%) in spark runner starting on 04/10
> for all the queries:
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> >
> > Sorry Andrew for the added work
> >
> > Etienne
> >
> > Le lundi 22 avril 2019 à 12:21 -0700, Andrew Pilloud a écrit :
> >
> > I signed the wheels files and updated the build process to not require
> giving travis apache credentials. (You should probably change your password
> if you haven't already.)
> >
> > Andrew
> >
> > On Mon, Apr 22, 2019 at 12:18 PM Ahmet Altay  wrote:
> >
> > +1 (binding)
> >
> > Verified the python 2 wheel files with quick start examples.
> >
> > On Mon, Apr 22, 2019 at 11:26 AM Ahmet Altay  wrote:
> >
> > I built the wheel files. They are in the usual place along with other
> python artifacts. I will test them a bit and update here. Could someone
> else please try the wheel files as well?
> >
> > Andrew, could you sign and hash the wheel files?
> >
> > On Mon, Apr 22, 2019 at 10:11 AM Ahmet Altay  wrote:
> >
> > I verified
> > - signatures and hashes.
> >  - python streaming quickstart guide
> >
> > I would like to verify the wheel files before voting. Please let us know
> when they are ready. Also, if you need help with building wheel files I can
> help/build.
> >
> > Ahmet
> >
> > On Mon, Apr 22, 

Re: CVE audit gradle plugin

2019-04-24 Thread Etienne Chauchot
Hi all,FYI I just submitted a PR (1) to add the CVE audit plugin to the build 
as an optional task gradlew audit --info.
[1] https://github.com/apache/beam/pull/8388
Etienne
Le mardi 23 avril 2019 à 17:25 +0200, Etienne Chauchot a écrit :
> Hi,should I merge my branch 
> https://github.com/echauchot/beam/tree/cve_audit_plugin to master to include 
> this tool to
> the build system then ?It will not fail the build but add an audit task to it.
> EtienneLe vendredi 19 avril 2019 à 10:54 -0700, Lukasz Cwik a écrit :
> >  Common Vulnerabilities and Exposures (CVE)
> > 
> > On Fri, Apr 19, 2019 at 10:33 AM Robert Burke  wrote:
> > > Ah! What's CVE stand for then?
> > > 
> > > Re the PR: Sadly, it's more complicated than that, which I'll explain in 
> > > the PR. Otherwise it would have been done
> > > already. It's not too bad if the time is put in though.
> > > On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:
> > > > Robert, I believe what is being suggested is a tool that integrates 
> > > > into CVE reports automatically and tells us
> > > > if we have a dependency with a security issue (not just whether there 
> > > > is a newer version). Also, there is a
> > > > sweet draft PR to add Go modules[1].
> > > > 1: https://github.com/apache/beam/pull/8354
> > > > On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  
> > > > wrote:
> > > > > If we move to Go Modules, the go.mod file specifies direct 
> > > > > dependencies and versions, and the go.sum file
> > > > > includes checksums of the full transitive set of dependencies. 
> > > > > There's likely going to be a tool for detecting
> > > > > if an update is possible, if one doesn't exist in the go tooling 
> > > > > already.
> > > > > On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
> > > > > > This seems worthwhile IMO.
> > > > > > Ahmet, Pyup[1] is free for open source projects and has an API that 
> > > > > > allows for dependency checking. They can
> > > > > > scan Github repos automatically it seems but it may not be 
> > > > > > compatible with how Apache permissions with
> > > > > > Github work. I'm not sure if there is such a thing for Go.
> > > > > > 
> > > > > > 1: https://pyup.io/
> > > > > > 
> > > > > > On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  
> > > > > > wrote:
> > > > > > > I want to bring this subject back, any chance we can get this 
> > > > > > > running
> > > > > > > 
> > > > > > > in or main repo maybe in a weekly basis like we do for the 
> > > > > > > dependency
> > > > > > > 
> > > > > > > reports. It looks totallly worth.
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  
> > > > > > > wrote:
> > > > > > > 
> > > > > > > >
> > > > > > > 
> > > > > > > > Thank you, I agree this is very important. Does anyone know a 
> > > > > > > > similar tool for python and go?
> > > > > > > 
> > > > > > > >
> > > > > > > 
> > > > > > > > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
> > > > > > > >  wrote:
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> Hi guys,
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> I came by this [1] gradle plugin that is a client to the 
> > > > > > > >> Sonatype OSS Index CVE database.
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> I have set it up here in a branch [2], though the cache is not 
> > > > > > > >> configured and the number of requests is
> > > > > > > limited. It can be run with "gradle --info audit"
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> It could be nice to have something like this to track the CVEs 
> > > > > > > >> in the libs we use. I know we have been
> > > > > > > spammed by libs upgrade automatic requests in the past but CVE 
> > > > > > > are more important IMHO.
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> This plugin is in BSD-3-Clause which is compatible with Apache 
> > > > > > > >> V2 licence [3]
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> WDYT ?
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> Etienne
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
> > > > > > > 
> > > > > > > >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
> > > > > > > 
> > > > > > > >> [3] https://www.apache.org/legal/resolved.html
> > > > > > > 


Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread Maximilian Michels

Hi Stephan,

This is excited! Thanks for sharing. The inter-process communication 
code looks like the most natural choice as a common ground. To go 
further, there are indeed some challenges to solve.



=> Biggest question is whether the language-independent DAG is expressive 
enough to capture all the expressions that we want to map directly to Table API 
expressions. Currently much is hidden in opaque UDFs. Kenn mentioned the structure 
should be flexible enough to capture more expressions transparently.


Just to add some context how this could be done, there is the concept of 
a FunctionSpec which is part of a transform in the DAG. FunctionSpec 
contains a URN and with a payload. FunctionSpec can be either (1) 
translated by the Runner directly, e.g. map to table API concepts or (2) 
run a user-defined function with an Environment. It could be feasible 
for Flink to choose the direct path, whereas Beam Runners would leverage 
the more generic approach using UDFs. Granted, compatibility across 
Flink and Beam would only work if both of the translation paths yielded 
the same semantics.



 If the DAG is generic enough to capture the additional information, we 
probably still need some standardization, so that all the different language 
APIs represent their expressions the same way


I wonder whether that's necessary as a first step. I think it would be 
fine for Flink to have its own way to represent API concepts in the Beam 
DAG which Beam Runners may not be able to understand. We could then 
successively add the capability for these transforms to run with Beam.



 Similarly, it makes sense to standardize the type system (and type inference) 
as far as built-in expressions and their interaction with UDFs are concerned. 
The Flink Table API and Blink teams found this to be essential for a consistent 
API behavior. This would not prevent all-UDF programs from still using purely 
binary/opaque types.


Beam has a set of standard coders which can be used across languages. We 
will have to expand those to play well with Flink's: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tableApi.html#data-types


I think we will need to exchange more ideas to work out a model that 
will work for both Flink and Beam. A regular meeting could be helpful.


Thanks,
Max

On 23.04.19 21:23, Stephan Ewen wrote:

Hi all!

Below are my notes on the discussion last week on how to collaborate 
between Beam and Flink.
The discussion was between Tyler, Kenn, Luke, Ahmed, Xiaowei, Shaoxuan, 
Jincheng, and me.


This represents my understanding of the discussion, please augment this 
where I missed something or where your conclusion was different.


Best,
Stephan

===

*Beams Python and Portability Framework*

   - Portability core to Beam
   - Language independent dataflow DAG that is defined via ProtoBuf
   - DAG can be generated from various languages (Java, Python, Go)
   - The DAG describes the pipelines and contains additional parameters 
to describe each operator, and contains artifacts that need to be 
deployed / executed as part of an operator execution.
   - Operators execute in language-specific containers, data is 
exchanged between the language-specific container and the runner 
container (JVM) via gRPC.


*Flink's desiderata for Python API*

   - Python API should mirror Java / Scala Table API
   - All relational expressions that correspond to built-in functions 
should be translated to corresponding expressions in the Table API. That 
way the planner generated Java code for the data types and built-in 
expressions, meaning no Python code is necessary during execution

   - UDFs should be supported and run similarly as in Beam's approach
   - Python programs should be similarly created and submitted/deployed 
as Java / Scala programs (CLI, web, containerized, etc.)


*Consensus to share inter-process communication code*

   - Crucial code for robust setup and high performance data exchange 
across processes
   - The code for the SDK harness, the artifact boostrapping, and the 
data exchange make sense to share.
   - Ongoing discussion whether this can be a dedicated module with slim 
dependencies in Beam


*Potential Long Term Perspective: Share language-independent DAG 
representation*


   - Beam's language independent DAG could become a standard 
representation used in both projects
   - Flink would need an way to receive that DAG, map it to the Table 
API, execute it from there
   - The DAG would need to have a standardized representation of 
functions and expressions that then get mapped to Table API expressions 
to let the planner optimize those and generate Java code for those
   - Similar as UDFs are supported in the Table API, there would be 
additional "external UDFs" that would go through the above mentioned 
inter-process communication layer


   - _Advantages:_
     => Flink and Beam could share more language bindings
     => Flink 

Re: Artifact staging in cross-language pipelines

2019-04-24 Thread Robert Bradshaw
On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels  wrote:
>
> Good idea to let the client expose an artifact staging service that the
> ExpansionService could use to stage artifacts. This solves two problems:
>
> (1) The Expansion Service not being able to access the Job Server
> artifact staging service
> (2) The client not having access to the dependencies returned by the
> Expansion Server
>
> The downside is that it adds an additional indirection. The alternative
> to let the client handle staging the artifacts returned by the Expansion
> Server is more transparent and easier to implement.

The other downside is that it may not always be possible for the
expansion service to connect to the artifact staging service (e.g.
when constructing a pipeline locally against a remote expansion
service).

> Ideally, the Expansion Service won't return any dependencies because the
> environment already contains the required dependencies. We could make it
> a requirement for the expansion to be performed inside an environment.
> Then we would already ensure during expansion time that the runtime
> dependencies are available.

Yes, it's cleanest if the expansion service provides an environment
without all the dependencies provided. Interesting idea to make this a
property of the expansion service itself.

> > In this case, the runner would (as
> > requested by its configuration) be free to merge environments it
> > deemed compatible, including swapping out beam-java-X for
> > beam-java-embedded if it considers itself compatible with the
> > dependency list.
>
> Could you explain how that would work in practice?

Say one has a pipeline with environments

A: beam-java-sdk-2.12-docker
B: beam-java-sdk-2.12-docker + dep1
C: beam-java-sdk-2.12-docker + dep2
D: beam-java-sdk-2.12-docker + dep3

A runner could (conceivably) be intelligent enough to know that dep1
and dep2 are indeed compatible, and run A, B, and C in a single
beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
corresponding fusion and lower overhead benefits). If a certain
pipeline option is set, it might further note that dep1 and dep2 are
compatible with its own workers, which are build against sdk-2.12, and
choose to run these in embedded + dep1 + dep2 environment.


JDK 11 compatibility testing

2019-04-24 Thread Michał Walenia
Hi all,

I’m currently working on enhancing a Beam test suite to check compatibility
with Java 11 UDFs. As JDK11 introduces several useful features, I wanted to
turn to the Devlist to gather your opinions on which features should be
included in the DoFn.

To give you an idea of how the test will be designed, I’m planning to
create a test pipeline with a custom DoFn which will use JDK11 specific
features. This test will be compiled with JDK11 and ran using a binary of
Beam built with JDK8 in order to simulate a situation in which the user
downloads the Beam from Maven repository and uses it in their project built
in Java 11.

The features I believe are worth checking are:


   -

   String manipulation methods:
   -

  .repeat
  -

  stripTrailing, stripLeading and strip()
  -

  isBlank
  -

  Lines
  -

   RegEx asMatchPredicate
   -

   Local parameter type inference in lambda expressions
   -

   Optional::isEmpty
   -

   Collection::toArray
   -

   Path API change - Path::of


I don’t think that checking other new features in Java 11 such as flight
recording or new HTTP client which we probably won’t use in Beam is
justified, but I’m open to suggestions and discussion.

Which of those new features should be included in the DoFn applied in the
test?

I will be grateful for any input.

Have a good day

Michal


-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

We create human & business stories through technology.
Check out our projects! 


Re: Artifact staging in cross-language pipelines

2019-04-24 Thread Maximilian Michels
Good idea to let the client expose an artifact staging service that the 
ExpansionService could use to stage artifacts. This solves two problems:


(1) The Expansion Service not being able to access the Job Server 
artifact staging service
(2) The client not having access to the dependencies returned by the 
Expansion Server


The downside is that it adds an additional indirection. The alternative 
to let the client handle staging the artifacts returned by the Expansion 
Server is more transparent and easier to implement.


Ideally, the Expansion Service won't return any dependencies because the 
environment already contains the required dependencies. We could make it 
a requirement for the expansion to be performed inside an environment. 
Then we would already ensure during expansion time that the runtime 
dependencies are available.



In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.


Could you explain how that would work in practice?

-Max

On 24.04.19 04:11, Heejong Lee wrote:



2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw >님이 작성:


I've been out, so coming a bit late to the discussion, but here's my
thoughts.

The expansion service absolutely needs to be able to provide the
dependencies for the transform(s) it expands. It seems the default,
foolproof way of doing this is via the environment, which can be a
docker image with all the required dependencies. More than this an
(arguably important, but possibly messy) optimization.

The standard way to provide artifacts outside of the environment is
via the artifact staging service. Of course, the expansion service may
not have access to the (final) artifact staging service (due to
permissions, locality, or it may not even be started up yet) but the
SDK invoking the expansion service could offer an artifact staging
environment for the SDK to publish artifacts to. However, there are
some difficulties here, in particular avoiding name collisions with
staged artifacts, assigning semantic meaning to the artifacts (e.g.
should jar files get automatically placed in the classpath, or Python
packages recognized and installed at startup). The alternative is
going with a (type, pointer) scheme for naming dependencies; if we go
this route I think we should consider migrating all artifact staging
to this style. I am concerned that the "file" version will be less
than useful for what will become the most convenient expansion
services (namely, hosted and docker image). I am still at a loss,
however, as to how to solve the diamond dependency problem among
dependencies--perhaps the information is there if one walks
maven/pypi/go modules/... but do we expect every runner to know about
every packaging platform? This also wouldn't solve the issue if fat
jars are used as dependencies. The only safe thing to do here is to
force distinct dependency sets to live in different environments,
which could be too conservative.

This all leads me to think that perhaps the environment itself should
be docker image (often one of "vanilla" beam-java-x.y ones) +
dependency list, rather than have the dependency/artifact list as some
kind of data off to the side. In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.


Like this idea to build multiple docker environments on top of a bare 
minimum SDK harness container and allow runners to pick a suitable one 
based on a dependency list.




I agree with Thomas that we'll want to make expansion services, and
the transforms they offer, more discoverable. The whole lifetime cycle
of expansion services is something that has yet to be fully fleshed
out, and may influence some of these decisions.

As for adding --jar_package to the Python SDK, this seems really
specific to calling java-from-python (would we have O(n^2) such
options?) as well as out-of-place for a Python user to specify. I
would really hope we can figure out a more generic solution. If we
need this option in the meantime, let's at least make it clear
(probably in the name) that it's temporary.


Good points. I second that we need a more generic solution than 
python-to-java specific option. I think instead of naming differently we 
can make --jar_package a secondary option under --experiment in the 
meantime. WDYT?



On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise mailto:t...@apache.org>> wrote:
 >
 > One more suggestion:
 >
 > It would be nice to be able to select the environment for the
  

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-24 Thread jincheng sun
Hi Kenn, I think you are right, the Python SDK harness can be shared to
Flink, and also need to add some new primitive operations. Regarding
runner-side, I think most of the code which in runners/java-fun- Execution
is can be shared(but need some improvement, such as FnDataService), some of
them cannot be shared, such as job submission code. So, we may need to set
up a document to clearly analyze which ones can be shared, which ones can
be shared but need to do some changes, and which ones are definitely cannot
be shared.

Hi Max, Thanks for sharing your opinion, I also prefer to using beam Fn
service as a library, also willing to do more efforts for this.
>From the view of the current code, abstracting Fn Service into a class
library that other projects can rely on requires a lot of effort from the
Beam community. Turn `WindowedValue` into `T` is just the beginning of
this effort. If the Beam community is willing on abstracting Fn Service
into a class library that can be relied upon by other projects, I can try
to draft a document, of course during this period may need a lot of help
from you, Kenn, Lukasz, and the Beam community. (I am a recruit in the Beam
community :-))

What do you think?

Regards,
Jincheng

Kenneth Knowles  于2019年4月24日周三 上午3:32写道:

> It seems to me that the most valuable code to share and keep up with is
> the Python/Go/etc SDK harness; they would need to be enhanced with new
> primitive operations. So you would want to depend directly and share the
> original proto-generated classes too, which Beam publishes as separate
> artifacts for Java. Is the runner-side support code that valuable for
> direct integration into Flink? I would expect once you get past trivial
> wrappers (that you can copy/paste with no loss) you would hit differences
> in architecture so you would diverge anyhow.
>
> Kenn
>
> On Tue, Apr 23, 2019 at 5:32 AM Maximilian Michels  wrote:
>
>> Hi Jincheng,
>>
>> Copying code is a solution for the short term. In the long run I'd like
>> the Fn services to be a library not only for the Beam portability layer
>> but also for other projects which want to leverage it. We should thus
>> make an effort to make it more generic/extensible where necessary and
>> feasible.
>>
>> Since you are investigating reuse of Beam portability in the context of
>> Flink, do you think it would make sense to setup a document where we
>> collect ideas and challenges?
>>
>> Thanks,
>> Max
>>
>> On 23.04.19 13:00, jincheng sun wrote:
>> > Hi Reuven,
>> >
>> > I think you have provided an optional solution for other community
>> which
>> > wants to take advantage of Beam's existing achievements. Thank you very
>> > much!
>> >
>> > I think the Flink community can choose to copy from Beam's code or
>> > choose to rely directly on the beam's class library. The Flink
>> community
>> > also initiated a discussion, more info can be found here
>> > <
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html#a28096
>> >
>> >
>> > The purpose of Turns `WindowedValue` into `T` is to promote the
>> > interface design of Beam more versatile, so that other open source
>> > projects have the opportunity to take advantage of Beam's existing
>> > achievements. Of course, just changing the `WindowedValue` into `T`
>> > is not enough to be shared by other projects in the form of a class
>> > library, we need to do more efforts. If Beam can provide a class
>> library
>> > in the future, other community contributors will also have the
>> > willingness to contribute to the beam community. This will benefit both
>> > the community that wants to take advantage of Beam's existing
>> > achievements and the Beam community itself. And thanks to Thomas for
>> > that he has also made a lot of efforts in this regard.
>> >
>> > Thanks again for your valuable suggestion, and welcome any feedback!
>> >
>> > Best,
>> > Jincheng
>> >
>> > Reuven Lax mailto:re...@google.com>> 于2019年4月23日
>> > 周二 上午1:00写道:
>> >
>> > One concern here: these interfaces are intended for use within the
>> > Beam project. Beam may decide to make specific changes to them to
>> > support needed functionality in Beam. If they are being reused by
>> > other projects, then those changes risk breaking those other
>> > projects in unexpected ways. I don't think we can guarantee that we
>> > don't do that. If this is useful in Flink, it would be safer to copy
>> > the code IMO rather than to directly depend on it.
>> >
>> > On Mon, Apr 22, 2019 at 12:08 AM jincheng sun
>> > mailto:sunjincheng...@gmail.com>> wrote:
>> >
>> > Hi Kenn,
>> >
>> > Thanks for your reply, and explained the design of WindowValue
>> > clearly!
>> >
>> > At present, the definitions of `FnDataService` and
>> > `BeamFnDataClient` in Data Plane are very clear and universal,
>> > such as: send(...)/receive(...). If it is only 

Re: Integration of python/portable runner tests for Samza runner

2019-04-24 Thread Maximilian Michels
If you are interested in portable python pipeline validation, I think 
fn_api_runner_test would also help.


Just to note, Ankur mentioned flinkCompatibilityMatrix, that one uses 
fn_api_runner_test with some tooling on top to bring up the test cluster.


On 23.04.19 19:23, Boyuan Zhang wrote:

Hi Daniel,
If you are interested in portable python pipeline validation, I think 
fn_api_runner_test 
 
would also help.


On Tue, Apr 23, 2019 at 10:19 AM Pablo Estrada > wrote:


This is cool, Daniel : ) Glad to see the Samza runner moving forward.
Best
-P.

On Tue, Apr 23, 2019 at 2:52 AM Maximilian Michels mailto:m...@apache.org>> wrote:

Hi Daniel,

Note that there is also Portable Validates Runner which runs Java
portability tests. I don't know if you have integrated with that
one
already.

Thanks,
Max

On 23.04.19 02:28, Ankur Goenka wrote:
 > Hi Daniel,
 >
 > We use flinkCompatibilityMatrix [1] to check the Flink
compatibility
 > with python. This is python equivalent to validatesRunner
tests in java
 > for portable runners.
 > I think we can reuse it for Samza Portable runner with minor
refactoring.
 >
 > [1]
 >

https://github.com/apache/beam/blob/bdb1a713a120a887e71e85c77879dc4446a58541/sdks/python/build.gradle#L305
 >
 > On Mon, Apr 22, 2019 at 3:21 PM Daniel Chen
mailto:danx...@gmail.com>
 > >> wrote:
 >
 >     Hi everyone,
 >
 >     I'm working on improving the validation of the Python
portable Samza
 >     runner. For java, we have the gradle task (
:validatesRunner) that
 >     runs the runner validation tests.
 >     I am looking for pointers on how to similarly
integrate/enable the
 >     portability and Python tests for the Samza runner.
 >
 >     Any help will be greatly appreciated.
 >
 >     Thanks,
 >     Daniel
 >



Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-24 Thread Etienne Chauchot
Reuven, 
Nexmark tests are indeed run as PostCommits (each commit on master). I guess we 
have been flooded with jenkins
notification emails.
Etienne
Le mardi 23 avril 2019 à 15:24 -0700, Reuven Lax a écrit :
> I mistakenly though that Java PostCommit would run these tests, and I merged 
> based on PostCommit passing. That's how
> the bug got into master.
> On Tue, Apr 23, 2019 at 3:21 PM Kenneth Knowles  wrote:
> > What can we do to make this part of day-to-day workflow instead of finding 
> > out during release validation? Was this
> > just a failing test that was missed?
> > Kenn
> > On Tue, Apr 23, 2019 at 3:02 PM Andrew Pilloud  wrote:
> > > It looks like Java Nexmark tests are on the validation sheet but we've 
> > > missed it the last few releases. Thanks for
> > > checking it Etienne! Does the current release process require everything 
> > > to be tested before making the release
> > > final?
> > > I fully agree with you on point 2. All of these issues were in RC1 and 
> > > could have been fixed for RC2.
> > > 
> > > Andrew
> > > On Tue, Apr 23, 2019 at 2:58 PM Ahmet Altay  wrote:
> > > > Thank you Andrew. I will suggest two improvements to the release 
> > > > process:1. We can include benchmarks in the
> > > > validation sheet ("Apache Beam Release Acceptance Criteria"). They are 
> > > > used part of the validation process and
> > > > we can ensure that we check those for each release.
> > > > 2. For RC validation, we can continue to exhaustively validate each RC 
> > > > even after the first -1 vote. Otherwise
> > > > we end up with not discovering all issues in a given RC and find them a 
> > > >  successive RC, increasing the number of
> > > > iterations required.
> > > > 
> > > > 
> > > > On Tue, Apr 23, 2019 at 2:11 PM Andrew Pilloud  
> > > > wrote:
> > > > > Please consider the vote for RC4 canceled. I'll quickly follow up 
> > > > > with a new RC.
> > > > > Thanks for the complete testing everyone!
> > > > > Andrew
> > > > > On Tue, Apr 23, 2019 at 2:06 PM Reuven Lax  wrote:
> > > > > > -1 
> > > > > > we need to cherry pick pr/8325 and pr/8385 to fix the above issue
> > > > > > On Tue, Apr 23, 2019 at 1:48 PM Andrew Pilloud 
> > > > > >  wrote:
> > > > > > > I believe the breakage of Nexmark on Dataflow is 
> > > > > > > https://issues.apache.org/jira/browse/BEAM-7002, which
> > > > > > > went in before the release was cut. It looks like this might be a 
> > > > > > > release blocker based on the fix: 
> > > > > > > https://github.com/apache/beam/pull/8325.
> > > > > > > 
> > > > > > > The degraded performance is after the release is cut, so we 
> > > > > > > should be good there.
> > > > > > > 
> > > > > > > Andrew
> > > > > > > On Tue, Apr 23, 2019 at 8:44 AM Ismaël Mejía  
> > > > > > > wrote:
> > > > > > > > Etienne RC1 vote happened in 04/03 and there have not been any 
> > > > > > > > cherry
> > > > > > > > 
> > > > > > > > picks on the spark runner afterwards so if there is a commit 
> > > > > > > > that
> > > > > > > > 
> > > > > > > > degraded performance around 04/10 it is not part of the release 
> > > > > > > > we are
> > > > > > > > 
> > > > > > > > voting, so please consider reverting your -1.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > However the issue you are reporting looks important, from a 
> > > > > > > > quick look
> > > > > > > > 
> > > > > > > > I am guessing it could be related to BEAM-5775 that was merged 
> > > > > > > > on
> > > > > > > > 
> > > > > > > > 12/04 however the performance regressions started happening 
> > > > > > > > since
> > > > > > > > 
> > > > > > > > 09/04 so it could be unrelated. Maybe it could be due to 
> > > > > > > > changes in
> > > > > > > > 
> > > > > > > > our infrastructure. Maybe the change in the workers to be 
> > > > > > > > tracked, but
> > > > > > > > 
> > > > > > > > definitely not a release blocker at least for the Spark runner.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Tue, Apr 23, 2019 at 5:12 PM Etienne Chauchot 
> > > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > >
> > > > > > > > 
> > > > > > > > > Hi guys ,
> > > > > > > > 
> > > > > > > > > I will vote -1 (binding) on this RC (although degradation is 
> > > > > > > > > before RC4 cut date). I took a look at
> > > > > > > > Nexmark graphs for the 3 major runners :
> > > > > > > > 
> > > > > > > > > - there seem to have functional regressions on Dataflow: 
> > > > > > > > https://apache-beam-testing.appspot.com/explore?dashboard=5647201107705856
> > > > > > > >  . 13 queries fail in batch
> > > > > > > > mode starting on 04/17
> > > > > > > > 
> > > > > > > > > - there is a perf degradation (+200%) in spark runner 
> > > > > > > > > starting on 04/10 for all the queries: 
> > > > > > > > https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> > > > > > > 

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-24 Thread Etienne Chauchot
Hi, I agree that checking Nexmark  should be a mandatory task of the release 
process, I think it is already mentioned in
the spreadsheet. Indeed it detects both functional and performances regressions 
and on all the beam model scope.The only
lacking things in Nexmark are with 2 runners:- gearpump is not integrated 
because streaming pipelines (there is no batch
mode) never end (timeout in awaitTermination is not respected)- spark runner in 
streaming mode is not integrated for the
same reason

Etienne

Le mardi 23 avril 2019 à 15:02 -0700, Andrew Pilloud a écrit :
> It looks like Java Nexmark tests are on the validation sheet but we've missed 
> it the last few releases. Thanks for
> checking it Etienne! Does the current release process require everything to 
> be tested before making the release final?
> I fully agree with you on point 2. All of these issues were in RC1 and could 
> have been fixed for RC2.
> 
> Andrew
> On Tue, Apr 23, 2019 at 2:58 PM Ahmet Altay  wrote:
> > Thank you Andrew. I will suggest two improvements to the release process:1. 
> > We can include benchmarks in the
> > validation sheet ("Apache Beam Release Acceptance Criteria"). They are used 
> > part of the validation process and we
> > can ensure that we check those for each release.
> > 2. For RC validation, we can continue to exhaustively validate each RC even 
> > after the first -1 vote. Otherwise we
> > end up with not discovering all issues in a given RC and find them a  
> > successive RC, increasing the number of
> > iterations required.
> > 
> > 
> > On Tue, Apr 23, 2019 at 2:11 PM Andrew Pilloud  wrote:
> > > Please consider the vote for RC4 canceled. I'll quickly follow up with a 
> > > new RC.
> > > Thanks for the complete testing everyone!
> > > Andrew
> > > On Tue, Apr 23, 2019 at 2:06 PM Reuven Lax  wrote:
> > > > -1 
> > > > we need to cherry pick pr/8325 and pr/8385 to fix the above issue
> > > > On Tue, Apr 23, 2019 at 1:48 PM Andrew Pilloud  
> > > > wrote:
> > > > > I believe the breakage of Nexmark on Dataflow is 
> > > > > https://issues.apache.org/jira/browse/BEAM-7002, which went
> > > > > in before the release was cut. It looks like this might be a release 
> > > > > blocker based on the fix: 
> > > > > https://github.com/apache/beam/pull/8325.
> > > > > 
> > > > > The degraded performance is after the release is cut, so we should be 
> > > > > good there.
> > > > > 
> > > > > Andrew
> > > > > On Tue, Apr 23, 2019 at 8:44 AM Ismaël Mejía  
> > > > > wrote:
> > > > > > Etienne RC1 vote happened in 04/03 and there have not been any 
> > > > > > cherry
> > > > > > 
> > > > > > picks on the spark runner afterwards so if there is a commit that
> > > > > > 
> > > > > > degraded performance around 04/10 it is not part of the release we 
> > > > > > are
> > > > > > 
> > > > > > voting, so please consider reverting your -1.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > However the issue you are reporting looks important, from a quick 
> > > > > > look
> > > > > > 
> > > > > > I am guessing it could be related to BEAM-5775 that was merged on
> > > > > > 
> > > > > > 12/04 however the performance regressions started happening since
> > > > > > 
> > > > > > 09/04 so it could be unrelated. Maybe it could be due to changes in
> > > > > > 
> > > > > > our infrastructure. Maybe the change in the workers to be tracked, 
> > > > > > but
> > > > > > 
> > > > > > definitely not a release blocker at least for the Spark runner.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Tue, Apr 23, 2019 at 5:12 PM Etienne Chauchot 
> > > > > >  wrote:
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Hi guys ,
> > > > > > 
> > > > > > > I will vote -1 (binding) on this RC (although degradation is 
> > > > > > > before RC4 cut date). I took a look at
> > > > > > Nexmark graphs for the 3 major runners :
> > > > > > 
> > > > > > > - there seem to have functional regressions on Dataflow: 
> > > > > > https://apache-beam-testing.appspot.com/explore?dashboard=5647201107705856
> > > > > >  . 13 queries fail in batch mode
> > > > > > starting on 04/17
> > > > > > 
> > > > > > > - there is a perf degradation (+200%) in spark runner starting on 
> > > > > > > 04/10 for all the queries: 
> > > > > > https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Sorry Andrew for the added work
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Etienne
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Le lundi 22 avril 2019 à 12:21 -0700, Andrew Pilloud a écrit :
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > I signed the wheels files and updated the build process to not 
> > > > > > > require giving travis apache credentials.
> > > > > > (You should probably change your password if you haven't already.)
> > > > > > 
> > > > > > >
> > > > > > 
> > > > 

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread jincheng sun
Hi Stephan,

Thanks for your summary, from the points of my view, we are on the same
page about the conclusion of the discussion!

I completely agree that we can divide the support of the Python Table API
into short-term and long-term goals, and the design of short-term goals
should be smoothly upgraded to long-term goals.
And we will also continue to communicate with the Beam community to achieve
the long-term goals.

We hope is that Flink 1.9 can support the Python Table API, so I am
preparing to create FLIP-38 in Flink Confluence and preparing to open the
first PR of Python Table API. Of course, we can continue the discussion in
the google doc and mail thread for the design that does not reach
consensus. Is that makes sense to you?

Regards,
Jincheng

Stephan Ewen  于2019年4月24日周三 上午3:24写道:

> Hi all!
>
> Below are my notes on the discussion last week on how to collaborate
> between Beam and Flink.
> The discussion was between Tyler, Kenn, Luke, Ahmed, Xiaowei, Shaoxuan,
> Jincheng, and me.
>
> This represents my understanding of the discussion, please augment this
> where I missed something or where your conclusion was different.
>
> Best,
> Stephan
>
> ===
>
> *Beams Python and Portability Framework*
>
>   - Portability core to Beam
>   - Language independent dataflow DAG that is defined via ProtoBuf
>   - DAG can be generated from various languages (Java, Python, Go)
>   - The DAG describes the pipelines and contains additional parameters to
> describe each operator, and contains artifacts that need to be deployed /
> executed as part of an operator execution.
>   - Operators execute in language-specific containers, data is exchanged
> between the language-specific container and the runner container (JVM) via
> gRPC.
>
> *Flink's desiderata for Python API*
>
>   - Python API should mirror Java / Scala Table API
>   - All relational expressions that correspond to built-in functions
> should be translated to corresponding expressions in the Table API. That
> way the planner generated Java code for the data types and built-in
> expressions, meaning no Python code is necessary during execution
>   - UDFs should be supported and run similarly as in Beam's approach
>   - Python programs should be similarly created and submitted/deployed as
> Java / Scala programs (CLI, web, containerized, etc.)
>
> *Consensus to share inter-process communication code*
>
>   - Crucial code for robust setup and high performance data exchange
> across processes
>   - The code for the SDK harness, the artifact boostrapping, and the data
> exchange make sense to share.
>   - Ongoing discussion whether this can be a dedicated module with slim
> dependencies in Beam
>
> *Potential Long Term Perspective: Share language-independent DAG
> representation*
>
>   - Beam's language independent DAG could become a standard representation
> used in both projects
>   - Flink would need an way to receive that DAG, map it to the Table API,
> execute it from there
>   - The DAG would need to have a standardized representation of functions
> and expressions that then get mapped to Table API expressions to let the
> planner optimize those and generate Java code for those
>   - Similar as UDFs are supported in the Table API, there would be
> additional "external UDFs" that would go through the above mentioned
> inter-process communication layer
>
>   - *Advantages:*
> => Flink and Beam could share more language bindings
> => Flink would execute Beam portability programs fast, without
> intermediate abstraction and directly in the JVM for many operators.
>  Abstraction is necessary around UDFs and to bridge between
> serializers / coders, etc.
>
>   - *Open issues:*
> => Biggest question is whether the language-independent DAG is
> expressive enough to capture all the expressions that we want to map
> directly to Table API expressions. Currently much is hidden in opaque UDFs.
> Kenn mentioned the structure should be flexible enough to capture more
> expressions transparently.
>
> => If the DAG is generic enough to capture the additional information,
> we probably still need some standardization, so that all the different
> language APIs represent their expressions the same way
> => Similarly, it makes sense to standardize the type system (and type
> inference) as far as built-in expressions and their interaction with UDFs
> are concerned. The Flink Table API and Blink teams found this to be
> essential for a consistent API behavior. This would not prevent all-UDF
> programs from still using purely binary/opaque types.
>
>  =>  We need to create a Python API that follows the same structure as
> Flink's Table API that produces the language-independent DAG
>
> *Short-term approach in Flink*
>
>   - Goal is to not block Flink's Python effort on the long term approach
> and the necessary design and evolution of the language-independent DAG.
>   - Depending on what the