Re: big data blog

2020-02-07 Thread Kenneth Knowles
Nice! Yes, I think we should promote Beam articles that are insightful from
a longtime contributor.

Etienne - can you add twitter announcements/retweets to the social media
spreadsheet when you write new articles?

Kenn

On Fri, Feb 7, 2020 at 5:44 PM Ahmet Altay  wrote:

> Cool, thank you. Would it make sense to promote Beam related posts on our
> twitter channel?
>
> On Fri, Feb 7, 2020 at 2:47 PM Pablo Estrada  wrote:
>
>> Very nice. Thanks for sharing Etienne!
>>
>> On Fri, Feb 7, 2020 at 2:19 PM Reuven Lax  wrote:
>>
>>> Cool!
>>>
>>> On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot 
>>> wrote:
>>>
 Hi all,

 FYI, I just started a blog around big data technologies and for now it
 is focused on Beam.

 https://echauchot.blogspot.com/

 Feel free to comment, suggest or anything.

 Etienne




Re: Compile error on Java 11 when running :examples:java:test

2020-02-07 Thread Jean-Baptiste Onofre
Hi,

AFAIR I had the same issue on my Linux.

Let me do a new run.

Regards
JB

> Le 7 févr. 2020 à 21:35, Kenneth Knowles  a écrit :
> 
> The expected class file version 53 is for Java 9, I believe. So is the right 
> javac being invoked?
> 
> I hit some issues like this on mac a while back, unrelated to Java 11. 
> Suspected something wonky in Mac's Java setup not working well with the 
> Gradle wrapper. Never resolved them actually. Have been working on linux 
> lately.
> 
> Kenn
> 
> On Fri, Feb 7, 2020 at 11:32 AM Jean-Baptiste Onofré  > wrote:
> Hi
> 
> No jdk 11 is not yet fully supported.
> 
> I?ve started to work on it but it?s not yet ready.
> 
> Regards
> JB
> 
> Le ven. 7 f?vr. 2020 ? 20:20, David Cavazos  > a ?crit :
> Hi Beamers,
> 
> I'm trying to run the tests for the Java examples using Java 11 and there is 
> a compilation error due to an incompatible version.
> 
> I'm using the latest version of master.
> 
> 
> 
> If I downgrade to Java 8, it works. But isn't Java 11 supported?
> 
> Thanks!



Re: Updating releases on Github release page.

2020-02-07 Thread Kenneth Knowles
Previously, GitHub treated every tag a release (why? I don't know). I think
you can remove/edit them now. In addition to adding new ones, let's remove
the ones that are not actually voted on releases.

Kenn

On Fri, Feb 7, 2020 at 5:42 PM Ahmet Altay  wrote:

> I do not believe this is intentional. This step might be missing from the
> release guide.
>
> On Fri, Feb 7, 2020 at 5:07 PM Daniel Oliveira 
> wrote:
>
>> Hey beam devs,
>>
>> I saw a comment on SO that our releases on github (
>> https://github.com/apache/beam/releases) are stuck at 2.16.0. It looks
>> like that's still tagged as the "Latest Release", but the newer releases
>> are actually present in tiny words above it: "... Show 7 newer tags".
>>
>> I wanted to fix this, but I'm not sure if it's intentional, and I have no
>> clue how to do so and am worried about messing something up. Anyone know
>> how to fix it? And do we need to add that step to release instructions for
>> the future?
>>
>
> Thank you. You can use the "Draft a new release" button to add/edit
> releases. We can also delete/edit in the future if needed.
>
>
>>
>> Thanks,
>> Daniel Oliveira
>>
>


Re: [PROPOSAL] Beam Schema Options

2020-02-07 Thread Kenneth Knowles
All fair points. I think it is a good proposal. We already know of existing
and future uses for it.

I don't think my concerns are actually answered by this discussion. Does
this allow/encourage creation of a PCollection that you can't make sense of
(or can't make *good* sense of) without understanding the options? We don't
have to answer that now and maybe it is unanswerable. If we look at proto
options as an example it seems to be mostly OK.

I think the risk is inseparable from how powerful it could be. So it is
worth accepting. If my fears come to pass, it will still mean that Beam is
being useful in new and unexpected ways, so that's not so bad :-)

Would it make sense to write this up as a BIP, to help bootstrap the wiki
page for them?

Kenn

On Fri, Feb 7, 2020 at 2:05 PM Reuven Lax  wrote:

> True - however at some level that's up to the user. We should be diligent
> that we don't implement core functionality this way (so far schema metadata
> has only been used for the fidelity use case above). However if some users
> wants to use it more extensively in their pipeline, that's up to them.
>
> Reuven
>
> On Fri, Feb 7, 2020 at 2:02 PM Kenneth Knowles  wrote:
>
>> It is a good point that it applies to configuring sources and sinks
>> mostly, or external data more generally.
>>
>> What I worry about is that metadata channels like this tend to take over
>> everything, and do it worse than more structured approach.
>>
>> As an exaggerated example which is not actually that far-fetched, schemas
>> could have a single type: bytes. And then the options on the fields would
>> say how to encode/decode the bytes. Now you've created a system where the
>> options are the main event and are actually core to the system. My
>> expectation is that this will almost certainly happen unless we are
>> vigilant about preventing core features from landing as options.
>>
>> Kenn
>>
>> On Fri, Feb 7, 2020 at 1:56 PM Reuven Lax  wrote:
>>
>>> I disagree - I've had several cases where user options on fields are
>>> very useful internally.
>>>
>>> A common rationale is to preserve fidelity. For instance, reading
>>> protos, projecting out a few fields, writing protos back out. You want to
>>> be able to nicely map protos to Beam schemas, but also preserve all the
>>> extra metadata on proto fields. This metadata has to carry through on all
>>> intermediate PCollections and schemas, so it makes sense to put it on the
>>> field.
>>>
>>> Another example: it provides a nice extension point for annotation
>>> extensions to a field. These are often things like Optional or Taint
>>> markings that don't change the semantic interpretation of the type (so
>>> shouldn't be a logical type), but do provide extra information about the
>>> field.
>>>
>>> Reuven
>>>
>>> On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette 
>>> wrote:
>>>
 I'm not sure this belongs directly on schemas. I've had trouble
 reconciling that opinion, since the idea does seem very useful, and in fact
 I'm interested in using it myself. I think I've figured out my concern -
 what I really want is options for a (maybe portable) Table.

 As I indicated in a comment in the doc [1] I still think all of the
 examples you've described only apply to IOs. To be clear, what I mean is
 that all of the examples either
 1) modify the behavior of the external system the IO is interfacing
 with (specify partitioning, indexing, etc..), or
 2) define some transformation that should be done to the data adjacent
 to the IO (after an Input or before an Output) in Beam

 (1) Is the sort of thing you described in the IO section [2] (aside
 from the PubSub example I added, since that's describing a transformation
 to do in Beam)
 I would argue that all of the other examples fall under (2) - data
 validation, adding computed columns, encryption, etc... are things that can
 be done in a transform

 I think we can make an analogy to a traditional database here:
 schema-aware Beam IOs are like Tables in a database, other PCollections are
 like intermediate results in a query. In a database, Tables can be defined
 with some DDL and have schema-level or column-level options that change
 system behavior, but intermediate results have no such capability.


 Another point I think is worth discussing: is there value in making
 these options portable?
 As it's currently defined I'm not sure there is - everything could be
 done within a single SDK. However, portable options on a portable table
 could be very powerful, since it could be used to configure cross-language
 IOs, perhaps with something like
 https://s.apache.org/xlang-table-provider/

 [1]
 https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=I54si4k
 [2]
 

Re: big data blog

2020-02-07 Thread Ahmet Altay
Cool, thank you. Would it make sense to promote Beam related posts on our
twitter channel?

On Fri, Feb 7, 2020 at 2:47 PM Pablo Estrada  wrote:

> Very nice. Thanks for sharing Etienne!
>
> On Fri, Feb 7, 2020 at 2:19 PM Reuven Lax  wrote:
>
>> Cool!
>>
>> On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi all,
>>>
>>> FYI, I just started a blog around big data technologies and for now it
>>> is focused on Beam.
>>>
>>> https://echauchot.blogspot.com/
>>>
>>> Feel free to comment, suggest or anything.
>>>
>>> Etienne
>>>
>>>


Re: Updating releases on Github release page.

2020-02-07 Thread Ahmet Altay
I do not believe this is intentional. This step might be missing from the
release guide.

On Fri, Feb 7, 2020 at 5:07 PM Daniel Oliveira 
wrote:

> Hey beam devs,
>
> I saw a comment on SO that our releases on github (
> https://github.com/apache/beam/releases) are stuck at 2.16.0. It looks
> like that's still tagged as the "Latest Release", but the newer releases
> are actually present in tiny words above it: "... Show 7 newer tags".
>
> I wanted to fix this, but I'm not sure if it's intentional, and I have no
> clue how to do so and am worried about messing something up. Anyone know
> how to fix it? And do we need to add that step to release instructions for
> the future?
>

Thank you. You can use the "Draft a new release" button to add/edit
releases. We can also delete/edit in the future if needed.


>
> Thanks,
> Daniel Oliveira
>


Updating releases on Github release page.

2020-02-07 Thread Daniel Oliveira
Hey beam devs,

I saw a comment on SO that our releases on github (
https://github.com/apache/beam/releases) are stuck at 2.16.0. It looks like
that's still tagged as the "Latest Release", but the newer releases are
actually present in tiny words above it: "... Show 7 newer tags".

I wanted to fix this, but I'm not sure if it's intentional, and I have no
clue how to do so and am worried about messing something up. Anyone know
how to fix it? And do we need to add that step to release instructions for
the future?

Thanks,
Daniel Oliveira


Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Kenneth Knowles
Regarding StatefulDoFnRunner: this fails during pipeline execution, too
late, and as you noted is just a utility that a runner may optionally use.
The change needs to be in the runner's run() method prior to execution
starting. Here is a specific PR that demonstrates the technique:
https://github.com/apache/beam/pull/3420. You could make some generalized
shared code that runner's could share I suppose.

Regarding having a pipeline-level summary of required features: this
introduces an opportunity for inconsistency where the pipeline misses some
needed features. Jan added the needed info the proto so it can be scraped
out easily, though it is still easy for a runner to just not be updated. So
that's a proto design issue. With PTransform URNs if there is a leaf
PTransform with an unknown URN the runner necessarily fails. That is a
better model.

Regarding consensus: it is true that Jan did reach out repeatedly and it
was the community that didn't engage. It is reasonable to move to
implementation eventually. Yet we still need to avoid the code-level issues
so need review. On this PR, to name names, I would consider Reuven or Luke
to be valuable reviewers. Also the email thread suffers from a tragedy of
the commons. You did directly ask for review. But mentions and using
GitHub's "review request" are probably a good way to get the PR actually
onto someone's dashboard.

Kenn

On Fri, Feb 7, 2020 at 1:52 PM Jan Lukavský  wrote:

> I reviewed closely the runners ad it seems to me that:
>
>  - all batch runners that would fail to support the annotation will fail
> already (spark structured streaming, apex) due to missing support for state
> or timers
>
>  - streaming runners must explicitly enable this, _as long as they use
> StatefulDoFnRunner_, which is the case for apex, flink and samza
>
> I will explicitly disable any pipeline with this annotation for:
>
>  - dataflow, jet and gearpump (because I don't see usage of
> StatefulDoFnRunner, although I though there was one, that's my mistake)
>
>  - all batch runners should either support the annotation or fail already
> (due to missing support for state or timers)
>
> Does this proposal solve the issues you see?
>
> Regarding the process of introducing this annotation I tried really hard
> to get to the best consensus I could. The same holds true for getting core
> people involved in the review process (explicitly mentioned in the PR,
> multiple mailing list threads). The PR was opened for discussion for more
> than half a year. But because I agree with you, I proposed the BIP, so that
> we can have a more explicit process for arriving at a consensus for
> features like this. I'd be happy though, if we can get to consensus about
> what to do now (if the steps I wrote above will solve every doubts) and
> have a deeper process for similar features for future cases. As I mentioned
> this feature is already implemented and having open PR into core for nearly
> a year is expensive to keep it in sync with master.
> On 2/7/20 9:31 PM, Kenneth Knowles wrote:
>
> TL;DR I am not suggesting that you must implement this for any runner. I'm
> afraid I do have to propose this change be rolled back before release
> 2.21.0 unless we fix this. I think the fix is easily achieved.
>
> Clarifications inline.
>
> On Fri, Feb 7, 2020 at 11:20 AM Jan Lukavský  wrote:
>
>> Hi Kenn,
>>
>> I think that this approach is not well maintainable and doesn't scale.
>> Main reasons:
>>
>>  a) modifying core has by definition some impact on runners, so modifying
>> core would imply necessity to modify all runners
>>
> My concern is not about all changes to "core" but only changes to the
> model, which should be extraordinarily rare. They must receive extreme
> scrutiny and require a very high level of consensus. It is true that every
> runner needs to either correctly execute or refuse to execute every
> pipeline, to the extent possible. For the case we are talking about it is
> very easy to meet this requirement.
>
>  b) having to implement core feature for all existing runners will make
>> any modification to core prohibitively expensive
>>
> No one is suggesting this. I am saying that you need to write the 1 line
> that I linked to "if (usesRequiresTimeSortedInput) then reject pipeline" so
> the runner fails before it begins processing data, potentially consuming
> non-replayable messages.
>
>
>>  c) even if we accept this, there can be runners that are outside of beam
>> repo (or even closed source!)
>>
> Indeed. And those runners need time to adapt to the new proto fields. I
> did not mention it this time, because the proto is not considered stable.
> But very soon it will be. At that point additions like this will have to be
> fully specified and added to the proto long before they are enabled for
> use. That way all runners can adjust. The proper order is (1) add model
> feature (2) make runners reject it, unsupported (3) add functionality to
> SDK (4) add to some runners and enable.
>
>

Re: big data blog

2020-02-07 Thread Pablo Estrada
Very nice. Thanks for sharing Etienne!

On Fri, Feb 7, 2020 at 2:19 PM Reuven Lax  wrote:

> Cool!
>
> On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot 
> wrote:
>
>> Hi all,
>>
>> FYI, I just started a blog around big data technologies and for now it
>> is focused on Beam.
>>
>> https://echauchot.blogspot.com/
>>
>> Feel free to comment, suggest or anything.
>>
>> Etienne
>>
>>


Re: Dynamic timers now supported!

2020-02-07 Thread Reuven Lax
Thanks for finding this. Hopefully the bug is easy .to fix. The tests
indeed never ran on any runner except for the DirectRunner, which is
something I should've noticed in the code review.

Reuven

On Mon, Feb 3, 2020 at 12:50 AM Ismaël Mejía  wrote:

> I had a discussion with Rehman last week and we discovered that the
> TimersMap
> related tests were not running for all runners because they were not
> tagged as
> part of the ValidatesRunner category. I opened a PR [1] to enable this, so
> please someone help me with the review/merge.
>
> I took a look just for curiosity and discovered that they are only passing
> for
> Direct runner and for the classic Flink runner in batch mode. They are not
> passing for Dataflow [2][3] and for the Portable Flink runner, so probably
> worth
> to reopen the issue to investigate/fix.
>
> [1] https://github.com/apache/beam/pull/10747
> [2]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_PR/210/
> [3]
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_PortabilityApi_Dataflow_PR/76/
>
>
> On Sat, Jan 25, 2020 at 1:26 AM Reuven Lax  wrote:
>
>> Yes. For now we exclude the flink runner, but fixing this should be
>> fairly trivial.
>>
>> On Fri, Jan 24, 2020 at 3:35 PM Maximilian Michels 
>> wrote:
>>
>>> The Flink Runner was allowing to set a timer multiple times before we
>>> made it comply with the Beam semantics of overwriting past invocations.
>>> I wouldn't be surprised if the Spark Runner never addressed this. Flink
>>> and Spark itself allow for a timer to be set to multiple times. In order
>>> to fix this for Beam, the Flink Runner has to maintain a checkpointed
>>> map which sits outside of its builtin TimerService.
>>>
>>> As far as I can see, multiple timer families are currently not supported
>>> in the Flink Runner due to the map not taking the family name into
>>> account. This can be easily fixed though.
>>>
>>> -Max
>>>
>>> On 24.01.20 21:31, Reuven Lax wrote:
>>> > The new timer family is in the portability protos. I think
>>> TimerReceiver
>>> > needs to be updated to set it though (I think a 1-line change).
>>> >
>>> > The TimerInternals class that runners implement today already handles
>>> > dynamic timers, so most of the work was in the Beam SDK  to provide an
>>> > API that allows users to access this feature.
>>> >
>>> > The main work needed in the runner was to take in account the timer
>>> > family. Beam semantics say that if a timer is set twice with the same
>>> > id, then the second timer overwrites the first.  Several runners
>>> > therefore had maps from timer id -> timer. However since the
>>> > timer family scopes the timers, we now allow two timers with the same
>>> id
>>> > as long as the timer families are different. Runners had to be updated
>>> > to include the timer family id in the map keys.
>>> >
>>> > Surprisingly, the new TimerMap tests seem to pass on Spark
>>> > ValidatesRunner, even though the Spark runner wasn't updated! I wonder
>>> > if this means that the Spark runner was incorrectly implementing the
>>> > Beam semantics before, and setTimer was not overwriting timers with
>>> the
>>> > same id?
>>> >
>>> > Reuven
>>> >
>>> > On Fri, Jan 24, 2020 at 7:31 AM Ismaël Mejía >> > > wrote:
>>> >
>>> > This looks great, thanks for the contribution Rehman!
>>> >
>>> > I have some questions (note I have not looked at the code at all).
>>> >
>>> > - Is this working for both portable and non portable runners?
>>> > - What do other runners need to implement to support this (e.g.
>>> Spark)?
>>> >
>>> > Maybe worth to add this to the website Compatibility Matrix.
>>> >
>>> > Regards,
>>> > Ismaël
>>> >
>>> >
>>> > On Fri, Jan 24, 2020 at 8:42 AM Rehman Murad Ali
>>> > >> > > wrote:
>>> >
>>> > Thank you Reuven for the guidance throughout the development
>>> > process. I am delighted to contribute my two cents to the Beam
>>> > project.
>>> >
>>> > Looking forward to more active contributions.
>>> >
>>> > *
>>> > *
>>> >
>>> > *Thanks & Regards*
>>> >
>>> >
>>> >
>>> > *Rehman Murad Ali*
>>> > Software Engineer
>>> > Mobile: +92 3452076766 <+92%20345%202076766>
>>> 
>>> > Skype: rehman.muradali
>>> >
>>> >
>>> >
>>> > On Thu, Jan 23, 2020 at 11:09 PM Reuven Lax >> > > wrote:
>>> >
>>> > Thanks to a lot of hard work by Rehman, Beam now supports
>>> > dynamic timers. As a reminder, this was discussed on the
>>> dev
>>> > list some time back.
>>> >
>>> > As background, previously one had to statically declare all
>>> > timers in your code. So if you wanted to have two timers,
>>> > you needed to create two timer variables and two callbacks
>>> -
>>> > one for each timer. A 

Re: big data blog

2020-02-07 Thread Reuven Lax
Cool!

On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot 
wrote:

> Hi all,
>
> FYI, I just started a blog around big data technologies and for now it
> is focused on Beam.
>
> https://echauchot.blogspot.com/
>
> Feel free to comment, suggest or anything.
>
> Etienne
>
>


Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Jan Lukavský

Hi Robert,

thanks for this insight. I think that this sort of uncovered additional 
question - I'm not saying that I follow every thread in dev@, but I 
didn't notice anything about "trying to stabilize the protos", which is 
again where I think these big milestones probably should be defined in 
front in a form of BIP, BEP or whatever. I will try to get more time to 
invest into this (creating the cwiki, BIP template and some basic first 
BIP to have something to iterate on).


Jan

On 2/7/20 10:51 PM, Robert Bradshaw wrote:

There are two separable concerns here.

(1) The @RequiresTimeSortedInput feature itself. This is a subtle
feature needed for certain pipelines, and if anything Jan has gone the
extra mile discussing, documenting, and designing this and trying to
reach consensus. I feel like there has been a failure in the community
(myself included) to either fully accept it, or come up with sound
reasons to reject it, in a timely manner. (This is one of the things I
hope BEPs could address.) The feature seems similar in spirit to
@RequiresStableInputs which I also find a bit icky but can't think of
a way around. (My ideal implementation for both would be to express
this in terms of a naive implementation that could be swapped out by
more advanced runners...) That being said, I don't think we should
block on this forever.

(2) Especially as we're trying to stabilize the protos, how can one
safely add constraints like this such that runners will reject rather
than execute pipelines with unsatisfied constraints? For SDKs, we're
thinking about adding the notion of capabilities (as a list, or
possibly mapping, of URNs that get attached to an environment. Perhaps
a pipeline could likewise have a set of requirements for those "new"
features that augments what can be inferred by looking at the set of
transform URNs. In this case, @RequiresTimeSortedInput would be such a
requirement attached to any pipeline using this feature, and its
contract would be to look at (and respect) certain bits on the DoFns,
and a runner must reject any pipeline with unknown requirements. (If
it understands a requirement, it could reject it based on its ability
to satisfy the contract as it is actually used in the pipeline).

On Fri, Feb 7, 2020 at 12:31 PM Kenneth Knowles  wrote:

TL;DR I am not suggesting that you must implement this for any runner. I'm 
afraid I do have to propose this change be rolled back before release 2.21.0 
unless we fix this. I think the fix is easily achieved.

Clarifications inline.

On Fri, Feb 7, 2020 at 11:20 AM Jan Lukavský  wrote:

Hi Kenn,

I think that this approach is not well maintainable and doesn't scale. Main 
reasons:

  a) modifying core has by definition some impact on runners, so modifying core 
would imply necessity to modify all runners

My concern is not about all changes to "core" but only changes to the model, 
which should be extraordinarily rare. They must receive extreme scrutiny and require a 
very high level of consensus. It is true that every runner needs to either correctly 
execute or refuse to execute every pipeline, to the extent possible. For the case we are 
talking about it is very easy to meet this requirement.


  b) having to implement core feature for all existing runners will make any 
modification to core prohibitively expensive

No one is suggesting this. I am saying that you need to write the 1 line that I linked to 
"if (usesRequiresTimeSortedInput) then reject pipeline" so the runner fails 
before it begins processing data, potentially consuming non-replayable messages.


  c) even if we accept this, there can be runners that are outside of beam repo 
(or even closed source!)

Indeed. And those runners need time to adapt to the new proto fields. I did not 
mention it this time, because the proto is not considered stable. But very soon 
it will be. At that point additions like this will have to be fully specified 
and added to the proto long before they are enabled for use. That way all 
runners can adjust. The proper order is (1) add model feature (2) make runners 
reject it, unsupported (3) add functionality to SDK (4) add to some runners and 
enable.


Therefore I think, that the correct and scalable approach would be to split 
this into several pieces:

  1) define pipeline requirements (this is pretty much similar to how we 
currently scope @Category(ValidatesRunner.class) tests

  2) let pipeline infer it's requirements prior to being translated via runner

  3) runner can check the set of required features and their support and reject 
the pipeline if some feature is missing

This is exactly what happens today, but was not included in your change. The 
pipeline proto (or the Java pipeline object) clearly contain all the needed 
information. Whether pipeline summarizes it or the runner implements a trivial 
PipelineVisitor is not important.


This could even replace the annotations used in validates runner tests, because 
each runner would simply 

Re: [PROPOSAL] Beam Schema Options

2020-02-07 Thread Reuven Lax
True - however at some level that's up to the user. We should be diligent
that we don't implement core functionality this way (so far schema metadata
has only been used for the fidelity use case above). However if some users
wants to use it more extensively in their pipeline, that's up to them.

Reuven

On Fri, Feb 7, 2020 at 2:02 PM Kenneth Knowles  wrote:

> It is a good point that it applies to configuring sources and sinks
> mostly, or external data more generally.
>
> What I worry about is that metadata channels like this tend to take over
> everything, and do it worse than more structured approach.
>
> As an exaggerated example which is not actually that far-fetched, schemas
> could have a single type: bytes. And then the options on the fields would
> say how to encode/decode the bytes. Now you've created a system where the
> options are the main event and are actually core to the system. My
> expectation is that this will almost certainly happen unless we are
> vigilant about preventing core features from landing as options.
>
> Kenn
>
> On Fri, Feb 7, 2020 at 1:56 PM Reuven Lax  wrote:
>
>> I disagree - I've had several cases where user options on fields are very
>> useful internally.
>>
>> A common rationale is to preserve fidelity. For instance, reading protos,
>> projecting out a few fields, writing protos back out. You want to be able
>> to nicely map protos to Beam schemas, but also preserve all the extra
>> metadata on proto fields. This metadata has to carry through on all
>> intermediate PCollections and schemas, so it makes sense to put it on the
>> field.
>>
>> Another example: it provides a nice extension point for annotation
>> extensions to a field. These are often things like Optional or Taint
>> markings that don't change the semantic interpretation of the type (so
>> shouldn't be a logical type), but do provide extra information about the
>> field.
>>
>> Reuven
>>
>> On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette  wrote:
>>
>>> I'm not sure this belongs directly on schemas. I've had trouble
>>> reconciling that opinion, since the idea does seem very useful, and in fact
>>> I'm interested in using it myself. I think I've figured out my concern -
>>> what I really want is options for a (maybe portable) Table.
>>>
>>> As I indicated in a comment in the doc [1] I still think all of the
>>> examples you've described only apply to IOs. To be clear, what I mean is
>>> that all of the examples either
>>> 1) modify the behavior of the external system the IO is interfacing with
>>> (specify partitioning, indexing, etc..), or
>>> 2) define some transformation that should be done to the data adjacent
>>> to the IO (after an Input or before an Output) in Beam
>>>
>>> (1) Is the sort of thing you described in the IO section [2] (aside from
>>> the PubSub example I added, since that's describing a transformation to do
>>> in Beam)
>>> I would argue that all of the other examples fall under (2) - data
>>> validation, adding computed columns, encryption, etc... are things that can
>>> be done in a transform
>>>
>>> I think we can make an analogy to a traditional database here:
>>> schema-aware Beam IOs are like Tables in a database, other PCollections are
>>> like intermediate results in a query. In a database, Tables can be defined
>>> with some DDL and have schema-level or column-level options that change
>>> system behavior, but intermediate results have no such capability.
>>>
>>>
>>> Another point I think is worth discussing: is there value in making
>>> these options portable?
>>> As it's currently defined I'm not sure there is - everything could be
>>> done within a single SDK. However, portable options on a portable table
>>> could be very powerful, since it could be used to configure cross-language
>>> IOs, perhaps with something like
>>> https://s.apache.org/xlang-table-provider/
>>>
>>> [1]
>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=I54si4k
>>> [2]
>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>>>
>>> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel  wrote:
>>>
 I would appreciate if someone would look at the following PR and get it
 to master:

 https://github.com/apache/beam/pull/10413#

 a lot of work needs to follow, but if we have the base already on
 master the next layers can follow. As a reminder, this is the base 
 proposal:

 https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing

 I've also looked for prior work, and saw that Spark actually has
 something comparable:

 https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html

 but when the options are finished it will be far more powerful as it is
 not limited on fields.

  _/
 _/ Alex Van Boxel


 On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles 

Re: Unable to run ParDoTests from CLI

2020-02-07 Thread Reuven Lax
FYI, this is documented here

https://cwiki.apache.org/confluence/display/BEAM/Contribution+Testing+Guide#ContributionTestingGuide-HowtorunJavaNeedsRunnertests


On Fri, Feb 7, 2020 at 6:29 AM Ismaël Mejía  wrote:

> Use
>
> ./gradlew :runners:direct-java:needsRunner --tests "*ParDoTest\$TimerTests"
>
> For ValidatesRunner for example:
> /gradlew :runners:direct-java:validatesRunner --tests
> "*ParDoTest\$TimerFamily*"
>
> Credit to Brian who helped me because I was struggling with the same issue
> last week.
>
>
> On Fri, Feb 7, 2020 at 3:19 PM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> Hello Community,
>>
>> I have been trying to run test cases from CLI. ParDoTest.java has some
>> inner classes with test functions (for example TimerTest). This is the
>> command I have used to run the test:
>>
>> ./gradlew runners:direct-java:needsRunnerTests --tests
>> "org.apache.beam.sdk.transforms.ParDoTest$TimerTests"
>>
>> Here is the error message:
>> [image: image.png]
>>
>>
>> I need assistance regarding this matter.
>>
>>
>> *Thanks & Regards*
>>
>>
>>
>> *Rehman Murad Ali*
>> Software Engineer
>> Mobile: +92 3452076766 <+92%20345%202076766>
>> Skype: rehman.muradali
>>
>


Re: [PROPOSAL] Beam Schema Options

2020-02-07 Thread Kenneth Knowles
It is a good point that it applies to configuring sources and sinks mostly,
or external data more generally.

What I worry about is that metadata channels like this tend to take over
everything, and do it worse than more structured approach.

As an exaggerated example which is not actually that far-fetched, schemas
could have a single type: bytes. And then the options on the fields would
say how to encode/decode the bytes. Now you've created a system where the
options are the main event and are actually core to the system. My
expectation is that this will almost certainly happen unless we are
vigilant about preventing core features from landing as options.

Kenn

On Fri, Feb 7, 2020 at 1:56 PM Reuven Lax  wrote:

> I disagree - I've had several cases where user options on fields are very
> useful internally.
>
> A common rationale is to preserve fidelity. For instance, reading protos,
> projecting out a few fields, writing protos back out. You want to be able
> to nicely map protos to Beam schemas, but also preserve all the extra
> metadata on proto fields. This metadata has to carry through on all
> intermediate PCollections and schemas, so it makes sense to put it on the
> field.
>
> Another example: it provides a nice extension point for annotation
> extensions to a field. These are often things like Optional or Taint
> markings that don't change the semantic interpretation of the type (so
> shouldn't be a logical type), but do provide extra information about the
> field.
>
> Reuven
>
> On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette  wrote:
>
>> I'm not sure this belongs directly on schemas. I've had trouble
>> reconciling that opinion, since the idea does seem very useful, and in fact
>> I'm interested in using it myself. I think I've figured out my concern -
>> what I really want is options for a (maybe portable) Table.
>>
>> As I indicated in a comment in the doc [1] I still think all of the
>> examples you've described only apply to IOs. To be clear, what I mean is
>> that all of the examples either
>> 1) modify the behavior of the external system the IO is interfacing with
>> (specify partitioning, indexing, etc..), or
>> 2) define some transformation that should be done to the data adjacent to
>> the IO (after an Input or before an Output) in Beam
>>
>> (1) Is the sort of thing you described in the IO section [2] (aside from
>> the PubSub example I added, since that's describing a transformation to do
>> in Beam)
>> I would argue that all of the other examples fall under (2) - data
>> validation, adding computed columns, encryption, etc... are things that can
>> be done in a transform
>>
>> I think we can make an analogy to a traditional database here:
>> schema-aware Beam IOs are like Tables in a database, other PCollections are
>> like intermediate results in a query. In a database, Tables can be defined
>> with some DDL and have schema-level or column-level options that change
>> system behavior, but intermediate results have no such capability.
>>
>>
>> Another point I think is worth discussing: is there value in making these
>> options portable?
>> As it's currently defined I'm not sure there is - everything could be
>> done within a single SDK. However, portable options on a portable table
>> could be very powerful, since it could be used to configure cross-language
>> IOs, perhaps with something like
>> https://s.apache.org/xlang-table-provider/
>>
>> [1]
>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=I54si4k
>> [2]
>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>>
>> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel  wrote:
>>
>>> I would appreciate if someone would look at the following PR and get it
>>> to master:
>>>
>>> https://github.com/apache/beam/pull/10413#
>>>
>>> a lot of work needs to follow, but if we have the base already on master
>>> the next layers can follow. As a reminder, this is the base proposal:
>>>
>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>
>>> I've also looked for prior work, and saw that Spark actually has
>>> something comparable:
>>>
>>> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>>>
>>> but when the options are finished it will be far more powerful as it is
>>> not limited on fields.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles  wrote:
>>>
 Using schema types for the metadata values is a nice touch.

 Are the options expected to be common across many fields? Perhaps the
 name should be a URN to make it clear to be careful about collisions? (just
 a synonym for "name" in practice, but with different connotation)

 I generally like this... but the examples (all but one) are weird
 things that I don't really understand how they are done or who is
 responsible for them.

Re: [PROPOSAL] Beam Schema Options

2020-02-07 Thread Reuven Lax
I disagree - I've had several cases where user options on fields are very
useful internally.

A common rationale is to preserve fidelity. For instance, reading protos,
projecting out a few fields, writing protos back out. You want to be able
to nicely map protos to Beam schemas, but also preserve all the extra
metadata on proto fields. This metadata has to carry through on all
intermediate PCollections and schemas, so it makes sense to put it on the
field.

Another example: it provides a nice extension point for annotation
extensions to a field. These are often things like Optional or Taint
markings that don't change the semantic interpretation of the type (so
shouldn't be a logical type), but do provide extra information about the
field.

Reuven

On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette  wrote:

> I'm not sure this belongs directly on schemas. I've had trouble
> reconciling that opinion, since the idea does seem very useful, and in fact
> I'm interested in using it myself. I think I've figured out my concern -
> what I really want is options for a (maybe portable) Table.
>
> As I indicated in a comment in the doc [1] I still think all of the
> examples you've described only apply to IOs. To be clear, what I mean is
> that all of the examples either
> 1) modify the behavior of the external system the IO is interfacing with
> (specify partitioning, indexing, etc..), or
> 2) define some transformation that should be done to the data adjacent to
> the IO (after an Input or before an Output) in Beam
>
> (1) Is the sort of thing you described in the IO section [2] (aside from
> the PubSub example I added, since that's describing a transformation to do
> in Beam)
> I would argue that all of the other examples fall under (2) - data
> validation, adding computed columns, encryption, etc... are things that can
> be done in a transform
>
> I think we can make an analogy to a traditional database here:
> schema-aware Beam IOs are like Tables in a database, other PCollections are
> like intermediate results in a query. In a database, Tables can be defined
> with some DDL and have schema-level or column-level options that change
> system behavior, but intermediate results have no such capability.
>
>
> Another point I think is worth discussing: is there value in making these
> options portable?
> As it's currently defined I'm not sure there is - everything could be done
> within a single SDK. However, portable options on a portable table could be
> very powerful, since it could be used to configure cross-language IOs,
> perhaps with something like https://s.apache.org/xlang-table-provider/
>
> [1]
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=I54si4k
> [2]
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>
> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel  wrote:
>
>> I would appreciate if someone would look at the following PR and get it
>> to master:
>>
>> https://github.com/apache/beam/pull/10413#
>>
>> a lot of work needs to follow, but if we have the base already on master
>> the next layers can follow. As a reminder, this is the base proposal:
>>
>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>
>> I've also looked for prior work, and saw that Spark actually has
>> something comparable:
>>
>> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>>
>> but when the options are finished it will be far more powerful as it is
>> not limited on fields.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles  wrote:
>>
>>> Using schema types for the metadata values is a nice touch.
>>>
>>> Are the options expected to be common across many fields? Perhaps the
>>> name should be a URN to make it clear to be careful about collisions? (just
>>> a synonym for "name" in practice, but with different connotation)
>>>
>>> I generally like this... but the examples (all but one) are weird things
>>> that I don't really understand how they are done or who is responsible for
>>> them.
>>>
>>> One way to go is this: if options are maybe not understood by all
>>> consumers, then they can't really change behavior. Kind of like how URN and
>>> payload on a composite transform can be ignored and just the expansion used.
>>>
>>> Kenn
>>>
>>> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel  wrote:
>>>
 Hi everyone,

 I'm proud to announce my first real proposal. The proposal describes
 Beam Schema Options. This is an extension to the Schema API to add typed
 meta data to to Rows, Field and Logical Types:


 https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing

 To give you some context where this proposal comes from: We've been
 using dynamic meta driven pipelines for a while, but till now in an
 awkward and hacky way (see 

Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Jan Lukavský

I reviewed closely the runners ad it seems to me that:

 - all batch runners that would fail to support the annotation will 
fail already (spark structured streaming, apex) due to missing support 
for state or timers


 - streaming runners must explicitly enable this, _as long as they use 
StatefulDoFnRunner_, which is the case for apex, flink and samza


I will explicitly disable any pipeline with this annotation for:

 - dataflow, jet and gearpump (because I don't see usage of 
StatefulDoFnRunner, although I though there was one, that's my mistake)


 - all batch runners should either support the annotation or fail 
already (due to missing support for state or timers)


Does this proposal solve the issues you see?

Regarding the process of introducing this annotation I tried really hard 
to get to the best consensus I could. The same holds true for getting 
core people involved in the review process (explicitly mentioned in the 
PR, multiple mailing list threads). The PR was opened for discussion for 
more than half a year. But because I agree with you, I proposed the BIP, 
so that we can have a more explicit process for arriving at a consensus 
for features like this. I'd be happy though, if we can get to consensus 
about what to do now (if the steps I wrote above will solve every 
doubts) and have a deeper process for similar features for future cases. 
As I mentioned this feature is already implemented and having open PR 
into core for nearly a year is expensive to keep it in sync with master.


On 2/7/20 9:31 PM, Kenneth Knowles wrote:
TL;DR I am not suggesting that you must implement this for any runner. 
I'm afraid I do have to propose this change be rolled back before 
release 2.21.0 unless we fix this. I think the fix is easily achieved.


Clarifications inline.

On Fri, Feb 7, 2020 at 11:20 AM Jan Lukavský > wrote:


Hi Kenn,

I think that this approach is not well maintainable and doesn't
scale. Main reasons:

 a) modifying core has by definition some impact on runners, so
modifying core would imply necessity to modify all runners

My concern is not about all changes to "core" but only changes to the 
model, which should be extraordinarily rare. They must receive extreme 
scrutiny and require a very high level of consensus. It is true that 
every runner needs to either correctly execute or refuse to execute 
every pipeline, to the extent possible. For the case we are talking 
about it is very easy to meet this requirement.


 b) having to implement core feature for all existing runners will
make any modification to core prohibitively expensive

No one is suggesting this. I am saying that you need to write the 1 
line that I linked to "if (usesRequiresTimeSortedInput) then reject 
pipeline" so the runner fails before it begins processing data, 
potentially consuming non-replayable messages.


 c) even if we accept this, there can be runners that are outside
of beam repo (or even closed source!)

Indeed. And those runners need time to adapt to the new proto fields. 
I did not mention it this time, because the proto is not considered 
stable. But very soon it will be. At that point additions like this 
will have to be fully specified and added to the proto long before 
they are enabled for use. That way all runners can adjust. The proper 
order is (1) add model feature (2) make runners reject it, unsupported 
(3) add functionality to SDK (4) add to some runners and enable.


Therefore I think, that the correct and scalable approach would be
to split this into several pieces:

 1) define pipeline requirements (this is pretty much similar to
how we currently scope @Category(ValidatesRunner.class) tests

 2) let pipeline infer it's requirements prior to being translated
via runner

 3) runner can check the set of required features and their
support and reject the pipeline if some feature is missing

This is exactly what happens today, but was not included in your 
change. The pipeline proto (or the Java pipeline object) clearly 
contain all the needed information. Whether pipeline summarizes it or 
the runner implements a trivial PipelineVisitor is not important.


This could even replace the annotations used in validates runner
tests, because each runner would simply execute all tests it has
enough features to run.

What you have described is exactly what happens today.

But as I mentioned - this is pretty much deep change. I don't know
how to safely do this for current runners, but to actually
implement the feature (it seems to be to me nearly equally
complicated to fail pipeline in batch case and to actually
implement the sorting).

Indeed. This feature hasn't really got consensus. The proposal thread 
[1] never really concluded affirmatively [1]. The [VOTE] thread 
indicates a clear *lack* of consensus, with all people who weighed in 
asking to raise awareness and build 

Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Robert Bradshaw
There are two separable concerns here.

(1) The @RequiresTimeSortedInput feature itself. This is a subtle
feature needed for certain pipelines, and if anything Jan has gone the
extra mile discussing, documenting, and designing this and trying to
reach consensus. I feel like there has been a failure in the community
(myself included) to either fully accept it, or come up with sound
reasons to reject it, in a timely manner. (This is one of the things I
hope BEPs could address.) The feature seems similar in spirit to
@RequiresStableInputs which I also find a bit icky but can't think of
a way around. (My ideal implementation for both would be to express
this in terms of a naive implementation that could be swapped out by
more advanced runners...) That being said, I don't think we should
block on this forever.

(2) Especially as we're trying to stabilize the protos, how can one
safely add constraints like this such that runners will reject rather
than execute pipelines with unsatisfied constraints? For SDKs, we're
thinking about adding the notion of capabilities (as a list, or
possibly mapping, of URNs that get attached to an environment. Perhaps
a pipeline could likewise have a set of requirements for those "new"
features that augments what can be inferred by looking at the set of
transform URNs. In this case, @RequiresTimeSortedInput would be such a
requirement attached to any pipeline using this feature, and its
contract would be to look at (and respect) certain bits on the DoFns,
and a runner must reject any pipeline with unknown requirements. (If
it understands a requirement, it could reject it based on its ability
to satisfy the contract as it is actually used in the pipeline).

On Fri, Feb 7, 2020 at 12:31 PM Kenneth Knowles  wrote:
>
> TL;DR I am not suggesting that you must implement this for any runner. I'm 
> afraid I do have to propose this change be rolled back before release 2.21.0 
> unless we fix this. I think the fix is easily achieved.
>
> Clarifications inline.
>
> On Fri, Feb 7, 2020 at 11:20 AM Jan Lukavský  wrote:
>>
>> Hi Kenn,
>>
>> I think that this approach is not well maintainable and doesn't scale. Main 
>> reasons:
>>
>>  a) modifying core has by definition some impact on runners, so modifying 
>> core would imply necessity to modify all runners
>
> My concern is not about all changes to "core" but only changes to the model, 
> which should be extraordinarily rare. They must receive extreme scrutiny and 
> require a very high level of consensus. It is true that every runner needs to 
> either correctly execute or refuse to execute every pipeline, to the extent 
> possible. For the case we are talking about it is very easy to meet this 
> requirement.
>
>>  b) having to implement core feature for all existing runners will make any 
>> modification to core prohibitively expensive
>
> No one is suggesting this. I am saying that you need to write the 1 line that 
> I linked to "if (usesRequiresTimeSortedInput) then reject pipeline" so the 
> runner fails before it begins processing data, potentially consuming 
> non-replayable messages.
>
>>
>>  c) even if we accept this, there can be runners that are outside of beam 
>> repo (or even closed source!)
>
> Indeed. And those runners need time to adapt to the new proto fields. I did 
> not mention it this time, because the proto is not considered stable. But 
> very soon it will be. At that point additions like this will have to be fully 
> specified and added to the proto long before they are enabled for use. That 
> way all runners can adjust. The proper order is (1) add model feature (2) 
> make runners reject it, unsupported (3) add functionality to SDK (4) add to 
> some runners and enable.
>
>>
>> Therefore I think, that the correct and scalable approach would be to split 
>> this into several pieces:
>>
>>  1) define pipeline requirements (this is pretty much similar to how we 
>> currently scope @Category(ValidatesRunner.class) tests
>>
>>  2) let pipeline infer it's requirements prior to being translated via runner
>>
>>  3) runner can check the set of required features and their support and 
>> reject the pipeline if some feature is missing
>
> This is exactly what happens today, but was not included in your change. The 
> pipeline proto (or the Java pipeline object) clearly contain all the needed 
> information. Whether pipeline summarizes it or the runner implements a 
> trivial PipelineVisitor is not important.
>
>> This could even replace the annotations used in validates runner tests, 
>> because each runner would simply execute all tests it has enough features to 
>> run.
>
> What you have described is exactly what happens today.
>
>>
>> But as I mentioned - this is pretty much deep change. I don't know how to 
>> safely do this for current runners, but to actually implement the feature 
>> (it seems to be to me nearly equally complicated to fail pipeline in batch 
>> case and to actually implement the sorting).
>
> Indeed. 

Re: [DISCUSS] Autoformat python code with Black

2020-02-07 Thread Udi Meiri
Chad: yes. I also noticed that it's not running on the Jenkins lint
precommit job.

On Fri, Feb 7, 2020 at 12:59 PM David Yan  wrote:

> Thank you Robert.
>
> https://github.com/google/yapf/issues/530 has been open for 2 years, but
> we will use `yapf: disable` and `yapf: enable` as a workaround for now.
>
> David
>
> On Fri, Feb 7, 2020 at 12:29 PM Robert Bradshaw 
> wrote:
>
>> Yeah, that's a lot worse. This looks like
>> https://github.com/google/yapf/issues/530 . In the meantime,
>> https://pypi.org/project/yapf/#potentially-frequently-asked-questions
>>
>> On Fri, Feb 7, 2020 at 12:17 PM David Yan  wrote:
>> >
>> > Hi, I just tried out the yapf formatter and I noticed that sometimes
>> it's making the original code a lot less readable.
>> > In the below example, - is the original, + is after running the yapf
>> formatter. Looks like the problem is with the method chaining pattern.
>> > How feasible is it to have yapf identify such a pattern and format it
>> better?
>> > Before this can be fixed, Is it possible to have a directive in the
>> code comment to bypass yapf?
>> >
>> > Thanks!
>> >
>> > -test_stream = (TestStream()
>> > -   .advance_watermark_to(0)
>> > -   .add_elements(['a', 'b', 'c'])
>> > -   .advance_processing_time(1)
>> > -   .advance_processing_time(1)
>> > -   .advance_processing_time(1)
>> > -   .advance_processing_time(1)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(5)
>> > -   .add_elements(['1', '2', '3'])
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(6)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(7)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(8)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(9)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(10)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(11)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(12)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(13)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(14)
>> > -   .advance_processing_time(1)
>> > -   .advance_watermark_to(15)
>> > -   .advance_processing_time(1)
>> > -   )
>> > +test_stream = (
>> > +TestStream().advance_watermark_to(0).add_elements(
>> > +['a', 'b',
>> 'c']).advance_processing_time(1).advance_processing_time(
>> > +
>> 1).advance_processing_time(1).advance_processing_time(1).
>> > +
>> advance_processing_time(1).advance_watermark_to(5).add_elements(
>> > +['1', '2',
>> '3']).advance_processing_time(1).advance_watermark_to(
>> > +6).advance_processing_time(1).advance_watermark_to(
>> > +7).advance_processing_time(1).advance_watermark_to(
>> > +
>> 8).advance_processing_time(1).advance_watermark_to(9).
>> > +advance_processing_time(1).advance_watermark_to(
>> > +10).advance_processing_time(1).advance_watermark_to(
>> > +11).advance_processing_time(1).advance_watermark_to(
>> > +
>> 12).advance_processing_time(1).advance_watermark_to(
>> > +
>> 13).advance_processing_time(1).advance_watermark_to(
>> > +
>> 14).advance_processing_time(1).advance_watermark_to(
>> > +15).advance_processing_time(1))
>> >
>> > On Thu, Feb 6, 2020 at 1:50 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> Thanks!
>> >>
>> >> On Thu, Feb 6, 2020 at 1:29 PM Ismaël Mejía  wrote:
>> >>>
>> >>> Thanks Kamil and Michał for taking care of this.
>> >>> Excellent job!
>> >>>
>> >>> On Thu, Feb 6, 2020 at 1:45 PM Kamil Wasilewski <
>> kamil.wasilew...@polidea.com> wrote:
>> 
>>  Thanks to everyone involved in the discussion.
>> 
>>  I've taken a look at the first 50 recently updated Pull Requests.
>> Only few of them were affected. I hope it wouldn't be too hard to fix them.
>> 
>>  In any case, here you can find instructions on how to run formatter:
>> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips (section
>> "Formatting").
>> 
>>  On Thu, Feb 6, 2020 at 12:42 PM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>> >
>> > Hi,
>> > the PR is merged, all checks were green :)
>> > Enjoy prettier Python!
>> >
>> > On Thu, Feb 6, 2020 at 11:11 AM Ismaël Mejía 
>> wrote:
>> >>
>> >> Agree no need for vote for this because the consensus is clear and
>> the sole
>> >> impact I can think of are 

Re: [DISCUSS] Autoformat python code with Black

2020-02-07 Thread David Yan
Thank you Robert.

https://github.com/google/yapf/issues/530 has been open for 2 years, but we
will use `yapf: disable` and `yapf: enable` as a workaround for now.

David

On Fri, Feb 7, 2020 at 12:29 PM Robert Bradshaw  wrote:

> Yeah, that's a lot worse. This looks like
> https://github.com/google/yapf/issues/530 . In the meantime,
> https://pypi.org/project/yapf/#potentially-frequently-asked-questions
>
> On Fri, Feb 7, 2020 at 12:17 PM David Yan  wrote:
> >
> > Hi, I just tried out the yapf formatter and I noticed that sometimes
> it's making the original code a lot less readable.
> > In the below example, - is the original, + is after running the yapf
> formatter. Looks like the problem is with the method chaining pattern.
> > How feasible is it to have yapf identify such a pattern and format it
> better?
> > Before this can be fixed, Is it possible to have a directive in the code
> comment to bypass yapf?
> >
> > Thanks!
> >
> > -test_stream = (TestStream()
> > -   .advance_watermark_to(0)
> > -   .add_elements(['a', 'b', 'c'])
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(5)
> > -   .add_elements(['1', '2', '3'])
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(6)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(7)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(8)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(9)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(10)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(11)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(12)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(13)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(14)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(15)
> > -   .advance_processing_time(1)
> > -   )
> > +test_stream = (
> > +TestStream().advance_watermark_to(0).add_elements(
> > +['a', 'b',
> 'c']).advance_processing_time(1).advance_processing_time(
> > +
> 1).advance_processing_time(1).advance_processing_time(1).
> > +advance_processing_time(1).advance_watermark_to(5).add_elements(
> > +['1', '2',
> '3']).advance_processing_time(1).advance_watermark_to(
> > +6).advance_processing_time(1).advance_watermark_to(
> > +7).advance_processing_time(1).advance_watermark_to(
> > +
> 8).advance_processing_time(1).advance_watermark_to(9).
> > +advance_processing_time(1).advance_watermark_to(
> > +10).advance_processing_time(1).advance_watermark_to(
> > +11).advance_processing_time(1).advance_watermark_to(
> > +12).advance_processing_time(1).advance_watermark_to(
> > +
> 13).advance_processing_time(1).advance_watermark_to(
> > +
> 14).advance_processing_time(1).advance_watermark_to(
> > +15).advance_processing_time(1))
> >
> > On Thu, Feb 6, 2020 at 1:50 PM Robert Bradshaw 
> wrote:
> >>
> >> Thanks!
> >>
> >> On Thu, Feb 6, 2020 at 1:29 PM Ismaël Mejía  wrote:
> >>>
> >>> Thanks Kamil and Michał for taking care of this.
> >>> Excellent job!
> >>>
> >>> On Thu, Feb 6, 2020 at 1:45 PM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
> 
>  Thanks to everyone involved in the discussion.
> 
>  I've taken a look at the first 50 recently updated Pull Requests.
> Only few of them were affected. I hope it wouldn't be too hard to fix them.
> 
>  In any case, here you can find instructions on how to run formatter:
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips (section
> "Formatting").
> 
>  On Thu, Feb 6, 2020 at 12:42 PM Michał Walenia <
> michal.wale...@polidea.com> wrote:
> >
> > Hi,
> > the PR is merged, all checks were green :)
> > Enjoy prettier Python!
> >
> > On Thu, Feb 6, 2020 at 11:11 AM Ismaël Mejía 
> wrote:
> >>
> >> Agree no need for vote for this because the consensus is clear and
> the sole
> >> impact I can think of are pending PRs that will be broken. In the
> Java case
> >> what we did was to just notice every PR that was affected by the
> change.
> >> And clearly document how to validate and autoformat the code.
> >>
> >> So the 

Re: Compile error on Java 11 when running :examples:java:test

2020-02-07 Thread Kenneth Knowles
The expected class file version 53 is for Java 9, I believe. So is the
right javac being invoked?

I hit some issues like this on mac a while back, unrelated to Java 11.
Suspected something wonky in Mac's Java setup not working well with the
Gradle wrapper. Never resolved them actually. Have been working on linux
lately.

Kenn

On Fri, Feb 7, 2020 at 11:32 AM Jean-Baptiste Onofré 
wrote:

> Hi
>
> No jdk 11 is not yet fully supported.
>
> I?ve started to work on it but it?s not yet ready.
>
> Regards
> JB
>
> Le ven. 7 f?vr. 2020 ? 20:20, David Cavazos  a
> ?crit :
>
>> Hi Beamers,
>>
>> I'm trying to run the tests for the Java examples using Java 11 and there
>> is a compilation error due to an incompatible version.
>>
>> I'm using the latest version of master.
>>
>> [image: image.png]
>>
>> If I downgrade to Java 8, it works. But isn't Java 11 supported?
>>
>> Thanks!
>>
>


Re: [DISCUSS] Autoformat python code with Black

2020-02-07 Thread Chad Dombrova
I have a PR I'm working on to allow users to easily setup yapf to run on
pre-commit.  Is that something that interests people?

-chad



On Fri, Feb 7, 2020 at 12:29 PM Robert Bradshaw  wrote:

> Yeah, that's a lot worse. This looks like
> https://github.com/google/yapf/issues/530 . In the meantime,
> https://pypi.org/project/yapf/#potentially-frequently-asked-questions
>
> On Fri, Feb 7, 2020 at 12:17 PM David Yan  wrote:
> >
> > Hi, I just tried out the yapf formatter and I noticed that sometimes
> it's making the original code a lot less readable.
> > In the below example, - is the original, + is after running the yapf
> formatter. Looks like the problem is with the method chaining pattern.
> > How feasible is it to have yapf identify such a pattern and format it
> better?
> > Before this can be fixed, Is it possible to have a directive in the code
> comment to bypass yapf?
> >
> > Thanks!
> >
> > -test_stream = (TestStream()
> > -   .advance_watermark_to(0)
> > -   .add_elements(['a', 'b', 'c'])
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(5)
> > -   .add_elements(['1', '2', '3'])
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(6)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(7)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(8)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(9)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(10)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(11)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(12)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(13)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(14)
> > -   .advance_processing_time(1)
> > -   .advance_watermark_to(15)
> > -   .advance_processing_time(1)
> > -   )
> > +test_stream = (
> > +TestStream().advance_watermark_to(0).add_elements(
> > +['a', 'b',
> 'c']).advance_processing_time(1).advance_processing_time(
> > +
> 1).advance_processing_time(1).advance_processing_time(1).
> > +advance_processing_time(1).advance_watermark_to(5).add_elements(
> > +['1', '2',
> '3']).advance_processing_time(1).advance_watermark_to(
> > +6).advance_processing_time(1).advance_watermark_to(
> > +7).advance_processing_time(1).advance_watermark_to(
> > +
> 8).advance_processing_time(1).advance_watermark_to(9).
> > +advance_processing_time(1).advance_watermark_to(
> > +10).advance_processing_time(1).advance_watermark_to(
> > +11).advance_processing_time(1).advance_watermark_to(
> > +12).advance_processing_time(1).advance_watermark_to(
> > +
> 13).advance_processing_time(1).advance_watermark_to(
> > +
> 14).advance_processing_time(1).advance_watermark_to(
> > +15).advance_processing_time(1))
> >
> > On Thu, Feb 6, 2020 at 1:50 PM Robert Bradshaw 
> wrote:
> >>
> >> Thanks!
> >>
> >> On Thu, Feb 6, 2020 at 1:29 PM Ismaël Mejía  wrote:
> >>>
> >>> Thanks Kamil and Michał for taking care of this.
> >>> Excellent job!
> >>>
> >>> On Thu, Feb 6, 2020 at 1:45 PM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
> 
>  Thanks to everyone involved in the discussion.
> 
>  I've taken a look at the first 50 recently updated Pull Requests.
> Only few of them were affected. I hope it wouldn't be too hard to fix them.
> 
>  In any case, here you can find instructions on how to run formatter:
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips (section
> "Formatting").
> 
>  On Thu, Feb 6, 2020 at 12:42 PM Michał Walenia <
> michal.wale...@polidea.com> wrote:
> >
> > Hi,
> > the PR is merged, all checks were green :)
> > Enjoy prettier Python!
> >
> > On Thu, Feb 6, 2020 at 11:11 AM Ismaël Mejía 
> wrote:
> >>
> >> Agree no need for vote for this because the consensus is clear and
> the sole
> >> impact I can think of are pending PRs that will be broken. In the
> Java case
> >> what we did was to just notice every PR that was affected by the
> change.
> >> And clearly document how to validate and autoformat the code.
> >>
> >> So the earlier the better, go go autoformat!
> 

Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Kenneth Knowles
TL;DR I am not suggesting that you must implement this for any runner. I'm
afraid I do have to propose this change be rolled back before release
2.21.0 unless we fix this. I think the fix is easily achieved.

Clarifications inline.

On Fri, Feb 7, 2020 at 11:20 AM Jan Lukavský  wrote:

> Hi Kenn,
>
> I think that this approach is not well maintainable and doesn't scale.
> Main reasons:
>
>  a) modifying core has by definition some impact on runners, so modifying
> core would imply necessity to modify all runners
>
My concern is not about all changes to "core" but only changes to the
model, which should be extraordinarily rare. They must receive extreme
scrutiny and require a very high level of consensus. It is true that every
runner needs to either correctly execute or refuse to execute every
pipeline, to the extent possible. For the case we are talking about it is
very easy to meet this requirement.

 b) having to implement core feature for all existing runners will make any
> modification to core prohibitively expensive
>
No one is suggesting this. I am saying that you need to write the 1 line
that I linked to "if (usesRequiresTimeSortedInput) then reject pipeline" so
the runner fails before it begins processing data, potentially consuming
non-replayable messages.


>  c) even if we accept this, there can be runners that are outside of beam
> repo (or even closed source!)
>
Indeed. And those runners need time to adapt to the new proto fields. I did
not mention it this time, because the proto is not considered stable. But
very soon it will be. At that point additions like this will have to be
fully specified and added to the proto long before they are enabled for
use. That way all runners can adjust. The proper order is (1) add model
feature (2) make runners reject it, unsupported (3) add functionality to
SDK (4) add to some runners and enable.


> Therefore I think, that the correct and scalable approach would be to
> split this into several pieces:
>
>  1) define pipeline requirements (this is pretty much similar to how we
> currently scope @Category(ValidatesRunner.class) tests
>
>  2) let pipeline infer it's requirements prior to being translated via
> runner
>
>  3) runner can check the set of required features and their support and
> reject the pipeline if some feature is missing
>
This is exactly what happens today, but was not included in your change.
The pipeline proto (or the Java pipeline object) clearly contain all the
needed information. Whether pipeline summarizes it or the runner implements
a trivial PipelineVisitor is not important.

This could even replace the annotations used in validates runner tests,
> because each runner would simply execute all tests it has enough features
> to run.
>
What you have described is exactly what happens today.


> But as I mentioned - this is pretty much deep change. I don't know how to
> safely do this for current runners, but to actually implement the feature
> (it seems to be to me nearly equally complicated to fail pipeline in batch
> case and to actually implement the sorting).
>
Indeed. This feature hasn't really got consensus. The proposal thread [1]
never really concluded affirmatively [1]. The [VOTE] thread indicates a
clear *lack* of consensus, with all people who weighed in asking to raise
awareness and build more support and consensus. Robert made the good point
that if it is (a) useful and (b) not easy for users to do themselves, then
we should consider it, even if most people here are not interested in the
feature. So that is the closest thing to approval that this feature has.
But getting more people interested and on board would get better feedback
and achieve a better result for our users.

And as a final note, the PR was not reviewed by the core people who built
out state & timers, nor those who built out DoFn annotation systems, nor
any runner author, nor those working on the Beam model protos. You really
should have gotten most of these people involved. They would likely have
caught the issues described here.

The specific action that I am proposing is to implement the 1 liner
described in all runners. It might be best to roll back and proceed with
steps 1-4 I outlined above, so we can be sure things are proceeding well.

Kenn

[1]
https://lists.apache.org/thread.html/b91f96121d37bf16403acbd88bc264cf16e40ddb636f0435276e89aa%40%3Cdev.beam.apache.org%3E
[2]
https://lists.apache.org/thread.html/91b87940ba7736f9f1021928271a0090f8a0096e5e3f9e52de89acf2%40%3Cdev.beam.apache.org%3E

> It would be super cool if anyone would be interested in implementing this
> in runners that don't currently support it. A side note - currently the
> annotation is not supported by all streaming runners due to missing
> guarantees for timers ordering (which can lead to data losss). I think I
> have found a solution to this, see [1], but I'd like to be 100% sure,
> before enabling the support (I'm not sure what is the impact of mis-ordered
> timers on output 

Re: [DISCUSS] Autoformat python code with Black

2020-02-07 Thread Robert Bradshaw
Yeah, that's a lot worse. This looks like
https://github.com/google/yapf/issues/530 . In the meantime,
https://pypi.org/project/yapf/#potentially-frequently-asked-questions

On Fri, Feb 7, 2020 at 12:17 PM David Yan  wrote:
>
> Hi, I just tried out the yapf formatter and I noticed that sometimes it's 
> making the original code a lot less readable.
> In the below example, - is the original, + is after running the yapf 
> formatter. Looks like the problem is with the method chaining pattern.
> How feasible is it to have yapf identify such a pattern and format it better?
> Before this can be fixed, Is it possible to have a directive in the code 
> comment to bypass yapf?
>
> Thanks!
>
> -test_stream = (TestStream()
> -   .advance_watermark_to(0)
> -   .add_elements(['a', 'b', 'c'])
> -   .advance_processing_time(1)
> -   .advance_processing_time(1)
> -   .advance_processing_time(1)
> -   .advance_processing_time(1)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(5)
> -   .add_elements(['1', '2', '3'])
> -   .advance_processing_time(1)
> -   .advance_watermark_to(6)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(7)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(8)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(9)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(10)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(11)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(12)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(13)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(14)
> -   .advance_processing_time(1)
> -   .advance_watermark_to(15)
> -   .advance_processing_time(1)
> -   )
> +test_stream = (
> +TestStream().advance_watermark_to(0).add_elements(
> +['a', 'b', 
> 'c']).advance_processing_time(1).advance_processing_time(
> +1).advance_processing_time(1).advance_processing_time(1).
> +advance_processing_time(1).advance_watermark_to(5).add_elements(
> +['1', '2', '3']).advance_processing_time(1).advance_watermark_to(
> +6).advance_processing_time(1).advance_watermark_to(
> +7).advance_processing_time(1).advance_watermark_to(
> +
> 8).advance_processing_time(1).advance_watermark_to(9).
> +advance_processing_time(1).advance_watermark_to(
> +10).advance_processing_time(1).advance_watermark_to(
> +11).advance_processing_time(1).advance_watermark_to(
> +12).advance_processing_time(1).advance_watermark_to(
> +13).advance_processing_time(1).advance_watermark_to(
> +
> 14).advance_processing_time(1).advance_watermark_to(
> +15).advance_processing_time(1))
>
> On Thu, Feb 6, 2020 at 1:50 PM Robert Bradshaw  wrote:
>>
>> Thanks!
>>
>> On Thu, Feb 6, 2020 at 1:29 PM Ismaël Mejía  wrote:
>>>
>>> Thanks Kamil and Michał for taking care of this.
>>> Excellent job!
>>>
>>> On Thu, Feb 6, 2020 at 1:45 PM Kamil Wasilewski 
>>>  wrote:

 Thanks to everyone involved in the discussion.

 I've taken a look at the first 50 recently updated Pull Requests. Only few 
 of them were affected. I hope it wouldn't be too hard to fix them.

 In any case, here you can find instructions on how to run formatter: 
 https://cwiki.apache.org/confluence/display/BEAM/Python+Tips (section 
 "Formatting").

 On Thu, Feb 6, 2020 at 12:42 PM Michał Walenia 
  wrote:
>
> Hi,
> the PR is merged, all checks were green :)
> Enjoy prettier Python!
>
> On Thu, Feb 6, 2020 at 11:11 AM Ismaël Mejía  wrote:
>>
>> Agree no need for vote for this because the consensus is clear and the 
>> sole
>> impact I can think of are pending PRs that will be broken. In the Java 
>> case
>> what we did was to just notice every PR that was affected by the change.
>> And clearly document how to validate and autoformat the code.
>>
>> So the earlier the better, go go autoformat!
>>
>> On Thu, Feb 6, 2020 at 1:38 AM Robert Bradshaw  
>> wrote:
>>>
>>> No, perhaps not. I agree there's consensus, just wondering what the
>>> next steps should be to get this in. (The presubmits look like they're
>>> all passing, with the exception of some breakage in java that should
>>> be 

Re: [DISCUSS] Autoformat python code with Black

2020-02-07 Thread David Yan
Hi, I just tried out the yapf formatter and I noticed that sometimes it's
making the original code a lot less readable.
In the below example, - is the original, + is after running the yapf
formatter. Looks like the problem is with the method chaining pattern.
How feasible is it to have yapf identify such a pattern and format it
better?
Before this can be fixed, Is it possible to have a directive in the code
comment to bypass yapf?

Thanks!

-test_stream = (TestStream()
-   .advance_watermark_to(0)
-   .add_elements(['a', 'b', 'c'])
-   .advance_processing_time(1)
-   .advance_processing_time(1)
-   .advance_processing_time(1)
-   .advance_processing_time(1)
-   .advance_processing_time(1)
-   .advance_watermark_to(5)
-   .add_elements(['1', '2', '3'])
-   .advance_processing_time(1)
-   .advance_watermark_to(6)
-   .advance_processing_time(1)
-   .advance_watermark_to(7)
-   .advance_processing_time(1)
-   .advance_watermark_to(8)
-   .advance_processing_time(1)
-   .advance_watermark_to(9)
-   .advance_processing_time(1)
-   .advance_watermark_to(10)
-   .advance_processing_time(1)
-   .advance_watermark_to(11)
-   .advance_processing_time(1)
-   .advance_watermark_to(12)
-   .advance_processing_time(1)
-   .advance_watermark_to(13)
-   .advance_processing_time(1)
-   .advance_watermark_to(14)
-   .advance_processing_time(1)
-   .advance_watermark_to(15)
-   .advance_processing_time(1)
-   )
+test_stream = (
+TestStream().advance_watermark_to(0).add_elements(
+['a', 'b',
'c']).advance_processing_time(1).advance_processing_time(
+1).advance_processing_time(1).advance_processing_time(1).
+advance_processing_time(1).advance_watermark_to(5).add_elements(
+['1', '2',
'3']).advance_processing_time(1).advance_watermark_to(
+6).advance_processing_time(1).advance_watermark_to(
+7).advance_processing_time(1).advance_watermark_to(
+
 8).advance_processing_time(1).advance_watermark_to(9).
+advance_processing_time(1).advance_watermark_to(
+10).advance_processing_time(1).advance_watermark_to(
+11).advance_processing_time(1).advance_watermark_to(
+12).advance_processing_time(1).advance_watermark_to(
+
 13).advance_processing_time(1).advance_watermark_to(
+
 14).advance_processing_time(1).advance_watermark_to(
+15).advance_processing_time(1))

On Thu, Feb 6, 2020 at 1:50 PM Robert Bradshaw  wrote:

> Thanks!
>
> On Thu, Feb 6, 2020 at 1:29 PM Ismaël Mejía  wrote:
>
>> Thanks Kamil and Michał for taking care of this.
>> Excellent job!
>>
>> On Thu, Feb 6, 2020 at 1:45 PM Kamil Wasilewski <
>> kamil.wasilew...@polidea.com> wrote:
>>
>>> Thanks to everyone involved in the discussion.
>>>
>>> I've taken a look at the first 50 recently updated Pull Requests. Only
>>> few of them were affected. I hope it wouldn't be too hard to fix them.
>>>
>>> In any case, here you can find instructions on how to run formatter:
>>> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips (section
>>> "Formatting").
>>>
>>> On Thu, Feb 6, 2020 at 12:42 PM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Hi,
 the PR is merged, all checks were green :)
 Enjoy prettier Python!

 On Thu, Feb 6, 2020 at 11:11 AM Ismaël Mejía  wrote:

> Agree no need for vote for this because the consensus is clear and the
> sole
> impact I can think of are pending PRs that will be broken. In the Java
> case
> what we did was to just notice every PR that was affected by the
> change.
> And clearly document how to validate and autoformat the code.
>
> So the earlier the better, go go autoformat!
>
> On Thu, Feb 6, 2020 at 1:38 AM Robert Bradshaw 
> wrote:
>
>> No, perhaps not. I agree there's consensus, just wondering what the
>> next steps should be to get this in. (The presubmits look like they're
>> all passing, with the exception of some breakage in java that should
>> be completely unrelated. Of course there's already merge conflicts...)
>>
>> On Wed, Feb 5, 2020 at 3:55 PM Ahmet Altay  wrote:
>> >
>> > Do we need a formal vote? There is consensus on this thread and on
>> the PR.
>> >
>> > On Wed, Feb 5, 2020 at 3:37 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> The PR is looking good. Should we call a vote?
>> >>
>> >> On Mon, Jan 27, 2020 

Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Jan Lukavský

And as a quick summary a pipeline with @RequiresTimeSortedInput will:

 a) work well on streaming pipelines run on direct java and 
non-portable flink, will fail on every other streaming runner


 b) work well on batch non-portable flink, legacy spark and batch dataflow

 c) from what I can tell, the code path seems to be supported for batch 
portable flink as well (if portable batch flink runner uses mostly the 
same code path, which seems it should)


If I'm not mistaken, then the only actual issue seems to be with the new 
spark runner and then all remaining streaming runners (dataflow, 
portable flink, jet, samza). The streaming case can probably be solved 
easily in one shot (as mentioned in previous email).


Jan

On 2/7/20 8:20 PM, Jan Lukavský wrote:


Hi Kenn,

I think that this approach is not well maintainable and doesn't scale. 
Main reasons:


 a) modifying core has by definition some impact on runners, so 
modifying core would imply necessity to modify all runners


 b) having to implement core feature for all existing runners will 
make any modification to core prohibitively expensive


 c) even if we accept this, there can be runners that are outside of 
beam repo (or even closed source!)


Therefore I think, that the correct and scalable approach would be to 
split this into several pieces:


 1) define pipeline requirements (this is pretty much similar to how 
we currently scope @Category(ValidatesRunner.class) tests


 2) let pipeline infer it's requirements prior to being translated via 
runner


 3) runner can check the set of required features and their support 
and reject the pipeline if some feature is missing


This could even replace the annotations used in validates runner 
tests, because each runner would simply execute all tests it has 
enough features to run.


But as I mentioned - this is pretty much deep change. I don't know how 
to safely do this for current runners, but to actually implement the 
feature (it seems to be to me nearly equally complicated to fail 
pipeline in batch case and to actually implement the sorting). It 
would be super cool if anyone would be interested in implementing this 
in runners that don't currently support it. A side note - currently 
the annotation is not supported by all streaming runners due to 
missing guarantees for timers ordering (which can lead to data losss). 
I think I have found a solution to this, see [1], but I'd like to be 
100% sure, before enabling the support (I'm not sure what is the 
impact of mis-ordered timers on output timestamps, and so on, and so 
forth).


Jan

[1] 
https://github.com/apache/beam/pull/10795/files#diff-11a02ba72f437b89e35f7ad37102dfd1R209


On 2/7/20 7:53 PM, Kenneth Knowles wrote:
I see. It is good to see that the pipeline will at least fail. 
However, the expect approach here is that the pipeline is rejected 
prior to execution. That is a primary reason for our 
annotation-driven API style; it allows much better "static" analysis 
by a runner, so we don't have to wait and fail late. Here is an 
example: 
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1940 



Kenn

On Thu, Feb 6, 2020 at 11:03 PM Jan Lukavský > wrote:


Hi Kenn,

that should not be the case. Care was taken to fail streaming
pipeline which needs this ability and the runner doesn't support
this [1]. It is true, however, that a batch pipeline will not
fail, because there is no generic (runner agnostic) way of
supporting this transform in batch case (which is why the
annotation was needed). Failing batch pipelines in this case
would mean runners have to understand this annotation, which is
pretty much close to implementing this feature as a whole.

This applies generally to any core functionality, it might take
some time before runners fully support this. I don't know how to
solve it, maybe add record to capability matrix? I can imagine a
fully generic solution (runners might publish their capabilities
and pipeline might be validated against these capabilities at
pipeline build time), but that is obviously out of scope of the
annotation.

Jan

[1]

https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/DoFnRunners.java#L150

On 2/7/20 1:01 AM, Kenneth Knowles wrote:

There is a major problem with this merge: the runners that do
not support it do not reject pipelines that need this feature.
They will silently produce the wrong answer, causing data loss.

Kenn

On Thu, Feb 6, 2020 at 3:24 AM Jan Lukavský mailto:je...@seznam.cz>> wrote:

Hi,

the PR was merged to master and a few follow-up issues, were
created,
mainly [1] and [2]. I didn't find any reference to
SortedMapState in
JIRA, is there any tracking issue for that 

Re: Compile error on Java 11 when running :examples:java:test

2020-02-07 Thread Jean-Baptiste Onofré
HiNo jdk 11 is not yet fully supported.I?ve started to work on it but it?s not yet ready.RegardsJBLe ven. 7 f?vr. 2020 ? 20:20, David Cavazos  a ?crit :Hi Beamers,I'm trying to run the tests for the Java examples using Java 11 and there is a compilation error due to an incompatible version.I'm using the latest version of master.If I downgrade to Java 8, it works. But isn't Java 11 supported?Thanks!

Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Jan Lukavský

Hi Kenn,

I think that this approach is not well maintainable and doesn't scale. 
Main reasons:


 a) modifying core has by definition some impact on runners, so 
modifying core would imply necessity to modify all runners


 b) having to implement core feature for all existing runners will make 
any modification to core prohibitively expensive


 c) even if we accept this, there can be runners that are outside of 
beam repo (or even closed source!)


Therefore I think, that the correct and scalable approach would be to 
split this into several pieces:


 1) define pipeline requirements (this is pretty much similar to how we 
currently scope @Category(ValidatesRunner.class) tests


 2) let pipeline infer it's requirements prior to being translated via 
runner


 3) runner can check the set of required features and their support and 
reject the pipeline if some feature is missing


This could even replace the annotations used in validates runner tests, 
because each runner would simply execute all tests it has enough 
features to run.


But as I mentioned - this is pretty much deep change. I don't know how 
to safely do this for current runners, but to actually implement the 
feature (it seems to be to me nearly equally complicated to fail 
pipeline in batch case and to actually implement the sorting). It would 
be super cool if anyone would be interested in implementing this in 
runners that don't currently support it. A side note - currently the 
annotation is not supported by all streaming runners due to missing 
guarantees for timers ordering (which can lead to data losss). I think I 
have found a solution to this, see [1], but I'd like to be 100% sure, 
before enabling the support (I'm not sure what is the impact of 
mis-ordered timers on output timestamps, and so on, and so forth).


Jan

[1] 
https://github.com/apache/beam/pull/10795/files#diff-11a02ba72f437b89e35f7ad37102dfd1R209


On 2/7/20 7:53 PM, Kenneth Knowles wrote:
I see. It is good to see that the pipeline will at least fail. 
However, the expect approach here is that the pipeline is rejected 
prior to execution. That is a primary reason for our annotation-driven 
API style; it allows much better "static" analysis by a runner, so we 
don't have to wait and fail late. Here is an example: 
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1940 



Kenn

On Thu, Feb 6, 2020 at 11:03 PM Jan Lukavský > wrote:


Hi Kenn,

that should not be the case. Care was taken to fail streaming
pipeline which needs this ability and the runner doesn't support
this [1]. It is true, however, that a batch pipeline will not
fail, because there is no generic (runner agnostic) way of
supporting this transform in batch case (which is why the
annotation was needed). Failing batch pipelines in this case would
mean runners have to understand this annotation, which is pretty
much close to implementing this feature as a whole.

This applies generally to any core functionality, it might take
some time before runners fully support this. I don't know how to
solve it, maybe add record to capability matrix? I can imagine a
fully generic solution (runners might publish their capabilities
and pipeline might be validated against these capabilities at
pipeline build time), but that is obviously out of scope of the
annotation.

Jan

[1]

https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/DoFnRunners.java#L150

On 2/7/20 1:01 AM, Kenneth Knowles wrote:

There is a major problem with this merge: the runners that do not
support it do not reject pipelines that need this feature. They
will silently produce the wrong answer, causing data loss.

Kenn

On Thu, Feb 6, 2020 at 3:24 AM Jan Lukavský mailto:je...@seznam.cz>> wrote:

Hi,

the PR was merged to master and a few follow-up issues, were
created,
mainly [1] and [2]. I didn't find any reference to
SortedMapState in
JIRA, is there any tracking issue for that that I can link
to? I also
added link to design document here [3].

[1] https://issues.apache.org/jira/browse/BEAM-9256

[2] https://issues.apache.org/jira/browse/BEAM-9257

[3]
https://cwiki.apache.org/confluence/display/BEAM/Design+Documents

On 1/30/20 1:39 PM, Jan Lukavský wrote:
> Hi,
>
> PR [1] (issue [2]) went though code review, and according
to [3] seems
> to me to be ready for merge. Current state of the
implementation is
> that it is supported only for direct runner, legacy flink
runner
> (batch and streaming) and legacy spark (batch). It could be
supported
> by all other (streaming) runners using 

Re: Unable to run ParDoTests from CLI

2020-02-07 Thread Rehman Murad Ali
Thanks, Ismaël.



*Rehman Murad Ali*
Software Engineer
Mobile: +92 3452076766
Skype: rehman.muradali


On Fri, Feb 7, 2020 at 7:29 PM Ismaël Mejía  wrote:

> Use
>
> ./gradlew :runners:direct-java:needsRunner --tests "*ParDoTest\$TimerTests"
>
> For ValidatesRunner for example:
> /gradlew :runners:direct-java:validatesRunner --tests
> "*ParDoTest\$TimerFamily*"
>
> Credit to Brian who helped me because I was struggling with the same issue
> last week.
>
>
> On Fri, Feb 7, 2020 at 3:19 PM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> Hello Community,
>>
>> I have been trying to run test cases from CLI. ParDoTest.java has some
>> inner classes with test functions (for example TimerTest). This is the
>> command I have used to run the test:
>>
>> ./gradlew runners:direct-java:needsRunnerTests --tests
>> "org.apache.beam.sdk.transforms.ParDoTest$TimerTests"
>>
>> Here is the error message:
>> [image: image.png]
>>
>>
>> I need assistance regarding this matter.
>>
>>
>> *Thanks & Regards*
>>
>>
>>
>> *Rehman Murad Ali*
>> Software Engineer
>> Mobile: +92 3452076766
>> Skype: rehman.muradali
>>
>


Re: [BEAM-8550] @RequiresTimeSortedInput ready for merge to master

2020-02-07 Thread Kenneth Knowles
I see. It is good to see that the pipeline will at least fail. However, the
expect approach here is that the pipeline is rejected prior to execution.
That is a primary reason for our annotation-driven API style; it allows
much better "static" analysis by a runner, so we don't have to wait and
fail late. Here is an example:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1940

Kenn

On Thu, Feb 6, 2020 at 11:03 PM Jan Lukavský  wrote:

> Hi Kenn,
>
> that should not be the case. Care was taken to fail streaming pipeline
> which needs this ability and the runner doesn't support this [1]. It is
> true, however, that a batch pipeline will not fail, because there is no
> generic (runner agnostic) way of supporting this transform in batch case
> (which is why the annotation was needed). Failing batch pipelines in this
> case would mean runners have to understand this annotation, which is pretty
> much close to implementing this feature as a whole.
>
> This applies generally to any core functionality, it might take some time
> before runners fully support this. I don't know how to solve it, maybe add
> record to capability matrix? I can imagine a fully generic solution
> (runners might publish their capabilities and pipeline might be validated
> against these capabilities at pipeline build time), but that is obviously
> out of scope of the annotation.
>
> Jan
>
> [1]
> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/DoFnRunners.java#L150
> On 2/7/20 1:01 AM, Kenneth Knowles wrote:
>
> There is a major problem with this merge: the runners that do not support
> it do not reject pipelines that need this feature. They will silently
> produce the wrong answer, causing data loss.
>
> Kenn
>
> On Thu, Feb 6, 2020 at 3:24 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> the PR was merged to master and a few follow-up issues, were created,
>> mainly [1] and [2]. I didn't find any reference to SortedMapState in
>> JIRA, is there any tracking issue for that that I can link to? I also
>> added link to design document here [3].
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-9256
>>
>> [2] https://issues.apache.org/jira/browse/BEAM-9257
>>
>> [3] https://cwiki.apache.org/confluence/display/BEAM/Design+Documents
>>
>> On 1/30/20 1:39 PM, Jan Lukavský wrote:
>> > Hi,
>> >
>> > PR [1] (issue [2]) went though code review, and according to [3] seems
>> > to me to be ready for merge. Current state of the implementation is
>> > that it is supported only for direct runner, legacy flink runner
>> > (batch and streaming) and legacy spark (batch). It could be supported
>> > by all other (streaming) runners using StatefulDoFnRunner, provided
>> > the runner can make guarantees about ordering of timer firings (which
>> > is unfortunately the case only for legacy flink and direct runner, at
>> > least for now - related issues are mentioned multiple times on other
>> > threads). Implementation for other batch runners should be as
>> > straightforward as adding sorting by event timestamp before stateful
>> > dofn (in case where the runner doesn't sort already - e.g. Dataflow -
>> > in which case the annotation can be simply ignored - hence support for
>> > batch Dataflow seems to be a no-op).
>> >
>> > There has been some slight controversy about this feature, but current
>> > feature proposing and implementing guidelines do not cover how to
>> > resolve those, so I'm using this opportunity to let the community
>> > know, that there is a plan to merge this feature, unless there is some
>> > veto (please provide specific reasons for that in that case). The plan
>> > is to merge this in the second part of next week, unless there is a
>> veto.
>> >
>> > Thanks,
>> >
>> >  Jan
>> >
>> > [1] https://github.com/apache/beam/pull/8774
>> >
>> > [2] https://issues.apache.org/jira/browse/BEAM-8550
>> >
>> > [3] https://beam.apache.org/contribute/committer-guide/
>> >
>>
>


Re: Jenkins outage

2020-02-07 Thread Yifan Zou
Cleaned the beam7, it should be okay.

On Fri, Feb 7, 2020 at 9:42 AM Yifan Zou  wrote:

> I'll look into the 'no space' issue.
>
> On Fri, Feb 7, 2020 at 7:14 AM Ismaël Mejía  wrote:
>
>> mmm apache-beam-jenkins-14 also has issues:
>>
>> *16:08:47* ERROR: Error cloning remote repo 'origin'*16:08:47* 
>> hudson.plugins.git.GitException: Could not init 
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Go_VR_Spark_PR/src
>>
>> https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark_PR/6/console
>>
>>
>> On Fri, Feb 7, 2020 at 3:14 PM Ismaël Mejía  wrote:
>>
>>> I am getting "java.lang.IllegalStateException: java.io.IOException: No
>>> space left on device" on  apache-beam-jenkins-7
>>> Can somebody please clean the space. Thanks.
>>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark_PR/26/
>>>
>>> On Fri, Feb 7, 2020 at 2:19 PM Ismaël Mejía  wrote:
>>>
 Thanks for taking care of this issue with INFRA Michał.
 Everything back to normal!


 On Fri, Feb 7, 2020 at 11:34 AM Michał Walenia <
 michal.wale...@polidea.com> wrote:

> Everything looks fine now, the jobs are triggering correctly again
>
> On Fri, Feb 7, 2020 at 10:06 AM Michał Walenia <
> michal.wale...@polidea.com> wrote:
>
>> Hi there,
>> it seems that our Jenkins is experiencing some issues and the jobs
>> are getting stuck in the queue despite the executors being idle.
>> Here's the JIRA issue for this:
>> https://issues.apache.org/jira/browse/INFRA-19830
>> Let's hope it will be resolved soon.
>>
>> --
>>
>> Michał Walenia
>> Polidea  | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! 
>>
>
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>



Re: Jenkins outage

2020-02-07 Thread Yifan Zou
I'll look into the 'no space' issue.

On Fri, Feb 7, 2020 at 7:14 AM Ismaël Mejía  wrote:

> mmm apache-beam-jenkins-14 also has issues:
>
> *16:08:47* ERROR: Error cloning remote repo 'origin'*16:08:47* 
> hudson.plugins.git.GitException: Could not init 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Go_VR_Spark_PR/src
>
> https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark_PR/6/console
>
>
> On Fri, Feb 7, 2020 at 3:14 PM Ismaël Mejía  wrote:
>
>> I am getting "java.lang.IllegalStateException: java.io.IOException: No
>> space left on device" on  apache-beam-jenkins-7
>> Can somebody please clean the space. Thanks.
>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark_PR/26/
>>
>> On Fri, Feb 7, 2020 at 2:19 PM Ismaël Mejía  wrote:
>>
>>> Thanks for taking care of this issue with INFRA Michał.
>>> Everything back to normal!
>>>
>>>
>>> On Fri, Feb 7, 2020 at 11:34 AM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Everything looks fine now, the jobs are triggering correctly again

 On Fri, Feb 7, 2020 at 10:06 AM Michał Walenia <
 michal.wale...@polidea.com> wrote:

> Hi there,
> it seems that our Jenkins is experiencing some issues and the jobs are
> getting stuck in the queue despite the executors being idle.
> Here's the JIRA issue for this:
> https://issues.apache.org/jira/browse/INFRA-19830
> Let's hope it will be resolved soon.
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


 --

 Michał Walenia
 Polidea  | Software Engineer

 M: +48 791 432 002 <+48791432002>
 E: michal.wale...@polidea.com

 Unique Tech
 Check out our projects! 

>>>


Re: Executing the runner validation tests for the Twister2 runner

2020-02-07 Thread Pulasthi Supun Wickramasinghe
Hi Kenn

Thanks for the information, Will add information accordingly and update the
community.

Best Regards,
Pulasthi

On Wed, Jan 29, 2020 at 8:28 AM Kenneth Knowles  wrote:

> In my opinion it is fine to add the documentation after the runner is
> added. I do think we should have input from more members of the community
> about accepting the donation. Since there is time here are places where you
> should add information about the runner:
>
> https://beam.apache.org/documentation/runners/capability-matrix/ source
> data at
> https://github.com/apache/beam/blob/master/website/src/_data/capability-matrix.yml
>
> https://beam.apache.org/documentation/runners/twister2 (new page - see
> the pages for other runners)
>
> https://beam.apache.org/get-started/quickstart-java/ has some snippets
> with toggles per runner
>
> https://beam.apache.org/roadmap/ has the roadmaps for different runners.
> For a new runner especially this could be helpful for users.
>
> Kenn
>
> On Sun, Jan 12, 2020 at 9:36 AM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
>
>> Hi Kenn,
>>
>> Is there any documentation that needs to accompany the new runner in the
>> pull request or is the documentation added after the pull request is
>> approved?.  I would be great if you can point me in the right direction
>> regarding this.
>>
>> Best Regards,
>> Pulasthi
>>
>> On Mon, Jan 6, 2020 at 9:56 PM Pulasthi Supun Wickramasinghe <
>> pulasthi...@gmail.com> wrote:
>>
>>> Hi Kenn,
>>>
>>>
>>>
>>> On Mon, Jan 6, 2020 at 9:09 PM Kenneth Knowles  wrote:
>>>


 On Mon, Jan 6, 2020 at 8:30 AM Pulasthi Supun Wickramasinghe <
 pulasthi...@gmail.com> wrote:

> Hi Kenn,
>
> I was able to solve the problem mentioned above, I am currently
> running the "ValidatesRunner" tests, I have around 4-5 tests that are
> failing that I should be able to fix in a couple of days. I wanted to 
> check
> the next steps I would need to take after all the "ValidatesRunner" tests
> are passing. I assume that the runner does not need to pass all the
> "NeedsRunner" tests.
>

 That's correct. You don't need to run the NeedsRunner tests. Those are
 tests of the core SDK's functionality, not the runner. The annotation is a
 workaround for the false cycle in deps "SDK tests" -> "direct runner" ->
 "SDK", which to maven looks like a cyclic dependency. You should run on the
 ValidatesRunner tests. It is fine to also disable some of them, either by
 excluding categories or test classes, or adding new categories to exclude.


> The runner is only implemented for the batch mode at the moment
> because the higher-level API's for streaming on Twister2 are still being
> finalized. Once that work is done we will add streaming support for the
> runner as well.
>

 Nice! Batch-only is perfectly fine for a runner. You should be able to
 detect and reject pipelines that the runner cannot execute.

>>>
>>> I will make sure that the capability is there.
>>>
>>>
 I re-read the old thread, but I may have missed the answer to a
 fundamental question. Just to get it clear on the mailing list: are you
 intending to submit the runner's code to Apache Beam and ask the community
 to maintain it?

 To answer the original question the issue was with forwarding
>>> exceptions that happen during execution since Twister2 has a
>>> distributed model for execution, I added the ability in the Twister2 side
>>> so that the Twister2 job submitting client will receive a job state object
>>> that contains any exceptions thrown during runtime once the job is
>>> completed.
>>>
>>> And about maintaining the twister2 runner, we would like to submit the
>>> runner to the beam codebase but the Twister2 team will maintain and update
>>> it continuously, in that case, we would become part of the Beam community I
>>> suppose. And any contributions from other members of the community are more
>>> than welcome. I hope that answers your question.
>>>
>>> Best Regards,
>>> Pulasthi
>>>
>>>
 Kenn


 Best Regards,
> Pulasthi
>
> On Thu, Dec 12, 2019 at 11:27 AM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
>
>> Hi Kenn
>>
>> We are still working on aspects like automated job monitoring so
>> currently do not have those capabilities built-in. I discussed with the
>> Twister2 team on a way we can forward failure information from the 
>> workers
>> to the Jobmaster which would be a solution to this problem. It might 
>> take a
>> little time to develop and test. I will update you after looking into 
>> that
>> solution in a little more detail.
>>
>> Best Regards,
>> Pulasthi
>>
>> On Wed, Dec 11, 2019 at 10:51 PM Kenneth Knowles 
>> wrote:
>>
>>> I dug in to Twister2 a little bit to understand the question better,
>>> checking 

Re: Tests not triggering

2020-02-07 Thread Andrew Pilloud
I saw similar things yesterday. I reran the stuck/missing tests about an
hour ago with 'retest this please' and they worked.

Andrew

On Fri, Feb 7, 2020 at 9:25 AM Reuven Lax  wrote:

> Is Jenkins wedged again? I have PRs where the tests have been have been
> pending for over 10 hours.
>
> Reuven
>


Tests not triggering

2020-02-07 Thread Reuven Lax
Is Jenkins wedged again? I have PRs where the tests have been have been
pending for over 10 hours.

Reuven


Re: Time precision in Python

2020-02-07 Thread Robert Bradshaw
I meant issues with windows firing before their time (i.e. before the
watermark passes the end of the window).

On Thu, Feb 6, 2020 at 8:42 PM Kenneth Knowles  wrote:
>
> What is an out of order window?
>
> On Thu, Feb 6, 2020 at 3:09 PM Sam Rohde  wrote:
>>
>> Gotcha, I was just surprised by the precision loss. Thanks!
>>
>> On Thu, Feb 6, 2020 at 1:50 PM Robert Bradshaw  wrote:
>>>
>>> Yes, the inconsistency of timestamp granularity is something that
>>> hasn't yet been resolved (see previous messages on this list). As long
>>> as we round consistently, it won't result in out-of-order windows, but
>>> it may result in timestamp truncation and (for sub-millisecond small
>>> windows) even window identifiaction.
>>>
>>> On Thu, Feb 6, 2020 at 1:42 PM Sam Rohde  wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I saw that in the Python SDK we encode WindowedValues and Timestamps as 
>>> > millis, whereas the TIME_GRANULARITY is defined as 1us. Why do we do 
>>> > this? Won't this cause problems using the FnApiRunner as windows might 
>>> > fire out of order or something else?
>>> >
>>> > Thanks,
>>> > Sam


Re: [PROPOSAL] Beam Schema Options

2020-02-07 Thread Brian Hulette
Messed up my own short-link. It's https://s.apache.org/xlang-table-provider

On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette  wrote:

> I'm not sure this belongs directly on schemas. I've had trouble
> reconciling that opinion, since the idea does seem very useful, and in fact
> I'm interested in using it myself. I think I've figured out my concern -
> what I really want is options for a (maybe portable) Table.
>
> As I indicated in a comment in the doc [1] I still think all of the
> examples you've described only apply to IOs. To be clear, what I mean is
> that all of the examples either
> 1) modify the behavior of the external system the IO is interfacing with
> (specify partitioning, indexing, etc..), or
> 2) define some transformation that should be done to the data adjacent to
> the IO (after an Input or before an Output) in Beam
>
> (1) Is the sort of thing you described in the IO section [2] (aside from
> the PubSub example I added, since that's describing a transformation to do
> in Beam)
> I would argue that all of the other examples fall under (2) - data
> validation, adding computed columns, encryption, etc... are things that can
> be done in a transform
>
> I think we can make an analogy to a traditional database here:
> schema-aware Beam IOs are like Tables in a database, other PCollections are
> like intermediate results in a query. In a database, Tables can be defined
> with some DDL and have schema-level or column-level options that change
> system behavior, but intermediate results have no such capability.
>
>
> Another point I think is worth discussing: is there value in making these
> options portable?
> As it's currently defined I'm not sure there is - everything could be done
> within a single SDK. However, portable options on a portable table could be
> very powerful, since it could be used to configure cross-language IOs,
> perhaps with something like https://s.apache.org/xlang-table-provider/
>
> [1]
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=I54si4k
> [2]
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>
> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel  wrote:
>
>> I would appreciate if someone would look at the following PR and get it
>> to master:
>>
>> https://github.com/apache/beam/pull/10413#
>>
>> a lot of work needs to follow, but if we have the base already on master
>> the next layers can follow. As a reminder, this is the base proposal:
>>
>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>
>> I've also looked for prior work, and saw that Spark actually has
>> something comparable:
>>
>> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>>
>> but when the options are finished it will be far more powerful as it is
>> not limited on fields.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles  wrote:
>>
>>> Using schema types for the metadata values is a nice touch.
>>>
>>> Are the options expected to be common across many fields? Perhaps the
>>> name should be a URN to make it clear to be careful about collisions? (just
>>> a synonym for "name" in practice, but with different connotation)
>>>
>>> I generally like this... but the examples (all but one) are weird things
>>> that I don't really understand how they are done or who is responsible for
>>> them.
>>>
>>> One way to go is this: if options are maybe not understood by all
>>> consumers, then they can't really change behavior. Kind of like how URN and
>>> payload on a composite transform can be ignored and just the expansion used.
>>>
>>> Kenn
>>>
>>> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel  wrote:
>>>
 Hi everyone,

 I'm proud to announce my first real proposal. The proposal describes
 Beam Schema Options. This is an extension to the Schema API to add typed
 meta data to to Rows, Field and Logical Types:


 https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing

 To give you some context where this proposal comes from: We've been
 using dynamic meta driven pipelines for a while, but till now in an
 awkward and hacky way (see my talks at the previous Beam Summits). This
 proposal would bring a good way to work with metadata on the metadata :-).

 The proposal points to 2 pull requests with the implementation, one for
 the API another for translating proto options to beam options.

 Thanks to Brian Hulette and Reuven Lax for the initial feedback. All
 feedback is welcome.

  _/
 _/ Alex Van Boxel

>>>


Re: [PROPOSAL] Beam Schema Options

2020-02-07 Thread Brian Hulette
I'm not sure this belongs directly on schemas. I've had trouble reconciling
that opinion, since the idea does seem very useful, and in fact I'm
interested in using it myself. I think I've figured out my concern - what I
really want is options for a (maybe portable) Table.

As I indicated in a comment in the doc [1] I still think all of the
examples you've described only apply to IOs. To be clear, what I mean is
that all of the examples either
1) modify the behavior of the external system the IO is interfacing with
(specify partitioning, indexing, etc..), or
2) define some transformation that should be done to the data adjacent to
the IO (after an Input or before an Output) in Beam

(1) Is the sort of thing you described in the IO section [2] (aside from
the PubSub example I added, since that's describing a transformation to do
in Beam)
I would argue that all of the other examples fall under (2) - data
validation, adding computed columns, encryption, etc... are things that can
be done in a transform

I think we can make an analogy to a traditional database here: schema-aware
Beam IOs are like Tables in a database, other PCollections are like
intermediate results in a query. In a database, Tables can be defined with
some DDL and have schema-level or column-level options that change
system behavior, but intermediate results have no such capability.


Another point I think is worth discussing: is there value in making these
options portable?
As it's currently defined I'm not sure there is - everything could be done
within a single SDK. However, portable options on a portable table could be
very powerful, since it could be used to configure cross-language IOs,
perhaps with something like https://s.apache.org/xlang-table-provider/

[1]
https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=I54si4k
[2]
https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt

On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel  wrote:

> I would appreciate if someone would look at the following PR and get it to
> master:
>
> https://github.com/apache/beam/pull/10413#
>
> a lot of work needs to follow, but if we have the base already on master
> the next layers can follow. As a reminder, this is the base proposal:
>
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>
> I've also looked for prior work, and saw that Spark actually has something
> comparable:
>
> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>
> but when the options are finished it will be far more powerful as it is
> not limited on fields.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles  wrote:
>
>> Using schema types for the metadata values is a nice touch.
>>
>> Are the options expected to be common across many fields? Perhaps the
>> name should be a URN to make it clear to be careful about collisions? (just
>> a synonym for "name" in practice, but with different connotation)
>>
>> I generally like this... but the examples (all but one) are weird things
>> that I don't really understand how they are done or who is responsible for
>> them.
>>
>> One way to go is this: if options are maybe not understood by all
>> consumers, then they can't really change behavior. Kind of like how URN and
>> payload on a composite transform can be ignored and just the expansion used.
>>
>> Kenn
>>
>> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel  wrote:
>>
>>> Hi everyone,
>>>
>>> I'm proud to announce my first real proposal. The proposal describes
>>> Beam Schema Options. This is an extension to the Schema API to add typed
>>> meta data to to Rows, Field and Logical Types:
>>>
>>>
>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>
>>> To give you some context where this proposal comes from: We've been
>>> using dynamic meta driven pipelines for a while, but till now in an
>>> awkward and hacky way (see my talks at the previous Beam Summits). This
>>> proposal would bring a good way to work with metadata on the metadata :-).
>>>
>>> The proposal points to 2 pull requests with the implementation, one for
>>> the API another for translating proto options to beam options.
>>>
>>> Thanks to Brian Hulette and Reuven Lax for the initial feedback. All
>>> feedback is welcome.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>


big data blog

2020-02-07 Thread Etienne Chauchot

Hi all,

FYI, I just started a blog around big data technologies and for now it 
is focused on Beam.


https://echauchot.blogspot.com/

Feel free to comment, suggest or anything.

Etienne



Re: Jenkins jobs not running for my PR 10438

2020-02-07 Thread Ismaël Mejía
done

On Fri, Feb 7, 2020 at 4:05 PM Tomo Suzuki  wrote:

> Hi Beam committers,
>
> I appreciate if you can run precommit checks for
> https://github.com/apache/beam/pull/10769
> with the following 6 commands:
>
> Run Java PostCommit
> Run Java HadoopFormatIO Performance Test
> Run BigQueryIO Streaming Performance Test Java
> Run Dataflow ValidatesRunner
> Run Spark ValidatesRunner
> Run SQL Postcommit
>
> Regards,
> Tomo
>


Re: Jenkins outage

2020-02-07 Thread Ismaël Mejía
mmm apache-beam-jenkins-14 also has issues:

*16:08:47* ERROR: Error cloning remote repo 'origin'*16:08:47*
hudson.plugins.git.GitException: Could not init
/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Go_VR_Spark_PR/src

https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark_PR/6/console


On Fri, Feb 7, 2020 at 3:14 PM Ismaël Mejía  wrote:

> I am getting "java.lang.IllegalStateException: java.io.IOException: No
> space left on device" on  apache-beam-jenkins-7
> Can somebody please clean the space. Thanks.
> https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark_PR/26/
>
> On Fri, Feb 7, 2020 at 2:19 PM Ismaël Mejía  wrote:
>
>> Thanks for taking care of this issue with INFRA Michał.
>> Everything back to normal!
>>
>>
>> On Fri, Feb 7, 2020 at 11:34 AM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>>
>>> Everything looks fine now, the jobs are triggering correctly again
>>>
>>> On Fri, Feb 7, 2020 at 10:06 AM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Hi there,
 it seems that our Jenkins is experiencing some issues and the jobs are
 getting stuck in the queue despite the executors being idle.
 Here's the JIRA issue for this:
 https://issues.apache.org/jira/browse/INFRA-19830
 Let's hope it will be resolved soon.

 --

 Michał Walenia
 Polidea  | Software Engineer

 M: +48 791 432 002 <+48791432002>
 E: michal.wale...@polidea.com

 Unique Tech
 Check out our projects! 

>>>
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>


Re: Jenkins jobs not running for my PR 10438

2020-02-07 Thread Tomo Suzuki
Hi Beam committers,

I appreciate if you can run precommit checks for
https://github.com/apache/beam/pull/10769
with the following 6 commands:

Run Java PostCommit
Run Java HadoopFormatIO Performance Test
Run BigQueryIO Streaming Performance Test Java
Run Dataflow ValidatesRunner
Run Spark ValidatesRunner
Run SQL Postcommit

Regards,
Tomo


Re: Unable to run ParDoTests from CLI

2020-02-07 Thread Ismaël Mejía
Use

./gradlew :runners:direct-java:needsRunner --tests "*ParDoTest\$TimerTests"

For ValidatesRunner for example:
/gradlew :runners:direct-java:validatesRunner --tests
"*ParDoTest\$TimerFamily*"

Credit to Brian who helped me because I was struggling with the same issue
last week.


On Fri, Feb 7, 2020 at 3:19 PM Rehman Murad Ali <
rehman.murad...@venturedive.com> wrote:

> Hello Community,
>
> I have been trying to run test cases from CLI. ParDoTest.java has some
> inner classes with test functions (for example TimerTest). This is the
> command I have used to run the test:
>
> ./gradlew runners:direct-java:needsRunnerTests --tests
> "org.apache.beam.sdk.transforms.ParDoTest$TimerTests"
>
> Here is the error message:
> [image: image.png]
>
>
> I need assistance regarding this matter.
>
>
> *Thanks & Regards*
>
>
>
> *Rehman Murad Ali*
> Software Engineer
> Mobile: +92 3452076766
> Skype: rehman.muradali
>


Unable to run ParDoTests from CLI

2020-02-07 Thread Rehman Murad Ali
Hello Community,

I have been trying to run test cases from CLI. ParDoTest.java has some
inner classes with test functions (for example TimerTest). This is the
command I have used to run the test:

./gradlew runners:direct-java:needsRunnerTests --tests
"org.apache.beam.sdk.transforms.ParDoTest$TimerTests"

Here is the error message:
[image: image.png]


I need assistance regarding this matter.


*Thanks & Regards*



*Rehman Murad Ali*
Software Engineer
Mobile: +92 3452076766
Skype: rehman.muradali


Re: Jenkins outage

2020-02-07 Thread Ismaël Mejía
I am getting "java.lang.IllegalStateException: java.io.IOException: No
space left on device" on  apache-beam-jenkins-7
Can somebody please clean the space. Thanks.
https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark_PR/26/

On Fri, Feb 7, 2020 at 2:19 PM Ismaël Mejía  wrote:

> Thanks for taking care of this issue with INFRA Michał.
> Everything back to normal!
>
>
> On Fri, Feb 7, 2020 at 11:34 AM Michał Walenia 
> wrote:
>
>> Everything looks fine now, the jobs are triggering correctly again
>>
>> On Fri, Feb 7, 2020 at 10:06 AM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>>
>>> Hi there,
>>> it seems that our Jenkins is experiencing some issues and the jobs are
>>> getting stuck in the queue despite the executors being idle.
>>> Here's the JIRA issue for this:
>>> https://issues.apache.org/jira/browse/INFRA-19830
>>> Let's hope it will be resolved soon.
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>
>>
>> --
>>
>> Michał Walenia
>> Polidea  | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! 
>>
>


Re: Jenkins outage

2020-02-07 Thread Ismaël Mejía
Thanks for taking care of this issue with INFRA Michał.
Everything back to normal!


On Fri, Feb 7, 2020 at 11:34 AM Michał Walenia 
wrote:

> Everything looks fine now, the jobs are triggering correctly again
>
> On Fri, Feb 7, 2020 at 10:06 AM Michał Walenia 
> wrote:
>
>> Hi there,
>> it seems that our Jenkins is experiencing some issues and the jobs are
>> getting stuck in the queue despite the executors being idle.
>> Here's the JIRA issue for this:
>> https://issues.apache.org/jira/browse/INFRA-19830
>> Let's hope it will be resolved soon.
>>
>> --
>>
>> Michał Walenia
>> Polidea  | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.wale...@polidea.com
>>
>> Unique Tech
>> Check out our projects! 
>>
>
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


SplittableDoFn with Flink fails at checkpointing larger files (200MB)

2020-02-07 Thread marek-simunek

Hi,


   I am using FileIO with continuously watching folder for new files to
process. The problem is when flink starts reading 200MB file (around 3M 
elements) and also starts checkpointing. Checkpoint never finishes until 
WHOLE file is processed.

Minimal example :
https://github.com/seznam/beam/blob/simunek/failingCheckpoint/examples/java/
src/main/java/org/apache/beam/examples/CheckpointFailingExample.java
(https://github.com/seznam/beam/blob/simunek/failingCheckpoint/examples/java/src/main/java/org/apache/beam/examples/CheckpointFailingExample.java)

My theory what could be wrong from my understanding :
CheckpointMark in this case starts from Create.ofProvider and then its
propagated to downstream operators where it will be (in queue) behind all 
splits, which means all splits have to be read to successfully checkpoint 
the operator. The problem is even bigger when there are more files, then we
need to wait for processing all files to successfully checkpoint.

1. Are my assumption correct?
2. Is there some possibility to improve behavior of SplittableDoFn (or
subsequent reading from BoundedSource) for Flink to better propagate
checkpoint barrier?
 
For now my fix is reading smaller files (30MB) one by one, by it’s not very
future proof.

Versions:
Beam 2.17
Flink 1.9

Please correct my poor understanding of checkpointing with Beam and Flink 
and it would be wonderful if you have some advice what to improve or where
to look.

Re: Jenkins outage

2020-02-07 Thread Michał Walenia
Everything looks fine now, the jobs are triggering correctly again

On Fri, Feb 7, 2020 at 10:06 AM Michał Walenia 
wrote:

> Hi there,
> it seems that our Jenkins is experiencing some issues and the jobs are
> getting stuck in the queue despite the executors being idle.
> Here's the JIRA issue for this:
> https://issues.apache.org/jira/browse/INFRA-19830
> Let's hope it will be resolved soon.
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 


Jenkins outage

2020-02-07 Thread Michał Walenia
Hi there,
it seems that our Jenkins is experiencing some issues and the jobs are
getting stuck in the queue despite the executors being idle.
Here's the JIRA issue for this:
https://issues.apache.org/jira/browse/INFRA-19830
Let's hope it will be resolved soon.

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects!