Re: Jenkins jobs not running for my PR 10438

2020-01-13 Thread Yifan Zou
done.

On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki  wrote:

> Hi Beam committers,
>
> Four Jenkins jobs did not report back for this PR
> https://github.com/apache/beam/pull/10554 .
> Can somebody trigger them?
>
> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud 
> wrote:
> >
> > Done.
> >
> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki  wrote:
> >>
> >> Hi Bean developers,
> >>
> >> I appreciate a committer can trigger precommit build for
> >> https://github.com/apache/beam/pull/10554.
> >>
> >> In addition to normal precommit checks, I want the followings:
> >> Run Java PostCommit
> >> Run Java HadoopFormatIO Performance Test
> >> Run BigQueryIO Streaming Performance Test Java
> >> Run Dataflow ValidatesRunner
> >> Run Spark ValidatesRunner
> >> Run SQL Postcommit
> >>
> >> Regards,
> >> Tomo
>
>
>
> --
> Regards,
> Tomo
>


Beam Dependency Check Report (2020-01-13)

2020-01-13 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
cachetools
3.1.1
4.0.0
2019-12-23
2019-12-23BEAM-9017
google-cloud-bigquery
1.17.1
1.23.1
2019-09-23
2019-12-23BEAM-5537
google-cloud-datastore
1.7.4
1.10.0
2019-05-27
2019-10-21BEAM-8443
httplib2
0.12.0
0.15.0
2018-12-10
2019-12-23BEAM-9018
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
pytest
4.6.9
5.3.2
2020-01-06
2019-12-23BEAM-8606
Sphinx
1.8.5
2.3.1
2019-05-20
2019-12-23BEAM-7370
tenacity
5.1.5
6.0.0
2019-11-11
2019-11-11BEAM-8607
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.alibaba:fastjson
1.2.49
1.2.62
2018-08-04
2019-10-07BEAM-8632
com.datastax.cassandra:cassandra-driver-core
3.8.0
4.0.0
2019-10-29
2019-03-18BEAM-8674
com.esotericsoftware:kryo
4.0.2
5.0.0-RC4
2018-03-20
2019-04-14BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.20.0
0.27.0
2019-02-11
2019-10-21BEAM-6645
com.github.spotbugs:spotbugs
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-7792
com.github.spotbugs:spotbugs-annotations
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-6951
com.google.api:gax-grpc
1.38.0
1.53.0
2019-02-05
2020-01-09BEAM-8676
com.google.api.grpc:grpc-google-cloud-datacatalog-v1beta1
0.27.0-alpha
0.31.0-alpha
2019-10-03
2020-01-09BEAM-8853
com.google.api.grpc:grpc-google-common-protos
1.12.0
1.17.0
2018-06-29
2019-10-04BEAM-8633
com.google.api.grpc:proto-google-cloud-bigtable-v2
0.44.0
1.9.1
2019-01-23
2020-01-10BEAM-8679
com.google.api.grpc:proto-google-cloud-datacatalog-v1beta1
0.27.0-alpha
0.31.0-alpha
2019-10-03
2020-01-09BEAM-8854
com.google.api.grpc:proto-google-cloud-datastore-v1
0.44.0
0.85.0
2019-01-23
2019-12-05BEAM-8680
com.google.api.grpc:proto-google-cloud-spanner-admin-database-v1
1.6.0
1.47.0
2019-01-23
2019-12-05BEAM-8682
com.google.api.grpc:proto-google-common-protos
1.12.0
1.17.0
2018-06-29
2019-10-04BEAM-6899
com.google.apis:google-api-services-bigquery
v2-rev20181221-1.28.0
v2-rev20190917-1.30.3
2019-01-17
2019-10-09BEAM-8684
com.google.apis:google-api-services-clouddebugger
v2-rev20181114-1.28.0
v2-rev20191003-1.30.3
2019-01-17
2019-10-19BEAM-8750
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20181015-1.28.0
v2-rev20191206-1.30.3
2019-01-17
2019-12-17BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20190927-1.28.0
v1beta3-rev12-1.20.0
2019-10-11
2015-04-29BEAM-8752
com.google.apis:google-api-services-pubsub
v1-rev2019-1.28.0
v1-rev20191203-1.30.3
2019-11-26
2019-12-18BEAM-8753
com.google.apis:google-api-services-storage
v1-rev20181109-1.28.0
v1-rev20191011-1.30.3
2019-01-18
2019-10-30BEAM-8754
com.google.cloud:google-cloud-bigquery
1.28.0
1.103.0
2018-04-27
2020-01-08BEAM-8687
com.google.cloud:google-cloud-bigquerystorage
0.79.0-alpha
0.120.1-beta
2019-01-23
2020-01-08BEAM-8755
com.google.cloud:google-cloud-core
1.61.0
1.92.2
2019-01-23
2020-01-09BEAM-8756
com.google.cloud:google-cloud-core-grpc
1.61.0
1.92.2
2019-01-23
2020-01-09BEAM-8757
com.google.cloud:google-cloud-spanner
1.6.0
1.47.0
2019-01-23
2019-12-05BEAM-8758
com.google.cloud.bigdataoss:gcsio
1.9.16
2.0.0
2019-02-25

Jenkins job execution policy

2020-01-13 Thread Michał Walenia
Hi,
I wanted to decouple the conversation about solutions to the issue from job
execution requests.
We have 131 open PRs right now and 64 committers with job running
privileges. From what I counted, more than 80 of those PRs are not authored
by committers.
I think that having committers answer testing and retesting requests is a
temporary solution and a permanent one should be decided upon soon. While
it's an inconvenience for contributors familiar with the workings of the
project and the community, newcomers might be put off by the fact that the
tests don't run automatically on their pull requests (this is an industry
standard which IMO should be upheld also in Beam). The barrier of finding
one of committers who is active and willing to trigger their tests can make
the entry to the project more difficult.

I believe that the solution proposed by Kenneth in the Jira thread
https://issues.apache.org/jira/browse/INFRA-19670 is viable. The questions
are: do we want to implement this idea and what needs to be done to do it?

Regards
Michał

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 


Re: outputWithTimestamp

2020-01-13 Thread Aaron Dixon
Reuven, thank you much for your help and the clarity here, it's very
helpful..

Per your solution #2 -- This approach makes sense, seems semantically
right, and something I'll explore when the timer.withOutputTimetstamp(t)
releases. Just for clarity, there is no other way in Beam
(mid-pipeline/post-Source) for me to affect a hold the watermark today
until this API is released, correct?

On Mon, Jan 13, 2020 at 1:22 AM Reuven Lax  wrote:

> Semantically though, since you want the CalendarWindow aggregation to be
> based on login timestamps, the watermark should be tracking the login
> timestamps. The watermark is a way for the CalendarWindow to know that as
> far as the system knows, there will be no more events that fall into that
> window. You say that long sessions are holding back the watermark, but
> that's exactly because those long sessions mean that there is still data
> pending for that CalendarWindow, so it is still incomplete! The above
> techniques might appear to solve this, but do so at the expense of somewhat
> randomly causing data to be late or worse dropped.
>
> There are a couple of ways I would address this:
>
> 1. The simplest would be to allow the watermark to track the login window,
> but put a trigger on the CalendarWindow (e.g. trigger every 10 seconds).
> That way whenever the trigger fires you can update the results so far for
> that window. This means that the majority of session that are complete can
> be output without needing to wait for the long sessions, yet the window
> will remain open waiting for those long sessions to complete.
>
> 2. Another possibility is to explicitly identify those extra-long
> sessions, and handle them differently. This I think is a better solution
> than the above timestampSkew solution, because it's deterministic: you know
> exactly which sessions you are handling differently. I would do this by
> using the state+timers API to calculate the sessions, instead of the
> sessions WindowFn. When a session is overly long, then you can stop setting
> the watermark hold for the login time, essentially removing that long
> session from the watermark calculation.
>
> One possibility for how to handle the long sessions "differently" would
> still involve using withAllowedTimestampSkew. This still risks losing some
> of these (if the skew ever happens to be larger than the static value you
> set, you'll not be about to output the session). However now you know
> you're limiting the skewed output to only those specific long sessions
> you've chosen, which is much better than emitting all records with skew and
> hoping that things work out.
>
> Reuven
>
> On Sun, Jan 12, 2020 at 12:07 PM Aaron Dixon  wrote:
>
>> Reuven thanks -- I understand each point although I'm trying to grapple
>> with your concerns expressed in #3; they don't seem avoidable even w/o the
>> allowedSkew feature.
>>
>> Considering your response I see a revision to my solution that omits
>> using the allowed skew configuration but as far as I can tell still has the
>> concerns from #3 (i.e., difficulty in reasoning about which events may be
>> dropped.)
>>
>> My pipeline using the skew config looks like this:
>>
>> (1) CustomSessionWindow
>> emits -> (user, login, logout) @ 
>> (2) ParDo
>> -> re-emits same tuple but w/ *login* timestamp
>> (requires custom allowed-skew)
>> (3) CalendarWindow
>> -> > timestamp>
>>
>> Instead, I can write a CustomCalendarWindow that places the tuple element
>> in the right window based on the *login* timestamp, avoiding the need for
>> the middle/skewing ParDo:
>>
>> (1) CustomSessionWindow
>> -> (user, login, logout) @ 
>> (2) CustomCalendarWindow
>> -> <*explicitly* places element in window based on the **login**
>> timestamp>
>>
>> So the use of the ParDo was simply a way to avoid having to write a
>> custom window; it essentially ensures the CalendarWindow windows based on
>> login time.
>>
>> But I don't see how your concerns in #3 are obviated by this revision.
>> Elements going in to the calendar window may be already late...this is
>> something that any (multi-stage) Beam pipeline has to contend with, even
>> without the deprecated allowedSkew facility, no?
>>
>> In other words both of these pipelines are semantically, behaviorally
>> identical. The former just had the benefit of not requiring a custom window
>> implementation.
>>
>>
>>
>>
>>
>>
>> On Sun, Jan 12, 2020 at 12:12 PM Reuven Lax  wrote:
>>
>>> A few comments:
>>>
>>> 1. Yes, this already works on Dataflow (at Beam head). Flink support is
>>> pending at pr/10534.
>>>
>>> 2. Just to make sure where on the same page: getAllowedTimestampSkew is
>>> _not_ about outputting behind the watermark. Rather it's about outputting a
>>> timestamp that's less than the current input timestamp. If for example the
>>> watermark is 12:00 and the current input element has a timestamp of 11:00
>>> (because it's late), then  you can output an element at 11:00 with no need
>>> to set this parameter. It 

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Reuven Lax
I don't think that should be the case. Also SchemaCoder will automatically
set the UUID for such logical types.

On Mon, Jan 13, 2020 at 8:24 AM Alex Van Boxel  wrote:

> OK, I've rechecked everything and eventually found the problem. The
> problem is when you use a LogicalType backed back a Row, then the UUID
> needs to be set to make it work. (this is the case for Proto based
> Timestamps). I'll create a fix.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Jan 13, 2020 at 8:36 AM Reuven Lax  wrote:
>
>> Can you elucidate? All BeamSQL pipelines use schemas and I believe those
>> test are working just fine on the Dataflow runner. In addition, there are a
>> number of ValidatesRunner schema-aware pipelines that are running regularly
>> on the Dataflow runner.
>>
>> On Sun, Jan 12, 2020 at 1:43 AM Alex Van Boxel  wrote:
>>
>>> Hey all,
>>>
>>> anyone tried master with a *schema aware pipeline* on Dataflow? I'm
>>> testing some PR's to see if the run on Dataflow (as they are working on
>>> Direct) but they got:
>>>
>>>
>>> Workflow failed. Causes: The Dataflow job appears to be stuck because no
>>> worker activity has been seen in the last 1h. You can get help with Cloud
>>> Dataflow at
>>>
>>> because I got this I wanted to see if master (without my changes) also
>>> have the same behaviour.
>>>
>>> It's easy to simulate: Just read for BigQuery with:
>>>
>>> BigQueryIO.readTableRowsWithSchema()
>>>
>>> it works with the classic: readTableRows().
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>


Re: Ask about beam pull requests

2020-01-13 Thread Pablo Estrada
... what that means is that you can tag me on github, and I'll take a look,
yes : ) I'm 'pabloem'.

On Mon, Jan 13, 2020 at 9:59 AM Pablo Estrada  wrote:

> I reviewed the first PR, so I'm happy to review others.
>
> On Mon, Jan 13, 2020 at 9:42 AM Robert Bradshaw 
> wrote:
>
>> One thing you could do is ask for a history [1] of the file and see if
>> there are any possible candidates (e.g. apache beam comitters [2]).
>>
>> [1]
>> https://github.com/ocworld/beam/blame/259f6174ce52e6317a5b4fe7ed3a126153d3/sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py
>> [2] https://projects.apache.org/committee.html?beam
>>
>> If nothing obvious comes up, feel free to create the pull requests and
>> ask on this list if anyone would be able to review (giving some
>> context, e.g. adding some options to the S3 filesystem in Python).
>>
>> On Mon, Jan 13, 2020 at 9:31 AM Keunhyun Oh  wrote:
>> >
>> > I want to make pull request about BEAM-9094 (
>> https://issues.apache.org/jira/browse/BEAM-9094)
>> >
>> > My tree is
>> https://github.com/ocworld/beam/tree/BEAM-9094-add-aws-s3-options
>> >
>> > When trying to create a pull request issue, It is needed for me to
>> assign reviewers.
>> >
>> > Who can review my request?
>> >
>>
>


Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Reuven Lax
SchemaCoder today recursively sets UUIDs for all schemas, including logical
types, in setSchemaIds. Is it possible that your changes modified that
logic somehow?

On Mon, Jan 13, 2020 at 9:39 AM Alex Van Boxel  wrote:

> This is the stacktrace:
>
>
> java.lang.IllegalStateException at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:371)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.createComponentCoders(RowCoderGenerator.java:337)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.generate(RowCoderGenerator.java:140)
> at
> org.apache.beam.sdk.schemas.SchemaCoder.getDelegateCoder(SchemaCoder.java:159)
> at org.apache.beam.sdk.schemas.SchemaCoder.toString(SchemaCoder.java:204)
> at java.lang.String.valueOf(String.java:2994) at
> java.lang.StringBuilder.append(StringBuilder.java:131) at
> org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
> at
> org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
> at
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:623)
> at
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:539)
> at
> org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:400)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:125)
> at
> org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:64)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
> at
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
> at
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
> at
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
> at
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
> at
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
> at
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
>
> In red is the case for LogicalType. Setting it to a fixed value makes it
> run on Dataflow. Note that the pipeline works perfect on DirectRunner.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Jan 13, 2020 at 6:06 PM Reuven Lax  wrote:
>
>> I don't think that should be the case. Also SchemaCoder will
>> automatically set the UUID for such logical types.
>>
>> On Mon, Jan 13, 2020 at 8:24 AM Alex Van Boxel  wrote:
>>
>>> OK, I've rechecked everything and eventually found the problem. The
>>> problem is when you use a LogicalType backed back a Row, then the UUID
>>> needs to be set to make it work. (this is the case for Proto based
>>> Timestamps). I'll create a fix.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Mon, Jan 13, 2020 at 8:36 AM Reuven Lax  wrote:
>>>
 Can you elucidate? All BeamSQL pipelines use schemas and I believe
 those test are working just fine on the Dataflow runner. In addition, there
 are a number of ValidatesRunner schema-aware pipelines that are running
 regularly on the Dataflow runner.

 On Sun, Jan 12, 2020 at 1:43 AM Alex Van Boxel 
 wrote:

> Hey all,
>
> anyone tried master with a *schema aware pipeline* on Dataflow? I'm
> testing some PR's to see if the run on Dataflow (as they are working on
> Direct) but they got:
>
>
> Workflow failed. Causes: The Dataflow job appears to be stuck because
> no worker activity has been seen in the last 1h. You can get help with
> Cloud Dataflow at
>
> because I got this I wanted to see if master (without my changes) also
> have the same behaviour.
>
> It's easy to simulate: Just read for BigQuery with:
>

Re: Jenkins jobs not running for my PR 10438

2020-01-13 Thread Tomo Suzuki
Thanks Yifan (but Java Precommit is still missing).
Can somebody run "Run Java PreCommit" on
https://github.com/apache/beam/pull/10554?


On Mon, Jan 13, 2020 at 2:59 AM Yifan Zou  wrote:
>
> done.
>
> On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki  wrote:
>>
>> Hi Beam committers,
>>
>> Four Jenkins jobs did not report back for this PR
>> https://github.com/apache/beam/pull/10554 .
>> Can somebody trigger them?
>>
>> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud  wrote:
>> >
>> > Done.
>> >
>> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki  wrote:
>> >>
>> >> Hi Bean developers,
>> >>
>> >> I appreciate a committer can trigger precommit build for
>> >> https://github.com/apache/beam/pull/10554.
>> >>
>> >> In addition to normal precommit checks, I want the followings:
>> >> Run Java PostCommit
>> >> Run Java HadoopFormatIO Performance Test
>> >> Run BigQueryIO Streaming Performance Test Java
>> >> Run Dataflow ValidatesRunner
>> >> Run Spark ValidatesRunner
>> >> Run SQL Postcommit
>> >>
>> >> Regards,
>> >> Tomo
>>
>>
>>
>> --
>> Regards,
>> Tomo



-- 
Regards,
Tomo


Re: Beam Summit North America 2019 - recordings

2020-01-13 Thread Pablo Estrada
Thanks Matthias!

On Sun, Jan 12, 2020 at 7:51 AM Matthias Baetens 
wrote:

> Hi everyone,
>
> It's our pleasure to share the recordings from the Beam Summit North
> America 2019.
> Please find them in the YouTube playlist
> 
> on the Apache Beam channel
> .
>
> Best regards,
> Matthias
>


[RESULT] [VOTE] Vendored Dependencies Release

2020-01-13 Thread Luke Cwik
[RESULT] [VOTE] Vendored Dependencies Release

I'm happy to announce that we have unanimously approved this release.

There are 6 approving votes, 4 of which are binding:
* Luke Cwik
* Pablo Estrada
* Ahmet Altay
* Kenneth Knowles

There are no disapproving votes.

Thanks everyone!

On 2020/01/11 03:00:14, Kai Jiang  wrote: 
> +1 (non-binding)
> 
> On Thu, Jan 9, 2020 at 8:48 PM jincheng sun 
> wrote:
> 
> > +1,checked list as follows:
> >  - verified the hash and signature
> >  - verified that there is no linkage errors
> >  - verified that the content of the pom is expected: the shaded
> > dependencies are not exposed, the scope of the logging dependencies are
> > runtime, etc.
> >
> > Best,
> > Jincheng
> >
> > Kenneth Knowles 于2020年1月10日 周五12:29写道:
> >
> >> +1
> >>
> >> On Thu, Jan 9, 2020 at 4:03 PM Ahmet Altay  wrote:
> >>
> >>> +1
> >>>
> >>> On Thu, Jan 9, 2020 at 2:04 PM Pablo Estrada  wrote:
> >>>
>  +1
> 
>  verified sha1 and md5 hashes.
> 
>  On Thu, Jan 9, 2020 at 10:28 AM Luke Cwik  wrote:
> 
> > +1
> >
> > I validated that no classes appeared outside of the
> > org.apache.beam.vendor.grpc.v1p26p0 namespace and I also validated that 
> > the
> > linkage checker listed no potential linkage errors.
> >
> > On Thu, Jan 9, 2020 at 10:25 AM Luke Cwik  wrote:
> >
> >> Please review the release of the following artifacts that we vendor:
> >>  * beam-vendor-grpc-1_26_0
> >>
> >> Hi everyone,
> >> Please review and vote on the release candidate #1 for the version
> >> 0.1, as follows:
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific comments)
> >>
> >>
> >> The complete staging area is available for your review, which
> >> includes:
> >> * the official Apache source release to be deployed to
> >> dist.apache.org [1], which is signed with the key with
> >> fingerprint EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
> >> * all artifacts to be deployed to the Maven Central Repository [3],
> >> * commit hash "e60d49bdf1ed85e8f3efa1da784227f381a9e085" [4],
> >>
> >> The vote will be open for at least 72 hours. It is adopted by
> >> majority approval, with at least 3 PMC affirmative votes.
> >>
> >> Thanks,
> >> Release Manager
> >>
> >> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
> >> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> >> [3]
> >> https://repository.apache.org/content/repositories/orgapachebeam-1089/
> >> [4]
> >> https://github.com/apache/beam/commit/e60d49bdf1ed85e8f3efa1da784227f381a9e085
> >>
> > --
> >
> > Best,
> > Jincheng
> > -
> > Twitter: https://twitter.com/sunjincheng121
> > -
> >
> 


Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
This is the stacktrace:


java.lang.IllegalStateException at
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
at
org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380)
at
org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:371)
at
org.apache.beam.sdk.coders.RowCoderGenerator.createComponentCoders(RowCoderGenerator.java:337)
at
org.apache.beam.sdk.coders.RowCoderGenerator.generate(RowCoderGenerator.java:140)
at
org.apache.beam.sdk.schemas.SchemaCoder.getDelegateCoder(SchemaCoder.java:159)
at org.apache.beam.sdk.schemas.SchemaCoder.toString(SchemaCoder.java:204)
at java.lang.String.valueOf(String.java:2994) at
java.lang.StringBuilder.append(StringBuilder.java:131) at
org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
at
org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
at
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:623)
at
org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:539)
at
org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:400)
at
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:125)
at
org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:64)
at
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
at
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
at
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
at
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
at
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
at
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
at
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
at
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
at
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
at
org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

In red is the case for LogicalType. Setting it to a fixed value makes it
run on Dataflow. Note that the pipeline works perfect on DirectRunner.

 _/
_/ Alex Van Boxel


On Mon, Jan 13, 2020 at 6:06 PM Reuven Lax  wrote:

> I don't think that should be the case. Also SchemaCoder will automatically
> set the UUID for such logical types.
>
> On Mon, Jan 13, 2020 at 8:24 AM Alex Van Boxel  wrote:
>
>> OK, I've rechecked everything and eventually found the problem. The
>> problem is when you use a LogicalType backed back a Row, then the UUID
>> needs to be set to make it work. (this is the case for Proto based
>> Timestamps). I'll create a fix.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Jan 13, 2020 at 8:36 AM Reuven Lax  wrote:
>>
>>> Can you elucidate? All BeamSQL pipelines use schemas and I believe those
>>> test are working just fine on the Dataflow runner. In addition, there are a
>>> number of ValidatesRunner schema-aware pipelines that are running regularly
>>> on the Dataflow runner.
>>>
>>> On Sun, Jan 12, 2020 at 1:43 AM Alex Van Boxel  wrote:
>>>
 Hey all,

 anyone tried master with a *schema aware pipeline* on Dataflow? I'm
 testing some PR's to see if the run on Dataflow (as they are working on
 Direct) but they got:


 Workflow failed. Causes: The Dataflow job appears to be stuck because
 no worker activity has been seen in the last 1h. You can get help with
 Cloud Dataflow at

 because I got this I wanted to see if master (without my changes) also
 have the same behaviour.

 It's easy to simulate: Just read for BigQuery with:

 BigQueryIO.readTableRowsWithSchema()

 it works with the classic: readTableRows().

  _/
 _/ Alex Van Boxel

>>>


Re: outputWithTimestamp

2020-01-13 Thread Reuven Lax
Correct. This API is merged into Beam, so should be included in the next
Beam release.

On Mon, Jan 13, 2020 at 4:00 AM Aaron Dixon  wrote:

> Reuven, thank you much for your help and the clarity here, it's very
> helpful..
>
> Per your solution #2 -- This approach makes sense, seems semantically
> right, and something I'll explore when the timer.withOutputTimetstamp(t)
> releases. Just for clarity, there is no other way in Beam
> (mid-pipeline/post-Source) for me to affect a hold the watermark today
> until this API is released, correct?
>
> On Mon, Jan 13, 2020 at 1:22 AM Reuven Lax  wrote:
>
>> Semantically though, since you want the CalendarWindow aggregation to be
>> based on login timestamps, the watermark should be tracking the login
>> timestamps. The watermark is a way for the CalendarWindow to know that as
>> far as the system knows, there will be no more events that fall into that
>> window. You say that long sessions are holding back the watermark, but
>> that's exactly because those long sessions mean that there is still data
>> pending for that CalendarWindow, so it is still incomplete! The above
>> techniques might appear to solve this, but do so at the expense of somewhat
>> randomly causing data to be late or worse dropped.
>>
>> There are a couple of ways I would address this:
>>
>> 1. The simplest would be to allow the watermark to track the login
>> window, but put a trigger on the CalendarWindow (e.g. trigger every 10
>> seconds). That way whenever the trigger fires you can update the results so
>> far for that window. This means that the majority of session that are
>> complete can be output without needing to wait for the long sessions, yet
>> the window will remain open waiting for those long sessions to complete.
>>
>> 2. Another possibility is to explicitly identify those extra-long
>> sessions, and handle them differently. This I think is a better solution
>> than the above timestampSkew solution, because it's deterministic: you know
>> exactly which sessions you are handling differently. I would do this by
>> using the state+timers API to calculate the sessions, instead of the
>> sessions WindowFn. When a session is overly long, then you can stop setting
>> the watermark hold for the login time, essentially removing that long
>> session from the watermark calculation.
>>
>> One possibility for how to handle the long sessions "differently" would
>> still involve using withAllowedTimestampSkew. This still risks losing some
>> of these (if the skew ever happens to be larger than the static value you
>> set, you'll not be about to output the session). However now you know
>> you're limiting the skewed output to only those specific long sessions
>> you've chosen, which is much better than emitting all records with skew and
>> hoping that things work out.
>>
>> Reuven
>>
>> On Sun, Jan 12, 2020 at 12:07 PM Aaron Dixon  wrote:
>>
>>> Reuven thanks -- I understand each point although I'm trying to grapple
>>> with your concerns expressed in #3; they don't seem avoidable even w/o the
>>> allowedSkew feature.
>>>
>>> Considering your response I see a revision to my solution that omits
>>> using the allowed skew configuration but as far as I can tell still has the
>>> concerns from #3 (i.e., difficulty in reasoning about which events may be
>>> dropped.)
>>>
>>> My pipeline using the skew config looks like this:
>>>
>>> (1) CustomSessionWindow
>>> emits -> (user, login, logout) @ 
>>> (2) ParDo
>>> -> re-emits same tuple but w/ *login* timestamp
>>> (requires custom allowed-skew)
>>> (3) CalendarWindow
>>> -> >> timestamp>
>>>
>>> Instead, I can write a CustomCalendarWindow that places the tuple
>>> element in the right window based on the *login* timestamp, avoiding the
>>> need for the middle/skewing ParDo:
>>>
>>> (1) CustomSessionWindow
>>> -> (user, login, logout) @ 
>>> (2) CustomCalendarWindow
>>> -> <*explicitly* places element in window based on the **login**
>>> timestamp>
>>>
>>> So the use of the ParDo was simply a way to avoid having to write a
>>> custom window; it essentially ensures the CalendarWindow windows based on
>>> login time.
>>>
>>> But I don't see how your concerns in #3 are obviated by this revision.
>>> Elements going in to the calendar window may be already late...this is
>>> something that any (multi-stage) Beam pipeline has to contend with, even
>>> without the deprecated allowedSkew facility, no?
>>>
>>> In other words both of these pipelines are semantically, behaviorally
>>> identical. The former just had the benefit of not requiring a custom window
>>> implementation.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jan 12, 2020 at 12:12 PM Reuven Lax  wrote:
>>>
 A few comments:

 1. Yes, this already works on Dataflow (at Beam head). Flink support is
 pending at pr/10534.

 2. Just to make sure where on the same page: getAllowedTimestampSkew is
 _not_ about outputting behind the watermark. Rather it's about outputting a
 

Ask about beam pull requests

2020-01-13 Thread Keunhyun Oh
I want to make pull request about BEAM-9094 (
https://issues.apache.org/jira/browse/BEAM-9094)

My tree is https://github.com/ocworld/beam/tree/BEAM-9094-add-aws-s3-options

When trying to create a pull request issue, It is needed for me to assign
reviewers.

Who can review my request?


Re: Ask about beam pull requests

2020-01-13 Thread Robert Bradshaw
One thing you could do is ask for a history [1] of the file and see if
there are any possible candidates (e.g. apache beam comitters [2]).

[1] 
https://github.com/ocworld/beam/blame/259f6174ce52e6317a5b4fe7ed3a126153d3/sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py
[2] https://projects.apache.org/committee.html?beam

If nothing obvious comes up, feel free to create the pull requests and
ask on this list if anyone would be able to review (giving some
context, e.g. adding some options to the S3 filesystem in Python).

On Mon, Jan 13, 2020 at 9:31 AM Keunhyun Oh  wrote:
>
> I want to make pull request about BEAM-9094 
> (https://issues.apache.org/jira/browse/BEAM-9094)
>
> My tree is https://github.com/ocworld/beam/tree/BEAM-9094-add-aws-s3-options
>
> When trying to create a pull request issue, It is needed for me to assign 
> reviewers.
>
> Who can review my request?
>


Failing Java PostCommit for Dataflow runner

2020-01-13 Thread Kirill Kozlov
Hello everyone!

I have noticed that Jenkins tests for Dataflow runner [1] are failing with
a runtime exception. It looks like the issue originated here [2], failed
Dataflow job [3].
We should look into fixing it.

Failing test:
:runners:google-cloud-dataflow-java:validatesRunnerLegacyWorkerTest »
org.apache.beam.sdk.transforms.ParDoTest$TimerTests » testOutputTimestamp
(29.723s)

Exception thrown:

java.lang.RuntimeException: Workflow failed. Causes: Unknown streaming
source: test_stream


[1]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/
[2]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/4157/
[3]
https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2020-01-08_14_36_07-180707589785292440?project=apache-beam-testing


Re: Jenkins jobs not running for my PR 10438

2020-01-13 Thread Ismaël Mejía
done

On Mon, Jan 13, 2020 at 2:39 PM Yoshiki Obata 
wrote:

> Hi Beam committers
>
> It would be appreciated if anyone could trigger python precommit job to
> this PR:
> https://github.com/apache/beam/pull/10141
>
> Regards,
> Yoshiki
>


Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
OK, I've rechecked everything and eventually found the problem. The problem
is when you use a LogicalType backed back a Row, then the UUID needs to be
set to make it work. (this is the case for Proto based Timestamps). I'll
create a fix.

 _/
_/ Alex Van Boxel


On Mon, Jan 13, 2020 at 8:36 AM Reuven Lax  wrote:

> Can you elucidate? All BeamSQL pipelines use schemas and I believe those
> test are working just fine on the Dataflow runner. In addition, there are a
> number of ValidatesRunner schema-aware pipelines that are running regularly
> on the Dataflow runner.
>
> On Sun, Jan 12, 2020 at 1:43 AM Alex Van Boxel  wrote:
>
>> Hey all,
>>
>> anyone tried master with a *schema aware pipeline* on Dataflow? I'm
>> testing some PR's to see if the run on Dataflow (as they are working on
>> Direct) but they got:
>>
>>
>> Workflow failed. Causes: The Dataflow job appears to be stuck because no
>> worker activity has been seen in the last 1h. You can get help with Cloud
>> Dataflow at
>>
>> because I got this I wanted to see if master (without my changes) also
>> have the same behaviour.
>>
>> It's easy to simulate: Just read for BigQuery with:
>>
>> BigQueryIO.readTableRowsWithSchema()
>>
>> it works with the classic: readTableRows().
>>
>>  _/
>> _/ Alex Van Boxel
>>
>


Re: Jenkins jobs not running for my PR 10438

2020-01-13 Thread Yoshiki Obata
Hi Beam committers

It would be appreciated if anyone could trigger python precommit job to this PR:
https://github.com/apache/beam/pull/10141

Regards,
Yoshiki


NYC ? (or more generally East Coast)

2020-01-13 Thread Austin Bennett
Hi Devs and Users,

We are looking for speakers for future Meetups and Events.  Who is
building cool things with Beam?  We are looking at hosting a Meetup at
Spotify in February, and ideally keep some meetups going throughout
the year.  For this to occur, we need to hear about what people are
working on!  Even if only a small/lightning talk, etc, do reach out!
Let's figure something out.

Cheers,
Austin

P.S.  https://www.meetup.com/New-York-Apache-Beam/

P.S.S.  We also have budding communities in DC and Boston, will
eventually write in separate threads on those.


Re: Failing Java PostCommit for Dataflow runner

2020-01-13 Thread Luke Cwik
This is being tracked in BEAM-9083

On Mon, Jan 13, 2020 at 11:23 AM Boyuan Zhang  wrote:

> Thanks Kirill! I'm going to look into it.
>
> On Mon, Jan 13, 2020 at 11:18 AM Kirill Kozlov 
> wrote:
>
>> Hello everyone!
>>
>> I have noticed that Jenkins tests for Dataflow runner [1] are failing
>> with a runtime exception. It looks like the issue originated here [2],
>> failed Dataflow job [3].
>> We should look into fixing it.
>>
>> Failing test:
>> :runners:google-cloud-dataflow-java:validatesRunnerLegacyWorkerTest »
>> org.apache.beam.sdk.transforms.ParDoTest$TimerTests » testOutputTimestamp
>> (29.723s)
>>
>> Exception thrown:
>>
>> java.lang.RuntimeException: Workflow failed. Causes: Unknown streaming 
>> source: test_stream
>>
>>
>> [1]
>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/
>> [2]
>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/4157/
>> [3]
>> https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2020-01-08_14_36_07-180707589785292440?project=apache-beam-testing
>>
>


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Rui Wang
If it indeed happened as you have described, I will be very interested in
the expected behaviour.

Something I remembered before: the trigger condition meets just gives the
runner/engine "permission" to fire, but runner/engine may not fire
immediately. But I don't know if the engine/runner will guarantee to fire.



-Rui

On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:

> I have the following trigger:
>
> .apply(Window
>   .configure()
>   .triggering(AfterWatermark
>.pastEndOfWindow()
>.withEarlyFirings(AfterPane
> .elementCountAtLeast(1)))
>   .accumulatingFiredPanes()
>   .withAllowedLateness(Duration.ZERO)
>
> But in Dataflow I notice that I never get an ON_TIME firing for my window
> -- I only see early firing for elements, and then nothing.
>
> My assumption is that AfterWatermark should give me a last, on-time pane
> under this configuration when the watermark surpasses the window's end.
>
> Is my expectation correct?
>


Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
So I think the following happens:

   1. the schema tree is initialized at construction time. The tree get
   serialized and send to the workers
   2. the workers deserialize the tree, but as the Timestamp logical type
   have a logical type with a *static* schema the schema will be
*re-initialized
   without the UUID* (as it was never serialized)
   3. this is why setting a fixed UUID at static initialization works

So solution is

   - as tested, se a fixed UUID
   - make the schema not static in the logical type

 _/
_/ Alex Van Boxel


On Mon, Jan 13, 2020 at 8:08 PM Reuven Lax  wrote:

> SchemaCoder today recursively sets UUIDs for all schemas, including
> logical types, in setSchemaIds. Is it possible that your changes modified
> that logic somehow?
>
> On Mon, Jan 13, 2020 at 9:39 AM Alex Van Boxel  wrote:
>
>> This is the stacktrace:
>>
>>
>> java.lang.IllegalStateException at
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
>> at
>> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380)
>> at
>> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:371)
>> at
>> org.apache.beam.sdk.coders.RowCoderGenerator.createComponentCoders(RowCoderGenerator.java:337)
>> at
>> org.apache.beam.sdk.coders.RowCoderGenerator.generate(RowCoderGenerator.java:140)
>> at
>> org.apache.beam.sdk.schemas.SchemaCoder.getDelegateCoder(SchemaCoder.java:159)
>> at org.apache.beam.sdk.schemas.SchemaCoder.toString(SchemaCoder.java:204)
>> at java.lang.String.valueOf(String.java:2994) at
>> java.lang.StringBuilder.append(StringBuilder.java:131) at
>> org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
>> at
>> org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
>> at
>> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:623)
>> at
>> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:539)
>> at
>> org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:400)
>> at
>> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:125)
>> at
>> org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:64)
>> at
>> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
>> at
>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
>> at
>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>> at
>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>> at
>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
>> at
>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
>> at
>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
>> at
>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>> at
>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>> at
>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:748)
>>
>> In red is the case for LogicalType. Setting it to a fixed value makes it
>> run on Dataflow. Note that the pipeline works perfect on DirectRunner.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Jan 13, 2020 at 6:06 PM Reuven Lax  wrote:
>>
>>> I don't think that should be the case. Also SchemaCoder will
>>> automatically set the UUID for such logical types.
>>>
>>> On Mon, Jan 13, 2020 at 8:24 AM Alex Van Boxel  wrote:
>>>
 OK, I've rechecked everything and eventually found the problem. The
 problem is when you use a LogicalType backed back a Row, then the UUID
 needs to be set to make it work. (this is the case for Proto based
 Timestamps). I'll create a fix.

  _/
 _/ Alex Van Boxel


 On Mon, Jan 13, 2020 at 8:36 AM Reuven Lax  wrote:

> Can you elucidate? All BeamSQL pipelines use schemas and I believe
> those test are working just fine on the Dataflow runner. In addition, 
> there
> are a number of ValidatesRunner schema-aware pipelines that 

No AfterWatermark firings in Dataflow

2020-01-13 Thread Aaron Dixon
I have the following trigger:

.apply(Window
  .configure()
  .triggering(AfterWatermark
   .pastEndOfWindow()
   .withEarlyFirings(AfterPane
.elementCountAtLeast(1)))
  .accumulatingFiredPanes()
  .withAllowedLateness(Duration.ZERO)

But in Dataflow I notice that I never get an ON_TIME firing for my window
-- I only see early firing for elements, and then nothing.

My assumption is that AfterWatermark should give me a last, on-time pane
under this configuration when the watermark surpasses the window's end.

Is my expectation correct?


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Luke Cwik
I would have expected an empty on time pane since the default on time
behavior is FIRE_ALWAYS.

On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:

> Can anyone confirm?
>
> This is intermittent. Some (it seems, sparse) windows don't get an ON_TIME
> firing after watermark. Is this a bug or is there a reason to not expect
> ON_TIME firings for every window?
>
> On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>
>> If it indeed happened as you have described, I will be very interested in
>> the expected behaviour.
>>
>> Something I remembered before: the trigger condition meets just gives the
>> runner/engine "permission" to fire, but runner/engine may not fire
>> immediately. But I don't know if the engine/runner will guarantee to fire.
>>
>>
>>
>> -Rui
>>
>> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:
>>
>>> I have the following trigger:
>>>
>>> .apply(Window
>>>   .configure()
>>>   .triggering(AfterWatermark
>>>.pastEndOfWindow()
>>>.withEarlyFirings(AfterPane
>>> .elementCountAtLeast(1)))
>>>   .accumulatingFiredPanes()
>>>   .withAllowedLateness(Duration.ZERO)
>>>
>>> But in Dataflow I notice that I never get an ON_TIME firing for my
>>> window -- I only see early firing for elements, and then nothing.
>>>
>>> My assumption is that AfterWatermark should give me a last, on-time pane
>>> under this configuration when the watermark surpasses the window's end.
>>>
>>> Is my expectation correct?
>>>
>>


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Aaron Dixon
Yes. Using calendar day-based windows and watermark is completely caught up
to today ... calendar window ends several days ago. I got EARLY panes for
each element but never ON_TIME pane.

On Mon, Jan 13, 2020 at 4:16 PM Luke Cwik  wrote:

> Is the watermark advancing past the end of the window?
>
> On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon  wrote:
>
>> The window is not empty fwiw; it has elements; I get an early firing pane
>> for the window but well after the watermark passes there is no ON_TIME
>> pane. Would this be a bug in Dataflow? Seems fundamental, so I'm concerned
>> perhaps the Beam spec doesn't obligate ON_TIME firings?
>>
>>
>>
>>
>>
>> On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:
>>
>>> I would have expected an empty on time pane since the default on time
>>> behavior is FIRE_ALWAYS.
>>>
>>> On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:
>>>
 Can anyone confirm?

 This is intermittent. Some (it seems, sparse) windows don't get an
 ON_TIME firing after watermark. Is this a bug or is there a reason to not
 expect ON_TIME firings for every window?

 On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:

> If it indeed happened as you have described, I will be very interested
> in the expected behaviour.
>
> Something I remembered before: the trigger condition meets just gives
> the runner/engine "permission" to fire, but runner/engine may not fire
> immediately. But I don't know if the engine/runner will guarantee to fire.
>
>
>
> -Rui
>
> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:
>
>> I have the following trigger:
>>
>> .apply(Window
>>   .configure()
>>   .triggering(AfterWatermark
>>.pastEndOfWindow()
>>.withEarlyFirings(AfterPane
>> .elementCountAtLeast(1)))
>>   .accumulatingFiredPanes()
>>   .withAllowedLateness(Duration.ZERO)
>>
>> But in Dataflow I notice that I never get an ON_TIME firing for my
>> window -- I only see early firing for elements, and then nothing.
>>
>> My assumption is that AfterWatermark should give me a last, on-time
>> pane under this configuration when the watermark surpasses the window's 
>> end.
>>
>> Is my expectation correct?
>>
>


Re: BigQueryUtils improvements for Avro Bytes / Timestamp (millis)

2020-01-13 Thread Ryan Berti
Hello,

I've got a PR that affects the java gcp component, specifically
BigQueryUtils. Can anyone help me with a review? I've tagged the owner of
the component on the PR but haven't heard anything for a week, so I figured
I'd send an e-mail to this list.

https://issues.apache.org/jira/browse/BEAM-9051
https://github.com/apache/beam/pull/10540

Thanks!
Ryan

On Thu, Jan 2, 2020 at 6:15 PM Ryan Berti  wrote:

> Hello,
>
> I just wanted to send a quick note to the mailing list letting you know I
> opened two tickets:
>
> https://issues.apache.org/jira/browse/BEAM-9051
> https://issues.apache.org/jira/browse/BEAM-9052
>
> In both cases, I encountered the lack of functionality after using an
> external library to generate Avro GenericRecords from Scala case classes.
> These records could be written via ParquetIO without any issues, but when
> converting them to Beam Rows, and then BigQuery TableRows (via AvroUtils
> and BigQueryUtils), I found that the records were rejected by BigQuery.
> I'll setup some unit tests and open PRs in the next week or so to add this
> functionality.
>
> Thanks!
> Ryan
>


Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Brian Hulette
I guess these are the first logical types we've defined with a base type of
row. It does seem reasonable that a static schema for a logical type could
have some fixed id, but it feels odd to have a fixed UUID, it would be nice
if we could give the schema some meaningful static identifier.

I think Reuven was investigating adding support for "named" schemas in
order to add support for recursive schemas, but ran into some issues. Maybe
something like that is what we need here?

On Mon, Jan 13, 2020 at 12:26 PM Alex Van Boxel  wrote:

> Fix in this PR:
>
> [BEAM-9113] Fix serialization proto logical types
> https://github.com/apache/beam/pull/10569
>
> or we all agree to *promote* the logical types to top-level logical types
> (as described in the design document, see ticket):
>
> [BEAM-9037] Instant and duration as logical type
> https://github.com/apache/beam/pull/10486
>
>
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Jan 13, 2020 at 8:40 PM Alex Van Boxel  wrote:
>
>> So I think the following happens:
>>
>>1. the schema tree is initialized at construction time. The tree get
>>serialized and send to the workers
>>2. the workers deserialize the tree, but as the Timestamp logical
>>type have a logical type with a *static* schema the schema will be 
>> *re-initialized
>>without the UUID* (as it was never serialized)
>>3. this is why setting a fixed UUID at static initialization works
>>
>> So solution is
>>
>>- as tested, se a fixed UUID
>>- make the schema not static in the logical type
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Jan 13, 2020 at 8:08 PM Reuven Lax  wrote:
>>
>>> SchemaCoder today recursively sets UUIDs for all schemas, including
>>> logical types, in setSchemaIds. Is it possible that your changes modified
>>> that logic somehow?
>>>
>>> On Mon, Jan 13, 2020 at 9:39 AM Alex Van Boxel  wrote:
>>>
 This is the stacktrace:


 java.lang.IllegalStateException at
 org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
 at
 org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380)
 at
 org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:371)
 at
 org.apache.beam.sdk.coders.RowCoderGenerator.createComponentCoders(RowCoderGenerator.java:337)
 at
 org.apache.beam.sdk.coders.RowCoderGenerator.generate(RowCoderGenerator.java:140)
 at
 org.apache.beam.sdk.schemas.SchemaCoder.getDelegateCoder(SchemaCoder.java:159)
 at org.apache.beam.sdk.schemas.SchemaCoder.toString(SchemaCoder.java:204)
 at java.lang.String.valueOf(String.java:2994) at
 java.lang.StringBuilder.append(StringBuilder.java:131) at
 org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
 at
 org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
 at
 org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:623)
 at
 org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:539)
 at
 org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:400)
 at
 org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:125)
 at
 org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:64)
 at
 org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
 at
 org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
 at
 org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
 at
 org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
 at
 org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
 at
 org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
 at
 org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
 at
 org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
 at
 org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
 at
 org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

Re: NYC ? (or more generally East Coast)

2020-01-13 Thread Suneel Marthi
I can do talks in either DC or NYC meetups.  I can coordinate with
CapitalOne to see if they would be willing to host the DC meetup.

On Mon, Jan 13, 2020 at 4:02 PM Austin Bennett 
wrote:

> Hi Devs and Users,
>
> We are looking for speakers for future Meetups and Events.  Who is
> building cool things with Beam?  We are looking at hosting a Meetup at
> Spotify in February, and ideally keep some meetups going throughout
> the year.  For this to occur, we need to hear about what people are
> working on!  Even if only a small/lightning talk, etc, do reach out!
> Let's figure something out.
>
> Cheers,
> Austin
>
> P.S.  https://www.meetup.com/New-York-Apache-Beam/
>
> P.S.S.  We also have budding communities in DC and Boston, will
> eventually write in separate threads on those.
>


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Aaron Dixon
The window is not empty fwiw; it has elements; I get an early firing pane
for the window but well after the watermark passes there is no ON_TIME
pane. Would this be a bug in Dataflow? Seems fundamental, so I'm concerned
perhaps the Beam spec doesn't obligate ON_TIME firings?





On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:

> I would have expected an empty on time pane since the default on time
> behavior is FIRE_ALWAYS.
>
> On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:
>
>> Can anyone confirm?
>>
>> This is intermittent. Some (it seems, sparse) windows don't get an
>> ON_TIME firing after watermark. Is this a bug or is there a reason to not
>> expect ON_TIME firings for every window?
>>
>> On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>>
>>> If it indeed happened as you have described, I will be very interested
>>> in the expected behaviour.
>>>
>>> Something I remembered before: the trigger condition meets just gives
>>> the runner/engine "permission" to fire, but runner/engine may not fire
>>> immediately. But I don't know if the engine/runner will guarantee to fire.
>>>
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:
>>>
 I have the following trigger:

 .apply(Window
   .configure()
   .triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterPane
 .elementCountAtLeast(1)))
   .accumulatingFiredPanes()
   .withAllowedLateness(Duration.ZERO)

 But in Dataflow I notice that I never get an ON_TIME firing for my
 window -- I only see early firing for elements, and then nothing.

 My assumption is that AfterWatermark should give me a last, on-time
 pane under this configuration when the watermark surpasses the window's 
 end.

 Is my expectation correct?

>>>


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Luke Cwik
Is the watermark advancing past the end of the window?

On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon  wrote:

> The window is not empty fwiw; it has elements; I get an early firing pane
> for the window but well after the watermark passes there is no ON_TIME
> pane. Would this be a bug in Dataflow? Seems fundamental, so I'm concerned
> perhaps the Beam spec doesn't obligate ON_TIME firings?
>
>
>
>
>
> On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:
>
>> I would have expected an empty on time pane since the default on time
>> behavior is FIRE_ALWAYS.
>>
>> On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:
>>
>>> Can anyone confirm?
>>>
>>> This is intermittent. Some (it seems, sparse) windows don't get an
>>> ON_TIME firing after watermark. Is this a bug or is there a reason to not
>>> expect ON_TIME firings for every window?
>>>
>>> On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>>>
 If it indeed happened as you have described, I will be very interested
 in the expected behaviour.

 Something I remembered before: the trigger condition meets just gives
 the runner/engine "permission" to fire, but runner/engine may not fire
 immediately. But I don't know if the engine/runner will guarantee to fire.



 -Rui

 On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:

> I have the following trigger:
>
> .apply(Window
>   .configure()
>   .triggering(AfterWatermark
>.pastEndOfWindow()
>.withEarlyFirings(AfterPane
> .elementCountAtLeast(1)))
>   .accumulatingFiredPanes()
>   .withAllowedLateness(Duration.ZERO)
>
> But in Dataflow I notice that I never get an ON_TIME firing for my
> window -- I only see early firing for elements, and then nothing.
>
> My assumption is that AfterWatermark should give me a last, on-time
> pane under this configuration when the watermark surpasses the window's 
> end.
>
> Is my expectation correct?
>



Re: Failing Java PostCommit for Dataflow runner

2020-01-13 Thread Kirill Kozlov
Thanks for taking care of this!

On Mon, Jan 13, 2020 at 2:00 PM Boyuan Zhang  wrote:

> This problem is addressed by PR10564. Now all affected tests are back to
> green.
>
> On Mon, Jan 13, 2020 at 1:11 PM Luke Cwik  wrote:
>
>> This is being tracked in BEAM-9083
>>
>> On Mon, Jan 13, 2020 at 11:23 AM Boyuan Zhang  wrote:
>>
>>> Thanks Kirill! I'm going to look into it.
>>>
>>> On Mon, Jan 13, 2020 at 11:18 AM Kirill Kozlov 
>>> wrote:
>>>
 Hello everyone!

 I have noticed that Jenkins tests for Dataflow runner [1] are failing
 with a runtime exception. It looks like the issue originated here [2],
 failed Dataflow job [3].
 We should look into fixing it.

 Failing test:
 :runners:google-cloud-dataflow-java:validatesRunnerLegacyWorkerTest »
 org.apache.beam.sdk.transforms.ParDoTest$TimerTests » testOutputTimestamp
 (29.723s)

 Exception thrown:

 java.lang.RuntimeException: Workflow failed. Causes: Unknown streaming 
 source: test_stream


 [1]
 https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/
 [2]
 https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/4157/
 [3]
 https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2020-01-08_14_36_07-180707589785292440?project=apache-beam-testing

>>>


[PROPOSAL] Leveraging SQL TableProviders for Cross-Language IOs

2020-01-13 Thread Brian Hulette
Hi everyone,
I have a proposal that I think can unify two problem sets:
  1) adding more IOs for Beam SQL, and
  2) making more (Row-based) Java IOs available in Python as cross-language
transforms

The basic idea is to create a single cross-language transform that exposes
all Beam SQL IOs via the TableProvider interface. A design document is
available here: https://s.apache.org/xlang-table-provider

Please take a look and let me know what you think. Thanks!
Brian


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Aaron Dixon
Any confirmation on this from anyone? Whether per Beam spec, runners are
obligated to send ON_TIME panes for AfterWatermark triggers? I'm stuck
because this seems fundamental, so it's hard to imagine this is a Dataflow
bug, but OTOH it's also hard to imagine that trigger specs like
AfterWatermark are "optional"... ?

On Mon, Jan 13, 2020 at 4:18 PM Aaron Dixon  wrote:

> Yes. Using calendar day-based windows and watermark is completely caught
> up to today ... calendar window ends several days ago. I got EARLY panes
> for each element but never ON_TIME pane.
>
> On Mon, Jan 13, 2020 at 4:16 PM Luke Cwik  wrote:
>
>> Is the watermark advancing past the end of the window?
>>
>> On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon  wrote:
>>
>>> The window is not empty fwiw; it has elements; I get an early firing
>>> pane for the window but well after the watermark passes there is no ON_TIME
>>> pane. Would this be a bug in Dataflow? Seems fundamental, so I'm concerned
>>> perhaps the Beam spec doesn't obligate ON_TIME firings?
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:
>>>
 I would have expected an empty on time pane since the default on time
 behavior is FIRE_ALWAYS.

 On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:

> Can anyone confirm?
>
> This is intermittent. Some (it seems, sparse) windows don't get an
> ON_TIME firing after watermark. Is this a bug or is there a reason to not
> expect ON_TIME firings for every window?
>
> On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>
>> If it indeed happened as you have described, I will be very
>> interested in the expected behaviour.
>>
>> Something I remembered before: the trigger condition meets just gives
>> the runner/engine "permission" to fire, but runner/engine may not fire
>> immediately. But I don't know if the engine/runner will guarantee to 
>> fire.
>>
>>
>>
>> -Rui
>>
>> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon 
>> wrote:
>>
>>> I have the following trigger:
>>>
>>> .apply(Window
>>>   .configure()
>>>   .triggering(AfterWatermark
>>>.pastEndOfWindow()
>>>.withEarlyFirings(AfterPane
>>> .elementCountAtLeast(1)))
>>>   .accumulatingFiredPanes()
>>>   .withAllowedLateness(Duration.ZERO)
>>>
>>> But in Dataflow I notice that I never get an ON_TIME firing for my
>>> window -- I only see early firing for elements, and then nothing.
>>>
>>> My assumption is that AfterWatermark should give me a last, on-time
>>> pane under this configuration when the watermark surpasses the window's 
>>> end.
>>>
>>> Is my expectation correct?
>>>
>>


Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
Fix in this PR:

[BEAM-9113] Fix serialization proto logical types
https://github.com/apache/beam/pull/10569

or we all agree to *promote* the logical types to top-level logical types
(as described in the design document, see ticket):

[BEAM-9037] Instant and duration as logical type
https://github.com/apache/beam/pull/10486



 _/
_/ Alex Van Boxel


On Mon, Jan 13, 2020 at 8:40 PM Alex Van Boxel  wrote:

> So I think the following happens:
>
>1. the schema tree is initialized at construction time. The tree get
>serialized and send to the workers
>2. the workers deserialize the tree, but as the Timestamp logical type
>have a logical type with a *static* schema the schema will be 
> *re-initialized
>without the UUID* (as it was never serialized)
>3. this is why setting a fixed UUID at static initialization works
>
> So solution is
>
>- as tested, se a fixed UUID
>- make the schema not static in the logical type
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Jan 13, 2020 at 8:08 PM Reuven Lax  wrote:
>
>> SchemaCoder today recursively sets UUIDs for all schemas, including
>> logical types, in setSchemaIds. Is it possible that your changes modified
>> that logic somehow?
>>
>> On Mon, Jan 13, 2020 at 9:39 AM Alex Van Boxel  wrote:
>>
>>> This is the stacktrace:
>>>
>>>
>>> java.lang.IllegalStateException at
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
>>> at
>>> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380)
>>> at
>>> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:371)
>>> at
>>> org.apache.beam.sdk.coders.RowCoderGenerator.createComponentCoders(RowCoderGenerator.java:337)
>>> at
>>> org.apache.beam.sdk.coders.RowCoderGenerator.generate(RowCoderGenerator.java:140)
>>> at
>>> org.apache.beam.sdk.schemas.SchemaCoder.getDelegateCoder(SchemaCoder.java:159)
>>> at org.apache.beam.sdk.schemas.SchemaCoder.toString(SchemaCoder.java:204)
>>> at java.lang.String.valueOf(String.java:2994) at
>>> java.lang.StringBuilder.append(StringBuilder.java:131) at
>>> org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
>>> at
>>> org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
>>> at
>>> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:623)
>>> at
>>> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:539)
>>> at
>>> org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:400)
>>> at
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:125)
>>> at
>>> org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:64)
>>> at
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
>>> at
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
>>> at
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
>>> at
>>> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
>>> at
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
>>> at
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
>>> at
>>> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
>>> at
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
>>> at
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
>>> at
>>> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>> In red is the case for LogicalType. Setting it to a fixed value makes it
>>> run on Dataflow. Note that the pipeline works perfect on DirectRunner.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Mon, Jan 13, 2020 at 6:06 PM Reuven Lax  wrote:
>>>
 I don't think that should be the case. Also SchemaCoder will
 automatically set the UUID for such logical types.

 On Mon, Jan 13, 2020 at 8:24 AM Alex Van Boxel 
 wrote:

> OK, I've rechecked everything and eventually found the 

Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Aaron Dixon
Can anyone confirm?

This is intermittent. Some (it seems, sparse) windows don't get an ON_TIME
firing after watermark. Is this a bug or is there a reason to not expect
ON_TIME firings for every window?

On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:

> If it indeed happened as you have described, I will be very interested in
> the expected behaviour.
>
> Something I remembered before: the trigger condition meets just gives the
> runner/engine "permission" to fire, but runner/engine may not fire
> immediately. But I don't know if the engine/runner will guarantee to fire.
>
>
>
> -Rui
>
> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:
>
>> I have the following trigger:
>>
>> .apply(Window
>>   .configure()
>>   .triggering(AfterWatermark
>>.pastEndOfWindow()
>>.withEarlyFirings(AfterPane
>> .elementCountAtLeast(1)))
>>   .accumulatingFiredPanes()
>>   .withAllowedLateness(Duration.ZERO)
>>
>> But in Dataflow I notice that I never get an ON_TIME firing for my window
>> -- I only see early firing for elements, and then nothing.
>>
>> My assumption is that AfterWatermark should give me a last, on-time pane
>> under this configuration when the watermark surpasses the window's end.
>>
>> Is my expectation correct?
>>
>


Re: Python IO Connector

2020-01-13 Thread Brian Hulette
Regarding cross-language and Beam rows (and SQL!) - I have a PR up [1] that
adds an example script for using Beam's SqlTransform in Python by
leveraging the portable row coder. Unfortunately I got stalled figuring out
how to build/stage the Java artifacts for the SQL extensions so it hasn't
been merged yet.

I think a cross-language JdbcIO would be quite similar, except it's in core
so there's no issue with additional jars. JdbcIO already has a ReadRows
transform that can produce a PCollection, we would just need to add an
ExternalTransformBuilder and ExternalTransformRegistrar implementation for
that transform. PubsubIO [2] has a good example of this.

[1] https://github.com/apache/beam/pull/10055
[2]
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L720

On Tue, Jan 7, 2020 at 4:49 AM Lucas Magalhães <
lucas.magalh...@paralelocs.com.br> wrote:

> Hi Peter.
>
> Why don't you use this external library?
> https://pypi.org/project/beam-nuggets/   They already use SQLAlchemy and
> is pretty easy to use.
>
>
> On Mon, Jan 6, 2020 at 10:17 PM Luke Cwik  wrote:
>
>> Eugene, the JdbcIO output should be updated to support Beam's schema
>> format which would allow for "rows" to cross the language boundaries.
>>
>> If the connector is easy to write and maintain then it makes sense for
>> native. Maybe the Python version will have an easier time to support
>> splitting and hence could overtake the Java implementation in useful
>> features.
>>
>> On Mon, Jan 6, 2020 at 3:55 PM  wrote:
>>
>>> Apache Airflow went for the DB API approach as well and it seems like to
>>> have worked well for them. We will likely need to add extra_requires for
>>> each database engine Python package though, which adds some complexity but
>>> not a lot
>>>
>>> On Jan 6, 2020, at 6:12 PM, Eugene Kirpichov  wrote:
>>>
>>> Agreed with above, it seems prudent to develop a pure-Python connector
>>> for something as common as interacting with a database. It's likely easier
>>> to achieve an idiomatic API, familiar to non-Beam Python SQL users, within
>>> pure Python.
>>>
>>> Developing a cross-language connector here might be plain impossible,
>>> because rows read from a database are (at least in JDBC) not encodable -
>>> they require a user's callback to translate to an encodable user type, and
>>> the callback can't be in Python because then you have to encode its input
>>> before giving it to Python. Same holds for the write transform.
>>>
>>> Not sure about sqlalchemy though, maybe use plain DB-API
>>> https://www.python.org/dev/peps/pep-0249/ instead? Seems like the
>>> Python one is more friendly than JDBC in the sense that it actually returns
>>> rows as tuples of simple data types.
>>>
>>> On Mon, Jan 6, 2020 at 1:42 PM Robert Bradshaw 
>>> wrote:
>>>
 On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath 
 wrote:

> Regarding cross-language transforms, we need to add better
> documentation, but for now you'll have to go with existing examples and
> tests. For example,
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py
>
> Note that cross-language transforms feature is currently only
> available for Flink Runner. Dataflow support is in development.
>

 I think it works with all non-Dataflow runners, with the exception of
 the Java and Go Direct runners. (It does work with the Python direct
 runner.)


> I'm fine with developing this natively for Python as well. AFAIK Java
> JDBC IO connector is not a super-complicated connector and it should be
> fine to make relatively easy to maintain and widely usable connectors
> available in multiple SDKs.
>

 Yes, a case can certainly be made for having native connectors for
 particular common/simple sources. (We certainly don't call cross-language
 to read text files for example.)


>
> Thanks,
> Cham
>
>
> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik  wrote:
>
>> +Chamikara Jayalath  +Heejong Lee
>> 
>>
>> On Mon, Jan 6, 2020 at 10:20 AM  wrote:
>>
>>> How do I go about doing that? From the docs, it appears cross
>>> language transforms are
>>> currently undocumented.
>>> https://beam.apache.org/roadmap/connectors-multi-sdk/
>>> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
>>>
>>> What about using a cross language transform between Python and the
>>> already existing Java JdbcIO transform?
>>>
>>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann 
>>> wrote:
>>>
 I’d like to develop the Python SDK’s SQL IO connector. I was
 thinking it would be easiest to use sqlalchemy to achieve maximum 
 database
 engine support, but I 

Re: Jenkins jobs not running for my PR 10438

2020-01-13 Thread Tomo Suzuki
Thank you, Mark and Ismaël.

On Mon, Jan 13, 2020 at 2:34 PM Mark Liu  wrote:
>
> done
>
> On Mon, Jan 13, 2020 at 8:03 AM Tomo Suzuki  wrote:
>>
>> Thanks Yifan (but Java Precommit is still missing).
>> Can somebody run "Run Java PreCommit" on
>> https://github.com/apache/beam/pull/10554?
>>
>>
>> On Mon, Jan 13, 2020 at 2:59 AM Yifan Zou  wrote:
>> >
>> > done.
>> >
>> > On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki  wrote:
>> >>
>> >> Hi Beam committers,
>> >>
>> >> Four Jenkins jobs did not report back for this PR
>> >> https://github.com/apache/beam/pull/10554 .
>> >> Can somebody trigger them?
>> >>
>> >> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud  
>> >> wrote:
>> >> >
>> >> > Done.
>> >> >
>> >> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki  wrote:
>> >> >>
>> >> >> Hi Bean developers,
>> >> >>
>> >> >> I appreciate a committer can trigger precommit build for
>> >> >> https://github.com/apache/beam/pull/10554.
>> >> >>
>> >> >> In addition to normal precommit checks, I want the followings:
>> >> >> Run Java PostCommit
>> >> >> Run Java HadoopFormatIO Performance Test
>> >> >> Run BigQueryIO Streaming Performance Test Java
>> >> >> Run Dataflow ValidatesRunner
>> >> >> Run Spark ValidatesRunner
>> >> >> Run SQL Postcommit
>> >> >>
>> >> >> Regards,
>> >> >> Tomo
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Tomo
>>
>>
>>
>> --
>> Regards,
>> Tomo



-- 
Regards,
Tomo


Re: Jenkins jobs not running for my PR 10438

2020-01-13 Thread Mark Liu
done

On Mon, Jan 13, 2020 at 8:03 AM Tomo Suzuki  wrote:

> Thanks Yifan (but Java Precommit is still missing).
> Can somebody run "Run Java PreCommit" on
> https://github.com/apache/beam/pull/10554?
>
>
> On Mon, Jan 13, 2020 at 2:59 AM Yifan Zou  wrote:
> >
> > done.
> >
> > On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki  wrote:
> >>
> >> Hi Beam committers,
> >>
> >> Four Jenkins jobs did not report back for this PR
> >> https://github.com/apache/beam/pull/10554 .
> >> Can somebody trigger them?
> >>
> >> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud 
> wrote:
> >> >
> >> > Done.
> >> >
> >> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki 
> wrote:
> >> >>
> >> >> Hi Bean developers,
> >> >>
> >> >> I appreciate a committer can trigger precommit build for
> >> >> https://github.com/apache/beam/pull/10554.
> >> >>
> >> >> In addition to normal precommit checks, I want the followings:
> >> >> Run Java PostCommit
> >> >> Run Java HadoopFormatIO Performance Test
> >> >> Run BigQueryIO Streaming Performance Test Java
> >> >> Run Dataflow ValidatesRunner
> >> >> Run Spark ValidatesRunner
> >> >> Run SQL Postcommit
> >> >>
> >> >> Regards,
> >> >> Tomo
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Tomo
>
>
>
> --
> Regards,
> Tomo
>


Re: Failing Java PostCommit for Dataflow runner

2020-01-13 Thread Boyuan Zhang
This problem is addressed by PR10564. Now all affected tests are back to
green.

On Mon, Jan 13, 2020 at 1:11 PM Luke Cwik  wrote:

> This is being tracked in BEAM-9083
>
> On Mon, Jan 13, 2020 at 11:23 AM Boyuan Zhang  wrote:
>
>> Thanks Kirill! I'm going to look into it.
>>
>> On Mon, Jan 13, 2020 at 11:18 AM Kirill Kozlov 
>> wrote:
>>
>>> Hello everyone!
>>>
>>> I have noticed that Jenkins tests for Dataflow runner [1] are failing
>>> with a runtime exception. It looks like the issue originated here [2],
>>> failed Dataflow job [3].
>>> We should look into fixing it.
>>>
>>> Failing test:
>>> :runners:google-cloud-dataflow-java:validatesRunnerLegacyWorkerTest »
>>> org.apache.beam.sdk.transforms.ParDoTest$TimerTests » testOutputTimestamp
>>> (29.723s)
>>>
>>> Exception thrown:
>>>
>>> java.lang.RuntimeException: Workflow failed. Causes: Unknown streaming 
>>> source: test_stream
>>>
>>>
>>> [1]
>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/
>>> [2]
>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/4157/
>>> [3]
>>> https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2020-01-08_14_36_07-180707589785292440?project=apache-beam-testing
>>>
>>


Re: Go SplittableDoFn prototype and proposed changes

2020-01-13 Thread Luke Cwik
Thanks for the update and I agree with the points that you have made.

On Fri, Jan 10, 2020 at 5:58 PM Robert Burke  wrote:

> Thank you for sharing Daniel!
>
> Resolving SplittableDoFns for the Go SDK even just as far as initial
> splitting will take the SDK that much closer to exiting its experimental
> status.
>
> It's especially exciting seeing this work on Flink and on the Python
> direct runner!
>
> On Fri, Jan 10, 2020, 5:36 PM Daniel Oliveira 
> wrote:
>
>> Hey Beam devs,
>>
>> So several months ago I posted my Go SDF proposal and got a lot of good
>> feedback (thread
>> ,
>> doc ). Since then I've been working on
>> implementing it and I've got an initial prototype ready to show off! It
>> works with initial splitting on Flink, and has a decently documented API.
>> Also in the second part of the email I'll also be proposing changes to the
>> original doc, based on my experience working on this prototype.
>>
>> To be clear, this is *not* ready to officially go into Beam yet; the API
>> is still likely to go through changes. Rather, I'm showing this off to show
>> that progress is being made on SDFs, and to provide some context to the
>> changes I'll be proposing below.
>>
>> Here's a link to the repo and branch so you can download it, and a link
>> to the changes specifically:
>> Repo: https://github.com/youngoli/beam/tree/gosdf
>> Changes:
>> https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831
>>
>> If you give it a try and have any thoughts, please let me know! I'm open
>> to any and all feedback.
>>
>> ==
>>
>> Proposed Changes
>> Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
>> history.)
>>
>> For anyone reading this who hasn't already read the doc above, I suggest
>> reading it first, since I'll be referring to concepts from it.
>>
>> After working on the prototype I've changed my mind on the original
>> decisions to go with an interface approach and a combined restriction +
>> tracker. But I don't want to go all in and create another doc with a
>> detailed proposal, so I've laid out a brief summary of the changes to get
>> some initial feedback before I go ahead and start working on these changes
>> in detail. Please let me know what you think!
>>
>> *1. Change from native Go interfaces to dynamic reflection-based API.*
>>
>> Instead of the native Go interfaces (SplittableDoFn, RProvider, and
>> RTracker) described in the doc and implemented in the prototype, use the
>> same dynamic approach that the Go SDK already uses for DoFns: Use the
>> reflection system to examine the names and signatures of methods in the
>> user's DoFn, RProvider, and RTracker.
>>
>> Original approach reasoning:
>>
>>- Simpler, so faster to implement and less bug-prone.
>>- The extra burden on the user to keep types consistent is ok since
>>most users of SDFs are more advanced
>>
>> Change reasoning:
>>
>>- In the prototype, I found interfaces to require too much extra
>>boilerplate which added more complexity than expected. (Examples: Constant
>>casting,
>>- More consistent API: Inconsistency between regular DoFns (dynamic)
>>and SDF API (interfaces) was jarring and unintuitive when implementing 
>> SDFs
>>as a user.
>>
>> Implementation: Full details are up for discussion, but the goal is to
>> make the RProvider and  RTracker interfaces dynamic, so we can replace all
>> instances of interface{} in the methods with the actual element types
>> (i.e. fake generics). Also uses of the RProvider and RTracker interfaces in
>> signatures can be replaced with the implementations of those
>> providers/trackers. This will require a good amount of additional work in
>> the DoFn validation codebase and the code generator. Plus a fair amount of
>> additional user code validation will be needed and more testing since the
>> new code is more complex.
>>
>> *2. Seperate the restriction tracker and restriction.*
>>
>> Currently the API has the restriction combined with the tracker. In most
>> other SDKs and within the SDF model, the two are usually separate concepts,
>> and this change is to follow that approach and split the two.
>>
>> Original approach reasoning:
>>
>>- It was considered simpler to avoid another level of type casting in
>>the API with the interface approach.
>>
>> Change reasoning:
>>
>>- We are no longer going with the interface approach. With "fake
>>generics", it is simpler to keep the two concepts separate.
>>- Requiring users to specify custom coders in order to only encode
>>the restriction and not the tracker ended up adding additional complexity
>>anyway.
>>
>> Implementation: In the API have the restriction tracker initialized with
>> a restriction object accessible via a getter. The restriction itself 

Re: master on Dataflow with schema aware PCollections stuck

2020-01-13 Thread Alex Van Boxel
It's indeed the first Logical identifier with Row base type. The UUID is
generated from the name of the class, but doing it in code (from a string)
you need to create bytes from the string, then a UUID.

 _/
_/ Alex Van Boxel


On Mon, Jan 13, 2020 at 10:40 PM Brian Hulette  wrote:

> I guess these are the first logical types we've defined with a base type
> of row. It does seem reasonable that a static schema for a logical type
> could have some fixed id, but it feels odd to have a fixed UUID, it would
> be nice if we could give the schema some meaningful static identifier.
>
> I think Reuven was investigating adding support for "named" schemas in
> order to add support for recursive schemas, but ran into some issues. Maybe
> something like that is what we need here?
>
> On Mon, Jan 13, 2020 at 12:26 PM Alex Van Boxel  wrote:
>
>> Fix in this PR:
>>
>> [BEAM-9113] Fix serialization proto logical types
>> https://github.com/apache/beam/pull/10569
>>
>> or we all agree to *promote* the logical types to top-level logical
>> types (as described in the design document, see ticket):
>>
>> [BEAM-9037] Instant and duration as logical type
>> https://github.com/apache/beam/pull/10486
>>
>>
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Jan 13, 2020 at 8:40 PM Alex Van Boxel  wrote:
>>
>>> So I think the following happens:
>>>
>>>1. the schema tree is initialized at construction time. The tree get
>>>serialized and send to the workers
>>>2. the workers deserialize the tree, but as the Timestamp logical
>>>type have a logical type with a *static* schema the schema will be 
>>> *re-initialized
>>>without the UUID* (as it was never serialized)
>>>3. this is why setting a fixed UUID at static initialization works
>>>
>>> So solution is
>>>
>>>- as tested, se a fixed UUID
>>>- make the schema not static in the logical type
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Mon, Jan 13, 2020 at 8:08 PM Reuven Lax  wrote:
>>>
 SchemaCoder today recursively sets UUIDs for all schemas, including
 logical types, in setSchemaIds. Is it possible that your changes modified
 that logic somehow?

 On Mon, Jan 13, 2020 at 9:39 AM Alex Van Boxel 
 wrote:

> This is the stacktrace:
>
>
> java.lang.IllegalStateException at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:491)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:380)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.getCoder(RowCoderGenerator.java:371)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.createComponentCoders(RowCoderGenerator.java:337)
> at
> org.apache.beam.sdk.coders.RowCoderGenerator.generate(RowCoderGenerator.java:140)
> at
> org.apache.beam.sdk.schemas.SchemaCoder.getDelegateCoder(SchemaCoder.java:159)
> at org.apache.beam.sdk.schemas.SchemaCoder.toString(SchemaCoder.java:204)
> at java.lang.String.valueOf(String.java:2994) at
> java.lang.StringBuilder.append(StringBuilder.java:131) at
> org.apache.beam.sdk.coders.Coder.getEncodedElementByteSize(Coder.java:300)
> at
> org.apache.beam.sdk.coders.Coder.registerByteSizeObserver(Coder.java:291)
> at
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:623)
> at
> org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.registerByteSizeObserver(WindowedValue.java:539)
> at
> org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$ElementByteSizeObservableCoder.registerByteSizeObserver(IntrinsicMapTaskExecutorFactory.java:400)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputObjectAndByteCounter.update(OutputObjectAndByteCounter.java:125)
> at
> org.apache.beam.runners.dataflow.worker.DataflowOutputCounter.update(DataflowOutputCounter.java:64)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:43)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
> at
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
> at
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
> at
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
> at
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
> at
> org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
> at
> 

[VOTE] Release 2.18.0, release candidate #1

2020-01-13 Thread Udi Meiri
Hi everyone,
Please review and vote on the release candidate #3 for the version 1.2.3,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 8961 F3EF 8E79 6688 4067
 87CF 587B 049C 36DA AFE6 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v1.2.3-RC3" [5],
* website pull request listing the release [6], publishing the API
reference manual [7], and the blog post [8].
* Java artifacts were built with Maven MAVEN_VERSION and OpenJDK/Oracle JDK
JDK_VERSION.
TODO: do these versions matter, and are they stamped into the artifacts?
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.18.0 release to help with validation
[9].
* Docker images published to Docker Hub [10].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346383=12319527
[2] https://dist.apache.org/repos/dist/dev/beam/2.18.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1090/
[5] https://github.com/apache/beam/tree/v2.18.0-RC1
[6] https://github.com/apache/beam/pull/10574
[7] https://github.com/apache/beam-site/pull/595
[8] https://github.com/apache/beam/pull/10575
[9]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1178617819
[10] https://hub.docker.com/u/apachebeam


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Udi Meiri
The most important gain would be compatibility with Google internal code.
TLDR: I don't expect non-Googlers to fix pytype issues in Beam, nor would
they have access to internal code that is validated against pytype with
Beam.

Pytype seems to detect attribute errors that mypy has not, so it acts as a
kind-of linter in this case.
Examples:
https://github.com/apache/beam/pull/10528/files#diff-0cb34b4622b0b7d7256d28b1ee1d52fc
https://github.com/apache/beam/pull/10528/files#diff-7e4ad8c086414399957cdbea711ebd36
https://github.com/apache/beam/pull/10528/files#diff-d5c3f4f603204c5c5917d89e90dba53d
(it also makes pytype more strict in a sense)

Pytype has bugs. I've filed/found these so far:
https://github.com/google/pytype/issues/491 (isinstance bug)
https://github.com/google/pytype/issues/445 (@overload bug)
https://github.com/google/pytype/issues/488 (builtins.object != object bug)
https://github.com/google/pytype/issues/485 (# type: ignore[...] mypy
compatibility issue)
https://github.com/google/pytype/issues/480 (alternate comment syntax -
wontfix)


On Mon, Jan 13, 2020 at 3:49 PM Kyle Weaver  wrote:

> Udi, what would we gain by using pytype?
>
> Also, has anyone tried running pytype against Beam? If it's not too much
> trouble, it might be helpful to diff the pytype and mypy results to get a
> feel for exactly how big the discrepancy is.
>
> On Mon, Jan 13, 2020 at 3:26 PM Kenneth Knowles  wrote:
>
>> Looking at this from the outside, it seems like mypy is the obvious
>> choice. Also running pytype could potentially be informative in some cases
>> but only if there is a specific gap. What about maintenance/governance of
>> the two projects?
>>
>> Kenn
>>
>> On Sun, Jan 12, 2020 at 7:48 PM Chad Dombrova  wrote:
>>
>>> Hi folks,
>>> I agree with Robert that we need to wait and see before making any
>>> decisions, but I do have some opinions about the probable/desired outcome.
>>>
>>> I haven't used pytype, but my experience working with mypy over the past
>>> few years -- and following various issues and peps related to it and typing
>>> in general -- has taught me there's still a lot of room for interpretation
>>> and thus variation between type checkers.
>>>
>>> Here's a simple example: ignoring errors.  Both tools support ignoring
>>> errors using a `type: ignore` comment, but only mypy (to my knowledge)
>>> supports specifying an error type so that only that error is suppressed,
>>> e.g. `type: ignore[error-code-here]`.   There's even room for differences
>>> with regard to the line number where the error is emitted and thus where
>>> the ignore comment must be placed (end of statement, site of open paren,
>>> site of close paren, etc).  I know this because mypy has actually made
>>> adjustments to this once or twice over the years, which necessitated moving
>>> existing ignore comments.  So just imagine having to ignore the same error
>>> separately for each type checker.  It's not the end of the world, but it's
>>> ugly and frustrating.
>>>
>>> As a user, it can be quite challenging to solve certain typing issues,
>>> and there's a fairly steep learning curve –  I wouldn't want to burden
>>> users with *two* type checker, each with its own idiosyncrasies.  That
>>> said, a linter that doesn't actually prevent merges when an error occurs
>>> will be ignored by users and quickly become less-than-useful.  Post-commit
>>> would not be a good idea for all the reasons that a post-commit lint check
>>> would be annoying (user's will trip it often and feel
>>> surprised/blind-sided).
>>>
>>> In the little exposure that I've had with pytype it seems to lag behind
>>> mypy in terms of features, especially wrt typing-related peps (it never
>>> fully supported pep484 multi-line type comments and it still doesn't
>>> support pep561, I see no mention of pep589/TypedDict in the docs, but then
>>> again they are *incredibly* light).  I've gotten mypy completely
>>> passing, and I know it very well, so I'm pretty biased towards making it
>>> the one and only type checker that generates pre-commit errors.  I see
>>> little advantage to most end users in supporting pytype, except y'know,
>>> Google has kind of an important presence in Apache Beam project  :)
>>>
>>> Some quick pypi download figures to back that up:
>>>
>>> Downloads last month:
>>> pytype: 24,864
>>> mypy: 1,502,582
>>>
>>> So to sum up this email in a sentence: running mypy in pre-commit checks
>>> would be beneficial, but making pytype also pass would be a burden with
>>> little benefit for the majority of users.
>>>
>>> But as I said at the outset, before we make any decisions we should get
>>> at least one type checker in place and start getting feedback, because
>>> we're still in the land of conjecture.
>>>
>>> Hopefully I'll have all my typing changes merged in the next month or
>>> two, at which point we can discuss enabling it as part of the pre-commit
>>> lint job.
>>>
>>> -chad
>>>
>>>
>>>
>>>
>>> On Tue, Jan 7, 2020 at 7:02 PM Udi Meiri  

Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Ahmet Altay
> The most important gain would be compatibility with Google internal code.
I would like to clarify this. This refers to users of Beam who by default
are using pytype as part of the toolchain. Even though they are internal to
a one single company and not vocal on Beam, they still represent a large
group of distinct teams using Beam.

On Mon, Jan 13, 2020 at 5:51 PM Robert Bradshaw  wrote:

> On Mon, Jan 13, 2020 at 5:34 PM Chad Dombrova  wrote:
> >>
> >> Pytype seems to detect attribute errors that mypy has not, so it acts
> as a kind-of linter in this case.
> >> Examples:
> >>
> https://github.com/apache/beam/pull/10528/files#diff-0cb34b4622b0b7d7256d28b1ee1d52fc
> >>
> https://github.com/apache/beam/pull/10528/files#diff-7e4ad8c086414399957cdbea711ebd36
> >>
> https://github.com/apache/beam/pull/10528/files#diff-d5c3f4f603204c5c5917d89e90dba53d
> >> (it also makes pytype more strict in a sense)
> >
> > Note that mypy is still not fully passing on master so it's unclear from
> those diffs exactly how the two tools differ.  Many of the fixes you've
> made for pytype look familiar to me from mypy, but my fixes may not be
> merged yet.  For example, mypy also does not support @total_ordering, but
> my fix for that is still pending.
>
> As it seems we have  a workaround to ignore pytype for now, it seems
> to make the most sense to focus on getting mypy working completely
> before focusing on that. In the long term, we could look into making
> pytype a post-commit which would at least be a useful signal for those
> caring about it, and only make it a precommit if the diff between what
> it requires and mypy requires is quite small.
>

The workaround to ignore pytype is making the experience for type users
worse than before. Previously, they could have type checked their pipelines
with pytype and pytype would have attempted to fill in with best effort
type information. Now, pytype is instructed to not provide this information.

I agree with focusing one mypy for now, but I would propose soon after, or
in parallel if it will be different folks, to work on pytype and enable it
as a first class citizen similar to mypy. If there will be a large delta
between the two then we can decide on what to do next.


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Kenneth Knowles
On my phone, so I can't grab the jira so easily, but quickly: EARLY panes
are "race condition equivalent" to ON_TIME panes. The early panes consume
all the pending elements then the on time pane is "empty". This is WAI if
it is what is causing it. You need to explicitly set
Window.configure().fireAlways()*. I know this is counterintuitive in
accumulating mode, where the empty pane is not the identity element.

Kenn

*I don't recall if this is the default or not, and also because on phone it
is slow to look up. From your experience I think not default.

On Mon, Jan 13, 2020, 15:03 Aaron Dixon  wrote:

> Any confirmation on this from anyone? Whether per Beam spec, runners are
> obligated to send ON_TIME panes for AfterWatermark triggers? I'm stuck
> because this seems fundamental, so it's hard to imagine this is a Dataflow
> bug, but OTOH it's also hard to imagine that trigger specs like
> AfterWatermark are "optional"... ?
>
> On Mon, Jan 13, 2020 at 4:18 PM Aaron Dixon  wrote:
>
>> Yes. Using calendar day-based windows and watermark is completely caught
>> up to today ... calendar window ends several days ago. I got EARLY panes
>> for each element but never ON_TIME pane.
>>
>> On Mon, Jan 13, 2020 at 4:16 PM Luke Cwik  wrote:
>>
>>> Is the watermark advancing past the end of the window?
>>>
>>> On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon  wrote:
>>>
 The window is not empty fwiw; it has elements; I get an early firing
 pane for the window but well after the watermark passes there is no ON_TIME
 pane. Would this be a bug in Dataflow? Seems fundamental, so I'm concerned
 perhaps the Beam spec doesn't obligate ON_TIME firings?





 On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:

> I would have expected an empty on time pane since the default on time
> behavior is FIRE_ALWAYS.
>
> On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:
>
>> Can anyone confirm?
>>
>> This is intermittent. Some (it seems, sparse) windows don't get an
>> ON_TIME firing after watermark. Is this a bug or is there a reason to not
>> expect ON_TIME firings for every window?
>>
>> On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>>
>>> If it indeed happened as you have described, I will be very
>>> interested in the expected behaviour.
>>>
>>> Something I remembered before: the trigger condition meets just
>>> gives the runner/engine "permission" to fire, but runner/engine may not
>>> fire immediately. But I don't know if the engine/runner will guarantee 
>>> to
>>> fire.
>>>
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon 
>>> wrote:
>>>
 I have the following trigger:

 .apply(Window
   .configure()
   .triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterPane
 .elementCountAtLeast(1)))
   .accumulatingFiredPanes()
   .withAllowedLateness(Duration.ZERO)

 But in Dataflow I notice that I never get an ON_TIME firing for my
 window -- I only see early firing for elements, and then nothing.

 My assumption is that AfterWatermark should give me a last, on-time
 pane under this configuration when the watermark surpasses the 
 window's end.

 Is my expectation correct?

>>>


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Chad Dombrova
>
> Pytype seems to detect attribute errors that mypy has not, so it acts as a
> kind-of linter in this case.
> Examples:
>
> https://github.com/apache/beam/pull/10528/files#diff-0cb34b4622b0b7d7256d28b1ee1d52fc
>
> https://github.com/apache/beam/pull/10528/files#diff-7e4ad8c086414399957cdbea711ebd36
>
> https://github.com/apache/beam/pull/10528/files#diff-d5c3f4f603204c5c5917d89e90dba53d
> (it also makes pytype more strict in a sense)
>

Note that mypy is still not fully passing on master so it's unclear from
those diffs exactly how the two tools differ.  Many of the fixes you've
made for pytype look familiar to me from mypy, but my fixes may not be
merged yet.  For example, mypy also does not support @total_ordering, but
my fix for that is still pending.


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Chad Dombrova
>
> > I agree with focusing one mypy for now, but I would propose soon after,
> or in parallel if it will be different folks, to work on pytype and enable
> it as a first class citizen similar to mypy. If there will be a large delta
> between the two then we can decide on what to do next.
>
> If there is a large delta, I wonder if what is needed to provide
> sufficient typing on the public API of beam (needed for users) could
> be a much smaller subset than that required fully
> documenting/typechecking the internals (which is also likely to be
> where more of the tricky bits are).
>

Let's separate our two use cases for type checking:

1) developers working on the Beam source
2) developers using Beam

For #1, there's little to gain from running a second type checker, and as
discussed, it's a burden.  so mypy is the clear winner here.

For #2, there's actually no need to expose all of the internals of Beam to
pytype, and as Robert points out, this is where the "tricky bits" reside.
We could use mpy's stubgen tool to create .pyi files of just the API
interface -- no internals -- as a separate apache_beam_stubs package.  Then
pytype users (or users of other type checkers) could optionally install
this and point pytype at it.  This would be a bit more automatic for end
users if pytype supported pep561[1], which is all about the packaging and
discovery of typed packages, including stub-only packages like I'm
proposing.

https://github.com/google/pytype/issues/151

-chad


Re: [VOTE] Release 2.18.0, release candidate #1

2020-01-13 Thread Valentyn Tymofieiev
There are some issues in this message, part of the message is still a
template (1.2.3, TODO, MAVEN_VERSION).
Before I noticed these issues, I ran a few Batch and Streaming Python 3.7
pipelines using Direct and Dataflow runners, and they all succeeded.

On Mon, Jan 13, 2020 at 4:09 PM Udi Meiri  wrote:

> Hi everyone,
> Please review and vote on the release candidate #3 for the version 1.2.3,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint 8961 F3EF 8E79 6688 4067
>  87CF 587B 049C 36DA AFE6 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v1.2.3-RC3" [5],
> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> * Java artifacts were built with Maven MAVEN_VERSION and OpenJDK/Oracle
> JDK JDK_VERSION.
> TODO: do these versions matter, and are they stamped into the artifacts?
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> * Validation sheet with a tab for 2.18.0 release to help with validation
> [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346383=12319527
> [2] https://dist.apache.org/repos/dist/dev/beam/2.18.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1090/
> [5] https://github.com/apache/beam/tree/v2.18.0-RC1
> [6] https://github.com/apache/beam/pull/10574
> [7] https://github.com/apache/beam-site/pull/595
> [8] https://github.com/apache/beam/pull/10575
> [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1178617819
> [10] https://hub.docker.com/u/apachebeam
>


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Luke Cwik
I would rather we focus on doing well with one type checker and it seems
that mypy is significantly more popular than pytype so its more natural for
users. I would support pytype if it covered more PEPs and was the newer and
upcoming thing but that doesn't seem to be the case.

On Sun, Jan 12, 2020 at 7:48 PM Chad Dombrova  wrote:

> Hi folks,
> I agree with Robert that we need to wait and see before making any
> decisions, but I do have some opinions about the probable/desired outcome.
>
> I haven't used pytype, but my experience working with mypy over the past
> few years -- and following various issues and peps related to it and typing
> in general -- has taught me there's still a lot of room for interpretation
> and thus variation between type checkers.
>
> Here's a simple example: ignoring errors.  Both tools support ignoring
> errors using a `type: ignore` comment, but only mypy (to my knowledge)
> supports specifying an error type so that only that error is suppressed,
> e.g. `type: ignore[error-code-here]`.   There's even room for differences
> with regard to the line number where the error is emitted and thus where
> the ignore comment must be placed (end of statement, site of open paren,
> site of close paren, etc).  I know this because mypy has actually made
> adjustments to this once or twice over the years, which necessitated moving
> existing ignore comments.  So just imagine having to ignore the same error
> separately for each type checker.  It's not the end of the world, but it's
> ugly and frustrating.
>
> As a user, it can be quite challenging to solve certain typing issues, and
> there's a fairly steep learning curve –  I wouldn't want to burden users
> with *two* type checker, each with its own idiosyncrasies.  That said, a
> linter that doesn't actually prevent merges when an error occurs will be
> ignored by users and quickly become less-than-useful.  Post-commit would
> not be a good idea for all the reasons that a post-commit lint check would
> be annoying (user's will trip it often and feel surprised/blind-sided).
>
> In the little exposure that I've had with pytype it seems to lag behind
> mypy in terms of features, especially wrt typing-related peps (it never
> fully supported pep484 multi-line type comments and it still doesn't
> support pep561, I see no mention of pep589/TypedDict in the docs, but then
> again they are *incredibly* light).  I've gotten mypy completely passing,
> and I know it very well, so I'm pretty biased towards making it the one and
> only type checker that generates pre-commit errors.  I see little advantage
> to most end users in supporting pytype, except y'know, Google has kind of
> an important presence in Apache Beam project  :)
>
> Some quick pypi download figures to back that up:
>
> Downloads last month:
> pytype: 24,864
> mypy: 1,502,582
>
> So to sum up this email in a sentence: running mypy in pre-commit checks
> would be beneficial, but making pytype also pass would be a burden with
> little benefit for the majority of users.
>
> But as I said at the outset, before we make any decisions we should get at
> least one type checker in place and start getting feedback, because we're
> still in the land of conjecture.
>
> Hopefully I'll have all my typing changes merged in the next month or two,
> at which point we can discuss enabling it as part of the pre-commit lint
> job.
>
> -chad
>
>
>
>
> On Tue, Jan 7, 2020 at 7:02 PM Udi Meiri  wrote:
>
>> Hi,
>> We recently added mypy to the Jenkins Lint job for PRs (currently ignores
>> errors). Mypy is a static type checker.
>>
>> There's a JIRA for adding another static type checker named pytype
>> https://issues.apache.org/jira/browse/BEAM-9064
>>
>> I wanted to ask the community their thoughts on this. (see JIRA issue
>> comments as well)
>>
>> - Should PRs have to pass more than 1 static type checker? (in pre-commit
>> tests)
>> - If not, should the remaining type checkers be run as a post-commit
>> tests?
>> - How much effort should be put into supporting more than 1 type checker?
>> (i.e. making sure that they all pass)
>>
>>
>>
>>


Re: Jenkins job execution policy

2020-01-13 Thread Luke Cwik
I'm for going back to the status quo where anyone's PR ran the tests
automatically or to the suggestion where users marked as contributors had
their tests run automatically (with the documentation update about how link
your github/jira accounts).

On Mon, Jan 13, 2020 at 2:45 AM Michał Walenia 
wrote:

> Hi,
> I wanted to decouple the conversation about solutions to the issue from
> job execution requests.
> We have 131 open PRs right now and 64 committers with job running
> privileges. From what I counted, more than 80 of those PRs are not authored
> by committers.
> I think that having committers answer testing and retesting requests is a
> temporary solution and a permanent one should be decided upon soon. While
> it's an inconvenience for contributors familiar with the workings of the
> project and the community, newcomers might be put off by the fact that the
> tests don't run automatically on their pull requests (this is an industry
> standard which IMO should be upheld also in Beam). The barrier of finding
> one of committers who is active and willing to trigger their tests can make
> the entry to the project more difficult.
>
> I believe that the solution proposed by Kenneth in the Jira thread
> https://issues.apache.org/jira/browse/INFRA-19670 is viable. The
> questions are: do we want to implement this idea and what needs to be done
> to do it?
>
> Regards
> Michał
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Aaron Dixon
Kenn, thank you! There is OnTimeBehavior (default FIRE_ALWAYS) and
ClosingBehavior (default FIRE_IF_NON_EMPTY). Given that OnTimeBehavior is
always-fire, shouldn't I see empty ON_TIME panes?

Since my lateness config is 0, I'm going to try ClosingBehavior =
FIRE_ALWAYS and see if I can rely on .isLast() to pick out the last pane
downstream. But curious if given that the OnTimeBehavior default is ALWAYS,
shouldn't I be seeing on-time panes in my current config?



On Mon, Jan 13, 2020 at 6:45 PM Kenneth Knowles  wrote:

> On my phone, so I can't grab the jira so easily, but quickly: EARLY panes
> are "race condition equivalent" to ON_TIME panes. The early panes consume
> all the pending elements then the on time pane is "empty". This is WAI if
> it is what is causing it. You need to explicitly set
> Window.configure().fireAlways()*. I know this is counterintuitive in
> accumulating mode, where the empty pane is not the identity element.
>
> Kenn
>
> *I don't recall if this is the default or not, and also because on phone
> it is slow to look up. From your experience I think not default.
>
> On Mon, Jan 13, 2020, 15:03 Aaron Dixon  wrote:
>
>> Any confirmation on this from anyone? Whether per Beam spec, runners are
>> obligated to send ON_TIME panes for AfterWatermark triggers? I'm stuck
>> because this seems fundamental, so it's hard to imagine this is a Dataflow
>> bug, but OTOH it's also hard to imagine that trigger specs like
>> AfterWatermark are "optional"... ?
>>
>> On Mon, Jan 13, 2020 at 4:18 PM Aaron Dixon  wrote:
>>
>>> Yes. Using calendar day-based windows and watermark is completely caught
>>> up to today ... calendar window ends several days ago. I got EARLY panes
>>> for each element but never ON_TIME pane.
>>>
>>> On Mon, Jan 13, 2020 at 4:16 PM Luke Cwik  wrote:
>>>
 Is the watermark advancing past the end of the window?

 On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon  wrote:

> The window is not empty fwiw; it has elements; I get an early firing
> pane for the window but well after the watermark passes there is no 
> ON_TIME
> pane. Would this be a bug in Dataflow? Seems fundamental, so I'm concerned
> perhaps the Beam spec doesn't obligate ON_TIME firings?
>
>
>
>
>
> On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:
>
>> I would have expected an empty on time pane since the default on time
>> behavior is FIRE_ALWAYS.
>>
>> On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon 
>> wrote:
>>
>>> Can anyone confirm?
>>>
>>> This is intermittent. Some (it seems, sparse) windows don't get an
>>> ON_TIME firing after watermark. Is this a bug or is there a reason to 
>>> not
>>> expect ON_TIME firings for every window?
>>>
>>> On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>>>
 If it indeed happened as you have described, I will be very
 interested in the expected behaviour.

 Something I remembered before: the trigger condition meets just
 gives the runner/engine "permission" to fire, but runner/engine may not
 fire immediately. But I don't know if the engine/runner will guarantee 
 to
 fire.



 -Rui

 On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon 
 wrote:

> I have the following trigger:
>
> .apply(Window
>   .configure()
>   .triggering(AfterWatermark
>.pastEndOfWindow()
>.withEarlyFirings(AfterPane
> .elementCountAtLeast(1)))
>   .accumulatingFiredPanes()
>   .withAllowedLateness(Duration.ZERO)
>
> But in Dataflow I notice that I never get an ON_TIME firing for my
> window -- I only see early firing for elements, and then nothing.
>
> My assumption is that AfterWatermark should give me a last,
> on-time pane under this configuration when the watermark surpasses the
> window's end.
>
> Is my expectation correct?
>



Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Kenneth Knowles
Looking at this from the outside, it seems like mypy is the obvious choice.
Also running pytype could potentially be informative in some cases but only
if there is a specific gap. What about maintenance/governance of the two
projects?

Kenn

On Sun, Jan 12, 2020 at 7:48 PM Chad Dombrova  wrote:

> Hi folks,
> I agree with Robert that we need to wait and see before making any
> decisions, but I do have some opinions about the probable/desired outcome.
>
> I haven't used pytype, but my experience working with mypy over the past
> few years -- and following various issues and peps related to it and typing
> in general -- has taught me there's still a lot of room for interpretation
> and thus variation between type checkers.
>
> Here's a simple example: ignoring errors.  Both tools support ignoring
> errors using a `type: ignore` comment, but only mypy (to my knowledge)
> supports specifying an error type so that only that error is suppressed,
> e.g. `type: ignore[error-code-here]`.   There's even room for differences
> with regard to the line number where the error is emitted and thus where
> the ignore comment must be placed (end of statement, site of open paren,
> site of close paren, etc).  I know this because mypy has actually made
> adjustments to this once or twice over the years, which necessitated moving
> existing ignore comments.  So just imagine having to ignore the same error
> separately for each type checker.  It's not the end of the world, but it's
> ugly and frustrating.
>
> As a user, it can be quite challenging to solve certain typing issues, and
> there's a fairly steep learning curve –  I wouldn't want to burden users
> with *two* type checker, each with its own idiosyncrasies.  That said, a
> linter that doesn't actually prevent merges when an error occurs will be
> ignored by users and quickly become less-than-useful.  Post-commit would
> not be a good idea for all the reasons that a post-commit lint check would
> be annoying (user's will trip it often and feel surprised/blind-sided).
>
> In the little exposure that I've had with pytype it seems to lag behind
> mypy in terms of features, especially wrt typing-related peps (it never
> fully supported pep484 multi-line type comments and it still doesn't
> support pep561, I see no mention of pep589/TypedDict in the docs, but then
> again they are *incredibly* light).  I've gotten mypy completely passing,
> and I know it very well, so I'm pretty biased towards making it the one and
> only type checker that generates pre-commit errors.  I see little advantage
> to most end users in supporting pytype, except y'know, Google has kind of
> an important presence in Apache Beam project  :)
>
> Some quick pypi download figures to back that up:
>
> Downloads last month:
> pytype: 24,864
> mypy: 1,502,582
>
> So to sum up this email in a sentence: running mypy in pre-commit checks
> would be beneficial, but making pytype also pass would be a burden with
> little benefit for the majority of users.
>
> But as I said at the outset, before we make any decisions we should get at
> least one type checker in place and start getting feedback, because we're
> still in the land of conjecture.
>
> Hopefully I'll have all my typing changes merged in the next month or two,
> at which point we can discuss enabling it as part of the pre-commit lint
> job.
>
> -chad
>
>
>
>
> On Tue, Jan 7, 2020 at 7:02 PM Udi Meiri  wrote:
>
>> Hi,
>> We recently added mypy to the Jenkins Lint job for PRs (currently ignores
>> errors). Mypy is a static type checker.
>>
>> There's a JIRA for adding another static type checker named pytype
>> https://issues.apache.org/jira/browse/BEAM-9064
>>
>> I wanted to ask the community their thoughts on this. (see JIRA issue
>> comments as well)
>>
>> - Should PRs have to pass more than 1 static type checker? (in pre-commit
>> tests)
>> - If not, should the remaining type checkers be run as a post-commit
>> tests?
>> - How much effort should be put into supporting more than 1 type checker?
>> (i.e. making sure that they all pass)
>>
>>
>>
>>


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Kyle Weaver
Udi, what would we gain by using pytype?

Also, has anyone tried running pytype against Beam? If it's not too much
trouble, it might be helpful to diff the pytype and mypy results to get a
feel for exactly how big the discrepancy is.

On Mon, Jan 13, 2020 at 3:26 PM Kenneth Knowles  wrote:

> Looking at this from the outside, it seems like mypy is the obvious
> choice. Also running pytype could potentially be informative in some cases
> but only if there is a specific gap. What about maintenance/governance of
> the two projects?
>
> Kenn
>
> On Sun, Jan 12, 2020 at 7:48 PM Chad Dombrova  wrote:
>
>> Hi folks,
>> I agree with Robert that we need to wait and see before making any
>> decisions, but I do have some opinions about the probable/desired outcome.
>>
>> I haven't used pytype, but my experience working with mypy over the past
>> few years -- and following various issues and peps related to it and typing
>> in general -- has taught me there's still a lot of room for interpretation
>> and thus variation between type checkers.
>>
>> Here's a simple example: ignoring errors.  Both tools support ignoring
>> errors using a `type: ignore` comment, but only mypy (to my knowledge)
>> supports specifying an error type so that only that error is suppressed,
>> e.g. `type: ignore[error-code-here]`.   There's even room for differences
>> with regard to the line number where the error is emitted and thus where
>> the ignore comment must be placed (end of statement, site of open paren,
>> site of close paren, etc).  I know this because mypy has actually made
>> adjustments to this once or twice over the years, which necessitated moving
>> existing ignore comments.  So just imagine having to ignore the same error
>> separately for each type checker.  It's not the end of the world, but it's
>> ugly and frustrating.
>>
>> As a user, it can be quite challenging to solve certain typing issues,
>> and there's a fairly steep learning curve –  I wouldn't want to burden
>> users with *two* type checker, each with its own idiosyncrasies.  That
>> said, a linter that doesn't actually prevent merges when an error occurs
>> will be ignored by users and quickly become less-than-useful.  Post-commit
>> would not be a good idea for all the reasons that a post-commit lint check
>> would be annoying (user's will trip it often and feel
>> surprised/blind-sided).
>>
>> In the little exposure that I've had with pytype it seems to lag behind
>> mypy in terms of features, especially wrt typing-related peps (it never
>> fully supported pep484 multi-line type comments and it still doesn't
>> support pep561, I see no mention of pep589/TypedDict in the docs, but then
>> again they are *incredibly* light).  I've gotten mypy completely
>> passing, and I know it very well, so I'm pretty biased towards making it
>> the one and only type checker that generates pre-commit errors.  I see
>> little advantage to most end users in supporting pytype, except y'know,
>> Google has kind of an important presence in Apache Beam project  :)
>>
>> Some quick pypi download figures to back that up:
>>
>> Downloads last month:
>> pytype: 24,864
>> mypy: 1,502,582
>>
>> So to sum up this email in a sentence: running mypy in pre-commit checks
>> would be beneficial, but making pytype also pass would be a burden with
>> little benefit for the majority of users.
>>
>> But as I said at the outset, before we make any decisions we should get
>> at least one type checker in place and start getting feedback, because
>> we're still in the land of conjecture.
>>
>> Hopefully I'll have all my typing changes merged in the next month or
>> two, at which point we can discuss enabling it as part of the pre-commit
>> lint job.
>>
>> -chad
>>
>>
>>
>>
>> On Tue, Jan 7, 2020 at 7:02 PM Udi Meiri  wrote:
>>
>>> Hi,
>>> We recently added mypy to the Jenkins Lint job for PRs (currently
>>> ignores errors). Mypy is a static type checker.
>>>
>>> There's a JIRA for adding another static type checker named pytype
>>> https://issues.apache.org/jira/browse/BEAM-9064
>>>
>>> I wanted to ask the community their thoughts on this. (see JIRA issue
>>> comments as well)
>>>
>>> - Should PRs have to pass more than 1 static type checker? (in
>>> pre-commit tests)
>>> - If not, should the remaining type checkers be run as a post-commit
>>> tests?
>>> - How much effort should be put into supporting more than 1 type
>>> checker? (i.e. making sure that they all pass)
>>>
>>>
>>>
>>>


Re: [PROPOSAL] Leveraging SQL TableProviders for Cross-Language IOs

2020-01-13 Thread Chamikara Jayalath
Thanks Brian. Added some comments.

On Mon, Jan 13, 2020 at 2:25 PM Brian Hulette  wrote:

> Hi everyone,
> I have a proposal that I think can unify two problem sets:
>   1) adding more IOs for Beam SQL, and
>   2) making more (Row-based) Java IOs available in Python as
> cross-language transforms
>
> The basic idea is to create a single cross-language transform that exposes
> all Beam SQL IOs via the TableProvider interface. A design document is
> available here: https://s.apache.org/xlang-table-provider
>
> Please take a look and let me know what you think. Thanks!
> Brian
>


Re: [DISCUSS] Python static type checkers

2020-01-13 Thread Robert Bradshaw
On Mon, Jan 13, 2020 at 5:34 PM Chad Dombrova  wrote:
>>
>> Pytype seems to detect attribute errors that mypy has not, so it acts as a 
>> kind-of linter in this case.
>> Examples:
>> https://github.com/apache/beam/pull/10528/files#diff-0cb34b4622b0b7d7256d28b1ee1d52fc
>> https://github.com/apache/beam/pull/10528/files#diff-7e4ad8c086414399957cdbea711ebd36
>> https://github.com/apache/beam/pull/10528/files#diff-d5c3f4f603204c5c5917d89e90dba53d
>> (it also makes pytype more strict in a sense)
>
> Note that mypy is still not fully passing on master so it's unclear from 
> those diffs exactly how the two tools differ.  Many of the fixes you've made 
> for pytype look familiar to me from mypy, but my fixes may not be merged yet. 
>  For example, mypy also does not support @total_ordering, but my fix for that 
> is still pending.

As it seems we have  a workaround to ignore pytype for now, it seems
to make the most sense to focus on getting mypy working completely
before focusing on that. In the long term, we could look into making
pytype a post-commit which would at least be a useful signal for those
caring about it, and only make it a precommit if the diff between what
it requires and mypy requires is quite small.


Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Kenneth Knowles
This sounds like a bug, as described.

Here's the logic, shared by all runners:
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/ReduceFnRunner.java#L958

Regarding "race condition equivalent" I mean that when you have an early
trigger set up there is a benign race condition that determines the
presence or absence of an on time pane. So you should never count on it
unless you insist on it.

Regarding Robert's point I agree that if we try to interpret triggers as an
expression of human intent then it seems that AfterWatermark variants
should all output an ON_TIME pane. It is really the whole windowing
strategy by which the user specifies their intent. IMO there are a couple
reasons. Foremost, it becomes hard/impossible to reason about systems where
there are layers built upon the design philosophy of guessing intent. So it
is better to focus on clear boundaries; in this case triggers govern
whether the runner is permitted to produce non-lifecycle outputs, while
separate flags control the lifecycle outputs.

You can imagine a small DSL where the combination of trigger, on time
behavior, and closing behavior are expressed together. That is more or less
what WindowingStrategy is trying to get at. I also think the space of
triggers is sufficiently arbitrary and unexplored that I don't want to
couple them to things that are more obviously fundamental, like on time
behavior, lateness, and closing behavior. Previously, triggers were a
user-defined state machine, which I think is overly general and too
inefficient in a portable setting. But our particular set of triggers has
cruft you don't need and also sometimes lacks things you do need. If/when
we invest in and move toward sink triggers (aka specify business logic and
let the runner figure out the rest) seems good to re-open the design space.

Kenn

On Mon, Jan 13, 2020 at 7:10 PM Aaron Dixon  wrote:

> Using ClosingBehavior I was able to see all windows get final panes fired
> (w/ isLast=true).
>
> Still unsure why OnTimeBehavior=ALWAYS wasn't enough to ensure an on-time
> pane for all windows. Though not clear on the meaning of the 'race
> condition equivalency' you mentioned Kenneth. Am very interested in
> understanding how this is supposed to work in general - but am very happy
> that I was able to get final panes by setting
> ClosingBehavior=ALWAYSthank you for that pro-tip!!
>
> On Mon, Jan 13, 2020 at 7:57 PM Robert Bradshaw 
> wrote:
>
>> I think AfterWatermark in particular should *alway* produce an ON_TIME
>> pane, regardless of whether there were early panes. (It's less clear
>> with non-watermark triggers like after count or processing time.) This
>> makes it feel like the on time behavior is a property of the trigger,
>> not the windowing strategy.
>>
>> On Mon, Jan 13, 2020 at 5:06 PM Aaron Dixon  wrote:
>> >
>> > Kenn, thank you! There is OnTimeBehavior (default FIRE_ALWAYS) and
>> ClosingBehavior (default FIRE_IF_NON_EMPTY). Given that OnTimeBehavior is
>> always-fire, shouldn't I see empty ON_TIME panes?
>> >
>> > Since my lateness config is 0, I'm going to try ClosingBehavior =
>> FIRE_ALWAYS and see if I can rely on .isLast() to pick out the last pane
>> downstream. But curious if given that the OnTimeBehavior default is ALWAYS,
>> shouldn't I be seeing on-time panes in my current config?
>> >
>> >
>> >
>> > On Mon, Jan 13, 2020 at 6:45 PM Kenneth Knowles 
>> wrote:
>> >>
>> >> On my phone, so I can't grab the jira so easily, but quickly: EARLY
>> panes are "race condition equivalent" to ON_TIME panes. The early panes
>> consume all the pending elements then the on time pane is "empty". This is
>> WAI if it is what is causing it. You need to explicitly set
>> Window.configure().fireAlways()*. I know this is counterintuitive in
>> accumulating mode, where the empty pane is not the identity element.
>> >>
>> >> Kenn
>> >>
>> >> *I don't recall if this is the default or not, and also because on
>> phone it is slow to look up. From your experience I think not default.
>> >>
>> >> On Mon, Jan 13, 2020, 15:03 Aaron Dixon  wrote:
>> >>>
>> >>> Any confirmation on this from anyone? Whether per Beam spec, runners
>> are obligated to send ON_TIME panes for AfterWatermark triggers? I'm stuck
>> because this seems fundamental, so it's hard to imagine this is a Dataflow
>> bug, but OTOH it's also hard to imagine that trigger specs like
>> AfterWatermark are "optional"... ?
>> >>>
>> >>> On Mon, Jan 13, 2020 at 4:18 PM Aaron Dixon 
>> wrote:
>> 
>>  Yes. Using calendar day-based windows and watermark is completely
>> caught up to today ... calendar window ends several days ago. I got EARLY
>> panes for each element but never ON_TIME pane.
>> 
>>  On Mon, Jan 13, 2020 at 4:16 PM Luke Cwik  wrote:
>> >
>> > Is the watermark advancing past the end of the window?
>> >
>> > On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon 
>> wrote:
>> >>
>> >> The window is not empty 

Re: No AfterWatermark firings in Dataflow

2020-01-13 Thread Robert Bradshaw
I think AfterWatermark in particular should *alway* produce an ON_TIME
pane, regardless of whether there were early panes. (It's less clear
with non-watermark triggers like after count or processing time.) This
makes it feel like the on time behavior is a property of the trigger,
not the windowing strategy.

On Mon, Jan 13, 2020 at 5:06 PM Aaron Dixon  wrote:
>
> Kenn, thank you! There is OnTimeBehavior (default FIRE_ALWAYS) and 
> ClosingBehavior (default FIRE_IF_NON_EMPTY). Given that OnTimeBehavior is 
> always-fire, shouldn't I see empty ON_TIME panes?
>
> Since my lateness config is 0, I'm going to try ClosingBehavior = FIRE_ALWAYS 
> and see if I can rely on .isLast() to pick out the last pane downstream. But 
> curious if given that the OnTimeBehavior default is ALWAYS, shouldn't I be 
> seeing on-time panes in my current config?
>
>
>
> On Mon, Jan 13, 2020 at 6:45 PM Kenneth Knowles  wrote:
>>
>> On my phone, so I can't grab the jira so easily, but quickly: EARLY panes 
>> are "race condition equivalent" to ON_TIME panes. The early panes consume 
>> all the pending elements then the on time pane is "empty". This is WAI if it 
>> is what is causing it. You need to explicitly set 
>> Window.configure().fireAlways()*. I know this is counterintuitive in 
>> accumulating mode, where the empty pane is not the identity element.
>>
>> Kenn
>>
>> *I don't recall if this is the default or not, and also because on phone it 
>> is slow to look up. From your experience I think not default.
>>
>> On Mon, Jan 13, 2020, 15:03 Aaron Dixon  wrote:
>>>
>>> Any confirmation on this from anyone? Whether per Beam spec, runners are 
>>> obligated to send ON_TIME panes for AfterWatermark triggers? I'm stuck 
>>> because this seems fundamental, so it's hard to imagine this is a Dataflow 
>>> bug, but OTOH it's also hard to imagine that trigger specs like 
>>> AfterWatermark are "optional"... ?
>>>
>>> On Mon, Jan 13, 2020 at 4:18 PM Aaron Dixon  wrote:

 Yes. Using calendar day-based windows and watermark is completely caught 
 up to today ... calendar window ends several days ago. I got EARLY panes 
 for each element but never ON_TIME pane.

 On Mon, Jan 13, 2020 at 4:16 PM Luke Cwik  wrote:
>
> Is the watermark advancing past the end of the window?
>
> On Mon, Jan 13, 2020 at 2:02 PM Aaron Dixon  wrote:
>>
>> The window is not empty fwiw; it has elements; I get an early firing 
>> pane for the window but well after the watermark passes there is no 
>> ON_TIME pane. Would this be a bug in Dataflow? Seems fundamental, so I'm 
>> concerned perhaps the Beam spec doesn't obligate ON_TIME firings?
>>
>>
>>
>>
>>
>> On Mon, Jan 13, 2020 at 3:58 PM Luke Cwik  wrote:
>>>
>>> I would have expected an empty on time pane since the default on time 
>>> behavior is FIRE_ALWAYS.
>>>
>>> On Mon, Jan 13, 2020 at 1:54 PM Aaron Dixon  wrote:

 Can anyone confirm?

 This is intermittent. Some (it seems, sparse) windows don't get an 
 ON_TIME firing after watermark. Is this a bug or is there a reason to 
 not expect ON_TIME firings for every window?

 On Mon, Jan 13, 2020 at 3:47 PM Rui Wang  wrote:
>
> If it indeed happened as you have described, I will be very 
> interested in the expected behaviour.
>
> Something I remembered before: the trigger condition meets just gives 
> the runner/engine "permission" to fire, but runner/engine may not 
> fire immediately. But I don't know if the engine/runner will 
> guarantee to fire.
>
>
>
> -Rui
>
> On Mon, Jan 13, 2020 at 1:43 PM Aaron Dixon  wrote:
>>
>> I have the following trigger:
>>
>> .apply(Window
>>   .configure()
>>   .triggering(AfterWatermark
>>.pastEndOfWindow()
>>.withEarlyFirings(AfterPane
>> .elementCountAtLeast(1)))
>>   .accumulatingFiredPanes()
>>   .withAllowedLateness(Duration.ZERO)
>>
>> But in Dataflow I notice that I never get an ON_TIME firing for my 
>> window -- I only see early firing for elements, and then nothing.
>>
>> My assumption is that AfterWatermark should give me a last, on-time 
>> pane under this configuration when the watermark surpasses the 
>> window's end.
>>
>> Is my expectation correct?