Re: Executing the runner validation tests for the Twister2 runner

2020-01-06 Thread Pulasthi Supun Wickramasinghe
Hi Kenn,



On Mon, Jan 6, 2020 at 9:09 PM Kenneth Knowles  wrote:

>
>
> On Mon, Jan 6, 2020 at 8:30 AM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
>
>> Hi Kenn,
>>
>> I was able to solve the problem mentioned above, I am currently running
>> the "ValidatesRunner" tests, I have around 4-5 tests that are failing that
>> I should be able to fix in a couple of days. I wanted to check the next
>> steps I would need to take after all the "ValidatesRunner" tests are
>> passing. I assume that the runner does not need to pass all the
>> "NeedsRunner" tests.
>>
>
> That's correct. You don't need to run the NeedsRunner tests. Those are
> tests of the core SDK's functionality, not the runner. The annotation is a
> workaround for the false cycle in deps "SDK tests" -> "direct runner" ->
> "SDK", which to maven looks like a cyclic dependency. You should run on the
> ValidatesRunner tests. It is fine to also disable some of them, either by
> excluding categories or test classes, or adding new categories to exclude.
>
>
>> The runner is only implemented for the batch mode at the moment because
>> the higher-level API's for streaming on Twister2 are still being finalized.
>> Once that work is done we will add streaming support for the runner as well.
>>
>
> Nice! Batch-only is perfectly fine for a runner. You should be able to
> detect and reject pipelines that the runner cannot execute.
>

I will make sure that the capability is there.


> I re-read the old thread, but I may have missed the answer to a
> fundamental question. Just to get it clear on the mailing list: are you
> intending to submit the runner's code to Apache Beam and ask the community
> to maintain it?
>
> To answer the original question the issue was with forwarding
exceptions that happen during execution since Twister2 has a
distributed model for execution, I added the ability in the Twister2 side
so that the Twister2 job submitting client will receive a job state object
that contains any exceptions thrown during runtime once the job is
completed.

And about maintaining the twister2 runner, we would like to submit the
runner to the beam codebase but the Twister2 team will maintain and update
it continuously, in that case, we would become part of the Beam community I
suppose. And any contributions from other members of the community are more
than welcome. I hope that answers your question.

Best Regards,
Pulasthi


> Kenn
>
>
> Best Regards,
>> Pulasthi
>>
>> On Thu, Dec 12, 2019 at 11:27 AM Pulasthi Supun Wickramasinghe <
>> pulasthi...@gmail.com> wrote:
>>
>>> Hi Kenn
>>>
>>> We are still working on aspects like automated job monitoring so
>>> currently do not have those capabilities built-in. I discussed with the
>>> Twister2 team on a way we can forward failure information from the workers
>>> to the Jobmaster which would be a solution to this problem. It might take a
>>> little time to develop and test. I will update you after looking into that
>>> solution in a little more detail.
>>>
>>> Best Regards,
>>> Pulasthi
>>>
>>> On Wed, Dec 11, 2019 at 10:51 PM Kenneth Knowles 
>>> wrote:
>>>
 I dug in to Twister2 a little bit to understand the question better,
 checking how the various resource managers / launchers are plumbed.

 How would a user set up automated monitoring for a job? If that is
 scraping the logs, then it seems unfortunate for users, but I think the
 Beam runner would naturally use whatever a user might use.

 Kenn

 On Wed, Dec 11, 2019 at 10:45 AM Pulasthi Supun Wickramasinghe <
 pulasthi...@gmail.com> wrote:

> Hi Dev's
>
> I have been making some progress on the Twister2 runner for the beam
> that I mentioned before on the mailing list. The runner is able to run the
> wordcount example and produce correct results. So I am currently trying to
> run the runner validation tests.
>
> From what I understood looking at a couple examples is that tests are
> validated based on the exceptions that are thrown (or not) during test
> runtime.  However in Twister2 currently the job submission client does not
> get failure information such as exceptions back once the job is submitted.
> These are however recorded in the worker log files.
>
> So in order to validate the tests for Twister2 I would have to parse
> the worker logfile and check what exceptions are in the logs. Would that 
> be
> an acceptable solution for the validation tests?
>
> Best Regards,
> Pulasthi
>
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>

>>>
>>> --
>>> Pulasthi S. Wickramasinghe
>>> PhD Candidate  | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, 

Re: Jenkins jobs not running for my PR 10438

2020-01-06 Thread Kai Jiang
According to this comment
,
it might be a Jenkins bug.
Meanwhile, I opened an infra ticket at
https://issues.apache.org/jira/browse/INFRA-19670 for Beam.

On Mon, Jan 6, 2020 at 12:01 PM Andrew Pilloud  wrote:

> "Run precommits" seems to work sometimes:
> https://github.com/apache/beam/pull/10455
>
> Has anyone opened a ticket with apache infra?
>
> On Mon, Jan 6, 2020 at 4:39 AM Rehman Murad Ali <
> rehman.murad...@venturedive.com> wrote:
>
>> +1:  https://github.com/apache/beam/pull/10506
>>
>> any solution yet?
>>
>>
>>
>> *Thanks & Regards*
>>
>>
>> 
>>
>> *Rehman Murad Ali*
>> Software Engineer
>> Mobile: +92 3452076766 <+92%20345%202076766>
>> Skype: rehman.muradali
>>
>>
>> On Sat, Jan 4, 2020 at 6:10 AM Heejong Lee  wrote:
>>
>>> +1: https://github.com/apache/beam/pull/10051
>>>
>>> force-pushing again. retest this please. nothing works :(
>>>
>>> On Fri, Jan 3, 2020 at 12:55 AM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Hi,
 I'm also affected by this - I touched my PRs opened before the holiday
 break and no jobs were triggered. Do we know what breaks Jenkins/fixes it
 when stuff like this happens?
 Happy new year,
 Michal

 On Fri, Jan 3, 2020 at 1:42 AM Kai Jiang  wrote:

> Thanks Alan for checking this out! I closed PR 9903 and reopen it in
> pull/10493 . It seems new
> PR still did not trigger jenkins jobs.
>
> On Thu, Jan 2, 2020 at 2:55 PM Alan Myrvold 
> wrote:
>
>> Oh, the PR 9903 run is quite old; I don't see a recent one yet.
>>
>> On Thu, Jan 2, 2020 at 2:48 PM Alan Myrvold 
>> wrote:
>>
>>> For PR 10427, I see
>>> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1593/
>>> For PR 9903, I see
>>> https://builds.apache.org/job/beam_PostCommit_Java_Nexmark_Flink_PR/22/
>>>
>>> Maybe the PR status is not being updated when the jobs run?
>>>
>>>
>>> On Thu, Jan 2, 2020 at 2:37 PM Kai Jiang  wrote:
>>>
 same for https://github.com/apache/beam/pull/9903 as well

 On Thu, Jan 2, 2020 at 1:40 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Seems like Jenkins tests are not being triggered for this PR as
> well: https://github.com/apache/beam/pull/10427
>
> On Fri, Dec 20, 2019 at 2:16 PM Tomo Suzuki 
> wrote:
>
>> Jenkins started working. Thank you for whoever fixed it.
>>
>> On Fri, Dec 20, 2019 at 1:42 PM Boyuan Zhang 
>> wrote:
>> >
>> > Same here. Even the phrase trigger doesn't work.
>> >
>> > On Fri, Dec 20, 2019 at 10:16 AM Luke Cwik 
>> wrote:
>> >>
>> >> I'm also affected by this.
>> >>
>> >> On Fri, Dec 20, 2019 at 10:13 AM Tomo Suzuki <
>> suzt...@google.com> wrote:
>> >>>
>> >>> Hi Beam developers,
>> >>>
>> >>> Does anybody know why my PR does not trigger Jenkins jobs
>> today?
>> >>> https://github.com/apache/beam/pull/10438
>> >>>
>> >>> --
>> >>> Regards,
>> >>> Tomo
>>
>>
>>
>> --
>> Regards,
>> Tomo
>>
>

 --

 Michał Walenia
 Polidea  | Software Engineer

 M: +48 791 432 002 <+48791432002>
 E: michal.wale...@polidea.com

 Unique Tech
 Check out our projects! 

>>>


Re: Executing the runner validation tests for the Twister2 runner

2020-01-06 Thread Kenneth Knowles
On Mon, Jan 6, 2020 at 8:30 AM Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:

> Hi Kenn,
>
> I was able to solve the problem mentioned above, I am currently running
> the "ValidatesRunner" tests, I have around 4-5 tests that are failing that
> I should be able to fix in a couple of days. I wanted to check the next
> steps I would need to take after all the "ValidatesRunner" tests are
> passing. I assume that the runner does not need to pass all the
> "NeedsRunner" tests.
>

That's correct. You don't need to run the NeedsRunner tests. Those are
tests of the core SDK's functionality, not the runner. The annotation is a
workaround for the false cycle in deps "SDK tests" -> "direct runner" ->
"SDK", which to maven looks like a cyclic dependency. You should run on the
ValidatesRunner tests. It is fine to also disable some of them, either by
excluding categories or test classes, or adding new categories to exclude.


> The runner is only implemented for the batch mode at the moment because
> the higher-level API's for streaming on Twister2 are still being finalized.
> Once that work is done we will add streaming support for the runner as well.
>

Nice! Batch-only is perfectly fine for a runner. You should be able to
detect and reject pipelines that the runner cannot execute.

I re-read the old thread, but I may have missed the answer to a fundamental
question. Just to get it clear on the mailing list: are you intending to
submit the runner's code to Apache Beam and ask the community to maintain
it?

Kenn


Best Regards,
> Pulasthi
>
> On Thu, Dec 12, 2019 at 11:27 AM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
>
>> Hi Kenn
>>
>> We are still working on aspects like automated job monitoring so
>> currently do not have those capabilities built-in. I discussed with the
>> Twister2 team on a way we can forward failure information from the workers
>> to the Jobmaster which would be a solution to this problem. It might take a
>> little time to develop and test. I will update you after looking into that
>> solution in a little more detail.
>>
>> Best Regards,
>> Pulasthi
>>
>> On Wed, Dec 11, 2019 at 10:51 PM Kenneth Knowles  wrote:
>>
>>> I dug in to Twister2 a little bit to understand the question better,
>>> checking how the various resource managers / launchers are plumbed.
>>>
>>> How would a user set up automated monitoring for a job? If that is
>>> scraping the logs, then it seems unfortunate for users, but I think the
>>> Beam runner would naturally use whatever a user might use.
>>>
>>> Kenn
>>>
>>> On Wed, Dec 11, 2019 at 10:45 AM Pulasthi Supun Wickramasinghe <
>>> pulasthi...@gmail.com> wrote:
>>>
 Hi Dev's

 I have been making some progress on the Twister2 runner for the beam
 that I mentioned before on the mailing list. The runner is able to run the
 wordcount example and produce correct results. So I am currently trying to
 run the runner validation tests.

 From what I understood looking at a couple examples is that tests are
 validated based on the exceptions that are thrown (or not) during test
 runtime.  However in Twister2 currently the job submission client does not
 get failure information such as exceptions back once the job is submitted.
 These are however recorded in the worker log files.

 So in order to validate the tests for Twister2 I would have to parse
 the worker logfile and check what exceptions are in the logs. Would that be
 an acceptable solution for the validation tests?

 Best Regards,
 Pulasthi




 --
 Pulasthi S. Wickramasinghe
 PhD Candidate  | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 cell: 224-386-9035 <(224)%20386-9035>

>>>
>>
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035 <(224)%20386-9035>
>>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035 <(224)%20386-9035>
>


Re: Python IO Connector

2020-01-06 Thread Luke Cwik
Eugene, the JdbcIO output should be updated to support Beam's schema format
which would allow for "rows" to cross the language boundaries.

If the connector is easy to write and maintain then it makes sense for
native. Maybe the Python version will have an easier time to support
splitting and hence could overtake the Java implementation in useful
features.

On Mon, Jan 6, 2020 at 3:55 PM  wrote:

> Apache Airflow went for the DB API approach as well and it seems like to
> have worked well for them. We will likely need to add extra_requires for
> each database engine Python package though, which adds some complexity but
> not a lot
>
> On Jan 6, 2020, at 6:12 PM, Eugene Kirpichov  wrote:
>
> Agreed with above, it seems prudent to develop a pure-Python connector for
> something as common as interacting with a database. It's likely easier to
> achieve an idiomatic API, familiar to non-Beam Python SQL users, within
> pure Python.
>
> Developing a cross-language connector here might be plain impossible,
> because rows read from a database are (at least in JDBC) not encodable -
> they require a user's callback to translate to an encodable user type, and
> the callback can't be in Python because then you have to encode its input
> before giving it to Python. Same holds for the write transform.
>
> Not sure about sqlalchemy though, maybe use plain DB-API
> https://www.python.org/dev/peps/pep-0249/ instead? Seems like the Python
> one is more friendly than JDBC in the sense that it actually returns rows
> as tuples of simple data types.
>
> On Mon, Jan 6, 2020 at 1:42 PM Robert Bradshaw 
> wrote:
>
>> On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath 
>> wrote:
>>
>>> Regarding cross-language transforms, we need to add better
>>> documentation, but for now you'll have to go with existing examples and
>>> tests. For example,
>>>
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py
>>>
>>> Note that cross-language transforms feature is currently only available
>>> for Flink Runner. Dataflow support is in development.
>>>
>>
>> I think it works with all non-Dataflow runners, with the exception of the
>> Java and Go Direct runners. (It does work with the Python direct runner.)
>>
>>
>>> I'm fine with developing this natively for Python as well. AFAIK Java
>>> JDBC IO connector is not a super-complicated connector and it should be
>>> fine to make relatively easy to maintain and widely usable connectors
>>> available in multiple SDKs.
>>>
>>
>> Yes, a case can certainly be made for having native connectors for
>> particular common/simple sources. (We certainly don't call cross-language
>> to read text files for example.)
>>
>>
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik  wrote:
>>>
 +Chamikara Jayalath  +Heejong Lee
 

 On Mon, Jan 6, 2020 at 10:20 AM  wrote:

> How do I go about doing that? From the docs, it appears cross language
> transforms are
> currently undocumented.
> https://beam.apache.org/roadmap/connectors-multi-sdk/
> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
>
> What about using a cross language transform between Python and the
> already existing Java JdbcIO transform?
>
> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann 
> wrote:
>
>> I’d like to develop the Python SDK’s SQL IO connector. I was thinking
>> it would be easiest to use sqlalchemy to achieve maximum database engine
>> support, but I suppose I could also create an ABC for databases that 
>> follow
>> the DB API and create subclasses for each database engine that override a
>> connect method. What are your thoughts on the best way to do this?
>>
>


Re: Dropping late data in DirectRunner

2020-01-06 Thread Kenneth Knowles
This thread has a lot in it, so I am just top-posting.

 - Stateful DoFn is a windowed operation; state is per-window. When the
window expires, any further inputs are dropped.
 - "Late" is not synonymous with out-of-order. It doesn't really have an
independent meaning.
- For a GBK/Combine "late" means "not included prior to the on-time
output", and "droppable" means "arriving after window expiry".
- For Stateful DoFn there is no real meaning to "late" except if one is
talking about "droppable", which still means "arriving after window
expiry". A user may have a special timer where they flip a flag and treat
elements after the timer differently.

I think the definition of when data is droppable is very simple. We
explicitly moved to this definition, away from the "out of order == late",
because it is more robust and simpler to think about. Users saw lots of
confusing behavior when we had "out of order by allowed lateness ==
droppable" logic.

Kenn

On Mon, Jan 6, 2020 at 1:42 AM Jan Lukavský  wrote:

> > Generally the watermark update can overtake elements, because runners
> can explicitly ignore late data in the watermark calculation (for good
> reason - those elements are already late, so no need to hold up the
> watermark advancing any more).
> This seems not to affect the decision of _not late_ vs. _late_, is it? If
> element is late and gets ignored from watermark calculation (whatever that
> includes in this context), than the watermark cannot move past elements
> that were not marked as _not late_ and thus nothing can make them _late_.
>
> > For GBK on-time data simply means the first pane marked as on time. For
> state+timers I don't think it makes sense for Beam to define on-time v.s.
> late, rather I think the user can come up with their own definition
> depending on their use case. For example, if you are buffering data into
> BagState and setting a timer to process it, it would be logical to say that
> any element that was buffered before the timer expired is on time, and any
> data that showed up after the timer fired is late. This would roughly
> correspond to what GBK does, and the answer would be very similar to simply
> comparing against the watermark (as the timers fire when the watermark
> advances).
>
> Yes, I'd say that stateful DoFns don't have (well defined) concept of
> pane, because that is related to concept of trigger and this is a concept
> of GBK (or windowed operations in general). The only semantic meaning of
> window in stateful DoFn is that it "scopes" state.
>
> This discussion might have got a little off the original question, so I'll
> try to rephrase it:
>
> Should stateful DoFn drop *all* late data, not just data that arrive after
> window boundary + allowed lateness? Some arguments why I think it should:
>  * in windowed operations (GBK), it is correct to drop data on window
> boundaries only, because time (as seen by user) effectively hops only on
> these discrete time points
>  * in stateful dofn on the other hand time move "smoothly" (yes, with some
> granularity, millisecond, nanosecond, whatever and with watermark updates
> only, but still)
>  * this could be viewed that dropping late data immediately as time
> (again, from user perspective) moves (not on some more or less artificial
> boundary having only little semantic meaning) is consistent with both the
> above properties
>
> The negative side effect of this would be, that more data could be
> dropped, but ... isn't this what defines allowed lateness? I don't want to
> discuss the implications on user pipelines of such a change (and if we can
> or cannot do it), just trying to build some theoretical understanding of
> the problem as a whole. The decision if any change could / should be made
> can be done afterwards.
>
> Thanks,
>  Jan
>
> On 1/4/20 10:35 PM, Reuven Lax wrote:
>
>
>
> On Sat, Jan 4, 2020 at 12:13 PM Jan Lukavský  wrote:
>
>> > Yes, but invariants should hold. If I add a ParDo that drops late
>> elements (or, more commonly,diverts the late elements  to a different
>> PCollection), then the result of that ParDo should _never_ introduce and
>> more late data. This cannot be guaranteed simply with watermark checks. The
>> ParDo may decide that the element was not late, but by the time it outputs
>> the element the watermark may have advanced, causing the element to
>> actually be late.
>>
>> This is actually very interesting. The question is - if I decide about
>> lateness based on output watermark of a PTransform, is it still the case,
>> that in downstream operator(s) the element could be changed from "not late"
>> to "late"? Provided the output watermark is updated synchronously based on
>> input data (which should be) and watermark update cannot "overtake"
>> elements, I think that the downstream decision should not be changed, so
>> the invariant should hold. Or am I missing something?
>>
>
> Generally the watermark update can overtake elements, because runners  can
> explicitly 

Re: RabbitMQ and CheckpointMark feasibility

2020-01-06 Thread Daniel Robert

Alright, a bit late but this took me a while.

Thanks for all the input so far. I have rewritten much of the RabbitMq 
IO connector and have it ready to go in a draft pr: 
https://github.com/apache/beam/pull/10509


This should incorporate a lot of what's been discussed here, in terms of 
watermarking, serialization, error handling, etc. It also 
clarifies/cleans up a lot of very confusing documentation/api settings 
pertaining to using 'queues vs exchanges' and adds clarifying 
documentation on various valid AMQP paradigms.


Watermarking/timestamp management is mostly stolen from KafkaIO and 
modified as appropriate.


This also does a lot to improve resource management in terms of 
Connection and Channel usage, largely modeled after JdbcIO's 
ConnectionHandlerProvider concept.


I'm not entirely sure how best to proceed from here, hence the email. 
It's a huge PR, but it has no specific backing ticket (it should), and 
historically there haven't been many eyes on RabbitMq PRs.


Thanks,
-Danny

On 11/14/19 4:13 PM, Jan Lukavský wrote:


On 11/14/19 9:50 PM, Daniel Robert wrote:

Alright, thanks everybody. I'm really appreciative of the 
conversation here. I think I see where my disconnect is and how this 
might all work together for me. There are some bugs in the current 
rabbit implementation that I think have confused my understanding of 
the intended semantics. I'm coming around to seeing how such a system 
with rabbit's restrictions can work properly in Beam (I'd totally 
forgotten about 'dedupe' support in Beam) but I want to clarify some 
implementation questions after pulling everyone's notes together.


RabbitMQ reader should not bother accepting an existing 
CheckpointMark in its constructor (in 'ack-based' systems this is 
unnecessary per Eugene's original reply). It should construct its own 
CheckpointMark at construction time and use it throughout its lifecycle.


At some point later, the CheckpointMark will be 'finalized'. If this 
CheckpointMark has been Serialized (via Coder or otherwise) or its 
underlying connection has been severed, this step will fail. This 
would mean at some point the messages are redelivered to Beam on some 
other Reader, so no data loss. If it has not been serialized, the 
acks will take place just fine, even if much later.


If the system is using processing-time as event-time, however, the 
redelivery of these messages would effectively change the ordering 
and potentially the window they arrived in. I *believe* that Beam 
deduping seems to be managed per-window so if 'finalizeCheckpoint' is 
attempted (and fails) would these messages appear in a new window?


This is very much likely to happen with any source, if it would assign 
something like *now* to event time. That is ill defined and if the 
source cannot provide some retry-persistent estimate of real 
event-time, than I'd suggest to force user to specify an UDF to 
extract event time from the payload. Everything else would probably 
break (at least if any timestamp-related windowing would be used in 
the pipeline).


Perhaps my question are now:
- how should a CheckpointMark should communicate failure to the Beam

An exception thrown should fail the checkpoint and therefore retry 
everything from the last checkpoint.


- how does Beam handle a CheckpointMark.finalizeCheckpoint failure, 
if the API dictates such a thing?



See above.


- is there a provision that would need to be made for processing-time 
sources that can fail a checkpointmark.finalizeCheckpoint call? (I'm 
nervous redelivered messages would appear in another window)


You are nervous for a reason. :) I strongly believe processing time 
source should be considered anti-pattern, at least in situations where 
there is any time manipulation downstream (time-windows, stateful 
processing, ...).


- What is the relationship lifecycle-wise between a CheckpointMark 
and a Reader? My understanding is a CheckpointMark may outlive a 
Reader, is that correct?


Definitely. But the same instance bound to the lifecycle of the reader 
would be used to finalizeCheckpoint (if that ever happens).


Thanks for bearing with me everyone. It feels a bit unfortunate my 
first foray into beam is reliant on this rabbit connector but I'm 
learning a lot and I'm very grateful for the help. PRs pending once I 
get this all straightened out in my head.


-Danny

On 11/14/19 2:35 PM, Eugene Kirpichov wrote:

Hi Daniel,


On Wed, Nov 13, 2019 at 8:26 PM Daniel Robert > wrote:


I believe I've nailed down a situation that happens in practice
that causes Beam and Rabbit to be incompatible. It seems that
runners can and do make assumptions about the serializability
(via Coder) of a CheckpointMark.

To start, these are the semantics of RabbitMQ:

- the client establishes a connection to the server
- client opens a channel on the connection
- messages are either pulled or pushed to the client from the

Re: Python IO Connector

2020-01-06 Thread Robert Bradshaw
On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath 
wrote:

> Regarding cross-language transforms, we need to add better documentation,
> but for now you'll have to go with existing examples and tests. For example,
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py
>
> Note that cross-language transforms feature is currently only available
> for Flink Runner. Dataflow support is in development.
>

I think it works with all non-Dataflow runners, with the exception of the
Java and Go Direct runners. (It does work with the Python direct runner.)


> I'm fine with developing this natively for Python as well. AFAIK Java JDBC
> IO connector is not a super-complicated connector and it should be fine to
> make relatively easy to maintain and widely usable connectors available in
> multiple SDKs.
>

Yes, a case can certainly be made for having native connectors for
particular common/simple sources. (We certainly don't call cross-language
to read text files for example.)


>
> Thanks,
> Cham
>
>
> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik  wrote:
>
>> +Chamikara Jayalath  +Heejong Lee
>> 
>>
>> On Mon, Jan 6, 2020 at 10:20 AM  wrote:
>>
>>> How do I go about doing that? From the docs, it appears cross language
>>> transforms are
>>> currently undocumented.
>>> https://beam.apache.org/roadmap/connectors-multi-sdk/
>>> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
>>>
>>> What about using a cross language transform between Python and the
>>> already existing Java JdbcIO transform?
>>>
>>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann  wrote:
>>>
 I’d like to develop the Python SDK’s SQL IO connector. I was thinking
 it would be easiest to use sqlalchemy to achieve maximum database engine
 support, but I suppose I could also create an ABC for databases that follow
 the DB API and create subclasses for each database engine that override a
 connect method. What are your thoughts on the best way to do this?

>>>


Re: Python IO Connector

2020-01-06 Thread Chamikara Jayalath
Regarding cross-language transforms, we need to add better documentation,
but for now you'll have to go with existing examples and tests. For example,

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py

Note that cross-language transforms feature is currently only available for
Flink Runner. Dataflow support is in development.

I'm fine with developing this natively for Python as well. AFAIK Java JDBC
IO connector is not a super-complicated connector and it should be fine to
make relatively easy to maintain and widely usable connectors available in
multiple SDKs.

Thanks,
Cham


On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik  wrote:

> +Chamikara Jayalath  +Heejong Lee
> 
>
> On Mon, Jan 6, 2020 at 10:20 AM  wrote:
>
>> How do I go about doing that? From the docs, it appears cross language
>> transforms are
>> currently undocumented.
>> https://beam.apache.org/roadmap/connectors-multi-sdk/
>> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
>>
>> What about using a cross language transform between Python and the
>> already existing Java JdbcIO transform?
>>
>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann  wrote:
>>
>>> I’d like to develop the Python SDK’s SQL IO connector. I was thinking it
>>> would be easiest to use sqlalchemy to achieve maximum database engine
>>> support, but I suppose I could also create an ABC for databases that follow
>>> the DB API and create subclasses for each database engine that override a
>>> connect method. What are your thoughts on the best way to do this?
>>>
>>


[RESULT][VOTE] Release 2.17.0, release candidate #2

2020-01-06 Thread Mikhail Gryzykhin
Hi all,

I'm happy to announce that we have approved this release.

There are 5 approving votes, 4 of which are binding (in order):
* Ahmet (al...@google.com);
* Luke (lc...@google.com);
* Reuven (re...@google.com);
* Robert (rober...@google.com);

There are no disapproving votes.

Thanks everyone!

Next step is to finalize the release (merge the docs/website/blog PRs,
publish artifacts). Please let me know if you have any questions.

Regards,
--Mikhail


Re: PTransform serialization question

2020-01-06 Thread Alexey Romanenko
Thank you for clarification, Luke.

> On 6 Jan 2020, at 20:03, Luke Cwik  wrote:
> 
> Anything that is reachable by the DoFn/CombineFn/*Fn needs to be 
> serializable. [1] is saying that it is common to have an anonymous inner 
> class for a DoFn which because of its serialization capture will get the 
> encompassing class which is typically a PTransform. If you are careful about 
> reachability, you can decide to not mark lots of things as serializable and 
> this is good because it decreases the size of the serialized *Fn blob as well.
> 
> The [2] javadoc could be clarified that PTransform class supports 
> serialization but is only serialized when part of the serialization capture 
> of a DoFn/CombineFn/*Fn and otherwise will never be serialized.

Yes, I think it would be more clear in this sense. 


> 
> On Mon, Jan 6, 2020 at 10:19 AM Alexey Romanenko  > wrote:
> Hello all,
> 
> I found myself that I’m a bit confused with Serialization requirements for 
> Beam transforms and I want to precise something.
> 
> Here [1] it's clearly mentioned that “DoFn, PTransform, CombineFn and other 
> instances will be serialized”. Since the most of Beam IO Read/Write 
> transforms is based on PTransform, then it means that all internal members of 
> them should be serializable too or declared as transient/static. 
> 
> In the same time, Javadoc of PTransform says [2] that “PTransform doesn't 
> actually support serialization, despite implementing Serializable. PTransform 
> is marked Serializable solely because it is common for an anonymous DoFn, 
> instance to be created within an apply() method of a composite PTransform”. 
> And, on the other hand, “DoFn passed to a ParDo transform must be 
> Serializable” [3] So, DoFn must be really serializable, PTransform is not 
> necessary.
> 
> So, does it mean that the members (that are mostly AutoValue generated) of 
> Read/Write PTransforms are free to be serializable or not if they don’t use 
> anonymous DoFn's? For example, they are needed only for configuration on 
> driver. However, if these members are used in DoFn or in other user defined 
> objects further, when they will be involved on workers, then they must be 
> serializable in any way.  Is it correct assumption?
> 
> Yes
>  
> 
> [1] https://beam.apache.org/contribute/ptransform-style-guide/#serialization 
> 
> [2] 
> https://github.com/apache/beam/blob/42dbb5d9c9fbf45676088a32f862101f03fa76fb/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L116
>  
> 
> [3] 
> https://github.com/apache/beam/blob/e2bb239f0418f1c4949227ba3f51a5f4eb7235df/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L282
>  
> 


Re: [VOTE] Release 2.17.0, release candidate #2

2020-01-06 Thread Mikhail Gryzykhin
Hi all,

I'm happy to announce that we have approved this release.

There are 5 approving votes, 4 of which are binding (in order):
* Ahmet (al...@google.com);
* Luke (lc...@google.com);
* Reuven (re...@google.com);
* Robert (rober...@google.com);

There are no disapproving votes.

Thanks everyone!

Next step is to finalize the release (merge the docs/website/blog PRs,
publish artifacts). Please let me know if you have any questions.

Regards,
--Mikhail

On Mon, Jan 6, 2020 at 10:55 AM Robert Bradshaw  wrote:

> Thanks. That's the right one. The signatures (and everything else) all
> look good now.
>
> Changing my vote to a +1.
>
> On Mon, Jan 6, 2020 at 9:13 AM Mikhail Gryzykhin 
> wrote:
>
>> KEYS files should be fixed now.
>>
>> On Mon, Jan 6, 2020 at 8:29 AM Robert Bradshaw 
>> wrote:
>>
>>> Yes, please update KEYS to have the correct key. (If you've never used
>>> the other one you could just remove it.)
>>>
>>> On Mon, Jan 6, 2020, 6:46 AM Mikhail Gryzykhin 
>>> wrote:
>>>
 I see. Seems that the wrong key is imported into KEYS file. And header
 is incorrect.

 --Mikhail

 On Mon, Jan 6, 2020 at 6:16 AM Mikhail Gryzykhin 
 wrote:

> Hi Robert,
>
> I redownloaded binaries from
> https://dist.apache.org/repos/dist/dev/beam/2.17.0/ and ran
>
> gpg --verify apache-beam-2.17.0-source-release.zip.asc
> gpg: assuming signed data in 'apache-beam-2.17.0-source-release.zip'
> gpg: Signature made Mon 16 Dec 2019 09:17:23 PM PST
> gpg:using RSA key
> 53F72D4EEEF306D97736FE1065ABB07A8965E788
> gpg: Good signature from "Mikhail Gryzykhin "
> [ultimate]
>
> Signature is valid with key 53F72D4EEEF306D97736FE1065ABB07A8965E788.
> The key you received is different. Which binaries did you get that
> signature from?
>
> --Mikhail
>
> On Thu, Jan 2, 2020 at 4:53 PM Robert Bradshaw 
> wrote:
>
>> (Other than that everything looks fine.)
>>
>> On Thu, Jan 2, 2020 at 4:44 PM Robert Bradshaw 
>> wrote:
>> >
>> > -1
>> >
>> > I'm having trouble verifying the signatures on the release
>> artifacts.
>> > When I try to import the key from
>> > https://dist.apache.org/repos/dist/release/beam/KEYS I get
>> >
>> > pub   rsa4096 2019-10-22 [SC]
>> >   79552F5C2FD869A08E097F96841855FB73AFFC7F
>> > uid   [ unknown] Mikhail Gryzykhin (mikhail) <
>> mikh...@apache.org>
>> > sub   rsa4096 2019-10-22 [E]
>> >
>> > which is not the key that these artifacts were signed with.
>> >
>> >
>> > On Thu, Jan 2, 2020 at 4:23 PM Reuven Lax  wrote:
>> > >
>> > > +1
>> > >
>> > > On Thu, Jan 2, 2020 at 3:02 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>> > >>
>> > >> +1. Validated Batch and Streaming quickstarts on Python 3.7
>> (using wheels) and Batch Mobile Gaming examples (user score, hourly team
>> score) on Dataflow.
>> > >>
>> > >> On Thu, Jan 2, 2020 at 11:23 AM Ahmet Altay 
>> wrote:
>> > >>>
>> > >>> This vote needs at least one more PMC vote before it can be
>> finalized. Could you please validate and vote?
>> > >>>
>> > >>> On Mon, Dec 23, 2019 at 9:44 AM Luke Cwik 
>> wrote:
>> > 
>> >  +1, I validated the Java quickstarts for the runners and the
>> issues I have brought up have been moved to a future release.
>> > 
>> >  On Fri, Dec 20, 2019 at 8:09 PM Ahmet Altay 
>> wrote:
>> > >
>> > > +1, I validated the python2 quick starts using wheels. Thank
>> you for pushing the release this far.
>> > >
>> > > On Thu, Dec 19, 2019 at 1:27 PM Kenneth Knowles <
>> k...@google.com> wrote:
>> > >>
>> > >> I verified the Java quickstart on Dataflow manually.
>> > >>
>> > >> Kenn
>> > >>
>> > >> On Wed, Dec 18, 2019 at 5:58 PM jincheng sun <
>> sunjincheng...@gmail.com> wrote:
>> > >>>
>> > >>> Thanks for drive this release Mikhail !
>> > >>>
>> > >>> I have found there is an incorrect release version for
>> release notes in PR[1], also left a question in PR[2].
>> > >>>
>> > >>> But I do not think it's the blocker of the release :)
>> > >>>
>> > >>> Best,
>> > >>> Jincheng
>> > >>>
>> > >>> [1] https://github.com/apache/beam/pull/10401
>> > >>> [2] https://github.com/apache/beam/pull/10402
>> > >>>
>> > >>>
>> > >>> Ahmet Altay  于2019年12月19日周四 上午3:31写道:
>> > 
>> >  I validated python quickstarts with python 2. Wheels file
>> are missing but they work otherwise. Once the wheel files are added I 
>> will
>> add my vote.
>> > 
>> >  On Wed, Dec 18, 2019 at 10:00 AM Luke Cwik <
>> lc...@google.com> wrote:
>> > >

Re: Request for review of PR [Beam-8564]

2020-01-06 Thread Luke Cwik
Have you had a chance to update the PR?

On Mon, Dec 30, 2019 at 5:00 AM Amogh Tiwari  wrote:

> Hi Luke,
>
> We have gone through shevek/lzo-java, but we chose to go with
> airflow/aircompressor for the following reasons:
>
> 1) shevek/lzo-java is internally using .jni, .c and .h files, hence the
> GNU licence, and that would leave us with only choice of putting this as an
> option dependency
>
> 2) performance of airlift/aircompressor was much better as compared to
> shevek/lzo-java in terms of compression ratio and time taken for
> compression/decompression
>
> 3) airflow/aircompressor is in pure java and is under Apache licence
>
> Therefore, we'd prefer to go with adding the current implementation as
> optional. We'd require your inputs on the same as we are unsure on where we
> are supposed to keep the required files and how the final directory
> structure would like. We have an idea and we'll update the current PR
> accordingly.
>
> Please do guide us on this.
>
>
> Regards,
>
> Amogh Tiwari
>
> On Wed, Dec 18, 2019 at 4:42 AM Luke Cwik  wrote:
>
>> Sorry for the long delay (was on vacation).
>>
>> Using org.apache.hadoop isn't part of the Apache Beam Core module but is
>> a dependency for those who depend on the Apache Beam Hadoop module. So I
>> don't think swapping the com.facebook.presto.hadoop version for the
>> org.apache.hadoop version will address Ismael's concern about including
>> hadoop dependencies as part of the core.
>>
>> I looked at shevek/lzo-java[1] and I think its our best choice since it
>> is:
>> * pure Java
>> * GPLv3 (would require marking the dependency optional and telling users
>> to add it explicitly which we have done in the past as well)
>> * small (<100kb)
>> * small dependency tree (commons-logging & findbugs annotations if we
>> only depend on lzo-core)
>> * performance (github page claims 500mb/s compression and 800mb/s
>> decompression)
>>
>> Alternatively we can make the LZO compression an extension module (with
>> the facebook dependency) and use a registrar to have it loaded dynamically.
>>
>> 1: https://github.com/shevek/lzo-java
>>
>> On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari  wrote:
>>
>>> While studying the code, we found that the airlift/ aircompressor
>>> library only requires some classes which are also present in apache hadoop
>>> common package. Therefore, we are now thinking that if we make changes in
>>> the airlift/ aircompressor package, remove the
>>> com.facebook.presto.hadoop and use the existing org.apache.hadoop
>>>  package which is
>>> already included in beam. This will solve both #2 and #3 as the transitive
>>> dependency will be removed and the size will also be reduced by almost
>>> ~20mbs.
>>>
>>> But if we use this approach, we will have to manually change the util
>>> whenever any changes are made to the airlift library.
>>>
>>> On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik  wrote:
>>>
 Going with the Registrar/ServiceLoader route would allow for
 alternative providers for the same compression algorithms so if they don't
 like one they can always contribute a different one.

 On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía  wrote:

> (1) seems not to be the issue because it is Apache licensed.
> (2) and (3) are the big issues, because it requires a provided huge
> uber jar that essentially leaks Hadoop classes into core SDK [1] so it is
> definitely concerning.
>
> We discussed at some point during the PR that added ZStandard support
> about creating some sort of Registrar for compression algorithms [2] but 
> we
> decided to not go ahead because we could achieve that for the zstd case 
> via
> the optional dependencies of commons-compress. Maybe it is time to
> reconsider if such mechanism is worth. For example for users that may not
> care about having the hadoop leakage to be able to use LZO.
>
> Refs.
> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
> [2] https://issues.apache.org/jira/browse/BEAM-6422
>
>
>
>
> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw 
> wrote:
>
>> Is there a way to wrap this up as an optional dependency with multiple
>> possible providers, if there's no good library satisfying all of the
>> conditions (in particular (1))?
>>
>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik  wrote:
>> >
>> > I was hoping that someone in the community would provide some
>> alternatives since there are quite a few implementations.
>> >
>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari 
>> wrote:
>> >>
>> >> Hi Luke,
>> >>
>> >> I agree with your thoughts and observations. But,
>> airlift:aircompressor is the only implementation of LZO in pure java. 
>> That
>> straight away solves #5.
>> >> The other implementations that I found either have licensing
>> issues (since 

Re: PTransform serialization question

2020-01-06 Thread Luke Cwik
Anything that is reachable by the DoFn/CombineFn/*Fn needs to be
serializable. [1] is saying that it is common to have an anonymous inner
class for a DoFn which because of its serialization capture will get the
encompassing class which is typically a PTransform. If you are careful
about reachability, you can decide to not mark lots of things as
serializable and this is good because it decreases the size of the
serialized *Fn blob as well.

The [2] javadoc could be clarified that PTransform class supports
serialization but is only serialized when part of the serialization capture
of a DoFn/CombineFn/*Fn and otherwise will never be serialized.

On Mon, Jan 6, 2020 at 10:19 AM Alexey Romanenko 
wrote:

> Hello all,
>
> I found myself that I’m a bit confused with Serialization requirements for
> Beam transforms and I want to precise something.
>
> Here [1] it's clearly mentioned that “*DoFn, PTransform, CombineFn and
> other instances will be serialized*”. Since the most of Beam IO
> Read/Write transforms is based on PTransform, then it means that all
> internal members of them should be serializable too or declared as
> transient/static.
>
> In the same time, Javadoc of PTransform says [2] that “*PTransform doesn't
> actually support serialization, despite implementing
> Serializable. PTransform is marked Serializable solely because it is common
> for an anonymous **DoFn, instance to be created within an apply() method
> of a composite **PTransform*”. And, on the other hand, “*DoFn passed to a
> ParDo transform must be Serializable*” [3] So, DoFn must be really
> serializable, PTransform is not necessary.
>
> So, does it mean that the members (that are mostly AutoValue generated) of
> Read/Write PTransforms are free to be serializable or not if they don’t use
> anonymous DoFn's? For example, they are needed only for configuration on
> driver. However, if these members are used in DoFn or in other user defined
> objects further, when they will be involved on workers, then they must be
> serializable in any way.  Is it correct assumption?
>

Yes


>
> [1]
> https://beam.apache.org/contribute/ptransform-style-guide/#serialization
> [2]
> https://github.com/apache/beam/blob/42dbb5d9c9fbf45676088a32f862101f03fa76fb/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L116
> [3]
> https://github.com/apache/beam/blob/e2bb239f0418f1c4949227ba3f51a5f4eb7235df/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L282
>
>


Re: Python IO Connector

2020-01-06 Thread Luke Cwik
+Chamikara Jayalath  +Heejong Lee 


On Mon, Jan 6, 2020 at 10:20 AM  wrote:

> How do I go about doing that? From the docs, it appears cross language
> transforms are
> currently undocumented.
> https://beam.apache.org/roadmap/connectors-multi-sdk/
> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
>
> What about using a cross language transform between Python and the already
> existing Java JdbcIO transform?
>
> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann  wrote:
>
>> I’d like to develop the Python SDK’s SQL IO connector. I was thinking it
>> would be easiest to use sqlalchemy to achieve maximum database engine
>> support, but I suppose I could also create an ABC for databases that follow
>> the DB API and create subclasses for each database engine that override a
>> connect method. What are your thoughts on the best way to do this?
>>
>


Re: [VOTE] Release 2.17.0, release candidate #2

2020-01-06 Thread Robert Bradshaw
Thanks. That's the right one. The signatures (and everything else) all look
good now.

Changing my vote to a +1.

On Mon, Jan 6, 2020 at 9:13 AM Mikhail Gryzykhin  wrote:

> KEYS files should be fixed now.
>
> On Mon, Jan 6, 2020 at 8:29 AM Robert Bradshaw 
> wrote:
>
>> Yes, please update KEYS to have the correct key. (If you've never used
>> the other one you could just remove it.)
>>
>> On Mon, Jan 6, 2020, 6:46 AM Mikhail Gryzykhin  wrote:
>>
>>> I see. Seems that the wrong key is imported into KEYS file. And header
>>> is incorrect.
>>>
>>> --Mikhail
>>>
>>> On Mon, Jan 6, 2020 at 6:16 AM Mikhail Gryzykhin 
>>> wrote:
>>>
 Hi Robert,

 I redownloaded binaries from
 https://dist.apache.org/repos/dist/dev/beam/2.17.0/ and ran

 gpg --verify apache-beam-2.17.0-source-release.zip.asc
 gpg: assuming signed data in 'apache-beam-2.17.0-source-release.zip'
 gpg: Signature made Mon 16 Dec 2019 09:17:23 PM PST
 gpg:using RSA key
 53F72D4EEEF306D97736FE1065ABB07A8965E788
 gpg: Good signature from "Mikhail Gryzykhin "
 [ultimate]

 Signature is valid with key 53F72D4EEEF306D97736FE1065ABB07A8965E788.
 The key you received is different. Which binaries did you get that
 signature from?

 --Mikhail

 On Thu, Jan 2, 2020 at 4:53 PM Robert Bradshaw 
 wrote:

> (Other than that everything looks fine.)
>
> On Thu, Jan 2, 2020 at 4:44 PM Robert Bradshaw 
> wrote:
> >
> > -1
> >
> > I'm having trouble verifying the signatures on the release artifacts.
> > When I try to import the key from
> > https://dist.apache.org/repos/dist/release/beam/KEYS I get
> >
> > pub   rsa4096 2019-10-22 [SC]
> >   79552F5C2FD869A08E097F96841855FB73AFFC7F
> > uid   [ unknown] Mikhail Gryzykhin (mikhail) <
> mikh...@apache.org>
> > sub   rsa4096 2019-10-22 [E]
> >
> > which is not the key that these artifacts were signed with.
> >
> >
> > On Thu, Jan 2, 2020 at 4:23 PM Reuven Lax  wrote:
> > >
> > > +1
> > >
> > > On Thu, Jan 2, 2020 at 3:02 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> > >>
> > >> +1. Validated Batch and Streaming quickstarts on Python 3.7
> (using wheels) and Batch Mobile Gaming examples (user score, hourly team
> score) on Dataflow.
> > >>
> > >> On Thu, Jan 2, 2020 at 11:23 AM Ahmet Altay 
> wrote:
> > >>>
> > >>> This vote needs at least one more PMC vote before it can be
> finalized. Could you please validate and vote?
> > >>>
> > >>> On Mon, Dec 23, 2019 at 9:44 AM Luke Cwik 
> wrote:
> > 
> >  +1, I validated the Java quickstarts for the runners and the
> issues I have brought up have been moved to a future release.
> > 
> >  On Fri, Dec 20, 2019 at 8:09 PM Ahmet Altay 
> wrote:
> > >
> > > +1, I validated the python2 quick starts using wheels. Thank
> you for pushing the release this far.
> > >
> > > On Thu, Dec 19, 2019 at 1:27 PM Kenneth Knowles <
> k...@google.com> wrote:
> > >>
> > >> I verified the Java quickstart on Dataflow manually.
> > >>
> > >> Kenn
> > >>
> > >> On Wed, Dec 18, 2019 at 5:58 PM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> > >>>
> > >>> Thanks for drive this release Mikhail !
> > >>>
> > >>> I have found there is an incorrect release version for
> release notes in PR[1], also left a question in PR[2].
> > >>>
> > >>> But I do not think it's the blocker of the release :)
> > >>>
> > >>> Best,
> > >>> Jincheng
> > >>>
> > >>> [1] https://github.com/apache/beam/pull/10401
> > >>> [2] https://github.com/apache/beam/pull/10402
> > >>>
> > >>>
> > >>> Ahmet Altay  于2019年12月19日周四 上午3:31写道:
> > 
> >  I validated python quickstarts with python 2. Wheels file
> are missing but they work otherwise. Once the wheel files are added I will
> add my vote.
> > 
> >  On Wed, Dec 18, 2019 at 10:00 AM Luke Cwik <
> lc...@google.com> wrote:
> > >
> > > I verified the release and ran the quickstarts and found
> that release 2.16 broke Apache Nemo runner which is also an issue for
> 2.17.0 RC #2. It is caused by a backwards incompatible change in
> ParDo.MultiOutput where getSideInputs return value was changed from List 
> to
> Map as part of https://github.com/apache/beam/pull/9275. I filed
> https://issues.apache.org/jira/browse/BEAM-8989 to track the issue.
> > >
> > > Should we re-add the method back in 2.17.0 renaming the
> newly added method to something else and also patch 2.16.0 with a minor
> change including the same fix (breaking 

Re: [Proposal] Slowly Changing Dimensions support in Beam

2020-01-06 Thread Mikhail Gryzykhin
I've narrowed down the topic. This does not include any of Dataflow part
and is general for all runners.  Please visit

.

Changes:
* Changed title
* Narrowed topic to slowly changing dimensions support only.

This should simplify discussion and make it easier to come to conclusion.

Looking for comments on:
* API for new feature.
   * Currently there's single comment with alternative approach.
* General idea review.

Regards,
--Mikhail

On Wed, Dec 18, 2019 at 3:14 PM Kenneth Knowles  wrote:

> I do think that the implementation concerns around larger side inputs are
> relevant to most runners. Ideally there would be no model change necessary.
> Triggers are harder and bring in consistency concerns, which are even more
> likely to be relevant to all runners.
>
> Kenn
>
> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik  wrote:
>
>> Most of the doc is about how to support distributed side inputs in
>> Dataflow and doesn't really cover how the Beam model (accumulating,
>> discarding, retraction) triggers impact what are the "contents" of a
>> PCollection in time and how this proposal for a limited set of side input
>> shapes can work to support larger side inputs in Dataflow.
>>
>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský  wrote:
>>
>>> Hi Mikhail,
>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>
>>> inline
>>>
>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský  wrote:
>>>
 Hi,

 I actually thought that the proposal refers to Dataflow only. If this
 is supposed to be general, can we remove the Dataflow/Windmill specific
 parts and replace them with generic ones?

>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>
>>> Cool, thanks!
>>>
>>> I'd have two more questions:

  a) the proposal is named "Slowly changing", why is the rate of change
 essential to the proposal? Once running on event time, that should not
 matter, or what am I missing?

>>> Within this proposal, it is suggested to make a full snapshot of data on
>>> every re-read. This is generally expensive and setting time event to short
>>> interval might cause issues. Otherwise it is not essential.
>>>
>>> Understood. This relates to table-stream duality, where the requirements
>>> might relax once you don't have to convert table to stream by re-reading
>>> it, but by being able to retrieve updates as you go (example would be
>>> reading directly from kafka or any other "commit log" abstraction).
>>>
>>>  b) The description says: 'User wants to solve a stream enrichment
 problem. In brief request sounds like: ”I want to enrich each event in this
 stream by corresponding data from given table.”'. That is understandable,
 but would it be better to enable the user to express this intent directly
 (via Join operation)? The actual implementation might be runner (and
 input!) specific. The analogy is that when doing group-by-key operation,
 runner can choose hash grouping or sort-merge grouping, but that is not
 (directly) expressed in user code. I'm not saying that we should not have
 low-level transforms, just asking if it would be better to leave this
 decision to the runner (at least in some cases). It might be the case that
 we want to make core SDK as low level as possible (and as reasonable), I
 just want to make sure that that is really the intent.

>>> The idea is to add basic operation with as small change as possible for
>>> current API.
>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>> strategy. However, I don't think that we have proper tools and view of how
>>> to choose best strategy at hand as of yet.
>>>
>>> OK, cool. That is where I would find it very much useful to have some
>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>> missing in the puzzle as of now. But it would be good to know what these
>>> pieces are and what needs to be done to fulfill our goals. But this is
>>> probably not related to discussion of this proposal, but more related to
>>> the concept of BIP or similar.
>>>
>>> Thanks for the explanation.
>>>
>>> Thanks for the proposal!

 Jan
 On 12/17/19 12:01 AM, Kenneth Knowles wrote:

 I want to highlight that this design works for definitely more runners
 than just Dataflow. I see two pieces of it that I want to bring onto the
 thread:

 1. A new kind of "unbounded source" which is a periodic refresh of a
 bounded source, and use that as a side input. Each main input element has a
 window that maps to a specific refresh of the side input.
 2. Distributed map side inputs: supporting very large lookup tables,
 but with consistency challenges. Even the part about "windmill API"
 probably applies to other runners

 So I hope the title and "Objective" section do not cause people to stop
 reading.


Re: Python IO Connector

2020-01-06 Thread pbd281
How do I go about doing that? From the docs, it appears cross language 
transforms are
currently undocumented. https://beam.apache.org/roadmap/connectors-multi-sdk/
> On Jan 6, 2020, at 12:55 PM, Luke Cwik  wrote:
> 
> What about using a cross language transform between Python and the already 
> existing Java JdbcIO transform?
> 
>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann  wrote:
>> I’d like to develop the Python SDK’s SQL IO connector. I was thinking it 
>> would be easiest to use sqlalchemy to achieve maximum database engine 
>> support, but I suppose I could also create an ABC for databases that follow 
>> the DB API and create subclasses for each database engine that override a 
>> connect method. What are your thoughts on the best way to do this?


PTransform serialization question

2020-01-06 Thread Alexey Romanenko
Hello all,

I found myself that I’m a bit confused with Serialization requirements for Beam 
transforms and I want to precise something.

Here [1] it's clearly mentioned that “DoFn, PTransform, CombineFn and other 
instances will be serialized”. Since the most of Beam IO Read/Write transforms 
is based on PTransform, then it means that all internal members of them should 
be serializable too or declared as transient/static. 

In the same time, Javadoc of PTransform says [2] that “PTransform doesn't 
actually support serialization, despite implementing Serializable. PTransform 
is marked Serializable solely because it is common for an anonymous DoFn, 
instance to be created within an apply() method of a composite PTransform”. 
And, on the other hand, “DoFn passed to a ParDo transform must be Serializable” 
[3] So, DoFn must be really serializable, PTransform is not necessary.

So, does it mean that the members (that are mostly AutoValue generated) of 
Read/Write PTransforms are free to be serializable or not if they don’t use 
anonymous DoFn's? For example, they are needed only for configuration on 
driver. However, if these members are used in DoFn or in other user defined 
objects further, when they will be involved on workers, then they must be 
serializable in any way.  Is it correct assumption?

[1] https://beam.apache.org/contribute/ptransform-style-guide/#serialization 

[2] 
https://github.com/apache/beam/blob/42dbb5d9c9fbf45676088a32f862101f03fa76fb/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java#L116
[3] 
https://github.com/apache/beam/blob/e2bb239f0418f1c4949227ba3f51a5f4eb7235df/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L282



Re: Python IO Connector

2020-01-06 Thread Luke Cwik
What about using a cross language transform between Python and the already
existing Java JdbcIO transform?

On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann  wrote:

> I’d like to develop the Python SDK’s SQL IO connector. I was thinking it
> would be easiest to use sqlalchemy to achieve maximum database engine
> support, but I suppose I could also create an ABC for databases that follow
> the DB API and create subclasses for each database engine that override a
> connect method. What are your thoughts on the best way to do this?
>


Re: [VOTE] Release 2.17.0, release candidate #2

2020-01-06 Thread Mikhail Gryzykhin
KEYS files should be fixed now.

On Mon, Jan 6, 2020 at 8:29 AM Robert Bradshaw  wrote:

> Yes, please update KEYS to have the correct key. (If you've never used the
> other one you could just remove it.)
>
> On Mon, Jan 6, 2020, 6:46 AM Mikhail Gryzykhin  wrote:
>
>> I see. Seems that the wrong key is imported into KEYS file. And header is
>> incorrect.
>>
>> --Mikhail
>>
>> On Mon, Jan 6, 2020 at 6:16 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi Robert,
>>>
>>> I redownloaded binaries from
>>> https://dist.apache.org/repos/dist/dev/beam/2.17.0/ and ran
>>>
>>> gpg --verify apache-beam-2.17.0-source-release.zip.asc
>>> gpg: assuming signed data in 'apache-beam-2.17.0-source-release.zip'
>>> gpg: Signature made Mon 16 Dec 2019 09:17:23 PM PST
>>> gpg:using RSA key
>>> 53F72D4EEEF306D97736FE1065ABB07A8965E788
>>> gpg: Good signature from "Mikhail Gryzykhin "
>>> [ultimate]
>>>
>>> Signature is valid with key 53F72D4EEEF306D97736FE1065ABB07A8965E788.
>>> The key you received is different. Which binaries did you get that
>>> signature from?
>>>
>>> --Mikhail
>>>
>>> On Thu, Jan 2, 2020 at 4:53 PM Robert Bradshaw 
>>> wrote:
>>>
 (Other than that everything looks fine.)

 On Thu, Jan 2, 2020 at 4:44 PM Robert Bradshaw 
 wrote:
 >
 > -1
 >
 > I'm having trouble verifying the signatures on the release artifacts.
 > When I try to import the key from
 > https://dist.apache.org/repos/dist/release/beam/KEYS I get
 >
 > pub   rsa4096 2019-10-22 [SC]
 >   79552F5C2FD869A08E097F96841855FB73AFFC7F
 > uid   [ unknown] Mikhail Gryzykhin (mikhail) <
 mikh...@apache.org>
 > sub   rsa4096 2019-10-22 [E]
 >
 > which is not the key that these artifacts were signed with.
 >
 >
 > On Thu, Jan 2, 2020 at 4:23 PM Reuven Lax  wrote:
 > >
 > > +1
 > >
 > > On Thu, Jan 2, 2020 at 3:02 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:
 > >>
 > >> +1. Validated Batch and Streaming quickstarts on Python 3.7 (using
 wheels) and Batch Mobile Gaming examples (user score, hourly team score) on
 Dataflow.
 > >>
 > >> On Thu, Jan 2, 2020 at 11:23 AM Ahmet Altay 
 wrote:
 > >>>
 > >>> This vote needs at least one more PMC vote before it can be
 finalized. Could you please validate and vote?
 > >>>
 > >>> On Mon, Dec 23, 2019 at 9:44 AM Luke Cwik 
 wrote:
 > 
 >  +1, I validated the Java quickstarts for the runners and the
 issues I have brought up have been moved to a future release.
 > 
 >  On Fri, Dec 20, 2019 at 8:09 PM Ahmet Altay 
 wrote:
 > >
 > > +1, I validated the python2 quick starts using wheels. Thank
 you for pushing the release this far.
 > >
 > > On Thu, Dec 19, 2019 at 1:27 PM Kenneth Knowles 
 wrote:
 > >>
 > >> I verified the Java quickstart on Dataflow manually.
 > >>
 > >> Kenn
 > >>
 > >> On Wed, Dec 18, 2019 at 5:58 PM jincheng sun <
 sunjincheng...@gmail.com> wrote:
 > >>>
 > >>> Thanks for drive this release Mikhail !
 > >>>
 > >>> I have found there is an incorrect release version for
 release notes in PR[1], also left a question in PR[2].
 > >>>
 > >>> But I do not think it's the blocker of the release :)
 > >>>
 > >>> Best,
 > >>> Jincheng
 > >>>
 > >>> [1] https://github.com/apache/beam/pull/10401
 > >>> [2] https://github.com/apache/beam/pull/10402
 > >>>
 > >>>
 > >>> Ahmet Altay  于2019年12月19日周四 上午3:31写道:
 > 
 >  I validated python quickstarts with python 2. Wheels file
 are missing but they work otherwise. Once the wheel files are added I will
 add my vote.
 > 
 >  On Wed, Dec 18, 2019 at 10:00 AM Luke Cwik 
 wrote:
 > >
 > > I verified the release and ran the quickstarts and found
 that release 2.16 broke Apache Nemo runner which is also an issue for
 2.17.0 RC #2. It is caused by a backwards incompatible change in
 ParDo.MultiOutput where getSideInputs return value was changed from List to
 Map as part of https://github.com/apache/beam/pull/9275. I filed
 https://issues.apache.org/jira/browse/BEAM-8989 to track the issue.
 > >
 > > Should we re-add the method back in 2.17.0 renaming the
 newly added method to something else and also patch 2.16.0 with a minor
 change including the same fix (breaking 2.16.0 users who picked up the new
 method) or leave as is?
 > 
 > 
 >  I suggest not fixing this for 2.17, because the issue
 already exists in 2.16 and there are two releases in parallel and it would
 be fine to fix this for 2.18 or 2.19.
 > 
 >  +Reuven Lax, who merged the 

Re: Proposed Jira and PR to change error messaging for Python SDK filesystem module

2020-01-06 Thread Luke Cwik
Thanks for the contribution.

On Mon, Jan 6, 2020 at 6:29 AM David Sabater Dinter 
wrote:

> Hi everyone!
> Happy New Year, personally coming with some resolutions like trying to
> contribute more often to Apache projects I love. :)
> Just wanted to mention that I created a new Jira
>  and PR
>  to improve the error
> messaging in the Python SDK when not specifying the correct filesystem
> (typically happening when installing apache_beam without the [gcp]
> dependency).
> Let me know your comments and approve/merge the trivial, but I think
> helpful, change if feasible.
>
>
> Thanks.
>
>
>


Re: Flaky Java warning/error inventory (cannot find symbol)

2020-01-06 Thread Tomo Suzuki
Hi Alex,

(I also feel frustrated to see sometimes Java precommit checks fail
due to connection errors. I appreciate Beam project makes it easy to
run it via "Run Java Precommit")

I dug into the builds but no clear answer to your question. The
"cannot find symbol" error [1] comes from
:sdks:java:io:google-cloud-platform:javadoc. The excerpt from the
consoleFull [2]:

09:50:23 > Task :sdks:java:io:google-cloud-platform:javadoc
09:50:23 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java:700:
error: cannot find symbol
09:50:23   public abstract static class TypedRead extends
PTransform> {
09:50:23 ^
09:50:23   symbol:   class PTransform
09:50:23   location: class BigQueryIO
09:50:23 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java:700:
error: cannot find symbol
09:50:23   public abstract static class TypedRead extends
PTransform> {
09:50:23^
09:50:23   symbol:   class PBegin
09:50:23   location: class BigQueryIO
09:50:23 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java:700:
error: cannot find symbol
09:50:23   public abstract static class TypedRead extends
PTransform> {
09:50:23
 ^
09:50:23   symbol:   class PCollection
09:50:23   location: class BigQueryIO
09:50:23 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java:912:
error: cannot find symbol
09:50:23   extends PTransform,
SpannerWriteResult> {
09:50:23   ^
09:50:23   symbol:   class PTransform
09:50:23   location: class SpannerIO
09:50:23 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java:912:
error: cannot find symbol
09:50:23   extends PTransform,
SpannerWriteResult> {
09:50:23  ^
09:50:23   symbol:   class PCollection
09:50:23   location: class SpannerIO
09:50:24 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java:452:
warning - Tag @link: can't find withQueryPriority(
09:50:24  TypedRead.QueryPriority) in
org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead
09:50:25


I don't have any explanation why the javadoc task could not find the
classes at that time. As this is javadoc, I stopped digging further.
Does anybody think of the cause of "cannot find symbol" for the
javadoc task?

[1]: 
https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1598/java/packageName.-780910515/
[2]: https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1598/consoleFull

Regards,
Tomo

On Sun, Jan 5, 2020 at 7:41 PM Alex Van Boxel  wrote:
>
> Hey Tomo,
>
> so this is one of mine (no change, just retrrigger):
> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1599/
> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1598/
>
> but here is a list of other PR's with the same issue:
> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1595/
> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1594/
> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1588/
>
>
>
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Jan 6, 2020 at 12:56 AM Tomo Suzuki  wrote:
>>
>> I’m interested in that problem. “Cannot find symbol” problem is usually 
>> caused by incompatible dependencies. No randomness.
>>
>> Would you share the 2 URLs for the Java Precommit Jenkins jobs which you 
>> think should fail with the same cause?
>>
>> Regards,
>> Tomo
>>
>> On Sun, Jan 5, 2020 at 05:48 Alex Van Boxel  wrote:
>>>
>>> I've noticed a pattern of fatal warning/errors and they are always the 
>>> same. If you want to check for yourself, just use this link and go to a few 
>>> previous builds... it's always the same tests that are failing:
>>>
>>> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/1596/java/new/
>>>
>>> This is an inventory of the failing tests with always the same error: 
>>> cannot find symbol
>>>
>>> RowJson.java:81 org.apache.beam.sdk.util
>>> BigQueryIO.java:700 org.apache.beam.sdk.io.gcp.bigquery
>>> SpannerIO.java:912 org.apache.beam.sdk.io.gcp.spanner
>>> BigQueryIO.java:700 org.apache.beam.sdk.io.gcp.bigquery
>>> SpannerIO.java:912 org.apache.beam.sdk.io.gcp.spanner
>>> Watch.java:170 org.apache.beam.sdk.transforms
>>>
>>> For my latest PR, I first go the RowJson error, and after no code change 
>>> and doing "Run Java PreCommit" I got the BigQueryIO one.
>>>
>>> Anyone 

Re: [VOTE] Release 2.17.0, release candidate #2

2020-01-06 Thread Mikhail Gryzykhin
I see. Seems that the wrong key is imported into KEYS file. And header is
incorrect.

--Mikhail

On Mon, Jan 6, 2020 at 6:16 AM Mikhail Gryzykhin  wrote:

> Hi Robert,
>
> I redownloaded binaries from
> https://dist.apache.org/repos/dist/dev/beam/2.17.0/ and ran
>
> gpg --verify apache-beam-2.17.0-source-release.zip.asc
> gpg: assuming signed data in 'apache-beam-2.17.0-source-release.zip'
> gpg: Signature made Mon 16 Dec 2019 09:17:23 PM PST
> gpg:using RSA key 53F72D4EEEF306D97736FE1065ABB07A8965E788
> gpg: Good signature from "Mikhail Gryzykhin "
> [ultimate]
>
> Signature is valid with key 53F72D4EEEF306D97736FE1065ABB07A8965E788. The
> key you received is different. Which binaries did you get that signature
> from?
>
> --Mikhail
>
> On Thu, Jan 2, 2020 at 4:53 PM Robert Bradshaw 
> wrote:
>
>> (Other than that everything looks fine.)
>>
>> On Thu, Jan 2, 2020 at 4:44 PM Robert Bradshaw 
>> wrote:
>> >
>> > -1
>> >
>> > I'm having trouble verifying the signatures on the release artifacts.
>> > When I try to import the key from
>> > https://dist.apache.org/repos/dist/release/beam/KEYS I get
>> >
>> > pub   rsa4096 2019-10-22 [SC]
>> >   79552F5C2FD869A08E097F96841855FB73AFFC7F
>> > uid   [ unknown] Mikhail Gryzykhin (mikhail) <
>> mikh...@apache.org>
>> > sub   rsa4096 2019-10-22 [E]
>> >
>> > which is not the key that these artifacts were signed with.
>> >
>> >
>> > On Thu, Jan 2, 2020 at 4:23 PM Reuven Lax  wrote:
>> > >
>> > > +1
>> > >
>> > > On Thu, Jan 2, 2020 at 3:02 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>> > >>
>> > >> +1. Validated Batch and Streaming quickstarts on Python 3.7 (using
>> wheels) and Batch Mobile Gaming examples (user score, hourly team score) on
>> Dataflow.
>> > >>
>> > >> On Thu, Jan 2, 2020 at 11:23 AM Ahmet Altay 
>> wrote:
>> > >>>
>> > >>> This vote needs at least one more PMC vote before it can be
>> finalized. Could you please validate and vote?
>> > >>>
>> > >>> On Mon, Dec 23, 2019 at 9:44 AM Luke Cwik  wrote:
>> > 
>> >  +1, I validated the Java quickstarts for the runners and the
>> issues I have brought up have been moved to a future release.
>> > 
>> >  On Fri, Dec 20, 2019 at 8:09 PM Ahmet Altay 
>> wrote:
>> > >
>> > > +1, I validated the python2 quick starts using wheels. Thank you
>> for pushing the release this far.
>> > >
>> > > On Thu, Dec 19, 2019 at 1:27 PM Kenneth Knowles 
>> wrote:
>> > >>
>> > >> I verified the Java quickstart on Dataflow manually.
>> > >>
>> > >> Kenn
>> > >>
>> > >> On Wed, Dec 18, 2019 at 5:58 PM jincheng sun <
>> sunjincheng...@gmail.com> wrote:
>> > >>>
>> > >>> Thanks for drive this release Mikhail !
>> > >>>
>> > >>> I have found there is an incorrect release version for release
>> notes in PR[1], also left a question in PR[2].
>> > >>>
>> > >>> But I do not think it's the blocker of the release :)
>> > >>>
>> > >>> Best,
>> > >>> Jincheng
>> > >>>
>> > >>> [1] https://github.com/apache/beam/pull/10401
>> > >>> [2] https://github.com/apache/beam/pull/10402
>> > >>>
>> > >>>
>> > >>> Ahmet Altay  于2019年12月19日周四 上午3:31写道:
>> > 
>> >  I validated python quickstarts with python 2. Wheels file are
>> missing but they work otherwise. Once the wheel files are added I will add
>> my vote.
>> > 
>> >  On Wed, Dec 18, 2019 at 10:00 AM Luke Cwik 
>> wrote:
>> > >
>> > > I verified the release and ran the quickstarts and found that
>> release 2.16 broke Apache Nemo runner which is also an issue for 2.17.0 RC
>> #2. It is caused by a backwards incompatible change in ParDo.MultiOutput
>> where getSideInputs return value was changed from List to Map as part of
>> https://github.com/apache/beam/pull/9275. I filed
>> https://issues.apache.org/jira/browse/BEAM-8989 to track the issue.
>> > >
>> > > Should we re-add the method back in 2.17.0 renaming the newly
>> added method to something else and also patch 2.16.0 with a minor change
>> including the same fix (breaking 2.16.0 users who picked up the new method)
>> or leave as is?
>> > 
>> > 
>> >  I suggest not fixing this for 2.17, because the issue already
>> exists in 2.16 and there are two releases in parallel and it would be fine
>> to fix this for 2.18 or 2.19.
>> > 
>> >  +Reuven Lax, who merged the mentioned PR.
>> > 
>> > >
>> > >
>> > > On Tue, Dec 17, 2019 at 12:13 PM Mikhail Gryzykhin <
>> mig...@google.com> wrote:
>> > >>
>> > >> Hi everyone,
>> > >>
>> > >>
>> > >> Please review and vote on the release candidate #2 for the
>> version 2.17.0, as follows:
>> > >>
>> > >> [ ] +1, Approve the release
>> > >>
>> > >> [ ] -1, Do not approve the release (please provide specific
>> comments)
>> > >>
>> > >>
>> > 

Proposed Jira and PR to change error messaging for Python SDK filesystem module

2020-01-06 Thread David Sabater Dinter
Hi everyone!
Happy New Year, personally coming with some resolutions like trying to
contribute more often to Apache projects I love. :)
Just wanted to mention that I created a new Jira
 and PR
 to improve the error messaging
in the Python SDK when not specifying the correct filesystem (typically
happening when installing apache_beam without the [gcp] dependency).
Let me know your comments and approve/merge the trivial, but I think
helpful, change if feasible.


Thanks.


Re: [VOTE] Release 2.17.0, release candidate #2

2020-01-06 Thread Mikhail Gryzykhin
Hi Robert,

I redownloaded binaries from
https://dist.apache.org/repos/dist/dev/beam/2.17.0/ and ran

gpg --verify apache-beam-2.17.0-source-release.zip.asc
gpg: assuming signed data in 'apache-beam-2.17.0-source-release.zip'
gpg: Signature made Mon 16 Dec 2019 09:17:23 PM PST
gpg:using RSA key 53F72D4EEEF306D97736FE1065ABB07A8965E788
gpg: Good signature from "Mikhail Gryzykhin " [ultimate]

Signature is valid with key 53F72D4EEEF306D97736FE1065ABB07A8965E788. The
key you received is different. Which binaries did you get that signature
from?

--Mikhail

On Thu, Jan 2, 2020 at 4:53 PM Robert Bradshaw  wrote:

> (Other than that everything looks fine.)
>
> On Thu, Jan 2, 2020 at 4:44 PM Robert Bradshaw 
> wrote:
> >
> > -1
> >
> > I'm having trouble verifying the signatures on the release artifacts.
> > When I try to import the key from
> > https://dist.apache.org/repos/dist/release/beam/KEYS I get
> >
> > pub   rsa4096 2019-10-22 [SC]
> >   79552F5C2FD869A08E097F96841855FB73AFFC7F
> > uid   [ unknown] Mikhail Gryzykhin (mikhail)  >
> > sub   rsa4096 2019-10-22 [E]
> >
> > which is not the key that these artifacts were signed with.
> >
> >
> > On Thu, Jan 2, 2020 at 4:23 PM Reuven Lax  wrote:
> > >
> > > +1
> > >
> > > On Thu, Jan 2, 2020 at 3:02 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> > >>
> > >> +1. Validated Batch and Streaming quickstarts on Python 3.7 (using
> wheels) and Batch Mobile Gaming examples (user score, hourly team score) on
> Dataflow.
> > >>
> > >> On Thu, Jan 2, 2020 at 11:23 AM Ahmet Altay  wrote:
> > >>>
> > >>> This vote needs at least one more PMC vote before it can be
> finalized. Could you please validate and vote?
> > >>>
> > >>> On Mon, Dec 23, 2019 at 9:44 AM Luke Cwik  wrote:
> > 
> >  +1, I validated the Java quickstarts for the runners and the issues
> I have brought up have been moved to a future release.
> > 
> >  On Fri, Dec 20, 2019 at 8:09 PM Ahmet Altay 
> wrote:
> > >
> > > +1, I validated the python2 quick starts using wheels. Thank you
> for pushing the release this far.
> > >
> > > On Thu, Dec 19, 2019 at 1:27 PM Kenneth Knowles 
> wrote:
> > >>
> > >> I verified the Java quickstart on Dataflow manually.
> > >>
> > >> Kenn
> > >>
> > >> On Wed, Dec 18, 2019 at 5:58 PM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> > >>>
> > >>> Thanks for drive this release Mikhail !
> > >>>
> > >>> I have found there is an incorrect release version for release
> notes in PR[1], also left a question in PR[2].
> > >>>
> > >>> But I do not think it's the blocker of the release :)
> > >>>
> > >>> Best,
> > >>> Jincheng
> > >>>
> > >>> [1] https://github.com/apache/beam/pull/10401
> > >>> [2] https://github.com/apache/beam/pull/10402
> > >>>
> > >>>
> > >>> Ahmet Altay  于2019年12月19日周四 上午3:31写道:
> > 
> >  I validated python quickstarts with python 2. Wheels file are
> missing but they work otherwise. Once the wheel files are added I will add
> my vote.
> > 
> >  On Wed, Dec 18, 2019 at 10:00 AM Luke Cwik 
> wrote:
> > >
> > > I verified the release and ran the quickstarts and found that
> release 2.16 broke Apache Nemo runner which is also an issue for 2.17.0 RC
> #2. It is caused by a backwards incompatible change in ParDo.MultiOutput
> where getSideInputs return value was changed from List to Map as part of
> https://github.com/apache/beam/pull/9275. I filed
> https://issues.apache.org/jira/browse/BEAM-8989 to track the issue.
> > >
> > > Should we re-add the method back in 2.17.0 renaming the newly
> added method to something else and also patch 2.16.0 with a minor change
> including the same fix (breaking 2.16.0 users who picked up the new method)
> or leave as is?
> > 
> > 
> >  I suggest not fixing this for 2.17, because the issue already
> exists in 2.16 and there are two releases in parallel and it would be fine
> to fix this for 2.18 or 2.19.
> > 
> >  +Reuven Lax, who merged the mentioned PR.
> > 
> > >
> > >
> > > On Tue, Dec 17, 2019 at 12:13 PM Mikhail Gryzykhin <
> mig...@google.com> wrote:
> > >>
> > >> Hi everyone,
> > >>
> > >>
> > >> Please review and vote on the release candidate #2 for the
> version 2.17.0, as follows:
> > >>
> > >> [ ] +1, Approve the release
> > >>
> > >> [ ] -1, Do not approve the release (please provide specific
> comments)
> > >>
> > >>
> > >>
> > >> The complete staging area is available for your review, which
> includes:
> > >>
> > >> * JIRA release notes [1],
> > >>
> > >> * the official Apache source release to be deployed to
> dist.apache.org [2], which is signed with the key with fingerprint
> 

Beam Dependency Check Report (2020-01-06)

2020-01-06 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
cachetools
3.1.1
4.0.0
2019-12-23
2019-12-23BEAM-9017
google-cloud-bigquery
1.17.1
1.23.1
2019-09-23
2019-12-23BEAM-5537
google-cloud-datastore
1.7.4
1.10.0
2019-05-27
2019-10-21BEAM-8443
httplib2
0.12.0
0.15.0
2018-12-10
2019-12-23BEAM-9018
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
pytest
4.6.9
5.3.2
None
2019-12-23BEAM-8606
Sphinx
1.8.5
2.3.1
2019-05-20
2019-12-23BEAM-7370
tenacity
5.1.5
6.0.0
2019-11-11
2019-11-11BEAM-8607
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.alibaba:fastjson
1.2.49
1.2.62
2018-08-04
2019-10-07BEAM-8632
com.datastax.cassandra:cassandra-driver-core
3.6.0
4.0.0
2018-08-29
2019-03-18BEAM-8674
com.datastax.cassandra:cassandra-driver-mapping
3.6.0
3.8.0
2018-08-29
2019-10-29BEAM-8749
com.esotericsoftware:kryo
4.0.2
5.0.0-RC4
2018-03-20
2019-04-14BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.20.0
0.27.0
2019-02-11
2019-10-21BEAM-6645
com.github.spotbugs:spotbugs
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-7792
com.github.spotbugs:spotbugs-annotations
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-6951
com.google.api:gax-grpc
1.38.0
1.52.0
2019-02-05
2019-12-13BEAM-8676
com.google.api.grpc:grpc-google-cloud-datacatalog-v1beta1
0.27.0-alpha
0.30.0-alpha
2019-10-03
2019-11-20BEAM-8853
com.google.api.grpc:grpc-google-cloud-pubsub-v1
1.43.0
1.84.0
2019-01-23
2019-12-04BEAM-8677
com.google.api.grpc:grpc-google-common-protos
1.12.0
1.17.0
2018-06-29
2019-10-04BEAM-8633
com.google.api.grpc:proto-google-cloud-bigtable-v2
0.44.0
1.8.0
2019-01-23
2019-12-17BEAM-8679
com.google.api.grpc:proto-google-cloud-datacatalog-v1beta1
0.27.0-alpha
0.30.0-alpha
2019-10-03
2019-11-20BEAM-8854
com.google.api.grpc:proto-google-cloud-datastore-v1
0.44.0
0.85.0
2019-01-23
2019-12-05BEAM-8680
com.google.api.grpc:proto-google-cloud-pubsub-v1
1.43.0
1.84.0
2019-01-23
2019-12-04BEAM-8681
com.google.api.grpc:proto-google-cloud-spanner-admin-database-v1
1.6.0
1.47.0
2019-01-23
2019-12-05BEAM-8682
com.google.api.grpc:proto-google-common-protos
1.12.0
1.17.0
2018-06-29
2019-10-04BEAM-6899
com.google.apis:google-api-services-bigquery
v2-rev20181221-1.28.0
v2-rev20190917-1.30.3
2019-01-17
2019-10-09BEAM-8684
com.google.apis:google-api-services-clouddebugger
v2-rev20181114-1.28.0
v2-rev20191003-1.30.3
2019-01-17
2019-10-19BEAM-8750
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20181015-1.28.0
v2-rev20191206-1.30.3
2019-01-17
2019-12-17BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20190927-1.28.0
v1beta3-rev12-1.20.0
2019-10-11
2015-04-29BEAM-8752
com.google.apis:google-api-services-pubsub
v1-rev20181213-1.28.0
v1-rev20191203-1.30.3
2019-01-18
2019-12-18BEAM-8753
com.google.apis:google-api-services-storage
v1-rev20181109-1.28.0
v1-rev20191011-1.30.3
2019-01-18
2019-10-30BEAM-8754
com.google.auth:google-auth-library-credentials
0.13.0
0.19.0
2019-01-17
2019-12-13BEAM-6478
com.google.auth:google-auth-library-oauth2-http
0.12.0
0.19.0
2018-11-14
2019-12-13BEAM-8685
com.google.cloud:google-cloud-bigquery
1.28.0

Re: Dropping late data in DirectRunner

2020-01-06 Thread Jan Lukavský
> Generally the watermark update can overtake elements, because 
runners  can explicitly ignore late data in the watermark calculation 
(for good reason - those elements are already late, so no need to hold 
up the watermark advancing any more).


This seems not to affect the decision of _not late_ vs. _late_, is it? 
If element is late and gets ignored from watermark calculation (whatever 
that includes in this context), than the watermark cannot move past 
elements that were not marked as _not late_ and thus nothing can make 
them _late_.


> For GBK on-time data simply means the first pane marked as on time. 
For state+timers I don't think it makes sense for Beam to define on-time 
v.s. late, rather I think the user can come up with their own definition 
depending on their use case. For example, if you are buffering data into 
BagState and setting a timer to process it, it would be logical to say 
that any element that was buffered before the timer expired is on time, 
and any data that showed up after the timer fired is late. This would 
roughly correspond to what GBK does, and the answer would be very 
similar to simply comparing against the watermark (as the timers fire 
when the watermark advances).


Yes, I'd say that stateful DoFns don't have (well defined) concept of 
pane, because that is related to concept of trigger and this is a 
concept of GBK (or windowed operations in general). The only semantic 
meaning of window in stateful DoFn is that it "scopes" state.


This discussion might have got a little off the original question, so 
I'll try to rephrase it:


Should stateful DoFn drop *all* late data, not just data that arrive 
after window boundary + allowed lateness? Some arguments why I think it 
should:
 * in windowed operations (GBK), it is correct to drop data on window 
boundaries only, because time (as seen by user) effectively hops only on 
these discrete time points
 * in stateful dofn on the other hand time move "smoothly" (yes, with 
some granularity, millisecond, nanosecond, whatever and with watermark 
updates only, but still)
 * this could be viewed that dropping late data immediately as time 
(again, from user perspective) moves (not on some more or less 
artificial boundary having only little semantic meaning) is consistent 
with both the above properties


The negative side effect of this would be, that more data could be 
dropped, but ... isn't this what defines allowed lateness? I don't want 
to discuss the implications on user pipelines of such a change (and if 
we can or cannot do it), just trying to build some theoretical 
understanding of the problem as a whole. The decision if any change 
could / should be made can be done afterwards.


Thanks,
 Jan

On 1/4/20 10:35 PM, Reuven Lax wrote:



On Sat, Jan 4, 2020 at 12:13 PM Jan Lukavský > wrote:


> Yes, but invariants should hold. If I add a ParDo that drops
late elements (or, more commonly,diverts the late elements  to a
different PCollection), then the result of that ParDo should
_never_ introduce and more late data. This cannot be guaranteed
simply with watermark checks. The ParDo may decide that the
element was not late, but by the time it outputs the element the
watermark may have advanced, causing the element to actually be late.

This is actually very interesting. The question is - if I decide
about lateness based on output watermark of a PTransform, is it
still the case, that in downstream operator(s) the element could
be changed from "not late" to "late"? Provided the output
watermark is updated synchronously based on input data (which
should be) and watermark update cannot "overtake" elements, I
think that the downstream decision should not be changed, so the
invariant should hold. Or am I missing something?


Generally the watermark update can overtake elements, because runners  
can explicitly ignore late data in the watermark calculation (for good 
reason - those elements are already late, so no need to hold up the 
watermark advancing any more).


For GBK on-time data simply means the first pane marked as on time. 
For state+timers I don't think it makes sense for Beam to define 
on-time v.s. late, rather I think the user can come up with their own 
definition depending on their use case. For example, if you are 
buffering data into BagState and setting a timer to process it, it 
would be logical to say that any element that was buffered before the 
timer expired is on time, and any data that showed up after the timer 
fired is late. This would roughly correspond to what GBK does, and the 
answer would be very similar to simply comparing against the watermark 
(as the timers fire when the watermark advances).


Reuven

On 1/4/20 8:11 PM, Reuven Lax wrote:



On Sat, Jan 4, 2020 at 11:03 AM Jan Lukavský mailto:je...@seznam.cz>> wrote:

On 1/4/20 6:14 PM, Reuven Lax wrote:

There is a very good reason