Re: [DISCUSS] Adding GroupByKeyAndSort

2019-04-30 Thread Kenneth Knowles
On Wed, Apr 17, 2019 at 7:48 AM Viliam Durina wrote: > > Combine.perKey ... certainly is standardized / well-defined > > Is there any document where it's defined? > At the user level, here: https://beam.apache.org/documentation/programming-guide/#combine There are a few places that define it.

Re: Scope of windows?

2019-04-30 Thread Kenneth Knowles
-user@ since this is pretty far afield On Tue, Apr 30, 2019 at 4:22 PM Robert Bradshaw wrote: > In the original version of the dataflow model, windowing was not > annotated on each PCollection, rather it was inferred based on tracing > up the graph to the latest WindowInto operation. This

Re: Structured streaming based spark runner.

2019-04-30 Thread Kenneth Knowles
Very cool. Took a look. On Tue, Apr 30, 2019 at 6:23 PM Ankur Goenka wrote: > Exciting! Thanks Etienne for sharing the design and progress. > > On Tue, Apr 30, 2019 at 10:11 AM Etienne Chauchot > wrote: > >> Hi guys, >> As part of the ongoing work on spark runner POC based on structured >>

Re: Structured streaming based spark runner.

2019-04-30 Thread Ankur Goenka
Exciting! Thanks Etienne for sharing the design and progress. On Tue, Apr 30, 2019 at 10:11 AM Etienne Chauchot wrote: > Hi guys, > As part of the ongoing work on spark runner POC based on structured > streaming framework, I sketched up a design doc (1) to share context and > design principles.

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-30 Thread Kenneth Knowles
Anything where the state evolves serially but arbitrarily - the toy example is the integer counter in my blog post - needs ValueState. You can't do it with AnyCombineFn. And I think LatestCombineFn is dangerous, especially when it comes to CombingState. ValueState is more explicit, and I still

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-30 Thread Robert Bradshaw
On Wed, May 1, 2019 at 1:55 AM Brian Hulette wrote: > > Reza - you're definitely not derailing, that's exactly what I was looking for! > > I've actually recently encountered an additional use case where I'd like to > use ValueState in the Python SDK. I'm experimenting with an ArrowBatchingDoFn

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-04-30 Thread Brian Hulette
Reza - you're definitely not derailing, that's exactly what I was looking for! I've actually recently encountered an additional use case where I'd like to use ValueState in the Python SDK. I'm experimenting with an ArrowBatchingDoFn that uses state and timers to batch up python dictionaries into

Re: Removing Java Reference Runner code

2019-04-30 Thread Daniel Oliveira
It sounds like no one has any objections specifically to removing this code. I'll get someone to review the PR and I'll start a vote to merge it as soon as it's approved. On Mon, Apr 29, 2019 at 3:39 AM Robert Bradshaw wrote: > I'd imagine that most users will continue to debug their pipelines

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-30 Thread Ahmet Altay
Michael, Max and other folks who are concerned about the compatibility with the apache release policy. Does the information in this thread sufficiently address your concerns? Especially the part where, the rc artifacts will be protected by a flag (i.e. --pre) from general consumption. On Tue, Apr

Re: Scope of windows?

2019-04-30 Thread Robert Bradshaw
In the original version of the dataflow model, windowing was not annotated on each PCollection, rather it was inferred based on tracing up the graph to the latest WindowInto operation. This tracing logic was put in the SDK for simplicity. I agree that there is room for a variety of SDK/DSL

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-30 Thread Robert Bradshaw
On Tue, Apr 30, 2019 at 6:11 PM Ahmet Altay wrote: > > This conversation get quite Python centric. Is there a similar need for Java? I think Java is already covered. Go is a different story (but the even versioning and releasing is being worked out). > On Tue, Apr 30, 2019 at 4:54 AM Robert

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-30 Thread Lukasz Cwik
The Java SDK uses the ASF managed Nexus repository. There is a snapshot one (where we publish nightly builds) and also a release one (where we place our release candidates). Once the release candidate is approved the Nexus repository has a way to publish it making it an official release. More

Re: Artifact staging in cross-language pipelines

2019-04-30 Thread Lukasz Cwik
Agree on adding the 5.5 and the resolution of conflicts/duplicates could be done by either the runner or the artifact staging service. On Tue, Apr 30, 2019 at 10:03 AM Chamikara Jayalath wrote: > > On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik wrote: > >> We should stick with URN + payload +

Re: Pipeline options validation

2019-04-30 Thread Lukasz Cwik
Being not serializable is not an issue since we use a JSON representation anyways for all PipelineOptions. Would just register a JSON mapper for Optional if one doesn't exist already. On Tue, Apr 30, 2019 at 12:52 PM Anton Kedin wrote: > Java8 Optional is not serializable. I think this may be a

Re: Pipeline options validation

2019-04-30 Thread Anton Kedin
Java8 Optional is not serializable. I think this may be a blocker. Or not? Regards, Anton On Tue, Apr 30, 2019 at 12:18 PM Lukasz Cwik wrote: > The migration to requiring @Nullable on methods that could take/return > null didn't update PipelineOptions contract and its validation to respect >

Re: Pipeline options validation

2019-04-30 Thread Lukasz Cwik
The migration to requiring @Nullable on methods that could take/return null didn't update PipelineOptions contract and its validation to respect it. We could start using Optional but can't enforce requiring @Nullable since it is likely backwards incompatible and would break people's current usage

Re: :beam-sdks-java-io-hadoop-input-format:test is extremely flaky

2019-04-30 Thread Valentyn Tymofieiev
BeamFnControlServiceTest is being worked on in https://issues.apache.org/jira/browse/BEAM-5709. On Mon, Apr 29, 2019 at 2:01 PM Reuven Lax wrote: > yeah, that testClientConnecting test is also extremely flaky. > > On Mon, Apr 29, 2019 at 6:50 AM Jean-Baptiste Onofré > wrote: > >> Agree, +1 >>

Re: [BEAM-7164] Python precommit failing on Java PRs. dataflow:setupVirtualenv

2019-04-30 Thread Alex Amato
Thanks, updated the JIRA with a link to this thread and a note of what could be done. On Mon, Apr 29, 2019 at 10:29 AM Udi Meiri wrote: > Pip has a --cache-dir which should be safe with concurrent writes. > > On Fri, Apr 26, 2019 at 3:59 PM Ahmet Altay wrote: > >> It is possible to download

Structured streaming based spark runner.

2019-04-30 Thread Etienne Chauchot
Hi guys, As part of the ongoing work on spark runner POC based on structured streaming framework, I sketched up a design doc (1) to share context and design principles. Feel free to comment. [1] https://s.apache.org/spark-structured-streaming-runner Etienne

Re: Artifact staging in cross-language pipelines

2019-04-30 Thread Chamikara Jayalath
On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik wrote: > We should stick with URN + payload + artifact metadata[1] where the only > mandatory one that all SDKs and expansion services understand is the > "bytes" artifact type. This allows us to add optional URNs for file://, > http://, Maven, PyPi,

Re: Pipeline options validation

2019-04-30 Thread Ning Wang
Interesting to know it needs to be an object. Thanks. I will try it. Agree with Kenneth though that Option might be more expected as an user. On Mon, Apr 29, 2019 at 7:16 PM Kenneth Knowles wrote: > Does it make use of the @Nullable annotation or just assume any object > reference could be

Re: Enable security for data channels in portability

2019-04-30 Thread Hai Lu
One thing to clarify is that we do not use docker. I don't have too much experience with docker; I assume docker itself already has network isolation, and that's why it was never necessary to enable security in portable runner before? For us because we simply use processes, we need this extra

Re: Scope of windows?

2019-04-30 Thread Kenneth Knowles
+dev@ since this has taken a turn in that direction SDK/DSL consistency is nice. But each SDK/DSL being the best thing it can be is more important IMO. I'm including DSLs to be clear that this is a construction issue having little/nothing to do with SDK in the sense of the per-run-time

Re: Custom shardingFn for FileIO

2019-04-30 Thread Jozef Vilcek
Hm, what would be the scenario? Have version A running with original random sharding and then start version B where I change sharding to some custom function? So I have to enable the pipeline to digest old keys from GBK restored state and also work with new keys produced to GBK going forward? On

Re: Custom shardingFn for FileIO

2019-04-30 Thread Reuven Lax
Initial thought on PR: we usually try to limit changing coders in these types of transforms to better support runners that allow in-place updates of pipelines. Can this be done without changing the coder? On Tue, Apr 30, 2019 at 8:21 AM Jozef Vilcek wrote: > I have created a PR for enhancing

Re: Custom shardingFn for FileIO

2019-04-30 Thread Jozef Vilcek
I have created a PR for enhancing WriteFiles for custom sharding function. https://github.com/apache/beam/pull/8438 If this sort of change looks good, then next step would be to use in in Flink runner transform override. Let me know what do you think On Fri, Apr 26, 2019 at 9:24 AM Jozef Vilcek

Re: Beam Summit at ApacheCon

2019-04-30 Thread Austin Bennett
Hi Users and Devs, The CfP deadline approaches. Do submit your technical and/or use case talks, etc etc. Feel free to reach out if you have any questions. Cheers, Austin On Tue, Apr 23, 2019 at 2:49 AM Maximilian Michels wrote: > Hi Austin, > > Thanks for the heads-up! I just want to

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-04-30 Thread Jozef Vilcek
All right, I can test it out if I can. How to deploy pipeline on Flink portable runner? Should I follow this to be able to do it? https://beam.apache.org/documentation/runners/flink/ On Tue, Apr 30, 2019 at 4:05 PM Reuven Lax wrote: > In that case, Robert's point is quite valid. The old Flink

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-04-30 Thread Reuven Lax
In that case, Robert's point is quite valid. The old Flink runner I believe had no knowledge of fusion, which was known to make it extremely slow. A lot of work went into making the portable runner fusion aware, so we don't need to round trip through coders on every ParDo. Reuven On Tue, Apr 30,

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-04-30 Thread Jozef Vilcek
It was not a portable Flink runner. Thanks all for the thoughts, I will create JIRAs, as suggested, with my findings and send them out On Tue, Apr 30, 2019 at 11:34 AM Reuven Lax wrote: > Jozef did you use the portable Flink runner or the old one? > > Reuven > > On Tue, Apr 30, 2019 at 1:03 AM

Re: [Discuss] Publishing pre-release artifacts to repositories

2019-04-30 Thread Robert Bradshaw
If we can, by the apache guidelines, post RCs to pypy that is definitely the way to go. (Note that test.pypi is for developing against the pypi interface, not for pushing anything real.) The caveat about naming these with rcN in the version number still applies (that's how pypi guards them against

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-04-30 Thread Reuven Lax
Jozef did you use the portable Flink runner or the old one? Reuven On Tue, Apr 30, 2019 at 1:03 AM Robert Bradshaw wrote: > Thanks for starting this investigation. As mentioned, most of the work > to date has been on feature parity, not performance parity, but we're > at the point that the

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-04-30 Thread Robert Bradshaw
Thanks for starting this investigation. As mentioned, most of the work to date has been on feature parity, not performance parity, but we're at the point that the latter should be tackled as well. Even if there is a slight overhead (and there's talk about integrating more deeply with the Flume DAG