Build failed in Jenkins: beam_Release_NightlySnapshot #670

2018-01-30 Thread Apache Jenkins Server
See Changes: [kedin] [SQL] Refactor Variance [kedin] [Nexmark][SQL] Implement sql query 3 [chamikara] Updates PTransform overriding to create a new AppliedPTransform object

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Hi, yeah, it sounds good to me. I will create the Jira to track this and start a PoC on the Composite. Thanks ! Regards JB On 01/30/2018 10:40 PM, Reuven Lax wrote: > Did we actually reach consensus here? :) > > On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau

Jenkins build is still unstable: beam_Release_NightlySnapshot #669

2018-01-30 Thread Apache Jenkins Server
See

How to get split location from HadoopInputFormatBoundedSource

2018-01-30 Thread JangHo Seo
Hello Beam dev, I'm working on a distributed data processing engine that supports Beam dataflow program, and investigating how to take split location into consideration when scheduling 'read' task for HDFS source. Is there any way to get split location information from

Re: Filesystems.copy and .rename behavior

2018-01-30 Thread Reuven Lax
I think the idea was to ignore "already exists" errors. The reason being that any step in Beam can be executed multiple times, including the rename step. If the rename step gets run twice, the second run should succeed vacuously. On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri

Filesystems.copy and .rename behavior

2018-01-30 Thread Udi Meiri
Hi, I've been working on HDFS code for the Python SDK and I've noticed some behaviors which are surprising. I wanted to know if these behaviors are known and intended. 1. When renaming files during finalize_write, rename errors are ignored

Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Well guess it was a wording issue more than anything else. That said it is not true for all runners so can still need some more love later but i dont have a solution yet for it. Just wondered if a better way to solve it was here already. Le 30 janv. 2018 22:36, "Reuven Lax" a

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Reuven Lax
Did we actually reach consensus here? :) On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau wrote: > Not sure how it fits in terms of API yet but +1 for the high level view. > Makes perfect sense. > > Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" a

Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Hmm starts to smell like the old question "how to enforce runner constraints without enforcing too much" :(. Anyway, that is enough for me for this topic. Thanks for the clarification and reminders guys. Le 30 janv. 2018 22:29, "Reuven Lax" a écrit : > Where the split points

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
Not sure how it fits in terms of API yet but +1 for the high level view. Makes perfect sense. Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" a écrit : > Hi Robert, > > Good point and idea for the Composite transform. It would apply nicely on > all transforms based on

Re: untyped pipeline API?

2018-01-30 Thread Reuven Lax
Where the split points are depends on the runner. Runners are free to split at any point (and often do to prevent cycles from appearing in the graph). On Tue, Jan 30, 2018 at 1:27 PM, Romain Manni-Bucau wrote: > I kind of agree on all of that and brings me to the

Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
I kind of agree on all of that and brings me to the interesting point of that topic: why coders are that enforced if not used most of the time - flat processor chain to caricature it? Shouldnt it be relaxed a bit and just enforced at split or shuffle points? Le 30 janv. 2018 22:09, "Ben

Re: untyped pipeline API?

2018-01-30 Thread Ben Chambers
It sounds like in your specific case you're saying that the same encoding can be viewed by the Java type system two different ways. For instance, if you have an object Person that is convertible to JSON using Jackson, than that JSON encoding can be viewed as either a Person or a Map

Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-30 Thread Ben Chambers
I think I agree with this, but wanted to point out a few things: 1. High-level DSLs may target the IL directly, rather than going through the high-level PL libraries. This would allow them to make more direct use of the capabilities of the IL. 2. I agree that the portability work is basically

Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-30 Thread Ben Chambers
On Tue, Jan 30, 2018 at 11:25 AM Kenneth Knowles wrote: > I've got some thoughts :-) > > Here is how I see the direction(s): > > - Requirements to be relevant: known scale, SQL, retractions (required > for correct answers) > - Core value-add: portability! I don't know that

Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-30 Thread Robert Bradshaw
On Tue, Jan 30, 2018 at 11:44 AM, Kenneth Knowles wrote: > (just dev@) > > *Low-level IL* > I wanted to comment more on the common intermediate layer idea of Ben's. > This is an awesome idea but I'm not sure it is Beam so much as Tez or Onyx. > I imagine most runners have some

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Hi Robert, Good point and idea for the Composite transform. It would apply nicely on all transforms based on composite. I also agree that the hint is more on the transform than the PCollection itself. Thanks ! Regards JB On 30/01/2018 21:26, Robert Bradshaw wrote: Many hints make more

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Reuven Lax
My point was that hints are added on transforms by default, so you can simply add them to originalTransform. The AddHint transform was for the case where you want the hint on the PCollection itself; it provides a way to do so, while keeping the PCollection immutable. On Tue, Jan 30, 2018 at 11:59

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Robert Bradshaw
Many hints make more sense for PTransforms (the computation itself) than for PCollections. In addition, when we want properties attached to PCollections of themselves, it often makes sense to let these be provided by the producing PTransform (e.g. coders and schemas are often functions of the

Re: untyped pipeline API?

2018-01-30 Thread Kenneth Knowles
Ah, this is a point that Robert brings up quite often: one reason we put coders on PCollections instead of doing that work in PTransforms is that the runner (plus SDK harness) can automatically only serialize when necessary. So the default in Beam is that the thing you want to happen is already

Re: untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Indeed, I'll take a stupid example to make it shorter. I have a source emitting Person objects ({name:...,id:...}) serialized with jackson as JSON. Then my pipeline processes them with a DoFn taking a Map. Here I set the coder to read json as a map. However a Map

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Kenneth Knowles
It seems like most of these use cases are hints on a PTransform and not a PCollection, no? CPU, memory, expected parallelism, etc are. Then you could just have: pc.apply(WithHints(myTransform, )) For a PCollection hints that might make sense are bits like total size, element size, and

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Yes, agree, it sounds like Create.of() but actually it adding hint to the collection. So maybe AddHints.on(collection, hint1, ...) it's clearer. Regards JB On 30/01/2018 21:08, Romain Manni-Bucau wrote: I think so too but `pc.apply(AddHints.of(hint1, hint2, hint3))` is a bit ambiguous for me

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
I think so too but `pc.apply(AddHints.of(hint1, hint2, hint3))` is a bit ambiguous for me (is it affecting the previous collection?) Maybe AddHints.on(collection, hint1, hint2, ...) is an acceptable compromise? Less fluent but not ambiguous (based on the same pattern as views). Romain

Re: untyped pipeline API?

2018-01-30 Thread Kenneth Knowles
I'm not sure I understand your question. Can you explain more? On Tue, Jan 30, 2018 at 11:50 AM, Romain Manni-Bucau wrote: > Hi guys, > > just encountered an issue with the pipeline API and wondered if you > thought about it. > > It can happen the Coders are compatible

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
Hmm, can work for pipeline hints but for transform hints we would need: p.apply(AddHint.of(.).wrap(originalTransform)) Would work for me too. Romain Manni-Bucau @rmannibucau | Blog | Old Blog

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Great idea for AddHints.of() ! What would be the resulting PCollection ? Just a PCollection of hints or the pc elements + hints ? Regards JB On 30/01/2018 20:52, Reuven Lax wrote: I think adding hints for runners is reasonable, though hints should always be assumed to be optional - they

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Reuven Lax
I think adding hints for runners is reasonable, though hints should always be assumed to be optional - they shouldn't change semantics of the program (otherwise you destroy the portability promise of Beam). However there are many types of hints that some runners might find useful (e.g. this step

untyped pipeline API?

2018-01-30 Thread Romain Manni-Bucau
Hi guys, just encountered an issue with the pipeline API and wondered if you thought about it. It can happen the Coders are compatible between them. Simple example is a text coder like JSON or XML will be able to read text. However with the pipeline API you can't support this directly and

Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-30 Thread Kenneth Knowles
(just dev@) *Low-level IL* I wanted to comment more on the common intermediate layer idea of Ben's. This is an awesome idea but I'm not sure it is Beam so much as Tez or Onyx. I imagine most runners have some such representation internally. Our layers in the stack with a low-level IL: 6.

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Maybe I should have started the discussion on the user mailing list: it would be great to have user feedback on this, even if I got your points. Sometime, I have the feeling that whatever we are proposing and discussing, it doesn't go anywhere. At some point, to attract more people, we have

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-30 Thread Romain Manni-Bucau
I get the issue but I don't get the last part. Concretely we can support any lib by just removing the exception in the close, no? What would be the issue? No additional wrapper, no lib integration issue. Romain Manni-Bucau @rmannibucau | Blog

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
2018-01-30 19:52 GMT+01:00 Kenneth Knowles : > I generally like having certain "escape hatches" that are well designed > and limited in scope, and anything that turns out to be important becomes > first-class. But this one I don't really like because the use cases belong >

Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-30 Thread Kenneth Knowles
I've got some thoughts :-) Here is how I see the direction(s): - Requirements to be relevant: known scale, SQL, retractions (required for correct answers) - Core value-add: portability! I don't know that there is any other project ambitiously trying to run Python and Go on "every" data

Re: [DISCUSS] State of the project: Culture and governance

2018-01-30 Thread Lukasz Cwik
I have to -1 reductions in the code review quality bar as this leads to test problems, which leads to CI issues, which leads to gaps in coverage and then to delayed, bad and broken releases. +1 on converting Google docs to either markdown or including them on the website since it is a valuable

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Kenneth Knowles
I generally like having certain "escape hatches" that are well designed and limited in scope, and anything that turns out to be important becomes first-class. But this one I don't really like because the use cases belong elsewhere. Of course, they creep so you should assume they will be unbounded

Re: Should we have a predictable test run order?

2018-01-30 Thread Reuven Lax
To expand on what Robert says, many other things in our test framework are randomized. e.g. PCollection elements are shuffled randomly, bundle sizes are determined randomly, etc. All of this should be repeatable if there's a failure. The test should print the seed used to generate the random

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-30 Thread Lukasz Cwik
Its common in the code base that input and output streams are passed around and the caller is responsible for closing it, not the callee. The UnownedInputStream is to guard against libraries that are poorly behaved and assume they get ownership of the stream when it is given to them. In the code:

Re: Should we have a predictable test run order?

2018-01-30 Thread Robert Bradshaw
Agreed, any leakage of state between tests is a bug, and giving things a deterministic order just hides these bugs. I'd be in favor of enforcing random ordering (with a published seed for reproduciblity of course). On Tue, Jan 30, 2018 at 9:21 AM, Lukasz Cwik wrote: > The order

why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-30 Thread Romain Manni-Bucau
Hi guys, All is in the subject ;) Rational is to support any I/O library and not fail when the close is encapsulated. Any blocker to swallow this close call? Romain Manni-Bucau @rmannibucau | Blog | Old Blog

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Romain Manni-Bucau
Lukasz, the point is that you have to choice to either bring all specificities to the main API which makes most of the API not usable or implemented or the opposite, not support anything. Introducing hints will allow to have eagerly for some runners some features - or just some very specific

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Lukasz Cwik
There have been suggestions in the past for Dataflow 1.x to extend PipelineOptions to be usable per PTransform. So when you apply a PTransform you can also provide a set of options that apply to it and all subtransforms contained within. This is the closest suggestion to what your describing that

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Good point Luke: in that case, the hint will be ignored by the runner if the hint is not for him. The hint can be generic (not specific to a runner). It could be interesting for the schema support or IOs, not specific to a runner. What do you mean by gathering PTransforms/PCollections and

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Lukasz Cwik
If the hint is required to run the persons pipeline well, how do you expect that the person we be able to migrate their pipeline to another runner? A lot of hints like "spark.persist" are really the user trying to tell us something about the PCollection, like it is very small. I would prefer if

Re: Should we have a predictable test run order?

2018-01-30 Thread Lukasz Cwik
The order should be random to ferret out issues but the test order seed should be printed and configurable so it allows replaying a test run because you can specify the order in which it should execute. I don't like having a strict order since it hides poorly written tests and people have a

Re: Should we have a predictable test run order?

2018-01-30 Thread Kenneth Knowles
What was the problem in this case? On Tue, Jan 30, 2018 at 9:12 AM, Romain Manni-Bucau wrote: > What I was used to do is to capture the output when I identified some of > these cases. Once it is reproduced I grep the "Running" lines from > surefire. This gives me a

Re: Should we have a predictable test run order?

2018-01-30 Thread Romain Manni-Bucau
What I was used to do is to capture the output when I identified some of these cases. Once it is reproduced I grep the "Running" lines from surefire. This gives me a reproducible order. Then with a kind of dichotomy you can find the "previous" test making your test failing and you can configure

Re: Should we have a predictable test run order?

2018-01-30 Thread Jean-Baptiste Onofré
Hi Dan, good catch !!! I think it makes sense to have a predictable order. It's worth to do it, because even if we don't detect some tests failures, I think it would be mostly the test itself (like missing resources cleanup or so), not that the test is covering itself. +1 on the proposal.

Re: Should we have a predictable test run order?

2018-01-30 Thread Daniel Kulp
The biggest problem with random is that if a test fails due to an interaction, you have no way to reproduce it. You could re-run with random 10 times and it might not fail again. Thus, what good did it do to even flag the failure? At least with alphabetical and reverse alphabetical, if a

Re: Should we have a predictable test run order?

2018-01-30 Thread Kenneth Knowles
I agree with Romain & your last couple sentences. I would prefer to expose faulty tests. Have you filed a JIRA with your diagnosis about the interaction? Standard caveat is that we are very close to being able to switch basic dev to gradle, with just some complex IT configs and release still using

Re: Reminder: merge commit messages should *not* reference ephemeral branches

2018-01-30 Thread Jean-Baptiste Onofré
Thanks Kenn for the reminder ! Very useful Regards JB On 30/01/2018 17:33, Kenneth Knowles wrote: Hi all, In the history I see merge commits like:     "Merge pull request #12345 from some_user/random-branch-name" This is GitHub's default, but it is quite silly. The branch name is not

Re: Should we have a predictable test run order?

2018-01-30 Thread Romain Manni-Bucau
Hi Daniel, As a quick fix it sounds good but doesnt it hide a leak or issue (in test setup or in main code)? Long story short: using a random order can allow to find bugs faster instead of hiding them and discover them randomly adding a new test. That said, good point to have it configurable

Should we have a predictable test run order?

2018-01-30 Thread Daniel Kulp
I spent a couple hours this morning trying to figure out why two of the SQL tests are failing on my machine, but not for Jenkins or for JB. Not knowing anything about the SQL stuff, it was very hard to debug and it wouldn’t fail within Eclipse or even if I ran that individual test from the

Reminder: merge commit messages should *not* reference ephemeral branches

2018-01-30 Thread Kenneth Knowles
Hi all, In the history I see merge commits like: "Merge pull request #12345 from some_user/random-branch-name" This is GitHub's default, but it is quite silly. The branch name is not informative and also not important. It is also subject to deletion any time so you can't follow it to learn

Build failed in Jenkins: beam_PostRelease_NightlySnapshot #11

2018-01-30 Thread Apache Jenkins Server
See Changes: [joey.baruch] Add javadoc to ConsoleIO [Pablo] Tracking of time spent reading side inputs, and bytes read in Dataflow. [Pablo] Fix lint issues [Pablo] Fixing counter names [Pablo]

[DISCUSSION] Add hint/option on PCollection

2018-01-30 Thread Jean-Baptiste Onofré
Hi, As part of the discussion about schema, Romain mentioned hint. I think it's worth to have an explanation about that and especially it could be wider than schema. Today, to give information to the runner, we use PipelineOptions. The runner can use these options, and apply for all inner

Re: Schema-Aware PCollections revisited

2018-01-30 Thread Jean-Baptiste Onofré
Hi, I think we should avoid to mix two things in the discussion (and so the document): 1. The element of the collection and the schema itself are two different things. By essence, Beam should not enforce any schema. That's why I think it's a good idea to set the schema optionally on the

Jenkins build became unstable: beam_Release_NightlySnapshot #668

2018-01-30 Thread Apache Jenkins Server
See