Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Yep but makes one other step to work in beam - dont forget you already passed like 10 steps when you end up on coders. My view was that the skip first close was a win-win for beam and users bit technically you are right, users can always do it themselves. Le 1 févr. 2018 07:16, "Lukasz Cwik"

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
Hi all, just a quick reminder about the vote process: 1. Any vote can be changed during the vote period. A -1 vote has to be argued (especially if there's not change to do in the project codebase). 2. For convenience to the release manager, please inform if your vote is binding or non-binding

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Lukasz Cwik
I'm not sure what you mean by it closes the door since as the caller of the library you can create a wrapper filter input stream that ignores close calls effectively overriding what happens in the UnownedInputStream. On Wed, Jan 31, 2018 at 10:08 PM, Romain Manni-Bucau

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Jean-Baptiste Onofré
+1, it was on my TODO for a while waiting the Java8 update. Regards JB On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote: > Why not dropping guava for all beam codebase? With java 8 it is quite easy to > do > it and avoid a bunch of conflicts. Did it in 2 projects with quite a good > result. >

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Le 1 févr. 2018 03:10, "Lukasz Cwik" a écrit : Yes, people will write bad coders which is why this is there. No, I don't think swallowing one close is what we want. In the case where you wants to pass an input/output stream to a library that incorrectly assumes ownership, the

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Romain Manni-Bucau
@ismael: any vote can be changes from -1 to +1 (or +-0) without additional delay Le 1 févr. 2018 03:15, "Lukasz Cwik" a écrit : > Note that a user reported TextIO being broken on Flink. > Thread is here: https://lists.apache.org/thread.html/ >

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Lukasz Cwik
That is an even better idea. A lot of guava constructs have been superseded by stuff that was added to Java 8. On Wed, Jan 31, 2018 at 9:56 PM, Romain Manni-Bucau wrote: > Why not dropping guava for all beam codebase? With java 8 it is quite easy > to do it and avoid a

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Romain Manni-Bucau
Why not dropping guava for all beam codebase? With java 8 it is quite easy to do it and avoid a bunch of conflicts. Did it in 2 projects with quite a good result. Le 1 févr. 2018 06:50, "Lukasz Cwik" a écrit : > Make sure to include the guava version in the artifact name so

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Lukasz Cwik
Make sure to include the guava version in the artifact name so that we can have multiple vendored versions. On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles wrote: > I didn't have time for this, but it just bit me. We definitely have Guava > on the API surface of runner support

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-01-31 Thread Kenneth Knowles
I didn't have time for this, but it just bit me. We definitely have Guava on the API surface of runner support code in ways that get incompatibly shaded. I will probably start "1a" by making a shaded library org.apache.beam:vendored-guava and starting to use it. It sounds like there is generally

Re: Can Window PTransform drop tuples that violate allowed lateness?

2018-01-31 Thread Kenneth Knowles
On Mon, Jan 22, 2018 at 11:42 AM, Shen Li wrote: > Hi Kenn, > > Thanks for the explanation. > > > So now elements are droppable if they belong to an expired window. > > Say I have two consecutive window transforms with FixedWindows WindowFn > (just an example, most likely

FOSDEM mini office hour?

2018-01-31 Thread Holden Karau
Hi BEAM Friends, If any folks are around for FOSDEM this year I was planning on doing a coffee office hour on the last day after my talks . Maybe like 6pm? I'm also going to see if any Spark folks are around and interested :) Cheers,

Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Kenneth Knowles
Filed https://issues.apache.org/jira/browse/BEAM-3587 On Wed, Jan 31, 2018 at 6:31 PM, Kenneth Knowles wrote: > Yes, a Jira ticket will help to keep track of things. > > I have a guess that this has to do with switching to running from a > portable pipeline representation, and

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is unsafe because it will skip (src, dst) pairs where neither exist? (it only looks if src exists) For GCS, we can construct a safe retryable rename() operation, assuming that copy() and delete() are atomic for a single file or

Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Kenneth Knowles
Yes, a Jira ticket will help to keep track of things. I have a guess that this has to do with switching to running from a portable pipeline representation, and it looks like there's a non-composite transform with an empty URN and it threw a bad error message. We can try to root cause but may also

Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Thomas Pelletier
Let me know if there is anything I can do to help! Should I file a jira ticket? On Wed 31 Jan 2018 at 18:14, Lukasz Cwik wrote: > Thanks for the feedback, we are currently in the middle of releasing 2.3.0 > from pretty close to what is on Apache Beam master so your issue

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Lukasz Cwik
Note that a user reported TextIO being broken on Flink. Thread is here: https://lists.apache.org/thread.html/47b16c94032392782505415e010970fd2a9480891c55c2f7b5de92bd@%3Cuser.beam.apache.org%3E Can someone confirm/refute? On Wed, Jan 31, 2018 at 3:36 PM, Konstantinos Katsiapis <

Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Lukasz Cwik
Thanks for the feedback, we are currently in the middle of releasing 2.3.0 from pretty close to what is on Apache Beam master so your issue should be investigated before release. On Wed, Jan 31, 2018 at 5:36 PM, Thomas Pelletier wrote: > Hi, > > I'm trying to run a

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Raghu Angadi
On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov wrote: > As far as I know, the current implementation of file sinks is the only > reason why the flag IGNORE_MISSING for copying even exists - there's no > other compelling reason to justify it. We implement "rename" as

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Eugene Kirpichov
I agree that using an atomic rename operation is even better. I'm mainly opposed to having a copy option that ignores missing files, and to our implementation of rename using that option, because it's unsafe. Unfortunately GCS doesn't have an atomic rename, so I'm not sure what's the best way to

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Chamikara Jayalath
Agree with what Robert said. We have a rename() operation in the FileSystem abstraction and some file-systems might be able to implement this more efficiently than copy+delete. Also note that the same issue could arise in any other usage of rename operation. So I agree that a retry-tolerant

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
I agree that overwriting is more in line with user expectations. I believe that the sink should not ignore errors from the filesystem layer. Instead, the FileSystem API should be more well defined. Examples: rename() and copy() should overwrite existing files at the destination, copy() should have

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Ahmet Altay
This will require a change in the Beam code, because image names are hardcoded in to code (python) and configuration (java). RC1 as it is will not work correctly with Cloud Dataflow. On Wed, Jan 31, 2018 at 2:08 PM, Reuven Lax wrote: > Hopefully we can validate soon. I believe

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Reuven Lax
Hopefully we can validate soon. I believe some of the delays are because of integrating major changes done over the last week (e.g. Java 8 migration). On Wed, Jan 31, 2018 at 2:04 PM, Ismaël Mejía wrote: > What is the common procedure in cases like this ? Because it doesn't >

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Ismaël Mejía
What is the common procedure in cases like this ? Because it doesn't seems that it needs a re-vote, just an extra day or two for validation, any ideas JB ? On Wed, Jan 31, 2018 at 10:41 PM, Alan Myrvold wrote: > Yes, it is a dataflow step. Happy to test this again when they

Re: Tracking Sickbayed tests in Jira

2018-01-31 Thread Romain Manni-Bucau
If it helps I'm using on another project: # to find @Ignore tests $ find . -name '*.java' | xargs grep -n1 @Ignore # to find test classes (not methods) $ find . -name '*.java' | xargs grep @Ignore | sed 's#:.*##' | sort -u # to find modules with @Ignore $ find . -name '*.java' | xargs grep

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Raghu Angadi
Original mail mentions that output from second run of word_count is ignored. That does not seem as safe as ignoring error from a second attempt of a step. How do we know second run didn't run on different output? Overwriting seems more accurate than ignoring. Does handling this error at sink level

Tracking Sickbayed tests in Jira

2018-01-31 Thread Thomas Groh
Hey everyone; I've realized that although we tend to tag any test we suppress (due to consistent flakiness) in the codebase, and file an associated JIRA issue with the failure, we don't have any centralized way to track tests that we're currently suppressing. To try and get more visibility into

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Alan Myrvold
Yes, it is a dataflow step. Happy to test this again when they are available. On Wed, Jan 31, 2018 at 1:39 PM, Jean-Baptiste Onofré wrote: > OK, I think I understood ;) > > So it's not "directly" related to Beam itself (it's more a Dataflow step > to perform). > > @Alan, I

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
OK, I think I understood ;) So it's not "directly" related to Beam itself (it's more a Dataflow step to perform). @Alan, I think it's better to test first and then cast the vote. This kind of tests are valuable to validate the release and make sense. But vote should represent the state of

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Reuven Lax
It's just a step that needs to be peformed before the new release works on Dataflow. Alan is saying that we've been unable to validate Dataflow so far, as worker images are not yet built. Hopefully they'll be built soon, and we'll be able to validate. On Wed, Jan 31, 2018 at 1:31 PM,

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
Hi Alan does it related to change in the codebase or in a dependency/related project ? I mean: is it something we have to fix/change in Beam ? Just curious as I'm not sure what you mean by "worker images" ;) Thanks ! Regards JB On 31/01/2018 22:18, Alan Myrvold wrote: -1 (for now, hope to

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Alan Myrvold
-1 (for now, hope to change this) Dataflow runner jobs are failing for me with 2.3.0 RC1, for both Java and Python. This is not an issues with the 2.3.0 RC1 SDK, we (google) need to release worker images. I have assigned these bugs to myself, and will be working to help get these workers

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
If you need help on the json part I'm happy to help. To give a few hints on what is very doable: we can add an avro module to johnzon (asf json{p,b} impl) to back jsonp by avro (guess it will be one of the first to be asked) for instance. Romain Manni-Bucau @rmannibucau

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
Yeah, another round of refactoring is due to move the rename via copy+delete logic up to the file-based sink level. On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath wrote: > Good point. There's always the chance of step that performs final rename > being retried. So we'll

Re: Should we have a predictable test run order?

2018-01-31 Thread Ismaël Mejía
Is the conclusion of this thread is that we should then make the test execution random, remember that currently it uses the default order that is filesystem-based as Dan mentioned and that produces some minor inconsistencies between mac/linux. It is going to be interesting to see how much extra

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Hmm, it is a hint semantically or it is deducable from the transform. Doing the union of both you cover all cases. Then how it is forwarded from the transform to the runtime is in runner API not the user (pipeline) API so I'm not sure I see the case you reference where it has a semantic API. Can

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Romain Manni-Bucau
Hmm, here we are the ones owning the call since it is in a coder, no? Do we assume people will badly implement coders? In this particular case we can assume close() will be called by a framework I think. What about swallowing one close() and fail on the second? Romain Manni-Bucau @rmannibucau

Re: why org.apache.beam.sdk.util.UnownedInputStream fails on close instead of ignoring it

2018-01-31 Thread Lukasz Cwik
Because people write code like: myMethod(InputStream in) { InputStream child = new InputStream(in); child.close(); } InputStream in = new FileInputStream(... path ...); myMethod(in); myMethod(in); An exception will be thrown when the second myMethod call occurs. Unfortunately not everyone

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
I don't think "hint" is the right API, as schema is not a hint (it has semantic meaning). However I think the API for schema should look similar to any "hint" API. On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau wrote: > > > Le 31 janv. 2018 20:16, "Reuven Lax"

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Le 31 janv. 2018 20:16, "Reuven Lax" a écrit : As to the question of how a schema should be specified, I want to support several common schema formats. So if a user has a Json schema, or an Avro schema, or a Calcite schema, etc. there should be adapters that allow setting a

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
As to the question of how a schema should be specified, I want to support several common schema formats. So if a user has a Json schema, or an Avro schema, or a Calcite schema, etc. there should be adapters that allow setting a schema from any of them. I don't think we should prefer one over the

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Chamikara Jayalath
Good point. There's always the chance of step that performs final rename being retried. So we'll have to ignore this error at the sink level. We don't necessarily have to do this at the FileSystem level though. I think the proper behavior might be to raise an error for the rename at the FileSystem

Re: ***UNCHECKED*** Re: Samza Runner

2018-01-31 Thread xinyu liu
Thanks Kenneth to merge the Samza BEAM runner to the feature branch! We will work on the other items (docs, example, capability matrix ..) to get it to the master. Thanks, Xinyu On Fri, Jan 26, 2018 at 9:28 AM, Kenneth Knowles wrote: > Regarding merging directly to master, I

Re: drop scala....version from artifact ;)

2018-01-31 Thread Jean-Baptiste Onofré
Hi Romain, AFAIR only Flink runner uses scala version in the artifactId. +1 for me. Regards JB On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote: > Hi guys > > since beam supports a single version of runners why not dropping the scala > version from the artifactId? > > ATM upgrades are

drop scala....version from artifact ;)

2018-01-31 Thread Romain Manni-Bucau
Hi guys since beam supports a single version of runners why not dropping the scala version from the artifactId? ATM upgrades are painful cause you upgrade beam version+ runner artifactIds. wdyt? Romain Manni-Bucau @rmannibucau | Blog

Re: [DISCUSS] State of the project

2018-01-31 Thread Kenneth Knowles
On Wed, Jan 31, 2018 at 3:40 AM, Etienne Chauchot wrote: > Thanks Kenn and Luke for your comments. > > WDYT about my proposition (bellow) to add methods to the runner api to > enhance the coherence between the runners? > If I understand your point, I think I agree but maybe

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
Thanks Kenn, I prepared the list of tasks I did. I will complete where release is out. Regards JB On 01/31/2018 03:07 PM, Kenneth Knowles wrote: > I've cloned the release validation spreadsheet: > >     https://s.apache.org/beam-2.3.0-release-validation > > If you plan to perform a manual

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Kenneth Knowles
I've cloned the release validation spreadsheet: https://s.apache.org/beam-2.3.0-release-validation If you plan to perform a manual validation task, please sign up so multiple people don't waste effort. Alan & JB, as far as your pairing up to automate more, anything manual on this sheet

Re: [VOTE] Release 2.3.0, release candidate #1

2018-01-31 Thread Jean-Baptiste Onofré
+1 (binding) Casting my own +1 ;) Regards JB On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote: > Hi everyone, > > Please review and vote on the release candidate #1 for the version 2.3.0, as > follows: > > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide

Re: IO plans?

2018-01-31 Thread Jean-Baptiste Onofré
Agree ! That's why I would like to move forward for 2.4.0 on those ones. Regards JB On 01/31/2018 02:44 PM, Romain Manni-Bucau wrote: > Thanks JB, this is great news since they are highly used IO in the industry > and > really awaited now. > > > Romain Manni-Bucau > @rmannibucau

Re: IO plans?

2018-01-31 Thread Romain Manni-Bucau
Thanks JB, this is great news since they are highly used IO in the industry and really awaited now. Romain Manni-Bucau @rmannibucau | Blog | Old Blog | Github

Re: IO plans?

2018-01-31 Thread Jean-Baptiste Onofré
By the way, short term: ParquetIO, RabbitMqIO are coming (PRs already open). Regards JB On 01/31/2018 02:41 PM, Jean-Baptiste Onofré wrote: > Hi Romain, > > I have some IOs locally and some idea: > > - ExecIO (it has been proposed as PR but declined) > - ConsoleIO (generic) > - SocketIO > -

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Jean-Baptiste Onofré
Hi Reuven, it's also what I did in JdbcIO for the statement or column mapper. That's fair enough. Regards JB On 01/31/2018 01:35 PM, Reuven Lax wrote: > > > On Wed, Jan 31, 2018 at 1:34 AM, Jean-Baptiste Onofré > wrote: > > Hi Ismaël, > >

Re: [DISCUSSION] Runner agnostic metrics extractor?

2018-01-31 Thread Etienne Chauchot
Hi all, Just to let you know that I have just submitted the PR [1]: This PR adds a MetricsPusher discussed in this [2] document in scenario 3.b. It merges and pushes beam metrics at a configurable (via pipelineOptions) frequency to a configurable sink. By default the sink is a DummySink also

IO plans?

2018-01-31 Thread Romain Manni-Bucau
Hi guys, is there a plan for future IO and some tracking somewhere? I particularly wonder if there are plans for a HTTP IO and common server IO like SFTP, SSH, etc... Romain Manni-Bucau @rmannibucau | Blog | Old Blog

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Reuven Lax
On Wed, Jan 31, 2018 at 1:34 AM, Jean-Baptiste Onofré wrote: > Hi Ismaël, > > I agree that hint should not change the output of PTransforms. > > However, let me illustrate why I think hint could be interesting: > > - I agree with what you are saying about the runners: they

Re: [DISCUSS] State of the project

2018-01-31 Thread Etienne Chauchot
Thanks Kenn and Luke for your comments. WDYT about my proposition (bellow) to add methods to the runner api to enhance the coherence between the runners? WDYT about my other proposition (bellow) of trying to avoid having validates runner tests that are specific to a runner like we have now?

Re: [DISCUSS] State of the project: Culture and governance

2018-01-31 Thread Jean-Baptiste Onofré
Sorry, but I fail to see the difference between Beam and other Apache project. It's not a github project: Beam follows the rules defined by the Apache Software Foundation. Beam website/contribution guide or whatever should just point to Apache website resources. It's clearly explained and

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Jean-Baptiste Onofré
Hi Ismaël, I agree that hint should not change the output of PTransforms. However, let me illustrate why I think hint could be interesting: - I agree with what you are saying about the runners: they should be smart. However, to be smart enough, the runner could use some statements provided by

Re: [DISCUSS] State of the project: Culture and governance

2018-01-31 Thread Ismaël Mejía
Some extra comments inlined: JB, > 1. The Apache Beam website contains a link to the Apache Code of Conduct (on > the dropdown menu under the feather icon ;)). Maybe we can also add a link to > the contribution guide as it's really important, especially for people who > target the

Build failed in Jenkins: beam_Release_NightlySnapshot #671

2018-01-31 Thread Apache Jenkins Server
See -- [...truncated 4.17 MB...] 2018-01-31T08:34:33.754 [INFO] --- mvn-golang-wrapper:2.1.6:build (go-build) @ beam-runners-gcp-gcsproxy --- 2018-01-31T08:34:33.755 [INFO]

Re: [DISCUSSION] Add hint/option on PCollection

2018-01-31 Thread Ismaël Mejía
This is a subject we have already discussed in the past. It was part on the discussion on ‘data locality’ for the runners on top of HDFS. In that moment the argument for ‘hints’ was that a transform could send hints to the runners so they properly allocate the readers improving its execution. This