Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Kenneth Knowles
This is a genius way to involve everyone who lands on the site! My first PR is about to open... :-) Kenn On Wed, Oct 24, 2018 at 8:47 PM Jean-Baptiste Onofré wrote: > Sweet !! > > Thanks ! > > Regards > JB > > On 24/10/2018 23:24, Alan Myrvold wrote: > > To make small documentation changes easi

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Jean-Baptiste Onofré
Sweet !! Thanks ! Regards JB On 24/10/2018 23:24, Alan Myrvold wrote: > To make small documentation changes easier, there is now an Edit button > at the top right of the pages on https://beam.apache.org. This button > opens the source .md file on the master branch of the beam repository in > the

Re: KafkaIO - Deadletter output

2018-10-24 Thread Kenneth Knowles
Forgive me if this is naive or missing something, but here are my thoughts on these alternatives: (0) Timestamp has to be pulled out in the source to control the watermark. Luke's point is imortant. (1) If bad records get min_timestamp, and they occur infrequently enough, then watermark will adva

Re: [SQL] Investigation of missing/wrong session_end implementation in BeamSQL

2018-10-24 Thread Kenneth Knowles
This is some very cool digging, especially the forays into neighboring Apache projects. We (and Flink) are clearly pushing at the edges of what the original Calcite design foresaw. The naive insertion of Beam/Flink style "group by window(s)" into SQL is showing a bit of wear. Kenn On Tue, Oct 23,

Re: [DISCUSS] Beam public roadmap

2018-10-24 Thread Kenneth Knowles
OK. I have taken everyone's feedback into account. Preview at http://apache-beam-website-pull-requests.storage.googleapis.com/6718/roadmap/index.html Summary: - Rephrased the highlights to be more dignified - Filled out everything I could think of to get specific roadmaps started - Moved porta

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
That depends on the users pipeline and how watermark advancement of the source may impact elements becoming droppably late if they are emitted with the minimum timestamp. On Wed, Oct 24, 2018 at 4:42 PM Raghu Angadi wrote: > I see. > > What I meant was to return min_timestamp for bad records in

Re: [PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Chamikara Jayalath
Thanks Heejong. Added some comments. +1 for summarizing the doc in the email thread. - Cham On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay wrote: > Thank you Heejong. Could you also share a summary of the design document > (major points/decisions) in the mailing list? > > On Wed, Oct 24, 2018 at 4

Re: [PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Ahmet Altay
Thank you Heejong. Could you also share a summary of the design document (major points/decisions) in the mailing list? On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee wrote: > Hi, > > I'm working on BEAM-: Parquet IO for Python SDK. > > Issue: https://issues.apache.org/jira/browse/BEAM- > D

Re: KafkaIO - Deadletter output

2018-10-24 Thread Raghu Angadi
I see. What I meant was to return min_timestamp for bad records in the timestamp handler passed to KafkaIO itself, and correct timestamp for parsable records. That should work too, right? On Wed, Oct 24, 2018 at 4:21 PM Lukasz Cwik wrote: > Yes, that would be fine. > > The user could then use a

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
Yes, that would be fine. The user could then use a ParDo which outputs to a DLQ for things it can't parse the timestamp for and use outputWithTimestamp[1] for everything else. 1: https://beam.apache.org/releases/javadoc/2.7.0/org/apache/beam/sdk/transforms/DoFn.WindowedContext.html#outputWithTime

[PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Heejong Lee
Hi, I'm working on BEAM-: Parquet IO for Python SDK. Issue: https://issues.apache.org/jira/browse/BEAM- Design doc: https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg WIP PR: https://github.com/apache/beam/pull/6763 Any feedback is appreciated. Thanks!

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Ankur Goenka
Great addition to the website 👍 On Wed, Oct 24, 2018 at 2:51 PM Ruoyun Huang wrote: > Looks awesome! > > On Wed, Oct 24, 2018 at 2:24 PM Alan Myrvold wrote: > >> To make small documentation changes easier, there is now an Edit button >> at the top right of the pages on https://beam.apache.org.

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Ruoyun Huang
Looks awesome! On Wed, Oct 24, 2018 at 2:24 PM Alan Myrvold wrote: > To make small documentation changes easier, there is now an Edit button at > the top right of the pages on https://beam.apache.org. This button opens > the source .md file on the master branch of the beam repository in the > gi

[PROPOSAL] Bundle Finalization

2018-10-24 Thread Lukasz Cwik
I have been working on the protocol for splitting/checkpointing of bundles for usage with SplittableDoFn but in the mean time wanted to share a proposal for bundle finalization[1]. Bundle finalization is used to solve a problem where integration with external systems which require acknowledgement

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Udi Meiri
Love it! On Wed, Oct 24, 2018 at 2:26 PM Ahmet Altay wrote: > Really cool! Thank you! > > On Wed, Oct 24, 2018 at 2:24 PM, Alan Myrvold wrote: > >> To make small documentation changes easier, there is now an Edit button >> at the top right of the pages on https://beam.apache.org. This button >>

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Charles Chen
This is great! Thanks! On Wed, Oct 24, 2018 at 2:26 PM Ahmet Altay wrote: > Really cool! Thank you! > > On Wed, Oct 24, 2018 at 2:24 PM, Alan Myrvold wrote: > >> To make small documentation changes easier, there is now an Edit button >> at the top right of the pages on https://beam.apache.org.

Re: New Edit button on beam.apache.org pages

2018-10-24 Thread Ahmet Altay
Really cool! Thank you! On Wed, Oct 24, 2018 at 2:24 PM, Alan Myrvold wrote: > To make small documentation changes easier, there is now an Edit button at > the top right of the pages on https://beam.apache.org. This button opens > the source .md file on the master branch of the beam repository i

New Edit button on beam.apache.org pages

2018-10-24 Thread Alan Myrvold
To make small documentation changes easier, there is now an Edit button at the top right of the pages on https://beam.apache.org. This button opens the source .md file on the master branch of the beam repository in the github web editor. After making changes you can create a pull request to ask to

Re: KafkaIO - Deadletter output

2018-10-24 Thread Raghu Angadi
Thanks. So returning min timestamp is OK, right (assuming application fine is with what it means)? On Wed, Oct 24, 2018 at 1:17 PM Lukasz Cwik wrote: > All records in Apache Beam have a timestamp. The default timestamp is the > min timestamp defined here: > https://github.com/apache/beam/blob/2

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
All records in Apache Beam have a timestamp. The default timestamp is the min timestamp defined here: https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/BoundedWindow.java#L48 On Wed, Oct 24, 2018 at 1

Jenkins build is back to normal : beam_SeedJob_Standalone #1807

2018-10-24 Thread Apache Jenkins Server
See

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-24 Thread Lukasz Cwik
On Wed, Oct 24, 2018 at 11:31 AM Kenneth Knowles wrote: > OK. I just opened https://github.com/apache/beam/pull/6809 to push Guava > through. I made some comments there, and also I agree with Luke that full > version string makes sense. For this purpose it seems easy and fine to do a > search/rep

Re: KafkaIO - Deadletter output

2018-10-24 Thread Raghu Angadi
On Wed, Oct 24, 2018 at 12:33 PM Lukasz Cwik wrote: > You would have to return min timestamp for all records otherwise the > watermark may have advanced and you would be outputting records that are > droppably late. > That would be fine I guess. What’s the timestamp for a record that doesn’t hav

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
You would have to return min timestamp for all records otherwise the watermark may have advanced and you would be outputting records that are droppably late. On Wed, Oct 24, 2018 at 12:25 PM Raghu Angadi wrote: > To be clear, returning min_timestamp for unparsable records shound not > affect the

Re: KafkaIO - Deadletter output

2018-10-24 Thread Raghu Angadi
To be clear, returning min_timestamp for unparsable records shound not affect the watermark. On Wed, Oct 24, 2018 at 10:32 AM Raghu Angadi wrote: > How about returning min_timestamp? The would be dropped or redirected by > the ParDo after that. > Btw, TimestampPolicyFactory.withTimestampFn() is

Re: KafkaIO - Deadletter output

2018-10-24 Thread Juan Carlos Garcia
As Raghu said, Just apply a regular ParDo and return a PCollectionTuple afert that you can extract your Success Records (TupleTag) and your DeadLetter records(TupleTag) and do whatever you want with them. Raghu Angadi schrieb am Mi., 24. Okt. 2018, 05:18: > User can read serialized bytes from

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-24 Thread Kenneth Knowles
OK. I just opened https://github.com/apache/beam/pull/6809 to push Guava through. I made some comments there, and also I agree with Luke that full version string makes sense. For this purpose it seems easy and fine to do a search/replace to swap 20.0 for 20.1, and compatibility between them should

Re: KafkaIO - Deadletter output

2018-10-24 Thread Raghu Angadi
How about returning min_timestamp? The would be dropped or redirected by the ParDo after that. Btw, TimestampPolicyFactory.withTimestampFn() is not a public API, is this pipeline defined under kafkaio package? On Wed, Oct 24, 2018 at 10:19 AM Lukasz Cwik wrote: > In this case, the user is attemp

Re: KafkaIO - Deadletter output

2018-10-24 Thread Lukasz Cwik
In this case, the user is attempting to handle errors when parsing the timestamp. The timestamp controls the watermark for the UnboundedSource, how would they control the watermark in a downstream ParDo? On Wed, Oct 24, 2018 at 9:48 AM Raghu Angadi wrote: > On Wed, Oct 24, 2018 at 7:19 AM Chamik

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-24 Thread Lukasz Cwik
It looks like we are agreeing to make each vendored dependency self contained and have all their own internal dependencies packaged. For example, gRPC and all its transitive dependencies would use org.apache.beam.vendored.grpc.vYYY and Calcite and all its transitive dependencies would use org.apach

Re: Follow up ideas, to simplify creating MonitoringInfos.

2018-10-24 Thread Alex Amato
Okay. That makes sense. Using runtime validation and protos is what I was thinking as well. I'll include you as a reviewer in my PRs. As for the choice of a builder/constructor/factory. If we go with factory methods/constructor then we will need a method for each metric type (intCounter, latestInt

Re: KafkaIO - Deadletter output

2018-10-24 Thread Raghu Angadi
On Wed, Oct 24, 2018 at 7:19 AM Chamikara Jayalath wrote: > Ah nice. Yeah, if user can return full bytes instead of applying a > function that would result in an exception, this can be extracted by a > ParDo down the line. > KafkaIO does return bytes, and I think most sources should, unless the

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-24 Thread Maximilian Michels
I've run WordCount using Quickstart with the FlinkRunner (locally and against a Flink cluster). Would give a +1 but waiting what Kenn finds. -Max On 23.10.18 07:11, Ahmet Altay wrote: On Mon, Oct 22, 2018 at 10:06 PM, Kenneth Knowles > wrote: You two did so muc

Jenkins build is back to normal : beam_SeedJob #2854

2018-10-24 Thread Apache Jenkins Server
See

Build failed in Jenkins: beam_SeedJob_Standalone #1806

2018-10-24 Thread Apache Jenkins Server
See Changes: [thw] [BEAM-5848] Fix coder for streaming impulse source. -- Started by timer [EnvInject] - Loading node environment variables. Building remotely on be

Re: KafkaIO - Deadletter output

2018-10-24 Thread Chamikara Jayalath
Ah nice. Yeah, if user can return full bytes instead of applying a function that would result in an exception, this can be extracted by a ParDo down the line. On Tue, Oct 23, 2018 at 11:14 PM Juan Carlos Garcia wrote: > As Raghu said, > > Just apply a regular ParDo and return a PCollectionTuple

Re: Unbalanced FileIO writes on Flink

2018-10-24 Thread Maximilian Michels
The FlinkRunner uses a hash function (MurmurHash) on each key which places keys somewhere in the hash space. The hash space (2^32) is split among the partitions (5 in your case). Given enough keys, the chance increases they are equally spread. This should be similar to what the other Runners d

Re: Follow up ideas, to simplify creating MonitoringInfos.

2018-10-24 Thread Robert Bradshaw
Thanks for bringing this to the list; it's a good question. I think the difficulty comes from trying to statically define a lists of possibilities that should instead be runtime values. E.g. we currently we're up to about a dozen distinct types, and having a setter for each is both verbose and not

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-24 Thread Maximilian Michels
Would also keep it simple and optimize for the JAR size only if necessary. On 24.10.18 00:06, Kenneth Knowles wrote: I think it makes sense for each vendored dependency to be self-contained as much as possible. It should keep it fairly simple. Things that cross their API surface cannot be hidde

Re: Data Preprocessing in Beam

2018-10-24 Thread Maximilian Michels
Welcome Alejandro! Interesting work. The sketching extension looks like a good place for your algorithms. -Max On 23.10.18 19:05, Lukasz Cwik wrote: Arnoud Fournier (afourn...@talend.com ) started by adding a library to support sketching (https://github.com/apache

Build failed in Jenkins: beam_SeedJob #2853

2018-10-24 Thread Apache Jenkins Server
See -- Started by timer [EnvInject] - Loading node environment variables. Building remotely on beam13 (beam) in workspace > git rev-parse --

Build failed in Jenkins: beam_SeedJob_Standalone #1805

2018-10-24 Thread Apache Jenkins Server
See -- Started by timer [EnvInject] - Loading node environment variables. Building remotely on beam2 (beam) in workspace

Re: Unbalanced FileIO writes on Flink

2018-10-24 Thread Jozef Vilcek
So if I run 5 workers with 50 shards, I end up with: Duration Bytes received Records received 2m 39s 900 MB 465,525 2m 39s1.76 GB 930,720 2m 39s 789 MB 407,315 2m 39s1.32 GB 698,262 2m 39s 788 MB

Jenkins build is back to normal : beam_Release_Gradle_NightlySnapshot #217

2018-10-24 Thread Apache Jenkins Server
See

Re: [PROPOSAL] allow the users to anticipate the support of features in the targeted runner.

2018-10-24 Thread Etienne Chauchot
Hi guys, To sum up what we said, I just opened this ticket:https://issues.apache.org/jira/browse/BEAM-5849 Etienne Le jeudi 18 octobre 2018 à 12:44 +0200, Maximilian Michels a écrit : > Plugins for IDEs would be amazing because they could provide feedback already > during pipeline construction, b

Re: Unbalanced FileIO writes on Flink

2018-10-24 Thread Reuven Lax
withNumShards(5) generates 5 random shards. It turns out that statistically when you generate 5 random shards and you have 5 works, the probability is reasonably high that some workers will get more than one shard (and as a result not all workers will participate). Are you able to set the number of

Re: [ANNOUNCE] New committers, October 2018

2018-10-24 Thread Etienne Chauchot
Congrats and welcome !EtienneLe vendredi 19 octobre 2018 à 07:09 -0700, Kenneth Knowles a écrit : > Hi all, > Hot on the tail of the summer announcement comes our pre-Hallowe'en > celebration. > > Please join me and the rest of the Beam PMC in welcoming the following new > committers: > > - X

Re: Unbalanced FileIO writes on Flink

2018-10-24 Thread Jozef Vilcek
cc (dev) I tried to run the example with FlinkRunner in batch mode and received again bad data spread among the workers. When I tried to remove number of shards for batch mode in above example, pipeline crashed before launch Caused by: java.lang.IllegalStateException: Inputs to Flatten had incom