Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Ismaël Mejía
>From the Apache point of view nothing impedes anyone from doing intermediate releases for non LTS releases, only needed thing is someone willing to do the release and the due vote process. I don’t know however how will we decide this, we are exactly in the middle of the release cycle and in 3

Re: Fixing equality of Rows

2018-10-29 Thread Rui Wang
I might misunderstand what portability is in Beam. If the portability is designed as each SDK has its own representation of something and after that it's converted to portable representation, then wrapping byte[] into an object is fine. -Rui On Mon, Oct 29, 2018 at 11:26 AM Gleb Kanterov wrote:

error with DirectRunner

2018-10-29 Thread Allie Chen
Hi, I have a project that started failing with DirectRunner, but works well using DataflowRunner (last working version is 2.4). The error message I received are: line 1088, in run_stage pipeline_components.pcollections[actual_pcoll_id].coder_id]] KeyError: u'ref_Coder_WindowedValueCoder_1' I

Re: Edit access to the Apache Beam Confluence Wiki?

2018-10-29 Thread Thomas Weise
You should be all set now. On Mon, Oct 29, 2018 at 1:49 PM Alan Myrvold wrote: > Can I get edit access to the Apache Beam Confluence Wiki, > https://cwiki.apache.org/confluence/display/BEAM ? > > I'd like to move some FAQ around contributing to the wiki. > > Thanks > Alan >

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Ahmet Altay
On Mon, Oct 29, 2018 at 12:40 PM, Ismaël Mejía wrote: > From the Apache point of view nothing impedes anyone from doing > intermediate releases for non LTS releases, only needed thing is > someone willing to do the release and the due vote process. > Agreed. I was not suggesting not doing a

Edit access to the Apache Beam Confluence Wiki?

2018-10-29 Thread Alan Myrvold
Can I get edit access to the Apache Beam Confluence Wiki, https://cwiki.apache.org/confluence/display/BEAM ? I'd like to move some FAQ around contributing to the wiki. Thanks Alan

Re: Follow up ideas, to simplify creating MonitoringInfos.

2018-10-29 Thread Alex Amato
Hi Robert and community, :) I was starting to code this up, but I wasn't sure exactly how to do some of the proto syntax. Would you mind taking a look at this doc and let me know if you know how to

Beam Dependency Check Report (2018-10-30)

2018-10-29 Thread Apache Jenkins Server
High Priority Dependency Updates Of Beam Python SDK: Dependency Name Current Version Latest Version Release Date Of the Current Used Version Release Date Of The Latest Release JIRA Issue future 0.16.0 0.17.0 2016-10-27

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Etienne Chauchot
Hey,I would vote -0 : here is the explanation: I took a look at Nexmark dashboards for output size and performance for all the runners in all the modes around the date of the release cut to search for regressions. I noted a regression on the performance of the spark runner. Query4, Query6,

Re: Beam Community Metrics

2018-10-29 Thread Maximilian Michels
Hi Scott, Thanks for sharing the progress. The test metrics are super helpful. I'm particularly looking forward to the PR metrics which could be useful for improving interaction within the community and with new contributors. -Max On 26.10.18 07:36, Scott Wegner wrote: I want to summarize

Re: Growing Beam -- A call for ideas? What is missing? What would be good to see?

2018-10-29 Thread Maximilian Michels
Hi Austin, Great initiative. I think there are already some materials out there but they are not consolidated: Cookbook with examples: https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/cookbook An interactive tutorial would be a great addition,

Re: Python profiling

2018-10-29 Thread Maximilian Michels
This looks very helpful for debugging performance of portable pipelines. Great work! Enabling local directories for Flink or other portable Runners would be useful for debugging, e.g. per https://issues.apache.org/jira/browse/BEAM-5440 On 26.10.18 18:08, Robert Bradshaw wrote: Now that

Re: Growing Beam -- A call for ideas? What is missing? What would be good to see?

2018-10-29 Thread Gleb Kanterov
I'm a scio contributor, and I have a lot of experience with Scala. However, I would advise for NOT using Scala. There are several problems with maintaining Scala libraries: - have to build different artifacts for each Scala version - artifacts have dependencies to Scala standard library - it

Re: Unbalanced FileIO writes on Flink

2018-10-29 Thread Maximilian Michels
I was suggesting a transform like reshuffle that can avoid the actual reshuffle if the data is already well distributed How do we know if the data is already well-distributed? Can't we simply give the user control over the shuffling behavior? and also provides some kind of unique key

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Etienne Chauchot
Oops, just saw than Kenn already mentioned spark perf degradation on spark runner around 10/05. Sorry for the repetition.Nevertheless, IMHO, I think it will be still worth checking PR #6181. Etienne Le lundi 29 octobre 2018 à 10:42 +0100, Etienne Chauchot a écrit : > Hey,I would vote -0 : here is

Re: Python profiling

2018-10-29 Thread Manu Zhang
Cool ! Can we document it somewhere such that other Runners could pick it up later ? Thanks, Manu Zhang On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels , wrote: > This looks very helpful for debugging performance of portable pipelines. > Great work! > > Enabling local directories for Flink or

Re: error with DirectRunner

2018-10-29 Thread Udi Meiri
This looks like a FnApiRunner bug. When I override use_fnapi_runner = False in direct_runner.py the pipeline works. It seems like either the side-input to _copy_number or the Flatten operation is the culprit. On Mon, Oct 29, 2018 at 2:37 PM Allie Chen wrote: > Hi, > > I have a project that

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Kenneth Knowles
I didn't isolate it to a cause and commit, so that is extremely useful to know. To bring some details on thread: query 4: a single aggregation in sliding windows query 8: a single join with no other interesting logic query 9 (prefix of query 6*): find the winning bid for each auction query 6:

Beam Dependency Check Report (2018-10-29)

2018-10-29 Thread Apache Jenkins Server
High Priority Dependency Updates Of Beam Python SDK: Dependency Name Current Version Latest Version Release Date Of the Current Used Version Release Date Of The Latest Release JIRA Issue google-cloud-pubsub 0.35.4 0.38.0

Re: Fixing equality of Rows

2018-10-29 Thread Kenneth Knowles
I'll summarize my input to the discussion. It is rather high level. But IMO: - even though schemas are part of Beam Java today, I think they should become part of portability when ready - so each type in a schema needs a language-independent & encoding-independent notion of domain of values and

Re: New Edit button on beam.apache.org pages

2018-10-29 Thread Etienne Chauchot
Cool ! ThanksEtienneLe mercredi 24 octobre 2018 à 14:24 -0700, Alan Myrvold a écrit : > To make small documentation changes easier, there is now an Edit button at > the top right of the pages on https://beam. > apache.org. This button opens the source .md file on the master branch of the > beam

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Kenneth Knowles
I think definitely open a cherry pick PR to a 2.8.x branch. I think we must not corrupt maven central, so if it is published to users this has to be 2.8.1. Ahmet - we are to this point, right? Kenn On Mon, Oct 29, 2018 at 8:40 AM Ismaël Mejía wrote: > First thanks Etienne and Kenn for noting

Re: New Edit button on beam.apache.org pages

2018-10-29 Thread Holden Karau
So awesome :) On Mon, Oct 29, 2018, 7:50 AM Etienne Chauchot Cool ! > Thanks > Etienne > Le mercredi 24 octobre 2018 à 14:24 -0700, Alan Myrvold a écrit : > > To make small documentation changes easier, there is now an Edit button at > the top right of the pages on https://beam.apache.org. This

Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov
With adding BYTES type, we broke equality. `RowCoder#consistentWithEquals` is always true, but this property doesn't hold for exotic types such as `Map`, `List`. The root cause is `byte[]`, where `equals` is implemented as reference equality instead of structural. Before we jump into solution

Re: Fixing equality of Rows

2018-10-29 Thread Rui Wang
Seems to me that Only Map's quality check cannot be solved by deepEquals because Keys cannot be looked up correctly in Map. If we cannot have a useful use case for Map, we could reject it in Schema and still keep byte[]. The option3 needs to find a wrapper of byte[] that is language-independent

Re: 2 tier input

2018-10-29 Thread Lukasz Cwik
Yes this will change. Apache Beam has been working towards a general solution to make all IO connectors become modular[1]. This would allow you to read from an arbitrary number of sources chaining the output from one to the next. 1: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html

Re: Fixing equality of Rows

2018-10-29 Thread Lukasz Cwik
I believe Kenn is spot on. The focus of the issue is too narrow as your talking about the short term problem related to Map. Schemas are very similar to coders and coders have been solving this problem by delegating to the underlying component coder to figure out whether two things are equal. You

Re: Fixing equality of Rows

2018-10-29 Thread Anton Kedin
About these specific use cases, how useful is it to support Map and List? These seem pretty exotic (maybe they aren't) and I wonder whether it would make sense to just reject them until we have a solid design. And wouldn't the same problems arise even without RowCoder? Is the path in that case to

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Ahmet Altay
On Mon, Oct 29, 2018 at 8:55 AM, Kenneth Knowles wrote: > I think definitely open a cherry pick PR to a 2.8.x branch. I think we > must not corrupt maven central, so if it is published to users this has to > be 2.8.1. Ahmet - we are to this point, right? > Yes, if someone is willing to make a

Re: [PROPOSAL] Bundle splitting (https://s.apache.org/beam-checkpoint-and-split-bundles)

2018-10-29 Thread Lukasz Cwik
Sorry all, I messed up the link/reference numbers. Here is the same e-mail with the reference numbers fixed. I build off of the work performed by Eugene et al. within Breaking the fusion barrier[2] and propose[1] a way of how to support splitting of bundles (primarily for SplittableDoFn) within

Re: 2 tier input

2018-10-29 Thread Chaim Turkel
Both solutions mean that i cannot use the beam IO classes that will be me the distribution, but i would have to get the data myself using a ParDo method, is this something that will change in the future? i understand that spark has a push down method that will pass the filter to the next level of

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-29 Thread Ismaël Mejía
Mmm 2.8.0 is already in maven central, so probably worth to discuss if other backports are needed too. On Mon, Oct 29, 2018 at 4:55 PM Kenneth Knowles wrote: > > I think definitely open a cherry pick PR to a 2.8.x branch. I think we must > not corrupt maven central, so if it is published to

Re: Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov
There is an indirect connection to RowCoder because `MapCoder` isn't deterministic, therefore, this doesn't hold: > - also each type (hence Row type) should have portable encoding(s) that respect this equality so shuffling is consistent I think it's a requirement only for rows we want to

FileSystems should retrieve lastModified time

2018-10-29 Thread Jeff Klukas
I just wrote up a JIRA issues proposing that FileSystem implementations retrieve lastModified time of the files they list: https://issues.apache.org/jira/browse/BEAM-5910 Any immediate concerns? I'm not intimately familiar with HDFS, but I'm otherwise confident that GCS, S3, and local filesystems

Re: FileSystems should retrieve lastModified time

2018-10-29 Thread Chamikara Jayalath
+1 for adding last modified time to MatchResult.Metadata. Sounds like a useful change that will enable additional use-cases. - Cham On Mon, Oct 29, 2018 at 11:08 AM Jeff Klukas wrote: > I just wrote up a JIRA issues proposing that FileSystem implementations > retrieve lastModified time of the

Re: Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov
Rui, I'm not completely sure I understand why it isn't possible to find suitable encoding for portability. As I understand, the only requirement is deterministic encoding consistent with equality, so existing representation of BYTES will work (VarInt followed by bytes). In my understanding, it's