Re: [python] ReadFromPubSub broken in Flink

2019-07-15 Thread Robert Bradshaw
On Mon, Jul 15, 2019 at 5:42 AM Chamikara Jayalath wrote: > > On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova wrote: >> >> Hi Chamikara, why not make this part of the pipeline options? does it really need to vary from transform to transform? >>> >>> It's possible for the same

[PROPOSAL] Standardize Gradle structure in Python SDK

2019-07-15 Thread Frederik Bode
Hi Mark & others, +1 on using this structure. I don't see any other alternative to gradle as some of the Python tasks have Java tasks as a dependency. You can't debug that using just `python nosetests... or tox`. Parallelizing such tasks requires different projects (and I don't think gradle

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Claire McGinty
Thanks Robert! We'd definitely like to be able to re-use existing I/O components--for example the Writer /FileBasedReader

Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Ismaël Mejía
Just wanted to bring back the conversation on this subject. A quick abstract of the discussion so far: We are trying to agree in the best approach for return types in Write transforms towards some sort of ‘homogenization’ in IOs. At the moment we mostly agree that the best approach for return

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Neville Li
Hi Robert, I agree, it'd be nice to reuse FileIO logic of different file types. But given the current code structure of FileIO & scope of the change, I feel it's better left for future refactor PRs. Some thoughts: - SMB file operation is simple single file sequential reads/writes, which already

Re: Circular dependencies between DataflowRunner and google cloud IO

2019-07-15 Thread Lukasz Cwik
When do you mean by separate package? Most of our perf tests are integrated with PerfkitBenchmarker such as our JDBC benchmark[1] and Python BigQuery benchmark[2] and point to a specific IT class/method that contains the benchmarks. Note that there was a recent discussion about moving away from

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Robert Bradshaw
I agree that generalizing the existing FileIO may not be the right path forward, and I'd only make their innards public with great care. (Would this be used like like SmbSink(MyFileIO.sink(parameters).getWriter[Factory]())?) SMB is a bit unique that the source and sink are much more coupled than

Re: Circular dependencies between DataflowRunner and google cloud IO

2019-07-15 Thread Michał Walenia
Thanks for the information. I don't think that the task you mentioned fits my case - I want to run a performance test on a real service and running it with a battery of other tests doesn't make much sense to me. It's similar to the case of other integration tests that are excluded from the task

Beam Dependency Check Report (2019-07-15)

2019-07-15 Thread Apache Jenkins Server
High Priority Dependency Updates Of Beam Python SDK: Dependency Name Current Version Latest Version Release Date Of the Current Used Version Release Date Of The Latest Release JIRA Issue google-cloud-core 0.29.1 1.0.2

pubsub -> IO

2019-07-15 Thread Chaim Turkel
Hi, I am looking to write a pipeline that read from a mongo collection. I would like to listen to a pubsub that will have a object that will tell me which collection and which time frame. Is there a way to do this? Chaim -- Loans are funded by FinWise Bank, a Utah-chartered bank located

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Robert Bradshaw
I left some comments on the doc. I think the general idea is sound, but one thing that worries me is the introduction of a parallel set of IOs that mirrors the (existing) FileIOs. I would suggest either (1) incorporate this functionality into the generic FileIO infrastructure, or let it be

Re: PostCommit tests broken

2019-07-15 Thread Katarzyna Kucharczyk
Hi, this is something we also observed in Python LoadTests. I created another issue on Friday issue [BEAM-7733] because I didn't see the one you shared, Valentyn. Whole problem seems to be a bit similar to this ([BEAM-5529])

Re: Bucketed histogram metrics in beam. Anyone currently looking into this?

2019-07-15 Thread Alex Amato
Thanks Steve, is your fork available for me to see? Would you mind linking me to the PRs you introduced to add the histogram support to the dataflow worker On Fri, Jul 12, 2019 at 11:52 AM Steve Niemitz wrote: > I've been doing some experiments in my own fork of the Dataflow worker > using

Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Kenneth Knowles
To me (6) PCollection feels like an obvious choice, and is actually the same as (4) at the core, as I think has pretty much been said. Basically, the type actually describes what is going on. If there's one or more PCollections doesn't really matter. All of the other options seem strange to me,

Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Lukasz Cwik
In the POutput case (4), does that mean we will have to compute all those outputs in the transform even if they aren't used? If yes, I prefer (6) because it allows for the transform structure to be modified to either produce these additional outputs only if they will be consumed instead of having

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Eugene Kirpichov
Quick note: I didn't look through the document, but please do not build on either FileBasedSink or FileBasedReader. They are both remnants of the old, non-composable IO world; and in fact much of the composable IO work emerged from frustration with their limitations and recognizing that many other

Re: PostCommit tests broken

2019-07-15 Thread Valentyn Tymofieiev
Thanks Kasia. Below are the most frequent flakes I have encountered in postcommit failures, but there may be more. Let's track resolution in respective Jira's and avoid re-creating duplicates. https://issues.apache.org/jira/browse/BEAM-7527 - ModuleNotFoundError (No module named

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

2019-07-15 Thread Kenneth Knowles
Gradle comments inline On Mon, Jul 15, 2019 at 2:30 AM Frederik Bode wrote: > Hi Mark & others, > > +1 on using this structure. I don't see any other alternative to gradle as > some of the Python tasks have Java tasks as > a dependency. You can't debug that using just `python nosetests... or >

Re: PostCommit tests broken

2019-07-15 Thread Pablo Estrada
Hello all, I've discussed this with Valentyn, but I'm looking at the BigQuery failures. Best -P. On Mon, Jul 15, 2019 at 2:40 PM Valentyn Tymofieiev wrote: > Thanks Kasia. > > Below are the most frequent flakes I have encountered in postcommit > failures, but there may be more. Let's track

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Pablo Estrada
@Thomas do you think this is a problem of documentation, or a missing feature? We did not add support for it without cython because the cost of locking and checking every 200ms in Python would be too high - that's why this is only implemented in the optimized Cython codepath. I think it makes

Re: [2.14.0] Release Progress Update

2019-07-15 Thread Ahmet Altay
Anton, any updates on this release? Do you need help? On Fri, Jun 28, 2019 at 11:42 AM Anton Kedin wrote: > I have been running validation builds (had some hickups with that), > everything looks mostly good, except failures in `:beam-test-tools` and > `:io:aws`. Now I will start cherry-picking

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Kyle Weaver
Pablo, what about setting a lower sampling rate? Or would that lead to poor results? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com | +1650203 On Mon, Jul 15, 2019 at 4:44 PM Pablo Estrada wrote: > @Thomas do you think this is a problem of documentation, or a

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Alex Amato
Perhaps no metric at all should be returned, instead of 0, which is an incorrect value. Also, is there a reason to have state_sampler_slow at all then, if its not intended to be implemented? On Mon, Jul 15, 2019 at 5:03 PM Kyle Weaver wrote: > Pablo, what about setting a lower sampling rate?

[REQUEST] Python Tests (pre/post-commits) Status

2019-07-15 Thread Udi Meiri
Hi, I've been trying to merge several Python PRs in the past weeks, but Jenkins pre- and post-commit jobs have been red all this time due to various reasons. I have a proposal in mind to help deal with this, but it can't happen without cooperation from a majority of committers. The request from

Re: [VOTE] Vendored Dependencies Release

2019-07-15 Thread Pablo Estrada
+1 verified hashes and signatures On Fri, Jul 12, 2019 at 9:40 AM Kai Jiang wrote: > +1 (non-binding) > > On Thu, Jul 11, 2019 at 8:27 PM Lukasz Cwik wrote: > >> Please review the release of the following artifacts that we vendor: >> * beam-vendor-grpc_1_21_0 >> * beam-vendor-guava-26_0-jre

Re: [VOTE] Vendored Dependencies Release

2019-07-15 Thread Lukasz Cwik
+1 On Mon, Jul 15, 2019 at 8:14 PM Pablo Estrada wrote: > +1 > verified hashes and signatures > > On Fri, Jul 12, 2019 at 9:40 AM Kai Jiang wrote: > >> +1 (non-binding) >> >> On Thu, Jul 11, 2019 at 8:27 PM Lukasz Cwik wrote: >> >>> Please review the release of the following artifacts that we

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Pablo Estrada
State sampler is the only state provider for the Python SDK. This means that the Metrics module relies on it to attribute metrics to each step; and the logging module also uses it to attribute logs to each step. statesampler_slow does not implement the actual sampling, but it does implement the

Docker Run Options in SDK Container

2019-07-15 Thread Sam Bourne
Hello Beam devs, I’ve opened a PR (https://github.com/apache/beam/pull/8982) to support passing options/flags to the docker run command executed as part of the portable environment workflow. I’m in need of providing specific volumes and possibly other docker run options as I refine our custom

Re: [REQUEST] Python Tests (pre/post-commits) Status

2019-07-15 Thread Kenneth Knowles
Huge +1 Once tests are a little bit red most of the time, they get ignored. I'd be in favor of automation to make it extra hard to merge when things are red. For example, making it so you cannot use the GitHub UI to merge when master is red, but of course you can do a git push not using the UI.

Re: [DISCUSS] Reconciling ValueState in Java and Python (was: [docs] Python State & Timers)

2019-07-15 Thread Rakesh Kumar
Brian, I just want to follow up. Let me know if you are working on this. Otherwise, I can implement ReadModifyWriteState. On Mon, May 6, 2019 at 4:52 PM Reza Rokni wrote: > When used as metadata I think the ReadModifyWrite naming is very accurate > for the majority of cases. > > The only case

Write-through-cache in State logic

2019-07-15 Thread Rakesh Kumar
Hi, I have been using python sdk for the application and also using BagState in production. I was wondering whether state logic has any write-through-cache implemented or not. If we are sending every read and write request through network then it comes with a performance cost. We can avoid

Using the BigQuery Storage API

2019-07-15 Thread john desmond
Hello, I am trying to follow the tutorial for reading data from a Big Query table using java, specifically using the BigQuery Storage API. I am using the tutorial found here: https://beam.apache.org/documentation/io/built-in/google-bigquery/ The tutorial suggests using the 2.13.0 version of

Re: Write-through-cache in State logic

2019-07-15 Thread Reuven Lax
It's runner dependent. Some runners (e.g. the Dataflow runner) do have such a cache, though I think it's currently has a cap for large bags. Reuven On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar wrote: > Hi, > > I have been using python sdk for the application and also using BagState > in

Re: [REQUEST] Python Tests (pre/post-commits) Status

2019-07-15 Thread Kenneth Knowles
I just merged a PR for the website despite a bit of redness in the tests. Somewhat happily, almost every cell of the PR matrix was green (yay!?). The only red was https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/ which appears to have been red for weeks, with

Re: PR#6675 Updates

2019-07-15 Thread Rui Wang
Hi community, I have helped review PR and it looks good to me now. Is there a committer who could help review/merge it? -Rui On Mon, Jul 8, 2019 at 11:19 AM Rui Wang wrote: > Thanks for the PR! I left some comments related to schema types. > > > -Rui

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Thomas Weise
That's great, but I think the JIRA needs to remain open since w/o Cython the metric still doesn't work. It would however be helpful to add a comment regarding your findings. On Mon, Jul 15, 2019 at 1:46 PM Rakesh Kumar wrote: > > Installing cython in the application environment fixed the

Re: pubsub -> IO

2019-07-15 Thread Rui Wang
+u...@beam.apache.org -Rui On Mon, Jul 15, 2019 at 6:55 AM Chaim Turkel wrote: > Hi, > I am looking to write a pipeline that read from a mongo collection. > I would like to listen to a pubsub that will have a object that will > tell me which collection and which time frame. > Is there

Re: ParDo Execution Time stat is always 0

2019-07-15 Thread Rakesh Kumar
Installing cython in the application environment fixed the issue. Now I am able to see the operator metrics ({organization_specific_prefix} .operator.beam-metric-pardo_execution_time-process_bundle_ msecs-v1.gauge.mean) Thanks Ankur for looking into it and providing support. I am going to close

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Gleb Kanterov
I share the same concern with Robert regarding re-implementing parts of IO. At the same time, in the past, I worked on internal libraries that try to re-use code from existing IO, and it's hardly possible because it feels like it wasn't designed for re-use. There are a lot of classes that are

Re: python precommits failing at head

2019-07-15 Thread Yifan Zou
We saw similar timeouts of the python precommit and it usually break the Jenkins build workers. I've run the precommit manually several times. It sometimes stuck at :sdks:python:docs and consumes 80G+ memory. Our build VMs eventually ran out of memory (104G memory available in total) then