Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Dan Halperin
> > > On Thu, Nov 8, 2018 at 2:12 PM Udi Meiri wrote: > >> Both options risk delaying worker shutdown if the executor's shutdown() >> hasn't been called, which is I guess why the executor in GcsOptions.java >> creates daemon threads. >> > My guess (and it really is a guess at this point) is that

Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Dan Halperin
Hey Udi, Thanks for the commit comment . I'll try to dump any (old) mental context I have left.. We were trying to find the right point in a space of: * enough parallelism to speed things up - more

[VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-08 Thread Ahmet Altay
Hi all, Please review the following statement: "2.7.0 branch will be marked as the long-term-support (LTS) release branch. This branch will be supported for a window of 6 months starting from the day it is marked as an LTS branch. Beam community will decide on which issues will be backported and

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-11-08 Thread Ahmet Altay
There were no new updates, I will start a vote based on the latest proposal. On Mon, Nov 5, 2018 at 10:19 AM, Ahmet Altay wrote: > +1 to starting with 2.7 branch and supporting it for 6 months. I think we > should start the support window of 6 months from the day we agree to do > this. That way

Re: How to use "PortableRunner" in Python SDK?

2018-11-08 Thread Ruoyun Huang
Thanks Maximilian! I am working on migrating existing PortableRunner to using java ULR (Link to Notes ). If this issue is non-trivial to solve, I would vote for removing this default behavior as part of the

Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-08 Thread Rui Wang
https://github.com/apache/beam/pull/6991 I am using java.time.instant as the internal representation to replace Joda time for DateTime field in the PR. The java.time.instant uses a *long* to save seconds-after-epoch and a *int* to save nanoseconds-of-second. Therefore 64 bits are fully used for

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Robert Bradshaw
I'd assume you're compiling the code with Cython as well? (If you're using the default containers, that should be fine.) On Fri, Nov 9, 2018 at 12:09 AM Robert Bradshaw wrote: > > Very cool to hear of this progress on Samza! > > Python protocol buffers are extraordinarily slow (lots of

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Robert Bradshaw
Very cool to hear of this progress on Samza! Python protocol buffers are extraordinarily slow (lots of reflection, attributes lookups, and bit fiddling for serialization/deserialization that is certainly not Python's strong point). Each bundle processed involves multiple protos being constructed

[Call for items] November Beam Newsletter

2018-11-08 Thread Rose Nguyen
Hi Beamers: Time to sync with the community on all the awesome stuff we've been doing! *Add the highlights from October to now (or planned events and talks) that you want to share by 11/14 11:59 p.m. PDT.* We will collect the notes via Google docs but send out the final version directly to the

Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Udi Meiri
My thought was to use 1 executor per GcsUtil instance (or 1 per process as you suggest), with a possible performance hit since I don't know how often these batch copy and remove operations are called. The other option is to leave things as they mostly are, and only remove the call to

Re: BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Lukasz Cwik
Not certain, it looks like we should have been caching the executor within the GcsUtil as a static instance instead of creating one each time. Could have been missed during code review / slow code changes over time. GcsUtil is not well "loved". On Thu, Nov 8, 2018 at 11:00 AM Udi Meiri wrote: >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-11-08 Thread Lukasz Cwik
The purpose of the spec would be to provide the names, type and descriptions of the options. We don't need anything beyond the JSON types (string, number, bool, object, list) because the only ambiguity we get is how do we parse command line string into the JSON type (and that ambiguity is actually

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Thomas Weise
We have been doing some end to end testing with Python and Flink (streaming). You could take a look at the following and possibly replicate it for your work: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/flink/flink_streaming_impulse.py We found that in order to get

Re: Can't define a pytype alias from Beam's PCollection type.

2018-11-08 Thread Robert Bradshaw
On Wed, Nov 7, 2018 at 10:30 PM Zach Moshe wrote: > > (Adding the public Beam-dev group) > > On Wed, Nov 7, 2018 at 2:26 PM Zach Moshe wrote: >> >> Hi, >> I've noticed that `beam.core.pvalue.PCollection` doesn't support a >> `__getitem__()` that returns a `GenericMeta` type (like regular types

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-11-08 Thread Robert Bradshaw
There's two questions here: (A) What do we do in the short term? I think adding every runner option to every SDK is not sustainable (n*m work, assuming every SDK knows about every runner), and having a patchwork of options that were added as one-offs to SDKs is not desirable either. Furthermore,

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Lukasz Cwik
This benchmark[1] shows that Python is getting about 19mb/s. Yes, running more python sdk_worker processes will improve performance since Python is limited to a single CPU core. [1] https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5652536396611584=490377658=1286539696 On

BEAM-6018: memory leak in thread pool instantiation

2018-11-08 Thread Udi Meiri
HI, I've identified a memory leak when GcsUtil.java instantiates a ThreadPoolExecutor (https://issues.apache.org/jira/browse/BEAM-6018). The code uses the getExitingExecutorService

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Xinyu Liu
By looking at the gRPC dashboard published by the benchmark[1], it seems the streaming ping-pong operations per second for gRPC in python is around 2k ~ 3k qps. This seems quite low compared to gRPC performance in other languages, e.g. 600k qps for Java and Go. Is it expected to run multiple

Re: How to use "PortableRunner" in Python SDK?

2018-11-08 Thread Maximilian Michels
In the long run, we should get rid of the Docker-inside-Docker approach, which was only intended for testing anyways. It would be cleaner to start the SDK harness container alongside with JobServer container. Short term, I think it should be easy to either fix the permissions of the mounted