Re: Bootstrapping Beam's Job Server

2018-08-21 Thread Henning Rohde
Agree with Luke. Perhaps something simple, prescriptive yet flexible, such
as custom command line (defined in the environment proto) rooted at the
base of the provided artifacts and either passed the same arguments or
defined in the container contract or made available through substitution.
That way, all the restrictions/assumptions of the execution environment
become implicit and runner/deployment dependent.


On Tue, Aug 21, 2018 at 2:12 PM Lukasz Cwik  wrote:

> I believe supporting a simple Process environment makes sense. It would be
> best if we didn't make the Process route solve all the problems that Docker
> solves for us. In my opinion we should limit the Process route to assume
> that the execution environment:
> * has all dependencies and libraries installed
> * is of a compatible machine architecture
> * doesn't require special networking rules to be setup
>
> Any other suggestions for reasonable limits on a Process environment?
>
> On Tue, Aug 21, 2018 at 2:53 AM Ismaël Mejía  wrote:
>
>> It is also worth to mention that apart of the testing/development use
>> case there is also the case of supporting people running in Hadoop
>> distributions. There are two extra reasons to want a process based
>> version: (1) Some Hadoop distributions run in machines with really old
>> kernels where docker support is limited or nonexistent (yes, some of
>> those run on kernel 2.6!) and (2) Ops people may be reticent to the
>> additional operational overhead of enabling docker in their clusters.
>> On Tue, Aug 21, 2018 at 11:50 AM Maximilian Michels 
>> wrote:
>> >
>> > Thanks Henning and Thomas. It looks like
>> >
>> > a) we want to keep the Docker Job Server Docker container and rely on
>> > spinning up "sibling" SDK harness containers via the Docker socket. This
>> > should require little changes to the Runner code.
>> >
>> > b) have the InProcess SDK harness as an alternative way to running user
>> > code. This can be done independently of a).
>> >
>> > Thomas, let's sync today on the InProcess SDK harness. I've created a
>> > JIRA issue: https://issues.apache.org/jira/browse/BEAM-5187
>> >
>> > Cheers,
>> > Max
>> >
>> > On 21.08.18 00:35, Thomas Weise wrote:
>> > > The original objective was to make test/development easier (which I
>> > > think is super important for user experience with portable runner).
>> > >
>> > >  From first hand experience I can confirm that dealing with Flink
>> > > clusters and Docker containers for local setup is a significant hurdle
>> > > for Python developers.
>> > >
>> > > To simplify using Flink in embedded mode, the (direct) process based
>> SDK
>> > > harness would be a good option, especially when it can be linked to
>> the
>> > > same virtualenv that developers have already setup, eliminating extra
>> > > packaging/deployment steps.
>> > >
>> > > Max, I would be interested to sync up on what your thoughts are
>> > > regarding that option since you mention you also started to work on it
>> > > (see previous discussion [1], not sure if there is a JIRA for it yet).
>> > > Internally we are planning to use a direct SDK harness process instead
>> > > of Docker containers. For our specific needs it will works equally
>> well
>> > > for development and production, including future plans to deploy Flink
>> > > TMs via Kubernetes.
>> > >
>> > > Thanks,
>> > > Thomas
>> > >
>> > > [1]
>> > >
>> https://lists.apache.org/thread.html/d8b81e9f74f77d74c8b883cda80fa48efdcaf6ac2ad313c4fe68795a@%3Cdev.beam.apache.org%3E
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels > > > > wrote:
>> > >
>> > > Thanks for your suggestions. Please see below.
>> > >
>> > >  > Option 3) would be to map in the docker binary and socket to
>> allow
>> > >  > the containerized Flink job server to start "sibling"
>> containers on
>> > >  > the host.
>> > >
>> > > Do you mean packaging Docker inside the Job Server container and
>> > > mounting /var/run/docker.sock from the host inside the container?
>> That
>> > > looks like a bit of a hack but for testing it could be fine.
>> > >
>> > >  > notably, if the runner supports auto-scaling or similar
>> non-trivial
>> > >  > configurations, that would be difficult to manage from the SDK
>> side.
>> > >
>> > > You're right, it would be unfortunate if the SDK would have to
>> deal with
>> > > spinning up SDK harness/backend containers. For non-trivial
>> > > configurations it would probably require an extended protocol.
>> > >
>> > >  > Option 4) We are also thinking about adding process based
>> SDKHarness.
>> > >  > This will avoid docker in docker scenario.
>> > >
>> > > Actually, I had started implementing a process-based SDK harness
>> but
>> > > figured it might be impractical because it doubles the execution
>> path
>> > > for UDF code and potentially doesn't work with custom
>> dependencies.
>> > >
>> > >  > 

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-21 Thread Lukasz Cwik
Henning, can you clarify by what you mean with send non-executable bundles
to the SDK harness and how it is useful for Flink?

On Tue, Aug 21, 2018 at 2:01 PM Henning Rohde  wrote:

> I think it will be useful to the runner to know upfront what the
> fundamental threading capabilities are for the SDK harness (say, "fixed",
> "linear", "dynamic", ..) so that the runner can upfront make a good static
> decision on #harnesses and how many resources they should each have. It's
> wasteful to give the Foo SDK a whole many-core machine with TBs of memory,
> if it can only support a single bundle at a time. I think this is also in
> line with what Thomas and Luke are suggesting.
>
> However, it still seems to me to be a semantically problematic idea to
> send non-executable bundles to the SDK harness. I understand it's useful
> for Flink, but is that really the best path forward?
>
>
>
> On Mon, Aug 20, 2018 at 5:44 PM Ankur Goenka  wrote:
>
>> That's right.
>> To add to it. We added multi threading to python streaming as a single
>> thread is sub optimal for streaming use case.
>> Shall we move towards a conclusion on the SDK bundle processing upper
>> bound?
>>
>> On Mon, Aug 20, 2018 at 1:54 PM Lukasz Cwik  wrote:
>>
>>> Ankur, I can see where you are going with your argument. I believe there
>>> is certain information which is static and won't change at pipeline
>>> creation time (such as Python SDK is most efficient doing one bundle at a
>>> time) and some stuff which is best at runtime, like memory and CPU limits,
>>> worker count.
>>>
>>> On Mon, Aug 20, 2018 at 1:47 PM Ankur Goenka  wrote:
>>>
 I would prefer to to keep it dynamic as it can be changed by the
 infrastructure or the pipeline author.
 Like in case of Python, number of concurrent bundle can be changed by
 setting pipeline option worker_count. And for Java it can be computed based
 on the cpus on the machine.

 For Flink runner, we can use the worker_count parameter for now to
 increase the parallelism. And we can have 1 container for each mapPartition
 task on Flink while reusing containers as container creation is expensive
 especially for Python where it installs a bunch of dependencies. There is 1
 caveat though. I have seen machine crashes because of too many simultaneous
 container creation. We can rate limit container creation in the code to
 avoid this.

 Thanks,
 Ankur

 On Mon, Aug 20, 2018 at 9:20 AM Lukasz Cwik  wrote:

> +1 on making the resources part of a proto. Based upon what Henning
> linked to, the provisioning API seems like an appropriate place to provide
> this information.
>
> Thomas, I believe the environment proto is the best place to add
> information that a runner may want to know about upfront during pipeline
> pipeline creation. I wouldn't stick this into PipelineOptions for the long
> term.
> If you have time to capture these thoughts and update the environment
> proto, I would suggest going down that path. Otherwise anything short term
> like PipelineOptions will do.
>
> On Sun, Aug 19, 2018 at 5:41 PM Thomas Weise  wrote:
>
>> For SDKs where the upper limit is constant and known upfront, why not
>> communicate this along with the other harness resource info as part of 
>> the
>> job submission?
>>
>> Regarding use of GRPC headers: Why not make this explicit in the
>> proto instead?
>>
>> WRT runner dictating resource constraints: The runner actually may
>> also not have that information. It would need to be supplied as part of 
>> the
>> pipeline options? The cluster resource manager needs to allocate 
>> resources
>> for both, the runner and the SDK harness(es).
>>
>> Finally, what can be done to unblock the Flink runner / Python until
>> solution discussed here is in place? An extra runner option for SDK
>> singleton on/off?
>>
>>
>> On Sat, Aug 18, 2018 at 1:34 AM Ankur Goenka 
>> wrote:
>>
>>> Sounds good to me.
>>> GRPC Header of the control channel seems to be a good place to add
>>> upper bound information.
>>> Added jiras:
>>> https://issues.apache.org/jira/browse/BEAM-5166
>>> https://issues.apache.org/jira/browse/BEAM-5167
>>>
>>> On Fri, Aug 17, 2018 at 10:51 PM Henning Rohde 
>>> wrote:
>>>
 Regarding resources: the runner can currently dictate the
 mem/cpu/disk resources that the harness is allowed to use via the
 provisioning api. The SDK harness need not -- and should not -- 
 speculate
 on what else might be running on the machine:


 https://github.com/apache/beam/blob/0e14965707b5d48a3de7fa69f09d88ef0aa48c09/model/fn-execution/src/main/proto/beam_provision_api.proto#L69

 A realistic startup-time computation in the SDK harness would be
 something 

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-21 Thread Henning Rohde
I think it will be useful to the runner to know upfront what the
fundamental threading capabilities are for the SDK harness (say, "fixed",
"linear", "dynamic", ..) so that the runner can upfront make a good static
decision on #harnesses and how many resources they should each have. It's
wasteful to give the Foo SDK a whole many-core machine with TBs of memory,
if it can only support a single bundle at a time. I think this is also in
line with what Thomas and Luke are suggesting.

However, it still seems to me to be a semantically problematic idea to send
non-executable bundles to the SDK harness. I understand it's useful for
Flink, but is that really the best path forward?



On Mon, Aug 20, 2018 at 5:44 PM Ankur Goenka  wrote:

> That's right.
> To add to it. We added multi threading to python streaming as a single
> thread is sub optimal for streaming use case.
> Shall we move towards a conclusion on the SDK bundle processing upper
> bound?
>
> On Mon, Aug 20, 2018 at 1:54 PM Lukasz Cwik  wrote:
>
>> Ankur, I can see where you are going with your argument. I believe there
>> is certain information which is static and won't change at pipeline
>> creation time (such as Python SDK is most efficient doing one bundle at a
>> time) and some stuff which is best at runtime, like memory and CPU limits,
>> worker count.
>>
>> On Mon, Aug 20, 2018 at 1:47 PM Ankur Goenka  wrote:
>>
>>> I would prefer to to keep it dynamic as it can be changed by the
>>> infrastructure or the pipeline author.
>>> Like in case of Python, number of concurrent bundle can be changed by
>>> setting pipeline option worker_count. And for Java it can be computed based
>>> on the cpus on the machine.
>>>
>>> For Flink runner, we can use the worker_count parameter for now to
>>> increase the parallelism. And we can have 1 container for each mapPartition
>>> task on Flink while reusing containers as container creation is expensive
>>> especially for Python where it installs a bunch of dependencies. There is 1
>>> caveat though. I have seen machine crashes because of too many simultaneous
>>> container creation. We can rate limit container creation in the code to
>>> avoid this.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> On Mon, Aug 20, 2018 at 9:20 AM Lukasz Cwik  wrote:
>>>
 +1 on making the resources part of a proto. Based upon what Henning
 linked to, the provisioning API seems like an appropriate place to provide
 this information.

 Thomas, I believe the environment proto is the best place to add
 information that a runner may want to know about upfront during pipeline
 pipeline creation. I wouldn't stick this into PipelineOptions for the long
 term.
 If you have time to capture these thoughts and update the environment
 proto, I would suggest going down that path. Otherwise anything short term
 like PipelineOptions will do.

 On Sun, Aug 19, 2018 at 5:41 PM Thomas Weise  wrote:

> For SDKs where the upper limit is constant and known upfront, why not
> communicate this along with the other harness resource info as part of the
> job submission?
>
> Regarding use of GRPC headers: Why not make this explicit in the proto
> instead?
>
> WRT runner dictating resource constraints: The runner actually may
> also not have that information. It would need to be supplied as part of 
> the
> pipeline options? The cluster resource manager needs to allocate resources
> for both, the runner and the SDK harness(es).
>
> Finally, what can be done to unblock the Flink runner / Python until
> solution discussed here is in place? An extra runner option for SDK
> singleton on/off?
>
>
> On Sat, Aug 18, 2018 at 1:34 AM Ankur Goenka 
> wrote:
>
>> Sounds good to me.
>> GRPC Header of the control channel seems to be a good place to add
>> upper bound information.
>> Added jiras:
>> https://issues.apache.org/jira/browse/BEAM-5166
>> https://issues.apache.org/jira/browse/BEAM-5167
>>
>> On Fri, Aug 17, 2018 at 10:51 PM Henning Rohde 
>> wrote:
>>
>>> Regarding resources: the runner can currently dictate the
>>> mem/cpu/disk resources that the harness is allowed to use via the
>>> provisioning api. The SDK harness need not -- and should not -- 
>>> speculate
>>> on what else might be running on the machine:
>>>
>>>
>>> https://github.com/apache/beam/blob/0e14965707b5d48a3de7fa69f09d88ef0aa48c09/model/fn-execution/src/main/proto/beam_provision_api.proto#L69
>>>
>>> A realistic startup-time computation in the SDK harness would be
>>> something simple like: max(1, min(cpu*100, mem_mb/10)) say, and use 
>>> that at
>>> most number of threads. Or just hardcode to 300. Or a user-provided 
>>> value.
>>> Whatever the value is the maximum number of bundles in flight allowed at
>>> any given time and needs to be communicated to the runner via 

Re: Beam application upgrade on Flink crashes

2018-08-21 Thread Stephan Ewen
Flink 1.7 will change the way the "restore serializer" is handled, which
should make it much easier to handle such cases.
Especially breaking java class version format will not be an issue anymore.

That should help to make it easier to give the Beam-on-Flink runner cross
version compatibility.


On Mon, Aug 20, 2018 at 6:46 PM, Maximilian Michels  wrote:

> AFAIK the serializer used here is the CoderTypeSerializer which may not
> be recoverable because of changes to the contained Coder
> (TaggedKvCoder). It doesn't currently have a serialVersionUID, so even
> small changes could break serialization backwards-compatibility.
>
> As of now Beam doesn't offer the same upgrade guarantees as Flink [1].
> This should be improved for the next release.
>
> Thanks,
> Max
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#
> compatibility-table
>
> On 20.08.18 17:46, Stephan Ewen wrote:
> > Hi Jozef!
> >
> > When restoring state, the serializer that created the state must still
> > be available, so the state can be read.
> > It looks like some serializer classes were removed between Beam versions
> > (or changed in an incompatible manner).
> >
> > Backwards compatibility of an operator implementation needs cooperation
> > from the operator. Withing Flink itself, when we change the way an
> > operator uses state, we keep the old codepath and classes in a
> > "backwards compatibility restore" that takes the old state and brings it
> > into the shape of the new state.
> >
> > I am not deeply into the of how Beam and the Flink runner implement
> > their use of state, but it looks this part is not present, which could
> > mean that savepoints taken from Beam applications are not backwards
> > compatible.
> >
> >
> > On Mon, Aug 20, 2018 at 4:03 PM, Jozef Vilcek  > > wrote:
> >
> > Hello,
> >
> > I am attempting to upgrade  Beam app from 2.5.0 running on Flink
> > 1.4.0 to Beam 2.6.0 running on Flink 1.5.0. I am not aware of any
> > state migration changes needed for Flink 1.4.0 -> 1.5.0 so I am just
> > starting a new App with updated libs from Flink save-point captured
> > by previous version of the app.
> >
> > There is not change in topology. Job is accepted without error to
> > the new cluster which suggests that all operators are matched with
> > state based on IDs. However, app runs only few seccons and then
> > crash with:
> >
> > java.lang.Exception: Exception while creating
> StreamOperatorStateContext.
> >   at org.apache.flink.streaming.api.operators.
> StreamTaskStateInitializerImpl.streamOperatorStateContext(
> StreamTaskStateInitializerImpl.java:191)
> >   at org.apache.flink.streaming.api.operators.
> AbstractStreamOperator.initializeState(AbstractStreamOperator.java:227)
> >   at org.apache.flink.streaming.runtime.tasks.StreamTask.
> initializeState(StreamTask.java:730)
> >   at org.apache.flink.streaming.runtime.tasks.StreamTask.
> invoke(StreamTask.java:295)
> >   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
> >   at java.lang.Thread.run(Thread.java:745)
> > Caused by: org.apache.flink.util.FlinkException: Could not restore
> operator state backend for 
> DoFnOperator_43996aa2908fa46bb50160f751f8cc09_(1/100)
> from any of the 1 provided restore options.
> >   at org.apache.flink.streaming.api.operators.
> BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:
> 137)
> >   at org.apache.flink.streaming.api.operators.
> StreamTaskStateInitializerImpl.operatorStateBackend(
> StreamTaskStateInitializerImpl.java:240)
> >   at org.apache.flink.streaming.api.operators.
> StreamTaskStateInitializerImpl.streamOperatorStateContext(
> StreamTaskStateInitializerImpl.java:139)
> >   ... 5 more
> > Caused by: java.io.IOException: Unable to restore operator state
> [bundle-buffer-tag]. The previous serializer of the operator state must be
> present; the serializer could have been removed from the classpath, or its
> implementation have changed and could not be loaded. This is a temporary
> restriction that will be fixed in future versions.
> >   at org.apache.flink.runtime.state.DefaultOperatorStateBackend.
> restore(DefaultOperatorStateBackend.java:514)
> >   at org.apache.flink.runtime.state.DefaultOperatorStateBackend.
> restore(DefaultOperatorStateBackend.java:63)
> >   at org.apache.flink.streaming.api.operators.
> BackendRestorerProcedure.attemptCreateAndRestore(
> BackendRestorerProcedure.java:151)
> >   at org.apache.flink.streaming.api.operators.
> BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:
> 123)
> >   ... 7 more
> >
> >
> > Does this mean anything to anyone? Am I doing anything wrong or did
> > FlinkRunner change in some way? The mentioned "bundle-buffer-tag"
> > seems to be too deep internal in runner for my reach.
> >
> > Any help is much appreciated.

Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Henning Rohde
By "enum" in quotes, I meant the usual open URN style pattern not an actual
enum. Sorry if that wasn't clear.

On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik  wrote:

> I would model the environment to be more free form then enums such that we
> have forward looking extensibility and would suggest to follow the same
> pattern we use on PTransforms (using an URN and a URN specific payload).
> Note that in this case we may want to support a list of supported
> environments (e.g. java, docker, python, ...).
>
> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde  wrote:
>
>> One thing to consider that we've talked about in the past. It might make
>> sense to extend the environment proto and have the SDK be explicit about
>> which kinds of environment it supports:
>>
>>
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>>
>> This choice might impact what files are staged or what not. Some SDKs,
>> such as Go, provide a compiled binary and _need_ to know what the target
>> architecture is. Running on a mac with and without docker, say, requires a
>> different worker in each case. If we add an "enum", we can also easily add
>> the external idea where the SDK/user starts the SDK harnesses instead of
>> the runner. Each runner may not support all types of environments.
>>
>> Henning
>>
>> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels 
>> wrote:
>>
>>> For reference, here is corresponding JIRA issue for this thread:
>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>
>>> On 16.08.18 11:15, Maximilian Michels wrote:
>>> > Makes sense to have an option to run the SDK harness in a
>>> non-dockerized
>>> > environment.
>>> >
>>> > I'm in the process of creating a Docker entry point for Flink's
>>> > JobServer[1]. I suppose you would also prefer to execute that one
>>> > standalone. We should make sure this is also an option.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>> >
>>> > On 16.08.18 07:42, Thomas Weise wrote:
>>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>>> >> into the Docker container would need to be pre-installed in the host
>>> >> environment. In the case of Python SDK, that could simply mean a
>>> >> (frozen) virtual environment that was setup when the host was
>>> >> provisioned - the SDK harness process(es) will then just utilize that.
>>> >> Of course this flavor of SDK harness execution could also be useful in
>>> >> the local development environment, where right now someone who already
>>> >> has the Python environment needs to also install Docker and update a
>>> >> container to launch a Python SDK pipeline on the Flink runner.
>>> >>
>>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>>> danolive...@google.com
>>> >> > wrote:
>>> >>
>>> >>  I just want to clarify that I understand this correctly since I'm
>>> >>  not that familiar with the details behind all these execution
>>> >>  environments yet. Is the proposal to create a new
>>> JobBundleFactory
>>> >>  that instead of using Docker to create the environment that the
>>> new
>>> >>  processes will execute in, this JobBundleFactory would execute
>>> the
>>> >>  new processes directly in the host environment? So in practice
>>> if I
>>> >>  ran a pipeline with this JobBundleFactory the SDK Harness and
>>> Runner
>>> >>  Harness would both be executing directly on my machine and would
>>> >>  depend on me having the dependencies already present on my
>>> machine?
>>> >>
>>> >>  On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka >> >>  > wrote:
>>> >>
>>> >>  Thanks for starting the discussion. I will be happy to help.
>>> >>  I agree, we should have pluggable SDKHarness environment
>>> Factory.
>>> >>  We can register multiple Environment factory using service
>>> >>  registry and use the PipelineOption to pick the right one on
>>> per
>>> >>  job basis.
>>> >>
>>> >>  There are a couple of things which are require to setup
>>> before
>>> >>  launching the process.
>>> >>
>>> >>* Setting up the environment as done in boot.go [4]
>>> >>* Retrieving and putting the artifacts in the right
>>> location.
>>> >>
>>> >>  You can probably leverage boot.go code to setup the
>>> environment.
>>> >>
>>> >>  Also, it will be useful to enumerate pros and cons of
>>> different
>>> >>  Environments to help users choose the right one.
>>> >>
>>> >>
>>> >>  On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise >> >>  > wrote:
>>> >>
>>> >>  Hi,
>>> >>
>>> >>  Currently the portable Flink runner only works with SDK
>>> >>  Docker containers for execution (DockerJobBundleFactory,
>>> >>  besides an in-process (embedded) factory option for
>>> 

Re: Bug or confusing python code? Are these the same element count metrics?

2018-08-21 Thread Robert Bradshaw
On Tue, Aug 21, 2018 at 2:05 AM Alex Amato  wrote:
>
> I discovered something while trying to update test_progress_metrics in 
> fn_api_runner_tests.py to inspect the returned MonitoringInfos in addition to 
> the already returned metrics format.
>
> This metric appears to be added twice using the None tag (but overwrites a 
> previous one). I am not sure if its intentional or not. Please let me know if 
> this is intentionally overwriting what is supposed to be the same metric, or 
> if something might be wrong here.
>
> See the use of element count metrics:
>
> Here the metric is added using the self.tagged_receivers tag sin the 
> DoOperation to add the element count metric. This can be the value 'None'
> Here the ONLY_OUTPUT tag is used and overridden later.
>
> Then fix_output_tags in  bundle_processor.by assigns the tag, which in this 
> case is None again

There is a TODO about plumbing the names to the right place that could
avoid the use of ONLY_OUTPUT altogether. (Given that all but one
operation has multiple outputs, maybe that's not worth it though.)

> When the second instance of the metric is added it gets overwritten in the 
> output_element_counts (because it uses the same key). Is it intentional to 
> overwrite the metric?

Is this because ONLY_OUTPUT is added generically (when there's one
output) and then the subclass uses self.tagged_receivers to do it
"properly" later?

> I discovered that the metric was created twice, because I am not using a map 
> of tags I am just adding another entry when the metric is added as a 
> monitoring_info a second time.

We should probably clarify what the contract is here. I think I was
assuming a "create or return" semantics when you get a metric for a
fully qualified name.

> So if this is intentional, then I need to make my code do the equivalent 
> thing, and check that there is already a MonitoringInfo for the metric and 
> update its value (or assert it is the same value).
>
> Also, is it intentional to use None as a tag name here? Seems like an odd 
> choice.

None (kind of like Java's NULL) is used when an output name is not
provided. The protos only accept strings here though, hence 'None'.


Re: Travis apache credentials

2018-08-21 Thread Robert Bradshaw
I was imagining the signing itself would still be manual. (Frankly, I
would feel odd having travis sign them for me...)
On Mon, Aug 20, 2018 at 10:20 PM Lukasz Cwik  wrote:
>
> If you can't get an answer quickly, its best to read the Apache policy on 
> release signing: http://www.apache.org/dev/release-signing.html
>
> On Mon, Aug 20, 2018 at 10:16 AM Pablo Estrada  wrote:
>>
>> This would mean that released artifacts are signed with two different keys 
>> (wheels with travis / jars and others with release manager's). Is this 
>> consistent with Apache policy?
>> Just checking : )
>> -P.
>>
>> On Mon, Aug 20, 2018 at 9:47 AM Boyuan Zhang  wrote:
>>>
>>> Hey Robert,
>>>
>>> I think your idea would be possible if following things are possible:
>>> 1. Link beam-wheels into travis-cli. I'm not sure who has the right 
>>> permission to perform this operation.
>>> 2. We have a common svn credential or one of the beam committers would like 
>>> to put his(or her) credential into beam-wheels.
>>>
>>> If we can configure these 2 things, then new commit pushed into beam-wheels 
>>> can trigger a new build.
>>>
>>> Boyuan
>>>
>>> On Mon, Aug 20, 2018 at 2:46 AM Robert Bradshaw  wrote:

 Boyaun set up a nice repository for building Python wheels at
 https://github.com/apache/beam-wheels . Does anyone know if it would
 be possible to get SVN credentials for travis so every user wouldn't
 have to fork the repository and put their own in?


Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Lukasz Cwik
I would model the environment to be more free form then enums such that we
have forward looking extensibility and would suggest to follow the same
pattern we use on PTransforms (using an URN and a URN specific payload).
Note that in this case we may want to support a list of supported
environments (e.g. java, docker, python, ...).

On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde  wrote:

> One thing to consider that we've talked about in the past. It might make
> sense to extend the environment proto and have the SDK be explicit about
> which kinds of environment it supports:
>
>
> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>
> This choice might impact what files are staged or what not. Some SDKs,
> such as Go, provide a compiled binary and _need_ to know what the target
> architecture is. Running on a mac with and without docker, say, requires a
> different worker in each case. If we add an "enum", we can also easily add
> the external idea where the SDK/user starts the SDK harnesses instead of
> the runner. Each runner may not support all types of environments.
>
> Henning
>
> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels  wrote:
>
>> For reference, here is corresponding JIRA issue for this thread:
>> https://issues.apache.org/jira/browse/BEAM-5187
>>
>> On 16.08.18 11:15, Maximilian Michels wrote:
>> > Makes sense to have an option to run the SDK harness in a non-dockerized
>> > environment.
>> >
>> > I'm in the process of creating a Docker entry point for Flink's
>> > JobServer[1]. I suppose you would also prefer to execute that one
>> > standalone. We should make sure this is also an option.
>> >
>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>> >
>> > On 16.08.18 07:42, Thomas Weise wrote:
>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>> >> into the Docker container would need to be pre-installed in the host
>> >> environment. In the case of Python SDK, that could simply mean a
>> >> (frozen) virtual environment that was setup when the host was
>> >> provisioned - the SDK harness process(es) will then just utilize that.
>> >> Of course this flavor of SDK harness execution could also be useful in
>> >> the local development environment, where right now someone who already
>> >> has the Python environment needs to also install Docker and update a
>> >> container to launch a Python SDK pipeline on the Flink runner.
>> >>
>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>> danolive...@google.com
>> >> > wrote:
>> >>
>> >>  I just want to clarify that I understand this correctly since I'm
>> >>  not that familiar with the details behind all these execution
>> >>  environments yet. Is the proposal to create a new JobBundleFactory
>> >>  that instead of using Docker to create the environment that the
>> new
>> >>  processes will execute in, this JobBundleFactory would execute the
>> >>  new processes directly in the host environment? So in practice if
>> I
>> >>  ran a pipeline with this JobBundleFactory the SDK Harness and
>> Runner
>> >>  Harness would both be executing directly on my machine and would
>> >>  depend on me having the dependencies already present on my
>> machine?
>> >>
>> >>  On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka > >>  > wrote:
>> >>
>> >>  Thanks for starting the discussion. I will be happy to help.
>> >>  I agree, we should have pluggable SDKHarness environment
>> Factory.
>> >>  We can register multiple Environment factory using service
>> >>  registry and use the PipelineOption to pick the right one on
>> per
>> >>  job basis.
>> >>
>> >>  There are a couple of things which are require to setup before
>> >>  launching the process.
>> >>
>> >>* Setting up the environment as done in boot.go [4]
>> >>* Retrieving and putting the artifacts in the right
>> location.
>> >>
>> >>  You can probably leverage boot.go code to setup the
>> environment.
>> >>
>> >>  Also, it will be useful to enumerate pros and cons of
>> different
>> >>  Environments to help users choose the right one.
>> >>
>> >>
>> >>  On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise > >>  > wrote:
>> >>
>> >>  Hi,
>> >>
>> >>  Currently the portable Flink runner only works with SDK
>> >>  Docker containers for execution (DockerJobBundleFactory,
>> >>  besides an in-process (embedded) factory option for
>> testing
>> >>  [1]). I'm considering adding another out of process
>> >>  JobBundleFactory implementation that directly forks the
>> >>  processes on the task manager host, eliminating the need
>> for
>> >>  Docker. This would work reasonably well 

Re: Beam Summit London 2018

2018-08-21 Thread javier ramirez
Hi,

What'd be the duration of the talks? So I can scope the contents of my
proposal.

Looking forward to the summit!

J

On Tue, 21 Aug 2018, 14:47 Pascal Gula,  wrote:

> Hi Matthias,
> we (Peat / Plantix) might be interested by submitting a talk and I would
> like to know if we can get access to the list of already submitted "Title"
> to avoid submitting on similar topic!
> Cheers,
> Pascal
>
> On Tue, Aug 21, 2018 at 1:59 PM, Matthias Baetens <
> baetensmatth...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> We are happy to invite you to the first Beam Summit in London.
>>
>> The summit will be held in London at Level39
>>  on *October 1 and 2.*
>> You can register to attend for free on the Eventbrite page
>> 
>> .
>>
>> If you are interested in talking, please check our CfP form
>>  and submit a talk!
>>
>> If you or your company is interested in helping out or sponsoring the
>> summit (to keep it free), you can check out the sponsor booklet
>> 
>> .
>>
>> We will soon launch a blogpost with more details and announce the agenda
>> closer to date.
>>
>> Thanks to everyone who helped make this happen, looking forward to
>> welcoming you all in London!
>>
>> The Events & Meetups Group
>>
>> --
>>
>>
>
>
>
> --
>
> Pascal Gula
> Senior Data Engineer / Scientist
> +49 (0)176 34232684www.plantix.net 
>  PEAT GmbH
> Kastanienallee 4
> 10435 Berlin // Germany
>  Download 
> the App! 
>
>


Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Henning Rohde
One thing to consider that we've talked about in the past. It might make
sense to extend the environment proto and have the SDK be explicit about
which kinds of environment it supports:


https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969

This choice might impact what files are staged or what not. Some SDKs, such
as Go, provide a compiled binary and _need_ to know what the target
architecture is. Running on a mac with and without docker, say, requires a
different worker in each case. If we add an "enum", we can also easily add
the external idea where the SDK/user starts the SDK harnesses instead of
the runner. Each runner may not support all types of environments.

Henning

On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels  wrote:

> For reference, here is corresponding JIRA issue for this thread:
> https://issues.apache.org/jira/browse/BEAM-5187
>
> On 16.08.18 11:15, Maximilian Michels wrote:
> > Makes sense to have an option to run the SDK harness in a non-dockerized
> > environment.
> >
> > I'm in the process of creating a Docker entry point for Flink's
> > JobServer[1]. I suppose you would also prefer to execute that one
> > standalone. We should make sure this is also an option.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-4130
> >
> > On 16.08.18 07:42, Thomas Weise wrote:
> >> Yes, that's the proposal. Everything that would otherwise be packaged
> >> into the Docker container would need to be pre-installed in the host
> >> environment. In the case of Python SDK, that could simply mean a
> >> (frozen) virtual environment that was setup when the host was
> >> provisioned - the SDK harness process(es) will then just utilize that.
> >> Of course this flavor of SDK harness execution could also be useful in
> >> the local development environment, where right now someone who already
> >> has the Python environment needs to also install Docker and update a
> >> container to launch a Python SDK pipeline on the Flink runner.
> >>
> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
> danolive...@google.com
> >> > wrote:
> >>
> >>  I just want to clarify that I understand this correctly since I'm
> >>  not that familiar with the details behind all these execution
> >>  environments yet. Is the proposal to create a new JobBundleFactory
> >>  that instead of using Docker to create the environment that the new
> >>  processes will execute in, this JobBundleFactory would execute the
> >>  new processes directly in the host environment? So in practice if I
> >>  ran a pipeline with this JobBundleFactory the SDK Harness and
> Runner
> >>  Harness would both be executing directly on my machine and would
> >>  depend on me having the dependencies already present on my machine?
> >>
> >>  On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka  >>  > wrote:
> >>
> >>  Thanks for starting the discussion. I will be happy to help.
> >>  I agree, we should have pluggable SDKHarness environment
> Factory.
> >>  We can register multiple Environment factory using service
> >>  registry and use the PipelineOption to pick the right one on
> per
> >>  job basis.
> >>
> >>  There are a couple of things which are require to setup before
> >>  launching the process.
> >>
> >>* Setting up the environment as done in boot.go [4]
> >>* Retrieving and putting the artifacts in the right location.
> >>
> >>  You can probably leverage boot.go code to setup the
> environment.
> >>
> >>  Also, it will be useful to enumerate pros and cons of different
> >>  Environments to help users choose the right one.
> >>
> >>
> >>  On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise  >>  > wrote:
> >>
> >>  Hi,
> >>
> >>  Currently the portable Flink runner only works with SDK
> >>  Docker containers for execution (DockerJobBundleFactory,
> >>  besides an in-process (embedded) factory option for testing
> >>  [1]). I'm considering adding another out of process
> >>  JobBundleFactory implementation that directly forks the
> >>  processes on the task manager host, eliminating the need
> for
> >>  Docker. This would work reasonably well in environments
> >>  where the dependencies (in this case Python) can easily be
> >>  tied into the host deployment (also within an application
> >>  specific Kubernetes pod).
> >>
> >>  There was already some discussion about alternative
> >>  JobBundleFactory implementation in [2]. There is also a
> JIRA
> >>  to make the bundle factory pluggable [3], pending
> >>  availability of runner level options.
> >>
> >>  

Re: Beam Summit London 2018

2018-08-21 Thread Pascal Gula
Hi Matthias,
we (Peat / Plantix) might be interested by submitting a talk and I would
like to know if we can get access to the list of already submitted "Title"
to avoid submitting on similar topic!
Cheers,
Pascal

On Tue, Aug 21, 2018 at 1:59 PM, Matthias Baetens  wrote:

> Hi everyone,
>
> We are happy to invite you to the first Beam Summit in London.
>
> The summit will be held in London at Level39
>  on *October 1 and 2.*
> You can register to attend for free on the Eventbrite page
> 
> .
>
> If you are interested in talking, please check our CfP form
>  and submit a talk!
>
> If you or your company is interested in helping out or sponsoring the
> summit (to keep it free), you can check out the sponsor booklet
> 
> .
>
> We will soon launch a blogpost with more details and announce the agenda
> closer to date.
>
> Thanks to everyone who helped make this happen, looking forward to
> welcoming you all in London!
>
> The Events & Meetups Group
>
> --
>
>



-- 

Pascal Gula
Senior Data Engineer / Scientist
+49 (0)176 34232684www.plantix.net 
 PEAT GmbH
Kastanienallee 4
10435 Berlin // Germany
 Download
the App! 


Beam Summit London 2018

2018-08-21 Thread Matthias Baetens
Hi everyone,

We are happy to invite you to the first Beam Summit in London.

The summit will be held in London at Level39
 on *October 1 and 2.*
You can register to attend for free on the Eventbrite page
.

If you are interested in talking, please check our CfP form
 and submit a talk!

If you or your company is interested in helping out or sponsoring the
summit (to keep it free), you can check out the sponsor booklet

.

We will soon launch a blogpost with more details and announce the agenda
closer to date.

Thanks to everyone who helped make this happen, looking forward to
welcoming you all in London!

The Events & Meetups Group

--


Re: Bootstrapping Beam's Job Server

2018-08-21 Thread Ismaël Mejía
It is also worth to mention that apart of the testing/development use
case there is also the case of supporting people running in Hadoop
distributions. There are two extra reasons to want a process based
version: (1) Some Hadoop distributions run in machines with really old
kernels where docker support is limited or nonexistent (yes, some of
those run on kernel 2.6!) and (2) Ops people may be reticent to the
additional operational overhead of enabling docker in their clusters.
On Tue, Aug 21, 2018 at 11:50 AM Maximilian Michels  wrote:
>
> Thanks Henning and Thomas. It looks like
>
> a) we want to keep the Docker Job Server Docker container and rely on
> spinning up "sibling" SDK harness containers via the Docker socket. This
> should require little changes to the Runner code.
>
> b) have the InProcess SDK harness as an alternative way to running user
> code. This can be done independently of a).
>
> Thomas, let's sync today on the InProcess SDK harness. I've created a
> JIRA issue: https://issues.apache.org/jira/browse/BEAM-5187
>
> Cheers,
> Max
>
> On 21.08.18 00:35, Thomas Weise wrote:
> > The original objective was to make test/development easier (which I
> > think is super important for user experience with portable runner).
> >
> >  From first hand experience I can confirm that dealing with Flink
> > clusters and Docker containers for local setup is a significant hurdle
> > for Python developers.
> >
> > To simplify using Flink in embedded mode, the (direct) process based SDK
> > harness would be a good option, especially when it can be linked to the
> > same virtualenv that developers have already setup, eliminating extra
> > packaging/deployment steps.
> >
> > Max, I would be interested to sync up on what your thoughts are
> > regarding that option since you mention you also started to work on it
> > (see previous discussion [1], not sure if there is a JIRA for it yet).
> > Internally we are planning to use a direct SDK harness process instead
> > of Docker containers. For our specific needs it will works equally well
> > for development and production, including future plans to deploy Flink
> > TMs via Kubernetes.
> >
> > Thanks,
> > Thomas
> >
> > [1]
> > https://lists.apache.org/thread.html/d8b81e9f74f77d74c8b883cda80fa48efdcaf6ac2ad313c4fe68795a@%3Cdev.beam.apache.org%3E
> >
> >
> >
> >
> >
> >
> > On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels  > > wrote:
> >
> > Thanks for your suggestions. Please see below.
> >
> >  > Option 3) would be to map in the docker binary and socket to allow
> >  > the containerized Flink job server to start "sibling" containers on
> >  > the host.
> >
> > Do you mean packaging Docker inside the Job Server container and
> > mounting /var/run/docker.sock from the host inside the container? That
> > looks like a bit of a hack but for testing it could be fine.
> >
> >  > notably, if the runner supports auto-scaling or similar non-trivial
> >  > configurations, that would be difficult to manage from the SDK side.
> >
> > You're right, it would be unfortunate if the SDK would have to deal with
> > spinning up SDK harness/backend containers. For non-trivial
> > configurations it would probably require an extended protocol.
> >
> >  > Option 4) We are also thinking about adding process based SDKHarness.
> >  > This will avoid docker in docker scenario.
> >
> > Actually, I had started implementing a process-based SDK harness but
> > figured it might be impractical because it doubles the execution path
> > for UDF code and potentially doesn't work with custom dependencies.
> >
> >  > Process based SDKHarness also has other applications and might be
> >  > desirable in some of the production use cases.
> >
> > True. Some users might want something more lightweight.
> >
>
> --
> Max


Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Maximilian Michels
For reference, here is corresponding JIRA issue for this thread: 
https://issues.apache.org/jira/browse/BEAM-5187


On 16.08.18 11:15, Maximilian Michels wrote:

Makes sense to have an option to run the SDK harness in a non-dockerized
environment.

I'm in the process of creating a Docker entry point for Flink's
JobServer[1]. I suppose you would also prefer to execute that one
standalone. We should make sure this is also an option.

[1] https://issues.apache.org/jira/browse/BEAM-4130

On 16.08.18 07:42, Thomas Weise wrote:

Yes, that's the proposal. Everything that would otherwise be packaged
into the Docker container would need to be pre-installed in the host
environment. In the case of Python SDK, that could simply mean a
(frozen) virtual environment that was setup when the host was
provisioned - the SDK harness process(es) will then just utilize that.
Of course this flavor of SDK harness execution could also be useful in
the local development environment, where right now someone who already
has the Python environment needs to also install Docker and update a
container to launch a Python SDK pipeline on the Flink runner.

On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira mailto:danolive...@google.com>> wrote:

 I just want to clarify that I understand this correctly since I'm
 not that familiar with the details behind all these execution
 environments yet. Is the proposal to create a new JobBundleFactory
 that instead of using Docker to create the environment that the new
 processes will execute in, this JobBundleFactory would execute the
 new processes directly in the host environment? So in practice if I
 ran a pipeline with this JobBundleFactory the SDK Harness and Runner
 Harness would both be executing directly on my machine and would
 depend on me having the dependencies already present on my machine?

 On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka mailto:goe...@google.com>> wrote:

 Thanks for starting the discussion. I will be happy to help.
 I agree, we should have pluggable SDKHarness environment Factory.
 We can register multiple Environment factory using service
 registry and use the PipelineOption to pick the right one on per
 job basis.

 There are a couple of things which are require to setup before
 launching the process.

   * Setting up the environment as done in boot.go [4]
   * Retrieving and putting the artifacts in the right location.

 You can probably leverage boot.go code to setup the environment.

 Also, it will be useful to enumerate pros and cons of different
 Environments to help users choose the right one.


 On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise mailto:t...@apache.org>> wrote:

 Hi,

 Currently the portable Flink runner only works with SDK
 Docker containers for execution (DockerJobBundleFactory,
 besides an in-process (embedded) factory option for testing
 [1]). I'm considering adding another out of process
 JobBundleFactory implementation that directly forks the
 processes on the task manager host, eliminating the need for
 Docker. This would work reasonably well in environments
 where the dependencies (in this case Python) can easily be
 tied into the host deployment (also within an application
 specific Kubernetes pod).

 There was already some discussion about alternative
 JobBundleFactory implementation in [2]. There is also a JIRA
 to make the bundle factory pluggable [3], pending
 availability of runner level options.

 For a "ProcessBundleFactory", in addition to the Python
 dependencies the environment would also need to have the Go
 boot executable [4] (or a substitute thereof) to perform the
 harness initialization.

 Is anyone else interested in this SDK execution option or
 has already investigated an alternative implementation?

 Thanks,
 Thomas

 [1]
 
https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83

 [2]
 
https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E

 [3] https://issues.apache.org/jira/browse/BEAM-4819

 [4] 
https://github.com/apache/beam/blob/master/sdks/python/container/boot.go



--
Max


Re: Bootstrapping Beam's Job Server

2018-08-21 Thread Maximilian Michels

Thanks Henning and Thomas. It looks like

a) we want to keep the Docker Job Server Docker container and rely on
spinning up "sibling" SDK harness containers via the Docker socket. This 
should require little changes to the Runner code.


b) have the InProcess SDK harness as an alternative way to running user
code. This can be done independently of a).

Thomas, let's sync today on the InProcess SDK harness. I've created a
JIRA issue: https://issues.apache.org/jira/browse/BEAM-5187

Cheers,
Max

On 21.08.18 00:35, Thomas Weise wrote:
The original objective was to make test/development easier (which I 
think is super important for user experience with portable runner).


 From first hand experience I can confirm that dealing with Flink 
clusters and Docker containers for local setup is a significant hurdle 
for Python developers.


To simplify using Flink in embedded mode, the (direct) process based SDK 
harness would be a good option, especially when it can be linked to the 
same virtualenv that developers have already setup, eliminating extra 
packaging/deployment steps.


Max, I would be interested to sync up on what your thoughts are 
regarding that option since you mention you also started to work on it 
(see previous discussion [1], not sure if there is a JIRA for it yet). 
Internally we are planning to use a direct SDK harness process instead 
of Docker containers. For our specific needs it will works equally well 
for development and production, including future plans to deploy Flink 
TMs via Kubernetes.


Thanks,
Thomas

[1] 
https://lists.apache.org/thread.html/d8b81e9f74f77d74c8b883cda80fa48efdcaf6ac2ad313c4fe68795a@%3Cdev.beam.apache.org%3E







On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels > wrote:


Thanks for your suggestions. Please see below.

 > Option 3) would be to map in the docker binary and socket to allow
 > the containerized Flink job server to start "sibling" containers on
 > the host.

Do you mean packaging Docker inside the Job Server container and
mounting /var/run/docker.sock from the host inside the container? That
looks like a bit of a hack but for testing it could be fine.

 > notably, if the runner supports auto-scaling or similar non-trivial
 > configurations, that would be difficult to manage from the SDK side.

You're right, it would be unfortunate if the SDK would have to deal with
spinning up SDK harness/backend containers. For non-trivial
configurations it would probably require an extended protocol.

 > Option 4) We are also thinking about adding process based SDKHarness.
 > This will avoid docker in docker scenario.

Actually, I had started implementing a process-based SDK harness but
figured it might be impractical because it doubles the execution path
for UDF code and potentially doesn't work with custom dependencies.

 > Process based SDKHarness also has other applications and might be
 > desirable in some of the production use cases.

True. Some users might want something more lightweight.



--
Max


Re: Beam Docs Contributor

2018-08-21 Thread Connell O'Callaghan
Welcome Rose!!!

On Tue, Aug 21, 2018 at 12:57 AM Maximilian Michels  wrote:

> That sounds great, Rose. Welcome!
>
> On 21.08.18 09:21, Etienne Chauchot wrote:
> > Welcome Rose !
> >
> > Etienne
> >
> > Le lundi 30 juillet 2018 à 10:10 -0700, Thomas Weise a écrit :
> >> Welcome Rose, and looking forward to the docs update!
> >>
> >> On Mon, Jul 30, 2018 at 9:15 AM Henning Rohde  >> > wrote:
> >>> Welcome Rose! Great to have you here.
> >>>
> >>> On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía  >>> > wrote:
>  Welcome !
>  Great to see someone new working in this important area for the
> project.
> 
> 
>  On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang   > wrote:
> > Welcome Rose!
> > ᐧ
> >
> > On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  > > wrote:
> >> Welcome!
> >>
> >> -Rui
> >>
> >> On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas  >> > wrote:
> >>> Welcome Rose, very glad to have you in the community :)
> >>>
> >>>
> >>>
> >>> On Fri, 27 Jul 2018 at 16:29, Ahmet Altay  >>> > wrote:
>  Welcome Rose! Looking forward to your contributions.
> 
>  On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen
>  mailto:rtngu...@google.com>> wrote:
> > Hi all:
> >
> > I'm Rose! I've worked on Cloud Dataflow documentation and now
> > I'm starting a project to refresh the Beam docs and improve the
> > onboarding experience. We're planning on splitting up the
> > programming guide into multiple pages, making the docs more
> > accessible for new users. I've got lots of ideas for doc
> > improvements, some of which are motivated by the UX research,
> > and am excited to share them with you all and work on them.
> >
> > I look forward to interacting with everybody in the community.
> > I welcome comments, thoughts, feedback, etc.
> > --
> >
> >
> >
> >
> > Rose Thi Nguyen
> >
> >   Technical Writer
> >
> > (281) 683-6900
> >
> 
>


Re: dulicate key-value elements lost when transfering them as side-inputs

2018-08-21 Thread Tim Robertson
Thanks for this Vaclav

The failing test (1 minute timeout exception) is something we see sometimes
and indicates issues in the build environment or a flakey test. I triggered
another build by leaving a comment in the PR - just fyi, this is something
you can also do in the future.







On Tue, Aug 21, 2018 at 10:57 AM Plajt, Vaclav 
wrote:

> Hi,
>
> looking for reviewer https://github.com/apache/beam/pull/6257
>
>
> And maybe some help with failing test in mqtt IO (timeout).
>
>
> Vaclav
> --
> *From:* Lukasz Cwik 
> *Sent:* Monday, August 20, 2018 6:12:24 PM
> *To:* dev
> *Subject:* Re: dulicate key-value elements lost when transfering them as
> side-inputs
>
> Yes, that is a bug. I filed and assigned
> https://issues.apache.org/jira/browse/BEAM-5184 to you, feel free to
> unassign if your unable to make progress.
>
> On Mon, Aug 20, 2018 at 1:14 AM Plajt, Vaclav <
> vaclav.pl...@firma.seznam.cz> wrote:
>
>> Hi Beam devs,
>>
>> I'm working on Euphoria DSL, where we implemented `BroadcastHashJoin`
>> using side-inputs. But our test shows some missing data. We use `
>> View.asMultimap()` to get our join-small-side to view in form of 
>> `PCollectionView> Iterable>>`. Then some duplicated key-value (the same key and value
>> as some other element) gets lost. That is of course unfortunate behavior
>> when doing joins. I believe that it all nails down to:
>>
>>
>> https://github.com/apache/beam/blob/05fb694f265dda0254d7256e938e508fec9ba098/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionViews.java#L293
>>
>>
>> Where `HashMultimap` is used to gather all the elements to a `Multimap> V>`.  Which do not allow duplicate key-value pairs. Do you also feel
>> this is a bug? And if yes, then we would like to fix it by replacing `
>> HashMultimap` with `ArrayListMultimap` which allows allows duplicate
>> key-value pairs.
>>
>>
>> We can thing of some workarounds. But we prefer to do the fix, if
>> possible.
>>
>>
>> So what are your opinions? And how should we proceed?
>>
>>
>> Thank you.
>>
>> Vaclav Plajt
>>
>>
>> Je dobré vědět, že tento e-mail a přílohy jsou důvěrné. Pokud spolu
>> jednáme o uzavření obchodu, vyhrazujeme si právo naše jednání kdykoli
>> ukončit. Pro fanoušky právní mluvy - vylučujeme tím ustanovení občanského
>> zákoníku o předsmluvní odpovědnosti. Pravidla o tom, kdo u nás a jak
>> vystupuje za společnost a kdo může co a jak podepsat naleznete zde
>> 
>>
>> You should know that this e-mail and its attachments are confidential. If
>> we are negotiating on the conclusion of a transaction, we reserve the right
>> to terminate the negotiations at any time. For fans of legalese—we hereby
>> exclude the provisions of the Civil Code on pre-contractual liability. The
>> rules about who and how may act for the company and what are the signing
>> procedures can be found here
>> .
>>
>


Re: Beam Docs Contributor

2018-08-21 Thread Maximilian Michels
That sounds great, Rose. Welcome!

On 21.08.18 09:21, Etienne Chauchot wrote:
> Welcome Rose !
> 
> Etienne
> 
> Le lundi 30 juillet 2018 à 10:10 -0700, Thomas Weise a écrit :
>> Welcome Rose, and looking forward to the docs update!
>>
>> On Mon, Jul 30, 2018 at 9:15 AM Henning Rohde > > wrote:
>>> Welcome Rose! Great to have you here.
>>>
>>> On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía >> > wrote:
 Welcome !
 Great to see someone new working in this important area for the project.


 On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang >>> > wrote:
> Welcome Rose!
> ᐧ
>
> On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  > wrote:
>> Welcome!
>>
>> -Rui
>>
>> On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas > > wrote:
>>> Welcome Rose, very glad to have you in the community :)
>>>
>>>
>>>
>>> On Fri, 27 Jul 2018 at 16:29, Ahmet Altay >> > wrote:
 Welcome Rose! Looking forward to your contributions.

 On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen
 mailto:rtngu...@google.com>> wrote:
> Hi all:
>
> I'm Rose! I've worked on Cloud Dataflow documentation and now
> I'm starting a project to refresh the Beam docs and improve the
> onboarding experience. We're planning on splitting up the
> programming guide into multiple pages, making the docs more
> accessible for new users. I've got lots of ideas for doc
> improvements, some of which are motivated by the UX research,
> and am excited to share them with you all and work on them. 
>
> I look forward to interacting with everybody in the community.
> I welcome comments, thoughts, feedback, etc. 
> -- 
>
>   
>   
>
> Rose Thi Nguyen
>
>   Technical Writer
>
> (281) 683-6900
>



Re: Beam Docs Contributor

2018-08-21 Thread Etienne Chauchot
Welcome Rose !
Etienne
Le lundi 30 juillet 2018 à 10:10 -0700, Thomas Weise a écrit :
> Welcome Rose, and looking forward to the docs update!
> On Mon, Jul 30, 2018 at 9:15 AM Henning Rohde  wrote:
> > Welcome Rose! Great to have you here.
> > On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía  wrote:
> > > Welcome !
> > > Great to see someone new working in this important area for the project.
> > > 
> > > 
> > > On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang  wrote:
> > > > Welcome Rose!ᐧ
> > > > On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  wrote:
> > > > > Welcome!
> > > > > -Rui
> > > > > On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas  
> > > > > wrote:
> > > > > > Welcome Rose, very glad to have you in the community :)
> > > > > > 
> > > > > > 
> > > > > > On Fri, 27 Jul 2018 at 16:29, Ahmet Altay  wrote:
> > > > > > > Welcome Rose! Looking forward to your contributions.
> > > > > > > 
> > > > > > > On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen 
> > > > > > >  wrote:
> > > > > > > > Hi all:
> > > > > > > > I'm Rose! I've worked on Cloud Dataflow documentation and now 
> > > > > > > > I'm starting a project to refresh the Beam
> > > > > > > > docs and improve the onboarding experience. We're planning on 
> > > > > > > > splitting up the programming guide into
> > > > > > > > multiple pages, making the docs more accessible for new users. 
> > > > > > > > I've got lots of ideas for doc
> > > > > > > > improvements, some of which are motivated by the UX research, 
> > > > > > > > and am excited to share them with you all
> > > > > > > > and work on them. 
> > > > > > > > 
> > > > > > > > I look forward to interacting with everybody in the community. 
> > > > > > > > I welcome comments, thoughts, feedback,
> > > > > > > > etc. 
> > > > > > > > -- 
> > > > > > > > 
> > > > > > > >   Rose Thi Nguyen 
> > > > > > > >   Technical Writer 
> > > > > > > >   (281) 683-6900