Re: FOSDEM 2023 is back as in person event

2022-11-01 Thread Maximilian Michels
Great news :)

> I believe it is also free to attend but don't quote me on this.

Indeed, FOSDEM is free. It's spread across the university campus in
Brussels.

On Fri, Oct 21, 2022 at 8:14 PM Ismaël Mejía  wrote:

> Hi Aizhamal,
>
> You might be interested on this thread where the ASF people are also
> discussing about FOSDEM participation.
> https://lists.apache.org/thread/kv4fhldmc9mo6v5lwtkwqtwg97l64lx1
>
> It seems the call for devrooms is closed so maybe it us too late for
> Beam, but we have had talks in the past about Beam as part of the Big
> Data track so maybe worth to participate there.
>
> Best,
> Ismaël
>
> On Mon, Oct 17, 2022 at 9:06 PM Aizhamal Nurmamat kyzy
>  wrote:
> >
> > Hi Beam community!
> >
> > FOSDEM 2023  is back as an in person event! I
> have heard only great things about the event where thousands of developers
> get together to talk all about open source!
> >
> > Is anyone from the Beam community planning to attend? The event takes
> place in Brussels on February 4 & 5, 2023. I believe it is also free to
> attend but don't quote me on this.
> >
> > As an open source project we can also have
> > - a stand for free https://fosdem.org/2023/news/2022-09-26-stands-cfp/
> > - a Devroom https://fosdem.org/2023/news/2022-09-29-call_for_devrooms/
> >
> > Anyone interested?
> >
>


Re: Compatibility between Beam v2.23 and Beam v2.26

2021-01-07 Thread Maximilian Michels

Thanks for mentioning me here @Boyan.

In Beam there is no guarantee that checkpoints work across Beam 
releases. Checkpoint compatibility can break due to a lot of reasons 
(primarily DAG changes and serializer changes). Even though in this case 
the serialization id might have guaranteed compatibility, we make 
internal changes to Beam all the time. There is currently no process 
that we follow to ensure compatibility.


I do want to note that Flink has a serializer migration strategy which 
we currently do not leverage: 
https://github.com/apache/beam/blob/d8966d640549932d7551461ff59fa1085730f768/runners/flink/1.8/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeSerializer.java#L182


However, this requires that in addition to the new serializer, the old 
serializer is kept around. Flink will then migrate the state by reading 
first with the old serializer and then subsequently writing with the new 
one.


-Max

On 07.01.21 09:43, Jan Lukavský wrote:

Hi Antonio,

can you please create one?

Thanks,

  Jan

On 1/6/21 10:31 PM, Antonio Si wrote:
Thanks for the information. Do we have a jira to track this issue or 
do you want me to create a jira for this?


Thanks.

Antonio.

On 2021/01/06 17:59:47, Kenneth Knowles  wrote:

Agree with Boyuan & Kyle. That PR is the problem, and we probably do not
have adequate testing. We have a cultural understanding of not breaking
encoded data forms but this is the encoded form of the 
TypeSerializer, and

actually there are two problems.

1. When you have a serialized object that does not have the
serialVersionUid explicitly set, the UID is generated based on many 
details

that are irrelevant for binary compatibility. Any Java-serialized object
that is intended for anything other than transient transmission 
*must* have

a serialVersionUid set and an explicit serialized form. Else it is
completely normal for it to break due to irrelevant changes. The
serialVersionUid has no mechanism for upgrade/downgrade so you *must* 
keep
it the same forever, and any versioning or compat scheme exists 
within the

single serialVersionUid.
2. In this case there was an actual change to the fields of the object
stored, so you need to explicitly add the serialized form and also the
ability to read from prior serialized forms.

I believe explicitly setting the serialVersionUid to the original (and
keeping it that way forever) and adding the ability to decode prior 
forms

will regain the ability to read the snapshot. But also this seems like
something that would be part of Flink best practice documentation since
naive use of Java serialization often hits this problem.

Kenn

On Tue, Jan 5, 2021 at 4:30 PM Kyle Weaver  wrote:


This raises a few related questions from me:

1. Do we claim to support resuming Flink checkpoints made with previous
Beam versions?
2. Does 1. require full binary compatibility between different 
versions of

runner internals like CoderTypeSerializer?


3. Do we have tests for 1.?
Kenn



On Tue, Jan 5, 2021 at 4:05 PM Boyuan Zhang  wrote:


https://github.com/apache/beam/pull/13240 seems suspicious to me.

  +Maximilian Michels  Any insights here?

On Tue, Jan 5, 2021 at 8:48 AM Antonio Si  
wrote:



Hi,

I would like to followup with this question to see if there is a
solution/workaround for this issue.

Thanks.

Antonio.

On 2020/12/19 18:33:48, Antonio Si  wrote:

Hi,

We were using Beam v2.23 and recently, we are testing upgrade to 
Beam
v2.26. For Beam v2.26, we are passing 
--experiments=use_deprecated_read and

--fasterCopy=true.

We run into this exception when we resume our pipeline:

Caused by: java.io.InvalidClassException:
org.apache.beam.runners.flink.translation.types.CoderTypeSerializer; 
local

class incompatible: stream classdesc serialVersionUID =
5241803328188007316, local class serialVersionUID = 
7247319138941746449

   at

java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)

   at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1942) 


   at

java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1808)

   at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2099) 


   at

java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1625)

   at

java.io.ObjectInputStream.readObject(ObjectInputStream.java:465)

   at

java.io.ObjectInputStream.readObject(ObjectInputStream.java:423)

   at
org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil$TypeSerializerSerializationProxy.read(TypeSerializerSerializationUtil.java:301) 


   at
org.apache.flink.api.common.typeutils.TypeSerializerSerializationUtil.tryReadSerializer(TypeSerializerSerializationUtil.java:116) 


   at
org.apache.flink.api.common.typeutils.TypeSerializerConfigSnapshot.readSnapshot(TypeSerializerConfigSnapshot.java:113) 


   at
org.apache.flink.api.common.typeutils.TypeSerializerSnapshot.readVersionedSnap

Re: beam flink-runner distribution implementation

2020-11-20 Thread Maximilian Michels

Hi Richard,

The rational was to preserve Beam's DistributionResult through the use 
of Flink Gauges. Whoever implemented this, wasn't fully aware that Flink 
 Histograms would be a better fit.


Feel free to open a PR. You can mention us here for a review.

Thanks,
Max

On 20.11.20 03:06, Alex Amato wrote:
Are you referring to a "Flink Gauge" or a "Beam Gauge"? Are you 
suggesting to package it as a "Flink Histogram?" (i.e. A Flink runner 
specific concept of Histograms) If so, seems fine and I have no comment 
here.


FWIW,
I proposed a "Beam Histogram" metric (bucket counts).
https://s.apache.org/beam-histogram-metrics 



(No runner, implements this, and most likely I will not be pursuing this 
further, due to a change of priority/interest around the meric I was 
interested in using this for).
I was intending to use it for a specific set of metrics metric (No plans 
to provide a User defined Histogram Metric API)
https://s.apache.org/beam-gcp-debuggability 



I don't think we should pursue any plans to package "Beam Distributions" 
as "Beam Histograms". As a "Beam Histogram" is essential several 
counters (one for each bucket). Changing all usage of beam.distribution 
to a "Beam Histograms" would have performance implications, and is not 
advised. If at some point "Beam Histograms" are implemented, migrating 
the usage of Metrics.distribution to histogram should be done on an 
individual basis.






On Thu, Nov 19, 2020 at 5:47 PM Robert Bradshaw > wrote:


Guage certainly seems wrong for DistributionResult. Yes, using a
Histogram would be a welcome PR.

On Thu, Nov 19, 2020 at 12:58 PM Kyle Weaver mailto:kcwea...@google.com>> wrote:
 >
 > What are the advantages of using a Histogram instead of a Gauge?
 >
 > Also, check out this design doc for adding histogram metrics to
Beam if you haven't already: http://s.apache.org/beam-metrics-api
 (Not sure what the current
status is.)
 >
 > On Wed, Nov 18, 2020 at 1:37 PM Richard Moorhead
mailto:richard.moorh...@gmail.com>> wrote:
 >>
 >> Beam's DistributionResult is implemented as a Gauge within the
Flink runner. Can someone explain the rationale behind this? Would a
PR to utilize a Histogram be acceptable?



Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-29 Thread Maximilian Michels
Ok then we are on the same page, but I disagree with your conclusion. The reason Flink has to do the deep copy is that it doesn't state that the inputs are immutable and should not be changed, and so have to do the deep copy. In Beam, the user is not supposed to modify the input collection and if they do, it's undefined behavior. This is the reason the DirectRunner checks for this, to make sure the users are not relying on it. 


It's not written anywhere that the input cannot be mutated. A 
DirectRunner test is not a proof. Any runner could add a test which 
proves the opposite. In fact we may have one that checks copying for Flink.


I prefer safety and correctness over performance because I've seen too 
many cases where users shoot themselves in the foot. We should make sure 
that, by default, the user cannot modify the input element. An option to 
disable that is fine.


-Max


Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-28 Thread Maximilian Michels
You are right that Flink serializers do not care to copy for immutable 
Java types, e.g. Long, Integer, String. However, Pojos or other custom 
types can be mutated and Flink does a deep copy in this case.


If you look at the PojoSerializer in Flink, you will see that it does a 
deep copy: 
https://github.com/apache/flink/blob/d13f66be552eac89f45469c199ae036087baa38d/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L228


Also Flink uses Java serialization if the generic Kryo serializer fails: 
https://github.com/apache/flink/blob/d13f66be552eac89f45469c199ae036087baa38d/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializer.java#L251


In Beam we are just wrapping around Beam coders, so we do not know if a 
type is mutable or not. This is why we always deep-copy. I'm not sure 
that the change to always return the input would be safe. However, we 
could add some exceptions for Beam types we are sure cannot be mutated. 
Also, a pipeline option is ok if it is opt-in.


-Max

On 28.10.20 11:59, Teodor Spæren wrote:

Hey Max!

Just to make sure we are on the same page:

When you say "Flinks default behavior" do you mean Apache Flink the 
project or Beams Flink Runner? I'm assuming the Flink Runner, since 
Apache Flink leaves it up to the TypeSerializer to decide how to copy 
and none of the ones I've seen so far choose to do it through a 
serialization then deserialization method.


Is the bug hard to detect? Using the direct runner will warn of this 
missuse by default, without any need to change the pipeline itself, as 
far as I know? Please correct me if I'm wrong, I don't have much 
experience with Beam!


PCollections being immutable does mean that the input element should not 
be modified, rather if a modification is needed it is up to the user to 
copy the input before changing it [1](3.2.3 Immutability). I think this 
is what you are saying, but I just wanted to make sure :)


Also, I think naming the flag anything with object reuse would be 
confusing to users, as flink already has the concept of object reuse 
[2](enableObjectReuse), and there is an option on the runner mentioning 
this already[3](objectReuse). I'm thinking something more along the 
lines of "fasterCopy" or "disableFailsafeCopying".


Best regards,
Teodor Spæren

[1]: 
https://beam.apache.org/documentation/programming-guide/#pcollection-characteristics 

[2]: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/execution_configuration.html 

[3]: 
https://beam.apache.org/documentation/runners/flink/#pipeline-options-for-the-flink-runner 



On Wed, Oct 28, 2020 at 10:31:51AM +0100, Maximilian Michels wrote:

Very good observation @Teodor!

Flink's default behavior is to copy elements by going through a 
serialization/deserialization roundtrip. This occurs regardless of 
whether operatores are "chained" (directly pass on data without going 
through the network) or not.


This default was chosen for a reason because users tend to forget that 
you must not modify the input. It can cause very hard to detect "bugs".


PCollections are immutable but that does not mean transforms are not 
allowed to mutate inputs. Rather it means that the original element 
will not change but a copy of that element.


+1 for a flag to enable object reuse
-1 for changing the default for the above reason

Cheers,
Max

On 27.10.20 21:54, Kenneth Knowles wrote:

It seems that many correct things are said on this thread.

1. Elements of a PCollection are immutable. They should be like 
mathematical values.
2. For performance reasons, the author of a DoFn is responsible to 
not mutate input elements and also to not mutate outputs once they 
have been output.

3. The direct runner does extra work to check if your DoFn* is wrong.
4. On a production runner it is expected that serialization only 
occurs when needed for shipping data**


If the FlinkRunner is serializing things that don't have to be 
shipped that seems like a great easy win.


Kenn

*notably CombineFn has an API that is broken; only the first 
accumulator is allowed to be mutated and a runner is responsible for 
cloning it as necessary; it is expected that combining many elements 
will execute by mutating one unaliased accumulator many times
**small caveat that when doing in-memory groupings you need to use 
Coder#structuralValue and group by that, which may serialize but 
hopefully does something smarter


On Tue, Oct 27, 2020 at 8:52 AM Reuven Lax <mailto:re...@google.com>> wrote:


   Actually I believe that the Beam model does say that input elements
   should be immutable. If I remember correctly, the DirectRunner even
   validates this in unit tests, failing tests if the input elements
   have been mutated.

   On Tue, Oct 27, 2020 at 3:49 AM David Morávek mailto:d...@apache.org>> wrote:

   Hi Teodor,

   Thanks for bringing this

Re: Possible 80% reduction in overhead for flink runner, input needed

2020-10-28 Thread Maximilian Michels

Very good observation @Teodor!

Flink's default behavior is to copy elements by going through a 
serialization/deserialization roundtrip. This occurs regardless of 
whether operatores are "chained" (directly pass on data without going 
through the network) or not.


This default was chosen for a reason because users tend to forget that 
you must not modify the input. It can cause very hard to detect "bugs".


PCollections are immutable but that does not mean transforms are not 
allowed to mutate inputs. Rather it means that the original element will 
not change but a copy of that element.


+1 for a flag to enable object reuse
-1 for changing the default for the above reason

Cheers,
Max

On 27.10.20 21:54, Kenneth Knowles wrote:

It seems that many correct things are said on this thread.

1. Elements of a PCollection are immutable. They should be like 
mathematical values.
2. For performance reasons, the author of a DoFn is responsible to not 
mutate input elements and also to not mutate outputs once they have been 
output.

3. The direct runner does extra work to check if your DoFn* is wrong.
4. On a production runner it is expected that serialization only occurs 
when needed for shipping data**


If the FlinkRunner is serializing things that don't have to be shipped 
that seems like a great easy win.


Kenn

*notably CombineFn has an API that is broken; only the first accumulator 
is allowed to be mutated and a runner is responsible for cloning it as 
necessary; it is expected that combining many elements will execute by 
mutating one unaliased accumulator many times
**small caveat that when doing in-memory groupings you need to use 
Coder#structuralValue and group by that, which may serialize but 
hopefully does something smarter


On Tue, Oct 27, 2020 at 8:52 AM Reuven Lax > wrote:


Actually I believe that the Beam model does say that input elements
should be immutable. If I remember correctly, the DirectRunner even
validates this in unit tests, failing tests if the input elements
have been mutated.

On Tue, Oct 27, 2020 at 3:49 AM David Morávek mailto:d...@apache.org>> wrote:

Hi Teodor,

Thanks for bringing this up. This is a known, long standing
"issue". Unfortunately there are few things we need to consider:

- As you correctly noted, the *Beam model doesn't enforce
immutability* of input / output elements, so this is the price.
- We*can not break *existing pipelines.
- Flink Runner needs to provide the *same guarantees as the Beam
model*.

There are definitely some things we can do here, to make things
faster:

- We can try the similar approach as HadoopIO
(HadoopInputFormatReader#isKnownImmutable), to check for known
immutable types (KV, primitives, protobuf, other known internal
immutable structures).
-*If the type is immutable, we can safely reuse it.* This should
cover most of the performance costs without breaking the
guarantees Beam model provides.
- We can enable registration of custom "immutable" types via
pipeline options? (this may be an unnecessary knob, so this
needs a further discussion)

WDYT?

D.


On Mon, Oct 26, 2020 at 6:37 PM Teodor Spæren
mailto:teodor_spae...@riseup.net>>
wrote:

Hey!

I'm a student at the University of Oslo, and I'm writing a
master thesis
about the possibility of using Beam to benchmark stream
processing
systems. An important factor in this is the overhead
associated with
using Beam over writing code for the runner directly. [1]
found that
there was a large overhead associated with using Beam, but
did not
investigate where this overhead came from. I've done
benchmarks and
confirmed the findings there, where for simple chains of
identity
operators, Beam is 43x times slower than the Flink equivalent.

These are very simple pipelines, with custom sources that
just output a
series of integers. By profiling I've found that most of the
overhead
comes from serializing and deserializing. Specifically the way
TypeSerializer's, [2], is implemented in [3], where each
object is
serialized and then deserialized between every operator.
Looking into
the semantics of Beam, no operator should change the input,
so we don't
need to do a copy here. The function in [3] could
potentially be changed
to a single `return` statement.

Doing this removes 80% of the overhead in my tests. This is
a very
synthetic example, but it's a low hanging fruit and 

Re: Throttling stream outputs per trigger?

2020-10-16 Thread Maximilian Michels
the downstream consumer has these requirements. 


Blocking should normally be avoided at all cost, but if the downstream 
operator has the requirement to only emit a fixed number of messages per 
second, it should enforce this, i.e. block once the maximum number of 
messages for a time period have been reached. This will automatically 
lead to backpressure in Runners like Flink or Dataflow.


-Max

On 07.10.20 18:30, Luke Cwik wrote:
SplittableDoFns apply to both batch and streaming pipelines. They are 
allowed to produce an unbounded amount of data and can either self 
checkpoint saying they want to resume later or the runner will ask them 
to checkpoint via a split call.


There hasn't been anything concrete on backpressure, there has been work 
done about exposing signals[1] related to IO that a runner can then use 
intelligently but throttling isn't one of them yet.


1: 
https://lists.apache.org/thread.html/r7c1bf68bd126f3421019e238363415604505f82aeb28ccaf8b834d0d%40%3Cdev.beam.apache.org%3E 



On Tue, Oct 6, 2020 at 3:51 PM Vincent Marquez 
mailto:vincent.marq...@gmail.com>> wrote:


Thanks for the response.  Is my understanding correct that
SplittableDoFns are only applicable to Batch pipelines?  I'm
wondering if there's any proposals to address backpressure needs?
/~Vincent/


On Tue, Oct 6, 2020 at 1:37 PM Luke Cwik mailto:lc...@google.com>> wrote:

There is no general back pressure mechanism within Apache Beam
(runners should be intelligent about this but there is currently
no way to say I'm being throttled so runners don't know that
throwing more CPUs at a problem won't make it go faster). Y

You can control how quickly you ingest data for runners that
support splittable DoFns with SDK initiated checkpoints with
resume delays. A splittable DoFn is able to return
resume().withDelay(Duration.seconds(10)) from
the @ProcessElement method. See Watch[1] for an example.

The 2.25.0 release enables more splittable DoFn features on more
runners. I'm working on a blog (initial draft[2], still mostly
empty) to update the old blog from 2017.

1:

https://github.com/apache/beam/blob/9c239ac93b40e911f03bec5da3c58a07fdceb245/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L908


2:

https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE/edit#




On Tue, Oct 6, 2020 at 10:39 AM Vincent Marquez
mailto:vincent.marq...@gmail.com>>
wrote:

Hmm, I'm not sure how that will help, I understand how to
batch up the data, but it is the triggering part that I
don't see how to do.  For example, in Spark Structured
Streaming, you can set a time trigger which happens at a
fixed interval all the way up to the source, so the source
can throttle how much data to read even.

Here is my use case more thoroughly explained:

I have a Kafka topic (with multiple partitions) that I'm
reading from, and I need to aggregate batches of up to 500
before sending a single batch off in an RPC call.  However,
the vendor specified a rate limit, so if there are more than
500 unread messages in the topic, I must wait 1 second
before issuing another RPC call. When searching on Stack
Overflow I found this answer:
https://stackoverflow.com/a/57275557/25658
 that makes it
seem challenging, but I wasn't sure if things had changed
since then or you had better ideas.

/~Vincent/


On Thu, Oct 1, 2020 at 2:57 PM Luke Cwik mailto:lc...@google.com>> wrote:

Look at the GroupIntoBatches[1] transform. It will
buffer "batches" of size X for you.

1:

https://beam.apache.org/documentation/transforms/java/aggregation/groupintobatches/



On Thu, Oct 1, 2020 at 2:51 PM Vincent Marquez
mailto:vincent.marq...@gmail.com>> wrote:

the downstream consumer has these requirements.

/~Vincent/


On Thu, Oct 1, 2020 at 2:29 PM Luke Cwik
mailto:lc...@google.com>> wrote:

Why do you want to only emit X? 

Re: Self-checkpoint Support on Portable Flink

2020-10-14 Thread Maximilian Michels
Duplicates cannot happen because the state of all operators will be 
rolled back to the latest checkpoint, in case of failures.


On 14.10.20 06:31, Reuven Lax wrote:
Does this mean that we have to deal with duplicate messages over the 
back edge? Or will that not happen, since duplicates mean that we rolled 
back a checkpoint.


On Tue, Oct 13, 2020 at 2:59 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


There would be ways around the lack of checkpointing in cycles, e.g.
buffer and backloop only after checkpointing is complete, similarly how
we implement @RequiresStableInput in the Flink Runner.

-Max

On 07.10.20 04:05, Reuven Lax wrote:
 > It appears that there's a proposal
 >

(https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance

<https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance>

 >

<https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance

<https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance>>)

 > and an abandoned PR to fix this, but AFAICT this remains a
limitation of
 > Flink. If Flink can't guarantee processing of records on back
edges, I
 > don't think we can use cycles, as we might otherwise lose the
residuals.
 >
 > On Tue, Oct 6, 2020 at 6:16 PM Reuven Lax mailto:re...@google.com>
 > <mailto:re...@google.com <mailto:re...@google.com>>> wrote:
 >
 >     This is what I was thinking of
 >
 >     "Flink currently only provides processing guarantees for jobs
 >     without iterations. Enabling checkpointing on an iterative job
 >     causes an exception. In order to force checkpointing on an
iterative
 >     program the user needs to set a special flag when enabling
 >     checkpointing:|env.enableCheckpointing(interval,
 >     CheckpointingMode.EXACTLY_ONCE, force = true)|.
 >
 >     Please note that records in flight in the loop edges (and the
state
 >     changes associated with them) will be lost during failure."
 >
 >
 >
 >
 >
 >
 >     On Tue, Oct 6, 2020 at 5:44 PM Boyuan Zhang
mailto:boyu...@google.com>
 >     <mailto:boyu...@google.com <mailto:boyu...@google.com>>> wrote:
 >
 >         Hi Reuven,
 >
 >         As Luke mentioned, at least there are some limitations around
 >         tracking watermark with flink cycles. I'm going to use
State +
 >         Timer without flink cycle to support self-checkpoint. For
 >         dynamic split, we can either explore flink cycle approach or
 >         limit depth approach.
 >
 >         On Tue, Oct 6, 2020 at 5:33 PM Reuven Lax
mailto:re...@google.com>
 >         <mailto:re...@google.com <mailto:re...@google.com>>> wrote:
 >
 >             Aren't there some limitations associated with flink
cycles?
 >             I seem to remember various features that could not be
used.
 >             I'm assuming that watermarks are not supported across
 >             cycles, but is there anything else?
 >
 >             On Tue, Oct 6, 2020 at 7:12 AM Maximilian Michels
 >             mailto:m...@apache.org>
<mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >
 >                 Thanks for starting the conversation. The two
approaches
 >                 both look good
 >                 to me. Probably we want to start with approach #1 for
 >                 all Runners to be
 >                 able to support delaying bundles. Flink supports
cycles
 >                 and thus
 >                 approach #2 would also be applicable and could be
used
 >                 to implement
 >                 dynamic splitting.
 >
 >                 -Max
 >
 >                 On 05.10.20 23:13, Luke Cwik wrote:
 >                  > Thanks Boyuan, I left a few comments.
 >                  >
 >                  > On Mon, Oct 5, 2020 at 11:12 AM Boyuan Zhang
 >                 mailto:boyu...@google.com>
<mailto:boyu...@google.com <mailto:boyu...@google.com>>
 >                  > <mailto:boyu...@google.com
<mailto:boyu...@google.com>
 >                 <mailto:boyu...@google.com
<mailto:boyu...@google.com>>>> wrote:
 >                  >
 >                  >     Hi team,
 >                  >
 >                  >     I'm looking at 

Re: Self-checkpoint Support on Portable Flink

2020-10-13 Thread Maximilian Michels
There would be ways around the lack of checkpointing in cycles, e.g. 
buffer and backloop only after checkpointing is complete, similarly how 
we implement @RequiresStableInput in the Flink Runner.


-Max

On 07.10.20 04:05, Reuven Lax wrote:
It appears that there's a proposal 
(https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance 
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-16%3A+Loop+Fault+Tolerance>) 
and an abandoned PR to fix this, but AFAICT this remains a limitation of 
Flink. If Flink can't guarantee processing of records on back edges, I 
don't think we can use cycles, as we might otherwise lose the residuals.


On Tue, Oct 6, 2020 at 6:16 PM Reuven Lax <mailto:re...@google.com>> wrote:


This is what I was thinking of

"Flink currently only provides processing guarantees for jobs
without iterations. Enabling checkpointing on an iterative job
causes an exception. In order to force checkpointing on an iterative
program the user needs to set a special flag when enabling
checkpointing:|env.enableCheckpointing(interval,
CheckpointingMode.EXACTLY_ONCE, force = true)|.

Please note that records in flight in the loop edges (and the state
changes associated with them) will be lost during failure."






On Tue, Oct 6, 2020 at 5:44 PM Boyuan Zhang mailto:boyu...@google.com>> wrote:

Hi Reuven,

As Luke mentioned, at least there are some limitations around
tracking watermark with flink cycles. I'm going to use State +
Timer without flink cycle to support self-checkpoint. For
dynamic split, we can either explore flink cycle approach or
limit depth approach.

On Tue, Oct 6, 2020 at 5:33 PM Reuven Lax mailto:re...@google.com>> wrote:

Aren't there some limitations associated with flink cycles?
I seem to remember various features that could not be used.
I'm assuming that watermarks are not supported across
cycles, but is there anything else?

    On Tue, Oct 6, 2020 at 7:12 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

Thanks for starting the conversation. The two approaches
both look good
to me. Probably we want to start with approach #1 for
all Runners to be
able to support delaying bundles. Flink supports cycles
and thus
approach #2 would also be applicable and could be used
to implement
dynamic splitting.

-Max

On 05.10.20 23:13, Luke Cwik wrote:
 > Thanks Boyuan, I left a few comments.
 >
 > On Mon, Oct 5, 2020 at 11:12 AM Boyuan Zhang
mailto:boyu...@google.com>
 > <mailto:boyu...@google.com
<mailto:boyu...@google.com>>> wrote:
 >
 >     Hi team,
 >
 >     I'm looking at adding self-checkpoint support to
portable Flink
 >     runner(BEAM-10940
 >     <https://issues.apache.org/jira/browse/BEAM-10940
<https://issues.apache.org/jira/browse/BEAM-10940>>) for
both batch
 >     and streaming. I summarized the problem that we
want to solve and
 >     proposed 2 potential approaches in this doc
 >   
  <https://docs.google.com/document/d/1372B7HYxtcUYjZOnOM7OBTfSJ4CyFg_gaPD_NUxWClo/edit?usp=sharing <https://docs.google.com/document/d/1372B7HYxtcUYjZOnOM7OBTfSJ4CyFg_gaPD_NUxWClo/edit?usp=sharing>>.

 >
 >     I want to collect feedback on which approach is
preferred and
 >     anything that I have not taken into consideration
yet but I should.
 >     Many thanks to all your help!
 >
 >     Boyuan
 >



Re: Self-checkpoint Support on Portable Flink

2020-10-06 Thread Maximilian Michels
Thanks for starting the conversation. The two approaches both look good 
to me. Probably we want to start with approach #1 for all Runners to be 
able to support delaying bundles. Flink supports cycles and thus 
approach #2 would also be applicable and could be used to implement 
dynamic splitting.


-Max

On 05.10.20 23:13, Luke Cwik wrote:

Thanks Boyuan, I left a few comments.

On Mon, Oct 5, 2020 at 11:12 AM Boyuan Zhang > wrote:


Hi team,

I'm looking at adding self-checkpoint support to portable Flink
runner(BEAM-10940
) for both batch
and streaming. I summarized the problem that we want to solve and
proposed 2 potential approaches in this doc

.

I want to collect feedback on which approach is preferred and
anything that I have not taken into consideration yet but I should.
Many thanks to all your help!

Boyuan



Re: Design rational behind copying via serializing in flink runner

2020-09-07 Thread Maximilian Michels

Hey Teodor,

Copying is the default behavior. This is tunable via the pipeline option 
'objectReuse', i.e. 'objectReuse=true'.


The option is disabled by default because users may not be aware of 
object reuse and recycle objects in their process functions which will 
have unexpected side effects.


Now, for primitive types, the copying is not even necessary and this 
could be optimized, similarly as done in the Flink serializers. Since we 
wrap Beam coders and expose them as Flink serializers for the Flink 
Runner, we would have to re-add this logic to Beam coders or the Flink 
Runner's CoderTypeSerializer.


-Max

On 06.09.20 11:01, Teodor Spæren wrote:

Hey Brian!

Sorry for the late reply, this one kind of got lost in my mail client. 
Still trying to figure this mailing list thing out, hehe.


I would like to try to see if a simple return there will speed things 
up. I've never built BEAM by hand though, but is a full build as 
described in [1], required to test this or can I do a more selective 
build of only the java portion?


Another question is that Flink has object reuse, and it can be turned
on through options to the flink runnner. If all passing between 
operators is immutable anyway, why isn't this option enabled by default? 
In this case it would not give a speed up I think, as the method is just 
aliased. I haven't fully understood all the aspects of this option in 
flink, so I might just be missing something.


Thanks for input so far :D

[1]: https://beam.apache.org/contribute/
[2]: 
https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/runners/flink/1.8/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeSerializer.java#L93 



On Mon, Aug 31, 2020 at 05:36:56PM -0700, Brian Hulette wrote:

Hi Teodor,
I actually forward your message to dev@ before, but I foolishly
removed user@ from the distro so I think you weren't able to see it. 
Sorry

about that!

+Lukasz Cwik  replied there [1]. I'll copy it here 
and we

can keep discussing on this thread:

The idea in Beam has always been to make objects passed between 
transforms
to not be mutable and require users to either pass through their 
object or

output a new object.

Minimizing the places where Flink performs copying would be best but
without further investigation wouldn't be able to give concrete 
suggestions.



Brian

[1]
https://lists.apache.org/thread.html/r8c22c8b089f9caaac8efef90e62117a1db49af6471ff6bd7cbc5b882%40%3Cdev.beam.apache.org%3E 



On Mon, Aug 31, 2020 at 11:14 AM Teodor Spæren 


wrote:


Hey!

First time posting to a mailing list, hope I did it correctly :)

I'm writing a master thesis at the University of Oslo and right now I'm
looking at the performance overhead of using Beam with the Flink runnner
versus plain Flink.

I've written a simple program, a custom source outputing 0, 1, 2, 3, up
to N, going into a single identity operator and then int a filter which
only matches N and prints that out. This is just to compare performance.

I've been doing some profiling of simple programs and one observation is
the performance difference in the serialization. The hotspot is [1],
which is used multiple places, but one place is [2], which is called
from [3]. As far as I can tell, [1] seems to be implementing copying by
first serializing and then deserializing and there are no way for the
actual types to change this. In flink, you have control over the copy()
method, like in [4] and so for certain types you can just do a simple
return as you do here.

My queston is if I've understood the flow correctly so far and if so
what the reason for doing it this way. Is it to avoid demanding that the
type implement some type of cloning? And would it be possible to push
this downward in the stack and allow the encoders to do define the copy
schemantics? I'm willing to do the work here, just want to know if it
would work on an arcitectural level.

If there is any known overheads of using beam that you would like to
point out, I would love to hear about it.

Best regards,
Teodor Spæren

[1]:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L140 


[2]:
https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/runners/flink/1.8/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeSerializer.java#L85 


[3]:
https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/types/CoderTypeInformation.java#L85 


[4]:
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/LongSerializer.java#L53 





Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-08-28 Thread Maximilian Michels

Thanks Luke! I've had a pass.

-Max

On 28.08.20 01:22, Luke Cwik wrote:

As an update.

Direct and Twister2 are done.
Samza: is ready for review[1].
Flink: is almost ready for review. [2] lays all the groundwork for the 
migration and [3] finishes the migration (there is a timeout happening 
in FlinkSubmissionTest that I'm trying to figure out).

No further updates on Spark[4] or Jet[5].

@Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org 
<mailto:thomas.we...@gmail.com>, can either of you take a look at the 
Flink PRs?
@ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu delegated 
to you, can you take another look at the Samza PR?


1: https://github.com/apache/beam/pull/12617
2: https://github.com/apache/beam/pull/12706
3: https://github.com/apache/beam/pull/12708
4: https://github.com/apache/beam/pull/12603
5: https://github.com/apache/beam/pull/12616

On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe 
mailto:pulasthi...@gmail.com>> wrote:


Hi Luke

Will take a look at this as soon as possible and get back to you.

Best Regards,
Pulasthi

On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik mailto:lc...@google.com>> wrote:

I have made some good progress here and have gotten to the
following state for non-portable runners:

DirectRunner[1]: Merged. Supports Read.Bounded and Read.Unbounded.
Twister2[2]: Ready for review. Supports Read.Bounded, the
current runner doesn't support unbounded pipelines.
Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not
certain about level of unbounded pipeline support coverage since
Spark uses its own tiny suite of tests to get unbounded pipeline
coverage instead of the validates runner set.
Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely
needs additional work.
Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of
unbounded pipeline support coverage since Spark uses its own
tiny suite of tests to get unbounded pipeline coverage instead
of the validates runner set.
Flink: Unstarted.

@Pulasthi Supun Wickramasinghe <mailto:pulasthi...@gmail.com> ,
can you help me with the Twister2 PR[2]?
@Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the expected
level of support for unbounded pipelines and hence ready for review?
@Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help me out
to get support for unbounded splittable DoFn's into Jet[4]?
@Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the expected
level of support for unbounded pipelines and hence ready for review?

1: https://github.com/apache/beam/pull/12519
2: https://github.com/apache/beam/pull/12594
3: https://github.com/apache/beam/pull/12603
4: https://github.com/apache/beam/pull/12616
5: https://github.com/apache/beam/pull/12617

On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik mailto:lc...@google.com>> wrote:

There shouldn't be any changes required since the wrapper
will smoothly transition the execution to be run as an SDF.
New IOs should strongly prefer to use SDF since it should be
simpler to write and will be more flexible but they can use
the "*Source"-based APIs. Eventually we'll deprecate the
APIs but we will never stop supporting them. Eventually they
should all be migrated to use SDF and if there is another
major Beam version, we'll finally be able to remove them.

On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko
mailto:aromanenko@gmail.com>>
wrote:

Hi Luke,

Great to hear about such progress on this!

Talking about opt-out for all runners in the future,
will it require any code changes for current
“*Source”-based IOs or the wrappers should completely
smooth this transition?
Do we need to require to create new IOs only based on
SDF or again, the wrappers should help to avoid this?


On 10 Aug 2020, at 22:59, Luke Cwik mailto:lc...@google.com>> wrote:

In the past couple of months wrappers[1, 2] have been
added to the Beam Java SDK which can execute
BoundedSource and UnboundedSource as Splittable DoFns.
These have been opt-out for portable pipelines (e.g.
Dataflow runner v2, XLang pipelines on Flink/Spark)
and opt-in using an experiment for all other pipelines.

I would like to start making the non-portable
pipelines starting with the DirectRunner[3] to be
opt-out with the p

Re: [External] Re: Memory Issue When Running Beam On Flink

2020-08-27 Thread Maximilian Michels
If the user chooses to create a window of 10 years, I'd say it is 
expected behavior that the state will be kept for as long as this duration.


GlobalWindows are different because they represent the default case 
where the user does not even use windowing. I think it warrants to be 
treated differently, especially because cleanup simply cannot be ensured 
by the watermark.


It would be possible to combine both approaches, but I'd rather not skip 
the cleanup timer for non-global windows because that could easily 
become the source of another leak. The more pressing issue here is the 
global window, not specific windowing.


-Max

On 26.08.20 10:15, Jan Lukavský wrote:
Window triggering is afaik operation that is specific to GBK. Stateful 
DoFns can have (as shown in the case of deduplication) timers set for 
the GC only, triggering has no effect there. And yes, if we have other 
timers than GC (any user timers), then we have to have GC timer (because 
timers are a form of state).


Imagine a (admittedly artificial) example of deduplication in fixed 
window of 10 years. It would exhibit exactly the same state growth as 
global window (and 10 years is "almost infinite", right? :)).


Jan

On 8/26/20 10:01 AM, Maximilian Michels wrote:
The inefficiency described happens if and only if the following two 
conditions are met:


 a) there are many timers per single window (as otherwise they will 
be negligible)


 b) there are many keys which actually contain no state (as otherwise 
the timer would be negligible wrt the state size) 


Each window has to have a timer set, it is unavoidable for the window 
computation to be triggered accordingly. This happens regardless of 
whether we have state associated with the key/window or not. The 
additional cleanup timer is just a side effect and not a concern in my 
opinion. Since window computation is per-key, there is no way around 
this. I don't think skipping the cleanup timer for non global windows 
without state is a good idea, just to save one cleanup timer, when 
there are already timers created for the window computation.


Now, the global window is different in that respect because we can't 
assume it is going to be triggered for unbounded streams. Thus, it 
makes sense to me to handle it differently by not using triggers but 
cleaning up once a watermark > MAX_TIMESTAMP has been processed.


-Max

On 26.08.20 09:20, Jan Lukavský wrote:

On 8/25/20 9:27 PM, Maximilian Michels wrote:

I agree that this probably solves the described issue in the most 
straightforward way, but special handling for global window feels 
weird, as there is really nothing special about global window wrt 
state cleanup. 


Why is special handling for the global window weird? After all, it 
is a special case because the global window normally will only be 
cleaned up when the application terminates.


The inefficiency described happens if and only if the following two 
conditions are met:


  a) there are many timers per single window (as otherwise they will 
be negligible)


  b) there are many keys which actually contain no state (as 
otherwise the timer would be negligible wrt the state size)


It only happens to be the case that global window is the (by far, 
might be 98% cases) most common case that satisfies these two 
conditions, but there are other cases as well (e.g. long lasting 
fixed window). Discussed options 2) and 3) are systematic in the 
sense that option 2) cancels property a) and option 3) property b). 
Making use of correlation of global window with these two conditions 
to solve the issue is of course possible, but a little unsystematic 
and that's what feels 'weird'. :)




It doesn't change anything wrt migration. The timers that were 
already set remain and keep on contributing to the state size.


That's ok, regular timers for non-global windows need to remain set 
and should be persisted. They will be redistributed when scaling up 
and down.


I'm not sure that's a "problem", rather an inefficiency. But we 
could address it by deleting the timers where they are currently 
set, as mentioned previously.


I had imagined that we don't even set these timers for the global 
window. Thus, there is no need to clean them up.


-Max

On 25.08.20 09:43, Jan Lukavský wrote:
I agree that this probably solves the described issue in the most 
straightforward way, but special handling for global window feels 
weird, as there is really nothing special about global window wrt 
state cleanup. A solution that handles all windows equally would be 
semantically 'cleaner'. If I try to sum up:


  - option 3) seems best, provided that isEmpty() lookup is cheap 
for every state backend (e.g. that we do not hit disk multiple 
times), this option is the best for state size wrt timers in all 
windows


  - option 2) works well for key-aligned windows, also reduces 
state size in all windows


  - option "watermark timer" - solves issue, easily 

Re: [External] Re: Memory Issue When Running Beam On Flink

2020-08-26 Thread Maximilian Michels

The inefficiency described happens if and only if the following two conditions 
are met:

 a) there are many timers per single window (as otherwise they will be 
negligible)

 b) there are many keys which actually contain no state (as otherwise the timer would be negligible wrt the state size) 


Each window has to have a timer set, it is unavoidable for the window 
computation to be triggered accordingly. This happens regardless of 
whether we have state associated with the key/window or not. The 
additional cleanup timer is just a side effect and not a concern in my 
opinion. Since window computation is per-key, there is no way around 
this. I don't think skipping the cleanup timer for non global windows 
without state is a good idea, just to save one cleanup timer, when there 
are already timers created for the window computation.


Now, the global window is different in that respect because we can't 
assume it is going to be triggered for unbounded streams. Thus, it makes 
sense to me to handle it differently by not using triggers but cleaning 
up once a watermark > MAX_TIMESTAMP has been processed.


-Max

On 26.08.20 09:20, Jan Lukavský wrote:

On 8/25/20 9:27 PM, Maximilian Michels wrote:

I agree that this probably solves the described issue in the most 
straightforward way, but special handling for global window feels 
weird, as there is really nothing special about global window wrt 
state cleanup. 


Why is special handling for the global window weird? After all, it is 
a special case because the global window normally will only be cleaned 
up when the application terminates.


The inefficiency described happens if and only if the following two 
conditions are met:


  a) there are many timers per single window (as otherwise they will be 
negligible)


  b) there are many keys which actually contain no state (as otherwise 
the timer would be negligible wrt the state size)


It only happens to be the case that global window is the (by far, might 
be 98% cases) most common case that satisfies these two conditions, but 
there are other cases as well (e.g. long lasting fixed window). 
Discussed options 2) and 3) are systematic in the sense that option 2) 
cancels property a) and option 3) property b). Making use of correlation 
of global window with these two conditions to solve the issue is of 
course possible, but a little unsystematic and that's what feels 
'weird'. :)




It doesn't change anything wrt migration. The timers that were 
already set remain and keep on contributing to the state size.


That's ok, regular timers for non-global windows need to remain set 
and should be persisted. They will be redistributed when scaling up 
and down.


I'm not sure that's a "problem", rather an inefficiency. But we could 
address it by deleting the timers where they are currently set, as 
mentioned previously.


I had imagined that we don't even set these timers for the global 
window. Thus, there is no need to clean them up.


-Max

On 25.08.20 09:43, Jan Lukavský wrote:
I agree that this probably solves the described issue in the most 
straightforward way, but special handling for global window feels 
weird, as there is really nothing special about global window wrt 
state cleanup. A solution that handles all windows equally would be 
semantically 'cleaner'. If I try to sum up:


  - option 3) seems best, provided that isEmpty() lookup is cheap for 
every state backend (e.g. that we do not hit disk multiple times), 
this option is the best for state size wrt timers in all windows


  - option 2) works well for key-aligned windows, also reduces state 
size in all windows


  - option "watermark timer" - solves issue, easily implemented, but 
doesn't improve situation for non-global windows


My conclusion would be - use watermark timer as hotfix, if we can 
prove that isEmpty() would be cheap, then use option 3) as final 
solution, otherwise use 2).


WDYT?

On 8/25/20 5:48 AM, Thomas Weise wrote:



On Mon, Aug 24, 2020 at 1:50 PM Maximilian Michels <mailto:m...@apache.org>> wrote:


    I'd suggest a modified option (2) which does not use a timer to
    perform
    the cleanup (as mentioned, this will cause problems with migrating
    state).


That's a great idea. It's essentially a mix of 1) and 2) for the 
global window only.


It doesn't change anything wrt migration. The timers that 
were already set remain and keep on contributing to the state size.


I'm not sure that's a "problem", rather an inefficiency. But we 
could address it by deleting the timers where they are currently 
set, as mentioned previously.



    Instead, whenever we receive a watermark which closes the global
    window,
    we enumerate all keys and cleanup the associated state.

    This is the cleanest and simplest option.

    -Max

    On 24.08.20 20:47, Thomas Weise wrote:
    >
    > On Mon, Aug 24, 2020 at 11:35 AM Jan Lukavský mailto:je...@seznam.cz>
    > &l

Re: [External] Re: Memory Issue When Running Beam On Flink

2020-08-25 Thread Maximilian Michels
I agree that this probably solves the described issue in the most straightforward way, but special handling for global window feels weird, as there is really nothing special about global window wrt state cleanup. 


Why is special handling for the global window weird? After all, it is a 
special case because the global window normally will only be cleaned up 
when the application terminates.



It doesn't change anything wrt migration. The timers that were already set 
remain and keep on contributing to the state size.


That's ok, regular timers for non-global windows need to remain set and 
should be persisted. They will be redistributed when scaling up and down.



I'm not sure that's a "problem", rather an inefficiency. But we could address 
it by deleting the timers where they are currently set, as mentioned previously.


I had imagined that we don't even set these timers for the global 
window. Thus, there is no need to clean them up.


-Max

On 25.08.20 09:43, Jan Lukavský wrote:
I agree that this probably solves the described issue in the most 
straightforward way, but special handling for global window feels weird, 
as there is really nothing special about global window wrt state 
cleanup. A solution that handles all windows equally would be 
semantically 'cleaner'. If I try to sum up:


  - option 3) seems best, provided that isEmpty() lookup is cheap for 
every state backend (e.g. that we do not hit disk multiple times), this 
option is the best for state size wrt timers in all windows


  - option 2) works well for key-aligned windows, also reduces state 
size in all windows


  - option "watermark timer" - solves issue, easily implemented, but 
doesn't improve situation for non-global windows


My conclusion would be - use watermark timer as hotfix, if we can prove 
that isEmpty() would be cheap, then use option 3) as final solution, 
otherwise use 2).


WDYT?

On 8/25/20 5:48 AM, Thomas Weise wrote:



On Mon, Aug 24, 2020 at 1:50 PM Maximilian Michels <mailto:m...@apache.org>> wrote:


I'd suggest a modified option (2) which does not use a timer to
perform
the cleanup (as mentioned, this will cause problems with migrating
state).


That's a great idea. It's essentially a mix of 1) and 2) for the 
global window only.


It doesn't change anything wrt migration. The timers that 
were already set remain and keep on contributing to the state size.


I'm not sure that's a "problem", rather an inefficiency. But we could 
address it by deleting the timers where they are currently set, as 
mentioned previously.



Instead, whenever we receive a watermark which closes the global
window,
we enumerate all keys and cleanup the associated state.

This is the cleanest and simplest option.

-Max

On 24.08.20 20:47, Thomas Weise wrote:
>
> On Mon, Aug 24, 2020 at 11:35 AM Jan Lukavský mailto:je...@seznam.cz>
> <mailto:je...@seznam.cz <mailto:je...@seznam.cz>>> wrote:
>
>      > The most general solution would be 3), given it can be
agnostic
>     to window types and does not assume extra runner capabilities.
>
>     Agree, 2) is optimization to that. It might be questionable
if this
>     is premature optimization, but generally querying multiple
states
>     for each clear opeartion to any state might be prohibitive,
mostly
>     when the state would be stored in external database (in case of
>     Flink that would be RocksDB).
>
> For the use case I'm looking at, we are using the heap state
backend. I
> have not checked the RocksDB, but would assume that incremental
cost of
> isEmpty() for other states under the same key is negligible?
>
>      > 3) wouldn't require any state migration.
>
>     Actually, it would, as we would (ideally) like to migrate users'
>     pipelines that already contain timers for the end of global
window,
>     which might not expire ever.
>
> Good catch. This could potentially be addressed by upgrading the
timer
> in the per record path.
>
>     On 8/24/20 7:44 PM, Thomas Weise wrote:
>>
>>     On Fri, Aug 21, 2020 at 12:32 AM Jan Lukavský
mailto:je...@seznam.cz>
>>     <mailto:je...@seznam.cz <mailto:je...@seznam.cz>>> wrote:
>>
>>         If there are runners, that are unable to efficiently
enumerate
>>         keys in state, then there probably isn't a runner agnostic
>>         solution to this. If we focus on Flink, we can provide
>>         specific implementation of CleanupTimer, which might
then do
>>         anything from the mentioned options. I'd be +1 for
option 2)
>>         for key-aligned windows (all current

Re: [External] Re: Memory Issue When Running Beam On Flink

2020-08-24 Thread Maximilian Michels
   We're
allocating
11G for
taskmanager JVM
heap, but it
eventually
gets filled
up (after
couple days)
and the
cluster ends
up in a bad
state.
Here's a
screenshot
of the heap
size over
the past 24h:
Screen Shot
2020-08-15
at 8.41.48
AM.png

Could it be
that the
timers never
got clear
out or maybe
the pipeline
is creating
more
timer instances
than expected?

On Sat, Aug
15, 2020 at
    4:07 AM
    Maximilian
Michels
mailto:m...@apache.org>>
wrote:

Awesome!
Thanks a
lot for
the
memory
profile.
Couple
remarks:

a) I can
see that
there
are
about
378k
keys and
each of
them
sets a
timer.
b) Based
on the
settings
for
DeduplicatePerKey
you
posted,
you will
keep
track of
all keys
of the
last 30
minutes.

Unless
you have
 

Re: [External] Re: Memory Issue When Running Beam On Flink

2020-08-15 Thread Maximilian Michels

Awesome! Thanks a lot for the memory profile. Couple remarks:

a) I can see that there are about 378k keys and each of them sets a timer.
b) Based on the settings for DeduplicatePerKey you posted, you will keep 
track of all keys of the last 30 minutes.


Unless you have much fewer keys, the behavior is to be expected. The 
memory sizes for the timer maps do not look particularly high (~12Mb).


How much memory did you reserve for the task managers?*

-Max

*The image links give me a "504 error".

On 14.08.20 23:29, Catlyn Kong wrote:

Hi!

We're indeed using the rocksdb state backend, so that might be part of 
the reason. Due to some security concerns, we might not be able to 
provide the full heap dump since we have some custom code path. But 
here's a screenshot from JProfiler:

Screen Shot 2020-08-14 at 9.10.07 AM.png
Looks like TimerHeapInternalTimer (initiated in InternalTimerServiceImpl 
<https://github.com/apache/flink/blob/5125b1123dfcfff73b5070401dfccb162959080c/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/InternalTimerServiceImpl.java#L46>) 
isn't getting garbage collected? As David has mentioned the pipeline 
uses DeduplicatePerKey 
<https://beam.apache.org/releases/pydoc/2.22.0/_modules/apache_beam/transforms/deduplicate.html#DeduplicatePerKey> in 
Beam 2.22, ProcessConnectionEventFn is a simple stateless DoFn that just 
does some logging and emits the events. Is there any possibility that 
the timer logic or the way it's used in the dedupe Pardo can cause this 
leak?


Thanks,
Catlyn

On Tue, Aug 11, 2020 at 7:58 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


Hi!

Looks like a potential leak, caused by your code or by Beam itself.
Would you be able to supply a heap dump from one of the task managers?
That would greatly help debugging this issue.

-Max

On 07.08.20 00:19, David Gogokhiya wrote:
 > Hi,
 >
 > We recently started using Apache Beam version 2.20.0 running on
Flink
 > version 1.9 deployed on kubernetes to process unbounded streams
of data.
 > However, we noticed that the memory consumed by stateful Beam is
 > steadily increasing over time with no drops no matter what the
current
 > bandwidth is. We were wondering if this is expected and if not what
 > would be the best way to resolve it.
 >
 >
 >       More Context
 >
 > We have the following pipeline that consumes messages from the
unbounded
 > stream of data. Later we deduplicate the messages based on unique
 > message id using the deduplicate function
 >

<https://beam.apache.org/releases/pydoc/2.22.0/_modules/apache_beam/transforms/deduplicate.html#DeduplicatePerKey>.

 > Since we are using Beam version 2.20.0, we copied the source code
of the
 > deduplicate function
 >

<https://beam.apache.org/releases/pydoc/2.22.0/_modules/apache_beam/transforms/deduplicate.html#DeduplicatePerKey>from

 > version 2.22.0. After that we unmap the tuple, retrieve the
necessary
 > data from message payload and dump the corresponding data into
the log.
 >
 >
 > Pipeline:
 >
 >
 > Flink configuration:
 >
 >
 > As we mentioned before, we noticed that the memory usage of the
 > jobmanager and taskmanager pod are steadily increasing with no
drops no
 > matter what the current bandwidth is. We tried allocating more
memory
 > but it seems like no matter how much memory we allocate it
eventually
 > reaches its limit and then it tries to restart itself.
 >
 >
 > Sincerely, David
 >
 >



Re: Output timestamp for Python event timers

2020-08-12 Thread Maximilian Michels

Thanks for your suggestions!

It makes sense to complete the work on this feature by exposing it in 
the Python API. We can do this as a next step. (There might be questions 
on how to do that exactly)


For now, I'm concerned with getting the semantics right and unblocking 
users from stalling pipelines.


I wasn't aware that processing timers used the input timestamp as the 
timer output timestamp. I've updated the PR accordingly. Please take a 
look: https://github.com/apache/beam/pull/12531


-Max

On 12.08.20 05:03, Luke Cwik wrote:
+1 on what Boyuan said. It is important that the defaults for processing 
time domain differ from the defaults for the event time domain.


On Tue, Aug 11, 2020 at 12:36 PM Yichi Zhang <mailto:zyi...@google.com>> wrote:


+1 to expose set_output_timestamp and enrich python set timer api.

On Tue, Aug 11, 2020 at 12:01 PM Boyuan Zhang mailto:boyu...@google.com>> wrote:

Hi Maximilian,

It makes sense to set  hold_timestamp as fire_timestamp when the
fire_timestamp is in the event time domain. Otherwise, the
system may advance the watermark incorrectly.
I think we can do something similar to Java FnApiRunner[1]:

  * Expose set_output_timestamp API to python timer as well
  * If set_output_timestamp is not specified and timer is in
event domain, we can use fire_timestamp as hold_timestamp
  * Otherwise, use input_timestamp as hold_timestamp.

What do you think?

[1]

https://github.com/apache/beam/blob/edb42952f6b0aa99477f5c7baca6d6a0d93deb4f/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L1433-L1493




On Tue, Aug 11, 2020 at 9:00 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

We ran into problems setting event time timers per-element
in the Python
SDK. Pipeline progress would stall.

Turns out, although the Python SDK does not expose the timer
output
timestamp feature to the user, it sets the timer output
timestamp to the
current input timestamp of an element.

This will lead to holding back the watermark until the timer
fires (the
Flink Runner respects the timer output timestamp when
advancing the
output watermark). We had set the fire timestamp to a
timestamp so far
in the future, that pipeline progress would completely stall
for
downstream transforms, due to the held back watermark.

Considering that this feature is not even exposed to the
user in the
Python SDK, I think we should set the default output
timestamp to the
fire timestamp, and not to the input timestamp. This is also
how timer
work in the Java SDK.

Let me know what you think.

-Max

PR: https://github.com/apache/beam/pull/12531



Output timestamp for Python event timers

2020-08-11 Thread Maximilian Michels
We ran into problems setting event time timers per-element in the Python 
SDK. Pipeline progress would stall.


Turns out, although the Python SDK does not expose the timer output 
timestamp feature to the user, it sets the timer output timestamp to the 
current input timestamp of an element.


This will lead to holding back the watermark until the timer fires (the 
Flink Runner respects the timer output timestamp when advancing the 
output watermark). We had set the fire timestamp to a timestamp so far 
in the future, that pipeline progress would completely stall for 
downstream transforms, due to the held back watermark.


Considering that this feature is not even exposed to the user in the 
Python SDK, I think we should set the default output timestamp to the 
fire timestamp, and not to the input timestamp. This is also how timer 
work in the Java SDK.


Let me know what you think.

-Max

PR: https://github.com/apache/beam/pull/12531


Re: Memory Issue When Running Beam On Flink

2020-08-11 Thread Maximilian Michels

Hi!

Looks like a potential leak, caused by your code or by Beam itself. 
Would you be able to supply a heap dump from one of the task managers? 
That would greatly help debugging this issue.


-Max

On 07.08.20 00:19, David Gogokhiya wrote:

Hi,

We recently started using Apache Beam version 2.20.0 running on Flink 
version 1.9 deployed on kubernetes to process unbounded streams of data. 
However, we noticed that the memory consumed by stateful Beam is 
steadily increasing over time with no drops no matter what the current 
bandwidth is. We were wondering if this is expected and if not what 
would be the best way to resolve it.



  More Context

We have the following pipeline that consumes messages from the unbounded 
stream of data. Later we deduplicate the messages based on unique 
message id using the deduplicate function 
. 
Since we are using Beam version 2.20.0, we copied the source code of the 
deduplicate function 
from 
version 2.22.0. After that we unmap the tuple, retrieve the necessary 
data from message payload and dump the corresponding data into the log.



Pipeline:


Flink configuration:


As we mentioned before, we noticed that the memory usage of the 
jobmanager and taskmanager pod are steadily increasing with no drops no 
matter what the current bandwidth is. We tried allocating more memory 
but it seems like no matter how much memory we allocate it eventually 
reaches its limit and then it tries to restart itself.



Sincerely, David




Re: Failing Python builds & AppEngine application

2020-08-07 Thread Maximilian Michels

Thanks for letting us know, Tyson!

I think in terms of data, everything is being reported to the new 
InfluxDb instance now. Historic data is missing though, which would be 
nice to have. Also, the graphs are still not 100% on par with the old 
application.


Cheers,
Max

On 06.08.20 16:34, Tyson Hamilton wrote:
It was me! I disabled the App in the premise that it only hosted the old 
perf graphs that were replaced with Grafana. Thanks for fixing the issue.


Is there anything else on the app, or is there more migration to Grafana 
required, or cleanup unfinished?



On Thu, Aug 6, 2020, 4:30 AM Damian Gadomski 
mailto:damian.gadom...@polidea.com>> wrote:


Hey,

A strange thing happened a few hours ago. All python builds (e.g.
[1]) started failing because of:

google.api_core.exceptions.NotFound: 404 The project
apache-beam-testing does not exist or it does not contain an active
Cloud Datastore or Cloud Firestore database. Please visit
http://console.cloud.google.com to create a project or
https://console.cloud.google.com/datastore/setup?project=apache-beam-testing
to add a Cloud Datastore or Cloud Firestore database. Note that
Cloud Datastore or Cloud Firestore always have an associated App
Engine app and this app must not be disabled.

I've checked that manually and the same error appeared while
accessing [2]. Seems that we are using Cloud Datastore and indeed
there was a default AppEngine application [3] that was disabled and
therefore Datastore was inactive. I've just enabled back this app
and the Datastore became active again. Hopefully, that will fix the
builds. Basing on the app statistics it seems that someone disabled
it around Aug 5, 21:00 UTC.

I saw the discussion on the devlist recently about the performance
monitoring. The app [3] is also serving the metrics on [4].
    CC +Maximilian Michels <mailto:m...@apache.org>, +Kamil Wasilewski
<mailto:kamil.wasilew...@polidea.com> - as you were involved in the
discussion there regarding [4]. Perhaps you know something more
about this app or at least who may know? :)

[1] https://ci-beam.apache.org/job/beam_PostCommit_Python37/2681/console
[2]
https://console.cloud.google.com/datastore?project=apache-beam-testing
[3]

https://console.cloud.google.com/appengine?project=apache-beam-testing=default
[4] https://apache-beam-testing.appspot.com

Regards,
Damian



Re: Monitoring performance for releases

2020-08-06 Thread Maximilian Michels
Robert, this is not too far off what I'm proposing. We can always create 
JIRA issues for performance regressions and mark them with a Fix 
Version. Especially, the time of the release is a good time to 
re-evaluate whether some gross performance regressions can be detected. 
Of course, if it's a more gradual and less noticeable once, we might 
miss it.


I agree that good performance is a continuous effort, but many times 
there are actual problems which can be dealt with at the time of the 
release. Those might be very hard to fix a couple of releases down the 
line because they are layered by new problems and much harder to find.


-Max

On 03.08.20 22:17, Robert Bradshaw wrote:
I have to admit I still have some qualms about tying detecting and 
fixing performance regressions as part of the release process (which is 
onerous enough as it is). Instead, I think we'd be better off with a 
separate process to detect and triage performance issues, which, when 
they occur, may merit filing a blocker which will require fixing before 
the release just like any other blocker would. Hopefully this would 
result in issues being detected (and resolved) sooner.


That being said, if a release is known to have performance regressions, 
that should be called out when the RCs are cut, and if not resolved, 
probably as part of the release notes as well.


On Mon, Aug 3, 2020 at 9:40 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


Here a first version of the updated release guide:
https://github.com/apache/beam/pull/12455

Feel free to comment.

-Max

On 29.07.20 17:27, Maximilian Michels wrote:
 > Thanks! I'm following up with this PR to display the Flink Pardo
 > streaming data: https://github.com/apache/beam/pull/12408
 >
 > Streaming data appears to be missing for Dataflow. We can revise the
 > Jenkins jobs to add those.
 >
 > -Max
 >
 > On 29.07.20 17:01, Tyson Hamilton wrote:
 >> Max,
 >>
 >> The runner dimension are present when hovering over a particular
 >> graph. For some more info, the load test configurations can be
found
 >> here [1]. I didn't get a chance to look into them but there are
tests
 >> for all the runners there, possibly not for every loadtest.
 >>
 >> [1]: https://github.com/apache/beam/tree/master/.test-infra/jenkins
 >>
 >> -Tyson
 >>
 >> On Wed, Jul 29, 2020 at 3:46 AM Maximilian Michels
mailto:m...@apache.org>
 >> <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >>
 >>     Looks like the permissions won't be necessary because backup
data
 >> gets
     >>     loaded into the local InfluxDb instance which makes writing
queries
 >>     locally possible.
 >>
 >>     On 29.07.20 12:21, Maximilian Michels wrote:
 >>  > Thanks Michał!
 >>  >
 >>  > It is a bit tricky to verify the exported query works if
I don't
 >>     have
 >>  > access to the data stored in InfluxDb.
 >>  >
 >>  > ==> Could somebody give me permissions to
max.mich...@gmail.com <mailto:max.mich...@gmail.com>
 >>     <mailto:max.mich...@gmail.com
<mailto:max.mich...@gmail.com>> for
 >>  > apache-beam-testing such that I can setup a ssh
port-forwarding
 >>     from the
 >>  > InfluxDb pod to my machine? I do have access to see the
pods but
 >>     that is
 >>  > not enough.
 >>  >
 >>  >> I think that the only test data is from Python streaming
tests,
 >>     which
 >>  >> are not implemented right now (check out
 >>  >>
 >>
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python

 >>
 >>
 >>  >> )
 >>  >
 >>  > Additionally, there is an entire dimension missing:
Runners. I'm
 >>  > assuming this data is for Dataflow?
 >>  >
 >>  > -Max
 >>  >
 >>  > On 29.07.20 11:55, Michał Walenia wrote:
 >>  >> Hi there,
 >>  >>
 >>  >>  > Indeed the Python load test data appears to be missing:
 >>  >>  >
 >>  >>
 >>
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python

 >>
 >>
 >>  >>
 >>  >>

Re: Use Coder message for cross-lang ExternalConfigurationPayload?

2020-08-05 Thread Maximilian Michels

+1

The format to store coders is not set in stone, it was a first version 
to make external configuration work. Using the Coder message would be 
better.


As for using Schema to store the configuration, could somebody fill me 
in how that would work?


-Max

On 04.08.20 02:01, Brian Hulette wrote:
I've opened BEAM-10571 [1] for this, and I'm most of the way to an 
implementation now. Aiming to have it done before the 2.24.0 cut since 
it will be the last release with python 2 support.


[1] https://issues.apache.org/jira/browse/BEAM-10571

On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath > wrote:




On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw mailto:rober...@google.com>> wrote:

On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
mailto:bhule...@google.com>> wrote:
 >
 > Ah yes I'm +1 for that approach too - it would let us
leverage all the schema-inference already in the Java SDK for
translating configuration objects which would be great.
 > Things on the Python side would be trickier as schemas don't
formally support all the types you can use in the PayloadBuilder
implementations [1] yet, just NamedTuple. For now we could just
make the PayloadBuilder implementations generate Rows without
making that translation available for use in PCollections.


This will be a good opportunity to add some sort of a minimal Python
type to Beam schema mapping :)


Yes, though eventually it might be nice to support all of these
various types as schema'd PCollection elements as well.

 > Do we need to worry about update compatibility for
ExternalConfigurationPayload?

Technically, each URN defines their payload, and the fact that we've
settled on ExternalConfigurationPayload is a convention. On a
practical note, we haven't declared these protos stable yet. (I
would
like to do so before we drop support for Python 2, as external
transforms are a possible escape hatch and the first strong
motivation
to have external transforms that span Beam versions).


+1


 > [1]

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
 >
 > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:
 >>
 >> I would be in favor of just using a schema to store the entire
 >> configuration. The reason we went with what we have to day
is that we
 >> didn't have cross language schemas yet.
 >>
 >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
mailto:bhule...@google.com>> wrote:
 >> >
 >> > Hi everyone,
 >> > I noticed that currently the ExternalConfigurationPayload
uses a list of coder URNs to represent the coder that was used
to serialize each configuration field [1]. This seems acceptable
at first blush, but there's one notable issue: it has no place
to store a payload for the coder. Most standard coders don't use
a payload so it's not a problem, but row coder does use a
payload to store it's schema, which means it can't be used in an
ExternalConfigurationPayload today.
 >> >
 >> > Is there a reason not to just use the Coder message [2] in
ExternalConfigurationPayload instead of a list of coder URNs?
That would work with row coder, and it would also make it easier
to re-use logic for translating Pipeline protos.
 >> >
 >> > I'd be happy to make this change, but I wanted to ask on
dev@ in case there's something I'm missing here.
 >> >
 >> > Brian
 >> >
 >> > [1]

https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
 >> > [2]

https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555



Re: Monitoring performance for releases

2020-08-03 Thread Maximilian Michels
Here a first version of the updated release guide: 
https://github.com/apache/beam/pull/12455


Feel free to comment.

-Max

On 29.07.20 17:27, Maximilian Michels wrote:
Thanks! I'm following up with this PR to display the Flink Pardo 
streaming data: https://github.com/apache/beam/pull/12408


Streaming data appears to be missing for Dataflow. We can revise the 
Jenkins jobs to add those.


-Max

On 29.07.20 17:01, Tyson Hamilton wrote:

Max,

The runner dimension are present when hovering over a particular 
graph. For some more info, the load test configurations can be found 
here [1]. I didn't get a chance to look into them but there are tests 
for all the runners there, possibly not for every loadtest.


[1]: https://github.com/apache/beam/tree/master/.test-infra/jenkins

-Tyson

On Wed, Jul 29, 2020 at 3:46 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


    Looks like the permissions won't be necessary because backup data 
gets

    loaded into the local InfluxDb instance which makes writing queries
    locally possible.

    On 29.07.20 12:21, Maximilian Michels wrote:
 > Thanks Michał!
 >
 > It is a bit tricky to verify the exported query works if I don't
    have
 > access to the data stored in InfluxDb.
 >
 > ==> Could somebody give me permissions to max.mich...@gmail.com
    <mailto:max.mich...@gmail.com> for
 > apache-beam-testing such that I can setup a ssh port-forwarding
    from the
 > InfluxDb pod to my machine? I do have access to see the pods but
    that is
 > not enough.
 >
 >> I think that the only test data is from Python streaming tests,
    which
 >> are not implemented right now (check out
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 



 >> )
 >
 > Additionally, there is an entire dimension missing: Runners. I'm
 > assuming this data is for Dataflow?
 >
 > -Max
 >
 > On 29.07.20 11:55, Michał Walenia wrote:
 >> Hi there,
 >>
 >>  > Indeed the Python load test data appears to be missing:
 >>  >
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python 



 >>
 >>
 >> I think that the only test data is from Python streaming tests,
    which
 >> are not implemented right now (check out
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 



 >> )
 >>
 >> As for updating the dashboards, the manual for doing this is 
here:

 >>

https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics#CommunityMetrics-UpdatingDashboards 



 >>
 >>
 >> I hope this helps,
 >>
 >> Michal
 >>
 >> On Mon, Jul 27, 2020 at 4:31 PM Maximilian Michels
    mailto:m...@apache.org>
 >> <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >>
 >>     Indeed the Python load test data appears to be missing:
 >>
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python 



 >>
 >>
 >>     How do we typically modify the dashboards?
 >>
 >>     It looks like we need to edit this json file:
 >>
 >>

https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81 



 >>
 >>
 >>     I found some documentation on the deployment:
 >>
 >>

https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring

 >>
 >>
 >>     +1 for alerting or weekly emails including performance
    numbers for
 >>     fixed
 >>     intervals (1d, 1w, 1m, previous release).
 >>
 >>     +1 for linking the dashboards in the release guide to allow
    for a
 >>     comparison as part of the release process.
 >>
 >>     As a first step, consolidating all the data seems like the 
most

 >>     pressing
 >>     problem to solve.
 >>
 >>     @Kamil I could need some advice regarding how to proceed
    updating the
 >>     dashboards.
 >>
 >>     -Max
 >>
 >>     On 22.07.20 20:20, Robert Bradshaw wrote:
 >>  > On Tue, Jul 21, 2020 at 9:58 AM Thomas Weise
    mailto:t...@apache.org>
 >>     <mailto:t...@apache.org <mailto:t...@apache.org>>
 >>  > <mailto:t...@apache.org <mailto:t...@apache.org>
    <mailto:t...@apache.org <mailto:t

Re: Making reporting bugs/feature request easier

2020-07-30 Thread Maximilian Michels
+1 for making it easier to report bug. Very often we have users report a 
bug and then they have to rely on a developer to report the issue 
because they don't want to bother creating a JIRA account (which is 
understandable). If you have a JIRA account it is easy because you don't 
need any additional permissions to create a ticket.


We would need to tag such externally created issues and triage them from 
time to time. That requires a regular review of some sort. We just have 
to put something in place that will remind us, e.g. a weekly email.


Generally, I'm against using more than one bug tracker. There is nothing 
inherently wrong with JIRA and we have our processes mapped there. 
However, if Github Issues would lower the barrier significantly, I 
suppose we could use it as an Inbox, but JIRA should remain the source 
of truth.


-Max

On 30.07.20 04:12, Kenneth Knowles wrote:



On Wed, Jul 29, 2020 at 11:08 AM Robert Bradshaw > wrote:


+1 to a simple link that fills in most of the fields in JIRA, though
this doesn't solve the issue of having to sign up just to submit a
bug report. Just using the users list isn't a bad idea either--we
could easily create a script that ensures all threads that have a
message like "we should file a JIRA for this" are followed up with a
message like "JIRA filed at ...". (That doesn't mean it won't
language on the tracker.)

I think it's worth seriously considering just using Github's issue
tracker, since that's where our users are. Is there anything in
we actually use in JIRA that'd be missing? 



Pretty big question. Just noting to start that Apache projects certainly 
can and do use GitHub issues. Here is a quick inventory of things that 
are used in a meaningful way:


  - Priorities (with GitHub Issues I think you roll your own with labels)
  - Issue types (with GitHub Issues I think you roll your own with labels)
  - Distinct "Triage Needed" state (also labels; anything lacking the 
"triaged" label)
  - Distinguishing "Open" and "In Progress" (also labels? can use 
Projects/Milestones - I forget exactly which - to keep a kanban-ish status)
  - Our new automation: "stale-assigned" and subsequent unassign; 
"stale-P2" and subsequent downgrade

  - Fix Version for associating fixes with releases
  - Affect Version, while not used much, is still helpful to have
  - Components, since our repo is really a mini mono repo. Again, labels.
  - Kanban boards (milestones/projects maybe kinda)
  - Reports (not really same level, but maybe OK?)

Fairly recently I worked on a project that tried to use GitHub Issues 
and Projects and Milestones and whatnot and it was OK but not great. 
Jira's complexity is largely essential / not really complex but just 
visually busy. The two are not really even comparable offerings. There 
may be third party integrations that add some of what you'd want.


Kenn


On Wed, Jul 29, 2020 at 10:30 AM Kenneth Knowles mailto:k...@apache.org>> wrote:

Very good points. We want the barrier to filing a bug report to
be as low as possible. Jira adds complexity of a sign up process
and a complex interface, and also users don't know what the
fields mean (issue type, priority, component, tag, fix version,
etc). So it currently really doesn't make sense to just point
users at Jira and expect them to figure it out.

As for using user@beam for bug reports:

  - In practice, we have to edit most of the fields and improve
the bug description anyhow, so it might be no extra work for an
experienced contributor to file the bug based on the user's email.
  - Also in practice, we already do this. So it is just a matter
of pointing users that way.

One downside is that there is not really tracking of resolution
of an email thread, so unless it gets filed as a Jira it may sit
unresolved.

Another option we could consider: I think we could have a
"report a bug / feature request" link that fills in most fields
and gives the user a simplified view (like GitHub issue view
where it is just title & body). It could end up that these Jiras
get ignored just as easily as a user@beam thread.

You can always have a link like that and it could point to
whatever the current choice is, like
"mailto:u...@beam.apache.org "
though I think mailto links are out of fashion these days.

Kenn

On Fri, Jul 24, 2020 at 2:50 PM Griselda Cuevas mailto:g...@apache.org>> wrote:

Hi folks,

I recently made a few Jira boards [1][2] to help triage and
organize Beam's backlog.


Something that came to my mind is that reporting bugs and
feature requests using Jira might be imposing a high barrier
for users who don't have an 

Re: Program and registration for Beam Digital Summit

2020-07-29 Thread Maximilian Michels
Thanks Pedro! Great to see the program! This is going to be an exciting 
event.


Forwarding to the dev mailing list, in case people didn't see this here.

-Max

On 29.07.20 20:25, Pedro Galvan wrote:

Hello!

Just a quick message to let everybody know that we have published the 
program for the Beam Digital Summit. It is available at 
https://2020.beamsummit.org/program


With more than 30 talks and workshops covering all the scope from 
introductory sessions to advanced scenarios and use cases, we hope  that 
everybody will find useful content at Beam Digital Summit.


Beam Digital Summit will broadcast through the Crowdcast platform. It is 
a free event but you need to register. Please visit 
https://www.crowdcast.io/e/beamsummit 
 to register.



--
*Pedro Galvan*
Beam Digital Summit Team


Re: Monitoring performance for releases

2020-07-29 Thread Maximilian Michels
Thanks! I'm following up with this PR to display the Flink Pardo 
streaming data: https://github.com/apache/beam/pull/12408


Streaming data appears to be missing for Dataflow. We can revise the 
Jenkins jobs to add those.


-Max

On 29.07.20 17:01, Tyson Hamilton wrote:

Max,

The runner dimension are present when hovering over a particular graph. 
For some more info, the load test configurations can be found here [1]. 
I didn't get a chance to look into them but there are tests for all the 
runners there, possibly not for every loadtest.


[1]: https://github.com/apache/beam/tree/master/.test-infra/jenkins

-Tyson

On Wed, Jul 29, 2020 at 3:46 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


Looks like the permissions won't be necessary because backup data gets
loaded into the local InfluxDb instance which makes writing queries
locally possible.

On 29.07.20 12:21, Maximilian Michels wrote:
 > Thanks Michał!
 >
 > It is a bit tricky to verify the exported query works if I don't
have
 > access to the data stored in InfluxDb.
 >
 > ==> Could somebody give me permissions to max.mich...@gmail.com
<mailto:max.mich...@gmail.com> for
 > apache-beam-testing such that I can setup a ssh port-forwarding
from the
 > InfluxDb pod to my machine? I do have access to see the pods but
that is
 > not enough.
 >
 >> I think that the only test data is from Python streaming tests,
which
 >> are not implemented right now (check out
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python

 >> )
 >
 > Additionally, there is an entire dimension missing: Runners. I'm
 > assuming this data is for Dataflow?
 >
 > -Max
 >
 > On 29.07.20 11:55, Michał Walenia wrote:
 >> Hi there,
 >>
 >>  > Indeed the Python load test data appears to be missing:
 >>  >
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python

 >>
 >>
 >> I think that the only test data is from Python streaming tests,
which
 >> are not implemented right now (check out
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python

 >> )
 >>
 >> As for updating the dashboards, the manual for doing this is here:
 >>

https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics#CommunityMetrics-UpdatingDashboards

 >>
 >>
 >> I hope this helps,
 >>
 >> Michal
 >>
 >> On Mon, Jul 27, 2020 at 4:31 PM Maximilian Michels
mailto:m...@apache.org>
 >> <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
 >>
 >>     Indeed the Python load test data appears to be missing:
 >>
 >>

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python

 >>
 >>
 >>     How do we typically modify the dashboards?
 >>
 >>     It looks like we need to edit this json file:
 >>
 >>

https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81

 >>
 >>
 >>     I found some documentation on the deployment:
 >>
 >>
https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring
 >>
 >>
 >>     +1 for alerting or weekly emails including performance
numbers for
 >>     fixed
 >>     intervals (1d, 1w, 1m, previous release).
 >>
 >>     +1 for linking the dashboards in the release guide to allow
for a
 >>     comparison as part of the release process.
 >>
 >>     As a first step, consolidating all the data seems like the most
 >>     pressing
 >>     problem to solve.
 >>
 >>     @Kamil I could need some advice regarding how to proceed
updating the
 >>     dashboards.
 >>
 >>     -Max
 >>
 >>     On 22.07.20 20:20, Robert Bradshaw wrote:
 >>  > On Tue, Jul 21, 2020 at 9:58 AM Thomas Weise
mailto:t...@apache.org>
 >>     <mailto:t...@apache.org <mailto:t...@apache.org>>
 >>  > <mailto:t...@apache.org <mailto:t...@apache.org>
<mailto:t...@apache.org <mailto:t...@apache.org>>>> wrote:
 >>  >
 >>  >     It appears that there is coverage missing in the Grafana
 >>     dashboards
 >>    

Re: Monitoring performance for releases

2020-07-29 Thread Maximilian Michels
Looks like the permissions won't be necessary because backup data gets 
loaded into the local InfluxDb instance which makes writing queries 
locally possible.


On 29.07.20 12:21, Maximilian Michels wrote:

Thanks Michał!

It is a bit tricky to verify the exported query works if I don't have 
access to the data stored in InfluxDb.


==> Could somebody give me permissions to max.mich...@gmail.com for 
apache-beam-testing such that I can setup a ssh port-forwarding from the 
InfluxDb pod to my machine? I do have access to see the pods but that is 
not enough.


I think that the only test data is from Python streaming tests, which 
are not implemented right now (check out 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 
)


Additionally, there is an entire dimension missing: Runners. I'm 
assuming this data is for Dataflow?


-Max

On 29.07.20 11:55, Michał Walenia wrote:

Hi there,

 > Indeed the Python load test data appears to be missing:
 > 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python 



I think that the only test data is from Python streaming tests, which 
are not implemented right now (check out 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 
)


As for updating the dashboards, the manual for doing this is here: 
https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics#CommunityMetrics-UpdatingDashboards 



I hope this helps,

Michal

On Mon, Jul 27, 2020 at 4:31 PM Maximilian Michels <mailto:m...@apache.org>> wrote:


    Indeed the Python load test data appears to be missing:

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python 



    How do we typically modify the dashboards?

    It looks like we need to edit this json file:

https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81 



    I found some documentation on the deployment:

https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring



    +1 for alerting or weekly emails including performance numbers for
    fixed
    intervals (1d, 1w, 1m, previous release).

    +1 for linking the dashboards in the release guide to allow for a
    comparison as part of the release process.

    As a first step, consolidating all the data seems like the most
    pressing
    problem to solve.

    @Kamil I could need some advice regarding how to proceed updating the
    dashboards.

    -Max

    On 22.07.20 20:20, Robert Bradshaw wrote:
 > On Tue, Jul 21, 2020 at 9:58 AM Thomas Weise mailto:t...@apache.org>
 > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote:
 >
 >     It appears that there is coverage missing in the Grafana
    dashboards
 >     (it could also be that I just don't find it).
 >
 >     For example:
 >

https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056 


 >
 >     The GBK and ParDo tests have a selection for {batch,
    streaming} and
 >     SDK. No coverage for streaming and python? There is also no
    runner
 >     option currently.
 >
 >     We have seen repeated regressions with streaming, Python,
    Flink. The
 >     test has been contributed. It would be great if the results
    can be
 >     covered as part of release verification.
 >
 >
 > Even better would be if we can use these dashboards (plus
    alerting or
 > similar?) to find issues before release verification. It's much
    easier
 > to fix things earlier.
 >
 >
 >     Thomas
 >
 >
 >
 >     On Tue, Jul 21, 2020 at 7:55 AM Kamil Wasilewski
 >     mailto:kamil.wasilew...@polidea.com>
    <mailto:kamil.wasilew...@polidea.com
    <mailto:kamil.wasilew...@polidea.com>>>
 >     wrote:
 >
 >             The prerequisite is that we have all the stats in one
    place.
 >             They seem
 >             to be scattered across 
http://metrics.beam.apache.org and

 > https://apache-beam-testing.appspot.com.
 >
 >             Would it be possible to consolidate the two, i.e. 
use the

 >             Grafana-based
 >             dashboard to load the legacy stats?
 >
 >
 >         I'm pretty sure that all dashboards have been moved to
 > http://metrics.beam.apache.org. Let me know if I missed
 >         something during the migration.
 >
 >         I think we should turn off
 > https://apache-beam-testing.appspot.com in the near future. New
 >         Grafana-based dashboards have been working seamlessly for
    some
 >         time now and there's no point in ma

Re: Monitoring performance for releases

2020-07-29 Thread Maximilian Michels

Thanks Michał!

It is a bit tricky to verify the exported query works if I don't have 
access to the data stored in InfluxDb.


==> Could somebody give me permissions to max.mich...@gmail.com for 
apache-beam-testing such that I can setup a ssh port-forwarding from the 
InfluxDb pod to my machine? I do have access to see the pods but that is 
not enough.



I think that the only test data is from Python streaming tests, which are not 
implemented right now (check out 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python
 )


Additionally, there is an entire dimension missing: Runners. I'm 
assuming this data is for Dataflow?


-Max

On 29.07.20 11:55, Michał Walenia wrote:

Hi there,

 > Indeed the Python load test data appears to be missing:
 > 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python


I think that the only test data is from Python streaming tests, which 
are not implemented right now (check out 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=batch=python 
)


As for updating the dashboards, the manual for doing this is here: 
https://cwiki.apache.org/confluence/display/BEAM/Community+Metrics#CommunityMetrics-UpdatingDashboards


I hope this helps,

Michal

On Mon, Jul 27, 2020 at 4:31 PM Maximilian Michels <mailto:m...@apache.org>> wrote:


Indeed the Python load test data appears to be missing:

http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python

How do we typically modify the dashboards?

It looks like we need to edit this json file:

https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81

I found some documentation on the deployment:
https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring


+1 for alerting or weekly emails including performance numbers for
fixed
intervals (1d, 1w, 1m, previous release).

+1 for linking the dashboards in the release guide to allow for a
comparison as part of the release process.

As a first step, consolidating all the data seems like the most
pressing
problem to solve.

@Kamil I could need some advice regarding how to proceed updating the
dashboards.

-Max

On 22.07.20 20:20, Robert Bradshaw wrote:
 > On Tue, Jul 21, 2020 at 9:58 AM Thomas Weise mailto:t...@apache.org>
 > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote:
 >
 >     It appears that there is coverage missing in the Grafana
dashboards
 >     (it could also be that I just don't find it).
 >
 >     For example:
 >
https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056
 >
 >     The GBK and ParDo tests have a selection for {batch,
streaming} and
 >     SDK. No coverage for streaming and python? There is also no
runner
 >     option currently.
 >
 >     We have seen repeated regressions with streaming, Python,
Flink. The
 >     test has been contributed. It would be great if the results
can be
 >     covered as part of release verification.
 >
 >
 > Even better would be if we can use these dashboards (plus
alerting or
 > similar?) to find issues before release verification. It's much
easier
 > to fix things earlier.
 >
 >
 >     Thomas
 >
 >
 >
 >     On Tue, Jul 21, 2020 at 7:55 AM Kamil Wasilewski
 >     mailto:kamil.wasilew...@polidea.com>
<mailto:kamil.wasilew...@polidea.com
<mailto:kamil.wasilew...@polidea.com>>>
 >     wrote:
 >
 >             The prerequisite is that we have all the stats in one
place.
 >             They seem
 >             to be scattered across http://metrics.beam.apache.org and
 > https://apache-beam-testing.appspot.com.
 >
 >             Would it be possible to consolidate the two, i.e. use the
 >             Grafana-based
 >             dashboard to load the legacy stats?
 >
 >
 >         I'm pretty sure that all dashboards have been moved to
 > http://metrics.beam.apache.org. Let me know if I missed
 >         something during the migration.
 >
 >         I think we should turn off
 > https://apache-beam-testing.appspot.com in the near future. New
 >         Grafana-based dashboards have been working seamlessly for
some
 >         time now and there's no point in maintaining the older
solution.
 >         We'd also avoid ambiguity in where the stats should be
looked for.
 >
 >         Kamil
 >
 >         On Tue, Jul 21, 

Re: Use concrete instances of ExternalTransformBuilder in ExternalTransformRegistrar?

2020-07-28 Thread Maximilian Michels

Replacing

  Class

with

  ExternalTransformBuilder

sounds reasonable to me. Looks like an oversight that we introduced the 
unnecessary class indirection.


-Max

On 27.07.20 20:45, Chamikara Jayalath wrote:
Brian's suggestion makes sense to me. I don't know of a specific reason 
regarding why we choose the Class type in the registrar instead of 
instance types. +Maximilian Michels <mailto:m...@apache.org> +Robert 
Bradshaw <mailto:rober...@google.com> may have more context.


Thanks,
Cham

On Mon, Jul 27, 2020 at 10:48 AM Kenneth Knowles <mailto:k...@apache.org>> wrote:




On Mon, Jul 27, 2020 at 10:47 AM Kenneth Knowles mailto:k...@apache.org>> wrote:

On Sun, Jul 26, 2020 at 8:50 PM Kenneth Knowles mailto:k...@apache.org>> wrote:

Rawtypes are a legacy compatibility feature that breaks type
checking (and further analyses)


Noting for the benefit of the thread that this is not
hypothetical. Fixing the rawtypes in this API surfaced
nullability issues according to spotbugs.


Additionally notable that Spotbugs operates on post-compile
bytecode, not source.

Kenn


Kenn



and harms readability. They should be forbidden in new code.
Class literals for generic types are quite inconvenient for
this, especially when placed in a heterogeneous map using
wildcard parameters [1].

So making either the change Brian proposes or something
similar is desirable, to avoid forcing inconvenience on
users of the API, and to just simplify and clarify it.

Kenn

[1]

https://github.com/apache/beam/pull/12376/files#diff-2fa38a7f8d24217f1f7bde0f5c7dbb40R495

Kenn

On Fri, Jul 24, 2020 at 11:04 AM Brian Hulette
mailto:bhule...@google.com>> wrote:

Hi all,
I've been working with +Scott Lukas
<mailto:slu...@google.com> on using the new schema io
interfaces [1] in cross-language. This means adding a
general-purpose ExternalTransformRegistrar [2,3] that
will register all SchemaIOProvider implementations via
ServiceLoader.

We've run into an issue though -
ExternalTransformRegistrar is supposed to return a
`Map>`. This makes it very
challenging (impossible?) for us to create a
general-purpose ExternalTransformBuilder that defers to
SchemaIOProvider. Ideally we would instead return a
Map (i.e. a concrete
instance rather than a class object), so that we could
just register different instances of a class like:

class SchemaIOBuilder extends ExternalTransformBuilder {
   private SchemaIOProvider provider;
   PTransform buildExternal(ConfigT
configuration) {
     // Use provider to create PTransform
   }
}

I think it would be possible to change the
ExternalTransformRegistrar interface so it has a single
method, Map
knownBuilders(). It could even be done in a
backwards-compatible way if we keep the old method and
provide a default implementation of the new method that
builds instances.

However, I'm curious if there's some strong reason for
using Class as the
return type for knownBuilders that I'm missing. Does
anyone know why we chose that?

Thanks,
Brian

[1] https://s.apache.org/beam-schema-io
[2]

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ExternalTransformRegistrar.java
[3]

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ExternalTransformBuilder.java



Re: Monitoring performance for releases

2020-07-27 Thread Maximilian Michels
Indeed the Python load test data appears to be missing: 
http://metrics.beam.apache.org/d/MOi-kf3Zk/pardo-load-tests?orgId=1=streaming=python


How do we typically modify the dashboards?

It looks like we need to edit this json file: 
https://github.com/apache/beam/blob/8d460db620d2ff1257b0e092218294df15b409a1/.test-infra/metrics/grafana/dashboards/perftests_metrics/ParDo_Load_Tests.json#L81


I found some documentation on the deployment: 
https://cwiki.apache.org/confluence/display/BEAM/Test+Results+Monitoring



+1 for alerting or weekly emails including performance numbers for fixed 
intervals (1d, 1w, 1m, previous release).


+1 for linking the dashboards in the release guide to allow for a 
comparison as part of the release process.


As a first step, consolidating all the data seems like the most pressing 
problem to solve.


@Kamil I could need some advice regarding how to proceed updating the 
dashboards.


-Max

On 22.07.20 20:20, Robert Bradshaw wrote:
On Tue, Jul 21, 2020 at 9:58 AM Thomas Weise <mailto:t...@apache.org>> wrote:


It appears that there is coverage missing in the Grafana dashboards
(it could also be that I just don't find it).

For example:
https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056

The GBK and ParDo tests have a selection for {batch, streaming} and
SDK. No coverage for streaming and python? There is also no runner
option currently.

We have seen repeated regressions with streaming, Python, Flink. The
test has been contributed. It would be great if the results can be
covered as part of release verification.


Even better would be if we can use these dashboards (plus alerting or 
similar?) to find issues before release verification. It's much easier 
to fix things earlier.



Thomas



On Tue, Jul 21, 2020 at 7:55 AM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>>
wrote:

The prerequisite is that we have all the stats in one place.
They seem
to be scattered across http://metrics.beam.apache.org and
https://apache-beam-testing.appspot.com.

Would it be possible to consolidate the two, i.e. use the
Grafana-based
dashboard to load the legacy stats?


I'm pretty sure that all dashboards have been moved to
http://metrics.beam.apache.org. Let me know if I missed
something during the migration.

I think we should turn off
https://apache-beam-testing.appspot.com in the near future. New
Grafana-based dashboards have been working seamlessly for some
time now and there's no point in maintaining the older solution.
We'd also avoid ambiguity in where the stats should be looked for.

Kamil

On Tue, Jul 21, 2020 at 4:17 PM Maximilian Michels
mailto:m...@apache.org>> wrote:

 > It doesn't support https. I had to add an exception to
the HTTPS Everywhere extension for "metrics.beam.apache.org
<http://metrics.beam.apache.org>".

*facepalm* Thanks Udi! It would always hang on me because I
use HTTPS
Everywhere.

 > To be explicit, I am supporting the idea of reviewing the
release guide but not changing the release process for the
already in-progress release.

I consider the release guide immutable for the process of a
release.
Thus, a change to the release guide can only affect new
upcoming
releases, not an in-process release.

 > +1 and I think we can also evaluate whether flaky tests
should be reviewed as release blockers or not. Some flaky
tests would be hiding real issues our users could face.

Flaky tests are also worth to take into account when
releasing, but a
little harder to find because may just happen to pass during
building
the release. It is possible though if we strictly capture
flaky tests
via JIRA and mark them with the Fix Version for the release.

 > We keep accumulating dashboards and
 > tests that few people care about, so it is probably worth
that we use
 > them or get a way to alert us of regressions during the
release cycle
 > to catch this even before the RCs.

+1 The release guide should be explicit about which
performance test
results to evaluate.

The prerequisite is that we have all the stats in one place.
They seem
to be scattered across http://metrics.beam.apache.org and
https://apache-beam-testing.appspot.com.

Would it be possible to consolidate the two, i.e. use the
Gr

Re: [VOTE] Make Apache Beam 2.24.0 the final release supporting Python 2.

2020-07-24 Thread Maximilian Michels

+1

On 24.07.20 18:54, Pablo Estrada wrote:

+1 - thank you Valentyn!
-P.

On Thu, Jul 23, 2020 at 1:29 PM Chamikara Jayalath > wrote:


+1

On Thu, Jul 23, 2020 at 1:15 PM Brian Hulette mailto:bhule...@google.com>> wrote:

+1

On Thu, Jul 23, 2020 at 1:05 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:

[X] +1: Remove Python 2 support in Apache Beam 2.25.0.

According to our six-week release cadence, 2.24.0 (the last
release to support Python 2) will be cut mid-August, and the
first release not supporting Python 2 would be expected to
land sometime in October. This seems a reasonable timeline
to me.


On Thu, Jul 23, 2020 at 12:53 PM Valentyn Tymofieiev
mailto:valen...@google.com>> wrote:

Hi everyone,

Please vote whether to make Apache Beam 2.24.0 the final
release supporting Python 2 as follows.

[ ] +1: Remove Python 2 support in Apache Beam 2.25.0.
[ ] -1: Continue to support Python 2 in Apache Beam, and
reconsider at a later date.

The Beam community has pledged to sunset Python 2
support at some point in 2020[1,2]. A recent
discussion[3] on dev@  proposes to outline a specific
version after which Beam developers no longer have to
maintain Py2 support, which is a motivation for this vote.

If this vote is approved we will announce Apache Beam
2.24.0 as our final release to support Python 2 and
discontinue Python 2 support starting from 2.25.0
(inclusive).

This is a procedural vote [4] that will follow the
majority approval rules and will be open for at least 72
hours.

Thanks,
Valentyn

[1]

https://lists.apache.org/thread.html/634f7346b607e779622d0437ed0eca783f474dea8976adf41556845b%40%3Cdev.beam.apache.org%3E
[2] https://python3statement.org/
[3]

https://lists.apache.org/thread.html/r0d5c309a7e3107854f4892ccfeb1a17c0cec25dfce188678ab8df072%40%3Cdev.beam.apache.org%3E
[4] https://www.apache.org/foundation/voting.html




Re: [BROKEN] Please add "Fix Version" when resolving or closing Jiras

2020-07-23 Thread Maximilian Michels

I can close/resolve but it will show "Resolution: Unresolved":

On 23.07.20 02:30, Brian Hulette wrote:
Is setting the Resolution broken as well? I realized I've been closing 
jiras with Resolution "Unresolved" and I can't actually change it to 
"Fixed".


On Tue, Jul 21, 2020 at 7:19 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


Also, a friendly reminder to always close the JIRA issue after
merging a
fix. It's easy to forget.

On 20.07.20 21:04, Kenneth Knowles wrote:
> Hi all,
>
> In working on our Jira automation, I've messed up our Jira
workflow. It
> will no longer prompt you to fill in "Fix Version" when you
resolve or
> close an issue. I will be working with infra to restore this. In
the
> meantime, please try to remember to add a Fix Version to each
issue that
> you close, so that we get automated detailed release notes.
>
> Kenn



Re: Errorprone plugin fails for release branches <2.22.0

2020-07-22 Thread Maximilian Michels
On the SparkRunner page, we advise users to download Beam sources and build JobService. So I think it would be better just to add a note there about this issue with old branches. 


Why is that? Don't we publish the Spark job server jar?

-Max

On 21.07.20 18:20, Alexey Romanenko wrote:
On the SparkRunner page, we advise users to download Beam sources and 
build JobService. So I think it would be better just to add a note there 
about this issue with old branches.


On 20 Jul 2020, at 22:29, Kenneth Knowles <mailto:k...@apache.org>> wrote:


I think it is fine to fix it in branches. I do not see too much value 
in fixing it except in branches you know you are going to use.


The "Downloads" page is for users and only mentioned the voted source 
releases, maven central, and pypi. There is nothing to do with GitHub 
or ongoing branches there. I don't think un-published cherrypicks to 
branches matter to users. Did you mean some other place?


Kenn

On Mon, Jul 20, 2020 at 9:44 AM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:


Then, would it be ok to fix it in branches (the question is how
many branches we should fix?) with additional commit and mention
that on “Downloads" page?


On 8 Jul 2020, at 21:24, Kenneth Knowles mailto:k...@apache.org>> wrote:



On Wed, Jul 8, 2020 at 12:07 PM Kyle Weaver mailto:kcwea...@google.com>> wrote:

> To fix on previous release branches, we would need to make
a new release, is it not? Since hashes would change..

Would it be alright to patch the release branches on Github
and leave the released source as-is? Github release branches
themselves aren't release artifacts, so I think it should be
okay to patch them without making a new release.


Yea. There are tags for the exact hashes that RCs were built
from. The release branch is fine to get new commits, and then if
anyone wants to build a patch release they will get those commits.

Kenn

On Wed, Jul 8, 2020 at 11:59 AM Pablo Estrada
mailto:pabl...@google.com>> wrote:

Ah that's annoying that a dependency would be removed
from maven. I thought that was not meant to happen? This
must be an issue happening for many other projects...
Why is errorprone a dependency anyway?

To fix on previous release branches, we would need to
make a new release, is it not? Since hashes would change..

On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Max,

I’m +1 for back porting as well but that seems quite
complicated since we distribute release source code
from https://archive.apache.org/
Perhaps, we should just warn users about this issue
and how to workaround it.

Any other ideas?

> On 8 Jul 2020, at 11:46, Maximilian Michels
mailto:m...@apache.org>> wrote:
>
> Hi Alexey,
>
> I also came across this issue when building a
custom Beam version. I applied the same fix
(https://github.com/apache/beam/pull/11527) which you
have mentioned.
>
> It appears that the Maven dependencies changed or
are no longer available which causes the missing
class files.
>
> +1 for backporting the fix to the release branches.
>
> Cheers,
> Max
>
> On 08.07.20 11:36, Alexey Romanenko wrote:
>> Hello,
>> Some days ago I noticed that I can’t build the
project from old release branches . For example, I
wanted to build and run Spark Job Server from
“release-2.20.0” branch and it failed:
>> ./gradlew :runners:spark:job-server:runShadow
—stacktrace
>> * Exception is:
>> org.gradle.api.tasks.TaskExecutionException:
Execution failed for task ':model:pipeline:compileJava’.
>> …
>> Caused by: org.gradle.internal.UncheckedException:
java.lang.ClassNotFoundException:
com.google.errorprone.ErrorProneCompiler$Builder
>> …
>> I experienced the same issue for “release-2.19.0”
and  “release-2.21.0” branches, I didn’t check older
branches but seems it’s a global issue for
“net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
>> 

Re: Monitoring performance for releases

2020-07-21 Thread Maximilian Michels

It doesn't support https. I had to add an exception to the HTTPS Everywhere extension for 
"metrics.beam.apache.org".


*facepalm* Thanks Udi! It would always hang on me because I use HTTPS 
Everywhere.



To be explicit, I am supporting the idea of reviewing the release guide but not 
changing the release process for the already in-progress release.


I consider the release guide immutable for the process of a release. 
Thus, a change to the release guide can only affect new upcoming 
releases, not an in-process release.



+1 and I think we can also evaluate whether flaky tests should be reviewed as 
release blockers or not. Some flaky tests would be hiding real issues our users 
could face.


Flaky tests are also worth to take into account when releasing, but a 
little harder to find because may just happen to pass during building 
the release. It is possible though if we strictly capture flaky tests 
via JIRA and mark them with the Fix Version for the release.



We keep accumulating dashboards and
tests that few people care about, so it is probably worth that we use
them or get a way to alert us of regressions during the release cycle
to catch this even before the RCs.


+1 The release guide should be explicit about which performance test 
results to evaluate.


The prerequisite is that we have all the stats in one place. They seem 
to be scattered across http://metrics.beam.apache.org and 
https://apache-beam-testing.appspot.com.


Would it be possible to consolidate the two, i.e. use the Grafana-based 
dashboard to load the legacy stats?


For the evaluation during the release process, I suggest to use a 
standardized set of performance tests for all runners, e.g.:


- Nexmark
- ParDo (Classic/Portable)
- GroupByKey
- IO


-Max

On 21.07.20 01:23, Ahmet Altay wrote:


On Mon, Jul 20, 2020 at 3:07 PM Ismaël Mejía <mailto:ieme...@gmail.com>> wrote:


+1

This is not in the release guide and we should probably re evaluate if
this should be a release blocking reason.
Of course exceptionally a performance regression could be motivated by
a correctness fix or a worth refactor, so we should consider this.


+1 and I think we can also evaluate whether flaky tests should be 
reviewed as release blockers or not. Some flaky tests would be hiding 
real issues our users could face.


To be explicit, I am supporting the idea of reviewing the release guide 
but not changing the release process for the already in-progress release.



We have been tracking and fixing performance regressions multiple
times found simply by checking the nexmark tests including on the
ongoing 2.23.0 release so value is there. Nexmark does not cover yet
python and portable runners so we are probably still missing many
issues and it is worth to work on this. In any case we should probably
decide what validations matter. We keep accumulating dashboards and
tests that few people care about, so it is probably worth that we use
them or get a way to alert us of regressions during the release cycle
to catch this even before the RCs.


I agree. And if we cannot use dashboards/tests in a meaningful way, IMO 
we can remove them. There is not much value to maintain them if they do 
not provide important signals.



On Fri, Jul 10, 2020 at 9:30 PM Udi Meiri mailto:eh...@google.com>> wrote:
 >
 > On Thu, Jul 9, 2020 at 12:48 PM Maximilian Michels
mailto:m...@apache.org>> wrote:
 >>
 >> Not yet, I just learned about the migration to a new frontend,
including
 >> a new backend (InfluxDB instead of BigQuery).
 >>
 >> >  - Are the metrics available on metrics.beam.apache.org
<http://metrics.beam.apache.org>?
 >>
 >> Is http://metrics.beam.apache.org online? I was never able to
access it.
 >
 >
 > It doesn't support https. I had to add an exception to the HTTPS
Everywhere extension for "metrics.beam.apache.org
<http://metrics.beam.apache.org>".
 >
 >>
 >>
 >> >  - What is the feature delta between usinig
metrics.beam.apache.org <http://metrics.beam.apache.org> (much
better UI) and using apache-beam-testing.appspot.com
<http://apache-beam-testing.appspot.com>?
 >>
 >> AFAIK it is an ongoing migration and the delta appears to be high.
 >>
 >> >  - Can we notice regressions faster than release cadence?
 >>
 >> Absolutely! A report with the latest numbers including
statistics about
 >> the growth of metrics would be useful.
 >>
 >> >  - Can we get automated alerts?
 >>
 >> I think we could setup a Jenkins job to do this.
 >>
 >> -Max
 >>
 >> On 09.07.20 20:26, Kenneth Kno

Re: [BROKEN] Please add "Fix Version" when resolving or closing Jiras

2020-07-21 Thread Maximilian Michels
Also, a friendly reminder to always close the JIRA issue after merging a 
fix. It's easy to forget.


On 20.07.20 21:04, Kenneth Knowles wrote:

Hi all,

In working on our Jira automation, I've messed up our Jira workflow. It 
will no longer prompt you to fill in "Fix Version" when you resolve or 
close an issue. I will be working with infra to restore this. In the 
meantime, please try to remember to add a Fix Version to each issue that 
you close, so that we get automated detailed release notes.


Kenn


Re: Errorprone plugin fails for release branches <2.22.0

2020-07-21 Thread Maximilian Michels

We can't change releases, as they are voted and signed.

+1 for updating the branches. That will give people the option to build 
recent Beam versions.


Note that it is not uncommon for source files to stop compiling. It's 
often the case when build tools or the underlying platform change. 
Although unfortunate, it is not in our interest to maintain old Beam 
versions, but since fixing the mentioned issue is easy, it makes sense 
in this particular instance.


-Max

On 20.07.20 22:29, Kenneth Knowles wrote:
I think it is fine to fix it in branches. I do not see too much value in 
fixing it except in branches you know you are going to use.


The "Downloads" page is for users and only mentioned the voted source 
releases, maven central, and pypi. There is nothing to do with GitHub or 
ongoing branches there. I don't think un-published cherrypicks to 
branches matter to users. Did you mean some other place?


Kenn

On Mon, Jul 20, 2020 at 9:44 AM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:


Then, would it be ok to fix it in branches (the question is how many
branches we should fix?) with additional commit and mention that on
“Downloads" page?


On 8 Jul 2020, at 21:24, Kenneth Knowles mailto:k...@apache.org>> wrote:



On Wed, Jul 8, 2020 at 12:07 PM Kyle Weaver mailto:kcwea...@google.com>> wrote:

> To fix on previous release branches, we would need to make a
new release, is it not? Since hashes would change..

Would it be alright to patch the release branches on Github
and leave the released source as-is? Github release branches
themselves aren't release artifacts, so I think it should be
okay to patch them without making a new release.


Yea. There are tags for the exact hashes that RCs were built from.
The release branch is fine to get new commits, and then if anyone
wants to build a patch release they will get those commits.

Kenn

On Wed, Jul 8, 2020 at 11:59 AM Pablo Estrada
mailto:pabl...@google.com>> wrote:

Ah that's annoying that a dependency would be removed from
maven. I thought that was not meant to happen? This must
be an issue happening for many other projects...
Why is errorprone a dependency anyway?

To fix on previous release branches, we would need to make
a new release, is it not? Since hashes would change..

On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Max,

I’m +1 for back porting as well but that seems quite
complicated since we distribute release source code
from https://archive.apache.org/
Perhaps, we should just warn users about this issue
and how to workaround it.

Any other ideas?

        > On 8 Jul 2020, at 11:46, Maximilian Michels
mailto:m...@apache.org>> wrote:
>
> Hi Alexey,
>
> I also came across this issue when building a custom
Beam version. I applied the same fix
(https://github.com/apache/beam/pull/11527) which you
have mentioned.
>
> It appears that the Maven dependencies changed or
are no longer available which causes the missing class
files.
>
> +1 for backporting the fix to the release branches.
>
> Cheers,
> Max
>
> On 08.07.20 11:36, Alexey Romanenko wrote:
>> Hello,
>> Some days ago I noticed that I can’t build the
project from old release branches . For example, I
wanted to build and run Spark Job Server from
“release-2.20.0” branch and it failed:
>> ./gradlew :runners:spark:job-server:runShadow
—stacktrace
>> * Exception is:
>> org.gradle.api.tasks.TaskExecutionException:
Execution failed for task ':model:pipeline:compileJava’.
>> …
>> Caused by: org.gradle.internal.UncheckedException:
java.lang.ClassNotFoundException:
com.google.errorprone.ErrorProneCompiler$Builder
>> …
>> I experienced the same issue for “release-2.19.0”
and  “release-2.21.0” branches, I didn’t check older
branches but seems it’s a global issue for
“net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
>> This is already known issu

Re: No space left on device - beam-jenkins 1 and 7

2020-07-20 Thread Maximilian Michels
+1 for scheduling it via a cron job if it won't lead to test failures 
while running. Not a Jenkins expert but maybe there is the notion of 
running exclusively while no other tasks are running?


-Max

On 17.07.20 21:49, Tyson Hamilton wrote:

FYI there was a job introduced to do this in Jenkins: beam_Clean_tmp_directory

Currently it needs to be run manually. I'm seeing some out of disk related 
errors in precommit tests currently, perhaps we should schedule this job with 
cron?


On 2020/03/11 19:31:13, Heejong Lee  wrote:

Still seeing no space left on device errors on jenkins-7 (for example:
https://builds.apache.org/job/beam_PreCommit_PythonLint_Commit/2754/)


On Fri, Mar 6, 2020 at 7:11 PM Alan Myrvold  wrote:


Did a one time cleanup of tmp files owned by jenkins older than 3 days.
Agree that we need a longer term solution.

Passing recent tests on all executors except jenkins-12, which has not
scheduled recent builds for the past 13 days. Not scheduling:
https://builds.apache.org/computer/apache-beam-jenkins-12/builds

Recent passing builds:
https://builds.apache.org/computer/apache-beam-jenkins-1/builds

https://builds.apache.org/computer/apache-beam-jenkins-2/builds

https://builds.apache.org/computer/apache-beam-jenkins-3/builds

https://builds.apache.org/computer/apache-beam-jenkins-4/builds

https://builds.apache.org/computer/apache-beam-jenkins-5/builds

https://builds.apache.org/computer/apache-beam-jenkins-6/builds

https://builds.apache.org/computer/apache-beam-jenkins-7/builds

https://builds.apache.org/computer/apache-beam-jenkins-8/builds

https://builds.apache.org/computer/apache-beam-jenkins-9/builds

https://builds.apache.org/computer/apache-beam-jenkins-10/builds

https://builds.apache.org/computer/apache-beam-jenkins-11/builds

https://builds.apache.org/computer/apache-beam-jenkins-13/builds

https://builds.apache.org/computer/apache-beam-jenkins-14/builds

https://builds.apache.org/computer/apache-beam-jenkins-15/builds

https://builds.apache.org/computer/apache-beam-jenkins-16/builds


On Fri, Mar 6, 2020 at 11:54 AM Ahmet Altay  wrote:


+Alan Myrvold  is doing a one time cleanup. I agree
that we need to have a solution to automate this task or address the root
cause of the buildup.

On Thu, Mar 5, 2020 at 2:47 AM Michał Walenia 
wrote:


Hi there,
it seems we have a problem with Jenkins workers again. Nodes 1 and 7
both fail jobs with "No space left on device".
Who is the best person to contact in these cases (someone with access
permissions to the workers).

I also noticed that such errors are becoming more and more frequent
recently and I'd like to discuss how can this be remedied. Can a cleanup
task be automated on Jenkins somehow?

Regards
Michal

--

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 







Re: [VOTE] Release 2.23.0, release candidate #1

2020-07-20 Thread Maximilian Michels
@Valentyn: Thank you for your transparency in the release process and 
for considering pending cherry-pick requests. No blockers from my side.


-Max

On 18.07.20 01:11, Ahmet Altay wrote:
Thank you Valentyn. Being a release manager is difficult. It requires 
balancing between stability, following the process, regressions, 
timelines. Thank you for following the process, thank you for asking the 
right questions, thank you for doing the release.



On Fri, Jul 17, 2020 at 3:59 PM Robert Bradshaw > wrote:


Thank you, Valentyn!

On Fri, Jul 17, 2020 at 3:25 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:
 >
 >
 >
 > On Fri, Jul 17, 2020 at 3:01 PM Valentyn Tymofieiev
mailto:valen...@google.com>> wrote:
 >>
 >> As a general rule, fixes pertaining to new functionality are not
a good candidate for a cherry-pick.
 >>
 >> A case for an exception can be made for polishing features
related to major wide announcements with a hard deadline, which
appears to be the case for xlang on Dataflow.
 >>
 >> I will prepare an RC2 with xlang fixes and consider other
low-risk additions from issues that were brought to my attention.
 >
 >
 > Thanks Valentyn.
 >
 >>
 >>
 >> Thanks
 >>
 >>
 >> On Fri, Jul 17, 2020 at 10:36 AM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:
 >>>
 >>>
 >>>
 >>> On Fri, Jul 17, 2020 at 10:01 AM Robert Bradshaw
mailto:rober...@google.com>> wrote:
 
  Taking a step back, the goal of avoiding cherry-picks is to reduce
  risk and increase the velocity of our releases, as otherwise the
  release manager gets inundated by a never ending list of features
  people want to get in that puts the releases further and further
  behind (increasing the desire to get features in in a vicious
cycle).
  On the flip side, the reason we have a release process with
candidates
  and voting (as opposed to just declaring a commit id every N
weeks to
  be "the release") is to give us the flexibility to achieve a
level of
  quality and polish that may not ever occur in HEAD itself.
 
  With regards to this specific cross-langauge fix, the
motivation is
  that those working on it at Google want to widely publish this
feature
  as newly available on Dataflow. The question to answer here
(Cham) is
  whether this bug is debilitating enough that were it not to be
in the
  release we would want to hold off advertising this (and related)
  features until the next release. (In my understanding, it
would result
  in a poor enough user experience that it is.)
 >>>
 >>>
 >>> Yes, I think we will have to either hold off on widely
publishing the feature or list this as a potential issue that will
be fixed in the next release for anybody who tries cross-language
pipelines and runs into this.
 >>> Note that we are getting in a Python Kafka example [1]. So
users will potentially try this out anyways.
 >>>
 >>> [1] https://github.com/apache/beam/pull/12188
 >>>
 >>>
 
 
  On the other hand, there's the question of the cost of getting
this
  fix into the release. The change is simple and well contained,
so I
  think the risk is low (and, in particular, the cost to include
it here
  is low enough that it's worth the value provided above).
 
  Looking at the other proposals,
  https://github.com/apache/beam/pull/12196 also seems to meet
this bar
  (there are possible xlang correctness issues at play here), as
does
  https://github.com/apache/beam/pull/12175 (mostly due to its
  simplicity and the fact that doing it later would be a backwards
  compatible change). I'm on the fence about
  https://github.com/apache/beam/pull/12171 (if an RC2 is in the
works
  anyway), and IMHO the others are less compelling as having to
be done
  now.
 >>>
 >>>
 >>> +1
 >>>
 
 
  (On the question of a point release, IMHO anything worth
considering
  for an x.y.1 release definitely meets the bar for inclusion
into an RC
  of an ongoing release.)
 
  - Robert
 
 
  On Thu, Jul 16, 2020 at 8:00 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:
  >
  >
  >
  > On Thu, Jul 16, 2020 at 7:46 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:
  >>
  >>
  >>
  >> On Thu, Jul 16, 2020 at 7:28 PM Valentyn Tymofieiev
mailto:valen...@google.com>> wrote:
  >>>
  >>>
  >>>
  >>> On Thu, Jul 16, 

Re:

2020-07-10 Thread Maximilian Michels

Welcome Emily! Looking forward to your questions.

Cheers,
Max

On 08.07.20 20:07, Emily Ye wrote:

Greetings, dev@beam! Just wanted to introduce myself - I'm a SWE at Google who 
will be contributing to Beam going forward. I'm pretty new to the data 
processing space but I'm excited to learn, and will probably be asking lots of 
questions here. Looking forward to getting to know the community!

- Emily

  



Re: Monitoring performance for releases

2020-07-09 Thread Maximilian Michels
Not yet, I just learned about the migration to a new frontend, including 
a new backend (InfluxDB instead of BigQuery).



 - Are the metrics available on metrics.beam.apache.org?


Is http://metrics.beam.apache.org online? I was never able to access it.


 - What is the feature delta between usinig metrics.beam.apache.org (much 
better UI) and using apache-beam-testing.appspot.com?


AFAIK it is an ongoing migration and the delta appears to be high.


 - Can we notice regressions faster than release cadence?


Absolutely! A report with the latest numbers including statistics about 
the growth of metrics would be useful.



 - Can we get automated alerts?


I think we could setup a Jenkins job to do this.

-Max

On 09.07.20 20:26, Kenneth Knowles wrote:

Questions:

  - Are the metrics available on metrics.beam.apache.org 
<http://metrics.beam.apache.org>?
  - What is the feature delta between usinig metrics.beam.apache.org 
<http://metrics.beam.apache.org> (much better UI) and using 
apache-beam-testing.appspot.com <http://apache-beam-testing.appspot.com>?

  - Can we notice regressions faster than release cadence?
  - Can we get automated alerts?

Kenn

On Thu, Jul 9, 2020 at 10:21 AM Maximilian Michels <mailto:m...@apache.org>> wrote:


Hi,

We recently saw an increase in latency migrating from Beam 2.18.0 to
2.21.0 (Python SDK with Flink Runner). This proofed very hard to debug
and it looks like each version in between the two versions let to
increased latency.

This is not the first time we saw issues when migrating, another
time we
had a decline in checkpointing performance and thus added a
checkpointing test [1] and dashboard [2] (see checkpointing widget).

That makes me wonder if we should monitor performance (throughput /
latency) for basic use cases as part of the release testing. Currently,
our release guide [3] mentions running examples but not evaluating the
performance. I think it would be good practice to check relevant charts
with performance measurements as part of of the release process. The
release guide should reflect that.

WDYT?

-Max

PS: Of course, this requires tests and metrics to be available. This PR
adds latency measurements to the load tests [4].


[1] https://github.com/apache/beam/pull/11558
[2]
https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056
[3] https://beam.apache.org/contribute/release-guide/
[4] https://github.com/apache/beam/pull/12065



Monitoring performance for releases

2020-07-09 Thread Maximilian Michels

Hi,

We recently saw an increase in latency migrating from Beam 2.18.0 to 
2.21.0 (Python SDK with Flink Runner). This proofed very hard to debug 
and it looks like each version in between the two versions let to 
increased latency.


This is not the first time we saw issues when migrating, another time we 
had a decline in checkpointing performance and thus added a 
checkpointing test [1] and dashboard [2] (see checkpointing widget).


That makes me wonder if we should monitor performance (throughput / 
latency) for basic use cases as part of the release testing. Currently, 
our release guide [3] mentions running examples but not evaluating the 
performance. I think it would be good practice to check relevant charts 
with performance measurements as part of of the release process. The 
release guide should reflect that.


WDYT?

-Max

PS: Of course, this requires tests and metrics to be available. This PR 
adds latency measurements to the load tests [4].



[1] https://github.com/apache/beam/pull/11558
[2] 
https://apache-beam-testing.appspot.com/explore?dashboard=5751884853805056

[3] https://beam.apache.org/contribute/release-guide/
[4] https://github.com/apache/beam/pull/12065


Re: RequiresStableInput on Spark runner

2020-07-08 Thread Maximilian Michels
Correct, for batch we rely on re-running the entire job which will 
produce stable input within each run.


For streaming, the Flink Runner buffers all input to a 
@RequiresStableInput DoFn until a checkpoint is complete, only then it 
processes the buffered data. Dataflow effectively does the same by going 
through the Shuffle service which produces a consistent result.


-Max

On 08.07.20 11:08, Jozef Vilcek wrote:

My last question was more towards the graph translation for batch mode.

Should DoFn with @RequiresStableInput be translated/expanded in some 
specific way (e.g. DoFn -> Reshuffle + DoFn) or is it not needed for batch?
Most runners fail in the presence of @RequiresStableInput for both batch 
and streaming. I can not find a fail for Flink and Dataflow, but at the 
same time, I can not find what those runners do with such DoFn.


On Tue, Jul 7, 2020 at 9:18 PM Kenneth Knowles > wrote:


I hope someone who knows better than me can respond.

A long time ago, the SparkRunner added a call to materialize() at
every GroupByKey. This was to mimic Dataflow, since so many of the
initial IO transforms relied on using shuffle to create stable inputs.

The overall goal is to be able to remove these extra calls to
materialize() and only include them when @RequiresStableInput.

The intermediate state is to analyze whether input is already stable
from materialize() and add another materialize() only if it is not
stable.

I don't know the current state of the SparkRunner. This may already
have changed.

Kenn

On Thu, Jul 2, 2020 at 10:24 PM Jozef Vilcek mailto:jozo.vil...@gmail.com>> wrote:

I was trying to look for references on how other runners handle
@RequiresStableInput for batch cases, however I was not able to
find any.
In Flink I can see added support for streaming case and in
Dataflow I see that support for the feature was turned off
https://github.com/apache/beam/pull/8065

It seems to me that @RequiresStableInput is ignored for the
batch case and the runner relies on being able to recompute the
whole job in the worst case scenario.
Is this assumption correct?
Could I just change SparkRunner to crash on @RequiresStableInput
annotation for streaming mode and ignore it in batch?



On Wed, Jul 1, 2020 at 10:27 AM Jozef Vilcek
mailto:jozo.vil...@gmail.com>> wrote:

We have a component which we use in streaming and batch
jobs. Streaming we run on FlinkRunner and batch on
SparkRunner. Recently we needed to add @RequiresStableInput
to taht component because of streaming use-case. But now
batch case crash on SparkRunner with

Caused by: java.lang.UnsupportedOperationException: Spark runner 
currently doesn't support @RequiresStableInput annotation.
at 
org.apache.beam.runners.core.construction.UnsupportedOverrideFactory.getReplacementTransform(UnsupportedOverrideFactory.java:58)
at 
org.apache.beam.sdk.Pipeline.applyReplacement(Pipeline.java:556)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:292)
at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
at 
org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:168)
at 
org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:90)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
at 
com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:42)
at 
com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:35)
at scala.util.Try$.apply(Try.scala:192)
at 
com.dp.pipeline.driver.PipelineDriver$class.main(PipelineDriver.scala:35)


We are using Beam 2.19.0. Is the @RequiresStableInput
problematic to support for both streaming and batch
use-case? What are the options here?
https://issues.apache.org/jira/browse/BEAM-5358



Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Maximilian Michels

Hi Alexey,

I also came across this issue when building a custom Beam version. I 
applied the same fix (https://github.com/apache/beam/pull/11527) which 
you have mentioned.


It appears that the Maven dependencies changed or are no longer 
available which causes the missing class files.


+1 for backporting the fix to the release branches.

Cheers,
Max

On 08.07.20 11:36, Alexey Romanenko wrote:

Hello,

Some days ago I noticed that I can’t build the project from old release 
branches . For example, I wanted to build and run Spark Job Server from 
“release-2.20.0” branch and it failed:


./gradlew :runners:spark:job-server:runShadow —stacktrace

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task 
':model:pipeline:compileJava’.

…
Caused by: org.gradle.internal.UncheckedException: 
java.lang.ClassNotFoundException: 
com.google.errorprone.ErrorProneCompiler$Builder

…


I experienced the same issue for “release-2.19.0” and  “release-2.21.0” 
branches, I didn’t check older branches but seems it’s a global issue 
for “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".


This is already known issue and it was fixed for 2.22.0 [1] a while ago. 
By applying a fix from [2] on top of previous branch, for example, 
“release-2.20.0” branch I’ve managed to build it. Though, the problem 
for old branches (<2.22.0) is still there - it’s not possible to build 
them right after checkout without applying the fix.


So, there are two questions:

1. Is anyone aware why the old static version of 
gradle-errorprone-plugin fails for the branches that were successfully 
built before?
2. Do we have to fix it for release branches <2.22.0 (either cherry-pick 
the fix for 2.22.0 or somehow else if it’s possible)?


[1] https://issues.apache.org/jira/browse/BEAM-10263
[2] https://github.com/apache/beam/pull/11527



Re: Error in FlinkRunnerTest.test_external_transforms

2020-06-30 Thread Maximilian Michels

Is this a flake? I don't see this in master atm.

-Max

On 27.06.20 02:59, Alex Amato wrote:

Hi,

I was wondering if this is something wrong with my PR 
 or an issue in master.

Thanks for your help.

Seeing this in my PR's presubmit
https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Commit/5382/

Logs 



==
ERROR: test_external_transforms (__main__.FlinkRunnerTest)
--
Traceback (most recent call last):
 Timed out after 60 seconds. 
   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Commit/src/sdks/python/apache_beam/runners/portability/flink_runner_test.py",
 line 204, in test_external_transforms

 assert_that(res, equal_to([i for i in range(1, 10)]))
# Thread: 
   File "apache_beam/pipeline.py", line 547, in __exit__
 self.run().wait_until_finish()

# Thread: 
   File "apache_beam/runners/portability/portable_runner.py", line 543, in 
wait_until_finish
 self._observe_state(message_thread)
   File "apache_beam/runners/portability/portable_runner.py", line 552, in 
_observe_state

 for state_response in self._state_stream:
# Thread: <_Worker(Thread-110, started daemon 140197924693760)>
   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Commit/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_channel.py",
 line 413, in next
 return self._next()

   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Commit/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_channel.py",
 line 697, in _next
# Thread: <_MainThread(MainThread, started 140200366741248)>
 _common.wait(self._state.condition.wait, _response_ready)
   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Commit/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_common.py",
 line 138, in wait
 _wait_once(wait_fn, MAXIMUM_WAIT_TIMEOUT, spin_cb)

   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Commit/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_common.py",
 line 103, in _wait_once
 wait_fn(timeout=timeout)
# Thread: 
   File "/usr/lib/python2.7/threading.py", line 359, in wait
 _sleep(delay)
   File "apache_beam/runners/portability/portable_runner_test.py", line 82, in 
handler
 raise BaseException(msg)
BaseException: Timed out after 60 seconds.


# Thread: <_Worker(Thread-18, started daemon 140198537066240)>

# Thread: 

--
# Thread: <_Worker(Thread-19, started daemon 140198528673536)>

Ran 82 tests in 461.409s

FAILED (errors=1, skipped=15)



Re: Is there an easy way to figure out why my build failed?

2020-06-30 Thread Maximilian Michels

Hi Alex,

Fully agree with you that it can be hard to find the cause for a failing 
build. You basically need to know the exact keyword to grep for. The 
reason is that Jenkins does not understand all build logs to display the 
error directly in the UI.


I often do the following for large logs:

  $ curl 
https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/12017/consoleText 
| less


Then I can use '/' to search in the log quickly without my browser 
slowing down.


In the linked build log, I searched for ' FAILED':

  09:18:26 > Task :sdks:java:io:rabbitmq:test FAILED
  09:18:26
  09:18:26 FAILURE: Build failed with an exception.
  09:18:26
  09:18:26 * What went wrong:
  09:18:26 Execution failed for task ':sdks:java:io:rabbitmq:test'.
  09:18:26 > Process 'Gradle Test Executor 110' finished with non-zero 
 exit value 143
  09:18:26   This problem might be caused by incorrect test process 
configuration.
  09:18:26   Please refer to the test execution section in the User 
Manual at 
https://docs.gradle.org/5.2.1/userguide/java_testing.html#sec:test_execution


Now, it appears that the rabbitmq tests are timing out but I'm not sure 
the issue if with rabbitmq because I'm also seeing:


  Build timed out (after 120 minutes). Marking the build as aborted.
  Build was aborted
  Recording test results

So maybe some other test slowed down the build and when it reached 
rabbitmq it was killed. That can probably tested by running the build 
multiple times.


-Max

On 30.06.20 19:47, Alex Amato wrote:
Often I see the build failing, but on the next page there are no 
warnings and no errors.


Then when you dive into the full log, it slows down the browser and 
there is no obvious ctrl-f keyword to find the error ("error" yields 
over 100 results, and the error isn't always at the bottom). Is there a 
faster/better way to do it?


There is a log about the build timing out, but I don't really know what 
timed out or where to look next.


Is 120 min a long enough time? Did something recently happen? If so Can 
we increase the timeout until we debug the regression?


https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/12017/

https://issues.apache.org/jira/browse/BEAM-10390

Thanks, I would appreciate any ideas :)
Alex


Re: [ANNOUNCE] New committer: Aizhamal Nurmamat kyzy

2020-06-30 Thread Maximilian Michels

Congrats Aizhamal!

On 30.06.20 17:34, Jan Lukavský wrote:

Congratulations Aizhamal!

On 6/30/20 1:35 PM, Alexey Romanenko wrote:

Congratulations!
Well deserved and thank you for your hard work, Aizhamal!

On 30 Jun 2020, at 13:31, Reza Rokni > wrote:


Congratulations !

On Tue, Jun 30, 2020 at 2:46 PM Michał Walenia 
mailto:michal.wale...@polidea.com>> wrote:


Congratulations, Aizhamal! :)

On Tue, Jun 30, 2020 at 8:41 AM Tobiasz Kędzierski
mailto:tobiasz.kedzier...@polidea.com>> wrote:

Congratulations Aizhamal! :)

On Mon, Jun 29, 2020 at 11:50 PM Austin Bennett
mailto:whatwouldausti...@gmail.com>> wrote:

Congratulations, @Aizhamal Nurmamat kyzy
 !

On Mon, Jun 29, 2020 at 2:32 PM Valentyn Tymofieiev
mailto:valen...@google.com>> wrote:

Congratulations and big thank you for all the hard
work on Beam, Aizhamal!

On Mon, Jun 29, 2020 at 9:56 AM Kenneth Knowles
mailto:k...@apache.org>> wrote:

Please join me and the rest of the Beam PMC in
welcoming a new committer: Aizhamal Nurmamat kyzy

Over the last 15 months or so, Aizhamal has
driven many efforts in the Beam community and
contributed to others. Aizhamal started by
helping with the Beam newsletter [1] then
continued by contributing to meetup planning [2]
[3] and Beam Summit planning [4]. Aizhamal
created Beam's system for managing social media
[5] and contributed many tweets, coordinated the
vote and design of Beam's mascot [6] [7], drove
migration of Beam's site to a more i18n-friendly
infrastructure [8], kept on top of Beam's
enrollment in Season of Docs [9], and even
organized remote Beam Webinars during the
pandemic [10].

In consideration of Aizhamal's contributions, the
Beam PMC trusts her with
the responsibilities of a Beam committer [11].

Thank you, Aizhamal, for your contributions and
looking forward to many more!

Kenn, on behalf of the Apache Beam PMC

[1]

https://lists.apache.org/thread.html/447ae9fdf580ad88522aabc8a0f3703c51acd8885578bb422389a4b0%40%3Cdev.beam.apache.org%3E
[2]

https://lists.apache.org/thread.html/ebeeae53a64dca8bb491e26b8254d247226e6d770e33dbc9428202df%40%3Cdev.beam.apache.org%3E
[3]

https://lists.apache.org/thread.html/rc31d3d57b39e6cf12ea3b6da0e884f198f8cbef9a73f6a50199e0e13%40%3Cdev.beam.apache.org%3E
[4]

https://lists.apache.org/thread.html/99815d5cd047e302b0ef4b918f2f6db091b8edcf430fb62e4eeb1060%40%3Cdev.beam.apache.org%3E
[5]

https://lists.apache.org/thread.html/babceeb52624fd4dd129c259db8ee9017cb68cba069b68fca7480c41%40%3Cdev.beam.apache.org%3E
[6]

https://lists.apache.org/thread.html/60aa4b149136e6aa4643749731f4b5a041ae4952e7b7e57654888bed%40%3Cdev.beam.apache.org%3E
[7]

https://lists.apache.org/thread.html/r872ba2860319cbb5ca20de953c43ed7d750155ca805cfce3b70085b0%40%3Cdev.beam.apache.org%3E

[8]

https://lists.apache.org/thread.html/rfab4cc1411318c3f4667bee051df68f37be11846ada877f3576c41a9%40%3Cdev.beam.apache.org%3E

[9]

https://lists.apache.org/thread.html/r4df2e596751e263a83300818776fbb57cb1e84171c474a9fd016ec10%40%3Cdev.beam.apache.org%3E
[10]

https://lists.apache.org/thread.html/r81b93d700fedf3012b9f02f56b5d693ac4c1aac1568edf9e0767b15f%40%3Cuser.beam.apache.org%3E
[11]

https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer



-- 
Michał Walenia

Polidea  | Software Engineer

M: +48 791 432 002 
E: michal.wale...@polidea.com 

Unique Tech
Check out our projects! 





Re: Individual Parallelism support for Flink Runner

2020-06-29 Thread Maximilian Michels
We could allow parameterizing transforms by using transform identifiers 
from the pipeline, e.g.



  options = ['--parameterize=MyTransform;parallelism=5']
  with Pipeline.create(PipelineOptions(options)) as p:
p | Create(1, 2, 3) | 'MyTransform' >> ParDo(..)


Those hints should always be optional, such that a pipeline continues to 
run on all runners.


-Max

On 28.06.20 14:30, Reuven Lax wrote:
However such a parameter would be specific to a single transform, 
whereas maxNumWorkers is a global parameter today.


On Sat, Jun 27, 2020 at 10:31 PM Daniel Collins > wrote:


I could imagine for example, a 'parallelismHint' field in the base
parameters that could be set to maxNumWorkers when running on
dataflow or an equivalent parameter when running on flink. It would
be useful to get a default value for the sharding in the Reshuffle
changes here https://github.com/apache/beam/pull/11919, but more
generally to have some decent guess on how to best shard work. Then
it would be runner-agnostic; you could set it to something like
numCpus on the local runner for instance.

On Sat, Jun 27, 2020 at 2:04 AM Reuven Lax mailto:re...@google.com>> wrote:

It's an interesting question - this parameter is clearly very
runner specific (e.g. it would be meaningless for the Dataflow
runner, where parallelism is not a static constant). How should
we go about passing runner-specific options per transform?

On Fri, Jun 26, 2020 at 1:14 PM Akshay Iyangar
mailto:aiyan...@godaddy.com>> wrote:

Hi beam community,

__ __

So I had brought this issue in our slack channel but I guess
this warrants a deeper discussion and if we do go about what
is the POA for it.

__ __

So basically currently for Flink Runner we don’t support
operator level parallelism which native Flink provides OOTB.
So I was wondering what the community feels about having
some way to pass parallelism for individual operators esp.
  for some of the existing IO’s 

__ __

Wanted to know what people think of this.

__ __

Thanks 

Akshay I



Re: Remove EOL'd Runners

2020-06-09 Thread Maximilian Michels
Thanks of the heads-up, Tyson! It's a sensible decision to remove
unsupported runners.

-Max

On 09.06.20 16:51, Tyson Hamilton wrote:
> Hi All,
> 
> As part of the Fixit [1] I'd like to remove EOL'd runners, Apex and Gearpump, 
> as described in BEAM- [2]. This will be a big PR I think and didn't want 
> anyone to be surprised. There is already some agreement in the linked Jira 
> issue. If there are no objections I'll get started later today or tomorrow.
> 
> -Tyson
> 
> 
> [1]: 
> https://lists.apache.org/thread.html/r9ddc77a8fee58ad02f68e2d9a7f054aab3e55717cc88ad1d5bc49311%40%3Cdev.beam.apache.org%3E
> [2]: https://issues.apache.org/jira/browse/BEAM-
> 


Re: SQL Windowing

2020-05-28 Thread Maximilian Michels
Thanks for the quick reply Brian! I've filed a JIRA for option (a):
https://jira.apache.org/jira/browse/BEAM-10143

Makes sense to define DATETIME as a logical type. I'll check out your
PR. We could work around this for now by doing a cast, e.g.:

  TUMBLE(CAST(f_timestamp AS DATETIME), INTERVAL '30' MINUTE)

Note that we may have to do a more sophisticated cast to convert the
Python micros into a DATETIME.

-Max

On 28.05.20 19:18, Brian Hulette wrote:
> Hey Max,
> Thanks for kicking the tires on SqlTransform in Python :)
> 
> We don't have any tests of windowing and Sql in Python yet, so I'm not
> that surprised you're running into issues here. Portable schemas don't
> support the DATETIME type, because we decided not to define it as one of
> the atomic types [1] and hope to add support via a logical type instead
> (see BEAM-7554 [2]). This was the motivation for the MillisInstant PR I
> put up, and the ongoing discussion [3].
> Regardless, that should only be an obstacle for option (b), where you'd
> need to have a DATETIME in the input and/or output PCollection of the
> SqlTransform. In theory option (a) should be possible, so I'd consider
> that a bug - can you file a jira for it?
> 
> Brian
> 
> [1] 
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto#L58
> [2] https://issues.apache.org/jira/browse/BEAM-7554
> [3] 
> https://lists.apache.org/thread.html/r2e05355b74fb5b8149af78ade1e3539ec08371a9a4b2b9e45737e6be%40%3Cdev.beam.apache.org%3E
> 
> On Thu, May 28, 2020 at 9:45 AM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> Hi,
> 
> I'm using the SqlTransform as an external transform from within a Python
> pipeline. The SQL docs [1] mention that you can either (a) window the
> input or (b) window in the SQL query.
> 
> Option (a):
> 
>   input
>       | "Window >> beam.WindowInto(window.FixedWindows(30))
>       | "Aggregate" >>
>       SqlTransform("""Select field, count(field) from PCOLLECTION
>                       WHERE ...
>                       GROUP BY field
>                    """)
> 
> This results in an exception:
> 
>   Caused by: java.lang.ClassCastException:
>   org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast
>   to org.apache.beam.sdk.transforms.windowing.GlobalWindow
> 
> => Is this a bug?
> 
> 
> Let's try Option (b):
> 
>   input
>       | "Aggregate & Window" >>
>       SqlTransform("""Select field, count(field) from PCOLLECTION
>                       WHERE ...
>                       GROUP BY field,
>                                TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
>                    """)
> 
> The issue that I'm facing here is that the timestamp is already assigned
> to my values but is not exposed as a field. So I need to use a DoFn to
> extract the timestamp as a new field:
> 
>   class GetTimestamp(beam.DoFn):
>     def process(self, event, timestamp=beam.DoFn.TimestampParam):
>       yield TimestampedRow(..., timestamp)
> 
>   input
>       | "Extract timestamp" >>
>       beam.ParDo(GetTimestamp())
>       | "Aggregate & Window" >>
>       SqlTransform("""Select field, count(field) from PCOLLECTION
>                       WHERE ...
>                       GROUP BY field,
>                                TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
>                    """)
> 
> => It would be very convenient if there was a reserved field name which
> would point to the timestamp of an element. Maybe there is?
> 
> 
> -Max
> 
> 
> [1]
> 
> https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/
> 


SQL Windowing

2020-05-28 Thread Maximilian Michels
Hi,

I'm using the SqlTransform as an external transform from within a Python
pipeline. The SQL docs [1] mention that you can either (a) window the
input or (b) window in the SQL query.

Option (a):

  input
  | "Window >> beam.WindowInto(window.FixedWindows(30))
  | "Aggregate" >>
  SqlTransform("""Select field, count(field) from PCOLLECTION
  WHERE ...
  GROUP BY field
   """)

This results in an exception:

  Caused by: java.lang.ClassCastException:
  org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast
  to org.apache.beam.sdk.transforms.windowing.GlobalWindow

=> Is this a bug?


Let's try Option (b):

  input
  | "Aggregate & Window" >>
  SqlTransform("""Select field, count(field) from PCOLLECTION
  WHERE ...
  GROUP BY field,
   TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
   """)

The issue that I'm facing here is that the timestamp is already assigned
to my values but is not exposed as a field. So I need to use a DoFn to
extract the timestamp as a new field:

  class GetTimestamp(beam.DoFn):
def process(self, event, timestamp=beam.DoFn.TimestampParam):
  yield TimestampedRow(..., timestamp)

  input
  | "Extract timestamp" >>
  beam.ParDo(GetTimestamp())
  | "Aggregate & Window" >>
  SqlTransform("""Select field, count(field) from PCOLLECTION
  WHERE ...
  GROUP BY field,
   TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
   """)

=> It would be very convenient if there was a reserved field name which
would point to the timestamp of an element. Maybe there is?


-Max


[1]
https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/


Re: What's the purpose of version=2.20.0-RC2 in gradle.properties?

2020-05-28 Thread Maximilian Michels
> I would expect the release branch to have the next -SNAPSHOT version (not the 
> case currently):

Why would the release branch have the next version? It is created for
the sole purpose of releasing the current version. For example, the
release branch for 2.21.0 would have the version 2.21.0-SNAPSHOT. If we
were to release 2.21.1 or 2.22.0, we would create a new branch where the
same logic applies.

The release branch having a -SNAPSHOT version makes perfect sense
because it is a snapshot of what is going to be released (still subject
to changes). Contrary to what I said before, I don't think we should
remove the snapshot suffix from the release branch.

However, as pointed out, the source release and its tag should have a
non-snapshot version.

-Max

On 27.05.20 05:02, Thomas Weise wrote:
> 
> 
> I think the "set_version.sh" script could be called in the release
> scripts to remove the -SNAPSHOT suffix on the release branch.
> 
> 
> I would expect the release branch to have the next -SNAPSHOT version
> (not the case currently):
> 
> https://github.com/apache/beam/blob/release-2.20.0/gradle.properties#L26
> 
> Release tag and the source archive should have the actually released
> version (not -RC):
> 
> https://github.com/apache/beam/blob/v2.20.0/gradle.properties#L26
> 
> 
>  
> 
> Btw, in case you haven't seen it, here is our release guide:
> https://beam.apache.org/contribute/release-guide/
> 
> -Max
> 
> On 26.05.20 19:02, Jacek Laskowski wrote:
> > Hi Max,
> >
> >> I think you bring up a good point, for the sake of release build
> > reproducibility, we may want to remove the snapshot suffix for the
> > source release.
> >
> > Wish I could be as clear as yourself with this. Yes, that's what I've
> > been bothered about. Is there a JIRA issue for this already? I've
> never
> > been good at releases but certainly could help a bit here and there
> > since I'm interested in having reproducible builds (from the tags).
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > "The Internals Of" Online Books <https://books.japila.pl/>
> > Follow me on https://twitter.com/jaceklaskowski
> >
> > <https://twitter.com/jaceklaskowski>
> >
> >
> > On Tue, May 26, 2020 at 5:37 PM Maximilian Michels  <mailto:m...@apache.org>
> > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
> >
> >     If you really want to work with the source code, I'd recommend
> using the
> >     released source code:
> >     https://beam.apache.org/get-started/downloads/#releases
> >
> >     Even there the version in gradle.properties says
> x.y.z-SNAPSHOT. You may
> >     want to remove the -SNAPSHOT suffix. I understand that this is
> confusing
> >     but that's how our release tooling currently works; it removes the
> >     snapshot suffix during publishing the artifacts.
> >
> >     I think you bring up a good point, for the sake of release build
> >     reproducibility, we may want to remove the snapshot suffix for the
> >     source release.
> >
> >     Best,
> >     Max
> >
> >     On 26.05.20 17:20, Kyle Weaver wrote:
> >     >> When we release the version, the RC suffix is dropped.
> >     >
> >     > I think this might not actually be true, at least for the
> git tag,
> >     since
> >     > we just copy the tag from the accepted RC without changing
> anything.
> >     > However, it might not matter because RC2 artifacts should be
> identical
> >     > to the final release artifacts.
> >     >
> >     >> In other words, how to check out the sources of Beam 2.20.0
> and build
> >     > them to get the released artifacts?
> >     >
> >     > As Max said, we build and publish artifacts (Jars, Docker
> containers,
> >     > Python wheels, etc.) for each release, so it usually isn't
> >     necessary to
> >     > build them oneself unless you are testing on head or other
> >     unreleased code.
> >     >
> >     > On Tue, May 26, 2020 at 6:02 AM Jacek Laskowski
> mailto:ja...@japila.pl>
> >     <mailto:ja...@japila.pl <mailto:ja...@japila.pl>>
> >     > <mailto:ja...@japila.pl &l

Re: What's the purpose of version=2.20.0-RC2 in gradle.properties?

2020-05-26 Thread Maximilian Michels
Don't think so. Feel free to create one.

We already have a script which updates the version to a non-snapshot
version:
https://github.com/apache/beam/blob/master/release/src/main/scripts/set_version.sh

However, it seems that this is merely a variant of this script which we
use to cut the release branch:
https://github.com/apache/beam/blob/master/release/src/main/scripts/cut_release_branch.sh

I think the "set_version.sh" script could be called in the release
scripts to remove the -SNAPSHOT suffix on the release branch.

Btw, in case you haven't seen it, here is our release guide:
https://beam.apache.org/contribute/release-guide/

-Max

On 26.05.20 19:02, Jacek Laskowski wrote:
> Hi Max,
> 
>> I think you bring up a good point, for the sake of release build
> reproducibility, we may want to remove the snapshot suffix for the
> source release.
> 
> Wish I could be as clear as yourself with this. Yes, that's what I've
> been bothered about. Is there a JIRA issue for this already? I've never
> been good at releases but certainly could help a bit here and there
> since I'm interested in having reproducible builds (from the tags).
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
> 
> <https://twitter.com/jaceklaskowski>
> 
> 
> On Tue, May 26, 2020 at 5:37 PM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> If you really want to work with the source code, I'd recommend using the
> released source code:
> https://beam.apache.org/get-started/downloads/#releases
> 
> Even there the version in gradle.properties says x.y.z-SNAPSHOT. You may
> want to remove the -SNAPSHOT suffix. I understand that this is confusing
> but that's how our release tooling currently works; it removes the
> snapshot suffix during publishing the artifacts.
> 
> I think you bring up a good point, for the sake of release build
> reproducibility, we may want to remove the snapshot suffix for the
> source release.
> 
> Best,
> Max
> 
> On 26.05.20 17:20, Kyle Weaver wrote:
> >> When we release the version, the RC suffix is dropped.
> >
> > I think this might not actually be true, at least for the git tag,
> since
> > we just copy the tag from the accepted RC without changing anything.
> > However, it might not matter because RC2 artifacts should be identical
> > to the final release artifacts.
> >
> >> In other words, how to check out the sources of Beam 2.20.0 and build
> > them to get the released artifacts?
> >
> > As Max said, we build and publish artifacts (Jars, Docker containers,
> > Python wheels, etc.) for each release, so it usually isn't
> necessary to
> > build them oneself unless you are testing on head or other
> unreleased code.
> >
> > On Tue, May 26, 2020 at 6:02 AM Jacek Laskowski  <mailto:ja...@japila.pl>
> > <mailto:ja...@japila.pl <mailto:ja...@japila.pl>>> wrote:
> >
> >     Hi Max,
> >
> >     > You probably want to work with the release artifacts, instead of
> >     cloning
> >     > the development branch.
> >
> >     I'm not sure I understand.
> >
> >     I did the following to work with the sources of v2.20.0. Am
> >     I missing something?
> >
> >     git fetch --all --tags --prune
> >     git checkout -b v2.20.0 v2.20.0
> >
> >     The last commit on the branch
> >     is 9f0cb649d39ee6236ea27f111acb4b66591a80ec that matches the repo.
> >
> >   
>  
> https://github.com/apache/beam/commit/9f0cb649d39ee6236ea27f111acb4b66591a80ec
> >
> >     commit 9f0cb649d39ee6236ea27f111acb4b66591a80ec (HEAD -> v2.20.0,
> >     tag: v2.20.0-RC2, tag: v2.20.0)
> >     Author: amaliujia  <mailto:ruw...@google.com> <mailto:ruw...@google.com
> <mailto:ruw...@google.com>>>
> >     Date:   Wed Apr 8 14:38:47 2020 -0700
> >
> >         [Gradle Release Plugin] - pre tag commit:  'v2.20.0-RC2'.
> >
> >      gradle.properties | 2 +-
> >      1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >     That commit introduced the RC2:
> >
> >     -version=2.20.0-SNAPSHOT
> >     +version=2.20.0-RC2
> >
> >     Why is there no 2.20.0 only commit? One that wo

Re: What's the purpose of version=2.20.0-RC2 in gradle.properties?

2020-05-26 Thread Maximilian Michels
If you really want to work with the source code, I'd recommend using the
released source code:
https://beam.apache.org/get-started/downloads/#releases

Even there the version in gradle.properties says x.y.z-SNAPSHOT. You may
want to remove the -SNAPSHOT suffix. I understand that this is confusing
but that's how our release tooling currently works; it removes the
snapshot suffix during publishing the artifacts.

I think you bring up a good point, for the sake of release build
reproducibility, we may want to remove the snapshot suffix for the
source release.

Best,
Max

On 26.05.20 17:20, Kyle Weaver wrote:
>> When we release the version, the RC suffix is dropped.
> 
> I think this might not actually be true, at least for the git tag, since
> we just copy the tag from the accepted RC without changing anything.
> However, it might not matter because RC2 artifacts should be identical
> to the final release artifacts.
> 
>> In other words, how to check out the sources of Beam 2.20.0 and build
> them to get the released artifacts?
> 
> As Max said, we build and publish artifacts (Jars, Docker containers,
> Python wheels, etc.) for each release, so it usually isn't necessary to
> build them oneself unless you are testing on head or other unreleased code.
> 
> On Tue, May 26, 2020 at 6:02 AM Jacek Laskowski  <mailto:ja...@japila.pl>> wrote:
> 
> Hi Max,
> 
> > You probably want to work with the release artifacts, instead of
> cloning
> > the development branch.
> 
> I'm not sure I understand.
> 
> I did the following to work with the sources of v2.20.0. Am
> I missing something?
> 
> git fetch --all --tags --prune
> git checkout -b v2.20.0 v2.20.0
> 
> The last commit on the branch
> is 9f0cb649d39ee6236ea27f111acb4b66591a80ec that matches the repo.
> 
> 
> https://github.com/apache/beam/commit/9f0cb649d39ee6236ea27f111acb4b66591a80ec
> 
> commit 9f0cb649d39ee6236ea27f111acb4b66591a80ec (HEAD -> v2.20.0,
> tag: v2.20.0-RC2, tag: v2.20.0)
> Author: amaliujia mailto:ruw...@google.com>>
> Date:   Wed Apr 8 14:38:47 2020 -0700
> 
>     [Gradle Release Plugin] - pre tag commit:  'v2.20.0-RC2'.
> 
>  gradle.properties | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> That commit introduced the RC2:
> 
> -version=2.20.0-SNAPSHOT
> +version=2.20.0-RC2
> 
> Why is there no 2.20.0 only commit? One that would be like this for
> Spark 2.4.5 [1] or Kafka 2.5.0 [2]?
> 
> [1] 
> https://github.com/apache/spark/commit/cee4ecbb16917fa85f02c635925e2687400aa56b
> [2] 
> https://github.com/apache/kafka/commit/66563e712b0b9f84f673b262f2fb87c03110084d
> 
> In other words, how to check out the sources of Beam 2.20.0 and
> build them to get the released artifacts?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
> 
> <https://twitter.com/jaceklaskowski>
> 
> 
> On Mon, May 25, 2020 at 12:00 PM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> Hi Jacek,
> 
> The Gradle property is the source of truth for the Beam version.
> When we
> release the version, the RC suffix is dropped.
> 
> The use of snapshot versions is normal during the development
> process.
> You probably want to work with the release artifacts, instead of
> cloning
> the development branch.
> 
> -Max
> 
> On 24.05.20 12:45, Jacek Laskowski wrote:
> > Hi,
> >
> > I git cloned https://github.com/apache/beam/tree/v2.20.0 and
> > found version=2.20.0-RC2 in gradle.properties. What's the
> purpose of the
> > version property?
> >
> > (The main reason I'm asking is that I try to find out why
> gradle / IDEA
> > attaches 2.20.0-SNAPSHOT dependencies to projects. How is that
> possible
> > that any of the two would ever consider SNAPSHOT as a dependency?)
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > "The Internals Of" Online Books <https://books.japila.pl/>
> > Follow me on https://twitter.com/jaceklaskowski
> >
> > <https://twitter.com/jaceklaskowski>
> 


[BEAM-10054] Pipeline stalls with DirectRunner

2020-05-26 Thread Maximilian Michels
Could somebody familiar with the Python SDK take a look at this problem?
It manifests in the Direct Runner stalling execution.

Tests are passing but I'm unsure about the context of the commit which
introduced the change (linked in the PR):
https://github.com/apache/beam/pull/11777

Thanks,
Max


Re: What's the purpose of version=2.20.0-RC2 in gradle.properties?

2020-05-25 Thread Maximilian Michels
Hi Jacek,

The Gradle property is the source of truth for the Beam version. When we
release the version, the RC suffix is dropped.

The use of snapshot versions is normal during the development process.
You probably want to work with the release artifacts, instead of cloning
the development branch.

-Max

On 24.05.20 12:45, Jacek Laskowski wrote:
> Hi,
> 
> I git cloned https://github.com/apache/beam/tree/v2.20.0 and
> found version=2.20.0-RC2 in gradle.properties. What's the purpose of the
> version property?
> 
> (The main reason I'm asking is that I try to find out why gradle / IDEA
> attaches 2.20.0-SNAPSHOT dependencies to projects. How is that possible
> that any of the two would ever consider SNAPSHOT as a dependency?)
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
> 
> 


Re: Event Calendar?

2020-05-21 Thread Maximilian Michels
Would it make sense to combine it with the Apache Beam release calendar?

https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles

On 20.05.20 19:27, Tyson Hamilton wrote:
> +1 a calendar would be nice. 
> 
> On Tue, May 19, 2020 at 3:51 PM Austin Bennett
> mailto:whatwouldausti...@gmail.com>> wrote:
> 
> Hi All,
> 
> As we have events more often that are more accessible (digital),
> wondering whether others see a value of adding a calendar to the
> website?  
> 
> Perhaps related, is it worth
> updating https://beam.apache.org/community/in-person/ <- to
> something that isn't 'in-person' since doing things in-person is
> perhaps (hopefully not completely) a vestige of the past.  
> 
> Cheers,
> Austin
> 


Re: New Dates For Beam Summit Digital 2020

2020-05-21 Thread Maximilian Michels
Thanks Matthias!

We realized we want this to be a much more community-driven process.
That's why we are planning to be more transparent to give everyone a
chance to get involved in the summit. Given that we now have more time,
this will be much more feasible.

Cheers,
Max

On 18.05.20 20:00, Matthias Baetens wrote:
> Dear Beam community,
> 
> A few weeks ago, we announced the dates for the Beam Digital Summit and
> we know the community received this news with excitement. This is a
> great opportunity to create and share content about streaming analytics
> and the solutions that teams around the world have created using Apache
> Beam and its ecosystem. 
> 
> We have chosen August 24-28th as the new dates. While this has been a
> difficult decision, we think it’s the right decision to ensure we
> produce the best possible event. We encourage you to send your talk
> proposals, anything from use cases, lightning talks, or workshop ideas.
> 
> Based on this change, the CFP will remain open until June 15th. We would
> love to hear about what you are doing with Beam, how to improve it, and
> how to strengthen our community.
> 
> We thank you for your understanding! See you soon!
> 
> -Griselda Cuevas, Brittany Hermann, Maximilian Michels, Austin Bennett,
> Matthias Baetens, Alex Van Boxel
> 


Re: Transparency to Beam Digital Summit Planning

2020-05-21 Thread Maximilian Michels
+1 for making the notes publicly available. This list is free to join by
anyone.

On 21.05.20 00:17, Austin Bennett wrote:
> Should the link/meeting notes be publicly available?  Not just available
> to individuals plus all of @google?  
> 
> 
> 
> On Wed, May 20, 2020 at 2:06 PM Brittany Hermann  > wrote:
> 
> Hi folks,
> 
> I wanted to provide a few different ways of transparency to you
> during the planning of the Beam Digital Summit. 
> 
> 1) *Beam Summit Status Reports:* I will be sending out weekly Beam
> Summit Status Reports which will include the goals, attendees,
> topics discussed, and decisions made every Wednesday. 
> 
> 2) *Community Guests on Committee Planning Calls:* We would like to
> invite you to join as a guest to these planning calls. This would
> allow for observation of the planning process and to see if there
> are ways for future collaboration on promotions, etc. for the event.
> If you are interested in joining the first bi-weekly meeting
> starting next week, please reach out to me and I will send the
> invite with call-in information directly to you. 
> 
> In the meantime, I have attached this week's Beam Summit Status
> report below. 
> 
> 
> https://docs.google.com/document/d/1_jLhKvW5MTtkHOZDJyzCTSLUDiD4RjlJmU35rXV-3n0/edit?usp=sharing
> 
> Have a great rest of your week! 
> 
> -- 
> 
>   
> 
> Brittany Hermann
> 
> Open Source Program Manager (Provided by Adecco Staffing)
> 
> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
> 
> 
> 
> 


Re: Running NexMark Tests

2020-05-19 Thread Maximilian Michels
Looks like an accidental change to me. Running with either version, 1.9
or 1.10 works, but this should be changed back to using the latest version.

Do you mind creating a PR?

Thanks,
Max

On 19.05.20 13:02, Sruthi Sree Kumar wrote:
> On the documentation, the version of Flink runner is changed to 1.9
> which was 1.10(latest)
> before 
> https://github.com/apache/beam/commit/1d2700818474c008eaa324ac1b5c49c9d2857298#diff-0e75160f4b09a1a300671557930589d9.
> 
> Is this an accidental change or is there any particular reason for this
> downgrade of version?
> 
> Regards,
> Sruthi
> 
> On Tue, May 12, 2020 at 7:21 PM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> A heads-up if anybody else sees this, we have removed the flag:
> https://jira.apache.org/jira/browse/BEAM-9900
> 
> Further contributions are very welcome :)
> 
> -Max
> 
> On 11.05.20 17:05, Sruthi Sree Kumar wrote:
> > I have opened a PR with the documentation change.
> > https://github.com/apache/beam/pull/11662
> >
> > Regards,
> > Sruthi
> >
> > On 2020/04/21 20:22:17, Ismaël Mejía  <mailto:ieme...@gmail.com>> wrote:
> >> You need to instruct the Flink runner to shutdown the the source
> >> otherwise it will stay waiting.
> >> You can this by adding the extra
> >> argument`--shutdownSourcesOnFinalWatermark=true`
> >> And if that works and you want to open a PR to update our
> >> documentation that would be greatly appreciated.
> >>
> >> Regards,
> >> Ismaël
> >>
> >>
> >> On Tue, Apr 21, 2020 at 10:04 PM Sruthi Sree Kumar
> >>  <mailto:sruthisreekumar2...@gmail.com>> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I am trying to run nexmark queries using flink runner streaming.
> Followed the documentation and used the command
> >>> ./gradlew :sdks:java:testing:nexmark:run \
> >>>
> >>>     -Pnexmark.runner=":runners:flink:1.10" \
> >>>     -Pnexmark.args="
> >>>         --runner=FlinkRunner
> >>>         --suite=SMOKE
> >>>         --streamTimeout=60
> >>>         --streaming=true
> >>>         --manageResources=false
> >>>         --monitorJobs=true
> >>>         --flinkMaster=[local]"
> >>>
> >>>
> >>> But after the events are read from the source, there is no
> further progress and the job is always stuck at 99%. Is there any
> configuration that I am missing?
> >>>
> >>> Regards,
> >>> Sruthi
> >>
> 


Re: TextIO. Writing late files

2020-05-19 Thread Maximilian Michels
> This is still confusing to me - why would the messages be dropped as late in 
> this case?

Since you previously mentioned that the bug is due to the pane info
missing, I just pointed out that the WriteFiles logic is expected to
drop the pane info.

@Jose Would it make sense to file a JIRA and summarize all the findings
here?

@Jozef What you describe in
https://www.mail-archive.com/dev@beam.apache.org/msg20186.html is
expected because Flink does not do a GroupByKey on Reshuffle but just
redistributes the elements.

Thanks,
Max

On 18.05.20 21:59, Jose Manuel wrote:
> Hi Reuven, 
> 
> I can try to explaining what I guess. 
> 
> - There is a source which is reading data entries and updating the
> watermark.
> - Then, data entries are grouped and stored in files. 
> - The window information of these data entries are used to emit
> filenames. Data entries's window and timestamp. PaneInfo is empty.
> - When a second window is applied to filenames, if allowlateness is zero
> of lower than the spent time in the previous reading/writing, the
> filenames are discarded as late.
> 
> I guess, the key is in 
> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/LateDataDroppingDoFnRunner.java#L168
> 
> My assumption is global watermark (or source watermark, I am not sure
> about the name) is used to evaluate the filenames, what are in an
> already emitted window.
> 
> Thanks
> Jose
> 
> 
> El lun., 18 may. 2020 a las 18:37, Reuven Lax ( <mailto:re...@google.com>>) escribió:
> 
> This is still confusing to me - why would the messages be dropped as
> late in this case?
> 
> On Mon, May 18, 2020 at 6:14 AM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> All runners which use the Beam reference implementation drop the
> PaneInfo for
> WriteFilesResult#getPerDestinationOutputFilenames(). That's
> why we can observe this behavior not only in Flink but also Spark.
> 
> The WriteFilesResult is returned here:
> 
> https://github.com/apache/beam/blob/d773f8ca7a4d63d01472b5eaef8b67157d60f40e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L363
> 
> GatherBundlesPerWindow will discard the pane information because all
> buffered elements are emitted in the FinishBundle method which
> always
> has a NO_FIRING (unknown) pane info:
> 
> https://github.com/apache/beam/blob/d773f8ca7a4d63d01472b5eaef8b67157d60f40e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L895
> 
> So this seems expected behavior. We would need to preserve the
> panes in
> the Multimap buffer.
> 
> -Max
> 
> On 15.05.20 18:34, Reuven Lax wrote:
> > Lateness should never be introduced inside a pipeline -
> generally late
> > data can only come from a source.  If data was not dropped as late
> > earlier in the pipeline, it should not be dropped after the
> file write.
> > I suspect that this is a bug in how the Flink runner handles the
> > Reshuffle transform, but I'm not sure what the exact bug is.
> >
> > Reuven
> >
> > On Fri, May 15, 2020 at 2:23 AM Jozef Vilcek
> mailto:jozo.vil...@gmail.com>
> > <mailto:jozo.vil...@gmail.com <mailto:jozo.vil...@gmail.com>>>
> wrote:
> >
> >     Hi Jose,
> >
> >     thank you for putting the effort to get example which
> >     demonstrate your problem. 
> >
> >     You are using a streaming pipeline and it seems that
> watermark in
> >     downstream already advanced further, so when your File
> pane arrives,
> >     it is already late. Since you define that lateness is not
> tolerated,
> >     it is dropped.
> >     I myself never had requirement to specify zero allowed
> lateness for
> >     streaming. It feels dangerous. Do you have a specific use
> case?
> >     Also, in may cases, after windowed files are written, I
> usually
> >     collect them into global window and specify a different
> triggering
> >     policy for collecting them. Both cases are why I never
> came across
> >     this situation.
> >
> >     I do not have an explanation if it is a bug or not. I
> would guess
> >     that watermark can advance further, e.g. because 

Re: TextIO. Writing late files

2020-05-18 Thread Maximilian Michels
All runners which use the Beam reference implementation drop the
PaneInfo for WriteFilesResult#getPerDestinationOutputFilenames(). That's
why we can observe this behavior not only in Flink but also Spark.

The WriteFilesResult is returned here:
https://github.com/apache/beam/blob/d773f8ca7a4d63d01472b5eaef8b67157d60f40e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L363

GatherBundlesPerWindow will discard the pane information because all
buffered elements are emitted in the FinishBundle method which always
has a NO_FIRING (unknown) pane info:
https://github.com/apache/beam/blob/d773f8ca7a4d63d01472b5eaef8b67157d60f40e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L895

So this seems expected behavior. We would need to preserve the panes in
the Multimap buffer.

-Max

On 15.05.20 18:34, Reuven Lax wrote:
> Lateness should never be introduced inside a pipeline - generally late
> data can only come from a source.  If data was not dropped as late
> earlier in the pipeline, it should not be dropped after the file write.
> I suspect that this is a bug in how the Flink runner handles the
> Reshuffle transform, but I'm not sure what the exact bug is.
> 
> Reuven
> 
> On Fri, May 15, 2020 at 2:23 AM Jozef Vilcek  > wrote:
> 
> Hi Jose,
> 
> thank you for putting the effort to get example which
> demonstrate your problem. 
> 
> You are using a streaming pipeline and it seems that watermark in
> downstream already advanced further, so when your File pane arrives,
> it is already late. Since you define that lateness is not tolerated,
> it is dropped.
> I myself never had requirement to specify zero allowed lateness for
> streaming. It feels dangerous. Do you have a specific use case?
> Also, in may cases, after windowed files are written, I usually
> collect them into global window and specify a different triggering
> policy for collecting them. Both cases are why I never came across
> this situation.
> 
> I do not have an explanation if it is a bug or not. I would guess
> that watermark can advance further, e.g. because elements can be
> processed in arbitrary order. Not saying this is the case.
> It needs someone with better understanding of how watermark advance
> is / should be handled within pipelines. 
> 
> 
> P.S.: you can add `.withTimestampFn()` to your generate sequence, to
> get more stable timing, which is also easier to reason about:
> 
> Dropping element at 1970-01-01T00:00:19.999Z for key
> ... window:[1970-01-01T00:00:15.000Z..1970-01-01T00:00:20.000Z)
> since too far behind inputWatermark:1970-01-01T00:00:24.000Z;
> outputWatermark:1970-01-01T00:00:24
> .000Z
> 
>            instead of
> 
> Dropping element at 2020-05-15T08:52:34.999Z for key ...
> window:[2020-05-15T08:52:30.000Z..2020-05-15T08:52:35.000Z) since
> too far behind inputWatermark:2020-05-15T08:52:39.318Z;
> outputWatermark:2020-05-15T08:52:39.318Z
> 
> 
> 
> 
> In my
> 
> 
> 
> On Thu, May 14, 2020 at 10:47 AM Jose Manuel  > wrote:
> 
> Hi again, 
> 
> I have simplify the example to reproduce the data loss. The
> scenario is the following:
> 
> - TextIO write files. 
> - getPerDestinationOutputFilenames emits file names 
> - File names are processed by a aggregator (combine, distinct,
> groupbyKey...) with a window **without allowlateness** 
> - File names are discarded as late
> 
> Here you can see the data loss in the picture
> in 
> https://github.com/kiuby88/windowing-textio/blob/master/README.md#showing-data-loss
> 
> Please, follow README to run the pipeline and find log traces
> that say data are dropped as late.
> Remember, you can run the pipeline with another
> window's  lateness values (check README.md)
> 
> Kby.
> 
> El mar., 12 may. 2020 a las 17:16, Jose Manuel
> (mailto:kiuby88@gmail.com>>) escribió:
> 
> Hi,
> 
> I would like to clarify that while TextIO is writing every
> data are in the files (shards). The losing happens when file
> names emitted by getPerDestinationOutputFilenames are
> processed by a window.
> 
> I have created a pipeline to reproduce the scenario in which
> some filenames are loss after the
> getPerDestinationOutputFilenames. Please, note I tried to
> simplify the code as much as possible, but the scenario is
> not easy to reproduce.
> 
> Please check this project
> https://github.com/kiuby88/windowing-textio
> Check readme to build and run
> (https://github.com/kiuby88/windowing-textio#build-and-run)
> Project contains only a class with the
>  

Re: Running NexMark Tests

2020-05-12 Thread Maximilian Michels
A heads-up if anybody else sees this, we have removed the flag:
https://jira.apache.org/jira/browse/BEAM-9900

Further contributions are very welcome :)

-Max

On 11.05.20 17:05, Sruthi Sree Kumar wrote:
> I have opened a PR with the documentation change.
> https://github.com/apache/beam/pull/11662
> 
> Regards,
> Sruthi
> 
> On 2020/04/21 20:22:17, Ismaël Mejía  wrote: 
>> You need to instruct the Flink runner to shutdown the the source
>> otherwise it will stay waiting.
>> You can this by adding the extra
>> argument`--shutdownSourcesOnFinalWatermark=true`
>> And if that works and you want to open a PR to update our
>> documentation that would be greatly appreciated.
>>
>> Regards,
>> Ismaël
>>
>>
>> On Tue, Apr 21, 2020 at 10:04 PM Sruthi Sree Kumar
>>  wrote:
>>>
>>> Hello,
>>>
>>> I am trying to run nexmark queries using flink runner streaming. Followed 
>>> the documentation and used the command
>>> ./gradlew :sdks:java:testing:nexmark:run \
>>>
>>> -Pnexmark.runner=":runners:flink:1.10" \
>>> -Pnexmark.args="
>>> --runner=FlinkRunner
>>> --suite=SMOKE
>>> --streamTimeout=60
>>> --streaming=true
>>> --manageResources=false
>>> --monitorJobs=true
>>> --flinkMaster=[local]"
>>>
>>>
>>> But after the events are read from the source, there is no further progress 
>>> and the job is always stuck at 99%. Is there any configuration that I am 
>>> missing?
>>>
>>> Regards,
>>> Sruthi
>>


Re: Beam 2.21 release update

2020-05-11 Thread Maximilian Michels
FYI I've created this issue and marked it as a blocker:
https://jira.apache.org/jira/browse/BEAM-9947

Essentially, the timer encoding is broken for all non-standard key
coders. The fix can be found here: https://github.com/apache/beam/pull/11658

-Max

On 08.05.20 18:53, Udi Meiri wrote:
> +Chad Dombrova  , who added _find_protoc_gen_mypy.
> 
> I'm guessing that the code
> in _install_grpcio_tools_and_generate_proto_files creates a kind of
> virtualenv, but it only works well for staging Python modules and not
> binaries like protoc-gen-mypy.
> (I assume there's a reason why it doesn't invoke virtualenv, probably
> since the list of things setup.py can expect to be installed is very
> minimal (setuptools).)
> 
> One solution would be to make these setup.py dependencies explicit in
> pyproject.toml, such that pip installs them before running
> setup.py: https://pip.pypa.io/en/stable/reference/pip/#pep-517-and-518-support
> It would help when using tools like pip ("pip wheel"), but I'm not sure
> what the alternative for "python setup.py sdist" is.
> 
> 
> On Thu, May 7, 2020 at 10:40 PM Thomas Weise  > wrote:
> 
> No additional stacktraces. Full error output below.
> 
> It's not clear what is going wrong.
> 
> There isn't any exception from the subprocess execution since the
> "WARNING:root:Installing grpcio-tools took 305.39 seconds." is printed.
> 
> Also, the time it takes to perform the install is equivalent to
> successfully running the pip command.
> 
> I will report back if I find anything else. Currently doing the
> explicit install via pip install -r sdks/python/build-requirements.txt
> 
> Thanks,
> Thomas
> 
> WARNING:root:Installing grpcio-tools took 269.27 seconds.
> INFO:gen_protos:Regenerating Python proto definitions (no output files).
> Process Process-1:
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in
> _bootstrap
>     self.run()
>   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
>     self._target(*self._args, **self._kwargs)
>   File
> "/src/streamingplatform/beam-release/beam/sdks/python/gen_protos.py", line
> 378, in _install_grpcio_tools_and_generate_proto_files
>     generate_proto_files(force=force)
>   File
> "/src/streamingplatform/beam-release/beam/sdks/python/gen_protos.py", line
> 315, in generate_proto_files
>     protoc_gen_mypy = _find_protoc_gen_mypy()
>   File
> "/src/streamingplatform/beam-release/beam/sdks/python/gen_protos.py", line
> 233, in _find_protoc_gen_mypy
>     (fname, ', '.join(search_paths)))
> RuntimeError: Could not find protoc-gen-mypy in
> /code/venvs/venv2/bin, /code/venvs/venv2/bin, /code/venvs/venv3/bin,
> /usr/local/sbin, /usr/local/bin, /usr/sbin, /usr/bin, /sbin, /bin
> Traceback (most recent call last):
>   File "setup.py", line 311, in 
>     'mypy': generate_protos_first(mypy),
>   File
> 
> "/code/venvs/venv2/local/lib/python2.7/site-packages/setuptools/__init__.py",
> line 129, in setup
>     return distutils.core.setup(**attrs)
>   File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
>     dist.run_commands()
>   File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
>     self.run_command(cmd)
>   File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
>     cmd_obj.run()
>   File
> 
> "/code/venvs/venv2/local/lib/python2.7/site-packages/wheel/bdist_wheel.py",
> line 204, in run
>     self.run_command('build')
>   File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
>     self.distribution.run_command(command)
>   File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
>     cmd_obj.run()
>   File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
>     self.run_command(cmd_name)
>   File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
>     self.distribution.run_command(command)
>   File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
>     cmd_obj.run()
>   File "setup.py", line 235, in run
>     gen_protos.generate_proto_files()
>   File
> "/src/streamingplatform/beam-release/beam/sdks/python/gen_protos.py", line
> 310, in generate_proto_files
>     raise ValueError("Proto generation failed (see log for details).")
> ValueError: Proto generation failed (see log for details).
> 
> 
> On Thu, May 7, 2020 at 2:25 PM Udi Meiri  > wrote:
> 
> It's hard to say without more details what's going on. Ahmet
> you're right that it installs build-requirements.txt and retries
> calling generate_proto_files().
> 
> Thomas, were there additional stacktraces? (after a 

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-05-05 Thread Maximilian Michels
Hey Eleanore,

The change will be part of the 2.21.0 release.

-Max

On 04.05.20 19:14, Eleanore Jin wrote:
> Hi Max, 
> 
> Thanks for the information and I saw this PR is already merged, just
> wonder is it backported to the affected versions already
> (i.e. 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0)? Or I have
> to wait for the 2.20.1 release? 
> 
> Thanks a lot!
> Eleanore
> 
> On Wed, Apr 22, 2020 at 2:31 AM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> Hi Eleanore,
> 
> Exactly-once is not affected but the pipeline can fail to checkpoint
> after the maximum number of state cells have been reached. We are
> working on a fix [1].
> 
> Cheers,
> Max
> 
> [1] https://github.com/apache/beam/pull/11478
> 
> On 22.04.20 07:19, Eleanore Jin wrote:
> > Hi Maxi, 
> >
> > I assume this will impact the Exactly Once Semantics that beam
> provided
> > as in the KafkaExactlyOnceSink, the processElement method is also
> > annotated with @RequiresStableInput?
> >
> > Thanks a lot!
> > Eleanore
> >
> > On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels
> mailto:m...@apache.org>
> > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
> >
> >     Hi Stephen,
> >
> >     Thanks for reporting the issue! David, good catch!
> >
> >     I think we have to resort to only using a single state cell for
> >     buffering on checkpoints, instead of using a new one for every
> >     checkpoint. I was under the assumption that, if the state cell was
> >     cleared, it would not be checkpointed but that does not seem to be
> >     the case.
> >
> >     Thanks,
> >     Max
> >
> >     On 21.04.20 09:29, David Morávek wrote:
> >     > Hi Stephen,
> >     >
> >     > nice catch and awesome report! ;) This definitely needs a
> proper fix.
> >     > I've created a new JIRA to track the issue and will try to
> resolve it
> >     > soon as this seems critical to me.
> >     >
> >     > https://issues.apache.org/jira/browse/BEAM-9794
> >     >
> >     > Thanks,
> >     > D.
> >     >
> >     > On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel
> >     mailto:stephenpate...@gmail.com>
> <mailto:stephenpate...@gmail.com <mailto:stephenpate...@gmail.com>>
> >     > <mailto:stephenpate...@gmail.com
> <mailto:stephenpate...@gmail.com>
> >     <mailto:stephenpate...@gmail.com
> <mailto:stephenpate...@gmail.com>>>> wrote:
> >     >
> >     >     I was able to reproduce this in a unit test:
> >     >
> >     >         @Test
> >     >
> >     >           *public* *void* test() *throws* InterruptedException,
> >     >         ExecutionException {
> >     >
> >     >             FlinkPipelineOptions options =
> >     >       
>  PipelineOptionsFactory./as/(FlinkPipelineOptions.*class*);
> >     >
> >     >             options.setCheckpointingInterval(10L);
> >     >
> >     >             options.setParallelism(1);
> >     >
> >     >             options.setStreaming(*true*);
> >     >
> >     >             options.setRunner(FlinkRunner.*class*);
> >     >
> >     >             options.setFlinkMaster("[local]");
> >     >
> >     >             options.setStateBackend(*new*
> >     >         MemoryStateBackend(Integer.*/MAX_VALUE/*));
> >     >
> >     >             Pipeline pipeline = Pipeline./create/(options);
> >     >
> >     >             pipeline
> >     >
> >     >                 .apply(Create./of/((Void) *null*))
> >     >
> >     >                 .apply(
> >     >
> >     >                     ParDo./of/(
> >     >
> >     >                         *new* DoFn() {
> >     >
> >     >
> >     >                           *private* *static* *final* *long*
> >     >         */serialVersionUID/* = 1L;
> >     >
> >     >
> >     >                           @RequiresStableInput
> >     >
> 

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
On 30.04.20 21:48, Hannah Jiang wrote:
> --info tag was passed to docker image build commands with PythonDocker
> Precommit to capture more logs. Without the tag, errors from
> DockerFile step are not printed out to the console.

Thanks for the info (pun intended).

On 30.04.20 21:48, Hannah Jiang wrote:
> Indeed, I can see the no space left on device in the following but
> not in the log above:
> 
> --info tag was passed to docker image build commands with PythonDocker
> Precommit to capture more logs. Without the tag, errors from DockerFile
> step are not printed out to the console.
> 
> On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri  <mailto:eh...@google.com>> wrote:
> 
> I checked node 8 and it had over 40GB space available. Does your job
> require more than that?
> 
> Long term, I'm thinking we could clean up workspaces for successful
> jobs. This should free up additional space (I guess at least 100GB).
> https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin
> to clean workspaces at job start.
> 
> 
> On Thu, Apr 30, 2020, 07:33 Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> *It's working again, probably because it's running on a different
> machine now.
> 
> Who can check the disk space of the Jenkins hosts?
> 
> Thanks,
> Max
> 
> On 30.04.20 11:55, Maximilian Michels wrote:
> > Sorry, I meant to include the Jenkins log:
> >
> 
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> >
> > Thanks for investigating Hannah! Indeed, I can see the no
> space left on
> > device in the following but not in the log above:
> >
> 
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> >
> > I'm going to try running the build again. Do you think we
> could add more
> > storage to our Jenkins hosts or delete old build data?
> >
> > Thanks,
> > Max
> >
> > On 30.04.20 08:43, Hannah Jiang wrote:
> >> Max, I found a link from your PR and noticed below errors.
> This would be
> >> the true error.
> >>
> >> *07:57:03* >*Task :sdks:python:container:py37:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >> *07:57:03*
> >> *07:57:03*  [0m
> >> *07:57:03* >*Task :sdks:python:container:py35:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >>
> >>
> >>
> >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang
> mailto:hannahji...@google.com>
> >> <mailto:hannahji...@google.com
> <mailto:hannahji...@google.com>>> wrote:
> >>
> >>     There is a PythonDocker Precommit test running for PRs
> with Python
> >>     changes. It seems running well.[1]
> >>     Max, can you please give me a link so I can check more
> details? Do
> >>     other images with different Python versions fail as well?
> >>
> >>   
>  1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> >>
> >>
> >>     On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay
>     mailto:al...@google.com>
> >>     <mailto:al...@google.com <mailto:al...@google.com>>> wrote:
> >>
> >>         +Valentyn Tymofieiev <mailto:valen...@google.com
> <mailto:valen...@google.com>> +Hannah Jiang
> >>         <mailto:hannahji...@google.com
> <mailto:hannahji...@google.com>> -- in case they have relevant
> >>         information.
> >>
> >>         On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> >>         mailto:m...@apache.org>
> <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
> >>
> >>             Hi,
> >>
> >>             has anyone noticed the Python 3.7 Docker
> container fails to
> >>             build? I
> >>             haven't b

Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
Is the issue that the workspace grows over time? Couldn't we delete it
daily to ensure it does not grow too much? Always deleting it on
successful runs may be too costly because we have to recreate the
workspace every time.

Logs are stored separately. I suppose they could also add up over time.

On 30.04.20 21:48, Hannah Jiang wrote:
> Indeed, I can see the no space left on device in the following but
> not in the log above:
> 
> --info tag was passed to docker image build commands with PythonDocker
> Precommit to capture more logs. Without the tag, errors from DockerFile
> step are not printed out to the console.
> 
> On Thu, Apr 30, 2020 at 11:19 AM Udi Meiri  <mailto:eh...@google.com>> wrote:
> 
> I checked node 8 and it had over 40GB space available. Does your job
> require more than that?
> 
> Long term, I'm thinking we could clean up workspaces for successful
> jobs. This should free up additional space (I guess at least 100GB).
> https://plugins.jenkins.io/ws-cleanup/ - we already use this plugin
> to clean workspaces at job start.
> 
> 
> On Thu, Apr 30, 2020, 07:33 Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> *It's working again, probably because it's running on a different
> machine now.
> 
> Who can check the disk space of the Jenkins hosts?
> 
> Thanks,
> Max
> 
> On 30.04.20 11:55, Maximilian Michels wrote:
> > Sorry, I meant to include the Jenkins log:
> >
> 
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> >
> > Thanks for investigating Hannah! Indeed, I can see the no
> space left on
> > device in the following but not in the log above:
> >
> 
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> >
> > I'm going to try running the build again. Do you think we
> could add more
> > storage to our Jenkins hosts or delete old build data?
> >
> > Thanks,
> > Max
> >
> > On 30.04.20 08:43, Hannah Jiang wrote:
> >> Max, I found a link from your PR and noticed below errors.
> This would be
> >> the true error.
> >>
> >> *07:57:03* >*Task :sdks:python:container:py37:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >> *07:57:03*
> >> *07:57:03*  [0m
> >> *07:57:03* >*Task :sdks:python:container:py35:docker*
> >> *07:57:03*  [91mERROR: Could not install packages due to an
> EnvironmentError: [Errno 28] No space left on device
> >>
> >>
> >>
> >> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang
> mailto:hannahji...@google.com>
> >> <mailto:hannahji...@google.com
> <mailto:hannahji...@google.com>>> wrote:
> >>
> >>     There is a PythonDocker Precommit test running for PRs
> with Python
> >>     changes. It seems running well.[1]
> >>     Max, can you please give me a link so I can check more
> details? Do
> >>     other images with different Python versions fail as well?
> >>
> >>   
>  1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> >>
> >>
> >>     On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay
> mailto:al...@google.com>
> >>     <mailto:al...@google.com <mailto:al...@google.com>>> wrote:
> >>
> >>         +Valentyn Tymofieiev <mailto:valen...@google.com
> <mailto:valen...@google.com>> +Hannah Jiang
> >>         <mailto:hannahji...@google.com
> <mailto:hannahji...@google.com>> -- in case they have relevant
> >>         information.
> >>
> >>         On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> >>         mailto:m...@apache.org>
> <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
> >>
> >>             Hi,
> >>
> >>             has anyone noticed the Python 3.7 Docker
> container fails to
> >>             build? I
> >>   

"DNS resolution failed"

2020-04-30 Thread Maximilian Michels
Hi,

Is anyone familiar with this GRPC error? The build logs are full of it.
Also getting it on my machine when I run tests:

23:17:02 ERROR:apache_beam.runners.worker.data_plane:Failed to read inputs in 
the data plane.
23:17:02 Traceback (most recent call last):
23:17:02   File "apache_beam/runners/worker/data_plane.py", line 528, in 
_read_inputs
23:17:02 for elements in elements_iterator:
23:17:02   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Phrase/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_channel.py",
 line 413, in next
23:17:02 return self._next()
23:17:02   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python2_PVR_Flink_Phrase/src/build/gradleenv/1866363813/local/lib/python2.7/site-packages/grpc/_channel.py",
 line 689, in _next
23:17:02 raise self
23:17:02 _MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that 
terminated with:
23:17:02status = StatusCode.UNAVAILABLE
23:17:02details = "DNS resolution failed"
23:17:02debug_error_string = 
"{"created":"@1588108621.907750662","description":"Failed to pick 
subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3981,"referenced_errors":[{"created":"@1588108621.907745000","description":"Resolver
 transient 
failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":214,"referenced_errors":[{"created":"@1588108621.907743049","description":"DNS
 resolution 
failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":357,"grpc_status":14,"referenced_errors":[{"created":"@1588108621.907719737","description":"C-ares
 status is not ARES_SUCCESS: Misformatted domain 
name","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":244,"referenced_errors":[{"created":"@1588108621.907691960","description":"C-ares
 status is not ARES_SUCCESS: Misformatted domain 
name","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":244}]}]}]}]}"

https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Phrase/158/console

Looks like a recent regression. Tracked here:
https://jira.apache.org/jira/browse/BEAM-9851

Thanks,
Max



Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
*It's working again, probably because it's running on a different
machine now.

Who can check the disk space of the Jenkins hosts?

Thanks,
Max

On 30.04.20 11:55, Maximilian Michels wrote:
> Sorry, I meant to include the Jenkins log:
> https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console
> 
> Thanks for investigating Hannah! Indeed, I can see the no space left on
> device in the following but not in the log above:
> https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console
> 
> I'm going to try running the build again. Do you think we could add more
> storage to our Jenkins hosts or delete old build data?
> 
> Thanks,
> Max
> 
> On 30.04.20 08:43, Hannah Jiang wrote:
>> Max, I found a link from your PR and noticed below errors. This would be
>> the true error.
>>
>> *07:57:03* >*Task :sdks:python:container:py37:docker*
>> *07:57:03*  [91mERROR: Could not install packages due to an 
>> EnvironmentError: [Errno 28] No space left on device
>> *07:57:03*
>> *07:57:03*  [0m
>> *07:57:03* >*Task :sdks:python:container:py35:docker*
>> *07:57:03*  [91mERROR: Could not install packages due to an 
>> EnvironmentError: [Errno 28] No space left on device
>>
>>
>>
>> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang > <mailto:hannahji...@google.com>> wrote:
>>
>> There is a PythonDocker Precommit test running for PRs with Python
>> changes. It seems running well.[1]
>> Max, can you please give me a link so I can check more details? Do
>> other images with different Python versions fail as well?
>>
>> 1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
>>
>>
>> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay > <mailto:al...@google.com>> wrote:
>>
>> +Valentyn Tymofieiev <mailto:valen...@google.com> +Hannah Jiang
>> <mailto:hannahji...@google.com> -- in case they have relevant
>> information.
>>
>> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
>> mailto:m...@apache.org>> wrote:
>>
>> Hi,
>>
>> has anyone noticed the Python 3.7 Docker container fails to
>> build? I
>> haven't been able to build the Python 3.7 container, neither
>> locally nor
>> on Jenkins.
>>
>> I get:
>>
>> 17:48:10 > Task :sdks:python:container:py37:docker
>> 17:49:36 The command '/bin/sh -c pip install -r
>> /tmp/base_image_requirements.txt && python -c "from
>> google.protobuf.internal import api_implementation; assert
>> api_implementation._default_implementation_type == 'cpp'; print
>> ('Verified fast protobuf used.')" && rm -rf
>> /root/.cache/pip' returned a
>> non-zero code: 1
>> 17:49:36
>> 17:49:36 > Task :sdks:python:container:py37:docker FAILED
>>
>>
>> Cheers,
>> Max
>>


Re: Python 3.7 docker container fails to build

2020-04-30 Thread Maximilian Michels
Sorry, I meant to include the Jenkins log:
https://builds.apache.org/job/beam_LoadTests_Python_ParDo_Flink_Streaming_PR/5/console

Thanks for investigating Hannah! Indeed, I can see the no space left on
device in the following but not in the log above:
https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/473/console

I'm going to try running the build again. Do you think we could add more
storage to our Jenkins hosts or delete old build data?

Thanks,
Max

On 30.04.20 08:43, Hannah Jiang wrote:
> Max, I found a link from your PR and noticed below errors. This would be
> the true error.
> 
> *07:57:03* >*Task :sdks:python:container:py37:docker*
> *07:57:03*  [91mERROR: Could not install packages due to an EnvironmentError: 
> [Errno 28] No space left on device
> *07:57:03*
> *07:57:03*  [0m
> *07:57:03* >*Task :sdks:python:container:py35:docker*
> *07:57:03*  [91mERROR: Could not install packages due to an EnvironmentError: 
> [Errno 28] No space left on device
> 
> 
> 
> On Wed, Apr 29, 2020 at 5:59 PM Hannah Jiang  <mailto:hannahji...@google.com>> wrote:
> 
> There is a PythonDocker Precommit test running for PRs with Python
> changes. It seems running well.[1]
> Max, can you please give me a link so I can check more details? Do
> other images with different Python versions fail as well?
> 
> 1. https://builds.apache.org/job/beam_PreCommit_PythonDocker_Commit/
> 
> 
> On Wed, Apr 29, 2020 at 2:44 PM Ahmet Altay  <mailto:al...@google.com>> wrote:
> 
> +Valentyn Tymofieiev <mailto:valen...@google.com> +Hannah Jiang
> <mailto:hannahji...@google.com> -- in case they have relevant
> information.
> 
> On Wed, Apr 29, 2020 at 12:35 PM Maximilian Michels
> mailto:m...@apache.org>> wrote:
> 
> Hi,
> 
> has anyone noticed the Python 3.7 Docker container fails to
> build? I
> haven't been able to build the Python 3.7 container, neither
> locally nor
> on Jenkins.
> 
> I get:
> 
> 17:48:10 > Task :sdks:python:container:py37:docker
> 17:49:36 The command '/bin/sh -c pip install -r
> /tmp/base_image_requirements.txt && python -c "from
> google.protobuf.internal import api_implementation; assert
> api_implementation._default_implementation_type == 'cpp'; print
> ('Verified fast protobuf used.')" && rm -rf
> /root/.cache/pip' returned a
> non-zero code: 1
> 17:49:36
> 17:49:36 > Task :sdks:python:container:py37:docker FAILED
> 
> 
> Cheers,
> Max
> 


Python 3.7 docker container fails to build

2020-04-29 Thread Maximilian Michels
Hi,

has anyone noticed the Python 3.7 Docker container fails to build? I
haven't been able to build the Python 3.7 container, neither locally nor
on Jenkins.

I get:

17:48:10 > Task :sdks:python:container:py37:docker
17:49:36 The command '/bin/sh -c pip install -r
/tmp/base_image_requirements.txt && python -c "from
google.protobuf.internal import api_implementation; assert
api_implementation._default_implementation_type == 'cpp'; print
('Verified fast protobuf used.')" && rm -rf /root/.cache/pip' returned a
non-zero code: 1
17:49:36
17:49:36 > Task :sdks:python:container:py37:docker FAILED


Cheers,
Max



Re: JIRA Committer Permissions

2020-04-28 Thread Maximilian Michels
FWIW committer and contributor roles in Beam's JIRA have practically
identical permissions[1]. The only different is the "Set Issue Security"
and "Manage Watchers" permissions, both of which I have never used before.

Thanks for updating the wiki page!

-Max

[1]
https://jira.apache.org/jira/plugins/servlet/project-config/BEAM/permissions

On 28.04.20 06:09, Kenneth Knowles wrote:
> I think it would be very valuable to have a committer onboarding guide,
> with info for both the committer and steps for PMC to take. I think the
> wiki is the right place for it...
> 
> (two seconds of checking later)
> 
> It
> exists! 
> https://cwiki.apache.org/confluence/display/BEAM/Committer+onboarding+guide
> 
> By the time you read this, I hope to have added the links that Luke
> suggested. We do need to remember to send this guide to new committers.
> And potentially announce changes that existing committers may not have
> followed.
> 
> Kenn
> 
> On Mon, Apr 27, 2020 at 7:23 PM Luke Cwik  > wrote:
> 
> The Beam committer guide is about reviewing code and the "become a
> committer" is more about what we look for and not the process.
> 
> Since this is common for all ASF projects, I suspect an ASF page may
> have this documented or should be updated to have this covered as
> well but didn't see it on the ASF new committers resources page[1]
> or on the developers & contributors overview[2].
> 
> 1: https://www.apache.org/dev/new-committers-guide.html
> 2: https://www.apache.org/dev/index.html
> 
> On Mon, Apr 27, 2020 at 3:32 PM Udi Meiri  > wrote:
> 
> Should this step be added to our new committer guide?
> 
> On Fri, Apr 24, 2020 at 6:21 PM Luke Cwik  > wrote:
> 
> I noticed that several committers only had contributor level
> permissions and I went and updated your account permissions
> for the Beam project to be committer level. Feel free to let
> me know If you run into any issues.
> 
> There were about ~25 accounts like this.
> 


Github PR links in JIRA

2020-04-27 Thread Maximilian Michels
Hi everyone,

Did anyone notice that GitHub PRs tagged with a JIRA issue
("[BEAM-XXX]") do not automatically get linked anymore in JIRA?

Does anyone know how that stuff works?

Thanks,
Max


Re: [RESULT][VOTE] Accept the Firefly design donation as Beam Mascot - Deadline Mon April 6

2020-04-26 Thread Maximilian Michels
Hey Maria,

I can testify :)

Cheers,
Max

On 23.04.20 20:49, María Cruz wrote:
> Hi everyone!
> It is amazing to see how this process developed to collaboratively
> create Apache Beam's mascot. Thank you to everyone who got involved! 
> I would like to write a blogpost for the Beam website, and I wanted to
> ask you: would anyone like to offer their testimony about the process of
> creating the Beam mascot, and what this means to you? Everyone's
> testimony is welcome! If you witnessed the development of a mascot for
> another open source project, even better =) 
> 
> Please feel free to express interest on this thread, and I'll reach out
> to you off-list. 
> 
> Thanks, 
> 
> María
> 
> On Fri, Apr 17, 2020 at 6:19 AM Jeff Klukas  <mailto:jklu...@mozilla.com>> wrote:
> 
> I personally like the sound of "Datum" as a name. I also like the
> idea of not assigning them a gender.
> 
> As a counterpoint on the naming side, one of the slide decks
> provided while iterating on the design mentions:
> 
> > Mascot can change colors when it is “full of data” or has a “batch
> of data” to process.  Yellow is supercharged and ready to process!
> 
> Based on that, I'd argue that the mascot maps to the concept of a
> bundle in the beam execution model and we should consider a name
> that's a play on "bundle" or perhaps a play on "checkpoint".
> 
> On Thu, Apr 16, 2020 at 3:44 PM Julian Bruno  <mailto:juliangbr...@gmail.com>> wrote:
> 
> Hi all,
> 
> While working on the design of our Mascot
> Some ideas showed up and I wish to share them.
> In regard to Alex Van Boxel's question about the name of our Mascot.
>  
> I was thinking about this yesterday night and feel it could be a
> great idea to name the Mascot "*Data*" or "*Datum*". Both names
> sound cute and make sense to me. I prefer the later. Datum means
> a single piece of information. The Mascot is the first piece of
> information and its job is to collect batches of data and
> process it. Datum is in charge of linking information together.
> 
> In addition, our Mascot should have no gender. Rendering it
> accessible to all users. 
> 
> Beam as a name for the mascot is pretty straight forward but I
> think there are many things carrying that same name already.
> 
> What do you think?
> 
> Looking forward to hearing your feedback. Names are important
> and I feel it can expand the personality and create a cool
> background for our Mascot.
> 
> Cheers!
> 
> Julian 
> 
> On Mon, Apr 13, 2020, 3:40 PM Kyle Weaver  <mailto:kcwea...@google.com>> wrote:
> 
> Beam Firefly is fine with me (I guess people tend to forget
> mascot names anyway). But if anyone comes up with something
> particularly cute/clever we can consider it.
> 
> On Mon, Apr 13, 2020 at 6:33 PM Aizhamal Nurmamat kyzy
> mailto:aizha...@apache.org>> wrote:
> 
> @Alex, Beam Firefly?
> 
> On Thu, Apr 9, 2020 at 10:57 PM Alex Van Boxel
> mailto:a...@vanboxel.be>> wrote:
> 
> We forgot something      
> 
> ...
> 
> ...
> 
> it/she/he needs a *name*!
> 
> 
>  _/
> _/ Alex Van Boxel
> 
> 
> On Fri, Apr 10, 2020 at 6:19 AM Kenneth Knowles
> mailto:k...@apache.org>> wrote:
> 
> Looking forward to the guide. I enjoy doing
> (bad) drawings as a way to relax. And I want
> them to be properly on brand :-)
> 
> Kenn
> 
> On Thu, Apr 9, 2020 at 10:35 AM Maximilian
> Michels mailto:m...@apache.org>>
> wrote:
> 
> Awesome. What a milestone! The mascot is a
> real eye catcher. Thank you
> Julian and Aizhamal for making it happen.
> 
> On 06.04.20 22:05, Aizhamal Nurmamat kyzy wrote:
> > I am happy to announce that this vote has
> passed, with 13 approving +1
> > votes, 5 of which are binding PMC votes.
> >
>

Re: [ANNOUNCE] Beam 2.20.0 Released

2020-04-24 Thread Maximilian Michels
Thanks Rui for getting this one out!

-Max

On 24.04.20 15:03, Jan Lukavský wrote:
> Hi Rui,
> 
> thanks making for this release! Is is possible we are missing git tag
> for this release? I cannot find it.
> 
> Thanks,
> 
>  Jan
> 
> On 4/16/20 8:47 PM, Rui Wang wrote:
>> Note that due to a bug on infrastructure, the website change failed to
>> publish. But 2.20.0 artifacts are available to use right now.
>>
>>
>>
>> -Rui
>>
>> On Thu, Apr 16, 2020 at 11:45 AM Rui Wang > > wrote:
>>
>> The Apache Beam team is pleased
>> to announce the release of version 2.20.0.
>>
>> Apache Beam is an open source unified programming model to define and
>> execute data processing pipelines, including ETL, batch and stream
>> (continuous) processing. See https://beam.apache.org
>> 
>>
>> You can download the release here:
>>
>>     https://beam.apache.org/get-started/downloads/ 
>>
>> This release includes bug fixes, features, and improvements
>> detailed on
>> the Beam
>> blog: https://beam.apache.org/blog/2020/04/15/beam-2.20.0.html
>>
>> Thanks to everyone who contributed to this release, and we hope
>> you enjoy
>> using Beam 2.20.0.
>> -- Rui Wang, on behalf of The Apache Beam team
>>


Re: Beam Digital Summit 2020 -- JUNE 2020!

2020-04-22 Thread Maximilian Michels
 Looking forward to this!

Cheers,
Max

On 22.04.20 21:09, Austin Bennett wrote:
> Hi All,
> 
> We are excited to announce the Beam Digital Summit 2020!
> 
> This will occur for partial days during the week of 15-19 June.
> 
> CfP is open and found: https://sessionize.com/beam-digital-summit-2020/
> 
> CfP closes on 20 May 2020.  Do not hesitate to reach out to the
> organizers with any questions.  
> 
> See you there (online)!
> Austin, on behalf of the Beam Summit Steering Committee 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Beam Summit Steering" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to beam-summit-steering+unsubscr...@googlegroups.com
> .
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/beam-summit-steering/CAEbFGqvhSeqPa_%3DcH2FYxtWGObR2iXqFwmwox77%2B_0%3DoGrCf9A%40mail.gmail.com
> .


Re: [Python] Setting a timer from a timer callback

2020-04-22 Thread Maximilian Michels
Attempting to fix this here, if somebody could have a look:
https://github.com/apache/beam/pull/11492

On 22.04.20 17:10, Maximilian Michels wrote:
> Hi,
> 
> I'm trying to set a timer from a timer callback in the Python SDK:
> 
> class MyFn(beam.DoFn):
>   timer_spec = userstate.TimerSpec('timer', userstate.TimeDomain.WATERMARK)
> 
>   def process(self, element, timer=beam.DoFn.TimerParam(timer_spec)):
> self.key = element[0]
> timer.set(0)
> 
>   @userstate.on_timer(timer_spec)
>   def process_timer(self, timer=beam.DoFn.TimerParam(timer_spec)):
> timer.set(0)
> 
> This yields the following Python stack trace:
> 
> INFO:apache_beam.utils.subprocess_server:Caused by:
> java.lang.RuntimeException: Error received from SDK harness for
> instruction 4: Traceback (most recent call last):
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/sdk_worker.py", line 245, in _execute
> INFO:apache_beam.utils.subprocess_server: response = task()
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/sdk_worker.py", line 302, in 
> INFO:apache_beam.utils.subprocess_server: lambda:
> self.create_worker().do_instruction(request), request)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/sdk_worker.py", line 471, in do_instruction
> INFO:apache_beam.utils.subprocess_server: getattr(request,
> request_type), request.instruction_id)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/sdk_worker.py", line 506, in process_bundle
> INFO:apache_beam.utils.subprocess_server:
> bundle_processor.process_bundle(instruction_id))
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/bundle_processor.py", line 910, in
> process_bundle
> INFO:apache_beam.utils.subprocess_server: element.timer_family_id,
> timer_data)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/operations.py", line 688, in process_timer
> INFO:apache_beam.utils.subprocess_server: timer_data.fire_timestamp)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/common.py", line 990, in process_user_timer
> INFO:apache_beam.utils.subprocess_server: self._reraise_augmented(exn)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/common.py", line 1043, in _reraise_augmented
> INFO:apache_beam.utils.subprocess_server: raise_with_traceback(new_exn)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/common.py", line 988, in process_user_timer
> INFO:apache_beam.utils.subprocess_server:
> self.do_fn_invoker.invoke_user_timer(timer_spec, key, window, timestamp)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/common.py", line 517, in invoke_user_timer
> INFO:apache_beam.utils.subprocess_server: self.user_state_context, key,
> window, timestamp))
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/common.py", line 1093, in process_outputs
> INFO:apache_beam.utils.subprocess_server: for result in results:
> INFO:apache_beam.utils.subprocess_server: File
> "/Users/max/Dev/beam/sdks/python/apache_beam/testing/load_tests/pardo_test.py",
> line 185, in process_timer
> INFO:apache_beam.utils.subprocess_server: timer.set(0)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/runners/worker/bundle_processor.py", line 589, in set
> INFO:apache_beam.utils.subprocess_server:
> self._timer_coder_impl.encode_to_stream(timer, self._output_stream, True)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/coders/coder_impl.py", line 651, in encode_to_stream
> INFO:apache_beam.utils.subprocess_server: value.hold_timestamp, out, True)
> INFO:apache_beam.utils.subprocess_server: File
> "apache_beam/coders/coder_impl.py", line 608, in encode_to_stream
> INFO:apache_beam.utils.subprocess_server: millis = value.micros // 1000
> INFO:apache_beam.utils.subprocess_server:AttributeError: 'NoneType'
> object has no attribute 'micros' [while running 'GenerateLoad']
> 
> Looking at the code base, I'm not sure we have tests for timer output
> timestamps. Am I missing something?
> 
> -Max
> 


[Python] Setting a timer from a timer callback

2020-04-22 Thread Maximilian Michels
Hi,

I'm trying to set a timer from a timer callback in the Python SDK:

class MyFn(beam.DoFn):
  timer_spec = userstate.TimerSpec('timer', userstate.TimeDomain.WATERMARK)

  def process(self, element, timer=beam.DoFn.TimerParam(timer_spec)):
self.key = element[0]
timer.set(0)

  @userstate.on_timer(timer_spec)
  def process_timer(self, timer=beam.DoFn.TimerParam(timer_spec)):
timer.set(0)

This yields the following Python stack trace:

INFO:apache_beam.utils.subprocess_server:Caused by:
java.lang.RuntimeException: Error received from SDK harness for
instruction 4: Traceback (most recent call last):
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/sdk_worker.py", line 245, in _execute
INFO:apache_beam.utils.subprocess_server: response = task()
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/sdk_worker.py", line 302, in 
INFO:apache_beam.utils.subprocess_server: lambda:
self.create_worker().do_instruction(request), request)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/sdk_worker.py", line 471, in do_instruction
INFO:apache_beam.utils.subprocess_server: getattr(request,
request_type), request.instruction_id)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/sdk_worker.py", line 506, in process_bundle
INFO:apache_beam.utils.subprocess_server:
bundle_processor.process_bundle(instruction_id))
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/bundle_processor.py", line 910, in
process_bundle
INFO:apache_beam.utils.subprocess_server: element.timer_family_id,
timer_data)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/operations.py", line 688, in process_timer
INFO:apache_beam.utils.subprocess_server: timer_data.fire_timestamp)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/common.py", line 990, in process_user_timer
INFO:apache_beam.utils.subprocess_server: self._reraise_augmented(exn)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/common.py", line 1043, in _reraise_augmented
INFO:apache_beam.utils.subprocess_server: raise_with_traceback(new_exn)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/common.py", line 988, in process_user_timer
INFO:apache_beam.utils.subprocess_server:
self.do_fn_invoker.invoke_user_timer(timer_spec, key, window, timestamp)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/common.py", line 517, in invoke_user_timer
INFO:apache_beam.utils.subprocess_server: self.user_state_context, key,
window, timestamp))
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/common.py", line 1093, in process_outputs
INFO:apache_beam.utils.subprocess_server: for result in results:
INFO:apache_beam.utils.subprocess_server: File
"/Users/max/Dev/beam/sdks/python/apache_beam/testing/load_tests/pardo_test.py",
line 185, in process_timer
INFO:apache_beam.utils.subprocess_server: timer.set(0)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/runners/worker/bundle_processor.py", line 589, in set
INFO:apache_beam.utils.subprocess_server:
self._timer_coder_impl.encode_to_stream(timer, self._output_stream, True)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/coders/coder_impl.py", line 651, in encode_to_stream
INFO:apache_beam.utils.subprocess_server: value.hold_timestamp, out, True)
INFO:apache_beam.utils.subprocess_server: File
"apache_beam/coders/coder_impl.py", line 608, in encode_to_stream
INFO:apache_beam.utils.subprocess_server: millis = value.micros // 1000
INFO:apache_beam.utils.subprocess_server:AttributeError: 'NoneType'
object has no attribute 'micros' [while running 'GenerateLoad']

Looking at the code base, I'm not sure we have tests for timer output
timestamps. Am I missing something?

-Max



Re: Running NexMark Tests

2020-04-22 Thread Maximilian Michels
The flag is needed when checkpointing is enabled because Flink is unable
to create a new checkpoint when not all operators are running.

By default, operators shut down when all input has been read. That will
trigger sending out the maximum (final) watermark at the sources. The
flag name is a bit confusing in this regard because shutting down the
sources triggers sending out the watermark, not the other way around.

-Max

On 22.04.20 06:26, Kenneth Knowles wrote:
> We should always want to shut down sources on final watermark. All
> incoming data should be dropped anyhow.
> 
> Kenn
> 
> On Tue, Apr 21, 2020 at 1:34 PM Luke Cwik  > wrote:
> 
> +dev
> 
> When would we not want --shutdownSourcesOnFinalWatermark=true ?
> 
> On Tue, Apr 21, 2020 at 1:22 PM Ismaël Mejía  > wrote:
> 
> You need to instruct the Flink runner to shutdown the the source
> otherwise it will stay waiting.
> You can this by adding the extra
> argument`--shutdownSourcesOnFinalWatermark=true`
> And if that works and you want to open a PR to update our
> documentation that would be greatly appreciated.
> 
> Regards,
> Ismaël
> 
> 
> On Tue, Apr 21, 2020 at 10:04 PM Sruthi Sree Kumar
>  > wrote:
> >
> > Hello,
> >
> > I am trying to run nexmark queries using flink runner
> streaming. Followed the documentation and used the command
> > ./gradlew :sdks:java:testing:nexmark:run \
> >
> >     -Pnexmark.runner=":runners:flink:1.10" \
> >     -Pnexmark.args="
> >         --runner=FlinkRunner
> >         --suite=SMOKE
> >         --streamTimeout=60
> >         --streaming=true
> >         --manageResources=false
> >         --monitorJobs=true
> >         --flinkMaster=[local]"
> >
> >
> > But after the events are read from the source, there is no
> further progress and the job is always stuck at 99%. Is there
> any configuration that I am missing?
> >
> > Regards,
> > Sruthi
> 


Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-22 Thread Maximilian Michels
Hi Eleanore,

Exactly-once is not affected but the pipeline can fail to checkpoint
after the maximum number of state cells have been reached. We are
working on a fix [1].

Cheers,
Max

[1] https://github.com/apache/beam/pull/11478

On 22.04.20 07:19, Eleanore Jin wrote:
> Hi Maxi, 
> 
> I assume this will impact the Exactly Once Semantics that beam provided
> as in the KafkaExactlyOnceSink, the processElement method is also
> annotated with @RequiresStableInput?
> 
> Thanks a lot!
> Eleanore
> 
> On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> Hi Stephen,
> 
> Thanks for reporting the issue! David, good catch!
> 
> I think we have to resort to only using a single state cell for
> buffering on checkpoints, instead of using a new one for every
> checkpoint. I was under the assumption that, if the state cell was
> cleared, it would not be checkpointed but that does not seem to be
> the case.
> 
> Thanks,
> Max
> 
> On 21.04.20 09:29, David Morávek wrote:
> > Hi Stephen,
> >
> > nice catch and awesome report! ;) This definitely needs a proper fix.
> > I've created a new JIRA to track the issue and will try to resolve it
> > soon as this seems critical to me.
> >
> > https://issues.apache.org/jira/browse/BEAM-9794
> >
> > Thanks,
> > D.
> >
> > On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel
> mailto:stephenpate...@gmail.com>
> > <mailto:stephenpate...@gmail.com
> <mailto:stephenpate...@gmail.com>>> wrote:
> >
> >     I was able to reproduce this in a unit test:
> >
> >         @Test
> >
> >           *public* *void* test() *throws* InterruptedException,
> >         ExecutionException {
> >
> >             FlinkPipelineOptions options =
> >         PipelineOptionsFactory./as/(FlinkPipelineOptions.*class*);
> >
> >             options.setCheckpointingInterval(10L);
> >
> >             options.setParallelism(1);
> >
> >             options.setStreaming(*true*);
> >
> >             options.setRunner(FlinkRunner.*class*);
> >
> >             options.setFlinkMaster("[local]");
> >
> >             options.setStateBackend(*new*
> >         MemoryStateBackend(Integer.*/MAX_VALUE/*));
> >
> >             Pipeline pipeline = Pipeline./create/(options);
> >
> >             pipeline
> >
> >                 .apply(Create./of/((Void) *null*))
> >
> >                 .apply(
> >
> >                     ParDo./of/(
> >
> >                         *new* DoFn() {
> >
> >
> >                           *private* *static* *final* *long*
> >         */serialVersionUID/* = 1L;
> >
> >
> >                           @RequiresStableInput
> >
> >                           @ProcessElement
> >
> >                           *public* *void* processElement() {}
> >
> >                         }));
> >
> >             pipeline.run();
> >
> >           }
> >
> >
> >     It took a while to get to checkpoint 32,767, but eventually it
> did,
> >     and it failed with the same error I listed above.
> >
> >     On Thu, Apr 16, 2020 at 11:26 AM Stephen Patel
> >     mailto:stephenpate...@gmail.com>
> <mailto:stephenpate...@gmail.com <mailto:stephenpate...@gmail.com>>>
> wrote:
> >
> >         I have a Beam Pipeline (2.14) running on Flink (1.8.0,
> >         emr-5.26.0) that uses the RequiresStableInput feature.
> >
> >         Currently it's configured to checkpoint once a minute, and
> after
> >         around 32000-33000 checkpoints, it fails with: 
> >
> >             2020-04-15 13:15:02,920 INFO
> >           
>   org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
> >               - Triggering checkpoint 32701 @ 1586956502911 for job
> >             9953424f21e240112dd23ab4f8320b60.
> >             2020-04-15 13:15:05,762 INFO
> >           
>   org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
> >               - Completed checkpoint 32701 for job
> >             9953424f21e240112dd23ab4f8320b60 (795385496 

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread Maximilian Michels
Hi Stephen,

Thanks for reporting the issue! David, good catch!

I think we have to resort to only using a single state cell for
buffering on checkpoints, instead of using a new one for every
checkpoint. I was under the assumption that, if the state cell was
cleared, it would not be checkpointed but that does not seem to be the case.

Thanks,
Max

On 21.04.20 09:29, David Morávek wrote:
> Hi Stephen,
> 
> nice catch and awesome report! ;) This definitely needs a proper fix.
> I've created a new JIRA to track the issue and will try to resolve it
> soon as this seems critical to me.
> 
> https://issues.apache.org/jira/browse/BEAM-9794
> 
> Thanks,
> D.
> 
> On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel  > wrote:
> 
> I was able to reproduce this in a unit test:
> 
> @Test
> 
>   *public* *void* test() *throws* InterruptedException,
> ExecutionException {
> 
>     FlinkPipelineOptions options =
> PipelineOptionsFactory./as/(FlinkPipelineOptions.*class*);
> 
>     options.setCheckpointingInterval(10L);
> 
>     options.setParallelism(1);
> 
>     options.setStreaming(*true*);
> 
>     options.setRunner(FlinkRunner.*class*);
> 
>     options.setFlinkMaster("[local]");
> 
>     options.setStateBackend(*new*
> MemoryStateBackend(Integer.*/MAX_VALUE/*));
> 
>     Pipeline pipeline = Pipeline./create/(options);
> 
>     pipeline
> 
>         .apply(Create./of/((Void) *null*))
> 
>         .apply(
> 
>             ParDo./of/(
> 
>                 *new* DoFn() {
> 
> 
>                   *private* *static* *final* *long*
> */serialVersionUID/* = 1L;
> 
> 
>                   @RequiresStableInput
> 
>                   @ProcessElement
> 
>                   *public* *void* processElement() {}
> 
>                 }));
> 
>     pipeline.run();
> 
>   }
> 
> 
> It took a while to get to checkpoint 32,767, but eventually it did,
> and it failed with the same error I listed above.
> 
> On Thu, Apr 16, 2020 at 11:26 AM Stephen Patel
> mailto:stephenpate...@gmail.com>> wrote:
> 
> I have a Beam Pipeline (2.14) running on Flink (1.8.0,
> emr-5.26.0) that uses the RequiresStableInput feature.
> 
> Currently it's configured to checkpoint once a minute, and after
> around 32000-33000 checkpoints, it fails with: 
> 
> 2020-04-15 13:15:02,920 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
>   - Triggering checkpoint 32701 @ 1586956502911 for job
> 9953424f21e240112dd23ab4f8320b60.
> 2020-04-15 13:15:05,762 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
>   - Completed checkpoint 32701 for job
> 9953424f21e240112dd23ab4f8320b60 (795385496 bytes in 2667 ms).
> 2020-04-15 13:16:02,919 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  
>   - Triggering checkpoint 32702 @ 1586956562911 for job
> 9953424f21e240112dd23ab4f8320b60.
> 2020-04-15 13:16:03,147 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph    
>    -  (1/2)
> (f4737add01961f8b42b8eb4e791b83ba) switched from RUNNING to
> FAILED.
> AsynchronousException{java.lang.Exception: Could not
> materialize checkpoint 32702 for operator 
> (1/2).}
> at
> 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)
> at
> 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)
> at
> 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)
> at
> 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize
> checkpoint 32702 for operator  (1/2).
> at
> 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)
> ... 6 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.IllegalArgumentException
> at 

Re: [Proposal] Requesting PMC approval to start planning for Beam Summits 2019

2020-04-15 Thread Maximilian Michels
The linked Google Doc is not accessible by everybody. I've made a copy,
so let's continue any discussions in there:
https://docs.google.com/document/d/1OddPOvP36mTTWEXV0DWtyS3MgfXyWOS3YXiZGeLWfSI/

Thanks,
Max

On 15.04.20 16:44, Maximilian Michels wrote:
> I've sent a request for approval to trademark with @private and the
> committee in CC.
> 
> -Max
> 
> On 15.04.20 05:58, Kenneth Knowles wrote:
>> Yes, here's links to process [1] and contact info [2]. The proposal
>> already is in very good shape and answers the needed questions. Send to
>> tradema...@apache.org <mailto:tradema...@apache.org> and CC
>> priv...@beam.apache.org <mailto:priv...@beam.apache.org>. It is good
>> that you are sending this early before dates are firm. That is helpful.
>> You should also check http://community.apache.org/calendars/.
>>
>> Kenn
>>
>> [1] https://www.apache.org/foundation/marks/events#approval
>> [2] https://www.apache.org/foundation/marks/contact#events
>>
>> On Tue, Apr 14, 2020 at 11:56 AM Ahmet Altay > <mailto:al...@google.com>> wrote:
>>
>> Thank you for this proposal and shifting this to a digital event.
>>
>> Kenn, do you know what formal approval is required?
>>
>> On Fri, Apr 10, 2020 at 8:17 PM Kenneth Knowles > <mailto:k...@apache.org>> wrote:
>>
>> Looks good to me. We'll have to see what ASF is doing with their
>> own events. We haven't gotten to dates, but I wonder if the
>> usual conflict avoidance need not apply, since digital events
>> don't require travel so are less of an imposition, plus have
>> lots of viewership after the live event is done.
>>
>> On Fri, Apr 10, 2020 at 11:13 AM Brittany Hermann
>> mailto:herma...@google.com>> wrote:
>>
>> Happy Friday! 
>>
>> I just wanted to follow up with the PMC regarding the 2020
>> Digital Beam Summit proposal. 
>>
>> Please let me know if you have any questions! 
>>
>> On Wed, Apr 1, 2020 at 6:22 PM Brittany Hermann
>> mailto:herma...@google.com>> wrote:
>>
>> My apologies all, I meant to say 2020 in the subject line. 
>>
>> On Wed, Apr 1, 2020 at 5:16 PM Tomo Suzuki
>> mailto:suzt...@google.com>> wrote:
>>
>> 2020 in email subject?
>>
>>
>>
>> -- 
>>
>>  
>>
>> Brittany Hermann
>>
>> Open Source Program Manager (Provided by Adecco Staffing)
>>
>> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
>> 
>> <https://www.google.com/maps/search/1190+Bordeaux+Drive+,+Building+4,+Sunnyvale,+CA+94089?entry=gmail=g>
>>
>>
>>
>>
>> -- 
>>
>>  
>>
>> Brittany Hermann
>>
>> Open Source Program Manager (Provided by Adecco Staffing)
>>
>> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
>> 
>> <https://www.google.com/maps/search/1190+Bordeaux+Drive+,+Building+4,+Sunnyvale,+CA+94089?entry=gmail=g>
>>
>>


Re: [Proposal] Requesting PMC approval to start planning for Beam Summits 2019

2020-04-15 Thread Maximilian Michels
I've sent a request for approval to trademark with @private and the
committee in CC.

-Max

On 15.04.20 05:58, Kenneth Knowles wrote:
> Yes, here's links to process [1] and contact info [2]. The proposal
> already is in very good shape and answers the needed questions. Send to
> tradema...@apache.org  and CC
> priv...@beam.apache.org . It is good
> that you are sending this early before dates are firm. That is helpful.
> You should also check http://community.apache.org/calendars/.
> 
> Kenn
> 
> [1] https://www.apache.org/foundation/marks/events#approval
> [2] https://www.apache.org/foundation/marks/contact#events
> 
> On Tue, Apr 14, 2020 at 11:56 AM Ahmet Altay  > wrote:
> 
> Thank you for this proposal and shifting this to a digital event.
> 
> Kenn, do you know what formal approval is required?
> 
> On Fri, Apr 10, 2020 at 8:17 PM Kenneth Knowles  > wrote:
> 
> Looks good to me. We'll have to see what ASF is doing with their
> own events. We haven't gotten to dates, but I wonder if the
> usual conflict avoidance need not apply, since digital events
> don't require travel so are less of an imposition, plus have
> lots of viewership after the live event is done.
> 
> On Fri, Apr 10, 2020 at 11:13 AM Brittany Hermann
> mailto:herma...@google.com>> wrote:
> 
> Happy Friday! 
> 
> I just wanted to follow up with the PMC regarding the 2020
> Digital Beam Summit proposal. 
> 
> Please let me know if you have any questions! 
> 
> On Wed, Apr 1, 2020 at 6:22 PM Brittany Hermann
> mailto:herma...@google.com>> wrote:
> 
> My apologies all, I meant to say 2020 in the subject line. 
> 
> On Wed, Apr 1, 2020 at 5:16 PM Tomo Suzuki
> mailto:suzt...@google.com>> wrote:
> 
> 2020 in email subject?
> 
> 
> 
> -- 
> 
>   
> 
> Brittany Hermann
> 
> Open Source Program Manager (Provided by Adecco Staffing)
> 
> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
> 
> 
> 
> 
> 
> 
> -- 
> 
>   
> 
> Brittany Hermann
> 
> Open Source Program Manager (Provided by Adecco Staffing)
> 
> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
> 
> 
> 
> 


Re: Portable timer loops

2020-04-14 Thread Maximilian Michels
Hey Jan,

Just saw your message since you posted right before I replied. What you
describe is precisely what I was experiencing. I also solved it the same
way, i.e. pushing back a newly set timer to the next bundle. Note that
there is no other way in portability because we can't fire timers once
we have closed the current bundle; we need to close the bundle to
receive all output which includes timers. It's definitely not efficient
but it appears that this behavior is even desired by some runners, e.g.
Dataflow.

Thanks,
Max

On 13.04.20 18:58, Jan Lukavský wrote:
> This is probably related to issue I was having with Direct runner and
> timer ordering. The problem is that there might be multiple timers (for
> given key) inside bundle and that each timer might set another timer. To
> ensure timer ordering, timers must be fired one at a time and when fired
> timer sets timer for time preceding current input watermark, the new
> timer and all remaining timers are pushed back to next bundle. That was
> the simplest yet efficient enough implementation for direct runner (see
> [1]), for different runners might exist better alternatives (e.g. what
> was discussed in [2]).
> 
> Jan
> 
> [1]
> https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/StatefulParDoEvaluatorFactory.java#L253
> 
> [2] https://github.com/apache/beam/pull/9190
> 
> On 4/13/20 6:01 PM, Reuven Lax wrote:
>> I'm not sure I understand the difference - do any "classic" runners
>> add new timers to the bundle? I know that at least the Dataflow runner
>> would end up the new timer in a new bundle.
>>
>> One thing we do need to ensure is that modifications to timers are
>> reflected in a bundle. So if a bundle contains a processElement and a
>> processTimer and the processElement modifies the timer, that should be
>> reflected in timer firing.
>>
>> On Mon, Apr 13, 2020 at 8:53 AM Luke Cwik > <mailto:lc...@google.com>> wrote:
>>
>> In non portable implementations you would have to wait till the
>> element/timer was finished processing before you could process any
>> newly created timers. Timers that are created with the same
>> key+window+timer family overwrite existing timers that have been
>> set which can lead to a timer being overwritten multiple times
>> while an element/timer is being processed and you wouldn't want to
>> process a newly created timer with the incorrect values or process
>> a timer you shouldn't have.
>>
>> In portable implementations, you can only safely say that element
>> processing is done when the bundle completes. There could be value
>> in exposing when an element is done since this could have usage in
>> other parts of the system such as when a large GBK value is done.
>>
>> On Mon, Apr 13, 2020 at 8:06 AM Maximilian Michels > <mailto:m...@apache.org>> wrote:
>>
>> Hi,
>>
>> In the "classic" Java mode one can set timers which in turn
>> can set
>> timers which enables to create a timer loop, e.g.:
>>
>> @ProcessElement
>> public void processElement(
>>     ProcessContext context, @TimerId("timer") Timer timer) {
>>   // Initial timer
>>   timer.withOutputTimestamp(new
>> Instant(0)).set(context.timestamp());
>> }
>>
>> @OnTimer("timer")
>> public void onTimer(
>>     OnTimerContext context,
>>     @TimerId("timer") Timer timer) {
>>   // Trigger again and start endless loop
>>   timer.withOutputTimestamp(new
>> Instant(0)).set(context.timestamp());
>> }
>>
>>
>> In portability, since we are only guaranteed to receive the
>> timers at
>> the end of a bundle when we flush the outputs and have closed the
>> inputs, it looks like this behavior can only be supported by
>> starting a
>> new bundle and executing the deferred timers. This also means
>> to hold
>> back the watermark to allow for these loops. Plus, starting a
>> new bundle
>> comes with some cost.
>>
>> Another possibility would be to allow a direct feedback loop,
>> i.e. when
>> the bundles closes and the timers fire, we can still set new
>> timers.
>>
>> I wonder do we want to allow a timer loop to execute within a
>> bundle?
>>
>> It would be possible to limit the number of iterations to
>> perform during
>> one bundle similar to how runners limit the number of elements
>> or the
>> duration of a bundle.
>>
>> -Max
>>


Re: [Proposal] Requesting PMC approval to start planning for Beam Summits 2020

2020-04-13 Thread Maximilian Michels
I suggest to move forward with a vote on the dev mailing list to
formally approve the proposal and copy ASF legal for the use of the Beam
trademark.

Does that sound good?

@Kenneth We are trying not to conflict with any ASF events. Also, we'll
have half-day sessions which gives us additional flexibility.

-Max

On 11.04.20 05:17, Kenneth Knowles wrote:
> Ah, I just replied on the other thread. Still looks good to me :-)
> 
> On Fri, Apr 10, 2020 at 11:40 AM Pablo Estrada  <mailto:pabl...@google.com>> wrote:
> 
> copying Brittany to the email with the 2020 subject to continue the
> discussion.
> 
> Max / Brittany, what are the steps for the PMC to approve y'all to
> move forward?
> 
> I took a look at the proposal, and it looks great. I don't have any
> questions for now.
> Best
> -P.
> 
> On Thu, Apr 2, 2020 at 1:50 AM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> Hi Brittany,
> 
> Thanks for sharing the proposal. After we organized Beam Summit
> Europe
> and Beam Summit NA in 2019, I'm really looking forward to the
> digital
> 2020 version of the summit!
> 
> tl;dr
> For it to be an official Beam event, the Beam trademark use
> needs to be
> approved by the PMC. We'll have a separate vote thread for that once
> we've answered questions here. We are also open to suggestions
> and ideas
> for the summit.
> 
> Note that the exact schedule is to-be-determined but we are thinking
> about multiple days with half-day morning (PDT) sessions, such that
> there will be enough time to work on other things that day.
> 
> Cheers,
> Max
> 
> PS: Yes, it's 2020 :)
> 
> On 01.04.20 23:56, Brittany Hermann wrote:
> > Dear Project Management Committee,
> >
> >
> > The Beam Summits are community events funded by a group of
> sponsors and
> > organized by a Steering Committee formed by members of the Beam
> > community and who have participated in past editions. I'd like
> to get
> > the following approval:
> >
> >
> > To organize and host the Summit under the name of Beam Summit
> 
> > , i.e. Digital Beam Summit 2020. 
> >
> >
> > Approval to host this year’s edition on the following dates:
> >
> >
> >   *
> >
> >     Digital Beam Summit, an online, multi-day event to be
> hosted in a
> >     selected video conference platform, and to be organized in
> the date
> >     range of mid-July to early-August
> >
> >   *
> >
> >     Expected attendees: 200 attendees live
> >
> >
> > The events will provide educational content selected by the
> Steering
> > Committee, and will be a not-for-profit venture. This will be
> a free
> > event. The event will be advertised on various channels,
> including the
> > Apache Beam's and Summit sponsor's social media accounts. 
> >
> >
> > The Organizing Committee will acknowledge the Apache Software
> > Foundation's ownership of the Apache Beam trademark and will add
> > attribution required by the foundation's policy on all marketing
> > channels. The Apache Beam branding will be used in accordance
> with the 
> >
> > foundation's trademark and events policies specifically as
> outlined in
> > Third Party Event Branding Policy. The Organizing Committee
> does not
> > request the ASF to become a Community Partner of the event. 
> >
> >
> > Attached is a full proposal with the event details for your
> > reference[1]. Please feel free to request further information
> if needed. 
> >
> >
> > Kind Regards,
> >
> >
> > Brittany Hermann on behalf of the Beam Summit Steering Committee 
> >
> >
> > [1]
> >
> 
> https://docs.google.com/document/d/1OezrSYARKnbu1pu2KVA5tDRbXKE8T3Q84DY6xXAKXXs/edit?usp=sharing
> >
> >
> >
> > --
> >
> >       
> >
> > Brittany Hermann
> >
> > Open Source Program Manager (Provided by Adecco Staffing)
> >
> > 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
> >
> 
> <https://www.google.com/maps/search/1190+Bordeaux+Drive+,+Building+4,+Sunnyvale,+CA+94089?entry=gmail=g>
> >
> >
> 


Re: Portable timer loops

2020-04-13 Thread Maximilian Michels
On Mon, Apr 13, 2020 at 8:53 AM Luke Cwik  wrote:
> In non portable implementations you would have to wait till the element/timer 
> was finished processing before you could process any newly created timers. 
> Timers that are created with the same key+window+timer family overwrite 
> existing timers that have been set which can lead to a timer being 
> overwritten multiple times while an element/timer is being processed and you 
> wouldn't want to process a newly created timer with the incorrect values or 
> process a timer you shouldn't have.

Timer overwrites/updates/deletions are a good argument for waiting until
the bundle finishes.

On 13.04.20 18:01, Reuven Lax wrote:
> I'm not sure I understand the difference - do any "classic" runners add new 
> timers to the bundle? I know that at least the Dataflow runner would end up 
> the new timer in a new bundle.

Yes, the "classic" Flink Runner has a timer processing loop which runs
for as long as there are still timers applicable for firing. New timers
can be added and existing ones can be updated all within a bundle.

I wasn't aware that Dataflow only includes new timers in the next bundle.

> One thing we do need to ensure is that modifications to timers are reflected 
> in a bundle. So if a bundle contains a processElement and a processTimer and 
> the processElement modifies the timer, that should be reflected in timer 
> firing.

That's is the case for the Flink Runner.

-Max

On 13.04.20 18:01, Reuven Lax wrote:
> I'm not sure I understand the difference - do any "classic" runners add
> new timers to the bundle? I know that at least the Dataflow runner would
> end up the new timer in a new bundle.
> 
> One thing we do need to ensure is that modifications to timers are
> reflected in a bundle. So if a bundle contains a processElement and a
> processTimer and the processElement modifies the timer, that should be
> reflected in timer firing.
> 
> On Mon, Apr 13, 2020 at 8:53 AM Luke Cwik  <mailto:lc...@google.com>> wrote:
> 
> In non portable implementations you would have to wait till the
> element/timer was finished processing before you could process any
> newly created timers. Timers that are created with the same
> key+window+timer family overwrite existing timers that have been set
> which can lead to a timer being overwritten multiple times while an
> element/timer is being processed and you wouldn't want to process a
> newly created timer with the incorrect values or process a timer you
> shouldn't have.
> 
> In portable implementations, you can only safely say that element
> processing is done when the bundle completes. There could be value
> in exposing when an element is done since this could have usage in
> other parts of the system such as when a large GBK value is done.
> 
> On Mon, Apr 13, 2020 at 8:06 AM Maximilian Michels  <mailto:m...@apache.org>> wrote:
> 
> Hi,
> 
> In the "classic" Java mode one can set timers which in turn can set
> timers which enables to create a timer loop, e.g.:
> 
> @ProcessElement
> public void processElement(
>     ProcessContext context, @TimerId("timer") Timer timer) {
>   // Initial timer
>   timer.withOutputTimestamp(new
> Instant(0)).set(context.timestamp());
> }
> 
> @OnTimer("timer")
> public void onTimer(
>     OnTimerContext context,
>     @TimerId("timer") Timer timer) {
>   // Trigger again and start endless loop
>   timer.withOutputTimestamp(new
> Instant(0)).set(context.timestamp());
> }
> 
> 
> In portability, since we are only guaranteed to receive the
> timers at
> the end of a bundle when we flush the outputs and have closed the
> inputs, it looks like this behavior can only be supported by
> starting a
> new bundle and executing the deferred timers. This also means to
> hold
> back the watermark to allow for these loops. Plus, starting a
> new bundle
> comes with some cost.
> 
> Another possibility would be to allow a direct feedback loop,
> i.e. when
> the bundles closes and the timers fire, we can still set new timers.
> 
> I wonder do we want to allow a timer loop to execute within a
> bundle?
> 
> It would be possible to limit the number of iterations to
> perform during
> one bundle similar to how runners limit the number of elements
> or the
> duration of a bundle.
> 
> -Max
> 


Portable timer loops

2020-04-13 Thread Maximilian Michels
Hi,

In the "classic" Java mode one can set timers which in turn can set
timers which enables to create a timer loop, e.g.:

@ProcessElement
public void processElement(
ProcessContext context, @TimerId("timer") Timer timer) {
  // Initial timer
  timer.withOutputTimestamp(new Instant(0)).set(context.timestamp());
}

@OnTimer("timer")
public void onTimer(
OnTimerContext context,
@TimerId("timer") Timer timer) {
  // Trigger again and start endless loop
  timer.withOutputTimestamp(new Instant(0)).set(context.timestamp());
}


In portability, since we are only guaranteed to receive the timers at
the end of a bundle when we flush the outputs and have closed the
inputs, it looks like this behavior can only be supported by starting a
new bundle and executing the deferred timers. This also means to hold
back the watermark to allow for these loops. Plus, starting a new bundle
comes with some cost.

Another possibility would be to allow a direct feedback loop, i.e. when
the bundles closes and the timers fire, we can still set new timers.

I wonder do we want to allow a timer loop to execute within a bundle?

It would be possible to limit the number of iterations to perform during
one bundle similar to how runners limit the number of elements or the
duration of a bundle.

-Max


Re: [RESULT][VOTE] Accept the Firefly design donation as Beam Mascot - Deadline Mon April 6

2020-04-09 Thread Maximilian Michels
Awesome. What a milestone! The mascot is a real eye catcher. Thank you
Julian and Aizhamal for making it happen.

On 06.04.20 22:05, Aizhamal Nurmamat kyzy wrote:
> I am happy to announce that this vote has passed, with 13 approving +1
> votes, 5 of which are binding PMC votes.
> 
> We have the final design for the Beam Firefly! Yahoo!
> 
> Everyone have a great week!
> 
> 
> 
> On Mon, Apr 6, 2020 at 9:57 AM David Morávek  <mailto:d...@apache.org>> wrote:
> 
> +1 (non-binding)
> 
> On Mon, Apr 6, 2020 at 12:51 PM Reza Rokni  <mailto:r...@google.com>> wrote:
> 
> +1(non-binding)
> 
> On Mon, Apr 6, 2020 at 5:24 PM Alexey Romanenko
> mailto:aromanenko@gmail.com>> wrote:
> 
>     +1 (non-binding).
> 
> > On 3 Apr 2020, at 14:53, Maximilian Michels
> mailto:m...@apache.org>> wrote:
> >
> > +1 (binding)
> >
> > On 03.04.20 10:33, Jan Lukavský wrote:
> >> +1 (non-binding).
> >>
> >> On 4/2/20 9:24 PM, Austin Bennett wrote:
> >>> +1 (nonbinding)
> >>>
> >>> On Thu, Apr 2, 2020 at 12:10 PM Luke Cwik
> mailto:lc...@google.com>
> >>> <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
> >>>
> >>>    +1 (binding)
> >>>
> >>>    On Thu, Apr 2, 2020 at 11:54 AM Pablo Estrada
> mailto:pabl...@google.com>
> >>>    <mailto:pabl...@google.com
> <mailto:pabl...@google.com>>> wrote:
> >>>
> >>>        +1! (binding)
> >>>
> >>>        On Thu, Apr 2, 2020 at 11:19 AM Alex Van Boxel
> >>>        mailto:a...@vanboxel.be>
> <mailto:a...@vanboxel.be <mailto:a...@vanboxel.be>>> wrote:
> >>>
> >>>            Thanks for clearing this up Aizhamal.
> >>>
> >>>            +1 (non binding)
> >>>
> >>>            _/
> >>>            _/ Alex Van Boxel
> >>>
> >>>
> >>>            On Thu, Apr 2, 2020 at 8:14 PM Aizhamal
> Nurmamat kyzy
> >>>             <mailto:aizha...@apache.org> <mailto:aizha...@apache.org
> <mailto:aizha...@apache.org>>> wrote:
> >>>
> >>>                Good point, Alex. Actually Julian and I
> have talked
> >>>                about producing this kind of guide. It
> will be
> >>>                delivered as an additional contribution
> in the follow
> >>>                up. We think this will be a derivative of
> the original
> >>>                design, and be done after the original is
> officially
> >>>                accepted.
> >>>
> >>>                With this vote, we want to accept the
> Firefly donation
> >>>                as designed [1], and let Julian produce other
> >>>                artifacts using the official Beam mascot
> later on.
> >>>
> >>>                [1]
> 
> https://docs.google.com/document/d/1zK8Cm8lwZ3ALVFpD1aY7TLCVNwlyTS3PXxTV2qQCAbk/edit?usp=sharing
> >>>
> >>>
> >>>                On Thu, Apr 2, 2020 at 10:37 AM Alex Van
> Boxel
> >>>                 <mailto:a...@vanboxel.be> <mailto:a...@vanboxel.be
> <mailto:a...@vanboxel.be>>> wrote:
> >>>
> >>>                    I don't want to be a spoiler... but
> this vote
> >>>                    feels like a final deliverable... but
> without a
> >>>                    style guide as Kenn originally
> suggested most of
> >>>                    use will not be able to adapt the
> design. 

Re: [VOTE] Release 2.20.0, release candidate #1

2020-04-08 Thread Maximilian Michels
Just a reminder that for ASF releases we are voting on releasing the
source, the binaries are simply a nice-to-have.

Nevertheless, this is the right call since the majority of users will
use the binaries, not the source. As Steve pointed out, we don't want to
release something broken, even if it is not contained in the source.

We should make sure to notify the release manager when merging
release-critical changes.

-Max

On 07.04.20 01:24, Robert Bradshaw wrote:
> Thanks. It sounds like this is enough of  a blocker for me to vote -1
> for RC1 as well. We'll keep an eye out for RC2. 
> 
> (If this is the only change, the Python artifacts are still good. I
> would encourage folks to keep testing RC1 to see if there are any other
> issues, so we can have quick resolution on RC2.)
> 
> On Mon, Apr 6, 2020 at 3:53 PM Rui Wang  > wrote:
> 
> Ok, I will abort RC1 and go toward RC2 for known issues. Thanks
> everyone who has helped!
> 
> 
> 
> -Rui  
> 
> On Mon, Apr 6, 2020 at 3:28 PM Reuven Lax  > wrote:
> 
> -1, as that PR does fix a critical bug. The fact that no unit
> test broke before was more a signal that our unit testing was
> deficient in this area.
> 
> My fix for the bug is pr/11226, which did include a unit test
> (which fails without the fix). However it appears that 11252
> forked off just the main code files from my pr, and not the unit
> test. If we're recutting, we should include the unit test as well.
> 
> On Mon, Apr 6, 2020 at 3:11 PM Rui Wang  > wrote:
> 
> I see. I will also leave the community to decide.
> 
> With the unit tests in [1], the fix becomes sufficient (e.g.
> if the community decides that the fix is critical, I will
> also need to include those tests in the release).
> 
> 
> [1] https://github.com/apache/beam/pull/11226
> 
> 
> -Rui
> 
> 
> On Mon, Apr 6, 2020 at 3:05 PM Steve Niemitz
> mailto:sniem...@apache.org>> wrote:
> 
> My opinion doesn't matter much, since we're just going
> to cherry pick the fix into our fork anyways, but you're
> essentially proposing releasing a build that *WILL*
> cause data loss to anyone who uses processing time timers.
> 
> I'll leave it up to the community to decide, but it
> seems like a pretty big bug.
> 
> Also, fwiw, there is a PR open that adds a test for this
> [1], but it was never merged (it's been open for 12 days).
> 
> [1] https://github.com/apache/beam/pull/11226
> 
> On Mon, Apr 6, 2020 at 5:52 PM Rui Wang
> mailto:ruw...@google.com>> wrote:
> 
> My opinion is, even though that commit was missing,
> no test/validation gave a signal that something
> relevant was broken. Plus that fix didn't include a
> test.  
> 
> I will hesitate to say such a fix is critical for a
> release, unless there is something to test or
> validate it.
> 
> 
> -Rui
> 
> On Mon, Apr 6, 2020 at 2:46 PM Steve Niemitz
> mailto:sniem...@apache.org>>
> wrote:
> 
> timers are essentially broken without it, so I'd
> say -1
> 
> On Mon, Apr 6, 2020 at 5:45 PM Rui Wang
> mailto:ruw...@google.com>>
> wrote:
> 
> ok so the source is consistent with the
> binary. What undecided is if missing that
> commit is -1, or that can be marked as a
> known issue in release note.
> 
> 
> -Rui
> 
> On Mon, Apr 6, 2020 at 2:38 PM Steve Niemitz
>  > wrote:
> 
> I can confirm that the artifact on maven
> central [1] does not have the change in
> it either, I disassembled it with javap.
> 
> [1]
> 
> https://repository.apache.org/content/repositories/orgapachebeam-1100/org/apache/beam/beam-runners-core-java/2.20.0/beam-runners-core-java-2.20.0.jar
> 
> 
> On Mon, Apr 6, 2020 at 5:28 PM Luke Cwik
>  > wrote:
> 
> If the source doesn't represent the
>   

Re: [PROPOSAL] Preparing for Beam 2.21 release

2020-04-06 Thread Maximilian Michels
Sounds good! +1

On 02.04.20 20:11, Luke Cwik wrote:
> Thanks for picking this up.
> 
> On Thu, Apr 2, 2020 at 10:09 AM Kyle Weaver  > wrote:
> 
> Hi all,
> 
> The next (2.21) release branch cut is scheduled for Apr 8, according
> to the calendar
> 
> .
> I would like to volunteer myself to do this release.
> The plan is to cut the branch on that date,
> and cherrypick release-blocking fixes afterwards if any.
> 
> Any unresolved release blocking JIRA issues for 2.21 should have
> their "Fix Version/s" marked as "2.21.0".
> 
> Any comments or objections? 
> 
> Kyle
> 


Re: [VOTE] Accept the Firefly design donation as Beam Mascot - Deadline Mon April 6

2020-04-03 Thread Maximilian Michels
+1 (binding)

On 03.04.20 10:33, Jan Lukavský wrote:
> +1 (non-binding).
> 
> On 4/2/20 9:24 PM, Austin Bennett wrote:
>> +1 (nonbinding)
>>
>> On Thu, Apr 2, 2020 at 12:10 PM Luke Cwik > > wrote:
>>
>> +1 (binding)
>>
>> On Thu, Apr 2, 2020 at 11:54 AM Pablo Estrada > > wrote:
>>
>> +1! (binding)
>>
>> On Thu, Apr 2, 2020 at 11:19 AM Alex Van Boxel
>> mailto:a...@vanboxel.be>> wrote:
>>
>> Thanks for clearing this up Aizhamal.
>>
>> +1 (non binding)
>>
>> _/
>> _/ Alex Van Boxel
>>
>>
>> On Thu, Apr 2, 2020 at 8:14 PM Aizhamal Nurmamat kyzy
>> mailto:aizha...@apache.org>> wrote:
>>
>> Good point, Alex. Actually Julian and I have talked
>> about producing this kind of guide. It will be
>> delivered as an additional contribution in the follow
>> up. We think this will be a derivative of the original
>> design, and be done after the original is officially
>> accepted. 
>>
>> With this vote, we want to accept the Firefly donation
>> as designed [1], and let Julian produce other
>> artifacts using the official Beam mascot later on.
>>
>> [1] 
>> https://docs.google.com/document/d/1zK8Cm8lwZ3ALVFpD1aY7TLCVNwlyTS3PXxTV2qQCAbk/edit?usp=sharing
>>
>>
>> On Thu, Apr 2, 2020 at 10:37 AM Alex Van Boxel
>> mailto:a...@vanboxel.be>> wrote:
>>
>> I don't want to be a spoiler... but this vote
>> feels like a final deliverable... but without a
>> style guide as Kenn originally suggested most of
>> use will not be able to adapt the design. This
>> would include:
>>
>>   * frontal view
>>   * side view
>>   * back view
>>
>> actually different posses so we can mix and match.
>> Without this it will never reach the potential of
>> the Go gopher or gRPC Pancakes.
>>
>> Note this is *not* a negative vote but I'm afraid
>> that the use without a guide will be fairly
>> limited as most of use are not designers. Just a
>> concern.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Thu, Apr 2, 2020 at 7:27 PM Andrew Pilloud
>> mailto:apill...@apache.org>>
>> wrote:
>>
>> +1, Accept the donation of the Firefly design
>> as Beam Mascot
>>
>> On Thu, Apr 2, 2020 at 10:19 AM Julian Bruno
>> > > wrote:
>>
>> Hello Apache Beam Community, 
>>
>> Please vote on the acceptance of the final
>> design of the Firefly as Beam's mascot
>> [1]. Please share your input no later than
>> Monday, April 6, at noon Pacific Time. 
>>
>>
>> [ ] +1, Accept the donation of the Firefly
>> design as Beam Mascot
>>
>> [ ] -1, Decline the donation of the
>> Firefly design as Beam Mascot
>>
>>
>> Vote is adopted by at least 3 PMC +1
>> approval votes, with no PMC -1 disapproval
>>
>> votes. Non-PMC votes are still encouraged.
>>
>> PMC voters, please help by indicating your
>> vote as "(binding)"
>>
>>
>> The vote and input phase will be open
>> until Monday, April 6, at 12 pm Pacific Time.
>>
>>
>> Thank you very much for your feedback and
>> ideas,
>>
>> Julian
>>
>>
>> [1]
>> 
>> https://docs.google.com/document/d/1zK8Cm8lwZ3ALVFpD1aY7TLCVNwlyTS3PXxTV2qQCAbk/edit?usp=sharing
>>   
>>
>> -- 
>> Julian Bruno // Visual Artist & Graphic
>> Designer
>>  (510) 367-0551  /
>> SF Bay Area, CA
>> www.instagram.com/julbro.art
>> 
>>


Re: [Proposal] Requesting PMC approval to start planning for Beam Summits 2020

2020-04-02 Thread Maximilian Michels
Hi Brittany,

Thanks for sharing the proposal. After we organized Beam Summit Europe
and Beam Summit NA in 2019, I'm really looking forward to the digital
2020 version of the summit!

tl;dr
For it to be an official Beam event, the Beam trademark use needs to be
approved by the PMC. We'll have a separate vote thread for that once
we've answered questions here. We are also open to suggestions and ideas
for the summit.

Note that the exact schedule is to-be-determined but we are thinking
about multiple days with half-day morning (PDT) sessions, such that
there will be enough time to work on other things that day.

Cheers,
Max

PS: Yes, it's 2020 :)

On 01.04.20 23:56, Brittany Hermann wrote:
> Dear Project Management Committee,
> 
> 
> The Beam Summits are community events funded by a group of sponsors and
> organized by a Steering Committee formed by members of the Beam
> community and who have participated in past editions. I'd like to get
> the following approval:
> 
> 
> To organize and host the Summit under the name of Beam Summit 
> , i.e. Digital Beam Summit 2020. 
> 
> 
> Approval to host this year’s edition on the following dates:
> 
> 
>   *
> 
> Digital Beam Summit, an online, multi-day event to be hosted in a
> selected video conference platform, and to be organized in the date
> range of mid-July to early-August
> 
>   *
> 
> Expected attendees: 200 attendees live
> 
> 
> The events will provide educational content selected by the Steering
> Committee, and will be a not-for-profit venture. This will be a free
> event. The event will be advertised on various channels, including the
> Apache Beam's and Summit sponsor's social media accounts. 
> 
> 
> The Organizing Committee will acknowledge the Apache Software
> Foundation's ownership of the Apache Beam trademark and will add
> attribution required by the foundation's policy on all marketing
> channels. The Apache Beam branding will be used in accordance with the 
> 
> foundation's trademark and events policies specifically as outlined in
> Third Party Event Branding Policy. The Organizing Committee does not
> request the ASF to become a Community Partner of the event. 
> 
> 
> Attached is a full proposal with the event details for your
> reference[1]. Please feel free to request further information if needed. 
> 
> 
> Kind Regards,
> 
> 
> Brittany Hermann on behalf of the Beam Summit Steering Committee 
> 
> 
> [1]
> https://docs.google.com/document/d/1OezrSYARKnbu1pu2KVA5tDRbXKE8T3Q84DY6xXAKXXs/edit?usp=sharing
> 
> 
> 
> -- 
> 
>   
> 
> Brittany Hermann
> 
> Open Source Program Manager (Provided by Adecco Staffing)
> 
> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
> 
> 
> 


Python SDK performance regression with UnboundedThreadPoolExecutor

2020-04-01 Thread Maximilian Michels
Hi everyone,

We're seeing a dramatic regression in the Python SDK with Flink + Beam
2.18.0 after upgrading from 2.16.0. Note that the Flink version is
unchanged. After bisecting the problem, we found that
https://jira.apache.org/jira/browse/BEAM-8944 is the cause for this.

The dynamic thread creation adds a significant delay to the bundle
completion which causes checkpoint times to increase 2-3x. I've tried
increasing the thread life time but that did not change anything.

Reverting to the static thread pool gives us 2.16.0 performance.

Looking at the code, I don't see something obviously wrong, but the new
locks and the use of threading.Event/threading.Condition could be too
much for the Python interpreter.

-Max


Re: [VOTE + INPUT] Beam Mascot Designs, 3rd iteration - Deadline Wednesday, April 1

2020-03-31 Thread Maximilian Michels
Hi Julian!

Perfect, thanks for incorporating all the suggestions.

> 1. Do you prefer stripes or no stripes?

No stripes.

Cheers,
Max

On 31.03.20 08:11, Alex Van Boxel wrote:
> Nooo stripes
> 
>  _/
> _/ Alex Van Boxel
> 
> 
> On Tue, Mar 31, 2020 at 6:06 AM Joshua B. Harrison
> mailto:josh.harri...@gmail.com>> wrote:
> 
> 
> 
> On Mon, Mar 30, 2020 at 8:09 PM Julian Bruno  > wrote:
> 
> Hello Apache Beam Community,
> 
> We need a third input from the community to finish the design.
> Please share your input no later than Wednesday, April 1st, at
> noon Pacific Time. Below you will find a link to the
> presentation of the work process and we are eager to know what
> you think of the current design [1].
> 
> 
> Our question to you:
> 
> 
> 1. Do you prefer stripes or no stripes?
> 
> I prefer the logo with stripes. 
> 
> 
> Please reply inline, so it is clear what exactly you are
> referring to. The vote and input phase will be open until
> Wednesday, April 1st, at 12 pm Pacific Time. We will incorporate
> the feedback into the final design iteration of the mascot. 
> 
> 
> Please find below the attached source file (.SVG) and a
> High-Quality Image (.PNG).
> 
> 
> Thank you,
> 
> 
> -- 
> Julian Bruno // Visual Artist & Graphic Designer
>  (510) 367-0551 / SF Bay Area, CA
> www.instagram.com/julbro.art 
> 
> [1]
> 
>  3/30 - Mascot Weekly Update
> 
> 
> 
> 
> 
> ᐧ
> 
> 
> 
> -- 
> Joshua Harrison |  Software Engineer | joshharri...@gmail.com
>  | 404-433-0242
> 


Re: [VOTE + INPUT] Beam Mascot Designs, 2nd iteration - Deadline Friday, March 27

2020-03-26 Thread Maximilian Michels
Thanks for the update Daniel!

> 1. Do you prefer red or black colored line art?

Black.

> 2. Do you have any additional feedback about the mascot's shape or features?

Great improvement! The tail looks a bit sharp to me. I like the new
shape but I'd prefer if it looked less like a sting.

Cheers,
Max

On 26.03.20 20:07, Alex Amato wrote:
> 1. Do you prefer red or black colored line art?
> 
> Black
> 
> 
> 
> On Thu, Mar 26, 2020 at 11:07 AM Pablo Estrada  > wrote:
> 
> 1. I am slightly inclined for black lines
> 2. I have two pieces of feedback:
> - It feels like the head and the eyes on this design are less
> elongated. I think I liked the more oval-like eyes from previous
> designs. (And the head also became less oval-like?, maybe?)
> - I like the white hand tips and stripes in the body from slide 22
> of your previous deck. These lines are very easy to draw but I think
> they make the Firefly less flat.
> 
> That's it from me!
> Thanks Julian, it's looking really good.
> Best
> -P.
> 
> On Wed, Mar 25, 2020 at 10:00 PM Kenneth Knowles  > wrote:
> 
> I assume that when this bug moves fast the tail will leave a
> cool light trail.
> 
> Kenn
> 
> On Wed, Mar 25, 2020 at 5:45 PM Daniel Oliveira
> mailto:danolive...@google.com>> wrote:
> 
> 1. Do you prefer red or black colored line art?
> 
> 
> Red.
>  
> 
> 2. Do you have any additional feedback about the
> mascot's shape or features?
> 
> 
> Love the new tail and new shadows.
> 
> I like the wings better with color, but they still feel a
> bit dull to me. I feel they would be improved by having more
> vibrant colors near the tips, and possibly by going with
> more yellow-ish colors closer to the Beam logo. Compare with
> the wings from slide 10 of your previous deck
> 
> ,
> which I like much better. Having the more vibrant color near
> the tips of the wings also pairs well with the new tail,
> which does the same thing with its yellow light.
> 
> On Wed, Mar 25, 2020 at 12:11 PM Julian Bruno
> mailto:juliangbr...@gmail.com>> wrote:
> 
> Hello Apache Beam Community, 
> 
> 
> Together with Aizhamal and her team, we have been
> working on the design of the Apache Beam mascot.
> 
> We now need input from the community to continue moving
> forward with the design. Please share your input no
> later than Friday, March 27, at noon Pacific Time. Below
> you will find a link to the presentation of the work
> process and we are eager to know what you think of the
> current design [1].
> 
> 
> Our questions to you:
> 
> 
> 1. Do you prefer red or black colored line art?
> 
> 2. Do you have any additional feedback about the
> mascot's shape or features?
> 
> 
> Please reply inline, so it is clear what exactly you are
> referring to. The vote and input phase will be open
> until Friday, March 27, at 12 pm Pacific Time. We will
> incorporate the feedback to the next design iteration of
> the mascot. 
> 
> 
> Thank you,
> 
> 
> Julian Bruno // Visual Artist & Graphic Designer
>  (510) 367-0551  / SF Bay Area, CA
> www.instagram.com/julbro.art
> 
> 
> [1]
> 
>  Mascot Weekly Update - 3/25
> 
> 
> 
> 
> 
> ᐧ
> 


Re: [PROPOSAL] Preparing for Beam 2.20.0 release

2020-03-23 Thread Maximilian Michels
Just mentioning that we have discovered
https://issues.apache.org/jira/browse/BEAM-9566 and
https://issues.apache.org/jira/browse/BEAM-9573 which are blockers for
the release. I'm currently fixing those. PR will be out later today.

-Max

On 28.02.20 20:11, Rui Wang wrote:
> Release branch 2.20.0 is already cut. 
> 
> Currently there should be only one blocking
> Jira: https://issues.apache.org/jira/browse/BEAM-9288
> 
> But there is a newly added
> Jira: https://issues.apache.org/jira/browse/BEAM-9322
> 
> 
> I will coordinate on those two Jira.
> 
> 
> 
> -Rui
> 
> 
> On Thu, Feb 27, 2020 at 3:18 PM Rui Wang  > wrote:
> 
> Hi community,
> 
> Just fyi:
> 
> The 2.20.0 release branch should be cut yesterday (02/26) per
> schedule. However as our python precommit was broken so I didn't cut
> the branch.
> 
> I am closely working with PR [1] owner to fix the python precommit.
> Once the fix is in, I will cut the release branch immediately.
> 
> 
> [1]: https://github.com/apache/beam/pull/10982
> 
> 
> -Rui
> 
> On Thu, Feb 20, 2020 at 7:06 AM Ismaël Mejía  > wrote:
> 
> Not yet, up to last check nobody is tackling it, it is still
> unassigned. Let's
> not forget that the fix of this one requires an extra release of
> the grpc
> vendored dependency (the source of the issue).
> 
> And yes this is a release blocker for the open source runners
> because people
> tend to package their projects with the respective runners in a
> jar and this is
> breaking at the moment.
> 
> Kenn changed the priority of BEAM-9252 from Blocker to Critical
> to follow the
> conventions in [1], and from those definitions  'most critical
> bugs should
> block release'.
> 
> [1] https://beam.apache.org/contribute/jira-priorities/
> 
> On Thu, Feb 20, 2020 at 3:42 AM Ahmet Altay  > wrote:
> 
> Curions, was there a resolution on BEAM-9252? Would it be a
> release blocker?
> 
> On Fri, Feb 14, 2020 at 12:42 AM Ismaël Mejía
> mailto:ieme...@gmail.com>> wrote:
> 
> Thanks Rui for volunteering and for keeping the release
> pace!
> 
> Since we are discussing the next release, I would like
> to highlight that nobody
> apparently is working on this blocker issue:
> 
> BEAM-9252 Problem shading Beam pipeline with Beam
> 2.20.0-SNAPSHOT
> https://issues.apache.org/jira/browse/BEAM-9252
> 
> This is a regression introduced by the move to vendored
> gRPC 1.26.0 and it
> probably will require an extra vendored gRPC release so
> better to give it
> some priority.
> 
> 
> On Wed, Feb 12, 2020 at 6:48 PM Ahmet Altay
> mailto:al...@google.com>> wrote:
> 
> +1. Thank you.
> 
> On Tue, Feb 11, 2020 at 11:01 PM Rui Wang
> mailto:ruw...@google.com>> wrote:
> 
> Hi all,
> 
> The next (2.20.0) release branch cut is
> scheduled for 02/26, according to the calendar
> 
> .
> I would like to volunteer myself to do this release.
> The plan is to cut the branch on that date,
> and cherrypick release-blocking fixes afterwards
> if any.
> 
> Any unresolved release blocking JIRA issues for
> 2.20.0 should have their "Fix Version/s" marked
> as "2.20.0".
> 
> Any comments or objections? 
> 
> 
> -Rui
> 


Re: [Interactive Runner] now available on master

2020-03-19 Thread Maximilian Michels
Great work! This will also be super handy for demoing Beam. Looking
forward to playing around with this :)

On 19.03.20 00:52, Kenneth Knowles wrote:
> Nice!
> 
> On Wed, Mar 18, 2020 at 2:58 PM Ahmet Altay  > wrote:
> 
> Great to see this progress! :)
> 
> On Wed, Mar 18, 2020 at 2:57 PM Reza Rokni  > wrote:
> 
> Awesome !
> 
> On Thu, 19 Mar 2020, 05:38 Sam Rohde,  > wrote:
> 
> Hi All!
> 
>  
> 
> I am happy to announce that an improved Interactive Runner
> is now available on master. This Python runner allows for
> the interactive development of Beam pipelines in a notebook
> (and IPython) environment.
> 
>  
> 
> The runner still has some bugs that need to be fixed as well
> as some refactoring, but it is in a good enough shape to
> start using it.
> 
>  
> 
> Here are the new things you can do with the Interactive Runner:
> 
>   *
> 
> Create and execute pipelines within a REPL
> 
>   *
> 
> Visualize elements as the pipeline is running
> 
>   *
> 
> Materialize PCollections to DataFrames
> 
>   *
> 
> Record unbounded sources for deterministic replay
> 
>   *
> 
> Replay cached unbounded sources including watermark
> advancements
> 
> The code lives insdks/python/apache_beam/runners/interactive
> 
> and
> example notebooks are
> insdks/python/apache_beam/runners/interactive/examples
> 
> .
> 
>  
> 
> To install, use `pip install -e .[interactive]` in your
> /sdks/python directory.
> 
> To run, here’s a quick example:
> 
> ```
> 
> import apache_beam as beam
> 
> from apache_beam.runners.interactive.interactive_runner
> import InteractiveRunner
> 
> import apache_beam.runners.interactive.interactive_beam as ib
> 
>  
> 
> p = beam.Pipeline(InteractiveRunner())
> 
> words = p | beam.Create(['To', 'be', 'or', 'not', 'to', 'be'])
> 
> counts = words | 'count' >> beam.combiners.Count.PerElement()
> 
>  
> 
> # Shows a dynamically updating display of the PCollection
> elements
> 
> ib.show(counts)
> 
>  
> 
> # We can now visualize the data using standard pandas
> operations.
> 
> df = ib.collect(counts)
> 
> print(df.info ())
> 
> print(df.describe())
> 
>  
> 
> # Plot the top-10 counted words
> 
> df = df.sort_values(by=1, ascending=False)
> 
> df.head(n=10).plot(x=0, y=1)
> 
> ```
> 
>  
> 
> Currently, Batch is supported on any runner. Streaming is
> only supported on the DirectRunner (non-FnAPI).
> 
>  
> 
> I would like to thank the great work of Sindy (@sindyli) and
> Harsh (@ananvay) for the initial implementation,
> 
> David Yan (@davidyan) who led the project, Ning (@ningk) and
> myself (@srohde) for the implementation and design, and
> Ahmet (@altay), Daniel (@millsd), Pablo (@pabloem), and
> Robert (@robertwb) who all contributed a lot of their time
> to help with the design and code reviews.
> 
>  
> 
> It was a team effort and we wouldn't have been able to
> complete it without the help of everyone involved.
> 
>  
> 
> Regards,
> 
> Sam
> 
> 


Re: Jenkins jobs not running for my PR 10438

2020-03-18 Thread Maximilian Michels
Sure. Done.

On 18.03.20 12:24, Rehman Murad Ali wrote:
> Hello Committers,
> 
> I appreciate if you could run jobs for this PR:
> https://github.com/apache/beam/pull/11154
> 
> Thanks
> 
> *Rehman Murad Ali*
> Software Engineer
> Mobile: +92 3452076766
> Skype: rehman.muradali
> 
> 
> 
> On Wed, Mar 11, 2020 at 12:24 AM Ahmet Altay  > wrote:
> 
> Done.
> 
> On Tue, Mar 10, 2020 at 12:21 PM Tomo Suzuki  > wrote:
> 
> Hi Beam committers,
> 
> Would you trigger precomimt checks
> for https://github.com/apache/beam/pull/11095 with the following
> 6 commands ?
> Run Java PostCommit
> Run Java HadoopFormatIO Performance Test
> Run BigQueryIO Streaming Performance Test Java
> Run Dataflow ValidatesRunner
> Run Spark ValidatesRunner
> Run SQL Postcommit
> 
> Regards,
> Tomo
> 


Re: [DISCUSS] Drop support for Flink 1.7

2020-03-12 Thread Maximilian Michels
This is now done. We added first Flink 1.10 and then removed 1.7
support. This will be reflected in Beam 2.21.0.

Thanks Jincheng for your initiative. It was a pleasure working with you
on the PRs.

Cheers,
Max

On 12.03.20 02:33, jincheng sun wrote:
> Hi all,
> I would like to drop the flink 1.7 support soon, as flink 1.10 already
> supported in this
> commit.https://github.com/apache/beam/commit/f91b390c8bbab4afe14734c1266da51dcc7558c9
> 
> Best,
> Jincheng
> 
> 
> 
> jincheng sun  > 于2020年2月24日周一 上午11:22写道:
> 
> Thanks for all of your feedback. Will dropping the 1.7 support
> after https://github.com/apache/beam/pull/10945 has be merged.
> 
> Best,
> Jincheng
> 
> 
> David Morávek  > 于2020年2月19日周三 下午7:08写道:
> 
> +1 for dropping 1.7, once we have 1.10 support ready
> 
> D.
> 
> On Tue, Feb 18, 2020 at 7:01 PM  > wrote:
> 
> Hi Ismael,
> yes, sure. The proposal would be to have snapshot dependency
> in the feature branch. The snapshot must be changed to
> release before merge to master.
> Jan
> 
> Dne 18. 2. 2020 17:55 napsal uživatel Ismaël Mejía
> mailto:ieme...@gmail.com>>:
> 
> Just to be sure, you mean Flink 1.11.0-SNAPSHOT ONLY in
> the next branch dependency?
> We should not have any SNAPSHOT dependency from other
> projects in Beam.
> 
> On Tue, Feb 18, 2020 at 5:34 PM  > wrote:
> 
> Hi=Cr�Jincheng,
> I think there should be a "process" for this. I
> would propose to:
>  a) create new branch with support for new
> (snapshot) flink - currently that would mean flink 1.11
>  b) as part of this brach drop support for all
> version up to N - 3
> I think th!t that dropping all versions and adding
> new version should be atomic, otherwise we risk we
> release beam version with less than three supprted
> flink versions.
> I'd suggest to start with the 1.10 branch support,
> include the drop of 1.7 into that branch. Once 1.10
> gets merged, we should create 1.11 with snapshot
> dependency to be able to keep up with the release
> cadence of flink.
> WDYT?
>  Jan
> 
> Dne 18. 2. 2020 15:34 napsal uživatel jincheng sun
>  >:
> 
> Hi folks,
> 
> Apache Flink 1.10 has completed the
> release announcement [1]n Then we would like to
> add Flink 1.10 build target and make Flink
> Runner compatible with Flink 1.10 [2]. So, I
> would suggest that at most three versions of
> Flink runner for Apache Beam community according
> to the update Policy of Apache Flink releases
> [3] , i.e. I think it's better to maintain the
> three versions of 1.8/1.9/1.10 after add Flink
> 1.10 build target to Flink runner.
> 
> The current existence of Flink runner 1.7 will
> affect the upgrade of Flink runner 1.8x and 1.9x
> due to the code of Flink 1.7 is too old, more
> detail can be found in [4]. So,  we need to drop
> the support of Flink runner 1.7 as soon as possible.
> 
> This discussion also CC to @User, due to the
> change will affect our users. And I would
> appreciate it if you could review the PR [5].
> 
> Welcome any feedback!
> 
> Best,
> Jincheng
> 
> [1]
> 
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flin+-1-10-0-released-td37564.html
> 
> 
> [2] https://issues.apache.org/jira/browse/BEAM-9295
> [3]
> 
> https://flink.apache.org/downloads.html#update-policy-for-old-releases
> [4] https://issues.apache.org/jira/browse/BEAM-9299
> [5] https://github.com/apache/beam/pull/10884
> 
> 

Re: [EXTERNAL] Re: Java Build broken

2020-03-05 Thread Maximilian Michels
Good find, Thomas! It looks like it is for testing releases because they 
are staged to this repository. IMHO there is no need for it to be 
enabled by default.


-Max

On 04.03.20 23:06, Thomas Weise wrote:
I run into this problem today and found that removing 
https://oss.sonatype.org/content/repositories/staging/ from buildSrc/src/main/groovy/org/apache/beam/gradle/Repositories.groovy 
also resolves the issue.


Is it possible that a flaky repository can poison the gradle cache? Do 
we need this repository entry at all?



On Tue, Mar 3, 2020 at 7:06 AM Pulasthi Supun Wickramasinghe 
mailto:pulasthi...@gmail.com>> wrote:


Thanks, that seems to have fixed the issue.

Best Regards,
Pulasthi

On Tue, Mar 3, 2020 at 5:47 AM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>>
wrote:

I had the same problem, it seems that removing Gradle's cache
(`rm -rf ~/.gradle/caches`) solved the issue.

On Tue, Feb 25, 2020 at 4:33 PM Pulasthi Supun Wickramasinghe
mailto:pulasthi...@gmail.com>> wrote:

Hi Stefan,

Yes, I am also still getting this error on my local setup,
However, strangely I am not getting this on my laptop. I
tried manually installing the missing 'error_prone'
dependencies to maven but then go some other error.
Might this be some kind of cache issue?

Best Regards,
Pulasthi

On Tue, Feb 25, 2020 at 5:38 AM Stefan Djelekar
mailto:stefan.djele...@redbox.com>> wrote:

Hi all,

__ __

No this is not yet fixed.

__ __

@Pulasthi do you still get the same error?

__ __

@Maximilian I don’t have any overrides.

It looks like on localhost build references

https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_check_api/2.3.4/

istead of 


https://mvnrepository.com/artifact/com.google.errorprone/error_prone_check_api/2.3.4

__ __

and the first link returns 404

__ __



__ __

Can you please advise?

__ __

All the best,

Stefan

__ __

*From:*Pulasthi Supun Wickramasinghe
mailto:pulasthi...@gmail.com>>
*Sent:* Tuesday, February 18, 2020 5:11 PM
*To:* dev mailto:dev@beam.apache.org>>
*Cc:* Stefan Djelekar mailto:stefan.djele...@redbox.com>>
*Subject:* [EXTERNAL] Re: Java Build broken

__ __

Hi All,

__ __

Was this issue resolved? I started to get the same error
on my local build suddenly.

__ __

Best Regards,

Pulasthi

__ __

    On Thu, Jan 23, 2020 at 10:17 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

Do you have any overrides in your
~/.m2/settings.xml? The artifacts
should be found as part of Maven central, e.g.

https://mvnrepository.com/artifact/com.google.errorprone/error_prone_check_api

Cheers,
Max

On 23.01.20 11:11, Stefan Djelekar wrote:
 > Hi guys,
 >
 > It’s been days now since the build for Java SDK
is broken.
 >
 > Namely, pipeline is successful on Jenkins, but it
fails in my localhost
 > with error in task model:pipeline:compileJava. As
I've seen, last couple
 > of builds were served from cache, so maybe that
is the reason why it's
 > green. I confirmed same thing happened to other
devs as well.
 >
 > 22:49:34 > Task :model:pipeline:compileJava
FROM-CACHE
 >
 > It looks like it’s related to mismatch of
com.google.errorprone library
 > version. Can someone please take a look as this
is a blocker to
 > localhost development?
 >
 > Cheers,
 >
 > *Stefan Đelekar*
 >
 > Sofware Engineer
  

  1   2   3   4   5   6   >