Re: Key encodings for state requests

2019-11-12 Thread jincheng sun
Thanks for sharing your thoughts which give me more help to deep
understanding the design of FnAPI, and It make more sense to me.

Great thanks Robert !

Best,
Jincheng


Robert Bradshaw  于2019年11月12日周二 上午2:10写道:

> On Fri, Nov 8, 2019 at 10:04 PM jincheng sun 
> wrote:
> >
> > > Let us first define what are "standard coders". Usually it should be
> the coders defined in the Proto. However, personally I think the coders
> defined in the Java ModelCoders [1] seems more appropriate. The reason is
> that for a coder which has already appeared in Proto and still not added to
> the Java ModelCoders, it's always replaced by the runner with
> LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
> data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
> That's to say, that coder does not still take effect in the SDK harness.
> Only when the coder is added in ModelCoders, it's 'known' and will take
> effect.
> >
> > Correct this point! The coder which is not contained in the Java
> ModelCoders is replaced with LengthPrefixCoder[ByteArrayCoder] at runner
> side and LengthPrefixCoder[CustomCoder] at SDK harness side.
> >
> > The point here is that the runner determines whether it knows the coder
> according to the coders defined in the Java ModelCoders, not the coders
> defined in the proto file. So if taking option 3, the non-standard coders
> which will be wrapped with LengthPrefixCoder should also be determined by
> the coders defined in the Java ModerCoders, not the coders defined in the
> proto file.
>
> Yes.
>
> Both as a matter of principle and pragmatics, it'd be good to avoid
> anything about the model only defined in Java files.
>
> Also, when we say "the runner" we cannot assume it's written in Java.
> While many Java OSS runners share these libraries, The Universal Local
> Runner is written in Python. Dataflow is written (primarily) in C++.
> My hope is that the FnAPI will be stable enough that one can even run
> multiple versions of the Java SDK with the same runner. What matters
> is that (1) if the same URN is used, all runners/SDKs agree on the
> encoding (2) there are certain coders (Windowed, LengthPrefixed, and
> KV come to mind) that all Runners/SDKs are required to understand, and
> (3) runners properly coerce coders they do not understand into coders
> that they do if they need to pull out and act on the bytes. The more
> coders the runner/SDK understands, the less often it needs to do this.
>
> > jincheng sun 于2019年11月9日 周六12:26写道:
> >>
> >> Hi Robert Bradshaw,
> >>
> >> Thanks a lot for the explanation. Very interesting topic!
> >>
> >> Let us first define what are "standard coders". Usually it should be
> the coders defined in the Proto. However, personally I think the coders
> defined in the Java ModelCoders [1] seems more appropriate. The reason is
> that for a coder which has already appeared in Proto and still not added to
> the Java ModelCoders, it's always replaced by the runner with
> LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
> data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
> That's to say, that coder does not still take effect in the SDK harness.
> Only when the coder is added in ModelCoders, it's 'known' and will take
> effect.
> >>
> >> So if we take option 3, the non-standard coders which will be wrapped
> with LengthPrefixCoder should be synced with the coders defined in the Java
> ModerCoders. (From this point of view, option 1 seems more clean!)
> >>
> >> Please correct me if I missed something. Thanks a lot!
> >>
> >> Best,
> >> Jincheng
> >>
> >> [1]
> https://github.com/apache/beam/blob/01726e9c62313749f9ea7c93063a1178abd1a8db/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoders.java#L59
> >>
> >> Robert Burke  于2019年11月9日周六 上午8:46写道:
> >>>
> >>> And by "I wasn't clear" I meant "I misread the options".
> >>>
> >>> On Fri, Nov 8, 2019, 4:14 PM Robert Burke  wrote:
> 
>  Reading back, I wasn't clear: the Go SDK does Option (1), putting the
> LP explicitly during encoding [1] for the runner proto, and explicitly
> expects LPs to contain a custom coder URN on decode for execution [2].
> (Modulo an old bug in Dataflow where the urn was empty)
> 
> 
>  [1]
> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L348
>  [2]
> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L219
> 
> 
>  On Fri, Nov 8, 2019, 10:28 AM Robert Bradshaw 
> wrote:
> >
> > On Fri, Nov 8, 2019 at 2:09 AM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Sorry for my late reply. It seems the conclusion has been reached.
> I just want to share my personal thoughts.
> > >
> > > Generally, both option 1 and 3 make sense to me.
> > >

Make environment_id a top level attribute of PTransform

2019-11-12 Thread Chamikara Jayalath
This was discussed in a JIRA [1] but don't think this was mentioned in the
dev list.

Not having environment_id as a top level attribute of PTransform [2] makes
it difficult to track the Environment [3] a given PTransform should be
executed in. For example, in Dataflow, we have to fork code in several
places to filter out the Environment from a given PTransform proto.

Making environment_id a top level attribute of PTransform and removing it
from various payload types will make tracking environments easier. Also
code will become less error prone since we don't have to fork for all
possible payload types.

Any objections to doing this change ?

Thanks,
Cham

[1] https://issues.apache.org/jira/browse/BEAM-7850
[2]
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99
[3]
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L1021


Cleaning up Approximate Algorithms in Beam

2019-11-12 Thread Reza Rokni
Hi everyone;

TL/DR : Discussion on Beam's various Approximate Distinct Count algorithms.

Today there are several options for Approximate Algorithms in Apache Beam
2.16 with HLLCount being the most recently added. Would like to canvas
opinions here on the possibility of rationalizing these API's by removing
obsolete / less efficient implementations.
The current situation:

There are three options available to users: ApproximateUnique.java
,
ApproximateDistinct.java

and HllCount.java
.
A quick summary of these API's as I understand them:

HllCount.java
:
Marked as @Experimental

PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data streams
based on the ZetaSketch 
implementation.Detailed design of this class, see
https://s.apache.org/hll-in-beam.

ApproximateUnique.java
:
Not Marked with experimental

This API does not expose the ability to create sketches so it's not
suitable for the OLAP use case that HLL++ is geared towards (do
pre-aggregation into sketches to allow interactive query speeds). It's also
less precise for the same amount of memory used: the error bounds in the
doc comments give :

/* The error is about

{@code 2 * / sqrt(sampleSize)},) */

Compared to the default HLLCount sketch size, its error is 10X larger than
the HLL++ error.

ApproximateDistinct.java

Marked with @Experimental

This is a re-implementation of the HLL++ algorithm, based on the paper
published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
have not run any benchmarks to compare this implementation compared to the
HLLCount and we need to be careful to ensure that if we were to change any
of these API's that the binary format of the sketches should never change,
there could be users who have stored previous sketches using
ApproximateDistinct and it will be important to try and ensure they do not
have a bad experience.


Proposals:

There are two classes of users expected for these algorithms:

1) Users who simply use the transform to estimate the size of their data
set in Beam

2) Users who want to create sketches and store them, either for
interoperability with other systems, or as features to be used in further
data processing.



For use case 1, it is possible to make use of naming which does not expose
the implementation, however for use case 2 it is important for the
implementation to be explicit as sketches produced with one implementation
will not work with other implementations.

ApproximateUnique.java

:

This one does not provide sketches and based on the notes above, is not as
efficient as HLLCount. However it does have a very searchable name and is
likely to be the one users will gravitate to when searching for Approximate
unique algorithms but do not need the capabilities of sketches.

Ideally we should think about switching the implementation of this
transform to wrap HLLCount. However this could mean changing things in a
way which is not transparent to the end developer.  Although as a result of
the change they would get a better implementation for free on an upgrade :-)

Another option would be to mark this transform as @Deprecated and create a
new transform ApproximateCountDistinct which would wrap HLLCount. The name
is also much clearer.

ApproximateDistinct.java


This transform does generate sketches as output and given its marked as
@Experimental, one option we would have is to create a name which includes
the algorithm implementation details, for example
ApproximateCountDistinctClearSpring.



HllCount.java

.

Again we have a few options here, as the name does not include search words
like 

Re: On processing event streams

2019-11-12 Thread Robert Bradshaw
One concern with (1) is that it may not be cheap to do for all
runners. There also seems to be the implication that in batch elements
would be 100% in order but in streaming kind-of-in-order is OK, which
would lead to pipelines being developed/tested against stronger
guarantees than are generally provided in a streaming system. It also
means batch and streaming have different semantics, not just different
runtime characteristics, etc. (Note also that for streaming the
out-of-order limits are essentially unbounded as well, but if you fall
"too far" behind you generally have other problems so in practice it's
OK for a "healthy" pipeline.)

I think (2) is the most consistent, as we can't meaningfully limit the
amount of unboundedness to say a particular runner (or mode) has
violated it.

On Tue, Nov 12, 2019 at 1:36 AM Jan Lukavský  wrote:
>
> Hi,
>
> this is follow up of multiple threads covering the topic of how to (in a
> unified way) process event streams. Event streams can be characterized
> by a common property that ordering of events matter. The processing
> (usually) looks something like
>
>unordered stream -> buffer (per key) -> ordered stream -> stateful
> logic (DoFn)
>
> This is perfectly fine and can be solved by current tools Beam offers
> (state & timers), but *only for streaming case*. The batch case is
> essentially broken, because:
>
>   a) out-of-orderness is essentially *unbounded* (as opposed to input
> being bounded, strangely, that is not a contradiction), out-of-orderness
> in streaming case is *bounded*, because the watermark can fall behind
> only limit amount of time (sooner or later, nobody would actually care
> about results from streaming pipeline being months or years late, right?)
>
>   b) with unbounded out-of-orderness, the spatial requirements of state
> grow with O(N), worst case, where N is size of the whole input
>
>   c) moreover, many runners restrict the size of state per key to fit in
> memory (spark, flink)
>
> Now, solutions to this problems seem to be:
>
>   1) refine the model guarantees for batch stateful processing, so that
> we limit the out-of-orderness (the source of issues here) - the only
> reasonable way to do that is to enforce sorting before all stateful
> dofns in batch case (perhaps there might opt-out for that), or
>
>   2) define a way to mark stateful dofn as requiring the sorting (e.g.
> @RequiresTimeSortedInput) - note this has to be done for both batch and
> streaming case, as opposed to 1), or
>
>   3) define a different URN for "ordered stateful dofn", with default
> expansion using state as buffer (for both batch and streaming case) -
> that way this can be overridden in batch runners that can get into
> trouble otherwise (and could be regarded as sort of natural extension of
> the current approach).
>
> I still think that the best solution is 1), for multiple reasons going
> from being internally logically consistent to being practical and easily
> implemented (a few lines of code in flink's case for instance). On the
> other hand, if this is really not what we want to do, then I'd like to
> know the community's opinion on the two other options (or, if there
> maybe is some other option I didn't cover).
>
> Many thanks for opinions and help with fixing what is (sort of) broken
> right now.
>
> Jan
>


Re: [Portability] Turn off artifact staging?

2019-11-12 Thread Robert Bradshaw
FWIW, there are also discussions of adding a preparation phase for sdk
harness (docker) images, such that artifacts could be staged (and
installed, compiled etc.) ahead of time and shipped as part of the sdk
image rather than via a side channel (and on every worker). Anyone not
using these images is probably shipping dependencies in another way
anyways.

On Tue, Nov 12, 2019 at 5:03 PM Robert Bradshaw  wrote:
>
> Certainly there's a lot to be re-thought in terms of artifact staging,
> especially when it comes to cross-langauge pipelines. I think it would
> makes sense to have a special retrieval token for the "empty"
> manifest, which would mean a staging directory would never have to be
> set up if no artifacts happened to be staged.
>
> The UberJar avoids any artifact staging overhead as well.
>
> On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver  wrote:
> >
> > Hi Beamers,
> >
> > We can use artifact staging to make sure SDK workers have access to a 
> > pipeline's dependencies. However, artifact staging is not always necessary. 
> > For example, one can make sure that the environment contains all the 
> > dependencies ahead of time. However, regardless of whether or not artifacts 
> > are used, my understanding is an artifact manifest will be written and read 
> > anyway. For example:
> >
> > INFO AbstractArtifactRetrievalService: GetManifest for 
> > /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts
> >
> > This can be a hassle, because users must set up a staging directory that 
> > all workers can access, even if it isn't used aside from the (empty) 
> > manifest [1]. Thomas mentioned that at Lyft they bypass artifact staging 
> > altogether [2]. So I was wondering, do you all think it would be reasonable 
> > or useful to create an "off switch" for artifact staging?
> >
> > Thanks,
> > Kyle
> >
> > [1] 
> > https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
> > [2] 
> > https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715


Re: [Portability] Turn off artifact staging?

2019-11-12 Thread Robert Bradshaw
Certainly there's a lot to be re-thought in terms of artifact staging,
especially when it comes to cross-langauge pipelines. I think it would
makes sense to have a special retrieval token for the "empty"
manifest, which would mean a staging directory would never have to be
set up if no artifacts happened to be staged.

The UberJar avoids any artifact staging overhead as well.

On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver  wrote:
>
> Hi Beamers,
>
> We can use artifact staging to make sure SDK workers have access to a 
> pipeline's dependencies. However, artifact staging is not always necessary. 
> For example, one can make sure that the environment contains all the 
> dependencies ahead of time. However, regardless of whether or not artifacts 
> are used, my understanding is an artifact manifest will be written and read 
> anyway. For example:
>
> INFO AbstractArtifactRetrievalService: GetManifest for 
> /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts
>
> This can be a hassle, because users must set up a staging directory that all 
> workers can access, even if it isn't used aside from the (empty) manifest 
> [1]. Thomas mentioned that at Lyft they bypass artifact staging altogether 
> [2]. So I was wondering, do you all think it would be reasonable or useful to 
> create an "off switch" for artifact staging?
>
> Thanks,
> Kyle
>
> [1] 
> https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
> [2] 
> https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715


Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-12 Thread Udi Meiri
+1 for better organization. I would have gone to ApacheCon LV had I known
there was going to be a Beam summit there.

On Tue, Nov 12, 2019 at 9:31 AM Alexey Romanenko 
wrote:

> On 8 Nov 2019, at 11:32, Maximilian Michels  wrote:
> >
> > The dates sounds good to me. I agree that the bay area has an advantage
> because of its large tech community. On the other hand, it is a question of
> how we run the event. For Berlin we managed to get about 200 attendees to
> Berlin, but for the BeamSummit in Las Vegas with ApacheCon the attendance
> was much lower.
>
> I agree with your point Max and I believe that it would be more efficient
> to run Beam Summit as a “standalone" event (as it was done in London and
> Berlin) which will allow us to attract mostly
> Beam-oriented/interested/focused audience comparing to running this as part
> of ApacheCon or any other large conferences where are many other different
> topics and tracks.
>
> > Should this also be discussed on the user mailing list?
>
> Definitively! Despite the fact that users opinion is a key point here, it
> will not be so easy to get not-biased statistics in this question.
>
> The time frames are also very important since holidays in different
> countries (for example, August is traditionally a "vacation month" in
> France and some other European countries) can effect people availability
> and influent the final number of participants in the end.
>
> >
> > Cheers,
> > Max
> >
> > On 07.11.19 22:50, Alex Van Boxel wrote:
> >> For date wise, I'm wondering why we should switching the Europe and NA
> one, this would mean that the Berlin and the new EU summit would be almost
> 1.5 years apart.
> >>  _/
> >> _/ Alex Van Boxel
> >> On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay  al...@google.com>> wrote:
> >>I prefer bay are for NA summit. My reasoning is that there is a
> >>criticall mass of contributors and users in that location, probably
> >>more than alternative NA locations. I was not involved with planning
> >>recently and I do not know if there were people who could attend due
> >>to location previously. If that is the case, I agree with Elliotte
> >>on looking for other options.
> >>Related to dates: March (Asia) and mid-May (NA) dates are a bit
> >>close. Mid-June for NA might be better to spread events. Other
> >>pieces looks good.
> >>Ahmet
> >>On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
> >>mailto:elh...@ibiblio.org>> wrote:
> >>The U.S. sadly is not a reliable destination for international
> >>conferences these days. Almost every conference I go to, big and
> >>small, has at least one speaker, sometimes more, who can't get
> into
> >>the country. Canada seems worth considering. Vancouver,
> >>Montreal, and
> >>Toronto are all convenient.
> >>On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas  >>> wrote:
> >> >
> >> > Hi Beam Community!
> >> >
> >> > I'd like to kick off a thread to discuss potential dates and
> >>venues for the 2020 Beam Summits.
> >> >
> >> > I did some research on industry conferences happening in 2020
> >>and pre-selected a few ranges as follows:
> >> >
> >> > (2 days) NA between mid-May and mid-June
> >> > (2 days) EU mid October
> >> > (1 day) Asia Mini Summit:  March
> >> >
> >> > I'd like to hear your thoughts on these dates and get
> >>consensus on exact dates as the convo progresses.
> >> >
> >> > For locations these are the options I reviewed:
> >> >
> >> > NA: Austin Texas, Berkeley California, Mexico City.
> >> > Europe: Warsaw, Barcelona, Paris
> >> > Asia: Singapore
> >> >
> >> > Let the discussion begin!
> >> > G (on behalf of the Beam Summit Steering Committee)
> >> >
> >> >
> >> >
> >>-- Elliotte Rusty Harold
> >>elh...@ibiblio.org 
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Behavior of TimestampCombiner?

2019-11-12 Thread Robert Bradshaw
I bet, as with the previous one, this is due to over-eager combiner lifting.

On Tue, Nov 12, 2019 at 4:17 PM Ruoyun Huang  wrote:
>
> Reported a tracking JIRA:  https://issues.apache.org/jira/browse/BEAM-8645
>
> On Tue, Nov 12, 2019 at 9:48 AM Ruoyun Huang  wrote:
>>
>> Thanks for confirming.
>>
>> Since it is unexpected behavior, I shall look into jira if it is already on 
>> radar, if not, will create one.
>>
>> On Mon, Nov 11, 2019 at 6:11 PM Robert Bradshaw  wrote:
>>>
>>> The END_OF_WINDOW is indeed 9.99 (or, in Java, 9.999000), but the
>>> results for LATEST and EARLIEST should be 9 and 0 respectively.
>>>
>>> On Mon, Nov 11, 2019 at 5:34 PM Ruoyun Huang  wrote:
>>> >
>>> > Hi, Folks,
>>> >
>>> > I am trying to understand the behavior of TimestampCombiner. I have a 
>>> > test like this:
>>> >
>>> > class TimestampCombinerTest(unittest.TestCase):
>>> >
>>> >   def test_combiner_latest(self):
>>> > """Test TimestampCombiner with LATEST."""
>>> > options = PipelineOptions()
>>> > options.view_as(StandardOptions).streaming = True
>>> > p = TestPipeline(options=options)
>>> >
>>> > main_stream = (p
>>> >| 'main TestStream' >> TestStream()
>>> >.add_elements([window.TimestampedValue(('k', 100), 0)])
>>> >.add_elements([window.TimestampedValue(('k', 400), 9)])
>>> >.advance_watermark_to_infinity()
>>> >| 'main windowInto' >> beam.WindowInto(
>>> >   window.FixedWindows(10),
>>> >   
>>> > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST)
>>> >| 'Combine' >> beam.CombinePerKey(sum))
>>> >
>>> > class RecordFn(beam.DoFn):
>>> >   def process(self,
>>> >   elm=beam.DoFn.ElementParam,
>>> >   ts=beam.DoFn.TimestampParam):
>>> > yield (elm, ts)
>>> >
>>> > records = (main_stream | beam.ParDo(RecordFn()))
>>> >
>>> > expected_window_to_elements = {
>>> > window.IntervalWindow(0, 10): [
>>> > (('k', 500),  Timestamp(9)),
>>> > ],
>>> > }
>>> >
>>> > assert_that(
>>> > records,
>>> > equal_to_per_window(expected_window_to_elements),
>>> > use_global_window=False,
>>> > label='assert per window')
>>> >
>>> > p.run()
>>> >
>>> >
>>> > I expect the result to be following (based on various TimestampCombiner 
>>> > strategy):
>>> > LATEST:(('k', 500), Timestamp(9)),
>>> > EARLIEST:(('k', 500), Timestamp(0)),
>>> > END_OF_WINDOW: (('k', 500), Timestamp(10)),
>>> >
>>> > The above outcome is partially confirmed by Java side test : [1]
>>> >
>>> >
>>> > However, from beam python, the outcome is like this:
>>> > LATEST:(('k', 500), Timestamp(10)),
>>> > EARLIEST:(('k', 500), Timestamp(10)),
>>> > END_OF_WINDOW: (('k', 500), Timestamp(9.)),
>>> >
>>> > What did I miss? what should be the right expected behavior? or this 
>>> > looks like a bug?
>>> >
>>> > [1]: 
>>> > https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupByKeyTest.java#L390
>>> >
>>> > Cheers,
>>> >
>>
>>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>
>
> --
> 
> Ruoyun  Huang
>


[discuss] Using a logger hierarchy in Python

2019-11-12 Thread Pablo Estrada
Hi all,
as of today, the Python SDK uses the root logger wherever we log. This
means that it's impossible to have different logging levels depending on
the section of the code that we want to debug most.

I have been doing some work on the FnApiRunner, and adding logging for it.
I would like to start using a logger hierarchy, and slowly transition the
rest of the project to use per-module loggers.

On each module, we could have a line like so:

_LOGGER = logging.getLogger(__name__)

and simply log everything on that _LOGGER. Is that an acceptable thing to
do for everyone?

If I see no objections, I will change the FnApiRunner to use a logger like
this, and change other sections of the code as I interact with them.
Best
-P.


Re: Behavior of TimestampCombiner?

2019-11-12 Thread Ruoyun Huang
Reported a tracking JIRA:  https://issues.apache.org/jira/browse/BEAM-8645

On Tue, Nov 12, 2019 at 9:48 AM Ruoyun Huang  wrote:

> Thanks for confirming.
>
> Since it is unexpected behavior, I shall look into jira if it is already
> on radar, if not, will create one.
>
> On Mon, Nov 11, 2019 at 6:11 PM Robert Bradshaw 
> wrote:
>
>> The END_OF_WINDOW is indeed 9.99 (or, in Java, 9.999000), but the
>> results for LATEST and EARLIEST should be 9 and 0 respectively.
>>
>> On Mon, Nov 11, 2019 at 5:34 PM Ruoyun Huang  wrote:
>> >
>> > Hi, Folks,
>> >
>> > I am trying to understand the behavior of TimestampCombiner. I have
>> a test like this:
>> >
>> > class TimestampCombinerTest(unittest.TestCase):
>> >
>> >   def test_combiner_latest(self):
>> > """Test TimestampCombiner with LATEST."""
>> > options = PipelineOptions()
>> > options.view_as(StandardOptions).streaming = True
>> > p = TestPipeline(options=options)
>> >
>> > main_stream = (p
>> >| 'main TestStream' >> TestStream()
>> >.add_elements([window.TimestampedValue(('k', 100),
>> 0)])
>> >.add_elements([window.TimestampedValue(('k', 400),
>> 9)])
>> >.advance_watermark_to_infinity()
>> >| 'main windowInto' >> beam.WindowInto(
>> >   window.FixedWindows(10),
>> >
>>  timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST)
>> >| 'Combine' >> beam.CombinePerKey(sum))
>> >
>> > class RecordFn(beam.DoFn):
>> >   def process(self,
>> >   elm=beam.DoFn.ElementParam,
>> >   ts=beam.DoFn.TimestampParam):
>> > yield (elm, ts)
>> >
>> > records = (main_stream | beam.ParDo(RecordFn()))
>> >
>> > expected_window_to_elements = {
>> > window.IntervalWindow(0, 10): [
>> > (('k', 500),  Timestamp(9)),
>> > ],
>> > }
>> >
>> > assert_that(
>> > records,
>> > equal_to_per_window(expected_window_to_elements),
>> > use_global_window=False,
>> > label='assert per window')
>> >
>> > p.run()
>> >
>> >
>> > I expect the result to be following (based on various TimestampCombiner
>> strategy):
>> > LATEST:(('k', 500), Timestamp(9)),
>> > EARLIEST:(('k', 500), Timestamp(0)),
>> > END_OF_WINDOW: (('k', 500), Timestamp(10)),
>> >
>> > The above outcome is partially confirmed by Java side test : [1]
>> >
>> >
>> > However, from beam python, the outcome is like this:
>> > LATEST:(('k', 500), Timestamp(10)),
>> > EARLIEST:(('k', 500), Timestamp(10)),
>> > END_OF_WINDOW: (('k', 500), Timestamp(9.)),
>> >
>> > What did I miss? what should be the right expected behavior? or this
>> looks like a bug?
>> >
>> > [1]:
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupByKeyTest.java#L390
>> >
>> > Cheers,
>> >
>>
>
>
> --
> 
> Ruoyun  Huang
>
>

-- 

Ruoyun  Huang


Re: Completeness of Beam Java Dependency Check Report

2019-11-12 Thread Yifan Zou
The dependency management tool is back. See the latest report

.

On Tue, Nov 12, 2019 at 9:51 AM Yifan Zou  wrote:

> Thanks Tomo. I'll follow up in JIRA.
>
> On Tue, Nov 12, 2019 at 9:44 AM Tomo Suzuki  wrote:
>
>> Yifan,
>> I created a ticket to track this finding:
>> https://issues.apache.org/jira/browse/BEAM-8621 .
>>
>>
>> On Mon, Nov 11, 2019 at 5:08 PM Tomo Suzuki  wrote:
>>
>>> Kenn,
>>>
>>> Thank you for the analysis. Although Guava was randomly picked up, it's
>>> great learning for me to learn how you analyzed other modules using Guava.
>>>
>>> On Mon, Nov 11, 2019 at 4:29 PM Kenneth Knowles  wrote:
>>>
 BeamModulePlugin just contains lists of versions to ease coordination
 across Beam modules, but mostly does not create dependencies. Most of
 Beam's modules only depend on a few things there. For example Guava is not
 a core dependency, but here is where it is actually depended upon:

 $ find . -name build.gradle | xargs grep library.java.guava
 ./sdks/java/core/build.gradle:  shadowTest library.java.guava_testlib
 ./sdks/java/extensions/sql/jdbc/build.gradle:  compile
 library.java.guava
 ./sdks/java/io/google-cloud-platform/build.gradle:  compile
 library.java.guava
 ./sdks/java/io/kinesis/build.gradle:  testCompile
 library.java.guava_testlib

 These results appear to be misleading. Grepping for 'import
 com.google.common', I see this as the actual state of things:

  - GCP connector does not appear to actually depend on Guava in compile
 scope
  - The Beam SQL JDBC driver does not appear to actually depend on Guava
 in compile scope
  - The Dataflow Java worker does depend on Guava at compile scope but
 has incorrect dependencies (and it probably shouldn't)
  - KinesisIO does depend on Guava at compile scope but has incorrect
 dependencies (Kinesis libs have Guava on API surface so it is OK here, but
 should be correctly declared)
  - ZetaSQL translator does depend on Guava at compile scope but has
 incorrect dependencies (ZetaSQL has it on API surface so it is OK here, but
 should be correctly declared)

 We used to have an analysis that prevented this class of error.

 Once the errors are fixed, the guava_version is simply a version that
 we have discovered that seems to work for both Kinesis and ZetaSQL,
 libraries we do not control. Kinesis producer is built against 18.0.
 Kinesis client against 26.0-jre. ZetaSQL against 26.0-android.

 (or maybe I messed up in my analysis)

 Kenn

 On Mon, Nov 11, 2019 at 12:07 PM Tomo Suzuki 
 wrote:

>
> Chamikara and Yifan,
> Thank you for the responses! Looking forward to hearing the
> investigation result.
> In the meantime, I'll explore .test-infra/jenkins/dependency_check
> directory.
>
>
>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>
>>
>> --
>> Regards,
>> Tomo
>>
>


Re: Contributor permission for Beam Jira tickets

2019-11-12 Thread amit kumar
THanks!

On Tue, Nov 12, 2019 at 3:49 PM Kenneth Knowles  wrote:

> Done. Welcome!
>
> On Tue, Nov 12, 2019 at 3:40 PM amit kumar  wrote:
>
>> Hi Beam Devs,
>>
>> I am Amit from Godaddy and I am looking to contribute to Beam.
>> Could you please add me as a contributor. My Id is - amitkumar27
>>
>> Regards,
>> Amit
>>
>> On Wed, Nov 6, 2019 at 9:59 AM amit kumar  wrote:
>>
>>> Hi Beam Devs,
>>>
>>> I am Amit from Godaddy and I am looking to contribute to Beam.
>>> Could you please add me as a contributor and a subscriber to the Dev
>>> mailing list.
>>>
>>> Regards,
>>> Amit
>>>
>>


Re: Contributor permission for Beam Jira tickets

2019-11-12 Thread Kenneth Knowles
Done. Welcome!

On Tue, Nov 12, 2019 at 3:40 PM amit kumar  wrote:

> Hi Beam Devs,
>
> I am Amit from Godaddy and I am looking to contribute to Beam.
> Could you please add me as a contributor. My Id is - amitkumar27
>
> Regards,
> Amit
>
> On Wed, Nov 6, 2019 at 9:59 AM amit kumar  wrote:
>
>> Hi Beam Devs,
>>
>> I am Amit from Godaddy and I am looking to contribute to Beam.
>> Could you please add me as a contributor and a subscriber to the Dev
>> mailing list.
>>
>> Regards,
>> Amit
>>
>


Re: Contributor permission for Beam Jira tickets

2019-11-12 Thread amit kumar
Hi Beam Devs,

I am Amit from Godaddy and I am looking to contribute to Beam.
Could you please add me as a contributor. My Id is - amitkumar27

Regards,
Amit

On Wed, Nov 6, 2019 at 9:59 AM amit kumar  wrote:

> Hi Beam Devs,
>
> I am Amit from Godaddy and I am looking to contribute to Beam.
> Could you please add me as a contributor and a subscriber to the Dev
> mailing list.
>
> Regards,
> Amit
>


Beam Dependency Check Report (2019-11-12)

2019-11-12 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
Sphinx
1.8.5
2.2.1
2019-05-20
2019-10-28BEAM-7370
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.esotericsoftware:kryo
4.0.2
5.0.0-RC4
2018-03-20
2019-04-14BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.20.0
0.27.0
2019-02-11
2019-10-21BEAM-6645
com.github.spotbugs:spotbugs
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-7792
com.github.spotbugs:spotbugs-annotations
3.1.12
4.0.0-beta4
2019-03-01
2019-09-18BEAM-6951
com.google.api.grpc:proto-google-common-protos
1.12.0
1.17.0
2018-06-29
2019-10-04BEAM-6899
com.google.auth:google-auth-library-credentials
0.13.0
0.18.0
2019-01-17
2019-10-09BEAM-6478
com.google.code.gson:gson
2.7
2.8.6
2016-06-14
2019-10-04BEAM-5558
io.grpc:grpc-auth
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5896
io.grpc:grpc-context
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5897
io.grpc:grpc-core
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5898
io.grpc:grpc-netty
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5899
io.grpc:grpc-protobuf
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5900
io.grpc:grpc-stub
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5901
io.grpc:grpc-testing
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5902
io.grpc:protoc-gen-grpc-java
1.21.0
1.25.0
2019-05-22
2019-11-05BEAM-5903
io.opencensus:opencensus-api
0.21.0
0.24.0
2019-05-01
2019-08-27BEAM-5580
io.opencensus:opencensus-contrib-grpc-metrics
0.21.0
0.24.0
2019-05-01
2019-08-27BEAM-5581
javax.servlet:javax.servlet-api
3.1.0
4.0.1
2013-04-25
2018-04-20BEAM-5750
net.java.dev.javacc:javacc
4.0
7.0.5
2006-03-17
2019-10-20BEAM-5570
net.java.dev.jna:jna
4.1.0
5.5.0
2014-03-06
2019-10-30BEAM-5573
org.apache.hbase:hbase-common
1.2.6
2.2.2
2017-05-29
2019-10-20BEAM-5560
org.apache.hbase:hbase-hadoop-compat
1.2.6
2.2.2
2017-05-29
2019-10-20BEAM-5561
org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.2.2
2017-05-29
2019-10-20BEAM-5562
org.apache.hbase:hbase-server
1.2.6
2.2.2
2017-05-29
2019-10-20BEAM-5563
org.apache.hbase:hbase-shaded-client
1.2.6
2.2.1
2017-05-29
2019-09-04BEAM-5564
org.apache.hive:hive-cli
2.1.0
3.1.2
2016-06-17
2019-08-22BEAM-5566
org.apache.hive:hive-common
2.1.0
3.1.2
2016-06-17
2019-08-22BEAM-5567
org.apache.hive:hive-exec
2.1.0
3.1.2
2016-06-17
2019-08-22BEAM-5568
org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.2
2016-06-17
2019-08-22BEAM-5569
org.apache.qpid:proton-j
0.13.1
0.33.2
2016-07-02
2019-08-09BEAM-5582
org.apache.solr:solr-core
5.5.4
8.3.0
2017-02-14
2019-10-31BEAM-5589
org.apache.solr:solr-solrj
5.5.4
8.3.0
2017-02-14
2019-10-31BEAM-5590
org.apache.solr:solr-test-framework
5.5.4
8.3.0
2017-02-14
2019-10-31BEAM-5591
org.conscrypt:conscrypt-openjdk
1.1.3
2.2.1
2018-06-04
2019-08-08BEAM-5748
org.eclipse.jetty:jetty-server
9.2.10.v20150310
10.0.0-alpha0
2015-03-10
2019-07-11BEAM-5752
org.eclipse.jetty:jetty-servlet
9.2.10.v20150310
10.0.0-alpha0

[Portability] Turn off artifact staging?

2019-11-12 Thread Kyle Weaver
Hi Beamers,

We can use artifact staging to make sure SDK workers have access to a
pipeline's dependencies. However, artifact staging is not always necessary.
For example, one can make sure that the environment contains all the
dependencies ahead of time. However, regardless of whether or not artifacts
are used, my understanding is an artifact manifest will be written and read
anyway. For example:

INFO AbstractArtifactRetrievalService: GetManifest for
/tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts

This can be a hassle, because users must set up a staging directory that
all workers can access, even if it isn't used aside from the (empty)
manifest [1]. Thomas mentioned that at Lyft they bypass artifact staging
altogether [2]. So I was wondering, do you all think it would be reasonable
or useful to create an "off switch" for artifact staging?

Thanks,
Kyle

[1]
https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
[2]
https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715


Re: [spark structured streaming runner] merge to master?

2019-11-12 Thread Kyle Weaver
+1 to "one uber jar to rule them all." As Ryan said, we snuck the portable
runner into master at a much less usable state than the structured
streaming runner, and it didn't seem to cause any issues (although in
retrospect we probably should have tagged it as @Experimental, I wasn't
aware that was a thing at the time).

In general, until adding visible documentation on the website, I suspect
most users will be completely unaware the new runner is even there.

On Fri, Nov 8, 2019 at 9:50 AM Kenneth Knowles  wrote:

> On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot 
> wrote:
> >
> > Hi guys
> >
> > @Kenn,
> >
> > I just wanted to mention that I did answered your question on
> dependencies here:
> https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E
>
> Ah, sorry! In that case there is no problem at all.
>
>
> > I'm not in favor of having the 2 runners in one jar, the point about
> having 2 jars was to:
> >
> > - avoid making promises to users on a work in progress runner (make it
> explicit with a different jar)
> > - avoid confusion for them (why are there 2 pipeline options? etc)
> >
> > If the community believes that there is no confusion or wrong promises
> with the one jar solution, we could leave the 2 runners in one jar.
> >
> > Maybe we could start a vote on that?
>
> It seems unanimous among others to have one jar. There were some
> suggestions of how to avoid promises and confusion, like Ryan's most recent
> email. Did any of the ideas sound good to you?
>
> Kenn
>
>
> I have no objection to putting the experimental runner alongside the
>> stable, mature runner.  We have some precedence with the portable
>> spark runner, and that's worked out pretty well -- at least, I haven't
>> heard any complaints from confused users!
>>
>> That being said:
>>
>> 1.  It really should be marked @Experimental in the code *and* clearly
>> warned in API (javadoc) and documentation.
>>
>> 2.  Ideally, I'd like to see a warning banner in the logs when it's
>> used, pointing to the stable SparkRunner and/or documentation on the
>> current known issues.
>>
>> All my best, Ryan
>>
>>
>>
>>
>>
>>
>> > regarding jars:
>> >
>> > I don't like 3 jars either.
>> >
>> >
>> > Etienne
>> >
>> > On 31/10/2019 02:06, Kenneth Knowles wrote:
>> >
>> > Very good points. We definitely ship a lot of code/features in very
>> early stages, and there seems to be no problem.
>> >
>> > I intend mostly to leave this judgment to people like you who know
>> better about Spark users.
>> >
>> > But I do think 1 or 2 jars is better than 3. I really don't like "3
>> jars" and I did give two reasons:
>> >
>> > 1. diamond deps where things overlap
>> > 2. figuring out which thing to depend on
>> >
>> > Both are annoying for users. I am not certain if it could lead to a
>> real unsolvable situation. This is just a Java ecosystem problem so I feel
>> qualified to comment.
>> >
>> > I did also ask if there were major dependency differences between the
>> two that could cause problems for users. This question was dropped and no
>> one cares to comment so I assume it is not an issue. So then I favor having
>> just 1 jar with both runners.
>> >
>> > Kenn
>> >
>> > On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía  wrote:
>> >>
>> >> I am still a bit lost about why we are discussing options without
>> giving any
>> >> arguments or reasons for the options? Why is 2 modules better than 3
>> or 3 better
>> >> than 2, or even better, what forces us to have something different
>> than a single
>> >> module?
>> >>
>> >> What are the reasons for wanting to have separate jars? If the issue
>> is that the
>> >> code is unfinished or not passing the tests, the impact for end users
>> is minimal
>> >> because they cannot accidentally end up running the new runner, and if
>> they
>> >> decide to do so we can warn them it is at their own risk and not ready
>> for
>> >> production in the documentation + runner.
>> >>
>> >> If the fear is that new code may end up being intertwined with the
>> classic and
>> >> portable runners and have some side effects. We have the
>> ValidatesRunner +
>> >> Nexmark in the CI to cover this so again I do not see what is the
>> problem that
>> >> requires modules to be separate.
>> >>
>> >> If the issue is being uncomfortable about having in-progress code in
>> released
>> >> artifacts we have been doing this in Beam forever, for example most of
>> the work
>> >> on portability and Schema/SQL, and all of those were still part of
>> artifacts
>> >> long time before they were ready for prime use, so I still don't see
>> why this
>> >> case is different to require different artifacts.
>> >>
>> >> I have the impression we are trying to solve a non-issue by adding a
>> lot of
>> >> artificial complexity (in particular to the users), or am I missing
>> something
>> >> else?
>> >>
>> >> On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles 
>> wrote:
>> >> >
>> >> > Oh, I mean that we 

Re: Test Failure: GcpOptionsTest$CommonTests. testDefaultGcpTempLocationDoesNotExist

2019-11-12 Thread Tomo Suzuki
Kyle,
Great. Thank you for quick response.

On Tue, Nov 12, 2019 at 4:26 PM Kyle Weaver  wrote:

> Hi Tomo, thanks for reporting.
>
> This test passes on my machine and on Jenkins. I'm guessing this test is
> assuming something about the host's gcloud settings, and is overfitting as
> a result. Probably we should mock something so that the test doesn't
> actually need to call gcloud.
>
> I have created a JIRA issue for this and assigned to myself:
> https://issues.apache.org/jira/browse/BEAM-8628
>
> Kyle
>
> On Tue, Nov 12, 2019 at 7:41 AM Tomo Suzuki  wrote:
>
>> Hi Beam developers,
>>
>> I'm trying to build Apache Beam from the source. But GcpOptionsTest fails
>> (error below).
>>
>> Did anybody solve this problem already?
>>
>> I'm using master (c2e58c55)
>>
>> suztomo@suxtomo24:~/beam4$ ./gradlew -p sdks/java check
>> ...
>> FAILURE: Build failed with an exception.
>>
>> * What went wrong:
>> Execution failed for task
>> ':sdks:java:extensions:google-cloud-platform-core:test'.
>> > There were failing tests. See the report at:
>> file:///usr/local/google/home/suztomo/beam4/sdks/java/extensions/google-cloud-platform-core/build/reports/tests/test/index.html
>>
>> The HTML file shows the following stacktrace:
>>
>> java.lang.AssertionError:
>> Expected: (an instance of java.lang.IllegalArgumentException and
>> exception with message a string containing "Error constructing default
>> value for gcpTempLocation: tempLocation is not a valid GCS path" and
>> exception with cause exception with message a string containing "Output
>> path does not exist or is not writeable")
>>  but: exception with cause exception with message a string containing
>> "Output path does not exist or is not writeable" cause message was "Unable
>> to verify that GCS bucket gs://does exists."
>> Stacktrace was: java.lang.IllegalArgumentException: Error constructing
>> default value for gcpTempLocation: tempLocation is not a valid GCS path,
>> gs://does/not/exist.
>>
>>
>> The problem (at the surface) is, in my development
>> environment, gcpOptions.getGcsUtil().bucketAccessible(gcsPath) throws
>> IOException rather than returning false. I want to know whether this is
>> specific to my environment or not.
>>
>> I'm using
>> - 1.8.0_181-google-v7
>> - x86_64 Debian GNU/Linux (Google buid)
>>
>> --
>> Regards,
>> Tomo
>>
>

-- 
Regards,
Tomo


Re: contributor permission for Beam Jira tickets: suztomo

2019-11-12 Thread Tomo Suzuki
Thank you so much

On Tue, Nov 12, 2019 at 4:28 PM Pablo Estrada  wrote:

> Hi Tomo!
> I've added you as contributor. Welcome!
> Best
> -P.
>
> On Tue, Nov 12, 2019 at 11:51 AM Tomo Suzuki  wrote:
>
>> Hi Beam Devs,
>>
>> This is Tomo from Google New York. I'd like to contribute to Beam Java
>> dependencies upgrade. Can someone add me as a contributor for Beam's JIRA
>> issue tracker?
>>
>> GitHub account: suztomo
>> Apache JIRA username: suztomo
>>
>> --
>> Regards,
>> Tomo
>>
>

-- 
Regards,
Tomo


Re: Command for Beam worker on Spark cluster

2019-11-12 Thread Kyle Weaver
Not sure what's causing the error. We should be able to see output from the
process if you set the logging level to DEBUG.

> Some debugging to boot.go and running it manually shows it doesn't return
from "artifact.Materialize" function.

Running boot.go by itself won't work if there is no artifact server running
(which normally Beam will start automatically):
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/artifact/materialize.go#L43

On Thu, Nov 7, 2019 at 10:05 AM Matthew K.  wrote:

> Thanks, but still have problem making remote worker on k8s work (important
> to point out that I had to create shared volume between nodes in order all
> have access to the same /tmp, since beam runner creates artifact staging
> files on the machine it is running on, and expects workers to read from it).
>
> However, I get this error from executor:
>
>
> INFO AbstractArtifactRetrievalService: GetManifest for
> /tmp/beam-artifact-staging/job_cca3e889-76d9-4c8a-a942-a64ddbd2dd1f/MANIFEST
> INFO AbstractArtifactRetrievalService: GetManifest for
> /tmp/beam-artifact-staging/job_cca3e889-76d9-4c8a-a942-a64ddbd2dd1f/MANIFEST
> -> 0 artifacts
> INFO ProcessEnvironmentFactory: Still waiting for startup of environment
> '/opt/spark/beam/sdks/python/container/build/target/launcher/linux_amd64/boot'
> for worker id 3-1
> ERROR Executor: Exception in task 0.1 in stage 0.0 (TID 2)
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.IllegalStateException: Process died with exit code 1
> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2050)
>
> (note that job manifest has no artifacts in it)
>
> I can see ports for enpoints (logging, artifact, ...) are open on the
> worker. Some debugging to boot.go and running it manually shows it doesn't
> return from "artifact.Materialize" function.
>
> Any idea what could be wrong in setup?
>
> *Sent:* Wednesday, November 06, 2019 at 5:45 PM
> *From:* "Kyle Weaver" 
> *To:* dev 
> *Subject:* Re: Command for Beam worker on Spark cluster
> > Where can I extract these parameters from?
>
> These parameters should be passed automatically when the process is run
> (note the use of $* in the example script):
> https://github.com/apache/beam/blob/fbc84b61240a3d83d9c19f7ccc17ba22e5d7e2c9/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessEnvironmentFactory.java#L115-L121
>
> > Also, how spark executor can find the port that grpc server is running
> on?
> Not sure which grpc server you mean here.
>
> On Wed, Nov 6, 2019 at 3:32 PM Matthew K.  wrote:
>
>> Thanks, still I need to pass parameters to the boot executable, such as,
>> worker id, control endpoint, logging endpoint, etc.
>>
>> Where can I extract these parameters from? (In apache_beam Python code,
>> those can be extracted from StartWorker request parameters)
>>
>> Also, how spark executor can find the port that grpc server is running on?
>>
>> *Sent:* Wednesday, November 06, 2019 at 5:07 PM
>> *From:* "Kyle Weaver" 
>> *To:* dev 
>> *Subject:* Re: Command for Beam worker on Spark cluster
>> In Docker mode, most everything's taken care of for you, but in process
>> mode you have to do a lot of setup yourself. The command you're looking for
>> is `sdks/python/container/build/target/launcher/linux_amd64/boot`. You will
>> be required to have both that executable (which you can build from source
>> using `./gradlew :sdks:python:container:build`) and a Python installation
>> including Beam and other dependencies on all of your worker machines.
>>
>> The best example I know of is here:
>> https://github.com/apache/beam/blob/cbf8a900819c52940a0edd90f59bf6aec55c817a/sdks/python/test-suites/portable/py2/build.gradle#L146-L165
>>
>> On Wed, Nov 6, 2019 at 2:24 PM Matthew K.  wrote:
>>
>>> Hi all,
>>>
>>> I am trying to run *Python* beam pipeline on a Spark cluster. Since
>>> workers are running on separate nodes, I am using "PROCESS" for
>>> "evironment_type" in pipeline options, but I couldn't find any
>>> documentation on what "command" I should pass to "environment_config"
>>> to run on the worker, so executor can be able to communicate with.
>>>
>>> Can someone help me on that?
>>>
>>


Re: [Discuss] Beam mascot

2019-11-12 Thread Robert Bradshaw
On Tue, Nov 12, 2019 at 1:29 PM Aizhamal Nurmamat kyzy
 wrote:
> 52 and 37 for me. I don't know what 53 is, but I like it too.

Same. What about 37 with the eyes from 52?

> On Tue, Nov 12, 2019 at 9:19 AM Maximilian Michels  wrote:
>>
>> More logos :D
>>
>> (35) - (37), (51), (48), (53) go into the direction of cuttlefish.
>>
>>  From the new ones I like (52) because of the eyes. (53) If we want to
>> move into the direction of a water animal, the small ones are quite
>> recognizable. Also, (23) and (36) are kinda cute.
>>
>> Cheers,
>> Max
>>
>> On 12.11.19 02:09, Robert Bradshaw wrote:
>> > Cuttlefish are cool, but I don't know how recognizable they are, and
>> > they don't scream "fast" or "stream-y" or "parallel processing" to me
>> > (not that that's a requirement...) I like that firefly, nice working
>> > the logo into the trailing beam of light.
>> >
>> > On Mon, Nov 11, 2019 at 5:03 PM Udi Meiri  wrote:
>> >>
>> >> Dumbo octopus anyone? https://youtu.be/DmqikqvLLLw?t=263
>> >>
>> >>
>> >> On Mon, Nov 11, 2019 at 2:06 PM Luke Cwik  wrote:
>> >>>
>> >>> The real answer, what cool schwag can we get based upon the mascot.
>> >>>
>> >>> On Mon, Nov 11, 2019 at 2:04 PM Kenneth Knowles  wrote:
>> 
>>  I'm with Luke on cuttlefish. We can have color changing schwag...
>> 
>>  On Mon, Nov 11, 2019 at 9:57 AM David Cavazos  
>>  wrote:
>> >
>> > I like 9 as well. Not related to anything, but chinchillas are also 
>> > cute.
>> >
>> > On Mon, Nov 11, 2019 at 8:25 AM Luke Cwik  wrote:
>> >>
>> >> 9 and 7 for me (in that order)
>> >>
>> >> On Mon, Nov 11, 2019 at 7:18 AM Maximilian Michels  
>> >> wrote:
>> >>>
>> >>> Here are some sketches from the designer. I've put them all in one 
>> >>> image
>> >>> and added labels to make it easier to refer to them. My favorites are
>> >>> (2) and (9).
>> >>>
>> >>> Cheers,
>> >>> Max
>> >>>
>> >>> On 09.11.19 19:43, Maximilian Michels wrote:
>>  I like that sketch! The designer has also sent me some rough 
>>  sketches,
>>  I'll share these here when I get consent from the designer.
>> 
>>  -Max
>> 
>>  On 09.11.19 19:22, Alex Van Boxel wrote:
>> > +1 for a FireFly. Ok, I can't draw, but it's to make a point ;-)
>> >
>> > Fire2.jpg
>> >
>> >
>> >
>> >_/
>> > _/ Alex Van Boxel
>> >
>> >
>> > On Sat, Nov 9, 2019 at 12:26 AM Kyle Weaver > > > wrote:
>> >
>> >  Re fish: The authors of the Streaming Systems went with 
>> > trout, but
>> >  the book mentioned a missed opportunity to make their cover a 
>> > "robot
>> >  dinosaur with a Scottish accent." Perhaps that idea is worth
>> > revisiting?
>> >
>> >  On Fri, Nov 8, 2019 at 3:20 PM Luke Cwik > >  > wrote:
>> >
>> >  My top suggestion is a cuttlefish.
>> >
>> >  On Thu, Nov 7, 2019 at 10:28 PM Reza Rokni 
>> > > >  > wrote:
>> >
>> >  Salmon... they love streams? :-)
>> >
>> >  On Fri, 8 Nov 2019 at 12:00, Kenneth Knowles
>> >  mailto:k...@apache.org>> wrote:
>> >
>> >  Agree with Aizhamal that it doesn't matter if 
>> > they are
>> >  taken if they are not too close in space to Beam: 
>> > Apache
>> >  projects, big data, log processing, stream 
>> > processing.
>> >  Not a legal opinion, but an aesthetic opinion. So 
>> > I
>> >  would keep Lemur as a possibility. Definitely 
>> > nginx is
>> >  far away from Beam so it seems OK as long as the 
>> > art is
>> >  different.
>> >
>> >  Also FWIW there are many kinds of Lemurs, and also
>> >  related Tarsier, of the only uncontroversial and
>> >  non-extinct infraorder within suborder 
>> > Strepsirrhini. I
>> >  think there's enough room for another mascot with 
>> > big
>> >  glowing eyes :-). The difference in the 
>> > designer's art
>> >  will be more significant than the taxonomy.
>> >
>> >  Kenn
>> >
>> >  On Tue, Nov 5, 2019 at 4:37 PM Aizhamal Nurmamat 
>> > kyzy
>> >  > > > wrote:
>> >
>> >  Aww.. that Hoover beaver is cute. But then 

Re: [Discuss] Beam mascot

2019-11-12 Thread Aizhamal Nurmamat kyzy
52 and 37 for me. I don't know what 53 is, but I like it too.


On Tue, Nov 12, 2019 at 9:19 AM Maximilian Michels  wrote:

> More logos :D
>
> (35) - (37), (51), (48), (53) go into the direction of cuttlefish.
>
>  From the new ones I like (52) because of the eyes. (53) If we want to
> move into the direction of a water animal, the small ones are quite
> recognizable. Also, (23) and (36) are kinda cute.
>
> Cheers,
> Max
>
> On 12.11.19 02:09, Robert Bradshaw wrote:
> > Cuttlefish are cool, but I don't know how recognizable they are, and
> > they don't scream "fast" or "stream-y" or "parallel processing" to me
> > (not that that's a requirement...) I like that firefly, nice working
> > the logo into the trailing beam of light.
> >
> > On Mon, Nov 11, 2019 at 5:03 PM Udi Meiri  wrote:
> >>
> >> Dumbo octopus anyone? https://youtu.be/DmqikqvLLLw?t=263
> >>
> >>
> >> On Mon, Nov 11, 2019 at 2:06 PM Luke Cwik  wrote:
> >>>
> >>> The real answer, what cool schwag can we get based upon the mascot.
> >>>
> >>> On Mon, Nov 11, 2019 at 2:04 PM Kenneth Knowles 
> wrote:
> 
>  I'm with Luke on cuttlefish. We can have color changing schwag...
> 
>  On Mon, Nov 11, 2019 at 9:57 AM David Cavazos 
> wrote:
> >
> > I like 9 as well. Not related to anything, but chinchillas are also
> cute.
> >
> > On Mon, Nov 11, 2019 at 8:25 AM Luke Cwik  wrote:
> >>
> >> 9 and 7 for me (in that order)
> >>
> >> On Mon, Nov 11, 2019 at 7:18 AM Maximilian Michels 
> wrote:
> >>>
> >>> Here are some sketches from the designer. I've put them all in one
> image
> >>> and added labels to make it easier to refer to them. My favorites
> are
> >>> (2) and (9).
> >>>
> >>> Cheers,
> >>> Max
> >>>
> >>> On 09.11.19 19:43, Maximilian Michels wrote:
>  I like that sketch! The designer has also sent me some rough
> sketches,
>  I'll share these here when I get consent from the designer.
> 
>  -Max
> 
>  On 09.11.19 19:22, Alex Van Boxel wrote:
> > +1 for a FireFly. Ok, I can't draw, but it's to make a point ;-)
> >
> > Fire2.jpg
> >
> >
> >
> >_/
> > _/ Alex Van Boxel
> >
> >
> > On Sat, Nov 9, 2019 at 12:26 AM Kyle Weaver  > > wrote:
> >
> >  Re fish: The authors of the Streaming Systems went with
> trout, but
> >  the book mentioned a missed opportunity to make their cover
> a "robot
> >  dinosaur with a Scottish accent." Perhaps that idea is worth
> > revisiting?
> >
> >  On Fri, Nov 8, 2019 at 3:20 PM Luke Cwik  >  > wrote:
> >
> >  My top suggestion is a cuttlefish.
> >
> >  On Thu, Nov 7, 2019 at 10:28 PM Reza Rokni <
> r...@google.com
> >  > wrote:
> >
> >  Salmon... they love streams? :-)
> >
> >  On Fri, 8 Nov 2019 at 12:00, Kenneth Knowles
> >  mailto:k...@apache.org>> wrote:
> >
> >  Agree with Aizhamal that it doesn't matter if
> they are
> >  taken if they are not too close in space to
> Beam: Apache
> >  projects, big data, log processing, stream
> processing.
> >  Not a legal opinion, but an aesthetic opinion.
> So I
> >  would keep Lemur as a possibility. Definitely
> nginx is
> >  far away from Beam so it seems OK as long as
> the art is
> >  different.
> >
> >  Also FWIW there are many kinds of Lemurs, and
> also
> >  related Tarsier, of the only uncontroversial and
> >  non-extinct infraorder within suborder
> Strepsirrhini. I
> >  think there's enough room for another mascot
> with big
> >  glowing eyes :-). The difference in the
> designer's art
> >  will be more significant than the taxonomy.
> >
> >  Kenn
> >
> >  On Tue, Nov 5, 2019 at 4:37 PM Aizhamal
> Nurmamat kyzy
> >   aizha...@apache.org>> wrote:
> >
> >  Aww.. that Hoover beaver is cute. But then
> lemur is
> >  also "taken" [1] and the owl too [2].
> >
> >  Personally, I don't think it matters much
> which
> >  mascots are taken, as long as the project
> is not too
> >  close in the same space as Beam. Also, it's
> good to
> >  just get all ideas out. We 

Re: contributor permission for Beam Jira tickets: suztomo

2019-11-12 Thread Pablo Estrada
Hi Tomo!
I've added you as contributor. Welcome!
Best
-P.

On Tue, Nov 12, 2019 at 11:51 AM Tomo Suzuki  wrote:

> Hi Beam Devs,
>
> This is Tomo from Google New York. I'd like to contribute to Beam Java
> dependencies upgrade. Can someone add me as a contributor for Beam's JIRA
> issue tracker?
>
> GitHub account: suztomo
> Apache JIRA username: suztomo
>
> --
> Regards,
> Tomo
>


Re: Test Failure: GcpOptionsTest$CommonTests. testDefaultGcpTempLocationDoesNotExist

2019-11-12 Thread Kyle Weaver
Hi Tomo, thanks for reporting.

This test passes on my machine and on Jenkins. I'm guessing this test is
assuming something about the host's gcloud settings, and is overfitting as
a result. Probably we should mock something so that the test doesn't
actually need to call gcloud.

I have created a JIRA issue for this and assigned to myself:
https://issues.apache.org/jira/browse/BEAM-8628

Kyle

On Tue, Nov 12, 2019 at 7:41 AM Tomo Suzuki  wrote:

> Hi Beam developers,
>
> I'm trying to build Apache Beam from the source. But GcpOptionsTest fails
> (error below).
>
> Did anybody solve this problem already?
>
> I'm using master (c2e58c55)
>
> suztomo@suxtomo24:~/beam4$ ./gradlew -p sdks/java check
> ...
> FAILURE: Build failed with an exception.
>
> * What went wrong:
> Execution failed for task
> ':sdks:java:extensions:google-cloud-platform-core:test'.
> > There were failing tests. See the report at:
> file:///usr/local/google/home/suztomo/beam4/sdks/java/extensions/google-cloud-platform-core/build/reports/tests/test/index.html
>
> The HTML file shows the following stacktrace:
>
> java.lang.AssertionError:
> Expected: (an instance of java.lang.IllegalArgumentException and exception
> with message a string containing "Error constructing default value for
> gcpTempLocation: tempLocation is not a valid GCS path" and exception with
> cause exception with message a string containing "Output path does not
> exist or is not writeable")
>  but: exception with cause exception with message a string containing
> "Output path does not exist or is not writeable" cause message was "Unable
> to verify that GCS bucket gs://does exists."
> Stacktrace was: java.lang.IllegalArgumentException: Error constructing
> default value for gcpTempLocation: tempLocation is not a valid GCS path,
> gs://does/not/exist.
>
>
> The problem (at the surface) is, in my development
> environment, gcpOptions.getGcsUtil().bucketAccessible(gcsPath) throws
> IOException rather than returning false. I want to know whether this is
> specific to my environment or not.
>
> I'm using
> - 1.8.0_181-google-v7
> - x86_64 Debian GNU/Linux (Google buid)
>
> --
> Regards,
> Tomo
>


contributor permission for Beam Jira tickets: suztomo

2019-11-12 Thread Tomo Suzuki
Hi Beam Devs,

This is Tomo from Google New York. I'd like to contribute to Beam Java
dependencies upgrade. Can someone add me as a contributor for Beam's JIRA
issue tracker?

GitHub account: suztomo
Apache JIRA username: suztomo

-- 
Regards,
Tomo


Re: Completeness of Beam Java Dependency Check Report

2019-11-12 Thread Yifan Zou
Thanks Tomo. I'll follow up in JIRA.

On Tue, Nov 12, 2019 at 9:44 AM Tomo Suzuki  wrote:

> Yifan,
> I created a ticket to track this finding:
> https://issues.apache.org/jira/browse/BEAM-8621 .
>
>
> On Mon, Nov 11, 2019 at 5:08 PM Tomo Suzuki  wrote:
>
>> Kenn,
>>
>> Thank you for the analysis. Although Guava was randomly picked up, it's
>> great learning for me to learn how you analyzed other modules using Guava.
>>
>> On Mon, Nov 11, 2019 at 4:29 PM Kenneth Knowles  wrote:
>>
>>> BeamModulePlugin just contains lists of versions to ease coordination
>>> across Beam modules, but mostly does not create dependencies. Most of
>>> Beam's modules only depend on a few things there. For example Guava is not
>>> a core dependency, but here is where it is actually depended upon:
>>>
>>> $ find . -name build.gradle | xargs grep library.java.guava
>>> ./sdks/java/core/build.gradle:  shadowTest library.java.guava_testlib
>>> ./sdks/java/extensions/sql/jdbc/build.gradle:  compile library.java.guava
>>> ./sdks/java/io/google-cloud-platform/build.gradle:  compile
>>> library.java.guava
>>> ./sdks/java/io/kinesis/build.gradle:  testCompile
>>> library.java.guava_testlib
>>>
>>> These results appear to be misleading. Grepping for 'import
>>> com.google.common', I see this as the actual state of things:
>>>
>>>  - GCP connector does not appear to actually depend on Guava in compile
>>> scope
>>>  - The Beam SQL JDBC driver does not appear to actually depend on Guava
>>> in compile scope
>>>  - The Dataflow Java worker does depend on Guava at compile scope but
>>> has incorrect dependencies (and it probably shouldn't)
>>>  - KinesisIO does depend on Guava at compile scope but has incorrect
>>> dependencies (Kinesis libs have Guava on API surface so it is OK here, but
>>> should be correctly declared)
>>>  - ZetaSQL translator does depend on Guava at compile scope but has
>>> incorrect dependencies (ZetaSQL has it on API surface so it is OK here, but
>>> should be correctly declared)
>>>
>>> We used to have an analysis that prevented this class of error.
>>>
>>> Once the errors are fixed, the guava_version is simply a version that we
>>> have discovered that seems to work for both Kinesis and ZetaSQL, libraries
>>> we do not control. Kinesis producer is built against 18.0. Kinesis client
>>> against 26.0-jre. ZetaSQL against 26.0-android.
>>>
>>> (or maybe I messed up in my analysis)
>>>
>>> Kenn
>>>
>>> On Mon, Nov 11, 2019 at 12:07 PM Tomo Suzuki  wrote:
>>>

 Chamikara and Yifan,
 Thank you for the responses! Looking forward to hearing the
 investigation result.
 In the meantime, I'll explore .test-infra/jenkins/dependency_check
 directory.


>>
>> --
>> Regards,
>> Tomo
>>
>
>
> --
> Regards,
> Tomo
>


Re: Behavior of TimestampCombiner?

2019-11-12 Thread Ruoyun Huang
Thanks for confirming.

Since it is unexpected behavior, I shall look into jira if it is already on
radar, if not, will create one.

On Mon, Nov 11, 2019 at 6:11 PM Robert Bradshaw  wrote:

> The END_OF_WINDOW is indeed 9.99 (or, in Java, 9.999000), but the
> results for LATEST and EARLIEST should be 9 and 0 respectively.
>
> On Mon, Nov 11, 2019 at 5:34 PM Ruoyun Huang  wrote:
> >
> > Hi, Folks,
> >
> > I am trying to understand the behavior of TimestampCombiner. I have
> a test like this:
> >
> > class TimestampCombinerTest(unittest.TestCase):
> >
> >   def test_combiner_latest(self):
> > """Test TimestampCombiner with LATEST."""
> > options = PipelineOptions()
> > options.view_as(StandardOptions).streaming = True
> > p = TestPipeline(options=options)
> >
> > main_stream = (p
> >| 'main TestStream' >> TestStream()
> >.add_elements([window.TimestampedValue(('k', 100),
> 0)])
> >.add_elements([window.TimestampedValue(('k', 400),
> 9)])
> >.advance_watermark_to_infinity()
> >| 'main windowInto' >> beam.WindowInto(
> >   window.FixedWindows(10),
> >
>  timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST)
> >| 'Combine' >> beam.CombinePerKey(sum))
> >
> > class RecordFn(beam.DoFn):
> >   def process(self,
> >   elm=beam.DoFn.ElementParam,
> >   ts=beam.DoFn.TimestampParam):
> > yield (elm, ts)
> >
> > records = (main_stream | beam.ParDo(RecordFn()))
> >
> > expected_window_to_elements = {
> > window.IntervalWindow(0, 10): [
> > (('k', 500),  Timestamp(9)),
> > ],
> > }
> >
> > assert_that(
> > records,
> > equal_to_per_window(expected_window_to_elements),
> > use_global_window=False,
> > label='assert per window')
> >
> > p.run()
> >
> >
> > I expect the result to be following (based on various TimestampCombiner
> strategy):
> > LATEST:(('k', 500), Timestamp(9)),
> > EARLIEST:(('k', 500), Timestamp(0)),
> > END_OF_WINDOW: (('k', 500), Timestamp(10)),
> >
> > The above outcome is partially confirmed by Java side test : [1]
> >
> >
> > However, from beam python, the outcome is like this:
> > LATEST:(('k', 500), Timestamp(10)),
> > EARLIEST:(('k', 500), Timestamp(10)),
> > END_OF_WINDOW: (('k', 500), Timestamp(9.)),
> >
> > What did I miss? what should be the right expected behavior? or this
> looks like a bug?
> >
> > [1]:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupByKeyTest.java#L390
> >
> > Cheers,
> >
>


-- 

Ruoyun  Huang


Re: Completeness of Beam Java Dependency Check Report

2019-11-12 Thread Tomo Suzuki
Yifan,
I created a ticket to track this finding:
https://issues.apache.org/jira/browse/BEAM-8621 .


On Mon, Nov 11, 2019 at 5:08 PM Tomo Suzuki  wrote:

> Kenn,
>
> Thank you for the analysis. Although Guava was randomly picked up, it's
> great learning for me to learn how you analyzed other modules using Guava.
>
> On Mon, Nov 11, 2019 at 4:29 PM Kenneth Knowles  wrote:
>
>> BeamModulePlugin just contains lists of versions to ease coordination
>> across Beam modules, but mostly does not create dependencies. Most of
>> Beam's modules only depend on a few things there. For example Guava is not
>> a core dependency, but here is where it is actually depended upon:
>>
>> $ find . -name build.gradle | xargs grep library.java.guava
>> ./sdks/java/core/build.gradle:  shadowTest library.java.guava_testlib
>> ./sdks/java/extensions/sql/jdbc/build.gradle:  compile library.java.guava
>> ./sdks/java/io/google-cloud-platform/build.gradle:  compile
>> library.java.guava
>> ./sdks/java/io/kinesis/build.gradle:  testCompile
>> library.java.guava_testlib
>>
>> These results appear to be misleading. Grepping for 'import
>> com.google.common', I see this as the actual state of things:
>>
>>  - GCP connector does not appear to actually depend on Guava in compile
>> scope
>>  - The Beam SQL JDBC driver does not appear to actually depend on Guava
>> in compile scope
>>  - The Dataflow Java worker does depend on Guava at compile scope but has
>> incorrect dependencies (and it probably shouldn't)
>>  - KinesisIO does depend on Guava at compile scope but has incorrect
>> dependencies (Kinesis libs have Guava on API surface so it is OK here, but
>> should be correctly declared)
>>  - ZetaSQL translator does depend on Guava at compile scope but has
>> incorrect dependencies (ZetaSQL has it on API surface so it is OK here, but
>> should be correctly declared)
>>
>> We used to have an analysis that prevented this class of error.
>>
>> Once the errors are fixed, the guava_version is simply a version that we
>> have discovered that seems to work for both Kinesis and ZetaSQL, libraries
>> we do not control. Kinesis producer is built against 18.0. Kinesis client
>> against 26.0-jre. ZetaSQL against 26.0-android.
>>
>> (or maybe I messed up in my analysis)
>>
>> Kenn
>>
>> On Mon, Nov 11, 2019 at 12:07 PM Tomo Suzuki  wrote:
>>
>>>
>>> Chamikara and Yifan,
>>> Thank you for the responses! Looking forward to hearing the
>>> investigation result.
>>> In the meantime, I'll explore .test-infra/jenkins/dependency_check
>>> directory.
>>>
>>>
>
> --
> Regards,
> Tomo
>


-- 
Regards,
Tomo


Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-12 Thread Alexey Romanenko
On 8 Nov 2019, at 11:32, Maximilian Michels  wrote:
> 
> The dates sounds good to me. I agree that the bay area has an advantage 
> because of its large tech community. On the other hand, it is a question of 
> how we run the event. For Berlin we managed to get about 200 attendees to 
> Berlin, but for the BeamSummit in Las Vegas with ApacheCon the attendance was 
> much lower.

I agree with your point Max and I believe that it would be more efficient to 
run Beam Summit as a “standalone" event (as it was done in London and Berlin) 
which will allow us to attract mostly Beam-oriented/interested/focused audience 
comparing to running this as part of ApacheCon or any other large conferences 
where are many other different topics and tracks.

> Should this also be discussed on the user mailing list?

Definitively! Despite the fact that users opinion is a key point here, it will 
not be so easy to get not-biased statistics in this question. 

The time frames are also very important since holidays in different countries 
(for example, August is traditionally a "vacation month" in France and some 
other European countries) can effect people availability and influent the final 
number of participants in the end. 

> 
> Cheers,
> Max
> 
> On 07.11.19 22:50, Alex Van Boxel wrote:
>> For date wise, I'm wondering why we should switching the Europe and NA one, 
>> this would mean that the Berlin and the new EU summit would be almost 1.5 
>> years apart.
>>  _/
>> _/ Alex Van Boxel
>> On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay > > wrote:
>>I prefer bay are for NA summit. My reasoning is that there is a
>>criticall mass of contributors and users in that location, probably
>>more than alternative NA locations. I was not involved with planning
>>recently and I do not know if there were people who could attend due
>>to location previously. If that is the case, I agree with Elliotte
>>on looking for other options.
>>Related to dates: March (Asia) and mid-May (NA) dates are a bit
>>close. Mid-June for NA might be better to spread events. Other
>>pieces looks good.
>>Ahmet
>>On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
>>mailto:elh...@ibiblio.org>> wrote:
>>The U.S. sadly is not a reliable destination for international
>>conferences these days. Almost every conference I go to, big and
>>small, has at least one speaker, sometimes more, who can't get into
>>the country. Canada seems worth considering. Vancouver,
>>Montreal, and
>>Toronto are all convenient.
>>On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas >> wrote:
>> >
>> > Hi Beam Community!
>> >
>> > I'd like to kick off a thread to discuss potential dates and
>>venues for the 2020 Beam Summits.
>> >
>> > I did some research on industry conferences happening in 2020
>>and pre-selected a few ranges as follows:
>> >
>> > (2 days) NA between mid-May and mid-June
>> > (2 days) EU mid October
>> > (1 day) Asia Mini Summit:  March
>> >
>> > I'd like to hear your thoughts on these dates and get
>>consensus on exact dates as the convo progresses.
>> >
>> > For locations these are the options I reviewed:
>> >
>> > NA: Austin Texas, Berkeley California, Mexico City.
>> > Europe: Warsaw, Barcelona, Paris
>> > Asia: Singapore
>> >
>> > Let the discussion begin!
>> > G (on behalf of the Beam Summit Steering Committee)
>> >
>> >
>> >
>>-- Elliotte Rusty Harold
>>elh...@ibiblio.org 



Re: Is there good way to make Python SDK docs draft accessible?

2019-11-12 Thread Yoshiki Obata
Sorry for late reply.

I've checked release process and found following way would be good to
make the docs/scripts ready to be reviewed and merged.

1. Create PR for apache/beam-site about docs generated by scripts
(hereinafter called PR1)
PR1 is intended only to review docs, so it must not to be merged.
2. create PR for apache/beam about modification to docs-generating
scripts (hereinafter called PR2)
3. PR1 and PR2 are reviewed and confirmed to be OK, close PR1.
4. Merge PR2.

Please let me know if there is a problem.

2019年11月7日(木) 9:17 Valentyn Tymofieiev :


>
> Hi Yoshiki,
>
> Were you able to find the information you need to regenerate the 
> documentation?
>
> Thanks,
> Valentyn
>
> On Tue, Oct 29, 2019 at 8:01 AM Yoshiki Obata  wrote:
>>
>> Thank you for advising, Udi and Ahmet.
>> I'll take a look at the release process.
>>
>> 2019年10月29日(火) 3:47 Ahmet Altay :
>> >
>> > Thank you for doing this. It should be possible to run tox as Udi 
>> > suggested and create a PR for review purposes similar to the release 
>> > process (see: 
>> > https://beam.apache.org/contribute/release-guide/#build-the-pydoc-api-reference)
>> >
>> > /cc +Valentyn Tymofieiev -- This is likely a required item before retiring 
>> > python 2 support.
>> >
>> > Ahmet
>> >
>> > On Mon, Oct 28, 2019 at 11:21 AM Udi Meiri  wrote:
>> >>
>> >> I believe that generating pydoc for the website is still a manual process 
>> >> (unlike the rest of the website?).
>> >> The reviewer will need to manually generate the docs (checkout the PR, 
>> >> run tox -e docs).
>> >>
>> >> On Mon, Oct 28, 2019 at 10:55 AM Yoshiki Obata  
>> >> wrote:
>> >>>
>> >>> Hi all.
>> >>>
>> >>> I'm working on enabling to generate Python SDK docs with Python3 [1]
>> >>> I have modified scripts and now reviewing generated docs in someone’s
>> >>> eyes is needed.
>> >>>
>> >>> But there seems to be no existing way to upload generated docs to
>> >>> where anyone can access unlike website html which can be uploaded to
>> >>> GCS via Jenkins job.
>> >>> Would anyone know good way to make generated docs accessible for
>> >>> anyone for convenience of reviewing them?
>> >>>
>> >>> [1] https://issues.apache.org/jira/browse/BEAM-7847
>> >>>
>> >>>
>> >>> Best regards,
>> >>> Yoshiki
>> >>>
>> >>>
>> >>> --
>> >>> Yoshiki Obata
>> >>> mail: yoshiki.ob...@gmail.com
>> >>> gh: https://github.com/lazylynx


Test Failure: GcpOptionsTest$CommonTests. testDefaultGcpTempLocationDoesNotExist

2019-11-12 Thread Tomo Suzuki
Hi Beam developers,

I'm trying to build Apache Beam from the source. But GcpOptionsTest fails
(error below).

Did anybody solve this problem already?

I'm using master (c2e58c55)

suztomo@suxtomo24:~/beam4$ ./gradlew -p sdks/java check
...
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task
':sdks:java:extensions:google-cloud-platform-core:test'.
> There were failing tests. See the report at:
file:///usr/local/google/home/suztomo/beam4/sdks/java/extensions/google-cloud-platform-core/build/reports/tests/test/index.html

The HTML file shows the following stacktrace:

java.lang.AssertionError:
Expected: (an instance of java.lang.IllegalArgumentException and exception
with message a string containing "Error constructing default value for
gcpTempLocation: tempLocation is not a valid GCS path" and exception with
cause exception with message a string containing "Output path does not
exist or is not writeable")
 but: exception with cause exception with message a string containing
"Output path does not exist or is not writeable" cause message was "Unable
to verify that GCS bucket gs://does exists."
Stacktrace was: java.lang.IllegalArgumentException: Error constructing
default value for gcpTempLocation: tempLocation is not a valid GCS path,
gs://does/not/exist.


The problem (at the surface) is, in my development
environment, gcpOptions.getGcsUtil().bucketAccessible(gcsPath) throws
IOException rather than returning false. I want to know whether this is
specific to my environment or not.

I'm using
- 1.8.0_181-google-v7
- x86_64 Debian GNU/Linux (Google buid)

-- 
Regards,
Tomo


Re: 10,000 Pull Requests

2019-11-12 Thread jincheng sun
Congratulate Beam community, Very amazing numbers, very active community!

Best,
Jincheng


Maximilian Michels  于2019年11月8日周五 上午1:39写道:

> Yes! Keep the committer pipeline filled ;)
>
> Reviewing PRs probably remains one of the toughest problems in active
> open-source projects.
>
> On 07.11.19 18:28, Luke Cwik wrote:
> > We need more committers...
> > that review the code.
> >
> > On Wed, Nov 6, 2019 at 6:21 PM Pablo Estrada  > > wrote:
> >
> > iiipe : )
> >
> > On Thu, Nov 7, 2019 at 12:59 AM Kenneth Knowles  > > wrote:
> >
> > Awesome!
> >
> > Number of days from PR #1 and PR #1000: 211
> > Number of days from PR #9000 and PR #1: 71
> >
> > Kenn
> >
> > On Wed, Nov 6, 2019 at 6:28 AM Łukasz Gajowy  > > wrote:
> >
> > Yay! Nice! :)
> >
> > śr., 6 lis 2019 o 14:38 Maximilian Michels  > > napisał(a):
> >
> > Just wanted to point out, we have crossed the 10,000 PRs
> > mark :)
> >
> > ...and the winner is:
> > https://github.com/apache/beam/pull/1
> >
> > Seriously, I think Beam's culture to promote PRs over
> > direct access to
> > the repository is remarkable. To another 10,000 PRs!
> >
> > Cheers,
> > Max
> >
>


On processing event streams

2019-11-12 Thread Jan Lukavský

Hi,

this is follow up of multiple threads covering the topic of how to (in a 
unified way) process event streams. Event streams can be characterized 
by a common property that ordering of events matter. The processing 
(usually) looks something like


  unordered stream -> buffer (per key) -> ordered stream -> stateful 
logic (DoFn)


This is perfectly fine and can be solved by current tools Beam offers 
(state & timers), but *only for streaming case*. The batch case is 
essentially broken, because:


 a) out-of-orderness is essentially *unbounded* (as opposed to input 
being bounded, strangely, that is not a contradiction), out-of-orderness 
in streaming case is *bounded*, because the watermark can fall behind 
only limit amount of time (sooner or later, nobody would actually care 
about results from streaming pipeline being months or years late, right?)


 b) with unbounded out-of-orderness, the spatial requirements of state 
grow with O(N), worst case, where N is size of the whole input


 c) moreover, many runners restrict the size of state per key to fit in 
memory (spark, flink)


Now, solutions to this problems seem to be:

 1) refine the model guarantees for batch stateful processing, so that 
we limit the out-of-orderness (the source of issues here) - the only 
reasonable way to do that is to enforce sorting before all stateful 
dofns in batch case (perhaps there might opt-out for that), or


 2) define a way to mark stateful dofn as requiring the sorting (e.g. 
@RequiresTimeSortedInput) - note this has to be done for both batch and 
streaming case, as opposed to 1), or


 3) define a different URN for "ordered stateful dofn", with default 
expansion using state as buffer (for both batch and streaming case) - 
that way this can be overridden in batch runners that can get into 
trouble otherwise (and could be regarded as sort of natural extension of 
the current approach).


I still think that the best solution is 1), for multiple reasons going 
from being internally logically consistent to being practical and easily 
implemented (a few lines of code in flink's case for instance). On the 
other hand, if this is really not what we want to do, then I'd like to 
know the community's opinion on the two other options (or, if there 
maybe is some other option I didn't cover).


Many thanks for opinions and help with fixing what is (sort of) broken 
right now.


Jan



[CANCELLED] [VOTE] @RequiresTimeSortedInput stateful DoFn annotation

2019-11-12 Thread Jan Lukavský
I'm cancelling this due to lack of activity. I will issue a follow-up 
thread to find solution.


On 11/9/19 11:45 AM, Jan Lukavský wrote:

Hi,

I'll try to summarize the mailing list threads to clarify why I think 
this addition is needed (and actually necessary):


 a) there are situations where the order of input events matter 
(obviously any finite state machine)


 b) in streaming case, this can be handled by the current machinery 
(e.g. holding elements in state, sorting all elements with timestamp 
less than input watermark, dropping latecomers)


 c) in batch case, this can be handled the same way, but

  i) due to the nature of batch processing, that has extreme 
requirements on the size of state needed to hold the elements 
(actually, in extreme, that might be the whole input, which might not 
be feasible)


  ii) although it is true, that watermark might (and will) fall behind 
in streaming processing as well so that similar issues might arise 
there too, it is hardly imaginable that it will fall behind as much as 
several years (but it is absolutely natural in batch case) - I'm 
talking about regular streaming processing, not some kappa like 
architectures, where this happens as well, but is causes troubles ([1])


  iii) given the fact, that some runners already use sort-merge 
groupings, it is actually virtually for free to also sort elements 
inside groups by timestamps, the runner just has to know, that it 
should do so


I don't want to go too far into details to keep this focused, but the 
fact that runner would know that it should sort by timestamp before 
stateful pardo brings additional features that are currently 
unavailable - e.g. actually shift event time smoothly, as elements 
flow through, not from -inf to +inf in one shot. That might have 
positive effect on timers being fired smoothly and thus for instance 
being able to free some state that would have to be held until the end 
of computation otherwise.


Therefore, I think it is essential for users to be able to tell runner 
that a particular stateful pardo depends on order of input events, so 
that the runner can use optimizations available in batch case. The 
streaming case is mostly unaffected by that, because all the sorting 
can be handled the usual way.


Hope this helps to clarify why it would be good to introduce (some 
way) to mark stateful pardos as "time sorted".


Cheers,

 Jan

[1] 
https://www.ververica.com/resources/flink-forward-san-francisco-2019/moving-from-lambda-and-kappa-architectures-to-kappa-at-uber


Hope these thoughts help

On 11/8/19 11:35 AM, Jan Lukavský wrote:

Hi Max,

thanks for comment. I probably should have put links to discussion 
threads here in the vote thread. Relevant would be


 - (a pretty lengthy) discussion about whether sorting by timestamp 
should be part of the model - [1]


 - part of the discussion related to the annotation - [2]

Regarding the open question in the design document - these are not 
meant to be open questions in regard to the design of the annotation 
and I'll remove that for now, as it is not (directly) related.


Now - main reason for this vote is that there is actually not a clear 
consensus in the ML thread. There are plenty of words like "should", 
"could", "would" and "maybe", so I wanted to be sure there is 
consensus to include this. I already run this in production for 
several months, so it is definitely useful for me. :-) But that might 
not be sufficient.


I'd be very happy to answer any more questions.

Thanks,

 Jan

[1] 
https://lists.apache.org/thread.html/4609a1bb1662690d67950e76d2f1108b51327b8feaf9580de659552e@%3Cdev.beam.apache.org%3E


[2] 
https://lists.apache.org/thread.html/dd9bec903102d9fcb4f390dc01513c0921eac1fedd8bcfdac630aaee@%3Cdev.beam.apache.org%3E


On 11/8/19 11:08 AM, Maximilian Michels wrote:

Hi Jan,

Disclaimer: I haven't followed the discussion closely, so I do not 
want to comment on the technical details of the feature here.


From the outside, it looks like there may be open questions. Also, 
we may need more motivation for what we can build with this feature 
or how it will become useful to users.


There are many threads in Beam and I believe we need to carefully 
prioritize the Beam feature set in order to focus on the things that 
provide the most value to our users.


Cheers,
Max

On 07.11.19 15:55, Jan Lukavský wrote:

Hi,
is there anything I can do to make this more attractive? :-) Any 
feedback would be much appreciated.

Many thanks,
  Jan

Dne 5. 11. 2019 14:10 napsal uživatel Jan Lukavský :

    Hi,

    I'd like to open a vote on accepting design document [1] as a 
base for

    implementation of @RequiresTimeSortedInput annotation for stateful
    DoFns. Associated JIRA [2] and PR [3] contains only subset of 
the whole
    functionality (allowed lateness ignored and no possibility to 
specify

    UDF for time - or sequential number - to be extracted from data).
    The PR
    will be subject to independent 

Re: Date/Time Ranges & Protobuf

2019-11-12 Thread Robert Bradshaw
I agree about it being a tagged union in the model (together with
actual_time(...) - epsilon). It's not just a performance hack though,
it's also (as discussed elsewhere) a question of being able to find an
embedding into existing datetime libraries. The real question here is
whether we should limit ourselves to just these 1 years AD, or
find value in being able to process events for the lifetime of the
universe (or, at least, recorded human history). Artificially limiting
in this way would seem surprising to me at least.

On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles  wrote:
>
> The max timestamp, min timestamp, and end of the global window are all 
> performance hacks in my view. Timestamps in beam are really a tagged union:
>
> timestamp ::= min | max | end_of_global | actual_time(... some 
> quantitative timestamp ...)
>
> with the ordering
>
> min < actual_time(...) < end_of_global < max
>
> We chose arbitrary numbers so that we could do simple numeric comparisons and 
> arithmetic.
>
> Kenn
>
> On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik  wrote:
>>
>> While crites@ was investigating using protobuf to represent Apache Beam 
>> timestamps within the TestStreamEvents, he found out that the well known 
>> type google.protobuf.Timestamp doesn't support certain timestamps we were 
>> using in our tests (specifically the max timestamp that Apache Beam 
>> supports).
>>
>> This lead me to investigate and the well known type 
>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to 
>> -12-31T23:59:59.9Z which is much smaller than the timestamp 
>> range that Apache Beam currently supports -9223372036854775ms to 
>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to 
>> find a time range that represented this).
>>
>> Similarly the google.protobuf.Duration represents any time range over those 
>> ~1 years. Google decided to limit their range to be compatible with the 
>> RFC 3339[2] standard to which does simplify many things since it guarantees 
>> that all RFC 3339 time parsing/manipulation libraries are supported.
>>
>> Should we:
>> A) define our own timestamp/duration types to be able to represent the full 
>> time range that Apache Beam can express?
>> B) limit the valid timestamps in Apache Beam to some standard such as RFC 
>> 3339?
>>
>> This discussion is somewhat related to the efforts to support nano 
>> timestamps[2].
>>
>> 1: https://tools.ietf.org/html/rfc3339
>> 2: 
>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E