Re: Bay Area Apache Beam Kickoff!

2018-11-19 Thread Matthias Baetens
Awesome effort awesome - good luck on the first meetup! :)

On Tue, 20 Nov 2018 at 01:37 Austin Bennett 
wrote:

> We have our first meetup scheduled for December 12th in San Francisco.
>
> Andrew Pilloud, a software engineer at Google and Beam committer, will
> demo the latest feature in Beam SQL: a standalone SQL shell. The talk cover
> why SQL is a good fit for streaming data processing, the technical details
> of the Beam SQL engine, and a peek into our future plans.
>
> Kenn Knowles, a founding PMC Member and incoming PMC Chair for the Apache
> Beam project, as well as computer scientist and engineer at Google will
> share about all things Beam. Where it is, where its been, where its going.
>
> More info:
> https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
>
> For those in/around town (or that can be) come join in the fun!
>
>
>
>
> --


Re: Portable wordcount on Flink runner broken

2018-11-19 Thread Thomas Weise
Try removing under sdks:

./go/vendor
./python/container/vendor
./go/.gogradle
./python/container/.gogradle


On Mon, Nov 19, 2018 at 12:01 PM Ruoyun Huang  wrote:

> Unfortunately, flink server still doesn't work consistently on my machine
> yet.  Funny thing is, it did worked ONCE (
> :beam-sdks-python:portableWordCount BUILD successful, finished in 18s).
> When I tried gain, things were back to hanging with server printing
> messages like:
>
> """
> [flink-akka.actor.default-dispatcher-25] DEBUG
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received
> slot report from instance 1ad9060bcc87cf5fd19c9a233c15a18f.
> [flink-akka.actor.default-dispatcher-25] DEBUG
> org.apache.flink.runtime.jobmaster.JobMaster - Trigger heartbeat request.
> [flink-akka.actor.default-dispatcher-23] DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat
> request from 006b3653dc7a24471c115d70c4c55fa6.
> [flink-akka.actor.default-dispatcher-25] DEBUG
> org.apache.flink.runtime.jobmaster.JobMaster - Received heartbeat from
> e188c32c-cfa5-4b85-bda9-16ce4742c490.
> ...
> repeat above forever after 5 minutes.
> """
>
> I am trying to figure out what I did right for that one time succeeded
> run.
>
> For the step 3 Thomas mentioned, all I did for cleanup is "gradle clean",
> if there are actually more to do, please kindly let me know.
>
>
>
>
> On Mon, Nov 19, 2018 at 6:00 AM Maximilian Michels  wrote:
>
>> Thanks for investing, Thomas!
>>
>> Ruoyun, does that solve the WordCount problem you were experiencing?
>>
>> -Max
>>
>> On 19.11.18 04:53, Thomas Weise wrote:
>> > With latest master the problem seems fixed. Unfortunately that was
>> first
>> > masked by build and docker issues. But I changed multiple things at
>> once
>> > after getting nowhere (the container build "succeeded" when in fact it
>> > did not):
>> >
>> > * Update to latest docker
>> > * Increase docker disk space after seeing a spurious, non-reproducible
>> > message in one of the build attempts
>> > * Full clean and manually remove Go build residuals from the workspace
>> >
>> > After that I could see Go and container builds execute differently
>> > (longer build time) and the result certainly looks better..
>> >
>> > HTH,
>> > Thomas
>> >
>> >
>> >
>> > On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang > > > wrote:
>> >
>> > I was after the same issue (I was using reference runner job server,
>> > but same error message), had some clue but no conclusion yet.
>> >
>> > By retaining the container instance, error message says "bad MD5"
>> > (see the other thread [1] I asked in dev last week). My hypothesis,
>> > based on the symptoms, is that the underlying container expects an
>> > MD5 to validate staged files, but job request from python SDK does
>> > not send file hash code.  Hope someone can confirm if that is the
>> > case (I am still trying to understand how come dataflow does not
>> > have such issue), and if so, the best way to fix it.
>> >
>> >
>> > [1]
>> >
>> https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E
>> >
>> > On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise > > > wrote:
>> >
>> > Since last few days, the steps under
>> > https://beam.apache.org/roadmap/portability/#python-on-flink
>> are
>> > broken.
>> >
>> > The gradle task hangs because the job server isn't able to
>> > launch the docker container.
>> >
>> > ./gradlew :beam-sdks-python:portableWordCount
>> > -PjobEndpoint=localhost:8099
>> >
>> > [CHAIN MapPartition (MapPartition at
>> >
>>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
>> > FlatMap (FlatMap at
>> >
>>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
>> > (8/8)] INFO
>> >
>>  org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory
>> > - Still waiting for startup of environment
>> > tweise-docker-apache.bintray.io/beam/python:latest
>> >  for
>> > worker id 1
>> >
>> > Unfortunately this isn't covered by tests yet. Is anyone aware
>> > what change may have caused this or looking into resolving it?
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> >
>> > --
>> > 
>> > Ruoyun  Huang
>> >
>>
>
>
> --
> 
> Ruoyun  Huang
>
>


Bay Area Apache Beam Kickoff!

2018-11-19 Thread Austin Bennett
We have our first meetup scheduled for December 12th in San Francisco.

Andrew Pilloud, a software engineer at Google and Beam committer, will demo
the latest feature in Beam SQL: a standalone SQL shell. The talk cover why
SQL is a good fit for streaming data processing, the technical details of
the Beam SQL engine, and a peek into our future plans.

Kenn Knowles, a founding PMC Member and incoming PMC Chair for the Apache
Beam project, as well as computer scientist and engineer at Google will
share about all things Beam. Where it is, where its been, where its going.

More info:
https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/

For those in/around town (or that can be) come join in the fun!


Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-19 Thread Lukasz Cwik
I also addressed a bunch of PR comments which clarified the
contract/expectations as described in my previous e-mail and the
splitting/backlog reporting/bundle finalization docs.

On Mon, Nov 19, 2018 at 3:19 PM Lukasz Cwik  wrote:

>
>
> On Mon, Nov 19, 2018 at 3:06 PM Lukasz Cwik  wrote:
>
>> Sorry for the late reply.
>>
>> On Thu, Nov 15, 2018 at 8:53 AM Ismaël Mejía  wrote:
>>
>>> Some late comments, and my pre excuses if some questions look silly,
>>> but the last documents were a lot of info that I have not yet fully
>>> digested.
>>>
>>> I have some questions about the ‘new’ Backlog concept following a
>>> quick look at the PR
>>> https://github.com/apache/beam/pull/6969/files
>>>
>>> 1. Is the Backlog a specific concept for each IO? Or in other words:
>>> ByteKeyRestrictionTracker can be used by HBase and Bigtable, but I am
>>> assuming from what I could understand that the Backlog implementation
>>> will be data store specific, is this the case? or it can be in some
>>> case generalized (for example for Filesystems)?
>>>
>>
>> The backlog is tied heavily to the restriction tracker implementation,
>> any data store using the same restriction tracker will provide the same
>> backlog computation. For example, if HBase/Bigtable use the
>> ByteKeyRestrictionTracker then they will use the same backlog calculation.
>> Note that an implementation could subclass a restriction tracker if the
>> data store could provide additional information. For example, the default
>> backlog for a ByteKeyRestrictionTracker over [startKey, endKey) is
>> distance(currentKey, lastKey) where distance is represented as byte array
>> subtraction (which can be wildly inaccurrate as the density of data is not
>> well reflected) but if HBase/Bigtable could provide the number of bytes
>> from current key to last key, a better representation could be provided.
>>
>> Other common examples of backlogs would be:
>> * files: backlog = length of file - current byte offset
>> * message queues: backlog = number of outstanding messages
>>
>>
>>>
>>> 2. Since the backlog is a byte[] this means that it is up to the user
>>> to give it a meaning depending on the situation, is this correct? Also
>>> since splitRestriction has now the Backlog as an argument, what do we
>>> expect the person that implements this method in a DoFn to do ideally
>>> with it? Maybe a more concrete example of how things fit for
>>> File/Offset or HBase/Bigtable/ByteKey will be helpful (maybe also for
>>> the BundleFinalizer concept too).
>>>
>>
>> Yes, the restriction tracker/restriction/SplittableDoFn must give the
>> byte[] a meaning. This can have any meaning but we would like that the
>> backlog byte[] representation to be lexicograhically comparable (when
>> viewing the byte[] in big endian format and prefixes are smaller (e.g. 001
>> is smaller then 0010) and preferably a linear representation. Note that all
>> restriction trackers of the same type should use the same "space" so that
>> backlogs are comparable across multiple restriction tracker instances.
>>
>> The backlog when provided to splitRestriction should be used to subdivide
>> the restriction into smaller restrictions where each would have the backlog
>> if processed (except for potentially the last).
>>
>> A concrete example would be to represent the remaining bytes to process
>> in a file as a 64 bit big endian integer, lets say that is 500MiB
>> (524288000 bytes) or     0001 0100
>> (note that the trailing zeros are optional and doesn't impact the
>> calculation). The runner could notice that processing the restriction will
>> take 10 hrs, so it asks the SDF to split at 1/16 segments by shifting the
>> bits over by 4 and asks to split using backlog   
>>  0001 0100. The SDK is able to convert this request back
>> into 32768000 bytes and returns 16 restrictions. Another example would be
>> for a message queue where we have 1 messages on the queue remaining so
>> the backlog would be      
>> 00100111 0001 when represented as a 64 bit unsigned big endian integer.
>> The runner could ask the SDK to split using a 1/8th backlog of 
>>      0100 11100010 which the
>> SDK would break out into 8 restrictions, the first 7 responsible for
>> reading 1250 messages and stopping while the last restriction would read
>> 1250 messages and then continue to read anything else that has been
>> enqueued.
>>
>> Bundle finalization is unrelated to backlogs but is needed since there is
>> a class of data stores which need acknowledgement that says I have
>> successfully received your data and am now responsible for it such as
>> acking a message from a message queue.
>>
>
> Note that this does bring up the question of whether SDKs should expose
> coders for backlogs since ByteKeyCoder and BigEndianLongCoder exist which
> would cover 

Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-19 Thread Lukasz Cwik
On Mon, Nov 19, 2018 at 3:06 PM Lukasz Cwik  wrote:

> Sorry for the late reply.
>
> On Thu, Nov 15, 2018 at 8:53 AM Ismaël Mejía  wrote:
>
>> Some late comments, and my pre excuses if some questions look silly,
>> but the last documents were a lot of info that I have not yet fully
>> digested.
>>
>> I have some questions about the ‘new’ Backlog concept following a
>> quick look at the PR
>> https://github.com/apache/beam/pull/6969/files
>>
>> 1. Is the Backlog a specific concept for each IO? Or in other words:
>> ByteKeyRestrictionTracker can be used by HBase and Bigtable, but I am
>> assuming from what I could understand that the Backlog implementation
>> will be data store specific, is this the case? or it can be in some
>> case generalized (for example for Filesystems)?
>>
>
> The backlog is tied heavily to the restriction tracker implementation, any
> data store using the same restriction tracker will provide the same backlog
> computation. For example, if HBase/Bigtable use the
> ByteKeyRestrictionTracker then they will use the same backlog calculation.
> Note that an implementation could subclass a restriction tracker if the
> data store could provide additional information. For example, the default
> backlog for a ByteKeyRestrictionTracker over [startKey, endKey) is
> distance(currentKey, lastKey) where distance is represented as byte array
> subtraction (which can be wildly inaccurrate as the density of data is not
> well reflected) but if HBase/Bigtable could provide the number of bytes
> from current key to last key, a better representation could be provided.
>
> Other common examples of backlogs would be:
> * files: backlog = length of file - current byte offset
> * message queues: backlog = number of outstanding messages
>
>
>>
>> 2. Since the backlog is a byte[] this means that it is up to the user
>> to give it a meaning depending on the situation, is this correct? Also
>> since splitRestriction has now the Backlog as an argument, what do we
>> expect the person that implements this method in a DoFn to do ideally
>> with it? Maybe a more concrete example of how things fit for
>> File/Offset or HBase/Bigtable/ByteKey will be helpful (maybe also for
>> the BundleFinalizer concept too).
>>
>
> Yes, the restriction tracker/restriction/SplittableDoFn must give the
> byte[] a meaning. This can have any meaning but we would like that the
> backlog byte[] representation to be lexicograhically comparable (when
> viewing the byte[] in big endian format and prefixes are smaller (e.g. 001
> is smaller then 0010) and preferably a linear representation. Note that all
> restriction trackers of the same type should use the same "space" so that
> backlogs are comparable across multiple restriction tracker instances.
>
> The backlog when provided to splitRestriction should be used to subdivide
> the restriction into smaller restrictions where each would have the backlog
> if processed (except for potentially the last).
>
> A concrete example would be to represent the remaining bytes to process in
> a file as a 64 bit big endian integer, lets say that is 500MiB (524288000
> bytes) or     0001 0100 (note that
> the trailing zeros are optional and doesn't impact the calculation). The
> runner could notice that processing the restriction will take 10 hrs, so it
> asks the SDF to split at 1/16 segments by shifting the bits over by 4 and
> asks to split using backlog     0001
> 0100. The SDK is able to convert this request back into 32768000 bytes
> and returns 16 restrictions. Another example would be for a message queue
> where we have 1 messages on the queue remaining so the backlog would
> be       00100111 0001
> when represented as a 64 bit unsigned big endian integer. The runner could
> ask the SDK to split using a 1/8th backlog of   
>    0100 11100010 which the SDK would break out
> into 8 restrictions, the first 7 responsible for reading 1250 messages and
> stopping while the last restriction would read 1250 messages and then
> continue to read anything else that has been enqueued.
>
> Bundle finalization is unrelated to backlogs but is needed since there is
> a class of data stores which need acknowledgement that says I have
> successfully received your data and am now responsible for it such as
> acking a message from a message queue.
>

Note that this does bring up the question of whether SDKs should expose
coders for backlogs since ByteKeyCoder and BigEndianLongCoder exist which
would cover a good number of scenarios described above. This coder doesn't
have to be understood by the runner nor does it have to be part of the
portability APIs (either Runner of Fn API). WDYT?


>
>> 3. By default all Restrictions are assumed to be unbounded but there
>> is this new Restrictions.IsBounded method, can’t this behavior be
>> 

Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-19 Thread Lukasz Cwik
Sorry for the late reply.

On Thu, Nov 15, 2018 at 8:53 AM Ismaël Mejía  wrote:

> Some late comments, and my pre excuses if some questions look silly,
> but the last documents were a lot of info that I have not yet fully
> digested.
>
> I have some questions about the ‘new’ Backlog concept following a
> quick look at the PR
> https://github.com/apache/beam/pull/6969/files
>
> 1. Is the Backlog a specific concept for each IO? Or in other words:
> ByteKeyRestrictionTracker can be used by HBase and Bigtable, but I am
> assuming from what I could understand that the Backlog implementation
> will be data store specific, is this the case? or it can be in some
> case generalized (for example for Filesystems)?
>

The backlog is tied heavily to the restriction tracker implementation, any
data store using the same restriction tracker will provide the same backlog
computation. For example, if HBase/Bigtable use the
ByteKeyRestrictionTracker then they will use the same backlog calculation.
Note that an implementation could subclass a restriction tracker if the
data store could provide additional information. For example, the default
backlog for a ByteKeyRestrictionTracker over [startKey, endKey) is
distance(currentKey, lastKey) where distance is represented as byte array
subtraction (which can be wildly inaccurrate as the density of data is not
well reflected) but if HBase/Bigtable could provide the number of bytes
from current key to last key, a better representation could be provided.

Other common examples of backlogs would be:
* files: backlog = length of file - current byte offset
* message queues: backlog = number of outstanding messages


>
> 2. Since the backlog is a byte[] this means that it is up to the user
> to give it a meaning depending on the situation, is this correct? Also
> since splitRestriction has now the Backlog as an argument, what do we
> expect the person that implements this method in a DoFn to do ideally
> with it? Maybe a more concrete example of how things fit for
> File/Offset or HBase/Bigtable/ByteKey will be helpful (maybe also for
> the BundleFinalizer concept too).
>

Yes, the restriction tracker/restriction/SplittableDoFn must give the
byte[] a meaning. This can have any meaning but we would like that the
backlog byte[] representation to be lexicograhically comparable (when
viewing the byte[] in big endian format and prefixes are smaller (e.g. 001
is smaller then 0010) and preferably a linear representation. Note that all
restriction trackers of the same type should use the same "space" so that
backlogs are comparable across multiple restriction tracker instances.

The backlog when provided to splitRestriction should be used to subdivide
the restriction into smaller restrictions where each would have the backlog
if processed (except for potentially the last).

A concrete example would be to represent the remaining bytes to process in
a file as a 64 bit big endian integer, lets say that is 500MiB (524288000
bytes) or     0001 0100 (note that
the trailing zeros are optional and doesn't impact the calculation). The
runner could notice that processing the restriction will take 10 hrs, so it
asks the SDF to split at 1/16 segments by shifting the bits over by 4 and
asks to split using backlog     0001
0100. The SDK is able to convert this request back into 32768000 bytes
and returns 16 restrictions. Another example would be for a message queue
where we have 1 messages on the queue remaining so the backlog would
be       00100111 0001
when represented as a 64 bit unsigned big endian integer. The runner could
ask the SDK to split using a 1/8th backlog of   
   0100 11100010 which the SDK would break out
into 8 restrictions, the first 7 responsible for reading 1250 messages and
stopping while the last restriction would read 1250 messages and then
continue to read anything else that has been enqueued.

Bundle finalization is unrelated to backlogs but is needed since there is a
class of data stores which need acknowledgement that says I have
successfully received your data and am now responsible for it such as
acking a message from a message queue.


>
> 3. By default all Restrictions are assumed to be unbounded but there
> is this new Restrictions.IsBounded method, can’t this behavior be
> inferred (adapted) from the DoFn UnboundedPerElement/Bounded
> annotation or are these independent concepts?
>

UnboundedPerElement/BoundedPerElement tells us during pipeline construction
time what type of PCollection we will be creating since we may have a
bounded PCollection goto an UnboundedPerElement DoFn and that will produce
an unbounded PCollection and similarly we could have an unbounded
PCollection goto a BoundedPerElement DoFn and that will produce an
unbounded PCollection. Restrictions.IsBounded is used during 

Re: Questions on [MD5] hash code of staged files

2018-11-19 Thread Ankur Goenka
Hi Ruoyun,

We moved from MD5 to SHA256 hashing which caused this problem.
The java and python code was updated in PR
https://github.com/apache/beam/pull/6583 though GO code was not updates. Go
caches the generated code which caused tests to pass. Though I am not sure
why we did not break integration tests sooner.
We resolved this issue with https://github.com/apache/beam/pull/7071 .
Let me know if you are still having the same issue.

Thanks,
Ankur

On Fri, Nov 16, 2018 at 3:03 PM Ruoyun Huang  wrote:

> Hi, Folks,
>
>  I am running python SDK PortableRunner, by connecting to Java
> Reference Runner Job server. But we couldn't make it work because docker
> container fails to start due to error message: "2018/11/16 21:38:55 Failed
> to retrieve staged files: failed to retrieve pickled_main_session in 3
> attempts: bad MD5 for /tmp/staged/pickled_main_session:
> 9g/EU11J0QTfwDVbpHQhAQ==, want ; bad MD5 for
> /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==, want ; bad MD5
> for /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==, want ; bad
> MD5 for /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==,
> want ".  Actual code for this error message is here
> 
> .
>
> The file pickled_main_session is INDEED staged, but for unknown reason we
> are expecting an empty string as the hash code. My hypothesis is that, the
> job request should've included a hash code, but fails to do so on the
> python part, thus led to an empty string.
>
> If the hypothesis above is correct, then my question is: where should I
> put the code in python SDK's job request to make it right? A pointer to the
> right place is appreciated.
>
> That being said, I also saw Ankur's recent PR#7049
> 
>  updates
> MD5 into SHA256. And this PR we are not updating anything in Java or
> Python. Therefore it makes me not sure about the hypothesis above. What did
> I miss? (or maybe that is what PR#7049 should've done?)
>
> Suggestions appreciated.
>
> Cheers,
> --
> 
> Ruoyun  Huang
>
>


Re: [DISCUSS] Reverting commits on green post-commit status

2018-11-19 Thread Maximilian Michels
The way I read Thomas' original email is that it's generally not a nice 
sign for a contributor if her work gets reverted. We all come from 
different backgrounds. For some, reverting is just a tool to get the job 
done, for others it might come across as offensive.


I know of communities where reverting is the absolute last resort. Now, 
Beam needs to find its own way. I think there are definitive advantages 
to reverting quickly.


In the most obvious case, when our tests are broken and a fix is not 
viable, reverting unblocks other contributors to test their code. I 
think this has been working fine in the past.


In the less obvious case, an external Runner or system is broken due to 
an update in the master. IMHO this does not warrant an immediate revert 
on its own. As already mentioned, there should be some justification for 
a rollback. This is not to make people's life harder but to figure out 
whether the problem can be solved upstream or downstream, or with a 
combination of both.


I think Thomas wanted to address this latter case. It seems like we're 
all more or less on the same page. The core problem is more related to 
communicating reverts in a way that helps contributors to save face and 
the community to work efficiently.


Thanks,
Max

On 19.11.18 10:51, Robert Bradshaw wrote:
If something breaks Beam's post (or especially pre) commit tests, I 
agree that rollback is typically the best option and can be done 
quickly. The situation is totally different if it breaks downstream 
projects in which Kenn's three points are good criteria for determining 
if we should rollback, which should not be assumed to be the default option.


I would say the root cause of the problem is insufficient visibility and 
testing. If external-to-beam tests (or production jobs) are broken in 
such a way that rollback is desired, I would say the onus (maybe not a 
hard requirement, but a high bar for exceptions) is on whoever is asking 
for the rollback to create and submit an external test that demonstrates 
the issue. It is their choice whether this is easier than rolling 
forward or otherwise working around the breakage. This seems like the 
only long-term sustainable option and should get us out of this bad 
situation.


(As an aside, the bar for rolling back a runner-specific PR that brake 
that runner may be lower, though still not automatic as other changes 
may depend on it.)


- Robert

On Sat, Nov 17, 2018 at 7:35 PM Kenneth Knowles > wrote:


Just adapting my PR commentary to this thread:

Our rollback first policy cannot apply to a change that passes all
of Beam's postcommit tests. It *does* apply to Beam's postcommit
suites for each and every runner; they are equally important in this
regard.

The purpose of rapid rollback without discussion is foremost to
restore test signal and not to disrupt the work of other
contributors, that is why it is OK to roll back before figuring out
if the change was actually bad. If that isn't at stake, the policy
doesn't make sense to apply.

But...

  - We have at least three examples of runners where there are
probably tests outside the Beam repo: Dataflow, Samza runner, and
IBM Streams.
  - We also may have users that try running their production loads
against Beam master branch to learn early whether the next release
will break them.

These are success stories for Beam. We should respect these other
sources of information for what they are: users and vendors giving
us a heads up about changes that will be a problem if we release them.

Often rollback is still a good option but IMO it is no longer
automatically the best option, and may not even be OK. I would say
that the case must be made clearly and *publicly* that

(1) something is actually broken
(2) the revert fixes the problem
(3) revert is the best option

In this scenario there is time to consider. An important and common
case is that a perfectly fine change exposes something already
broken, so the best option may be sickbaying downstream or pinning
their version/commit of Beam until they can fix.

Kenn


On Fri, Nov 16, 2018 at 8:15 PM Ahmet Altay mailto:al...@google.com>> wrote:
 >
 > It sounds like we are in agreement that addressing issues sooner
is better. I think reverting is in general the less stressful option
because it allows a solution to be developed in parallel. Even with
that, it is not the only option we have and based on the severity
and the complexity of the problem we can consider other options.
Fixing forward might be feasible in some cases.
 >
 > We can bring issues back to the mailing list. This would be akin
to bringing any issues to the mailing list. I think JIRA is a better
tool for that. These reverts are happening because of an issue, and
JIRA allows informing all involved parties, 

Re: Portable wordcount on Flink runner broken

2018-11-19 Thread Ruoyun Huang
Unfortunately, flink server still doesn't work consistently on my machine
yet.  Funny thing is, it did worked ONCE (
:beam-sdks-python:portableWordCount BUILD successful, finished in 18s).
When I tried gain, things were back to hanging with server printing
messages like:

"""
[flink-akka.actor.default-dispatcher-25] DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received
slot report from instance 1ad9060bcc87cf5fd19c9a233c15a18f.
[flink-akka.actor.default-dispatcher-25] DEBUG
org.apache.flink.runtime.jobmaster.JobMaster - Trigger heartbeat request.
[flink-akka.actor.default-dispatcher-23] DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor - Received heartbeat
request from 006b3653dc7a24471c115d70c4c55fa6.
[flink-akka.actor.default-dispatcher-25] DEBUG
org.apache.flink.runtime.jobmaster.JobMaster - Received heartbeat from
e188c32c-cfa5-4b85-bda9-16ce4742c490.
...
repeat above forever after 5 minutes.
"""

I am trying to figure out what I did right for that one time succeeded run.


For the step 3 Thomas mentioned, all I did for cleanup is "gradle clean",
if there are actually more to do, please kindly let me know.




On Mon, Nov 19, 2018 at 6:00 AM Maximilian Michels  wrote:

> Thanks for investing, Thomas!
>
> Ruoyun, does that solve the WordCount problem you were experiencing?
>
> -Max
>
> On 19.11.18 04:53, Thomas Weise wrote:
> > With latest master the problem seems fixed. Unfortunately that was first
> > masked by build and docker issues. But I changed multiple things at once
> > after getting nowhere (the container build "succeeded" when in fact it
> > did not):
> >
> > * Update to latest docker
> > * Increase docker disk space after seeing a spurious, non-reproducible
> > message in one of the build attempts
> > * Full clean and manually remove Go build residuals from the workspace
> >
> > After that I could see Go and container builds execute differently
> > (longer build time) and the result certainly looks better..
> >
> > HTH,
> > Thomas
> >
> >
> >
> > On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang  > > wrote:
> >
> > I was after the same issue (I was using reference runner job server,
> > but same error message), had some clue but no conclusion yet.
> >
> > By retaining the container instance, error message says "bad MD5"
> > (see the other thread [1] I asked in dev last week). My hypothesis,
> > based on the symptoms, is that the underlying container expects an
> > MD5 to validate staged files, but job request from python SDK does
> > not send file hash code.  Hope someone can confirm if that is the
> > case (I am still trying to understand how come dataflow does not
> > have such issue), and if so, the best way to fix it.
> >
> >
> > [1]
> >
> https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E
> >
> > On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise  > > wrote:
> >
> > Since last few days, the steps under
> > https://beam.apache.org/roadmap/portability/#python-on-flink are
> > broken.
> >
> > The gradle task hangs because the job server isn't able to
> > launch the docker container.
> >
> > ./gradlew :beam-sdks-python:portableWordCount
> > -PjobEndpoint=localhost:8099
> >
> > [CHAIN MapPartition (MapPartition at
> >
>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
> > FlatMap (FlatMap at
> >
>  36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
> > (8/8)] INFO
> >
>  org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory
> > - Still waiting for startup of environment
> > tweise-docker-apache.bintray.io/beam/python:latest
> >  for
> > worker id 1
> >
> > Unfortunately this isn't covered by tests yet. Is anyone aware
> > what change may have caused this or looking into resolving it?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> > --
> > 
> > Ruoyun  Huang
> >
>


-- 

Ruoyun  Huang


Re: E-mail Organization

2018-11-19 Thread Lukasz Cwik
Putting the tags in the subject line is inline with the style of what we
currently do using [DISCUSS], [VOTE], [BEAM-YYY] so I like that. (I forgot
that you can edit subject after the fact, thanks for pointing that out.)

On Mon, Nov 19, 2018 at 10:04 AM Kenneth Knowles  wrote:

> The traditional thing to do is something like [SQL] or [Portability] in
> the subject. Cannot be added after the email has been sent, but your email
> client can probably do that part, right? If another user changes the
> subject line to add more tags in their reply, I think things like gmail
> will actually keep the thread intact and it will show up in searches too.
> Lots of lightweight options that don't require us to invent anything.
>
> Kenn
>
> On Mon, Nov 19, 2018 at 9:21 AM Suneel Marthi  wrote:
>
>> Kafka uses KIPs
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
>>
>> Flink uses FLIPs
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>
>> So Beam - BIPs 
>>
>> On Mon, Nov 19, 2018 at 10:48 PM Lukasz Cwik  wrote:
>>
>>> dev@beam.apache.org gets a lot of e-mail. I was wondering how other
>>> Apache projects help their contributors focus on design/project discussions
>>> (such as SQL, SplittableDoFn, Portability, Samza, Flink, Testing, ...)?
>>>
>>> I'm looking for a solution that allows people to tag a discussion with
>>> multiple topics, and that tags can be added after the e-mail has been sent
>>> as the discussion may cross multiple topics such as testing and SQL.
>>>
>>> I was initially thinking that we could embed tags like "topic:sql" in
>>> the message body and if something was part of multiple tags it would be
>>> "topic:sql topic:testing" to make it easy
>>>
>>> What do you think?
>>>
>>


Re: E-mail Organization

2018-11-19 Thread Kenneth Knowles
The traditional thing to do is something like [SQL] or [Portability] in the
subject. Cannot be added after the email has been sent, but your email
client can probably do that part, right? If another user changes the
subject line to add more tags in their reply, I think things like gmail
will actually keep the thread intact and it will show up in searches too.
Lots of lightweight options that don't require us to invent anything.

Kenn

On Mon, Nov 19, 2018 at 9:21 AM Suneel Marthi  wrote:

> Kafka uses KIPs
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
>
> Flink uses FLIPs
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> So Beam - BIPs 
>
> On Mon, Nov 19, 2018 at 10:48 PM Lukasz Cwik  wrote:
>
>> dev@beam.apache.org gets a lot of e-mail. I was wondering how other
>> Apache projects help their contributors focus on design/project discussions
>> (such as SQL, SplittableDoFn, Portability, Samza, Flink, Testing, ...)?
>>
>> I'm looking for a solution that allows people to tag a discussion with
>> multiple topics, and that tags can be added after the e-mail has been sent
>> as the discussion may cross multiple topics such as testing and SQL.
>>
>> I was initially thinking that we could embed tags like "topic:sql" in the
>> message body and if something was part of multiple tags it would be
>> "topic:sql topic:testing" to make it easy
>>
>> What do you think?
>>
>


Re: E-mail Organization

2018-11-19 Thread Suneel Marthi
Kafka uses KIPs
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

Flink uses FLIPs
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

So Beam - BIPs 

On Mon, Nov 19, 2018 at 10:48 PM Lukasz Cwik  wrote:

> dev@beam.apache.org gets a lot of e-mail. I was wondering how other
> Apache projects help their contributors focus on design/project discussions
> (such as SQL, SplittableDoFn, Portability, Samza, Flink, Testing, ...)?
>
> I'm looking for a solution that allows people to tag a discussion with
> multiple topics, and that tags can be added after the e-mail has been sent
> as the discussion may cross multiple topics such as testing and SQL.
>
> I was initially thinking that we could embed tags like "topic:sql" in the
> message body and if something was part of multiple tags it would be
> "topic:sql topic:testing" to make it easy
>
> What do you think?
>


E-mail Organization

2018-11-19 Thread Lukasz Cwik
dev@beam.apache.org gets a lot of e-mail. I was wondering how other Apache
projects help their contributors focus on design/project discussions (such
as SQL, SplittableDoFn, Portability, Samza, Flink, Testing, ...)?

I'm looking for a solution that allows people to tag a discussion with
multiple topics, and that tags can be added after the e-mail has been sent
as the discussion may cross multiple topics such as testing and SQL.

I was initially thinking that we could embed tags like "topic:sql" in the
message body and if something was part of multiple tags it would be
"topic:sql topic:testing" to make it easy

What do you think?


Re: Wiki edit access

2018-11-19 Thread Lukasz Cwik
You've been added.

On Mon, Nov 19, 2018 at 1:45 AM Wout Scheepers <
wout.scheep...@vente-exclusive.com> wrote:

> Sorry, I assumed I would be the same account as the apache jira. Just
> created new one.
>
> Full name: “Wout Scheepers”
> email: woutscheep...@gmail.com
>
>
>
>
>
> *From: *Lukasz Cwik 
> *Reply-To: *"dev@beam.apache.org" 
> *Date: *Friday, 16 November 2018 at 18:39
> *To: *dev 
> *Subject: *Re: Wiki edit access
>
>
>
> I tried finding your account on cwiki.apache.org but was unable to, what
> is your user id on cwiki.apache.org?
>
>
>
> On Thu, Nov 15, 2018 at 7:51 AM Wout Scheepers <
> wout.scheep...@vente-exclusive.com> wrote:
>
> Can anyone give me edit access for the wiki?
>
>
>
> Thanks,
>
> Wout
>
>


Re: Portable wordcount on Flink runner broken

2018-11-19 Thread Maximilian Michels

Thanks for investing, Thomas!

Ruoyun, does that solve the WordCount problem you were experiencing?

-Max

On 19.11.18 04:53, Thomas Weise wrote:
With latest master the problem seems fixed. Unfortunately that was first 
masked by build and docker issues. But I changed multiple things at once 
after getting nowhere (the container build "succeeded" when in fact it 
did not):


* Update to latest docker
* Increase docker disk space after seeing a spurious, non-reproducible 
message in one of the build attempts

* Full clean and manually remove Go build residuals from the workspace

After that I could see Go and container builds execute differently 
(longer build time) and the result certainly looks better..


HTH,
Thomas



On Sun, Nov 18, 2018 at 2:11 PM Ruoyun Huang > wrote:


I was after the same issue (I was using reference runner job server,
but same error message), had some clue but no conclusion yet.

By retaining the container instance, error message says "bad MD5"
(see the other thread [1] I asked in dev last week). My hypothesis,
based on the symptoms, is that the underlying container expects an
MD5 to validate staged files, but job request from python SDK does
not send file hash code.  Hope someone can confirm if that is the
case (I am still trying to understand how come dataflow does not
have such issue), and if so, the best way to fix it.


[1]

https://lists.apache.org/thread.html/b26560087ff88f142e26d66c8a5a9283558c8e55b5edd705b5e53c9c@%3Cdev.beam.apache.org%3E

On Fri, Nov 16, 2018 at 7:06 PM Thomas Weise mailto:t...@apache.org>> wrote:

Since last few days, the steps under
https://beam.apache.org/roadmap/portability/#python-on-flink are
broken.

The gradle task hangs because the job server isn't able to
launch the docker container.

./gradlew :beam-sdks-python:portableWordCount
-PjobEndpoint=localhost:8099

[CHAIN MapPartition (MapPartition at
36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0) ->
FlatMap (FlatMap at
36write/Write/WriteImpl/DoOnce/Impulse.None/beam:env:docker:v1:0/out.0)
(8/8)] INFO
org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory
- Still waiting for startup of environment
tweise-docker-apache.bintray.io/beam/python:latest
 for
worker id 1

Unfortunately this isn't covered by tests yet. Is anyone aware
what change may have caused this or looking into resolving it?

Thanks,
Thomas



-- 


Ruoyun  Huang



Beam Dependency Check Report (2018-11-19)

2018-11-19 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
future
0.16.0
0.17.1
2016-10-27
2018-10-31BEAM-5968
google-cloud-pubsub
0.35.4
0.38.0
2018-06-06
2018-09-12BEAM-5539
oauth2client
3.0.0
4.1.3
2016-07-28
2018-09-07BEAM-6089
pytz
2018.4
2018.7
2018-04-10
2018-10-29BEAM-5893
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.rabbitmq:amqp-client
4.6.0
5.5.0
2018-03-26
2018-10-22BEAM-5895
org.apache.rat:apache-rat-tasks
0.12
0.13
2016-06-07
2018-10-13BEAM-6039
com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2014-10-25
2017-12-11BEAM-5541
com.gradle:build-scan-plugin
1.13.1
2.0.2
2018-04-10
2018-11-12BEAM-5543
org.conscrypt:conscrypt-openjdk
1.1.3
1.4.1
2018-06-04
2018-11-01BEAM-5748
org.elasticsearch:elasticsearch
6.4.0
7.0.0-alpha1
2018-08-18
2018-11-13BEAM-6090
org.elasticsearch:elasticsearch-hadoop
5.0.0
7.0.0-alpha1
2016-10-26
2018-11-13BEAM-5551
org.elasticsearch.client:elasticsearch-rest-client
6.4.0
7.0.0-alpha1
2018-08-18
2018-11-13BEAM-6091
org.elasticsearch.test:framework
6.4.0
7.0.0-alpha1
2018-08-18
2018-11-13BEAM-6092
io.grpc:grpc-auth
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5896
io.grpc:grpc-context
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5897
io.grpc:grpc-core
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5898
io.grpc:grpc-netty
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5899
io.grpc:grpc-protobuf
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5900
io.grpc:grpc-stub
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5901
io.grpc:grpc-testing
1.13.1
1.16.1
2018-06-21
2018-10-26BEAM-5902
com.google.code.gson:gson
2.7
2.8.5
2016-06-14
2018-05-22BEAM-5558
org.apache.hbase:hbase-common
1.2.6
2.1.1
2017-05-29
2018-10-27BEAM-5560
org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.1
2017-05-29
2018-10-27BEAM-5561
org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.1
2017-05-29
2018-10-27BEAM-5562
org.apache.hbase:hbase-server
1.2.6
2.1.1
2017-05-29
2018-10-27BEAM-5563
org.apache.hbase:hbase-shaded-client
1.2.6
2.1.1
2017-05-29
2018-10-27BEAM-5564
org.apache.hive:hive-cli
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5566
org.apache.hive:hive-common
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5567
org.apache.hive:hive-exec
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5568
org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5569
net.java.dev.javacc:javacc
4.0
7.0.4
2006-03-17
2018-09-17BEAM-5570
javax.servlet:javax.servlet-api
3.1.0
4.0.1
2013-04-25
2018-04-20BEAM-5750
org.eclipse.jetty:jetty-server
9.2.10.v20150310
9.4.14.v20181114
2015-03-10
2018-11-14BEAM-5752
org.eclipse.jetty:jetty-servlet
9.2.10.v20150310
9.4.14.v20181114
2015-03-10
2018-11-14BEAM-5753
net.java.dev.jna:jna
4.1.0
5.1.0
2014-03-06
2018-11-14BEAM-5573
com.esotericsoftware:kryo
4.0.2
5.0.0-RC1
2018-03-20
2018-06-19BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
org.apache.kudu:kudu-client
1.4.0
1.8.0
2017-06-05
2018-10-16BEAM-5575
io.dropwizard.metrics:metrics-core
3.1.2
4.1.0-rc2
2015-04-26
2018-05-03BEAM-5576
org.mongodb:mongo-java-driver
3.2.2
3.9.0
2016-02-15
2018-11-07BEAM-5577

Re: [DISCUSS] Reverting commits on green post-commit status

2018-11-19 Thread Robert Bradshaw
If something breaks Beam's post (or especially pre) commit tests, I agree
that rollback is typically the best option and can be done quickly. The
situation is totally different if it breaks downstream projects in which
Kenn's three points are good criteria for determining if we should
rollback, which should not be assumed to be the default option.

I would say the root cause of the problem is insufficient visibility and
testing. If external-to-beam tests (or production jobs) are broken in such
a way that rollback is desired, I would say the onus (maybe not a hard
requirement, but a high bar for exceptions) is on whoever is asking for the
rollback to create and submit an external test that demonstrates the issue.
It is their choice whether this is easier than rolling forward or otherwise
working around the breakage. This seems like the only long-term sustainable
option and should get us out of this bad situation.

(As an aside, the bar for rolling back a runner-specific PR that brake that
runner may be lower, though still not automatic as other changes may depend
on it.)

- Robert

On Sat, Nov 17, 2018 at 7:35 PM Kenneth Knowles  wrote:

> Just adapting my PR commentary to this thread:
>
> Our rollback first policy cannot apply to a change that passes all of
> Beam's postcommit tests. It *does* apply to Beam's postcommit suites for
> each and every runner; they are equally important in this regard.
>
> The purpose of rapid rollback without discussion is foremost to restore
> test signal and not to disrupt the work of other contributors, that is why
> it is OK to roll back before figuring out if the change was actually bad.
> If that isn't at stake, the policy doesn't make sense to apply.
>
> But...
>
>  - We have at least three examples of runners where there are probably
> tests outside the Beam repo: Dataflow, Samza runner, and IBM Streams.
>  - We also may have users that try running their production loads against
> Beam master branch to learn early whether the next release will break them.
>
> These are success stories for Beam. We should respect these other sources
> of information for what they are: users and vendors giving us a heads up
> about changes that will be a problem if we release them.
>
> Often rollback is still a good option but IMO it is no longer
> automatically the best option, and may not even be OK. I would say that the
> case must be made clearly and *publicly* that
>
> (1) something is actually broken
> (2) the revert fixes the problem
> (3) revert is the best option
>
> In this scenario there is time to consider. An important and common case
> is that a perfectly fine change exposes something already broken, so the
> best option may be sickbaying downstream or pinning their version/commit of
> Beam until they can fix.
>
> Kenn
>
>
> On Fri, Nov 16, 2018 at 8:15 PM Ahmet Altay  wrote:
> >
> > It sounds like we are in agreement that addressing issues sooner is
> better. I think reverting is in general the less stressful option because
> it allows a solution to be developed in parallel. Even with that, it is not
> the only option we have and based on the severity and the complexity of the
> problem we can consider other options. Fixing forward might be feasible in
> some cases.
> >
> > We can bring issues back to the mailing list. This would be akin to
> bringing any issues to the mailing list. I think JIRA is a better tool for
> that. These reverts are happening because of an issue, and JIRA allows
> informing all involved parties, creates emails to the issues list for later
> searching through mailing archives, and creates a record of things in
> structured way with components.
> >
> > We could establish general policies about for all reverts to have an
> issue (which we already do because they are regular PRs), including all
> people in the discussion (including the author and reviewers) and follow up
> with new tests to expands Beam's test coverage.
> >
> > On Fri, Nov 16, 2018 at 7:55 PM, Thomas Weise 
> wrote:
> >>
> >>
> >>
> >> On Fri, Nov 16, 2018 at 7:39 PM Ahmet Altay  wrote:
> >>>
> >>> Thank you for bringing this discussion back to the mailing list.
> >>>
> >>> On Fri, Nov 16, 2018 at 6:49 PM, Thomas Weise  wrote:
> 
>  We have observed instances of changes being reverted in master that
> have been authored following the contributor guidelines and pass all tests
> (post commit). While we generally seem to have quite a bit of revert action
> happening [1], this thread is about those instances that are outside of our
> documented policies.
> 
>  For a contributor, it isn't a good experience to see reverts
> (especially not out of the blue) after a PR has been reviewed, all tests
> pass and generally care has been taken to do the right things.
> >>>
> >>>
> >>> I completely agree. Everyone involved needs to have the context about
> why a change is being reverted. A JIRA with information is probably a good
> way to do it, similar to the any other issue we 

Re: Wiki edit access

2018-11-19 Thread Wout Scheepers
Sorry, I assumed I would be the same account as the apache jira. Just created 
new one.
Full name: “Wout Scheepers”
email: woutscheep...@gmail.com


From: Lukasz Cwik 
Reply-To: "dev@beam.apache.org" 
Date: Friday, 16 November 2018 at 18:39
To: dev 
Subject: Re: Wiki edit access

I tried finding your account on cwiki.apache.org but 
was unable to, what is your user id on 
cwiki.apache.org?

On Thu, Nov 15, 2018 at 7:51 AM Wout Scheepers 
mailto:wout.scheep...@vente-exclusive.com>> 
wrote:
Can anyone give me edit access for the wiki?

Thanks,
Wout