Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Romain Manni-Bucau
Le mer. 12 sept. 2018 00:37, Lukasz Cwik  a écrit :

> I was unaware that users would use multiple versions of Apache Beam on the
> classpath at the same time. In that case I don't believe shading is
> something that will be there number one problem since we don't have a
> stable API surface between internal Apache Beam components.
>

Agree was exactly what I tried to say.


> For users who aren't using multiple Apache Beam packages, I would not
> expect non Apache Beam packages to ever export anything underneath the
> org.apache.beam package namespace.
>

Agree too.


> Also, I did add tooling to our build process to make sure that we only
> release classes underneath the org.apache.beam package namespace with the
> validateShadedJarDoesntLeakNonOrgApacheBeamClasses[1] task.
> 1:
> https://github.com/apache/beam/blob/a3f6f7e3b147f5a65e5b419d9baf24b35750974b/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L751
>
> Romain, I think this is something we could continue outside of the release
> thread. Feel free to start a new thread or follow up on Slack.
>

The point was that Beam is hiding non beam issues with such a delivery
which is a blocker to upgrade. So beam alone is ok but if you add anything
- and since you will likely for any pipeline - then your app is no more in
a workable state while shades are a recommended solution.



> On Tue, Sep 11, 2018 at 2:48 PM Romain Manni-Bucau 
> wrote:
>
>> I understand Lukasz but it makes using shades properly pretty impossible
>> since this warning is not just something you can ignore but something you
>> have to fix since it can hide bugs. I get the "it is ok while you have a
>> single beam version" point but why would you get only beam in your
>> classpath, from the moment you use an IO it is not true anymore so this
>> warning is key to ensure your deployment is under control. In general you
>> accept something which fits the screen (like 20 overlapping classes or so)
>> but having 6600 classes to check is way more than something which would be
>> done just by a quick visual check. It requires you to add tooling on top of
>> it which is not really good overall. Wonder if it wouldn't be better to
>> revert that if it can't be completed short term and reapplied when possible
>> (probably using a working branch).
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le mar. 11 sept. 2018 à 23:41, Lukasz Cwik  a écrit :
>>
>>> Romain, the beam-model-fn-execution-2.7.0.jar,
>>> beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar have
>>> duplicates of the same classes to satisfy their dependencies (gRPC and
>>> protobuf and their transitive dependencies). Producing a separate artifact
>>> is still not done to prevent the message that your describing and other
>>> then size of jars, that message is benign in this case.
>>>
>>> Note that much of our vendoring goal that the community had discussed
>>> and agreed upon is still not unfinished, for example Guava:
>>> https://issues.apache.org/jira/browse/BEAM-3608
>>>
>>>
>>>
>>> On Tue, Sep 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 BTW, did you notice that doing a shade now logs something like:

 [WARNING] beam-model-fn-execution-2.7.0.jar,
 beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar define
 6660 overlapping classes:
 [WARNING]   -
 org.apache.beam.vendor.netty.v4.io.netty.handler.codec.http.HttpClientCodec$1
 [WARNING]   -
 org.apache.beam.vendor.guava.v20.com.google.common.util.concurrent.AggregateFutureState$SafeAtomicHelper
 [WARNING]   -
 org.apache.beam.vendor.netty.v4.io.netty.util.concurrent.DefaultFutureListeners
 [WARNING]   -
 org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.OpenSslSessionContext$1
 [WARNING]   -
 org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.Java9SslUtils$4
 [WARNING]   -
 org.apache.beam.vendor.guava.v20.com.google.common.collect.ImmutableMultimap$Builder
 [WARNING]   -
 org.apache.beam.vendor.netty.v4.io.netty.handler.codec.spdy.SpdyHeaders
 [WARNING]   -
 org.apache.beam.vendor.protobuf.v3.com.google.protobuf.DescriptorProtos$FieldDescriptorProtoOrBuilder
 [WARNING]   -
 org.apache.beam.vendor.guava.v20.com.google.common.collect.AbstractMultimap
 [WARNING]   -
 org.apache.beam.vendor.guava.v20.com.google.common.io.BaseEncoding$3
 [WARNING]   - 6650 more...

 Looks like the new shading policy impl was merged a bit too fast ;)

 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old 

Re: [Proposal] Creating a reproducible environment for Beam Jenkins Tests

2018-09-11 Thread Yifan Zou
Thanks all. I am struggling with the missing buildscan reports when running
jobs with containers. I believe it is a big disadvantage to use docker if
the buildscan doesn't show up. I will keep updating my progress in this
thread. In the meanwhile, any comments, suggestions and objections are
still welcome.

Regards.
Yifan

On Tue, Sep 11, 2018 at 6:08 AM Alexey Romanenko 
wrote:

> +1 Great feature that should help with complicated error cases.
>
> On 11 Sep 2018, at 03:39, Henning Rohde  wrote:
>
> +1 Nice proposal. It will help eradicate some of the inflexibility and
> frustrations with Jenkins.
>
> On Wed, Sep 5, 2018 at 2:30 PM Yifan Zou  wrote:
>
>> Thank you all for making comments on this and I apologize for the late
>> reply.
>>
>> To clarify the concerns of testing locally, it is still able to run tests
>> without Docker. One of the purposes of this is to create an identical
>> environment as we are running in Jenkins that would be helpful to reproduce
>> strange errors. Contributors could choose starting a container and run
>> tests in there, or just run tests directly.
>>
>>
>>
>> On Wed, Sep 5, 2018 at 6:37 AM Ismaël Mejía  wrote:
>>
>>> BIG +1, the previous work on having docker build images [1] had a
>>> similar goal (to have a reproducible build environment). But this is
>>> even better because we will guarantee the exact same environment in
>>> Jenkins as well as any further improvements. It is important to
>>> document the setup process as part of this (for future maintenance +
>>> local reproducibility).
>>>
>>> Just for clarification this is independent of running the tests
>>> locally without docker, it is more to improve the reproducibility of
>>> the environment we have on jenkins locally for example to address some
>>> weird Heissenbug.
>>>
>>> I just added BEAM-5311 to track the removal of the docker build images
>>> when this is ready (of course if there are no objections to this
>>> proposal).
>>>
>>> [1] https://beam.apache.org/contribute/docker-images/
>>> On Thu, Aug 30, 2018 at 3:58 PM Jean-Baptiste Onofré 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > That's interesting, however, it's really important to still be able to
>>> > easily run test locally, without any VM/Docker required. It should be
>>> > activated by profile or so.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 27/08/2018 19:53, Yifan Zou wrote:
>>> > > Hi,
>>> > >
>>> > > I have a proposal for creating a reproducible environment for Jenkins
>>> > > tests by using docker container. The thing is, the environment
>>> > > configurations on Beam Jenkins slaves are sometimes different from
>>> > > developer's machines. Test failures on Jenkins may not be easy to
>>> > > reproduce locally. Also, it is not convenient for developers to add
>>> or
>>> > > modify underlying tools installed on Jenkins VMs, since they're
>>> managed
>>> > > by Apache Infra. This proposal is aimed to address those problems.
>>> > >
>>> > >
>>> https://docs.google.com/document/d/1y0YuQj_oZXC0uM5-gniG7r9-5gv2uiDhzbtgYYJW48c/edit#heading=h.bg2yi0wbhl9n
>>> > >
>>> > > Any comments are welcome. Thank you.
>>> > >
>>> > > Regards.
>>> > > Yifan
>>> > >
>>> >
>>> > --
>>> > Jean-Baptiste Onofré
>>> > jbono...@apache.org
>>> > http://blog.nanthrax.net
>>> > Talend - http://www.talend.com
>>>
>>
>


Jenkins build is back to normal : beam_Release_Gradle_NightlySnapshot #169

2018-09-11 Thread Apache Jenkins Server
See 




Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Lukasz Cwik
I was unaware that users would use multiple versions of Apache Beam on the
classpath at the same time. In that case I don't believe shading is
something that will be there number one problem since we don't have a
stable API surface between internal Apache Beam components.

For users who aren't using multiple Apache Beam packages, I would not
expect non Apache Beam packages to ever export anything underneath the
org.apache.beam package namespace.

Also, I did add tooling to our build process to make sure that we only
release classes underneath the org.apache.beam package namespace with the
validateShadedJarDoesntLeakNonOrgApacheBeamClasses[1] task.
1:
https://github.com/apache/beam/blob/a3f6f7e3b147f5a65e5b419d9baf24b35750974b/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L751

Romain, I think this is something we could continue outside of the release
thread. Feel free to start a new thread or follow up on Slack.

On Tue, Sep 11, 2018 at 2:48 PM Romain Manni-Bucau 
wrote:

> I understand Lukasz but it makes using shades properly pretty impossible
> since this warning is not just something you can ignore but something you
> have to fix since it can hide bugs. I get the "it is ok while you have a
> single beam version" point but why would you get only beam in your
> classpath, from the moment you use an IO it is not true anymore so this
> warning is key to ensure your deployment is under control. In general you
> accept something which fits the screen (like 20 overlapping classes or so)
> but having 6600 classes to check is way more than something which would be
> done just by a quick visual check. It requires you to add tooling on top of
> it which is not really good overall. Wonder if it wouldn't be better to
> revert that if it can't be completed short term and reapplied when possible
> (probably using a working branch).
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le mar. 11 sept. 2018 à 23:41, Lukasz Cwik  a écrit :
>
>> Romain, the beam-model-fn-execution-2.7.0.jar,
>> beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar have
>> duplicates of the same classes to satisfy their dependencies (gRPC and
>> protobuf and their transitive dependencies). Producing a separate artifact
>> is still not done to prevent the message that your describing and other
>> then size of jars, that message is benign in this case.
>>
>> Note that much of our vendoring goal that the community had discussed and
>> agreed upon is still not unfinished, for example Guava:
>> https://issues.apache.org/jira/browse/BEAM-3608
>>
>>
>>
>> On Tue, Sep 11, 2018 at 2:29 PM Romain Manni-Bucau 
>> wrote:
>>
>>> BTW, did you notice that doing a shade now logs something like:
>>>
>>> [WARNING] beam-model-fn-execution-2.7.0.jar,
>>> beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar define
>>> 6660 overlapping classes:
>>> [WARNING]   -
>>> org.apache.beam.vendor.netty.v4.io.netty.handler.codec.http.HttpClientCodec$1
>>> [WARNING]   -
>>> org.apache.beam.vendor.guava.v20.com.google.common.util.concurrent.AggregateFutureState$SafeAtomicHelper
>>> [WARNING]   -
>>> org.apache.beam.vendor.netty.v4.io.netty.util.concurrent.DefaultFutureListeners
>>> [WARNING]   -
>>> org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.OpenSslSessionContext$1
>>> [WARNING]   -
>>> org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.Java9SslUtils$4
>>> [WARNING]   -
>>> org.apache.beam.vendor.guava.v20.com.google.common.collect.ImmutableMultimap$Builder
>>> [WARNING]   -
>>> org.apache.beam.vendor.netty.v4.io.netty.handler.codec.spdy.SpdyHeaders
>>> [WARNING]   -
>>> org.apache.beam.vendor.protobuf.v3.com.google.protobuf.DescriptorProtos$FieldDescriptorProtoOrBuilder
>>> [WARNING]   -
>>> org.apache.beam.vendor.guava.v20.com.google.common.collect.AbstractMultimap
>>> [WARNING]   -
>>> org.apache.beam.vendor.guava.v20.com.google.common.io.BaseEncoding$3
>>> [WARNING]   - 6650 more...
>>>
>>> Looks like the new shading policy impl was merged a bit too fast ;)
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>>
>>> Le mar. 11 sept. 2018 à 21:42, Jean-Baptiste Onofré  a
>>> écrit :
>>>
 I'm taking the Spark runner one.

 Regards
 JB

 On 11/09/2018 21:15, Ahmet Altay wrote:
 > Could anyone else help with looking at these issues earlier?
 >
 > On Tue, Sep 

Re: Spotless broken on master

2018-09-11 Thread Andrew Pilloud
I don't think spotless is included in the default test target. Jenkins runs
a more expanded ':javaPreCommit' gradle target.

Andrew

On Tue, Sep 11, 2018 at 2:32 PM Ismaël Mejía  wrote:

> Mmm this is weird, I tested this locally and passed without issue, I
> am wondering how could this happen.
> Thanks anyway for the quick fix.
>


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Romain Manni-Bucau
I understand Lukasz but it makes using shades properly pretty impossible
since this warning is not just something you can ignore but something you
have to fix since it can hide bugs. I get the "it is ok while you have a
single beam version" point but why would you get only beam in your
classpath, from the moment you use an IO it is not true anymore so this
warning is key to ensure your deployment is under control. In general you
accept something which fits the screen (like 20 overlapping classes or so)
but having 6600 classes to check is way more than something which would be
done just by a quick visual check. It requires you to add tooling on top of
it which is not really good overall. Wonder if it wouldn't be better to
revert that if it can't be completed short term and reapplied when possible
(probably using a working branch).

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le mar. 11 sept. 2018 à 23:41, Lukasz Cwik  a écrit :

> Romain, the beam-model-fn-execution-2.7.0.jar,
> beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar have
> duplicates of the same classes to satisfy their dependencies (gRPC and
> protobuf and their transitive dependencies). Producing a separate artifact
> is still not done to prevent the message that your describing and other
> then size of jars, that message is benign in this case.
>
> Note that much of our vendoring goal that the community had discussed and
> agreed upon is still not unfinished, for example Guava:
> https://issues.apache.org/jira/browse/BEAM-3608
>
>
>
> On Tue, Sep 11, 2018 at 2:29 PM Romain Manni-Bucau 
> wrote:
>
>> BTW, did you notice that doing a shade now logs something like:
>>
>> [WARNING] beam-model-fn-execution-2.7.0.jar,
>> beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar define
>> 6660 overlapping classes:
>> [WARNING]   -
>> org.apache.beam.vendor.netty.v4.io.netty.handler.codec.http.HttpClientCodec$1
>> [WARNING]   -
>> org.apache.beam.vendor.guava.v20.com.google.common.util.concurrent.AggregateFutureState$SafeAtomicHelper
>> [WARNING]   -
>> org.apache.beam.vendor.netty.v4.io.netty.util.concurrent.DefaultFutureListeners
>> [WARNING]   -
>> org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.OpenSslSessionContext$1
>> [WARNING]   -
>> org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.Java9SslUtils$4
>> [WARNING]   -
>> org.apache.beam.vendor.guava.v20.com.google.common.collect.ImmutableMultimap$Builder
>> [WARNING]   -
>> org.apache.beam.vendor.netty.v4.io.netty.handler.codec.spdy.SpdyHeaders
>> [WARNING]   -
>> org.apache.beam.vendor.protobuf.v3.com.google.protobuf.DescriptorProtos$FieldDescriptorProtoOrBuilder
>> [WARNING]   -
>> org.apache.beam.vendor.guava.v20.com.google.common.collect.AbstractMultimap
>> [WARNING]   -
>> org.apache.beam.vendor.guava.v20.com.google.common.io.BaseEncoding$3
>> [WARNING]   - 6650 more...
>>
>> Looks like the new shading policy impl was merged a bit too fast ;)
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le mar. 11 sept. 2018 à 21:42, Jean-Baptiste Onofré  a
>> écrit :
>>
>>> I'm taking the Spark runner one.
>>>
>>> Regards
>>> JB
>>>
>>> On 11/09/2018 21:15, Ahmet Altay wrote:
>>> > Could anyone else help with looking at these issues earlier?
>>> >
>>> > On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau
>>> > mailto:rmannibu...@gmail.com>> wrote:
>>> >
>>> > Im running this main [1] through this IT [2]. Was working fine
>>> since
>>> > ~1 year but 2.7.0 broke it. Didnt investigate more but can have a
>>> > look later this month if it helps.
>>> >
>>> > [1]
>>> >
>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>>> > <
>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>>> >
>>> > [2]
>>> >
>>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>>> > <
>>> 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Lukasz Cwik
Romain, the beam-model-fn-execution-2.7.0.jar,
beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar have
duplicates of the same classes to satisfy their dependencies (gRPC and
protobuf and their transitive dependencies). Producing a separate artifact
is still not done to prevent the message that your describing and other
then size of jars, that message is benign in this case.

Note that much of our vendoring goal that the community had discussed and
agreed upon is still not unfinished, for example Guava:
https://issues.apache.org/jira/browse/BEAM-3608



On Tue, Sep 11, 2018 at 2:29 PM Romain Manni-Bucau 
wrote:

> BTW, did you notice that doing a shade now logs something like:
>
> [WARNING] beam-model-fn-execution-2.7.0.jar,
> beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar define
> 6660 overlapping classes:
> [WARNING]   -
> org.apache.beam.vendor.netty.v4.io.netty.handler.codec.http.HttpClientCodec$1
> [WARNING]   -
> org.apache.beam.vendor.guava.v20.com.google.common.util.concurrent.AggregateFutureState$SafeAtomicHelper
> [WARNING]   -
> org.apache.beam.vendor.netty.v4.io.netty.util.concurrent.DefaultFutureListeners
> [WARNING]   -
> org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.OpenSslSessionContext$1
> [WARNING]   -
> org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.Java9SslUtils$4
> [WARNING]   -
> org.apache.beam.vendor.guava.v20.com.google.common.collect.ImmutableMultimap$Builder
> [WARNING]   -
> org.apache.beam.vendor.netty.v4.io.netty.handler.codec.spdy.SpdyHeaders
> [WARNING]   -
> org.apache.beam.vendor.protobuf.v3.com.google.protobuf.DescriptorProtos$FieldDescriptorProtoOrBuilder
> [WARNING]   -
> org.apache.beam.vendor.guava.v20.com.google.common.collect.AbstractMultimap
> [WARNING]   -
> org.apache.beam.vendor.guava.v20.com.google.common.io.BaseEncoding$3
> [WARNING]   - 6650 more...
>
> Looks like the new shading policy impl was merged a bit too fast ;)
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le mar. 11 sept. 2018 à 21:42, Jean-Baptiste Onofré  a
> écrit :
>
>> I'm taking the Spark runner one.
>>
>> Regards
>> JB
>>
>> On 11/09/2018 21:15, Ahmet Altay wrote:
>> > Could anyone else help with looking at these issues earlier?
>> >
>> > On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau
>> > mailto:rmannibu...@gmail.com>> wrote:
>> >
>> > Im running this main [1] through this IT [2]. Was working fine since
>> > ~1 year but 2.7.0 broke it. Didnt investigate more but can have a
>> > look later this month if it helps.
>> >
>> > [1]
>> >
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>> > <
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>> >
>> > [2]
>> >
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>> > <
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>> >
>> >
>> > Le mar. 11 sept. 2018 20:54, Charles Chen > > > a écrit :
>> >
>> > Romain: can you give more details on the failure you're
>> > encountering, i.e. how you are performing this validation?
>> >
>> > On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré
>> > mailto:j...@nanthrax.net>> wrote:
>> >
>> > Hi,
>> >
>> > weird, I didn't have it on Beam samples. Let me try to
>> > reproduce and I
>> > will create the Jira.
>> >
>> > Regards
>> > JB
>> >
>> > On 11/09/2018 11:44, Romain Manni-Bucau wrote:
>> >  > -1, seems spark integration is broken (tested with spark
>> > 2.3.1 and 2.2.1):
>> >  >
>> >  > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in
>> > stage 0.0 (TID 0, RMANNIBUCAU, executor 0):
>> > java.lang.ClassCastException: cannot assign instance of
>> > scala.collection.immutable.List$SerializationProxy to
>> > fieldorg.apache.spark.rdd.RDD.org
>> > 
>> > > > > 

Re: Spotless broken on master

2018-09-11 Thread Ismaël Mejía
Mmm this is weird, I tested this locally and passed without issue, I
am wondering how could this happen.
Thanks anyway for the quick fix.


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Romain Manni-Bucau
BTW, did you notice that doing a shade now logs something like:

[WARNING] beam-model-fn-execution-2.7.0.jar,
beam-model-job-management-2.7.0.jar, beam-model-pipeline-2.7.0.jar define
6660 overlapping classes:
[WARNING]   -
org.apache.beam.vendor.netty.v4.io.netty.handler.codec.http.HttpClientCodec$1
[WARNING]   -
org.apache.beam.vendor.guava.v20.com.google.common.util.concurrent.AggregateFutureState$SafeAtomicHelper
[WARNING]   -
org.apache.beam.vendor.netty.v4.io.netty.util.concurrent.DefaultFutureListeners
[WARNING]   -
org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.OpenSslSessionContext$1
[WARNING]   -
org.apache.beam.vendor.netty.v4.io.netty.handler.ssl.Java9SslUtils$4
[WARNING]   -
org.apache.beam.vendor.guava.v20.com.google.common.collect.ImmutableMultimap$Builder
[WARNING]   -
org.apache.beam.vendor.netty.v4.io.netty.handler.codec.spdy.SpdyHeaders
[WARNING]   -
org.apache.beam.vendor.protobuf.v3.com.google.protobuf.DescriptorProtos$FieldDescriptorProtoOrBuilder
[WARNING]   -
org.apache.beam.vendor.guava.v20.com.google.common.collect.AbstractMultimap
[WARNING]   -
org.apache.beam.vendor.guava.v20.com.google.common.io.BaseEncoding$3
[WARNING]   - 6650 more...

Looks like the new shading policy impl was merged a bit too fast ;)

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le mar. 11 sept. 2018 à 21:42, Jean-Baptiste Onofré  a
écrit :

> I'm taking the Spark runner one.
>
> Regards
> JB
>
> On 11/09/2018 21:15, Ahmet Altay wrote:
> > Could anyone else help with looking at these issues earlier?
> >
> > On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau
> > mailto:rmannibu...@gmail.com>> wrote:
> >
> > Im running this main [1] through this IT [2]. Was working fine since
> > ~1 year but 2.7.0 broke it. Didnt investigate more but can have a
> > look later this month if it helps.
> >
> > [1]
> >
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
> > <
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
> >
> > [2]
> >
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
> > <
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
> >
> >
> > Le mar. 11 sept. 2018 20:54, Charles Chen  > > a écrit :
> >
> > Romain: can you give more details on the failure you're
> > encountering, i.e. how you are performing this validation?
> >
> > On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> >
> > Hi,
> >
> > weird, I didn't have it on Beam samples. Let me try to
> > reproduce and I
> > will create the Jira.
> >
> > Regards
> > JB
> >
> > On 11/09/2018 11:44, Romain Manni-Bucau wrote:
> >  > -1, seems spark integration is broken (tested with spark
> > 2.3.1 and 2.2.1):
> >  >
> >  > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in
> > stage 0.0 (TID 0, RMANNIBUCAU, executor 0):
> > java.lang.ClassCastException: cannot assign instance of
> > scala.collection.immutable.List$SerializationProxy to
> > fieldorg.apache.spark.rdd.RDD.org
> > 
> >  >  >>$apache$spark$rdd$RDD$$dependencies_
> > of type scala.collection.Seq in instance of
> > org.apache.spark.rdd.MapPartitionsRDD
> >  >   at
> >
>  
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
> >  >
> >  >
> >  > Also the issue Lukasz identified is important even if
> > workarounds can be
> >  > put in place so +1 to fix it as well if possible.
> >  >
> >  > Romain Manni-Bucau
> >  > @rmannibucau  > > | Blog
> >  >  > 

Spotless broken on master

2018-09-11 Thread Andrew Pilloud
Looks like the Java PreCommit is broken due to a commit manually merged to
master. Thanks to Huygaa for finding it in our unstable tests.

Fix is here, I will merge when tests pass:
https://github.com/apache/beam/pull/6364

Andrew


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Jean-Baptiste Onofré

I'm taking the Spark runner one.

Regards
JB

On 11/09/2018 21:15, Ahmet Altay wrote:

Could anyone else help with looking at these issues earlier?

On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau 
mailto:rmannibu...@gmail.com>> wrote:


Im running this main [1] through this IT [2]. Was working fine since
~1 year but 2.7.0 broke it. Didnt investigate more but can have a
look later this month if it helps.

[1]

https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java


[2]

https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java



Le mar. 11 sept. 2018 20:54, Charles Chen mailto:c...@google.com>> a écrit :

Romain: can you give more details on the failure you're
encountering, i.e. how you are performing this validation?

On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:

Hi,

weird, I didn't have it on Beam samples. Let me try to
reproduce and I
will create the Jira.

Regards
JB

On 11/09/2018 11:44, Romain Manni-Bucau wrote:
 > -1, seems spark integration is broken (tested with spark
2.3.1 and 2.2.1):
 >
 > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in
stage 0.0 (TID 0, RMANNIBUCAU, executor 0):
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to
fieldorg.apache.spark.rdd.RDD.org

>$apache$spark$rdd$RDD$$dependencies_
of type scala.collection.Seq in instance of
org.apache.spark.rdd.MapPartitionsRDD
 >       at

java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
 >
 >
 > Also the issue Lukasz identified is important even if
workarounds can be
 > put in place so +1 to fix it as well if possible.
 >
 > Romain Manni-Bucau
 > @rmannibucau > | Blog
 > > | Old Blog
 > > | Github
 > > | LinkedIn
 > > | Book
 >

>
 >
 >
 > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik
mailto:lc...@google.com>
 > >> a
écrit :
 >
 >     I found an issue where we are no longer packaging the
pom.xml within
 >     the artifact jars at
META-INF/maven/groupId/artifactId. More details
 >     in https://issues.apache.org/jira/browse/BEAM-5351
. I wouldn't
 >     consider this a blocker but it was an easy fix
 >     (https://github.com/apache/beam/pull/6358
) and users may
rely on the
 >     pom.xml.
 >
 >     Should we recut the release candidate to include this?
 >
 >     On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
 >     mailto:j...@nanthrax.net>
>> wrote:
 >
 >         +1 (binding)
 >
 >         Tested successfully on Beam Samples.
 >
 >         Thanks !
 >
 >         Regards
 >         JB
 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Charles Chen
The SparkRunner validation test (here:
https://beam.apache.org/contribute/release-guide/#run-validation-tests)
passes on my machine.  It looks like we are likely missing test coverage
where Romain is hitting issues.

On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:

> Could anyone else help with looking at these issues earlier?
>
> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Im running this main [1] through this IT [2]. Was working fine since ~1
>> year but 2.7.0 broke it. Didnt investigate more but can have a look later
>> this month if it helps.
>>
>> [1]
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
>> [2]
>> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>>
>> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>>
>>> Romain: can you give more details on the failure you're encountering,
>>> i.e. how you are performing this validation?
>>>
>>> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Hi,

 weird, I didn't have it on Beam samples. Let me try to reproduce and I
 will create the Jira.

 Regards
 JB

 On 11/09/2018 11:44, Romain Manni-Bucau wrote:
 > -1, seems spark integration is broken (tested with spark 2.3.1 and
 2.2.1):
 >
 > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0
 (TID 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot
 assign instance of scala.collection.immutable.List$SerializationProxy to
 fieldorg.apache.spark.rdd.RDD.org 
 $apache$spark$rdd$RDD$$dependencies_
 of type scala.collection.Seq in instance of
 org.apache.spark.rdd.MapPartitionsRDD
 >   at
 java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
 >
 >
 > Also the issue Lukasz identified is important even if workarounds can
 be
 > put in place so +1 to fix it as well if possible.
 >
 > Romain Manni-Bucau
 > @rmannibucau  | Blog
 >  | Old Blog
 >  | Github
 >  | LinkedIn
 >  | Book
 > <
 https://www.packtpub.com/application-development/java-ee-8-high-performance
 >
 >
 >
 > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik >>> > > a écrit :
 >
 > I found an issue where we are no longer packaging the pom.xml
 within
 > the artifact jars at META-INF/maven/groupId/artifactId. More
 details
 > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
 > consider this a blocker but it was an easy fix
 > (https://github.com/apache/beam/pull/6358) and users may rely on
 the
 > pom.xml.
 >
 > Should we recut the release candidate to include this?
 >
 > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
 > mailto:j...@nanthrax.net>> wrote:
 >
 > +1 (binding)
 >
 > Tested successfully on Beam Samples.
 >
 > Thanks !
 >
 > Regards
 > JB
 >
 > On 07/09/2018 23:56, Charles Chen wrote:
 >  > Hi everyone,
 >  >
 >  > Please review and vote on the release candidate #1 for the
 > version
 >  > 2.7.0, as follows:
 >  > [ ] +1, Approve the release
 >  > [ ] -1, Do not approve the release (please provide specific
 > comments)
 >  >
 >  > The complete staging area is available for your review,
 which
 > includes:
 >  > * JIRA release notes [1],
 >  > * the official Apache source release to be deployed to
 > dist.apache.org 
 >  >  [2], which is signed with the
 key with
 >  > fingerprint 45C60AAAD115F560 [3],
 >  > * all artifacts to be deployed to the Maven Central
 > Repository [4],
 >  > * source code tag "v2.7.0-RC1" [5],
 >  > * website pull request listing the release and publishing
 the API
 >  > reference manual [6].
 >  > * Java artifacts were built with Gradle 4.8 and OpenJDK
 >  > 1.8.0_181-8u181-b13-1~deb9u1-b13.
 >  > * Python artifacts are deployed along with the source
 release
 > to the
 >  > dist.apache.org 
 > 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Ahmet Altay
Could anyone else help with looking at these issues earlier?

On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau 
wrote:

> Im running this main [1] through this IT [2]. Was working fine since ~1
> year but 2.7.0 broke it. Didnt investigate more but can have a look later
> this month if it helps.
>
> [1] https://github.com/Talend/component-runtime/blob/master/
> component-runtime-beam/src/it/serialization-over-cluster/
> src/main/java/org/talend/sdk/component/beam/it/
> clusterserialization/Main.java
> [2] https://github.com/Talend/component-runtime/blob/master/
> component-runtime-beam/src/it/serialization-over-cluster/
> src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.
> java
>
> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>
>> Romain: can you give more details on the failure you're encountering,
>> i.e. how you are performing this validation?
>>
>> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> weird, I didn't have it on Beam samples. Let me try to reproduce and I
>>> will create the Jira.
>>>
>>> Regards
>>> JB
>>>
>>> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
>>> > -1, seems spark integration is broken (tested with spark 2.3.1 and
>>> 2.2.1):
>>> >
>>> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
>>> 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign
>>> instance of scala.collection.immutable.List$SerializationProxy to
>>> fieldorg.apache.spark.rdd.RDD.org >> >$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in
>>> instance of org.apache.spark.rdd.MapPartitionsRDD
>>> >   at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(
>>> ObjectStreamClass.java:2233)
>>> >
>>> >
>>> > Also the issue Lukasz identified is important even if workarounds can
>>> be
>>> > put in place so +1 to fix it as well if possible.
>>> >
>>> > Romain Manni-Bucau
>>> > @rmannibucau  | Blog
>>> >  | Old Blog
>>> >  | Github
>>> >  | LinkedIn
>>> >  | Book
>>> > >> ee-8-high-performance>
>>> >
>>> >
>>> > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik >> > > a écrit :
>>> >
>>> > I found an issue where we are no longer packaging the pom.xml
>>> within
>>> > the artifact jars at META-INF/maven/groupId/artifactId. More
>>> details
>>> > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
>>> > consider this a blocker but it was an easy fix
>>> > (https://github.com/apache/beam/pull/6358) and users may rely on
>>> the
>>> > pom.xml.
>>> >
>>> > Should we recut the release candidate to include this?
>>> >
>>> > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
>>> > mailto:j...@nanthrax.net>> wrote:
>>> >
>>> > +1 (binding)
>>> >
>>> > Tested successfully on Beam Samples.
>>> >
>>> > Thanks !
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 07/09/2018 23:56, Charles Chen wrote:
>>> >  > Hi everyone,
>>> >  >
>>> >  > Please review and vote on the release candidate #1 for the
>>> > version
>>> >  > 2.7.0, as follows:
>>> >  > [ ] +1, Approve the release
>>> >  > [ ] -1, Do not approve the release (please provide specific
>>> > comments)
>>> >  >
>>> >  > The complete staging area is available for your review,
>>> which
>>> > includes:
>>> >  > * JIRA release notes [1],
>>> >  > * the official Apache source release to be deployed to
>>> > dist.apache.org 
>>> >  >  [2], which is signed with the key
>>> with
>>> >  > fingerprint 45C60AAAD115F560 [3],
>>> >  > * all artifacts to be deployed to the Maven Central
>>> > Repository [4],
>>> >  > * source code tag "v2.7.0-RC1" [5],
>>> >  > * website pull request listing the release and publishing
>>> the API
>>> >  > reference manual [6].
>>> >  > * Java artifacts were built with Gradle 4.8 and OpenJDK
>>> >  > 1.8.0_181-8u181-b13-1~deb9u1-b13.
>>> >  > * Python artifacts are deployed along with the source
>>> release
>>> > to the
>>> >  > dist.apache.org 
>>> >  [2].
>>> >  >
>>> >  > The vote will be open for at least 72 hours. It is adopted
>>> by
>>> > majority
>>> >  > approval, with at least 3 PMC affirmative votes.
>>> >  >
>>> >  > Thanks,
>>> >  > Charles
>>> >  >
>>> >  > [1]
>>> >  >
>>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>>> 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Romain Manni-Bucau
Im running this main [1] through this IT [2]. Was working fine since ~1
year but 2.7.0 broke it. Didnt investigate more but can have a look later
this month if it helps.

[1]
https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
[2]
https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java

Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :

> Romain: can you give more details on the failure you're encountering, i.e.
> how you are performing this validation?
>
> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> weird, I didn't have it on Beam samples. Let me try to reproduce and I
>> will create the Jira.
>>
>> Regards
>> JB
>>
>> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
>> > -1, seems spark integration is broken (tested with spark 2.3.1 and
>> 2.2.1):
>> >
>> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
>> 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign
>> instance of scala.collection.immutable.List$SerializationProxy to
>> fieldorg.apache.spark.rdd.RDD.org 
>> $apache$spark$rdd$RDD$$dependencies_
>> of type scala.collection.Seq in instance of
>> org.apache.spark.rdd.MapPartitionsRDD
>> >   at
>> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
>> >
>> >
>> > Also the issue Lukasz identified is important even if workarounds can
>> be
>> > put in place so +1 to fix it as well if possible.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau  | Blog
>> >  | Old Blog
>> >  | Github
>> >  | LinkedIn
>> >  | Book
>> > <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik > > > a écrit :
>> >
>> > I found an issue where we are no longer packaging the pom.xml within
>> > the artifact jars at META-INF/maven/groupId/artifactId. More details
>> > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
>> > consider this a blocker but it was an easy fix
>> > (https://github.com/apache/beam/pull/6358) and users may rely on
>> the
>> > pom.xml.
>> >
>> > Should we recut the release candidate to include this?
>> >
>> > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
>> > mailto:j...@nanthrax.net>> wrote:
>> >
>> > +1 (binding)
>> >
>> > Tested successfully on Beam Samples.
>> >
>> > Thanks !
>> >
>> > Regards
>> > JB
>> >
>> > On 07/09/2018 23:56, Charles Chen wrote:
>> >  > Hi everyone,
>> >  >
>> >  > Please review and vote on the release candidate #1 for the
>> > version
>> >  > 2.7.0, as follows:
>> >  > [ ] +1, Approve the release
>> >  > [ ] -1, Do not approve the release (please provide specific
>> > comments)
>> >  >
>> >  > The complete staging area is available for your review, which
>> > includes:
>> >  > * JIRA release notes [1],
>> >  > * the official Apache source release to be deployed to
>> > dist.apache.org 
>> >  >  [2], which is signed with the key
>> with
>> >  > fingerprint 45C60AAAD115F560 [3],
>> >  > * all artifacts to be deployed to the Maven Central
>> > Repository [4],
>> >  > * source code tag "v2.7.0-RC1" [5],
>> >  > * website pull request listing the release and publishing
>> the API
>> >  > reference manual [6].
>> >  > * Java artifacts were built with Gradle 4.8 and OpenJDK
>> >  > 1.8.0_181-8u181-b13-1~deb9u1-b13.
>> >  > * Python artifacts are deployed along with the source release
>> > to the
>> >  > dist.apache.org 
>> >  [2].
>> >  >
>> >  > The vote will be open for at least 72 hours. It is adopted by
>> > majority
>> >  > approval, with at least 3 PMC affirmative votes.
>> >  >
>> >  > Thanks,
>> >  > Charles
>> >  >
>> >  > [1]
>> >  >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>> >  > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>> >  > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>> >  > [4]
>> >
>> https://repository.apache.org/content/repositories/orgapachebeam-1046/
>> >  > [5] 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Charles Chen
Romain: can you give more details on the failure you're encountering, i.e.
how you are performing this validation?

On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> weird, I didn't have it on Beam samples. Let me try to reproduce and I
> will create the Jira.
>
> Regards
> JB
>
> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
> > -1, seems spark integration is broken (tested with spark 2.3.1 and
> 2.2.1):
> >
> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
> 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign
> instance of scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org 
> $apache$spark$rdd$RDD$$dependencies_
> of type scala.collection.Seq in instance of
> org.apache.spark.rdd.MapPartitionsRDD
> >   at
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
> >
> >
> > Also the issue Lukasz identified is important even if workarounds can be
> > put in place so +1 to fix it as well if possible.
> >
> > Romain Manni-Bucau
> > @rmannibucau  | Blog
> >  | Old Blog
> >  | Github
> >  | LinkedIn
> >  | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> > Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik  > > a écrit :
> >
> > I found an issue where we are no longer packaging the pom.xml within
> > the artifact jars at META-INF/maven/groupId/artifactId. More details
> > in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
> > consider this a blocker but it was an easy fix
> > (https://github.com/apache/beam/pull/6358) and users may rely on the
> > pom.xml.
> >
> > Should we recut the release candidate to include this?
> >
> > On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> >
> > +1 (binding)
> >
> > Tested successfully on Beam Samples.
> >
> > Thanks !
> >
> > Regards
> > JB
> >
> > On 07/09/2018 23:56, Charles Chen wrote:
> >  > Hi everyone,
> >  >
> >  > Please review and vote on the release candidate #1 for the
> > version
> >  > 2.7.0, as follows:
> >  > [ ] +1, Approve the release
> >  > [ ] -1, Do not approve the release (please provide specific
> > comments)
> >  >
> >  > The complete staging area is available for your review, which
> > includes:
> >  > * JIRA release notes [1],
> >  > * the official Apache source release to be deployed to
> > dist.apache.org 
> >  >  [2], which is signed with the key
> with
> >  > fingerprint 45C60AAAD115F560 [3],
> >  > * all artifacts to be deployed to the Maven Central
> > Repository [4],
> >  > * source code tag "v2.7.0-RC1" [5],
> >  > * website pull request listing the release and publishing the
> API
> >  > reference manual [6].
> >  > * Java artifacts were built with Gradle 4.8 and OpenJDK
> >  > 1.8.0_181-8u181-b13-1~deb9u1-b13.
> >  > * Python artifacts are deployed along with the source release
> > to the
> >  > dist.apache.org 
> >  [2].
> >  >
> >  > The vote will be open for at least 72 hours. It is adopted by
> > majority
> >  > approval, with at least 3 PMC affirmative votes.
> >  >
> >  > Thanks,
> >  > Charles
> >  >
> >  > [1]
> >  >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
> >  > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
> >  > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
> >  > [4]
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1046/
> >  > [5] https://github.com/apache/beam/tree/v2.7.0-RC1
> >  > [6] https://github.com/apache/beam-site/pull/549
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


JIRA permissions request

2018-09-11 Thread Connell O'Callaghan
Hi dev@,

There are quite a few efforts in flight that have a lot of identified work
that needs a bit of project management to better communicate what is being
worked on and in what order across the community -- Portability framework,
portable runners, and SQL being examples that come to mind. Rafael,
Henning, and I want to work with JIRA's tools to produce (and publish) the
necessary dashboards, reports, and views. We appear to be unable to share
dashboards we create with the entire project due to a lack of permissions.
Can someone explain to us how we can create and then share them? Otherwise,
if it's just a permissions issue, is it possible to be given the necessary
permissions?

Thank you in advance,
- Connell


Re: PTransforms and Fusion

2018-09-11 Thread Henning Rohde
Empty pipelines have neither subtransforms or a spec, which is what I don't
think is useful -- especially given the only usecase (which is really
"nop") would create non-timer loops in the representations. I'd rather have
a well-known nop primitive instead. Even now, for the A example, I don't
think it's unreasonable to add a (well-known) identity transform inside a
normal composite to retain the extrema at either end. It could be ignored
at runtime at no cost.

To clarify my support for A1, native transforms would have a spec and would
be passed through in the shared code even through they're not primitives.


On Tue, Sep 11, 2018 at 12:56 AM Robert Bradshaw 
wrote:

> For (A), it really boils down to the question of what is a legal pipeline.
> A1 takes the position that all empty transforms must be on a whitelist
> (which implies B1, unless we make the whitelist extensible, which starts to
> sound a lot like B3). Presumably if we want to support B2, we cannot remove
> all empty unknown transforms, just those whose outputs are a subset of the
> inputs.
>
> The reason I strongly support A3 is that empty PTransforms are not just
> noise, they are expressions of user intent, and the pipeline graph should
> reflect that as faithfully as possible. This is the whole point of
> composite transforms--one should not be required to expose what is inside
> (even whether it's empty). Consider, for example, an A, B -> C transform
> that mixes A and B in proportions to produce C. In the degenerate case
> where we want 100% for A or 100% from B, it's reasonable to implement this
> by just returning A or B directly. But when, say, visualizing the pipeline
> graph, I don't think it's desirable to have the discontinuity of the
> composite transform suddenly disappearing when the mixing parameter is at
> either extreme.
>
> If a runner cannot handle these empty pipelines (as is the case for those
> relying on the current Java libraries) it is an easy matter for it to drop
> them, but that doesn't mean we should withhold this information (by making
> it illegal and dropping it in every SDK) from a runner (or any other tool)
> that would want to see this information.
>
> - Robert
>
>
> On Tue, Sep 11, 2018 at 4:20 AM Henning Rohde  wrote:
>
>> For A, I am in favor of A1 and A2 as well. It is then up to each SDK to
>> not generate "empty" transforms in the proto representation as we avoid
>> noise as mentioned. The shared Java libraries are also optional and we
>> should not assume every runner will use them. I'm not convinced empty
>> transforms would have value for pipeline structure over what can be
>> accomplished with normal composites. I suspect empty transforms, such as A,
>> B -> B, B, will just be confusion generators.
>>
>> For B, I favor B2 for the reasons Thomas mentions. I also agree with the
>> -1 for B1.
>>
>> On Mon, Sep 10, 2018 at 2:51 PM Thomas Weise  wrote:
>>
>>> For B, note the prior discussion [1].
>>>
>>> B1 and B2 cannot be supported at the same time.
>>>
>>> Native transforms will almost always be customizations. Users do not
>>> create customizations without reason. They would start with what is there
>>> and dig deeper only when needed. Right now there are no streaming
>>> connectors in the Python SDK - should the user not use the SDK? Or is it
>>> better (now and in general) to have the option of a custom connector, even
>>> when it is not portable?
>>>
>>> Regarding portability, IMO it should be up to the user to decide how
>>> much of it is necessary/important. The IO requirements are normally
>>> dictated by the infrastructure. If it has Kafka, Kinesis or any other
>>> source (including those that Beam might never have a connector for), the
>>> user needs to be able to integrate it.
>>>
>>> Overall extensibility is important and will help users adopt Beam. This
>>> has come up in a few other areas (think Docker environments). I think we
>>> need to provide the flexibility and enable, not prevent alternatives and
>>> therefore -1 for B1 (unsurprisingly :).
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw 
>>> wrote:
>>>
 A) I think it's a bug to not handle empty PTransforms (which are useful
 at pipeline construction, and may still have meaning in terms of pipeline
 structure, e.g. for visualization). Note that such transforms, if truly
 composite, can't output any PCollections that do not appear in their inputs
 (which is how we distinguish them from primitives in Python). Thus I'm in
 favor of A3, and as a stopgap we can drop these transforms as part of/just
 before decoding in the Java libraries (rather than in the SDKs during
 encoding as in A2).

 B) I'm also for B1 or B2.


 On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels 
 wrote:

> > A) What should we do with these "empty" 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Jean-Baptiste Onofré

Hi,

weird, I didn't have it on Beam samples. Let me try to reproduce and I 
will create the Jira.


Regards
JB

On 11/09/2018 11:44, Romain Manni-Bucau wrote:

-1, seems spark integration is broken (tested with spark 2.3.1 and 2.2.1):

18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.List$SerializationProxy to 
fieldorg.apache.spark.rdd.RDD.org 
$apache$spark$rdd$RDD$$dependencies_ of 
type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)


Also the issue Lukasz identified is important even if workarounds can be 
put in place so +1 to fix it as well if possible.


Romain Manni-Bucau
@rmannibucau  | Blog 
 | Old Blog 
 | Github 
 | LinkedIn 
 | Book 




Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik > a écrit :


I found an issue where we are no longer packaging the pom.xml within
the artifact jars at META-INF/maven/groupId/artifactId. More details
in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
consider this a blocker but it was an easy fix
(https://github.com/apache/beam/pull/6358) and users may rely on the
pom.xml.

Should we recut the release candidate to include this?

On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:

+1 (binding)

Tested successfully on Beam Samples.

Thanks !

Regards
JB

On 07/09/2018 23:56, Charles Chen wrote:
 > Hi everyone,
 >
 > Please review and vote on the release candidate #1 for the
version
 > 2.7.0, as follows:
 > [ ] +1, Approve the release
 > [ ] -1, Do not approve the release (please provide specific
comments)
 >
 > The complete staging area is available for your review, which
includes:
 > * JIRA release notes [1],
 > * the official Apache source release to be deployed to
dist.apache.org 
 >  [2], which is signed with the key with
 > fingerprint 45C60AAAD115F560 [3],
 > * all artifacts to be deployed to the Maven Central
Repository [4],
 > * source code tag "v2.7.0-RC1" [5],
 > * website pull request listing the release and publishing the API
 > reference manual [6].
 > * Java artifacts were built with Gradle 4.8 and OpenJDK
 > 1.8.0_181-8u181-b13-1~deb9u1-b13.
 > * Python artifacts are deployed along with the source release
to the
 > dist.apache.org 
 [2].
 >
 > The vote will be open for at least 72 hours. It is adopted by
majority
 > approval, with at least 3 PMC affirmative votes.
 >
 > Thanks,
 > Charles
 >
 > [1]
 >

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
 > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
 > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
 > [4]
https://repository.apache.org/content/repositories/orgapachebeam-1046/
 > [5] https://github.com/apache/beam/tree/v2.7.0-RC1
 > [6] https://github.com/apache/beam-site/pull/549

-- 
Jean-Baptiste Onofré

jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Maximilian Michels
Could we still include some fixes for the RC2? I just discovered two 
JIRA issues which were not properly marked with "Fix Version".


https://issues.apache.org/jira/browse/BEAM-5239
https://issues.apache.org/jira/browse/BEAM-5246

They are not show-stoppers, so also fine with me if we don't backport them.

-Max

On 11.09.18 11:44, Romain Manni-Bucau wrote:

-1, seems spark integration is broken (tested with spark 2.3.1 and 2.2.1):

18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.List$SerializationProxy to 
fieldorg.apache.spark.rdd.RDD.org 
$apache$spark$rdd$RDD$$dependencies_ of 
type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)


Also the issue Lukasz identified is important even if workarounds can be 
put in place so +1 to fix it as well if possible.


Romain Manni-Bucau
@rmannibucau  | Blog 
 | Old Blog 
 | Github 
 | LinkedIn 
 | Book 




Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik > a écrit :


I found an issue where we are no longer packaging the pom.xml within
the artifact jars at META-INF/maven/groupId/artifactId. More details
in https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't
consider this a blocker but it was an easy fix
(https://github.com/apache/beam/pull/6358) and users may rely on the
pom.xml.

Should we recut the release candidate to include this?

On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:

+1 (binding)

Tested successfully on Beam Samples.

Thanks !

Regards
JB

On 07/09/2018 23:56, Charles Chen wrote:
 > Hi everyone,
 >
 > Please review and vote on the release candidate #1 for the
version
 > 2.7.0, as follows:
 > [ ] +1, Approve the release
 > [ ] -1, Do not approve the release (please provide specific
comments)
 >
 > The complete staging area is available for your review, which
includes:
 > * JIRA release notes [1],
 > * the official Apache source release to be deployed to
dist.apache.org 
 >  [2], which is signed with the key with
 > fingerprint 45C60AAAD115F560 [3],
 > * all artifacts to be deployed to the Maven Central
Repository [4],
 > * source code tag "v2.7.0-RC1" [5],
 > * website pull request listing the release and publishing the API
 > reference manual [6].
 > * Java artifacts were built with Gradle 4.8 and OpenJDK
 > 1.8.0_181-8u181-b13-1~deb9u1-b13.
 > * Python artifacts are deployed along with the source release
to the
 > dist.apache.org 
 [2].
 >
 > The vote will be open for at least 72 hours. It is adopted by
majority
 > approval, with at least 3 PMC affirmative votes.
 >
 > Thanks,
 > Charles
 >
 > [1]
 >

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
 > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
 > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
 > [4]
https://repository.apache.org/content/repositories/orgapachebeam-1046/
 > [5] https://github.com/apache/beam/tree/v2.7.0-RC1
 > [6] https://github.com/apache/beam-site/pull/549

-- 
Jean-Baptiste Onofré

jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: [portablility] metrics interrogations

2018-09-11 Thread Robert Bradshaw
On Mon, Sep 10, 2018 at 11:07 AM Etienne Chauchot 
wrote:

> Hi all,
>
> @Luke, @Alex I have a general question related to metrics in the Fn API:
> as the communication between runner harness and SDK harness is done on a
> bundle basis. When the runner harness sends data to the sdk harness to
> execute a transform that contains metrics, does it:
>
>1. send metrics values (for the ones defined in the transform)
>alongside with data and receive an updated value of the metrics from the
>sdk harness when the bundle is finished processing?
>2. or does it send only the data and the sdk harness responds with a
>diff value of the metrics so that the runner can update them in its side?
>
> My bet is option 2. But can you confirm?
>

The runner harness periodically asks for the status of a bundle to which
the runner harness may respond with a current snapshot of metrics. These
metrics are deltas in the sense that only "dirty" metrics need to be
reported (i.e. unreported metrics can be assumed to have their previous
values) but are *not* deltas with respect to values, i.e. the full value is
reported each time. As an example, suppose one were counting red and blue
marbles. The first update may be something like

{ red: 5, blue: 7}

and if two more blue ones were found, a valid update would be

{ blue: 9 }

On bundle completion, the full set of metrics is reported as part of the
same message that declares the bundle complete.



On Tue, Sep 11, 2018 at 11:43 AM Etienne Chauchot 
wrote:

> Le lundi 10 septembre 2018 à 09:42 -0700, Lukasz Cwik a écrit :
>
> Alex is out on vacation for the next 3 weeks.
>
> Alex had proposed the types of metrics[1] but not the exact protocol as to
> what the SDK and runner do. I could envision Alex proposing that the SDK
> harness only sends diffs or dirty metrics in intermediate updates and all
> metrics values in the final update.
> Robert is referring to an integration that happened to an older set of
> messages[2] that preceeded Alex's proposal and that integration with
> Dataflow which is still incomplete works as you described in #2.
>
>
> Thanks Luke and Robert for the confirmation.
>
>
> Robin had recently been considering adding an accessor to DoFns that would
> allow you to get access to the job information from within the pipeline
> (current state, poll for metrics, invoke actions like cancel / drain, ...).
> He wanted it so he could poll for attempted metrics to be able to test
> @RequiresStableInput.
>
> Yes, I remember, I voted +1 to his proposal.
>
> Integrating the MetricsPusher or something like that on the SDK side to be
> able to poll metrics over the job information accessor could be useful.
>
>
> Well, in the design discussion, we decided to host Metrics Pusher as close
> as possible of the actual engine (inside the runner code chosen over the
> sdk code) to allow the runner to send system metrics in the future.
>

+1. The runner harness can then do whatever it wants (e.g. reporting back
to its master, or pushing to another service, or simply dropping them), but
the SDKs only have to follow the FnAPI contract.


>
> 1: https://s.apache.org/beam-fn-api-metrics
> 2:
> https://github.com/apache/beam/blob/9b68f926628d727e917b6a33ccdafcfe693eef6a/model/fn-execution/src/main/proto/beam_fn_api.proto#L410
>
>
> Besides, in his PR Alex talks about deprecated metrics. As he is off, can
> you tell me a little more about them ? What metrics will be deprecated when
> the portability framework is 100% operational on all the runners?
>

Currently, the SDKs return metrics to the FnAPI via the proto found at
https://github.com/apache/beam/blob/release-2.6.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L410
(and specifically user metrics at
https://github.com/apache/beam/blob/release-2.6.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L483
) The new metrics are the nested one-ofs defined at
https://github.com/apache/beam/blob/release-2.6.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L269


>


Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-11 Thread Thomas Weise
I'm in favor of a combination of 2) and 3): New module
"hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify
what it is). Turn existing " hadoop-input-format" into a proxy for new
module for backward compatibility (marked deprecated and removed in next
major version).

I don't think everything "Hadoop" should be merged, purpose and usage is
just too different. As an example, the Hadoop file system abstraction
itself has implementation for multiple other systems and is not limited to
HDFS.

On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko 
wrote:

> Dharmendra,
> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you
> can use FileIO or TextIO to write to HDFS, these IOs support different file
> systems.
>
> On 11 Sep 2018, at 11:11, dharmendra pratap singh <
> dharmendra0...@gmail.com> wrote:
>
> Hello Team,
> Does this mean, as of today we can read from Hadoop FS but can't write to
> Hadoop FS using Beam HDFS API ?
>
> Regards
> Dharmendra
>
> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko 
> wrote:
>
>> Hello everyone,
>>
>> I’d like to discuss the following topic (see below) with community since
>> the optimal solution is not clear for me.
>>
>> There is Java IO module, called “*hadoop-input-format*”, which allows to
>> use MapReduce InputFormat implementations to read data from different
>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>> According to its name, it has only “Read" and it's missing “Write” part,
>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>> OutputFormat (PR 6306 ). For
>> this I created another module with this name. So, in the end, we will have
>> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
>> and it looks quite strange for me since, afaik, every existed Java IO, that
>> we have, incapsulates Read and Write parts into one module. Additionally,
>> we have “*hadoop-common*” and *“hadoop-file-system*” as other
>> hadoop-related modules.
>>
>> Now I’m thinking how it will be better to organise all these Hadoop
>> modules better. There are several options in my mind:
>>
>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>> “as it is”.
>> Pros: no breaking changes, no additional work
>> Cons: not logical for users to have the same IO in two different modules
>> and with different names.
>>
>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>> keep the other Hadoop modules “as it is”.
>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>> for users
>> Cons: breaking changes for user code because of module/IO renaming
>>
>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>> which will include new “write” functionality and be a proxy for old “
>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>> become deprecated and be finally moved to common “*hadoop-format*”
>> module in future releases. Keep the other Hadoop modules “as it is”.
>> Pros: finally it will be only one module for hadoop MR format; changes
>> are less painful for user
>> Cons: hidden difficulties of implementation this strategy; a bit
>> confusing for user
>>
>> 4) Add new module “*hadoop*” and move all already existed modules there
>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>> Pros: unification of all hadoop-related modules
>> Cons: breaking changes for user code, additional complexity with deps and
>> testing
>>
>> 5) Your suggestion?..
>>
>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>
>> I’m wondering if there were similar situations in Beam before and how it
>> was finally resolved. If yes then probably we need to do here in similar
>> way.
>> Any suggestions/advices/comments would be very appreciated.
>>
>> Thanks,
>> Alexey
>>
>
>


Re: How to implement repartition.

2018-09-11 Thread Robert Bradshaw
Does Reshuffle do what you want?

On Tue, Sep 11, 2018, 3:46 PM devinduan(段丁瑞)  wrote:

> Hi all:
> I recently start studying the Beam on spark runner.
> I want to implement a method *repartition* similar to Spark
> *rdd.repartition()* , but I can't find a solution.
> Could anyone help me?
> Thanks for your reply.
> devin.
>


How to implement repartition.

2018-09-11 Thread 段丁瑞
Hi all:
I recently start studying the Beam on spark runner.
I want to implement a method repartition similar to Spark rdd.repartition() 
, but I can't find a solution.
Could anyone help me?
Thanks for your reply.
devin.


Re: [Proposal] Creating a reproducible environment for Beam Jenkins Tests

2018-09-11 Thread Alexey Romanenko
+1 Great feature that should help with complicated error cases.

> On 11 Sep 2018, at 03:39, Henning Rohde  wrote:
> 
> +1 Nice proposal. It will help eradicate some of the inflexibility and 
> frustrations with Jenkins.
> 
> On Wed, Sep 5, 2018 at 2:30 PM Yifan Zou  > wrote:
> Thank you all for making comments on this and I apologize for the late reply. 
> 
> To clarify the concerns of testing locally, it is still able to run tests 
> without Docker. One of the purposes of this is to create an identical 
> environment as we are running in Jenkins that would be helpful to reproduce 
> strange errors. Contributors could choose starting a container and run tests 
> in there, or just run tests directly. 
> 
> 
> 
> On Wed, Sep 5, 2018 at 6:37 AM Ismaël Mejía  > wrote:
> BIG +1, the previous work on having docker build images [1] had a
> similar goal (to have a reproducible build environment). But this is
> even better because we will guarantee the exact same environment in
> Jenkins as well as any further improvements. It is important to
> document the setup process as part of this (for future maintenance +
> local reproducibility).
> 
> Just for clarification this is independent of running the tests
> locally without docker, it is more to improve the reproducibility of
> the environment we have on jenkins locally for example to address some
> weird Heissenbug.
> 
> I just added BEAM-5311 to track the removal of the docker build images
> when this is ready (of course if there are no objections to this
> proposal).
> 
> [1] https://beam.apache.org/contribute/docker-images/ 
> 
> On Thu, Aug 30, 2018 at 3:58 PM Jean-Baptiste Onofré  > wrote:
> >
> > Hi,
> >
> > That's interesting, however, it's really important to still be able to
> > easily run test locally, without any VM/Docker required. It should be
> > activated by profile or so.
> >
> > Regards
> > JB
> >
> > On 27/08/2018 19:53, Yifan Zou wrote:
> > > Hi,
> > >
> > > I have a proposal for creating a reproducible environment for Jenkins
> > > tests by using docker container. The thing is, the environment
> > > configurations on Beam Jenkins slaves are sometimes different from
> > > developer's machines. Test failures on Jenkins may not be easy to
> > > reproduce locally. Also, it is not convenient for developers to add or
> > > modify underlying tools installed on Jenkins VMs, since they're managed
> > > by Apache Infra. This proposal is aimed to address those problems.
> > >
> > > https://docs.google.com/document/d/1y0YuQj_oZXC0uM5-gniG7r9-5gv2uiDhzbtgYYJW48c/edit#heading=h.bg2yi0wbhl9n
> > >  
> > > 
> > >
> > > Any comments are welcome. Thank you.
> > >
> > > Regards.
> > > Yifan
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net 
> > Talend - http://www.talend.com 



Re: [PROPOSAL] Test performance of basic Apache Beam operations

2018-09-11 Thread Alexey Romanenko
I agree that we can benefit from having two types of performance tests (low and 
high level) that could complement each other.
Can we detect a regression (if any) automatically and send a report about that? 
Sorry if we already do that for Nexmark.

> On 11 Sep 2018, at 11:29, Etienne Chauchot  wrote:
> 
> Hi Lukasz,
> 
> Well, having low level byte[] based pure performance tests makes sense. And 
> having high level realistic model (Nexmark auction system) makes sense also 
> to avoid testing unrealistic pipelines as you describe.
> 
> Have common code between the 2 seems difficult as both the architecture and 
> the model are different.
> 
> I'm more concerned about having two CI mechanisms to detect 
> functionnal/performance regressions.
> Best
> Etienne
> 
> Le lundi 10 septembre 2018 à 18:33 +0200, Łukasz Gajowy a écrit :
>> In my opinion and as far as I understand Nexmark, there are some benefits to 
>> having both types of tests. The load tests we propose can be very 
>> straightforward and clearly show what is being tested thanks to the fact 
>> that there's no fixed model but very "low level" KV 
>> collections only. They are more flexible in shapes of the pipelines they can 
>> express e.g. fanout_64, without having to think about specific use cases. 
>> 
>> Having both types would allow developers to decide whether they want to 
>> create a new Nexmark query for their specific case or develop a new Load 
>> test (whatever is easier and more fits their case). However, there is a risk 
>> - with KV developer can overemphasize cases that can never 
>> happen in practice, so we need to be careful about the exact configurations 
>> we run. 
>> 
>> Still, I can imagine that there surely will be code that should be common to 
>> both types of tests and we seek ways to not duplicate code.
>> 
>> WDYT? 
>> 
>> Regards, 
>> Łukasz
>> 
>> 
>> 
>> pon., 10 wrz 2018 o 16:36 Etienne Chauchot > > napisał(a):
>>> Hi,
>>> It seems that there is a notable overlap with what Nexmark already does:
>>> Nexmark mesures performance and regression by exercising all the Beam model 
>>> in both batch and streaming modes with several runners. It also computes on 
>>> synthetic data. Also nexmark is already included as PostCommits in the CI 
>>> and dashboards.
>>> 
>>> Shall we merge the two?
>>> 
>>> Best
>>> 
>>> Etienne
>>> 
>>> Le lundi 10 septembre 2018 à 12:56 +0200, Łukasz Gajowy a écrit :
 Hello everyone, 
 
 thank you for all your comments to the proposal. To sum up: 
 
 A set of performance tests exercising Core Beam Transforms (ParDo, 
 GroupByKey, CoGroupByKey, Combine) will be implemented for Java and Python 
 SDKs. Those tests will allow to: 
 measure performance of the transforms on various runners
 exercise the transforms by creating stressful conditions and big loads 
 using Synthetic Source and Synthetic Step API (delays, keeping cpu busy or 
 asleep, processing large keys and values, performing fanout or reiteration 
 of inputs)
 run both in batch and streaming context
 gather various metrics
 notice regressions by comparing data from consequent Jenkins runs  
 Metrics (runtime, consumed bytes, memory usage, split/bundle count) can be 
 gathered during test invocations. We will start with runtime and leverage 
 Metrics API to collect the other metrics in later phases of development. 
 The tests will be fully configurable through pipeline options and it will 
 be possible to run any custom scenarios manually. However, a 
 representative set of testing scenarios will be run periodically using 
 Jenkins.
 
 Regards, 
 Łukasz 
 
 śr., 5 wrz 2018 o 20:31 Rafael Fernandez >>> > napisał(a):
> neat! left a comment or two
> 
> On Mon, Sep 3, 2018 at 3:53 AM Łukasz Gajowy  > wrote:
>> Hi all! 
>> 
>> I'm bumping this (in case you missed it). Any feedback and questions are 
>> welcome!
>> 
>> Best regards, 
>> Łukasz
>> 
>> pon., 13 sie 2018 o 13:51 Jean-Baptiste Onofré > > napisał(a):
>>> Hi Lukasz,
>>> 
>>> Thanks for the update, and the abstract looks promising.
>>> 
>>> Let me take a look on the doc.
>>> 
>>> Regards
>>> JB
>>> 
>>> On 13/08/2018 13:24, Łukasz Gajowy wrote:
>>> > Hi all, 
>>> > 
>>> > since Synthetic Sources API has been introduced in Java and Python 
>>> > SDK,
>>> > it can be used to test some basic Apache Beam operations (i.e.
>>> > GroupByKey, CoGroupByKey Combine, ParDo and ParDo with SideInput) in
>>> > terms of performance. This, in brief, is why we'd like to share the
>>> > below proposal:
>>> > 
>>> > _https://docs.google.com/document/d/1PuIQv4v06eosKKwT76u7S6IP88AnXhTf870Rcj1AHt4/edit?usp=sharing_
>>> >  
>>> > 

Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-11 Thread Alexey Romanenko
Dharmendra,
For now, you can’t write with Hadoop MapReduce OutputFormat. However, you can 
use FileIO or TextIO to write to HDFS, these IOs support different file systems.

> On 11 Sep 2018, at 11:11, dharmendra pratap singh  
> wrote:
> 
> Hello Team,
> Does this mean, as of today we can read from Hadoop FS but can't write to 
> Hadoop FS using Beam HDFS API ?
> 
> Regards
> Dharmendra
> 
> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko  > wrote:
> Hello everyone,
> 
> I’d like to discuss the following topic (see below) with community since the 
> optimal solution is not clear for me.
> 
> There is Java IO module, called “hadoop-input-format”, which allows to use 
> MapReduce InputFormat implementations to read data from different sources 
> (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to 
> its name, it has only “Read" and it's missing “Write” part, so, I'm working 
> on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 
> ). For this I created another 
> module with this name. So, in the end, we will have two different modules 
> “hadoop-input-format” and “hadoop-output-format” and it looks quite strange 
> for me since, afaik, every existed Java IO, that we have, incapsulates Read 
> and Write parts into one module. Additionally, we have “hadoop-common” and 
> “hadoop-file-system” as other hadoop-related modules. 
> 
> Now I’m thinking how it will be better to organise all these Hadoop modules 
> better. There are several options in my mind: 
> 
> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it 
> is”. 
>   Pros: no breaking changes, no additional work 
>   Cons: not logical for users to have the same IO in two different 
> modules and with different names.
> 
> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module 
> called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other 
> Hadoop modules “as it is”.
>   Pros: to have InputFormat/OutputFormat in one IO module which is 
> logical for users
>   Cons: breaking changes for user code because of module/IO renaming 
> 
> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will 
> include new “write” functionality and be a proxy for old 
> “hadoop-input-format”. In its turn, “hadoop-input-format” should become 
> deprecated and be finally moved to common “hadoop-format” module in future 
> releases. Keep the other Hadoop modules “as it is”.
>   Pros: finally it will be only one module for hadoop MR format; changes 
> are less painful for user
>   Cons: hidden difficulties of implementation this strategy; a bit 
> confusing for user 
> 
> 4) Add new module “hadoop” and move all already existed modules there as 
> submodules (like we have for “io/google-cloud-platform”), merge 
> “hadoop-input-format” and “hadoop-output-format” into one module. 
>   Pros: unification of all hadoop-related modules
>   Cons: breaking changes for user code, additional complexity with deps 
> and testing
> 
> 5) Your suggestion?..
> 
> My personal preferences are lying between 2 and 3 (if 3 is possible). 
> 
> I’m wondering if there were similar situations in Beam before and how it was 
> finally resolved. If yes then probably we need to do here in similar way.
> Any suggestions/advices/comments would be very appreciated.
> 
> Thanks,
> Alexey



Re: Gradle Races in beam-examples-java, beam-runners-apex

2018-09-11 Thread Maximilian Michels
Do we have inotifywait available on Travis and could set it up to log 
concurrent access to the relevant Jar files?


On 10.09.18 22:41, Lukasz Cwik wrote:
I had originally suggested to use some Linux kernel tooling such as 
inotifywait[1] to watch what is happening.


It is likely that we have some Gradle task which is running something in 
parallel to a different Gradle task when it shouldn't which means that 
the jar file is being changed/corrupted. I believe fixing our Gradle 
task dependency tree wrt to this would solve the problem. This crash 
does not reproduce on my desktop after 20 runs which makes it hard for 
me to test for.


1: https://www.linuxjournal.com/content/linux-filesystem-events-inotify

On Mon, Sep 10, 2018 at 1:15 PM Ryan Williams > wrote:


this continues to be an issue locally (cf. some discussion in #beam
slack)

commands like `./gradlew javaPreCommit` or `./gradlew build`
reliably fail with a range of different


JVM crashes


in a few different tasks, with messages that suggest filing a bug
against the Java compiler

what do we know about the actual race condition that is allowing one
task to attempt to read from a JAR that is being overwritten by
another task? presumably this is just a bug in our Gradle configs?

On Mon, Aug 27, 2018 at 2:28 PM Andrew Pilloud mailto:apill...@google.com>> wrote:

It appears that there is no one working on a fix for the flakes,
so I've merged the change to disable parallel tasks on precommit.

Andrew

On Fri, Aug 24, 2018 at 1:30 PM Andrew Pilloud
mailto:apill...@google.com>> wrote:

I'm seeing failures due to this on 12 of the last 16
PostCommits. Precommits take about 22 minutes run in
parallel, so at a 25% pass rate that puts the expected time
to a good test run at 264 minutes assuming you immediately
restart on each failure. We are looking at 56 minutes for a
precommit that isn't run in parallel:
https://builds.apache.org/job/beam_PreCommit_Java_Phrase/266/ I'd
rather have tests take a little longer then have to monitor
them for several hours.

I've opened a PR: https://github.com/apache/beam/pull/6274

Andrew

On Fri, Aug 24, 2018 at 10:47 AM Lukasz Cwik
mailto:lc...@google.com>> wrote:

I believe it would mitigate the issue but also make the
jobs take much longer to complete.

On Thu, Aug 23, 2018 at 2:44 PM Andrew Pilloud
mailto:apill...@google.com>> wrote:

There seems to be a misconfiguration of gradle that
is causing a high rate of failure for the last
several weeks in building beam-examples-java and
beam-runners-apex. It appears to be some sort of
race condition in building dependencies. Given that
no one has made progress on fixing the root cause,
is this something we could mitigate by running jobs
with `--no-parallel` flag?

https://issues.apache.org/jira/browse/BEAM-5035
https://issues.apache.org/jira/browse/BEAM-5207

Andrew



Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-11 Thread Romain Manni-Bucau
-1, seems spark integration is broken (tested with spark 2.3.1 and 2.2.1):

18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID
0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot
assign instance of scala.collection.immutable.List$SerializationProxy
to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_
of type scala.collection.Seq in instance of
org.apache.spark.rdd.MapPartitionsRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)


Also the issue Lukasz identified is important even if workarounds can be
put in place so +1 to fix it as well if possible.

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le lun. 10 sept. 2018 à 20:48, Lukasz Cwik  a écrit :

> I found an issue where we are no longer packaging the pom.xml within the
> artifact jars at META-INF/maven/groupId/artifactId. More details in
> https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't consider this
> a blocker but it was an easy fix (https://github.com/apache/beam/pull/6358)
> and users may rely on the pom.xml.
>
> Should we recut the release candidate to include this?
>
> On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (binding)
>>
>> Tested successfully on Beam Samples.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On 07/09/2018 23:56, Charles Chen wrote:
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #1 for the version
>> > 2.7.0, as follows:
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific comments)
>> >
>> > The complete staging area is available for your review, which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to dist.apache.org
>> >  [2], which is signed with the key with
>> > fingerprint 45C60AAAD115F560 [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.7.0-RC1" [5],
>> > * website pull request listing the release and publishing the API
>> > reference manual [6].
>> > * Java artifacts were built with Gradle 4.8 and OpenJDK
>> > 1.8.0_181-8u181-b13-1~deb9u1-b13.
>> > * Python artifacts are deployed along with the source release to the
>> > dist.apache.org  [2].
>> >
>> > The vote will be open for at least 72 hours. It is adopted by majority
>> > approval, with at least 3 PMC affirmative votes.
>> >
>> > Thanks,
>> > Charles
>> >
>> > [1]
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
>> > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>> > [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1046/
>> > [5] https://github.com/apache/beam/tree/v2.7.0-RC1
>> > [6] https://github.com/apache/beam-site/pull/549
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Re: [portablility] metrics interrogations

2018-09-11 Thread Etienne Chauchot
Le lundi 10 septembre 2018 à 09:42 -0700, Lukasz Cwik a écrit :
> Alex is out on vacation for the next 3 weeks.
> Alex had proposed the types of metrics[1] but not the exact protocol as to 
> what the SDK and runner do. I could
> envision Alex proposing that the SDK harness only sends diffs or dirty 
> metrics in intermediate updates and all metrics
> values in the final update.
> Robert is referring to an integration that happened to an older set of 
> messages[2] that preceeded Alex's proposal and
> that integration with Dataflow which is still incomplete works as you 
> described in #2.

Thanks Luke and Robert for the confirmation.
> Robin had recently been considering adding an accessor to DoFns that would 
> allow you to get access to the job
> information from within the pipeline (current state, poll for metrics, invoke 
> actions like cancel / drain, ...). He
> wanted it so he could poll for attempted metrics to be able to test 
> @RequiresStableInput. 
Yes, I remember, I voted +1 to his proposal.
> Integrating the MetricsPusher or something like that on the SDK side to be 
> able to poll metrics over the job
> information accessor could be useful.

Well, in the design discussion, we decided to host Metrics Pusher as close as 
possible of the actual engine (inside the
runner code chosen over the sdk code) to allow the runner to send system 
metrics in the future. 
> 1: https://s.apache.org/beam-fn-api-metrics
> 2: 
> https://github.com/apache/beam/blob/9b68f926628d727e917b6a33ccdafcfe693eef6a/model/fn-execution/src/main/proto/beam
> _fn_api.proto#L410

Besides, in his PR Alex talks about deprecated metrics. As he is off, can you 
tell me a little more about them ? What
metrics will be deprecated when the portability framework is 100% operational 
on all the runners?
ThxEtienne
> 
> On Mon, Sep 10, 2018 at 8:41 AM Robert Burke  wrote:
> > The way I entered them into the Go SDK is #2 (SDK sends diffs per bundle) 
> > and the Java Runner Harness appears to
> > aggregate them correctly from there.
> > On Mon, Sep 10, 2018, 2:07 AM Etienne Chauchot  wrote:
> > > Hi all,
> > > @Luke, @Alex I have a general question related to metrics in the Fn API: 
> > > as the communication between runner
> > > harness and SDK harness is done on a bundle basis. When the runner 
> > > harness sends data to the sdk harness to
> > > execute a transform that contains metrics, does it:
> > > send metrics values (for the ones defined in the transform) alongside 
> > > with data and receive an updated value of
> > > the metrics from the sdk harness when the bundle is finished 
> > > processing?or does it send only the data and the sdk
> > > harness responds with a diff value of the metrics so that the runner can 
> > > update them in its side?My bet is option
> > > 2. But can you confirm?
> > > 
> > > 
> > > Thanks
> > > 
> > > Etienne
> > > Le jeudi 19 juillet 2018 à 15:10 +0200, Etienne Chauchot a écrit :
> > > > Thanks for the confirmations Luke.
> > > > Le mercredi 18 juillet 2018 à 07:56 -0700, Lukasz Cwik a écrit :
> > > > > On Wed, Jul 18, 2018 at 7:01 AM Etienne Chauchot 
> > > > >  wrote:
> > > > > > Hi,
> > > > > > Luke, Alex, I have some portable metrics interrogations, can you 
> > > > > > confirm them ? 
> > > > > > 
> > > > > > 1 - As it is the SDK harness that will run the code of the UDFs, if 
> > > > > > a UDF defines a metric, then the SDK
> > > > > > harness will give updates through GRPC calls to the runner so that 
> > > > > > the runner could update metrics cells,
> > > > > > right?
> > > > > 
> > > > > Yes. 
> > > > > > 2 - Alex, you mentioned in proto and design doc that there will be 
> > > > > > no aggreagation of metrics. But some
> > > > > > runners (spark/flink) rely on accumulators and when they are 
> > > > > > merged, it triggers the merging of the whole
> > > > > > chain to the metric cells. I know that Dataflow does not do the 
> > > > > > same, it uses non agregated metrics and
> > > > > > sends them to an aggregation service. Will there be a change of 
> > > > > > paradigm with portability for runners that
> > > > > > merge themselves ? 
> > > > > 
> > > > > There will be local aggregation of metrics scoped to a bundle; after 
> > > > > the bundle is finished processing they
> > > > > are discarded. This will require some kind of global aggregation 
> > > > > support from a runner, whether that runner
> > > > > does it via accumulators or via an aggregation service is up to the 
> > > > > runner.
> > > > > > 3 - Please confirm that the distinction between attempted and 
> > > > > > committed metrics is not the business of
> > > > > > portable metrics. Indeed, it does not involve communication between 
> > > > > > the runner harness and the SDK harness
> > > > > > as it is a runner only matter. I mean, when a runner commits a 
> > > > > > bundle it just updates its committed metrics
> > > > > > and do not need to inform the SDK harness. But, of course, when the 
> > > > > > user requests 

Re: [PROPOSAL] Test performance of basic Apache Beam operations

2018-09-11 Thread Etienne Chauchot
Hi Lukasz,
Well, having low level byte[] based pure performance tests makes sense. And 
having high level realistic model (Nexmark
auction system) makes sense also to avoid testing unrealistic pipelines as you 
describe.
Have common code between the 2 seems difficult as both the architecture and the 
model are different.
I'm more concerned about having two CI mechanisms to detect 
functionnal/performance regressions. BestEtienne
Le lundi 10 septembre 2018 à 18:33 +0200, Łukasz Gajowy a écrit :
> In my opinion and as far as I understand Nexmark, there are some benefits to 
> having both types of tests. The load
> tests we propose can be very straightforward and clearly show what is being 
> tested thanks to the fact that there's no
> fixed model but very "low level" KV collections only. They 
> are more flexible in shapes of the
> pipelines they can express e.g. fanout_64, without having to think about 
> specific use cases. 
> 
> Having both types would allow developers to decide whether they want to 
> create a new Nexmark query for their specific
> case or develop a new Load test (whatever is easier and more fits their 
> case). However, there is a risk - with
> KV developer can overemphasize cases that can never happen in 
> practice, so we need to be careful about
> the exact configurations we run. 
> 
> Still, I can imagine that there surely will be code that should be common to 
> both types of tests and we seek ways to
> not duplicate code.
> 
> WDYT? 
> 
> Regards, 
> Łukasz
> 
> 
> 
> pon., 10 wrz 2018 o 16:36 Etienne Chauchot  napisał(a):
> > Hi,It seems that there is a notable overlap with what Nexmark already 
> > does:Nexmark mesures performance and
> > regression by exercising  all the Beam model in both batch and streaming 
> > modes with several runners. It also
> > computes on synthetic data. Also nexmark is already included as PostCommits 
> > in the CI and dashboards.
> > Shall we merge the two?
> > Best
> > Etienne
> > Le lundi 10 septembre 2018 à 12:56 +0200, Łukasz Gajowy a écrit :
> > > Hello everyone, 
> > > 
> > > thank you for all your comments to the proposal. To sum up: 
> > > 
> > > A set of performance tests exercising Core Beam Transforms (ParDo, 
> > > GroupByKey, CoGroupByKey, Combine) will be
> > > implemented for Java and Python SDKs. Those tests will allow to: 
> > > measure performance of the transforms on various runners
> > > exercise the transforms by creating stressful conditions and big loads 
> > > using Synthetic Source and Synthetic Step
> > > API (delays, keeping cpu busy or asleep, processing large keys and 
> > > values, performing fanout or reiteration of
> > > inputs)
> > > run both in batch and streaming context
> > > gather various metrics
> > > notice regressions by comparing data from consequent Jenkins runs  
> > > Metrics (runtime, consumed bytes, memory usage, split/bundle count) can 
> > > be gathered during test invocations. We
> > > will start with runtime and leverage Metrics API to collect the other 
> > > metrics in later phases of development. 
> > > The tests will be fully configurable through pipeline options and it will 
> > > be possible to run any custom scenarios
> > > manually. However, a representative set of testing scenarios will be run 
> > > periodically using Jenkins.
> > > 
> > > Regards, 
> > > Łukasz 
> > > 
> > > śr., 5 wrz 2018 o 20:31 Rafael Fernandez  napisał(a):
> > > > neat! left a comment or two
> > > > 
> > > > On Mon, Sep 3, 2018 at 3:53 AM Łukasz Gajowy  wrote:
> > > > > Hi all! 
> > > > > 
> > > > > I'm bumping this (in case you missed it). Any feedback and questions 
> > > > > are welcome!
> > > > > 
> > > > > Best regards, 
> > > > > Łukasz
> > > > > 
> > > > > pon., 13 sie 2018 o 13:51 Jean-Baptiste Onofré  
> > > > > napisał(a):
> > > > > > Hi Lukasz,
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Thanks for the update, and the abstract looks promising.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Let me take a look on the doc.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Regards
> > > > > > 
> > > > > > JB
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On 13/08/2018 13:24, Łukasz Gajowy wrote:
> > > > > > 
> > > > > > > Hi all, 
> > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > > since Synthetic Sources API has been introduced in Java and 
> > > > > > > Python SDK,
> > > > > > 
> > > > > > > it can be used to test some basic Apache Beam operations (i.e.
> > > > > > 
> > > > > > > GroupByKey, CoGroupByKey Combine, ParDo and ParDo with SideInput) 
> > > > > > > in
> > > > > > 
> > > > > > > terms of performance. This, in brief, is why we'd like to share 
> > > > > > > the
> > > > > > 
> > > > > > > below proposal:
> > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > > _https://docs.google.com/document/d/1PuIQv4v06eosKKwT76u7S6IP88AnXhTf870Rcj1AHt4/edit?usp=sharing_
> > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > > Let us know what you think in the document's comments. Thank you 
> > > 

Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-11 Thread dharmendra pratap singh
Hello Team,
Does this mean, as of today we can read from Hadoop FS but can't write to
Hadoop FS using Beam HDFS API ?

Regards
Dharmendra

On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko 
wrote:

> Hello everyone,
>
> I’d like to discuss the following topic (see below) with community since
> the optimal solution is not clear for me.
>
> There is Java IO module, called “*hadoop-input-format*”, which allows to
> use MapReduce InputFormat implementations to read data from different
> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> According to its name, it has only “Read" and it's missing “Write” part,
> so, I'm working on “*hadoop-output-format*” to support MapReduce
> OutputFormat (PR 6306 ). For
> this I created another module with this name. So, in the end, we will have
> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
> and it looks quite strange for me since, afaik, every existed Java IO, that
> we have, incapsulates Read and Write parts into one module. Additionally,
> we have “*hadoop-common*” and *“hadoop-file-system*” as other
> hadoop-related modules.
>
> Now I’m thinking how it will be better to organise all these Hadoop
> modules better. There are several options in my mind:
>
> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
> “as it is”.
> Pros: no breaking changes, no additional work
> Cons: not logical for users to have the same IO in two different modules
> and with different names.
>
> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
> keep the other Hadoop modules “as it is”.
> Pros: to have InputFormat/OutputFormat in one IO module which is logical
> for users
> Cons: breaking changes for user code because of module/IO renaming
>
> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
> which will include new “write” functionality and be a proxy for old “
> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
> become deprecated and be finally moved to common “*hadoop-format*” module
> in future releases. Keep the other Hadoop modules “as it is”.
> Pros: finally it will be only one module for hadoop MR format; changes are
> less painful for user
> Cons: hidden difficulties of implementation this strategy; a bit confusing
> for user
>
> 4) Add new module “*hadoop*” and move all already existed modules there
> as submodules (like we have for “*io/google-cloud-platform*”), merge “
> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
> Pros: unification of all hadoop-related modules
> Cons: breaking changes for user code, additional complexity with deps and
> testing
>
> 5) Your suggestion?..
>
> My personal preferences are lying between 2 and 3 (if 3 is possible).
>
> I’m wondering if there were similar situations in Beam before and how it
> was finally resolved. If yes then probably we need to do here in similar
> way.
> Any suggestions/advices/comments would be very appreciated.
>
> Thanks,
> Alexey
>


Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #168

2018-09-11 Thread Apache Jenkins Server
See 


Changes:

[robertwb] Update container versions of NumPy and TensorFlow.

[lcwik] [BEAM-5149] Add support for the Java SDK harness to merge windows.

[lcwik] Address PR comments.

[lcwik] [BEAM-5351] Fix missing pom.xml file in artifact jar.

[robbe.sneyders] remove sys.exc_clear()

--
[...truncated 17.84 MB...]
Skipping task ':beam-vendor-sdks-java-extensions-protobuf:test' as it has no 
source files and no previous output files.
:beam-vendor-sdks-java-extensions-protobuf:test (Thread[Task worker for ':' 
Thread 3,5,main]) completed. Took 0.0 secs.
:beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses
 (Thread[Task worker for ':' Thread 3,5,main]) started.

> Task 
> :beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses
Caching disabled for task 
':beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses':
 Caching has not been enabled for the task
Task 
':beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses'
 is not up-to-date because:
  Task has not declared any outputs despite executing actions.
:beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses
 (Thread[Task worker for ':' Thread 3,5,main]) completed. Took 0.033 secs.
:beam-vendor-sdks-java-extensions-protobuf:check (Thread[Task worker for ':' 
Thread 3,5,main]) started.

> Task :beam-vendor-sdks-java-extensions-protobuf:check
Skipping task ':beam-vendor-sdks-java-extensions-protobuf:check' as it has no 
actions.
:beam-vendor-sdks-java-extensions-protobuf:check (Thread[Task worker for ':' 
Thread 3,5,main]) completed. Took 0.0 secs.
:beam-vendor-sdks-java-extensions-protobuf:build (Thread[Task worker for ':' 
Thread 3,5,main]) started.

> Task :beam-vendor-sdks-java-extensions-protobuf:build
Skipping task ':beam-vendor-sdks-java-extensions-protobuf:build' as it has no 
actions.
:beam-vendor-sdks-java-extensions-protobuf:build (Thread[Task worker for ':' 
Thread 3,5,main]) completed. Took 0.0 secs.
:beam-website:assemble (Thread[Task worker for ':' Thread 3,5,main]) started.

> Task :beam-website:assemble UP-TO-DATE
Skipping task ':beam-website:assemble' as it has no actions.
:beam-website:assemble (Thread[Task worker for ':' Thread 3,5,main]) completed. 
Took 0.0 secs.
:beam-website:setupBuildDir (Thread[Task worker for ':' Thread 3,5,main]) 
started.

> Task :beam-website:setupBuildDir
Build cache key for task ':beam-website:setupBuildDir' is 
7afd3bd0c1a9269d50131165bb8a63a1
Caching disabled for task ':beam-website:setupBuildDir': Caching has not been 
enabled for the task
Task ':beam-website:setupBuildDir' is not up-to-date because:
  No history is available.
:beam-website:setupBuildDir (Thread[Task worker for ':' Thread 3,5,main]) 
completed. Took 0.004 secs.
:beam-website:buildDockerImage (Thread[Task worker for ':' Thread 3,5,main]) 
started.

> Task :beam-website:buildDockerImage
Caching disabled for task ':beam-website:buildDockerImage': Caching has not 
been enabled for the task
Task ':beam-website:buildDockerImage' is not up-to-date because:
  Task has not declared any outputs despite executing actions.
Starting process 'command 'docker''. Working directory: 

 Command: docker build -t beam-website .
Successfully started process 'command 'docker''
Sending build context to Docker daemon  24.51MB
Step 1/7 : FROM ruby:2.5
 ---> 88666731c3e1
Step 2/7 : WORKDIR /ruby
 ---> Using cache
 ---> 9b7353f27cb5
Step 3/7 : RUN gem install bundler
 ---> Using cache
 ---> cd46d9b7ccbe
Step 4/7 : ADD Gemfile Gemfile.lock /ruby/
 ---> Using cache
 ---> d50c22e097f2
Step 5/7 : RUN bundle install --deployment --path $GEM_HOME
 ---> Using cache
 ---> e8881d09b465
Step 6/7 : ENV LC_ALL C.UTF-8
 ---> Using cache
 ---> 3787b82c937d
Step 7/7 : CMD sleep 3600
 ---> Using cache
 ---> 1608418b66da
Successfully built 1608418b66da
Successfully tagged beam-website:latest
:beam-website:buildDockerImage (Thread[Task worker for ':' Thread 3,5,main]) 
completed. Took 0.273 secs.
:beam-website:createDockerContainer (Thread[Task worker for ':' Thread 
3,5,main]) started.

> Task :beam-website:createDockerContainer
Caching disabled for task ':beam-website:createDockerContainer': Caching has 
not been enabled for the task
Task ':beam-website:createDockerContainer' is not up-to-date because:
  Task has not declared any outputs despite executing actions.
Starting process 'command '/bin/bash''. Working directory: 

 Command: /bin/bash -c docker create -v 
:/repo
 -u $(id -u):$(id -g) beam-website
Successfully 

Re: PTransforms and Fusion

2018-09-11 Thread Robert Bradshaw
For (A), it really boils down to the question of what is a legal pipeline.
A1 takes the position that all empty transforms must be on a whitelist
(which implies B1, unless we make the whitelist extensible, which starts to
sound a lot like B3). Presumably if we want to support B2, we cannot remove
all empty unknown transforms, just those whose outputs are a subset of the
inputs.

The reason I strongly support A3 is that empty PTransforms are not just
noise, they are expressions of user intent, and the pipeline graph should
reflect that as faithfully as possible. This is the whole point of
composite transforms--one should not be required to expose what is inside
(even whether it's empty). Consider, for example, an A, B -> C transform
that mixes A and B in proportions to produce C. In the degenerate case
where we want 100% for A or 100% from B, it's reasonable to implement this
by just returning A or B directly. But when, say, visualizing the pipeline
graph, I don't think it's desirable to have the discontinuity of the
composite transform suddenly disappearing when the mixing parameter is at
either extreme.

If a runner cannot handle these empty pipelines (as is the case for those
relying on the current Java libraries) it is an easy matter for it to drop
them, but that doesn't mean we should withhold this information (by making
it illegal and dropping it in every SDK) from a runner (or any other tool)
that would want to see this information.

- Robert


On Tue, Sep 11, 2018 at 4:20 AM Henning Rohde  wrote:

> For A, I am in favor of A1 and A2 as well. It is then up to each SDK to
> not generate "empty" transforms in the proto representation as we avoid
> noise as mentioned. The shared Java libraries are also optional and we
> should not assume every runner will use them. I'm not convinced empty
> transforms would have value for pipeline structure over what can be
> accomplished with normal composites. I suspect empty transforms, such as A,
> B -> B, B, will just be confusion generators.
>
> For B, I favor B2 for the reasons Thomas mentions. I also agree with the
> -1 for B1.
>
> On Mon, Sep 10, 2018 at 2:51 PM Thomas Weise  wrote:
>
>> For B, note the prior discussion [1].
>>
>> B1 and B2 cannot be supported at the same time.
>>
>> Native transforms will almost always be customizations. Users do not
>> create customizations without reason. They would start with what is there
>> and dig deeper only when needed. Right now there are no streaming
>> connectors in the Python SDK - should the user not use the SDK? Or is it
>> better (now and in general) to have the option of a custom connector, even
>> when it is not portable?
>>
>> Regarding portability, IMO it should be up to the user to decide how much
>> of it is necessary/important. The IO requirements are normally dictated by
>> the infrastructure. If it has Kafka, Kinesis or any other source (including
>> those that Beam might never have a connector for), the user needs to be
>> able to integrate it.
>>
>> Overall extensibility is important and will help users adopt Beam. This
>> has come up in a few other areas (think Docker environments). I think we
>> need to provide the flexibility and enable, not prevent alternatives and
>> therefore -1 for B1 (unsurprisingly :).
>>
>> [1]
>> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E
>>
>>
>> On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw 
>> wrote:
>>
>>> A) I think it's a bug to not handle empty PTransforms (which are useful
>>> at pipeline construction, and may still have meaning in terms of pipeline
>>> structure, e.g. for visualization). Note that such transforms, if truly
>>> composite, can't output any PCollections that do not appear in their inputs
>>> (which is how we distinguish them from primitives in Python). Thus I'm in
>>> favor of A3, and as a stopgap we can drop these transforms as part of/just
>>> before decoding in the Java libraries (rather than in the SDKs during
>>> encoding as in A2).
>>>
>>> B) I'm also for B1 or B2.
>>>
>>>
>>> On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels 
>>> wrote:
>>>
 > A) What should we do with these "empty" PTransforms?

 We can't translate them, so dropping them seems the most reasonable
 choice. Should we throw an error/warning to make the user aware of
 this?
 Otherwise might be unexpected for the user.

 >> A3) Handle the "empty" PTransform case within all of the shared
 libraries.

 What can we do at this point other than dropping them?

 > B) What should we do with "native" PTransforms?

 I support B1 and B2 as well. Non-portable PTransforms should be
 discouraged in the long run. However, the available PTransforms are not
 even consistent across the different SDKs yet (e.g. no streaming
 connectors in Python), so we should continue to provide a way to
 utilize
 the "native" transforms of a Runner.