Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Jean-Baptiste Onofré

Fair enough. +1 especially for the documentation.

Regards
JB

On 04/03/2017 08:48 PM, Aviem Zur wrote:

Upon further inspection there seems to be an issue we may have overlooked:
In cluster mode, some of the runners will have dependencies added directly
to the classpath by the cluster, and since SLF4J can only work with one
binding, the first one in the classpath will be used.

So while what we suggested would work in local mode, the user's chosen
binding and configuration might be ignored in cluster mode, which is
detrimental to what we wanted to accomplish.

So I believe what we should do instead is:

   1. Add better documentation regarding logging in each runner, which
   binding is used, perhaps examples of how to configure logging for that
   runner.
   2. Have direct runner use the most common binding among runners (this
   appears to be log4j which is used by Spark runner, Flink runner and Apex
   runner).


On Mon, Apr 3, 2017 at 7:02 PM Aljoscha Krettek  wrote:


Yes, I think we can exclude log4j from the Flink dependencies. It’s
somewhat annoying that they are there in the first place.

The Flink doc has this to say about the topic:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/logging.html

On 3. Apr 2017, at 17:56, Aviem Zur  wrote:


* java.util.logging could be a good choice for the Direct Runner

Yes, this will be great for users (Instead of having no logging when

using

direct runner).


* Logging backend could be runner-specific, particularly if it needs to
integrate into some other experience

Good point, let's take a look at the current state of runners:
Direct runner - will use JUL as suggested.
Dataflow runner - looks like there is already no binding (There is a
binding in tests only).
Spark runner - currently uses slf4j-log4j12. does not require any

specific

logger, we can change this to no binding.
Flink runner - uses slf4j-log4j12 transitively from Flink dependencies.

I'm

assuming this is not a must and we can default to no binding here.
@aljoscha please confirm.
Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
assuming this is not a must and we can default to no binding here. @thw
please confirm.

It might be a good idea to use a consistent binding in tests (Since we'll
use JUL for direct runner, let this be JUL).

On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci  wrote:

+1 on consistency across Beam modules on the logging facade
+1 on enforcing consistency
+1 on clearly documenting how to do logging

Mixed feelings:
* Logging backend could be runner-specific, particularly if it needs to
integrate into some other experience
* java.util.logging could be a good choice for the Direct Runner

On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay 
wrote:


On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss 
wrote:


This is a great idea!

I believe Python-SDK's logging could also be enhanced (a bit

differently):

Currently we are not instantiating the logger, just using the class

what

logging package provides.
Shortcoming of this approach is that the user cannot set the log level

on

a per module basis as all log messages
end up in the root level.



+1 to this. Python SDK needs to expands its logging capabilities. Filed

[1]

for this.

Ahmet

[1] https://issues.apache.org/jira/browse/BEAM-1825




On 3/22/17, 5:46 AM, "Aviem Zur"  wrote:

   +1 to what JB said.

   Will just have to be documented well as if we provide no binding

there

will
   be no logging out of the box unless the user adds a binding.

   On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <

j...@nanthrax.net>

   wrote:


Hi Aviem,

Good point.

I think, in our dependencies set, we should just depend to

slf4j-api

and

let the
user provides the binding he wants (slf4j-log4j12, slf4j-simple,

whatever).


We define a binding only with test scope in our modules.

Regards
JB

On 03/22/2017 04:58 AM, Aviem Zur wrote:

Hi all,

There have been a few reports lately (On JIRA [1] and on Slack)

from

users

regarding inconsistent loggers used across Beam's modules.

While we use SLF4J, different modules use a different logger

behind it

(JUL, log4j, etc)
So when people add a log4j.properties file to their classpath

for

instance,

they expect this to affect all of their dependencies on Beam

modules, but

it doesn’t and they miss out on some logs they thought they

would

see.


I think we should strive for consistency in which logger is used

behind

SLF4J, and try to enforce this in our modules.
I for one think it should be slf4j-log4j. However, if

performance

of

logging is critical we might want to consider logback.

Note: SLF4J will still be the facade for logging across the

project. The

only change would be the logger SLF4J delegates to.

Once we have something like this it would also be useful to add
documentation on logging in Beam to 

IO ITs: Hosting Docker images

2017-04-03 Thread Stephen Sisk
Summary:

For IO ITs that use data stores that need custom docker images in order to
run, we can't currently use them in a kubernetes cluster (which is where we
host our data stores.) I have a couple options for how to solve this and am
looking for feedback from folks involved in creating IO ITs/opinions on
kubernetes.


Details:

We've discussed in the past that we'll want to allow developers to submit
just a dockerfile, and then we'll use that when creating the data store on
kubernetes. This is the case for ElasticsearchIO and I assume more data
stores in the future will want to do this. It's also looking like it'll be
necessary to use custom docker images for the HadoopInputFormatIO's
cassandra ITs - to run a cassandra cluster, there doesn't seem to be a good
image you can use out of the box.

In either case, in order to retrieve a docker image, kubernetes needs a
container registry - it will read the docker images from there. A simple
private container registry doesn't work because kubernetes config files are
static - this means that if local devs try to use the kubernetes files,
they point at the private container registry and they wouldn't be able to
retrieve the images since they don't have access. They'd have to manually
edit the files, which in theory is an option, but I don't consider that to
be acceptable since it feels pretty unfriendly (it is simple, so if we
really don't like the below options we can revisit it.)

Quick summary of the options

===

We can:

* Start using something like k8 helm - this adds more dependencies, adds a
small amount of complexity (this is my recommendation, but only by a little)

* Start pushing images to docker hub - this means they'll be publicly
visible and raises the bar for maintenance of those images

* Host our own public container registry - this means running our own
public service with costs, etc..

Below are detailed discussions of these options. You can skip to the "My
thoughts on this" section if you're not interested in the details.


1. Templated kubernetes images

=

Kubernetes (k8) does not currently have built in support for parameterizing
scripts - there's an issues open for this[1], but it doesn't seem to be
very active.

There are tools like Kubernetes helm that allow users to specify parameters
when running their kubernetes scripts. They also enable a lot more (they're
probably closer to a package manager like apt-get) - see this
description[3] for an overview.

I'm open to other options besides helm, but it seems to be the officially
supported one.

How the world would look using helm:

* When developing an IO IT, someone (either the developer or one of us),
would need to create a chart (the name for the helm script) - it's
basically another set of config files but in theory is as simple as a
couple metadata files plus a templatized version of a regular k8 script.
This should be trivial compared to the task of creating a k8 script.

*  When creating an instance of a data store, the developer (or the beam CI
server) would first build the docker image for the data store and push to
their container registry, then run a command like `helm install -f
mydb.yaml --set imageRepo=1.2.3.4`

* when done running tests/developing/etc…  the developer/beam CI server
would run `helm delete -f mydb.yaml`

Upsides:

* Something like helm is pretty interesting - we talked about it as an
upside and something we wanted to do when we talked about using kubernetes

* We pick up a set of working kubernetes scripts this way. The full list is
at [2], but some ones that stood out: mongodb, memcached, mysql, postgres,
redis, elasticsearch (incubating), kafka (incubating), zookeeper
(incubating) - this could speed development

Downsides:

* Adds an additional dependency to run our ITs (helm or another k8
templating tool)

* Requires people to build their own images run a container registry if
they don't already have one (it will not surprise you that there's a docker
image for running the registry [0] - so it's not crazy. :) I *think* this
will probably just be a simple one/two line command once we have it
scripted.

* Helm in particular is kind of heavyweight for what we really need - it
requires running a service in the k8 cluster and adds additional complexity.

* Adds to the complexity of creating a new kubernetes script. Until I've
tried it, I can't really speak to the complexity, but taking a look at the
instructions [4], it doesn't seem too bad.




2. Push images to docker hub

===

This requires that users push images that we want to use to docker hub, and
then our IO ITs will rely on that. I  think the developer of the dockerfile
should be responsible for the image - having the beam project responsible
for a publicly available artifact (like the docker images) outside of our
core deliverables doesn't seem like the right move.

We would still retain a copy of the source dockerfiles and could 

Adding logging for RunnableOnService/ValidatesRunner tests

2017-04-03 Thread Pablo Estrada
Hello there,
I'm running RunnableOnService tests on the DirectRunner, with 'mvn clean
verify' in runners/direct-java; and I'd like to add some logging to figure
out what's going on in some failures. My question is:

1. Is there a way to run only a specific test with maven?
2. Is there extra configuration needed to collect logs written during the
test (Specifically, logs written from PAssert)
3. If not, where should I look for these logs? A file? Stdout?

Best!
-P.


Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Ted Yu
+1

> On Apr 3, 2017, at 11:48 AM, Aviem Zur  wrote:
> 
> Upon further inspection there seems to be an issue we may have overlooked:
> In cluster mode, some of the runners will have dependencies added directly
> to the classpath by the cluster, and since SLF4J can only work with one
> binding, the first one in the classpath will be used.
> 
> So while what we suggested would work in local mode, the user's chosen
> binding and configuration might be ignored in cluster mode, which is
> detrimental to what we wanted to accomplish.
> 
> So I believe what we should do instead is:
> 
>   1. Add better documentation regarding logging in each runner, which
>   binding is used, perhaps examples of how to configure logging for that
>   runner.
>   2. Have direct runner use the most common binding among runners (this
>   appears to be log4j which is used by Spark runner, Flink runner and Apex
>   runner).
> 
> 
>> On Mon, Apr 3, 2017 at 7:02 PM Aljoscha Krettek  wrote:
>> 
>> Yes, I think we can exclude log4j from the Flink dependencies. It’s
>> somewhat annoying that they are there in the first place.
>> 
>> The Flink doc has this to say about the topic:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/logging.html
 On 3. Apr 2017, at 17:56, Aviem Zur  wrote:
 
 * java.util.logging could be a good choice for the Direct Runner
>>> Yes, this will be great for users (Instead of having no logging when
>> using
>>> direct runner).
>>> 
 * Logging backend could be runner-specific, particularly if it needs to
 integrate into some other experience
>>> Good point, let's take a look at the current state of runners:
>>> Direct runner - will use JUL as suggested.
>>> Dataflow runner - looks like there is already no binding (There is a
>>> binding in tests only).
>>> Spark runner - currently uses slf4j-log4j12. does not require any
>> specific
>>> logger, we can change this to no binding.
>>> Flink runner - uses slf4j-log4j12 transitively from Flink dependencies.
>> I'm
>>> assuming this is not a must and we can default to no binding here.
>>> @aljoscha please confirm.
>>> Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
>>> assuming this is not a must and we can default to no binding here. @thw
>>> please confirm.
>>> 
>>> It might be a good idea to use a consistent binding in tests (Since we'll
>>> use JUL for direct runner, let this be JUL).
>>> 
>>> On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci  wrote:
>>> 
>>> +1 on consistency across Beam modules on the logging facade
>>> +1 on enforcing consistency
>>> +1 on clearly documenting how to do logging
>>> 
>>> Mixed feelings:
>>> * Logging backend could be runner-specific, particularly if it needs to
>>> integrate into some other experience
>>> * java.util.logging could be a good choice for the Direct Runner
>>> 
>>> On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay 
>>> wrote:
>>> 
 On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss 
 wrote:
 
> This is a great idea!
> 
> I believe Python-SDK's logging could also be enhanced (a bit
 differently):
> Currently we are not instantiating the logger, just using the class
>> what
> logging package provides.
> Shortcoming of this approach is that the user cannot set the log level
>>> on
> a per module basis as all log messages
> end up in the root level.
 
 +1 to this. Python SDK needs to expands its logging capabilities. Filed
>>> [1]
 for this.
 
 Ahmet
 
 [1] https://issues.apache.org/jira/browse/BEAM-1825
 
 
> 
> On 3/22/17, 5:46 AM, "Aviem Zur"  wrote:
> 
>   +1 to what JB said.
> 
>   Will just have to be documented well as if we provide no binding
 there
> will
>   be no logging out of the box unless the user adds a binding.
> 
>   On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <
 j...@nanthrax.net>
>   wrote:
> 
>> Hi Aviem,
>> 
>> Good point.
>> 
>> I think, in our dependencies set, we should just depend to
 slf4j-api
> and
>> let the
>> user provides the binding he wants (slf4j-log4j12, slf4j-simple,
> whatever).
>> 
>> We define a binding only with test scope in our modules.
>> 
>> Regards
>> JB
>> 
>>> On 03/22/2017 04:58 AM, Aviem Zur wrote:
>>> Hi all,
>>> 
>>> There have been a few reports lately (On JIRA [1] and on Slack)
> from
>> users
>>> regarding inconsistent loggers used across Beam's modules.
>>> 
>>> While we use SLF4J, different modules use a different logger
> behind it
>>> (JUL, log4j, etc)
>>> So when people add a log4j.properties file to their classpath
>>> for
>> instance,
>>> they expect this to affect all of their dependencies on Beam

Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Aviem Zur
Upon further inspection there seems to be an issue we may have overlooked:
In cluster mode, some of the runners will have dependencies added directly
to the classpath by the cluster, and since SLF4J can only work with one
binding, the first one in the classpath will be used.

So while what we suggested would work in local mode, the user's chosen
binding and configuration might be ignored in cluster mode, which is
detrimental to what we wanted to accomplish.

So I believe what we should do instead is:

   1. Add better documentation regarding logging in each runner, which
   binding is used, perhaps examples of how to configure logging for that
   runner.
   2. Have direct runner use the most common binding among runners (this
   appears to be log4j which is used by Spark runner, Flink runner and Apex
   runner).


On Mon, Apr 3, 2017 at 7:02 PM Aljoscha Krettek  wrote:

> Yes, I think we can exclude log4j from the Flink dependencies. It’s
> somewhat annoying that they are there in the first place.
>
> The Flink doc has this to say about the topic:
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/logging.html
> > On 3. Apr 2017, at 17:56, Aviem Zur  wrote:
> >
> >> * java.util.logging could be a good choice for the Direct Runner
> > Yes, this will be great for users (Instead of having no logging when
> using
> > direct runner).
> >
> >> * Logging backend could be runner-specific, particularly if it needs to
> >> integrate into some other experience
> > Good point, let's take a look at the current state of runners:
> > Direct runner - will use JUL as suggested.
> > Dataflow runner - looks like there is already no binding (There is a
> > binding in tests only).
> > Spark runner - currently uses slf4j-log4j12. does not require any
> specific
> > logger, we can change this to no binding.
> > Flink runner - uses slf4j-log4j12 transitively from Flink dependencies.
> I'm
> > assuming this is not a must and we can default to no binding here.
> > @aljoscha please confirm.
> > Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
> > assuming this is not a must and we can default to no binding here. @thw
> > please confirm.
> >
> > It might be a good idea to use a consistent binding in tests (Since we'll
> > use JUL for direct runner, let this be JUL).
> >
> > On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci  wrote:
> >
> > +1 on consistency across Beam modules on the logging facade
> > +1 on enforcing consistency
> > +1 on clearly documenting how to do logging
> >
> > Mixed feelings:
> > * Logging backend could be runner-specific, particularly if it needs to
> > integrate into some other experience
> > * java.util.logging could be a good choice for the Direct Runner
> >
> > On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay 
> > wrote:
> >
> >> On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss 
> >> wrote:
> >>
> >>> This is a great idea!
> >>>
> >>> I believe Python-SDK's logging could also be enhanced (a bit
> >> differently):
> >>> Currently we are not instantiating the logger, just using the class
> what
> >>> logging package provides.
> >>> Shortcoming of this approach is that the user cannot set the log level
> > on
> >>> a per module basis as all log messages
> >>> end up in the root level.
> >>>
> >>
> >> +1 to this. Python SDK needs to expands its logging capabilities. Filed
> > [1]
> >> for this.
> >>
> >> Ahmet
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-1825
> >>
> >>
> >>>
> >>> On 3/22/17, 5:46 AM, "Aviem Zur"  wrote:
> >>>
> >>>+1 to what JB said.
> >>>
> >>>Will just have to be documented well as if we provide no binding
> >> there
> >>> will
> >>>be no logging out of the box unless the user adds a binding.
> >>>
> >>>On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <
> >> j...@nanthrax.net>
> >>>wrote:
> >>>
>  Hi Aviem,
> 
>  Good point.
> 
>  I think, in our dependencies set, we should just depend to
> >> slf4j-api
> >>> and
>  let the
>  user provides the binding he wants (slf4j-log4j12, slf4j-simple,
> >>> whatever).
> 
>  We define a binding only with test scope in our modules.
> 
>  Regards
>  JB
> 
>  On 03/22/2017 04:58 AM, Aviem Zur wrote:
> > Hi all,
> >
> > There have been a few reports lately (On JIRA [1] and on Slack)
> >>> from
>  users
> > regarding inconsistent loggers used across Beam's modules.
> >
> > While we use SLF4J, different modules use a different logger
> >>> behind it
> > (JUL, log4j, etc)
> > So when people add a log4j.properties file to their classpath
> > for
>  instance,
> > they expect this to affect all of their dependencies on Beam
> >>> modules, but
> > it doesn’t and they miss out on some logs they thought they
> > would
> >>> see.
> >
> > I think we should 

Re: Update of Pei in Alibaba

2017-04-03 Thread Kenneth Knowles
Nice to hear from you again, Pei!

This is awesome news. I'd love to help when you are ready to get it in the
repo and hooked up to our testing infrastructure.

Kenn

On Fri, Mar 31, 2017 at 6:24 PM, Pei HE  wrote:

> Hi all,
> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
> And, I want to give an update of things in here.
>
> A colleague and I have been working on JStorm
>  runner. We have a prototype that works
> with WordCount and PAssert. (I am going to start a separate email thread
> about how to get it reviewed and merged in Apache Beam.)
> We also have Spark clusters, and are very interested in using Spark runner.
>
> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
> Beam model. While many companies gave talks of their in-house solutions for
> unified batch and unified SQL, there are also lots of interests
> and enthusiasts of Beam.
>
> Looking forward to chat more.
> --
> Pei
>


Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Aljoscha Krettek
Yes, I think we can exclude log4j from the Flink dependencies. It’s somewhat 
annoying that they are there in the first place.

The Flink doc has this to say about the topic: 
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/logging.html
> On 3. Apr 2017, at 17:56, Aviem Zur  wrote:
> 
>> * java.util.logging could be a good choice for the Direct Runner
> Yes, this will be great for users (Instead of having no logging when using
> direct runner).
> 
>> * Logging backend could be runner-specific, particularly if it needs to
>> integrate into some other experience
> Good point, let's take a look at the current state of runners:
> Direct runner - will use JUL as suggested.
> Dataflow runner - looks like there is already no binding (There is a
> binding in tests only).
> Spark runner - currently uses slf4j-log4j12. does not require any specific
> logger, we can change this to no binding.
> Flink runner - uses slf4j-log4j12 transitively from Flink dependencies. I'm
> assuming this is not a must and we can default to no binding here.
> @aljoscha please confirm.
> Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
> assuming this is not a must and we can default to no binding here. @thw
> please confirm.
> 
> It might be a good idea to use a consistent binding in tests (Since we'll
> use JUL for direct runner, let this be JUL).
> 
> On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci  wrote:
> 
> +1 on consistency across Beam modules on the logging facade
> +1 on enforcing consistency
> +1 on clearly documenting how to do logging
> 
> Mixed feelings:
> * Logging backend could be runner-specific, particularly if it needs to
> integrate into some other experience
> * java.util.logging could be a good choice for the Direct Runner
> 
> On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay 
> wrote:
> 
>> On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss 
>> wrote:
>> 
>>> This is a great idea!
>>> 
>>> I believe Python-SDK's logging could also be enhanced (a bit
>> differently):
>>> Currently we are not instantiating the logger, just using the class what
>>> logging package provides.
>>> Shortcoming of this approach is that the user cannot set the log level
> on
>>> a per module basis as all log messages
>>> end up in the root level.
>>> 
>> 
>> +1 to this. Python SDK needs to expands its logging capabilities. Filed
> [1]
>> for this.
>> 
>> Ahmet
>> 
>> [1] https://issues.apache.org/jira/browse/BEAM-1825
>> 
>> 
>>> 
>>> On 3/22/17, 5:46 AM, "Aviem Zur"  wrote:
>>> 
>>>+1 to what JB said.
>>> 
>>>Will just have to be documented well as if we provide no binding
>> there
>>> will
>>>be no logging out of the box unless the user adds a binding.
>>> 
>>>On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <
>> j...@nanthrax.net>
>>>wrote:
>>> 
 Hi Aviem,
 
 Good point.
 
 I think, in our dependencies set, we should just depend to
>> slf4j-api
>>> and
 let the
 user provides the binding he wants (slf4j-log4j12, slf4j-simple,
>>> whatever).
 
 We define a binding only with test scope in our modules.
 
 Regards
 JB
 
 On 03/22/2017 04:58 AM, Aviem Zur wrote:
> Hi all,
> 
> There have been a few reports lately (On JIRA [1] and on Slack)
>>> from
 users
> regarding inconsistent loggers used across Beam's modules.
> 
> While we use SLF4J, different modules use a different logger
>>> behind it
> (JUL, log4j, etc)
> So when people add a log4j.properties file to their classpath
> for
 instance,
> they expect this to affect all of their dependencies on Beam
>>> modules, but
> it doesn’t and they miss out on some logs they thought they
> would
>>> see.
> 
> I think we should strive for consistency in which logger is used
>>> behind
> SLF4J, and try to enforce this in our modules.
> I for one think it should be slf4j-log4j. However, if
> performance
>>> of
> logging is critical we might want to consider logback.
> 
> Note: SLF4J will still be the facade for logging across the
>>> project. The
> only change would be the logger SLF4J delegates to.
> 
> Once we have something like this it would also be useful to add
> documentation on logging in Beam to the website.
> 
> [1] https://issues.apache.org/jira/browse/BEAM-1757
> 
 
 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com
 
>>> 
>>> 
>>> 
>> 



Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Aviem Zur
>* java.util.logging could be a good choice for the Direct Runner
Yes, this will be great for users (Instead of having no logging when using
direct runner).

>* Logging backend could be runner-specific, particularly if it needs to
>integrate into some other experience
Good point, let's take a look at the current state of runners:
Direct runner - will use JUL as suggested.
Dataflow runner - looks like there is already no binding (There is a
binding in tests only).
Spark runner - currently uses slf4j-log4j12. does not require any specific
logger, we can change this to no binding.
Flink runner - uses slf4j-log4j12 transitively from Flink dependencies. I'm
assuming this is not a must and we can default to no binding here.
@aljoscha please confirm.
Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
assuming this is not a must and we can default to no binding here. @thw
please confirm.

It might be a good idea to use a consistent binding in tests (Since we'll
use JUL for direct runner, let this be JUL).

On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci  wrote:

+1 on consistency across Beam modules on the logging facade
+1 on enforcing consistency
+1 on clearly documenting how to do logging

Mixed feelings:
* Logging backend could be runner-specific, particularly if it needs to
integrate into some other experience
* java.util.logging could be a good choice for the Direct Runner

On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay 
wrote:

> On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss 
> wrote:
>
> > This is a great idea!
> >
> > I believe Python-SDK's logging could also be enhanced (a bit
> differently):
> > Currently we are not instantiating the logger, just using the class what
> > logging package provides.
> > Shortcoming of this approach is that the user cannot set the log level
on
> > a per module basis as all log messages
> > end up in the root level.
> >
>
> +1 to this. Python SDK needs to expands its logging capabilities. Filed
[1]
> for this.
>
> Ahmet
>
> [1] https://issues.apache.org/jira/browse/BEAM-1825
>
>
> >
> > On 3/22/17, 5:46 AM, "Aviem Zur"  wrote:
> >
> > +1 to what JB said.
> >
> > Will just have to be documented well as if we provide no binding
> there
> > will
> > be no logging out of the box unless the user adds a binding.
> >
> > On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > wrote:
> >
> > > Hi Aviem,
> > >
> > > Good point.
> > >
> > > I think, in our dependencies set, we should just depend to
> slf4j-api
> > and
> > > let the
> > > user provides the binding he wants (slf4j-log4j12, slf4j-simple,
> > whatever).
> > >
> > > We define a binding only with test scope in our modules.
> > >
> > > Regards
> > > JB
> > >
> > > On 03/22/2017 04:58 AM, Aviem Zur wrote:
> > > > Hi all,
> > > >
> > > > There have been a few reports lately (On JIRA [1] and on Slack)
> > from
> > > users
> > > > regarding inconsistent loggers used across Beam's modules.
> > > >
> > > > While we use SLF4J, different modules use a different logger
> > behind it
> > > > (JUL, log4j, etc)
> > > > So when people add a log4j.properties file to their classpath
for
> > > instance,
> > > > they expect this to affect all of their dependencies on Beam
> > modules, but
> > > > it doesn’t and they miss out on some logs they thought they
would
> > see.
> > > >
> > > > I think we should strive for consistency in which logger is used
> > behind
> > > > SLF4J, and try to enforce this in our modules.
> > > > I for one think it should be slf4j-log4j. However, if
performance
> > of
> > > > logging is critical we might want to consider logback.
> > > >
> > > > Note: SLF4J will still be the facade for logging across the
> > project. The
> > > > only change would be the logger SLF4J delegates to.
> > > >
> > > > Once we have something like this it would also be useful to add
> > > > documentation on logging in Beam to the website.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/BEAM-1757
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> >
> >
>


Re: Want to contribute to Beam project

2017-04-03 Thread tarush grover
Thanks Jean. I will take a look at Jira to take upon some tickets.

Regards,
Tarush

On Sun, 2 Apr 2017 at 11:12 AM, Jean-Baptiste Onofré 
wrote:

> Hi Tarush,
>
> welcome aboard !
>
> You can take a look on https://beam.apache.org/contribute/.
>
> Any contribution is valuable (not only code): documentation, etc.
>
> I propose to you to take a look on the Jira, experiment Beam to find new
> features/improvement, and be involved on the mailing list.
>
> Regards
> JB
>
> On 04/01/2017 09:59 PM, tarush grover wrote:
> > Hi Members,
> >
> > Let me introduce myself I am Tarush Grover with 3 years working in the
> big
> > data technologies as senior software engineer. I find Apache Beam to be
> an
> > exciting project.
> >
> > I request community members to please involve me in this exciting
> journey.
> > Please guide me to where and how to start so that I can quickly pace with
> > the active development and it would be great if you can assign something
> to
> > me to start.
> >
> > Regards,
> > Tarush
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Update of Pei in Alibaba

2017-04-03 Thread Ismaël Mejía
Thanks Jingsong for answering, and the Streamscope ref, I am going to
check the paper, the concept of non-global-checkpointing sounds super
interesting.

It is nice that you guys are also trying to promote the move to a unified model.

Regards,
Ismaël


On Sun, Apr 2, 2017 at 3:40 PM, JingsongLee  wrote:
> Hi Ismaël,
> We have a streaming computing platform in Alibaba.
> Galaxy is an internal system, so you can't find some information from Google.
> It is becoming more like StreamScope (you can search it for the paper).
> Non-global-checkpoint makes failure recovery quickly and makes streaming
> applications easier to develop and debug.
>
>
> But as far as I know, each engine has its own tradeoffs, has its own good 
> cases.
> So we also developed an open source platform, which has Spark, Flink and so 
> on.
> We hope we can use Apache Beam to unify the user program model.  This will 
> make
>  the user learning costs are low, the application migration costs are low.
> (Not only from batch to streaming, but also conducive to migration from the
> streaming to the streaming.)
>
>
> --From:Ismaël 
> Mejía Time:2017 Apr 2 (Sun) 03:18To:dev 
> Subject:Re: Update of Pei in Alibaba
> Excellent news,
>
> Pei it would be great to have a new runner. I am curious about how
> different are the implementations of storm among them considering that
> there are already three 'versions': Storm, Jstorm and Heron, I wonder
> if one runner could traduce to an API that would cover all of them (of
> course maybe I am super naive I really don't know much about JStorm or
> Heron and how much they differ from the original storm).
>
> Jingson, I am super curious about this Galaxy project, it is there any
> public information about it? is this related to the previous blink ali
> baba project? I already looked a bit but searching "Ali baba galaxy"
> is a recipe for a myriad of telephone sellers :)
>
> Nice to see that you are going to keep contributing to the project Pei.
>
> Regards,
> Ismaël
>
>
>
> On Sat, Apr 1, 2017 at 4:59 PM, Tibor Kiss  wrote:
>> Exciting times, looking forward to try it out!
>>
>> I shall mention that Taylor Goetz also started creating a BEAM runner using
>> Storm.
>> His work is available in the storm repo:
>> https://github.com/apache/storm/commits/beam-runner
>> Maybe it's worth while to take a peek and see if something is reusable from
>> there.
>>
>> - Tibor
>>
>> On Sat, Apr 1, 2017 at 4:37 AM, JingsongLee  wrote:
>>
>>> Wow, very glad to see JStorm also started building BeamRunner.
>>> I am working in Galaxy (Another streaming process engine in Alibaba).
>>> I hope that we can work together to promote the use of Apache Beam
>>> in Alibaba and China.
>>>
>>> best,
>>> JingsongLee
>>> --From:Pei
>>> HE Time:2017 Apr 1 (Sat) 09:24To:dev <
>>> dev@beam.apache.org>Subject:Update of Pei in Alibaba
>>> Hi all,
>>> On February, I moved from Seattle to Hangzhou, China, and joined Alibaba.
>>> And, I want to give an update of things in here.
>>>
>>> A colleague and I have been working on JStorm
>>>  runner. We have a prototype that works
>>> with WordCount and PAssert. (I am going to start a separate email thread
>>> about how to get it reviewed and merged in Apache Beam.)
>>> We also have Spark clusters, and are very interested in
>>> using Spark runner.
>>>
>>> Last Saturday, I went to China Hadoop Summit, and gave a talk about Apache
>>> Beam model. While many companies gave talks of their
>>> in-house solutions for
>>> unified batch and unified SQL, there are also lots of interests
>>> and enthusiasts of Beam.
>>>
>>> Looking forward to chat more.
>>> --
>>> Pei
>>>
>>>
>>
>>
>> --
>> Kiss Tibor
>>
>> +36 70 275 9863
>> tibor.k...@gmail.com


Re: [PROPOSAL] ORC support

2017-04-03 Thread Tibor Kiss
Thanks for your replies, I've created
https://issues.apache.org/jira/browse/BEAM-1861 to track this effort.

On Sun, Apr 2, 2017 at 7:40 AM, Jean-Baptiste Onofré 
wrote:

> +1
>
> By the way, around the same topic, I'm working on Apache CarbonData
> support (http://carbondata.apache.org/).
>
> Regards
> JB
>
>
> On 04/01/2017 05:31 PM, Tibor Kiss wrote:
>
>> Hello,
>>
>> Recently the Optimized Row Columnar (ORC) file format was spin off from
>> Hive
>> and became a top level Apache Project: https://orc.apache.org/
>>
>> It is similar to Parquet in a sense that it uses column major format but
>> ORC has
>> a more elaborate type system and stores basic statistics about each row.
>>
>> I'd be interested extending Beam with ORC support if others find it
>> helpful
>> too.
>>
>> What do you think?
>>
>> - Tibor
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Kiss Tibor

+36 70 275 9863
tibor.k...@gmail.com


Re: [DISCUSSION] rename StateSpecs.combiningValue?

2017-04-03 Thread Etienne Chauchot

+1

Etienne


Le 30/03/2017 à 20:48, Kenneth Knowles a écrit :

+1 for a different reason (also: now is the time to revisit names and other
bits of the State API before it is too late :-)

Folks may not have this catalog in their head. The classes / methods are:

ValueState / value(...)
BagState / bag(...)
SetState / set(...)
MapState / map(...)
AccumulatorCombiningState / combiningValue(...)

I propose renaming this last one to CombiningState / combining(...) to
match here and also to match CombineFn.

I propose renaming what is currently CombiningState to something like
GroupingState and probably just moving to runners-core (as a trivial
wrapper, since it won't be able to remain a superclass) since it is not
useful for users.

Kenn

On Thu, Mar 30, 2017 at 9:32 AM, Jean-Baptiste Onofré 
wrote:


+0

as the StateSpec takes AccumularCombiningState, it's already "specify"
indirectly.

Regards
JB


On 03/30/2017 04:46 PM, Etienne Chauchot wrote:


Hi all,

Just a 10 cents comment, but maybe a rename of public method
nevertheless...

There are AccumulatorCombiningState and CombiningState interfaces, the
first
extends the second

The factory method StateSpecs.combiningValue returns a
StateSpec
** >

The method name seems confusing for a user, who would search for a factory
method returning a CombiningState (but there is none).

Maybe StateSpecs.accumulatorCombiningValue would be a better name?

The same goes for StateSpecs.*CombiningValue methods that all return
AccumulatorCombiningState but maybe the name would be too long.

WDYT?

Etienne



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





Re: [PROPOSAL] @OnWindowExpiration

2017-04-03 Thread Etienne Chauchot

+1

Etienne


Le 28/03/2017 à 22:27, Kenneth Knowles a écrit :

Hi all,

I have a little extension to the stateful DoFn annotations to circulate for
feedback: Allow a method to be annotated with @OnWindowExpiration to
automatically get a callback at some point after the window has expired,
but before the state for the window has been cleared.

Today, a user can pretty easily get the same effect by setting a timer for
the end of the window + allowed lateness in their @ProcessElement calls.
But having just one annotation for it has a couple nice benefits:

1. Some users assume a naive implementation so they are concerned that
setting a timer repeatedly is costly. This eliminates the cause for user
alarm and allows a runner to do a better job in case it didn't already do
it efficiently.

2. Getting the allowed lateness to be available to your @ProcessElement is
a little crufty.

3. Often, if you don't have @OnWindowExpiration, you are leaving behind
state that might contain data that is otherwise lost. So I would even
consider making it mandatory (with some way of indicating state you don't
care about dropping) though that could be annoying.

Another interesting moment in a window's lifecycle is @EndOfWindow. This is
not critical for correctness, though.

Thoughts?

Kenn