Re: Slack Request

2018-01-30 Thread Jean-Baptiste Onofré
Invite sent.

Welcome aboard !

Regards
JB

On 01/31/2018 01:56 AM, Luke Zhu wrote:
> Hello,
> 
> Could I get an invite to the Slack group?
> 
> Thanks!

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Slack Request

2018-01-30 Thread Luke Zhu
Hello,

Could I get an invite to the Slack group?

Thanks!


Re: Dependencies and Datastore

2018-01-30 Thread Jacob Marble
Josh, what did you do to work around this?

This suddenly crept up on a production pipeline yesterday, without anything
changing on our side (we do rebuild at every run).

Jacob

On Fri, Dec 8, 2017 at 6:46 PM, Chamikara Jayalath 
wrote:

> Created https://issues.apache.org/jira/browse/BEAM-3321 to update
> the gax-grpc dependency of Beam.
>
> - Cham
>
> On Friday, December 8, 2017 at 6:12:20 PM UTC-8, Harsh Vardhan wrote:
>>
>> +chamikara@
>>
>>
>> On Friday, December 8, 2017 at 1:31:00 AM UTC-8, Joshua Fox wrote:
>>>
>>> I use Cloud Datastore API to check for Kinds in the Datastore, then  use
>>> Dataflow -- now upgrading to Beam -- to copy one Datastore to another..
>>>
>>> After adding beam-sdks-java-io-google-cloud-platform to my pom, I start
>>> getting this when initializing the Cloud Datastore API
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> com/google/api/gax/retrying/ResultRetryAlgorithm
>>> at com.google.cloud.datastore.DatastoreOptions$DefaultDatastore
>>> Factory.create(DatastoreOptions.java:51)
>>> at com.google.cloud.datastore.DatastoreOptions$DefaultDatastore
>>> Factory.create(DatastoreOptions.java:45)
>>> at com.google.cloud.ServiceOptions.getService(ServiceOptions.java:426)
>>>
>>> It is caused by  gax dependencies.
>>>
>>> Specifically, before I add beam-sdks-java-io-google-cloud-platform there
>>> is this  gax dependency
>>>
>>> +- com.google.cloud:google-cloud-datastore:jar:1.12.0:compile
>>> |  +- com.google.cloud:google-cloud-core:jar:1.12.0:compile
>>> |  |  +- com.google.api:gax:jar:1.15.0:compile
>>>
>>>
>>>
>>> and after I add  it there is this
>>>
>>> +- org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:
>>> 2.2.0:compile
>>> |  +- com.google.api:gax-grpc:jar:0.20.0:compile
>>> |  |  +- com.google.api:gax:jar:1.3.1:compile
>>>
>>>
>>> If I add gax 1.15.0 to my pom explicitly, I get
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> com/google/api/gax/retrying/ExceptionRetryAlgorithm
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>> ..
>>> at com.google.cloud.BaseService.(BaseService.java:48)
>>> at com.google.cloud.datastore.DatastoreOptions$DefaultDatastore
>>> Factory.create(DatastoreOptions.java:51)
>>> at com.google.cloud.datastore.DatastoreOptions$DefaultDatastore
>>> Factory.create(DatastoreOptions.java:45)
>>> at com.google.cloud.ServiceOptions.getService(ServiceOptions.java:426)
>>>
>>> Clearly Datastore and Beam should work together. Yet there have been
>>> dependency problems between the two for  a while. See this discussion
>>>  from 1 year ago.
>>>
>>> How can I resolve this?
>>>
>>


Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Carlos Alonso
Created this SO question:
https://stackoverflow.com/questions/48530496/google-dataflow-custom-metrics-not-showing-on-stackdriver

Thanks Andrea!

On Tue, Jan 30, 2018 at 9:54 PM Andrea Foegler  wrote:

> I would suggest Platform Support or StackOverflow as the best places to
> request Dataflow-specific support.
>
> This could be an issue coordinating between your Stackdriver Account(s)
> and your Cloud project(s).  We can continue to discuss / investigate
> through one of the above forums.
>
> On Tue, Jan 30, 2018 at 12:30 PM, Carlos Alonso 
> wrote:
>
>> Thanks Andrea!!
>>
>> Do you mean using the UserVoice forum?
>>
>> Aside from that, there's something that could be helpful that is that
>> when I navigate https://app.google.stackdriver.com/services/dataflow the
>> message I get is this:
>> "You do not have any resources of this type being monitored by
>> Stackdriver." and that's weird as well. As if our Cloud Dataflow wasn't
>> properly connected to Stackdriver, but, on the other hand. Some metrics are
>> displayed and can be monitored such as System Lag, Watermark, etc...
>>
>> Thanks!
>>
>> On Tue, Jan 30, 2018 at 9:20 PM Andrea Foegler 
>> wrote:
>>
>>> Hi Carlos -
>>>
>>> This sounds like something we should investigate further.  Since it
>>> appears to be a Dataflow specific question / issue, would you mind posting
>>> or following up in a Dataflow-specific forum or through Google Cloud
>>> Platform Support: https://cloud.google.com/dataflow/support?  Feel free
>>> to mention my name in your contact.
>>>
>>> Cheers
>>> Andrea
>>>
>>>
>>>
>>> On Tue, Jan 30, 2018 at 10:27 AM, Carlos Alonso 
>>> wrote:
>>>
 Hi Andrea, thank you very much for your response.

 I've followed your directions and only droppedDueToLateness appears.
 The way I'm creating those metrics is:

 Metrics.counter("Ingester", "IngestedMessages").inc()

 I can see those counters on the Custom Counters section on the Google
 Dataflow UI, but nothing appears on Stackdriver...

 Thanks!

 On Tue, Jan 30, 2018 at 7:22 PM Andrea Foegler 
 wrote:

> Hi Carlos -
>
> Custom metrics can be "listed" by going to the Metric Explorer in
> Stackdriver and entering "custom.googleapis.com/dataflow" in the
> filter.
> If that list contains more than 100 different names, new custom
> metrics will not be created.  If this is a case, there should be a message
> in the job log reporting as much.
> (We are working with Stackdriver to improve this experience.)
>
> Also, we do not currently export distribution metrics to Stackdriver
> because we don't yet have a good mechanism to do so.  Gauge metrics are 
> not
> implemented yet and would not appear in either the Dataflow UI or
> Stackdriver.
>
> These are the only explanations I can think for for these metrics to
> not show up.  If neither of these are the case, I'm happy to investigate
> further on a particular instance.
>
> Cheers
> Andrea
>
>
>
> On 2018/01/23 19:59:08, Carlos Alonso  wrote:
> > Hi everyone!!>
> >
> > I'm trying to get a deeper view on my dataflow jobs by measuring
> parts of>
> > it using `Metrics.counter|gauge` but I cannot find how to see them
> on>
> > Stackdriver.>
> >
> > I have a premium Stackdriver account and I can see those counters
> under the>
> > Custom Counters section on the Dataflow UI.>
> >
> > I can see droppedDueToLateness 'custom' counter though on
> Stackdriver that>
> > seems to be created via 'Metrics.counter' as well...>
> >
> > What am I missing?>
> >
> > Regards>
> >
>

>>>
>


Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Andrea Foegler
I would suggest Platform Support or StackOverflow as the best places to
request Dataflow-specific support.

This could be an issue coordinating between your Stackdriver Account(s) and
your Cloud project(s).  We can continue to discuss / investigate through
one of the above forums.

On Tue, Jan 30, 2018 at 12:30 PM, Carlos Alonso 
wrote:

> Thanks Andrea!!
>
> Do you mean using the UserVoice forum?
>
> Aside from that, there's something that could be helpful that is that when
> I navigate https://app.google.stackdriver.com/services/dataflow the
> message I get is this:
> "You do not have any resources of this type being monitored by
> Stackdriver." and that's weird as well. As if our Cloud Dataflow wasn't
> properly connected to Stackdriver, but, on the other hand. Some metrics are
> displayed and can be monitored such as System Lag, Watermark, etc...
>
> Thanks!
>
> On Tue, Jan 30, 2018 at 9:20 PM Andrea Foegler  wrote:
>
>> Hi Carlos -
>>
>> This sounds like something we should investigate further.  Since it
>> appears to be a Dataflow specific question / issue, would you mind posting
>> or following up in a Dataflow-specific forum or through Google Cloud
>> Platform Support: https://cloud.google.com/dataflow/support?  Feel free
>> to mention my name in your contact.
>>
>> Cheers
>> Andrea
>>
>>
>>
>> On Tue, Jan 30, 2018 at 10:27 AM, Carlos Alonso 
>> wrote:
>>
>>> Hi Andrea, thank you very much for your response.
>>>
>>> I've followed your directions and only droppedDueToLateness appears.
>>> The way I'm creating those metrics is:
>>>
>>> Metrics.counter("Ingester", "IngestedMessages").inc()
>>>
>>> I can see those counters on the Custom Counters section on the Google
>>> Dataflow UI, but nothing appears on Stackdriver...
>>>
>>> Thanks!
>>>
>>> On Tue, Jan 30, 2018 at 7:22 PM Andrea Foegler 
>>> wrote:
>>>
 Hi Carlos -

 Custom metrics can be "listed" by going to the Metric Explorer in
 Stackdriver and entering "custom.googleapis.com/dataflow" in the
 filter.
 If that list contains more than 100 different names, new custom metrics
 will not be created.  If this is a case, there should be a message in the
 job log reporting as much.
 (We are working with Stackdriver to improve this experience.)

 Also, we do not currently export distribution metrics to Stackdriver
 because we don't yet have a good mechanism to do so.  Gauge metrics are not
 implemented yet and would not appear in either the Dataflow UI or
 Stackdriver.

 These are the only explanations I can think for for these metrics to
 not show up.  If neither of these are the case, I'm happy to investigate
 further on a particular instance.

 Cheers
 Andrea



 On 2018/01/23 19:59:08, Carlos Alonso  wrote:
 > Hi everyone!!>
 >
 > I'm trying to get a deeper view on my dataflow jobs by measuring
 parts of>
 > it using `Metrics.counter|gauge` but I cannot find how to see them
 on>
 > Stackdriver.>
 >
 > I have a premium Stackdriver account and I can see those counters
 under the>
 > Custom Counters section on the Dataflow UI.>
 >
 > I can see droppedDueToLateness 'custom' counter though on Stackdriver
 that>
 > seems to be created via 'Metrics.counter' as well...>
 >
 > What am I missing?>
 >
 > Regards>
 >

>>>
>>


Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-30 Thread Ben Chambers
On Tue, Jan 30, 2018 at 11:25 AM Kenneth Knowles  wrote:

> I've got some thoughts :-)
>
> Here is how I see the direction(s):
>
>  - Requirements to be relevant: known scale, SQL, retractions (required
> for correct answers)
>  - Core value-add: portability! I don't know that there is any other
> project ambitiously trying to run Python and Go on "every" data processing
> engine.
>  - Experiments: SDF and dynamic work rebalancing. Just like event time
> processing, when it matters to users these will become widespread and then
> Beam's runner can easily make the features portable.
>
> So let's do portability really well on all our most active runners. I have
> a radical proposal for how we should think about it:
>
> A portable Beam runner should be defined to be a _service_ hosting the
> Beam job management APIs.
>
> In that sense, we have zero runners today. Even Dataflow is just a service
> hosting its own API with a client-side library for converting a Beam
> pipeline into a Dataflow pipeline. Re-orienting our thinking this way is
> not actually a huge change in code, but emphasizes:
>
>  - our "runners/core" etc should focus on making these services easy
> (Thomas G is doing great work here right now)
>  - a user selecting a runner should be thought of more as just pointing at
> a different endpoint
>  - our testing infrastructure should become much more service-oriented,
> standing these up even for local testing
>  - ditto Luke's point about making a crisp line of SDK/runner
> responsibility
>

+1, I like this perspective -- I think this would be really useful. If this
encompasses more than just running (eg., getting results/metrics/logs/etc.)
out of the pipelines, then it enables to treat Beam as a true abstraction
layer on top of the data processing service, and build their own
infrastructure around Beam rather than specializing.


>
> On Fri, Jan 26, 2018 at 12:58 PM, Lukasz Cwik  wrote:
>
>> 1) Instead of enabling it easier to write features I think more users
>> would care about being able to move their pipeline between different
>> runners and one of the key missing features is dynamic work rebalancing in
>> all runners (except Dataflow).
>> Also, portability is meant to help make a crisp line between what are the
>> responsibilities of the Runner and the SDK which would help make it easier
>> to write features in an SDK and to support features in Runners.
>>
>> 2) To realize portability there are a lot of JIRAs being tracked under
>> the portability label[1] that need addressing to be able to run an existing
>> pipeline in a portable manner before we even get to more advanced features.
>>
>> 1:
>> https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability
>>
>> 3) Ben, do you want to design and run a couple of polls (similar to the
>> Java 8 poll) to get feedback from our users based upon the list of major
>> features being developed?
>>
>> 4) Yes, plenty. It would be worthwhile to have someone walk through the
>> open JIRAs and mark them with a label and also summarize what groups they
>> fall under as there are plenty of good ideas there.
>>
>> On Tue, Jan 23, 2018 at 5:25 PM, Robert Bradshaw 
>> wrote:
>>
>>> In terms of features, I think a key thing we should focus on is making
>>> simple things simple. Beam is very powerful, but it doesn't always
>>> make easy things easy. Features like schema'd PCollections could go a
>>> long way here. Also fully fleshing out/smoothing our runner
>>> portability story is part of this too.
>>>
>>> For beam 3.x we could also reason about if there's any complexity that
>>> doesn't hold its weight (e.g. side inputs on CombineFns).
>>>
>>> On Mon, Jan 22, 2018 at 9:20 PM, Jean-Baptiste Onofré 
>>> wrote:
>>> > Hi Ben,
>>> >
>>> > about the "technical roadmap", we have a thread about "Beam 3.x
>>> roadmap".
>>> >
>>> > It already provides ideas for points 3 & 4.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 01/22/2018 09:15 PM, Ben Chambers wrote:
>>> >> Thanks Davor for starting the state of the project discussions [1].
>>> >>
>>> >>
>>> >> In this fork of the state of the project discussion, I’d like to
>>> start the
>>> >> discussion of the feature roadmap for 2018 (and beyond).
>>> >>
>>> >>
>>> >> To kick off the discussion, I think the features could be divided
>>> into several
>>> >> areas, as follows:
>>> >>
>>> >>  1.
>>> >>
>>> >> Enabling Contributions: How do we make it easier to add new
>>> features to the
>>> >> supported runners? Can we provide a common intermediate layer
>>> below the
>>> >> existing functionality that features are translated to so that
>>> runners only
>>> >> need to support the intermediate layer and new features only need
>>> to target
>>> >> it? What other ways can we make it easier to contribute to the
>>> development
>>> >> of Beam?
>>> >>
>>> >>  2.
>>> >>
>>> >> 

Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Carlos Alonso
Thanks Andrea!!

Do you mean using the UserVoice forum?

Aside from that, there's something that could be helpful that is that when
I navigate https://app.google.stackdriver.com/services/dataflow the message
I get is this:
"You do not have any resources of this type being monitored by
Stackdriver." and that's weird as well. As if our Cloud Dataflow wasn't
properly connected to Stackdriver, but, on the other hand. Some metrics are
displayed and can be monitored such as System Lag, Watermark, etc...

Thanks!

On Tue, Jan 30, 2018 at 9:20 PM Andrea Foegler  wrote:

> Hi Carlos -
>
> This sounds like something we should investigate further.  Since it
> appears to be a Dataflow specific question / issue, would you mind posting
> or following up in a Dataflow-specific forum or through Google Cloud
> Platform Support: https://cloud.google.com/dataflow/support?  Feel free
> to mention my name in your contact.
>
> Cheers
> Andrea
>
>
>
> On Tue, Jan 30, 2018 at 10:27 AM, Carlos Alonso 
> wrote:
>
>> Hi Andrea, thank you very much for your response.
>>
>> I've followed your directions and only droppedDueToLateness appears.
>> The way I'm creating those metrics is:
>>
>> Metrics.counter("Ingester", "IngestedMessages").inc()
>>
>> I can see those counters on the Custom Counters section on the Google
>> Dataflow UI, but nothing appears on Stackdriver...
>>
>> Thanks!
>>
>> On Tue, Jan 30, 2018 at 7:22 PM Andrea Foegler 
>> wrote:
>>
>>> Hi Carlos -
>>>
>>> Custom metrics can be "listed" by going to the Metric Explorer in
>>> Stackdriver and entering "custom.googleapis.com/dataflow" in the filter.
>>> If that list contains more than 100 different names, new custom metrics
>>> will not be created.  If this is a case, there should be a message in the
>>> job log reporting as much.
>>> (We are working with Stackdriver to improve this experience.)
>>>
>>> Also, we do not currently export distribution metrics to Stackdriver
>>> because we don't yet have a good mechanism to do so.  Gauge metrics are not
>>> implemented yet and would not appear in either the Dataflow UI or
>>> Stackdriver.
>>>
>>> These are the only explanations I can think for for these metrics to not
>>> show up.  If neither of these are the case, I'm happy to investigate
>>> further on a particular instance.
>>>
>>> Cheers
>>> Andrea
>>>
>>>
>>>
>>> On 2018/01/23 19:59:08, Carlos Alonso  wrote:
>>> > Hi everyone!!>
>>> >
>>> > I'm trying to get a deeper view on my dataflow jobs by measuring parts
>>> of>
>>> > it using `Metrics.counter|gauge` but I cannot find how to see them on>
>>> > Stackdriver.>
>>> >
>>> > I have a premium Stackdriver account and I can see those counters
>>> under the>
>>> > Custom Counters section on the Dataflow UI.>
>>> >
>>> > I can see droppedDueToLateness 'custom' counter though on Stackdriver
>>> that>
>>> > seems to be created via 'Metrics.counter' as well...>
>>> >
>>> > What am I missing?>
>>> >
>>> > Regards>
>>> >
>>>
>>
>


Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Andrea Foegler
Hi Carlos -

This sounds like something we should investigate further.  Since it appears
to be a Dataflow specific question / issue, would you mind posting or
following up in a Dataflow-specific forum or through Google Cloud Platform
Support: https://cloud.google.com/dataflow/support?  Feel free to mention
my name in your contact.

Cheers
Andrea



On Tue, Jan 30, 2018 at 10:27 AM, Carlos Alonso 
wrote:

> Hi Andrea, thank you very much for your response.
>
> I've followed your directions and only droppedDueToLateness appears.
> The way I'm creating those metrics is:
>
> Metrics.counter("Ingester", "IngestedMessages").inc()
>
> I can see those counters on the Custom Counters section on the Google
> Dataflow UI, but nothing appears on Stackdriver...
>
> Thanks!
>
> On Tue, Jan 30, 2018 at 7:22 PM Andrea Foegler  wrote:
>
>> Hi Carlos -
>>
>> Custom metrics can be "listed" by going to the Metric Explorer in
>> Stackdriver and entering "custom.googleapis.com/dataflow" in the filter.
>> If that list contains more than 100 different names, new custom metrics
>> will not be created.  If this is a case, there should be a message in the
>> job log reporting as much.
>> (We are working with Stackdriver to improve this experience.)
>>
>> Also, we do not currently export distribution metrics to Stackdriver
>> because we don't yet have a good mechanism to do so.  Gauge metrics are not
>> implemented yet and would not appear in either the Dataflow UI or
>> Stackdriver.
>>
>> These are the only explanations I can think for for these metrics to not
>> show up.  If neither of these are the case, I'm happy to investigate
>> further on a particular instance.
>>
>> Cheers
>> Andrea
>>
>>
>>
>> On 2018/01/23 19:59:08, Carlos Alonso  wrote:
>> > Hi everyone!!>
>> >
>> > I'm trying to get a deeper view on my dataflow jobs by measuring parts
>> of>
>> > it using `Metrics.counter|gauge` but I cannot find how to see them on>
>> > Stackdriver.>
>> >
>> > I have a premium Stackdriver account and I can see those counters under
>> the>
>> > Custom Counters section on the Dataflow UI.>
>> >
>> > I can see droppedDueToLateness 'custom' counter though on Stackdriver
>> that>
>> > seems to be created via 'Metrics.counter' as well...>
>> >
>> > What am I missing?>
>> >
>> > Regards>
>> >
>>
>


Re: Custom metrics not showing on Stackdriver

2018-01-30 Thread Andrea Foegler
Hi Carlos -

Custom metrics can be "listed" by going to the Metric Explorer in
Stackdriver and entering "custom.googleapis.com/dataflow" in the filter.
If that list contains more than 100 different names, new custom metrics
will not be created.  If this is a case, there should be a message in the
job log reporting as much.
(We are working with Stackdriver to improve this experience.)

Also, we do not currently export distribution metrics to Stackdriver
because we don't yet have a good mechanism to do so.  Gauge metrics are not
implemented yet and would not appear in either the Dataflow UI or
Stackdriver.

These are the only explanations I can think for for these metrics to not
show up.  If neither of these are the case, I'm happy to investigate
further on a particular instance.

Cheers
Andrea


On 2018/01/23 19:59:08, Carlos Alonso  wrote:
> Hi everyone!!>
>
> I'm trying to get a deeper view on my dataflow jobs by measuring parts
of>
> it using `Metrics.counter|gauge` but I cannot find how to see them on>
> Stackdriver.>
>
> I have a premium Stackdriver account and I can see those counters under
the>
> Custom Counters section on the Dataflow UI.>
>
> I can see droppedDueToLateness 'custom' counter though on Stackdriver
that>
> seems to be created via 'Metrics.counter' as well...>
>
> What am I missing?>
>
> Regards>
>


Re: Calculate Average over a Window

2018-01-30 Thread Kenneth Knowles
The use of CoGroupByKey is for joining your input stream with the
aggregation. It isn't strictly necessary to use CoGroupByKey to join the
average humidity and average temperature. You could also compute both of
these together by building a compound Combine transform to do both at the
same time. There are two ways:

1. Write your own CombineFn that calculates both averages together. I will
refer you to
https://beam.apache.org/documentation/programming-guide/#combine

2. Use "composed combine" which I think is most documented in javadoc,
here:
https://beam.apache.org/documentation/sdks/javadoc/2.2.0/org/apache/beam/sdk/transforms/CombineFns.html#compose--

I think the second one is probably a good fit and will look roughly like
this:

Combine.perKey(
CombineFns.compose()
.with(RawEvent::getHumidity, new MeanFn<>(),
humidityTag)
.with(RawEvent::getTemp, new MeanFn<>(), tempTag)))

Kenn

On Tue, Jan 30, 2018 at 4:31 AM, Steiner Patrick <
patr...@steiner-buchholz.de> wrote:

> Thanks for the pointer to 'CoGroupByKey', which makes perfect sense and so
> does the usage of Mean.
>
> Still I have to bother you, as I'm still probably lacking some basic
> understanding on how ideal messages/data-structures are supposed to look
> like in a Beam Pipeline.
>
> This is what I want to happen for each element in my original PCollection
> ( read from PubSub )
>
> original message -> "{ humidity=42.71, temp=20.71}"
> final message -> "deviceID":{timestamp=1516549776,
> humidity=42.71, temp=20.71, avg_hum=, avg_temp=}"
>
> Where I'm stuck is how to calculate the average ( Mean ) for each
> datapoint in my original message?
>
> As you pointed out the usage of 'CoGroupByKey' I assume I have to do
> something like in the following visualization?
>
> https://github.com/PatrickSteiner/Google_Cloud_
> IoT_Demo/blob/master/pictures/Mean_flow.png
>
> Am I on the right path?
>
>
> Thanks
>
> Patrick
>
> Kenneth Knowles wrote:
>
> It looks like what you want is to join your input stream with the computed
> averages. It might look something like this:
>
> PCollection inputEvents = ...apply(Window.into(
> SlidingWindows))
> PCollection> avgTemps =
> inputEvents.apply(Mean.perKey())
>
> I don't want to recreate all of the docs on this thread, so I will just
> point to CoGroupByKey [1] that you would use to join these on DeviceId;
> they pattern looks something like this, where I've left out lots of
> boilerplate:
>
> PCollection> joined = ...
> PCollection> result =
> joined.apply(ParDo.of())
>
> Kenn
>
> [1] https://beam.apache.org/documentation/programming-guide/#cogroupbykey
>
>
>
>
>
> On Fri, Jan 26, 2018 at 6:23 AM, Steiner Patrick <
> patr...@steiner-buchholz.de> wrote:
>
>> Hi Kenn,
>>
>> thanks again for responding.
>>
>> Let me try to explain better what I'm looking for. For simplicity reason
>> let's take a more simple example.
>>
>> Message 1  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
>> avg_temp=10}"
>> Message 2  "4711" : "{temp=12}"  will become  "4711" : "{temp=12,
>> avg_temp=11}"
>> Message 3  "4711" : "{temp=14}"  will become  "4711" : "{temp=14,
>> avg_temp=12}"
>> Message 4  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
>> avg_temp=11.5}"
>>
>> So for each incoming message I would like to append the current windows
>> "avg_temp" and all this with a sliding window.
>> So if we would say that the window is 2 seconds and we receive one
>> message per second, my sample would change to
>>
>>
>> Message 1  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
>> avg_temp=10}"
>> Message 2  "4711" : "{temp=12}"  will become  "4711" : "{temp=12,
>> avg_temp=11}"
>> Message 3  "4711" : "{temp=14}"  will become  "4711" : "{temp=14,
>> avg_temp=13}"
>> Message 4  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
>> avg_temp=12}"
>>
>> Does this explain what I plan?
>>
>> Thanks again for your help
>>
>> Patrick
>>
>>
>> Kenneth Knowles wrote:
>>
>>
>>
>> On Thu, Jan 25, 2018 at 3:31 AM, Steiner Patrick <
>> patr...@steiner-buchholz.de> wrote:
>>
>>> Hi Kenn, all,
>>>
>>> you are right, sliding windows seems to be exactly what I need. Thanks
>>> for that pointer.
>>>
>>> Where I'm still in need of expert advise is how to structure the data
>>> within my PCollection. From PubSub I do read all data as a JSON-String
>>>
>>> Format 1: "{ humidity=42.71, temp=20.71}"
>>>
>>> currently I'm just extending the JSON-String with the deviceID of the
>>> sender and the timestamp
>>>
>>> Format 2: "{timestamp=1516549776, deviceID=esp8266_D608CF,
>>> humidity=42.71, temp=20.71}"
>>>
>>> I guess it would make sense to take the deviceID as "Key" to a Key/Value
>>> pair, so I can Group by "deviceID"?
>>>
>>> Format 3: "esp8266_D608CF" : "{timestamp=1516549776, humidity=42.71,
>>> 

Re: Calculate Average over a Window

2018-01-30 Thread Steiner Patrick
Thanks for the pointer to 'CoGroupByKey', which makes perfect sense and 
so does the usage of Mean.


Still I have to bother you, as I'm still probably lacking some basic 
understanding on how ideal messages/data-structures are supposed to look 
like in a Beam Pipeline.


This is what I want to happen for each element in my original 
PCollection ( read from PubSub )


original message -> "{ humidity=42.71, temp=20.71}"
final message -> "deviceID":{timestamp=1516549776, 
humidity=42.71, temp=20.71, avg_hum=, avg_temp=}"


Where I'm stuck is how to calculate the average ( Mean ) for each 
datapoint in my original message?


As you pointed out the usage of 'CoGroupByKey' I assume I have to do 
something like in the following visualization?


https://github.com/PatrickSteiner/Google_Cloud_IoT_Demo/blob/master/pictures/Mean_flow.png

Am I on the right path?

Thanks

Patrick

Kenneth Knowles wrote:
It looks like what you want is to join your input stream with the 
computed averages. It might look something like this:


PCollection inputEvents = 
...apply(Window.into(SlidingWindows))
PCollection> avgTemps = 
inputEvents.apply(Mean.perKey())


I don't want to recreate all of the docs on this thread, so I will 
just point to CoGroupByKey [1] that you would use to join these on 
DeviceId; they pattern looks something like this, where I've left out 
lots of boilerplate:


PCollection> joined = ...
PCollection> result = 
joined.apply(ParDo.of())


Kenn

[1] https://beam.apache.org/documentation/programming-guide/#cogroupbykey





On Fri, Jan 26, 2018 at 6:23 AM, Steiner Patrick 
> wrote:


Hi Kenn,

thanks again for responding.

Let me try to explain better what I'm looking for. For simplicity
reason let's take a more simple example.

Message 1  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
avg_temp=10}"
Message 2  "4711" : "{temp=12}"  will become  "4711" : "{temp=12,
avg_temp=11}"
Message 3  "4711" : "{temp=14}"  will become  "4711" : "{temp=14,
avg_temp=12}"
Message 4  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
avg_temp=11.5}"

So for each incoming message I would like to append the current
windows "avg_temp" and all this with a sliding window.
So if we would say that the window is 2 seconds and we receive one
message per second, my sample would change to


Message 1  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
avg_temp=10}"
Message 2  "4711" : "{temp=12}"  will become  "4711" : "{temp=12,
avg_temp=11}"
Message 3  "4711" : "{temp=14}"  will become  "4711" : "{temp=14,
avg_temp=13}"
Message 4  "4711" : "{temp=10}"  will become  "4711" : "{temp=10,
avg_temp=12}"

Does this explain what I plan?

Thanks again for your help

Patrick


Kenneth Knowles wrote:



On Thu, Jan 25, 2018 at 3:31 AM, Steiner Patrick
> wrote:

Hi Kenn, all,

you are right, sliding windows seems to be exactly what I
need. Thanks for that pointer.

Where I'm still in need of expert advise is how to structure
the data within my PCollection. From PubSub I do read all
data as a JSON-String

Format 1: "{ humidity=42.71, temp=20.71}"

currently I'm just extending the JSON-String with the
deviceID of the sender and the timestamp

Format 2: "{timestamp=1516549776, deviceID=esp8266_D608CF,
humidity=42.71, temp=20.71}"

I guess it would make sense to take the deviceID as "Key" to
a Key/Value pair, so I can Group by "deviceID"?

Format 3: "esp8266_D608CF" : "{timestamp=1516549776,
humidity=42.71, temp=20.71}"


Yup, this is the right basic set up for just about anything
you'll want to do.

What I'm still "missing" is an idea how to apply  "Mean" to
each "humidity" and "temp", so that as a result I can create
something like

Format 4: "esp8266_D608CF" : "{timestamp=1516549776,
humidity=42.71, temp=20.71, avg_hum=,
avg_temp=}"


Do you just want to calculate the averages and have one record
output summarizing the device/window? Or do you want to keep all
the original records and annotate them with the avg for their
window, in other words basically doing the first calculation and
then joining it with your original stream?

Kenn

Happy to take advise or pointer into the right direction.

Thanks

Patrick



Kenneth Knowles wrote:

Hi Patrick,

It sounds like you want to read about event-time windowing: