Re: Apache Beam delta between windows.

Raghu Angadi Thu, 03 May 2018 14:54:26 -0700

>
>     Window of Windows:
>           1: KV<userID,clickEvent> --> groupByKey() --> fixedWindow(1hour)
> --> count() --> KV<userID,clickEventCount>
>           2: KV<userID,clickEventCount> --> groupByKey()
> fixedWindow(X*hour) --> ParDo(Iterable<KV<userID,clickEventCount>>)
>     This allows me get my hands on the collection of
> KV<userID,clickEventCount> that I can iterate over and emit the delta
> between them.
>     However, the larger fixed window seems arbitrary and unnecessary as I
> want to carry the delta indefinitely, not start from 0 every few hours.



You can use 'slidingWindow(2*hours)' in 2, right?


On Thu, May 3, 2018 at 11:27 AM Stephan Kotze <[email protected]>
wrote:

> Hi everyone
>
> I've already posted this question on stackoverflow @
> https://stackoverflow.com/questions/50161186/apache-beam-delta-between-windows
>
> But I thought it made sense to post it here as well.
>
> So here we go:
>
> I am trying to determine the delta between values calculated in different
> fixed windows and emit them.
>
>                                T-2                         T-1
>                T
>
> |---------------------------------|--------------------------------
> |---------------------------------|
>  userID     |       clickEventCount = 3       |       clickEventCount = 1
>      |       clickEventCount = 6       |
>
>                                 |                                 |
>                        |
>                                 ˅                                 ˅
>                        ˅
> userID           clickEventCountDelta = 3           clickEventCountDelta =
> -2         clickEventCountDelta = 5
>
>
> Something like this would be grand:
>         1. KV<userID,clickEvent> --> groupByKey() --> fixedWindow(1hour)
> --> count() --> KV<userID,clickEventCount>
>         2. KV<userID,clickEventCount> --> ?(deltaFn) -->
> KV<userId,clickEventCountDelta>
>
>
> I'm struggling a bit to find a solution that is both elegant and scalable,
> so any help would be greatly appreciated.
>
> Some things I've considered/tried:
>     Window of Windows:
>           1: KV<userID,clickEvent> --> groupByKey() --> fixedWindow(1hour)
> --> count() --> KV<userID,clickEventCount>
>           2: KV<userID,clickEventCount> --> groupByKey()
> fixedWindow(X*hour) --> ParDo(Iterable<KV<userID,clickEventCount>>)
>
>     This allows me get my hands on the collection of
> KV<userID,clickEventCount> that I can iterate over and emit the delta
> between them.
>     However, the larger fixed window seems arbitrary and unnecessary as I
> want to carry the delta indefinitely, not start from 0 every few hours.
>
>     Shifting the previous count's timestamp
>         1: Essentially attempting to take the KV<userID,clickEventCount>
> PTransform and emitting an additional KV<userID,clickEventCount> with an
> adjusted timestamp.
>         2: Then grabbing this time shifted timestamp as a side input to a
> PTransform/doFn for calculating the delta with the previous period.
>
>     This seemed like a cute idea, but it very quickly became a mess and
> doesn't "feel" to be the right approach.
>
>     Using an external cache:
>         1: writing the KV<userID,clickEventCount> + timestamp+window to a
> distributed cache.
>         2: Grabbing the previous value from the cache and computing the
> delta.
>
>     Doesn’t seem totally unreasonable to be able to reach back in time to
> the count of the previous window via a cache. However it "feels" wrong and
> highly inefficient given that I have the data in a PCollection somewhere
> nearby.
>
>     Stateful DoFns
>     Seems reasonable, but is it overkill (having to initialise them with
> cache lookbacks anyways as they are reset on window closes)
>
>     Sliding Window over 2 X Windows:
>     If I create a sliding window 2x the duration of the underlying count
> windows, I could potentially indefinitely emit the delta between every two
> events.
>     This also feels odd, but could be the most elegant solution here?
>
>
> None of the above really feel like the right way to approach deltas, and
> it does appear that the Compute model's state Doesn't cater for the "feed
> forward" of data in time.
>
> So :)
> I've done quite a bit of searching and I can't seem to find anything
> leading me clearly into a particular direction.
> I'm probably missing something big here, so any help would be much
> appreciated.
>
>

Re: Apache Beam delta between windows.

Reply via email to