Re: How to preserve KeyedDataStream

Aljoscha Krettek Tue, 03 Nov 2015 06:07:57 -0800

Hi,
where are you storing the results of each window computation to? Maybe you 
could also store it from inside a custom WindowFunction where you just count 
the elements and then store the results.

On the other hand, adding a (1) field and doing a window reduce (à la 
WordCount) is going to be way more efficient because we only have to keep one 
element per window (the current reduced tuple) instead of all the tuples, as we 
have to for a fold or WindowFunction. If you want you can also combine a reduce 
and WindowFunction:
WindowedStream.apply(ReduceFunction<T> preAggregator, WindowFunction<T, R, K, 
W> function)

here, the ReduceFunction does the WordCount-like counting while in the 
WindowFunction you get the final result and store it inside your model.

Let me know if you need more information.

Cheers,
Aljoscha
> On 03 Nov 2015, at 11:28, Martin Neumann <mneum...@sics.se> wrote:
> 
> Hej,
> 
> I want to do the following thing:
> 1. Split a Stream of incoming Logs by host address. 
> 2. For each Key, create time based windows
> 3. Count the number of items in the window
> 4. Feed it into a statistical model that is maintained for each host
> 
> Since I don't have fields to sum upon, I use a (window) fold function to 
> count the number of elements in the window. (Maybe there is a better way to 
> do this, or it could be part of the primitives)
> My problem is now that I get back a DataStream so the distribution by key is 
> lost. Is there a way to preserve the distribution by key? Currently I only 
> store the count of element in the windows so I cannot simple do byKey again.
> 
> I could fold into tuples that have the count and also contain the host 
> address but that feels clumsy.
> 
> Any hints are welcome.
> 
> 
> cheers Martin

Re: How to preserve KeyedDataStream

Reply via email to