Hej,

I want to do the following thing:
1. Split a Stream of incoming Logs by host address.
2. For each Key, create time based windows
3. Count the number of items in the window
4. Feed it into a statistical model that is maintained for each host

Since I don't have fields to sum upon, I use a (window) fold function to
count the number of elements in the window. (Maybe there is a better way to
do this, or it could be part of the primitives)
My problem is now that I get back a DataStream so the distribution by key
is lost. Is there a way to preserve the distribution by key? Currently I
only store the count of element in the windows so I cannot simple do byKey
again.

I could fold into tuples that have the count and also contain the host
address but that feels clumsy.

Any hints are welcome.


cheers Martin

Reply via email to