Hej, I want to do the following thing: 1. Split a Stream of incoming Logs by host address. 2. For each Key, create time based windows 3. Count the number of items in the window 4. Feed it into a statistical model that is maintained for each host
Since I don't have fields to sum upon, I use a (window) fold function to count the number of elements in the window. (Maybe there is a better way to do this, or it could be part of the primitives) My problem is now that I get back a DataStream so the distribution by key is lost. Is there a way to preserve the distribution by key? Currently I only store the count of element in the windows so I cannot simple do byKey again. I could fold into tuples that have the count and also contain the host address but that feels clumsy. Any hints are welcome. cheers Martin