Hi. In the project I'm working on, we have a lot of code that basically:
* consumes normal tuples as they come in, building up some sort of
aggregated representation of what was in those tuples
* then when a tick tuple comes in, it publishes the whole set of data
(e.g., it sends the aggregates to some other bolt for processing, or publishes
to Kafka or Cassandra, whatever)
Of course, given the most straightforward implementation of that, given that
the bolts typically start at more or less the same time, the tick tuples all
get delivered at the same time. So it's really easy to end up in a
circumstance where some downstream consumer spends 59 seconds out of 60 doing
nothing, then gets completely pounded on for a second, then spends the next 59
seconds doing nothing.
In our use cases, generally we want to do things like aggregate data for 60
seconds, but the aggregates don't all need to line up.
I keep thinking that if there was a way to tell Storm that we want a tick tuple
every 60 seconds, but delay for a random number of seconds between 0 and 60
before you send the first one, that'd just fix this right up. But I don't see
an obvious way to do that.
Clearly there are ways in which we can take care of this in our code, they just
involve more code. (-:
It seems like this would be a common use case. Are there better approaches?
Is there some trick that would make it possible to smear the tick tuples out
over time? If you're in this situation, how do you handle it?
I'd love to be missing something easy and obvious.
Thanks!
-Steve