Re: How to do delimit persistent aggregate output

Larry Palmer Mon, 05 May 2014 05:52:33 -0700

Let me try again, since I feel like what I'm trying to do is exactly the
sort of application Storm/Trident is designed for, yet I can't find a
natural way to implement it.

The goal is very simple: data streams in from a spout (probably a
partitioned spout), in 5-minute (based on a timestamp in the data) bursts.
The spout knows when each burst has been completed. I expect every burst to
take 2-3 minutes to process. Each burst is maybe 50 million tuples, so far
more than one batch. I'd like the data to flow from the spout, be grouped
and aggregated by various combinations of fields, then have the 5-minute
aggregates flow through to a set of threshold filters which can compare the
values of each 5-minute aggregate to a threshold and generate alarms as
appropriate.

Of course I've read the section on Time Series in the Storm FAQ (
https://github.com/nathanmarz/storm/wiki/FAQ). I'm not trying to write the
5-minute aggregates to an external database that can be queried by some
external process; I'd like them to flow through Trident and do the
threshold/alarm processing from within it.

On Fri, May 2, 2014 at 10:57 AM, Larry Palmer <[email protected]>wrote:

> I'm kind of new to Trident, and was looking for some help making
> persistentAggregates group values over a defined set of tuples.
>
> Groupby() followed by a regular (combiner) aggregate accumulates values
> for the duration of a batch, then outputs them all onto an output stream,
> which is very convenient.
>
> I'd like to make a persistentAggregate behave similarly, except that it
> aggregates across batches, then outputs all its aggregates onto an output
> stream when some external signal occurs, and then clears its database to
> begin the next aggregation period.
>
> One approach would be to have the spout put a "Done" flag in each tuple,
> e.g. emit tuples like ("DIMENSION_1", "DIMENSION_2",
> "METRIC_TO_BE_AGGREGATED", "DONE"), where it's doing .groupby(new
> Fields("DIMENSION_1", "DIMENSION_2")), and the final tuple of each
> aggregation period has the Done flag set, or maybe better, a final batch
> would be sent consisting of a single tuple with null values for all the
> other columns and the Done flag set to indicate all previous batches are
> complete.
>
> The problem with that is I'm not sure how to get the tuple with the "Done"
> flag to go to all the partitions of the persistentAggregate - is there any
> way to broadcast a specific tuple to all partitions?
>
> Another approach would be to split another stream off the spout, that
> watches for the "Done" flag and generates a state query when it sees it.
> But it doesn't appear to be possible for a single state query to generate
> more than a single tuple of output - it appears to want to be given a list
> of keys to query for, and return the tuples for those keys - I don't see a
> way for a state query to dump the whole DB.
>
> Is there any way to do what I'm trying to do?
>
>
>

Re: How to do delimit persistent aggregate output

Reply via email to