Hi,

I'm currently working on building an ETL system using storm. Approximately
30% of the incoming records have binary attachments which need to be virus
scanned. A single record can have one or more attachments. My initial
thought was to build a topology with two outputs from the spout both of
which eventually feed a downstream bolt. I've attached a simple diagram.
Hopefully it comes through on the list.


The spout would output tuples to the metadata transform on the default
stream. If it came across a record that had attachments, it would output
one or more additional tuples with the same ID to the Virus scan stream.
Obviously the diagram is simplified as the Metadata transform might involve
many steps, but regardless it's safe to assume that the time required for
the virus scanner is likely significantly higher than the transform stream.
I would like for records not having attachments to be able to keep flowing
through the system without being slowed down by those records that do
happen to have attachments.

>From looking at the CoordinatedBolt, it looks like it probably does exactly
what I'm looking for, but I'm not sure. It would join the tuples from the
two streams back together and deliver them to the dissem bolt as a batch to
be processed from there. Am I viewing this right or am I completely off
base? I can't find a lot of examples of CoordinatedBolts and there aren't
any real comments in the code explaining what it's doing.

I feel like Direct Groupings might come into play here as well, but the
link from the Documentation Manual page gets a 404, so I was unable to find
more details on that.

 If I'm completely off, is there an example implementation that does
something similar to what I'm trying to do? Specifically, is there an
example of something outputting a variable number of tuples that all get
grouped back together somewhere down the line?

Thanks!
--Jason

Reply via email to