Hi, I'm currently working on building an ETL system using storm. Approximately 30% of the incoming records have binary attachments which need to be virus scanned. A single record can have one or more attachments. My initial thought was to build a topology with two outputs from the spout both of which eventually feed a downstream bolt. I've attached a simple diagram. Hopefully it comes through on the list.
The spout would output tuples to the metadata transform on the default stream. If it came across a record that had attachments, it would output one or more additional tuples with the same ID to the Virus scan stream. Obviously the diagram is simplified as the Metadata transform might involve many steps, but regardless it's safe to assume that the time required for the virus scanner is likely significantly higher than the transform stream. I would like for records not having attachments to be able to keep flowing through the system without being slowed down by those records that do happen to have attachments. >From looking at the CoordinatedBolt, it looks like it probably does exactly what I'm looking for, but I'm not sure. It would join the tuples from the two streams back together and deliver them to the dissem bolt as a batch to be processed from there. Am I viewing this right or am I completely off base? I can't find a lot of examples of CoordinatedBolts and there aren't any real comments in the code explaining what it's doing. I feel like Direct Groupings might come into play here as well, but the link from the Documentation Manual page gets a 404, so I was unable to find more details on that. If I'm completely off, is there an example implementation that does something similar to what I'm trying to do? Specifically, is there an example of something outputting a variable number of tuples that all get grouped back together somewhere down the line? Thanks! --Jason
