Hi,

I am not familiar with CoordinatedBolt, but it seems not to do what you
want (it seem to do DRPC stuff)

http://www.pixelmachine.org/storm/2012/01/03/How-CoordinatedBolt-Works.html

To me, it seems as you would like to perform a simple join... For this,
you need to buffer all incoming meta-data tuples (that are related to
messages with attachments) in Dissem until the join is complete. For
this, you need to know (for each tuple coming from meta-data-transform)
how many attachment-tuples are expected from virus scanner. But Spout
can simple add this information. If the attachment-count-attribute is
zero, the message can be processed immediately.

Does this make sense to you?

However, I don't understand why you want to use direct-grouping? Using
fields-grouping on the message-id attribute should work for you.


-Matthias


On 04/29/2015 02:18 AM, Jason Kusar wrote:
> Hi,  
> 
> I'm currently working on building an ETL system using storm.
> Approximately 30% of the incoming records have binary attachments which
> need to be virus scanned. A single record can have one or more
> attachments. My initial thought was to build a topology with two outputs
> from the spout both of which eventually feed a downstream bolt. I've
> attached a simple diagram. Hopefully it comes through on the list.
> 
> 
> The spout would output tuples to the metadata transform on the default
> stream. If it came across a record that had attachments, it would output
> one or more additional tuples with the same ID to the Virus scan stream.
> Obviously the diagram is simplified as the Metadata transform might
> involve many steps, but regardless it's safe to assume that the time
> required for the virus scanner is likely significantly higher than the
> transform stream. I would like for records not having attachments to be
> able to keep flowing through the system without being slowed down by
> those records that do happen to have attachments.  
> 
> From looking at the CoordinatedBolt, it looks like it probably does
> exactly what I'm looking for, but I'm not sure. It would join the tuples
> from the two streams back together and deliver them to the dissem bolt
> as a batch to be processed from there. Am I viewing this right or am I
> completely off base? I can't find a lot of examples of CoordinatedBolts
> and there aren't any real comments in the code explaining what it's doing.
> 
> I feel like Direct Groupings might come into play here as well, but the
> link from the Documentation Manual page gets a 404, so I was unable to
> find more details on that. 
> 
>  If I'm completely off, is there an example implementation that does
> something similar to what I'm trying to do? Specifically, is there an
> example of something outputting a variable number of tuples that all get
> grouped back together somewhere down the line? 
> 
> Thanks!
> --Jason

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to