Use noneGrouping between the two bolts so the only overhead is a thread context 
switch. Storm+Linux manages these context switches pretty well. Unless you are 
already in the stage of CPU usage optimizations, I would not sweat about it.

________________________________
From: Hemanth Yamijala <[email protected]>
Sent: Thursday, January 8, 2015 8:27 AM
To: [email protected]
Subject: Re: Storm patterns vis-a-vis external data storage

Itai & Jens,

Thank you for sharing your thoughts. My requirement is what Jens has referred 
to as "export" data from my topology outside.

I can clearly see the benefits of segregating this functionality to another 
bolt - for e.g. to scale it independently of the processing bolts, or for 
accommodating changes.

The only negative (if it is that) seems to be the increase in number of runtime 
bolt instances in the topology. I understand that it can be solved with more 
hardware resources and the horizontal scalability of Storm. Also, it might be 
hard to quantify this precisely, given the different scaling requirements for 
processing and I/O bound bolts. Do you see this as a concern ?

Thanks
hemanth

On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen 
<[email protected]<mailto:[email protected]>> wrote:
Hi Hemanth,

Zitat von Hemanth Yamijala <[email protected]<mailto:[email protected]>>
Hi all,

I guess it is common to build topologies where message processing in storm 
results in data that should be stored in external stores like NoSQL DBs or 
message queues like Kafka.

There are two broad approaches to handle this storage:

1) Inline the storage functionality with the processing functionality - i.e. 
the bolt generating the info to be stored also takes care of storing it.
2) Separate out the two and make a downstream bolt responsible for the storage.

Just wanted to see if people on the list think if there are advantages to 
favour one approach over the other. Any pitfalls to take care of in one case 
over the other.

I'd say: it depends ;) In case of aggregation bolts that persist their states, 
you may want to limit the memory footprint of each bolt instance. Thus 
implementing an in-mem cache for persisted data is pretty helpful, but means to 
incorporate persistence access per-bolt.

OTOH, if you plan to "export" data from your topology (which seems to be the 
main focus of your question), separating calculation and "export" into separate 
bolts seems a natural choice to me - especially when you consider future 
changes (i.e. to support a different or possibly *additional* export paths - 
you can keep the "tuple interface" as it is and simply connect different and/or 
additional export bolts).

Regards,
Jens


Reply via email to