Use noneGrouping between the two bolts so the only overhead is a thread context switch. Storm+Linux manages these context switches pretty well. Unless you are already in the stage of CPU usage optimizations, I would not sweat about it.
________________________________ From: Hemanth Yamijala <[email protected]> Sent: Thursday, January 8, 2015 8:27 AM To: [email protected] Subject: Re: Storm patterns vis-a-vis external data storage Itai & Jens, Thank you for sharing your thoughts. My requirement is what Jens has referred to as "export" data from my topology outside. I can clearly see the benefits of segregating this functionality to another bolt - for e.g. to scale it independently of the processing bolts, or for accommodating changes. The only negative (if it is that) seems to be the increase in number of runtime bolt instances in the topology. I understand that it can be solved with more hardware resources and the horizontal scalability of Storm. Also, it might be hard to quantify this precisely, given the different scaling requirements for processing and I/O bound bolts. Do you see this as a concern ? Thanks hemanth On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <[email protected]<mailto:[email protected]>> wrote: Hi Hemanth, Zitat von Hemanth Yamijala <[email protected]<mailto:[email protected]>> Hi all, I guess it is common to build topologies where message processing in storm results in data that should be stored in external stores like NoSQL DBs or message queues like Kafka. There are two broad approaches to handle this storage: 1) Inline the storage functionality with the processing functionality - i.e. the bolt generating the info to be stored also takes care of storing it. 2) Separate out the two and make a downstream bolt responsible for the storage. Just wanted to see if people on the list think if there are advantages to favour one approach over the other. Any pitfalls to take care of in one case over the other. I'd say: it depends ;) In case of aggregation bolts that persist their states, you may want to limit the memory footprint of each bolt instance. Thus implementing an in-mem cache for persisted data is pretty helpful, but means to incorporate persistence access per-bolt. OTOH, if you plan to "export" data from your topology (which seems to be the main focus of your question), separating calculation and "export" into separate bolts seems a natural choice to me - especially when you consider future changes (i.e. to support a different or possibly *additional* export paths - you can keep the "tuple interface" as it is and simply connect different and/or additional export bolts). Regards, Jens
