I thought the storm documentation indicates that noneGrouping is currently equivalent to shuffleGrouping? Has this changed? If this is still the case, I would recommend using localOrShuffleGrouping which will keep the data in process at least, and avoid serialization and network transfer.
On Thu, Jan 8, 2015 at 10:34 AM, Itai Frenkel <[email protected]> wrote: > Use noneGrouping between the two bolts so the only overhead is a thread > context switch. Storm+Linux manages these context switches pretty > well. Unless you are already in the stage of CPU usage optimizations, I > would not sweat about it. > ------------------------------ > *From:* Hemanth Yamijala <[email protected]> > *Sent:* Thursday, January 8, 2015 8:27 AM > *To:* [email protected] > *Subject:* Re: Storm patterns vis-a-vis external data storage > > Itai & Jens, > > Thank you for sharing your thoughts. My requirement is what Jens has > referred to as "export" data from my topology outside. > > I can clearly see the benefits of segregating this functionality to > another bolt - for e.g. to scale it independently of the processing bolts, > or for accommodating changes. > > The only negative (if it is that) seems to be the increase in number of > runtime bolt instances in the topology. I understand that it can be solved > with more hardware resources and the horizontal scalability of Storm. Also, > it might be hard to quantify this precisely, given the different scaling > requirements for processing and I/O bound bolts. Do you see this as a > concern ? > > Thanks > hemanth > > On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <[email protected]> wrote: > >> Hi Hemanth, >> >> Zitat von Hemanth Yamijala <[email protected]> >> >>> Hi all, >>> >>> I guess it is common to build topologies where message processing in >>> storm results in data that should be stored in external stores like NoSQL >>> DBs or message queues like Kafka. >>> >>> There are two broad approaches to handle this storage: >>> >>> 1) Inline the storage functionality with the processing functionality - >>> i.e. the bolt generating the info to be stored also takes care of storing >>> it. >>> 2) Separate out the two and make a downstream bolt responsible for the >>> storage. >>> >>> Just wanted to see if people on the list think if there are advantages >>> to favour one approach over the other. Any pitfalls to take care of in one >>> case over the other. >>> >> >> I'd say: it depends ;) In case of aggregation bolts that persist their >> states, you may want to limit the memory footprint of each bolt instance. >> Thus implementing an in-mem cache for persisted data is pretty helpful, but >> means to incorporate persistence access per-bolt. >> >> OTOH, if you plan to "export" data from your topology (which seems to be >> the main focus of your question), separating calculation and "export" into >> separate bolts seems a natural choice to me - especially when you consider >> future changes (i.e. to support a different or possibly *additional* export >> paths - you can keep the "tuple interface" as it is and simply connect >> different and/or additional export bolts). >> >> Regards, >> Jens >> >> >
