Hey Kashyap, Thanks a bunch for adding your thoughts and it sounds like we are facing a very similar issue, where one tuple from the spout going into Bolt A generates 15-20 tuples in the Bolt A execute method. The difference appears to be I bin the 15-20 tuples generated by Bolt A, and the reason I do that is Bolt B sends tuples to Bolt C via fieldsGrouping, and the binning has helped my throughput quite a bit.
I also experimented with the buffers, LMAX strategies, max spout pending, as well as the number of threads processing incoming/outgoing Netty messages (hoping the latter would speed up the Netty -> LMAX transfer), and none of these made a significant impact for my topology, either. --John On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <[email protected]> wrote: > John, Nick > Thanks for broaching this topic. In my case, 1 tuple from spout gives out > 200 more tuples. I too see the same class listed in VisualVM profiling... > And tried bringing this down... I reduced parallelism hints, played with > buffers, changed lmax strategies, changed max spout pending... Nothing > seems to have an impact > > Any ideas on what could be done for this? > > Thanks > Kashyap > Hello John, > > First off, let us agree on your definition of throughput. Do you define > throughput as the average number of tuples each of your last bolts (sinks) > emit per second? If yes, then OK. Otherwise, please provide us with more > details. > > Going back to the BlockingWaitStrategy observation you have, it (most > probably) means that since you are producing a large number of tuples > (15-20 tuples) the outgoing Disruptor queue gets full, and the emit() > function blocks. Also, since you are anchoring tuples (that might mean > exactly-once semantics), it basically takes more time to place something in > the queue, in order to guarantee deliver of all tuples to a downstream > bolt. > > Therefore, it makes sense to see so much time spent in the LMAX messaging > layer. A good experiment to verify your hypothesis, is to not anchor > tuples, and profile your topology again. However, I am not sure that you > will see a much different percentage, since for every tuple you are > receiving, you have at least one call to the Disruptor layer. Maybe in your > case (if I got it correctly from your description), you should have one > call every N tuples, where N is the size of your bin in tuples. Right? > > I hope I helped with my comments. > > Cheers, > Nick > > On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]> wrote: > >> Hi Everyone, >> >> I have a large fan-out that I've posted questions about before with the >> following new, updated info: >> >> 1. Incoming tuple to Bolt A produces 15-20 tuples >> 2. Bolt A emits to Bolt B via fieldsGrouping >> 3. I cache outgoing tuples in bins within Bolt A and then emit anchored >> tuples to Bolt B with the OutputCollector *emit >> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>* >> (Collection >> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true> >> <Tuple <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html> >> > anchors, List >> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true> >> <Object >> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true> >> > tuple) method >> 4. I have throughput where I need it to be if I just receive tuples in >> Bolt B, ack, and drop. If I do actual processing in Bolt B, throughput >> degrades a bunch. >> 5. I profiled the Bolt B worker yesterday and see that over 90% is spent >> in com.lmax.disruptor.BlockingWaitStrategy--irrespective if I drop the >> tuples or process in Bolt B >> >> I am wondering if the acking of the anchor tuples is what's resulting in >> so much time spent in the LMAX messaging layer. What do y'all think? Any >> ideas appreciated as always. >> >> Thanks! :) >> >> --John >> > > > > -- > Nick R. Katsipoulakis, > Department of Computer Science > University of Pittsburgh >
