Hi Kashyap, Question--what percentage of time is spent in Kryo deserialization and how much in LMAX disruptor?
--John On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar <[email protected]> wrote: > That is right. But for a decently well written code, disruptor is almost > always the CPU hogger. That said, on the issue b of emits taking time, we > found that the size of emitted object matters. Kryo times for serializing > and deserialization increases with size. > > But does size have a correlation with disruptor showing up big time in > profiling? > > Thanks > Kashyap > Kashyap, > > It is only expected to see the Disruptor dominating CPU time. It is the > object responsible for sending/receiving tuples (at least when you have > tuples produced by one executor thread for another executor thread on the > same machine). Therefore, it is expected to see Disruptor having something > like ~80% of the time. > > A nice experiment to check my statement above is to create a Bolt that for > every tuple it receives, it performs a random CPU task (like nested for > loops) and it emits a tuple only after receiving X number of tuples, where > X > 1. Then, I expect that you will see the percentage of CPU time for the > Disruptor object to drop. > > Cheers, > Nick > > On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <[email protected]> > wrote: > >> John, Nick >> Thanks for broaching this topic. In my case, 1 tuple from spout gives out >> 200 more tuples. I too see the same class listed in VisualVM profiling... >> And tried bringing this down... I reduced parallelism hints, played with >> buffers, changed lmax strategies, changed max spout pending... Nothing >> seems to have an impact >> >> Any ideas on what could be done for this? >> >> Thanks >> Kashyap >> Hello John, >> >> First off, let us agree on your definition of throughput. Do you define >> throughput as the average number of tuples each of your last bolts (sinks) >> emit per second? If yes, then OK. Otherwise, please provide us with more >> details. >> >> Going back to the BlockingWaitStrategy observation you have, it (most >> probably) means that since you are producing a large number of tuples >> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit() >> function blocks. Also, since you are anchoring tuples (that might mean >> exactly-once semantics), it basically takes more time to place something in >> the queue, in order to guarantee deliver of all tuples to a downstream >> bolt. >> >> Therefore, it makes sense to see so much time spent in the LMAX messaging >> layer. A good experiment to verify your hypothesis, is to not anchor >> tuples, and profile your topology again. However, I am not sure that you >> will see a much different percentage, since for every tuple you are >> receiving, you have at least one call to the Disruptor layer. Maybe in your >> case (if I got it correctly from your description), you should have one >> call every N tuples, where N is the size of your bin in tuples. Right? >> >> I hope I helped with my comments. >> >> Cheers, >> Nick >> >> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]> wrote: >> >>> Hi Everyone, >>> >>> I have a large fan-out that I've posted questions about before with the >>> following new, updated info: >>> >>> 1. Incoming tuple to Bolt A produces 15-20 tuples >>> 2. Bolt A emits to Bolt B via fieldsGrouping >>> 3. I cache outgoing tuples in bins within Bolt A and then emit anchored >>> tuples to Bolt B with the OutputCollector *emit >>> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>* >>> (Collection >>> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true> >>> <Tuple <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html> >>> > anchors, List >>> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true> >>> <Object >>> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true> >>> > tuple) method >>> 4. I have throughput where I need it to be if I just receive tuples in >>> Bolt B, ack, and drop. If I do actual processing in Bolt B, throughput >>> degrades a bunch. >>> 5. I profiled the Bolt B worker yesterday and see that over 90% is spent >>> in com.lmax.disruptor.BlockingWaitStrategy--irrespective if I drop the >>> tuples or process in Bolt B >>> >>> I am wondering if the acking of the anchor tuples is what's resulting in >>> so much time spent in the LMAX messaging layer. What do y'all think? Any >>> ideas appreciated as always. >>> >>> Thanks! :) >>> >>> --John >>> >> >> >> >> -- >> Nick R. Katsipoulakis, >> Department of Computer Science >> University of Pittsburgh >> > > > > -- > Nick R. Katsipoulakis, > Department of Computer Science > University of Pittsburgh >
