I was intending to test the Flux and security on 0.10.0 release. Will test backflow also along with this. Thanks Taylor.
But continuing on the original discussion, another two interesting things i observed - 1. No. of ackers make a difference in fan-out and fan-in topologies. 2. Determining the no. of workers - The FAQ section says "There's no great reason to use more than one worker per topology per machine" - What I observed is that depending on no. of tuples getting emitted, increase in no. of workers does result in better performance for one topology. Any opinions/observations/comments on this? Thanks, Kashyap On Sun, Jan 31, 2016 at 6:37 AM, John Yost <[email protected]> wrote: > Hey Taylor, > > Cool re: back pressure mechanism--do you have a quick overview and > corresponding classes to check out? > > Also, as I mentioned earlier in this thread, seems like the STORM-292 > enhancement that's supposed to be in 0.10.1 would also help this situation > as the publishing bolt emit would no longer be blocked if the receiving > bolt disruptor queue in one worker is full. If I am reading the JIRA ticket > correctly, the worker(s) would use additional off-heap memory to keep > sending tuples. But...I don't see this or the umbrella STORM-216 ticket in > the README for either the 0.9.6 or 0.10.0 releases, so not sure this is > actually in storm at this point. > > Thanks > > --John > > On Sat, Jan 30, 2016 at 10:11 PM, Nick R. Katsipoulakis < > [email protected]> wrote: > >> Hello all, >> >> There is a back pressure mechanism in v1.0? Other than the max spout >> pending mechanism? >> I did not know that and I will be glad to put it to a test. >> >> Nick >> >> >> On Saturday, January 30, 2016, P. Taylor Goetz <[email protected]> wrote: >> >>> Interesting conversation. >>> >>> The back pressure mechanism in 1.0 should help. >>> >>> Do you guys have environments that you could test that in? >>> >>> Better yet, do you have code to share? >>> >>> -Taylor >>> >>> On Jan 30, 2016, at 9:05 PM, [email protected] wrote: >>> >>> Hey Kashyap, >>> >>> Excellent points, especially regarding compression. I've thought about >>> trying compression, and your results indicate that's worth a shot. >>> >>> Also, I concur on fields grouping, especially with a dramatic fan-out >>> followed by a fan-in, which is what I am currently working with. >>> >>> Sure glad I started this thread today because both you and Nick have >>> shared lots of excellent thoughts--much appreciated, and thanks to you both! >>> >>> --John >>> >>> Sent from my iPhone >>> >>> On Jan 30, 2016, at 7:34 PM, Kashyap Mhaisekar <[email protected]> >>> wrote: >>> >>> John, Nick >>> I don't have direct answers but here is one test I did based on which I >>> concluded that tuple size does matter. >>> My use case was like this - >>> Spout S emits a number *X* (say 1 or 100 or 1024 etc) -> Bolt A (Which >>> generates a string of *X*kb and emits it out 200 times) -> Bolt C (Bolt >>> see just prints the the length of the string). All are shuffle grouped and >>> no limits on max spout pending. >>> >>> As you notice, this is a pretty straight topology with really nothing >>> much in this except emitting out Strings of varying sizes. >>> >>> With increase in the size, i notice that the throughput (No. of acks on >>> spout divided by total time taken) decreases. The test was done on 1 >>> machine so that network can be ruled out. The only things in play here are >>> the LMAX and Kryo (de)serialization. >>> >>> Another test - if Bolt C was field grouped on X, then i see that the >>> performance drops much further, probably because all the desrialization is >>> being done on instance of the bolt AND also because the queues are filled >>> up. >>> >>> This being said, when I compressed the emits from Bolt A (Use Snappy >>> compression), I see that the throuput increases drastically. - I interpret >>> this as the reduction in size due to compression has improved throughput). >>> >>> I unfortunately have not checked VisualVM at the time.. >>> >>> Hope this helps. >>> >>> Thanks >>> Kashyap >>> On Sat, Jan 30, 2016 at 4:54 PM, John Yost <[email protected]> wrote: >>> >>>> Also, I am wondering if this issue is actually fixed in 0.10.0: >>>> https://issues.apache.org/jira/browse/STORM-292 What do you guys >>>> think? >>>> >>>> --John >>>> >>>> On Sat, Jan 30, 2016 at 5:53 PM, John Yost <[email protected]> >>>> wrote: >>>> >>>>> Hi Kashyap, >>>>> >>>>> Question--what percentage of time is spent in Kryo deserialization and >>>>> how much in LMAX disruptor? >>>>> >>>>> --John >>>>> >>>>> On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar < >>>>> [email protected]> wrote: >>>>> >>>>>> That is right. But for a decently well written code, disruptor is >>>>>> almost always the CPU hogger. That said, on the issue b of emits taking >>>>>> time, we found that the size of emitted object matters. Kryo times for >>>>>> serializing and deserialization increases with size. >>>>>> >>>>>> But does size have a correlation with disruptor showing up big time >>>>>> in profiling? >>>>>> >>>>>> Thanks >>>>>> Kashyap >>>>>> Kashyap, >>>>>> >>>>>> It is only expected to see the Disruptor dominating CPU time. It is >>>>>> the object responsible for sending/receiving tuples (at least when you >>>>>> have >>>>>> tuples produced by one executor thread for another executor thread on the >>>>>> same machine). Therefore, it is expected to see Disruptor having >>>>>> something >>>>>> like ~80% of the time. >>>>>> >>>>>> A nice experiment to check my statement above is to create a Bolt >>>>>> that for every tuple it receives, it performs a random CPU task (like >>>>>> nested for loops) and it emits a tuple only after receiving X number of >>>>>> tuples, where X > 1. Then, I expect that you will see the percentage of >>>>>> CPU >>>>>> time for the Disruptor object to drop. >>>>>> >>>>>> Cheers, >>>>>> Nick >>>>>> >>>>>> On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> John, Nick >>>>>>> Thanks for broaching this topic. In my case, 1 tuple from spout >>>>>>> gives out 200 more tuples. I too see the same class listed in VisualVM >>>>>>> profiling... And tried bringing this down... I reduced parallelism >>>>>>> hints, >>>>>>> played with buffers, changed lmax strategies, changed max spout >>>>>>> pending... >>>>>>> Nothing seems to have an impact >>>>>>> >>>>>>> Any ideas on what could be done for this? >>>>>>> >>>>>>> Thanks >>>>>>> Kashyap >>>>>>> Hello John, >>>>>>> >>>>>>> First off, let us agree on your definition of throughput. Do you >>>>>>> define throughput as the average number of tuples each of your last >>>>>>> bolts >>>>>>> (sinks) emit per second? If yes, then OK. Otherwise, please provide us >>>>>>> with >>>>>>> more details. >>>>>>> >>>>>>> Going back to the BlockingWaitStrategy observation you have, it >>>>>>> (most probably) means that since you are producing a large number of >>>>>>> tuples >>>>>>> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit() >>>>>>> function blocks. Also, since you are anchoring tuples (that might mean >>>>>>> exactly-once semantics), it basically takes more time to place >>>>>>> something in >>>>>>> the queue, in order to guarantee deliver of all tuples to a downstream >>>>>>> bolt. >>>>>>> >>>>>>> Therefore, it makes sense to see so much time spent in the LMAX >>>>>>> messaging layer. A good experiment to verify your hypothesis, is to not >>>>>>> anchor tuples, and profile your topology again. However, I am not sure >>>>>>> that >>>>>>> you will see a much different percentage, since for every tuple you are >>>>>>> receiving, you have at least one call to the Disruptor layer. Maybe in >>>>>>> your >>>>>>> case (if I got it correctly from your description), you should have one >>>>>>> call every N tuples, where N is the size of your bin in tuples. Right? >>>>>>> >>>>>>> I hope I helped with my comments. >>>>>>> >>>>>>> Cheers, >>>>>>> Nick >>>>>>> >>>>>>> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Everyone, >>>>>>>> >>>>>>>> I have a large fan-out that I've posted questions about before with >>>>>>>> the following new, updated info: >>>>>>>> >>>>>>>> 1. Incoming tuple to Bolt A produces 15-20 tuples >>>>>>>> 2. Bolt A emits to Bolt B via fieldsGrouping >>>>>>>> 3. I cache outgoing tuples in bins within Bolt A and then emit >>>>>>>> anchored tuples to Bolt B with the OutputCollector *emit >>>>>>>> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>* >>>>>>>> (Collection >>>>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true> >>>>>>>> <Tuple >>>>>>>> <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html> >>>>>>>> > anchors, List >>>>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true> >>>>>>>> <Object >>>>>>>> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true> >>>>>>>> > tuple) method >>>>>>>> 4. I have throughput where I need it to be if I just receive tuples >>>>>>>> in Bolt B, ack, and drop. If I do actual processing in Bolt B, >>>>>>>> throughput >>>>>>>> degrades a bunch. >>>>>>>> 5. I profiled the Bolt B worker yesterday and see that over 90% is >>>>>>>> spent in com.lmax.disruptor.BlockingWaitStrategy--irrespective if >>>>>>>> I drop the tuples or process in Bolt B >>>>>>>> >>>>>>>> I am wondering if the acking of the anchor tuples is what's >>>>>>>> resulting in so much time spent in the LMAX messaging layer. What do >>>>>>>> y'all >>>>>>>> think? Any ideas appreciated as always. >>>>>>>> >>>>>>>> Thanks! :) >>>>>>>> >>>>>>>> --John >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nick R. Katsipoulakis, >>>>>>> Department of Computer Science >>>>>>> University of Pittsburgh >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Nick R. Katsipoulakis, >>>>>> Department of Computer Science >>>>>> University of Pittsburgh >>>>>> >>>>> >>>>> >>>> >>> >> >> -- >> Nick R. Katsipoulakis, >> Department of Computer Science >> University of Pittsburgh >> >> >
