Hello all, There is a back pressure mechanism in v1.0? Other than the max spout pending mechanism? I did not know that and I will be glad to put it to a test.
Nick On Saturday, January 30, 2016, P. Taylor Goetz <[email protected]> wrote: > Interesting conversation. > > The back pressure mechanism in 1.0 should help. > > Do you guys have environments that you could test that in? > > Better yet, do you have code to share? > > -Taylor > > On Jan 30, 2016, at 9:05 PM, [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');> wrote: > > Hey Kashyap, > > Excellent points, especially regarding compression. I've thought about > trying compression, and your results indicate that's worth a shot. > > Also, I concur on fields grouping, especially with a dramatic fan-out > followed by a fan-in, which is what I am currently working with. > > Sure glad I started this thread today because both you and Nick have > shared lots of excellent thoughts--much appreciated, and thanks to you both! > > --John > > Sent from my iPhone > > On Jan 30, 2016, at 7:34 PM, Kashyap Mhaisekar <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > > John, Nick > I don't have direct answers but here is one test I did based on which I > concluded that tuple size does matter. > My use case was like this - > Spout S emits a number *X* (say 1 or 100 or 1024 etc) -> Bolt A (Which > generates a string of *X*kb and emits it out 200 times) -> Bolt C (Bolt > see just prints the the length of the string). All are shuffle grouped and > no limits on max spout pending. > > As you notice, this is a pretty straight topology with really nothing much > in this except emitting out Strings of varying sizes. > > With increase in the size, i notice that the throughput (No. of acks on > spout divided by total time taken) decreases. The test was done on 1 > machine so that network can be ruled out. The only things in play here are > the LMAX and Kryo (de)serialization. > > Another test - if Bolt C was field grouped on X, then i see that the > performance drops much further, probably because all the desrialization is > being done on instance of the bolt AND also because the queues are filled > up. > > This being said, when I compressed the emits from Bolt A (Use Snappy > compression), I see that the throuput increases drastically. - I interpret > this as the reduction in size due to compression has improved throughput). > > I unfortunately have not checked VisualVM at the time.. > > Hope this helps. > > Thanks > Kashyap > On Sat, Jan 30, 2016 at 4:54 PM, John Yost <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Also, I am wondering if this issue is actually fixed in 0.10.0: >> https://issues.apache.org/jira/browse/STORM-292 What do you guys think? >> >> --John >> >> On Sat, Jan 30, 2016 at 5:53 PM, John Yost <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> Hi Kashyap, >>> >>> Question--what percentage of time is spent in Kryo deserialization and >>> how much in LMAX disruptor? >>> >>> --John >>> >>> On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> That is right. But for a decently well written code, disruptor is >>>> almost always the CPU hogger. That said, on the issue b of emits taking >>>> time, we found that the size of emitted object matters. Kryo times for >>>> serializing and deserialization increases with size. >>>> >>>> But does size have a correlation with disruptor showing up big time in >>>> profiling? >>>> >>>> Thanks >>>> Kashyap >>>> Kashyap, >>>> >>>> It is only expected to see the Disruptor dominating CPU time. It is the >>>> object responsible for sending/receiving tuples (at least when you have >>>> tuples produced by one executor thread for another executor thread on the >>>> same machine). Therefore, it is expected to see Disruptor having something >>>> like ~80% of the time. >>>> >>>> A nice experiment to check my statement above is to create a Bolt that >>>> for every tuple it receives, it performs a random CPU task (like nested for >>>> loops) and it emits a tuple only after receiving X number of tuples, where >>>> X > 1. Then, I expect that you will see the percentage of CPU time for the >>>> Disruptor object to drop. >>>> >>>> Cheers, >>>> Nick >>>> >>>> On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <[email protected] >>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>> >>>>> John, Nick >>>>> Thanks for broaching this topic. In my case, 1 tuple from spout gives >>>>> out 200 more tuples. I too see the same class listed in VisualVM >>>>> profiling... And tried bringing this down... I reduced parallelism hints, >>>>> played with buffers, changed lmax strategies, changed max spout pending... >>>>> Nothing seems to have an impact >>>>> >>>>> Any ideas on what could be done for this? >>>>> >>>>> Thanks >>>>> Kashyap >>>>> Hello John, >>>>> >>>>> First off, let us agree on your definition of throughput. Do you >>>>> define throughput as the average number of tuples each of your last bolts >>>>> (sinks) emit per second? If yes, then OK. Otherwise, please provide us >>>>> with >>>>> more details. >>>>> >>>>> Going back to the BlockingWaitStrategy observation you have, it (most >>>>> probably) means that since you are producing a large number of tuples >>>>> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit() >>>>> function blocks. Also, since you are anchoring tuples (that might mean >>>>> exactly-once semantics), it basically takes more time to place something >>>>> in >>>>> the queue, in order to guarantee deliver of all tuples to a downstream >>>>> bolt. >>>>> >>>>> Therefore, it makes sense to see so much time spent in the LMAX >>>>> messaging layer. A good experiment to verify your hypothesis, is to not >>>>> anchor tuples, and profile your topology again. However, I am not sure >>>>> that >>>>> you will see a much different percentage, since for every tuple you are >>>>> receiving, you have at least one call to the Disruptor layer. Maybe in >>>>> your >>>>> case (if I got it correctly from your description), you should have one >>>>> call every N tuples, where N is the size of your bin in tuples. Right? >>>>> >>>>> I hope I helped with my comments. >>>>> >>>>> Cheers, >>>>> Nick >>>>> >>>>> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected] >>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> I have a large fan-out that I've posted questions about before with >>>>>> the following new, updated info: >>>>>> >>>>>> 1. Incoming tuple to Bolt A produces 15-20 tuples >>>>>> 2. Bolt A emits to Bolt B via fieldsGrouping >>>>>> 3. I cache outgoing tuples in bins within Bolt A and then emit >>>>>> anchored tuples to Bolt B with the OutputCollector *emit >>>>>> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>* >>>>>> (Collection >>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true> >>>>>> <Tuple >>>>>> <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html> >>>>>> > anchors, List >>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true> >>>>>> <Object >>>>>> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true> >>>>>> > tuple) method >>>>>> 4. I have throughput where I need it to be if I just receive tuples >>>>>> in Bolt B, ack, and drop. If I do actual processing in Bolt B, throughput >>>>>> degrades a bunch. >>>>>> 5. I profiled the Bolt B worker yesterday and see that over 90% is >>>>>> spent in com.lmax.disruptor.BlockingWaitStrategy--irrespective if I >>>>>> drop the tuples or process in Bolt B >>>>>> >>>>>> I am wondering if the acking of the anchor tuples is what's resulting >>>>>> in so much time spent in the LMAX messaging layer. What do y'all think? >>>>>> Any ideas appreciated as always. >>>>>> >>>>>> Thanks! :) >>>>>> >>>>>> --John >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Nick R. Katsipoulakis, >>>>> Department of Computer Science >>>>> University of Pittsburgh >>>>> >>>> >>>> >>>> >>>> -- >>>> Nick R. Katsipoulakis, >>>> Department of Computer Science >>>> University of Pittsburgh >>>> >>> >>> >> > -- Nick R. Katsipoulakis, Department of Computer Science University of Pittsburgh
