Re: Acking of anchor tuple list decreases throughput?

Kashyap Mhaisekar Mon, 01 Feb 2016 08:00:41 -0800

I was intending to test the Flux and security on 0.10.0 release. Will test
backflow also along with this. Thanks Taylor.


But continuing on the original discussion, another two interesting things i
observed -
1. No. of ackers make a difference in fan-out and fan-in topologies.
2. Determining the no. of workers - The FAQ section says "There's no great
reason to use more than one worker per topology per machine" - What I
observed is that depending on no. of tuples getting emitted, increase in
no. of workers does result in better performance for one topology.


Any opinions/observations/comments on this?

Thanks,
Kashyap

On Sun, Jan 31, 2016 at 6:37 AM, John Yost <[email protected]> wrote:

> Hey Taylor,
>
> Cool re: back pressure mechanism--do you have a quick overview and
> corresponding classes to check out?
>
> Also, as I mentioned earlier in this thread, seems like the STORM-292
> enhancement that's supposed to be in 0.10.1 would also help this situation
> as the publishing bolt emit would no longer be blocked if the receiving
> bolt disruptor queue in one worker is full. If I am reading the JIRA ticket
> correctly, the worker(s) would use additional off-heap memory to keep
> sending tuples.  But...I don't see this or the umbrella STORM-216 ticket in
> the README for either the 0.9.6 or 0.10.0 releases, so not sure this is
> actually in storm at this point.
>
> Thanks
>
> --John
>
> On Sat, Jan 30, 2016 at 10:11 PM, Nick R. Katsipoulakis <
> [email protected]> wrote:
>
>> Hello all,
>>
>> There is a back pressure mechanism in v1.0? Other than the max spout
>> pending mechanism?
>> I did not know that and I will be glad to put it to a test.
>>
>> Nick
>>
>>
>> On Saturday, January 30, 2016, P. Taylor Goetz <[email protected]> wrote:
>>
>>> Interesting conversation.
>>>
>>> The back pressure mechanism in 1.0 should help.
>>>
>>> Do you guys have environments that you could test that in?
>>>
>>> Better yet, do you have code to share?
>>>
>>> -Taylor
>>>
>>> On Jan 30, 2016, at 9:05 PM, [email protected] wrote:
>>>
>>> Hey Kashyap,
>>>
>>> Excellent points, especially regarding compression. I've thought about
>>> trying compression, and your results indicate that's worth a shot.
>>>
>>> Also, I concur on fields grouping, especially with a dramatic fan-out
>>> followed by a fan-in, which is what I am currently working with.
>>>
>>> Sure glad I started this thread today because both you and Nick have
>>> shared lots of excellent thoughts--much appreciated, and thanks to you both!
>>>
>>> --John
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 30, 2016, at 7:34 PM, Kashyap Mhaisekar <[email protected]>
>>> wrote:
>>>
>>> John, Nick
>>> I don't have direct answers but here is one test I did based on which I
>>> concluded that tuple size does matter.
>>> My use case was like this -
>>> Spout S emits a number *X* (say 1 or 100 or 1024 etc) -> Bolt A (Which
>>> generates a string of *X*kb and emits it out 200 times) -> Bolt C (Bolt
>>> see just prints the the length of the string). All are shuffle grouped and
>>> no limits on max spout pending.
>>>
>>> As you notice, this is a pretty straight topology with really nothing
>>> much in this except emitting out Strings of varying sizes.
>>>
>>> With increase in the size, i notice that the throughput (No. of acks on
>>> spout divided by total time taken) decreases. The test was done on 1
>>> machine so that network can be ruled out. The only things in play here are
>>> the LMAX and Kryo (de)serialization.
>>>
>>> Another test - if Bolt C was field grouped on X, then i see that the
>>> performance drops much further, probably because all the desrialization is
>>> being done on instance of the bolt AND also because the queues are filled
>>> up.
>>>
>>> This being said, when I compressed the emits from Bolt A (Use Snappy
>>> compression), I see that the throuput increases drastically. - I interpret
>>> this as the reduction in size due to compression has improved throughput).
>>>
>>> I unfortunately have not checked VisualVM at the time..
>>>
>>> Hope this helps.
>>>
>>> Thanks
>>> Kashyap
>>> On Sat, Jan 30, 2016 at 4:54 PM, John Yost <[email protected]> wrote:
>>>
>>>> Also, I am wondering if this issue is actually fixed in 0.10.0:
>>>> https://issues.apache.org/jira/browse/STORM-292  What do you guys
>>>> think?
>>>>
>>>> --John
>>>>
>>>> On Sat, Jan 30, 2016 at 5:53 PM, John Yost <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Kashyap,
>>>>>
>>>>> Question--what percentage of time is spent in Kryo deserialization and
>>>>> how much in LMAX disruptor?
>>>>>
>>>>> --John
>>>>>
>>>>> On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> That is right. But for a decently well written code, disruptor is
>>>>>> almost always the CPU hogger. That said, on the issue b of emits taking
>>>>>> time, we found that the size of emitted object matters. Kryo times for
>>>>>> serializing and deserialization increases with size.
>>>>>>
>>>>>> But does size have a correlation with disruptor showing up big time
>>>>>> in profiling?
>>>>>>
>>>>>> Thanks
>>>>>> Kashyap
>>>>>> Kashyap,
>>>>>>
>>>>>> It is only expected to see the Disruptor dominating CPU time. It is
>>>>>> the object responsible for sending/receiving tuples (at least when you 
>>>>>> have
>>>>>> tuples produced by one executor thread for another executor thread on the
>>>>>> same machine). Therefore, it is expected to see Disruptor having 
>>>>>> something
>>>>>> like ~80% of the time.
>>>>>>
>>>>>> A nice experiment to check my statement above is to create a Bolt
>>>>>> that for every tuple it receives, it performs a random CPU task (like
>>>>>> nested for loops) and it emits a tuple only after receiving X number of
>>>>>> tuples, where X > 1. Then, I expect that you will see the percentage of 
>>>>>> CPU
>>>>>> time for the Disruptor object to drop.
>>>>>>
>>>>>> Cheers,
>>>>>> Nick
>>>>>>
>>>>>> On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> John, Nick
>>>>>>> Thanks for broaching this topic. In my case, 1 tuple from spout
>>>>>>> gives out 200 more tuples. I too see the same class listed in VisualVM
>>>>>>> profiling... And tried bringing this down... I reduced parallelism 
>>>>>>> hints,
>>>>>>> played with buffers, changed lmax strategies, changed max spout 
>>>>>>> pending...
>>>>>>> Nothing seems to have an impact
>>>>>>>
>>>>>>> Any ideas on what could be done for this?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Kashyap
>>>>>>> Hello John,
>>>>>>>
>>>>>>> First off, let us agree on your definition of throughput. Do you
>>>>>>> define throughput as the average number of tuples each of your last 
>>>>>>> bolts
>>>>>>> (sinks) emit per second? If yes, then OK. Otherwise, please provide us 
>>>>>>> with
>>>>>>> more details.
>>>>>>>
>>>>>>> Going back to the BlockingWaitStrategy observation you have, it
>>>>>>> (most probably) means that since you are producing a large number of 
>>>>>>> tuples
>>>>>>> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit()
>>>>>>> function blocks. Also, since you are anchoring tuples (that might mean
>>>>>>> exactly-once semantics), it basically takes more time to place 
>>>>>>> something in
>>>>>>> the queue, in order to guarantee deliver of all tuples to a downstream
>>>>>>> bolt.
>>>>>>>
>>>>>>> Therefore, it makes sense to see so much time spent in the LMAX
>>>>>>> messaging layer. A good experiment to verify your hypothesis, is to not
>>>>>>> anchor tuples, and profile your topology again. However, I am not sure 
>>>>>>> that
>>>>>>> you will see a much different percentage, since for every tuple you are
>>>>>>> receiving, you have at least one call to the Disruptor layer. Maybe in 
>>>>>>> your
>>>>>>> case (if I got it correctly from your description), you should have one
>>>>>>> call every N tuples, where N is the size of your bin in tuples. Right?
>>>>>>>
>>>>>>> I hope I helped with my comments.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Nick
>>>>>>>
>>>>>>> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I have a large fan-out that I've posted questions about before with
>>>>>>>> the following new, updated info:
>>>>>>>>
>>>>>>>> 1. Incoming tuple to Bolt A produces 15-20 tuples
>>>>>>>> 2. Bolt A emits to Bolt B via fieldsGrouping
>>>>>>>> 3. I cache outgoing tuples in bins within Bolt A and then emit
>>>>>>>> anchored tuples to Bolt B with the OutputCollector *emit
>>>>>>>> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>*
>>>>>>>> (Collection
>>>>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true>
>>>>>>>> <Tuple
>>>>>>>> <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html>
>>>>>>>> > anchors, List
>>>>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true>
>>>>>>>> <Object
>>>>>>>> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>>>>>>>> > tuple) method
>>>>>>>> 4. I have throughput where I need it to be if I just receive tuples
>>>>>>>> in Bolt B, ack, and drop. If I do actual processing in Bolt B, 
>>>>>>>> throughput
>>>>>>>> degrades a bunch.
>>>>>>>> 5. I profiled the Bolt B worker yesterday and see that over 90% is
>>>>>>>> spent in com.lmax.disruptor.BlockingWaitStrategy--irrespective if
>>>>>>>> I drop the tuples or process in Bolt B
>>>>>>>>
>>>>>>>> I am wondering if the acking of the anchor tuples is what's
>>>>>>>> resulting in so much time spent in the LMAX messaging layer.  What do 
>>>>>>>> y'all
>>>>>>>> think?  Any ideas appreciated as always.
>>>>>>>>
>>>>>>>> Thanks! :)
>>>>>>>>
>>>>>>>> --John
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nick R. Katsipoulakis,
>>>>>>> Department of Computer Science
>>>>>>> University of Pittsburgh
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nick R. Katsipoulakis,
>>>>>> Department of Computer Science
>>>>>> University of Pittsburgh
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>> --
>> Nick R. Katsipoulakis,
>> Department of Computer Science
>> University of Pittsburgh
>>
>>
>

Re: Acking of anchor tuple list decreases throughput?

Reply via email to