John, Nick
I don't have direct answers but here is one test I did based on which I
concluded that tuple size does matter.
My use case was like this -
Spout S emits a number *X* (say 1 or 100 or 1024 etc) -> Bolt A (Which
generates a string of *X*kb and emits it out 200 times) -> Bolt C (Bolt see
just prints the the length of the string). All are shuffle grouped and no
limits on max spout pending.

As you notice, this is a pretty straight topology with really nothing much
in this except emitting out Strings of varying sizes.

With increase in the size, i notice that the throughput (No. of acks on
spout divided by total time taken) decreases. The test was done on 1
machine so that network can be ruled out. The only things in play here are
the LMAX and Kryo (de)serialization.

Another test - if Bolt C was field grouped on X, then i see that the
performance drops much further, probably because all the desrialization is
being done on instance of the bolt AND also because the queues are filled
up.

This being said, when I compressed the emits from Bolt A (Use Snappy
compression), I see that the throuput increases drastically. - I interpret
this as the reduction in size due to compression has improved throughput).

I unfortunately have not checked VisualVM at the time..

Hope this helps.

Thanks
Kashyap
On Sat, Jan 30, 2016 at 4:54 PM, John Yost <[email protected]> wrote:

> Also, I am wondering if this issue is actually fixed in 0.10.0:
> https://issues.apache.org/jira/browse/STORM-292  What do you guys think?
>
> --John
>
> On Sat, Jan 30, 2016 at 5:53 PM, John Yost <[email protected]> wrote:
>
>> Hi Kashyap,
>>
>> Question--what percentage of time is spent in Kryo deserialization and
>> how much in LMAX disruptor?
>>
>> --John
>>
>> On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar <[email protected]>
>> wrote:
>>
>>> That is right. But for a decently well written code, disruptor is almost
>>> always the CPU hogger. That said, on the issue b of emits taking time, we
>>> found that the size of emitted object matters. Kryo times for serializing
>>> and deserialization increases with size.
>>>
>>> But does size have a correlation with disruptor showing up big time in
>>> profiling?
>>>
>>> Thanks
>>> Kashyap
>>> Kashyap,
>>>
>>> It is only expected to see the Disruptor dominating CPU time. It is the
>>> object responsible for sending/receiving tuples (at least when you have
>>> tuples produced by one executor thread for another executor thread on the
>>> same machine). Therefore, it is expected to see Disruptor having something
>>> like ~80% of the time.
>>>
>>> A nice experiment to check my statement above is to create a Bolt that
>>> for every tuple it receives, it performs a random CPU task (like nested for
>>> loops) and it emits a tuple only after receiving X number of tuples, where
>>> X > 1. Then, I expect that you will see the percentage of CPU time for the
>>> Disruptor object to drop.
>>>
>>> Cheers,
>>> Nick
>>>
>>> On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <[email protected]>
>>> wrote:
>>>
>>>> John, Nick
>>>> Thanks for broaching this topic. In my case, 1 tuple from spout gives
>>>> out 200 more tuples. I too see the same class listed in VisualVM
>>>> profiling... And tried bringing this down... I reduced parallelism hints,
>>>> played with buffers, changed lmax strategies, changed max spout pending...
>>>> Nothing seems to have an impact
>>>>
>>>> Any ideas on what could be done for this?
>>>>
>>>> Thanks
>>>> Kashyap
>>>> Hello John,
>>>>
>>>> First off, let us agree on your definition of throughput. Do you define
>>>> throughput as the average number of tuples each of your last bolts (sinks)
>>>> emit per second? If yes, then OK. Otherwise, please provide us with more
>>>> details.
>>>>
>>>> Going back to the BlockingWaitStrategy observation you have, it (most
>>>> probably) means that since you are producing a large number of tuples
>>>> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit()
>>>> function blocks. Also, since you are anchoring tuples (that might mean
>>>> exactly-once semantics), it basically takes more time to place something in
>>>> the queue, in order to guarantee deliver of all tuples to a downstream
>>>> bolt.
>>>>
>>>> Therefore, it makes sense to see so much time spent in the LMAX
>>>> messaging layer. A good experiment to verify your hypothesis, is to not
>>>> anchor tuples, and profile your topology again. However, I am not sure that
>>>> you will see a much different percentage, since for every tuple you are
>>>> receiving, you have at least one call to the Disruptor layer. Maybe in your
>>>> case (if I got it correctly from your description), you should have one
>>>> call every N tuples, where N is the size of your bin in tuples. Right?
>>>>
>>>> I hope I helped with my comments.
>>>>
>>>> Cheers,
>>>> Nick
>>>>
>>>> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I have a large fan-out that I've posted questions about before with
>>>>> the following new, updated info:
>>>>>
>>>>> 1. Incoming tuple to Bolt A produces 15-20 tuples
>>>>> 2. Bolt A emits to Bolt B via fieldsGrouping
>>>>> 3. I cache outgoing tuples in bins within Bolt A and then emit
>>>>> anchored tuples to Bolt B with the OutputCollector *emit
>>>>> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>*
>>>>> (Collection
>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true>
>>>>> <Tuple
>>>>> <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html>
>>>>> > anchors, List
>>>>> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true>
>>>>> <Object
>>>>> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>>>>> > tuple) method
>>>>> 4. I have throughput where I need it to be if I just receive tuples in
>>>>> Bolt B, ack, and drop. If I do actual processing in Bolt B, throughput
>>>>> degrades a bunch.
>>>>> 5. I profiled the Bolt B worker yesterday and see that over 90% is
>>>>> spent in com.lmax.disruptor.BlockingWaitStrategy--irrespective if I
>>>>> drop the tuples or process in Bolt B
>>>>>
>>>>> I am wondering if the acking of the anchor tuples is what's resulting
>>>>> in so much time spent in the LMAX messaging layer.  What do y'all think?
>>>>> Any ideas appreciated as always.
>>>>>
>>>>> Thanks! :)
>>>>>
>>>>> --John
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nick R. Katsipoulakis,
>>>> Department of Computer Science
>>>> University of Pittsburgh
>>>>
>>>
>>>
>>>
>>> --
>>> Nick R. Katsipoulakis,
>>> Department of Computer Science
>>> University of Pittsburgh
>>>
>>
>>
>

Reply via email to