Hi Kashyap,

Question--what percentage of time is spent in Kryo deserialization and how
much in LMAX disruptor?

--John

On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar <[email protected]>
wrote:

> That is right. But for a decently well written code, disruptor is almost
> always the CPU hogger. That said, on the issue b of emits taking time, we
> found that the size of emitted object matters. Kryo times for serializing
> and deserialization increases with size.
>
> But does size have a correlation with disruptor showing up big time in
> profiling?
>
> Thanks
> Kashyap
> Kashyap,
>
> It is only expected to see the Disruptor dominating CPU time. It is the
> object responsible for sending/receiving tuples (at least when you have
> tuples produced by one executor thread for another executor thread on the
> same machine). Therefore, it is expected to see Disruptor having something
> like ~80% of the time.
>
> A nice experiment to check my statement above is to create a Bolt that for
> every tuple it receives, it performs a random CPU task (like nested for
> loops) and it emits a tuple only after receiving X number of tuples, where
> X > 1. Then, I expect that you will see the percentage of CPU time for the
> Disruptor object to drop.
>
> Cheers,
> Nick
>
> On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <[email protected]>
> wrote:
>
>> John, Nick
>> Thanks for broaching this topic. In my case, 1 tuple from spout gives out
>> 200 more tuples. I too see the same class listed in VisualVM profiling...
>> And tried bringing this down... I reduced parallelism hints, played with
>> buffers, changed lmax strategies, changed max spout pending... Nothing
>> seems to have an impact
>>
>> Any ideas on what could be done for this?
>>
>> Thanks
>> Kashyap
>> Hello John,
>>
>> First off, let us agree on your definition of throughput. Do you define
>> throughput as the average number of tuples each of your last bolts (sinks)
>> emit per second? If yes, then OK. Otherwise, please provide us with more
>> details.
>>
>> Going back to the BlockingWaitStrategy observation you have, it (most
>> probably) means that since you are producing a large number of tuples
>> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit()
>> function blocks. Also, since you are anchoring tuples (that might mean
>> exactly-once semantics), it basically takes more time to place something in
>> the queue, in order to guarantee deliver of all tuples to a downstream
>> bolt.
>>
>> Therefore, it makes sense to see so much time spent in the LMAX messaging
>> layer. A good experiment to verify your hypothesis, is to not anchor
>> tuples, and profile your topology again. However, I am not sure that you
>> will see a much different percentage, since for every tuple you are
>> receiving, you have at least one call to the Disruptor layer. Maybe in your
>> case (if I got it correctly from your description), you should have one
>> call every N tuples, where N is the size of your bin in tuples. Right?
>>
>> I hope I helped with my comments.
>>
>> Cheers,
>> Nick
>>
>> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]> wrote:
>>
>>> Hi Everyone,
>>>
>>> I have a large fan-out that I've posted questions about before with the
>>> following new, updated info:
>>>
>>> 1. Incoming tuple to Bolt A produces 15-20 tuples
>>> 2. Bolt A emits to Bolt B via fieldsGrouping
>>> 3. I cache outgoing tuples in bins within Bolt A and then emit anchored
>>> tuples to Bolt B with the OutputCollector *emit
>>> <http://storm.apache.org/apidocs/backtype/storm/task/OutputCollector.html#emit(java.util.Collection,%20java.util.List)>*
>>> (Collection
>>> <http://docs.oracle.com/javase/6/docs/api/java/util/Collection.html?is-external=true>
>>> <Tuple <http://storm.apache.org/apidocs/backtype/storm/tuple/Tuple.html>
>>> > anchors, List
>>> <http://docs.oracle.com/javase/6/docs/api/java/util/List.html?is-external=true>
>>> <Object
>>> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html?is-external=true>
>>> > tuple) method
>>> 4. I have throughput where I need it to be if I just receive tuples in
>>> Bolt B, ack, and drop. If I do actual processing in Bolt B, throughput
>>> degrades a bunch.
>>> 5. I profiled the Bolt B worker yesterday and see that over 90% is spent
>>> in com.lmax.disruptor.BlockingWaitStrategy--irrespective if I drop the
>>> tuples or process in Bolt B
>>>
>>> I am wondering if the acking of the anchor tuples is what's resulting in
>>> so much time spent in the LMAX messaging layer.  What do y'all think?  Any
>>> ideas appreciated as always.
>>>
>>> Thanks! :)
>>>
>>> --John
>>>
>>
>>
>>
>> --
>> Nick R. Katsipoulakis,
>> Department of Computer Science
>> University of Pittsburgh
>>
>
>
>
> --
> Nick R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>

Reply via email to