Interesting conversation.

The back pressure mechanism in 1.0 should help.

Do you guys have environments that you could test that in?

Better yet, do you have code to share?

-Taylor

> On Jan 30, 2016, at 9:05 PM, [email protected] wrote:
> 
> Hey Kashyap,
> 
> Excellent points, especially regarding compression. I've thought about trying 
> compression, and your results indicate that's worth a shot.
> 
> Also, I concur on fields grouping, especially with a dramatic fan-out 
> followed by a fan-in, which is what I am currently working with.
> 
> Sure glad I started this thread today because both you and Nick have shared 
> lots of excellent thoughts--much appreciated, and thanks to you both!
> 
> --John
> 
> Sent from my iPhone
> 
>> On Jan 30, 2016, at 7:34 PM, Kashyap Mhaisekar <[email protected]> wrote:
>> 
>> John, Nick
>> I don't have direct answers but here is one test I did based on which I 
>> concluded that tuple size does matter.
>> My use case was like this -
>> Spout S emits a number X (say 1 or 100 or 1024 etc) -> Bolt A (Which 
>> generates a string of Xkb and emits it out 200 times) -> Bolt C (Bolt see 
>> just prints the the length of the string). All are shuffle grouped and no 
>> limits on max spout pending.
>> 
>> As you notice, this is a pretty straight topology with really nothing much 
>> in this except emitting out Strings of varying sizes.
>> 
>> With increase in the size, i notice that the throughput (No. of acks on 
>> spout divided by total time taken) decreases. The test was done on 1 machine 
>> so that network can be ruled out. The only things in play here are the LMAX 
>> and Kryo (de)serialization.
>> 
>> Another test - if Bolt C was field grouped on X, then i see that the 
>> performance drops much further, probably because all the desrialization is 
>> being done on instance of the bolt AND also because the queues are filled up.
>> 
>> This being said, when I compressed the emits from Bolt A (Use Snappy 
>> compression), I see that the throuput increases drastically. - I interpret 
>> this as the reduction in size due to compression has improved throughput).
>> 
>> I unfortunately have not checked VisualVM at the time..
>> 
>> Hope this helps.
>> 
>> Thanks
>> Kashyap
>>> On Sat, Jan 30, 2016 at 4:54 PM, John Yost <[email protected]> wrote:
>>> Also, I am wondering if this issue is actually fixed in 0.10.0: 
>>> https://issues.apache.org/jira/browse/STORM-292  What do you guys think?
>>> 
>>> --John
>>> 
>>>> On Sat, Jan 30, 2016 at 5:53 PM, John Yost <[email protected]> wrote:
>>>> Hi Kashyap,
>>>> 
>>>> Question--what percentage of time is spent in Kryo deserialization and how 
>>>> much in LMAX disruptor?
>>>> 
>>>> --John
>>>> 
>>>>> On Sat, Jan 30, 2016 at 5:18 PM, Kashyap Mhaisekar <[email protected]> 
>>>>> wrote:
>>>>> That is right. But for a decently well written code, disruptor is almost 
>>>>> always the CPU hogger. That said, on the issue b of emits taking time, we 
>>>>> found that the size of emitted object matters. Kryo times for serializing 
>>>>> and deserialization increases with size.
>>>>> 
>>>>> But does size have a correlation with disruptor showing up big time in 
>>>>> profiling?
>>>>> 
>>>>> Thanks
>>>>> Kashyap
>>>>> 
>>>>> Kashyap, 
>>>>> 
>>>>> It is only expected to see the Disruptor dominating CPU time. It is the 
>>>>> object responsible for sending/receiving tuples (at least when you have 
>>>>> tuples produced by one executor thread for another executor thread on the 
>>>>> same machine). Therefore, it is expected to see Disruptor having 
>>>>> something like ~80% of the time. 
>>>>> 
>>>>> A nice experiment to check my statement above is to create a Bolt that 
>>>>> for every tuple it receives, it performs a random CPU task (like nested 
>>>>> for loops) and it emits a tuple only after receiving X number of tuples, 
>>>>> where X > 1. Then, I expect that you will see the percentage of CPU time 
>>>>> for the Disruptor object to drop.
>>>>> 
>>>>> Cheers,
>>>>> Nick
>>>>> 
>>>>>> On Sat, Jan 30, 2016 at 3:40 PM, Kashyap Mhaisekar <[email protected]> 
>>>>>> wrote:
>>>>>> John, Nick
>>>>>> Thanks for broaching this topic. In my case, 1 tuple from spout gives 
>>>>>> out 200 more tuples. I too see the same class listed in VisualVM 
>>>>>> profiling... And tried bringing this down... I reduced parallelism 
>>>>>> hints, played with buffers, changed lmax strategies, changed max spout 
>>>>>> pending... Nothing seems to have an impact
>>>>>> 
>>>>>> Any ideas on what could be done for this?
>>>>>> 
>>>>>> Thanks
>>>>>> Kashyap
>>>>>> 
>>>>>> Hello John, 
>>>>>> 
>>>>>> First off, let us agree on your definition of throughput. Do you define 
>>>>>> throughput as the average number of tuples each of your last bolts 
>>>>>> (sinks) emit per second? If yes, then OK. Otherwise, please provide us 
>>>>>> with more details.
>>>>>> 
>>>>>> Going back to the BlockingWaitStrategy observation you have, it (most 
>>>>>> probably) means that since you are producing a large number of tuples 
>>>>>> (15-20 tuples) the outgoing Disruptor queue gets full, and the emit() 
>>>>>> function blocks. Also, since you are anchoring tuples (that might mean 
>>>>>> exactly-once semantics), it basically takes more time to place something 
>>>>>> in the queue, in order to guarantee deliver of all tuples to a 
>>>>>> downstream bolt. 
>>>>>> 
>>>>>> Therefore, it makes sense to see so much time spent in the LMAX 
>>>>>> messaging layer. A good experiment to verify your hypothesis, is to not 
>>>>>> anchor tuples, and profile your topology again. However, I am not sure 
>>>>>> that you will see a much different percentage, since for every tuple you 
>>>>>> are receiving, you have at least one call to the Disruptor layer. Maybe 
>>>>>> in your case (if I got it correctly from your description), you should 
>>>>>> have one call every N tuples, where N is the size of your bin in tuples. 
>>>>>> Right?
>>>>>> 
>>>>>> I hope I helped with my comments.
>>>>>> 
>>>>>> Cheers,
>>>>>> Nick
>>>>>> 
>>>>>>> On Sat, Jan 30, 2016 at 12:16 PM, John Yost <[email protected]> 
>>>>>>> wrote:
>>>>>>> Hi Everyone,
>>>>>>> 
>>>>>>> I have a large fan-out that I've posted questions about before with the 
>>>>>>> following new, updated info:
>>>>>>> 
>>>>>>> 1. Incoming tuple to Bolt A produces 15-20 tuples
>>>>>>> 2. Bolt A emits to Bolt B via fieldsGrouping
>>>>>>> 3. I cache outgoing tuples in bins within Bolt A and then emit anchored 
>>>>>>> tuples to Bolt B with the OutputCollector emit(Collection<Tuple> 
>>>>>>> anchors, List<Object> tuple) method
>>>>>>> 4. I have throughput where I need it to be if I just receive tuples in 
>>>>>>> Bolt B, ack, and drop. If I do actual processing in Bolt B, throughput 
>>>>>>> degrades a bunch.
>>>>>>> 5. I profiled the Bolt B worker yesterday and see that over 90% is 
>>>>>>> spent in com.lmax.disruptor.BlockingWaitStrategy--irrespective if I 
>>>>>>> drop the tuples or process in Bolt B
>>>>>>> 
>>>>>>> I am wondering if the acking of the anchor tuples is what's resulting 
>>>>>>> in so much time spent in the LMAX messaging layer.  What do y'all 
>>>>>>> think?  Any ideas appreciated as always.
>>>>>>> 
>>>>>>> Thanks! :)
>>>>>>> 
>>>>>>> --John
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Nick R. Katsipoulakis, 
>>>>>> Department of Computer Science 
>>>>>> University of Pittsburgh
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Nick R. Katsipoulakis, 
>>>>> Department of Computer Science 
>>>>> University of Pittsburgh
>> 

Reply via email to