Thank you for your answer. Let me add something more about why I asked for this (it might be that the problem I am facing is not caused by the worker's send thread, and I think it could be still of interest for the community).
The goal of my experiment is to find the maximum throughput of a given topology (described below). Setup: 1 server with 48 cores. Supervisor (with 1 worker), Nimbus, UI and Zookeeper all running in the server. Topology: Spout [a spout] --> Op [a bolt] --> Sink [a bolt] Spout generates tuples with 3 integer fields (randomly generated for each new tuple). Op simply forwards the tuples. Sink does nothing. I understand the topology itself might look weird, but I am interested in studying the scalability of a streaming operator with selectivity close to 1 (1 input tuple --> 1 output tuple). Now, given that I have 48 cores but each executor actually maintains 2 threads, I want to try different setups using no more than 24 executors. My understanding is that, for the given topology, the per-tuple processing cost of Spout and Bolt will be comparable (no acking in place), while the cost of Sink should be negligible. Also, notice that all operators are maintaining metrics that keep track of the throughput and the cost. All the sequent results are averaged over 3 runs. The first setup I try is 1_22_1 (1 executor for Spout, 22 for Op and 1 for Sink). This setup is not really optimal (that's my prediction) given that most likely the bottleneck will be the Spout, which cannot generate enough tuples to reach the processing capacity of Op. I get a throughput of 250K t/s, the cost of Spout is 0.91 while the cost of Op is 0.15 (Sink is 0.20). As expected, cost of Spout is significantly higher than cost of Op. What happens when I move to 2_21_1? I get a throughput of 300K t/s, the cost of Spout is 0.72 while the cost of Op is 0.43 (Sink is 0.20). As expected, the cost of the Spout decreases, the cost of Op goes up and so does the throughput. What happens when I move to other configurations like 3_20_1, 4_19_1, ... up to 10_13_1? Well, I see decreasing costs for the Spout, increasing costs for Op but still 300K t/s as maximum throughput. My questions are: Why do I get an improvement of only 50K t/s from 1 to 2 executors for the Spout? Why does the throughput not increase with more than 2 spouts? (the cost of Sink is still stable at 0.17 in all the experiments) Is there something I missing that could be the problem aside from the worker's send thread being a bottleneck? Thank you very much for your help, I hope my question is clear. On 18 August 2015 at 07:54, Kishore Senji <[email protected]> wrote: > I think it is the same even in local shuffling. It will go to the worker > receive buffer from which it gets transferred to the executor disruptor > queue. So yes the executor would be able to go as fast that thread is able > to tuples, provided the Bolt takes less time than copy of the messages from > the worker receive buffer to the disruptor queue (which more often than not > is not the case). You can increase the disruptor queue size to see if that > improves any performance in your scenario. > > On Mon, Aug 17, 2015 at 6:21 AM Vincenzo Gulisano < > [email protected]> wrote: > >> Hi, >> something that is not clear to me (and I cannot find clear answers to >> this), is whether the send thread of an executor A is able to directly move >> a tuple to the input queue of another executor B (within the same worker of >> course) or whether all output tuples have to go through the shared transfer >> queue and then copied to input queues by the worker send thread. >> In such a case, wouldn't such a thread be a bottleneck? A worker would be >> able to go as fast as this thread is able to move tuples around, isn't it? >> >> Thanks! >> >
