Thank you for your answer. Let me add something more about why I asked for
this (it might be that the problem I am facing is not caused by the
worker's send thread, and I think it could be still of interest for the
community).

The goal of my experiment is to find the maximum throughput of a given
topology (described below).

Setup:
1 server with 48 cores. Supervisor (with 1 worker), Nimbus, UI and
Zookeeper all running in the server.

Topology:
Spout [a spout] --> Op [a bolt] --> Sink [a bolt]

Spout generates tuples with 3 integer fields (randomly generated for each
new tuple).
Op simply forwards the tuples.
Sink does nothing.

I understand the topology itself might look weird, but I am interested in
studying the scalability of a streaming operator with selectivity close to
1 (1 input tuple --> 1 output tuple).

Now, given that I have 48 cores but each executor actually maintains 2
threads, I want to try different setups using no more than 24 executors. My
understanding is that, for the given topology, the per-tuple processing
cost of Spout and Bolt will be comparable (no acking in place), while the
cost of Sink should be negligible. Also, notice that all operators are
maintaining metrics that keep track of the throughput and the cost. All the
sequent results are averaged over 3 runs.

The first setup I try is 1_22_1 (1 executor for Spout, 22 for Op and 1 for
Sink). This setup is not really optimal (that's my prediction) given that
most likely the bottleneck will be the Spout, which cannot generate enough
tuples to reach the processing capacity of Op.
I get a throughput of 250K t/s, the cost of Spout is 0.91 while the cost of
Op is 0.15 (Sink is 0.20).
As expected, cost of Spout is significantly higher than cost of Op.
What happens when I move to 2_21_1?
I get a throughput of 300K t/s, the cost of Spout is 0.72 while the cost of
Op is 0.43 (Sink is 0.20).
As expected, the cost of the Spout decreases, the cost of Op goes up and so
does the throughput.
What happens when I move to other configurations like 3_20_1, 4_19_1, ...
up to 10_13_1?
Well, I see decreasing costs for the Spout, increasing costs for Op but
still 300K t/s as maximum throughput.

My questions are:
Why do I get an improvement of only 50K t/s from 1 to 2 executors for the
Spout?
Why does the throughput not increase with more than 2 spouts? (the cost of
Sink is still stable at 0.17 in all the experiments)
Is there something I missing that could be the problem aside from the
worker's send thread being a bottleneck?

Thank you very much for your help, I hope my question is clear.

On 18 August 2015 at 07:54, Kishore Senji <[email protected]> wrote:

> I think it is the same even in local shuffling. It will go to the worker
> receive buffer from which it gets transferred to the executor disruptor
> queue. So yes the executor would be able to go as fast that thread is able
> to tuples, provided the Bolt takes less time than copy of the messages from
> the worker receive buffer to the disruptor queue (which more often than not
> is not the case). You can increase the disruptor queue size to see if that
> improves any performance in your scenario.
>
> On Mon, Aug 17, 2015 at 6:21 AM Vincenzo Gulisano <
> [email protected]> wrote:
>
>> Hi,
>> something that is not clear to me (and I cannot find clear answers to
>> this), is whether the send thread of an executor A is able to directly move
>> a tuple to the input queue of another executor B (within the same worker of
>> course) or whether all output tuples have to go through the shared transfer
>> queue and then copied to input queues by the worker send thread.
>> In such a case, wouldn't such a thread be a bottleneck? A worker would be
>> able to go as fast as this thread is able to move tuples around, isn't it?
>>
>> Thanks!
>>
>

Reply via email to