Let’s see if I can help…
  I do see that the ability to customize the wait strategy  it via config file 
was removed sometime after v0.10.2.   So there is no way to do that anymore but 
you can try to tweak topology.disruptor.wait.timeout.millis and see if that 
helps your situation.

The TimeoutBlockingWaitStrategy is what is being used for the Disruptor 
currently if the timeout>0 … or LiteBlockingWaitStrategy (if timeout=0)

  Is your primary concern that too much kernel CPU is being used in idle mode 
(in the wait strategy)  and consequently it is affecting your ability to run 
more topos on the same box ?

-Roshan

From: Roee Shenberg <shenb...@alooma.io>
Reply-To: "user@storm.apache.org" <user@storm.apache.org>
Date: Monday, June 12, 2017 at 10:06 AM
To: "user@storm.apache.org" <user@storm.apache.org>
Subject: Re: High kernel cpu usage when running 10 topologies on the same worker

Some better diagnostics: I got a perf recording (30secs) of an idle topology on 
a system with 10 idle topologies, using perf-map-agent to create a .map file 
for jitted code. The flame graph is attached, and it strengthens my desire to 
use a waiting strategy that isn't based on condition variables seeing as the 
vast majority of the time was spent in futexes.


On Sun, Jun 11, 2017 at 6:39 PM, Roee Shenberg 
<shenb...@alooma.io<mailto:shenb...@alooma.io>> wrote:
A more practical question is - is there a way to replace the disruptor waiting 
strategy? I suspect the issue is the ginormous amount of threads vying to be 
woken up, running the disruptor queue with SleepWaitStrategy seems like it 
should alleviate this pain, but it seems the ability to set the wait strategy 
was removed in the transition from clojure to java in versions 0.10.x->1.0.x.

On Sun, Jun 11, 2017 at 3:32 PM, Roee Shenberg 
<shenb...@alooma.io<mailto:shenb...@alooma.io>> wrote:
Our storm cluster (1.0.2) is running many trident topologies, each one is local 
to a single worker, with each supervisor having 10 worker slots. Every slot 
runs a copy of the same topology with a different configuration, the topology 
being a fairly fat trident topology (e.g. ~300 threads per topology - totalling 
>3000 threads on the machine)

A quick htop showed a grim picture of most CPU time being spent in the kernel:
[nline image 1]
(note: running as root inside a docker)

Here's an example top summary line:

%Cpu(s): 39.4 us, 51.1 sy,  0.0 ni,  8.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.8 st
This suggests actual kernel time waste, not I/O, irqs, etc, so I ran sudo 
strace -cf -p 2466 to get a feel for what's going on:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 86.84 3489.872073       14442    241646     27003 futex
 10.69  429.437949      271453      1582           epoll_wait
  1.88   75.608000      361761       209       108 restart_syscall
  0.29   11.722287       46889       250           recvfrom
  0.12    4.736911          92     51379           gettimeofday
  0.08    3.173336          12    254162           clock_gettime
  0.06    2.234660        4373       511           poll
...

I don't understand whether threads that are simply blocking are counted (in 
which case this is a worthless measure) or not.

I ran jvmtop to get some runtime data out of one of the topologies (well, a 
few, but they were all roughly the same):


  TID   NAME                                    STATE    CPU  TOTALCPU BLOCKEDBY

    203 Thread-27-$mastercoord-bg0-exe       RUNNABLE 57.55%     8.54%

    414 RMI TCP Connection(3)-172.17.0       RUNNABLE  8.03%     0.01%

     22 disruptor-flush-trigger         TIMED_WAITING  3.79%     4.49%

     51 heartbeat-timer                 TIMED_WAITING  0.80%     1.66%

    328 disruptor-flush-task-pool       TIMED_WAITING  0.61%     0.84%

...

So just about all of the time is spent inside the trident master coordinator.

My theory is that the ridiculous thread count is causing the kernel to work 
extra-hard on all those futex calls (e.g. when waking a thread blocking on a 
futex).

I'm very uncertain regarding this, the best I can say is that the overhead is 
related more to the number of topologies than to what the topology is doing 
(when running 1 topology on the same worker, cpu use is less than 1/10th of 
what it is with 10 topologies), and there are a lot of threads on the system 
(>3000).

Any advice, suggestions for additional diagnostics, or ideas as to the cause? 
Operationally, we're planning moving to smaller instances with less slots per 
worker to work around this issue, but I'd rather resolve it without changing 
our cluster's makeup entirely.

Thanks,

Roee


Reply via email to