Hello,



We have a rather large topology with 1023 bolts, all of those with
parallelism = 1. When I run the topology on a single worker, the topology
starts up in 35 minutes (!!). If I split the topology on two workers, it's 3
minutes. And if it gets split between three workers, the start-up time is 
reduced to ~30 seconds.




Obviously Nimbus normally kills the supervisor much sooner than after 35 
minutes, so supervisor.worker.start.timeout.secs, nimbus.supervisor.timeout.
secs and nimbus.task.launch.secs have been configured to a few hours.





The bulk of the time is spent like this (log snippet from the init sequence,
spacing mine, bolt names censored):




2018-07-17_18:18:46.430 o.a.s.d.executor [INFO] Loading executor bolt-123:
[664 664]
2018-07-17_18:18:46.436 o.a.s.d.executor [INFO] Loaded executor tasks bolt-
123:[664 664]
2018-07-17_18:18:46.443 o.a.s.d.executor [INFO] Finished loading executor 
bolt-123:[664 664]

2018-07-17_18:18:47.471 o.a.s.d.executor [INFO] Loading executor bolt-abc:
[582 582]
2018-07-17_18:18:47.472 o.a.s.d.executor [INFO] Loaded executor tasks bolt-
abc:[582 582]
2018-07-17_18:18:47.476 o.a.s.d.executor [INFO] Finished loading executor 
bolt-abc:[582 582]

2018-07-17_18:18:47.883 o.a.s.d.executor [INFO] Loading executor bolt-xyz:
[220 220]
2018-07-17_18:18:47.885 o.a.s.d.executor [INFO] Loaded executor tasks bolt-
xyz:[220 220]
2018-07-17_18:18:47.893 o.a.s.d.executor [INFO] Finished loading executor 
bolt-xyz:[220 220]

2018-07-17_18:18:52.346 o.a.s.d.executor [INFO] Loading executor bolt-789:
[783 783]
2018-07-17_18:18:52.353 o.a.s.d.executor [INFO] Loaded executor tasks bolt-
789:[783 783]
2018-07-17_18:18:52.360 o.a.s.d.executor [INFO] Finished loading executor 
bolt-789:[783 783]

2018-07-17_18:18:54.154 o.a.s.d.executor [INFO] Loading executor bolt-def:
[898 898]
2018-07-17_18:18:54.155 o.a.s.d.executor [INFO] Loaded executor tasks bolt-
def:[898 898]
2018-07-17_18:18:54.159 o.a.s.d.executor [INFO] Finished loading executor 
bolt-def:[898 898]



Please note the _insane_ time delays between the particular bolt loads.




This is reproducible on Storm 1.01, 1.0.6 and 1.2.2, with Java 1.8u141.


uname -a: 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x
86_64 x86_64 GNU/Linux

worker.childopts: "-Xmx96G" (but during the init it does not grow bigger 
that 2 GBs and produces basically no GC activity)





I understand that Storm is designed for horizontal scaling, but scaling 
vertically so badly based on the number of bolts (quadratically?
exponentially?) seems to be an oversight.

Is there any configuration we could use to improve the situation, e.g. by 
parallelizing the loading procedure? Should I file a Jira?




Thank you,

Petr Janeček

Reply via email to