Some Workers are restarted all the time

pwozniak Tue, 18 Feb 2020 09:24:20 -0800

Hi,

We are using Apache Storm 1.1.3 in our application. Several hours agoNimbus started to restart some of (just some of) our workers all the time.Now We are struggling with investigating what exactly if happening. Whatif the cause of this problem.


Here is was we was able to find out:

1. Nimbus is restarting s/me workers every 150 second which is equals toour config value for 'worker.heartbeat.frequency.secs' and 'task.heartbeat.frequency.secs';2. We assumed that, for some reason, Executors (jvm's) are not able toupdate their statuses to the zookeeper. But when we check thecorresponding zNodes, last update dates of all of them was older than150 seconds (for the topologies that are running smoothly too). So inthis case all the topologies should be restarted by the nimbus.3. Than we realized that we also use pacemaker and maybe because of thatdates in zookeeper are correct. But we found nothing in the pacemaker logs.4. We tried to search for something interesting in logs of nimbus,supervisor and workers.


For nimbus we get only number of entries like:

2020-02-11 20:21:56.087 o.a.s.d.nimbus timer [INFO] Executorenrichment_topology-3-1581414651:[24 24] not alive


For supervisor logs:

2020-02-18 14:34:52.729 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNINGmsInState: 150043 topo:enrichment_topology-6-1581951947worker:32de0a97-4a50-4967-873c-df189eb8ec94 -> RUNNING msInState: 150043topo:enrichment_topology-6-1581951947worker:32de0a97-4a50-4967-873c-df189eb8ec94

2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE RUNNING

2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700:Assignment Changed fromLocalAssignment(topology_id:enrichment_topology-6-1581951947,executors:[ExecutorInfo(task_start:51, task_end:51),ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35,task_end:35), ExecutorInfo(task_start:19, task_end:19),ExecutorInfo(task_start:67, task_end:67), ExecutorInfo(task_start:115,task_end:115), ExecutorInfo(task_start:83, task_end:83),ExecutorInfo(task_start:99, task_end:99), ExecutorInfo(task_start:85,task_end:85), ExecutorInfo(task_start:53, task_end:53),ExecutorInfo(task_start:101, task_end:101), ExecutorInfo(task_start:69,task_end:69), ExecutorInfo(task_start:21, task_end:21),ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:37,task_end:37), ExecutorInfo(task_start:117, task_end:117),ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:119,task_end:119), ExecutorInfo(task_start:103, task_end:103),ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:71,task_end:71), ExecutorInfo(task_start:23, task_end:23),ExecutorInfo(task_start:55, task_end:55), ExecutorInfo(task_start:87,task_end:87), ExecutorInfo(task_start:41, task_end:41),ExecutorInfo(task_start:73, task_end:73), ExecutorInfo(task_start:105,task_end:105), ExecutorInfo(task_start:9, task_end:9),ExecutorInfo(task_start:57, task_end:57), ExecutorInfo(task_start:121,task_end:121), ExecutorInfo(task_start:89, task_end:89),ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:75,task_end:75), ExecutorInfo(task_start:107, task_end:107),ExecutorInfo(task_start:123, task_end:123), ExecutorInfo(task_start:43,task_end:43), ExecutorInfo(task_start:11, task_end:11),ExecutorInfo(task_start:91, task_end:91), ExecutorInfo(task_start:27,task_end:27), ExecutorInfo(task_start:59, task_end:59),ExecutorInfo(task_start:125, task_end:125), ExecutorInfo(task_start:109,task_end:109), ExecutorInfo(task_start:93, task_end:93),ExecutorInfo(task_start:61, task_end:61), ExecutorInfo(task_start:13,task_end:13), ExecutorInfo(task_start:29, task_end:29),ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:77,task_end:77), ExecutorInfo(task_start:47, task_end:47),ExecutorInfo(task_start:111, task_end:111), ExecutorInfo(task_start:127,task_end:127), ExecutorInfo(task_start:63, task_end:63),ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:79,task_end:79), ExecutorInfo(task_start:95, task_end:95),ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:81,task_end:81), ExecutorInfo(task_start:97, task_end:97),ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:1,task_end:1), ExecutorInfo(task_start:65, task_end:65),ExecutorInfo(task_start:129, task_end:129), ExecutorInfo(task_start:49,task_end:49), ExecutorInfo(task_start:113, task_end:113),ExecutorInfo(task_start:17, task_end:17)],resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0),owner:storm) to null2020-02-18 14:34:52.730 o.a.s.d.s.Container SLOT_6700 [INFO] Killing6654f8f2-6972-4238-af0b-65e196c8fb70:32de0a97-4a50-4967-873c-df189eb8ec94


which is also not telling us what is the cause of the problem

And for the worker logs:

2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting downworker xslt_topology-21-1580129826 6654f8f2-6972-4238-af0b-65e196c8fb70 67032020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Terminatingmessaging context2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting downexecutors

5. We used jvisualVM to check if maybe for some reason workers areoccupied by something else and are not able to send heartbeats tozookeeper/pacemaker. But haven't found anything interesting;



Everything without success.

Is there anything more that we can do to find our what is the reason ofthis behavior?




Additional Questions:

1. Are there any other solutions to debug and investigate the causes ofsimilar problems;

2. Maybe Pacemaker is the problem here:
    - where does it store the data?
    - can we read it somehow like it is possible for zookeeper (zkCli.sh)

- maybe some or the requests are not reaching the pacemaker, isthere way to verify it



Regards,

Pawel

Some Workers are restarted all the time

Reply via email to