Hi,

We are using Apache Storm 1.1.3 in our application. Several hours ago Nimbus started to restart some of (just some of) our workers all the time. Now We are struggling with investigating what exactly if happening. What if the cause of this problem.

Here is was we was able to find out:

1. Nimbus is restarting s/me workers every 150 second which is equals to our config value for 'worker.heartbeat.frequency.secs' and  'task.heartbeat.frequency.secs'; 2. We assumed that, for some reason, Executors (jvm's) are not able to update their statuses to the zookeeper. But when we check the corresponding zNodes, last update dates of all of them was older than 150 seconds (for the topologies that are running smoothly too). So in this case all the topologies should be restarted by the nimbus. 3. Than we realized that we also use pacemaker and maybe because of that dates in zookeeper are correct. But we found nothing in the pacemaker logs. 4. We tried to search for something interesting in logs of nimbus, supervisor and workers.

For nimbus we get only number of entries like:
2020-02-11 20:21:56.087 o.a.s.d.nimbus timer [INFO] Executor enrichment_topology-3-1581414651:[24 24] not alive

For supervisor logs:

2020-02-18 14:34:52.729 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 150043 topo:enrichment_topology-6-1581951947 worker:32de0a97-4a50-4967-873c-df189eb8ec94 -> RUNNING msInState: 150043 topo:enrichment_topology-6-1581951947 worker:32de0a97-4a50-4967-873c-df189eb8ec94
2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [DEBUG] STATE RUNNING
2020-02-18 14:34:52.730 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700: Assignment Changed from LocalAssignment(topology_id:enrichment_topology-6-1581951947, executors:[ExecutorInfo(task_start:51, task_end:51), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35, task_end:35), ExecutorInfo(task_start:19, task_end:19), ExecutorInfo(task_start:67, task_end:67), ExecutorInfo(task_start:115, task_end:115), ExecutorInfo(task_start:83, task_end:83), ExecutorInfo(task_start:99, task_end:99), ExecutorInfo(task_start:85, task_end:85), ExecutorInfo(task_start:53, task_end:53), ExecutorInfo(task_start:101, task_end:101), ExecutorInfo(task_start:69, task_end:69), ExecutorInfo(task_start:21, task_end:21), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:37, task_end:37), ExecutorInfo(task_start:117, task_end:117), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:119, task_end:119), ExecutorInfo(task_start:103, task_end:103), ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:71, task_end:71), ExecutorInfo(task_start:23, task_end:23), ExecutorInfo(task_start:55, task_end:55), ExecutorInfo(task_start:87, task_end:87), ExecutorInfo(task_start:41, task_end:41), ExecutorInfo(task_start:73, task_end:73), ExecutorInfo(task_start:105, task_end:105), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:57, task_end:57), ExecutorInfo(task_start:121, task_end:121), ExecutorInfo(task_start:89, task_end:89), ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:75, task_end:75), ExecutorInfo(task_start:107, task_end:107), ExecutorInfo(task_start:123, task_end:123), ExecutorInfo(task_start:43, task_end:43), ExecutorInfo(task_start:11, task_end:11), ExecutorInfo(task_start:91, task_end:91), ExecutorInfo(task_start:27, task_end:27), ExecutorInfo(task_start:59, task_end:59), ExecutorInfo(task_start:125, task_end:125), ExecutorInfo(task_start:109, task_end:109), ExecutorInfo(task_start:93, task_end:93), ExecutorInfo(task_start:61, task_end:61), ExecutorInfo(task_start:13, task_end:13), ExecutorInfo(task_start:29, task_end:29), ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:77, task_end:77), ExecutorInfo(task_start:47, task_end:47), ExecutorInfo(task_start:111, task_end:111), ExecutorInfo(task_start:127, task_end:127), ExecutorInfo(task_start:63, task_end:63), ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:79, task_end:79), ExecutorInfo(task_start:95, task_end:95), ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:81, task_end:81), ExecutorInfo(task_start:97, task_end:97), ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:65, task_end:65), ExecutorInfo(task_start:129, task_end:129), ExecutorInfo(task_start:49, task_end:49), ExecutorInfo(task_start:113, task_end:113), ExecutorInfo(task_start:17, task_end:17)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:storm) to null 2020-02-18 14:34:52.730 o.a.s.d.s.Container SLOT_6700 [INFO] Killing 6654f8f2-6972-4238-af0b-65e196c8fb70:32de0a97-4a50-4967-873c-df189eb8ec94

which is also not telling us what is the cause of the problem

And for the worker logs:

2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting down worker xslt_topology-21-1580129826 6654f8f2-6972-4238-af0b-65e196c8fb70 6703 2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Terminating messaging context 2020-02-18 17:19:04.290 o.a.s.d.worker Thread-64 [INFO] Shutting down executors

5. We used jvisualVM to check if maybe for some reason workers are occupied by something else and are not able to send heartbeats to zookeeper/pacemaker. But haven't found anything interesting;


Everything without success.

Is there anything more that we can do to find our what is the reason of this behavior?



Additional Questions:
1. Are there any other solutions to debug and investigate the causes of similar problems;
2. Maybe Pacemaker is the problem here:
    - where does it store the data?
    - can we read it somehow like it is possible for zookeeper (zkCli.sh)
    - maybe some or the requests are not reaching the pacemaker, is there way to verify it


Regards,

Pawel

Reply via email to