Re: All 4 hosts disconnected in Alert state due to ComputeCapacityListener NULL: how to fix? | 4.19.0.1

Janis Viklis | Files.fm Tue, 23 Jul 2024 10:51:20 -0700

Upgraded to 4.19.0.1.

The same error and more new ones:
Web UI: The given command 'readyForShutdown' either does not exist.


[root@cs1 ~]# log | grep WARN

2024-07-23 20:48:13,896 WARN [c.c.a.m.AgentAttache](StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Seq77-731834939447705632: Timed out on null2024-07-23 20:48:13,896 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Operation timed out:Commands 731834939447705632 to Host 77 timed out after 1728002024-07-23 20:48:13,896 WARN [c.c.v.VirtualMachineManagerImpl](StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Unable to obtain VMnetwork statistics.2024-07-23 20:48:13,930 WARN [c.c.a.m.AgentAttache](StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Seq170-4359484439294640163: Timed out on null2024-07-23 20:48:13,931 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Operation timed out:Commands 4359484439294640163 to Host 170 timed out after 1728002024-07-23 20:48:13,931 WARN [c.c.v.VirtualMachineManagerImpl](StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Unable to obtain VMnetwork statistics.2024-07-23 20:48:19,729 WARN [c.c.h.x.d.XcpServerDiscoverer](AgentTaskPool-9:ctx-8ba2cf7c) (logid:3c35ed8f) defaulting toxenserver650 resource for product brand: XCP-ng with product version: 8.1.02024-07-23 20:48:19,940 WARN [c.c.h.x.d.XcpServerDiscoverer](AgentTaskPool-10:ctx-2fe9992d) (logid:bcbe402f) defaulting toxenserver650 resource for product brand: XCP-ng with product version: 8.1.02024-07-23 20:48:20,072 WARN [c.c.h.x.d.XcpServerDiscoverer](AgentTaskPool-11:ctx-0f428011) (logid:3f311481) defaulting toxenserver650 resource for product brand: XCP-ng with product version: 8.1.02024-07-23 20:48:20,145 WARN [c.c.h.x.d.XcpServerDiscoverer](AgentTaskPool-12:ctx-7cf4ea84) (logid:2936e5ed) defaulting toxenserver650 resource for product brand: XCP-ng with product version: 8.1.02024-07-23 20:48:24,469 WARN [c.c.r.ResourceManagerImpl](AgentTaskPool-9:ctx-8ba2cf7c) (logid:3c35ed8f) Unable to connect due to2024-07-23 20:48:26,003 WARN [c.c.r.ResourceManagerImpl](AgentTaskPool-10:ctx-2fe9992d) (logid:bcbe402f) Unable to connect due to2024-07-23 20:48:26,028 WARN [c.c.r.ResourceManagerImpl](AgentTaskPool-12:ctx-7cf4ea84) (logid:2936e5ed) Unable to connect due to2024-07-23 20:48:26,247 WARN [c.c.r.ResourceManagerImpl](AgentTaskPool-11:ctx-0f428011) (logid:3f311481) Unable to connect due to2024-07-23 20:48:29,635 WARN [c.c.a.m.AgentAttache](StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Seq77-731834939447705633: Timed out on null2024-07-23 20:48:29,635 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Operation timed out:Commands 731834939447705633 to Host 77 timed out after 1728002024-07-23 20:48:29,635 WARN [c.c.v.VirtualMachineManagerImpl](StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Unable to obtain VMstatistics.2024-07-23 20:48:29,676 WARN [c.c.a.m.AgentAttache](StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Seq170-4359484439294640164: Timed out on null2024-07-23 20:48:29,676 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Operation timed out:Commands 4359484439294640164 to Host 170 timed out after 1728002024-07-23 20:48:29,676 WARN [c.c.v.VirtualMachineManagerImpl](StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Unable to obtain VMstatistics.2024-07-23 20:48:29,707 WARN [c.c.a.m.AgentAttache](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq18-1364027737139838981: Timed out on null2024-07-23 20:48:29,707 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:Commands 1364027737139838981 to Host 18 timed out after 1728002024-07-23 20:48:29,707 WARN [c.c.r.ResourceManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host18 statistics.2024-07-23 20:48:29,707 WARN [c.c.s.StatsCollector](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is nullfor host: 182024-07-23 20:48:29,735 WARN [c.c.a.m.AgentAttache](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq74-9075316199104970777: Timed out on null2024-07-23 20:48:29,735 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:Commands 9075316199104970777 to Host 74 timed out after 1728002024-07-23 20:48:29,735 WARN [c.c.r.ResourceManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host74 statistics.2024-07-23 20:48:29,736 WARN [c.c.s.StatsCollector](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is nullfor host: 742024-07-23 20:48:29,765 WARN [c.c.a.m.AgentAttache](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq77-731834939447705634: Timed out on null2024-07-23 20:48:29,765 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:Commands 731834939447705634 to Host 77 timed out after 1728002024-07-23 20:48:29,765 WARN [c.c.r.ResourceManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host77 statistics.2024-07-23 20:48:29,765 WARN [c.c.s.StatsCollector](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is nullfor host: 772024-07-23 20:48:29,792 WARN [c.c.a.m.AgentAttache](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq170-4359484439294640165: Timed out on null2024-07-23 20:48:29,792 WARN [c.c.a.m.AgentManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:Commands 4359484439294640165 to Host 170 timed out after 1728002024-07-23 20:48:29,792 WARN [c.c.r.ResourceManagerImpl](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host170 statistics.2024-07-23 20:48:29,792 WARN [c.c.s.StatsCollector](StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is nullfor host: 1702024-07-23 20:48:31,609 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq77-731834939447705635: Timed out on null2024-07-23 20:48:31,626 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq170-4359484439294640166: Timed out on null2024-07-23 20:48:31,641 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq74-9075316199104970778: Timed out on null2024-07-23 20:48:31,667 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq77-731834939447705636: Timed out on null2024-07-23 20:48:31,684 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq170-4359484439294640167: Timed out on null2024-07-23 20:48:31,700 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq74-9075316199104970779: Timed out on null2024-07-23 20:48:31,728 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq74-9075316199104970780: Timed out on null2024-07-23 20:48:31,744 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq170-4359484439294640168: Timed out on null2024-07-23 20:48:31,761 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq77-731834939447705637: Timed out on null2024-07-23 20:48:31,787 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq170-4359484439294640169: Timed out on null2024-07-23 20:48:31,803 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq74-9075316199104970781: Timed out on null2024-07-23 20:48:31,820 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq77-731834939447705638: Timed out on null2024-07-23 20:48:31,844 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq74-9075316199104970782: Timed out on null2024-07-23 20:48:31,870 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq77-731834939447705639: Timed out on null2024-07-23 20:48:31,886 WARN [c.c.a.m.AgentAttache](StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq170-4359484439294640170: Timed out on null2024-07-23 20:48:44,035 WARN [c.c.a.AlertManagerImpl](CapacityChecker:ctx-ad836823) (logid:febec4ab) alertType=[24]dataCenterId=[9] podId=[null] clusterId=[null] message=[System Alert:Number of unallocated shared network IPs is low in availability zoneLTC-DC].

2024-07-23 20:45:43,969 ERROR [c.c.a.AlertManagerImpl](CapacityChecker:ctx-834ef0f3) (logid:7e9bd4f4) Caught exception inrecalculating capacity2024-07-23 20:46:19,513 ERROR [c.c.u.s.SshHelper](AgentTaskPool-5:ctx-4a180df2) (logid:44b1d912) SSH execution of commandxe sm-list | grep "resigning of duplicates" has an error status code inreturn. Result output:2024-07-23 20:46:19,679 ERROR [c.c.u.s.SshHelper](AgentTaskPool-6:ctx-4510f484) (logid:e8eaa85e) SSH execution of commandxe sm-list | grep "resigning of duplicates" has an error status code inreturn. Result output:2024-07-23 20:46:19,846 ERROR [c.c.u.s.SshHelper](AgentTaskPool-7:ctx-18d4342d) (logid:c1d175a1) SSH execution of commandxe sm-list | grep "resigning of duplicates" has an error status code inreturn. Result output:2024-07-23 20:46:19,896 ERROR [c.c.u.s.SshHelper](AgentTaskPool-8:ctx-574e515b) (logid:10346a2e) SSH execution of commandxe sm-list | grep "resigning of duplicates" has an error status code inreturn. Result output:2024-07-23 20:46:24,407 ERROR [c.c.a.m.AgentManagerImpl](AgentTaskPool-5:ctx-4a180df2) (logid:44b1d912) MonitorComputeCapacityListener says there is an error in the connect processfor 248 due to null2024-07-23 20:46:25,735 ERROR [c.c.a.m.AgentManagerImpl](AgentTaskPool-6:ctx-4510f484) (logid:e8eaa85e) MonitorComputeCapacityListener says there is an error in the connect processfor 254 due to null2024-07-23 20:46:25,817 ERROR [c.c.a.m.AgentManagerImpl](AgentTaskPool-8:ctx-574e515b) (logid:10346a2e) MonitorComputeCapacityListener says there is an error in the connect processfor 260 due to null2024-07-23 20:46:26,029 ERROR [c.c.a.m.AgentManagerImpl](AgentTaskPool-7:ctx-18d4342d) (logid:c1d175a1) MonitorComputeCapacityListener says there is an error in the connect processfor 257 due to null


Janis

On 2024-07-23 16:44, Janis Viklis | Files.fm wrote:

Seems similar issue to this one:https://issues.apache.org/jira/browse/CLOUDSTACK-8747


Sending Connect to listener: ComputeCapacityListener
Found 5 VMs on host 27
Found 1 VM, not running on host 27

Monitor*ComputeCapacityListener *says there is an error in the connectprocessfor 27 due tonull Host 27 is disconnecting with eventAgentDisconnected

The next status of agent 27is Alert, current status is Connecting

Janis

On 2024-07-04 3:48, Nux wrote:

Janis,

No clue, it's been a while since I used Xenserver and you are also onquite an old version as well, right? There have been many bugs fixedsince 4.13.

Would it be possible to include a much larger fragment from the logsor the full logs?

Also, have you checked the Xcp logs, anything there, is XenCentershowing anything out of the ordinary?


HTH

On 2024-07-03 14:36, Janis Viklis | Files.fm wrote:

If I set valid management server id, it returns to NULL after nexthost check cycle.

I wonder could it bet somehow related to total or cluster resources.(but i tried to find and check/change all overprovisionig multipliers)

2024-07-03 16:30:16,036 DEBUG [c.c.c.CapacityManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 32 VMs on host2482024-07-03 16:30:16,039 DEBUG [c.c.c.CapacityManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 0 VMs areMigrating from host 2482024-07-03 16:30:16,138 ERROR [c.c.a.AlertManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Caught exception inrecalculating capacity

java.lang.NullPointerException

atcom.cloud.capacity.CapacityManagerImpl.updateCapacityForHost(CapacityManagerImpl.java:677) atcom.cloud.alert.AlertManagerImpl.recalculateCapacity(AlertManagerImpl.java:279) atcom.cloud.alert.AlertManagerImpl.checkForAlerts(AlertManagerImpl.java:432) atcom.cloud.alert.AlertManagerImpl$CapacityChecker.runInContext(AlertManagerImpl.java:422) atorg.apache.cloudstack.managed.context.ManagedContextTimerTask$1.runInContext(ManagedContextTimerTask.java:30) atorg.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) atorg.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) atorg.apache.cloudstack.managed.context.ManagedContextTimerTask.run(ManagedContextTimerTask.java:32)

        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

Janis

On 2024-07-03 16:27, Nux wrote:

Hello,

What happens if you update the 4 problematic hosts with a validmgmt id?


On 2024-07-03 14:23, Janis Viklis | Files.fm wrote:

mgmt_server_id is NULL just for those 4 hosts, other hosts ar fine.
Looking at logs, cs1 management server starts to connect pools at
first:

2024-07-01 16:31:29,617 DEBUG [c.c.s.l.StoragePoolMonitor]
(AgentTaskPool-380:ctx-f411cc14) (logid:284129f8) Host 248 connected,
connecting host to shared pool id 152 and sending storage pool...

------------------------------------------------------------------------



DB Tables: cloud.host and cloud.mshost:

SELECT id, status, Type, mgmt_server_id FROM cloud.host where ID in
(74,77,170, 248, 254, 257, 260) :

         260
         Alert
         Routing

         257
         Alert
         Routing

         254
         Alert
         Routing

         248
         Alert
         Routing

         170
         Up
         Routing
         95534596974

         77
         Up
         Routing
         95534596974

         74
         Up
         Routing
         95534596974

         179
         95534596974
         1720012401793
         localhost
         b34f493a-42c0-47a8-ada4-04be4cdd8c49
         Up
         4.13.1.0
         10.10.10.11
         9090
         2024-07-03 13:13:47

         0

         178
         95536034244
         1718828790629
         cs2.failiem.lv
         70420423-b362-4335-b083-8ad1342ce485
         Down
         4.13.1.0
         10.10.10.12
         9090
         2024-06-19 20:39:19

         1

         176
         95530190206
         1719663483676
         localhost
         96a155b6-7041-48ff-9f20-268ea77c5098
         Down
         4.13.1.0
         10.10.10.13
         9090
         2024-06-29 12:24:28

         1

         175
         95536505104
         1719666507512
         localhost
         c8e6fefa-7464-4bb7-a379-5eafb55c666d
         Down
         4.13.1.0
         10.10.10.11
         9090
         2024-06-29 13:38:00

         0

         174
         95534962877
         1682516323955
         localhost
         45a057c6-6d50-41a9-bbad-cab370c01832
         Down
         4.13.1.0
         10.10.10.11
         9090
         2024-06-15 08:36:06

         1

         172
         95529749065
         1658756353180
         localhost
         535277d3-33df-4b2a-9f1d-07f05084d473
         Down
         4.13.1.0
         10.10.10.13
         9090
         2024-06-15 07:53:32

         1

         170
         95529797928
         1603725530943
         localhost
         5892611f-7af8-4686-8818-95ade086e6cf
         Down
         4.13.1.0
         10.10.10.13
         9090
         2020-11-03 04:05:40

         1

         167
         95534560846
         1658756323907
         localhost
         e7ffd55a-77b7-4848-90de-5b5f10cc4500
         Down
         4.13.1.0
         10.10.10.11
         9090
         2023-04-17 09:50:14

         1

         163
         95534279505
         1582559260879
         cs1.failiem.lv
         8c254697-9783-11ea-900f-00163e4db64e
         Down
         4.11.1.0
         10.10.10.11
         9090
         2020-05-16 14:07:09

         1

         161
         95531601526
         1582559325515
         cs3.failiem.lv
         8c25457e-9783-11ea-900f-00163e4db64e
         Down
         4.11.1.0
         10.10.10.13
         9090
         2020-05-16 14:07:21

         1

Janis

On 2024-07-03 13:11, Nux wrote:

A shot in the dark, haven't checked the log files properly.
For these hosts in the disconnected state, if you check them in the
DB cloud.host table (type="Routing" btw), which mgmt_server_id are
they reporting?

Then check cloud.mshost table and see whether the management server
with that id is in there and marked as UP etc.

HTH

On 2024-07-03 06:57, Janis Viklis | Files.fm wrote:
(sorry, some bad formatting in previous email)

Could anyone have any ideas why this error occurs and how to debug
it? (248 is a host id)

Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null

Janis

On 2024-07-01 21:44, Janis Viklis | Files.fm wrote:
Hi,

looking for help after 2 weeks:  What could be the reason that
suddenly after restarting the 4.13.1 Management server, all 4 XEN
(xcp-ng 8.1) hosts of one Intel cluster disconnects and goes into
"Alert state" with an error:

Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null

I can't find the reason for 2 weeks. The other AMD Xenserver 6.5
cluster is working just fine.

Everything seems ok: network is working, I restarted: toolstack,
both system vms (SSVM, consolev), one of the hosts, then removed and
added back.

Previously there were 3 management servers via Haproxy and Galera
Mariadb, I left only one. (tried upgrade to 3.14.1, didn't help). I
can manage hosts via Xencenter. There ar 5 storage pools and 3
secondary.

Thanks, hoping on some clues or directions, Janis.

Below is LOG output:

Re: All 4 hosts disconnected in Alert state due to ComputeCapacityListener NULL: how to fix? | 4.19.0.1

Reply via email to