Re: All 4 hosts disconnected in Alert state due to ComputeCapacityListener NULL: how to fix?

Janis Viklis | Files.fm Tue, 23 Jul 2024 06:46:56 -0700

Seems similar issue to this one:https://issues.apache.org/jira/browse/CLOUDSTACK-8747


Sending Connect to listener: ComputeCapacityListener
Found 5 VMs on host 27
Found 1 VM, not running on host 27
Monitor*ComputeCapacityListener *says there is an error in the connect 
processfor  27 due tonull Host 27 is disconnecting with event AgentDisconnected
The next status of agent 27is Alert, current status is Connecting


Janis

On 2024-07-04 3:48, Nux wrote:

Janis,

No clue, it's been a while since I used Xenserver and you are also onquite an old version as well, right? There have been many bugs fixedsince 4.13.

Would it be possible to include a much larger fragment from the logsor the full logs?

Also, have you checked the Xcp logs, anything there, is XenCentershowing anything out of the ordinary?


HTH

On 2024-07-03 14:36, Janis Viklis | Files.fm wrote:

If I set valid management server id, it returns to NULL after nexthost check cycle.

I wonder could it bet somehow related to total or cluster resources.(but i tried to find and check/change all overprovisionig multipliers)

2024-07-03 16:30:16,036 DEBUG [c.c.c.CapacityManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 32 VMs on host 2482024-07-03 16:30:16,039 DEBUG [c.c.c.CapacityManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 0 VMs areMigrating from host 2482024-07-03 16:30:16,138 ERROR [c.c.a.AlertManagerImpl](CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Caught exception inrecalculating capacity

java.lang.NullPointerException

atcom.cloud.capacity.CapacityManagerImpl.updateCapacityForHost(CapacityManagerImpl.java:677) atcom.cloud.alert.AlertManagerImpl.recalculateCapacity(AlertManagerImpl.java:279) atcom.cloud.alert.AlertManagerImpl.checkForAlerts(AlertManagerImpl.java:432) atcom.cloud.alert.AlertManagerImpl$CapacityChecker.runInContext(AlertManagerImpl.java:422) atorg.apache.cloudstack.managed.context.ManagedContextTimerTask$1.runInContext(ManagedContextTimerTask.java:30) atorg.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) atorg.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) atorg.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) atorg.apache.cloudstack.managed.context.ManagedContextTimerTask.run(ManagedContextTimerTask.java:32)

        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

Janis

On 2024-07-03 16:27, Nux wrote:

Hello,

What happens if you update the 4 problematic hosts with a valid mgmtid?


On 2024-07-03 14:23, Janis Viklis | Files.fm wrote:

mgmt_server_id is NULL just for those 4 hosts, other hosts ar fine.
Looking at logs, cs1 management server starts to connect pools at
first:

2024-07-01 16:31:29,617 DEBUG [c.c.s.l.StoragePoolMonitor]
(AgentTaskPool-380:ctx-f411cc14) (logid:284129f8) Host 248 connected,
connecting host to shared pool id 152 and sending storage pool...

------------------------------------------------------------------------



DB Tables: cloud.host and cloud.mshost:

SELECT id, status, Type, mgmt_server_id FROM cloud.host where ID in
(74,77,170, 248, 254, 257, 260) :

         260
         Alert
         Routing

         257
         Alert
         Routing

         254
         Alert
         Routing

         248
         Alert
         Routing

         170
         Up
         Routing
         95534596974

         77
         Up
         Routing
         95534596974

         74
         Up
         Routing
         95534596974

         179
         95534596974
         1720012401793
         localhost
         b34f493a-42c0-47a8-ada4-04be4cdd8c49
         Up
         4.13.1.0
         10.10.10.11
         9090
         2024-07-03 13:13:47

         0

         178
         95536034244
         1718828790629
         cs2.failiem.lv
         70420423-b362-4335-b083-8ad1342ce485
         Down
         4.13.1.0
         10.10.10.12
         9090
         2024-06-19 20:39:19

         1

         176
         95530190206
         1719663483676
         localhost
         96a155b6-7041-48ff-9f20-268ea77c5098
         Down
         4.13.1.0
         10.10.10.13
         9090
         2024-06-29 12:24:28

         1

         175
         95536505104
         1719666507512
         localhost
         c8e6fefa-7464-4bb7-a379-5eafb55c666d
         Down
         4.13.1.0
         10.10.10.11
         9090
         2024-06-29 13:38:00

         0

         174
         95534962877
         1682516323955
         localhost
         45a057c6-6d50-41a9-bbad-cab370c01832
         Down
         4.13.1.0
         10.10.10.11
         9090
         2024-06-15 08:36:06

         1

         172
         95529749065
         1658756353180
         localhost
         535277d3-33df-4b2a-9f1d-07f05084d473
         Down
         4.13.1.0
         10.10.10.13
         9090
         2024-06-15 07:53:32

         1

         170
         95529797928
         1603725530943
         localhost
         5892611f-7af8-4686-8818-95ade086e6cf
         Down
         4.13.1.0
         10.10.10.13
         9090
         2020-11-03 04:05:40

         1

         167
         95534560846
         1658756323907
         localhost
         e7ffd55a-77b7-4848-90de-5b5f10cc4500
         Down
         4.13.1.0
         10.10.10.11
         9090
         2023-04-17 09:50:14

         1

         163
         95534279505
         1582559260879
         cs1.failiem.lv
         8c254697-9783-11ea-900f-00163e4db64e
         Down
         4.11.1.0
         10.10.10.11
         9090
         2020-05-16 14:07:09

         1

         161
         95531601526
         1582559325515
         cs3.failiem.lv
         8c25457e-9783-11ea-900f-00163e4db64e
         Down
         4.11.1.0
         10.10.10.13
         9090
         2020-05-16 14:07:21

         1

Janis

On 2024-07-03 13:11, Nux wrote:

A shot in the dark, haven't checked the log files properly.
For these hosts in the disconnected state, if you check them in the
DB cloud.host table (type="Routing" btw), which mgmt_server_id are
they reporting?

Then check cloud.mshost table and see whether the management server
with that id is in there and marked as UP etc.

HTH

On 2024-07-03 06:57, Janis Viklis | Files.fm wrote:
(sorry, some bad formatting in previous email)

Could anyone have any ideas why this error occurs and how to debug
it? (248 is a host id)

Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null

Janis

On 2024-07-01 21:44, Janis Viklis | Files.fm wrote:
Hi,

looking for help after 2 weeks:  What could be the reason that
suddenly after restarting the 4.13.1 Management server, all 4 XEN
(xcp-ng 8.1) hosts of one Intel cluster disconnects and goes into
"Alert state" with an error:

Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null

I can't find the reason for 2 weeks. The other AMD Xenserver 6.5
cluster is working just fine.

Everything seems ok: network is working, I restarted: toolstack,
both system vms (SSVM, consolev), one of the hosts, then removed and
added back.

Previously there were 3 management servers via Haproxy and Galera
Mariadb, I left only one. (tried upgrade to 3.14.1, didn't help). I
can manage hosts via Xencenter. There ar 5 storage pools and 3
secondary.

Thanks, hoping on some clues or directions, Janis.

Below is LOG output:

Re: All 4 hosts disconnected in Alert state due to ComputeCapacityListener NULL: how to fix?

Reply via email to