mgmt_server_id is *NULL *just for those 4 hosts, other hosts ar fine. Looking at logs, cs1 management server starts to connect pools at first:

2024-07-01 16:31:29,617 DEBUG [c.c.s.l.StoragePoolMonitor] (AgentTaskPool-380:ctx-f411cc14) (logid:284129f8) Host 248 connected, connecting host to shared pool id 152 and sending storage pool...

...

2024-07-01 16:31:29,839 DEBUG [c.c.a.t.Request] (AgentTaskPool-380:ctx-f411cc14) (logid:284129f8) Seq 248-1798343626204381188: Received:  { Ans: , MgmtId: 95534596974, via: 248(xs31.failiem.lv),          Ver: v1, Flags: 10, { ModifyStoragePoolAnswer } }

------------------------------------------------------------------------

DB Tables cloud.host and cloud.mshost:

*SELECT id, status, Type, mgmt_server_id FROM cloud.host  where ID in (74,77,170, 248, 254, 257, 260) :*

260     Alert   Routing         
257     Alert   Routing         
254     Alert   Routing         
248     Alert   Routing         
170     Up      Routing         95534596974
77      Up      Routing         95534596974
74      Up      Routing         95534596974

179 95534596974 1720012401793 localhost b34f493a-42c0-47a8-ada4-04be4cdd8c49 Up 4.13.1.0 10.10.10.11 9090 2024-07-03 13:13:47
        0
178 95536034244 1718828790629 cs2.failiem.lv 70420423-b362-4335-b083-8ad1342ce485 Down 4.13.1.0 10.10.10.12 9090 2024-06-19 20:39:19
        1
176 95530190206 1719663483676 localhost 96a155b6-7041-48ff-9f20-268ea77c5098 Down 4.13.1.0 10.10.10.13 9090 2024-06-29 12:24:28
        1
175 95536505104 1719666507512 localhost c8e6fefa-7464-4bb7-a379-5eafb55c666d Down 4.13.1.0 10.10.10.11 9090 2024-06-29 13:38:00
        0
174 95534962877 1682516323955 localhost 45a057c6-6d50-41a9-bbad-cab370c01832 Down 4.13.1.0 10.10.10.11 9090 2024-06-15 08:36:06
        1
172 95529749065 1658756353180 localhost 535277d3-33df-4b2a-9f1d-07f05084d473 Down 4.13.1.0 10.10.10.13 9090 2024-06-15 07:53:32
        1
170 95529797928 1603725530943 localhost 5892611f-7af8-4686-8818-95ade086e6cf Down 4.13.1.0 10.10.10.13 9090 2020-11-03 04:05:40
        1
167 95534560846 1658756323907 localhost e7ffd55a-77b7-4848-90de-5b5f10cc4500 Down 4.13.1.0 10.10.10.11 9090 2023-04-17 09:50:14
        1
163 95534279505 1582559260879 cs1.failiem.lv 8c254697-9783-11ea-900f-00163e4db64e Down 4.11.1.0 10.10.10.11 9090 2020-05-16 14:07:09
        1
161 95531601526 1582559325515 cs3.failiem.lv 8c25457e-9783-11ea-900f-00163e4db64e Down 4.11.1.0 10.10.10.13 9090 2020-05-16 14:07:21
        1

Janis


On 2024-07-03 13:11, Nux wrote:
A shot in the dark, haven't checked the log files properly.
For these hosts in the disconnected state, if you check them in the DB cloud.host table (type="Routing" btw), which mgmt_server_id are they reporting?

Then check cloud.mshost table and see whether the management server with that id is in there and marked as UP etc.

HTH

On 2024-07-03 06:57, Janis Viklis | Files.fm wrote:
(sorry, some bad formatting in previous email)

Could anyone have any ideas why this error occurs and how to debug it? (248 is a host id)

Monitor ComputeCapacityListener says there is an error in the connect process for 248 due to null


Janis

On 2024-07-01 21:44, Janis Viklis | Files.fm wrote:
Hi,

looking for help after 2 weeks:  What could be the reason that suddenly after restarting the 4.13.1 Management server, all 4 XEN (xcp-ng 8.1) hosts of one Intel cluster disconnects and goes into "Alert state" with an error:

Monitor ComputeCapacityListener says there is an error in the connect process for 248 due to null

I can't find the reason for 2 weeks. The other AMD Xenserver 6.5 cluster is working just fine.

Everything seems ok: network is working, I restarted: toolstack, both system vms (SSVM, consolev), one of the hosts, then removed and added back.

Previously there were 3 management servers via Haproxy and Galera Mariadb, I left only one. (tried upgrade to 3.14.1, didn't help). I can manage hosts via Xencenter. There ar 5 storage pools and 3 secondary.

Thanks, hoping on some clues or directions, Janis.

Below is LOG output:


Reply via email to