Upgraded to 4.19.0.1.
The same error and more new ones:
Web UI: The given command 'readyForShutdown' either does not exist.
[root@cs1 ~]# log | grep WARN
2024-07-23 20:48:13,896 WARN [c.c.a.m.AgentAttache]
(StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Seq
77-731834939447705632: Timed out on null
2024-07-23 20:48:13,896 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Operation timed out:
Commands 731834939447705632 to Host 77 timed out after 172800
2024-07-23 20:48:13,896 WARN [c.c.v.VirtualMachineManagerImpl]
(StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Unable to obtain VM
network statistics.
2024-07-23 20:48:13,930 WARN [c.c.a.m.AgentAttache]
(StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Seq
170-4359484439294640163: Timed out on null
2024-07-23 20:48:13,931 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Operation timed out:
Commands 4359484439294640163 to Host 170 timed out after 172800
2024-07-23 20:48:13,931 WARN [c.c.v.VirtualMachineManagerImpl]
(StatsCollector-1:ctx-0331434c) (logid:1eb04bf4) Unable to obtain VM
network statistics.
2024-07-23 20:48:19,729 WARN [c.c.h.x.d.XcpServerDiscoverer]
(AgentTaskPool-9:ctx-8ba2cf7c) (logid:3c35ed8f) defaulting to
xenserver650 resource for product brand: XCP-ng with product version: 8.1.0
2024-07-23 20:48:19,940 WARN [c.c.h.x.d.XcpServerDiscoverer]
(AgentTaskPool-10:ctx-2fe9992d) (logid:bcbe402f) defaulting to
xenserver650 resource for product brand: XCP-ng with product version: 8.1.0
2024-07-23 20:48:20,072 WARN [c.c.h.x.d.XcpServerDiscoverer]
(AgentTaskPool-11:ctx-0f428011) (logid:3f311481) defaulting to
xenserver650 resource for product brand: XCP-ng with product version: 8.1.0
2024-07-23 20:48:20,145 WARN [c.c.h.x.d.XcpServerDiscoverer]
(AgentTaskPool-12:ctx-7cf4ea84) (logid:2936e5ed) defaulting to
xenserver650 resource for product brand: XCP-ng with product version: 8.1.0
2024-07-23 20:48:24,469 WARN [c.c.r.ResourceManagerImpl]
(AgentTaskPool-9:ctx-8ba2cf7c) (logid:3c35ed8f) Unable to connect due to
2024-07-23 20:48:26,003 WARN [c.c.r.ResourceManagerImpl]
(AgentTaskPool-10:ctx-2fe9992d) (logid:bcbe402f) Unable to connect due to
2024-07-23 20:48:26,028 WARN [c.c.r.ResourceManagerImpl]
(AgentTaskPool-12:ctx-7cf4ea84) (logid:2936e5ed) Unable to connect due to
2024-07-23 20:48:26,247 WARN [c.c.r.ResourceManagerImpl]
(AgentTaskPool-11:ctx-0f428011) (logid:3f311481) Unable to connect due to
2024-07-23 20:48:29,635 WARN [c.c.a.m.AgentAttache]
(StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Seq
77-731834939447705633: Timed out on null
2024-07-23 20:48:29,635 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Operation timed out:
Commands 731834939447705633 to Host 77 timed out after 172800
2024-07-23 20:48:29,635 WARN [c.c.v.VirtualMachineManagerImpl]
(StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Unable to obtain VM
statistics.
2024-07-23 20:48:29,676 WARN [c.c.a.m.AgentAttache]
(StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Seq
170-4359484439294640164: Timed out on null
2024-07-23 20:48:29,676 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Operation timed out:
Commands 4359484439294640164 to Host 170 timed out after 172800
2024-07-23 20:48:29,676 WARN [c.c.v.VirtualMachineManagerImpl]
(StatsCollector-6:ctx-6ae014e7) (logid:e109ae1e) Unable to obtain VM
statistics.
2024-07-23 20:48:29,707 WARN [c.c.a.m.AgentAttache]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq
18-1364027737139838981: Timed out on null
2024-07-23 20:48:29,707 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:
Commands 1364027737139838981 to Host 18 timed out after 172800
2024-07-23 20:48:29,707 WARN [c.c.r.ResourceManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host
18 statistics.
2024-07-23 20:48:29,707 WARN [c.c.s.StatsCollector]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is null
for host: 18
2024-07-23 20:48:29,735 WARN [c.c.a.m.AgentAttache]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq
74-9075316199104970777: Timed out on null
2024-07-23 20:48:29,735 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:
Commands 9075316199104970777 to Host 74 timed out after 172800
2024-07-23 20:48:29,735 WARN [c.c.r.ResourceManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host
74 statistics.
2024-07-23 20:48:29,736 WARN [c.c.s.StatsCollector]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is null
for host: 74
2024-07-23 20:48:29,765 WARN [c.c.a.m.AgentAttache]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq
77-731834939447705634: Timed out on null
2024-07-23 20:48:29,765 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:
Commands 731834939447705634 to Host 77 timed out after 172800
2024-07-23 20:48:29,765 WARN [c.c.r.ResourceManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host
77 statistics.
2024-07-23 20:48:29,765 WARN [c.c.s.StatsCollector]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is null
for host: 77
2024-07-23 20:48:29,792 WARN [c.c.a.m.AgentAttache]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Seq
170-4359484439294640165: Timed out on null
2024-07-23 20:48:29,792 WARN [c.c.a.m.AgentManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Operation timed out:
Commands 4359484439294640165 to Host 170 timed out after 172800
2024-07-23 20:48:29,792 WARN [c.c.r.ResourceManagerImpl]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) Unable to obtain host
170 statistics.
2024-07-23 20:48:29,792 WARN [c.c.s.StatsCollector]
(StatsCollector-1:ctx-2fd01c39) (logid:5a92adb5) The Host stats is null
for host: 170
2024-07-23 20:48:31,609 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
77-731834939447705635: Timed out on null
2024-07-23 20:48:31,626 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
170-4359484439294640166: Timed out on null
2024-07-23 20:48:31,641 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
74-9075316199104970778: Timed out on null
2024-07-23 20:48:31,667 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
77-731834939447705636: Timed out on null
2024-07-23 20:48:31,684 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
170-4359484439294640167: Timed out on null
2024-07-23 20:48:31,700 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
74-9075316199104970779: Timed out on null
2024-07-23 20:48:31,728 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
74-9075316199104970780: Timed out on null
2024-07-23 20:48:31,744 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
170-4359484439294640168: Timed out on null
2024-07-23 20:48:31,761 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
77-731834939447705637: Timed out on null
2024-07-23 20:48:31,787 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
170-4359484439294640169: Timed out on null
2024-07-23 20:48:31,803 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
74-9075316199104970781: Timed out on null
2024-07-23 20:48:31,820 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
77-731834939447705638: Timed out on null
2024-07-23 20:48:31,844 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
74-9075316199104970782: Timed out on null
2024-07-23 20:48:31,870 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
77-731834939447705639: Timed out on null
2024-07-23 20:48:31,886 WARN [c.c.a.m.AgentAttache]
(StatsCollector-4:ctx-c7c5ff50) (logid:67a9af5e) Seq
170-4359484439294640170: Timed out on null
2024-07-23 20:48:44,035 WARN [c.c.a.AlertManagerImpl]
(CapacityChecker:ctx-ad836823) (logid:febec4ab) alertType=[24]
dataCenterId=[9] podId=[null] clusterId=[null] message=[System Alert:
Number of unallocated shared network IPs is low in availability zone
LTC-DC].
2024-07-23 20:45:43,969 ERROR [c.c.a.AlertManagerImpl]
(CapacityChecker:ctx-834ef0f3) (logid:7e9bd4f4) Caught exception in
recalculating capacity
2024-07-23 20:46:19,513 ERROR [c.c.u.s.SshHelper]
(AgentTaskPool-5:ctx-4a180df2) (logid:44b1d912) SSH execution of command
xe sm-list | grep "resigning of duplicates" has an error status code in
return. Result output:
2024-07-23 20:46:19,679 ERROR [c.c.u.s.SshHelper]
(AgentTaskPool-6:ctx-4510f484) (logid:e8eaa85e) SSH execution of command
xe sm-list | grep "resigning of duplicates" has an error status code in
return. Result output:
2024-07-23 20:46:19,846 ERROR [c.c.u.s.SshHelper]
(AgentTaskPool-7:ctx-18d4342d) (logid:c1d175a1) SSH execution of command
xe sm-list | grep "resigning of duplicates" has an error status code in
return. Result output:
2024-07-23 20:46:19,896 ERROR [c.c.u.s.SshHelper]
(AgentTaskPool-8:ctx-574e515b) (logid:10346a2e) SSH execution of command
xe sm-list | grep "resigning of duplicates" has an error status code in
return. Result output:
2024-07-23 20:46:24,407 ERROR [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-5:ctx-4a180df2) (logid:44b1d912) Monitor
ComputeCapacityListener says there is an error in the connect process
for 248 due to null
2024-07-23 20:46:25,735 ERROR [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-6:ctx-4510f484) (logid:e8eaa85e) Monitor
ComputeCapacityListener says there is an error in the connect process
for 254 due to null
2024-07-23 20:46:25,817 ERROR [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-8:ctx-574e515b) (logid:10346a2e) Monitor
ComputeCapacityListener says there is an error in the connect process
for 260 due to null
2024-07-23 20:46:26,029 ERROR [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-7:ctx-18d4342d) (logid:c1d175a1) Monitor
ComputeCapacityListener says there is an error in the connect process
for 257 due to null
Janis
On 2024-07-23 16:44, Janis Viklis | Files.fm wrote:
Seems similar issue to this one:
https://issues.apache.org/jira/browse/CLOUDSTACK-8747
Sending Connect to listener: ComputeCapacityListener
Found 5 VMs on host 27
Found 1 VM, not running on host 27
Monitor*ComputeCapacityListener *says there is an error in the connect
processfor 27 due tonull Host 27 is disconnecting with event
AgentDisconnected
The next status of agent 27is Alert, current status is Connecting
Janis
On 2024-07-04 3:48, Nux wrote:
Janis,
No clue, it's been a while since I used Xenserver and you are also on
quite an old version as well, right? There have been many bugs fixed
since 4.13.
Would it be possible to include a much larger fragment from the logs
or the full logs?
Also, have you checked the Xcp logs, anything there, is XenCenter
showing anything out of the ordinary?
HTH
On 2024-07-03 14:36, Janis Viklis | Files.fm wrote:
If I set valid management server id, it returns to NULL after next
host check cycle.
I wonder could it bet somehow related to total or cluster resources.
(but i tried to find and check/change all overprovisionig multipliers)
2024-07-03 16:30:16,036 DEBUG [c.c.c.CapacityManagerImpl]
(CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 32 VMs on host
248
2024-07-03 16:30:16,039 DEBUG [c.c.c.CapacityManagerImpl]
(CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Found 0 VMs are
Migrating from host 248
2024-07-03 16:30:16,138 ERROR [c.c.a.AlertManagerImpl]
(CapacityChecker:ctx-af9f7c42) (logid:31d432e5) Caught exception in
recalculating capacity
java.lang.NullPointerException
at
com.cloud.capacity.CapacityManagerImpl.updateCapacityForHost(CapacityManagerImpl.java:677)
at
com.cloud.alert.AlertManagerImpl.recalculateCapacity(AlertManagerImpl.java:279)
at
com.cloud.alert.AlertManagerImpl.checkForAlerts(AlertManagerImpl.java:432)
at
com.cloud.alert.AlertManagerImpl$CapacityChecker.runInContext(AlertManagerImpl.java:422)
at
org.apache.cloudstack.managed.context.ManagedContextTimerTask$1.runInContext(ManagedContextTimerTask.java:30)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at
org.apache.cloudstack.managed.context.ManagedContextTimerTask.run(ManagedContextTimerTask.java:32)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Janis
On 2024-07-03 16:27, Nux wrote:
Hello,
What happens if you update the 4 problematic hosts with a valid
mgmt id?
On 2024-07-03 14:23, Janis Viklis | Files.fm wrote:
mgmt_server_id is NULL just for those 4 hosts, other hosts ar fine.
Looking at logs, cs1 management server starts to connect pools at
first:
2024-07-01 16:31:29,617 DEBUG [c.c.s.l.StoragePoolMonitor]
(AgentTaskPool-380:ctx-f411cc14) (logid:284129f8) Host 248 connected,
connecting host to shared pool id 152 and sending storage pool...
------------------------------------------------------------------------
DB Tables: cloud.host and cloud.mshost:
SELECT id, status, Type, mgmt_server_id FROM cloud.host where ID in
(74,77,170, 248, 254, 257, 260) :
260
Alert
Routing
257
Alert
Routing
254
Alert
Routing
248
Alert
Routing
170
Up
Routing
95534596974
77
Up
Routing
95534596974
74
Up
Routing
95534596974
179
95534596974
1720012401793
localhost
b34f493a-42c0-47a8-ada4-04be4cdd8c49
Up
4.13.1.0
10.10.10.11
9090
2024-07-03 13:13:47
0
178
95536034244
1718828790629
cs2.failiem.lv
70420423-b362-4335-b083-8ad1342ce485
Down
4.13.1.0
10.10.10.12
9090
2024-06-19 20:39:19
1
176
95530190206
1719663483676
localhost
96a155b6-7041-48ff-9f20-268ea77c5098
Down
4.13.1.0
10.10.10.13
9090
2024-06-29 12:24:28
1
175
95536505104
1719666507512
localhost
c8e6fefa-7464-4bb7-a379-5eafb55c666d
Down
4.13.1.0
10.10.10.11
9090
2024-06-29 13:38:00
0
174
95534962877
1682516323955
localhost
45a057c6-6d50-41a9-bbad-cab370c01832
Down
4.13.1.0
10.10.10.11
9090
2024-06-15 08:36:06
1
172
95529749065
1658756353180
localhost
535277d3-33df-4b2a-9f1d-07f05084d473
Down
4.13.1.0
10.10.10.13
9090
2024-06-15 07:53:32
1
170
95529797928
1603725530943
localhost
5892611f-7af8-4686-8818-95ade086e6cf
Down
4.13.1.0
10.10.10.13
9090
2020-11-03 04:05:40
1
167
95534560846
1658756323907
localhost
e7ffd55a-77b7-4848-90de-5b5f10cc4500
Down
4.13.1.0
10.10.10.11
9090
2023-04-17 09:50:14
1
163
95534279505
1582559260879
cs1.failiem.lv
8c254697-9783-11ea-900f-00163e4db64e
Down
4.11.1.0
10.10.10.11
9090
2020-05-16 14:07:09
1
161
95531601526
1582559325515
cs3.failiem.lv
8c25457e-9783-11ea-900f-00163e4db64e
Down
4.11.1.0
10.10.10.13
9090
2020-05-16 14:07:21
1
Janis
On 2024-07-03 13:11, Nux wrote:
A shot in the dark, haven't checked the log files properly.
For these hosts in the disconnected state, if you check them in the
DB cloud.host table (type="Routing" btw), which mgmt_server_id are
they reporting?
Then check cloud.mshost table and see whether the management server
with that id is in there and marked as UP etc.
HTH
On 2024-07-03 06:57, Janis Viklis | Files.fm wrote:
(sorry, some bad formatting in previous email)
Could anyone have any ideas why this error occurs and how to debug
it? (248 is a host id)
Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null
Janis
On 2024-07-01 21:44, Janis Viklis | Files.fm wrote:
Hi,
looking for help after 2 weeks: What could be the reason that
suddenly after restarting the 4.13.1 Management server, all 4 XEN
(xcp-ng 8.1) hosts of one Intel cluster disconnects and goes into
"Alert state" with an error:
Monitor ComputeCapacityListener says there is an error in the
connect process for 248 due to null
I can't find the reason for 2 weeks. The other AMD Xenserver 6.5
cluster is working just fine.
Everything seems ok: network is working, I restarted: toolstack,
both system vms (SSVM, consolev), one of the hosts, then removed and
added back.
Previously there were 3 management servers via Haproxy and Galera
Mariadb, I left only one. (tried upgrade to 3.14.1, didn't help). I
can manage hosts via Xencenter. There ar 5 storage pools and 3
secondary.
Thanks, hoping on some clues or directions, Janis.
Below is LOG output: