ACS: 4.15.0.0 Hypervisor: KVM OS: Ubuntu 20.04 In troubleshooting my last secondary storage issue I came across a repeating warning and error on the secondary storage VM and the compute host respectively. This is causing some issues and I'd like to know if anybody else has come across this or how to resolve it.
Exactly every 60 seconds I receive a lost connection info message followed by a reconnection attempt (which is always successful) which is then followed by a warning in the SSVM's case and an error in the compute host's case. The warning is an exception that is caught by java.lang.NumberFormatException and the error is a libvirt error. This also triggers an alert in the CloudStack UI. Here are the details. First the logs from /var/log/cloud.log inside the SSVM: 2021-07-10 08:48:57,919 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Lost connection to host: bllcloudlb01.harmonywave.cloud. Attempting reconnection while we still have 0 commands in progress. 2021-07-10 08:48:57,945 INFO [utils.nio.NioClient] (Agent-Handler-2:null) NioClient connection closed 2021-07-10 08:48:57,947 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Reconnecting to host:bllcloudlb01.harmonywave.cloud 2021-07-10 08:48:57,948 INFO [utils.nio.NioClient] (Agent-Handler-2:null) Connecting to bllcloudlb01.harmonywave.cloud:8250 2021-07-10 08:48:57,954 INFO [utils.nio.Link] (Agent-Handler-2:null) Conf file found: /usr/local/cloud/systemvm/conf/agent.properties 2021-07-10 08:48:58,664 INFO [utils.nio.NioClient] (Agent-Handler-2:null) SSL: Handshake done 2021-07-10 08:48:58,665 INFO [utils.nio.NioClient] (Agent-Handler-2:null) Connected to bllcloudlb01.harmonywave.cloud:8250 2021-07-10 08:48:58,729 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Proccess agent startup answer, agent id = 0 2021-07-10 08:48:58,731 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Set agent id 0 2021-07-10 08:48:58,736 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Startup Response Received: agent id = 0 2021-07-10 08:48:59,128 INFO [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-2:null) Determined host bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6 2021-07-10 08:48:59,132 INFO [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-2:null) Determined host bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6 2021-07-10 08:49:02,799 INFO [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-2:null) created a thread to start post upload server 2021-07-10 08:49:02,807 INFO [handler.logging.LoggingHandler] (nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] REGISTERED 2021-07-10 08:49:02,808 INFO [handler.logging.LoggingHandler] (nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] BIND(0.0.0.0/0.0.0.0:8210) 2021-07-10 08:49:02,808 INFO [handler.logging.LoggingHandler] (nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] CLOSE() 2021-07-10 08:49:02,809 INFO [handler.logging.LoggingHandler] (nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] UNREGISTERED 2021-07-10 08:49:02,816 INFO [storage.resource.NfsSecondaryStorageResource] (Thread-157:null) shutting down post upload server 2021-07-10 08:49:03,021 INFO [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-1:null) Determined host bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6 2021-07-10 08:49:03,072 INFO [storage.template.DownloadManagerImpl] (agentRequest-Handler-1:null) found 0 volumes[] 2021-07-10 08:49:03,127 INFO [storage.resource.NfsSecondaryStorageResource] (agentRequest-Handler-5:null) Determined host bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6 2021-07-10 08:49:03,257 INFO [storage.template.DownloadManagerImpl] (agentRequest-Handler-5:null) found 3 templates[/mnt/SecStorage/df2ca46d-aee0-302e-9ad0-2e94252341e4/template/tmpl/1/3/21b64e43-bd2e-447e-a488-14b67205e289.qcow2, /mnt/SecStorage/df2ca46d-aee0-302e-9ad0-2e94252341e4/template/tmpl/1/10/746d45d3-426c-4d86-b82c-1d4cd5a25a8c.qcow2, /mnt/SecStorage/df2ca46d-aee0-302e-9ad0-2e94252341e4/template/tmpl/2/230/50035329-d6d9-491d-9aef-e2756bdd02ec.iso] 2021-07-10 08:49:03,269 WARN [storage.template.TemplateLocation] (agentRequest-Handler-5:null) Cleaning up inconsistent information for QCOW2 2021-07-10 08:49:03,271 WARN [cloud.agent.Agent] (agentRequest-Handler-5:null) Caught: java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Long.parseLong(Long.java:702) at java.base/java.lang.Long.parseLong(Long.java:817) at com.cloud.storage.template.TemplateLocation.getTemplateInfo(TemplateLocation.java:180) at org.apache.cloudstack.storage.template.DownloadManagerImpl.gatherTemplateInfo(DownloadManagerImpl.java:905) at org.apache.cloudstack.storage.resource.NfsSecondaryStorageResource.execute(NfsSecondaryStorageResource.java:2145) at org.apache.cloudstack.storage.resource.NfsSecondaryStorageResource.executeRequest(NfsSecondaryStorageResource.java:298) at com.cloud.storage.resource.PremiumSecondaryStorageResource.defaultAction(PremiumSecondaryStorageResource.java:64) at com.cloud.storage.resource.PremiumSecondaryStorageResource.executeRequest(PremiumSecondaryStorageResource.java:60) at com.cloud.agent.Agent.processRequest(Agent.java:661) at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1079) at com.cloud.utils.nio.Task.call(Task.java:83) at com.cloud.utils.nio.Task.call(Task.java:29) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2021-07-10 08:49:03,316 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) Processing agent ready command, agent id = 16 2021-07-10 08:49:03,317 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) Set agent id 16 2021-07-10 08:49:03,318 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) Ready command is processed for agent id = 16 2021-07-10 08:49:03,482 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) Processing agent ready command, agent id = 16 2021-07-10 08:49:03,482 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) Set agent id 16 2021-07-10 08:49:03,506 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) Processed new management server list: bllcloudlb01.harmonywave.cloud@static 2021-07-10 08:49:03,506 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) Ready command is processed for agent id = 16 2021-07-10 08:49:03,670 INFO [cloud.agent.Agent] (Agent-Handler-2:null) Connected to the host: bllcloudlb01.harmonywave.cloud I also see an alert in the CloudStack UI that seems to correspond to the warning: Name: ALERT.COMPUTE.HOST ID: ea6acb78-3635-4e7f-acb9-ea061154f112 Type: 7 Date: 10 Jul 2021 08:48:57 Description: Host in ALERT state, name: s-41-VM (id:16), availability zone: BL1, pod: POD1 Secondly, the logs from /var/log/cloudstack/agent/agent.log: 2021-07-10 09:04:00,735 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:29f535a2) Lost connection to host: bllcloudlb01.harmonywave.cloud. Attempting reconnection while we still have 0 commands in progress. 2021-07-10 09:04:00,739 INFO [utils.nio.NioClient] (Agent-Handler-2:null) (logid:29f535a2) NioClient connection closed 2021-07-10 09:04:00,739 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:29f535a2) Reconnecting to host:bllcloudlb01.harmonywave.cloud 2021-07-10 09:04:00,739 INFO [utils.nio.NioClient] (Agent-Handler-2:null) (logid:29f535a2) Connecting to bllcloudlb01.harmonywave.cloud:8250 2021-07-10 09:04:00,743 INFO [utils.nio.Link] (Agent-Handler-2:null) (logid:29f535a2) Conf file found: /etc/cloudstack/agent/agent.properties 2021-07-10 09:04:01,249 INFO [utils.nio.NioClient] (Agent-Handler-2:null) (logid:29f535a2) SSL: Handshake done 2021-07-10 09:04:01,250 INFO [utils.nio.NioClient] (Agent-Handler-2:null) (logid:29f535a2) Connected to bllcloudlb01.harmonywave.cloud:8250 2021-07-10 09:04:01,256 INFO [utils.linux.KVMHostInfo] (Agent-Handler-1:null) (logid:29f535a2) Could not read cpuinfo_max_freq, falling back on libvirt 2021-07-10 09:04:01,360 INFO [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:29f535a2) Attempting to create storage pool 6b8fc73b-c066-45ad-bc88-7bb4018bf32f (Filesystem) in libvirt 2021-07-10 09:04:01,361 ERROR [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:29f535a2) Connection with libvirtd is broken: invalid connection pointer in virConnectGetVersion 2021-07-10 09:04:01,364 INFO [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:29f535a2) Found existing defined storage pool 6b8fc73b-c066-45ad-bc88-7bb4018bf32f, using it. 2021-07-10 09:04:01,365 INFO [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:29f535a2) Trying to fetch storage pool 6b8fc73b-c066-45ad-bc88-7bb4018bf32f from libvirt 2021-07-10 09:04:01,396 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:29f535a2) Proccess agent startup answer, agent id = 0 2021-07-10 09:04:01,396 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:29f535a2) Set agent id 0 2021-07-10 09:04:01,397 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:29f535a2) Startup Response Received: agent id = 0 2021-07-10 09:04:01,749 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:629af4a9) Attempting to create storage pool 341bae0e-fd14-39f6-80a7-e964b5afb28c (NetworkFilesystem) in libvirt 2021-07-10 09:04:01,752 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:629af4a9) Found existing defined storage pool 341bae0e-fd14-39f6-80a7-e964b5afb28c, using it. 2021-07-10 09:04:01,752 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:629af4a9) Trying to fetch storage pool 341bae0e-fd14-39f6-80a7-e964b5afb28c from libvirt 2021-07-10 09:04:02,446 INFO [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:629af4a9) Processing agent ready command, agent id = 13 2021-07-10 09:04:02,447 INFO [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:629af4a9) Set agent id 13 2021-07-10 09:04:02,447 INFO [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:629af4a9) Ready command is processed for agent id = 13 2021-07-10 09:04:02,569 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:629af4a9) Processing agent ready command, agent id = 13 2021-07-10 09:04:02,570 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:629af4a9) Set agent id 13 2021-07-10 09:04:02,571 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:629af4a9) Processed new management server list: bllcloudlb01.harmonywave.cloud@static 2021-07-10 09:04:02,571 INFO [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:629af4a9) Ready command is processed for agent id = 13 2021-07-10 09:04:06,251 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:29f535a2) Connected to the host: bllcloudlb01.harmonywave.cloud It also throws an alert in the UI: Name: ALERT.COMPUTE.HOST ID: 7c83e1cc-aefa-4698-90c8-7c616fb3628a Type: 7 Date: 10 Jul 2021 09:04:00 Description: Host disconnected, name: bllcloudcmp02 (id:13), availability zone: BL1, pod: POD1 All of the logs above repeat every 60 seconds. When I was trying to upload my ISO the upload process would fail if it ever coincided with the warning and/or error message. So if the ISO upload took less than 1 minute and was not occurring at the top of the minute then it would successfully upload. However, if it was trying to upload when this warning/error is thrown the progress bar would just stop and a network error message was thrown in the UI. I should note that it is losing connection to my load balancer which is in front of two management servers. Could this be a setting at the load balancer level that is closing the connections? I found a couple links related to the libvirt error [1] [2] but for the most part they seem to say that the message can be ignored or the discussions are not actually centered around the libvirt error. Libvirt is in fact running on the compute host and the systemvm's are running on top of it. They both show a "state" of "running" and the "agent status" of "up" in the UI. Does anybody know why the SSVM and the compute host would constantly disconnect and reconnect? I'm happy to provide any additional information. Any help is appreciated. [1] (https://github.com/apache/cloudstack/issues/4700) [2] (https://github.com/apache/cloudstack/issues/3509) -- Thanks, Joshua Schaeffer