ACS: 4.15.0.0
Hypervisor: KVM
OS: Ubuntu 20.04

In troubleshooting my last secondary storage issue I came across a repeating 
warning and error on the secondary storage VM and the compute host 
respectively. This is causing some issues and I'd like to know if anybody else 
has come across this or how to resolve it.

Exactly every 60 seconds I receive a lost connection info message followed by a 
reconnection attempt (which is always successful) which is then followed by a 
warning in the SSVM's case and an error in the compute host's case. The warning 
is an exception that is caught by java.lang.NumberFormatException and the error 
is a libvirt error. This also triggers an alert in the CloudStack UI. Here are 
the details.

First the logs from /var/log/cloud.log inside the SSVM:

2021-07-10 08:48:57,919 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) Lost 
connection to host: bllcloudlb01.harmonywave.cloud. Attempting reconnection 
while we still have 0 commands in progress.
2021-07-10 08:48:57,945 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
NioClient connection closed
2021-07-10 08:48:57,947 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
Reconnecting to host:bllcloudlb01.harmonywave.cloud
2021-07-10 08:48:57,948 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
Connecting to bllcloudlb01.harmonywave.cloud:8250
2021-07-10 08:48:57,954 INFO  [utils.nio.Link] (Agent-Handler-2:null) Conf file 
found: /usr/local/cloud/systemvm/conf/agent.properties
2021-07-10 08:48:58,664 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) SSL: 
Handshake done
2021-07-10 08:48:58,665 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
Connected to bllcloudlb01.harmonywave.cloud:8250
2021-07-10 08:48:58,729 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
Proccess agent startup answer, agent id = 0
2021-07-10 08:48:58,731 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) Set 
agent id 0
2021-07-10 08:48:58,736 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
Startup Response Received: agent id = 0
2021-07-10 08:48:59,128 INFO  [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-2:null) Determined host 
bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6
2021-07-10 08:48:59,132 INFO  [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-2:null) Determined host 
bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6
2021-07-10 08:49:02,799 INFO  [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-2:null) created a thread to start post upload server
2021-07-10 08:49:02,807 INFO  [handler.logging.LoggingHandler] 
(nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] REGISTERED
2021-07-10 08:49:02,808 INFO  [handler.logging.LoggingHandler] 
(nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] BIND(0.0.0.0/0.0.0.0:8210)
2021-07-10 08:49:02,808 INFO  [handler.logging.LoggingHandler] 
(nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] CLOSE()
2021-07-10 08:49:02,809 INFO  [handler.logging.LoggingHandler] 
(nioEventLoopGroup-310-1:null) [id: 0xab44bf4e] UNREGISTERED
2021-07-10 08:49:02,816 INFO  [storage.resource.NfsSecondaryStorageResource] 
(Thread-157:null) shutting down post upload server
2021-07-10 08:49:03,021 INFO  [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-1:null) Determined host 
bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6
2021-07-10 08:49:03,072 INFO  [storage.template.DownloadManagerImpl] 
(agentRequest-Handler-1:null) found 0 volumes[]
2021-07-10 08:49:03,127 INFO  [storage.resource.NfsSecondaryStorageResource] 
(agentRequest-Handler-5:null) Determined host 
bllcloudstg02-storage.harmonywave.cloud corresponds to IP 10.2.96.6
2021-07-10 08:49:03,257 INFO  [storage.template.DownloadManagerImpl] 
(agentRequest-Handler-5:null) found 3 
templates[/mnt/SecStorage/df2ca46d-aee0-302e-9ad0-2e94252341e4/template/tmpl/1/3/21b64e43-bd2e-447e-a488-14b67205e289.qcow2,
 
/mnt/SecStorage/df2ca46d-aee0-302e-9ad0-2e94252341e4/template/tmpl/1/10/746d45d3-426c-4d86-b82c-1d4cd5a25a8c.qcow2,
 
/mnt/SecStorage/df2ca46d-aee0-302e-9ad0-2e94252341e4/template/tmpl/2/230/50035329-d6d9-491d-9aef-e2756bdd02ec.iso]
2021-07-10 08:49:03,269 WARN  [storage.template.TemplateLocation] 
(agentRequest-Handler-5:null) Cleaning up inconsistent information for QCOW2
2021-07-10 08:49:03,271 WARN  [cloud.agent.Agent] (agentRequest-Handler-5:null) 
Caught:
java.lang.NumberFormatException: For input string: ""
    at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.base/java.lang.Long.parseLong(Long.java:702)
    at java.base/java.lang.Long.parseLong(Long.java:817)
    at 
com.cloud.storage.template.TemplateLocation.getTemplateInfo(TemplateLocation.java:180)
    at 
org.apache.cloudstack.storage.template.DownloadManagerImpl.gatherTemplateInfo(DownloadManagerImpl.java:905)
    at 
org.apache.cloudstack.storage.resource.NfsSecondaryStorageResource.execute(NfsSecondaryStorageResource.java:2145)
    at 
org.apache.cloudstack.storage.resource.NfsSecondaryStorageResource.executeRequest(NfsSecondaryStorageResource.java:298)
    at 
com.cloud.storage.resource.PremiumSecondaryStorageResource.defaultAction(PremiumSecondaryStorageResource.java:64)
    at 
com.cloud.storage.resource.PremiumSecondaryStorageResource.executeRequest(PremiumSecondaryStorageResource.java:60)
    at com.cloud.agent.Agent.processRequest(Agent.java:661)
    at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1079)
    at com.cloud.utils.nio.Task.call(Task.java:83)
    at com.cloud.utils.nio.Task.call(Task.java:29)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
2021-07-10 08:49:03,316 INFO  [cloud.agent.Agent] (agentRequest-Handler-2:null) 
Processing agent ready command, agent id = 16
2021-07-10 08:49:03,317 INFO  [cloud.agent.Agent] (agentRequest-Handler-2:null) 
Set agent id 16
2021-07-10 08:49:03,318 INFO  [cloud.agent.Agent] (agentRequest-Handler-2:null) 
Ready command is processed for agent id = 16
2021-07-10 08:49:03,482 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
Processing agent ready command, agent id = 16
2021-07-10 08:49:03,482 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
Set agent id 16
2021-07-10 08:49:03,506 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
Processed new management server list: bllcloudlb01.harmonywave.cloud@static
2021-07-10 08:49:03,506 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
Ready command is processed for agent id = 16
2021-07-10 08:49:03,670 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
Connected to the host: bllcloudlb01.harmonywave.cloud

I also see an alert in the CloudStack UI that seems to correspond to the 
warning:

Name: ALERT.COMPUTE.HOST
ID: ea6acb78-3635-4e7f-acb9-ea061154f112
Type: 7
Date: 10 Jul 2021 08:48:57
Description: Host in ALERT state, name: s-41-VM (id:16), availability zone: 
BL1, pod: POD1

Secondly, the logs from /var/log/cloudstack/agent/agent.log:

2021-07-10 09:04:00,735 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
(logid:29f535a2) Lost connection to host: bllcloudlb01.harmonywave.cloud. 
Attempting reconnection while we still have 0 commands in progress.
2021-07-10 09:04:00,739 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
(logid:29f535a2) NioClient connection closed
2021-07-10 09:04:00,739 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
(logid:29f535a2) Reconnecting to host:bllcloudlb01.harmonywave.cloud
2021-07-10 09:04:00,739 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
(logid:29f535a2) Connecting to bllcloudlb01.harmonywave.cloud:8250
2021-07-10 09:04:00,743 INFO  [utils.nio.Link] (Agent-Handler-2:null) 
(logid:29f535a2) Conf file found: /etc/cloudstack/agent/agent.properties
2021-07-10 09:04:01,249 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
(logid:29f535a2) SSL: Handshake done
2021-07-10 09:04:01,250 INFO  [utils.nio.NioClient] (Agent-Handler-2:null) 
(logid:29f535a2) Connected to bllcloudlb01.harmonywave.cloud:8250
2021-07-10 09:04:01,256 INFO  [utils.linux.KVMHostInfo] (Agent-Handler-1:null) 
(logid:29f535a2) Could not read cpuinfo_max_freq, falling back on libvirt
2021-07-10 09:04:01,360 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(Agent-Handler-1:null) (logid:29f535a2) Attempting to create storage pool 
6b8fc73b-c066-45ad-bc88-7bb4018bf32f (Filesystem) in libvirt
2021-07-10 09:04:01,361 ERROR [kvm.resource.LibvirtConnection] 
(Agent-Handler-1:null) (logid:29f535a2) Connection with libvirtd is broken: 
invalid connection pointer in virConnectGetVersion
2021-07-10 09:04:01,364 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(Agent-Handler-1:null) (logid:29f535a2) Found existing defined storage pool 
6b8fc73b-c066-45ad-bc88-7bb4018bf32f, using it.
2021-07-10 09:04:01,365 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(Agent-Handler-1:null) (logid:29f535a2) Trying to fetch storage pool 
6b8fc73b-c066-45ad-bc88-7bb4018bf32f from libvirt
2021-07-10 09:04:01,396 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
(logid:29f535a2) Proccess agent startup answer, agent id = 0
2021-07-10 09:04:01,396 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
(logid:29f535a2) Set agent id 0
2021-07-10 09:04:01,397 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
(logid:29f535a2) Startup Response Received: agent id = 0
2021-07-10 09:04:01,749 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-3:null) (logid:629af4a9) Attempting to create storage 
pool 341bae0e-fd14-39f6-80a7-e964b5afb28c (NetworkFilesystem) in libvirt
2021-07-10 09:04:01,752 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-3:null) (logid:629af4a9) Found existing defined storage 
pool 341bae0e-fd14-39f6-80a7-e964b5afb28c, using it.
2021-07-10 09:04:01,752 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-3:null) (logid:629af4a9) Trying to fetch storage pool 
341bae0e-fd14-39f6-80a7-e964b5afb28c from libvirt
2021-07-10 09:04:02,446 INFO  [cloud.agent.Agent] (agentRequest-Handler-3:null) 
(logid:629af4a9) Processing agent ready command, agent id = 13
2021-07-10 09:04:02,447 INFO  [cloud.agent.Agent] (agentRequest-Handler-3:null) 
(logid:629af4a9) Set agent id 13
2021-07-10 09:04:02,447 INFO  [cloud.agent.Agent] (agentRequest-Handler-3:null) 
(logid:629af4a9) Ready command is processed for agent id = 13
2021-07-10 09:04:02,569 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
(logid:629af4a9) Processing agent ready command, agent id = 13
2021-07-10 09:04:02,570 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
(logid:629af4a9) Set agent id 13
2021-07-10 09:04:02,571 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
(logid:629af4a9) Processed new management server list: 
bllcloudlb01.harmonywave.cloud@static
2021-07-10 09:04:02,571 INFO  [cloud.agent.Agent] (agentRequest-Handler-4:null) 
(logid:629af4a9) Ready command is processed for agent id = 13
2021-07-10 09:04:06,251 INFO  [cloud.agent.Agent] (Agent-Handler-2:null) 
(logid:29f535a2) Connected to the host: bllcloudlb01.harmonywave.cloud

It also throws an alert in the UI:

Name: ALERT.COMPUTE.HOST
ID: 7c83e1cc-aefa-4698-90c8-7c616fb3628a
Type: 7
Date: 10 Jul 2021 09:04:00
Description: Host disconnected, name: bllcloudcmp02 (id:13), availability zone: 
BL1, pod: POD1

All of the logs above repeat every 60 seconds. When I was trying to upload my 
ISO the upload process would fail if it ever coincided with the warning and/or 
error message. So if the ISO upload took less than 1 minute and was not 
occurring at the top of the minute then it would successfully upload. However, 
if it was trying to upload when this warning/error is thrown the progress bar 
would just stop and a network error message was thrown in the UI.

I should note that it is losing connection to my load balancer which is in 
front of two management servers. Could this be a setting at the load balancer 
level that is closing the connections?

I found a couple links related to the libvirt error [1] [2] but for the most 
part they seem to say that the message can be ignored or the discussions are 
not actually centered around the libvirt error. Libvirt is in fact running on 
the compute host and the systemvm's are running on top of it. They both show a 
"state" of "running" and the "agent status" of "up" in the UI. Does anybody 
know why the SSVM and the compute host would constantly disconnect and 
reconnect? I'm happy to provide any additional information. Any help is 
appreciated.

[1] (https://github.com/apache/cloudstack/issues/4700)
[2] (https://github.com/apache/cloudstack/issues/3509)

-- 
Thanks,
Joshua Schaeffer

Reply via email to