GitHub user tuanhoangth1603 created a discussion: Failed to create RBD storage 
pool after KVM agent upgrade from 4.20 to 4.22: "org.libvirt.LibvirtException: 
failed to create the RBD IoCTX"

### problem

After upgrading the KVM agent on a compute node from CloudStack 4.20 to 4.22, 
the agent fails to recreate or connect to the existing RBD storage pool. The 
error manifests in the agent logs as a LibvirtException during pool 
initialization, querying if the RBD pool exists (which it does on the Ceph 
cluster). This prevents the host from fully reconnecting and handling VM 
operations (e.g., volume attach/detach).
The issue appears tied to changes in libvirt (8.0+) or Ceph client libraries 
post-upgrade, causing IoCTX creation to fail due to temporary secret/cached 
state mismatches. Notably, a full reboot of the compute node resolves the issue 
immediately, allowing clean recreation of the pool and secret. However, this 
introduces unwanted downtime for running VMs on that node, which is 
unacceptable in production.

### versions

Environment

CloudStack version: Management server upgraded to 4.22.0 (from 4.20.0)
Agent version: KVM agent upgraded from 4.20.0 to 4.22.0 on compute nodes
Hypervisor: KVM 
Primary Storage: Ceph RBD (pool name: cloudstack-zone1; Ceph version: 14)
OS on compute nodes: Ubuntu 20.04


### The steps to reproduce the bug

1. Upgrade mgmt to 4.22
2. upgrade agent to 4.22
3. log error from agent.log: Failed to create RBD storage pool: 
org.libvirt.LibvirtException: failed to create the RBD IoCTX. Does the pool 
'cloudstack-zone1' exist? 
I also do these commands on CEPH but it's still error
```
# ceph config set mon auth_expose_insecure_global_id_reclaim false

# ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false

# ceph config set mon auth_allow_insecure_global_id_reclaim false
```

**Expected Behavior**
The agent should successfully redefine the RBD storage pool using the existing 
Ceph configuration (monitors, secrets) without failure, allowing seamless host 
reconnection post-upgrade.

**Actual Behavior**
Agent logs show repeated failures to create the RBD IoCTX, followed by cleanup 
of the libvirt secret. Host status remains "Disconnected" or "Alert" in UI 
until manual intervention. Full reboot of the compute node resolves the issue 
immediately (it's so bad solution)



GitHub link: https://github.com/apache/cloudstack/discussions/12154

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to