[D] CS4.22 GPU Server Unable to create a VM [cloudstack]

via GitHub Wed, 10 Dec 2025 04:30:33 -0800


GitHub user tatay188 created a discussion: CS4.22 GPU Server Unable to create a 
VM


### problem

Unable to create VMs getting 504 error

The server is recognized a GPU enabled.
Using a regular template with UBUNTU 22.04

Using a Service created, the service is as all our server but has a GPU. HA, 
and i checked the Video just to test.
on the agent logs shows:
```
2025-12-09 20:33:14,351 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-5:[]) (logid:3e3d3f80) Trying to fetch storage pool 
e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
2025-12-09 20:34:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:35:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:36:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:37:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:38:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:39:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:40:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:41:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:42:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:43:01,633 ERROR [cloud.agent.Agent] 
(AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. 
Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:43:05,155 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-3:[]) (logid:7f598e7f) Trying to fetch storage pool 
e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
```

On the Server, there is no errors or disconnections.
the storage ID  for this VM shows 
```
ID
30aea531-8b82-478f-85db-e9991bf193f5
```
I am able to reach the primary storage from the GPU Host.

Except for this error and after 45 minutes the System keeps spinning on 
creating the VM "Launch Instance in progress"

<img width="1182" height="163" alt="Image" 
src="https://github.com/user-attachments/assets/8d2a9c9d-ca5c-47a0-9789-1b6353045639";
 />


Logs from the management server:
```
2025-12-09 20:46:49,858 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) No inactive management 
server node found

2025-12-09 20:46:49,858 DEBUG [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) Peer scan is finished. 
profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , 
profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: 
Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, 
profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms

2025-12-09 20:46:51,322 DEBUG [o.a.c.h.H.HAManagerBgPollTask] 
(BackgroundTaskPollManager-4:[ctx-1aad6e7c]) (logid:829826e7) HA health check 
task is running...

2025-12-09 20:46:51,358 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) No inactive management 
server node found

2025-12-09 20:46:51,358 DEBUG [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) Peer scan is finished. 
profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , 
profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: 
Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, 
profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms

2025-12-09 20:46:51,678 INFO  [c.c.a.m.A.MonitorTask] 
(AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Found the following agents 
behind on ping: [75]

2025-12-09 20:46:51,683 DEBUG [c.c.a.m.A.MonitorTask] 
(AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Ping timeout for agent Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"},
 do investigation

2025-12-09 20:46:51,685 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Investigating why host Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}
 has disconnected with event

2025-12-09 20:46:51,687 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Checking if agent (Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"})
 is alive

2025-12-09 20:46:51,689 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Wait time setting on 
com.cloud.agent.api.CheckHealthCommand is 50 seconds

2025-12-09 20:46:51,690 DEBUG [c.c.a.m.ClusteredAgentAttache] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: 
Routed from 250977680725600

2025-12-09 20:46:51,690 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: 
Sending  { Cmd , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 
100011, 
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}]
 }

2025-12-09 20:46:51,733 DEBUG [c.c.a.t.Request] (AgentManager-Handler-11:[]) 
(logid:) Seq 75-1207246175112003675: Processing:  { Ans: , MgmtId: 
250977680725600, via: 75, Ver: v1, Flags: 10, 
[{"com.cloud.agent.api.CheckHealthAnswer":{"result":"true","details":"resource 
is alive","wait":"0","bypassHostMaintenance":"false"}}] }

2025-12-09 20:46:51,734 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: 
Received:  { Ans: , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 10, 
{ CheckHealthAnswer } }

2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Details from executing class 
com.cloud.agent.api.CheckHealthCommand: resource is alive

2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent (Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"})
 responded to checkHealthCommand, reporting that agent is Up

2025-12-09 20:46:51,734 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) The agent from host Host 
{"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}
 state determined is Up

2025-12-09 20:46:51,734 INFO  [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent is determined to be up 
and running

2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) [Resource state = Enabled, 
Agent event = , Host = Ping]

2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] 
(qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) ===START===  
SOMEIPADDRESS  -- GET  
jobId=22d89170-20e6-4151-a809-552938d734e9&command=queryAsyncJobResult&response=json&

2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] 
(qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) Two factor 
authentication is already verified for the user 2, so skipping

2025-12-09 20:46:52,134 DEBUG [c.c.a.ApiServer] 
(qtp1438988851-251223:[ctx-4df48857, ctx-caa819c6]) (logid:9ab16f87) CIDRs from 
which account 'Account 
[{"accountName":"admin","id":2,"uuid":"45a1be9e-2c67-11f0-a2e6-9ee6a2dce283"}]' 
is allowed to perform API calls:
```
I noticed the Isolated network the Virtual router is on another server, I do 
not have any server tags at the moment.


final error:
```
Error
Unable to orchestrate the start of VM instance 
{"instanceName":"i-2-223-VM","uuid":"a12748a3-7519-4732-8445-05dfa96046b7"}.
```

### versions

The versions of ACS, hypervisors, storage, network etc..
ACS 4.22.0
KVM for the GPU and other hosts
CEPH RDB primary storage
NFS secondary storage
VXLAN running same as the other servers.
Ubuntu 22.04 as a Hypervisor
Ubuntu 22.04 as template - Same template used for other VMs.

The GPU is recognized by the system with no problems.

### The steps to reproduce the bug

1. Using a GPU service offering with HA and GPU Display true - we do have 
disabled the OOB management.
2. Add a simple VM isolated network, using a GPU Offering 1GPU.
3. Everything starts ok, VR is created on a regular CPU server -automatically, 
Storage is created, Ip addresses allocated
4. Instance creation fails after 35+ minutes.


one more screen:

<img width="1708" height="465" alt="Image" 
src="https://github.com/user-attachments/assets/767592b4-a0e8-466f-bf7d-030d307c3287";
 />

Please Guide us on the proper setting.

Thank you




### What to do about it?

_No response_

GitHub link: https://github.com/apache/cloudstack/discussions/12222

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] CS4.22 GPU Server Unable to create a VM [cloudstack]

Reply via email to