GitHub user tatay188 created a discussion: CS4.22 GPU Server Unable to create a VM
### problem Unable to create VMs getting 504 error The server is recognized a GPU enabled. Using a regular template with UBUNTU 22.04 Using a Service created, the service is as all our server but has a GPU. HA, and i checked the Video just to test. on the agent logs shows: ``` 2025-12-09 20:33:14,351 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-5:[]) (logid:3e3d3f80) Trying to fetch storage pool e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt 2025-12-09 20:34:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:35:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:36:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:37:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:38:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:39:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:40:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:41:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:42:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:43:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive 2025-12-09 20:43:05,155 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-3:[]) (logid:7f598e7f) Trying to fetch storage pool e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt ``` On the Server, there is no errors or disconnections. the storage ID for this VM shows ``` ID 30aea531-8b82-478f-85db-e9991bf193f5 ``` I am able to reach the primary storage from the GPU Host. Except for this error and after 45 minutes the System keeps spinning on creating the VM "Launch Instance in progress" <img width="1182" height="163" alt="Image" src="https://github.com/user-attachments/assets/8d2a9c9d-ca5c-47a0-9789-1b6353045639" /> Logs from the management server: ``` 2025-12-09 20:46:49,858 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) No inactive management server node found 2025-12-09 20:46:49,858 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) Peer scan is finished. profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms 2025-12-09 20:46:51,322 DEBUG [o.a.c.h.H.HAManagerBgPollTask] (BackgroundTaskPollManager-4:[ctx-1aad6e7c]) (logid:829826e7) HA health check task is running... 2025-12-09 20:46:51,358 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) No inactive management server node found 2025-12-09 20:46:51,358 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) Peer scan is finished. profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms 2025-12-09 20:46:51,678 INFO [c.c.a.m.A.MonitorTask] (AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Found the following agents behind on ping: [75] 2025-12-09 20:46:51,683 DEBUG [c.c.a.m.A.MonitorTask] (AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Ping timeout for agent Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}, do investigation 2025-12-09 20:46:51,685 INFO [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Investigating why host Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"} has disconnected with event 2025-12-09 20:46:51,687 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Checking if agent (Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}) is alive 2025-12-09 20:46:51,689 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Wait time setting on com.cloud.agent.api.CheckHealthCommand is 50 seconds 2025-12-09 20:46:51,690 DEBUG [c.c.a.m.ClusteredAgentAttache] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Routed from 250977680725600 2025-12-09 20:46:51,690 DEBUG [c.c.a.t.Request] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Sending { Cmd , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] } 2025-12-09 20:46:51,733 DEBUG [c.c.a.t.Request] (AgentManager-Handler-11:[]) (logid:) Seq 75-1207246175112003675: Processing: { Ans: , MgmtId: 250977680725600, via: 75, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckHealthAnswer":{"result":"true","details":"resource is alive","wait":"0","bypassHostMaintenance":"false"}}] } 2025-12-09 20:46:51,734 DEBUG [c.c.a.t.Request] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Received: { Ans: , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 10, { CheckHealthAnswer } } 2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Details from executing class com.cloud.agent.api.CheckHealthCommand: resource is alive 2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent (Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}) responded to checkHealthCommand, reporting that agent is Up 2025-12-09 20:46:51,734 INFO [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) The agent from host Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"} state determined is Up 2025-12-09 20:46:51,734 INFO [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent is determined to be up and running 2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) [Resource state = Enabled, Agent event = , Host = Ping] 2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] (qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) ===START=== SOMEIPADDRESS -- GET jobId=22d89170-20e6-4151-a809-552938d734e9&command=queryAsyncJobResult&response=json& 2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] (qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) Two factor authentication is already verified for the user 2, so skipping 2025-12-09 20:46:52,134 DEBUG [c.c.a.ApiServer] (qtp1438988851-251223:[ctx-4df48857, ctx-caa819c6]) (logid:9ab16f87) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"45a1be9e-2c67-11f0-a2e6-9ee6a2dce283"}]' is allowed to perform API calls: ``` I noticed the Isolated network the Virtual router is on another server, I do not have any server tags at the moment. final error: ``` Error Unable to orchestrate the start of VM instance {"instanceName":"i-2-223-VM","uuid":"a12748a3-7519-4732-8445-05dfa96046b7"}. ``` ### versions The versions of ACS, hypervisors, storage, network etc.. ACS 4.22.0 KVM for the GPU and other hosts CEPH RDB primary storage NFS secondary storage VXLAN running same as the other servers. Ubuntu 22.04 as a Hypervisor Ubuntu 22.04 as template - Same template used for other VMs. The GPU is recognized by the system with no problems. ### The steps to reproduce the bug 1. Using a GPU service offering with HA and GPU Display true - we do have disabled the OOB management. 2. Add a simple VM isolated network, using a GPU Offering 1GPU. 3. Everything starts ok, VR is created on a regular CPU server -automatically, Storage is created, Ip addresses allocated 4. Instance creation fails after 35+ minutes. one more screen: <img width="1708" height="465" alt="Image" src="https://github.com/user-attachments/assets/767592b4-a0e8-466f-bf7d-030d307c3287" /> Please Guide us on the proper setting. Thank you ### What to do about it? _No response_ GitHub link: https://github.com/apache/cloudstack/discussions/12222 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
