GitHub user rishabhjain1997 added a comment to the discussion: Provisioning the 
control VM failed in the Kubernetes cluster

Hi @anirudh09041 and @Pearl1594. I'm a teammate of @anirudh09041 and here is 
what I've found so far. @Pearl1594 can you please help us with a solution based 
on the RCA below? 

1. The actual error seems to be the async job error:

```
 $ grep "Complete async job-774" 
/var/log/cloudstack/management/management-server.log
                                                                                
                                                                        
  2026-05-05 06:50:56,274 DEBUG ... Complete async job-774, jobStatus: FAILED, 
resultCode: 530,                                                         
  result: ... "errortext":"Failed to setup Kubernetes cluster : test-cluster-23 
is not in                                                               
  usable state as the system is unable to access control node VMs of the 
cluster"   
```

KubernetesClusterStartWorker logs show a similar error:

```
2026-05-05 06:50:56,035 ERROR [c.c.k.c.a.KubernetesClusterStartWorker]          
                                                                      
  Failed to setup Kubernetes cluster : test-cluster-23 is not in usable state   
                                                                        
  as the system is unable to access control node VMs of the cluster
```

So the real failure seems to be: CKS finishes provisioning the VMs and FW/PF/LB 
rules, then calls KubernetesClusterUtil.isKubernetesClusterServerRunning(), 
which fails because the API server never came up.

The kube-apiserver never came up because each VM was stuck "Starting" for 50 
minutes    

```
$ grep -E "i-2-120-VM.*StartCommand|i-2-120-VM.*Start completed" \              
                                                                      
      /var/log/cloudstack/management/management-server.log                      
                                                                        
                                                                                
                                                                        
  05:10:40,293 ... Sending { Cmd ... StartCommand ... id=120 ...     <- 
StartCommand sent                                                               
  06:00:43,090 ... Start completed for VM ... i-2-120-VM             <- 
StartAnswer arrived  

```
So cluster provisioning sat for ~100 min (50 min × 2 VMs sequentially) just 
waiting for the VMs to be marked Started.

The 50-minute stall is the KVM agent repeatedly retrying patch.sh

```
$ grep "patch.sh -n i-2-120-VM" /var/log/cloudstack/agent/agent.log             
                                                                      
                                                                                
                                                                        
  05:15:43 WARN ... Process [2551078] for command 
[/usr/share/cloudstack-common/scripts/                                          
                      
     vm/hypervisor/kvm/patch.sh -n i-2-120-VM -c template=domP 
name=test-cluster-23-control                                                    
         
     ...] timed out. Output is [].                                              
                                                                        
  05:20:43 WARN ... Process [2567946] ... timed out. Output is [].              
                                                                        
  05:25:43 WARN ... Process [2588320] ... timed out. Output is [].              
                                                                        
  05:30:43 WARN ... Process [2605401] ... timed out. Output is [].              
                                                                        
  05:35:43 WARN ... Process [2621993] ... timed out. Output is [].              
                                                                        
  05:40:43 WARN ... Process [2638725] ... timed out. Output is [].              
                                                                        
  05:45:43 WARN ... Process [2655334] ... timed out. Output is [].              
                                                                        
  05:50:43 WARN ... Process [2672092] ... timed out. Output is [].
  05:55:43 WARN ... Process [2688808] ... timed out. Output is [].              
                                                                        
  06:00:43 WARN ... Process [2705703] ... timed out. Output is [].


```
patch.sh times out because qemu-guest-agent isn’t responding in the guest

```
[root@cloudstack ~]# virsh qemu-agent-command i-2-120-VM 
'{"execute":"guest-ping"}' --timeout 5
error: Guest agent is not responding: QEMU guest agent is not connected

```
Here is the custom template that we’re using:

```
mysql> SELECT name, format, url FROM vm_template WHERE id = 217;
  +------------------+--------+-------------------------------------------+     
                                                                        
  | name             | format | url                                       |     
                                                                        
  +------------------+--------+-------------------------------------------+     
                                                                        
  | Ubuntu-CKS-Fresh | QCOW2  | http://10.96.32.32/ubuntu-cks-fresh.qcow2 |     
                                                                        
  +------------------+--------+-------------------------------------------+   


```
That template doesn't seem to have qemu-guest-agent installed (or running at 
boot), so patch.sh has nothing to talk to — hence the timeout in the previous  
block.

And patch.sh hangs trying to inject /var/lib/cloud/data/cmdline

```
 mysql> SELECT name, value FROM configuration
         WHERE name IN 
('cloud.kubernetes.control.node.install.attempt.wait.duration',
                        
'cloud.kubernetes.control.node.install.reattempt.count');                       
                                                
  +-------------------------------------------------------------+-------+       
                                                                        
  | cloud.kubernetes.control.node.install.attempt.wait.duration |    15 |       
                                                                        
  | cloud.kubernetes.control.node.install.reattempt.count       |   100 |       
                                                                        
  +-------------------------------------------------------------+-------+   
```

Even after patch.sh finally gives up, cloud-init inside the guest can't make 
progress: it sits in its own retry loop waiting for the binaries ISO at 
/mnt/k8sdisk/ to appear. The retry length is configured globally as 100 × 15s = 
25 min:

CKS only actually attaches the binaries ISO at the very end of the workflow 
(06:50:55 below) — by then cloud-init has been dead for over an hour.

```
$ grep -E "Attached binaries ISO|Detached Kubernetes 
binaries|isKubernetesClusterServerRunning" \
      /var/log/cloudstack/management/management-server.log                      
                                                                        
   
  06:50:55,955 INFO  ... Attached binaries ISO for VM: ... i-2-120-VM           
                                                                        
  06:50:56,034 INFO  ... Attached binaries ISO for VM: ... i-2-122-VM
  06:50:56,035 ERROR ... Failed to setup Kubernetes cluster ...                 
                                                                        
  06:50:56,145 INFO  ... Detached Kubernetes binaries from VM: ... i-2-120-VM
  06:50:56,271 INFO  ... Detached Kubernetes binaries from VM: ... i-2-122-VM   
  

```
And the ISO attach is mmediately followed by a detach 1s later. 


Summary:

**qemu-guest-agent is missing or disabled in the Ubuntu-CKS-Fresh template, so 
the host's patch.sh has nothing to talk to and times out for 50 minutes per VM 
(~100 min total for both). Without patch.sh, the cmdline metadata is never 
injected, the cks-init module inside cloud-init never runs, and kubeadm-init 
never executes. When CKS eventually calls isKubernetesClusterServerRunning(), 
the API server isn't up, the readiness check fails, and the cluster goes to 
Error.**

@Pearl1594 please let us know how to remediate this issue. Any help would be 
greatly appreciated!

GitHub link: 
https://github.com/apache/cloudstack/discussions/13056#discussioncomment-16832768

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to