GitHub user btzq edited a comment on the discussion: HA not working in cloudstack 4.19
Hey @Aashiqps , we are facing the same situation as well. Great to have found another Linstor user in dissagregated architecture. At this moment, we managed to get Volume Snapshots to work fine, after latest fixes from Linbit side. But we still the failover issues, where we cant seem to get all the Virtual Routers to start up successfully during a node failure (we simulated by pulling the power from the server). If the Virtual Router cant start up, then the VMs in that network cant continue to start up as well. We tested this using NFS Too just to isolate the network issue and the VM HA using NFS works just fine. When we go through the logs, we cant seem to identify whats the problem. Our latest findings is that according to the logs, the VR was successfully migrated to the new host, and its status transitioned to running. However, the 'ACS HighAvailabilityManager' triggered a stop/reboot action on the router and did not take any action to start it afterward. `2024-07-16 17:31:18,853 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-137:ctx-b555a5f3 job-385919/job-386245) (logid:1dc1e938) Run VM work job: com.cloud.vm.VmWorkStop for VM 54572, job origin: 385919 2024-07-16 17:31:18,854 DEBUG [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-137:ctx-b555a5f3 job-385919/job-386245 ctx-10b406cf) (logid:1dc1e938) Execute VM work job: com.cloud.vm.VmWorkStop{"cleanup":true,"userId":1,"accountId":1,"vmId":54572,"handlerName":"VirtualMachineManagerImpl"} 2024-07-16 17:31:18,867 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-137:ctx-b555a5f3 job-385919/job-386245 ctx-10b406cf) (logid:1dc1e938) VM instance {"id":54572,"instanceName":"r-54572-VM","type":"DomainRouter","uuid":"a0022aec-996e-490d-8a21-3eccd43c9e0b"} state transited from [Running] to [Stopping] with event [StopRequested]. VM's original host: Host {"id":129,"name":"n2ncloudmy1cp02","type":"Routing","uuid":"3bf16d9d-e561-4e59-b855-7256bee35c6f"}, new host: Host {"id":129,"name":"n2ncloudmy1cp02","type":"Routing","uuid":"3bf16d9d-e561-4e59-b855-7256bee35c6f"}, host before state transition: Host {"id":129,"name":"n2ncloudmy1cp02","type":"Routing","uuid":"3bf16d9d-e561-4e59-b855-7256bee35c6f"} 2024-07-16 17:31:37,947 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-42:ctx-cfc0e084 work-103139) (logid:fe9b0f9a) VM VM instance {"id":54572,"instanceName":"r-54572-VM","type":"DomainRouter","uuid":"a0022aec-996e-490d-8a21-3eccd43c9e0b"} is no w no longer on host 129 2024-07-16 17:31:37,947 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-42:ctx-cfc0e084 work-103139) (logid:fe9b0f9a) Completed work HAWork[103139-HA-54572-Stopped-Investigating]. Took 1/10 attempts.` But the challenge we are having, is that we dont see any error related to the storage. There should be somewhere a message that either a device can't be accessed or created, but we cant find any. But there are few things to take note when using Linstor: - Theres no need to use IMPI OOB. Infact, users are asked not to. This is cause only NFS iSCSI storage are susceptible to splitbrain, but in Linstor, apparently the technology is different which is why splitbrain will not occur. - Need to update Cloudstack Agent to not restart the server. - HA Strategy in Cloudstack + Linstor is to rely solely on VM HA (not Host HA). More info here: https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#ch-cloudstack:~:text=video%20here.-,14.9.%20High%20Availability%20and%20LINSTOR%20Volumes%20in%20CloudStack,-The%20CloudStack%20documentation https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#ch-cloudstack:~:text=14.9.1.%20Explanation%20and%20Reasoning Im curious to know your progress and if you managed to find any solution to it? Happy to communicate to help each other out. GitHub link: https://github.com/apache/cloudstack/discussions/9362#discussioncomment-10070156 ---- This is an automatically sent email for users@cloudstack.apache.org. To unsubscribe, please send an email to: users-unsubscr...@cloudstack.apache.org