Re: Can't cancel Maintenance Mode on host
Hi Makran, Yes I’m talking purely on the XS side – in the cases we’ve seen the XS host may or may not have been in MM in CloudStack, but if you checked in XenCentre they would have somehow gone into MM independently of CloudStack. The CloudStack GUI would then error if you tried to the the host in/out of MM. If this is not the case with you then great, cause it’s a difficult one to fix. With regards to https://issues.apache.org/jira/browse/CLOUDSTACK-8210 I couldn’t comment, yes looks similar but as you say – it is for KVM. Keep in mind rebooting storage will cause XS hosts to self-fence and reboot – check https://github.com/apache/cloudstack/blob/master/scripts/vm/hypervisor/xenserver/xenheartbeat.sh As a general point – ACS4.4 and XS6.2 are obviously relatively old now, so I would recommend you upgrade both. However – you do need to sort your issues out before your upgrade, any issues before the upgrade are likely to be the same or worse after the upgrade, and worst case you end up having to roll back. If you do have a lot of issues and can’t get them fixed you may obviously be better off building a new 4.9 or 4.11 infrastructure and migrate your workloads across. If you want some tips on upgrading check this blog post I wrote a year ago http://www.shapeblue.com/cloudstack-upgrades-best-practices/ Regards, Dag Sonstebo Cloud Architect ShapeBlue On 23/02/2018, 05:46, "Makrand" wrote: Hi Dag, 1) Thanks for the reply. I was talking about canceling MM from cloud stack. No issues taking out it in and out of MM on xenserver/xencenter level. With normaly scene, one first puts host in MM from Cstack>>Then from XEN center>>DO your reboot>>Exit MM from Xencenter>>Exit from Cstack. 2) When you said rebuild, you mean eject the host out of the pool and reinstall OS? Also, I am yet to try to delete the host from Cstack and add it back. Should I try that? Do you think it will work? 3) I also found this:- https://issues.apache.org/jira/browse/CLOUDSTACK-8210. I know this is for KVM, but we are using Cstack 4.4. BTW, on a broader view, this zone has some funky stuff happening. Its Cstack 4.4.x and XEN server 6.2 We have noticed that VRs go into reboot loops once we reboot the storage. VMs are stuck on XenServer in start stages. Sometimes we can't shut down VMs. Sometimes we can't migrate VMs between hosts. We have also found dead beef on Xenservers (whatever that means..one of our engineers told me). Let me dig some logs for these things and I will try to share it. I am seriously thinking of reinstalling everything here. But I just need to justify this to senior management. -- Makrand On Thu, Feb 22, 2018 at 6:14 PM, Dag Sonstebo wrote: > Hi Makrand, > > Yes this rings a bell – first of I would advise you to thread very > carefully – this is most likely an issue with your underlying XAPI db on > your poolmaster, so there is a risk of further problems. > > We have seen this in the past with a couple of clients – and I think we > found XS servers still in MM in XenCentre (unbeknownst to CloudStack) – but > we have then had some problems getting the hosts out of MM again from the > Xen side. We have also seen situations where taking one host out of MM in > XenCentre puts another host into MM, which is odd. I know in on one > occasion we ended up removing / rebuilding / reading the stubborn MM host. > Unfortunately we never found the actual root cause. > > Hopefully your issue is something simpler – have you checked that all SRs > are plugged on all hosts? > > Regards, > Dag Sonstebo > Cloud Architect > ShapeBlue > > On 22/02/2018, 10:32, "Makrand" wrote: > > Hi All, > > Couple of days back we had some iSCSI issue and all the LUNs were > disconnected from Xenserver hosts. After the issue was fixed and when > all > LUNs were back online, for some BIOS checks, we put one of compute > node in > Maintenance Mode from cloudstack. It took more than usual time for it > to go > into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now > whenever we are trying to cancel its MM, it just fails:- Command > failed due > to Internal Server Error. > > The logs are indicating below > > 2018-02-16 09:44:24,291 INFO [o.a.c.f.j.i.AsyncJobMonitor] > (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job > monitoring > 2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO > {id:72477, userId: 2, accountId: 2, > instanceType: Host, instanceId: 26, cmd: > org.apache.cloudstack.api.c
Re: Can't cancel Maintenance Mode on host
Hi Dag, 1) Thanks for the reply. I was talking about canceling MM from cloud stack. No issues taking out it in and out of MM on xenserver/xencenter level. With normaly scene, one first puts host in MM from Cstack>>Then from XEN center>>DO your reboot>>Exit MM from Xencenter>>Exit from Cstack. 2) When you said rebuild, you mean eject the host out of the pool and reinstall OS? Also, I am yet to try to delete the host from Cstack and add it back. Should I try that? Do you think it will work? 3) I also found this:- https://issues.apache.org/jira/browse/CLOUDSTACK-8210. I know this is for KVM, but we are using Cstack 4.4. BTW, on a broader view, this zone has some funky stuff happening. Its Cstack 4.4.x and XEN server 6.2 We have noticed that VRs go into reboot loops once we reboot the storage. VMs are stuck on XenServer in start stages. Sometimes we can't shut down VMs. Sometimes we can't migrate VMs between hosts. We have also found dead beef on Xenservers (whatever that means..one of our engineers told me). Let me dig some logs for these things and I will try to share it. I am seriously thinking of reinstalling everything here. But I just need to justify this to senior management. -- Makrand On Thu, Feb 22, 2018 at 6:14 PM, Dag Sonstebo wrote: > Hi Makrand, > > Yes this rings a bell – first of I would advise you to thread very > carefully – this is most likely an issue with your underlying XAPI db on > your poolmaster, so there is a risk of further problems. > > We have seen this in the past with a couple of clients – and I think we > found XS servers still in MM in XenCentre (unbeknownst to CloudStack) – but > we have then had some problems getting the hosts out of MM again from the > Xen side. We have also seen situations where taking one host out of MM in > XenCentre puts another host into MM, which is odd. I know in on one > occasion we ended up removing / rebuilding / reading the stubborn MM host. > Unfortunately we never found the actual root cause. > > Hopefully your issue is something simpler – have you checked that all SRs > are plugged on all hosts? > > Regards, > Dag Sonstebo > Cloud Architect > ShapeBlue > > On 22/02/2018, 10:32, "Makrand" wrote: > > Hi All, > > Couple of days back we had some iSCSI issue and all the LUNs were > disconnected from Xenserver hosts. After the issue was fixed and when > all > LUNs were back online, for some BIOS checks, we put one of compute > node in > Maintenance Mode from cloudstack. It took more than usual time for it > to go > into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now > whenever we are trying to cancel its MM, it just fails:- Command > failed due > to Internal Server Error. > > The logs are indicating below > > 2018-02-16 09:44:24,291 INFO [o.a.c.f.j.i.AsyncJobMonitor] > (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job > monitoring > 2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO > {id:72477, userId: 2, accountId: 2, > instanceType: Host, instanceId: 26, cmd: > org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd, > cmdInfo: > {"id":"4bca233d-0e61-495c-a522-43800fe311fc","r > esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\ > u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d- > 0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA > INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":" > 1518774059073","uuid":"4bca233d-0e61-495c-a522- > 43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"}, > cmdVe > rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: > null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null, > lastPolled: null, crea > ted: null} > 2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher] > (API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception > while > executing org.apache.cloudstack.a > pi.command.admin.host.CancelMaintenanceCmd > java.lang.NullPointerException > at > com.cloud.resource.ResourceManagerImpl.doCancelMaintenance( > ResourceManagerImpl.java:2083) > at > com.cloud.resource.ResourceManagerImpl.cancelMaintenance( > ResourceManagerImpl.java:2140) > at > com.cloud.resource.ResourceManagerImpl.cancelMaintenance( > ResourceManagerImpl.java:1127) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.springframework.aop.support.AopUtils. > invokeJoinpointUsingReflection(AopUtils.java:317) > at > org
Re: Can't cancel Maintenance Mode on host
Hi Makrand, Yes this rings a bell – first of I would advise you to thread very carefully – this is most likely an issue with your underlying XAPI db on your poolmaster, so there is a risk of further problems. We have seen this in the past with a couple of clients – and I think we found XS servers still in MM in XenCentre (unbeknownst to CloudStack) – but we have then had some problems getting the hosts out of MM again from the Xen side. We have also seen situations where taking one host out of MM in XenCentre puts another host into MM, which is odd. I know in on one occasion we ended up removing / rebuilding / reading the stubborn MM host. Unfortunately we never found the actual root cause. Hopefully your issue is something simpler – have you checked that all SRs are plugged on all hosts? Regards, Dag Sonstebo Cloud Architect ShapeBlue On 22/02/2018, 10:32, "Makrand" wrote: Hi All, Couple of days back we had some iSCSI issue and all the LUNs were disconnected from Xenserver hosts. After the issue was fixed and when all LUNs were back online, for some BIOS checks, we put one of compute node in Maintenance Mode from cloudstack. It took more than usual time for it to go into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now whenever we are trying to cancel its MM, it just fails:- Command failed due to Internal Server Error. The logs are indicating below 2018-02-16 09:44:24,291 INFO [o.a.c.f.j.i.AsyncJobMonitor] (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job monitoring 2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO {id:72477, userId: 2, accountId: 2, instanceType: Host, instanceId: 26, cmd: org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd, cmdInfo: {"id":"4bca233d-0e61-495c-a522-43800fe311fc","r esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d-0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":"1518774059073","uuid":"4bca233d-0e61-495c-a522-43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"}, cmdVe rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null, lastPolled: null, crea ted: null} 2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception while executing org.apache.cloudstack.a pi.command.admin.host.CancelMaintenanceCmd java.lang.NullPointerException at com.cloud.resource.ResourceManagerImpl.doCancelMaintenance(ResourceManagerImpl.java:2083) at com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:2140) at com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:1127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) at com.sun.proxy.$Proxy147.cancelMaintenance(Unknown Source) at org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd.execute(CancelMaintenanceCmd.java:102) at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141) at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:503) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(Def
Can't cancel Maintenance Mode on host
Hi All, Couple of days back we had some iSCSI issue and all the LUNs were disconnected from Xenserver hosts. After the issue was fixed and when all LUNs were back online, for some BIOS checks, we put one of compute node in Maintenance Mode from cloudstack. It took more than usual time for it to go into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now whenever we are trying to cancel its MM, it just fails:- Command failed due to Internal Server Error. The logs are indicating below 2018-02-16 09:44:24,291 INFO [o.a.c.f.j.i.AsyncJobMonitor] (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job monitoring 2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO {id:72477, userId: 2, accountId: 2, instanceType: Host, instanceId: 26, cmd: org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd, cmdInfo: {"id":"4bca233d-0e61-495c-a522-43800fe311fc","r esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d-0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":"1518774059073","uuid":"4bca233d-0e61-495c-a522-43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"}, cmdVe rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null, lastPolled: null, crea ted: null} 2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception while executing org.apache.cloudstack.a pi.command.admin.host.CancelMaintenanceCmd java.lang.NullPointerException at com.cloud.resource.ResourceManagerImpl.doCancelMaintenance(ResourceManagerImpl.java:2083) at com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:2140) at com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:1127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204) at com.sun.proxy.$Proxy147.cancelMaintenance(Unknown Source) at org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd.execute(CancelMaintenanceCmd.java:102) at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141) at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:503) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:460) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2018-02-16 09:44:24,305 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-27:ctx-1e865550 job-72477) Complete async job-72477, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530} 2018-02-16 09:44:24,320 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] (DirectAgent-303:ctx-d1ac93ce) Done with process of VM state report. host: 1 2018-02-16 09:44:24,322 DEBUG [c.c.d.De