Re: Can't cancel Maintenance Mode on host

Dag Sonstebo Fri, 23 Feb 2018 01:21:07 -0800

Hi Makran,

Yes I’m talking purely on the XS side – in the cases we’ve seen the XS host may 
or may not have been in MM in CloudStack, but if you checked in XenCentre they 
would have somehow gone into MM independently of CloudStack. The CloudStack GUI 
would then error if you tried to the the host in/out of MM. If this is not the 
case with you then great, cause it’s a difficult one to fix.


With regards to https://issues.apache.org/jira/browse/CLOUDSTACK-8210 I 
couldn’t comment, yes looks similar but as you say – it is for KVM.

Keep in mind rebooting storage will cause XS hosts to self-fence and reboot – 
check 

https://github.com/apache/cloudstack/blob/master/scripts/vm/hypervisor/xenserver/xenheartbeat.sh

As a general point – ACS4.4 and XS6.2 are obviously relatively old now, so I 
would recommend you upgrade both. However – you do need to sort your issues out 
before your upgrade, any issues before the upgrade are likely to be the same or 
worse after the upgrade, and worst case you end up having to roll back. If you 
do have a lot of issues and can’t get them fixed you may obviously be better 
off building a new 4.9 or 4.11 infrastructure and migrate your workloads across.

If you want some tips on upgrading check this blog post I wrote a year ago 
http://www.shapeblue.com/cloudstack-upgrades-best-practices/ 

Regards,
Dag Sonstebo
Cloud Architect
ShapeBlue

On 23/02/2018, 05:46, "Makrand" <[email protected]> wrote:

    Hi Dag,
    
    1) Thanks for the reply. I was talking about canceling MM from cloud stack.
    No issues taking out it in and out of MM on xenserver/xencenter level. With
    normaly scene, one first puts host in MM from Cstack>>Then from XEN
    center>>DO your reboot>>Exit MM from Xencenter>>Exit from Cstack.
    
    2) When you said rebuild, you mean eject the host out of the pool and
    reinstall OS? Also, I am yet to try to delete the host from Cstack and add
    it back. Should I try that? Do you think it will work?
    
    3) I also found this:- 
https://issues.apache.org/jira/browse/CLOUDSTACK-8210.
    I know this is for KVM, but we are using Cstack 4.4.
    
    BTW, on a broader view, this zone has some funky stuff happening. Its
    Cstack 4.4.x and XEN server 6.2
    We have noticed that VRs go into reboot loops once we reboot the storage.
    VMs are stuck on XenServer in start stages. Sometimes we can't shut down
    VMs. Sometimes we can't migrate VMs between hosts. We have also found dead
    beef on Xenservers (whatever that means......one of our engineers told
    me).  Let me dig some logs for these things and I will try to share it.
    
    I am seriously thinking of reinstalling everything here. But I just need to
    justify this to senior management.
    
    
    
    
    
    
    
    
    
    --
    Makrand
    
    
    On Thu, Feb 22, 2018 at 6:14 PM, Dag Sonstebo <[email protected]>
    wrote:
    
    > Hi Makrand,
    >
    > Yes this rings a bell – first of I would advise you to thread very
    > carefully – this is most likely an issue with your underlying XAPI db on
    > your poolmaster, so there is a risk of further problems.
    >
    > We have seen this in the past with a couple of clients – and I think we
    > found XS servers still in MM in XenCentre (unbeknownst to CloudStack) – 
but
    > we have then had some problems getting the hosts out of MM again from the
    > Xen side. We have also seen situations where taking one host out of MM in
    > XenCentre puts another host into MM, which is odd. I know in on one
    > occasion we ended up removing / rebuilding / reading the stubborn MM host.
    > Unfortunately we never found the actual root cause.
    >
    > Hopefully your issue is something simpler – have you checked that all SRs
    > are plugged on all hosts?
    >
    > Regards,
    > Dag Sonstebo
    > Cloud Architect
    > ShapeBlue
    >
    > On 22/02/2018, 10:32, "Makrand" <[email protected]> wrote:
    >
    >     Hi All,
    >
    >     Couple of days back we had some iSCSI issue and all the LUNs were
    >     disconnected from Xenserver hosts. After the issue was  fixed and when
    > all
    >     LUNs were back online, for some BIOS checks, we put one of compute
    > node in
    >     Maintenance Mode from cloudstack. It took more than usual time for it
    > to go
    >     into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now
    >     whenever we are trying to cancel its MM, it just fails:- Command
    > failed due
    >     to Internal Server Error.
    >
    >     The logs are indicating below
    >
    >     2018-02-16 09:44:24,291 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
    >     (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job
    >     monitoring
    >     2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
    >     (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO
    >     {id:72477, userId: 2, accountId: 2,
    >      instanceType: Host, instanceId: 26, cmd:
    >     org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd,
    > cmdInfo:
    >     {"id":"4bca233d-0e61-495c-a522-43800fe311fc","r
    >     esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\
    > u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d-
    > 0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA
    >     INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":"
    > 1518774059073","uuid":"4bca233d-0e61-495c-a522-
    > 43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"},
    >     cmdVe
    >     rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, 
result:
    >     null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null,
    >     lastPolled: null, crea
    >     ted: null}
    >     2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher]
    >     (API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception
    > while
    >     executing org.apache.cloudstack.a
    >     pi.command.admin.host.CancelMaintenanceCmd
    >     java.lang.NullPointerException
    >             at
    >     com.cloud.resource.ResourceManagerImpl.doCancelMaintenance(
    > ResourceManagerImpl.java:2083)
    >             at
    >     com.cloud.resource.ResourceManagerImpl.cancelMaintenance(
    > ResourceManagerImpl.java:2140)
    >             at
    >     com.cloud.resource.ResourceManagerImpl.cancelMaintenance(
    > ResourceManagerImpl.java:1127)
    >             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    >             at
    >     sun.reflect.NativeMethodAccessorImpl.invoke(
    > NativeMethodAccessorImpl.java:57)
    >             at
    >     sun.reflect.DelegatingMethodAccessorImpl.invoke(
    > DelegatingMethodAccessorImpl.java:43)
    >             at java.lang.reflect.Method.invoke(Method.java:606)
    >             at
    >     org.springframework.aop.support.AopUtils.
    > invokeJoinpointUsingReflection(AopUtils.java:317)
    >             at
    >     org.springframework.aop.framework.ReflectiveMethodInvocation.
    > invokeJoinpoint(ReflectiveMethodInvocation.java:183)
    >             at
    >     org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
    > ReflectiveMethodInvocation.java:150)
    >             at
    >     org.springframework.aop.interceptor.ExposeInvocationInterceptor.
    > invoke(ExposeInvocationInterceptor.java:91)
    >             at
    >     org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
    > ReflectiveMethodInvocation.java:172)
    >             at
    >     org.springframework.aop.framework.JdkDynamicAopProxy.
    > invoke(JdkDynamicAopProxy.java:204)
    >             at com.sun.proxy.$Proxy147.cancelMaintenance(Unknown Source)
    >     at
    >     org.apache.cloudstack.api.command.admin.host.
    > CancelMaintenanceCmd.execute(CancelMaintenanceCmd.java:102)
    >             at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:
    > 141)
    >             at
    >     com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:
    > 108)
    >             at
    >     org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.
    > runInContext(AsyncJobManagerImpl.java:503)
    >             at
    >     org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(
    > ManagedContextRunnable.java:49)
    >             at
    >     org.apache.cloudstack.managed.context.impl.
    > DefaultManagedContext$1.call(DefaultManagedContext.java:56)
    >             at
    >     org.apache.cloudstack.managed.context.impl.DefaultManagedContext.
    > callWithContext(DefaultManagedContext.java:103)
    >             at
    >     org.apache.cloudstack.managed.context.impl.DefaultManagedContext.
    > runWithContext(DefaultManagedContext.java:53)
    >             at
    >     org.apache.cloudstack.managed.context.ManagedContextRunnable.run(
    > ManagedContextRunnable.java:46)
    >             at
    >     org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(
    > AsyncJobManagerImpl.java:460)
    >             at
    >     java.util.concurrent.Executors$RunnableAdapter.
    > call(Executors.java:471)
    >             at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    >             at
    >     java.util.concurrent.ThreadPoolExecutor.runWorker(
    > ThreadPoolExecutor.java:1145)
    >             at
    >     java.util.concurrent.ThreadPoolExecutor$Worker.run(
    > ThreadPoolExecutor.java:615)
    >             at java.lang.Thread.run(Thread.java:745)
    >     2018-02-16 09:44:24,305 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
    >     (API-Job-Executor-27:ctx-1e865550 job-72477) Complete async job-72477,
    >     jobStatus: FAILED, resultCode: 530, result:
    >     org.apache.cloudstack.api.response.ExceptionResponse/
    > null/{"uuidList":[],"errorcode":530}
    >     2018-02-16 09:44:24,320 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
    >     (DirectAgent-303:ctx-d1ac93ce) Done with process of VM state report.
    > host: 1
    >     2018-02-16 09:44:24,322 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) MessageBus message: host reserved
    > capacity
    >     released for VM: 1, checking if host reservation can be released for
    > host:1
    >     2018-02-16 09:44:24,329 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) Cannot release reservation, Found 7 VMs
    >     Running on host 1
    >     2018-02-16 09:44:24,340 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) MessageBus message: host reserved
    > capacity
    >     released for VM: 1, checking if host reservation can be released for
    > host:1
    >     2018-02-16 09:44:24,347 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) Cannot release reservation, Found 7 VMs
    >     Running on host 1
    >     2018-02-16 09:44:24,352 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
    >     (API-Job-Executor-27:ctx-1e865550 job-72477) Done executing
    >     org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd for
    >     job-72477
    >     2018-02-16 09:44:24,352 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) MessageBus message: host reserved
    > capacity
    >     released for VM: 1, checking if host reservation can be released for
    > host:1
    >     2018-02-16 09:44:24,356 DEBUG [c.c.d.DeploymentPlanningManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) Cannot release reservation, Found 7 VMs
    >     Running on host 1
    >     2018-02-16 09:44:24,357 DEBUG [c.c.c.CapacityManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) No need to calibrate cpu capacity,
    > host:1
    >     usedCpu: 13500 reservedCpu: 0
    >     2018-02-16 09:44:24,357 DEBUG [c.c.c.CapacityManagerImpl]
    >     (CapacityChecker:ctx-038e67bd) No need to calibrate memory capacity,
    > host:1
    >     usedMem: 21881683968 reservedMem: 0
    >     2018-02-16 09:44:24,363 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
    >     (API-Job-Executor-27:ctx-1e865550 job-72477) Remove job-72477 from job
    >     monitoring
    >
    >     So far tried:-
    >     1) Rebooted Compute multiple times, but no help
    >     2) Edited DB and marked resource state as Enabled. Then forcefully
    >     reconnected host. All looking ok. Again put back in MM and tried
    > taking it
    >     out, its FAILING with same error.
    >
    >     Not sure what is issue here. Anyone??
    >
    >
    >
    >     --
    >     Makrand
    >
    >
    >
    > [email protected]
    > www.shapeblue.com
    > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
    > @shapeblue
    >
    >
    >
    >
    


[email protected] 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

Re: Can't cancel Maintenance Mode on host

Reply via email to