Re: Can't cancel Maintenance Mode on host

2018-02-23 Thread Dag Sonstebo
Hi Makran,

Yes I’m talking purely on the XS side – in the cases we’ve seen the XS host may 
or may not have been in MM in CloudStack, but if you checked in XenCentre they 
would have somehow gone into MM independently of CloudStack. The CloudStack GUI 
would then error if you tried to the the host in/out of MM. If this is not the 
case with you then great, cause it’s a difficult one to fix.

With regards to https://issues.apache.org/jira/browse/CLOUDSTACK-8210 I 
couldn’t comment, yes looks similar but as you say – it is for KVM.

Keep in mind rebooting storage will cause XS hosts to self-fence and reboot – 
check 

https://github.com/apache/cloudstack/blob/master/scripts/vm/hypervisor/xenserver/xenheartbeat.sh

As a general point – ACS4.4 and XS6.2 are obviously relatively old now, so I 
would recommend you upgrade both. However – you do need to sort your issues out 
before your upgrade, any issues before the upgrade are likely to be the same or 
worse after the upgrade, and worst case you end up having to roll back. If you 
do have a lot of issues and can’t get them fixed you may obviously be better 
off building a new 4.9 or 4.11 infrastructure and migrate your workloads across.

If you want some tips on upgrading check this blog post I wrote a year ago 
http://www.shapeblue.com/cloudstack-upgrades-best-practices/ 

Regards,
Dag Sonstebo
Cloud Architect
ShapeBlue

On 23/02/2018, 05:46, "Makrand"  wrote:

Hi Dag,

1) Thanks for the reply. I was talking about canceling MM from cloud stack.
No issues taking out it in and out of MM on xenserver/xencenter level. With
normaly scene, one first puts host in MM from Cstack>>Then from XEN
center>>DO your reboot>>Exit MM from Xencenter>>Exit from Cstack.

2) When you said rebuild, you mean eject the host out of the pool and
reinstall OS? Also, I am yet to try to delete the host from Cstack and add
it back. Should I try that? Do you think it will work?

3) I also found this:- 
https://issues.apache.org/jira/browse/CLOUDSTACK-8210.
I know this is for KVM, but we are using Cstack 4.4.

BTW, on a broader view, this zone has some funky stuff happening. Its
Cstack 4.4.x and XEN server 6.2
We have noticed that VRs go into reboot loops once we reboot the storage.
VMs are stuck on XenServer in start stages. Sometimes we can't shut down
VMs. Sometimes we can't migrate VMs between hosts. We have also found dead
beef on Xenservers (whatever that means..one of our engineers told
me).  Let me dig some logs for these things and I will try to share it.

I am seriously thinking of reinstalling everything here. But I just need to
justify this to senior management.









--
Makrand


On Thu, Feb 22, 2018 at 6:14 PM, Dag Sonstebo 
wrote:

> Hi Makrand,
>
> Yes this rings a bell – first of I would advise you to thread very
> carefully – this is most likely an issue with your underlying XAPI db on
> your poolmaster, so there is a risk of further problems.
>
> We have seen this in the past with a couple of clients – and I think we
> found XS servers still in MM in XenCentre (unbeknownst to CloudStack) – 
but
> we have then had some problems getting the hosts out of MM again from the
> Xen side. We have also seen situations where taking one host out of MM in
> XenCentre puts another host into MM, which is odd. I know in on one
> occasion we ended up removing / rebuilding / reading the stubborn MM host.
> Unfortunately we never found the actual root cause.
>
> Hopefully your issue is something simpler – have you checked that all SRs
> are plugged on all hosts?
>
> Regards,
> Dag Sonstebo
> Cloud Architect
> ShapeBlue
>
> On 22/02/2018, 10:32, "Makrand"  wrote:
>
> Hi All,
>
> Couple of days back we had some iSCSI issue and all the LUNs were
> disconnected from Xenserver hosts. After the issue was  fixed and when
> all
> LUNs were back online, for some BIOS checks, we put one of compute
> node in
> Maintenance Mode from cloudstack. It took more than usual time for it
> to go
> into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now
> whenever we are trying to cancel its MM, it just fails:- Command
> failed due
> to Internal Server Error.
>
> The logs are indicating below
>
> 2018-02-16 09:44:24,291 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
> (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job
> monitoring
> 2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO
> {id:72477, userId: 2, accountId: 2,
>  instanceType: Host, instanceId: 26, cmd:
> org.apache.cloudstack.api.c

Re: Can't cancel Maintenance Mode on host

2018-02-22 Thread Makrand
Hi Dag,

1) Thanks for the reply. I was talking about canceling MM from cloud stack.
No issues taking out it in and out of MM on xenserver/xencenter level. With
normaly scene, one first puts host in MM from Cstack>>Then from XEN
center>>DO your reboot>>Exit MM from Xencenter>>Exit from Cstack.

2) When you said rebuild, you mean eject the host out of the pool and
reinstall OS? Also, I am yet to try to delete the host from Cstack and add
it back. Should I try that? Do you think it will work?

3) I also found this:- https://issues.apache.org/jira/browse/CLOUDSTACK-8210.
I know this is for KVM, but we are using Cstack 4.4.

BTW, on a broader view, this zone has some funky stuff happening. Its
Cstack 4.4.x and XEN server 6.2
We have noticed that VRs go into reboot loops once we reboot the storage.
VMs are stuck on XenServer in start stages. Sometimes we can't shut down
VMs. Sometimes we can't migrate VMs between hosts. We have also found dead
beef on Xenservers (whatever that means..one of our engineers told
me).  Let me dig some logs for these things and I will try to share it.

I am seriously thinking of reinstalling everything here. But I just need to
justify this to senior management.









--
Makrand


On Thu, Feb 22, 2018 at 6:14 PM, Dag Sonstebo 
wrote:

> Hi Makrand,
>
> Yes this rings a bell – first of I would advise you to thread very
> carefully – this is most likely an issue with your underlying XAPI db on
> your poolmaster, so there is a risk of further problems.
>
> We have seen this in the past with a couple of clients – and I think we
> found XS servers still in MM in XenCentre (unbeknownst to CloudStack) – but
> we have then had some problems getting the hosts out of MM again from the
> Xen side. We have also seen situations where taking one host out of MM in
> XenCentre puts another host into MM, which is odd. I know in on one
> occasion we ended up removing / rebuilding / reading the stubborn MM host.
> Unfortunately we never found the actual root cause.
>
> Hopefully your issue is something simpler – have you checked that all SRs
> are plugged on all hosts?
>
> Regards,
> Dag Sonstebo
> Cloud Architect
> ShapeBlue
>
> On 22/02/2018, 10:32, "Makrand"  wrote:
>
> Hi All,
>
> Couple of days back we had some iSCSI issue and all the LUNs were
> disconnected from Xenserver hosts. After the issue was  fixed and when
> all
> LUNs were back online, for some BIOS checks, we put one of compute
> node in
> Maintenance Mode from cloudstack. It took more than usual time for it
> to go
> into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now
> whenever we are trying to cancel its MM, it just fails:- Command
> failed due
> to Internal Server Error.
>
> The logs are indicating below
>
> 2018-02-16 09:44:24,291 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
> (API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job
> monitoring
> 2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
> (API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO
> {id:72477, userId: 2, accountId: 2,
>  instanceType: Host, instanceId: 26, cmd:
> org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd,
> cmdInfo:
> {"id":"4bca233d-0e61-495c-a522-43800fe311fc","r
> esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\
> u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d-
> 0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA
> INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":"
> 1518774059073","uuid":"4bca233d-0e61-495c-a522-
> 43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"},
> cmdVe
> rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result:
> null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null,
> lastPolled: null, crea
> ted: null}
> 2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher]
> (API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception
> while
> executing org.apache.cloudstack.a
> pi.command.admin.host.CancelMaintenanceCmd
> java.lang.NullPointerException
> at
> com.cloud.resource.ResourceManagerImpl.doCancelMaintenance(
> ResourceManagerImpl.java:2083)
> at
> com.cloud.resource.ResourceManagerImpl.cancelMaintenance(
> ResourceManagerImpl.java:2140)
> at
> com.cloud.resource.ResourceManagerImpl.cancelMaintenance(
> ResourceManagerImpl.java:1127)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.springframework.aop.support.AopUtils.
> invokeJoinpointUsingReflection(AopUtils.java:317)
> at
> org

Re: Can't cancel Maintenance Mode on host

2018-02-22 Thread Dag Sonstebo
Hi Makrand,

Yes this rings a bell – first of I would advise you to thread very carefully – 
this is most likely an issue with your underlying XAPI db on your poolmaster, 
so there is a risk of further problems. 

We have seen this in the past with a couple of clients – and I think we found 
XS servers still in MM in XenCentre (unbeknownst to CloudStack) – but we have 
then had some problems getting the hosts out of MM again from the Xen side. We 
have also seen situations where taking one host out of MM in XenCentre puts 
another host into MM, which is odd. I know in on one occasion we ended up 
removing / rebuilding / reading the stubborn MM host. Unfortunately we never 
found the actual root cause.

Hopefully your issue is something simpler – have you checked that all SRs are 
plugged on all hosts?

Regards,
Dag Sonstebo
Cloud Architect
ShapeBlue

On 22/02/2018, 10:32, "Makrand"  wrote:

Hi All,

Couple of days back we had some iSCSI issue and all the LUNs were
disconnected from Xenserver hosts. After the issue was  fixed and when all
LUNs were back online, for some BIOS checks, we put one of compute node in
Maintenance Mode from cloudstack. It took more than usual time for it to go
into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now
whenever we are trying to cancel its MM, it just fails:- Command failed due
to Internal Server Error.

The logs are indicating below

2018-02-16 09:44:24,291 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
(API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job
monitoring
2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO
{id:72477, userId: 2, accountId: 2,
 instanceType: Host, instanceId: 26, cmd:
org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd, cmdInfo:
{"id":"4bca233d-0e61-495c-a522-43800fe311fc","r

esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d-0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA

INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":"1518774059073","uuid":"4bca233d-0e61-495c-a522-43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"},
cmdVe
rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result:
null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null,
lastPolled: null, crea
ted: null}
2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher]
(API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception while
executing org.apache.cloudstack.a
pi.command.admin.host.CancelMaintenanceCmd
java.lang.NullPointerException
at

com.cloud.resource.ResourceManagerImpl.doCancelMaintenance(ResourceManagerImpl.java:2083)
at

com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:2140)
at

com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:1127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at

org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at

org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
at

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
at

org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
at

org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at

org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy147.cancelMaintenance(Unknown Source)
at

org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd.execute(CancelMaintenanceCmd.java:102)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141)
at
com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108)
at

org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:503)
at

org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at

org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(Def

Can't cancel Maintenance Mode on host

2018-02-22 Thread Makrand
Hi All,

Couple of days back we had some iSCSI issue and all the LUNs were
disconnected from Xenserver hosts. After the issue was  fixed and when all
LUNs were back online, for some BIOS checks, we put one of compute node in
Maintenance Mode from cloudstack. It took more than usual time for it to go
into MM (was stuck in PrepateforMaintenance), but it went anyhow. Now
whenever we are trying to cancel its MM, it just fails:- Command failed due
to Internal Server Error.

The logs are indicating below

2018-02-16 09:44:24,291 INFO  [o.a.c.f.j.i.AsyncJobMonitor]
(API-Job-Executor-27:ctx-1e865550 job-72477) Add job-72477 into job
monitoring
2018-02-16 09:44:24,292 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(API-Job-Executor-27:ctx-1e865550 job-72477) Executing AsyncJobVO
{id:72477, userId: 2, accountId: 2,
 instanceType: Host, instanceId: 26, cmd:
org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd, cmdInfo:
{"id":"4bca233d-0e61-495c-a522-43800fe311fc","r
esponse":"json","sessionkey":"ZxtGyco2RuYHil/VnglSOgguw5c\u003d","ctxDetails":"{\"com.cloud.host.Host\":\"4bca233d-0e61-495c-a522-43800fe311fc\"}","cmdEventType":"MA
INT.CANCEL","ctxUserId":"2","httpmethod":"GET","_":"1518774059073","uuid":"4bca233d-0e61-495c-a522-43800fe311fc","ctxAccountId":"2","ctxStartEventId":"51924"},
cmdVe
rsion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result:
null, initMsid: 16143068278473, completeMsid: null, lastUpdated: null,
lastPolled: null, crea
ted: null}
2018-02-16 09:44:24,301 ERROR [c.c.a.ApiAsyncJobDispatcher]
(API-Job-Executor-27:ctx-1e865550 job-72477) Unexpected exception while
executing org.apache.cloudstack.a
pi.command.admin.host.CancelMaintenanceCmd
java.lang.NullPointerException
at
com.cloud.resource.ResourceManagerImpl.doCancelMaintenance(ResourceManagerImpl.java:2083)
at
com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:2140)
at
com.cloud.resource.ResourceManagerImpl.cancelMaintenance(ResourceManagerImpl.java:1127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
at
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
at
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
at
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy147.cancelMaintenance(Unknown Source)
at
org.apache.cloudstack.api.command.admin.host.CancelMaintenanceCmd.execute(CancelMaintenanceCmd.java:102)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141)
at
com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108)
at
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:503)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:460)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2018-02-16 09:44:24,305 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl]
(API-Job-Executor-27:ctx-1e865550 job-72477) Complete async job-72477,
jobStatus: FAILED, resultCode: 530, result:
org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530}
2018-02-16 09:44:24,320 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl]
(DirectAgent-303:ctx-d1ac93ce) Done with process of VM state report. host: 1
2018-02-16 09:44:24,322 DEBUG [c.c.d.De