[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399052#comment-16399052
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-373127181
 
 
   @Slair1 I have no easy way to put this. You have four (4) PRs out and all 
fail all travis runs. It must be that travis doesn't like your github handle, 
or so. Can you think of something you have/do that might cause this. In the 
same period other PRs have passed travis runs...?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398186#comment-16398186
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-372929689
 
 
   @Slair1 can we also have marvin tests to cover these fixes ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397534#comment-16397534
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

Slair1 commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-372794000
 
 
   Yea, i can work to modularize this, i unfortunately don't have the time at 
the moment, but can later.
   
   On the tests, do you mean Unit Tests?  I've never wrote a unit test before, 
but agree it would be good to have 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389265#comment-16389265
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-371067952
 
 
   @Slair1 are you going to modularise the handleDisconnectWithInvestigation 
method?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389012#comment-16389012
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

blueorangutan commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-371019688
 
 
   Trillian test result (tid-2328)
   Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
   Total time taken: 21963 seconds
   Marvin logs: 
https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2474-t2328-kvm-centos7.zip
   Intermitten failure detected: /marvin/tests/smoke/test_iso.py
   Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
   Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
   Smoke tests completed. 52 look OK, 2 have error(s)
   Only failed tests results shown below:
   
   
   Test | Result | Time (s) | Test File
   --- | --- | --- | ---
   test_04_rvpc_privategw_static_routes | `Failure` | 329.18 | 
test_privategw_acl.py
   test_02_edit_iso | `Failure` | 0.04 | test_iso.py
   test_05_iso_permissions | `Failure` | 0.05 | test_iso.py
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388590#comment-16388590
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

blueorangutan commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370944019
 
 
   @borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has 
been kicked to run smoke tests


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388588#comment-16388588
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370943779
 
 
   @blueorangutan test


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388454#comment-16388454
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

blueorangutan commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370913368
 
 
   Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1762


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388397#comment-16388397
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

blueorangutan commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370900817
 
 
   @borisstoyanov a Jenkins job has been kicked to build packages. I'll keep 
you posted as I make progress.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388393#comment-16388393
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370900571
 
 
   @blueorangutan package


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387949#comment-16387949
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

blueorangutan commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370819917
 
 
   Packaging result: ✔centos6 ✖centos7 ✖debian. JID-1759


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387828#comment-16387828
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370788754
 
 
   @blueorangutan package


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387827#comment-16387827
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370791081
 
 
   @blueorangutan package


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387829#comment-16387829
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

blueorangutan commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370791279
 
 
   @borisstoyanov a Jenkins job has been kicked to build packages. I'll keep 
you posted as I make progress.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387818#comment-16387818
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370789130
 
 
   @Slair1 what issues are fixed, do we have marvin tests for them? If not I 
think it'll be good to add them.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387813#comment-16387813
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

borisstoyanov commented on issue #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474#issuecomment-370788754
 
 
   @blueorangutan package


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383366#comment-16383366
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host 
HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171791837
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Status determinedState = investigate(attache);
 // if state cannot be determined do nothing and bail out
 if (determinedState == null) {
 if ((System.currentTimeMillis() >> 10) - 
host.getLastPinged() > AlertWait.value()) {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, 
will go to Alert state");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") 
seconds, will go to Alert state");
 determinedState = Status.Alert;
 } else {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined, do nothing");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined, doing nothing");
 return false;
 }
 }
 
 final Status currentStatus = host.getStatus();
-s_logger.info("The agent from host " + hostId + " state 
determined is " + determinedState);
+s_logger.info("Status for " + hostShortDesc + " was " + 
currentStatus + ".  Investigation determined the current state is " + 
determinedState);
 
-if (determinedState == Status.Down) {
-final String message = "Host is down: " + host.getId() + 
"-" + host.getName() + ". Starting HA on the VMs";
-s_logger.error(message);
-if (host.getType() != Host.Type.SecondaryStorage && 
host.getType() != Host.Type.ConsoleProxy) {
-
_alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, 
host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message);
-}
-event = Status.Event.HostDown;
-} else if (determinedState == Status.Up) {
-/* Got ping response from host, bring it back */
-s_logger.info("Agent is determined to be up and running");
+if (determinedState == Status.Up) {
+// Got ping response from host, bring it back
+s_logger.info(hostShortDesc + " is up again");
 agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
-return false;
 } else if (determinedState == Status.Disconnected) {
-s_logger.warn("Agent is disconnected but the host is still 
up: 

[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383364#comment-16383364
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host 
HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171791683
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Status determinedState = investigate(attache);
 // if state cannot be determined do nothing and bail out
 if (determinedState == null) {
 if ((System.currentTimeMillis() >> 10) - 
host.getLastPinged() > AlertWait.value()) {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, 
will go to Alert state");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") 
seconds, will go to Alert state");
 determinedState = Status.Alert;
 } else {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined, do nothing");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined, doing nothing");
 return false;
 }
 }
 
 final Status currentStatus = host.getStatus();
-s_logger.info("The agent from host " + hostId + " state 
determined is " + determinedState);
+s_logger.info("Status for " + hostShortDesc + " was " + 
currentStatus + ".  Investigation determined the current state is " + 
determinedState);
 
-if (determinedState == Status.Down) {
-final String message = "Host is down: " + host.getId() + 
"-" + host.getName() + ". Starting HA on the VMs";
-s_logger.error(message);
-if (host.getType() != Host.Type.SecondaryStorage && 
host.getType() != Host.Type.ConsoleProxy) {
-
_alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, 
host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message);
-}
-event = Status.Event.HostDown;
-} else if (determinedState == Status.Up) {
-/* Got ping response from host, bring it back */
-s_logger.info("Agent is determined to be up and running");
+if (determinedState == Status.Up) {
+// Got ping response from host, bring it back
+s_logger.info(hostShortDesc + " is up again");
 agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
-return false;
 } else if (determinedState == Status.Disconnected) {
-s_logger.warn("Agent is disconnected but the host is still 
up: 

[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383358#comment-16383358
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host 
HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171790992
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Review comment:
   @DaanHoogland good thought, I didn’t think about that.  However, the 
hostShortDesc does include the hostId as part of it.  So maybe it’s ok?
   
   final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383331#comment-16383331
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784133
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Status determinedState = investigate(attache);
 // if state cannot be determined do nothing and bail out
 if (determinedState == null) {
 if ((System.currentTimeMillis() >> 10) - 
host.getLastPinged() > AlertWait.value()) {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, 
will go to Alert state");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") 
seconds, will go to Alert state");
 
 Review comment:
   for warn message the above is even more true


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383334#comment-16383334
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171785154
 
 

 ##
 File path: 
server/src/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
 ##
 @@ -340,6 +340,7 @@
 private ScheduledExecutorService _executor;
 private ScheduledExecutorService _checkExecutor;
 private ScheduledExecutorService _networkStatsUpdateExecutor;
+private ExecutorService _routerOobStartExecutor;
 
 Review comment:
   we are trying to get rid of these _'s . no need to adhere to this old 
convention. In fact you may want to rename the others as well (in a separate 
commit for review ease) (this is not a request for change, just a suggestion)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383332#comment-16383332
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784702
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Status determinedState = investigate(attache);
 // if state cannot be determined do nothing and bail out
 if (determinedState == null) {
 if ((System.currentTimeMillis() >> 10) - 
host.getLastPinged() > AlertWait.value()) {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, 
will go to Alert state");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") 
seconds, will go to Alert state");
 determinedState = Status.Alert;
 } else {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined, do nothing");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined, doing nothing");
 return false;
 }
 }
 
 final Status currentStatus = host.getStatus();
-s_logger.info("The agent from host " + hostId + " state 
determined is " + determinedState);
+s_logger.info("Status for " + hostShortDesc + " was " + 
currentStatus + ".  Investigation determined the current state is " + 
determinedState);
 
-if (determinedState == Status.Down) {
-final String message = "Host is down: " + host.getId() + 
"-" + host.getName() + ". Starting HA on the VMs";
-s_logger.error(message);
-if (host.getType() != Host.Type.SecondaryStorage && 
host.getType() != Host.Type.ConsoleProxy) {
-
_alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, 
host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message);
-}
-event = Status.Event.HostDown;
-} else if (determinedState == Status.Up) {
-/* Got ping response from host, bring it back */
-s_logger.info("Agent is determined to be up and running");
+if (determinedState == Status.Up) {
+// Got ping response from host, bring it back
+s_logger.info(hostShortDesc + " is up again");
 agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
-return false;
 } else if (determinedState == Status.Disconnected) {
-s_logger.warn("Agent is disconnected but the host is still 

[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383335#comment-16383335
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784246
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Status determinedState = investigate(attache);
 // if state cannot be determined do nothing and bail out
 if (determinedState == null) {
 if ((System.currentTimeMillis() >> 10) - 
host.getLastPinged() > AlertWait.value()) {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, 
will go to Alert state");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") 
seconds, will go to Alert state");
 determinedState = Status.Alert;
 } else {
-s_logger.warn("Agent " + hostId + " state cannot be 
determined, do nothing");
+s_logger.warn("State for " + hostShortDesc + " could 
not be determined, doing nothing");
 return false;
 }
 }
 
 final Status currentStatus = host.getStatus();
-s_logger.info("The agent from host " + hostId + " state 
determined is " + determinedState);
+s_logger.info("Status for " + hostShortDesc + " was " + 
currentStatus + ".  Investigation determined the current state is " + 
determinedState);
 
-if (determinedState == Status.Down) {
-final String message = "Host is down: " + host.getId() + 
"-" + host.getName() + ". Starting HA on the VMs";
-s_logger.error(message);
-if (host.getType() != Host.Type.SecondaryStorage && 
host.getType() != Host.Type.ConsoleProxy) {
-
_alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, 
host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message);
-}
-event = Status.Event.HostDown;
-} else if (determinedState == Status.Up) {
-/* Got ping response from host, bring it back */
-s_logger.info("Agent is determined to be up and running");
+if (determinedState == Status.Up) {
+// Got ping response from host, bring it back
+s_logger.info(hostShortDesc + " is up again");
 agentStatusTransitTo(host, Status.Event.Ping, _nodeId);
-return false;
 } else if (determinedState == Status.Disconnected) {
-s_logger.warn("Agent is disconnected but the host is still 

[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1638#comment-1638
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171785409
 
 

 ##
 File path: 
server/src/com/cloud/network/router/VirtualNetworkApplianceManagerImpl.java
 ##
 @@ -2587,7 +2589,13 @@ public boolean postStateTransitionEvent(final 
StateMachine2.Transition VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383336#comment-16383336
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

DaanHoogland commented on a change in pull request #2474: CLOUDSTACK-10246 Fix 
Host HA and VM HA issues
URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171784055
 
 

 ##
 File path: 
engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
 ##
 @@ -843,72 +846,103 @@ protected boolean 
handleDisconnectWithInvestigation(final AgentAttache attache,
 s_logger.debug("Caught exception while getting agent's next 
status", ne);
 }
 
+// For log and alert purposes later
+final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId());
+final HostPodVO podVO = _podDao.findById(host.getPodId());
+final String hostDesc = "[name: " + host.getName() + " (id:" + 
host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + 
podVO.getName() + "]";
+final String hostShortDesc = "Host " + host.getName() + " (id:" + 
host.getId() + ")";
+
+final ResourceState resourceState = host.getResourceState();
+if (resourceState == ResourceState.Disabled || resourceState == 
ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) 
{
+// If we are in this resourceState, no need to investigate or 
do anything.  AgentMonitor will handle when in these resourceStates
+s_logger.info(hostShortDesc + " has disconnected with event " 
+ event + ",  but is in Resource State of " + resourceState + ", so doing 
nothing");
+return true;
+}
+
 if (nextStatus == Status.Alert) {
-/* OK, we are going to the bad status, let's see what happened 
*/
-s_logger.info("Investigating why host " + hostId + " has 
disconnected with event " + event);
+/* Our next Agent transition state is Alert
+ * Let's see if the host down or why we had this event
+ */
+s_logger.info("Investigating why host " + hostShortDesc + " 
has disconnected with event " + event);
 
 Review comment:
    good improvement, but though it is only (a comment and) a log statement, 
this entails an interface of the system. the ecosystem may query logs for the 
text and no longer find the hostId thus not being able to take mitigating 
actions any more. I'd rather see a less destructive change like 'hostId + " (" 
+ hostShortDesc + ") "'
   
   We may get away with it but it does require extensive testing by the whole 
community :/.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382718#comment-16382718
 ] 

ASF GitHub Bot commented on CLOUDSTACK-10246:
-

Slair1 opened a new pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA 
issues
URL: https://github.com/apache/cloudstack/pull/2474
 
 
   The HA logic just does not work.  VM's with HA enabled would never restart 
after a host failure.  Had to re-do most of that logic.  There are comments 
inline with the code, but here is the general updated logic.  Sorry for the 
long notes...
   
   We are running KVM FYI.
   
   - If host-agent is unreachable, handleDisconnectWithInvestigation() is 
called as always.
   - The investigators are called to see what happened, which is one of the 
following two scenarios.  (If it isn't one of the two below, then the host just 
came back UP, or another status was returned and that is also logged.  But the 
two scenarios below are what needed updated the most)
   
   **If the investigators find the host is UP, but just the agent is 
unreachable**
   The host is put into DISCONNECTED status.  It will stay in this status and 
the PingTimeouts will continue to call handleDisconnectWithoutInvestigation() 
periodically.  It will stay in DISCONNECTED status until the AlertWait config 
option expires.  If the AlertWait time eventually is hit, and the investigators 
are still just reporting that the host is DISCONNECTED and not DOWN.  Then 
we'll put the host into ALERT state and we'll stay there until the 
investigators say the host is UP or the investigators say the host is DOWN.  If 
the host goes DOWN, then VM HA will be initiated.
   
   **If the investigators find the host is DOWN**
   Then VM HA is initiated...
   
   **VirtualNetworkApplianceManagerImpl.java**
   The file VirtualNetworkApplianceManagerImpl.java is edited for a related VM 
HA problem.  When a Host is determined to be DOWN, CloudStack attempts to VM HA 
any affected routers.  The problem is, when the host is determined to be down, 
by code referenced above, the host may not actually be DOWN.  On KVM for 
example, the host is considered DOWN if the agent is stopped on the KVM host 
for too long.  In that case, the VMs could still be running just fine...  
However when we think the host is DOWN, VM HA runs on the router and as part of 
that it unallocates/cleans-up the router and it's 169.x.x.x control IP is 
unallocated.  Then after it cleans it up, it tries to power on the router on 
another host, and as part of that it allocates a NEW 169.x.x.x control IP and 
writes that to the DB.  However, since the router isn't actually down (we just 
think the host is down) the VM HA fails as the vRouter is currently still 
running on the problem host.  
   
   Next, in this example, when the host agent is back online again, it sends a 
power report to the mgmt servers, and the management servers think the router 
was powered-on OOB.  However, the GUI will not show a control IP for the 
vRouter, and the DB will have the NEW control IP it tried to allocated during 
the failed VM HA event.  Thus, leaving us unable to communicate with the 
vRouter.
   
   This PR does a simple check that we can still communicate with the vRouter 
after any OOB power-on occurs.  If we can, then we have the correct control IP 
in the DB and we're good - so we do nothing.  If we can't communicate with the 
vRouter after the OOB power-on, we do a reboot of the vRouter to fix it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> 

[jira] [Commented] (CLOUDSTACK-10246) VM HA issues

2018-02-14 Thread Sean Lair (JIRA)

[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363609#comment-16363609
 ] 

Sean Lair commented on CLOUDSTACK-10246:


I am also having this issue.  There was a change to AgentManagerImpl.java a 
while back that seems to have broken this for KVM.  I'll be working on it this 
week or maybe next

> VM HA issues
> 
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
>  Issue Type: Bug
>  Security Level: Public(Anyone can view this level - this is the 
> default.) 
>  Components: Management Server
>Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7 
> KVM HVs, NFS as primary and secondary storages.
>Reporter: Nux
>Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the 
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as 
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after 
> that, but this never happens.
>  
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>  
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)