Re: [4.11] Management to VR connection issues
On 02/26/2018 12:41 PM, Rohit Yadav wrote: > - If waiting for ssh and apache2 as part of post-init solves the issue, this > would require a new systemvmtemplate as the systemd scripts cannot be changed > or make effect during first boot. The waiting for ssh was not the issue, it was a result. The hang of cloud-postinit caused by p.wait() when having a ton of iptable rules was the issue. But this is addressed already. should be fine. a systemctl list-jobs shows "no pending jobs" anymore, so the boot has completed. After that the VR should be accessable by SSH (3922) by managemement right, but it is not. Did you see the changes after a reboot (please compare the screenshots of the ip addr output I sent). After that reboot/network change, SSH works... > - I think the additional nics always used to show up for vmware, there is a > global setting to configure this (extra nics for vmware, probably because > older versions did not support dynamic nic addition on vmware vrs). On 4.5.2, we only see 4 NICs. in 4.11 we see 5 of them. We were just wondering if this could result in an issue. What global setting would that be? > - For VR timeouts, see logs and check if from management server host you're > able to SSH into the VR using the private IP and port 3922. See the > troubleshooting wiki: > https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting Yes, after a manual reboot of the VR, we can SSH-in as I wrote. Without a reboot of the VR, we get a "no route to host". So it seems not even an arp ping is working. > - Can you share/check which processes are consuming the RAM, 256MB ram is > usually enough for non-redundant VRs. (share output of top or check using > htop?). Make sure to use a latest Linux version (any Debian variant such as > Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 > for some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support > for legacy os. I had faced/found this issue while testing redundant VRs which > take more RAM usually than normal VRs. Using the shapeblue VR template (your template ;)) So the man docs says https://manpages.debian.org/stretch/initscripts/tmpfs.5.en.html unfortunately only a fstab entry worked for me, setting the /etc/default/tmpfs didn't. https://github.com/apache/cloudstack/pull/2468/commits/bd882a8f80763595a89a3b74330500e1965bfda3
Re: [4.11] Management to VR connection issues
Hi Rene, - I think on the general issue of slow iptables rules application, we need to fix that. Does it help to increase aggregation timeouts? - If waiting for ssh and apache2 as part of post-init solves the issue, this would require a new systemvmtemplate as the systemd scripts cannot be changed or make effect during first boot. - I think the additional nics always used to show up for vmware, there is a global setting to configure this (extra nics for vmware, probably because older versions did not support dynamic nic addition on vmware vrs). - For VR timeouts, see logs and check if from management server host you're able to SSH into the VR using the private IP and port 3922. See the troubleshooting wiki: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting - Can you share/check which processes are consuming the RAM, 256MB ram is usually enough for non-redundant VRs. (share output of top or check using htop?). Make sure to use a latest Linux version (any Debian variant such as Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 for some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support for legacy os. I had faced/found this issue while testing redundant VRs which take more RAM usually than normal VRs. - Rohit <https://cloudstack.apache.org> From: Rene Moser <m...@renemoser.net> Sent: Monday, February 26, 2018 11:22:27 AM To: users@cloudstack.apache.org; d...@cloudstack.apache.org Subject: Re: [4.11] Management to VR connection issues Hi again We found the main problem. == cloud-postinit hang When having many iptables rules resulting in cloud-postinit to hang for 10min unless it was killed by systemd. As a result the ssh daemon was not started for 10 min because it is configured to be started after cloud-postinit. It seems the issue was already fixed by https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e == VR still needs manual reboot However, we still notice adapter changes after a reboot: see before after screenshots of "ip addr" in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually reboot the VR to make the network actually working. == VR has too many adapters? Next thing we noticed there are many network adapters (NICs) for this non-vpc router (see screenshot of the vcenter in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem unnecessary. Any comments on that? == VR with 256 MB RAM dows not work Next issue we found is, that the VR must have more than 256MB RAM. Otherwise systemd will complain the daemon can not be reloaded, because the ram disk of /run has too less space. Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon: Refusing to reload, not enough space available on /run/systemd. Currently, 8.6M are free, but a safety buffer of 16.0M is enforced. root@r-413-VM:~# df -h /run/ Filesystem Size Used Avail Use% Mounted on tmpfs16M 7.2M 8.7M 46% /run Increaing to 512MB RAM helped: root@r-413-VM:~# df -h /run/ Filesystem Size Used Avail Use% Mounted on tmpfs41M 7.8M 34M 19% /run Unsure if this can be tuned on systemd level, didn't find a way yet. == VR API Command timeouts When executing command related to VR, e.g. restart network, start/stop router the command won't reach the vcenter api, and times out. We are unsure yet, why. == VR minor fixes Next we fixed 2 minor things along. * rsyslogd config syntax issue * IMHO we should start apache2 also after cloud-postinit Also see https://github.com/apache/cloudstack/pull/2468 Regards René rohit.ya...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue
RE: [4.11] Management to VR connection issues
Rene, Have you checked the OS getting applied on vCenter? A lot of the issues went away once I changed the OS when testing over the weekend. Kind regards, Paul Angus paul.an...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue -Original Message- From: Rene Moser [mailto:m...@renemoser.net] Sent: 26 February 2018 10:22 To: users@cloudstack.apache.org; d...@cloudstack.apache.org Subject: Re: [4.11] Management to VR connection issues Hi again We found the main problem. == cloud-postinit hang When having many iptables rules resulting in cloud-postinit to hang for 10min unless it was killed by systemd. As a result the ssh daemon was not started for 10 min because it is configured to be started after cloud-postinit. It seems the issue was already fixed by https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e == VR still needs manual reboot However, we still notice adapter changes after a reboot: see before after screenshots of "ip addr" in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually reboot the VR to make the network actually working. == VR has too many adapters? Next thing we noticed there are many network adapters (NICs) for this non-vpc router (see screenshot of the vcenter in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem unnecessary. Any comments on that? == VR with 256 MB RAM dows not work Next issue we found is, that the VR must have more than 256MB RAM. Otherwise systemd will complain the daemon can not be reloaded, because the ram disk of /run has too less space. Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon: Refusing to reload, not enough space available on /run/systemd. Currently, 8.6M are free, but a safety buffer of 16.0M is enforced. root@r-413-VM:~# df -h /run/ Filesystem Size Used Avail Use% Mounted on tmpfs16M 7.2M 8.7M 46% /run Increaing to 512MB RAM helped: root@r-413-VM:~# df -h /run/ Filesystem Size Used Avail Use% Mounted on tmpfs41M 7.8M 34M 19% /run Unsure if this can be tuned on systemd level, didn't find a way yet. == VR API Command timeouts When executing command related to VR, e.g. restart network, start/stop router the command won't reach the vcenter api, and times out. We are unsure yet, why. == VR minor fixes Next we fixed 2 minor things along. * rsyslogd config syntax issue * IMHO we should start apache2 also after cloud-postinit Also see https://github.com/apache/cloudstack/pull/2468 Regards René
Re: [4.11] Management to VR connection issues
Hi again We found the main problem. == cloud-postinit hang When having many iptables rules resulting in cloud-postinit to hang for 10min unless it was killed by systemd. As a result the ssh daemon was not started for 10 min because it is configured to be started after cloud-postinit. It seems the issue was already fixed by https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e == VR still needs manual reboot However, we still notice adapter changes after a reboot: see before after screenshots of "ip addr" in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually reboot the VR to make the network actually working. == VR has too many adapters? Next thing we noticed there are many network adapters (NICs) for this non-vpc router (see screenshot of the vcenter in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem unnecessary. Any comments on that? == VR with 256 MB RAM dows not work Next issue we found is, that the VR must have more than 256MB RAM. Otherwise systemd will complain the daemon can not be reloaded, because the ram disk of /run has too less space. Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon: Refusing to reload, not enough space available on /run/systemd. Currently, 8.6M are free, but a safety buffer of 16.0M is enforced. root@r-413-VM:~# df -h /run/ Filesystem Size Used Avail Use% Mounted on tmpfs16M 7.2M 8.7M 46% /run Increaing to 512MB RAM helped: root@r-413-VM:~# df -h /run/ Filesystem Size Used Avail Use% Mounted on tmpfs41M 7.8M 34M 19% /run Unsure if this can be tuned on systemd level, didn't find a way yet. == VR API Command timeouts When executing command related to VR, e.g. restart network, start/stop router the command won't reach the vcenter api, and times out. We are unsure yet, why. == VR minor fixes Next we fixed 2 minor things along. * rsyslogd config syntax issue * IMHO we should start apache2 also after cloud-postinit Also see https://github.com/apache/cloudstack/pull/2468 Regards René
Re: [4.11] Management to VR connection issues
Hi Rene, Paul is correct, for default VMware systemvm I had fixed it here: <https://github.com/apache/cloudstack/blob/master/engine/schema/src/main/resources/META-INF/db/schema-41000to41100.sql#L403> https://github.com/apache/cloudstack/blob/4.11/engine/schema/resources/META-INF/db/schema-41000to41100.sql#L403 But the above would have worked only for new installations, for upgraded ones we'll need to fix the release notes to ask users/admins to select 'Other Linux 64-bit'. Can you try that and share if that works for you? I also checked, we're still using the 6.0 sdk jars. That needs to be fixed as well. - Rohit <https://cloudstack.apache.org> rohit.ya...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue From: Paul Angus Sent: Sunday, February 25, 2018 8:57:55 AM To: d...@cloudstack.apache.org; users@cloudstack.apache.org Cc: Rohit Yadav Subject: RE: [4.11] Management to VR connection issues Hey Rene. Can you check that OS type that has been applied to your system VM template. I found that mine were coming up as 32bit Debian 5, making them go REALLY slow and if there are rules applied to the firewall it takes forever to provision. Switching the guest OS fixed it. If you use Linux (other 64) - which is guestos_id 99 they run properly I suspect that the VMware 6.5 mappings are failing when they aren’t supported by the 6.0 SDK which we use, but I'll need to get that verified I think that we should have a 'ACS SystemVM' guest OS, which we can map to the best performing guest OS for each hypervisor version. VP Technology paul.an...@shapeblue.com www.shapeblue.com<http://www.shapeblue.com> -Original Message- From: Rene Moser [mailto:m...@renemoser.net] Sent: 22 February 2018 16:27 To: users@cloudstack.apache.org; d...@cloudstack.apache.org Subject: Re: [4.11] Management to VR connection issues On 02/20/2018 08:04 PM, Rohit Yadav wrote: > Hi Rene, > > > Thanks for sharing - I've not seen this in test/production environment yet. > Does it help to destroy the VR and check if the issue persists? Also, is this > behaviour system-wide for every VR, or VRs of specific networks or topologies > such as VPCs? Are these VRs redundant in nature? We have non-redundant VRs, and we haven't looked at VPC routers yet. The current analyses shows the following: 1. Started the process to upgrade an existing router. 2. Router gets destroyed and re-deployed with new template 4.11 as expected. 3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout. 4. We reboot manually on the OS level. VR gets rebooted. 5. After the OS has booted, ACS Router state switches to "Running" 6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock. Logs: https://transfer.sh/DdTtH/management-server.log.gz We continue our investigations Regards René
RE: [4.11] Management to VR connection issues
Hey Rene. Can you check that OS type that has been applied to your system VM template. I found that mine were coming up as 32bit Debian 5, making them go REALLY slow and if there are rules applied to the firewall it takes forever to provision. Switching the guest OS fixed it. If you use Linux (other 64) - which is guestos_id 99 they run properly I suspect that the VMware 6.5 mappings are failing when they aren't supported by the 6.0 SDK which we use, but I'll need to get that verified I think that we should have a 'ACS SystemVM' guest OS, which we can map to the best performing guest OS for each hypervisor version. paul.an...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue -Original Message- From: Rene Moser [mailto:m...@renemoser.net] Sent: 22 February 2018 16:27 To: users@cloudstack.apache.org; d...@cloudstack.apache.org Subject: Re: [4.11] Management to VR connection issues On 02/20/2018 08:04 PM, Rohit Yadav wrote: > Hi Rene, > > > Thanks for sharing - I've not seen this in test/production environment yet. > Does it help to destroy the VR and check if the issue persists? Also, is this > behaviour system-wide for every VR, or VRs of specific networks or topologies > such as VPCs? Are these VRs redundant in nature? We have non-redundant VRs, and we haven't looked at VPC routers yet. The current analyses shows the following: 1. Started the process to upgrade an existing router. 2. Router gets destroyed and re-deployed with new template 4.11 as expected. 3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout. 4. We reboot manually on the OS level. VR gets rebooted. 5. After the OS has booted, ACS Router state switches to "Running" 6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock. Logs: https://transfer.sh/DdTtH/management-server.log.gz We continue our investigations Regards René
Re: [4.11] Management to VR connection issues
On 02/20/2018 08:04 PM, Rohit Yadav wrote: > Hi Rene, > > > Thanks for sharing - I've not seen this in test/production environment yet. > Does it help to destroy the VR and check if the issue persists? Also, is this > behaviour system-wide for every VR, or VRs of specific networks or topologies > such as VPCs? Are these VRs redundant in nature? We have non-redundant VRs, and we haven't looked at VPC routers yet. The current analyses shows the following: 1. Started the process to upgrade an existing router. 2. Router gets destroyed and re-deployed with new template 4.11 as expected. 3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout. 4. We reboot manually on the OS level. VR gets rebooted. 5. After the OS has booted, ACS Router state switches to "Running" 6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock. Logs: https://transfer.sh/DdTtH/management-server.log.gz We continue our investigations Regards René
Re: [4.11] Management to VR connection issues
Hi Rene, Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature? 4.11+ VRs are systemd enabled and don't reboot after patching which is a major difference between 4.9 and 4.11 systemvms/VRs; to make this work for VMware when the nics come up we use a hack (that has been followed since at least 4.6+) to ping the interfaces/gateways: https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/common.sh#L335 After nic/mac-addresses change/configure, 4.9 and previous VRs used to reboot (i.e. 4.9 and previous VRs on vmware used to reboot twice, once after patching and once more to reconfigure nic-mac assignments). 4.11+ VRs don't do reboots at all but uses udevadm for nic/mac/interface configurations: https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/router.sh#L62 So you may try two tests and see if it makes any difference wrt above mentioned code -- (a) one to increase timeout/ping retries and (b) another to reboot after udev/mac-address configurations (which would only require re-building the systemvm.iso file and scp-ing on the secondary storage in your test environment). Finally, if you can share logs or other details about the test setup and environment, I can help you with some investigations. - Rohit <https://cloudstack.apache.org> From: Rene Moser <m...@renemoser.net> Sent: Tuesday, February 20, 2018 1:46:02 PM To: users@cloudstack.apache.org; d...@cloudstack.apache.org Subject: [4.11] Management to VR connection issues Hi We upgraded from 4.9 to 4.11. VMware 6.5.0. (Testing environment). VR upgrade went through. But we noticed that the communication between the management server and the VR are not working properly. We do not yet fully understand the issue, one thing we noted is that the networks configs seems not be bound to the same interfaces after every reboot. As a result, after a reboot you may can connect to the VR by SSH, after another reboot you can't anymore. The Network name eth0 switched from the NIC id 3 to 4 after reboot. The VR is kept in "starting" state, of course as a consequence we get many issues related to this, no VM deployments (kept in starting state), VM expunging failure (cleanup fails), a.s.o. Have anyone experienced similar issues? Regards René rohit.ya...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue
[4.11] Management to VR connection issues
Hi We upgraded from 4.9 to 4.11. VMware 6.5.0. (Testing environment). VR upgrade went through. But we noticed that the communication between the management server and the VR are not working properly. We do not yet fully understand the issue, one thing we noted is that the networks configs seems not be bound to the same interfaces after every reboot. As a result, after a reboot you may can connect to the VR by SSH, after another reboot you can't anymore. The Network name eth0 switched from the NIC id 3 to 4 after reboot. The VR is kept in "starting" state, of course as a consequence we get many issues related to this, no VM deployments (kept in starting state), VM expunging failure (cleanup fails), a.s.o. Have anyone experienced similar issues? Regards René