Re: [4.11] Management to VR connection issues

2018-02-26 Thread Rene Moser


On 02/26/2018 12:41 PM, Rohit Yadav wrote:

> - If waiting for ssh and apache2 as part of post-init solves the issue, this 
> would require a new systemvmtemplate as the systemd scripts cannot be changed 
> or make effect during first boot.

The waiting for ssh was not the issue, it was a result.

The hang of cloud-postinit caused by p.wait() when having a ton of
iptable rules was the issue. But this is addressed already. should be fine.

a systemctl list-jobs shows "no pending jobs" anymore, so the boot has
completed.

After that the VR should be accessable by SSH (3922) by managemement
right, but it is not.

Did you see  the changes after a reboot (please compare the screenshots
of the ip addr output I sent). After that reboot/network change, SSH
works...


> - I think the additional nics always used to show up for vmware, there is a 
> global setting to configure this (extra nics for vmware, probably because 
> older versions did not support dynamic nic addition on vmware vrs).

On 4.5.2, we only see 4 NICs. in 4.11 we see 5 of them. We were just
wondering if this could result in an issue. What global setting would
that be?


> - For VR timeouts, see logs and check if from management server host you're 
> able to SSH into the VR using the private IP and port 3922. See the 
> troubleshooting wiki: 
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting

Yes, after a manual reboot of the VR, we can SSH-in as I wrote. Without
a reboot of the VR, we get a "no route to host". So it seems not even an
arp ping is working.


> - Can you share/check which processes are consuming the RAM, 256MB ram is 
> usually enough for non-redundant VRs. (share output of top or check using 
> htop?). Make sure to use a latest Linux version (any Debian variant such as 
> Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 
> for some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support 
> for legacy os. I had faced/found this issue while testing redundant VRs which 
> take more RAM usually than normal VRs.

Using the shapeblue VR template (your template ;))

So the man docs says
https://manpages.debian.org/stretch/initscripts/tmpfs.5.en.html

unfortunately only a fstab entry worked for me, setting the
/etc/default/tmpfs didn't.

https://github.com/apache/cloudstack/pull/2468/commits/bd882a8f80763595a89a3b74330500e1965bfda3








Re: [4.11] Management to VR connection issues

2018-02-26 Thread Rohit Yadav
Hi Rene,


- I think on the general issue of slow iptables rules application, we need to 
fix that. Does it help to increase aggregation timeouts?


- If waiting for ssh and apache2 as part of post-init solves the issue, this 
would require a new systemvmtemplate as the systemd scripts cannot be changed 
or make effect during first boot.


- I think the additional nics always used to show up for vmware, there is a 
global setting to configure this (extra nics for vmware, probably because older 
versions did not support dynamic nic addition on vmware vrs).


- For VR timeouts, see logs and check if from management server host you're 
able to SSH into the VR using the private IP and port 3922. See the 
troubleshooting wiki: 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting


- Can you share/check which processes are consuming the RAM, 256MB ram is 
usually enough for non-redundant VRs. (share output of top or check using 
htop?). Make sure to use a latest Linux version (any Debian variant such as 
Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 for 
some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support for 
legacy os. I had faced/found this issue while testing redundant VRs which take 
more RAM usually than normal VRs.


- Rohit

<https://cloudstack.apache.org>




From: Rene Moser <m...@renemoser.net>
Sent: Monday, February 26, 2018 11:22:27 AM
To: users@cloudstack.apache.org; d...@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for
10min unless it was killed by systemd. As a result the ssh daemon was
not started for 10 min because it is configured to be started after
cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before
after screenshots of "ip addr" in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually
reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this
non-vpc router (see screenshot of the vcenter in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem
unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because
the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem  Size  Used Avail Use% Mounted on
tmpfs16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem  Size  Used Avail Use% Mounted on
tmpfs41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop
router the command won't reach the vcenter api, and times out. We are
unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 



RE: [4.11] Management to VR connection issues

2018-02-26 Thread Paul Angus
Rene,
Have you checked the OS getting applied on vCenter?
A lot of the issues went away once I changed the OS when testing over the 
weekend.

Kind regards,

Paul Angus

paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-Original Message-
From: Rene Moser [mailto:m...@renemoser.net] 
Sent: 26 February 2018 10:22
To: users@cloudstack.apache.org; d...@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for 10min 
unless it was killed by systemd. As a result the ssh daemon was not started for 
10 min because it is configured to be started after cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before after 
screenshots of "ip addr" in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We 
still need to manually reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this non-vpc 
router (see screenshot of the vcenter in 
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem unnecessary. 
Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because the ram 
disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem  Size  Used Avail Use% Mounted on
tmpfs16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem  Size  Used Avail Use% Mounted on
tmpfs41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop router 
the command won't reach the vcenter api, and times out. We are unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René



Re: [4.11] Management to VR connection issues

2018-02-26 Thread Rene Moser
Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for
10min unless it was killed by systemd. As a result the ssh daemon was
not started for 10 min because it is configured to be started after
cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before
after screenshots of "ip addr" in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually
reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this
non-vpc router (see screenshot of the vcenter in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem
unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because
the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem  Size  Used Avail Use% Mounted on
tmpfs16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem  Size  Used Avail Use% Mounted on
tmpfs41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop
router the command won't reach the vcenter api, and times out. We are
unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René


Re: [4.11] Management to VR connection issues

2018-02-25 Thread Rohit Yadav
Hi Rene,


Paul is correct, for default VMware systemvm I had fixed it here:

<https://github.com/apache/cloudstack/blob/master/engine/schema/src/main/resources/META-INF/db/schema-41000to41100.sql#L403>

https://github.com/apache/cloudstack/blob/4.11/engine/schema/resources/META-INF/db/schema-41000to41100.sql#L403


But the above would have worked only for new installations, for upgraded ones 
we'll need to fix the release notes to ask users/admins to select 'Other Linux 
64-bit'. Can you try that and share if that works for you?


I also checked, we're still using the 6.0 sdk jars. That needs to be fixed as 
well.


- Rohit

<https://cloudstack.apache.org>





rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

From: Paul Angus
Sent: Sunday, February 25, 2018 8:57:55 AM
To: d...@cloudstack.apache.org; users@cloudstack.apache.org
Cc: Rohit Yadav
Subject: RE: [4.11] Management to VR connection issues

Hey Rene.

Can you check that OS type that has been applied to your system VM template.
I found that mine were coming up as 32bit Debian 5, making them go REALLY slow 
and if there are rules applied to the firewall it takes forever to provision.  
Switching the guest OS fixed it.

If you use Linux (other 64)  - which is guestos_id 99 they run properly

I suspect that the VMware 6.5 mappings are failing when they aren’t supported 
by the 6.0 SDK which we use, but I'll need to get that verified



I think that we should have a 'ACS SystemVM' guest OS, which we can map to the 
best performing guest OS for each hypervisor version.




VP Technology
paul.an...@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>




-Original Message-
From: Rene Moser [mailto:m...@renemoser.net]
Sent: 22 February 2018 16:27
To: users@cloudstack.apache.org; d...@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues


On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
>
>
> Thanks for sharing - I've not seen this in test/production environment yet. 
> Does it help to destroy the VR and check if the issue persists? Also, is this 
> behaviour system-wide for every VR, or VRs of specific networks or topologies 
> such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we login by 
console, we see some actions in the cloud.log. At this point, router will be 
left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows "requires upgrade" (but 
the OS has already booted with template 4.11) 7. When we upgrade, the same 
process happens again points 1-3. Feels like a dead lock.


Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René



RE: [4.11] Management to VR connection issues

2018-02-24 Thread Paul Angus
Hey Rene.

Can you check that OS type that has been applied to your system VM template.
I found that mine were coming up as 32bit Debian 5, making them go REALLY slow 
and if there are rules applied to the firewall it takes forever to provision.  
Switching the guest OS fixed it.

If you use Linux (other 64)  - which is guestos_id 99 they run properly

I suspect that the VMware 6.5 mappings are failing when they aren't supported 
by the 6.0 SDK which we use, but I'll need to get that verified



I think that we should have a 'ACS SystemVM' guest OS, which we can map to the 
best performing guest OS for each hypervisor version.



paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-Original Message-
From: Rene Moser [mailto:m...@renemoser.net] 
Sent: 22 February 2018 16:27
To: users@cloudstack.apache.org; d...@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues


On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
> 
> 
> Thanks for sharing - I've not seen this in test/production environment yet. 
> Does it help to destroy the VR and check if the issue persists? Also, is this 
> behaviour system-wide for every VR, or VRs of specific networks or topologies 
> such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we login by 
console, we see some actions in the cloud.log. At this point, router will be 
left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows "requires upgrade" (but 
the OS has already booted with template 4.11) 7. When we upgrade, the same 
process happens again points 1-3. Feels like a dead lock.


Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René




Re: [4.11] Management to VR connection issues

2018-02-22 Thread Rene Moser

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
> 
> 
> Thanks for sharing - I've not seen this in test/production environment yet. 
> Does it help to destroy the VR and check if the issue persists? Also, is this 
> behaviour system-wide for every VR, or VRs of specific networks or topologies 
> such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we
login by console, we see some actions in the cloud.log. At this point,
router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows
"requires upgrade" (but the OS has already booted with template 4.11)
7. When we upgrade, the same process happens again points 1-3. Feels
like a dead lock.


Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René



Re: [4.11] Management to VR connection issues

2018-02-20 Thread Rohit Yadav
Hi Rene,


Thanks for sharing - I've not seen this in test/production environment yet. 
Does it help to destroy the VR and check if the issue persists? Also, is this 
behaviour system-wide for every VR, or VRs of specific networks or topologies 
such as VPCs? Are these VRs redundant in nature?


4.11+ VRs are systemd enabled and don't reboot after patching which is a major 
difference between 4.9 and 4.11 systemvms/VRs; to make this work for VMware 
when the nics come up we use a hack (that has been followed since at least 
4.6+) to ping the interfaces/gateways:

https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/common.sh#L335

After nic/mac-addresses change/configure, 4.9 and previous VRs used to reboot 
(i.e. 4.9 and previous VRs on vmware used to reboot twice, once after patching 
and once more to reconfigure nic-mac assignments). 4.11+ VRs don't do reboots 
at all but uses udevadm for nic/mac/interface configurations:

https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/router.sh#L62

So you may try two tests and see if it makes any difference wrt above mentioned 
code -- (a) one to increase timeout/ping retries and (b) another to reboot 
after udev/mac-address configurations (which would only require re-building the 
systemvm.iso file and scp-ing on the secondary storage in your test 
environment).

Finally, if you can share logs or other details about the test setup and 
environment, I can help you with some investigations.


- Rohit






From: Rene Moser 
Sent: Tuesday, February 20, 2018 1:46:02 PM
To: users@cloudstack.apache.org; d...@cloudstack.apache.org
Subject: [4.11] Management to VR connection issues

Hi

We upgraded from 4.9 to 4.11. VMware 6.5.0. (Testing environment).

VR upgrade went through. But we noticed that the communication between
the management server and the VR are not working properly.

We do not yet fully understand the issue, one thing we noted is that the
networks configs seems not be bound to the same interfaces after every
reboot. As a result, after a reboot you may can connect to the VR by
SSH, after another reboot you can't anymore.

The Network name eth0 switched from the NIC id 3 to 4 after reboot.

The VR is kept in "starting" state, of course as a consequence we get
many issues related to this, no VM deployments (kept in starting state),
VM expunging failure (cleanup fails), a.s.o.

Have anyone experienced similar issues?

Regards
René

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue