Public bug reported:
If a host SmartOS server is rebooted and the metadata service is not
available, a KVM VM instance that use cloud-init (via the SmartOS
datasource) will fail to start.
If the metadata agent on the host server is not available the python
code for cloud-init gets blocked forever waiting for data it will never
receive. This causes the boot process for an instance to hang on cloud-
init.
This is problematic if there happens to be some reason the metadata
agent is not available for any reason while a SmartOS KVM VM that relies
on cloud-init is booting.
>From the engineer that worked on this (not the svadm command is run on
the host SmartOS server):
You can reproduce this by disabling the metadata service SmartOS host:
svcadm disable metadata
and then boot a KVM VM running an Ubuntu Certified Cloud image such as:
c864f104-624c-43d2-835e-b49a39709b6b (ubuntu-certified-14.04 20150225.2)
when you do this, the VM's boot process will hang at cloud-init. If you
then start the metadata service, cloud-init will not recover.
On of our engineers who looked at this was able to cause forward
progress by applying this patch:
--- /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py.ori
2017-02-23 01:28:28.405885775 +0000
+++ /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py
2017-02-23 01:35:51.281885775 +0000
@@ -286,7 +286,7 @@
if not seed_device:
raise AttributeError("seed_device value is not set")
- ser = serial.Serial(seed_device, timeout=seed_timeout)
+ ser = serial.Serial(seed_device, timeout=10)
if not ser.isOpen():
raise SystemError("Unable to open %s" % seed_device)
which causes the following strace output:
[pid 2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or
TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 2119] write(5, "GET user-script\n", 16) = 16
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0
[pid 2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or
TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 2119] write(5, "GET iptables_disable\n", 21) = 21
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0
instead of:
[pid 1977] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 1977] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or
TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 1977] write(5, "GET base64_keys\n", 16) = 16
[pid 1977] select(6, [5], [], [], NULL
which you get without the patch (notice the NULL for the timeout
parameter). The code that gets blocked in this version of cloud-init is:
ser.write("GET %s\n" % noun.rstrip())
status = str(ser.readline()).rstrip()
in cloudinit/sources/DataSourceSmartOS.py. The ser.readline()
documentation says
(https://pyserial.readthedocs.io/en/latest/shortintro.html#readline):
Be careful when using readline(). Do specify a timeout when opening the
serial port otherwise it could block forever if no newline character is
received. Also note that readlines() only works with a timeout.
readlines() depends on having a timeout and interprets that as EOF (end
of file). It raises an exception if the port is not opened correctly.
which is exactly the situation we've hit here.
It might be better to have a timeout but when the timeout is hit, the
GET should be retried if there's no answer rather than moving on to the
next key. A negative answer (NOTFOUND for example) should not be
retried, only when there's no answer (because metadata is unavailable).
Once this is resolved, it should be possible to start a VM with cloud-
init and metadata disabled, and then enable metadata some time later and
have the boot process complete at that time.
** Affects: cloud-init (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1667735
Title:
cloud-init doesn't retry metadata lookups and hangs forever if
metadata is down
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1667735/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs