"That doesn't really make sense. And re-trying timing out in 10 seconds and 
re-trying wouldn't really change anything."

Unfortunately that's not the case for cloud-init in 14.04 (retrying
there would fix the problem but miss keys) and only correct for cloud-
init in 16.10 because that version has other problems.

With the Ubuntu 14.04 image's version of cloud-init (0.7.5-0ubuntu1.3)
the read() is blocking with no timeout. If there was a timeout and if
the request was correctly retried every 10 seconds, eventually things
would proceed once the metadata service became available. As it stands
now, these instances need to be externally rebooted when they become
hung as they will never recover on their own.

I've just tested 16.10 as well with cloud-init
0.7.8-68-gca3ae67-0ubuntu1~16.10.1. With that version the code has
changed significantly, but the problem is the same. Even though the
timeout has been added in the new version, its usefulness has been
negated since the new code gets itself stuck in an infinite loop when we
hit this case. You can also see some more details in:

https://smartos.org/bugview/IMAGE-1014

In my testing on 16.10, systemd did not kill cloud-init in the 30
minutes I waited.


"Its very arguable that the *right* thing to do is wait forever on the metadata 
service."

I agree with this. However, that's not what cloud-init is doing. In part
because cloud-init's implementation of the metadata specification
(https://eng.joyent.com/mdata/protocol.html) is incomplete. In
particular:

 * It uses V2 without doing NEGOTIATE
 * It uses the KVM serial port without reading all the data from the buffer 
before writing
 * It does not write '\n' and wait for 'invalid command\n'
 * When a read() times out, it tries a read() again instead of starting over

What happens with cloud-init in 16.10 is that if metadata is unavailable
when the instance boots, cloud-init will:

 1) write data into the socket (nobody's listening)
 2) do a select() on the socket looking for readable data (and timeout after 10 
seconds)
 3) goto 2

the loop between 2 and 3 becomes infinite because even if metadata is
enabled at this point, cloud-init never attempts to send any commands to
it.

If instead it were to:

 1) open the socket
 2) read on the socket (with a timeout) and discard any data
 3) write '\n'
 4) read on the socket for 'invalid command\n' (with a timeout, on timeout 
close socket and go to 1 )
 5) NEGOTIATE V2

before making any queries, it would be able to recover when the metadata
service became available if it is unavailable initially. If you disable
the metadata service and run mdata-get under strace, and then enable
metadata, you'll see that that's how it is able to recover in this case.

So in summary: I think we're in agreement that cloud-init should wait
forever for the metadata service, but both 14.04 and 16.10 have
different but related problems in their implementation which prevent
cloud-init from ever actually knowing when metadata has become
available. The consequence of this is that in both versions if metadata
is unavailable when cloud-init is first run, the VMs will hang until
rebooted externally.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1667735

Title:
  cloud-init doesn't retry metadata lookups and hangs forever if
  metadata is down

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1667735/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to