Well, it's been a week, and I've had hundreds of successful deployments
(now on 2.0-beta11), whether i specify constraints or not.  I'm marking
this as invalid.

I don't have much to go on for how this fixed itself.  I didn't see
recent changes to cloud-init or the azure provider in late 2.0 betas,
and even if those existed, it wouldn't explain how
http://status.juju.solutions/ (running 1.25) also cleared up.  Just
another mystery I suppose.

** Changed in: cloud-init
       Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1588014

Title:
  cloud-init doesn't always complete on azure machines

Status in cloud-init:
  Invalid

Bug description:
  Sometimes (like 1 in 100 times), my Juju/Azure deployment hangs with a
  machine stuck in 'pending'. On these machines, cloud-init-output.log
  says:

  ...
  Setting up qemu-utils (2.0.0+dfsg-2ubuntu1.24) ...
  Setting up cloud-image-utils (0.27-0ubuntu9.2) ...
  Setting up cloud-utils (0.27-0ubuntu9.2) ...
  Processing triggers for libc-bin (2.19-0ubuntu6.9) ...

  
  And cloud-init.log says:

  ...
  Jun  1 16:44:11 machine-3 [CLOUDINIT] cloud-init[DEBUG]: Ran 19 modules with 
0 failures
  Jun  1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: Reading from 
/proc/uptime (quiet=False)
  Jun  1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: Read 12 bytes from 
/proc/uptime
  Jun  1 16:44:11 machine-3 [CLOUDINIT] util.py[DEBUG]: cloud-init mode 
'modules' took 48.449 seconds (48.45)

  
  This seems to be related to this failure in waagent.log:

  root@machine-3:/var/log# cat waagent.log 
  2016/06/01 16:43:07 Azure Linux Agent Version: WALinuxAgent-2.0.16
  2016/06/01 16:43:07 Linux Distribution Detected      : Ubuntu
  2016/06/01 16:43:07 Module 
/lib/modules/3.19.0-59-generic/kernel/drivers/ata/ata_piix.ko driver for ATAPI 
CD-ROM does not exist.
  2016/06/01 16:43:07 mount: block device /dev/sr0 is write-protected, mounting 
read-only
  2016/06/01 16:43:07 mount: you didn't specify a filesystem type for /dev/sr0
  2016/06/01 16:43:07        I will try type udf
  2016/06/01 16:43:07 mount: you didn't specify a filesystem type for /dev/sr0
  2016/06/01 16:43:07        I will try type udf
  2016/06/01 16:43:07 /dev/sr0 on /mnt/cdrom/secure type udf (ro)
  2016/06/01 16:43:07 mount succeeded on attempt #1
  2016/06/01 16:43:07 VMM Init script not found.  Provisioning for Azure
  2016/06/01 16:43:07 IPv4 address: 10.0.0.6
  2016/06/01 16:43:07 MAC  address: 00:0D:3A:71:79:7D
  2016/06/01 16:43:07 Probing for Azure environment.
  2016/06/01 16:43:07 DoDhcpWork: Setting socket.timeout=10, entering recv
  2016/06/01 16:43:07 Set default gateway: 10.0.0.1
  2016/06/01 16:43:07 Discovered Azure endpoint: 168.63.129.16
  2016/06/01 16:43:07 Fabric preferred wire protocol version: 2015-04-05
  2016/06/01 16:43:07 Negotiated wire protocol version: 2012-11-30
  2016/06/01 16:43:07 SetBlockDeviceTimeout: Update the device sda with timeout 
300
  2016/06/01 16:43:07 SetBlockDeviceTimeout: Update the device sdb with timeout 
300
  2016/06/01 16:43:08 Retrieved GoalState from Azure Fabric.
  2016/06/01 16:43:08 ExpectedState: Started
  2016/06/01 16:43:08 ContainerId: adea2037-ec24-4752-b6f0-0b53f7dee188
  2016/06/01 16:43:08 RoleInstanceId: 
2d0f948c-f310-4290-b3dc-bde08efeb3d9._machine-3
  2016/06/01 16:43:08 Public cert with thumbprint: 
0C420F72F3884BD2CE799D50B9C1AFDB1B0C1072 was retrieved.
  2016/06/01 16:43:18 ERROR:Can't find host key: /etc/ssh/ssh_host_rsa_key.pub
  2016/06/01 16:43:22 Finished processing ExtensionsConfig.xml
  2016/06/01 16:43:22 Successfully reported handler status
  2016/06/01 16:55:24 ERROR:Socket IOError The read operation timed out, 
args:('The read operation timed out',)
  2016/06/01 16:55:24 ERROR:Retry=0
  2016/06/01 16:55:24 ERROR:HTTP Req: HEAD 
https://p437b4r90zrnkr2vcf51fm9q.blob.core.windows.net/osvhds/machine-3.988b8739-235f-4b55-a0ce-9e896a7eb595.status?sv=2014-02-14&sr=b&sig=tEiFGWxfACZ8Y2%2FEF%2Bl6ocG9BdhSwgNRQj%2Btpr3JKHY%3D&se=9999-01-01T00%3A00%3A00Z&sp=rw
  2016/06/01 16:55:24 ERROR:HTTP Req: Data=None
  2016/06/01 16:55:24 ERROR:HTTP Req: Header={'x-ms-version': '2014-02-14', 
'x-ms-date': '2016-06-01T16:55:14Z'}
  2016/06/01 16:55:24 ERROR:HTTP Err: response is empty.

  
  I can wget the file that it's complaining about, though perhaps it's trying 
to get it earlier when it doesn't exist (or is empty).  Contents of that file 
look like this:

  
{"version":"1.0","timestampUTC":"2016-06-01T17:40:34Z","aggregateStatus":{"guestAgentStatus":{"version":"WALinuxAgent-2.0.16","status":"Ready","formattedMessage":{"lang
  ":"en-US","message":"GuestAgent is running and accepting new
  configurations."}},"handlerAggregateStatus":[]}}

  
  I've let machines sit like this for hours, but they don't seem to unstick 
themselves. My quick solution is to ssh onto a 'pending' machine and reboot it. 
 The subsequent boot triggers cloud-init and it finishes whatever it's supposed 
to do. Juju then takes over and all is good.

  Is a waagent failure (if that is indeed the problem) something that
  cloud-init can watch for and handle?  Logs coming shortly...

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1588014/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to