Hi all, the issues with ppc64el autopkgtest should be resolved now, IS spent a ton of time yesterday analysing the situation, whacking virtual switches, rebooting hypervisors and all that fun stuff.
The queue has made some progress, but it will still be a couple of days until ppc64el has caught up with the other architectures. -- What happened Already at the archive opening, ppc64el capacity was apparently reduced - we ran about 1 test per minute. Presumably this was issues in the bos02 cloud. In the bos02 cloud, instances received configuration for two network interfaces despite only having one, and cloud-init then failed to bring up networking. The reason for this apparently was some load balancing issue, so multiple nodes had allocated networking for the servers. Last Friday, bos01 started failing completely, as new hypervisor nodes had been marked active that were not yet ready, and all requests were being allocated to them. This bos01 outage coincided with me changing the script to reject broken bos02 machines more instantly, so it lead to some confusion on my side. On Monday, did some further investigation and tried to see if I can boot an image and hack around cloud-init to bring up networking on one interface. This was not successful - even deleting the "down" interface and rebooting the server did not bring up networking, so gave up. On Tuesday, I noticed the reason for the bos01 failure - all new servers were allocated on hypervisor "cybelle.None" - hmm, that looked odd. And it was, as mentioned above. I also noticed all the instances on bos02 hang on 'floette' before moving to another node, which hopefully helped IS in digging out the issue. Initial whacking of OVS and other components on floette did not yield stable results, so IS later rebooted some hypervisor nodes, and the cloud seems to be stable once again. -- Changes to autopkgtest-cloud We now can monitor failures per cloud on the grafana dashboard[1], allowing us to debug issues more effectively and find out which cloud is broken :) Still looking for help BTW, in merging the "failed" and "succesful" graphs into a "failure rate" one. It might be impossible with our InfluxDB + Grafana combo, however. Server creation now rejects servers with two IP addresses immediately instead of waiting for SSH to timeout on the first of them. If problems pop back up again, they'll either be worked around faster or if they are fairly persistent, the workers will fail more often and stop, so the workers in error state KPI increases. [1] https://ubuntu-release.kpi.ubuntu.com/d/76Oe_0-Gz/autopkgtest?orgId=1 -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en -- ubuntu-devel mailing list ubuntu-devel@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel