Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-31 Thread Noah Meyerhans
Our daily image builds include this fix as of yesterday.  I've confirmed
that it resolves the issue there.

I'll close this bug once we've published "release" images containing the
fix, which should occur with the buster update scheduled for this
weekend.

noah



Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-29 Thread kula
On 2020-07-29 08:01:19, Bastian Blank wrote:
> On Tue, Jul 28, 2020 at 03:40:14PM -0700, Noah Meyerhans wrote:
> > Actually, the problem seems to have been caused by
> > https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/192/
> > Prior to that MR, we weren't using systemd-growfs at all.
> 
> Prior to that we did not have any grow capability at all in several
> setup, so reverting is not really an option.

I'd say reverting is always an option and in some cases when behaviour becomes
erratic I'd prefer to revert to expected lack of something that having
inconsistency across whole fleet of instances.

> > I've confirmed impact on amd64 instances as well as the arm64 instances
> > on which originally observed it.  It also seems like this could impact
> > our images on other cloud services besides Amazon EC2, but I haven't
> > tested there.
> 
> I'll take a look later.  None of the instances I tested showed this
> behaviour.

Thank Waldi.
-- 

|_|0|_|  |
|_|_|0|  "Panta rei" |
|0|0|0|  kuLa    |

gpg --keyserver pgp.mit.edu --recv-keys 0x686930DD58C338B3
3DF1  A4DF  C732  4688  38BC  F121  6869  30DD  58C3  38B3


signature.asc
Description: PGP signature


Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-29 Thread Noah Meyerhans
On Wed, Jul 29, 2020 at 08:04:54AM +0200, Bastian Blank wrote:
> | # /run/systemd/generator/systemd-growfs@-.service
> | # Automatically generated by systemd-fstab-generator
> | 
> | [Unit]
> | BindsTo=%i.mount
> | After=%i.mount
> | Before=shutdown.target local-fs.target
> 
> So it is an artefact of Debian being the only distro still starting the
> real system with a read-only / (and with that making using first-boot
> mode impossible).

This is a systemd upstream issue, fixed in 244.1.  See
https://github.com/systemd/systemd/issues/14603 and the corresponding
fix in
https://github.com/systemd/systemd/commit/18e6e8635f06ac8d935ed5494ea65c6dac6af90f

The easiest solution for us would be to ship a drop-in configuration
fragment to supply the missing dependency.  I've opened a merge request
to do this at
https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/213

A more complete solution would be to ask the systemd maintainers to
backport the above commit to buster, but it's too late to get that
change into the upcoming point release (10.5).  With that in mind, I
propose that we ship the drop-in now, and pursue the systemd fix for
buster 10.6.  If we get the systemd fix, we can stop shipping the
drop-in.

noah



Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-29 Thread Bastian Blank
On Tue, Jul 28, 2020 at 12:33:48PM -0400, Noah Meyerhans wrote:
> Jul 28 16:14:54 debian systemd-growfs[271]: Partition size 8455699968 is not 
> a multiple of the blocksize 4096, ignoring 3584 bytes

That's normal, we may fix the initramfs grow stuff to make better
decisions.

> Jul 28 16:14:54 debian systemd-growfs[271]: Failed to resize "/" to 2064379 
> blocks (ext4): Read-only file system

This is weird.  systemd-growfs@.service waits for the real mount, at
least on any systemd versions I looked at:

| # /run/systemd/generator/systemd-growfs@-.service
| # Automatically generated by systemd-fstab-generator
| 
| [Unit]
| BindsTo=%i.mount
| After=%i.mount
| Before=shutdown.target local-fs.target

So it is an artefact of Debian being the only distro still starting the
real system with a read-only / (and with that making using first-boot
mode impossible).

Bastian

-- 
Lots of people drink from the wrong bottle sometimes.
-- Edith Keeler, "The City on the Edge of Forever",
   stardate unknown



Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-29 Thread Bastian Blank
On Tue, Jul 28, 2020 at 03:40:14PM -0700, Noah Meyerhans wrote:
> Actually, the problem seems to have been caused by
> https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/192/
> Prior to that MR, we weren't using systemd-growfs at all.

Prior to that we did not have any grow capability at all in several
setup, so reverting is not really an option.

> I've confirmed impact on amd64 instances as well as the arm64 instances
> on which originally observed it.  It also seems like this could impact
> our images on other cloud services besides Amazon EC2, but I haven't
> tested there.

I'll take a look later.  None of the instances I tested showed this
behaviour.

Bastian

-- 
You!  What PLANET is this!
-- McCoy, "The City on the Edge of Forever", stardate 3134.0



Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-28 Thread Noah Meyerhans
Actually, the problem seems to have been caused by
https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/192/

Prior to that MR, we weren't using systemd-growfs at all.

I've confirmed impact on amd64 instances as well as the arm64 instances
on which originally observed it.  It also seems like this could impact
our images on other cloud services besides Amazon EC2, but I haven't
tested there.

I'm a little surprised that nobody has reported this yet.  I suspect
that it's because although the service fails and the system is
"degraded", it generally is functional.  The people most likely to
experience problems because of this are relying on the filesystem to
successfully resize at launch; it could be that most are simply leaving
the root drive alone and attaching secondary EBS drives or EFS
filesystems for additional storage.

We should consider reverting that MR before we publish images for the
upcoming 10.5 release, unless we can identify and fix systemd-growfs by
then.



Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-28 Thread Noah Meyerhans
Having done a bit of testing, this definitely seems to be a regression.
I've launched 63 instances of the 10.3 AMI in us-east-1
(ami-031d1abcdcbbfbd8f) and 63 of the current 10.4 AMI
(ami-0bb15d03913335eae) on a variety of 6g arm64 instance types.  Exacly
zero of the 10.3 launches ended up in degraded state because of
systemd-growfs failures, while 37/63 (59%) had systemd-growfs fail.

There was a systemd update in 10.4, to version 241-7~deb10u4, which may
be related.



Bug#966451: cloud.debian.org: systemd-growfs@-.service fails on arm64 instance types

2020-07-28 Thread Noah Meyerhans
Package: cloud.debian.org
Severity: important
User: cloud.debian@packages.debian.org
Usertags: aws image

Testing the current buster releases for [mcr]6gd.* instance type support
reveals an issue that leads the instances to boot to "degraded" state in
systemd.  The failing unit is systemd-growfs@-.service:

admin@ip-172-31-13-14:~$ systemctl status systemd-growfs@-.service
● systemd-growfs@-.service - Grow File System on /
   Loaded: loaded (/run/systemd/generator/systemd-growfs@-.service; generated)
   Active: failed (Result: exit-code) since Tue 2020-07-28 16:14:54 UTC; 1min 
2s ago
 Docs: man:systemd-growfs@.service(8)
  Process: 271 ExecStart=/lib/systemd/systemd-growfs / (code=exited, 
status=1/FAILURE)
 Main PID: 271 (code=exited, status=1/FAILURE)

Jul 28 16:14:54 debian systemd-growfs[271]: Partition size 8455699968 is not a 
multiple of the blocksize 4096, ignoring 3584 bytes
Jul 28 16:14:54 debian systemd-growfs[271]: Failed to resize "/" to 2064379 
blocks (ext4): Read-only file system
Jul 28 16:14:54 debian systemd[1]: systemd-growfs@-.service: Main process 
exited, code=exited, status=1/FAILURE
Jul 28 16:14:54 debian systemd[1]: systemd-growfs@-.service: Failed with result 
'exit-code'.
Jul 28 16:14:54 debian systemd[1]: Failed to start Grow File System on /.

This does not seem limited to the local-NVMe instance types (with the "d" in
the first field), as I have reproduced the issue on c6g.xlarge as well:

admin@ip-172-31-14-76:~$ systemctl status systemd-growfs@-.service
● systemd-growfs@-.service - Grow File System on /
   Loaded: loaded (/run/systemd/generator/systemd-growfs@-.service; generated)
   Active: failed (Result: exit-code) since Tue 2020-07-28 16:24:22 UTC; 1min 
29s ago
 Docs: man:systemd-growfs@.service(8)
  Process: 256 ExecStart=/lib/systemd/systemd-growfs / (code=exited, 
status=1/FAILURE)
 Main PID: 256 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is 
incomplete or unavailable.
admin@ip-172-31-14-76:~$ ec2metadata --instance-type
c6g.xlarge

Instances were tested in us-east-1.  I don't think this has always occurred
on the 6g arm64 instance types, but it seems to be reproducible
approximately 100% of the time right now.