Bug#885166: instability with 4.14 regarding KVM virtualization
On Wed, Feb 21, 2018 at 03:57:31PM +0100, Marc Haber wrote: > I then applied 2a266f23550be997d783f27e704b9b40c4010292 to 4.14.19, > resulting in 4.4.19+, and this one has been running flawlessly for two > days (and 0 minutes, incidentally) now. I will stay on 4.14.19+ until > 4.15.5 comes out (and will update then). If you don't hear from me again > in this bug report, then you can safely assume that 4.14.19+ lived on > the test machine until its planned decommissioning during the 4.15.5 > rollout. I regret to say that one of the VMs running on the host running 4.14.19+ has just suffered a "BUG: soft lockup - CPU#1 stuck for 58s!" which is an atypical behavior for the issue I had previously, but might have a common cause. 4.15.5 is currently building on my build host, and I'll move to that kernel asap. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bug#885166: instability with 4.14 regarding KVM virtualization
On Mon, Feb 19, 2018 at 06:31:16AM +0100, Salvatore Bonaccorso wrote: > On Sun, Feb 18, 2018 at 11:34:21AM +0100, Marc Haber wrote: > > On Sun, Feb 18, 2018 at 10:15:43AM +0100, Salvatore Bonaccorso wrote: > > > Looking today through the kernel archive, I noticed an answer from > > > Paolo Bonzini, <62aa6b81-5456-07dc-cf64-e46747d3a...@redhat.com>, > > > claiming this is fixed by > > > > > > https://git.kernel.org/linus/2a266f23550be997d783f27e704b9b40c4010292 > > > which is in 4.15-rc8, and thus confirming that you did not had the > > > issue anymore in 4.15. > > > > ... unfortunately with a totally unexplaining commit message though. > > > > > Closing this bug with that version, but do you have a chance to > > > confirm that? > > > > What exactly do you want me to test: > > > > - that the bug doesn't happen any more in Debian 4.15 kernels? > > - that ths bug still happens in Debian's 4.14 kernel and vanishes with > > the patch applied? > > - Something else? > > In an optimal case we get a confirmation that > 2a266f23550be997d783f27e704b9b40c4010292 is the fixing commit for your > issues. But given we have uploaded 4.15.4-1 to unstable, if you can > confirm that this one defintively fix your issue that would be great > (bonus if you can confirm your last non-working 4.14+commit fixes the > issue as well). If I used git correctly, then 4.14.20 already has 2a266f23550be997d783f27e704b9b40c4010292, so I tried 4.14.19. 4.14.19 on the one virt host that had the most violent failures failed in the first hour of operation, but with a slightly different error behavior that I was used to. I am therefore not sure whether we are not talking about multiple issues, one of them having been fixed somewhere in between 4.14.13 and 4.14.19. I then applied 2a266f23550be997d783f27e704b9b40c4010292 to 4.14.19, resulting in 4.4.19+, and this one has been running flawlessly for two days (and 0 minutes, incidentally) now. I will stay on 4.14.19+ until 4.15.5 comes out (and will update then). If you don't hear from me again in this bug report, then you can safely assume that 4.14.19+ lived on the test machine until its planned decommissioning during the 4.15.5 rollout. Hope this helps. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bug#885166: instability with 4.14 regarding KVM virtualization
Hi Marc, On Sun, Feb 18, 2018 at 11:34:21AM +0100, Marc Haber wrote: > On Sun, Feb 18, 2018 at 10:15:43AM +0100, Salvatore Bonaccorso wrote: > > Looking today through the kernel archive, I noticed an answer from > > Paolo Bonzini, <62aa6b81-5456-07dc-cf64-e46747d3a...@redhat.com>, > > claiming this is fixed by > > > > https://git.kernel.org/linus/2a266f23550be997d783f27e704b9b40c4010292 > > which is in 4.15-rc8, and thus confirming that you did not had the > > issue anymore in 4.15. > > ... unfortunately with a totally unexplaining commit message though. > > > Closing this bug with that version, but do you have a chance to > > confirm that? > > What exactly do you want me to test: > > - that the bug doesn't happen any more in Debian 4.15 kernels? > - that ths bug still happens in Debian's 4.14 kernel and vanishes with > the patch applied? > - Something else? In an optimal case we get a confirmation that 2a266f23550be997d783f27e704b9b40c4010292 is the fixing commit for your issues. But given we have uploaded 4.15.4-1 to unstable, if you can confirm that this one defintively fix your issue that would be great (bonus if you can confirm your last non-working 4.14+commit fixes the issue as well). Regards, Salvatore
Bug#885166: instability with 4.14 regarding KVM virtualization
On Sun, Feb 18, 2018 at 10:15:43AM +0100, Salvatore Bonaccorso wrote: > Looking today through the kernel archive, I noticed an answer from > Paolo Bonzini, <62aa6b81-5456-07dc-cf64-e46747d3a...@redhat.com>, > claiming this is fixed by > > https://git.kernel.org/linus/2a266f23550be997d783f27e704b9b40c4010292 > which is in 4.15-rc8, and thus confirming that you did not had the > issue anymore in 4.15. ... unfortunately with a totally unexplaining commit message though. > Closing this bug with that version, but do you have a chance to > confirm that? What exactly do you want me to test: - that the bug doesn't happen any more in Debian 4.15 kernels? - that ths bug still happens in Debian's 4.14 kernel and vanishes with the patch applied? - Something else? Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bug#885166: instability with 4.14 regarding KVM virtualization
Source: linux Source-Version: 4.15~rc8-1~exp1 Hi Marc, On Sun, Feb 11, 2018 at 02:44:44PM +0100, Marc Haber wrote: > Hi, > > after in total nine weeks of bisecting, broken filesystems, service > outages (thankfully on unportant systems), 4.15 seems to have fixed the > issue. After going to 4.15, the crashes never happened again. > > They have, however, happened with each and every 4.14 release I tried, > which I stopped doing with 4.14.15 on Jan 28. > > This means, for me, that the issue is fixed and that I have just wasted > nine weeks of time. > > For Debian, this means that there is a crippling, data-eating issue in > the current long-term releae kernel. I do sincerely hope that I never > have to lay my eye on any 4.14 kernel again and hope that no major > distribution will release with this version. I'm sorry this was a frustrated triage, I can immagine. Looking today through the kernel archive, I noticed an answer from Paolo Bonzini, <62aa6b81-5456-07dc-cf64-e46747d3a...@redhat.com>, claiming this is fixed by https://git.kernel.org/linus/2a266f23550be997d783f27e704b9b40c4010292 which is in 4.15-rc8, and thus confirming that you did not had the issue anymore in 4.15. Closing this bug with that version, but do you have a chance to confirm that? Regards, Salvatore
Bug#885166: instability with 4.14 regarding KVM virtualization
Hi, after in total nine weeks of bisecting, broken filesystems, service outages (thankfully on unportant systems), 4.15 seems to have fixed the issue. After going to 4.15, the crashes never happened again. They have, however, happened with each and every 4.14 release I tried, which I stopped doing with 4.14.15 on Jan 28. This means, for me, that the issue is fixed and that I have just wasted nine weeks of time. For Debian, this means that there is a crippling, data-eating issue in the current long-term releae kernel. I do sincerely hope that I never have to lay my eye on any 4.14 kernel again and hope that no major distribution will release with this version. Greetings Marc On Mon, Jan 08, 2018 at 09:53:14AM +0100, Marc Haber wrote: > On Mon, Dec 25, 2017 at 10:02:48PM +, Ben Hutchings wrote: > > Given that commit fb1522e099f0 was merged after -rc7 I assume it's an > > important fix, though the commit message doesn't spell that out. So I > > think that whenever bisect asks you to test a version that doesn't > > contain it, you should cherry-pick it first to avoid the other bug. > > After two more weeks, I can now confirm that cherry-picking fb1522e099f0 > onto earlier 4.13-rc kernels makes things _WORSE_. > > I believe that > - plain 4.13-rc4 works fine and allows stable usage of KVM > virtualization > - 4.13-rc4 with cherry-picked fb1522e099f0 is bad, causing random > segfaults and file-system corruption in VMs under disk and CPU load > - 4.13-rc5 is bad either way > > I will start bisecting between 4.13-rc4 and 4.13-rc5, hoping that this > will yield better results. Since I have to wait between three to five > days to flag a kernel as "good" without too much doubt, and bisect > 4.13-rc4..4.13-rc5 will take "roughly 7 steps". > > I am open to additional suggestions. > > Greetings > Marc > > > -- > - > Marc Haber | "I don't trust Computers. They | Mailadresse im Header > Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 > Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 > -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bug#885166: instability with 4.14 regarding KVM virtualization
On Mon, Dec 25, 2017 at 10:02:48PM +, Ben Hutchings wrote: > Given that commit fb1522e099f0 was merged after -rc7 I assume it's an > important fix, though the commit message doesn't spell that out. So I > think that whenever bisect asks you to test a version that doesn't > contain it, you should cherry-pick it first to avoid the other bug. After two more weeks, I can now confirm that cherry-picking fb1522e099f0 onto earlier 4.13-rc kernels makes things _WORSE_. I believe that - plain 4.13-rc4 works fine and allows stable usage of KVM virtualization - 4.13-rc4 with cherry-picked fb1522e099f0 is bad, causing random segfaults and file-system corruption in VMs under disk and CPU load - 4.13-rc5 is bad either way I will start bisecting between 4.13-rc4 and 4.13-rc5, hoping that this will yield better results. Since I have to wait between three to five days to flag a kernel as "good" without too much doubt, and bisect 4.13-rc4..4.13-rc5 will take "roughly 7 steps". I am open to additional suggestions. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bug#885166: instability with 4.14 regarding KVM virtualization
On Wed, 2017-12-27 at 12:18 +0100, Marc Haber wrote: > On Mon, Dec 25, 2017 at 10:02:48PM +, Ben Hutchings wrote: > > It's on a branch that started at 4.13-rc7 but wasn't merged into > > mainline until after 4.13. Comparing the two of them, 569dbb88e80d has > > the addition of commit fb1522e099f0 "KVM: update to new mmu_notifier > > semantic v2". So I would guess that what you landed on is a different > > bug than the one you were looking for. > > Ouch. Two bugs with such similiar behavior in a single kernel release? > > > Given that commit fb1522e099f0 was merged after -rc7 I assume it's an > > important fix, though the commit message doesn't spell that out. So I > > think that whenever bisect asks you to test a version that doesn't > > contain it, you should cherry-pick it first to avoid the other bug. (I > > think you will then need to use 'git bisect good|bad HEAD^' after > > testing, rather than implicitly flagging the current head commit.) > > Would > git show fb1522e099f0 | patch -p1 > build/test > git reset --hard > git bisect good|bad > be the same thing? I would feel much more comfortable with that. Yes that should be equivalent. Ben. -- Ben Hutchings Any sufficiently advanced bug is indistinguishable from a feature. signature.asc Description: This is a digitally signed message part
Bug#885166: instability with 4.14 regarding KVM virtualization
Marc Haber wrote: > On Mon, Dec 25, 2017 at 10:02:48PM +, Ben Hutchings wrote: >> Given that commit fb1522e099f0 was merged after -rc7 I assume it's an >> important fix, though the commit message doesn't spell that out. So I >> think that whenever bisect asks you to test a version that doesn't >> contain it, you should cherry-pick it first to avoid the other bug. (I >> think you will then need to use 'git bisect good|bad HEAD^' after >> testing, rather than implicitly flagging the current head commit.) > > Would > git show fb1522e099f0 | patch -p1 > build/test > git reset --hard > git bisect good|bad > be the same thing? I would feel much more comfortable with that. Yes. (The main advantage of 'git cherry-pick' over that is that it performs rename detection, but that shouldn't be relevant here. You can similarly do git cherry-pick --no-commit fb1522e099f0 build/test git reset --hard git bisect good|bad ) Thanks, Jonathan
Bug#885166: instability with 4.14 regarding KVM virtualization
On Mon, Dec 25, 2017 at 10:02:48PM +, Ben Hutchings wrote: > It's on a branch that started at 4.13-rc7 but wasn't merged into > mainline until after 4.13. Comparing the two of them, 569dbb88e80d has > the addition of commit fb1522e099f0 "KVM: update to new mmu_notifier > semantic v2". So I would guess that what you landed on is a different > bug than the one you were looking for. Ouch. Two bugs with such similiar behavior in a single kernel release? > Given that commit fb1522e099f0 was merged after -rc7 I assume it's an > important fix, though the commit message doesn't spell that out. So I > think that whenever bisect asks you to test a version that doesn't > contain it, you should cherry-pick it first to avoid the other bug. (I > think you will then need to use 'git bisect good|bad HEAD^' after > testing, rather than implicitly flagging the current head commit.) Would git show fb1522e099f0 | patch -p1 build/test git reset --hard git bisect good|bad be the same thing? I would feel much more comfortable with that. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
Bug#885166: instability with 4.14 regarding KVM virtualization
On Mon, 2017-12-25 at 14:15 +0100, Marc Haber wrote: [...] > I tried bisecting the kernel between 4.13 and 4.14, but the results are > inconclusive to me: > > - 569dbb88e80deb68974ef6fdd6a13edb9d686261 is good > - ddf720f86efe38cb3ef88b2eaad9ea8ad7c6f798 is bad > - ddf720f86efe38cb3ef88b2eaad9ea8ad7c6f798 was the result of the kernel > bisect between 4.13 and 4.14, but is a one-character typo fix in a > comment. > - I am also confused that ddf720f86efe38cb3ef88b2eaad9ea8ad7c6f798 is in > 4.13-rc7, therefore earlier than the "good" 4.13 relese [...] It's on a branch that started at 4.13-rc7 but wasn't merged into mainline until after 4.13. Comparing the two of them, 569dbb88e80d has the addition of commit fb1522e099f0 "KVM: update to new mmu_notifier semantic v2". So I would guess that what you landed on is a different bug than the one you were looking for. Given that commit fb1522e099f0 was merged after -rc7 I assume it's an important fix, though the commit message doesn't spell that out. So I think that whenever bisect asks you to test a version that doesn't contain it, you should cherry-pick it first to avoid the other bug. (I think you will then need to use 'git bisect good|bad HEAD^' after testing, rather than implicitly flagging the current head commit.) Ben. -- Ben Hutchings The world is coming to an end. Please log off. signature.asc Description: This is a digitally signed message part
Bug#885166: instability with 4.14 regarding KVM virtualization
Package: src:linux Version: 4.14.2-1 Severity: normal Tags: upstream Hi, starting with kernel 4.14, the majority of mv KVM virtualization hosts has become unstable. This behavior has been present in every 4.14 kernel, regardless of self-compiled or the Debian kernel. I am reporting this in Debian in hope that I can get more input here than I got on the linux-kernel mailing list. The issue happens on various Debian stable hosts (didn't try unstable yet) with AMD and Intel CPUs, including: - Model name:AMD GX-412TC SOC - Model name:Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz - Model name:Quad-Core AMD Opteron(tm) Processor 1389 - Model name:Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz The symptoms appear more often when the system is under tight memory conditions and/or KSM is enabled. Disabling KSM decreases the frequency of the issue happening, but doesn't make it stop. Going back to a 4.13 kernel makes all machines rock-stable again. I also see this behavior in 4.15 release candidates kernels up to -rc4 (-rc5 test still pending). Symptoms are (choose any combination). - VMs hanging completely: no ping, no reaction on serial console - VMs losing their storage: machine still pings, login not possible, no "password" prompt after entering user name on serial console - reliable and reproducible segfault of certain binaries in the VM until the VM is restarted - VM file systems being re-mounted r/o - VM file systems being corrupted so that external fsck is necessary - virsh shutdown not working for affected VM - sometimes, even virsh destroy not working (hanging for minutes until Ctrl-C, sometimes error message, unfortunately not written down) - host not rebooting cleanly, needing hardware reset I tried bisecting the kernel between 4.13 and 4.14, but the results are inconclusive to me: - 569dbb88e80deb68974ef6fdd6a13edb9d686261 is good - ddf720f86efe38cb3ef88b2eaad9ea8ad7c6f798 is bad - ddf720f86efe38cb3ef88b2eaad9ea8ad7c6f798 was the result of the kernel bisect between 4.13 and 4.14, but is a one-character typo fix in a comment. - I am also confused that ddf720f86efe38cb3ef88b2eaad9ea8ad7c6f798 is in 4.13-rc7, therefore earlier than the "good" 4.13 relese In the second try, I tried bisecting between those two commits. This quickly results in: The merge base cc4a41fe5541a73019a864883297bd5043aa6d98 is bad. This means the bug has been fixed between cc4a41fe5541a73019a864883297bd5043aa6d98 and [569dbb88e80deb68974ef6fdd6a13edb9d686261]. 569dbb88e80deb68974ef6fdd6a13edb9d686261 is Linux 4.13 and is good cc4a41fe5541a73019a864883297bd5043aa6d98 is Linux 4.13-rc7 and is bad. Bisecting between those ends up in: [6/4993]mh@fan:~/linux/git/bisect/linux ((v4.13) *|BISECTING) $ git bisect good Some good revs are not ancestors of the bad rev. git bisect cannot work properly in this case. Maybe you mistook good and bad revs? git [5/4992]mh@fan:~/linux/git/bisect/linux ((v4.13) *|BISECTING) $ git bisect log git bisect start # bad: [cc4a41fe5541a73019a864883297bd5043aa6d98] Linux 4.13-rc7 git bisect bad cc4a41fe5541a73019a864883297bd5043aa6d98 What am I doing wrong here? Any idea what to do here? Greetings Marc -- Package-specific info: ** Version: Linux version 4.14.0-1-amd64 (debian-ker...@lists.debian.org) (gcc version 7.2.0 (Debian 7.2.0-16)) #1 SMP Debian 4.14.2-1 (2017-11-30) ** Command line: BOOT_IMAGE=/vmlinuz-4.14.0-1-amd64 root=/dev/mapper/heel-root ro net.ifnames=1 ** Not tainted ** Kernel log: [ 10.160061] [drm] Driver supports precise vblank timestamp query. [ 10.160422] i915 :00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem [ 10.169463] input: ThinkPad Extra Buttons as /devices/platform/thinkpad_acpi/input/input8 [ 10.196551] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this. [ 10.200499] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: discard [ 10.235958] systemd-journald[467]: Received request to flush runtime journal from PID 1 [ 10.237138] [drm] Initialized i915 1.6.0 20170818 for :00:02.0 on minor 0 [ 10.239282] 8021q: 802.1Q VLAN Support v1.8 [ 10.240463] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: discard [ 10.249705] ACPI: Video Device [VID] (multi-head: yes rom: no post: no) [ 10.252137] acpi device:00: registered as cooling_device5 [ 10.252233] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input9 [ 10.253318] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: discard [ 10.265969] FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck. [ 10.448215] snd_hda_intel :00:1b.0: bound :00:02.0 (ops i915_audio_component_bind_ops [i915]) [ 10.492251] IPv6: ADDRCONF(NETDEV_UP): enp0s25: link is not ready [