Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
On 22.08.19 17:40, Max Reitz wrote: > On 22.08.19 17:25, Max Reitz wrote: >> On 22.08.19 14:09, Max Reitz wrote: >>> (CC-ing Paolo because of the XFS connection, and Stefan because why not.) >>> >>> On 22.08.19 13:27, Lukáš Doktor wrote: Dne 21. 08. 19 v 19:51 Max Reitz napsal(a): > On 21.08.19 16:14, Lukáš Doktor wrote: >> Hello guys, >> >> First attempt was rejected due to zip attachment, let's try it again >> with just Avocado-vt debug.log and serial console log files attached. >> >> I bisected a regression on aarch64 all the way to this commit: "qcow2: >> skip writing zero buffers to empty COW areas" >> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look >> at it? > > I think I can see the issue on my x64 system (I don’t see the XFS > corruption, but the installation fails because of some segfaults). > > I haven’t found a simpler way to reproduce the problem yet, though, > which is a pain... :-/ > > It looks like the problem disappears when I configure qemu with > “--disable-xfsctl”. Can you try that? > > Max > Hello Max, yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. Also looking at the option I understand why it only failed on aarch64 for me, I don't have libs installed on the other machines, therefor it was disabled by "./configure" there. Anyway I guess disabling it in my builds won't really fix the issue, right? :-) >>> >>> Thanks! >>> >>> No, it won’t, but it means the actual root of the problem is probably >>> rather in some XFS-related code (be it because qemu uses it the wrong >>> way or because of XFS kernel code) than in the pure qcow2 commit that >>> made the problem surface by exercising it heavily. (Or in an >>> interaction between the two.) >> >> OK, I got a simpler reproducer now: >> >> $ ./qemu-img create -f qcow2 test.qcow2 1M >> $ (for i in $(seq 15 -1 0); do \ >>echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \ >>done) \ >> | ./qemu-io test.qcow2 >> $ for i in $(seq 0 15); do \ >> echo $i; \ >> ofs=$((i * 64)); \ >> ./qemu-io -c "read -P 0 ${ofs}k 1k" \ >> -c "read -P 42 $((ofs + 1))k 62k" \ >> -c "read -P 0 $((ofs + 63))k 1k" \ >> test.qcow2 \ >> | grep 'verification'; \ >> done >> >> On XFS with --enable-xfsctl, this basically always gives me some >> verification failure somewhere. (On tmpfs or with --disable-xfsctl, it >> never fails.) >> >> So it seems to be related to I/O from back to front. >> >> (You can also reproduce it with a plain “qemu-img bench” invocation, >> like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2” >> (on, say, a 4 GB image), but then the failure appears much later in the >> image, because you have to wait from some requests to come in reverse >> (by chance) first.) > > The problem is the ftruncate() in xfs_write_zeroes(). It is possible > for it to yield, then other requests come in, and the data they write > may get discarded once the ftruncate() settles. I’ve just sent a patch: “block/file-posix: Fix xfs_write_zeroes()”, Message-ID <20190822162618.27670-1-mre...@redhat.com>: https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg01148.html Max
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
On 22.08.19 17:25, Max Reitz wrote: > On 22.08.19 14:09, Max Reitz wrote: >> (CC-ing Paolo because of the XFS connection, and Stefan because why not.) >> >> On 22.08.19 13:27, Lukáš Doktor wrote: >>> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a): On 21.08.19 16:14, Lukáš Doktor wrote: > Hello guys, > > First attempt was rejected due to zip attachment, let's try it again with > just Avocado-vt debug.log and serial console log files attached. > > I bisected a regression on aarch64 all the way to this commit: "qcow2: > skip writing zero buffers to empty COW areas" > c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at > it? I think I can see the issue on my x64 system (I don’t see the XFS corruption, but the installation fails because of some segfaults). I haven’t found a simpler way to reproduce the problem yet, though, which is a pain... :-/ It looks like the problem disappears when I configure qemu with “--disable-xfsctl”. Can you try that? Max >>> >>> Hello Max, >>> >>> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. >>> Also looking at the option I understand why it only failed on aarch64 for >>> me, I don't have libs installed on the other machines, therefor it was >>> disabled by "./configure" there. Anyway I guess disabling it in my builds >>> won't really fix the issue, right? :-) >> >> Thanks! >> >> No, it won’t, but it means the actual root of the problem is probably >> rather in some XFS-related code (be it because qemu uses it the wrong >> way or because of XFS kernel code) than in the pure qcow2 commit that >> made the problem surface by exercising it heavily. (Or in an >> interaction between the two.) > > OK, I got a simpler reproducer now: > > $ ./qemu-img create -f qcow2 test.qcow2 1M > $ (for i in $(seq 15 -1 0); do \ >echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \ >done) \ > | ./qemu-io test.qcow2 > $ for i in $(seq 0 15); do \ > echo $i; \ > ofs=$((i * 64)); \ > ./qemu-io -c "read -P 0 ${ofs}k 1k" \ > -c "read -P 42 $((ofs + 1))k 62k" \ > -c "read -P 0 $((ofs + 63))k 1k" \ > test.qcow2 \ > | grep 'verification'; \ > done > > On XFS with --enable-xfsctl, this basically always gives me some > verification failure somewhere. (On tmpfs or with --disable-xfsctl, it > never fails.) > > So it seems to be related to I/O from back to front. > > (You can also reproduce it with a plain “qemu-img bench” invocation, > like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2” > (on, say, a 4 GB image), but then the failure appears much later in the > image, because you have to wait from some requests to come in reverse > (by chance) first.) The problem is the ftruncate() in xfs_write_zeroes(). It is possible for it to yield, then other requests come in, and the data they write may get discarded once the ftruncate() settles. Max
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
On 22.08.19 14:09, Max Reitz wrote: > (CC-ing Paolo because of the XFS connection, and Stefan because why not.) > > On 22.08.19 13:27, Lukáš Doktor wrote: >> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a): >>> On 21.08.19 16:14, Lukáš Doktor wrote: Hello guys, First attempt was rejected due to zip attachment, let's try it again with just Avocado-vt debug.log and serial console log files attached. I bisected a regression on aarch64 all the way to this commit: "qcow2: skip writing zero buffers to empty COW areas" c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it? >>> >>> I think I can see the issue on my x64 system (I don’t see the XFS >>> corruption, but the installation fails because of some segfaults). >>> >>> I haven’t found a simpler way to reproduce the problem yet, though, >>> which is a pain... :-/ >>> >>> It looks like the problem disappears when I configure qemu with >>> “--disable-xfsctl”. Can you try that? >>> >>> Max >>> >> >> Hello Max, >> >> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. >> Also looking at the option I understand why it only failed on aarch64 for >> me, I don't have libs installed on the other machines, therefor it was >> disabled by "./configure" there. Anyway I guess disabling it in my builds >> won't really fix the issue, right? :-) > > Thanks! > > No, it won’t, but it means the actual root of the problem is probably > rather in some XFS-related code (be it because qemu uses it the wrong > way or because of XFS kernel code) than in the pure qcow2 commit that > made the problem surface by exercising it heavily. (Or in an > interaction between the two.) OK, I got a simpler reproducer now: $ ./qemu-img create -f qcow2 test.qcow2 1M $ (for i in $(seq 15 -1 0); do \ echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \ done) \ | ./qemu-io test.qcow2 $ for i in $(seq 0 15); do \ echo $i; \ ofs=$((i * 64)); \ ./qemu-io -c "read -P 0 ${ofs}k 1k" \ -c "read -P 42 $((ofs + 1))k 62k" \ -c "read -P 0 $((ofs + 63))k 1k" \ test.qcow2 \ | grep 'verification'; \ done On XFS with --enable-xfsctl, this basically always gives me some verification failure somewhere. (On tmpfs or with --disable-xfsctl, it never fails.) So it seems to be related to I/O from back to front. (You can also reproduce it with a plain “qemu-img bench” invocation, like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2” (on, say, a 4 GB image), but then the failure appears much later in the image, because you have to wait from some requests to come in reverse (by chance) first.) Max
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
(CC-ing Paolo because of the XFS connection, and Stefan because why not.) On 22.08.19 13:27, Lukáš Doktor wrote: > Dne 21. 08. 19 v 19:51 Max Reitz napsal(a): >> On 21.08.19 16:14, Lukáš Doktor wrote: >>> Hello guys, >>> >>> First attempt was rejected due to zip attachment, let's try it again with >>> just Avocado-vt debug.log and serial console log files attached. >>> >>> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip >>> writing zero buffers to empty COW areas" >>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at >>> it? >> >> I think I can see the issue on my x64 system (I don’t see the XFS >> corruption, but the installation fails because of some segfaults). >> >> I haven’t found a simpler way to reproduce the problem yet, though, >> which is a pain... :-/ >> >> It looks like the problem disappears when I configure qemu with >> “--disable-xfsctl”. Can you try that? >> >> Max >> > > Hello Max, > > yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. > Also looking at the option I understand why it only failed on aarch64 for me, > I don't have libs installed on the other machines, therefor it was disabled > by "./configure" there. Anyway I guess disabling it in my builds won't really > fix the issue, right? :-) Thanks! No, it won’t, but it means the actual root of the problem is probably rather in some XFS-related code (be it because qemu uses it the wrong way or because of XFS kernel code) than in the pure qcow2 commit that made the problem surface by exercising it heavily. (Or in an interaction between the two.) Max
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
Dne 21. 08. 19 v 19:51 Max Reitz napsal(a): > On 21.08.19 16:14, Lukáš Doktor wrote: >> Hello guys, >> >> First attempt was rejected due to zip attachment, let's try it again with >> just Avocado-vt debug.log and serial console log files attached. >> >> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip >> writing zero buffers to empty COW areas" >> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it? > > I think I can see the issue on my x64 system (I don’t see the XFS > corruption, but the installation fails because of some segfaults). > > I haven’t found a simpler way to reproduce the problem yet, though, > which is a pain... :-/ > > It looks like the problem disappears when I configure qemu with > “--disable-xfsctl”. Can you try that? > > Max > Hello Max, yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. Also looking at the option I understand why it only failed on aarch64 for me, I don't have libs installed on the other machines, therefor it was disabled by "./configure" there. Anyway I guess disabling it in my builds won't really fix the issue, right? :-) Regards, Lukáš signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
On 21.08.19 16:14, Lukáš Doktor wrote: > Hello guys, > > First attempt was rejected due to zip attachment, let's try it again with > just Avocado-vt debug.log and serial console log files attached. > > I bisected a regression on aarch64 all the way to this commit: "qcow2: skip > writing zero buffers to empty COW areas" > c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it? I think I can see the issue on my x64 system (I don’t see the XFS corruption, but the installation fails because of some segfaults). I haven’t found a simpler way to reproduce the problem yet, though, which is a pain... :-/ It looks like the problem disappears when I configure qemu with “--disable-xfsctl”. Can you try that? Max
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
Dne 21. 08. 19 v 17:49 Anton Nefedov napsal(a): > On 21/8/2019 5:14 PM, Lukáš Doktor wrote: >> Hello guys, >> >> First attempt was rejected due to zip attachment, let's try it again with >> just Avocado-vt debug.log and serial console log files attached. >> >> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip >> writing zero buffers to empty COW areas" >> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it? >> >> My reproducer is running kickstart installation of RHEL-8 from DVD on >> aarch64 gicv3 machine, which never finishes since this commit, where >> anaconda complains about package installation, occasionally there are also >> XFS metadata corruption messages on serial console: >> > > hi, > > this looks scary :( I doubt that it can have anything to do with aarch64 > but rather a really tricky timing (or, possibly, a broken environment > like broken fallocate() on a host? who knows..) > > Is it always the same machine you observe this issue on? Did you try > others? > > I just wonder if it's worth to try to reproduce it on my machine > (and I don't have aarch64 on hand now). I can probably come up with > some torture test that will continuously write to qcow2 with random > offsets/sizes and verify the result. > > If you could kindly reproduce it again then we can probably start with > enabling qemu traces by appending > " -trace bdrv* -trace qcow2* -trace file=/some_huge_partition/qemu.log" > to the command line. > > Beware that it's going to produce a huge amount of logs. > > Also, the corrupted image and the serial log will be required for > investigation. > > thanks, > > /Anton > Hello Anton, I have only tried that on a single machine, but colleague of mine reported similar issues even on TCG installing Fedora using x86_64 host. I'll try to reproduce it on my x86_64 box which should simplify the debugging. Lukáš signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]
On 21/8/2019 5:14 PM, Lukáš Doktor wrote: > Hello guys, > > First attempt was rejected due to zip attachment, let's try it again with > just Avocado-vt debug.log and serial console log files attached. > > I bisected a regression on aarch64 all the way to this commit: "qcow2: skip > writing zero buffers to empty COW areas" > c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it? > > My reproducer is running kickstart installation of RHEL-8 from DVD on aarch64 > gicv3 machine, which never finishes since this commit, where anaconda > complains about package installation, occasionally there are also XFS > metadata corruption messages on serial console: > hi, this looks scary :( I doubt that it can have anything to do with aarch64 but rather a really tricky timing (or, possibly, a broken environment like broken fallocate() on a host? who knows..) Is it always the same machine you observe this issue on? Did you try others? I just wonder if it's worth to try to reproduce it on my machine (and I don't have aarch64 on hand now). I can probably come up with some torture test that will continuously write to qcow2 with random offsets/sizes and verify the result. If you could kindly reproduce it again then we can probably start with enabling qemu traces by appending " -trace bdrv* -trace qcow2* -trace file=/some_huge_partition/qemu.log" to the command line. Beware that it's going to produce a huge amount of logs. Also, the corrupted image and the serial log will be required for investigation. thanks, /Anton