Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
On 22.08.19 17:40, Max Reitz wrote:
> On 22.08.19 17:25, Max Reitz wrote:
>> On 22.08.19 14:09, Max Reitz wrote:
>>> (CC-ing Paolo because of the XFS connection, and Stefan because why not.)
>>>
>>> On 22.08.19 13:27, Lukáš Doktor wrote:
 Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
> On 21.08.19 16:14, Lukáš Doktor wrote:
>> Hello guys,
>>
>> First attempt was rejected due to zip attachment, let's try it again 
>> with just Avocado-vt debug.log and serial console log files attached.
>>
>> I bisected a regression on aarch64 all the way to this commit: "qcow2: 
>> skip writing zero buffers to empty COW areas" 
>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look 
>> at it?
>
> I think I can see the issue on my x64 system (I don’t see the XFS
> corruption, but the installation fails because of some segfaults).
>
> I haven’t found a simpler way to reproduce the problem yet, though,
> which is a pain... :-/
>
> It looks like the problem disappears when I configure qemu with
> “--disable-xfsctl”.  Can you try that?
>
> Max
>

 Hello Max,

 yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
 Also looking at the option I understand why it only failed on aarch64 for 
 me, I don't have libs installed on the other machines, therefor it was 
 disabled by "./configure" there. Anyway I guess disabling it in my builds 
 won't really fix the issue, right? :-)
>>>
>>> Thanks!
>>>
>>> No, it won’t, but it means the actual root of the problem is probably
>>> rather in some XFS-related code (be it because qemu uses it the wrong
>>> way or because of XFS kernel code) than in the pure qcow2 commit that
>>> made the problem surface by exercising it heavily.  (Or in an
>>> interaction between the two.)
>>
>> OK, I got a simpler reproducer now:
>>
>> $ ./qemu-img create -f qcow2 test.qcow2 1M
>> $ (for i in $(seq 15 -1 0); do \
>>echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \
>>done) \
>>   | ./qemu-io test.qcow2
>> $ for i in $(seq 0 15); do \
>>   echo $i; \
>>   ofs=$((i * 64)); \
>>   ./qemu-io -c "read -P 0 ${ofs}k 1k" \
>> -c "read -P 42 $((ofs + 1))k 62k" \
>> -c "read -P 0 $((ofs + 63))k 1k" \
>> test.qcow2 \
>>   | grep 'verification'; \
>>   done
>>
>> On XFS with --enable-xfsctl, this basically always gives me some
>> verification failure somewhere.  (On tmpfs or with --disable-xfsctl, it
>> never fails.)
>>
>> So it seems to be related to I/O from back to front.
>>
>> (You can also reproduce it with a plain “qemu-img bench” invocation,
>> like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2”
>> (on, say, a 4 GB image), but then the failure appears much later in the
>> image, because you have to wait from some requests to come in reverse
>> (by chance) first.)
> 
> The problem is the ftruncate() in xfs_write_zeroes().  It is possible
> for it to yield, then other requests come in, and the data they write
> may get discarded once the ftruncate() settles.

I’ve just sent a patch: “block/file-posix: Fix xfs_write_zeroes()”,
Message-ID <20190822162618.27670-1-mre...@redhat.com>:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg01148.html

Max



Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
On 22.08.19 17:25, Max Reitz wrote:
> On 22.08.19 14:09, Max Reitz wrote:
>> (CC-ing Paolo because of the XFS connection, and Stefan because why not.)
>>
>> On 22.08.19 13:27, Lukáš Doktor wrote:
>>> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
 On 21.08.19 16:14, Lukáš Doktor wrote:
> Hello guys,
>
> First attempt was rejected due to zip attachment, let's try it again with 
> just Avocado-vt debug.log and serial console log files attached.
>
> I bisected a regression on aarch64 all the way to this commit: "qcow2: 
> skip writing zero buffers to empty COW areas" 
> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at 
> it?

 I think I can see the issue on my x64 system (I don’t see the XFS
 corruption, but the installation fails because of some segfaults).

 I haven’t found a simpler way to reproduce the problem yet, though,
 which is a pain... :-/

 It looks like the problem disappears when I configure qemu with
 “--disable-xfsctl”.  Can you try that?

 Max

>>>
>>> Hello Max,
>>>
>>> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
>>> Also looking at the option I understand why it only failed on aarch64 for 
>>> me, I don't have libs installed on the other machines, therefor it was 
>>> disabled by "./configure" there. Anyway I guess disabling it in my builds 
>>> won't really fix the issue, right? :-)
>>
>> Thanks!
>>
>> No, it won’t, but it means the actual root of the problem is probably
>> rather in some XFS-related code (be it because qemu uses it the wrong
>> way or because of XFS kernel code) than in the pure qcow2 commit that
>> made the problem surface by exercising it heavily.  (Or in an
>> interaction between the two.)
> 
> OK, I got a simpler reproducer now:
> 
> $ ./qemu-img create -f qcow2 test.qcow2 1M
> $ (for i in $(seq 15 -1 0); do \
>echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \
>done) \
>   | ./qemu-io test.qcow2
> $ for i in $(seq 0 15); do \
>   echo $i; \
>   ofs=$((i * 64)); \
>   ./qemu-io -c "read -P 0 ${ofs}k 1k" \
> -c "read -P 42 $((ofs + 1))k 62k" \
> -c "read -P 0 $((ofs + 63))k 1k" \
> test.qcow2 \
>   | grep 'verification'; \
>   done
> 
> On XFS with --enable-xfsctl, this basically always gives me some
> verification failure somewhere.  (On tmpfs or with --disable-xfsctl, it
> never fails.)
> 
> So it seems to be related to I/O from back to front.
> 
> (You can also reproduce it with a plain “qemu-img bench” invocation,
> like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2”
> (on, say, a 4 GB image), but then the failure appears much later in the
> image, because you have to wait from some requests to come in reverse
> (by chance) first.)

The problem is the ftruncate() in xfs_write_zeroes().  It is possible
for it to yield, then other requests come in, and the data they write
may get discarded once the ftruncate() settles.

Max



Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
On 22.08.19 14:09, Max Reitz wrote:
> (CC-ing Paolo because of the XFS connection, and Stefan because why not.)
> 
> On 22.08.19 13:27, Lukáš Doktor wrote:
>> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
>>> On 21.08.19 16:14, Lukáš Doktor wrote:
 Hello guys,

 First attempt was rejected due to zip attachment, let's try it again with 
 just Avocado-vt debug.log and serial console log files attached.

 I bisected a regression on aarch64 all the way to this commit: "qcow2: 
 skip writing zero buffers to empty COW areas" 
 c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at 
 it?
>>>
>>> I think I can see the issue on my x64 system (I don’t see the XFS
>>> corruption, but the installation fails because of some segfaults).
>>>
>>> I haven’t found a simpler way to reproduce the problem yet, though,
>>> which is a pain... :-/
>>>
>>> It looks like the problem disappears when I configure qemu with
>>> “--disable-xfsctl”.  Can you try that?
>>>
>>> Max
>>>
>>
>> Hello Max,
>>
>> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
>> Also looking at the option I understand why it only failed on aarch64 for 
>> me, I don't have libs installed on the other machines, therefor it was 
>> disabled by "./configure" there. Anyway I guess disabling it in my builds 
>> won't really fix the issue, right? :-)
> 
> Thanks!
> 
> No, it won’t, but it means the actual root of the problem is probably
> rather in some XFS-related code (be it because qemu uses it the wrong
> way or because of XFS kernel code) than in the pure qcow2 commit that
> made the problem surface by exercising it heavily.  (Or in an
> interaction between the two.)

OK, I got a simpler reproducer now:

$ ./qemu-img create -f qcow2 test.qcow2 1M
$ (for i in $(seq 15 -1 0); do \
   echo "aio_write -P 42 $((i * 64 + 1))k 62k"; \
   done) \
  | ./qemu-io test.qcow2
$ for i in $(seq 0 15); do \
  echo $i; \
  ofs=$((i * 64)); \
  ./qemu-io -c "read -P 0 ${ofs}k 1k" \
-c "read -P 42 $((ofs + 1))k 62k" \
-c "read -P 0 $((ofs + 63))k 1k" \
test.qcow2 \
  | grep 'verification'; \
  done

On XFS with --enable-xfsctl, this basically always gives me some
verification failure somewhere.  (On tmpfs or with --disable-xfsctl, it
never fails.)

So it seems to be related to I/O from back to front.

(You can also reproduce it with a plain “qemu-img bench” invocation,
like “./qemu-img bench -w --pattern=42 -o 1k -S 64k -s 62k test.qcow2”
(on, say, a 4 GB image), but then the failure appears much later in the
image, because you have to wait from some requests to come in reverse
(by chance) first.)

Max



Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Max Reitz
(CC-ing Paolo because of the XFS connection, and Stefan because why not.)

On 22.08.19 13:27, Lukáš Doktor wrote:
> Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
>> On 21.08.19 16:14, Lukáš Doktor wrote:
>>> Hello guys,
>>>
>>> First attempt was rejected due to zip attachment, let's try it again with 
>>> just Avocado-vt debug.log and serial console log files attached.
>>>
>>> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
>>> writing zero buffers to empty COW areas" 
>>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at 
>>> it?
>>
>> I think I can see the issue on my x64 system (I don’t see the XFS
>> corruption, but the installation fails because of some segfaults).
>>
>> I haven’t found a simpler way to reproduce the problem yet, though,
>> which is a pain... :-/
>>
>> It looks like the problem disappears when I configure qemu with
>> “--disable-xfsctl”.  Can you try that?
>>
>> Max
>>
> 
> Hello Max,
> 
> yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. 
> Also looking at the option I understand why it only failed on aarch64 for me, 
> I don't have libs installed on the other machines, therefor it was disabled 
> by "./configure" there. Anyway I guess disabling it in my builds won't really 
> fix the issue, right? :-)

Thanks!

No, it won’t, but it means the actual root of the problem is probably
rather in some XFS-related code (be it because qemu uses it the wrong
way or because of XFS kernel code) than in the pure qcow2 commit that
made the problem surface by exercising it heavily.  (Or in an
interaction between the two.)

Max



Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-22 Thread Lukáš Doktor
Dne 21. 08. 19 v 19:51 Max Reitz napsal(a):
> On 21.08.19 16:14, Lukáš Doktor wrote:
>> Hello guys,
>>
>> First attempt was rejected due to zip attachment, let's try it again with 
>> just Avocado-vt debug.log and serial console log files attached.
>>
>> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
>> writing zero buffers to empty COW areas" 
>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it?
> 
> I think I can see the issue on my x64 system (I don’t see the XFS
> corruption, but the installation fails because of some segfaults).
> 
> I haven’t found a simpler way to reproduce the problem yet, though,
> which is a pain... :-/
> 
> It looks like the problem disappears when I configure qemu with
> “--disable-xfsctl”.  Can you try that?
> 
> Max
> 

Hello Max,

yes, I'm getting the same behavior. With "--disable-xfsctl" it works well. Also 
looking at the option I understand why it only failed on aarch64 for me, I 
don't have libs installed on the other machines, therefor it was disabled by 
"./configure" there. Anyway I guess disabling it in my builds won't really fix 
the issue, right? :-)

Regards,
Lukáš



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-21 Thread Max Reitz
On 21.08.19 16:14, Lukáš Doktor wrote:
> Hello guys,
> 
> First attempt was rejected due to zip attachment, let's try it again with 
> just Avocado-vt debug.log and serial console log files attached.
> 
> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
> writing zero buffers to empty COW areas" 
> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it?

I think I can see the issue on my x64 system (I don’t see the XFS
corruption, but the installation fails because of some segfaults).

I haven’t found a simpler way to reproduce the problem yet, though,
which is a pain... :-/

It looks like the problem disappears when I configure qemu with
“--disable-xfsctl”.  Can you try that?

Max



Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-21 Thread Lukáš Doktor
Dne 21. 08. 19 v 17:49 Anton Nefedov napsal(a):
> On 21/8/2019 5:14 PM, Lukáš Doktor wrote:
>> Hello guys,
>>
>> First attempt was rejected due to zip attachment, let's try it again with 
>> just Avocado-vt debug.log and serial console log files attached.
>>
>> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
>> writing zero buffers to empty COW areas" 
>> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it?
>>
>> My reproducer is running kickstart installation of RHEL-8 from DVD on 
>> aarch64 gicv3 machine, which never finishes since this commit, where 
>> anaconda complains about package installation, occasionally there are also 
>> XFS metadata corruption messages on serial console:
>>
> 
> hi,
> 
> this looks scary :( I doubt that it can have anything to do with aarch64
> but rather a really tricky timing (or, possibly, a broken environment
> like broken fallocate() on a host? who knows..)
> 
> Is it always the same machine you observe this issue on? Did you try
> others?
> 
> I just wonder if it's worth to try to reproduce it on my machine
> (and I don't have aarch64 on hand now). I can probably come up with
> some torture test that will continuously write to qcow2 with random
> offsets/sizes and verify the result.
> 
> If you could kindly reproduce it again then we can probably start with
> enabling qemu traces by appending
>   " -trace bdrv* -trace qcow2* -trace file=/some_huge_partition/qemu.log"
> to the command line.
> 
> Beware that it's going to produce a huge amount of logs.
> 
> Also, the corrupted image and the serial log will be required for
> investigation.
> 
> thanks,
> 
> /Anton
> 

Hello Anton,

I have only tried that on a single machine, but colleague of mine reported 
similar issues even on TCG installing Fedora using x86_64 host. I'll try to 
reproduce it on my x86_64 box which should simplify the debugging.

Lukáš



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-21 Thread Anton Nefedov
On 21/8/2019 5:14 PM, Lukáš Doktor wrote:
> Hello guys,
> 
> First attempt was rejected due to zip attachment, let's try it again with 
> just Avocado-vt debug.log and serial console log files attached.
> 
> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
> writing zero buffers to empty COW areas" 
> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it?
> 
> My reproducer is running kickstart installation of RHEL-8 from DVD on aarch64 
> gicv3 machine, which never finishes since this commit, where anaconda 
> complains about package installation, occasionally there are also XFS 
> metadata corruption messages on serial console:
> 

hi,

this looks scary :( I doubt that it can have anything to do with aarch64
but rather a really tricky timing (or, possibly, a broken environment
like broken fallocate() on a host? who knows..)

Is it always the same machine you observe this issue on? Did you try
others?

I just wonder if it's worth to try to reproduce it on my machine
(and I don't have aarch64 on hand now). I can probably come up with
some torture test that will continuously write to qcow2 with random
offsets/sizes and verify the result.

If you could kindly reproduce it again then we can probably start with
enabling qemu traces by appending
  " -trace bdrv* -trace qcow2* -trace file=/some_huge_partition/qemu.log"
to the command line.

Beware that it's going to produce a huge amount of logs.

Also, the corrupted image and the serial log will be required for
investigation.

thanks,

/Anton