Re: [Bug 1878255] Re: Assertion failure in bdrv_aio_cancel, through ide

2020-07-20 Thread Stefan Hajnoczi
Here is another patch that attempts to fix this:
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg05758.html

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1878255

Title:
  Assertion failure in bdrv_aio_cancel, through ide

Status in QEMU:
  New

Bug description:
  Hello,
  While fuzzing, I found an input that triggers an assertion failure in 
bdrv_aio_cancel, through ide:

  #1  0x7685755b in __GI_abort () at abort.c:79
  #2  0x56a8d396 in bdrv_aio_cancel (acb=0x60761290) at 
/home/alxndr/Development/qemu/block/io.c:2746
  #3  0x56a58525 in blk_aio_cancel (acb=0x2) at 
/home/alxndr/Development/qemu/block/block-backend.c:1540
  #4  0x56552f5b in ide_reset (s=) at 
/home/alxndr/Development/qemu/hw/ide/core.c:1318
  #5  0x56552aeb in ide_bus_reset (bus=0x62d17398) at 
/home/alxndr/Development/qemu/hw/ide/core.c:2422
  #6  0x56579ba5 in ahci_reset_port (s=, port=) at /home/alxndr/Development/qemu/hw/ide/ahci.c:650
  #7  0x5657bd8d in ahci_port_write (s=0x61e14d70, port=0x2, 
offset=, val=0x10) at 
/home/alxndr/Development/qemu/hw/ide/ahci.c:360
  #8  0x5657bd8d in ahci_mem_write (opaque=, 
addr=, val=, size=) at 
/home/alxndr/Development/qemu/hw/ide/ahci.c:513
  #9  0x560028d7 in memory_region_write_accessor (mr=, 
addr=, value=, size=, 
shift=, mask=, attrs=...) at 
/home/alxndr/Development/qemu/memory.c:483
  #10 0x56002280 in access_with_adjusted_size (addr=, 
value=, size=, access_size_min=, 
access_size_max=, access_fn=, mr=0x61e14da0, 
attrs=...) at /home/alxndr/Development/qemu/memory.c:544
  #11 0x56002280 in memory_region_dispatch_write (mr=, 
addr=, data=0x10, op=, attrs=...) at 
/home/alxndr/Development/qemu/memory.c:1476
  #12 0x55f171d4 in flatview_write_continue (fv=, 
addr=0xe106c22c, attrs=..., ptr=, len=0x1, addr1=0x7fffb8d0, 
l=, mr=0x61e14da0) at 
/home/alxndr/Development/qemu/exec.c:3137
  #13 0x55f0fb98 in flatview_write (fv=0x6063b180, addr=, attrs=..., buf=, len=) at 
/home/alxndr/Development/qemu/exec.c:3177

  I can reproduce it in qemu 5.0 using:

  cat << EOF | ~/Development/qemu/build/i386-softmmu/qemu-system-i386 -qtest 
stdio -monitor none -serial none -M pc-q35-5.0  -nographic
  outl 0xcf8 0x8000fa24
  outl 0xcfc 0xe106c000
  outl 0xcf8 0x8000fa04
  outw 0xcfc 0x7
  outl 0xcf8 0x8000fb20
  write 0x0 0x3 0x2780e7
  write 0xe106c22c 0xd 0x1130c218021130c218021130c2
  write 0xe106c218 0x15 0x110010110010110010110010110010110010110010
  EOF

  I also attached the commands to this launchpad report, in case the
  formatting is broken:

  qemu-system-i386 -qtest stdio -monitor none -serial none -M pc-q35-5.0
  -nographic < attachment

  Please let me know if I can provide any further info.
  -Alex

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1878255/+subscriptions



Re: [Bug 1878255] Re: Assertion failure in bdrv_aio_cancel, through ide

2020-05-30 Thread Philippe Mathieu-Daudé
On 5/30/20 12:59 AM, John Snow wrote:
> outl 0xcf8 0x8000fa24
> outl 0xcfc 0xe106c000 (Writes e106c00 to BAR5 for 0:31:2)

We might eventually display this in the reproducer output.

> 
> outl 0xcf8 0x8000fa04
> outw 0xcfc 0x7 (Enables BM, Memory IO and PIO for 0:31:2)
> 
> outl 0xcf8 0x8000fb20 (Enables 0:31:3, I guess? My PCI knowledge is
> iffy. We set the enable bit and select BAR4, but then we don't actually
> write to 0xcfc again to set anything.)
> 
> 
> write 0x0 0x3 0x2780e7
> - write these three bytes to addr 0 in memory.
> 
> write 0xe106c22c 0xd 0x1130c218021130c218021130c2
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxSCTL] @ 0x2c: 
> 0x18c23011
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxSERR] @ 0x30: 
> 0xc2301102
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxSACT] @ 0x34: 
> 0x30110218
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxCI] @ 0x38: 
> 0x00c2
> 
> write 0xe106c218 0x15 0x110010110010110010110010110010110010110010
> 
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxCMD] @ 0x18: 
> 0x11100011
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:Reserved] @ 0x1c: 
> 0x00111000
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxTFD] @ 0x20: 
> 0x10001110
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxSIG] @ 0x24: 
> 0x11100011
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxSSTS] @ 0x28: 
> 0x00111000
> - ahci_port_write ahci(0x555c950f71a0)[2]: port write [reg:PxSCTL] @ 0x2c: 
> 0x0010
> 
> Not all of those register writes are actually important for the bug, so
> I simplified them to the fewest writes and fewest bits.
> 
> outl 0xcf8 0x8000fa24
> outl 0xcfc 0xe106c000
> outl 0xcf8 0x8000fa04
> outw 0xcfc 0x7
> outl 0xcf8 0x8000fb20
> write 0x0 0x3 0x2780e7
> write 0xe106c22c 0x4 0x0100
> write 0xe106c238 0x2 0x02
> write 0xe106c218 0x4 0x1100
> write 0xe106c22c 0x1 0x00
> 
> 
> 1. PxSCTL write arms the DET bit. It isn't intended to be left on when 
> PxCMD.ST (Start) is issued. It's not clear what should happen if this DOES 
> occur. (Undefined behavior, at the very least.)
> See AHCI 1.3 section 3.3.1.1 "Offset 2Ch: PxSCTL – Port x Serial ATA Control 
> (SCR2: SControl)"
> 
> This bit is intended to send a reset signal to attached SATA drives.
> QEMU just synchronously resets the drive because we can.
> 
> 
> 2. PxCI is not intended to be written to when PxCMD.ST is unset. The spec 
> suggests that when ST transitions from '1' to '0' that this field is cleared, 
> but it does not suggest what happens when it transitions from '0' to '1'. 
> QEMU will happily set the register.
> 
> 
> 3. PxCMD write: This sets PxCMD.ST and PxCMD.FRE, which engages the AHCI 
> device in earnest.
> 
> At this point, AHCI sees outstanding commands and tries to process them.
> The FIS receive address is never programmed, so it's at zero. We start
> reading a FIS there:
> 
> 15712@1590789960.784835:handle_cmd_fis_dump ahci(0x55b4c56621a0)[2]: FIS:
> 0x00: 27 80 e7 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x40: 34 40 70 01 01 14 eb 20 00 00 00 00 01 00 00 00 
> 0x50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 
> This is translated as:
> 0x27 SATA_FIS_TYPE_REGISTER_H2D
> 0x80 SATA_FIS_REG_H2D_UPDATE_COMMAND_REGISTER
> 0xe7 command = FLUSH CACHE
> 
> This will engage ide_flush_cache() (core.c)
> 
> 
> At this point I get a little confused. I wouldn't think we'd have a 
> BlockBackend here for ide_flush to work on, but it seems to think we do and 
> allows the flush command to proceed. We then immediately try to cancel this 
> flush, but bdrv_aio_cancel can't tolerate an aiocb with a null BDS and panics.
> 
> ...Hm, it should be the case that blk_do_flush detects this as
> ENOMEDIUM, but are we maybe just canceling this request too fast? I
> actually can't trigger this through the console, but I can trigger it by
> redirecting input from a .txt file.
> 
> Yup, OK: if you look in blk_aio_prwv, we schedule a oneshot to invoke
> the callback on a synchronous failure, but we are managing to inject the
> reset command before the oneshot gets a chance to run.

What is not clear to me is, can this be reproduced by a real guest, or
only replaying the fuzzer payload (via the qtest chardev)?

Very nicely detailed analysis btw!

Various parts are worth being copied in the fix commit description.

> 
> I think either blk_aio_cancel or bdrv_aio_cancel needs to check that
> there isn't a dangling BH callback -- it seems wrong to make it as far
> as bdrv_aio_cancel when there's no BDS, but the IDE device no longer has
> any idea why its callback hasn't returned yet, and blk_aio_cancel is