Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer
On Tue, 9 Aug 2022 at 17:26, Denis Krienbühl wrote: > On 9 Aug 2022, at 18:15, Peter Maydell wrote: >> My wild guess is that there's a race condition somewhere such >> that when you're doing this huge amount of detaches, very rarely >> a disk is detached and deleted but this INQUIRY request is >> incorrectly still sent to the disk (which being a freed object, >> might be overwritten with other stuff). But that is purely a guess. > > So.. should this be something I create a bug report for? > > >> If you can repro this on current head-of-git, or at least on >> the most recent release, then yes, file a bug report. > The best I can currently do is start to log what’s going on. Since > I’m not at all familiar with SCSI and this code-base, do you have > any tipps on what I should log to maybe find out where this > race-condition occurs? > > Or if there’s any kind of documentation I could read to understand > better what is going on in the hw/scsi subsystem and how I should > navigate the code. After reading your explanation we’ll probably > look for other workarounds, but I would love to understand what’s > going on. Paolo and Fam are the SCSI subsystem maintainers. They might know whether this sounds like a bug that's already been fixed at some point, or have other suggestions. Context (ie link to the start of this thread on the list archive): https://lists.gnu.org/archive/html/qemu-discuss/2022-08/msg00011.html thanks -- PMM
Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer
> On 9 Aug 2022, at 18:15, Peter Maydell wrote: > > My wild guess is that there's a race condition somewhere such > that when you're doing this huge amount of detaches, very rarely > a disk is detached and deleted but this INQUIRY request is > incorrectly still sent to the disk (which being a freed object, > might be overwritten with other stuff). But that is purely a guess. > >> So.. should this be something I create a bug report for? > > If you can repro this on current head-of-git, or at least on > the most recent release, then yes, file a bug report. The best I can currently do is start to log what’s going on. Since I’m not at all familiar with SCSI and this code-base, do you have any tipps on what I should log to maybe find out where this race-condition occurs? Or if there’s any kind of documentation I could read to understand better what is going on in the hw/scsi subsystem and how I should navigate the code. After reading your explanation we’ll probably look for other workarounds, but I would love to understand what’s going on. Appreciate your help, Denis
Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer
On Tue, 9 Aug 2022 at 16:37, Denis Krienbühl wrote: > We are running a Ceph 15.2.16 cluster at work, connect VMs to it using > librbd1 14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive > (4.2-3ubuntu6.23~cloud0). > > A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This > seems to only happen when lots of volume detach requests are executed in > parallel. We can somewhat reliably reproduce this in our staging environment > (it still takes thousands of detaches and sometimes hours of running tests > until it occurs once). In production it is a lot rarer, but it does occur > with some regularity. > > Using crash dumps and debug symbols I was able to pin-point the cause of the > segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, > QEMU tries to access two SCSIDiskState variables that are both `0x0`, when > QEMU expects them to be characters. > > 1. `s->version` > 2. `s->vendor` > > I found out that in other parts of the code (`scsi_block_realize`) those > variables are not expected to always be valid characters, so I copied that > over and created my own release of the distro package with this patch (here > diffed against master): > > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c > index efee6739f9..419d6e28f1 100644 > --- a/hw/scsi/scsi-disk.c > +++ b/hw/scsi/scsi-disk.c > @@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, > uint8_t *outbuf) > return -1; > } > > +/* Avoid null-pointers leading to segfaults below */ > +if (!s->version) { > +s->version = g_strdup(qemu_hw_version()); > +} > +if (!s->vendor) { > +s->vendor = g_strdup("QEMU"); > +} > + > /* PAGE CODE == 0 */ > buflen = req->cmd.xfer; > if (buflen > SCSI_MAX_INQUIRY_LEN) { If these are NULL, then something has gone weirdly wrong (possibly memory corruption or other Bad Things). At least in the version of this code in upstream git, the 'realize' method does: if (!s->version) { s->version = g_strdup(qemu_hw_version()); } if (!s->vendor) { s->vendor = g_strdup("QEMU"); } which is to say: when the device is created, if the user didn't specify version/vendor strings via QOM properties, we set them to these defaults. So at all points in execution after this, these pointers can't be NULL. > 1. I don’t have a way to reproduce this in the master. > 2. I don’t know if this is safe. > 3. I don’t know if this just masks the issue and the problem lies somewhere > deeper. Number 3. My wild guess is that there's a race condition somewhere such that when you're doing this huge amount of detaches, very rarely a disk is detached and deleted but this INQUIRY request is incorrectly still sent to the disk (which being a freed object, might be overwritten with other stuff). But that is purely a guess. > So.. should this be something I create a bug report for? If you can repro this on current head-of-git, or at least on the most recent release, then yes, file a bug report. thanks -- PMM
Segfault in hw/scsi/scsi-disk.c caused by null pointer
Hi We are running a Ceph 15.2.16 cluster at work, connect VMs to it using librbd1 14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive (4.2-3ubuntu6.23~cloud0). A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This seems to only happen when lots of volume detach requests are executed in parallel. We can somewhat reliably reproduce this in our staging environment (it still takes thousands of detaches and sometimes hours of running tests until it occurs once). In production it is a lot rarer, but it does occur with some regularity. Using crash dumps and debug symbols I was able to pin-point the cause of the segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, QEMU tries to access two SCSIDiskState variables that are both `0x0`, when QEMU expects them to be characters. 1. `s->version` 2. `s->vendor` I found out that in other parts of the code (`scsi_block_realize`) those variables are not expected to always be valid characters, so I copied that over and created my own release of the distro package with this patch (here diffed against master): diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c index efee6739f9..419d6e28f1 100644 --- a/hw/scsi/scsi-disk.c +++ b/hw/scsi/scsi-disk.c @@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf) return -1; } +/* Avoid null-pointers leading to segfaults below */ +if (!s->version) { +s->version = g_strdup(qemu_hw_version()); +} +if (!s->vendor) { +s->vendor = g_strdup("QEMU"); +} + /* PAGE CODE == 0 */ buflen = req->cmd.xfer; if (buflen > SCSI_MAX_INQUIRY_LEN) { And with that fix, we have no segfaults anymore. So I have that, but there are a few problems: 1. I don’t have a way to reproduce this in the master. 2. I don’t know if this is safe. 3. I don’t know if this just masks the issue and the problem lies somewhere deeper. So.. should this be something I create a bug report for? I feel like a maintainer might be able to know much better if this is an actual issue or not. I would also suspect that this is indeed a QEMU bug to a degree, even though it’s maybe something that does not happen with other librbd1 releases, because the assumption that those pointers are always point to valid characters is likely wrong. Also, even if it’s something just for our version, I would appreciate any kind of input on my patch and why you think it helps our case. Cheers, Denis