Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Peter Maydell
On Tue, 9 Aug 2022 at 17:26, Denis Krienbühl  wrote:
> On 9 Aug 2022, at 18:15, Peter Maydell  wrote:
>> My wild guess is that there's a race condition somewhere such
>> that when you're doing this huge amount of detaches, very rarely
>> a disk is detached and deleted but this INQUIRY request is
>> incorrectly still sent to the disk (which being a freed object,
>> might be overwritten with other stuff). But that is purely a guess.
>
> So.. should this be something I create a bug report for?
>
>
>> If you can repro this on current head-of-git, or at least on
>> the most recent release, then yes, file a bug report.

> The best I can currently do is start to log what’s going on. Since
> I’m not at all familiar with SCSI and this code-base, do you have
> any tipps on what I should log to maybe find out where this
> race-condition occurs?
>
> Or if there’s any kind of documentation I could read to understand
> better what is going on in the hw/scsi subsystem and how I should
> navigate the code. After reading your explanation we’ll probably
> look for other workarounds, but I would love to understand what’s
> going on.

Paolo and Fam are the SCSI subsystem maintainers. They might know
whether this sounds like a bug that's already been fixed at some
point, or have other suggestions.

Context (ie link to the start of this thread on the list archive):
https://lists.gnu.org/archive/html/qemu-discuss/2022-08/msg00011.html

thanks
-- PMM



Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Denis Krienbühl

> On 9 Aug 2022, at 18:15, Peter Maydell  wrote:
> 
> My wild guess is that there's a race condition somewhere such
> that when you're doing this huge amount of detaches, very rarely
> a disk is detached and deleted but this INQUIRY request is
> incorrectly still sent to the disk (which being a freed object,
> might be overwritten with other stuff). But that is purely a guess.
> 
>> So.. should this be something I create a bug report for?
> 
> If you can repro this on current head-of-git, or at least on
> the most recent release, then yes, file a bug report.

The best I can currently do is start to log what’s going on. Since I’m not at 
all familiar with SCSI and this code-base, do you have any tipps on what I 
should log to maybe find out where this race-condition occurs?

Or if there’s any kind of documentation I could read to understand better what 
is going on in the hw/scsi subsystem and how I should navigate the code. After 
reading your explanation we’ll probably look for other workarounds, but I would 
love to understand what’s going on.

Appreciate your help,

Denis

Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Peter Maydell
On Tue, 9 Aug 2022 at 16:37, Denis Krienbühl  wrote:
> We are running a Ceph 15.2.16 cluster at work, connect VMs to it using 
> librbd1 14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive 
> (4.2-3ubuntu6.23~cloud0).
>
> A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This 
> seems to only happen when lots of volume detach requests are executed in 
> parallel. We can somewhat reliably reproduce this in our staging environment 
> (it still takes thousands of detaches and sometimes hours of running tests 
> until it occurs once). In production it is a lot rarer, but it does occur 
> with some regularity.
>
> Using crash dumps and debug symbols I was able to pin-point the cause of the 
> segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, 
> QEMU tries to access two SCSIDiskState variables that are both `0x0`, when 
> QEMU expects them to be characters.
>
> 1. `s->version`
> 2. `s->vendor`
>
> I found out that in other parts of the code (`scsi_block_realize`) those 
> variables are not expected to always be valid characters, so I copied that 
> over and created my own release of the distro package with this patch (here 
> diffed against master):
>
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index efee6739f9..419d6e28f1 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
> uint8_t *outbuf)
>  return -1;
>  }
>
> +/* Avoid null-pointers leading to segfaults below */
> +if (!s->version) {
> +s->version = g_strdup(qemu_hw_version());
> +}
> +if (!s->vendor) {
> +s->vendor = g_strdup("QEMU");
> +}
> +
>  /* PAGE CODE == 0 */
>  buflen = req->cmd.xfer;
>  if (buflen > SCSI_MAX_INQUIRY_LEN) {

If these are NULL, then something has gone weirdly wrong (possibly
memory corruption or other Bad Things).

At least in the version of this code in upstream git, the
'realize' method does:

if (!s->version) {
s->version = g_strdup(qemu_hw_version());
}
if (!s->vendor) {
s->vendor = g_strdup("QEMU");
}

which is to say: when the device is created, if the user didn't
specify version/vendor strings via QOM properties, we set them to
these defaults. So at all points in execution after this, these
pointers can't be NULL.

> 1. I don’t have a way to reproduce this in the master.
> 2. I don’t know if this is safe.
> 3. I don’t know if this just masks the issue and the problem lies somewhere 
> deeper.

Number 3.

My wild guess is that there's a race condition somewhere such
that when you're doing this huge amount of detaches, very rarely
a disk is detached and deleted but this INQUIRY request is
incorrectly still sent to the disk (which being a freed object,
might be overwritten with other stuff). But that is purely a guess.

> So.. should this be something I create a bug report for?

If you can repro this on current head-of-git, or at least on
the most recent release, then yes, file a bug report.

thanks
-- PMM



Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Denis Krienbühl
Hi

We are running a Ceph 15.2.16 cluster at work, connect VMs to it using librbd1 
14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive 
(4.2-3ubuntu6.23~cloud0).

A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This seems 
to only happen when lots of volume detach requests are executed in parallel. We 
can somewhat reliably reproduce this in our staging environment (it still takes 
thousands of detaches and sometimes hours of running tests until it occurs 
once). In production it is a lot rarer, but it does occur with some regularity.

Using crash dumps and debug symbols I was able to pin-point the cause of the 
segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, 
QEMU tries to access two SCSIDiskState variables that are both `0x0`, when QEMU 
expects them to be characters.

1. `s->version`
2. `s->vendor`

I found out that in other parts of the code (`scsi_block_realize`) those 
variables are not expected to always be valid characters, so I copied that over 
and created my own release of the distro package with this patch (here diffed 
against master):

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index efee6739f9..419d6e28f1 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 return -1;
 }
 
+/* Avoid null-pointers leading to segfaults below */
+if (!s->version) {
+s->version = g_strdup(qemu_hw_version());
+}
+if (!s->vendor) {
+s->vendor = g_strdup("QEMU");
+}
+
 /* PAGE CODE == 0 */
 buflen = req->cmd.xfer;
 if (buflen > SCSI_MAX_INQUIRY_LEN) {

And with that fix, we have no segfaults anymore.

So I have that, but there are a few problems:

1. I don’t have a way to reproduce this in the master.
2. I don’t know if this is safe.
3. I don’t know if this just masks the issue and the problem lies somewhere 
deeper.

So.. should this be something I create a bug report for? I feel like a 
maintainer might be able to know much better if this is an actual issue or not. 
I would also suspect that this is indeed a QEMU bug to a degree, even though 
it’s maybe something that does not happen with other librbd1 releases, because 
the assumption that those pointers are always point to valid characters is 
likely wrong.

Also, even if it’s something just for our version, I would appreciate any kind 
of input on my patch and why you think it helps our case.

Cheers,

Denis