Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-12 Thread Denis Krienbühl
I see, thanks for pointing me in the direction of those commits.

I figured a new version would not have the same issue, so we’ll upgrade.

Cheers, Denis

> On 12 Aug 2022, at 17:11, Paolo Bonzini  wrote:
> 
> On 8/12/22 16:50, Peter Maydell wrote:
>> As I said previously, this is still absolutely wrong.
>> If we ever get to this function with either of these
>> fields being NULL then there has been a serious problem,
>> probably a memory corruption or use-after-free, or
>> possibly an attempt to use a partially constructed object.
> 
> Yeah, this would still be a use-after-free.  s->version is never
> written (see for example release_string in hw/core/qdev-properties.c)
> so it means that the storage for "s" has been reused.
> 
> The bug has been fixed in version 5.2 of QEMU with the following commit:
> 
> 7a8202c521 scsi/scsi_bus: switch search direction in scsi_device_find
> 7bed89958b device_core: use drain_call_rcu in in qmp_device_add
> 2d24a64661 device-core: use RCU for list of children of a bus
> 42a90a899e scsi: switch to bus->check_address
> a23151e8cc device-core: use atomic_set on .realized property
> 8ddf958e8d scsi/scsi-bus: scsi_device_find: don't return unrealized devices
> 8ff3449560 scsi/scsi_bus: Add scsi_device_get
> 07a47d4a18 virtio-scsi: use scsi_device_get
> 8cfe8013ba scsi/scsi_bus: fix races in REPORT LUNS
> 
> Feel free to pass this information to Canonical so that they can fix
> their old version of QEMU.
> 
> Paolo
> 




Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-12 Thread Paolo Bonzini

On 8/12/22 16:50, Peter Maydell wrote:

As I said previously, this is still absolutely wrong.
If we ever get to this function with either of these
fields being NULL then there has been a serious problem,
probably a memory corruption or use-after-free, or
possibly an attempt to use a partially constructed object.


Yeah, this would still be a use-after-free.  s->version is never
written (see for example release_string in hw/core/qdev-properties.c)
so it means that the storage for "s" has been reused.

The bug has been fixed in version 5.2 of QEMU with the following commit:

7a8202c521 scsi/scsi_bus: switch search direction in scsi_device_find
7bed89958b device_core: use drain_call_rcu in in qmp_device_add
2d24a64661 device-core: use RCU for list of children of a bus
42a90a899e scsi: switch to bus->check_address
a23151e8cc device-core: use atomic_set on .realized property
8ddf958e8d scsi/scsi-bus: scsi_device_find: don't return unrealized devices
8ff3449560 scsi/scsi_bus: Add scsi_device_get
07a47d4a18 virtio-scsi: use scsi_device_get
8cfe8013ba scsi/scsi_bus: fix races in REPORT LUNS

Feel free to pass this information to Canonical so that they can fix
their old version of QEMU.

Paolo




Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-12 Thread Peter Maydell
On Fri, 12 Aug 2022 at 15:41, Denis Krienbühl  wrote:
>
> I’m not much further with my segfault, though I now know that the number of 
> detaches likely does not matter and it seems to occur during the attach, not 
> the detach part of the code.
>
> I adapted my change to be a bit more sane - I think it might make sense in 
> general, as something is clearly wrong, the code can be reached somehow and 
> in this case we probably just want to stop, instead of pretending everything 
> is okay.
>
> So the following change also works for us, causing no segfaults:
>
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index efee6739f9..7273cd6c3d 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -775,6 +775,15 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
> uint8_t *outbuf)
>  return -1;
>  }
>
> +/* Avoid null-pointers leading to segfaults below */
> +if (!s->version) {
> +return -1;
> +}
> +
> +if (!s->vendor) {
> +return -1;
> +}
> +
>  /* PAGE CODE == 0 */
>  buflen = req->cmd.xfer;
>  if (buflen > SCSI_MAX_INQUIRY_LEN) {
>
> I still hope to get some feedback from anyone that is familiar with hw/scsi. 
> Hopefully this reaches someone who can shed some light on this.

As I said previously, this is still absolutely wrong.
If we ever get to this function with either of these
fields being NULL then there has been a serious problem,
probably a memory corruption or use-after-free, or
possibly an attempt to use a partially constructed object.

You need to find out why these fields have ended up NULL,
not just paper over the problem.

You could put in
  assert(s->version);
  assert(s->vendor);

if you like, though that doesn't really gain much over
just segfaulting.

thanks
-- PMM



Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-12 Thread Denis Krienbühl
I’m not much further with my segfault, though I now know that the number of 
detaches likely does not matter and it seems to occur during the attach, not 
the detach part of the code.

I adapted my change to be a bit more sane - I think it might make sense in 
general, as something is clearly wrong, the code can be reached somehow and in 
this case we probably just want to stop, instead of pretending everything is 
okay.

So the following change also works for us, causing no segfaults:

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index efee6739f9..7273cd6c3d 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -775,6 +775,15 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 return -1;
 }
 
+/* Avoid null-pointers leading to segfaults below */
+if (!s->version) {
+return -1;
+}
+
+if (!s->vendor) {
+return -1;
+}
+
 /* PAGE CODE == 0 */
 buflen = req->cmd.xfer;
 if (buflen > SCSI_MAX_INQUIRY_LEN) {

I still hope to get some feedback from anyone that is familiar with hw/scsi. 
Hopefully this reaches someone who can shed some light on this.

Cheers and enjoy your weekend,

Denis

> On 9 Aug 2022, at 18:51, Peter Maydell  wrote:
> 
> On Tue, 9 Aug 2022 at 17:26, Denis Krienbühl  > wrote:
>> On 9 Aug 2022, at 18:15, Peter Maydell  wrote:
>>> My wild guess is that there's a race condition somewhere such
>>> that when you're doing this huge amount of detaches, very rarely
>>> a disk is detached and deleted but this INQUIRY request is
>>> incorrectly still sent to the disk (which being a freed object,
>>> might be overwritten with other stuff). But that is purely a guess.
>> 
>> So.. should this be something I create a bug report for?
>> 
>> 
>>> If you can repro this on current head-of-git, or at least on
>>> the most recent release, then yes, file a bug report.
> 
>> The best I can currently do is start to log what’s going on. Since
>> I’m not at all familiar with SCSI and this code-base, do you have
>> any tipps on what I should log to maybe find out where this
>> race-condition occurs?
>> 
>> Or if there’s any kind of documentation I could read to understand
>> better what is going on in the hw/scsi subsystem and how I should
>> navigate the code. After reading your explanation we’ll probably
>> look for other workarounds, but I would love to understand what’s
>> going on.
> 
> Paolo and Fam are the SCSI subsystem maintainers. They might know
> whether this sounds like a bug that's already been fixed at some
> point, or have other suggestions.
> 
> Context (ie link to the start of this thread on the list archive):
> https://lists.gnu.org/archive/html/qemu-discuss/2022-08/msg00011.html 
> 
> 
> thanks
> -- PMM



Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Peter Maydell
On Tue, 9 Aug 2022 at 17:26, Denis Krienbühl  wrote:
> On 9 Aug 2022, at 18:15, Peter Maydell  wrote:
>> My wild guess is that there's a race condition somewhere such
>> that when you're doing this huge amount of detaches, very rarely
>> a disk is detached and deleted but this INQUIRY request is
>> incorrectly still sent to the disk (which being a freed object,
>> might be overwritten with other stuff). But that is purely a guess.
>
> So.. should this be something I create a bug report for?
>
>
>> If you can repro this on current head-of-git, or at least on
>> the most recent release, then yes, file a bug report.

> The best I can currently do is start to log what’s going on. Since
> I’m not at all familiar with SCSI and this code-base, do you have
> any tipps on what I should log to maybe find out where this
> race-condition occurs?
>
> Or if there’s any kind of documentation I could read to understand
> better what is going on in the hw/scsi subsystem and how I should
> navigate the code. After reading your explanation we’ll probably
> look for other workarounds, but I would love to understand what’s
> going on.

Paolo and Fam are the SCSI subsystem maintainers. They might know
whether this sounds like a bug that's already been fixed at some
point, or have other suggestions.

Context (ie link to the start of this thread on the list archive):
https://lists.gnu.org/archive/html/qemu-discuss/2022-08/msg00011.html

thanks
-- PMM



Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Denis Krienbühl

> On 9 Aug 2022, at 18:15, Peter Maydell  wrote:
> 
> My wild guess is that there's a race condition somewhere such
> that when you're doing this huge amount of detaches, very rarely
> a disk is detached and deleted but this INQUIRY request is
> incorrectly still sent to the disk (which being a freed object,
> might be overwritten with other stuff). But that is purely a guess.
> 
>> So.. should this be something I create a bug report for?
> 
> If you can repro this on current head-of-git, or at least on
> the most recent release, then yes, file a bug report.

The best I can currently do is start to log what’s going on. Since I’m not at 
all familiar with SCSI and this code-base, do you have any tipps on what I 
should log to maybe find out where this race-condition occurs?

Or if there’s any kind of documentation I could read to understand better what 
is going on in the hw/scsi subsystem and how I should navigate the code. After 
reading your explanation we’ll probably look for other workarounds, but I would 
love to understand what’s going on.

Appreciate your help,

Denis

Re: Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Peter Maydell
On Tue, 9 Aug 2022 at 16:37, Denis Krienbühl  wrote:
> We are running a Ceph 15.2.16 cluster at work, connect VMs to it using 
> librbd1 14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive 
> (4.2-3ubuntu6.23~cloud0).
>
> A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This 
> seems to only happen when lots of volume detach requests are executed in 
> parallel. We can somewhat reliably reproduce this in our staging environment 
> (it still takes thousands of detaches and sometimes hours of running tests 
> until it occurs once). In production it is a lot rarer, but it does occur 
> with some regularity.
>
> Using crash dumps and debug symbols I was able to pin-point the cause of the 
> segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, 
> QEMU tries to access two SCSIDiskState variables that are both `0x0`, when 
> QEMU expects them to be characters.
>
> 1. `s->version`
> 2. `s->vendor`
>
> I found out that in other parts of the code (`scsi_block_realize`) those 
> variables are not expected to always be valid characters, so I copied that 
> over and created my own release of the distro package with this patch (here 
> diffed against master):
>
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index efee6739f9..419d6e28f1 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
> uint8_t *outbuf)
>  return -1;
>  }
>
> +/* Avoid null-pointers leading to segfaults below */
> +if (!s->version) {
> +s->version = g_strdup(qemu_hw_version());
> +}
> +if (!s->vendor) {
> +s->vendor = g_strdup("QEMU");
> +}
> +
>  /* PAGE CODE == 0 */
>  buflen = req->cmd.xfer;
>  if (buflen > SCSI_MAX_INQUIRY_LEN) {

If these are NULL, then something has gone weirdly wrong (possibly
memory corruption or other Bad Things).

At least in the version of this code in upstream git, the
'realize' method does:

if (!s->version) {
s->version = g_strdup(qemu_hw_version());
}
if (!s->vendor) {
s->vendor = g_strdup("QEMU");
}

which is to say: when the device is created, if the user didn't
specify version/vendor strings via QOM properties, we set them to
these defaults. So at all points in execution after this, these
pointers can't be NULL.

> 1. I don’t have a way to reproduce this in the master.
> 2. I don’t know if this is safe.
> 3. I don’t know if this just masks the issue and the problem lies somewhere 
> deeper.

Number 3.

My wild guess is that there's a race condition somewhere such
that when you're doing this huge amount of detaches, very rarely
a disk is detached and deleted but this INQUIRY request is
incorrectly still sent to the disk (which being a freed object,
might be overwritten with other stuff). But that is purely a guess.

> So.. should this be something I create a bug report for?

If you can repro this on current head-of-git, or at least on
the most recent release, then yes, file a bug report.

thanks
-- PMM



Segfault in hw/scsi/scsi-disk.c caused by null pointer

2022-08-09 Thread Denis Krienbühl
Hi

We are running a Ceph 15.2.16 cluster at work, connect VMs to it using librbd1 
14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive 
(4.2-3ubuntu6.23~cloud0).

A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This seems 
to only happen when lots of volume detach requests are executed in parallel. We 
can somewhat reliably reproduce this in our staging environment (it still takes 
thousands of detaches and sometimes hours of running tests until it occurs 
once). In production it is a lot rarer, but it does occur with some regularity.

Using crash dumps and debug symbols I was able to pin-point the cause of the 
segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, 
QEMU tries to access two SCSIDiskState variables that are both `0x0`, when QEMU 
expects them to be characters.

1. `s->version`
2. `s->vendor`

I found out that in other parts of the code (`scsi_block_realize`) those 
variables are not expected to always be valid characters, so I copied that over 
and created my own release of the distro package with this patch (here diffed 
against master):

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index efee6739f9..419d6e28f1 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 return -1;
 }
 
+/* Avoid null-pointers leading to segfaults below */
+if (!s->version) {
+s->version = g_strdup(qemu_hw_version());
+}
+if (!s->vendor) {
+s->vendor = g_strdup("QEMU");
+}
+
 /* PAGE CODE == 0 */
 buflen = req->cmd.xfer;
 if (buflen > SCSI_MAX_INQUIRY_LEN) {

And with that fix, we have no segfaults anymore.

So I have that, but there are a few problems:

1. I don’t have a way to reproduce this in the master.
2. I don’t know if this is safe.
3. I don’t know if this just masks the issue and the problem lies somewhere 
deeper.

So.. should this be something I create a bug report for? I feel like a 
maintainer might be able to know much better if this is an actual issue or not. 
I would also suspect that this is indeed a QEMU bug to a degree, even though 
it’s maybe something that does not happen with other librbd1 releases, because 
the assumption that those pointers are always point to valid characters is 
likely wrong.

Also, even if it’s something just for our version, I would appreciate any kind 
of input on my patch and why you think it helps our case.

Cheers,

Denis