from:"Hannes Reinecke"

Re: NVME hotplug support ?

2024-01-29 Thread Hannes Reinecke

On 1/29/24 14:13, Damien Hedde wrote:

On 1/24/24 08:47, Hannes Reinecke wrote:

On 1/24/24 07:52, Philippe Mathieu-Daudé wrote:

Hi Hannes,

[+Markus as QOM/QDev rubber duck]

On 23/1/24 13:40, Hannes Reinecke wrote:

On 1/23/24 11:59, Damien Hedde wrote:

Hi all,

We are currently looking into hotplugging nvme devices and it is 
currently not possible:

When nvme was introduced 2 years ago, the feature was disabled.

commit cc6fb6bc506e6c47ed604fcb7b7413dff0b7d845
Author: Klaus Jensen
Date:   Tue Jul 6 10:48:40 2021 +0200

    hw/nvme: mark nvme-subsys non-hotpluggable
    We currently lack the infrastructure to handle subsystem 
hotplugging, so

    disable it.

Do someone know what's lacking or anyone have some tips/idea of 
what we should develop to add the support ?

Problem is that the object model is messed up. In qemu namespaces 
are attached to controllers, which in turn are children of the PCI 
device.

There are subsystems, but these just reference the controller.

So if you hotunplug the PCI device you detach/destroy the controller 
and detach the namespaces from the controller.
But if you hotplug the PCI device again the NVMe controller will be 
attached to the PCI device, but the namespace are still be detached.

Klaus said he was going to fix that, and I dimly remember some patches
floating around. But apparently it never went anywhere.

Fundamental problem is that the NVMe hierarchy as per spec is 
incompatible with the qemu object model; qemu requires a strict

tree model where every object has exactly _one_ parent.

The modelling problem is not clear to me.
Do you have an example of how the NVMe hierarchy should be?

Sure.

As per NVMe spec we have this hierarchy:

  --->  subsys ---
 |    |
 |    V
controller  namespaces

There can be several controllers, and several
namespaces.
The initiator (ie the linux 'nvme' driver) connects
to a controller, queries the subsystem for the attached
namespaces, and presents each namespace as a block device.

For Qemu we have the problem that every device _must_ be
a direct descendant of the parent (expressed by the fact
that each 'parent' object is embedded in the device object).

So if we were to present a NVMe PCI device, the controller
must be derived from the PCI device:

pci -> controller

but now we have to express the NVMe hierarchy, too:

pci -> ctrl1 -> subsys1 -> namespace1

which actually works.
We can easily attach several namespaces:

pci -> ctrl1 ->subsys1 -> namespace2

For a single controller and a single subsystem.
However, as mentioned above, there can be _several_
controllers attached to the same subsystem.
So we can express the second controller:

pci -> ctrl2

but we cannot attach the controller to 'subsys1'
as then 'subsys1' would need to be derived from
'ctrl2', and not (as it is now) from 'ctrl1'.

The most logical step would be to have 'subsystems'
their own entity, independent of any controllers.
But then the block devices (which are derived from
the namespaces) could not be traced back
to the PCI device, and a PCI hotplug would not
'automatically' disconnect the nvme block devices.

Plus the subsystem would be independent from the NVMe
PCI devices, so you could have a subsystem with
no controllers attached. And one would wonder who
should be responsible for cleaning up that.

Thanks for the details !

My use case is the simple one with no nvme subsystem/namespaces:
- hotplug a pci nvme device (nvme controller) as in the following CLI 
(which automatically put the drive into a default namespace)

./qemu-system-aarch64 -nographic -M virt \
    -drive file=nvme0.disk,if=none,id=nvme-drive0 \
    -device nvme,serial=nvme0,id=nvmedev0,drive=nvme-drive0

In the simple tree approach where subsystems and namespaces are not 
shared by controllers. We could delete the whole nvme hiearchy under the 
controller while unplugging it ?

In your first message, you said
  > So if you hotunplug the PCI device you detach/destroy the controller
  > and detach the namespaces from the controller.
  > But if you hotplug the PCI device again the NVMe controller will be
  > attached to the PCI device, but the namespace are still be detached.

Do you mean that if we unplug the pci device we HAVE to keep some nvme 
objects so that if we plug the device back we can recover them ?
Or just that it's hard to unplug nvme objects if they are not real qom 
children of pci device ?

Key point for trying on PCI hotplug with qemu is to attach the PCI 
device to it's own PCI root port. Cf the mail from Klaus Jensen for details.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich

Re: NVME hotplug support ?

2024-01-23 Thread Hannes Reinecke


On 1/24/24 07:52, Philippe Mathieu-Daudé wrote:

Hi Hannes,

[+Markus as QOM/QDev rubber duck]

On 23/1/24 13:40, Hannes Reinecke wrote:

On 1/23/24 11:59, Damien Hedde wrote:

Hi all,

We are currently looking into hotplugging nvme devices and it is 
currently not possible:

When nvme was introduced 2 years ago, the feature was disabled.

commit cc6fb6bc506e6c47ed604fcb7b7413dff0b7d845
Author: Klaus Jensen
Date:   Tue Jul 6 10:48:40 2021 +0200

    hw/nvme: mark nvme-subsys non-hotpluggable
    We currently lack the infrastructure to handle subsystem 
hotplugging, so

    disable it.


Do someone know what's lacking or anyone have some tips/idea of what 
we should develop to add the support ?


Problem is that the object model is messed up. In qemu namespaces are 
attached to controllers, which in turn are children of the PCI device.

There are subsystems, but these just reference the controller.

So if you hotunplug the PCI device you detach/destroy the controller 
and detach the namespaces from the controller.
But if you hotplug the PCI device again the NVMe controller will be 
attached to the PCI device, but the namespace are still be detached.


Klaus said he was going to fix that, and I dimly remember some patches
floating around. But apparently it never went anywhere.

Fundamental problem is that the NVMe hierarchy as per spec is 
incompatible with the qemu object model; qemu requires a strict

tree model where every object has exactly _one_ parent.


The modelling problem is not clear to me.
Do you have an example of how the NVMe hierarchy should be?


Sure.

As per NVMe spec we have this hierarchy:

 --->  subsys ---
||
|V
controller  namespaces

There can be several controllers, and several
namespaces.
The initiator (ie the linux 'nvme' driver) connects
to a controller, queries the subsystem for the attached
namespaces, and presents each namespace as a block device.

For Qemu we have the problem that every device _must_ be
a direct descendant of the parent (expressed by the fact
that each 'parent' object is embedded in the device object).

So if we were to present a NVMe PCI device, the controller
must be derived from the PCI device:

pci -> controller

but now we have to express the NVMe hierarchy, too:

pci -> ctrl1 -> subsys1 -> namespace1

which actually works.
We can easily attach several namespaces:

pci -> ctrl1 ->subsys1 -> namespace2

For a single controller and a single subsystem.
However, as mentioned above, there can be _several_
controllers attached to the same subsystem.
So we can express the second controller:

pci -> ctrl2

but we cannot attach the controller to 'subsys1'
as then 'subsys1' would need to be derived from
'ctrl2', and not (as it is now) from 'ctrl1'.

The most logical step would be to have 'subsystems'
their own entity, independent of any controllers.
But then the block devices (which are derived from
the namespaces) could not be traced back
to the PCI device, and a PCI hotplug would not
'automatically' disconnect the nvme block devices.

Plus the subsystem would be independent from the NVMe
PCI devices, so you could have a subsystem with
no controllers attached. And one would wonder who
should be responsible for cleaning up that.

Cheers,

Hannes

Re: NVME hotplug support ?

2024-01-23 Thread Hannes Reinecke


On 1/23/24 11:59, Damien Hedde wrote:

Hi all,

We are currently looking into hotplugging nvme devices and it is currently not 
possible:
When nvme was introduced 2 years ago, the feature was disabled.

commit cc6fb6bc506e6c47ed604fcb7b7413dff0b7d845
Author: Klaus Jensen
Date:   Tue Jul 6 10:48:40 2021 +0200

hw/nvme: mark nvme-subsys non-hotpluggable

We currently lack the infrastructure to handle subsystem hotplugging, so

disable it.


Do someone know what's lacking or anyone have some tips/idea of what we should 
develop to add the support ?

Problem is that the object model is messed up. In qemu namespaces are 
attached to controllers, which in turn are children of the PCI device.

There are subsystems, but these just reference the controller.

So if you hotunplug the PCI device you detach/destroy the controller and 
detach the namespaces from the controller.
But if you hotplug the PCI device again the NVMe controller will be 
attached to the PCI device, but the namespace are still be detached.


Klaus said he was going to fix that, and I dimly remember some patches
floating around. But apparently it never went anywhere.

Fundamental problem is that the NVMe hierarchy as per spec is 
incompatible with the qemu object model; qemu requires a strict

tree model where every object has exactly _one_ parent.

Cheers,

Hannes

Re: [PATCH v2] hw/scsi/megasas: Silent GCC duplicated-cond warning

2023-06-05 Thread Hannes Reinecke


On 6/5/23 15:44, Philippe Mathieu-Daudé wrote:

ping?

On 28/3/23 23:01, Philippe Mathieu-Daudé wrote:

From: Philippe Mathieu-Daudé 

GCC9 is confused when building with CFLAG -O3:

   hw/scsi/megasas.c: In function ‘megasas_scsi_realize’:
   hw/scsi/megasas.c:2387:26: error: duplicated ‘if’ condition 
[-Werror=duplicated-cond]

    2387 | } else if (s->fw_sge >= 128 - MFI_PASS_FRAME_SIZE) {
   hw/scsi/megasas.c:2385:19: note: previously used here
    2385 | if (s->fw_sge >= MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE) {
   cc1: all warnings being treated as errors

When this device was introduced in commit e8f943c3bcc, the author
cared about modularity, using a definition for the firmware limit.

However if the firmware limit isn't changed (MEGASAS_MAX_SGE = 128),
the code ends doing the same check twice.

Per the maintainer [*]:


The original code assumed that one could change MFI_PASS_FRAME_SIZE,
but it turned out not to be possible as it's being hardcoded in the
drivers themselves (even though the interface provides mechanisms to
query it). So we can remove the duplicate lines.


Add the 'MEGASAS_MIN_SGE' definition for the '64' magic value,
slightly rewrite the condition check to simplify a bit the logic
and remove the unnecessary / duplicated check.

[*] 
https://lore.kernel.org/qemu-devel/e0029fc5-882f-1d63-15e3-1c3dbe9b6...@suse.de/


Suggested-by: Paolo Bonzini 
Signed-off-by: Philippe Mathieu-Daudé 
Signed-off-by: Philippe Mathieu-Daudé 
---
v1: 
https://lore.kernel.org/qemu-devel/20191217173425.5082-6-phi...@redhat.com/

---
  hw/scsi/megasas.c | 16 ++--
  1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index 9cbbb16121..32c70c9e99 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -42,6 +42,7 @@
  #define MEGASAS_MAX_FRAMES 2048 /* Firmware limit at 65535 */
  #define MEGASAS_DEFAULT_FRAMES 1000 /* Windows requires this */
  #define MEGASAS_GEN2_DEFAULT_FRAMES 1008 /* Windows requires 
this */

+#define MEGASAS_MIN_SGE 64
  #define MEGASAS_MAX_SGE 128 /* Firmware limit */
  #define MEGASAS_DEFAULT_SGE 80
  #define MEGASAS_MAX_SECTORS 0x  /* No real limit */
@@ -2356,6 +2357,7 @@ static void megasas_scsi_realize(PCIDevice *dev, 
Error **errp)

  MegasasState *s = MEGASAS(dev);
  MegasasBaseClass *b = MEGASAS_GET_CLASS(s);
  uint8_t *pci_conf;
+    uint32_t sge;
  int i, bar_type;
  Error *err = NULL;
  int ret;
@@ -2424,13 +2426,15 @@ static void megasas_scsi_realize(PCIDevice 
*dev, Error **errp)

  if (!s->hba_serial) {
  s->hba_serial = g_strdup(MEGASAS_HBA_SERIAL);
  }
-    if (s->fw_sge >= MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE) {
-    s->fw_sge = MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE;
-    } else if (s->fw_sge >= 128 - MFI_PASS_FRAME_SIZE) {
-    s->fw_sge = 128 - MFI_PASS_FRAME_SIZE;
-    } else {
-    s->fw_sge = 64 - MFI_PASS_FRAME_SIZE;
+
+    sge = s->fw_sge + MFI_PASS_FRAME_SIZE;
+    if (sge < MEGASAS_MIN_SGE) {
+    sge = MEGASAS_MIN_SGE;
+    } else if (sge >= MEGASAS_MAX_SGE) {
+    sge = MEGASAS_MAX_SGE;
  }
+    s->fw_sge = sge - MFI_PASS_FRAME_SIZE;
+
  if (s->fw_cmds > MEGASAS_MAX_FRAMES) {
  s->fw_cmds = MEGASAS_MAX_FRAMES;
  }



Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [PATCH] hw/block/nvme: re-enable NVMe PCI hotplug

2022-10-12 Thread Hannes Reinecke


On 10/10/22 19:01, Daniel Wagner wrote:

On Tue, May 11, 2021 at 06:12:47PM +0200, Hannes Reinecke wrote:

On 5/11/21 6:03 PM, Klaus Jensen wrote:

On May 11 16:54, Hannes Reinecke wrote:

On 5/11/21 3:37 PM, Klaus Jensen wrote:

On May 11 15:12, Hannes Reinecke wrote:

On 5/11/21 2:22 PM, Klaus Jensen wrote:

[ .. ]

The hotplug fix looks good - I'll post a series that
tries to integrate
both.


Ta.

The more I think about it, the more I think we should be looking into
reparenting the namespaces to the subsystem.
That would have the _immediate_ benefit that 'device_del' and
'device_add' becomes symmetric (ie one doesn't have to do a separate
'device_add nvme-ns'), as the nvme namespace is not affected by the
hotplug event.



I have that working, but I'm struggling with a QEMU API technicality in
that I apparently cannot simply move the NvmeBus creation to the
nvme-subsys device. For some reason the bus is not available for the
nvme-ns devices. That is, if one does something like this:

   -device nvme-subsys,...
   -device nvme-ns,...

Then I get an error that "no 'nvme-bus' bus found for device 'nvme'ns".
This is probably just me not grok'ing the qdev well enough, so I'll keep
trying to fix that. What works now is to have the regular setup:


_Normally_ the 'id' of the parent device spans a bus, so the syntax
should be

-device nvme-subsys,id=subsys1,...
-device nvme-ns,bus=subsys1,...


Yeah, I know, I just oversimplified the example. This *is* how I wanted
it to work ;)



As for the nvme device I would initially expose any namespace from the
subsystem to the controller; the nvme spec has some concept of 'active'
or 'inactive' namespaces which would allow us to blank out individual
namespaces on a per-controller basis, but I fear that's not easy to
model with qdev and the structure above.



The nvme-ns device already supports the boolean 'detached' parameter to
support the concept of an inactive namespace.


Yeah, but that doesn't really work if we move the namespace to the
subsystem; the 'detached' parameter is for the controller<->namespace
relationship.
That's why I meant we have to have some sort of NSID map for the controller
such that the controller knows with NSID to access.
I guess we can copy the trick with the NSID array, and reverse the operation
we have now wrt subsystem; keep the main NSID array in the subsystem, and
per-controller NSID arrays holding those which can be accessed.

And ignore the commandline for now; figure that one out later.


[..]


Sorry to ask but has there been any progress on this topic? Just run
into the same issue that adding nvme device during runtime is not
showing any namespace.

I _thought_ that the pci hotplug fixes have now been merged with qemu 
upstream. Klaus?


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [PATCH v11 7/7] docs/zoned-storage: add zoned device documentation

2022-10-10 Thread Hannes Reinecke


On 10/10/22 04:21, Sam Li wrote:

Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
  docs/devel/zoned-storage.rst   | 40 ++
  docs/system/qemu-block-drivers.rst.inc |  6 
  2 files changed, 46 insertions(+)
  create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..deaa4ce99b
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,40 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) devide the LBA space into block regions called zones


divide


+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-
+QEMU block layer has three zoned storage model:
+- BLK_Z_HM: This model only allows sequential writes access. It supports a set
+of ZBD-specific I/O request that used by the host to manage device zones.


Maybe:
This model only allow for sequential write access to zones. It supports 
ZBD-specific I/O requests to manage device zones.



+- BLK_Z_HA: It deals with both sequential writes and random writes access.


Maybe better:
This model allows sequential and random writes to zones. It supports 
ZBD-specific I/O requests to manage device zones.



+- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as
+non-zoned devices.


Maybe:
This is the default model with no zones support; it includes both 
regular and drive-managed ZBD devices. ZBD-specific I/O requests are not 
supported.



+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is:
+$ path/to/qemu-io --image-opts -n driver=zoned_host_device,filename=/dev/nullb0
+-c "zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..0b97227fd9 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
you may corrupt your host data (use the ``-snapshot`` command
line option or modify the device permissions accordingly).
  
+Zoned block devices

+  Zoned block devices can be passed through to the guest if the emulated 
storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
  Windows
  ^^^
  


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [PATCH v11 5/7] config: add check to block layer

2022-10-10 Thread Hannes Reinecke


On 10/10/22 04:21, Sam Li wrote:

Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
  block.c  | 17 +
  block/file-posix.c   | 13 +
  block/raw-format.c   |  1 +
  include/block/block_int-common.h |  5 +
  4 files changed, 36 insertions(+)
  mode change 100644 => 100755 block.c
  mode change 100644 => 100755 block/file-posix.c

diff --git a/block.c b/block.c
old mode 100644
new mode 100755
index bc85f46eed..bf2f2918e7
--- a/block.c
+++ b/block.c
@@ -7947,6 +7947,23 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
  return;
  }
  
+/*

+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children &&
+/* The host-aware model allows zoned storage constraints and random
+ * write. Allow mixing host-aware and non-zoned drivers. Using
+ * host-aware device as a regular device. */


It's a very unusual style to put comments inside a condition.
Please move it before or after the condition to keep the condition together.


+child_bs->bl.zoned == BLK_Z_HM) {
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
  if (!QLIST_EMPTY(_bs->parents)) {
  error_setg(errp, "The node %s already has a parent",
 child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
old mode 100644
new mode 100755
index 226f5d48f5..a9d347292e
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -778,6 +778,19 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
  goto fail;
  }
  }
+#ifdef CONFIG_BLKZONED
+/*
+ * The kernel page cache does not reliably work for writes to SWR zones
+ * of zoned block device because it can not guarantee the order of writes.
+ */
+if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
+if (!(s->open_flags & O_DIRECT)) {


You can join these conditions with '&&' and safe one level of intendation.


+error_setg(errp, "driver=zoned_host_device was specified, but it "
+ "requires cache.direct=on, which was not 
specified.");
+return -EINVAL; /* No host kernel page cache */
+}
+}
+#endif
  
  if (S_ISBLK(st.st_mode)) {

  #ifdef BLKDISCARDZEROES
diff --git a/block/raw-format.c b/block/raw-format.c
index 618c6b1ec2..b885688434 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
  BlockDriver bdrv_raw = {
  .format_name  = "raw",
  .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
  .bdrv_probe   = _probe,
  .bdrv_reopen_prepare  = _reopen_prepare,
  .bdrv_reopen_commit   = _reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index cdc06e77a6..37dddc603c 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -127,6 +127,11 @@ struct BlockDriver {
   */
  bool is_format;
  
+/*

+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
  /*
   * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
   * this field set to true, except ones that are defined only by their


The remainder looks good.
Once you fixed the minor editing issues you can add:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [PATCH v11 4/7] raw-format: add zone operations to pass through requests

2022-10-10 Thread Hannes Reinecke


On 10/10/22 04:21, Sam Li wrote:

raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
---
  block/raw-format.c | 13 +
  1 file changed, 13 insertions(+)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [PATCH v11 1/7] include: add zoned device structs

2022-10-10 Thread Hannes Reinecke


On 10/10/22 04:21, Sam Li wrote:

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
---
  include/block/block-common.h | 43 
  1 file changed, 43 insertions(+)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [RFC v4 3/9] file-posix: introduce get_sysfs_long_val for a block queue of sysfs attribute

2022-07-12 Thread Hannes Reinecke


On 7/12/22 04:13, Sam Li wrote:

Use sysfs attribute files to get the zoned device information in case
that ioctl() commands of zone management interface won't work.

Signed-off-by: Sam Li 
---
  block/file-posix.c | 38 +++---
  1 file changed, 27 insertions(+), 11 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [RFC v4 4/9] file-posix: introduce get_sysfs_str_val for device zoned model.

2022-07-12 Thread Hannes Reinecke


On 7/12/22 04:13, Sam Li wrote:

Signed-off-by: Sam Li 
---
  block/file-posix.c   | 60 
  include/block/block-common.h |  4 +--
  2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 3161d39ea4..42708012ff 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1279,6 +1279,65 @@ out:
  #endif
  }
  
+/*

+ * Convert the zoned attribute file in sysfs to internal value.
+ */
+static zone_model get_sysfs_str_val(int fd, struct stat *st) {
+#ifdef CONFIG_LINUX
+char buf[32];
+char *sysfspath = NULL;
+int ret, offset;
+int sysfd = -1;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/zoned",
+major(st->st_rdev), minor(st->st_rdev));
+sysfd = open(sysfspath, O_RDONLY);
+if (sysfd == -1) {
+ret = -errno;
+goto out;
+}
+offset = 0;
+do {
+ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset);
+if (ret > 0) {
+offset += ret;
+}
+} while (ret == -1);
+/* The file is ended with '\n' */
+if (buf[ret - 1] == '\n') {
+buf[ret - 1] = '\0';
+}
+
+if (strcmp(buf, "host-managed") == 0) {
+return BLK_Z_HM;
+} else if (strcmp(buf, "host-aware") == 0) {
+return BLK_Z_HA;
+} else {
+return -ENOTSUP;
+}
+
+out:
+if (sysfd != -1) {
+close(sysfd);
+}
+g_free(sysfspath);
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
  static int hdev_get_max_segments(int fd, struct stat *st) {
  return get_sysfs_long_val(fd, st, "max_segments");
  }
@@ -1885,6 +1944,7 @@ static int handle_aiocb_zone_mgmt(void *opaque) {
  int64_t len = aiocb->aio_nbytes;
  zone_op op = aiocb->zone_mgmt.op;
  
+zone_model mod;

  struct blk_zone_range range;
  const char *ioctl_name;
  unsigned long ioctl_op;
diff --git a/include/block/block-common.h b/include/block/block-common.h
index 78cddeeda5..35e00afe8e 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -56,8 +56,8 @@ typedef enum zone_op {
  } zone_op;
  
  typedef enum zone_model {

-BLK_Z_HM,
-BLK_Z_HA,
+BLK_Z_HM = 0x1,
+BLK_Z_HA = 0x2,
  } zone_model;
  
  typedef enum BlkZoneCondition {


This hunk is unrelated, please move it into a different patch.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [RFC v4 1/9] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-07-12 Thread Hannes Reinecke


On 7/12/22 04:13, Sam Li wrote:

By adding zone management operations in BlockDriver, storage
controller emulation can use the new block layer APIs including
zone_report and zone_mgmt(open, close, finish, reset).

Signed-off-by: Sam Li 
---
  block/block-backend.c|  41 ++
  block/coroutines.h   |   5 +
  block/file-posix.c   | 236 +++
  include/block/block-common.h |  43 +-
  include/block/block_int-common.h |  20 +++
  5 files changed, 344 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index f425b00793..0a05247ae4 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1806,6 +1806,47 @@ int blk_flush(BlockBackend *blk)
  return ret;
  }
  
+/*

+ * Send a zone_report command.
+ * offset can be any number within the zone size. No alignment for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+int64_t *nr_zones,
+BlockZoneDescriptor *zones)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+ret = bdrv_co_zone_report(blk->root->bs, offset, nr_zones, zones);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * Offset is the start of a zone and len is aligned to zones.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
+int64_t offset, int64_t len)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+ret = blk_check_byte_request(blk, offset, len);
+if (ret < 0) {
+return ret;
+}
+
+ret = bdrv_co_zone_mgmt(blk->root->bs, op, offset, len);
+blk_dec_in_flight(blk);
+return ret;
+}
+
  void blk_drain(BlockBackend *blk)
  {
  BlockDriverState *bs = blk_bs(blk);
diff --git a/block/coroutines.h b/block/coroutines.h
index 830ecaa733..19aa96cc56 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -80,6 +80,11 @@ int coroutine_fn
  blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes);
  
  int coroutine_fn blk_co_do_flush(BlockBackend *blk);

+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+int64_t *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op,
+  int64_t offset, int64_t len);
  
  
  /*

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..e7523ae2ed 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -67,6 +67,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -216,6 +217,13 @@ typedef struct RawPosixAIOData {
  PreallocMode prealloc;
  Error **errp;
  } truncate;
+struct {
+int64_t *nr_zones;


Why is this a pointer?
I'd rather use a number here, seeing that it's the number
of zones in the *zones array ...

But the remainder looks good.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [RFC v4 2/9] qemu-io: add zoned block device operations.

2022-07-12 Thread Hannes Reinecke

.argmax = 2,
+    .args = "offset number",
+    .oneline = "report zone information",
+};
+

'zp' is a bit odd; I'd rather go with 'zrp'.

Otherwise:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [RFC v3 3/5] file-posix: introduce get_sysfs_long_val for zoned device information.

2022-06-27 Thread Hannes Reinecke


On 6/27/22 02:19, Sam Li wrote:

Use sysfs attribute files to get the zoned device information in case
that ioctl() commands of zone management interface won't work. It can
return long type of value like chunk_sectors, zoned_append_max_bytes,
max_open_zones, max_active_zones.
---
  block/file-posix.c | 37 +
  1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 1b8b0d351f..73c2cdfbca 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1216,15 +1216,19 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
  #endif
  }
  
-static int hdev_get_max_segments(int fd, struct stat *st)

-{
+/*
+ * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
+ * max_open_zones, max_active_zones) through sysfs attribute files.
+ */
+static long get_sysfs_long_val(int fd, struct stat *st,
+   const char *attribute) {
  #ifdef CONFIG_LINUX
  char buf[32];
  const char *end;
  char *sysfspath = NULL;
  int ret;
  int sysfd = -1;
-long max_segments;
+long val;
  
  if (S_ISCHR(st->st_mode)) {

  if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
@@ -1237,8 +1241,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
  return -ENOTSUP;
  }
  
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",

-major(st->st_rdev), minor(st->st_rdev));
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
  sysfd = open(sysfspath, O_RDONLY);
  if (sysfd == -1) {
  ret = -errno;
@@ -1256,9 +1261,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
  }
  buf[ret] = 0;
  /* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, , 10, _segments);
+ret = qemu_strtol(buf, , 10, );
  if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+ret = val;
  }
  
  out:

@@ -1272,6 +1277,15 @@ out:
  #endif
  }
  
+static int hdev_get_max_segments(int fd, struct stat *st) {

+int ret;
+ret = get_sysfs_long_val(fd, st, "max_segments");
+if (ret < 0) {
+return -1;
+}
+return ret;
+}
+
  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
  {
  BDRVRawState *s = bs->opaque;
@@ -1872,6 +1886,7 @@ static int handle_aiocb_zone_report(void *opaque) {
  
  static int handle_aiocb_zone_mgmt(void *opaque) {

  RawPosixAIOData *aiocb = opaque;
+BlockDriverState *s = aiocb->bs;
  int fd = aiocb->aio_fildes;
  int64_t offset = aiocb->aio_offset;
  int64_t len = aiocb->aio_nbytes;
@@ -1884,11 +1899,9 @@ static int handle_aiocb_zone_mgmt(void *opaque) {
  int64_t zone_size_mask;
  int ret;
  
-ret = ioctl(fd, BLKGETZONESZ, _size);

-if (ret) {
-return -1;
-}
-
+g_autofree struct stat *file = g_new(struct stat, 1);
+stat(s->filename, file);
+zone_size = get_sysfs_long_val(fd, file, "chunk_sectors");
  zone_size_mask = zone_size - 1;
  if (offset & zone_size_mask) {
      error_report("offset is not the start of a zone");


Round of applause.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [RFC v3 4/5] file-posix: introduce get_sysfs_str_val for device zoned model.

2022-06-27 Thread Hannes Reinecke


On 6/27/22 02:19, Sam Li wrote:

---
  block/file-posix.c   | 60 
  include/block/block-common.h |  4 +--
  2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 73c2cdfbca..74c0245e0f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1277,6 +1277,66 @@ out:
  #endif
  }
  
+/*

+ * Convert the zoned attribute file in sysfs to internal value.
+ */
+static zone_model get_sysfs_str_val(int fd, struct stat *st) {
+#ifdef CONFIG_LINUX
+char buf[32];
+char *sysfspath = NULL;
+int ret;
+int sysfd = -1;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, ) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/zoned",
+major(st->st_rdev), minor(st->st_rdev));
+sysfd = open(sysfspath, O_RDONLY);
+if (sysfd == -1) {
+ret = -errno;
+goto out;
+}
+do {
+ret = read(sysfd, buf, sizeof(buf) - 1);
+} while (ret == -1 && errno == EINTR);


This is wrong.
read() might return a value smaller than the 'len' argument (sizeof(buf) 
-1 in your case). But in that case it's a short read, and one need to 
call 'read()' again to fetch the remaining bytes.


So the correct code would be something like:

offset = 0;
do {
ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset);
if (ret > 0)
offset += ret;
} while (ret > 0);

Not that you'd actually need it; reads from sysfs are basically never 
interrupted, so you should be able to read from an attribute in one go.

IE alternatively you can drop the 'while' loop and just call read().


+if (ret < 0) {
+ret = -errno;
+goto out;
+} else if (ret == 0) {
+ret = -EIO;
+goto out;
+}
+buf[ret] = 0;
+
+/* The file is ended with '\n' */


I'd rather check if the string ends with an '\n', and overwrite
it with a '\0'. That way you'd be insulated against any changes
to sysfs.


+if (strcmp(buf, "host-managed\n") == 0) {
+return BLK_Z_HM;
+} else if (strcmp(buf, "host-aware\n") == 0) {
+return BLK_Z_HA;
+} else {
+return -ENOTSUP;
+}
+
+out:
+if (sysfd != -1) {
+close(sysfd);
+}
+g_free(sysfspath);
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
  static int hdev_get_max_segments(int fd, struct stat *st) {
  int ret;
  ret = get_sysfs_long_val(fd, st, "max_segments");


And as you already set a precedent in your previous patch, I'd recommend 
split this in two patches, one introducing a generic function 
'get_sysfs_str_val()' which returns a string and another function
(eg hdev_get_zone_model()) which calls this function to fetch the device 
zoned model.



diff --git a/include/block/block-common.h b/include/block/block-common.h
index 78cddeeda5..35e00afe8e 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -56,8 +56,8 @@ typedef enum zone_op {
  } zone_op;
  
  typedef enum zone_model {

-BLK_Z_HM,
-BLK_Z_HA,
+BLK_Z_HM = 0x1,
+BLK_Z_HA = 0x2,
  } zone_model;
  
  typedef enum BlkZoneCondition {


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [RFC v3 5/5] qemu-iotests: add zone operation tests.

2022-06-27 Thread Hannes Reinecke


On 6/27/22 02:19, Sam Li wrote:

---
  tests/qemu-iotests/tests/zoned.sh | 49 +++
  1 file changed, 49 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..262c0b5427
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device"
+echo "Simple cases: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+# hidden issues:
+# 1. memory allocation error of "unaligned tcache chunk detected" when the 
nr_zone=1 in zone report
+# 2. qemu-io: after report 10 zones, the program failed at double free error 
and exited.
+echo "report the first zone"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 1"
+echo "report: the first 10 zones"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 10"
+
+echo "open the first zone"
+sudo $QEMU_IO $IMG -c "zone_open 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 1"
+echo "open the last zone"
+sudo $QEMU_IO $IMG -c "zone_open 0x3e7000 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0x3e7000 0 2"
+
+echo "close the first zone"
+sudo $QEMU_IO $IMG -c "zone_close 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0 0 1"
+echo "close the last zone"
+sudo $QEMU_IO $IMG -c "zone_close 0x3e7000 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zone_report 0x3e7000 0 2"
+
+
+echo "reset the second zone"
+sudo $QEMU_IO $IMG -c "zone_reset 0x8 0x8"
+echo "After resetting a zone:"
+sudo $QEMU_IO $IMG -c "zone_report 0x8 0 5"
+
+# success, all done
+sudo rmmod null_blk
+echo "*** done"
+#rm -f $seq.full
+status=0


Caveat: I'm not that familiar with qemu-iotests.
FWIW:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [RFC v3 2/5] qemu-io: add zoned block device operations.

2022-06-27 Thread Hannes Reinecke

houldn't 'altname' be different for each function?
'zo', maybe?


+.cfunc = zone_open_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset [offset..] len [len..]",
+.oneline = "explicit open a range of zones in zone block device",
+};
+
+static int zone_close_f(BlockBackend *blk, int argc, char **argv)
+{
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);


argv checking.


+return blk_zone_mgmt(blk, zone_close, offset, len);
+}
+
+static const cmdinfo_t zone_close_cmd = {
+.name = "zone_close",
+.altname = "f",


altname 'zc'


+.cfunc = zone_close_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset [offset..] len [len..]",
+.oneline = "close a range of zones in zone block device",
+};
+
+static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
+{
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);


Argv checking.


+return blk_zone_mgmt(blk, zone_finish, offset, len);
+}
+
+static const cmdinfo_t zone_finish_cmd = {
+.name = "zone_finish",
+.altname = "f",


altname 'zf'


+.cfunc = zone_finish_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset [offset..] len [len..]",
+.oneline = "finish a range of zones in zone block device",
+};
+
+static int zone_reset_f(BlockBackend *blk, int argc, char **argv)
+{
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);


Argv checking.


+return blk_zone_mgmt(blk, zone_reset, offset, len);
+}
+
+static const cmdinfo_t zone_reset_cmd = {
+.name = "zone_reset",
+.altname = "f",


altname 'zf'


+.cfunc = zone_reset_f, > +.argmin = 2,
+.argmax = 2,
+.args = "offset [offset..] len [len..]",
+.oneline = "reset a zone write pointer in zone block device",
+};
+
+
  static int truncate_f(BlockBackend *blk, int argc, char **argv);
  static const cmdinfo_t truncate_cmd = {
  .name   = "truncate",
@@ -2498,6 +2614,11 @@ static void __attribute((constructor)) 
init_qemuio_commands(void)
  qemuio_add_command(_write_cmd);
  qemuio_add_command(_flush_cmd);
  qemuio_add_command(_cmd);
+qemuio_add_command(_report_cmd);
+qemuio_add_command(_open_cmd);
+qemuio_add_command(_close_cmd);
+qemuio_add_command(_finish_cmd);
+qemuio_add_command(_reset_cmd);
  qemuio_add_command(_cmd);
  qemuio_add_command(_cmd);
  qemuio_add_command(_cmd);


Otherwise looks okay.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.

2022-06-27 Thread Hannes Reinecke

truct BlockDriver {
QEMUIOVector *qiov,
int64_t pos);
  
+int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs,

+int64_t offset, int64_t len, int64_t *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, enum zone_op 
op,
+int64_t offset, int64_t len);
+
  /* removable device specific */
  bool (*bdrv_is_inserted)(BlockDriverState *bs);
  void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag);


Other than that: Well done!

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-24 Thread Hannes Reinecke


On 9/23/21 10:09 PM, Klaus Jensen wrote:

On Sep  9 13:37, Hannes Reinecke wrote:

On 9/9/21 12:47 PM, Klaus Jensen wrote:

On Sep  9 11:43, Hannes Reinecke wrote:

With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
namespaces get moved from the controller to the subsystem if one
is specified.
That keeps the namespaces alive after a controller hot-unplug, but
after a controller hotplug we have to reconnect the namespaces
from the subsystem to the controller.

Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
Cc: Klaus Jensen 
Signed-off-by: Hannes Reinecke 
---
  hw/nvme/subsys.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 93c35950d6..a9404f2b5e 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -14,7 +14,7 @@
  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
  {
  NvmeSubsystem *subsys = n->subsys;
-int cntlid;
+int cntlid, nsid;
  
  for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {

  if (!subsys->ctrls[cntlid]) {
@@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
  
  subsys->ctrls[cntlid] = n;
  
+for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {

+if (subsys->namespaces[nsid]) {
+nvme_attach_ns(n, subsys->namespaces[nsid]);
+}


Thanks Hannes! I like it, keeping things simple.

But we should only attach namespaces that have the shared property or
have ns->attached == 0. Non-shared namespaces may already be attached to
another controller in the subsystem.



Well ... I tried to avoid that subject, but as you brought it up:
There is a slightly tricky issue in fabrics, in that the 'controller' is
independent from the 'connection'.
The 'shared' bit in the CMIC setting indicates that the subsystem may
have more than one _controller_. It doesn't talk about how many
_connections_ a controller may support; that then is the realm of
dynamic or static controllers, which we don't talk about :-).
Sufficient to say, linux only implements the dynamic controller model,
so every connection will be to a different controller.
But you have been warned :-)

However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
'shared' bit in the namespace).
Which leads to the interesting question on how exactly should one handle
non-shared namespaces in subsystems for which there are multiple
controllers. Especially as the NSID space is per subsystem, so each
controller will be able to figure out if there are blanked-out namespaces.
So to make this a sane setup I would propose to set the 'shared' option
automatically whenever the controller has the 'subsys' option set.
And actually, I would ditch the 'shared' option completely, and make it
dependent on the 'subsys' setting for the controller.
Much like we treat the 'CMIC' setting nowadays.
That avoids lots of confusions, and also make the implementation _way_
easier.



I see your point. Unfortunately, there is no easy way to ditch shared=
now. But I think it'd be good enough to attach to shared automatically,
but not to "shared=off".

I've already ditched the shared parameter on my RFC refactor series and
always having the namespaces shared.



Okay.


However...

The spec says that "The attach and detach operations are persistent
across all reset events.". This means that we should track those events
in the subsystem and only reattach namespaces that were attached at the
time of the "reset" event. Currently, we don't have anything mapping
that state. But the device already has to take some liberties with
regard to stuff that is considered persistent by the spec (SMART log
etc.) since we do not have any way to store stuff persistently across
qemu invocations, so I think the above is an acceptable compromise.


Careful. 'attach' and 'detach' are MI (management interface) operations.
If and how many namespaces are visible to any given controllers is
actually independent on that; and, in fact, controllers might not even
implement 'attach' or 'detach'.
But I do agree that we don't handle the 'reset' state properly.



The Namespace Attachment command has nothing to do with MI? And the QEMU
controller does support attaching and detaching namespaces.



No, you got me wrong. Whether a not a namespace is attached is 
independent of any 'Namespace attachment' command.
Case in point: the subsystem will be starting up with namespace already 
attached, even though no-one issued any namespace attachment command.



A potential (as good as it gets) fix would be to keep a map/list of
"persistently" attached controllers on the namespaces and re-attach
according to that when we see that controller joining the subsystem
again. CNTLID would be the obvious choice for the key here, but problem
is that we cant really use it since we assign it sequentially from the
subsys

Re: [PATCH 2/3] scsi: make io_timeout configurable

2021-09-23 Thread Hannes Reinecke


On 9/22/21 5:47 PM, Philippe Mathieu-Daudé wrote:

Hi Hannes,

On 11/16/20 19:31, Hannes Reinecke wrote:

The current code sets an infinite timeout on SG_IO requests,
causing the guest to stall if the host experiences a frame
loss.
This patch adds an 'io_timeout' parameter for SCSIDevice to
make the SG_IO timeout configurable, and also shortens the
default timeout to 30 seconds to avoid infinite stalls.

Signed-off-by: Hannes Reinecke 
---
  hw/scsi/scsi-disk.c    |  6 --
  hw/scsi/scsi-generic.c | 17 +++--
  include/hw/scsi/scsi.h |  4 +++-
  3 files changed, 18 insertions(+), 9 deletions(-)


  int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, uint8_t 
cmd_size,

-    uint8_t *buf, uint8_t buf_size)
+    uint8_t *buf, uint8_t buf_size, uint32_t 
timeout)

  {
  sg_io_hdr_t io_header;
  uint8_t sensebuf[8];
@@ -520,7 +522,7 @@ int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t 
*cmd, uint8_t cmd_size,

  io_header.cmd_len = cmd_size;
  io_header.mx_sb_len = sizeof(sensebuf);
  io_header.sbp = sensebuf;
-    io_header.timeout = 6000; /* XXX */
+    io_header.timeout = timeout * 1000;



@@ -637,7 +639,7 @@ static int get_stream_blocksize(BlockBackend *blk)
  cmd[0] = MODE_SENSE;
  cmd[4] = sizeof(buf);
-    ret = scsi_SG_IO_FROM_DEV(blk, cmd, sizeof(cmd), buf, sizeof(buf));
+    ret = scsi_SG_IO_FROM_DEV(blk, cmd, sizeof(cmd), buf, 
sizeof(buf), 6);


Why is this timeout hardcoded? Due to the /* XXX */ comment?



60 seconds is the default command timeout on linux.
And the problem is that the guest might set a command timeout on the 
comands being send from the guests user space, but the guest is unable

to communicate this timeout to the host.

Really, one should fix up the virtio spec here to allow for a 'timeout' 
field. But in the absence of that being able to configure it is the next 
best attempt.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 2/3] scsi: make io_timeout configurable

2021-09-20 Thread Hannes Reinecke


On 9/20/21 8:56 PM, Paolo Bonzini wrote:

On Mon, Nov 16, 2020 at 7:31 PM Hannes Reinecke  wrote:

The current code sets an infinite timeout on SG_IO requests,
causing the guest to stall if the host experiences a frame
loss.
This patch adds an 'io_timeout' parameter for SCSIDevice to
make the SG_IO timeout configurable, and also shortens the
default timeout to 30 seconds to avoid infinite stalls.


Hannes, could 30 seconds be a bit too short for tape drives?

It would, but then anyone attempting to use tapes via qemu emulation 
deserves to suffer.
Tapes are bitchy even when used normally, so attempting to use them 
under qemu emulation will land you with lots of unhappy experiences, 
where the timeout is the least of your problems.

I sincerely doubt anyone will be using tapes here.
Not in real-world scenarios.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH] hw/nvme: select first free NSID for legacy drive configuration

2021-09-09 Thread Hannes Reinecke

On 9/9/21 12:52 PM, Klaus Jensen wrote:
> On Sep  9 11:51, Hannes Reinecke wrote:
>> If a legacy 'drive' argument is passed to the controller we cannot
>> assume that '1' will be a free NSID, as the subsys might already
>> have attached a namespace to this NSID. So select the first free
>> one.
>>
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  hw/nvme/ctrl.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index 757cdff038..2c69031ca9 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
>> **errp)
>>  
>>  /* setup a namespace if the controller drive property was given */
>>  if (n->namespace.blkconf.blk) {
>> +int i;
>>  ns = >namespace;
>> -ns->params.nsid = 1;
>> +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
>> +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
>> +continue;
>> +}
>> +ns->params.nsid = i;
>> +break;
>> +}
>>  
>>  if (nvme_ns_setup(ns, errp)) {
>>  return;
>> -- 
>> 2.26.2
>>
> 
> Did you actually hit this?
> 
> Because then then property checking logic is bad... The device is not
> supposed to allow nvme,drive= combined with a subsystem property. In
> nvme_check_constraints():
> 
>   if (n->namespace.blkconf.blk && n->subsys) {
> /* error out */
> return;
>   }
> 
> 
Ah. Missed that.
Do ignore my patch then.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Hannes Reinecke

On 9/9/21 12:47 PM, Klaus Jensen wrote:
> On Sep  9 11:43, Hannes Reinecke wrote:
>> With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>> namespaces get moved from the controller to the subsystem if one
>> is specified.
>> That keeps the namespaces alive after a controller hot-unplug, but
>> after a controller hotplug we have to reconnect the namespaces
>> from the subsystem to the controller.
>>
>> Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
>> Cc: Klaus Jensen 
>> Signed-off-by: Hannes Reinecke 
>> ---
>>  hw/nvme/subsys.c | 8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
>> index 93c35950d6..a9404f2b5e 100644
>> --- a/hw/nvme/subsys.c
>> +++ b/hw/nvme/subsys.c
>> @@ -14,7 +14,7 @@
>>  int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>  {
>>  NvmeSubsystem *subsys = n->subsys;
>> -int cntlid;
>> +int cntlid, nsid;
>>  
>>  for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
>>  if (!subsys->ctrls[cntlid]) {
>> @@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
>>  
>>  subsys->ctrls[cntlid] = n;
>>  
>> +for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
>> +if (subsys->namespaces[nsid]) {
>> +nvme_attach_ns(n, subsys->namespaces[nsid]);
>> +}
> 
> Thanks Hannes! I like it, keeping things simple.
> 
> But we should only attach namespaces that have the shared property or
> have ns->attached == 0. Non-shared namespaces may already be attached to
> another controller in the subsystem.
> 

Well ... I tried to avoid that subject, but as you brought it up:
There is a slightly tricky issue in fabrics, in that the 'controller' is
independent from the 'connection'.
The 'shared' bit in the CMIC setting indicates that the subsystem may
have more than one _controller_. It doesn't talk about how many
_connections_ a controller may support; that then is the realm of
dynamic or static controllers, which we don't talk about :-).
Sufficient to say, linux only implements the dynamic controller model,
so every connection will be to a different controller.
But you have been warned :-)

However, the 'CMIC' setting is independent on the 'NMIC' setting (ie the
'shared' bit in the namespace).
Which leads to the interesting question on how exactly should one handle
non-shared namespaces in subsystems for which there are multiple
controllers. Especially as the NSID space is per subsystem, so each
controller will be able to figure out if there are blanked-out namespaces.
So to make this a sane setup I would propose to set the 'shared' option
automatically whenever the controller has the 'subsys' option set.
And actually, I would ditch the 'shared' option completely, and make it
dependent on the 'subsys' setting for the controller.
Much like we treat the 'CMIC' setting nowadays.
That avoids lots of confusions, and also make the implementation _way_
easier.

> However...
> 
> The spec says that "The attach and detach operations are persistent
> across all reset events.". This means that we should track those events
> in the subsystem and only reattach namespaces that were attached at the
> time of the "reset" event. Currently, we don't have anything mapping
> that state. But the device already has to take some liberties with
> regard to stuff that is considered persistent by the spec (SMART log
> etc.) since we do not have any way to store stuff persistently across
> qemu invocations, so I think the above is an acceptable compromise.
> 
Careful. 'attach' and 'detach' are MI (management interface) operations.
If and how many namespaces are visible to any given controllers is
actually independent on that; and, in fact, controllers might not even
implement 'attach' or 'detach'.
But I do agree that we don't handle the 'reset' state properly.

> A potential (as good as it gets) fix would be to keep a map/list of
> "persistently" attached controllers on the namespaces and re-attach
> according to that when we see that controller joining the subsystem
> again. CNTLID would be the obvious choice for the key here, but problem
> is that we cant really use it since we assign it sequentially from the
> subsystem, which now looks like a pretty bad choice. CNTLID should have
> been a required property of the nvme device when subsystems are
> involved. Maybe we can fix up the CNTLID assignment to take the serial
> into account (we know that is defined and *should* be unique) and not
> reuse CNTLIDs. This limits the subsystem to NVME_MAX_CONTROLLERS unique

[PATCH] hw/nvme: select first free NSID for legacy drive configuration

2021-09-09 Thread Hannes Reinecke

If a legacy 'drive' argument is passed to the controller we cannot
assume that '1' will be a free NSID, as the subsys might already
have attached a namespace to this NSID. So select the first free
one.

Signed-off-by: Hannes Reinecke 
---
 hw/nvme/ctrl.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 757cdff038..2c69031ca9 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6546,8 +6546,15 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
**errp)
 
 /* setup a namespace if the controller drive property was given */
 if (n->namespace.blkconf.blk) {
+int i;
 ns = >namespace;
-ns->params.nsid = 1;
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
+continue;
+}
+ns->params.nsid = i;
+break;
+}
 
 if (nvme_ns_setup(ns, errp)) {
 return;
-- 
2.26.2

Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging

2021-09-09 Thread Hannes Reinecke

On 9/9/21 9:59 AM, Klaus Jensen wrote:
> On Sep  9 09:02, Hannes Reinecke wrote:
>> On 7/26/21 9:18 PM, Klaus Jensen wrote:
>>> From: Klaus Jensen 
>>>
>>> Prior to this patch the nvme-ns devices are always children of the
>>> NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
>>> unrealized when the parent device is removed. However, when subsystems
>>> are involved, this is not what we want since the namespaces may be
>>> attached to other controllers as well.
>>>
>>> This patch adds an additional NvmeBus on the subsystem device. When
>>> nvme-ns devices are realized, if the parent controller device is linked
>>> to a subsystem, the parent bus is set to the subsystem one instead. This
>>> makes sure that namespaces are kept alive and not unrealized.
>>>
>>> Reviewed-by: Hannes Reinecke 
>>> Signed-off-by: Klaus Jensen 
>>> ---
>>>   hw/nvme/nvme.h   | 15 ---
>>>   hw/nvme/ctrl.c   | 14 ++
>>>   hw/nvme/ns.c | 18 ++
>>>   hw/nvme/subsys.c |  3 +++
>>>   4 files changed, 35 insertions(+), 15 deletions(-)
>>>
>> Finally got around to test this; sadly, with mixed results.
>> On the good side: controller hotplug works.
>> Within qemu monitor I can do:
>>
>> device_del 
>> device_add 
>>
>> and OS reports:
>> [   56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention button
>> pressed
>> [   56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off due to
>> button press
>> [  104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention button
>> pressed
>> [  104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due to
>> button press
>> [  104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present
>> [  104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up
>> [  104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802
>> [  104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit]
>> [  104.431442] pci :03:00.0: BAR 0: assigned [mem 0xc120-0xc1203fff
>> 64bit]
>> [  104.431580] pcieport :00:09.0: PCI bridge to [bus 03]
>> [  104.431604] pcieport :00:09.0:   bridge window [io  0x7000-0x7fff]
>> [  104.434815] pcieport :00:09.0:   bridge window [mem
>> 0xc120-0xc13f]
>> [  104.436685] pcieport :00:09.0:   bridge window [mem
>> 0x80400-0x805ff 64bit pref]
>> [  104.441896] nvme nvme2: pci function :03:00.0
>> [  104.442151] nvme :03:00.0: enabling device ( -> 0002)
>> [  104.455821] nvme nvme2: 1/0/0 default/read/poll queues
>>
>> So that works.
>> But: the namespace is not reconnected.
>>
>> # nvme list-ns /dev/nvme2
>>
>> draws a blank. So guess some fixup patch is in order...
>>
> 
> Hi Hannes,
> 
> I see. Ater the device_del/device_add dance, the namespace is there, but it is
> not automatically attached.
> 
> # nvme list-ns -a /dev/nvme0
> [   0]:0x2
> 
> # nvme attach-ns /dev/nvme0 -n 2 -c 0
> attach-ns: Success, nsid:2
> 
> # nvme list-ns /dev/nvme0
> [   0]:0x2
> 
> 
> I don't *think* the spec says that namespaces *must* be re-attached
> automatically? But I would have to check... If it does say that, then this is 
> a
> bug of course.
> 
Errm. Yes, the namespaces must be present after a 'reset' (which a
hotunplug/hotplug cycle amounts to here).

As per spec the namespaces are a property of the _subsystem_, not the
controller. And the controller attaches to a subsystem, so it'll see any
namespaces which are present in the subsystem.
(whether it needs to see _all_ namespaces from that subsystem is another
story, but doesn't need to bother us here :-).

Just send a patch for it; is actually quite trivial.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

[PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Hannes Reinecke

With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
namespaces get moved from the controller to the subsystem if one
is specified.
That keeps the namespaces alive after a controller hot-unplug, but
after a controller hotplug we have to reconnect the namespaces
from the subsystem to the controller.

Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
Cc: Klaus Jensen 
Signed-off-by: Hannes Reinecke 
---
 hw/nvme/subsys.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 93c35950d6..a9404f2b5e 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -14,7 +14,7 @@
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 {
 NvmeSubsystem *subsys = n->subsys;
-int cntlid;
+int cntlid, nsid;
 
 for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
 if (!subsys->ctrls[cntlid]) {
@@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
 subsys->ctrls[cntlid] = n;
 
+for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
+if (subsys->namespaces[nsid]) {
+nvme_attach_ns(n, subsys->namespaces[nsid]);
+}
+}
 return cntlid;
 }
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
 subsys->ctrls[n->cntlid] = NULL;
+n->cntlid = -1;
 }
 
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
-- 
2.26.2

[PATCH] hw/nvme: reattach subsystem namespaces on hotplug

2021-09-09 Thread Hannes Reinecke

With commit 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
namespaces get moved from the controller to the subsystem if one
is specified.
That keeps the namespaces alive after a controller hot-unplug, but
after a controller hotplug we have to reconnect the namespaces
from the subsystem to the controller.

Fixes: 5ffbaeed16 ("hw/nvme: fix controller hot unplugging")
Cc: Klaus Jensen 
Signed-off-by: Hannes Reinecke 
---
 hw/nvme/subsys.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 93c35950d6..a9404f2b5e 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -14,7 +14,7 @@
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 {
 NvmeSubsystem *subsys = n->subsys;
-int cntlid;
+int cntlid, nsid;
 
 for (cntlid = 0; cntlid < ARRAY_SIZE(subsys->ctrls); cntlid++) {
 if (!subsys->ctrls[cntlid]) {
@@ -29,12 +29,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 
 subsys->ctrls[cntlid] = n;
 
+for (nsid = 0; nsid < ARRAY_SIZE(subsys->namespaces); nsid++) {
+if (subsys->namespaces[nsid]) {
+nvme_attach_ns(n, subsys->namespaces[nsid]);
+}
+}
 return cntlid;
 }
 
 void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
 {
 subsys->ctrls[n->cntlid] = NULL;
+n->cntlid = -1;
 }
 
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
-- 
2.26.2

Re: [PULL for-6.1 06/11] hw/nvme: fix controller hot unplugging

2021-09-09 Thread Hannes Reinecke


On 7/26/21 9:18 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 15 ---
  hw/nvme/ctrl.c   | 14 ++
  hw/nvme/ns.c | 18 ++
  hw/nvme/subsys.c |  3 +++
  4 files changed, 35 insertions(+), 15 deletions(-)


Finally got around to test this; sadly, with mixed results.
On the good side: controller hotplug works.
Within qemu monitor I can do:

device_del 
device_add 

and OS reports:
[   56.564447] pcieport :00:09.0: pciehp: Slot(0-2): Attention 
button pressed
[   56.564460] pcieport :00:09.0: pciehp: Slot(0-2): Powering off 
due to button press
[  104.293335] pcieport :00:09.0: pciehp: Slot(0-2): Attention 
button pressed
[  104.293347] pcieport :00:09.0: pciehp: Slot(0-2) Powering on due 
to button press

[  104.293540] pcieport :00:09.0: pciehp: Slot(0-2): Card present
[  104.293544] pcieport :00:09.0: pciehp: Slot(0-2): Link Up
[  104.428961] pci :03:00.0: [1b36:0010] type 00 class 0x010802
[  104.429298] pci :03:00.0: reg 0x10: [mem 0x-0x3fff 64bit]
[  104.431442] pci :03:00.0: BAR 0: assigned [mem 
0xc120-0xc1203fff 64bit]

[  104.431580] pcieport :00:09.0: PCI bridge to [bus 03]
[  104.431604] pcieport :00:09.0:   bridge window [io  0x7000-0x7fff]
[  104.434815] pcieport :00:09.0:   bridge window [mem 
0xc120-0xc13f]
[  104.436685] pcieport :00:09.0:   bridge window [mem 
0x80400-0x805ff 64bit pref]

[  104.441896] nvme nvme2: pci function :03:00.0
[  104.442151] nvme :03:00.0: enabling device ( -> 0002)
[  104.455821] nvme nvme2: 1/0/0 default/read/poll queues

So that works.
But: the namespace is not reconnected.

# nvme list-ns /dev/nvme2

draws a blank. So guess some fixup patch is in order...

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Hannes Reinecke


On 7/9/21 8:55 AM, Klaus Jensen wrote:

On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller 
hot

unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme 
devices

can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't 
believe

it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.




Patch 1-3 are fixes we need anyway, so I thought I might as well apply 
them :)



Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?



Yes, if acceptable, I would like to use patch 4 as an interim solution. 
We have a bug we need to fix for 6.1, and I believe this does the job.


Oh, yes, it does. But it's ever so slightly ugly with the reparenting 
stuff. But if that's considered an interim solution I'm fine with it.


You can add my 'Reviewed-by: Hannes Reinecke ' tag if you 
like.


I considered changing the existing nvme-bus to be on the main system 
bus, but then we break the existing behavior that the namespaces attach 
to the most recently defined controller in the absence of the shared 
parameter or an explicit bus parameter.



Do we?
My idea was to always attach a namespace to a subsystem (and, if not 
present, create one). The controller would then 'link' to that 
subsystem. The subsystem would have a 'shared' attribute, which would 
determine if more than one controller can be 'linked' to it.


That way we change the relationship between the controller and the 
namespace, as then the namespace would be a child of the subsystem,

and the namespace would need to be detached separately from the controller.

But it fits neatly into the current device model, except the slightly 
awkward 'link' thingie.



Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
    This is the approach that I am taking right now and it works well. 
It allows many-to-many relationships and separates the life times of 
subsystems, namespaces and controllers like you mentioned.




Ah. Would like to see that path, then.

    Conceptually, I also really like that the subsystem and namespace 
are not "devices". One could argue that the namespace is comparable 
to a SCSI LUN (-device scsi-hd, right?), but where the SCSI LUN 
actually "shows up" in the host, the nvme namespace does not.




Well, 'devices' really is an abstraction, and it really depends on what 
you declare a device is. But yes, in the QDEV sense with its strict 
inheritance the nvme topology is not a good fit, agreed.


As for SCSI: the namespace is quite comparable to a SCSI LUN; the NVMe 
controller is roughly equivalent to the 'initiator' on SCSI, and the 
subsystem would match up

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Hannes Reinecke


On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.


Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-08 Thread Hannes Reinecke


On 7/7/21 6:14 PM, Stefan Hajnoczi wrote:

On Wed, Jul 07, 2021 at 12:43:56PM +0200, Hannes Reinecke wrote:

On 7/7/21 11:53 AM, Klaus Jensen wrote:

On Jul  7 09:49, Hannes Reinecke wrote:

On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 18 ++
  hw/nvme/ctrl.c   |  8 +---
  hw/nvme/ns.c | 32 +---
  hw/nvme/subsys.c |  4 
  4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES >
NVME_NSID_BROADCAST - 1);
  typedef struct NvmeCtrl NvmeCtrl;
  typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+    BusState parent_bus;
+    bool is_subsys;
+} NvmeBus;
+
  #define TYPE_NVME_SUBSYS "nvme-subsys"
  #define NVME_SUBSYS(obj) \
  OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
  typedef struct NvmeSubsystem {
  DeviceState parent_obj;
+    NvmeBus bus;
  uint8_t subnqn[256];
  NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
  QTAILQ_HEAD(, NvmeRequest) req_list;
  } NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-    BusState parent_bus;
-} NvmeBus;
-
  #define TYPE_NVME "nvme"
  #define NVME(obj) \
  OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
  static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
  {
-    if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+    if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
  return NULL;
  }
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
  for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
  ns = nvme_ns(n, i);
-    if (!ns) {
-    continue;
+    if (ns) {
+    ns->attached--;
  }
+    }
-    nvme_ns_cleanup(ns);
+    if (n->subsys) {
+    nvme_subsys_unregister_ctrl(n->subsys, n);
  }
  if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
  static void nvme_ns_realize(DeviceState *dev, Error **errp)
  {
  NvmeNamespace *ns = NVME_NS(dev);
-    BusState *s = qdev_get_parent_bus(dev);
-    NvmeCtrl *n = NVME(s->parent);
-    NvmeSubsystem *subsys = n->subsys;
+    BusState *qbus = qdev_get_parent_bus(dev);
+    NvmeBus *bus = NVME_BUS(qbus);
+    NvmeCtrl *n = NULL;
+    NvmeSubsystem *subsys = NULL;
  uint32_t nsid = ns->params.nsid;
  int i;
-    if (!n->subsys) {
+    if (bus->is_subsys) {
+    subsys = NVME_SUBSYS(qbus->parent);
+    } else {
+    n = NVME(qbus->parent);
+    subsys = n->subsys;
+    }
+
+    if (subsys) {
+    /*
+ * If this namespace belongs to a subsystem (through a
link on the
+ * controller device), reparent the device.
+ */
+    if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+    return;
+    }
+    } else {
  if (ns->params.detached) {
  error_setg(errp, "detached requires that the nvme
device is "
     "linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState
*dev, Error **errp)
  if (!nsid) {
  for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-    if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+    if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
  continue;
  }
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState
*dev, Error **errp)
  return;
  }
  } else {
-    if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+    if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
  error_setg(errp, "namespace id '%d' alr

Re: [PATCH v2 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Hannes Reinecke


On 7/7/21 5:49 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 15 ---
  hw/nvme/ctrl.c   | 14 ++
  hw/nvme/ns.c | 18 ++
  hw/nvme/subsys.c |  3 +++
  4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..83ffabade4cf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,20 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
  typedef struct NvmeCtrl NvmeCtrl;
  typedef struct NvmeNamespace NvmeNamespace;
  
+#define TYPE_NVME_BUS "nvme-bus"

+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+} NvmeBus;
+
  #define TYPE_NVME_SUBSYS "nvme-subsys"
  #define NVME_SUBSYS(obj) \
  OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
  
  typedef struct NvmeSubsystem {

  DeviceState parent_obj;
+NvmeBus bus;
  uint8_t subnqn[256];
  
  NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];

@@ -365,13 +373,6 @@ typedef struct NvmeCQueue {
  QTAILQ_HEAD(, NvmeRequest) req_list;
  } NvmeCQueue;
  
-#define TYPE_NVME_BUS "nvme-bus"

-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
  #define TYPE_NVME "nvme"
  #define NVME(obj) \
  OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..9a3b3a27c293 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6514,16 +6514,14 @@ static void nvme_exit(PCIDevice *pci_dev)
  
  nvme_ctrl_reset(n);
  
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {

-ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (n->subsys) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
+}
  }
  
-nvme_ns_cleanup(ns);


So who is removing the namespaces, then?
I would have expected some cleanup action from the subsystem, seeing 
that we reparent to that ...



-}
-
-if (n->subsys) {
  nvme_subsys_unregister_ctrl(n->subsys, n);
  }
  
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c

index 3c4f5b8c714a..b7cf1494e75b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -441,6 +441,15 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
  }
  }
  
+static void nvme_ns_unrealize(DeviceState *dev)

+{
+NvmeNamespace *ns = NVME_NS(dev);
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+}
+
  static void nvme_ns_realize(DeviceState *dev, Error **errp)
  {
  NvmeNamespace *ns = NVME_NS(dev);
@@ -462,6 +471,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 "linked to an nvme-subsys device");
  return;
  }
+} else {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}


What happens if that fails?
Will we abort? Not create the namespace?


  }
  
  if (nvme_ns_setup(ns, errp)) {

@@ -552,6 +569,7 @@ static void nvme_ns_class_init(ObjectClass *oc, void *data)
  
  dc->bus_type = TYPE_NVME_BUS;

  dc->realize = nvme_ns_realize;
+dc->unrealize = nvme_ns_unrealize;
  device_class_set_props(dc, nvme_ns_props);
  dc->desc = "Virtual NVMe namespace";
  }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..93c35950d69d 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -50,6 +50,9 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
  {
  NvmeSubsystem *subsys = NVME_SUBSYS(dev);
  
+qbus_create_inplace(>bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev,

+        dev->id);
+
  nvme_subsys_setup(subsys);
  }
  


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Hannes Reinecke


On 7/7/21 11:53 AM, Klaus Jensen wrote:

On Jul  7 09:49, Hannes Reinecke wrote:

On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 18 ++
 hw/nvme/ctrl.c   |  8 +---
 hw/nvme/ns.c | 32 +---
 hw/nvme/subsys.c |  4 
 4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > 
NVME_NSID_BROADCAST - 1);

 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+    BusState parent_bus;
+    bool is_subsys;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+    NvmeBus bus;
 uint8_t subnqn[256];
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-    BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
-    if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+    if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
 return NULL;
 }
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
-    if (!ns) {
-    continue;
+    if (ns) {
+    ns->attached--;
 }
+    }
-    nvme_ns_cleanup(ns);
+    if (n->subsys) {
+    nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
-    BusState *s = qdev_get_parent_bus(dev);
-    NvmeCtrl *n = NVME(s->parent);
-    NvmeSubsystem *subsys = n->subsys;
+    BusState *qbus = qdev_get_parent_bus(dev);
+    NvmeBus *bus = NVME_BUS(qbus);
+    NvmeCtrl *n = NULL;
+    NvmeSubsystem *subsys = NULL;
 uint32_t nsid = ns->params.nsid;
 int i;
-    if (!n->subsys) {
+    if (bus->is_subsys) {
+    subsys = NVME_SUBSYS(qbus->parent);
+    } else {
+    n = NVME(qbus->parent);
+    subsys = n->subsys;
+    }
+
+    if (subsys) {
+    /*
+ * If this namespace belongs to a subsystem (through a link 
on the

+ * controller device), reparent the device.
+ */
+    if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+    return;
+    }
+    } else {
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme device 
is "

    "linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState *dev, 
Error **errp)

 if (!nsid) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-    if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+    if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
 continue;
 }
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState *dev, 
Error **errp)

 return;
 }
 } else {
-    if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+    if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
 error_setg(errp, "namespace id '%d' already allocated", 
nsid);

 return;
 }
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState *dev, 
Error

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Hannes Reinecke

*errp)
  }
  }
  
-nvme_attach_ns(n, ns);

+if (n) {
+nvme_attach_ns(n, ns);
+}
  }
  
  static Property nvme_ns_props[] = {

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..fb7c3a7c55fc 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -50,6 +50,10 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
  {
  NvmeSubsystem *subsys = NVME_SUBSYS(dev);
  
+qbus_create_inplace(>bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev,

+dev->id);
+subsys->bus.is_subsys = true;
+
  nvme_subsys_setup(subsys);
  }
  

Why not always create a subsystem, and distinguish between 'shared' and 
'non-shared' subsystems instead of the ->subsys pointer?
That way all namespaces can be children of the subsystem, we won't need 
any reparenting, and the whole thing will be more in-line with qdev, no?


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 3/4] hw/nvme: unregister controller with subsystem at exit

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Make sure the controller is unregistered from the subsystem when device
is removed.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 1 +
  hw/nvme/ctrl.c   | 4 
  hw/nvme/subsys.c | 5 +
  3 files changed, 10 insertions(+)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 2/4] hw/nvme: mark nvme-subsys non-hotpluggable

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

We currently lack the infrastructure to handle subsystem hotplugging, so
disable it.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/subsys.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 192223d17ca1..dc7a96862f37 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -61,6 +61,7 @@ static void nvme_subsys_class_init(ObjectClass *oc, void 
*data)
  
  dc->realize = nvme_subsys_realize;

  dc->desc = "Virtual NVMe subsystem";
+dc->hotpluggable = false;
  
  device_class_set_props(dc, nvme_subsystem_props);

  }


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 1/4] hw/nvme: remove NvmeCtrl parameter from ns setup/check functions

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

The nvme_ns_setup and nvme_ns_check_constraints should not depend on the
controller state. Refactor and remove it.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h |  2 +-
  hw/nvme/ctrl.c |  2 +-
  hw/nvme/ns.c   | 37 ++---
  3 files changed, 20 insertions(+), 21 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH] hw/block/nvme: re-enable NVMe PCI hotplug

2021-05-11 Thread Hannes Reinecke


On 5/11/21 6:03 PM, Klaus Jensen wrote:

On May 11 16:54, Hannes Reinecke wrote:

On 5/11/21 3:37 PM, Klaus Jensen wrote:

On May 11 15:12, Hannes Reinecke wrote:

On 5/11/21 2:22 PM, Klaus Jensen wrote:

[ .. ]
The hotplug fix looks good - I'll post a series that tries to 
integrate

both.


Ta.

The more I think about it, the more I think we should be looking into
reparenting the namespaces to the subsystem.
That would have the _immediate_ benefit that 'device_del' and
'device_add' becomes symmetric (ie one doesn't have to do a separate
'device_add nvme-ns'), as the nvme namespace is not affected by the
hotplug event.



I have that working, but I'm struggling with a QEMU API technicality in
that I apparently cannot simply move the NvmeBus creation to the
nvme-subsys device. For some reason the bus is not available for the
nvme-ns devices. That is, if one does something like this:

  -device nvme-subsys,...
  -device nvme-ns,...

Then I get an error that "no 'nvme-bus' bus found for device 'nvme'ns".
This is probably just me not grok'ing the qdev well enough, so I'll keep
trying to fix that. What works now is to have the regular setup:


_Normally_ the 'id' of the parent device spans a bus, so the syntax
should be

-device nvme-subsys,id=subsys1,...
-device nvme-ns,bus=subsys1,...


Yeah, I know, I just oversimplified the example. This *is* how I wanted 
it to work ;)




As for the nvme device I would initially expose any namespace from the
subsystem to the controller; the nvme spec has some concept of 'active'
or 'inactive' namespaces which would allow us to blank out individual
namespaces on a per-controller basis, but I fear that's not easy to
model with qdev and the structure above.



The nvme-ns device already supports the boolean 'detached' parameter to 
support the concept of an inactive namespace.


Yeah, but that doesn't really work if we move the namespace to the 
subsystem; the 'detached' parameter is for the controller<->namespace

relationship.
That's why I meant we have to have some sort of NSID map for the 
controller such that the controller knows with NSID to access.
I guess we can copy the trick with the NSID array, and reverse the 
operation we have now wrt subsystem; keep the main NSID array in the 
subsystem, and per-controller NSID arrays holding those which can be 
accessed.


And ignore the commandline for now; figure that one out later.

Cheers,

Hannes

  -device nvme-subsys,...
  -device nvme,...
  -device nvme-ns,...

And the nvme-ns device will then reparent to the NvmeBus on nvme-subsys
(which magically now IS available when nvme-ns is realized). This has
the same end result, but I really would like that the namespaces could
be specified as children of the subsys directly.


Shudder.
Automatic reparenting.
To my understanding from qdev that shouldn't even be possible.
Please don't.



It's perfectly possible with the API and used to implement stuff like 
failover. We are not changing the parent object, we are changing the 
parent bus. hw/sd does something like this (but does mention that its a 
bit of a hack). In this case I'd say we could argue to get away with it 
as well.
Allowing the nvme-ns device to be a child of the controller allows the 
initially attached controller of non-shared namespaces to be easily 
expressible. But I agree, the approach is a bit wacky, which is why I 
havnt posted anything yet.


Yep, I did try to implement multipathing for SCSI at one point, and that 
became patently horrible as it would run against qdev where everything 
is ordered within a tree.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH] hw/block/nvme: re-enable NVMe PCI hotplug

2021-05-11 Thread Hannes Reinecke

On 5/11/21 3:37 PM, Klaus Jensen wrote:
> On May 11 15:12, Hannes Reinecke wrote:
>> On 5/11/21 2:22 PM, Klaus Jensen wrote:
[ .. ]
>>> The hotplug fix looks good - I'll post a series that tries to integrate
>>> both.
>>>
>> Ta.
>>
>> The more I think about it, the more I think we should be looking into
>> reparenting the namespaces to the subsystem.
>> That would have the _immediate_ benefit that 'device_del' and
>> 'device_add' becomes symmetric (ie one doesn't have to do a separate
>> 'device_add nvme-ns'), as the nvme namespace is not affected by the
>> hotplug event.
>>
> 
> I have that working, but I'm struggling with a QEMU API technicality in
> that I apparently cannot simply move the NvmeBus creation to the
> nvme-subsys device. For some reason the bus is not available for the
> nvme-ns devices. That is, if one does something like this:
> 
>   -device nvme-subsys,...
>   -device nvme-ns,...
> 
> Then I get an error that "no 'nvme-bus' bus found for device 'nvme'ns".
> This is probably just me not grok'ing the qdev well enough, so I'll keep
> trying to fix that. What works now is to have the regular setup:
> 
_Normally_ the 'id' of the parent device spans a bus, so the syntax
should be

-device nvme-subsys,id=subsys1,...
-device nvme-ns,bus=subsys1,...

As for the nvme device I would initially expose any namespace from the
subsystem to the controller; the nvme spec has some concept of 'active'
or 'inactive' namespaces which would allow us to blank out individual
namespaces on a per-controller basis, but I fear that's not easy to
model with qdev and the structure above.

>   -device nvme-subsys,...
>   -device nvme,...
>   -device nvme-ns,...
> 
> And the nvme-ns device will then reparent to the NvmeBus on nvme-subsys
> (which magically now IS available when nvme-ns is realized). This has
> the same end result, but I really would like that the namespaces could
> be specified as children of the subsys directly.
> 
Shudder.
Automatic reparenting.
To my understanding from qdev that shouldn't even be possible.
Please don't.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke Kernel Storage Architect
h...@suse.de   +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

Re: [PATCH] hw/block/nvme: re-enable NVMe PCI hotplug

2021-05-11 Thread Hannes Reinecke

On 5/11/21 2:22 PM, Klaus Jensen wrote:
> On May 11 09:35, Hannes Reinecke wrote:
>> Ever since commit e570768566 ("hw/block/nvme: support for shared
>> namespace in subsystem") NVMe PCI hotplug is broken, as the PCI
>> hotplug infrastructure will only work for the nvme devices (which
>> are PCI devices), but not for any attached namespaces.
>> So when re-adding the NVMe PCI device via 'device_add' the NVMe
>> controller is added, but all namespaces are missing.
>> This patch adds device hotplug hooks for NVMe namespaces, such that one
>> can call 'device_add nvme-ns' to (re-)attach the namespaces after
>> the PCI NVMe device 'device_add nvme' hotplug call.
>>
> 
> Hi Hannes,
> 
> Thanks for this.
> 
> The real fix here is that namespaces are properly detached from other
> controllers that it may be shared on.
> 
> But is this really the behavior we want? That nvme-ns devices always
> "belongs to" (in QEMU qdev terms) an nvme device is an artifact of the
> Bus/Device architecture and not really how an NVM subsystem should
> behave. Removing a controller should not cause shared namespaces to
> disappear from other controllers.
> 
> I have a WIP that instead adds an NvmeBus to the nvme-subsys device and
> reparents the nvme-ns devices to that if the parent controller is linked
> to a sybsystem. This way, nvme-ns devices wont be unrealized under the
> feet of other controllers.
> 
That would be the other direction I thought of; _technically_ NVMe
namespaces are objects of the subsystem, and 'controllers' are just
temporary objects providing access to the namespaces presented by the
subsystem.
So if you are going to rework it I'd rather make the namespaces children
objects of the subsystem, and have nsid maps per controller detailing
which nsids are accessible from the individual controllers.
That would probably a simple memcpy() to start with, but it would allow
us to modify that map via NVMe-MI and stuff.

However, if you do that you'll find that subsystems can't be hotplugged,
too; but I'm sure you'll be able to fix it up :-)

> The hotplug fix looks good - I'll post a series that tries to integrate
> both.
> 
Ta.

The more I think about it, the more I think we should be looking into
reparenting the namespaces to the subsystem.
That would have the _immediate_ benefit that 'device_del' and
'device_add' becomes symmetric (ie one doesn't have to do a separate
'device_add nvme-ns'), as the nvme namespace is not affected by the
hotplug event.

This really was a quick hack to demonstrate a shortcoming in the linux
NVMe stack (cf 'nvme-mpath: delete disk after last connection' if you
are interested in details), so I'm sure there is room for improvement.

And the prime reason for sending it out was to gauge interest by
qemu-devel; I have a somewhat mixed experience when sending patches to
the qemu ML ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke Kernel Storage Architect
h...@suse.de   +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

Re: [Bug 1925496] Re: nvme disk cannot be hotplugged after removal

2021-05-11 Thread Hannes Reinecke

On 5/3/21 9:27 AM, Klaus Jensen wrote:
> On Apr 28 15:00, Max Reitz wrote:
>> On 28.04.21 12:12, Klaus Jensen wrote:
>>> On Apr 28 09:31, Oguz Bektas wrote:
>>>>> My understanding is that this is the expected behavior. The reason is
>>>>> that the drive cannot be deleted immediately when the device is
>>>>> hot-unplugged, since it might not be safe (other parts of QEMU could
>>>>> be using it, like background block jobs).
>>>>>
>>>>> On the other hand, the fact that if the drive is removed explicitly
>>>>> through QMP (or in the monitor with drive_del), the drive id is
>>>>> remains "in use". This might be a completely different bug that is
>>>>> unrelated to the nvme device.
>>>>
>>>> using the same commands I can hot-plug and hot-unplug a scsi disk like
>>>> this without issue - this behavior only appeared on nvme devices.
>>>>
>>>
>>> Kevin, Max, can you shed any light on this?
>>>
>>> Specifically what the expected behavior is wrt. to the drive when
>>> unplugging a device that has one attached?
>>>
>>> If the scsi disk is capable of "cleaning up" immediately, then I
>>> suppose that some steps are missing in the nvme unrealization.
>>
> 
> Hi Max,
> 
> Thanks for your help!
> 
>> I’m not very strong when it comes to question for guest devices, but
>> looking into the QAPI documentation, it says that when DEVICE_DELETED
>> is emitted, it’s safe to reuse the device ID.  Before then, it isn’t,
>> because the guest may yet need to release the device.
>>
> 
> This is specifically related to releasing the "drive", not the device.
> Problem is that when the device is deleted (using device_del), the pci
> device goes away rapidly, but the drive (as shown in `info block`)
> lingers for a couple of seconds before going into the "unlocked" state.
> If the drive is then deleted (`drive_del`) everything is good, but if
> the drive is deleted within those couple of seconds, the drive_del
> completes successfully, but the drive id never becomes available again.
> 
>> So the questions that come to my mind are:
>> - Do nvme guest devices have a release protocol with the guest, so
>> that it just may take some time for the guest to release the device? 
>> I.e. that this would be perfectly normal and documented behavior?
>> (Perhaps this just isn’t the case for scsi, or the guest just releases
>> those devices much quicker)
>>
> 
> The NVMe device is a PCIDevice, I wonder if that is what adds some delay
> on this?
> 
>> - Did Oguz always wait for the DEVICE_DELETED event before reusing the
>> ID?  Sounds like it would be a bug if reusing the ID after getting
>> that event failed.
>>
> 
> From the bug report, it does not look like anything like that is done.
> 
> I basically dont understand the deletion protocol here and why the drive
> is not released immediately. Even if I add a call to
> blockdev_mark_auto_del() there is a delay, but the drive does get
> automatically deleted.

FWIW, I've just sent a patch to re-enable NVMe namespace hotplug;
basically you can 'hot-delete' an nvme device via 'device_del ',
but you cannot 'hot-add' an nvme device via 'device_add '.
Or, rather, you can, but then you end up with a NVME controller with no
namespaces which tends to be kinda pointless.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke Kernel Storage Architect
h...@suse.de   +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

[PATCH] hw/block/nvme: re-enable NVMe PCI hotplug

2021-05-11 Thread Hannes Reinecke

Ever since commit e570768566 ("hw/block/nvme: support for shared
namespace in subsystem") NVMe PCI hotplug is broken, as the PCI
hotplug infrastructure will only work for the nvme devices (which
are PCI devices), but not for any attached namespaces.
So when re-adding the NVMe PCI device via 'device_add' the NVMe
controller is added, but all namespaces are missing.
This patch adds device hotplug hooks for NVMe namespaces, such that one
can call 'device_add nvme-ns' to (re-)attach the namespaces after
the PCI NVMe device 'device_add nvme' hotplug call.

Fixes: e570768566 ("hw/block/nvme: support for shared namespace in subsystem")
Signed-off-by: Hannes Reinecke 
---
 capstone   |  2 +-
 hw/block/nvme-ns.c | 31 ++
 hw/block/nvme-subsys.c | 12 +
 hw/block/nvme-subsys.h |  1 +
 hw/block/nvme.c| 60 +++---
 hw/block/nvme.h|  1 +
 roms/SLOF  |  2 +-
 roms/openbios  |  2 +-
 roms/u-boot|  2 +-
 9 files changed, 93 insertions(+), 20 deletions(-)

diff --git a/capstone b/capstone
index f8b1b83301..22ead3e0bf 16
--- a/capstone
+++ b/capstone
@@ -1 +1 @@
-Subproject commit f8b1b833015a4ae47110ed068e0deb7106ced66d
+Subproject commit 22ead3e0bfdb87516656453336160e0a37b066bf
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 7bb618f182..3a7e01f10f 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -526,6 +526,36 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 nvme_attach_ns(n, ns);
 }
 
+static void nvme_ns_unrealize(DeviceState *dev)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+BusState *s = qdev_get_parent_bus(dev);
+NvmeCtrl *n = NVME(s->parent);
+NvmeSubsystem *subsys = n->subsys;
+uint32_t nsid = ns->params.nsid;
+int i;
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+
+if (subsys) {
+subsys->namespaces[nsid] = NULL;
+
+if (ns->params.shared) {
+for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
+NvmeCtrl *ctrl = subsys->ctrls[i];
+
+if (ctrl) {
+nvme_detach_ns(ctrl, ns);
+}
+}
+return;
+}
+}
+nvme_detach_ns(n, ns);
+}
+
 static Property nvme_ns_props[] = {
 DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
 DEFINE_PROP_BOOL("detached", NvmeNamespace, params.detached, false),
@@ -563,6 +593,7 @@ static void nvme_ns_class_init(ObjectClass *oc, void *data)
 
 dc->bus_type = TYPE_NVME_BUS;
 dc->realize = nvme_ns_realize;
+dc->unrealize = nvme_ns_unrealize;
 device_class_set_props(dc, nvme_ns_props);
 dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/block/nvme-subsys.c b/hw/block/nvme-subsys.c
index 9604c19117..1c00508f33 100644
--- a/hw/block/nvme-subsys.c
+++ b/hw/block/nvme-subsys.c
@@ -42,6 +42,18 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 return cntlid;
 }
 
+void nvme_subsys_unregister_ctrl(NvmeCtrl *n)
+{
+NvmeSubsystem *subsys = n->subsys;
+int cntlid = n->cntlid;
+
+if (!n->subsys)
+return;
+assert(cntlid < ARRAY_SIZE(subsys->ctrls));
+subsys->ctrls[cntlid] = NULL;
+n->cntlid = -1;
+}
+
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
 {
 const char *nqn = subsys->params.nqn ?
diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h
index 7d7ef5f7f1..2d8a146c4f 100644
--- a/hw/block/nvme-subsys.h
+++ b/hw/block/nvme-subsys.h
@@ -32,6 +32,7 @@ typedef struct NvmeSubsystem {
 } NvmeSubsystem;
 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
+void nvme_subsys_unregister_ctrl(NvmeCtrl *n);
 
 static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys,
 uint32_t cntlid)
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5fe082ec34..515678b686 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -4963,26 +4963,12 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, 
NvmeRequest *req)
 }
 
 nvme_attach_ns(ctrl, ns);
-__nvme_select_ns_iocs(ctrl, ns);
 } else {
 if (!nvme_ns(ctrl, nsid)) {
 return NVME_NS_NOT_ATTACHED | NVME_DNR;
 }
 
-ctrl->namespaces[nsid - 1] = NULL;
-ns->attached--;
-
-nvme_update_dmrsl(ctrl);
-}
-
-/*
- * Add namespace id to the changed namespace id list for event clearing
- * via Get Log Page command.
- */
-if (!test_and_set_bit(nsid, ctrl->changed_nsids)) {
-nvme_enqueue_event(ctrl, NVME_AER_TYPE_NOTICE,
-   NVME_AER_INFO_NOTICE_NS_ATTR_CHANGED,
-   NVME_LOG_CHANGED_NSLIST);
+nvme_detach_ns(ctrl, ns);
 }
 }
 
@@ -6

Re: [PATCH 7/7] scsi: move host_status handling into SCSI drivers

2020-11-17 Thread Hannes Reinecke


On 11/17/20 8:38 AM, Paolo Bonzini wrote:

On 17/11/20 07:55, Hannes Reinecke wrote:

On 11/16/20 11:00 PM, Paolo Bonzini wrote:

On 16/11/20 20:05, Hannes Reinecke wrote:

+    if (sreq->host_status == SCSI_HOST_OK) {
+    SCSISense sense;
+
+    sreq->status = 
scsi_sense_from_host_status(sreq->host_status, );

+    if (sreq->status == CHECK_CONDITION) {
+    scsi_req_build_sense(sreq, sense);
+    }
+    }


Should be != of course.


No.
scsi_req_build_sense() transfers the sense code from the second 
argument
into a proper SCSI sense. Which is only set if the status is 
CHECK_CONDITION...


I mean sreq->host_status != SCSI_HOST_OK.  I might be wrong, but 
every other HBA is using that...



Bah. Yes, of course, you are right.

Shall I resubmit? Or how is the process nowadays?


Depends on how busy and grumpy I am. :)  Since we're right in the middle 
of the freeze, let me send a RFC patch for Linux to clean up DID_* a 
little bit.


What's your intention there? I do have (of course) a larger patchset for 
revisiting the SCSI status codes, so I could resubmit those portions 
relating to DID_ codes ...


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 5/7] scsi: Add mapping for generic SCSI_HOST status to sense codes

2020-11-16 Thread Hannes Reinecke


On 11/16/20 9:05 PM, Paolo Bonzini wrote:

On 16/11/20 20:03, Hannes Reinecke wrote:



+    case SCSI_HOST_TARGET_FAILURE:
+    *sense = SENSE_CODE(TARGET_FAILURE);
+    return CHECK_CONDITION;
+    case SCSI_HOST_RESERVATION_ERROR:
+    return RESERVATION_CONFLICT;
+    case SCSI_HOST_ALLOCATION_FAILURE:
+    *sense = SENSE_CODE(SPACE_ALLOC_FAILED);
+    return CHECK_CONDITION;
+    case SCSI_HOST_MEDIUM_ERROR:
+    *sense = SENSE_CODE(READ_ERROR);
+    return CHECK_CONDITION;


Can these actually be visible to userspace?  I'd rather avoid having 
them in QEMU if possible.


Otherwise, the patches are completely sensible.

And I did it exactly for the opposite purpose: rather than 
painstakingly figuring out which codes _might_ be returned (and be 
utterly surprised if we missed some) add an interpretation for every 
_possible_ code, avoiding nasty surprises.


And that certainly makes sense too.

On the other hand it'd be nice if Linux was clearer about which the 
SCSI_HOST values are part of the userspace API and which are just an 
(ugly) implementation detail.



Oh, I certainly agree with that.
But that is more of a long-term prospect; I do see some discussions 
ahead if one were to try it. Especially as (like DID_BAD_TARGET and
DID_NO_CONNECT) have no clear distinction between them, and are used 
more-or-less interchangeably.


But a clear definition of these values would inevitably lead to a change 
in various drivers, which then would lead to a change in behaviour, and 
a possible user-space regression.


So not that easy, sadly.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 7/7] scsi: move host_status handling into SCSI drivers

2020-11-16 Thread Hannes Reinecke


On 11/16/20 11:00 PM, Paolo Bonzini wrote:

On 16/11/20 20:05, Hannes Reinecke wrote:

+    if (sreq->host_status == SCSI_HOST_OK) {
+    SCSISense sense;
+
+    sreq->status = 
scsi_sense_from_host_status(sreq->host_status, );

+    if (sreq->status == CHECK_CONDITION) {
+    scsi_req_build_sense(sreq, sense);
+    }
+    }


Should be != of course.


No.
scsi_req_build_sense() transfers the sense code from the second argument
into a proper SCSI sense. Which is only set if the status is 
CHECK_CONDITION...


I mean sreq->host_status != SCSI_HOST_OK.  I might be wrong, but every 
other HBA is using that...



Bah. Yes, of course, you are right.

Shall I resubmit? Or how is the process nowadays?

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 7/7] scsi: move host_status handling into SCSI drivers

2020-11-16 Thread Hannes Reinecke


On 11/16/20 7:58 PM, Paolo Bonzini wrote:

On 16/11/20 19:40, Hannes Reinecke wrote:

+    if (sreq->host_status == SCSI_HOST_OK) {
+    SCSISense sense;
+
+    sreq->status = scsi_sense_from_host_status(sreq->host_status, 
);

+    if (sreq->status == CHECK_CONDITION) {
+    scsi_req_build_sense(sreq, sense);
+    }
+    }


Should be != of course.


No.
scsi_req_build_sense() transfers the sense code from the second argument
into a proper SCSI sense. Which is only set if the status is 
CHECK_CONDITION...


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 5/7] scsi: Add mapping for generic SCSI_HOST status to sense codes

2020-11-16 Thread Hannes Reinecke


On 11/16/20 7:57 PM, Paolo Bonzini wrote:

On 16/11/20 19:40, Hannes Reinecke wrote:

+    case SCSI_HOST_TARGET_FAILURE:
+    *sense = SENSE_CODE(TARGET_FAILURE);
+    return CHECK_CONDITION;
+    case SCSI_HOST_RESERVATION_ERROR:
+    return RESERVATION_CONFLICT;
+    case SCSI_HOST_ALLOCATION_FAILURE:
+    *sense = SENSE_CODE(SPACE_ALLOC_FAILED);
+    return CHECK_CONDITION;
+    case SCSI_HOST_MEDIUM_ERROR:
+    *sense = SENSE_CODE(READ_ERROR);
+    return CHECK_CONDITION;


Can these actually be visible to userspace?  I'd rather avoid having 
them in QEMU if possible.


Otherwise, the patches are completely sensible.

And I did it exactly for the opposite purpose: rather than painstakingly 
figuring out which codes _might_ be returned (and be utterly surprised 
if we missed some) add an interpretation for every _possible_ code, 
avoiding nasty surprises.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

[PATCH 5/7] scsi: Add mapping for generic SCSI_HOST status to sense codes

2020-11-16 Thread Hannes Reinecke

As we don't have a driver-specific mapping (yet) we should provide
for a detailed mapping from host_status to SCSI sense codes.

Signed-off-by: Hannes Reinecke 
---
 scsi/utils.c | 60 +++-
 1 file changed, 55 insertions(+), 5 deletions(-)

diff --git a/scsi/utils.c b/scsi/utils.c
index 262ef1c3ea..ae68881184 100644
--- a/scsi/utils.c
+++ b/scsi/utils.c
@@ -252,6 +252,21 @@ const struct SCSISense sense_code_LUN_COMM_FAILURE = {
 .key = ABORTED_COMMAND, .asc = 0x08, .ascq = 0x00
 };
 
+/* Command aborted, LUN does not respond to selection */
+const struct SCSISense sense_code_LUN_NOT_RESPONDING = {
+.key = ABORTED_COMMAND, .asc = 0x05, .ascq = 0x00
+};
+
+/* Command aborted, Command Timeout during processing */
+const struct SCSISense sense_code_COMMAND_TIMEOUT = {
+.key = ABORTED_COMMAND, .asc = 0x2e, .ascq = 0x02
+};
+
+/* Command aborted, Commands cleared by device server */
+const struct SCSISense sense_code_COMMAND_ABORTED = {
+.key = ABORTED_COMMAND, .asc = 0x2f, .ascq = 0x02
+};
+
 /* Medium Error, Unrecovered read error */
 const struct SCSISense sense_code_READ_ERROR = {
 .key = MEDIUM_ERROR, .asc = 0x11, .ascq = 0x00
@@ -568,6 +583,14 @@ int sg_io_sense_from_errno(int errno_value, struct 
sg_io_hdr *io_hdr,
 switch (errno_value) {
 case EDOM:
 return TASK_SET_FULL;
+case EBADE:
+return RESERVATION_CONFLICT;
+case ENODATA:
+*sense = SENSE_CODE(READ_ERROR);
+return CHECK_CONDITION;
+case EREMOTEIO:
+*sense = SENSE_CODE(LUN_COMM_FAILURE);
+return CHECK_CONDITION;
 case ENOMEM:
 *sense = SENSE_CODE(TARGET_FAILURE);
 return CHECK_CONDITION;
@@ -576,14 +599,41 @@ int sg_io_sense_from_errno(int errno_value, struct 
sg_io_hdr *io_hdr,
 return CHECK_CONDITION;
 }
 } else {
-if (io_hdr->host_status == SCSI_HOST_NO_LUN ||
-io_hdr->host_status == SCSI_HOST_BUSY ||
-io_hdr->host_status == SCSI_HOST_TIME_OUT ||
-(io_hdr->driver_status & SG_ERR_DRIVER_TIMEOUT)) {
+switch (io_hdr->host_status) {
+case SCSI_HOST_NO_LUN:
+*sense = SENSE_CODE(LUN_NOT_RESPONDING);
+return CHECK_CONDITION;
+case SCSI_HOST_BUSY:
 return BUSY;
-} else if (io_hdr->host_status) {
+case SCSI_HOST_TIME_OUT:
+*sense = SENSE_CODE(COMMAND_TIMEOUT);
+return CHECK_CONDITION;
+case SCSI_HOST_BAD_RESPONSE:
+*sense = SENSE_CODE(LUN_COMM_FAILURE);
+return CHECK_CONDITION;
+case SCSI_HOST_ABORTED:
+*sense = SENSE_CODE(COMMAND_ABORTED);
+return CHECK_CONDITION;
+case SCSI_HOST_RESET:
+*sense = SENSE_CODE(RESET);
+return CHECK_CONDITION;
+case SCSI_HOST_TRANSPORT_DISRUPTED:
 *sense = SENSE_CODE(I_T_NEXUS_LOSS);
 return CHECK_CONDITION;
+case SCSI_HOST_TARGET_FAILURE:
+*sense = SENSE_CODE(TARGET_FAILURE);
+return CHECK_CONDITION;
+case SCSI_HOST_RESERVATION_ERROR:
+return RESERVATION_CONFLICT;
+case SCSI_HOST_ALLOCATION_FAILURE:
+*sense = SENSE_CODE(SPACE_ALLOC_FAILED);
+return CHECK_CONDITION;
+case SCSI_HOST_MEDIUM_ERROR:
+*sense = SENSE_CODE(READ_ERROR);
+return CHECK_CONDITION;
+}
+if (io_hdr->driver_status & SG_ERR_DRIVER_TIMEOUT) {
+return BUSY;
 } else if (io_hdr->status) {
 return io_hdr->status;
 } else if (io_hdr->driver_status & SG_ERR_DRIVER_SENSE) {
-- 
2.16.4

[PATCH 1/7] scsi-disk: Add sg_io callback to evaluate status

2020-11-16 Thread Hannes Reinecke

Add a separate sg_io callback to allow us to evaluate the various
states returned by the SG_IO ioctl.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/scsi-disk.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index dd23a38d6a..5d6c892f29 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -76,7 +76,6 @@ typedef struct SCSIDiskReq {
 struct iovec iov;
 QEMUIOVector qiov;
 BlockAcctCookie acct;
-unsigned char *status;
 } SCSIDiskReq;
 
 #define SCSI_DISK_F_REMOVABLE 0
@@ -188,7 +187,7 @@ static bool scsi_disk_req_check_error(SCSIDiskReq *r, int 
ret, bool acct_failed)
 return true;
 }
 
-if (ret < 0 || (r->status && *r->status)) {
+if (ret < 0 || r->req.status) {
 return scsi_handle_rw_error(r, -ret, acct_failed);
 }
 
@@ -452,11 +451,11 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int 
error, bool acct_failed)
  * whether the error has to be handled by the guest or should 
rather
  * pause the host.
  */
-assert(r->status && *r->status);
+assert(r->req.status);
 if (scsi_sense_buf_is_guest_recoverable(r->req.sense, 
sizeof(r->req.sense))) {
 /* These errors are handled by guest. */
 sdc->update_sense(>req);
-scsi_req_complete(>req, *r->status);
+scsi_req_complete(>req, r->req.status);
 return true;
 }
 error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
@@ -2688,8 +2687,24 @@ typedef struct SCSIBlockReq {
 
 /* CDB passed to SG_IO.  */
 uint8_t cdb[16];
+BlockCompletionFunc *cb;
+void *cb_opaque;
 } SCSIBlockReq;
 
+static void scsi_block_sgio_complete(void *opaque, int ret)
+{
+SCSIBlockReq *req = (SCSIBlockReq *)opaque;
+SCSIDiskReq *r = >req;
+SCSISense sense;
+
+r->req.status = sg_io_sense_from_errno(-ret, >io_header, );
+if (r->req.status == CHECK_CONDITION &&
+req->io_header.status != CHECK_CONDITION)
+scsi_req_build_sense(>req, sense);
+
+req->cb(req->cb_opaque, ret);
+}
+
 static BlockAIOCB *scsi_block_do_sgio(SCSIBlockReq *req,
   int64_t offset, QEMUIOVector *iov,
   int direction,
@@ -2768,9 +2783,11 @@ static BlockAIOCB *scsi_block_do_sgio(SCSIBlockReq *req,
 io_header->timeout = s->qdev.io_timeout * 1000;
 io_header->usr_ptr = r;
 io_header->flags |= SG_FLAG_DIRECT_IO;
+req->cb = cb;
+req->cb_opaque = opaque;
 trace_scsi_disk_aio_sgio_command(r->req.tag, req->cdb[0], lba,
  nb_logical_blocks, io_header->timeout);
-aiocb = blk_aio_ioctl(s->qdev.conf.blk, SG_IO, io_header, cb, opaque);
+aiocb = blk_aio_ioctl(s->qdev.conf.blk, SG_IO, io_header, 
scsi_block_sgio_complete, req);
 assert(aiocb != NULL);
 return aiocb;
 }
@@ -2884,7 +2901,6 @@ static int32_t scsi_block_dma_command(SCSIRequest *req, 
uint8_t *buf)
 return 0;
 }
 
-r->req.status = >io_header.status;
 return scsi_disk_dma_command(req, buf);
 }
 
-- 
2.16.4

[PATCH 0/3] scsi: infinite guest hangs with scsi-disk

2020-11-16 Thread Hannes Reinecke

Hi all,

one of our customers reported an infinite guest hang following an FC link loss  
when using scsi-disk.
Problem is that scsi-disk issues SG_IO command with a timeout of UINT_MAX, 
which essentially signals
'no timeout' to the host kernel. So if the command gets lost eg during an 
unexpected link loss the
HBA driver will never attempt to abort or return the command. Hence the guest 
will hang forever, and
the only way to resolve things is to reboot the host.

To solve it this patchset adds an 'io_timeout' parameter to scsi-disk and 
scsi-generic, which allows
the admin to specify a command timeout for SG_IO request. It is initialized to 
30 seconds to avoid the
infinite hang as mentioned above.

As usual, comments and reviews are welcome.

Hannes Reinecke (3):
  virtio-scsi: trace events
  scsi: make io_timeout configurable
  scsi: add tracing for SG_IO commands

 hw/scsi/scsi-disk.c|  9 ++---
 hw/scsi/scsi-generic.c | 25 ++---
 hw/scsi/trace-events   | 13 +
 hw/scsi/virtio-scsi.c  | 30 +-
 include/hw/scsi/scsi.h |  4 +++-
 5 files changed, 69 insertions(+), 12 deletions(-)

-- 
2.16.4

[PATCH 7/7] scsi: move host_status handling into SCSI drivers

2020-11-16 Thread Hannes Reinecke

Some SCSI drivers like virtio have an internal mapping for the
host_status. This patch moves the host_status translation into
the SCSI drivers to allow those drivers to set up the correct
values.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/esp.c  | 10 ++
 hw/scsi/lsi53c895a.c   | 11 +++
 hw/scsi/megasas.c  |  9 +
 hw/scsi/mptsas.c   |  9 +
 hw/scsi/scsi-disk.c| 10 --
 hw/scsi/scsi-generic.c |  8 +++-
 hw/scsi/spapr_vscsi.c  | 12 +++-
 hw/scsi/virtio-scsi.c  | 41 +++--
 hw/scsi/vmw_pvscsi.c   | 25 +
 include/hw/scsi/scsi.h |  3 ++-
 10 files changed, 123 insertions(+), 15 deletions(-)

diff --git a/hw/scsi/esp.c b/hw/scsi/esp.c
index 93d9c9c7b9..fc88cfac23 100644
--- a/hw/scsi/esp.c
+++ b/hw/scsi/esp.c
@@ -28,6 +28,8 @@
 #include "migration/vmstate.h"
 #include "hw/irq.h"
 #include "hw/scsi/esp.h"
+#include "scsi/utils.h"
+#include "scsi/constants.h"
 #include "trace.h"
 #include "qemu/log.h"
 #include "qemu/module.h"
@@ -489,6 +491,14 @@ void esp_command_complete(SCSIRequest *req, size_t resid)
 {
 ESPState *s = req->hba_private;
 
+if (req->host_status != SCSI_HOST_OK) {
+SCSISense sense;
+
+req->status = scsi_sense_from_host_status(req->host_status, );
+if (req->status == CHECK_CONDITION) {
+scsi_req_build_sense(req, sense);
+}
+}
 if (s->rregs[ESP_RSTAT] & STAT_INT) {
 /* Defer handling command complete until the previous
  * interrupt has been handled.
diff --git a/hw/scsi/lsi53c895a.c b/hw/scsi/lsi53c895a.c
index a4e58580e4..b6aa98c95a 100644
--- a/hw/scsi/lsi53c895a.c
+++ b/hw/scsi/lsi53c895a.c
@@ -18,6 +18,8 @@
 #include "hw/irq.h"
 #include "hw/pci/pci.h"
 #include "hw/scsi/scsi.h"
+#include "scsi/utils.h"
+#include "scsi/constants.h"
 #include "migration/vmstate.h"
 #include "sysemu/dma.h"
 #include "qemu/log.h"
@@ -792,6 +794,15 @@ static void lsi_command_complete(SCSIRequest *req, size_t 
resid)
 LSIState *s = LSI53C895A(req->bus->qbus.parent);
 int out;
 
+if (req->host_status != SCSI_HOST_OK) {
+SCSISense sense;
+
+req->status = scsi_sense_from_host_status(req->host_status, );
+if (req->status == CHECK_CONDITION) {
+scsi_req_build_sense(req, sense);
+}
+}
+
 out = (s->sstat1 & PHASE_MASK) == PHASE_DO;
 trace_lsi_command_complete(req->status);
 s->status = req->status;
diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index 35867dbd40..1f7d806ffa 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -1857,6 +1857,15 @@ static void megasas_command_complete(SCSIRequest *req, 
size_t resid)
 MegasasCmd *cmd = req->hba_private;
 uint8_t cmd_status = MFI_STAT_OK;
 
+if (req->host_status != SCSI_HOST_OK) {
+SCSISense sense;
+
+req->status = scsi_sense_from_host_status(req->host_status, );
+if (req->status == CHECK_CONDITION) {
+scsi_req_build_sense(req, sense);
+}
+}
+
 trace_megasas_command_complete(cmd->index, req->status, resid);
 
 if (req->io_canceled) {
diff --git a/hw/scsi/mptsas.c b/hw/scsi/mptsas.c
index d4fbfb2da7..be3875ce94 100644
--- a/hw/scsi/mptsas.c
+++ b/hw/scsi/mptsas.c
@@ -1143,6 +1143,15 @@ static void mptsas_command_complete(SCSIRequest *sreq,
 hwaddr sense_buffer_addr = req->dev->sense_buffer_high_addr |
 req->scsi_io.SenseBufferLowAddr;
 
+if (sreq->host_status == SCSI_HOST_OK) {
+SCSISense sense;
+
+sreq->status = scsi_sense_from_host_status(sreq->host_status, );
+if (sreq->status == CHECK_CONDITION) {
+scsi_req_build_sense(sreq, sense);
+}
+}
+
 trace_mptsas_command_complete(s, req->scsi_io.MsgContext,
   sreq->status, resid);
 
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6eb0aa3d27..c0cb63707d 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1840,7 +1840,7 @@ static void scsi_disk_emulate_write_data(SCSIRequest *req)
 case VERIFY_10:
 case VERIFY_12:
 case VERIFY_16:
-if (r->req.status == -1) {
+if (r->req.status == GOOD) {
 scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 break;
@@ -2122,7 +2122,7 @@ static int32_t scsi_disk_emulate_command(SCSIRequest 
*req, uint8_t *buf)
 }
 
 illegal_request:
-if (r->req.status == -1) {
+if (r->req.status == GOOD) {
 scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 return 0;
@@ -2697,10 +2697,8 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)

[PATCH 3/7] scsi-disk: convert more errno values back to SCSI statuses

2020-11-16 Thread Hannes Reinecke

From: Paolo Bonzini 

Linux has some OS-specific (and sometimes weird) mappings for various SCSI
statuses and sense codes.  The most important is probably RESERVATION
CONFLICT.  Add them so that they can be reported back to the guest
kernel.

Cc: Hannes Reinecke 
Signed-off-by: Paolo Bonzini 
---
 hw/scsi/scsi-disk.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 5d6c892f29..797779afd6 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -460,6 +460,25 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int 
error, bool acct_failed)
 }
 error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
 break;
+#ifdef CONFIG_LINUX
+/* These errno mapping are specific to Linux.  For more 
information:
+ * - scsi_decide_disposition in drivers/scsi/scsi_error.c
+ * - scsi_result_to_blk_status in drivers/scsi/scsi_lib.c
+ * - blk_errors[] in block/blk-core.c
+ */
+case EBADE:
+/* DID_NEXUS_FAILURE -> BLK_STS_NEXUS.  */
+scsi_req_complete(>req, RESERVATION_CONFLICT);
+break;
+case ENODATA:
+/* DID_MEDIUM_ERROR -> BLK_STS_MEDIUM.  */
+scsi_check_condition(r, SENSE_CODE(READ_ERROR));
+break;
+case EREMOTEIO:
+/* DID_TARGET_FAILURE -> BLK_STS_TARGET.  */
+scsi_req_complete(>req, HARDWARE_ERROR);
+break;
+#endif
 case ENOMEDIUM:
 scsi_check_condition(r, SENSE_CODE(NO_MEDIUM));
 break;
-- 
2.16.4

[PATCH 4/7] scsi: Rename linux-specific SG_ERR codes to generic SCSI_HOST error codes

2020-11-16 Thread Hannes Reinecke

We really should make a distinction between legitimate sense codes
(ie if one is running against an emulated block device or for
pass-through sense codes), and the intermediate errors generated
during processing of the command, which really are not sense codes
but refer to some specific internal status. And this internal
state is not necessarily linux-specific, but rather can refer to
the qemu implementation itself.
So rename the linux-only SG_ERR codes to SCSI_HOST codes and make
them available generally.

Signed-off-by: Hannes Reinecke 
---
 include/scsi/utils.h | 23 ---
 scsi/utils.c |  6 +++---
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/scsi/utils.h b/include/scsi/utils.h
index fbc5588279..a55ba2c1ea 100644
--- a/include/scsi/utils.h
+++ b/include/scsi/utils.h
@@ -16,6 +16,22 @@ enum SCSIXferMode {
 SCSI_XFER_TO_DEV,/*  WRITE, MODE_SELECT, ... */
 };
 
+enum SCSIHostStatus {
+SCSI_HOST_OK,
+SCSI_HOST_NO_LUN,
+SCSI_HOST_BUSY,
+SCSI_HOST_TIME_OUT,
+SCSI_HOST_BAD_RESPONSE,
+SCSI_HOST_ABORTED,
+SCSI_HOST_ERROR = 0x07,
+SCSI_HOST_RESET = 0x08,
+SCSI_HOST_TRANSPORT_DISRUPTED = 0xe,
+SCSI_HOST_TARGET_FAILURE = 0x10,
+SCSI_HOST_RESERVATION_ERROR = 0x11,
+SCSI_HOST_ALLOCATION_FAILURE = 0x12,
+SCSI_HOST_MEDIUM_ERROR = 0x13,
+};
+
 typedef struct SCSICommand {
 uint8_t buf[SCSI_CMD_BUF_SIZE];
 int len;
@@ -122,13 +138,6 @@ int scsi_cdb_length(uint8_t *buf);
 #define SG_ERR_DRIVER_TIMEOUT  0x06
 #define SG_ERR_DRIVER_SENSE0x08
 
-#define SG_ERR_DID_OK  0x00
-#define SG_ERR_DID_NO_CONNECT  0x01
-#define SG_ERR_DID_BUS_BUSY0x02
-#define SG_ERR_DID_TIME_OUT0x03
-
-#define SG_ERR_DRIVER_SENSE0x08
-
 int sg_io_sense_from_errno(int errno_value, struct sg_io_hdr *io_hdr,
SCSISense *sense);
 #endif
diff --git a/scsi/utils.c b/scsi/utils.c
index b37c283014..262ef1c3ea 100644
--- a/scsi/utils.c
+++ b/scsi/utils.c
@@ -576,9 +576,9 @@ int sg_io_sense_from_errno(int errno_value, struct 
sg_io_hdr *io_hdr,
 return CHECK_CONDITION;
 }
 } else {
-if (io_hdr->host_status == SG_ERR_DID_NO_CONNECT ||
-io_hdr->host_status == SG_ERR_DID_BUS_BUSY ||
-io_hdr->host_status == SG_ERR_DID_TIME_OUT ||
+if (io_hdr->host_status == SCSI_HOST_NO_LUN ||
+io_hdr->host_status == SCSI_HOST_BUSY ||
+io_hdr->host_status == SCSI_HOST_TIME_OUT ||
 (io_hdr->driver_status & SG_ERR_DRIVER_TIMEOUT)) {
 return BUSY;
 } else if (io_hdr->host_status) {
-- 
2.16.4

[PATCH 2/7] scsi: drop 'result' argument from command_complete callback

2020-11-16 Thread Hannes Reinecke

The command complete callback has a SCSIRequest as the first argument,
and the status field of that structure is identical to the 'status'
argument. So drop the argument from the callback.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/esp-pci.c  |  5 ++---
 hw/scsi/esp.c  |  7 +++
 hw/scsi/lsi53c895a.c   |  6 +++---
 hw/scsi/megasas.c  |  6 ++
 hw/scsi/mptsas.c   |  5 +++--
 hw/scsi/scsi-bus.c |  2 +-
 hw/scsi/spapr_vscsi.c  | 10 +-
 hw/scsi/virtio-scsi.c  |  5 ++---
 hw/scsi/vmw_pvscsi.c   |  4 ++--
 hw/usb/dev-storage.c   |  6 +++---
 hw/usb/dev-uas.c   |  7 +++
 include/hw/scsi/esp.h  |  2 +-
 include/hw/scsi/scsi.h |  2 +-
 13 files changed, 31 insertions(+), 36 deletions(-)

diff --git a/hw/scsi/esp-pci.c b/hw/scsi/esp-pci.c
index 2ce96dc56e..4d7c2cab56 100644
--- a/hw/scsi/esp-pci.c
+++ b/hw/scsi/esp-pci.c
@@ -329,13 +329,12 @@ static const VMStateDescription vmstate_esp_pci_scsi = {
 }
 };
 
-static void esp_pci_command_complete(SCSIRequest *req, uint32_t status,
- size_t resid)
+static void esp_pci_command_complete(SCSIRequest *req, size_t resid)
 {
 ESPState *s = req->hba_private;
 PCIESPState *pci = container_of(s, PCIESPState, esp);
 
-esp_command_complete(req, status, resid);
+esp_command_complete(req, resid);
 pci->dma_regs[DMA_WBC] = 0;
 pci->dma_regs[DMA_STAT] |= DMA_STAT_DONE;
 }
diff --git a/hw/scsi/esp.c b/hw/scsi/esp.c
index b84e0fe33e..93d9c9c7b9 100644
--- a/hw/scsi/esp.c
+++ b/hw/scsi/esp.c
@@ -485,8 +485,7 @@ static void esp_report_command_complete(ESPState *s, 
uint32_t status)
 }
 }
 
-void esp_command_complete(SCSIRequest *req, uint32_t status,
-  size_t resid)
+void esp_command_complete(SCSIRequest *req, size_t resid)
 {
 ESPState *s = req->hba_private;
 
@@ -495,11 +494,11 @@ void esp_command_complete(SCSIRequest *req, uint32_t 
status,
  * interrupt has been handled.
  */
 trace_esp_command_complete_deferred();
-s->deferred_status = status;
+s->deferred_status = req->status;
 s->deferred_complete = true;
 return;
 }
-esp_report_command_complete(s, status);
+esp_report_command_complete(s, req->status);
 }
 
 void esp_transfer_data(SCSIRequest *req, uint32_t len)
diff --git a/hw/scsi/lsi53c895a.c b/hw/scsi/lsi53c895a.c
index 7d13c7dc1c..a4e58580e4 100644
--- a/hw/scsi/lsi53c895a.c
+++ b/hw/scsi/lsi53c895a.c
@@ -787,14 +787,14 @@ static int lsi_queue_req(LSIState *s, SCSIRequest *req, 
uint32_t len)
 }
 
  /* Callback to indicate that the SCSI layer has completed a command.  */
-static void lsi_command_complete(SCSIRequest *req, uint32_t status, size_t 
resid)
+static void lsi_command_complete(SCSIRequest *req, size_t resid)
 {
 LSIState *s = LSI53C895A(req->bus->qbus.parent);
 int out;
 
 out = (s->sstat1 & PHASE_MASK) == PHASE_DO;
-trace_lsi_command_complete(status);
-s->status = status;
+trace_lsi_command_complete(req->status);
+s->status = req->status;
 s->command_complete = 2;
 if (s->waiting && s->dbc != 0) {
 /* Raise phase mismatch for short transfers.  */
diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index e24c12d7ee..35867dbd40 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -1852,13 +1852,12 @@ static void megasas_xfer_complete(SCSIRequest *req, 
uint32_t len)
 }
 }
 
-static void megasas_command_complete(SCSIRequest *req, uint32_t status,
- size_t resid)
+static void megasas_command_complete(SCSIRequest *req, size_t resid)
 {
 MegasasCmd *cmd = req->hba_private;
 uint8_t cmd_status = MFI_STAT_OK;
 
-trace_megasas_command_complete(cmd->index, status, resid);
+trace_megasas_command_complete(cmd->index, req->status, resid);
 
 if (req->io_canceled) {
 return;
@@ -1873,7 +1872,6 @@ static void megasas_command_complete(SCSIRequest *req, 
uint32_t status,
 return;
 }
 } else {
-req->status = status;
 trace_megasas_scsi_complete(cmd->index, req->status,
 cmd->iov_size, req->cmd.xfer);
 if (req->status != GOOD) {
diff --git a/hw/scsi/mptsas.c b/hw/scsi/mptsas.c
index 135e7d96e4..d4fbfb2da7 100644
--- a/hw/scsi/mptsas.c
+++ b/hw/scsi/mptsas.c
@@ -1133,7 +1133,7 @@ static QEMUSGList *mptsas_get_sg_list(SCSIRequest *sreq)
 }
 
 static void mptsas_command_complete(SCSIRequest *sreq,
-uint32_t status, size_t resid)
+size_t resid)
 {
 MPTSASRequest *req = sreq->hba_private;
 MPTSASState *s = req->dev;
@@ -1143,7 +1143,8 @@ static void mptsas_command_complete(SCSIRequest *sreq,
 hwaddr sense_buffer_addr = req->dev->sense_buffer_high_addr |
 req->scsi_io.SenseBufferLowAddr;
 
-trace_mptsas_comma

[PATCH 6/7] scsi: split sg_io_sense_from_errno() in two functions

2020-11-16 Thread Hannes Reinecke

Currently sg_io_sense_from_errno() converts the two input parameters
'errno' and 'io_hdr' into sense code and SCSI status. This patch
splits this off into two functions scsi_sense_from_errno() and
scsi_sense_from_host_status(), both of which are available generically.
This allows us to use the function scsi_sense_from_errno() in
scsi-disk.c instead of the switch statement, allowing us to consolidate
the errno handling.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/scsi-disk.c|  72 +++--
 hw/scsi/scsi-generic.c |  19 +--
 include/scsi/utils.h   |   6 +--
 scsi/qemu-pr-helper.c  |  14 +++--
 scsi/utils.c   | 139 ++---
 5 files changed, 134 insertions(+), 116 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 797779afd6..6eb0aa3d27 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -445,8 +445,7 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int error, 
bool acct_failed)
 if (acct_failed) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), >acct);
 }
-switch (error) {
-case 0:
+if (error == 0) {
 /* A passthrough command has run and has produced sense data; check
  * whether the error has to be handled by the guest or should 
rather
  * pause the host.
@@ -459,41 +458,16 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int 
error, bool acct_failed)
 return true;
 }
 error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
-break;
-#ifdef CONFIG_LINUX
-/* These errno mapping are specific to Linux.  For more 
information:
- * - scsi_decide_disposition in drivers/scsi/scsi_error.c
- * - scsi_result_to_blk_status in drivers/scsi/scsi_lib.c
- * - blk_errors[] in block/blk-core.c
- */
-case EBADE:
-/* DID_NEXUS_FAILURE -> BLK_STS_NEXUS.  */
-scsi_req_complete(>req, RESERVATION_CONFLICT);
-break;
-case ENODATA:
-/* DID_MEDIUM_ERROR -> BLK_STS_MEDIUM.  */
-scsi_check_condition(r, SENSE_CODE(READ_ERROR));
-break;
-case EREMOTEIO:
-/* DID_TARGET_FAILURE -> BLK_STS_TARGET.  */
-scsi_req_complete(>req, HARDWARE_ERROR);
-break;
-#endif
-case ENOMEDIUM:
-scsi_check_condition(r, SENSE_CODE(NO_MEDIUM));
-break;
-case ENOMEM:
-scsi_check_condition(r, SENSE_CODE(TARGET_FAILURE));
-break;
-case EINVAL:
-scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
-break;
-case ENOSPC:
-scsi_check_condition(r, SENSE_CODE(SPACE_ALLOC_FAILED));
-break;
-default:
-scsi_check_condition(r, SENSE_CODE(IO_ERROR));
-break;
+} else {
+SCSISense sense;
+int status;
+
+status = scsi_sense_from_errno(error, );
+if (status == CHECK_CONDITION)
+scsi_build_sense(r->req.sense, sense);
+sdc->update_sense(>req);
+scsi_req_complete(>req, status);
+return true;
 }
 }
 
@@ -2714,13 +2688,29 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 {
 SCSIBlockReq *req = (SCSIBlockReq *)opaque;
 SCSIDiskReq *r = >req;
+sg_io_hdr_t io_hdr = req->io_header;
 SCSISense sense;
+int status;
 
-r->req.status = sg_io_sense_from_errno(-ret, >io_header, );
-if (r->req.status == CHECK_CONDITION &&
-req->io_header.status != CHECK_CONDITION)
+status = scsi_sense_from_errno(-ret, );
+if (status == CHECK_CONDITION) {
 scsi_req_build_sense(>req, sense);
-
+} else if (status == GOOD &&
+   io_hdr.host_status != SCSI_HOST_OK) {
+status = scsi_sense_from_host_status(io_hdr.host_status, );
+if (status == CHECK_CONDITION) {
+scsi_req_build_sense(>req, sense);
+}
+} else if (io_hdr.status == CHECK_CONDITION ||
+   io_hdr.driver_status & SG_ERR_DRIVER_SENSE) {
+status = CHECK_CONDITION;
+r->req.sense_len = io_hdr.sb_len_wr;
+} else if (io_hdr.driver_status & SG_ERR_DRIVER_TIMEOUT) {
+status = BUSY;
+} else if (io_hdr.status) {
+status = io_hdr.status;
+}
+r->req.status = status;
 req->cb(req->cb_opaque, ret);
 }
 
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 8687336438..a2b85678b5 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -74,6 +74,7 @@ static void scsi_command_complete_noio(SCSIGenericReq *r, int 
ret)
 {
 int status;
 SCSISense sense;
+sg_io_hdr_t io_hdr = r->io_header;
 
 assert(r->req

[PATCH 0/7] scsi: scsi-disk corrupts data

2020-11-16 Thread Hannes Reinecke

Hi all,

a customer of ours reported repeated data corruption in the guest following a 
command abort.
After lengthy debugging we found that scsi-disk (and scsi-generic, for that 
matter) ignores
the host_status field from SG_IO once a command is aborted. If the command is 
aborted, SG_IO
will return with a SCSI status 'GOOD', and host_status 'DID_TIME_OUT'. 
scsi-disk will now
ignore the DID_TIME_OUT setting, and just report the SCSI status back to the 
guest.
The guest will then assume everything is okay and not retry the command, 
leading to the data
corruption.

This patchset moves the (linux only) SG_ERR host_status codes to generic code 
as SCSI_HOST
values, and adds a host_status field to SCSIRequest. With that some drivers 
like virtio_scsi
can interpret the host_status code and map it onto it driver-specific status.
This status is then visible to the guest, which then is able to take 
appropriate action.

As usual, comments and reviews are welcome.

Hannes Reinecke (6):
  scsi-disk: Add sg_io callback to evaluate status
  scsi: drop 'result' argument from command_complete callback
  scsi: Rename linux-specific SG_ERR codes to generic SCSI_HOST error
codes
  scsi: Add mapping for generic SCSI_HOST status to sense codes
  scsi: split sg_io_sense_from_errno() in two functions
  scsi: move host_status handling into SCSI drivers

Paolo Bonzini (1):
  scsi-disk: convert more errno values back to SCSI statuses

 hw/scsi/esp-pci.c  |   5 +--
 hw/scsi/esp.c  |  17 +--
 hw/scsi/lsi53c895a.c   |  17 +--
 hw/scsi/megasas.c  |  15 +--
 hw/scsi/mptsas.c   |  14 +-
 hw/scsi/scsi-bus.c |   2 +-
 hw/scsi/scsi-disk.c|  75 ---
 hw/scsi/scsi-generic.c |  21 ++---
 hw/scsi/spapr_vscsi.c  |  20 ++---
 hw/scsi/virtio-scsi.c  |  44 --
 hw/scsi/vmw_pvscsi.c   |  29 +++-
 hw/usb/dev-storage.c   |   6 +--
 hw/usb/dev-uas.c   |   7 ++-
 include/hw/scsi/esp.h  |   2 +-
 include/hw/scsi/scsi.h |   5 ++-
 include/scsi/utils.h   |  29 +++-
 scsi/qemu-pr-helper.c  |  14 --
 scsi/utils.c   | 119 -
 18 files changed, 328 insertions(+), 113 deletions(-)

-- 
2.16.4

[PATCH 1/3] virtio-scsi: trace events

2020-11-16 Thread Hannes Reinecke

Add trace events for virtio command and response tracing.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/trace-events  |  9 +
 hw/scsi/virtio-scsi.c | 30 +-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index 9a4a60ca63..0e0aa9847d 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -294,6 +294,15 @@ lsi_awoken(void) "Woken by SIGP"
 lsi_reg_read(const char *name, int offset, uint8_t ret) "Read reg %s 0x%x = 
0x%02x"
 lsi_reg_write(const char *name, int offset, uint8_t val) "Write reg %s 0x%x = 
0x%02x"
 
+# virtio-scsi.c
+virtio_scsi_cmd_req(int lun, uint32_t tag, uint8_t cmd) "virtio_scsi_cmd_req 
lun=%u tag=0x%x cmd=0x%x"
+virtio_scsi_cmd_resp(int lun, uint32_t tag, int response, uint8_t status) 
"virtio_scsi_cmd_resp lun=%u tag=0x%x response=%d status=0x%x"
+virtio_scsi_tmf_req(int lun, uint32_t tag, int subtype) "virtio_scsi_tmf_req 
lun=%u tag=0x%x subtype=%d"
+virtio_scsi_tmf_resp(int lun, uint32_t tag, int response) 
"virtio_scsi_tmf_resp lun=%u tag=0x%x response=%d"
+virtio_scsi_an_req(int lun, uint32_t event_requested) "virtio_scsi_an_req 
lun=%u event_requested=0x%x"
+virtio_scsi_an_resp(int lun, int response) "virtio_scsi_an_resp lun=%u 
response=%d"
+virtio_scsi_event(int lun, int event, int reason) "virtio_scsi_event lun=%u 
event=%d reason=%d"
+
 # scsi-disk.c
 scsi_disk_check_condition(uint32_t tag, uint8_t key, uint8_t asc, uint8_t 
ascq) "Command complete tag=0x%x sense=%d/%d/%d"
 scsi_disk_read_complete(uint32_t tag, size_t size) "Data ready tag=0x%x 
len=%zd"
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 3a71ea7097..c9873d5af7 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -27,6 +27,7 @@
 #include "scsi/constants.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "trace.h"
 
 static inline int virtio_scsi_get_lun(uint8_t *lun)
 {
@@ -239,7 +240,11 @@ static void virtio_scsi_cancel_notify(Notifier *notifier, 
void *data)
notifier);
 
 if (--n->tmf_req->remaining == 0) {
-virtio_scsi_complete_req(n->tmf_req);
+VirtIOSCSIReq *req = n->tmf_req;
+
+trace_virtio_scsi_tmf_resp(virtio_scsi_get_lun(req->req.tmf.lun),
+   req->req.tmf.tag, req->resp.tmf.response);
+virtio_scsi_complete_req(req);
 }
 g_free(n);
 }
@@ -273,6 +278,9 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq 
*req)
 req->req.tmf.subtype =
 virtio_tswap32(VIRTIO_DEVICE(s), req->req.tmf.subtype);
 
+trace_virtio_scsi_tmf_req(virtio_scsi_get_lun(req->req.tmf.lun),
+  req->req.tmf.tag, req->req.tmf.subtype);
+
 switch (req->req.tmf.subtype) {
 case VIRTIO_SCSI_T_TMF_ABORT_TASK:
 case VIRTIO_SCSI_T_TMF_QUERY_TASK:
@@ -422,11 +430,23 @@ static void virtio_scsi_handle_ctrl_req(VirtIOSCSI *s, 
VirtIOSCSIReq *req)
 virtio_scsi_bad_req(req);
 return;
 } else {
+req->req.an.event_requested =
+virtio_tswap32(VIRTIO_DEVICE(s), req->req.an.event_requested);
+trace_virtio_scsi_an_req(virtio_scsi_get_lun(req->req.an.lun),
+ req->req.an.event_requested);
 req->resp.an.event_actual = 0;
 req->resp.an.response = VIRTIO_SCSI_S_OK;
 }
 }
 if (r == 0) {
+if (type == VIRTIO_SCSI_T_TMF)
+trace_virtio_scsi_tmf_resp(virtio_scsi_get_lun(req->req.tmf.lun),
+   req->req.tmf.tag,
+   req->resp.tmf.response);
+else if (type == VIRTIO_SCSI_T_AN_QUERY ||
+ type == VIRTIO_SCSI_T_AN_SUBSCRIBE)
+trace_virtio_scsi_an_resp(virtio_scsi_get_lun(req->req.an.lun),
+  req->resp.an.response);
 virtio_scsi_complete_req(req);
 } else {
 assert(r == -EINPROGRESS);
@@ -462,6 +482,10 @@ static void virtio_scsi_handle_ctrl(VirtIODevice *vdev, 
VirtQueue *vq)
 
 static void virtio_scsi_complete_cmd_req(VirtIOSCSIReq *req)
 {
+trace_virtio_scsi_cmd_resp(virtio_scsi_get_lun(req->req.cmd.lun),
+   req->req.cmd.tag,
+   req->resp.cmd.response,
+   req->resp.cmd.status);
 /* Sense data is not in req->resp and is copied separately
  * in virtio_scsi_command_complete.
  */
@@ -559,6 +583,8 @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI 
*s, VirtIOSCSIReq *req)
 return -EINVAL;
 }
 }
+trace_virtio_scsi_cmd_

[PATCH 2/3] scsi: make io_timeout configurable

2020-11-16 Thread Hannes Reinecke

The current code sets an infinite timeout on SG_IO requests,
causing the guest to stall if the host experiences a frame
loss.
This patch adds an 'io_timeout' parameter for SCSIDevice to
make the SG_IO timeout configurable, and also shortens the
default timeout to 30 seconds to avoid infinite stalls.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/scsi-disk.c|  6 --
 hw/scsi/scsi-generic.c | 17 +++--
 include/hw/scsi/scsi.h |  4 +++-
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index e859534eaf..2959526b52 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2604,7 +2604,7 @@ static int get_device_type(SCSIDiskState *s)
 cmd[4] = sizeof(buf);
 
 ret = scsi_SG_IO_FROM_DEV(s->qdev.conf.blk, cmd, sizeof(cmd),
-  buf, sizeof(buf));
+  buf, sizeof(buf), s->qdev.io_timeout);
 if (ret < 0) {
 return -1;
 }
@@ -2765,7 +2765,7 @@ static BlockAIOCB *scsi_block_do_sgio(SCSIBlockReq *req,
 /* The rest is as in scsi-generic.c.  */
 io_header->mx_sb_len = sizeof(r->req.sense);
 io_header->sbp = r->req.sense;
-io_header->timeout = UINT_MAX;
+io_header->timeout = s->qdev.io_timeout * 1000;
 io_header->usr_ptr = r;
 io_header->flags |= SG_FLAG_DIRECT_IO;
 
@@ -3083,6 +3083,8 @@ static Property scsi_block_properties[] = {
DEFAULT_MAX_IO_SIZE),
 DEFINE_PROP_INT32("scsi_version", SCSIDiskState, qdev.default_scsi_version,
   -1),
+DEFINE_PROP_UINT32("io_timeout", SCSIDiskState, qdev.io_timeout,
+   DEFAULT_IO_TIMEOUT),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 2cb23ca891..e07924b3d7 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -114,6 +114,8 @@ static int execute_command(BlockBackend *blk,
SCSIGenericReq *r, int direction,
BlockCompletionFunc *complete)
 {
+SCSIDevice *s = r->req.dev;
+
 r->io_header.interface_id = 'S';
 r->io_header.dxfer_direction = direction;
 r->io_header.dxferp = r->buf;
@@ -122,7 +124,7 @@ static int execute_command(BlockBackend *blk,
 r->io_header.cmd_len = r->req.cmd.len;
 r->io_header.mx_sb_len = sizeof(r->req.sense);
 r->io_header.sbp = r->req.sense;
-r->io_header.timeout = MAX_UINT;
+r->io_header.timeout = s->io_timeout * 1000;
 r->io_header.usr_ptr = r;
 r->io_header.flags |= SG_FLAG_DIRECT_IO;
 
@@ -505,7 +507,7 @@ static int read_naa_id(const uint8_t *p, uint64_t *p_wwn)
 }
 
 int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, uint8_t cmd_size,
-uint8_t *buf, uint8_t buf_size)
+uint8_t *buf, uint8_t buf_size, uint32_t timeout)
 {
 sg_io_hdr_t io_header;
 uint8_t sensebuf[8];
@@ -520,7 +522,7 @@ int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, 
uint8_t cmd_size,
 io_header.cmd_len = cmd_size;
 io_header.mx_sb_len = sizeof(sensebuf);
 io_header.sbp = sensebuf;
-io_header.timeout = 6000; /* XXX */
+io_header.timeout = timeout * 1000;
 
 ret = blk_ioctl(blk, SG_IO, _header);
 if (ret < 0 || io_header.driver_status || io_header.host_status) {
@@ -550,7 +552,7 @@ static void scsi_generic_set_vpd_bl_emulation(SCSIDevice *s)
 cmd[4] = sizeof(buf);
 
 ret = scsi_SG_IO_FROM_DEV(s->conf.blk, cmd, sizeof(cmd),
-  buf, sizeof(buf));
+  buf, sizeof(buf), s->io_timeout);
 if (ret < 0) {
 /*
  * Do not assume anything if we can't retrieve the
@@ -586,7 +588,7 @@ static void 
scsi_generic_read_device_identification(SCSIDevice *s)
 cmd[4] = sizeof(buf);
 
 ret = scsi_SG_IO_FROM_DEV(s->conf.blk, cmd, sizeof(cmd),
-  buf, sizeof(buf));
+  buf, sizeof(buf), s->io_timeout);
 if (ret < 0) {
 return;
 }
@@ -637,7 +639,7 @@ static int get_stream_blocksize(BlockBackend *blk)
 cmd[0] = MODE_SENSE;
 cmd[4] = sizeof(buf);
 
-ret = scsi_SG_IO_FROM_DEV(blk, cmd, sizeof(cmd), buf, sizeof(buf));
+ret = scsi_SG_IO_FROM_DEV(blk, cmd, sizeof(cmd), buf, sizeof(buf), 6);
 if (ret < 0) {
 return -1;
 }
@@ -727,6 +729,7 @@ static void scsi_generic_realize(SCSIDevice *s, Error 
**errp)
 
 /* Only used by scsi-block, but initialize it nevertheless to be clean.  */
 s->default_scsi_version = -1;
+s->io_timeout = DEFAULT_IO_TIMEOUT;
 scsi_generic_read_device_inquiry(s);
 }
 
@@ -750,6 +753,8 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, 
uint32_t tag, uint32_t lun,
 static Property scsi_generic_properties[] = {
 DEFINE_PROP_DRIVE("drive&quo

[PATCH 3/3] scsi: add tracing for SG_IO commands

2020-11-16 Thread Hannes Reinecke

Add tracepoints for SG_IO commands to allow for debugging
of SG_IO commands.

Signed-off-by: Hannes Reinecke 
---
 hw/scsi/scsi-disk.c| 3 ++-
 hw/scsi/scsi-generic.c | 8 +++-
 hw/scsi/trace-events   | 4 
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 2959526b52..dd23a38d6a 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2768,7 +2768,8 @@ static BlockAIOCB *scsi_block_do_sgio(SCSIBlockReq *req,
 io_header->timeout = s->qdev.io_timeout * 1000;
 io_header->usr_ptr = r;
 io_header->flags |= SG_FLAG_DIRECT_IO;
-
+trace_scsi_disk_aio_sgio_command(r->req.tag, req->cdb[0], lba,
+ nb_logical_blocks, io_header->timeout);
 aiocb = blk_aio_ioctl(s->qdev.conf.blk, SG_IO, io_header, cb, opaque);
 assert(aiocb != NULL);
 return aiocb;
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index e07924b3d7..8687336438 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -128,6 +128,8 @@ static int execute_command(BlockBackend *blk,
 r->io_header.usr_ptr = r;
 r->io_header.flags |= SG_FLAG_DIRECT_IO;
 
+trace_scsi_generic_aio_sgio_command(r->req.tag, r->req.cmd.buf[0],
+r->io_header.timeout);
 r->req.aiocb = blk_aio_ioctl(blk, SG_IO, >io_header, complete, r);
 if (r->req.aiocb == NULL) {
 return -EIO;
@@ -524,8 +526,12 @@ int scsi_SG_IO_FROM_DEV(BlockBackend *blk, uint8_t *cmd, 
uint8_t cmd_size,
 io_header.sbp = sensebuf;
 io_header.timeout = timeout * 1000;
 
+trace_scsi_generic_ioctl_sgio_command(cmd[0], io_header.timeout);
 ret = blk_ioctl(blk, SG_IO, _header);
-if (ret < 0 || io_header.driver_status || io_header.host_status) {
+if (ret < 0 || io_header.status ||
+io_header.driver_status || io_header.host_status) {
+trace_scsi_generic_ioctl_sgio_done(cmd[0], ret, io_header.status,
+   io_header.host_status);
 return -1;
 }
 return 0;
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index 0e0aa9847d..9788661bfd 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -331,6 +331,7 @@ scsi_disk_emulate_command_UNKNOWN(int cmd, const char 
*name) "Unknown SCSI comma
 scsi_disk_dma_command_READ(uint64_t lba, uint32_t len) "Read (sector %" PRId64 
", count %u)"
 scsi_disk_dma_command_WRITE(const char *cmd, uint64_t lba, int len) "Write 
%s(sector %" PRId64 ", count %u)"
 scsi_disk_new_request(uint32_t lun, uint32_t tag, const char *line) "Command: 
lun=%d tag=0x%x data=%s"
+scsi_disk_aio_sgio_command(uint32_t tag, uint8_t cmd, uint64_t lba, int len, 
uint32_t timeout) "disk aio sgio: tag=0x%x cmd=0x%x (sector %" PRId64 ", count 
%d) timeout=%u"
 
 # scsi-generic.c
 scsi_generic_command_complete_noio(void *req, uint32_t tag, int statuc) 
"Command complete %p tag=0x%x status=%d"
@@ -342,3 +343,6 @@ scsi_generic_write_data(uint32_t tag) "scsi_write_data 
tag=0x%x"
 scsi_generic_send_command(const char *line) "Command: data=%s"
 scsi_generic_realize_type(int type) "device type %d"
 scsi_generic_realize_blocksize(int blocksize) "block size %d"
+scsi_generic_aio_sgio_command(uint32_t tag, uint8_t cmd, uint32_t timeout) 
"generic aio sgio: tag=0x%x cmd=0x%x timeout=%u"
+scsi_generic_ioctl_sgio_command(uint8_t cmd, uint32_t timeout) "generic ioctl 
sgio: cmd=0x%x timeout=%u"
+scsi_generic_ioctl_sgio_done(uint8_t cmd, int ret, uint8_t status, uint8_t 
host_status) "generic ioctl sgio: cmd=0x%x ret=%d status=0x%x host_status=0x%x"
-- 
2.16.4

Re: [PATCH] scsi-disk: convert more errno values back to SCSI statuses

2020-11-12 Thread Hannes Reinecke


On 11/12/20 10:52 AM, Paolo Bonzini wrote:

Linux has some OS-specific (and sometimes weird) mappings for various SCSI
statuses and sense codes.  The most important is probably RESERVATION
CONFLICT.  Add them so that they can be reported back to the guest
kernel.

Cc: Hannes Reinecke 
Signed-off-by: Paolo Bonzini 
---
  hw/scsi/scsi-disk.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 424bc192b7..fa14d1527a 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -461,6 +461,25 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int 
error, bool acct_failed)
  }
  error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
  break;
+#ifdef CONFIG_LINUX
+/* These errno mapping are specific to Linux.  For more 
information:
+ * - scsi_decide_disposition in drivers/scsi/scsi_error.c
+ * - scsi_result_to_blk_status in drivers/scsi/scsi_lib.c
+ * - blk_errors[] in block/blk-core.c
+ */
+case EBADE:
+/* DID_NEXUS_FAILURE -> BLK_STS_NEXUS.  */
+scsi_req_complete(>req, RESERVATION_CONFLICT);
+break;
+case ENODATA:
+/* DID_MEDIUM_ERROR -> BLK_STS_MEDIUM.  */
+scsi_check_condition(r, SENSE_CODE(READ_ERROR));
+break;
+case EREMOTEIO:
+/* DID_TARGET_FAILURE -> BLK_STS_TARGET.  */
+scsi_req_complete(>req, HARDWARE_ERROR);
+break;
+#endif
  case ENOMEDIUM:
  scsi_check_condition(r, SENSE_CODE(NO_MEDIUM));
  break;

Well, ironically I'm currently debugging a customer escalation which 
touches exactly this area.
It revolves more around the SG_IO handling; qemu ignores the host_status 
setting completely, leading to data corruption if the host has to abort 
commands.
(Not that the host _does_ abort commands, as qemu SG_IO sets an infinite 
timeout. But that's another story).


And part of the patchset is to enable passing of the host_status code 
back to the drivers. In particular virtio_scsi has a 'response' code 
which matches pretty closely to the linux SCSI DID_XXX codes.
So my idea is to pass the host_status directly down to the drivers, 
allowing virtio-scsi to do a mapping between DID_XX and virtio response 
codes. That way we can get rid of the weird mapping in utils/scsi.c 
where we try to map the DID_XXX codes onto Sense or SCSI status.
And with that we could map the error codes above onto DID_XXX codes, 
too, allowing us to have a more streamlined mapping.


Thoughts?

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 5/6] hw/scsi/megasas: Silent GCC9 duplicated-cond warning

2020-01-07 Thread Hannes Reinecke

On 1/7/20 3:56 PM, Paolo Bonzini wrote:
> On 18/12/19 05:03, Richard Henderson wrote:
>>>  if (!s->hba_serial) {
>>>  s->hba_serial = g_strdup(MEGASAS_HBA_SERIAL);
>>>  }
>>> -if (s->fw_sge >= MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE) {
>>> +if (MEGASAS_MAX_SGE > 128
>>> +&& s->fw_sge >= MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE) {
>>>  s->fw_sge = MEGASAS_MAX_SGE - MFI_PASS_FRAME_SIZE;
>>>  } else if (s->fw_sge >= 128 - MFI_PASS_FRAME_SIZE) {
>>>  s->fw_sge = 128 - MFI_PASS_FRAME_SIZE;
>>>  } else {
>>>  s->fw_sge = 64 - MFI_PASS_FRAME_SIZE;
>>>  }
>>
>> I'm not keen on this.  It looks to me like the raw 128 case should be removed
>> -- surely that's the point of the symbolic constant.  But I'll defer if a
>> maintainer disagrees.
> 
> I don't really understand this chain of ifs.  Hannes, does it make sense
> to just remove the "if (s->fw_sge >= 128 - MFI_PASS_FRAME_SIZE)" case,
> or does Phil's variation (quoted in the patch fragment above) makes sense?
> 
> Or perhaps this rewrite:
> 
> max_sge = s->fw_sge + MFI_PASS_FRAME_SIZE;
> if (max_sge < MEGASAS_MAX_SGE) {
> if (max_sge < 64) {
> error(...);
> } else {
> max_sge = max_sge < 128 ? 64 : 128;
> }
> }
>   s->fw_sge = max_sge - MFI_PASS_FRAME_SIZE;
> 
Yeah, do it.
The original code assumed that one could change MFI_PASS_FRAME_SIZE, but
it turned out not to be possible as it's being hardcoded in the drivers
themselves (even though the interface provides mechanisms to query it).
So we can remove the duplicate lines.
But then I prefer to stick with the define, and avoid having to check
the magic '128' value directly; rather use the define throughout the code.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   Teamlead Storage & Networking
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

Re: [Qemu-devel] [RFC 0/3] VirtIO RDMA

2019-04-19 Thread Hannes Reinecke


On 4/15/19 12:35 PM, Yuval Shaia wrote:

On Thu, Apr 11, 2019 at 07:02:15PM +0200, Cornelia Huck wrote:

On Thu, 11 Apr 2019 14:01:54 +0300
Yuval Shaia  wrote:


Data center backends use more and more RDMA or RoCE devices and more and
more software runs in virtualized environment.
There is a need for a standard to enable RDMA/RoCE on Virtual Machines.

Virtio is the optimal solution since is the de-facto para-virtualizaton
technology and also because the Virtio specification
allows Hardware Vendors to support Virtio protocol natively in order to
achieve bare metal performance.

This RFC is an effort to addresses challenges in defining the RDMA/RoCE
Virtio Specification and a look forward on possible implementation
techniques.

Open issues/Todo list:
List is huge, this is only start point of the project.
Anyway, here is one example of item in the list:
- Multi VirtQ: Every QP has two rings and every CQ has one. This means that
   in order to support for example 32K QPs we will need 64K VirtQ. Not sure
   that this is reasonable so one option is to have one for all and
   multiplex the traffic on it. This is not good approach as by design it
   introducing an optional starvation. Another approach would be multi
   queues and round-robin (for example) between them.

Typically there will be a one-to-one mapping between QPs and CPUs (on 
the guest). So while one would need to be prepared to support quite some 
QPs, the expectation is that the actual number of QPs used will be 
rather low.
In a similar vein, multiplexing QPs would be defeating the purpose, as 
the overall idea was to have _independent_ QPs to enhance parallelism.



Expectations from this posting:
In general, any comment is welcome, starting from hey, drop this as it is a
very bad idea, to yeah, go ahead, we really want it.
Idea here is that since it is not a minor effort i first want to know if
there is some sort interest in the community for such device.


My first reaction is: Sounds sensible, but it would be good to have a
spec for this :)

You'll need a spec if you want this to go forward anyway, so at least a
sketch would be good to answer questions such as how many virtqueues
you use for which purpose, what is actually put on the virtqueues,
whether there are negotiable features, and what the expectations for
the device and the driver are. It also makes it easier to understand
how this is supposed to work in practice.

If folks agree that this sounds useful, the next step would be to
reserve an id for the device type.


Thanks for the tips, will sure do that, it is that first i wanted to make
sure there is a use case here.

Waiting for any feedback from the community.

I really do like the ides; in fact, it saved me from coding a similar 
thing myself :-)


However, I'm still curious about the overall intent of this driver. 
Where would the I/O be routed _to_ ?

It's nice that we have a virtualized driver, but this driver is
intended to do I/O (even if it doesn't _do_ any I/O ATM :-)
And this I/O needs to be send to (and possibly received from)
something.

So what exactly is this something?
An existing piece of HW on the host?
If so, wouldn't it be more efficient to use vfio, either by using SR-IOV 
or by using virtio-mdev?


Another guest?
If so, how would we route the I/O from one guest to the other?
Shared memory? Implementing a full-blown RDMA switch in qemu?

Oh, and I would _love_ to have a discussion about this at KVM Forum.
Maybe I'll manage to whip up guest-to-guest RDMA connection using 
ivshmem ... let's see.


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] [PATCH] megasas: fix mapped frame size

2019-04-04 Thread Hannes Reinecke


On 4/4/19 2:10 PM, Peter Lieven wrote:

the current value of 1024 bytes (16 * MFI_FRAME_SIZE) we map is not enough to 
hold
the maximum number of scatter gather elements we advertise. We actually need a
maximum of 2048 bytes. This is 128 max sg elements * 16 bytes (sizeof (union 
mfi_sgl)).

Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Lieven 
---
  hw/scsi/megasas.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index a56317e026..5ad762de23 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -477,7 +477,7 @@ static MegasasCmd *megasas_enqueue_frame(MegasasState *s,
  {
  PCIDevice *pcid = PCI_DEVICE(s);
  MegasasCmd *cmd = NULL;
-int frame_size = MFI_FRAME_SIZE * 16;
+int frame_size = MEGASAS_MAX_SGE * sizeof(union mfi_sgl);
  hwaddr frame_size_p = frame_size;
  unsigned long index;
  


Indeed.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes

Re: [Qemu-devel] megasas: Unexpected response from lun 1 while scanning, scan aborted

2019-03-28 Thread Hannes Reinecke


On 3/28/19 1:47 AM, Dongli Zhang wrote:



On 3/27/19 7:31 PM, Hannes Reinecke wrote:

On 3/26/19 5:47 PM, Dongli Zhang wrote:

I am reporting an error that the scsi lun cannot initialize successfully when I
am emulating megasas scsi controller with qemu.

I am not sure if this is issue in qemu or linux kernel.

When 'lun=1' is specified, there is "Unexpected response from lun 1 while
scanning, scan aborted".

Everything works well if 'lun=0' is specified.


Below is the qemu cmdline involved:

-device megasas,id=scsi0 \
-device scsi-hd,drive=drive0,bus=scsi0.0,lun=1 \
-drive file=/home/zhang/img/test.img,if=none,id=drive0,format=raw


Below is the syslog related to 'scsi|SCSI'

# dmesg | grep SCSI
[    0.392494] SCSI subsystem initialized
[    0.460666] Block layer SCSI generic (bsg) driver version 0.4 loaded (major
251)
[    0.706788] sd 1:0:0:0: [sda] Attached SCSI disk
# dmesg | grep scsi
[    0.511643] scsi host0: Avago SAS based MegaRAID driver
[    0.523302] scsi 0:2:0:0: Unexpected response from lun 1 while scanning,
scan aborted
[    0.540364] scsi host1: ata_piix
[    0.540780] scsi host2: ata_piix
[    0.702396] scsi 1:0:0:0: Direct-Access ATA  QEMU HARDDISK    2.5+
PQ: 0 ANSI: 5

When 'lun=1' is changed to 'lun=0', there is no issue.

Thank you very much!


That's an artifact from the megasas emulation in quemu.
Megasas (internally) can't handle LUN numbers (the RAID part only knows about
'disks'), so I took the decision to not expose devices with LUN != 0.
Please use a different SCSI target number, not a non-zero LUN number.


The guest kernel is able to detect the disk if lun is always 0, while target
number is changed:

-device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0
-device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=1,lun=0

# dmesg | grep scsi
[0.935999] scsi host0: ata_piix
[0.936401] scsi host1: ata_piix
[1.100945] scsi 0:0:0:0: Direct-Access ATA  QEMU HARDDISK2.5+
PQ: 0 ANSI: 5
[1.102409] sd 0:0:0:0: Attached scsi generic sg0 type 0
[1.672952] scsi host2: Avago SAS based MegaRAID driver
[1.683886] scsi 2:2:0:0: Direct-Access QEMU QEMU HARDDISK2.5+
PQ: 0 ANSI: 5
[1.684915] scsi 2:2:1:0: Direct-Access QEMU QEMU HARDDISK2.5+
PQ: 0 ANSI: 5
[1.701529] sd 2:2:0:0: Attached scsi generic sg1 type 0
[1.704795] sd 2:2:1:0: Attached scsi generic sg2 type 0
# dmesg | grep SCSI
[0.111015] SCSI subsystem initialized
[0.904712] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 
246)
[1.121174] sd 0:0:0:0: [sda] Attached SCSI disk
[1.703739] sd 2:2:0:0: [sdb] Attached SCSI disk
[1.706964] sd 2:2:1:0: [sdc] Attached SCSI disk



If device with LUN != 0 will not be exposed, why not set max_lun = 0 as what
qemu lsi is doing?

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index a56317e..c966ee0 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -2298,7 +2298,7 @@ static void megasas_scsi_uninit(PCIDevice *d)
  static const struct SCSIBusInfo megasas_scsi_info = {
  .tcq = true,
  .max_target = MFI_MAX_LD,
-.max_lun = 255,
+.max_lun = 0,

  .transfer_data = megasas_xfer_complete,
  .get_sg_list = megasas_get_sg_list,


Hmm. Good point.

In _theory_ one could just jbod mode, in which case _all_ LUNs are 
exposed. But then we could probably adjust it, based on which mode is 
selected.

I'll check.

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] megasas: Unexpected response from lun 1 while scanning, scan aborted

2019-03-27 Thread Hannes Reinecke


On 3/26/19 5:47 PM, Dongli Zhang wrote:

I am reporting an error that the scsi lun cannot initialize successfully when I
am emulating megasas scsi controller with qemu.

I am not sure if this is issue in qemu or linux kernel.

When 'lun=1' is specified, there is "Unexpected response from lun 1 while
scanning, scan aborted".

Everything works well if 'lun=0' is specified.


Below is the qemu cmdline involved:

-device megasas,id=scsi0 \
-device scsi-hd,drive=drive0,bus=scsi0.0,lun=1 \
-drive file=/home/zhang/img/test.img,if=none,id=drive0,format=raw


Below is the syslog related to 'scsi|SCSI'

# dmesg | grep SCSI
[0.392494] SCSI subsystem initialized
[0.460666] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 
251)
[0.706788] sd 1:0:0:0: [sda] Attached SCSI disk
# dmesg | grep scsi
[0.511643] scsi host0: Avago SAS based MegaRAID driver
[0.523302] scsi 0:2:0:0: Unexpected response from lun 1 while scanning, 
scan aborted
[0.540364] scsi host1: ata_piix
[0.540780] scsi host2: ata_piix
[0.702396] scsi 1:0:0:0: Direct-Access ATA  QEMU HARDDISK2.5+ 
PQ: 0 ANSI: 5

When 'lun=1' is changed to 'lun=0', there is no issue.

Thank you very much!


That's an artifact from the megasas emulation in quemu.
Megasas (internally) can't handle LUN numbers (the RAID part only knows 
about 'disks'), so I took the decision to not expose devices with LUN != 0.

Please use a different SCSI target number, not a non-zero LUN number.

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] libiscsi task cancellation

2018-02-08 Thread Hannes Reinecke

On 02/08/2018 03:14 PM, Paolo Bonzini wrote:
> On 08/02/2018 15:08, Stefan Hajnoczi wrote:
>> Now on to libiscsi:
>>
>> The iscsi_task_mgmt_async() API documentation says:
>>
>>   * abort_task will also cancel the scsi task. The callback for the
>> scsi task will be invoked with
>>   *SCSI_STATUS_CANCELLED
>>
>> I see that the ABORT TASK TMF response invokes the user's
>> iscsi_task_mgmt_async() callback but not the command callback.  I'm
>> not sure how the command callback is invoked with
>> SCSI_STATUS_CANCELLED unless libiscsi is relying on the target to send
>> that response.
>>
>> Is libiscsi honoring its iscsi_task_mgmt_async() contract?
> 
> No, and QEMU is assuming the "wrong" behavior:
> 
> static void
> iscsi_abort_task_cb(struct iscsi_context *iscsi, int status, void 
> *command_data,
> void *private_data)
> {
> IscsiAIOCB *acb = private_data;
> 
> acb->status = -ECANCELED;
> iscsi_schedule_bh(acb);
> }
> 
The definition of ABORT TASK TMF in SAM is pretty much useless.
To quote:

A response of FUNCTION COMPLETE shall indicate that the task was aborted
or was not in the task set.

IE we have no idea if we ever managed to abort the task; if the task had
been in-flight by the time we've send the TMF we'll be getting a
FUNCTION COMPLETE, too.

So most FC HBA firmware implement the abort task with just a blacklist;
the TMF will be returned immediately and the command response will be
dropped if and when is arrives.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[Qemu-devel] [PATCH 1/2] block: implement shared block device count

2017-12-14 Thread Hannes Reinecke

By setting 'locking=off' when creating a 'file' format drive one
can simulate a multipath setup. This patch adds infrastructure
for tracking shared block devices.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 block.c   | 36 
 hw/scsi/scsi-disk.c   |  6 ++
 include/block/block.h |  2 ++
 include/block/block_int.h |  2 ++
 4 files changed, 46 insertions(+)

diff --git a/block.c b/block.c
index 9a1a0d1e73..cf8b94252c 100644
--- a/block.c
+++ b/block.c
@@ -3801,6 +3801,42 @@ BlockDriverState *bdrv_find_node(const char *node_name)
 return NULL;
 }
 
+void bdrv_find_shared(BlockDriverState *bs)
+{
+BlockDriverState *tmp_bs;
+int max_shared_no = 0;
+unsigned long shared_bits = 0;
+
+if (!bs->filename)
+return;
+QTAILQ_FOREACH(tmp_bs, _bdrv_states, node_list) {
+if (tmp_bs == bs)
+continue;
+if (!strcmp(bs->filename, tmp_bs->filename)) {
+if (tmp_bs->shared_no > 0) {
+set_bit(tmp_bs->shared_no - 1, _bits);
+}
+max_shared_no++;
+}
+}
+if (max_shared_no > 0) {
+int bit;
+
+QTAILQ_FOREACH(tmp_bs, _bdrv_states, node_list) {
+if (!strcmp(bs->filename, tmp_bs->filename) && !tmp_bs->shared_no) 
{
+bit = find_first_zero_bit(_bits, sizeof(unsigned long));
+tmp_bs->shared_no = bit + 1;
+set_bit(bit, _bits);
+}
+}
+}
+}
+
+int bdrv_get_shared(BlockDriverState *bs)
+{
+return bs->shared_no;
+}
+
 /* Put this QMP function here so it can access the static graph_bdrv_states. */
 BlockDeviceInfoList *bdrv_named_nodes_list(Error **errp)
 {
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 12431177a7..32c1d656b1 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2333,6 +2333,7 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 {
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, dev);
 Error *err = NULL;
+int port_no;
 
 if (!s->qdev.conf.blk) {
 error_setg(errp, "drive property not set");
@@ -2396,6 +2397,11 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 blk_set_guest_block_size(s->qdev.conf.blk, s->qdev.blocksize);
 
 blk_iostatus_enable(s->qdev.conf.blk);
+bdrv_find_shared(blk_bs(s->qdev.conf.blk));
+port_no = bdrv_get_shared(blk_bs(s->qdev.conf.blk));
+if (port_no && !s->port_index) {
+s->port_index = port_no;
+}
 }
 
 static void scsi_hd_realize(SCSIDevice *dev, Error **errp)
diff --git a/include/block/block.h b/include/block/block.h
index c05cac57e5..5c03c1acfa 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -443,6 +443,8 @@ void bdrv_lock_medium(BlockDriverState *bs, bool locked);
 void bdrv_eject(BlockDriverState *bs, bool eject_flag);
 const char *bdrv_get_format_name(BlockDriverState *bs);
 BlockDriverState *bdrv_find_node(const char *node_name);
+void bdrv_find_shared(BlockDriverState *bs);
+int bdrv_get_shared(BlockDriverState *bs);
 BlockDeviceInfoList *bdrv_named_nodes_list(Error **errp);
 BlockDriverState *bdrv_lookup_bs(const char *device,
  const char *node_name,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index a5482775ec..79c9c3a3aa 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -634,6 +634,8 @@ struct BlockDriverState {
 /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
  * BDRV_REQ_MAY_UNMAP) */
 unsigned int supported_zero_flags;
+/* Shared instance count */
+unsigned int shared_no;
 
 /* the following member gives a name to every node on the bs graph. */
 char node_name[32];
-- 
2.12.3

[Qemu-devel] [PATCH 2/2] scsi: Implement multipath support

2017-12-14 Thread Hannes Reinecke

Implement simple multipath support based on the shared block device
feature. Whenever a shared device is detected the scsi-disk driver
will report a simple ALUA setup with all paths in active/optimized.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 block.c  | 15 +++
 hw/scsi/scsi-disk.c  | 66 
 include/block/block.h|  1 +
 include/scsi/constants.h |  5 
 4 files changed, 87 insertions(+)

diff --git a/block.c b/block.c
index cf8b94252c..e0643989e5 100644
--- a/block.c
+++ b/block.c
@@ -3837,6 +3837,21 @@ int bdrv_get_shared(BlockDriverState *bs)
 return bs->shared_no;
 }
 
+void bdrv_shared_mask(BlockDriverState *bs, unsigned long *shared_mask)
+{
+BlockDriverState *tmp_bs;
+
+if (!bs->filename || !shared_mask)
+return;
+QTAILQ_FOREACH(tmp_bs, _bdrv_states, node_list) {
+if (!strcmp(bs->filename, tmp_bs->filename)) {
+if (tmp_bs->shared_no > 0) {
+set_bit(tmp_bs->shared_no - 1, shared_mask);
+}
+}
+}
+}
+
 /* Put this QMP function here so it can access the static graph_bdrv_states. */
 BlockDeviceInfoList *bdrv_named_nodes_list(Error **errp)
 {
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 32c1d656b1..b73fcafc29 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -676,6 +676,13 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 buflen += 8;
 }
 
+if (bdrv_get_shared(blk_bs(s->qdev.conf.blk))) {
+outbuf[buflen++] = 0x61; // SAS / Binary
+outbuf[buflen++] = 0x95; // PIV / Target port / target port 
group
+outbuf[buflen++] = 0; // reserved
+outbuf[buflen++] = 4;
+buflen += 4;
+}
 if (s->port_index) {
 outbuf[buflen++] = 0x61; // SAS / Binary
 outbuf[buflen++] = 0x94; // PIV / Target port / relative 
target port
@@ -819,6 +826,11 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 outbuf[4] = 36 - 5;
 }
 
+/* Enable TGPS bit */
+if (bdrv_get_shared(blk_bs(s->qdev.conf.blk))) {
+outbuf[5] = 0x10;
+}
+
 /* Sync data transfer and TCQ.  */
 outbuf[7] = 0x10 | (req->bus->info->tcq ? 0x02 : 0);
 return buflen;
@@ -1869,6 +1881,47 @@ static void scsi_disk_emulate_write_data(SCSIRequest 
*req)
 }
 }
 
+static int scsi_emulate_report_target_port_groups(SCSIDiskState *s,
+  uint8_t *inbuf)
+{
+uint8_t *p, *pg;
+int buflen = 0, i, count = 0;
+unsigned long shared_mask = 0;
+
+if (!bdrv_get_shared(blk_bs(s->qdev.conf.blk))) {
+return -1;
+}
+
+bdrv_shared_mask(blk_bs(s->qdev.conf.blk), _mask);
+if (!shared_mask) {
+return -1;
+}
+
+pg = [4];
+pg[0] = 0; /* Active/Optimized */
+pg[1] = 0x1; /* Only Active/Optimized is supported */
+
+p = [8];
+buflen += 8;
+for (i = 0; i < 32; i++) {
+if (!test_bit(i, _mask))
+continue;
+p[2] = (i + 1) >> 8;
+p[3] = (i + 1) & 0xFF;
+p += 4;
+buflen += 4;
+count++;
+}
+pg[7] = count;
+
+inbuf[0] = (buflen >> 24) & 0xff;
+inbuf[1] = (buflen >> 16) & 0xff;
+inbuf[2] = (buflen >> 8) & 0xff;
+inbuf[3] = buflen & 0xff;
+
+return buflen + 4;
+}
+
 static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
 {
 SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
@@ -2010,6 +2063,19 @@ static int32_t scsi_disk_emulate_command(SCSIRequest 
*req, uint8_t *buf)
 goto illegal_request;
 }
 break;
+case MAINTENANCE_IN:
+if ((req->cmd.buf[1] & 31) == MI_REPORT_TARGET_PORT_GROUPS) {
+DPRINTF("MI REPORT TARGET PORT GROUPS\n");
+memset(outbuf, 0, req->cmd.xfer);
+buflen = scsi_emulate_report_target_port_groups(s, outbuf);
+if (buflen < 0) {
+goto illegal_request;
+}
+break;
+}
+DPRINTF("Unsupported Maintenance In\n");
+goto illegal_request;
+break;
 case MECHANISM_STATUS:
 buflen = scsi_emulate_mechanism_status(s, outbuf);
 if (buflen < 0) {
diff --git a/include/block/block.h b/include/block/block.h
index 5c03c1acfa..b31e033702 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -445,6 +445,7 @@ const char *bdrv_get_format_name(BlockDriverState *bs);
 BlockDriverState *bdrv_find_node(const char *node_name);
 void bdrv_find_shared(BlockDriverState *bs);
 int bdrv_get_shared(BlockDriverState *bs);
+void bdrv_shared_mask(BlockDriverState *bs, unsigned long *shared_mask);
 BlockDeviceInfoList *b

[Qemu-devel] [PATCH 0/2] scsi: Simple ALUA support

2017-12-14 Thread Hannes Reinecke

Hi all,

qemu allows for a simple multipath emulation by just using the
same backing file for two distinct block devices.
This patchset adds a very basic ALUA support to properly report this
scenario go the guest.

As usual, comments and reviews are welcome.

Hannes Reinecke (2):
  block: implement shared block device count
  scsi: Implement multipath support

 block.c   | 51 +
 hw/scsi/scsi-disk.c   | 72 +++
 include/block/block.h |  3 ++
 include/block/block_int.h |  2 ++
 include/scsi/constants.h  |  5 
 5 files changed, 133 insertions(+)

-- 
2.12.3

[Qemu-devel] [PATCH 1/4] scsi: use 64-bit LUN

2017-12-14 Thread Hannes Reinecke

The LUN value really is a 64-bit number, so we should as well treat
it as such. And we should be using accessor functions to provide
backwards compability.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/esp.c  |   6 ++-
 hw/scsi/lsi53c895a.c   |   7 +--
 hw/scsi/megasas.c  |  24 +
 hw/scsi/mptsas.c   |  10 ++--
 hw/scsi/scsi-bus.c | 137 -
 hw/scsi/scsi-disk.c|   6 +--
 hw/scsi/scsi-generic.c |   2 +-
 hw/scsi/spapr_vscsi.c  |  17 +++---
 hw/scsi/virtio-scsi.c  |  10 ++--
 hw/scsi/vmw_pvscsi.c   |  22 
 hw/usb/dev-storage.c   |  11 ++--
 hw/usb/dev-uas.c   |  27 +++---
 include/hw/scsi/scsi.h |  56 +---
 13 files changed, 207 insertions(+), 128 deletions(-)

diff --git a/hw/scsi/esp.c b/hw/scsi/esp.c
index ee586e7d6c..12b76bc5c4 100644
--- a/hw/scsi/esp.c
+++ b/hw/scsi/esp.c
@@ -136,8 +136,10 @@ static void do_busid_cmd(ESPState *s, uint8_t *buf, 
uint8_t busid)
 
 trace_esp_do_busid_cmd(busid);
 lun = busid & 7;
-current_lun = scsi_device_find(>bus, 0, s->current_dev->id, lun);
-s->current_req = scsi_req_new(current_lun, 0, lun, buf, s);
+current_lun = scsi_device_find(>bus, 0, s->current_dev->id,
+   scsi_lun_from_int(lun));
+s->current_req = scsi_req_new(current_lun, 0, scsi_lun_from_int(lun),
+  buf, s);
 datalen = scsi_req_enqueue(s->current_req);
 s->ti_size = datalen;
 if (datalen != 0) {
diff --git a/hw/scsi/lsi53c895a.c b/hw/scsi/lsi53c895a.c
index 191505df5b..907ba880bf 100644
--- a/hw/scsi/lsi53c895a.c
+++ b/hw/scsi/lsi53c895a.c
@@ -811,7 +811,7 @@ static void lsi_do_command(LSIState *s)
 s->command_complete = 0;
 
 id = (s->select_tag >> 8) & 0xf;
-dev = scsi_device_find(>bus, 0, id, s->current_lun);
+dev = scsi_device_find(>bus, 0, id, scsi_lun_from_int(s->current_lun));
 if (!dev) {
 lsi_bad_selection(s, id);
 return;
@@ -820,8 +820,9 @@ static void lsi_do_command(LSIState *s)
 assert(s->current == NULL);
 s->current = g_new0(lsi_request, 1);
 s->current->tag = s->select_tag;
-s->current->req = scsi_req_new(dev, s->current->tag, s->current_lun, buf,
-   s->current);
+s->current->req = scsi_req_new(dev, s->current->tag,
+   scsi_lun_from_int(s->current_lun),
+   buf, s->current);
 
 n = scsi_req_enqueue(s->current->req);
 if (n) {
diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index d5eae6239a..2b9fb71b12 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -756,7 +756,7 @@ static int megasas_ctrl_get_info(MegasasState *s, 
MegasasCmd *cmd)
 uint16_t pd_id;
 
 if (num_pd_disks < 8) {
-pd_id = ((sdev->id & 0xFF) << 8) | (sdev->lun & 0xFF);
+pd_id = ((sdev->id & 0xFF) << 8) | scsi_lun_to_int(sdev->lun);
 info.device.port_addr[num_pd_disks] =
 cpu_to_le64(megasas_get_sata_addr(pd_id));
 }
@@ -975,7 +975,7 @@ static int megasas_dcmd_pd_get_list(MegasasState *s, 
MegasasCmd *cmd)
 if (num_pd_disks >= max_pd_disks)
 break;
 
-pd_id = ((sdev->id & 0xFF) << 8) | (sdev->lun & 0xFF);
+pd_id = ((sdev->id & 0xFF) << 8) | scsi_lun_to_int(sdev->lun);
 info.addr[num_pd_disks].device_id = cpu_to_le16(pd_id);
 info.addr[num_pd_disks].encl_device_id = 0x;
 info.addr[num_pd_disks].encl_index = 0;
@@ -1028,7 +1028,8 @@ static int megasas_pd_get_info_submit(SCSIDevice *sdev, 
int lun,
 info->inquiry_data[0] = 0x7f; /* Force PQual 0x3, PType 0x1f */
 info->vpd_page83[0] = 0x7f;
 megasas_setup_inquiry(cmdbuf, 0, sizeof(info->inquiry_data));
-cmd->req = scsi_req_new(sdev, cmd->index, lun, cmdbuf, cmd);
+cmd->req = scsi_req_new(sdev, cmd->index, scsi_lun_from_int(lun),
+cmdbuf, cmd);
 if (!cmd->req) {
 trace_megasas_dcmd_req_alloc_failed(cmd->index,
 "PD get info std inquiry");
@@ -1110,7 +,7 @@ static int megasas_dcmd_pd_get_info(MegasasState *s, 
MegasasCmd *cmd)
 pd_id = le16_to_cpu(cmd->frame->dcmd.mbox[0]);
 target_id = (pd_id >> 8) & 0xFF;
 lun_id = pd_id & 0xFF;
-sdev = scsi_device_find(>bus, 0, target_id, lun_id);
+sdev = scsi_device_find(>bus, 0, target_id, scsi_lun_from_int(lun_id));
 trace_megasas_dcmd_pd_get_info(cmd->index, pd_id);
 
 if (sdev) {
@@ -1200,7 +1201,7 @@ static int megasas_dcmd_ld_list_query(MegasasState *s, 
MegasasCmd *cmd)

[Qemu-devel] [PATCH 0/4] virtio-vfc implementation

2017-12-14 Thread Hannes Reinecke

Hi all,

here's my attempt to implement a 'Virtual FC' emulation for virtio-scsi,
based on the presentation at the KVM Forum in Prague.

This doesn't so much implement the FC protocol per se, but rather enables
virtio to pass in addition port information so that the guest can setup
the FC infrastructure in sysfs.

This patchset has a complementary patchset for the linux virtio-scsi driver
which will be posted to the linux-scsi mailing list.

As usual, comments and reviews are welcome.

Hannes Reinecke (4):
  scsi: use 64-bit LUN
  virtio-scsi: implement target rescan
  virtio-scsi: Implement 'native LUN' feature
  scsi: support REPORT_LUNS for LUNs != 0

 hw/scsi/esp.c|   6 +-
 hw/scsi/lsi53c895a.c |   7 +-
 hw/scsi/megasas.c|  24 +++--
 hw/scsi/mptsas.c |  10 +-
 hw/scsi/scsi-bus.c   | 149 ++-
 hw/scsi/scsi-disk.c  |  22 +++-
 hw/scsi/scsi-generic.c   |   2 +-
 hw/scsi/spapr_vscsi.c|  17 +--
 hw/scsi/virtio-scsi.c|  90 ++--
 hw/scsi/vmw_pvscsi.c |  22 ++--
 hw/usb/dev-storage.c |  11 +-
 hw/usb/dev-uas.c |  27 ++---
 include/hw/scsi/scsi.h   |  62 +--
 include/hw/virtio/virtio-scsi.h  |   7 +-
 include/scsi/constants.h |  11 ++
 include/standard-headers/linux/virtio_scsi.h |  16 +++
 16 files changed, 343 insertions(+), 140 deletions(-)

-- 
2.12.3

[Qemu-devel] [PATCH 3/4] virtio-scsi: Implement 'native LUN' feature

2017-12-14 Thread Hannes Reinecke

The 'native LUN' feature allows virtio-scsi to use the LUN
numbers from the underlying storage directly, without having
to modify the LUN number when sending over the wire.
It works by shifting the existing LUN number down by 8 bytes,
and add the virtio-specific 8-byte LUN steering header.
With that virtio doesn't have to mangle the LUN number, allowing
us to pass the 'real' LUN number to the guest.
Of course, we do cut off the last 8 bytes of the 'real' LUN number,
but I'm not aware of any array utilizing that, so the impact should
be negligible.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/virtio-scsi.c| 29 
 include/hw/scsi/scsi.h   |  4 ++--
 include/standard-headers/linux/virtio_scsi.h |  1 +
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index fa2031f636..ec91c8c403 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -26,20 +26,30 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 
-static inline uint64_t virtio_scsi_get_lun(uint8_t *lun)
+static inline uint64_t virtio_scsi_get_lun(VirtIODevice *vdev, uint8_t *lun)
 {
-return (((uint64_t)(lun[2] << 8) | lun[3]) & 0x3FFF) << 48;
+uint64_t lun64;
+
+if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_NATIVE_LUN)) {
+lun64 = scsi_lun_from_str(lun) << 16;
+} else {
+lun64 = (((uint64_t)(lun[2] << 8) | lun[3]) & 0x3FFF) << 48;
+}
+return lun64;
 }
 
 static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
 {
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
+
 if (lun[0] != 1) {
 return NULL;
 }
-if (lun[2] != 0 && !(lun[2] >= 0x40 && lun[2] < 0x80)) {
+if (!virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_NATIVE_LUN) &&
+lun[2] != 0 && !(lun[2] >= 0x40 && lun[2] < 0x80)) {
 return NULL;
 }
-return scsi_device_find(>bus, 0, lun[1], virtio_scsi_get_lun(lun));
+return scsi_device_find(>bus, 0, lun[1], virtio_scsi_get_lun(vdev, 
lun));
 }
 
 void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req)
@@ -270,7 +280,7 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq 
*req)
 if (!d) {
 goto fail;
 }
-if (d->lun != virtio_scsi_get_lun(req->req.tmf.lun)) {
+if (d->lun != virtio_scsi_get_lun(VIRTIO_DEVICE(s), req->req.tmf.lun)) 
{
 goto incorrect_lun;
 }
 QTAILQ_FOREACH_SAFE(r, >requests, next, next) {
@@ -307,7 +317,7 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq 
*req)
 if (!d) {
 goto fail;
 }
-if (d->lun != virtio_scsi_get_lun(req->req.tmf.lun)) {
+if (d->lun != virtio_scsi_get_lun(VIRTIO_DEVICE(s), req->req.tmf.lun)) 
{
 goto incorrect_lun;
 }
 s->resetting++;
@@ -321,7 +331,7 @@ static int virtio_scsi_do_tmf(VirtIOSCSI *s, VirtIOSCSIReq 
*req)
 if (!d) {
 goto fail;
 }
-if (d->lun != virtio_scsi_get_lun(req->req.tmf.lun)) {
+if (d->lun != virtio_scsi_get_lun(VIRTIO_DEVICE(s), req->req.tmf.lun)) 
{
 goto incorrect_lun;
 }
 
@@ -609,7 +619,8 @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI 
*s, VirtIOSCSIReq *req)
 }
 virtio_scsi_ctx_check(s, d);
 req->sreq = scsi_req_new(d, req->req.cmd.tag,
- virtio_scsi_get_lun(req->req.cmd.lun),
+ virtio_scsi_get_lun(VIRTIO_DEVICE(s),
+ req->req.cmd.lun),
  req->req.cmd.cdb, req);
 
 if (req->sreq->cmd.mode != SCSI_XFER_NONE
@@ -980,6 +991,8 @@ static Property virtio_scsi_properties[] = {
 VIRTIO_SCSI_F_CHANGE, true),
 DEFINE_PROP_BIT("rescan", VirtIOSCSI, host_features,
   VIRTIO_SCSI_F_RESCAN, true),
+DEFINE_PROP_BIT("native_lun", VirtIOSCSI, host_features,
+  VIRTIO_SCSI_F_NATIVE_LUN, true),
 DEFINE_PROP_LINK("iothread", VirtIOSCSI, parent_obj.conf.iothread,
  TYPE_IOTHREAD, IOThread *),
 DEFINE_PROP_STRING("wwpn", VirtIOSCSI, parent_obj.conf.wwpn),
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index cb2b2bddcd..a18e430c08 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -175,8 +175,8 @@ static inline uint64_t scsi_lun_from_str(uint8_t *lun)
 uint64_t lun64 = 0;
 
 for (i = 0; i < 8; i += 2) {
-lun64 |= (uint64_t)lun[i] << ((i + 1) * 8) |
-(uint64_t)lun[i + 1] <<

[Qemu-devel] [PATCH 4/4] scsi: support REPORT_LUNS for LUNs != 0

2017-12-14 Thread Hannes Reinecke

SPC doesn't restrict the use of REPORT LUNS to LUN 0, so we
shouldn't be doing so, either.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 83497ac916..67ac472c14 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -548,7 +548,7 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 int fixed_sense = (req->cmd.buf[1] & 1) == 0;
 
 if (req->lun != 0 &&
-buf[0] != INQUIRY && buf[0] != REQUEST_SENSE) {
+buf[0] != INQUIRY && buf[0] != REQUEST_SENSE && buf[0] != REPORT_LUNS) 
{
 scsi_req_build_sense(req, SENSE_CODE(LUN_NOT_SUPPORTED));
 scsi_req_complete(req, CHECK_CONDITION);
 return 0;
-- 
2.12.3

[Qemu-devel] [PATCH 2/4] virtio-scsi: implement target rescan

2017-12-14 Thread Hannes Reinecke

Implement a new virtio-scsi command 'rescan' to return a list of
attached targets. The guest is required to set the 'next_id' field
to the next expected target id; the host will return either that
or the next higher target id (if present), or -1 if no additional
targets are found.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c   | 10 +
 hw/scsi/scsi-disk.c  | 16 +++-
 hw/scsi/virtio-scsi.c| 55 
 include/hw/scsi/scsi.h   |  6 ++-
 include/hw/virtio/virtio-scsi.h  |  7 +++-
 include/scsi/constants.h | 11 ++
 include/standard-headers/linux/virtio_scsi.h | 15 
 7 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index a0e66d0e01..83497ac916 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -21,6 +21,7 @@ static Property scsi_props[] = {
 DEFINE_PROP_UINT32("channel", SCSIDevice, channel, 0),
 DEFINE_PROP_UINT32("scsi-id", SCSIDevice, id, -1),
 DEFINE_PROP_UINT64("lun", SCSIDevice, lun, -1),
+DEFINE_PROP_UINT8("protocol", SCSIDevice, protocol, SCSI_PROTOCOL_SAS),
 DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -189,6 +190,15 @@ static void scsi_qdev_realize(DeviceState *qdev, Error 
**errp)
 }
 }
 
+if (dev->protocol != SCSI_PROTOCOL_FCP &&
+dev->protocol != SCSI_PROTOCOL_SPI &&
+dev->protocol != SCSI_PROTOCOL_SRP &&
+dev->protocol != SCSI_PROTOCOL_ISCSI &&
+dev->protocol != SCSI_PROTOCOL_SAS &&
+dev->protocol != SCSI_PROTOCOL_UAS) {
+error_setg(errp, "invalid scsi protocol id: %d", dev->protocol);
+return;
+}
 if (dev->id == -1) {
 int id = -1;
 if (dev->lun == -1) {
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index cbee840601..1313aafae3 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -668,7 +668,7 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 }
 
 if (s->qdev.port_wwn) {
-outbuf[buflen++] = 0x61; // SAS / Binary
+outbuf[buflen++] = s->qdev.protocol << 8 | 0x1; // Binary
 outbuf[buflen++] = 0x93; // PIV / Target port / NAA
 outbuf[buflen++] = 0;// reserved
 outbuf[buflen++] = 8;
@@ -677,7 +677,7 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
 }
 
 if (s->port_index) {
-outbuf[buflen++] = 0x61; // SAS / Binary
+outbuf[buflen++] = s->qdev.protocol << 8 | 0x1; // Binary
 outbuf[buflen++] = 0x94; // PIV / Target port / relative 
target port
 outbuf[buflen++] = 0;// reserved
 outbuf[buflen++] = 4;
@@ -2355,6 +2355,18 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 return;
 }
 
+if (dev->protocol == SCSI_PROTOCOL_FCP) {
+if (!s->qdev.port_wwn) {
+error_setg(errp,
+   "Missing port_wwn for FCP protocol");
+return;
+}
+if (!s->qdev.node_wwn && (s->qdev.port_wwn >> 60) != 0x02) {
+error_setg(errp,
+   "port_wwn is not a IEEE Extended identifier");
+return;
+}
+}
 if (dev->type == TYPE_DISK) {
 blkconf_geometry(>conf, NULL, 65535, 255, 255, );
 if (err) {
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index f98bfb3db5..fa2031f636 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -19,6 +19,7 @@
 #include "hw/virtio/virtio-scsi.h"
 #include "qemu/error-report.h"
 #include "qemu/iov.h"
+#include "qemu/cutils.h"
 #include "sysemu/block-backend.h"
 #include "hw/scsi/scsi.h"
 #include "scsi/constants.h"
@@ -386,6 +387,7 @@ fail:
 static void virtio_scsi_handle_ctrl_req(VirtIOSCSI *s, VirtIOSCSIReq *req)
 {
 VirtIODevice *vdev = (VirtIODevice *)s;
+VirtIOSCSICommon *c = VIRTIO_SCSI_COMMON(vdev);
 uint32_t type;
 int r = 0;
 
@@ -415,6 +417,55 @@ static void virtio_scsi_handle_ctrl_req(VirtIOSCSI *s, 
VirtIOSCSIReq *req)
 req->resp.an.event_actual = 0;
 req->resp.an.response = VIRTIO_SCSI_S_OK;
 }
+} else if (type == VIRTIO_SCSI_T_RESCAN) {
+if (virtio_scsi_parse_req(req, sizeof(VirtIOSCSIRescanReq),
+  sizeof(VirtIOSCSIRescanResp)) < 0) {
+virtio_scsi_bad_req(req);
+return;
+} else {
+BusChild *kid;
+SCSIDevice *dev = NULL;
+
+

[Qemu-devel] [PATCH] scsi-bus: correct responses for INQUIRY and REQUEST SENSE

2017-08-18 Thread Hannes Reinecke

According to SPC-3 INQUIRY and REQUEST SENSE should return GOOD
even on unsupported LUNS.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index e364410a23..ade31c11f5 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -516,8 +516,10 @@ static size_t scsi_sense_len(SCSIRequest *req)
 static int32_t scsi_target_send_command(SCSIRequest *req, uint8_t *buf)
 {
 SCSITargetReq *r = DO_UPCAST(SCSITargetReq, req, req);
+int fixed_sense = (req->cmd.buf[1] & 1) == 0;
 
-if (req->lun != 0) {
+if (req->lun != 0 &&
+buf[0] != INQUIRY && buf[0] != REQUEST_SENSE) {
 scsi_req_build_sense(req, SENSE_CODE(LUN_NOT_SUPPORTED));
 scsi_req_complete(req, CHECK_CONDITION);
 return 0;
@@ -535,9 +537,28 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 break;
 case REQUEST_SENSE:
 scsi_target_alloc_buf(>req, scsi_sense_len(req));
-r->len = scsi_device_get_sense(r->req.dev, r->buf,
-   MIN(req->cmd.xfer, r->buf_len),
-   (req->cmd.buf[1] & 1) == 0);
+if (req->lun != 0) {
+const struct SCSISense sense = SENSE_CODE(LUN_NOT_SUPPORTED);
+
+if (fixed_sense) {
+r->buf[0] = 0x70;
+r->buf[2] = sense.key;
+r->buf[10] = 10;
+r->buf[12] = sense.asc;
+r->buf[13] = sense.ascq;
+r->len = MIN(req->cmd.xfer, SCSI_SENSE_LEN);
+} else {
+r->buf[0] = 0x72;
+r->buf[1] = sense.key;
+r->buf[2] = sense.asc;
+r->buf[3] = sense.ascq;
+r->len = 8;
+}
+} else {
+r->len = scsi_device_get_sense(r->req.dev, r->buf,
+   MIN(req->cmd.xfer, r->buf_len),
+   fixed_sense);
+}
 if (r->req.dev->sense_is_ua) {
 scsi_device_unit_attention_reported(req->dev);
 r->req.dev->sense_len = 0;
-- 
2.12.0

Re: [Qemu-devel] [PATCHv2 3/4] scsi: clarify sense codes for LUN0 emulation

2017-08-17 Thread Hannes Reinecke

On 08/18/2017 02:57 AM, Laszlo Ersek wrote:
> On 08/18/17 02:16, Laszlo Ersek wrote:
>> On 08/17/17 22:57, Laszlo Ersek wrote:
>>> On 08/04/17 12:49, Paolo Bonzini wrote:
>>>> On 04/08/2017 10:36, Hannes Reinecke wrote:
>>>>> The LUN0 emulation is just that, an emulation for a non-existing
>>>>> LUN0. So we should be returning LUN_NOT_SUPPORTED for any request
>>>>> coming from any other LUN.
>>>>> And we should be aborting unhandled commands with INVALID OPCODE,
>>>>> not LUN NOT SUPPORTED.
>>>>>
>>>>> Signed-off-by: Hannes Reinecke <h...@suse.com>
>>>>> ---
>>>>>  hw/scsi/scsi-bus.c | 7 ++-
>>>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
>>>>> index 8419c75..79a222f 100644
>>>>> --- a/hw/scsi/scsi-bus.c
>>>>> +++ b/hw/scsi/scsi-bus.c
>>>>> @@ -583,6 +583,11 @@ static int32_t scsi_target_send_command(SCSIRequest 
>>>>> *req, uint8_t *buf)
>>>>>  {
>>>>>  SCSITargetReq *r = DO_UPCAST(SCSITargetReq, req, req);
>>>>>
>>>>> +if (req->lun != 0) {
>>>>> +scsi_req_build_sense(req, SENSE_CODE(LUN_NOT_SUPPORTED));
>>>>> +scsi_req_complete(req, CHECK_CONDITION);
>>>>> +return 0;
>>>>> +}
>>>>>  switch (buf[0]) {
>>>>>  case REPORT_LUNS:
>>>>>  if (!scsi_target_emulate_report_luns(r)) {
>>>>> @@ -613,7 +618,7 @@ static int32_t scsi_target_send_command(SCSIRequest 
>>>>> *req, uint8_t *buf)
>>>>>  case TEST_UNIT_READY:
>>>>>  break;
>>>>>  default:
>>>>> -scsi_req_build_sense(req, SENSE_CODE(LUN_NOT_SUPPORTED));
>>>>> +scsi_req_build_sense(req, SENSE_CODE(INVALID_OPCODE));
>>>>>  scsi_req_complete(req, CHECK_CONDITION);
>>>>>  return 0;
>>>>>  illegal_request:
>>>>>
>>>>
>>>> I am queuing this one since it's an independent bugfix.
>>>
>>> This patch (ded6ddc5a7b9, "scsi: clarify sense codes for LUN0
>>> emulation", 2017-08-04) seems to confuse the media detection in
>>> edk2's "MdeModulePkg/Bus/Scsi/ScsiDiskDxe/ScsiDisk.c".
>>>
>>> Namely, when it enumerates the {targets}x{LUNs} matrix on the
>>> virtio-scsi HBA, it now reports the following message, for each
>>> (target,LUN) pair to which no actual SCSI device (like disk or
>>> CD-ROM) is assigned on the command line:
>>>
>>> ScsiDisk: Sense Key = 0x5 ASC = 0x25!
>>>
>>> Unfortunately, this is not all that happens -- the ScsiDiskDxe driver
>>> even installs a BlockIo protocol instance on the handle (again, there
>>> is no media, and no actual SCSI device), on which further protocols
>>> are stacked, such as BlockIo2:
>>>
>>> ScsiDisk: Sense Key = 0x5 ASC = 0x25!
>>> InstallProtocolInterface: [EfiBlockIoProtocol] 13A59A3A8
>>> InstallProtocolInterface: [EfiBlockIo2Protocol] 13A59A3D8
>>> InstallProtocolInterface: [EfiDiskInfoProtocol] 13A59A4D0
>>>
>>> In turn, in BDS, UEFI boot options are auto-generated for these
>>> devices, which is not nice, given that this procedure in BDS is very
>>> pflash-intensive, and pflash access is remarkably slow on aarch64
>>> KVM.
>>>
>>> For example, if I use one virtio-scsi HBA, and put a CD-ROM on target
>>> 0, LUN 0, and a disk on target 1, LUN 0, then edk2 will create
>>> protocol interfaces, and matching boot options, for
>>>
>>>   2 targets * 7 LUNs/target = 14 LUNs
>>>
>>> of which only 2 make sense.
>>>
>>>
>>> If I revert the patch (on top of v2.10.0-rc3), then everything works
>>> as before -- BlockIo protocol instances are produced only for actual
>>> devices (with media).
>>>
>>> I guess the path forward is to fix the ScsiDiskDxe driver in edk2;
>>> the new ASC should be recognized.
>>>
>>> My question is, how *exactly* did this patch change the reported
>>> sense key and ASC? That is, what did they use to be *before*?
>>> INVALID_OPCODE?
>>
>> I found the bug in edk2. It's a missing error check. I'll send a patch
>> and CC you guys.
> 
> Actually, I thin

Re: [Qemu-devel] [PATCHv2 4/4] scsi: Add 'enclosure' option for scsi devices

2017-08-04 Thread Hannes Reinecke

On 08/04/2017 12:48 PM, Paolo Bonzini wrote:
> On 04/08/2017 10:36, Hannes Reinecke wrote:
>> Make enclosure support optional with the 'enclosure' argument to
>> the scsi device.
>> Adding 'enclosure=on' as option to the SCSI device will present
>> an enclosure service device on LUN0, either as a stand-alone
>> LUN (in case LUN0 is not assigned) or by setting the EncServ bit
>> int the inquiry data if LUN0 is assigned to a block devices.
>>
>> Signed-off-by: Hannes Reinecke <h...@suse.com>
>> ---
>>  hw/scsi/scsi-bus.c | 11 ---
>>  hw/scsi/scsi-disk.c|  4 +++-
>>  include/hw/scsi/scsi.h |  1 +
>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
>> index 79a222f..c11422b 100644
>> --- a/hw/scsi/scsi-bus.c
>> +++ b/hw/scsi/scsi-bus.c
>> @@ -22,6 +22,7 @@ static Property scsi_props[] = {
>>  DEFINE_PROP_UINT32("channel", SCSIDevice, channel, 0),
>>  DEFINE_PROP_UINT32("scsi-id", SCSIDevice, id, -1),
>>  DEFINE_PROP_UINT32("lun", SCSIDevice, lun, -1),
>> +DEFINE_PROP_BOOL("enclosure", SCSIDevice, enclosure, false),
>>  DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> @@ -494,11 +495,14 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq 
>> *r)
>>  if (r->req.lun != 0) {
>>  r->buf[0] = TYPE_NO_LUN;
>>  } else {
>> -r->buf[0] = TYPE_ENCLOSURE;
>> +r->buf[0] = r->req.dev->enclosure ?
>> +TYPE_ENCLOSURE : TYPE_NOT_PRESENT | TYPE_INACTIVE;
>>  r->buf[2] = 5; /* Version */
>>  r->buf[3] = 2 | 0x10; /* HiSup, response data format */
>>  r->buf[4] = r->len - 5; /* Additional Length = (Len - 1) - 4 */
>> -r->buf[6] = 0x40; /* Enclosure service */
>> +if (r->req.dev->enclosure) {
>> +r->buf[6] = 0x40; /* Enclosure service */
>> +}
>>  r->buf[7] = 0x10 | (r->req.bus->info->tcq ? 0x02 : 0); /* Sync, 
>> TCQ.  */
>>  memcpy(>buf[8], "QEMU", 8);
>>  memcpy(>buf[16], "QEMU TARGET ", 16);
>> @@ -600,7 +604,8 @@ static int32_t scsi_target_send_command(SCSIRequest 
>> *req, uint8_t *buf)
>>  }
>>  break;
>>  case RECEIVE_DIAGNOSTIC:
>> -if (!scsi_target_emulate_receive_diagnostic(r)) {
>> +if (!r->req.dev->enclosure ||
>> +!scsi_target_emulate_receive_diagnostic(r)) {
>>  goto illegal_request;
>>  }
>>  break;
>> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
>> index 5f1e5e8..153d97d 100644
>> --- a/hw/scsi/scsi-disk.c
>> +++ b/hw/scsi/scsi-disk.c
>> @@ -792,7 +792,9 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
>> uint8_t *outbuf)
>> the additional length is not adjusted */
>>  outbuf[4] = 36 - 5;
>>  }
>> -
>> +if (s->qdev.lun == 0 && s->qdev.enclosure) {
>> +outbuf[6] = 0x40; /* Enclosure service */
>> +}
> 
> Should this really be set on disks even if you do have a LUN0?
> 
Why not? You _can_ have embedded enclosure devices; that's what this bit
is for...

> Why don't you instead add a very simple scsi-enclosure device that you
> can specify as the LUN0?
> 
You don't have to.
Once you leave LUN0 free (eg by assigning your first disk LUN1) you'll
get this device by virtue of the current LUN0 emulation.

But yeah, maybe I can be doing a separate enclosure device.
Will be checking if and how I can make it to work.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[Qemu-devel] [PATCHv2 4/4] scsi: Add 'enclosure' option for scsi devices

2017-08-04 Thread Hannes Reinecke

Make enclosure support optional with the 'enclosure' argument to
the scsi device.
Adding 'enclosure=on' as option to the SCSI device will present
an enclosure service device on LUN0, either as a stand-alone
LUN (in case LUN0 is not assigned) or by setting the EncServ bit
int the inquiry data if LUN0 is assigned to a block devices.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 11 ---
 hw/scsi/scsi-disk.c|  4 +++-
 include/hw/scsi/scsi.h |  1 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 79a222f..c11422b 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -22,6 +22,7 @@ static Property scsi_props[] = {
 DEFINE_PROP_UINT32("channel", SCSIDevice, channel, 0),
 DEFINE_PROP_UINT32("scsi-id", SCSIDevice, id, -1),
 DEFINE_PROP_UINT32("lun", SCSIDevice, lun, -1),
+DEFINE_PROP_BOOL("enclosure", SCSIDevice, enclosure, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -494,11 +495,14 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq *r)
 if (r->req.lun != 0) {
 r->buf[0] = TYPE_NO_LUN;
 } else {
-r->buf[0] = TYPE_ENCLOSURE;
+r->buf[0] = r->req.dev->enclosure ?
+TYPE_ENCLOSURE : TYPE_NOT_PRESENT | TYPE_INACTIVE;
 r->buf[2] = 5; /* Version */
 r->buf[3] = 2 | 0x10; /* HiSup, response data format */
 r->buf[4] = r->len - 5; /* Additional Length = (Len - 1) - 4 */
-r->buf[6] = 0x40; /* Enclosure service */
+if (r->req.dev->enclosure) {
+r->buf[6] = 0x40; /* Enclosure service */
+}
 r->buf[7] = 0x10 | (r->req.bus->info->tcq ? 0x02 : 0); /* Sync, TCQ.  
*/
 memcpy(>buf[8], "QEMU", 8);
 memcpy(>buf[16], "QEMU TARGET ", 16);
@@ -600,7 +604,8 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 }
 break;
 case RECEIVE_DIAGNOSTIC:
-if (!scsi_target_emulate_receive_diagnostic(r)) {
+if (!r->req.dev->enclosure ||
+!scsi_target_emulate_receive_diagnostic(r)) {
 goto illegal_request;
 }
 break;
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 5f1e5e8..153d97d 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -792,7 +792,9 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, 
uint8_t *outbuf)
the additional length is not adjusted */
 outbuf[4] = 36 - 5;
 }
-
+if (s->qdev.lun == 0 && s->qdev.enclosure) {
+outbuf[6] = 0x40; /* Enclosure service */
+}
 /* Sync data transfer and TCQ.  */
 outbuf[7] = 0x10 | (req->bus->info->tcq ? 0x02 : 0);
 return buflen;
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 6b85786..243c185 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -100,6 +100,7 @@ struct SCSIDevice
 BlockConf conf;
 SCSISense unit_attention;
 bool sense_is_ua;
+bool enclosure;
 uint8_t sense[SCSI_SENSE_BUF_SIZE];
 uint32_t sense_len;
 QTAILQ_HEAD(, SCSIRequest) requests;
-- 
1.8.5.6

[Qemu-devel] [PATCHv2 3/4] scsi: clarify sense codes for LUN0 emulation

2017-08-04 Thread Hannes Reinecke

The LUN0 emulation is just that, an emulation for a non-existing
LUN0. So we should be returning LUN_NOT_SUPPORTED for any request
coming from any other LUN.
And we should be aborting unhandled commands with INVALID OPCODE,
not LUN NOT SUPPORTED.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 8419c75..79a222f 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -583,6 +583,11 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 {
 SCSITargetReq *r = DO_UPCAST(SCSITargetReq, req, req);
 
+if (req->lun != 0) {
+scsi_req_build_sense(req, SENSE_CODE(LUN_NOT_SUPPORTED));
+scsi_req_complete(req, CHECK_CONDITION);
+return 0;
+}
 switch (buf[0]) {
 case REPORT_LUNS:
 if (!scsi_target_emulate_report_luns(r)) {
@@ -613,7 +618,7 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 case TEST_UNIT_READY:
 break;
 default:
-scsi_req_build_sense(req, SENSE_CODE(LUN_NOT_SUPPORTED));
+scsi_req_build_sense(req, SENSE_CODE(INVALID_OPCODE));
 scsi_req_complete(req, CHECK_CONDITION);
 return 0;
 illegal_request:
-- 
1.8.5.6

[Qemu-devel] [PATCHv2 0/4] scsi: enclosure support

2017-08-04 Thread Hannes Reinecke

Hi all,

due to a customer issue I've added simple subenclosure support
to the SCSI emulation. By setting the 'enclosure' option to a
SCSI device we will now present an enclosure device on LUN0
(if LUN0 is otherwise unassigned) or setting the 'EncServ' bit
in the inquiry data if LUN0 is assigned to a device.

Changes to v1:
- Add patch to clarify sense code responses
- Add 'enclosure' option for SCSI devices

Hannes Reinecke (4):
  scsi: Make LUN 0 a simple enclosure
  scsi: use qemu_uuid to generate logical identifier for SES
  scsi: clarify sense codes for LUN0 emulation
  scsi: Add 'enclosure' option for scsi devices

 hw/scsi/scsi-bus.c | 85 --
 hw/scsi/scsi-disk.c|  4 ++-
 include/hw/scsi/scsi.h |  1 +
 3 files changed, 87 insertions(+), 3 deletions(-)

-- 
1.8.5.6

[Qemu-devel] [PATCHv2 2/4] scsi: use qemu_uuid to generate logical identifier for SES

2017-08-04 Thread Hannes Reinecke

The SES enclosure descriptor requires a logical identifier,
so generate one using the qemu_uuid and the Qumranet OUI.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index c89e82d..8419c75 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -10,6 +10,7 @@
 #include "trace.h"
 #include "sysemu/dma.h"
 #include "qemu/cutils.h"
+#include "qemu/crc32c.h"
 
 static char *scsibus_get_dev_path(DeviceState *dev);
 static char *scsibus_get_fw_dev_path(DeviceState *dev);
@@ -533,6 +534,22 @@ static bool 
scsi_target_emulate_receive_diagnostic(SCSITargetReq *r)
 enc_desc[0] = 0x09;
 enc_desc[2] = 1;
 enc_desc[3] = 0x24;
+if (qemu_uuid_set) {
+uint32_t crc;
+
+/*
+ * Make this a NAA IEEE Registered identifier
+ * using Qumranet OUI (0x001A4A) and the
+ * crc32 from the system UUID.
+ */
+enc_desc[4] = 0x50;
+enc_desc[5] = 0x01;
+enc_desc[6] = 0xa4;
+enc_desc[7] = 0xa0;
+crc = crc32c(0x, qemu_uuid.data, 16);
+cpu_to_le32s();
+memcpy(_desc[8], , 4);
+}
 memcpy(_desc[12], "QEMU", 8);
 memcpy(_desc[20], "QEMU TARGET ", 16);
 pstrcpy((char *)_desc[36], 4, qemu_hw_version());
-- 
1.8.5.6

[Qemu-devel] [PATCHv2 1/4] scsi: Make LUN 0 a simple enclosure

2017-08-04 Thread Hannes Reinecke

Instead of having an 'invisible' LUN0 (in case LUN 0 is not connected)
this patch maks LUN0 a enclosure service, exposing it to the OS.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 56 +-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 23c51de..c89e82d 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -493,10 +493,11 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq *r)
 if (r->req.lun != 0) {
 r->buf[0] = TYPE_NO_LUN;
 } else {
-r->buf[0] = TYPE_NOT_PRESENT | TYPE_INACTIVE;
+r->buf[0] = TYPE_ENCLOSURE;
 r->buf[2] = 5; /* Version */
 r->buf[3] = 2 | 0x10; /* HiSup, response data format */
 r->buf[4] = r->len - 5; /* Additional Length = (Len - 1) - 4 */
+r->buf[6] = 0x40; /* Enclosure service */
 r->buf[7] = 0x10 | (r->req.bus->info->tcq ? 0x02 : 0); /* Sync, TCQ.  
*/
 memcpy(>buf[8], "QEMU", 8);
 memcpy(>buf[16], "QEMU TARGET ", 16);
@@ -505,6 +506,54 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq *r)
 return true;
 }
 
+static bool scsi_target_emulate_receive_diagnostic(SCSITargetReq *r)
+{
+uint8_t page_code = r->req.cmd.buf[2];
+unsigned char *enc_desc, *type_desc;
+
+assert(r->req.dev->lun != r->req.lun);
+
+scsi_target_alloc_buf(>req, 0x38);
+
+switch (page_code) {
+case 0x00:
+r->buf[r->len++] = page_code ; /* this page */
+r->buf[r->len++] = 0x00;
+r->buf[r->len++] = 0x00;
+r->buf[r->len++] = 0x03;
+r->buf[r->len++] = 0x00;
+r->buf[r->len++] = 0x01;
+r->buf[r->len++] = 0x08;
+break;
+case 0x01:
+memset(r->buf, 0, 0x38);
+r->buf[0] = page_code;
+r->buf[3] = 0x30;
+enc_desc = >buf[8];
+enc_desc[0] = 0x09;
+enc_desc[2] = 1;
+enc_desc[3] = 0x24;
+memcpy(_desc[12], "QEMU", 8);
+memcpy(_desc[20], "QEMU TARGET ", 16);
+pstrcpy((char *)_desc[36], 4, qemu_hw_version());
+type_desc = >buf[48];
+type_desc[1] = 1;
+r->len = 0x38;
+break;
+case 0x08:
+r->buf[0] = page_code;
+r->buf[1] = 0x00;
+r->buf[2] = 0x00;
+r->buf[3] = 0x00;
+r->len = 4;
+break;
+default:
+return false;
+}
+r->len = MIN(r->req.cmd.xfer, r->len);
+return true;
+}
+
 static size_t scsi_sense_len(SCSIRequest *req)
 {
 if (req->dev->type == TYPE_SCANNER)
@@ -528,6 +577,11 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 goto illegal_request;
 }
 break;
+case RECEIVE_DIAGNOSTIC:
+if (!scsi_target_emulate_receive_diagnostic(r)) {
+goto illegal_request;
+}
+break;
 case REQUEST_SENSE:
 scsi_target_alloc_buf(>req, scsi_sense_len(req));
 r->len = scsi_device_get_sense(r->req.dev, r->buf,
-- 
1.8.5.6

Re: [Qemu-devel] [PATCH 0/2] scsi: enclosure support

2017-08-04 Thread Hannes Reinecke

On 08/04/2017 08:10 AM, Paolo Bonzini wrote:
> 
>> On 08/03/2017 05:10 PM, Paolo Bonzini wrote:
>>> On 03/08/2017 15:26, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> due to a customer issue I've added simple subenclosure support
>>>> to the SCSI emulation. The patch simply converts the current invisible
>>>> LUN0 into an enclosure device; existing setups using LUN0 as disks or
>>>> CD-ROMs will not be affected.
>>>
>>> What is the issue exactly?  That is, for what is it necessary to have a
>>> dummy enclosure?
>>>
>> Well, stock linux displays some very interesting error messages for
>> these types of enclosures. Which was the prime mover for doing this.
> 
> --verbose?
> 
[   12.958454] scsi 1:0:0:254: Wrong diagnostic page; asked for 2 got 0
[   12.958456] scsi 1:0:0:254: Failed to get diagnostic page 0xffea
[   12.958457] scsi 1:0:0:254: Failed to bind enclosure -19
[   12.959392] ses 1:0:0:254: Attached Enclosure device

>>> I agree with Dan that this need machine type compatibility gunk.  For
>>> example, could the new device affect /dev/sgN numbering?
>>
>> Yes, indeed it would.
>>
>> What about a new option to the scsi driver?
> 
> If you do that, you've done 99% of the work to do compatibility so I
> won't complain and do the 1% myself. :)
> 
Okay, will be doing so.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] [PATCH 0/2] scsi: enclosure support

2017-08-03 Thread Hannes Reinecke

On 08/03/2017 05:10 PM, Paolo Bonzini wrote:
> On 03/08/2017 15:26, Hannes Reinecke wrote:
>> Hi all,
>>
>> due to a customer issue I've added simple subenclosure support
>> to the SCSI emulation. The patch simply converts the current invisible
>> LUN0 into an enclosure device; existing setups using LUN0 as disks or
>> CD-ROMs will not be affected.
> 
> What is the issue exactly?  That is, for what is it necessary to have a
> dummy enclosure?
> 
Well, stock linux displays some very interesting error messages for
these types of enclosures.
Which was the prime mover for doing this.

> I agree with Dan that this need machine type compatibility gunk.  For
> example, could the new device affect /dev/sgN numbering?
> 
Yes, indeed it would.

What about a new option to the scsi driver?
With that each user could selectively enable it, and we wouldn't need to
worry with machine type compability...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] [PATCH 1/2] scsi: Make LUN 0 a simple enclosure

2017-08-03 Thread Hannes Reinecke

On 08/03/2017 03:32 PM, Daniel P. Berrange wrote:
> On Thu, Aug 03, 2017 at 03:27:00PM +0200, Hannes Reinecke wrote:
>> Instead of having an 'invisible' LUN0 (in case LUN 0 is not connected)
>> this patch maks LUN0 a enclosure service, exposing it to the OS.
>>
>> Signed-off-by: Hannes Reinecke <h...@suse.com>
>> ---
>>  hw/scsi/scsi-bus.c | 56 
>> +-
>>  1 file changed, 55 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
>> index 23c51de..c89e82d 100644
>> --- a/hw/scsi/scsi-bus.c
>> +++ b/hw/scsi/scsi-bus.c
>> @@ -493,10 +493,11 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq 
>> *r)
>>  if (r->req.lun != 0) {
>>  r->buf[0] = TYPE_NO_LUN;
>>  } else {
>> -r->buf[0] = TYPE_NOT_PRESENT | TYPE_INACTIVE;
>> +r->buf[0] = TYPE_ENCLOSURE;
>>  r->buf[2] = 5; /* Version */
>>  r->buf[3] = 2 | 0x10; /* HiSup, response data format */
>>  r->buf[4] = r->len - 5; /* Additional Length = (Len - 1) - 4 */
>> +r->buf[6] = 0x40; /* Enclosure service */
>>  r->buf[7] = 0x10 | (r->req.bus->info->tcq ? 0x02 : 0); /* Sync, 
>> TCQ.  */
>>  memcpy(>buf[8], "QEMU", 8);
>>  memcpy(>buf[16], "QEMU TARGET ", 16);
> 
> I would think this needs to be tied into machine type version, otherwise
> when you migrate old to new QEMU, LUN0 is suddenly going to change beneath
> the running guest ?
> 
(I _knew_ this would be coming ...)

It will only change if LUN0 is _not_ assigned, ie if a LUN larger than 0
is the first LUN on that host.
In those cases the system would see an additional LUN, correct.
But as that LUN is trivially not used I don't really see a problem with
that.

Cheers,

Hannes
-- 
Dr. Hannes ReineckezSeries & Storage
h...@suse.com  +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[Qemu-devel] [PATCH 0/2] scsi: enclosure support

2017-08-03 Thread Hannes Reinecke

Hi all,

due to a customer issue I've added simple subenclosure support
to the SCSI emulation. The patch simply converts the current invisible
LUN0 into an enclosure device; existing setups using LUN0 as disks or
CD-ROMs will not be affected.

Hannes Reinecke (2):
  scsi: Make LUN 0 a simple enclosure
  scsi: use qemu_uuid to generate logical identifier for SES

 hw/scsi/scsi-bus.c | 73 +-
 1 file changed, 72 insertions(+), 1 deletion(-)

-- 
1.8.5.6

[Qemu-devel] [PATCH 1/2] scsi: Make LUN 0 a simple enclosure

2017-08-03 Thread Hannes Reinecke

Instead of having an 'invisible' LUN0 (in case LUN 0 is not connected)
this patch maks LUN0 a enclosure service, exposing it to the OS.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 56 +-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 23c51de..c89e82d 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -493,10 +493,11 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq *r)
 if (r->req.lun != 0) {
 r->buf[0] = TYPE_NO_LUN;
 } else {
-r->buf[0] = TYPE_NOT_PRESENT | TYPE_INACTIVE;
+r->buf[0] = TYPE_ENCLOSURE;
 r->buf[2] = 5; /* Version */
 r->buf[3] = 2 | 0x10; /* HiSup, response data format */
 r->buf[4] = r->len - 5; /* Additional Length = (Len - 1) - 4 */
+r->buf[6] = 0x40; /* Enclosure service */
 r->buf[7] = 0x10 | (r->req.bus->info->tcq ? 0x02 : 0); /* Sync, TCQ.  
*/
 memcpy(>buf[8], "QEMU", 8);
 memcpy(>buf[16], "QEMU TARGET ", 16);
@@ -505,6 +506,54 @@ static bool scsi_target_emulate_inquiry(SCSITargetReq *r)
 return true;
 }
 
+static bool scsi_target_emulate_receive_diagnostic(SCSITargetReq *r)
+{
+uint8_t page_code = r->req.cmd.buf[2];
+unsigned char *enc_desc, *type_desc;
+
+assert(r->req.dev->lun != r->req.lun);
+
+scsi_target_alloc_buf(>req, 0x38);
+
+switch (page_code) {
+case 0x00:
+r->buf[r->len++] = page_code ; /* this page */
+r->buf[r->len++] = 0x00;
+r->buf[r->len++] = 0x00;
+r->buf[r->len++] = 0x03;
+r->buf[r->len++] = 0x00;
+r->buf[r->len++] = 0x01;
+r->buf[r->len++] = 0x08;
+break;
+case 0x01:
+memset(r->buf, 0, 0x38);
+r->buf[0] = page_code;
+r->buf[3] = 0x30;
+enc_desc = >buf[8];
+enc_desc[0] = 0x09;
+enc_desc[2] = 1;
+enc_desc[3] = 0x24;
+memcpy(_desc[12], "QEMU", 8);
+memcpy(_desc[20], "QEMU TARGET ", 16);
+pstrcpy((char *)_desc[36], 4, qemu_hw_version());
+type_desc = >buf[48];
+type_desc[1] = 1;
+r->len = 0x38;
+break;
+case 0x08:
+r->buf[0] = page_code;
+r->buf[1] = 0x00;
+r->buf[2] = 0x00;
+r->buf[3] = 0x00;
+r->len = 4;
+break;
+default:
+return false;
+}
+r->len = MIN(r->req.cmd.xfer, r->len);
+return true;
+}
+
 static size_t scsi_sense_len(SCSIRequest *req)
 {
 if (req->dev->type == TYPE_SCANNER)
@@ -528,6 +577,11 @@ static int32_t scsi_target_send_command(SCSIRequest *req, 
uint8_t *buf)
 goto illegal_request;
 }
 break;
+case RECEIVE_DIAGNOSTIC:
+if (!scsi_target_emulate_receive_diagnostic(r)) {
+goto illegal_request;
+}
+break;
 case REQUEST_SENSE:
 scsi_target_alloc_buf(>req, scsi_sense_len(req));
 r->len = scsi_device_get_sense(r->req.dev, r->buf,
-- 
1.8.5.6

[Qemu-devel] [PATCH 2/2] scsi: use qemu_uuid to generate logical identifier for SES

2017-08-03 Thread Hannes Reinecke

The SES enclosure descriptor requires a logical identifier,
so generate one using the qemu_uuid and the Qumranet OUI.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index c89e82d..8419c75 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -10,6 +10,7 @@
 #include "trace.h"
 #include "sysemu/dma.h"
 #include "qemu/cutils.h"
+#include "qemu/crc32c.h"
 
 static char *scsibus_get_dev_path(DeviceState *dev);
 static char *scsibus_get_fw_dev_path(DeviceState *dev);
@@ -533,6 +534,22 @@ static bool 
scsi_target_emulate_receive_diagnostic(SCSITargetReq *r)
 enc_desc[0] = 0x09;
 enc_desc[2] = 1;
 enc_desc[3] = 0x24;
+if (qemu_uuid_set) {
+uint32_t crc;
+
+/*
+ * Make this a NAA IEEE Registered identifier
+ * using Qumranet OUI (0x001A4A) and the
+ * crc32 from the system UUID.
+ */
+enc_desc[4] = 0x50;
+enc_desc[5] = 0x01;
+enc_desc[6] = 0xa4;
+enc_desc[7] = 0xa0;
+crc = crc32c(0x, qemu_uuid.data, 16);
+cpu_to_le32s();
+memcpy(_desc[8], , 4);
+}
 memcpy(_desc[12], "QEMU", 8);
 memcpy(_desc[20], "QEMU TARGET ", 16);
 pstrcpy((char *)_desc[36], 4, qemu_hw_version());
-- 
1.8.5.6

Re: [Qemu-devel] 答复： Re: [RFC] virtio-fc: draft idea of virtual fibre channel HBA

2017-05-17 Thread Hannes Reinecke

On 05/16/2017 06:22 PM, Paolo Bonzini wrote:
> Pruning to sort out the basic disagreements.
> 
> On 16/05/2017 17:22, Hannes Reinecke wrote:
>>> That depends on how you would like to do controller passthrough in
>>> general.  iSCSI doesn't have the 64-bit target ID, and doesn't have
>>> (AFAIK) hot-plug/hot-unplug support, so it's less important than FC.
>>>
>> iSCSI has its 'iqn' string, which is defined to be a 256-byte string.
>> Hence the number 
>> And if we're updating virtio anyway, we could as well update it to carry
>> _all_ possible scsi IDs.
> 
> Yes, but one iSCSI connection maps to one initiator and target IQN.
> It's not like FC where each frame can specify its own initiator ID.
> 
Sure. But updating the format to hold _any_ SCSI Name would allow us to
reflect the actual initiator port name used by the host.
So the guest could be
>>>> (3) would be feasible, as it would effectively mean 'just' to update the
>>>> current NPIV mechanism. However, this would essentially lock us in for
>>>> FC; any other types (think NVMe) will require yet another solution.
>>> An FC-NVMe driver could also expose the same vhost interface, couldn't it?
>>> FC-NVMe doesn't have to share the Linux code; but sharing the virtio 
>>> standard
>>> and the userspace ABI would be great.
>>>
>>> In fact, the main advantage of virtio-fc would be that (if we define it 
>>> properly)
>>> it could be reused for FC-NVMe instead of having to extend e.g. virtio-blk.
>>> For example virtio-scsi has request, to-device payload, response, 
>>> from-device
>>> payload.  virtio-fc's request format could be the initiator and target port
>>> identifiers, followed by FCP_CMD, to-device payload, FCP_RSP, from-device
>>> payload.
>>>
>> As already said: We do _not_ have access to the FCP frames.
>> So designing a virtio-fc protocol will only work for libfc-based HBAs,
>> namely fnic, bnx2fc, and fcoe.
>> Given that the future of FCoE is somewhat unclear I doubt it's a good
>> idea to restrict ourselves to that.
> 
> I understand that.  It doesn't have to be a 1:1 match with FCP frames or
> even IUs; in fact the above minimal example is not (no OXID/RXID and no
> FCP_XFER_RDY IU are just the first two things that come to mind).
> 
> The only purpose is to have a *transport* that is (roughly speaking)
> flexible enough to support future NPIV usecases which will certainly
> include FC-NVMe.  In other words: I'm inventing my own cooked FCP
> format, but I might as well base it on FCP itself as much as possible.
> 
Weeelll ... I don't want to go into nit-picking here, but FC-NVMe is
_NOT_ FCP. In fact, it's a different FC-4 provider with its own set of
FC-4 commands etc.

> Likewise, I'm not going to even mention ELS, but we would need _some_
> kind of protocol to query name servers, receive state change
> notifications, and get service parameters.  No idea yet how to do that,
> probably something similar to virtio-scsi control and event queues, but
> why not make the requests/responses look a little like PLOGI and PRLI?
> 
And my idea here is to keep virtio-scsi as the basic mode of (command)
transfer, but add a set of transport management commands which would
allow us to do things like:
- port discovery / scan
- port instantiation / login
- port reset
- transport link notification / status check
- transport reset

Those could be defined transport independently; and the neat thing is
they could even be made to work with the current NPIV implementation
with some tooling.
And we could define things such that all current transport protocols can
be mapped onto it.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] 答复： Re: [RFC] virtio-fc: draft idea of virtual fibre channel HBA

2017-05-16 Thread Hannes Reinecke

vtap" like layer (e.g.,
>>> socket+bind creates an NPIV vport).  This layer can work on some kind of
>>> FCP encapsulation, not the raw thing, and virtio-fc could be designed
>>> according to a similar format for simplicity.
>>
>> (4) would require raw FCP frame access, which is one thing we do _not_
>> have. Each card (except for the pure FCoE ones like bnx2fc, fnic, and
>> fcoe) only allows access to pre-formatted I/O commands. And has it's own
>> mechanism for generatind sequence IDs etc. So anything requiring raw FCP
>> access is basically out of the game.
> 
> Not raw.  It could even be defined at the exchange level (plus some special
> things for discovery and login services).  But I agree that (4) is a bit
> pie-in-the-sky.
> 
>> Overall, I would vote to specify a new virtio scsi format _first_,
>> keeping in mind all of these options.
>> (1), (3), and (4) all require an update anyway :-)
>>
>> The big advantage I see with (1) is that it can be added with just some
>> code changes to qemu and virtio-scsi. Every other option require some
>> vendor buy-in, which inevitably leads to more discussions, delays, and
>> more complex interaction (changes to qemu, virtio, _and_ the affected HBAs).
> 
> I agree.  But if we have to reinvent everything in a couple years for
> NVMe over fabrics, maybe it's not worth it.
> 
>> While we're at it: We also need a 'timeout' field to the virtion request
>> structure. I even posted an RFC for it :-)
> 
> Yup, I've seen it. :)
> 
Cool. Thanks.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] 答复： Re: [RFC] virtio-fc: draft idea of virtual fibre channel HBA

2017-05-16 Thread Hannes Reinecke

ost HBA driver
> - no new guest drivers are needed (well, almost due to above issues)
> - out-of-the-box support for live migration, albeit with hacks required
> such as Hyper-V's two WWNN/WWPN pairs
> 
> Disadvantages:
> - no full FCP support
> - guest devices exposed as /dev nodes to the host
> 
> 
> 2) the exact opposite: use the recently added "mediated device
> passthrough" (mdev) framework to present a "fake" PCI device to the
> guest.  mdev is currently used for vGPU and will also be used by s390
> for CCW passthrough.  It lets the host driver take care of device
> emulation, and the result is similar to an SR-IOV virtual function but
> without requiring SR-IOV in the host.  The PCI device would presumably
> reuse in the guest the same driver as the host.
> 
> Advantages:
> - no new guest drivers are needed
> - solution confined entirely within the host driver
> - each driver can use its own native 'cooked' format for FC frames
> 
> Disadvantages:
> - specific to each HBA driver
> - guests cannot be stopped/restarted across hosts with different HBAs
> - it's still device passthrough, so live migration is a mess (and would
> require guest-specific code in QEMU)
> 
> 
> 3) handle passthrough with a kernel driver.  Under this model, the guest
> uses the virtio device, but the passthrough of commands and TMFs is
> performed by the host driver.  The host driver grows the option to
> present an NPIV vport through a vhost interface (*not* the same thing as
> LIO's vhost-scsi target, but a similar API with a different /dev node or
> even one node per scsi_host).
> 
> We can then choose whether to do it with virtio-scsi or with a new
> virtio-fc.
> 
> Advantages:
> - guests can be stopped/restarted across hosts with different HBAs
> - no need to do the "two WWNN/WWPN pairs" hack for live migration,
> unlike e.g. Hyper-V
> - a bit Rube Goldberg, but the vhost interface can be consumed by any
> userspace program, not just by virtual machines
> 
> Disadvantages:
> - requires a new generalized vhost-scsi (or vhost-fc) layer
> - not sure about support for live migration (what to do about in-flight
> commands?)
> 
> I don't know the Linux code well enough to know if it would require code
> specific to each HBA driver.  Maybe just some refactoring.
> 
> 
> 4) same as (3), but in userspace with a "macvtap" like layer (e.g.,
> socket+bind creates an NPIV vport).  This layer can work on some kind of
> FCP encapsulation, not the raw thing, and virtio-fc could be designed
> according to a similar format for simplicity.
> 
> Advantages:
> - less dependencies on kernel code
> - simplest for live migration
> - most flexible for userspace usage
> 
> Disadvantages:
> - possibly two packs of cats to herd (SCSI + networking)?
> - haven't thought much about it, so I'm not sure about the feasibility
> 
> Again, I don't know the Linux code well enough to know if it would
> require code specific to each HBA driver.
> 
> 
> If we can get the hardware manufacturers (and the SCSI maintainers...)
> on board, (3) would probably be pretty easy to achieve, even accounting
> for the extra complication of writing a virtio-fc specification.  Really
> just one hardware manufacturer, the others would follow suit.
> 
With option (1) and the target/initiator ID extensions we should be able
to get basic NPIV support to work, and would even be able to handle
reservations in a sane manner.

(4) would require raw FCP frame access, which is one thing we do _not_
have. Each card (except for the pure FCoE ones like bnx2fc, fnic, and
fcoe) only allows access to pre-formatted I/O commands. And has it's own
mechanism for generatind sequence IDs etc. So anything requiring raw FCP
access is basically out of the game.

(3) would be feasible, as it would effectively mean 'just' to update the
current NPIV mechanism. However, this would essentially lock us in for
FC; any other types (think NVMe) will require yet another solution.

(2) sounds interesting, but I'd have to have a look into the code to
figure out if it could easily be done.

> (2) would probably be what the manufacturers like best, but it would be
> worse for lock in.  Or... they would like it best *because* it would be
> worse for lock in.
> 
> The main disadvantage of (2)/(3) against (1) is more complex testing.  I
> guess we can add a vhost-fc target for testing to LIO, so as not to
> require an FC card for guest development.  And if it is still a problem
> 'cause configfs requires root, we can add a fake FC target in QEMU.
> 
Overall, I would vote to specify a new virtio scsi format _first_,
keeping in mind all of these options.
(1), (3), and (4) all require an update anyway :-)

The big advantage I see with (1) is that it can be added with just some
code changes to qemu and virtio-scsi. Every other option require some
vendor buy-in, which inevitably leads to more discussions, delays, and
more complex interaction (changes to qemu, virtio, _and_ the affected HBAs).

While we're at it: We also need a 'timeout' field to the virtion request
structure. I even posted an RFC for it :-)

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

[Qemu-devel] [PATCH 4/4] virtio: implement VIRTIO_SCSI_F_TIMEOUT feature

2017-05-12 Thread Hannes Reinecke

Implement a handler for the VIRTIO_SCSI_F_TIMEOUT feature, which
allows to pass in the assigned command timeout in seconds or
minutes. This allows to specify a timeout up to 3 hours.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/virtio-scsi.c| 16 
 include/standard-headers/linux/virtio_scsi.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 46a3e3f280..f1c2f7cb5b 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -534,6 +534,7 @@ static void virtio_scsi_fail_cmd_req(VirtIOSCSIReq *req)
 static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq 
*req)
 {
 VirtIOSCSICommon *vs = >parent_obj;
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
 SCSIDevice *d;
 int rc;
 
@@ -560,6 +561,21 @@ static int virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI 
*s, VirtIOSCSIReq *req)
  virtio_scsi_get_lun(req->req.cmd.lun),
  req->req.cmd.cdb, req);
 
+if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_TIMEOUT)) {
+int timeout = (int)req->req.cmd.crn;
+
+if (timeout < 60) {
+/* Timeouts below 60 are in seconds */
+req->sreq->timeout = timeout * 1000;
+} else if (timeout == 255) {
+/* 255 is infinite timeout */
+req->sreq->timeout = UINT_MAX;
+} else {
+/* Otherwise the timeout is in minutes */
+req->sreq->timeout = timeout * 1000 * 60;
+}
+}
+
 if (req->sreq->cmd.mode != SCSI_XFER_NONE
 && (req->sreq->cmd.mode != req->mode ||
 req->sreq->cmd.xfer > req->qsgl.size)) {
diff --git a/include/standard-headers/linux/virtio_scsi.h 
b/include/standard-headers/linux/virtio_scsi.h
index ab66166b6a..b00e2b95ab 100644
--- a/include/standard-headers/linux/virtio_scsi.h
+++ b/include/standard-headers/linux/virtio_scsi.h
@@ -120,6 +120,7 @@ struct virtio_scsi_config {
 #define VIRTIO_SCSI_F_HOTPLUG  1
 #define VIRTIO_SCSI_F_CHANGE   2
 #define VIRTIO_SCSI_F_T10_PI   3
+#define VIRTIO_SCSI_F_TIMEOUT  4
 
 /* Response codes */
 #define VIRTIO_SCSI_S_OK   0
-- 
2.12.0

[Qemu-devel] [PATCH 3/4] scsi: per-request timeouts

2017-05-12 Thread Hannes Reinecke

Add a 'timeout' value per request to allow to specify individual
per-request timeouts.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-bus.c |  1 +
 hw/scsi/scsi-disk.c| 15 +++
 hw/scsi/scsi-generic.c | 12 
 include/hw/scsi/scsi.h |  1 +
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index f5574469c8..206000d873 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -690,6 +690,7 @@ SCSIRequest *scsi_req_new(SCSIDevice *d, uint32_t tag, 
uint32_t lun,
 
 req->cmd = cmd;
 req->resid = req->cmd.xfer;
+req->timeout = d->timeout;
 
 switch (buf[0]) {
 case INQUIRY:
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4ac4c872fe..6a6f091a9c 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2503,7 +2503,9 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, 
uint32_t tag, uint32_t lun,
 printf("\n");
 }
 #endif
-
+if (req) {
+req->timeout = d->timeout;
+}
 return req;
 }
 
@@ -2679,7 +2681,7 @@ static BlockAIOCB *scsi_block_do_sgio(SCSIBlockReq *req,
 /* The rest is as in scsi-generic.c.  */
 io_header->mx_sb_len = sizeof(r->req.sense);
 io_header->sbp = r->req.sense;
-io_header->timeout = s->qdev.timeout;
+io_header->timeout = r->req.timeout;
 io_header->usr_ptr = r;
 io_header->flags |= SG_FLAG_DIRECT_IO;
 
@@ -2812,14 +2814,19 @@ static SCSIRequest *scsi_block_new_request(SCSIDevice 
*d, uint32_t tag,
void *hba_private)
 {
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, d);
+struct SCSIRequest *req;
 
 if (scsi_block_is_passthrough(s, buf)) {
-return scsi_req_alloc(_generic_req_ops, >qdev, tag, lun,
+req = scsi_req_alloc(_generic_req_ops, >qdev, tag, lun,
   hba_private);
 } else {
-return scsi_req_alloc(_block_dma_reqops, >qdev, tag, lun,
+req = scsi_req_alloc(_block_dma_reqops, >qdev, tag, lun,
   hba_private);
 }
+if (req) {
+req->timeout = d->timeout;
+}
+return req;
 }
 
 static int scsi_block_parse_cdb(SCSIDevice *d, SCSICommand *cmd,
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 998b6a4558..31c3a75bad 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -157,8 +157,6 @@ static int execute_command(BlockBackend *blk,
SCSIGenericReq *r, int direction,
BlockCompletionFunc *complete)
 {
-SCSIDevice *s = r->req.dev;
-
 r->io_header.interface_id = 'S';
 r->io_header.dxfer_direction = direction;
 r->io_header.dxferp = r->buf;
@@ -167,7 +165,7 @@ static int execute_command(BlockBackend *blk,
 r->io_header.cmd_len = r->req.cmd.len;
 r->io_header.mx_sb_len = sizeof(r->req.sense);
 r->io_header.sbp = r->req.sense;
-r->io_header.timeout = s->timeout;
+r->io_header.timeout = r->req.timeout;
 r->io_header.usr_ptr = r;
 r->io_header.flags |= SG_FLAG_DIRECT_IO;
 
@@ -596,7 +594,13 @@ const SCSIReqOps scsi_generic_req_ops = {
 static SCSIRequest *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun,
  uint8_t *buf, void *hba_private)
 {
-return scsi_req_alloc(_generic_req_ops, d, tag, lun, hba_private);
+struct SCSIRequest *req;
+
+req = scsi_req_alloc(_generic_req_ops, d, tag, lun, hba_private);
+if (req) {
+req->timeout = d->timeout;
+}
+return req;
 }
 
 static Property scsi_generic_properties[] = {
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index a976e85cfa..6c51ddc3ef 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -51,6 +51,7 @@ struct SCSIRequest {
 uint32_t  tag;
 uint32_t  lun;
 uint32_t  status;
+uint32_t  timeout;
 void  *hba_private;
 size_tresid;
 SCSICommand   cmd;
-- 
2.12.0

[Qemu-devel] [PATCH 2/4] scsi: use host default timeouts for SCSI commands

2017-05-12 Thread Hannes Reinecke

Instead of disabling command aborts by setting the command timeout
to infinity we should be setting it to '0' per default, allowing
the host to fall back to its default values.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-disk.c| 3 +--
 hw/scsi/scsi-generic.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index dd01ff7e06..4ac4c872fe 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2898,8 +2898,7 @@ static Property scsi_hd_properties[] = {
DEFAULT_MAX_UNMAP_SIZE),
 DEFINE_PROP_UINT64("max_io_size", SCSIDiskState, max_io_size,
DEFAULT_MAX_IO_SIZE),
-DEFINE_PROP_UINT32("timeout", SCSIDevice, timeout,
-   MAX_UINT),
+DEFINE_PROP_UINT32("timeout", SCSIDevice, timeout, 0),
 DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
 DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index fd02a0f4b2..998b6a4558 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -601,7 +601,7 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, 
uint32_t tag, uint32_t lun,
 
 static Property scsi_generic_properties[] = {
 DEFINE_PROP_DRIVE("drive", SCSIDevice, conf.blk),
-DEFINE_PROP_UINT32("timeout", SCSIDevice, timeout, MAX_UINT),
+DEFINE_PROP_UINT32("timeout", SCSIDevice, timeout, 0),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.12.0

[Qemu-devel] [PATCH 1/4] scsi: make default command timeout user-settable

2017-05-12 Thread Hannes Reinecke

Per default any SCSI commands are sent with an infinite timeout,
which essentially disables any command abort mechanism on the
host and causes the guest to stall.
This patch adds a new option 'timeout' for scsi-generic and
scsi-disk which allows the user to set the timeout value to
something sensible.

Signed-off-by: Hannes Reinecke <h...@suse.com>
---
 hw/scsi/scsi-disk.c| 4 +++-
 hw/scsi/scsi-generic.c | 5 -
 include/hw/scsi/scsi.h | 1 +
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index a53f058621..dd01ff7e06 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2679,7 +2679,7 @@ static BlockAIOCB *scsi_block_do_sgio(SCSIBlockReq *req,
 /* The rest is as in scsi-generic.c.  */
 io_header->mx_sb_len = sizeof(r->req.sense);
 io_header->sbp = r->req.sense;
-io_header->timeout = UINT_MAX;
+io_header->timeout = s->qdev.timeout;
 io_header->usr_ptr = r;
 io_header->flags |= SG_FLAG_DIRECT_IO;
 
@@ -2898,6 +2898,8 @@ static Property scsi_hd_properties[] = {
DEFAULT_MAX_UNMAP_SIZE),
 DEFINE_PROP_UINT64("max_io_size", SCSIDiskState, max_io_size,
DEFAULT_MAX_IO_SIZE),
+DEFINE_PROP_UINT32("timeout", SCSIDevice, timeout,
+   MAX_UINT),
 DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
 DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index a55ff87c22..fd02a0f4b2 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -157,6 +157,8 @@ static int execute_command(BlockBackend *blk,
SCSIGenericReq *r, int direction,
BlockCompletionFunc *complete)
 {
+SCSIDevice *s = r->req.dev;
+
 r->io_header.interface_id = 'S';
 r->io_header.dxfer_direction = direction;
 r->io_header.dxferp = r->buf;
@@ -165,7 +167,7 @@ static int execute_command(BlockBackend *blk,
 r->io_header.cmd_len = r->req.cmd.len;
 r->io_header.mx_sb_len = sizeof(r->req.sense);
 r->io_header.sbp = r->req.sense;
-r->io_header.timeout = MAX_UINT;
+r->io_header.timeout = s->timeout;
 r->io_header.usr_ptr = r;
 r->io_header.flags |= SG_FLAG_DIRECT_IO;
 
@@ -599,6 +601,7 @@ static SCSIRequest *scsi_new_request(SCSIDevice *d, 
uint32_t tag, uint32_t lun,
 
 static Property scsi_generic_properties[] = {
 DEFINE_PROP_DRIVE("drive", SCSIDevice, conf.blk),
+DEFINE_PROP_UINT32("timeout", SCSIDevice, timeout, MAX_UINT),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 6b85786dbf..a976e85cfa 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -110,6 +110,7 @@ struct SCSIDevice
 uint64_t max_lba;
 uint64_t wwn;
 uint64_t port_wwn;
+uint32_t timeout;
 };
 
 extern const VMStateDescription vmstate_scsi_device;
-- 
2.12.0

[Qemu-devel] [RFC PATCH 0/4] Virtio command timeouts (qemu part)

2017-05-12 Thread Hannes Reinecke

Hi all,

we've run into a really awkward customer situation where the guest would
hang forever due to an SG_IO ioctl on the host not returning.
Looking into it we found that qemu will submit direct I/O requests with
an _infinite_ timeout (well, actually UINT_MAX, which due to a kernel
bug gets translated into (ULONG)-2, resulting in a timeout of
4.2 years :-).
And this particular I/O ran into a timeout on the wire due to a flaky
connection. Which resulted in the 'normal' block-level timeout on the
host being disabled, and the SCSI stack never sending any aborts as
the block-layer was still waiting for the I/O timeout to expire.

Unfortunately I didn't find a way to create a stand-alone patch; the
fix I'm proposing relies on fixes for qemu running on the host and
the kernel side running on the guest.

The proposed fix consists of several parts:
- make the standard device-timeout user-settable via a 'timeout'
  attribute to 'scsi-disk' and 'scsi-generic'
- Add a kernel patch to implement a eh_timeout_handler() for
  virtio_scsi(); this patch just checks if the command is still pending
  and resets the timer if so.
- Add a request timeout to allow drivers to modify the timeout
  on a per-request base.
- Implement a new VIRTIO_SCSI_F_TIMEOUT feature allowing virtio-scsi
  to pass in a timeout via the otherwise unused 'crn' field.
- Add a kernel patch to implement the VIRTIO_SCSI_F_TIMEOUT feature
  so that the timeout is added per virtio request.

With that virtio-scsi on the guest can pass in the used timeout to the
qemu on the host side, which then can use this timeout to issue I/O
requests to the host.
The host can then properly aborting a command if the timeout is hit, and
the aborted command will be returned to the guest.
The guest itself doesn't need to (and, in fact, in most cases can't) abort
any commands anymore, so it just need to reset the I/O timer until the
requests are returned.

However, as this is quite an elaborate construct I'd like to get some
feedback for it.

Hannes Reinecke (4):
  scsi: make default command timeout user-settable
  scsi: use host default timeouts for SCSI commands
  scsi: per-request timeouts
  virtio: implement VIRTIO_SCSI_F_TIMEOUT feature

 hw/scsi/scsi-bus.c   |  1 +
 hw/scsi/scsi-disk.c  | 16 
 hw/scsi/scsi-generic.c   | 11 +--
 hw/scsi/virtio-scsi.c| 16 
 include/hw/scsi/scsi.h   |  2 ++
 include/standard-headers/linux/virtio_scsi.h |  1 +
 6 files changed, 41 insertions(+), 6 deletions(-)

-- 
2.12.0

Re: [Qemu-devel] [PATCH v2 25/30] trace: Fix parameter types in hw/scsi

2017-03-14 Thread Hannes Reinecke

On 03/13/2017 08:55 PM, Eric Blake wrote:
> An upcoming patch will let the compiler warn us when we are silently
> losing precision in traces; update the trace definitions to pass
> through the full value at the callsite.  Also update some callers
> to avoid variable-sized dma_addr_t (which cannot be easily used in
> trace headers) and needless casts for values used in tracing.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> ---
>  hw/scsi/esp-pci.c|   2 +-
>  hw/scsi/megasas.c|   8 ++--
>  hw/scsi/trace-events | 114 
> +--
>  3 files changed, 63 insertions(+), 61 deletions(-)
> 
Reviewed-by: Hannes Reinecke <h...@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] [PATCH v2 02/30] trace: Fix incorrect megasas trace parameters

2017-03-14 Thread Hannes Reinecke

On 03/13/2017 08:55 PM, Eric Blake wrote:
> hw/scsi/trace-events lists cmd as the first parameter for both
> megasas_iovec_overflow and megasas_iovec_underflow, but the caller
> was mistakenly passing cmd->iov_size twice instead of the command
> index.  Also, trace_megasas_abort_invalid is called with parameters
> in the wrong order.  Broken since its introduction in commit
> e8f943c3.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> ---
>  hw/scsi/megasas.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
> index e3d59b7..84b8caf 100644
> --- a/hw/scsi/megasas.c
> +++ b/hw/scsi/megasas.c
> @@ -291,7 +291,7 @@ static int megasas_map_sgl(MegasasState *s, MegasasCmd 
> *cmd, union mfi_sgl *sgl)
>  if (cmd->iov_size > iov_size) {
>  trace_megasas_iovec_overflow(cmd->index, iov_size, cmd->iov_size);
>  } else if (cmd->iov_size < iov_size) {
> -trace_megasas_iovec_underflow(cmd->iov_size, iov_size, 
> cmd->iov_size);
> +trace_megasas_iovec_underflow(cmd->index, iov_size, cmd->iov_size);
>  }
>  cmd->iov_offset = 0;
>  return 0;
> @@ -1924,8 +1924,8 @@ static int megasas_handle_abort(MegasasState *s, 
> MegasasCmd *cmd)
>  abort_ctx &= (uint64_t)0x;
>  }
>  if (abort_cmd->context != abort_ctx) {
> -trace_megasas_abort_invalid_context(cmd->index, abort_cmd->index,
> -abort_cmd->context);
> +trace_megasas_abort_invalid_context(cmd->index, abort_cmd->context,
> +    abort_cmd->index);
>  s->event_count++;
>  return MFI_STAT_ABORT_NOT_POSSIBLE;
>  }
> 
Reviewed-by: Hannes Reinecke <h...@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] 答复： Re: [RFC] virtio-fc: draft idea of virtual fibre channel HBA

2017-02-22 Thread Hannes Reinecke

On 02/22/2017 09:19 AM, Lin Ma wrote:
> Hi Hannes,
> 
>>>> Hannes Reinecke <h...@suse.de> 2017/2/16 星期四 下午 5:56 >>>
>>On 02/16/2017 09:39 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 16/02/2017 08:16, Lin Ma wrote:
>>>>> What are the benefits of having FC access from the guest?
>>>>
>>>> Actually, I havn't dug it too much, Just thought that from
> virtualization's
>>>> perspective, when interact with FC storage, having complete FC access
>>>> from the guest is the way it should use.
>>>
>>> How much of this requires a completely new spec?  Could we get enough of
>>> the benefit (e.g. the ability to issue rescans or LIPs on the host) by
>>> extending virtio-scsi?
>>>
>> I guess I'd need to chime in with my favourite topic :-)
>>
>> Initially I really would go with extending the virtio-scsi spec, as
>> 'real' virtio-fc suffers from some issues:
>> While it's of course possible to implement a full fc stack in qemu
>> itself, it's not easily possible assign 'real' FC devices to the guest.
>> Problem here is that any virtio-fc would basically have to use the
>> standard FC frame format for virtio itself, but all 'real' FC HBAs
>> (excluding FCoE drivers) do _not_ allow access to the actual FC frames,
>> but rather present you with a 'cooked' version of them.
>> So if you were attempting to pass FC devices to the guest you would have
>> to implement yet-another conversion between raw FC frames and the
>> version the HBA would like.
>> And that doesn't even deal with the real complexity like generating
>> correct OXID/RXIDs etc.
>> 
>> So initially I would propose to update the virtio spec to include:
>> - Full 64bit LUNs
>> - A 64bit target port ID (well, _actually_ it should be a SCSI-compliant
>>   target port ID, but as this essentially is a free text field I'd
>>   restrict it to something sensible)
>> For full compability we'd also need a (64-bit) initiator ID, but that is
>> essentially a property of the virtio queue, so we don't need to transmit
>> it with every command; once during queue setup is enough.
>> And if we restrict virtio-scsi to point-to-point topology we can even
>> associate the target port ID with the virtio queue, making
>> implementation even easier.
>> I'm not sure if that is a good idea long-term, as one might want to
>> identify an NPIV host with a virtio device, in which case we're having a
>> 1-M topology and that model won't work anymore.
>> 
>> To be precise:
>> 
>> I'd propose to update
>> 
>> struct virtio_scsi_config
>> with a field 'u8 initiator_id[8]'
>> 
>> and
>> 
>> struct virtio_scsi_req_cmd
>> with a field 'u8 target_id[8]'
>> 
>> and do away with the weird LUN remapping qemu has nowadays.
> Does it mean we dont need to provide '-drive' and '-device scsi-hd'
> option in qemu command line? so the guest can get its own LUNs
> through fc switch, right?
> 
No, you still would need that (at least initially).
But with the modifications above we can add tooling around qemu to
establish the correct (host) device mappings.
Without it we
a) have no idea from the host side which devices should be attached to
any given guest
b) have no idea from the guest side what the initiator and target IDs
are; which will get _really_ tricky if someone decides to use persistent
reservations from within the guest...

For handling NPIV proper we would need to update qemu
a) locate the NPIV host based on the initiator ID from the guest
b) stop exposing the devices attached to that NPIV host to the guest
c) establish a 'rescan' routine to capture any state changes (LUN
remapping etc) of the NPIV host.

But having the initiator and target IDs in virtio scsi is a mandatory
step before we can attempt anything else.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] 答复： Re: [RFC] virtio-fc: draft idea of virtual fibre channel HBA

2017-02-16 Thread Hannes Reinecke

On 02/16/2017 09:39 AM, Paolo Bonzini wrote:
> 
> 
> On 16/02/2017 08:16, Lin Ma wrote:
>>> What are the benefits of having FC access from the guest?
>>
>> Actually, I havn't dug it too much, Just thought that from virtualization's
>> perspective, when interact with FC storage, having complete FC access
>> from the guest is the way it should use.
> 
> How much of this requires a completely new spec?  Could we get enough of
> the benefit (e.g. the ability to issue rescans or LIPs on the host) by
> extending virtio-scsi?
> 
I guess I'd need to chime in with my favourite topic :-)

Initially I really would go with extending the virtio-scsi spec, as
'real' virtio-fc suffers from some issues:
While it's of course possible to implement a full fc stack in qemu
itself, it's not easily possible assign 'real' FC devices to the guest.
Problem here is that any virtio-fc would basically have to use the
standard FC frame format for virtio itself, but all 'real' FC HBAs
(excluding FCoE drivers) do _not_ allow access to the actual FC frames,
but rather present you with a 'cooked' version of them.
So if you were attempting to pass FC devices to the guest you would have
to implement yet-another conversion between raw FC frames and the
version the HBA would like.
And that doesn't even deal with the real complexity like generating
correct OXID/RXIDs etc.

So initially I would propose to update the virtio spec to include:
- Full 64bit LUNs
- A 64bit target port ID (well, _actually_ it should be a SCSI-compliant
  target port ID, but as this essentially is a free text field I'd
  restrict it to something sensible)
For full compability we'd also need a (64-bit) initiator ID, but that is
essentially a property of the virtio queue, so we don't need to transmit
it with every command; once during queue setup is enough.
And if we restrict virtio-scsi to point-to-point topology we can even
associate the target port ID with the virtio queue, making
implementation even easier.
I'm not sure if that is a good idea long-term, as one might want to
identify an NPIV host with a virtio device, in which case we're having a
1-M topology and that model won't work anymore.

To be precise:

I'd propose to update

struct virtio_scsi_config
with a field 'u8 initiator_id[8]'

and

struct virtio_scsi_req_cmd
with a field 'u8 target_id[8]'

and do away with the weird LUN remapping qemu has nowadays.

That should be enough to get proper NPIV passthrough working.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

1 2 3 4 5 >

1 - 100 of 462 matches

Mail list logo