Re: [virtio-comment] Re: virtio-sound linux driver conformance to spec

2023-09-18 Thread Paolo Bonzini

On 9/19/23 02:35, Anton Yakovlev wrote:


If the Linux virtio sound driver violates a specification, then there 
must be

a conformance statement that the driver does not follow. As far as I know,
there is no such thing at the moment.


There is one in 2.7.13.3: "The device MAY access the descriptor chains 
the driver created and the memory they refer to immediately"


And likewise for packed virtqueues in 2.8.21.1: "The device MAY access 
the descriptor and any following descriptors the driver created and the 
memory they refer to immediately"


I think it's a mistake to use MAY here, as opposed to "may".  This is 
not an optional feature, it's a MUST NOT requirement on the driver's 
part that should be in 2.7.13.3.1 and 2.8.21.1.1.


This does not prevent the virtio-snd spec from overriding this.  If an 
override is desirable (for example because other hardware behaves like 
this), there should be a provision in 2.7.13.3.1 and 2.8.21.1.1.  For 
example:


2.7.13.3.1 Unless the device specification specifies otherwise, the 
driver MUST NOT write to the descriptor chains and the memory they refer 
to, between the /idx/ update and the time the device places the driver 
on the used ring.


2.8.21.1.1 "Unless the device specification specifies otherwise, the 
driver MUST NOT write to the descriptor, to any following descriptors 
the driver created, nor to the memory the refer to, between the /flags/ 
update and the time the device places the driver on the used ring.



In the virtio-snd there would be a normative statement like

5.14.6.8.1.1  The device MUST NOT read from available device-readable 
buffers beyond the first buffer_bytes / period_bytes periods.


5.14.6.8.1.2  The driver MAY write to device-readable buffers beyond the 
first buffer_bytes / period_bytes periods, even after offering them to 
the device.




As an aside, here are two other statements that have a similar issue:

- 2.6.1.1.2 "the driver MAY release any resource associated with that
virtqueue" (instead 2.6.1.1.1 should have something like "After a queue 
has been reset by the driver, the device MUST NOT access any resource 
associated with a virtqueue").


- 2.7.5.1 "[the device] MAY do so for debugging or diagnostic purposes" 
(this is not normative and can be just "may")



Thanks,

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-comment] virtio-sound linux driver conformance to spec

2023-09-13 Thread Paolo Bonzini
On Wed, Sep 13, 2023 at 5:05 PM Matias Ezequiel Vara Larsen
 wrote:
>
> Hello,
>
> This email is to report a behavior of the Linux virtio-sound driver that
> looks like it is not conforming to the VirtIO specification. The kernel
> driver is moving buffers from the used ring to the available ring
> without knowing if the content has been updated from the user. If the
> device picks up buffers from the available ring just after it is
> notified, it happens that the content is old. This problem can be fixed
> by waiting a period of time after the device dequeues a buffer from the
> available ring. The driver should not be allowed to change the content
> of a buffer in the available ring. When buffers are enqueued in the
> available ring, the device can consume them immediately.
>
> 1. Is the actual driver implementation following the spec?

If I understand the issue correctly, it's not. As you say, absent any
clarification a buffer that has been placed in the available ring
should be unmodifiable; the driver should operate as if the data in
the available ring is copied to an internal buffer instantly when the
buffer id is added to the ring.

I am assuming this is a playback buffer. To clarify, does the driver
expect buffers to be read only as needed, which is a fraction of a
second before the data is played back?

> 2. Shall the spec be extended to support such a use-case?

If it places N buffers, I think it's a reasonable expectation (for the
sound device only!) that the Nth isn't read until the (N-1)th has
started playing. With two buffers only, the behavior you specify would
not be permissible (because as soon as buffer 1 starts playing, the
device can read buffer 2; there is never an idle buffer). With three
buffers, you could write buffer 3 while buffer 1 plays; write buffer 1
while buffer 2 plays; and write buffer 2 while buffer 3 plays. Is this
enough?

That said, being reasonable isn't enough for virtio-sound to do it and
deviate from other virtio devices. Why does the Linux driver behave
like this? Is it somehow constrained by the kernel->userspace APIs?

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"

2023-07-12 Thread Paolo Bonzini

On 7/12/23 15:40, Christoph Hellwig wrote:

The problem is that the SCSI stack does not send this command, so we
should do it in the driver. In fact we do it for
VIRTIO_SCSI_EVT_RESET_RESCAN (hotplug), but not for
VIRTIO_SCSI_EVT_RESET_REMOVED (hotunplug).


No, you should absolutely no do it in the driver.  The fact that
virtio-scsi even tries to do some of its own LUN scanning is
problematic and should have never happened.


I agree that it should not do it for hot-unplug.  However, for hot-plug 
the spec says that a hotplug event for LUN 0 represents the addition of 
an entire target, so why is it incorrect to start a REPORT LUNS scan if 
the host doesn't tell you the exact LUN(s) that have been added?


There is a similar case in mpi3mr/mpi3mr_os.c, though it's only scanning 
for newly added devices after a controller reset.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"

2023-07-12 Thread Paolo Bonzini

On 7/11/23 19:06, Stefano Garzarella wrote:

CCing `./scripts/get_maintainer.pl -f drivers/scsi/virtio_scsi.c`,
since I found a few things in the virtio-scsi driver...

FYI we have seen that Linux has problems with a QEMU patch for the
virtio-scsi device (details at the bottom of this email in the revert
commit message and BZ).


This is what I found when I looked at the Linux code:

In scsi_report_sense() in linux/drivers/scsi/scsi_error.c linux calls
scsi_report_lun_change() that set `sdev_target->expecting_lun_change =
1` when we receive a UNIT ATTENTION with REPORT LUNS CHANGED
(sshdr->asc == 0x3f && sshdr->ascq == 0x0e).

When `sdev_target->expecting_lun_change = 1` is set and we call
scsi_check_sense(), for example to check the next UNIT ATTENTION, it
will return NEEDS_RETRY, that I think will cause the issues we are
seeing.

`sdev_target->expecting_lun_change` is reset only in
scsi_decide_disposition() when `REPORT_LUNS` command returns with
SAM_STAT_GOOD.
That command is issued in scsi_report_lun_scan() called by
__scsi_scan_target(), called for example by scsi_scan_target(),
scsi_scan_host(), etc.

So, checking QEMU, we send VIRTIO_SCSI_EVT_RESET_RESCAN during hotplug
and VIRTIO_SCSI_EVT_RESET_REMOVED during hotunplug. In both cases now we
send also the UNIT ATTENTION.

In the virtio-scsi driver, when we receive VIRTIO_SCSI_EVT_RESET_RESCAN
(hotplug) we call scsi_scan_target() or scsi_add_device(). Both of them
will call __scsi_scan_target() at some points, sending `REPORT_LUNS`
command to the device. This does not happen for
VIRTIO_SCSI_EVT_RESET_REMOVED (hotunplug). Indeed if I remove the
UNIT ATTENTION from the hotunplug in QEMU, everything works well.

So, I tried to add a scan also for VIRTIO_SCSI_EVT_RESET_REMOVED:


The point of having the event queue is to avoid expensive scans of the 
entire host, so I don't think this is the right thing to do.


On the Linux side, one change we might do is to remove the printk for 
adapters that do process hotplug/hotunplug, using a new flag in 
scsi_host_template.  There are several callers of scsi_add_device() and 
scsi_remove_device() in adapter code, so at least these should not issue 
the printk:


drivers/scsi/aacraid/commsup.c
drivers/scsi/arcmsr/arcmsr_hba.c
drivers/scsi/esas2r/esas2r_main.c
drivers/scsi/hpsa.c
drivers/scsi/ipr.c
drivers/scsi/megaraid/megaraid_sas_base.c
drivers/scsi/mvumi.c
drivers/scsi/pmcraid.c
drivers/scsi/smartpqi/smartpqi_init.c
drivers/scsi/virtio_scsi.c
drivers/scsi/vmw_pvscsi.c
drivers/scsi/xen-scsifront.c

Paolo


Another thing I noticed is that in QEMU maybe we should set the UNIT
ATTENTION first and then send the event on the virtqueue, because the
scan should happen after the unit attention, but I don't know if in any
case the unit attention is processed before the virtqueue.

I mean something like this:

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 45b95ea070..13db40f4f3 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1079,8 +1079,8 @@ static void virtio_scsi_hotplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,

  };

  virtio_scsi_acquire(s);
-    virtio_scsi_push_event(s, &info);
  scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
+    virtio_scsi_push_event(s, &info);
  virtio_scsi_release(s);
  }
  }
@@ -,8 +,8 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,


  if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) {
  virtio_scsi_acquire(s);
-    virtio_scsi_push_event(s, &info);
  scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
+    virtio_scsi_push_event(s, &info);
  virtio_scsi_release(s);
  }
  }

At this point I think the problem is on the handling of the
VIRTIO_SCSI_EVT_RESET_REMOVED event in the virtio-scsi driver, where
somehow we have to redo the bus scan, but scsi_scan_host() doesn't seem
to be enough when the event rate is very high.

I don't know if along with this fix, we also need to limit the rate in
QEMU somehow.

Sorry for the length of this email, but I'm not familiar with SCSI and
wanted some suggestions on how to proceed.

Paolo, Stefan, Linux SCSI maintainers, any suggestion?


Thanks,
Stefano

On Wed, Jul 05, 2023 at 09:15:23AM +0200, Stefano Garzarella wrote:

This reverts commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2.

That commit causes several problems in Linux as described in the BZ.
In particular, after a while, other devices on the bus are no longer
usable even if those devices are not affected by the hotunplug.
This may be a problem in Linux, but we have not been able to identify
it so far. So better to revert this patch until we find a solution.

Also, Oracle, which initially proposed this patch for a problem with
Solaris, seems to have already reversed it downstream:
   https://linux.oracle.com/errata/ELSA-2023-12065.html

Suggested-by: Thomas Huth 
Buglink: https://bugzilla.redhat

Re: [PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"

2023-07-12 Thread Paolo Bonzini

On 7/11/23 22:21, Mike Christie wrote:

What was the issue you are seeing?

Was it something like you get the UA. We retry then on one of the
retries the sense is not setup correctly, so the scsi error handler
runs? That fails and the device goes offline?

If you turn on scsi debugging you would see:


[  335.445922] sd 0:0:0:0: [sda] tag#15 Add. Sense: Reported luns data has 
changed
[  335.445922] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445925] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445929] sd 0:0:0:0: [sda] tag#17 Done: FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=0s
[  335.445932] sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 00 db 4f c0 00 00 
20 00
[  335.445934] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445936] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445938] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445940] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445942] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445945] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.451447] scsi host0: scsi_eh_0: waking up 0/2/2
[  335.451453] scsi host0: Total of 2 commands on 1 devices require eh work
[  335.451457] sd 0:0:0:0: [sda] tag#16 scsi_eh_0: requesting sense


Does this log come from internal discussions within Oracle?


I don't know the qemu scsi code well, but I scanned the code for my co-worker
and my guess was commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 had a race in 
it.

How is locking done? when it is a bus level UA but there are multiple devices
on the bus?


No locking should be necessary, the code is single threaded.  However, 
what can happen is that two consecutive calls to 
virtio_scsi_handle_cmd_req_prepare use the unit attention ReqOps, and 
then the second virtio_scsi_handle_cmd_req_submit finds no unit 
attention (see the loop in virtio_scsi_handle_cmd_vq).  That can 
definitely explain the log above.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: Remove a useless function call

2023-05-29 Thread Paolo Bonzini

On 5/29/23 09:35, Christophe JAILLET wrote:

'inq_result' is known to be NULL. There is no point calling kfree().

Signed-off-by: Christophe JAILLET 
---
  drivers/scsi/virtio_scsi.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 58498da9869a..bd5633667d01 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -338,10 +338,8 @@ static int virtscsi_rescan_hotunplug(struct virtio_scsi 
*vscsi)
int result, inquiry_len, inq_result_len = 256;
char *inq_result = kmalloc(inq_result_len, GFP_KERNEL);
  
-	if (!inq_result) {

-   kfree(inq_result);
+   if (!inq_result)
return -ENOMEM;
-   }
  
  	shost_for_each_device(sdev, shost) {

inquiry_len = sdev->inquiry_len ? sdev->inquiry_len : 36;


Reviewed-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/8] Introduce akcipher service for virtio-crypto

2022-04-12 Thread Paolo Bonzini




In our plan, the feature is designed for HTTPS offloading case and
other applications which use kernel RSA/ecdsa by keyctl syscall.


Hi Zhenwei,

what is the % of time spent doing asymmetric key operations in your
benchmark?  I am not very familiar with crypto acceleration but my
understanding has always been that most time is spent doing either
hashing (for signing) or symmetric key operations (for encryption).

If I understand correctly, without support for acceleration these 
patches are more of a demonstration of virtio-crypto, or usable for 
testing purposes.


Would it be possible to extend virtio-crypto to use keys already in the
host keyctl, or in a PKCS#11 smartcard, so that virtio-crypto could also
provide the functionality of an HSM?  Or does the standard require that
the keys are provided by the guest?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: Fix a NULL pointer dereference in virtscsi_rescan_hotunplug()

2021-11-30 Thread Paolo Bonzini

On 11/30/21 18:19, Zhou Qingyang wrote:

--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -337,7 +337,11 @@ static void virtscsi_rescan_hotunplug(struct virtio_scsi 
*vscsi)
unsigned char scsi_cmd[MAX_COMMAND_SIZE];
int result, inquiry_len, inq_result_len = 256;
char *inq_result = kmalloc(inq_result_len, GFP_KERNEL);
-
+   if (!inq_result) {
+   pr_err("%s:no enough memory for inq_result\n",
+   __func__);
+   return;
+   }
shost_for_each_device(sdev, shost) {
inquiry_len = sdev->inquiry_len ? sdev->inquiry_len : 36;
  


In practice this will never happen, since the kmalloc is very small, so 
I think it's easier to just return early without a printk.  On the other 
hand, if the out-of-memory really could happen, this should be a 
pr_err_ratelimited.


Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Fix SEV-ES INS/OUTS instructions for word, dword, and qword.

2021-11-18 Thread Paolo Bonzini

On 11/18/21 03:13, Michael Sterritt wrote:

Properly type the operands being passed to __put_user()/__get_user().
Otherwise, these routines truncate data for dependent instructions
(e.g., INSW) and only read/write one byte.

Tested: Tested by sending a string with `REP OUTSW` to a port and then
reading it back in with `REP INSW` on the same port. Previous behavior
was to only send and receive the first char of the size. For example,
word operations for "abcd" would only read/write "ac". With change, the
full string is now written and read back.

Signed-off-by: Michael Sterritt 
Reviewed-by: Marc Orr 
Reviewed-by: Peter Gonda 
---
  arch/x86/kernel/sev.c | 57 +--
  1 file changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 74f0ec955384..a9fc2ac7a8bd 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -294,11 +294,6 @@ static enum es_result vc_write_mem(struct es_em_ctxt *ctxt,
   char *dst, char *buf, size_t size)
  {
unsigned long error_code = X86_PF_PROT | X86_PF_WRITE;
-   char __user *target = (char __user *)dst;
-   u64 d8;
-   u32 d4;
-   u16 d2;
-   u8  d1;
  
  	/*

 * This function uses __put_user() independent of whether kernel or user
@@ -320,26 +315,42 @@ static enum es_result vc_write_mem(struct es_em_ctxt 
*ctxt,
 * instructions here would cause infinite nesting.
 */
switch (size) {
-   case 1:
+   case 1: {
+   u8 d1;
+   u8 __user *target = (u8 __user *)dst;
+
memcpy(&d1, buf, 1);
if (__put_user(d1, target))
goto fault;
break;
-   case 2:
+   }
+   case 2: {
+   u16 d2;
+   u16 __user *target = (u16 __user *)dst;
+
memcpy(&d2, buf, 2);
if (__put_user(d2, target))
goto fault;
break;
-   case 4:
+   }
+   case 4: {
+   u32 d4;
+   u32 __user *target = (u32 __user *)dst;
+
memcpy(&d4, buf, 4);
if (__put_user(d4, target))
goto fault;
break;
-   case 8:
+   }
+   case 8: {
+   u64 d8;
+   u64 __user *target = (u64 __user *)dst;
+
memcpy(&d8, buf, 8);
if (__put_user(d8, target))
goto fault;
break;
+   }
default:
WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
return ES_UNSUPPORTED;
@@ -362,11 +373,6 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
  char *src, char *buf, size_t size)
  {
unsigned long error_code = X86_PF_PROT;
-   char __user *s = (char __user *)src;
-   u64 d8;
-   u32 d4;
-   u16 d2;
-   u8  d1;
  
  	/*

 * This function uses __get_user() independent of whether kernel or user
@@ -388,26 +394,41 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
 * instructions here would cause infinite nesting.
 */
switch (size) {
-   case 1:
+   case 1: {
+   u8 d1;
+   u8 __user *s = (u8 __user *)src;
+
if (__get_user(d1, s))
goto fault;
memcpy(buf, &d1, 1);
break;
-   case 2:
+   }
+   case 2: {
+   u16 d2;
+   u16 __user *s = (u16 __user *)src;
+
if (__get_user(d2, s))
goto fault;
memcpy(buf, &d2, 2);
break;
-   case 4:
+   }
+   case 4: {
+   u32 d4;
+   u32 __user *s = (u32 __user *)src;
+
if (__get_user(d4, s))
goto fault;
memcpy(buf, &d4, 4);
break;
-   case 8:
+   }
+   case 8: {
+   u64 d8;
+   u64 __user *s = (u64 __user *)src;
if (__get_user(d8, s))
goto fault;
memcpy(buf, &d8, 8);
break;
+   }
default:
WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size);
return ES_UNSUPPORTED;



Reviewed-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] hypercall-vsock: add a new vsock transport

2021-11-11 Thread Paolo Bonzini

On 11/11/21 09:14, Wang, Wei W wrote:
Adding Andra and Sergio, because IIRC Firecracker and libkrun 
emulates virtio-vsock with virtio-mmio so the implementation

should be simple and also not directly tied to a specific VMM.

OK. This would be OK for KVM based guests. For Hyperv and VMWare 
based guests, they don't have virtio-mmio support. If the MigTD (a 
special guest) we provide is based on virtio-mmio, it would not be 
usable to them.


Hyper-V and VMware (and KVM) would have to add support for
hypercall-vsock anyway.  Why can't they just implement a subset of
virtio-mmio?  It's not hard and there's even plenty of permissively-
licensed code in the various VMMs for the *BSDs.

In fact, instead of defining your own transport for vsock, my first idea
would have been the opposite: reuse virtio-mmio for the registers and
the virtqueue format, and define your own virtio device for the MigTD!

Thanks,

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] virtio: disable partitions scanning for no partitions block

2021-07-15 Thread Paolo Bonzini

On 15/07/21 11:47, Yury Kamenev wrote:

+#ifdef CONFIG_VIRTIO_BLK_NO_PART_SCAN
+   if (unlikely(partitions_scanning_disable))
+   /* disable partitions scanning if it was stated in virtio 
features*/
+   if (virtio_has_feature(vdev, VIRTIO_BLK_F_NO_PART_SCAN))
+   vblk->disk->flags |= GENHD_FL_NO_PART_SCAN;
+#endif
+


Has this been added to the spec?  It doesn't seem like a good idea, as 
pointed out by Stefan[1], Christoph[2] and myself[3].


Paolo

[1] 
https://lore.kernel.org/linux-block/20210524145654.ga2...@lst.de/T/#m2697cb41578490aad49ed1d8fa6604bf0924b54d
[2] 
https://lore.kernel.org/linux-block/20210524145654.ga2...@lst.de/T/#mc59329fd824102f94ac2f6b29fe94a652849aca0
[3] 
https://lore.kernel.org/linux-block/20210524145654.ga2...@lst.de/T/#mee6787c4fd87790b64feccc9e77fd5f618c2c336


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-25 Thread Paolo Bonzini

On 25/05/21 09:38, Ming Lei wrote:

On Tue, May 25, 2021 at 09:22:48AM +0200, Paolo Bonzini wrote:

On 24/05/21 16:59, Christoph Hellwig wrote:

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:

Possible drawbacks of this approach:

- Hardware virtio_blk implementations may find virtqueue_disable_cb()
expensive since it requires DMA. If such devices become popular then
the virtio_blk driver could use a similar approach to NVMe when
VIRTIO_F_ACCESS_PLATFORM is detected in the future.

- If a blk_poll() thread is descheduled it not only hurts polling
performance but also delays completion of non-REQ_HIPRI requests on
that virtqueue since vq notifications are disabled.


Yes, I think this is a dangerous configuration.  What argument exists
again just using dedicated poll queues?


There isn't an equivalent of the admin queue in virtio-blk, which would
allow the guest to configure the desired number of poll queues.  The number
of queues is fixed.


Dedicated vqs can be used for poll only, and I understand VM needn't to know
if the vq is polled or driven by IRQ in VM.

I tried that in v5.4, but not see obvious IOPS boost, so give up.

https://github.com/ming1/linux/commits/my_v5.4-virtio-irq-poll


Sure, but polling can be beneficial even for a single queue.  Queues 
have a cost on the host side as well, so a 1 vCPU - 1 queue model may 
not be always the best.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-25 Thread Paolo Bonzini

On 24/05/21 16:59, Christoph Hellwig wrote:

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:

Possible drawbacks of this approach:

- Hardware virtio_blk implementations may find virtqueue_disable_cb()
   expensive since it requires DMA. If such devices become popular then
   the virtio_blk driver could use a similar approach to NVMe when
   VIRTIO_F_ACCESS_PLATFORM is detected in the future.

- If a blk_poll() thread is descheduled it not only hurts polling
   performance but also delays completion of non-REQ_HIPRI requests on
   that virtqueue since vq notifications are disabled.


Yes, I think this is a dangerous configuration.  What argument exists
again just using dedicated poll queues?


There isn't an equivalent of the admin queue in virtio-blk, which would 
allow the guest to configure the desired number of poll queues.  The 
number of queues is fixed.


Could the blk_poll() thread use preempt notifiers to enable/disable 
callbacks, for example using two new .poll_start and .end_poll callbacks 
to struct blk_mq_ops?


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] virtio: disable partitions scanning for no partitions block

2021-05-24 Thread Paolo Bonzini

On 24/05/21 21:34, Юрий Каменев wrote:

Hi

Is your goal to avoid accidentally detecting partitions because it's
confusing when that happens?

The main goal is reducing the kernel start time. It might be use useful 
in tiny systems that use, for example, squashfs images with certainly no 
partitions. Disabling partitions scanning for these images can save a 
few tens of milliseconds which can be a significant acceleration for 
starting such systems.


Perhaps that could be configured in the image, for example in the kernel 
command line?


Paolo


24.05.2021, 17:29, "Stefan Hajnoczi" :

On Thu, May 20, 2021 at 04:39:08PM +0300, Yury Kamenev wrote:

Hi,
Is there a VIRTIO spec change for the new VIRTIO_BLK_F_NO_PS feature
bit? Please send one:
https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback



GENHD_FL_NO_PART_SCAN is not used much in other drivers. This makes me
wonder if the same use case is addressed through other means with SCSI,
NVMe, etc devices. Maybe Christoph or Jens can weigh in on whether
adding a bit to disable partition scanning for a virtio-blk fits into
the big picture?

Is your goal to avoid accidentally detecting partitions because it's
confusing when that happens?

VIRTIO is currently undergoing auditing and changes to support untrusted
devices. From that perspective adding a device feature bit to disable
partition scanning does not help protect the guest from an untrusted
disk. The guest cannot trust the device, instead the guest itself would
need to be configured to avoid partition scanning of untrusted devices.

Stefan

  Signed-off-by: Yury Kamenev mailto:dam...@yandex-team.ru>>
  ---
   drivers/block/virtio_blk.c | 6 ++
   include/uapi/linux/virtio_blk.h | 1 +
   2 files changed, 7 insertions(+)

  diff --git a/drivers/block/virtio_blk.c
b/drivers/block/virtio_blk.c
  index b9fa3ef5b57c..17edcfee2208 100644
  --- a/drivers/block/virtio_blk.c
  +++ b/drivers/block/virtio_blk.c
  @@ -799,6 +799,10 @@ static int virtblk_probe(struct
virtio_device *vdev)
   vblk->disk->flags |= GENHD_FL_EXT_DEVT;
   vblk->index = index;

  + /*Disable partitions scanning for no-partitions block*/


Formatting cleanup and rephrasing:

   /* Disable partition scanning for devices with no partitions */

  + if (virtio_has_feature(vdev, VIRTIO_BLK_F_NO_PS))


I suggest user a more obvious name:

   VIRTIO_BLK_F_NO_PART_SCAN

  + vblk->disk->flags |= GENHD_FL_NO_PART_SCAN;
  +
   /* configure queue flush support */
   virtblk_update_cache_mode(vdev);

  @@ -977,6 +981,7 @@ static unsigned int features_legacy[] = {
   VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
   VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY,
VIRTIO_BLK_F_CONFIG_WCE,
   VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD,
VIRTIO_BLK_F_WRITE_ZEROES,
  + VIRTIO_BLK_F_NO_PS,
   }
   ;
   static unsigned int features[] = {
  @@ -984,6 +989,7 @@ static unsigned int features[] = {
   VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
   VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY,
VIRTIO_BLK_F_CONFIG_WCE,
   VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD,
VIRTIO_BLK_F_WRITE_ZEROES,
  + VIRTIO_BLK_F_NO_PS,
   };

   static struct virtio_driver virtio_blk = {
  diff --git a/include/uapi/linux/virtio_blk.h
b/include/uapi/linux/virtio_blk.h
  index d888f013d9ff..f197d07afb05 100644
  --- a/include/uapi/linux/virtio_blk.h
  +++ b/include/uapi/linux/virtio_blk.h
  @@ -40,6 +40,7 @@
   #define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
   #define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
   #define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is
supported */
  +#define VIRTIO_BLK_F_NO_PS 16 /* No partitions */

   /* Legacy feature bits */
   #ifndef VIRTIO_BLK_NO_LEGACY
  --
  2.24.3 (Apple Git-128)



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-12-04 Thread Paolo Bonzini

On 04/12/20 16:49, Sasha Levin wrote:

On Fri, Dec 04, 2020 at 09:27:28AM +0100, Paolo Bonzini wrote:

On 01/12/20 00:59, Sasha Levin wrote:


It's quite easy to NAK a patch too, just reply saying "no" and it'll be
dropped (just like this patch was dropped right after your first reply)
so the burden on maintainers is minimal.


The maintainers are _already_ marking patches with "Cc: stable".  That 


They're not, though. Some forget, some subsystems don't mark anything,
some don't mark it as it's not stable material when it lands in their
tree but then it turns out to be one if it sits there for too long.


That means some subsystems will be worse as far as stable release 
support goes.  That's not a problem:


- some subsystems have people paid to do backports to LTS releases when 
patches don't apply; others don't, if the patch doesn't apply the bug is 
simply not fixed in LTS releases


- some subsystems are worse than others even in "normal" releases :)

(plus backports) is where the burden on maintainers should start and 
end.  I don't see the need to second guess them.


This is similar to describing our CI infrastructure as "second
guessing": why are we second guessing authors and maintainers who are
obviously doing the right thing by testing their patches and reporting
issues to them?


No, it's not the same.  CI helps finding bugs before you have to waste 
time spending bisecting regressions across thousands of commits.  The 
lack of stable tags _can_ certainly be a problem, but it solves itself 
sooner or later when people upgrade their kernel.



Are you saying that you have always gotten stable tags right? never
missed a stable tag where one should go?


Of course I did, just like I have introduced bugs.  But at least I try 
to do my best both at adding stable tags and at not introducing bugs.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-12-04 Thread Paolo Bonzini

On 01/12/20 00:59, Sasha Levin wrote:


It's quite easy to NAK a patch too, just reply saying "no" and it'll be
dropped (just like this patch was dropped right after your first reply)
so the burden on maintainers is minimal.


The maintainers are _already_ marking patches with "Cc: stable".  That 
(plus backports) is where the burden on maintainers should start and 
end.  I don't see the need to second guess them.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-30 Thread Paolo Bonzini

On 30/11/20 20:44, Mike Christie wrote:

I have never seen a public/open-source vhost-scsi testsuite.

For patch 23 (the one that adds the lun reset support which is built on
patch 22), we can't add it to stable right now if you wanted to, because
it has a bug in it. Michael T, sent the fix:

https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?h=linux-next&id=b4fffc177fad3c99ee049611a508ca9561bb6871

to Linus today.


Ok, so at least it was only a close call and anyway not for something 
that most people would be running on their machines.  But it still seems 
to me that the state of CI in Linux is abysmal compared to what is 
needed to arbitrarily(*) pick up patches and commit them to "stable" trees.


Paolo

(*) A ML bot is an arbitrary choice as far as we are concerned since we 
cannot know how it makes a decision.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-30 Thread Paolo Bonzini

On 30/11/20 18:38, Sasha Levin wrote:
I am not aware of any public CI being done _at all_ done on 
vhost-scsi, by CKI or everyone else.  So autoselection should be done 
only on subsystems that have very high coverage in CI.


Where can I find a testsuite for virtio/vhost? I see one for KVM, but
where is the one that the maintainers of virtio/vhost run on patches
that come in?


I don't know of any, especially for vhost-scsi.  MikeC?

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-30 Thread Paolo Bonzini

On 30/11/20 14:57, Greg KH wrote:

Every patch should be "fixing a real issue"---even a new feature.  But the
larger the patch, the more the submitters and maintainers should be trusted
rather than a bot.  The line between feature and bugfix_sometimes_  is
blurry, I would say that in this case it's not, and it makes me question how
the bot decided that this patch would be acceptable for stable (which AFAIK
is not something that can be answered).

I thought that earlier Sasha said that this patch was needed as a
prerequisite patch for a later fix, right?  If not, sorry, I've lost the
train of thought in this thread...


Yeah---sorry I am replying to 22/33 but referring to 23/33, which is the 
one that in my opinion should not be blindly accepted for stable kernels 
without the agreement of the submitter or maintainer.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-30 Thread Paolo Bonzini

On 30/11/20 14:28, Greg KH wrote:

Lines of code is not everything. If you think that this needs additional
testing then that's fine and we can drop it, but not picking up a fix
just because it's 120 lines is not something we'd do.

Starting with the first two steps in stable-kernel-rules.rst:

Rules on what kind of patches are accepted, and which ones are not, into the
"-stable" tree:

  - It must be obviously correct and tested.
  - It cannot be bigger than 100 lines, with context.

We do obviously take patches that are bigger than 100 lines, as there
are always exceptions to the rules here.  Look at all of the
spectre/meltdown patches as one such example.  Should we refuse a patch
just because it fixes a real issue yet is 101 lines long?


Every patch should be "fixing a real issue"---even a new feature.  But 
the larger the patch, the more the submitters and maintainers should be 
trusted rather than a bot.  The line between feature and bugfix 
_sometimes_ is blurry, I would say that in this case it's not, and it 
makes me question how the bot decided that this patch would be 
acceptable for stable (which AFAIK is not something that can be answered).


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-30 Thread Paolo Bonzini

On 29/11/20 22:06, Sasha Levin wrote:

On Sun, Nov 29, 2020 at 06:34:01PM +0100, Paolo Bonzini wrote:

On 29/11/20 05:13, Sasha Levin wrote:

Which doesn't seem to be suitable for stable either...  Patch 3/5 in


Why not? It was sent as a fix to Linus.


Dunno, 120 lines of new code?  Even if it's okay for an rc, I don't 
see why it is would be backported to stable releases and release it 
without any kind of testing.  Maybe for 5.9 the chances of breaking 


Lines of code is not everything. If you think that this needs additional
testing then that's fine and we can drop it, but not picking up a fix
just because it's 120 lines is not something we'd do.


Starting with the first two steps in stable-kernel-rules.rst:

Rules on what kind of patches are accepted, and which ones are not, into 
the "-stable" tree:


 - It must be obviously correct and tested.
 - It cannot be bigger than 100 lines, with context.


Plus all the testing we have for the stable trees, yes. It goes beyond
just compiling at this point.

Your very own co-workers (https://cki-project.org/) are pushing hard on
this effort around stable kernel testing, and statements like these
aren't helping anyone.


I am not aware of any public CI being done _at all_ done on vhost-scsi, 
by CKI or everyone else.  So autoselection should be done only on 
subsystems that have very high coverage in CI.


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-29 Thread Paolo Bonzini

On 29/11/20 05:13, Sasha Levin wrote:
Which doesn't seem to be suitable for stable either...  Patch 3/5 in 


Why not? It was sent as a fix to Linus.


Dunno, 120 lines of new code?  Even if it's okay for an rc, I don't see 
why it is would be backported to stable releases and release it without 
any kind of testing.  Maybe for 5.9 the chances of breaking things are 
low, but stuff like locking rules might have changed since older 
releases like 5.4 or 4.19.  The autoselection bot does not know that, it 
basically crosses fingers that these larger-scale changes cause the 
patches not to apply or compile anymore.


Maybe it's just me, but the whole "autoselect stable patches" and 
release them is very suspicious.  You are basically crossing fingers and 
are ready to release any kind of untested crap, because you do not trust 
maintainers of marking stable patches right.  Only then, when a backport 
is broken, it's maintainers who get the blame and have to fix it.


Personally I don't care because I have asked you to opt KVM out of 
autoselection, but this is the opposite of what Greg brags about when he 
touts the virtues of the upstream stable process over vendor kernels.


Paolo

the series might be (vhost scsi: fix cmd completion race), so I can 
understand including 1/5 and 2/5 just in case, but not the rest.  Does 
the bot not understand diffstats?


Not on their own, no. What's wrong with the diffstats?



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-25 Thread Paolo Bonzini

On 25/11/20 19:01, Sasha Levin wrote:

On Wed, Nov 25, 2020 at 06:48:21PM +0100, Paolo Bonzini wrote:

On 25/11/20 16:35, Sasha Levin wrote:

From: Mike Christie 

[ Upstream commit 18f1becb6948cd411fd01968a0a54af63732e73c ]

Move code to parse lun from req's lun_buf to helper, so tmf code
can use it in the next patch.

Signed-off-by: Mike Christie 
Reviewed-by: Paolo Bonzini 
Acked-by: Jason Wang 
Link: 
https://lore.kernel.org/r/1604986403-4931-5-git-send-email-michael.chris...@oracle.com 


Signed-off-by: Michael S. Tsirkin 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Sasha Levin 


This doesn't seem like stable material, does it?


It went in as a dependency for efd838fec17b ("vhost scsi: Add support
for LUN resets."), which is the next patch.


Which doesn't seem to be suitable for stable either...  Patch 3/5 in the 
series might be (vhost scsi: fix cmd completion race), so I can 
understand including 1/5 and 2/5 just in case, but not the rest.  Does 
the bot not understand diffstats?


Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH AUTOSEL 5.9 22/33] vhost scsi: add lun parser helper

2020-11-25 Thread Paolo Bonzini

On 25/11/20 16:35, Sasha Levin wrote:

From: Mike Christie 

[ Upstream commit 18f1becb6948cd411fd01968a0a54af63732e73c ]

Move code to parse lun from req's lun_buf to helper, so tmf code
can use it in the next patch.

Signed-off-by: Mike Christie 
Reviewed-by: Paolo Bonzini 
Acked-by: Jason Wang 
Link: 
https://lore.kernel.org/r/1604986403-4931-5-git-send-email-michael.chris...@oracle.com
Signed-off-by: Michael S. Tsirkin 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Sasha Levin 


This doesn't seem like stable material, does it?

Paolo


---
  drivers/vhost/scsi.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 5d8850f5aef16..ed7dc6b998f65 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -898,6 +898,11 @@ vhost_scsi_get_req(struct vhost_virtqueue *vq, struct 
vhost_scsi_ctx *vc,
return ret;
  }
  
+static u16 vhost_buf_to_lun(u8 *lun_buf)

+{
+   return ((lun_buf[2] << 8) | lun_buf[3]) & 0x3FFF;
+}
+
  static void
  vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
  {
@@ -1036,12 +1041,12 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
tag = vhost64_to_cpu(vq, v_req_pi.tag);
task_attr = v_req_pi.task_attr;
cdb = &v_req_pi.cdb[0];
-   lun = ((v_req_pi.lun[2] << 8) | v_req_pi.lun[3]) & 
0x3FFF;
+   lun = vhost_buf_to_lun(v_req_pi.lun);
} else {
tag = vhost64_to_cpu(vq, v_req.tag);
task_attr = v_req.task_attr;
cdb = &v_req.cdb[0];
-   lun = ((v_req.lun[2] << 8) | v_req.lun[3]) & 0x3FFF;
+   lun = vhost_buf_to_lun(v_req.lun);
}
/*
 * Check that the received CDB size does not exceeded our



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/17] vhost scsi: fix cmd completion race

2020-10-30 Thread Paolo Bonzini
On 30/10/20 09:51, Michael S. Tsirkin wrote:
> On Wed, Oct 21, 2020 at 07:34:55PM -0500, Mike Christie wrote:
>> We might not do the final se_cmd put from vhost_scsi_complete_cmd_work.
>> When the last put happens a little later then we could race where
>> vhost_scsi_complete_cmd_work does vhost_signal, the guest runs and sends
>> more IO, and vhost_scsi_handle_vq runs but does not find any free cmds.
>>
>> This patch has us delay completing the cmd until the last lio core ref
>> is dropped. We then know that once we signal to the guest that the cmd
>> is completed that if it queues a new command it will find a free cmd.
>>
>> Signed-off-by: Mike Christie 
> 
> Paolo, could you review this one?

I don't know how LIO does all the callbacks, honestly (I have only ever
worked on the virtio-scsi driver, not vhost-scsi, and I have only ever
reviewed some virtio-scsi spec bits of vhost-scsi).

The vhost_scsi_complete_cmd_work parts look fine, but I have no idea why
vhost_scsi_queue_data_in and vhost_scsi_queue_status call.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/8] vhost scsi: add lun parser helper

2020-09-23 Thread Paolo Bonzini
On 21/09/20 20:23, Mike Christie wrote:
> Move code to parse lun from req's lun_buf to helper, so tmf code
> can use it in the next patch.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/scsi.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 26d0f75..736ce19 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -899,6 +899,11 @@ static void vhost_scsi_submission_work(struct 
> work_struct *work)
>   return ret;
>  }
>  
> +static u16 vhost_buf_to_lun(u8 *lun_buf)
> +{
> + return ((lun_buf[2] << 8) | lun_buf[3]) & 0x3FFF;
> +}
> +
>  static void
>  vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
>  {
> @@ -1037,12 +1042,12 @@ static void vhost_scsi_submission_work(struct 
> work_struct *work)
>   tag = vhost64_to_cpu(vq, v_req_pi.tag);
>   task_attr = v_req_pi.task_attr;
>   cdb = &v_req_pi.cdb[0];
> - lun = ((v_req_pi.lun[2] << 8) | v_req_pi.lun[3]) & 
> 0x3FFF;
> + lun = vhost_buf_to_lun(v_req_pi.lun);
>   } else {
>   tag = vhost64_to_cpu(vq, v_req.tag);
>   task_attr = v_req.task_attr;
>   cdb = &v_req.cdb[0];
> - lun = ((v_req.lun[2] << 8) | v_req.lun[3]) & 0x3FFF;
> + lun = vhost_buf_to_lun(v_req.lun);
>   }
>   /*
>* Check that the received CDB size does not exceeded our
> 

Reviewed-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] Rescan the entire target on transport reset when LUN is 0

2020-09-08 Thread Paolo Bonzini
On 28/08/20 14:21, Matej Genci wrote:
> VirtIO 1.0 spec says
> The removed and rescan events ... when sent for LUN 0, they MAY
> apply to the entire target so the driver can ask the initiator
> to rescan the target to detect this.
> 
> This change introduces the behaviour described above by scanning the
> entire scsi target when LUN is set to 0. This is both a functional and a
> performance fix. It aligns the driver with the spec and allows control
> planes to hotplug targets with large numbers of LUNs without having to
> request a RESCAN for each one of them.
> 
> Signed-off-by: Matej Genci 
> Suggested-by: Felipe Franciosi 
> ---
>  drivers/scsi/virtio_scsi.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index bfec84aacd90..a4b9bc7b4b4a 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -284,7 +284,12 @@ static void virtscsi_handle_transport_reset(struct 
> virtio_scsi *vscsi,
>  
>   switch (virtio32_to_cpu(vscsi->vdev, event->reason)) {
>   case VIRTIO_SCSI_EVT_RESET_RESCAN:
> - scsi_add_device(shost, 0, target, lun);
> + if (lun == 0) {
> + scsi_scan_target(&shost->shost_gendev, 0, target,
> +  SCAN_WILD_CARD, SCSI_SCAN_INITIAL);
> + } else {
> + scsi_add_device(shost, 0, target, lun);
> + }
>   break;
>   case VIRTIO_SCSI_EVT_RESET_REMOVED:
>   sdev = scsi_device_lookup(shost, 0, target, lun);
> 


Acked-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] scsi: virtio-scsi: handle correctly case when all LUNs were unplugged

2020-07-29 Thread Paolo Bonzini
On 29/07/20 21:48, Maxim Levitsky wrote:
> Commit 5ff843721467 ("scsi: virtio_scsi: unplug LUNs when events missed"),
> almost fixed the case of mass unpluging of LUNs, but it missed a
> corner case in which all the LUNs are unplugged at the same time.
> 
> In this case INQUIRY ends with DID_BAD_TARGET.
> Detect this and unplug the LUN.
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  drivers/scsi/virtio_scsi.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 0e0910c5b9424..c7f0c22b6f11d 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -351,6 +351,16 @@ static void virtscsi_rescan_hotunplug(struct virtio_scsi 
> *vscsi)
>   /* PQ indicates the LUN is not attached */
>   scsi_remove_device(sdev);
>   }
> +
> + else if (host_byte(result) == DID_BAD_TARGET) {
> + /*
> +  * if all LUNs of a virtio-scsi device are unplugged,
> +  * it will respond with BAD TARGET on any INQUIRY
> +  * command.
> +  * Remove the device in this case as well
> +  */
> + scsi_remove_device(sdev);
> + }
>   }
>  
>   kfree(inq_result);
> 

Acked-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: Remove unnecessary condition checks

2020-07-10 Thread Paolo Bonzini
On 10/07/20 09:40, Markus Elfring wrote:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/virtio_scsi.c?id=42f82040ee66db13525dc6f14b8559890b2f4c1c#n980
>>>
>>> if (!virtscsi_cmd_cache) {
>>> pr_err("kmem_cache_create() for virtscsi_cmd_cache failed\n");
>>> -   goto error;
>>> +   return -ENOMEM;
>>> }
>>
>> Could be doable, but I don't see a particular benefit.
> 
> Can a bit more “compliance” (with the Linux coding style) matter here?

No.

>> Having a single error loop is an advantage by itself.
> 
> I do not see that a loop is involved in the implementation of the function 
> “init”.

s/loop/label/ sorry.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] scsi: virtio_scsi: Remove unnecessary condition checks

2020-07-10 Thread Paolo Bonzini
On 10/07/20 08:32, Markus Elfring wrote:
 +  mempool_destroy(virtscsi_cmd_pool);
 +  virtscsi_cmd_pool = NULL;
 +  kmem_cache_destroy(virtscsi_cmd_cache);
 +  virtscsi_cmd_cache = NULL;
return ret;
  }
>>>
>>> How do you think about to add a jump target so that the execution
>>> of a few statements can be avoided according to a previous
>>> null pointer check?
>>
>> The point of the patch is precisely to simplify the code,
> 
> I suggest to reconsider also Linux coding style aspects
> for the implementation of the function “init”.
> https://elixir.bootlin.com/linux/v5.8-rc4/source/drivers/scsi/virtio_scsi.c#L980
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/virtio_scsi.c?id=42f82040ee66db13525dc6f14b8559890b2f4c1c#n980
> 
>   if (!virtscsi_cmd_cache) {
>   pr_err("kmem_cache_create() for virtscsi_cmd_cache failed\n");
> - goto error;
> + return -ENOMEM;
>   }

Could be doable, but I don't see a particular benefit.  Having a single
error loop is an advantage by itself.

The coding style is a suggestion.  Note the difference between

kfree(foo->bar);
kfree(foo);

and

kfree(bar);
kfree(foo);

> See also:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?id=42f82040ee66db13525dc6f14b8559890b2f4c1c#n461
> 
> 
>> executing a couple more instruction is not an issue.
> 
> With which update steps would like to achieve such a code variant?
> 
> destroy_pool:
>   mempool_destroy(virtscsi_cmd_pool);
>   virtscsi_cmd_pool = NULL;
> destroy_cache:
>   kmem_cache_destroy(virtscsi_cmd_cache);
>   virtscsi_cmd_cache = NULL;
>   return ret;

... while there's no advantage in this.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 12/24] scsi: virtio_scsi: Demote seemingly unintentional kerneldoc header

2020-07-09 Thread Paolo Bonzini
On 09/07/20 19:45, Lee Jones wrote:
> This is the only use of kerneldoc in the sourcefile and no
> descriptions are provided.
> 
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/scsi/virtio_scsi.c:109: warning: Function parameter or member 
> 'vscsi' not described in 'virtscsi_complete_cmd'
>  drivers/scsi/virtio_scsi.c:109: warning: Function parameter or member 'buf' 
> not described in 'virtscsi_complete_cmd'
> 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Paolo Bonzini 
> Cc: Stefan Hajnoczi 
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/scsi/virtio_scsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 0e0910c5b9424..56875467e4984 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -100,7 +100,7 @@ static void virtscsi_compute_resid(struct scsi_cmnd *sc, 
> u32 resid)
>   scsi_set_resid(sc, resid);
>  }
>  
> -/**
> +/*
>   * virtscsi_complete_cmd - finish a scsi_cmd and invoke scsi_done
>   *
>   * Called with vq_lock held.
> 

Acked-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: Remove unnecessary condition checks

2020-07-09 Thread Paolo Bonzini
On 09/07/20 19:16, Markus Elfring wrote:
>> +mempool_destroy(virtscsi_cmd_pool);
>> +virtscsi_cmd_pool = NULL;
>> +kmem_cache_destroy(virtscsi_cmd_cache);
>> +virtscsi_cmd_cache = NULL;
>>  return ret;
>>  }
> 
> How do you think about to add a jump target so that the execution
> of a few statements can be avoided according to a previous
> null pointer check?

The point of the patch is precisely to simplify the code, executing a
couple more instruction is not an issue.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: remove unnecessary condition check

2020-07-09 Thread Paolo Bonzini
On 09/07/20 16:46, Xianting Tian wrote:
> kmem_cache_destroy can correctly handle null pointer parameter,
> so there is no need to check if the parameter is null before
> calling kmem_cache_destroy.
> 
> Signed-off-by: Xianting Tian 
> ---
>  drivers/scsi/virtio_scsi.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index bfec84a..5bc288f 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -1007,10 +1007,8 @@ static int __init init(void)
>   mempool_destroy(virtscsi_cmd_pool);
>   virtscsi_cmd_pool = NULL;
>   }
> - if (virtscsi_cmd_cache) {
> - kmem_cache_destroy(virtscsi_cmd_cache);
> - virtscsi_cmd_cache = NULL;
> - }
> + kmem_cache_destroy(virtscsi_cmd_cache);
> + virtscsi_cmd_cache = NULL;
>       return ret;
>  }
>  
> 

Acked-by: Paolo Bonzini 
Reviewed-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: remove unnecessary condition check

2020-07-09 Thread Paolo Bonzini
On 09/07/20 17:06, Xianting Tian wrote:
> kmem_cache_destroy and mempool_destroy can correctly handle
> null pointer parameter, so there is no need to check if the
> parameter is null before calling kmem_cache_destroy and
> mempool_destroy.
> 
> Signed-off-by: Xianting Tian 
> ---
>  drivers/scsi/virtio_scsi.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index bfec84a..54ac83e 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -1003,14 +1003,10 @@ static int __init init(void)
>   return 0;
>  
>  error:
> - if (virtscsi_cmd_pool) {
> - mempool_destroy(virtscsi_cmd_pool);
> - virtscsi_cmd_pool = NULL;
> - }
> - if (virtscsi_cmd_cache) {
> - kmem_cache_destroy(virtscsi_cmd_cache);
> - virtscsi_cmd_cache = NULL;
> - }
> + mempool_destroy(virtscsi_cmd_pool);
> + virtscsi_cmd_pool = NULL;
> + kmem_cache_destroy(virtscsi_cmd_cache);
> + virtscsi_cmd_cache = NULL;
>   return ret;
>  }
>  
> 

Reviewed-by: Paolo Bonzini 
Acked-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] VIRTIO adoption in other hypervisors

2020-02-28 Thread Paolo Bonzini
On 28/02/20 12:18, Alex Bennée wrote:
>> OS X Hypervisor.framework just uses QEMU, so it can use virtio devices
>> too.  VirtualBox also supports virtio devices.
> I guess these don't do any sort of vhost support so all virtio devices
> are handled directly in QEMU?

OS X can use vhost-user.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] VIRTIO adoption in other hypervisors

2020-02-28 Thread Paolo Bonzini
On 28/02/20 11:16, Alex Bennée wrote:
>   - How about HyperV and the OSX equivalent?

OS X Hypervisor.framework just uses QEMU, so it can use virtio devices
too.  VirtualBox also supports virtio devices.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 15/24] compat_ioctl: scsi: move ioctl handling into drivers

2019-12-11 Thread Paolo Bonzini
On 12/12/19 00:05, Michael S. Tsirkin wrote:
>> @@ -405,6 +405,9 @@ static int virtblk_getgeo(struct block_device *bd, 
>> struct hd_geometry *geo)
>>  
>>  static const struct block_device_operations virtblk_fops = {
>>  .ioctl  = virtblk_ioctl,
>> +#ifdef CONFIG_COMPAT
>> +.compat_ioctl = blkdev_compat_ptr_ioctl,
>> +#endif
>>  .owner  = THIS_MODULE,
>>  .getgeo = virtblk_getgeo,
>>  };
> Hmm - is virtio blk lumped in with scsi things intentionally?

I think it's because the only ioctl for virtio-blk is SG_IO.  It makes
sense to lump it in with scsi, but I wouldn't mind getting rid of
CONFIG_VIRTIO_BLK_SCSI altogether.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-04 Thread Paolo Bonzini
On 04/10/19 11:41, Mircea CIRJALIU - MELIU wrote:
> I get it so far. I have a patch that does mirroring in a separate VMA.
> We create an extra VMA with VM_PFNMAP/VM_MIXEDMAP that mirrors the 
> source VMA in the other QEMU and is refreshed by the device MMU notifier.

So for example on the host you'd have a new ioctl on the kvm file
descriptor.  You pass a size and you get back a file descriptor for that
guest's physical memory, which is mmap-able up to the size you specified
in the ioctl.

In turn, the file descriptor would have ioctls to map/unmap ranges of
the guest memory into its mmap-able range.  Accessing an unmapped range
produces a SIGSEGV.

When asked via the QEMU monitor, QEMU will create the file descriptor
and pass it back via SCM_RIGHTS.  The management application can then
use it to hotplug memory into the destination...

> Create a new memslot based on the mirror VMA, hotplug it into the guest as
> new memory device (is this possible?) and have a guest-side driver allocate 
> pages from that area.

... using the existing ivshmem device, whose BAR can be accessed and
mmap-ed from the guest via sysfs.  In other words, the hotplugging will
use the file descriptor returned by QEMU when creating the ivshmem device.

We then need an additional mechanism to invoke the map/unmap ioctls from
the guest.  Without writing a guest-side driver it is possible to:

- pass a socket into the "create guest physical memory view" ioctl
above.  KVM will then associate that KVMI socket with the newly created
file descriptor.

- use KVMI messages to that socket to map/unmap sections of memory

> Redirect (some) GFN->HVA translations into the new VMA based on a table 
> of addresses required by the introspector process.

That would be tricky because there are multiple paths (gfn_to_page,
gfn_to_pfn, etc.).

There is some complication in this because the new device has to be
plumbed at multiple levels (KVM, QEMU, libvirt).  But it seems like a
very easily separated piece of code (except for the KVMI socket part,
which can be added later), so I suggest that you contribute the KVM
parts first.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-03 Thread Paolo Bonzini
On 03/10/19 20:31, Jerome Glisse wrote:
> So in summary, the source qemu process has anonymous vma (regular
> libc malloc for instance). The introspector qemu process which
> mirror the the source qemu use mmap on /dev/kvm (assuming you can
> reuse the kvm device file for this otherwise you can introduce a
> new kvm device file). 

It should be a new device, something like /dev/kvmmem.  BitDefender's
RFC patches already have the right userspace API, that was not an issue.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-03 Thread Paolo Bonzini
On 03/10/19 17:42, Jerome Glisse wrote:
> All that is needed is to make sure that vm_normal_page() will see those
> pte (inside the process that is mirroring the other process) as special
> which is the case either because insert_pfn() mark the pte as special or
> the kvm device driver which control the vm_operation struct set a
> find_special_page() callback that always return NULL, or the vma has
> either VM_PFNMAP or VM_MIXEDMAP set (which is the case with insert_pfn).
> 
> So you can keep the existing kvm code unmodified.

Great, thanks.  And KVM is already able to handle VM_PFNMAP/VM_MIXEDMAP,
so that should work.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Paolo Bonzini
On 02/10/19 19:04, Jerome Glisse wrote:
> On Wed, Oct 02, 2019 at 06:18:06PM +0200, Paolo Bonzini wrote:
>>>> If the mapping of the source VMA changes, mirroring can update the
>>>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>>>> dismantles its own existing page tables (so that they can be recreated
>>>> with the new mapping from the source VMA)?
>>
>> The KVM inspector process is also (or can be) a QEMU that will have to
>> create its own KVM guest page table.  So if a page in the source VMA is
>> unmapped we want:
>>
>> - the source KVM to invalidate its guest page table (done by the KVM MMU
>> notifier)
>>
>> - the target VMA to be invalidated (easy using mirroring)
>>
>> - the target KVM to invalidate its guest page table, as a result of
>> invalidation of the target VMA
> 
> You can do the target KVM invalidation inside the mirroring invalidation
> code.

Why should the source and target KVMs behave differently?  If the source
invalidates its guest page table via MMU notifiers, so should the target.

The KVM MMU notifier exists so that nothing (including mirroring) needs
to know that there is KVM on the other side.  Any interaction between
KVM page tables and VMAs must be mediated by MMU notifiers, anything
else is unacceptable.

If it is possible to invoke the MMU notifiers around the calls to
insert_pfn, that of course would be perfect.

Thanks,

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Paolo Bonzini
On 02/10/19 16:15, Jerome Glisse wrote:
>>> Why would you need to target mmu notifier on target vma ?
>> If the mapping of the source VMA changes, mirroring can update the
>> target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
>> dismantles its own existing page tables (so that they can be recreated
>> with the new mapping from the source VMA)?
>>
> So just to make sure i follow we have:
>   - qemu process on host with anonymous vma
> -> host cpu page table
>   - kvm which maps host anonymous vma to guest
> -> kvm guest page table
>   - kvm inspector process which mirror vma from qemu process
> -> inspector process page table
> 
> AFAIK the KVM notifier's will clear the kvm guest page table whenever
> necessary (through kvm_mmu_notifier_invalidate_range_start). This is
> what ensure that KVM's dismatles its own mapping, it abides to mmu-
> notifier callbacks. If you did not you would have bugs (at least i
> expect so). Am i wrong here ?

The KVM inspector process is also (or can be) a QEMU that will have to
create its own KVM guest page table.

So if a page in the source VMA is unmapped we want:

- the source KVM to invalidate its guest page table (done by the KVM MMU
notifier)

- the target VMA to be invalidated (easy using mirroring)

- the target KVM to invalidate its guest page table, as a result of
invalidation of the target VMA

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: DANGER WILL ROBINSON, DANGER

2019-10-02 Thread Paolo Bonzini
On 02/10/19 21:27, Jerome Glisse wrote:
> On Tue, Sep 10, 2019 at 07:49:51AM +, Mircea CIRJALIU - MELIU wrote:
>>> On 05/09/19 20:09, Jerome Glisse wrote:
 Not sure i understand, you are saying that the solution i outline
 above does not work ? If so then i think you are wrong, in the above
 solution the importing process mmap a device file and the resulting
 vma is then populated using insert_pfn() and constantly keep
 synchronize with the target process through mirroring which means that
 you never have to look at the struct page ... you can mirror any kind
 of memory from the remote process.
>>>
>>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
>>> the KVM MMU notifier), then that would work.  Though I guess it would be
>>> possible to call MMU notifier update callbacks around the call to 
>>> insert_pfn.
>>
>> Can't do that.
>> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
>> the target VMA. It's also static, so I'll have to access it thru 
>> vmf_insert_pfn()
>> or vmf_insert_mixed().
> 
> Why would you need to target mmu notifier on target vma ?

If the mapping of the source VMA changes, mirroring can update the
target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
dismantles its own existing page tables (so that they can be recreated
with the new mapping from the source VMA)?

Thanks,

Paolo

> You do not need
> that. The workflow is:
> 
> userspace:
> ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)
> 
> Then when the mirroring process access ptr it triggers page fault that
> endup in the vm_operation_struct->fault() which is just doing:
> 
> kernel-kvm-mirroring-function:
> kvm_mirror_page_fault(struct vm_fault *vmf) {
> struct kvm_mirror_struct *kvmms;
> 
> kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
> ...
> again:
> hmm_range_register(&range);
> hmm_range_snapshot(&range);
> take_lock(kvmms->update);
> if (!hmm_range_valid(&range)) {
> vm_insert_pfn();
> drop_lock(kvmms->update);
> hmm_range_unregister(&range);
> return VM_FAULT_NOPAGE;
> }
> drop_lock(kvmms->update);
> goto again;
> }
> 
> The notifier callback:
> kvmms_notifier_start() {
> take_lock(kvmms->update);
> clear_pte(start, end);
> drop_lock(kvmms->update);
> }
> 
>>
>> Our model (the importing process is encapsulated in another VM) forces us
>> to mirror certain pages from the anon VMA backing one VM's system RAM to 
>> the other VM's anon VMA. 
> 
> The mirror does not have to be an anon vma it can very well be a
> device vma ie mmap of a device file. I do not see any reasons why
> the mirror need to be an anon vma. Please explain why.
> 
>>
>> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
>> the target anon VMA, but I guess this breaks the VMA. Is this recommended?
> 
> The mirror vma should not be an anon vma.
> 
>>
>> Then, mapping anon pages from one VMA to another without fixing the 
>> refcount and the mapcount breaks the daemons that think they're working 
>> on a pure anon VMA (kcompactd, khugepaged).
> 
> Note here the target vma ie the mirroring one is a mmap of device file
> and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
> fully ignore by core mm.
> 
> Thus you do not need to fix the refcount in any way. If any of the core
> mm try to reclaim memory from the original vma then you will get mmu
> notifier callbacks and all you have to do is clear the page table of your
> device vma.
> 
> I did exactly that as a tools in the past and it works just fine with
> no change to core mm whatsoever.
> 
> Cheers,
> Jérôme
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-09-09 Thread Paolo Bonzini
On 05/09/19 20:09, Jerome Glisse wrote:
> Not sure i understand, you are saying that the solution i outline above
> does not work ? If so then i think you are wrong, in the above solution
> the importing process mmap a device file and the resulting vma is then
> populated using insert_pfn() and constantly keep synchronize with the
> target process through mirroring which means that you never have to look
> at the struct page ... you can mirror any kind of memory from the remote
> process.

If insert_pfn in turn calls MMU notifiers for the target VMA (which
would be the KVM MMU notifier), then that would work.  Though I guess it
would be possible to call MMU notifier update callbacks around the call
to insert_pfn.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] scsi: virtio_scsi: unplug LUNs when events missed

2019-09-09 Thread Paolo Bonzini
On 06/09/19 10:54, Stefan Hajnoczi wrote:
> On Thu, Sep 05, 2019 at 06:19:28PM +, Matt Lupfer wrote:
>> The event handler calls scsi_scan_host() when events are missed, which
>> will hotplug new LUNs.  However, this function won't remove any
>> unplugged LUNs.  The result is that hotunplug doesn't work properly when
>> the number of unplugged LUNs exceeds the event queue size (currently 8).
>>
>> Scan existing LUNs when events are missed to check if they are still
>> present.  If not, remove them.
>>
>> Signed-off-by: Matt Lupfer 
>> ---
>>  drivers/scsi/virtio_scsi.c | 33 +
>>  1 file changed, 33 insertions(+)
> 
> Please include a changelog in future patch revisions.  For example:
> 
>   Signed-off-by: ...
>   ---
>   v2:
> * Replaced magic constants with sd.h constants [Michael]
> 
> Just C and virtio code review, no SCSI specifics:
> 
> Reviewed-by: Stefan Hajnoczi 
> 

Acked-by: Paolo Bonzini 



signature.asc
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 64/92] kvm: introspection: add single-stepping

2019-08-14 Thread Paolo Bonzini
On 14/08/19 14:36, Nicusor CITU wrote:
> Thank you for signaling this. This piece of code is leftover from the
> initial attempt to make single step running.
> Based on latest results, we do not actually need to change
> interruptibility during the singlestep. It is enough to enable the MTF
> and just suppress any interrupt injection (if any) before leaving the
> vcpu entering in guest.
> 

This is exactly what testcases are for...

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation

2019-08-14 Thread Paolo Bonzini
On 14/08/19 14:07, Adalbert Lazăr wrote:
> On Tue, 13 Aug 2019 11:20:45 +0200, Paolo Bonzini  wrote:
>> On 09/08/19 18:00, Adalbert Lazăr wrote:
>>> From: Mihai Donțu 
>>>
>>> It can happened for us to end up emulating the VMCALL instruction as a
>>> result of the handling of an EPT write fault. In this situation, the
>>> emulator will try to unconditionally patch the correct hypercall opcode
>>> bytes using emulator_write_emulated(). However, this last call uses the
>>> fault GPA (if available) or walks the guest page tables at RIP,
>>> otherwise. The trouble begins when using KVMI, when we forbid the use of
>>> the fault GPA and fallback to the guest pt walk: in Windows (8.1 and
>>> newer) the page that we try to write into is marked read-execute and as
>>> such emulator_write_emulated() fails and we inject a write #PF, leading
>>> to a guest crash.
>>>
>>> The fix is rather simple: check the existing instruction bytes before
>>> doing the patching. This does not change the normal KVM behaviour, but
>>> does help when using KVMI as we no longer inject a write #PF.
>>
>> Fixing the hypercall is just an optimization.  Can we just hush and
>> return to the guest if emulator_write_emulated returns
>> X86EMUL_PROPAGATE_FAULT?
>>
>> Paolo
> 
> Something like this?
> 
>   err = emulator_write_emulated(...);
>   if (err == X86EMUL_PROPAGATE_FAULT)
>   err = X86EMUL_CONTINUE;
>   return err;

Yes.  The only difference will be that you'll keep getting #UD vmexits
instead of hypercall vmexits.  It's also safer, we want to obey those
r-x permissions because PatchGuard would crash the system if it noticed
the rewriting for whatever reason.

Paolo

>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 04b1d2916a0a..965c4f0108eb 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7363,16 +7363,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>>>  
>>> +#define KVM_HYPERCALL_INSN_LEN 3
>>> +
>>>  static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
>>>  {
>>> +   int err;
>>> struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
>>> -   char instruction[3];
>>> +   char buf[KVM_HYPERCALL_INSN_LEN];
>>> +   char instruction[KVM_HYPERCALL_INSN_LEN];
>>> unsigned long rip = kvm_rip_read(vcpu);
>>>  
>>> +   err = emulator_read_emulated(ctxt, rip, buf, sizeof(buf),
>>> +&ctxt->exception);
>>> +   if (err != X86EMUL_CONTINUE)
>>> +   return err;
>>> +
>>> kvm_x86_ops->patch_hypercall(vcpu, instruction);
>>> +   if (!memcmp(instruction, buf, sizeof(instruction)))
>>> +   /*
>>> +* The hypercall instruction is the correct one. Retry
>>> +* its execution maybe we got here as a result of an
>>> +* event other than #UD which has been resolved in the
>>> +* mean time.
>>> +*/
>>> +   return X86EMUL_CONTINUE;
>>>  
>>> -   return emulator_write_emulated(ctxt, rip, instruction, 3,
>>> -   &ctxt->exception);
>>> +   return emulator_write_emulated(ctxt, rip, instruction,
>>> +  sizeof(instruction), &ctxt->exception);
>>>  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)

2019-08-14 Thread Paolo Bonzini
On 14/08/19 11:48, Adalbert Lazăr wrote:
>> Why does closing the socket require destroying the kvmi object?  E.g. can
>> it be marked as defunct or whatever and only fully removed on a synchronous
>> unhook from userspace?  Re-hooking could either require said unhook, or
>> maybe reuse the existing kvmi object with a new socket.
> Will it be better to have the following ioctls?
> 
>   - hook (alloc kvmi and kvmi_vcpu structs)
>   - notify_imminent_unhook (send the KVMI_EVENT_UNHOOK event)
>   - unhook (free kvmi and kvmi_vcpu structs)

Yeah, that is nice also because it leaves the timeout policy to
userspace.  (BTW, please change references to QEMU to "userspace").

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)

2019-08-13 Thread Paolo Bonzini
On 13/08/19 17:01, Sean Christopherson wrote:
>>> It's a bit unclear how, but we'll try to get ride of the refcount object,
>>> which will remove a lot of code, indeed.
>> You can keep it for now.  It may become clearer how to fix it after the
>> event loop is cleaned up.
> By event loop, do you mean the per-vCPU jobs list?

Yes, I meant event handling (which involves the jobs list).

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest

2019-08-13 Thread Paolo Bonzini
On 13/08/19 15:54, Adalbert Lazăr wrote:
> Leaving kvm_vcpu_block() in order to handle a request such as 'pause',
> would cause the vCPU to enter the guest when resumed. Most of the
> time this does not appear to be an issue, but during early boot it
> can happen for a non-boot vCPU to start executing code from areas that
> first needed to be set up by vCPU #0.
> 
> In a particular case, vCPU #1 executed code which resided in an area
> not covered by a memslot, which caused an EPT violation that got
> turned in mmu_set_spte() into a MMIO request that required emulation.
> Unfortunatelly, the emulator tripped, exited to userspace and the VM
> was aborted.

Okay, this makes sense.  Maybe you want to handle KVM_REQ_INTROSPECTION
in vcpu_run rather than vcpu_enter_guest?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage()

2019-08-13 Thread Paolo Bonzini
On 13/08/19 16:33, Adalbert Lazăr wrote:
> On Tue, 13 Aug 2019 10:47:34 +0200, Paolo Bonzini  wrote:
>> On 09/08/19 18:00, Adalbert Lazăr wrote:
>>> If the EPT violation was caused by an execute restriction imposed by the
>>> introspection tool, gpa_available will point to the instruction pointer,
>>> not the to the read/write location that has to be used to emulate the
>>> current instruction.
>>>
>>> This optimization should be disabled only when the VM is introspected,
>>> not just because the introspection subsystem is present.
>>>
>>> Signed-off-by: Adalbert Lazăr 
>>
>> The right thing to do is to not set gpa_available for fetch failures in 
>> kvm_mmu_page_fault instead:
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 24843cf49579..1bdca40fa831 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -5364,8 +5364,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t 
>> cr2, u64 error_code,
>>  enum emulation_result er;
>>  bool direct = vcpu->arch.mmu->direct_map;
>>  
>> -/* With shadow page tables, fault_address contains a GVA or nGPA.  */
>> -if (vcpu->arch.mmu->direct_map) {
>> +/*
>> + * With shadow page tables, fault_address contains a GVA or nGPA.
>> + * On a fetch fault, fault_address contains the instruction pointer.
>> + */
>> +if (vcpu->arch.mmu->direct_map &&
>> +likely(!(error_code & PFERR_FETCH_MASK)) {
>>  vcpu->arch.gpa_available = true;
>>  vcpu->arch.gpa_val = cr2;
>>  }
>
> Sure, but I think we'll have to extend the check.
> 
> Searching the logs I've found:
> 
> kvm/x86: re-translate broken translation that caused EPT violation
> 
> Signed-off-by: Mircea Cirjaliu 
> 
>  arch/x86/kvm/x86.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> /home/b/kvmi@9cad844~1/arch/x86/kvm/x86.c:4757,4762 - 
> /home/b/kvmi@9cad844/arch/x86/kvm/x86.c:4757,4763
>*/
>   if (vcpu->arch.gpa_available &&
>   emulator_can_use_gpa(ctxt) &&
> + (vcpu->arch.error_code & PFERR_GUEST_FINAL_MASK) &&
>   (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
>   gpa = vcpu->arch.gpa_val;
>   ret = vcpu_is_mmio_gpa(vcpu, addr, gpa, write);
> 

Yes, adding that check makes sense as well (still in kvm_mmu_page_fault).

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)

2019-08-13 Thread Paolo Bonzini
On 13/08/19 13:57, Adalbert Lazăr wrote:
>> The refcounting approach seems a bit backwards, and AFAICT is driven by
>> implementing unhook via a message, which also seems backwards.  I assume
>> hook and unhook are relatively rare events and not performance critical,
>> so make those the restricted/slow flows, e.g. force userspace to quiesce
>> the VM by making unhook() mutually exclusive with every vcpu ioctl() and
>> maybe anything that takes kvm->lock. 
>>
>> Then kvmi_ioctl_unhook() can use thread_stop() and kvmi_recv() just needs
>> to check kthread_should_stop().
>>
>> That way kvmi doesn't need to be refcounted since it's guaranteed to be
>> alive if the pointer is non-null.  Eliminating the refcounting will clean
>> up a lot of the code by eliminating calls to kvmi_{get,put}(), e.g.
>> wrappers like kvmi_breakpoint_event() just check vcpu->kvmi, or maybe
>> even get dropped altogether.
> 
> The unhook event has been added to cover the following case: while the
> introspection tool runs in another VM, both VMs, the virtual appliance
> and the introspected VM, could be paused by the user. We needed a way
> to signal this to the introspection tool and give it time to unhook
> (the introspected VM has to run and execute the introspection commands
> during this phase). The receiving threads quits when the socket is closed
> (by QEMU or by the introspection tool).
> 
> It's a bit unclear how, but we'll try to get ride of the refcount object,
> which will remove a lot of code, indeed.

You can keep it for now.  It may become clearer how to fix it after the
event loop is cleaned up.

>>
>>> +void kvmi_create_vm(struct kvm *kvm)
>>> +{
>>> +   init_completion(&kvm->kvmi_completed);
>>> +   complete(&kvm->kvmi_completed);
>> Pretty sure you don't want to be calling complete() here.
> The intention was to stop the hooking ioctl until the VM is
> created. A better name for 'kvmi_completed' would have been
> 'ready_to_be_introspected', as kvmi_hook() will wait for it.
> 
> We'll see how we can get ride of the completion object.

The ioctls are not accessible while kvm_create_vm runs (only after
kvm_dev_ioctl_create_vm calls fd_install).  Even if it were, however,
you should have placed init_completion much earlier, otherwise
wait_for_completion would access uninitialized memory.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-13 Thread Paolo Bonzini
On 13/08/19 13:24, Matthew Wilcox wrote:
>>>
>>> This is an awfully big patch to the memory management code, buried in
>>> the middle of a gigantic series which almost guarantees nobody would
>>> look at it.  I call shenanigans.
>> Are you calling shenanigans on the patch submitter (which is gratuitous)
>> or on the KVM maintainers/reviewers?
>
> On the patch submitter, of course.  How can I possibly be criticising you
> for something you didn't do?

No idea.  "Nobody would look at it" definitely includes me though.

In any case, water under the bridge.  The submitter did duly mark the
series as RFC, I don't see anything wrong in what he did apart from not
having testcases. :)

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v6 00/92] VM introspection

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> 
> Patches 1-20: unroll a big part of the KVM introspection subsystem,
> sent in one patch in the previous versions.
> 
> Patches 21-24: extend the current page tracking code.
> 
> Patches 25-33: make use of page tracking to support the
> KVMI_SET_PAGE_ACCESS introspection command and the KVMI_EVENT_PF event
> (on EPT violations caused by the tracking settings).
> 
> Patches 34-42: include the SPP feature (Enable Sub-page
> Write Protection Support), already sent to KVM list:
> 
>   
> https://lore.kernel.org/lkml/20190717133751.12910-1-weijiang.y...@intel.com/
> 
> Patches 43-46: add the commands needed to use SPP.
> 
> Patches 47-63: unroll almost all the rest of the introspection code.
> 
> Patches 64-67: add single-stepping, mostly as a way to overcome the
> unimplemented instructions, but also as a feature for the introspection
> tool.
> 
> Patches 68-70: cover more cases related to EPT violations.
> 
> Patches 71-73: add the remote mapping feature, allowing the introspection
> tool to map into its address space a page from guest memory.
> 
> Patches 74: add a fix to hypercall emulation.
> 
> Patches 75-76: disable some features/optimizations when the introspection
> code is present.
> 
> Patches 77-78: add trace functions for the introspection code and change
> some related to interrupts/exceptions injection.
> 
> Patches 79-92: new instruction for the x86 emulator, including cmpxchg
> fixes.

Thanks for the very good explanation.  Apart from the complicated flow
of KVM request handling and KVM reply, the main issue is the complete
lack of testcases.  There should be a kvmi_test in
tools/testing/selftests/kvm, and each patch adding a new ioctl or event
should add a new testcase.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: DANGER WILL ROBINSON, DANGER

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:24, Matthew Wilcox wrote:
> On Fri, Aug 09, 2019 at 07:00:26PM +0300, Adalbert Lazăr wrote:
>> +++ b/include/linux/page-flags.h
>> @@ -417,8 +417,10 @@ PAGEFLAG(Idle, idle, PF_ANY)
>>   */
>>  #define PAGE_MAPPING_ANON   0x1
>>  #define PAGE_MAPPING_MOVABLE0x2
>> +#define PAGE_MAPPING_REMOTE 0x4
> Uh.  How do you know page->mapping would otherwise have bit 2 clear?
> Who's guaranteeing that?
> 
> This is an awfully big patch to the memory management code, buried in
> the middle of a gigantic series which almost guarantees nobody would
> look at it.  I call shenanigans.

Are you calling shenanigans on the patch submitter (which is gratuitous)
or on the KVM maintainers/reviewers?

It's not true that nobody would look at it.  Of course no one from
linux-mm is going to look at it, but the maintainer that looks at the
gigantic series is very much expected to look at it and explain to the
submitter that this patch is unacceptable as is.

In fact I shouldn't have to to explain this to you; you know better than
believing that I would try to sneak it past the mm folks.  I am puzzled.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 74/92] kvm: x86: do not unconditionally patch the hypercall instruction during emulation

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:00, Adalbert Lazăr wrote:
> From: Mihai Donțu 
> 
> It can happened for us to end up emulating the VMCALL instruction as a
> result of the handling of an EPT write fault. In this situation, the
> emulator will try to unconditionally patch the correct hypercall opcode
> bytes using emulator_write_emulated(). However, this last call uses the
> fault GPA (if available) or walks the guest page tables at RIP,
> otherwise. The trouble begins when using KVMI, when we forbid the use of
> the fault GPA and fallback to the guest pt walk: in Windows (8.1 and
> newer) the page that we try to write into is marked read-execute and as
> such emulator_write_emulated() fails and we inject a write #PF, leading
> to a guest crash.
> 
> The fix is rather simple: check the existing instruction bytes before
> doing the patching. This does not change the normal KVM behaviour, but
> does help when using KVMI as we no longer inject a write #PF.

Fixing the hypercall is just an optimization.  Can we just hush and
return to the guest if emulator_write_emulated returns
X86EMUL_PROPAGATE_FAULT?

Paolo

> CC: Joerg Roedel 
> Signed-off-by: Mihai Donțu 
> Signed-off-by: Adalbert Lazăr 
> ---
>  arch/x86/kvm/x86.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 04b1d2916a0a..965c4f0108eb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7363,16 +7363,33 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>  
> +#define KVM_HYPERCALL_INSN_LEN 3
> +
>  static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
>  {
> + int err;
>   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> - char instruction[3];
> + char buf[KVM_HYPERCALL_INSN_LEN];
> + char instruction[KVM_HYPERCALL_INSN_LEN];
>   unsigned long rip = kvm_rip_read(vcpu);
>  
> + err = emulator_read_emulated(ctxt, rip, buf, sizeof(buf),
> +  &ctxt->exception);
> + if (err != X86EMUL_CONTINUE)
> + return err;
> +
>   kvm_x86_ops->patch_hypercall(vcpu, instruction);
> + if (!memcmp(instruction, buf, sizeof(instruction)))
> + /*
> +  * The hypercall instruction is the correct one. Retry
> +  * its execution maybe we got here as a result of an
> +  * event other than #UD which has been resolved in the
> +  * mean time.
> +  */
> + return X86EMUL_CONTINUE;
>  
> - return emulator_write_emulated(ctxt, rip, instruction, 3,
> - &ctxt->exception);
> + return emulator_write_emulated(ctxt, rip, instruction,
> +sizeof(instruction), &ctxt->exception);
>  }
>  
>  static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 76/92] kvm: x86: disable EPT A/D bits if introspection is present

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:00, Adalbert Lazăr wrote:
> Signed-off-by: Adalbert Lazăr 
> ---
>  arch/x86/kvm/vmx/vmx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index dc648ba47df3..152c58b63f69 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7718,7 +7718,7 @@ static __init int hardware_setup(void)
>   !cpu_has_vmx_invept_global())
>   enable_ept = 0;
>  
> - if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
> + if (!cpu_has_vmx_ept_ad_bits() || !enable_ept || kvmi_is_present())
>   enable_ept_ad_bits = 0;
>  
>   if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
> 

Why?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 79/92] kvm: x86: emulate movsd xmm, m64

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:00, Adalbert Lazăr wrote:
> From: Mihai Donțu 
> 
> This is needed in order to be able to support guest code that uses movsd to
> write into pages that are marked for write tracking.
> 
> Signed-off-by: Mihai Donțu 
> Signed-off-by: Adalbert Lazăr 
> ---
>  arch/x86/kvm/emulate.c | 32 +++-
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 34431cf31f74..9d38f892beea 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1177,6 +1177,27 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
>   return X86EMUL_CONTINUE;
>  }
>  
> +static u8 simd_prefix_to_bytes(const struct x86_emulate_ctxt *ctxt,
> +int simd_prefix)
> +{
> + u8 bytes;
> +
> + switch (ctxt->b) {
> + case 0x11:
> + /* movsd xmm, m64 */
> + /* movups xmm, m128 */
> + if (simd_prefix == 0xf2) {
> + bytes = 8;
> + break;
> + }
> + /* fallthrough */
> + default:
> + bytes = 16;
> + break;
> + }
> + return bytes;
> +}
> +
>  static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
>   struct operand *op)
>  {
> @@ -1187,7 +1208,7 @@ static void decode_register_operand(struct 
> x86_emulate_ctxt *ctxt,
>  
>   if (ctxt->d & Sse) {
>   op->type = OP_XMM;
> - op->bytes = 16;
> + op->bytes = ctxt->op_bytes;
>   op->addr.xmm = reg;
>   read_sse_reg(ctxt, &op->vec_val, reg);
>   return;
> @@ -1238,7 +1259,7 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
>   ctxt->d & ByteOp);
>   if (ctxt->d & Sse) {
>   op->type = OP_XMM;
> - op->bytes = 16;
> + op->bytes = ctxt->op_bytes;
>   op->addr.xmm = ctxt->modrm_rm;
>   read_sse_reg(ctxt, &op->vec_val, ctxt->modrm_rm);
>   return rc;
> @@ -4529,7 +4550,7 @@ static const struct gprefix pfx_0f_2b = {
>  };
>  
>  static const struct gprefix pfx_0f_10_0f_11 = {
> - I(Unaligned, em_mov), I(Unaligned, em_mov), N, N,
> + I(Unaligned, em_mov), I(Unaligned, em_mov), I(Unaligned, em_mov), N,
>  };
>  
>  static const struct gprefix pfx_0f_28_0f_29 = {
> @@ -5097,7 +5118,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
> *insn, int insn_len)
>  {
>   int rc = X86EMUL_CONTINUE;
>   int mode = ctxt->mode;
> - int def_op_bytes, def_ad_bytes, goffset, simd_prefix;
> + int def_op_bytes, def_ad_bytes, goffset, simd_prefix = 0;
>   bool op_prefix = false;
>   bool has_seg_override = false;
>   struct opcode opcode;
> @@ -5320,7 +5341,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
> *insn, int insn_len)
>   ctxt->op_bytes = 4;
>  
>   if (ctxt->d & Sse)
> - ctxt->op_bytes = 16;
> + ctxt->op_bytes = simd_prefix_to_bytes(ctxt,
> +   simd_prefix);
>   else if (ctxt->d & Mmx)
>   ctxt->op_bytes = 8;
>   }
> 

Please submit all these emulator patches as a separate series, complete
with testcases for kvm-unit-tests.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 07/92] kvm: introspection: honor the reply option when handling the KVMI_GET_VERSION command

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> Obviously, the KVMI_GET_VERSION command must not be used when the command
> reply is disabled by a previous KVMI_CONTROL_CMD_RESPONSE command.
> 
> This commit changes the code path in order to check the reply option
> (enabled/disabled) before trying to reply to this command. If the command
> reply is disabled it will return an error to the caller. In the end, the
> receiving worker will finish and the introspection socket will be closed.
> 
> Signed-off-by: Adalbert Lazăr 
> ---
>  virt/kvm/kvmi_msg.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvmi_msg.c b/virt/kvm/kvmi_msg.c
> index ea5c7e23669a..2237a6ed25f6 100644
> --- a/virt/kvm/kvmi_msg.c
> +++ b/virt/kvm/kvmi_msg.c
> @@ -169,7 +169,7 @@ static int handle_get_version(struct kvmi *ikvm,
>   memset(&rpl, 0, sizeof(rpl));
>   rpl.version = KVMI_VERSION;
>  
> - return kvmi_msg_vm_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
> + return kvmi_msg_vm_maybe_reply(ikvm, msg, 0, &rpl, sizeof(rpl));
>  }
>  
>  static bool is_command_allowed(struct kvmi *ikvm, int id)
> 

Go ahead and squash this in the previous patch.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 06/92] kvm: introspection: add KVMI_CONTROL_CMD_RESPONSE

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> +If `now` is 1, the command reply is enabled/disabled (according to
> +`enable`) starting with the current command. For example, `enable=0`
> +and `now=1` means that the reply is disabled for this command too,
> +while `enable=0` and `now=0` means that a reply will be send for this
> +command, but not for the next ones (until enabled back with another
> +*KVMI_CONTROL_CMD_RESPONSE*).
> +
> +This command is used by the introspection tool to disable the replies
> +for commands returning an error code only (eg. *KVMI_SET_REGISTERS*)
> +when an error is less likely to happen. For example, the following
> +commands can be used to reply to an event with a single `write()` call:
> +
> + KVMI_CONTROL_CMD_RESPONSE enable=0 now=1
> + KVMI_SET_REGISTERS vcpu=N
> + KVMI_EVENT_REPLY   vcpu=N
> + KVMI_CONTROL_CMD_RESPONSE enable=1 now=0

I don't understand the usage.  Is there any case where you want now == 1
actually?  Can you just say that KVMI_CONTROL_CMD_RESPONSE never has a
reply, or to make now==enable?

> + if (err)
> + kvmi_warn(ikvm, "Error code %d discarded for message id %d\n",
> +   err, msg->id);
> +

Would it make sense to even close the socket if there is an error?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 01/92] kvm: introduce KVMI (VM introspection subsystem)

2019-08-13 Thread Paolo Bonzini
On 12/08/19 22:20, Sean Christopherson wrote:
> The refcounting approach seems a bit backwards, and AFAICT is driven by
> implementing unhook via a message, which also seems backwards.  I assume
> hook and unhook are relatively rare events and not performance critical,
> so make those the restricted/slow flows, e.g. force userspace to quiesce
> the VM by making unhook() mutually exclusive with every vcpu ioctl() and
> maybe anything that takes kvm->lock. 

The reason for the unhook event, as far as I understand, is because the
introspection appliance can poke int3 into the guest and needs an
opportunity to undo that.

I don't have a big problem with that and the refcounting, at least for
this first iteration---it can be tackled later, once the general event
loop is simplified---however I agree with the other comments that Sean
made.  Fortunately it should not be hard to apply them to the whole
patchset with search and replace on the patches themselves.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v6 70/92] kvm: x86: filter out access rights only when tracked by the introspection tool

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:00, Adalbert Lazăr wrote:
> It should complete the commit fd34a9518173 ("kvm: x86: consult the page 
> tracking from kvm_mmu_get_page() and __direct_map()")
> 
> Signed-off-by: Adalbert Lazăr 
> ---
>  arch/x86/kvm/mmu.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 65b6acba82da..fd64cf1115da 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2660,6 +2660,9 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  static unsigned int kvm_mmu_page_track_acc(struct kvm_vcpu *vcpu, gfn_t gfn,
>  unsigned int acc)
>  {
> + if (!kvmi_tracked_gfn(vcpu, gfn))
> + return acc;
> +
>   if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREREAD))
>   acc &= ~ACC_USER_MASK;
>   if (kvm_page_track_is_active(vcpu, gfn, KVM_PAGE_TRACK_PREWRITE) ||
> 

If this patch is always needed, then the function should be named
something like kvm_mmu_apply_introspection_access and kvmi_tracked_gfn
should be tested from the moment it is introduced.

But the commit message says nothing about _why_ it is needed, so I
cannot guess.  I would very much avoid it however.  Is it just an
optimization?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 27/92] kvm: introspection: use page track

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> +
> + /*
> +  * This function uses kvm->mmu_lock so it's not allowed to be
> +  * called under kvmi_put(). It can reach a deadlock if called
> +  * from kvm_mmu_load -> kvmi_tracked_gfn -> kvmi_put.
> +  */
> + kvmi_clear_mem_access(kvm);

kvmi_tracked_gfn does not exist yet.

More in general, this comment says why you are calling this here, but it
says nothing about the split of responsibility between
kvmi_end_introspection and kvmi_release.  Please add a comment for this
as soon as you add kvmi_end_introspection (which according to my earlier
review should be patch 1).

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 16/92] kvm: introspection: handle events and event replies

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> 
> +  reply->padding2);
> +
> + ivcpu->reply_waiting = false;
> + return expected->error;
> +}
> +
>  /*

Is this missing a wakeup?

>  
> +static bool need_to_wait(struct kvm_vcpu *vcpu)
> +{
> + struct kvmi_vcpu *ivcpu = IVCPU(vcpu);
> +
> + return ivcpu->reply_waiting;
> +}
> +

Do you actually need this function?  It seems to me that everywhere you
call it you already have an ivcpu, so you can just access the field.

Also, "reply_waiting" means "there is a reply that is waiting".  What
you mean is "waiting_for_reply".

The overall structure of the jobs code is confusing.  The same function
kvm_run_jobs_and_wait is an infinite loop before and gets a "break"
later.  It is also not clear why kvmi_job_wait is called through a job.
 Can you have instead just kvm_run_jobs in KVM_REQ_INTROSPECTION, and
something like this instead when sending an event:

int kvmi_wait_for_reply(struct kvm_vcpu *vcpu)
{
struct kvmi_vcpu *ivcpu = IVCPU(vcpu);

while (ivcpu->waiting_for_reply) {
kvmi_run_jobs(vcpu);

err = swait_event_killable(*wq,
!ivcpu->waiting_for_reply ||
!list_empty(&ivcpu->job_list));

if (err)
return -EINTR;
}

return 0;
}

?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 75/92] kvm: x86: disable gpa_available optimization in emulator_read_write_onepage()

2019-08-13 Thread Paolo Bonzini
On 09/08/19 18:00, Adalbert Lazăr wrote:
> If the EPT violation was caused by an execute restriction imposed by the
> introspection tool, gpa_available will point to the instruction pointer,
> not the to the read/write location that has to be used to emulate the
> current instruction.
> 
> This optimization should be disabled only when the VM is introspected,
> not just because the introspection subsystem is present.
> 
> Signed-off-by: Adalbert Lazăr 

The right thing to do is to not set gpa_available for fetch failures in 
kvm_mmu_page_fault instead:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24843cf49579..1bdca40fa831 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5364,8 +5364,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
enum emulation_result er;
bool direct = vcpu->arch.mmu->direct_map;
 
-   /* With shadow page tables, fault_address contains a GVA or nGPA.  */
-   if (vcpu->arch.mmu->direct_map) {
+   /*
+* With shadow page tables, fault_address contains a GVA or nGPA.
+* On a fetch fault, fault_address contains the instruction pointer.
+*/
+   if (vcpu->arch.mmu->direct_map &&
+   likely(!(error_code & PFERR_FETCH_MASK)) {
vcpu->arch.gpa_available = true;
vcpu->arch.gpa_val = cr2;
}


Paolo

> ---
>  arch/x86/kvm/x86.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 965c4f0108eb..3975331230b9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5532,7 +5532,7 @@ static int emulator_read_write_onepage(unsigned long 
> addr, void *val,
>* operation using rep will only have the initial GPA from the NPF
>* occurred.
>*/
> - if (vcpu->arch.gpa_available &&
> + if (vcpu->arch.gpa_available && !kvmi_is_present() &&
>   emulator_can_use_gpa(ctxt) &&
>   (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
>   gpa = vcpu->arch.gpa_val;
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 02/92] kvm: introspection: add basic ioctls (hook/unhook)

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> +static int kvmi_recv(void *arg)
> +{
> + struct kvmi *ikvm = arg;
> +
> + kvmi_info(ikvm, "Hooking VM\n");
> +
> + while (kvmi_msg_process(ikvm))
> + ;
> +
> + kvmi_info(ikvm, "Unhooking VM\n");
> +
> + kvmi_end_introspection(ikvm);
> +
> + return 0;
> +}
> +

Rename this to kvmi_recv_thread instead, please.

> +
> + /*
> +  * Make sure all the KVM/KVMI structures are linked and no pointer
> +  * is read as NULL after the reference count has been set.
> +  */
> + smp_mb__before_atomic();

This is an smp_wmb(), not an smp_mb__before_atomic().  Add a comment
that it pairs with the refcount_inc_not_zero in kvmi_get.

> + refcount_set(&kvm->kvmi_ref, 1);
> +


> @@ -57,8 +183,27 @@ void kvmi_destroy_vm(struct kvm *kvm)
>   if (!ikvm)
>   return;
>  
> + /* trigger socket shutdown - kvmi_recv() will start shutdown process */
> + kvmi_sock_shutdown(ikvm);
> +
>   kvmi_put(kvm);
>  
>   /* wait for introspection resources to be released */
>   wait_for_completion_killable(&kvm->kvmi_completed);
>  }
> +

This addition means that kvmi_destroy_vm should have called
kvmi_end_introspection instead.  In patch 1, kvmi_end_introspection
should have been just kvmi_put, now this patch can add kvmi_sock_shutdown.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 13/92] kvm: introspection: make the vCPU wait even when its jobs list is empty

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> +void kvmi_handle_requests(struct kvm_vcpu *vcpu)
> +{
> + struct kvmi *ikvm;
> +
> + ikvm = kvmi_get(vcpu->kvm);
> + if (!ikvm)
> + return;
> +
> + for (;;) {
> + int err = kvmi_run_jobs_and_wait(vcpu);
> +
> + if (err)
> + break;
> + }
> +
> + kvmi_put(vcpu->kvm);
> +}
> +

Using kvmi_run_jobs_and_wait from two places (here and kvmi_send_event)
is very confusing.  Does kvmi_handle_requests need to do this, or can it
just use kvmi_run_jobs?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 14/92] kvm: introspection: handle introspection commands before returning to guest

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> + prepare_to_swait_exclusive(&vcpu->wq, &wait,
> +TASK_INTERRUPTIBLE);
> +
> + if (kvm_vcpu_check_block(vcpu) < 0)
> + break;
> +
> + waited = true;
> + schedule();
> +
> + if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu)) {
> + do_kvmi_work = true;
> + break;
> + }
> + }
>  
> - waited = true;
> - schedule();
> + finish_swait(&vcpu->wq, &wait);
> +
> + if (do_kvmi_work)
> + kvmi_handle_requests(vcpu);
> + else
> + break;
>   }

Is this needed?  Or can it just go back to KVM_RUN and handle
KVM_REQ_INTROSPECTION there (in which case it would be basically
premature optimization)?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v6 26/92] kvm: x86: add kvm_mmu_nested_pagefault()

2019-08-13 Thread Paolo Bonzini
On 09/08/19 17:59, Adalbert Lazăr wrote:
> +static bool vmx_nested_pagefault(struct kvm_vcpu *vcpu)
> +{
> + if (vcpu->arch.exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)
> + return false;
> + return true;
> +}
> +

This hook is misnamed; it has nothing to do with nested virtualization.
 Rather, it returns true if it the failure happened while translating
the address of a guest page table.

SVM makes the same information available in EXITINFO[33].

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] scsi: virtio_scsi: remove unused 'affinity_hint_set'

2019-06-19 Thread Paolo Bonzini
On 19/06/19 09:52, Dongli Zhang wrote:
> The 'affinity_hint_set' is not used any longer since
> commit 0d9f0a52c8b9 ("virtio_scsi: use virtio IRQ affinity").
> 
> Signed-off-by: Dongli Zhang 
> ---
>  drivers/scsi/virtio_scsi.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 13f1b3b..1705398 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -74,9 +74,6 @@ struct virtio_scsi {
>  
>   u32 num_queues;
>  
> - /* If the affinity hint is set for virtqueues */
> - bool affinity_hint_set;
> -
>   struct hlist_node node;
>  
>   /* Protected by event_vq lock */
> 

Acked-by: Paolo Bonzini 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-27 Thread Paolo Bonzini
On 27/05/19 14:32, Peter Zijlstra wrote:
> On Mon, May 27, 2019 at 12:21:59PM +0200, Paolo Bonzini wrote:
>> On 27/05/19 11:47, Peter Zijlstra wrote:
> 
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>>>  
>>>  static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>>>  
>>> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>>> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>>> const struct flush_tlb_info *info)
>>>  {
>>> u8 state;
>>> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>>>  * queue flush_on_enter for pre-empted vCPUs
>>>  */
>>> for_each_cpu(cpu, flushmask) {
>>> +   if (cpu == smp_processor_id())
>>> +   continue;
>>> +
>>
>> Even this would be just an optimization; the vCPU you're running on
>> cannot be preempted.  You can just change others to multi.
> 
> Yeah, I know, but it felt weird so I added the explicit skip. No strong
> feelings though.

Neither here, and it would indeed deserve a comment if you left the if out.

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-27 Thread Paolo Bonzini
On 27/05/19 11:47, Peter Zijlstra wrote:
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
> 
>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>> b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>> void (*flush_tlb_user)(void);
>>> void (*flush_tlb_kernel)(void);
>>> void (*flush_tlb_one_user)(unsigned long addr);
>>> +   /*
>>> +* flush_tlb_multi() is the preferred interface. When it is used,
>>> +* flush_tlb_others() should return false.
>>
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
> 
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.
> 
> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
> 
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
>  
>  static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
>  
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>   const struct flush_tlb_info *info)
>  {
>   u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>* queue flush_on_enter for pre-empted vCPUs
>*/
>   for_each_cpu(cpu, flushmask) {
> + if (cpu == smp_processor_id())
> + continue;
> +

Even this would be just an optimization; the vCPU you're running on
cannot be preempted.  You can just change others to multi.

Paolo

>   src = &per_cpu(steal_time, cpu);
>   state = READ_ONCE(src->preempted);
>   if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
>   }
>   }
>  
> - native_flush_tlb_others(flushmask, info);
> + native_flush_tlb_multi(flushmask, info);
>  }
>  
>  static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
>   if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
>   !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
>   kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
>   pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> - static_key_disable(&flush_tlb_multi_enabled.key);
>   }
>  
>   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] vhost-scsi: remove incorrect memory barrier

2019-04-16 Thread Paolo Bonzini
At this point, vs_tpg is not public at all; tv_tpg_vhost_count
is accessed under tpg->tv_tpg_mutex; tpg->vhost_scsi is
accessed under vhost_scsi_mutex.  Therefor there are no atomic
operations involved at all here, just remove the barrier.

Reported-by: Andrea Parri 
Signed-off-by: Paolo Bonzini 
---
 drivers/vhost/scsi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 618fb6461017..c090d177bd75 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1443,7 +1443,6 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
tpg->tv_tpg_vhost_count++;
tpg->vhost_scsi = vs;
vs_tpg[tpg->tport_tpgt] = tpg;
-   smp_mb__after_atomic();
match = true;
}
mutex_unlock(&tpg->tv_tpg_mutex);
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


CFP: KVM Forum 2019

2019-04-06 Thread Paolo Bonzini

KVM Forum 2019: Call For Participation
October 30-November 1, 2019 - Lyon Convention Center - Lyon, France

(All submissions must be received before June 15, 2019 at 23:59 PST)
=

KVM Forum is an annual event that presents a rare opportunity
for developers and users to meet, discuss the state of Linux
virtualization technology, and plan for the challenges ahead. 
We invite you to lead part of the discussion by submitting a speaking
proposal for KVM Forum 2019.

At this highly technical conference, developers driving innovation
in the KVM virtualization stack (Linux, KVM, QEMU, libvirt) can
meet users who depend on KVM as part of their offerings, or to
power their data centers and clouds.

KVM Forum will include sessions on the state of the KVM
virtualization stack, planning for the future, and many
opportunities for attendees to collaborate. As we celebrate ten years
of KVM development in the Linux kernel, KVM continues to be a
critical part of the FOSS cloud infrastructure.

This year, KVM Forum is joining Open Source Summit in Lyon,
France. Selected talks from KVM Forum will be presented on
Wednesday October 30 to the full audience of the Open Source Summit.
Also, attendees of KVM Forum will have access to all of the talks from
Open Source Summit on Wednesday.

https://events.linuxfoundation.org/events/kvm-forum-2019/program/call-for-proposals/

Suggested topics:
* Scaling, latency optimizations, performance tuning, real-time guests
* Hardening and security
* New features
* Testing

KVM and the Linux kernel:
* Nested virtualization
* Resource management (CPU, I/O, memory) and scheduling
* VFIO: IOMMU, SR-IOV, virtual GPU, etc.
* Networking: Open vSwitch, XDP, etc.
* virtio and vhost
* Architecture ports and new processor features

QEMU:
* Management interfaces: QOM and QMP
* New devices, new boards, new architectures
* Graphics, desktop virtualization and virtual GPU
* New storage features
* High availability, live migration and fault tolerance
* Emulation and TCG
* Firmware: ACPI, UEFI, coreboot, U-Boot, etc.

Management and infrastructure
* Managing KVM: Libvirt, OpenStack, oVirt, etc.
* Storage: Ceph, Gluster, SPDK, etc.r
* Network Function Virtualization: DPDK, OPNFV, OVN, etc.
* Provisioning



SUBMITTING YOUR PROPOSAL


Abstracts due: June 15, 2019

Please submit a short abstract (~150 words) describing your presentation
proposal. Slots vary in length up to 45 minutes.

Submit your proposal here:
https://events.linuxfoundation.org/events/kvm-forum-2019/program/call-for-proposals/
Please only use the categories "presentation" and "panel discussion"

You will receive a notification whether or not your presentation proposal
was accepted by August 12, 2019.

Speakers will receive a complimentary pass for the event. In case
your submission has multiple presenters, only the primary speaker for a
proposal will receive a complimentary event pass. For panel discussions, all
panelists will receive a complimentary event pass.


TECHNICAL TALKS

A good technical talk should not just report on what has happened over
the last year; it should present a concrete problem and how it impacts
the user and/or developer community. Whenever applicable, focus on
work that needs to be done, difficulties that haven't yet been solved,
and on decisions that other developers should be aware of. Summarizing
recent developments is okay but it should not be more than a small
portion of the overall talk.


END-USER TALKS

One of the big challenges as developers is to know what, where and how
people actually use our software. We will reserve a few slots for end
users talking about their deployment challenges and achievements.

If you are using KVM in production you are encouraged submit a speaking
proposal. Simply mark it as an end-user talk. As an end user, this is a
unique opportunity to get your input to developers.


HANDS-ON / BOF SESSIONS

We will reserve some time for people to get together and discuss
strategic decisions as well as other topics that are best solved within
smaller groups.

These sessions will be announced during the event. If you are interested
in organizing such a session, please add it to the list at

  http://www.linux-kvm.org/page/KVM_Forum_2019_BOF

Let people you think who might be interested know about your BOF, and
encourage them to add their names to the wiki page as well. Please add
your ideas to the list before KVM Forum starts.


PANEL DISCUSSIONS

If you are proposing a panel discussion, please make sure that you list
all of your potential panelists in your the abstract. We will request full
biographies if a panel is accepted.


==
HOTEL / TRAVEL
==

This year's event will take place at the Lyon Conference Center.
For information on hotels close to the conference, please visit
https://events.linuxf

Re: [PATCH] MAiNTAINERS: add Paolo, Stefan for virtio blk/scsi

2019-03-27 Thread Paolo Bonzini
On 27/03/19 17:57, Stefan Hajnoczi wrote:
> On Wed, Mar 27, 2019 at 10:33:57AM -0400, Michael S. Tsirkin wrote:
>> Jason doesn't really have the time to review blk/scsi
>> patches. Paolo and Setfan agreed to help out.
>>
>> Thanks guys!
>>
>> Signed-off-by: Michael S. Tsirkin 
>>
>> ---
> 
> There is relatively little activity in this area so I'd like to reply
> with Reviewed-by/Acked-by on the mailing list and have patches merged
> via your virtio tree.  That way I do not maintain a sub-tree and send
> you pull requests.  Does this sound good?

FWIW me too, that's why I suggested that Michael add us as reviewers
rather than maintainers. So,

Acked-by: Paolo Bonzini 

too.

Paolo

> Acked-by: Stefan Hajnoczi 
> 




signature.asc
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] MAiNTAINERS: add Paolo, Stefan for virtio blk/scsi

2019-03-27 Thread Paolo Bonzini
On 27/03/19 15:33, Michael S. Tsirkin wrote:
> Jason doesn't really have the time to review blk/scsi
> patches. Paolo and Setfan agreed to help out.
> 
> Thanks guys!
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 95a5ebecd04f..8326d19c1681 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16247,7 +16247,7 @@ F:drivers/char/virtio_console.c
>  F:   include/linux/virtio_console.h
>  F:   include/uapi/linux/virtio_console.h
>  
> -VIRTIO CORE, NET AND BLOCK DRIVERS
> +VIRTIO CORE AND NET DRIVERS
>  M:   "Michael S. Tsirkin" 
>  M:   Jason Wang 
>  L:   virtualization@lists.linux-foundation.org
> @@ -16262,6 +16262,18 @@ F:   include/uapi/linux/virtio_*.h
>  F:   drivers/crypto/virtio/
>  F:   mm/balloon_compaction.c
>  
> +VIRTIO BLOCK AND SCSI DRIVERS
> +M:   "Michael S. Tsirkin" 
> +M:   Paolo Bonzini 
> +M:   Stefan Hajnoczi 

Please make this R at least for me, so that it's clear that patches
still flow through your and Jason's tree.  Not sure if you want to keep
Jason as M.

Paolo

> +L:   virtualization@lists.linux-foundation.org
> +S:   Maintained
> +F:   drivers/block/virtio_blk.c
> +F:   drivers/scsi/virtio_scsi.c
> +F:   include/uapi/linux/virtio_blk.h
> +F:   include/uapi/linux/virtio_scsi.h
> +F:   drivers/vhost/scsi.c
> +
>  VIRTIO CRYPTO DRIVER
>  M:   Gonglei 
>  L:   virtualization@lists.linux-foundation.org
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/1] vhost: add vhost_blk driver

2018-11-06 Thread Paolo Bonzini
On 06/11/2018 08:13, Christoph Hellwig wrote:
> On Tue, Nov 06, 2018 at 10:45:08AM +0800, Jason Wang wrote:
>>> Storage industry is shifting away from SCSI, which has a scaling
>>> problem.
>>
>> Know little about storage. For scaling, do you mean SCSI protocol itself? If
>> not, it's probably not a real issue for virtio-scsi itself.
> 
> The above is utter bullshit.  There is a big NVMe hype, but it is not
> because "SCSI has a scaling problem", but because the industry can sell
> a new thing, and the standardization body seems easier to work with.

In the case of Linux, there is indeed some performance to be gained by
skipping the SCSI layer in the guest and to a lesser extent in QEMU,
which is why virtio-blk is still in use.  But there's no scaling
problem, it's just extra constant overhead.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8] virtio_blk: add discard and write zeroes support

2018-10-29 Thread Paolo Bonzini
On 26/10/2018 10:26, Christoph Hellwig wrote:
> On Fri, Oct 26, 2018 at 01:28:54AM +0200, Paolo Bonzini wrote:
>> On 15/10/2018 11:27, Christoph Hellwig wrote:
>>> There is some issues in this spec.  For one using the multiple ranges
>>> also for write zeroes is rather inefficient.  Write zeroes really should
>>> use the same format as read and write.
>>
>> What makes it inefficient?
> 
> We require a memory allocation for each write zeroes instead of encoding
> the lba/len in the command.

Oh, I see.  That's not a spec issue, the lba/length descriptor can be
included in the same buffer as the rest of the command; using
kmalloc_array even for a single-bio REQ_OP_WRITE_ZEROES is a choice made
by this patch, I suppose for simplicity.

It is possible to special case single-bio unmap and write zeroes so that
they don't call virtblk_setup_discard_write_zeroes and avoid
RQF_SPECIAL_PAYLOAD.

Thanks,

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8] virtio_blk: add discard and write zeroes support

2018-10-25 Thread Paolo Bonzini
On 15/10/2018 11:27, Christoph Hellwig wrote:
> There is some issues in this spec.  For one using the multiple ranges
> also for write zeroes is rather inefficient.  Write zeroes really should
> use the same format as read and write.

What makes it inefficient?

> Second the unmap flag isn't properly specified at all, as nothing
> says the device may not unmap without the unmap flag.  Please take
> a look at the SCSI or NVMe ѕpec for some guidance.

Thanks, I'll submit a patch for this.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support

2018-10-04 Thread Paolo Bonzini
On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).

Not if you want to migrate to pre-Skylake systems.

> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?

No. :(  We still get reports whenever we break 2007-2008 hardware.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost/scsi: truncate T10 PI iov_iter to prot_bytes

2018-09-24 Thread Paolo Bonzini
On 21/09/2018 19:49, Greg Edwards wrote:
> On Wed, Aug 22, 2018 at 01:21:53PM -0600, Greg Edwards wrote:
>> Commands with protection information included were not truncating the
>> protection iov_iter to the number of protection bytes in the command.
>> This resulted in vhost_scsi mis-calculating the size of the protection
>> SGL in vhost_scsi_calc_sgls(), and including both the protection and
>> data SG entries in the protection SGL.
>>
>> Fixes: 09b13fa8c1a1 ("vhost/scsi: Add ANY_LAYOUT support in 
>> vhost_scsi_handle_vq")
>> Signed-off-by: Greg Edwards 
> 
> 
> Any thoughts on this patch?
> 
> 
>> ---
>>  drivers/vhost/scsi.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>> index 76f8d649147b..cbe0ea26c1ff 100644
>> --- a/drivers/vhost/scsi.c
>> +++ b/drivers/vhost/scsi.c
>> @@ -964,7 +964,8 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
>> vhost_virtqueue *vq)
>>  prot_bytes = vhost32_to_cpu(vq, 
>> v_req_pi.pi_bytesin);
>>  }
>>  /*
>> - * Set prot_iter to data_iter, and advance past any
>> + * Set prot_iter to data_iter and truncate it to
>> + * prot_bytes, and advance data_iter past any
>>   * preceeding prot_bytes that may be present.
>>   *
>>   * Also fix up the exp_data_len to reflect only the
>> @@ -973,6 +974,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
>> vhost_virtqueue *vq)
>>  if (prot_bytes) {
>>  exp_data_len -= prot_bytes;
>>  prot_iter = data_iter;
>> +iov_iter_truncate(&prot_iter, prot_bytes);
>>      iov_iter_advance(&data_iter, prot_bytes);
>>  }
>>  tag = vhost64_to_cpu(vq, v_req_pi.tag);
>> --
>> 2.17.1

Fixes: 09b13fa8c1a1093e9458549ac8bb203a7c65c62a
Cc: sta...@vger.kernel.org
Reviewed-by: Paolo Bonzini 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4] virtio_blk: add DISCARD and WRIET ZEROES commands support

2018-06-04 Thread Paolo Bonzini
On 04/06/2018 06:14, Liu, Changpeng wrote:
>>> But I believe the specification says VIRTIO_BLK_T_OUT means direction, so
>>> OR the two bits together should compliance with the specification.
>> I cannot find that in the specification:
>>
>>   
>> http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.html#x1-
>> 2020002
>>
>> and it would contradict the "The type of the request is either ... or
>> ..." wording that I quoted from the spec above.
>>
>> If you do find something in the spec, please let me know and we can
>> figure out how to make the spec consistent.
>
> I saw comments from file linux/usr/include/uapi/linux/virtio_blk.h, which says
> VIRTIO_BLK_T_OUT may be combined with other commands and means direction,
> the specification does not have such description.

I don't think it is in the specification indeed (however, 11 and 13 were
chosen so that VIRTIO_BLK_T_OUT can still indicate direction).

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_ring: switch to dma_XX barriers for rpmsg

2018-04-19 Thread Paolo Bonzini
On 19/04/2018 19:46, Michael S. Tsirkin wrote:
>> This should be okay, but I wonder if there should be a virtio_wmb(...)
>> or an "if (weak_barriers) wmb()" before the "writel" in vm_notify
>> (drivers/virtio/virtio_mmio.c).
>>
>> Thanks,
>>
>> Paolo
> That one uses weak barriers AFAIK.
> 
> IIUC you mean rproc_virtio_notify.
> 
> I suspect it works because specific kick callbacks have a barrier internally.

Yes, that one.  At least keystone_rproc_kick doesn't seem to have a barrier.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_ring: switch to dma_XX barriers for rpmsg

2018-04-19 Thread Paolo Bonzini
On 19/04/2018 19:35, Michael S. Tsirkin wrote:
> virtio is using barriers to order memory accesses, thus
> dma_wmb/rmb is a good match.
> 
> Build-tested on x86: Before
> 
> [mst@tuck linux]$ size drivers/virtio/virtio_ring.o
>textdata bss dec hex filename
>   11392 820   0   122122fb4 drivers/virtio/virtio_ring.o
> 
> After
> mst@tuck linux]$ size drivers/virtio/virtio_ring.o
>textdata bss dec hex filename
>   11284 820   0   121042f48 drivers/virtio/virtio_ring.o
> 
> Cc: Ohad Ben-Cohen 
> Cc: Bjorn Andersson 
> Cc: linux-remotep...@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> It's good in theory, but could one of RPMSG maintainers please review
> and ack this patch? Or even better test it?
> 
> All these barriers are useless on Intel anyway ...

This should be okay, but I wonder if there should be a virtio_wmb(...)
or an "if (weak_barriers) wmb()" before the "writel" in vm_notify
(drivers/virtio/virtio_mmio.c).

Thanks,

Paolo

> 
>  include/linux/virtio_ring.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> index bbf3252..fab0213 100644
> --- a/include/linux/virtio_ring.h
> +++ b/include/linux/virtio_ring.h
> @@ -35,7 +35,7 @@ static inline void virtio_rmb(bool weak_barriers)
>   if (weak_barriers)
>   virt_rmb();
>   else
> - rmb();
> + dma_rmb();
>  }
>  
>  static inline void virtio_wmb(bool weak_barriers)
> @@ -43,7 +43,7 @@ static inline void virtio_wmb(bool weak_barriers)
>   if (weak_barriers)
>   virt_wmb();
>   else
> - wmb();
> + dma_wmb();
>  }
>  
>  static inline void virtio_store_mb(bool weak_barriers,
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Paolo Bonzini
On 10/04/2018 06:57, Tiwei Bie wrote:
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> 
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?

I think so.  We have vendor-specific quirks, and at some point there was
an idea of using quirks to implement (vendor-specific) live migration
support for assigned devices.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-05 Thread Paolo Bonzini
On 04/04/2018 10:02, Siwei Liu wrote:
>> pci_bus_num is almost always a bug if not done within
>> a context of a PCI host, bridge, etc.
>>
>> In particular this will not DTRT if run before guest assigns bus
>> numbers.
>>
> I was seeking means to reserve a specific pci bus slot from drivers,
> and update the driver when guest assigns the bus number but it seems
> there's no low-hanging fruits. Because of that reason the bus_num is
> only obtained until it's really needed (during get_config) and I
> assume at that point the pci bus assignment is already done. I know
> the current one is not perfect, but we need that information (PCI
> bus:slot.func number) to name the guest device correctly.

Can you use the -device "id", and look it up as

devices = container_get(qdev_get_machine(), "/peripheral");
return object_resolve_path_component(devices, id);

?

Thanks,

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 06/27] x86/entry/64: Adapt assembly for PIE support

2018-03-15 Thread Paolo Bonzini
On 14/03/2018 16:54, Christopher Lameter wrote:
>>> +   pushq   %rax/* Support Position Independent Code */
>>> +   leaq1f(%rip), %rax  /* RIP */
>>> +   xchgq   %rax, (%rsp)/* Restore RAX, put 1f */
>>> iretq   /* continues at repeat_nmi below */
>>> UNWIND_HINT_IRET_REGS
>>>  1:
>> Urgh, xchg with a memop has an implicit LOCK prefix.
> this_cpu_xchg uses no lock cmpxchg as a replacement to reduce latency.

That requires using a second register, since %rax is used as the
comparison source.  At this point it's easier to just push %rax twice:

pushq %rax
pushq %rax
leaq 1f(%ip), %rax
movq %rax, 8(%rsp)
popq %rax
iretq

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/4] KVM/vmx: re-write the msr auto switch feature

2017-09-25 Thread Paolo Bonzini
On 25/09/2017 15:02, Wei Wang wrote:
> On 09/25/2017 07:54 PM, Paolo Bonzini wrote:
>> On 25/09/2017 06:44, Wei Wang wrote:
>>>   +static void update_msr_autoload_count_max(void)
>>> +{
>>> +    u64 vmx_msr;
>>> +    int n;
>>> +
>>> +    /*
>>> + * According to the Intel SDM, if Bits 27:25 of
>>> MSR_IA32_VMX_MISC is
>>> + * n, then (n + 1) * 512 is the recommended max number of MSRs
>>> to be
>>> + * included in the VMExit and VMEntry MSR auto switch list.
>>> + */
>>> +    rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
>>> +    n = ((vmx_msr & 0xe00) >> 25) + 1;
>>> +    msr_autoload_count_max = n * KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT;
>>> +}
>>> +
>>
>> Any reasons to do this if it's unlikely that we'll ever update more than
>> 512 MSRs?
>>
>> Paolo
> 
> It isn't a must to allocate memory for 512 MSRs, but I think it would
> be good to allocate memory at least for 128 MSRs, because on skylake
> we have 32 triplets of MSRs (FROM/TO/INFO), which are 96 in total
> already.
> 
> The disadvantage is that developers would need to manually calculate
> and change the number carefully when they need to add new MSRs for
> auto switching in the future. So, if consuming a little bit more
> memory isn't a concern, I think we can directly allocate memory based
> on what's recommended by the hardware.
Sure.  I was just thinking that it's unnecessary to actually use
VMX_MISC; sticking to one-page allocations is nicer.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 2/4] KVM/vmx: auto switch MSR_IA32_DEBUGCTLMSR

2017-09-25 Thread Paolo Bonzini
On 25/09/2017 06:44, Wei Wang wrote:
> Passthrough the MSR_IA32_DEBUGCTLMSR to the guest, and take advantage of
> the hardware VT-x feature to auto switch the msr upon VMExit and VMEntry.

I think most bits in the MSR should not be passed through (for example
FREEZE_WHILE_SMM_EN, FREEZE_LBRS_ON_PMI etc.).  Using auto-switch of
course is fine instead.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/4] KVM/vmx: re-write the msr auto switch feature

2017-09-25 Thread Paolo Bonzini
On 25/09/2017 06:44, Wei Wang wrote:
>  
> +static void update_msr_autoload_count_max(void)
> +{
> + u64 vmx_msr;
> + int n;
> +
> + /*
> +  * According to the Intel SDM, if Bits 27:25 of MSR_IA32_VMX_MISC is
> +  * n, then (n + 1) * 512 is the recommended max number of MSRs to be
> +  * included in the VMExit and VMEntry MSR auto switch list.
> +  */
> + rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
> + n = ((vmx_msr & 0xe00) >> 25) + 1;
> + msr_autoload_count_max = n * KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT;
> +}
> +


Any reasons to do this if it's unlikely that we'll ever update more than
512 MSRs?

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-25 Thread Paolo Bonzini
On 25/09/2017 06:44, Wei Wang wrote:
> Passthrough the LBR stack to the guest, and auto switch the stack MSRs
> upon VMEntry and VMExit.
> 
> Signed-off-by: Wei Wang 

This has to be enabled separately for each guest, because it may prevent
live migration to hosts with a different family/model.

Paolo

> ---
>  arch/x86/kvm/vmx.c | 50 ++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5f5c2f1..35e02a7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -107,6 +107,9 @@ static u64 __read_mostly host_xss;
>  static bool __read_mostly enable_pml = 1;
>  module_param_named(pml, enable_pml, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_lbrv = 1;
> +module_param_named(lbrv, enable_lbrv, bool, 0444);
> +
>  #define KVM_VMX_TSC_MULTIPLIER_MAX 0xULL
>  
>  /* Guest_tsc -> host_tsc conversion requires 64-bit division.  */
> @@ -5428,6 +5431,25 @@ static void ept_set_mmio_spte_mask(void)
>  VMX_EPT_MISCONFIG_WX_VALUE);
>  }
>  
> +static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx)
> +{
> + int i;
> + struct perf_lbr_stack lbr_stack;
> +
> + perf_get_lbr_stack(&lbr_stack);
> +
> + add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0);
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0);
> +
> + for (i = 0; i < lbr_stack.lbr_nr; i++) {
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0);
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0);
> + if (lbr_stack.lbr_info)
> + add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0,
> +   0);
> + }
> +}
> +
>  #define VMX_XSS_EXIT_BITMAP 0
>  /*
>   * Sets up the vmcs for emulated real mode.
> @@ -5508,6 +5530,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  
>   add_atomic_switch_msr(vmx, MSR_IA32_DEBUGCTLMSR, 0, 0);
>  
> + if (enable_lbrv)
> + auto_switch_lbr_msrs(vmx);
> +
>   if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
>   vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
>  
> @@ -6721,6 +6746,28 @@ void vmx_enable_tdp(void)
>   kvm_enable_tdp();
>  }
>  
> +static void vmx_passthrough_lbr_msrs(void)
> +{
> + int i;
> + struct perf_lbr_stack lbr_stack;
> +
> + if (perf_get_lbr_stack(&lbr_stack) < 0) {
> + enable_lbrv = false;
> + return;
> + }
> +
> + vmx_disable_intercept_for_msr(MSR_LBR_SELECT, false);
> + vmx_disable_intercept_for_msr(lbr_stack.lbr_tos, false);
> +
> + for (i = 0; i < lbr_stack.lbr_nr; i++) {
> + vmx_disable_intercept_for_msr(lbr_stack.lbr_from + i, false);
> + vmx_disable_intercept_for_msr(lbr_stack.lbr_to + i, false);
> + if (lbr_stack.lbr_info)
> + vmx_disable_intercept_for_msr(lbr_stack.lbr_info + i,
> +   false);
> + }
> +}
> +
>  static __init int hardware_setup(void)
>  {
>   int r = -ENOMEM, i, msr;
> @@ -6822,6 +6869,9 @@ static __init int hardware_setup(void)
>   vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
>   vmx_disable_intercept_for_msr(MSR_IA32_DEBUGCTLMSR, false);
>  
> + if (enable_lbrv)
> + vmx_passthrough_lbr_msrs();
> +
>   memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
>   vmx_msr_bitmap_legacy, PAGE_SIZE);
>   memcpy(vmx_msr_bitmap_longmode_x2apic_apicv,
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-11 Thread Paolo Bonzini
On 11/08/2017 19:23, Michael S. Tsirkin wrote:
> On Fri, Aug 11, 2017 at 04:09:26PM +0200, Paolo Bonzini wrote:
>> On 10/08/2017 23:41, Michael S. Tsirkin wrote:
>>>>> Then we probably should fail probe if vq size is too small.
>>>> What does this mean?
>>>
>>> We must prevent driver from submitting s/g lists > vq size to device.
>>
>> What is the rationale for the limit?
> 
> So the host knows what it needs to support.
> 
>> both virtio-blk and virtio-scsi transmit their own value for the
>> maximum sg list size (max_seg in virtio-scsi, seg_max in virtio-blk).
> 
> No other device has it, and it seemed like a good idea to
> limit it generally at the time.
> 
> we can fix the spec to relax the requirement for blk and scsi -
> want to submit a proposal? Alternatively, add a generic field
> for that.

Yes, I can submit a proposal.  blk and scsi are the ones that are most
likely to have very long sg lists.

When I was designing scsi I just copied that field from blk. :)

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-11 Thread Paolo Bonzini
On 10/08/2017 23:41, Michael S. Tsirkin wrote:
>>> Then we probably should fail probe if vq size is too small.
>> What does this mean?
> 
> We must prevent driver from submitting s/g lists > vq size to device.

What is the rationale for the limit?  It makes no sense if indirect
descriptors are available, especially because...

> Either tell linux to avoid s/g lists that are too long, or
> simply fail request if this happens, or refuse to attach driver to device.
> 
> Later option would look something like this within probe:
> 
> for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>   if (vqs[i]->num < MAX_SG_USED_BY_LINUX)
>   goto err;
> 
> 
> I don't know what's MAX_SG_USED_BY_LINUX though.
> 

... both virtio-blk and virtio-scsi transmit their own value for the
maximum sg list size (max_seg in virtio-scsi, seg_max in virtio-blk).

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Paolo Bonzini
On 10/08/2017 18:56, Richard W.M. Jones wrote:
> Since switching to blk-mq as the default in commit 5c279bd9e406
> ("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
> much kernel memory.
> 
> qemu currently allocates a fixed 128 entry virtqueue.  can_queue
> currently is set to 1024.  But with indirect descriptors, each command
> in the queue takes 1 virtqueue entry, so the number of commands which
> can be queued is equal to the length of the virtqueue.
> 
> Note I intend to send a patch to qemu to allow the virtqueue size to
> be configured from the qemu command line.
> 
> Thanks Paolo Bonzini, Christoph Hellwig.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  drivers/scsi/virtio_scsi.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 9be211d68b15..7c28e8d4955a 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -818,7 +818,6 @@ static struct scsi_host_template 
> virtscsi_host_template_single = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -839,7 +838,6 @@ static struct scsi_host_template 
> virtscsi_host_template_multi = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -972,6 +970,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
>   if (err)
>   goto virtscsi_init_failed;
>  
> + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
> +
>   cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
>   shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
>   shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
> 

Acked-by: Paolo Bonzini 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Paolo Bonzini
On 10/08/2017 18:40, Richard W.M. Jones wrote:
> If using indirect descriptors, you can make the total_sg as large as
> you want.  If not, BUG is too serious because the function later
> returns -ENOSPC.
> 
> Thanks Paolo Bonzini, Christoph Hellwig.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  drivers/virtio/virtio_ring.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 5e1b548828e6..27cbc1eab868 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>   }
>  #endif
>  
> - BUG_ON(total_sg > vq->vring.num);
>   BUG_ON(total_sg == 0);
>  
>   head = vq->free_head;
> @@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>* buffers, then go indirect. FIXME: tune this threshold */
>   if (vq->indirect && total_sg > 1 && vq->vq.num_free)
>   desc = alloc_indirect(_vq, total_sg, gfp);
> - else
> + else {
>   desc = NULL;
> + WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);

So we get here only if vq->vq.num_free is zero.  In that case,
vq->indirect makes no difference for the appropriateness of the warning
(that is, it should never be issued for indirect descriptors).

> + }
>  
>   if (desc) {
>   /* Use a single buffer which doesn't continue */
> 


Reviewed-by: Paolo Bonzini 

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Paolo Bonzini
On 10/08/2017 18:40, Richard W.M. Jones wrote:
> Since switching to blk-mq as the default in commit 5c279bd9e406
> ("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
> much kernel memory.
> 
> qemu currently allocates a fixed 128 entry virtqueue.  can_queue
> currently is set to 1024.  But with indirect descriptors, each command
> in the queue takes 1 virtqueue entry, so the number of commands which
> can be queued is equal to the length of the virtqueue.
> 
> Note I intend to send a patch to qemu to allow the virtqueue size to
> be configured from the qemu command line.

You can clear .can_queue from the templates.  Otherwise looks good.

Paolo

> Thanks Paolo Bonzini, Christoph Hellwig.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  drivers/scsi/virtio_scsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 9be211d68b15..d6b4ff634c0d 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
>   if (err)
>   goto virtscsi_init_failed;
>  
> + shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
> +
>   cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
>   shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
>   shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-blk: add DISCARD support to virtio-blk driver

2017-07-05 Thread Paolo Bonzini


On 05/07/2017 09:57, Liu, Changpeng wrote:
>> Please include a patch for the specification.  Since we are at it, I
> Thanks Paolo, do you mean include a text file which describe the changes for 
> the specification?

The specification is hosted in an svn (Subversion) repository at
https://tools.oasis-open.org/version-control/svn/virtio.  You can
provide a patch and send it to virtio-comm...@lists.oasis-open.org.

Thanks,

Paolo

>> would like to have three operations defined using the same descriptor:
>>
>> - discard (SCSI UNMAP)
>>
>> - write zeroes (SCSI WRITE SAME without UNMAP flag)
>>
>> - write zeroes and possibly discard (SCSI WRITE SAME with UNMAP flag)
>>
>> The last two can use the same command VIRTIO_BLK_T_WRITE_ZEROES, using
>> the reserved field as a flags field.
>
> Will add write zeroes feature.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/1] Change email address

2017-07-04 Thread Paolo Bonzini


On 04/07/2017 11:30, Cornelia Huck wrote:
> New employer, new address. Make sure people use the right one.
> 
> Christian, Martin, it's probably quickest if one of you takes this.

I'm sending a pull request tomorrow so I'll include this too.

Paolo

> Cornelia Huck (1):
>   Update my email address
> 
>  MAINTAINERS | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-blk: add DISCARD support to virtio-blk driver

2017-07-04 Thread Paolo Bonzini


On 05/07/2017 10:44, Changpeng Liu wrote:
> Currently virtio-blk driver does not provide discard feature flag, so the
> filesystems which built on top of the block device will not send discard
> command. This is okay for HDD backend, but it will impact the performance
> for SSD backend.
> 
> Add a feature flag VIRTIO_BLK_F_DISCARD and command VIRTIO_BLK_T_DISCARD
> to extend exist virtio-blk protocol, define 16 bytes discard descriptor
> for each discard segment, the discard segment defination aligns with
> SCSI or NVM Express protocols, virtio-blk driver will support multi-range
> discard request as well.
> 
> Signed-off-by: Changpeng Liu 

Please include a patch for the specification.  Since we are at it, I
would like to have three operations defined using the same descriptor:

- discard (SCSI UNMAP)

- write zeroes (SCSI WRITE SAME without UNMAP flag)

- write zeroes and possibly discard (SCSI WRITE SAME with UNMAP flag)

The last two can use the same command VIRTIO_BLK_T_WRITE_ZEROES, using
the reserved field as a flags field.

Paolo

> ---
>  drivers/block/virtio_blk.c  | 76 
> +++--
>  include/uapi/linux/virtio_blk.h | 19 +++
>  2 files changed, 92 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 0297ad7..8f0c614 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -172,10 +172,52 @@ static int virtblk_add_req(struct virtqueue *vq, struct 
> virtblk_req *vbr,
>   return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
>  }
>  
> +static inline int virtblk_setup_discard(struct request *req)
> +{
> + unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
> + u32 block_size = queue_logical_block_size(req->q);
> + struct virtio_blk_discard *range;
> + struct bio *bio;
> +
> + if (block_size < 512 || !block_size)
> + return -1;
> +
> + range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
> + if (!range)
> + return -1;
> +
> + __rq_for_each_bio(bio, req) {
> + u64 slba = (bio->bi_iter.bi_sector << 9) / block_size;
> + u32 nlb = bio->bi_iter.bi_size / block_size;
> +
> + range[n].reserved = cpu_to_le32(0);
> + range[n].nlba = cpu_to_le32(nlb);
> + range[n].slba = cpu_to_le64(slba);
> + n++;
> + }
> +
> + if (WARN_ON_ONCE(n != segments)) {
> + kfree(range);
> + return -1;
> + }
> +
> + req->special_vec.bv_page = virt_to_page(range);
> + req->special_vec.bv_offset = offset_in_page(range);
> + req->special_vec.bv_len = sizeof(*range) * segments;
> + req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +
> + return 0;
> +}
> +
>  static inline void virtblk_request_done(struct request *req)
>  {
>   struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>  
> + if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
> + kfree(page_address(req->special_vec.bv_page) +
> +   req->special_vec.bv_offset);
> + }
> +
>   switch (req_op(req)) {
>   case REQ_OP_SCSI_IN:
>   case REQ_OP_SCSI_OUT:
> @@ -237,6 +279,9 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   case REQ_OP_FLUSH:
>   type = VIRTIO_BLK_T_FLUSH;
>   break;
> + case REQ_OP_DISCARD:
> + type = VIRTIO_BLK_T_DISCARD;
> + break;
>   case REQ_OP_SCSI_IN:
>   case REQ_OP_SCSI_OUT:
>   type = VIRTIO_BLK_T_SCSI_CMD;
> @@ -256,9 +301,15 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>  
>   blk_mq_start_request(req);
>  
> + if (type == VIRTIO_BLK_T_DISCARD) {
> + err = virtblk_setup_discard(req);
> + if (err)
> + return BLK_STS_IOERR;
> + }
> +
>   num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
>   if (num) {
> - if (rq_data_dir(req) == WRITE)
> + if (rq_data_dir(req) == WRITE || type == VIRTIO_BLK_T_DISCARD)
>   vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, 
> VIRTIO_BLK_T_OUT);
>   else
>   vbr->out_hdr.type |= cpu_to_virtio32(vblk->vdev, 
> VIRTIO_BLK_T_IN);
> @@ -767,6 +818,25 @@ static int virtblk_probe(struct virtio_device *vdev)
>   if (!err && opt_io_size)
>   blk_queue_io_opt(q, blk_size * opt_io_size);
>  
> + if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
> + q->limits.discard_alignment = blk_size;
> + q->limits.discard_granularity = blk_size;
> +
> + virtio_cread(vdev, struct virtio_blk_config, max_discard_seg, 
> &v);
> + if (v)
> + blk_queue_max_discard_sectors(q, v);
> + else
> + blk_queue_max_discard_sectors(q, -1U);
> +
> + virtio_cread(vdev, struct virtio_blk_config, max_d

[PATCH] virtio_scsi: let host do exception handling

2017-06-21 Thread Paolo Bonzini
virtio_scsi tries to do exception handling after the default
30 seconds timeout expires.  However, it's better to let the host
control the timeout, otherwise with a heavy I/O load it is
likely that an abort will also timeout.  This leads to fatal
errors like filesystems going offline.

Disable the 'sd' timeout and allow the host to do exception
handling, following the precedent of the storvsc driver.

Hannes has a proposal to introduce timeouts in virtio, but
this provides an immediate solution for stable kernels too.

Reported-by: Douglas Miller 
Cc: "James E.J. Bottomley" 
Cc: "Martin K. Petersen" 
Cc: Hannes Reinecke 
Cc: linux-s...@vger.kernel.org
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 
---
 drivers/scsi/virtio_scsi.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index f8dbfeee6c63..55d344ea6869 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -796,6 +796,16 @@ static int virtscsi_map_queues(struct Scsi_Host *shost)
return blk_mq_virtio_map_queues(&shost->tag_set, vscsi->vdev, 2);
 }
 
+/*
+ * The host guarantees to respond to each command, although I/O latencies might
+ * be higher than on bare meta.  Reset the timer unconditionally to give the
+ * host a chance to perform EH.
+ */
+static enum blk_eh_timer_return virtscsi_eh_timed_out(struct scsi_cmnd *scmnd)
+{
+   return BLK_EH_RESET_TIMER;
+}
+
 static struct scsi_host_template virtscsi_host_template_single = {
.module = THIS_MODULE,
.name = "Virtio SCSI HBA",
@@ -806,6 +816,7 @@ static struct scsi_host_template 
virtscsi_host_template_single = {
.change_queue_depth = virtscsi_change_queue_depth,
.eh_abort_handler = virtscsi_abort,
.eh_device_reset_handler = virtscsi_device_reset,
+   .eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
.can_queue = 1024,
@@ -826,6 +837,7 @@ static struct scsi_host_template 
virtscsi_host_template_multi = {
.change_queue_depth = virtscsi_change_queue_depth,
.eh_abort_handler = virtscsi_abort,
.eh_device_reset_handler = virtscsi_device_reset,
+   .eh_timed_out = virtscsi_eh_timed_out,
 
.can_queue = 1024,
.dma_boundary = UINT_MAX,
-- 
2.13.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   3   4   5   6   >