Re: [Qemu-devel] [RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread Peter Maydell
On 25 October 2014 09:24, john.liuli john.li...@huawei.com wrote:
 To get the interrupt reason to support such VIRTIO_NET_F_STATUS
 features I add a new register offset VIRTIO_MMIO_ISRMEM which
 will help to establish a shared memory region between qemu and
 virtio-mmio device. Then the interrupt reason can be accessed by
 guest driver through this region. At the same time, the virtio-mmio
 dirver check this region to see irqfd is supported or not during
 the irq handler registration, and different handler will be assigned.

If you want to add a new register you should probably propose
an update to the virtio spec. However, it seems to me it would
be better to get generic PCI/PCIe working on the ARM virt
board instead; then we can let virtio-mmio quietly fade away.
This has been on the todo list for ages (and there have been
RFC patches posted for plain PCI), it's just nobody's had time
to work on it.

thanks
-- PMM
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread Michael S. Tsirkin
On Mon, Oct 27, 2014 at 05:19:23PM +0800, Li Liu wrote:
 
 
 On 2014/10/26 19:52, Michael S. Tsirkin wrote:
  On Sat, Oct 25, 2014 at 04:24:52PM +0800, john.liuli wrote:
  From: Li Liu john.li...@huawei.com
 
  This set of patches try to implemet irqfd support of vhost-net 
  based on virtio-mmio.
 
  I had posted a mail to talking about the status of vhost-net 
  on kvm-arm refer to http://www.spinics.net/lists/kvm-arm/msg10804.html.
  Some dependent patches are listed in the mail too. Basically the 
  vhost-net brings great performance improvements, almost 50%+.
 
  It's easy to implement irqfd support with PCI MSI-X. But till 
  now arm32 do not provide equivalent mechanism to let a device 
  allocate multiple interrupts. And even the aarch64 provid LPI
  but also not available in a short time.
 
  As Gauguey Remy said Vhost does not emulate a complete virtio 
  adapter but only manage virtqueue operations. Vhost module
  don't update the ISR register, so if with only one irq then it's 
  no way to get the interrupt reason even we can inject the 
  irq correctly.  
  
  Well guests don't read ISR in MSI-X mode so why does it help
  to set the ISR bit?
 
 Yeah, vhost don't need to set ISR under MSI-X mode. But for ARM
 without MSI-X kind mechanism guest can't get the interrupt reason
 through the only one irq hanlder (with one gsi resource).

 So I build a shared memory region to provide the interrupt reason
 instead of ISR regiser by qemu without bothering vhost. Then even
 there's only one irq with only one irq hanlder, it still can
 distinguish why irq occur.
 
 Li.

OK so this might allow sharing IRQs between config and VQs
which might be a handy optimization even for PCI
(at the moment reading ISR causes an exit).

Need to see how all this would work on MP systems though.


 
  
  To get the interrupt reason to support such VIRTIO_NET_F_STATUS 
  features I add a new register offset VIRTIO_MMIO_ISRMEM which 
  will help to establish a shared memory region between qemu and 
  virtio-mmio device. Then the interrupt reason can be accessed by
  guest driver through this region. At the same time, the virtio-mmio 
  dirver check this region to see irqfd is supported or not during 
  the irq handler registration, and different handler will be assigned.
 
  I want to know it's the right direction? Does it comply with the 
  virtio-mmio spec.? Or anyone have more good ideas to emulate mis-x 
  based on virtio-mmio? I hope to get feedback and guidance.
  Thx for any help.
 
  Li Liu (2):
Add a new register offset let interrupt reason available
Assign a new irq handler while irqfd enabled
 
   drivers/virtio/virtio_mmio.c |   55 
  +++---
   include/linux/virtio_mmio.h  |3 +++
   2 files changed, 55 insertions(+), 3 deletions(-)
 
  -- 
  1.7.9.5
 
  
  .
  
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Mike Galbraith
On Sat, 2014-10-25 at 00:04 +0200, Peter Zijlstra wrote: 
 On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote:
  The additional register pressure may just cause a few more register moves
  which should be negligible in the overall performance . The additional
  icache pressure, however, may have some impact on performance. I was trying
  to balance the performance of the pv and non-pv versions so that we won't
  penalize the pv code too much for a bit more performance in the non-pv code.
  Doing it your way will add a lot of function call and register
  saving/restoring to the pv code.
 
 If people care about performance they should not be using virt crap :-)

I tried some benching recently.. where did they hide the fastpaths? :)

-Mike

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 1/2] Add a new register offset let interrupt reason available

2014-10-27 Thread john.liuli
From: Li Liu john.li...@huawei.com

Add a new register offset VIRTIO_MMIO_ISRMEM which help to
estblish a shared memory region between virtio-mmio driver
and qemu with two purposes:

1.Guest virtio-mmio driver can get the interrupt reason.
2.Check irqfd enabled or not to register different irq handler.

Signed-off-by: Li Liu john.li...@huawei.com
---
 drivers/virtio/virtio_mmio.c |   21 -
 include/linux/virtio_mmio.h  |3 +++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index ef9a165..28ddb55 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -122,6 +122,8 @@ struct virtio_mmio_device {
/* a list of queues so we can dispatch IRQs */
spinlock_t lock;
struct list_head virtqueues;
+
+   uint8_t *isr_mem;
 };
 
 struct virtio_mmio_vq_info {
@@ -443,6 +445,7 @@ static int virtio_mmio_probe(struct platform_device *pdev)
struct virtio_mmio_device *vm_dev;
struct resource *mem;
unsigned long magic;
+   int err;
 
mem = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!mem)
@@ -481,6 +484,15 @@ static int virtio_mmio_probe(struct platform_device *pdev)
return -ENXIO;
}
 
+   vm_dev-isr_mem = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL|__GFP_ZERO);
+   if (vm_dev-isr_mem == NULL) {
+   dev_err(pdev-dev, Allocate isr memory failed!\n);
+   return -ENOMEM;
+   }
+
+   writel(virt_to_phys(vm_dev-isr_mem),
+  vm_dev-base + VIRTIO_MMIO_ISRMEM);
+
vm_dev-vdev.id.device = readl(vm_dev-base + VIRTIO_MMIO_DEVICE_ID);
vm_dev-vdev.id.vendor = readl(vm_dev-base + VIRTIO_MMIO_VENDOR_ID);
 
@@ -488,13 +500,20 @@ static int virtio_mmio_probe(struct platform_device *pdev)
 
platform_set_drvdata(pdev, vm_dev);
 
-   return register_virtio_device(vm_dev-vdev);
+   err = register_virtio_device(vm_dev-vdev);
+   if (err) {
+   free_pages_exact(vm_dev-isr_mem, PAGE_SIZE);
+   vm_dev-isr_mem = NULL;
+   }
+
+   return err;
 }
 
 static int virtio_mmio_remove(struct platform_device *pdev)
 {
struct virtio_mmio_device *vm_dev = platform_get_drvdata(pdev);
 
+   free_pages_exact(vm_dev-isr_mem, PAGE_SIZE);
unregister_virtio_device(vm_dev-vdev);
 
return 0;
diff --git a/include/linux/virtio_mmio.h b/include/linux/virtio_mmio.h
index 5c7b6f0..b1e3ec7 100644
--- a/include/linux/virtio_mmio.h
+++ b/include/linux/virtio_mmio.h
@@ -95,6 +95,9 @@
 /* Device status register - Read Write */
 #define VIRTIO_MMIO_STATUS 0x070
 
+/* Allocate ISRMEM for interrupt reason - Write Only */
+#define VIRTIO_MMIO_ISRMEM 0x080
+
 /* The config space is defined by each driver as
  * the per-driver configuration space - Read Write */
 #define VIRTIO_MMIO_CONFIG 0x100
-- 
1.7.9.5


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread Li Liu


On 2014/10/26 19:52, Michael S. Tsirkin wrote:
 On Sat, Oct 25, 2014 at 04:24:52PM +0800, john.liuli wrote:
 From: Li Liu john.li...@huawei.com

 This set of patches try to implemet irqfd support of vhost-net 
 based on virtio-mmio.

 I had posted a mail to talking about the status of vhost-net 
 on kvm-arm refer to http://www.spinics.net/lists/kvm-arm/msg10804.html.
 Some dependent patches are listed in the mail too. Basically the 
 vhost-net brings great performance improvements, almost 50%+.

 It's easy to implement irqfd support with PCI MSI-X. But till 
 now arm32 do not provide equivalent mechanism to let a device 
 allocate multiple interrupts. And even the aarch64 provid LPI
 but also not available in a short time.

 As Gauguey Remy said Vhost does not emulate a complete virtio 
 adapter but only manage virtqueue operations. Vhost module
 don't update the ISR register, so if with only one irq then it's 
 no way to get the interrupt reason even we can inject the 
 irq correctly.  
 
 Well guests don't read ISR in MSI-X mode so why does it help
 to set the ISR bit?

Yeah, vhost don't need to set ISR under MSI-X mode. But for ARM
without MSI-X kind mechanism guest can't get the interrupt reason
through the only one irq hanlder (with one gsi resource).

So I build a shared memory region to provide the interrupt reason
instead of ISR regiser by qemu without bothering vhost. Then even
there's only one irq with only one irq hanlder, it still can
distinguish why irq occur.

Li.

 
 To get the interrupt reason to support such VIRTIO_NET_F_STATUS 
 features I add a new register offset VIRTIO_MMIO_ISRMEM which 
 will help to establish a shared memory region between qemu and 
 virtio-mmio device. Then the interrupt reason can be accessed by
 guest driver through this region. At the same time, the virtio-mmio 
 dirver check this region to see irqfd is supported or not during 
 the irq handler registration, and different handler will be assigned.

 I want to know it's the right direction? Does it comply with the 
 virtio-mmio spec.? Or anyone have more good ideas to emulate mis-x 
 based on virtio-mmio? I hope to get feedback and guidance.
 Thx for any help.

 Li Liu (2):
   Add a new register offset let interrupt reason available
   Assign a new irq handler while irqfd enabled

  drivers/virtio/virtio_mmio.c |   55 
 +++---
  include/linux/virtio_mmio.h  |3 +++
  2 files changed, 55 insertions(+), 3 deletions(-)

 -- 
 1.7.9.5

 
 .
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread john.liuli
From: Li Liu john.li...@huawei.com

This set of patches try to implemet irqfd support of vhost-net 
based on virtio-mmio.

I had posted a mail to talking about the status of vhost-net 
on kvm-arm refer to http://www.spinics.net/lists/kvm-arm/msg10804.html.
Some dependent patches are listed in the mail too. Basically the 
vhost-net brings great performance improvements, almost 50%+.

It's easy to implement irqfd support with PCI MSI-X. But till 
now arm32 do not provide equivalent mechanism to let a device 
allocate multiple interrupts. And even the aarch64 provid LPI
but also not available in a short time.

As Gauguey Remy said Vhost does not emulate a complete virtio 
adapter but only manage virtqueue operations. Vhost module
don't update the ISR register, so if with only one irq then it's 
no way to get the interrupt reason even we can inject the 
irq correctly.  

To get the interrupt reason to support such VIRTIO_NET_F_STATUS 
features I add a new register offset VIRTIO_MMIO_ISRMEM which 
will help to establish a shared memory region between qemu and 
virtio-mmio device. Then the interrupt reason can be accessed by
guest driver through this region. At the same time, the virtio-mmio 
dirver check this region to see irqfd is supported or not during 
the irq handler registration, and different handler will be assigned.

I want to know it's the right direction? Does it comply with the 
virtio-mmio spec.? Or anyone have more good ideas to emulate mis-x 
based on virtio-mmio? I hope to get feedback and guidance.
Thx for any help.

Li Liu (2):
  Add a new register offset let interrupt reason available
  Assign a new irq handler while irqfd enabled

 drivers/virtio/virtio_mmio.c |   55 +++---
 include/linux/virtio_mmio.h  |3 +++
 2 files changed, 55 insertions(+), 3 deletions(-)

-- 
1.7.9.5


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 2/2] Assign a new irq handler while irqfd enabled

2014-10-27 Thread john.liuli
From: Li Liu john.li...@huawei.com

This irq handler will get the interrupt reason from a
shared memory. And will be assigned only while irqfd
enabled.

Signed-off-by: Li Liu john.li...@huawei.com
---
 drivers/virtio/virtio_mmio.c |   34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 28ddb55..7229605 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -259,7 +259,31 @@ static irqreturn_t vm_interrupt(int irq, void *opaque)
return ret;
 }
 
+/* Notify all virtqueues on an interrupt. */
+static irqreturn_t vm_interrupt_irqfd(int irq, void *opaque)
+{
+   struct virtio_mmio_device *vm_dev = opaque;
+   struct virtio_mmio_vq_info *info;
+   unsigned long status;
+   unsigned long flags;
+   irqreturn_t ret = IRQ_NONE;
 
+   /* Read the interrupt reason and reset it */
+   status = *vm_dev-isr_mem;
+   *vm_dev-isr_mem = 0x0;
+
+   if (unlikely(status  VIRTIO_MMIO_INT_CONFIG)) {
+   virtio_config_changed(vm_dev-vdev);
+   ret = IRQ_HANDLED;
+   }
+
+   spin_lock_irqsave(vm_dev-lock, flags);
+   list_for_each_entry(info, vm_dev-virtqueues, node)
+   ret |= vring_interrupt(irq, info-vq);
+   spin_unlock_irqrestore(vm_dev-lock, flags);
+
+   return ret;
+}
 
 static void vm_del_vq(struct virtqueue *vq)
 {
@@ -391,6 +415,7 @@ error_available:
return ERR_PTR(err);
 }
 
+#define VIRTIO_MMIO_F_IRQFD(1  7)
 static int vm_find_vqs(struct virtio_device *vdev, unsigned nvqs,
   struct virtqueue *vqs[],
   vq_callback_t *callbacks[],
@@ -400,8 +425,13 @@ static int vm_find_vqs(struct virtio_device *vdev, 
unsigned nvqs,
unsigned int irq = platform_get_irq(vm_dev-pdev, 0);
int i, err;
 
-   err = request_irq(irq, vm_interrupt, IRQF_SHARED,
-   dev_name(vdev-dev), vm_dev);
+   if (*vm_dev-isr_mem  VIRTIO_MMIO_F_IRQFD) {
+   err = request_irq(irq, vm_interrupt_irqfd, IRQF_SHARED,
+ dev_name(vdev-dev), vm_dev);
+   } else {
+   err = request_irq(irq, vm_interrupt, IRQF_SHARED,
+ dev_name(vdev-dev), vm_dev);
+   }
if (err)
return err;
 
-- 
1.7.9.5


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/2] Assign a new irq handler while irqfd enabled

2014-10-27 Thread Li Liu


On 2014/10/26 19:56, Michael S. Tsirkin wrote:
 On Sat, Oct 25, 2014 at 04:24:54PM +0800, john.liuli wrote:
 From: Li Liu john.li...@huawei.com

 This irq handler will get the interrupt reason from a
 shared memory. And will be assigned only while irqfd
 enabled.

 Signed-off-by: Li Liu john.li...@huawei.com
 ---
  drivers/virtio/virtio_mmio.c |   34 --
  1 file changed, 32 insertions(+), 2 deletions(-)

 diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
 index 28ddb55..7229605 100644
 --- a/drivers/virtio/virtio_mmio.c
 +++ b/drivers/virtio/virtio_mmio.c
 @@ -259,7 +259,31 @@ static irqreturn_t vm_interrupt(int irq, void *opaque)
  return ret;
  }
  
 +/* Notify all virtqueues on an interrupt. */
 +static irqreturn_t vm_interrupt_irqfd(int irq, void *opaque)
 +{
 +struct virtio_mmio_device *vm_dev = opaque;
 +struct virtio_mmio_vq_info *info;
 +unsigned long status;
 +unsigned long flags;
 +irqreturn_t ret = IRQ_NONE;
  
 +/* Read the interrupt reason and reset it */
 +status = *vm_dev-isr_mem;
 +*vm_dev-isr_mem = 0x0;
 
 you are reading and modifying shared memory
 without atomics and any memory barriers.
 Why is this safe?
 

good catch, a stupid mistake.

 +
 +if (unlikely(status  VIRTIO_MMIO_INT_CONFIG)) {
 +virtio_config_changed(vm_dev-vdev);
 +ret = IRQ_HANDLED;
 +}
 +
 +spin_lock_irqsave(vm_dev-lock, flags);
 +list_for_each_entry(info, vm_dev-virtqueues, node)
 +ret |= vring_interrupt(irq, info-vq);
 +spin_unlock_irqrestore(vm_dev-lock, flags);
 +
 +return ret;
 +}
  
  static void vm_del_vq(struct virtqueue *vq)
  {
 
 So you invoke callbacks for all VQs.
 This won't scale well as the number of VQs grows, will it?
 
 @@ -391,6 +415,7 @@ error_available:
  return ERR_PTR(err);
  }
  
 +#define VIRTIO_MMIO_F_IRQFD(1  7)
  static int vm_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 struct virtqueue *vqs[],
 vq_callback_t *callbacks[],
 @@ -400,8 +425,13 @@ static int vm_find_vqs(struct virtio_device *vdev, 
 unsigned nvqs,
  unsigned int irq = platform_get_irq(vm_dev-pdev, 0);
  int i, err;
  
 -err = request_irq(irq, vm_interrupt, IRQF_SHARED,
 -dev_name(vdev-dev), vm_dev);
 +if (*vm_dev-isr_mem  VIRTIO_MMIO_F_IRQFD) {
 +err = request_irq(irq, vm_interrupt_irqfd, IRQF_SHARED,
 +  dev_name(vdev-dev), vm_dev);
 +} else {
 +err = request_irq(irq, vm_interrupt, IRQF_SHARED,
 +  dev_name(vdev-dev), vm_dev);
 +}
  if (err)
  return err;
 
 
 So still a single interrupt for all VQs.
 Again this doesn't scale: a single CPU has to handle
 interrupts for all of them.
 I think you need to find a way to get per-VQ interrupts.

Yeah, AFAIK it's impossible to distribute works to different CPUs with
only one irq without MSI-X kind mechanism. Assign multiple gsis to one
device, obviously it's consumptive and not scalable. Any ideas? Thx.

 
 -- 
 1.7.9.5

 
 .
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [RFC PATCH 0/2] virtio-mmio: add irqfd support for vhost-net based on virtio-mmio

2014-10-27 Thread Li Liu


On 2014/10/27 17:37, Peter Maydell wrote:
 On 25 October 2014 09:24, john.liuli john.li...@huawei.com wrote:
 To get the interrupt reason to support such VIRTIO_NET_F_STATUS
 features I add a new register offset VIRTIO_MMIO_ISRMEM which
 will help to establish a shared memory region between qemu and
 virtio-mmio device. Then the interrupt reason can be accessed by
 guest driver through this region. At the same time, the virtio-mmio
 dirver check this region to see irqfd is supported or not during
 the irq handler registration, and different handler will be assigned.
 
 If you want to add a new register you should probably propose
 an update to the virtio spec. However, it seems to me it would
 be better to get generic PCI/PCIe working on the ARM virt
 board instead; then we can let virtio-mmio quietly fade away.
 This has been on the todo list for ages (and there have been
 RFC patches posted for plain PCI), it's just nobody's had time
 to work on it.
 
 thanks
 -- PMM
 

So you mean virtio-mmio will be replaced by PCI/PCIe on ARM at last?
If so, let this patch go with the wind:). Thx.

Li.
 .
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Waiman Long

On 10/24/2014 06:04 PM, Peter Zijlstra wrote:

On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote:

The additional register pressure may just cause a few more register moves
which should be negligible in the overall performance . The additional
icache pressure, however, may have some impact on performance. I was trying
to balance the performance of the pv and non-pv versions so that we won't
penalize the pv code too much for a bit more performance in the non-pv code.
Doing it your way will add a lot of function call and register
saving/restoring to the pv code.

If people care about performance they should not be using virt crap :-)

I only really care about bare metal.


Yes, I am aware of that. However, the whole point of doing PV spinlock 
is to improve performance in a virtual guest.


Anyway, I had done some measurements. In my test system, the 
queue_spin_lock_slowpath() function has a text size of about 400 bytes 
without PV, but 1120 bytes with PV. I made some changes to create 
separate versions of PV and non-PV slowpath functions as shown by the 
diff below. The text size is now about 430 bytes for the non-PV version 
and 925 bytes for the PV version. The overall object size increases by a 
bit more than 200 bytes, but the icache footprint should be reduced no 
matter which version is used.


-Longman



diff --git a/arch/x86/include/asm/pvqspinlock.h 
b/arch/x86/include/asm/pvqspinlo

index d424252..241bf30 100644
--- a/arch/x86/include/asm/pvqspinlock.h
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -79,9 +79,6 @@ static inline void pv_init_node(struct mcs_spinlock *node)

BUILD_BUG_ON(sizeof(struct pv_qnode)  5*sizeof(struct 
mcs_spinlock));


-   if (!pv_enabled())
-   return;
-
pn-cpustate = PV_CPU_ACTIVE;
pn-mayhalt  = false;
pn-mycpu= smp_processor_id();
@@ -132,9 +129,6 @@ static inline bool pv_link_and_wait_node(u32 old, 
struct mcs

struct pv_qnode *ppn, *pn = (struct pv_qnode *)node;
unsigned int count;

-   if (!pv_enabled())
-   return false;
-
if (!(old  _Q_TAIL_MASK)) {
node-locked = true;/* At queue head now */
goto ret;
@@ -236,9 +230,6 @@ pv_wait_head(struct qspinlock *lock, struct 
mcs_spinlock *no

 {
struct pv_qnode *pn = (struct pv_qnode *)node;

-   if (!pv_enabled())
-   return smp_load_acquire(lock-val.counter);
-
for (;;) {
unsigned int count;
s8 oldstate;
@@ -328,8 +319,6 @@ static inline void pv_wait_check(struct qspinlock *lock,
struct pv_qnode *pnxt = (struct pv_qnode *)next;
struct pv_qnode *pcur = (struct pv_qnode *)node;

-   if (!pv_enabled())
-   return;
/*
 * Clear the locked and head values of lock holder
 */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 1662dbd..05aea57 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -16,6 +16,7 @@
  * Authors: Waiman Long waiman.l...@hp.com
  *  Peter Zijlstra pzijl...@redhat.com
  */
+#ifndef _GEN_PV_LOCK_SLOWPATH
 #include linux/smp.h
 #include linux/bug.h
 #include linux/cpumask.h
@@ -271,19 +272,37 @@ void queue_spin_unlock_slowpath(struct qspinlock 
*lock)

 }
 EXPORT_SYMBOL(queue_spin_unlock_slowpath);

-#else
+static void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+#else  /* CONFIG_PARAVIRT_SPINLOCKS */
+
+static inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock, 
u32 val)

+   { }

-static inline void pv_init_node(struct mcs_spinlock *node) { }
-static inline void pv_wait_check(struct qspinlock *lock,
-struct mcs_spinlock *node,
-struct mcs_spinlock *next) { }
-static inline bool pv_link_and_wait_node(u32 old, struct mcs_spinlock 
*node)

+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+/*
+ * Dummy PV functions for bare-metal slowpath code
+ */
+static inline void nopv_init_node(struct mcs_spinlock *node)   { }
+static inline void nopv_wait_check(struct qspinlock *lock,
+  struct mcs_spinlock *node,
+  struct mcs_spinlock *next)   { }
+static inline bool nopv_link_and_wait_node(u32 old, struct mcs_spinlock 
*node)

   { return false; }
-static inline int  pv_wait_head(struct qspinlock *lock,
+static inline int  nopv_wait_head(struct qspinlock *lock,
struct mcs_spinlock *node)
   { return smp_load_acquire(lock-val.counter); }
+static inline bool return_true(void)   { return true;  }
+static inline bool return_false(void)  { return false; }

-#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+#define pv_init_node   nopv_init_node
+#define pv_wait_check  nopv_wait_check
+#define pv_link_and_wait_node  nopv_link_and_wait_node

Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Peter Zijlstra
On Mon, Oct 27, 2014 at 01:15:53PM -0400, Waiman Long wrote:
 On 10/24/2014 06:04 PM, Peter Zijlstra wrote:
 On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote:
 The additional register pressure may just cause a few more register moves
 which should be negligible in the overall performance . The additional
 icache pressure, however, may have some impact on performance. I was trying
 to balance the performance of the pv and non-pv versions so that we won't
 penalize the pv code too much for a bit more performance in the non-pv code.
 Doing it your way will add a lot of function call and register
 saving/restoring to the pv code.
 If people care about performance they should not be using virt crap :-)
 
 I only really care about bare metal.
 
 Yes, I am aware of that. However, the whole point of doing PV spinlock is to
 improve performance in a virtual guest.

Anything that avoids the lock holder preemption nonsense is a _massive_
win for them, a few function calls should not even register on that
scale.

 +#ifdef _GEN_PV_LOCK_SLOWPATH
 +static void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 +#else
  void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 +#endif

If you have two functions you might as well use the PV stuff to patch in
the right function call at the usage sites and avoid:

 +   if (pv_enabled()) {
 +   pv_queue_spin_lock_slowpath(lock, val);
 +   return;
 +   }

this alltogether.

 this_cpu_dec(mcs_nodes[0].count);
  }
  EXPORT_SYMBOL(queue_spin_lock_slowpath);
 +
 +#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 +/*
 + * Generate the PV version of the queue_spin_lock_slowpath function
 + */
 +#undef pv_init_node
 +#undef pv_wait_check
 +#undef pv_link_and_wait_node
 +#undef pv_wait_head
 +#undef EXPORT_SYMBOL
 +#undef in_pv_code
 +
 +#define _GEN_PV_LOCK_SLOWPATH
 +#define EXPORT_SYMBOL(x)
 +#define in_pv_code return_true
 +#define pv_enabled return_false
 +
 +#include qspinlock.c
 +
 +#endif

That's properly disgusting :-) But a lot better than actually
duplicating everything I suppose.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Waiman Long

On 10/24/2014 04:54 AM, Peter Zijlstra wrote:

On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:


Since enabling paravirt spinlock will disable unlock function inlining,
a jump label can be added to the unlock function without adding patch
sites all over the kernel.

But you don't have to. My patches allowed for the inline to remain,
again reducing the overhead of enabling PV spinlocks while running on a
real machine.

Look at:

   http://lkml.kernel.org/r/20140615130154.213923...@chello.nl

In particular this hunk:

Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
===
--- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
+++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
@@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
  DEF_NATIVE(, mov32, mov %edi, %eax);
  DEF_NATIVE(, mov64, mov %rdi, %rax);

+#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
+DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
+#endif
+
  unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
  {
 return paravirt_patch_insns(insnbuf, len,
@@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
 PATCH_SITE(pv_cpu_ops, clts);
 PATCH_SITE(pv_mmu_ops, flush_tlb_single);
 PATCH_SITE(pv_cpu_ops, wbinvd);
+#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
+   PATCH_SITE(pv_lock_ops, queue_unlock);
+#endif

 patch_site:
 ret = paravirt_patch_insns(ibuf, len, start, end);


That makes sure to overwrite the callee-saved call to the
pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).


Therefore you can retain the inlined unlock with hardly (there might be
some NOP padding) any overhead at all. On PV it reverts to a callee
saved function call.


My concern is that spin_unlock() can be called in many places, including 
loadable kernel modules. Can the paravirt_patch_ident_32() function able 
to patch all of them in reasonable time? How about a kernel module 
loaded later at run time?


So I think we may still need to disable unlock function inlining even if 
we used your way kernel site patching.


Regards,
Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 00/11] qspinlock: a 4-byte queue spinlock with PV support

2014-10-27 Thread Waiman Long

On 10/24/2014 04:57 AM, Peter Zijlstra wrote:

On Thu, Oct 16, 2014 at 02:10:29PM -0400, Waiman Long wrote:

v11-v12:
  - Based on PeterZ's version of the qspinlock patch
(https://lkml.org/lkml/2014/6/15/63).
  - Incorporated many of the review comments from Konrad Wilk and
Paolo Bonzini.
  - The pvqspinlock code is largely from my previous version with
PeterZ's way of going from queue tail to head and his idea of
using callee saved calls to KVM and XEN codes.

Thanks for taking the time to refresh this.. I would prefer you use a
little more of the PV techniques I outlined in my latest PV patch to
further reduce the overhead of PV enabled kernels on real hardware.

This is an important use case, because distro kernels will have to
enable PV support while their majority of installations will be on
physical hardware.

Other than that I see no reason not to move this forward.


Thanks for reviewing the patch and agree to move forward. Currently, I 
am thinking of separating out a PV and non-PV versions of the lock 
slowpath functions as shown in my previous mail. That should also 
minimize the performance impact on bare metal even more than what can be 
done with the PV techniques used in your patch while not penalizing PV 
performance.


As for the unlock function, if the site patching function can handle all 
the possible call sites of spin_unlock() without disabling function 
inlining, I will be glad to use your way of handing unlock function. 
Otherwise, I will prefer my current approach as it is simpler and more 
easy to understand as well as similar to what has been done in the pv 
ticket spinlock code.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Konrad Rzeszutek Wilk
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 On 10/24/2014 04:54 AM, Peter Zijlstra wrote:
 On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:
 
 Since enabling paravirt spinlock will disable unlock function inlining,
 a jump label can be added to the unlock function without adding patch
 sites all over the kernel.
 But you don't have to. My patches allowed for the inline to remain,
 again reducing the overhead of enabling PV spinlocks while running on a
 real machine.
 
 Look at:
 
http://lkml.kernel.org/r/20140615130154.213923...@chello.nl
 
 In particular this hunk:
 
 Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 ===
 --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
 +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
   DEF_NATIVE(, mov32, mov %edi, %eax);
   DEF_NATIVE(, mov64, mov %rdi, %rax);
 
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
 +#endif
 +
   unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
   {
  return paravirt_patch_insns(insnbuf, len,
 @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
  PATCH_SITE(pv_cpu_ops, clts);
  PATCH_SITE(pv_mmu_ops, flush_tlb_single);
  PATCH_SITE(pv_cpu_ops, wbinvd);
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +   PATCH_SITE(pv_lock_ops, queue_unlock);
 +#endif
 
  patch_site:
  ret = paravirt_patch_insns(ibuf, len, start, end);
 
 
 That makes sure to overwrite the callee-saved call to the
 pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).
 
 
 Therefore you can retain the inlined unlock with hardly (there might be
 some NOP padding) any overhead at all. On PV it reverts to a callee
 saved function call.
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?

It has too. When the modules are loaded the .paravirt symbols are exposed
and the module loader patches that.

And during bootup time (before modules are loaded) it also patches everything
- when it only runs on one CPU.
 
 So I think we may still need to disable unlock function inlining even if we
 used your way kernel site patching.

No need. Inline should (And is) working just fine.
 
 Regards,
 Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Peter Zijlstra
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 On 10/24/2014 04:54 AM, Peter Zijlstra wrote:
 On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:
 
 Since enabling paravirt spinlock will disable unlock function inlining,
 a jump label can be added to the unlock function without adding patch
 sites all over the kernel.
 But you don't have to. My patches allowed for the inline to remain,
 again reducing the overhead of enabling PV spinlocks while running on a
 real machine.
 
 Look at:
 
http://lkml.kernel.org/r/20140615130154.213923...@chello.nl
 
 In particular this hunk:
 
 Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 ===
 --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
 +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
   DEF_NATIVE(, mov32, mov %edi, %eax);
   DEF_NATIVE(, mov64, mov %rdi, %rax);
 
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
 +#endif
 +
   unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
   {
  return paravirt_patch_insns(insnbuf, len,
 @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
  PATCH_SITE(pv_cpu_ops, clts);
  PATCH_SITE(pv_mmu_ops, flush_tlb_single);
  PATCH_SITE(pv_cpu_ops, wbinvd);
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +   PATCH_SITE(pv_lock_ops, queue_unlock);
 +#endif
 
  patch_site:
  ret = paravirt_patch_insns(ibuf, len, start, end);
 
 
 That makes sure to overwrite the callee-saved call to the
 pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).
 
 
 Therefore you can retain the inlined unlock with hardly (there might be
 some NOP padding) any overhead at all. On PV it reverts to a callee
 saved function call.
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?

modules should be fine, see arch/x86/kernel/module.c:module_finalize()
- apply_paravirt().

Also note that the 'default' text is an indirect call into the paravirt
ops table which routes to the 'right' function, so even if the text
patching would be 'late' calls would 'work' as expected, just slower.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Waiman Long

On 10/27/2014 01:27 PM, Peter Zijlstra wrote:

On Mon, Oct 27, 2014 at 01:15:53PM -0400, Waiman Long wrote:

On 10/24/2014 06:04 PM, Peter Zijlstra wrote:

On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote:

The additional register pressure may just cause a few more register moves
which should be negligible in the overall performance . The additional
icache pressure, however, may have some impact on performance. I was trying
to balance the performance of the pv and non-pv versions so that we won't
penalize the pv code too much for a bit more performance in the non-pv code.
Doing it your way will add a lot of function call and register
saving/restoring to the pv code.

If people care about performance they should not be using virt crap :-)

I only really care about bare metal.

Yes, I am aware of that. However, the whole point of doing PV spinlock is to
improve performance in a virtual guest.

Anything that avoids the lock holder preemption nonsense is a _massive_
win for them, a few function calls should not even register on that
scale.


I would say all the PV stuffs are mostly useful for a over-committed 
guest where a single CPU is shared in more than one guest. When the 
guests are not overcommitted, the PV code seldom get triggered. In those 
cases, the overhead of the extra function call and register 
saving/restoring will show up.



+#ifdef _GEN_PV_LOCK_SLOWPATH
+static void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+#else
  void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+#endif

If you have two functions you might as well use the PV stuff to patch in
the right function call at the usage sites and avoid:


+   if (pv_enabled()) {
+   pv_queue_spin_lock_slowpath(lock, val);
+   return;
+   }

this alltogether.


Good point! I will do some investigation on how to do this kind of 
function address patching and eliminate the extra function call overhead.



 this_cpu_dec(mcs_nodes[0].count);
  }
  EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+/*
+ * Generate the PV version of the queue_spin_lock_slowpath function
+ */
+#undef pv_init_node
+#undef pv_wait_check
+#undef pv_link_and_wait_node
+#undef pv_wait_head
+#undef EXPORT_SYMBOL
+#undef in_pv_code
+
+#define _GEN_PV_LOCK_SLOWPATH
+#define EXPORT_SYMBOL(x)
+#define in_pv_code return_true
+#define pv_enabled return_false
+
+#include qspinlock.c
+
+#endif

That's properly disgusting :-) But a lot better than actually
duplicating everything I suppose.


I know you don't like this kind of preprocessor trick, but this is the 
easiest way that I can think of to generate two separate functions from 
the same source code.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Waiman Long

On 10/27/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:

On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:

On 10/24/2014 04:54 AM, Peter Zijlstra wrote:

On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:


Since enabling paravirt spinlock will disable unlock function inlining,
a jump label can be added to the unlock function without adding patch
sites all over the kernel.

But you don't have to. My patches allowed for the inline to remain,
again reducing the overhead of enabling PV spinlocks while running on a
real machine.

Look at:

   http://lkml.kernel.org/r/20140615130154.213923...@chello.nl

In particular this hunk:

Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
===
--- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
+++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
@@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
  DEF_NATIVE(, mov32, mov %edi, %eax);
  DEF_NATIVE(, mov64, mov %rdi, %rax);

+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   defined(CONFIG_QUEUE_SPINLOCK)
+DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
+#endif
+
  unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
  {
 return paravirt_patch_insns(insnbuf, len,
@@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
 PATCH_SITE(pv_cpu_ops, clts);
 PATCH_SITE(pv_mmu_ops, flush_tlb_single);
 PATCH_SITE(pv_cpu_ops, wbinvd);
+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   defined(CONFIG_QUEUE_SPINLOCK)
+   PATCH_SITE(pv_lock_ops, queue_unlock);
+#endif

 patch_site:
 ret = paravirt_patch_insns(ibuf, len, start, end);


That makes sure to overwrite the callee-saved call to the
pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).


Therefore you can retain the inlined unlock with hardly (there might be
some NOP padding) any overhead at all. On PV it reverts to a callee
saved function call.

My concern is that spin_unlock() can be called in many places, including
loadable kernel modules. Can the paravirt_patch_ident_32() function able to
patch all of them in reasonable time? How about a kernel module loaded later
at run time?

It has too. When the modules are loaded the .paravirt symbols are exposed
and the module loader patches that.

And during bootup time (before modules are loaded) it also patches everything
- when it only runs on one CPU.

So I think we may still need to disable unlock function inlining even if we
used your way kernel site patching.

No need. Inline should (And is) working just fine.

Regards,
Longman


Thanks for letting me know about the paravirt patching capability 
available in the kernel. In this case, I would say we should use Peter's 
way of doing unlock without disabling unlock function inlining. That 
will further reduce the performance difference of kernels with and 
without PV.


Cheer,
Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Waiman Long

On 10/27/2014 02:04 PM, Peter Zijlstra wrote:

On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:

On 10/24/2014 04:54 AM, Peter Zijlstra wrote:

On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:


Since enabling paravirt spinlock will disable unlock function inlining,
a jump label can be added to the unlock function without adding patch
sites all over the kernel.

But you don't have to. My patches allowed for the inline to remain,
again reducing the overhead of enabling PV spinlocks while running on a
real machine.

Look at:

   http://lkml.kernel.org/r/20140615130154.213923...@chello.nl

In particular this hunk:

Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
===
--- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
+++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
@@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
  DEF_NATIVE(, mov32, mov %edi, %eax);
  DEF_NATIVE(, mov64, mov %rdi, %rax);

+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   defined(CONFIG_QUEUE_SPINLOCK)
+DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
+#endif
+
  unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
  {
 return paravirt_patch_insns(insnbuf, len,
@@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
 PATCH_SITE(pv_cpu_ops, clts);
 PATCH_SITE(pv_mmu_ops, flush_tlb_single);
 PATCH_SITE(pv_cpu_ops, wbinvd);
+#if defined(CONFIG_PARAVIRT_SPINLOCKS)   defined(CONFIG_QUEUE_SPINLOCK)
+   PATCH_SITE(pv_lock_ops, queue_unlock);
+#endif

 patch_site:
 ret = paravirt_patch_insns(ibuf, len, start, end);


That makes sure to overwrite the callee-saved call to the
pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).


Therefore you can retain the inlined unlock with hardly (there might be
some NOP padding) any overhead at all. On PV it reverts to a callee
saved function call.

My concern is that spin_unlock() can be called in many places, including
loadable kernel modules. Can the paravirt_patch_ident_32() function able to
patch all of them in reasonable time? How about a kernel module loaded later
at run time?

modules should be fine, see arch/x86/kernel/module.c:module_finalize()
-  apply_paravirt().

Also note that the 'default' text is an indirect call into the paravirt
ops table which routes to the 'right' function, so even if the text
patching would be 'late' calls would 'work' as expected, just slower.


Thanks for letting me know about that. I have this concern because your 
patch didn't change the current configuration of disabling unlock 
inlining when paravirt_spinlock is enabled. With that, I think it is 
worthwhile to reduce the performance delta between the PV and non-PV 
kernel on bare metal.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 1/4] virtio_net: pass vi around

2014-10-27 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Too many places poke at [rs]q-vq-vdev-priv just to get
 the the vi structure.  Let's just pass the pointer around: seems
 cleaner, and might even be faster.

Agreed, it's neater.

Acked-by: Rusty Russell ru...@rustcorp.com.au

Thanks,
Rusty.


 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/net/virtio_net.c | 36 +++-
  1 file changed, 19 insertions(+), 17 deletions(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 57cbc7d..36f3dfc 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -241,11 +241,11 @@ static unsigned long mergeable_buf_to_ctx(void *buf, 
 unsigned int truesize)
  }
  
  /* Called from bottom half context */
 -static struct sk_buff *page_to_skb(struct receive_queue *rq,
 +static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 +struct receive_queue *rq,
  struct page *page, unsigned int offset,
  unsigned int len, unsigned int truesize)
  {
 - struct virtnet_info *vi = rq-vq-vdev-priv;
   struct sk_buff *skb;
   struct skb_vnet_hdr *hdr;
   unsigned int copy, hdr_len, hdr_padded_len;
 @@ -328,12 +328,13 @@ static struct sk_buff *receive_small(void *buf, 
 unsigned int len)
  }
  
  static struct sk_buff *receive_big(struct net_device *dev,
 +struct virtnet_info *vi,
  struct receive_queue *rq,
  void *buf,
  unsigned int len)
  {
   struct page *page = buf;
 - struct sk_buff *skb = page_to_skb(rq, page, 0, len, PAGE_SIZE);
 + struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
  
   if (unlikely(!skb))
   goto err;
 @@ -359,7 +360,8 @@ static struct sk_buff *receive_mergeable(struct 
 net_device *dev,
   int offset = buf - page_address(page);
   unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
  
 - struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 + struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
 +truesize);
   struct sk_buff *curr_skb = head_skb;
  
   if (unlikely(!curr_skb))
 @@ -433,9 +435,9 @@ err_buf:
   return NULL;
  }
  
 -static void receive_buf(struct receive_queue *rq, void *buf, unsigned int 
 len)
 +static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
 + void *buf, unsigned int len)
  {
 - struct virtnet_info *vi = rq-vq-vdev-priv;
   struct net_device *dev = vi-dev;
   struct virtnet_stats *stats = this_cpu_ptr(vi-stats);
   struct sk_buff *skb;
 @@ -459,9 +461,9 @@ static void receive_buf(struct receive_queue *rq, void 
 *buf, unsigned int len)
   if (vi-mergeable_rx_bufs)
   skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len);
   else if (vi-big_packets)
 - skb = receive_big(dev, rq, buf, len);
 + skb = receive_big(dev, vi, rq, buf, len);
   else
 - skb = receive_small(buf, len);
 + skb = receive_small(vi, buf, len);
  
   if (unlikely(!skb))
   return;
 @@ -530,9 +532,9 @@ frame_err:
   dev_kfree_skb(skb);
  }
  
 -static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 +static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue 
 *rq,
 +  gfp_t gfp)
  {
 - struct virtnet_info *vi = rq-vq-vdev-priv;
   struct sk_buff *skb;
   struct skb_vnet_hdr *hdr;
   int err;
 @@ -655,9 +657,9 @@ static int add_recvbuf_mergeable(struct receive_queue 
 *rq, gfp_t gfp)
   * before we're receiving packets, or from refill_work which is
   * careful to disable receiving (using napi_disable).
   */
 -static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 +static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 +   gfp_t gfp)
  {
 - struct virtnet_info *vi = rq-vq-vdev-priv;
   int err;
   bool oom;
  
 @@ -668,7 +670,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t 
 gfp)
   else if (vi-big_packets)
   err = add_recvbuf_big(rq, gfp);
   else
 - err = add_recvbuf_small(rq, gfp);
 + err = add_recvbuf_small(vi, rq, gfp);
  
   oom = err == -ENOMEM;
   if (err)
 @@ -717,7 +719,7 @@ static void refill_work(struct work_struct *work)
   struct receive_queue *rq = vi-rq[i];
  
   napi_disable(rq-napi);
 - still_empty = !try_fill_recv(rq, GFP_KERNEL);
 + still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
   virtnet_napi_enable(rq);
  
   /* In theory, this can happen: 

Re: [PATCH] virtio_blk: fix race at module removal

2014-10-27 Thread Rusty Russell
Ming Lei tom.leim...@gmail.com writes:
 On Fri, Oct 24, 2014 at 12:12 AM, Michael S. Tsirkin m...@redhat.com wrote:
 If a device appears while module is being removed,
 driver will get a callback after we've given up
 on the major number.

 In theory this means this major number can get reused
 by something else, resulting in a conflict.

 Yes, there is a tiny race window.

Applied.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 2/4] virtio_net: get rid of virtio_net_hdr/skb_vnet_hdr

2014-10-27 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 virtio 1.0 doesn't use virtio_net_hdr anymore, and in fact, it's not
 really useful since virtio_net_hdr_mrg_rxbuf includes that as the first
 field anyway.

 Let's drop it, precalculate header len and store within vi instead.

 This way we can also remove struct skb_vnet_hdr.

Yes, this is definitely a win.

Acked-by: Rusty Russell ru...@rustcorp.com.au

Thanks,
Rusty.


 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/net/virtio_net.c | 88 
 ++--
  1 file changed, 40 insertions(+), 48 deletions(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 36f3dfc..a795a23 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -123,6 +123,9 @@ struct virtnet_info {
   /* Host can handle any s/g split between our header and packet data */
   bool any_header_sg;
  
 + /* Packet virtio header size */
 + u8 hdr_len;
 +
   /* Active statistics */
   struct virtnet_stats __percpu *stats;
  
 @@ -139,21 +142,14 @@ struct virtnet_info {
   struct notifier_block nb;
  };
  
 -struct skb_vnet_hdr {
 - union {
 - struct virtio_net_hdr hdr;
 - struct virtio_net_hdr_mrg_rxbuf mhdr;
 - };
 -};
 -
  struct padded_vnet_hdr {
 - struct virtio_net_hdr hdr;
 + struct virtio_net_hdr_mrg_rxbuf hdr;
   /*
 -  * virtio_net_hdr should be in a separated sg buffer because of a
 -  * QEMU bug, and data sg buffer shares same page with this header sg.
 -  * This padding makes next sg 16 byte aligned after virtio_net_hdr.
 +  * hdr is in a separate sg buffer, and data sg buffer shares same page
 +  * with this header sg. This padding makes next sg 16 byte aligned
 +  * after the header.
*/
 - char padding[6];
 + char padding[4];
  };
  
  /* Converting between virtqueue no. and kernel tx/rx queue no.
 @@ -179,9 +175,9 @@ static int rxq2vq(int rxq)
   return rxq * 2;
  }
  
 -static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
 +static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff 
 *skb)
  {
 - return (struct skb_vnet_hdr *)skb-cb;
 + return (struct virtio_net_hdr_mrg_rxbuf *)skb-cb;
  }
  
  /*
 @@ -247,7 +243,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
 *vi,
  unsigned int len, unsigned int truesize)
  {
   struct sk_buff *skb;
 - struct skb_vnet_hdr *hdr;
 + struct virtio_net_hdr_mrg_rxbuf *hdr;
   unsigned int copy, hdr_len, hdr_padded_len;
   char *p;
  
 @@ -260,13 +256,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
 *vi,
  
   hdr = skb_vnet_hdr(skb);
  
 - if (vi-mergeable_rx_bufs) {
 - hdr_len = sizeof hdr-mhdr;
 - hdr_padded_len = sizeof hdr-mhdr;
 - } else {
 - hdr_len = sizeof hdr-hdr;
 + hdr_len = vi-hdr_len;
 + if (vi-mergeable_rx_bufs)
 + hdr_padded_len = sizeof *hdr;
 + else
   hdr_padded_len = sizeof(struct padded_vnet_hdr);
 - }
  
   memcpy(hdr, p, hdr_len);
  
 @@ -317,11 +311,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
 *vi,
   return skb;
  }
  
 -static struct sk_buff *receive_small(void *buf, unsigned int len)
 +static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, 
 unsigned int len)
  {
   struct sk_buff * skb = buf;
  
 - len -= sizeof(struct virtio_net_hdr);
 + len -= vi-hdr_len;
   skb_trim(skb, len);
  
   return skb;
 @@ -354,8 +348,8 @@ static struct sk_buff *receive_mergeable(struct 
 net_device *dev,
unsigned int len)
  {
   void *buf = mergeable_ctx_to_buf_address(ctx);
 - struct skb_vnet_hdr *hdr = buf;
 - u16 num_buf = virtio16_to_cpu(rq-vq-vdev, hdr-mhdr.num_buffers);
 + struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
 + u16 num_buf = virtio16_to_cpu(vi-vdev, hdr-num_buffers);
   struct page *page = virt_to_head_page(buf);
   int offset = buf - page_address(page);
   unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
 @@ -373,8 +367,8 @@ static struct sk_buff *receive_mergeable(struct 
 net_device *dev,
   if (unlikely(!ctx)) {
   pr_debug(%s: rx error: %d buffers out of %d missing\n,
dev-name, num_buf,
 -  virtio16_to_cpu(rq-vq-vdev,
 -  hdr-mhdr.num_buffers));
 +  virtio16_to_cpu(vi-vdev,
 +  hdr-num_buffers));
   dev-stats.rx_length_errors++;
   goto err_buf;
   }
 @@ -441,7 +435,7 @@ static void receive_buf(struct virtnet_info *vi, struct 
 receive_queue *rq,
   struct net_device *dev = vi-dev;
   struct 

Re: [PATCH RFC 4/4] virtio_net: bigger header when VERSION_1 is set

2014-10-27 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 With VERSION_1 virtio_net uses same header size
 whether mergeable buffers are enabled or not.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

These two are great too, thanks:

Acked-by: Rusty Russell ru...@rustcorp.com.au

Cheers,
Rusty.

 ---
  drivers/net/virtio_net.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 9c6d50f..a2fe340 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -1764,7 +1764,8 @@ static int virtnet_probe(struct virtio_device *vdev)
   if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
   vi-mergeable_rx_bufs = true;
  
 - if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 + if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) ||
 + virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
   vi-hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
   else
   vi-hdr_len = sizeof(struct virtio_net_hdr);
 -- 
 MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 00/11] qemu: towards virtio-1 host support

2014-10-27 Thread Michael S. Tsirkin
On Fri, Oct 24, 2014 at 10:38:39AM +0200, Cornelia Huck wrote:
 On Fri, 24 Oct 2014 00:42:20 +0300
 Michael S. Tsirkin m...@redhat.com wrote:
 
  On Tue, Oct 07, 2014 at 04:39:56PM +0200, Cornelia Huck wrote:
   This patchset aims to get us some way to implement virtio-1 compliant
   and transitional devices in qemu. Branch available at
   
   git://github.com/cohuck/qemu virtio-1
   
   I've mainly focused on:
   - endianness handling
   - extended feature bits
   - virtio-ccw new/changed commands
  
  So issues identified so far:
 
 Thanks for taking a look.
 
  - devices not converted yet should not advertize 1.0
 
 Neither should an uncoverted transport. So we either can
 - have transport set the bit and rely on devices -get_features
   callback to mask it out
   (virtio-ccw has to change the calling order for get_features, btw.)
 - have device set the bit and the transport mask it out later. Feels a
   bit weird, as virtio-1 is a transport feature bit.


I thought more about it, I think the right thing
would be for unconverted transports to clear
high bits on ack and get features.

So bit 32 is set, but not exposed to guests.
In fact at least for PCI, we have a 32 bit field for
features in 0.9 so it's automatic.
Didn't check mmio yet.



 I'm tending towards the first option; smth like this (on top of my
 branch):
 
 diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
 index c29c8c8..f6501ea 100644
 --- a/hw/9pfs/virtio-9p-device.c
 +++ b/hw/9pfs/virtio-9p-device.c
 @@ -24,6 +24,9 @@
  static uint32_t virtio_9p_get_features(VirtIODevice *vdev, unsigned int 
 index,
 uint32_t features)
  {
 +if (index == 1) {
 +features = ~(1  (VIRTIO_F_VERSION_1 - 32));
 +}
  if (index  0) {
  return features;
  }
 diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
 index 0d843fe..07a7a6f 100644
 --- a/hw/char/virtio-serial-bus.c
 +++ b/hw/char/virtio-serial-bus.c
 @@ -472,6 +472,9 @@ static uint32_t get_features(VirtIODevice *vdev, unsigned 
 int index,
  {
  VirtIOSerial *vser;
  
 +if (index == 1) {
 +features = ~(1  (VIRTIO_F_VERSION_1 - 32));
 +}
  if (index  0) {
  return features;
  }
 diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
 index 67f91c0..4b75105 100644
 --- a/hw/net/virtio-net.c
 +++ b/hw/net/virtio-net.c
 @@ -447,6 +447,9 @@ static uint32_t virtio_net_get_features(VirtIODevice 
 *vdev, unsigned int index,
  VirtIONet *n = VIRTIO_NET(vdev);
  NetClientState *nc = qemu_get_queue(n-nic);
  
 +if (index == 1  get_vhost_net(nc-peer)) {
 +features = ~(1  (VIRTIO_F_VERSION_1 - 32));
 +}
  if (index  0) {
  return features;
  }
 diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
 index 80efe88..07fbf40 100644
 --- a/hw/s390x/virtio-ccw.c
 +++ b/hw/s390x/virtio-ccw.c
 @@ -839,16 +839,16 @@ static int virtio_ccw_device_init(VirtioCcwDevice *dev, 
 VirtIODevice *vdev)
  dev-revision = -1;
  
  /* Set default feature bits that are offered by the host. */
 +dev-host_features[0] = 0x1  VIRTIO_F_NOTIFY_ON_EMPTY;
 +dev-host_features[0] |= 0x1  VIRTIO_F_BAD_FEATURE;
 +
 +dev-host_features[1] = 1  (VIRTIO_F_VERSION_1 - 32);
 +
  for (i = 0; i  ARRAY_SIZE(dev-host_features); i++) {
  dev-host_features[i] =
  virtio_bus_get_vdev_features(dev-bus, i, 
 dev-host_features[i]);
  }
  
 -dev-host_features[0] |= 0x1  VIRTIO_F_NOTIFY_ON_EMPTY;
 -dev-host_features[0] |= 0x1  VIRTIO_F_BAD_FEATURE;
 -
 -dev-host_features[1] = 1  (VIRTIO_F_VERSION_1 - 32);
 -
  css_generate_sch_crws(sch-cssid, sch-ssid, sch-schid,
parent-hotplugged, 1);
  return 0;
 diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
 index 8e1afa0..8a8fdb9 100644
 --- a/hw/scsi/vhost-scsi.c
 +++ b/hw/scsi/vhost-scsi.c
 @@ -155,6 +155,9 @@ static uint32_t vhost_scsi_get_features(VirtIODevice 
 *vdev, unsigned int index,
  {
  VHostSCSI *s = VHOST_SCSI(vdev);
  
 +if (index == 1) {
 +features = ~(1  (VIRTIO_F_VERSION_1 - 32));
 +}
  if (index  0) {
  return features;
  }
 diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
 index 088e688..378783f 100644
 --- a/hw/scsi/virtio-scsi.c
 +++ b/hw/scsi/virtio-scsi.c
 @@ -611,6 +611,9 @@ static void virtio_scsi_set_config(VirtIODevice *vdev,
  static uint32_t virtio_scsi_get_features(VirtIODevice *vdev, unsigned int 
 index,
   uint32_t requested_features)
  {
 +if (index == 1) {
 +requested_features = ~(1  (VIRTIO_F_VERSION_1 - 32));
 +}
  return requested_features;
  }
  
 diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
 index 5af17e2..bd52845 100644
 --- a/hw/virtio/virtio-balloon.c
 +++ b/hw/virtio/virtio-balloon.c
 @@ -306,6 +306,9 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,