Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-19 Thread Rusty Russell
On Mon, 19 Dec 2011 10:50:13 +0800, Amos Kong ak...@redhat.com wrote:
 The format is broken in webpage, attached the result file.
 it's also available here: http://amosk.info/download/rusty-fix-perf.txt

Thanks Amos.

Linus, please apply.  Fixes virtio-mmio without breaking virtio_pci
performance.

Thanks,
Rusty.

From: Rusty Russell ru...@rustcorp.com.au
Subject: virtio: harsher barriers for virtio-mmio.

We were cheating with our barriers; using the smp ones rather than the
real device ones.  That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).

Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci.  In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.

By comparison, this branch is in the noise.

Reference: https://lkml.org/lkml/2011/12/11/22

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/lguest/lguest_device.c |   10 ++
 drivers/s390/kvm/kvm_virtio.c  |2 +-
 drivers/virtio/virtio_mmio.c   |7 ---
 drivers/virtio/virtio_pci.c|4 ++--
 drivers/virtio/virtio_ring.c   |   34 +-
 include/linux/virtio_ring.h|1 +
 tools/virtio/linux/virtio.h|1 +
 tools/virtio/virtio_test.c |3 ++-
 8 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru
}
 
/*
-* OK, tell virtio_ring.c to set up a virtqueue now we know its size
-* and we've got a pointer to its pages.
+* OK, tell virtio_ring.c to set up a virtqueue now we know its size
+* and we've got a pointer to its pages.  Note that we set weak_barriers
+* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
+* barriers.
 */
-   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN,
-vdev, lvq-pages, lg_notify, callback, name);
+   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev,
+true, lvq-pages, lg_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
goto out;
 
vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN,
-vdev, (void *) config-address,
+vdev, true, (void *) config-address,
 kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str
writel(virt_to_phys(info-queue)  PAGE_SHIFT,
vm_dev-base + VIRTIO_MMIO_QUEUE_PFN);
 
-   /* Create the vring */
-   vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN,
-vdev, info-queue, vm_notify, callback, name);
+   /* Create the vring: no weak barriers, the other side is could
+* be an independent device. */
+   vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+false, info-queue, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
  vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
-   vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN,
-vdev, info-queue, vp_notify, callback, name);
+   vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, vdev,
+true, info-queue, vp_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -28,17 +28,20 @@
 #ifdef CONFIG_SMP
 /* Where possible, use SMP barriers which are more lightweight than mandatory
  * barriers, because mandatory barriers control MMIO effects on accesses
- * through relaxed memory I/O windows (which virtio does not use). */
-#define virtio_mb() smp_mb()
-#define 

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-18 Thread Amos Kong

On 12/12/11 13:12, Rusty Russell wrote:

On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kongak...@redhat.com  wrote:

On 12/12/11 06:27, Benjamin Herrenschmidt wrote:

On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:


Forwarding some results by Amos, who run multiple netperf streams in
parallel, from an external box to the guest.  TCP_STREAM results were
noisy.  This could be due to buffering done by TCP, where packet size
varies even as message size is constant.

TCP_RR results were consistent. In this benchmark, after switching
to mandatory barriers, CPU utilization increased by up to 35% while
throughput went down by up to 14%. the normalized throughput/cpu
regressed consistently, between 7 and 35%

The fix applied was simply this:


What machine   processor was this  ?


pined guest memory to numa node 1


Please try this patch.  How much does the branch cost us?

(Compiles, untested).

Thanks,
Rusty.

From: Rusty Russellru...@rustcorp.com.au
Subject: virtio: harsher barriers for virtio-mmio.

We were cheating with our barriers; using the smp ones rather than the
real device ones.  That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).

Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci.  In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.

By comparison, this branch costs us???

Reference: https://lkml.org/lkml/2011/12/11/22

Signed-off-by: Rusty Russellru...@rustcorp.com.au
---
  drivers/lguest/lguest_device.c |   10 ++
  drivers/s390/kvm/kvm_virtio.c  |2 +-
  drivers/virtio/virtio_mmio.c   |7 ---
  drivers/virtio/virtio_pci.c|4 ++--
  drivers/virtio/virtio_ring.c   |   34 +-
  include/linux/virtio_ring.h|1 +
  tools/virtio/linux/virtio.h|1 +
  tools/virtio/virtio_test.c |3 ++-
  8 files changed, 38 insertions(+), 24 deletions(-)


Hi all,

I tested with the same environment and scenarios.
tested one scenarios for three times and compute the average for more 
precision.


Thanks, Amos

- compare results ---
Mon Dec 19 09:51:09 2011

1 - avg-old.netperf.exhost_guest.txt
2 - avg-fixed.netperf.exhost_guest.txt

==
TCP_STREAM
  sessions| size|throughput|   cpu| normalize|  #tx-pkts| 
#rx-pkts|  #tx-byts|  #rx-byts| #re-trans|  #tx-intr|  #rx-intr| 
#io_exit|  #irq_inj|#tpkt/#exit| #rpkt/#irq
11|   64|   1073.54| 10.50|   102| 0|31| 
0|  1612| 0|16|487641|489753| 
504764|   0.00|   0.00
21|   64|   1079.44| 10.29|   104| 0|30| 
0|  1594| 0|17|487156|488828| 
504411|   0.00|   0.00
% |  0.0|  +0.5|  -2.0|  +2.0| 0|  -3.2| 
0|  -1.1| 0|  +6.2|  -0.1|  -0.2| 
-0.1|  0|  0
12|   64|   2141.12| 15.72|   136| 0|33| 
0|  1744| 0|34|873777|972303| 
928926|   0.00|   0.00
22|   64|   2140.88| 15.64|   137| 0|33| 
0|  1744| 0|34|926588|942841| 
974095|   0.00|   0.00
% |  0.0|  -0.0|  -0.5|  +0.7| 0|   0.0| 
0|   0.0| 0|   0.0|  +6.0|  -3.0| 
+4.9|  0|  0
14|   64|   4076.80| 19.82|   205| 0|30| 
0|  1577| 0|67|   1422282|   1166425| 
1539219|   0.00|   0.00
24|   64|   4094.32| 20.70|   197| 0|31| 
0|  1612| 0|68|   1704330|   1314077| 
1833394|   0.00|   0.00
% |  0.0|  +0.4|  +4.4|  -3.9| 0|  +3.3| 
0|  +2.2| 0|  +1.5| +19.8| +12.7| 
+19.1|  0|  0
11|  256|   2867.48| 13.44|   213| 0|32| 
0|  1726| 0|14|666430|694922| 
690730|   0.00|   0.00
21|  256|   2874.20| 12.71|   226| 0|32| 
0|  1709| 0|14|697960|740407| 
721807|   0.00|   0.00
% |  0.0|  +0.2|  -5.4|  +6.1| 0|   0.0| 
0|  -1.0| 0|   0.0|  +4.7|  +6.5| 
+4.5|  0|  0
12|  256|   5642.82| 17.61|   320| 0|30| 
0|  1594| 0|30|   1226861|   1236081| 
1268562|   0.00|   0.00
22|  256|   5661.06| 17.41|   326| 0|30| 
0|  1594| 0|29|   1175696|   1143490| 
1221528|   0.00|   0.00
% |  0.0|  +0.3|  -1.1| 

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-18 Thread Benjamin Herrenschmidt
On Mon, 2011-12-19 at 10:19 +0800, Amos Kong wrote:

 I tested with the same environment and scenarios.
 tested one scenarios for three times and compute the average for more 
 precision.
 
 Thanks, Amos
 
 - compare results ---
 Mon Dec 19 09:51:09 2011
 
 1 - avg-old.netperf.exhost_guest.txt
 2 - avg-fixed.netperf.exhost_guest.txt

The output is word wrapped and generally unreadable. Any chance you can
provide us with a summary of the outcome ?

Cheers,
Ben.

 ==
 TCP_STREAM
sessions| size|throughput|   cpu| normalize|  #tx-pkts| 
 #rx-pkts|  #tx-byts|  #rx-byts| #re-trans|  #tx-intr|  #rx-intr| 
 #io_exit|  #irq_inj|#tpkt/#exit| #rpkt/#irq
 11|   64|   1073.54| 10.50|   102| 0|31| 
  0|  1612| 0|16|487641|489753| 
 504764|   0.00|   0.00
 21|   64|   1079.44| 10.29|   104| 0|30| 
  0|  1594| 0|17|487156|488828| 
 504411|   0.00|   0.00
 % |  0.0|  +0.5|  -2.0|  +2.0| 0|  -3.2| 
  0|  -1.1| 0|  +6.2|  -0.1|  -0.2| 
 -0.1|  0|  0
 12|   64|   2141.12| 15.72|   136| 0|33| 
  0|  1744| 0|34|873777|972303| 
 928926|   0.00|   0.00
 22|   64|   2140.88| 15.64|   137| 0|33| 
  0|  1744| 0|34|926588|942841| 
 974095|   0.00|   0.00
 % |  0.0|  -0.0|  -0.5|  +0.7| 0|   0.0| 
  0|   0.0| 0|   0.0|  +6.0|  -3.0| 
 +4.9|  0|  0
 14|   64|   4076.80| 19.82|   205| 0|30| 
  0|  1577| 0|67|   1422282|   1166425| 
 1539219|   0.00|   0.00
 24|   64|   4094.32| 20.70|   197| 0|31| 
  0|  1612| 0|68|   1704330|   1314077| 
 1833394|   0.00|   0.00
 % |  0.0|  +0.4|  +4.4|  -3.9| 0|  +3.3| 
  0|  +2.2| 0|  +1.5| +19.8| +12.7| 
 +19.1|  0|  0
 11|  256|   2867.48| 13.44|   213| 0|32| 
  0|  1726| 0|14|666430|694922| 
 690730|   0.00|   0.00
 21|  256|   2874.20| 12.71|   226| 0|32| 
  0|  1709| 0|14|697960|740407| 
 721807|   0.00|   0.00
 % |  0.0|  +0.2|  -5.4|  +6.1| 0|   0.0| 
  0|  -1.0| 0|   0.0|  +4.7|  +6.5| 
 +4.5|  0|  0
 12|  256|   5642.82| 17.61|   320| 0|30| 
  0|  1594| 0|30|   1226861|   1236081| 
 1268562|   0.00|   0.00
 22|  256|   5661.06| 17.41|   326| 0|30| 
  0|  1594| 0|29|   1175696|   1143490| 
 1221528|   0.00|   0.00
 % |  0.0|  +0.3|  -1.1|  +1.9| 0|   0.0| 
  0|   0.0| 0|  -3.3|  -4.2|  -7.5| 
 -3.7|  0|  0
 14|  256|   9404.27| 23.55|   399| 0|33| 
  0|  1744| 0|37|   1692245|659975| 
 1765103|   0.00|   0.00
 24|  256|   8761.11| 23.18|   376| 0|32| 
  0|  1726| 0|36|   1699382|418992| 
 1870804|   0.00|   0.00
 % |  0.0|  -6.8|  -1.6|  -5.8| 0|  -3.0| 
  0|  -1.0| 0|  -2.7|  +0.4| -36.5| 
 +6.0|  0|  0
 11|  512|   3803.66| 14.20|   267| 0|30| 
  0|  1594| 0|14|693992|750078| 
 721107|   0.00|   0.00
 21|  512|   3838.02| 15.47|   248| 0|31| 
  0|  1612| 0|15|811709|773505| 
 838788|   0.00|   0.00
 % |  0.0|  +0.9|  +8.9|  -7.1| 0|  +3.3| 
  0|  +1.1| 0|  +7.1| +17.0|  +3.1| 
 +16.3|  0|  0
 12|  512|   8606.11| 19.34|   444| 0|32| 
  0|  1709| 0|29|   1264624|647652| 
 1309740|   0.00|   0.00
 22|  512|   8127.80| 18.93|   428| 0|32| 
  0|  1726| 0|28|   1216606|   1179269| 
 1266260|   0.00|   0.00
 % |  0.0|  -5.6|  -2.1|  -3.6| 0|   0.0| 
  0|  +1.0| 0|  -3.4|  -3.8| +82.1| 
 -3.3|  0|  0
 14|  512|   9409.41| 22.88|   413| 0|30| 
  0|  1577|

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-18 Thread Amos Kong

On 19/12/11 10:19, Amos Kong wrote:

On 12/12/11 13:12, Rusty Russell wrote:

On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kongak...@redhat.com wrote:

On 12/12/11 06:27, Benjamin Herrenschmidt wrote:

On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:


Forwarding some results by Amos, who run multiple netperf streams in
parallel, from an external box to the guest. TCP_STREAM results were
noisy. This could be due to buffering done by TCP, where packet size
varies even as message size is constant.

TCP_RR results were consistent. In this benchmark, after switching
to mandatory barriers, CPU utilization increased by up to 35% while
throughput went down by up to 14%. the normalized throughput/cpu
regressed consistently, between 7 and 35%

The fix applied was simply this:


What machine processor was this ?


pined guest memory to numa node 1


Please try this patch. How much does the branch cost us?

(Compiles, untested).

Thanks,
Rusty.

From: Rusty Russellru...@rustcorp.com.au
Subject: virtio: harsher barriers for virtio-mmio.

We were cheating with our barriers; using the smp ones rather than the
real device ones. That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).

Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci. In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.

By comparison, this branch costs us???

Reference: https://lkml.org/lkml/2011/12/11/22

Signed-off-by: Rusty Russellru...@rustcorp.com.au
---
drivers/lguest/lguest_device.c | 10 ++
drivers/s390/kvm/kvm_virtio.c | 2 +-
drivers/virtio/virtio_mmio.c | 7 ---
drivers/virtio/virtio_pci.c | 4 ++--
drivers/virtio/virtio_ring.c | 34 +-
include/linux/virtio_ring.h | 1 +
tools/virtio/linux/virtio.h | 1 +
tools/virtio/virtio_test.c | 3 ++-
8 files changed, 38 insertions(+), 24 deletions(-)


Hi all,

I tested with the same environment and scenarios.
tested one scenarios for three times and compute the average for more
precision.

Thanks, Amos

- compare results ---
Mon Dec 19 09:51:09 2011

1 - avg-old.netperf.exhost_guest.txt
2 - avg-fixed.netperf.exhost_guest.txt

==
TCP_STREAM
sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #tx-byts|
#rx-byts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit|
#rpkt/#irq
1 1| 64| 1073.54| 10.50| 102| 0| 31| 0| 1612| 0| 16| 487641| 489753|
504764| 0.00| 0.00
2 1| 64| 1079.44| 10.29| 104| 0| 30| 0| 1594| 0| 17| 487156| 488828|
504411| 0.00| 0.00
% | 0.0| +0.5| -2.0| +2.0| 0| -3.2| 0| -1.1| 0| +6.2| -0.1| -0.2| -0.1|


The format is broken in webpage, attached the result file.
it's also available here: http://amosk.info/download/rusty-fix-perf.txt
- compare results ---
Mon Dec 19 09:51:09 2011

1 - avg-old.netperf.exhost_guest.txt
2 - avg-fixed.netperf.exhost_guest.txt

==
TCP_STREAM
  sessions| size|throughput|   cpu| normalize|  #tx-pkts|  #rx-pkts|  
#tx-byts|  #rx-byts| #re-trans|  #tx-intr|  #rx-intr|  #io_exit|  
#irq_inj|#tpkt/#exit| #rpkt/#irq
11|   64|   1073.54| 10.50|   102| 0|31|
 0|  1612| 0|16|487641|489753|504764|   
0.00|   0.00
21|   64|   1079.44| 10.29|   104| 0|30|
 0|  1594| 0|17|487156|488828|504411|   
0.00|   0.00
% |  0.0|  +0.5|  -2.0|  +2.0| 0|  -3.2|
 0|  -1.1| 0|  +6.2|  -0.1|  -0.2|  -0.1|  
0|  0
12|   64|   2141.12| 15.72|   136| 0|33|
 0|  1744| 0|34|873777|972303|928926|   
0.00|   0.00
22|   64|   2140.88| 15.64|   137| 0|33|
 0|  1744| 0|34|926588|942841|974095|   
0.00|   0.00
% |  0.0|  -0.0|  -0.5|  +0.7| 0|   0.0|
 0|   0.0| 0|   0.0|  +6.0|  -3.0|  +4.9|  
0|  0
14|   64|   4076.80| 19.82|   205| 0|30|
 0|  1577| 0|67|   1422282|   1166425|   1539219|   
0.00|   0.00
24|   64|   4094.32| 20.70|   197| 0|31|
 0|  1612| 0|68|   1704330|   1314077|   1833394|   
0.00|   0.00
% |  0.0|  +0.4|  +4.4|  -3.9| 0|  +3.3|
 0|  +2.2| 0|  +1.5| +19.8| +12.7| +19.1|  
0|  0
11|  256|   2867.48| 13.44|   213| 0|32|
 0|  1726| 0|14|666430|694922|690730|   
0.00|   0.00
21|  256|   

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-18 Thread Rusty Russell
On Tue, 13 Dec 2011 07:56:36 +0800, Amos Kong ak...@redhat.com wrote:
 On 12/12/2011 01:12 PM, Rusty Russell wrote:
  On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong ak...@redhat.com wrote:
  On 12/12/11 06:27, Benjamin Herrenschmidt wrote:
  On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:
 
  Forwarding some results by Amos, who run multiple netperf streams in
  parallel, from an external box to the guest.  TCP_STREAM results were
  noisy.  This could be due to buffering done by TCP, where packet size
  varies even as message size is constant.
 
  TCP_RR results were consistent. In this benchmark, after switching
  to mandatory barriers, CPU utilization increased by up to 35% while
  throughput went down by up to 14%. the normalized throughput/cpu
  regressed consistently, between 7 and 35%
 
  The fix applied was simply this:
 
  What machine  processor was this  ?
 
  pined guest memory to numa node 1
  
  Please try this patch.  How much does the branch cost us?
 
 Ok, I will provide the result later.

Any news?  We're cutting it very fine.  I've CC'd Linus so he knows this
is coming...

From: Rusty Russell ru...@rustcorp.com.au
Subject: virtio: harsher barriers for virtio-mmio.

We were cheating with our barriers; using the smp ones rather than the
real device ones.  That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).

Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci.  In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.

By comparison, this branch costs us???

Reference: https://lkml.org/lkml/2011/12/11/22

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/lguest/lguest_device.c |   10 ++
 drivers/s390/kvm/kvm_virtio.c  |2 +-
 drivers/virtio/virtio_mmio.c   |7 ---
 drivers/virtio/virtio_pci.c|4 ++--
 drivers/virtio/virtio_ring.c   |   34 +-
 include/linux/virtio_ring.h|1 +
 tools/virtio/linux/virtio.h|1 +
 tools/virtio/virtio_test.c |3 ++-
 8 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru
}
 
/*
-* OK, tell virtio_ring.c to set up a virtqueue now we know its size
-* and we've got a pointer to its pages.
+* OK, tell virtio_ring.c to set up a virtqueue now we know its size
+* and we've got a pointer to its pages.  Note that we set weak_barriers
+* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
+* barriers.
 */
-   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN,
-vdev, lvq-pages, lg_notify, callback, name);
+   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev,
+true, lvq-pages, lg_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
goto out;
 
vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN,
-vdev, (void *) config-address,
+vdev, true, (void *) config-address,
 kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str
writel(virt_to_phys(info-queue)  PAGE_SHIFT,
vm_dev-base + VIRTIO_MMIO_QUEUE_PFN);
 
-   /* Create the vring */
-   vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN,
-vdev, info-queue, vm_notify, callback, name);
+   /* Create the vring: no weak barriers, the other side is could
+* be an independent device. */
+   vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+false, info-queue, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
  vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
-   vq = 

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-18 Thread Amos Kong

On 19/12/11 10:41, Benjamin Herrenschmidt wrote:

On Mon, 2011-12-19 at 10:19 +0800, Amos Kong wrote:


I tested with the same environment and scenarios.
tested one scenarios for three times and compute the average for more
precision.

Thanks, Amos

- compare results ---
Mon Dec 19 09:51:09 2011

1 - avg-old.netperf.exhost_guest.txt
2 - avg-fixed.netperf.exhost_guest.txt


The output is word wrapped and generally unreadable. Any chance you can
provide us with a summary of the outcome ?

Cheers,
Ben.


Hi Ben,

The change of TCP_RR Throughput is very small.
external host - guest: Some of throughput of TCP_STREAM and TCP_MAERTS 
reduced a little.
local host - guest: Some of throughput of TCP_STREAM and TCP_MAERTS 
increased a little.



About compare result format:
---

1 - avg-old.netperf.exhost_guest.txt


average result (tested 3 times) file of test 1

2 - avg-fixed.netperf.exhost_guest.txt


average result file of test 2


==
TCP_STREAM


^^^ protocol


  sessions| size|throughput|   cpu| normalize|  #tx-pkts| #rx-pkts| 
#tx-byts|  #rx-byts| #re-trans| #tx-intr|  #rx-intr|  #io_exit|  
#irq_inj|#tpkt/#exit| #rpkt/#irq
11|   64|   1073.54| 10.50|   102|  


^^^ average result of old kernel, start netserver in guest, start 
netperf client(s) in external host



21|   64|   1079.44| 10.29|   104|  


^^^ average result of fixed kernel


% |  0.0|  +0.5|  -2.0|  +2.0|  


^^^ augment rate between test1 and test2


sessions: netperf clients number
size: request/response sizes
#rx-pkts: received packets number
#rx-byts: received bytes number
#rx-intr: interrupt number for receive
#io_exit: io exit number
#irq_inj: injected irq number


Thanks, Amos.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-12 Thread Amos Kong
On 12/12/2011 01:12 PM, Rusty Russell wrote:
 On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong ak...@redhat.com wrote:
 On 12/12/11 06:27, Benjamin Herrenschmidt wrote:
 On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:

 Forwarding some results by Amos, who run multiple netperf streams in
 parallel, from an external box to the guest.  TCP_STREAM results were
 noisy.  This could be due to buffering done by TCP, where packet size
 varies even as message size is constant.

 TCP_RR results were consistent. In this benchmark, after switching
 to mandatory barriers, CPU utilization increased by up to 35% while
 throughput went down by up to 14%. the normalized throughput/cpu
 regressed consistently, between 7 and 35%

 The fix applied was simply this:

 What machine  processor was this  ?

 pined guest memory to numa node 1
 
 Please try this patch.  How much does the branch cost us?

Ok, I will provide the result later.

Thanks,
Amos

 (Compiles, untested).
 
 Thanks,
 Rusty.
 
 From: Rusty Russell ru...@rustcorp.com.au
 Subject: virtio: harsher barriers for virtio-mmio.
 
 We were cheating with our barriers; using the smp ones rather than the
 real device ones.  That was fine, until virtio-mmio came along, which
 could be talking to a real device (a non-SMP CPU).
 
 Unfortunately, just putting back the real barriers (reverting
 d57ed95d) causes a performance regression on virtio-pci.  In
 particular, Amos reports netbench's TCP_RR over virtio_net CPU
 utilization increased up to 35% while throughput went down by up to
 14%.
 
 By comparison, this branch costs us???
 
 Reference: https://lkml.org/lkml/2011/12/11/22
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/lguest/lguest_device.c |   10 ++
  drivers/s390/kvm/kvm_virtio.c  |2 +-
  drivers/virtio/virtio_mmio.c   |7 ---
  drivers/virtio/virtio_pci.c|4 ++--
  drivers/virtio/virtio_ring.c   |   34 +-
  include/linux/virtio_ring.h|1 +
  tools/virtio/linux/virtio.h|1 +
  tools/virtio/virtio_test.c |3 ++-
  8 files changed, 38 insertions(+), 24 deletions(-)
 
 diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
 --- a/drivers/lguest/lguest_device.c
 +++ b/drivers/lguest/lguest_device.c
 @@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru
   }
  
   /*
 -  * OK, tell virtio_ring.c to set up a virtqueue now we know its size
 -  * and we've got a pointer to its pages.
 +  * OK, tell virtio_ring.c to set up a virtqueue now we know its size
 +  * and we've got a pointer to its pages.  Note that we set weak_barriers
 +  * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
 +  * barriers.
*/
 - vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN,
 -  vdev, lvq-pages, lg_notify, callback, name);
 + vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev,
 +  true, lvq-pages, lg_notify, callback, name);
   if (!vq) {
   err = -ENOMEM;
   goto unmap;
 diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
 --- a/drivers/s390/kvm/kvm_virtio.c
 +++ b/drivers/s390/kvm/kvm_virtio.c
 @@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
   goto out;
  
   vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN,
 -  vdev, (void *) config-address,
 +  vdev, true, (void *) config-address,
kvm_notify, callback, name);
   if (!vq) {
   err = -ENOMEM;
 diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
 --- a/drivers/virtio/virtio_mmio.c
 +++ b/drivers/virtio/virtio_mmio.c
 @@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str
   writel(virt_to_phys(info-queue)  PAGE_SHIFT,
   vm_dev-base + VIRTIO_MMIO_QUEUE_PFN);
  
 - /* Create the vring */
 - vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN,
 -  vdev, info-queue, vm_notify, callback, name);
 + /* Create the vring: no weak barriers, the other side is could
 +  * be an independent device. */
 + vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev,
 +  false, info-queue, vm_notify, callback, name);
   if (!vq) {
   err = -ENOMEM;
   goto error_new_virtqueue;
 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
 --- a/drivers/virtio/virtio_pci.c
 +++ b/drivers/virtio/virtio_pci.c
 @@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
 vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
  
   /* create the vring */
 - vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN,
 -  vdev, info-queue, vp_notify, callback, name);
 +

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-11 Thread Michael S. Tsirkin
On Sat, Dec 03, 2011 at 03:44:36PM +1030, Rusty Russell wrote:
 On Sat, 03 Dec 2011 10:09:44 +1100, Benjamin Herrenschmidt 
 b...@kernel.crashing.org wrote:
  On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote:
   A trivial, albeit sub-optimal, solution would be to simply revert
   commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though,
   that's going to have a negative impact on performance of SMP-based
   virtualization use cases.
  
  Have you measured the impact of using normal barriers (non-SMP ones)
  like we use on normal HW drivers unconditionally ?
  
  IE. If the difference is small enough I'd say just go for it and avoid
  the bloat.
 
 Yep.  Plan is:
 1) Measure the difference.
 2) Difference unmeassurable?  Use normal barriers (ie. revert d57ed95).
 3) Difference small?  Revert d57ed95 for 3.2, revisit for 3.3.
 4) Difference large?  Runtime switch based on if you're PCI for 3.2,
revisit for 3.3.
 
 Cheers,
 Rusty.

Forwarding some results by Amos, who run multiple netperf streams in
parallel, from an external box to the guest.  TCP_STREAM results were
noisy.  This could be due to buffering done by TCP, where packet size
varies even as message size is constant.

TCP_RR results were consistent. In this benchmark, after switching
to mandatory barriers, CPU utilization increased by up to 35% while
throughput went down by up to 14%. the normalized throughput/cpu
regressed consistently, between 7 and 35%

The fix applied was simply this:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3198f2e..fdccb77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -23,7 +23,7 @@
 
 /* virtio guest is communicating with a virtual device that actually runs on
  * a host processor.  Memory barriers are used to control SMP effects. */
-#ifdef CONFIG_SMP
+#if 0
 /* Where possible, use SMP barriers which are more lightweight than mandatory
  * barriers, because mandatory barriers control MMIO effects on accesses
  * through relaxed memory I/O windows (which virtio does not use). */



-- 
MST
Fri Dec  9 23:57:33 2011

1 - old-exhost_guest.txt
2 - fixed-exhost_guest.txt

==
TCP_STREAM
  sessions| size|throughput|   cpu| normalize|  #tx-pkts|  #rx-pkts| 
#re-trans|  #tx-intr|  #rx-intr|  #io_exit|  #irq_inj|#tpkt/#exit| #rpkt/#irq
11|   64|949.64| 10.64|89|   1170134|   1368739|
 0|17|487392|488820|504716|   2.39|   2.71
21|   64|946.03| 10.87|87|   1119582|   1325851|
 0|17|493763|485865|516161|   2.30|   2.57
% |  0.0|  -0.4|  +2.2|  -2.2|  -4.3|  -3.1|
 0|   0.0|  +1.3|  -0.6|  +2.3|   -3.8|   -5.2
12|   64|   1877.15| 15.45|   121|   2151267|   2561929|
 0|33|923916|971093|969360|   2.22|   2.64
22|   64|   1867.63| 15.06|   124|   2212457|   2607606|
 0|33|836160|927721|883964|   2.38|   2.95
% |  0.0|  -0.5|  -2.5|  +2.5|  +2.8|  +1.8|
 0|   0.0|  -9.5|  -4.5|  -8.8|   +7.2|  +11.7
14|   64|   3577.38| 19.62|   182|   4176151|   5036661|
 0|64|   1677417|   1412979|   1859101|   2.96|   2.71
24|   64|   3583.17| 20.05|   178|   4215327|   5063534|
 0|65|   1682582|   1549394|   1759033|   2.72|   2.88
% |  0.0|  +0.2|  +2.2|  -2.2|  +0.9|  +0.5|
 0|  +1.6|  +0.3|  +9.7|  -5.4|   -8.1|   +6.3
11|  256|   2654.52| 11.41|   232|925787|   1029214|
 0|14|597763|670927|619414|   1.38|   1.66
21|  256|   2632.22| 20.32|   129|977446|   1036094|
 0|15|742699|715460|764512|   1.37|   1.36
% |  0.0|  -0.8| +78.1| -44.4|  +5.6|  +0.7|
 0|  +7.1| +24.2|  +6.6| +23.4|   -0.7|  -18.1
12|  256|   5228.76| 16.94|   308|   1949442|   2082492|
 0|30|   1230329|   1323945|   1274262|   1.47|   1.63
22|  256|   5140.98| 19.58|   262|   1991090|   2093206|
 0|30|   1400232|   1271363|   1441564|   1.57|   1.45
% |  0.0|  -1.7| +15.6| -14.9|  +2.1|  +0.5|
 0|   0.0| +13.8|  -4.0| +13.1|   +6.8|  -11.0
14|  256|   9412.61| 24.04|   391|   2292404|   2351356|
 0|35|   1669864|555786|   1741742|   4.12|   1.35
24|  256|   9408.92| 22.80|   412|   2303267|   

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-11 Thread Benjamin Herrenschmidt
On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:

 Forwarding some results by Amos, who run multiple netperf streams in
 parallel, from an external box to the guest.  TCP_STREAM results were
 noisy.  This could be due to buffering done by TCP, where packet size
 varies even as message size is constant.
 
 TCP_RR results were consistent. In this benchmark, after switching
 to mandatory barriers, CPU utilization increased by up to 35% while
 throughput went down by up to 14%. the normalized throughput/cpu
 regressed consistently, between 7 and 35%
 
 The fix applied was simply this:

What machine  processor was this  ?

Cheers,
Ben.

 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index 3198f2e..fdccb77 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -23,7 +23,7 @@
  
  /* virtio guest is communicating with a virtual device that actually runs 
 on
   * a host processor.  Memory barriers are used to control SMP effects. */
 -#ifdef CONFIG_SMP
 +#if 0
  /* Where possible, use SMP barriers which are more lightweight than mandatory
   * barriers, because mandatory barriers control MMIO effects on accesses
   * through relaxed memory I/O windows (which virtio does not use). */
 
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-11 Thread Amos Kong

On 12/12/11 06:27, Benjamin Herrenschmidt wrote:

On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:


Forwarding some results by Amos, who run multiple netperf streams in
parallel, from an external box to the guest.  TCP_STREAM results were
noisy.  This could be due to buffering done by TCP, where packet size
varies even as message size is constant.

TCP_RR results were consistent. In this benchmark, after switching
to mandatory barriers, CPU utilization increased by up to 35% while
throughput went down by up to 14%. the normalized throughput/cpu
regressed consistently, between 7 and 35%

The fix applied was simply this:


What machine  processor was this  ?


pined guest memory to numa node 1
# numactl -m 1 qemu-kvm ...
pined guest vcpu threads and vhost thread to single cpu of numa node 1
# taskset -p 0x10 8348 (vhost_net_thread)
# taskset -p 0x20 8353 (vcpu 1 thread)
# taskset -p 0x40 8357 (vcpu 2 thread)
pined cpu/memory of netperf client process to node 1
# numactl --cpunodebind=1 --membind=1 netperf ...

8 cores
---
processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model   : 44
model name  : Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz
stepping: 2
microcode   : 0xc
cpu MHz : 1596.000
cache size  : 12288 KB
physical id : 1
siblings: 4
core id : 10
cpu cores   : 4
apicid  : 52
initial apicid  : 52
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx 
smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes 
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid

bogomips: 4787.76
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

# cat /proc/meminfo
MemTotal:   16446616 kB
MemFree:15874092 kB
Buffers:   30404 kB
Cached:   238640 kB
SwapCached:0 kB
Active:   100204 kB
Inactive: 184312 kB
Active(anon):  15724 kB
Inactive(anon):4 kB
Active(file):  84480 kB
Inactive(file):   184308 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:   8388604 kB
SwapFree:8388604 kB
Dirty:56 kB
Writeback: 0 kB
AnonPages: 15548 kB
Mapped:11540 kB
Shmem:   256 kB
Slab:  82444 kB
SReclaimable:  19220 kB
SUnreclaim:63224 kB
KernelStack:1224 kB
PageTables: 2256 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:16611912 kB
Committed_AS: 209068 kB
VmallocTotal:   34359738367 kB
VmallocUsed:  224244 kB
VmallocChunk:   34351073668 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k:9876 kB
DirectMap2M: 2070528 kB
DirectMap1G:14680064 kB

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 8175 MB
node 0 free: 7706 MB
node 1 cpus: 4 5 6 7
node 1 size: 8192 MB
node 1 free: 7796 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1



Cheers,
Ben.


diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3198f2e..fdccb77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -23,7 +23,7 @@

  /* virtio guest is communicating with a virtual device that actually runs on
   * a host processor.  Memory barriers are used to control SMP effects. */
-#ifdef CONFIG_SMP
+#if 0
  /* Where possible, use SMP barriers which are more lightweight than mandatory
   * barriers, because mandatory barriers control MMIO effects on accesses
   * through relaxed memory I/O windows (which virtio does not use). */






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-11 Thread Rusty Russell
On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong ak...@redhat.com wrote:
 On 12/12/11 06:27, Benjamin Herrenschmidt wrote:
  On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote:
 
  Forwarding some results by Amos, who run multiple netperf streams in
  parallel, from an external box to the guest.  TCP_STREAM results were
  noisy.  This could be due to buffering done by TCP, where packet size
  varies even as message size is constant.
 
  TCP_RR results were consistent. In this benchmark, after switching
  to mandatory barriers, CPU utilization increased by up to 35% while
  throughput went down by up to 14%. the normalized throughput/cpu
  regressed consistently, between 7 and 35%
 
  The fix applied was simply this:
 
  What machine  processor was this  ?
 
 pined guest memory to numa node 1

Please try this patch.  How much does the branch cost us?

(Compiles, untested).

Thanks,
Rusty.

From: Rusty Russell ru...@rustcorp.com.au
Subject: virtio: harsher barriers for virtio-mmio.

We were cheating with our barriers; using the smp ones rather than the
real device ones.  That was fine, until virtio-mmio came along, which
could be talking to a real device (a non-SMP CPU).

Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci.  In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.

By comparison, this branch costs us???

Reference: https://lkml.org/lkml/2011/12/11/22

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/lguest/lguest_device.c |   10 ++
 drivers/s390/kvm/kvm_virtio.c  |2 +-
 drivers/virtio/virtio_mmio.c   |7 ---
 drivers/virtio/virtio_pci.c|4 ++--
 drivers/virtio/virtio_ring.c   |   34 +-
 include/linux/virtio_ring.h|1 +
 tools/virtio/linux/virtio.h|1 +
 tools/virtio/virtio_test.c |3 ++-
 8 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru
}
 
/*
-* OK, tell virtio_ring.c to set up a virtqueue now we know its size
-* and we've got a pointer to its pages.
+* OK, tell virtio_ring.c to set up a virtqueue now we know its size
+* and we've got a pointer to its pages.  Note that we set weak_barriers
+* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
+* barriers.
 */
-   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN,
-vdev, lvq-pages, lg_notify, callback, name);
+   vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev,
+true, lvq-pages, lg_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
goto out;
 
vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN,
-vdev, (void *) config-address,
+vdev, true, (void *) config-address,
 kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str
writel(virt_to_phys(info-queue)  PAGE_SHIFT,
vm_dev-base + VIRTIO_MMIO_QUEUE_PFN);
 
-   /* Create the vring */
-   vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN,
-vdev, info-queue, vm_notify, callback, name);
+   /* Create the vring: no weak barriers, the other side is could
+* be an independent device. */
+   vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+false, info-queue, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
  vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
-   vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN,
-vdev, info-queue, vp_notify, callback, name);
+   vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, vdev,
+   

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-04 Thread Rusty Russell
On Sat, 03 Dec 2011 10:09:44 +1100, Benjamin Herrenschmidt 
b...@kernel.crashing.org wrote:
 On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote:
  A trivial, albeit sub-optimal, solution would be to simply revert
  commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though,
  that's going to have a negative impact on performance of SMP-based
  virtualization use cases.
 
 Have you measured the impact of using normal barriers (non-SMP ones)
 like we use on normal HW drivers unconditionally ?
 
 IE. If the difference is small enough I'd say just go for it and avoid
 the bloat.

Yep.  Plan is:
1) Measure the difference.
2) Difference unmeassurable?  Use normal barriers (ie. revert d57ed95).
3) Difference small?  Revert d57ed95 for 3.2, revisit for 3.3.
4) Difference large?  Runtime switch based on if you're PCI for 3.2,
   revisit for 3.3.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-02 Thread Benjamin Herrenschmidt
On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote:
 Virtio is using memory barriers to control the ordering of
 references to the vrings on SMP systems. When the guest is compiled
 with SMP support, virtio is only using SMP barriers in order to
 avoid incurring the overhead involved with mandatory barriers.
 
 Lately, though, virtio is being increasingly used with inter-processor
 communication scenarios too, which involve running two (separate)
 instances of operating systems on two (separate) processors, each of
 which might either be UP or SMP.
 
 To control the ordering of memory references when the vrings are shared
 between two external processors, we must always use mandatory barriers.
 
 A trivial, albeit sub-optimal, solution would be to simply revert
 commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though,
 that's going to have a negative impact on performance of SMP-based
 virtualization use cases.

Have you measured the impact of using normal barriers (non-SMP ones)
like we use on normal HW drivers unconditionally ?

IE. If the difference is small enough I'd say just go for it and avoid
the bloat.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-02 Thread Ohad Ben-Cohen
On Sat, Dec 3, 2011 at 1:09 AM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 Have you measured the impact of using normal barriers (non-SMP ones)
 like we use on normal HW drivers unconditionally ?

 IE. If the difference is small enough I'd say just go for it and avoid
 the bloat.

I agree.

MST wanted to give this a try this week. If all goes well and there's
no unreasonable regression, we'd just switch to mandatory barriers.

Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-01 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 12:58:59PM +1030, Rusty Russell wrote:
 On Thu, 1 Dec 2011 01:13:07 +0200, Michael S. Tsirkin m...@redhat.com 
 wrote:
  For x86, stores into memory are ordered. So I think that yes, smp_XXX
  can be selected at compile time.
  
  So let's forget the virtio strangeness for a minute,
 
 Hmm, we got away with light barriers because we knew we were not
 *really* talking to a device.  But now with virtio-mmio, turns out we
 are :)

You think virtio-mmio this issue too?  It's reported on remoteproc...

 I'm really tempted to revert d57ed95 for 3.2, and we can revisit this
 optimization later if it proves worthwhile.
 
 Thoughts?
 Rusty. 

Generally it does seem the best we can do for 3.2.

Given it's rc3, I'd be a bit wary of introducing regressions - I'll try
to find some real setups (as in - not my laptop) to run some benchmarks
on, to verify there's no major problem.
I hope I can report on this in about a week from now - want to hold onto this 
meanwhile?

Further, if we do revert, need to remember to apply the following
beforehand, to avoid breaking virtio tool:

tools/virtio: implement mandatory barriers for x86

Signed-off-by: Michael S. Tsirkin m...@redhat.com

diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
index 68b8b8d..1bf0e80 100644
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -172,11 +172,18 @@ struct virtqueue {
 #define MODULE_LICENSE(__MODULE_LICENSE_value) \
const char *__MODULE_LICENSE_name = __MODULE_LICENSE_value
 
 #define CONFIG_SMP
 
 #if defined(__i386__) || defined(__x86_64__)
 #define barrier() asm volatile( ::: memory)
 #define mb() __sync_synchronize()
+#if defined(__i386__)
+#define wmb() mb()
+#define rmb() mb()
+#else
+#define wmb() asm volatile(sfence ::: memory)
+#define rmb() asm volatile(lfence ::: memory)
+#endif
 
 #define smp_mb()   mb()
 # define smp_rmb() barrier()

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-01 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 08:14:26AM +0200, Ohad Ben-Cohen wrote:
 On Thu, Dec 1, 2011 at 1:13 AM, Michael S. Tsirkin m...@redhat.com wrote:
  For x86, stores into memory are ordered. So I think that yes, smp_XXX
  can be selected at compile time.
 
 But then you can't use the same kernel image for both scenarios.

I was talking about virtio-pci. That always allocates the ring
in the normal memory.

 It won't take long until people will use virtio on ARM for both
 virtualization and for talking to devices, and having to rebuild the
 kernel for different use cases is nasty.

Yes, I understand that it's nasty.

 ___
 Virtualization mailing list
 virtualizat...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/virtualization
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-12-01 Thread Rusty Russell
On Thu, 1 Dec 2011 10:12:37 +0200, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Dec 01, 2011 at 12:58:59PM +1030, Rusty Russell wrote:
  On Thu, 1 Dec 2011 01:13:07 +0200, Michael S. Tsirkin m...@redhat.com 
  wrote:
   For x86, stores into memory are ordered. So I think that yes, smp_XXX
   can be selected at compile time.
   
   So let's forget the virtio strangeness for a minute,
  
  Hmm, we got away with light barriers because we knew we were not
  *really* talking to a device.  But now with virtio-mmio, turns out we
  are :)
 
 You think virtio-mmio this issue too?  It's reported on remoteproc...

I think any non-virtual, non-PCI device has to worry about it.  Perhaps
all virtio-mmio are virtual (at this point).

I'm tempted to say we want permission from the device to do relaxed
barriers (so I don't have to worry about it!)

  I'm really tempted to revert d57ed95 for 3.2, and we can revisit this
  optimization later if it proves worthwhile.
 
 Generally it does seem the best we can do for 3.2.
 
 Given it's rc3, I'd be a bit wary of introducing regressions - I'll try
 to find some real setups (as in - not my laptop) to run some benchmarks
 on, to verify there's no major problem.
 I hope I can report on this in about a week from now - want to hold onto this 
 meanwhile?

Yep, no huge hurry.  Thanks!

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Tue, Nov 29, 2011 at 5:16 PM, Michael S. Tsirkin m...@redhat.com wrote:
 This mentions iommu - is there a need to use dma api to let
 the firmware acess the rings? Or does it have access to all
 of memory?

IOMMU may or may not be used, it really depends on the hardware (my
personal SoC does employ one, while others don't).

The vrings are created in non-cacheable memory, which is allocated
using dma_alloc_coherent, but that isn't necessarily controlling the
remote processor access to the memory (a notable example is an
iommu-less remote processor which can directly access the physical
memory).

 Is there cache snooping? If yes access from an external device
 typically works mostly in the same way as smp ...

No, nothing fancy like that. Every processor has its own cache, with
no coherency protocol. The remote processor should really be treated
as a device, and not as a processor that is part of an SMP
configuration, and we must prohibit both the compiler and the CPU from
reordering memory operations.

 So you put virtio rings in MMIO memory?

I'll be precise: the vrings are created in non-cacheable memory, which
both processors have access to.

 Could you please give a couple of examples of breakage?

Sure. Basically, the order of the vring memory operations appear
differently to the observing processor. For example, avail-idx gets
updated before the new entry is put in the available array...

Thanks,
Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Tue, Nov 29, 2011 at 5:19 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
  Is an extra branch faster or slower than reverting d57ed95?

 Sorry, unfortunately I have no way to measure this, as I don't have
 any virtualization/x86 setup. I'm developing on ARM SoCs, where
 virtualization hardware is coming, but not here yet.

 You can try using the micro-benchmark in tools/virtio/.

Hmm, care to show me exactly what do you mean ?

Though I somewhat suspect that any micro-benchmarking I'll do with my
random ARM SoC will not have much value to real virtualization/x86
workloads.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 01:55:53PM +0200, Ohad Ben-Cohen wrote:
 On Tue, Nov 29, 2011 at 5:19 PM, Michael S. Tsirkin m...@redhat.com wrote:
  On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
   Is an extra branch faster or slower than reverting d57ed95?
 
  Sorry, unfortunately I have no way to measure this, as I don't have
  any virtualization/x86 setup. I'm developing on ARM SoCs, where
  virtualization hardware is coming, but not here yet.
 
  You can try using the micro-benchmark in tools/virtio/.
 
 Hmm, care to show me exactly what do you mean ?

make headers_install
make -C tools/virtio/
(you'll need an empty stub for tools/virtio/linux/module.h,
 I just sent a patch to add that)
sudo insmod tools/virtio/vhost_test/vhost_test.ko
./tools/virtio/virtio_test

 Though I somewhat suspect that any micro-benchmarking I'll do with my
 random ARM SoC will not have much value to real virtualization/x86
 workloads.
 
 Thanks,
 Ohad.

Real virtualization/x86 can keep using current smp_XX barriers, right?
We can have some config for your kind of setup.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 01:45:05PM +0200, Ohad Ben-Cohen wrote:
  So you put virtio rings in MMIO memory?
 
 I'll be precise: the vrings are created in non-cacheable memory, which
 both processors have access to.
 
  Could you please give a couple of examples of breakage?
 
 Sure. Basically, the order of the vring memory operations appear
 differently to the observing processor. For example, avail-idx gets
 updated before the new entry is put in the available array...

I see. And this happens because the ARM processor reorders
memory writes to this uncacheable memory?
And in an SMP configuration, writes are somehow not reordered?

For example, if we had such an AMP configuration with and x86
processor, wmb() (sfence) would be wrong and smp_wmb() would be sufficient.

Just checking that this is not a bug in the smp_wmb implementation
for the specific platform.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 4:59 PM, Michael S. Tsirkin m...@redhat.com wrote:
 I see. And this happens because the ARM processor reorders
 memory writes

Yes.

 And in an SMP configuration, writes are somehow not reordered?

They are, but then the smp memory barriers are enough to control these
effects. It's not enough to control reordering as seen by a device
(which is what our AMP processors are) though.

(btw, the difference between an SMP processor and a device here lies
in how the memory is mapped: normal memory vs. device memory
attributes. it's an ARM thingy).

 Just checking that this is not a bug in the smp_wmb implementation
 for the specific platform.

No, it's not.

ARM's smp memory barriers use ARM's DMB instruction, which is enough
to control SMP effects, whereas ARM's mandatory memory barriers use
ARM's DSB instruction, which is required to ensure the ordering
between Device and Normal memory accesses.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Wed, Nov 30, 2011 at 06:04:56PM +0200, Ohad Ben-Cohen wrote:
 On Wed, Nov 30, 2011 at 4:59 PM, Michael S. Tsirkin m...@redhat.com wrote:
  I see. And this happens because the ARM processor reorders
  memory writes
 
 Yes.
 
  And in an SMP configuration, writes are somehow not reordered?
 
 They are, but then the smp memory barriers are enough to control these
 effects. It's not enough to control reordering as seen by a device
 (which is what our AMP processors are) though.
 
 (btw, the difference between an SMP processor and a device here lies
 in how the memory is mapped: normal memory vs. device memory
 attributes. it's an ARM thingy).

How are the rings mapped? normal memory, right?
We allocate them with plan alloc_pages_exact in virtio_pci.c ...

  Just checking that this is not a bug in the smp_wmb implementation
  for the specific platform.
 
 No, it's not.
 
 ARM's smp memory barriers use ARM's DMB instruction, which is enough
 to control SMP effects, whereas ARM's mandatory memory barriers use
 ARM's DSB instruction, which is required to ensure the ordering
 between Device and Normal memory accesses.
 
 Thanks,
 Ohad.

Yes wmb() is required to ensure ordering for MMIO.
But here both accesses: index and ring - are for
memory, not MMIO.

I could understand ring kick bypassing index write, maybe ...
But you described an index write bypassing descriptor write.
Is this something you see in practice?


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin m...@redhat.com wrote:
 How are the rings mapped? normal memory, right?

No, device memory.

 We allocate them with plan alloc_pages_exact in virtio_pci.c ...

I'm not using virtio_pci.c; remoteproc is allocating the rings using
the DMA API.

 Yes wmb() is required to ensure ordering for MMIO.
 But here both accesses: index and ring - are for
 memory, not MMIO.

I'm doing IO with a device over shared memory. It does require
mandatory barriers as I explained.

 Is this something you see in practice?

Yes. These bugs are very real.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 4:50 PM, Michael S. Tsirkin m...@redhat.com wrote:
 make headers_install
 make -C tools/virtio/
 (you'll need an empty stub for tools/virtio/linux/module.h,
  I just sent a patch to add that)
 sudo insmod tools/virtio/vhost_test/vhost_test.ko
 ./tools/virtio/virtio_test

Ok, I gave this a spin.

I've tried to see if reverting d57ed95 has any measurable effect on
the execution time of virtio_test's run_test(), but I couldn't see any
(several attempts with and without d57ed95 yielded very similar range
of execution times).

YMMV though, especially with real workloads.

 Real virtualization/x86 can keep using current smp_XX barriers, right?

Yes, sure. ARM virtualization can too, since smp_XX barriers are
enough for that scenario.

 We can have some config for your kind of setup.

Please note that it can't be a compile-time decision though (unless
we're willing to effectively revert d57ed95 when this config kicks
in): it's not unlikely that one would want to have both use cases
running on the same time.

Thanks,
Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 12:43:08AM +0200, Ohad Ben-Cohen wrote:
 On Wed, Nov 30, 2011 at 4:50 PM, Michael S. Tsirkin m...@redhat.com wrote:
  make headers_install
  make -C tools/virtio/
  (you'll need an empty stub for tools/virtio/linux/module.h,
   I just sent a patch to add that)
  sudo insmod tools/virtio/vhost_test/vhost_test.ko
  ./tools/virtio/virtio_test
 
 Ok, I gave this a spin.
 
 I've tried to see if reverting d57ed95 has any measurable effect on
 the execution time of virtio_test's run_test(), but I couldn't see any
 (several attempts with and without d57ed95 yielded very similar range
 of execution times).
 
 YMMV though, especially with real workloads.
 
  Real virtualization/x86 can keep using current smp_XX barriers, right?
 
 Yes, sure. ARM virtualization can too, since smp_XX barriers are
 enough for that scenario.
 
  We can have some config for your kind of setup.
 
 Please note that it can't be a compile-time decision though (unless
 we're willing to effectively revert d57ed95 when this config kicks
 in): it's not unlikely that one would want to have both use cases
 running on the same time.
 
 Thanks,
 Ohad.

For x86, stores into memory are ordered. So I think that yes, smp_XXX
can be selected at compile time.

So let's forget the virtio strangeness for a minute,

To me it starts looking like we need some new kind of barrier
that handles accesses to DMA coherent memory. dma_Xmb()?
dma_coherent_Xmb()?  For example, on x86, dma_wmb() can be barrier(),
but on your system it needs to do DSB.

We can set the rule that dma barriers are guaranteed stronger
than smp ones, and we can just use dma_ everywhere.
So the strength will be:

smp  dma  mandatory

And now virtio can use DMA barriers and instead of adding
overhead for x86, x86 will actually gain from this,
as we'll drop mandatory barriers on UP systems.

Hmm?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Wed, Nov 30, 2011 at 6:24 PM, Ohad Ben-Cohen o...@wizery.com wrote:
 On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin m...@redhat.com wrote:
 How are the rings mapped? normal memory, right?

 No, device memory.

Ok, I have more info.

Originally remoteproc was mapping the rings using ioremap, and that
meant ARM Device memory.

Recently, though, we moved to CMA (allocating memory for the rings via
dma_alloc_coherent), and that isn't Device memory anymore: it's
uncacheable Normal memory (on ARM v6+).

We still require mandatory barriers though: one very reproducible
problem I personally face is that the avail index doesn't get updated
before the kick. As a result, the remote processor misses a buffer
that was just added (the kick wakes it up only to find that the avail
index wasn't changed yet). In this case, it probably happens because
the mailbox, used to kick the remote processor, is mapped as Device
memory, and therefore the kick can be reordered before the updates to
the ring  can be observed.

I did get two additional reports about reordering issues, on different
setups than my own, and which I can't personally reproduce: the one
I've described earlier (avail index gets updated before the avail
array) and one in the receive path (reading a used buffer which we
already read). I couldn't personally verify those, but both issues
were reported to be gone when mandatory barriers were used.

I expect those reports only to increase: the diversity of platforms
that are now looking into adopting virtio for this kind of
inter-process communication is quite huge, with several different
architectures and even more hardware implementations on the way (not
only ARM).

Thanks,
Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Michael S. Tsirkin
On Thu, Dec 01, 2011 at 01:27:10AM +0200, Ohad Ben-Cohen wrote:
 On Wed, Nov 30, 2011 at 6:24 PM, Ohad Ben-Cohen o...@wizery.com wrote:
  On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin m...@redhat.com wrote:
  How are the rings mapped? normal memory, right?
 
  No, device memory.
 
 Ok, I have more info.
 
 Originally remoteproc was mapping the rings using ioremap, and that
 meant ARM Device memory.
 
 Recently, though, we moved to CMA (allocating memory for the rings via
 dma_alloc_coherent), and that isn't Device memory anymore: it's
 uncacheable Normal memory (on ARM v6+).

And these accesses need to be ordered with DSB? Or DMB?

 We still require mandatory barriers though: one very reproducible
 problem I personally face is that the avail index doesn't get updated
 before the kick.

Aha! The *kick* really is MMIO. So I think we do need a mandatory barrier
before the kick.  Maybe we need it for virtio-pci as well
(not on kvm, naturally :) Off-hand this seems to belong in the transport
layer but need to think about it.

 As a result, the remote processor misses a buffer
 that was just added (the kick wakes it up only to find that the avail
 index wasn't changed yet). In this case, it probably happens because
 the mailbox, used to kick the remote processor, is mapped as Device
 memory, and therefore the kick can be reordered before the updates to
 the ring  can be observed.
 
 I did get two additional reports about reordering issues, on different
 setups than my own, and which I can't personally reproduce: the one
 I've described earlier (avail index gets updated before the avail
 array) and one in the receive path (reading a used buffer which we
 already read). I couldn't personally verify those, but both issues
 were reported to be gone when mandatory barriers were used.

Hmm. So it's a hint that something is wrong with memory
but not what's wrong exactly.

 I expect those reports only to increase: the diversity of platforms
 that are now looking into adopting virtio for this kind of
 inter-process communication is quite huge, with several different
 architectures and even more hardware implementations on the way (not
 only ARM).
 
 Thanks,
 Ohad.

Right. We need to be very careful with memory,
it's a tricky field. One known problem with virtio
is its insistance on using native endian-ness
for some fields. If power is used, we'll have to fix this.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Rusty Russell
On Thu, 1 Dec 2011 01:13:07 +0200, Michael S. Tsirkin m...@redhat.com wrote:
 For x86, stores into memory are ordered. So I think that yes, smp_XXX
 can be selected at compile time.
 
 So let's forget the virtio strangeness for a minute,

Hmm, we got away with light barriers because we knew we were not
*really* talking to a device.  But now with virtio-mmio, turns out we
are :)

I'm really tempted to revert d57ed95 for 3.2, and we can revisit this
optimization later if it proves worthwhile.

Thoughts?
Rusty. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Thu, Dec 1, 2011 at 1:13 AM, Michael S. Tsirkin m...@redhat.com wrote:
 For x86, stores into memory are ordered. So I think that yes, smp_XXX
 can be selected at compile time.

But then you can't use the same kernel image for both scenarios.

It won't take long until people will use virtio on ARM for both
virtualization and for talking to devices, and having to rebuild the
kernel for different use cases is nasty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Thu, Dec 1, 2011 at 1:43 AM, Michael S. Tsirkin m...@redhat.com wrote:
 And these accesses need to be ordered with DSB? Or DMB?

DMB (i.e. smp barriers) should be enough within Normal memory
accesses, though the other issues that were reported to me are a bit
concerning. I'm still trying to get more information about them, in
the hopes that I can eventually reproduce them myself.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-30 Thread Ohad Ben-Cohen
On Thu, Dec 1, 2011 at 4:28 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Hmm, we got away with light barriers because we knew we were not
 *really* talking to a device.  But now with virtio-mmio, turns out we
 are :)

 I'm really tempted to revert d57ed95 for 3.2, and we can revisit this
 optimization later if it proves worthwhile.

+1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-29 Thread Ohad Ben-Cohen
Virtio is using memory barriers to control the ordering of
references to the vrings on SMP systems. When the guest is compiled
with SMP support, virtio is only using SMP barriers in order to
avoid incurring the overhead involved with mandatory barriers.

Lately, though, virtio is being increasingly used with inter-processor
communication scenarios too, which involve running two (separate)
instances of operating systems on two (separate) processors, each of
which might either be UP or SMP.

To control the ordering of memory references when the vrings are shared
between two external processors, we must always use mandatory barriers.

A trivial, albeit sub-optimal, solution would be to simply revert
commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though,
that's going to have a negative impact on performance of SMP-based
virtualization use cases.

A different approach, as demonstrated by this patch, would pick the type
of memory barriers, in run time, according to the requirements of the
virtio device. This way, both SMP virtualization scenarios and inter-
processor communication use cases would run correctly, without making
any performance compromises (except for those incurred by an additional
branch or level of indirection).

This patch introduces VIRTIO_RING_F_REMOTEPROC, a new virtio transport
feature, which should be used by virtio devices that run on remote
processors. The CONFIG_SMP variant of virtio_{mb, rmb, wmb} is then changed
to use SMP barriers only if VIRTIO_RING_F_REMOTEPROC was absent.

Signed-off-by: Ohad Ben-Cohen o...@wizery.com
---
Alternatively, we can also introduce some kind of virtio_mb_ops and set it
according to the nature of the vdev with handlers that just do the right
thing, instead of introducting that branch.

Though I also wonder how big really is the perforamnce gain of d57ed95 ?

 drivers/virtio/virtio_ring.c |   78 +-
 include/linux/virtio_ring.h  |6 +++
 2 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index c7a2c20..cf66a2d 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -23,24 +23,6 @@
 #include linux/slab.h
 #include linux/module.h
 
-/* virtio guest is communicating with a virtual device that actually runs on
- * a host processor.  Memory barriers are used to control SMP effects. */
-#ifdef CONFIG_SMP
-/* Where possible, use SMP barriers which are more lightweight than mandatory
- * barriers, because mandatory barriers control MMIO effects on accesses
- * through relaxed memory I/O windows (which virtio does not use). */
-#define virtio_mb() smp_mb()
-#define virtio_rmb() smp_rmb()
-#define virtio_wmb() smp_wmb()
-#else
-/* We must force memory ordering even if guest is UP since host could be
- * running on another CPU, but SMP barriers are defined to barrier() in that
- * configuration. So fall back to mandatory barriers instead. */
-#define virtio_mb() mb()
-#define virtio_rmb() rmb()
-#define virtio_wmb() wmb()
-#endif
-
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
 #define BAD_RING(_vq, fmt, args...)\
@@ -86,6 +68,9 @@ struct vring_virtqueue
/* Host publishes avail event idx */
bool event;
 
+   /* Host runs on a remote processor */
+   bool rproc;
+
/* Number of free buffers */
unsigned int num_free;
/* Head of free buffer list. */
@@ -108,6 +93,48 @@ struct vring_virtqueue
void *data[];
 };
 
+/*
+ * virtio guest is communicating with a virtual device that may either run
+ * on the host processor, or on an external processor. The former requires
+ * memory barriers in order to control SMP effects, but the latter must
+ * use mandatory barriers.
+ */
+#ifdef CONFIG_SMP
+/* Where possible, use SMP barriers which are more lightweight than mandatory
+ * barriers, because mandatory barriers control MMIO effects on accesses
+ * through relaxed memory I/O windows. */
+static inline void virtio_mb(struct vring_virtqueue *vq)
+{
+   if (vq-rproc)
+   mb();
+   else
+   smp_mb();
+}
+
+static inline void virtio_rmb(struct vring_virtqueue *vq)
+{
+   if (vq-rproc)
+   rmb();
+   else
+   smp_rmb();
+}
+
+static inline void virtio_wmb(struct vring_virtqueue *vq)
+{
+   if (vq-rproc)
+   wmb();
+   else
+   smp_wmb();
+}
+#else
+/* We must force memory ordering even if guest is UP since host could be
+ * running on another CPU, but SMP barriers are defined to barrier() in that
+ * configuration. So fall back to mandatory barriers instead. */
+static inline void virtio_mb(struct vring_virtqueue *vq) { mb(); }
+static inline void virtio_rmb(struct vring_virtqueue *vq) { rmb(); }
+static inline void virtio_wmb(struct vring_virtqueue *vq) { wmb(); }
+#endif
+
 #define to_vvq(_vq) container_of(_vq, 

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-29 Thread Michael S. Tsirkin
On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote:
 Virtio is using memory barriers to control the ordering of
 references to the vrings on SMP systems. When the guest is compiled
 with SMP support, virtio is only using SMP barriers in order to
 avoid incurring the overhead involved with mandatory barriers.
 
 Lately, though, virtio is being increasingly used with inter-processor
 communication scenarios too, which involve running two (separate)
 instances of operating systems on two (separate) processors, each of
 which might either be UP or SMP.

Is that using virtio-mmio? If yes, would the extra serialization belongs
at that layer?

 To control the ordering of memory references when the vrings are shared
 between two external processors, we must always use mandatory barriers.

Sorry, could you pls explain what are 'two external processors'?
I think I know that if two x86 CPUs in an SMP system run kernels built
in an SMP configuration, smp_*mb barriers are enough.

Documentation/memory-barriers.txt says:
Mandatory barriers should not be used to control SMP effects ...
They may, however, be used to control MMIO effects on accesses through
relaxed memory I/O windows.

We don't control MMIO/relaxed memory I/O windows here, so what exactly
is the issue?

Could you please give an example of a setup that is currently broken?

 
 A trivial, albeit sub-optimal, solution would be to simply revert
 commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though,
 that's going to have a negative impact on performance of SMP-based
 virtualization use cases.
 
 A different approach, as demonstrated by this patch, would pick the type
 of memory barriers, in run time, according to the requirements of the
 virtio device. This way, both SMP virtualization scenarios and inter-
 processor communication use cases would run correctly, without making
 any performance compromises (except for those incurred by an additional
 branch or level of indirection).

Is an extra branch faster or slower than reverting d57ed95?

 
 This patch introduces VIRTIO_RING_F_REMOTEPROC, a new virtio transport
 feature, which should be used by virtio devices that run on remote
 processors. The CONFIG_SMP variant of virtio_{mb, rmb, wmb} is then changed
 to use SMP barriers only if VIRTIO_RING_F_REMOTEPROC was absent.

One wonders how the remote side knows enough to set this flag?

 
 Signed-off-by: Ohad Ben-Cohen o...@wizery.com
 ---
 Alternatively, we can also introduce some kind of virtio_mb_ops and set it
 according to the nature of the vdev with handlers that just do the right
 thing, instead of introducting that branch.
 
 Though I also wonder how big really is the perforamnce gain of d57ed95 ?

Want to check and tell us?

  drivers/virtio/virtio_ring.c |   78 
 +-
  include/linux/virtio_ring.h  |6 +++
  2 files changed, 60 insertions(+), 24 deletions(-)
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index c7a2c20..cf66a2d 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -23,24 +23,6 @@
  #include linux/slab.h
  #include linux/module.h
  
 -/* virtio guest is communicating with a virtual device that actually runs 
 on
 - * a host processor.  Memory barriers are used to control SMP effects. */
 -#ifdef CONFIG_SMP
 -/* Where possible, use SMP barriers which are more lightweight than mandatory
 - * barriers, because mandatory barriers control MMIO effects on accesses
 - * through relaxed memory I/O windows (which virtio does not use). */
 -#define virtio_mb() smp_mb()
 -#define virtio_rmb() smp_rmb()
 -#define virtio_wmb() smp_wmb()
 -#else
 -/* We must force memory ordering even if guest is UP since host could be
 - * running on another CPU, but SMP barriers are defined to barrier() in that
 - * configuration. So fall back to mandatory barriers instead. */
 -#define virtio_mb() mb()
 -#define virtio_rmb() rmb()
 -#define virtio_wmb() wmb()
 -#endif
 -
  #ifdef DEBUG
  /* For development, we want to crash whenever the ring is screwed. */
  #define BAD_RING(_vq, fmt, args...)  \
 @@ -86,6 +68,9 @@ struct vring_virtqueue
   /* Host publishes avail event idx */
   bool event;
  
 + /* Host runs on a remote processor */
 + bool rproc;
 +
   /* Number of free buffers */
   unsigned int num_free;
   /* Head of free buffer list. */
 @@ -108,6 +93,48 @@ struct vring_virtqueue
   void *data[];
  };
  
 +/*
 + * virtio guest is communicating with a virtual device that may either run
 + * on the host processor, or on an external processor. The former requires
 + * memory barriers in order to control SMP effects, but the latter must
 + * use mandatory barriers.
 + */
 +#ifdef CONFIG_SMP
 +/* Where possible, use SMP barriers which are more lightweight than mandatory
 + * barriers, because mandatory barriers control MMIO effects on accesses
 + * through 

Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-29 Thread Ohad Ben-Cohen
On Tue, Nov 29, 2011 at 3:11 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote:
 Virtio is using memory barriers to control the ordering of
 references to the vrings on SMP systems. When the guest is compiled
 with SMP support, virtio is only using SMP barriers in order to
 avoid incurring the overhead involved with mandatory barriers.

 Lately, though, virtio is being increasingly used with inter-processor
 communication scenarios too, which involve running two (separate)
 instances of operating systems on two (separate) processors, each of
 which might either be UP or SMP.

 Is that using virtio-mmio?

No, I'm using this:

https://lkml.org/lkml/2011/10/25/139

 Sorry, could you pls explain what are 'two external processors'?
 I think I know that if two x86 CPUs in an SMP system run kernels built
 in an SMP configuration, smp_*mb barriers are enough.

Sure:

My setup is not SMP-based; it's two separate processors running in AMP
configuration. The processors have completely different architectures,
are not cache coherent, and only simply share some memory, which is
used for communications using virtio as the shared memory wire
protocol (i.e. we're not even doing virtualization: we have Linux on
one processor, and some RTOS on another processor, and they use virtio
to send and receive buffers).

So it's not SMP effects we're controlling; we're pretty much doing
MMIO and must use mandatory barriers (otherwise we see breakage).

 Is an extra branch faster or slower than reverting d57ed95?

Sorry, unfortunately I have no way to measure this, as I don't have
any virtualization/x86 setup. I'm developing on ARM SoCs, where
virtualization hardware is coming, but not here yet.

 One wonders how the remote side knows enough to set this flag?

The remote side is a dedicated firmware loaded in order to boot the
other processor, so it really is intimate with the platform and its
requirements (and it publishes the virtio device features for every
supported virtio device).

 Though I also wonder how big really is the perforamnce gain of d57ed95 ?

 Want to check and tell us?

Sorry, as I said, I have no way to do that... Any chance you did some
measurements back then when d57ed95 was introduced ?

I just guessed it did improve performance, otherwise you probably
wouldn't have bothered. But if not, then reverting it is surely
preferable to adding more complexity.

Thanks!
Ohad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-29 Thread Michael S. Tsirkin
On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
 On Tue, Nov 29, 2011 at 3:11 PM, Michael S. Tsirkin m...@redhat.com wrote:
  On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote:
  Virtio is using memory barriers to control the ordering of
  references to the vrings on SMP systems. When the guest is compiled
  with SMP support, virtio is only using SMP barriers in order to
  avoid incurring the overhead involved with mandatory barriers.
 
  Lately, though, virtio is being increasingly used with inter-processor
  communication scenarios too, which involve running two (separate)
  instances of operating systems on two (separate) processors, each of
  which might either be UP or SMP.
 
  Is that using virtio-mmio?
 
 No, I'm using this:
 
 https://lkml.org/lkml/2011/10/25/139

This mentions iommu - is there a need to use dma api to let
the firmware acess the rings? Or does it have access to all
of memory?

  Sorry, could you pls explain what are 'two external processors'?
  I think I know that if two x86 CPUs in an SMP system run kernels built
  in an SMP configuration, smp_*mb barriers are enough.
 
 Sure:
 
 My setup is not SMP-based; it's two separate processors running in AMP
 configuration. The processors have completely different architectures,
 are not cache coherent, and only simply share some memory, which is
 used for communications using virtio as the shared memory wire
 protocol (i.e. we're not even doing virtualization: we have Linux on
 one processor, and some RTOS on another processor, and they use virtio
 to send and receive buffers).

I'd like to make sure I understand the memory model some more.
Is there cache snooping? If yes access from an external device
typically works mostly in the same way as smp ...


 So it's not SMP effects we're controlling; we're pretty much doing
 MMIO and must use mandatory barriers

So you put virtio rings in MMIO memory?

 (otherwise we see breakage).

Could you please give a couple of examples of breakage?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] virtio: use mandatory barriers for remote processor vdevs

2011-11-29 Thread Michael S. Tsirkin
On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote:
  Is an extra branch faster or slower than reverting d57ed95?
 
 Sorry, unfortunately I have no way to measure this, as I don't have
 any virtualization/x86 setup. I'm developing on ARM SoCs, where
 virtualization hardware is coming, but not here yet.

You can try using the micro-benchmark in tools/virtio/.
That does not need virtualization.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html