Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-06 Thread Stephen Hemminger
On Fri, 06 Jul 2012 11:20:06 +0800
Jason Wang jasow...@redhat.com wrote:

 On 07/05/2012 08:51 PM, Sasha Levin wrote:
  On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
  @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
   if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
   vi-has_cvq = true;
 
  +   /* Use single tx/rx queue pair as default */
  +   vi-num_queue_pairs = 1;
  +   vi-total_queue_pairs = num_queue_pairs;
  The code is using this default even if the amount of queue pairs it
  wants was specified during initialization. This basically limits any
  device to use 1 pair when starting up.
 
 
 Yes, currently the virtio-net driver would use 1 txq/txq by default 
 since multiqueue may not outperform in all kinds of workload. So it's 
 better to keep it as default and let user enable multiqueue by ethtool -L.
 

I would prefer that the driver sized number of queues based on number
of online CPU's. That is what real hardware does. What kind of workload
are you doing? If it is some DBMS benchmark then maybe the issue is that
some CPU's need to be reserved.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC V5 0/5] Multiqueue virtio-net

2012-07-06 Thread Jason Wang

On 07/06/2012 01:45 AM, Rick Jones wrote:

On 07/05/2012 03:29 AM, Jason Wang wrote:



Test result:

1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning

- Guest to External Host TCP STREAM
sessions size throughput1 throughput2   norm1 norm2
1 64 650.55 655.61 100% 24.88 24.86 99%
2 64 1446.81 1309.44 90% 30.49 27.16 89%
4 64 1430.52 1305.59 91% 30.78 26.80 87%
8 64 1450.89 1270.82 87% 30.83 25.95 84%


Was the -D test-specific option used to set TCP_NODELAY?  I'm guessing 
from your description of how packet sizes were smaller with multiqueue 
and your need to hack tcp_write_xmit() it wasn't but since we don't 
have the specific netperf command lines (hint hint :) I wanted to make 
certain.

Hi Rick:

I didn't specify -D for disabling Nagle. I also collects rx packets and 
average packet size:


Guest to External Host ( 2vcpu 1q vs 2q )
sessions size tput-sq tput-mq %  norm-sq norm-mq %  #tx-pkts-sq 
#tx-pkts-mq % avg-sz-sq avg-sz-mq %

1 64 668.85 671.13 100% 25.80 26.86 104% 629038 627126 99% 1395 1403 100%
2 64 1421.29 1345.40 94% 32.06 27.57 85% 1318498 1246721 94% 1413 1414 100%
4 64 1469.96 1365.42 92% 32.44 27.04 83% 1362542 1277848 93% 1414 1401 99%
8 64 1131.00 1361.58 120% 24.81 26.76 107% 1223700 1280970 104% 1395 
1394 99%

1 256 1883.98 1649.87 87% 60.67 58.48 96% 1542775 1465836 95% 1592 1472 92%
2 256 4847.09 3539.74 73% 98.35 64.05 65% 2683346 3074046 114% 2323 1505 64%
4 256 5197.33 3283.48 63% 109.14 62.39 57% 1819814 2929486 160% 3636 
1467 40%

8 256 5953.53 3359.22 56% 122.75 64.21 52% 906071 2924148 322% 8282 1502 18%
1 512 3019.70 2646.07 87% 93.89 86.78 92% 2003780 2256077 112% 1949 1532 78%
2 512 7455.83 5861.03 78% 173.79 104.43 60% 1200322 3577142 298% 7831 
2114 26%
4 512 8962.28 7062.20 78% 213.08 127.82 59% 468142 2594812 554% 24030 
3468 14%
8 512 7849.82 8523.85 108% 175.41 154.19 87% 304923 1662023 545% 38640 
6479 16%


When multiqueue were enabled, it does have a higher packets per second 
but with a much more smaller packet size. It looks to me that multiqueue 
is faster and guest tcp have less oppotunity to build a larger skbs to 
send, so lots of small packet were required to send which leads to much 
more #exit and vhost works. One interesting thing is, if I run tcpdump 
in the host where guest run, I can get obvious throughput increasing. To 
verify the assumption, I hack the tcp_write_xmit() with following patch 
and set tcp_tso_win_divisor=1, then I multiqueue can outperform or at 
least get the same throughput as singlequeue, though it could introduce 
latency but I havent' measured it.


I'm not expert of tcp, but looks like the changes are reasonable:
- we can do full-sized TSO check in tcp_tso_should_defer() only for 
westwood, according to tcp westwood

- run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c465d3e..166a888 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1567,7 +1567,7 @@ static bool tcp_tso_should_defer(struct sock *sk, 
struct sk_buff *skb)


in_flight = tcp_packets_in_flight(tp);

-   BUG_ON(tcp_skb_pcount(skb) = 1 || (tp-snd_cwnd = in_flight));
+   BUG_ON(tp-snd_cwnd = in_flight);

send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)-seq;

@@ -1576,9 +1576,11 @@ static bool tcp_tso_should_defer(struct sock *sk, 
struct sk_buff *skb)


limit = min(send_win, cong_win);

+#if 0
/* If a full-sized TSO skb can be sent, do it. */
if (limit = sk-sk_gso_max_size)
goto send_now;
+#endif

/* Middle in queue won't get any more data, full sendable 
already? */

if ((skb != tcp_write_queue_tail(sk))  (limit = skb-len))
@@ -1795,10 +1797,9 @@ static bool tcp_write_xmit(struct sock *sk, 
unsigned int mss_now, int nonagle,
 
(tcp_skb_is_last(sk, skb) ?
  nonagle : 
TCP_NAGLE_PUSH

break;
-   } else {
-   if (!push_one  tcp_tso_should_defer(sk, skb))
-   break;
}
+   if (!push_one  tcp_tso_should_defer(sk, skb))
+   break;

limit = mss_now;
if (tso_segs  1  !tcp_urg_mode(tp))






Instead of calling them throughput1 and throughput2, it might be more 
clear in future to identify them as singlequeue and multiqueue.




Sure.
Also, how are you combining the concurrent netperf results?  Are you 
taking sums of what netperf reports, or are you gathering statistics 
outside of netperf?




The throughput were just sumed from netperf result like what netperf 
manual suggests. The cpu utilization were measured by mpstat.

- TCP RR
sessions size throughput1 throughput2   norm1 norm2
50 1 54695.41 84164.98 153% 1957.33 1901.31 97%


A single instance TCP_RR test would help confirm/refute any 
non-trivial change in 

Re: [net-next RFC V5 4/5] virtio_net: multiqueue support

2012-07-06 Thread Jason Wang

On 07/06/2012 04:02 AM, Amos Kong wrote:

On 07/05/2012 06:29 PM, Jason Wang wrote:

This patch converts virtio_net to a multi queue device. After negotiated
VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
and driver could read the number from config space.

The driver expects the number of rx/tx queue paris is equal to the number of
vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
optimization were introduced:

- Txq selection is based on the processor id in order to avoid contending a lock
   whose owner may exits to host.
- Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
   the queue pairs.

Signed-off-by: Krishna Kumarkrkum...@in.ibm.com
Signed-off-by: Jason Wangjasow...@redhat.com
---

...



  static int virtnet_probe(struct virtio_device *vdev)
  {
-   int err;
+   int i, err;
struct net_device *dev;
struct virtnet_info *vi;
+   u16 num_queues, num_queue_pairs;
+
+   /* Find if host supports multiqueue virtio_net device */
+   err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
+   offsetof(struct virtio_net_config,
+   num_queues),num_queues);
+
+   /* We need atleast 2 queue's */


s/atleast/at least/



+   if (err || num_queues  2)
+   num_queues = 2;
+   if (num_queues  MAX_QUEUES * 2)
+   num_queues = MAX_QUEUES;

 num_queues = MAX_QUEUES * 2;

MAX_QUEUES is the limitation of RX or TX.


Right, it's a typo, thanks.

+
+   num_queue_pairs = num_queues / 2;

...



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-06 Thread Jason Wang

On 07/06/2012 04:07 AM, Amos Kong wrote:

On 07/05/2012 08:51 PM, Sasha Levin wrote:

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:

@@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
 vi-has_cvq = true;




+   /* Use single tx/rx queue pair as default */
+   vi-num_queue_pairs = 1;
+   vi-total_queue_pairs = num_queue_pairs;

vi-total_queue_pairs also should be set to 1

vi-total_queue_pairs = 1;


Hi Amos:

total_queue_pairs is the max number of queue pairs that the deivce could 
provide, so it's ok here.

The code is using this default even if the amount of queue pairs it
wants was specified during initialization. This basically limits any
device to use 1 pair when starting up.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-06 Thread Sasha Levin
On Fri, 2012-07-06 at 11:20 +0800, Jason Wang wrote:
 On 07/05/2012 08:51 PM, Sasha Levin wrote:
  On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
  @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
   if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
   vi-has_cvq = true;
 
  +   /* Use single tx/rx queue pair as default */
  +   vi-num_queue_pairs = 1;
  +   vi-total_queue_pairs = num_queue_pairs;
  The code is using this default even if the amount of queue pairs it
  wants was specified during initialization. This basically limits any
  device to use 1 pair when starting up.
 
 
 Yes, currently the virtio-net driver would use 1 txq/txq by default 
 since multiqueue may not outperform in all kinds of workload. So it's 
 better to keep it as default and let user enable multiqueue by ethtool -L.

I think it makes sense to set it to 1 if the amount of initial queue
pairs wasn't specified.

On the other hand, if a virtio-net driver was probed to provide
VIRTIO_NET_F_MULTIQUEUE and has set something reasonable in
virtio_net_config.num_queues, then that setting shouldn't be quietly
ignored and reset back to 1.

What I'm basically saying is that I agree that the *default* should be 1
- but if the user has explicitly asked for something else during
initialization, then the default should be overridden.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Nicholas A. Bellinger
On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
 On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
 
  So I'm pretty sure this discrepancy is attributed to the small block
  random I/O bottleneck currently present for all Linux/SCSI core LLDs
  regardless of physical or virtual storage fabric.
  
  The SCSI wide host-lock less conversion that happened in .38 code back
  in 2010, and subsequently having LLDs like virtio-scsi convert to run in
  host-lock-less mode have helped to some extent..  But it's still not
  enough..
  
  Another example where we've been able to prove this bottleneck recently
  is with the following target setup:
  
  *) Intel Romley production machines with 128 GB of DDR-3 memory
  *) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2)
  *) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec 
  *) Infiniband SRP Target backported to RHEL 6.2 + latest OFED
  
  In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 +
  iomemory_vsl export we end up avoiding SCSI core bottleneck on the
  target machine, just as with the tcm_vhost example here for host kernel
  side processing with vhost.
  
  Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP
  (OFED) Initiator connected to four ib_srpt LUNs, we've observed that
  MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs.
  ~215K with heavy random 4k WRITE iometer / fio tests.  Note this with an
  optimized queue_depth ib_srp client w/ noop I/O schedulering, but is
  still lacking the host_lock-less patches on RHEL 6.2 OFED..
  
  This bottleneck has been mentioned by various people (including myself)
  on linux-scsi the last 18 months, and I've proposed that that it be
  discussed at KS-2012 so we can start making some forward progress:
 
 Well, no, it hasn't.  You randomly drop things like this into unrelated
 email (I suppose that is a mention in strict English construction) but
 it's not really enough to get anyone to pay attention since they mostly
 stopped reading at the top, if they got that far: most people just go by
 subject when wading through threads initially.
 

It most certainly has been made clear to me, numerous times from many
people in the Linux/SCSI community that there is a bottleneck for small
block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
Linux based SCSI subsystems.

My apologies if mentioning this issue last year at LC 2011 to you
privately did not take a tone of a more serious nature, or that
proposing a topic for LSF-2012 this year was not a clear enough
indication of a problem with SCSI small block random I/O performance.

 But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
 kernel, which is now nearly three years old) is 25% slower than W2k8R2
 on infiniband isn't really going to get anyone excited either
 (particularly when you mention OFED, which usually means a stack
 replacement on Linux anyway).
 

The specific issue was first raised for .38 where we where able to get
most of the interesting high performance LLDs converted to using
internal locking methods so that host_lock did not have to be obtained
during each -queuecommand() I/O dispatch, right..?

This has helped a good deal for large multi-lun scsi_host configs that
are now running in host-lock less mode, but there is still a large
discrepancy single LUN vs. raw struct block_device access even with LLD
host_lock less mode enabled.

Now I think the virtio-blk client performance is demonstrating this
issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
flash benchmarks that is demonstrate some other yet-to-be determined
limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
randrw workload.

 What people might pay attention to is evidence that there's a problem in
 3.5-rc6 (without any OFED crap).  If you're not going to bother
 investigating, it has to be in an environment they can reproduce (so
 ordinary hardware, not infiniband) otherwise it gets ignored as an
 esoteric hardware issue.
 

It's really quite simple for anyone to demonstrate the bottleneck
locally on any machine using tcm_loop with raw block flash.  Take a
struct block_device backend (like a Fusion IO /dev/fio*) and using
IBLOCK and export locally accessible SCSI LUNs via tcm_loop..

Using FIO there is a significant drop for randrw 4k performance between
tcm_loop - IBLOCK vs. raw struct block device backends.  And no, it's
not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
LUN limitation for small block random I/Os on the order of ~75K for each
SCSI LUN.

If anyone has gone actually gone faster than this with any single SCSI
LUN on any storage fabric, I would be interested in hearing about your
setup.

Thanks,

--nab

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-06 Thread Jason Wang

On 07/06/2012 02:38 PM, Stephen Hemminger wrote:

On Fri, 06 Jul 2012 11:20:06 +0800
Jason Wangjasow...@redhat.com  wrote:


On 07/05/2012 08:51 PM, Sasha Levin wrote:

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:

@@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
  if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
  vi-has_cvq = true;

+   /* Use single tx/rx queue pair as default */
+   vi-num_queue_pairs = 1;
+   vi-total_queue_pairs = num_queue_pairs;

The code is using this default even if the amount of queue pairs it
wants was specified during initialization. This basically limits any
device to use 1 pair when starting up.


Yes, currently the virtio-net driver would use 1 txq/txq by default
since multiqueue may not outperform in all kinds of workload. So it's
better to keep it as default and let user enable multiqueue by ethtool -L.


I would prefer that the driver sized number of queues based on number
of online CPU's. That is what real hardware does. What kind of workload
are you doing? If it is some DBMS benchmark then maybe the issue is that
some CPU's need to be reserved.


I run rr and stream test of netperf, and multiqueue shows improvement on 
rr test and regression on small packet transmission in stream test. For 
small packet transmission, multiqueue tends to send much more small 
packets which also increase the cpu utilization. I suspect multiqueue is 
faster and tcp does not merger big enough packet to send, but may need 
more think.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC V3 0/5] Multiqueue support for tap and virtio-net/vhost

2012-07-06 Thread Jason Wang
Hello all:

This seires is an update of last version of multiqueue support to add multiqueue
capability to both tap and virtio-net.

Some kinds of tap backends has (macvatp in linux) or would (tap) support
multiqueue. In such kind of tap backend, each file descriptor of a tap is a
qeueu and ioctls were prodived to attach an exist tap file descriptor to the
tun/tap device. So the patch let qemu to use this kind of backend, and let it
can transmit and receving packets through multiple file descriptors.

Patch 1 introduce a new help to get all matched options, after this patch, we
could pass multiple file descriptors to a signle netdev by:

  qemu -netdev tap,id=h0,queues=2,fd=10,fd=11 ...

Patch 2 introduce generic helpers in tap to attach or detach a file descriptor
from a tap device, emulated nics could use this helper to enable/disable queues.

Patch 3 modifies the NICState to allow multiple VLANClientState to be stored in
it, with this patch, qemu has basic support of multiple capable tap backend.

Patch 4 implement 1:1 mapping of tx/rx virtqueue pairs with vhost_net backend.

Patch 5 converts virtio-net to multiqueue device, after this patch, multiqueue
virtio-net device could be specified by:

  qemu -netdev tap,id=h0,queues=2 -device virtio-net-pci,netdev=h0,queues=2

Performace numbers:

I post them in the threads of RFC of multiqueue virtio-net driver:
http://www.spinics.net/lists/kvm/msg75386.html

Multiqueue with vhost shows improvemnt in TCP_RR, and degradate for small packet
transmission.

Changes from V2:
- split vhost patch from virtio-net
- add the support of queue number negotiation through control virtqueue
- hotplug, set_link and migration support
- bug fixes

Changes from V1:

- rebase to the latest
- fix memory leak in parse_netdev
- fix guest notifiers assignment/de-assignment
- changes the command lines to:
   qemu -netdev tap,queues=2 -device virtio-net-pci,queues=2

References:
- V2 http://www.spinics.net/lists/kvm/msg74588.html
- V1 http://comments.gmane.org/gmane.comp.emulators.qemu/100481

Jason Wang (5):
  option: introduce qemu_get_opt_all()
  tap: multiqueue support
  net: multiqueue support
  vhost: multiqueue support
  virtio-net: add multiqueue support

 hw/dp8393x.c |2 +-
 hw/mcf_fec.c |2 +-
 hw/qdev-properties.c |   34 +++-
 hw/qdev.h|3 +-
 hw/vhost.c   |   53 --
 hw/vhost.h   |2 +
 hw/vhost_net.c   |7 +-
 hw/vhost_net.h   |2 +-
 hw/virtio-net.c  |  505 ++
 hw/virtio-net.h  |   12 ++
 net.c|   83 +++--
 net.h|   16 ++-
 net/tap-aix.c|   13 ++-
 net/tap-bsd.c|   13 ++-
 net/tap-haiku.c  |   13 ++-
 net/tap-linux.c  |   56 ++-
 net/tap-linux.h  |4 +
 net/tap-solaris.c|   13 ++-
 net/tap-win32.c  |   11 +
 net/tap.c|  199 +---
 net/tap.h|7 +-
 qemu-option.c|   19 ++
 qemu-option.h|2 +
 23 files changed, 787 insertions(+), 284 deletions(-)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC V3 1/5] option: introduce qemu_get_opt_all()

2012-07-06 Thread Jason Wang
Sometimes, we need to pass option like -netdev tap,fd=100,fd=101,fd=102 which
can not be properly parsed by qemu_find_opt() because it only returns the first
matched option. So qemu_get_opt_all() were introduced to return an array of
pointers which contains all matched option.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 qemu-option.c |   19 +++
 qemu-option.h |2 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/qemu-option.c b/qemu-option.c
index bb3886c..9263125 100644
--- a/qemu-option.c
+++ b/qemu-option.c
@@ -545,6 +545,25 @@ static QemuOpt *qemu_opt_find(QemuOpts *opts, const char 
*name)
 return NULL;
 }
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+ int max)
+{
+QemuOpt *opt;
+int index = 0;
+
+QTAILQ_FOREACH_REVERSE(opt, opts-head, QemuOptHead, next) {
+if (strcmp(opt-name, name) == 0) {
+if (index  max) {
+optp[index++] = opt-str;
+}
+if (index == max) {
+break;
+}
+}
+}
+return index;
+}
+
 const char *qemu_opt_get(QemuOpts *opts, const char *name)
 {
 QemuOpt *opt = qemu_opt_find(opts, name);
diff --git a/qemu-option.h b/qemu-option.h
index 951dec3..3c9a273 100644
--- a/qemu-option.h
+++ b/qemu-option.h
@@ -106,6 +106,8 @@ struct QemuOptsList {
 QemuOptDesc desc[];
 };
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+ int max);
 const char *qemu_opt_get(QemuOpts *opts, const char *name);
 bool qemu_opt_get_bool(QemuOpts *opts, const char *name, bool defval);
 uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t 
defval);
-- 
1.7.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC V3 2/5] tap: multiqueue support

2012-07-06 Thread Jason Wang
Some operating system ( such as Linux ) supports multiqueue tap, this is done
through attaching multiple sockets to the net device and expose multiple file
descriptors.

This patch let qemu utilizes this kind of backend, and introduces helpter for:

- creating a multiple capable tap device
- increase and decrease the number of queues by introducing helpter to attach or
  detach a file descriptor to the device

Then qemu can use this as the infrastructures of emulating a multiple queue
capable network cards.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 net.c |4 +
 net/tap-aix.c |   13 +++-
 net/tap-bsd.c |   13 +++-
 net/tap-haiku.c   |   13 +++-
 net/tap-linux.c   |   56 +++-
 net/tap-linux.h   |4 +
 net/tap-solaris.c |   13 +++-
 net/tap-win32.c   |   11 +++
 net/tap.c |  197 ++---
 net/tap.h |7 ++-
 10 files changed, 253 insertions(+), 78 deletions(-)

diff --git a/net.c b/net.c
index 4aa416c..eabe830 100644
--- a/net.c
+++ b/net.c
@@ -978,6 +978,10 @@ static const struct {
 .name = vhostforce,
 .type = QEMU_OPT_BOOL,
 .help = force vhost on for non-MSIX virtio guests,
+}, {
+.name = queues,
+.type = QEMU_OPT_NUMBER,
+.help = number of queues the backend can provides,
 },
 #endif /* _WIN32 */
 { /* end of list */ }
diff --git a/net/tap-aix.c b/net/tap-aix.c
index e19aaba..f111e0f 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -25,7 +25,8 @@
 #include net/tap.h
 #include stdio.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int attach)
 {
 fprintf(stderr, no tap on AIX\n);
 return -1;
@@ -59,3 +60,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+return -1;
+}
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 937a94b..44f3421 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -33,7 +33,8 @@
 #include net/if_tap.h
 #endif
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int attach)
 {
 int fd;
 #ifdef TAPGIFNAME
@@ -145,3 +146,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+return -1;
+}
diff --git a/net/tap-haiku.c b/net/tap-haiku.c
index 91dda8e..6fb6719 100644
--- a/net/tap-haiku.c
+++ b/net/tap-haiku.c
@@ -25,7 +25,8 @@
 #include net/tap.h
 #include stdio.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int attach)
 {
 fprintf(stderr, no tap on Haiku\n);
 return -1;
@@ -59,3 +60,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+return -1;
+}
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 41d581b..ed0afe9 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -35,7 +35,8 @@
 
 #define PATH_NET_TUN /dev/net/tun
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int attach)
 {
 struct ifreq ifr;
 int fd, ret;
@@ -47,6 +48,8 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 }
 memset(ifr, 0, sizeof(ifr));
 ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+if (!attach)
+ifr.ifr_flags |= IFF_MULTI_QUEUE;
 
 if (*vnet_hdr) {
 unsigned int features;
@@ -71,7 +74,11 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
 else
 pstrcpy(ifr.ifr_name, IFNAMSIZ, tap%d);
-ret = ioctl(fd, TUNSETIFF, (void *) ifr);
+if (attach) {
+ifr.ifr_flags |= IFF_ATTACH_QUEUE;
+ret = ioctl(fd, TUNSETQUEUE, (void *) ifr);
+} else
+ret = ioctl(fd, TUNSETIFF, (void *) ifr);
 if (ret != 0) {
 if (ifname[0] != '\0') {
 error_report(could not configure %s (%s): %m, PATH_NET_TUN, 
ifr.ifr_name);
@@ -197,3 +204,48 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 }
 }
 }
+
+/* Attach a file descriptor to a TUN/TAP device. This 

[RFC V3 3/5] net: multiqueue support

2012-07-06 Thread Jason Wang
This patch adds the multiqueues support for emulated nics. Each VLANClientState
pairs are now abstract as a queue instead of a nic, and multiple VLANClientState
pointers were stored in the NICState. A queue_index were also introduced to let
the emulated nics know which queue the packet were came from or sent
out. Virtio-net would be the first user.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/dp8393x.c |2 +-
 hw/mcf_fec.c |2 +-
 hw/qdev-properties.c |   34 +
 hw/qdev.h|3 +-
 net.c|   79 +-
 net.h|   16 +++--
 net/tap.c|2 +-
 7 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/hw/dp8393x.c b/hw/dp8393x.c
index 017d074..483a868 100644
--- a/hw/dp8393x.c
+++ b/hw/dp8393x.c
@@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, target_phys_addr_t base, int 
it_shift,
 
 s-conf.macaddr = nd-macaddr;
 s-conf.vlan = nd-vlan;
-s-conf.peer = nd-netdev;
+s-conf.peers[0] = nd-netdev;
 
 s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, s);
 
diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c
index ae37bef..69f508d 100644
--- a/hw/mcf_fec.c
+++ b/hw/mcf_fec.c
@@ -473,7 +473,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd,
 
 s-conf.macaddr = nd-macaddr;
 s-conf.vlan = nd-vlan;
-s-conf.peer = nd-netdev;
+s-conf.peers[0] = nd-netdev;
 
 s-nic = qemu_new_nic(net_mcf_fec_info, s-conf, nd-model, nd-name, s);
 
diff --git a/hw/qdev-properties.c b/hw/qdev-properties.c
index 9ae3187..88e97e9 100644
--- a/hw/qdev-properties.c
+++ b/hw/qdev-properties.c
@@ -554,16 +554,38 @@ PropertyInfo qdev_prop_chr = {
 
 static int parse_netdev(DeviceState *dev, const char *str, void **ptr)
 {
-VLANClientState *netdev = qemu_find_netdev(str);
+VLANClientState ***nc = (VLANClientState ***)ptr;
+VLANClientState *vcs[MAX_QUEUE_NUM];
+int queues, i = 0;
+int ret;
 
-if (netdev == NULL) {
-return -ENOENT;
+*nc = g_malloc(MAX_QUEUE_NUM * sizeof(VLANClientState *));
+queues = qemu_find_netdev_all(str, vcs, MAX_QUEUE_NUM);
+if (queues == 0) {
+ret = -ENOENT;
+goto err;
 }
-if (netdev-peer) {
-return -EEXIST;
+
+for (i = 0; i  queues; i++) {
+if (vcs[i] == NULL) {
+ret = -ENOENT;
+goto err;
+}
+
+if (vcs[i]-peer) {
+ret = -EEXIST;
+goto err;
+}
+
+(*nc)[i] = vcs[i];
+   vcs[i]-queue_index = i;
 }
-*ptr = netdev;
+
 return 0;
+
+err:
+g_free(*nc);
+return ret;
 }
 
 static const char *print_netdev(void *ptr)
diff --git a/hw/qdev.h b/hw/qdev.h
index 5386b16..1c023b4 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -248,6 +248,7 @@ extern PropertyInfo qdev_prop_blocksize;
 .defval= (bool)_defval,  \
 }
 
+
 #define DEFINE_PROP_UINT8(_n, _s, _f, _d)   \
 DEFINE_PROP_DEFAULT(_n, _s, _f, _d, qdev_prop_uint8, uint8_t)
 #define DEFINE_PROP_UINT16(_n, _s, _f, _d)  \
@@ -274,7 +275,7 @@ extern PropertyInfo qdev_prop_blocksize;
 #define DEFINE_PROP_STRING(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_string, char*)
 #define DEFINE_PROP_NETDEV(_n, _s, _f) \
-DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, VLANClientState*)
+DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, VLANClientState**)
 #define DEFINE_PROP_VLAN(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_vlan, VLANState*)
 #define DEFINE_PROP_DRIVE(_n, _s, _f) \
diff --git a/net.c b/net.c
index eabe830..f5db537 100644
--- a/net.c
+++ b/net.c
@@ -238,16 +238,33 @@ NICState *qemu_new_nic(NetClientInfo *info,
 {
 VLANClientState *nc;
 NICState *nic;
+int i;
 
 assert(info-type == NET_CLIENT_TYPE_NIC);
 assert(info-size = sizeof(NICState));
 
-nc = qemu_new_net_client(info, conf-vlan, conf-peer, model, name);
+if (conf-peers) {
+nc = qemu_new_net_client(info, NULL, conf-peers[0], model, name);
+} else {
+nc = qemu_new_net_client(info, conf-vlan, NULL, model, name);
+}
 
 nic = DO_UPCAST(NICState, nc, nc);
 nic-conf = conf;
 nic-opaque = opaque;
 
+/* For compatiablity with single queue nic */
+nic-ncs[0] = nc;
+nc-opaque = nic;
+
+for (i = 1 ; i  conf-queues; i++) {
+VLANClientState *vc = qemu_new_net_client(info, NULL, conf-peers[i],
+  model, name);
+vc-opaque = nic;
+nic-ncs[i] = vc;
+vc-queue_index = i;
+}
+
 return nic;
 }
 
@@ -283,11 +300,10 @@ void qemu_del_vlan_client(VLANClientState *vc)
 {
 /* If there is a peer NIC, delete and cleanup client, but do not free. */
 if (!vc-vlan  vc-peer  vc-peer-info-type == NET_CLIENT_TYPE_NIC) {
-NICState *nic = DO_UPCAST(NICState, nc, 

[RFC V3 4/5] vhost: multiqueue support

2012-07-06 Thread Jason Wang
This patch converts the vhost to support multiqueue queues. It implement a 1:1
mapping of vhost devs and tap fds. That it to say, the patch creates and uses
N vhost devs as the backend of the N queues virtio-net deivce.

The main work is to convert the virtqueue index into vhost queue index, this is
done by introducing an vq_index filed in vhost_dev struct to record the index of
first virtuque that is used by the vhost devs. Then vhost could simply convert
it to the local vhost queue index and issue ioctls.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/vhost.c  |   53 ++---
 hw/vhost.h  |2 ++
 hw/vhost_net.c  |7 +--
 hw/vhost_net.h  |2 +-
 hw/virtio-net.c |2 +-
 5 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 43664e7..3eb6037 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -620,11 +620,12 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 {
 target_phys_addr_t s, l, a;
 int r;
+int vhost_vq_index = (idx  2 ? idx - 1 : idx) % dev-nvqs;
 struct vhost_vring_file file = {
-.index = idx,
+.index = vhost_vq_index
 };
 struct vhost_vring_state state = {
-.index = idx,
+.index = vhost_vq_index
 };
 struct VirtQueue *vvq = virtio_get_queue(vdev, idx);
 
@@ -670,11 +671,12 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 goto fail_alloc_ring;
 }
 
-r = vhost_virtqueue_set_addr(dev, vq, idx, dev-log_enabled);
+r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev-log_enabled);
 if (r  0) {
 r = -errno;
 goto fail_alloc;
 }
+
 file.fd = event_notifier_get_fd(virtio_queue_get_host_notifier(vvq));
 r = ioctl(dev-control, VHOST_SET_VRING_KICK, file);
 if (r) {
@@ -715,7 +717,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
 unsigned idx)
 {
 struct vhost_vring_state state = {
-.index = idx,
+.index = (idx  2 ? idx - 1 : idx) % dev-nvqs,
 };
 int r;
 r = ioctl(dev-control, VHOST_GET_VRING_BASE, state);
@@ -829,7 +831,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 }
 
 for (i = 0; i  hdev-nvqs; ++i) {
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, true);
+r = vdev-binding-set_host_notifier(vdev-binding_opaque,
+hdev-vq_index + i,
+true);
 if (r  0) {
 fprintf(stderr, vhost VQ %d notifier binding failed: %d\n, i, 
-r);
 goto fail_vq;
@@ -839,7 +843,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 return 0;
 fail_vq:
 while (--i = 0) {
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, false);
+r = vdev-binding-set_host_notifier(vdev-binding_opaque,
+hdev-vq_index + i,
+false);
 if (r  0) {
 fprintf(stderr, vhost VQ %d notifier cleanup error: %d\n, i, -r);
 fflush(stderr);
@@ -860,7 +866,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 int i, r;
 
 for (i = 0; i  hdev-nvqs; ++i) {
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, false);
+r = vdev-binding-set_host_notifier(vdev-binding_opaque,
+ hdev-vq_index + i,
+ false);
 if (r  0) {
 fprintf(stderr, vhost VQ %d notifier cleanup failed: %d\n, i, 
-r);
 fflush(stderr);
@@ -879,10 +887,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 goto fail;
 }
 
-r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
-if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
-goto fail_notifiers;
+if (hdev-vq_index == 0) {
+r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
+if (r  0) {
+fprintf(stderr, Error binding guest notifier: %d\n, -r);
+goto fail_notifiers;
+}
 }
 
 r = vhost_dev_set_features(hdev, hdev-log_enabled);
@@ -898,7 +908,7 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 r = vhost_virtqueue_init(hdev,
  vdev,
  hdev-vqs + i,
- i);
+ hdev-vq_index + i);
 if (r  0) {
 goto fail_vq;
 }
@@ -925,8 +935,9 @@ fail_vq:
 vhost_virtqueue_cleanup(hdev,
 vdev,
 hdev-vqs + i,
-i);
+hdev-vq_index + i);
 }
+   

[RFC V3 5/5] virtio-net: add multiqueue support

2012-07-06 Thread Jason Wang
Based on the multiqueue support for taps and NICState, this patch add the
capability of multiqueue for virtio-net. For userspace virtio-net emulation,
each pair of VLANClientState peers were abstracted as a tx/rx queue. For vhost,
the vhost net devices were created per virtio-net tx/rx queue pairs, so when
multiqueue is enabled, N vhost devices/threads were created for a N queues
virtio-net devices.

Since guest may not want to use all queues that qemu provided ( one example is
the old guest w/o multiqueue support). The files were attached/detached on
demand when guest set status for virtio_net.

This feature was negotiated through VIRTIO_NET_F_MULTIQUEUE. A new property
queues were added to virtio-net device to specify the number of queues it
supported. With this patch a virtio-net backend with N queues could be created
by:

qemu -netdev tap,id=hn0,queues=2 -device virtio-net-pci,netdev=hn0,queues=2

To let user tweak the performance, guest could negotiate the num of queues it
wishes to use through control virtqueue.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/virtio-net.c |  505 ++-
 hw/virtio-net.h |   12 ++
 2 files changed, 361 insertions(+), 156 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 30eb4f4..afe69e8 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -21,39 +21,48 @@
 #include virtio-net.h
 #include vhost_net.h
 
-#define VIRTIO_NET_VM_VERSION11
+#define VIRTIO_NET_VM_VERSION12
 
 #define MAC_TABLE_ENTRIES64
 #define MAX_VLAN(1  12)   /* Per 802.1Q definition */
 
-typedef struct VirtIONet
+struct VirtIONet;
+
+typedef struct VirtIONetQueue
 {
-VirtIODevice vdev;
-uint8_t mac[ETH_ALEN];
-uint16_t status;
 VirtQueue *rx_vq;
 VirtQueue *tx_vq;
-VirtQueue *ctrl_vq;
-NICState *nic;
 QEMUTimer *tx_timer;
 QEMUBH *tx_bh;
 uint32_t tx_timeout;
-int32_t tx_burst;
 int tx_waiting;
-uint32_t has_vnet_hdr;
-uint8_t has_ufo;
 struct {
 VirtQueueElement elem;
 ssize_t len;
 } async_tx;
+struct VirtIONet *n;
+uint8_t vhost_started;
+} VirtIONetQueue;
+
+typedef struct VirtIONet
+{
+VirtIODevice vdev;
+uint8_t mac[ETH_ALEN];
+uint16_t status;
+VirtIONetQueue vqs[MAX_QUEUE_NUM];
+VirtQueue *ctrl_vq;
+NICState *nic;
+int32_t tx_burst;
+uint32_t has_vnet_hdr;
+uint8_t has_ufo;
 int mergeable_rx_bufs;
+int multiqueue;
 uint8_t promisc;
 uint8_t allmulti;
 uint8_t alluni;
 uint8_t nomulti;
 uint8_t nouni;
 uint8_t nobcast;
-uint8_t vhost_started;
 struct {
 int in_use;
 int first_multi;
@@ -63,6 +72,8 @@ typedef struct VirtIONet
 } mac_table;
 uint32_t *vlans;
 DeviceState *qdev;
+uint16_t queues;
+uint16_t real_queues;
 } VirtIONet;
 
 /* TODO
@@ -74,12 +85,25 @@ static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 return (VirtIONet *)vdev;
 }
 
+static int vq_get_pair_index(VirtIONet *n, VirtQueue *vq)
+{
+int i;
+for (i = 0; i  n-queues; i++) {
+if (n-vqs[i].tx_vq == vq || n-vqs[i].rx_vq == vq) {
+return i;
+}
+}
+assert(1);
+return -1;
+}
+
 static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIONet *n = to_virtio_net(vdev);
 struct virtio_net_config netcfg;
 
 stw_p(netcfg.status, n-status);
+netcfg.queues = n-queues * 2;
 memcpy(netcfg.mac, n-mac, ETH_ALEN);
 memcpy(config, netcfg, sizeof(netcfg));
 }
@@ -103,78 +127,146 @@ static bool virtio_net_started(VirtIONet *n, uint8_t 
status)
 (n-status  VIRTIO_NET_S_LINK_UP)  n-vdev.vm_running;
 }
 
-static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
+static void virtio_net_vhost_status(VLANClientState *nc, VirtIONet *n,
+uint8_t status)
 {
-if (!n-nic-nc.peer) {
+int queue_index = nc-queue_index;
+VLANClientState *peer = nc-peer;
+VirtIONetQueue *netq = n-vqs[nc-queue_index];
+
+if (!peer) {
 return;
 }
-if (n-nic-nc.peer-info-type != NET_CLIENT_TYPE_TAP) {
+if (peer-info-type != NET_CLIENT_TYPE_TAP) {
 return;
 }
 
-if (!tap_get_vhost_net(n-nic-nc.peer)) {
+if (!tap_get_vhost_net(peer)) {
 return;
 }
-if (!!n-vhost_started == virtio_net_started(n, status) 
-  !n-nic-nc.peer-link_down) {
+if (!!netq-vhost_started == virtio_net_started(n, status) 
+ !peer-link_down) {
 return;
 }
-if (!n-vhost_started) {
-int r;
-if (!vhost_net_query(tap_get_vhost_net(n-nic-nc.peer), n-vdev)) {
+if (!netq-vhost_started) {
+   int r;
+if (!vhost_net_query(tap_get_vhost_net(peer), n-vdev)) {
 return;
 }
-r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev, 0);
+
+r = 

Re: [net-next RFC V5 0/5] Multiqueue virtio-net

2012-07-06 Thread Rick Jones

On 07/06/2012 12:42 AM, Jason Wang wrote:

I'm not expert of tcp, but looks like the changes are reasonable:
- we can do full-sized TSO check in tcp_tso_should_defer() only for
westwood, according to tcp westwood
- run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.


I'm sure Eric and David will weigh-in on the TCP change.  My initial 
inclination would have been to say well, if multiqueue is draining 
faster, that means ACKs come-back faster, which means the race between 
more data being queued by netperf and ACKs will go more to the ACKs 
which means the segments being sent will be smaller - as TCP_NODELAY is 
not set, the Nagle algorithm is in force, which means once there is data 
outstanding on the connection, no more will be sent until either the 
outstanding data is ACKed, or there is an accumulation of  MSS worth of 
data to send.



Also, how are you combining the concurrent netperf results?  Are you
taking sums of what netperf reports, or are you gathering statistics
outside of netperf?



The throughput were just sumed from netperf result like what netperf
manual suggests. The cpu utilization were measured by mpstat.


Which mechanism to address skew error?  The netperf manual describes 
more than one:


http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

Personally, my preference these days is to use the demo mode method of 
aggregate results as it can be rather faster than (ab)using the 
confidence intervals mechanism, which I suspect may not really scale all 
that well to large numbers of concurrent netperfs.


I also tend to use the --enable-burst configure option to allow me to 
minimize the number of concurrent netperfs in the first place.  Set 
TCP_NODELAY (the test-specific -D option) and then have several 
transactions outstanding at one time (test-specific -b option with a 
number of additional in-flight transactions).


This is expressed in the runemomniaggdemo.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh

which uses the find_max_burst.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh

to pick the burst size to use in the concurrent netperfs, the results of 
which can be post-processed with:


http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py

The nice feature of using the demo mode mechanism is when it is 
coupled with systems with reasonably synchronized clocks (eg NTP) it can 
be used for many-to-many testing in addition to one-to-many testing 
(which cannot be dealt with by the confidence interval method of dealing 
with skew error)



A single instance TCP_RR test would help confirm/refute any
non-trivial change in (effective) path length between the two cases.



Yes, I would test this thanks.


Excellent.

happy benchmarking,

rick jones

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Nicholas A. Bellinger
On Fri, 2012-07-06 at 17:49 +0400, James Bottomley wrote:
 On Fri, 2012-07-06 at 02:13 -0700, Nicholas A. Bellinger wrote:
  On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
   On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
   

SNIP

This bottleneck has been mentioned by various people (including myself)
on linux-scsi the last 18 months, and I've proposed that that it be
discussed at KS-2012 so we can start making some forward progress:
   
   Well, no, it hasn't.  You randomly drop things like this into unrelated
   email (I suppose that is a mention in strict English construction) but
   it's not really enough to get anyone to pay attention since they mostly
   stopped reading at the top, if they got that far: most people just go by
   subject when wading through threads initially.
   
  
  It most certainly has been made clear to me, numerous times from many
  people in the Linux/SCSI community that there is a bottleneck for small
  block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
  Linux based SCSI subsystems.
  
  My apologies if mentioning this issue last year at LC 2011 to you
  privately did not take a tone of a more serious nature, or that
  proposing a topic for LSF-2012 this year was not a clear enough
  indication of a problem with SCSI small block random I/O performance.
  
   But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
   kernel, which is now nearly three years old) is 25% slower than W2k8R2
   on infiniband isn't really going to get anyone excited either
   (particularly when you mention OFED, which usually means a stack
   replacement on Linux anyway).
   
  
  The specific issue was first raised for .38 where we where able to get
  most of the interesting high performance LLDs converted to using
  internal locking methods so that host_lock did not have to be obtained
  during each -queuecommand() I/O dispatch, right..?
  
  This has helped a good deal for large multi-lun scsi_host configs that
  are now running in host-lock less mode, but there is still a large
  discrepancy single LUN vs. raw struct block_device access even with LLD
  host_lock less mode enabled.
  
  Now I think the virtio-blk client performance is demonstrating this
  issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
  flash benchmarks that is demonstrate some other yet-to-be determined
  limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
  randrw workload.
  
   What people might pay attention to is evidence that there's a problem in
   3.5-rc6 (without any OFED crap).  If you're not going to bother
   investigating, it has to be in an environment they can reproduce (so
   ordinary hardware, not infiniband) otherwise it gets ignored as an
   esoteric hardware issue.
   
  
  It's really quite simple for anyone to demonstrate the bottleneck
  locally on any machine using tcm_loop with raw block flash.  Take a
  struct block_device backend (like a Fusion IO /dev/fio*) and using
  IBLOCK and export locally accessible SCSI LUNs via tcm_loop..
  
  Using FIO there is a significant drop for randrw 4k performance between
  tcm_loop - IBLOCK vs. raw struct block device backends.  And no, it's
  not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
  LUN limitation for small block random I/Os on the order of ~75K for each
  SCSI LUN.
 
 Here, you're saying here that the end to end SCSI stack tops out at
 around 75k iops, which is reasonably respectable if you don't employ any
 mitigation like queue steering and interrupt polling ... what were the
 mitigation techniques in the test you employed by the way?
 

~75K per SCSI LUN in a multi-lun per host setup is being optimistic btw.
On the other side of the coin, the same pure block device can easily go
~200K per backend.-

For the simplest case with tcm_loop, a struct scsi_cmnd is queued via
cmwq to execute in process context - submit the backend I/O.  Once
completed from IBLOCK, the I/O is run though a target completion wq, and
completed back to SCSI.

There is no fancy queue steering or interrupt polling going on (at least
not in tcm_loop) because it's a simple virtual SCSI LLD similar to
scsi_debug.

 But previously, you ascribed a performance drop of around 75% on
 virtio-scsi (topping out around 15-20k iops) to this same problem ...
 that doesn't really seem likely.
 

No.  I ascribed the performance difference between virtio-scsi+tcm_vhost
vs. bare-metal raw block flash to this bottleneck in Linux/SCSI.

It's obvious that virtio-scsi-raw going through QEMU SCSI / block is
having some other shortcomings.

 Here's the rough ranges of concern:
 
 10K iops: standard arrays
 100K iops: modern expensive fast flash drives on 6Gb links
 1M iops: PCIe NVMexpress like devices
 
 SCSI should do arrays with no problem at all, so I'd be really concerned
 that it can't make 0-20k iops.  If you push the system and fine tune it,
 SCSI can just about 

[PATCH] virtio-scsi: Add vdrv-scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning

2012-07-06 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch changes virtio-scsi to use a new virtio_driver-scan() callback
so that scsi_scan_host() can be properly invoked once virtio_dev_probe() has
set add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK) to signal active virtio-ring
operation, instead of from within virtscsi_probe().

This fixes a bug where SCSI LUN scanning for both virtio-scsi-raw and
virtio-scsi/tcm_vhost setups was happening before VIRTIO_CONFIG_S_DRIVER_OK
had been set, causing VIRTIO_SCSI_S_BAD_TARGET to occur.  This fixes a bug
with virtio-scsi/tcm_vhost where LUN scan was not detecting LUNs.

Tested with virtio-scsi-raw + virtio-scsi/tcm_vhost w/ IBLOCK on 3.5-rc2 code.

Reviewed-by: Paolo Bonzini pbonz...@redhat.com
Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
Cc: Zhi Yong Wu wu...@cn.ibm.com
Cc: Christoph Hellwig h...@lst.de
Cc: Hannes Reinecke h...@suse.de
Cc: sta...@vger.kernel.org
Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org
---
 drivers/scsi/virtio_scsi.c |   15 ---
 drivers/virtio/virtio.c|5 -
 include/linux/virtio.h |1 +
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 1b38431..391b30d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -481,9 +481,10 @@ static int __devinit virtscsi_probe(struct virtio_device 
*vdev)
err = scsi_add_host(shost, vdev-dev);
if (err)
goto scsi_add_host_failed;
-
-   scsi_scan_host(shost);
-
+   /*
+* scsi_scan_host() happens in virtscsi_scan() via virtio_driver-scan()
+* after VIRTIO_CONFIG_S_DRIVER_OK has been set..
+*/
return 0;
 
 scsi_add_host_failed:
@@ -493,6 +494,13 @@ virtscsi_init_failed:
return err;
 }
 
+static void virtscsi_scan(struct virtio_device *vdev)
+{
+   struct Scsi_Host *shost = (struct Scsi_Host *)vdev-priv;
+
+   scsi_scan_host(shost);
+}
+
 static void virtscsi_remove_vqs(struct virtio_device *vdev)
 {
/* Stop all the virtqueues. */
@@ -537,6 +545,7 @@ static struct virtio_driver virtio_scsi_driver = {
.driver.owner = THIS_MODULE,
.id_table = id_table,
.probe = virtscsi_probe,
+   .scan = virtscsi_scan,
 #ifdef CONFIG_PM
.freeze = virtscsi_freeze,
.restore = virtscsi_restore,
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f355807..c3b3f7f 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -141,8 +141,11 @@ static int virtio_dev_probe(struct device *_d)
err = drv-probe(dev);
if (err)
add_status(dev, VIRTIO_CONFIG_S_FAILED);
-   else
+   else {
add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
+   if (drv-scan)
+   drv-scan(dev);
+   }
 
return err;
 }
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..a1ba8bb 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -92,6 +92,7 @@ struct virtio_driver {
const unsigned int *feature_table;
unsigned int feature_table_size;
int (*probe)(struct virtio_device *dev);
+   void (*scan)(struct virtio_device *dev);
void (*remove)(struct virtio_device *dev);
void (*config_changed)(struct virtio_device *dev);
 #ifdef CONFIG_PM
-- 
1.7.2.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-scsi: Add vdrv-scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning

2012-07-06 Thread Nicholas A. Bellinger
Hi James,

Please consider picking this one up for your next scsi-rc-fixes PULL,
and it's CC'ed to stable following Paolo's request.

Thank you,

--nab

On Fri, 2012-07-06 at 20:15 +, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch changes virtio-scsi to use a new virtio_driver-scan() callback
 so that scsi_scan_host() can be properly invoked once virtio_dev_probe() has
 set add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK) to signal active virtio-ring
 operation, instead of from within virtscsi_probe().
 
 This fixes a bug where SCSI LUN scanning for both virtio-scsi-raw and
 virtio-scsi/tcm_vhost setups was happening before VIRTIO_CONFIG_S_DRIVER_OK
 had been set, causing VIRTIO_SCSI_S_BAD_TARGET to occur.  This fixes a bug
 with virtio-scsi/tcm_vhost where LUN scan was not detecting LUNs.
 
 Tested with virtio-scsi-raw + virtio-scsi/tcm_vhost w/ IBLOCK on 3.5-rc2 code.
 
 Reviewed-by: Paolo Bonzini pbonz...@redhat.com
 Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 Cc: Zhi Yong Wu wu...@cn.ibm.com
 Cc: Christoph Hellwig h...@lst.de
 Cc: Hannes Reinecke h...@suse.de
 Cc: sta...@vger.kernel.org
 Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org
 ---
  drivers/scsi/virtio_scsi.c |   15 ---
  drivers/virtio/virtio.c|5 -
  include/linux/virtio.h |1 +
  3 files changed, 17 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
 index 1b38431..391b30d 100644
 --- a/drivers/scsi/virtio_scsi.c
 +++ b/drivers/scsi/virtio_scsi.c
 @@ -481,9 +481,10 @@ static int __devinit virtscsi_probe(struct virtio_device 
 *vdev)
   err = scsi_add_host(shost, vdev-dev);
   if (err)
   goto scsi_add_host_failed;
 -
 - scsi_scan_host(shost);
 -
 + /*
 +  * scsi_scan_host() happens in virtscsi_scan() via virtio_driver-scan()
 +  * after VIRTIO_CONFIG_S_DRIVER_OK has been set..
 +  */
   return 0;
  
  scsi_add_host_failed:
 @@ -493,6 +494,13 @@ virtscsi_init_failed:
   return err;
  }
  
 +static void virtscsi_scan(struct virtio_device *vdev)
 +{
 + struct Scsi_Host *shost = (struct Scsi_Host *)vdev-priv;
 +
 + scsi_scan_host(shost);
 +}
 +
  static void virtscsi_remove_vqs(struct virtio_device *vdev)
  {
   /* Stop all the virtqueues. */
 @@ -537,6 +545,7 @@ static struct virtio_driver virtio_scsi_driver = {
   .driver.owner = THIS_MODULE,
   .id_table = id_table,
   .probe = virtscsi_probe,
 + .scan = virtscsi_scan,
  #ifdef CONFIG_PM
   .freeze = virtscsi_freeze,
   .restore = virtscsi_restore,
 diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
 index f355807..c3b3f7f 100644
 --- a/drivers/virtio/virtio.c
 +++ b/drivers/virtio/virtio.c
 @@ -141,8 +141,11 @@ static int virtio_dev_probe(struct device *_d)
   err = drv-probe(dev);
   if (err)
   add_status(dev, VIRTIO_CONFIG_S_FAILED);
 - else
 + else {
   add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
 + if (drv-scan)
 + drv-scan(dev);
 + }
  
   return err;
  }
 diff --git a/include/linux/virtio.h b/include/linux/virtio.h
 index 8efd28a..a1ba8bb 100644
 --- a/include/linux/virtio.h
 +++ b/include/linux/virtio.h
 @@ -92,6 +92,7 @@ struct virtio_driver {
   const unsigned int *feature_table;
   unsigned int feature_table_size;
   int (*probe)(struct virtio_device *dev);
 + void (*scan)(struct virtio_device *dev);
   void (*remove)(struct virtio_device *dev);
   void (*config_changed)(struct virtio_device *dev);
  #ifdef CONFIG_PM


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Nicholas A. Bellinger
On Fri, 2012-07-06 at 15:30 -0500, Christoph Lameter wrote:
 On Fri, 6 Jul 2012, James Bottomley wrote:
 
  What people might pay attention to is evidence that there's a problem in
  3.5-rc6 (without any OFED crap).  If you're not going to bother
  investigating, it has to be in an environment they can reproduce (so
  ordinary hardware, not infiniband) otherwise it gets ignored as an
  esoteric hardware issue.
 
 The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been
 supported for a long time and its a very important technology given the
 problematic nature of ethernet at high network speeds.
 
 OFED crap exists for those running RHEL5/6. The new enterprise distros are
 based on the 3.2 kernel which has pretty good Infiniband support
 out of the box.
 

So I don't think the HCAs or Infiniband fabric was the limiting factor
for small block random I/O in the RHEL 6.2 w/ OFED vs. Windows Server
2008 R2 w/ OFED setup mentioned earlier.

I've seen both FC and iSCSI fabrics demonstrate the same type of random
small block I/O performance anomalies with Linux/SCSI clients too.  The
v3.x Linux/SCSI clients are certainly better in the multi-lun per host
small block random I/O case, but single LUN performance is (still)
lacking compared to everything else.

Also RHEL 6.2 does have the scsi-host-lock less bits in place now, but
it's been more a matter of converting OFED ib_srp code to run in
host-lock less mode to realize extra gains for multi-lun per host.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization