Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 09:25:34 PM:

  Sure, will get a build/test on latest bits and send in 1-2 days.
 
The TX-only patch helped the guest TX path but didn't help
host-guest much (as tested using TCP_MAERTS from the guest).
But with the TX+RX patch, both directions are getting
improvements.
  
   Also, my hope is that with appropriate queue mapping,
   we might be able to do away with heuristics to detect
   single stream load that TX only code needs.
 
  Yes, that whole stuff is removed, and the TX/RX path is
  unchanged with this patch (thankfully :)

 Cool. I was wondering whether in that case, we can
 do without host kernel changes at all,
 and use a separate fd for each TX/RX pair.
 The advantage of that approach is that this way,
 the max fd limit naturally sets an upper bound
 on the amount of resources userspace can use up.

 Thoughts?

 In any case, pls don't let the above delay
 sending an RFC.

I will look into this also.

Please excuse the delay in sending the patch out faster - my
bits are a little old, so it is taking some time to move to
the latest kernel and get some initial TCP/UDP test results.
I should have it ready by tomorrow.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-23 Thread Michael S. Tsirkin
On Wed, Feb 23, 2011 at 12:18:36PM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM:
 
 Hi Michael,
 
   Yes. Michael Tsirkin had wanted to see how the MQ RX patch
   would look like, so I was in the process of getting the two
   working together. The patch is ready and is being tested.
   Should I send a RFC patch at this time?
 
  Yes, please do.
 
 Sure, will get a build/test on latest bits and send in 1-2 days.
 
   The TX-only patch helped the guest TX path but didn't help
   host-guest much (as tested using TCP_MAERTS from the guest).
   But with the TX+RX patch, both directions are getting
   improvements.
 
  Also, my hope is that with appropriate queue mapping,
  we might be able to do away with heuristics to detect
  single stream load that TX only code needs.
 
 Yes, that whole stuff is removed, and the TX/RX path is
 unchanged with this patch (thankfully :)

Cool. I was wondering whether in that case, we can
do without host kernel changes at all,
and use a separate fd for each TX/RX pair.
The advantage of that approach is that this way,
the max fd limit naturally sets an upper bound
on the amount of resources userspace can use up.

Thoughts?

In any case, pls don't let the above delay
sending an RFC.

   Remote testing is still to be done.
 
  Others might be able to help here once you post the patch.
 
 That's great, will appreciate any help.
 
 Thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-23 Thread Simon Horman
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
 Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM:
 
 Hi Simon,
 
 
  I have a few questions about the results below:
 
  1. Are the (%) comparisons between non-mq and mq virtio?
 
 Yes - mainline kernel with transmit-only MQ patch.
 
  2. Was UDP or TCP used?
 
 TCP. I had done some initial testing on UDP, but don't have
 the results now as it is really old. But I will be running
 it again.
 
  3. What was the transmit size (-m option to netperf)?
 
 I didn't use the -m option, so it defaults to 16K. The
 script does:
 
 netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
 
  Also, I'm interested to know what the status of these patches is.
  Are you planing a fresh series?
 
 Yes. Michael Tsirkin had wanted to see how the MQ RX patch
 would look like, so I was in the process of getting the two
 working together. The patch is ready and is being tested.
 Should I send a RFC patch at this time?
 
 The TX-only patch helped the guest TX path but didn't help
 host-guest much (as tested using TCP_MAERTS from the guest).
 But with the TX+RX patch, both directions are getting
 improvements. Remote testing is still to be done.

Hi Krishna,

thanks for clarifying the test results.
I'm looking forward to the forthcoming RFC patches.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Simon Horman
On Wed, Oct 20, 2010 at 02:24:52PM +0530, Krishna Kumar wrote:
 Following set of patches implement transmit MQ in virtio-net.  Also
 included is the user qemu changes.  MQ is disabled by default unless
 qemu specifies it.

Hi Krishna,

I have a few questions about the results below:

1. Are the (%) comparisons between non-mq and mq virtio?
2. Was UDP or TCP used?
3. What was the transmit size (-m option to netperf)?

Also, I'm interested to know what the status of these patches is.
Are you planing a fresh series?

 
   Changes from rev2:
   --
 1. Define (in virtio_net.h) the maximum send txqs; and use in
virtio-net and vhost-net.
 2. vi-sq[i] is allocated individually, resulting in cache line
aligned sq[0] to sq[n].  Another option was to define
'send_queue' as:
struct send_queue {
struct virtqueue *svq;
struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
} cacheline_aligned_in_smp;
and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
the submitted method is preferable.
 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
handles TX[0-n].
 4. Further change TX handling such that vhost[0] handles both RX/TX
for single stream case.
 
   Enabling MQ on virtio:
   ---
 When following options are passed to qemu:
 - smp  1
 - vhost=on
 - mq=on (new option, default:off)
 then #txqueues = #cpus.  The #txqueues can be changed by using an
 optional 'numtxqs' option.  e.g. for a smp=4 guest:
 vhost=on   -   #txqueues = 1
 vhost=on,mq=on -   #txqueues = 4
 vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
 vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
 
 
Performance (guest - local host):
---
 System configuration:
 Host:  8 Intel Xeon, 8 GB memory
 Guest: 4 cpus, 2 GB memory
 Test: Each test case runs for 60 secs, sum over three runs (except
 when number of netperf sessions is 1, which has 10 runs of 12 secs
 each).  No tuning (default netperf) other than taskset vhost's to
 cpus 0-3.  numtxqs=32 gave the best results though the guest had
 only 4 vcpus (I haven't tried beyond that).
 
 __ numtxqs=2, vhosts=3  
 #sessions  BW%  CPU%RCPU%SD%  RSD%
 
 1  4.46-1.96 .19 -12.50   -6.06
 2  4.93-1.162.10  0   -2.38
 4  46.1764.77   33.72 19.51   -2.48
 8  47.8970.00   36.23 41.4613.35
 16 48.9780.44   40.67 21.11   -5.46
 24 49.0378.78   41.22 20.51   -4.78
 32 51.1177.15   42.42 15.81   -6.87
 40 51.6071.65   42.43 9.75-8.94
 48 50.1069.55   42.85 11.80   -5.81
 64 46.2468.42   42.67 14.18   -3.28
 80 46.3763.13   41.62 7.43-6.73
 96 46.4063.31   42.20 9.36-4.78
 12850.4362.79   42.16 13.11   -1.23
 
 BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
 
 __ numtxqs=8, vhosts=5  
 #sessions   BW%  CPU% RCPU% SD%  RSD%
 
 1   -.76-1.56 2.33  03.03
 2   17.4111.1111.41 0   -4.76
 4   42.1255.1130.20 19.51.62
 8   54.6980.0039.22 24.39-3.88
 16  54.7781.6240.89 20.34-6.58
 24  54.6679.6841.57 15.49-8.99
 32  54.9276.8241.79 17.59-5.70
 40  51.7968.5640.53 15.31-3.87
 48  51.7266.4040.84 9.72 -7.13
 64  51.1163.9441.10 5.93 -8.82
 80  46.5159.5039.80 9.33 -4.18
 96  47.7257.7539.84 4.20 -7.62
 128 54.3558.9540.66 3.24 -8.63
 
 BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
 
 __ numtxqs=16, vhosts=5  ___
 #sessions   BW%  CPU% RCPU% SD%  RSD%
 
 1   -1.43-3.521.55  0  3.03
 2   33.09 21.63   20.12-10.00 -9.52
 4   67.17 94.60   44.28 19.51 -11.80
 8   75.72 108.14  49.15 25.00 -10.71
 16  80.34 101.77  52.94 25.93 -4.49
 24  70.84 93.12   43.62 27.63 -5.03
 32  69.01 94.16   47.33 29.68 -1.51
 40  58.56 63.47   25.91-3.92  -25.85
 48 

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Krishna Kumar2
Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM:

Hi Simon,


 I have a few questions about the results below:

 1. Are the (%) comparisons between non-mq and mq virtio?

Yes - mainline kernel with transmit-only MQ patch.

 2. Was UDP or TCP used?

TCP. I had done some initial testing on UDP, but don't have
the results now as it is really old. But I will be running
it again.

 3. What was the transmit size (-m option to netperf)?

I didn't use the -m option, so it defaults to 16K. The
script does:

netperf -t TCP_STREAM -c -C -l 60 -H $SERVER

 Also, I'm interested to know what the status of these patches is.
 Are you planing a fresh series?

Yes. Michael Tsirkin had wanted to see how the MQ RX patch
would look like, so I was in the process of getting the two
working together. The patch is ready and is being tested.
Should I send a RFC patch at this time?

The TX-only patch helped the guest TX path but didn't help
host-guest much (as tested using TCP_MAERTS from the guest).
But with the TX+RX patch, both directions are getting
improvements. Remote testing is still to be done.

Thanks,

- KK

Changes from rev2:
--
  1. Define (in virtio_net.h) the maximum send txqs; and use in
 virtio-net and vhost-net.
  2. vi-sq[i] is allocated individually, resulting in cache line
 aligned sq[0] to sq[n].  Another option was to define
 'send_queue' as:
 struct send_queue {
 struct virtqueue *svq;
 struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 } cacheline_aligned_in_smp;
 and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
 the submitted method is preferable.
  3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
 handles TX[0-n].
  4. Further change TX handling such that vhost[0] handles both RX/TX
 for single stream case.
 
Enabling MQ on virtio:
---
  When following options are passed to qemu:
  - smp  1
  - vhost=on
  - mq=on (new option, default:off)
  then #txqueues = #cpus.  The #txqueues can be changed by using an
  optional 'numtxqs' option.  e.g. for a smp=4 guest:
  vhost=on   -   #txqueues = 1
  vhost=on,mq=on -   #txqueues = 4
  vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
  vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
 
 
 Performance (guest - local host):
 ---
  System configuration:
  Host:  8 Intel Xeon, 8 GB memory
  Guest: 4 cpus, 2 GB memory
  Test: Each test case runs for 60 secs, sum over three runs (except
  when number of netperf sessions is 1, which has 10 runs of 12 secs
  each).  No tuning (default netperf) other than taskset vhost's to
  cpus 0-3.  numtxqs=32 gave the best results though the guest had
  only 4 vcpus (I haven't tried beyond that).
 
  __ numtxqs=2, vhosts=3  
  #sessions  BW%  CPU%RCPU%SD%  RSD%
  
  1  4.46-1.96 .19 -12.50   -6.06
  2  4.93-1.162.10  0   -2.38
  4  46.1764.77   33.72 19.51   -2.48
  8  47.8970.00   36.23 41.4613.35
  16 48.9780.44   40.67 21.11   -5.46
  24 49.0378.78   41.22 20.51   -4.78
  32 51.1177.15   42.42 15.81   -6.87
  40 51.6071.65   42.43 9.75-8.94
  48 50.1069.55   42.85 11.80   -5.81
  64 46.2468.42   42.67 14.18   -3.28
  80 46.3763.13   41.62 7.43-6.73
  96 46.4063.31   42.20 9.36-4.78
  12850.4362.79   42.16 13.11   -1.23
  
  BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
 
  __ numtxqs=8, vhosts=5  
  #sessions   BW%  CPU% RCPU% SD%  RSD%
  
  1   -.76-1.56 2.33  03.03
  2   17.4111.1111.41 0   -4.76
  4   42.1255.1130.20 19.51.62
  8   54.6980.0039.22 24.39-3.88
  16  54.7781.6240.89 20.34-6.58
  24  54.6679.6841.57 15.49-8.99
  32  54.9276.8241.79 17.59-5.70
  40  51.7968.5640.53 15.31-3.87
  48  51.7266.4040.84 9.72 -7.13
  64  51.1163.9441.10 5.93 -8.82
  80  46.5159.5039.80 9.33 -4.18
  96  47.7257.7539.84 4.20 -7.62
  128 54.3558.9540.66 3.24 -8.63
  
  BW: 38.9%,  

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Michael S. Tsirkin
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
 Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM:
 
 Hi Simon,
 
 
  I have a few questions about the results below:
 
  1. Are the (%) comparisons between non-mq and mq virtio?
 
 Yes - mainline kernel with transmit-only MQ patch.
 
  2. Was UDP or TCP used?
 
 TCP. I had done some initial testing on UDP, but don't have
 the results now as it is really old. But I will be running
 it again.
 
  3. What was the transmit size (-m option to netperf)?
 
 I didn't use the -m option, so it defaults to 16K. The
 script does:
 
 netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
 
  Also, I'm interested to know what the status of these patches is.
  Are you planing a fresh series?
 
 Yes. Michael Tsirkin had wanted to see how the MQ RX patch
 would look like, so I was in the process of getting the two
 working together. The patch is ready and is being tested.
 Should I send a RFC patch at this time?

Yes, please do.

 The TX-only patch helped the guest TX path but didn't help
 host-guest much (as tested using TCP_MAERTS from the guest).
 But with the TX+RX patch, both directions are getting
 improvements.

Also, my hope is that with appropriate queue mapping,
we might be able to do away with heuristics to detect
single stream load that TX only code needs.

 Remote testing is still to be done.

Others might be able to help here once you post the patch.

 Thanks,
 
 - KK
 
 Changes from rev2:
 --
   1. Define (in virtio_net.h) the maximum send txqs; and use in
  virtio-net and vhost-net.
   2. vi-sq[i] is allocated individually, resulting in cache line
  aligned sq[0] to sq[n].  Another option was to define
  'send_queue' as:
  struct send_queue {
  struct virtqueue *svq;
  struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
  } cacheline_aligned_in_smp;
  and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
  the submitted method is preferable.
   3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
  handles TX[0-n].
   4. Further change TX handling such that vhost[0] handles both RX/TX
  for single stream case.
  
 Enabling MQ on virtio:
 ---
   When following options are passed to qemu:
   - smp  1
   - vhost=on
   - mq=on (new option, default:off)
   then #txqueues = #cpus.  The #txqueues can be changed by using an
   optional 'numtxqs' option.  e.g. for a smp=4 guest:
   vhost=on   -   #txqueues = 1
   vhost=on,mq=on -   #txqueues = 4
   vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
   vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
  
  
  Performance (guest - local host):
  ---
   System configuration:
   Host:  8 Intel Xeon, 8 GB memory
   Guest: 4 cpus, 2 GB memory
   Test: Each test case runs for 60 secs, sum over three runs (except
   when number of netperf sessions is 1, which has 10 runs of 12 secs
   each).  No tuning (default netperf) other than taskset vhost's to
   cpus 0-3.  numtxqs=32 gave the best results though the guest had
   only 4 vcpus (I haven't tried beyond that).
  
   __ numtxqs=2, vhosts=3  
   #sessions  BW%  CPU%RCPU%SD%  RSD%
   
   1  4.46-1.96 .19 -12.50   -6.06
   2  4.93-1.162.10  0   -2.38
   4  46.1764.77   33.72 19.51   -2.48
   8  47.8970.00   36.23 41.4613.35
   16 48.9780.44   40.67 21.11   -5.46
   24 49.0378.78   41.22 20.51   -4.78
   32 51.1177.15   42.42 15.81   -6.87
   40 51.6071.65   42.43 9.75-8.94
   48 50.1069.55   42.85 11.80   -5.81
   64 46.2468.42   42.67 14.18   -3.28
   80 46.3763.13   41.62 7.43-6.73
   96 46.4063.31   42.20 9.36-4.78
   12850.4362.79   42.16 13.11   -1.23
   
   BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
  
   __ numtxqs=8, vhosts=5  
   #sessions   BW%  CPU% RCPU% SD%  RSD%
   
   1   -.76-1.56 2.33  03.03
   2   17.4111.1111.41 0   -4.76
   4   42.1255.1130.20 19.51.62
   8   54.6980.0039.22 24.39-3.88
   16  54.7781.6240.89 20.34-6.58
   24  54.6679.6841.57 15.49-8.99
   32  54.9276.8241.79 17.59-5.70
   40

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM:

Hi Michael,

  Yes. Michael Tsirkin had wanted to see how the MQ RX patch
  would look like, so I was in the process of getting the two
  working together. The patch is ready and is being tested.
  Should I send a RFC patch at this time?

 Yes, please do.

Sure, will get a build/test on latest bits and send in 1-2 days.

  The TX-only patch helped the guest TX path but didn't help
  host-guest much (as tested using TCP_MAERTS from the guest).
  But with the TX+RX patch, both directions are getting
  improvements.

 Also, my hope is that with appropriate queue mapping,
 we might be able to do away with heuristics to detect
 single stream load that TX only code needs.

Yes, that whole stuff is removed, and the TX/RX path is
unchanged with this patch (thankfully :)

  Remote testing is still to be done.

 Others might be able to help here once you post the patch.

That's great, will appreciate any help.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-10 Thread Michael S. Tsirkin
On Tue, Nov 09, 2010 at 10:54:57PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 09:03:25 PM:
 
Something strange here, right?
1. You are consistently getting 10G/s here, and even with a single
   stream?
  
   Sorry, I should have mentioned this though I had stated in my
   earlier mails. Each test result has two iterations, each of 60
   seconds, except when #netperfs is 1 for which I do 10 iteration
   (sum across 10 iterations).
 
  So need to divide the number by 10?
 
 Yes, that is what I get with 512/1K macvtap I/O size :)
 
I started doing many more iterations
   for 1 netperf after finding the issue earlier with single stream.
   So the BW is only 4.5-7 Gbps.
  
2. With 2 streams, is where we get  10G/s originally. Instead of
   doubling that we get a marginal improvement with 2 queues and
   about 30% worse with 1 queue.
  
   (doubling happens consistently for guest - host, but never for
   remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
   testing scenario. In first case, there is a slight improvement in
   BW and good reduction in SD. In the second case, only SD improves
   (though BW drops for 2 stream for some reason).  In both cases,
   BW and SD improves as the number of sessions increase.
 
  I guess this is another indication that something's wrong.
 
 The patch - both virtio-net and vhost-net, doesn't have any
 locking/mutex's/ or any synchronization method. Guest - host
 performance improvement of upto 100% shows the patch is not
 doing anything wrong.

My concern is this: we don't seem to do anything in tap or macvtap to
help packets from separate virtio queues get to separate queues in the
hardware device and to avoid reordering when we do this.

- skb_tx_hash calculation will get different results
- hash math that e.g. tcp does will run on guest and seems to be discarded

etc

Maybe it's as simple as some tap/macvtap ioctls to set up the queue number
in skbs. Or maybe we need to pass the skb hash from guest to host.
It's this last option that should make us especially cautios as it'll
affect guest/host interface.

Also see d5a9e24afb4ab38110ebb777588ea0bd0eacbd0a: if we have
hardware which records an RX queue, it appears important to
pass that info to guest and to use that in selecting the TX queue.
Of course we won't see this in netperf runs but this needs to
be given thought too - supporting this seems to suggest either
sticking the hash in the virtio net header for both tx and rx,
or using multiplease RX queues.

  We are quite far from line rate, the fact BW does not scale
  means there's some contention in the code.
 
 Attaining line speed with macvtap seems to be a generic issue
 and unrelated to my patch specifically. IMHO if there is nothing
 wrong in the code (review) and is accepted, it will benefit as
 others can also help to find what needs to be implemented in
 vhost/macvtap/qemu to get line speed for guest-remote-host.

No problem, I will queue these patches in some branch
to help enable cooperation, as well as help you
iterate with incremental patches instead of resending it all each time.


 PS: bare-metal performance for host-remote-host is also
 2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.
 
 Thanks,

You mean native linux BW does not scale for your host with
# of connections either? I guess this just means need another
setup for testing?

 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Michael S. Tsirkin
On Tue, Nov 09, 2010 at 08:58:44PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 06:52:39 PM:
 
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
   
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
  Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:

 Any feedback, comments, objections, issues or bugs about the
 patches? Please let me know if something needs to be done.

 Some more test results:
 _
  Host-Guest BW (numtxqs=2)
 #   BW% CPU%RCPU%   SD% RSD%
 _
   
I think we discussed the need for external to guest testing
over 10G. For large messages we should not see any change
but you should be able to get better numbers for small messages
assuming a MQ NIC card.
  
   I had to make a few changes to qemu (and a minor change in macvtap
   driver) to get multiple TXQ support using macvtap working. The NIC
   is a ixgbe card.
  
  
 __
   Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
   #  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
  
 __
   1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
   2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
   4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
   8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
   16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
   24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
   32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
   40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495
 (-10.4)
   48 10624   11267 (6.0)   12329  12258 (-.5)1276211538
 (-9.5)
   64 10524   10596 (.6)21689  22859 (5.3)2362622403
 (-5.1)
   80 985610284 (4.3)   35769  36313 (1.5)3993236419
 (-8.7)
   96 969110075 (3.9)   52357  52259 (-.1)5867653463
 (-8.8)
   12893519794 (4.7)114707 94275 (-17.8)  114050   97337
 (-14.6)
  
 __
   Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)
  
  
 __
   Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
   #  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
  
 __
   1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
   2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
   4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
   8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
   16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
   24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
   32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
   40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
   48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403
 (-10.9)
   64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
   80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824
 (-15.4)
   96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131
 (-11.4)
   1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873
 (-18.8)
  
 __
   Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)
  
   Is there anything else you would like me to test/change, or shall
   I submit the next version (with the above macvtap changes)?
  
   Thanks,
  
   - KK
 
  Something strange here, right?
  1. You are consistently getting 10G/s here, and even with a single
 stream?
 
 Sorry, I should have mentioned this though I had stated in my
 earlier mails. Each test result has two iterations, each of 60
 seconds, except when #netperfs is 1 for which I do 10 iteration
 (sum across 10 iterations).

So need to divide the number by 10?

  I started doing many more iterations
 for 1 netperf after finding the issue earlier with single stream.
 So the BW is only 4.5-7 Gbps.
 
  2. With 2 streams, is where we get  10G/s originally. Instead of
 doubling that we get a marginal improvement with 2 queues and
 about 30% worse with 1 queue.
 
 (doubling happens consistently for guest - host, but never for
 remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
 testing scenario. In first case, there is a slight improvement in
 BW and good reduction in SD. In the second case, only SD improves

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Michael S. Tsirkin
On Tue, Nov 09, 2010 at 10:08:21AM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 02:27:09 PM:
 
  Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
 
  On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
  
   Any feedback, comments, objections, issues or bugs about the
   patches? Please let me know if something needs to be done.
  
   Some more test results:
   _
Host-Guest BW (numtxqs=2)
   #   BW% CPU%RCPU%   SD% RSD%
   _
 
  I think we discussed the need for external to guest testing
  over 10G. For large messages we should not see any change
  but you should be able to get better numbers for small messages
  assuming a MQ NIC card.
 
 I had to make a few changes to qemu (and a minor change in macvtap
 driver) to get multiple TXQ support using macvtap working. The NIC
 is a ixgbe card.
 
 __
 Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
 #  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
 __
 1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
 2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
 4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
 8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
 16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
 24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
 32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
 40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495 (-10.4)
 48 10624   11267 (6.0)   12329  12258 (-.5)1276211538 (-9.5)
 64 10524   10596 (.6)21689  22859 (5.3)2362622403 (-5.1)
 80 985610284 (4.3)   35769  36313 (1.5)3993236419 (-8.7)
 96 969110075 (3.9)   52357  52259 (-.1)5867653463 (-8.8)
 12893519794 (4.7)114707 94275 (-17.8)  114050   97337 (-14.6)
 __
 Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)
 
 __
 Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
 #  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
 __
 1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
 2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
 4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
 8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
 16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
 24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
 32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
 40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
 48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
 64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
 80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
 96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
 1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873 (-18.8)
 __
 Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)
 
 Is there anything else you would like me to test/change, or shall
 I submit the next version (with the above macvtap changes)?
 
 Thanks,
 
 - KK

Something strange here, right?
1. You are consistently getting 10G/s here, and even with a single stream?
2. With 2 streams, is where we get  10G/s originally. Instead of
   doubling that we get a marginal improvement with 2 queues and
   about 30% worse with 1 queue.

Is your card MQ?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 09:03:25 PM:

   Something strange here, right?
   1. You are consistently getting 10G/s here, and even with a single
  stream?
 
  Sorry, I should have mentioned this though I had stated in my
  earlier mails. Each test result has two iterations, each of 60
  seconds, except when #netperfs is 1 for which I do 10 iteration
  (sum across 10 iterations).

 So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

   I started doing many more iterations
  for 1 netperf after finding the issue earlier with single stream.
  So the BW is only 4.5-7 Gbps.
 
   2. With 2 streams, is where we get  10G/s originally. Instead of
  doubling that we get a marginal improvement with 2 queues and
  about 30% worse with 1 queue.
 
  (doubling happens consistently for guest - host, but never for
  remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
  testing scenario. In first case, there is a slight improvement in
  BW and good reduction in SD. In the second case, only SD improves
  (though BW drops for 2 stream for some reason).  In both cases,
  BW and SD improves as the number of sessions increase.

 I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest - host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

 We are quite far from line rate, the fact BW does not scale
 means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest-remote-host.

PS: bare-metal performance for host-remote-host is also
2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 02:27:09 PM:

 Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

 On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
   Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
 
  Any feedback, comments, objections, issues or bugs about the
  patches? Please let me know if something needs to be done.
 
  Some more test results:
  _
   Host-Guest BW (numtxqs=2)
  #   BW% CPU%RCPU%   SD% RSD%
  _

 I think we discussed the need for external to guest testing
 over 10G. For large messages we should not see any change
 but you should be able to get better numbers for small messages
 assuming a MQ NIC card.

I had to make a few changes to qemu (and a minor change in macvtap
driver) to get multiple TXQ support using macvtap working. The NIC
is a ixgbe card.

__
Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
#  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
__
1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495 (-10.4)
48 10624   11267 (6.0)   12329  12258 (-.5)1276211538 (-9.5)
64 10524   10596 (.6)21689  22859 (5.3)2362622403 (-5.1)
80 985610284 (4.3)   35769  36313 (1.5)3993236419 (-8.7)
96 969110075 (3.9)   52357  52259 (-.1)5867653463 (-8.8)
12893519794 (4.7)114707 94275 (-17.8)  114050   97337 (-14.6)
__
Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)

__
Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
#  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
__
1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873 (-18.8)
__
Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)

Is there anything else you would like me to test/change, or shall
I submit the next version (with the above macvtap changes)?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-03 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
  Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
 
  Results for UDP BW tests (unidirectional, sum across
  3 iterations, each iteration of 45 seconds, default
  netperf, vhosts bound to cpus 0-3; no other tuning):

 Is binding vhost threads to CPUs really required?
 What happens if we let the scheduler do its job?
   
Nothing drastic, I remember BW% and SD% both improved a
bit as a result of binding.
  
   If there's a significant improvement this would mean that
   we need to rethink the vhost-net interaction with the scheduler.
 
  I will get a test run with and without binding and post the
  results later today.
 
 Correction: The result with binding is is much better for
 SD/CPU compared to without-binding:

Something that was suggested to me off-list is
trying to set smp affinity for NIC: in host to guest
case probably virtio-net, for external to guest
the host NIC as well.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-29 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
  Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
 
  Results for UDP BW tests (unidirectional, sum across
  3 iterations, each iteration of 45 seconds, default
  netperf, vhosts bound to cpus 0-3; no other tuning):

 Is binding vhost threads to CPUs really required?
 What happens if we let the scheduler do its job?
   
Nothing drastic, I remember BW% and SD% both improved a
bit as a result of binding.
  
   If there's a significant improvement this would mean that
   we need to rethink the vhost-net interaction with the scheduler.
 
  I will get a test run with and without binding and post the
  results later today.
 
 Correction: The result with binding is is much better for
 SD/CPU compared to without-binding:

Can you pls ty finding out why that is?  Is some thread bouncing between
CPUs?  Does a wrong numa node get picked up?
In practice users are very unlikely to pin threads to CPUs.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-29 Thread linux_kvm
On Fri, 29 Oct 2010 13:26 +0200, Michael S. Tsirkin m...@redhat.com
wrote:
 On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
   Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
 In practice users are very unlikely to pin threads to CPUs.

I may be misunderstanding what you're referring to. It caught my
attention since I'm working on a configuration to do what you say is
unlikely, so I'll chime in for what it's worth.

An option in Vyatta allows assigning CPU affinity to network adapters,
since apparently seperate L2 caches can have a significant impact on
throughput.

Although much of their focus seems to be on commercial virtualization
platforms, I do see quite a few forum posts with regard to KVM.
Mabye this still qualifies as an edge case, but as for virtualized
routing theirs seems to offer the most functionality.

http://www.vyatta.org/forum/viewtopic.php?t=2697

-cb
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com

   I think we discussed the need for external to guest testing
   over 10G. For large messages we should not see any change
   but you should be able to get better numbers for small messages
   assuming a MQ NIC card.
 
  For external host, there is a contention among different
  queues (vhosts) when packets are processed in tun/bridge,
  unless I implement MQ TX for macvtap (tun/bridge?).  So
  my testing shows a small improvement (1 to 1.5% average)
  in BW and a rise in SD (between 10-15%).  For remote host,
  I think tun/macvtap needs MQ TX support?

 Confused. I thought this *is* with a multiqueue tun/macvtap?
 bridge does not do any queueing AFAIK ...
 I think we need to fix the contention. With migration what was guest to
 host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com
 
I think we discussed the need for external to guest testing
over 10G. For large messages we should not see any change
but you should be able to get better numbers for small messages
assuming a MQ NIC card.
  
   For external host, there is a contention among different
   queues (vhosts) when packets are processed in tun/bridge,
   unless I implement MQ TX for macvtap (tun/bridge?).  So
   my testing shows a small improvement (1 to 1.5% average)
   in BW and a rise in SD (between 10-15%).  For remote host,
   I think tun/macvtap needs MQ TX support?
 
  Confused. I thought this *is* with a multiqueue tun/macvtap?
  bridge does not do any queueing AFAIK ...
  I think we need to fix the contention. With migration what was guest to
  host a minute ago might become guest to external now ...
 
 Macvtap RX is MQ but not TX. I don't think MQ TX support is
 required for macvtap, though. Is it enough for existing
 macvtap sendmsg to work, since it calls dev_queue_xmit
 which selects the txq for the outgoing device?
 
 Thanks,
 
 - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
 Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:

 Results for UDP BW tests (unidirectional, sum across
 3 iterations, each iteration of 45 seconds, default
 netperf, vhosts bound to cpus 0-3; no other tuning):
   
Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?
  
   Nothing drastic, I remember BW% and SD% both improved a
   bit as a result of binding.
 
  If there's a significant improvement this would mean that
  we need to rethink the vhost-net interaction with the scheduler.

 I will get a test run with and without binding and post the
 results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_
 numtxqs=8,vhosts=5, Bind vs No-bind
# BW% CPU% RCPU% SD%   RSD%
_
1 11.25 10.771.89 0-6.06
2 18.66 7.20 7.20-14.28-7.40
4 4.24 -1.27 1.56-2.70 -.98
8 14.91-3.79 5.46-12.19-3.76
1612.32-8.67 4.63-35.97-26.66
2411.68-7.83 5.10-40.73-32.37
3213.09-10.516.57-51.52-42.28
4011.04-4.12 11.23   -50.69-42.81
488.61 -10.306.04-62.38-55.54
647.55 -6.05 6.41-61.20-56.04
808.74 -11.456.29-72.65-67.17
969.84 -6.01 9.87-69.89-64.78
128   5.57 -6.23 8.99-75.03-70.97
_
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
1.  All my test results earlier was binding vhost
to cpus 0-3 for both org and new kernel.
2.  I am not using MST's use_mq patch, only mainline
kernel. However, I reported earlier that I got
better results with that patch. The result for
MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 04:39:13 PM:

(merging two posts into one)

 I think we discussed the need for external to guest testing
 over 10G. For large messages we should not see any change
 but you should be able to get better numbers for small messages
 assuming a MQ NIC card.

For external host, there is a contention among different
queues (vhosts) when packets are processed in tun/bridge,
unless I implement MQ TX for macvtap (tun/bridge?).  So
my testing shows a small improvement (1 to 1.5% average)
in BW and a rise in SD (between 10-15%).  For remote host,
I think tun/macvtap needs MQ TX support?

Results for UDP BW tests (unidirectional, sum across
3 iterations, each iteration of 45 seconds, default
netperf, vhosts bound to cpus 0-3; no other tuning):
  
   Is binding vhost threads to CPUs really required?
   What happens if we let the scheduler do its job?
 
  Nothing drastic, I remember BW% and SD% both improved a
  bit as a result of binding.

 If there's a significant improvement this would mean that
 we need to rethink the vhost-net interaction with the scheduler.

I will get a test run with and without binding and post the
results later today.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 10:44:14AM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 04:39:13 PM:
 
 (merging two posts into one)
 
  I think we discussed the need for external to guest testing
  over 10G. For large messages we should not see any change
  but you should be able to get better numbers for small messages
  assuming a MQ NIC card.
 
 For external host, there is a contention among different
 queues (vhosts) when packets are processed in tun/bridge,
 unless I implement MQ TX for macvtap (tun/bridge?).  So
 my testing shows a small improvement (1 to 1.5% average)
 in BW and a rise in SD (between 10-15%).  For remote host,
 I think tun/macvtap needs MQ TX support?

Confused. I thought this *is* with a multiqueue tun/macvtap?
bridge does not do any queueing AFAIK ...
I think we need to fix the contention. With migration what was guest to
host a minute ago might become guest to external now ...

 Results for UDP BW tests (unidirectional, sum across
 3 iterations, each iteration of 45 seconds, default
 netperf, vhosts bound to cpus 0-3; no other tuning):
   
Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?
  
   Nothing drastic, I remember BW% and SD% both improved a
   bit as a result of binding.
 
  If there's a significant improvement this would mean that
  we need to rethink the vhost-net interaction with the scheduler.
 
 I will get a test run with and without binding and post the
 results later today.
 
 Thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Michael S. Tsirkin
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
  Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
 
 Any feedback, comments, objections, issues or bugs about the
 patches? Please let me know if something needs to be done.
 
 Some more test results:
 _
  Host-Guest BW (numtxqs=2)
 #   BW% CPU%RCPU%   SD% RSD%
 _

I think we discussed the need for external to guest testing
over 10G. For large messages we should not see any change
but you should be able to get better numbers for small messages
assuming a MQ NIC card.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/26/2010 10:40:35 AM:

  I am trying to wrap my head around kernel/user interface here.
  E.g., will we need another incompatible change when we add multiple RX
  queues?

 Though I added a 'mq' option to qemu, there shouldn't be
 any incompatibility between old and new qemu's wrt vhost
 and virtio-net drivers. So the old qemu will run new host
 and new guest without issues, and new qemu can also run
 old host and old guest. Multiple RXQ will also not add
 any incompatibility.

 With MQ RX, I will be able to remove the hueristic (idea
 from David Stevens).  The idea is: Guest sends out packets
 on, say TXQ#2, vhost#2 processes the packets but packets
 going out from host to guest might be sent out on a
 different RXQ, say RXQ#4.  Guest receives the packet on
 RXQ#4, and all future responses on that connection are sent
 on TXQ#4.  Now vhost#4 processes both RX and TX packets for
 this connection.  Without needing to hash on the connection,
 guest can make sure that the same vhost thread will handle
 a single connection.

  Also need to think about how robust our single stream heuristic is,
  e.g. what are the chances it will misdetect a bidirectional
  UDP stream as a single TCP?

 I think it should not happen. The hueristic code gets
 called for handling just the transmit packets, packets
 that vhost sends out to the guest skip this path.

 I tested unidirectional and bidirectional UDP to confirm:

 8 iterations of iperf tests, each iteration of 15 secs,
 result is the sum of all 8 iterations in Gbits/sec
 __
 Uni-directional  Bi-directional
   Org  New Org  New
 __
   71.7871.77   71.74   72.07
 __


Results for UDP BW tests (unidirectional, sum across
3 iterations, each iteration of 45 seconds, default
netperf, vhosts bound to cpus 0-3; no other tuning):

-- numtxqs=8, vhosts=5 -
# BW%CPU%SD%

1 .491.07 0
223.51   52.5126.66
475.17   72.438.57
886.54   80.2127.85
16   92.37   85.996.27
24   91.37   84.918.41
32   89.78   82.903.31
48   89.85   79.95   -3.57
64   85.83   80.282.22
80   88.90   79.47   -23.18
96   90.12   79.9814.71
128  86.13   80.604.42

BW: 71.3%, CPU: 80.4%, SD: 1.2%


-- numtxqs=16, vhosts=5 
#BW%  CPU% SD%

11.80 00
219.8150.6826.66
457.3152.778.57
8108.44   88.19   -5.21
16   106.09   85.03   -4.44
24   102.34   84.23   -.82
32   102.77   82.71   -5.81
48   100.00   79.62   -7.29
64   96.8679.75   -6.10
80   99.2679.82   -27.34
96   94.7980.02   -5.08
128  98.1481.15   -15.25

BW: 77.9%,  CPU: 80.4%,  SD: -13.6%

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Michael S. Tsirkin
On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
 Results for UDP BW tests (unidirectional, sum across
 3 iterations, each iteration of 45 seconds, default
 netperf, vhosts bound to cpus 0-3; no other tuning):

Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com

 On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
  Results for UDP BW tests (unidirectional, sum across
  3 iterations, each iteration of 45 seconds, default
  netperf, vhosts bound to cpus 0-3; no other tuning):

 Is binding vhost threads to CPUs really required?
 What happens if we let the scheduler do its job?

Nothing drastic, I remember BW% and SD% both improved a
bit as a result of binding. I started binding vhost thread
after Avi suggested it in response to my v1 patch (he
suggested some more that I haven't done), and have been
doing only this tuning ever since. This is part of his
mail for the tuning:

vhost:
thread #0:  CPU0
thread #1:  CPU1
thread #2:  CPU2
thread #3:  CPU3

I simply bound each thread to CPU0-3 instead.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Michael S. Tsirkin
On Tue, Oct 26, 2010 at 03:31:39PM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com
 
  On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
   Results for UDP BW tests (unidirectional, sum across
   3 iterations, each iteration of 45 seconds, default
   netperf, vhosts bound to cpus 0-3; no other tuning):
 
  Is binding vhost threads to CPUs really required?
  What happens if we let the scheduler do its job?
 
 Nothing drastic, I remember BW% and SD% both improved a
 bit as a result of binding.

If there's a significant improvement this would mean that
we need to rethink the vhost-net interaction with the scheduler.

 I started binding vhost thread
 after Avi suggested it in response to my v1 patch (he
 suggested some more that I haven't done), and have been
 doing only this tuning ever since. This is part of his
 mail for the tuning:
 
   vhost:
   thread #0:  CPU0
   thread #1:  CPU1
   thread #2:  CPU2
   thread #3:  CPU3
 
 I simply bound each thread to CPU0-3 instead.
 
 Thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Krishna Kumar2
 Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:

Any feedback, comments, objections, issues or bugs about the
patches? Please let me know if something needs to be done.

Some more test results:
_
 Host-Guest BW (numtxqs=2)
#   BW% CPU%RCPU%   SD% RSD%
_
1   5.53.31 .67 -5.88   0
2   -2.11   -1.01   -2.08   4.340
4   13.53   10.77   13.87   -1.96   0
8   34.22   22.80   30.53   -8.46   -2.50
16  30.89   24.06   35.17   -5.20   3.20
24  33.22   26.30   43.39   -5.17   7.58
32  30.85   27.27   47.74   -.5915.51
40  33.80   27.33   48.00   -7.42   7.59
48  45.93   26.33   45.46   -12.24  1.10
64  33.51   27.11   45.00   -3.27   10.30
80  39.28   29.21   52.33   -4.88   12.17
96  32.05   31.01   57.72   -1.02   19.05
128 35.66   32.04   60.00   -.6620.41
_
BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%


Guest-Host 512 byte (numtxqs=2):
#   BW% CPU%RCPU%   SD% RSD%
_
1   3.02-3.84   -4.76   -12.50  -7.69
2   52.77   -15.73  -8.66   -45.31  -40.33
4   -23.14  13.84   7.5050.58   40.81
8   -21.44  28.08   16.32   63.06   47.43
16  33.53   46.50   27.19   7.61-6.60
24  55.77   42.81   30.49   -8.65   -16.48
32  52.59   38.92   29.08   -9.18   -15.63
40  50.92   36.11   28.92   -10.59  -15.30
48  46.63   34.73   28.17   -7.83   -12.32
64  45.56   37.12   28.81   -5.05   -10.80
80  44.55   36.60   28.45   -4.95   -10.61
96  43.02   35.97   28.89   -.11-5.31
128 38.54   33.88   27.19   -4.79   -9.54
_
BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%


Thanks,

- KK



 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

 Following set of patches implement transmit MQ in virtio-net.  Also
 included is the user qemu changes.  MQ is disabled by default unless
 qemu specifies it.

   Changes from rev2:
   --
 1. Define (in virtio_net.h) the maximum send txqs; and use in
virtio-net and vhost-net.
 2. vi-sq[i] is allocated individually, resulting in cache line
aligned sq[0] to sq[n].  Another option was to define
'send_queue' as:
struct send_queue {
struct virtqueue *svq;
struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
} cacheline_aligned_in_smp;
and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
the submitted method is preferable.
 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
handles TX[0-n].
 4. Further change TX handling such that vhost[0] handles both RX/TX
for single stream case.

   Enabling MQ on virtio:
   ---
 When following options are passed to qemu:
 - smp  1
 - vhost=on
 - mq=on (new option, default:off)
 then #txqueues = #cpus.  The #txqueues can be changed by using an
 optional 'numtxqs' option.  e.g. for a smp=4 guest:
 vhost=on   -   #txqueues = 1
 vhost=on,mq=on -   #txqueues = 4
 vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
 vhost=on,mq=on,numtxqs=8   -   #txqueues = 8


Performance (guest - local host):
---
 System configuration:
 Host:  8 Intel Xeon, 8 GB memory
 Guest: 4 cpus, 2 GB memory
 Test: Each test case runs for 60 secs, sum over three runs (except
 when number of netperf sessions is 1, which has 10 runs of 12 secs
 each).  No tuning (default netperf) other than taskset vhost's to
 cpus 0-3.  numtxqs=32 gave the best results though the guest had
 only 4 vcpus (I haven't tried beyond that).

 __ numtxqs=2, vhosts=3  
 #sessions  BW%  CPU%RCPU%SD%  RSD%
 
 1  4.46-1.96 .19 -12.50   -6.06
 2  4.93-1.162.10  0   -2.38
 4  46.1764.77   33.72 19.51   -2.48
 8  47.8970.00   36.23 41.4613.35
 16 48.9780.44   40.67 21.11   -5.46
 24 49.0378.78   41.22 20.51   -4.78
 32 51.1177.15   42.42 15.81   -6.87
 40 51.6071.65   42.43 9.75-8.94
 48 50.1069.55   42.85 11.80   -5.81
 64 46.2468.42   42.67 14.18   -3.28
 80 46.3763.13   41.62 7.43-6.73
 96 46.4063.31   42.20 9.36-4.78
 12850.4362.79   42.16 13.11   -1.23
 
 BW: 37.2%,  

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Michael S. Tsirkin
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
  Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
 
 Any feedback, comments, objections, issues or bugs about the
 patches? Please let me know if something needs to be done.

I am trying to wrap my head around kernel/user interface here.
E.g., will we need another incompatible change when we add multiple RX
queues? Also need to think about how robust our single stream heuristic is,
e.g. what are the chances it will misdetect a bidirectional
UDP stream as a single TCP?

 Some more test results:
 _
  Host-Guest BW (numtxqs=2)
 #   BW% CPU%RCPU%   SD% RSD%
 _
 1   5.53.31 .67 -5.88   0
 2   -2.11   -1.01   -2.08   4.340
 4   13.53   10.77   13.87   -1.96   0
 8   34.22   22.80   30.53   -8.46   -2.50
 16  30.89   24.06   35.17   -5.20   3.20
 24  33.22   26.30   43.39   -5.17   7.58
 32  30.85   27.27   47.74   -.5915.51
 40  33.80   27.33   48.00   -7.42   7.59
 48  45.93   26.33   45.46   -12.24  1.10
 64  33.51   27.11   45.00   -3.27   10.30
 80  39.28   29.21   52.33   -4.88   12.17
 96  32.05   31.01   57.72   -1.02   19.05
 128 35.66   32.04   60.00   -.6620.41
 _
 BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%
 
 
 Guest-Host 512 byte (numtxqs=2):
 #   BW% CPU%RCPU%   SD% RSD%
 _
 1   3.02-3.84   -4.76   -12.50  -7.69
 2   52.77   -15.73  -8.66   -45.31  -40.33
 4   -23.14  13.84   7.5050.58   40.81
 8   -21.44  28.08   16.32   63.06   47.43
 16  33.53   46.50   27.19   7.61-6.60
 24  55.77   42.81   30.49   -8.65   -16.48
 32  52.59   38.92   29.08   -9.18   -15.63
 40  50.92   36.11   28.92   -10.59  -15.30
 48  46.63   34.73   28.17   -7.83   -12.32
 64  45.56   37.12   28.81   -5.05   -10.80
 80  44.55   36.60   28.45   -4.95   -10.61
 96  43.02   35.97   28.89   -.11-5.31
 128 38.54   33.88   27.19   -4.79   -9.54
 _
 BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%
 
 
 Thanks,
 
 - KK
 
 
 
  [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
 
  Following set of patches implement transmit MQ in virtio-net.  Also
  included is the user qemu changes.  MQ is disabled by default unless
  qemu specifies it.
 
Changes from rev2:
--
  1. Define (in virtio_net.h) the maximum send txqs; and use in
 virtio-net and vhost-net.
  2. vi-sq[i] is allocated individually, resulting in cache line
 aligned sq[0] to sq[n].  Another option was to define
 'send_queue' as:
 struct send_queue {
 struct virtqueue *svq;
 struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 } cacheline_aligned_in_smp;
 and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
 the submitted method is preferable.
  3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
 handles TX[0-n].
  4. Further change TX handling such that vhost[0] handles both RX/TX
 for single stream case.
 
Enabling MQ on virtio:
---
  When following options are passed to qemu:
  - smp  1
  - vhost=on
  - mq=on (new option, default:off)
  then #txqueues = #cpus.  The #txqueues can be changed by using an
  optional 'numtxqs' option.  e.g. for a smp=4 guest:
  vhost=on   -   #txqueues = 1
  vhost=on,mq=on -   #txqueues = 4
  vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
  vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
 
 
 Performance (guest - local host):
 ---
  System configuration:
  Host:  8 Intel Xeon, 8 GB memory
  Guest: 4 cpus, 2 GB memory
  Test: Each test case runs for 60 secs, sum over three runs (except
  when number of netperf sessions is 1, which has 10 runs of 12 secs
  each).  No tuning (default netperf) other than taskset vhost's to
  cpus 0-3.  numtxqs=32 gave the best results though the guest had
  only 4 vcpus (I haven't tried beyond that).
 
  __ numtxqs=2, vhosts=3  
  #sessions  BW%  CPU%RCPU%SD%  RSD%
  
  1  4.46-1.96 .19 -12.50   -6.06
  2  4.93-1.162.10  0   -2.38
  4  46.1764.77   33.72 19.51   -2.48
  8  47.8970.00   36.23 41.4613.35
  16 48.9780.44   40.67 21.11   -5.46
  24 49.03

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/25/2010 09:47:18 PM:

  Any feedback, comments, objections, issues or bugs about the
  patches? Please let me know if something needs to be done.

 I am trying to wrap my head around kernel/user interface here.
 E.g., will we need another incompatible change when we add multiple RX
 queues?

Though I added a 'mq' option to qemu, there shouldn't be
any incompatibility between old and new qemu's wrt vhost
and virtio-net drivers. So the old qemu will run new host
and new guest without issues, and new qemu can also run
old host and old guest. Multiple RXQ will also not add
any incompatibility.

With MQ RX, I will be able to remove the hueristic (idea
from David Stevens).  The idea is: Guest sends out packets
on, say TXQ#2, vhost#2 processes the packets but packets
going out from host to guest might be sent out on a
different RXQ, say RXQ#4.  Guest receives the packet on
RXQ#4, and all future responses on that connection are sent
on TXQ#4.  Now vhost#4 processes both RX and TX packets for
this connection.  Without needing to hash on the connection,
guest can make sure that the same vhost thread will handle
a single connection.

 Also need to think about how robust our single stream heuristic is,
 e.g. what are the chances it will misdetect a bidirectional
 UDP stream as a single TCP?

I think it should not happen. The hueristic code gets
called for handling just the transmit packets, packets
that vhost sends out to the guest skip this path.

I tested unidirectional and bidirectional UDP to confirm:

8 iterations of iperf tests, each iteration of 15 secs,
result is the sum of all 8 iterations in Gbits/sec
__
Uni-directional  Bi-directional
  Org  New Org  New
__
  71.7871.77   71.74   72.07
__

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html