Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 09:25:34 PM: Sure, will get a build/test on latest bits and send in 1-2 days. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Yes, that whole stuff is removed, and the TX/RX path is unchanged with this patch (thankfully :) Cool. I was wondering whether in that case, we can do without host kernel changes at all, and use a separate fd for each TX/RX pair. The advantage of that approach is that this way, the max fd limit naturally sets an upper bound on the amount of resources userspace can use up. Thoughts? In any case, pls don't let the above delay sending an RFC. I will look into this also. Please excuse the delay in sending the patch out faster - my bits are a little old, so it is taking some time to move to the latest kernel and get some initial TCP/UDP test results. I should have it ready by tomorrow. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wed, Feb 23, 2011 at 12:18:36PM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM: Hi Michael, Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? Yes, please do. Sure, will get a build/test on latest bits and send in 1-2 days. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Yes, that whole stuff is removed, and the TX/RX path is unchanged with this patch (thankfully :) Cool. I was wondering whether in that case, we can do without host kernel changes at all, and use a separate fd for each TX/RX pair. The advantage of that approach is that this way, the max fd limit naturally sets an upper bound on the amount of resources userspace can use up. Thoughts? In any case, pls don't let the above delay sending an RFC. Remote testing is still to be done. Others might be able to help here once you post the patch. That's great, will appreciate any help. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote: Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM: Hi Simon, I have a few questions about the results below: 1. Are the (%) comparisons between non-mq and mq virtio? Yes - mainline kernel with transmit-only MQ patch. 2. Was UDP or TCP used? TCP. I had done some initial testing on UDP, but don't have the results now as it is really old. But I will be running it again. 3. What was the transmit size (-m option to netperf)? I didn't use the -m option, so it defaults to 16K. The script does: netperf -t TCP_STREAM -c -C -l 60 -H $SERVER Also, I'm interested to know what the status of these patches is. Are you planing a fresh series? Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Remote testing is still to be done. Hi Krishna, thanks for clarifying the test results. I'm looking forward to the forthcoming RFC patches. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wed, Oct 20, 2010 at 02:24:52PM +0530, Krishna Kumar wrote: Following set of patches implement transmit MQ in virtio-net. Also included is the user qemu changes. MQ is disabled by default unless qemu specifies it. Hi Krishna, I have a few questions about the results below: 1. Are the (%) comparisons between non-mq and mq virtio? 2. Was UDP or TCP used? 3. What was the transmit size (-m option to netperf)? Also, I'm interested to know what the status of these patches is. Are you planing a fresh series? Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2%, CPU/RCPU: 66.3%,41.6%, SD/RSD: 11.5%,-3.7% __ numtxqs=8, vhosts=5 #sessions BW% CPU% RCPU% SD% RSD% 1 -.76-1.56 2.33 03.03 2 17.4111.1111.41 0 -4.76 4 42.1255.1130.20 19.51.62 8 54.6980.0039.22 24.39-3.88 16 54.7781.6240.89 20.34-6.58 24 54.6679.6841.57 15.49-8.99 32 54.9276.8241.79 17.59-5.70 40 51.7968.5640.53 15.31-3.87 48 51.7266.4040.84 9.72 -7.13 64 51.1163.9441.10 5.93 -8.82 80 46.5159.5039.80 9.33 -4.18 96 47.7257.7539.84 4.20 -7.62 128 54.3558.9540.66 3.24 -8.63 BW: 38.9%, CPU/RCPU: 63.0%,40.1%, SD/RSD: 6.0%,-7.4% __ numtxqs=16, vhosts=5 ___ #sessions BW% CPU% RCPU% SD% RSD% 1 -1.43-3.521.55 0 3.03 2 33.09 21.63 20.12-10.00 -9.52 4 67.17 94.60 44.28 19.51 -11.80 8 75.72 108.14 49.15 25.00 -10.71 16 80.34 101.77 52.94 25.93 -4.49 24 70.84 93.12 43.62 27.63 -5.03 32 69.01 94.16 47.33 29.68 -1.51 40 58.56 63.47 25.91-3.92 -25.85 48
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM: Hi Simon, I have a few questions about the results below: 1. Are the (%) comparisons between non-mq and mq virtio? Yes - mainline kernel with transmit-only MQ patch. 2. Was UDP or TCP used? TCP. I had done some initial testing on UDP, but don't have the results now as it is really old. But I will be running it again. 3. What was the transmit size (-m option to netperf)? I didn't use the -m option, so it defaults to 16K. The script does: netperf -t TCP_STREAM -c -C -l 60 -H $SERVER Also, I'm interested to know what the status of these patches is. Are you planing a fresh series? Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Remote testing is still to be done. Thanks, - KK Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2%, CPU/RCPU: 66.3%,41.6%, SD/RSD: 11.5%,-3.7% __ numtxqs=8, vhosts=5 #sessions BW% CPU% RCPU% SD% RSD% 1 -.76-1.56 2.33 03.03 2 17.4111.1111.41 0 -4.76 4 42.1255.1130.20 19.51.62 8 54.6980.0039.22 24.39-3.88 16 54.7781.6240.89 20.34-6.58 24 54.6679.6841.57 15.49-8.99 32 54.9276.8241.79 17.59-5.70 40 51.7968.5640.53 15.31-3.87 48 51.7266.4040.84 9.72 -7.13 64 51.1163.9441.10 5.93 -8.82 80 46.5159.5039.80 9.33 -4.18 96 47.7257.7539.84 4.20 -7.62 128 54.3558.9540.66 3.24 -8.63 BW: 38.9%,
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote: Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM: Hi Simon, I have a few questions about the results below: 1. Are the (%) comparisons between non-mq and mq virtio? Yes - mainline kernel with transmit-only MQ patch. 2. Was UDP or TCP used? TCP. I had done some initial testing on UDP, but don't have the results now as it is really old. But I will be running it again. 3. What was the transmit size (-m option to netperf)? I didn't use the -m option, so it defaults to 16K. The script does: netperf -t TCP_STREAM -c -C -l 60 -H $SERVER Also, I'm interested to know what the status of these patches is. Are you planing a fresh series? Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? Yes, please do. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Remote testing is still to be done. Others might be able to help here once you post the patch. Thanks, - KK Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2%, CPU/RCPU: 66.3%,41.6%, SD/RSD: 11.5%,-3.7% __ numtxqs=8, vhosts=5 #sessions BW% CPU% RCPU% SD% RSD% 1 -.76-1.56 2.33 03.03 2 17.4111.1111.41 0 -4.76 4 42.1255.1130.20 19.51.62 8 54.6980.0039.22 24.39-3.88 16 54.7781.6240.89 20.34-6.58 24 54.6679.6841.57 15.49-8.99 32 54.9276.8241.79 17.59-5.70 40
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM: Hi Michael, Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? Yes, please do. Sure, will get a build/test on latest bits and send in 1-2 days. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Yes, that whole stuff is removed, and the TX/RX path is unchanged with this patch (thankfully :) Remote testing is still to be done. Others might be able to help here once you post the patch. That's great, will appreciate any help. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tue, Nov 09, 2010 at 10:54:57PM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 09:03:25 PM: Something strange here, right? 1. You are consistently getting 10G/s here, and even with a single stream? Sorry, I should have mentioned this though I had stated in my earlier mails. Each test result has two iterations, each of 60 seconds, except when #netperfs is 1 for which I do 10 iteration (sum across 10 iterations). So need to divide the number by 10? Yes, that is what I get with 512/1K macvtap I/O size :) I started doing many more iterations for 1 netperf after finding the issue earlier with single stream. So the BW is only 4.5-7 Gbps. 2. With 2 streams, is where we get 10G/s originally. Instead of doubling that we get a marginal improvement with 2 queues and about 30% worse with 1 queue. (doubling happens consistently for guest - host, but never for remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied testing scenario. In first case, there is a slight improvement in BW and good reduction in SD. In the second case, only SD improves (though BW drops for 2 stream for some reason). In both cases, BW and SD improves as the number of sessions increase. I guess this is another indication that something's wrong. The patch - both virtio-net and vhost-net, doesn't have any locking/mutex's/ or any synchronization method. Guest - host performance improvement of upto 100% shows the patch is not doing anything wrong. My concern is this: we don't seem to do anything in tap or macvtap to help packets from separate virtio queues get to separate queues in the hardware device and to avoid reordering when we do this. - skb_tx_hash calculation will get different results - hash math that e.g. tcp does will run on guest and seems to be discarded etc Maybe it's as simple as some tap/macvtap ioctls to set up the queue number in skbs. Or maybe we need to pass the skb hash from guest to host. It's this last option that should make us especially cautios as it'll affect guest/host interface. Also see d5a9e24afb4ab38110ebb777588ea0bd0eacbd0a: if we have hardware which records an RX queue, it appears important to pass that info to guest and to use that in selecting the TX queue. Of course we won't see this in netperf runs but this needs to be given thought too - supporting this seems to suggest either sticking the hash in the virtio net header for both tx and rx, or using multiplease RX queues. We are quite far from line rate, the fact BW does not scale means there's some contention in the code. Attaining line speed with macvtap seems to be a generic issue and unrelated to my patch specifically. IMHO if there is nothing wrong in the code (review) and is accepted, it will benefit as others can also help to find what needs to be implemented in vhost/macvtap/qemu to get line speed for guest-remote-host. No problem, I will queue these patches in some branch to help enable cooperation, as well as help you iterate with incremental patches instead of resending it all each time. PS: bare-metal performance for host-remote-host is also 2.7 Gbps and 2.8 Gbps for 512/1024 for the same card. Thanks, You mean native linux BW does not scale for your host with # of connections either? I guess this just means need another setup for testing? - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tue, Nov 09, 2010 at 08:58:44PM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 06:52:39 PM: Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. I had to make a few changes to qemu (and a minor change in macvtap driver) to get multiple TXQ support using macvtap working. The NIC is a ixgbe card. __ Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3) # BW1 BW2 (%) SD1SD2 (%)RSD1RSD2 (%) __ 1 14367 13142 (-8.5) 56 62 (10.7) 88 (0) 2 36523855 (5.5)37 35 (-5.4) 76 (-14.2) 4 12529 12059 (-3.7) 65 77 (18.4) 35 35 (0) 8 13912 14668 (5.4) 288332 (15.2) 175 184 (5.1) 16 13433 14455 (7.6) 1218 1321 (8.4) 920 943 (2.5) 24 12750 13477 (5.7) 2876 2985 (3.7) 2514 2348 (-6.6) 32 11729 12632 (7.6) 5299 5332 (.6) 4934 4497 (-8.8) 40 11061 11923 (7.7) 8482 8364 (-1.3)8374 7495 (-10.4) 48 10624 11267 (6.0) 12329 12258 (-.5)1276211538 (-9.5) 64 10524 10596 (.6)21689 22859 (5.3)2362622403 (-5.1) 80 985610284 (4.3) 35769 36313 (1.5)3993236419 (-8.7) 96 969110075 (3.9) 52357 52259 (-.1)5867653463 (-8.8) 12893519794 (4.7)114707 94275 (-17.8) 114050 97337 (-14.6) __ Avg: BW: (3.3) SD: (-7.3) RSD: (-11.0) __ Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5) # BW1 BW2 (%) SD1 SD2 (%)RSD1 RSD2 (%) __ 1 1650915985 (-3.1) 4547 (4.4) 7 7 (0) 2 6963 4499 (-35.3) 1751 (200.0) 7 7 (0) 4 1293211080 (-14.3) 4974 (51.0) 35 35 (0) 8 1387814095 (1.5) 223 292 (30.9) 175 181 (3.4) 16 1344013698 (1.9) 980 1131 (15.4)926 942 (1.7) 24 1268012927 (1.9) 2387 2463 (3.1) 25262342 (-7.2) 32 1171412261 (4.6) 4506 4486 (-.4) 49414463 (-9.6) 40 1105911651 (5.3) 7244 7081 (-2.2)83497437 (-10.9) 48 1058011095 (4.8) 10811 10500 (-2.8) 12809 11403 (-10.9) 64 1056910566 (0) 19194 19270 (.3) 23648 21717 (-8.1) 80 9827 10753 (9.4) 31668 29425 (-7.0) 39991 33824 (-15.4) 96 1004310150 (1.0) 45352 44227 (-2.4) 57766 51131 (-11.4) 1289360 9979 (6.6)92058 79198 (-13.9) 114381 92873 (-18.8) __ Avg: BW: (-.5) SD: (-7.5) RSD: (-14.7) Is there anything else you would like me to test/change, or shall I submit the next version (with the above macvtap changes)? Thanks, - KK Something strange here, right? 1. You are consistently getting 10G/s here, and even with a single stream? Sorry, I should have mentioned this though I had stated in my earlier mails. Each test result has two iterations, each of 60 seconds, except when #netperfs is 1 for which I do 10 iteration (sum across 10 iterations). So need to divide the number by 10? I started doing many more iterations for 1 netperf after finding the issue earlier with single stream. So the BW is only 4.5-7 Gbps. 2. With 2 streams, is where we get 10G/s originally. Instead of doubling that we get a marginal improvement with 2 queues and about 30% worse with 1 queue. (doubling happens consistently for guest - host, but never for remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied testing scenario. In first case, there is a slight improvement in BW and good reduction in SD. In the second case, only SD improves
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tue, Nov 09, 2010 at 10:08:21AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 02:27:09 PM: Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. I had to make a few changes to qemu (and a minor change in macvtap driver) to get multiple TXQ support using macvtap working. The NIC is a ixgbe card. __ Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3) # BW1 BW2 (%) SD1SD2 (%)RSD1RSD2 (%) __ 1 14367 13142 (-8.5) 56 62 (10.7) 88 (0) 2 36523855 (5.5)37 35 (-5.4) 76 (-14.2) 4 12529 12059 (-3.7) 65 77 (18.4) 35 35 (0) 8 13912 14668 (5.4) 288332 (15.2) 175 184 (5.1) 16 13433 14455 (7.6) 1218 1321 (8.4) 920 943 (2.5) 24 12750 13477 (5.7) 2876 2985 (3.7) 2514 2348 (-6.6) 32 11729 12632 (7.6) 5299 5332 (.6) 4934 4497 (-8.8) 40 11061 11923 (7.7) 8482 8364 (-1.3)8374 7495 (-10.4) 48 10624 11267 (6.0) 12329 12258 (-.5)1276211538 (-9.5) 64 10524 10596 (.6)21689 22859 (5.3)2362622403 (-5.1) 80 985610284 (4.3) 35769 36313 (1.5)3993236419 (-8.7) 96 969110075 (3.9) 52357 52259 (-.1)5867653463 (-8.8) 12893519794 (4.7)114707 94275 (-17.8) 114050 97337 (-14.6) __ Avg: BW: (3.3) SD: (-7.3) RSD: (-11.0) __ Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5) # BW1 BW2 (%) SD1 SD2 (%)RSD1 RSD2 (%) __ 1 1650915985 (-3.1) 4547 (4.4) 7 7 (0) 2 6963 4499 (-35.3) 1751 (200.0) 7 7 (0) 4 1293211080 (-14.3) 4974 (51.0) 35 35 (0) 8 1387814095 (1.5) 223 292 (30.9) 175 181 (3.4) 16 1344013698 (1.9) 980 1131 (15.4)926 942 (1.7) 24 1268012927 (1.9) 2387 2463 (3.1) 25262342 (-7.2) 32 1171412261 (4.6) 4506 4486 (-.4) 49414463 (-9.6) 40 1105911651 (5.3) 7244 7081 (-2.2)83497437 (-10.9) 48 1058011095 (4.8) 10811 10500 (-2.8) 12809 11403 (-10.9) 64 1056910566 (0) 19194 19270 (.3) 23648 21717 (-8.1) 80 9827 10753 (9.4) 31668 29425 (-7.0) 39991 33824 (-15.4) 96 1004310150 (1.0) 45352 44227 (-2.4) 57766 51131 (-11.4) 1289360 9979 (6.6)92058 79198 (-13.9) 114381 92873 (-18.8) __ Avg: BW: (-.5) SD: (-7.5) RSD: (-14.7) Is there anything else you would like me to test/change, or shall I submit the next version (with the above macvtap changes)? Thanks, - KK Something strange here, right? 1. You are consistently getting 10G/s here, and even with a single stream? 2. With 2 streams, is where we get 10G/s originally. Instead of doubling that we get a marginal improvement with 2 queues and about 30% worse with 1 queue. Is your card MQ? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 09:03:25 PM: Something strange here, right? 1. You are consistently getting 10G/s here, and even with a single stream? Sorry, I should have mentioned this though I had stated in my earlier mails. Each test result has two iterations, each of 60 seconds, except when #netperfs is 1 for which I do 10 iteration (sum across 10 iterations). So need to divide the number by 10? Yes, that is what I get with 512/1K macvtap I/O size :) I started doing many more iterations for 1 netperf after finding the issue earlier with single stream. So the BW is only 4.5-7 Gbps. 2. With 2 streams, is where we get 10G/s originally. Instead of doubling that we get a marginal improvement with 2 queues and about 30% worse with 1 queue. (doubling happens consistently for guest - host, but never for remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied testing scenario. In first case, there is a slight improvement in BW and good reduction in SD. In the second case, only SD improves (though BW drops for 2 stream for some reason). In both cases, BW and SD improves as the number of sessions increase. I guess this is another indication that something's wrong. The patch - both virtio-net and vhost-net, doesn't have any locking/mutex's/ or any synchronization method. Guest - host performance improvement of upto 100% shows the patch is not doing anything wrong. We are quite far from line rate, the fact BW does not scale means there's some contention in the code. Attaining line speed with macvtap seems to be a generic issue and unrelated to my patch specifically. IMHO if there is nothing wrong in the code (review) and is accepted, it will benefit as others can also help to find what needs to be implemented in vhost/macvtap/qemu to get line speed for guest-remote-host. PS: bare-metal performance for host-remote-host is also 2.7 Gbps and 2.8 Gbps for 512/1024 for the same card. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 02:27:09 PM: Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. I had to make a few changes to qemu (and a minor change in macvtap driver) to get multiple TXQ support using macvtap working. The NIC is a ixgbe card. __ Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3) # BW1 BW2 (%) SD1SD2 (%)RSD1RSD2 (%) __ 1 14367 13142 (-8.5) 56 62 (10.7) 88 (0) 2 36523855 (5.5)37 35 (-5.4) 76 (-14.2) 4 12529 12059 (-3.7) 65 77 (18.4) 35 35 (0) 8 13912 14668 (5.4) 288332 (15.2) 175 184 (5.1) 16 13433 14455 (7.6) 1218 1321 (8.4) 920 943 (2.5) 24 12750 13477 (5.7) 2876 2985 (3.7) 2514 2348 (-6.6) 32 11729 12632 (7.6) 5299 5332 (.6) 4934 4497 (-8.8) 40 11061 11923 (7.7) 8482 8364 (-1.3)8374 7495 (-10.4) 48 10624 11267 (6.0) 12329 12258 (-.5)1276211538 (-9.5) 64 10524 10596 (.6)21689 22859 (5.3)2362622403 (-5.1) 80 985610284 (4.3) 35769 36313 (1.5)3993236419 (-8.7) 96 969110075 (3.9) 52357 52259 (-.1)5867653463 (-8.8) 12893519794 (4.7)114707 94275 (-17.8) 114050 97337 (-14.6) __ Avg: BW: (3.3) SD: (-7.3) RSD: (-11.0) __ Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5) # BW1 BW2 (%) SD1 SD2 (%)RSD1 RSD2 (%) __ 1 1650915985 (-3.1) 4547 (4.4) 7 7 (0) 2 6963 4499 (-35.3) 1751 (200.0) 7 7 (0) 4 1293211080 (-14.3) 4974 (51.0) 35 35 (0) 8 1387814095 (1.5) 223 292 (30.9) 175 181 (3.4) 16 1344013698 (1.9) 980 1131 (15.4)926 942 (1.7) 24 1268012927 (1.9) 2387 2463 (3.1) 25262342 (-7.2) 32 1171412261 (4.6) 4506 4486 (-.4) 49414463 (-9.6) 40 1105911651 (5.3) 7244 7081 (-2.2)83497437 (-10.9) 48 1058011095 (4.8) 10811 10500 (-2.8) 12809 11403 (-10.9) 64 1056910566 (0) 19194 19270 (.3) 23648 21717 (-8.1) 80 9827 10753 (9.4) 31668 29425 (-7.0) 39991 33824 (-15.4) 96 1004310150 (1.0) 45352 44227 (-2.4) 57766 51131 (-11.4) 1289360 9979 (6.6)92058 79198 (-13.9) 114381 92873 (-18.8) __ Avg: BW: (-.5) SD: (-7.5) RSD: (-14.7) Is there anything else you would like me to test/change, or shall I submit the next version (with the above macvtap changes)? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Correction: The result with binding is is much better for SD/CPU compared to without-binding: Something that was suggested to me off-list is trying to set smp affinity for NIC: in host to guest case probably virtio-net, for external to guest the host NIC as well. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Correction: The result with binding is is much better for SD/CPU compared to without-binding: Can you pls ty finding out why that is? Is some thread bouncing between CPUs? Does a wrong numa node get picked up? In practice users are very unlikely to pin threads to CPUs. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Fri, 29 Oct 2010 13:26 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM: In practice users are very unlikely to pin threads to CPUs. I may be misunderstanding what you're referring to. It caught my attention since I'm working on a configuration to do what you say is unlikely, so I'll chime in for what it's worth. An option in Vyatta allows assigning CPU affinity to network adapters, since apparently seperate L2 caches can have a significant impact on throughput. Although much of their focus seems to be on commercial virtualization platforms, I do see quite a few forum posts with regard to KVM. Mabye this still qualifies as an edge case, but as for virtualized routing theirs seems to offer the most functionality. http://www.vyatta.org/forum/viewtopic.php?t=2697 -cb -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Confused. I thought this *is* with a multiqueue tun/macvtap? bridge does not do any queueing AFAIK ... I think we need to fix the contention. With migration what was guest to host a minute ago might become guest to external now ... Macvtap RX is MQ but not TX. I don't think MQ TX support is required for macvtap, though. Is it enough for existing macvtap sendmsg to work, since it calls dev_queue_xmit which selects the txq for the outgoing device? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Confused. I thought this *is* with a multiqueue tun/macvtap? bridge does not do any queueing AFAIK ... I think we need to fix the contention. With migration what was guest to host a minute ago might become guest to external now ... Macvtap RX is MQ but not TX. I don't think MQ TX support is required for macvtap, though. Is it enough for existing macvtap sendmsg to work, since it calls dev_queue_xmit which selects the txq for the outgoing device? Thanks, - KK I think there would be an issue with using a single poll notifier and contention on send buffer atomic variable. Is tun different than macvtap? We need to support both long term ... -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Correction: The result with binding is is much better for SD/CPU compared to without-binding: _ numtxqs=8,vhosts=5, Bind vs No-bind # BW% CPU% RCPU% SD% RSD% _ 1 11.25 10.771.89 0-6.06 2 18.66 7.20 7.20-14.28-7.40 4 4.24 -1.27 1.56-2.70 -.98 8 14.91-3.79 5.46-12.19-3.76 1612.32-8.67 4.63-35.97-26.66 2411.68-7.83 5.10-40.73-32.37 3213.09-10.516.57-51.52-42.28 4011.04-4.12 11.23 -50.69-42.81 488.61 -10.306.04-62.38-55.54 647.55 -6.05 6.41-61.20-56.04 808.74 -11.456.29-72.65-67.17 969.84 -6.01 9.87-69.89-64.78 128 5.57 -6.23 8.99-75.03-70.97 _ BW: 10.4%, CPU/RCPU: -7.4%,7.7%, SD: -70.5%,-65.7% Notes: 1. All my test results earlier was binding vhost to cpus 0-3 for both org and new kernel. 2. I am not using MST's use_mq patch, only mainline kernel. However, I reported earlier that I got better results with that patch. The result for MQ vs MQ+use_mm patch (from my earlier mail): BW: 0 CPU/RCPU: -4.2,-6.1 SD/RSD: -13.1,-15.6 Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 04:39:13 PM: (merging two posts into one) I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Thu, Oct 28, 2010 at 10:44:14AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 04:39:13 PM: (merging two posts into one) I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Confused. I thought this *is* with a multiqueue tun/macvtap? bridge does not do any queueing AFAIK ... I think we need to fix the contention. With migration what was guest to host a minute ago might become guest to external now ... Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/26/2010 10:40:35 AM: I am trying to wrap my head around kernel/user interface here. E.g., will we need another incompatible change when we add multiple RX queues? Though I added a 'mq' option to qemu, there shouldn't be any incompatibility between old and new qemu's wrt vhost and virtio-net drivers. So the old qemu will run new host and new guest without issues, and new qemu can also run old host and old guest. Multiple RXQ will also not add any incompatibility. With MQ RX, I will be able to remove the hueristic (idea from David Stevens). The idea is: Guest sends out packets on, say TXQ#2, vhost#2 processes the packets but packets going out from host to guest might be sent out on a different RXQ, say RXQ#4. Guest receives the packet on RXQ#4, and all future responses on that connection are sent on TXQ#4. Now vhost#4 processes both RX and TX packets for this connection. Without needing to hash on the connection, guest can make sure that the same vhost thread will handle a single connection. Also need to think about how robust our single stream heuristic is, e.g. what are the chances it will misdetect a bidirectional UDP stream as a single TCP? I think it should not happen. The hueristic code gets called for handling just the transmit packets, packets that vhost sends out to the guest skip this path. I tested unidirectional and bidirectional UDP to confirm: 8 iterations of iperf tests, each iteration of 15 secs, result is the sum of all 8 iterations in Gbits/sec __ Uni-directional Bi-directional Org New Org New __ 71.7871.77 71.74 72.07 __ Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): -- numtxqs=8, vhosts=5 - # BW%CPU%SD% 1 .491.07 0 223.51 52.5126.66 475.17 72.438.57 886.54 80.2127.85 16 92.37 85.996.27 24 91.37 84.918.41 32 89.78 82.903.31 48 89.85 79.95 -3.57 64 85.83 80.282.22 80 88.90 79.47 -23.18 96 90.12 79.9814.71 128 86.13 80.604.42 BW: 71.3%, CPU: 80.4%, SD: 1.2% -- numtxqs=16, vhosts=5 #BW% CPU% SD% 11.80 00 219.8150.6826.66 457.3152.778.57 8108.44 88.19 -5.21 16 106.09 85.03 -4.44 24 102.34 84.23 -.82 32 102.77 82.71 -5.81 48 100.00 79.62 -7.29 64 96.8679.75 -6.10 80 99.2679.82 -27.34 96 94.7980.02 -5.08 128 98.1481.15 -15.25 BW: 77.9%, CPU: 80.4%, SD: -13.6% Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. I started binding vhost thread after Avi suggested it in response to my v1 patch (he suggested some more that I haven't done), and have been doing only this tuning ever since. This is part of his mail for the tuning: vhost: thread #0: CPU0 thread #1: CPU1 thread #2: CPU2 thread #3: CPU3 I simply bound each thread to CPU0-3 instead. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tue, Oct 26, 2010 at 03:31:39PM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I started binding vhost thread after Avi suggested it in response to my v1 patch (he suggested some more that I haven't done), and have been doing only this tuning ever since. This is part of his mail for the tuning: vhost: thread #0: CPU0 thread #1: CPU1 thread #2: CPU2 thread #3: CPU3 I simply bound each thread to CPU0-3 instead. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ 1 5.53.31 .67 -5.88 0 2 -2.11 -1.01 -2.08 4.340 4 13.53 10.77 13.87 -1.96 0 8 34.22 22.80 30.53 -8.46 -2.50 16 30.89 24.06 35.17 -5.20 3.20 24 33.22 26.30 43.39 -5.17 7.58 32 30.85 27.27 47.74 -.5915.51 40 33.80 27.33 48.00 -7.42 7.59 48 45.93 26.33 45.46 -12.24 1.10 64 33.51 27.11 45.00 -3.27 10.30 80 39.28 29.21 52.33 -4.88 12.17 96 32.05 31.01 57.72 -1.02 19.05 128 35.66 32.04 60.00 -.6620.41 _ BW: 23.5% CPU/RCPU: 28.6%,51.2% SD/RSD: -2.6%,15.8% Guest-Host 512 byte (numtxqs=2): # BW% CPU%RCPU% SD% RSD% _ 1 3.02-3.84 -4.76 -12.50 -7.69 2 52.77 -15.73 -8.66 -45.31 -40.33 4 -23.14 13.84 7.5050.58 40.81 8 -21.44 28.08 16.32 63.06 47.43 16 33.53 46.50 27.19 7.61-6.60 24 55.77 42.81 30.49 -8.65 -16.48 32 52.59 38.92 29.08 -9.18 -15.63 40 50.92 36.11 28.92 -10.59 -15.30 48 46.63 34.73 28.17 -7.83 -12.32 64 45.56 37.12 28.81 -5.05 -10.80 80 44.55 36.60 28.45 -4.95 -10.61 96 43.02 35.97 28.89 -.11-5.31 128 38.54 33.88 27.19 -4.79 -9.54 _ BW: 34.4% CPU/RCPU: 35.9%,27.8% SD/RSD: -4.1%,-9.3% Thanks, - KK [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Following set of patches implement transmit MQ in virtio-net. Also included is the user qemu changes. MQ is disabled by default unless qemu specifies it. Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2%,
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. I am trying to wrap my head around kernel/user interface here. E.g., will we need another incompatible change when we add multiple RX queues? Also need to think about how robust our single stream heuristic is, e.g. what are the chances it will misdetect a bidirectional UDP stream as a single TCP? Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ 1 5.53.31 .67 -5.88 0 2 -2.11 -1.01 -2.08 4.340 4 13.53 10.77 13.87 -1.96 0 8 34.22 22.80 30.53 -8.46 -2.50 16 30.89 24.06 35.17 -5.20 3.20 24 33.22 26.30 43.39 -5.17 7.58 32 30.85 27.27 47.74 -.5915.51 40 33.80 27.33 48.00 -7.42 7.59 48 45.93 26.33 45.46 -12.24 1.10 64 33.51 27.11 45.00 -3.27 10.30 80 39.28 29.21 52.33 -4.88 12.17 96 32.05 31.01 57.72 -1.02 19.05 128 35.66 32.04 60.00 -.6620.41 _ BW: 23.5% CPU/RCPU: 28.6%,51.2% SD/RSD: -2.6%,15.8% Guest-Host 512 byte (numtxqs=2): # BW% CPU%RCPU% SD% RSD% _ 1 3.02-3.84 -4.76 -12.50 -7.69 2 52.77 -15.73 -8.66 -45.31 -40.33 4 -23.14 13.84 7.5050.58 40.81 8 -21.44 28.08 16.32 63.06 47.43 16 33.53 46.50 27.19 7.61-6.60 24 55.77 42.81 30.49 -8.65 -16.48 32 52.59 38.92 29.08 -9.18 -15.63 40 50.92 36.11 28.92 -10.59 -15.30 48 46.63 34.73 28.17 -7.83 -12.32 64 45.56 37.12 28.81 -5.05 -10.80 80 44.55 36.60 28.45 -4.95 -10.61 96 43.02 35.97 28.89 -.11-5.31 128 38.54 33.88 27.19 -4.79 -9.54 _ BW: 34.4% CPU/RCPU: 35.9%,27.8% SD/RSD: -4.1%,-9.3% Thanks, - KK [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Following set of patches implement transmit MQ in virtio-net. Also included is the user qemu changes. MQ is disabled by default unless qemu specifies it. Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.03
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/25/2010 09:47:18 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. I am trying to wrap my head around kernel/user interface here. E.g., will we need another incompatible change when we add multiple RX queues? Though I added a 'mq' option to qemu, there shouldn't be any incompatibility between old and new qemu's wrt vhost and virtio-net drivers. So the old qemu will run new host and new guest without issues, and new qemu can also run old host and old guest. Multiple RXQ will also not add any incompatibility. With MQ RX, I will be able to remove the hueristic (idea from David Stevens). The idea is: Guest sends out packets on, say TXQ#2, vhost#2 processes the packets but packets going out from host to guest might be sent out on a different RXQ, say RXQ#4. Guest receives the packet on RXQ#4, and all future responses on that connection are sent on TXQ#4. Now vhost#4 processes both RX and TX packets for this connection. Without needing to hash on the connection, guest can make sure that the same vhost thread will handle a single connection. Also need to think about how robust our single stream heuristic is, e.g. what are the chances it will misdetect a bidirectional UDP stream as a single TCP? I think it should not happen. The hueristic code gets called for handling just the transmit packets, packets that vhost sends out to the guest skip this path. I tested unidirectional and bidirectional UDP to confirm: 8 iterations of iperf tests, each iteration of 15 secs, result is the sum of all 8 iterations in Gbits/sec __ Uni-directional Bi-directional Org New Org New __ 71.7871.77 71.74 72.07 __ Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html