Re: Fw: Benchmarking for vhost polling patch
Our suggestion would be to use the maximum (a large enough) value, so that vhost is polling 100% of the time. The polling optimization mainly addresses users who want to maximize their performance, even on the expense of wasting cpu cycles. The maximum value will produce the biggest impact on performance. *Everyone* is interested in getting maximum performance from their systems. Maybe so, but not everyone is willing to pay the price. That is also the reason why this optimization should not be enabled by default. However, using the maximum default value will be valuable even for users who care more about the normalized throughput/cpu criteria. Such users, interested in a finer tuning of the polling timeout need to look for an optimal timeout value for their system. The maximum value serves as the upper limit of the range that needs to be searched for such optimal timeout value. Number of users who are going to do this kind of tuning can be counted on one hand. If the optimization is not enabled by default, the default value is almost irrelevant, because when users turn on the feature they should understand that there's an associated cost and they have to tune their system if they want to get the maximum benefit (depending how they define their maximum benefit). The maximum value is a good starting point that will work in most cases and can be used to start the tuning. There are some cases where networking stack already exposes low-level hardware detail to userspace, e.g. tcp polling configuration. If we can't come up with a way to abstract hardware, maybe we can at least tie it to these existing controls rather than introducing new ones? We've spent time thinking about the possible interfaces that could be appropriate for such an optimization(including tcp polling). We think that using the ioctl as interface to configure the virtual device/vhost, in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of sense, and is consistent with the existing mechanism. Thanks, Razya guest is giving up it's share of CPU for benefit of vhost, right? So maybe exposing this to guest is appropriate, and then add e.g. an ethtool interface for guest admin to set this. The decision making of whether to turn polling on (and with what rate) should be made by the system administrator, who has a broad view of the system and workload, and not by the guest administrator. Polling should be a tunable parameter from the host side, the guest should not be aware of it. The guest is not necessarily giving up its time. It may be that there's just an extra dedicated core or free cpu cycles on a different cpu. We provide a mechanism and an interface that can be tuned by some other program to implement its policy. This patch is all about the mechanism and not the policy of how to use it. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: Benchmarking for vhost polling patch
Michael S. Tsirkin m...@redhat.com wrote on 12/01/2015 12:36:13 PM: From: Michael S. Tsirkin m...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org, Eyal Moscovici/Haifa/IBM@IBMIL Date: 12/01/2015 12:36 PM Subject: Re: Fw: Benchmarking for vhost polling patch On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote: Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? The extra cpu waste vs throughput gains depends on the polling timeout value(poll_stop_idle). The best value to chose is dependant on the workload and the system hardware and configuration. There is nothing that we can say about this value in advance. The system's manager/administrator should use this optimization with the awareness that polling consumes extra cpu cycles, as documented. - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? It should be tunable per vhost thread. We can do it in a subsequent patch. So I think whether the patchset is appropriate upstream will depend exactly on coming up with a reasonable interface for enabling and tuning the functionality. How about adding a new ioctl for each vhost device that sets the poll_stop_idle (the timeout)? This should be aligned with the QEMU way of doing things. I was hopeful some reasonable default value can be derived from e.g. cost of the exit. If that is not the case, it becomes that much harder for users to select good default values. Our suggestion would be to use the maximum (a large enough) value, so that vhost is polling 100% of the time. The polling optimization mainly addresses users who want to maximize their performance, even on the expense of wasting cpu cycles. The maximum value will produce the biggest impact on performance. However, using the maximum default value will be valuable even for users who care more about the normalized throughput/cpu criteria. Such users, interested in a finer tuning of the polling timeout need to look for an optimal timeout value for their system. The maximum value serves as the upper limit of the range that needs to be searched for such optimal timeout value. There are some cases where networking stack already exposes low-level hardware detail to userspace, e.g. tcp polling configuration. If we can't come up with a way to abstract hardware, maybe we can at least tie it to these existing controls rather than introducing new ones? We've spent time thinking about the possible interfaces that could be appropriate for such an optimization(including tcp polling). We think that using the ioctl as interface to configure the virtual device/vhost, in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of sense, and is consistent with the existing mechanism. Thanks, Razya - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream, if it goes in vhost polling can do something similar. Any data points to shed light on these questions? I ran a simple apache benchmark, with an over commit scenario, where both the vcpu and vhost share the same core. In some cases (c4 in my testcases) polling surprisingly produced a better throughput. Likely because latency is hurt, so you get better batching? Therefore, it is hard to predict how the polling will impact performance in advance. If it's so hard, users will struggle to configure this properly. Looks like an argument for us developers to do the hard work, and expose simpler controls to users? It is up to whoever is using this optimization to use it wisely. Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo
Re: Fw: Benchmarking for vhost polling patch
Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? The extra cpu waste vs throughput gains depends on the polling timeout value(poll_stop_idle). The best value to chose is dependant on the workload and the system hardware and configuration. There is nothing that we can say about this value in advance. The system's manager/administrator should use this optimization with the awareness that polling consumes extra cpu cycles, as documented. - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? It should be tunable per vhost thread. We can do it in a subsequent patch. - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream, if it goes in vhost polling can do something similar. Any data points to shed light on these questions? I ran a simple apache benchmark, with an over commit scenario, where both the vcpu and vhost share the same core. In some cases (c4 in my testcases) polling surprisingly produced a better throughput. Therefore, it is hard to predict how the polling will impact performance in advance. It is up to whoever is using this optimization to use it wisely. Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: Benchmarking for vhost polling patch
Hi Michael, Just a follow up on the polling patch numbers,.. Please let me know if you find these numbers satisfying enough to continue with submitting this patch. Otherwise - we'll have this patch submitted as part of the larger Elvis patch set rather than independently. Thank you, Razya - Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM - From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Date: 25/11/2014 02:43 PM Subject:Re: Benchmarking for vhost polling patch Sent by:kvm-ow...@vger.kernel.org Hi Michael, Hi Razya, On the netperf benchmark, it looks like polling=10 gives a modest but measureable gain. So from that perspective it might be worth it if it's not too much code, though we'll need to spend more time checking the macro effect - we barely moved the needle on the macro benchmark and that is suspicious. I ran memcached with various values for the key value arguments, and managed to see a bigger impact of polling than when I used the default values, Here are the numbers: key=250 TPS netvhost vm TPS/cpu TPS/CPU value=2048 rate util util change polling=0 101540 103.0 46 100 695.47 polling=5 136747 123.0 83 100 747.25 0.074440609 polling=7 140722 125.7 84 100 764.79 0.099663658 polling=10 141719 126.3 87 100 757.85 0.089688003 polling=15 142430 127.1 90 100 749.63 0.077863015 polling=25 146347 128.7 95 100 750.49 0.079107993 polling=50 150882 131.1 100 100 754.41 0.084733701 Macro benchmarks are less I/O intensive than the micro benchmark, which is why we can expect less impact for polling as compared to netperf. However, as shown above, we managed to get 10% TPS/CPU improvement with the polling patch. Is there a chance you are actually trading latency for throughput? do you observe any effect on latency? No. How about trying some other benchmark, e.g. NFS? Tried, but didn't have enough I/O produced (vhost was at most at 15% util) Also, I am wondering: since vhost thread is polling in kernel anyway, shouldn't we try and poll the host NIC? that would likely reduce at least the latency significantly, won't it? Yes, it could be a great addition at some point, but needs a thorough investigation. In any case, not a part of this patch... Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Hi Michael, Hi Razya, On the netperf benchmark, it looks like polling=10 gives a modest but measureable gain. So from that perspective it might be worth it if it's not too much code, though we'll need to spend more time checking the macro effect - we barely moved the needle on the macro benchmark and that is suspicious. I ran memcached with various values for the key value arguments, and managed to see a bigger impact of polling than when I used the default values, Here are the numbers: key=250 TPS netvhost vm TPS/cpu TPS/CPU value=2048 rate util util change polling=0 101540 103.0 46 100 695.47 polling=5 136747 123.0 83 100 747.25 0.074440609 polling=7 140722 125.7 84 100 764.79 0.099663658 polling=10 141719 126.3 87 100 757.85 0.089688003 polling=15 142430 127.1 90 100 749.63 0.077863015 polling=25 146347 128.7 95 100 750.49 0.079107993 polling=50 150882 131.1 100 100 754.41 0.084733701 Macro benchmarks are less I/O intensive than the micro benchmark, which is why we can expect less impact for polling as compared to netperf. However, as shown above, we managed to get 10% TPS/CPU improvement with the polling patch. Is there a chance you are actually trading latency for throughput? do you observe any effect on latency? No. How about trying some other benchmark, e.g. NFS? Tried, but didn't have enough I/O produced (vhost was at most at 15% util) Also, I am wondering: since vhost thread is polling in kernel anyway, shouldn't we try and poll the host NIC? that would likely reduce at least the latency significantly, won't it? Yes, it could be a great addition at some point, but needs a thorough investigation. In any case, not a part of this patch... Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benchmarking for vhost polling patch
Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM: From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org Date: 29/10/2014 02:38 PM Subject: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya Dear Michael, I'm still interested in hearing your opinion about these numbers http://marc.info/?l=kvmm=141458631532669w=2, and whether it is worthwhile to continue with the polling patch. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM: From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org Date: 29/10/2014 02:38 PM Subject: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya Hi Michael, Have you had the chance to look into these numbers? Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Zhang Haoyu zhan...@sangfor.com wrote on 30/10/2014 01:30:08 PM: From: Zhang Haoyu zhan...@sangfor.com To: Razya Ladelsky/Haifa/IBM@IBMIL, mst m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm kvm@vger.kernel.org Date: 30/10/2014 01:30 PM Subject: Re: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu Hi Zhang, Do you mean the change in code for poll_stop_idle? Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benchmarking for vhost polling patch
Hi Michael, Following the polling patch thread: http://marc.info/?l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Here are the numbers for netperf (micro benchmark): polling|Send |Throughput|Utilization |S. Demand |vhost|exits|throughput|throughput mode |Msg | |Send Recv |Send Recv |util |/sec | /cpu | /cpu |Size | |local remote|local remote| | | |% change |bytes|10^6bits/s| %%|us/KB us/KB | % | | | - NoPolling 64 1054.11 99.97 3.01 7.78 3.74 38.80 92K7.60 Polling=1 64 1036.67 99.97 2.93 7.90 3.70 53.00 92K6.78 -10.78 Polling=5 64 1079.27 99.97 3.07 7.59 3.73 83.00 90K5.90 -22.35 Polling=7 64 1444.90 99.97 3.98 5.67 3.61 95.00 19.5K 7.41 -2.44 Polling=10 64 1521.70 99.97 4.21 5.38 3.63 98.00 8.5K 7.69 1.19 Polling=25 64 1534.24 99.97 4.18 5.34 3.57 99.00 8.5K 7.71 1.51 Polling=50 64 1534.24 99.97 4.18 5.34 3.57 99.00 8.5K 7.71 1.51 NoPolling 128 1577.39 99.97 4.09 5.19 3.40 54.00 113K 10.24 Polling=1 128 1596.08 99.97 4.22 5.13 3.47 71.00 120K 9.34 -8.88 Polling=5 128 2238.49 99.97 5.45 3.66 3.19 92.00 24K11.66 13.82 Polling=7 128 2330.97 99.97 5.59 3.51 3.14 95.00 19.5K 11.96 16.70 Polling=10 128 2375.78 99.97 5.69 3.45 3.14 98.00 10K12.00 17.14 Polling=25 128 2655.01 99.97 2.45 3.09 1.21 99.00 8.5K 13.34 30.25 Polling=50 128 2655.01 99.97 2.45 3.09 1.21 99.00 8.5K 13.34 30.25 NoPolling 25 2558.10 99.97 2.33 3.20 1.20 67.00 120K 15.32 Polling=1 25 2508.93 99.97 3.13 3.27 1.67 75.00 125K 14.34 -6.41 Polling=5 25 3740.34 99.97 2.70 2.19 0.95 94.00 17K19.28 25.86 Polling=7 25 3692.69 99.97 2.80 2.22 0.99 97.00 15.5K 18.75 22.37 Polling=10 25 4036.60 99.97 2.69 2.03 0.87 99.00 8.5K 20.29 32.42 Polling=25 25 3998.89 99.97 2.64 2.05 0.87 99.00 8.5K 20.10 31.18 Polling=50 25 3998.89 99.97 2.64 2.05 0.87 99.00 8.5K 20.10 31.18 NoPolling 512 4531.50 99.90 2.75 1.81 0.79 78.00 55K25.47 Polling=1 512 4684.19 99.95 2.69 1.75 0.75 83.00 35K25.60 0.52 Polling=5 512 4932.65 99.75 2.75 1.68 0.74 91.00 12K25.86 1.52 Polling=7 512 5226.14 99.86 2.80 1.57 0.70 95.00 7.5K 26.82 5.30 Polling=10 512 5464.90 99.60 2.90 1.49 0.70 96.00 8.2K 27.94 9.69 Polling=25 512 5550.44 99.58 2.84 1.47 0.67 99.00 7.5K 27.95 9.73 Polling=50 512 5550.44 99.58 2.84 1.47 0.67 99.00 7.5K 27.95 9.73 As you can see from the last column, polling improves performance in most cases. I ran memcached (macro benchmark), where (as in the previous benchmark) the vm and vhost each get their own dedicated core. I configured memslap with C=128, T=8, as this configuration was required to produce enough load to saturate the vm. I tried several other configurations, but this one produced the maximal throughput(for the baseline). The numbers for memcached (macro benchmark): polling time TPS Netvhost vm exits TPS/cpu TPS/cpu mode rate util util /sec % change % Disabled15.9s 125819 91.5 4599 87K873.74 polling=1 15.8s 126820 92.3 6099 87K797.61 -8.71 polling=5 12.82 155799 113.4 7999 25.5K 875.280.18 polling=10 11.7s 160639 116.9 8399 16.3K 882.631.02 pollling=15 12.4s 160897 117.2 8799 15K865.04 -1.00 polling=100 11.7s 170971 124.4 9999 30 863.49 -1.17 For memcached TPS/cpu does not show a significant difference in any of the cases. However, TPS numbers did improve in up to 35%, which can be useful for under-utilized systems which have cpu time to spare for extra throughput. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] vhost: Add polling mode
David Laight david.lai...@aculab.com wrote on 21/08/2014 05:29:41 PM: From: David Laight david.lai...@aculab.com To: Razya Ladelsky/Haifa/IBM@IBMIL, Michael S. Tsirkin m...@redhat.com Cc: abel.gor...@gmail.com abel.gor...@gmail.com, Alex Glikson/ Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/ IBM@IBMIL, kvm@vger.kernel.org kvm@vger.kernel.org, linux- ker...@vger.kernel.org linux-ker...@vger.kernel.org, net...@vger.kernel.org net...@vger.kernel.org, virtualizat...@lists.linux-foundation.org virtualizat...@lists.linux-foundation.org, Yossi Kuperman1/Haifa/IBM@IBMIL Date: 21/08/2014 05:31 PM Subject: RE: [PATCH] vhost: Add polling mode From: Razya Ladelsky Michael S. Tsirkin m...@redhat.com wrote on 20/08/2014 01:57:10 PM: Results: Netperf, 1 vm: The polling patch improved throughput by ~33% (1516 MB/sec - 2046 MB/sec). Number of exits/sec decreased 6x. The same improvement was shown when I tested with 3 vms running netperf (4086 MB/sec - 5545 MB/sec). filebench, 1 vm: ops/sec improved by 13% with the polling patch. Number of exits was reduced by 31%. The same experiment with 3 vms running filebench showed similar numbers. Signed-off-by: Razya Ladelsky ra...@il.ibm.com This really needs more thourough benchmarking report, including system data. One good example for a related patch: http://lwn.net/Articles/551179/ though for virtualization, we need data about host as well, and if you want to look at streaming benchmarks, you need to test different message sizes and measure packet size. Hi Michael, I have already tried running netperf with several message sizes: 64,128,256,512,600,800... But the results are inconsistent even in the baseline/unpatched configuration. For smaller msg sizes, I get consistent numbers. However, at some point, when I increase the msg size I get unstable results. For example, for a 512B msg, I get two scenarios: vm utilization 100%, vhost utilization 75%, throughput ~6300 vm utilization 80%, vhost utilization 13%, throughput ~9400 (line rate) I don't know why vhost is behaving that way for certain message sizes. Do you have any insight to why this is happening? Have you tried looking at the actual ethernet packet sizes. It may well jump between using small packets (the size of the writes) and full sized ones. I will check it, Thanks, Razya If you are trying to measure ethernet packet 'cost' you need to use UDP. However that probably uses different code paths. David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
Christian Borntraeger borntrae...@de.ibm.com wrote on 20/08/2014 11:41:32 AM: Results: Netperf, 1 vm: The polling patch improved throughput by ~33% (1516 MB/sec - 2046 MB/sec). Number of exits/sec decreased 6x. The same improvement was shown when I tested with 3 vms running netperf (4086 MB/sec - 5545 MB/sec). filebench, 1 vm: ops/sec improved by 13% with the polling patch. Number of exits was reduced by 31%. The same experiment with 3 vms running filebench showed similar numbers. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Gave it a quick try on s390/kvm. As expected it makes no difference for big streaming workload like iperf. uperf with a 1-1 round robin got indeed faster by about 30%. The high CPU consumption is something that bothers me though, as virtualized systems tend to be full. Thanks for confirming the results! The best way to use this patch would be along with a shared vhost thread for multiple devices/vms, as described in: http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument This work assumes having a dedicated I/O core where the vhost thread serves multiple vms, which makes the high cpu utilization less of a concern. +static int poll_start_rate = 0; +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_start_rate, Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.); + +static int poll_stop_idle = 3*HZ; /* 3 seconds */ +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_stop_idle, Stop continuous polling of virtqueue after this many jiffies of no work.); This seems ridicoudly high. Even one jiffie is an eternity, so setting it to 1 as a default would reduce the CPU overhead for most cases. If we dont have a packet in one millisecond, we can surely go back to the kick approach, I think. Christian Good point, will reduce it and recheck. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
Michael S. Tsirkin m...@redhat.com wrote on 20/08/2014 01:57:10 PM: Results: Netperf, 1 vm: The polling patch improved throughput by ~33% (1516 MB/sec - 2046 MB/sec). Number of exits/sec decreased 6x. The same improvement was shown when I tested with 3 vms running netperf (4086 MB/sec - 5545 MB/sec). filebench, 1 vm: ops/sec improved by 13% with the polling patch. Number of exits was reduced by 31%. The same experiment with 3 vms running filebench showed similar numbers. Signed-off-by: Razya Ladelsky ra...@il.ibm.com This really needs more thourough benchmarking report, including system data. One good example for a related patch: http://lwn.net/Articles/551179/ though for virtualization, we need data about host as well, and if you want to look at streaming benchmarks, you need to test different message sizes and measure packet size. Hi Michael, I have already tried running netperf with several message sizes: 64,128,256,512,600,800... But the results are inconsistent even in the baseline/unpatched configuration. For smaller msg sizes, I get consistent numbers. However, at some point, when I increase the msg size I get unstable results. For example, for a 512B msg, I get two scenarios: vm utilization 100%, vhost utilization 75%, throughput ~6300 vm utilization 80%, vhost utilization 13%, throughput ~9400 (line rate) I don't know why vhost is behaving that way for certain message sizes. Do you have any insight to why this is happening? Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
That was just one example. There many other possibilities. Either actually make the systems load all host CPUs equally, or divide throughput by host CPU. The polling patch adds this capability to vhost, reducing costly exit overhead when the vm is loaded. In order to load the vm I ran netperf with msg size of 256: Without polling: 2480 Mbits/sec, utilization: vm - 100% vhost - 64% With Polling: 4160 Mbits/sec, utilization: vm - 100% vhost - 100% Therefore, throughput/cpu without polling is 15.1, and 20.8 with polling. My intention was to load vhost as close as possible to 100% utilization without polling, in order to compare it to the polling utilization case (where vhost is always 100%). The best use case, of course, would be when the shared vhost thread work (TBD) is integrated and then vhost will actually be using its polling cycles to handle requests of multiple devices (even from multiple vms). Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
Hi Michael, Sorry for the delay, had some problems with my mailbox, and I realized just now that my reply wasn't sent. The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or not. The vhost thread utilized less than 100% (of the other cpu) when polling was disabled. Enabling polling increased its utilization to 100% (in which case both cpus were 100% utilized). Hmm this means the testing wasn't successful then, as you said: The idea was to get it 100% loaded, so we can see that the polling is getting it to produce higher throughput. in fact here you are producing more throughput but spending more power to produce it, which can have any number of explanations besides polling improving the efficiency. For example, increasing system load might disable host power management. Hi Michael, I re-ran the tests, this time with the turbo mode and C-states features off. No Polling: 1 VM running netperf (msg size 64B): 1107 Mbits/sec Polling: 1 VM running netperf (msg size 64B): 1572 Mbits/sec As you can see from the new results, the numbers are lower, but relatively (polling on/off) there's no change. Thank you, Razya -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
Michael S. Tsirkin m...@redhat.com wrote on 12/08/2014 12:18:50 PM: From: Michael S. Tsirkin m...@redhat.com To: David Miller da...@davemloft.net Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, linux-ker...@vger.kernel.org, net...@vger.kernel.org, virtualizat...@lists.linux-foundation.org Date: 12/08/2014 12:18 PM Subject: Re: [PATCH] vhost: Add polling mode On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote: From: Michael S. Tsirkin m...@redhat.com Date: Sun, 10 Aug 2014 21:45:59 +0200 On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote: ... And, did your tests actually produce 100% load on both host CPUs? ... Michael, please do not quote an entire patch just to ask a one line question. I truly, truly, wish it was simpler in modern email clients to delete the unrelated quoted material because I bet when people do this they are simply being lazy. Thank you. Lazy - mea culpa, though I'm using mutt so it isn't even hard. The question still stands: the test results are only valid if CPU was at 100% in all configurations. This is the reason I generally prefer it when people report throughput divided by CPU (power would be good too but it still isn't easy for people to get that number). Hi Michael, Sorry for the delay, had some problems with my mailbox, and I realized just now that my reply wasn't sent. The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or not. The vhost thread utilized less than 100% (of the other cpu) when polling was disabled. Enabling polling increased its utilization to 100% (in which case both cpus were 100% utilized). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: Add polling mode
From: Razya Ladelsky ra...@il.ibm.com Date: Thu, 31 Jul 2014 09:47:20 +0300 Subject: [PATCH] vhost: Add polling mode When vhost is waiting for buffers from the guest driver (e.g., more packets to send in vhost-net's transmit queue), it normally goes to sleep and waits for the guest to kick it. This kick involves a PIO in the guest, and therefore an exit (and possibly userspace involvement in translating this PIO exit into a file descriptor event), all of which hurts performance. If the system is under-utilized (has cpu time to spare), vhost can continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. This patch adds an optional polling mode to vhost, that can be enabled via a kernel module parameter, poll_start_rate. When polling is active for a virtqueue, the guest is asked to disable notification (kicks), and the worker thread continuously checks for new buffers. When it does discover new buffers, it simulates a kick by invoking the underlying backend driver (such as vhost-net), which thinks it got a real kick from the guest, and acts accordingly. If the underlying driver asks not to be kicked, we disable polling on this virtqueue. We start polling on a virtqueue when we notice it has work to do. Polling on this virtqueue is later disabled after 3 seconds of polling turning up no new work, as in this case we are better off returning to the exit-based notification mechanism. The default timeout of 3 seconds can be changed with the poll_stop_idle kernel module parameter. This polling approach makes lot of sense for new HW with posted-interrupts for which we have exitless host-to-guest notifications. But even with support for posted interrupts, guest-to-host communication still causes exits. Polling adds the missing part. When systems are overloaded, there won't be enough cpu time for the various vhost threads to poll their guests' devices. For these scenarios, we plan to add support for vhost threads that can be shared by multiple devices, even of multiple vms. Our ultimate goal is to implement the I/O acceleration features described in: KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) https://www.youtube.com/watch?v=9EyweibHfEs and https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html I ran some experiments with TCP stream netperf and filebench (having 2 threads performing random reads) benchmarks on an IBM System x3650 M4. I have two machines, A and B. A hosts the vms, B runs the netserver. The vms (on A) run netperf, its destination server is running on B. All runs loaded the guests in a way that they were (cpu) saturated. For example, I ran netperf with 64B messages, which is heavily loading the vm (which is why its throughput is low). The idea was to get it 100% loaded, so we can see that the polling is getting it to produce higher throughput. The system had two cores per guest, as to allow for both the vcpu and the vhost thread to run concurrently for maximum throughput (but I didn't pin the threads to specific cores). My experiments were fair in a sense that for both cases, with or without polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that way). The only difference was whether polling was enabled/disabled. Results: Netperf, 1 vm: The polling patch improved throughput by ~33% (1516 MB/sec - 2046 MB/sec). Number of exits/sec decreased 6x. The same improvement was shown when I tested with 3 vms running netperf (4086 MB/sec - 5545 MB/sec). filebench, 1 vm: ops/sec improved by 13% with the polling patch. Number of exits was reduced by 31%. The same experiment with 3 vms running filebench showed similar numbers. Signed-off-by: Razya Ladelsky ra...@il.ibm.com --- drivers/vhost/net.c |6 +- drivers/vhost/scsi.c |6 +- drivers/vhost/vhost.c | 245 +++-- drivers/vhost/vhost.h | 38 +++- 4 files changed, 277 insertions(+), 18 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 971a760..558aecb 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + vqs[VHOST_NET_VQ_RX]); f-private_data = n; diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 4f4ffa4..665eeeb 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs
[PATCH v2] vhost: Add polling mode
Resubmitting the patch in: http://marc.info/?l=kvmm=140594903520308w=2 after fixing the whitespaces issues. Thank you, Razya From f293e470b36ff9eb4910540c620315c418e4a8fc Mon Sep 17 00:00:00 2001 From: Razya Ladelsky ra...@il.ibm.com Date: Thu, 31 Jul 2014 09:47:20 +0300 Subject: [PATCH] vhost: Add polling mode Add an optional polling mode to continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. Signed-off-by: Razya Ladelsky ra...@il.ibm.com --- drivers/vhost/net.c |6 +- drivers/vhost/scsi.c |6 +- drivers/vhost/vhost.c | 245 +++-- drivers/vhost/vhost.h | 38 +++- 4 files changed, 277 insertions(+), 18 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 971a760..558aecb 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + vqs[VHOST_NET_VQ_RX]); f-private_data = n; diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 4f4ffa4..665eeeb 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); - vhost_work_init(vs-vs_event_work, tcm_vhost_evt_work); - + vhost_work_init(vs-vs_completion_work, NULL, + vhost_scsi_complete_cmd_work); + vhost_work_init(vs-vs_event_work, NULL, tcm_vhost_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c90f437..fbe8174 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -24,9 +24,17 @@ #include linux/slab.h #include linux/kthread.h #include linux/cgroup.h +#include linux/jiffies.h #include linux/module.h #include vhost.h +static int poll_start_rate = 0; +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_start_rate, Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.); + +static int poll_stop_idle = 3*HZ; /* 3 seconds */ +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_stop_idle, Stop continuous polling of virtqueue after this many jiffies of no work.); enum { VHOST_MEMORY_MAX_NREGIONS = 64, @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq, + vhost_work_fn_t fn) { INIT_LIST_HEAD(work-node); work-fn = fn; init_waitqueue_head(work-done); work-flushing = 0; work-queue_seq = work-done_seq = 0; + work-vq = vq; } EXPORT_SYMBOL_GPL(vhost_work_init); /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -unsigned long mask, struct vhost_dev *dev) +unsigned long mask, struct vhost_virtqueue *vq) { init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup); init_poll_funcptr(poll-table, vhost_poll_func); poll-mask = mask; - poll-dev = dev; + poll-dev = vq-dev; poll-wqh = NULL; - - vhost_work_init(poll-work, fn); + vhost_work_init(poll-work, vq, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_queue); +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue. + * + * Enabling this mode it tells the guest not to notify (kick) us when its + * has made more work available on this virtqueue; Rather, we will continuously + * poll this virtqueue in the worker thread. If multiple virtqueues are polled, + * the worker thread polls them all, e.g., in a round-robin fashion. + * Note that vqpoll.enabled doesn't always mean that this virtqueue is + * actually being polled: The backend (e.g., net.c) may temporarily disable it + * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged. + * + * It is assumed that these functions are called relatively rarely, when vhost + * notices that this virtqueue's usage pattern significantly changed
Re: [PATCH] vhost: Add polling mode
kvm-ow...@vger.kernel.org wrote on 29/07/2014 03:40:18 PM: From: Michael S. Tsirkin m...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL, Cc: abel.gor...@gmail.com, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, kvm@vger.kernel.org, kvm-ow...@vger.kernel.org, Yossi Kuperman1/ Haifa/IBM@IBMIL Date: 29/07/2014 03:40 PM Subject: Re: [PATCH] vhost: Add polling mode Sent by: kvm-ow...@vger.kernel.org On Tue, Jul 29, 2014 at 03:23:59PM +0300, Razya Ladelsky wrote: Hmm there aren't a lot of numbers there :(. Speed increased by 33% but by how much? E.g. maybe you are getting from 1Mbyte/sec to 1.3, if so it's hard to get excited about it. Netperf 1 VM: 1516 MB/sec - 2046 MB/sec and for 3 VMs: 4086 MB/sec - 5545 MB/sec What do you mean by 1 VM? Streaming TCP host to vm? Also, your throughput is somewhat low, it's worth seeing why you can't hit higher speeds. My configuration is this: I have two machines, A and B. A hosts the vms, B runs the netserver. One vm (on A) runs netperf, where the its destination server is running on B. I ran netperf with 64B messages, which is heavily loading the vm, which is why its throughput is low. The idea was to get it 100% loaded, so we can see that the polling is getting it to produce higher throughput. Some questions that come to mind: what was the message size? I would expect several measurements with different values. How did host CPU utilization change? message size was 64B in order to get the VM to be cpu saturated. so vhost had 99% cpu and vhost 38%, with the polling patch both had 99%. Hmm so a net loss in throughput/CPU. Actually, my experiments were fair in a sense that for both cases, with or without polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that way). The only difference was whether polling was enabled/disabled. What about latency? As we are competing with guest for host CPU, would worst-case or average latency suffer? Polling indeed doesn't make a lot of sense if there aren't enough available cores. In these cases polling should not be used. Thank you, Razya OK but scheduler might run vm and vhost on the same cpu even if cores are available. This needs to be detected somehow and polling disabled. Thanks, -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
kvm-ow...@vger.kernel.org wrote on 29/07/2014 04:30:34 AM: From: Zhang Haoyu zhan...@sangfor.com To: Jason Wang jasow...@redhat.com, Abel Gordon abel.gor...@gmail.com, Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, kvm kvm@vger.kernel.org, Michael S. Tsirkin m...@redhat.com, Yossi Kuperman1/Haifa/IBM@IBMIL Date: 29/07/2014 04:35 AM Subject: Re: [PATCH] vhost: Add polling mode Sent by: kvm-ow...@vger.kernel.org Maybe tie a knot between vhost-net scalability tuning: threading for many VMs and vhost: Add polling mode is a good marriage, because it's more possibility to get work to do with less polling time, so less cpu cycles waste. Hi Zhang, Indeed have one vhost thread shared by multiple vms, polling for their requests is the ultimate goal of this plan. The current challenge with it is that the cgroup mechanism needs to be supported/incorporated somehow by this shared vhost thread, as it now serves multiple vms(processes). B.T.W. - if someone wants to help with this effort (mainly the cgroup issue), it would be greatly appreciated...! Thank you, Razya Thanks, Zhang Haoyu Hello All, When vhost is waiting for buffers from the guest driver (e.g., more packets to send in vhost-net's transmit queue), it normally goes to sleep and waits for the guest to kick it. This kick involves a PIO in the guest, and therefore an exit (and possibly userspace involvement in translating this PIO exit into a file descriptor event), all of which hurts performance. If the system is under-utilized (has cpu time to spare), vhost can continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. This patch adds an optional polling mode to vhost, that can be enabled via a kernel module parameter, poll_start_rate. When polling is active for a virtqueue, the guest is asked to disable notification (kicks), and the worker thread continuously checks for new buffers. When it does discover new buffers, it simulates a kick by invoking the underlying backend driver (such as vhost-net), which thinks it got a real kick from the guest, and acts accordingly. If the underlying driver asks not to be kicked, we disable polling on this virtqueue. We start polling on a virtqueue when we notice it has work to do. Polling on this virtqueue is later disabled after 3 seconds of polling turning up no new work, as in this case we are better off returning to the exit-based notification mechanism. The default timeout of 3 seconds can be changed with the poll_stop_idle kernel module parameter. This polling approach makes lot of sense for new HW with posted-interrupts for which we have exitless host-to-guest notifications. But even with support for posted interrupts, guest-to-host communication still causes exits. Polling adds the missing part. When systems are overloaded, there won?t be enough cpu time for the various vhost threads to poll their guests' devices. For these scenarios, we plan to add support for vhost threads that can be shared by multiple devices, even of multiple vms. Our ultimate goal is to implement the I/O acceleration features described in: KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) https://www.youtube.com/watch?v=9EyweibHfEs and https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html Comments are welcome, Thank you, Razya Thanks for the work. Do you have perf numbers for this? Hi Jason, Thanks for reviewing. I ran some experiments with TCP stream netperf and filebench (having 2 threads performing random reads) benchmarks on an IBM System x3650 M4. All runs loaded the guests in a way that they were (cpu) saturated. The system had two cores per guest, as to allow for both the vcpu and the vhost thread to run concurrently for maximum throughput (but I didn't pin the threads to specific cores) I get: Netperf, 1 vm: The polling patch improved throughput by ~33%. Number of exits/sec decreased 6x. The same improvement was shown when I tested with 3 vms running netperf. filebench, 1 vm: ops/sec improved by 13% with the polling patch. Number of exits was reduced by 31%. The same experiment with 3 vms running filebench showed similar numbers. Looks good, may worth to add the result in the commit log. And looks like the patch only poll for virtqueue. In the future, may worth to add callbacks for vhost_net to poll socket. Then it could be used with rx busy polling in host which may speedup the rx also. Did you mean polling the network device to avoid
Re: [PATCH] vhost: Add polling mode
Michael S. Tsirkin m...@redhat.com wrote on 29/07/2014 11:06:40 AM: From: Michael S. Tsirkin m...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL, Cc: kvm@vger.kernel.org, abel.gor...@gmail.com, Joel Nider/Haifa/ IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/ IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL Date: 29/07/2014 11:06 AM Subject: Re: [PATCH] vhost: Add polling mode On Mon, Jul 21, 2014 at 04:23:44PM +0300, Razya Ladelsky wrote: Hello All, When vhost is waiting for buffers from the guest driver (e.g., more packets to send in vhost-net's transmit queue), it normally goes to sleep and waits for the guest to kick it. This kick involves a PIO in the guest, and therefore an exit (and possibly userspace involvement in translating this PIO exit into a file descriptor event), all of which hurts performance. If the system is under-utilized (has cpu time to spare), vhost can continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. This patch adds an optional polling mode to vhost, that can be enabled via a kernel module parameter, poll_start_rate. When polling is active for a virtqueue, the guest is asked to disable notification (kicks), and the worker thread continuously checks for new buffers. When it does discover new buffers, it simulates a kick by invoking the underlying backend driver (such as vhost-net), which thinks it got a real kick from the guest, and acts accordingly. If the underlying driver asks not to be kicked, we disable polling on this virtqueue. We start polling on a virtqueue when we notice it has work to do. Polling on this virtqueue is later disabled after 3 seconds of polling turning up no new work, as in this case we are better off returning to the exit-based notification mechanism. The default timeout of 3 seconds can be changed with the poll_stop_idle kernel module parameter. This polling approach makes lot of sense for new HW with posted-interrupts for which we have exitless host-to-guest notifications. But even with support for posted interrupts, guest-to-host communication still causes exits. Polling adds the missing part. When systems are overloaded, there won?t be enough cpu time for the various vhost threads to poll their guests' devices. For these scenarios, we plan to add support for vhost threads that can be shared by multiple devices, even of multiple vms. Our ultimate goal is to implement the I/O acceleration features described in: KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) https://www.youtube.com/watch?v=9EyweibHfEs and https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html Comments are welcome, Thank you, Razya From: Razya Ladelsky ra...@il.ibm.com Add an optional polling mode to continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. Signed-off-by: Razya Ladelsky ra...@il.ibm.com This is an optimization patch, isn't it? Could you please include some numbers showing its effect? Hi Michael, Sure. I included them in a reply to Jason Wang in this thread, Here it is: http://www.spinics.net/linux/lists/kvm/msg106049.html --- drivers/vhost/net.c |6 +- drivers/vhost/scsi.c |5 +- drivers/vhost/vhost.c | 247 +++-- drivers/vhost/vhost.h | 37 +++- 4 files changed, 277 insertions(+), 18 deletions(-) Whitespace seems mangled to the point of making patch unreadable. Can you pls repost? Sure. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 971a760..558aecb 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + vqs[VHOST_NET_VQ_RX]); f-private_data = n; diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 4f4ffa4..56f0233 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1528,9 +1528,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); - vhost_work_init(vs-vs_event_work, tcm_vhost_evt_work); - + vhost_work_init(vs-vs_completion_work, NULL, vhost_scsi_complete_cmd_work); + vhost_work_init(vs
Re: [PATCH] vhost: Add polling mode
Hmm there aren't a lot of numbers there :(. Speed increased by 33% but by how much? E.g. maybe you are getting from 1Mbyte/sec to 1.3, if so it's hard to get excited about it. Netperf 1 VM: 1516 MB/sec - 2046 MB/sec and for 3 VMs: 4086 MB/sec - 5545 MB/sec Some questions that come to mind: what was the message size? I would expect several measurements with different values. How did host CPU utilization change? message size was 64B in order to get the VM to be cpu saturated. so vhost had 99% cpu and vhost 38%, with the polling patch both had 99%. What about latency? As we are competing with guest for host CPU, would worst-case or average latency suffer? Polling indeed doesn't make a lot of sense if there aren't enough available cores. In these cases polling should not be used. Thank you, Razya Thanks, -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Add polling mode
Jason Wang jasow...@redhat.com wrote on 23/07/2014 08:26:36 AM: From: Jason Wang jasow...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, Michael S. Tsirkin m...@redhat.com, Cc: abel.gor...@gmail.com, Joel Nider/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL Date: 23/07/2014 08:26 AM Subject: Re: [PATCH] vhost: Add polling mode On 07/21/2014 09:23 PM, Razya Ladelsky wrote: Hello All, When vhost is waiting for buffers from the guest driver (e.g., more packets to send in vhost-net's transmit queue), it normally goes to sleep and waits for the guest to kick it. This kick involves a PIO in the guest, and therefore an exit (and possibly userspace involvement in translating this PIO exit into a file descriptor event), all of which hurts performance. If the system is under-utilized (has cpu time to spare), vhost can continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. This patch adds an optional polling mode to vhost, that can be enabled via a kernel module parameter, poll_start_rate. When polling is active for a virtqueue, the guest is asked to disable notification (kicks), and the worker thread continuously checks for new buffers. When it does discover new buffers, it simulates a kick by invoking the underlying backend driver (such as vhost-net), which thinks it got a real kick from the guest, and acts accordingly. If the underlying driver asks not to be kicked, we disable polling on this virtqueue. We start polling on a virtqueue when we notice it has work to do. Polling on this virtqueue is later disabled after 3 seconds of polling turning up no new work, as in this case we are better off returning to the exit-based notification mechanism. The default timeout of 3 seconds can be changed with the poll_stop_idle kernel module parameter. This polling approach makes lot of sense for new HW with posted-interrupts for which we have exitless host-to-guest notifications. But even with support for posted interrupts, guest-to-host communication still causes exits. Polling adds the missing part. When systems are overloaded, there won?t be enough cpu time for the various vhost threads to poll their guests' devices. For these scenarios, we plan to add support for vhost threads that can be shared by multiple devices, even of multiple vms. Our ultimate goal is to implement the I/O acceleration features described in: KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) https://www.youtube.com/watch?v=9EyweibHfEs and https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html Comments are welcome, Thank you, Razya Thanks for the work. Do you have perf numbers for this? Hi Jason, Thanks for reviewing. I ran some experiments with TCP stream netperf and filebench (having 2 threads performing random reads) benchmarks on an IBM System x3650 M4. All runs loaded the guests in a way that they were (cpu) saturated. The system had two cores per guest, as to allow for both the vcpu and the vhost thread to run concurrently for maximum throughput (but I didn't pin the threads to specific cores) I get: Netperf, 1 vm: The polling patch improved throughput by ~33%. Number of exits/sec decreased 6x. The same improvement was shown when I tested with 3 vms running netperf. filebench, 1 vm: ops/sec improved by 13% with the polling patch. Number of exits was reduced by 31%. The same experiment with 3 vms running filebench showed similar numbers. And looks like the patch only poll for virtqueue. In the future, may worth to add callbacks for vhost_net to poll socket. Then it could be used with rx busy polling in host which may speedup the rx also. Did you mean polling the network device to avoid interrupts? diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c90f437..678d766 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -24,9 +24,17 @@ #include linux/slab.h #include linux/kthread.h #include linux/cgroup.h +#include linux/jiffies.h #include linux/module.h #include vhost.h +static int poll_start_rate = 0; +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_start_rate, Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.); + +static int poll_stop_idle = 3*HZ; /* 3 seconds */ +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_stop_idle, Stop continuous polling of virtqueue after this many jiffies of no work.); I'm not sure using jiffy is good enough since user need know HZ value. May worth to look at sk_busy_loop() which use sched_clock() and us. Ok, Will look into it, thanks. +/* Enable or disable virtqueue polling
[PATCH] vhost: Add polling mode
Hello All, When vhost is waiting for buffers from the guest driver (e.g., more packets to send in vhost-net's transmit queue), it normally goes to sleep and waits for the guest to kick it. This kick involves a PIO in the guest, and therefore an exit (and possibly userspace involvement in translating this PIO exit into a file descriptor event), all of which hurts performance. If the system is under-utilized (has cpu time to spare), vhost can continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. This patch adds an optional polling mode to vhost, that can be enabled via a kernel module parameter, poll_start_rate. When polling is active for a virtqueue, the guest is asked to disable notification (kicks), and the worker thread continuously checks for new buffers. When it does discover new buffers, it simulates a kick by invoking the underlying backend driver (such as vhost-net), which thinks it got a real kick from the guest, and acts accordingly. If the underlying driver asks not to be kicked, we disable polling on this virtqueue. We start polling on a virtqueue when we notice it has work to do. Polling on this virtqueue is later disabled after 3 seconds of polling turning up no new work, as in this case we are better off returning to the exit-based notification mechanism. The default timeout of 3 seconds can be changed with the poll_stop_idle kernel module parameter. This polling approach makes lot of sense for new HW with posted-interrupts for which we have exitless host-to-guest notifications. But even with support for posted interrupts, guest-to-host communication still causes exits. Polling adds the missing part. When systems are overloaded, there won?t be enough cpu time for the various vhost threads to poll their guests' devices. For these scenarios, we plan to add support for vhost threads that can be shared by multiple devices, even of multiple vms. Our ultimate goal is to implement the I/O acceleration features described in: KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) https://www.youtube.com/watch?v=9EyweibHfEs and https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html Comments are welcome, Thank you, Razya From: Razya Ladelsky ra...@il.ibm.com Add an optional polling mode to continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. Signed-off-by: Razya Ladelsky ra...@il.ibm.com --- drivers/vhost/net.c |6 +- drivers/vhost/scsi.c |5 +- drivers/vhost/vhost.c | 247 +++-- drivers/vhost/vhost.h | 37 +++- 4 files changed, 277 insertions(+), 18 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 971a760..558aecb 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) } vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX); - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + vqs[VHOST_NET_VQ_RX]); f-private_data = n; diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 4f4ffa4..56f0233 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1528,9 +1528,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); - vhost_work_init(vs-vs_event_work, tcm_vhost_evt_work); - + vhost_work_init(vs-vs_completion_work, NULL, vhost_scsi_complete_cmd_work); + vhost_work_init(vs-vs_event_work, NULL, tcm_vhost_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c90f437..678d766 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -24,9 +24,17 @@ #include linux/slab.h #include linux/kthread.h #include linux/cgroup.h +#include linux/jiffies.h #include linux/module.h #include vhost.h +static int poll_start_rate = 0; +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_start_rate, Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.); + +static int poll_stop_idle = 3*HZ; /* 3 seconds */ +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(poll_stop_idle, Stop continuous polling of virtqueue after this many jiffies of no work.); enum { VHOST_MEMORY_MAX_NREGIONS = 64, @@ -58,27 +66,27 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0
Re: Updated Elvis Upstreaming Roadmap
Hi, To summarize the issues raised and following steps: 1. Shared vhost thread will support multiple vms, while supporting cgroups. As soon as we have a design to support cgroups with multiple vms, we'll share it. 2. Adding vhost polling mode: this patch can be submitted independently from (1). We'll add a condition that will be checked periodically, in order to stop polling if the guest is not running (scheduled out) at that time. 3. Implement good heuristics (policies) in the vhost module for adding/removing vhost threads. We will not expose an interface to user-space at this time. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updated Elvis Upstreaming Roadmap
Gleb Natapov g...@minantech.com wrote on 24/12/2013 06:21:03 PM: From: Gleb Natapov g...@kernel.org To: Razya Ladelsky/Haifa/IBM@IBMIL, Cc: Michael S. Tsirkin m...@redhat.com, abel.gor...@gmail.com, Anthony Liguori anth...@codemonkey.ws, as...@redhat.com, digitale...@google.com, Eran Raichstein/Haifa/IBM@IBMIL, g...@redhat.com, jasow...@redhat.com, Joel Nider/Haifa/IBM@IBMIL, kvm@vger.kernel.org, kvm-ow...@vger.kernel.org, pbonz...@redhat.com, Stefan Hajnoczi stefa...@gmail.com, Yossi Kuperman1/Haifa/ IBM@IBMIL, Eyal Moscovici/Haifa/IBM@IBMIL, b...@redhat.com Date: 24/12/2013 06:21 PM Subject: Re: Updated Elvis Upstreaming Roadmap Sent by: Gleb Natapov g...@minantech.com On Tue, Dec 17, 2013 at 12:04:42PM +0200, Razya Ladelsky wrote: 4. vhost statistics The issue that was raised for the vhost statistics was using ftrace instead of the debugfs mechanism. However, looking further into the kvm stat mechanism, we learned that ftrace didn't replace the plain debugfs mechanism, but was used in addition to it. It did. Statistics in debugfs is deprecated. No new statistics are added there. kvm_stat is using ftrace now (if available) and of course ftrace gives seamless integration with perf. O.k, I understand. We'll look more into ftrace to see that it fully supports our vhost statistics requirements. Thank you, Razya -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Updated Elvis Upstreaming Roadmap
Hi, Thank you all for your comments. I'm sorry for taking this long to reply, I was away on vacation.. It was a good, long discussion, many issues were raised, which we'd like to address with the following proposed roadmap for Elvis patches. In general, we believe it would be best to start with patches that are as simple as possible, providing the basic Elvis functionality, and attend to the more complicated issues in subsequent patches. Here's the road map for Elvis patches: 1. Shared vhost thread for multiple devices. The way to go here, we believe, is to start with a patch having a shared vhost thread for multiple devices of the SAME vm. The next step/patch may be handling vms belonging to the same cgroup. Finally, we need to extend the functionality so that the shared vhost thread serves multiple vms (not necessarily belonging to the same cgroup). There was a lot of discussion about the way to address the enforcement of cgroup policies, and we will consider the various solutions with a future patch. 2. Creation of vhost threads We suggested two ways of controlling the creation and removal of vhost threads: - statically determining the maximum number of virtio devices per worker via a kernel module parameter - dynamically: Sysfs mechanism to add and remove vhost threads It seems that it would be simplest to take the static approach as a first stage. At a second stage (next patch), we'll advance to dynamically changing the number of vhost threads, using the static module parameter only as a default value. Regarding cwmq, it is an interesting mechanism, which we need to explore further. At the moment we prefer not to change the vhost model to use cwmq, as some of the issues that were discussed, such as cgroups, are not supported by cwmq, and this is adding more complexity. However, we'll look further into it, and consider it at a later stage. 3. Adding polling mode to vhost It is a good idea making polling adaptive based on various factors such as the I/O rate, the guest kick overhead(which is the tradeoff of polling), or the amount of wasted cycles (cycles we kept polling but no new work was added). However, as a beginning polling patch, we would prefer having a naive polling approach, which could be tuned with later patches. 4. vhost statistics The issue that was raised for the vhost statistics was using ftrace instead of the debugfs mechanism. However, looking further into the kvm stat mechanism, we learned that ftrace didn't replace the plain debugfs mechanism, but was used in addition to it. We propose to continue using debugfs for statistics, in a manner similar to kvm, and at some point in the future ftrace can be added to vhost as well. Does this plan look o.k.? If there are no further comments, I'll start preparing the patches according to what we've agreed on thus far. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Elvis upstreaming plan
Michael S. Tsirkin m...@redhat.com wrote on 24/11/2013 12:26:15 PM: From: Michael S. Tsirkin m...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL, Cc: kvm@vger.kernel.org, anth...@codemonkey.ws, g...@redhat.com, pbonz...@redhat.com, as...@redhat.com, jasow...@redhat.com, digitale...@google.com, abel.gor...@gmail.com, Abel Gordon/Haifa/ IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL Date: 24/11/2013 12:22 PM Subject: Re: Elvis upstreaming plan On Sun, Nov 24, 2013 at 11:22:17AM +0200, Razya Ladelsky wrote: Hi all, I am Razya Ladelsky, I work at IBM Haifa virtualization team, which developed Elvis, presented by Abel Gordon at the last KVM forum: ELVIS video: https://www.youtube.com/watch?v=9EyweibHfEs ELVIS slides: https://drive.google.com/file/d/0BzyAwvVlQckeQmpnOHM5SnB5UVE According to the discussions that took place at the forum, upstreaming some of the Elvis approaches seems to be a good idea, which we would like to pursue. Our plan for the first patches is the following: 1.Shared vhost thread between mutiple devices This patch creates a worker thread and worker queue shared across multiple virtio devices We would like to modify the patch posted in https://github.com/abelg/virtual_io_acceleration/commit/ 3dc6a3ce7bcbe87363c2df8a6b6fee0c14615766 to limit a vhost thread to serve multiple devices only if they belong to the same VM as Paolo suggested to avoid isolation or cgroups concerns. Another modification is related to the creation and removal of vhost threads, which will be discussed next. 2. Sysfs mechanism to add and remove vhost threads This patch allows us to add and remove vhost threads dynamically. A simpler way to control the creation of vhost threads is statically determining the maximum number of virtio devices per worker via a kernel module parameter (which is the way the previously mentioned patch is currently implemented) I'd like to ask for advice here about the more preferable way to go: Although having the sysfs mechanism provides more flexibility, it may be a good idea to start with a simple static parameter, and have the first patches as simple as possible. What do you think? 3.Add virtqueue polling mode to vhost Have the vhost thread poll the virtqueues with high I/O rate for new buffers , and avoid asking the guest to kick us. https://github.com/abelg/virtual_io_acceleration/commit/ 26616133fafb7855cc80fac070b0572fd1aaf5d0 4. vhost statistics This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat, based on the old kvm_stats) https://github.com/abelg/virtual_io_acceleration/commit/ ac14206ea56939ecc3608dc5f978b86fa322e7b0 5. Add heuristics to improve I/O scheduling This patch enhances the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. https://github.com/abelg/virtual_io_acceleration/commit/ f6a4f1a5d6b82dc754e8af8af327b8d0f043dc4d This patch improves the handling of the requests by the vhost thread, but could perhaps be delayed to a later time , and not submitted as one of the first Elvis patches. I'd love to hear some comments about whether this patch needs to be part of the first submission. Any other feedback on this plan will be appreciated, Thank you, Razya How about we start with the stats patch? This will make it easier to evaluate the other patches. Hi Michael, Thank you for your quick reply. Our plan was to send all these patches that contain the Elvis code. We can start with the stats patch, however, many of the statistics there are related to the features that the other patches provide... B.T.W. If you a chance to look at the rest of the patches, I'd really appreciate your comments, Thank you very much, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html