Re: [PATCH v2 00/21] arm64: KVM: world switch in C
On Tue, Dec 01, 2015 at 05:51:46PM +, Marc Zyngier wrote: > On 01/12/15 12:00, Christoffer Dall wrote: > > On Tue, Dec 01, 2015 at 09:58:23AM +, Marc Zyngier wrote: > >> On 30/11/15 20:33, Christoffer Dall wrote: > >>> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote: > Once upon a time, the KVM/arm64 world switch was a nice, clean, lean > and mean piece of hand-crafted assembly code. Over time, features have > crept in, the code has become harder to maintain, and the smallest > change is a pain to introduce. The VHE patches are a prime example of > why this doesn't work anymore. > > This series rewrites most of the existing assembly code in C, but keeps > the existing code structure in place (most function names will look > familiar to the reader). The biggest change is that we don't have to > deal with a static register allocation (the compiler does it for us), > we can easily follow structure and pointers, and only the lowest level > is still in assembly code. Oh, and a negative diffstat. > > There is still a healthy dose of inline assembly (system register > accessors, runtime code patching), but I've tried not to make it too > invasive. The generated code, while not exactly brilliant, doesn't > look too shaby. I do expect a small performance degradation, but I > believe this is something we can improve over time (my initial > measurements don't show any obvious regression though). > >>> > >>> I ran this through my experimental setup on m400 and got this: > >> > >> [...] > >> > >>> What this tells me is that we do take a noticable hit on the > >>> world-switch path, which shows up in the TCP_RR and hackbench workloads, > >>> which have a high precision in their output. > >>> > >>> Note that the memcached number is well within its variability between > >>> individual benchmark runs, where it varies to 12% of its average in over > >>> 80% of the executions. > >>> > >>> I don't think this is a showstopper thought, but we could consider > >>> looking more closely at a breakdown of the world-switch path and verify > >>> if/where we are really taking a hit. > >> > >> Thanks for doing so, very interesting. As a data point, what compiler > >> are you using? I'd expect some variability based on the compiler version... > >> > > I used the following (compiling natively on the m400): > > > > gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1) > > For what it is worth, I've ran hackbench on my Seattle B0 (8xA57 2GHz), > with a 4 vcpu VM and got the following results (10 runs per kernel > version, same configuration): > > v4.4-rc3-wsinc: Average 31.750 > 32.459 > 32.124 > 32.435 > 31.940 > 31.085 > 31.804 > 31.862 > 30.985 > 31.450 > 31.359 > > v4.4-rc3: Average 31.954 > 31.806 > 31.598 > 32.697 > 31.472 > 31.410 > 32.562 > 31.938 > 31.932 > 31.672 > 32.459 > > This is with GCC as produced by Linaro: > aarch64-linux-gnu-gcc (Linaro GCC 5.1-2015.08) 5.1.1 20150608 > > It could well be that your compiler generates worse code than the one I > use, or that the code it outputs is badly tuned for XGene. I guess I > need to unearth my Mustang to find out... > Worth investigating I suppose. At any rate, the conclusion stays the same; we should proceed with these patches. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/21] arm64: KVM: world switch in C
On 01/12/15 12:00, Christoffer Dall wrote: > On Tue, Dec 01, 2015 at 09:58:23AM +, Marc Zyngier wrote: >> On 30/11/15 20:33, Christoffer Dall wrote: >>> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote: Once upon a time, the KVM/arm64 world switch was a nice, clean, lean and mean piece of hand-crafted assembly code. Over time, features have crept in, the code has become harder to maintain, and the smallest change is a pain to introduce. The VHE patches are a prime example of why this doesn't work anymore. This series rewrites most of the existing assembly code in C, but keeps the existing code structure in place (most function names will look familiar to the reader). The biggest change is that we don't have to deal with a static register allocation (the compiler does it for us), we can easily follow structure and pointers, and only the lowest level is still in assembly code. Oh, and a negative diffstat. There is still a healthy dose of inline assembly (system register accessors, runtime code patching), but I've tried not to make it too invasive. The generated code, while not exactly brilliant, doesn't look too shaby. I do expect a small performance degradation, but I believe this is something we can improve over time (my initial measurements don't show any obvious regression though). >>> >>> I ran this through my experimental setup on m400 and got this: >> >> [...] >> >>> What this tells me is that we do take a noticable hit on the >>> world-switch path, which shows up in the TCP_RR and hackbench workloads, >>> which have a high precision in their output. >>> >>> Note that the memcached number is well within its variability between >>> individual benchmark runs, where it varies to 12% of its average in over >>> 80% of the executions. >>> >>> I don't think this is a showstopper thought, but we could consider >>> looking more closely at a breakdown of the world-switch path and verify >>> if/where we are really taking a hit. >> >> Thanks for doing so, very interesting. As a data point, what compiler >> are you using? I'd expect some variability based on the compiler version... >> > I used the following (compiling natively on the m400): > > gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1) For what it is worth, I've ran hackbench on my Seattle B0 (8xA57 2GHz), with a 4 vcpu VM and got the following results (10 runs per kernel version, same configuration): v4.4-rc3-wsinc: Average 31.750 32.459 32.124 32.435 31.940 31.085 31.804 31.862 30.985 31.450 31.359 v4.4-rc3: Average 31.954 31.806 31.598 32.697 31.472 31.410 32.562 31.938 31.932 31.672 32.459 This is with GCC as produced by Linaro: aarch64-linux-gnu-gcc (Linaro GCC 5.1-2015.08) 5.1.1 20150608 It could well be that your compiler generates worse code than the one I use, or that the code it outputs is badly tuned for XGene. I guess I need to unearth my Mustang to find out... M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/21] arm64: KVM: world switch in C
On Tue, Dec 01, 2015 at 09:58:23AM +, Marc Zyngier wrote: > On 30/11/15 20:33, Christoffer Dall wrote: > > On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote: > >> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean > >> and mean piece of hand-crafted assembly code. Over time, features have > >> crept in, the code has become harder to maintain, and the smallest > >> change is a pain to introduce. The VHE patches are a prime example of > >> why this doesn't work anymore. > >> > >> This series rewrites most of the existing assembly code in C, but keeps > >> the existing code structure in place (most function names will look > >> familiar to the reader). The biggest change is that we don't have to > >> deal with a static register allocation (the compiler does it for us), > >> we can easily follow structure and pointers, and only the lowest level > >> is still in assembly code. Oh, and a negative diffstat. > >> > >> There is still a healthy dose of inline assembly (system register > >> accessors, runtime code patching), but I've tried not to make it too > >> invasive. The generated code, while not exactly brilliant, doesn't > >> look too shaby. I do expect a small performance degradation, but I > >> believe this is something we can improve over time (my initial > >> measurements don't show any obvious regression though). > > > > I ran this through my experimental setup on m400 and got this: > > [...] > > > What this tells me is that we do take a noticable hit on the > > world-switch path, which shows up in the TCP_RR and hackbench workloads, > > which have a high precision in their output. > > > > Note that the memcached number is well within its variability between > > individual benchmark runs, where it varies to 12% of its average in over > > 80% of the executions. > > > > I don't think this is a showstopper thought, but we could consider > > looking more closely at a breakdown of the world-switch path and verify > > if/where we are really taking a hit. > > Thanks for doing so, very interesting. As a data point, what compiler > are you using? I'd expect some variability based on the compiler version... > I used the following (compiling natively on the m400): gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1) -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/21] arm64: KVM: world switch in C
On 30/11/15 20:33, Christoffer Dall wrote: > On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote: >> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean >> and mean piece of hand-crafted assembly code. Over time, features have >> crept in, the code has become harder to maintain, and the smallest >> change is a pain to introduce. The VHE patches are a prime example of >> why this doesn't work anymore. >> >> This series rewrites most of the existing assembly code in C, but keeps >> the existing code structure in place (most function names will look >> familiar to the reader). The biggest change is that we don't have to >> deal with a static register allocation (the compiler does it for us), >> we can easily follow structure and pointers, and only the lowest level >> is still in assembly code. Oh, and a negative diffstat. >> >> There is still a healthy dose of inline assembly (system register >> accessors, runtime code patching), but I've tried not to make it too >> invasive. The generated code, while not exactly brilliant, doesn't >> look too shaby. I do expect a small performance degradation, but I >> believe this is something we can improve over time (my initial >> measurements don't show any obvious regression though). > > I ran this through my experimental setup on m400 and got this: [...] > What this tells me is that we do take a noticable hit on the > world-switch path, which shows up in the TCP_RR and hackbench workloads, > which have a high precision in their output. > > Note that the memcached number is well within its variability between > individual benchmark runs, where it varies to 12% of its average in over > 80% of the executions. > > I don't think this is a showstopper thought, but we could consider > looking more closely at a breakdown of the world-switch path and verify > if/where we are really taking a hit. Thanks for doing so, very interesting. As a data point, what compiler are you using? I'd expect some variability based on the compiler version... Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/21] arm64: KVM: world switch in C
On 11/30/2015 12:33 PM, Christoffer Dall wrote: > On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote: >> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean >> and mean piece of hand-crafted assembly code. Over time, features have >> crept in, the code has become harder to maintain, and the smallest >> change is a pain to introduce. The VHE patches are a prime example of >> why this doesn't work anymore. >> >> This series rewrites most of the existing assembly code in C, but keeps >> the existing code structure in place (most function names will look >> familiar to the reader). The biggest change is that we don't have to >> deal with a static register allocation (the compiler does it for us), >> we can easily follow structure and pointers, and only the lowest level >> is still in assembly code. Oh, and a negative diffstat. >> >> There is still a healthy dose of inline assembly (system register >> accessors, runtime code patching), but I've tried not to make it too >> invasive. The generated code, while not exactly brilliant, doesn't >> look too shaby. I do expect a small performance degradation, but I >> believe this is something we can improve over time (my initial >> measurements don't show any obvious regression though). > > I ran this through my experimental setup on m400 and got this: > > BMv4.4-rc2v4.4-rc2-wsinc overhead > ---- > Apache5297.11 5243.77 101.02% > fio rand read 4354.33 4294.50 101.39% > fio rand write2465.33 2231.33 110.49% > hackbench 17.48 19.78 113.16% > memcached 96442.69101274.04 95.23% > TCP_MAERTS5966.89 6029.72 98.96% > TCP_STREAM6284.60 6351.74 98.94% > TCP_RR15044.7114324.03105.03% > pbzip2 c 18.13 17.89 98.68% > pbzip2 d 11.42 11.45 100.26% > kernbench 50.13 50.28 100.30% > mysql 1 152.84 154.01 100.77% > mysql 2 98.12 98.94 100.84% > mysql 4 51.32 51.17 99.71% > mysql 8 27.31 27.70 101.42% > mysql 20 16.80 17.21 102.47% > mysql 100 13.71 14.11 102.92% > mysql 200 15.20 15.20 100.00% > mysql 400 17.16 17.16 100.00% > > (you want to see this with a viewer that renders clear-text and tabs > properly) > > What this tells me is that we do take a noticable hit on the > world-switch path, which shows up in the TCP_RR and hackbench workloads, > which have a high precision in their output. > > Note that the memcached number is well within its variability between > individual benchmark runs, where it varies to 12% of its average in over > 80% of the executions. > > I don't think this is a showstopper thought, but we could consider > looking more closely at a breakdown of the world-switch path and verify > if/where we are really taking a hit. > > -Christoffer > ___ > kvmarm mailing list > kvm...@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm > I ran some of the lmbench 'micro benchmarks' - currently the usleep one consistently stands out by about .4% or extra 300ns per sleep. Few other ones have some outliers, I will look at these closer. Tests were ran on Juno. - Mario -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/21] arm64: KVM: world switch in C
On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote: > Once upon a time, the KVM/arm64 world switch was a nice, clean, lean > and mean piece of hand-crafted assembly code. Over time, features have > crept in, the code has become harder to maintain, and the smallest > change is a pain to introduce. The VHE patches are a prime example of > why this doesn't work anymore. > > This series rewrites most of the existing assembly code in C, but keeps > the existing code structure in place (most function names will look > familiar to the reader). The biggest change is that we don't have to > deal with a static register allocation (the compiler does it for us), > we can easily follow structure and pointers, and only the lowest level > is still in assembly code. Oh, and a negative diffstat. > > There is still a healthy dose of inline assembly (system register > accessors, runtime code patching), but I've tried not to make it too > invasive. The generated code, while not exactly brilliant, doesn't > look too shaby. I do expect a small performance degradation, but I > believe this is something we can improve over time (my initial > measurements don't show any obvious regression though). I ran this through my experimental setup on m400 and got this: BM v4.4-rc2v4.4-rc2-wsinc overhead -- -- Apache 5297.11 5243.77 101.02% fio rand read 4354.33 4294.50 101.39% fio rand write 2465.33 2231.33 110.49% hackbench 17.48 19.78 113.16% memcached 96442.69101274.04 95.23% TCP_MAERTS 5966.89 6029.72 98.96% TCP_STREAM 6284.60 6351.74 98.94% TCP_RR 15044.7114324.03105.03% pbzip2 c18.13 17.89 98.68% pbzip2 d11.42 11.45 100.26% kernbench 50.13 50.28 100.30% mysql 1 152.84 154.01 100.77% mysql 2 98.12 98.94 100.84% mysql 4 51.32 51.17 99.71% mysql 8 27.31 27.70 101.42% mysql 2016.80 17.21 102.47% mysql 100 13.71 14.11 102.92% mysql 200 15.20 15.20 100.00% mysql 400 17.16 17.16 100.00% (you want to see this with a viewer that renders clear-text and tabs properly) What this tells me is that we do take a noticable hit on the world-switch path, which shows up in the TCP_RR and hackbench workloads, which have a high precision in their output. Note that the memcached number is well within its variability between individual benchmark runs, where it varies to 12% of its average in over 80% of the executions. I don't think this is a showstopper thought, but we could consider looking more closely at a breakdown of the world-switch path and verify if/where we are really taking a hit. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html