Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-12-01 Thread Christoffer Dall
On Tue, Dec 01, 2015 at 05:51:46PM +, Marc Zyngier wrote:
> On 01/12/15 12:00, Christoffer Dall wrote:
> > On Tue, Dec 01, 2015 at 09:58:23AM +, Marc Zyngier wrote:
> >> On 30/11/15 20:33, Christoffer Dall wrote:
> >>> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
>  Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
>  and mean piece of hand-crafted assembly code. Over time, features have
>  crept in, the code has become harder to maintain, and the smallest
>  change is a pain to introduce. The VHE patches are a prime example of
>  why this doesn't work anymore.
> 
>  This series rewrites most of the existing assembly code in C, but keeps
>  the existing code structure in place (most function names will look
>  familiar to the reader). The biggest change is that we don't have to
>  deal with a static register allocation (the compiler does it for us),
>  we can easily follow structure and pointers, and only the lowest level
>  is still in assembly code. Oh, and a negative diffstat.
> 
>  There is still a healthy dose of inline assembly (system register
>  accessors, runtime code patching), but I've tried not to make it too
>  invasive. The generated code, while not exactly brilliant, doesn't
>  look too shaby. I do expect a small performance degradation, but I
>  believe this is something we can improve over time (my initial
>  measurements don't show any obvious regression though).
> >>>
> >>> I ran this through my experimental setup on m400 and got this:
> >>
> >> [...]
> >>
> >>> What this tells me is that we do take a noticable hit on the
> >>> world-switch path, which shows up in the TCP_RR and hackbench workloads,
> >>> which have a high precision in their output.
> >>>
> >>> Note that the memcached number is well within its variability between
> >>> individual benchmark runs, where it varies to 12% of its average in over
> >>> 80% of the executions.
> >>>
> >>> I don't think this is a showstopper thought, but we could consider
> >>> looking more closely at a breakdown of the world-switch path and verify
> >>> if/where we are really taking a hit.
> >>
> >> Thanks for doing so, very interesting. As a data point, what compiler
> >> are you using? I'd expect some variability based on the compiler version...
> >>
> > I used the following (compiling natively on the m400):
> > 
> > gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1)
> 
> For what it is worth, I've ran hackbench on my Seattle B0 (8xA57 2GHz),
> with a 4 vcpu VM and got the following results (10 runs per kernel
> version, same configuration):
> 
> v4.4-rc3-wsinc: Average 31.750
> 32.459
> 32.124
> 32.435
> 31.940
> 31.085
> 31.804
> 31.862
> 30.985
> 31.450
> 31.359
> 
> v4.4-rc3: Average 31.954
> 31.806
> 31.598
> 32.697
> 31.472
> 31.410
> 32.562
> 31.938
> 31.932
> 31.672
> 32.459
> 
> This is with GCC as produced by Linaro:
> aarch64-linux-gnu-gcc (Linaro GCC 5.1-2015.08) 5.1.1 20150608
> 
> It could well be that your compiler generates worse code than the one I
> use, or that the code it outputs is badly tuned for XGene. I guess I
> need to unearth my Mustang to find out...
> 
Worth investigating I suppose.  At any rate, the conclusion stays the
same; we should proceed with these patches.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-12-01 Thread Marc Zyngier
On 01/12/15 12:00, Christoffer Dall wrote:
> On Tue, Dec 01, 2015 at 09:58:23AM +, Marc Zyngier wrote:
>> On 30/11/15 20:33, Christoffer Dall wrote:
>>> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
 Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
 and mean piece of hand-crafted assembly code. Over time, features have
 crept in, the code has become harder to maintain, and the smallest
 change is a pain to introduce. The VHE patches are a prime example of
 why this doesn't work anymore.

 This series rewrites most of the existing assembly code in C, but keeps
 the existing code structure in place (most function names will look
 familiar to the reader). The biggest change is that we don't have to
 deal with a static register allocation (the compiler does it for us),
 we can easily follow structure and pointers, and only the lowest level
 is still in assembly code. Oh, and a negative diffstat.

 There is still a healthy dose of inline assembly (system register
 accessors, runtime code patching), but I've tried not to make it too
 invasive. The generated code, while not exactly brilliant, doesn't
 look too shaby. I do expect a small performance degradation, but I
 believe this is something we can improve over time (my initial
 measurements don't show any obvious regression though).
>>>
>>> I ran this through my experimental setup on m400 and got this:
>>
>> [...]
>>
>>> What this tells me is that we do take a noticable hit on the
>>> world-switch path, which shows up in the TCP_RR and hackbench workloads,
>>> which have a high precision in their output.
>>>
>>> Note that the memcached number is well within its variability between
>>> individual benchmark runs, where it varies to 12% of its average in over
>>> 80% of the executions.
>>>
>>> I don't think this is a showstopper thought, but we could consider
>>> looking more closely at a breakdown of the world-switch path and verify
>>> if/where we are really taking a hit.
>>
>> Thanks for doing so, very interesting. As a data point, what compiler
>> are you using? I'd expect some variability based on the compiler version...
>>
> I used the following (compiling natively on the m400):
> 
> gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1)

For what it is worth, I've ran hackbench on my Seattle B0 (8xA57 2GHz),
with a 4 vcpu VM and got the following results (10 runs per kernel
version, same configuration):

v4.4-rc3-wsinc: Average 31.750
32.459
32.124
32.435
31.940
31.085
31.804
31.862
30.985
31.450
31.359

v4.4-rc3: Average 31.954
31.806
31.598
32.697
31.472
31.410
32.562
31.938
31.932
31.672
32.459

This is with GCC as produced by Linaro:
aarch64-linux-gnu-gcc (Linaro GCC 5.1-2015.08) 5.1.1 20150608

It could well be that your compiler generates worse code than the one I
use, or that the code it outputs is badly tuned for XGene. I guess I
need to unearth my Mustang to find out...

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-12-01 Thread Christoffer Dall
On Tue, Dec 01, 2015 at 09:58:23AM +, Marc Zyngier wrote:
> On 30/11/15 20:33, Christoffer Dall wrote:
> > On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
> >> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
> >> and mean piece of hand-crafted assembly code. Over time, features have
> >> crept in, the code has become harder to maintain, and the smallest
> >> change is a pain to introduce. The VHE patches are a prime example of
> >> why this doesn't work anymore.
> >>
> >> This series rewrites most of the existing assembly code in C, but keeps
> >> the existing code structure in place (most function names will look
> >> familiar to the reader). The biggest change is that we don't have to
> >> deal with a static register allocation (the compiler does it for us),
> >> we can easily follow structure and pointers, and only the lowest level
> >> is still in assembly code. Oh, and a negative diffstat.
> >>
> >> There is still a healthy dose of inline assembly (system register
> >> accessors, runtime code patching), but I've tried not to make it too
> >> invasive. The generated code, while not exactly brilliant, doesn't
> >> look too shaby. I do expect a small performance degradation, but I
> >> believe this is something we can improve over time (my initial
> >> measurements don't show any obvious regression though).
> > 
> > I ran this through my experimental setup on m400 and got this:
> 
> [...]
> 
> > What this tells me is that we do take a noticable hit on the
> > world-switch path, which shows up in the TCP_RR and hackbench workloads,
> > which have a high precision in their output.
> > 
> > Note that the memcached number is well within its variability between
> > individual benchmark runs, where it varies to 12% of its average in over
> > 80% of the executions.
> > 
> > I don't think this is a showstopper thought, but we could consider
> > looking more closely at a breakdown of the world-switch path and verify
> > if/where we are really taking a hit.
> 
> Thanks for doing so, very interesting. As a data point, what compiler
> are you using? I'd expect some variability based on the compiler version...
> 
I used the following (compiling natively on the m400):

gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1)


-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-12-01 Thread Marc Zyngier
On 30/11/15 20:33, Christoffer Dall wrote:
> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
>> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
>> and mean piece of hand-crafted assembly code. Over time, features have
>> crept in, the code has become harder to maintain, and the smallest
>> change is a pain to introduce. The VHE patches are a prime example of
>> why this doesn't work anymore.
>>
>> This series rewrites most of the existing assembly code in C, but keeps
>> the existing code structure in place (most function names will look
>> familiar to the reader). The biggest change is that we don't have to
>> deal with a static register allocation (the compiler does it for us),
>> we can easily follow structure and pointers, and only the lowest level
>> is still in assembly code. Oh, and a negative diffstat.
>>
>> There is still a healthy dose of inline assembly (system register
>> accessors, runtime code patching), but I've tried not to make it too
>> invasive. The generated code, while not exactly brilliant, doesn't
>> look too shaby. I do expect a small performance degradation, but I
>> believe this is something we can improve over time (my initial
>> measurements don't show any obvious regression though).
> 
> I ran this through my experimental setup on m400 and got this:

[...]

> What this tells me is that we do take a noticable hit on the
> world-switch path, which shows up in the TCP_RR and hackbench workloads,
> which have a high precision in their output.
> 
> Note that the memcached number is well within its variability between
> individual benchmark runs, where it varies to 12% of its average in over
> 80% of the executions.
> 
> I don't think this is a showstopper thought, but we could consider
> looking more closely at a breakdown of the world-switch path and verify
> if/where we are really taking a hit.

Thanks for doing so, very interesting. As a data point, what compiler
are you using? I'd expect some variability based on the compiler version...

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-11-30 Thread Mario Smarduch


On 11/30/2015 12:33 PM, Christoffer Dall wrote:
> On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
>> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
>> and mean piece of hand-crafted assembly code. Over time, features have
>> crept in, the code has become harder to maintain, and the smallest
>> change is a pain to introduce. The VHE patches are a prime example of
>> why this doesn't work anymore.
>>
>> This series rewrites most of the existing assembly code in C, but keeps
>> the existing code structure in place (most function names will look
>> familiar to the reader). The biggest change is that we don't have to
>> deal with a static register allocation (the compiler does it for us),
>> we can easily follow structure and pointers, and only the lowest level
>> is still in assembly code. Oh, and a negative diffstat.
>>
>> There is still a healthy dose of inline assembly (system register
>> accessors, runtime code patching), but I've tried not to make it too
>> invasive. The generated code, while not exactly brilliant, doesn't
>> look too shaby. I do expect a small performance degradation, but I
>> believe this is something we can improve over time (my initial
>> measurements don't show any obvious regression though).
> 
> I ran this through my experimental setup on m400 and got this:
> 
> BMv4.4-rc2v4.4-rc2-wsinc  overhead
> ----  
> Apache5297.11 5243.77 101.02%
> fio rand read 4354.33 4294.50 101.39%
> fio rand write2465.33 2231.33 110.49%
> hackbench 17.48   19.78   113.16%
> memcached 96442.69101274.04   95.23%
> TCP_MAERTS5966.89 6029.72 98.96%
> TCP_STREAM6284.60 6351.74 98.94%
> TCP_RR15044.7114324.03105.03%
> pbzip2 c  18.13   17.89   98.68%
> pbzip2 d  11.42   11.45   100.26%
> kernbench 50.13   50.28   100.30%
> mysql 1   152.84  154.01  100.77%
> mysql 2   98.12   98.94   100.84%
> mysql 4   51.32   51.17   99.71%
> mysql 8   27.31   27.70   101.42%
> mysql 20  16.80   17.21   102.47%
> mysql 100 13.71   14.11   102.92%
> mysql 200 15.20   15.20   100.00%
> mysql 400 17.16   17.16   100.00%
> 
> (you want to see this with a viewer that renders clear-text and tabs
> properly)
> 
> What this tells me is that we do take a noticable hit on the
> world-switch path, which shows up in the TCP_RR and hackbench workloads,
> which have a high precision in their output.
> 
> Note that the memcached number is well within its variability between
> individual benchmark runs, where it varies to 12% of its average in over
> 80% of the executions.
> 
> I don't think this is a showstopper thought, but we could consider
> looking more closely at a breakdown of the world-switch path and verify
> if/where we are really taking a hit.
> 
> -Christoffer
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

I ran some of the lmbench 'micro benchmarks' - currently
the usleep one consistently stands out by about .4% or extra 300ns
per sleep. Few other ones have some outliers, I will look at these
closer. Tests were ran on Juno.

- Mario
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/21] arm64: KVM: world switch in C

2015-11-30 Thread Christoffer Dall
On Fri, Nov 27, 2015 at 06:49:54PM +, Marc Zyngier wrote:
> Once upon a time, the KVM/arm64 world switch was a nice, clean, lean
> and mean piece of hand-crafted assembly code. Over time, features have
> crept in, the code has become harder to maintain, and the smallest
> change is a pain to introduce. The VHE patches are a prime example of
> why this doesn't work anymore.
> 
> This series rewrites most of the existing assembly code in C, but keeps
> the existing code structure in place (most function names will look
> familiar to the reader). The biggest change is that we don't have to
> deal with a static register allocation (the compiler does it for us),
> we can easily follow structure and pointers, and only the lowest level
> is still in assembly code. Oh, and a negative diffstat.
> 
> There is still a healthy dose of inline assembly (system register
> accessors, runtime code patching), but I've tried not to make it too
> invasive. The generated code, while not exactly brilliant, doesn't
> look too shaby. I do expect a small performance degradation, but I
> believe this is something we can improve over time (my initial
> measurements don't show any obvious regression though).

I ran this through my experimental setup on m400 and got this:

BM  v4.4-rc2v4.4-rc2-wsinc  overhead
--  --  
Apache  5297.11 5243.77 101.02%
fio rand read   4354.33 4294.50 101.39%
fio rand write  2465.33 2231.33 110.49%
hackbench   17.48   19.78   113.16%
memcached   96442.69101274.04   95.23%
TCP_MAERTS  5966.89 6029.72 98.96%
TCP_STREAM  6284.60 6351.74 98.94%
TCP_RR  15044.7114324.03105.03%
pbzip2 c18.13   17.89   98.68%
pbzip2 d11.42   11.45   100.26%
kernbench   50.13   50.28   100.30%
mysql 1 152.84  154.01  100.77%
mysql 2 98.12   98.94   100.84%
mysql 4 51.32   51.17   99.71%
mysql 8 27.31   27.70   101.42%
mysql 2016.80   17.21   102.47%
mysql 100   13.71   14.11   102.92%
mysql 200   15.20   15.20   100.00%
mysql 400   17.16   17.16   100.00%

(you want to see this with a viewer that renders clear-text and tabs
properly)

What this tells me is that we do take a noticable hit on the
world-switch path, which shows up in the TCP_RR and hackbench workloads,
which have a high precision in their output.

Note that the memcached number is well within its variability between
individual benchmark runs, where it varies to 12% of its average in over
80% of the executions.

I don't think this is a showstopper thought, but we could consider
looking more closely at a breakdown of the world-switch path and verify
if/where we are really taking a hit.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html