Hi,
On 12/4/18 8:26 PM, Julien Grall wrote:
At the moment, the implementation of Set/Way operations will go through
all the entries of the guest P2M and flush them. However, this is very
expensive and may render unusable a guest OS using them.
For instance, Linux 32-bit will use Set/Way operations during secondary
CPU bring-up. As the implementation is really expensive, it may be possible
to hit the CPU bring-up timeout.
To limit the Set/Way impact, we track what pages has been of the guest
has been accessed between batch of Set/Way operations. This is done
using bit[0] (aka valid bit) of the P2M entry.
This patch adds a new per-arch helper is introduced to perform actions just
before the guest is first unpaused. This will be used to invalidate the
P2M to track access from the start of the guest.
Signed-off-by: Julien Grall <julien.gr...@arm.com>
---
While we can spread d->creation_finished all over the code, the per-arch
helper to perform actions just before the guest is first unpaused can
bring a lot of benefit for both architecture. For instance, on Arm, the
flush to the instruction cache could be delayed until the domain is
first run. This would improve greatly the performance of creating guest.
I am still doing the benchmark whether having a command line option is
worth it. I will provide numbers as soon as I have them.
I remembered Stefano suggested to look at the impact on the boot. This
is a bit tricky to do as there are many kernel configurations existing
and all the mappings may not have been touched during the boot.
Instead I wrote a tiny guest [1] that will zero roughly 1GB of memory.
Because the toolstack will always try to allocate with the biggest
mapping, I had to hack a bit the toolstack to be able to test with
different mapping size (but not a mix). The guest has only one vCPU with
a dedicated pCPU.
- 1GB: 0.03% slower when starting with valid bit unset
- 2MB: 0.04% faster when starting with valid bit unset
- 4KB: ~3% slower when starting with valid bit unset
The performance using 1GB and 2MB mapping is pretty much insignificant
because the number of traps is very limited (resp. 1 and 513). With 4KB
mapping, there are a much significant drop because you have more traps
(~262700) as the P2M contains more entries.
However, having many 4KB mappings in the P2M is pretty unlikely as the
toolstack will always try to get bigger mapping. In real world, you
should only have 4KB mappings when you guest has not memory aligned with
a bigger mapping. If you end up to have many 4KB mappings, then you are
already going to have a performance impact in long run because of the
TLB pressure.
Overall, I would not recommend to introduce a command line option until
we figured out a use case where the trap will be a slow down.
Cheers,
[1]
.text
b _start /* branch to kernel start, magic */
.long 0 /* reserved */
.quad 0x0 /* Image load offset from start of
RAM */
.quad 0x0 /* XXX: Effective Image size */
.quad 2 /* kernel flags: LE, 4K page size */
.quad 0 /* reserved */
.quad 0 /* reserved */
.quad 0 /* reserved */
.byte 0x41 /* Magic number, "ARM\x64" */
.byte 0x52
.byte 0x4d
.byte 0x64
.long 0 /* reserved */
_start:
isb
mrs x0, CNTPCT_EL0
isb
adrp x2, _end
ldr x3, =(0x40000000 + (1 << 30))
1: str xzr, [x2], #8
cmp x2, x3
b.lo 1b
isb
mrs x1, CNTPCT_EL0
isb
hvc #0xffff
1: b 1b
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel