Way operations

Julien Grall Thu, 06 Dec 2018 04:22:50 -0800

Hi,

On 12/4/18 8:26 PM, Julien Grall wrote:

At the moment, the implementation of Set/Way operations will go through
all the entries of the guest P2M and flush them. However, this is very
expensive and may render unusable a guest OS using them.


For instance, Linux 32-bit will use Set/Way operations during secondary
CPU bring-up. As the implementation is really expensive, it may be possible
to hit the CPU bring-up timeout.

To limit the Set/Way impact, we track what pages has been of the guest
has been accessed between batch of Set/Way operations. This is done
using bit[0] (aka valid bit) of the P2M entry.

This patch adds a new per-arch helper is introduced to perform actions just
before the guest is first unpaused. This will be used to invalidate the
P2M to track access from the start of the guest.

Signed-off-by: Julien Grall <[email protected]>

---

While we can spread d->creation_finished all over the code, the per-arch
helper to perform actions just before the guest is first unpaused can
bring a lot of benefit for both architecture. For instance, on Arm, the
flush to the instruction cache could be delayed until the domain is
first run. This would improve greatly the performance of creating guest.

I am still doing the benchmark whether having a command line option is
worth it. I will provide numbers as soon as I have them.

I remembered Stefano suggested to look at the impact on the boot. Thisis a bit tricky to do as there are many kernel configurations existingand all the mappings may not have been touched during the boot.

Instead I wrote a tiny guest [1] that will zero roughly 1GB of memory.Because the toolstack will always try to allocate with the biggestmapping, I had to hack a bit the toolstack to be able to test withdifferent mapping size (but not a mix). The guest has only one vCPU witha dedicated pCPU.

        - 1GB: 0.03% slower when starting with valid bit unset
        - 2MB: 0.04% faster when starting with valid bit unset
        - 4KB: ~3% slower when starting with valid bit unset

The performance using 1GB and 2MB mapping is pretty much insignificantbecause the number of traps is very limited (resp. 1 and 513). With 4KBmapping, there are a much significant drop because you have more traps(~262700) as the P2M contains more entries.

However, having many 4KB mappings in the P2M is pretty unlikely as thetoolstack will always try to get bigger mapping. In real world, youshould only have 4KB mappings when you guest has not memory aligned witha bigger mapping. If you end up to have many 4KB mappings, then you arealready going to have a performance impact in long run because of theTLB pressure.

Overall, I would not recommend to introduce a command line option untilwe figured out a use case where the trap will be a slow down.


Cheers,

[1]

.text
    b       _start                  /* branch to kernel start, magic */
    .long   0                       /* reserved */

.quad 0x0 /* Image load offset from start ofRAM */

    .quad   0x0                     /* XXX: Effective Image size */
    .quad   2                       /* kernel flags: LE, 4K page size */
    .quad   0                       /* reserved */
    .quad   0                       /* reserved */
    .quad   0                       /* reserved */
    .byte   0x41                    /* Magic number, "ARM\x64" */
    .byte   0x52
    .byte   0x4d
    .byte   0x64
    .long   0                       /* reserved */

_start:
    isb
    mrs     x0, CNTPCT_EL0
    isb

    adrp    x2, _end
    ldr     x3, =(0x40000000 + (1 << 30))
1:  str     xzr, [x2], #8
    cmp     x2, x3
    b.lo    1b

    isb
    mrs     x1, CNTPCT_EL0
    isb
    hvc     #0xffff
1:  b       1b

--
Julien Grall

_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH for-4.12 v2 17/17] xen/arm: Track page accessed between batch of Set/Way operations

Reply via email to