This is the initial mail as information/start of discussion about step2
of the xen-ppc profiling support. The targets are:
1. Sampling xen by passing xen samples to a profiling domain
2. Passing results of postponed perfmon interrupts to the appropriate
domain (currently this samples are ignored)
a) by using a mechanism compatible to that we need for 1. anyway
b) or by mangling the vcpu structure to emulate the occured irq and
let the domain handle this
3. "context switch" pmc status in all transitions between domains and
xen while sampling xen
== Background I - PMC counters need to be reset after the perfmon
interrupt occured ==
First a very compact description how the performance monitoring is set
up and what has to happen in/after a performance monitor interrupt - for
the real details read the appropriate cpu user manual.
An operating system may set up MMCR0,MMCR1,MMCRA in a way that for
example PMC2 counts cycles and registers a handler for the perfmon irq.
A PMC contains a signed 32 bit value (bit 0 is the CTR_NEG bit), when
the PMC wraps to a negative value the current instruction/data adress is
written to the SIAR/SDAR spr's - this is a performance monitor
exception. For example if you want trace every 0x10000000 cycles you set
the PMC to 0x70000000 (0x80000000+0x1 is the first negative value).
If now the condition occurs that a performance monitor exception exists
(a PMC is negative) AND external interrupts are enabled (MSR[EE]=1) AND
performance monitor exceptions are enabled MMCR0[PMXE]=1 -> the
performance monitor interrupt occurs and the handler can read SIAR/SDAR.
The interrupt itself disables subsequent permon irq's by settin
MMCR0[PMXE] to zero, the irq handler has to do the rest. It has to reset
the PMC to a non negative value, in our scenario we wanted to sample
every 0x10000000 cycles so the handler has to set PMC2 back to
0x70000000. It has also to re-enable perfmon iinterrupts by setting
MMCR0[PMXE] back to 1.
Oprofile resets the PMC values in the kernel perfmon interrupt handler
which knows the values (sysfs) and use them to reset the PMC properly.
== Issue I - postponed perfmon irq's ==
Sometimes it may happen that perfmon interrupts belonging to a domain
occur in xen space. Currently the performance monitor is always set up
with MMCR0[FCH] so hypervisor privlege level can't be the original
source. Also the values MMCRA[SAMPHV] & MMCRA[SAMPPR] show that the
sample was taken in domain space. It may happen sometimes that this
sample is now reported "postponed" into xen because we do not run
completely with MSR[EE]=0. The current perfmon handler in xen does just
ignore those samples - they are few enough that this is currently a
negligible issue (small loss of accuracy).
The handler needs to be there although we do not (yet) sample xen space
to re-enable MMCR0[PMXE] and to reset PMC values (Otherwise the
performance monitor would stop to work after the first of this postponed
irq's, because without MMCR0[PMXE]=1 no further perfmon irq would happen).
As described above linux knows the values to which it should reset the
wrapped PMC's in its handler, but xen does not. Combined with the issue
of the postponed perfmon irq that belong to a domain but occur in xen we
have the situation of a perfmon irq handler in xen that does not know
how to reset wrapped PMC's values properly.
The current implementation of the xen perfmon handler resets the PMC
values to the defaults of oprofile, but to let profiling work properly
a) xen need to be aware of the values a domain would reset the PMCs to
b) or pass the sample to the domain so that it can consume it (and also
do the reset/reenable part for the domain)
== Background II - why PHYP might not need to know about PMC reset values ==
This is my current assumption about that after a lot of chat discussions
and document reviews - I welcome every comment making this more clear.
As XenPPC developer you can look at PHYP as black box and know that
"whatever" has to work non-paravirtualized because it works that way in
PHYP. This is the case for the PMC handling described above - so why
does it work without passing wanted PMC reset values to the hypervisor
The basic assumption is that PHYP runs completely with MSR[EE]=0, in
this case our kind of postponed perfmon interrupts do just not occur.
Starting from the point where MSR[EE] is set back to one in the domain
the perfmon exception will get reported as interrupt - in the domain.
Because in this scenario the domain gets every perfmon interrupt it can
handle samples and reset the PMC counters properly without any issue.
Even if PHYP would have the issue of postponed perfmon irq's, because it
may be not running fully with MSR[EE]=0 they might workaround this by
altering whatever they have as equivalent to our vcpu struct. They could
"emulate" the interrupt by altering all registers as the irq would do
it. The SIAR/SDAR is valid until the next performance monitor exception
occurs, so after returning to the domain it would continue with the
handler read SIAR/SDAR and reset PMC ... properly.
== Issue II - MMCR0[PMAO] polling needed to sample xen ? ==
As written above we are neither always nor never running with MSR[EE]=1
in xen-ppc. While MSR[EE]=0 all the time would be nice to defer the irq
back to the domain, it is also an problem to have MSR[EE]=0 for the
intention of sampling the hypervisor itself.
We can't assume that sampling xen works with the interrupt based
mechanism because MSR[EE] is 0 "too often/incalculable". So we need
additionally a polling based mechanism to be at least a little accurate
(can only be as good as the frequency of the poll actions). Polling
would need one or more good places in xen to check via MMCR0[PMAO] if a
perfmon exception occured (If we have enough & frequent places that
active MSR[EE] this would do the job too).
== Issue III - emulating perfmon irq in the domain by altering vcpu
prohibited ? ==
The intention to profile xen prohibits us to implement the handling of
the postponed irq's with the "emulate the perfmon irq" workaround
mentioned above because new perfmon exceptions belonging to xen would
overwrite SIAR/SDAR before the domain can read their results.
To continue the plan to sample xen we will need a event channel to
report xen samples to a profiling domain (similar to xenoprof approach)
and an event channel (maybe the same) to report samples of postponed
perfmon irq's. This way linux which knows the PMC reset values can reset
them which saves us from implementing the PMC reset value awareness for
the postponed irq's. But xen will need to get a (kind
of/mechanism/trick) "pmu setup interface" which defines PMC reset values
and all the other pmu related registers for the part of profiling xen
== Background III - how to switch PMC context on xen<->domain transitions ==
I currently think that the domain should still sample with MMCR0[FCH]=1
so in the transition into the hypervisor we have frozen counters until
the PMU_SAVE_STATE in exceptions.S has saved the domain perfmon
setting&PMCs and restored the ones of xen. At last it sets MMCR0[FCH]=0
so the profiling continues with the xen configuration. On the way back
to a domain it first sets MMCR0[FCH]=1 and then saves xen / restores
domain perfmon status. This would profile all xen but the small slice
between PMU_SAVE_STATE and the involved domain.
On thing for sure, these issues makes the implementation to sample xen
more complex than I thought initially :-(
The complexity of this may render the text capable of beeing
misunderstood - if anything is just confusing for someone, please ask me
to me to improve my description ;-)
Grüsse / regards,
IBM Linux Technology Center, Open Virtualization
IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Johann Weihen
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
Xen-ppc-devel mailing list