This is the initial mail as information/start of discussion about step2 of the xen-ppc profiling support. The targets are:
1. Sampling xen by passing xen samples to a profiling domain
2. Passing results of postponed perfmon interrupts to the appropriate domain (currently this samples are ignored)
 a) by using a mechanism compatible to that we need for 1. anyway
b) or by mangling the vcpu structure to emulate the occured irq and let the domain handle this 3. "context switch" pmc status in all transitions between domains and xen while sampling xen

== Background I - PMC counters need to be reset after the perfmon interrupt occured == First a very compact description how the performance monitoring is set up and what has to happen in/after a performance monitor interrupt - for the real details read the appropriate cpu user manual. An operating system may set up MMCR0,MMCR1,MMCRA in a way that for example PMC2 counts cycles and registers a handler for the perfmon irq. A PMC contains a signed 32 bit value (bit 0 is the CTR_NEG bit), when the PMC wraps to a negative value the current instruction/data adress is written to the SIAR/SDAR spr's - this is a performance monitor exception. For example if you want trace every 0x10000000 cycles you set the PMC to 0x70000000 (0x80000000+0x1 is the first negative value). If now the condition occurs that a performance monitor exception exists (a PMC is negative) AND external interrupts are enabled (MSR[EE]=1) AND performance monitor exceptions are enabled MMCR0[PMXE]=1 -> the performance monitor interrupt occurs and the handler can read SIAR/SDAR. The interrupt itself disables subsequent permon irq's by settin MMCR0[PMXE] to zero, the irq handler has to do the rest. It has to reset the PMC to a non negative value, in our scenario we wanted to sample every 0x10000000 cycles so the handler has to set PMC2 back to 0x70000000. It has also to re-enable perfmon iinterrupts by setting MMCR0[PMXE] back to 1. Oprofile resets the PMC values in the kernel perfmon interrupt handler which knows the values (sysfs) and use them to reset the PMC properly.

== Issue I - postponed perfmon irq's ==
Sometimes it may happen that perfmon interrupts belonging to a domain occur in xen space. Currently the performance monitor is always set up with MMCR0[FCH] so hypervisor privlege level can't be the original source. Also the values MMCRA[SAMPHV] & MMCRA[SAMPPR] show that the sample was taken in domain space. It may happen sometimes that this sample is now reported "postponed" into xen because we do not run completely with MSR[EE]=0. The current perfmon handler in xen does just ignore those samples - they are few enough that this is currently a negligible issue (small loss of accuracy). The handler needs to be there although we do not (yet) sample xen space to re-enable MMCR0[PMXE] and to reset PMC values (Otherwise the performance monitor would stop to work after the first of this postponed irq's, because without MMCR0[PMXE]=1 no further perfmon irq would happen). As described above linux knows the values to which it should reset the wrapped PMC's in its handler, but xen does not. Combined with the issue of the postponed perfmon irq that belong to a domain but occur in xen we have the situation of a perfmon irq handler in xen that does not know how to reset wrapped PMC's values properly. The current implementation of the xen perfmon handler resets the PMC values to the defaults of oprofile, but to let profiling work properly
a) xen need to be aware of the values a domain would reset the PMCs to
b) or pass the sample to the domain so that it can consume it (and also do the reset/reenable part for the domain)

== Background II - why PHYP might not need to know about PMC reset values ==
This is my current assumption about that after a lot of chat discussions and document reviews - I welcome every comment making this more clear. As XenPPC developer you can look at PHYP as black box and know that "whatever" has to work non-paravirtualized because it works that way in PHYP. This is the case for the PMC handling described above - so why does it work without passing wanted PMC reset values to the hypervisor explicitly. The basic assumption is that PHYP runs completely with MSR[EE]=0, in this case our kind of postponed perfmon interrupts do just not occur. Starting from the point where MSR[EE] is set back to one in the domain the perfmon exception will get reported as interrupt - in the domain. Because in this scenario the domain gets every perfmon interrupt it can handle samples and reset the PMC counters properly without any issue. Even if PHYP would have the issue of postponed perfmon irq's, because it may be not running fully with MSR[EE]=0 they might workaround this by altering whatever they have as equivalent to our vcpu struct. They could "emulate" the interrupt by altering all registers as the irq would do it. The SIAR/SDAR is valid until the next performance monitor exception occurs, so after returning to the domain it would continue with the handler read SIAR/SDAR and reset PMC ... properly.

== Issue II - MMCR0[PMAO] polling needed to sample xen ? ==
As written above we are neither always nor never running with MSR[EE]=1 in xen-ppc. While MSR[EE]=0 all the time would be nice to defer the irq back to the domain, it is also an problem to have MSR[EE]=0 for the intention of sampling the hypervisor itself. We can't assume that sampling xen works with the interrupt based mechanism because MSR[EE] is 0 "too often/incalculable". So we need additionally a polling based mechanism to be at least a little accurate (can only be as good as the frequency of the poll actions). Polling would need one or more good places in xen to check via MMCR0[PMAO] if a perfmon exception occured (If we have enough & frequent places that active MSR[EE] this would do the job too).

== Issue III - emulating perfmon irq in the domain by altering vcpu prohibited ? == The intention to profile xen prohibits us to implement the handling of the postponed irq's with the "emulate the perfmon irq" workaround mentioned above because new perfmon exceptions belonging to xen would overwrite SIAR/SDAR before the domain can read their results. To continue the plan to sample xen we will need a event channel to report xen samples to a profiling domain (similar to xenoprof approach) and an event channel (maybe the same) to report samples of postponed perfmon irq's. This way linux which knows the PMC reset values can reset them which saves us from implementing the PMC reset value awareness for the postponed irq's. But xen will need to get a (kind of/mechanism/trick) "pmu setup interface" which defines PMC reset values and all the other pmu related registers for the part of profiling xen itself.

== Background III - how to switch PMC context on xen<->domain transitions ==
I currently think that the domain should still sample with MMCR0[FCH]=1 so in the transition into the hypervisor we have frozen counters until the PMU_SAVE_STATE in exceptions.S has saved the domain perfmon setting&PMCs and restored the ones of xen. At last it sets MMCR0[FCH]=0 so the profiling continues with the xen configuration. On the way back to a domain it first sets MMCR0[FCH]=1 and then saves xen / restores domain perfmon status. This would profile all xen but the small slice between PMU_SAVE_STATE and the involved domain.

On thing for sure, these issues makes the implementation to sample xen more complex than I thought initially :-( The complexity of this may render the text capable of beeing misunderstood - if anything is just confusing for someone, please ask me to me to improve my description ;-)


Grüsse / regards, Christian Ehrhardt

IBM Linux Technology Center, Open Virtualization
+49 7031/16-3385

IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Johann Weihen Geschäftsführung: Herbert Kircher Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

Xen-ppc-devel mailing list

Reply via email to