Jan Kiszka wrote:
Philippe Gerum wrote:

Jan Kiszka wrote:


Philippe Gerum wrote:


...
Fixed. The cause was related to the thread migration routine to
primary mode (xnshadow_harden), which would spuriously call the Linux
rescheduling procedure from the primary domain under certain
circumstances. This bug only triggers on preemptible kernels. This
also fixes the spinlock recursion issue which is sometimes triggered
when the spinlock debug option is active.


Gasp. I've found a severe regression with this fix, so more work is
needed. More later.


End of alert. Should be ok now.



No crashes so far, looks good. But the final test, a box which always
went to hell very quickly, is still waiting in my office - more on
Monday.

Anyway, there seems to be some latency issues pending. I discovered this
again with my migration test. Please give it a try on a mid- (800 MHz
Athlon in my case) to low-end box. On that Athlon I got peaks of over
100 us in the userspace latency test right on starting migration. The
Athlon does not support the NMI watchdog, but on my 1.4 GHz Notebook
there were alarms (>30 us) hitting in the native registry during
rt_task_create. I have no clue yet if anything is broken there.


I suspect that rt_registry_enter() is inherently a long operation when
considered as a non-preemptible sum of reasonably short ones. Since it
is always called with interrupts enabled, we should split the work in
there, releasing interrupts in the middle. The tricky thing is that we
must ensure that the new registration slot is not exposed in a
half-baked state during the preemptible section.


Yea, I guess there are a few more of such complex call chains inside the
core lock, at least when looking at the native skin. For a regression
test suite, we should define load scenarios of low-prio realtime tasks
doing some init/cleanup and communication while e.g. the latency test is
running. This should give a clearer picture what numbers you can expect
in a normal application scenarios.


We need
that back-tracer soon - did I mentioned this before? ;)


Well, we have a backtrace support for detecting latency peaks, but it's
dependent on NMI availability. The thing is that not every platform
provides a programmable NMI support. A possible option would be to
overload the existing LTT tracepoints in order to keep an execution
backtrace, so that we would not have to rely on any hw support.



The advantage of Fu's mcount-based tracer will be that is can capture
also functions you do not expect, e.g. accidentally called kernel
services. His patch, likely against Adeos, will enable kernel-wide
function tracing which you can use to instrument IRQ-off paths or (in a
second step or so) others things you are interested in. And it will
maintain a FULL calling history, something that NMI can't do.

NMI will still be useful for hard lock-ups, LTT for a more global view
what's happening, but the mcount-instrumentation should give deep
insights on the core's and skin's critical timing behaviours.


No problem. I've just suggested to build a bicycle to go to the shop around the corner, but if you tell me that a spaceship to visit Venus is at hand, I'll wait for it: shopping can wait.

--

Philippe.

Reply via email to