On Thu, 2009-05-14 at 15:10 +0200, Gilles Chanteperdrix wrote:
Philippe Gerum wrote:
On Thu, 2009-05-14 at 14:52 +0200, Gilles Chanteperdrix wrote:
Philippe Gerum wrote:
On Thu, 2009-05-14 at 12:20 +0200, Jan Kiszka wrote:
Philippe Gerum wrote:
On Wed, 2009-05-13 at 18:10 +0200, Jan Kiszka wrote:
Philippe Gerum wrote:
On Wed, 2009-05-13 at 17:28 +0200, Jan Kiszka wrote:
Philippe Gerum wrote:
On Wed, 2009-05-13 at 15:18 +0200, Jan Kiszka wrote:
Gilles Chanteperdrix wrote:
Jan Kiszka wrote:
Hi Gilles,
I'm currently facing a nasty effect with switchtest over latest
git head
(only tested this so far): running it inside my test VM (ie. with
frequent excessive latencies) I get a stalled Linux timer IRQ
quite
quickly. System is otherwise still responsive, Xenomai timers
are still
being delivered, other Linux IRQs too. switchtest complained
about
Warning: Linux is compiled to use FPU in kernel-space.
when it was started. Kernels are 2.6.28.9/ipipe-x86-2.2-07 and
2.6.29.3/ipipe-x86-2.3-01 (LTTng patched in, but unused), both
show the
same effect.
Seen this before?
The warning about Linux being compiled to use FPU in kernel-space
means
that you enabled soft RAID or compiled for K7, Geode, or any other
RAID is on (ordinary server config).
configuration using 3DNow for such simple operations as memcpy.
It is
harmless, it simply means that switchtest can not use fpu in
kernel-space.
The bug you have is probably the same as the one described here,
which I
am able to reproduce on my atom:
https://mail.gna.org/public/xenomai-help/2009-04/msg00200.html
Unfortunately, I for one am working on ARM issues and am not
available
to debug x86 issues. I think Philippe is busy too...
OK, looks like I got the same flu here.
Philippe, did you find out any more details in the meantime? Then
I'm
afraid I have to pick this up.
No, I did not resume this task yet. Working from the powerpc side
of the
universe here.
Hoho, don't think this rain here over x86 would have never made it
down
to ARM or PPC land! ;)
Martin, could you check if this helps you, too?
Jan
(as usual, ready to be pulled from 'for-upstream')
-
Host IRQs may not only be triggered from non-root domains.
Are you sure of this? I can't find any spot where this assumption
would
be wrong. host_pend() is basically there to relay RT timer ticks and
device IRQs, and this only happens on behalf of the pipeline head. At
least, this is how rthal_irq_host_pend() should be used in any case.
If
you did find a spot where this interface is being called from the
lower
stage, then this is the root bug to fix.
I haven't studied the I-pipe trace /wrt this in details yet, but I
could
imagine that some shadow task is interrupted in primary mode by the
timer IRQ and then leaves the handler in secondary mode due to whatever
events between schedule-out and in at the end of xnintr_clock_handler.
You need a thread context to move to secondary, I just can't see how
such scenario would be possible.
Here is the trace of events:
= Shadow task starts migration to secondary
= in xnpod_suspend_thread, nklock is briefly released before
xnpod_schedule
Which is the root bug. Blame on me; this recent change in -head breaks a
basic rule a lot of code is based on: a self-suspending thread may not
be preempted while scheduling out, i.e. suspension and rescheduling must
be atomically performed. xnshadow_relax() counts on this too.
Actually, I think the idea was mine in the first place... Maybe we can
specify a special flag to xnpod_suspend_thread to ask fo the atomic
suspension (maybe reuse XNATOMIC ?).
I don't think so. We really need the basic assumption to hold in any
case, because this is expected by most of the callers, and this
micro-optimization is not worth the risk of introducing a race if
misused.
Well, I tend to disagree. The assumption that the thread is suspended
from the point of view of the scheduler still holds even when the nklock
is released, and it is what callers like rt_cond_wait are expecting. The
assumptions of xnshadow_relax do not seem to me like a common assumption.
The assumption is that the thread has been suspended _and_ scheduled out
atomically, not only put in a suspended thread, which is quite different
when considered from an interrupt context. I'm worried by the fact that
re-enabling interrupts in the middle of this critical transition breaks
the unspoken rule that sched-curr may not be seen as bearing any block
bit in its status word from anywhere in the code executed from the local
CPU but xnpod_suspend_thread().
Another issue may arise in the SMP case, where xnpod_suspend_thread()
would block a thread running on a remote CPU; in theory, re-enabling
interrupts before the IPIs are sent from xnpod_schedule() - to kick the
remote