> Hi,
> well, if I'm not totally wrong, we have a design problem in the
> RT-thread hardening path. I dug into the crash Jeroen reported > and I'm
> quite sure that this is the reason.
> So that's the bad news. The good one is that we can at least
> work around
> it by switching off CONFIG_PREEMPT for Linux (this implicitly means that
> it's a 2.6-only issue).
> But let's start with two assumptions my further analysis is
> based on:
> [Xenomai]
>  o Shadow threads have only one stack, i.e. one context. If the
>   real-time part is active (this includes it is blocked on some
> xnsynch object or delayed), the original Linux task must
>   executed, even if it will immediately fall asleep again. That's
>   because the stack is in use by the real-time part at that time. > And this condition is checked in do_schedule_event() [1].
> [Linux]
>  o A Linux task which has called
> set_current_state(<blocking_bit>) will
>  remain in the run-queue as long as it calls schedule() on its
> own.

Yes, you are right.

Let's keep in mind the following piece of code.


[code]    from sched.c::schedule()
    switch_count = &prev->nivcsw;
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {    <--- MUST BE TRUE FOR A TASK TO BE REMOVED
        switch_count = &prev->nvcsw;
        if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
            prev->state = TASK_RUNNING;
        else {
            if (prev->state == TASK_UNINTERRUPTIBLE)
            deactivate_task(prev, rq);            <--- removing from the active queue

On executing schedule(), a "current" (prev = current) task is not removed from the active queue in one of the following cases:

[1] prev->state == 0, i.e. == TASK_RUNNING (since #define TASK_RUNNING  0);

[2] add_preempt_count(PREEMPT_ACTIVE) has been called before calling schedule() from the task's context
    i.e. from the context of the "current" task (prev = current in schedule());

[3] there is a pending signal for the "current" task.

Keeping that in mind too, let's take a look at what happens in your "crash"-scenario.

> ...
> 3) Interruption by some Linux IRQ. This may cause other
> threads to become runnable as well, but the gatekeeper has
> the highest prio and will therefore be the next. The problem is
> that the rescheduling on Linux IRQ exit will PREEMPT our task > in xnshadow_harden(), it will NOT remove it from the Linux
> run-queue.

Right. But what actually happens is the following sequence of calls:

ret_from_intr ---> resume_kernel ---> need_resched ---> sched.c::preempt_schedule_irq() ---> schedule()        (**)

As a result, schedule() is called indeed but it does not execute the [*] code -
the "current" task is not removed from the active queue.
The reason is [2] (from the list above) and that's done in preempt_schedule_irq().

> And now we are in real troubles: The
> gatekeeper will kick off our RT part which will take over the
> thread's stack. As soon as the RT domain falls asleep and
> Linux takes over again, it will continue our non-RT part as well! > Actually, this seems to be the reason for the panic in
> do_schedule_event(). Without CONFIG_XENO_OPT_DEBUG
> and this check, we will run both parts AT THE SAME
> TIME now, thus violating my first assumption. The system gets > fatally corrupted.
> Well, I would be happy if someone can prove me wrong here.

I'm afraid you are right.

> The problem is that I don't see a solution because Linux does
> not provide an atomic wake-up + schedule-out under
> currently considering a hack to remove the migrating Linux
> thread manually from the run-queue, but this could easily break > the Linux scheduler.

I have a "stupid" idea on top of my head but I'd prefer to test it on my own first so not to look as a complete idiot if it's totally wrong. Err... it's difficult to look more an idiot than I'm already? :o)

> Jan

Best regards,
Dmitry Adamushko

Reply via email to