Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Hi Philippe,
>>>>> I'm afraid this one is serious: let the attached migration stress test
>>>>> run on likely any Xenomai since 2.0, preferably with
>>>>> CONFIG_XENO_OPT_DEBUG on. Will give a nice crash sooner or later (I'm
>>>>> trying to set up a serial console now).
>>> Confirmed here. My test box went through some nifty triple salto out of
>>> the window running this frag for 2mn or so. Actually, the semop
>>> handshake is not even needed to cause the crash. At first sight, it
>>> looks like a migration issue taking place during the critical phase when
>>> a shadow thread switches back to Linux to terminate.
>>>> As it took some time to persuade my box to not just reboot but to
>>>> give a
>>>> message, I'm posting here the kernel dump of the P-III running
>>>> nat_migration:
>>>> [...]
>>>> Xenomai: starting native API services.
>>>> ce649fb4 ce648000 00000b17 00000202 c0139246 cdf2819c cdf28070 0b12d310
>>>>       00000037 ce648000 00000000 c02f0700 00009a28 00000000 b7e94a70
>>>> bfed63c8
>>>>       00000000 ce648000 c0102fcb b7e94a70 bfed63dc b7faf4b0 bfed63c8
>>>> 00000000
>>>> Call Trace:
>>>> [<c0139246>] __ipipe_dispatch_event+0x96/0x130
>>>> [<c0102fcb>] work_resched+0x6/0x1c
>>>> Xenomai: fatal: blocked thread migration[22175] rescheduled?!
>>>> (status=0x300010, sig=0, prev=watchdog/0[3])
>>> This babe is awaken by Linux while Xeno sees it in a dormant state,
>>> likely after it has terminated. No wonder why things are going wild
>>> after that... Ok, job queued. Thanks.
>> I think I can explain this warning now: This happens during creation of
>> a new userspace real-time thread. In the context of the newly created
>> Linux pthread that is to become a real-time thread, Xenomai first sets
>> up the real-time part and then calls xnshadow_map. The latter function
>> does further init and then signals via xnshadow_signal_completion to the
>> parent Linux thread (the caller of rt_task_create e.g.) that the thread
>> is up. This happens before xnshadow_harden, i.e. still in preemptible
>> linux context.
>> The signalling should normally do not cause a reschedule as the caller -
>> the to-be-mapped linux pthread - has higher prio than the woken up
>> thread.
> Xeno never assumes this.
>  And Xenomai implicitly assumes with this fatal-test above that
>> there is no preemption! But it can happen: the watchdog thread of linux
>> does preempt here. So, I think it's a false positive.
> This is wrong. This check is not related to Linux preemption at all; it
> makes sure that control over any shadow is shared in a strictly
> _mutually exclusive_ way, so that a thread blocked at Xenomai level may
> not not be seen as runnable by Linux either. Disabling it only makes
> things worse since the scheduling state is obviously corrupted when it
> triggers, and that's the root bug we are chasing right now. You should
> not draw any conclusion beyond that. Additionally, keep in mind that
> Xeno has already run over some PREEMPT_RT patches, for which an infinite
> number of CPUs is assumed over a fine-grained code base, which induces
> maximum preemption probabilities.

Ok, may explanation was a quick hack before some meeting here, I should
have elaborated it more thoroughly. Let's try to do it step by step so
that you can say where I go of the right path:

1. We enter xnshadow_map. The linux thread is happily running, the
   shadow thread is in XNDORMANT state and not yet linked to its linux
   mate. Any linux preemption hitting us here and causing a reactivation
   of this particular linux thread later will not cause any activity of
   do_schedule_event related to this thread because [1] is NULL. That's
   important, we will see later why.

2. After some init stuff, xnshadow_map links the shadow to the linux
   thread [2] and then calls xnshadow_signal_completion. This call would
   normally wake up the sleeping parent of our linux thread, performing
   a direct standard linux schedule from the new born thread to the
   parent. Again, nothing here about which do_schedule_event could

3. Now let's consider some preemption by a third linux task after [2]
   but before [3]. Scheduling away the new linux thread is no issue. But
   when it comes back again, we will see those nice xnpod_fatal. The
   reason: our shadow thread is now linked to its linux mate, thus [1]
   will evaluate non-NULL, and later also [4] will hit as XNDORMANT is
   part of XNTHREAD_BLOCK_BITS (and the thread is not ptraced).

Ok, this is how I see THIS particular issue so far. For me the question
is now:

 a) I'm right?
 b) If yes, is this preemption uncritical, thus the warning in the
    described context a false positive?
 c) If it is not, can this cause the following crash?



Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to