Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Jan Kiszka wrote:
>>>>
>>>>
>>>>> Hi Philippe,
>>>>>
>>>>> I'm afraid this one is serious: let the attached migration stress test
>>>>> run on likely any Xenomai since 2.0, preferably with
>>>>> CONFIG_XENO_OPT_DEBUG on. Will give a nice crash sooner or later (I'm
>>>>> trying to set up a serial console now).
>>>>>
>>>
>>> Confirmed here. My test box went through some nifty triple salto out of
>>> the window running this frag for 2mn or so. Actually, the semop
>>> handshake is not even needed to cause the crash. At first sight, it
>>> looks like a migration issue taking place during the critical phase when
>>> a shadow thread switches back to Linux to terminate.
>>>
>>>
>>>>
>>>> As it took some time to persuade my box to not just reboot but to
>>>> give a
>>>> message, I'm posting here the kernel dump of the P-III running
>>>> nat_migration:
>>>>
>>>> [...]
>>>> Xenomai: starting native API services.
>>>> ce649fb4 ce648000 00000b17 00000202 c0139246 cdf2819c cdf28070 0b12d310
>>>>       00000037 ce648000 00000000 c02f0700 00009a28 00000000 b7e94a70
>>>> bfed63c8
>>>>       00000000 ce648000 c0102fcb b7e94a70 bfed63dc b7faf4b0 bfed63c8
>>>> 00000000
>>>> Call Trace:
>>>> [<c0139246>] __ipipe_dispatch_event+0x96/0x130
>>>> [<c0102fcb>] work_resched+0x6/0x1c
>>>> Xenomai: fatal: blocked thread migration[22175] rescheduled?!
>>>> (status=0x300010, sig=0, prev=watchdog/0[3])
>>>
>>> This babe is awaken by Linux while Xeno sees it in a dormant state,
>>> likely after it has terminated. No wonder why things are going wild
>>> after that... Ok, job queued. Thanks.
>>>
>>
>>
>> I think I can explain this warning now: This happens during creation of
>> a new userspace real-time thread. In the context of the newly created
>> Linux pthread that is to become a real-time thread, Xenomai first sets
>> up the real-time part and then calls xnshadow_map. The latter function
>> does further init and then signals via xnshadow_signal_completion to the
>> parent Linux thread (the caller of rt_task_create e.g.) that the thread
>> is up. This happens before xnshadow_harden, i.e. still in preemptible
>> linux context.
>>
>> The signalling should normally do not cause a reschedule as the caller -
>> the to-be-mapped linux pthread - has higher prio than the woken up
>> thread.
> 
> Xeno never assumes this.
> 
>  And Xenomai implicitly assumes with this fatal-test above that
>> there is no preemption! But it can happen: the watchdog thread of linux
>> does preempt here. So, I think it's a false positive.
>>
> 
> This is wrong. This check is not related to Linux preemption at all; it
> makes sure that control over any shadow is shared in a strictly
> _mutually exclusive_ way, so that a thread blocked at Xenomai level may
> not not be seen as runnable by Linux either. Disabling it only makes
> things worse since the scheduling state is obviously corrupted when it
> triggers, and that's the root bug we are chasing right now. You should
> not draw any conclusion beyond that. Additionally, keep in mind that
> Xeno has already run over some PREEMPT_RT patches, for which an infinite
> number of CPUs is assumed over a fine-grained code base, which induces
> maximum preemption probabilities.
> 

Ok, may explanation was a quick hack before some meeting here, I should
have elaborated it more thoroughly. Let's try to do it step by step so
that you can say where I go of the right path:

1. We enter xnshadow_map. The linux thread is happily running, the
   shadow thread is in XNDORMANT state and not yet linked to its linux
   mate. Any linux preemption hitting us here and causing a reactivation
   of this particular linux thread later will not cause any activity of
   do_schedule_event related to this thread because [1] is NULL. That's
   important, we will see later why.

2. After some init stuff, xnshadow_map links the shadow to the linux
   thread [2] and then calls xnshadow_signal_completion. This call would
   normally wake up the sleeping parent of our linux thread, performing
   a direct standard linux schedule from the new born thread to the
   parent. Again, nothing here about which do_schedule_event could
   complain.

3. Now let's consider some preemption by a third linux task after [2]
   but before [3]. Scheduling away the new linux thread is no issue. But
   when it comes back again, we will see those nice xnpod_fatal. The
   reason: our shadow thread is now linked to its linux mate, thus [1]
   will evaluate non-NULL, and later also [4] will hit as XNDORMANT is
   part of XNTHREAD_BLOCK_BITS (and the thread is not ptraced).

Ok, this is how I see THIS particular issue so far. For me the question
is now:

 a) I'm right?
 b) If yes, is this preemption uncritical, thus the warning in the
    described context a false positive?
 c) If it is not, can this cause the following crash?

Jan

[1]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#L1515
[2]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#L765
[3]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#L621
[4]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#L1555

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to