Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

Gilles Chanteperdrix Fri, 15 Jul 2011 05:33:13 -0700

On 07/14/2011 10:57 PM, Jan Kiszka wrote:
> On 2011-07-13 21:12, Gilles Chanteperdrix wrote:
>> On 07/13/2011 09:04 PM, Jan Kiszka wrote:
>>> On 2011-07-13 20:39, Gilles Chanteperdrix wrote:
>>>> On 07/12/2011 07:43 PM, Jan Kiszka wrote:
>>>>> On 2011-07-12 19:38, Gilles Chanteperdrix wrote:
>>>>>> On 07/12/2011 07:34 PM, Jan Kiszka wrote:
>>>>>>> On 2011-07-12 19:31, Gilles Chanteperdrix wrote:
>>>>>>>> On 07/12/2011 02:57 PM, Jan Kiszka wrote:
>>>>>>>>>                       xnlock_put_irqrestore(&nklock, s);
>>>>>>>>>                       xnpod_schedule();
>>>>>>>>>               }
>>>>>>>>> @@ -1036,6 +1043,7 @@ redo:
>>>>>>>>>        * to process this signal anyway.
>>>>>>>>>        */
>>>>>>>>>       if (rthal_current_domain == rthal_root_domain) {
>>>>>>>>> +             XENO_BUGON(NUCLEUS, xnthread_test_info(thread, 
>>>>>>>>> XNATOMIC));
>>>>>>>>
>>>>>>>> Misleading dead code again, XNATOMIC is cleared not ten lines above.
>>>>>>>
>>>>>>> Nope, I forgot to remove that line.
>>>>>>>
>>>>>>>>
>>>>>>>>>               if (XENO_DEBUG(NUCLEUS) && (!signal_pending(this_task)
>>>>>>>>>                   || this_task->state != TASK_RUNNING))
>>>>>>>>>                       xnpod_fatal
>>>>>>>>> @@ -1044,6 +1052,8 @@ redo:
>>>>>>>>>               return -ERESTARTSYS;
>>>>>>>>>       }
>>>>>>>>>  
>>>>>>>>> +     xnthread_clear_info(thread, XNATOMIC);
>>>>>>>>
>>>>>>>> Why this? I find the xnthread_clear_info(XNATOMIC) right at the right
>>>>>>>> place at the point it currently is.
>>>>>>>
>>>>>>> Nope. Now we either clear XNATOMIC after successful migration or when
>>>>>>> the signal is about to be sent (ie. in the hook). That way we can test
>>>>>>> more reliably (TM) in the gatekeeper if the thread can be migrated.
>>>>>>
>>>>>> Ok for adding the XNATOMIC test, because it improves the robustness, but
>>>>>> why changing the way XNATOMIC is set and clear? Chances of breaking
>>>>>> thing while changing code in this area are really high...
>>>>>
>>>>> The current code is (most probably) broken as it does not properly
>>>>> synchronizes the gatekeeper against a signaled and "runaway" target
>>>>> Linux task.
>>>>>
>>>>> We need an indication if a Linux signal will (or already has) woken up
>>>>> the to-be-migrated task. That task may have continued over its context,
>>>>> potentially on a different CPU. Providing this indication is the purpose
>>>>> of changing where XNATOMIC is cleared.
>>>>
>>>> What about synchronizing with the gatekeeper with a semaphore, as done
>>>> in the first patch you sent, but doing it in xnshadow_harden, as soon as
>>>> we detect that we are not back from schedule in primary mode? It seems
>>>> it would avoid any further issue, as we would then be guaranteed that
>>>> the thread could not switch to TASK_INTERRUPTIBLE again before the
>>>> gatekeeper is finished.
>>>
>>> The problem is that the gatekeeper tests the task state without holding
>>> the task's rq lock (which is not available to us without a kernel
>>> patch). That cannot work reliably as long as we accept signals. That's
>>> why I'm trying to move state change and test under nklock.
>>>
>>>>
>>>> What worries me is the comment in xnshadow_harden:
>>>>
>>>>     * gatekeeper sent us to primary mode. Since
>>>>     * TASK_UNINTERRUPTIBLE is unavailable to us without wrecking
>>>>     * the runqueue's count of uniniterruptible tasks, we just
>>>>     * notice the issue and gracefully fail; the caller will have
>>>>     * to process this signal anyway.
>>>>     */
>>>>
>>>> Does this mean that we can not switch to TASK_UNINTERRUPTIBLE at this
>>>> point? Or simply that TASK_UNINTERRUPTIBLE is not available for the
>>>> business of xnshadow_harden?
>>>>
>>>
>>> TASK_UNINTERRUPTIBLE is not available without patching the kernel's
>>> scheduler for the reason mentioned in the comment (the scheduler becomes
>>> confused and may pick the wrong tasks, IIRC).
>>
>> Does not using down/up in the taskexit event handler risk to cause the
>> same issue?
> 
> Yes, and that means the first patch is incomplete without something like
> the second.
> 
>>
>>>
>>> But I would refrain from trying to "improve" the gatekeeper design. I've
>>> recently mentioned this to Philippe offlist: For Xenomai 3 with some
>>> ipipe v3, we must rather patch schedule() to enable zero-switch domain
>>> migration. Means: enter the scheduler, let it suspend current and pick
>>> another task, but then simply escalate to the RT domain before doing any
>>> context switch. That's much cheaper than the current design and
>>> hopefully also less error-prone.
>>
>> So, do you want me to merge your for-upstream branch?
> 
> You may merge up to for-upstream^, ie. without any gatekeeper fixes.
> 
> I strongly suspect that there are still more races in the migration
> path. The crashes we face even with all patches applied may be related
> to a shadow task being executed under Linux and Xenomai at the same time.


Maybe we could try the following patch instead?

diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
index 01f4200..deb7620 100644
--- a/ksrc/nucleus/shadow.c
+++ b/ksrc/nucleus/shadow.c
@@ -1033,6 +1033,8 @@ redo:
                        xnpod_fatal
                            ("xnshadow_harden() failed for thread %s[%d]",
                             thread->name, xnthread_user_pid(thread));
+               down(&sched->gksync);
+               up(&sched->gksync);
                return -ERESTARTSYS;
        }
 

-- 
                                                                Gilles.

_______________________________________________
Xenomai-core mailing list
[email protected]
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] [Xenomai-git] Jan Kiszka : nucleus: Fix race between gatekeeper and thread deletion

Reply via email to