Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Philippe Gerum wrote:
>>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, 
>>>>>>>>>>>>> but the
>>>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs 
>>>>>>>>>>>> from
>>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't 
>>>>>>>>>>>> switch_mm run
>>>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>>>
>>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we 
>>>>>>>>>>>> are now
>>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in 
>>>>>>>>>>>> the
>>>>>>>>>>>> meanwhile...
>>>>>>>>>>> As you have found out in the mean time, we do not use unlocked 
>>>>>>>>>>> context
>>>>>>>>>>> switches on x86.
>>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>> The last question I asked myself (but couldn't answer yet due to 
>>>>>>>>>> other
>>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>>>> switch_mm for its Linux callers?
>>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>>>> is not enough according to current tests. It seems to reduce to
>>>>>>> probability of corruption, though.
>>>>>>>
>>>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy 
>>>>>>>> between
>>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>>>> understand, switch_mm called by Xenomai upon return to root would 
>>>>>>>> re-set
>>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit 
>>>>>>>> being
>>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>>>> missing in leave_mm.
>>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>>>
>>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>>>> picture yet.
>>>>>>
>>>>> Looking closer at the call sites of switch_mm, I think our the problem
>>>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>>>> heavily). But other callers need protection, too. We are going to test
>>>>> this patch tomorrow:
>>>>>
>>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>>> index 76da125..d90fca3 100644
>>>>> --- a/fs/aio.c
>>>>> +++ b/fs/aio.c
>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>>  {
>>>>>   struct mm_struct *active_mm;
>>>>>   struct task_struct *tsk = current;
>>>>> + unsigned long flags;
>>>>>  
>>>>>   task_lock(tsk);
>>>>>   active_mm = tsk->active_mm;
>>>>>   atomic_inc(&mm->mm_count);
>>>>> + local_irq_save_hw_cond(flags);
>>>>>   tsk->mm = mm;
>>>>>   tsk->active_mm = mm;
>>>>>   switch_mm(active_mm, mm, tsk);
>>>>> + local_irq_restore_hw_cond(flags);
>>>>>   task_unlock(tsk);
>>>>>  
>>>>>   mmdrop(active_mm);
>>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>>> index aa8f86c..8c545a4 100644
>>>>> --- a/kernel/sched.c
>>>>> +++ b/kernel/sched.c
>>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct 
>>>>> *prev,
>>>>>           next->active_mm = oldmm;
>>>>>           atomic_inc(&oldmm->mm_count);
>>>>>           enter_lazy_tlb(oldmm, next);
>>>>> - } else
>>>>> + } else {
>>>>> +         unsigned long flags;
>>>>> +         local_irq_save_hw_cond(flags);
>>>>>           switch_mm(oldmm, mm, next);
>>>>> +         local_irq_restore_hw_cond(flags);
>>>>> + }
>>>>>  
>>>>>   if (unlikely(!prev->mm)) {
>>>>>           prev->active_mm = NULL;
>>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>>>  
>>>>>   BUG_ON(cpu_online(smp_processor_id()));
>>>>>  
>>>>> - if (mm != &init_mm)
>>>>> + if (mm != &init_mm) {
>>>>> +         unsigned long flags;
>>>>> +         local_irq_save_hw_cond(flags);
>>>>>           switch_mm(mm, &init_mm, current);
>>>>> +         local_irq_restore_hw_cond(flags);
>>>>> + }
>>>>>   mmdrop(mm);
>>>>>  }
>>>> Please fix the callee instead of ironing the call sites. This would
>>>> avoid further issues as upstream emits additional switch_mm calls over
>>>> time, and make ironing activate_mm useless.
>>>>
>>> Btw, this how the powerpc port works in locked switch mode, and this
>>> particular bug does not apply in unlocked switch mode anyway; this is
>>> why we don't have that problem on this arch. The ARM port always works
>>> in unlocked switch mode IIRC for latency reasons, so this should be a
>>> non-issue here as well. Gilles, would you confirm this?
>> Yes, on arm switch_mm always has irqs on. We handle the case where
>> switch_mm was interrupted in the middle of its operations by restarting
>> the mm switch after xenomai switches back to Linux.
>>
> 
> On archs with non-atomic switch_mm(), use_mm() will require a different
> strategy. I'm thinking about something like
> 
> use_mm():
>       set_some_flag();
>       barrier();
>       current->mm = new_mm;
>       current->active_mm = new_mm;
>       switch_mm(old_active_mm, new_mm, current);
>       clear_some_flag();
> 
> and switch_mm():
>       ...
>       if (likely(prev != next) || some_flag_set()) {
>               clear_some_flag();
>               ...
> 
> ie. enforce mm switch if we may have interrupted use_mm at an unpleasant
> time. I just don't know yet where to attach that some_flag to. Should be
> the current task, but can we always access it from switch_mm?

These mechanisms are already in place. All you have to do is:

use_mm()
        ipipe_active_mm = NULL;
        barrier();
        current->mm = new_mm;
        current->active_mm = new_mm;
        switch_mm(old_active_mm, new_mm, current);

-- 
                                            Gilles.

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to