Philippe Gerum wrote:
> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, 
>>>>>>>>>>> but the
>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm 
>>>>>>>>>> run
>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>
>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are 
>>>>>>>>>> now
>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in 
>>>>>>>>>> the
>>>>>>>>>> meanwhile...
>>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>>> switches on x86.
>>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>> switch_mm for its Linux callers?
>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>> is not enough according to current tests. It seems to reduce to
>>>>> probability of corruption, though.
>>>>>
>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>> missing in leave_mm.
>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>
>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>> picture yet.
>>>>
>>> Looking closer at the call sites of switch_mm, I think our the problem
>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>> heavily). But other callers need protection, too. We are going to test
>>> this patch tomorrow:
>>>
>>> diff --git a/fs/aio.c b/fs/aio.c
>>> index 76da125..d90fca3 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>  {
>>>     struct mm_struct *active_mm;
>>>     struct task_struct *tsk = current;
>>> +   unsigned long flags;
>>>  
>>>     task_lock(tsk);
>>>     active_mm = tsk->active_mm;
>>>     atomic_inc(&mm->mm_count);
>>> +   local_irq_save_hw_cond(flags);
>>>     tsk->mm = mm;
>>>     tsk->active_mm = mm;
>>>     switch_mm(active_mm, mm, tsk);
>>> +   local_irq_restore_hw_cond(flags);
>>>     task_unlock(tsk);
>>>  
>>>     mmdrop(active_mm);
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index aa8f86c..8c545a4 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct 
>>> *prev,
>>>             next->active_mm = oldmm;
>>>             atomic_inc(&oldmm->mm_count);
>>>             enter_lazy_tlb(oldmm, next);
>>> -   } else
>>> +   } else {
>>> +           unsigned long flags;
>>> +           local_irq_save_hw_cond(flags);
>>>             switch_mm(oldmm, mm, next);
>>> +           local_irq_restore_hw_cond(flags);
>>> +   }
>>>  
>>>     if (unlikely(!prev->mm)) {
>>>             prev->active_mm = NULL;
>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>  
>>>     BUG_ON(cpu_online(smp_processor_id()));
>>>  
>>> -   if (mm != &init_mm)
>>> +   if (mm != &init_mm) {
>>> +           unsigned long flags;
>>> +           local_irq_save_hw_cond(flags);
>>>             switch_mm(mm, &init_mm, current);
>>> +           local_irq_restore_hw_cond(flags);
>>> +   }
>>>     mmdrop(mm);
>>>  }
>> Please fix the callee instead of ironing the call sites. This would
>> avoid further issues as upstream emits additional switch_mm calls over
>> time, and make ironing activate_mm useless.
>>
> 
> Btw, this how the powerpc port works in locked switch mode, and this
> particular bug does not apply in unlocked switch mode anyway; this is
> why we don't have that problem on this arch. The ARM port always works
> in unlocked switch mode IIRC for latency reasons, so this should be a
> non-issue here as well. Gilles, would you confirm this?

Yes, on arm switch_mm always has irqs on. We handle the case where
switch_mm was interrupted in the middle of its operations by restarting
the mm switch after xenomai switches back to Linux.

-- 
                                            Gilles.

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to