Jan Kiszka wrote:
> Philippe Gerum wrote:
>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, 
>>>>>>>>>>>> but the
>>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't 
>>>>>>>>>>> switch_mm run
>>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>>
>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we 
>>>>>>>>>>> are now
>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in 
>>>>>>>>>>> the
>>>>>>>>>>> meanwhile...
>>>>>>>>>> As you have found out in the mean time, we do not use unlocked 
>>>>>>>>>> context
>>>>>>>>>> switches on x86.
>>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>>> switch_mm for its Linux callers?
>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>>> is not enough according to current tests. It seems to reduce to
>>>>>> probability of corruption, though.
>>>>>>
>>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>>> missing in leave_mm.
>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>>
>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>>> picture yet.
>>>>>
>>>> Looking closer at the call sites of switch_mm, I think our the problem
>>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>>> heavily). But other callers need protection, too. We are going to test
>>>> this patch tomorrow:
>>>>
>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>> index 76da125..d90fca3 100644
>>>> --- a/fs/aio.c
>>>> +++ b/fs/aio.c
>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>  {
>>>>    struct mm_struct *active_mm;
>>>>    struct task_struct *tsk = current;
>>>> +  unsigned long flags;
>>>>  
>>>>    task_lock(tsk);
>>>>    active_mm = tsk->active_mm;
>>>>    atomic_inc(&mm->mm_count);
>>>> +  local_irq_save_hw_cond(flags);
>>>>    tsk->mm = mm;
>>>>    tsk->active_mm = mm;
>>>>    switch_mm(active_mm, mm, tsk);
>>>> +  local_irq_restore_hw_cond(flags);
>>>>    task_unlock(tsk);
>>>>  
>>>>    mmdrop(active_mm);
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index aa8f86c..8c545a4 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct 
>>>> *prev,
>>>>            next->active_mm = oldmm;
>>>>            atomic_inc(&oldmm->mm_count);
>>>>            enter_lazy_tlb(oldmm, next);
>>>> -  } else
>>>> +  } else {
>>>> +          unsigned long flags;
>>>> +          local_irq_save_hw_cond(flags);
>>>>            switch_mm(oldmm, mm, next);
>>>> +          local_irq_restore_hw_cond(flags);
>>>> +  }
>>>>  
>>>>    if (unlikely(!prev->mm)) {
>>>>            prev->active_mm = NULL;
>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>>  
>>>>    BUG_ON(cpu_online(smp_processor_id()));
>>>>  
>>>> -  if (mm != &init_mm)
>>>> +  if (mm != &init_mm) {
>>>> +          unsigned long flags;
>>>> +          local_irq_save_hw_cond(flags);
>>>>            switch_mm(mm, &init_mm, current);
>>>> +          local_irq_restore_hw_cond(flags);
>>>> +  }
>>>>    mmdrop(mm);
>>>>  }
>>> Please fix the callee instead of ironing the call sites. This would
>>> avoid further issues as upstream emits additional switch_mm calls over
>>> time, and make ironing activate_mm useless.
>>>
>> Btw, this how the powerpc port works in locked switch mode, and this
>> particular bug does not apply in unlocked switch mode anyway; this is
>> why we don't have that problem on this arch. The ARM port always works
>> in unlocked switch mode IIRC for latency reasons, so this should be a
>> non-issue here as well. Gilles, would you confirm this?
> 
> The aio issue (if it turns out to be one, /me still needs to understand
> the concrete patch) should bite any arch, but we may solve it
> differently where switch_mm is actually reentrant.

Took a beer to get my mind free, here is the potential race:

-> aio thread runs with mm == init_mm
-> aio thread calls into use_mm
-> preemption by Xenomai right after tsk->active_mm = mm
-> Xenomai switches from the aio thread to an RT thread with identical
   mm (definitely possible in our scenario)
-> switch_mm will not update cr3 as prev == next
-> next fault of the RT thread (e.g. in __xn_put_user...) will attempt
   to fix the right mm, but the CPU will continue to fail over init_mm

Jan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to