On Wed, 2009-07-01 at 20:56 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
> >> Jan Kiszka wrote:
> >>> Jan Kiszka wrote:
> >>>> Gilles Chanteperdrix wrote:
> >>>>> Jan Kiszka wrote:
> >>>>>> Jan Kiszka wrote:
> >>>>>>> Gilles Chanteperdrix wrote:
> >>>>>>>> Jan Kiszka wrote:
> >>>>>>>>> Jan Kiszka wrote:
> >>>>>>>>>> It's still unclear what goes on precisely, we are still digging, 
> >>>>>>>>>> but the
> >>>>>>>>>> test system that can produce this is highly contended.
> >>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
> >>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
> >>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
> >>>>>>>>> window between updating cr3 and active_mm somewhere (isn't 
> >>>>>>>>> switch_mm run
> >>>>>>>>> in a preemptible manner now?).
> >>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
> >>>>>>>>
> >>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we 
> >>>>>>>>> are now
> >>>>>>>>> checking if it makes a difference. Digging deeper into the code in 
> >>>>>>>>> the
> >>>>>>>>> meanwhile...
> >>>>>>>> As you have found out in the mean time, we do not use unlocked 
> >>>>>>>> context
> >>>>>>>> switches on x86.
> >>>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>> The last question I asked myself (but couldn't answer yet due to other
> >>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
> >>>>>>> switch_mm for its Linux callers?
> >>>>>> Ha, that's the point: only activate_mm is protected, but we have more
> >>>>>> spots in 2.6.29 and maybe other kernels, too!
> >>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
> >>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
> >>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
> >>>> is not enough according to current tests. It seems to reduce to
> >>>> probability of corruption, though.
> >>>>
> >>>>> however, is that leave_mm sets the cr3 and just clears
> >>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
> >>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
> >>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
> >>>>> understand, switch_mm called by Xenomai upon return to root would re-set
> >>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
> >>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
> >>>>> set instead of cleared as expected. So, maybe an irqs off section is
> >>>>> missing in leave_mm.
> >>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
> >>>> but now I'm parsing context_switch /wrt to lazy tlb.
> >>>>
> >>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
> >>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
> >>> picture yet.
> >>>
> >> Looking closer at the call sites of switch_mm, I think our the problem
> >> is mostly related to use_mm from fs/aio.c (customer is using aio
> >> heavily). But other callers need protection, too. We are going to test
> >> this patch tomorrow:
> >>
> >> diff --git a/fs/aio.c b/fs/aio.c
> >> index 76da125..d90fca3 100644
> >> --- a/fs/aio.c
> >> +++ b/fs/aio.c
> >> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
> >>  {
> >>    struct mm_struct *active_mm;
> >>    struct task_struct *tsk = current;
> >> +  unsigned long flags;
> >>  
> >>    task_lock(tsk);
> >>    active_mm = tsk->active_mm;
> >>    atomic_inc(&mm->mm_count);
> >> +  local_irq_save_hw_cond(flags);
> >>    tsk->mm = mm;
> >>    tsk->active_mm = mm;
> >>    switch_mm(active_mm, mm, tsk);
> >> +  local_irq_restore_hw_cond(flags);
> >>    task_unlock(tsk);
> >>  
> >>    mmdrop(active_mm);
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index aa8f86c..8c545a4 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct 
> >> *prev,
> >>            next->active_mm = oldmm;
> >>            atomic_inc(&oldmm->mm_count);
> >>            enter_lazy_tlb(oldmm, next);
> >> -  } else
> >> +  } else {
> >> +          unsigned long flags;
> >> +          local_irq_save_hw_cond(flags);
> >>            switch_mm(oldmm, mm, next);
> >> +          local_irq_restore_hw_cond(flags);
> >> +  }
> >>  
> >>    if (unlikely(!prev->mm)) {
> >>            prev->active_mm = NULL;
> >> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
> >>  
> >>    BUG_ON(cpu_online(smp_processor_id()));
> >>  
> >> -  if (mm != &init_mm)
> >> +  if (mm != &init_mm) {
> >> +          unsigned long flags;
> >> +          local_irq_save_hw_cond(flags);
> >>            switch_mm(mm, &init_mm, current);
> >> +          local_irq_restore_hw_cond(flags);
> >> +  }
> >>    mmdrop(mm);
> >>  }
> > 
> > Please fix the callee instead of ironing the call sites. This would
> > avoid further issues as upstream emits additional switch_mm calls over
> > time, and make ironing activate_mm useless.
> 
> This was my first idea, too, but it didn't work, see use_mm(). I don't
> think we will get around reviewing switch_mm users on kernel updates.
> 

At the very least, you won't have to patch separately the spots that may
work this way. Avoiding uselessly invasive changes is not a substitute
for proper reviewing in any case.

> Jan
> 
-- 
Philippe.



_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to