Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs

Philippe Gerum via Xenomai Fri, 11 Jun 2021 08:39:52 -0700


Jan Kiszka <[email protected]> writes:


> On 09.06.21 08:12, Philippe Gerum wrote:
>> 
>> Jan Kiszka <[email protected]> writes:
>> 
>>> On 08.06.21 19:39, Philippe Gerum wrote:
>>>>
>>>> Philippe Gerum <[email protected]> writes:
>>>>
>>>>> Jan Kiszka <[email protected]> writes:
>>>>>
>>>>>> On 07.06.21 18:19, Philippe Gerum wrote:
>>>>>>>
>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>
>>>>>>>> On 07.06.21 17:04, Philippe Gerum wrote:
>>>>>>>>>
>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>
>>>>>>>>>> On 07.06.21 15:48, Jan Kiszka wrote:
>>>>>>>>>>> On 07.06.21 15:44, Jan Kiszka wrote:
>>>>>>>>>>>> On 07.06.21 15:03, Philippe Gerum wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 07.06.21 09:17, Philippe Gerum wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 06.06.21 16:43, Philippe Gerum via Xenomai wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> From: Jan Kiszka <[email protected]>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The correct order is hard IRQs off first, then stalling 
>>>>>>>>>>>>>>>>>> inband, see also
>>>>>>>>>>>>>>>>>> I-pipe. Otherwise, we risk to take an interrupt for the 
>>>>>>>>>>>>>>>>>> inband stage but
>>>>>>>>>>>>>>>>>> do not inject it before returning to user space.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Jan Kiszka <[email protected]>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Fixes the RCU stalls Florian reported, at least for me.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm afraid this wasn't the last regression of translating 
>>>>>>>>>>>>>>>>>> I-pipe into
>>>>>>>>>>>>>>>>>> dovetail..
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  include/linux/entry-common.h | 3 ++-
>>>>>>>>>>>>>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/include/linux/entry-common.h 
>>>>>>>>>>>>>>>>>> b/include/linux/entry-common.h
>>>>>>>>>>>>>>>>>> index 0fb45b2d6094..00540110985e 100644
>>>>>>>>>>>>>>>>>> --- a/include/linux/entry-common.h
>>>>>>>>>>>>>>>>>> +++ b/include/linux/entry-common.h
>>>>>>>>>>>>>>>>>> @@ -203,7 +203,8 @@ static inline void 
>>>>>>>>>>>>>>>>>> local_irq_disable_exit_to_user(void);
>>>>>>>>>>>>>>>>>>  #ifndef local_irq_disable_exit_to_user
>>>>>>>>>>>>>>>>>>  static inline void local_irq_disable_exit_to_user(void)
>>>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>>> -    local_irq_disable_full();
>>>>>>>>>>>>>>>>>> +    hard_cond_local_irq_disable();
>>>>>>>>>>>>>>>>>> +    local_irq_disable();
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>  #endif
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Good spot, real issue here, just like the CPU_IDLE code deals 
>>>>>>>>>>>>>>>>> with. The
>>>>>>>>>>>>>>>>> fix looks good, but maybe we could do even better.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The way local_irq_disable_full() works tends to introduce 
>>>>>>>>>>>>>>>>> this issue in
>>>>>>>>>>>>>>>>> a sneaky way, when the code path does not synchronize the 
>>>>>>>>>>>>>>>>> interrupt log
>>>>>>>>>>>>>>>>> (exit_to_user_mode_loop may be at fault in this case, which 
>>>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>>>> traverses synchronize_pipeline()). Lack of synchronization 
>>>>>>>>>>>>>>>>> would still
>>>>>>>>>>>>>>>>> hit us if some trap handler turns hard irqs off -> on -> off 
>>>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>>>> way, and we don't leave through the user exit path.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Inverting the order of the actions in 
>>>>>>>>>>>>>>>>> local_irq_disable_full() may be an
>>>>>>>>>>>>>>>>> option (inband_irq_disable would allow that), making sure we 
>>>>>>>>>>>>>>>>> cannot be
>>>>>>>>>>>>>>>>> caught in the same issue when returning to kernel mode is 
>>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>> way. This needs more thought I think.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So, always this way?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> local_irq_disable_full:
>>>>>>>>>>>>>>>>        hard_local_irq_disable  
>>>>>>>>>>>>>>>>        local_irq_disable
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> local_irq_enable_full:
>>>>>>>>>>>>>>>>        hard_local_irq_enable
>>>>>>>>>>>>>>>>        local_irq_enable
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or even flip local_irq_enable_full as well?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not this one, hard irqs must be on before local_irq_enable() is 
>>>>>>>>>>>>>>> issued
>>>>>>>>>>>>>>> otherwise we would trigger a debug assertion. The reason for 
>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>> assertion is that the I-pipe version of local_irq_enable() 
>>>>>>>>>>>>>>> force enables
>>>>>>>>>>>>>>> hard irqs on, and this is something we want to depart from 
>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>> whether the caller expects or not such side-effect is 
>>>>>>>>>>>>>>> error-prone. For
>>>>>>>>>>>>>>> this reason, inband_irq_enable() expects hard irqs on, which 
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>> the current hard irq state for the caller under normal 
>>>>>>>>>>>>>>> circumstances.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Was there ever code that required ordering 
>>>>>>>>>>>>>>>> local_irq_disable_full the
>>>>>>>>>>>>>>>> other way around?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After review, I don't think so. The current order for
>>>>>>>>>>>>>>> local_irq_disable_full() was rather determined by applying a 
>>>>>>>>>>>>>>> reverse
>>>>>>>>>>>>>>> sequence compared to local_irq_enable_full(). So this looks 
>>>>>>>>>>>>>>> good. I need
>>>>>>>>>>>>>>> to test this on the armv[7/8] side here first before merging.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a commit somewhere that I can drop already?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. These are fine with me in their latest iteration:
>>>>>>>>>>>>>
>>>>>>>>>>>>>   irq_pipeline: Clean up stage_info field and users
>>>>>>>>>>>>>   irq_pipeline: Account for stage migration across faults
>>>>>>>>>>>>>   irq_pipeline: Warn when calling irqentry_enter with oob stalled
>>>>>>>>>>>>>   x86: dovetail: Fix TS flag reservation
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. This one should be replaced by a fix in 
>>>>>>>>>>>>> local_irq_disable_full(),
>>>>>>>>>>>>> pending some tests ongoing here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> irq_pipeline: Prevent returning to user space with pending inband 
>>>>>>>>>>>>> IRQs
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, as I mentioned earlier, this may not be sufficient per 
>>>>>>>>>>>>> se, we
>>>>>>>>>>>>> may have to synchronize the in-band log when unstalling the stage 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> kernel exit path, i.e. the exit point fixed up by the "Account 
>>>>>>>>>>>>> for stage
>>>>>>>>>>>>> migration across faults" patch. This would address the following
>>>>>>>>>>>>> scenario:
>>>>>>>>>>>>>
>>>>>>>>>>>>>         irqentry_enter on kernel context:
>>>>>>>>>>>>>                        stall_inband();
>>>>>>>>>>>>>                                 trap_handler();
>>>>>>>>>>>>>                                         hard_local_irq_enable()
>>>>>>>>>>>>>                                         hard_local_irq_disable()
>>>>>>>>>>>>>                        if (state.stage_info == 
>>>>>>>>>>>>> IRQENTRY_INBAND_UNSTALLED ||
>>>>>>>>>>>>>                            state.stage_info == IRQENTRY_OOB) {
>>>>>>>>>>>>>                              unstall_inband();
>>>>>>>>>>>>>                                synchronize_pipeline();
>>>>>>>>>>>>
>>>>>>>>>>>> Oh, unstall_inband does not synchronize? Because of
>>>>>>>>>>>> hard_local_irq_disable or in general?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ah, found it - unstall != inband_irq_enable. Why not using the 
>>>>>>>>>>> latter
>>>>>>>>>>> directly?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ...because it's instrumented to detect hard_irqs_disabled.
>>>>>>>>>>
>>>>>>>>>> Can we safely sync at that point?
>>>>>>>>>
>>>>>>>>> Yes we may do this, because we restore the original -unstalled- state 
>>>>>>>>> of
>>>>>>>>> the in-band stage. So if we have pending IRQs, we may play them at 
>>>>>>>>> this
>>>>>>>>> point. What we may not do obviously, is creating synchronization 
>>>>>>>>> points
>>>>>>>>> where the kernel does not expect irqs.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Again: When will any reschedules that might be requested by those
>>>>>>>> replayed IRQs be handled then?
>>>>>>>>
>>>>>>>
>>>>>>> Nowhere unless we provide a log synchronization routine dedicated to the
>>>>>>> in-band kernel exit path, which would fold in the rescheduling call via
>>>>>>> irqentry_exit_cond_resched(). I'll write that one.
>>>>>>>
>>>>>>
>>>>>> How about simply moving the replay up, before the kernel checks for
>>>>>> pending reschedules anyway? We can stall inband then again, to please
>>>>>> lockdep & Co., as long as there is no hard-irq-on section before the 
>>>>>> intret.
>>>>>>
>>>>>
>>>>> Yes, this is the right way to do this. I'm on it.
>>>>
>>>> Please test this (heavily) with cobalt. This survived a 2-hours long kvm
>>>> test here, running loops of the evl test suite + hackbench over 64 vcpus,
>>>> so far so good. No reason for any difference in behavior between cobalt
>>>> and evl in this area, this is merely a dovetail business.
>>>>
>>>> commit 30808d958277a525baad33f3c68ccb7910dc3176 (HEAD -> 
>>>> rebase/v5.10.y-evl)
>>>> Author: Philippe Gerum <[email protected]>
>>>>
>>>>     genirq: irq_pipeline: synchronize log on irq exit to kernel
>>>>     
>>>>     We must make sure to play any IRQ which might be pending in the
>>>>     in-band log before leaving an interrupt frame for a preempted kernel
>>>>     context.
>>>>     
>>>>     This completes "irq_pipeline: Account for stage migration across
>>>>     faults", so that we synchronize the log once the in-band stage is
>>>>     unstalled. In addition, we also care to do this before
>>>>     preempt_schedule_irq() runs, so that we won't miss any rescheduling
>>>>     request which might have been triggered by some IRQ we just played.
>>>>     
>>>>     Signed-off-by: Philippe Gerum <[email protected]>
>>>>     Suggested-by: Jan Kiszka <[email protected]>
>>>>
>>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>>>> index 4e81c0c03e5726a..99512307537561b 100644
>>>> --- a/kernel/entry/common.c
>>>> +++ b/kernel/entry/common.c
>>>> @@ -477,8 +477,57 @@ bool irqexit_may_preempt_schedule(irqentry_state_t 
>>>> state,
>>>>  
>>>>  #endif
>>>>  
>>>> +#ifdef CONFIG_IRQ_PIPELINE
>>>> +
>>>> +static bool irqentry_syncstage(irqentry_state_t state) /* hard irqs off */
>>>> +{
>>>> +  /*
>>>> +   * If pipelining interrupts, enable in-band IRQs then
>>>> +   * synchronize the interrupt log on exit if:
>>>> +   *
>>>> +   * - irqentry_enter() stalled the stage in order to mirror the
>>>> +   * hardware state.
>>>> +   *
>>>> +   * - we where coming from oob, thus went through a stage migration
>>>> +   * that was caused by taking a CPU exception, e.g., a fault.
>>>> +   *
>>>> +   * We run before preempt_schedule_irq() may be called later on
>>>> +   * by preemptible kernels, so that any rescheduling request
>>>> +   * triggered by in-band IRQ handlers is considered.
>>>> +   */
>>>> +  if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>>>> +          state.stage_info == IRQENTRY_OOB) {
>>>> +          unstall_inband_nocheck();
>>>> +          synchronize_pipeline_on_irq();
>>>> +          stall_inband_nocheck();
>>>> +          return true;
>>>> +  }
>>>> +
>>>> +  return false;
>>>> +}
>>>> +
>>>> +static void irqentry_unstall(void)
>>>> +{
>>>> +  unstall_inband_nocheck();
>>>> +}
>>>> +
>>>> +#else
>>>> +
>>>> +static bool irqentry_syncstage(irqentry_state_t state)
>>>> +{
>>>> +  return false;
>>>> +}
>>>> +
>>>> +static void irqentry_unstall(void)
>>>> +{
>>>> +}
>>>> +
>>>> +#endif
>>>> +
>>>>  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>>>>  {
>>>> +  bool synchronized = false;
>>>> +
>>>>    if (running_oob())
>>>>            return;
>>>>  
>>>> @@ -488,7 +537,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
>>>> irqentry_state_t state)
>>>>    if (user_mode(regs)) {
>>>>            irqentry_exit_to_user_mode(regs);
>>>>            return;
>>>> -  } else if (irqexit_may_preempt_schedule(state, regs)) {
>>>> +  }
>>>> +
>>>> +  synchronized = irqentry_syncstage(state);
>>>> +
>>>> +  if (irqexit_may_preempt_schedule(state, regs)) {
>>>>            /*
>>>>             * If RCU was not watching on entry this needs to be done
>>>>             * carefully and needs the same ordering of lockdep/tracing
>>>> @@ -521,17 +574,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
>>>> irqentry_state_t state)
>>>>    }
>>>>  
>>>>  out:
>>>> -#ifdef CONFIG_IRQ_PIPELINE
>>>> -  /*
>>>> -   * If pipelining interrupts, clear the in-band stall bit if
>>>> -   * irqentry_enter() raised it in order to mirror the hardware
>>>> -   * state. Also clear it when we where coming from oob, thus went
>>>> -   * through a migration that was caused by taking, e.g., a fault.
>>>> -   */
>>>> -  if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>>>> -      state.stage_info == IRQENTRY_OOB)
>>>> -          unstall_inband();
>>>> -#endif
>>>> +  if (synchronized)
>>>> +          irqentry_unstall();
>>>> +
>>>>    return;
>>>>  }
>>>>  
>>>>
>>>
>>> I've put that into a staging/v5.10.y-dovetail branch, pointed
>>> xenomai-images to that and started a run in our lab. That should give us
>>> some confidence that things work under cobalt across all archs, and we
>>> can make that commit official.
>>>
>>> However, heavy stressing would require triggering migration-on-fault in
>>> tighter loops while running something like stress-ng in the background.
>>> Maybe it would make sense to add such a pattern to the switchtest (would
>>> also require a way to silence the related kernel warnings during
>>> runtime, not only build-time)?
>>>
>> 
>> switchtest aims at exercising the core scheduling logic including the
>> transition between execution stages, so migration-on-fault would fit
>> there indeed.
>> 
>
> Do we need this testcase first? My feeling from review and testing is
> that is safe to move forward with the patch already.
>

Looks ok here too. The patch is fairly straightforward, and this has
been tested on the EVL side of the universe on a handful of armv7, armv8
machines without any glitch. I'll merge this patch shortly.

-- 
Philippe.

Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs

Reply via email to