Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs

Philippe Gerum via Xenomai Tue, 08 Jun 2021 23:12:59 -0700


Jan Kiszka <[email protected]> writes:


> On 08.06.21 19:39, Philippe Gerum wrote:
>> 
>> Philippe Gerum <[email protected]> writes:
>> 
>>> Jan Kiszka <[email protected]> writes:
>>>
>>>> On 07.06.21 18:19, Philippe Gerum wrote:
>>>>>
>>>>> Jan Kiszka <[email protected]> writes:
>>>>>
>>>>>> On 07.06.21 17:04, Philippe Gerum wrote:
>>>>>>>
>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>
>>>>>>>> On 07.06.21 15:48, Jan Kiszka wrote:
>>>>>>>>> On 07.06.21 15:44, Jan Kiszka wrote:
>>>>>>>>>> On 07.06.21 15:03, Philippe Gerum wrote:
>>>>>>>>>>>
>>>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>>>
>>>>>>>>>>>> On 07.06.21 09:17, Philippe Gerum wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 06.06.21 16:43, Philippe Gerum via Xenomai wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Jan Kiszka <[email protected]>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The correct order is hard IRQs off first, then stalling 
>>>>>>>>>>>>>>>> inband, see also
>>>>>>>>>>>>>>>> I-pipe. Otherwise, we risk to take an interrupt for the inband 
>>>>>>>>>>>>>>>> stage but
>>>>>>>>>>>>>>>> do not inject it before returning to user space.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Jan Kiszka <[email protected]>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fixes the RCU stalls Florian reported, at least for me.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm afraid this wasn't the last regression of translating 
>>>>>>>>>>>>>>>> I-pipe into
>>>>>>>>>>>>>>>> dovetail..
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  include/linux/entry-common.h | 3 ++-
>>>>>>>>>>>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/include/linux/entry-common.h 
>>>>>>>>>>>>>>>> b/include/linux/entry-common.h
>>>>>>>>>>>>>>>> index 0fb45b2d6094..00540110985e 100644
>>>>>>>>>>>>>>>> --- a/include/linux/entry-common.h
>>>>>>>>>>>>>>>> +++ b/include/linux/entry-common.h
>>>>>>>>>>>>>>>> @@ -203,7 +203,8 @@ static inline void 
>>>>>>>>>>>>>>>> local_irq_disable_exit_to_user(void);
>>>>>>>>>>>>>>>>  #ifndef local_irq_disable_exit_to_user
>>>>>>>>>>>>>>>>  static inline void local_irq_disable_exit_to_user(void)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>> -      local_irq_disable_full();
>>>>>>>>>>>>>>>> +      hard_cond_local_irq_disable();
>>>>>>>>>>>>>>>> +      local_irq_disable();
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  #endif
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Good spot, real issue here, just like the CPU_IDLE code deals 
>>>>>>>>>>>>>>> with. The
>>>>>>>>>>>>>>> fix looks good, but maybe we could do even better.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The way local_irq_disable_full() works tends to introduce this 
>>>>>>>>>>>>>>> issue in
>>>>>>>>>>>>>>> a sneaky way, when the code path does not synchronize the 
>>>>>>>>>>>>>>> interrupt log
>>>>>>>>>>>>>>> (exit_to_user_mode_loop may be at fault in this case, which 
>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>> traverses synchronize_pipeline()). Lack of synchronization 
>>>>>>>>>>>>>>> would still
>>>>>>>>>>>>>>> hit us if some trap handler turns hard irqs off -> on -> off 
>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>> way, and we don't leave through the user exit path.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Inverting the order of the actions in local_irq_disable_full() 
>>>>>>>>>>>>>>> may be an
>>>>>>>>>>>>>>> option (inband_irq_disable would allow that), making sure we 
>>>>>>>>>>>>>>> cannot be
>>>>>>>>>>>>>>> caught in the same issue when returning to kernel mode is 
>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>> way. This needs more thought I think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, always this way?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> local_irq_disable_full:
>>>>>>>>>>>>>>  hard_local_irq_disable  
>>>>>>>>>>>>>>  local_irq_disable
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> local_irq_enable_full:
>>>>>>>>>>>>>>  hard_local_irq_enable
>>>>>>>>>>>>>>  local_irq_enable
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or even flip local_irq_enable_full as well?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not this one, hard irqs must be on before local_irq_enable() is 
>>>>>>>>>>>>> issued
>>>>>>>>>>>>> otherwise we would trigger a debug assertion. The reason for such
>>>>>>>>>>>>> assertion is that the I-pipe version of local_irq_enable() force 
>>>>>>>>>>>>> enables
>>>>>>>>>>>>> hard irqs on, and this is something we want to depart from because
>>>>>>>>>>>>> whether the caller expects or not such side-effect is 
>>>>>>>>>>>>> error-prone. For
>>>>>>>>>>>>> this reason, inband_irq_enable() expects hard irqs on, which 
>>>>>>>>>>>>> should be
>>>>>>>>>>>>> the current hard irq state for the caller under normal 
>>>>>>>>>>>>> circumstances.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Was there ever code that required ordering 
>>>>>>>>>>>>>> local_irq_disable_full the
>>>>>>>>>>>>>> other way around?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> After review, I don't think so. The current order for
>>>>>>>>>>>>> local_irq_disable_full() was rather determined by applying a 
>>>>>>>>>>>>> reverse
>>>>>>>>>>>>> sequence compared to local_irq_enable_full(). So this looks good. 
>>>>>>>>>>>>> I need
>>>>>>>>>>>>> to test this on the armv[7/8] side here first before merging.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a commit somewhere that I can drop already?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1. These are fine with me in their latest iteration:
>>>>>>>>>>>
>>>>>>>>>>>   irq_pipeline: Clean up stage_info field and users
>>>>>>>>>>>   irq_pipeline: Account for stage migration across faults
>>>>>>>>>>>   irq_pipeline: Warn when calling irqentry_enter with oob stalled
>>>>>>>>>>>   x86: dovetail: Fix TS flag reservation
>>>>>>>>>>>
>>>>>>>>>>> 2. This one should be replaced by a fix in local_irq_disable_full(),
>>>>>>>>>>> pending some tests ongoing here:
>>>>>>>>>>>
>>>>>>>>>>> irq_pipeline: Prevent returning to user space with pending inband 
>>>>>>>>>>> IRQs
>>>>>>>>>>>
>>>>>>>>>>> However, as I mentioned earlier, this may not be sufficient per se, 
>>>>>>>>>>> we
>>>>>>>>>>> may have to synchronize the in-band log when unstalling the stage 
>>>>>>>>>>> on the
>>>>>>>>>>> kernel exit path, i.e. the exit point fixed up by the "Account for 
>>>>>>>>>>> stage
>>>>>>>>>>> migration across faults" patch. This would address the following
>>>>>>>>>>> scenario:
>>>>>>>>>>>
>>>>>>>>>>>         irqentry_enter on kernel context:
>>>>>>>>>>>                        stall_inband();
>>>>>>>>>>>                                 trap_handler();
>>>>>>>>>>>                                         hard_local_irq_enable()
>>>>>>>>>>>                                         hard_local_irq_disable()
>>>>>>>>>>>                        if (state.stage_info == 
>>>>>>>>>>> IRQENTRY_INBAND_UNSTALLED ||
>>>>>>>>>>>                            state.stage_info == IRQENTRY_OOB) {
>>>>>>>>>>>                              unstall_inband();
>>>>>>>>>>>                                  synchronize_pipeline();
>>>>>>>>>>
>>>>>>>>>> Oh, unstall_inband does not synchronize? Because of
>>>>>>>>>> hard_local_irq_disable or in general?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ah, found it - unstall != inband_irq_enable. Why not using the latter
>>>>>>>>> directly?
>>>>>>>>>
>>>>>>>>
>>>>>>>> ...because it's instrumented to detect hard_irqs_disabled.
>>>>>>>>
>>>>>>>> Can we safely sync at that point?
>>>>>>>
>>>>>>> Yes we may do this, because we restore the original -unstalled- state of
>>>>>>> the in-band stage. So if we have pending IRQs, we may play them at this
>>>>>>> point. What we may not do obviously, is creating synchronization points
>>>>>>> where the kernel does not expect irqs.
>>>>>>>
>>>>>>
>>>>>> Again: When will any reschedules that might be requested by those
>>>>>> replayed IRQs be handled then?
>>>>>>
>>>>>
>>>>> Nowhere unless we provide a log synchronization routine dedicated to the
>>>>> in-band kernel exit path, which would fold in the rescheduling call via
>>>>> irqentry_exit_cond_resched(). I'll write that one.
>>>>>
>>>>
>>>> How about simply moving the replay up, before the kernel checks for
>>>> pending reschedules anyway? We can stall inband then again, to please
>>>> lockdep & Co., as long as there is no hard-irq-on section before the 
>>>> intret.
>>>>
>>>
>>> Yes, this is the right way to do this. I'm on it.
>> 
>> Please test this (heavily) with cobalt. This survived a 2-hours long kvm
>> test here, running loops of the evl test suite + hackbench over 64 vcpus,
>> so far so good. No reason for any difference in behavior between cobalt
>> and evl in this area, this is merely a dovetail business.
>> 
>> commit 30808d958277a525baad33f3c68ccb7910dc3176 (HEAD -> rebase/v5.10.y-evl)
>> Author: Philippe Gerum <[email protected]>
>> 
>>     genirq: irq_pipeline: synchronize log on irq exit to kernel
>>     
>>     We must make sure to play any IRQ which might be pending in the
>>     in-band log before leaving an interrupt frame for a preempted kernel
>>     context.
>>     
>>     This completes "irq_pipeline: Account for stage migration across
>>     faults", so that we synchronize the log once the in-band stage is
>>     unstalled. In addition, we also care to do this before
>>     preempt_schedule_irq() runs, so that we won't miss any rescheduling
>>     request which might have been triggered by some IRQ we just played.
>>     
>>     Signed-off-by: Philippe Gerum <[email protected]>
>>     Suggested-by: Jan Kiszka <[email protected]>
>> 
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index 4e81c0c03e5726a..99512307537561b 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -477,8 +477,57 @@ bool irqexit_may_preempt_schedule(irqentry_state_t 
>> state,
>>  
>>  #endif
>>  
>> +#ifdef CONFIG_IRQ_PIPELINE
>> +
>> +static bool irqentry_syncstage(irqentry_state_t state) /* hard irqs off */
>> +{
>> +    /*
>> +     * If pipelining interrupts, enable in-band IRQs then
>> +     * synchronize the interrupt log on exit if:
>> +     *
>> +     * - irqentry_enter() stalled the stage in order to mirror the
>> +     * hardware state.
>> +     *
>> +     * - we where coming from oob, thus went through a stage migration
>> +     * that was caused by taking a CPU exception, e.g., a fault.
>> +     *
>> +     * We run before preempt_schedule_irq() may be called later on
>> +     * by preemptible kernels, so that any rescheduling request
>> +     * triggered by in-band IRQ handlers is considered.
>> +     */
>> +    if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>> +            state.stage_info == IRQENTRY_OOB) {
>> +            unstall_inband_nocheck();
>> +            synchronize_pipeline_on_irq();
>> +            stall_inband_nocheck();
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +static void irqentry_unstall(void)
>> +{
>> +    unstall_inband_nocheck();
>> +}
>> +
>> +#else
>> +
>> +static bool irqentry_syncstage(irqentry_state_t state)
>> +{
>> +    return false;
>> +}
>> +
>> +static void irqentry_unstall(void)
>> +{
>> +}
>> +
>> +#endif
>> +
>>  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>>  {
>> +    bool synchronized = false;
>> +
>>      if (running_oob())
>>              return;
>>  
>> @@ -488,7 +537,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
>> irqentry_state_t state)
>>      if (user_mode(regs)) {
>>              irqentry_exit_to_user_mode(regs);
>>              return;
>> -    } else if (irqexit_may_preempt_schedule(state, regs)) {
>> +    }
>> +
>> +    synchronized = irqentry_syncstage(state);
>> +
>> +    if (irqexit_may_preempt_schedule(state, regs)) {
>>              /*
>>               * If RCU was not watching on entry this needs to be done
>>               * carefully and needs the same ordering of lockdep/tracing
>> @@ -521,17 +574,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
>> irqentry_state_t state)
>>      }
>>  
>>  out:
>> -#ifdef CONFIG_IRQ_PIPELINE
>> -    /*
>> -     * If pipelining interrupts, clear the in-band stall bit if
>> -     * irqentry_enter() raised it in order to mirror the hardware
>> -     * state. Also clear it when we where coming from oob, thus went
>> -     * through a migration that was caused by taking, e.g., a fault.
>> -     */
>> -    if (state.stage_info == IRQENTRY_INBAND_UNSTALLED ||
>> -        state.stage_info == IRQENTRY_OOB)
>> -            unstall_inband();
>> -#endif
>> +    if (synchronized)
>> +            irqentry_unstall();
>> +
>>      return;
>>  }
>>  
>> 
>
> I've put that into a staging/v5.10.y-dovetail branch, pointed
> xenomai-images to that and started a run in our lab. That should give us
> some confidence that things work under cobalt across all archs, and we
> can make that commit official.
>
> However, heavy stressing would require triggering migration-on-fault in
> tighter loops while running something like stress-ng in the background.
> Maybe it would make sense to add such a pattern to the switchtest (would
> also require a way to silence the related kernel warnings during
> runtime, not only build-time)?
>

switchtest aims at exercising the core scheduling logic including the
transition between execution stages, so migration-on-fault would fit
there indeed.

-- 
Philippe.

Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs

Reply via email to