On 09.06.21 08:12, Philippe Gerum wrote: > > Jan Kiszka <[email protected]> writes: > >> On 08.06.21 19:39, Philippe Gerum wrote: >>> >>> Philippe Gerum <[email protected]> writes: >>> >>>> Jan Kiszka <[email protected]> writes: >>>> >>>>> On 07.06.21 18:19, Philippe Gerum wrote: >>>>>> >>>>>> Jan Kiszka <[email protected]> writes: >>>>>> >>>>>>> On 07.06.21 17:04, Philippe Gerum wrote: >>>>>>>> >>>>>>>> Jan Kiszka <[email protected]> writes: >>>>>>>> >>>>>>>>> On 07.06.21 15:48, Jan Kiszka wrote: >>>>>>>>>> On 07.06.21 15:44, Jan Kiszka wrote: >>>>>>>>>>> On 07.06.21 15:03, Philippe Gerum wrote: >>>>>>>>>>>> >>>>>>>>>>>> Jan Kiszka <[email protected]> writes: >>>>>>>>>>>> >>>>>>>>>>>>> On 07.06.21 09:17, Philippe Gerum wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 06.06.21 16:43, Philippe Gerum via Xenomai wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Jan Kiszka <[email protected]> writes: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> From: Jan Kiszka <[email protected]> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The correct order is hard IRQs off first, then stalling >>>>>>>>>>>>>>>>> inband, see also >>>>>>>>>>>>>>>>> I-pipe. Otherwise, we risk to take an interrupt for the >>>>>>>>>>>>>>>>> inband stage but >>>>>>>>>>>>>>>>> do not inject it before returning to user space. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Signed-off-by: Jan Kiszka <[email protected]> >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Fixes the RCU stalls Florian reported, at least for me. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm afraid this wasn't the last regression of translating >>>>>>>>>>>>>>>>> I-pipe into >>>>>>>>>>>>>>>>> dovetail.. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> include/linux/entry-common.h | 3 ++- >>>>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> diff --git a/include/linux/entry-common.h >>>>>>>>>>>>>>>>> b/include/linux/entry-common.h >>>>>>>>>>>>>>>>> index 0fb45b2d6094..00540110985e 100644 >>>>>>>>>>>>>>>>> --- a/include/linux/entry-common.h >>>>>>>>>>>>>>>>> +++ b/include/linux/entry-common.h >>>>>>>>>>>>>>>>> @@ -203,7 +203,8 @@ static inline void >>>>>>>>>>>>>>>>> local_irq_disable_exit_to_user(void); >>>>>>>>>>>>>>>>> #ifndef local_irq_disable_exit_to_user >>>>>>>>>>>>>>>>> static inline void local_irq_disable_exit_to_user(void) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> - local_irq_disable_full(); >>>>>>>>>>>>>>>>> + hard_cond_local_irq_disable(); >>>>>>>>>>>>>>>>> + local_irq_disable(); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> #endif >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Good spot, real issue here, just like the CPU_IDLE code deals >>>>>>>>>>>>>>>> with. The >>>>>>>>>>>>>>>> fix looks good, but maybe we could do even better. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The way local_irq_disable_full() works tends to introduce this >>>>>>>>>>>>>>>> issue in >>>>>>>>>>>>>>>> a sneaky way, when the code path does not synchronize the >>>>>>>>>>>>>>>> interrupt log >>>>>>>>>>>>>>>> (exit_to_user_mode_loop may be at fault in this case, which >>>>>>>>>>>>>>>> does not >>>>>>>>>>>>>>>> traverses synchronize_pipeline()). Lack of synchronization >>>>>>>>>>>>>>>> would still >>>>>>>>>>>>>>>> hit us if some trap handler turns hard irqs off -> on -> off >>>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>>> way, and we don't leave through the user exit path. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Inverting the order of the actions in local_irq_disable_full() >>>>>>>>>>>>>>>> may be an >>>>>>>>>>>>>>>> option (inband_irq_disable would allow that), making sure we >>>>>>>>>>>>>>>> cannot be >>>>>>>>>>>>>>>> caught in the same issue when returning to kernel mode is >>>>>>>>>>>>>>>> another >>>>>>>>>>>>>>>> way. This needs more thought I think. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So, always this way? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> local_irq_disable_full: >>>>>>>>>>>>>>> hard_local_irq_disable >>>>>>>>>>>>>>> local_irq_disable >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> local_irq_enable_full: >>>>>>>>>>>>>>> hard_local_irq_enable >>>>>>>>>>>>>>> local_irq_enable >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Or even flip local_irq_enable_full as well? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Not this one, hard irqs must be on before local_irq_enable() is >>>>>>>>>>>>>> issued >>>>>>>>>>>>>> otherwise we would trigger a debug assertion. The reason for such >>>>>>>>>>>>>> assertion is that the I-pipe version of local_irq_enable() force >>>>>>>>>>>>>> enables >>>>>>>>>>>>>> hard irqs on, and this is something we want to depart from >>>>>>>>>>>>>> because >>>>>>>>>>>>>> whether the caller expects or not such side-effect is >>>>>>>>>>>>>> error-prone. For >>>>>>>>>>>>>> this reason, inband_irq_enable() expects hard irqs on, which >>>>>>>>>>>>>> should be >>>>>>>>>>>>>> the current hard irq state for the caller under normal >>>>>>>>>>>>>> circumstances. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Was there ever code that required ordering >>>>>>>>>>>>>>> local_irq_disable_full the >>>>>>>>>>>>>>> other way around? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> After review, I don't think so. The current order for >>>>>>>>>>>>>> local_irq_disable_full() was rather determined by applying a >>>>>>>>>>>>>> reverse >>>>>>>>>>>>>> sequence compared to local_irq_enable_full(). So this looks >>>>>>>>>>>>>> good. I need >>>>>>>>>>>>>> to test this on the armv[7/8] side here first before merging. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Is there a commit somewhere that I can drop already? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 1. These are fine with me in their latest iteration: >>>>>>>>>>>> >>>>>>>>>>>> irq_pipeline: Clean up stage_info field and users >>>>>>>>>>>> irq_pipeline: Account for stage migration across faults >>>>>>>>>>>> irq_pipeline: Warn when calling irqentry_enter with oob stalled >>>>>>>>>>>> x86: dovetail: Fix TS flag reservation >>>>>>>>>>>> >>>>>>>>>>>> 2. This one should be replaced by a fix in >>>>>>>>>>>> local_irq_disable_full(), >>>>>>>>>>>> pending some tests ongoing here: >>>>>>>>>>>> >>>>>>>>>>>> irq_pipeline: Prevent returning to user space with pending inband >>>>>>>>>>>> IRQs >>>>>>>>>>>> >>>>>>>>>>>> However, as I mentioned earlier, this may not be sufficient per >>>>>>>>>>>> se, we >>>>>>>>>>>> may have to synchronize the in-band log when unstalling the stage >>>>>>>>>>>> on the >>>>>>>>>>>> kernel exit path, i.e. the exit point fixed up by the "Account for >>>>>>>>>>>> stage >>>>>>>>>>>> migration across faults" patch. This would address the following >>>>>>>>>>>> scenario: >>>>>>>>>>>> >>>>>>>>>>>> irqentry_enter on kernel context: >>>>>>>>>>>> stall_inband(); >>>>>>>>>>>> trap_handler(); >>>>>>>>>>>> hard_local_irq_enable() >>>>>>>>>>>> hard_local_irq_disable() >>>>>>>>>>>> if (state.stage_info == >>>>>>>>>>>> IRQENTRY_INBAND_UNSTALLED || >>>>>>>>>>>> state.stage_info == IRQENTRY_OOB) { >>>>>>>>>>>> unstall_inband(); >>>>>>>>>>>> synchronize_pipeline(); >>>>>>>>>>> >>>>>>>>>>> Oh, unstall_inband does not synchronize? Because of >>>>>>>>>>> hard_local_irq_disable or in general? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Ah, found it - unstall != inband_irq_enable. Why not using the latter >>>>>>>>>> directly? >>>>>>>>>> >>>>>>>>> >>>>>>>>> ...because it's instrumented to detect hard_irqs_disabled. >>>>>>>>> >>>>>>>>> Can we safely sync at that point? >>>>>>>> >>>>>>>> Yes we may do this, because we restore the original -unstalled- state >>>>>>>> of >>>>>>>> the in-band stage. So if we have pending IRQs, we may play them at this >>>>>>>> point. What we may not do obviously, is creating synchronization points >>>>>>>> where the kernel does not expect irqs. >>>>>>>> >>>>>>> >>>>>>> Again: When will any reschedules that might be requested by those >>>>>>> replayed IRQs be handled then? >>>>>>> >>>>>> >>>>>> Nowhere unless we provide a log synchronization routine dedicated to the >>>>>> in-band kernel exit path, which would fold in the rescheduling call via >>>>>> irqentry_exit_cond_resched(). I'll write that one. >>>>>> >>>>> >>>>> How about simply moving the replay up, before the kernel checks for >>>>> pending reschedules anyway? We can stall inband then again, to please >>>>> lockdep & Co., as long as there is no hard-irq-on section before the >>>>> intret. >>>>> >>>> >>>> Yes, this is the right way to do this. I'm on it. >>> >>> Please test this (heavily) with cobalt. This survived a 2-hours long kvm >>> test here, running loops of the evl test suite + hackbench over 64 vcpus, >>> so far so good. No reason for any difference in behavior between cobalt >>> and evl in this area, this is merely a dovetail business. >>> >>> commit 30808d958277a525baad33f3c68ccb7910dc3176 (HEAD -> rebase/v5.10.y-evl) >>> Author: Philippe Gerum <[email protected]> >>> >>> genirq: irq_pipeline: synchronize log on irq exit to kernel >>> >>> We must make sure to play any IRQ which might be pending in the >>> in-band log before leaving an interrupt frame for a preempted kernel >>> context. >>> >>> This completes "irq_pipeline: Account for stage migration across >>> faults", so that we synchronize the log once the in-band stage is >>> unstalled. In addition, we also care to do this before >>> preempt_schedule_irq() runs, so that we won't miss any rescheduling >>> request which might have been triggered by some IRQ we just played. >>> >>> Signed-off-by: Philippe Gerum <[email protected]> >>> Suggested-by: Jan Kiszka <[email protected]> >>> >>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c >>> index 4e81c0c03e5726a..99512307537561b 100644 >>> --- a/kernel/entry/common.c >>> +++ b/kernel/entry/common.c >>> @@ -477,8 +477,57 @@ bool irqexit_may_preempt_schedule(irqentry_state_t >>> state, >>> >>> #endif >>> >>> +#ifdef CONFIG_IRQ_PIPELINE >>> + >>> +static bool irqentry_syncstage(irqentry_state_t state) /* hard irqs off */ >>> +{ >>> + /* >>> + * If pipelining interrupts, enable in-band IRQs then >>> + * synchronize the interrupt log on exit if: >>> + * >>> + * - irqentry_enter() stalled the stage in order to mirror the >>> + * hardware state. >>> + * >>> + * - we where coming from oob, thus went through a stage migration >>> + * that was caused by taking a CPU exception, e.g., a fault. >>> + * >>> + * We run before preempt_schedule_irq() may be called later on >>> + * by preemptible kernels, so that any rescheduling request >>> + * triggered by in-band IRQ handlers is considered. >>> + */ >>> + if (state.stage_info == IRQENTRY_INBAND_UNSTALLED || >>> + state.stage_info == IRQENTRY_OOB) { >>> + unstall_inband_nocheck(); >>> + synchronize_pipeline_on_irq(); >>> + stall_inband_nocheck(); >>> + return true; >>> + } >>> + >>> + return false; >>> +} >>> + >>> +static void irqentry_unstall(void) >>> +{ >>> + unstall_inband_nocheck(); >>> +} >>> + >>> +#else >>> + >>> +static bool irqentry_syncstage(irqentry_state_t state) >>> +{ >>> + return false; >>> +} >>> + >>> +static void irqentry_unstall(void) >>> +{ >>> +} >>> + >>> +#endif >>> + >>> noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) >>> { >>> + bool synchronized = false; >>> + >>> if (running_oob()) >>> return; >>> >>> @@ -488,7 +537,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, >>> irqentry_state_t state) >>> if (user_mode(regs)) { >>> irqentry_exit_to_user_mode(regs); >>> return; >>> - } else if (irqexit_may_preempt_schedule(state, regs)) { >>> + } >>> + >>> + synchronized = irqentry_syncstage(state); >>> + >>> + if (irqexit_may_preempt_schedule(state, regs)) { >>> /* >>> * If RCU was not watching on entry this needs to be done >>> * carefully and needs the same ordering of lockdep/tracing >>> @@ -521,17 +574,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, >>> irqentry_state_t state) >>> } >>> >>> out: >>> -#ifdef CONFIG_IRQ_PIPELINE >>> - /* >>> - * If pipelining interrupts, clear the in-band stall bit if >>> - * irqentry_enter() raised it in order to mirror the hardware >>> - * state. Also clear it when we where coming from oob, thus went >>> - * through a migration that was caused by taking, e.g., a fault. >>> - */ >>> - if (state.stage_info == IRQENTRY_INBAND_UNSTALLED || >>> - state.stage_info == IRQENTRY_OOB) >>> - unstall_inband(); >>> -#endif >>> + if (synchronized) >>> + irqentry_unstall(); >>> + >>> return; >>> } >>> >>> >> >> I've put that into a staging/v5.10.y-dovetail branch, pointed >> xenomai-images to that and started a run in our lab. That should give us >> some confidence that things work under cobalt across all archs, and we >> can make that commit official. >> >> However, heavy stressing would require triggering migration-on-fault in >> tighter loops while running something like stress-ng in the background. >> Maybe it would make sense to add such a pattern to the switchtest (would >> also require a way to silence the related kernel warnings during >> runtime, not only build-time)? >> > > switchtest aims at exercising the core scheduling logic including the > transition between execution stages, so migration-on-fault would fit > there indeed. >
Do we need this testcase first? My feeling from review and testing is that is safe to move forward with the patch already. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux
