Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-02 Thread Sebastian Andrzej Siewior
On 2025-12-02 13:39:56 [+0100], Jan Kiszka wrote:
> On 02.12.25 09:24, Sebastian Andrzej Siewior wrote:
> > On 2025-12-01 22:51:50 [+0100], Jan Kiszka wrote:
> >> How should we handle this? Consider 6.12 mainline with -rt and cgroups
> >> as potentially broken, asking people to user 6.12-rt? Or port this back?
> > 
> > If you have everything in v6.12 for an useable RT system and this is the
> > only missing piece I could ask the stable nicely to backport this.
> > 
> 
> Given that this is a fix for potential lock-up... Does it have
> dependencies? The other two patches in this series are optimizations
> only, or should they better join the backport?

Just an optimisation if I am not mistaken but it will conflict if not
backported.

> Jan

Sebastian



Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-02 Thread Jan Kiszka
On 02.12.25 09:24, Sebastian Andrzej Siewior wrote:
> On 2025-12-01 22:51:50 [+0100], Jan Kiszka wrote:
>> How should we handle this? Consider 6.12 mainline with -rt and cgroups
>> as potentially broken, asking people to user 6.12-rt? Or port this back?
> 
> If you have everything in v6.12 for an useable RT system and this is the
> only missing piece I could ask the stable nicely to backport this.
> 

Given that this is a fix for potential lock-up... Does it have
dependencies? The other two patches in this series are optimizations
only, or should they better join the backport?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center



Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-02 Thread Sebastian Andrzej Siewior
On 2025-12-01 22:51:50 [+0100], Jan Kiszka wrote:
> How should we handle this? Consider 6.12 mainline with -rt and cgroups
> as potentially broken, asking people to user 6.12-rt? Or port this back?

If you have everything in v6.12 for an useable RT system and this is the
only missing piece I could ask the stable nicely to backport this.

> Jan
> 
Sebastian



Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-02 Thread Sebastian Andrzej Siewior
On 2025-12-02 07:57:04 [+0100], Jan Kiszka wrote:
> So, we have this right now in the 6.12-rt and 6.1-rt trees. The patch
> (b12e35832805) that enabled the lockup above was added to 4.18. In
> 4.19-rt (still under maintenance via 4.19-cip-rt), we had ktimersoftd -
> was that addressing the issue as well? Or could timers have expired back
> then also outside of that thread, thus potentially without SCHED_FIFO prio?

Looking at v4.19.25-rt16, there is ktimersoftd/ via "softirq: split
timer softirqs out of ksoftirqd". This became later ktimers/.
So yes, the timer should be expired with a SCHED_FIFO priority.

You are most likely aware of "Defer throttle when task exits to user"
https://lore.kernel.org/all/[email protected]/

> Thanks,
> Jan

Sebastian



Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-01 Thread Jan Kiszka
On 02.12.25 00:49, Luis Claudio R. Goncalves wrote:
> On Mon, Dec 01, 2025 at 10:51:50PM +0100, Jan Kiszka wrote:
>> On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
>>> A timer/ hrtimer softirq is raised in-IRQ context. With threaded
>>> interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
>>> for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
>>> means it will compete with other tasks for CPU ressources.
>>> This can introduce long delays for timer processing on heavy loaded
>>> systems and is not desired.
>>>
>>> Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
>>> timers thread and let it run at the lowest SCHED_FIFO priority.
>>> Wake-ups for RT tasks happen from hardirq context so only timer_list timers
>>> and hrtimers for "regular" tasks are processed here. The higher priority
>>> ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
>>>
>>> Using a dedicated variable to store the pending softirq bits values
>>> ensure that the timer are not accidentally picked up by ksoftirqd and
>>> other threaded interrupts.
>>> It shouldn't be picked up by ksoftirqd since it runs at lower priority.
>>> However if ksoftirqd is already running while a timer fires, then
>>> ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
>>> Ideally we try to avoid having ksoftirqd running.
>>>
>>> The timer thread can pick up pending softirqs from ksoftirqd but only
>>> if the softirq load is high. It is not be desired that the picked up
>>> softirqs are processed at SCHED_FIFO priority under high softirq load
>>> but this can already happen by a PI-boost by a force-threaded interrupt.
>>>
>>> [ [email protected]: rcutorture.c fixes, storm fix by introduction of
>>>   local_timers_pending() for tick_nohz_next_event() ]
>>>
>>> [ [email protected]: Ensure ktimersd gets woken up even if a
>>>   softirq is currently served. ]
>>>
>>> Reviewed-by: Paul E. McKenney  [rcutorture]
>>> Reviewed-by: Frederic Weisbecker 
>>> Signed-off-by: Sebastian Andrzej Siewior 
>>
>> This went into 6.13 and was never backported to 6.12-lts. And that is
>> why you can easily stall the latter with a workload like this and
>> CONFIG_PREEMPT_RT enabled:
>>
>> echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
>> echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
>>
>> mkdir /sys/fs/cgroup/stalltest.sub1
>> mkdir /sys/fs/cgroup/stalltest.sub2
>> sleep 1000 &
>> pid=$!
>>
>> systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
>> while true; do
>> echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
>> echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
>> done"
>>
>> echo "1000 2" > /sys/fs/cgroup/stalltest.slice/cpu.max
>>
>> This triggers a lock-up if a holder of cgroup_file_kn_lock with
>> SCHED_OTHER is scheduled out after using up its timeslice and then
>> cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
>> trying to get this lock, failing and then never being able to reactivate
>> the lock holder again as well.
>>
>> I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
>> 6.12-rt does not trigger because it additionally has the downstream -rt
>> patches on board.
>>
>> How should we handle this? Consider 6.12 mainline with -rt and cgroups
>> as potentially broken, asking people to user 6.12-rt? Or port this back?
>>
>> BTW, the original report of this issue came from an older
>> 5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
>> feature introduced to the -rt patches? Was it ever backported to 5.10-rt
>> or other -rt versions?
> 
> Hi Jan!
> 
> I failed to locate the original discussion (from v5.10-rt) as the V1 of this
> patchset is a new thread. Anyway, you are correct, the commit below (and the
> other two changes from the series) are not present in v5.10-rt.
> 
> AFAICT commit 49a17639508c ("softirq: Use a dedicated thread for timer wakeups
> on PREEMPT_RT.") was merged initially to v6.13-rc1, it was never exclusive
> to the RT tree.

So, we have this right now in the 6.12-rt and 6.1-rt trees. The patch
(b12e35832805) that enabled the lockup above was added to 4.18. In
4.19-rt (still under maintenance via 4.19-cip-rt), we had ktimersoftd -
was that addressing the issue as well? Or could timers have expired back
then also outside of that thread, thus potentially without SCHED_FIFO prio?

Thanks,
Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center



Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-01 Thread Luis Claudio R. Goncalves
On Mon, Dec 01, 2025 at 10:51:50PM +0100, Jan Kiszka wrote:
> On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
> > A timer/ hrtimer softirq is raised in-IRQ context. With threaded
> > interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
> > for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
> > means it will compete with other tasks for CPU ressources.
> > This can introduce long delays for timer processing on heavy loaded
> > systems and is not desired.
> > 
> > Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
> > timers thread and let it run at the lowest SCHED_FIFO priority.
> > Wake-ups for RT tasks happen from hardirq context so only timer_list timers
> > and hrtimers for "regular" tasks are processed here. The higher priority
> > ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
> > 
> > Using a dedicated variable to store the pending softirq bits values
> > ensure that the timer are not accidentally picked up by ksoftirqd and
> > other threaded interrupts.
> > It shouldn't be picked up by ksoftirqd since it runs at lower priority.
> > However if ksoftirqd is already running while a timer fires, then
> > ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
> > Ideally we try to avoid having ksoftirqd running.
> > 
> > The timer thread can pick up pending softirqs from ksoftirqd but only
> > if the softirq load is high. It is not be desired that the picked up
> > softirqs are processed at SCHED_FIFO priority under high softirq load
> > but this can already happen by a PI-boost by a force-threaded interrupt.
> > 
> > [ [email protected]: rcutorture.c fixes, storm fix by introduction of
> >   local_timers_pending() for tick_nohz_next_event() ]
> > 
> > [ [email protected]: Ensure ktimersd gets woken up even if a
> >   softirq is currently served. ]
> > 
> > Reviewed-by: Paul E. McKenney  [rcutorture]
> > Reviewed-by: Frederic Weisbecker 
> > Signed-off-by: Sebastian Andrzej Siewior 
> 
> This went into 6.13 and was never backported to 6.12-lts. And that is
> why you can easily stall the latter with a workload like this and
> CONFIG_PREEMPT_RT enabled:
> 
> echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
> echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
> 
> mkdir /sys/fs/cgroup/stalltest.sub1
> mkdir /sys/fs/cgroup/stalltest.sub2
> sleep 1000 &
> pid=$!
> 
> systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
> while true; do
> echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
> echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
> done"
> 
> echo "1000 2" > /sys/fs/cgroup/stalltest.slice/cpu.max
> 
> This triggers a lock-up if a holder of cgroup_file_kn_lock with
> SCHED_OTHER is scheduled out after using up its timeslice and then
> cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
> trying to get this lock, failing and then never being able to reactivate
> the lock holder again as well.
> 
> I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
> 6.12-rt does not trigger because it additionally has the downstream -rt
> patches on board.
> 
> How should we handle this? Consider 6.12 mainline with -rt and cgroups
> as potentially broken, asking people to user 6.12-rt? Or port this back?
> 
> BTW, the original report of this issue came from an older
> 5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
> feature introduced to the -rt patches? Was it ever backported to 5.10-rt
> or other -rt versions?

Hi Jan!

I failed to locate the original discussion (from v5.10-rt) as the V1 of this
patchset is a new thread. Anyway, you are correct, the commit below (and the
other two changes from the series) are not present in v5.10-rt.

AFAICT commit 49a17639508c ("softirq: Use a dedicated thread for timer wakeups
on PREEMPT_RT.") was merged initially to v6.13-rc1, it was never exclusive
to the RT tree.

Luis

> Jan
> 
> -- 
> Siemens AG, Foundational Technologies
> Linux Expert Center
> 
---end quoted text---




Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.

2025-12-01 Thread Jan Kiszka
On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
> A timer/ hrtimer softirq is raised in-IRQ context. With threaded
> interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
> for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
> means it will compete with other tasks for CPU ressources.
> This can introduce long delays for timer processing on heavy loaded
> systems and is not desired.
> 
> Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
> timers thread and let it run at the lowest SCHED_FIFO priority.
> Wake-ups for RT tasks happen from hardirq context so only timer_list timers
> and hrtimers for "regular" tasks are processed here. The higher priority
> ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
> 
> Using a dedicated variable to store the pending softirq bits values
> ensure that the timer are not accidentally picked up by ksoftirqd and
> other threaded interrupts.
> It shouldn't be picked up by ksoftirqd since it runs at lower priority.
> However if ksoftirqd is already running while a timer fires, then
> ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
> Ideally we try to avoid having ksoftirqd running.
> 
> The timer thread can pick up pending softirqs from ksoftirqd but only
> if the softirq load is high. It is not be desired that the picked up
> softirqs are processed at SCHED_FIFO priority under high softirq load
> but this can already happen by a PI-boost by a force-threaded interrupt.
> 
> [ [email protected]: rcutorture.c fixes, storm fix by introduction of
>   local_timers_pending() for tick_nohz_next_event() ]
> 
> [ [email protected]: Ensure ktimersd gets woken up even if a
>   softirq is currently served. ]
> 
> Reviewed-by: Paul E. McKenney  [rcutorture]
> Reviewed-by: Frederic Weisbecker 
> Signed-off-by: Sebastian Andrzej Siewior 

This went into 6.13 and was never backported to 6.12-lts. And that is
why you can easily stall the latter with a workload like this and
CONFIG_PREEMPT_RT enabled:

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control

mkdir /sys/fs/cgroup/stalltest.sub1
mkdir /sys/fs/cgroup/stalltest.sub2
sleep 1000 &
pid=$!

systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
while true; do
echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
done"

echo "1000 2" > /sys/fs/cgroup/stalltest.slice/cpu.max

This triggers a lock-up if a holder of cgroup_file_kn_lock with
SCHED_OTHER is scheduled out after using up its timeslice and then
cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
trying to get this lock, failing and then never being able to reactivate
the lock holder again as well.

I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
6.12-rt does not trigger because it additionally has the downstream -rt
patches on board.

How should we handle this? Consider 6.12 mainline with -rt and cgroups
as potentially broken, asking people to user 6.12-rt? Or port this back?

BTW, the original report of this issue came from an older
5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
feature introduced to the -rt patches? Was it ever backported to 5.10-rt
or other -rt versions?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center