[tip: core/rcu] rcu: Enable rcu_normal_after_boot unconditionally for RT
The following commit has been merged into the core/rcu branch of tip: Commit-ID: 36221e109eb20ac111bc3bf3e8d5639aa457c7e0 Gitweb: https://git.kernel.org/tip/36221e109eb20ac111bc3bf3e8d5639aa457c7e0 Author:Julia Cartwright AuthorDate:Tue, 15 Dec 2020 15:16:47 +01:00 Committer: Paul E. McKenney CommitterDate: Mon, 04 Jan 2021 13:43:51 -08:00 rcu: Enable rcu_normal_after_boot unconditionally for RT Expedited RCU grace periods send IPIs to all non-idle CPUs, and thus can disrupt time-critical code in real-time applications. However, there is a portion of boot-time processing (presumably before any real-time applications have started) where expedited RCU grace periods are the only option. And so it is that experience with the -rt patchset indicates that PREEMPT_RT systems should always set the rcupdate.rcu_normal_after_boot kernel boot parameter. This commit therefore makes the post-boot application environment safe for real-time applications by making PREEMPT_RT systems disable the rcupdate.rcu_normal_after_boot kernel boot parameter and acting as if this parameter had been set. This means that post-boot calls to synchronize_rcu_expedited() will be treated as if they were instead calls to synchronize_rcu(), thus preventing the IPIs, and thus avoiding disrupting real-time applications. Suggested-by: Luiz Capitulino Acked-by: Paul E. McKenney Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior [ paulmck: Update kernel-parameters.txt accordingly. ] Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 7 +++ kernel/rcu/update.c | 4 +++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 521255f..e0008d9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4474,6 +4474,13 @@ only normal grace-period primitives. No effect on CONFIG_TINY_RCU kernels. + But note that CONFIG_PREEMPT_RT=y kernels enables + this kernel boot parameter, forcibly setting + it to the value one, that is, converting any + post-boot attempt at an expedited RCU grace + period to instead use normal non-expedited + grace-period processing. + rcupdate.rcu_task_ipi_delay= [KNL] Set time in jiffies during which RCU tasks will avoid sending IPIs, starting with the beginning diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 39334d2..b95ae86 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -56,8 +56,10 @@ #ifndef CONFIG_TINY_RCU module_param(rcu_expedited, int, 0); module_param(rcu_normal, int, 0); -static int rcu_normal_after_boot; +static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT); +#ifndef CONFIG_PREEMPT_RT module_param(rcu_normal_after_boot, int, 0); +#endif #endif /* #ifndef CONFIG_TINY_RCU */ #ifdef CONFIG_DEBUG_LOCK_ALLOC
[tip: locking/core] squashfs: Make use of local lock in multi_cpu decompressor
The following commit has been merged into the locking/core branch of tip: Commit-ID: fd56200a16c72c7c3ec3e54e06160dfaa5b8dee8 Gitweb: https://git.kernel.org/tip/fd56200a16c72c7c3ec3e54e06160dfaa5b8dee8 Author:Julia Cartwright AuthorDate:Wed, 27 May 2020 22:11:16 +02:00 Committer: Ingo Molnar CommitterDate: Thu, 28 May 2020 10:31:10 +02:00 squashfs: Make use of local lock in multi_cpu decompressor The squashfs multi CPU decompressor makes use of get_cpu_ptr() to acquire a pointer to per-CPU data. get_cpu_ptr() implicitly disables preemption which serializes the access to the per-CPU data. But decompression can take quite some time depending on the size. The observed preempt disabled times in real world scenarios went up to 8ms, causing massive wakeup latencies. This happens on all CPUs as the decompression is fully parallelized. Replace the implicit preemption control with an explicit local lock. This allows RT kernels to substitute it with a real per CPU lock, which serializes the access but keeps the code section preemptible. On non RT kernels this maps to preempt_disable() as before, i.e. no functional change. [ bigeasy: Use local_lock(), patch description] Reported-by: Alexander Stein Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Ingo Molnar Tested-by: Alexander Stein Acked-by: Peter Zijlstra Link: https://lore.kernel.org/r/20200527201119.1692513-5-bige...@linutronix.de --- fs/squashfs/decompressor_multi_percpu.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/fs/squashfs/decompressor_multi_percpu.c b/fs/squashfs/decompressor_multi_percpu.c index 2a2a2d1..e206ebf 100644 --- a/fs/squashfs/decompressor_multi_percpu.c +++ b/fs/squashfs/decompressor_multi_percpu.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "squashfs_fs.h" #include "squashfs_fs_sb.h" @@ -20,7 +21,8 @@ */ struct squashfs_stream { - void*stream; + void*stream; + local_lock_tlock; }; void *squashfs_decompressor_create(struct squashfs_sb_info *msblk, @@ -41,6 +43,7 @@ void *squashfs_decompressor_create(struct squashfs_sb_info *msblk, err = PTR_ERR(stream->stream); goto out; } + local_lock_init(&stream->lock); } kfree(comp_opts); @@ -75,12 +78,16 @@ void squashfs_decompressor_destroy(struct squashfs_sb_info *msblk) int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, int b, int offset, int length, struct squashfs_page_actor *output) { - struct squashfs_stream __percpu *percpu = - (struct squashfs_stream __percpu *) msblk->stream; - struct squashfs_stream *stream = get_cpu_ptr(percpu); - int res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, - offset, length, output); - put_cpu_ptr(stream); + struct squashfs_stream *stream; + int res; + + local_lock(&msblk->stream->lock); + stream = this_cpu_ptr(msblk->stream); + + res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, + offset, length, output); + + local_unlock(&msblk->stream->lock); if (res < 0) ERROR("%s decompression failed, data probably corrupt\n",
Re: [patch 10/12] hrtimer: Determine hard/soft expiry mode for hrtimer sleepers on RT
On Fri, Jul 26, 2019 at 08:30:58PM +0200, Thomas Gleixner wrote: > From: Sebastian Andrzej Siewior > > On PREEMPT_RT enabled kernels hrtimers which are not explicitely marked for > hard interrupt expiry mode are moved into soft interrupt context either for > latency reasons or because the hrtimer callback takes regular spinlocks or > invokes other functions which are not suitable for hard interrupt context > on PREEMPT_RT. > > The hrtimer_sleeper callback is RT compatible in hard interrupt context, > but there is a latency concern: Untrusted userspace can spawn many threads > which arm timers for the same expiry time on the same CPU. On expiry that > causes a latency spike due to the wakeup of a gazillion threads. > > OTOH, priviledged real-time user space applications rely on the low latency > of hard interrupt wakeups. These syscall related wakeups are all based on > hrtimer sleepers. > > If the current task is in a real-time scheduling class, mark the mode for > hard interrupt expiry. > > [ tglx: Split out of a larger combo patch. Added changelog ] > > Signed-off-by: Sebastian Andrzej Siewior > Signed-off-by: Thomas Gleixner > --- > kernel/time/hrtimer.c | 24 > 1 file changed, 24 insertions(+) > > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1662,6 +1662,30 @@ static enum hrtimer_restart hrtimer_wake > static void __hrtimer_init_sleeper(struct hrtimer_sleeper *sl, > clockid_t clock_id, enum hrtimer_mode mode) > { > + /* > + * On PREEMPT_RT enabled kernels hrtimers which are not explicitely > + * marked for hard interrupt expiry mode are moved into soft > + * interrupt context either for latency reasons or because the > + * hrtimer callback takes regular spinlocks or invokes other > + * functions which are not suitable for hard interrupt context on > + * PREEMPT_RT. > + * > + * The hrtimer_sleeper callback is RT compatible in hard interrupt > + * context, but there is a latency concern: Untrusted userspace can > + * spawn many threads which arm timers for the same expiry time on > + * the same CPU. That causes a latency spike due to the wakeup of > + * a gazillion threads. > + * > + * OTOH, priviledged real-time user space applications rely on the > + * low latency of hard interrupt wakeups. If the current task is in > + * a real-time scheduling class, mark the mode for hard interrupt > + * expiry. > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > + if (task_is_realtime(current) && !(mode & HRTIMER_MODE_SOFT)) > + mode |= HRTIMER_MODE_HARD; Because this ends up sampling the tasks' scheduling parameters only at the time of enqueue, it doesn't take into consideration whether or not the task maybe holding a PI lock and later be boosted if contended by an RT thread. Am I correct in assuming there is an induced inversion here in this case, because the deferred wakeup mechanism isn't part of the PI chain? If so, is this just to be an accepted limitation at this point? Is the intent to argue this away as bad RT application design? :) Julia
Re: [patch V2 1/1] Kconfig: Introduce CONFIG_PREEMPT_RT
On Wed, Jul 17, 2019 at 10:01:49PM +0200, Thomas Gleixner wrote: > Add a new entry to the preemption menu which enables the real-time support > for the kernel. The choice is only enabled when an architecture supports > it. > > It selects PREEMPT as the RT features depend on it. To achieve that the > existing PREEMPT choice is renamed to PREEMPT_LL which select PREEMPT as > well. > > No functional change. > > Signed-off-by: Thomas Gleixner > Acked-by: Paul E. McKenney > Acked-by: Steven Rostedt (VMware) > Acked-by: Clark Williams > Acked-by: Daniel Bristot de Oliveira > Acked-by: Frederic Weisbecker > Acked-by: Ingo Molnar > Acked-by: Peter Zijlstra (Intel) > Acked-by: Marc Zyngier > Acked-by: Daniel Wagner > --- I'm excited to see where this goes. Acked-by: Julia Cartwright Julia
[tip:sched/core] kthread: Convert worker lock to raw spinlock
Commit-ID: fe99a4f4d6022ec92f9b52a5528cb9b77513e7d1 Gitweb: https://git.kernel.org/tip/fe99a4f4d6022ec92f9b52a5528cb9b77513e7d1 Author: Julia Cartwright AuthorDate: Tue, 12 Feb 2019 17:25:53 +0100 Committer: Thomas Gleixner CommitDate: Thu, 28 Feb 2019 11:18:38 +0100 kthread: Convert worker lock to raw spinlock In order to enable the queuing of kthread work items from hardirq context even when PREEMPT_RT_FULL is enabled, convert the worker spin_lock to a raw_spin_lock. This is only acceptable to do because the work performed under the lock is well-bounded and minimal. Reported-by: Steffen Trumtrar Reported-by: Tim Sander Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Tested-by: Steffen Trumtrar Reviewed-by: Petr Mladek Cc: Guenter Roeck Link: https://lkml.kernel.org/r/20190212162554.19779-1-bige...@linutronix.de --- include/linux/kthread.h | 4 ++-- kernel/kthread.c| 42 +- 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/include/linux/kthread.h b/include/linux/kthread.h index c1961761311d..6b8c064f0cbc 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -85,7 +85,7 @@ enum { struct kthread_worker { unsigned intflags; - spinlock_t lock; + raw_spinlock_t lock; struct list_headwork_list; struct list_headdelayed_work_list; struct task_struct *task; @@ -106,7 +106,7 @@ struct kthread_delayed_work { }; #define KTHREAD_WORKER_INIT(worker){ \ - .lock = __SPIN_LOCK_UNLOCKED((worker).lock),\ + .lock = __RAW_SPIN_LOCK_UNLOCKED((worker).lock),\ .work_list = LIST_HEAD_INIT((worker).work_list),\ .delayed_work_list = LIST_HEAD_INIT((worker).delayed_work_list),\ } diff --git a/kernel/kthread.c b/kernel/kthread.c index 087d18d771b5..5641b55783a6 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -599,7 +599,7 @@ void __kthread_init_worker(struct kthread_worker *worker, struct lock_class_key *key) { memset(worker, 0, sizeof(struct kthread_worker)); - spin_lock_init(&worker->lock); + raw_spin_lock_init(&worker->lock); lockdep_set_class_and_name(&worker->lock, key, name); INIT_LIST_HEAD(&worker->work_list); INIT_LIST_HEAD(&worker->delayed_work_list); @@ -641,21 +641,21 @@ repeat: if (kthread_should_stop()) { __set_current_state(TASK_RUNNING); - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); worker->task = NULL; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); return 0; } work = NULL; - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); if (!list_empty(&worker->work_list)) { work = list_first_entry(&worker->work_list, struct kthread_work, node); list_del_init(&work->node); } worker->current_work = work; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); if (work) { __set_current_state(TASK_RUNNING); @@ -812,12 +812,12 @@ bool kthread_queue_work(struct kthread_worker *worker, bool ret = false; unsigned long flags; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); if (!queuing_blocked(worker, work)) { kthread_insert_work(worker, work, &worker->work_list); ret = true; } - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); return ret; } EXPORT_SYMBOL_GPL(kthread_queue_work); @@ -843,7 +843,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) if (WARN_ON_ONCE(!worker)) return; - spin_lock(&worker->lock); + raw_spin_lock(&worker->lock); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -852,7 +852,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) list_del_init(&work->node); kthread_insert_work(worker, work, &worker->work_list); - spin_unlock(&worker->lock); + raw_spin_unlock(&worker->lock); } EXPORT_SYMBOL(kthread_delayed_work_timer_fn); @@ -908,14 +908,14 @@ bool kthread_queue_delayed_work(struct kthread_worker *worker,
Re: [PATCH] iommu/dmar: fix buffer overflow during PCI bus notification
On Wed, Feb 20, 2019 at 10:46:31AM -0600, Julia Cartwright wrote: > Commit 57384592c433 ("iommu/vt-d: Store bus information in RMRR PCI > device path") changed the type of the path data, however, the change in > path type was not reflected in size calculations. Update to use the > correct type and prevent a buffer overflow. > > This bug manifests in systems with deep PCI hierarchies, and can lead to > an overflow of the static allocated buffer (dmar_pci_notify_info_buf), > or can lead to overflow of slab-allocated data. > >BUG: KASAN: global-out-of-bounds in dmar_alloc_pci_notify_info+0x1d5/0x2e0 >Write of size 1 at addr 90445d80 by task swapper/0/1 >CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW > 4.14.87-rt49-02406-gd0a0e96 #1 >Call Trace: > ? dump_stack+0x46/0x59 > ? print_address_description+0x1df/0x290 [..] >The buggy address belongs to the variable: > dmar_pci_notify_info_buf+0x40/0x60 > > Fixes: 57384592c433 ("iommu/vt-d: Store bus information in RMRR PCI device > path") > Signed-off-by: Julia Cartwright > --- > drivers/iommu/dmar.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c > index 6b7df25e1488..9c49300e9fb7 100644 > --- a/drivers/iommu/dmar.c > +++ b/drivers/iommu/dmar.c > @@ -145,7 +145,7 @@ dmar_alloc_pci_notify_info(struct pci_dev *dev, unsigned > long event) > for (tmp = dev; tmp; tmp = tmp->bus->self) > level++; > > - size = sizeof(*info) + level * sizeof(struct acpi_dmar_pci_path); > + size = sizeof(*info) + level * sizeof(info->path[0]); This is probably a candidate for struct_size() instead, if that's what is preferred. Julia
[PATCH] iommu/dmar: fix buffer overflow during PCI bus notification
Commit 57384592c433 ("iommu/vt-d: Store bus information in RMRR PCI device path") changed the type of the path data, however, the change in path type was not reflected in size calculations. Update to use the correct type and prevent a buffer overflow. This bug manifests in systems with deep PCI hierarchies, and can lead to an overflow of the static allocated buffer (dmar_pci_notify_info_buf), or can lead to overflow of slab-allocated data. BUG: KASAN: global-out-of-bounds in dmar_alloc_pci_notify_info+0x1d5/0x2e0 Write of size 1 at addr 90445d80 by task swapper/0/1 CPU: 0 PID: 1 Comm: swapper/0 Tainted: GW 4.14.87-rt49-02406-gd0a0e96 #1 Call Trace: ? dump_stack+0x46/0x59 ? print_address_description+0x1df/0x290 ? dmar_alloc_pci_notify_info+0x1d5/0x2e0 ? kasan_report+0x256/0x340 ? dmar_alloc_pci_notify_info+0x1d5/0x2e0 ? e820__memblock_setup+0xb0/0xb0 ? dmar_dev_scope_init+0x424/0x48f ? __down_write_common+0x1ec/0x230 ? dmar_dev_scope_init+0x48f/0x48f ? dmar_free_unused_resources+0x109/0x109 ? cpumask_next+0x16/0x20 ? __kmem_cache_create+0x392/0x430 ? kmem_cache_create+0x135/0x2f0 ? e820__memblock_setup+0xb0/0xb0 ? intel_iommu_init+0x170/0x1848 ? _raw_spin_unlock_irqrestore+0x32/0x60 ? migrate_enable+0x27a/0x5b0 ? sched_setattr+0x20/0x20 ? migrate_disable+0x1fc/0x380 ? task_rq_lock+0x170/0x170 ? try_to_run_init_process+0x40/0x40 ? locks_remove_file+0x85/0x2f0 ? dev_prepare_static_identity_mapping+0x78/0x78 ? rt_spin_unlock+0x39/0x50 ? lockref_put_or_lock+0x2a/0x40 ? dput+0x128/0x2f0 ? __rcu_read_unlock+0x66/0x80 ? __fput+0x250/0x300 ? __rcu_read_lock+0x1b/0x30 ? mntput_no_expire+0x38/0x290 ? e820__memblock_setup+0xb0/0xb0 ? pci_iommu_init+0x25/0x63 ? pci_iommu_init+0x25/0x63 ? do_one_initcall+0x7e/0x1c0 ? initcall_blacklisted+0x120/0x120 ? kernel_init_freeable+0x27b/0x307 ? rest_init+0xd0/0xd0 ? kernel_init+0xf/0x120 ? rest_init+0xd0/0xd0 ? ret_from_fork+0x1f/0x40 The buggy address belongs to the variable: dmar_pci_notify_info_buf+0x40/0x60 Fixes: 57384592c433 ("iommu/vt-d: Store bus information in RMRR PCI device path") Signed-off-by: Julia Cartwright --- drivers/iommu/dmar.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c index 6b7df25e1488..9c49300e9fb7 100644 --- a/drivers/iommu/dmar.c +++ b/drivers/iommu/dmar.c @@ -145,7 +145,7 @@ dmar_alloc_pci_notify_info(struct pci_dev *dev, unsigned long event) for (tmp = dev; tmp; tmp = tmp->bus->self) level++; - size = sizeof(*info) + level * sizeof(struct acpi_dmar_pci_path); + size = sizeof(*info) + level * sizeof(info->path[0]); if (size <= sizeof(dmar_pci_notify_info_buf)) { info = (struct dmar_pci_notify_info *)dmar_pci_notify_info_buf; } else { -- 2.20.1
Re: [RFC PATCH] arm64/fpsimd: Don't disable softirq when touching FPSIMD/SVE state
Hello Julien- On Fri, Feb 08, 2019 at 04:55:13PM +, Julien Grall wrote: > When the kernel is compiled with CONFIG_KERNEL_MODE_NEON, some part of > the kernel may be able to use FPSIMD/SVE. This is for instance the case > for crypto code. > > Any use of FPSIMD/SVE in the kernel are clearly marked by using the > function kernel_neon_{begin, end}. Furthermore, this can only be used > when may_use_simd() returns true. > > The current implementation of may_use_simd() allows softirq to use > FPSIMD/SVE unless it is currently in used (i.e kernel_neon_busy is true). > When in used, softirqs usually fallback to a software method. > > At the moment, as a softirq may use FPSIMD/SVE, softirqs are disabled > when touching the FPSIMD/SVE context. This has the drawback to disable > all softirqs even if they are not using FPSIMD/SVE. > > As a softirq should not rely on been able to use simd at a given time, > there are limited reason to keep softirq disabled when touching the > FPSIMD/SVE context. Instead, we can only disable preemption and tell > the NEON unit is currently in use. > > This patch introduces two new helpers kernel_neon_{disable, enable} to > mark the area using FPSIMD/SVE context and use them in replacement of > local_bh_{disable, enable}. The functions kernel_neon_{begin, end} are > also re-implemented to use the new helpers. > > Signed-off-by: Julien Grall > > --- > > I have been exploring this solution as an alternative approach to the RT > patch "arm64: fpsimd: use preemp_disable in addition to local_bh_disable()". > > So far, the patch has only been lightly tested. > > For RT-linux, it might be possible to use migrate_{enable, disable}. I > am quite new with RT and have some trouble to understand the semantics > of migrate_{enable, disable}. So far, I am still unsure if it is possible > to run another userspace task on the same CPU while getting preempted > when the migration is disabled. Yes, a thread in a migrate_disable()-protected critical section can be preempted, and another thread on the same CPU may enter the critical section. If it's necessary to prevent this from happening, then you need to also make use of a per-CPU mutex. On RT, this would do the right thing w.r.t. priority inheritance. Julia
Re: Re: [PATCH v3 1/3] KVM: arm/arm64: vgic: Make vgic_irq->irq_lock a raw_spinlock
On Fri, Feb 01, 2019 at 03:30:58PM +, Julien Grall wrote: > Hi Julien, > > On 07/01/2019 15:06, Julien Thierry wrote: > > vgic_irq->irq_lock must always be taken with interrupts disabled as > > it is used in interrupt context. > > I am a bit confused with the reason here. The code mention that ap_list_lock > could be taken from the timer interrupt handler interrupt. I assume it > speaks about the handler kvm_arch_timer_handler. Looking at the > configuration of the interrupt, the flag IRQF_NO_THREAD is not set, so the > interrupt should be threaded when CONFIG_PREEMPT_FULL is set. If my > understanding is correct, this means the interrupt thread would sleep if it > takes the spinlock. > > Did I miss anything? Do you have an exact path where the vGIC is actually > called from an interrupt context? The part you're missing is that percpu interrupts are not force threaded: static int irq_setup_forced_threading(struct irqaction *new) { if (!force_irqthreads) return 0; if (new->flags & (IRQF_NO_THREAD | IRQF_PERCPU | IRQF_ONESHOT)) return 0; /* ...*/ } Julia
[ANNOUNCE] 4.9.146-rt125
Hello RT Folks! I'm pleased to announce the 4.9.146-rt125 stable release. Apologies for an update to the 4.9-rt tree being way overdue. This release is just an update to the new stable 4.9.146 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.9-rt Head SHA1: 5b5e350d8c82f5dcbad136aeede6b73d41cd1e8a Or to build 4.9.146-rt125 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.146.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.146-rt125.patch.xz Enjoy, and happy holidays, Julia
[PATCH 0/2] Fix watchdogd wakeup deferral on RT
The following two patches solve an issue reported by Steffen Trumtrar and Tim Sander to the linux-rt-users mailing list. Namely, the wakeup of the watchdogd thread being starved by an RT task due to the hrtimer deferral through ktimersoftd (on PREEMPT_RT_FULL). The first patch adjusts the kthread_worker locking to make use of a raw spinlock, making it suitable for work item queueing from hardirq context on PREEMPT_RT. This patch can be applied directly to mainline without any functional change. The second patch adjusts the hrtimer used by the watchdog core to be a _HARD timer (and thus not deferred through ktimersoftd w/ PREEMPT_RT). This patch depends on hrtimer patches carried in the RT patch, and so should therefore land there. Cc: Guenter Roeck Cc: Steffen Trumtrar Cc: Tim Sander Julia Cartwright (2): kthread: convert worker lock to raw spinlock watchdog, rt: prevent deferral of watchdogd wakeup drivers/watchdog/watchdog_dev.c | 2 +- include/linux/kthread.h | 2 +- kernel/kthread.c| 42 - 3 files changed, 23 insertions(+), 23 deletions(-) -- 2.18.0
[PATCH 1/2] kthread: convert worker lock to raw spinlock
In order to enable the queuing of kthread work items from hardirq context even when PREEMPT_RT_FULL is enabled, convert the worker spin_lock to a raw_spin_lock. This is only acceptable to do because the work performed under the lock is well-bounded and minimal. Cc: Sebastian Andrzej Siewior Cc: Guenter Roeck Reported-and-tested-by: Steffen Trumtrar Reported-by: Tim Sander Signed-off-by: Julia Cartwright --- include/linux/kthread.h | 2 +- kernel/kthread.c| 42 - 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/include/linux/kthread.h b/include/linux/kthread.h index c1961761311d..ad292898f7f2 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -85,7 +85,7 @@ enum { struct kthread_worker { unsigned intflags; - spinlock_t lock; + raw_spinlock_t lock; struct list_headwork_list; struct list_headdelayed_work_list; struct task_struct *task; diff --git a/kernel/kthread.c b/kernel/kthread.c index 486dedbd9af5..c1d9ee6671c6 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -597,7 +597,7 @@ void __kthread_init_worker(struct kthread_worker *worker, struct lock_class_key *key) { memset(worker, 0, sizeof(struct kthread_worker)); - spin_lock_init(&worker->lock); + raw_spin_lock_init(&worker->lock); lockdep_set_class_and_name(&worker->lock, key, name); INIT_LIST_HEAD(&worker->work_list); INIT_LIST_HEAD(&worker->delayed_work_list); @@ -639,21 +639,21 @@ int kthread_worker_fn(void *worker_ptr) if (kthread_should_stop()) { __set_current_state(TASK_RUNNING); - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); worker->task = NULL; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); return 0; } work = NULL; - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); if (!list_empty(&worker->work_list)) { work = list_first_entry(&worker->work_list, struct kthread_work, node); list_del_init(&work->node); } worker->current_work = work; - spin_unlock_irq(&worker->lock); + raw_spin_unlock_irq(&worker->lock); if (work) { __set_current_state(TASK_RUNNING); @@ -810,12 +810,12 @@ bool kthread_queue_work(struct kthread_worker *worker, bool ret = false; unsigned long flags; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); if (!queuing_blocked(worker, work)) { kthread_insert_work(worker, work, &worker->work_list); ret = true; } - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); return ret; } EXPORT_SYMBOL_GPL(kthread_queue_work); @@ -841,7 +841,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) if (WARN_ON_ONCE(!worker)) return; - spin_lock(&worker->lock); + raw_spin_lock(&worker->lock); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -850,7 +850,7 @@ void kthread_delayed_work_timer_fn(struct timer_list *t) list_del_init(&work->node); kthread_insert_work(worker, work, &worker->work_list); - spin_unlock(&worker->lock); + raw_spin_unlock(&worker->lock); } EXPORT_SYMBOL(kthread_delayed_work_timer_fn); @@ -906,14 +906,14 @@ bool kthread_queue_delayed_work(struct kthread_worker *worker, unsigned long flags; bool ret = false; - spin_lock_irqsave(&worker->lock, flags); + raw_spin_lock_irqsave(&worker->lock, flags); if (!queuing_blocked(worker, work)) { __kthread_queue_delayed_work(worker, dwork, delay); ret = true; } - spin_unlock_irqrestore(&worker->lock, flags); + raw_spin_unlock_irqrestore(&worker->lock, flags); return ret; } EXPORT_SYMBOL_GPL(kthread_queue_delayed_work); @@ -949,7 +949,7 @@ void kthread_flush_work(struct kthread_work *work) if (!worker) return; - spin_lock_irq(&worker->lock); + raw_spin_lock_irq(&worker->lock); /* Work must not be used with >1 worker, see kthread_queue_work(). */ WARN_ON_ONCE(work->worker != worker); @@ -961,7 +961,7 @@ void kthread_flush_work(struct k
[PATCH RT 2/2] watchdog, rt: prevent deferral of watchdogd wakeup
When PREEMPT_RT_FULL is enabled, all hrtimer expiry functions are deferred for execution into the context of ktimersoftd unless otherwise annotated. Deferring the expiry of the hrtimer used by the watchdog core, however, is a waste, as the callback does nothing but queue a kthread work item and wakeup watchdogd. It's worst then that, too: the deferral through ktimersoftd also means that for correct behavior a user must adjust the scheduling parameters of both watchdogd _and_ ktimersoftd, which is unnecessary and has other side effects (like causing unrelated expiry functions to execute at potentially elevated priority). Instead, mark the hrtimer used by the watchdog core as being _HARD to allow it's execution directly from hardirq context. The work done in this expiry function is well-bounded and minimal. A user still must adjust the scheduling parameters of the watchdogd to be correct w.r.t. their application needs. Cc: Guenter Roeck Reported-and-tested-by: Steffen Trumtrar Reported-by: Tim Sander Signed-off-by: Julia Cartwright --- drivers/watchdog/watchdog_dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c index ffbdc4642ea5..9c2b3e5cebdc 100644 --- a/drivers/watchdog/watchdog_dev.c +++ b/drivers/watchdog/watchdog_dev.c @@ -945,7 +945,7 @@ static int watchdog_cdev_register(struct watchdog_device *wdd, dev_t devno) return -ENODEV; kthread_init_work(&wd_data->work, watchdog_ping_work); - hrtimer_init(&wd_data->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_init(&wd_data->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD); wd_data->timer.function = watchdog_timer_expired; if (wdd->id == 0) { -- 2.18.0
Re: [ANNOUNCE] Submit a topic for the RT Microconference at Linux Plumbers
Hello all- On Tue, Sep 04, 2018 at 01:25:29PM -0400, Steven Rostedt wrote: > Hi RT folks! > > The call for proposals (CfP) is now open for the RT Microconference at > Linux Plumbers in Vancouver, Canada. The topics we are looking at this > year are: > > - How will PREEMPT_RT be maintained (is it a subsystem?) > > - Who will maintain it? > > - How do we teach the rest of the kernel developers how not to break > PREEMPT_RT? > > - How do we catch when it breaks? > > - What tools can we add into tools/ that other kernel developers can use to > test and learn about PREEMPT_RT? > > - What tests can we add into tools/testing/selftests? > > - What kernel boot selftests can be added? > > - How can we modify lockdep to catch problems that can break PREEMPT_RT? > > - Discuss various types of failures that can happen with PREEMPT_RT that > normally would not happen in the vanilla kernel. > > - How do we handle stable backports? > > - Interaction with Safety Critical domains? > > - Interrupt threads are RT and are not protected by the RT Throttling. >How can we prevent interrupt thread starvation from a rogue RT task? > > - How to determine if a kernel is booted with PREEMPT_RT enabled? > > - librtpi: "solving" long-standing glibc pthread condvar internal lock > inversion with a new implementation; motivations, technical details, etc. > > - SCHED_DEADLINE - Lots of research has been made to various issues like CPU > affinity, Bandwidth inheritance, cgroup support; but nothing has been done to > the kernel. Let's make a commitment and push these ideas forward into > mainline. > > - How can we modify lockdep to catch problems that can break PREEMPT_RT? > > - Discuss various types of failures that can happen with PREEMPT_RT that > normally would not happen in the vanilla kernel. There was some confusion as to whether or not a topic being listed above automatically meant that it was to be included in the RT Microconference agenda. The answer is: NO. If you would like to see a specific topic discussed, either from this list or otherwise, you must submit it via the portal listed below. Only topics submitted to the portal will be considered for the RT Microconference agenda. Note also that submitting a topic will also mean that you will be responsible for introducing the necessary context to the group to facilitate further discussion. > If you want to lead one of the above sessions, or if you have a new > session you want to submit, please do so here: > > https://linuxplumbersconf.org/event/2/abstracts/ > > Set up an account, and then click "Submit new proposal". > > Fill out the "Title", "Content", select the "Authors" as the one leading > the session (yourself), and then for "Track" select "RT MC Topics CfP". > Select "Yes" to agree with the anti-harassment policy, and then hit > Submit. Thanks, and see you there! Julia
[PATCH RT 11/22] mm/slub: close possible memory-leak in kmem_cache_alloc_bulk()
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- Under certain circumstances we could leak elements which were moved to the local "to_free" list. The damage is limited since I can't find any users here. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 5022166d3b225bf5e343efb3ea01b3c5a41d69ba) Signed-off-by: Julia Cartwright --- mm/slub.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/slub.c b/mm/slub.c index 67eb368b9314..738b2bccbd5f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3209,6 +3209,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, return i; error: local_irq_enable(); + free_delayed(&to_free); slab_post_alloc_hook(s, flags, i, p); __kmem_cache_free_bulk(s, i, p); return 0; -- 2.18.0
[ANNOUNCE] 4.9.115-rt93
Hello RT Folks! I'm pleased to announce the 4.9.115-rt93 stable release. This release is just an update to the new stable 4.9.115 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.9-rt Head SHA1: 236d20fd941a444577579ba398ba9907f3b33277 Or to build 4.9.115-rt93 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.115.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.115-rt93.patch.xz Enjoy! Julia
[PATCH RT 10/22] arm*: disable NEON in kernel mode
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- NEON in kernel mode is used by the crypto algorithms and raid6 code. While the raid6 code looks okay, the crypto algorithms do not: NEON is enabled on first invocation and may allocate/free/map memory before the NEON mode is disabled again. This needs to be changed until it can be enabled. On ARM NEON in kernel mode can be simply disabled. on ARM64 it needs to stay on due to possible EFI callbacks so here I disable each algorithm. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit b3a776555e0d465df138d254d6dc3ac1b718ac6d) Signed-off-by: Julia Cartwright --- arch/arm/Kconfig | 2 +- arch/arm64/crypto/Kconfig | 14 +++--- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 5715844e83e3..8c40f7d73251 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -2158,7 +2158,7 @@ config NEON config KERNEL_MODE_NEON bool "Support for NEON in kernel mode" - depends on NEON && AEABI + depends on NEON && AEABI && !PREEMPT_RT_BASE help Say Y to include support for NEON in kernel mode. diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig index 2cf32e9887e1..cd71b3432720 100644 --- a/arch/arm64/crypto/Kconfig +++ b/arch/arm64/crypto/Kconfig @@ -10,41 +10,41 @@ if ARM64_CRYPTO config CRYPTO_SHA1_ARM64_CE tristate "SHA-1 digest algorithm (ARMv8 Crypto Extensions)" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_HASH config CRYPTO_SHA2_ARM64_CE tristate "SHA-224/SHA-256 digest algorithm (ARMv8 Crypto Extensions)" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_HASH config CRYPTO_GHASH_ARM64_CE tristate "GHASH (for GCM chaining mode) using ARMv8 Crypto Extensions" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_HASH config CRYPTO_AES_ARM64_CE tristate "AES core cipher using ARMv8 Crypto Extensions" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_ALGAPI config CRYPTO_AES_ARM64_CE_CCM tristate "AES in CCM mode using ARMv8 Crypto Extensions" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_ALGAPI select CRYPTO_AES_ARM64_CE select CRYPTO_AEAD config CRYPTO_AES_ARM64_CE_BLK tristate "AES in ECB/CBC/CTR/XTS modes using ARMv8 Crypto Extensions" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_BLKCIPHER select CRYPTO_AES_ARM64_CE select CRYPTO_ABLK_HELPER config CRYPTO_AES_ARM64_NEON_BLK tristate "AES in ECB/CBC/CTR/XTS modes using NEON instructions" - depends on ARM64 && KERNEL_MODE_NEON + depends on ARM64 && KERNEL_MODE_NEON && !PREEMPT_RT_BASE select CRYPTO_BLKCIPHER select CRYPTO_AES select CRYPTO_ABLK_HELPER -- 2.18.0
[PATCH RT 17/22] alarmtimer: Prevent live lock in alarm_cancel()
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- If alarm_try_to_cancel() requires a retry, then depending on the priority setting the retry loop might prevent timer callback completion on RT. Prevent that by waiting for completion on RT, no change for a non RT kernel. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 51e376c469bf05f32cb1ceb9e39d31bb92f1f6c8) Signed-off-by: Julia Cartwright --- kernel/time/alarmtimer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index d67ef56ca9bc..61b20a656863 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -407,7 +407,7 @@ int alarm_cancel(struct alarm *alarm) int ret = alarm_try_to_cancel(alarm); if (ret >= 0) return ret; - cpu_relax(); + hrtimer_wait_for_timer(&alarm->timer); } } EXPORT_SYMBOL_GPL(alarm_cancel); -- 2.18.0
[PATCH RT 18/22] posix-timers: move the rcu head out of the union
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- On RT the timer can be preempted while running and therefore we wait with timer_wait_for_callback() for the timer to complete (instead of busy looping). The RCU-readlock is held to ensure that this posix timer is not removed while we wait on it. If the timer is removed then it invokes call_rcu() with a pointer that is shared with the hrtimer because it is part of the same union. In order to avoid any possible side effects I am moving the rcu pointer out of the union. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit b8401365af110949f12c7cf1fa86b4c0ea069bbd) Signed-off-by: Julia Cartwright --- include/linux/posix-timers.h | 2 +- kernel/time/posix-timers.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index 62d44c176071..cbd3f9334543 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -92,8 +92,8 @@ struct k_itimer { struct alarm alarmtimer; ktime_t interval; } alarm; - struct rcu_head rcu; } it; + struct rcu_head rcu; }; struct k_clock { diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index 6084618436fd..45d8033caec4 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -568,7 +568,7 @@ static struct k_itimer * alloc_posix_timer(void) static void k_itimer_rcu_free(struct rcu_head *head) { - struct k_itimer *tmr = container_of(head, struct k_itimer, it.rcu); + struct k_itimer *tmr = container_of(head, struct k_itimer, rcu); kmem_cache_free(posix_timers_cache, tmr); } @@ -585,7 +585,7 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set) } put_pid(tmr->it_pid); sigqueue_free(tmr->sigq); - call_rcu(&tmr->it.rcu, k_itimer_rcu_free); + call_rcu(&tmr->rcu, k_itimer_rcu_free); } static struct k_clock *clockid_to_kclock(const clockid_t id) -- 2.18.0
[PATCH RT 19/22] locallock: provide {get,put}_locked_ptr() variants
4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- Provide a set of locallocked accessors for pointers to per-CPU data; this is useful for dynamically-allocated per-CPU regions, for example. These are symmetric with the {get,put}_cpu_ptr() per-CPU accessor variants. Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 3d45cf23db4f76cd356ebb0aa4cdaa7d92d1a64e) Signed-off-by: Julia Cartwright --- include/linux/locallock.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/locallock.h b/include/linux/locallock.h index 280f884a05a3..0c3ff5b23f6a 100644 --- a/include/linux/locallock.h +++ b/include/linux/locallock.h @@ -238,6 +238,14 @@ static inline int __local_unlock_irqrestore(struct local_irq_lock *lv, #define put_locked_var(lvar, var) local_unlock(lvar); +#define get_locked_ptr(lvar, var) \ + ({ \ + local_lock(lvar); \ + this_cpu_ptr(var); \ + }) + +#define put_locked_ptr(lvar, var) local_unlock(lvar); + #define local_lock_cpu(lvar) \ ({ \ local_lock(lvar); \ @@ -278,6 +286,8 @@ static inline void local_irq_lock_init(int lvar) { } #define get_locked_var(lvar, var) get_cpu_var(var) #define put_locked_var(lvar, var) put_cpu_var(var) +#define get_locked_ptr(lvar, var) get_cpu_ptr(var) +#define put_locked_ptr(lvar, var) put_cpu_ptr(var) #define local_lock_cpu(lvar) get_cpu() #define local_unlock_cpu(lvar) put_cpu() -- 2.18.0
[PATCH RT 06/22] rcu: Do not include rtmutex_common.h unconditionally
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ upstream commit b88697810d7c1d102a529990f9071b0f14cfe6df ] This commit adjusts include files and provides definitions in preparation for suppressing lockdep false-positive ->boost_mtx complaints. Without this preparation, architectures not supporting rt_mutex will get build failures. Reported-by: kbuild test robot Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney Signed-off-by: Julia Cartwright --- kernel/rcu/tree_plugin.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index a1a7bafcce15..3d18d08e8382 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -37,6 +37,7 @@ * This probably needs to be excluded from -rt builds. */ #define rt_mutex_owner(a) ({ WARN_ON_ONCE(1); NULL; }) +#define rt_mutex_futex_unlock(x) WARN_ON_ONCE(1) #endif /* #else #ifdef CONFIG_RCU_BOOST */ @@ -834,8 +835,6 @@ static void rcu_cpu_kthread_setup(unsigned int cpu) #ifdef CONFIG_RCU_BOOST -#include "../locking/rtmutex_common.h" - #ifdef CONFIG_RCU_TRACE static void rcu_initiate_boost_trace(struct rcu_node *rnp) -- 2.18.0
[PATCH RT 12/22] locking: add types.h
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- During the stable update the arm architecture did not compile anymore due to missing definition of u16/32. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 1289b06974d64f244a26455fab699c6a1332f4bc) Signed-off-by: Julia Cartwright --- include/linux/spinlock_types_raw.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/spinlock_types_raw.h b/include/linux/spinlock_types_raw.h index edffc4d53fc9..03235b475b77 100644 --- a/include/linux/spinlock_types_raw.h +++ b/include/linux/spinlock_types_raw.h @@ -1,6 +1,8 @@ #ifndef __LINUX_SPINLOCK_TYPES_RAW_H #define __LINUX_SPINLOCK_TYPES_RAW_H +#include + #if defined(CONFIG_SMP) # include #else -- 2.18.0
[PATCH RT 09/22] crypto: limit more FPU-enabled sections
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new *_fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 0dcc4c1693ef37e166da420ef7c68c7047c996f1) Signed-off-by: Julia Cartwright --- arch/x86/crypto/camellia_aesni_avx2_glue.c | 20 arch/x86/crypto/camellia_aesni_avx_glue.c | 19 +++ arch/x86/crypto/cast6_avx_glue.c | 24 +++ arch/x86/crypto/chacha20_glue.c| 9 arch/x86/crypto/serpent_avx2_glue.c| 19 +++ arch/x86/crypto/serpent_avx_glue.c | 23 ++ arch/x86/crypto/serpent_sse2_glue.c| 23 ++ arch/x86/crypto/twofish_avx_glue.c | 27 -- arch/x86/include/asm/fpu/api.h | 1 + arch/x86/kernel/fpu/core.c | 12 ++ 10 files changed, 158 insertions(+), 19 deletions(-) diff --git a/arch/x86/crypto/camellia_aesni_avx2_glue.c b/arch/x86/crypto/camellia_aesni_avx2_glue.c index 60907c139c4e..0902db7d326a 100644 --- a/arch/x86/crypto/camellia_aesni_avx2_glue.c +++ b/arch/x86/crypto/camellia_aesni_avx2_glue.c @@ -206,6 +206,20 @@ struct crypt_priv { bool fpu_enabled; }; +#ifdef CONFIG_PREEMPT_RT_FULL +static void camellia_fpu_end_rt(struct crypt_priv *ctx) +{ + bool fpu_enabled = ctx->fpu_enabled; + + if (!fpu_enabled) + return; + camellia_fpu_end(fpu_enabled); + ctx->fpu_enabled = false; +} +#else +static void camellia_fpu_end_rt(struct crypt_priv *ctx) { } +#endif + static void encrypt_callback(void *priv, u8 *srcdst, unsigned int nbytes) { const unsigned int bsize = CAMELLIA_BLOCK_SIZE; @@ -221,16 +235,19 @@ static void encrypt_callback(void *priv, u8 *srcdst, unsigned int nbytes) } if (nbytes >= CAMELLIA_AESNI_PARALLEL_BLOCKS * bsize) { + kernel_fpu_resched(); camellia_ecb_enc_16way(ctx->ctx, srcdst, srcdst); srcdst += bsize * CAMELLIA_AESNI_PARALLEL_BLOCKS; nbytes -= bsize * CAMELLIA_AESNI_PARALLEL_BLOCKS; } while (nbytes >= CAMELLIA_PARALLEL_BLOCKS * bsize) { + kernel_fpu_resched(); camellia_enc_blk_2way(ctx->ctx, srcdst, srcdst); srcdst += bsize * CAMELLIA_PARALLEL_BLOCKS; nbytes -= bsize * CAMELLIA_PARALLEL_BLOCKS; } + camellia_fpu_end_rt(ctx); for (i = 0; i < nbytes / bsize; i++, srcdst += bsize) camellia_enc_blk(ctx->ctx, srcdst, srcdst); @@ -251,16 +268,19 @@ static void decrypt_callback(void *priv, u8 *srcdst, unsigned int nbytes) } if (nbytes >= CAMELLIA_AESNI_PARALLEL_BLOCKS * bsize) { + kernel_fpu_resched(); camellia_ecb_dec_16way(ctx->ctx, srcdst, srcdst); srcdst += bsize * CAMELLIA_AESNI_PARALLEL_BLOCKS; nbytes -= bsize * CAMELLIA_AESNI_PARALLEL_BLOCKS; } while (nbytes >= CAMELLIA_PARALLEL_BLOCKS * bsize) { + kernel_fpu_resched(); camellia_dec_blk_2way(ctx->ctx, srcdst, srcdst); srcdst += bsize * CAMELLIA_PARALLEL_BLOCKS; nbytes -= bsize * CAMELLIA_PARALLEL_BLOCKS; } + camellia_fpu_end_rt(ctx); for (i = 0; i < nbytes / bsize; i++, srcdst += bsize) camellia_dec_blk(ctx->ctx, srcdst, srcdst); diff --git a/arch/x86/crypto/camellia_aesni_avx_glue.c b/arch/x86/crypto/camellia_aesni_avx_glue.c index d96429da88eb..3b8e91841039 100644 --- a/arch/x86/crypto/camellia_aesni_avx_glue.c +++ b/arch/x86/crypto/camellia_aesni_avx_glue.c @@ -210,6 +210,21 @@ struct crypt_priv { bool fpu_enabled; }; +#ifdef CONFIG_PREEMPT_RT_FULL +static void camellia_fpu_end_rt(struct crypt_priv *ctx) +{ + bool fpu_enabled = ctx->fpu_enabled; + + if (!fpu_enabled) + re
[PATCH RT 07/22] rcu: Suppress lockdep false-positive ->boost_mtx complaints
From: "Paul E. McKenney" 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit 02a7c234e54052101164368ff981bd72f7acdd65 ] RCU priority boosting uses rt_mutex_init_proxy_locked() to initialize an rt_mutex structure in locked state held by some other task. When that other task releases it, lockdep complains (quite accurately, but a bit uselessly) that the other task never acquired it. This complaint can suppress other, more helpful, lockdep complaints, and in any case it is a false positive. This commit therefore switches from rt_mutex_unlock() to rt_mutex_futex_unlock(), thereby avoiding the lockdep annotations. Of course, if lockdep ever learns about rt_mutex_init_proxy_locked(), addtional adjustments will be required. Suggested-by: Peter Zijlstra Signed-off-by: Paul E. McKenney Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 3d18d08e8382..510de72ad8a3 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -486,7 +486,7 @@ void rcu_read_unlock_special(struct task_struct *t) /* Unboost if we were boosted. */ if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex) - rt_mutex_unlock(&rnp->boost_mtx); + rt_mutex_futex_unlock(&rnp->boost_mtx); /* * If this was the last task on the expedited lists, -- 2.18.0
[PATCH RT 04/22] futex: Fix OWNER_DEAD fixup
From: Peter Zijlstra 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit a97cb0e7b3f4c6297fd857055ae8e895f402f501 ] Both Geert and DaveJ reported that the recent futex commit: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") introduced a problem with setting OWNER_DEAD. We set the bit on an uninitialized variable and then entirely optimize it away as a dead-store. Move the setting of the bit to where it is more useful. Reported-by: Geert Uytterhoeven Reported-by: Dave Jones Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Link: http://lkml.kernel.org/r/20180122103947.gd2...@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/futex.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index cdd68ba6e3a6..57038131ad3f 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2301,9 +2301,6 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); oldowner = pi_state->owner; - /* Owner died? */ - if (!pi_state->owner) - newtid |= FUTEX_OWNER_DIED; /* * We are here because either: @@ -2364,6 +2361,9 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, } newtid = task_pid_vnr(newowner) | FUTEX_WAITERS; + /* Owner died? */ + if (!pi_state->owner) + newtid |= FUTEX_OWNER_DIED; if (get_futex_value_locked(&uval, uaddr)) goto handle_fault; -- 2.18.0
[PATCH RT 16/22] block: blk-mq: move blk_queue_usage_counter_release() into process context
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914 | in_atomic(): 1, irqs_disabled(): 0, pid: 255, name: kworker/u257:6 | 5 locks held by kworker/u257:6/255: | #0: ("events_unbound"){.+.+.+}, at: [] process_one_work+0x171/0x5e0 | #1: ((&entry->work)){+.+.+.}, at: [] process_one_work+0x171/0x5e0 | #2: (&shost->scan_mutex){+.+.+.}, at: [] __scsi_add_device+0xa3/0x130 [scsi_mod] | #3: (&set->tag_list_lock){+.+...}, at: [] blk_mq_init_queue+0x96a/0xa50 | #4: (rcu_read_lock_sched){..}, at: [] percpu_ref_kill_and_confirm+0x1d/0x120 | Preemption disabled at:[] blk_mq_freeze_queue_start+0x56/0x70 | | CPU: 2 PID: 255 Comm: kworker/u257:6 Not tainted 3.18.7-rt0+ #1 | Workqueue: events_unbound async_run_entry_fn | 0003 8800bc29f998 815b3a12 | 8800bc29f9b8 8109aa16 8800bc29fa28 | 8800bc5d1bc8 8800bc29f9e8 815b8dd4 8800 | Call Trace: | [] dump_stack+0x4f/0x7c | [] __might_sleep+0x116/0x190 | [] rt_spin_lock+0x24/0x60 | [] __wake_up+0x29/0x60 | [] blk_mq_usage_counter_release+0x1e/0x20 | [] percpu_ref_kill_and_confirm+0x106/0x120 | [] blk_mq_freeze_queue_start+0x56/0x70 | [] blk_mq_update_tag_set_depth+0x40/0xd0 | [] blk_mq_init_queue+0x98c/0xa50 | [] scsi_mq_alloc_queue+0x20/0x60 [scsi_mod] | [] scsi_alloc_sdev+0x2f5/0x370 [scsi_mod] | [] scsi_probe_and_add_lun+0x9e4/0xdd0 [scsi_mod] | [] __scsi_add_device+0x126/0x130 [scsi_mod] | [] ata_scsi_scan_host+0xaf/0x200 [libata] | [] async_port_probe+0x46/0x60 [libata] | [] async_run_entry_fn+0x3b/0xf0 | [] process_one_work+0x201/0x5e0 percpu_ref_kill_and_confirm() invokes blk_mq_usage_counter_release() in a rcu-sched region. swait based wake queue can't be used due to wake_up_all() usage and disabled interrupts in !RT configs (as reported by Corey Minyard). The wq_has_sleeper() check has been suggested by Peter Zijlstra. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 2d701058d614554cce412a787f00568b9fdffade) Signed-off-by: Julia Cartwright --- block/blk-core.c | 14 +- include/linux/blkdev.h | 2 ++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 87d3e0a503e5..346d5bba3948 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -675,12 +675,21 @@ void blk_queue_exit(struct request_queue *q) percpu_ref_put(&q->q_usage_counter); } +static void blk_queue_usage_counter_release_swork(struct swork_event *sev) +{ + struct request_queue *q = + container_of(sev, struct request_queue, mq_pcpu_wake); + + wake_up_all(&q->mq_freeze_wq); +} + static void blk_queue_usage_counter_release(struct percpu_ref *ref) { struct request_queue *q = container_of(ref, struct request_queue, q_usage_counter); - wake_up_all(&q->mq_freeze_wq); + if (wq_has_sleeper(&q->mq_freeze_wq)) + swork_queue(&q->mq_pcpu_wake); } static void blk_rq_timed_out_timer(unsigned long data) @@ -751,6 +760,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); init_waitqueue_head(&q->mq_freeze_wq); + INIT_SWORK(&q->mq_pcpu_wake, blk_queue_usage_counter_release_swork); /* * Init percpu_ref in atomic mode so that it's faster to shutdown. @@ -3556,6 +3566,8 @@ int __init blk_dev_init(void) if (!kblockd_workqueue) panic("Failed to create kblockd\n"); + BUG_ON(swork_get()); + request_cachep = kmem_cache_create("blkdev_requests", sizeof(struct request), 0, SLAB_PANIC, NULL); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index fdb449fe3ff7..ab039211ab9f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -24,6 +24,7 @@ #include #include #include +#include struct module; struct scsi_ioctl_command; @@ -469,6 +470,7 @@ struct request_queue { #endif struct rcu_head rcu_head; wait_queue_head_t mq_freeze_wq; + struct swork_event mq_pcpu_wake; struct percpu_ref q_usage_counter; struct list_headall_q_node; -- 2.18.0
[PATCH RT 02/22] futex: Fix more put_pi_state() vs. exit_pi_state_list() races
From: Peter Zijlstra 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit 51d00899f7e6ded15c89cb4e2cb11a35283bac81 ] Dmitry (through syzbot) reported being able to trigger the WARN in get_pi_state() and a use-after-free on: raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); Both are due to this race: exit_pi_state_list() put_pi_state() lock(&curr->pi_lock) while() { pi_state = list_first_entry(head); hb = hash_futex(&pi_state->key); unlock(&curr->pi_lock); dec_and_test(&pi_state->refcount); lock(&hb->lock) lock(&pi_state->pi_mutex.wait_lock) // uaf if pi_state free'd lock(&curr->pi_lock); unlock(&curr->pi_lock); get_pi_state(); // WARN; refcount==0 The problem is we take the reference count too late, and don't allow it being 0. Fix it by using inc_not_zero() and simply retrying the loop when we fail to get a refcount. In that case put_pi_state() should remove the entry from the list. Reported-by: Dmitry Vyukov Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Thomas Gleixner Cc: Gratian Crisan Cc: Linus Torvalds Cc: Peter Zijlstra Cc: dvh...@infradead.org Cc: syzbot Cc: syzkaller-b...@googlegroups.com Cc: Fixes: c74aef2d06a9 ("futex: Fix pi_state->owner serialization") Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdf...@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/futex.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 47e42faad6c5..270148be5647 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -899,11 +899,27 @@ void exit_pi_state_list(struct task_struct *curr) */ raw_spin_lock_irq(&curr->pi_lock); while (!list_empty(head)) { - next = head->next; pi_state = list_entry(next, struct futex_pi_state, list); key = pi_state->key; hb = hash_futex(&key); + + /* +* We can race against put_pi_state() removing itself from the +* list (a waiter going away). put_pi_state() will first +* decrement the reference count and then modify the list, so +* its possible to see the list entry but fail this reference +* acquire. +* +* In that case; drop the locks to let put_pi_state() make +* progress and retry the loop. +*/ + if (!atomic_inc_not_zero(&pi_state->refcount)) { + raw_spin_unlock_irq(&curr->pi_lock); + cpu_relax(); + raw_spin_lock_irq(&curr->pi_lock); + continue; + } raw_spin_unlock_irq(&curr->pi_lock); spin_lock(&hb->lock); @@ -914,10 +930,12 @@ void exit_pi_state_list(struct task_struct *curr) * task still owns the PI-state: */ if (head->next != next) { + /* retain curr->pi_lock for the loop invariant */ raw_spin_unlock(&curr->pi_lock); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); raw_spin_lock_irq(&curr->pi_lock); + put_pi_state(pi_state); continue; } @@ -925,9 +943,8 @@ void exit_pi_state_list(struct task_struct *curr) WARN_ON(list_empty(&pi_state->list)); list_del_init(&pi_state->list); pi_state->owner = NULL; - raw_spin_unlock(&curr->pi_lock); - get_pi_state(pi_state); + raw_spin_unlock(&curr->pi_lock); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); -- 2.18.0
[PATCH RT 21/22] seqlock: provide the same ordering semantics as mainline
4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- The mainline implementation of read_seqbegin() orders prior loads w.r.t. the read-side critical section. Fixup the RT writer-boosting implementation to provide the same guarantee. Also, while we're here, update the usage of ACCESS_ONCE() to use READ_ONCE(). Fixes: e69f15cf77c23 ("seqlock: Prevent rt starvation") Cc: stable...@vger.kernel.org Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit afa4c06b89a3c0fb7784ff900ccd707bef519cb7) Signed-off-by: Julia Cartwright --- include/linux/seqlock.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 3d7223ffdd3b..de04d6d0face 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -461,6 +461,7 @@ static inline unsigned read_seqbegin(seqlock_t *sl) spin_unlock_wait(&sl->lock); goto repeat; } + smp_rmb(); return ret; } #endif -- 2.18.0
[PATCH RT 01/22] futex: Fix pi_state->owner serialization
From: Peter Zijlstra 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit c74aef2d06a9f59cece89093eecc552933cba72a ] There was a reported suspicion about a race between exit_pi_state_list() and put_pi_state(). The same report mentioned the comment with put_pi_state() said it should be called with hb->lock held, and it no longer is in all places. As it turns out, the pi_state->owner serialization is indeed broken. As per the new rules: 734009e96d19 ("futex: Change locking rules") pi_state->owner should be serialized by pi_state->pi_mutex.wait_lock. For the sites setting pi_state->owner we already hold wait_lock (where required) but exit_pi_state_list() and put_pi_state() were not and raced on clearing it. Fixes: 734009e96d19 ("futex: Change locking rules") Reported-by: Gratian Crisan Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Cc: dvh...@infradead.org Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20170922154806.jd3ffltfk24m4...@hirez.programming.kicks-ass.net Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/futex.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 8ab0ddd4cf8f..47e42faad6c5 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -819,8 +819,6 @@ static void get_pi_state(struct futex_pi_state *pi_state) /* * Drops a reference to the pi_state object and frees or caches it * when the last reference is gone. - * - * Must be called with the hb lock held. */ static void put_pi_state(struct futex_pi_state *pi_state) { @@ -835,16 +833,22 @@ static void put_pi_state(struct futex_pi_state *pi_state) * and has cleaned up the pi_state already */ if (pi_state->owner) { - raw_spin_lock_irq(&pi_state->owner->pi_lock); - list_del_init(&pi_state->list); - raw_spin_unlock_irq(&pi_state->owner->pi_lock); + struct task_struct *owner; - rt_mutex_proxy_unlock(&pi_state->pi_mutex, pi_state->owner); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + owner = pi_state->owner; + if (owner) { + raw_spin_lock(&owner->pi_lock); + list_del_init(&pi_state->list); + raw_spin_unlock(&owner->pi_lock); + } + rt_mutex_proxy_unlock(&pi_state->pi_mutex, owner); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); } - if (current->pi_state_cache) + if (current->pi_state_cache) { kfree(pi_state); - else { + } else { /* * pi_state->list is already empty. * clear pi_state->owner. @@ -903,14 +907,15 @@ void exit_pi_state_list(struct task_struct *curr) raw_spin_unlock_irq(&curr->pi_lock); spin_lock(&hb->lock); - - raw_spin_lock_irq(&curr->pi_lock); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + raw_spin_lock(&curr->pi_lock); /* * We dropped the pi-lock, so re-check whether this * task still owns the PI-state: */ if (head->next != next) { - raw_spin_unlock_irq(&curr->pi_lock); + raw_spin_unlock(&curr->pi_lock); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); raw_spin_lock_irq(&curr->pi_lock); continue; @@ -920,9 +925,10 @@ void exit_pi_state_list(struct task_struct *curr) WARN_ON(list_empty(&pi_state->list)); list_del_init(&pi_state->list); pi_state->owner = NULL; - raw_spin_unlock_irq(&curr->pi_lock); + raw_spin_unlock(&curr->pi_lock); get_pi_state(pi_state); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); rt_mutex_futex_unlock(&pi_state->pi_mutex); @@ -1204,6 +1210,10 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, WARN_ON(!list_empty(&pi_state->list)); list_add(&pi_state->list, &p->pi_state_list); + /* +* Assignment without holding pi_state->pi_mutex.wait_lock is safe +* because there is no concurrency as the object is not published yet. +*/ pi_state->owner = p; raw_spin_unlock_irq(&p->pi_lock); -- 2.18.0
[PATCH RT 14/22] Revert "rt,ntp: Move call to schedule_delayed_work() to helper thread"
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- I've been looking at this in v3.10-RT where it got in. The patch description says |The ntp code for notify_cmos_timer() is called from a hard interrupt |context. I see only one caller of ntp_notify_cmos_timer() and that is do_adjtimex() after "raw_spin_unlock_irqrestore()". I see a few callers of do_adjtimex() which is SYS_adjtimex() (+compat) and posix_clock_realtime_adj() which in turn is called by SYS_clock_adjtime(). Reverting the patch. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 932c5783d4434250a1019f49ae81b80731dfd4cd) Signed-off-by: Julia Cartwright --- kernel/time/ntp.c | 26 -- 1 file changed, 26 deletions(-) diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index 05b7391bf9bd..6df8927c58a5 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -17,7 +17,6 @@ #include #include #include -#include #include "ntp_internal.h" #include "timekeeping_internal.h" @@ -569,35 +568,10 @@ static void sync_cmos_clock(struct work_struct *work) &sync_cmos_work, timespec64_to_jiffies(&next)); } -#ifdef CONFIG_PREEMPT_RT_FULL - -static void run_clock_set_delay(struct swork_event *event) -{ - queue_delayed_work(system_power_efficient_wq, &sync_cmos_work, 0); -} - -static struct swork_event ntp_cmos_swork; - -void ntp_notify_cmos_timer(void) -{ - swork_queue(&ntp_cmos_swork); -} - -static __init int create_cmos_delay_thread(void) -{ - WARN_ON(swork_get()); - INIT_SWORK(&ntp_cmos_swork, run_clock_set_delay); - return 0; -} -early_initcall(create_cmos_delay_thread); - -#else - void ntp_notify_cmos_timer(void) { queue_delayed_work(system_power_efficient_wq, &sync_cmos_work, 0); } -#endif /* CONFIG_PREEMPT_RT_FULL */ #else void ntp_notify_cmos_timer(void) { } -- 2.18.0
[PATCH RT 13/22] net: use task_struct instead of CPU number as the queue owner on -RT
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- In commit ("net: move xmit_recursion to per-task variable on -RT") the recursion level was changed to be per-task since we can get preempted in BH on -RT. The lock owner should consequently be recorded as the task that holds the lock and not the CPU. Otherwise we trigger the "Dead loop on virtual device" warning on SMP systems. Cc: stable...@vger.kernel.org Reported-by: Kurt Kanzenbach Tested-by: Kurt Kanzenbach Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit d3a66ffd1c4f0253076069b10a8223e7b6e80e38) Signed-off-by: Julia Cartwright --- include/linux/netdevice.h | 54 ++- net/core/dev.c| 6 - 2 files changed, 53 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 85fc72b8a92b..6f1a3f286c4b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -594,7 +594,11 @@ struct netdev_queue { * write-mostly part */ spinlock_t _xmit_lock cacheline_aligned_in_smp; +#ifdef CONFIG_PREEMPT_RT_FULL + struct task_struct *xmit_lock_owner; +#else int xmit_lock_owner; +#endif /* * Time (in jiffies) of last Tx */ @@ -3610,41 +3614,79 @@ static inline u32 netif_msg_init(int debug_value, int default_msg_enable_bits) return (1 << debug_value) - 1; } +#ifdef CONFIG_PREEMPT_RT_FULL +static inline void netdev_queue_set_owner(struct netdev_queue *txq, int cpu) +{ + txq->xmit_lock_owner = current; +} + +static inline void netdev_queue_clear_owner(struct netdev_queue *txq) +{ + txq->xmit_lock_owner = NULL; +} + +static inline bool netdev_queue_has_owner(struct netdev_queue *txq) +{ + if (txq->xmit_lock_owner != NULL) + return true; + return false; +} + +#else + +static inline void netdev_queue_set_owner(struct netdev_queue *txq, int cpu) +{ + txq->xmit_lock_owner = cpu; +} + +static inline void netdev_queue_clear_owner(struct netdev_queue *txq) +{ + txq->xmit_lock_owner = -1; +} + +static inline bool netdev_queue_has_owner(struct netdev_queue *txq) +{ + if (txq->xmit_lock_owner != -1) + return true; + return false; +} +#endif + static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu) { spin_lock(&txq->_xmit_lock); - txq->xmit_lock_owner = cpu; + netdev_queue_set_owner(txq, cpu); } static inline void __netif_tx_lock_bh(struct netdev_queue *txq) { spin_lock_bh(&txq->_xmit_lock); - txq->xmit_lock_owner = smp_processor_id(); + netdev_queue_set_owner(txq, smp_processor_id()); } static inline bool __netif_tx_trylock(struct netdev_queue *txq) { bool ok = spin_trylock(&txq->_xmit_lock); if (likely(ok)) - txq->xmit_lock_owner = smp_processor_id(); + netdev_queue_set_owner(txq, smp_processor_id()); return ok; } static inline void __netif_tx_unlock(struct netdev_queue *txq) { - txq->xmit_lock_owner = -1; + netdev_queue_clear_owner(txq); spin_unlock(&txq->_xmit_lock); } static inline void __netif_tx_unlock_bh(struct netdev_queue *txq) { - txq->xmit_lock_owner = -1; + netdev_queue_clear_owner(txq); spin_unlock_bh(&txq->_xmit_lock); } static inline void txq_trans_update(struct netdev_queue *txq) { - if (txq->xmit_lock_owner != -1) + if (netdev_queue_has_owner(txq)) txq->trans_start = jiffies; } diff --git a/net/core/dev.c b/net/core/dev.c index 93995575d23a..e7dc4700e463 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3449,7 +3449,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) if (dev->flags & IFF_UP) { int cpu = smp_processor_id(); /* ok because BHs are off */ +#ifdef CONFIG_PREEMPT_RT_FULL + if (txq->xmit_lock_owner != current) { +#else if (txq->xmit_lock_owner != cpu) { +#endif if (unlikely(xmit_rec_read() > XMIT_RECURSION_LIMIT)) goto recursion_alert; @@ -7168,7 +7172,7 @@ static void netdev_init_one_queue(struct net_device *dev, /* Initialize queue lock */ spin_lock_init(&queue->_xmit_lock); netdev_set_xmit_lockdep_class(&queue->_xmit_lock, dev->type); - queue->xmit_lock_owner = -1; + netdev_queue_clear_owner(queue); netdev_queue_numa_node_write(queue, NUMA_NO_NODE); queue->dev = dev; #ifdef CONFIG_BQL -- 2.18.0
[PATCH RT 05/22] rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites
From: Boqun Feng 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit 6b0ef92fee2a3189eba6d6b827b247cb4f6da7e9 ] When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a HARDIRQ-safe->HARDIRQ-unsafe deadlock: === WARNING: inconsistent lock state 4.16.0-rc4+ #1 Not tainted inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. takes: __schedule+0xbe/0xaf0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock+0x2a/0x40 scheduler_tick+0x47/0xf0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(&rq->lock); lock(&rq->lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/724: rcu_torture_read_lock+0x0/0x70 stack backtrace: CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 Call Trace: lock_acquire+0x90/0x200 ? __schedule+0xbe/0xaf0 _raw_spin_lock+0x2a/0x40 ? __schedule+0xbe/0xaf0 __schedule+0xbe/0xaf0 preempt_schedule_irq+0x2f/0x60 retint_kernel+0x1b/0x2d RIP: 0010:rcu_read_unlock_special+0x0/0x680 ? rcu_torture_read_unlock+0x60/0x60 __rcu_read_unlock+0x64/0x70 rcu_torture_read_unlock+0x17/0x60 rcu_torture_reader+0x275/0x450 ? rcutorture_booster_init+0x110/0x110 ? rcu_torture_stall+0x230/0x230 ? kthread+0x10e/0x130 kthread+0x10e/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ? call_usermodehelper_exec_async+0x11a/0x150 ret_from_fork+0x3a/0x50 This happens with the following even sequence: preempt_schedule_irq(); local_irq_enable(); __schedule(): local_irq_disable(); // irq off ... rcu_note_context_switch(): rcu_note_preempt_context_switch(): rcu_read_unlock_special(): local_irq_save(flags); ... raw_spin_unlock_irqrestore(...,flags); // irq remains off rt_mutex_futex_unlock(): raw_spin_lock_irq(); ... raw_spin_unlock_irq(); // accidentally set irq on rq_lock(): raw_spin_lock(); // acquiring rq->lock with irq on which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause deadlocks in scheduler code. This problem was introduced by commit 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints"). That brought the user of rt_mutex_futex_unlock() with irq off. To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock() with irq off. Fixes: 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints") Signed-off-by: Boqun Feng Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Lai Jiangshan Cc: Steven Rostedt Cc: Josh Triplett Cc: Mathieu Desnoyers Cc: "Paul E . McKenney" Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.f...@gmail.com Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/locking/rtmutex.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 57361d631749..5e15f5c73637 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -2213,11 +2213,12 @@ void __sched rt_mutex_futex_unlock(struct rt_mutex *lock) { WAKE_Q(wake_q); WAKE_Q(wake_sleeper_q); + unsigned long flags; bool postunlock; - raw_spin_lock_irq(&lock->wait_lock); + raw_spin_lock_irqsave(&lock->wait_lock, flags); postunlock = __rt_mutex_futex_unlock(lock, &wake_q, &wake_sleeper_q); - raw_spin_unlock_irq(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); if (postunlock) rt_mutex_postunlock(&wake_q, &wake_sleeper_q); -- 2.18.0
[PATCH RT 20/22] squashfs: make use of local lock in multi_cpu decompressor
4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- Currently, the squashfs multi_cpu decompressor makes use of get_cpu_ptr()/put_cpu_ptr(), which unconditionally disable preemption during decompression. Because the workload is distributed across CPUs, all CPUs can observe a very high wakeup latency, which has been seen to be as much as 8000us. Convert this decompressor to make use of a local lock, which will allow execution of the decompressor with preemption-enabled, but also ensure concurrent accesses to the percpu compressor data on the local CPU will be serialized. Cc: stable...@vger.kernel.org Reported-by: Alexander Stein Tested-by: Alexander Stein Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit c160736542d7b3d67da32848d2f028b8e35730e5) Signed-off-by: Julia Cartwright --- fs/squashfs/decompressor_multi_percpu.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/fs/squashfs/decompressor_multi_percpu.c b/fs/squashfs/decompressor_multi_percpu.c index 23a9c28ad8ea..6a73c4fa88e7 100644 --- a/fs/squashfs/decompressor_multi_percpu.c +++ b/fs/squashfs/decompressor_multi_percpu.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "squashfs_fs.h" #include "squashfs_fs_sb.h" @@ -25,6 +26,8 @@ struct squashfs_stream { void*stream; }; +static DEFINE_LOCAL_IRQ_LOCK(stream_lock); + void *squashfs_decompressor_create(struct squashfs_sb_info *msblk, void *comp_opts) { @@ -79,10 +82,15 @@ int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, { struct squashfs_stream __percpu *percpu = (struct squashfs_stream __percpu *) msblk->stream; - struct squashfs_stream *stream = get_cpu_ptr(percpu); - int res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, - offset, length, output); - put_cpu_ptr(stream); + struct squashfs_stream *stream; + int res; + + stream = get_locked_ptr(stream_lock, percpu); + + res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, + offset, length, output); + + put_locked_ptr(stream_lock, stream); if (res < 0) ERROR("%s decompression failed, data probably corrupt\n", -- 2.18.0
[PATCH RT 08/22] sched, tracing: Fix trace_sched_pi_setprio() for deboosting
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit 4ff648decf4712d39f184fc2df3163f43975575a ] Since the following commit: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()") the sched_pi_setprio trace point shows the "newprio" during a deboost: |futex sched_pi_setprio: comm=futex_requeue_p pid"34 oldprio newprio=98 |futex sched_switch: prev_comm=futex_requeue_p prev_pid"34 prev_prio0 This patch open codes __rt_effective_prio() in the tracepoint as the 'newprio' to get the old behaviour back / the correct priority: |futex sched_pi_setprio: comm=futex_requeue_p pid"20 oldprio newprio=120 |futex sched_switch: prev_comm=futex_requeue_p prev_pid"20 prev_prio0 Peter suggested to open code the new priority so people using tracehook could get the deadline data out. Reported-by: Mansky Christian Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Fixes: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()") Link: http://lkml.kernel.org/r/20180524132647.gg6ziuogczdmj...@linutronix.de Signed-off-by: Ingo Molnar Signed-off-by: Julia Cartwright --- include/trace/events/sched.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 516ae88cddf4..742682079acf 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -429,7 +429,9 @@ TRACE_EVENT(sched_pi_setprio, memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); __entry->pid= tsk->pid; __entry->oldprio= tsk->prio; - __entry->newprio= pi_task ? pi_task->prio : tsk->prio; + __entry->newprio= pi_task ? + min(tsk->normal_prio, pi_task->prio) : + tsk->normal_prio; /* XXX SCHED_DEADLINE bits missing */ ), -- 2.18.0
[PATCH RT 03/22] futex: Avoid violating the 10th rule of futex
From: Peter Zijlstra 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- [ Upstream commit c1e2f0eaf015fb7076d51a339011f2383e6dd389 ] Julia reported futex state corruption in the following scenario: waiter waker stealer (prio > waiter) futex(WAIT_REQUEUE_PI, uaddr, uaddr2, timeout=[N ms]) futex_wait_requeue_pi() futex_wait_queue_me() freezable_schedule() futex(LOCK_PI, uaddr2) futex(CMP_REQUEUE_PI, uaddr, uaddr2, 1, 0) /* requeues waiter to uaddr2 */ futex(UNLOCK_PI, uaddr2) wake_futex_pi() cmp_futex_value_locked(uaddr2, waiter) wake_up_q() task> futex(LOCK_PI, uaddr2) __rt_mutex_start_proxy_lock() try_to_take_rt_mutex() /* steals lock */ rt_mutex_set_owner(lock, stealer) rt_mutex_wait_proxy_lock() __rt_mutex_slowlock() try_to_take_rt_mutex() /* fails, lock held by stealer */ if (timeout && !timeout->task) return -ETIMEDOUT; fixup_owner() /* lock wasn't acquired, so, fixup_pi_state_owner skipped */ return -ETIMEDOUT; /* At this point, we've returned -ETIMEDOUT to userspace, but the * futex word shows waiter to be the owner, and the pi_mutex has * stealer as the owner */ futex_lock(LOCK_PI, uaddr2) -> bails with EDEADLK, futex word says we're owner. And suggested that what commit: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") removes from fixup_owner() looks to be just what is needed. And indeed it is -- I completely missed that requeue_pi could also result in this case. So we need to restore that, except that subsequent patches, like commit: 16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock") changed all the locking rules. Even without that, the sequence: - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { - locked = 1; - goto out; - } - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(&q->pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); already suggests there were races; otherwise we'd never have to look at next_owner. So instead of doing 3 consecutive wait_lock sections with who knows what races, we do it all in a single section. Additionally, the usage of pi_state->owner in fixup_owner() was only safe because only the rt_mutex owner would modify it, which this additional case wrecks. Luckily the values can only change away and not to the value we're testing, this means we can do a speculative test and double check once we have the wait_lock. Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") Reported-by: Julia Cartwright Reported-by: Gratian Crisan Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Tested-by: Julia Cartwright Tested-by: Gratian Crisan Cc: Darren Hart Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65...@hirez.programming.kicks-ass.net Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/futex.c | 83 ++--- kernel/locking/rtmutex.c| 26 --- kernel/locking/rtmutex_common.h | 1 + 3 files changed, 87 insertions(+), 23 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 270148be5647..cdd68ba6e3a6 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2287,21 +2287,17 @@ static void unqueue_me_pi(struct futex_q *q) spin_unlock(q->
[PATCH RT 15/22] Revert "block: blk-mq: Use swait"
From: Sebastian Andrzej Siewior 4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- This reverts commit "block: blk-mq: Use swait". The issue remains but will be fixed differently. Cc: stable...@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior (cherry-picked from ca3fd6cf836739fd59eac2f7a9b0261365e818bb) Signed-off-by: Julia Cartwright --- block/blk-core.c | 10 +- block/blk-mq.c | 6 +++--- include/linux/blkdev.h | 2 +- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index e4ac43392875..87d3e0a503e5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -662,9 +662,9 @@ int blk_queue_enter(struct request_queue *q, bool nowait) if (nowait) return -EBUSY; - swait_event(q->mq_freeze_wq, - !atomic_read(&q->mq_freeze_depth) || - blk_queue_dying(q)); + wait_event(q->mq_freeze_wq, + !atomic_read(&q->mq_freeze_depth) || + blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; } @@ -680,7 +680,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref) struct request_queue *q = container_of(ref, struct request_queue, q_usage_counter); - swake_up_all(&q->mq_freeze_wq); + wake_up_all(&q->mq_freeze_wq); } static void blk_rq_timed_out_timer(unsigned long data) @@ -750,7 +750,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q->bypass_depth = 1; __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); - init_swait_queue_head(&q->mq_freeze_wq); + init_waitqueue_head(&q->mq_freeze_wq); /* * Init percpu_ref in atomic mode so that it's faster to shutdown. diff --git a/block/blk-mq.c b/block/blk-mq.c index e0a804ab5420..3a49552974ec 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -72,7 +72,7 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); static void blk_mq_freeze_queue_wait(struct request_queue *q) { - swait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); + wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); } /* @@ -110,7 +110,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q) WARN_ON_ONCE(freeze_depth < 0); if (!freeze_depth) { percpu_ref_reinit(&q->q_usage_counter); - swake_up_all(&q->mq_freeze_wq); + wake_up_all(&q->mq_freeze_wq); } } EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); @@ -129,7 +129,7 @@ void blk_mq_wake_waiters(struct request_queue *q) * dying, we need to ensure that processes currently waiting on * the queue are notified as well. */ - swake_up_all(&q->mq_freeze_wq); + wake_up_all(&q->mq_freeze_wq); } bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 34fd1ed9845e..fdb449fe3ff7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -468,7 +468,7 @@ struct request_queue { struct throtl_data *td; #endif struct rcu_head rcu_head; - struct swait_queue_head mq_freeze_wq; + wait_queue_head_t mq_freeze_wq; struct percpu_ref q_usage_counter; struct list_headall_q_node; -- 2.18.0
[PATCH RT 00/22] Linux 4.9.115-rt94-rc1
Hello RT folks! This patchset brings back many RT-specific fixes that have gone into subsequent 4.14-rt and 4.16-rt releases. One of my x86 boxes very intermittently triggers a WARN_ON() on bootup in migrate_enable(), which I'm still trying to triage. If you can more reliably reproduce this, please let me know. This release candidate will not be pushed to the git tree. To build 4.9.115-rt94-rc1 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.115.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.115-rt94-rc1.patch.xz If all goes well with testing, this rc will be promoted to an official release on 8/16/2018. Please go forth and test! Thanks, Julia --- Boqun Feng (1): rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites Julia Cartwright (4): locallock: provide {get,put}_locked_ptr() variants squashfs: make use of local lock in multi_cpu decompressor seqlock: provide the same ordering semantics as mainline Linux 4.9.115-rt94-rc1 Paul E. McKenney (1): rcu: Suppress lockdep false-positive ->boost_mtx complaints Peter Zijlstra (4): futex: Fix pi_state->owner serialization futex: Fix more put_pi_state() vs. exit_pi_state_list() races futex: Avoid violating the 10th rule of futex futex: Fix OWNER_DEAD fixup Sebastian Andrzej Siewior (12): rcu: Do not include rtmutex_common.h unconditionally sched, tracing: Fix trace_sched_pi_setprio() for deboosting crypto: limit more FPU-enabled sections arm*: disable NEON in kernel mode mm/slub: close possible memory-leak in kmem_cache_alloc_bulk() locking: add types.h net: use task_struct instead of CPU number as the queue owner on -RT Revert "rt,ntp: Move call to schedule_delayed_work() to helper thread" Revert "block: blk-mq: Use swait" block: blk-mq: move blk_queue_usage_counter_release() into process context alarmtimer: Prevent live lock in alarm_cancel() posix-timers: move the rcu head out of the union arch/arm/Kconfig | 2 +- arch/arm64/crypto/Kconfig | 14 +- arch/x86/crypto/camellia_aesni_avx2_glue.c | 20 +++ arch/x86/crypto/camellia_aesni_avx_glue.c | 19 +++ arch/x86/crypto/cast6_avx_glue.c | 24 +++- arch/x86/crypto/chacha20_glue.c| 9 +- arch/x86/crypto/serpent_avx2_glue.c| 19 +++ arch/x86/crypto/serpent_avx_glue.c | 23 +++- arch/x86/crypto/serpent_sse2_glue.c| 23 +++- arch/x86/crypto/twofish_avx_glue.c | 27 +++- arch/x86/include/asm/fpu/api.h | 1 + arch/x86/kernel/fpu/core.c | 12 ++ block/blk-core.c | 22 +++- block/blk-mq.c | 6 +- fs/squashfs/decompressor_multi_percpu.c| 16 ++- include/linux/blkdev.h | 4 +- include/linux/locallock.h | 10 ++ include/linux/netdevice.h | 54 +++- include/linux/posix-timers.h | 2 +- include/linux/seqlock.h| 1 + include/linux/spinlock_types_raw.h | 2 + include/trace/events/sched.h | 4 +- kernel/futex.c | 144 - kernel/locking/rtmutex.c | 31 +++-- kernel/locking/rtmutex_common.h| 1 + kernel/rcu/tree_plugin.h | 5 +- kernel/time/alarmtimer.c | 2 +- kernel/time/ntp.c | 26 kernel/time/posix-timers.c | 4 +- localversion-rt| 2 +- mm/slub.c | 1 + net/core/dev.c | 6 +- 32 files changed, 412 insertions(+), 124 deletions(-) -- 2.18.0
[PATCH RT 22/22] Linux 4.9.115-rt94-rc1
4.9.115-rt94-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- --- localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/localversion-rt b/localversion-rt index e98a1fe050bd..dcc2fd2ca155 100644 --- a/localversion-rt +++ b/localversion-rt @@ -1 +1 @@ --rt93 +-rt94-rc1 -- 2.18.0
Re: simple-framebuffer enquire
On Tue, Jun 26, 2018 at 08:44:58PM +0200, Hans de Goede wrote: > Hi, > > On 26-06-18 18:35, Michael Nazzareno Trimarchi wrote: [..] > > cat memblock/reserved > > 0: 0x80004000..0x80007fff > > 1: 0x8010..0x81e030b3 > > 2: 0x8300..0x83007fff > > 3: 0x8400..0x85ff > > 4: 0x86fa2000..0x87021fff > > > > + reserved-memory { > > + #address-cells = <1>; > > + #size-cells = <1>; > > + ranges; > > + > > + display_reserved: framebuffer@86fa2000 { > > + reg = <0x86fa2000 0x8>; > > + }; > > + > > + }; [..] > > Still have the same on ioremap. > > Hmm, I guess the kernel does map the entire region its get > passed and simply makes sure to not touch the reserved mem, > where as with the changes to the passed in mem-region the > sunxi u-boot code does the memory does not get mapped by > the kernel at all ? If the intent is to reserve memory _and_ prevent it from being included in the kernel's linear map, then it is also necessary to include the 'no-map' property for this reserved-mem node. >From Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt: no-map (optional) - empty property - Indicates the operating system must not create a virtual mapping of the region as part of its standard mapping of system memory, nor permit speculative access to it under any circumstances other than under the control of the device driver using the region. Julia
[ANNOUNCE] 4.9.98-rt76
Hello RT Folks! I'm pleased to announce the 4.9.98-rt76 stable release. This release is just an update to the new stable 4.9.98 version and no RT specific changes have been made. Expect a 4.9.98-rt77-rc1 with backports from rt-devel soon. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.9-rt Head SHA1: bc64e52891b8b9e56b19cbb800a1b8c415df13e8 Or to build 4.9.98-rt76 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.98.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.98-rt76.patch.xz Enjoy! Julia
Re: [LINUX PATCH v8 2/2] memory: pl353: Add driver for arm pl353 static memory controller
On Mon, May 07, 2018 at 10:12:28AM +, Naga Sureshkumar Relli wrote: > Hi Julia, > > Thanks for reviewing the patch and Sorry for my late reply. This patch > went to junk folder, hence I didn't catch this patch. > > From: Julia Cartwright [mailto:ju...@ni.com] [..] > > > > It would be easier to follow if you constructed your two patchsets with git > > format-patch -- > > thread. > > > > I am using the same but with out --thread. > > > Or, merge them into a single patchset, especially considering the > > dependency between patches. > > But both are different patches, one for Device tree documentation and other > for driver update. Yes, I'm not proposing you merge _patches_ but _patchsets_. You have two patchsets, one for the SMC driver, and another for the NAND. Given that they depend on one another, it's helpful for reviewers if you sent them all together, with a cover letter which describes the entire patchset, changelog, it's dependencies, revision changelog, etc. Something like: [PATCH v9 0/4] rawnand: memory: add support for PL353 static memory controller + NAND [PATCH v9 1/4] Devicetree: Add pl353 smc controller devicetree binding information [PATCH v9 2/4] memory: pl353: Add driver for arm pl353 static memory controller [PATCH v9 3/4] Documentation: nand: pl353: Add documentation for controller and driver [PATCH v9 4/4] mtd: rawnand: pl353: Add basic driver for arm pl353 smc nand interface Anyway, just with --thread enabled would be an improvement. [..] > > > --- a/drivers/memory/Kconfig > > > +++ b/drivers/memory/Kconfig > > > @@ -152,6 +152,13 @@ config DA8XX_DDRCTL > > > This driver is for the DDR2/mDDR Memory Controller present on > > > Texas Instruments da8xx SoCs. It's used to tweak various memory > > > controller configuration options. > > > +config PL35X_SMC > > > + bool "ARM PL35X Static Memory Controller(SMC) driver" > > > > Is there any reason why this can't be tristate? > > There is a Nand driver which uses this driver. i.e The NAND driver > Depends on this driver. That's true, but it's irrelevant to question I asked. It is perfectly valid for both the SMC and NAND drivers to be tristate, why are you not allowing this configuration? Julia
[PATCH RT 1/2] locallock: provide {get,put}_locked_ptr() variants
Provide a set of locallocked accessors for pointers to per-CPU data; this is useful for dynamically-allocated per-CPU regions, for example. These are symmetric with the {get,put}_cpu_ptr() per-CPU accessor variants. Signed-off-by: Julia Cartwright --- include/linux/locallock.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/locallock.h b/include/linux/locallock.h index d658c2552601..921eab83cd34 100644 --- a/include/linux/locallock.h +++ b/include/linux/locallock.h @@ -222,6 +222,14 @@ static inline int __local_unlock_irqrestore(struct local_irq_lock *lv, #define put_locked_var(lvar, var) local_unlock(lvar); +#define get_locked_ptr(lvar, var) \ + ({ \ + local_lock(lvar); \ + this_cpu_ptr(var); \ + }) + +#define put_locked_ptr(lvar, var) local_unlock(lvar); + #define local_lock_cpu(lvar) \ ({ \ local_lock(lvar); \ @@ -262,6 +270,8 @@ static inline void local_irq_lock_init(int lvar) { } #define get_locked_var(lvar, var) get_cpu_var(var) #define put_locked_var(lvar, var) put_cpu_var(var) +#define get_locked_ptr(lvar, var) get_cpu_ptr(var) +#define put_locked_ptr(lvar, var) put_cpu_ptr(var) #define local_lock_cpu(lvar) get_cpu() #define local_unlock_cpu(lvar) put_cpu() -- 2.17.0
[PATCH RT 2/2] squashfs: make use of local lock in multi_cpu decompressor
Currently, the squashfs multi_cpu decompressor makes use of get_cpu_ptr()/put_cpu_ptr(), which unconditionally disable preemption during decompression. Because the workload is distributed across CPUs, all CPUs can observe a very high wakeup latency, which has been seen to be as much as 8000us. Convert this decompressor to make use of a local lock, which will allow execution of the decompressor with preemption-enabled, but also ensure concurrent accesses to the percpu compressor data on the local CPU will be serialized. Reported-by: Alexander Stein Tested-by: Alexander Stein Signed-off-by: Julia Cartwright --- fs/squashfs/decompressor_multi_percpu.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/fs/squashfs/decompressor_multi_percpu.c b/fs/squashfs/decompressor_multi_percpu.c index 23a9c28ad8ea..6a73c4fa88e7 100644 --- a/fs/squashfs/decompressor_multi_percpu.c +++ b/fs/squashfs/decompressor_multi_percpu.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "squashfs_fs.h" #include "squashfs_fs_sb.h" @@ -25,6 +26,8 @@ struct squashfs_stream { void*stream; }; +static DEFINE_LOCAL_IRQ_LOCK(stream_lock); + void *squashfs_decompressor_create(struct squashfs_sb_info *msblk, void *comp_opts) { @@ -79,10 +82,15 @@ int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, { struct squashfs_stream __percpu *percpu = (struct squashfs_stream __percpu *) msblk->stream; - struct squashfs_stream *stream = get_cpu_ptr(percpu); - int res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, - offset, length, output); - put_cpu_ptr(stream); + struct squashfs_stream *stream; + int res; + + stream = get_locked_ptr(stream_lock, percpu); + + res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, + offset, length, output); + + put_locked_ptr(stream_lock, stream); if (res < 0) ERROR("%s decompression failed, data probably corrupt\n", -- 2.17.0
[PATCH RT] seqlock: provide the same ordering semantics as mainline
The mainline implementation of read_seqbegin() orders prior loads w.r.t. the read-side critical section. Fixup the RT writer-boosting implementation to provide the same guarantee. Also, while we're here, update the usage of ACCESS_ONCE() to use READ_ONCE(). Fixes: e69f15cf77c23 ("seqlock: Prevent rt starvation") Cc: stable...@vger.kernel.org Signed-off-by: Julia Cartwright --- Found during code inspection of the RT seqlock implementation. Julia include/linux/seqlock.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index a59751276b94..597ce5a9e013 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -453,7 +453,7 @@ static inline unsigned read_seqbegin(seqlock_t *sl) unsigned ret; repeat: - ret = ACCESS_ONCE(sl->seqcount.sequence); + ret = READ_ONCE(sl->seqcount.sequence); if (unlikely(ret & 1)) { /* * Take the lock and let the writer proceed (i.e. evtl @@ -462,6 +462,7 @@ static inline unsigned read_seqbegin(seqlock_t *sl) spin_unlock_wait(&sl->lock); goto repeat; } + smp_rmb(); return ret; } #endif -- 2.16.1
Re: [LINUX PATCH v8 2/2] memory: pl353: Add driver for arm pl353 static memory controller
Hello- On Wed, Mar 14, 2018 at 04:14:34PM +0530, nagasureshkumarre...@gmail.com wrote: > From: Naga Sureshkumar Relli I'm pleased to see this patchset revived and resubmitted! It would be easier to follow if you constructed your two patchsets with git format-patch --thread. Or, merge them into a single patchset, especially considering the dependency between patches. Some code review comments below: > Add driver for arm pl353 static memory controller. This controller is > used in xilinx zynq soc for interfacing the nand and nor/sram memory > devices. > > Signed-off-by: Naga Sureshkumar Relli > --- > Changes in v8: > - None > Changes in v7: > - Corrected the kconfig to use tristate selection > - Corrected the GPL licence ident > - Added boundary checks for nand timing parameters > Changes in v6: > - Fixed checkpatch.pl reported warnings > Changes in v5: > - Added pl353_smc_get_clkrate function, made pl353_smc_set_cycles as public > API > - Removed nand timing parameter initialization and moved it to nand driver > Changes in v4: > - Modified driver to support multiple instances > - Used sleep instaed of busywait for delay > Changes in v3: > - None > Changes in v2: > - Since now the timing parameters are in nano seconds, added logic to convert > them to the cycles > --- > drivers/memory/Kconfig | 7 + > drivers/memory/Makefile | 1 + > drivers/memory/pl353-smc.c | 548 > > include/linux/platform_data/pl353-smc.h | 34 ++ > 4 files changed, 590 insertions(+) > create mode 100644 drivers/memory/pl353-smc.c > create mode 100644 include/linux/platform_data/pl353-smc.h > > diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig > index 19a0e83..d70d6db 100644 > --- a/drivers/memory/Kconfig > +++ b/drivers/memory/Kconfig > @@ -152,6 +152,13 @@ config DA8XX_DDRCTL > This driver is for the DDR2/mDDR Memory Controller present on > Texas Instruments da8xx SoCs. It's used to tweak various memory > controller configuration options. > +config PL35X_SMC > + bool "ARM PL35X Static Memory Controller(SMC) driver" Is there any reason why this can't be tristate? [..] > +++ b/drivers/memory/pl353-smc.c [..] > +/** > + * struct pl353_smc_data - Private smc driver structure > + * @devclk: Pointer to the peripheral clock > + * @aperclk: Pointer to the APER clock > + */ > +struct pl353_smc_data { > + struct clk *memclk; > + struct clk *aclk; > +}; > + > +/* SMC virtual register base */ > +static void __iomem *pl353_smc_base; While it's true that the Zynq chips only have a single instance of this IP, there is no real reason why an SoC can't instantiate more than one. I'm a bit uncomfortable with the fact that this has been designed out. > + > +/** > + * pl353_smc_set_buswidth - Set memory buswidth > + * @bw: Memory buswidth (8 | 16) > + * Return: 0 on success or negative errno. > + */ > +int pl353_smc_set_buswidth(unsigned int bw) > +{ > + > + if (bw != PL353_SMC_MEM_WIDTH_8 && bw != PL353_SMC_MEM_WIDTH_16) > + return -EINVAL; > + > + writel(bw, pl353_smc_base + PL353_SMC_SET_OPMODE_OFFS); There should be no reason not to use the _relaxed() accessor variants. > + writel(PL353_SMC_DC_UPT_NAND_REGS, pl353_smc_base + > +PL353_SMC_DIRECT_CMD_OFFS); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(pl353_smc_set_buswidth); > + > +/** > + * pl353_smc_set_cycles - Set memory timing parameters > + * @t0: t_rcread cycle time > + * @t1: t_wcwrite cycle time > + * @t2: t_rea/t_ceoeoutput enable assertion delay > + * @t3: t_wpwrite enable deassertion delay > + * @t4: t_clr/t_pc page cycle time > + * @t5: t_ar/t_ta ID read time/turnaround time > + * @t6: t_rrbusy to RE timing > + * > + * Sets NAND chip specific timing parameters. > + */ > +static void pl353_smc_set_cycles(u32 t0, u32 t1, u32 t2, u32 t3, u32 > + t4, u32 t5, u32 t6) > +{ > + t0 &= PL353_SMC_SET_CYCLES_T0_MASK; > + t1 = (t1 & PL353_SMC_SET_CYCLES_T1_MASK) << > + PL353_SMC_SET_CYCLES_T1_SHIFT; > + t2 = (t2 & PL353_SMC_SET_CYCLES_T2_MASK) << > + PL353_SMC_SET_CYCLES_T2_SHIFT; > + t3 = (t3 & PL353_SMC_SET_CYCLES_T3_MASK) << > + PL353_SMC_SET_CYCLES_T3_SHIFT; > + t4 = (t4 & PL353_SMC_SET_CYCLES_T4_MASK) << > + PL353_SMC_SET_CYCLES_T4_SHIFT; > + t5 = (t5 & PL353_SMC_SET_CYCLES_T5_MASK) << > + PL353_SMC_SET_CYCLES_T5_SHIFT; > + t6 = (t6 & PL353_SMC_SET_CYCLES_T6_MASK) << > + PL353_SMC_SET_CYCLES_T6_SHIFT; > + > + t0 |= t1 | t2 | t3 | t4 | t5 | t6; > + > + writel(t0, pl353_smc_base + PL353_SMC_SET_CYCLES_OFFS); > + writel(PL353_SMC_DC_UPT_NAND_REGS, pl353_smc_base + > +
Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
On Fri, Mar 23, 2018 at 01:21:31PM -0400, joe.ko...@concurrent-rt.com wrote: > On Fri, Mar 23, 2018 at 11:59:21AM -0500, Julia Cartwright wrote: > > On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.ko...@concurrent-rt.com wrote: > > > I see the below kernel splat in 4.9-rt when I run a test program that > > > continually changes the affinity of some set of running pids: > > > > > >do not call blocking ops when !TASK_RUNNING; state=2 set at ... > > > ... > > > stop_one_cpu+0x60/0x80 > > > migrate_enable+0x21f/0x3e0 > > > rt_spin_unlock+0x2f/0x40 > > > prepare_to_wait+0x5c/0x80 > > > ... > > > > This is clearly a problem. > > > > > The reason is that spin_unlock, write_unlock, and read_unlock call > > > migrate_enable, and since 4.4-rt, migrate_enable will sleep if it > > > discovers > > > that a migration is in order. But sleeping in the unlock services is not > > > expected by most kernel developers, > > > > I don't buy this, see below: > > > > > and where that counts most is in code sequences like the following: > > > > > > set_current_state(TASK_UNINTERRUPIBLE); > > > spin_unlock(&s); > > > schedule(); > > > > The analog in mainline is CONFIG_PREEMPT and the implicit > > preempt_enable() in spin_unlock(). In this configuration, a kernel > > developer should _absolutely_ expect their task to be suspended (and > > potentially migrated), _regardless of the task state_ if there is a > > preemption event on the CPU on which this task is executing. > > > > Similarly, on RT, there is nothing _conceptually_ wrong on RT with > > migrating on migrate_enable(), regardless of task state, if there is a > > pending migration event. > > My understanding is, in standard Linux and in rt, setting > task state to anything other than TASK_RUNNING in of itself > blocks preemption. I'm assuming you're referring to the window of time between a task setting its state to !TASK_RUNNING and schedule()? The task remains preemptible in this window. > A preemption is not really needed here as it is expected that there is > a schedule() written in that will shortly be executed. > > And if a 'involuntary schedule' (ie, preemption) were allowed to occur > between the task state set and the schedule(), that would change the > task state back to TASK_RUNNING; This isn't the case. A preempted task preserves its state. Julia
Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
Hey Joe- Thanks for the writeup. On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.ko...@concurrent-rt.com wrote: > I see the below kernel splat in 4.9-rt when I run a test program that > continually changes the affinity of some set of running pids: > >do not call blocking ops when !TASK_RUNNING; state=2 set at ... > ... > stop_one_cpu+0x60/0x80 > migrate_enable+0x21f/0x3e0 > rt_spin_unlock+0x2f/0x40 > prepare_to_wait+0x5c/0x80 > ... This is clearly a problem. > The reason is that spin_unlock, write_unlock, and read_unlock call > migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers > that a migration is in order. But sleeping in the unlock services is not > expected by most kernel developers, I don't buy this, see below: > and where that counts most is in code sequences like the following: > > set_current_state(TASK_UNINTERRUPIBLE); > spin_unlock(&s); > schedule(); The analog in mainline is CONFIG_PREEMPT and the implicit preempt_enable() in spin_unlock(). In this configuration, a kernel developer should _absolutely_ expect their task to be suspended (and potentially migrated), _regardless of the task state_ if there is a preemption event on the CPU on which this task is executing. Similarly, on RT, there is nothing _conceptually_ wrong on RT with migrating on migrate_enable(), regardless of task state, if there is a pending migration event. It's clear, however, that the mechanism used here is broken ... Julia
Re: [PATCH v2 1/8] [PATCH 1/8] drivers/peci: Add support for PECI bus driver core
On Wed, Feb 21, 2018 at 08:15:59AM -0800, Jae Hyun Yoo wrote: > This commit adds driver implementation for PECI bus into linux > driver framework. > > Signed-off-by: Jae Hyun Yoo > --- [..] > +static int peci_locked_xfer(struct peci_adapter *adapter, > + struct peci_xfer_msg *msg, > + bool do_retry, > + bool has_aw_fcs) _locked generally means that this function is invoked with some critical lock held, what lock does the caller need to acquire before invoking this function? > +{ > + ktime_t start, end; > + s64 elapsed_ms; > + int rc = 0; > + > + if (!adapter->xfer) { Is this really an optional feature of an adapter? If this is not optional, then this check should be in place when the adapter is registered, not here. (And it should WARN_ON(), because it's a driver developer error). > + dev_dbg(&adapter->dev, "PECI level transfers not supported\n"); > + return -ENODEV; > + } > + > + if (in_atomic() || irqs_disabled()) { As Andrew mentioned, this is broken. You don't even need a might_sleep(). The locking functions you use here will already include a might_sleep() w/ CONFIG_DEBUG_ATOMIC_SLEEP. > + rt_mutex_trylock(&adapter->bus_lock); > + if (!rc) > + return -EAGAIN; /* PECI activity is ongoing */ > + } else { > + rt_mutex_lock(&adapter->bus_lock); > + } > + > + if (do_retry) > + start = ktime_get(); > + > + do { > + rc = adapter->xfer(adapter, msg); > + > + if (!do_retry) > + break; > + > + /* Per the PECI spec, need to retry commands that return 0x8x */ > + if (!(!rc && ((msg->rx_buf[0] & DEV_PECI_CC_RETRY_ERR_MASK) == > + DEV_PECI_CC_TIMEOUT))) > + break; This is pretty difficult to parse. Can you split it into two different conditions? > + > + /* Set the retry bit to indicate a retry attempt */ > + msg->tx_buf[1] |= DEV_PECI_RETRY_BIT; Are you sure this bit is to be set in the _second_ byte of tx_buf? > + > + /* Recalculate the AW FCS if it has one */ > + if (has_aw_fcs) > + msg->tx_buf[msg->tx_len - 1] = 0x80 ^ > + peci_aw_fcs((u8 *)msg, > + 2 + msg->tx_len); > + > + /* Retry for at least 250ms before returning an error */ > + end = ktime_get(); > + elapsed_ms = ktime_to_ms(ktime_sub(end, start)); > + if (elapsed_ms >= DEV_PECI_RETRY_TIME_MS) { > + dev_dbg(&adapter->dev, "Timeout retrying xfer!\n"); > + break; > + } > + } while (true); > + > + rt_mutex_unlock(&adapter->bus_lock); > + > + return rc; > +} > + > +static int peci_xfer(struct peci_adapter *adapter, struct peci_xfer_msg *msg) > +{ > + return peci_locked_xfer(adapter, msg, false, false); > +} > + > +static int peci_xfer_with_retries(struct peci_adapter *adapter, > + struct peci_xfer_msg *msg, > + bool has_aw_fcs) > +{ > + return peci_locked_xfer(adapter, msg, true, has_aw_fcs); > +} > + > +static int peci_scan_cmd_mask(struct peci_adapter *adapter) > +{ > + struct peci_xfer_msg msg; > + u32 dib; > + int rc = 0; > + > + /* Update command mask just once */ > + if (adapter->cmd_mask & BIT(PECI_CMD_PING)) > + return 0; > + > + msg.addr = PECI_BASE_ADDR; > + msg.tx_len= GET_DIB_WR_LEN; > + msg.rx_len= GET_DIB_RD_LEN; > + msg.tx_buf[0] = GET_DIB_PECI_CMD; > + > + rc = peci_xfer(adapter, &msg); > + if (rc < 0) { > + dev_dbg(&adapter->dev, "PECI xfer error, rc : %d\n", rc); > + return rc; > + } > + > + dib = msg.rx_buf[0] | (msg.rx_buf[1] << 8) | > + (msg.rx_buf[2] << 16) | (msg.rx_buf[3] << 24); > + > + /* Check special case for Get DIB command */ > + if (dib == 0x00) { > + dev_dbg(&adapter->dev, "DIB read as 0x00\n"); > + return -1; > + } > + > + if (!rc) { You should change this to: if (rc) { dev_dbg(&adapter->dev, "Error reading DIB, rc : %d\n", rc); return rc; } And then leave the happy path below unindented. > + /** > + * setting up the supporting commands based on minor rev# > + * see PECI Spec Table 3-1 > + */ > + dib = (dib >> 8) & 0xF; > + > + if (dib >= 0x1) { > + adapter->cmd_mask |= BIT(PECI_CMD_RD_PKG_CFG); > + adapter->cmd_mask |= BIT(PECI_CMD_WR_PKG_CFG); > + } > + > + if (dib >= 0x2) > + adapter->cmd_mask |= BI
Re: C tricks for efficient stack zeroing
On Fri, Mar 02, 2018 at 08:50:17PM +0100, Jason A. Donenfeld wrote: [..] > What would be really nice would be to somehow keep track of the > maximum stack depth, and just before the function returns, clear from > the maximum depth to its stack base, all in one single call. This > would not only make the code faster and less brittle, but it would > also clean up some algorithms quite a bit. > > Ideally this would take the form of a gcc attribute on the function, > but I was unable to find anything of that nature. I started looking > for little C tricks for this, and came up dry too. I realize I could > probably just take the current stack address and zero out until _the > very end_ but that seems to overshoot and would probably be bad for > performance. The best I've been able to do come up with are some > x86-specific macros, but that approach seems a bit underwhelming. > Other approaches include adding a new attribute via the gcc plugin > system, which could make this kind of thing more complete [cc'ing > pipacs in case he's thought about that before]. Can objtool support a static stack usage analysis? I'm wondering if it's possible to place these sensitive functions in a special linker section, like .text.stackzero.; objtool could collect static call data (as it already does) and stack usage, spitting out a symbol definition stackzero__max_depth, which you could then use to bound your zeroing. Obviously this is a static analysis, with the limitations therein. Julia
[ANNOUNCE] 4.9.84-rt62
Hello RT Folks! I'm pleased to announce the 4.9.84-rt62 stable release. This release is just an update to the new stable 4.9.84 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.9-rt Head SHA1: be42cd02846a611af533103a3f4b6a7d8c592f49 Or to build 4.9.84-rt62 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.84.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.84-rt62.patch.xz Enjoy! Julia
Re: [PATCH 1/2] kernel/sofirq: consolidate common code in __tasklet_schedule() + _hi_
On Thu, Feb 15, 2018 at 03:07:07PM -0500, Steven Rostedt wrote: > On Thu, 15 Feb 2018 18:20:41 +0100 > Sebastian Andrzej Siewior wrote: > > > -void __tasklet_schedule(struct tasklet_struct *t) > > +static void __tasklet_schedule_common(struct tasklet_struct *t, > > + struct tasklet_head *head, > > + unsigned int softirq_nr) > > { > > unsigned long flags; > > > > local_irq_save(flags); > > If you look at the original patch, it did not move local_irq_save() > into the common function. > > > t->next = NULL; > > - *__this_cpu_read(tasklet_vec.tail) = t; > > - __this_cpu_write(tasklet_vec.tail, &(t->next)); > > - raise_softirq_irqoff(TASKLET_SOFTIRQ); > > + *head->tail = t; > > + head->tail = &(t->next); > > + raise_softirq_irqoff(softirq_nr); > > local_irq_restore(flags); > > } > > + > > +void __tasklet_schedule(struct tasklet_struct *t) > > +{ > > + __tasklet_schedule_common(t, this_cpu_ptr(&tasklet_vec), > > What can happen is, we reference (tasklet_vec) on one CPU, get > preempted (running in ksoftirqd), scheduled on another CPU, then when > inside the common code, we are executing on a different CPU than the > tasklet is for. The rasise_softirq() is happening on the wrong CPU. > > The local_irq_save() can't be moved to the common function. It must be > done by each individual function. Well, it can be, but the percpu access needs to go with it; so an alternative solution would be the below. I'm also wondering whether the t->next = NULL assignment could be lifted out from irqs-disabled region. Julia diff --git a/kernel/softirq.c b/kernel/softirq.c index fa7ed89a9fcf..177de3640c78 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -461,12 +461,14 @@ static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec); static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec); static void __tasklet_schedule_common(struct tasklet_struct *t, - struct tasklet_head *head, + struct tasklet_head __percpu *headp, unsigned int softirq_nr) { + struct tasklet_head *head; unsigned long flags; local_irq_save(flags); + head = this_cpu_ptr(headp); t->next = NULL; *head->tail = t; head->tail = &(t->next); @@ -476,14 +478,14 @@ static void __tasklet_schedule_common(struct tasklet_struct *t, void __tasklet_schedule(struct tasklet_struct *t) { - __tasklet_schedule_common(t, this_cpu_ptr(&tasklet_vec), + __tasklet_schedule_common(t, &tasklet_vec, TASKLET_SOFTIRQ); } EXPORT_SYMBOL(__tasklet_schedule); void __tasklet_hi_schedule(struct tasklet_struct *t) { - __tasklet_schedule_common(t, this_cpu_ptr(&tasklet_hi_vec), + __tasklet_schedule_common(t, &tasklet_hi_vec, HI_SOFTIRQ); } EXPORT_SYMBOL(__tasklet_hi_schedule);
[PATCH 4.9.y] ubifs: Massage assert in ubifs_xattr_set() wrt. init_xattrs
From: Xiaolei Li This is a conceptual cherry-pick of commit d8db5b1ca9d4c57e49893d0f78e6d5ce81450cc8 upstream. The inode is not locked in init_xattrs when creating a new inode. Without this patch, there will occurs assert when booting or creating a new file, if the kernel config CONFIG_SECURITY_SMACK is enabled. Log likes: UBIFS assert failed in ubifs_xattr_set at 298 (pid 1156) CPU: 1 PID: 1156 Comm: ldconfig Tainted: G S 4.12.0-rc1-207440-g1e70b02 #2 Hardware name: MediaTek MT2712 evaluation board (DT) Call trace: [] dump_backtrace+0x0/0x238 [] show_stack+0x14/0x20 [] dump_stack+0x9c/0xc0 [] ubifs_xattr_set+0x374/0x5e0 [] init_xattrs+0x5c/0xb8 [] security_inode_init_security+0x110/0x190 [] ubifs_init_security+0x30/0x68 [] ubifs_mkdir+0x100/0x200 [] vfs_mkdir+0x11c/0x1b8 [] SyS_mkdirat+0x74/0xd0 [] __sys_trace_return+0x0/0x4 Signed-off-by: Xiaolei Li Signed-off-by: Richard Weinberger Cc: sta...@vger.kernel.org (julia: massaged to apply to 4.9.y, which doesn't contain fscrypto support) Signed-off-by: Julia Cartwright --- Hey all- We reproduced the issue fixed upstream by Xiaolei Li's commit in 4.9.y, with the very similar backtrace: UBIFS assert failed in __ubifs_setxattr at 282 (pid 1362) CPU: 1 PID: 1362 Comm: sed Not tainted 4.9.47-rt37 #1 Backtrace: (dump_backtrace) from (show_stack+0x20/0x24) (show_stack) from (dump_stack+0x80/0xa0) (dump_stack) from (__ubifs_setxattr+0x84/0x634) (__ubifs_setxattr) from (init_xattrs+0x70/0xac) (init_xattrs) from (security_inode_init_security+0x100/0x144) (security_inode_init_security) from (ubifs_init_security+0x38/0x6c) (ubifs_init_security) from (ubifs_create+0xe8/0x1fc) (ubifs_create) from (path_openat+0x97c/0x1090) (path_openat) from (do_filp_open+0x48/0x94) (do_filp_open) from (do_sys_open+0x134/0x1d8) (do_sys_open) from (SyS_open+0x30/0x34) (SyS_open) from (ret_fast_syscall+0x0/0x3c) Please consider applying this to 4.9.y at the very least, it may apply further back as well. Thanks! Julia fs/ubifs/xattr.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c index d9f9615bfd71..3979d767a0cb 100644 --- a/fs/ubifs/xattr.c +++ b/fs/ubifs/xattr.c @@ -270,7 +270,8 @@ static struct inode *iget_xattr(struct ubifs_info *c, ino_t inum) } static int __ubifs_setxattr(struct inode *host, const char *name, - const void *value, size_t size, int flags) + const void *value, size_t size, int flags, + bool check_lock) { struct inode *inode; struct ubifs_info *c = host->i_sb->s_fs_info; @@ -279,7 +280,8 @@ static int __ubifs_setxattr(struct inode *host, const char *name, union ubifs_key key; int err; - ubifs_assert(inode_is_locked(host)); + if (check_lock) + ubifs_assert(inode_is_locked(host)); if (size > UBIFS_MAX_INO_DATA) return -ERANGE; @@ -548,7 +550,8 @@ static int init_xattrs(struct inode *inode, const struct xattr *xattr_array, } strcpy(name, XATTR_SECURITY_PREFIX); strcpy(name + XATTR_SECURITY_PREFIX_LEN, xattr->name); - err = __ubifs_setxattr(inode, name, xattr->value, xattr->value_len, 0); + err = __ubifs_setxattr(inode, name, xattr->value, + xattr->value_len, 0, false); kfree(name); if (err < 0) break; @@ -594,7 +597,8 @@ static int ubifs_xattr_set(const struct xattr_handler *handler, name = xattr_full_name(handler, name); if (value) - return __ubifs_setxattr(inode, name, value, size, flags); + return __ubifs_setxattr(inode, name, value, size, flags, + true); else return __ubifs_removexattr(inode, name); } -- 2.16.1
Re: [patch v18 1/4] drivers: jtag: Add JTAG core driver
On Mon, Jan 29, 2018 at 04:31:42PM +0200, Oleksandr Shamray wrote: > Initial patch for JTAG driver > JTAG class driver provide infrastructure to support hardware/software > JTAG platform drivers. It provide user layer API interface for flashing > and debugging external devices which equipped with JTAG interface > using standard transactions. > > Driver exposes set of IOCTL to user space for: > - XFER: > - SIR (Scan Instruction Register, IEEE 1149.1 Data Register scan); > - SDR (Scan Data Register, IEEE 1149.1 Instruction Register scan); > - RUNTEST (Forces the IEEE 1149.1 bus to a run state for a specified > number of clocks). > - SIOCFREQ/GIOCFREQ for setting and reading JTAG frequency. > > Driver core provides set of internal APIs for allocation and > registration: > - jtag_register; > - jtag_unregister; > - jtag_alloc; > - jtag_free; > > Platform driver on registration with jtag-core creates the next > entry in dev folder: > /dev/jtagX > > Signed-off-by: Oleksandr Shamray > Signed-off-by: Jiri Pirko > Acked-by: Philippe Ombredanne [..] > diff --git a/drivers/jtag/jtag.c b/drivers/jtag/jtag.c [..] > +struct jtag *jtag_alloc(size_t priv_size, const struct jtag_ops *ops) > +{ > + struct jtag *jtag; > + > + jtag = kzalloc(sizeof(*jtag) + priv_size, GFP_KERNEL); > + if (!jtag) > + return NULL; > + > + if (!ops) > + return NULL; > + > + if (!ops->idle || !ops->mode_set || !ops->status_get || !ops->xfer) > + return NULL; Did you think through this? You leak 'jtag' here and above. Perform all the ops checks prior to the allocation. Julia
Re: [patch v17 1/4] drivers: jtag: Add JTAG core driver
Hello Oleksandr- On Tue, Jan 16, 2018 at 09:18:56AM +0200, Oleksandr Shamray wrote: [..] > v16->v17 > Comments pointed by Julia Cartwright More review feedback below: [..] > +++ b/drivers/jtag/jtag.c [..] > +static long jtag_ioctl(struct file *file, unsigned int cmd, unsigned long > arg) > +{ > + struct jtag *jtag = file->private_data; > + struct jtag_run_test_idle idle; > + struct jtag_xfer xfer; > + u8 *xfer_data; > + u32 data_size; > + u32 value; > + int err; > + > + if (!arg) > + return -EINVAL; > + > + switch (cmd) { > + case JTAG_GIOCFREQ: > + if (!jtag->ops->freq_get) > + err = -EOPNOTSUPP; Did you mean: return -EOPNOTSUPP; ? > + > + err = jtag->ops->freq_get(jtag, &value); Otherwise you're check was worthless, you'll call NULL here. Also, w.r.t. the set of ops which are required to be implemented: this isn't the right place to do the check. Instead, do it in jtag_alloc(): struct jtag *jtag_alloc(size_t priv_size, const struct jtag_ops *ops) { struct jtag *jtag; if (!ops->freq_get || !ops->xfer || ...) /* fixup condition */ return NULL; jtag = kzalloc(sizeof(*jtag) + priv_size, GFP_KERNEL); if (!jtag) return NULL; jtag->ops = ops; return jtag; } EXPORT_SYMBOL_GPL(jtag_alloc); [..] > + case JTAG_IOCXFER: [..] > + data_size = DIV_ROUND_UP(xfer.length, BITS_PER_BYTE); > + xfer_data = memdup_user(u64_to_user_ptr(xfer.tdio), data_size); > + > + if (!xfer_data) memdup_user() doesn't return NULL on error. You need to check for IS_ERR(xfer_data). > + return -EFAULT; > + > + err = jtag->ops->xfer(jtag, &xfer, xfer_data); > + if (err) { > + kfree(xfer_data); > + return -EFAULT; > + } > + > + err = copy_to_user(u64_to_user_ptr(xfer.tdio), > +(void *)(xfer_data), data_size); > + > + if (err) { > + kfree(xfer_data); > + return -EFAULT; > + } > + > + kfree(xfer_data); Move the kfree() above the if (err). > + if (copy_to_user((void *)arg, &xfer, sizeof(struct jtag_xfer))) > + return -EFAULT; > + break; > + > + case JTAG_GIOCSTATUS: > + if (!jtag->ops->status_get) > + return -EOPNOTSUPP; > + > + err = jtag->ops->status_get(jtag, &value); > + if (err) > + break; > + > + err = put_user(value, (__u32 *)arg); > + if (err) > + err = -EFAULT; put_user() returns -EFAULT on failure, so this shouldn't be necessary. [..] > --- /dev/null > +++ b/include/uapi/linux/jtag.h [..] > +/** > + * struct jtag_xfer - jtag xfer: > + * > + * @type: transfer type > + * @direction: xfer direction > + * @length: xfer bits len > + * @tdio : xfer data array > + * @endir: xfer end state > + * > + * Structure represents interface to JTAG device for jtag sdr xfer > + * execution. > + */ > +struct jtag_xfer { > + __u8type; > + __u8direction; > + __u8endstate; Just to be as unambiguous as possible, considering this is ABI, I would suggest explicitly putting a padding byte here. > + __u32 length; > + __u64 tdio; > +}; Thanks, Julia
Re: [ANNOUNCE] 4.9.76-rt61
On Tue, Jan 16, 2018 at 11:11:32PM +, Bernhard Landauer wrote: > Ah. Working now after adding in validpgpkeys array in the PKGBUILD. This was the first v4.9-rt release since Steven handed over his maintainership , so that would explain why it would be signed by a different key (although, my key has been signed by Steven's). Thanks, Julia
[ANNOUNCE] 4.9.76-rt61
Hello RT Folks! I'm pleased to announce the 4.9.76-rt61 stable release. This release is just an update to the new stable 4.9.76 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.9-rt Head SHA1: 90816cc1d4a1d23efe37b74866c6174dd5eab6b5 Or to build 4.9.76-rt61 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.76.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.76-rt61.patch.xz You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.9-rt Head SHA1: 90816cc1d4a1d23efe37b74866c6174dd5eab6b5 Or to build 4.9.76-rt61 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.9.76.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.9/patch-4.9.76-rt61.patch.xz Enjoy, Julia
Re: [patch v16 1/4] drivers: jtag: Add JTAG core driver
Hello Oleksandr- I have a few comments below. On Fri, Jan 12, 2018 at 07:08:26PM +0200, Oleksandr Shamray wrote: > Initial patch for JTAG driver > JTAG class driver provide infrastructure to support hardware/software > JTAG platform drivers. It provide user layer API interface for flashing > and debugging external devices which equipped with JTAG interface > using standard transactions. > > Driver exposes set of IOCTL to user space for: > - XFER: > - SIR (Scan Instruction Register, IEEE 1149.1 Data Register scan); > - SDR (Scan Data Register, IEEE 1149.1 Instruction Register scan); > - RUNTEST (Forces the IEEE 1149.1 bus to a run state for a specified > number of clocks). > - SIOCFREQ/GIOCFREQ for setting and reading JTAG frequency. > > Driver core provides set of internal APIs for allocation and > registration: > - jtag_register; > - jtag_unregister; > - jtag_alloc; > - jtag_free; > > Platform driver on registration with jtag-core creates the next > entry in dev folder: > /dev/jtagX > > Signed-off-by: Oleksandr Shamray > Signed-off-by: Jiri Pirko > Acked-by: Philippe Ombredanne [..] > +static long jtag_ioctl(struct file *file, unsigned int cmd, unsigned long > arg) > +{ > + struct jtag *jtag = file->private_data; > + struct jtag_run_test_idle idle; > + struct jtag_xfer xfer; > + u8 *xfer_data; > + u32 data_size; > + u32 value; > + int err; > + > + if (!arg) > + return -EINVAL; > + > + switch (cmd) { > + case JTAG_GIOCFREQ: > + if (!jtag->ops->freq_get) > + err = -EOPNOTSUPP; > + > + err = jtag->ops->freq_get(jtag, &value); > + if (err) > + break; > + > + if (put_user(value, (__u32 *)arg)) > + err = -EFAULT; > + break; > + > + case JTAG_SIOCFREQ: > + if (!jtag->ops->freq_set) > + return -EOPNOTSUPP; > + > + if (get_user(value, (__u32 *)arg)) > + return -EFAULT; > + if (value == 0) > + return -EINVAL; > + > + err = jtag->ops->freq_set(jtag, value); > + break; > + > + case JTAG_IOCRUNTEST: > + if (!jtag->ops->idle) > + return -EOPNOTSUPP; > + > + if (copy_from_user(&idle, (void *)arg, > +sizeof(struct jtag_run_test_idle))) > + return -EFAULT; > + > + if (idle.endstate > JTAG_STATE_PAUSEDR) > + return -EINVAL; > + > + err = jtag->ops->idle(jtag, &idle); > + break; > + > + case JTAG_IOCXFER: > + if (!jtag->ops->xfer) Are all ops optional? That seems bizarre. I would have expected at least one callback to be required. [..] > +static int jtag_open(struct inode *inode, struct file *file) > +{ > + struct jtag *jtag = container_of(file->private_data, struct jtag, > + miscdev); > + > + if (mutex_lock_interruptible(&jtag->open_lock)) > + return -ERESTARTSYS; > + > + if (jtag->opened) { > + mutex_unlock(&jtag->open_lock); > + return -EBUSY; > + } > + > + nonseekable_open(inode, file); > + file->private_data = jtag; These two can be moved out of the lock. > + jtag->opened = true; > + mutex_unlock(&jtag->open_lock); > + return 0; > +} > + > +static int jtag_release(struct inode *inode, struct file *file) > +{ > + struct jtag *jtag = file->private_data; > + > + mutex_lock(&jtag->open_lock); > + jtag->opened = false; > + mutex_unlock(&jtag->open_lock); > + return 0; > +} > + > +static const struct file_operations jtag_fops = { > + .owner = THIS_MODULE, > + .open = jtag_open, > + .release= jtag_release, > + .llseek = noop_llseek, > + .unlocked_ioctl = jtag_ioctl, > +}; > + > +struct jtag *jtag_alloc(size_t priv_size, const struct jtag_ops *ops) > +{ > + struct jtag *jtag; > + > + jtag = kzalloc(sizeof(*jtag), GFP_KERNEL); Did you mean to allocate: sizeof(*jtag) + priv_size? > + if (!jtag) > + return NULL; > + > + jtag->ops = ops; > + return jtag; > +} > +EXPORT_SYMBOL_GPL(jtag_alloc); > + > +void jtag_free(struct jtag *jtag) > +{ > + kfree(jtag); > +} > +EXPORT_SYMBOL_GPL(jtag_free); > + > +int jtag_register(struct jtag *jtag) > +{ > + char *name; > + int err; > + int id; > + > + id = ida_simple_get(&jtag_ida, 0, 0, GFP_KERNEL); > + if (id < 0) > + return id; > + > + jtag->id = id; > + > + name = kzalloc(MAX_JTAG_NAME_LEN, GFP_KERNEL); > + if (!name) { > + err = -ENOMEM; > + goto err_jtag_alloc; > + } > + > + err = snprintf(name, MAX_JTAG_NAME_LEN, "jtag%d", id); > + if (err < 0) > + goto err_jtag_name; > + > + mutex_init(&jtag->ope
Re: [PATCH] futex: Avoid violating the 10th rule of futex
On Fri, Dec 08, 2017 at 01:49:39PM +0100, Peter Zijlstra wrote: > On Thu, Dec 07, 2017 at 05:02:40PM -0600, Gratian Crisan wrote: [..] > Assuming nothing bad happens; find below the patch with a Changelog > attached. > > --- > Subject: futex: Avoid violating the 10th rule of futex > From: Peter Zijlstra > Date: Thu Dec 7 16:54:23 CET 2017 > > Julia reported futex state corruption in the following scenario: > >waiter waker > stealer (prio > waiter) > >futex(WAIT_REQUEUE_PI, uaddr, uaddr2, > timeout=[N ms]) > futex_wait_requeue_pi() > futex_wait_queue_me() > freezable_schedule() > >futex(LOCK_PI, uaddr2) >futex(CMP_REQUEUE_PI, uaddr, > uaddr2, 1, 0) > /* requeues waiter to uaddr2 */ >futex(UNLOCK_PI, uaddr2) > wake_futex_pi() > > cmp_futex_value_locked(uaddr2, waiter) > wake_up_q() > > clears sleeper->task> > > futex(LOCK_PI, uaddr2) > > __rt_mutex_start_proxy_lock() > >try_to_take_rt_mutex() /* steals lock */ > > rt_mutex_set_owner(lock, stealer) > > > > rt_mutex_wait_proxy_lock() > __rt_mutex_slowlock() >try_to_take_rt_mutex() /* fails, lock held by stealer */ >if (timeout && !timeout->task) > return -ETIMEDOUT; > fixup_owner() >/* lock wasn't acquired, so, > fixup_pi_state_owner skipped */ > >return -ETIMEDOUT; > >/* At this point, we've returned -ETIMEDOUT to userspace, but the > * futex word shows waiter to be the owner, and the pi_mutex has > * stealer as the owner */ > >futex_lock(LOCK_PI, uaddr2) > -> bails with EDEADLK, futex word says we're owner. > > And suggested that what commit: > > 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") > > removes from fixup_owner() looks to be just what is needed. And indeed > it is -- I completely missed that requeue_pi could also result in this > case. So we need to restore that, except that subsequent patches, like > commit: > > 16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock") > > changed all the locking rules. Even without that, the sequence: > > - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { > - locked = 1; > - goto out; > - } > > - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); > - owner = rt_mutex_owner(&q->pi_state->pi_mutex); > - if (!owner) > - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); > - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); > - ret = fixup_pi_state_owner(uaddr, q, owner); > > already suggests there were races; otherwise we'd never have to look > at next_owner. > > So instead of doing 3 consecutive wait_lock sections with who knows > what races, we do it all in a single section. Additionally, the usage > of pi_state->owner in fixup_owner() was only safe because only the > rt_mutex owner would modify it, which this additional case wrecks. > > Luckily the values can only change away and not to the value we're > testing, this means we can do a speculative test and double check once > we have the wait_lock. > > Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") > Reported-and-Tested-by: Julia Cartwright > Reported-and-Tested-by: Gratian Crisan > Signed-off-by: Peter Zijlstra (Intel) Hey Peter- We've been running w/ this patch now without further regression. I was expecting to see this hit 4.15-rc* eventually, but haven't seen it land anywhere. Is this in the queue for 4.16, then? Thanks! Julia
Re: [ANNOUNCE] 4.1.46-rt52
On Mon, Dec 11, 2017 at 03:01:39PM -0600, Corey Minyard wrote: > On 12/11/2017 02:20 PM, Julia Cartwright wrote: > > On Mon, Dec 11, 2017 at 01:04:58PM -0600, Corey Minyard wrote: [..] > > > > > > > > Sebastian Andrzej Siewior (2): > > > > PM / CPU: replace raw_notifier with atomic_notifier (fixup) > > > This change breaks the compile of 4.1 on ARM (and other things using > > > kernel/cpu_pm.c). > > > > > > rcu_irq_enter_irqson() and friends doesn't exist in 4.1. In fact, > > > rcu_irq_enter() doesn't even exist. > > > > > > Sorry I didn't get to testing this earlier, I've been focused on other > > > things. > > Thanks! My current builds apparently don't currently cover CPU_PM > > builds, so I didn't see this :(. Thanks for the report. > > > > v4.1 does contain rcu_irq_enter(), and it looks like what we should be > > using in 4.1 instead. > > Ah, it does, I looked in the wrong place. > > > I'm thinking the below should be okay; I'll be standing up a proper cpu > > hotplugging test to give it some runtime. > > The below should work. You'll need an ARM64, ARM or MIPS system with > the proper config settings to exercise this code, though. Shouldn't be a problem, I've got an ARM64 board that I've been meaning to integrate into my testing anyway. Thanks, Julia
Re: [ANNOUNCE] 4.1.46-rt52
On Mon, Dec 11, 2017 at 01:04:58PM -0600, Corey Minyard wrote: > On 11/29/2017 05:13 PM, Julia Cartwright wrote: > > Hello RT Folks! > > > > I'm pleased to announce the 4.1.46-rt52 stable release. > > > > You can get this release via the git tree at: > > > >git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git > > > >branch: v4.1-rt > >Head SHA1: 6e737a91c1ce923d4e10db831e14cfef9a2459cc > > > > Or to build 4.1.46-rt52 directly, the following patches should be applied: > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.kernel.org_pub_linux_kernel_v4.x_linux-2D4.1.tar.xz&d=DwICaQ&c=I_0YwoKy7z5LMTVdyO6YCiE2uzI1jjZZuIPelcSjixA&r=cAXq_8W9Othb2h8ZcWv8Vw&m=a5KTpX9A0-pHwmQkOOLr5_xeWPavI-lgN7Z0Goes2So&s=tGmSNpZKEKvhTpeVu_ktmD4I5PIXfIJhZ79DxwbCjtQ&e= > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.kernel.org_pub_linux_kernel_v4.x_patch-2D4.1.46.xz&d=DwICaQ&c=I_0YwoKy7z5LMTVdyO6YCiE2uzI1jjZZuIPelcSjixA&r=cAXq_8W9Othb2h8ZcWv8Vw&m=a5KTpX9A0-pHwmQkOOLr5_xeWPavI-lgN7Z0Goes2So&s=hx9pHgRy7tWhUeeiJhPOH0T8qdDrhtZvWxOWuIBZZgs&e= > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.kernel.org_pub_linux_kernel_projects_rt_4.1_patch-2D4.1.46-2Drt52.patch.xz&d=DwICaQ&c=I_0YwoKy7z5LMTVdyO6YCiE2uzI1jjZZuIPelcSjixA&r=cAXq_8W9Othb2h8ZcWv8Vw&m=a5KTpX9A0-pHwmQkOOLr5_xeWPavI-lgN7Z0Goes2So&s=Pm8ed_xvW1drpPxNkt035bj6ILvSImuN9vvgEAt8IiM&e= > > > > > > You can also build from 4.1.46-rt51 by applying the incremental patch: > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.kernel.org_pub_linux_kernel_projects_rt_4.1_incr_patch-2D4.1.46-2Drt51-2Drt52.patch.xz&d=DwICaQ&c=I_0YwoKy7z5LMTVdyO6YCiE2uzI1jjZZuIPelcSjixA&r=cAXq_8W9Othb2h8ZcWv8Vw&m=a5KTpX9A0-pHwmQkOOLr5_xeWPavI-lgN7Z0Goes2So&s=ZM64-JCd0WNFQgFlB3wbCfn8Jtf9rq5QyCO2-GJsC2Q&e= > > > > Enjoy! > > Julia > > > > Changes from v4.1.46-rt51: > > --- > > Julia Cartwright (2): > >workqueue: fixup rcu check for RT > >Linux 4.1.46-rt52 > > > > Sebastian Andrzej Siewior (2): > >PM / CPU: replace raw_notifier with atomic_notifier (fixup) > > This change breaks the compile of 4.1 on ARM (and other things using > kernel/cpu_pm.c). > > rcu_irq_enter_irqson() and friends doesn't exist in 4.1. In fact, > rcu_irq_enter() doesn't even exist. > > Sorry I didn't get to testing this earlier, I've been focused on other things. Thanks! My current builds apparently don't currently cover CPU_PM builds, so I didn't see this :(. Thanks for the report. v4.1 does contain rcu_irq_enter(), and it looks like what we should be using in 4.1 instead. I'm thinking the below should be okay; I'll be standing up a proper cpu hotplugging test to give it some runtime. Thanks, Julia -- 8< -- Subject: [PATCH] PM / CPU: replace usage of rcu_irq_{enter,exit}_irqson() Commit c50243950f97e ("PM / CPU: replace raw_notifier with atomic_notifier (fixup)") intended to fix up RCU usage in idle, by telling RCU to pay attention. However, the cherry-pick back from rt-devel included the invocation of the _irqson() variants of rcu_irq_{enter,exit}() API which doesn't exist in 4.1. For 4.1, just simply drop the _irqson() variant, and instead use rcu_irq_{enter,exit}() directly instead. This is safe, because in 4.1, these functions internally disable interrupts. Reported-by: Corey Minyard Signed-off-by: Julia Cartwright
Re: PI futexes + lock stealing woes
On Thu, Dec 07, 2017 at 08:57:59AM -0600, Gratian Crisan wrote: > > Peter Zijlstra writes: > > The below compiles and boots, but is otherwise untested. Could you give > > it a spin? > > Thank you! Yes, I'll start a test now. I gave it a spin with my minimal reproducing case w/ the strategic msleep(). Previously, I could reproduce it in less than a minute in that setup; now it's been running for a couple hours. So ... looks good? :) Thanks, Julia
[PATCH] trace-cmd: add plugin for decoding syscalls/sys_enter_futex
The futex syscall is a complicated one. It supports thirteen multiplexed operations, each with different semantics and encodings for the syscalls six arguments. Manually decoding these arguments is tedious and error prone. This plugin provides symbolic names for futex operations, futex flags, and tries to be intelligent about the intent of specific arguments (for example, waking operations use 'val' as an integer count, not just an arbitrary value). It doesn't do a full decode of the FUTEX_WAKE_OP's 'val3' argument, however, this is a good starting point. Signed-off-by: Julia Cartwright --- Makefile | 1 + plugin_futex.c | 135 + 2 files changed, 136 insertions(+) create mode 100644 plugin_futex.c diff --git a/Makefile b/Makefile index 5c35143c8867..58b957d9a578 100644 --- a/Makefile +++ b/Makefile @@ -365,6 +365,7 @@ PLUGIN_OBJS += plugin_kvm.o PLUGIN_OBJS += plugin_mac80211.o PLUGIN_OBJS += plugin_sched_switch.o PLUGIN_OBJS += plugin_function.o +PLUGIN_OBJS += plugin_futex.o PLUGIN_OBJS += plugin_xen.o PLUGIN_OBJS += plugin_scsi.o PLUGIN_OBJS += plugin_cfg80211.o diff --git a/plugin_futex.c b/plugin_futex.c new file mode 100644 index ..b12c92d91a86 --- /dev/null +++ b/plugin_futex.c @@ -0,0 +1,135 @@ +/* + * Copyright (C) 2017 National Instruments Corp. + * + * Author: Julia Cartwright + * + * ~~ + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; + * version 2.1 of the License (not later!) + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see <http://www.gnu.org/licenses> + * + * ~~ + */ +#include +#include +#include +#include + +#include "trace-cmd.h" + +struct futex_args { + unsigned long long uaddr; + unsigned long long op; + unsigned long long val; + unsigned long long utime; /* or val2 */ + unsigned long long uaddr2; + unsigned long long val3; +}; + +struct futex_op { + const char *name; + const char *fmt_val; + const char *fmt_utime; + const char *fmt_uaddr2; + const char *fmt_val3; +}; + +static const struct futex_op futex_op_tbl[] = { + {"FUTEX_WAIT", " val=0x%08llx", " utime=0x%08llx", NULL, NULL }, + {"FUTEX_WAKE", " val=%llu", NULL, NULL, NULL }, + { "FUTEX_FD", " val=%llu", NULL, NULL, NULL }, + { "FUTEX_REQUEUE", " val=%llu", " val2=%llu", " uaddr2=0x%08llx", NULL }, + { "FUTEX_CMP_REQUEUE", " val=%llu", " val2=%llu", " uaddr2=0x%08llx", " val3=0x%08llx" }, + { "FUTEX_WAKE_OP", " val=%llu", " val2=%llu", " uaddr2=0x%08llx", " val3=0x%08llx" }, + { "FUTEX_LOCK_PI",NULL, " utime=0x%08llx", NULL, NULL }, + { "FUTEX_UNLOCK_PI",NULL, NULL, NULL, NULL }, + { "FUTEX_TRYLOCK_PI",NULL, NULL, NULL, NULL }, + { "FUTEX_WAIT_BITSET", " val=0x%08llx", " utime=0x%08llx", NULL, " val3=0x%08llx" }, + { "FUTEX_WAKE_BITSET", " val=%llu", NULL, NULL, " val3=0x%08llx" }, + { "FUTEX_WAIT_REQUEUE_PI", " val=0x%08llx", " utime=0x%08llx", " uaddr2=0x%08llx", " val3=0x%08llx" }, + { "FUTEX_CMP_REQUEUE_PI", " val=%llu", " val2=%llu", " uaddr2=0x%08llx", " val3=0x%08llx" }, +}; + + +static void futex_print(struct trace_seq *s, const struct futex_args *args, + const struct futex_op *fop) +{ + trace_seq_printf(s, " uaddr=0x%08llx", args->uaddr); + + if (fop->fmt_val) +
[PATCH v2 2/3] net: macb: reduce scope of rx_fs_lock-protected regions
Commit ae8223de3df5 ("net: macb: Added support for RX filtering") introduces a lock, rx_fs_lock which is intended to protect the list of rx_flow items and synchronize access to the hardware rx filtering registers. However, the region protected by this lock is overscoped, unnecessarily including things like slab allocation. Reduce this lock scope to only include operations which must be performed atomically: list traversal, addition, and removal, and hitting the macb filtering registers. This fixes the use of kmalloc w/ GFP_KERNEL in atomic context. Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering") Cc: Rafal Ozieblo Cc: Julia Lawall Acked-by: Nicolas Ferre Signed-off-by: Julia Cartwright --- drivers/net/ethernet/cadence/macb_main.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index b7644836aba1..758e8b3042b2 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2796,6 +2796,7 @@ static int gem_add_flow_filter(struct net_device *netdev, struct macb *bp = netdev_priv(netdev); struct ethtool_rx_flow_spec *fs = &cmd->fs; struct ethtool_rx_fs_item *item, *newfs; + unsigned long flags; int ret = -EINVAL; bool added = false; @@ -2811,6 +2812,8 @@ static int gem_add_flow_filter(struct net_device *netdev, htonl(fs->h_u.tcp_ip4_spec.ip4dst), htons(fs->h_u.tcp_ip4_spec.psrc), htons(fs->h_u.tcp_ip4_spec.pdst)); + spin_lock_irqsave(&bp->rx_fs_lock, flags); + /* find correct place to add in list */ list_for_each_entry(item, &bp->rx_fs_list.list, list) { if (item->fs.location > newfs->fs.location) { @@ -2833,9 +2836,11 @@ static int gem_add_flow_filter(struct net_device *netdev, if (netdev->features & NETIF_F_NTUPLE) gem_enable_flow_filters(bp, 1); + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return 0; err: + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); kfree(newfs); return ret; } @@ -2846,6 +2851,9 @@ static int gem_del_flow_filter(struct net_device *netdev, struct macb *bp = netdev_priv(netdev); struct ethtool_rx_fs_item *item; struct ethtool_rx_flow_spec *fs; + unsigned long flags; + + spin_lock_irqsave(&bp->rx_fs_lock, flags); list_for_each_entry(item, &bp->rx_fs_list.list, list) { if (item->fs.location == cmd->fs.location) { @@ -2862,12 +2870,14 @@ static int gem_del_flow_filter(struct net_device *netdev, gem_writel_n(bp, SCRT2, fs->location, 0); list_del(&item->list); - kfree(item); bp->rx_fs_list.count--; + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); + kfree(item); return 0; } } + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return -EINVAL; } @@ -2936,11 +2946,8 @@ static int gem_get_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd, static int gem_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd) { struct macb *bp = netdev_priv(netdev); - unsigned long flags; int ret; - spin_lock_irqsave(&bp->rx_fs_lock, flags); - switch (cmd->cmd) { case ETHTOOL_SRXCLSRLINS: if ((cmd->fs.location >= bp->max_tuples) @@ -2959,7 +2966,6 @@ static int gem_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd) ret = -EOPNOTSUPP; } - spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return ret; } -- 2.14.2
[PATCH v2 0/3] macb rx filter cleanups
Here's a proper patchset based on net-next. v1 -> v2: - Rebased on net-next - Add Nicolas's Acks - Reorder commits, putting the list_empty() cleanups prior to the others. - Added commit reverting the GFP_ATOMIC change. Julia Cartwright (3): net: macb: kill useless use of list_empty() net: macb: reduce scope of rx_fs_lock-protected regions net: macb: change GFP_ATOMIC to GFP_KERNEL drivers/net/ethernet/cadence/macb_main.c | 47 1 file changed, 23 insertions(+), 24 deletions(-) -- 2.14.2
[PATCH v2 1/3] net: macb: kill useless use of list_empty()
The list_for_each_entry() macro already handles the case where the list is empty (by not executing the loop body). It's not necessary to handle this case specially, so stop doing so. Cc: Rafal Ozieblo Acked-by: Nicolas Ferre Signed-off-by: Julia Cartwright --- drivers/net/ethernet/cadence/macb_main.c | 31 --- 1 file changed, 12 insertions(+), 19 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index ebfeab853bf4..b7644836aba1 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2812,24 +2812,20 @@ static int gem_add_flow_filter(struct net_device *netdev, htons(fs->h_u.tcp_ip4_spec.psrc), htons(fs->h_u.tcp_ip4_spec.pdst)); /* find correct place to add in list */ - if (list_empty(&bp->rx_fs_list.list)) - list_add(&newfs->list, &bp->rx_fs_list.list); - else { - list_for_each_entry(item, &bp->rx_fs_list.list, list) { - if (item->fs.location > newfs->fs.location) { - list_add_tail(&newfs->list, &item->list); - added = true; - break; - } else if (item->fs.location == fs->location) { - netdev_err(netdev, "Rule not added: location %d not free!\n", - fs->location); - ret = -EBUSY; - goto err; - } + list_for_each_entry(item, &bp->rx_fs_list.list, list) { + if (item->fs.location > newfs->fs.location) { + list_add_tail(&newfs->list, &item->list); + added = true; + break; + } else if (item->fs.location == fs->location) { + netdev_err(netdev, "Rule not added: location %d not free!\n", + fs->location); + ret = -EBUSY; + goto err; } - if (!added) - list_add_tail(&newfs->list, &bp->rx_fs_list.list); } + if (!added) + list_add_tail(&newfs->list, &bp->rx_fs_list.list); gem_prog_cmp_regs(bp, fs); bp->rx_fs_list.count++; @@ -2851,9 +2847,6 @@ static int gem_del_flow_filter(struct net_device *netdev, struct ethtool_rx_fs_item *item; struct ethtool_rx_flow_spec *fs; - if (list_empty(&bp->rx_fs_list.list)) - return -EINVAL; - list_for_each_entry(item, &bp->rx_fs_list.list, list) { if (item->fs.location == cmd->fs.location) { /* disable screener regs for the flow entry */ -- 2.14.2
[PATCH v2 3/3] net: macb: change GFP_ATOMIC to GFP_KERNEL
Now that the rx_fs_lock is no longer held across allocation, it's safe to use GFP_KERNEL for allocating new entries. This reverts commit 81da3bf6e3f88 ("net: macb: change GFP_KERNEL to GFP_ATOMIC"). Cc: Julia Lawall Signed-off-by: Julia Cartwright --- drivers/net/ethernet/cadence/macb_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index 758e8b3042b2..234667eaaa92 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2800,7 +2800,7 @@ static int gem_add_flow_filter(struct net_device *netdev, int ret = -EINVAL; bool added = false; - newfs = kmalloc(sizeof(*newfs), GFP_ATOMIC); + newfs = kmalloc(sizeof(*newfs), GFP_KERNEL); if (newfs == NULL) return -ENOMEM; memcpy(&newfs->fs, fs, sizeof(newfs->fs)); -- 2.14.2
[PATCH 2/2] net: macb: kill useless use of list_empty()
The list_for_each_entry() macro already handles the case where the list is empty (by not executing the loop body). It's not necessary to handle this case specially, so stop doing so. Cc: Rafal Ozieblo Signed-off-by: Julia Cartwright --- This is an additional cleanup patch found when looking at this code. Julia drivers/net/ethernet/cadence/macb_main.c | 34 1 file changed, 13 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index e7ef104a077d..3643c6ad2322 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2815,25 +2815,22 @@ static int gem_add_flow_filter(struct net_device *netdev, spin_lock_irqsave(&bp->rx_fs_lock, flags); /* find correct place to add in list */ - if (list_empty(&bp->rx_fs_list.list)) - list_add(&newfs->list, &bp->rx_fs_list.list); - else { - list_for_each_entry(item, &bp->rx_fs_list.list, list) { - if (item->fs.location > newfs->fs.location) { - list_add_tail(&newfs->list, &item->list); - added = true; - break; - } else if (item->fs.location == fs->location) { - netdev_err(netdev, "Rule not added: location %d not free!\n", - fs->location); - ret = -EBUSY; - goto err; - } + list_for_each_entry(item, &bp->rx_fs_list.list, list) { + if (item->fs.location > newfs->fs.location) { + list_add_tail(&newfs->list, &item->list); + added = true; + break; + } else if (item->fs.location == fs->location) { + netdev_err(netdev, "Rule not added: location %d not free!\n", + fs->location); + ret = -EBUSY; + goto err; } - if (!added) - list_add_tail(&newfs->list, &bp->rx_fs_list.list); } + if (!added) + list_add_tail(&newfs->list, &bp->rx_fs_list.list); + gem_prog_cmp_regs(bp, fs); bp->rx_fs_list.count++; /* enable filtering if NTUPLE on */ @@ -2859,11 +2856,6 @@ static int gem_del_flow_filter(struct net_device *netdev, spin_lock_irqsave(&bp->rx_fs_lock, flags); - if (list_empty(&bp->rx_fs_list.list)) { - spin_unlock_irqrestore(&bp->rx_fs_lock, flags); - return -EINVAL; - } - list_for_each_entry(item, &bp->rx_fs_list.list, list) { if (item->fs.location == cmd->fs.location) { /* disable screener regs for the flow entry */ -- 2.14.2
[PATCH 1/2] net: macb: reduce scope of rx_fs_lock-protected regions
Commit ae8223de3df5 ("net: macb: Added support for RX filtering") introduces a lock, rx_fs_lock which is intended to protect the list of rx_flow items and synchronize access to the hardware rx filtering registers. However, the region protected by this lock is overscoped, unnecessarily including things like slab allocation. Reduce this lock scope to only include operations which must be performed atomically: list traversal, addition, and removal, and hitting the macb filtering registers. This fixes the use of kmalloc w/ GFP_KERNEL in atomic context. Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering") Cc: Rafal Ozieblo Cc: Julia Lawall Signed-off-by: Julia Cartwright --- While Julia Lawall's cocci-generated patch fixes the problem, the right solution is to obviate the problem altogether. Thanks, The Other Julia drivers/net/ethernet/cadence/macb_main.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index c5fa87cdc6c4..e7ef104a077d 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2796,6 +2796,7 @@ static int gem_add_flow_filter(struct net_device *netdev, struct macb *bp = netdev_priv(netdev); struct ethtool_rx_flow_spec *fs = &cmd->fs; struct ethtool_rx_fs_item *item, *newfs; + unsigned long flags; int ret = -EINVAL; bool added = false; @@ -2811,6 +2812,8 @@ static int gem_add_flow_filter(struct net_device *netdev, htonl(fs->h_u.tcp_ip4_spec.ip4dst), htons(fs->h_u.tcp_ip4_spec.psrc), htons(fs->h_u.tcp_ip4_spec.pdst)); + spin_lock_irqsave(&bp->rx_fs_lock, flags); + /* find correct place to add in list */ if (list_empty(&bp->rx_fs_list.list)) list_add(&newfs->list, &bp->rx_fs_list.list); @@ -2837,9 +2840,11 @@ static int gem_add_flow_filter(struct net_device *netdev, if (netdev->features & NETIF_F_NTUPLE) gem_enable_flow_filters(bp, 1); + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return 0; err: + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); kfree(newfs); return ret; } @@ -2850,9 +2855,14 @@ static int gem_del_flow_filter(struct net_device *netdev, struct macb *bp = netdev_priv(netdev); struct ethtool_rx_fs_item *item; struct ethtool_rx_flow_spec *fs; + unsigned long flags; - if (list_empty(&bp->rx_fs_list.list)) + spin_lock_irqsave(&bp->rx_fs_lock, flags); + + if (list_empty(&bp->rx_fs_list.list)) { + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return -EINVAL; + } list_for_each_entry(item, &bp->rx_fs_list.list, list) { if (item->fs.location == cmd->fs.location) { @@ -2869,12 +2879,14 @@ static int gem_del_flow_filter(struct net_device *netdev, gem_writel_n(bp, SCRT2, fs->location, 0); list_del(&item->list); - kfree(item); bp->rx_fs_list.count--; + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); + kfree(item); return 0; } } + spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return -EINVAL; } @@ -2943,11 +2955,8 @@ static int gem_get_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd, static int gem_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd) { struct macb *bp = netdev_priv(netdev); - unsigned long flags; int ret; - spin_lock_irqsave(&bp->rx_fs_lock, flags); - switch (cmd->cmd) { case ETHTOOL_SRXCLSRLINS: if ((cmd->fs.location >= bp->max_tuples) @@ -2966,7 +2975,6 @@ static int gem_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd) ret = -EOPNOTSUPP; } - spin_unlock_irqrestore(&bp->rx_fs_lock, flags); return ret; } -- 2.14.2
Re: PI futexes + lock stealing woes
On Fri, Dec 01, 2017 at 12:11:15PM -0800, Darren Hart wrote: > On Wed, Nov 29, 2017 at 11:56:05AM -0600, Julia Cartwright wrote: > > Hey Thomas, Peter- > > > > Gratian and I have been debugging into a nasty and difficult race w/ > > futexes seemingly the culprit. The original symptom we were seeing > > was a seemingly spurious -EDEADLK from a futex(LOCK_PI) operation. > > > > On further analysis, however, it appears the thread which gets the > > spurious -EDEADLK has observed a weird futex state: a prior > > futex(WAIT_REQUEUE_PI) operation has returned -ETIMEDOUT, but the uaddr2 > > futex word owner field indicates that it's the owner. > > > > Do you have a reproducer you can share? We have a massive application which seems to reproduce it in 8 hours or so, but it's not in a state to be shared :(. So far, every attempt at creating a simple, smaller reproducing case has failed. We're still trying, though :(. One debugging technique we're trying to employ as well now that we think we have a handle on the race is to pry the race window open with some strategically placed spinning (or fixed-period sleeping). Hopefully that will make it easier to reproduce ... > > Here's an attempt to boil down this situation into a pseudo trace; I'm > > happy to forward along the full traces as well, if that would be > > helpful: > > Please do forward the full trace Will do. Chances are they are large enough to bounce from LKML, but I'll send them along privately. > > > >waiter waker > > stealer (prio > waiter) > > > >futex(WAIT_REQUEUE_PI, uaddr, uaddr2, > > timeout=[N ms]) > > futex_wait_requeue_pi() > > futex_wait_queue_me() > > freezable_schedule() > > > >futex(LOCK_PI, uaddr2) > >futex(CMP_REQUEUE_PI, uaddr, > > uaddr2, 1, 0) > > /* requeues waiter to uaddr2 > > */ > >futex(UNLOCK_PI, uaddr2) > > wake_futex_pi() > > > > cmp_futex_value_locked(uaddr, waiter) minor fix: the above should have been: cmp_futex_value_locked(uaddr2, waiter) > > wake_up_q() > > > > > clears sleeper->task> > > > >futex(LOCK_PI, uaddr2) > > > > __rt_mutex_start_proxy_lock() > > > > try_to_take_rt_mutex() /* steals lock */ > > > > rt_mutex_set_owner(lock, stealer) > > > > > > > > rt_mutex_wait_proxy_lock() > > __rt_mutex_slowlock() > >try_to_take_rt_mutex() /* fails, lock held by stealer */ > >if (timeout && !timeout->task) > > return -ETIMEDOUT; > > fixup_owner() > >/* lock wasn't acquired, so, > > fixup_pi_state_owner skipped */ > >return -ETIMEDOUT; > > > >/* At this point, we've returned -ETIMEDOUT to userspace, but the > > * futex word shows waiter to be the owner, and the pi_mutex has > > * stealer as the owner */ > > eee Indeed. :( > >futex_lock(LOCK_PI, uaddr2) > > -> bails with EDEADLK, futex word says we're owner. > > > > At some later point in execution, the stealer gets scheduled back in and > > will do fixup_owner() which fixes up the futex word, but at that point > > it's too late: the waiter has already observed the wonky state. > > > > fixup_owner() used to have additional seemingly relevant checks in place > > that were removed 73d786bd043eb ("futex: Rework inconsistent > > rt_mutex/futex_q state"). > > This and the subsequent changes moving some of this out from under the > hb->
[ANNOUNCE] 4.1.46-rt52
Hello RT Folks! I'm pleased to announce the 4.1.46-rt52 stable release. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.1-rt Head SHA1: 6e737a91c1ce923d4e10db831e14cfef9a2459cc Or to build 4.1.46-rt52 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.1.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.1.46.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/patch-4.1.46-rt52.patch.xz You can also build from 4.1.46-rt51 by applying the incremental patch: http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/incr/patch-4.1.46-rt51-rt52.patch.xz Enjoy! Julia Changes from v4.1.46-rt51: --- Julia Cartwright (2): workqueue: fixup rcu check for RT Linux 4.1.46-rt52 Sebastian Andrzej Siewior (2): PM / CPU: replace raw_notifier with atomic_notifier (fixup) kernel/hrtimer: migrate deferred timer on CPU down --- kernel/cpu_pm.c | 7 +++ kernel/time/hrtimer.c | 5 + kernel/workqueue.c| 2 +- localversion-rt | 2 +- 4 files changed, 14 insertions(+), 2 deletions(-) --- diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c index 9da42f83ee03..d1d1c3961553 100644 --- a/kernel/cpu_pm.c +++ b/kernel/cpu_pm.c @@ -28,8 +28,15 @@ static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) { int ret; + /* +* __atomic_notifier_call_chain has a RCU read critical section, which +* could be disfunctional in cpu idle. Copy RCU_NONIDLE code to let +* RCU know this. +*/ + rcu_irq_enter_irqson(); ret = __atomic_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, nr_to_call, nr_calls); + rcu_irq_exit_irqson(); return notifier_to_errno(ret); } diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 2c6be169bdc7..75c990b00525 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1951,6 +1951,11 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, /* Clear the migration state bit */ timer->state &= ~HRTIMER_STATE_MIGRATE; } +#ifdef CONFIG_PREEMPT_RT_BASE + list_splice_tail(&old_base->expired, &new_base->expired); + if (!list_empty(&new_base->expired)) + raise_softirq_irqoff(HRTIMER_SOFTIRQ); +#endif } static void migrate_hrtimers(int scpu) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6bdcab98501c..90e261c8811e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -363,7 +363,7 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq); "RCU or wq->mutex should be held") #define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \ - rcu_lockdep_assert(rcu_read_lock_sched_held() ||\ + rcu_lockdep_assert(rcu_read_lock_held() || \ lockdep_is_held(&wq->mutex) || \ lockdep_is_held(&wq_pool_mutex), \ "sched RCU, wq->mutex or wq_pool_mutex should be held") diff --git a/localversion-rt b/localversion-rt index 75493460c41f..66a5ed8bf3d7 100644 --- a/localversion-rt +++ b/localversion-rt @@ -1 +1 @@ --rt51 +-rt52
PI futexes + lock stealing woes
Hey Thomas, Peter- Gratian and I have been debugging into a nasty and difficult race w/ futexes seemingly the culprit. The original symptom we were seeing was a seemingly spurious -EDEADLK from a futex(LOCK_PI) operation. On further analysis, however, it appears the thread which gets the spurious -EDEADLK has observed a weird futex state: a prior futex(WAIT_REQUEUE_PI) operation has returned -ETIMEDOUT, but the uaddr2 futex word owner field indicates that it's the owner. Here's an attempt to boil down this situation into a pseudo trace; I'm happy to forward along the full traces as well, if that would be helpful: waiter waker stealer (prio > waiter) futex(WAIT_REQUEUE_PI, uaddr, uaddr2, timeout=[N ms]) futex_wait_requeue_pi() futex_wait_queue_me() freezable_schedule() futex(LOCK_PI, uaddr2) futex(CMP_REQUEUE_PI, uaddr, uaddr2, 1, 0) /* requeues waiter to uaddr2 */ futex(UNLOCK_PI, uaddr2) wake_futex_pi() cmp_futex_value_locked(uaddr, waiter) wake_up_q() task> futex(LOCK_PI, uaddr2) __rt_mutex_start_proxy_lock() try_to_take_rt_mutex() /* steals lock */ rt_mutex_set_owner(lock, stealer) rt_mutex_wait_proxy_lock() __rt_mutex_slowlock() try_to_take_rt_mutex() /* fails, lock held by stealer */ if (timeout && !timeout->task) return -ETIMEDOUT; fixup_owner() /* lock wasn't acquired, so, fixup_pi_state_owner skipped */ return -ETIMEDOUT; /* At this point, we've returned -ETIMEDOUT to userspace, but the * futex word shows waiter to be the owner, and the pi_mutex has * stealer as the owner */ futex_lock(LOCK_PI, uaddr2) -> bails with EDEADLK, futex word says we're owner. At some later point in execution, the stealer gets scheduled back in and will do fixup_owner() which fixes up the futex word, but at that point it's too late: the waiter has already observed the wonky state. fixup_owner() used to have additional seemingly relevant checks in place that were removed 73d786bd043eb ("futex: Rework inconsistent rt_mutex/futex_q state"). The actual kernel we've been testing is 4.9.33-rt23, w/ 153fbd1226fb3 ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races") cherry-picked w/ PREEMPT_RT_FULL. However, it appears that this issue may affect v4.15-rc1? Thoughts on how to move forward? Nasty. Julia
[PATCH RT 4/4] Linux 4.1.46-rt52-rc1
4.1.46-rt52-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- --- localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/localversion-rt b/localversion-rt index 75493460c41f..d42746076d9b 100644 --- a/localversion-rt +++ b/localversion-rt @@ -1 +1 @@ --rt51 +-rt52-rc1 -- 2.14.2
[PATCH RT 3/4] kernel/hrtimer: migrate deferred timer on CPU down
4.1.46-rt52-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Sebastian Andrzej Siewior hrtimers, which were deferred to the softirq context, and expire between softirq shutdown and hrtimer migration are dangling around. If the CPU goes back up the list head will be initialized and this corrupts the timer's list. It will remain unnoticed until a hrtimer_cancel(). This moves those timers so they will expire. Cc: stable...@vger.kernel.org Reported-by: Mike Galbraith Tested-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit b3c08bffdcdd23f1b3ca8d9c01e3b8a715e03d46) Signed-off-by: Julia Cartwright --- kernel/time/hrtimer.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 2c6be169bdc7..75c990b00525 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1951,6 +1951,11 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, /* Clear the migration state bit */ timer->state &= ~HRTIMER_STATE_MIGRATE; } +#ifdef CONFIG_PREEMPT_RT_BASE + list_splice_tail(&old_base->expired, &new_base->expired); + if (!list_empty(&new_base->expired)) + raise_softirq_irqoff(HRTIMER_SOFTIRQ); +#endif } static void migrate_hrtimers(int scpu) -- 2.14.2
[PATCH RT 0/4] Linux 4.1.46-rt52-rc1
Dear RT Folks, This is the RT stable review cycle of patch 4.1.46-rt52-rc1. Please review the included patches, and test! You might ask: "Where was 4.1.46-rt51?". The answer is: it was released silently, and only as a means to facilitate release-level bisection. Version 4.1.46-rt51 has known issues to be resolved by the pending release of -rt52. The -rc release will be uploaded to kernel.org and will be deleted when the final release is out. This is just a review release (or release candidate). The pre-releases will not be pushed to the git repository, only the final release is. If all goes well, this patch will be converted to the next main release on 11/15/2017. Julia To build 4.1.46-rt52-rc1 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.1.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.1.46.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/patch-4.1.46-rt52-rc1.patch.xz You can also build from 4.1.46-rt51 release by applying the incremental patch: http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/incr/patch-4.1.46-rt51-rt52-rc1.patch.xz Julia Cartwright (2): workqueue: fixup rcu check for RT Linux 4.1.46-rt52-rc1 Sebastian Andrzej Siewior (2): PM / CPU: replace raw_notifier with atomic_notifier (fixup) kernel/hrtimer: migrate deferred timer on CPU down kernel/cpu_pm.c | 7 +++ kernel/time/hrtimer.c | 5 + kernel/workqueue.c| 2 +- localversion-rt | 2 +- 4 files changed, 14 insertions(+), 2 deletions(-) -- 2.14.2
[PATCH RT 2/4] workqueue: fixup rcu check for RT
4.1.46-rt52-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- Upstream commit 5b95e1af8d17d ("workqueue: wq_pool_mutex protects the attrs-installation") introduced an additional assertion (assert_rcu_or_wq_mutex_or_pool_mutex) which contains a check ensuring that the caller is in a RCU-sched read-side critical section. However, on RT, the locking rules are lessened to only require require _normal_ RCU. Fix up this check. The upstream commit was cherry-picked back into stable v4.1.19 as d3c4dd8843be. This fixes up the bogus splat triggered on boot: === [ INFO: suspicious RCU usage. ] 4.1.42-rt50 --- kernel/workqueue.c:609 sched RCU, wq->mutex or wq_pool_mutex should be held! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by swapper/0/1: #0: ((pendingb_lock).lock){+.+...}, at: queue_work_on+0x64/0x1c0 #1: (rcu_read_lock){..}, at: __queue_work+0x2a/0x880 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.1.42-rt50 #4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Call Trace: dump_stack+0x70/0x9a lockdep_rcu_suspicious+0xe7/0x120 unbound_pwq_by_node+0x92/0x100 __queue_work+0x28c/0x880 ? __queue_work+0x2a/0x880 queue_work_on+0xc9/0x1c0 call_usermodehelper_exec+0x1a7/0x200 kobject_uevent_env+0x4be/0x520 ? initcall_blacklist+0xa2/0xa2 kobject_uevent+0xb/0x10 kset_register+0x34/0x50 bus_register+0x100/0x2d0 ? ftrace_define_fields_workqueue_work+0x29/0x29 subsys_virtual_register+0x26/0x50 wq_sysfs_init+0x12/0x14 do_one_initcall+0x88/0x1b0 ? parse_args+0x190/0x410 kernel_init_freeable+0x204/0x299 ? rest_init+0x140/0x140 kernel_init+0x9/0xf0 ret_from_fork+0x42/0x70 ? rest_init+0x140/0x140 Reported-by: Sebastian Andrzej Siewior Signed-off-by: Julia Cartwright --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6bdcab98501c..90e261c8811e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -363,7 +363,7 @@ static void workqueue_sysfs_unregister(struct workqueue_struct *wq); "RCU or wq->mutex should be held") #define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \ - rcu_lockdep_assert(rcu_read_lock_sched_held() ||\ + rcu_lockdep_assert(rcu_read_lock_held() || \ lockdep_is_held(&wq->mutex) || \ lockdep_is_held(&wq_pool_mutex), \ "sched RCU, wq->mutex or wq_pool_mutex should be held") -- 2.14.2
[PATCH RT 1/4] PM / CPU: replace raw_notifier with atomic_notifier
4.1.46-rt52-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Sebastian Andrzej Siewior The original patch changed betwen its posting and what finally went into Rafael's tree so here is the delta. Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit f648e23dac72deef07f25e05fc09dbbc209dbd33) Signed-off-by: Julia Cartwright --- kernel/cpu_pm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c index 9da42f83ee03..d1d1c3961553 100644 --- a/kernel/cpu_pm.c +++ b/kernel/cpu_pm.c @@ -28,8 +28,15 @@ static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) { int ret; + /* +* __atomic_notifier_call_chain has a RCU read critical section, which +* could be disfunctional in cpu idle. Copy RCU_NONIDLE code to let +* RCU know this. +*/ + rcu_irq_enter_irqson(); ret = __atomic_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, nr_to_call, nr_calls); + rcu_irq_exit_irqson(); return notifier_to_errno(ret); } -- 2.14.2
[PATCH] futex: Drop now unnecessary check in exit_pi_state()
This check was an attempt to protect against a race with put_pi_state() by ensuring that the pi_state_list was consistent across the unlock/lock of pi_lock. However, as of commit 153fbd1226fb3 ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races"), this check is no longer necessary because we now hold a reference to the pi_state object across the unlock/lock of pi_lock. This reference guarantees that a put_pi_state() on another CPU won't rip the pi_state object from the list when we drop pi_lock. Cc: Gratian Crisan Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Darren Hart Signed-off-by: Julia Cartwright --- I'm not sure my analysis is 100% correct here, so please carefully think through it, as I'm sure you all always do when futex patches hit your mailbox :). Julia kernel/futex.c | 16 ++-- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index ca5bb9cba5cf..e127ec0555b6 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -889,7 +889,7 @@ static struct task_struct *futex_find_get_task(pid_t pid) */ void exit_pi_state_list(struct task_struct *curr) { - struct list_head *next, *head = &curr->pi_state_list; + struct list_head *head = &curr->pi_state_list; struct futex_pi_state *pi_state; struct futex_hash_bucket *hb; union futex_key key = FUTEX_KEY_INIT; @@ -903,8 +903,7 @@ void exit_pi_state_list(struct task_struct *curr) */ raw_spin_lock_irq(&curr->pi_lock); while (!list_empty(head)) { - next = head->next; - pi_state = list_entry(next, struct futex_pi_state, list); + pi_state = list_first_entry(head, struct futex_pi_state, list); key = pi_state->key; hb = hash_futex(&key); @@ -929,17 +928,6 @@ void exit_pi_state_list(struct task_struct *curr) spin_lock(&hb->lock); raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); raw_spin_lock(&curr->pi_lock); - /* -* We dropped the pi-lock, so re-check whether this -* task still owns the PI-state: -*/ - if (head->next != next) { - /* retain curr->pi_lock for the loop invariant */ - raw_spin_unlock(&pi_state->pi_mutex.wait_lock); - spin_unlock(&hb->lock); - put_pi_state(pi_state); - continue; - } WARN_ON(pi_state->owner != curr); WARN_ON(list_empty(&pi_state->list)); -- 2.14.2
Re: [PATCH 1/2] IB/hfi1: Use preempt_{dis,en}able_nort()
On Thu, Oct 05, 2017 at 01:53:05PM -0300, Arnaldo Carvalho de Melo wrote: > Em Thu, Oct 05, 2017 at 09:17:44AM -0500, Julia Cartwright escreveu: > > On Tue, Oct 03, 2017 at 12:49:19PM -0300, Arnaldo Carvalho de Melo wrote: > > > +++ b/drivers/infiniband/hw/hfi1/pio.c > > > @@ -1421,7 +1421,7 @@ struct pio_buf *sc_buffer_alloc(struct send_context > > > *sc, u32 dw_len, > > > > /* there is enough room */ > > > > - preempt_disable(); > > > + preempt_disable_nort(); > > > this_cpu_inc(*sc->buffers_allocated); > > > Have you tried this on RT w/ CONFIG_DEBUG_PREEMPT? > > No > > > I believe that the this_cpu_* operations perform a preemption check, which > > we'd trip. > > Humm, looking at include/linux/percpu-defs.h on v4.11.12-rt14 I see > (trimmed to what we're discussing here): > > #ifdef CONFIG_DEBUG_PREEMPT > extern void __this_cpu_preempt_check(const char *op); > #else > static inline void __this_cpu_preempt_check(const char *op) { } > #endif > > #define __this_cpu_add(pcp, val) \ > ({\ > __this_cpu_preempt_check("add");\ > raw_cpu_add(pcp, val); \ > }) > #define __this_cpu_inc(pcp) __this_cpu_add(pcp, 1) > > /* > * Operations with implied preemption/interrupt protection. These > * operations can be used without worrying about preemption or interrupt. > */ > #define this_cpu_add(pcp, val) __pcpu_size_call(this_cpu_add_, pcp, > val) > #define this_cpu_inc(pcp) this_cpu_add(pcp, 1) > > > You may also have to change these to the non-preempt checked variants. > > So __this_cpu_inc() checks preemption but this_cpu_inc() doesn't and > thus we're ok here? Or am I getting lost in this maze of defines? :-) I think I was the one lost in the maze. You are correct. Sorry for the confusion. My mind expected that the __ prefixed versions would be the "raw" versions that are free of checks, with the checks made in the non prefixed versions, but it is the other way around. I'm happy to accept that the bug is within my mind. Julia
Re: [PATCH 1/2] IB/hfi1: Use preempt_{dis,en}able_nort()
On Thu, Oct 05, 2017 at 06:16:23PM +0200, Thomas Gleixner wrote: > On Thu, 5 Oct 2017, Julia Cartwright wrote: > > > On Thu, Oct 05, 2017 at 11:55:39AM -0400, Steven Rostedt wrote: > > > On Thu, 5 Oct 2017 10:37:59 -0500 > > > Julia Cartwright wrote: > > > > > > > On Thu, Oct 05, 2017 at 05:27:30PM +0200, Thomas Gleixner wrote: > > > > > On Thu, 5 Oct 2017, Julia Cartwright wrote: > > > > > > On Tue, Oct 03, 2017 at 12:49:19PM -0300, Arnaldo Carvalho de Melo > > > > > > wrote: > > > > > > > - preempt_disable(); > > > > > > > + preempt_disable_nort(); > > > > > > > this_cpu_inc(*sc->buffers_allocated); > > > > > > > > > > > > Have you tried this on RT w/ CONFIG_DEBUG_PREEMPT? I believe that > > > > > > the > > > > > > this_cpu_* operations perform a preemption check, which we'd trip. > > > > > > > > > > Good point. Changing this to migrate_disable() would do the trick. > > > > > > > > Wouldn't we still trip the preempt check even with migration disabled? > > > > In another thread I asked the same question: should the preemption > > > > checks here be converted to migration-checks in RT? > > > > > > Is it a "preemption check"? > > > > Sorry if I was unclear, more precisely: the this_cpu_* family of > > accessors, w/ CONFIG_DEBUG_PREEMPT currently spits out a warning when > > the caller is invoked in a context where preemption is enabled. > > > > The check is shared w/ the smp_processor_id() check, as implemented in > > lib/smp_processor_id.c. It effectively boils down to a check of > > preempt_count() and irqs_disabled(). > > Except that on RT that check cares about the migrate disable state. You can > invoke this_cpu_* and smp_processor_id() in preemptible/interruptible > context because of: > > if (cpumask_equal(current->cpus_ptr, cpumask_of(this_cpu))) > goto out; > > That's true even on mainline. > > But you are right that this check needs some update because > migrate_disable() does not immediately update the allowed cpumask IIRC. Actually, I think it does: migrate_disable() -> p = current; ... migrate_disable_update_cpus_allowed(p) -> p->cpus_ptr = cpumask_of(smp_processor_id()) ... Perhaps it's worth a simple comment update, below. Julia -- 8< -- Subject: [PATCH] kernel: sched: document smp_processor_id/this_cpu safety in migration-disabled context On RT, users of smp_processor_id() and this_cpu_*() per-cpu accessors are considered safe if the caller executes with migration disabled. On !RT, preempt_disable() is sufficient to make this guarantee, however on RT, the lesser migrate_disable() is sufficient. It is not entirely obvious which check in check_preemption_disabled() makes this work, so document it. Signed-off-by: Julia Cartwright --- lib/smp_processor_id.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/lib/smp_processor_id.c b/lib/smp_processor_id.c index de3b2d925473..c8091d9eb1b4 100644 --- a/lib/smp_processor_id.c +++ b/lib/smp_processor_id.c @@ -20,7 +20,11 @@ notrace static unsigned int check_preemption_disabled(const char *what1, /* * Kernel threads bound to a single CPU can safely use -* smp_processor_id(): +* smp_processor_id(). +* +* In addition, threads which are currently executing within +* a migration disabled region can safely use smp_processor_id() and +* this_cpu accessors. */ if (cpumask_equal(current->cpus_ptr, cpumask_of(this_cpu))) goto out; -- 2.14.1
Re: [PATCH 1/2] IB/hfi1: Use preempt_{dis,en}able_nort()
On Thu, Oct 05, 2017 at 11:55:39AM -0400, Steven Rostedt wrote: > On Thu, 5 Oct 2017 10:37:59 -0500 > Julia Cartwright wrote: > > > On Thu, Oct 05, 2017 at 05:27:30PM +0200, Thomas Gleixner wrote: > > > On Thu, 5 Oct 2017, Julia Cartwright wrote: > > > > On Tue, Oct 03, 2017 at 12:49:19PM -0300, Arnaldo Carvalho de Melo > > > > wrote: > > > > > - preempt_disable(); > > > > > + preempt_disable_nort(); > > > > > this_cpu_inc(*sc->buffers_allocated); > > > > > > > > Have you tried this on RT w/ CONFIG_DEBUG_PREEMPT? I believe that the > > > > this_cpu_* operations perform a preemption check, which we'd trip. > > > > > > Good point. Changing this to migrate_disable() would do the trick. > > > > Wouldn't we still trip the preempt check even with migration disabled? > > In another thread I asked the same question: should the preemption > > checks here be converted to migration-checks in RT? > > Is it a "preemption check"? Sorry if I was unclear, more precisely: the this_cpu_* family of accessors, w/ CONFIG_DEBUG_PREEMPT currently spits out a warning when the caller is invoked in a context where preemption is enabled. The check is shared w/ the smp_processor_id() check, as implemented in lib/smp_processor_id.c. It effectively boils down to a check of preempt_count() and irqs_disabled(). > Getting a cpu # should only care about migration. I think we're agreeing? :) > This isn't the same as a rcu_sched check is it? That does care about > preemption. This is something totally different, I think. Julia
Re: [PATCH 1/2] IB/hfi1: Use preempt_{dis,en}able_nort()
On Thu, Oct 05, 2017 at 05:27:30PM +0200, Thomas Gleixner wrote: > On Thu, 5 Oct 2017, Julia Cartwright wrote: > > On Tue, Oct 03, 2017 at 12:49:19PM -0300, Arnaldo Carvalho de Melo wrote: > > > - preempt_disable(); > > > + preempt_disable_nort(); > > > this_cpu_inc(*sc->buffers_allocated); > > > > Have you tried this on RT w/ CONFIG_DEBUG_PREEMPT? I believe that the > > this_cpu_* operations perform a preemption check, which we'd trip. > > Good point. Changing this to migrate_disable() would do the trick. Wouldn't we still trip the preempt check even with migration disabled? In another thread I asked the same question: should the preemption checks here be converted to migration-checks in RT? Julia
Re: [PATCH 1/2] IB/hfi1: Use preempt_{dis,en}able_nort()
On Tue, Oct 03, 2017 at 12:49:19PM -0300, Arnaldo Carvalho de Melo wrote: > From: Arnaldo Carvalho de Melo > > sc_buffer_alloc() disables preemption that will be reenabled by either > pio_copy() or seg_pio_copy_end(). But before disabling preemption it > grabs a spin lock that will be dropped after it disables preemption, > which ends up triggering a warning in migrate_disable() later on. > > spin_lock_irqsave(&sc->alloc_lock) > migrate_disable() ++p->migrate_disable -> 2 > preempt_disable() > spin_unlock_irqrestore(&sc->alloc_lock) > migrate_enable() in_atomic(), so just returns, migrate_disable stays at > 2 > spin_lock_irqsave(some other lock) -> b00m > > And the WARN_ON code ends up tripping over this over and over in > log_store(). > > Sequence captured via ftrace_dump_on_oops + crash utility 'dmesg' > command. > > [512258.613862] sm-3297 16 .11 359465349134644: sc_buffer_alloc > <-hfi1_verbs_send_pio > [512258.613876] sm-3297 16 .11 359465349134719: migrate_disable > <-sc_buffer_alloc > [512258.613890] sm-3297 16 .12 359465349134798: rt_spin_lock > <-sc_buffer_alloc > [512258.613903] sm-3297 16 112 359465349135481: rt_spin_unlock > <-sc_buffer_alloc > [512258.613916] sm-3297 16 112 359465349135556: migrate_enable > <-sc_buffer_alloc > [512258.613935] sm-3297 16 112 359465349135788: seg_pio_copy_start > <-hfi1_verbs_send_pio > [512258.613954] sm-3297 16 112 359465349136273: update_sge > <-hfi1_verbs_send_pio > [512258.613981] sm-3297 16 112 359465349136373: seg_pio_copy_mid > <-hfi1_verbs_send_pio > [512258.613999] sm-3297 16 112 359465349136873: update_sge > <-hfi1_verbs_send_pio > [512258.614017] sm-3297 16 112 359465349136956: seg_pio_copy_mid > <-hfi1_verbs_send_pio > [512258.614035] sm-3297 16 112 359465349137221: seg_pio_copy_end > <-hfi1_verbs_send_pio > [512258.614048] sm-3297 16 .12 359465349137360: migrate_disable > <-hfi1_verbs_send_pio > [512258.614065] sm-3297 16 .12 359465349137476: warn_slowpath_null > <-migrate_disable > [512258.614081] sm-3297 16 .12 359465349137564: __warn > <-warn_slowpath_null > [512258.614088] sm-3297 16 .12 359465349137958: printk <-__warn > [512258.614096] sm-3297 16 .12 359465349138055: vprintk_default <-printk > [512258.614104] sm-3297 16 .12 359465349138144: vprintk_emit > <-vprintk_default > [512258.614111] sm-3297 16 d12 359465349138312: _raw_spin_lock > <-vprintk_emit > [512258.614119] sm-3297 16 d...112 359465349138789: log_store <-vprintk_emit > [512258.614127] sm-3297 16 .12 359465349139068: migrate_disable > <-vprintk_emit > > According to a discussion (see Link: below) on the linux-rt-users > mailing list, this locking is done for performance reasons, not for > correctness, so use the _nort() variants to avoid the above problem. > > Suggested-by: Julia Cartwright > Cc: Clark Williams > Cc: Dean Luick > Cc: Dennis Dalessandro > Cc: Doug Ledford > Cc: Kaike Wan > Cc: Leon Romanovsky > Cc: linux-r...@vger.kernel.org > Cc: Peter Zijlstra > Cc: Sebastian Andrzej Siewior > Cc: Sebastian Sanchez > Cc: Steven Rostedt > Cc: Thomas Gleixner > Link: > https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_20170926210045.GO29872-40jcartwri.amer.corp.natinst.com&d=DwIBaQ&c=I_0YwoKy7z5LMTVdyO6YCiE2uzI1jjZZuIPelcSjixA&r=cAXq_8W9Othb2h8ZcWv8Vw&m=zzKtWJ595HB0jyuiFic0ZEkpmmjvGRXJHkGF27oyvCI&s=J4_Al0cbvQ9PCM3VbqzJ6apmpSZI9Xx7eq6Gcfucp24&e= > > Signed-off-by: Arnaldo Carvalho de Melo > --- > drivers/infiniband/hw/hfi1/pio.c | 2 +- > drivers/infiniband/hw/hfi1/pio_copy.c | 4 ++-- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/infiniband/hw/hfi1/pio.c > b/drivers/infiniband/hw/hfi1/pio.c > index 615be68e40b3..3a30bde9a07b 100644 > --- a/drivers/infiniband/hw/hfi1/pio.c > +++ b/drivers/infiniband/hw/hfi1/pio.c > @@ -1421,7 +1421,7 @@ struct pio_buf *sc_buffer_alloc(struct send_context > *sc, u32 dw_len, > > /* there is enough room */ > > - preempt_disable(); > + preempt_disable_nort(); > this_cpu_inc(*sc->buffers_allocated); Have you tried this on RT w/ CONFIG_DEBUG_PREEMPT? I believe that the this_cpu_* operations perform a preemption check, which we'd trip. You may also have to change these to the non-preempt checked variants. Julia
Re: [PATCH v2 37/40] tracing: Add inter-event hist trigger Documentation
On Tue, Sep 05, 2017 at 04:57:49PM -0500, Tom Zanussi wrote: > Add background and details on inter-event hist triggers, including > hist variables, synthetic events, and actions. > > Signed-off-by: Tom Zanussi > Signed-off-by: Baohong Liu > --- > Documentation/trace/events.txt | 385 > + > 1 file changed, 385 insertions(+) > > diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt > index f271d87..7ee720b 100644 > --- a/Documentation/trace/events.txt > +++ b/Documentation/trace/events.txt > @@ -571,6 +571,7 @@ The following commands are supported: > .sym-offset display an address as a symbol and offset > .syscalldisplay a syscall id as a system call name > .execname display a common_pid as a program name > + .usecs display a $common_timestamp in microseconds > >Note that in general the semantics of a given field aren't >interpreted when applying a modifier to it, but there are some > @@ -2101,3 +2102,387 @@ The following commands are supported: > Hits: 489 > Entries: 7 > Dropped: 0 > + > +6.3 Inter-event hist triggers > +- > + > +Inter-event hist triggers are hist triggers that combine values from > +one or more other events and create a histogram using that data. Data > +from an inter-event histogram can in turn become the source for > +further combined histograms, thus providing a chain of related > +histograms, which is important for some applications. > + > +The most important example of an inter-event quantity that can be used > +in this manner is latency, which is simply a difference in timestamps > +between two events (although trace events don't have an externally > +visible timestamp field, the inter-event hist trigger support adds a > +pseudo-field to all events named '$common_timestamp' which can be used > +as if it were an actual event field) What's between the parentheses is covered below, is it worth mentioning in both places? [..] > + - Trace events don't have a 'timestamp' associated with them, but > +there is an implicit timestamp saved along with an event in the > +underlying ftrace ring buffer. This timestamp is now exposed as a > +a synthetic field named '$common_timestamp' which can be used in > +histograms as if it were any other event field. Note that it has > +a '$' prefixed to it - this is meant to indicate that it isn't an > +actual field in the trace format but rather is a synthesized value > +that nonetheless can be used as if it were an actual field. By > +default it is in units of nanoseconds; appending '.usecs' to a > +common_timestamp field changes the units to microseconds. > + > +These features are decribed in more detail in the following sections. ^ described > + > +6.3.1 Histogram Variables > +- > + > +Variables are simply named locations used for saving and retrieving > +values between matching events. A 'matching' event is defined as an > +event that has a matching key - if a variable is saved for a histogram > +entry corresponding to that key, any subsequent event with a matching > +key can access that variable. > + > +A variable's value is normally available to any subsequent event until > +it is set to something else by a subsequent event. The one exception > +to that rule is that any variable used in an expression is essentially > +'read-once' - once it's used by an expression in a subsequent event, > +it's reset to its 'unset' state, which means it can't be used again > +unless it's set again. This ensures not only that an event doesn't > +use an uninitialized variable in a calculation, but that that variable > +is used only once and not for any unrelated subsequent match. > + > +The basic syntax for saving a variable is to simply prefix a unique > +variable name not corresponding to any keyword along with an '=' sign > +to any event field. This would seem to make it much more difficult in the future to add new keywords for hist triggers without breaking users existing triggers. Maybe that's not a problem given the tracing ABI wild west. [..] > +6.3.2 Synthetic Events > +-- > + > +Synthetic events are user-defined events generated from hist trigger > +variables or fields associated with one or more other events. Their > +purpose is to provide a mechanism for displaying data spanning > +multiple events consistent with the existing and already familiar > +usage for normal events. > + > +To define a synthetic event, the user writes a simple specification > +consisting of the name of the new event along with one or more > +variables and their types, which can be any valid field type, > +separated by semicolons, to the tracing/synthetic_events file. > + > +For instance, the following creates a new event named 'wakeup_latency' > +with 3 fields: lat, pid, and prio. Each of those fields is sim
Re: [ANNOUNCE] 4.1.42-rt50
On Thu, Aug 17, 2017 at 09:30:28AM +0200, Sebastian Andrzej Siewior wrote: > On 2017-08-16 15:42:28 [-0500], Julia Cartwright wrote: [..] > > diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c > > index 9656a3c36503..9da42f83ee03 100644 > > --- a/kernel/cpu_pm.c > > +++ b/kernel/cpu_pm.c > > @@ -22,14 +22,13 @@ > > #include > > #include > > > > -static DEFINE_RWLOCK(cpu_pm_notifier_lock); > > -static RAW_NOTIFIER_HEAD(cpu_pm_notifier_chain); > > +static ATOMIC_NOTIFIER_HEAD(cpu_pm_notifier_chain); > > > > static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int > > *nr_calls) > > { > > int ret; > > > > - ret = __raw_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, > > + ret = __atomic_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, > > nr_to_call, nr_calls); > > there is a piece missing, upstream has a different change queued. I know > that this is the same as in latest RT, this is just to let you know in > case someone complains about an RCU backtrace??? Ah! Thanks. Indeed. I did see the upstream-queued version get cc'd to linux-rt-users, but didn't see the rcu_irq_{enter,exit}_irqson() addition, assuming it to be what landed in rt-devel. Will you be fixing this up in 4.11-rt? Julia
[ANNOUNCE] 4.1.42-rt50
Hello RT Folks! I'm pleased to announce the 4.1.42-rt50 stable release. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.1-rt Head SHA1: 2e2586f49c8f6b84ceeecce704901405a7e780df Or to build 4.1.42-rt50 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.1.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.1.42.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/patch-4.1.42-rt50.patch.xz You can also build from 4.1.42-rt49 by applying the incremental patch: http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/incr/patch-4.1.42-rt49-rt50.patch.xz Enjoy! Julia Changes from v4.1.42-rt49: --- Alex Shi (1): cpu_pm: replace raw_notifier to atomic_notifier Julia Cartwright (1): Linux 4.1.42-rt50 Peter Zijlstra (2): lockdep: Fix per-cpu static objects sched: Remove TASK_ALL Thomas Gleixner (2): rtmutex: Make lock_killable work sched: Prevent task state corruption by spurious lock wakeup include/linux/sched.h| 1 - include/linux/smp.h | 12 init/main.c | 8 kernel/cpu_pm.c | 43 ++- kernel/locking/rtmutex.c | 19 +++ kernel/module.c | 6 +- kernel/sched/core.c | 2 +- localversion-rt | 2 +- mm/percpu.c | 5 - 9 files changed, 44 insertions(+), 54 deletions(-) --- diff --git a/include/linux/sched.h b/include/linux/sched.h index d51525ce2c41..7587d6181cd2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -227,7 +227,6 @@ extern char ___assert_task_state[1 - 2*!!( /* Convenience macros for the sake of wake_up */ #define TASK_NORMAL(TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE) -#define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED) /* get_task_state() */ #define TASK_REPORT(TASK_RUNNING | TASK_INTERRUPTIBLE | \ diff --git a/include/linux/smp.h b/include/linux/smp.h index e6ab36aeaaab..cbf6836524dc 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -120,6 +120,13 @@ extern unsigned int setup_max_cpus; extern void __init setup_nr_cpu_ids(void); extern void __init smp_init(void); +extern int __boot_cpu_id; + +static inline int get_boot_cpu_id(void) +{ + return __boot_cpu_id; +} + #else /* !SMP */ static inline void smp_send_stop(void) { } @@ -158,6 +165,11 @@ static inline void smp_init(void) { up_late_init(); } static inline void smp_init(void) { } #endif +static inline int get_boot_cpu_id(void) +{ + return 0; +} + #endif /* !SMP */ /* diff --git a/init/main.c b/init/main.c index 0486a8e11fc0..e1bae15a2154 100644 --- a/init/main.c +++ b/init/main.c @@ -451,6 +451,10 @@ void __init parse_early_param(void) * Activate the first processor. */ +#ifdef CONFIG_SMP +int __boot_cpu_id; +#endif + static void __init boot_cpu_init(void) { int cpu = smp_processor_id(); @@ -459,6 +463,10 @@ static void __init boot_cpu_init(void) set_cpu_active(cpu, true); set_cpu_present(cpu, true); set_cpu_possible(cpu, true); + +#ifdef CONFIG_SMP + __boot_cpu_id = cpu; +#endif } void __init __weak smp_setup_processor_id(void) diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c index 9656a3c36503..9da42f83ee03 100644 --- a/kernel/cpu_pm.c +++ b/kernel/cpu_pm.c @@ -22,14 +22,13 @@ #include #include -static DEFINE_RWLOCK(cpu_pm_notifier_lock); -static RAW_NOTIFIER_HEAD(cpu_pm_notifier_chain); +static ATOMIC_NOTIFIER_HEAD(cpu_pm_notifier_chain); static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) { int ret; - ret = __raw_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, + ret = __atomic_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, nr_to_call, nr_calls); return notifier_to_errno(ret); @@ -47,14 +46,7 @@ static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) */ int cpu_pm_register_notifier(struct notifier_block *nb) { - unsigned long flags; - int ret; - - write_lock_irqsave(&cpu_pm_notifier_lock, flags); - ret = raw_notifier_chain_register(&cpu_pm_notifier_chain, nb); - write_unlock_irqrestore(&cpu_pm_notifier_lock, flags); - - return ret; + return atomic_notifier_chain_register(&cpu_pm_notifier_chain, nb); } EXPORT_SYMBOL_GPL(cpu_pm_register_notifier); @@ -69,14 +61,7 @@ EXPORT_SYMBOL_GPL(cpu_pm_register_notifier); */ int cpu_pm_unregister_notifier(struct notifier_block *nb) { - unsigned long flags; - int ret; - - write_lock_irqsave(&cpu_pm_notifier_lock, flags); - ret = raw_notifier_chain_unregister(&cpu_pm_notifier_chain, nb); - write_unlock_ir
[PATCH RT 2/6] rtmutex: Make lock_killable work
4.1.42-rt50-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Thomas Gleixner Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionaly. Cc: rt-sta...@vger.kernel.org Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 9edcb2cd71ff3684755f52129538260efa382789) Signed-off-by: Julia Cartwright --- kernel/locking/rtmutex.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index e0b0d9b419b5..3e45ceb862bd 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1511,18 +1511,13 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state, if (try_to_take_rt_mutex(lock, current, waiter)) break; - /* -* TASK_INTERRUPTIBLE checks for signals and -* timeout. Ignored otherwise. -*/ - if (unlikely(state == TASK_INTERRUPTIBLE)) { - /* Signal pending? */ - if (signal_pending(current)) - ret = -EINTR; - if (timeout && !timeout->task) - ret = -ETIMEDOUT; - if (ret) - break; + if (timeout && !timeout->task) { + ret = -ETIMEDOUT; + break; + } + if (signal_pending_state(state, current)) { + ret = -EINTR; + break; } if (ww_ctx && ww_ctx->acquired > 0) { -- 2.13.1
[PATCH RT 0/6] Linux 4.1.42-rt50-rc1
Dear RT Folks, This is the RT stable review cycle of patch 4.1.42-rt50-rc1. Please review the included patches, and test! The -rc release will be uploaded to kernel.org and will be deleted when the final release is out. This is just a review release (or release candidate). The pre-releases will not be pushed to the git repository, only the final release is. If all goes well, this patch will be converted to the next main release on 8/14/2017. Julia To build 4.1.42-rt50-rc1 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.1.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.1.42.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/patch-4.1.42-rt50-rc1.patch.xz You can also build from 4.1.42-rt49 release by applying the incremental patch: http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/incr/patch-4.1.42-rt49-rt50-rc1.patch.xz Alex Shi (1): cpu_pm: replace raw_notifier to atomic_notifier Julia Cartwright (1): Linux 4.1.42-rt50-rc1 Peter Zijlstra (2): lockdep: Fix per-cpu static objects sched: Remove TASK_ALL Thomas Gleixner (2): rtmutex: Make lock_killable work sched: Prevent task state corruption by spurious lock wakeup include/linux/sched.h| 1 - include/linux/smp.h | 12 init/main.c | 8 kernel/cpu_pm.c | 43 ++- kernel/locking/rtmutex.c | 19 +++ kernel/module.c | 6 +- kernel/sched/core.c | 2 +- localversion-rt | 2 +- mm/percpu.c | 5 - 9 files changed, 44 insertions(+), 54 deletions(-) -- 2.13.1
[PATCH RT 6/6] Linux 4.1.42-rt50-rc1
4.1.42-rt50-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- --- localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/localversion-rt b/localversion-rt index 4b7dca68a5b4..e8a9a36bb066 100644 --- a/localversion-rt +++ b/localversion-rt @@ -1 +1 @@ --rt49 +-rt50-rc1 -- 2.13.1
[PATCH RT 5/6] cpu_pm: replace raw_notifier to atomic_notifier
4.1.42-rt50-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Alex Shi This patch replace a rwlock and raw notifier by atomic notifier which protected by spin_lock and rcu. The first to reason to have this replace is due to a 'scheduling while atomic' bug of RT kernel on arm/arm64 platform. On arm/arm64, rwlock cpu_pm_notifier_lock in cpu_pm cause a potential schedule after irq disable in idle call chain: cpu_startup_entry cpu_idle_loop local_irq_disable() cpuidle_idle_call call_cpuidle cpuidle_enter cpuidle_enter_state ->enter :arm_enter_idle_state cpu_pm_enter/exit CPU_PM_CPU_IDLE_ENTER read_lock(&cpu_pm_notifier_lock); <-- sleep in idle __rt_spin_lock(); schedule(); The kernel panic is here: [4.609601] BUG: scheduling while atomic: swapper/1/0/0x0002 [4.609608] [] arm_enter_idle_state+0x18/0x70 [4.609614] Modules linked in: [4.609615] [] cpuidle_enter_state+0xf0/0x218 [4.609620] [] cpuidle_enter+0x18/0x20 [4.609626] Preemption disabled at: [4.609627] [] call_cpuidle+0x24/0x40 [4.609635] [] schedule_preempt_disabled+0x1c/0x28 [4.609639] [] cpu_startup_entry+0x154/0x1f8 [4.609645] [] secondary_start_kernel+0x15c/0x1a0 Daniel Lezcano said this notification is needed on arm/arm64 platforms. Sebastian suggested using atomic_notifier instead of rwlock, which is not only removing the sleeping in idle, but also getting better latency improvement. This patch passed Fengguang's 0day testing. Signed-off-by: Alex Shi Cc: Sebastian Andrzej Siewior Cc: Thomas Gleixner Cc: Anders Roxell Cc: Rik van Riel Cc: Steven Rostedt Cc: Rafael J. Wysocki Cc: Daniel Lezcano Cc: linux-rt-users Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit df0fba5ba4c69cdc68bdaa5ca7a4100d959fdd07) Signed-off-by: Julia Cartwright --- kernel/cpu_pm.c | 43 ++- 1 file changed, 6 insertions(+), 37 deletions(-) diff --git a/kernel/cpu_pm.c b/kernel/cpu_pm.c index 9656a3c36503..9da42f83ee03 100644 --- a/kernel/cpu_pm.c +++ b/kernel/cpu_pm.c @@ -22,14 +22,13 @@ #include #include -static DEFINE_RWLOCK(cpu_pm_notifier_lock); -static RAW_NOTIFIER_HEAD(cpu_pm_notifier_chain); +static ATOMIC_NOTIFIER_HEAD(cpu_pm_notifier_chain); static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) { int ret; - ret = __raw_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, + ret = __atomic_notifier_call_chain(&cpu_pm_notifier_chain, event, NULL, nr_to_call, nr_calls); return notifier_to_errno(ret); @@ -47,14 +46,7 @@ static int cpu_pm_notify(enum cpu_pm_event event, int nr_to_call, int *nr_calls) */ int cpu_pm_register_notifier(struct notifier_block *nb) { - unsigned long flags; - int ret; - - write_lock_irqsave(&cpu_pm_notifier_lock, flags); - ret = raw_notifier_chain_register(&cpu_pm_notifier_chain, nb); - write_unlock_irqrestore(&cpu_pm_notifier_lock, flags); - - return ret; + return atomic_notifier_chain_register(&cpu_pm_notifier_chain, nb); } EXPORT_SYMBOL_GPL(cpu_pm_register_notifier); @@ -69,14 +61,7 @@ EXPORT_SYMBOL_GPL(cpu_pm_register_notifier); */ int cpu_pm_unregister_notifier(struct notifier_block *nb) { - unsigned long flags; - int ret; - - write_lock_irqsave(&cpu_pm_notifier_lock, flags); - ret = raw_notifier_chain_unregister(&cpu_pm_notifier_chain, nb); - write_unlock_irqrestore(&cpu_pm_notifier_lock, flags); - - return ret; + return atomic_notifier_chain_unregister(&cpu_pm_notifier_chain, nb); } EXPORT_SYMBOL_GPL(cpu_pm_unregister_notifier); @@ -100,7 +85,6 @@ int cpu_pm_enter(void) int nr_calls; int ret = 0; - read_lock(&cpu_pm_notifier_lock); ret = cpu_pm_notify(CPU_PM_ENTER, -1, &nr_calls); if (ret) /* @@ -108,7 +92,6 @@ int cpu_pm_enter(void) * PM entry who are notified earlier to prepare for it. */ cpu_pm_notify(CPU_PM_ENTER_FAILED, nr_calls - 1, NULL); - read_unlock(&cpu_pm_notifier_lock); return ret; } @@ -128,13 +111,7 @@ EXPORT_SYMBOL_GPL(cpu_pm_enter); */ int cpu_pm_exit(void) { - int ret; - - read_lock(&cpu_pm_notifier_lock); - ret = cpu_pm_notify(CPU_PM_EXIT, -1, NULL); - read_unlock(&cpu_pm_notifier_lock); - - return ret; + return cpu_pm_notify(CPU_PM_EXIT, -1, NULL); } EXPORT_SYMBOL_GPL(cpu_pm_exit); @@ -159,7 +136,6 @@ int cpu_cluster_pm_enter(void) int nr_calls; int ret = 0; - read_lock(&cpu_pm_notifier_lock); ret = cpu_pm
[PATCH RT 1/6] lockdep: Fix per-cpu static objects
4.1.42-rt50-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Peter Zijlstra Since commit 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") we try to collapse per-cpu locks into a single class by giving them all the same key. For this key we choose the canonical address of the per-cpu object, which would be the offset into the per-cpu area. This has two problems: - there is a case where we run !0 lock->key through static_obj() and expect this to pass; it doesn't for canonical pointers. - 0 is a valid canonical address. Cure both issues by redefining the canonical address as the address of the per-cpu variable on the boot CPU. Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at all, track the boot CPU in a variable. Fixes: 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") Reported-by: kernel test robot Signed-off-by: Peter Zijlstra (Intel) Tested-by: Borislav Petkov Cc: Sebastian Andrzej Siewior Cc: linux...@kvack.org Cc: w...@linux.intel.com Cc: kernel test robot Cc: LKP Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j...@hirez.programming.kicks-ass.net Signed-off-by: Thomas Gleixner (cherry picked from commit c9fe9196079f738c89c3ffcdce3fbe142ac3f3c4) Signed-off-by: Julia Cartwright --- include/linux/smp.h | 12 init/main.c | 8 kernel/module.c | 6 +- mm/percpu.c | 5 - 4 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index e6ab36aeaaab..cbf6836524dc 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -120,6 +120,13 @@ extern unsigned int setup_max_cpus; extern void __init setup_nr_cpu_ids(void); extern void __init smp_init(void); +extern int __boot_cpu_id; + +static inline int get_boot_cpu_id(void) +{ + return __boot_cpu_id; +} + #else /* !SMP */ static inline void smp_send_stop(void) { } @@ -158,6 +165,11 @@ static inline void smp_init(void) { up_late_init(); } static inline void smp_init(void) { } #endif +static inline int get_boot_cpu_id(void) +{ + return 0; +} + #endif /* !SMP */ /* diff --git a/init/main.c b/init/main.c index 0486a8e11fc0..e1bae15a2154 100644 --- a/init/main.c +++ b/init/main.c @@ -451,6 +451,10 @@ void __init parse_early_param(void) * Activate the first processor. */ +#ifdef CONFIG_SMP +int __boot_cpu_id; +#endif + static void __init boot_cpu_init(void) { int cpu = smp_processor_id(); @@ -459,6 +463,10 @@ static void __init boot_cpu_init(void) set_cpu_active(cpu, true); set_cpu_present(cpu, true); set_cpu_possible(cpu, true); + +#ifdef CONFIG_SMP + __boot_cpu_id = cpu; +#endif } void __init __weak smp_setup_processor_id(void) diff --git a/kernel/module.c b/kernel/module.c index a7ac858fd1a1..982c57b2c2a1 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -542,8 +542,12 @@ bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr) void *va = (void *)addr; if (va >= start && va < start + mod->percpu_size) { - if (can_addr) + if (can_addr) { *can_addr = (unsigned long) (va - start); + *can_addr += (unsigned long) + per_cpu_ptr(mod->percpu, + get_boot_cpu_id()); + } preempt_enable(); return true; } diff --git a/mm/percpu.c b/mm/percpu.c index 4146b00bfde7..b41c3960d5fb 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1297,8 +1297,11 @@ bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr) void *va = (void *)addr; if (va >= start && va < start + static_size) { - if (can_addr) + if (can_addr) { *can_addr = (unsigned long) (va - start); + *can_addr += (unsigned long) + per_cpu_ptr(base, get_boot_cpu_id()); + } return true; } } -- 2.13.1
[PATCH RT 3/6] sched: Prevent task state corruption by spurious lock wakeup
4.1.42-rt50-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Thomas Gleixner Mathias and others reported GDB failures on RT. The following scenario leads to task state corruption: CPU0CPU1 T1->state = TASK_XXX; spin_lock(&lock) rt_spin_lock_slowlock(&lock->rtmutex) raw_spin_lock(&rtm->wait_lock); T1->saved_state = current->state; T1->state = TASK_UNINTERRUPTIBLE; spin_unlock(&lock) task_blocks_on_rt_mutex(rtm) rt_spin_lock_slowunlock(&lock->rtmutex) queue_waiter(rtm) raw_spin_lock(&rtm->wait_lock); pi_chain_walk(rtm) raw_spin_unlock(&rtm->wait_lock); wake_top_waiter(T1) raw_spin_lock(&rtm->wait_lock); for (;;) { if (__try_to_take_rt_mutex()) <- Succeeds break; ... } T1->state = T1->saved_state; try_to_wake_up(T1) ttwu_do_wakeup(T1) T1->state = TASK_RUNNING; In most cases this is harmless because waiting for some event, which is the usual reason for TASK_[UN]INTERRUPTIBLE has to be safe against other forms of spurious wakeups anyway. But in case of TASK_TRACED this is actually fatal, because the task loses the TASK_TRACED state. In consequence it fails to consume SIGSTOP which was sent from the debugger and actually delivers SIGSTOP to the task which breaks the ptrace mechanics and brings the debugger into an unexpected state. The TASK_TRACED state should prevent getting there due to the state matching logic in try_to_wake_up(). But that's not true because wake_up_lock_sleeper() uses TASK_ALL as state mask. That's bogus because lock sleepers always use TASK_UNINTERRUPTIBLE, so the wakeup should use that as well. The cure is way simpler as figuring it out: Change the mask used in wake_up_lock_sleeper() from TASK_ALL to TASK_UNINTERRUPTIBLE. Cc: stable...@vger.kernel.org Reported-by: Mathias Koehrer Reported-by: David Hauck Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit 2f9f24e15088d2ef3244d088a9604d7e98c9c625) Signed-off-by: Julia Cartwright --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0d3a40b24304..ee11a59e53ff 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1876,7 +1876,7 @@ EXPORT_SYMBOL(wake_up_process); */ int wake_up_lock_sleeper(struct task_struct *p) { - return try_to_wake_up(p, TASK_ALL, WF_LOCK_SLEEPER); + return try_to_wake_up(p, TASK_UNINTERRUPTIBLE, WF_LOCK_SLEEPER); } int wake_up_state(struct task_struct *p, unsigned int state) -- 2.13.1
[PATCH RT 4/6] sched: Remove TASK_ALL
4.1.42-rt50-rc1 stable review patch. If you have any objection to the inclusion of this patch, let me know. --- 8< --- 8< --- 8< --- From: Peter Zijlstra It's unused: $ git grep "\" | wc -l 1 And dangerous, kill the bugger. Cc: stable...@vger.kernel.org Acked-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior (cherry picked from commit a1762267d95649bddf6e94e7e3305e0207d0fff0) Signed-off-by: Julia Cartwright --- include/linux/sched.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index d51525ce2c41..7587d6181cd2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -227,7 +227,6 @@ extern char ___assert_task_state[1 - 2*!!( /* Convenience macros for the sake of wake_up */ #define TASK_NORMAL(TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE) -#define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED) /* get_task_state() */ #define TASK_REPORT(TASK_RUNNING | TASK_INTERRUPTIBLE | \ -- 2.13.1
Re: [PATCH] [RFC] tpm_tis: tpm_tcg_flush() after iowrite*()s
On Fri, Aug 04, 2017 at 04:56:51PM -0500, Haris Okanovic wrote: > I have a latency issue using a SPI-based TPM chip with tpm_tis driver > from non-rt usermode application, which induces ~400 us latency spikes > in cyclictest (Intel Atom E3940 system, PREEMPT_RT_FULL kernel). > > The spikes are caused by a stalling ioread8() operation, following a > sequence of 30+ iowrite8()s to the same address. I believe this happens > because the writes are cached (in cpu or somewhere along the bus), which > gets flushed on the first LOAD instruction (ioread*()) that follows. To use the ARM parlance, these accesses aren't "cached" (which would imply that a result could be returned to the load from any intermediate node in the interconnect), but instead are "bufferable". It is really unfortunate that we continue to run into this class of problem across various CPU vendors and various underlying bus technologies; it's the continuing curse of running an PREEMPT_RT on commodity hardware. RT is not easy :) > The enclosed change appears to fix this issue: read the TPM chip's > access register (status code) after every iowrite*() operation. Are we engaged in a game of wack-a-mole with all of the drivers which use this same access pattern (of which I imagine there are quite a few!)? I'm wondering if we should explore the idea of adding a load in the iowriteN()/writeX() macros (marking those accesses in which reads cause side effects explicitly, redirecting to a _raw() variant or something). Obviously that would be expensive for non-RT use cases, but for helping constrain latency, it may be worth it for RT. Julia signature.asc Description: PGP signature
[ANNOUNCE] 4.1.42-rt49
Hello RT Folks, I'm happy to announce the 4.1.42-rt49 stable release. This release is just an update to the new stable 4.1.42 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.1-rt Head SHA1: 64fd80a3b0ef4db0d91b6fa224f49197745ef503 Or to build 4.1.42-rt49 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.1.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.1.42.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/patch-4.1.42-rt49.patch.xz Let me know if there are any problems! Julia
Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 !//RE: kernel BUG at kernel/locking/rtmutex.c:1027
On Tue, Jun 27, 2017 at 05:47:41AM +, Feng Feng24 Liu wrote: > Hi, Julia > Thanks for your kindly hit. I will try this patch > The problem is accidental. I will try to reproduce it. > BTW, could you help to give the link about the emails which > discuss about " nsfs: mark dentry with DCACHE_RCUACCESS ". I > cannot find them(just patch email) Conversation upstream started with a syzkaller-found problem here: https://marc.info/?l=linux-netdev&m=149183505630266&w=2 Hopefully that helps, Julia signature.asc Description: PGP signature
Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 !//RE: kernel BUG at kernel/locking/rtmutex.c:1027
On Mon, Jun 26, 2017 at 04:54:36PM +0200, Sebastian Andrzej Siewior wrote: > On 2017-06-26 10:24:18 [-0400], Steven Rostedt wrote: > > > CPU: 17 PID: 1738811 Comm: ip Not tainted 4.4.70-thinkcloud-nfv #1 > > > Hardware name: LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS > > > -[TCE124M-2.10]- 06/23/2016 > > > task: 881cda2c27c0 ti: 881ea0538000 task.ti: 881ea0538000 > > > RIP: 0010:[] [] > > > __try_to_take_rt_mutex+0x34/0x160 > > > RSP: 0018:881ea053bb50 EFLAGS: 00010082 > > > RAX: RBX: 881f805416a8 RCX: > > > RDX: 881ea053bb98 RSI: 881cda2c27c0 RDI: 881f805416a8 > > > RBP: 881ea053bb60 R08: 0001 R09: 0002 > > > R10: 0a01 R11: 0001 R12: 881cda2c27c0 > > > R13: 881cda2c27c0 R14: 0202 R15: 881f6b0c27c0 > > > FS: 7f28be315740() GS:88205f8c() > > > knlGS: > > > CS: 0010 DS: ES: CR0: 80050033 > > > CR2: 0038 CR3: 001e9e479000 CR4: 003406e0 > > > DR0: DR1: DR2: > > > DR3: DR6: fffe0ff0 DR7: 0400 > > > Stack: > > > 881f805416a8 881ea053bb98 881ea053bc28 81a8f03d > > > 881ea053c000 01ff881ea053bb90 881cda2c27c0 881f6b0c27c1 > > > 881cda2c2eb0 0001 > > > Call Trace: > > > [] rt_spin_lock_slowlock+0x13d/0x390 > > > [] rt_spin_lock+0x1f/0x30 > > > [] lockref_get_not_dead+0xf/0x50 > > > [] ns_get_path+0x61/0x1d0 > > > > Hmm, this is in the filesystem code. What were you doing when this > > happened? > > and do you have any patches except the RT patch? This stack trace looks very familiar to an upstream use-after-free bug fixed in v4.11 (commit 073c516ff735, "nsfs: mark dentry with DCACHE_RCUACCESS", attached below), it's tagged for stable, but doesn't look like it's trickled it's way back to 4.4.y yet. Can you reproduce this problem reliably? Can you confirm that the below fixes it (it'll require some minor cajoling to apply cleanly)? Julia -- 8< -- From: Cong Wang Date: Wed, 19 Apr 2017 15:11:00 -0700 Subject: [PATCH] nsfs: mark dentry with DCACHE_RCUACCESS Andrey reported a use-after-free in __ns_get_path(): spin_lock include/linux/spinlock.h:299 [inline] lockref_get_not_dead+0x19/0x80 lib/lockref.c:179 __ns_get_path+0x197/0x860 fs/nsfs.c:66 open_related_ns+0xda/0x200 fs/nsfs.c:143 sock_ioctl+0x39d/0x440 net/socket.c:1001 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 We are under rcu read lock protection at that point: rcu_read_lock(); d = atomic_long_read(&ns->stashed); if (!d) goto slow; dentry = (struct dentry *)d; if (!lockref_get_not_dead(&dentry->d_lockref)) goto slow; rcu_read_unlock(); but don't use a proper RCU API on the free path, therefore a parallel __d_free() could free it at the same time. We need to mark the stashed dentry with DCACHE_RCUACCESS so that __d_free() will be called after all readers leave RCU. Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs") Cc: Alexander Viro Cc: Andrew Morton Reported-by: Andrey Konovalov Signed-off-by: Cong Wang Signed-off-by: Linus Torvalds --- fs/nsfs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/nsfs.c b/fs/nsfs.c index 1656843e87d2..323f492e0822 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -91,6 +91,7 @@ slow: return ERR_PTR(-ENOMEM); } d_instantiate(dentry, inode); + dentry->d_flags |= DCACHE_RCUACCESS; dentry->d_fsdata = (void *)ns->ops; d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry); if (d) { -- 2.13.1
[ANNOUNCE] 4.1.40-rt48
Hello RT Folks, I'm happy to announce the 4.1.40-rt48 stable release. This release is just an update to the new stable 4.1.40 version and no RT specific changes have been made. You can get this release via the git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git branch: v4.1-rt Head SHA1: a87c589ef7f10183dc648ef6be7e53eb3143c42e Or to build 4.1.40-rt48 directly, the following patches should be applied: http://www.kernel.org/pub/linux/kernel/v4.x/linux-4.1.tar.xz http://www.kernel.org/pub/linux/kernel/v4.x/patch-4.1.40.xz http://www.kernel.org/pub/linux/kernel/projects/rt/4.1/patch-4.1.40-rt48.patch.xz Let me know if there are any problems! Julia
[PATCH] md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
On mainline, there is no functional difference, just less code, and symmetric lock/unlock paths. On PREEMPT_RT builds, this fixes the following warning, seen by Alexander GQ Gerasiov, due to the sleeping nature of spinlocks. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993 in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1 CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: GW 4.9.20-rt16-stand6-686 #1 Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015 Workqueue: writeback wb_workfn (flush-253:0) Call Trace: dump_stack+0x47/0x68 ? migrate_enable+0x4a/0xf0 ___might_sleep+0x101/0x180 rt_spin_lock+0x17/0x40 add_stripe_bio+0x4e3/0x6c0 [raid456] ? preempt_count_add+0x42/0xb0 raid5_make_request+0x737/0xdd0 [raid456] Reported-by: Alexander GQ Gerasiov Tested-by: Alexander GQ Gerasiov Signed-off-by: Julia Cartwright --- Hey All- While this fixes a problem on RT primarily, the patch is equally applicable upstream, as such probably makes sense to be pulled through the md tree. It may also make sense to be pulled directly into rt-devel. Alexander- I turned your "I confirm the fix" to a 'Tested-by', let me know if that's a problem. Thanks, Julia drivers/md/raid5.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index fa2c4de32a64..54dc2995aeee 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -110,8 +110,7 @@ static inline void unlock_device_hash_lock(struct r5conf *conf, int hash) static inline void lock_all_device_hash_locks_irq(struct r5conf *conf) { int i; - local_irq_disable(); - spin_lock(conf->hash_locks); + spin_lock_irq(conf->hash_locks); for (i = 1; i < NR_STRIPE_HASH_LOCKS; i++) spin_lock_nest_lock(conf->hash_locks + i, conf->hash_locks); spin_lock(&conf->device_lock); @@ -121,9 +120,9 @@ static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf) { int i; spin_unlock(&conf->device_lock); - for (i = NR_STRIPE_HASH_LOCKS; i; i--) - spin_unlock(conf->hash_locks + i - 1); - local_irq_enable(); + for (i = NR_STRIPE_HASH_LOCKS - 1; i; i--) + spin_unlock(conf->hash_locks + i); + spin_unlock_irq(conf->hash_locks); } /* bio's attached to a stripe+device for I/O are linked together in bi_sector @@ -732,12 +731,11 @@ static bool is_full_stripe_write(struct stripe_head *sh) static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) { - local_irq_disable(); if (sh1 > sh2) { - spin_lock(&sh2->stripe_lock); + spin_lock_irq(&sh2->stripe_lock); spin_lock_nested(&sh1->stripe_lock, 1); } else { - spin_lock(&sh1->stripe_lock); + spin_lock_irq(&sh1->stripe_lock); spin_lock_nested(&sh2->stripe_lock, 1); } } @@ -745,8 +743,7 @@ static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) { spin_unlock(&sh1->stripe_lock); - spin_unlock(&sh2->stripe_lock); - local_irq_enable(); + spin_unlock_irq(&sh2->stripe_lock); } /* Only freshly new full stripe normal write stripe can be added to a batch list */ -- 2.12.0
Re: [PATCH RT 1/1] remoteproc: Prevent schedule while atomic
On Thu, Mar 23, 2017 at 10:26:49AM +, Lee Jones wrote: > On Thu, 23 Mar 2017, Lionel DEBIEVE wrote: > > > On 03/22/2017 07:47 PM, Julia Cartwright wrote: > > > On Wed, Mar 22, 2017 at 01:30:12PM -0500, Grygorii Strashko wrote: > > >> On 03/22/2017 01:01 PM, Steven Rostedt wrote: > > >>> On Wed, 22 Mar 2017 12:37:59 -0500 > > >>> Julia Cartwright wrote: > > >>> > > >>>> Which kernel were you testing on, here? From what I can tell, this > > >>>> should have been fixed with Thomas's commit: > > >>>> > > >>>> 2a1d3ab8986d ("genirq: Handle force threading of irqs with primary > > >>>> and thread handler") > > >>> Thanks Julia for looking into this. I just looked at the code, and saw > > >>> that it does very little with the lock held, and was fine with the > > >>> conversion. But if that interrupt handler should be in a thread, we > > >>> should see if that's the issue first. > > >> > > >> It will not be threaded because there are IRQF_ONESHOT used. > > >> > > >> ret = devm_request_threaded_irq(&pdev->dev, irq, > > >> sti_mbox_irq_handler, > > >> sti_mbox_thread_handler, > > >> IRQF_ONESHOT, mdev->name, mdev); > > > Indeed. I had skipped over this important detail when I was skimming > > > through the code. > > > > > > Thanks for clarifying! > > > > > > Is IRQF_ONESHOT really necessary for this device? The primary handler > > > invokes sti_mbox_disable_channel() on the interrupting channel, which I > > > would hope would acquiesce the pending interrupt at the device-level? > > Not sure. This part of the code is remanent from when I re-wrote it. > > What is the alternative? If, on the completed execution of the registered primary handler, you can ensure that the device is no longer asserting an interrupt to the connected irq chip, then the IRQF_ONESHOT isn't necessary, because it's safe for the irq core to unmask the interrupt after the primary handler runs. It appears that it might be able to make this guarantee, if that's what sti_mbox_disable_channel() is doing. > NB: What does 'acquiesce' mean in this context? Is that a typo? I mean 'acquiesce' to mean what I mention before: prevent the device from asserting the interrupt. Perhaps it's a uncommon use of the word. > > > Also, as written there are num_inst reads of STI_IRQ_VAL_OFFSET in the > > > primary handler, which seems inefficient...(unless of course reading > > > incurs side effects, here). > > Inefficient in what respect? I've since looked again and realized that the register base is recalculated based on 'instance', so disregard this comment. Julia signature.asc Description: PGP signature
Re: [PATCH v4 1/4] pinctrl: rockchip: remove unnecessary locking
On Thu, Mar 23, 2017 at 09:01:43PM +0100, Heiko St?bner wrote: > Am Donnerstag, 23. März 2017, 13:29:10 CET schrieb Julia Cartwright: > > On Thu, Mar 23, 2017 at 06:55:50PM +0100, Heiko St?bner wrote: > > > Am Donnerstag, 23. März 2017, 17:51:53 CET schrieb John Keeping: > > > > On Thu, 23 Mar 2017 11:10:20 -0500, Julia Cartwright wrote: [..] > > > > > > @@ -1185,17 +1177,14 @@ static int rockchip_set_drive_perpin(struct > > > > > > rockchip_pin_bank *bank,> > > > > > > > > > > > > > rmask = BIT(15) | BIT(31); > > > > > > data |= BIT(31); > > > > > > ret = regmap_update_bits(regmap, reg, rmask, > > > > > > data); > > > > > > > > > > > > - if (ret) { > > > > > > - spin_unlock_irqrestore(&bank->slock, > > > > > > flags); > > > > > > + if (ret) > > > > > > > > > > > > return ret; > > > > > > > > > > > > - } > > > > > > > > > > > > rmask = 0x3 | (0x3 << 16); > > > > > > temp |= (0x3 << 16); > > > > > > reg += 0x4; > > > > > > ret = regmap_update_bits(regmap, reg, rmask, > > > > > > temp); > > > > > > > > > > Killing the lock here means the writes to to this pair of registers > > > > > (reg > > > > > and reg + 4) can be observed non-atomically. Have you convinced > > > > > yourself that this isn't a problem? > > > > > > > > I called it out in v1 [1] since this bit is new since v4.4 where I > > > > originally wrote this patch, and didn't get any comments about it. > > > > > > > > I've convinced myself that removing the lock doesn't cause any problems > > > > for writing to the hardware: if the lock would prevent writes > > > > interleaving then it means that two callers are trying to write > > > > different drive strengths to the same pin, and even with a lock here one > > > > of them will end up with the wrong drive strength. > > > > > > > > But it does mean that a read via rockchip_get_drive_perpin() may see an > > > > inconsistent state. I think adding a new lock specifically for this > > > > particular drive strength bit is overkill and I can't find a scenario > > > > where this will actually matter; any driver setting a pinctrl config > > > > must already be doing something to avoid racing two configurations > > > > against each other, mustn't it? > > > > > > also, pins can normally only be requested once - see drivers complaining > > > if > > > one of their pins is already held by a different driver. So if you really > > > end up with two things writing to the same drive strength bits, the > > > driver holding the pins must be really messed up anyway :-) > > > > My concern would be if two independent pins' drive strength > > configuration would walk on each other, because they happen to be > > configured via the same registers. > > > > If that's not possible, then great! > > ah sorry that we didn't make that clearer in the beginning, but no that also > isn't possible. The registers use a "hiword-mask" scheme, so on each write to > bit (x) the corresponding write-mask in bit (x+16) also needs to be set, so > the each write will always only affect bits enabled in that way. Awesome, thanks for clearing things up. Thanks! Julia signature.asc Description: PGP signature