[PATCH kernel] powerpc/perf: Fix 32bit compile

2022-04-20 Thread Alexey Kardashevskiy
The "read_bhrb" global symbol is only called under CONFIG_PPC64 of
arch/powerpc/perf/core-book3s.c but it is compiled for both 32 and 64 bit
anyway (and LLVM fails to link this on 32bit).

This fixes it by moving bhrb.o to obj64 targets.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/perf/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 2f46e31c7612..4f53d0b97539 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -3,11 +3,11 @@
 obj-y  += callchain.o callchain_$(BITS).o perf_regs.o
 obj-$(CONFIG_COMPAT)   += callchain_32.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   isa207-common.o power8-pmu.o power9-pmu.o \
-  generic-compat-pmu.o power10-pmu.o
+  generic-compat-pmu.o power10-pmu.o bhrb.o
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_PPC_POWERNV)  += imc-pmu.o
-- 
2.30.2



[PATCH] powerpc/pci: Remove useless null check before call of_node_put()

2022-04-20 Thread Haowen Bai
No need to add null check before call of_node_put(), since the
implementation of of_node_put() has done it.

Signed-off-by: Haowen Bai 
---
 arch/powerpc/kernel/pci_dn.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 61571ae23953..ba3bbc9bec2d 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -357,8 +357,8 @@ void pci_remove_device_node_info(struct device_node *dn)
 
/* Drop the parent pci_dn's ref to our backing dt node */
parent = of_get_parent(dn);
-   if (parent)
-   of_node_put(parent);
+
+   of_node_put(parent);
 
/*
 * At this point we *might* still have a pci_dev that was
-- 
2.7.4



[PATCH] ASoC: imx-hdmi: remove useless null check before call of_node_put()

2022-04-20 Thread Haowen Bai
No need to add null check before call of_node_put(), since the
implementation of of_node_put() has done it.

Signed-off-by: Haowen Bai 
---
 sound/soc/fsl/imx-hdmi.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/sound/soc/fsl/imx-hdmi.c b/sound/soc/fsl/imx-hdmi.c
index e10136afa741..2ae1a889c68d 100644
--- a/sound/soc/fsl/imx-hdmi.c
+++ b/sound/soc/fsl/imx-hdmi.c
@@ -206,8 +206,7 @@ static int imx_hdmi_probe(struct platform_device *pdev)
}
 
 fail:
-   if (cpu_np)
-   of_node_put(cpu_np);
+   of_node_put(cpu_np);
 
return ret;
 }
-- 
2.7.4



Re: [PATCH] misc: ocxl: fix possible double free in ocxl_file_register_afu

2022-04-20 Thread Hangyu Hua

On 2022/4/21 06:54, Michael Ellerman wrote:

Hangyu Hua  writes:

info_release() will be called in device_unregister() when info->dev's
reference count is 0. So there is no need to call ocxl_afu_put() and
kfree() again.


Double frees are often exploitable. But it looks to me like this error
path is not easily reachable by an attacker.

ocxl_file_register_afu() is only called from ocxl_probe(), and we only
go to err_unregister if the sysfs or cdev initialisation fails, which
should only happen if we hit ENOMEM, or we have a duplicate device which
would be a device-tree/hardware error. But maybe Fred can check more
closely, I don't know the driver that well.

cheers



Hi Michael,

You are right. It is hard to exploit at least in my understanding. 
That's why I didn't give this to security@. But it still need to be fix 
out. By the way, if you are interesting in this kind of bug, you can 
check out the other three patches I submitted recently.


rpmsg: virtio: fix possible double free in rpmsg_virtio_add_ctrl_dev()
https://lore.kernel.org/all/20220418101724.42174-1-hbh...@gmail.com/

rpmsg: virtio: fix possible double free in rpmsg_probe()
https://lore.kernel.org/all/20220418093144.40859-1-hbh...@gmail.com/

hwtracing: stm: fix possible double free in stm_register_device()
https://lore.kernel.org/all/20220418081632.35121-1-hbh...@gmail.com/

They are all the similar bugs i could find in the kernel.

Thanks.




Fix this by adding free_minor() and return to err_unregister error path.

Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & 
frontend")
Signed-off-by: Hangyu Hua 
---
  drivers/misc/ocxl/file.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index d881f5e40ad9..6777c419a8da 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -556,7 +556,9 @@ int ocxl_file_register_afu(struct ocxl_afu *afu)
  
  err_unregister:

ocxl_sysfs_unregister_afu(info); // safe to call even if register failed
+   free_minor(info);
device_unregister(>dev);
+   return rc;
  err_put:
ocxl_afu_put(afu);
free_minor(info);
--
2.25.1


Re: [PATCH] powerpc/time: Always set decrementer in timer_interrupt()

2022-04-20 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of April 21, 2022 12:28 am:
> Hello,
> 
> On Thu, Apr 21, 2022 at 12:16:57AM +1000, Michael Ellerman wrote:
>> This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
>> Don't enable MSR[EE] in irq handlers unless perf is in use").
>> 
>> Prior to that commit, we always set the decrementer in
>> timer_interrupt(), to clear the timer interrupt. Otherwise we could end
>> up continuously taking timer interrupts.
>> 
>> When high res timers are enabled there is no problem seen with leaving
>> the decrementer untouched in timer_interrupt(), because it will be
>> programmed via hrtimer_interrupt() -> tick_program_event() ->
>> clockevents_program_event() -> decrementer_set_next_event().
>> 
>> However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
> 
> How difficult is it to detect this condition?
> 
> Maybe detecting this could be just added?

Possibly not too difficult but I'd like to see if we can get this to work
in core timer code -

https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-April/242212.html

I'll resend as a patch and see what flamage I get...

Thanks,
Nick



Re: [PATCH] powerpc/time: Always set decrementer in timer_interrupt()

2022-04-20 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of April 21, 2022 12:16 am:
> This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
> Don't enable MSR[EE] in irq handlers unless perf is in use").
> 
> Prior to that commit, we always set the decrementer in
> timer_interrupt(), to clear the timer interrupt. Otherwise we could end
> up continuously taking timer interrupts.
> 
> When high res timers are enabled there is no problem seen with leaving
> the decrementer untouched in timer_interrupt(), because it will be
> programmed via hrtimer_interrupt() -> tick_program_event() ->
> clockevents_program_event() -> decrementer_set_next_event().
> 
> However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
> see a stall/lockup, because tick_nohz_handler() does not cause a
> reprogram of the decrementer, leading to endless timer interrupts.
> Example trace:
> 
>   [1.898617][T7] Freeing initrd memory: 2624K^M
>   [   22.680919][C1] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:^M
>   [   22.682281][C1] rcu: 0-: (25 ticks this GP) idle=073/0/0x1 
> softirq=10/16 fqs=1050 ^M
>   [   22.682851][C1]  (detected by 1, t=2102 jiffies, g=-1179, q=476)^M
>   [   22.683649][C1] Sending NMI from CPU 1 to CPUs 0:^M
>   [   22.685252][C0] NMI backtrace for cpu 0^M
>   [   22.685649][C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 5.16.0-rc2-00185-g0faf20a1ad16 #145^M
>   [   22.686393][C0] NIP:  c0016d64 LR: c0f6cca4 CTR: 
> c019c6e0^M
>   [   22.686774][C0] REGS: c2833590 TRAP: 0500   Not tainted  
> (5.16.0-rc2-00185-g0faf20a1ad16)^M
>   [   22.687222][C0] MSR:  80009033   CR: 
> 24000222  XER: ^M
>   [   22.688297][C0] CFAR: c000c854 IRQMASK: 0 ^M
>   ...
>   [   22.692637][C0] NIP [c0016d64] 
> arch_local_irq_restore+0x174/0x250^M
>   [   22.694443][C0] LR [c0f6cca4] __do_softirq+0xe4/0x3dc^M
>   [   22.695762][C0] Call Trace:^M
>   [   22.696050][C0] [c2833830] [c0f6cc80] 
> __do_softirq+0xc0/0x3dc (unreliable)^M
>   [   22.697377][C0] [c2833920] [c0151508] 
> __irq_exit_rcu+0xd8/0x130^M
>   [   22.698739][C0] [c2833950] [c0151730] 
> irq_exit+0x20/0x40^M
>   [   22.699938][C0] [c2833970] [c0027f40] 
> timer_interrupt+0x270/0x460^M
>   [   22.701119][C0] [c28339d0] [c00099a8] 
> decrementer_common_virt+0x208/0x210^M
> 
> Possibly this should be fixed in the lowres timing code, but that would
> be a generic change and could take some time and may not backport
> easily, so for now make the programming of the decrementer unconditional
> again in timer_interrupt() to avoid the stall/lockup.
> 
> Fixes: 0faf20a1ad16 ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq 
> handlers unless perf is in use")
> Reported-by: Miguel Ojeda 

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/time.c | 29 ++---
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index f5cbfe5efd25..f80cce0e3899 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -615,23 +615,22 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
>   return;
>   }
>  
> - /* Conditionally hard-enable interrupts. */
> - if (should_hard_irq_enable()) {
> - /*
> -  * Ensure a positive value is written to the decrementer, or
> -  * else some CPUs will continue to take decrementer exceptions.
> -  * When the PPC_WATCHDOG (decrementer based) is configured,
> -  * keep this at most 31 bits, which is about 4 seconds on most
> -  * systems, which gives the watchdog a chance of catching timer
> -  * interrupt hard lockups.
> -  */
> - if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
> - set_dec(0x7fff);
> - else
> - set_dec(decrementer_max);
> + /*
> +  * Ensure a positive value is written to the decrementer, or
> +  * else some CPUs will continue to take decrementer exceptions.
> +  * When the PPC_WATCHDOG (decrementer based) is configured,
> +  * keep this at most 31 bits, which is about 4 seconds on most
> +  * systems, which gives the watchdog a chance of catching timer
> +  * interrupt hard lockups.
> +  */
> + if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
> + set_dec(0x7fff);
> + else
> + set_dec(decrementer_max);
>  
> + /* Conditionally hard-enable interrupts. */
> + if (should_hard_irq_enable())
>   do_hard_irq_enable();
> - }
>  
>  #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
>   if (atomic_read(_n_lost_interrupts) != 0)
> -- 
> 2.34.1
> 
> 


Re: [PATCH] misc: ocxl: fix possible double free in ocxl_file_register_afu

2022-04-20 Thread Michael Ellerman
Hangyu Hua  writes:
> info_release() will be called in device_unregister() when info->dev's
> reference count is 0. So there is no need to call ocxl_afu_put() and
> kfree() again.

Double frees are often exploitable. But it looks to me like this error
path is not easily reachable by an attacker.

ocxl_file_register_afu() is only called from ocxl_probe(), and we only
go to err_unregister if the sysfs or cdev initialisation fails, which
should only happen if we hit ENOMEM, or we have a duplicate device which
would be a device-tree/hardware error. But maybe Fred can check more
closely, I don't know the driver that well.

cheers


> Fix this by adding free_minor() and return to err_unregister error path.
>
> Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & 
> frontend")
> Signed-off-by: Hangyu Hua 
> ---
>  drivers/misc/ocxl/file.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
> index d881f5e40ad9..6777c419a8da 100644
> --- a/drivers/misc/ocxl/file.c
> +++ b/drivers/misc/ocxl/file.c
> @@ -556,7 +556,9 @@ int ocxl_file_register_afu(struct ocxl_afu *afu)
>  
>  err_unregister:
>   ocxl_sysfs_unregister_afu(info); // safe to call even if register failed
> + free_minor(info);
>   device_unregister(>dev);
> + return rc;
>  err_put:
>   ocxl_afu_put(afu);
>   free_minor(info);
> -- 
> 2.25.1


Re: [PATCH v1 1/1] powerpc/83xx/mpc8349emitx: Get rid of of_node assignment

2022-04-20 Thread Michael Ellerman
Linus Walleij  writes:
> On Wed, Apr 6, 2022 at 3:02 PM Andy Shevchenko
>  wrote:
>> On Mon, Mar 28, 2022 at 03:16:08PM +0200, Linus Walleij wrote:
>> > On Wed, Mar 23, 2022 at 6:43 PM Andy Shevchenko
>> >  wrote:
>> >
>> > > Let GPIO library to assign of_node from the parent device.
>> > > This allows to move GPIO library and drivers to use fwnode
>> > > APIs instead of being stuck with OF-only interfaces.
>> > >
>> > > Signed-off-by: Andy Shevchenko 
>> >
>> > That's a nice patch.
>> > Reviewed-by: Linus Walleij 
>>
>> Thanks!
>>
>> Can we have this applied now?
>
> I think Michael Ellerman could help with this?
>
> Michael?

Yep, I'll pick it up when I start putting things into next.

That's usually the week after rc2, but I had a break for Easter.

cheers


Re: [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-04-20 Thread Vlastimil Babka
On 3/29/22 18:43, David Hildenbrand wrote:
> Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
> 
> Signed-off-by: David Hildenbrand 

Acked-by: Vlastimil Babka 

> ---
>  mm/debug_vm_pgtable.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index db2abd9e415b..55f1a8dc716f 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct 
> pgtable_debug_args *args) { }
>  static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args 
> *args) { }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
> +{
> +#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> +
> + pr_debug("Validating PTE swap exclusive\n");
> + pte = pte_swp_mkexclusive(pte);
> + WARN_ON(!pte_swp_exclusive(pte));

I guess only this WARN_ON must be guarded by the #ifdef, but doesn't matter
that much - won't gain significantly more test coverage.

> + pte = pte_swp_clear_exclusive(pte);
> + WARN_ON(pte_swp_exclusive(pte));
> +#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
> +}
> +
>  static void __init pte_swap_tests(struct pgtable_debug_args *args)
>  {
>   swp_entry_t swp;
> @@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
>   pte_swap_soft_dirty_tests();
>   pmd_swap_soft_dirty_tests();
>  
> + pte_swap_exclusive_tests();
> +
>   pte_swap_tests();
>   pmd_swap_tests();
>  



Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit

2022-04-20 Thread David Hildenbrand
On 20.04.22 19:10, Vlastimil Babka wrote:
> On 3/29/22 18:43, David Hildenbrand wrote:
>> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
>> it. We do this, to keep fork() logic on swap entries easy and efficient:
>> for example, if we wouldn't clear it when unmapping, we'd have to lookup
>> the page in the swapcache for each and every swap entry during fork() and
>> clear PG_anon_exclusive if set.
>>
>> Instead, we want to store that information directly in the swap pte,
>> protected by the page table lock, similarly to how we handle
>> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
>> swap entries, we don't want to mess with the swap type (e.g., still one
>> bit) because it overcomplicates swap code.
>>
>> In try_to_unmap(), we already reject to unmap in case the page might be
>> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
>> Checking if there are other unexpected references reliably *before*
>> completely unmapping a page is unfortunately not really possible: THP
>> heavily overcomplicate the situation. Once fully unmapped it's easier --
>> we, for example, make sure that there are no unexpected references
>> *after* unmapping a page before starting writeback on that page.
>>
>> So, we currently might end up unmapping a page and clearing
>> PG_anon_exclusive if that page has additional references, for example,
>> due to a FOLL_GET.
>>
>> do_swap_page() has to re-determine if a page is exclusive, which will
>> easily fail if there are other references on a page, most prominently
>> GUP references via FOLL_GET. This can currently result in memory
>> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
>> when fork() is never involved: try_to_unmap() will succeed, and when
>> refaulting the page, it cannot be marked exclusive and will get replaced
>> by a copy in the page tables on the next write access, resulting in writes
>> via the GUP reference to the page being lost.
>>
>> In an ideal world, everybody that uses GUP and wants to modify page
>> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
>> conversion will take a while. It's easier to fix what used to work in the
>> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
>> by remembering PG_anon_exclusive we can further reduce unnecessary COW
>> in some cases, so it's the natural thing to do.
>>
>> So let's transfer the PG_anon_exclusive information to the swap pte and
>> store it via an architecture-dependant pte bit; use that information when
>> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
>> simply have to clear the pte bit and are done.
>>
>> Of course, there is one corner case to handle: swap backends that don't
>> support concurrent page modifications while the page is under writeback.
>> Special case these, and drop the exclusive marker. Add a comment why that
>> is just fine (also, reuse_swap_page() would have done the same in the
>> past).
>>
>> In the future, we'll hopefully have all architectures support
>> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
>> stubs and the define completely. Then, we can also convert
>> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
>> support: either simply use a yet unused pte bit that can be used for swap
>> entries, steal one from the arch type bits if they exceed 5, or steal one
>> from the offset bits.
>>
>> Note: R/O FOLL_GET references were never really reliable, especially
>> when taking one on a shared page and then writing to the page (e.g., GUP
>> after fork()). FOLL_GET, including R/W references, were never really
>> reliable once fork was involved (e.g., GUP before fork(),
>> GUP during fork()). KSM steps back in case it stumbles over unexpected
>> references and is, therefore, fine.
>>
>> Signed-off-by: David Hildenbrand 
> 
> With the fixup as reportedy by Miaohe Lin
> 
> Acked-by: Vlastimil Babka 
> 
> (sent a separate mm-commits mail to inquire about the fix going missing from
> mmotm)
> 
> https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4bae...@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb

Yes I saw that, thanks for catching that!


-- 
Thanks,

David / dhildenb



Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit

2022-04-20 Thread Vlastimil Babka
On 3/29/22 18:43, David Hildenbrand wrote:
> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
> it. We do this, to keep fork() logic on swap entries easy and efficient:
> for example, if we wouldn't clear it when unmapping, we'd have to lookup
> the page in the swapcache for each and every swap entry during fork() and
> clear PG_anon_exclusive if set.
> 
> Instead, we want to store that information directly in the swap pte,
> protected by the page table lock, similarly to how we handle
> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
> swap entries, we don't want to mess with the swap type (e.g., still one
> bit) because it overcomplicates swap code.
> 
> In try_to_unmap(), we already reject to unmap in case the page might be
> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
> Checking if there are other unexpected references reliably *before*
> completely unmapping a page is unfortunately not really possible: THP
> heavily overcomplicate the situation. Once fully unmapped it's easier --
> we, for example, make sure that there are no unexpected references
> *after* unmapping a page before starting writeback on that page.
> 
> So, we currently might end up unmapping a page and clearing
> PG_anon_exclusive if that page has additional references, for example,
> due to a FOLL_GET.
> 
> do_swap_page() has to re-determine if a page is exclusive, which will
> easily fail if there are other references on a page, most prominently
> GUP references via FOLL_GET. This can currently result in memory
> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
> when fork() is never involved: try_to_unmap() will succeed, and when
> refaulting the page, it cannot be marked exclusive and will get replaced
> by a copy in the page tables on the next write access, resulting in writes
> via the GUP reference to the page being lost.
> 
> In an ideal world, everybody that uses GUP and wants to modify page
> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
> conversion will take a while. It's easier to fix what used to work in the
> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
> by remembering PG_anon_exclusive we can further reduce unnecessary COW
> in some cases, so it's the natural thing to do.
> 
> So let's transfer the PG_anon_exclusive information to the swap pte and
> store it via an architecture-dependant pte bit; use that information when
> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
> simply have to clear the pte bit and are done.
> 
> Of course, there is one corner case to handle: swap backends that don't
> support concurrent page modifications while the page is under writeback.
> Special case these, and drop the exclusive marker. Add a comment why that
> is just fine (also, reuse_swap_page() would have done the same in the
> past).
> 
> In the future, we'll hopefully have all architectures support
> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
> stubs and the define completely. Then, we can also convert
> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
> support: either simply use a yet unused pte bit that can be used for swap
> entries, steal one from the arch type bits if they exceed 5, or steal one
> from the offset bits.
> 
> Note: R/O FOLL_GET references were never really reliable, especially
> when taking one on a shared page and then writing to the page (e.g., GUP
> after fork()). FOLL_GET, including R/W references, were never really
> reliable once fork was involved (e.g., GUP before fork(),
> GUP during fork()). KSM steps back in case it stumbles over unexpected
> references and is, therefore, fine.
> 
> Signed-off-by: David Hildenbrand 

With the fixup as reportedy by Miaohe Lin

Acked-by: Vlastimil Babka 

(sent a separate mm-commits mail to inquire about the fix going missing from
mmotm)

https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4bae...@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb



Re: [PATCH] powerpc/time: Always set decrementer in timer_interrupt()

2022-04-20 Thread Michal Suchánek
Hello,

On Thu, Apr 21, 2022 at 12:16:57AM +1000, Michael Ellerman wrote:
> This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
> Don't enable MSR[EE] in irq handlers unless perf is in use").
> 
> Prior to that commit, we always set the decrementer in
> timer_interrupt(), to clear the timer interrupt. Otherwise we could end
> up continuously taking timer interrupts.
> 
> When high res timers are enabled there is no problem seen with leaving
> the decrementer untouched in timer_interrupt(), because it will be
> programmed via hrtimer_interrupt() -> tick_program_event() ->
> clockevents_program_event() -> decrementer_set_next_event().
> 
> However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we

How difficult is it to detect this condition?

Maybe detecting this could be just added?

Thanks

Michal

> see a stall/lockup, because tick_nohz_handler() does not cause a
> reprogram of the decrementer, leading to endless timer interrupts.
> Example trace:
> 
>   [1.898617][T7] Freeing initrd memory: 2624K^M
>   [   22.680919][C1] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:^M
>   [   22.682281][C1] rcu: 0-: (25 ticks this GP) idle=073/0/0x1 
> softirq=10/16 fqs=1050 ^M
>   [   22.682851][C1]  (detected by 1, t=2102 jiffies, g=-1179, q=476)^M
>   [   22.683649][C1] Sending NMI from CPU 1 to CPUs 0:^M
>   [   22.685252][C0] NMI backtrace for cpu 0^M
>   [   22.685649][C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 5.16.0-rc2-00185-g0faf20a1ad16 #145^M
>   [   22.686393][C0] NIP:  c0016d64 LR: c0f6cca4 CTR: 
> c019c6e0^M
>   [   22.686774][C0] REGS: c2833590 TRAP: 0500   Not tainted  
> (5.16.0-rc2-00185-g0faf20a1ad16)^M
>   [   22.687222][C0] MSR:  80009033   CR: 
> 24000222  XER: ^M
>   [   22.688297][C0] CFAR: c000c854 IRQMASK: 0 ^M
>   ...
>   [   22.692637][C0] NIP [c0016d64] 
> arch_local_irq_restore+0x174/0x250^M
>   [   22.694443][C0] LR [c0f6cca4] __do_softirq+0xe4/0x3dc^M
>   [   22.695762][C0] Call Trace:^M
>   [   22.696050][C0] [c2833830] [c0f6cc80] 
> __do_softirq+0xc0/0x3dc (unreliable)^M
>   [   22.697377][C0] [c2833920] [c0151508] 
> __irq_exit_rcu+0xd8/0x130^M
>   [   22.698739][C0] [c2833950] [c0151730] 
> irq_exit+0x20/0x40^M
>   [   22.699938][C0] [c2833970] [c0027f40] 
> timer_interrupt+0x270/0x460^M
>   [   22.701119][C0] [c28339d0] [c00099a8] 
> decrementer_common_virt+0x208/0x210^M
> 
> Possibly this should be fixed in the lowres timing code, but that would
> be a generic change and could take some time and may not backport
> easily, so for now make the programming of the decrementer unconditional
> again in timer_interrupt() to avoid the stall/lockup.
> 
> Fixes: 0faf20a1ad16 ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq 
> handlers unless perf is in use")
> Reported-by: Miguel Ojeda 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/time.c | 29 ++---
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index f5cbfe5efd25..f80cce0e3899 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -615,23 +615,22 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
>   return;
>   }
>  
> - /* Conditionally hard-enable interrupts. */
> - if (should_hard_irq_enable()) {
> - /*
> -  * Ensure a positive value is written to the decrementer, or
> -  * else some CPUs will continue to take decrementer exceptions.
> -  * When the PPC_WATCHDOG (decrementer based) is configured,
> -  * keep this at most 31 bits, which is about 4 seconds on most
> -  * systems, which gives the watchdog a chance of catching timer
> -  * interrupt hard lockups.
> -  */
> - if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
> - set_dec(0x7fff);
> - else
> - set_dec(decrementer_max);
> + /*
> +  * Ensure a positive value is written to the decrementer, or
> +  * else some CPUs will continue to take decrementer exceptions.
> +  * When the PPC_WATCHDOG (decrementer based) is configured,
> +  * keep this at most 31 bits, which is about 4 seconds on most
> +  * systems, which gives the watchdog a chance of catching timer
> +  * interrupt hard lockups.
> +  */
> + if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
> + set_dec(0x7fff);
> + else
> + set_dec(decrementer_max);
>  
> + /* Conditionally hard-enable interrupts. */
> + if (should_hard_irq_enable())
>   do_hard_irq_enable();
> - }
>  
>  #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
>   if 

[PATCH] powerpc/time: Always set decrementer in timer_interrupt()

2022-04-20 Thread Michael Ellerman
This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
Don't enable MSR[EE] in irq handlers unless perf is in use").

Prior to that commit, we always set the decrementer in
timer_interrupt(), to clear the timer interrupt. Otherwise we could end
up continuously taking timer interrupts.

When high res timers are enabled there is no problem seen with leaving
the decrementer untouched in timer_interrupt(), because it will be
programmed via hrtimer_interrupt() -> tick_program_event() ->
clockevents_program_event() -> decrementer_set_next_event().

However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
see a stall/lockup, because tick_nohz_handler() does not cause a
reprogram of the decrementer, leading to endless timer interrupts.
Example trace:

  [1.898617][T7] Freeing initrd memory: 2624K^M
  [   22.680919][C1] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:^M
  [   22.682281][C1] rcu: 0-: (25 ticks this GP) idle=073/0/0x1 
softirq=10/16 fqs=1050 ^M
  [   22.682851][C1]  (detected by 1, t=2102 jiffies, g=-1179, q=476)^M
  [   22.683649][C1] Sending NMI from CPU 1 to CPUs 0:^M
  [   22.685252][C0] NMI backtrace for cpu 0^M
  [   22.685649][C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.16.0-rc2-00185-g0faf20a1ad16 #145^M
  [   22.686393][C0] NIP:  c0016d64 LR: c0f6cca4 CTR: 
c019c6e0^M
  [   22.686774][C0] REGS: c2833590 TRAP: 0500   Not tainted  
(5.16.0-rc2-00185-g0faf20a1ad16)^M
  [   22.687222][C0] MSR:  80009033   CR: 
24000222  XER: ^M
  [   22.688297][C0] CFAR: c000c854 IRQMASK: 0 ^M
  ...
  [   22.692637][C0] NIP [c0016d64] 
arch_local_irq_restore+0x174/0x250^M
  [   22.694443][C0] LR [c0f6cca4] __do_softirq+0xe4/0x3dc^M
  [   22.695762][C0] Call Trace:^M
  [   22.696050][C0] [c2833830] [c0f6cc80] 
__do_softirq+0xc0/0x3dc (unreliable)^M
  [   22.697377][C0] [c2833920] [c0151508] 
__irq_exit_rcu+0xd8/0x130^M
  [   22.698739][C0] [c2833950] [c0151730] 
irq_exit+0x20/0x40^M
  [   22.699938][C0] [c2833970] [c0027f40] 
timer_interrupt+0x270/0x460^M
  [   22.701119][C0] [c28339d0] [c00099a8] 
decrementer_common_virt+0x208/0x210^M

Possibly this should be fixed in the lowres timing code, but that would
be a generic change and could take some time and may not backport
easily, so for now make the programming of the decrementer unconditional
again in timer_interrupt() to avoid the stall/lockup.

Fixes: 0faf20a1ad16 ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq 
handlers unless perf is in use")
Reported-by: Miguel Ojeda 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/time.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f5cbfe5efd25..f80cce0e3899 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -615,23 +615,22 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(timer_interrupt)
return;
}
 
-   /* Conditionally hard-enable interrupts. */
-   if (should_hard_irq_enable()) {
-   /*
-* Ensure a positive value is written to the decrementer, or
-* else some CPUs will continue to take decrementer exceptions.
-* When the PPC_WATCHDOG (decrementer based) is configured,
-* keep this at most 31 bits, which is about 4 seconds on most
-* systems, which gives the watchdog a chance of catching timer
-* interrupt hard lockups.
-*/
-   if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
-   set_dec(0x7fff);
-   else
-   set_dec(decrementer_max);
+   /*
+* Ensure a positive value is written to the decrementer, or
+* else some CPUs will continue to take decrementer exceptions.
+* When the PPC_WATCHDOG (decrementer based) is configured,
+* keep this at most 31 bits, which is about 4 seconds on most
+* systems, which gives the watchdog a chance of catching timer
+* interrupt hard lockups.
+*/
+   if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
+   set_dec(0x7fff);
+   else
+   set_dec(decrementer_max);
 
+   /* Conditionally hard-enable interrupts. */
+   if (should_hard_irq_enable())
do_hard_irq_enable();
-   }
 
 #if defined(CONFIG_PPC32) && defined(CONFIG_PPC_PMAC)
if (atomic_read(_n_lost_interrupts) != 0)
-- 
2.34.1



Re: [PATCH 0/3] KVM: x86 SRCU bug fix and SRCU hardening

2022-04-20 Thread Paolo Bonzini
Queued, thanks.

Paolo




Re: [PATCH] ASoC: fsl: using pm_runtime_resume_and_get instead of pm_runtime_get_sync

2022-04-20 Thread Pavel Machek
Hi!

> From: Minghao Chi 
> 
> Using pm_runtime_resume_and_get is more appropriate
> for simplifing code
> 
> Reported-by: Zeal Robot 
> Signed-off-by: Minghao Chi 
> ---
>  sound/soc/fsl/fsl_esai.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
> index ed444e8f1d6b..1a2bdf8e76f0 100644
> --- a/sound/soc/fsl/fsl_esai.c
> +++ b/sound/soc/fsl/fsl_esai.c
> @@ -1050,11 +1050,9 @@ static int fsl_esai_probe(struct platform_device *pdev)
>   goto err_pm_disable;
>   }
>  
> - ret = pm_runtime_get_sync(>dev);
> - if (ret < 0) {
> - pm_runtime_put_noidle(>dev);
> + ret = pm_runtime_resume_and_get(>dev);
> + if (ret < 0)
>   goto err_pm_get_sync;
> - }
>  
>   ret = fsl_esai_hw_init(esai_priv);
>   if (ret)

Please take a closer look at that function.

a) error labels are now misnamed

b) there's leak if
   ret = fsl_esai_hw_init(esai_priv);
   if (ret)
 goto err_pm_get_sync;

happens.

Best regards,
Pavel   
   

-- 
People of Russia, stop Putin before his war on Ukraine escalates.


signature.asc
Description: Digital signature


Re: [PATCH v2 2/2] fbdev: Warn in hot-unplug workaround for framebuffers without device

2022-04-20 Thread Javier Martinez Canillas
Hello Thomas,

Thanks a lot for re-spinning your series.

On 4/19/22 12:04, Thomas Zimmermann wrote:
> A workaround makes fbdev hot-unplugging work for framebuffers without
> device. The only user for this feature was offb. As each OF framebuffer
> now has an associated platform device, the workaround hould no longer
> be triggered. Update it with a warning and rewrite the comment. Fbdev
> drivers that trigger the hot-unplug workaround really need to be fixed.
> 
> Signed-off-by: Thomas Zimmermann 
> Suggested-by: Javier Martinez Canillas 
> ---

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Linux Engineering
Red Hat



RE: [PATCH RFC 2/8] arm64: stacktrace: Add arch_within_stack_frames

2022-04-20 Thread David Laight
> > Thanks for doing this implementation! One reason usercopy hardening
> > didn't persue doing a "full" stacktrace was because it seemed relatively
> > expensive. Did you do any usercopy-heavily workload testing to see if
> > there was a noticeable performance impact?

Look at anything that uses sendmsg().
It is noticeably more expensive than sendto().
All the extra copy_from_user() cause measurable slow slowdowns.
Using __copy_from_user()(to avoid 'hardending') in the socket code
and when reading the iov[] gives a measurable improvement.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)