Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Paolo Bonzini
On 28/05/20 00:21, David Ahern wrote:
> On 5/27/20 3:07 PM, Paolo Bonzini wrote:
>> I see what you meant now.  statsfs can also be used to enumerate objects
>> if one is so inclined (with the prototype in patch 7, for example, each
>> network interface becomes a directory).
> 
> there are many use cases that have 100's to 1000's have network devices.
> Having a sysfs entry per device already bloats memory usage for these
> use cases; another filesystem with an entry per device makes that worse.
> Really the wrong direction for large scale systems.

Hi David,

IMO the important part for now is having a flexible kernel API for
exposing statistics across multiple subsystems, so that they can be
harvested in an efficient way.  The userspace API is secondary, and
multiple APIs can be added to cater for different usecases.

For example, as of the first five patches the memory usage is the same
as what is now in the mainline kernel, since all the patchset does is
take existing debugfs inodes and move them to statsfs.  I agree that, if
the concept is extended to the whole kernel, scalability and memory
usage becomes an issue; and indeed, the long-term plan is to support a
binary format that is actually _more_ efficient than the status quo for
large scale systems.

In the meanwhile, the new filesystem can be disabled (see the difference
between "STATS_FS" and "STATS_FS_API") if it imposes undesirable overhead.

Thanks,

Paolo



Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-27 Thread Paul Mackerras
On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> The locking rules for walking partition scoped table is different from process
> scoped table. Hence add a helper for secondary linux page table walk and also
> add check whether we are holding the right locks.

This patch is causing new warnings to appear when testing migration,
like this:

[  142.090159] [ cut here ]
[  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held 
[  142.090176] WARNING: CPU: 23 PID: 5341 at 
arch/powerpc/include/asm/kvm_book3s_64.h:644 
kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables bpfilter overlay binfmt_misc input_leds raid_class 
scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
[  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW  
   5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
[  142.090189] NIP:  c00800fe848c LR: c00800fe8488 CTR: 
[  142.090190] REGS: c01e19f077e0 TRAP: 0700   Tainted: GW  
(5.7.0-rc5-kvm-00211-g9ccf10d6d088)
[  142.090191] MSR:  90029033   CR: 4422  
XER: 2004
[  142.090196] CFAR: c012f5ac IRQMASK: 0 
   GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 
0039 
   GPR04: 0001 c01ffc8b4900 00018840 
0007 
   GPR08: 0003 0001 0007 
0001 
   GPR12: 2000 c01fff6d9400 00011f884678 
7fff70b7 
   GPR16: 7fff7137cb90 7fff7dcb4410 0001 
 
   GPR20: 0ffe  0001 
 
   GPR24: 8000 0001 c01e1f67e600 
c01e1fd82410 
   GPR28: 1000 c01e2e41 0fff 
0ffe 
[  142.090217] NIP [c00800fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 
[kvm_hv]
[  142.090223] LR [c00800fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 
[kvm_hv]
[  142.090224] Call Trace:
[  142.090230] [c01e19f07a70] [c00800fe8488] 
kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[  142.090236] [c01e19f07b50] [c00800fd42e4] 
kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[  142.090292] [c01e19f07be0] [c00800eea878] 
kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[  142.090300] [c01e19f07c00] [c00800edc818] kvm_vm_ioctl+0x2b0/0xc00 
[kvm]
[  142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150
[  142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80
[  142.090307] [c01e19f07dc0] [c003652c] 
system_call_exception+0x16c/0x240
[  142.090309] [c01e19f07e20] [c000d070] 
system_call_common+0xf0/0x278
[  142.090310] Instruction dump:
[  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 e8848468 
3c62 
[  142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 6000 
6000 
[  142.090322] ---[ end trace 619d45057b6919e0 ]---

Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
locklessly, and only takes the kvm->mmu_lock once it finds a dirty
PTE.  I think that is important for performance, since on any given
scan of the guest real address space we may only find a small
proportion of the guest pages to be dirty.

Are you now relying on the kvm->mmu_lock to protect the existence of
the PTEs, or just their content?

Paul.


Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again

2020-05-27 Thread Michael Ellerman
Petr Mladek  writes:
> The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
> to archs where they work") caused that bpf_probe_read{, str}() functions
> were not longer available on architectures where the same logical address
> might have different content in kernel and user memory mapping. These
> architectures should use probe_read_{user,kernel}_str helpers.
>
> For backward compatibility, the problematic functions are still available
> on architectures where the user and kernel address spaces are not
> overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
>
> At the moment, these backward compatible functions are enabled only
> on x86_64, arm, and arm64. Let's do it also on powerpc that has
> the non overlapping address space as well.
>
> Signed-off-by: Petr Mladek 

This seems like it should have a Fixes: tag and go into v5.7?

cheers

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index d13b5328ca10..b29d7cb38368 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -126,6 +126,7 @@ config PPC
>   select ARCH_HAS_MMIOWB  if PPC64
>   select ARCH_HAS_PHYS_TO_DMA
>   select ARCH_HAS_PMEM_API
> + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>   select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
>   select ARCH_HAS_PTE_SPECIAL
>   select ARCH_HAS_MEMBARRIER_CALLBACKS
> -- 
> 2.26.2


Re: [PATCH] powerpc/64: Remove unused generic_secondary_thread_init()

2020-05-27 Thread Jordan Niethe
On Tue, May 26, 2020 at 4:36 PM Michael Ellerman  wrote:
>
> The last caller was removed in 2014 in commit fb5a515704d7 ("powerpc:
> Remove platforms/wsp and associated pieces").
>
> Once generic_secondary_thread_init() is removed there are no longer
> any uses of book3e_secondary_thread_init() or
> generic_secondary_common_init so remove them too.
>
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/smp.h   |  1 -
>  arch/powerpc/kernel/exceptions-64e.S |  4 
>  arch/powerpc/kernel/head_64.S| 18 --
>  3 files changed, 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..81a49566ccd8 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -243,7 +243,6 @@ extern void arch_send_call_function_ipi_mask(const struct 
> cpumask *mask);
>   * 64-bit but defining them all here doesn't harm
>   */
>  extern void generic_secondary_smp_init(void);
> -extern void generic_secondary_thread_init(void);
>  extern unsigned long __secondary_hold_spinloop;
>  extern unsigned long __secondary_hold_acknowledge;
>  extern char __secondary_hold;
> diff --git a/arch/powerpc/kernel/exceptions-64e.S 
> b/arch/powerpc/kernel/exceptions-64e.S
> index d9ed79415100..9f9e8686798b 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1814,10 +1814,6 @@ _GLOBAL(book3e_secondary_core_init)
>  1: mtlrr28
> blr
>
> -_GLOBAL(book3e_secondary_thread_init)
> -   mflrr28
> -   b   3b
> -
> .globl init_core_book3e
>  init_core_book3e:
> /* Establish the interrupt vector base */
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index 0e05a9a47a4b..4ae2c18c5fc6 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/powerpc/kernel/head_64.S
> @@ -302,23 +302,6 @@ _GLOBAL(fsl_secondary_thread_init)
>  1:
>  #endif

Nothing directly calls generic_secondary_thread_init() but I think
fsl_secondary_thread_init() which is directly above "falls through"
into it. fsl_secondary_thread_init() still has callers.

>
> -_GLOBAL(generic_secondary_thread_init)
> -   mr  r24,r3
> -
> -   /* turn on 64-bit mode */
> -   bl  enable_64b_mode
> -
> -   /* get a valid TOC pointer, wherever we're mapped at */
> -   bl  relative_toc
> -   tovirt(r2,r2)
> -
> -#ifdef CONFIG_PPC_BOOK3E
> -   /* Book3E initialization */
> -   mr  r3,r24
> -   bl  book3e_secondary_thread_init
> -#endif
> -   b   generic_secondary_common_init
> -
>  /*
>   * On pSeries and most other platforms, secondary processors spin
>   * in the following code.
> @@ -385,7 +368,6 @@ _GLOBAL(generic_secondary_smp_init)
>  20:
>  #endif
>
> -generic_secondary_common_init:
> /* Set up a paca value for this processor. Since we have the
>  * physical cpu id in r24, we need to search the pacas to find
>  * which logical id maps to our physical one.
> --
> 2.25.1
>


Re: [v3 2/2] dts: ppc: t1024rdb: remove interrupts property

2020-05-27 Thread Li Yang
On Tue, May 26, 2020 at 10:52 PM Biwen Li  wrote:
>
> From: Biwen Li 
>
> Since the interrupt pin for RTC DS1339 is not connected
> to the CPU on T1024RDB, remove the interrupt property
> from the device tree.
>
> This also fix the following warning for hwclock.util-linux:
> $ hwclock.util-linux
> hwclock.util-linux: select() to /dev/rtc0
> to wait for clock tick timed out
>
> Signed-off-by: Biwen Li 

Acked-by: Li Yang 

> ---
>  arch/powerpc/boot/dts/fsl/t1024rdb.dts | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/boot/dts/fsl/t1024rdb.dts 
> b/arch/powerpc/boot/dts/fsl/t1024rdb.dts
> index 645caff98ed1..605ceec66af3 100644
> --- a/arch/powerpc/boot/dts/fsl/t1024rdb.dts
> +++ b/arch/powerpc/boot/dts/fsl/t1024rdb.dts
> @@ -161,7 +161,6 @@
> rtc@68 {
> compatible = "dallas,ds1339";
> reg = <0x68>;
> -   interrupts = <0x1 0x1 0 0>;
> };
> };
>
> --
> 2.17.1
>


Re: [v3 1/2] dts: ppc: t4240rdb: remove interrupts property

2020-05-27 Thread Li Yang
On Tue, May 26, 2020 at 10:49 PM Biwen Li  wrote:
>
> From: Biwen Li 
>
> Since the interrupt pin for RTC DS1374 is not connected
> to the CPU on T4240RDB, remove the interrupt property
> from the device tree.
>
> This also fix the following warning for hwclock.util-linux:
> $ hwclock.util-linux
> hwclock.util-linux: select() to /dev/rtc0
> to wait for clock tick timed out
>
> Signed-off-by: Biwen Li 

Acked-by: Li Yang 

> ---
>  arch/powerpc/boot/dts/fsl/t4240rdb.dts | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/boot/dts/fsl/t4240rdb.dts 
> b/arch/powerpc/boot/dts/fsl/t4240rdb.dts
> index a56a705d41f7..145896f2eef6 100644
> --- a/arch/powerpc/boot/dts/fsl/t4240rdb.dts
> +++ b/arch/powerpc/boot/dts/fsl/t4240rdb.dts
> @@ -144,7 +144,6 @@
> rtc@68 {
> compatible = "dallas,ds1374";
> reg = <0x68>;
> -   interrupts = <0x1 0x1 0 0>;
> };
> };
>
> --
> 2.17.1
>


Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread David Ahern
On 5/27/20 3:07 PM, Paolo Bonzini wrote:
> I see what you meant now.  statsfs can also be used to enumerate objects
> if one is so inclined (with the prototype in patch 7, for example, each
> network interface becomes a directory).

there are many use cases that have 100's to 1000's have network devices.
Having a sysfs entry per device already bloats memory usage for these
use cases; another filesystem with an entry per device makes that worse.
Really the wrong direction for large scale systems.


Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Paolo Bonzini
On 27/05/20 23:27, Jakub Kicinski wrote:
> On Wed, 27 May 2020 23:07:53 +0200 Paolo Bonzini wrote:
>>> Again, I have little KVM knowledge, but BPF also uses a fd-based API,
>>> and carries stats over the same syscall interface.  
>>
>> Can BPF stats (for BPF scripts created by whatever process is running in
>> the system) be collected by an external daemon that does not have access
>> to the file descriptor?  For KVM it's of secondary importance to gather
>> stats in the program; it can be nice to have and we are thinking of a
>> way to export the stats over the fd-based API, but it's less useful than
>> system-wide monitoring.  Perhaps this is a difference between the two.
> 
> Yes, check out bpftool prog list (bpftool code is under tools/bpf/ in
> the kernel tree). BPF statistics are under a static key, so you may not
> see any on your system. My system shows e.g.:
> 
> 81: kprobe  name abc  tag cefaa9376bdaae75  gpl run_time_ns 80941 run_cnt 152
>   loaded_at 2020-05-26T13:00:24-0700  uid 0
>   xlated 512B  jited 307B  memlock 4096B  map_ids 66,64
>   btf_id 16
> 
> In this example run_time_ns and run_cnt are stats.
> 
> The first number on the left is the program ID. BPF has an IDA, and
> each object gets an integer id. So admin (or CAP_BPF, I think) can
> iterate over the ids and open fds to objects of interest.

Got it, thanks.  But then "I'd hope that whatever daemon collects [BPF]
stats doesn't run as root". :)

>> Another case where stats and configuration are separate is CPUs, where
>> CPU enumeration is done in sysfs but statistics are exposed in various
>> procfs files such as /proc/interrupts and /proc/stats.
> 
> True, but I'm guessing everyone is just okay living with the legacy
> procfs format there. Otherwise I'd guess the stats would had been added
> to sysfs. I'd be curious to hear the full story there.

Yeah, it's a chicken-and-egg problem in that there's no good place in
sysfs to put statistics right now, which is part of what this filesystem
is trying to solve (the other part is the API).

You can read more about Google's usecase at
http://lkml.iu.edu/hypermail/linux/kernel/2005.0/08056.html, it does
include both network and interrupt stats and it's something that they've
been using in production for quite some time.  We'd like the statsfs API
to be the basis for including something akin to that in Linux.

To be honest, it's unlikely that Emanuele (who has just finished his
internship at Red Hat) and I will pursue the networking stats further
than the demo patch at the end of this series. However, we're trying to
make sure that the API is at least ready for that, and to probe whether
any developers from other subsystems would be interested in using
statsfs.  So thanks for bringing your point of view!

Thanks,

Paolo



Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Jakub Kicinski
On Wed, 27 May 2020 23:07:53 +0200 Paolo Bonzini wrote:
> > Again, I have little KVM knowledge, but BPF also uses a fd-based API,
> > and carries stats over the same syscall interface.  
> 
> Can BPF stats (for BPF scripts created by whatever process is running in
> the system) be collected by an external daemon that does not have access
> to the file descriptor?  For KVM it's of secondary importance to gather
> stats in the program; it can be nice to have and we are thinking of a
> way to export the stats over the fd-based API, but it's less useful than
> system-wide monitoring.  Perhaps this is a difference between the two.

Yes, check out bpftool prog list (bpftool code is under tools/bpf/ in
the kernel tree). BPF statistics are under a static key, so you may not
see any on your system. My system shows e.g.:

81: kprobe  name abc  tag cefaa9376bdaae75  gpl run_time_ns 80941 run_cnt 152
loaded_at 2020-05-26T13:00:24-0700  uid 0
xlated 512B  jited 307B  memlock 4096B  map_ids 66,64
btf_id 16

In this example run_time_ns and run_cnt are stats.

The first number on the left is the program ID. BPF has an IDA, and
each object gets an integer id. So admin (or CAP_BPF, I think) can
iterate over the ids and open fds to objects of interest.

> Another case where stats and configuration are separate is CPUs, where
> CPU enumeration is done in sysfs but statistics are exposed in various
> procfs files such as /proc/interrupts and /proc/stats.

True, but I'm guessing everyone is just okay living with the legacy
procfs format there. Otherwise I'd guess the stats would had been added
to sysfs. I'd be curious to hear the full story there.


Re: [PATCH v3 6/8] mm: parallelize deferred_init_memmap()

2020-05-27 Thread Alexander Duyck
On Wed, May 27, 2020 at 10:37 AM Daniel Jordan
 wrote:
>
> Deferred struct page init is a significant bottleneck in kernel boot.
> Optimizing it maximizes availability for large-memory systems and allows
> spinning up short-lived VMs as needed without having to leave them
> running.  It also benefits bare metal machines hosting VMs that are
> sensitive to downtime.  In projects such as VMM Fast Restart[1], where
> guest state is preserved across kexec reboot, it helps prevent
> application and network timeouts in the guests.
>
> Multithread to take full advantage of system memory bandwidth.
>
> The maximum number of threads is capped at the number of CPUs on the
> node because speedups always improve with additional threads on every
> system tested, and at this phase of boot, the system is otherwise idle
> and waiting on page init to finish.
>
> Helper threads operate on section-aligned ranges to both avoid false
> sharing when setting the pageblock's migrate type and to avoid accessing
> uninitialized buddy pages, though max order alignment is enough for the
> latter.
>
> The minimum chunk size is also a section.  There was benefit to using
> multiple threads even on relatively small memory (1G) systems, and this
> is the smallest size that the alignment allows.
>
> The time (milliseconds) is the slowest node to initialize since boot
> blocks until all nodes finish.  intel_pstate is loaded in active mode
> without hwp and with turbo enabled, and intel_idle is active as well.
>
> Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
>   2 nodes * 26 cores * 2 threads = 104 CPUs
>   384G/node = 768G memory
>
>kernel boot deferred init
>
> node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
>   (  0) --   4089.7 (  8.1) --   1785.7 (  7.6)
>2% (  1)   1.7%   4019.3 (  1.5)   3.8%   1717.7 ( 11.8)
>   12% (  6)  34.9%   2662.7 (  2.9)  79.9%359.3 (  0.6)
>   25% ( 13)  39.9%   2459.0 (  3.6)  91.2%157.0 (  0.0)
>   37% ( 19)  39.2%   2485.0 ( 29.7)  90.4%172.0 ( 28.6)
>   50% ( 26)  39.3%   2482.7 ( 25.7)  90.3%173.7 ( 30.0)
>   75% ( 39)  39.0%   2495.7 (  5.5)  89.4%190.0 (  1.0)
>  100% ( 52)  40.2%   2443.7 (  3.8)  92.3%138.0 (  1.0)
>
> Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
>   1 node * 16 cores * 2 threads = 32 CPUs
>   192G/node = 192G memory
>
>kernel boot deferred init
>
> node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
>   (  0) --   1988.7 (  9.6) --   1096.0 ( 11.5)
>3% (  1)   1.1%   1967.0 ( 17.6)   0.3%   1092.7 ( 11.0)
>   12% (  4)  41.1%   1170.3 ( 14.2)  73.8%287.0 (  3.6)
>   25% (  8)  47.1%   1052.7 ( 21.9)  83.9%177.0 ( 13.5)
>   38% ( 12)  48.9%   1016.3 ( 12.1)  86.8%144.7 (  1.5)
>   50% ( 16)  48.9%   1015.7 (  8.1)  87.8%134.0 (  4.4)
>   75% ( 24)  49.1%   1012.3 (  3.1)  88.1%130.3 (  2.3)
>  100% ( 32)  49.5%   1004.0 (  5.3)  88.5%125.7 (  2.1)
>
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
>   2 nodes * 18 cores * 2 threads = 72 CPUs
>   128G/node = 256G memory
>
>kernel boot deferred init
>
> node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
>   (  0) --   1680.0 (  4.6) --627.0 (  4.0)
>3% (  1)   0.3%   1675.7 (  4.5)  -0.2%628.0 (  3.6)
>   11% (  4)  25.6%   1250.7 (  2.1)  67.9%201.0 (  0.0)
>   25% (  9)  30.7%   1164.0 ( 17.3)  81.8%114.3 ( 17.7)
>   36% ( 13)  31.4%   1152.7 ( 10.8)  84.0%100.3 ( 17.9)
>   50% ( 18)  31.5%   1150.7 (  9.3)  83.9%101.0 ( 14.1)
>   75% ( 27)  31.7%   1148.0 (  5.6)  84.5% 97.3 (  6.4)
>  100% ( 36)  32.0%   1142.3 (  4.0)  85.6% 90.0 (  1.0)
>
> AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>   1 node * 8 cores * 2 threads = 16 CPUs
>   64G/node = 64G memory
>
>kernel boot deferred init
>
> node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
>   (  0) --   1029.3 ( 25.1) --240.7 (  1.5)
>6% (  1)  -0.6%   1036.0 (  7.8)  -2.2%246.0 (  0.0)
>   12% (  2)  11.8%907.7 (  8.6)  44.7%133.0 (  1.0)
>   25% (  4)  13.9%886.0 ( 10.6)

Re: [PATCH v3 5/8] mm: don't track number of pages during deferred initialization

2020-05-27 Thread Alexander Duyck
On Wed, May 27, 2020 at 10:37 AM Daniel Jordan
 wrote:
>
> Deferred page init used to report the number of pages initialized:
>
>   node 0 initialised, 32439114 pages in 97ms
>
> Tracking this makes the code more complicated when using multiple
> threads.  Given that the statistic probably has limited value,
> especially since a zone grows on demand so that the page count can vary,
> just remove it.
>
> The boot message now looks like
>
>   node 0 deferred pages initialised in 97ms
>
> Signed-off-by: Daniel Jordan 
> Suggested-by: Alexander Duyck 

This looks good to me.

Reviewed-by: Alexander Duyck 

> ---
>  mm/page_alloc.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0c0d9364aa6d..d64f3027fdfa6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data)
>  {
> pg_data_t *pgdat = data;
> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> -   unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> +   unsigned long spfn = 0, epfn = 0;
> unsigned long first_init_pfn, flags;
> unsigned long start = jiffies;
> struct zone *zone;
> @@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data)
>  * allocator.
>  */
> while (spfn < epfn) {
> -   nr_pages += deferred_init_maxorder(, zone, , );
> +   deferred_init_maxorder(, zone, , );
> cond_resched();
> }
>  zone_empty:
> /* Sanity check that the next zone really is unpopulated */
> WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> -   pr_info("node %d initialised, %lu pages in %ums\n",
> -   pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
> +   pr_info("node %d deferred pages initialised in %ums\n",
> +   pgdat->node_id, jiffies_to_msecs(jiffies - start));
>
> pgdat_init_report_one_done();
> return 0;
> --
> 2.26.2
>
>


Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Andrew Lunn
On Tue, May 26, 2020 at 01:03:10PM +0200, Emanuele Giuseppe Esposito wrote:
> There is currently no common way for Linux kernel subsystems to expose
> statistics to userspace shared throughout the Linux kernel; subsystems have
> to take care of gathering and displaying statistics by themselves, for
> example in the form of files in debugfs. For example KVM has its own code
> section that takes care of this in virt/kvm/kvm_main.c, where it sets up
> debugfs handlers for displaying values and aggregating them from various
> subfolders to obtain information about the system state (i.e. displaying
> the total number of exits, calculated by summing all exits of all cpus of
> all running virtual machines).
> 
> Allowing each section of the kernel to do so has two disadvantages. First,
> it will introduce redundant code. Second, debugfs is anyway not the right
> place for statistics (for example it is affected by lockdown)
> 
> In this patch series I introduce statsfs, a synthetic ram-based virtual
> filesystem that takes care of gathering and displaying statistics for the
> Linux kernel subsystems.
> 
> The file system is mounted on /sys/kernel/stats and would be already used
> by kvm. Statsfs was initially introduced by Paolo Bonzini [1].
> 
> Statsfs offers a generic and stable API, allowing any kind of
> directory/file organization and supporting multiple kind of aggregations
> (not only sum, but also average, max, min and count_zero) and data types
> (boolean, unsigned/signed and custom types). The implementation, which is
> a generalization of KVM’s debugfs statistics code, takes care of gathering
> and displaying information at run time; users only need to specify the
> values to be included in each source.
> 
> Statsfs would also be a different mountpoint from debugfs, and would not
> suffer from limited access due to the security lock down patches. Its main
> function is to display each statistics as a file in the desired folder
> hierarchy defined through the API. Statsfs files can be read, and possibly
> cleared if their file mode allows it.
> 
> Statsfs has two main components: the public API defined by
> include/linux/statsfs.h, and the virtual file system which should end up in
> /sys/kernel/stats.
> 

Hi Emanuele

> The API has two main elements, values and sources. Kernel subsystems like
> KVM can use the API to create a source, add child sources/values/aggregates
> and register it to the root source (that on the virtual fs would be
> /sys/kernel/statsfs).

Another issue i see with networking is that statistic counters can be
dynamic. They can come and go. One of the drivers i work on has extra
statistics available when a fibre interface is used, compared to a
copper interface. And this happens at run time. The netlink API has no
problems with this. It is a snapshot of what counters are currently
available. There is no state in the API.

In my humble opinion, networking is unlikely to adopt your approach.
You probably want to look around for other subsystems which have
statistics, and see if you can cover their requirements, and get them
on board.

   Andrew


Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Paolo Bonzini
On 27/05/20 22:23, Jakub Kicinski wrote:
> On Wed, 27 May 2020 15:14:41 +0200 Emanuele Giuseppe Esposito wrote:
>> Regarding the config, as I said the idea is to gather multiple 
>> subsystems' statistics, therefore there wouldn't be a single 
>> configuration method like in netlink.
>> For example in kvm there are file descriptors for configuration, and 
>> creating them requires no privilege, contrary to the network interfaces.
>
> Enumerating networking interfaces, addresses, and almost all of the
> configuration requires no extra privilege. In fact I'd hope that
> whatever daemon collects network stats doesn't run as root :)
> 
> I think enumerating objects is of primary importance, and statistics 
> of those objects are subordinate.

I see what you meant now.  statsfs can also be used to enumerate objects
if one is so inclined (with the prototype in patch 7, for example, each
network interface becomes a directory).

> Again, I have little KVM knowledge, but BPF also uses a fd-based API,
> and carries stats over the same syscall interface.

Can BPF stats (for BPF scripts created by whatever process is running in
the system) be collected by an external daemon that does not have access
to the file descriptor?  For KVM it's of secondary importance to gather
stats in the program; it can be nice to have and we are thinking of a
way to export the stats over the fd-based API, but it's less useful than
system-wide monitoring.  Perhaps this is a difference between the two.

Another case where stats and configuration are separate is CPUs, where
CPU enumeration is done in sysfs but statistics are exposed in various
procfs files such as /proc/interrupts and /proc/stats.

Thanks,

Paolo



Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Jakub Kicinski
On Wed, 27 May 2020 15:14:41 +0200 Emanuele Giuseppe Esposito wrote:
> Regarding the config, as I said the idea is to gather multiple 
> subsystems' statistics, therefore there wouldn't be a single 
> configuration method like in netlink.
> For example in kvm there are file descriptors for configuration, and 
> creating them requires no privilege, contrary to the network interfaces.

Enumerating networking interfaces, addresses, and almost all of the
configuration requires no extra privilege. In fact I'd hope that
whatever daemon collects network stats doesn't run as root :)

I think enumerating objects is of primary importance, and statistics 
of those objects are subordinate.

Again, I have little KVM knowledge, but BPF also uses a fd-based API,
and carries stats over the same syscall interface.


Re: [PATCH v8 1/5] powerpc: Document details on H_SCM_HEALTH hcall

2020-05-27 Thread Dan Williams
On Tue, May 26, 2020 at 9:13 PM Vaibhav Jain  wrote:
>
> Add documentation to 'papr_hcalls.rst' describing the bitmap flags
> that are returned from H_SCM_HEALTH hcall as per the PAPR-SCM
> specification.
>

Please do a global s/SCM/PMEM/ or s/SCM/NVDIMM/. It's unfortunate that
we already have 2 ways to describe persistent memory devices, let's
not perpetuate a third so that "grep" has a chance to find
interrelated code across architectures. Other than that this looks
good to me.

> Cc: "Aneesh Kumar K . V" 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: Ira Weiny 
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
> v7..v8:
> * Added a clarification on bit-ordering of Health Bitmap
>
> Resend:
> * None
>
> v6..v7:
> * None
>
> v5..v6:
> * New patch in the series
> ---
>  Documentation/powerpc/papr_hcalls.rst | 45 ---
>  1 file changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 3493631a60f8..45063f305813 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -220,13 +220,50 @@ from the LPAR memory.
>  **H_SCM_HEALTH**
>
>  | Input: drcIndex
> -| Out: *health-bitmap, health-bit-valid-bitmap*
> +| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
>  | Return Value: *H_Success, H_Parameter, H_Hardware*
>
>  Given a DRC Index return the info on predictive failure and overall health of
> -the NVDIMM. The asserted bits in the health-bitmap indicate a single 
> predictive
> -failure and health-bit-valid-bitmap indicate which bits in health-bitmap are
> -valid.
> +the NVDIMM. The asserted bits in the health-bitmap indicate one or more 
> states
> +(described in table below) of the NVDIMM and health-bit-valid-bitmap indicate
> +which bits in health-bitmap are valid. The bits are reported in
> +reverse bit ordering for example a value of 0xC400
> +indicates bits 0, 1, and 5 are valid.
> +
> +Health Bitmap Flags:
> +
> ++--+---+
> +|  Bit |   Definition
>   |
> ++==+===+
> +|  00  | SCM device is unable to persist memory contents.
>   |
> +|  | If the system is powered down, nothing will be saved.   
>   |
> ++--+---+
> +|  01  | SCM device failed to persist memory contents. Either contents were 
> not|
> +|  | saved successfully on power down or were not restored properly on   
>   |
> +|  | power up.   
>   |
> ++--+---+
> +|  02  | SCM device contents are persisted from previous IPL. The data from  
>   |
> +|  | the last boot were successfully restored.   
>   |
> ++--+---+
> +|  03  | SCM device contents are not persisted from previous IPL. There was 
> no |
> +|  | data to restore from the last boot. 
>   |
> ++--+---+
> +|  04  | SCM device memory life remaining is critically low  
>   |
> ++--+---+
> +|  05  | SCM device will be garded off next IPL due to failure   
>   |
> ++--+---+
> +|  06  | SCM contents cannot persist due to current platform health status. 
> A  |
> +|  | hardware failure may prevent data from being saved or restored. 
>   |
> ++--+---+
> +|  07  | SCM device is unable to persist memory contents in certain 
> conditions |
> ++--+---+
> +|  08  | SCM device is encrypted 
>   |
> ++--+---+
> +|  09  | SCM device has successfully completed a requested erase or secure   
>   |
> +|  | erase procedure.
>   |
> ++--+---+
> +|10:63 | Reserved / Unused   
>   |
> ++--+---+
>
>  **H_SCM_PERFORMANCE_STATS**
>
> --
> 2.26.2
>


[PATCH v3 7/8] mm: make deferred init's max threads arch-specific

2020-05-27 Thread Daniel Jordan
Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 arch/x86/mm/init_64.c| 12 
 include/linux/memblock.h |  3 +++
 mm/page_alloc.c  | 13 -
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
mem_init_print_info(NULL);
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   /*
+* More CPUs always led to greater speedups on tested systems, up to
+* all the nodes' CPUs.  Use all since the system is otherwise idle
+* now.
+*/
+   return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
 int kernel_set_to_readonly;
 
 void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
*zone,
 #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
for (; i != U64_MAX;  \
 __next_mem_pfn_range_in_zone(, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d47016849531..329fd1a809c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1835,6 +1835,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
unsigned long end_pfn,
}
 }
 
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   return 1;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1883,11 +1890,7 @@ static int __init deferred_init_memmap(void *data)
 first_init_pfn))
goto zone_empty;
 
-   /*
-* More CPUs always led to greater speedups on tested systems, up to
-* all the nodes' CPUs.  Use all since the system is otherwise idle now.
-*/
-   max_threads = max(cpumask_weight(cpumask), 1u);
+   max_threads = deferred_page_init_max_threads(cpumask);
 
while (spfn < epfn) {
unsigned long epfn_align = ALIGN(epfn, PAGES_PER_SECTION);
-- 
2.26.2



[PATCH v3 1/8] padata: remove exit routine

2020-05-27 Thread Daniel Jordan
padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 kernel/padata.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-   cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-   cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2



[PATCH v3 8/8] padata: document multithreaded jobs

2020-05-27 Thread Daniel Jordan
Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 Documentation/core-api/padata.rst | 41 +++
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 9a24c111781d9..0830e5b0e8211 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 ===
 
-:Date: December 2019
+:Date: May 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+===
 
 Initializing
 
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run serialized jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
 #include 
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically allocated on the main thread's stack.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =
 
-- 
2.26.2



[PATCH v3 5/8] mm: don't track number of pages during deferred initialization

2020-05-27 Thread Daniel Jordan
Deferred page init used to report the number of pages initialized:

  node 0 initialised, 32439114 pages in 97ms

Tracking this makes the code more complicated when using multiple
threads.  Given that the statistic probably has limited value,
especially since a zone grows on demand so that the page count can vary,
just remove it.

The boot message now looks like

  node 0 deferred pages initialised in 97ms

Signed-off-by: Daniel Jordan 
Suggested-by: Alexander Duyck 
---
 mm/page_alloc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..d64f3027fdfa6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data)
 {
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-   unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+   unsigned long spfn = 0, epfn = 0;
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
@@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data)
 * allocator.
 */
while (spfn < epfn) {
-   nr_pages += deferred_init_maxorder(, zone, , );
+   deferred_init_maxorder(, zone, , );
cond_resched();
}
 zone_empty:
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
-   pr_info("node %d initialised, %lu pages in %ums\n",
-   pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
+   pr_info("node %d deferred pages initialised in %ums\n",
+   pgdat->node_id, jiffies_to_msecs(jiffies - start));
 
pgdat_init_report_one_done();
return 0;
-- 
2.26.2



[PATCH v3 3/8] padata: allocate work structures for parallel jobs from a pool

2020-05-27 Thread Daniel Jordan
padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |   8 +--
 kernel/padata.c| 118 +++--
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
struct list_headlist;
struct parallel_data*pd;
int cb_cpu;
-   int cpu;
unsigned intseq_nr;
int info;
void(*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-   struct padata_listparallel;
struct padata_listreorder;
-   struct work_structwork;
atomic_t  num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
struct padata_parallel_queue__percpu *pqueue;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
-   atomic_tseq_nr;
+   unsigned intseq_nr;
unsigned intprocessed;
int cpu;
struct padata_cpumask   cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index 6f709bc0fc413..78ff9aa529204 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include 
 #include 
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+   struct work_struct  pw_work;
+   struct list_headpw_list;  /* padata_free_works linkage */
+   void*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, 
unsigned int seq_nr)
return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-   struct padata_parallel_queue *pqueue;
-   LIST_HEAD(local_list);
+   struct padata_work *pw;
 
-   local_bh_disable();
-   pqueue = container_of(parallel_work,
- struct padata_parallel_queue, work);
+   lockdep_assert_held(_works_lock);
 
-   spin_lock(>parallel.lock);
-   list_replace_init(>parallel.list, _list);
-   spin_unlock(>parallel.lock);
+   if (list_empty(_free_works))
+   return NULL;/* No more work items allowed to be queued. */
 
-   while (!list_empty(_list)) {
-   struct padata_priv *padata;
+   pw = list_first_entry(_free_works, struct padata_work, pw_list);
+   list_del(>pw_list);
+   return pw;
+}
 
-   padata = list_entry(local_list.next,
-   struct padata_priv, list);
+static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
+void *data)
+{
+   

[PATCH v3 6/8] mm: parallelize deferred_init_memmap()

2020-05-27 Thread Daniel Jordan
Deferred struct page init is a significant bottleneck in kernel boot.
Optimizing it maximizes availability for large-memory systems and allows
spinning up short-lived VMs as needed without having to leave them
running.  It also benefits bare metal machines hosting VMs that are
sensitive to downtime.  In projects such as VMM Fast Restart[1], where
guest state is preserved across kexec reboot, it helps prevent
application and network timeouts in the guests.

Multithread to take full advantage of system memory bandwidth.

The maximum number of threads is capped at the number of CPUs on the
node because speedups always improve with additional threads on every
system tested, and at this phase of boot, the system is otherwise idle
and waiting on page init to finish.

Helper threads operate on section-aligned ranges to both avoid false
sharing when setting the pageblock's migrate type and to avoid accessing
uninitialized buddy pages, though max order alignment is enough for the
latter.

The minimum chunk size is also a section.  There was benefit to using
multiple threads even on relatively small memory (1G) systems, and this
is the smallest size that the alignment allows.

The time (milliseconds) is the slowest node to initialize since boot
blocks until all nodes finish.  intel_pstate is loaded in active mode
without hwp and with turbo enabled, and intel_idle is active as well.

Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
  2 nodes * 26 cores * 2 threads = 104 CPUs
  384G/node = 768G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   4089.7 (  8.1) --   1785.7 (  7.6)
   2% (  1)   1.7%   4019.3 (  1.5)   3.8%   1717.7 ( 11.8)
  12% (  6)  34.9%   2662.7 (  2.9)  79.9%359.3 (  0.6)
  25% ( 13)  39.9%   2459.0 (  3.6)  91.2%157.0 (  0.0)
  37% ( 19)  39.2%   2485.0 ( 29.7)  90.4%172.0 ( 28.6)
  50% ( 26)  39.3%   2482.7 ( 25.7)  90.3%173.7 ( 30.0)
  75% ( 39)  39.0%   2495.7 (  5.5)  89.4%190.0 (  1.0)
 100% ( 52)  40.2%   2443.7 (  3.8)  92.3%138.0 (  1.0)

Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
  1 node * 16 cores * 2 threads = 32 CPUs
  192G/node = 192G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1988.7 (  9.6) --   1096.0 ( 11.5)
   3% (  1)   1.1%   1967.0 ( 17.6)   0.3%   1092.7 ( 11.0)
  12% (  4)  41.1%   1170.3 ( 14.2)  73.8%287.0 (  3.6)
  25% (  8)  47.1%   1052.7 ( 21.9)  83.9%177.0 ( 13.5)
  38% ( 12)  48.9%   1016.3 ( 12.1)  86.8%144.7 (  1.5)
  50% ( 16)  48.9%   1015.7 (  8.1)  87.8%134.0 (  4.4)
  75% ( 24)  49.1%   1012.3 (  3.1)  88.1%130.3 (  2.3)
 100% ( 32)  49.5%   1004.0 (  5.3)  88.5%125.7 (  2.1)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
  2 nodes * 18 cores * 2 threads = 72 CPUs
  128G/node = 256G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1680.0 (  4.6) --627.0 (  4.0)
   3% (  1)   0.3%   1675.7 (  4.5)  -0.2%628.0 (  3.6)
  11% (  4)  25.6%   1250.7 (  2.1)  67.9%201.0 (  0.0)
  25% (  9)  30.7%   1164.0 ( 17.3)  81.8%114.3 ( 17.7)
  36% ( 13)  31.4%   1152.7 ( 10.8)  84.0%100.3 ( 17.9)
  50% ( 18)  31.5%   1150.7 (  9.3)  83.9%101.0 ( 14.1)
  75% ( 27)  31.7%   1148.0 (  5.6)  84.5% 97.3 (  6.4)
 100% ( 36)  32.0%   1142.3 (  4.0)  85.6% 90.0 (  1.0)

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 8 cores * 2 threads = 16 CPUs
  64G/node = 64G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1029.3 ( 25.1) --240.7 (  1.5)
   6% (  1)  -0.6%   1036.0 (  7.8)  -2.2%246.0 (  0.0)
  12% (  2)  11.8%907.7 (  8.6)  44.7%133.0 (  1.0)
  25% (  4)  13.9%886.0 ( 10.6)  62.6% 90.0 (  6.0)
  38% (  6)  17.8%845.7 ( 14.2)  69.1% 74.3 (  3.8)
  50% (  8)  16.8%856.0 ( 22.1)  72.9% 65.3 (  5.7)
  75% ( 12)  15.4%871.0 ( 29.2)  

[PATCH v3 0/8] padata: parallelize deferred page init

2020-05-27 Thread Daniel Jordan
Thanks to Alex for his continued review and Josh for running v2!  Please
continue to review and test, and acks for the padata parts would be
appreciated.

Daniel

--

Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 8.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap.  Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot
by 4% to 49%, saving up to 1.6 out of 4 seconds.  Patch 6 has more
numbers.

The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.

Series based on v5.7-rc7 plus these three from mmotm

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

and it's available here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v3
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v3

and the future users and related features are available as
work-in-progress:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.5
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

v3:
 - Remove nr_pages accounting as suggested by Alex, adding a new patch
 - Align deferred init ranges up not down, simplify surrounding code (Alex)
 - Add Josh's T-b's from v2 (Josh's T-b's for v1 lost in rebase, apologies!)
 - Move padata.h include up in init/main.c to reduce patch collisions (Andrew)
 - Slightly reword Documentation patch
 - Rebase on v5.7-rc7 and retest

v2:
 - Improve the problem statement (Andrew, Josh, Pavel)
 - Add T-b's to unchanged patches (Josh)
 - Fully initialize max-order blocks to avoid buddy issues (Alex)
 - Parallelize on section-aligned boundaries to avoid potential
   false sharing (Alex)
 - Return the maximum thread count from a function that architectures
   can override, with the generic version returning 1 (current
   behavior).  Override for x86 since that's the only arch this series
   has been tested on so far.  Other archs can test with more threads
   by dropping patch 6.
 - Rebase to v5.7-rc6, rerun tests

RFC v4 [2] -> v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
https://www.youtube.com/watch?v=pBsHnf93tcQ

https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/

[2] 
https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jor...@oracle.com/

Daniel Jordan (8):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: don't track number of pages during deferred initialization
  mm: parallelize deferred_init_memmap()
  mm: make deferred init's max threads arch-specific
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 arch/x86/mm/init_64.c |  12 ++
 include/linux/memblock.h  |   3 +
 include/linux/padata.h|  43 -
 init/main.c   |   2 +
 kernel/padata.c   | 277 --
 mm/Kconfig|   6 +-
 mm/page_alloc.c   |  59 +--
 8 files changed, 361 insertions(+), 82 deletions(-)


base-commit: 

[PATCH v3 4/8] padata: add basic support for multithreaded jobs

2020-05-27 Thread Daniel Jordan
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |  29 
 kernel/padata.c| 152 -
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
struct list_headlist;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ * possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ * the client to communicate the minimum amount of work that's
+ * appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *   depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+   void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+   void*fn_arg;
+   unsigned long   start;
+   unsigned long   size;
+   unsigned long   align;
+   unsigned long   min_chunk;
+   int max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
  struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index 78ff9aa529204..e78f57d9aef90 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +36,8 @@
 #include 
 #include 
 
+#definePADATA_WORK_ONSTACK 1   /* Work's memory is on stack */
+
 struct padata_work {
struct work_struct  pw_work;
struct list_headpw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+   spinlock_t  lock;
+   struct completion   completion;
+   struct padata_mt_job*job;
+   int nworks;
+   int nworks_fini;
+   unsigned long   chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-void *data)
+void *data, int flags)
 {
-   INIT_WORK(>pw_work, work_fn);
+   if (flags & PADATA_WORK_ONSTACK)
+   

[PATCH v3 2/8] padata: initialize earlier

2020-05-27 Thread Daniel Jordan
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |  6 ++
 init/main.c|  2 ++
 kernel/padata.c| 17 -
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #definePADATA_INVALID  4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index 03371976d3872..df32f67214d23 100644
--- a/init/main.c
+++ b/init/main.c
@@ -63,6 +63,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();
 
+   padata_init();
page_alloc_init_late();
/* Initialize page ext after all struct pages are initialized. */
page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 835919c745266..6f709bc0fc413 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MAX_OBJ_NUM 1000
 
@@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
int ret;
 
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
  padata_cpu_online, NULL);
if (ret < 0)
-   return ret;
+   goto err;
hp_online = ret;
 
ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
  NULL, padata_cpu_dead);
if (ret < 0) {
cpuhp_remove_multi_state(hp_online);
-   return ret;
+   goto err;
}
-   return 0;
-}
-module_init(padata_driver_init);
 
+   return;
+err:
+   pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2



Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Paolo Bonzini
On 27/05/20 15:33, Andrew Lunn wrote:
>> I don't really know a lot about the networking subsystem, and as it was
>> pointed out in another email on patch 7 by Andrew, networking needs to
>> atomically gather and display statistics in order to make them consistent,
>> and currently this is not supported by stats_fs but could be added in
>> future.
> 
> Do you have any idea how you will support atomic access? It does not
> seem easy to implement in a filesystem based model.

Hi Andrew,

there are plans to support binary access.  Emanuele and I don't really
have a plan for how to implement it, but there are developers from
Google that have ideas (because Google has a similar "metricfs" thing
in-house).

I think atomic access would use some kind of "source_ops" struct
containing create_snapshot and release_snapshot function pointers.

Paolo



[RFC PATCH 2/4] powerpc/64s: Don't let DT CPU features set FSCR_DSCR

2020-05-27 Thread Michael Ellerman
The device tree CPU features binding includes FSCR bit numbers which
Linux is instructed to set by firmware.

Whether that's a good idea or not, in the case of the DSCR the Linux
implementation has a hard requirement that the FSCR_DSCR bit not be
set by default. We use it to track when a process reads/writes to
DSCR, so it must be clear to begin with.

So if firmware tells us to set FSCR_DSCR we must ignore it.

Currently this does not cause a bug in our DSCR handling because the
value of FSCR that the device tree CPU features code establishes is
only used by swapper. All other tasks use the value hard coded in
init_task.thread.fscr.

However we'd like to fix that in a future commit, at which point this
will become necessary.

Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for 
discovering CPU features")
Cc: sta...@vger.kernel.org # v4.12+
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/dt_cpu_ftrs.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 36bc0d5c4f3a..fca4d7ff22b9 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -346,6 +346,14 @@ static int __init feat_enable_dscr(struct dt_cpu_feature 
*f)
 {
u64 lpcr;
 
+   /*
+* Linux relies on FSCR[DSCR] being clear, so that we can take the
+* facility unavailable interrupt and track the task's usage of DSCR.
+* See facility_unavailable_exception().
+* Clear the bit here so that feat_enable() doesn't set it.
+*/
+   f->fscr_bit_nr = -1;
+
feat_enable(f);
 
lpcr = mfspr(SPRN_LPCR);
-- 
2.25.1



[RFC PATCH 4/4] powerpc/64s: Don't set FSCR bits in INIT_THREAD

2020-05-27 Thread Michael Ellerman
Since the previous commit that saves the value of FSCR configured at
boot into init_task.thread.fscr, the static initialisation in
INIT_THREAD now no longer has any effect.

So remove it.

For non DT CPU features, the end result is the same, because
__init_FSCR() is called on all CPUs that have an FSCR (Power8,
Power9), and it sets FSCR_TAR & FSCR_EBB.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/processor.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 4e53df163b92..52a67835057a 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -302,7 +302,6 @@ struct thread_struct {
.ksp = INIT_SP, \
.addr_limit = KERNEL_DS, \
.fpexc_mode = 0, \
-   .fscr = FSCR_TAR | FSCR_EBB \
 }
 #endif
 
-- 
2.25.1



[RFC PATCH 3/4] powerpc/64s: Save FSCR to init_task.thread.fscr after feature init

2020-05-27 Thread Michael Ellerman
At boot the FSCR is initialised via one of two paths. On most systems
it's set to a hard coded value in __init_FSCR().

On newer skiboot systems we use the device tree CPU features binding,
where firmware can tell Linux what bits to set in FSCR (and HFSCR).

In both cases the value that's configured at boot is not propagated
into the init_task.thread.fscr value prior to the initial fork of init
(pid 1), which means the value is not used by any processes other than
swapper (the idle task).

For the __init_FSCR() case this is OK, because the value in
init_task.thread.fscr is initialised to something sensible. However it
does mean that the value set in __init_FSCR() is not used other than
for swapper, which is odd and confusing.

The bigger problem is for the device tree CPU features case it
prevents firmware from setting (or clearing) FSCR bits for use by user
space. This means all existing kernels can not have features
enabled/disabled by firmware if those features require
setting/clearing FSCR bits.

We can handle both cases by saving the FSCR value into
init_task.thread.fscr after we have initialised it at boot. This fixes
the bug for device tree CPU features, and will allow us to simplify
the initialisation for the __init_FSCR() case in a future patch.

Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for 
discovering CPU features")
Cc: sta...@vger.kernel.org # v4.12+
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/prom.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 1dcf0e214a22..a74bfd09cb38 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -685,6 +685,23 @@ static void __init tm_init(void)
 static void tm_init(void) { }
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
+#ifdef CONFIG_PPC64
+static void __init save_fscr_to_task(void)
+{
+   /*
+* Ensure the init_task (pid 0, aka swapper) uses the value of FSCR we
+* have configured via the device tree features or via __init_FSCR().
+* That value will then be propagated to pid 1 (init) and all future
+* processes.
+*/
+   if (early_cpu_has_feature(CPU_FTR_ARCH_207S))
+   init_task.thread.fscr = mfspr(SPRN_FSCR);
+}
+#else
+static inline void save_fscr_to_task(void) {};
+#endif
+
+
 void __init early_init_devtree(void *params)
 {
phys_addr_t limit;
@@ -773,6 +790,8 @@ void __init early_init_devtree(void *params)
BUG();
}
 
+   save_fscr_to_task();
+
 #if defined(CONFIG_SMP) && defined(CONFIG_PPC64)
/* We'll later wait for secondaries to check in; there are
 * NCPUS-1 non-boot CPUs  :-)
-- 
2.25.1



[RFC PATCH 1/4] powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()

2020-05-27 Thread Michael Ellerman
__init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc:
Add support for context switching the TAR register") (Feb 2013), and
only set FSCR_TAR.

At that point FSCR (Facility Status and Control Register) was not
context switched, so the setting was permanent after boot.

Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit
54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again
that was permanent after boot.

Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on
POWER8") (Aug 2013) added a limited context switch of FSCR, just the
FSCR_DSCR bit was context switched based on thread.dscr_inherit. That
commit said "This clears the H/FSCR DSCR bit initially", but it
didn't, it left the initialisation of FSCR_DSCR in __init_FSCR().
However the initial context switch from init_task to pid 1 would clear
FSCR_DSCR because thread.dscr_inherit was 0.

That commit also introduced the requirement that FSCR_DSCR be clear
for user processes, so that we can take the facility unavailable
interrupt in order to manage dscr_inherit.

Then in commit 152d523e6307 ("powerpc: Create context switch helpers
save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to
thread_struct. However it still wasn't fully context switched, we just
took the existing value and set FSCR_DSCR if the new thread had
dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR |
FSCR_TAR, but that value was not propagated into the thread_struct, so
the initial context switch set FSCR_DSCR back to 0.

Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context
switching") (Jun 2016) added a full context switch of the FSCR, and
added an initialisation of init_task.thread.fscr to FSCR_TAR |
FSCR_EBB, but omitted FSCR_DSCR.

The end result is that swapper runs with FSCR_DSCR set because of the
initialisation in __init_FSCR(), but no other processes do, they use
the value from init_task.thread.fscr.

Having FSCR_DSCR set for swapper allows it to access SPR 3 from
userspace, but swapper never runs userspace, so it has no useful
effect. It's also confusing to have the value initialised in two
places to two different values.

So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the
point where there's a single value of FSCR, even if it's still set in
two places.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/cpu_setup_power.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index a460298c7ddb..f91ecb10d0ae 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -184,7 +184,7 @@ _GLOBAL(__restore_cpu_power9)
 
 __init_FSCR:
mfspr   r3,SPRN_FSCR
-   ori r3,r3,FSCR_TAR|FSCR_DSCR|FSCR_EBB
+   ori r3,r3,FSCR_TAR|FSCR_EBB
mtspr   SPRN_FSCR,r3
blr
 
-- 
2.25.1



Re: [PATCH] media: omap3isp: Shuffle cacheflush.h and include mm.h

2020-05-27 Thread Laurent Pinchart
Hi Nathan,

(CC'ing Sakari Ailus and the linux-media mailing list)

On Wed, May 27, 2020 at 01:13:37AM -0700, Nathan Chancellor wrote:
> On Wed, May 27, 2020 at 09:02:51AM +0200, Geert Uytterhoeven wrote:
> > On Wed, May 27, 2020 at 6:37 AM Nathan Chancellor wrote:
> > > After mm.h was removed from the asm-generic version of cacheflush.h,
> > > s390 allyesconfig shows several warnings of the following nature:
> > >
> > > In file included from ./arch/s390/include/generated/asm/cacheflush.h:1,
> > >  from drivers/media/platform/omap3isp/isp.c:42:
> > > ./include/asm-generic/cacheflush.h:16:42: warning: 'struct mm_struct'
> > > declared inside parameter list will not be visible outside of this
> > > definition or declaration
> > >
> > > cacheflush.h does not include mm.h nor does it include any forward
> > > declaration of these structures hence the warning. To avoid this,
> > > include mm.h explicitly in this file and shuffle cacheflush.h below it.
> > >
> > > Fixes: 19c0054597a0 ("asm-generic: don't include  in 
> > > cacheflush.h")
> > > Signed-off-by: Nathan Chancellor 
> > 
> > Thanks for your patch!
> > 
> > > I am aware the fixes tag is kind of irrelevant because that SHA will
> > > change in the next linux-next revision and this will probably get folded
> > > into the original patch anyways but still.
> > >
> > > The other solution would be to add forward declarations of these structs
> > > to the top of cacheflush.h, I just chose to do what Christoph did in the
> > > original patch. I am happy to do that instead if you all feel that is
> > > better.
> > 
> > That actually looks like a better solution to me, as it would address the
> > problem for all users.

Headers should be self-contained, so that would be the best fix in my
opinion.

This being said, as cacheflush.h isn't needed in isp.c, I think we
should also drop it. It seems to have been included there since the
first driver version, and was likely a left-over from the out-of-tree
development. Manual cache handling was part of
drivers/media/platform/omap3isp/ispqueue.c and has been removed in
commit fbac1400bd1a ("[media] omap3isp: Move to videobuf2").

cacheflush.h can also be dropped from ispvideo.c which suffers from the
same issue.

> > >  drivers/media/platform/omap3isp/isp.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/media/platform/omap3isp/isp.c 
> > > b/drivers/media/platform/omap3isp/isp.c
> > > index a4ee6b86663e..54106a768e54 100644
> > > --- a/drivers/media/platform/omap3isp/isp.c
> > > +++ b/drivers/media/platform/omap3isp/isp.c
> > > @@ -39,8 +39,6 @@
> > >   * Troy Laramy 
> > >   */
> > >
> > > -#include 
> > > -
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -49,6 +47,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -58,6 +57,8 @@
> > >  #include 
> > >  #include 
> > >
> > > +#include 
> > > +
> > >  #ifdef CONFIG_ARM_DMA_USE_IOMMU
> > >  #include 
> > >  #endif
> > 
> > Why does this file need  at all?
> > It doesn't call any of the flush_*() functions, and seems to compile fine
> > without (on arm32).
> > 
> > Perhaps it was included at the top intentionally, to override the 
> > definitions
> > of copy_{to,from}_user_page()? Fortunately that doesn't seem to be the
> > case, from a quick look at the assembler output.
> > 
> > So let's just remove the #include instead?
> 
> Sounds good to me. I can send a patch if needed or I suppose Andrew can
> just make a small fixup patch for it. Let me know what I should do.

-- 
Regards,

Laurent Pinchart


Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Andrew Lunn
> I don't really know a lot about the networking subsystem, and as it was
> pointed out in another email on patch 7 by Andrew, networking needs to
> atomically gather and display statistics in order to make them consistent,
> and currently this is not supported by stats_fs but could be added in
> future.

Hi Emanuele

Do you have any idea how you will support atomic access? It does not
seem easy to implement in a filesystem based model.

 Andrew


Re: [PATCH v3 3/7] kunit: tests for stats_fs API

2020-05-27 Thread Emanuele Giuseppe Esposito




In order to run them, the kernel .config must set CONFIG_KUNIT=y
and a new .kunitconfig file must be created with CONFIG_STATS_FS=y
and CONFIG_STATS_FS_TEST=y



It looks like CONFIG_STATS_FS is built-in, but it exports
much of the functionality you are testing.  However could the
tests also be built as a module (i.e. make CONFIG_STATS_FS_TEST
a tristate variable)? To test this you'd need to specify
CONFIG_KUNIT=m and CONFIG_STATS_FS_TEST=m, and testing would
simply be a case of "modprobe"ing the stats fs module and collecting
results in /sys/kernel/debug/kunit/ (rather
than running kunit.py). Are you relying on unexported internals in
the the tests that would prevent building them as a module?



I haven't checked it yet, but tests should work as separate module.
I will look into it, thanks.

Emanuele



Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics

2020-05-27 Thread Emanuele Giuseppe Esposito





The file system is mounted on /sys/kernel/stats and would be already used
by kvm. Statsfs was initially introduced by Paolo Bonzini [1].


What's the direct motivation for this work? Moving KVM stats out of
debugfs?


There's many reasons: one of these is not using debugfs for statistics, 
but also (and mainly) to try and have a single tool that automatically 
takes care and displays them, instead of leaving each subsystem "on its 
own".


Sure, everyone gathers and processes stats in different ways, and the 
aim of this tool is to hopefully be extensible enough to cover all needs.

In my experience stats belong in the API used for creating/enumerating
objects, statsfs sounds like going in the exact opposite direction -
creating a parallel structure / hierarchy for exposing stats.


 I know

nothing about KVM but are you sure all the info that has to be exposed
will be stats?I don't understand, what do you mean here?




In case of networking we have the basic stats in sysfs, under the
netdevice's kobject. But since we're not using sysfs much any more
for config, new stats are added in netlink APIs. Again - same APIs
used for enumeration and config.


I don't really know a lot about the networking subsystem, and as it was 
pointed out in another email on patch 7 by Andrew, networking needs to 
atomically gather and display statistics in order to make them 
consistent, and currently this is not supported by stats_fs but could be 
added in future.


In addition, right now it won't work properly if the networking 
namespaces are enabled. That is another issue to take into 
consideration. That's also why I marked patch 7 as "not for merge"


Regarding the config, as I said the idea is to gather multiple 
subsystems' statistics, therefore there wouldn't be a single 
configuration method like in netlink.
For example in kvm there are file descriptors for configuration, and 
creating them requires no privilege, contrary to the network interfaces.


Thank you,
Emanuele



Re: [PATCH v3 3/7] kunit: tests for stats_fs API

2020-05-27 Thread Alan Maguire
On Tue, 26 May 2020, Emanuele Giuseppe Esposito wrote:

> Add kunit tests to extensively test the stats_fs API functionality.
>

I've added in the kunit-related folks.
 
> In order to run them, the kernel .config must set CONFIG_KUNIT=y
> and a new .kunitconfig file must be created with CONFIG_STATS_FS=y
> and CONFIG_STATS_FS_TEST=y
>

It looks like CONFIG_STATS_FS is built-in, but it exports
much of the functionality you are testing.  However could the
tests also be built as a module (i.e. make CONFIG_STATS_FS_TEST
a tristate variable)? To test this you'd need to specify
CONFIG_KUNIT=m and CONFIG_STATS_FS_TEST=m, and testing would
simply be a case of "modprobe"ing the stats fs module and collecting
results in /sys/kernel/debug/kunit/ (rather 
than running kunit.py). Are you relying on unexported internals in
the the tests that would prevent building them as a module?

Thanks!

Alan


[PATCH V4 2/2] tools/perf: Add perf tools support for extended register capability in powerpc

2020-05-27 Thread Athira Rajeev
From: Anju T Sudhakar 

Add extended regs to sample_reg_mask in the tool side to use
with `-I?` option. Perf tools side uses extended mask to display
the platform supported register names (with -I? option) to the user
and also send this mask to the kernel to capture the extended registers
in each sample. Hence decide the mask value based on the processor
version.

Signed-off-by: Anju T Sudhakar 
[Decide extended mask at run time based on platform]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++-
 tools/perf/arch/powerpc/include/perf_regs.h |  5 ++-
 tools/perf/arch/powerpc/util/perf_regs.c| 55 +
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..485b1d5 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
+   - PERF_REG_PMU_MASK)
+
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
b/tools/perf/arch/powerpc/include/perf_regs.h
index e18a355..46ed00d 100644
--- a/tools/perf/arch/powerpc/include/perf_regs.h
+++ b/tools/perf/arch/powerpc/include/perf_regs.h
@@ -64,7 +64,10 @@
[PERF_REG_POWERPC_DAR] = "dar",
[PERF_REG_POWERPC_DSISR] = "dsisr",
[PERF_REG_POWERPC_SIER] = "sier",
-   [PERF_REG_POWERPC_MMCRA] = "mmcra"
+   [PERF_REG_POWERPC_MMCRA] = "mmcra",
+   [PERF_REG_POWERPC_MMCR0] = "mmcr0",
+   [PERF_REG_POWERPC_MMCR1] = "mmcr1",
+   [PERF_REG_POWERPC_MMCR2] = "mmcr2",
 };
 
 static inline const char *perf_reg_name(int id)
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
b/tools/perf/arch/powerpc/util/perf_regs.c
index 0a52429..9179230 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -6,9 +6,14 @@
 
 #include "../../../util/perf_regs.h"
 #include "../../../util/debug.h"
+#include "../../../util/event.h"
+#include "../../../util/header.h"
+#include "../../../perf-sys.h"
 
 #include 
 
+#define PVR_POWER9 0x004E
+
 const struct sample_reg sample_reg_masks[] = {
SMPL_REG(r0, PERF_REG_POWERPC_R0),
SMPL_REG(r1, PERF_REG_POWERPC_R1),
@@ -55,6 +60,9 @@
SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
SMPL_REG(sier, PERF_REG_POWERPC_SIER),
SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
+   SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
+   SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
+   SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
SMPL_REG_END
 };
 
@@ -163,3 +171,50 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 
return SDT_ARG_VALID;
 }
+
+uint64_t arch__intr_reg_mask(void)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .sample_type= PERF_SAMPLE_REGS_INTR,
+   .precise_ip = 1,
+   .disabled   = 1,
+   .exclude_kernel = 1,
+   };
+   int fd, ret;
+   char buffer[64];
+   u32 version;
+   u64 extended_mask = 0;
+
+   /* Get the PVR value to set the extended
+* mask specific to platform
+*/
+   get_cpuid(buffer, sizeof(buffer));
+   ret = sscanf(buffer, "%u,", );
+
+   if (ret != 1) {
+   pr_debug("Failed to get the processor version, unable to output 
extended registers\n");
+   return PERF_REGS_MASK;
+   }
+
+   if (version == PVR_POWER9)
+   extended_mask = PERF_REG_PMU_MASK_300;
+   else
+   return PERF_REGS_MASK;
+
+   attr.sample_regs_intr = extended_mask;
+   attr.sample_period = 1;
+   event_attr_init();
+
+   /*
+* check if the pmu supports perf extended regs, before
+* returning the register mask to sample.
+*/
+   fd = sys_perf_event_open(, 0, -1, -1, 0);
+   if (fd != -1) {
+   close(fd);
+   return (extended_mask | PERF_REGS_MASK);
+   }
+   return PERF_REGS_MASK;
+}
-- 
1.8.3.1



[PATCH V4 1/2] powerpc/perf: Add support for outputting extended regs in perf intr_regs

2020-05-27 Thread Athira Rajeev
From: Anju T Sudhakar 

Add support for perf extended register capability in powerpc.
The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
PMU which support extended registers. The generic code define the mask
of extended registers as 0 for non supported architectures.

Patch adds extended regs support for power9 platform by
exposing MMCR0, MMCR1 and MMCR2 registers.

REG_RESERVED mask needs update to include extended regs.
`PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
is defined at runtime in the kernel based on platform since the supported
registers may differ from one processor version to another and hence the
MASK value.

with patch
--

available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2

PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
... intr regs: mask 0x ABI 64-bit
 r00xc012b77c
 r10xc03fe5e03930
 r20xc1b0e000
 r30xc03fdcddf800
 r40xc03fc788
 r50x9c422724be
 r60xc03fe5e03908
 r70xff63bddc8706
 r80x9e4
 r90x0
 r10   0x1
 r11   0x0
 r12   0xc01299c0
 r13   0xc03c4800
 r14   0x0
 r15   0x7fffdd8b8b00
 r16   0x0
 r17   0x7fffdd8be6b8
 r18   0x7e7076607730
 r19   0x2f
 r20   0xc0001fc26c68
 r21   0xc0002041e4227e00
 r22   0xc0002018fb60
 r23   0x1
 r24   0xc03ffec4d900
 r25   0x8000
 r26   0x0
 r27   0x1
 r28   0x1
 r29   0xc1be1260
 r30   0x6008010
 r31   0xc03ffebb7218
 nip   0xc012b910
 msr   0x90009033
 orig_r3 0xc012b86c
 ctr   0xc01299c0
 link  0xc012b77c
 xer   0x0
 ccr   0x2800
 softe 0x1
 trap  0xf00
 dar   0x0
 dsisr 0x800
 sier  0x0
 mmcra 0x800
 mmcr0 0x82008090
 mmcr1 0x1e00
 mmcr2 0x0
 ... thread: perf:4784

Signed-off-by: Anju T Sudhakar 
[Defined PERF_REG_EXTENDED_MASK at run time to add support for different 
platforms ]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/perf_event_server.h |  8 +++
 arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++-
 arch/powerpc/perf/core-book3s.c  |  1 +
 arch/powerpc/perf/perf_regs.c| 34 +---
 arch/powerpc/perf/power9-pmu.c   |  6 +
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3e9703f..1458e1a 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -15,6 +15,9 @@
 #define MAX_EVENT_ALTERNATIVES 8
 #define MAX_LIMITED_HWCOUNTERS 2
 
+extern u64 mask_var;
+#define PERF_REG_EXTENDED_MASK  mask_var
+
 struct perf_event;
 
 /*
@@ -55,6 +58,11 @@ struct power_pmu {
int *blacklist_ev;
/* BHRB entries in the PMU */
int bhrb_nr;
+   /*
+* set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
+* the pmu supports extended perf regs capability
+*/
+   int capabilities;
 };
 
 /*
diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..485b1d5 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
+   - PERF_REG_PMU_MASK)
+
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 3dcfecf..7f63edf 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2275,6 +2275,7 @@ int register_power_pmu(struct power_pmu *pmu)
pmu->name);
 
power_pmu.attr_groups = ppmu->attr_groups;
+   power_pmu.capabilities |= (ppmu->capabilities & 
PERF_PMU_CAP_EXTENDED_REGS);
 
 #ifdef MSR_HV
/*
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index a213a0a..c8a7e8c 

[PATCH V4 0/2] powerpc/perf: Add support for perf extended regs in powerpc

2020-05-27 Thread Athira Rajeev
Patch set to add support for perf extended register capability in
powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
indicate the PMU which support extended registers. The generic code
define the mask of extended registers as 0 for non supported architectures.

patch 1/2 defines this PERF_PMU_CAP_EXTENDED_REGS mask to output the
values of mmcr0,mmcr1,mmcr2 for POWER9. Defines `PERF_REG_EXTENDED_MASK`
at runtime which contains mask value of the supported registers under
extended regs.

Patch 2/2 adds extended regs to sample_reg_mask in the tool side to use
with `-I?` option.

Anju T Sudhakar (2):
  powerpc/perf: Add support for outputting extended regs in perf
intr_regs
  tools/perf: Add perf tools support for extended register capability in
powerpc

---
Changes from v3 -> v4
- Addressed the comments for new line/tab issue
  and added "Reviewed-by" from Madhavan Srinivasn.

Changes from v2 -> v3
- Split kernel and tools side patches as suggested by Arnaldo
- Addressed review comment from Madhavan Srinivasn

Changes from v1 -> v2

- PERF_REG_EXTENDED_MASK` is defined at runtime in the kernel
based on platform. This will give flexibility in using extended
regs for all processor versions where the supported registers may differ.
- removed PERF_REG_EXTENDED_MASK from the perf tools side. Based on the
processor version(from PVR value), tool side will return the appropriate
extended mask
- Since tool changes can handle without a "PERF_REG_EXTENDED_MASK" macro,
dropped patch to set NO_AUXTRACE.
- Addressed review comments from Ravi Bangoria for V1

---

 arch/powerpc/include/asm/perf_event_server.h|  8 
 arch/powerpc/include/uapi/asm/perf_regs.h   | 14 ++-
 arch/powerpc/perf/core-book3s.c |  1 +
 arch/powerpc/perf/perf_regs.c   | 34 +--
 arch/powerpc/perf/power9-pmu.c  |  6 +++
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++-
 tools/perf/arch/powerpc/include/perf_regs.h |  5 ++-
 tools/perf/arch/powerpc/util/perf_regs.c| 55 +
 8 files changed, 131 insertions(+), 6 deletions(-)

-- 
1.8.3.1



[PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again

2020-05-27 Thread Petr Mladek
The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
to archs where they work") caused that bpf_probe_read{, str}() functions
were not longer available on architectures where the same logical address
might have different content in kernel and user memory mapping. These
architectures should use probe_read_{user,kernel}_str helpers.

For backward compatibility, the problematic functions are still available
on architectures where the user and kernel address spaces are not
overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.

At the moment, these backward compatible functions are enabled only
on x86_64, arm, and arm64. Let's do it also on powerpc that has
the non overlapping address space as well.

Signed-off-by: Petr Mladek 
To: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: Masami Hiramatsu 
Cc: Brendan Gregg 
Cc: Christoph Hellwig 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d13b5328ca10..b29d7cb38368 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -126,6 +126,7 @@ config PPC
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
+   select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
-- 
2.26.2



Re: [PATCH 2/3] powerpc/pci: unmap legacy INTx interrupts of passthrough IO adapters

2020-05-27 Thread Cédric Le Goater
Hello Michael,

On 4/29/20 9:51 AM, Cédric Le Goater wrote:
> When a passthrough IO adapter is removed from a pseries machine using
> hash MMU and the XIVE interrupt mode, the POWER hypervisor, pHyp,
> expects the guest OS to have cleared all page table entries related to
> the adapter. If some are still present, the RTAS call which isolates
> the PCI slot returns error 9001 "valid outstanding translations" and
> the removal of the IO adapter fails.
> 
> INTx interrupt numbers need special care because Linux maps the
> interrupts automatically in the Linux interrupt number space if they
> are presented in the device tree node describing the IO adapter. These
> interrupts are not un-mapped automatically and in case of an hot-plug
> adapter, the PCI hot-plug layer needs to handle the cleanup to make
> sure that all the page table entries of the XIVE ESB pages are
> cleared.

It seems this patch needs more digging to make sure we are handling
the IRQ unmapping in the correct PCI handler. Could you please keep
it back for the moment ? 

Thanks,

C.


Re: [PATCH 2/3] powerpc/pci: unmap legacy INTx interrupts of passthrough IO adapters

2020-05-27 Thread Cédric Le Goater
On 5/27/20 2:57 AM, Oliver O'Halloran wrote:
> On Wed, Apr 29, 2020 at 5:51 PM Cédric Le Goater  wrote:
>>
>> When a passthrough IO adapter is removed from a pseries machine using
>> hash MMU and the XIVE interrupt mode, the POWER hypervisor, pHyp,
>> expects the guest OS to have cleared all page table entries related to
>> the adapter. If some are still present, the RTAS call which isolates
>> the PCI slot returns error 9001 "valid outstanding translations" and
>> the removal of the IO adapter fails.
>>
>> INTx interrupt numbers need special care because Linux maps the
>> interrupts automatically in the Linux interrupt number space if they
>> are presented in the device tree node describing the IO adapter. These
>> interrupts are not un-mapped automatically and in case of an hot-plug
>> adapter, the PCI hot-plug layer needs to handle the cleanup to make
>> sure that all the page table entries of the XIVE ESB pages are
>> cleared.
>>
>> Cc: "Oliver O'Halloran" 
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/kernel/pci-hotplug.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/pci-hotplug.c 
>> b/arch/powerpc/kernel/pci-hotplug.c
>> index bf83f76563a3..9e9c6befd7ea 100644
>> --- a/arch/powerpc/kernel/pci-hotplug.c
>> +++ b/arch/powerpc/kernel/pci-hotplug.c
>> @@ -57,6 +57,8 @@ void pcibios_release_device(struct pci_dev *dev)
>> struct pci_controller *phb = pci_bus_to_host(dev->bus);
>> struct pci_dn *pdn = pci_get_pdn(dev);
>>
>> +   irq_dispose_mapping(dev->irq);
> 
> What does the original mapping? Powerpc arch code or the PCI core?

Powerpc. In pci_read_irq_line() when a device is added.

> Tearing down the mapping in pcibios_release_device() seems a bit fishy
> to me since the PCI core has already torn down the device state at
> that point. If the release is delayed it's possible that another
> pci_dev has mapped the IRQ before we get here, but maybe that's ok.

Which scenario would that be ? multiple devices mapping the same INTx
interrupt because all are used already ? 

Where should we drop the mapping ?

Thanks,

C.

>> +
>> eeh_remove_device(dev);
>>
>> if (phb->controller_ops.release_device)
>> --
>> 2.25.4
>>



Re: [PATCH v8 5/5] powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_HEALTH

2020-05-27 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> This patch implements support for PDSM request 'PAPR_SCM_PDSM_HEALTH'
> that returns a newly introduced 'struct nd_papr_pdsm_health' instance
> containing dimm health information back to user space in response to
> ND_CMD_CALL. This functionality is implemented in newly introduced
> papr_scm_get_health() that queries the scm-dimm health information and
> then copies this information to the package payload whose layout is
> defined by 'struct nd_papr_pdsm_health'.
>
> The patch also introduces a new member 'struct papr_scm_priv.health'
> thats an instance of 'struct nd_papr_pdsm_health' to cache the health
> information of a nvdimm. As a result functions drc_pmem_query_health()
> and flags_show() are updated to populate and use this new struct
> instead of a u64 integer that was earlier used.
>

Reviewed-by: Aneesh Kumar K.V 

> Cc: "Aneesh Kumar K . V" 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: Ira Weiny 
> Signed-off-by: Vaibhav Jain 

-aneesh


Re: [PATCH v8 4/5] ndctl/papr_scm,uapi: Add support for PAPR nvdimm specific methods

2020-05-27 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> Introduce support for PAPR NVDIMM Specific Methods (PDSM) in papr_scm
> module and add the command family to the white list of NVDIMM command
> sets. Also advertise support for ND_CMD_CALL for the nvdimm
> command mask and implement necessary scaffolding in the module to
> handle ND_CMD_CALL ioctl and PDSM requests that we receive.
>
> The layout of the PDSM request as we expect from libnvdimm/libndctl is
> described in newly introduced uapi header 'papr_scm_pdsm.h' which
> defines a new 'struct nd_pdsm_cmd_pkg' header. This header is used
> to communicate the PDSM request via member
> 'nd_pkg_papr_scm->nd_command' and size of payload that need to be
> sent/received for servicing the PDSM.
>
> A new function is_cmd_valid() is implemented that reads the args to
> papr_scm_ndctl() and performs sanity tests on them. A new function
> papr_scm_service_pdsm() is introduced and is called from
> papr_scm_ndctl() in case of a PDSM request is received via ND_CMD_CALL
> command from libnvdimm.
>

Reviewed-by: Aneesh Kumar K.V 

> Cc: "Aneesh Kumar K . V" 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: Ira Weiny 
> Signed-off-by: Vaibhav Jain 
> ---

-aneesh


Re: [PATCH v8 3/5] powerpc/papr_scm: Fetch nvdimm health information from PHYP

2020-05-27 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> Implement support for fetching nvdimm health information via
> H_SCM_HEALTH hcall as documented in Ref[1]. The hcall returns a pair
> of 64-bit bitmap, bitwise-and of which is then stored in
> 'struct papr_scm_priv' and subsequently partially exposed to
> user-space via newly introduced dimm specific attribute
> 'papr/flags'. Since the hcall is costly, the health information is
> cached and only re-queried, 60s after the previous successful hcall.
>
> The patch also adds a  documentation text describing flags reported by
> the the new sysfs attribute 'papr/flags' is also introduced at
> Documentation/ABI/testing/sysfs-bus-papr-scm.
>
> [1] commit 58b278f568f0 ("powerpc: Provide initial documentation for
> PAPR hcalls")
>

Reviewed-by: Aneesh Kumar K.V 

> Cc: "Aneesh Kumar K . V" 
> Cc: Dan Williams 
> Cc: Michael Ellerman 
> Cc: Ira Weiny 
> Signed-off-by: Vaibhav Jain 


[PATCH] powerpc/fadump: account for memory_limit while reserving memory

2020-05-27 Thread Hari Bathini
If the memory chunk found for reserving memory overshoots the memory
limit imposed, do not proceed with reserving memory. Default behavior
was this until commit 140777a3d8df ("powerpc/fadump: consider reserved
ranges while reserving memory") changed it unwittingly.

Reported-by: kbuild test robot 
Fixes: 140777a3d8df ("powerpc/fadump: consider reserved ranges while reserving 
memory")
Cc: sta...@vger.kernel.org
Signed-off-by: Hari Bathini 
---

For reference:
- https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-May/211136.html


 arch/powerpc/kernel/fadump.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 63aac8b..78ab9a6 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -603,7 +603,7 @@ int __init fadump_reserve_mem(void)
 */
base = fadump_locate_reserve_mem(base, size);
 
-   if (!base) {
+   if (!base || (base + size > mem_boundary)) {
pr_err("Failed to find memory chunk for 
reservation!\n");
goto error_out;
}



Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT

2020-05-27 Thread Laurent Dufour

Le 27/05/2020 à 06:16, Paul Mackerras a écrit :

On Wed, May 20, 2020 at 07:43:08PM +0200, Laurent Dufour wrote:

The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.

However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.

In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.

Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour 


Thanks, applied to my kvm-ppc-next branch.  I expanded the comment in
the code a little.


Thanks, the comment is more explicit now.

Laurent.


Re: [PATCH] media: omap3isp: Shuffle cacheflush.h and include mm.h

2020-05-27 Thread Nathan Chancellor
Hi Geert,

On Wed, May 27, 2020 at 09:02:51AM +0200, Geert Uytterhoeven wrote:
> Hi Nathan,
> 
> CC Laurent
> 
> On Wed, May 27, 2020 at 6:37 AM Nathan Chancellor
>  wrote:
> > After mm.h was removed from the asm-generic version of cacheflush.h,
> > s390 allyesconfig shows several warnings of the following nature:
> >
> > In file included from ./arch/s390/include/generated/asm/cacheflush.h:1,
> >  from drivers/media/platform/omap3isp/isp.c:42:
> > ./include/asm-generic/cacheflush.h:16:42: warning: 'struct mm_struct'
> > declared inside parameter list will not be visible outside of this
> > definition or declaration
> >
> > cacheflush.h does not include mm.h nor does it include any forward
> > declaration of these structures hence the warning. To avoid this,
> > include mm.h explicitly in this file and shuffle cacheflush.h below it.
> >
> > Fixes: 19c0054597a0 ("asm-generic: don't include  in 
> > cacheflush.h")
> > Signed-off-by: Nathan Chancellor 
> 
> Thanks for your patch!
> 
> > I am aware the fixes tag is kind of irrelevant because that SHA will
> > change in the next linux-next revision and this will probably get folded
> > into the original patch anyways but still.
> >
> > The other solution would be to add forward declarations of these structs
> > to the top of cacheflush.h, I just chose to do what Christoph did in the
> > original patch. I am happy to do that instead if you all feel that is
> > better.
> 
> That actually looks like a better solution to me, as it would address the
> problem for all users.
> 
> >  drivers/media/platform/omap3isp/isp.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/media/platform/omap3isp/isp.c 
> > b/drivers/media/platform/omap3isp/isp.c
> > index a4ee6b86663e..54106a768e54 100644
> > --- a/drivers/media/platform/omap3isp/isp.c
> > +++ b/drivers/media/platform/omap3isp/isp.c
> > @@ -39,8 +39,6 @@
> >   * Troy Laramy 
> >   */
> >
> > -#include 
> > -
> >  #include 
> >  #include 
> >  #include 
> > @@ -49,6 +47,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -58,6 +57,8 @@
> >  #include 
> >  #include 
> >
> > +#include 
> > +
> >  #ifdef CONFIG_ARM_DMA_USE_IOMMU
> >  #include 
> >  #endif
> 
> Why does this file need  at all?
> It doesn't call any of the flush_*() functions, and seems to compile fine
> without (on arm32).
> 
> Perhaps it was included at the top intentionally, to override the definitions
> of copy_{to,from}_user_page()? Fortunately that doesn't seem to be the
> case, from a quick look at the assembler output.
> 
> So let's just remove the #include instead?

Sounds good to me. I can send a patch if needed or I suppose Andrew can
just make a small fixup patch for it. Let me know what I should do.

Cheers,
Nathan


Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone

2020-05-27 Thread kbuild test robot
Hi Alexandre,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on linus/master v5.7-rc7 next-20200526]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Alexandre-Ghiti/vmalloc-kernel-mapping-and-relocatable-kernel/20200524-170109
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: riscv-allyesconfig (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot 

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> arch/riscv/mm/init.c:383:6: warning: no previous prototype for 
>> 'create_kernel_page_table' [-Wmissing-prototypes]
383 | void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
|  ^~~~

vim +/create_kernel_page_table +383 arch/riscv/mm/init.c

   382  
 > 383  void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
   384  {
   385  uintptr_t va, end_va;
   386  
   387  end_va = kernel_virt_addr + load_sz;
   388  for (va = kernel_virt_addr; va < end_va; va += map_size)
   389  create_pgd_mapping(pgdir, va,
   390 load_pa + (va - kernel_virt_addr),
   391 map_size, PAGE_KERNEL_EXEC);
   392  }
   393  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone

2020-05-27 Thread Alex Ghiti

Le 5/27/20 à 2:05 AM, Zong Li a écrit :

On Wed, May 27, 2020 at 1:06 AM Alex Ghiti  wrote:

Hi Zong,

Le 5/26/20 à 5:43 AM, Zong Li a écrit :

On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti  wrote:

This is a preparatory patch for relocatable kernel.

The kernel used to be linked at PAGE_OFFSET address and used to be loaded
physically at the beginning of the main memory. Therefore, we could use
the linear mapping for the kernel mapping.

But the relocated kernel base address will be different from PAGE_OFFSET
and since in the linear mapping, two different virtual addresses cannot
point to the same physical address, the kernel mapping needs to lie outside
the linear mapping.

In addition, because modules and BPF must be close to the kernel (inside
+-2GB window), the kernel is placed at the end of the vmalloc zone minus
2GB, which leaves room for modules and BPF. The kernel could not be
placed at the beginning of the vmalloc zone since other vmalloc
allocations from the kernel could get all the +-2GB window around the
kernel which would prevent new modules and BPF programs to be loaded.

Signed-off-by: Alexandre Ghiti 
---
   arch/riscv/boot/loader.lds.S |  3 +-
   arch/riscv/include/asm/page.h| 10 +-
   arch/riscv/include/asm/pgtable.h | 37 +---
   arch/riscv/kernel/head.S |  3 +-
   arch/riscv/kernel/module.c   |  4 +--
   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
   arch/riscv/mm/init.c | 58 +---
   arch/riscv/mm/physaddr.c |  2 +-
   8 files changed, 87 insertions(+), 33 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
   /* SPDX-License-Identifier: GPL-2.0 */

   #include 
+#include 

   OUTPUT_ARCH(riscv)
   ENTRY(_start)

   SECTIONS
   {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;

  .payload : {
  *(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..48bb09b6a9b7 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;

   #ifdef CONFIG_MMU
   extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
   extern unsigned long pfn_base;
   #define ARCH_PFN_OFFSET(pfn_base)
   #else
   #define va_pa_offset   0
+#define va_kernel_pa_offset0
   #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
   #endif /* CONFIG_MMU */

   extern unsigned long max_low_pfn;
   extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;

   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+   ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  \
+   (((x) >= PAGE_OFFSET) ? \
+   linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))

   #ifdef CONFIG_DEBUG_VIRTUAL
   extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 35b60035b6b0..25213cfaf680 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,29 @@

   #include 

-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
-
-#ifdef CONFIG_MMU
+#ifndef CONFIG_MMU
+#define KERNEL_VIRT_ADDR   PAGE_OFFSET
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
+/*
+ * Leave 2GB for modules and BPF that must lie within a 2GB range around
+ * the kernel.
+ */
+#define KERNEL_VIRT_ADDR   (VMALLOC_END - SZ_2G + 1)
+#define KERNEL_LINK_ADDR   KERNEL_VIRT_ADDR

   #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
   #define VMALLOC_END  (PAGE_OFFSET - 1)
   #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)

   #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   (kernel_virt_addr)
+#define BPF_JIT_REGION_END (kernel_virt_addr + BPF_JIT_REGION_SIZE)

It seems to have a potential risk here, the region of bpf is
overlapping with kernel mapping, so if kernel size is bigger than
128MB, bpf region would be occupied and run out by kernel mapping.

Is there the risk as I mentioned?



Sorry I forgot to answer this one: I was confident that 128MB was large 
enough for kernel
and BPF. But I see no reason to leave this risk so I'll change 
kernel_virt_addr for _end so

BPF will have its 128MB reserved.

Thanks !

Alex





+
+#ifdef CONFIG_64BIT
+#define VMALLOC_MODULE_START   BPF_JIT_REGION_END

Re: [PATCH] media: omap3isp: Shuffle cacheflush.h and include mm.h

2020-05-27 Thread Geert Uytterhoeven
Hi Nathan,

CC Laurent

On Wed, May 27, 2020 at 6:37 AM Nathan Chancellor
 wrote:
> After mm.h was removed from the asm-generic version of cacheflush.h,
> s390 allyesconfig shows several warnings of the following nature:
>
> In file included from ./arch/s390/include/generated/asm/cacheflush.h:1,
>  from drivers/media/platform/omap3isp/isp.c:42:
> ./include/asm-generic/cacheflush.h:16:42: warning: 'struct mm_struct'
> declared inside parameter list will not be visible outside of this
> definition or declaration
>
> cacheflush.h does not include mm.h nor does it include any forward
> declaration of these structures hence the warning. To avoid this,
> include mm.h explicitly in this file and shuffle cacheflush.h below it.
>
> Fixes: 19c0054597a0 ("asm-generic: don't include  in 
> cacheflush.h")
> Signed-off-by: Nathan Chancellor 

Thanks for your patch!

> I am aware the fixes tag is kind of irrelevant because that SHA will
> change in the next linux-next revision and this will probably get folded
> into the original patch anyways but still.
>
> The other solution would be to add forward declarations of these structs
> to the top of cacheflush.h, I just chose to do what Christoph did in the
> original patch. I am happy to do that instead if you all feel that is
> better.

That actually looks like a better solution to me, as it would address the
problem for all users.

>  drivers/media/platform/omap3isp/isp.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/media/platform/omap3isp/isp.c 
> b/drivers/media/platform/omap3isp/isp.c
> index a4ee6b86663e..54106a768e54 100644
> --- a/drivers/media/platform/omap3isp/isp.c
> +++ b/drivers/media/platform/omap3isp/isp.c
> @@ -39,8 +39,6 @@
>   * Troy Laramy 
>   */
>
> -#include 
> -
>  #include 
>  #include 
>  #include 
> @@ -49,6 +47,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -58,6 +57,8 @@
>  #include 
>  #include 
>
> +#include 
> +
>  #ifdef CONFIG_ARM_DMA_USE_IOMMU
>  #include 
>  #endif

Why does this file need  at all?
It doesn't call any of the flush_*() functions, and seems to compile fine
without (on arm32).

Perhaps it was included at the top intentionally, to override the definitions
of copy_{to,from}_user_page()? Fortunately that doesn't seem to be the
case, from a quick look at the assembler output.

So let's just remove the #include instead?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v4 6/7] KVM: MIPS: clean up redundant 'kvm_run' parameters

2020-05-27 Thread Tianjia Zhang




On 2020/4/27 13:40, Huacai Chen wrote:

Reviewed-by: Huacai Chen 

On Mon, Apr 27, 2020 at 12:35 PM Tianjia Zhang
 wrote:


In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
  arch/mips/include/asm/kvm_host.h |  28 +---
  arch/mips/kvm/emulate.c  |  59 ++--
  arch/mips/kvm/mips.c |  11 ++-
  arch/mips/kvm/trap_emul.c| 114 ++-
  arch/mips/kvm/vz.c   |  26 +++
  5 files changed, 87 insertions(+), 151 deletions(-)



Hi Huacai,

These two patches(6/7 and 7/7) should be merged into the tree of the 
mips architecture separately. At present, there seems to be no good way 
to merge the whole architecture patchs.


For this series of patches, some architectures have been merged, some 
need to update the patch.


Thanks and best,
Tianjia


Re: [PATCH v3 1/3] riscv: Move kernel mapping to vmalloc zone

2020-05-27 Thread Zong Li
On Wed, May 27, 2020 at 1:06 AM Alex Ghiti  wrote:
>
> Hi Zong,
>
> Le 5/26/20 à 5:43 AM, Zong Li a écrit :
> > On Sun, May 24, 2020 at 4:54 PM Alexandre Ghiti  wrote:
> >> This is a preparatory patch for relocatable kernel.
> >>
> >> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> >> physically at the beginning of the main memory. Therefore, we could use
> >> the linear mapping for the kernel mapping.
> >>
> >> But the relocated kernel base address will be different from PAGE_OFFSET
> >> and since in the linear mapping, two different virtual addresses cannot
> >> point to the same physical address, the kernel mapping needs to lie outside
> >> the linear mapping.
> >>
> >> In addition, because modules and BPF must be close to the kernel (inside
> >> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> >> 2GB, which leaves room for modules and BPF. The kernel could not be
> >> placed at the beginning of the vmalloc zone since other vmalloc
> >> allocations from the kernel could get all the +-2GB window around the
> >> kernel which would prevent new modules and BPF programs to be loaded.
> >>
> >> Signed-off-by: Alexandre Ghiti 
> >> ---
> >>   arch/riscv/boot/loader.lds.S |  3 +-
> >>   arch/riscv/include/asm/page.h| 10 +-
> >>   arch/riscv/include/asm/pgtable.h | 37 +---
> >>   arch/riscv/kernel/head.S |  3 +-
> >>   arch/riscv/kernel/module.c   |  4 +--
> >>   arch/riscv/kernel/vmlinux.lds.S  |  3 +-
> >>   arch/riscv/mm/init.c | 58 +---
> >>   arch/riscv/mm/physaddr.c |  2 +-
> >>   8 files changed, 87 insertions(+), 33 deletions(-)
> >>
> >> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> >> index 47a5003c2e28..62d94696a19c 100644
> >> --- a/arch/riscv/boot/loader.lds.S
> >> +++ b/arch/riscv/boot/loader.lds.S
> >> @@ -1,13 +1,14 @@
> >>   /* SPDX-License-Identifier: GPL-2.0 */
> >>
> >>   #include 
> >> +#include 
> >>
> >>   OUTPUT_ARCH(riscv)
> >>   ENTRY(_start)
> >>
> >>   SECTIONS
> >>   {
> >> -   . = PAGE_OFFSET;
> >> +   . = KERNEL_LINK_ADDR;
> >>
> >>  .payload : {
> >>  *(.payload)
> >> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> >> index 2d50f76efe48..48bb09b6a9b7 100644
> >> --- a/arch/riscv/include/asm/page.h
> >> +++ b/arch/riscv/include/asm/page.h
> >> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
> >>
> >>   #ifdef CONFIG_MMU
> >>   extern unsigned long va_pa_offset;
> >> +extern unsigned long va_kernel_pa_offset;
> >>   extern unsigned long pfn_base;
> >>   #define ARCH_PFN_OFFSET(pfn_base)
> >>   #else
> >>   #define va_pa_offset   0
> >> +#define va_kernel_pa_offset0
> >>   #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
> >>   #endif /* CONFIG_MMU */
> >>
> >>   extern unsigned long max_low_pfn;
> >>   extern unsigned long min_low_pfn;
> >> +extern unsigned long kernel_virt_addr;
> >>
> >>   #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + 
> >> va_pa_offset))
> >> -#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
> >> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
> >> +#define kernel_mapping_va_to_pa(x) \
> >> +   ((unsigned long)(x) - va_kernel_pa_offset)
> >> +#define __va_to_pa_nodebug(x)  \
> >> +   (((x) >= PAGE_OFFSET) ? \
> >> +   linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
> >>
> >>   #ifdef CONFIG_DEBUG_VIRTUAL
> >>   extern phys_addr_t __virt_to_phys(unsigned long x);
> >> diff --git a/arch/riscv/include/asm/pgtable.h 
> >> b/arch/riscv/include/asm/pgtable.h
> >> index 35b60035b6b0..25213cfaf680 100644
> >> --- a/arch/riscv/include/asm/pgtable.h
> >> +++ b/arch/riscv/include/asm/pgtable.h
> >> @@ -11,23 +11,29 @@
> >>
> >>   #include 
> >>
> >> -#ifndef __ASSEMBLY__
> >> -
> >> -/* Page Upper Directory not used in RISC-V */
> >> -#include 
> >> -#include 
> >> -#include 
> >> -#include 
> >> -
> >> -#ifdef CONFIG_MMU
> >> +#ifndef CONFIG_MMU
> >> +#define KERNEL_VIRT_ADDR   PAGE_OFFSET
> >> +#define KERNEL_LINK_ADDR   PAGE_OFFSET
> >> +#else
> >> +/*
> >> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
> >> + * the kernel.
> >> + */
> >> +#define KERNEL_VIRT_ADDR   (VMALLOC_END - SZ_2G + 1)
> >> +#define KERNEL_LINK_ADDR   KERNEL_VIRT_ADDR
> >>
> >>   #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
> >>   #define VMALLOC_END  (PAGE_OFFSET - 1)
> >>   #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
> >>
> >>   #define BPF_JIT_REGION_SIZE(SZ_128M)
> >> -#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> >> -#define BPF_JIT_REGION_END (VMALLOC_END)
> >> +#define BPF_JIT_REGION_START   (kernel_virt_addr)
> >> +#define BPF_JIT_REGION_END (kernel_virt_addr + BPF_JIT_REGION_SIZE)
> > It seems to have a