Re: [PATCH 3/3] imx8mq.dtsi: add the sdma nodes
Hi Angus, What is the status of this patch? Most likely this should go through Shwan's tree. I noticed that I have also sent a similar patch to Shawn: https://www.spinics.net/lists/arm-kernel/msg708424.html So, lets coordinate and work better on this. I am now preparing another series where I add SAI nodes and enable WM codec on imx8MQ. If you don't mind I will pick your relevant changes from this patch and add them to my series, then send them to Shawn. thanks, Daniel. On Sun, Jan 20, 2019 at 4:05 AM Angus Ainslie (Purism) wrote: > > Add the sdma nodes to the base devicetree for the imx8mq > > Signed-off-by: Angus Ainslie (Purism) > --- > arch/arm64/boot/dts/freescale/imx8mq.dtsi | 31 +++ > 1 file changed, 31 insertions(+) > > diff --git a/arch/arm64/boot/dts/freescale/imx8mq.dtsi > b/arch/arm64/boot/dts/freescale/imx8mq.dtsi > index c0402375e7c1..0b9a9b5ae7b7 100644 > --- a/arch/arm64/boot/dts/freescale/imx8mq.dtsi > +++ b/arch/arm64/boot/dts/freescale/imx8mq.dtsi > @@ -336,6 +336,19 @@ > clocks = < IMX8MQ_CLK_WDOG3_ROOT>; > status = "disabled"; > }; > + > + sdma2: sdma@302c { > + compatible = "fsl,imx8mq-sdma", > "fsl,imx7d-sdma"; > + reg = <0x302c 0x1>; > + interrupts = IRQ_TYPE_LEVEL_HIGH>; > + clocks = < IMX8MQ_CLK_SDMA2_ROOT>, > + < IMX8MQ_CLK_SDMA2_ROOT>; > + clock-names = "ipg", "ahb"; > + #dma-cells = <3>; > + fsl,sdma-ram-script-name = > "imx/sdma/sdma-imx7d.bin"; > + fsl,ratio-1-1; > + status = "disabled"; > + }; > }; > > bus@3040 { /* AIPS2 */ > @@ -370,6 +383,8 @@ > clocks = < IMX8MQ_CLK_UART3_ROOT>, > < IMX8MQ_CLK_UART3_ROOT>; > clock-names = "ipg", "per"; > + dmas = < 26 4 0>, < 27 4 0>; > + dma-names = "rx", "tx"; > status = "disabled"; > }; > > @@ -381,6 +396,8 @@ > clocks = < IMX8MQ_CLK_UART2_ROOT>, > < IMX8MQ_CLK_UART2_ROOT>; > clock-names = "ipg", "per"; > + dmas = < 24 4 0>, < 25 4 0>; > + dma-names = "rx", "tx"; > status = "disabled"; > }; > > @@ -432,6 +449,8 @@ > clocks = < IMX8MQ_CLK_UART4_ROOT>, > < IMX8MQ_CLK_UART4_ROOT>; > clock-names = "ipg", "per"; > + dmas = < 28 4 0>, < 29 4 0>; > + dma-names = "rx", "tx"; > status = "disabled"; > }; > > @@ -465,6 +484,18 @@ > status = "disabled"; > }; > > + sdma1: sdma@30bd { > + compatible = "fsl,imx8mq-sdma", > "fsl,imx7d-sdma"; > + reg = <0x30bd 0x1>; > + interrupts = ; > + clocks = < IMX8MQ_CLK_SDMA1_ROOT>, > + < IMX8MQ_CLK_SDMA1_ROOT>; > + clock-names = "ipg", "ahb"; > + #dma-cells = <3>; > + fsl,sdma-ram-script-name = > "imx/sdma/sdma-imx7d.bin"; > + status = "disabled"; > + }; > + > fec1: ethernet@30be { > compatible = "fsl,imx8mq-fec", > "fsl,imx6sx-fec"; > reg = <0x30be 0x1>; > -- > 2.17.1 >
Re: [PATCH V3 2/2] dt-bindings: imx: add new resources to scu resource table
On Wed, 20 Feb 2019 14:35:06 +, Anson Huang wrote: > Add new resources as below according to latest system > controller firmware for new features: > > IMX_SC_R_PERF > IMX_SC_R_OCRAM > IMX_SC_R_DMA_5_CH0 > IMX_SC_R_DMA_5_CH1 > IMX_SC_R_DMA_5_CH2 > IMX_SC_R_DMA_5_CH3 > IMX_SC_R_ATTESTATION > > Signed-off-by: Anson Huang > --- > No changes since V2, just separate the patch to 2, 1 patch to remove > resources and the other is to add new > resources. > --- > include/dt-bindings/firmware/imx/rsrc.h | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > Reviewed-by: Rob Herring
Re: [PATCH v3 0/8] nic: thunderx: fix communication races between VF & PF
From: Vadim Lomovtsev Date: Wed, 20 Feb 2019 11:02:42 + > The ThunderX CN88XX NIC Virtual Function driver uses mailbox interface > to communicate to physical function driver. Each of VF has it's own pair > of mailbox registers to read from and write to. The mailbox registers > has no protection from possible races, so it has to be implemented > at software side. > > After long term testing by loop of 'ip link set up/down' > command it was found that there are two possible scenarios when > race condition appears: > 1. VF receives link change message from PF and VF send RX mode > configuration message to PF in the same time from separate thread. > 2. PF receives RX mode configuration from VF and in the same time, > in separate thread PF detects link status change and sends appropriate > message to particular VF. > > Both cases leads to mailbox data to be rewritten, NIC VF messaging control > data to be updated incorrectly and communication sequence gets broken. > > This patch series is to address race condition with VF & PF communication. > > Changes: > v1 -> v2 > - : correct typo in cover letter subject: 'betwen' -> 'between'; > - move link state polling request task from pf to vf >instead of cheking status of mailbox irq; > v2 -> v3 > - 0003: change return type of nicvf_send_cfg_done() function >from int to void; > - 0007: update subject and remove unused variable 'netdev' >from nicvf_link_status_check_task() function; Series applied, thanks.
Re: [PATCH 6/6] perf thread-stack: Hide x86 retpolines
Em Wed, Jan 09, 2019 at 11:18:35AM +0200, Adrian Hunter escreveu: > x86 retpoline functions pollute the call graph by showing up everywhere > there is an indirect branch, but they do not really mean anything. Make > changes so that the default retpoline functions will no longer appear in > the call graph. Note this only affects the call graph, since all the > original branches are left unchanged. > > This does not handle function return thunks, nor is there any improvement > for the handling of inline thunks or extern thunks. > > Example: > > $ cat simple-retpoline.c > __attribute__((noinline)) int bar(void) > { > return -1; > } > > int foo(void) > { > return bar() + 1; > } > > __attribute__((indirect_branch("thunk"))) int main() > { > int (*volatile fn)(void) = foo; > > fn(); > return fn(); > } > $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c > $ objdump -d simple-retpoline > > 1040 : > 1040: 48 83 ec 18 sub$0x18,%rsp > 1044: 48 8d 05 25 01 00 00lea0x125(%rip),%rax# 1170 > > 104b: 48 89 44 24 08 mov%rax,0x8(%rsp) > 1050: 48 8b 44 24 08 mov0x8(%rsp),%rax > 1055: e8 1f 01 00 00 callq 1179 <__x86_indirect_thunk_rax> > 105a: 48 8b 44 24 08 mov0x8(%rsp),%rax > 105f: 48 83 c4 18 add$0x18,%rsp > 1063: e9 11 01 00 00 jmpq 1179 <__x86_indirect_thunk_rax> > > 1160 : > 1160: b8 ff ff ff ff mov$0x,%eax > 1165: c3 retq > > 1170 : > 1170: e8 eb ff ff ff callq 1160 > 1175: 83 c0 01add$0x1,%eax > 1178: c3 retq > 1179 <__x86_indirect_thunk_rax>: > 1179: e8 07 00 00 00 callq 1185 > <__x86_indirect_thunk_rax+0xc> > 117e: f3 90 pause > 1180: 0f ae e8lfence > 1183: eb f9 jmp117e > <__x86_indirect_thunk_rax+0x5> > 1185: 48 89 04 24 mov%rax,(%rsp) > 1189: c3 retq > > $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u > ./simple-retpoline > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ] > $ perf script -i simple-retpoline.perf.data --itrace=be -s > ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db > branches calls > 2019-01-08 14:03:37.851655 Creating database... > 2019-01-08 14:03:37.863256 Writing records... > 2019-01-08 14:03:38.069750 Adding indexes > 2019-01-08 14:03:38.078799 Done > $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py > simple-retpoline.db > > Before: > > main > -> __x86_indirect_thunk_rax > -> __x86_indirect_thunk_rax > -> foo > -> bar > > After: > > main > -> foo > -> bar > > Signed-off-by: Adrian Hunter > --- > tools/perf/util/thread-stack.c | 112 - > 1 file changed, 109 insertions(+), 3 deletions(-) > > diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c > index 632c07a125ab..805e30836460 100644 > --- a/tools/perf/util/thread-stack.c > +++ b/tools/perf/util/thread-stack.c > @@ -20,6 +20,7 @@ > #include "thread.h" > #include "event.h" > #include "machine.h" > +#include "env.h" > #include "util.h" > #include "debug.h" > #include "symbol.h" > @@ -29,6 +30,19 @@ > > #define STACK_GROWTH 2048 > > +/* > + * State of retpoline detection. > + * > + * RETPOLINE_NONE: no retpoline detection > + * X86_RETPOLINE_POSSIBLE: x86 retpoline possible > + * X86_RETPOLINE_DETECTED: x86 retpoline detected > + */ > +enum retpoline_state_t { > + RETPOLINE_NONE, > + X86_RETPOLINE_POSSIBLE, > + X86_RETPOLINE_DETECTED, > +}; > + > /** > * struct thread_stack_entry - thread stack entry. > * @ret_addr: return address > @@ -64,6 +78,7 @@ struct thread_stack_entry { > * @crp: call/return processor > * @comm: current comm > * @arr_sz: size of array if this is the first element of an array > + * @rstate: used to detect retpolines > */ > struct thread_stack { > struct thread_stack_entry *stack; > @@ -76,6 +91,7 @@ struct thread_stack { > struct call_return_processor *crp; > struct comm *comm; > unsigned int arr_sz; > + enum retpoline_state_t rstate; > }; > > /* > @@ -115,10 +131,16 @@ static int thread_stack__init(struct thread_stack *ts, > struct thread *thread, > if (err) > return err; > > - if (thread->mg && thread->mg->machine) > - ts->kernel_start = machine__kernel_start(thread->mg->machine); > - else > + if (thread->mg &&
[v6 PATCH 3/6] RISC-V: Remove NR_CPUs check during hartid search from DT
In non-smp configuration, hartid can be higher that NR_CPUS. riscv_of_processor_hartid should not be compared to hartid to NR_CPUS in that case. Moreover, this function checks all the DT properties of a hart node. NR_CPUS comparison seems out of place. Signed-off-by: Atish Patra Reviewed-by: Christoph Hellwig Reviewed-by: Anup Patel --- arch/riscv/kernel/cpu.c | 4 1 file changed, 4 deletions(-) diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c index d1d9bfd5..cf2fca12 100644 --- a/arch/riscv/kernel/cpu.c +++ b/arch/riscv/kernel/cpu.c @@ -34,10 +34,6 @@ int riscv_of_processor_hartid(struct device_node *node) pr_warn("Found CPU without hart ID\n"); return -ENODEV; } - if (hart >= NR_CPUS) { - pr_info("Found hart ID %d, which is above NR_CPUs. Disabling this hart\n", hart); - return -ENODEV; - } if (!of_device_is_available(node)) { pr_info("CPU with hartid=%d is not available\n", hart); -- 2.7.4
[v6 PATCH 0/6] Various SMP related fixes
The existing upstream kernel doesn't boot for non-smp configuration. This patch series address various issues with non-smp configurations. The patch series is based on 5.0-rc7 + Johan's below mentioned patch series. Tested on both QEMU and HiFive Unleashed board using both OpenSBI & BBL. https://lore.kernel.org/lkml/20190118140308.9599-1-jo...@kernel.org/ Changes from v5->v6 1. Dropped driver patches as they have been accepted via drivers tree. Changes from v4->v5 1. Continue processing other harts even if isa string is incorrect for a single hart during HWCAP processing. Changes from v3->v4 1. Fixed commit text length issues. 2. Updated hwcap patch to use common capabilities of all harts. 3. Rebased on Johan's patch series. Changes from v2->v3 1. Fixed spurious white space. 2. Added lockdep for smpboot completion variable. 2. Added a sanity check for hwcap. Changes from v1->v2 1. Move the cpuid to hartid map to smp.c from setup.c 2. Split 3rd patch into several small patches based on logical grouping. 3. Added a new patch that fixes an issue in hwcap query. 4. Changed the title of the patch series. Atish Patra (6): RISC-V: Do not wait indefinitely in __cpu_up RISC-V: Move cpuid to hartid mapping to SMP. RISC-V: Remove NR_CPUs check during hartid search from DT RISC-V: Allow hartid-to-cpuid function to fail. RISC-V: Compare cpuid with NR_CPUS before mapping. RISC-V: Assign hwcap as per comman capabilities. arch/riscv/include/asm/smp.h | 18 +- arch/riscv/kernel/cpu.c| 4 arch/riscv/kernel/cpufeature.c | 41 ++--- arch/riscv/kernel/setup.c | 9 - arch/riscv/kernel/smp.c| 10 +- arch/riscv/kernel/smpboot.c| 20 +--- 6 files changed, 61 insertions(+), 41 deletions(-) -- 2.7.4
[v6 PATCH 4/6] RISC-V: Allow hartid-to-cpuid function to fail.
It is perfectly okay to call riscv_hartid_to_cpuid for a hartid that is not mapped with an CPU id. It can happen if the calling functions retrieves the hartid from DT. However, that hartid was never brought online by the firmware or kernel for any reasons. No need to BUG() in the above case. A negative error return is sufficient and the calling function should check for the return value always. Signed-off-by: Atish Patra Reviewed-by: Anup Patel Reviewed-by: Christoph Hellwig --- arch/riscv/kernel/smp.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c index b69883c6..ca99f0fb 100644 --- a/arch/riscv/kernel/smp.c +++ b/arch/riscv/kernel/smp.c @@ -60,7 +60,6 @@ int riscv_hartid_to_cpuid(int hartid) return i; pr_err("Couldn't find cpu id for hartid [%d]\n", hartid); - BUG(); return i; } -- 2.7.4
[v6 PATCH 2/6] RISC-V: Move cpuid to hartid mapping to SMP.
Currently, logical CPU id to physical hartid mapping is defined for both smp and non-smp configurations. This is not required as we need this only for smp configuration. The mapping function can define directly boot_cpu_hartid for non-smp use case. The reverse mapping function i.e. hartid to cpuid can be called for any valid but not booted harts. So it should return default cpu 0 only if it is a boot hartid. Signed-off-by: Atish Patra Reviewed-by: Anup Patel Reviewed-by: Christoph Hellwig --- arch/riscv/include/asm/smp.h | 18 +- arch/riscv/kernel/setup.c| 9 - arch/riscv/kernel/smp.c | 9 + 3 files changed, 22 insertions(+), 14 deletions(-) diff --git a/arch/riscv/include/asm/smp.h b/arch/riscv/include/asm/smp.h index 41aa73b4..636a934f 100644 --- a/arch/riscv/include/asm/smp.h +++ b/arch/riscv/include/asm/smp.h @@ -19,16 +19,17 @@ #include #define INVALID_HARTID ULONG_MAX + +struct seq_file; +extern unsigned long boot_cpu_hartid; + +#ifdef CONFIG_SMP /* * Mapping between linux logical cpu index and hartid. */ extern unsigned long __cpuid_to_hartid_map[NR_CPUS]; #define cpuid_to_hartid_map(cpu)__cpuid_to_hartid_map[cpu] -struct seq_file; - -#ifdef CONFIG_SMP - /* print IPI stats */ void show_ipi_stats(struct seq_file *p, int prec); @@ -58,7 +59,14 @@ static inline void show_ipi_stats(struct seq_file *p, int prec) static inline int riscv_hartid_to_cpuid(int hartid) { - return 0; + if (hartid == boot_cpu_hartid) + return 0; + + return -1; +} +static inline unsigned long cpuid_to_hartid_map(int cpu) +{ + return boot_cpu_hartid; } static inline void riscv_cpuid_to_hartid_mask(const struct cpumask *in, diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c index fb09e013..61c81616 100644 --- a/arch/riscv/kernel/setup.c +++ b/arch/riscv/kernel/setup.c @@ -61,15 +61,6 @@ EXPORT_SYMBOL(empty_zero_page); atomic_t hart_lottery; unsigned long boot_cpu_hartid; -unsigned long __cpuid_to_hartid_map[NR_CPUS] = { - [0 ... NR_CPUS-1] = INVALID_HARTID -}; - -void __init smp_setup_processor_id(void) -{ - cpuid_to_hartid_map(0) = boot_cpu_hartid; -} - #ifdef CONFIG_BLK_DEV_INITRD static void __init setup_initrd(void) { diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c index 246635ea..b69883c6 100644 --- a/arch/riscv/kernel/smp.c +++ b/arch/riscv/kernel/smp.c @@ -36,6 +36,15 @@ enum ipi_message_type { IPI_MAX }; +unsigned long __cpuid_to_hartid_map[NR_CPUS] = { + [0 ... NR_CPUS-1] = INVALID_HARTID +}; + +void __init smp_setup_processor_id(void) +{ + cpuid_to_hartid_map(0) = boot_cpu_hartid; +} + /* A collection of single bit ipi messages. */ static struct { unsigned long stats[IPI_MAX] cacheline_aligned; -- 2.7.4
[v6 PATCH 1/6] RISC-V: Do not wait indefinitely in __cpu_up
In SMP path, __cpu_up waits for other CPU to come online indefinitely. This is wrong as other CPU might be disabled in machine mode and possible CPU is set to the cpus present in DT. Introduce a completion variable and waits only for a second. Signed-off-by: Atish Patra Reviewed-by: Anup Patel Reviewed-by: Christoph Hellwig --- arch/riscv/kernel/smpboot.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c index 6e281325..d369b669 100644 --- a/arch/riscv/kernel/smpboot.c +++ b/arch/riscv/kernel/smpboot.c @@ -39,6 +39,7 @@ void *__cpu_up_stack_pointer[NR_CPUS]; void *__cpu_up_task_pointer[NR_CPUS]; +static DECLARE_COMPLETION(cpu_running); void __init smp_prepare_boot_cpu(void) { @@ -77,6 +78,7 @@ void __init setup_smp(void) int __cpu_up(unsigned int cpu, struct task_struct *tidle) { + int ret = 0; int hartid = cpuid_to_hartid_map(cpu); tidle->thread_info.cpu = cpu; @@ -92,10 +94,16 @@ int __cpu_up(unsigned int cpu, struct task_struct *tidle) task_stack_page(tidle) + THREAD_SIZE); WRITE_ONCE(__cpu_up_task_pointer[hartid], tidle); - while (!cpu_online(cpu)) - cpu_relax(); + lockdep_assert_held(_running); + wait_for_completion_timeout(_running, + msecs_to_jiffies(1000)); - return 0; + if (!cpu_online(cpu)) { + pr_crit("CPU%u: failed to come online\n", cpu); + ret = -EIO; + } + + return ret; } void __init smp_cpus_done(unsigned int max_cpus) @@ -121,6 +129,7 @@ asmlinkage void __init smp_callin(void) * a local TLB flush right now just in case. */ local_flush_tlb_all(); + complete(_running); /* * Disable preemption before enabling interrupts, so we don't try to * schedule a CPU that hasn't actually started yet. -- 2.7.4
[v6 PATCH 5/6] RISC-V: Compare cpuid with NR_CPUS before mapping.
We should never have a cpuid greater that NR_CPUS. Compare with NR_CPUS before creating the mapping between logical and physical CPU ids. This is also mandatory as NR_CPUS check is removed from riscv_of_processor_hartid. Signed-off-by: Atish Patra Reviewed-by: Anup Patel Reviewed-by: Christoph Hellwig --- arch/riscv/kernel/smpboot.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c index d369b669..eb533b5c 100644 --- a/arch/riscv/kernel/smpboot.c +++ b/arch/riscv/kernel/smpboot.c @@ -66,6 +66,11 @@ void __init setup_smp(void) found_boot_cpu = 1; continue; } + if (cpuid >= NR_CPUS) { + pr_warn("Invalid cpuid [%d] for hartid [%d]\n", + cpuid, hart); + break; + } cpuid_to_hartid_map(cpuid) = hart; set_cpu_possible(cpuid, true); -- 2.7.4
[v6 PATCH 6/6] RISC-V: Assign hwcap as per comman capabilities.
Currently, we set hwcap based on first valid hart from DT. This may not be correct always as that hart might not be current booting cpu or may have a different capability. Set hwcap as the capabilities supported by all possible harts with "okay" status. Signed-off-by: Atish Patra --- arch/riscv/kernel/cpufeature.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c index e7a4701f..bc29b010 100644 --- a/arch/riscv/kernel/cpufeature.c +++ b/arch/riscv/kernel/cpufeature.c @@ -20,6 +20,7 @@ #include #include #include +#include unsigned long elf_hwcap __read_mostly; #ifdef CONFIG_FPU @@ -42,28 +43,30 @@ void riscv_fill_hwcap(void) elf_hwcap = 0; - /* -* We don't support running Linux on hertergenous ISA systems. For -* now, we just check the ISA of the first "okay" processor. -*/ for_each_of_cpu_node(node) { - if (riscv_of_processor_hartid(node) >= 0) - break; - } - if (!node) { - pr_warn("Unable to find \"cpu\" devicetree entry\n"); - return; - } + unsigned long this_hwcap = 0; - if (of_property_read_string(node, "riscv,isa", )) { - pr_warn("Unable to find \"riscv,isa\" devicetree entry\n"); - of_node_put(node); - return; - } - of_node_put(node); + if (riscv_of_processor_hartid(node) < 0) + continue; - for (i = 0; i < strlen(isa); ++i) - elf_hwcap |= isa2hwcap[(unsigned char)(isa[i])]; + if (of_property_read_string(node, "riscv,isa", )) { + pr_warn("Unable to find \"riscv,isa\" devicetree entry\n"); + continue; + } + + for (i = 0; i < strlen(isa); ++i) + this_hwcap |= isa2hwcap[(unsigned char)(isa[i])]; + + /* +* All "okay" hart should have same isa. Set HWCAP based on +* common capabilities of every "okay" hart, in case they don't +* have. +*/ + if (elf_hwcap) + elf_hwcap &= this_hwcap; + else + elf_hwcap = this_hwcap; + } /* We don't support systems with F but without D, so mask those out * here. */ -- 2.7.4
Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
On Fri, 22 Feb 2019 11:34:58 -0800 Alexei Starovoitov wrote: > so you're saying you will break existing kprobe scripts? Yes we may. > I don't think it's a good idea. > It's not acceptable to break bpf_probe_read uapi. Then you may need to add more code to determine if the address is user space or not in the kernel, and then go the appropriate route, before calling probe_kernel_read(). -- Steve
Re: [PATCH] r8152: Fix an error on RTL8153-BD MAC Address Passthrough support
From: David Chen Date: Wed, 20 Feb 2019 13:47:19 +0800 > From: David Chen > > RTL8153-BD is used in Dell DA300 type-C dongle. > Added RTL8153-BD support to activate MAC address pass through on DA300. > Apply correction on previously submitted patch in net.git tree. > > Signed-off-by: David Chen Applied, thanks.
[PATCH v6 7/7] ARM: tegra: Add firmware calls required for suspend-resume on Tegra30
In order to suspend-resume CPU with Trusted Foundations firmware being present on Tegra30, the LP1/LP2 boot vectors and CPU caches need to be set up using the firmware calls and then suspend code shall avoid re-disabling parts that were disabled by the firmware. Signed-off-by: Dmitry Osipenko --- arch/arm/mach-tegra/pm.c| 49 + arch/arm/mach-tegra/reset-handler.S | 26 +++ arch/arm/mach-tegra/sleep.S | 14 ++--- 3 files changed, 84 insertions(+), 5 deletions(-) diff --git a/arch/arm/mach-tegra/pm.c b/arch/arm/mach-tegra/pm.c index 1ad5719779b0..f209f59e0daf 100644 --- a/arch/arm/mach-tegra/pm.c +++ b/arch/arm/mach-tegra/pm.c @@ -33,11 +33,13 @@ #include #include +#include #include #include #include #include #include +#include #include "iomap.h" #include "pm.h" @@ -159,6 +161,28 @@ int tegra_cpu_do_idle(void) static int tegra_sleep_cpu(unsigned long v2p) { + /* +* L2 cache disabling using kernel API only allowed when all +* secondary CPU's are offline. Cache have to be disabled with +* MMU-on if cache maintenance is done via Trusted Foundations +* firmware. Note that CPUIDLE won't ever enter powergate on Tegra30 +* if any of secondary CPU's is online and this is the LP2-idle +* code-path only for Tegra20/30. +*/ + if (trusted_foundations_registered()) + outer_disable(); + + /* +* Note that besides of setting up CPU reset vector this firmware +* call may also do the following, depending on the FW version: +* 1) Disable L2. But this doesn't matter since we already +* disabled the L2. +* 2) Disable D-cache. This need to be taken into account in +* particular by the tegra_disable_clean_inv_dcache() which +* shall avoid the re-disable. +*/ + call_firmware_op(prepare_idle, TF_PM_MODE_LP2); + setup_mm_for_reboot(); tegra_sleep_cpu_finish(v2p); @@ -197,6 +221,14 @@ void tegra_idle_lp2_last(void) cpu_suspend(PHYS_OFFSET - PAGE_OFFSET, _sleep_cpu); + /* +* Resume L2 cache if it wasn't re-enabled early during resume, +* which is the case for Tegra30 that has to re-enable the cache +* via firmware call. In other cases cache is already enabled and +* hence re-enabling is a no-op. This is always a no-op on Tegra114+. +*/ + outer_resume(); + restore_cpu_complex(); cpu_cluster_pm_exit(); } @@ -215,6 +247,15 @@ enum tegra_suspend_mode tegra_pm_validate_suspend_mode( static int tegra_sleep_core(unsigned long v2p) { + /* +* Cache have to be disabled with MMU-on if cache maintenance is done +* via Trusted Foundations firmware. This is a no-op on Tegra114+. +*/ + if (trusted_foundations_registered()) + outer_disable(); + + call_firmware_op(prepare_idle, TF_PM_MODE_LP1); + setup_mm_for_reboot(); tegra_sleep_core_finish(v2p); @@ -342,6 +383,14 @@ static int tegra_suspend_enter(suspend_state_t state) cpu_suspend(PHYS_OFFSET - PAGE_OFFSET, tegra_sleep_func); + /* +* Resume L2 cache if it wasn't re-enabled early during resume, +* which is the case for Tegra30 that has to re-enable the cache +* via firmware call. In other cases cache is already enabled and +* hence re-enabling is a no-op. +*/ + outer_resume(); + switch (mode) { case TEGRA_SUSPEND_LP1: tegra_suspend_exit_lp1(); diff --git a/arch/arm/mach-tegra/reset-handler.S b/arch/arm/mach-tegra/reset-handler.S index 3bf202819534..19a609046547 100644 --- a/arch/arm/mach-tegra/reset-handler.S +++ b/arch/arm/mach-tegra/reset-handler.S @@ -20,6 +20,7 @@ #include #include +#include #include #include @@ -76,6 +77,7 @@ ENTRY(tegra_resume) orr r1, r1, #1 str r1, [r0] #endif + bl tegra_resume_trusted_foundations #ifdef CONFIG_CACHE_L2X0 /* L2 cache resume & re-enable */ @@ -88,6 +90,30 @@ end_ca9_scu_l2_resume: b cpu_resume ENDPROC(tegra_resume) + +/* + * tegra_resume_trusted_foundations + * + * Trusted Foundations firmware initialization. + * + * Doesn't return if firmware presents. + * Corrupted registers: r1, r2 + */ +ENTRY(tegra_resume_trusted_foundations) + /* Check whether Trusted Foundations firmware presents. */ + mov32 r2, TEGRA_IRAM_BASE + TEGRA_IRAM_RESET_HANDLER_OFFSET + ldr r1, =__tegra_cpu_reset_handler_data_offset + \ + RESET_DATA(TF_PRESENT) + ldr r1, [r2, r1] + cmp r1, #0 + reteq lr + + .arch_extension sec + /* First call after suspend wakes firmware. No arguments required. */ + smc #0 + + b
Re: [PATCH] rpmsg: virtio_rpmsg_bus: acknowledge the received creation message
Thanks for review. On Fri, Feb 22, 2019 at 6:44 PM Arnaud Pouliquen wrote: > > Hello Xiang, > > > > On 2/12/19 8:13 AM, Xiang Xiao wrote: > > From: QianWenfa > > > > the two phase handsake make the client could initiate the transfer > > immediately without the server side send any dummy message first. > As discussed on OpenAMP mailing list, the first point (from my pov) is > to figure out if this should be part of the rpmsg protocol or service > dependent (so managed by the rpmsg client itself)... > Here is my thought: 1.The ack message don't have any side effect except kernel send back the local address through NS channel directly. 2.But without this message, remote side can't send the message first due to the lack of kernel address information. The second limitation is very strange for all developer I think. > > > > > Signed-off-by: Wenfa Qian > > Signed-off-by: Xiang Xiao > > --- > > drivers/rpmsg/virtio_rpmsg_bus.c | 25 - > > 1 file changed, 20 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c > > b/drivers/rpmsg/virtio_rpmsg_bus.c > > index 664f957..e323c98 100644 > > --- a/drivers/rpmsg/virtio_rpmsg_bus.c > > +++ b/drivers/rpmsg/virtio_rpmsg_bus.c > > @@ -71,6 +71,7 @@ struct virtproc_info { > > > > /* The feature bitmap for virtio rpmsg */ > > #define VIRTIO_RPMSG_F_NS0 /* RP supports name service notifications */ > > +#define VIRTIO_RPMSG_F_ACK 1 /* RP supports name service acknowledge */ > > > > /** > > * struct rpmsg_hdr - common header for all rpmsg messages > > @@ -115,10 +116,12 @@ struct rpmsg_ns_msg { > > * > > * @RPMSG_NS_CREATE: a new remote service was just created > > * @RPMSG_NS_DESTROY: a known remote service was just destroyed > > + * @RPMSG_NS_ACK: acknowledge the previous creation message > > */ > > enum rpmsg_ns_flags { > > RPMSG_NS_CREATE = 0, > > RPMSG_NS_DESTROY= 1, > > + RPMSG_NS_ACK= 2, > > }; > > > > /** > > @@ -330,13 +333,14 @@ static int virtio_rpmsg_announce_create(struct > > rpmsg_device *rpdev) > > int err = 0; > > > > /* need to tell remote processor's name service about this channel ? > > */ > > - if (rpdev->announce && rpdev->ept && > > + if (rpdev->ept && (rpdev->announce || > > + virtio_has_feature(vrp->vdev, VIRTIO_RPMSG_F_ACK)) && > > virtio_has_feature(vrp->vdev, VIRTIO_RPMSG_F_NS)) {Regarding this > > condition i have a concern. If rpdev->announce is false > but VIRTIO_RPMSG_F_ACK feature is set, this should generate an ack message. > Seems that this condition can occurs depending on the rpmsg_device > struct provided on registration, when rpmsg_dev_probe is called. The ack message can't be sent before device probe success, otherwise the early ack make the remote side consider kernel driver is ready and send the message immediately, but nobody can receive the message actually. > > > struct rpmsg_ns_msg nsm; > > > > strncpy(nsm.name, rpdev->id.name, RPMSG_NAME_SIZE); > > nsm.addr = rpdev->ept->addr; > > - nsm.flags = RPMSG_NS_CREATE; > > + nsm.flags = rpdev->announce ? RPMSG_NS_CREATE : RPMSG_NS_ACK; > > > > err = rpmsg_sendto(rpdev->ept, , sizeof(nsm), > > RPMSG_NS_ADDR); > > if (err) > > @@ -820,6 +824,7 @@ static int rpmsg_ns_cb(struct rpmsg_device *rpdev, void > > *data, int len, > > struct rpmsg_channel_info chinfo; > > struct virtproc_info *vrp = priv; > > struct device *dev = >vdev->dev; > > + struct device *tmp; > > int ret; > > > > #if defined(CONFIG_DYNAMIC_DEBUG) > > @@ -847,21 +852,30 @@ static int rpmsg_ns_cb(struct rpmsg_device *rpdev, > > void *data, int len, > > msg->name[RPMSG_NAME_SIZE - 1] = '\0'; > > > > dev_info(dev, "%sing channel %s addr 0x%x\n", > > - msg->flags & RPMSG_NS_DESTROY ? "destroy" : "creat", > > + msg->flags == RPMSG_NS_ACK ? "ack" : > > + msg->flags == RPMSG_NS_DESTROY ? "destroy" : "creat", > >msg->name, msg->addr); > > > > strncpy(chinfo.name, msg->name, sizeof(chinfo.name)); > > chinfo.src = RPMSG_ADDR_ANY; > > chinfo.dst = msg->addr; > > > > - if (msg->flags & RPMSG_NS_DESTROY) { > > + if (msg->flags == RPMSG_NS_DESTROY) { > > ret = rpmsg_unregister_device(>vdev->dev, ); > > if (ret) > > dev_err(dev, "rpmsg_destroy_channel failed: %d\n", > > ret); > > - } else { > > + } else if (msg->flags == RPMSG_NS_CREATE) { > > newch = rpmsg_create_channel(vrp, ); > > if (!newch) > > dev_err(dev, "rpmsg_create_channel failed\n"); > > + } else if (msg->flags == RPMSG_NS_ACK) { > > + chinfo.dst = RPMSG_ADDR_ANY; > > + tmp = rpmsg_find_device(>vdev->dev, ); > nit-picking... replace >vdev->dev by dev and change
[PATCH v6 6/7] ARM: tegra: Always boot CPU in ARM-mode
CPU always jumps into reset handler in ARM-mode from the Trusted Foundations firmware, hence let's make CPU to always jump into kernel in ARM-mode regardless of the firmware presence. This is required to make Thumb-2 kernel working with the Trusted Foundations firmware on Tegra30. Signed-off-by: Dmitry Osipenko --- arch/arm/mach-tegra/reset-handler.S | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/mach-tegra/reset-handler.S b/arch/arm/mach-tegra/reset-handler.S index 6bea95d165fa..3bf202819534 100644 --- a/arch/arm/mach-tegra/reset-handler.S +++ b/arch/arm/mach-tegra/reset-handler.S @@ -113,6 +113,7 @@ ENTRY(__tegra_cpu_reset_handler_start) * must be position-independent. */ + .arm .align L1_CACHE_SHIFT ENTRY(__tegra_cpu_reset_handler) -- 2.20.1
[PATCH v6 5/7] ARM: tegra: Don't apply CPU erratas in insecure mode
CPU isn't allowed to touch secure registers while running under secure monitor. Hence skip applying of CPU erratas in the reset handler if Trusted Foundations firmware presents. Partially based on work done by Michał Mirosław [1]. [1] https://www.spinics.net/lists/arm-kernel/msg594768.html Signed-off-by: Dmitry Osipenko --- arch/arm/mach-tegra/reset-handler.S | 23 --- arch/arm/mach-tegra/reset.c | 3 +++ arch/arm/mach-tegra/reset.h | 9 +++-- arch/arm/mach-tegra/sleep-tegra20.S | 4 4 files changed, 26 insertions(+), 13 deletions(-) diff --git a/arch/arm/mach-tegra/reset-handler.S b/arch/arm/mach-tegra/reset-handler.S index 805f306fa6f7..6bea95d165fa 100644 --- a/arch/arm/mach-tegra/reset-handler.S +++ b/arch/arm/mach-tegra/reset-handler.S @@ -29,8 +29,6 @@ #define PMC_SCRATCH41 0x140 -#define RESET_DATA(x) ((TEGRA_RESET_##x)*4) - #ifdef CONFIG_PM_SLEEP /* * tegra_resume @@ -121,6 +119,12 @@ ENTRY(__tegra_cpu_reset_handler) cpsid aif, 0x13 @ SVC mode, interrupts disabled tegra_get_soc_id TEGRA_APB_MISC_BASE, r6 + + adr r12, __tegra_cpu_reset_handler_data + ldr r5, [r12, #RESET_DATA(TF_PRESENT)] + cmp r5, #0 + bne after_errata + #ifdef CONFIG_ARCH_TEGRA_2x_SOC t20_check: cmp r6, #TEGRA20 @@ -155,7 +159,6 @@ after_errata: and r10, r10, #0x3 @ R10 = CPU number mov r11, #1 mov r11, r11, lsl r10 @ R11 = CPU mask - adr r12, __tegra_cpu_reset_handler_data #ifdef CONFIG_SMP /* Does the OS know about this CPU? */ @@ -169,10 +172,9 @@ after_errata: cmp r6, #TEGRA20 bne 1f /* If not CPU0, don't let CPU0 reset CPU1 now that CPU1 is coming up. */ - mov32 r5, TEGRA_IRAM_BASE + TEGRA_IRAM_RESET_HANDLER_OFFSET mov r0, #CPU_NOT_RESETTABLE cmp r10, #0 - strneb r0, [r5, #__tegra20_cpu1_resettable_status_offset] + strneb r0, [r12, #RESET_DATA(RESETTABLE_STATUS)] 1: #endif @@ -277,14 +279,13 @@ ENDPROC(__tegra_cpu_reset_handler) .align L1_CACHE_SHIFT .type __tegra_cpu_reset_handler_data, %object .globl __tegra_cpu_reset_handler_data + .globl __tegra_cpu_reset_handler_data_offset + .equ__tegra_cpu_reset_handler_data_offset, \ + . - __tegra_cpu_reset_handler_start __tegra_cpu_reset_handler_data: - .rept TEGRA_RESET_DATA_SIZE - .long 0 + .rept TEGRA_RESET_DATA_SIZE + .long 0 .endr - .globl __tegra20_cpu1_resettable_status_offset - .equ__tegra20_cpu1_resettable_status_offset, \ - . - __tegra_cpu_reset_handler_start - .byte 0 .align L1_CACHE_SHIFT ENTRY(__tegra_cpu_reset_handler_end) diff --git a/arch/arm/mach-tegra/reset.c b/arch/arm/mach-tegra/reset.c index dc558892753c..b02ae7699842 100644 --- a/arch/arm/mach-tegra/reset.c +++ b/arch/arm/mach-tegra/reset.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "iomap.h" #include "irammap.h" @@ -89,6 +90,8 @@ static void __init tegra_cpu_reset_handler_enable(void) void __init tegra_cpu_reset_handler_init(void) { + __tegra_cpu_reset_handler_data[TEGRA_RESET_TF_PRESENT] = + trusted_foundations_registered(); #ifdef CONFIG_SMP __tegra_cpu_reset_handler_data[TEGRA_RESET_MASK_PRESENT] = diff --git a/arch/arm/mach-tegra/reset.h b/arch/arm/mach-tegra/reset.h index 9c479c7925b8..db0e6b3097ab 100644 --- a/arch/arm/mach-tegra/reset.h +++ b/arch/arm/mach-tegra/reset.h @@ -25,7 +25,11 @@ #define TEGRA_RESET_STARTUP_SECONDARY 3 #define TEGRA_RESET_STARTUP_LP24 #define TEGRA_RESET_STARTUP_LP15 -#define TEGRA_RESET_DATA_SIZE 6 +#define TEGRA_RESET_RESETTABLE_STATUS 6 +#define TEGRA_RESET_TF_PRESENT 7 +#define TEGRA_RESET_DATA_SIZE 8 + +#define RESET_DATA(x) ((TEGRA_RESET_##x)*4) #ifndef __ASSEMBLY__ @@ -49,7 +53,8 @@ void __tegra_cpu_reset_handler_end(void); (u32)__tegra_cpu_reset_handler_start))) #define tegra20_cpu1_resettable_status \ (IO_ADDRESS(TEGRA_IRAM_BASE + TEGRA_IRAM_RESET_HANDLER_OFFSET + \ -(u32)__tegra20_cpu1_resettable_status_offset)) + ((u32)&__tegra_cpu_reset_handler_data[TEGRA_RESET_RESETTABLE_STATUS] - \ +(u32)__tegra_cpu_reset_handler_start))) #endif #define tegra_cpu_reset_handler_offset \ diff --git a/arch/arm/mach-tegra/sleep-tegra20.S b/arch/arm/mach-tegra/sleep-tegra20.S index dedeebfccc55..50d51d3465f6 100644 --- a/arch/arm/mach-tegra/sleep-tegra20.S +++ b/arch/arm/mach-tegra/sleep-tegra20.S @@ -28,6 +28,7 @@ #include #include "irammap.h" +#include "reset.h" #include "sleep.h" #define EMC_CFG0xc @@ -53,6 +54,9 @@ #define
[PATCH v6 3/7] ARM: trusted_foundations: Provide information about whether firmware is registered
Add a helper that provides information about whether Trusted Foundations firmware operations have been registered. Signed-off-by: Dmitry Osipenko --- arch/arm/firmware/trusted_foundations.c| 5 + arch/arm/include/asm/trusted_foundations.h | 7 +++ 2 files changed, 12 insertions(+) diff --git a/arch/arm/firmware/trusted_foundations.c b/arch/arm/firmware/trusted_foundations.c index c496f4cc49cb..d795ed83a3cd 100644 --- a/arch/arm/firmware/trusted_foundations.c +++ b/arch/arm/firmware/trusted_foundations.c @@ -172,3 +172,8 @@ void of_register_trusted_foundations(void) panic("Trusted Foundation: missing version-minor property\n"); register_trusted_foundations(); } + +bool trusted_foundations_registered(void) +{ + return firmware_ops == _foundations_ops; +} diff --git a/arch/arm/include/asm/trusted_foundations.h b/arch/arm/include/asm/trusted_foundations.h index cdd48ab7d191..3f23fa493db6 100644 --- a/arch/arm/include/asm/trusted_foundations.h +++ b/arch/arm/include/asm/trusted_foundations.h @@ -31,6 +31,7 @@ #include #include #include +#include #define TF_PM_MODE_LP0 0 #define TF_PM_MODE_LP1 1 @@ -47,6 +48,7 @@ struct trusted_foundations_platform_data { void register_trusted_foundations(struct trusted_foundations_platform_data *pd); void of_register_trusted_foundations(void); +bool trusted_foundations_registered(void); #else /* CONFIG_TRUSTED_FOUNDATIONS */ @@ -74,6 +76,11 @@ static inline void of_register_trusted_foundations(void) if (of_find_compatible_node(NULL, NULL, "tlm,trusted-foundations")) register_trusted_foundations(NULL); } + +static inline bool trusted_foundations_registered(void) +{ + return false; +} #endif /* CONFIG_TRUSTED_FOUNDATIONS */ #endif -- 2.20.1
[PATCH v6 4/7] ARM: tegra: Set up L2 cache using Trusted Foundations firmware
On Tegra30 L2 cache should be initialized using firmware call if CPU is running in insecure mode. Set up the required outer-cache write_sec() callback early during boot using the firmware API, it is always a NO-OP on T114+ and is NO-OP on T20/30 if Trusted Foundations firmware node isn't present in device-tree. Signed-off-by: Dmitry Osipenko --- arch/arm/mach-tegra/tegra.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm/mach-tegra/tegra.c b/arch/arm/mach-tegra/tegra.c index f9587be48235..1e89cfefbf68 100644 --- a/arch/arm/mach-tegra/tegra.c +++ b/arch/arm/mach-tegra/tegra.c @@ -38,6 +38,7 @@ #include #include +#include #include #include #include @@ -74,6 +75,7 @@ static void __init tegra_init_early(void) { of_register_trusted_foundations(); tegra_cpu_reset_handler_init(); + call_firmware_op(l2x0_init); } static void __init tegra_dt_init_irq(void) -- 2.20.1
[PATCH v6 1/7] ARM: trusted_foundations: Implement L2 cache initialization callback
Implement L2 cache initialization firmware callback that should be invoked early during boot in order to set up the required outer cache driver's callbacks. Partially based on work done by Michał Mirosław [1]. [1] https://www.spinics.net/lists/arm-kernel/msg594765.html Signed-off-by: Dmitry Osipenko --- arch/arm/firmware/trusted_foundations.c | 46 + 1 file changed, 46 insertions(+) diff --git a/arch/arm/firmware/trusted_foundations.c b/arch/arm/firmware/trusted_foundations.c index 689e6565abfc..3bf61a5933b9 100644 --- a/arch/arm/firmware/trusted_foundations.c +++ b/arch/arm/firmware/trusted_foundations.c @@ -18,8 +18,15 @@ #include #include #include +#include +#include #include +#define TF_CACHE_MAINT 0xf100 + +#define TF_CACHE_ENABLE1 +#define TF_CACHE_DISABLE 2 + #define TF_SET_CPU_BOOT_ADDR_SMC 0xf200 #define TF_CPU_PM 0xfffc @@ -67,9 +74,48 @@ static int tf_prepare_idle(void) return 0; } +#ifdef CONFIG_CACHE_L2X0 +static void tf_cache_write_sec(unsigned long val, unsigned int reg) +{ + static u32 l2x0_way_mask = 0xff; + static u32 l2x0_aux_ctrl = 0; + + switch (reg) { + case L2X0_AUX_CTRL: + l2x0_aux_ctrl = val; + + if (l2x0_aux_ctrl & BIT(16)) + l2x0_way_mask = 0x; + break; + + case L2X0_CTRL: + if (val == L2X0_CTRL_EN) + tf_generic_smc(TF_CACHE_MAINT, TF_CACHE_ENABLE, + l2x0_aux_ctrl); + else + tf_generic_smc(TF_CACHE_MAINT, TF_CACHE_DISABLE, + l2x0_way_mask); + break; + + default: + break; + } +} + +static int tf_init_cache(void) +{ + outer_cache.write_sec = tf_cache_write_sec; + + return 0; +} +#endif /* CONFIG_CACHE_L2X0 */ + static const struct firmware_ops trusted_foundations_ops = { .set_cpu_boot_addr = tf_set_cpu_boot_addr, .prepare_idle = tf_prepare_idle, +#ifdef CONFIG_CACHE_L2X0 + .l2x0_init = tf_init_cache, +#endif }; void register_trusted_foundations(struct trusted_foundations_platform_data *pd) -- 2.20.1
[PATCH v6 2/7] ARM: trusted_foundations: Make prepare_idle call to take mode argument
The Trusted Foundations firmware call varies depending on the required suspend-mode. Make the firmware API to take the mode argument in order to expose all of the modes to firmware user. Signed-off-by: Dmitry Osipenko --- arch/arm/firmware/trusted_foundations.c| 29 -- arch/arm/include/asm/firmware.h| 2 +- arch/arm/include/asm/trusted_foundations.h | 6 + arch/arm/mach-tegra/cpuidle-tegra114.c | 3 ++- 4 files changed, 36 insertions(+), 4 deletions(-) diff --git a/arch/arm/firmware/trusted_foundations.c b/arch/arm/firmware/trusted_foundations.c index 3bf61a5933b9..c496f4cc49cb 100644 --- a/arch/arm/firmware/trusted_foundations.c +++ b/arch/arm/firmware/trusted_foundations.c @@ -67,9 +67,34 @@ static int tf_set_cpu_boot_addr(int cpu, unsigned long boot_addr) return 0; } -static int tf_prepare_idle(void) +static int tf_prepare_idle(unsigned long mode) { - tf_generic_smc(TF_CPU_PM, TF_CPU_PM_S1_NOFLUSH_L2, cpu_boot_addr); + switch (mode) { + case TF_PM_MODE_LP0: + tf_generic_smc(TF_CPU_PM, TF_CPU_PM_S3, cpu_boot_addr); + break; + + case TF_PM_MODE_LP1: + tf_generic_smc(TF_CPU_PM, TF_CPU_PM_S2, cpu_boot_addr); + break; + + case TF_PM_MODE_LP1_NO_MC_CLK: + tf_generic_smc(TF_CPU_PM, TF_CPU_PM_S2_NO_MC_CLK, + cpu_boot_addr); + break; + + case TF_PM_MODE_LP2: + tf_generic_smc(TF_CPU_PM, TF_CPU_PM_S1, cpu_boot_addr); + break; + + case TF_PM_MODE_LP2_NOFLUSH_L2: + tf_generic_smc(TF_CPU_PM, TF_CPU_PM_S1_NOFLUSH_L2, + cpu_boot_addr); + break; + + default: + return -EINVAL; + } return 0; } diff --git a/arch/arm/include/asm/firmware.h b/arch/arm/include/asm/firmware.h index 34c1d96ef46d..6698272bbcbf 100644 --- a/arch/arm/include/asm/firmware.h +++ b/arch/arm/include/asm/firmware.h @@ -24,7 +24,7 @@ struct firmware_ops { /* * Inform the firmware we intend to enter CPU idle mode */ - int (*prepare_idle)(void); + int (*prepare_idle)(unsigned long mode); /* * Enters CPU idle mode */ diff --git a/arch/arm/include/asm/trusted_foundations.h b/arch/arm/include/asm/trusted_foundations.h index 00748350cf72..cdd48ab7d191 100644 --- a/arch/arm/include/asm/trusted_foundations.h +++ b/arch/arm/include/asm/trusted_foundations.h @@ -32,6 +32,12 @@ #include #include +#define TF_PM_MODE_LP0 0 +#define TF_PM_MODE_LP1 1 +#define TF_PM_MODE_LP1_NO_MC_CLK 2 +#define TF_PM_MODE_LP2 3 +#define TF_PM_MODE_LP2_NOFLUSH_L2 4 + struct trusted_foundations_platform_data { unsigned int version_major; unsigned int version_minor; diff --git a/arch/arm/mach-tegra/cpuidle-tegra114.c b/arch/arm/mach-tegra/cpuidle-tegra114.c index e3fbcfedf845..3b9af4766cdf 100644 --- a/arch/arm/mach-tegra/cpuidle-tegra114.c +++ b/arch/arm/mach-tegra/cpuidle-tegra114.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include "cpuidle.h" @@ -46,7 +47,7 @@ static int tegra114_idle_power_down(struct cpuidle_device *dev, tegra_set_cpu_in_lp2(); cpu_pm_enter(); - call_firmware_op(prepare_idle); + call_firmware_op(prepare_idle, TF_PM_MODE_LP2_NOFLUSH_L2); /* Do suspend by ourselves if the firmware does not implement it */ if (call_firmware_op(do_idle, 0) == -ENOSYS) -- 2.20.1
[PATCH v6 0/7] Support Trusted Foundations firmware on Tegra30
Hello, This patchset adds support for the Trusted Foundations firmware on NVIDIA Tegra30. Pretty much all of Tegra30 consumer devices have that firmware and upstream kernel can't boot on those devices without the firmware support. Changelog: v6: - One patch got messed up accidentally in v5, this is fixed now. - Squashed "Support L2 cache maintenance done via firmware" patch into the "Add firmware calls..." patch. - The l2x0_init() firmware callback is now invoked unconditionally because it is always a NO-OP on T114+ and is a NO-OP on T20/30 if firmware node is missed in device-tree, hence there is no need to check the machine's DT compatible as it was done in the previous versions of the series. v5: - Fixed machine hanging on disabling D-cache during suspend, turned out there are slight variations in behaviour between firmware version in regards to cache-management. Thanks to Robert Yang for reporting the problem and helping with the solution. v4: - Fixed Thumb2-kernel hanging on Tegra20, turned out it was not a good idea to switch CPU into Thumb2 mode right after jumping into the reset handler. - Moved LP2-mode-set firmware call invocation to a later stage to better replicate what downstream kernel does. This change was suggested by Robert Yang and fixes system hang on Ouya game console. - Added references to the original work made by Michał Mirosław into commit messages of the relevant patches. v3: - Implemented suspend-resume support. - Reworked arm/firmware/trusted_foundations.c a tad. Now cache is getting properly initialized, cache enabling / disabling is supported. v2: - The "Don't apply CPU erratas in insecure mode" patch got some cleanup, in particular resolved the messiness in __tegra_cpu_reset_handler_data. - Added a comment to tf_cache_write_sec(), justifying the warning message. Dmitry Osipenko (7): ARM: trusted_foundations: Implement L2 cache initialization callback ARM: trusted_foundations: Make prepare_idle call to take mode argument ARM: trusted_foundations: Provide information about whether firmware is registered ARM: tegra: Set up L2 cache using Trusted Foundations firmware ARM: tegra: Don't apply CPU erratas in insecure mode ARM: tegra: Always boot CPU in ARM-mode ARM: tegra: Add firmware calls required for suspend-resume on Tegra30 arch/arm/firmware/trusted_foundations.c| 80 +- arch/arm/include/asm/firmware.h| 2 +- arch/arm/include/asm/trusted_foundations.h | 13 arch/arm/mach-tegra/cpuidle-tegra114.c | 3 +- arch/arm/mach-tegra/pm.c | 49 + arch/arm/mach-tegra/reset-handler.S| 50 +++--- arch/arm/mach-tegra/reset.c| 3 + arch/arm/mach-tegra/reset.h| 9 ++- arch/arm/mach-tegra/sleep-tegra20.S| 4 ++ arch/arm/mach-tegra/sleep.S| 14 ++-- arch/arm/mach-tegra/tegra.c| 2 + 11 files changed, 207 insertions(+), 22 deletions(-) -- 2.20.1
Re: [PATCH v4 1/3] dt-bindings: phy: Add Stingray USB PHY binding document
On Fri, Feb 22, 2019 at 11:29 AM Srinath Mannam wrote: > > Hi Rob, > > Thanks for the review, Please find my comments below in line. > > On Fri, Feb 22, 2019 at 10:50 PM Rob Herring wrote: > > > > On Wed, Feb 20, 2019 at 04:04:00PM +0530, Srinath Mannam wrote: > > > Add DT binding document for Stingray USB PHY. > > > > > > Signed-off-by: Srinath Mannam > > > Reviewed-by: Florian Fainelli > > > Reviewed-by: Scott Branden > > > --- > > > .../bindings/phy/brcm,stingray-usb-phy.txt | 62 > > > ++ > > > 1 file changed, 62 insertions(+) > > > create mode 100644 > > > Documentation/devicetree/bindings/phy/brcm,stingray-usb-phy.txt > > > > > > diff --git > > > a/Documentation/devicetree/bindings/phy/brcm,stingray-usb-phy.txt > > > b/Documentation/devicetree/bindings/phy/brcm,stingray-usb-phy.txt > > > new file mode 100644 > > > index 000..da19236 > > > --- /dev/null > > > +++ b/Documentation/devicetree/bindings/phy/brcm,stingray-usb-phy.txt > > > @@ -0,0 +1,62 @@ > > > +Broadcom Stingray USB PHY > > > + > > > +Required properties: > > > + - compatible : should be one of the listed compatibles > > > + - "brcm,sr-usb-combo-phy" is a combo PHY has one SS PHY and one HS > > > PHY. > > > + - "brcm,sr-usb-hs-phy" has a single HS PHY. > > > + - reg: offset and length of the PHY blocks registers > > > + - address-cells: should be 1 > > > + - size-cells: should be 0 > > > + > > > +Sub-nodes: > > > + brcm,sr-usb-combo-phy have two sub-nodes for one SS PHY and one HS PHY. > > > + > > > +Sub-nodes required properties: > > > + - reg: required for brcm,sr-usb-phy model PHY. > > > + reg value 0 is HS PHY and 1 is SS PHY. > > > + - phy-cells: generic PHY binding; must be 0 > > > + > > > +Refer to phy/phy-bindings.txt for the generic PHY binding properties > > > + > > > +Example: > > > + usbphy0: usb-phy@0 { > > > + compatible = "brcm,sr-usb-combo-phy"; > > > + reg = <0x 0x100>; > > > + #address-cells = <1>; > > > + #size-cells = <0>; > > > + > > > + usb0_phy0: phy@0 { > > > + reg = <0>; > > > + #phy-cells = <0>; > > > + }; > > > + > > > + usb0_phy1: phy@1 { > > > + reg = <1>; > > > + #phy-cells = <0>; > > > + }; > > > > Again, you don't need child nodes here. There are not any per child > > resources. Clients can refer to < 1> just as easily as > > <_phy1>. This is why we have #phy-cells. > This phy controller is combo PHY it has one Super Speed USB PHY and > one High Speed USB PHY. > We required to create two PHY devices inside driver to initialize and > service(reset) both SS and HS PHYs separately. > That is the reason we used two child nodes. What you do in the driver is your business. That is independent of the binding. Go look at other phy drivers which use #phy-cells=1. .of_xlate() function is what converts the phy cells to a struct phy. Rob
Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote: > On Fri, 22 Feb 2019 11:27:05 -0800 > Alexei Starovoitov wrote: > > > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote: > > > > > > Then we should still probably fix up "__probe_kernel_read()" to not > > > allow user accesses. The easiest way to do that is actually likely to > > > use the "unsafe_get_user()" functions *without* doing a > > > uaccess_begin(), which will mean that modern CPU's will simply fault > > > on a kernel access to user space. > > > > On bpf side the bpf_probe_read() helper just calls probe_kernel_read() > > and users pass both user and kernel addresses into it and expect > > that the helper will actually try to read from that address. > > > > If __probe_kernel_read will suddenly start failing on all user addresses > > it will break the expectations. > > How do we solve it in bpf_probe_read? > > Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte > > in the loop? > > That's doable, but people already complain that bpf_probe_read() is slow > > and shows up in their perf report. > > We're changing kprobes to add a specific flag to say that we want to > differentiate between kernel or user reads. Can this be done with > bpf_probe_read()? If it's showing up in perf report, I doubt a single so you're saying you will break existing kprobe scripts? I don't think it's a good idea. It's not acceptable to break bpf_probe_read uapi.
Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
On Fri, 22 Feb 2019 11:27:05 -0800 Alexei Starovoitov wrote: > On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote: > > > > Then we should still probably fix up "__probe_kernel_read()" to not > > allow user accesses. The easiest way to do that is actually likely to > > use the "unsafe_get_user()" functions *without* doing a > > uaccess_begin(), which will mean that modern CPU's will simply fault > > on a kernel access to user space. > > On bpf side the bpf_probe_read() helper just calls probe_kernel_read() > and users pass both user and kernel addresses into it and expect > that the helper will actually try to read from that address. > > If __probe_kernel_read will suddenly start failing on all user addresses > it will break the expectations. > How do we solve it in bpf_probe_read? > Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte > in the loop? > That's doable, but people already complain that bpf_probe_read() is slow > and shows up in their perf report. We're changing kprobes to add a specific flag to say that we want to differentiate between kernel or user reads. Can this be done with bpf_probe_read()? If it's showing up in perf report, I doubt a single check is going to cause an issue. In fact, it may actually help speed things up as the read will be optimized for either user or kernel address reading. -- Steve
[PATCH 1/2] habanalabs: don't print result when rc indicates error
send_cpu_message() doesn't update the result parameter when an error occurs in its code. Therefore, callers of send_cpu_message() shouldn't use the result value when the return code indicates error. This patch fixes a static checker warning in goya_test_cpu_queue(), where that function did print the result even though the return code from send_cpu_message() indicated error. Signed-off-by: Oded Gabbay --- drivers/misc/habanalabs/goya/goya.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c index c43bd37fe693..e6f0d49ab71a 100644 --- a/drivers/misc/habanalabs/goya/goya.c +++ b/drivers/misc/habanalabs/goya/goya.c @@ -3380,10 +3380,16 @@ int goya_test_cpu_queue(struct hl_device *hdev) rc = hdev->asic_funcs->send_cpu_message(hdev, (u32 *) _pkt, sizeof(test_pkt), HL_DEVICE_TIMEOUT_USEC, ); - if (!rc) - dev_info(hdev->dev, "queue test on CPU queue succeeded\n"); - else - dev_err(hdev->dev, "CPU queue test failed (0x%08lX)\n", result); + if (!rc) { + if (result == ARMCP_PACKET_FENCE_VAL) + dev_info(hdev->dev, + "queue test on CPU queue succeeded\n"); + else + dev_err(hdev->dev, + "CPU queue test failed (0x%08lX)\n", result); + } else { + dev_err(hdev->dev, "CPU queue test failed, error %d\n", rc); + } return rc; } -- 2.18.0
[PATCH 2/2] habanalabs: driver's Kconfig must select DMA_SHARED_BUFFER
The driver uses the DMA_BUF module which is built only if DMA_SHARED_BUFFER is selected. DMA_SHARED_BUFFER doesn't have any dependencies so it is ok to select it (as done by many other components). Signed-off-by: Oded Gabbay --- drivers/misc/habanalabs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index b7f38a14caf5..80400a035dc1 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,6 +6,7 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI select FRAME_VECTOR + select DMA_SHARED_BUFFER help Enables PCIe card driver for Habana's AI Processors (AIP) that are designed to accelerate Deep Learning inference and training workloads. -- 2.18.0
Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote: > > Then we should still probably fix up "__probe_kernel_read()" to not > allow user accesses. The easiest way to do that is actually likely to > use the "unsafe_get_user()" functions *without* doing a > uaccess_begin(), which will mean that modern CPU's will simply fault > on a kernel access to user space. On bpf side the bpf_probe_read() helper just calls probe_kernel_read() and users pass both user and kernel addresses into it and expect that the helper will actually try to read from that address. If __probe_kernel_read will suddenly start failing on all user addresses it will break the expectations. How do we solve it in bpf_probe_read? Call probe_kernel_read and if that fails call unsafe_get_user byte-by-byte in the loop? That's doable, but people already complain that bpf_probe_read() is slow and shows up in their perf report.
Re: [PATCH v2] dt-bindings: Add vendor prefix for techstar
Rob, On Tue, Feb 12, 2019 at 4:57 PM Jagan Teki wrote: > > Add vendor prefix for techstar, known as > Shenzhen Techstar Electronics Co., Ltd. a known producer for LCD modules. > > Signed-off-by: Jagan Teki > --- > Changes for v2: > - rebase for master > > Documentation/devicetree/bindings/vendor-prefixes.txt | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt > b/Documentation/devicetree/bindings/vendor-prefixes.txt > index 7eae7a95ab2a..a5ad57eaee20 100644 > --- a/Documentation/devicetree/bindings/vendor-prefixes.txt > +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt > @@ -397,6 +397,7 @@ tcl Toby Churchill Ltd. > technexion TechNexion > technologicTechnologic Systems > tempo Tempo Semiconductor > +techstar Shenzhen Techstar Electronics Co., Ltd. > terasicTerasic Inc. > thine THine Electronics, Inc. > ti Texas Instruments > -- Can you apply this, the relevant driver is mainline already. [1] [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=849b2e3ff9698226ab91e034d52cbb1da92a5b4c
Re: [RFC][PATCH 00/16] sched: Core scheduling
On 2/22/19 6:20 AM, Peter Zijlstra wrote: > On Fri, Feb 22, 2019 at 01:17:01PM +0100, Paolo Bonzini wrote: >> On 18/02/19 21:40, Peter Zijlstra wrote: >>> On Mon, Feb 18, 2019 at 09:49:10AM -0800, Linus Torvalds wrote: On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra wrote: > > However; whichever way around you turn this cookie; it is expensive and > nasty. Do you (or anybody else) have numbers for real loads? Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. >>> >>> Not for these patches; they stopped crashing only yesterday and I >>> cleaned them up and send them out. >>> >>> The previous version; which was more horrible; but L1TF complete, was >>> between OK-ish and horrible depending on the number of VMEXITs a >>> workload had. >>> >>> If there were close to no VMEXITs, it beat smt=off, if there were lots >>> of VMEXITs it was far far worse. Supposedly hosting people try their >>> very bestest to have no VMEXITs so it mostly works for them (with the >>> obvious exception of single VCPU guests). >> >> If you are giving access to dedicated cores to guests, you also let them >> do PAUSE/HLT/MWAIT without vmexits and the host just thinks it's a CPU >> bound workload. >> >> In any case, IIUC what you are looking for is: >> >> 1) take a benchmark that *is* helped by SMT, this will be something CPU >> bound. >> >> 2) compare two runs, one without SMT and without core scheduler, and one >> with SMT+core scheduler. >> >> 3) find out whether performance is helped by SMT despite the increased >> overhead of the core scheduler >> >> Do you want some other load in the host, so that the scheduler actually >> does do something? Or is the point just that you show that the >> performance isn't affected when the scheduler does not have anything to >> do (which should be obvious, but having numbers is always better)? > > Well, what _I_ want is for all this to just go away :-) > > Tim did much of testing last time around; and I don't think he did > core-pinning of VMs much (although I'm sure he did some of that). I'm Yes. The last time around I tested basic scenarios like: 1. single VM pinned on a core 2. 2 VMs pinned on a core 3. system oversubscription (no pinning) In general, CPU bound benchmarks and even things without too much I/O causing lots of VMexits perform better with HT than without for Peter's last patchset. > still a complete virt noob; I can barely boot a VM to save my life. > > (you should be glad to not have heard my cursing at qemu cmdline when > trying to reproduce some of Tim's results -- lets just say that I can > deal with gpg) > > I'm sure he tried some oversubscribed scenarios without pinning. We did try some oversubscribed scenarios like SPECVirt, that tried to squeeze tons of VMs on a single system in over subscription mode. There're two main problems in the last go around: 1. Workload with high rate of Vmexits (SpecVirt is one) were a major source of pain when we tried Peter's previous patchset. The switch from vcpus to qemu and back in previous version of Peter's patch requires some coordination between the hyperthread siblings via IPI. And for workload that does this a lot, the overhead quickly added up. For Peter's new patch, this overhead hopefully would be reduced and give better performance. 2. Load balancing is quite tricky. Peter's last patchset did not have load balancing for consolidating compatible running threads. I did some non-sophisticated load balancing to pair vcpus up. But the constant vcpu migrations overhead probably ate up any improvements from better load pairing. So I didn't get much improvement in the over-subscription case when turning on load balancing to consolidate the VCPUs of the same VM. We'll probably have to try out this incarnation of Peter's patch and see how well the load balancing works. I'll try to line up some benchmarking folks to do some tests. Tim
Re: [PATCH v5 4/9] mm/mmu_notifier: contextual information for event enums
On 2/19/19 12:04 PM, jgli...@redhat.com wrote: From: Jérôme Glisse CPU page table update can happens for many reasons, not only as a result s/update/updates s/happens/happen of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). This patch introduce a set of enums that can be associated with each of s/introduce/introduces the events triggering a mmu notifier. Latter patches take advantages of those enum values. s/advantages/advantage - UNMAP: munmap() or mremap() - CLEAR: page table is cleared (migration, compaction, reclaim, ...) - PROTECTION_VMA: change in access protections for the range - PROTECTION_PAGE: change in access protections for page in the range - SOFT_DIRTY: soft dirtyness tracking s/dirtyness/dirtiness Being able to identify munmap() and mremap() from other reasons why the page table is cleared is important to allow user of mmu notifier to update their own internal tracking structure accordingly (on munmap or mremap it is not longer needed to track range of virtual address as it becomes invalid). Signed-off-by: Jérôme Glisse Cc: Christian König Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Rodrigo Vivi Cc: Jan Kara Cc: Andrea Arcangeli Cc: Peter Xu Cc: Felix Kuehling Cc: Jason Gunthorpe Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Michal Hocko Cc: Christian Koenig Cc: Ralph Campbell Cc: John Hubbard Cc: k...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linux-r...@vger.kernel.org Cc: Arnd Bergmann --- include/linux/mmu_notifier.h | 30 ++ 1 file changed, 30 insertions(+) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index c8672c366f67..2386e71ac1b8 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -10,6 +10,36 @@ struct mmu_notifier; struct mmu_notifier_ops; +/** + * enum mmu_notifier_event - reason for the mmu notifier callback + * @MMU_NOTIFY_UNMAP: either munmap() that unmap the range or a mremap() that + * move the range I would say something about the VMA for the notifier range is being deleted. MMU notifier clients can then use this case to remove any policy or access counts associated with the range. Just changing the PTE to "no access" as in the CLEAR case doesn't mean a policy which prefers device private memory over system memory should be cleared. + * + * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like + * madvise() or replacing a page by another one, ...). + * + * @MMU_NOTIFY_PROTECTION_VMA: update is due to protection change for the range + * ie using the vma access permission (vm_page_prot) to update the whole range + * is enough no need to inspect changes to the CPU page table (mprotect() + * syscall) + * + * @MMU_NOTIFY_PROTECTION_PAGE: update is due to change in read/write flag for + * pages in the range so to mirror those changes the user must inspect the CPU + * page table (from the end callback). + * + * @MMU_NOTIFY_SOFT_DIRTY: soft dirty accounting (still same page and same + * access flags). User should soft dirty the page in the end callback to make + * sure that anyone relying on soft dirtyness catch pages that might be written + * through non CPU mappings. + */ +enum mmu_notifier_event { + MMU_NOTIFY_UNMAP = 0, + MMU_NOTIFY_CLEAR, + MMU_NOTIFY_PROTECTION_VMA, + MMU_NOTIFY_PROTECTION_PAGE, + MMU_NOTIFY_SOFT_DIRTY, +}; + #ifdef CONFIG_MMU_NOTIFIER /*
Re: [PATCH v2] kcm: remove any offset before parsing messages
On Thu, Feb 21, 2019 at 12:22 AM Dominique Martinet wrote: > > Tom Herbert wrote on Wed, Feb 20, 2019: > > > When the client closes the socket, some messages are obviously still "in > > > flight", and the server will recv a POLLERR notification on the csock at > > > some point with many messages left. > > > The documentation says to unattach the csock when you get POLLER. If I > > > do that, the kcm socket will no longer give me any message, so all the > > > messages still in flight at the time are lost. > > > > So basically it sounds like you're interested in supporting TCP > > connections that are half closed. I believe that the error in half > > closed is EPIPE, so if the TCP socket returns that it can be ignored > > and the socket can continue being attached and used to send data. > > Did you mean 'can continue being attached and used to receive data'? > No, I meant shutdown on receive side when FIN is receved. TX is still allowed to drain an queued bytes. To support shutdown on the TX side would require additional logic since we need to effectively detach the transmit path but retain the receive path. I'm not sure this is a compelling use case to support. > I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see > how to efficiently ignore EPIPE until POLLIN gets unset -- polling on > both the csock and kcm socket will do many needless wakeups on only the > csock from what I can see, so I'd need some holdoff timer or something. > I guess it's possible though. We might need to clear the error somehow. May a read of zero bytes? > > > Another possibility is to add some linger semantics to an attached > > socket. For instance, a large message might be sent so that part of > > the messge is queued in TCP and part is queued in the KCM socket. > > Unattach would probably break that message. We probably want to linger > > option similar to SO_LINGER (or maybe just use the option on the TCP > > socket) that means don't complete the detach until any message being > > transmitted on the lower socket has been queued. > > That would certainly work, even if non-obvious from a user standpoint. > > > > > > I'd like to see some retry on ENOMEM before this is merged though, so > > > > while I'm there I'll resend this with a second patch doing that > > > > retry,.. I think just not setting strp->interrupted and not reporting > > > > the error up might be enough? Will have to try either way. > > > > > > I also tried playing with that without much success. > > > I had assumed just not calling strp_parser_err() (which calls the > > > abort_parser cb) would be enough, eventually calling strp_start_timer() > > > like the !len case, but no can do. > > > > I think you need to ignore the ENOMEM and have a timer or other > > callback to retry the operation in the future. > > Yes, that's what I had intended to try; basically just break out and > schedule timer as said above. You might want to look at some other systems, I don't recall if there's a hook that can be used for when memory pressure is relieved. > > After a bit more debugging, this part works (__strp_recv() is called > again); but the next packet that is treated properly is rejected because > by the time __strp_recv() was called again a new skb was read and the > length isn't large enough to go all the way into the new packet, so this > test fails: > } else if (len <= (ssize_t)head->len - > skb->len - stm->strp.offset) { > /* Length must be into new skb (and also > * greater than zero) > */ > STRP_STATS_INCR(strp->stats.bad_hdr_len); > strp_parser_err(strp, -EPROTO, desc); > > So I need to figure a way to say "call this function again without > reading more data" somehow, or make this check more lax e.g. accept any > len > 0 after a retry maybe... > Removing that branch altogether seems to work at least but I'm not sure > we'd want to? I like the check since it's conservative and covers the normal case. Maybe just need some more logic? > (grmbl at this slow VM and strparser not being possible to enable as a > module, it takes ages to test) > > > > > With that said, returning 0 from the parse function also raises POLLERR > > > on the csock and hangs netparser, so things aren't that simple... > > > > Can you point to where this is happening. If the parse_msg callback > > returns 0 that is suppose to indicate that more bytes are needed. > > I just blindly returned 0 "from time to time" in the kcm parser > function, but looking at above it was failing on the same check. > This somewhat makes sense for this one to fail here if a new packet was > read, no sane parser function should give a length smaller than what > they require to determine the length. > > > Thanks, > -- > Dominique
Re: [PATCHv6 07/10] acpi/hmat: Register processor domain to its memory
On Fri, Feb 22, 2019 at 10:48 AM Keith Busch wrote: > > On Wed, Feb 20, 2019 at 11:02:01PM +0100, Rafael J. Wysocki wrote: > > On Thu, Feb 14, 2019 at 6:10 PM Keith Busch wrote: > > > config ACPI_HMAT > > > bool "ACPI Heterogeneous Memory Attribute Table Support" > > > depends on ACPI_NUMA > > > + select HMEM_REPORTING > > > > If you want to do this here, I'm not sure that defining HMEM_REPORTING > > as a user-selectable option is a good idea. In particular, I don't > > really think that setting ACPI_HMAT without it makes a lot of sense. > > Apart from this, the patch looks reasonable to me. > > I'm trying to implement based on the feedback, but I'm a little confused. > > As I have it at the moment, HMEM_REPORTING is not user-prompted, so > another option needs to turn it on. I have ACPI_HMAT do that here. > > So when you say it's a bad idea to make HMEM_REPORTING user selectable, > isn't it already not user selectable? > > If I do it the other way around, that's going to make HMEM_REPORTING > complicated if a non-ACPI implementation wants to report HMEM > properties. Agree. If a platform supports these HMEM properties then they should be reported. ACPI_HMAT is that opt-in for ACPI based platforms, and other archs can do something similar. It's not clear that one would ever want to opt-in to HMAT support and opt-out of reporting any of it to userspace.
Re: [v4 PATCH 8/8] RISC-V: Assign hwcap as per comman capabilities.
On 2/14/19 3:49 PM, Atish Patra wrote: On 2/13/19 4:38 PM, Palmer Dabbelt wrote: On Wed, 13 Feb 2019 00:44:42 PST (-0800), jo...@kernel.org wrote: On Tue, Feb 12, 2019 at 11:58:10AM -0800, Atish Patra wrote: On 2/12/19 3:25 AM, Johan Hovold wrote: On Tue, Feb 12, 2019 at 03:10:12AM -0800, Atish Patra wrote: Currently, we set hwcap based on first valid hart from DT. This may not be correct always as that hart might not be current booting cpu or may have a different capability. Set hwcap as the capabilities supported by all possible harts with "okay" status. Signed-off-by: Atish Patra --- arch/riscv/kernel/cpufeature.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c index e7a4701f..a1e4fb34 100644 --- a/arch/riscv/kernel/cpufeature.c +++ b/arch/riscv/kernel/cpufeature.c @@ -20,6 +20,7 @@ #include #include #include +#include unsigned long elf_hwcap __read_mostly; #ifdef CONFIG_FPU @@ -42,28 +43,30 @@ void riscv_fill_hwcap(void) elf_hwcap = 0; - /* -* We don't support running Linux on hertergenous ISA systems. For -* now, we just check the ISA of the first "okay" processor. -*/ for_each_of_cpu_node(node) { - if (riscv_of_processor_hartid(node) >= 0) - break; - } - if (!node) { - pr_warn("Unable to find \"cpu\" devicetree entry\n"); - return; - } + unsigned long this_hwcap = 0; - if (of_property_read_string(node, "riscv,isa", )) { - pr_warn("Unable to find \"riscv,isa\" devicetree entry\n"); - of_node_put(node); - return; - } - of_node_put(node); + if (riscv_of_processor_hartid(node) < 0) + continue; - for (i = 0; i < strlen(isa); ++i) - elf_hwcap |= isa2hwcap[(unsigned char)(isa[i])]; + if (of_property_read_string(node, "riscv,isa", )) { + pr_warn("Unable to find \"riscv,isa\" devicetree entry\n"); + return; Did you want "continue" here to continue processing the other harts? Hmm. If a cpu node doesn't have isa in DT, that means DT is wrong. A "continue" here will let user space use other harts just with a warning message? Returning here will not set elf_hwcap which forces the user to fix the DT. I am not sure what should be the defined behavior in this case. Any thoughts ? The problem is that the proposed code might still set elf_hwcap -- it all depends on the order of the hart nodes in dt (i.e. it will only be left unset if the first node is malformed). For that reason, I'd say it's better to either bail out (hard or at least with elf_hwcap unset) or to continue processing the other nodes. The former might break current systems with malformed dt, though. And since the harts are expected to have the same ISA, continuing the processing while warning and ignoring the malformed node might be acceptable. Handling malformed device trees by providing a warning and an empty HWCAP seems like the right way to go to me. If I understand you correctly, you prefer following things to be done in case of malformed DT. 1. Print a warning message 2. Unset the entire HWCAP 3. Return without processing other harts. This will most likely result in panic when user space starts. Is this correct ? As per our offline discussion, we should let kernel avoid setting any value for the cpu with incorrect DT entry and continue for other harts. A warning is enough. This is fine as long as user space never see that hart. As the hart enumeration depends on riscv_of_processor_hartid, the hart with corrupted isa property will never boot. riscv_of_processor_hartid will return -ENODEV if "riscv,isa" property is not present. Moreover, the discussed conditional statement will not even executed unless there is memory corruption or somebody corrupts the DT on the fly. So we can continue with the patch as it is. I will just resend the series (dropping driver patches) for easy merge. Regards, Atish Regards, Atish Johan
Re: [PATCH] arm64: dts: rockchip: Add support for the Orange Pi RK3399 board.
On Fri, Feb 22, 2019 at 12:46 PM Alexis Ballier wrote: > > This adds basic support for the Orange Pi RK3399 board. > What works: > - SD card / emmc. > - Debug UART > - Ethernet > - USB: Type C, internal USB3 for SATA, 4 USB 2.0 ports > - Sensors: All of them but the Hall sensor. > - Buttons > - Wifi, Bluetooth > - HDMI out > > Signed-off-by: Alexis Ballier > Cc: devicet...@vger.kernel.org > Cc: Heiko Stuebner > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-rockc...@lists.infradead.org > Cc: linux-kernel@vger.kernel.org > > --- > > The board is heavily based on the Sapphire one, hence this patch is also > based on this DT. This is why I kept the copyrights of the sapphire DT. > --- > .../devicetree/bindings/arm/rockchip.yaml | 5 + > arch/arm64/boot/dts/rockchip/Makefile | 1 + > .../boot/dts/rockchip/rk3399-orangepi.dts | 778 ++ > 3 files changed, 784 insertions(+) > create mode 100644 arch/arm64/boot/dts/rockchip/rk3399-orangepi.dts Reviewed-by: Rob Herring
Re: [PATCH RFC v2 4/4] PCI: hotplug: Add quirk For Dell nvme pcie switches
On 2/21/19 8:05 PM, Oliver wrote: > On Fri, Feb 22, 2019 at 5:38 AM wrote: >> On 2/21/19 1:57 AM, Lukas Wunner wrote: [snip] >>> If the quirk is x86-specific, please enclose it in "#ifdef CONFIG_X86" >>> to reduce kernel footprint on other arches. >> >> That's a tricky one. If you look at p. 185 of [1], items 9, 11, and 12 >> are standard x16 cards that would fit in any x16 slot. Those cards have >> the offending switches. >> >> On the one hand, you could take the cards and backplane and put them in >> a non-hax86 system. On the other hand, I don't see why someone would >> want to do this. > > I have a couple of POWER boxes with Dell branded switch cards in them. > I have no idea why either, but it does happen. The hardware debouncer, I think, is on the backplane, so you'd really need both the switch and backplane combo. I've seen other marketing departments call the switches "NVMe HBA". Alex
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Yes, you are absolutely right. We shouldn't go after active pages as long as there are plenty of inactive pages around. That's the global reclaim policy, and we currently fail to translate that well to cgrouped systems. Setting group protections or limits would work around this problem, but they're kind of a red herring. We shouldn't ever allow use-once streams to push out hot workingsets, that's a bug. > @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, > struct mem_cgroup *memcg, > > scan >>= sc->priority; > > + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > + file, memcg, sc, false)) > + scan = 0; > + > /* >* If the cgroup's already been deleted, make sure to >* scrape out the remaining cache. > @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_reclaimed, nr_scanned; > bool reclaimable = false; > + bool retry; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > }; > struct mem_cgroup *memcg; > > + retry = false; > + > memset(>nr, 0, sizeof(sc->nr)); > > nr_reclaimed = sc->nr_reclaimed; > @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > } > } while ((memcg = mem_cgroup_iter(root, memcg, ))); > > + if ((sc->nr_scanned - nr_scanned) == 0 && > + !sc->may_shrink_active) { > + sc->may_shrink_active = 1; > + retry = true; > + continue; > + } Using !scanned as the gate could be a problem. There might be a cgroup that has inactive pages on the local level, but when viewed from the system level the total inactive pages in the system might still be low compared to active ones. In that case we should go after active pages. Basically, during global reclaim, the answer for whether active pages should be scanned or not should be the same regardless of whether the memory is all global or whether it's spread out between cgroups. The reason this isn't the case is because we're checking the ratio at the lruvec level - which is the highest level (and identical to the node counters) when memory is global, but it's at the lowest level when memory is cgrouped. So IMO what we should do is: - At the beginning of global reclaim, use node_page_state() to compare the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim can go after active pages or not. Regardless of what the ratio is in individual lruvecs. - And likewise at the beginning of cgroup limit reclaim, walk the subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE and ACTIVE_FILE counters, and make inactive_is_low() decision on those sums.
Re: [PATCH net] net: socket: set sock->sk to NULL after calling proto_ops::release()
On Fri, Feb 22, 2019 at 10:25:09AM -0800, Eric Dumazet wrote: > > > On 02/22/2019 09:57 AM, Eric Biggers wrote: > > > ->setattr() is called under inode_lock(), which __sock_release() also > > takes. So > > the uses of sock->sk are serialized. See commit 6d8c50dcb029 ("socket: > > close > > race condition between sock_close() and sockfs_setattr()"). > > Oh right, we added another inode_lock()/inode_unlock() for sock_close() An interesting question is whether anything else will be confused by sock->sk && sock->sk->sk_socket != sock I'd still like to figure out if we could simply make sock_orphan() do something like if (likely(sk->sk_socket)) sk->sk_socket->sk = NULL; just before sk_set_socket(sk, NULL); That would make for much easier rules; the question is whether anything relies upon the windows when linkage between socket and sock is not symmetrical...
Re: [PATCH RESEND V6 2/2] clk: imx: scu: add cpu frequency scaling support
Quoting Anson Huang (2019-02-21 18:32:10) > On NXP's i.MX SoCs with system controller inside, CPU frequency > scaling can ONLY be done by system controller firmware, and it > can ONLY be requested from secure mode, so Linux kernel has to > call ARM SMC to trap to ARM-Trusted-Firmware to request system > controller firmware to do CPU frequency scaling. > > This patch adds i.MX system controller CPU frequency scaling support, > it reuses cpufreq-dt driver and implement the CPU frequency scaling > inside SCU clock driver. > > Signed-off-by: Anson Huang Ah I missed one thing, see below. > @@ -180,6 +185,23 @@ static long clk_scu_round_rate(struct clk_hw *hw, > unsigned long rate, > return rate; > } > > +static int clk_scu_atf_set_cpu_rate(struct clk_hw *hw, unsigned long rate, > + unsigned long parent_rate) > +{ > + struct clk_scu *clk = to_clk_scu(hw); > + struct arm_smccc_res res; > + unsigned long cluster_id; > + > + if (clk->rsrc_id == IMX_SC_R_A35) > + cluster_id = 0; Do we still need to check this anymore? Why not just always use cluster_id 0? > + > + /* CPU frequency scaling can ONLY be done by ARM-Trusted-Firmware */ > + arm_smccc_smc(IMX_SIP_CPUFREQ, IMX_SIP_SET_CPUFREQ, > + cluster_id, rate, 0, 0, 0, 0, ); Because not checking would make this work, vs. checking causes this code to sometimes use an uninitialized value from the stack. > + > + return 0; > +} > +
Re: [PATCH v5 3/9] mm/mmu_notifier: convert mmu_notifier_range->blockable to a flags
On 2/19/19 12:04 PM, jgli...@redhat.com wrote: From: Jérôme Glisse Use an unsigned field for flags other than blockable and convert the blockable field to be one of those flags. Signed-off-by: Jérôme Glisse Cc: Christian König Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Rodrigo Vivi Cc: Jan Kara Cc: Andrea Arcangeli Cc: Peter Xu Cc: Felix Kuehling Cc: Jason Gunthorpe Cc: Andrew Morton Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Michal Hocko Cc: Christian Koenig Cc: Ralph Campbell Cc: John Hubbard Cc: k...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linux-r...@vger.kernel.org Cc: Arnd Bergmann --- include/linux/mmu_notifier.h | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index e630def131ce..c8672c366f67 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -25,11 +25,13 @@ struct mmu_notifier_mm { spinlock_t lock; }; +#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) + struct mmu_notifier_range { struct mm_struct *mm; unsigned long start; unsigned long end; - bool blockable; + unsigned flags; }; struct mmu_notifier_ops { @@ -229,7 +231,7 @@ extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, static inline bool mmu_notifier_range_blockable(const struct mmu_notifier_range *range) { - return range->blockable; + return (range->flags & MMU_NOTIFIER_RANGE_BLOCKABLE); } static inline void mmu_notifier_release(struct mm_struct *mm) @@ -275,7 +277,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { if (mm_has_notifiers(range->mm)) { - range->blockable = true; + range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE; __mmu_notifier_invalidate_range_start(range); } } @@ -284,7 +286,7 @@ static inline int mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { if (mm_has_notifiers(range->mm)) { - range->blockable = false; + range->flags &= ~MMU_NOTIFIER_RANGE_BLOCKABLE; return __mmu_notifier_invalidate_range_start(range); } return 0; @@ -331,6 +333,7 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range, range->mm = mm; range->start = start; range->end = end; + range->flags = 0; } #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ Reviewed-by: Ralph Campbell
Re: [PATCH RESEND V6 2/2] clk: imx: scu: add cpu frequency scaling support
Quoting Anson Huang (2019-02-21 18:32:10) > On NXP's i.MX SoCs with system controller inside, CPU frequency > scaling can ONLY be done by system controller firmware, and it > can ONLY be requested from secure mode, so Linux kernel has to > call ARM SMC to trap to ARM-Trusted-Firmware to request system > controller firmware to do CPU frequency scaling. > > This patch adds i.MX system controller CPU frequency scaling support, > it reuses cpufreq-dt driver and implement the CPU frequency scaling > inside SCU clock driver. > > Signed-off-by: Anson Huang > --- Applied to clk-next
Re: [PATCH v2 2/2] selftests/ftrace: Replace \e with \033
On Fri, 22 Feb 2019 08:59:42 -0700 shuah wrote: > On 2/22/19 7:59 AM, Masami Hiramatsu wrote: > > On Fri, 22 Feb 2019 10:10:21 +0100 > > Juerg Haefliger wrote: > > > >> The \e sequence character is not POSIX. Fix that by using \033 instead. > >> > >> Signed-off-by: Juerg Haefliger > > > > This also looks good to me. > > > > Acked-by: Masami Hiramatsu > > > > Thank you! > > > > > I wasn't on the To or Cc on the original patch. Make sure to > run the get_maintainers and include the people it recommends. > > Please resend the patch to me, so I can take it for 5.1-rc1 And add both Masami and my Acked-by when you send it to Shuah (for both patches). Acked-by: Steven Rostedt (VMware) -- Steve
Re: [PATCH 1/5] mm/workingset: remove unused @mapping argument in workingset_eviction()
On Fri, 2019-02-22 at 20:43 +0300, Andrey Ryabinin wrote: > workingset_eviction() doesn't use and never did use the @mapping > argument. > Remove it. > > Signed-off-by: Andrey Ryabinin > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vlastimil Babka > Cc: Rik van Riel > Cc: Mel Gorman Acked-by: Rik van Riel -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 3/5] mm/compaction: pass pgdat to too_many_isolated() instead of zone
On Fri, 2019-02-22 at 20:43 +0300, Andrey Ryabinin wrote: > too_many_isolated() in mm/compaction.c looks only at node state, > so it makes more sense to change argument to pgdat instead of zone. > > Signed-off-by: Andrey Ryabinin > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Vlastimil Babka > Cc: Rik van Riel > Cc: Mel Gorman Acked-by: Rik van Riel -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[PATCH] null_blk: fix checking for REQ_FUA
null_handle_bio() erroneously uses the bio_op macro which masks respective request flag bits including REQ_FUA out thus failing the check. Fix by checking bio->bi_opf directly. Signed-off-by: Heinz Mauelshagen --- drivers/block/null_blk_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c index 62c9654b9ce8..04b9d415511b 100644 --- a/drivers/block/null_blk_main.c +++ b/drivers/block/null_blk_main.c @@ -1104,7 +1104,7 @@ static int null_handle_bio(struct nullb_cmd *cmd) len = bvec.bv_len; err = null_transfer(nullb, bvec.bv_page, len, bvec.bv_offset, op_is_write(bio_op(bio)), sector, -bio_op(bio) & REQ_FUA); +bio->bi_opf & REQ_FUA); if (err) { spin_unlock_irq(>lock); return err; -- 2.20.1
Re: [PATCH v5 2/9] mm/mmu_notifier: convert user range->blockable to helper function
On 2/19/19 12:04 PM, jgli...@redhat.com wrote: From: Jérôme Glisse Use the mmu_notifier_range_blockable() helper function instead of directly dereferencing the range->blockable field. This is done to make it easier to change the mmu_notifier range field. This patch is the outcome of the following coccinelle patch: %<--- @@ identifier I1, FN; @@ FN(..., struct mmu_notifier_range *I1, ...) { <... -I1->blockable +mmu_notifier_range_blockable(I1) ...> } --->% spatch --in-place --sp-file blockable.spatch --dir . Signed-off-by: Jérôme Glisse Cc: Christian König Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Rodrigo Vivi Cc: Jan Kara Cc: Andrea Arcangeli Cc: Peter Xu Cc: Felix Kuehling Cc: Jason Gunthorpe Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Michal Hocko Cc: Christian Koenig Cc: Ralph Campbell Cc: John Hubbard Cc: k...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linux-r...@vger.kernel.org Cc: Arnd Bergmann --- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 8 drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +- drivers/gpu/drm/radeon/radeon_mn.c | 4 ++-- drivers/infiniband/core/umem_odp.c | 5 +++-- drivers/xen/gntdev.c| 6 +++--- mm/hmm.c| 6 +++--- mm/mmu_notifier.c | 2 +- virt/kvm/kvm_main.c | 3 ++- 8 files changed, 19 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 3e6823fdd939..58ed401c5996 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -256,14 +256,14 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn, /* TODO we should be able to split locking for interval tree and * amdgpu_mn_invalidate_node */ - if (amdgpu_mn_read_lock(amn, range->blockable)) + if (amdgpu_mn_read_lock(amn, mmu_notifier_range_blockable(range))) return -EAGAIN; it = interval_tree_iter_first(>objects, range->start, end); while (it) { struct amdgpu_mn_node *node; - if (!range->blockable) { + if (!mmu_notifier_range_blockable(range)) { amdgpu_mn_read_unlock(amn); return -EAGAIN; } @@ -299,7 +299,7 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, /* notification is exclusive, but interval is inclusive */ end = range->end - 1; - if (amdgpu_mn_read_lock(amn, range->blockable)) + if (amdgpu_mn_read_lock(amn, mmu_notifier_range_blockable(range))) return -EAGAIN; it = interval_tree_iter_first(>objects, range->start, end); @@ -307,7 +307,7 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn, struct amdgpu_mn_node *node; struct amdgpu_bo *bo; - if (!range->blockable) { + if (!mmu_notifier_range_blockable(range)) { amdgpu_mn_read_unlock(amn); return -EAGAIN; } diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 1d3f9a31ad61..777b3f8727e7 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -122,7 +122,7 @@ userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, while (it) { struct drm_i915_gem_object *obj; - if (!range->blockable) { + if (!mmu_notifier_range_blockable(range)) { ret = -EAGAIN; break; } diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c index b3019505065a..c9bd1278f573 100644 --- a/drivers/gpu/drm/radeon/radeon_mn.c +++ b/drivers/gpu/drm/radeon/radeon_mn.c @@ -133,7 +133,7 @@ static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn, /* TODO we should be able to split locking for interval tree and * the tear down. */ - if (range->blockable) + if (mmu_notifier_range_blockable(range)) mutex_lock(>lock); else if (!mutex_trylock(>lock)) return -EAGAIN; @@ -144,7 +144,7 @@ static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn, struct radeon_bo *bo; long r; - if (!range->blockable) { + if (!mmu_notifier_range_blockable(range)) { ret = -EAGAIN; goto out_unlock; } diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 012044f16d1c..3a3f1538d295 100644 --- a/drivers/infiniband/core/umem_odp.c +++
Re: [PATCH v5 1/9] mm/mmu_notifier: helper to test if a range invalidation is blockable
On 2/19/19 12:04 PM, jgli...@redhat.com wrote: From: Jérôme Glisse Simple helpers to test if range invalidation is blockable. Latter patches use cocinnelle to convert all direct dereference of range-> blockable to use this function instead so that we can convert the blockable field to an unsigned for more flags. Signed-off-by: Jérôme Glisse Cc: Christian König Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Rodrigo Vivi Cc: Jan Kara Cc: Andrea Arcangeli Cc: Peter Xu Cc: Felix Kuehling Cc: Jason Gunthorpe Cc: Andrew Morton Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Michal Hocko Cc: Christian Koenig Cc: Ralph Campbell Cc: John Hubbard Cc: k...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: linux-r...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: Arnd Bergmann --- include/linux/mmu_notifier.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 4050ec1c3b45..e630def131ce 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -226,6 +226,12 @@ extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r, extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); +static inline bool +mmu_notifier_range_blockable(const struct mmu_notifier_range *range) +{ + return range->blockable; +} + static inline void mmu_notifier_release(struct mm_struct *mm) { if (mm_has_notifiers(mm)) @@ -455,6 +461,11 @@ static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range, #define mmu_notifier_range_init(range, mm, start, end) \ _mmu_notifier_range_init(range, start, end) +static inline bool +mmu_notifier_range_blockable(const struct mmu_notifier_range *range) +{ + return true; +} static inline int mm_has_notifiers(struct mm_struct *mm) { Reviewed-by: Ralph Campbell
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Fri, 2019-02-22 at 20:58 +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always > wins. > E.g. job that allocates a lot of clean rarely used page cache will > push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc > which > are supposed to protect jobs or limit them so they to not hurt > others. > But memory cgroups are very hard to configure right because it > requires > precise knowledge of the workload which may vary during the > execution. > E.g. setting memory limit means that job won't be able to use all > memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single > cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to > reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active > lists > only if all inactive lists are low. Your general idea seems like a good one, but the logic in the code seems a little convoluted to me. I wonder if we can simplify things a little, by checking (when we enter page reclaim) whether the pgdat has enough inactive pages based on the node_page_state statistics, and basing our decision whether or not to scan the active lists off that. As it stands, your patch seems like the kind of code that makes perfect sense today, but which will confuse people who look at the code two years from now. If the code could be made a little more explicit, great. If there are good reasons to do things in the fallback way your current patch does it, the code could use some good comments explaining why :) -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH v3] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()
On 2/22/19 6:01 AM, Jing Xiangfeng wrote: Thanks, just a couple small changes. > User can change a node specific hugetlb count. i.e. > /sys/devices/system/node/node1/hugepages/hugepages-2048kB Please make that, /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages > the calculated value of count is a total number of huge pages. It could > be overflow when a user entering a crazy high value. If so, the total > number of huge pages could be a small value which is not user expect. > We can simply fix it by setting count to ULONG_MAX, then it goes on. This > may be more in line with user's intention of allocating as many huge pages > as possible. > > Signed-off-by: Jing Xiangfeng > --- > mm/hugetlb.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index afef616..18fa7d7 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2423,7 +2423,10 @@ static ssize_t __nr_hugepages_store_common(bool > obey_mempolicy, >* per node hstate attribute: adjust count to global, >* but restrict alloc/free to the specified node. >*/ > + unsigned long old_count = count; > count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; Also, adding a comment here about checking for overflow would help people reading the code. Something like, /* * If user specified count causes overflow, set to * largest possible value. */ -- Mike Kravetz > + if (count < old_count) > + count = ULONG_MAX; > init_nodemask_of_node(nodes_allowed, nid); > } else > nodes_allowed = _states[N_MEMORY]; >
Re: [PATCH] x86/kvm/mmu: fix switch between root and guest MMUs
Paolo Bonzini writes: > On 22/02/19 17:45, Vitaly Kuznetsov wrote: >> Commit 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") brought one subtle >> change: previously, when switching back from L2 to L1, we were resetting >> MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from >> nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context() >> when we re-target vcpu->arch.mmu pointer. >> The change itself looks logical: if nested_ept_init_mmu_context() changes >> something than nested_ept_uninit_mmu_context() restores it back. There is, >> however, one thing: the following call chain: >> >> nested_vmx_load_cr3() >> kvm_mmu_new_cr3() >> __kvm_mmu_new_cr3() >> fast_cr3_switch() >> cached_root_available() >> >> now happens with MMU hooks pointing to the new MMU (root MMU in our case) >> while previously it was happening with the old one. cached_root_available() >> tries to stash current root but it is incorrect to read current CR3 with >> mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're >> switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this >> is a non-issue because we don't switch MMU). >> >> While we could've tried to guess that we're switching between MMUs and call >> the right ->get_cr3() from cached_root_available() this seems to be overly >> complicated. Instead, just stash the corresponding CR3 when setting >> root_hpa and make cached_root_available() use the stashed value. >> >> Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") >> Signed-off-by: Vitaly Kuznetsov > > Is the bug latent until the other patch? Or are they completely > separate issues? > This one was reported as https://bugs.launchpad.net/qemu/+bug/1813165 (and I still don't completely understand why it is only SMM which is broken - or is it?). While working on it I noticed that fast_cr3_switch() actually doesn't work (even after this patch when we stop putting incorrect values in the cache with cached_root_available()). So in the end they seem to be fairly independent. -- Vitaly
[RFC PATCH 19/20] net: silan: sc92031: Remove stale comment about mmiowb()
mmiowb() is no more. It has ceased to be. It is an ex-barrier. So remove all references to it from comments. Signed-off-by: Will Deacon --- drivers/net/ethernet/silan/sc92031.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/silan/sc92031.c b/drivers/net/ethernet/silan/sc92031.c index db5dc8ce0aff..02b3962b0e63 100644 --- a/drivers/net/ethernet/silan/sc92031.c +++ b/drivers/net/ethernet/silan/sc92031.c @@ -251,7 +251,6 @@ enum PMConfigBits { * use of mdelay() at _sc92031_reset. * Functions prefixed with _sc92031_ must be called with the lock held; * functions prefixed with sc92031_ must be called without the lock held. - * Use mmiowb() before unlocking if the hardware was written to. */ /* Locking rules for the interrupt: -- 2.11.0
[RFC PATCH 17/20] scsi: qla1280: Remove stale comment about mmiowb()
All mmiowb() invocations have been removed, so there's no need to keep banging on about it. Signed-off-by: Will Deacon --- drivers/scsi/qla1280.c | 10 -- 1 file changed, 10 deletions(-) diff --git a/drivers/scsi/qla1280.c b/drivers/scsi/qla1280.c index 93acbc5094f0..327eff67a1ee 100644 --- a/drivers/scsi/qla1280.c +++ b/drivers/scsi/qla1280.c @@ -3363,16 +3363,6 @@ qla1280_isp_cmd(struct scsi_qla_host *ha) /* * Update request index to mailbox4 (Request Queue In). -* The mmiowb() ensures that this write is ordered with writes by other -* CPUs. Without the mmiowb(), it is possible for the following: -*CPUA posts write of index 5 to mailbox4 -*CPUA releases host lock -*CPUB acquires host lock -*CPUB posts write of index 6 to mailbox4 -*On PCI bus, order reverses and write of 6 posts, then index 5, -* causing chip to issue full queue of stale commands -* The mmiowb() prevents future writes from crossing the barrier. -* See Documentation/driver-api/device-io.rst for more information. */ WRT_REG_WORD(>mailbox4, ha->req_ring_index); -- 2.11.0
[RFC PATCH 20/20] arch: Remove dummy mmiowb() definitions from arch code
Now that no driver code is using mmiowb() directly, we can remove the dummy definitions from the architectures that don't make use of asm-generic/io.h Signed-off-by: Will Deacon --- arch/alpha/include/asm/io.h| 2 -- arch/hexagon/include/asm/io.h | 2 -- arch/parisc/include/asm/io.h | 2 -- arch/sparc/include/asm/io_64.h | 2 -- 4 files changed, 8 deletions(-) diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index 4c533fc94d62..ccf9d65166bb 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -513,8 +513,6 @@ extern inline void writeq(u64 b, volatile void __iomem *addr) #define writel_relaxed(b, addr)__raw_writel(b, addr) #define writeq_relaxed(b, addr)__raw_writeq(b, addr) -#define mmiowb() - /* * String version of IO memory access ops: */ diff --git a/arch/hexagon/include/asm/io.h b/arch/hexagon/include/asm/io.h index e17262ad125e..3d0ae09c2b8e 100644 --- a/arch/hexagon/include/asm/io.h +++ b/arch/hexagon/include/asm/io.h @@ -184,8 +184,6 @@ static inline void writel(u32 data, volatile void __iomem *addr) #define writew_relaxed __raw_writew #define writel_relaxed __raw_writel -#define mmiowb() - /* * Need an mtype somewhere in here, for cache type deals? * This is probably too long for an inline. diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h index afe493b23d04..b163043d49db 100644 --- a/arch/parisc/include/asm/io.h +++ b/arch/parisc/include/asm/io.h @@ -229,8 +229,6 @@ static inline void writeq(unsigned long long q, volatile void __iomem *addr) #define writel_relaxed(l, addr)writel(l, addr) #define writeq_relaxed(q, addr)writeq(q, addr) -#define mmiowb() do { } while (0) - void memset_io(volatile void __iomem *addr, unsigned char val, int count); void memcpy_fromio(void *dst, const volatile void __iomem *src, int count); void memcpy_toio(volatile void __iomem *dst, const void *src, int count); diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h index b162c23ae8c2..688911051b44 100644 --- a/arch/sparc/include/asm/io_64.h +++ b/arch/sparc/include/asm/io_64.h @@ -396,8 +396,6 @@ static inline void memcpy_toio(volatile void __iomem *dst, const void *src, } } -#define mmiowb() - #ifdef __KERNEL__ /* On sparc64 we have the whole physical IO address space accessible -- 2.11.0
[RFC PATCH 16/20] drivers: Remove explicit invocations of mmiowb()
mmiowb() is now implied by spin_unlock() on architectures that require it, so there is no reason to call it from driver code. This patch was generated using coccinelle: @mmiowb@ @@ - mmiowb(); and invoked as: $ for d in drivers include/linux/qed sound; do \ spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done Signed-off-by: Will Deacon --- drivers/crypto/cavium/nitrox/nitrox_reqmgr.c | 4 --- drivers/dma/txx9dmac.c | 3 --- drivers/firewire/ohci.c| 1 - drivers/gpu/drm/i915/intel_hdmi.c | 10 drivers/ide/tx4939ide.c| 2 -- drivers/infiniband/hw/hfi1/chip.c | 3 --- drivers/infiniband/hw/hfi1/pio.c | 1 - drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 2 -- drivers/infiniband/hw/mlx4/qp.c| 6 - drivers/infiniband/hw/mlx5/qp.c| 1 - drivers/infiniband/hw/mthca/mthca_cmd.c| 6 - drivers/infiniband/hw/mthca/mthca_cq.c | 5 drivers/infiniband/hw/mthca/mthca_qp.c | 17 - drivers/infiniband/hw/mthca/mthca_srq.c| 6 - drivers/infiniband/hw/qedr/verbs.c | 12 - drivers/infiniband/hw/qib/qib_iba6120.c| 4 --- drivers/infiniband/hw/qib/qib_iba7220.c| 3 --- drivers/infiniband/hw/qib/qib_iba7322.c| 3 --- drivers/infiniband/hw/qib/qib_sd7220.c | 4 --- drivers/media/pci/dt3155/dt3155.c | 8 -- drivers/memstick/host/jmb38x_ms.c | 4 --- drivers/misc/ioc4.c| 2 -- drivers/misc/mei/hw-me.c | 3 --- drivers/misc/tifm_7xx1.c | 1 - drivers/mmc/host/alcor.c | 1 - drivers/mmc/host/sdhci.c | 13 -- drivers/mmc/host/tifm_sd.c | 3 --- drivers/mmc/host/via-sdmmc.c | 10 drivers/mtd/nand/raw/r852.c| 2 -- drivers/mtd/nand/raw/txx9ndfmc.c | 1 - drivers/net/ethernet/aeroflex/greth.c | 1 - drivers/net/ethernet/alacritech/slicoss.c | 4 --- drivers/net/ethernet/amazon/ena/ena_com.c | 1 - drivers/net/ethernet/atheros/atlx/atl1.c | 1 - drivers/net/ethernet/atheros/atlx/atl2.c | 1 - drivers/net/ethernet/broadcom/bnx2.c | 4 --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c| 2 -- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h| 4 --- .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c| 1 - drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 29 -- drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 1 - drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 2 -- drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 4 --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 --- drivers/net/ethernet/broadcom/tg3.c| 6 - .../net/ethernet/cavium/liquidio/cn66xx_device.c | 10 .../net/ethernet/cavium/liquidio/octeon_device.c | 1 - drivers/net/ethernet/cavium/liquidio/octeon_droq.c | 4 --- .../net/ethernet/cavium/liquidio/request_manager.c | 1 - drivers/net/ethernet/intel/e1000/e1000_main.c | 5 drivers/net/ethernet/intel/e1000e/netdev.c | 7 -- drivers/net/ethernet/intel/fm10k/fm10k_iov.c | 2 -- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 5 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 5 drivers/net/ethernet/intel/iavf/iavf_txrx.c| 5 drivers/net/ethernet/intel/ice/ice_txrx.c | 5 drivers/net/ethernet/intel/igb/igb_main.c | 5 drivers/net/ethernet/intel/igbvf/netdev.c | 4 --- drivers/net/ethernet/intel/igc/igc_main.c | 5 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 5 drivers/net/ethernet/marvell/sky2.c| 4 --- drivers/net/ethernet/mellanox/mlx4/catas.c | 4 --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 13 -- drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 1 - drivers/net/ethernet/myricom/myri10ge/myri10ge.c | 2 -- drivers/net/ethernet/neterion/s2io.c | 2 -- drivers/net/ethernet/neterion/vxge/vxge-main.c | 5 drivers/net/ethernet/neterion/vxge/vxge-traffic.c | 4 --- drivers/net/ethernet/qlogic/qed/qed_int.c | 13 -- drivers/net/ethernet/qlogic/qed/qed_spq.c | 3 --- drivers/net/ethernet/qlogic/qede/qede_ethtool.c| 8 -- drivers/net/ethernet/qlogic/qede/qede_fp.c | 8 -- drivers/net/ethernet/qlogic/qla3xxx.c | 1 - drivers/net/ethernet/qlogic/qlge/qlge.h| 1 -
[RFC PATCH 03/20] mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors
Removing explicit calls to mmiowb() from driver code means that we must now call into the generic mmiowb_spin_{lock,unlock}() functions from the core spinlock code. In order to elide barriers following critical sections without any I/O writes, we also hook into the asm-generic I/O routines. Signed-off-by: Will Deacon --- include/asm-generic/io.h| 3 ++- include/linux/spinlock.h| 11 ++- kernel/locking/spinlock_debug.c | 6 +- 3 files changed, 17 insertions(+), 3 deletions(-) diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h index 303871651f8a..bc490a746602 100644 --- a/include/asm-generic/io.h +++ b/include/asm-generic/io.h @@ -19,6 +19,7 @@ #include #endif +#include #include #ifndef mmiowb @@ -49,7 +50,7 @@ /* serialize device access against a spin_unlock, usually handled there. */ #ifndef __io_aw -#define __io_aw() barrier() +#define __io_aw() mmiowb_set_pending() #endif #ifndef __io_pbw diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index e089157dcf97..4298b1b31d9b 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -57,6 +57,7 @@ #include #include #include +#include /* @@ -177,6 +178,7 @@ do { \ static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock) { __acquire(lock); + mmiowb_spin_lock(); arch_spin_lock(>raw_lock); } @@ -188,16 +190,23 @@ static inline void do_raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long *flags) __acquires(lock) { __acquire(lock); + mmiowb_spin_lock(); arch_spin_lock_flags(>raw_lock, *flags); } static inline int do_raw_spin_trylock(raw_spinlock_t *lock) { - return arch_spin_trylock(&(lock)->raw_lock); + int ret = arch_spin_trylock(&(lock)->raw_lock); + + if (ret) + mmiowb_spin_lock(); + + return ret; } static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock) { + mmiowb_spin_unlock(); arch_spin_unlock(>raw_lock); __release(lock); } diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c index 9aa0fccd5d43..654484b6e70c 100644 --- a/kernel/locking/spinlock_debug.c +++ b/kernel/locking/spinlock_debug.c @@ -109,6 +109,7 @@ static inline void debug_spin_unlock(raw_spinlock_t *lock) */ void do_raw_spin_lock(raw_spinlock_t *lock) { + mmiowb_spin_lock(); debug_spin_lock_before(lock); arch_spin_lock(>raw_lock); debug_spin_lock_after(lock); @@ -118,8 +119,10 @@ int do_raw_spin_trylock(raw_spinlock_t *lock) { int ret = arch_spin_trylock(>raw_lock); - if (ret) + if (ret) { + mmiowb_spin_lock(); debug_spin_lock_after(lock); + } #ifndef CONFIG_SMP /* * Must not happen on UP: @@ -131,6 +134,7 @@ int do_raw_spin_trylock(raw_spinlock_t *lock) void do_raw_spin_unlock(raw_spinlock_t *lock) { + mmiowb_spin_unlock(); debug_spin_unlock(lock); arch_spin_unlock(>raw_lock); } -- 2.11.0
[RFC PATCH 04/20] ARM: io: Remove useless definition of mmiowb()
ARM includes asm-generic/io.h, which provides a dummy definition of mmiowb() if one isn't already provided by the architecture. Remove the useless definition. Signed-off-by: Will Deacon --- arch/arm/include/asm/io.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h index 6b51826ab3d1..7e22c81398c4 100644 --- a/arch/arm/include/asm/io.h +++ b/arch/arm/include/asm/io.h @@ -281,8 +281,6 @@ extern void _memcpy_fromio(void *, const volatile void __iomem *, size_t); extern void _memcpy_toio(volatile void __iomem *, const void *, size_t); extern void _memset_io(volatile void __iomem *, int, size_t); -#define mmiowb() - /* * Memory access primitives * -- 2.11.0
[RFC PATCH 18/20] i40iw: Redefine i40iw_mmiowb() to do nothing
mmiowb() is now implicit in spin_unlock(), so there's no reason to call it from driver code. Redefine i40iw_mmiowb() to do nothing instead. Signed-off-by: Will Deacon --- drivers/infiniband/hw/i40iw/i40iw_osdep.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/i40iw/i40iw_osdep.h b/drivers/infiniband/hw/i40iw/i40iw_osdep.h index f27be3e7830b..d474aad62a81 100644 --- a/drivers/infiniband/hw/i40iw/i40iw_osdep.h +++ b/drivers/infiniband/hw/i40iw/i40iw_osdep.h @@ -211,7 +211,7 @@ enum i40iw_status_code i40iw_hw_manage_vf_pble_bp(struct i40iw_device *iwdev, struct i40iw_sc_vsi; void i40iw_hw_stats_start_timer(struct i40iw_sc_vsi *vsi); void i40iw_hw_stats_stop_timer(struct i40iw_sc_vsi *vsi); -#define i40iw_mmiowb() mmiowb() +#define i40iw_mmiowb() do { } while (0) void i40iw_wr32(struct i40iw_hw *hw, u32 reg, u32 value); u32 i40iw_rd32(struct i40iw_hw *hw, u32 reg); #endif /* _I40IW_OSDEP_H_ */ -- 2.11.0
[RFC PATCH 15/20] drivers: Remove useless trailing comments from mmiowb() invocations
In preparation for removing mmiowb() from driver code altogether using coccinelle, remove all trailing comments since they won't be picked up by spatch later on and will end up being preserved in the code. Signed-off-by: Will Deacon --- drivers/infiniband/hw/hfi1/chip.c| 2 +- drivers/infiniband/hw/qedr/verbs.c | 2 +- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 2 +- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 2 +- drivers/scsi/bnx2i/bnx2i_hwi.c | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c index b443642eac02..955bad21a519 100644 --- a/drivers/infiniband/hw/hfi1/chip.c +++ b/drivers/infiniband/hw/hfi1/chip.c @@ -8352,7 +8352,7 @@ static inline void clear_recv_intr(struct hfi1_ctxtdata *rcd) struct hfi1_devdata *dd = rcd->dd; u32 addr = CCE_INT_CLEAR + (8 * rcd->ireg); - mmiowb(); /* make sure everything before is written */ + mmiowb(); write_csr(dd, addr, rcd->imask); /* force the above write on the chip and get a value back */ (void)read_csr(dd, addr); diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c index e1ccf32b1c3d..23353e0e4bd4 100644 --- a/drivers/infiniband/hw/qedr/verbs.c +++ b/drivers/infiniband/hw/qedr/verbs.c @@ -3744,7 +3744,7 @@ int qedr_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr, if (rdma_protocol_iwarp(>ibdev, 1)) { writel(qp->rq.iwarp_db2_data.raw, qp->rq.iwarp_db2); - mmiowb(); /* for second doorbell */ + mmiowb(); } wr = wr->next; diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h index 2462e7aa0c5d..1ed068509337 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h @@ -527,7 +527,7 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp, REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4, ((u32 *)_prods)[i]); - mmiowb(); /* keep prod updates ordered */ + mmiowb(); DP(NETIF_MSG_RX_STATUS, "queue[%d]: wrote bd_prod %u cqe_prod %u sge_prod %u\n", diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 3b5b47e98c73..64bc6d6fcd65 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -5244,7 +5244,7 @@ static void bnx2x_update_eq_prod(struct bnx2x *bp, u16 prod) { /* No memory barriers */ storm_memset_eq_prod(bp, prod, BP_FUNC(bp)); - mmiowb(); /* keep prod updates ordered */ + mmiowb(); } static int bnx2x_cnic_handle_cfc_del(struct bnx2x *bp, u32 cid, diff --git a/drivers/scsi/bnx2i/bnx2i_hwi.c b/drivers/scsi/bnx2i/bnx2i_hwi.c index fae6f71e677d..d56a78f411cd 100644 --- a/drivers/scsi/bnx2i/bnx2i_hwi.c +++ b/drivers/scsi/bnx2i/bnx2i_hwi.c @@ -280,7 +280,7 @@ static void bnx2i_ring_sq_dbell(struct bnx2i_conn *bnx2i_conn, int count) } else writew(count, ep->qp.ctx_base + CNIC_SEND_DOORBELL); - mmiowb(); /* flush posted PCI writes */ + mmiowb(); } -- 2.11.0
[RFC PATCH 14/20] Documentation: Kill all references to mmiowb()
The guarantees provided by mmiowb() are now provided implicitly by spin_unlock(), so we can remove all references from this most confusing of barriers from our Documentation. Good riddance. Signed-off-by: Will Deacon --- Documentation/driver-api/device-io.rst | 45 -- Documentation/driver-api/pci/p2pdma.rst | 4 -- Documentation/memory-barriers.txt | 103 ++-- 3 files changed, 4 insertions(+), 148 deletions(-) diff --git a/Documentation/driver-api/device-io.rst b/Documentation/driver-api/device-io.rst index b00b23903078..0e389378f71d 100644 --- a/Documentation/driver-api/device-io.rst +++ b/Documentation/driver-api/device-io.rst @@ -103,51 +103,6 @@ continuing execution:: ha->flags.ints_enabled = 0; } -In addition to write posting, on some large multiprocessing systems -(e.g. SGI Challenge, Origin and Altix machines) posted writes won't be -strongly ordered coming from different CPUs. Thus it's important to -properly protect parts of your driver that do memory-mapped writes with -locks and use the :c:func:`mmiowb()` to make sure they arrive in the -order intended. Issuing a regular readX() will also ensure write ordering, -but should only be used when the -driver has to be sure that the write has actually arrived at the device -(not that it's simply ordered with respect to other writes), since a -full readX() is a relatively expensive operation. - -Generally, one should use :c:func:`mmiowb()` prior to releasing a spinlock -that protects regions using :c:func:`writeb()` or similar functions that -aren't surrounded by readb() calls, which will ensure ordering -and flushing. The following pseudocode illustrates what might occur if -write ordering isn't guaranteed via :c:func:`mmiowb()` or one of the -readX() functions:: - -CPU A: spin_lock_irqsave(_lock, flags) -CPU A: ... -CPU A: writel(newval, ring_ptr); -CPU A: spin_unlock_irqrestore(_lock, flags) -... -CPU B: spin_lock_irqsave(_lock, flags) -CPU B: writel(newval2, ring_ptr); -CPU B: ... -CPU B: spin_unlock_irqrestore(_lock, flags) - -In the case above, newval2 could be written to ring_ptr before newval. -Fixing it is easy though:: - -CPU A: spin_lock_irqsave(_lock, flags) -CPU A: ... -CPU A: writel(newval, ring_ptr); -CPU A: mmiowb(); /* ensure no other writes beat us to the device */ -CPU A: spin_unlock_irqrestore(_lock, flags) -... -CPU B: spin_lock_irqsave(_lock, flags) -CPU B: writel(newval2, ring_ptr); -CPU B: ... -CPU B: mmiowb(); -CPU B: spin_unlock_irqrestore(_lock, flags) - -See tg3.c for a real world example of how to use :c:func:`mmiowb()` - PCI ordering rules also guarantee that PIO read responses arrive after any outstanding DMA writes from that bus, since for some devices the result of a readb() call may signal to the driver that a DMA transaction is diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst index 6d85b5a2598d..44deb52beeb4 100644 --- a/Documentation/driver-api/pci/p2pdma.rst +++ b/Documentation/driver-api/pci/p2pdma.rst @@ -132,10 +132,6 @@ precludes passing these pages to userspace. P2P memory is also technically IO memory but should never have any side effects behind it. Thus, the order of loads and stores should not be important and ioreadX(), iowriteX() and friends should not be necessary. -However, as the memory is not cache coherent, if access ever needs to -be protected by a spinlock then :c:func:`mmiowb()` must be used before -unlocking the lock. (See ACQUIRES VS I/O ACCESSES in -Documentation/memory-barriers.txt) P2P DMA Support Library diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 158947ae78c2..6d6eff413462 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1937,21 +1937,6 @@ There are some more advanced barrier functions: information on consistent memory. -MMIO WRITE BARRIER --- - -The Linux kernel also has a special barrier for use with memory-mapped I/O -writes: - - mmiowb(); - -This is a variation on the mandatory write barrier that causes writes to weakly -ordered I/O regions to be partially ordered. Its effects may go beyond the -CPU->Hardware interface and actually affect the hardware at some level. - -See the subsection "Acquires vs I/O accesses" for more information. - - === IMPLICIT KERNEL MEMORY BARRIERS === @@ -2317,75 +2302,6 @@ But it won't see any of: *E, *F or *G following RELEASE Q - -ACQUIRES VS I/O ACCESSES - - -Under certain circumstances (especially involving NUMA), I/O accesses within -two spinlocked sections on two different CPUs may be seen as interleaved by the -PCI bridge, because the PCI bridge does not necessarily participate in the
[RFC PATCH 12/20] powerpc: mmiowb: Hook up mmwiob() implementation to asm-generic code
In a bid to kill off explicit mmiowb() usage in driver code, hook up the asm-generic mmiowb() tracking code but override all of the functions so that the existing (flawed) implementation is preserved on Power. Future work should either use the asm-generic implementation entirely, or implement a similar counting mechanism to avoid erroneously skipping barrier invocations when they are needed. Signed-off-by: Will Deacon --- arch/powerpc/Kconfig| 1 + arch/powerpc/include/asm/Kbuild | 1 - arch/powerpc/include/asm/io.h | 33 +++-- arch/powerpc/include/asm/mmiowb.h | 41 + arch/powerpc/include/asm/spinlock.h | 17 --- 5 files changed, 45 insertions(+), 48 deletions(-) create mode 100644 arch/powerpc/include/asm/mmiowb.h diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2890d36eb531..242ee248b922 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -134,6 +134,7 @@ config PPC select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_FORTIFY_SOURCE select ARCH_HAS_GCOV_PROFILE_ALL + select ARCH_HAS_MMIOWB if PPC64 && SMP select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_APIif PPC64 select ARCH_HAS_PTE_SPECIAL diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index 57bd1f6660f4..77ff7fb24823 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -8,7 +8,6 @@ generic-y += irq_regs.h generic-y += irq_work.h generic-y += local64.h generic-y += mcs_spinlock.h -generic-y += mmiowb.h generic-y += preempt.h generic-y += rwsem.h generic-y += vtime.h diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 7f19fbd3ba55..828100476ba6 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -34,14 +34,11 @@ extern struct pci_dev *isa_bridge_pcidev; #include #include #include +#include #include #include #include -#ifdef CONFIG_PPC64 -#include -#endif - #define SIO_CONFIG_RA 0x398 #define SIO_CONFIG_RD 0x399 @@ -107,12 +104,6 @@ extern bool isa_io_special; * */ -#ifdef CONFIG_PPC64 -#define IO_SET_SYNC_FLAG() do { local_paca->io_sync = 1; } while(0) -#else -#define IO_SET_SYNC_FLAG() -#endif - #define DEF_MMIO_IN_X(name, size, insn)\ static inline u##size name(const volatile u##size __iomem *addr) \ { \ @@ -127,7 +118,7 @@ static inline void name(volatile u##size __iomem *addr, u##size val)\ { \ __asm__ __volatile__("sync;"#insn" %1,%y0" \ : "=Z" (*addr) : "r" (val) : "memory"); \ - IO_SET_SYNC_FLAG(); \ + mmiowb_set_pending(); \ } #define DEF_MMIO_IN_D(name, size, insn)\ @@ -144,7 +135,7 @@ static inline void name(volatile u##size __iomem *addr, u##size val)\ { \ __asm__ __volatile__("sync;"#insn"%U0%X0 %1,%0" \ : "=m" (*addr) : "r" (val) : "memory"); \ - IO_SET_SYNC_FLAG(); \ + mmiowb_set_pending(); \ } DEF_MMIO_IN_D(in_8, 8, lbz); @@ -652,24 +643,6 @@ static inline void name at \ #include -#ifdef CONFIG_PPC32 -#define mmiowb() -#else -/* - * Enforce synchronisation of stores vs. spin_unlock - * (this does it explicitly, though our implementation of spin_unlock - * does it implicitely too) - */ -static inline void mmiowb(void) -{ - unsigned long tmp; - - __asm__ __volatile__("sync; li %0,0; stb %0,%1(13)" - : "=" (tmp) : "i" (offsetof(struct paca_struct, io_sync)) - : "memory"); -} -#endif /* !CONFIG_PPC32 */ - static inline void iosync(void) { __asm__ __volatile__ ("sync" : : : "memory"); diff --git a/arch/powerpc/include/asm/mmiowb.h b/arch/powerpc/include/asm/mmiowb.h new file mode 100644 index ..7647ffcced1c --- /dev/null +++ b/arch/powerpc/include/asm/mmiowb.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_MMIOWB_H +#define _ASM_POWERPC_MMIOWB_H + +#ifdef CONFIG_ARCH_HAS_MMIOWB + +#include +#include +#include + +#define mmiowb() mb() + +/* + * Note: this is broken, as it doesn't correctly handle sequences such as: + * + * spin_lock();// io_sync = 0 + * outb(42, port); // io_sync = 1 + * spin_lock();// io_sync = 0 + * ... + * spin_unlock(); + * spin_unlock(); + * + * The
[RFC PATCH 07/20] nds32: io: Remove useless definition of mmiowb()
mmiowb() only makes sense for SMP platforms, so we can remove it entirely for nds32. Signed-off-by: Will Deacon --- arch/nds32/include/asm/io.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/nds32/include/asm/io.h b/arch/nds32/include/asm/io.h index 71cd226d6863..5ef8ae5ba833 100644 --- a/arch/nds32/include/asm/io.h +++ b/arch/nds32/include/asm/io.h @@ -55,8 +55,6 @@ static inline u32 __raw_readl(const volatile void __iomem *addr) #define __iormb() rmb() #define __iowmb() wmb() -#define mmiowb()__asm__ __volatile__ ("msync all" : : : "memory"); - /* * {read,write}{b,w,l,q}_relaxed() are like the regular version, but * are not guaranteed to provide ordering against spinlocks or memory -- 2.11.0
[RFC PATCH 13/20] riscv: mmiowb: Hook up mmwiob() implementation to asm-generic code
In a bid to kill off explicit mmiowb() usage in driver code, hook up the asm-generic mmiowb() tracking code for riscv, so that an mmiowb() is automatically issued from spin_unlock() if an I/O write was performed in the critical section. Signed-off-by: Will Deacon --- arch/riscv/Kconfig | 1 + arch/riscv/include/asm/Kbuild | 1 - arch/riscv/include/asm/io.h | 15 ++- arch/riscv/include/asm/mmiowb.h | 14 ++ 4 files changed, 17 insertions(+), 14 deletions(-) create mode 100644 arch/riscv/include/asm/mmiowb.h diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 515fc3cc9687..3e7acb997235 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -49,6 +49,7 @@ config RISCV select RISCV_TIMER select GENERIC_IRQ_MULTI_HANDLER select ARCH_HAS_PTE_SPECIAL + select ARCH_HAS_MMIOWB if SMP config MMU def_bool y diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild index 221cd2ec78a4..cccd12cf27d4 100644 --- a/arch/riscv/include/asm/Kbuild +++ b/arch/riscv/include/asm/Kbuild @@ -21,7 +21,6 @@ generic-y += kvm_para.h generic-y += local.h generic-y += local64.h generic-y += mm-arch-hooks.h -generic-y += mmiowb.h generic-y += mutex.h generic-y += percpu.h generic-y += preempt.h diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h index 1d9c1376dc64..744fd92e77bc 100644 --- a/arch/riscv/include/asm/io.h +++ b/arch/riscv/include/asm/io.h @@ -20,6 +20,7 @@ #define _ASM_RISCV_IO_H #include +#include extern void __iomem *ioremap(phys_addr_t offset, unsigned long size); @@ -100,18 +101,6 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) #endif /* - * FIXME: I'm flip-flopping on whether or not we should keep this or enforce - * the ordering with I/O on spinlocks like PowerPC does. The worry is that - * drivers won't get this correct, but I also don't want to introduce a fence - * into the lock code that otherwise only uses AMOs (and is essentially defined - * by the ISA to be correct). For now I'm leaving this here: "o,w" is - * sufficient to ensure that all writes to the device have completed before the - * write to the spinlock is allowed to commit. I surmised this from reading - * "ACQUIRES VS I/O ACCESSES" in memory-barriers.txt. - */ -#define mmiowb() __asm__ __volatile__ ("fence o,w" : : : "memory"); - -/* * Unordered I/O memory access primitives. These are even more relaxed than * the relaxed versions, as they don't even order accesses between successive * operations to the I/O regions. @@ -165,7 +154,7 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) #define __io_br() do {} while (0) #define __io_ar(v) __asm__ __volatile__ ("fence i,r" : : : "memory"); #define __io_bw() __asm__ __volatile__ ("fence w,o" : : : "memory"); -#define __io_aw() do {} while (0) +#define __io_aw() mmiowb_set_pending() #define readb(c) ({ u8 __v; __io_br(); __v = readb_cpu(c); __io_ar(__v); __v; }) #define readw(c) ({ u16 __v; __io_br(); __v = readw_cpu(c); __io_ar(__v); __v; }) diff --git a/arch/riscv/include/asm/mmiowb.h b/arch/riscv/include/asm/mmiowb.h new file mode 100644 index ..5d7e3a2b4e3b --- /dev/null +++ b/arch/riscv/include/asm/mmiowb.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _ASM_RISCV_MMIOWB_H +#define _ASM_RISCV_MMIOWB_H + +/* + * "o,w" is sufficient to ensure that all writes to the device have completed + * before the write to the spinlock is allowed to commit. + */ +#define mmiowb() __asm__ __volatile__ ("fence o,w" : : : "memory"); + +#include + +#endif /* ASM_RISCV_MMIOWB_H */ -- 2.11.0
[RFC PATCH 08/20] m68k: io: Remove useless definition of mmiowb()
m68k includes asm-generic/io.h, which provides a dummy definition of mmiowb() if one isn't already provided by the architecture. Remove the useless definition. Cc: Geert Uytterhoeven Signed-off-by: Will Deacon --- arch/m68k/include/asm/io_mm.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/m68k/include/asm/io_mm.h b/arch/m68k/include/asm/io_mm.h index 782b78f8a048..6c03ca5bc436 100644 --- a/arch/m68k/include/asm/io_mm.h +++ b/arch/m68k/include/asm/io_mm.h @@ -377,8 +377,6 @@ static inline void isa_delay(void) #define writesw(port, buf, nr)raw_outsw((port), (u16 *)(buf), (nr)) #define writesl(port, buf, nr)raw_outsl((port), (u32 *)(buf), (nr)) -#define mmiowb() - #ifndef CONFIG_SUN3 #define IO_SPACE_LIMIT 0x #else -- 2.11.0
[RFC PATCH 00/20] Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())
Hi all, This series attempts to remove explicit mmiowb() calls from driver code, instead adding it to spin_unlock() for architectures that require it. The infamous "PowerPC trick" was found to be flawed, so a correct but probably slower version is made generic for architectures wishing to use it. I've ported RISC-V over to this interface, and hooked up PowerPC to the generic infrastructure but with their existing implementation. Other architectures that require a definition for mmiowb() now have it unconditionally called from their spin_unlock() code. If this is deemed to be expensive, they should be able to do something similar to RISC-V. I used coccinelle to remove the vast majority of the users, and the trivial spatch invocation is included in the commit message of that patch. I've also *build-tested* this for RISC-V, ia64, mips64, sh, powerpc64 and arm64. This is based on a couple of other I/O-related series that I have pending: https://lkml.org/lkml/2019/2/22/484 https://lkml.org/lkml/2019/2/22/500 so I've pushed the whole lot out on a branch for your comfort and convenience: git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git io Feedback welcome, Will Cc: "Paul E. McKenney" Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Arnd Bergmann Cc: Peter Zijlstra Cc: Andrea Parri Cc: Palmer Dabbelt Cc: Daniel Lustig Cc: David Howells Cc: Alan Stern Cc: Linus Torvalds Cc: "Maciej W. Rozycki" Cc: Paul Burton Cc: Ingo Molnar Cc: Yoshinori Sato Cc: Rich Felker Cc: Tony Luck --->8 Will Deacon (20): asm-generic/mmiowb: Add generic implementation of mmiowb() tracking arch: Use asm-generic header for asm/mmiowb.h mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors ARM: io: Remove useless definition of mmiowb() arm64: io: Remove useless definition of mmiowb() x86: io: Remove useless definition of mmiowb() nds32: io: Remove useless definition of mmiowb() m68k: io: Remove useless definition of mmiowb() sh: Add unconditional mmiowb() to arch_spin_unlock() mips: Add unconditional mmiowb() to arch_spin_unlock() ia64: Add unconditional mmiowb() to arch_spin_unlock() powerpc: mmiowb: Hook up mmwiob() implementation to asm-generic code riscv: mmiowb: Hook up mmwiob() implementation to asm-generic code Documentation: Kill all references to mmiowb() drivers: Remove useless trailing comments from mmiowb() invocations drivers: Remove explicit invocations of mmiowb() scsi: qla1280: Remove stale comment about mmiowb() i40iw: Redefine i40iw_mmiowb() to do nothing net: silan: sc92031: Remove stale comment about mmiowb() arch: Remove dummy mmiowb() definitions from arch code Documentation/driver-api/device-io.rst | 45 - Documentation/driver-api/pci/p2pdma.rst| 4 - Documentation/memory-barriers.txt | 103 + arch/alpha/include/asm/Kbuild | 1 + arch/alpha/include/asm/io.h| 2 - arch/arc/include/asm/Kbuild| 1 + arch/arm/include/asm/Kbuild| 1 + arch/arm/include/asm/io.h | 2 - arch/arm64/include/asm/Kbuild | 1 + arch/arm64/include/asm/io.h| 2 - arch/c6x/include/asm/Kbuild| 1 + arch/csky/include/asm/Kbuild | 1 + arch/h8300/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild| 1 + arch/hexagon/include/asm/io.h | 2 - arch/ia64/include/asm/io.h | 4 - arch/ia64/include/asm/mmiowb.h | 12 +++ arch/ia64/include/asm/spinlock.h | 2 + arch/m68k/include/asm/Kbuild | 1 + arch/m68k/include/asm/io_mm.h | 2 - arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/io.h | 3 - arch/mips/include/asm/mmiowb.h | 11 +++ arch/mips/include/asm/spinlock.h | 15 +++ arch/nds32/include/asm/Kbuild | 1 + arch/nds32/include/asm/io.h| 2 - arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/io.h | 2 - arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/io.h | 33 +-- arch/powerpc/include/asm/mmiowb.h | 41 arch/powerpc/include/asm/spinlock.h| 17 arch/riscv/Kconfig | 1 + arch/riscv/include/asm/io.h| 15 +-- arch/riscv/include/asm/mmiowb.h| 14 +++
[RFC PATCH 02/20] arch: Use asm-generic header for asm/mmiowb.h
Hook up asm-generic/mmiowb.h to Kbuild for all architectures so that we can subsequently include asm/mmiowb.h from core code. Cc: Masahiro Yamada Cc: Arnd Bergmann Signed-off-by: Will Deacon --- arch/alpha/include/asm/Kbuild | 1 + arch/arc/include/asm/Kbuild| 1 + arch/arm/include/asm/Kbuild| 1 + arch/arm64/include/asm/Kbuild | 1 + arch/c6x/include/asm/Kbuild| 1 + arch/csky/include/asm/Kbuild | 1 + arch/h8300/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild| 1 + arch/ia64/include/asm/Kbuild | 1 + arch/m68k/include/asm/Kbuild | 1 + arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/Kbuild | 1 + arch/nds32/include/asm/Kbuild | 1 + arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/Kbuild| 1 + arch/riscv/include/asm/Kbuild | 1 + arch/s390/include/asm/Kbuild | 1 + arch/sh/include/asm/Kbuild | 1 + arch/sparc/include/asm/Kbuild | 1 + arch/um/include/asm/Kbuild | 1 + arch/unicore32/include/asm/Kbuild | 1 + arch/xtensa/include/asm/Kbuild | 1 + 24 files changed, 24 insertions(+) diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild index dc0ab28baca1..2b4b2fe715bb 100644 --- a/arch/alpha/include/asm/Kbuild +++ b/arch/alpha/include/asm/Kbuild @@ -8,6 +8,7 @@ generic-y += fb.h generic-y += irq_work.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += preempt.h generic-y += sections.h generic-y += trace_clock.h diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild index caa270261521..138f564afc74 100644 --- a/arch/arc/include/asm/Kbuild +++ b/arch/arc/include/asm/Kbuild @@ -14,6 +14,7 @@ generic-y += local.h generic-y += local64.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += msi.h generic-y += parport.h generic-y += percpu.h diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild index 1d66db9c9db5..ee6d155ca372 100644 --- a/arch/arm/include/asm/Kbuild +++ b/arch/arm/include/asm/Kbuild @@ -9,6 +9,7 @@ generic-y += kdebug.h generic-y += local.h generic-y += local64.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += msi.h generic-y += parport.h generic-y += preempt.h diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild index 1e17ea5c372b..3dae4fd028cf 100644 --- a/arch/arm64/include/asm/Kbuild +++ b/arch/arm64/include/asm/Kbuild @@ -13,6 +13,7 @@ generic-y += local.h generic-y += local64.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += msi.h generic-y += qrwlock.h generic-y += qspinlock.h diff --git a/arch/c6x/include/asm/Kbuild b/arch/c6x/include/asm/Kbuild index 63b4a1705182..fda53087eabc 100644 --- a/arch/c6x/include/asm/Kbuild +++ b/arch/c6x/include/asm/Kbuild @@ -22,6 +22,7 @@ generic-y += kprobes.h generic-y += local.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += mmu.h generic-y += mmu_context.h generic-y += pci.h diff --git a/arch/csky/include/asm/Kbuild b/arch/csky/include/asm/Kbuild index 2a0abe8f2a35..95f4e550db8a 100644 --- a/arch/csky/include/asm/Kbuild +++ b/arch/csky/include/asm/Kbuild @@ -28,6 +28,7 @@ generic-y += linkage.h generic-y += local.h generic-y += local64.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += module.h generic-y += mutex.h generic-y += pci.h diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild index 961c1dc064e1..86ad35d93861 100644 --- a/arch/h8300/include/asm/Kbuild +++ b/arch/h8300/include/asm/Kbuild @@ -29,6 +29,7 @@ generic-y += local.h generic-y += local64.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += mmu.h generic-y += mmu_context.h generic-y += module.h diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild index b25fd42aa0f4..ff9134a01a13 100644 --- a/arch/hexagon/include/asm/Kbuild +++ b/arch/hexagon/include/asm/Kbuild @@ -23,6 +23,7 @@ generic-y += local.h generic-y += local64.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += pci.h generic-y += percpu.h generic-y += preempt.h diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild index 43e21fe3499c..3273d7aedfa0 100644 --- a/arch/ia64/include/asm/Kbuild +++ b/arch/ia64/include/asm/Kbuild @@ -4,6 +4,7 @@ generic-y += exec.h generic-y += irq_work.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h +generic-y += mmiowb.h generic-y += preempt.h generic-y += trace_clock.h generic-y += vtime.h diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild index 95f8f631c4df..2add40f9fdef 100644 --- a/arch/m68k/include/asm/Kbuild +++ b/arch/m68k/include/asm/Kbuild
[RFC PATCH 06/20] x86: io: Remove useless definition of mmiowb()
x86 maps mmiowb() to barrier(), but this is superfluous because a compiler barrier is already implied by spin_unlock(). Since x86 also includes asm-generic/io.h in its asm/io.h file, we can remove the definition entirely and pick up the dummy definition from core code. Signed-off-by: Will Deacon --- arch/x86/include/asm/io.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index 686247db3106..a06a9f8294ea 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -90,8 +90,6 @@ build_mmio_write(__writel, "l", unsigned int, "r", ) #define __raw_writew __writew #define __raw_writel __writel -#define mmiowb() barrier() - #ifdef CONFIG_X86_64 build_mmio_read(readq, "q", u64, "=r", :"memory") -- 2.11.0
Re: [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
Paolo Bonzini writes: > On 22/02/19 17:46, Vitaly Kuznetsov wrote: >> I noticed that fast_cr3_switch() always fails when we switch back from L2 >> to L1 as it is not able to find a cached root. This is odd: host's CR3 >> usually stays the same, we expect to always follow the fast path. Turns >> out the problem is that page role is always mismatched because >> kvm_mmu_get_page() filters out cr4_pae when direct, the value is stored >> in page header and later compared with new_role in cached_root_available(). >> As cr4_pae is always set in long mode prev_roots cache is dysfunctional. > > Really cr4_pae means "are the PTEs 8 bytes". So I think your patch is > correct but on top we should set it to 1 (not zero!!) for > kvm_calc_shadow_ept_root_page_role, init_kvm_nested_mmu and > kvm_calc_tdp_mmu_root_page_role. Or maybe everything breaks with that > change. > Yes, exactly. If we put '1' there kvm_mmu_get_page() will again filter it out and we won't be able to find the root in prev_roots cache :-( >> - Do not clear cr4_pae in kvm_mmu_get_page() and check direct on call sites >> (detect_write_misaligned(), get_written_sptes()). > > These only run with shadow page tables, by the way. > Yes, and that's why I think it may make sense to move the filtering logic there. At least in other cases cr4_pae will always be equal to is_pae(). It seems I know too little about shadow paging and all these corner cases :-( -- Vitaly
[RFC PATCH 09/20] sh: Add unconditional mmiowb() to arch_spin_unlock()
The mmiowb() macro is horribly difficult to use and drivers will continue to work most of the time if they omit a call when it is required. Rather than rely on driver authors getting this right, push mmiowb() into arch_spin_unlock() for sh. If this is deemed to be a performance issue, a subsequent optimisation could make use of ARCH_HAS_MMIOWB to elide the barrier in cases where no I/O writes were performned inside the critical section. Cc: Yoshinori Sato Cc: Rich Felker Signed-off-by: Will Deacon --- arch/sh/include/asm/Kbuild | 1 - arch/sh/include/asm/io.h| 3 --- arch/sh/include/asm/mmiowb.h| 12 arch/sh/include/asm/spinlock-llsc.h | 2 ++ 4 files changed, 14 insertions(+), 4 deletions(-) create mode 100644 arch/sh/include/asm/mmiowb.h diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild index e3f6926c4b86..a6ef3fee5f85 100644 --- a/arch/sh/include/asm/Kbuild +++ b/arch/sh/include/asm/Kbuild @@ -13,7 +13,6 @@ generic-y += local.h generic-y += local64.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h -generic-y += mmiowb.h generic-y += parport.h generic-y += percpu.h generic-y += preempt.h diff --git a/arch/sh/include/asm/io.h b/arch/sh/include/asm/io.h index 4f7f235f15f8..c28e37a344ad 100644 --- a/arch/sh/include/asm/io.h +++ b/arch/sh/include/asm/io.h @@ -229,9 +229,6 @@ __BUILD_IOPORT_STRING(q, u64) #define IO_SPACE_LIMIT 0x -/* synco on SH-4A, otherwise a nop */ -#define mmiowb() wmb() - /* We really want to try and get these to memcpy etc */ void memcpy_fromio(void *, const volatile void __iomem *, unsigned long); void memcpy_toio(volatile void __iomem *, const void *, unsigned long); diff --git a/arch/sh/include/asm/mmiowb.h b/arch/sh/include/asm/mmiowb.h new file mode 100644 index ..535d59735f1d --- /dev/null +++ b/arch/sh/include/asm/mmiowb.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_SH_MMIOWB_H +#define __ASM_SH_MMIOWB_H + +#include + +/* synco on SH-4A, otherwise a nop */ +#define mmiowb() wmb() + +#include + +#endif /* __ASM_SH_MMIOWB_H */ diff --git a/arch/sh/include/asm/spinlock-llsc.h b/arch/sh/include/asm/spinlock-llsc.h index 786ee0fde3b0..7fd929cd2e7a 100644 --- a/arch/sh/include/asm/spinlock-llsc.h +++ b/arch/sh/include/asm/spinlock-llsc.h @@ -47,6 +47,8 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock) { unsigned long tmp; + /* This could be optimised with ARCH_HAS_MMIOWB */ + mmiowb(); __asm__ __volatile__ ( "mov#1, %0 ! arch_spin_unlock \n\t" "mov.l %0, @%1 \n\t" -- 2.11.0
[RFC PATCH 11/20] ia64: Add unconditional mmiowb() to arch_spin_unlock()
The mmiowb() macro is horribly difficult to use and drivers will continue to work most of the time if they omit a call when it is required. Rather than rely on driver authors getting this right, push mmiowb() into arch_spin_unlock() for ia64. If this is deemed to be a performance issue, a subsequent optimisation could make use of ARCH_HAS_MMIOWB to elide the barrier in cases where no I/O writes were performned inside the critical section. Signed-off-by: Will Deacon --- arch/ia64/include/asm/Kbuild | 1 - arch/ia64/include/asm/io.h | 4 arch/ia64/include/asm/mmiowb.h | 12 arch/ia64/include/asm/spinlock.h | 2 ++ 4 files changed, 14 insertions(+), 5 deletions(-) create mode 100644 arch/ia64/include/asm/mmiowb.h diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild index 3273d7aedfa0..43e21fe3499c 100644 --- a/arch/ia64/include/asm/Kbuild +++ b/arch/ia64/include/asm/Kbuild @@ -4,7 +4,6 @@ generic-y += exec.h generic-y += irq_work.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h -generic-y += mmiowb.h generic-y += preempt.h generic-y += trace_clock.h generic-y += vtime.h diff --git a/arch/ia64/include/asm/io.h b/arch/ia64/include/asm/io.h index 1e6fef69bb01..7f2371ba04a4 100644 --- a/arch/ia64/include/asm/io.h +++ b/arch/ia64/include/asm/io.h @@ -119,8 +119,6 @@ extern int valid_mmap_phys_addr_range (unsigned long pfn, size_t count); * Ensure ordering of I/O space writes. This will make sure that writes * following the barrier will arrive after all previous writes. For most * ia64 platforms, this is a simple 'mf.a' instruction. - * - * See Documentation/driver-api/device-io.rst for more information. */ static inline void ___ia64_mmiowb(void) { @@ -296,7 +294,6 @@ __outsl (unsigned long port, const void *src, unsigned long count) #define __outb platform_outb #define __outw platform_outw #define __outl platform_outl -#define __mmiowb platform_mmiowb #define inb(p) __inb(p) #define inw(p) __inw(p) @@ -310,7 +307,6 @@ __outsl (unsigned long port, const void *src, unsigned long count) #define outsb(p,s,c) __outsb(p,s,c) #define outsw(p,s,c) __outsw(p,s,c) #define outsl(p,s,c) __outsl(p,s,c) -#define mmiowb() __mmiowb() /* * The address passed to these functions are ioremap()ped already. diff --git a/arch/ia64/include/asm/mmiowb.h b/arch/ia64/include/asm/mmiowb.h new file mode 100644 index ..238d56172c6f --- /dev/null +++ b/arch/ia64/include/asm/mmiowb.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _ASM_IA64_MMIOWB_H +#define _ASM_IA64_MMIOWB_H + +#include + +#define mmiowb() platform_mmiowb() + +#include + +#endif /* _ASM_IA64_MMIOWB_H */ diff --git a/arch/ia64/include/asm/spinlock.h b/arch/ia64/include/asm/spinlock.h index afd0b3121b4c..5f620e66384e 100644 --- a/arch/ia64/include/asm/spinlock.h +++ b/arch/ia64/include/asm/spinlock.h @@ -73,6 +73,8 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { unsigned short *p = (unsigned short *)>lock + 1, tmp; + /* This could be optimised with ARCH_HAS_MMIOWB */ + mmiowb(); asm volatile ("ld2.bias %0=[%1]" : "=r"(tmp) : "r"(p)); WRITE_ONCE(*p, (tmp + 2) & ~1); } -- 2.11.0
[RFC PATCH 10/20] mips: Add unconditional mmiowb() to arch_spin_unlock()
The mmiowb() macro is horribly difficult to use and drivers will continue to work most of the time if they omit a call when it is required. Rather than rely on driver authors getting this right, push mmiowb() into arch_spin_unlock() for mips. If this is deemed to be a performance issue, a subsequent optimisation could make use of ARCH_HAS_MMIOWB to elide the barrier in cases where no I/O writes were performned inside the critical section. Signed-off-by: Will Deacon --- arch/mips/include/asm/Kbuild | 1 - arch/mips/include/asm/io.h | 3 --- arch/mips/include/asm/mmiowb.h | 11 +++ arch/mips/include/asm/spinlock.h | 15 +++ 4 files changed, 26 insertions(+), 4 deletions(-) create mode 100644 arch/mips/include/asm/mmiowb.h diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild index 5653b1e47dd0..f15d5db5dd67 100644 --- a/arch/mips/include/asm/Kbuild +++ b/arch/mips/include/asm/Kbuild @@ -13,7 +13,6 @@ generic-y += irq_work.h generic-y += local64.h generic-y += mcs_spinlock.h generic-y += mm-arch-hooks.h -generic-y += mmiowb.h generic-y += msi.h generic-y += parport.h generic-y += percpu.h diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h index 845fbbc7a2e3..29997e42480e 100644 --- a/arch/mips/include/asm/io.h +++ b/arch/mips/include/asm/io.h @@ -102,9 +102,6 @@ static inline void set_io_port_base(unsigned long base) #define iobarrier_w() wmb() #define iobarrier_sync() iob() -/* Some callers use this older API instead. */ -#define mmiowb() iobarrier_w() - /* * virt_to_phys- map virtual addresses to physical * @address: address to remap diff --git a/arch/mips/include/asm/mmiowb.h b/arch/mips/include/asm/mmiowb.h new file mode 100644 index ..a40824e3ef8e --- /dev/null +++ b/arch/mips/include/asm/mmiowb.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_MMIOWB_H +#define _ASM_MMIOWB_H + +#include + +#define mmiowb() iobarrier_w() + +#include + +#endif /* _ASM_MMIOWB_H */ diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h index ee81297d9117..8a88eb265516 100644 --- a/arch/mips/include/asm/spinlock.h +++ b/arch/mips/include/asm/spinlock.h @@ -11,6 +11,21 @@ #include #include + +#include + +#definequeued_spin_unlock queued_spin_unlock +/** + * queued_spin_unlock - release a queued spinlock + * @lock : Pointer to queued spinlock structure + */ +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + /* This could be optimised with ARCH_HAS_MMIOWB */ + mmiowb(); + smp_store_release(>locked, 0); +} + #include #endif /* _ASM_SPINLOCK_H */ -- 2.11.0
[RFC PATCH 05/20] arm64: io: Remove useless definition of mmiowb()
arm64 includes asm-generic/io.h, which provides a dummy definition of mmiowb() if one isn't already provided by the architecture. Remove the useless definition. Signed-off-by: Will Deacon --- arch/arm64/include/asm/io.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h index 8bb7210ac286..b807cb9b517d 100644 --- a/arch/arm64/include/asm/io.h +++ b/arch/arm64/include/asm/io.h @@ -124,8 +124,6 @@ static inline u64 __raw_readq(const volatile void __iomem *addr) #define __io_par(v)__iormb(v) #define __iowmb() wmb() -#define mmiowb() do { } while (0) - /* * Relaxed I/O memory access primitives. These follow the Device memory * ordering rules but do not guarantee any ordering relative to Normal memory -- 2.11.0
[RFC PATCH 01/20] asm-generic/mmiowb: Add generic implementation of mmiowb() tracking
In preparation for removing all explicit mmiowb() calls from driver code, implement a tracking system in asm-generic based on the PowerPC implementation. This allows architectures with a non-empty mmiowb() definition to automatically have the barrier inserted in spin_unlock() following a critical section containing an I/O write. Signed-off-by: Will Deacon --- include/asm-generic/mmiowb.h | 60 kernel/Kconfig.locks | 3 +++ kernel/locking/spinlock.c| 5 3 files changed, 68 insertions(+) create mode 100644 include/asm-generic/mmiowb.h diff --git a/include/asm-generic/mmiowb.h b/include/asm-generic/mmiowb.h new file mode 100644 index ..1cec8907806f --- /dev/null +++ b/include/asm-generic/mmiowb.h @@ -0,0 +1,60 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_MMIOWB_H +#define __ASM_GENERIC_MMIOWB_H + +/* + * Generic implementation of mmiowb() tracking for spinlocks. + * + * If your architecture doesn't ensure that writes to an I/O peripheral + * within two spinlocked sections on two different CPUs are seen by the + * peripheral in the order corresponding to the lock handover, then you + * need to follow these FIVE easy steps: + * + * 1. Implement mmiowb() in asm/mmiowb.h and then #include this file + * 2. Ensure your I/O write accessors call mmiowb_set_pending() + * 3. Select ARCH_HAS_MMIOWB + * 4. Untangle the resulting mess of header files + * 5. Complain to your architects + */ +#if defined(CONFIG_ARCH_HAS_MMIOWB) && defined(CONFIG_SMP) + +#include +#include +#include + +struct mmiowb_state { + u16 nesting_count; + u16 mmiowb_pending; +}; +DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state); + +#ifndef mmiowb_set_pending +static inline void mmiowb_set_pending(void) +{ + __this_cpu_write(__mmiowb_state.mmiowb_pending, 1); +} +#endif + +#ifndef mmiowb_spin_lock +static inline void mmiowb_spin_lock(void) +{ + if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1) + __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); +} +#endif + +#ifndef mmiowb_spin_unlock +static inline void mmiowb_spin_unlock(void) +{ + if (__this_cpu_xchg(__mmiowb_state.mmiowb_pending, 0)) + mmiowb(); + __this_cpu_dec_return(__mmiowb_state.nesting_count); +} +#endif + +#else +#define mmiowb_set_pending() do { } while (0) +#define mmiowb_spin_lock() do { } while (0) +#define mmiowb_spin_unlock() do { } while (0) +#endif /* CONFIG_ARCH_HAS_MMIOWB && CONFIG_SMP */ +#endif /* __ASM_GENERIC_MMIOWB_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 84d882f3e299..04976ae41176 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -248,3 +248,6 @@ config ARCH_USE_QUEUED_RWLOCKS config QUEUED_RWLOCKS def_bool y if ARCH_USE_QUEUED_RWLOCKS depends on SMP + +config ARCH_HAS_MMIOWB + bool diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c index 936f3d14dd6b..cbae365d7dd1 100644 --- a/kernel/locking/spinlock.c +++ b/kernel/locking/spinlock.c @@ -22,6 +22,11 @@ #include #include +#ifdef CONFIG_ARCH_HAS_MMIOWB +DEFINE_PER_CPU(struct mmiowb_state, __mmiowb_state); +EXPORT_PER_CPU_SYMBOL(__mmiowb_state); +#endif + /* * If lockdep is enabled then we use the non-preemption spin-ops * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are -- 2.11.0
Re: [PATCH] i2c: Allow recovery of the initial IRQ by an I2C client device.
On Fri, Feb 22, 2019 at 3:32 AM Wolfram Sang wrote: > > > > > > But I still have the feeling that the problem is not solved at the > > > > right place. In i2c_new_device() we are storing parts of the fields of > > > > struct i2c_board_info, and when resetting the irq we are losing > > > > information. This patch solves that, but I wonder if the IRQ should > > > > not be 'simply' set in i2c_device_probe(). This means we also need to > > > > store the .resources of info, but I have a feeling this will be less > > > > error prone in the future. > > > > > > > > But this is just my guts telling me something is not right. I would > > > > perfectly understand if we want to get this merged ASAP. > > > > > > > > So given that the code is correct, this is my: > > > > Reviewed-by: Benjamin Tissoires > > > > > > > > But at least I have expressed my feelings :) > > > > > > Which I can relate to very much. I see the code solves the issue but my > > > feeling is that we are patching around something which should be handled > > > differently in general. > > > > > > Is somebody willing to research this further? > > > > > > Thanks for your input. > > > > > > > I would be willing to have more of a look at it but am slightly > > nervous I am not right person as all the systems I currently work > > with are DT based so don't really exemplify the issue at all. > > Thank you! Well, I'll be there, too. Discussing, reviewing, testing. And > if we have Benjamin for that on board as well, then I think we have a > good start for that task :) Others are more than welcome to join, too, > of course. > I'm also more familiar with device-tree (just came across this on my personal laptop) but happy to review and test at the risk of learning something.
Re: [PATCHv6 07/10] acpi/hmat: Register processor domain to its memory
On Wed, Feb 20, 2019 at 11:02:01PM +0100, Rafael J. Wysocki wrote: > On Thu, Feb 14, 2019 at 6:10 PM Keith Busch wrote: > > config ACPI_HMAT > > bool "ACPI Heterogeneous Memory Attribute Table Support" > > depends on ACPI_NUMA > > + select HMEM_REPORTING > > If you want to do this here, I'm not sure that defining HMEM_REPORTING > as a user-selectable option is a good idea. In particular, I don't > really think that setting ACPI_HMAT without it makes a lot of sense. > Apart from this, the patch looks reasonable to me. I'm trying to implement based on the feedback, but I'm a little confused. As I have it at the moment, HMEM_REPORTING is not user-prompted, so another option needs to turn it on. I have ACPI_HMAT do that here. So when you say it's a bad idea to make HMEM_REPORTING user selectable, isn't it already not user selectable? If I do it the other way around, that's going to make HMEM_REPORTING complicated if a non-ACPI implementation wants to report HMEM properties.
[PATCH] arm64: dts: rockchip: Add support for the Orange Pi RK3399 board.
This adds basic support for the Orange Pi RK3399 board. What works: - SD card / emmc. - Debug UART - Ethernet - USB: Type C, internal USB3 for SATA, 4 USB 2.0 ports - Sensors: All of them but the Hall sensor. - Buttons - Wifi, Bluetooth - HDMI out Signed-off-by: Alexis Ballier Cc: devicet...@vger.kernel.org Cc: Heiko Stuebner Cc: linux-arm-ker...@lists.infradead.org Cc: linux-rockc...@lists.infradead.org Cc: linux-kernel@vger.kernel.org --- The board is heavily based on the Sapphire one, hence this patch is also based on this DT. This is why I kept the copyrights of the sapphire DT. --- .../devicetree/bindings/arm/rockchip.yaml | 5 + arch/arm64/boot/dts/rockchip/Makefile | 1 + .../boot/dts/rockchip/rk3399-orangepi.dts | 778 ++ 3 files changed, 784 insertions(+) create mode 100644 arch/arm64/boot/dts/rockchip/rk3399-orangepi.dts diff --git a/Documentation/devicetree/bindings/arm/rockchip.yaml b/Documentation/devicetree/bindings/arm/rockchip.yaml index eb4d7f356543..58b075fcd78d 100644 --- a/Documentation/devicetree/bindings/arm/rockchip.yaml +++ b/Documentation/devicetree/bindings/arm/rockchip.yaml @@ -309,6 +309,11 @@ properties: - const: phytec,rk3288-phycore-som - const: rockchip,rk3288 + - description: Orange Pi RK3399 board +items: + - const: rockchip,rk3399-orangepi + - const: rockchip,rk3399 + - description: Pine64 Rock64 items: - const: pine64,rock64 diff --git a/arch/arm64/boot/dts/rockchip/Makefile b/arch/arm64/boot/dts/rockchip/Makefile index 1b28fa72ea0b..25454e2cf2de 100644 --- a/arch/arm64/boot/dts/rockchip/Makefile +++ b/arch/arm64/boot/dts/rockchip/Makefile @@ -18,6 +18,7 @@ dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-gru-scarlet-inx.dtb dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-gru-scarlet-kd.dtb dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopc-t4.dtb dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-m4.dtb +dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-orangepi.dtb dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-puma-haikou.dtb dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-roc-pc.dtb dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-rock-pi-4.dtb diff --git a/arch/arm64/boot/dts/rockchip/rk3399-orangepi.dts b/arch/arm64/boot/dts/rockchip/rk3399-orangepi.dts new file mode 100644 index ..e33f0c8ce2c9 --- /dev/null +++ b/arch/arm64/boot/dts/rockchip/rk3399-orangepi.dts @@ -0,0 +1,778 @@ +// SPDX-License-Identifier: (GPL-2.0+ OR MIT) +/* + * Copyright (c) 2017 Fuzhou Rockchip Electronics Co., Ltd. + */ + +/dts-v1/; + +#include "dt-bindings/pwm/pwm.h" +#include "dt-bindings/input/input.h" +#include "rk3399.dtsi" +#include "rk3399-opp.dtsi" + +/ { + model = "Orange Pi RK3399 Board"; + compatible = "rockchip,rk3399-orangepi", "rockchip,rk3399"; + + chosen { + stdout-path = "serial2:150n8"; + }; + + clkin_gmac: external-gmac-clock { + compatible = "fixed-clock"; + clock-frequency = <12500>; + clock-output-names = "clkin_gmac"; + #clock-cells = <0>; + }; + + dc_12v: dc-12v { + compatible = "regulator-fixed"; + regulator-name = "dc_12v"; + regulator-always-on; + regulator-boot-on; + regulator-min-microvolt = <1200>; + regulator-max-microvolt = <1200>; + }; + + keys: gpio-keys { + compatible = "gpio-keys"; + autorepeat; + + power { + debounce-interval = <100>; + gpios = < RK_PA5 GPIO_ACTIVE_LOW>; + label = "GPIO Power"; + linux,code = ; + linux,input-type = <1>; + pinctrl-names = "default"; + pinctrl-0 = <_btn>; + wakeup-source; + }; + }; + + adc-keys { + compatible = "adc-keys"; + io-channels = < 1>; + io-channel-names = "buttons"; + keyup-threshold-microvolt = <180>; + poll-interval = <100>; + + button-up { + label = "Volume Up"; + linux,code = ; + press-threshold-microvolt = <10>; + }; + + button-down { + label = "Volume Down"; + linux,code = ; + press-threshold-microvolt = <30>; + }; + + back { + label = "Back"; + linux,code = ; + press-threshold-microvolt = <985000>; + }; + + menu { + label = "Menu"; + linux,code = ; + press-threshold-microvolt = <1314000>; + }; + }; + + sdio_pwrseq:
Re: [PATCH net-next 1/7] net: phy: marvell10g: Use get_features to get the PHY abilities
On 21.02.2019 10:51, Maxime Chevallier wrote: > The Alaska family of 10G PHYs has more abilities than the ones listed in > PHY_10GBIT_FULL_FEATURES, the exact list depending on the model. > > Make use of the newly introduced .get_features call to build this list, > using genphy_c45_pma_read_abilities to build the list of supported > linkmodes, and adding autoneg ability based on what's reported by the AN > MMD. > > .config_init is still used to validate the interface_mode. > > Signed-off-by: Maxime Chevallier > --- > drivers/net/phy/marvell10g.c | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/phy/marvell10g.c b/drivers/net/phy/marvell10g.c > index 9ea27acf05ad..65ef469adf58 100644 > --- a/drivers/net/phy/marvell10g.c > +++ b/drivers/net/phy/marvell10g.c > @@ -233,8 +233,6 @@ static int mv3310_resume(struct phy_device *phydev) > > static int mv3310_config_init(struct phy_device *phydev) > { > - int ret, val; > - > /* Check that the PHY interface type is compatible */ > if (phydev->interface != PHY_INTERFACE_MODE_SGMII && > phydev->interface != PHY_INTERFACE_MODE_XAUI && > @@ -242,6 +240,12 @@ static int mv3310_config_init(struct phy_device *phydev) > phydev->interface != PHY_INTERFACE_MODE_10GKR) > return -ENODEV; > > + return 0; > +} > + > +static int mv3310_get_features(struct phy_device *phydev) > +{ After my just submitted patch to include the aneg capability checking in genphy_c45_pma_read_abilities() function mv3310_get_features() isn't needed any longer and can be replaced with the generic one. But we can make this change afterwards, then you don't have to rework your series. Also I'm not sure whether there will be a 5.0-rc8 or whether beginning of next week we'll see 5.0. In the latter case we're a little bit in a hurry because the merge window will start very soon. > + int ret, val; > if (phydev->c45_ids.devices_in_package & MDIO_DEVS_AN) { > val = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_STAT1); > if (val < 0) > @@ -429,7 +433,7 @@ static struct phy_driver mv3310_drivers[] = { > .phy_id = 0x002b09aa, > .phy_id_mask= MARVELL_PHY_ID_MASK, > .name = "mv88x3310", > - .features = PHY_10GBIT_FEATURES, > + .get_features = mv3310_get_features, > .soft_reset = gen10g_no_soft_reset, > .config_init= mv3310_config_init, > .probe = mv3310_probe, >
Re: [GIT PULL] clk fixes for v5.0-rc7
The pull request you sent on Thu, 21 Feb 2019 13:57:21 -0800: > https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git > tags/clk-fixes-for-linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/a3504f7a38233030def726fcfe692e786ab162df Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [GIT PULL] Please pull RDMA subsystem changes
The pull request you sent on Thu, 21 Feb 2019 23:07:20 +: > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/168bd29830e8ebbffcd70d2af50249dca088e1a8 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [git pull] drm fixes for 5.0-rc8 (or final)
The pull request you sent on Fri, 22 Feb 2019 10:13:42 +1000: > git://anongit.freedesktop.org/drm/drm tags/drm-fixes-2019-02-22 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/6ee2846cb4e7c6e8acdcb265299ad1c6de42b437 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
Re: [PATCH v3 7/9] dt-bindings: serial: Add Milbeaut serial driver description
On Wed, 20 Feb 2019 16:44:37 +0900, Sugaya Taichi wrote: > Add DT bindings document for Milbeaut serial driver. > > Signed-off-by: Sugaya Taichi > --- > .../devicetree/bindings/serial/milbeaut-uart.txt| 21 > + > 1 file changed, 21 insertions(+) > create mode 100644 Documentation/devicetree/bindings/serial/milbeaut-uart.txt > Reviewed-by: Rob Herring
Re: [PATCH] Bluetooth: btusb: request wake pin with NOAUTOEN
On Thu, Feb 21, 2019 at 04:30:51PM -0800, Brian Norris wrote: > Badly-designed systems might have (for example) active-high wake pins > that default to high (e.g., because of external pull ups) until they > have an active firmware which starts driving it low. This can cause an > interrupt storm in the time between request_irq() and disable_irq(). > > We don't support shared interrupts here, so let's just pre-configure the > interrupt to avoid auto-enabling it. > > Fixes: fd913ef7ce61 ("Bluetooth: btusb: Add out-of-band wakeup support") > Signed-off-by: Brian Norris > --- > drivers/bluetooth/btusb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c > index 4761499db9ee..470ee68555d9 100644 > --- a/drivers/bluetooth/btusb.c > +++ b/drivers/bluetooth/btusb.c > @@ -2885,6 +2885,7 @@ static int btusb_config_oob_wake(struct hci_dev *hdev) > return 0; > } > > + irq_set_status_flags(irq, IRQ_NOAUTOEN); > ret = devm_request_irq(>dev, irq, btusb_oob_wake_handler, > 0, "OOB Wake-on-BT", data); > if (ret) { > @@ -2899,7 +2900,6 @@ static int btusb_config_oob_wake(struct hci_dev *hdev) > } > > data->oob_wake_irq = irq; > - disable_irq(irq); > bt_dev_info(hdev, "OOB Wake-on-BT configured at IRQ %u", irq); > return 0; > } Reviewed-by: Matthias Kaehlcke
Re: [PATCH v3 00/20] Merge text_poke fixes and executable lockdowns
On Fri, 2019-02-22 at 17:14 +0100, Borislav Petkov wrote: > On Thu, Feb 21, 2019 at 03:44:31PM -0800, Rick Edgecombe wrote: > > Changes v2 to v3: > > - Fix commit messages and comments [Boris] > > - Rename VM_HAS_SPECIAL_PERMS [Boris] > > - Remove unnecessary local variables [Boris] > > - Rename set_alias_*() functions [Boris, Andy] > > - Save/restore DR registers when using temporary mm > > - Move line deletion from patch 10 to patch 17 > > In your previous submission there was a patch called > > Subject: [PATCH v2 01/20] Fix "x86/alternatives: Lockdep-enforce text_mutex in > text_poke*()" > > What happened to it? > > It did introduce a function text_poke_kgdb(), a.o., and I see this > function in the diff contexts in some of the patches in this submission > so it looks to me like you missed that first patch when submitting v3? > > Or am *I* missing something? > > Thx. > Oh, you are right! Sorry about that. I'll just send a new version with fixes for other comments instead of a resend of this one. Thanks, Rick
Re: [PATCH 3/3] arm64: dts: rockchip: move QCA6174A wakeup pin into its USB node
On Thu, Feb 21, 2019 at 04:34:03PM -0800, Brian Norris wrote: > Currently, we don't coordinate BT USB activity with our handling of the > BT out-of-band wake pin, and instead just use gpio-keys. That causes > problems because we have no way of distinguishing wake activity due to a > BT device (e.g., mouse) vs. the BT controller (e.g., re-configuring wake > mask before suspend). This can cause spurious wake events just because > we, for instance, try to reconfigure the host controller's event mask > before suspending. > > We can avoid these synchronization problems by handling the BT wake pin > directly in the btusb driver -- for all activity up until BT controller > suspend(), we simply listen to normal USB activity (e.g., to know the > difference between device and host activity); once we're really ready to > suspend the host controller, there should be no more host activity, and > only *then* do we unmask the GPIO interrupt. > > This is already supported by btusb; we just need to describe the wake > pin in the right node. > > We list 2 compatible properties, since both PID/VID pairs show up on > Scarlet devices, and they're both essentially identical QCA6174A-based > modules. > > Also note that the polarity was wrong before: Qualcomm implemented WAKE > as active high, not active low. We only got away with this because > gpio-keys always reconfigured us as bi-directional edge-triggered. > > Finally, we have an external pull-up and a level-shifter on this line > (we didn't notice Qualcomm's polarity in the initial design), so we > can't do pull-down. Switch to pull-none. > > Signed-off-by: Brian Norris > --- > This patch is also required to make this stable, but since it's not > really tied to the device tree, and it's an existing bug, I sent it > separately: > > https://lore.kernel.org/patchwork/patch/1044896/ > Subject: Bluetooth: btusb: request wake pin with NOAUTOEN > > .../dts/rockchip/rk3399-gru-chromebook.dtsi | 13 ++ > .../boot/dts/rockchip/rk3399-gru-scarlet.dtsi | 46 --- > arch/arm64/boot/dts/rockchip/rk3399-gru.dtsi | 13 -- > 3 files changed, 42 insertions(+), 30 deletions(-) > > diff --git a/arch/arm64/boot/dts/rockchip/rk3399-gru-chromebook.dtsi > b/arch/arm64/boot/dts/rockchip/rk3399-gru-chromebook.dtsi > index c400be64170e..931640e9aed4 100644 > --- a/arch/arm64/boot/dts/rockchip/rk3399-gru-chromebook.dtsi > +++ b/arch/arm64/boot/dts/rockchip/rk3399-gru-chromebook.dtsi > @@ -200,6 +200,19 @@ > pinctrl-0 = <_en>; > pwm-delay-us = <1>; > }; > + > + gpio_keys: gpio-keys { > + compatible = "gpio-keys"; > + pinctrl-names = "default"; > + pinctrl-0 = <_host_wake_l>; > + > + wake_on_bt: wake-on-bt { > + label = "Wake-on-Bluetooth"; > + gpios = < 3 GPIO_ACTIVE_LOW>; > + linux,code = ; > + wakeup-source; > + }; > + }; > }; > > _bigcpu { > diff --git a/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi > b/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi > index fc50b3ef758c..3e2196c08473 100644 > --- a/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi > +++ b/arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet.dtsi > @@ -175,6 +175,21 @@ > pinctrl-0 = <_en>; > wakeup-delay-ms = <250>; > }; > + > + gpio_keys: gpio-keys { > + compatible = "gpio-keys"; > + pinctrl-names = "default"; > + pinctrl-0 = <_eject_odl>; > + > + pen-insert { > + label = "Pen Insert"; > + /* Insert = low, eject = high */ > + gpios = < 1 GPIO_ACTIVE_LOW>; > + linux,code = ; > + linux,input-type = ; > + wakeup-source; > + }; > + }; > }; > > /* pp900_s0 aliases */ > @@ -328,20 +343,6 @@ camera: { > <4>; > }; > > -_keys { > - pinctrl-names = "default"; > - pinctrl-0 = <_host_wake_l>, <_eject_odl>; > - > - pen-insert { > - label = "Pen Insert"; > - /* Insert = low, eject = high */ > - gpios = < 1 GPIO_ACTIVE_LOW>; > - linux,code = ; > - linux,input-type = ; > - wakeup-source; > - }; > -}; > - > _tunnel { > google,remote-bus = <0>; > }; > @@ -437,8 +438,19 @@ camera: { > status = "okay"; > }; > > -_on_bt { > - gpios = < 2 GPIO_ACTIVE_LOW>; > +_host0_ohci { > + #address-cells = <1>; > + #size-cells = <0>; > + > + qca_bt: bt@1 { > + compatible = "usb0cf3,e300", "usb04ca,301a"; > + reg = <1>; > + pinctrl-names = "default"; > + pinctrl-0 = <_host_wake_l>; > + interrupt-parent = <>; > + interrupts = <2 IRQ_TYPE_LEVEL_HIGH>; > + interrupt-names =
Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault
On Fri, Feb 22, 2019 at 9:48 AM Andy Lutomirski wrote: > > > On Feb 22, 2019, at 9:43 AM, Linus Torvalds > > wrote: > > > > Then we should still probably fix up "__probe_kernel_read()" to not > > allow user accesses. The easiest way to do that is actually likely to > > use the "unsafe_get_user()" functions *without* doing a > > uaccess_begin(), which will mean that modern CPU's will simply fault > > on a kernel access to user space. > > > > The nice thing about that is that usually developers will have access > > to exactly those modern boxes, so the people who notice that it > > doesn't work are the right people. > > We use probe_kernel_read() from oops code. I’d rather it return -EFAULT than > oops harder and kill the first oops. It would still do that. Using the unsafe_get_user() macros doesn't remove the exception handling, and we wouldn't remove the whole "pagefault_disable()" either. So it would work exactly the same way it does now, except on a modern CPU it would return -EFAULT for a user space access due to AC not being set. So no "oops harder", only "safer accesses". Linus
[PATCH] kasan: prevent recursive instrumentation
When CONFIG_KASAN is selected, defines the prototypes for kasan_check_{read,write}(), rather than inline stubs. This is the case even when it is included by files which are not instrumented by KASAN. Where helpers (e.g. the atomics) are instrumented with explicit calls to kasan_check_{read,write}(), this results in files being unexpectedly instrumented. This is problematic as it can result in instrumentation in files which cannot safely call these functions, such as within the EFI stub: [mark@lakrids:~/src/linux]% usekorg 8.1.0 make ARCH=arm64 CROSS_COMPILE=aarch64-linux- -j64 -s drivers/firmware/efi/libstub/arm-stub.stub.o: In function `atomic_set': /home/mark/src/linux/./include/asm-generic/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write' /home/mark/src/linux/./include/asm-generic/atomic-instrumented.h:44:(.init.text+0xa0): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `__efistub_kasan_check_write' Makefile:1021: recipe for target 'vmlinux' failed make: *** [vmlinux] Error 1 Let's avoid this by ensuring that uninstrumented files are given the same stub definition of these functions used for !KASAN builds. So that the stub defintions don't conflict with the real definitions in (uninstrumented) common KASAN code, the real definitions are prefixed with underscores, and called from unprefixed macros. Any compiler-generated instrumentation uses separate __asan_{load,store}_*() entry points, and is not affected by this change. Signed-off-by: Mark Rutland Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar --- include/linux/kasan-checks.h | 8 +--- mm/kasan/common.c| 8 scripts/Makefile.kasan | 2 +- 3 files changed, 10 insertions(+), 8 deletions(-) Hi, Assuming the KASAN folk are happy with this, I'd like this to be queued in the tip tree, where the arm64 instrumented atomics are already queued. Thanks, Mark. diff --git a/include/linux/kasan-checks.h b/include/linux/kasan-checks.h index d314150658a4..6cf4b41a5393 100644 --- a/include/linux/kasan-checks.h +++ b/include/linux/kasan-checks.h @@ -2,9 +2,11 @@ #ifndef _LINUX_KASAN_CHECKS_H #define _LINUX_KASAN_CHECKS_H -#ifdef CONFIG_KASAN -void kasan_check_read(const volatile void *p, unsigned int size); -void kasan_check_write(const volatile void *p, unsigned int size); +#if defined(CONFIG_KASAN) && !defined (KASAN_NOSANITIZE) +void __kasan_check_read(const volatile void *p, unsigned int size); +#define kasan_check_read __kasan_check_read +void __kasan_check_write(const volatile void *p, unsigned int size); +#define kasan_check_write __kasan_check_write #else static inline void kasan_check_read(const volatile void *p, unsigned int size) { } diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 73c9cbfdedf4..630e32838adb 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -95,17 +95,17 @@ void kasan_disable_current(void) current->kasan_depth--; } -void kasan_check_read(const volatile void *p, unsigned int size) +void __kasan_check_read(const volatile void *p, unsigned int size) { check_memory_region((unsigned long)p, size, false, _RET_IP_); } -EXPORT_SYMBOL(kasan_check_read); +EXPORT_SYMBOL(__kasan_check_read); -void kasan_check_write(const volatile void *p, unsigned int size) +void __kasan_check_write(const volatile void *p, unsigned int size) { check_memory_region((unsigned long)p, size, true, _RET_IP_); } -EXPORT_SYMBOL(kasan_check_write); +EXPORT_SYMBOL(__kasan_check_write); #undef memset void *memset(void *addr, int c, size_t len) diff --git a/scripts/Makefile.kasan b/scripts/Makefile.kasan index 25c259df8ffa..c475a8ac776c 100644 --- a/scripts/Makefile.kasan +++ b/scripts/Makefile.kasan @@ -52,5 +52,5 @@ CFLAGS_KASAN := -fsanitize=kernel-hwaddress \ endif # CONFIG_KASAN_SW_TAGS ifdef CONFIG_KASAN -CFLAGS_KASAN_NOSANITIZE := -fno-builtin +CFLAGS_KASAN_NOSANITIZE := -fno-builtin -DKASAN_NOSANITIZE endif -- 2.11.0
Re: [PATCH v5 8/8] ARM: tegra: Add firmware calls required for suspend-resume
22.02.2019 20:59, Dmitry Osipenko пишет: > In order to resume CPU from suspend via trusted Foundations firmware, > the LP1/LP2 boot vectors and CPU caches need to be set up using the > firmware calls. > > Signed-off-by: Dmitry Osipenko > --- > arch/arm/mach-tegra/pm.c| 53 ++--- > arch/arm/mach-tegra/reset-handler.S | 26 ++ > arch/arm/mach-tegra/sleep.S | 3 +- > 3 files changed, 61 insertions(+), 21 deletions(-) > > diff --git a/arch/arm/mach-tegra/pm.c b/arch/arm/mach-tegra/pm.c > index 66c8cd63dd86..f209f59e0daf 100644 > --- a/arch/arm/mach-tegra/pm.c > +++ b/arch/arm/mach-tegra/pm.c > @@ -33,6 +33,7 @@ > #include > > #include > +#include > #include > #include > #include > @@ -160,6 +161,28 @@ int tegra_cpu_do_idle(void) > > static int tegra_sleep_cpu(unsigned long v2p) > { > + /* > + * L2 cache disabling using kernel API only allowed when all > + * secondary CPU's are offline. Cache have to be disabled with > + * MMU-on if cache maintenance is done via Trusted Foundations > + * firmware. Note that CPUIDLE won't ever enter powergate on Tegra30 > + * if any of secondary CPU's is online and this is the LP2-idle > + * code-path only for Tegra20/30. > + */ > + if (trusted_foundations_registered()) > + outer_disable(); > + > + /* > + * Note that besides of setting up CPU reset vector this firmware > + * call may also do the following, depending on the FW version: > + * 1) Disable L2. But this doesn't matter since we already > + * disabled the L2. > + * 2) Disable D-cache. This need to be taken into account in > + * particular by the tegra_disable_clean_inv_dcache() which > + * shall avoid the re-disable. > + */ > + call_firmware_op(prepare_idle, TF_PM_MODE_LP2); > + > setup_mm_for_reboot(); > tegra_sleep_cpu_finish(v2p); > > @@ -196,24 +219,13 @@ void tegra_idle_lp2_last(void) > cpu_cluster_pm_enter(); > suspend_cpu_complex(); > > - /* > - * L2 cache disabling using kernel API only allowed when all > - * secondary CPU's are offline. Cache have to be disabled early > - * if cache maintenance is done via Trusted Foundations firmware. > - * Note that CPUIDLE won't ever enter powergate on Tegra30 if any > - * of secondary CPU's is online and this is the LP2 codepath only > - * for Tegra20/30. > - */ > - if (trusted_foundations_registered()) > - outer_disable(); > - > cpu_suspend(PHYS_OFFSET - PAGE_OFFSET, _sleep_cpu); > > /* >* Resume L2 cache if it wasn't re-enabled early during resume, >* which is the case for Tegra30 that has to re-enable the cache >* via firmware call. In other cases cache is already enabled and > - * hence re-enabling is a no-op. > + * hence re-enabling is a no-op. This is always a no-op on Tegra114+. >*/ > outer_resume(); > > @@ -235,6 +247,15 @@ enum tegra_suspend_mode tegra_pm_validate_suspend_mode( > > static int tegra_sleep_core(unsigned long v2p) > { > + /* > + * Cache have to be disabled with MMU-on if cache maintenance is done > + * via Trusted Foundations firmware. This is a no-op on Tegra114+. > + */ > + if (trusted_foundations_registered()) > + outer_disable(); > + > + call_firmware_op(prepare_idle, TF_PM_MODE_LP1); > + > setup_mm_for_reboot(); > tegra_sleep_core_finish(v2p); > > @@ -360,14 +381,6 @@ static int tegra_suspend_enter(suspend_state_t state) > break; > } > > - /* > - * Cache have to be disabled early if cache maintenance is done > - * via Trusted Foundations firmware. Otherwise this is a no-op, > - * like on Tegra114+. > - */ > - if (trusted_foundations_registered()) > - outer_disable(); > - > cpu_suspend(PHYS_OFFSET - PAGE_OFFSET, tegra_sleep_func); Seems I messed up the rebasing a tad. Will send another version.
Re: [PATCH V3 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM
On 22/02/19 16:06, lantianyu1...@gmail.com wrote: > From: Lan Tianyu > > This patchset is to introduce hv ept tlb range list flush function > support in the KVM MMU component. Flushing ept tlbs of several address > range can be done via single hypercall and new list flush function is > used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset > also adds more hv ept tlb range flush support in more KVM MMU function. > > This patchset is based on the fix patch "x86/Hyper-V: Fix definition > HV_MAX_FLUSH_REP_COUNT". > (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1939455.html) Note that this won't make it in 5.1 unless Linus releases an -rc8. Otherwise, I'll get to it next week. Thanks, Paolo
[PATCH hyperv-fixes] hv_netvsc: Fix IP header checksum for coalesced packets
From: Haiyang Zhang Incoming packets may have IP header checksum verified by the host. They may not have IP header checksum computed after coalescing. This patch re-compute the checksum when necessary, otherwise the packets may be dropped, because Linux network stack always checks it. Signed-off-by: Haiyang Zhang --- drivers/net/hyperv/netvsc_drv.c | 22 +++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index 256adbd044f5..cf4897043e83 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -744,6 +744,14 @@ void netvsc_linkstatus_callback(struct net_device *net, schedule_delayed_work(_ctx->dwork, 0); } +static void netvsc_comp_ipcsum(struct sk_buff *skb) +{ + struct iphdr *iph = (struct iphdr *)skb->data; + + iph->check = 0; + iph->check = ip_fast_csum(iph, iph->ihl); +} + static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net, struct netvsc_channel *nvchan) { @@ -770,9 +778,17 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net, /* skb is already created with CHECKSUM_NONE */ skb_checksum_none_assert(skb); - /* -* In Linux, the IP checksum is always checked. -* Do L4 checksum offload if enabled and present. + /* Incoming packets may have IP header checksum verified by the host. +* They may not have IP header checksum computed after coalescing. +* We compute it here if the flags are set, because on Linux, the IP +* checksum is always checked. +*/ + if (csum_info && csum_info->receive.ip_checksum_value_invalid && + csum_info->receive.ip_checksum_succeeded && + skb->protocol == htons(ETH_P_IP)) + netvsc_comp_ipcsum(skb); + + /* Do L4 checksum offload if enabled and present. */ if (csum_info && (net->features & NETIF_F_RXCSUM)) { if (csum_info->receive.tcp_checksum_succeeded || -- 2.19.1
Re: [PATCH net] net: socket: set sock->sk to NULL after calling proto_ops::release()
On 02/22/2019 09:57 AM, Eric Biggers wrote: > ->setattr() is called under inode_lock(), which __sock_release() also takes. > So > the uses of sock->sk are serialized. See commit 6d8c50dcb029 ("socket: close > race condition between sock_close() and sockfs_setattr()"). Oh right, we added another inode_lock()/inode_unlock() for sock_close() > > The issue now is that if ->setattr() happens *after* __sock_release() (which > is > possible if fchownat() gets the reference to the file's 'struct path', then > the > file is close()d by another thread, then fchownat() continues), it will see > stale sock->sk because for many socket types it wasn't set to NULL earlier. > > - Eric >
Re: general protection fault in rose_send_frame
syzbot has found a reproducer for the following crash on: HEAD commit:7a25c6c0aac8 rocker: Add missing break for PRE_BRIDGE_FLAGS git tree: net-next console output: https://syzkaller.appspot.com/x/log.txt?x=104002d8c0 kernel config: https://syzkaller.appspot.com/x/.config?x=58cb9d752ba5f3e0 dashboard link: https://syzkaller.appspot.com/bug?extid=7078ae989d857fe17988 compiler: gcc (GCC) 9.0.0 20181231 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1224c304c0 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=151824f8c0 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+7078ae989d857fe17...@syzkaller.appspotmail.com IPv6: ADDRCONF(NETDEV_CHANGE): hsr_slave_1: link becomes ready 8021q: adding VLAN 0 to HW filter on device batadv0 IPv6: ADDRCONF(NETDEV_CHANGE): rose0: link becomes ready kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] PREEMPT SMP KASAN CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.0.0-rc7+ #76 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:rose_send_frame+0x1a8/0x280 net/rose/rose_link.c:104 Code: c1 ea 03 80 3c 02 00 0f 85 8d 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 20 49 8d bc 24 58 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 7e 49 8b 94 24 58 03 00 00 e9 b8 fe ff ff e8 f0 53 RSP: 0018:8880ae807ae8 EFLAGS: 00010202 RAX: dc00 RBX: 88809aa1fa00 RCX: 86378efb RDX: 006b RSI: 8637902c RDI: 0358 RBP: 8880ae807b18 R08: 8887dec0 R09: ed101263146d R10: ed101263146c R11: 88809318a363 R12: R13: 0078 R14: 0005 R15: 8880a4f6fbc0 FS: () GS:8880ae80() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f23b3e4ee78 CR3: 9eef3000 CR4: 001406f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: rose_transmit_clear_request+0x1de/0x2a0 net/rose/rose_link.c:258 rose_rx_call_request+0x4ea/0x1990 net/rose/af_rose.c:1001 rose_loopback_timer+0x26a/0x3f0 net/rose/rose_loopback.c:100 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325 expire_timers kernel/time/timer.c:1362 [inline] __run_timers kernel/time/timer.c:1681 [inline] __run_timers kernel/time/timer.c:1649 [inline] run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694 __do_softirq+0x266/0x95a kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807 RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58 Code: ff ff ff 48 89 c7 48 89 45 d8 e8 59 6c a1 fa 48 8b 45 d8 e9 ce fe ff ff 48 89 df e8 48 6c a1 fa eb 82 90 90 90 90 90 90 fb f4 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90 RSP: 0018:88807d08 EFLAGS: 0286 ORIG_RAX: ff13 RAX: 11125061 RBX: 8887dec0 RCX: RDX: dc00 RSI: 0001 RDI: 8887e73c RBP: 88807d38 R08: 8887dec0 R09: R10: R11: R12: R13: 889282f8 R14: R15: arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:555 default_idle_call+0x36/0x90 kernel/sched/idle.c:93 cpuidle_idle_call kernel/sched/idle.c:153 [inline] do_idle+0x386/0x570 kernel/sched/idle.c:262 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:353 rest_init+0x245/0x37b init/main.c:442 arch_call_rest_init+0xe/0x1b start_kernel+0x803/0x83c init/main.c:739 x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:470 x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:451 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243 Modules linked in: ---[ end trace 8f29030702ddb052 ]--- RIP: 0010:rose_send_frame+0x1a8/0x280 net/rose/rose_link.c:104 Code: c1 ea 03 80 3c 02 00 0f 85 8d 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 20 49 8d bc 24 58 03 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 7e 49 8b 94 24 58 03 00 00 e9 b8 fe ff ff e8 f0 53 RSP: 0018:8880ae807ae8 EFLAGS: 00010202 RAX: dc00 RBX: 88809aa1fa00 RCX: 86378efb RDX: 006b RSI: 8637902c RDI: 0358 RBP: 8880ae807b18 R08: 8887dec0 R09: ed101263146d R10: ed101263146c R11: 88809318a363 R12: R13: 0078 R14: 0005 R15: 8880a4f6fbc0 FS: () GS:8880ae80() knlGS: CS: 0010 DS: ES: CR0:
Re: [PATCH] KVM: MMU: record maximum physical address width in kvm_mmu_extended_role
On 20/02/19 08:06, Yu Zhang wrote: > Hi Paolo, any comments on this patch? And the other one(kvm: x86: Return > LA57 feature based on hardware capability )? :-) Queued both, thanks. Paolo > On Fri, Feb 01, 2019 at 12:09:23AM +0800, Yu Zhang wrote: >> Previously, commit 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow >> MMU reconfiguration is needed") offered some optimization to avoid >> the unnecessary reconfiguration. Yet one scenario is broken - when >> cpuid changes VM's maximum physical address width, reconfiguration >> is needed to reset the reserved bits. Also, the TDP may need to >> reset its shadow_root_level when this value is changed. >> >> To fix this, a new field, maxphyaddr, is introduced in the extended >> role structure to keep track of the configured guest physical address >> width. >> >> Signed-off-by: Yu Zhang >> --- >> Cc: Paolo Bonzini >> Cc: "Radim Krčmář" >> Cc: Thomas Gleixner >> Cc: Ingo Molnar >> Cc: Borislav Petkov >> Cc: "H. Peter Anvin" >> Cc: linux-kernel@vger.kernel.org >> --- >> arch/x86/include/asm/kvm_host.h | 1 + >> arch/x86/kvm/mmu.c | 1 + >> 2 files changed, 2 insertions(+) >> >> diff --git a/arch/x86/include/asm/kvm_host.h >> b/arch/x86/include/asm/kvm_host.h >> index 4660ce9..be87f71 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -299,6 +299,7 @@ struct kvm_mmu_memory_cache { >> unsigned int cr4_smap:1; >> unsigned int cr4_smep:1; >> unsigned int cr4_la57:1; >> +unsigned int maxphyaddr:6; >> }; >> }; >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index ce770b4..2b74505 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -4769,6 +4769,7 @@ static union kvm_mmu_extended_role >> kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu) >> ext.cr4_pse = !!is_pse(vcpu); >> ext.cr4_pke = !!kvm_read_cr4_bits(vcpu, X86_CR4_PKE); >> ext.cr4_la57 = !!kvm_read_cr4_bits(vcpu, X86_CR4_LA57); >> +ext.maxphyaddr = cpuid_maxphyaddr(vcpu); >> >> ext.valid = 1; >> >> -- >> 1.9.1 >> > > Thanks > Yu >
Re: [PATCH 5/5] mm/vmscan: don't forcely shrink active anon lru list
On Fri, Feb 22, 2019 at 08:43:37PM +0300, Andrey Ryabinin wrote: > shrink_node_memcg() always forcely shrink active anon list. > This doesn't seem like correct behavior. If system/memcg has no swap, it's > absolutely pointless to rebalance anon lru lists. > And in case we did scan the active anon list above, it's unclear why would > we need this additional force scan. If there are cases when we want more > aggressive scan of the anon lru we should just change the scan target > in get_scan_count() (and better explain such cases in the comments). > > Remove this force shrink and let get_scan_count() to decide how > much of active anon we want to shrink. This change breaks the anon pre-aging. The idea behind this is that the VM maintains a small batch of anon reclaim candidates with recent access information. On every reclaim, even when we just trim cache, which is the most common reclaim mode, but also when we just swapped out some pages and shrunk the inactive anon list, at the end of it we make sure that the list of potential anon candidates is refilled for the next reclaim cycle. The comments for this are above inactive_list_is_low() and the age_active_anon() call from kswapd. Re: no swap, you are correct. We should gate that rebalancing on total_swap_pages, just like age_active_anon() does.
Re: [PATCHv6 06/10] node: Add memory-side caching attributes
On Fri, Feb 22, 2019 at 10:09 AM Keith Busch wrote: > > On Fri, Feb 22, 2019 at 11:12:38AM +0100, Brice Goglin wrote: > > Le 14/02/2019 à 18:10, Keith Busch a écrit : > > > +What: > > > /sys/devices/system/node/nodeX/memory_side_cache/indexY/size > > > +Date: December 2018 > > > +Contact: Keith Busch > > > +Description: > > > + The size of this memory side cache in bytes. > > > > > > Hello Keith, > > > > CPU-side cache size is reported in kilobytes: > > > > $ cat > > /sys/devices/system/cpu/cpu0/cache/index*/size > > > > 32K > > 32K > > 256K > > 4096K > > > > Can you do the same of memory-side caches instead of reporting bytes? > > Ok, will do. Ugh, please no. Don't optimize sysfs for human consumption. That 'K' now needs to be parsed.
Re: [PATCH v2] IB/mlx4: Increase the timeout for CM cache
On Sun, Feb 17, 2019 at 03:45:12PM +0100, Håkon Bugge wrote: > Using CX-3 virtual functions, either from a bare-metal machine or > pass-through from a VM, MAD packets are proxied through the PF driver. > > Since the VF drivers have separate name spaces for MAD Transaction Ids > (TIDs), the PF driver has to re-map the TIDs and keep the book keeping > in a cache. > > Following the RDMA Connection Manager (CM) protocol, it is clear when > an entry has to evicted form the cache. But life is not perfect, > remote peers may die or be rebooted. Hence, it's a timeout to wipe out > a cache entry, when the PF driver assumes the remote peer has gone. > > During workloads where a high number of QPs are destroyed concurrently, > excessive amount of CM DREQ retries has been observed > > The problem can be demonstrated in a bare-metal environment, where two > nodes have instantiated 8 VFs each. This using dual ported HCAs, so we > have 16 vPorts per physical server. > > 64 processes are associated with each vPort and creates and destroys > one QP for each of the remote 64 processes. That is, 1024 QPs per > vPort, all in all 16K QPs. The QPs are created/destroyed using the > CM. > > When tearing down these 16K QPs, excessive CM DREQ retries (and > duplicates) are observed. With some cat/paste/awk wizardry on the > infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the > nodes: > > cm_rx_duplicates: > dreq 2102 > cm_rx_msgs: > drep 1989 > dreq 6195 >rep 3968 >req 4224 >rtu 4224 > cm_tx_msgs: > drep 4093 > dreq 27568 >rep 4224 >req 3968 >rtu 3968 > cm_tx_retries: > dreq 23469 > > Note that the active/passive side is equally distributed between the > two nodes. > > Enabling pr_debug in cm.c gives tons of: > > [171778.814239] mlx4_ib_multiplex_cm_handler: id{slave: > 1,sl_cm_id: 0xd393089f} is NULL! > > By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the > tear-down phase of the application is reduced from approximately 90 to > 50 seconds. Retries/duplicates are also significantly reduced: > > cm_rx_duplicates: > dreq 2460 > [] > cm_tx_retries: > dreq 3010 >req47 > > Increasing the timeout further didn't help, as these duplicates and > retries stems from a too short CMA timeout, which was 20 (~4 seconds) > on the systems. By increasing the CMA timeout to 22 (~17 seconds), the > numbers fell down to about 10 for both of them. > > Adjustment of the CMA timeout is not part of this commit. > > Signed-off-by: Håkon Bugge > Acked-by: Jack Morgenstein > --- Applied to for-next Thanks, Jason
Re: [PATCH v8 0/7] freezer for cgroup v2
Hey, Oleg. On Fri, Feb 22, 2019 at 05:34:42PM +0100, Oleg Nesterov wrote: > > ptrace support is a lot less important than kill for sure but if at > > all possible I think it'd be better to have it > > Tejun, I agree it would be better. I did not argue with that. > > The question is how this can be implemented. And how much uglifications^W > complications in the core kernel code this needs. Yeah, sure thing. It's always a trade-off but given that this is something we're gonna expose to userspace as an API which will be difficult to change too noticeably once released, I think it'd be worthwhile to spend some effort to get the visible semantics right. > > To summarize, the ideal result is the frozen state to be "stuck in > > jobctl stop loop" > > Not sure I understand, but I don't think this can work... > > Let me repeat, imo the freezer should be "reliable", it shouldn't stuck > in CGRP_FREEZE state forever if, say, it races with vfork(). And personally > I think this is more important than (very limited) ptrace support. Absolutely, it makes no sense to ship it otherwise. I don't think anyone is saying that we can ship it while it doesn't work reliably. > So I think it too should somehow interact with freezable_schedule/etc. You mean freezer_do_not_count(), right? As long as the task is guaranteed to be trapped by signal stop afterwards (and they are), we likely can use them the same way. The only thing to be careful about would be ensuring that we don't end up flipping group level frozen state inbetween. Would something like that work? Thanks. -- tejun
hello dear
Assalamu Alaikum Wa Rahmatullahi Wa Barakatuh, hello dear I came across your contact during my private search. Mrs Aisha Al- Qaddafi is my name, the only daughter of late Libyan president, am a single Mother and a Widow with three Children.I have funds the sum of $27.5 million USD for, investment, I am interested in you for investment project assistance in your country,because of my current refugee status, i shall compensate you 30% of the total sum after the funds are transfer into your account,I am willing to, negotiate investment/business profit sharing ratio with you base on the future investment earning profits. Reply me urgent for more details Mrs Aisha Al-Qaddafi
Re: [PATCH 2/2] clk: ingenic: Fix doc of ingenic_cgu_div_info
Quoting Paul Cercueil (2019-01-27 18:09:21) > The 'div' field does not represent a number of bits used to divide > (understand: right-shift) the divider, but a number itself used to > divide the divider. > > Signed-off-by: Paul Cercueil > Signed-off-by: Maarten ter Huurne > Cc: > --- Applied to clk-next
Re: [PATCH 1/2] clk: ingenic: Fix round_rate misbehaving with non-integer dividers
Quoting Paul Cercueil (2019-01-27 18:09:20) > Take a parent rate of 180 MHz, and a requested rate of 4.285715 MHz. > This results in a theorical divider of 41.93 which is then rounded > up to 42. The .round_rate function would then return (180 MHz / 42) as > the clock, rounded down, so 4.285714 MHz. > > Calling clk_set_rate on 4.285714 MHz would round the rate again, and > give a theorical divider of 42,028, now rounded up to 43, and the > rate returned would be (180 MHz / 43) which is 4.186046 MHz, aka. not > what we requested. > > Fix this by rounding up the divisions. > > Signed-off-by: Paul Cercueil > Tested-by: Maarten ter Huurne > Cc: > --- Applied to clk-next
Re: [PATCHv6 06/10] node: Add memory-side caching attributes
On Fri, Feb 22, 2019 at 11:22:12AM +0100, Brice Goglin wrote: > Le 14/02/2019 à 18:10, Keith Busch a écrit : > > +What: > > /sys/devices/system/node/nodeX/memory_side_cache/indexY/associativity > > +Date: December 2018 > > +Contact: Keith Busch > > +Description: > > + The caches associativity: 0 for direct mapped, non-zero if > > + indexed. > > > Should we rename "associativity" into "indexing" or something else? > > When I see "associativity" that contains 0, I tend to interpret this as > the associativity value itself, which would mean fully-associative here > (as in CPU-side cache "ways_of_associativity" attribute), while actually > 0 means direct-mapped (ie 1-associative) with yout semantics. > > Brice Yes, that's a good suggestion.
Re: [PATCH v2 08/14] clk: qcom: hfpll: CLK_IGNORE_UNUSED
Quoting Jorge Ramirez-Ortiz (2019-01-28 10:32:55) > When COMMON_CLK_DISABLED_UNUSED is set, in an effort to save power and > to keep the software model of the clock in line with reality, the > framework transverses the clock tree and disables those clocks that > were enabled by the firmware but have not been enabled by any device > driver. > > If CPUFREQ is enabled, early during the system boot, it might attempt > to change the CPU frequency ("set_rate"). If the HFPLL is selected as > a provider, it will then change the rate for this clock. > > As boot continues, clk_disable_unused_subtree will run. Since it wont > find a valid counter (enable_count) for a clock that is actually > enabled it will attempt to disable it which will cause the CPU to > stop. Notice that in this driver, calls to check whether the clock is > enabled are routed via the is_enabled callback which queries the > hardware. > > The following commit, rather than marking the clock critical and > forcing the clock to be always enabled, addresses the above scenario > making sure the clock is not disabled but it continues to rely on the > firmware to enable the clock. > > Co-developed-by: Niklas Cassel > Signed-off-by: Niklas Cassel > Signed-off-by: Jorge Ramirez-Ortiz > --- > drivers/clk/qcom/hfpll.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/clk/qcom/hfpll.c b/drivers/clk/qcom/hfpll.c > index 0ffed0d..9d92f5d 100644 > --- a/drivers/clk/qcom/hfpll.c > +++ b/drivers/clk/qcom/hfpll.c > @@ -58,6 +58,7 @@ static int qcom_hfpll_probe(struct platform_device *pdev) > .parent_names = (const char *[]){ "xo" }, > .num_parents = 1, > .ops = _ops_hfpll, > + .flags = CLK_IGNORE_UNUSED, Please put some sort of similar comment in the code so we can find it without git history digging.