Re: [PATCH 12/14] powerpc: pass node id into create_section_mapping
Nicholas Pigginwrites: > diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c > index 328ff9abc333..435b19e74508 100644 > --- a/arch/powerpc/mm/pgtable-radix.c > +++ b/arch/powerpc/mm/pgtable-radix.c > @@ -862,9 +862,9 @@ static void remove_pagetable(unsigned long start, > unsigned long end) > radix__flush_tlb_kernel_range(start, end); > } > > -int __ref radix__create_section_mapping(unsigned long start, unsigned long > end) > +int __ref radix__create_section_mapping(unsigned long start, unsigned long > end, int nid) > { > - return create_physical_mapping(start, end); > + return create_physical_mapping(start, end, nid); > } This got a little muddled. We add the nid argument here, but create_physical_mapping() doesn't take it until patch 14. I managed to fix it by rearranging the last three patches and fiddling things a bit. If you can check the result once I push that would be good. cheers
Re: [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered
Nicholas Pigginwrites: > --- > arch/powerpc/include/asm/paca.h| 3 +- > arch/powerpc/kernel/paca.c | 90 > -- > arch/powerpc/kernel/prom.c | 5 ++- > arch/powerpc/kernel/setup-common.c | 24 +++--- > 4 files changed, 51 insertions(+), 71 deletions(-) Added SOB. cheers
Re: [PATCH 08/14] powerpc/setup: cpu_to_phys_id array
Nicholas Pigginwrites: > Build an array that finds hardware CPU number from logical CPU > number in firmware CPU discovery. Use that rather than setting > paca of other CPUs directly, to begin with. Subsequent patch will > not have pacas allocated at this point. > --- > arch/powerpc/include/asm/smp.h | 1 + > arch/powerpc/kernel/prom.c | 7 +++ > arch/powerpc/kernel/setup-common.c | 15 ++- > 3 files changed, 22 insertions(+), 1 deletion(-) Added SOB. > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c > index 4dffef947b8a..5979e34ba90e 100644 > --- a/arch/powerpc/kernel/prom.c > +++ b/arch/powerpc/kernel/prom.c > @@ -874,5 +874,12 @@ EXPORT_SYMBOL(cpu_to_chip_id); > > bool arch_match_cpu_phys_id(int cpu, u64 phys_id) > { > + /* > + * Early firmware scanning must use this rather than > + * get_hard_smp_processor_id because we don't have pacas allocated > + * until memory topology is discovered. > + */ > + if (cpu_to_phys_id != NULL) > + return (int)phys_id == cpu_to_phys_id[cpu]; This needed an #ifdef CONFIG_SMP. cheers
Re: [PATCH 10/14] powerpc/64: allocate pacas per node
Nicholas Pigginwrites: > Per-node allocations are possible on 64s with radix that does > not have the bolted SLB limitation. > > Hash would be able to do the same if all CPUs had the bottom of > their node-local memory bolted as well. This is left as an > exercise for the reader. > --- > arch/powerpc/kernel/paca.c | 41 +++-- > arch/powerpc/kernel/setup_64.c | 4 > 2 files changed, 39 insertions(+), 6 deletions(-) Added SOB. cheers
[PATCH] powerpc/fscr: Enable interrupts earlier before calling get_user()
The function get_user() can sleep while trying to fetch instruction from user address space and causes the following warning from the scheduler. BUG: sleeping function called from invalid context Though interrupts get enabled back but it happens bit later after get_user() is called. This change moves enabling these interrupts earlier covering the function get_user(). While at this, lets check for kernel mode and crash as this interrupt should not have been triggered from the kernel context. Signed-off-by: Anshuman Khandual--- arch/powerpc/kernel/traps.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 1e48d15..4d5a55e 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -1611,6 +1611,13 @@ void facility_unavailable_exception(struct pt_regs *regs) else value = mfspr(SPRN_FSCR); + /* We should not have taken this interrupt in kernel */ + BUG_ON(!user_mode(regs)); + + /* We restore the interrupt state now */ + if (!arch_irq_disabled_regs(regs)) + local_irq_enable(); + status = value >> 56; if (status == FSCR_DSCR_LG) { /* @@ -1683,10 +1690,6 @@ void facility_unavailable_exception(struct pt_regs *regs) facility_strings[status]) facility = facility_strings[status]; - /* We restore the interrupt state now */ - if (!arch_irq_disabled_regs(regs)) - local_irq_enable(); - pr_err_ratelimited("%sFacility '%s' unavailable (%d), exception at 0x%lx, MSR=%lx\n", hv ? "Hypervisor " : "", facility, status, regs->nip, regs->msr); -- 2.9.3
Re: [PATCH] powerpc/powernv/nvram: opal_nvram_write handle unknown OPAL errors
Nicholas Pigginwrites: > opal_nvram_write currently just assumes success if it encounters an > error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO > on other errors instead. > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/platforms/powernv/opal-nvram.c | 2 ++ > 1 file changed, 2 insertions(+) Acked-by: Stewart Smith -- Stewart Smith OPAL Architect, IBM.
[PATCH v2 4/4] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
From: Mahesh SalgaonkarFor fadump to work successfully there should not be any holes in reserved memory ranges where kernel has asked firmware to move the content of old kernel memory in event of crash. But this memory area is currently not protected from hot-remove operations. Hence, fadump service can fail to re-register after the hot-remove operation, if hot-removed memory belongs to fadump reserved region. To avoid this make sure that memory from fadump reserved area is not hot-removable if fadump is registered. However, if user still wants to remove that memory, he can do so by manually stopping fadump service before hot-remove operation. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/fadump.h |2 +- arch/powerpc/kernel/fadump.c| 10 -- arch/powerpc/platforms/pseries/hotplug-memory.c |7 +-- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index 776cba0baec4..bd84b496d2b8 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -206,7 +206,7 @@ struct fad_crash_memory_ranges { unsigned long long size; }; -extern int is_fadump_boot_memory_area(u64 addr, ulong size); +extern int is_fadump_memory_area(u64 addr, ulong size); extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname, int depth, void *data); extern int fadump_reserve_mem(void); diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 0268a32b632e..331066eefaee 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long node, /* * If fadump is registered, check if the memory provided - * falls within boot memory area. + * falls within boot memory area and reserved memory area. */ -int is_fadump_boot_memory_area(u64 addr, ulong size) +int is_fadump_memory_area(u64 addr, ulong size) { + u64 d_start = fw_dump.reserve_dump_area_start; + u64 d_end = d_start + fw_dump.reserve_dump_area_size; + if (!fw_dump.dump_registered) return 0; + if (((addr + size) > d_start) && (addr <= d_end)) + return 1; + return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size; } diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c1578f54c626..e4c658cda3a7 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) phys_addr = lmb->base_addr; #ifdef CONFIG_FA_DUMP - /* Don't hot-remove memory that falls in fadump boot memory area */ - if (is_fadump_boot_memory_area(phys_addr, block_sz)) + /* +* Don't hot-remove memory that falls in fadump boot memory area +* and memory that is reserved for capturing old kernel memory. +*/ + if (is_fadump_memory_area(phys_addr, block_sz)) return false; #endif
[PATCH v2 3/4] powerpc/fadump: throw proper error message on fadump registration failure.
From: Mahesh Salgaonkarfadump fails to register when there are holes in reserved memory area. This can happen if user has hot-removed a memory that falls in the fadump reserved memory area. Throw a meaningful error message to the user in such case. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 33 + 1 file changed, 33 insertions(+) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 9d75619cac28..0268a32b632e 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -216,6 +216,36 @@ static int is_boot_memory_area_contiguous(void) return ret; } +/* + * Returns 1, if there are no holes in reserved memory area, + * 0 otherwise. + */ +static int is_reserved_memory_area_contiguous(void) +{ + struct memblock_region *reg; + unsigned long start, end; + unsigned long d_start = fw_dump.reserve_dump_area_start; + unsigned long d_end = d_start + fw_dump.reserve_dump_area_size; + int ret = 0; + + for_each_memblock(memory, reg) { + start = max(d_start, (unsigned long)reg->base); + end = min(d_end, (unsigned long)(reg->base + reg->size)); + if (d_start < end) { + /* Memory hole from d_start to start */ + if (start > d_start) + break; + + if (end == d_end) { + ret = 1; + break; + } + d_start = end + 1; + } + } + return 0; +} + /* Print firmware assisted dump configurations for debugging purpose. */ static void fadump_show_config(void) { @@ -605,6 +635,9 @@ static int register_fw_dump(struct fadump_mem_struct *fdm) if (!is_boot_memory_area_contiguous()) pr_err("Can't have holes in boot memory area while " "registering fadump\n"); + else if (!is_reserved_memory_area_contiguous()) + pr_err("Can't have holes in reserved memory area while" + " registering fadump\n"); printk(KERN_ERR "Failed to register firmware-assisted kernel" " dump. Parameter Error(%d).\n", rc);
[PATCH v2 2/4] powerpc/fadump: exclude memory holes while reserving memory in second kernel.
From: Mahesh SalgaonkarThe second kernel, during early boot after the crash, reserves rest of the memory above boot memory size to make sure it does not touch any of the dump memory area. It uses memblock_reserve() that reserves the specified memory region irrespective of memory holes present within that region. There are chances where previous kernel would have hot removed some of its memory leaving memory holes behind. In such cases fadump kernel reports incorrect number of reserved pages through arch_reserved_kernel_pages() hook causing kernel to hang or panic. Fix this by excluding memory holes while reserving rest of the memory above boot memory size during second kernel boot after crash. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 2098583c935f..9d75619cac28 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -434,6 +434,23 @@ static inline unsigned long get_fadump_metadata_base( be64_to_cpu(fdm_active->rmr_region.source_len)); } +static void fadump_memblock_reserve(unsigned long base, unsigned long size) +{ + struct memblock_region *reg; + unsigned long start, end; + + for_each_memblock(memory, reg) { + start = (unsigned long)reg->base; + end = start + (unsigned long)reg->size; + + if ((start >= base) && (start < (base + size))) { + if (end > (base + size)) + end = base + size; + memblock_reserve(start, end - start); + } + } +} + int __init fadump_reserve_mem(void) { unsigned long base, size, memory_boundary; @@ -488,7 +505,7 @@ int __init fadump_reserve_mem(void) */ base = fw_dump.boot_memory_size; size = memory_boundary - base; - memblock_reserve(base, size); + fadump_memblock_reserve(base, size); printk(KERN_INFO "Reserved %ldMB of memory at %ldMB " "for saving crash dump\n", (unsigned long)(size >> 20),
[PATCH v2 1/4] powerpc/fadump: Reservationless firmware assisted dump
From: Mahesh SalgaonkarOne of the primary issues with Firmware Assisted Dump (fadump) on Power is that it needs a large amount of memory to be reserved. On large systems with TeraBytes of memory, this reservation can be quite significant. In some cases, fadump fails if the memory reserved is insufficient, or if the reserved memory was DLPAR hot-removed. In the normal case, post reboot, the preserved memory is filtered to extract only relevant areas of interest using the makedumpfile tool. While the tool provides flexibility to determine what needs to be part of the dump and what memory to filter out, all supported distributions default this to "Capture only kernel data and nothing else". We take advantage of this default and the Linux kernel's Contiguous Memory Allocator (CMA) to fundamentally change the memory reservation model for fadump. Fadump can now only capture kernel memory. Instead of setting aside a significant chunk of memory nobody can use, this patch uses CMA instead, to reserve a significant chunk of memory that the kernel is prevented from using (due to MIGRATE_CMA), but applications are free to use it. Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream: [root@zzxx-yy10 ~]# free -m totalusedfree shared buff/cache available Mem: 7557 1936822 12 5416725 Swap: 4095 04095 With this patch: [root@zzxx-yy10 ~]# free -m totalusedfree shared buff/cache available Mem: 8133 1947464 12 4757338 Swap: 4095 04095 Changes made here are completely transparent to how fadump has traditionally worked. Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand CMA and its usage. TODO: - Handle case where CMA reservation spans nodes. Signed-off-by: Ananth N Mavinakayanahalli Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/fadump.h |3 + arch/powerpc/kernel/fadump.c | 179 +++-- 2 files changed, 154 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index 5a23010af600..776cba0baec4 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -119,6 +119,7 @@ struct fadump_mem_struct { struct fadump_section cpu_state_data; struct fadump_section hpte_region; struct fadump_section rmr_region; + struct fadump_section metadata_region; }; /* Firmware-assisted dump configuration details. */ @@ -141,6 +142,8 @@ struct fw_dump { unsigned long fadump_supported:1; unsigned long dump_active:1; unsigned long dump_registered:1; + /* flag to indicate fadump metadata area is cma allocated */ + unsigned long cma_alloc:1; }; /* diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 3c2c2688918f..2098583c935f 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -45,11 +46,57 @@ static struct fw_dump fw_dump; static struct fadump_mem_struct fdm; static const struct fadump_mem_struct *fdm_active; +static struct cma *fadump_cma; static DEFINE_MUTEX(fadump_mutex); struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES]; int crash_mem_ranges; +/* + * fadump_cma_reserve() - reserve area for fadump memory reservation + * + * This function reserves memory from early allocator. It should be + * called by arch specific code once the memblock allocator + * has been activated. + */ +int __init fadump_cma_reserve(void) +{ + unsigned long long base, size; + int rc; + + if (!fw_dump.fadump_enabled) + return 0; + + base = fw_dump.reserve_dump_area_start; + size = fw_dump.reserve_dump_area_size; + pr_debug("Original reserve area base %ld, size %ld\n", + (unsigned long)base >> 20, + (unsigned long)size >> 20); + if (!size) + return 0; + + rc = cma_declare_contiguous(base, size, 0, 0, 0, false, + "fadump_cma", _cma); + if (rc) { + printk(KERN_ERR "fadump: Failed to reserve cma area for " + "firmware-assisted dump, %d\n", rc); + fw_dump.reserve_dump_area_size = 0; + return 0; + } + /* +* So we now have cma area reserved for fadump. base may be different +* from what we requested. +*/ + fw_dump.reserve_dump_area_start = cma_get_base(fadump_cma); +
Re: [PATCH 6/6] doc/devicetree: NVDIMM region documentation
On Thu, Mar 29, 2018 at 4:06 AM, Rob Herringwrote: > On Tue, Mar 27, 2018 at 9:53 AM, Oliver wrote: >> On Tue, Mar 27, 2018 at 9:24 AM, Rob Herring wrote: >>> On Fri, Mar 23, 2018 at 07:12:09PM +1100, Oliver O'Halloran wrote: Add device-tree binding documentation for the nvdimm region driver. Cc: devicet...@vger.kernel.org Signed-off-by: Oliver O'Halloran --- .../devicetree/bindings/nvdimm/nvdimm-region.txt | 45 ++ 1 file changed, 45 insertions(+) create mode 100644 Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt diff --git a/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt new file mode 100644 index ..02091117ff16 --- /dev/null +++ b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt @@ -0,0 +1,45 @@ +Device-tree bindings for NVDIMM memory regions +- + +Non-volatile DIMMs are memory modules used to provide (cacheable) main memory >>> >>> Are DIMMs always going to be the only form factor for NV memory? >>> >>> And if you have multiple DIMMs, does each DT node correspond to a DIMM? >> >> A nvdimm-region might correspond to a single NVDIMM, a set of >> interleaved NVDIMMs, or it might just be a chunk of normal memory that >> you want treated as a NVDIMM for some reason. The last case is useful >> for provisioning install media on servers since it allows you do >> download a DVD image, turn it into an nvdimm-region, and kexec into >> the installer which can use it as a root disk. That may seem a little >> esoteric, but it's handy and we're using a full linux environment for >> our boot loader so it's easy to make use of. > > I'm really just asking if we should drop the "dimm" name because it is > not always a DIMM. Maybe pmem instead? I don't know, naming is > hard(TM). pmem is probably a better name. I'll fix that up. >>> If not, then what if we want/need to provide power control to a DIMM? >> >> That would require a DIMM (and probably memory controller) specific >> driver. I've deliberately left out how regions are mapped back to >> DIMMs from the binding since it's not really clear to me how that >> should work. A phandle array pointing to each DIMM device (which could >> be anything) would do the trick, but I've found that a bit awkward to >> plumb into the model that libnvdimm expects. >> +that retains its contents across power cycles. In more practical terms, they +are kind of storage device where the contents can be accessed by the CPU +directly, rather than indirectly via a storage controller or similar. The an +nvdimm-region specifies a physical address range that is hosted on an NVDIMM +device. + +Bindings for the region nodes: +- + +Required properties: + - compatible = "nvdimm-region" + + - reg = ; + The system physical address range of this nvdimm region. + +Optional properties: + - Any relevant NUMA assocativity properties for the target platform. + - A "volatile" property indicating that this region is actually in + normal DRAM and does not require cache flushes after each write. + +A complete example: + + +/ { + #size-cells = <2>; + #address-cells = <2>; + + platform { >>> >>> Perhaps we need a more well defined node here. Like we have 'memory' for >>> memory nodes. >> >> I think treating it as a platform device is fine. Memory nodes are > > Platform device is a Linux term... > >> special since the OS needs to know where it can allocate early in boot >> and I don't see non-volatile memory as being similarly significant. >> Fundamentally an NVDIMM is just a memory mapped storage device so we >> should be able to defer looking at them until later in boot. > > It's not clear if 'platform' is just an example or random name or what > the node is required to be called. In the latter case, we should be > much more specific because 'platform' could be anything. In the former > case, then we have no way to find or validate the node because the > name could be anything and there's no compatible property either. Sorry, the platform node is just there as an example. I'll remove it. > "region" is pretty generic too. It is, but I didn't see a compelling reason to call it something else. >> That said you might have problems with XIP kernels and what not. I >> think that problem is better solved through other means though. >> + region@5000 { + compatible = "nvdimm-region; + reg = <0x0001 0x
Re: [PATCH v8 22/24] mm: Speculative page fault handler return VMA
2018-03-29 10:26 GMT+08:00 Ganesh Mahendran: > Hi, Laurent > > 2018-02-16 23:25 GMT+08:00 Laurent Dufour : >> When the speculative page fault handler is returning VM_RETRY, there is a >> chance that VMA fetched without grabbing the mmap_sem can be reused by the >> legacy page fault handler. By reusing it, we avoid calling find_vma() >> again. To achieve, that we must ensure that the VMA structure will not be >> freed in our back. This is done by getting the reference on it (get_vma()) >> and by assuming that the caller will call the new service >> can_reuse_spf_vma() once it has grabbed the mmap_sem. >> >> can_reuse_spf_vma() is first checking that the VMA is still in the RB tree >> , and then that the VMA's boundaries matched the passed address and release >> the reference on the VMA so that it can be freed if needed. >> >> In the case the VMA is freed, can_reuse_spf_vma() will have returned false >> as the VMA is no more in the RB tree. > > when I applied this patch to arm64, I got a crash: > > [6.088296] Unable to handle kernel NULL pointer dereference at > virtual address > [6.088307] pgd = ff9d67735000 > [6.088313] [] *pgd=0001795e3003, > *pud=0001795e3003, *pmd= > [6.088372] [ cut here ] > [6.088377] Kernel BUG at ff9d64f65960 [verbose debug info unavailable] > [6.088384] Internal error: Oops - BUG: 9645 [#1] PREEMPT SMP > [6.088389] BUG: Bad rss-counter state mm:ffe8f3861040 idx:0 val:90 > [6.088393] BUG: Bad rss-counter state mm:ffe8f3861040 idx:1 val:58 > [6.088398] Modules linked in: > [6.088408] CPU: 1 PID: 621 Comm: qseecomd Not tainted 4.4.78-perf+ #88 > [6.088413] Hardware name: Qualcomm Technologies, Inc. SDM 636 > PM660 + PM660L MTP E7S (DT) > [6.088419] task: ffe8f6208000 ti: ffe872a8c000 task.ti: > ffe872a8c000 > [6.088432] PC is at __rb_erase_color+0x108/0x240 > [6.088441] LR is at vma_interval_tree_remove+0x244/0x24c > [6.088447] pc : [] lr : [] > pstate: 604001c5 > [6.088451] sp : ffe872a8fa50 > [6.088455] x29: ffe872a8fa50 x28: 0008 > [6.088462] x27: 0009 x26: > [6.088470] x25: ffe8f458fb80 x24: 00768ff87000 > [6.088477] x23: x22: > [6.088484] x21: ff9d64d9be7c x20: ffe8f3ff0680 > [6.088492] x19: ffe8f212e9b0 x18: 0074 > [6.088499] x17: 0007 x16: 000e > [6.088507] x15: ff9d65c88000 x14: 0001 > [6.088514] x13: 00192d76 x12: 00989680 > [6.088521] x11: 001f x10: ff9d661ded1b > [6.088528] x9 : 007691759000 x8 : 07691759 > [6.088535] x7 : x6 : ffe871ebada8 > [6.088541] x5 : 00e1 x4 : ffe8f212e958 > [6.088548] x3 : 00e9 x2 : > [6.088555] x1 : ffe8f212f110 x0 : ffe8f212e9b1 > [6.088564] > [6.088564] PC: 0xff9d64f65920: > [6.088568] 5920 f902 aa0103e0 aa1603e1 d63f02a0 aa1603e1 > f9400822 f9000662 f9000833 > [6.088590] 5940 143b f9400a61 f9400020 370002c0 f9400436 > b2400260 f9000a76 f9000433 > [6.088610] 5960 f90002c0 f9400260 f920 f9000261 f27ef400 > 54000100 f9400802 eb13005f > [6.088630] 5980 5461 f9000801 1404 f9000401 1402 > f9000281 aa1303e0 d63f02a0 > [6.088652] > [6.088652] LR: 0xff9d64d9c298: > [6.088656] c298 f9403083 b483 f9400c63 eb03005f 9a832042 > f9403883 eb02007f 54a0 > [6.088676] c2b8 f9003882 f9402c82 927ef442 b5fffd22 b480 > f0e2 9139f042 94072561 > [6.088695] c2d8 a8c17bfd d65f03c0 a9bf7bfd 910003fd f943 > d280 b4e3 f9400c65 > [6.088715] c2f8 d1016063 eb0100bf 5463 aa0303e0 97fffef2 > a8c17bfd d65f03c0 a9bf7bfd > [6.088735] > [6.088735] SP: 0xffe872a8fa10: > [6.088740] fa10 64d9c2d8 ff9d 72a8fa50 ffe8 64f65960 > ff9d 604001c5 > [6.088759] fa30 71d67d70 ffe8 71c281e8 ffe8 > 0080 64daa90c ff9d > [6.088779] fa50 72a8fa90 ffe8 64d9c2d8 ff9d 71ebada8 > ffe8 f3ff0678 ffe8 > [6.088799] fa70 72a8fb80 ffe8 > 0001 > [6.088818] > [6.088823] Process qseecomd (pid: 621, stack limit = 0xffe872a8c028) > [6.088828] Call trace: > [6.088834] Exception stack(0xffe872a8f860 to 0xffe872a8f990) > [6.088841] f860: ffe8f212e9b0 0080 > 82b37000 ff9d64f65960 > [6.088848] f880: 604001c5 ff9d672c8680 > ff9d672c9c00 ff9d672d3ab7 > [6.088855] f8a0: ffe872a8f8f0 ff9d64db9bfc > ffe8f9402c00 > [6.088861] f8c0: ffe872a8c000 > ffe872a8f920 ff9d64db9bfc >
Re: [PATCH v8 22/24] mm: Speculative page fault handler return VMA
Hi, Laurent 2018-02-16 23:25 GMT+08:00 Laurent Dufour: > When the speculative page fault handler is returning VM_RETRY, there is a > chance that VMA fetched without grabbing the mmap_sem can be reused by the > legacy page fault handler. By reusing it, we avoid calling find_vma() > again. To achieve, that we must ensure that the VMA structure will not be > freed in our back. This is done by getting the reference on it (get_vma()) > and by assuming that the caller will call the new service > can_reuse_spf_vma() once it has grabbed the mmap_sem. > > can_reuse_spf_vma() is first checking that the VMA is still in the RB tree > , and then that the VMA's boundaries matched the passed address and release > the reference on the VMA so that it can be freed if needed. > > In the case the VMA is freed, can_reuse_spf_vma() will have returned false > as the VMA is no more in the RB tree. when I applied this patch to arm64, I got a crash: [6.088296] Unable to handle kernel NULL pointer dereference at virtual address [6.088307] pgd = ff9d67735000 [6.088313] [] *pgd=0001795e3003, *pud=0001795e3003, *pmd= [6.088372] [ cut here ] [6.088377] Kernel BUG at ff9d64f65960 [verbose debug info unavailable] [6.088384] Internal error: Oops - BUG: 9645 [#1] PREEMPT SMP [6.088389] BUG: Bad rss-counter state mm:ffe8f3861040 idx:0 val:90 [6.088393] BUG: Bad rss-counter state mm:ffe8f3861040 idx:1 val:58 [6.088398] Modules linked in: [6.088408] CPU: 1 PID: 621 Comm: qseecomd Not tainted 4.4.78-perf+ #88 [6.088413] Hardware name: Qualcomm Technologies, Inc. SDM 636 PM660 + PM660L MTP E7S (DT) [6.088419] task: ffe8f6208000 ti: ffe872a8c000 task.ti: ffe872a8c000 [6.088432] PC is at __rb_erase_color+0x108/0x240 [6.088441] LR is at vma_interval_tree_remove+0x244/0x24c [6.088447] pc : [] lr : [] pstate: 604001c5 [6.088451] sp : ffe872a8fa50 [6.088455] x29: ffe872a8fa50 x28: 0008 [6.088462] x27: 0009 x26: [6.088470] x25: ffe8f458fb80 x24: 00768ff87000 [6.088477] x23: x22: [6.088484] x21: ff9d64d9be7c x20: ffe8f3ff0680 [6.088492] x19: ffe8f212e9b0 x18: 0074 [6.088499] x17: 0007 x16: 000e [6.088507] x15: ff9d65c88000 x14: 0001 [6.088514] x13: 00192d76 x12: 00989680 [6.088521] x11: 001f x10: ff9d661ded1b [6.088528] x9 : 007691759000 x8 : 07691759 [6.088535] x7 : x6 : ffe871ebada8 [6.088541] x5 : 00e1 x4 : ffe8f212e958 [6.088548] x3 : 00e9 x2 : [6.088555] x1 : ffe8f212f110 x0 : ffe8f212e9b1 [6.088564] [6.088564] PC: 0xff9d64f65920: [6.088568] 5920 f902 aa0103e0 aa1603e1 d63f02a0 aa1603e1 f9400822 f9000662 f9000833 [6.088590] 5940 143b f9400a61 f9400020 370002c0 f9400436 b2400260 f9000a76 f9000433 [6.088610] 5960 f90002c0 f9400260 f920 f9000261 f27ef400 54000100 f9400802 eb13005f [6.088630] 5980 5461 f9000801 1404 f9000401 1402 f9000281 aa1303e0 d63f02a0 [6.088652] [6.088652] LR: 0xff9d64d9c298: [6.088656] c298 f9403083 b483 f9400c63 eb03005f 9a832042 f9403883 eb02007f 54a0 [6.088676] c2b8 f9003882 f9402c82 927ef442 b5fffd22 b480 f0e2 9139f042 94072561 [6.088695] c2d8 a8c17bfd d65f03c0 a9bf7bfd 910003fd f943 d280 b4e3 f9400c65 [6.088715] c2f8 d1016063 eb0100bf 5463 aa0303e0 97fffef2 a8c17bfd d65f03c0 a9bf7bfd [6.088735] [6.088735] SP: 0xffe872a8fa10: [6.088740] fa10 64d9c2d8 ff9d 72a8fa50 ffe8 64f65960 ff9d 604001c5 [6.088759] fa30 71d67d70 ffe8 71c281e8 ffe8 0080 64daa90c ff9d [6.088779] fa50 72a8fa90 ffe8 64d9c2d8 ff9d 71ebada8 ffe8 f3ff0678 ffe8 [6.088799] fa70 72a8fb80 ffe8 0001 [6.088818] [6.088823] Process qseecomd (pid: 621, stack limit = 0xffe872a8c028) [6.088828] Call trace: [6.088834] Exception stack(0xffe872a8f860 to 0xffe872a8f990) [6.088841] f860: ffe8f212e9b0 0080 82b37000 ff9d64f65960 [6.088848] f880: 604001c5 ff9d672c8680 ff9d672c9c00 ff9d672d3ab7 [6.088855] f8a0: ffe872a8f8f0 ff9d64db9bfc ffe8f9402c00 [6.088861] f8c0: ffe872a8c000 ffe872a8f920 ff9d64db9bfc [6.088867] f8e0: ffe8f9402b00 ffe872a8fa10 ff9d64dba568 [6.088874] f900: ffbe61c759c0 ffe871d67d70 ffe8f9402c00 1de56fb006cba396 [6.01] f920: ffe8f212e9b1 ffe8f212f110
Re: [PATCH] Extract initrd free logic from arch-specific code.
On Wed, Mar 28, 2018 at 09:55:07AM -0700, Kees Cook wrote: >On Wed, Mar 28, 2018 at 8:26 AM, Shea Levywrote: >> Now only those architectures that have custom initrd free requirements >> need to define free_initrd_mem. >> >> Signed-off-by: Shea Levy > >Yay consolidation! :) > >> --- a/usr/Kconfig >> +++ b/usr/Kconfig >> @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION >> default ".lzma" if RD_LZMA >> default ".bz2" if RD_BZIP2 >> default "" >> + >> +config HAVE_ARCH_FREE_INITRD_MEM >> + bool >> + default n > >If you keep the Kconfig, you can leave off "default n", and I'd >suggest adding a help section just to describe what the per-arch >responsibilities are when select-ing the config. (See >HAVE_ARCH_SECCOMP_FILTER for an example.) > One question about this change. The original code would "select" HAVE_ARCH_FREE_INITRD_MEM on those arch. After this change, we need to manually "select" this? >-Kees > >-- >Kees Cook >Pixel Security -- Wei Yang Help you, Help me
Re: [PATCH] Extract initrd free logic from arch-specific code.
On Thu, 29 Mar 2018 09:37:52 +1100 Oliverwrote: > On Thu, Mar 29, 2018 at 9:14 AM, Russell King - ARM Linux > wrote: > > On Wed, Mar 28, 2018 at 02:04:22PM -0500, Rob Landley wrote: > >> > >> > >> On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote: > >> > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote: > >> >> On 03/28/2018 10:26 AM, Shea Levy wrote: > >> >>> Now only those architectures that have custom initrd free requirements > >> >>> need to define free_initrd_mem. > >> >> ... > >> >>> --- a/arch/arc/mm/init.c > >> >>> +++ b/arch/arc/mm/init.c > >> >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void) > >> >>> { > >> >>> free_initmem_default(-1); > >> >>> } > >> >>> - > >> >>> -#ifdef CONFIG_BLK_DEV_INITRD > >> >>> -void __init free_initrd_mem(unsigned long start, unsigned long end) > >> >>> -{ > >> >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd"); > >> >>> -} > >> >>> -#endif > >> >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > >> >>> index 3f972e83909b..19d1c5594e2d 100644 > >> >>> --- a/arch/arm/Kconfig > >> >>> +++ b/arch/arm/Kconfig > >> >>> @@ -47,6 +47,7 @@ config ARM > >> >>> select HARDIRQS_SW_RESEND > >> >>> select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) > >> >>> select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 > >> >>> + select HAVE_ARCH_FREE_INITRD_MEM > >> >>> select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU > >> >>> select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU > >> >>> select HAVE_ARCH_MMAP_RND_BITS if MMU > >> >> > >> >> Isn't this why weak symbols were invented? > >> > > >> > Weak symbols means that we end up with both the weakly-referenced code > >> > and the arch code in the kernel image. That's fine if the weak code > >> > is small. > >> > >> The kernel's been able to build with link time garbage collection since > >> 2016: > >> > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d > >> > >> Wouldn't that remove the unused one? > > > > Probably, if anyone bothered to use that, which they don't. > > > > LD_DEAD_CODE_DATA_ELIMINATION is a symbol without a prompt, and from > > what I can see, nothing selects it. Therefore, the symbol is always > > disabled, and so the feature never gets used in mainline kernels. > > > > Brings up the obvious question - why is it there if it's completely > > unused? (Maybe to cause confusion, and allowing a justification > > for __weak ?) Well weak symbols have been used long before it was added. > IIRC Nick had some patches to do the arch enablement for powerpc, but > I'm not sure what happened to them though. I suspect it just fell down > Nick's ever growing TODO list. Yeah I had started some patches for powerpc and x86 that have ended up on the back burner. There's been some MIPS people playing with it too. For the kernel, LD_DEAD_CODE_DATA_ELIMINATION is not great. It can save a little, but you get issues like any exception table entry or bug table entry in a function will create a reference back to the function, so the linker can't trim it away even if nothing else references it. I'll try to take another look at it within the next few months and remove it if I can't make progress. Nicolas Pitre has been doing some much better work on dead code using real LTO. Thanks, Nick
[PATCH] macintosh/adb: Use C99 initializers for struct adb_driver instances
No change to object files. Cc: Benjamin HerrenschmidtSigned-off-by: Finn Thain --- drivers/macintosh/adb-iop.c| 14 +++--- drivers/macintosh/macio-adb.c | 15 +++ drivers/macintosh/via-macii.c | 14 +++--- drivers/macintosh/via-pmu.c| 14 +++--- drivers/macintosh/via-pmu68k.c | 14 +++--- 5 files changed, 35 insertions(+), 36 deletions(-) diff --git a/drivers/macintosh/adb-iop.c b/drivers/macintosh/adb-iop.c index 15db69d8ba69..ca623e6446e4 100644 --- a/drivers/macintosh/adb-iop.c +++ b/drivers/macintosh/adb-iop.c @@ -53,13 +53,13 @@ static void adb_iop_poll(void); static int adb_iop_reset_bus(void); struct adb_driver adb_iop_driver = { - "ISM IOP", - adb_iop_probe, - adb_iop_init, - adb_iop_send_request, - adb_iop_autopoll, - adb_iop_poll, - adb_iop_reset_bus + .name = "ISM IOP", + .probe= adb_iop_probe, + .init = adb_iop_init, + .send_request = adb_iop_send_request, + .autopoll = adb_iop_autopoll, + .poll = adb_iop_poll, + .reset_bus= adb_iop_reset_bus }; static void adb_iop_end_req(struct adb_request *req, int state) diff --git a/drivers/macintosh/macio-adb.c b/drivers/macintosh/macio-adb.c index 9a6223add30e..eb3adfb7f88d 100644 --- a/drivers/macintosh/macio-adb.c +++ b/drivers/macintosh/macio-adb.c @@ -70,14 +70,13 @@ static void macio_adb_poll(void); static int macio_adb_reset_bus(void); struct adb_driver macio_adb_driver = { - "MACIO", - macio_probe, - macio_init, - macio_send_request, - /*macio_write,*/ - macio_adb_autopoll, - macio_adb_poll, - macio_adb_reset_bus + .name = "MACIO", + .probe= macio_probe, + .init = macio_init, + .send_request = macio_send_request, + .autopoll = macio_adb_autopoll, + .poll = macio_adb_poll, + .reset_bus= macio_adb_reset_bus, }; int macio_probe(void) diff --git a/drivers/macintosh/via-macii.c b/drivers/macintosh/via-macii.c index 4ba06a1695ea..cf6f7d52d6be 100644 --- a/drivers/macintosh/via-macii.c +++ b/drivers/macintosh/via-macii.c @@ -91,13 +91,13 @@ static void macii_poll(void); static int macii_reset_bus(void); struct adb_driver via_macii_driver = { - "Mac II", - macii_probe, - macii_init, - macii_send_request, - macii_autopoll, - macii_poll, - macii_reset_bus + .name = "Mac II", + .probe= macii_probe, + .init = macii_init, + .send_request = macii_send_request, + .autopoll = macii_autopoll, + .poll = macii_poll, + .reset_bus= macii_reset_bus, }; static enum macii_state { diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c index c4c2b3b85ebc..8da474899d79 100644 --- a/drivers/macintosh/via-pmu.c +++ b/drivers/macintosh/via-pmu.c @@ -199,13 +199,13 @@ static const struct file_operations pmu_options_proc_fops; #ifdef CONFIG_ADB struct adb_driver via_pmu_driver = { - "PMU", - pmu_probe, - pmu_init, - pmu_send_request, - pmu_adb_autopoll, - pmu_poll_adb, - pmu_adb_reset_bus + .name = "PMU", + .probe= pmu_probe, + .init = pmu_init, + .send_request = pmu_send_request, + .autopoll = pmu_adb_autopoll, + .poll = pmu_poll_adb, + .reset_bus= pmu_adb_reset_bus, }; #endif /* CONFIG_ADB */ diff --git a/drivers/macintosh/via-pmu68k.c b/drivers/macintosh/via-pmu68k.c index 7d9c4baf8c11..d545ed45e482 100644 --- a/drivers/macintosh/via-pmu68k.c +++ b/drivers/macintosh/via-pmu68k.c @@ -120,13 +120,13 @@ static void pmu_enable_backlight(int on); static void pmu_set_brightness(int level); struct adb_driver via_pmu_driver = { - "68K PMU", - pmu_probe, - pmu_init, - pmu_send_request, - pmu_autopoll, - pmu_poll, - pmu_reset_bus + .name = "68K PMU", + .probe= pmu_probe, + .init = pmu_init, + .send_request = pmu_send_request, + .autopoll = pmu_autopoll, + .poll = pmu_poll, + .reset_bus= pmu_reset_bus, }; /*
Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()
Dave Hansenwrites: > On 03/28/2018 01:47 PM, Thiago Jung Bauermann wrote: if (flags) - assert(rdpkey_reg() > orig_pkey_reg); + assert(rdpkey_reg() < orig_pkey_reg); } void pkey_write_allow(int pkey) >>> This seems so horribly wrong that I wonder how it worked in the first >>> place. Any idea? >> The code simply wasn't used. pkey_disable_clear() is called by >> pkey_write_allow() and pkey_access_allow(), but before this patch series >> nothing called either of these functions. >> > > Ahh, that explains it. Can that get stuck in the changelog, please? Yes, will be in the next version. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH] Extract initrd free logic from arch-specific code.
On Thu, Mar 29, 2018 at 9:14 AM, Russell King - ARM Linuxwrote: > On Wed, Mar 28, 2018 at 02:04:22PM -0500, Rob Landley wrote: >> >> >> On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote: >> > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote: >> >> On 03/28/2018 10:26 AM, Shea Levy wrote: >> >>> Now only those architectures that have custom initrd free requirements >> >>> need to define free_initrd_mem. >> >> ... >> >>> --- a/arch/arc/mm/init.c >> >>> +++ b/arch/arc/mm/init.c >> >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void) >> >>> { >> >>> free_initmem_default(-1); >> >>> } >> >>> - >> >>> -#ifdef CONFIG_BLK_DEV_INITRD >> >>> -void __init free_initrd_mem(unsigned long start, unsigned long end) >> >>> -{ >> >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd"); >> >>> -} >> >>> -#endif >> >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig >> >>> index 3f972e83909b..19d1c5594e2d 100644 >> >>> --- a/arch/arm/Kconfig >> >>> +++ b/arch/arm/Kconfig >> >>> @@ -47,6 +47,7 @@ config ARM >> >>> select HARDIRQS_SW_RESEND >> >>> select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) >> >>> select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 >> >>> + select HAVE_ARCH_FREE_INITRD_MEM >> >>> select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU >> >>> select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU >> >>> select HAVE_ARCH_MMAP_RND_BITS if MMU >> >> >> >> Isn't this why weak symbols were invented? >> > >> > Weak symbols means that we end up with both the weakly-referenced code >> > and the arch code in the kernel image. That's fine if the weak code >> > is small. >> >> The kernel's been able to build with link time garbage collection since 2016: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d >> >> Wouldn't that remove the unused one? > > Probably, if anyone bothered to use that, which they don't. > > LD_DEAD_CODE_DATA_ELIMINATION is a symbol without a prompt, and from > what I can see, nothing selects it. Therefore, the symbol is always > disabled, and so the feature never gets used in mainline kernels. > > Brings up the obvious question - why is it there if it's completely > unused? (Maybe to cause confusion, and allowing a justification > for __weak ?) IIRC Nick had some patches to do the arch enablement for powerpc, but I'm not sure what happened to them though. I suspect it just fell down Nick's ever growing TODO list.
Re: [PATCH] Extract initrd free logic from arch-specific code.
On Wed, Mar 28, 2018 at 02:04:22PM -0500, Rob Landley wrote: > > > On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote: > > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote: > >> On 03/28/2018 10:26 AM, Shea Levy wrote: > >>> Now only those architectures that have custom initrd free requirements > >>> need to define free_initrd_mem. > >> ... > >>> --- a/arch/arc/mm/init.c > >>> +++ b/arch/arc/mm/init.c > >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void) > >>> { > >>> free_initmem_default(-1); > >>> } > >>> - > >>> -#ifdef CONFIG_BLK_DEV_INITRD > >>> -void __init free_initrd_mem(unsigned long start, unsigned long end) > >>> -{ > >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd"); > >>> -} > >>> -#endif > >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > >>> index 3f972e83909b..19d1c5594e2d 100644 > >>> --- a/arch/arm/Kconfig > >>> +++ b/arch/arm/Kconfig > >>> @@ -47,6 +47,7 @@ config ARM > >>> select HARDIRQS_SW_RESEND > >>> select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) > >>> select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 > >>> + select HAVE_ARCH_FREE_INITRD_MEM > >>> select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU > >>> select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU > >>> select HAVE_ARCH_MMAP_RND_BITS if MMU > >> > >> Isn't this why weak symbols were invented? > > > > Weak symbols means that we end up with both the weakly-referenced code > > and the arch code in the kernel image. That's fine if the weak code > > is small. > > The kernel's been able to build with link time garbage collection since 2016: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d > > Wouldn't that remove the unused one? Probably, if anyone bothered to use that, which they don't. LD_DEAD_CODE_DATA_ELIMINATION is a symbol without a prompt, and from what I can see, nothing selects it. Therefore, the symbol is always disabled, and so the feature never gets used in mainline kernels. Brings up the obvious question - why is it there if it's completely unused? (Maybe to cause confusion, and allowing a justification for __weak ?) -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.
> On 28 Mar 2018, at 02:49, Matthew Wilcoxwrote: > > On Tue, Mar 27, 2018 at 03:53:53PM -0700, Kees Cook wrote: >> I agree: pushing this off to libc leaves a lot of things unprotected. >> I think this should live in the kernel. The question I have is about >> making it maintainable/readable/etc. >> >> The state-of-the-art for ASLR is moving to finer granularity (over >> just base-address offset), so I'd really like to see this supported in >> the kernel. We'll be getting there for other things in the future, and >> I'd like to have a working production example for researchers to >> study, etc. > > One thing we need is to limit the fragmentation of this approach. > Even on 64-bit systems, we can easily get into a situation where there isn't > space to map a contiguous terabyte. As I wrote before, shift_random is introduced to be fragmentation limit. Even without it, the main question here is ‘if we can’t allocate memory with N size bytes, how many bytes we already allocated?’. From these point of view I already showed in previous version of patch that if application uses not so big memory allocations, it will have enough memory to use. If it uses XX Gigabytes or Terabytes memory, this application has all chances to be exploited with fully randomization or without. Since it is much easier to find(or guess) any usable pointer, etc. For the instance you have only 128 terabytes of memory for user space, so probability to exploit this application is 1/128 what is not secure at all. This is very rough estimate but I try to make things easier to understand. Best regards, Ilya
RE: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.
> The default limit of only 65536 VMAs will also quickly come into play > if consecutive anon mmaps don't get merged. Of course this can be > raised, but it has significant resource and performance (fork) costs. Could the random mmap address chooser look for how many existing VMAs have space before/after and the right attributes to merge with the new one you want to create? If this is above some threshold (100?) then pick one of them randomly and allocate the new address so that it will merge from below/above with an existing one. That should still give you a very high degree of randomness, but prevent out of control numbers of VMAs from being created. -Tony
Re: [PATCH] Extract initrd free logic from arch-specific code.
On 03/28/2018 11:48 AM, Russell King - ARM Linux wrote: > On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote: >> On 03/28/2018 10:26 AM, Shea Levy wrote: >>> Now only those architectures that have custom initrd free requirements >>> need to define free_initrd_mem. >> ... >>> --- a/arch/arc/mm/init.c >>> +++ b/arch/arc/mm/init.c >>> @@ -229,10 +229,3 @@ void __ref free_initmem(void) >>> { >>> free_initmem_default(-1); >>> } >>> - >>> -#ifdef CONFIG_BLK_DEV_INITRD >>> -void __init free_initrd_mem(unsigned long start, unsigned long end) >>> -{ >>> - free_reserved_area((void *)start, (void *)end, -1, "initrd"); >>> -} >>> -#endif >>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig >>> index 3f972e83909b..19d1c5594e2d 100644 >>> --- a/arch/arm/Kconfig >>> +++ b/arch/arm/Kconfig >>> @@ -47,6 +47,7 @@ config ARM >>> select HARDIRQS_SW_RESEND >>> select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) >>> select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 >>> + select HAVE_ARCH_FREE_INITRD_MEM >>> select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU >>> select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU >>> select HAVE_ARCH_MMAP_RND_BITS if MMU >> >> Isn't this why weak symbols were invented? > > Weak symbols means that we end up with both the weakly-referenced code > and the arch code in the kernel image. That's fine if the weak code > is small. The kernel's been able to build with link time garbage collection since 2016: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b67067f1176d Wouldn't that remove the unused one? Rob
Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.
> On 28 Mar 2018, at 01:16, Theodore Y. Ts'owrote: > > On Tue, Mar 27, 2018 at 04:51:08PM +0300, Ilya Smith wrote: >>> /dev/[u]random is not sufficient? >> >> Using /dev/[u]random makes 3 syscalls - open, read, close. This is a >> performance >> issue. > > You may want to take a look at the getrandom(2) system call, which is > the recommended way getting secure random numbers from the kernel. > >>> Well, I am pretty sure userspace can implement proper free ranges >>> tracking… >> >> I think we need to know what libc developers will say on implementing ASLR >> in >> user-mode. I am pretty sure they will say ‘nether’ or ‘some-day’. And >> problem >> of ASLR will stay forever. > > Why can't you send patches to the libc developers? > > Regards, > > - Ted I still believe the issue is on kernel side, not in library. Best regards, Ilya
Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.
> On 27 Mar 2018, at 17:38, Michal Hockowrote: > > On Tue 27-03-18 16:51:08, Ilya Smith wrote: >> >>> On 27 Mar 2018, at 10:24, Michal Hocko wrote: >>> >>> On Mon 26-03-18 22:45:31, Ilya Smith wrote: > On 26 Mar 2018, at 11:46, Michal Hocko wrote: > > On Fri 23-03-18 20:55:49, Ilya Smith wrote: >> >>> On 23 Mar 2018, at 15:48, Matthew Wilcox wrote: >>> >>> On Thu, Mar 22, 2018 at 07:36:36PM +0300, Ilya Smith wrote: Current implementation doesn't randomize address returned by mmap. All the entropy ends with choosing mmap_base_addr at the process creation. After that mmap build very predictable layout of address space. It allows to bypass ASLR in many cases. This patch make randomization of address on any mmap call. >>> >>> Why should this be done in the kernel rather than libc? libc is >>> perfectly >>> capable of specifying random numbers in the first argument of mmap. >> Well, there is following reasons: >> 1. It should be done in any libc implementation, what is not possible >> IMO; > > Is this really so helpful? Yes, ASLR is one of very important mitigation techniques which are really used to protect applications. If there is no ASLR, it is very easy to exploit vulnerable application and compromise the system. We can’t just fix all the vulnerabilities right now, thats why we have mitigations - techniques which are makes exploitation more hard or impossible in some cases. Thats why it is helpful. >>> >>> I am not questioning ASLR in general. I am asking whether we really need >>> per mmap ASLR in general. I can imagine that some environments want to >>> pay the additional price and other side effects, but considering this >>> can be achieved by libc, why to add more code to the kernel? >> >> I believe this is the only one right place for it. Adding these 200+ lines >> of >> code we give this feature for any user - on desktop, on server, on IoT >> device, >> on SCADA, etc. But if only glibc will implement ‘user-mode-aslr’ IoT and >> SCADA >> devices will never get it. > > I guess it would really help if you could be more specific about the > class of security issues this would help to mitigate. My first > understanding was that we we need some randomization between program > executable segments to reduce the attack space when a single address > leaks and you know the segments layout (ordering). But why do we need > _all_ mmaps to be randomized. Because that complicates the > implementation consirably for different reasons you have mentioned > earlier. > There are following reasons: 1) To protect layout if one region was leaked (as you said). 2) To protect against exploitation of Out-of-bounds vulnerabilities in some cases (CWE-125 , CWE-787) 3) To protect against exploitation of Buffer Overflows in some cases (CWE-120) 4) To protect application in cases when attacker need to guess the address (paper ASLR-NG by Hector Marco-Gisbert and Ismael Ripoll-Ripoll) And may be more cases. > Do you have any specific CVE that would be mitigated by this > randomization approach? > I am sorry, I am not a security expert to see all the cosequences but a > vague - the more randomization the better - sounds rather weak to me. It is hard to name concrete CVE number, sorry. Mitigations are made to prevent exploitation but not to fix vulnerabilities. It means good mitigation will make vulnerable application crash but not been compromised in most cases. This means the better randomization, the less successful exploitation rate. Thanks, Ilya
Re: [PATCH] Extract initrd free logic from arch-specific code.
On Wed, Mar 28, 2018 at 10:58:51AM -0500, Rob Landley wrote: > On 03/28/2018 10:26 AM, Shea Levy wrote: > > Now only those architectures that have custom initrd free requirements > > need to define free_initrd_mem. > ... > > --- a/arch/arc/mm/init.c > > +++ b/arch/arc/mm/init.c > > @@ -229,10 +229,3 @@ void __ref free_initmem(void) > > { > > free_initmem_default(-1); > > } > > - > > -#ifdef CONFIG_BLK_DEV_INITRD > > -void __init free_initrd_mem(unsigned long start, unsigned long end) > > -{ > > - free_reserved_area((void *)start, (void *)end, -1, "initrd"); > > -} > > -#endif > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > > index 3f972e83909b..19d1c5594e2d 100644 > > --- a/arch/arm/Kconfig > > +++ b/arch/arm/Kconfig > > @@ -47,6 +47,7 @@ config ARM > > select HARDIRQS_SW_RESEND > > select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) > > select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 > > + select HAVE_ARCH_FREE_INITRD_MEM > > select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU > > select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU > > select HAVE_ARCH_MMAP_RND_BITS if MMU > > Isn't this why weak symbols were invented? Weak symbols means that we end up with both the weakly-referenced code and the arch code in the kernel image. That's fine if the weak code is small. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
Re: [PATCH] Extract initrd free logic from arch-specific code.
Hi Rob, Rob Landleywrites: > On 03/28/2018 10:26 AM, Shea Levy wrote: >> Now only those architectures that have custom initrd free requirements >> need to define free_initrd_mem. > ... >> --- a/arch/arc/mm/init.c >> +++ b/arch/arc/mm/init.c >> @@ -229,10 +229,3 @@ void __ref free_initmem(void) >> { >> free_initmem_default(-1); >> } >> - >> -#ifdef CONFIG_BLK_DEV_INITRD >> -void __init free_initrd_mem(unsigned long start, unsigned long end) >> -{ >> -free_reserved_area((void *)start, (void *)end, -1, "initrd"); >> -} >> -#endif >> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig >> index 3f972e83909b..19d1c5594e2d 100644 >> --- a/arch/arm/Kconfig >> +++ b/arch/arm/Kconfig >> @@ -47,6 +47,7 @@ config ARM >> select HARDIRQS_SW_RESEND >> select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) >> select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 >> +select HAVE_ARCH_FREE_INITRD_MEM >> select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU >> select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU >> select HAVE_ARCH_MMAP_RND_BITS if MMU > > Isn't this why weak symbols were invented? > This approach was suggested by Christoph Hellwig upthread, and seems to have some precedent elsewhere (e.g. strncasecmp), but I agree weak symbols seem appropriate here. I'm happy to implement either approach! > > Confused, > > Rob Thanks, Shea signature.asc Description: PGP signature
Re: [PATCH] Extract initrd free logic from arch-specific code.
On 03/28/2018 10:26 AM, Shea Levy wrote: > Now only those architectures that have custom initrd free requirements > need to define free_initrd_mem. ... > --- a/arch/arc/mm/init.c > +++ b/arch/arc/mm/init.c > @@ -229,10 +229,3 @@ void __ref free_initmem(void) > { > free_initmem_default(-1); > } > - > -#ifdef CONFIG_BLK_DEV_INITRD > -void __init free_initrd_mem(unsigned long start, unsigned long end) > -{ > - free_reserved_area((void *)start, (void *)end, -1, "initrd"); > -} > -#endif > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > index 3f972e83909b..19d1c5594e2d 100644 > --- a/arch/arm/Kconfig > +++ b/arch/arm/Kconfig > @@ -47,6 +47,7 @@ config ARM > select HARDIRQS_SW_RESEND > select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) > select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 > + select HAVE_ARCH_FREE_INITRD_MEM > select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU > select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU > select HAVE_ARCH_MMAP_RND_BITS if MMU Isn't this why weak symbols were invented? Confused, Rob
[PATCH] Extract initrd free logic from arch-specific code.
Now only those architectures that have custom initrd free requirements need to define free_initrd_mem. Signed-off-by: Shea Levy--- arch/alpha/mm/init.c | 8 arch/arc/mm/init.c| 7 --- arch/arm/Kconfig | 1 + arch/arm64/Kconfig| 1 + arch/blackfin/Kconfig | 1 + arch/c6x/mm/init.c| 7 --- arch/cris/Kconfig | 1 + arch/frv/mm/init.c| 11 --- arch/h8300/mm/init.c | 7 --- arch/hexagon/Kconfig | 1 + arch/ia64/Kconfig | 1 + arch/m32r/Kconfig | 1 + arch/m32r/mm/init.c | 11 --- arch/m68k/mm/init.c | 7 --- arch/metag/Kconfig| 1 + arch/microblaze/mm/init.c | 7 --- arch/mips/Kconfig | 1 + arch/mn10300/Kconfig | 1 + arch/nios2/mm/init.c | 7 --- arch/openrisc/mm/init.c | 7 --- arch/parisc/mm/init.c | 7 --- arch/powerpc/mm/mem.c | 7 --- arch/riscv/mm/init.c | 6 -- arch/s390/Kconfig | 1 + arch/score/Kconfig| 1 + arch/sh/mm/init.c | 7 --- arch/sparc/Kconfig| 1 + arch/tile/Kconfig | 1 + arch/um/kernel/mem.c | 7 --- arch/unicore32/Kconfig| 1 + arch/x86/Kconfig | 1 + arch/xtensa/Kconfig | 1 + init/initramfs.c | 7 +++ usr/Kconfig | 4 34 files changed, 28 insertions(+), 113 deletions(-) diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c index 9d74520298ab..55f7c8efa962 100644 --- a/arch/alpha/mm/init.c +++ b/arch/alpha/mm/init.c @@ -291,11 +291,3 @@ free_initmem(void) { free_initmem_default(-1); } - -#ifdef CONFIG_BLK_DEV_INITRD -void -free_initrd_mem(unsigned long start, unsigned long end) -{ - free_reserved_area((void *)start, (void *)end, -1, "initrd"); -} -#endif diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index ba145065c579..7bcf23ab1756 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -229,10 +229,3 @@ void __ref free_initmem(void) { free_initmem_default(-1); } - -#ifdef CONFIG_BLK_DEV_INITRD -void __init free_initrd_mem(unsigned long start, unsigned long end) -{ - free_reserved_area((void *)start, (void *)end, -1, "initrd"); -} -#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 3f972e83909b..19d1c5594e2d 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -47,6 +47,7 @@ config ARM select HARDIRQS_SW_RESEND select HAVE_ARCH_AUDITSYSCALL if (AEABI && !OABI_COMPAT) select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6 + select HAVE_ARCH_FREE_INITRD_MEM select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index cb03e93f03cf..de93620870af 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -85,6 +85,7 @@ config ARM64 select HAVE_ALIGNED_STRUCT_PAGE if SLUB select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_BITREVERSE + select HAVE_ARCH_FREE_INITRD_MEM select HAVE_ARCH_HUGE_VMAP select HAVE_ARCH_JUMP_LABEL select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48) diff --git a/arch/blackfin/Kconfig b/arch/blackfin/Kconfig index d9c2866ba618..6c6dae9fe894 100644 --- a/arch/blackfin/Kconfig +++ b/arch/blackfin/Kconfig @@ -15,6 +15,7 @@ config BLACKFIN def_bool y select HAVE_ARCH_KGDB select HAVE_ARCH_TRACEHOOK + select HAVE_ARCH_FREE_INITRD_MEM select HAVE_DYNAMIC_FTRACE select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER diff --git a/arch/c6x/mm/init.c b/arch/c6x/mm/init.c index 4cc72b0d1c1d..a11cb657182a 100644 --- a/arch/c6x/mm/init.c +++ b/arch/c6x/mm/init.c @@ -66,13 +66,6 @@ void __init mem_init(void) mem_init_print_info(NULL); } -#ifdef CONFIG_BLK_DEV_INITRD -void __init free_initrd_mem(unsigned long start, unsigned long end) -{ - free_reserved_area((void *)start, (void *)end, -1, "initrd"); -} -#endif - void __init free_initmem(void) { free_initmem_default(-1); diff --git a/arch/cris/Kconfig b/arch/cris/Kconfig index cd5a0865c97f..5425f77e5664 100644 --- a/arch/cris/Kconfig +++ b/arch/cris/Kconfig @@ -76,6 +76,7 @@ config CRIS select HAVE_DEBUG_BUGVERBOSE if ETRAX_ARCH_V32 select HAVE_NMI select DMA_DIRECT_OPS if PCI + select HAVE_ARCH_FREE_INITRD_MEM config HZ int diff --git a/arch/frv/mm/init.c b/arch/frv/mm/init.c index cf464100e838..345edc4dc462 100644 --- a/arch/frv/mm/init.c +++ b/arch/frv/mm/init.c @@ -131,14 +131,3 @@ void free_initmem(void) free_initmem_default(-1); #endif } /* end free_initmem() */ - -/*/ -/* - * free the initial ramdisk memory - */
Re: Linux 4.16: Reported regressions as of Tuesday, 2018-03-27 (Was: Linux 4.16-rc7)
On Tue, Mar 27, 2018 at 09:13:32PM +0200, Thorsten Leemhuis wrote: > On 26.03.2018 01:37, Linus Torvalds wrote: > > […] Anyway. Go out and test. And let's hope next week is nice and calm and > > I can release the final 4.16 next Sunday without any extra rc's. > > > >Linus > > Hi! Find below my sixth regression report for Linux 4.16. It lists 7 > regressions I'm currently aware of. 2 were fixed since last weeks > report; 2 are new. > > Are you aware of any other regressions that got introduced this > development cycle? Then please let me know by mail (a simple bounce or > forward to the sender of this email address is enough!). And please tell > me if there is anything in the report that shouldn't be there. > > Ciao, Thorsten > > == Current regressions == > > Dell R640 does not boot due to SCSI/SATA failure > - Status: Afaics still unfixed; lost track, ask reporter for an update > on Monday morning, no reply yet > - Cause: https://git.kernel.org/torvalds/c/84676c1f21e8 > - Reported: 2018-02-22 > https://marc.info/?l=linux-kernel=151931128006031 > - Note: Issue understood and even (kind of accidentally) fixed by a > patch series that was proposed for 4.17 (see links) > - Last known developer activity: 2018-03-14 > https://marc.info/?l=linux-block=152102086831636=2 > - Other relevant links: > https://marc.info/?l=linux-block=152051511802229=2 > https://marc.info/?l=linux-kernel=152026091325037 I'm not the original reporter but I can confirm this regression has been fixed by git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes (tag 0c30612d535460af7fe16e7e6ed8b97defe4adbe) The machine has a HP array and stopped booting some time around rc1. I tried to bisect it and in the end it was the commit 84676c1f21e8 that I reverted for testing builds, so I'm sure it's the same issue as you've been tracking.
Re: [PATCH] Extract initrd free logic from arch-specific code.
On Wed, Mar 28, 2018 at 8:26 AM, Shea Levywrote: > Now only those architectures that have custom initrd free requirements > need to define free_initrd_mem. > > Signed-off-by: Shea Levy Yay consolidation! :) > --- a/usr/Kconfig > +++ b/usr/Kconfig > @@ -233,3 +233,7 @@ config INITRAMFS_COMPRESSION > default ".lzma" if RD_LZMA > default ".bz2" if RD_BZIP2 > default "" > + > +config HAVE_ARCH_FREE_INITRD_MEM > + bool > + default n If you keep the Kconfig, you can leave off "default n", and I'd suggest adding a help section just to describe what the per-arch responsibilities are when select-ing the config. (See HAVE_ARCH_SECCOMP_FILTER for an example.) -Kees -- Kees Cook Pixel Security
Re: RFC on writel and writel_relaxed
On Thu, 29 Mar 2018 08:31:32 +1100 Benjamin Herrenschmidtwrote: > On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) > > David Miller wrote: > > > > > From: Benjamin Herrenschmidt > > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > > the archs are unused or dead anyway. > > > > > > Agreed. > > > > While we're making decrees here, can we do something about mmiowb? > > The semantics are basically indecipherable. > > I was going to tackle that next :-) > > > This is a variation on the mandatory write barrier that causes writes to > > weakly > > ordered I/O regions to be partially ordered. Its effects may go beyond > > the > > CPU->Hardware interface and actually affect the hardware at some level. > > > > How can a driver writer possibly get that right? > > > > IIRC it was added for some big ia64 system that was really expensive > > to implement the proper wmb() semantics on. So wmb() semantics were > > quietly downgraded, then the subsequently broken drivers they cared > > about were fixed by adding the stronger mmiowb(). > > > > What should have happened was wmb and writel remained correct, sane, and > > expensive, and they add an mmio_wmb() to order MMIO stores made by the > > writel_relaxed accessors, then use that to speed up the few drivers they > > care about. > > > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > > make wmb ordering talk about stores to the device, not to some > > intermediate stage of the interconnect where it can be subsequently > > reordered wrt the device? Drivers can be converted back to using wmb > > or writel gradually. > > I was under the impression that mmiowb was specifically about ordering > writel's with a subsequent spin_unlock, without it, MMIOs from > different CPUs (within the same lock) would still arrive OO. Yes more or less, and I think that until mmiowb was introduced, wmb or writel was sufficient for this. Thanks, Nick
Re: [PATCH v3 00/41] cxlflash: OCXL transport support and miscellaneous fixes
Uma, > This patch series adds OCXL support to the cxlflash driver. With this > support, new devices using the OCXL transport will be supported by the > cxlflash driver along with the existing CXL devices. An effort is made > to keep this transport specific function independent of the existing > core driver that communicates with the AFU. > > The first three patches contain a minor fix and staging improvements. > > This series is intended for 4.17 and is bisectable. Something this big really needs to be submitted around rc2/rc3. It's way too late in the cycle to ensure proper zeroday coverage for all these commits. I have started a 4.18/scsi-queue branch to hold this series for now. The 4.18 branch will be rebased once 4.17rc1 is out in a few weeks. Your changes won't show up in for-next until then either. Thanks! -- Martin K. Petersen Oracle Linux Engineering
Re: RFC on writel and writel_relaxed
On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) > David Millerwrote: > > > From: Benjamin Herrenschmidt > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > the archs are unused or dead anyway. > > > > Agreed. > > While we're making decrees here, can we do something about mmiowb? > The semantics are basically indecipherable. I was going to tackle that next :-) > This is a variation on the mandatory write barrier that causes writes to > weakly > ordered I/O regions to be partially ordered. Its effects may go beyond the > CPU->Hardware interface and actually affect the hardware at some level. > > How can a driver writer possibly get that right? > > IIRC it was added for some big ia64 system that was really expensive > to implement the proper wmb() semantics on. So wmb() semantics were > quietly downgraded, then the subsequently broken drivers they cared > about were fixed by adding the stronger mmiowb(). > > What should have happened was wmb and writel remained correct, sane, and > expensive, and they add an mmio_wmb() to order MMIO stores made by the > writel_relaxed accessors, then use that to speed up the few drivers they > care about. > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > make wmb ordering talk about stores to the device, not to some > intermediate stage of the interconnect where it can be subsequently > reordered wrt the device? Drivers can be converted back to using wmb > or writel gradually. I was under the impression that mmiowb was specifically about ordering writel's with a subsequent spin_unlock, without it, MMIOs from different CPUs (within the same lock) would still arrive OO. If that's indeed the case, I would suggest ia64 switches to a similar per-cpu flag trick powerpc uses. Cheers, Ben. > Thanks, > Nick
Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder
On Wed, 28 Mar 2018, Laurent Dufour wrote: > >> @@ -326,7 +336,10 @@ static unsigned long move_vma(struct vm_area_struct > >> *vma, > >>mremap_userfaultfd_prep(new_vma, uf); > >>arch_remap(mm, old_addr, old_addr + old_len, > >> new_addr, new_addr + new_len); > >> + if (vma != new_vma) > >> + vm_raw_write_end(vma); > >>} > >> + vm_raw_write_end(new_vma); > > > > Just do > > > > vm_raw_write_end(vma); > > vm_raw_write_end(new_vma); > > > > here. > > Are you sure ? we can have vma = new_vma done if (unlikely(err)) > Sorry, what I meant was do if (vma != new_vma) vm_raw_write_end(vma); vm_raw_write_end(new_vma); after the conditional. Having the locking unnecessarily embedded in the conditional has been an issue in the past with other areas of core code, unless you have a strong reason for it.
Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT
On Wed, 28 Mar 2018, Laurent Dufour wrote: > > Putting this in mm/Kconfig is definitely the right way to go about it > > instead of any generic option in arch/*. > > > > My question, though, was making this configurable by the user: > > > > config SPECULATIVE_PAGE_FAULT > > bool "Speculative page faults" > > depends on X86_64 || PPC > > default y > > help > > .. > > > > It's a question about whether we want this always enabled on x86_64 and > > power or whether the user should be able to disable it (right now they > > can't). With a large feature like this, you may want to offer something > > simple (disable CONFIG_SPECULATIVE_PAGE_FAULT) if someone runs into > > regressions. > > I agree, but I think it would be important to get the per architecture > enablement to avoid complex check here. For instance in the case of powerPC > this is only supported for PPC_BOOK3S_64. > > To avoid exposing such per architecture define here, what do you think about > having supporting architectures setting ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > and the SPECULATIVE_PAGE_FAULT depends on this, like this: > > In mm/Kconfig: > config SPECULATIVE_PAGE_FAULT > bool "Speculative page faults" > depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && SMP > default y > help > ... > > In arch/powerpc/Kconfig: > config PPC > ... > select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64 > > In arch/x86/Kconfig: > config X86_64 > ... > select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > > Looks good to me! It feels like this will add more assurance that if things regress for certain workloads that it can be disabled. I don't feel strongly about the default value, I'm ok with it being enabled by default.
Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()
On 03/28/2018 01:47 PM, Thiago Jung Bauermann wrote: >>> if (flags) >>> - assert(rdpkey_reg() > orig_pkey_reg); >>> + assert(rdpkey_reg() < orig_pkey_reg); >>> } >>> >>> void pkey_write_allow(int pkey) >> This seems so horribly wrong that I wonder how it worked in the first >> place. Any idea? > The code simply wasn't used. pkey_disable_clear() is called by > pkey_write_allow() and pkey_access_allow(), but before this patch series > nothing called either of these functions. > Ahh, that explains it. Can that get stuck in the changelog, please?
Re: [PATCH v4 14/16] powerpc: Use generic free_initrd_mem.
Joe Percheswrites: > On Wed, 2018-03-28 at 16:36 -0400, Shea Levy wrote: >> Signed-off-by: Shea Levy > > Most people seem to want some form of commit message > and not just your sign-off. > Ah, if the subject is insufficient I can add some more detail. > > And btw: > > It seems you used get_maintainer to determine who to > send these patches to. > > I suggest you add --nogit and --nogit-fallback to the > get_maintainer command line you use to avoid sending > these patches to people like me that have done drive-by > cleanup work on these files. Whoops, thanks for the tip and sorry for the noise! signature.asc Description: PGP signature
Re: [PATCH v12 07/22] selftests/vm: fixed bugs in pkey_disable_clear()
Dave Hansenwrites: > On 02/21/2018 05:55 PM, Ram Pai wrote: >> --- a/tools/testing/selftests/vm/protection_keys.c >> +++ b/tools/testing/selftests/vm/protection_keys.c >> @@ -461,7 +461,7 @@ void pkey_disable_clear(int pkey, int flags) >> pkey, pkey, pkey_rights); >> pkey_assert(pkey_rights >= 0); >> >> -pkey_rights |= flags; >> +pkey_rights &= ~flags; >> >> ret = pkey_set(pkey, pkey_rights, 0); >> /* pkey_reg and flags have the same format */ >> @@ -475,7 +475,7 @@ void pkey_disable_clear(int pkey, int flags) >> dprintf1("%s(%d) pkey_reg: 0x%016lx\n", __func__, >> pkey, rdpkey_reg()); >> if (flags) >> -assert(rdpkey_reg() > orig_pkey_reg); >> +assert(rdpkey_reg() < orig_pkey_reg); >> } >> >> void pkey_write_allow(int pkey) > > This seems so horribly wrong that I wonder how it worked in the first > place. Any idea? The code simply wasn't used. pkey_disable_clear() is called by pkey_write_allow() and pkey_access_allow(), but before this patch series nothing called either of these functions. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH v4 14/16] powerpc: Use generic free_initrd_mem.
On Wed, 2018-03-28 at 16:36 -0400, Shea Levy wrote: > Signed-off-by: Shea LevyMost people seem to want some form of commit message and not just your sign-off. And btw: It seems you used get_maintainer to determine who to send these patches to. I suggest you add --nogit and --nogit-fallback to the get_maintainer command line you use to avoid sending these patches to people like me that have done drive-by cleanup work on these files.
Re: [PATCH] powerpc/64: Fix checksum folding in csum_add
On Tue, Mar 27, 2018 at 05:22:32PM +0200, LEROY Christophe wrote: > Shile Zhanga écrit : > > >fix the missed point in Paul's patch: > >"powerpc/64: Fix checksum folding in csum_tcpudp_nofold and > >ip_fast_csum_nofold" > > > >Signed-off-by: Shile Zhang > >--- > > arch/powerpc/include/asm/checksum.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > >diff --git a/arch/powerpc/include/asm/checksum.h > >b/arch/powerpc/include/asm/checksum.h > >index 5b1a6e3..430d038 100644 > >--- a/arch/powerpc/include/asm/checksum.h > >+++ b/arch/powerpc/include/asm/checksum.h > >@@ -108,7 +108,7 @@ static inline __wsum csum_add(__wsum csum, __wsum addend) > > > > #ifdef __powerpc64__ > > res += (__force u64)addend; > >-return (__force __wsum)((u32)res + (res >> 32)); > >+return (__force __wsum) from64to32(res); > > Did you encounter a bug due to that ? > As far as i understand, csum and addend are 32 bits so can't exceed 0x > Then their sum won't exceed 0x1fffe. So the sum of upper and lower part > won't carry If the sum of the two halves was 0x1fffe, then that previously got truncated to 32 bits and returned as 0xfffe, which is wrong - the result should be 0x. Paul.
[PATCH v4 14/16] powerpc: Use generic free_initrd_mem.
Signed-off-by: Shea Levy--- arch/powerpc/mm/mem.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index fe8c61149fb8..e85b2a3cd264 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -404,13 +404,6 @@ void free_initmem(void) free_initmem_default(POISON_FREE_INITMEM); } -#ifdef CONFIG_BLK_DEV_INITRD -void __init free_initrd_mem(unsigned long start, unsigned long end) -{ - free_reserved_area((void *)start, (void *)end, -1, "initrd"); -} -#endif - /* * This is called when a page has been modified by the kernel. * It just marks the page as not i-cache clean. We do the i-cache -- 2.16.2
[PATCH v2 04/19] powerpc/kvm: Prefer fault_in_pages_readable function
Directly use fault_in_pages_readable instead of manual __get_user code. Fix warning treated as error with W=1: arch/powerpc/kernel/kvm.c:675:6: error: variable ‘tmp’ set but not used [-Werror=unused-but-set-variable] Suggested-by: Christophe LeroySigned-off-by: Mathieu Malaterre --- v2: use fault_in_pages_readable instead arch/powerpc/kernel/kvm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c index 9ad37f827a97..124c51030b75 100644 --- a/arch/powerpc/kernel/kvm.c +++ b/arch/powerpc/kernel/kvm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -672,14 +673,13 @@ static void kvm_use_magic_page(void) { u32 *p; u32 *start, *end; - u32 tmp; u32 features; /* Tell the host to map the magic page to -4096 on all CPUs */ on_each_cpu(kvm_map_magic_page, , 1); /* Quick self-test to see if the mapping works */ - if (__get_user(tmp, (u32*)KVM_MAGIC_PAGE)) { + if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) { kvm_patching_worked = false; return; } -- 2.11.0
[PATCH v2 07/19] powerpc/powermac: Make some functions static
These functions can all be static, make it so. Fix warnings treated as errors with W=1: arch/powerpc/platforms/powermac/pci.c:1022:6: error: no previous prototype for ‘pmac_pci_fixup_ohci’ [-Werror=missing-prototypes] arch/powerpc/platforms/powermac/pci.c:1057:6: error: no previous prototype for ‘pmac_pci_fixup_cardbus’ [-Werror=missing-prototypes] arch/powerpc/platforms/powermac/pci.c:1094:6: error: no previous prototype for ‘pmac_pci_fixup_pciata’ [-Werror=missing-prototypes] Remove has_address declaration and assignment since not used. Also add gcc attribute unused to fix a warning treated as error with W=1: arch/powerpc/platforms/powermac/pci.c:784:19: error: variable ‘has_address’ set but not used [-Werror=unused-but-set-variable] arch/powerpc/platforms/powermac/pci.c:907:22: error: variable ‘ht’ set but not used [-Werror=unused-but-set-variable] Suggested-by: Christophe LeroySigned-off-by: Mathieu Malaterre --- v2: remove has_address variable since not used arch/powerpc/platforms/powermac/pci.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c index 0b8174a79993..67c497093e0a 100644 --- a/arch/powerpc/platforms/powermac/pci.c +++ b/arch/powerpc/platforms/powermac/pci.c @@ -781,12 +781,12 @@ static int __init pmac_add_bridge(struct device_node *dev) struct resource rsrc; char *disp_name; const int *bus_range; - int primary = 1, has_address = 0; + int primary = 1; DBG("Adding PCI host bridge %pOF\n", dev); /* Fetch host bridge registers address */ - has_address = (of_address_to_resource(dev, 0, ) == 0); + of_address_to_resource(dev, 0, ); /* Get bus range if any */ bus_range = of_get_property(dev, "bus-range", ); @@ -904,7 +904,7 @@ static int pmac_pci_root_bridge_prepare(struct pci_host_bridge *bridge) void __init pmac_pci_init(void) { struct device_node *np, *root; - struct device_node *ht = NULL; + struct device_node *ht __maybe_unused = NULL; pci_set_flags(PCI_CAN_SKIP_ISA_ALIGN); @@ -1019,7 +1019,7 @@ static bool pmac_pci_enable_device_hook(struct pci_dev *dev) return true; } -void pmac_pci_fixup_ohci(struct pci_dev *dev) +static void pmac_pci_fixup_ohci(struct pci_dev *dev) { struct device_node *node = pci_device_to_OF_node(dev); @@ -1054,7 +1054,7 @@ void __init pmac_pcibios_after_init(void) } } -void pmac_pci_fixup_cardbus(struct pci_dev* dev) +static void pmac_pci_fixup_cardbus(struct pci_dev *dev) { if (!machine_is(powermac)) return; @@ -1091,7 +1091,7 @@ void pmac_pci_fixup_cardbus(struct pci_dev* dev) DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_TI, PCI_ANY_ID, pmac_pci_fixup_cardbus); -void pmac_pci_fixup_pciata(struct pci_dev* dev) +static void pmac_pci_fixup_pciata(struct pci_dev *dev) { u8 progif = 0; -- 2.11.0
[PATCH v2 05/19] powerpc/chrp/setup: Add attribute unused and make some functions static
Remove variable declaration idu_size and associated code since not used. These functions can all be static, make it so. Fix warnings treated as errors with W=1: arch/powerpc/platforms/chrp/setup.c:97:6: error: no previous prototype for ‘chrp_show_cpuinfo’ [-Werror=missing-prototypes] arch/powerpc/platforms/chrp/setup.c:302:13: error: no previous prototype for ‘chrp_setup_arch’ [-Werror=missing-prototypes] arch/powerpc/platforms/chrp/setup.c:385:16: error: variable ‘idu_size’ set but not used [-Werror=unused-but-set-variable] arch/powerpc/platforms/chrp/setup.c:526:13: error: no previous prototype for ‘chrp_init_IRQ’ [-Werror=missing-prototypes] arch/powerpc/platforms/chrp/setup.c:559:1: error: no previous prototype for ‘chrp_init2’ [-Werror=missing-prototypes] Suggested-by: Christophe LeroySigned-off-by: Mathieu Malaterre --- v2: Remove totally variable idu_size arch/powerpc/platforms/chrp/setup.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/chrp/setup.c b/arch/powerpc/platforms/chrp/setup.c index 481ed133e04b..d6d8ffc0271e 100644 --- a/arch/powerpc/platforms/chrp/setup.c +++ b/arch/powerpc/platforms/chrp/setup.c @@ -94,7 +94,7 @@ static const char *chrp_names[] = { "Total Impact Briq" }; -void chrp_show_cpuinfo(struct seq_file *m) +static void chrp_show_cpuinfo(struct seq_file *m) { int i, sdramen; unsigned int t; @@ -299,7 +299,7 @@ static __init void chrp_init(void) of_node_put(node); } -void __init chrp_setup_arch(void) +static void __init chrp_setup_arch(void) { struct device_node *root = of_find_node_by_path("/"); const char *machine = NULL; @@ -382,7 +382,7 @@ static void __init chrp_find_openpic(void) { struct device_node *np, *root; int len, i, j; - int isu_size, idu_size; + int isu_size; const unsigned int *iranges, *opprop = NULL; int oplen = 0; unsigned long opaddr; @@ -427,11 +427,9 @@ static void __init chrp_find_openpic(void) } isu_size = 0; - idu_size = 0; if (len > 0 && iranges[1] != 0) { printk(KERN_INFO "OpenPIC irqs %d..%d in IDU\n", iranges[0], iranges[0] + iranges[1] - 1); - idu_size = iranges[1]; } if (len > 1) isu_size = iranges[3]; @@ -523,7 +521,7 @@ static void __init chrp_find_8259(void) } } -void __init chrp_init_IRQ(void) +static void __init chrp_init_IRQ(void) { #if defined(CONFIG_VT) && defined(CONFIG_INPUT_ADBHID) && defined(CONFIG_XMON) struct device_node *kbd; @@ -555,7 +553,7 @@ void __init chrp_init_IRQ(void) #endif } -void __init +static void __init chrp_init2(void) { #ifdef CONFIG_NVRAM -- 2.11.0
[PATCH v2 03/19] powerpc: Mark variables as unused
Add gcc attribute unused for two variables. Fix warnings treated as errors with W=1: arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not used [-Werror=unused-but-set-variable] Suggested-by: Christophe LeroySigned-off-by: Mathieu Malaterre --- v2: move path within ifdef DEBUG_PROM arch/powerpc/kernel/prom_init.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index acf4b2e0530c..4163b11abb6c 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -603,7 +603,7 @@ static void __init early_cmdline_parse(void) const char *opt; char *p; - int l = 0; + int l __maybe_unused = 0; prom_cmd_line[0] = 0; p = prom_cmd_line; @@ -1385,7 +1385,7 @@ static void __init reserve_mem(u64 base, u64 size) static void __init prom_init_mem(void) { phandle node; - char *path, type[64]; + char *path __maybe_unused, type[64]; unsigned int plen; cell_t *p, *endp; __be32 val; @@ -1406,7 +1406,6 @@ static void __init prom_init_mem(void) prom_debug("root_size_cells: %x\n", rsc); prom_debug("scanning memory:\n"); - path = prom_scratch; for (node = 0; prom_next_node(); ) { type[0] = 0; @@ -1431,6 +1430,7 @@ static void __init prom_init_mem(void) endp = p + (plen / sizeof(cell_t)); #ifdef DEBUG_PROM + path = prom_scratch; memset(path, 0, PROM_SCRATCH_SIZE); call_prom("package-to-path", 3, 1, node, path, PROM_SCRATCH_SIZE-1); prom_debug(" node %s :\n", path); -- 2.11.0
[PATCH v2 02/19] powerpc/powermac: Mark variable x as unused
Since the value of x is never intended to be read, remove it. Fix warning treated as error with W=1: arch/powerpc/platforms/powermac/udbg_scc.c:76:9: error: variable ‘x’ set but not used [-Werror=unused-but-set-variable] Suggested-by: Christophe LeroySigned-off-by: Mathieu Malaterre --- v2: remove x completely arch/powerpc/platforms/powermac/udbg_scc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powermac/udbg_scc.c b/arch/powerpc/platforms/powermac/udbg_scc.c index d83135a9830e..8901973ed683 100644 --- a/arch/powerpc/platforms/powermac/udbg_scc.c +++ b/arch/powerpc/platforms/powermac/udbg_scc.c @@ -73,7 +73,7 @@ void udbg_scc_init(int force_scc) struct device_node *stdout = NULL, *escc = NULL, *macio = NULL; struct device_node *ch, *ch_def = NULL, *ch_a = NULL; const char *path; - int i, x; + int i; escc = of_find_node_by_name(NULL, "escc"); if (escc == NULL) @@ -120,7 +120,7 @@ void udbg_scc_init(int force_scc) mb(); for (i = 2; i != 0; --i) - x = in_8(sccc); + in_8(sccc); out_8(sccc, 0x09); /* reset A or B side */ out_8(sccc, 0xc0); -- 2.11.0
[PATCH v2 01/19] powerpc/powermac: Mark variable x as unused
Since the value of x is never intended to be read, declare it with gcc attribute as unused. Fix warning treated as error with W=1: arch/powerpc/platforms/powermac/bootx_init.c:471:21: error: variable ‘x’ set but not used [-Werror=unused-but-set-variable] Signed-off-by: Mathieu Malaterre--- v2: move x variable within its local scope arch/powerpc/platforms/powermac/bootx_init.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powermac/bootx_init.c b/arch/powerpc/platforms/powermac/bootx_init.c index c3c9bbb3573a..d44e8571c1ec 100644 --- a/arch/powerpc/platforms/powermac/bootx_init.c +++ b/arch/powerpc/platforms/powermac/bootx_init.c @@ -468,7 +468,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4) boot_infos_t *bi = (boot_infos_t *) r4; unsigned long hdr; unsigned long space; - unsigned long ptr, x; + unsigned long ptr; char *model; unsigned long offset = reloc_offset(); @@ -562,6 +562,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4) * MMU switched OFF, so this should not be useful anymore. */ if (bi->version < 4) { + unsigned long x __maybe_unused; bootx_printf("Touching pages...\n"); /* -- 2.11.0
Re: [PATCH 15/19] powerpc: Add missing prototype
On Fri, Mar 23, 2018 at 1:20 PM, christophe leroywrote: > > > Le 22/03/2018 à 21:20, Mathieu Malaterre a écrit : >> >> Add one missing prototype for function rh_dump_blk. Fix warning treated as >> error in W=1: >> >>arch/powerpc/lib/rheap.c:740:6: error: no previous prototype for >> ‘rh_dump_blk’ [-Werror=missing-prototypes] >> >> Signed-off-by: Mathieu Malaterre >> --- >> arch/powerpc/include/asm/rheap.h | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/arch/powerpc/include/asm/rheap.h >> b/arch/powerpc/include/asm/rheap.h >> index 172381769cfc..e75d96de19a0 100644 >> --- a/arch/powerpc/include/asm/rheap.h >> +++ b/arch/powerpc/include/asm/rheap.h >> @@ -83,6 +83,9 @@ extern int rh_get_stats(rh_info_t * info, int what, int >> max_stats, >> /* Simple dump of remote heap info */ >> extern void rh_dump(rh_info_t * info); >> +/* Simple dump of remote info block */ >> +extern void rh_dump_blk(rh_info_t *info, rh_block_t *blk); >> + > > > Only used in one place, should be static Well here is what I see over here: $ git grep rh_dump_blk ... arch/powerpc/lib/rheap.c:EXPORT_SYMBOL_GPL(rh_dump_blk); > Christophe > >> /* Set owner of taken block */ >> extern int rh_set_owner(rh_info_t * info, unsigned long start, const >> char *owner); >> > > > --- > L'absence de virus dans ce courrier électronique a été vérifiée par le > logiciel antivirus Avast. > https://www.avast.com/antivirus >
Re: [PATCH 11/19] powerpc/powermac: Move pmac_pfunc_base_install prototype to header file
On Fri, Mar 23, 2018 at 1:13 PM, christophe leroywrote: > > > Le 22/03/2018 à 21:19, Mathieu Malaterre a écrit : >> >> The pmac_pfunc_base_install prototype was declared in powermac/smp.c since >> function was used there, move it to pmac_pfunc.h header to be visible in >> pfunc_base.c. Fix a warning treated as error with W=1: >> >>arch/powerpc/platforms/powermac/pfunc_base.c:330:12: error: no previous >> prototype for ‘pmac_pfunc_base_install’ [-Werror=missing-prototypes] >> >> Signed-off-by: Mathieu Malaterre >> --- >> arch/powerpc/include/asm/pmac_pfunc.h | 1 + >> arch/powerpc/platforms/powermac/smp.c | 1 - >> 2 files changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/include/asm/pmac_pfunc.h >> b/arch/powerpc/include/asm/pmac_pfunc.h >> index 73bd8f28f2a8..99f7a288789a 100644 >> --- a/arch/powerpc/include/asm/pmac_pfunc.h >> +++ b/arch/powerpc/include/asm/pmac_pfunc.h >> @@ -245,6 +245,7 @@ extern void pmf_put_function(struct pmf_function >> *func); >> extern int pmf_call_one(struct pmf_function *func, struct pmf_args >> *args); >> +extern int pmac_pfunc_base_install(void); > > > > extern keyword is not needed I understand; but for consistency every single protoypes in this header file actually use the extern keyword. Is there a guide/best practice to refer to in this case ? > Christophe > >> /* Suspend/resume code called by via-pmu directly for now */ >> extern void pmac_pfunc_base_suspend(void); >> diff --git a/arch/powerpc/platforms/powermac/smp.c >> b/arch/powerpc/platforms/powermac/smp.c >> index 95275e0e2efa..447da6db450a 100644 >> --- a/arch/powerpc/platforms/powermac/smp.c >> +++ b/arch/powerpc/platforms/powermac/smp.c >> @@ -65,7 +65,6 @@ >> #endif >> extern void __secondary_start_pmac_0(void); >> -extern int pmac_pfunc_base_install(void); >> static void (*pmac_tb_freeze)(int freeze); >> static u64 timebase; >> > > --- > L'absence de virus dans ce courrier électronique a été vérifiée par le > logiciel antivirus Avast. > https://www.avast.com/antivirus >
[PATCH v3] powerpc/altivec: Add missing prototypes for altivec
Some functions prototypes were missing for the non-altivec code. Add the missing prototypes in a new header file, fix warnings treated as errors with W=1: arch/powerpc/lib/xor_vmx_glue.c:18:6: error: no previous prototype for ‘xor_altivec_2’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:29:6: error: no previous prototype for ‘xor_altivec_3’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:40:6: error: no previous prototype for ‘xor_altivec_4’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:52:6: error: no previous prototype for ‘xor_altivec_5’ [-Werror=missing-prototypes] The prototypes were already present in but this header file is meant to be included after . Trying to re-use directly would lead to warnings such as: arch/powerpc/include/asm/xor.h:39:15: error: variable ‘xor_block_altivec’ has initializer but incomplete type Trying to re-use after in xor_vmx_glue.c would in turn trigger the following warnings: include/asm-generic/xor.h:688:34: error: ‘xor_block_32regs’ defined but not used [-Werror=unused-variable] Signed-off-by: Mathieu Malaterre--- v3: missing changelog, no other change v2: new version where a separate header was introduced. Also update commit message explaining history. arch/powerpc/include/asm/xor.h | 12 +--- arch/powerpc/include/asm/xor_altivec.h | 19 +++ arch/powerpc/lib/xor_vmx_glue.c| 1 + 3 files changed, 21 insertions(+), 11 deletions(-) create mode 100644 arch/powerpc/include/asm/xor_altivec.h diff --git a/arch/powerpc/include/asm/xor.h b/arch/powerpc/include/asm/xor.h index a36c2069d8ed..7d6dc503349d 100644 --- a/arch/powerpc/include/asm/xor.h +++ b/arch/powerpc/include/asm/xor.h @@ -24,17 +24,7 @@ #include #include - -void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in); -void xor_altivec_3(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in, unsigned long *v3_in); -void xor_altivec_4(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in, unsigned long *v3_in, - unsigned long *v4_in); -void xor_altivec_5(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in, unsigned long *v3_in, - unsigned long *v4_in, unsigned long *v5_in); +#include static struct xor_block_template xor_block_altivec = { .name = "altivec", diff --git a/arch/powerpc/include/asm/xor_altivec.h b/arch/powerpc/include/asm/xor_altivec.h new file mode 100644 index ..6ca923510b59 --- /dev/null +++ b/arch/powerpc/include/asm/xor_altivec.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_XOR_ALTIVEC_H +#define _ASM_POWERPC_XOR_ALTIVEC_H + +#ifdef CONFIG_ALTIVEC + +void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in); +void xor_altivec_3(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in, unsigned long *v3_in); +void xor_altivec_4(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in, unsigned long *v3_in, + unsigned long *v4_in); +void xor_altivec_5(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in, unsigned long *v3_in, + unsigned long *v4_in, unsigned long *v5_in); + +#endif +#endif /* _ASM_POWERPC_XOR_ALTIVEC_H */ diff --git a/arch/powerpc/lib/xor_vmx_glue.c b/arch/powerpc/lib/xor_vmx_glue.c index 6521fe5e8cef..dab2b6bfcf36 100644 --- a/arch/powerpc/lib/xor_vmx_glue.c +++ b/arch/powerpc/lib/xor_vmx_glue.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "xor_vmx.h" void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, -- 2.11.0
[PATCH v2] powerpc/altivec: Add missing prototypes for altivec
Some functions prototypes were missing for the non-altivec code. Add the missing prototypes in a new header file, fix warnings treated as errors with W=1: arch/powerpc/lib/xor_vmx_glue.c:18:6: error: no previous prototype for ‘xor_altivec_2’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:29:6: error: no previous prototype for ‘xor_altivec_3’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:40:6: error: no previous prototype for ‘xor_altivec_4’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:52:6: error: no previous prototype for ‘xor_altivec_5’ [-Werror=missing-prototypes] The prototypes were already present in but this header file is meant to be included after . Trying to re-use directly would lead to warnings such as: arch/powerpc/include/asm/xor.h:39:15: error: variable ‘xor_block_altivec’ has initializer but incomplete type Trying to re-use after in xor_vmx_glue.c would in turn trigger the following warnings: include/asm-generic/xor.h:688:34: error: ‘xor_block_32regs’ defined but not used [-Werror=unused-variable] Signed-off-by: Mathieu Malaterre--- arch/powerpc/include/asm/xor.h | 12 +--- arch/powerpc/include/asm/xor_altivec.h | 19 +++ arch/powerpc/lib/xor_vmx_glue.c| 1 + 3 files changed, 21 insertions(+), 11 deletions(-) create mode 100644 arch/powerpc/include/asm/xor_altivec.h diff --git a/arch/powerpc/include/asm/xor.h b/arch/powerpc/include/asm/xor.h index a36c2069d8ed..7d6dc503349d 100644 --- a/arch/powerpc/include/asm/xor.h +++ b/arch/powerpc/include/asm/xor.h @@ -24,17 +24,7 @@ #include #include - -void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in); -void xor_altivec_3(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in, unsigned long *v3_in); -void xor_altivec_4(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in, unsigned long *v3_in, - unsigned long *v4_in); -void xor_altivec_5(unsigned long bytes, unsigned long *v1_in, - unsigned long *v2_in, unsigned long *v3_in, - unsigned long *v4_in, unsigned long *v5_in); +#include static struct xor_block_template xor_block_altivec = { .name = "altivec", diff --git a/arch/powerpc/include/asm/xor_altivec.h b/arch/powerpc/include/asm/xor_altivec.h new file mode 100644 index ..6ca923510b59 --- /dev/null +++ b/arch/powerpc/include/asm/xor_altivec.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_XOR_ALTIVEC_H +#define _ASM_POWERPC_XOR_ALTIVEC_H + +#ifdef CONFIG_ALTIVEC + +void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in); +void xor_altivec_3(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in, unsigned long *v3_in); +void xor_altivec_4(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in, unsigned long *v3_in, + unsigned long *v4_in); +void xor_altivec_5(unsigned long bytes, unsigned long *v1_in, + unsigned long *v2_in, unsigned long *v3_in, + unsigned long *v4_in, unsigned long *v5_in); + +#endif +#endif /* _ASM_POWERPC_XOR_ALTIVEC_H */ diff --git a/arch/powerpc/lib/xor_vmx_glue.c b/arch/powerpc/lib/xor_vmx_glue.c index 6521fe5e8cef..dab2b6bfcf36 100644 --- a/arch/powerpc/lib/xor_vmx_glue.c +++ b/arch/powerpc/lib/xor_vmx_glue.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "xor_vmx.h" void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, -- 2.11.0
Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder
On 28/03/2018 00:12, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 88042d843668..ef6ef0627090 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct >> anon_vma_chain *node); >> extern int __vm_enough_memory(struct mm_struct *mm, long pages, int >> cap_sys_admin); >> extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, >> -struct vm_area_struct *expand); >> +struct vm_area_struct *expand, bool keep_locked); >> static inline int vma_adjust(struct vm_area_struct *vma, unsigned long >> start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) >> { >> -return __vma_adjust(vma, start, end, pgoff, insert, NULL); >> +return __vma_adjust(vma, start, end, pgoff, insert, NULL, false); >> } >> -extern struct vm_area_struct *vma_merge(struct mm_struct *, >> +extern struct vm_area_struct *__vma_merge(struct mm_struct *, >> struct vm_area_struct *prev, unsigned long addr, unsigned long end, >> unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, >> -struct mempolicy *, struct vm_userfaultfd_ctx); >> +struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked); >> +static inline struct vm_area_struct *vma_merge(struct mm_struct *vma, >> +struct vm_area_struct *prev, unsigned long addr, unsigned long end, >> +unsigned long vm_flags, struct anon_vma *anon, struct file *file, >> +pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff) >> +{ >> +return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off, >> + pol, uff, false); >> +} > > The first formal to vma_merge() is an mm, not a vma. Oops ! > This area could use an uncluttering. > >> extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); >> extern int __split_vma(struct mm_struct *, struct vm_area_struct *, >> unsigned long addr, int new_below); >> diff --git a/mm/mmap.c b/mm/mmap.c >> index d6533cb85213..ac32b577a0c9 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct >> *mm, >> */ >> int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, >> -struct vm_area_struct *expand) >> +struct vm_area_struct *expand, bool keep_locked) >> { >> struct mm_struct *mm = vma->vm_mm; >> struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; >> @@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned >> long start, >> >> if (next && next != vma) >> vm_raw_write_end(next); >> -vm_raw_write_end(vma); >> +if (!keep_locked) >> +vm_raw_write_end(vma); >> >> validate_mm(mm); >> > > This will require a fixup for the following patch where a retval from > anon_vma_close() can also return without vma locked even though > keep_locked == true. Yes I saw your previous email about that. > > How does vma_merge() handle that error wrt vm_raw_write_begin(vma)? The not written assumption is that in the case __vma_adjust() is returning an error, the vma is no more sequence locked. That's need to be clarify a huge comment near the __vma_adjust() definition. This being said, in that case of __vma_merge() is returning NULL which means that it didn't do the job (the caller don't know if this is due to an error or because the VMA cannot be merged). In that case again the assumption is that the VMA are no locked in that case since the merge operation was not done. But again this is not documented at all and I've to fix that. In addition the caller copy_vma() which is the only one calling __vma_merge() with keep_locked=true, is assuming that in that case, the VMA is not locked and it will allocate a new VMA which will be locked before it is inserted in the RB tree. So it should work, but I've to make a huge effort to document that in the code. Thanks a lot for raising this ! > >> @@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, >> unsigned long vm_flags, >> * parameter) may establish ptes with the wrong permissions of >> * instead of the right permissions of . >> */ >> -struct vm_area_struct *vma_merge(struct mm_struct *mm, >> +struct vm_area_struct *__vma_merge(struct mm_struct *mm, >> struct vm_area_struct *prev, unsigned long addr, >> unsigned long end, unsigned long vm_flags, >> struct anon_vma *anon_vma, struct file *file, >> pgoff_t pgoff, struct mempolicy *policy, >> -struct vm_userfaultfd_ctx vm_userfaultfd_ctx) >> +struct vm_userfaultfd_ctx vm_userfaultfd_ctx, >>
Re: [PATCH v9 07/24] mm: VMA sequence count
On 27/03/2018 23:30, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> diff --git a/mm/mmap.c b/mm/mmap.c >> index faf85699f1a1..5898255d0aeb 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -558,6 +558,10 @@ void __vma_link_rb(struct mm_struct *mm, struct >> vm_area_struct *vma, >> else >> mm->highest_vm_end = vm_end_gap(vma); >> >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> +seqcount_init(>vm_sequence); >> +#endif >> + >> /* >> * vma->vm_prev wasn't known when we followed the rbtree to find the >> * correct insertion point for that vma. As a result, we could not >> @@ -692,6 +696,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned >> long start, >> long adjust_next = 0; >> int remove_next = 0; >> >> +/* >> + * Why using vm_raw_write*() functions here to avoid lockdep's warning ? >> + * >> + * Locked is complaining about a theoretical lock dependency, involving >> + * 3 locks: >> + * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim >> + * >> + * Here are the major path leading to this dependency : >> + * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem >> + * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim >> + * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem >> + * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence >> + * >> + * So there is no way to solve this easily, especially because in >> + * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted >> + * VMAs are not yet known. >> + * However, the way the vm_seq is used is guarantying that we will >> + * never block on it since we just check for its value and never wait >> + * for it to move, see vma_has_changed() and handle_speculative_fault(). >> + */ >> +vm_raw_write_begin(vma); >> +if (next) >> +vm_raw_write_begin(next); >> + >> if (next && !insert) { >> struct vm_area_struct *exporter = NULL, *importer = NULL; >> > > Eek, what about later on: > > /* >* Easily overlooked: when mprotect shifts the boundary, >* make sure the expanding vma has anon_vma set if the >* shrinking vma had, to cover any anon pages imported. >*/ > if (exporter && exporter->anon_vma && !importer->anon_vma) { > int error; > > importer->anon_vma = exporter->anon_vma; > error = anon_vma_clone(importer, exporter); > if (error) > return error; > } > > This needs > > if (error) { > if (next && next != vma) > vm_raw_write_end(next); > vm_raw_write_end(vma); > return error; > } Nice catch ! Thanks, Laurent.
Re: [PATCH 6/6] doc/devicetree: NVDIMM region documentation
On Wed, Mar 28, 2018 at 10:06 AM, Rob Herringwrote: [..] > >> Are DIMMs always going to be the only form factor for NV memory? > >> > >> And if you have multiple DIMMs, does each DT node correspond to a DIMM? > > > > A nvdimm-region might correspond to a single NVDIMM, a set of > > interleaved NVDIMMs, or it might just be a chunk of normal memory that > > you want treated as a NVDIMM for some reason. The last case is useful > > for provisioning install media on servers since it allows you do > > download a DVD image, turn it into an nvdimm-region, and kexec into > > the installer which can use it as a root disk. That may seem a little > > esoteric, but it's handy and we're using a full linux environment for > > our boot loader so it's easy to make use of. > > I'm really just asking if we should drop the "dimm" name because it is > not always a DIMM. Maybe pmem instead? I don't know, naming is > hard(TM). The Linux enabling uses the term "memory device". The Linux device object name for memory devices is "nmem". [..] > > special since the OS needs to know where it can allocate early in boot > > and I don't see non-volatile memory as being similarly significant. > > Fundamentally an NVDIMM is just a memory mapped storage device so we > > should be able to defer looking at them until later in boot. > > It's not clear if 'platform' is just an example or random name or what > the node is required to be called. In the latter case, we should be > much more specific because 'platform' could be anything. In the former > case, then we have no way to find or validate the node because the > name could be anything and there's no compatible property either. > > "region" is pretty generic too. > The term "nvdimm-region" has specific meaning to the libnvdimm sub-system. It is a contiguous physical address range backed by one or more memory devices, DIMMs, in an interleaved configuration (interleave set). One feature that is currently missing from libnvdimm is a management interface to change an interleave configuration. To date, Linux only reads a static region configuration published by platform firmware. Linux can provide dynamic provisioning of namespaces out of those regions, but interleave configuration has been left to vendor specific tooling to date. It would be great to start incorporating generic Linux support for that capability across platform firmware implementations.
Re: [PATCH] powerpc/powernv/nvram: opal_nvram_write handle unknown OPAL errors
On 03/27/2018 01:08 PM, Nicholas Piggin wrote: On Tue, 27 Mar 2018 12:47:31 +0530 Vasant Hegdewrote: On 03/26/2018 08:32 PM, Nicholas Piggin wrote: opal_nvram_write currently just assumes success if it encounters an error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO on other errors instead. Signed-off-by: Nicholas Piggin --- arch/powerpc/platforms/powernv/opal-nvram.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/platforms/powernv/opal-nvram.c b/arch/powerpc/platforms/powernv/opal-nvram.c index 9db4398ded5d..13bf625dc3e8 100644 --- a/arch/powerpc/platforms/powernv/opal-nvram.c +++ b/arch/powerpc/platforms/powernv/opal-nvram.c @@ -59,6 +59,8 @@ static ssize_t opal_nvram_write(char *buf, size_t count, loff_t *index) if (rc == OPAL_BUSY_EVENT) opal_poll_events(NULL); Current code does continuous poller here. May be we have small breathing time here. What you say? Yeah that's probably not a bad idea. I cc'ed skiboot list -- what's a reasonable delay between retries? I think it depends on individual API. Like in case of dump retrival I've 20ms delay... as FSP takes time to copy data to host memory. But may be here we can have smaller delay. Linux has a bunch of similar kind of loops if you grep for opal_poll_events and OPAL_BUSY. It would be good to convert them all to a standard form with a standard delay as recommended by OPAL, and where specific calls have different delay for a good reason, that would be documented in the OPAL API docs. Yes. We should update API documentation. -Vasant
Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count
On 27/03/2018 23:57, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> diff --git a/mm/mmap.c b/mm/mmap.c >> index 5898255d0aeb..d6533cb85213 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -847,17 +847,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned >> long start, >> } >> >> if (start != vma->vm_start) { >> -vma->vm_start = start; >> +WRITE_ONCE(vma->vm_start, start); >> start_changed = true; >> } >> if (end != vma->vm_end) { >> -vma->vm_end = end; >> +WRITE_ONCE(vma->vm_end, end); >> end_changed = true; >> } >> -vma->vm_pgoff = pgoff; >> +WRITE_ONCE(vma->vm_pgoff, pgoff); >> if (adjust_next) { >> -next->vm_start += adjust_next << PAGE_SHIFT; >> -next->vm_pgoff += adjust_next; >> +WRITE_ONCE(next->vm_start, >> + next->vm_start + (adjust_next << PAGE_SHIFT)); >> +WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next); >> } >> >> if (root) { >> @@ -1781,6 +1782,7 @@ unsigned long mmap_region(struct file *file, unsigned >> long addr, >> out: >> perf_event_mmap(vma); >> >> +vm_write_begin(vma); >> vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); >> if (vm_flags & VM_LOCKED) { >> if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || >> @@ -1803,6 +1805,7 @@ unsigned long mmap_region(struct file *file, unsigned >> long addr, >> vma->vm_flags |= VM_SOFTDIRTY; >> >> vma_set_page_prot(vma); >> +vm_write_end(vma); >> >> return addr; >> > > Shouldn't this also protect vma->vm_flags? Nice catch ! I just found that too while reviewing the entire patch to answer your previous email. > > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1796,7 +1796,8 @@ unsigned long mmap_region(struct file *file, unsigned > long addr, > vma == get_gate_vma(current->mm))) > mm->locked_vm += (len >> PAGE_SHIFT); > else > - vma->vm_flags &= VM_LOCKED_CLEAR_MASK; > + WRITE_ONCE(vma->vm_flags, > +vma->vm_flags & VM_LOCKED_CLEAR_MASK); > } > > if (file) > @@ -1809,7 +1810,7 @@ unsigned long mmap_region(struct file *file, unsigned > long addr, >* then new mapped in-place (which must be aimed as >* a completely new data area). >*/ > - vma->vm_flags |= VM_SOFTDIRTY; > + WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY); > > vma_set_page_prot(vma); > vm_write_end(vma); >
Re: [RFC PATCH] powerpc/xmon: Use OPAL_DEBUG to debug srest in OPAL
On 03/27/2018 12:58 PM, Nicholas Piggin wrote: On Tue, 27 Mar 2018 12:42:32 +0530 Vasant Hegdewrote: On 03/26/2018 08:39 PM, Nicholas Piggin wrote: xmon can be entered via sreset NMI (from a management sreset, or an NMI IPI), which can interrupt OPAL. Add checks to xmon to see if pc or sp are within OPAL memory, and if so, then use OPAL_DEBUG to print the opal stack and return the Linux stack, which can then be dumped by xmon Nick, OPAL uses FSP/cronus interface for many of debug interface (like OPAL assert, getting opal console, triggering FSP R/R etc). May be in future we may add new debug capability. It would be good to ensure an API could accommodate them, or at least not get in the way. Agree. Once secureboot is enabled none of these interface work and we have limited debug capability. Here you are using very generic API name (OPAL_DEBUG). May be we should have generic interface (exported via debugfs?) here rather than SRESET specific one. OPAL_DEBUG here actually uses the sub-function OPAL_DEBUG_DUMP_STACK (1), but I didn't bring that constant across from skiboot which I should have. Nick, May be we should define sub-function usage. Also current API limits number of arguments and its type. may be we should have argument 2 as "void *" ? something like : s64 opal_debug(u32 debug_type, void *data, u64 dsize); That way individual sub-function can parse/use based on its need. But I don't think this is SRESET specific. If Linux has any way to get an r1 for a CPU in OPAL, then it can use this function. If it does not, then it simply can't use it. I haven't really followed what's happening with secure boot, but presumably we can still get NMIs (at least machine check, even if all system reset sources are suppressed). AFAIK secureboot won't block us here. It mostly blocks external entity (like FSP/cronus) from accessing host memory. (like they can't directly read, write to host memory, SCOM operations are restricted etc). -Vasant
Re: [PATCH 6/6] doc/devicetree: NVDIMM region documentation
On Tue, Mar 27, 2018 at 9:53 AM, Oliverwrote: > On Tue, Mar 27, 2018 at 9:24 AM, Rob Herring wrote: >> On Fri, Mar 23, 2018 at 07:12:09PM +1100, Oliver O'Halloran wrote: >>> Add device-tree binding documentation for the nvdimm region driver. >>> >>> Cc: devicet...@vger.kernel.org >>> Signed-off-by: Oliver O'Halloran >>> --- >>> .../devicetree/bindings/nvdimm/nvdimm-region.txt | 45 >>> ++ >>> 1 file changed, 45 insertions(+) >>> create mode 100644 >>> Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt >>> >>> diff --git a/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt >>> b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt >>> new file mode 100644 >>> index ..02091117ff16 >>> --- /dev/null >>> +++ b/Documentation/devicetree/bindings/nvdimm/nvdimm-region.txt >>> @@ -0,0 +1,45 @@ >>> +Device-tree bindings for NVDIMM memory regions >>> +- >>> + >>> +Non-volatile DIMMs are memory modules used to provide (cacheable) main >>> memory >> >> Are DIMMs always going to be the only form factor for NV memory? >> >> And if you have multiple DIMMs, does each DT node correspond to a DIMM? > > A nvdimm-region might correspond to a single NVDIMM, a set of > interleaved NVDIMMs, or it might just be a chunk of normal memory that > you want treated as a NVDIMM for some reason. The last case is useful > for provisioning install media on servers since it allows you do > download a DVD image, turn it into an nvdimm-region, and kexec into > the installer which can use it as a root disk. That may seem a little > esoteric, but it's handy and we're using a full linux environment for > our boot loader so it's easy to make use of. I'm really just asking if we should drop the "dimm" name because it is not always a DIMM. Maybe pmem instead? I don't know, naming is hard(TM). >> If not, then what if we want/need to provide power control to a DIMM? > > That would require a DIMM (and probably memory controller) specific > driver. I've deliberately left out how regions are mapped back to > DIMMs from the binding since it's not really clear to me how that > should work. A phandle array pointing to each DIMM device (which could > be anything) would do the trick, but I've found that a bit awkward to > plumb into the model that libnvdimm expects. > >>> +that retains its contents across power cycles. In more practical terms, >>> they >>> +are kind of storage device where the contents can be accessed by the CPU >>> +directly, rather than indirectly via a storage controller or similar. The >>> an >>> +nvdimm-region specifies a physical address range that is hosted on an >>> NVDIMM >>> +device. >>> + >>> +Bindings for the region nodes: >>> +- >>> + >>> +Required properties: >>> + - compatible = "nvdimm-region" >>> + >>> + - reg = ; >>> + The system physical address range of this nvdimm region. >>> + >>> +Optional properties: >>> + - Any relevant NUMA assocativity properties for the target platform. >>> + - A "volatile" property indicating that this region is actually in >>> + normal DRAM and does not require cache flushes after each write. >>> + >>> +A complete example: >>> + >>> + >>> +/ { >>> + #size-cells = <2>; >>> + #address-cells = <2>; >>> + >>> + platform { >> >> Perhaps we need a more well defined node here. Like we have 'memory' for >> memory nodes. > > I think treating it as a platform device is fine. Memory nodes are Platform device is a Linux term... > special since the OS needs to know where it can allocate early in boot > and I don't see non-volatile memory as being similarly significant. > Fundamentally an NVDIMM is just a memory mapped storage device so we > should be able to defer looking at them until later in boot. It's not clear if 'platform' is just an example or random name or what the node is required to be called. In the latter case, we should be much more specific because 'platform' could be anything. In the former case, then we have no way to find or validate the node because the name could be anything and there's no compatible property either. "region" is pretty generic too. > > That said you might have problems with XIP kernels and what not. I > think that problem is better solved through other means though. > >>> + region@5000 { >>> + compatible = "nvdimm-region; >>> + reg = <0x0001 0x 0x 0x4000> >>> + >>> + }; >>> + >>> + region@6000 { >>> + compatible = "nvdimm-region"; >>> + reg = <0x0001 0x 0x 0x4000> Thinking about this some more, the 2 levels of nodes is pointless. Just follow memory nodes structure. nv-memory@1 { compatible = "nvdimm-region"; reg =
Re: RFC on writel and writel_relaxed
On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > they should have the same semantics as writel. > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() > > > within > > > one device, and the memory mapped PCI I/O space window almost certainly > > > counts as a separate device to the CPU. > > > > That sounds bogus. > > To elaborate, if you do the following on arm: > > writel(DEVICE_FOO); > writel(DEVICE_BAR); > > we generally cannot guarantee in which order those accesses will hit the > devices even if we add every barrier under the sun. You'd need something > in between, specific to DEVICE_FOO (probably a read-back) to really push > the first write out. This doesn't sound like it would be that uncommon to > me. The PCI posted write does not require the above to execute 'in order' only that any bus segment shared by the two devices have the writes issued in CPU order. ie at a shared PCI root port for instance. If I recall this is very similar to the ordering that ARM's on-chip AXI interconnect is supposed to provide.. So I'd be very surprised if a modern ARM64 has an meaningful difference from x86 here. When talking about ordering between the devices, the relevant question is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA from the DEVICE_FOO. 'ordered' means that in this case writel(DEVICE_FOO) must be presented to FOO before anything generated by BAR. Jason
Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count
On 27/03/2018 23:45, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index 65ae54659833..a2d9c87b7b0b 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, >> const char __user *buf, >> goto out_mm; >> } >> for (vma = mm->mmap; vma; vma = vma->vm_next) { >> -vma->vm_flags &= ~VM_SOFTDIRTY; >> +vm_write_begin(vma); >> +WRITE_ONCE(vma->vm_flags, >> + vma->vm_flags & >> ~VM_SOFTDIRTY); >> vma_set_page_prot(vma); >> +vm_write_end(vma); >> } >> downgrade_write(>mmap_sem); >> break; >> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c >> index cec550c8468f..b8212ba17695 100644 >> --- a/fs/userfaultfd.c >> +++ b/fs/userfaultfd.c >> @@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct >> list_head *fcs) >> >> octx = vma->vm_userfaultfd_ctx.ctx; >> if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { >> +vm_write_begin(vma); >> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; >> -vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); >> +WRITE_ONCE(vma->vm_flags, >> + vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING)); >> +vm_write_end(vma); >> return 0; >> } >> > > In several locations in this patch vm_write_begin(vma) -> > vm_write_end(vma) is nesting things other than vma->vm_flags, > vma->vm_policy, etc. I think it's better to do vm_write_end(vma) as soon > as the members that the seqcount protects are modified. In other words, > this isn't offering protection for vma->vm_userfaultfd_ctx. There are > several examples of this in the patch. That's true in this particular case, and I could change that to not include the change to vm_userfaultfd_ctx. This being said, I don't think this will have a major impact, but I'll make a close review on this patch to be sure there is too large protected part of code. >> @@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, >> struct file *file) >> vma = prev; >> else >> prev = vma; >> -vma->vm_flags = new_flags; >> +vm_write_begin(vma); >> +WRITE_ONCE(vma->vm_flags, new_flags); >> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; >> +vm_write_end(vma); >> } >> up_write(>mmap_sem); >> mmput(mm); >> @@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct >> userfaultfd_ctx *ctx, >> * the next vma was merged into the current one and >> * the current one has not been updated yet. >> */ >> -vma->vm_flags = new_flags; >> +vm_write_begin(vma); >> +WRITE_ONCE(vma->vm_flags, new_flags); >> vma->vm_userfaultfd_ctx.ctx = ctx; >> +vm_write_end(vma); >> >> skip: >> prev = vma; >> @@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct >> userfaultfd_ctx *ctx, >> * the next vma was merged into the current one and >> * the current one has not been updated yet. >> */ >> -vma->vm_flags = new_flags; >> +vm_write_begin(vma); >> +WRITE_ONCE(vma->vm_flags, new_flags); >> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; >> +vm_write_end(vma); >> >> skip: >> prev = vma; >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index b7e2268dfc9a..32314e9e48dd 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm, >> if (mm_find_pmd(mm, address) != pmd) >> goto out; >> >> +vm_write_begin(vma); >> anon_vma_lock_write(vma->anon_vma); >> >> pte = pte_offset_map(pmd, address); >> @@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm, >> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); >> spin_unlock(pmd_ptl); >> anon_vma_unlock_write(vma->anon_vma); >> +vm_write_end(vma); >> result = SCAN_FAIL; >> goto out; >> } >> @@ -1075,6 +1077,7 @@ static void collapse_huge_page(struct mm_struct *mm, >> set_pmd_at(mm, address, pmd, _pmd); >> update_mmu_cache_pmd(vma, address, pmd); >> spin_unlock(pmd_ptl); >> +vm_write_end(vma); >> >> *hpage = NULL; >> >> diff --git a/mm/madvise.c
Re: RFC on writel and writel_relaxed
On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) David Millerwrote: > From: Benjamin Herrenschmidt > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > the archs are unused or dead anyway. > > Agreed. While we're making decrees here, can we do something about mmiowb? The semantics are basically indecipherable. This is a variation on the mandatory write barrier that causes writes to weakly ordered I/O regions to be partially ordered. Its effects may go beyond the CPU->Hardware interface and actually affect the hardware at some level. How can a driver writer possibly get that right? IIRC it was added for some big ia64 system that was really expensive to implement the proper wmb() semantics on. So wmb() semantics were quietly downgraded, then the subsequently broken drivers they cared about were fixed by adding the stronger mmiowb(). What should have happened was wmb and writel remained correct, sane, and expensive, and they add an mmio_wmb() to order MMIO stores made by the writel_relaxed accessors, then use that to speed up the few drivers they care about. Now that ia64 doesn't matter too much, can we deprecate mmiowb and just make wmb ordering talk about stores to the device, not to some intermediate stage of the interconnect where it can be subsequently reordered wrt the device? Drivers can be converted back to using wmb or writel gradually. Thanks, Nick
RE: RFC on writel and writel_relaxed
From: Benjamin Herrenschmidt > Sent: 28 March 2018 16:13 ... > > I've always wondered exactly what the twi;isync were for - always seemed > > very heavy handed for most mmio reads. > > Particularly if you are doing mmio reads from a data fifo. > > If you do that you should use the "s" version of the accessors. Those > will only do the above trick at the end of the access series. Also a > FIFO needs special care about endianness anyway, so you should use > those accessors regardless. (Hint: you never endian swap a FIFO even on > BE on a LE device, unless something's been wired very badly in HW). That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface. Reading the high address 'clocked' the fifo. So the first 3 reads could happen in any order, but the 4th had to be last. This is a small ppc and we shovel a lot of data through that fifo. Whether it needed byteswapping depended completely on how our hardware people had built the pcb (not made easy by some docs using the ibm bit numbering). In fact it didn't While that driver only had to run on a very specific small ppc, generic drivers might have similar issues. I suspect that writel() is always (or should always be): barrier_before_writel() writel_relaxed() barrier_after_writel() So if a driver needs to do multiple writes (without strong ordering) it should be able to repeat the writel_relaxed() with only one set of barriers. Similarly for readl(). In addition a lesser barrier is probably enough between a readl_relaxed() and a writel_relaxed() that is conditional on the read value. David
Re: RFC on writel and writel_relaxed
From: Benjamin HerrenschmidtDate: Thu, 29 Mar 2018 02:13:16 +1100 > Let's fix all archs, it's way easier than fixing all drivers. Half of > the archs are unused or dead anyway. Agreed.
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote: > On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote: > > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote: > > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote: > > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote: > > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending > > > > > broadcast IPI. > > > > > If CPU is in extended quiescent state (idle task or nohz_full > > > > > userspace), this > > > > > work may be done at the exit of this state. Delaying synchronization > > > > > helps to > > > > > save power if CPU is in idle state and decrease latency for real-time > > > > > tasks. > > > > > > > > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab > > > > > and arm64 > > > > > code to delay syncronization. > > > > > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the > > > > > CPU running > > > > > isolated task would be fatal, as it breaks isolation. The approach > > > > > with delaying > > > > > of synchronization work helps to maintain isolated state. > > > > > > > > > > I've tested it with test from task isolation series on ThunderX2 for > > > > > more than > > > > > 10 hours (10k giga-ticks) without breaking isolation. > > > > > > > > > > Signed-off-by: Yury Norov> > > > > --- > > > > > arch/arm64/kernel/insn.c | 2 +- > > > > > include/linux/smp.h | 2 ++ > > > > > kernel/smp.c | 24 > > > > > mm/slab.c| 2 +- > > > > > 4 files changed, 28 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > > > > > index 2718a77da165..9d7c492e920e 100644 > > > > > --- a/arch/arm64/kernel/insn.c > > > > > +++ b/arch/arm64/kernel/insn.c > > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void > > > > > *addrs[], u32 insns[], int cnt) > > > > >* synchronization. > > > > >*/ > > > > > ret = aarch64_insn_patch_text_nosync(addrs[0], > > > > > insns[0]); > > > > > - kick_all_cpus_sync(); > > > > > + kick_active_cpus_sync(); > > > > > return ret; > > > > > } > > > > > } > > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h > > > > > index 9fb239e12b82..27215e22240d 100644 > > > > > --- a/include/linux/smp.h > > > > > +++ b/include/linux/smp.h > > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask > > > > > *mask, > > > > > smp_call_func_t func, void *info, int wait); > > > > > > > > > > void kick_all_cpus_sync(void); > > > > > +void kick_active_cpus_sync(void); > > > > > void wake_up_all_idle_cpus(void); > > > > > > > > > > /* > > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, > > > > > smp_call_func_t func, > > > > > } > > > > > > > > > > static inline void kick_all_cpus_sync(void) { } > > > > > +static inline void kick_active_cpus_sync(void) { } > > > > > static inline void wake_up_all_idle_cpus(void) { } > > > > > > > > > > #ifdef CONFIG_UP_LATE_INIT > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > > index 084c8b3a2681..0358d6673850 100644 > > > > > --- a/kernel/smp.c > > > > > +++ b/kernel/smp.c > > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void) > > > > > } > > > > > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > > > > > > > > > +/** > > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended > > > > > + * quiescent state (idle or nohz_full userspace) sync by sending > > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of > > > > > + * that state. > > > > > + */ > > > > > +void kick_active_cpus_sync(void) > > > > > +{ > > > > > + int cpu; > > > > > + struct cpumask kernel_cpus; > > > > > + > > > > > + smp_mb(); > > > > > + > > > > > + cpumask_clear(_cpus); > > > > > + preempt_disable(); > > > > > + for_each_online_cpu(cpu) { > > > > > + if (!rcu_eqs_special_set(cpu)) > > > > > > > > If we get here, the CPU is not in a quiescent state, so we therefore > > > > must IPI it, correct? > > > > > > > > But don't you also need to define rcu_eqs_special_exit() so that RCU > > > > can invoke it when it next leaves its quiescent state? Or are you able > > > > to ignore the CPU in that case? (If you are able to ignore the CPU in > > > > that case, I could give you a lower-cost function to get your job done.) > > > > > > > > Thanx, Paul > > > > > > What's actually needed for synchronization is issuing memory barrier on > > > target > > > CPUs before we start executing kernel code. > > > > > > smp_mb() is implicitly called in smp_call_function*() path for it. In > > >
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 07:41 -0400, ok...@codeaurora.org wrote: > Yes, we want to get there indeed. It is because of some arch not > implementing writel properly. Maintainers want to play safe. > > That is why I asked if IA64 and other well known archs follow the > strongly ordered rule at this moment like PPC and ARM. > > Or should we go and inform every arch about this before yanking wmb()? > > Maintainers are afraid of introducing a regression. Let's fix all archs, it's way easier than fixing all drivers. Half of the archs are unused or dead anyway. > > > > The above code makes no sense, and just looks stupid to me. It also > > generates pointlessly bad code on x86, so it's bad there too. > > > > Linus
Re: RFC on writel and writel_relaxed
On Wed, 2018-03-28 at 11:30 +, David Laight wrote: > From: Benjamin Herrenschmidt > > Sent: 28 March 2018 10:56 > > ... > > For example, let's say I have a device with a reset bit and the spec > > says the reset bit needs to be set for at least 10us. > > > > This is wrong: > > > > writel(1, RESET_REG); > > usleep(10); > > writel(0, RESET_REG); > > > > Because of write posting, the first write might arrive to the device > > right before the second one. > > > > The typical "fix" is to turn that into: > > > > writel(1, RESET_REG); > > readl(RESET_REG); /* Flush posted writes */ > > Would a writel(1, RESET_REG) here provide enough synchronsiation? Probably yes. It's one of those things where you try to deal with the fact that 90% of driver writers barely understand the basic stuff and so you need the "default" accessors to be hardened as much as possible. We still need to get a reasonably definition of the semantics of the relaxed ones vs. WC memory but let's get through that exercise first and hopefully for the last time. > > usleep(10); > > writel(0, RESET_REG); > > > > *However* the issue here, at least on power, is that the CPU can issue > > that readl but doesn't necessarily wait for it to complete (ie, the > > data to return), before proceeding to the usleep. Now a usleep contains > > a bunch of loads and stores and is probably fine, but a udelay which > > just loops on the timebase may not be. > > > > Thus we may still violate the timing requirement. > > I've seem that sort of code (with udelay() and no read back) quite often. > How many were in linux I don't know. > > For small delays I usually fix it by repeated writes (of the same value) > to the device register. That can guarantee very short intervals. As long as you know the bus frequency... > The only time I've actually seen buffered writes break timing was > between a 286 and an 8859 interrupt controller. :-) The problem for me is not so much what I've seen, I realize that most of the issues we are talking about are the kind that will hit once in a thousand times or less. But we *can* reason about them in a way that can effectively prevent the problem completely and when your cluster has 1 machine, 1/1000 starts becoming significant. These days the vast majority of IO devices either are 99% DMA driven so that a bit of overhead on MMIOs is irrelevant, or have one fast path (low latency IB etc...) that needs some finer control, and the rest is all setup which can be paranoid at will. So I think we should aim for the "default" accessors most people use to be as hadened as we can think of. I favor correctness over performance in all cases. But then we also define a reasonable semantic for the relaxed ones (well, we sort-of do have one, we might have to make it a bit more precise in some areas) that allows the few MMIO fast path that care to be optimized. > If you wrote to the mask then enabled interrupts the first IACK cycle > could be too close to write and break the cycle recovery time. > That clobbered most of the interrupt controller registers. > That probably affected every 286 board ever built! > Not sure how much software added the required extra bus cycle. > > > What we did inside readl, with the twi;isync sequence (which basically > > means, trap on return value with "trap never" as a condition, followed > > by isync that ensures all excpetion conditions are resolved), is force > > the CPU to "consume" the data from the read before moving on. > > > > This effectively makes readl fully synchronous (we would probably avoid > > that if we were to implement a readl_relaxed). > > I've always wondered exactly what the twi;isync were for - always seemed > very heavy handed for most mmio reads. > Particularly if you are doing mmio reads from a data fifo. If you do that you should use the "s" version of the accessors. Those will only do the above trick at the end of the access series. Also a FIFO needs special care about endianness anyway, so you should use those accessors regardless. (Hint: you never endian swap a FIFO even on BE on a LE device, unless something's been wired very badly in HW). > Perhaps there should be a writel_wait() that is allowed to do a read back > for such code paths? I think what we have is fine, we just define that the standard writel/readl as fairly simple and hardened, and we look at providing a somewhat reasonable set of relaxed variants for optimizing fast path. We pretty much already are there, we just need to be better at defining the semantics. And for the super high perf case, which thankfully is either seldom (server high perf network stuff) or very arch specific (ARM SoC stuff), then arch specific driver hacks will always remain the norm. Cheers, Ben. > David >
[RFC PATCH 3/3] powerpc/64s: always flush non-local CPUs from single threaded mms
Go one step further, if we're going to put a tlbie on the bus at all, make it count. Always flush all others and restore our mm to a local one. --- arch/powerpc/mm/tlb-radix.c | 45 +++-- 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c index f00acacf48f1..ba48539e799e 100644 --- a/arch/powerpc/mm/tlb-radix.c +++ b/arch/powerpc/mm/tlb-radix.c @@ -424,10 +424,16 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr, return; preempt_disable(); - if (!mm_is_thread_local(mm)) - _tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB); - else + if (mm_is_thread_local(mm)) { _tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB); + } else { + if (mm_is_singlethreaded(mm)) { + _tlbie_pid(pid, RIC_FLUSH_ALL); + mm_reset_thread_local(mm); + } else { + _tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB); + } + } preempt_enable(); } @@ -496,14 +502,14 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, nr_pages > tlb_single_page_flush_ceiling); } - if (full) { + if (!local && mm_is_singlethreaded(mm)) { + _tlbie_pid(pid, RIC_FLUSH_ALL); + mm_reset_thread_local(mm); + } else if (full) { if (local) { _tlbiel_pid(pid, RIC_FLUSH_TLB); } else { - if (mm_is_singlethreaded(mm)) { - _tlbie_pid(pid, RIC_FLUSH_ALL); - mm_reset_thread_local(mm); - } else if (mm_needs_flush_escalation(mm)) { + if (mm_needs_flush_escalation(mm)) { _tlbie_pid(pid, RIC_FLUSH_ALL); } else { _tlbie_pid(pid, RIC_FLUSH_TLB); @@ -618,19 +624,17 @@ static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm, nr_pages > tlb_single_page_flush_ceiling); } - if (full) { + if (!local && mm_is_singlethreaded(mm)) { + _tlbie_pid(pid, RIC_FLUSH_ALL); + mm_reset_thread_local(mm); + } else if (full) { if (local) { _tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB); } else { - if (mm_is_singlethreaded(mm)) { - _tlbie_pid(pid, RIC_FLUSH_ALL); - mm_reset_thread_local(mm); - } else { - if (mm_needs_flush_escalation(mm)) - also_pwc = true; + if (mm_needs_flush_escalation(mm)) + also_pwc = true; - _tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB); - } + _tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB); } } else { if (local) @@ -676,7 +680,12 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr) if (mm_is_thread_local(mm)) { _tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true); } else { - _tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true); + if (mm_is_singlethreaded(mm)) { + _tlbie_pid(pid, RIC_FLUSH_ALL); + mm_reset_thread_local(mm); + } else { + _tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true); + } } preempt_enable(); -- 2.16.1
[RFC PATCH 2/3] powerpc/64s/radix: reset mm_cpumask for single thread process when possible
When a single-threaded process has a non-local mm_cpumask and requires a full PID tlbie invalidation, use that as an opportunity to reset the cpumask back to the current CPU we're running on. No other thread can concurrently switch to this mm, because it must have had a reference on mm_users before it could use_mm. mm_users can be asynchronously incremented e.g., by mmget_not_zero, but those users should not be doing use_mm. --- arch/powerpc/include/asm/mmu_context.h | 33 +++- arch/powerpc/include/asm/tlb.h | 7 + arch/powerpc/mm/tlb-radix.c| 57 +- 3 files changed, 75 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 3a15b6db9501..9d373c8fe9fa 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -102,13 +103,6 @@ static inline void mm_context_add_copro(struct mm_struct *mm) static inline void mm_context_remove_copro(struct mm_struct *mm) { - int c; - - c = atomic_dec_if_positive(>context.copros); - - /* Detect imbalance between add and remove */ - WARN_ON(c < 0); - /* * Need to broadcast a global flush of the full mm before * decrementing active_cpus count, as the next TLBI may be @@ -119,10 +113,15 @@ static inline void mm_context_remove_copro(struct mm_struct *mm) * for the time being. Invalidations will remain global if * used on hash. */ - if (c == 0 && radix_enabled()) { + if (radix_enabled()) { flush_all_mm(mm); dec_mm_active_cpus(mm); } + + /* Detect imbalance between add and remove */ + WARN_ON(!atomic_read(>context.copros)); + atomic_dec(>context.copros); + } #else static inline void inc_mm_active_cpus(struct mm_struct *mm) { } @@ -162,6 +161,24 @@ static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { +#ifdef CONFIG_PPC_BOOK3S_64 + /* +* Under radix, we do not want to keep lazy PIDs around because +* even if the CPU does not access userspace, it can still bring +* in translations through speculation and prefetching. +* +* Switching away here allows us to trim back the mm_cpumask in +* cases where we know the process is not running on some CPUs +* (see mm/tlb-radix.c). +*/ + if (radix_enabled() && mm != _mm) { + mmgrab(_mm); + tsk->active_mm = _mm; + switch_mm_irqs_off(mm, tsk->active_mm, tsk); + mmdrop(mm); + } +#endif + /* 64-bit Book3E keeps track of current PGD in the PACA */ #ifdef CONFIG_PPC_BOOK3E_64 get_paca()->pgd = NULL; diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index a7eabff27a0f..c4f43bcc6494 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -76,6 +76,13 @@ static inline int mm_is_thread_local(struct mm_struct *mm) return false; return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)); } +static inline void mm_reset_thread_local(struct mm_struct *mm) +{ + WARN_ON(!(atomic_read(>mm_users) == 1 && current->mm == mm)); + atomic_set(>context.active_cpus, 1); + cpumask_clear(mm_cpumask(mm)); + cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm)); +} #else /* CONFIG_PPC_BOOK3S_64 */ static inline int mm_is_thread_local(struct mm_struct *mm) { diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c index a07f5372a4bf..f00acacf48f1 100644 --- a/arch/powerpc/mm/tlb-radix.c +++ b/arch/powerpc/mm/tlb-radix.c @@ -341,6 +341,15 @@ void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmadd } EXPORT_SYMBOL(radix__local_flush_tlb_page); +static bool mm_is_singlethreaded(struct mm_struct *mm) +{ + if (atomic_read(>context.copros) > 0) + return false; + if (atomic_read(>mm_users) == 1 && current->mm == mm) + return true; + return false; +} + static bool mm_needs_flush_escalation(struct mm_struct *mm) { /* @@ -348,7 +357,9 @@ static bool mm_needs_flush_escalation(struct mm_struct *mm) * caching PTEs and not flushing them properly when * RIC = 0 for a PID/LPID invalidate */ - return atomic_read(>context.copros) != 0; + if (atomic_read(>context.copros) > 0) + return true; + return false; } #ifdef CONFIG_SMP @@ -362,12 +373,17 @@ void radix__flush_tlb_mm(struct mm_struct *mm) preempt_disable(); if (!mm_is_thread_local(mm)) { - if
[RFC PATCH 1/3] powerpc/64s: do not flush TLB when relaxing access
Book3S does not require TLB flushing when protection is being relaxed. >From Power ISA v3.0B, p.1090: Setting a Reference or Change Bit or Upgrading Access Authority (PTE Subject to Atomic Hardware Updates) If the only change being made to a valid PTE that is subject to atomic hardware updates is to set the Refer- ence or Change bit to 1 or to add access authorities, a simpler sequence suffices because the translation hardware will refetch the PTE if an access is attempted for which the only problems were reference and/or change bits needing to be set or insufficient access authority. Signed-off-by: Nicholas Piggin--- arch/powerpc/mm/pgtable-book3s64.c | 1 - arch/powerpc/mm/pgtable.c | 3 ++- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c index 422e80253a33..de49cedcbc84 100644 --- a/arch/powerpc/mm/pgtable-book3s64.c +++ b/arch/powerpc/mm/pgtable-book3s64.c @@ -40,7 +40,6 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address, if (changed) { __ptep_set_access_flags(vma->vm_mm, pmdp_ptep(pmdp), pmd_pte(entry), address); - flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); } return changed; } diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9f361ae571e9..5b07a626df5b 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -224,7 +224,8 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, if (!is_vm_hugetlb_page(vma)) assert_pte_locked(vma->vm_mm, address); __ptep_set_access_flags(vma->vm_mm, ptep, entry, address); - flush_tlb_page(vma, address); + if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64)) + flush_tlb_page(vma, address); } return changed; } -- 2.16.1
[RFC PATCH 0/3] powerpc tlbie reductions again
Last time this came up, there was concern about whether we can trim the mm cpumask, and what concurrency there is vs use_mm(). I've had more of a look and still think this is okay. I haven't thought of a good way to add debug checks to ensure it though. When doing a parallel kernel build on a 2 socket P9 system, this series causes tlbie (broadcast) to go from 1.37 million (22k/sec) to 181 thousand (3k/sec). tlbiel (local) flushes increase from 20.2 to 23.7 million. Due to requiring 128 tlbiel (vs 1 tlbie) to flush a PID, and also we set the cutoff higher before we switch from va range to full PID flush, when doing tlbiel. End result performance was very little changed, very tiny improvement maybe but well under 1%. Kernel compile mostly stays off the interconnect, and this is a small system, and without nMMU involvement. Any of these factors could make broadcast tlbie reduction more important. Remaining work - ensuring correctness of this stuff, implementations for hash, understanding and testing nMMU cases better, using IPIs for some/all types of invalidations, then possibly looking at doing something more fancy with the PID allocator. Nicholas Piggin (3): powerpc/64s: do not flush TLB when relaxing access powerpc/64s/radix: reset mm_cpumask for single thread process when possible powerpc/64s: always flush non-local CPUs from single threaded mms arch/powerpc/include/asm/mmu_context.h | 33 ++ arch/powerpc/include/asm/tlb.h | 7 +++ arch/powerpc/mm/pgtable-book3s64.c | 1 - arch/powerpc/mm/pgtable.c | 3 +- arch/powerpc/mm/tlb-radix.c| 78 +- 5 files changed, 92 insertions(+), 30 deletions(-) -- 2.16.1
Re: [PATCH v3 41/41] cxlflash: Handle spurious interrupts
On Mon, Mar 26, 2018 at 11:35:42AM -0500, Uma Krishnan wrote: > The following Oops can occur when there is heavy I/O traffic and the host > is reset by a tool such as sg_reset. > > [c000200fff3fbc90] c0081690117c process_cmd_doneq+0x104/0x500 >[cxlflash] (unreliable) > [c000200fff3fbd80] c00816901648 cxlflash_rrq_irq+0xd0/0x150 [cxlflash] > [c000200fff3fbde0] c0193130 __handle_irq_event_percpu+0xa0/0x310 > [c000200fff3fbea0] c01933d8 handle_irq_event_percpu+0x38/0x90 > [c000200fff3fbee0] c0193494 handle_irq_event+0x64/0xb0 > [c000200fff3fbf10] c0198ea0 handle_fasteoi_irq+0xc0/0x230 > [c000200fff3fbf40] c019182c generic_handle_irq+0x4c/0x70 > [c000200fff3fbf60] c001794c __do_irq+0x7c/0x1c0 > [c000200fff3fbf90] c002a390 call_do_irq+0x14/0x24 > [c000200e5828fab0] c0017b2c do_IRQ+0x9c/0x130 > [c000200e5828fb00] c0009b04 h_virt_irq_common+0x114/0x120 > > When a context is reset, the pending commands are flushed and the AFU > is notified. Before the AFU handles this request there could be command > completion interrupts queued to PHB which are yet to be delivered to the > context. In this scenario, a context could receive an interrupt for a > command that has been flushed, leading to a possible crash when the memory > for the flushed command is accessed. > > To resolve this problem, a boolean will indicate if the hardware queue is > ready to process interrupts or not. This can be evaluated in the interrupt > handler before proessing an interrupt. > > Signed-off-by: Uma KrishnanAcked-by: Matthew R. Ochs
Re: [PATCH v3 40/41] cxlflash: Remove commmands from pending list on timeout
On Mon, Mar 26, 2018 at 11:35:34AM -0500, Uma Krishnan wrote: > The following Oops can occur if an internal command sent to the AFU does > not complete within the timeout: > > [c00ff101b810] c00816020d94 term_mc+0xfc/0x1b0 [cxlflash] > [c00ff101b8a0] c00816020fb0 term_afu+0x168/0x280 [cxlflash] > [c00ff101b930] c008160232ec cxlflash_pci_error_detected+0x184/0x230 >[cxlflash] > [c00ff101b9e0] c0080d95d468 cxl_vphb_error_detected+0x90/0x150[cxl] > [c00ff101ba20] c0080d95f27c cxl_pci_error_detected+0xa4/0x240 [cxl] > [c00ff101bac0] c003eaf8 eeh_report_error+0xd8/0x1b0 > [c00ff101bb20] c003d0b8 eeh_pe_dev_traverse+0x98/0x170 > [c00ff101bbb0] c003f438 eeh_handle_normal_event+0x198/0x580 > [c00ff101bc60] c003fba4 eeh_handle_event+0x2a4/0x338 > [c00ff101bd10] c00400b8 eeh_event_handler+0x1f8/0x200 > [c00ff101bdc0] c013da48 kthread+0x1a8/0x1b0 > [c00ff101be30] c000b528 ret_from_kernel_thread+0x5c/0xb4 > > When an internal command times out, the command buffer is freed while it > is still in the pending commands list of the context. This corrupts the > list and when the context is cleaned up, a crash is encountered. > > To resolve this issue, when an AFU command or TMF command times out, the > command should be deleted from the hardware queue pending command list > before freeing the buffer. > > Signed-off-by: Uma KrishnanAcked-by: Matthew R. Ochs
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Wed, Mar 28, 2018 at 05:41:40PM +0300, Yury Norov wrote: > On Wed, Mar 28, 2018 at 06:56:17AM -0700, Paul E. McKenney wrote: > > On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote: > > > On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote: > > > > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote: > > > > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote: > > > > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote: > > > > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending > > > > > > > broadcast IPI. > > > > > > > If CPU is in extended quiescent state (idle task or nohz_full > > > > > > > userspace), this > > > > > > > work may be done at the exit of this state. Delaying > > > > > > > synchronization helps to > > > > > > > save power if CPU is in idle state and decrease latency for > > > > > > > real-time tasks. > > > > > > > > > > > > > > This patch introduces kick_active_cpus_sync() and uses it in > > > > > > > mm/slab and arm64 > > > > > > > code to delay syncronization. > > > > > > > > > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to > > > > > > > the CPU running > > > > > > > isolated task would be fatal, as it breaks isolation. The > > > > > > > approach with delaying > > > > > > > of synchronization work helps to maintain isolated state. > > > > > > > > > > > > > > I've tested it with test from task isolation series on ThunderX2 > > > > > > > for more than > > > > > > > 10 hours (10k giga-ticks) without breaking isolation. > > > > > > > > > > > > > > Signed-off-by: Yury Norov> > > > > > > --- > > > > > > > arch/arm64/kernel/insn.c | 2 +- > > > > > > > include/linux/smp.h | 2 ++ > > > > > > > kernel/smp.c | 24 > > > > > > > mm/slab.c| 2 +- > > > > > > > 4 files changed, 28 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > > > > > > > index 2718a77da165..9d7c492e920e 100644 > > > > > > > --- a/arch/arm64/kernel/insn.c > > > > > > > +++ b/arch/arm64/kernel/insn.c > > > > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void > > > > > > > *addrs[], u32 insns[], int cnt) > > > > > > >* synchronization. > > > > > > >*/ > > > > > > > ret = aarch64_insn_patch_text_nosync(addrs[0], > > > > > > > insns[0]); > > > > > > > - kick_all_cpus_sync(); > > > > > > > + kick_active_cpus_sync(); > > > > > > > return ret; > > > > > > > } > > > > > > > } > > > > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h > > > > > > > index 9fb239e12b82..27215e22240d 100644 > > > > > > > --- a/include/linux/smp.h > > > > > > > +++ b/include/linux/smp.h > > > > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct > > > > > > > cpumask *mask, > > > > > > > smp_call_func_t func, void *info, int wait); > > > > > > > > > > > > > > void kick_all_cpus_sync(void); > > > > > > > +void kick_active_cpus_sync(void); > > > > > > > void wake_up_all_idle_cpus(void); > > > > > > > > > > > > > > /* > > > > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask > > > > > > > *mask, smp_call_func_t func, > > > > > > > } > > > > > > > > > > > > > > static inline void kick_all_cpus_sync(void) { } > > > > > > > +static inline void kick_active_cpus_sync(void) { } > > > > > > > static inline void wake_up_all_idle_cpus(void) { } > > > > > > > > > > > > > > #ifdef CONFIG_UP_LATE_INIT > > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > > > > index 084c8b3a2681..0358d6673850 100644 > > > > > > > --- a/kernel/smp.c > > > > > > > +++ b/kernel/smp.c > > > > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void) > > > > > > > } > > > > > > > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > > > > > > > > > > > > > +/** > > > > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended > > > > > > > + * quiescent state (idle or nohz_full userspace) sync by sending > > > > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of > > > > > > > + * that state. > > > > > > > + */ > > > > > > > +void kick_active_cpus_sync(void) > > > > > > > +{ > > > > > > > + int cpu; > > > > > > > + struct cpumask kernel_cpus; > > > > > > > + > > > > > > > + smp_mb(); > > > > > > > + > > > > > > > + cpumask_clear(_cpus); > > > > > > > + preempt_disable(); > > > > > > > + for_each_online_cpu(cpu) { > > > > > > > + if (!rcu_eqs_special_set(cpu)) > > > > > > > > > > > > If we get here, the CPU is not in a quiescent state, so we therefore > > > > > > must IPI it, correct? > > > > > > > > > > > > But don't you also need to define rcu_eqs_special_exit() so that RCU > > > > > > can invoke it when it next leaves its quiescent state? Or are you > > > > > >
Re: [PATCH v3 39/41] cxlflash: Synchronize reset and remove ops
On Mon, Mar 26, 2018 at 11:35:27AM -0500, Uma Krishnan wrote: > The following Oops can be encountered if a device removal or system > shutdown is initiated while an EEH recovery is in process: > > [c00ff2f479c0] c00815256f18 cxlflash_pci_slot_reset+0xa0/0x100 > [cxlflash] > [c00ff2f47a30] c0080dae22e0 cxl_pci_slot_reset+0x168/0x290 [cxl] > [c00ff2f47ae0] c003ef1c eeh_report_reset+0xec/0x170 > [c00ff2f47b20] c003d0b8 eeh_pe_dev_traverse+0x98/0x170 > [c00ff2f47bb0] c003f80c eeh_handle_normal_event+0x56c/0x580 > [c00ff2f47c60] c003fba4 eeh_handle_event+0x2a4/0x338 > [c00ff2f47d10] c00400b8 eeh_event_handler+0x1f8/0x200 > [c00ff2f47dc0] c013da48 kthread+0x1a8/0x1b0 > [c00ff2f47e30] c000b528 ret_from_kernel_thread+0x5c/0xb4 > > The remove handler frees AFU memory while the EEH recovery is in progress, > leading to a race condition. This can result in a crash if the recovery > thread tries to access this memory. > > To resolve this issue, the cxlflash remove handler will evaluate the > device state and yield to any active reset or probing threads. > > Signed-off-by: Uma KrishnanLooks good! Acked-by: Matthew R. Ochs
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Wed, Mar 28, 2018 at 06:56:17AM -0700, Paul E. McKenney wrote: > On Wed, Mar 28, 2018 at 04:36:05PM +0300, Yury Norov wrote: > > On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote: > > > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote: > > > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote: > > > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote: > > > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending > > > > > > broadcast IPI. > > > > > > If CPU is in extended quiescent state (idle task or nohz_full > > > > > > userspace), this > > > > > > work may be done at the exit of this state. Delaying > > > > > > synchronization helps to > > > > > > save power if CPU is in idle state and decrease latency for > > > > > > real-time tasks. > > > > > > > > > > > > This patch introduces kick_active_cpus_sync() and uses it in > > > > > > mm/slab and arm64 > > > > > > code to delay syncronization. > > > > > > > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to > > > > > > the CPU running > > > > > > isolated task would be fatal, as it breaks isolation. The approach > > > > > > with delaying > > > > > > of synchronization work helps to maintain isolated state. > > > > > > > > > > > > I've tested it with test from task isolation series on ThunderX2 > > > > > > for more than > > > > > > 10 hours (10k giga-ticks) without breaking isolation. > > > > > > > > > > > > Signed-off-by: Yury Norov> > > > > > --- > > > > > > arch/arm64/kernel/insn.c | 2 +- > > > > > > include/linux/smp.h | 2 ++ > > > > > > kernel/smp.c | 24 > > > > > > mm/slab.c| 2 +- > > > > > > 4 files changed, 28 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > > > > > > index 2718a77da165..9d7c492e920e 100644 > > > > > > --- a/arch/arm64/kernel/insn.c > > > > > > +++ b/arch/arm64/kernel/insn.c > > > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void > > > > > > *addrs[], u32 insns[], int cnt) > > > > > > * synchronization. > > > > > > */ > > > > > > ret = aarch64_insn_patch_text_nosync(addrs[0], > > > > > > insns[0]); > > > > > > - kick_all_cpus_sync(); > > > > > > + kick_active_cpus_sync(); > > > > > > return ret; > > > > > > } > > > > > > } > > > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h > > > > > > index 9fb239e12b82..27215e22240d 100644 > > > > > > --- a/include/linux/smp.h > > > > > > +++ b/include/linux/smp.h > > > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask > > > > > > *mask, > > > > > > smp_call_func_t func, void *info, int wait); > > > > > > > > > > > > void kick_all_cpus_sync(void); > > > > > > +void kick_active_cpus_sync(void); > > > > > > void wake_up_all_idle_cpus(void); > > > > > > > > > > > > /* > > > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask > > > > > > *mask, smp_call_func_t func, > > > > > > } > > > > > > > > > > > > static inline void kick_all_cpus_sync(void) { } > > > > > > +static inline void kick_active_cpus_sync(void) { } > > > > > > static inline void wake_up_all_idle_cpus(void) { } > > > > > > > > > > > > #ifdef CONFIG_UP_LATE_INIT > > > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > > > index 084c8b3a2681..0358d6673850 100644 > > > > > > --- a/kernel/smp.c > > > > > > +++ b/kernel/smp.c > > > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void) > > > > > > } > > > > > > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > > > > > > > > > > > +/** > > > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended > > > > > > + * quiescent state (idle or nohz_full userspace) sync by sending > > > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of > > > > > > + * that state. > > > > > > + */ > > > > > > +void kick_active_cpus_sync(void) > > > > > > +{ > > > > > > + int cpu; > > > > > > + struct cpumask kernel_cpus; > > > > > > + > > > > > > + smp_mb(); > > > > > > + > > > > > > + cpumask_clear(_cpus); > > > > > > + preempt_disable(); > > > > > > + for_each_online_cpu(cpu) { > > > > > > + if (!rcu_eqs_special_set(cpu)) > > > > > > > > > > If we get here, the CPU is not in a quiescent state, so we therefore > > > > > must IPI it, correct? > > > > > > > > > > But don't you also need to define rcu_eqs_special_exit() so that RCU > > > > > can invoke it when it next leaves its quiescent state? Or are you > > > > > able > > > > > to ignore the CPU in that case? (If you are able to ignore the CPU in > > > > > that case, I could give you a lower-cost function to get your job > > > > > done.) > > > > > > > > > >
Re: [v2, 01/10] powerpc: Add security feature flags for Spectre/Meltdown
On Tue, 2018-03-27 at 12:01:44 UTC, Michael Ellerman wrote: > This commit adds security feature flags to reflect the settings we > receive from firmware regarding Spectre/Meltdown mitigations. > > The feature names reflect the names we are given by firmware on bare > metal machines. See the hostboot source for details. > > Arguably these could be firmware features, but that then requires them > to be read early in boot so they're available prior to asm feature > patching, but we don't actually want to use them for patching. We may > also want to dynamically update them in future, which would be > incompatible with the way firmware features work (at the moment at > least). So for now just make them separate flags. > > Signed-off-by: Michael EllermanSeries applied to powerpc next. https://git.kernel.org/powerpc/c/9a868f634349e62922c226834aa23e cheers
Re: [v3,1/8] powerpc: Add ppc_breakpoint_available()
On Tue, 2018-03-27 at 04:37:17 UTC, Michael Neuling wrote: > Add ppc_breakpoint_available() to determine if a breakpoint is > available currently via the DAWR or DABR. > > Signed-off-by: Michael NeulingSeries applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/404b27d66ed657ebccb08a9c8f8f65 cheers
Re: [v2] powerpc/perf: Fix the kernel address leak to userspace via SDAR
On Wed, 2018-03-21 at 11:40:26 UTC, Madhavan Srinivasan wrote: > Sampled Data Address Register (SDAR) is a 64-bit > register that contains the effective address of > the storage operand of an instruction that was > being executed, possibly out-of-order, at or around > the time that the Performance Monitor alert occurred. > > In certain scenario SDAR happen to contain the kernel > address even for userspace only sampling. Add checks > to prevent it. > > Signed-off-by: Madhavan SrinivasanApplied to powerpc next, thanks. https://git.kernel.org/powerpc/c/cd1231d7035fea894118d5155ff984 cheers
Re: [v2] powerpc/perf: Fix kernel address leak to userspace via BHRB buffer
On Wed, 2018-03-21 at 11:40:25 UTC, Madhavan Srinivasan wrote: > The current Branch History Rolling Buffer (BHRB) code does > not check for any privilege levels before updating the data > from BHRB. This leaks kernel addresses to userspace even when > profiling only with userspace privileges. Add proper checks > to prevent it. > > Acked-by: Balbir Singh> Signed-off-by: Madhavan Srinivasan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/bb19af816025d495376bd76bf6fbcf cheers
Re: [v2] powerpc/perf: Fix kernel address leaks via Sampling registers
On Wed, 2018-03-21 at 11:40:24 UTC, Madhavan Srinivasan wrote: > From: Michael Ellerman> > Current code in power_pmu_disable() does not clear the sampling > registers like Sampling Instruction Address Register (SAIR) and > Sampling Data Address Register (SDAR) after disabling the PMU. > Since these are userspace readable and could contain kernel > address, add code to explicitly clear the content of these registers. > Patch also adds a "context synchronizing instruction" to enforce > no further updates to these registers as mandated by PowerISA. > > "If an mtspr instruction is executed that changes the > value of a Performance Monitor register other than > SIAR, SDAR, and SIER, the change is not guaranteed > to have taken effect until after a subsequent context > synchronizing instruction has been executed (see > Chapter 11. "Synchronization Requirements for Con- > text Alterations" on page 1133)." > > Signed-off-by: Madhavan Srinivasan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/e1ebd0e5b9d0a10ba65e63a3514b6d cheers
Re: [v2,1/9] powerpc/eeh: Remove eeh_handle_event()
On Mon, 2018-03-19 at 02:46:20 UTC, Sam Bobroff wrote: > The function eeh_handle_event(pe) does nothing other than switching > between calling eeh_handle_normal_event(pe) and > eeh_handle_special_event(). However it is only called in two places, > one where pe can't be NULL and the other where it must be NULL (see > eeh_event_handler()) so it does nothing but obscure the flow of > control. > > So, remove it. > > Signed-off-by: Sam Bobroff> Reviewed-by: Alexey Kardashevskiy Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/68701780712f7ddb2fa81032aa1b4a cheers
Re: [v3, 1/5] rfi-flush: Move the logic to avoid a redo into the debugfs code
On Wed, 2018-03-14 at 22:40:38 UTC, Mauricio Faria de Oliveira wrote: > From: Michael Ellerman> > rfi_flush_enable() includes a check to see if we're already > enabled (or disabled), and in that case does nothing. > > But that means calling setup_rfi_flush() a 2nd time doesn't actually > work, which is a bit confusing. > > Move that check into the debugfs code, where it really belongs. > > Signed-off-by: Michael Ellerman > Signed-off-by: Mauricio Faria de Oliveira Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/1e2a9fc7496955faacbbed49461d61 cheers
Re: powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping()
On Fri, 2018-03-09 at 20:45:58 UTC, Mauricio Faria de Oliveira wrote: > Fix the warning messages for stop_machine_change_mapping(), and a number > of other affected functions in its call chain. > > All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit > is okay (keeps them / does not discard them). > > Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu. > > $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux > ... > MODPOST vmlinux.o > WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the > function stop_machine_change_mapping() to the function > .meminit.text:create_physical_mapping() > The function stop_machine_change_mapping() references > the function __meminit create_physical_mapping(). > This is often because stop_machine_change_mapping lacks a __meminit > annotation or the annotation of create_physical_mapping is wrong. > > WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the > function stop_machine_change_mapping() to the function > .meminit.text:create_physical_mapping() > The function stop_machine_change_mapping() references > the function __meminit create_physical_mapping(). > This is often because stop_machine_change_mapping lacks a __meminit > annotation or the annotation of create_physical_mapping is wrong. > ... > > Signed-off-by: Mauricio Faria de Oliveira> Acked-by: Balbir Singh Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/bde709a70884bfc790da6fbc4467c9 cheers
Re: [1/3] powerpc/perf: Infrastructure to support addition of blacklisted events
On Sun, 2018-03-04 at 11:56:26 UTC, Madhavan Srinivasan wrote: > Introduce code to support addition of blacklisted events for a > processor version. A 'pointer' and 'int' variable to hold the > number of events are added to 'struct power_pmu', along with a > generic function to loop through the list to validate the given > event. Generic function 'is_event_blacklisted' is called in > power_pmu_event_init() to detect and reject early. > > Signed-off-by: Madhavan SrinivasanSeries applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/b58064da046243f0c988afd939997e cheers
Re: [kernel] powerpc/npu: Do not try invalidating 32bit table when 64bit table is enabled
On Tue, 2018-02-13 at 05:51:35 UTC, Alexey Kardashevskiy wrote: > GPUs and the corresponding NVLink bridges get different PEs as they have > separate translation validation entries (TVEs). We put these PEs to > the same IOMMU group so they cannot be passed through separately. > So the iommu_table_group_ops::set_window/unset_window for GPUs do set > tables to the NPU PEs as well which means that iommu_table's list of > attached PEs (iommu_table_group_link) has both GPU and NPU PEs linked. > This list is used for TCE cache invalidation. > > The problem is that NPU PE has just a single TVE and can be programmed > to point to 32bit or 64bit windows while GPU PE has two (as any other PCI > device). So we end up having an 32bit iommu_table struct linked to both > PEs even though only the 64bit TCE table cache can be invalidated on NPU. > And a relatively recent skiboot detects this and prints errors. > > This changes GPU's iommu_table_group_ops::set_window/unset_window to make > sure that NPU PE is only linked to the table actually used by the hardware. > If there are two tables used by an IOMMU group, the NPU PE will use > the last programmed one which with the current use scenarios is expected > to be a 64bit one. > > Signed-off-by: Alexey KardashevskiyApplied to powerpc next, thanks. https://git.kernel.org/powerpc/c/d41ce7b1bcc3e1d02cc9da3b83c0fe cheers
Re: [kernel] powerpc/mm: Fix typo in comments
On Thu, 2018-02-01 at 05:07:25 UTC, Alexey Kardashevskiy wrote: > Fixes: 912cc87a6 "powerpc/mm/radix: Add LPID based tlb flush helpers" > Signed-off-by: Alexey KardashevskiyApplied to powerpc next, thanks. https://git.kernel.org/powerpc/c/b574df94883df4d37f1b9d648867d6 cheers
Re: [kernel] powerpc/lpar/debug: Initialize flags before printing debug message
On Tue, 2018-01-09 at 05:52:14 UTC, Alexey Kardashevskiy wrote: > With enabled DEBUG, there is a compile error: > "error: âflagsâ is used uninitialized in this function". > > This moves pr_devel() little further where @flags are initialized. > > Signed-off-by: Alexey KardashevskiyApplied to powerpc next, thanks. https://git.kernel.org/powerpc/c/a8c0bf3c621e0acc01451e27fe47c4 cheers
Re: [kernel] powerpc/init: Do not advertise radix during client-architecture-support
On Tue, 2018-01-09 at 05:45:20 UTC, Alexey Kardashevskiy wrote: > Currently the pseries kernel advertises radix MMU support even if > the actual support is disabled via the CONFIG_PPC_RADIX_MMU option. > > This adds a check for CONFIG_PPC_RADIX_MMU to avoid advertising radix > to the hypervisor. > > Suggested-by: Paul Mackerras> Signed-off-by: Alexey Kardashevskiy Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/79b4686857029cdea97d0102d179ae cheers
Re: powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9
On Thu, 2017-02-16 at 05:03:39 UTC, Paul Mackerras wrote: > On POWER9, since commit cc3d2940133d ("powerpc/64: Enable use of radix > MMU under hypervisor on POWER9", 2017-01-30), we set both the radix and > HPT bits in the client-architecture-support (CAS) vector, which tells > the hypervisor that we can do either radix or HPT. According to PAPR, > if we use this combination we are promising to do a H_REGISTER_PROC_TBL > hcall later on to let the hypervisor know whether we are doing radix > or HPT. We currently do this call if we are doing radix but not if > we are doing HPT. If the hypervisor is able to support both radix > and HPT guests, it would be entitled to defer allocation of the HPT > until the H_REGISTER_PROC_TBL call, and to fail any attempts to create > HPTEs until the H_REGISTER_PROC_TBL call. Thus we need to do a > H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may > crash at boot time. > > This adds the code to call H_REGISTER_PROC_TBL in this case, before > we attempt to create any HPT entries using H_ENTER. > > Fixes: cc3d2940133d ("powerpc/64: Enable use of radix MMU under hypervisor on > POWER9") > Signed-off-by: Paul Mackerras> Reviewed-by: Suraj Jitindar Singh Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/dbfcf3cb9c681aa0c5d0bb46068f98 cheers
Re: [PATCH] powerpc: Fix smp_wmb barrier definition use use lwsync consistently
On Wed, 28 Mar 2018 23:40:05 +1100 Michael Ellermanwrote: > Nicholas Piggin writes: > > > asm/barrier.h is not always included after asm/synch.h, which meant > > it was missing __SUBARCH_HAS_LWSYNC, so in some files smp_wmb() would > > be eieio when it should be lwsync. kernel/time/hrtimer.c is one case. > > Wow nice catch. Only broken since 2008 presumably. > > Some days I think maybe we aren't very good at this writing software > thing, good to have some certainty :) Yeah, I only caught it by luck when looking through instruction traces. The pipeline model just happens to make eieio look different to most other instructions (which is likely a bug in the model) which made me look closer at it. Could have been with us for another 10 years. > > __SUBARCH_HAS_LWSYNC is only used in one place, so just fold it in > > to where it's used. Previously with my small simulator config, 377 > > instances of eieio in the tree. After this patch there are 55. > > At least for Book3S this isn't actually a terrible bug AFAICS: > > - smp_wmb() is only defined to order accesses to cacheable memory. > - smp_wmb() only orders prior stores vs later stores. > - eieio orders all prior stores vs all later stores for cacheable >memory. > - lwsync orders everything except prior stores vs later loads for >cacheable memory. > > So eieio and lwsync are both valid to use as smp_wmb(), but it's still > terrible fishy that we were using both in different places depending on > include ordering. Oh yeah it's not a bug in that it would cause violation of memory ordering, only performance (and expectations when debugging and observing things I guess). eieio works fine for smp_wmb(). > I'm inclined to tag this for stable unless anyone can think of a reason > not to? I think that would be good. Thanks, Nick
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Mon, Mar 26, 2018 at 05:45:55AM -0700, Paul E. McKenney wrote: > On Sun, Mar 25, 2018 at 11:11:54PM +0300, Yury Norov wrote: > > On Sun, Mar 25, 2018 at 12:23:28PM -0700, Paul E. McKenney wrote: > > > On Sun, Mar 25, 2018 at 08:50:04PM +0300, Yury Norov wrote: > > > > kick_all_cpus_sync() forces all CPUs to sync caches by sending > > > > broadcast IPI. > > > > If CPU is in extended quiescent state (idle task or nohz_full > > > > userspace), this > > > > work may be done at the exit of this state. Delaying synchronization > > > > helps to > > > > save power if CPU is in idle state and decrease latency for real-time > > > > tasks. > > > > > > > > This patch introduces kick_active_cpus_sync() and uses it in mm/slab > > > > and arm64 > > > > code to delay syncronization. > > > > > > > > For task isolation (https://lkml.org/lkml/2017/11/3/589), IPI to the > > > > CPU running > > > > isolated task would be fatal, as it breaks isolation. The approach with > > > > delaying > > > > of synchronization work helps to maintain isolated state. > > > > > > > > I've tested it with test from task isolation series on ThunderX2 for > > > > more than > > > > 10 hours (10k giga-ticks) without breaking isolation. > > > > > > > > Signed-off-by: Yury Norov> > > > --- > > > > arch/arm64/kernel/insn.c | 2 +- > > > > include/linux/smp.h | 2 ++ > > > > kernel/smp.c | 24 > > > > mm/slab.c| 2 +- > > > > 4 files changed, 28 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c > > > > index 2718a77da165..9d7c492e920e 100644 > > > > --- a/arch/arm64/kernel/insn.c > > > > +++ b/arch/arm64/kernel/insn.c > > > > @@ -291,7 +291,7 @@ int __kprobes aarch64_insn_patch_text(void > > > > *addrs[], u32 insns[], int cnt) > > > > * synchronization. > > > > */ > > > > ret = aarch64_insn_patch_text_nosync(addrs[0], > > > > insns[0]); > > > > - kick_all_cpus_sync(); > > > > + kick_active_cpus_sync(); > > > > return ret; > > > > } > > > > } > > > > diff --git a/include/linux/smp.h b/include/linux/smp.h > > > > index 9fb239e12b82..27215e22240d 100644 > > > > --- a/include/linux/smp.h > > > > +++ b/include/linux/smp.h > > > > @@ -105,6 +105,7 @@ int smp_call_function_any(const struct cpumask > > > > *mask, > > > > smp_call_func_t func, void *info, int wait); > > > > > > > > void kick_all_cpus_sync(void); > > > > +void kick_active_cpus_sync(void); > > > > void wake_up_all_idle_cpus(void); > > > > > > > > /* > > > > @@ -161,6 +162,7 @@ smp_call_function_any(const struct cpumask *mask, > > > > smp_call_func_t func, > > > > } > > > > > > > > static inline void kick_all_cpus_sync(void) { } > > > > +static inline void kick_active_cpus_sync(void) { } > > > > static inline void wake_up_all_idle_cpus(void) { } > > > > > > > > #ifdef CONFIG_UP_LATE_INIT > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > index 084c8b3a2681..0358d6673850 100644 > > > > --- a/kernel/smp.c > > > > +++ b/kernel/smp.c > > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void) > > > > } > > > > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > > > > > > > +/** > > > > + * kick_active_cpus_sync - Force CPUs that are not in extended > > > > + * quiescent state (idle or nohz_full userspace) sync by sending > > > > + * IPI. Extended quiescent state CPUs will sync at the exit of > > > > + * that state. > > > > + */ > > > > +void kick_active_cpus_sync(void) > > > > +{ > > > > + int cpu; > > > > + struct cpumask kernel_cpus; > > > > + > > > > + smp_mb(); > > > > + > > > > + cpumask_clear(_cpus); > > > > + preempt_disable(); > > > > + for_each_online_cpu(cpu) { > > > > + if (!rcu_eqs_special_set(cpu)) > > > > > > If we get here, the CPU is not in a quiescent state, so we therefore > > > must IPI it, correct? > > > > > > But don't you also need to define rcu_eqs_special_exit() so that RCU > > > can invoke it when it next leaves its quiescent state? Or are you able > > > to ignore the CPU in that case? (If you are able to ignore the CPU in > > > that case, I could give you a lower-cost function to get your job done.) > > > > > > Thanx, Paul > > > > What's actually needed for synchronization is issuing memory barrier on > > target > > CPUs before we start executing kernel code. > > > > smp_mb() is implicitly called in smp_call_function*() path for it. In > > rcu_eqs_special_set() -> rcu_dynticks_eqs_exit() path, > > smp_mb__after_atomic() > > is called just before rcu_eqs_special_exit(). > > > > So I think, rcu_eqs_special_exit() may be left untouched. Empty > > rcu_eqs_special_exit() in new RCU path corresponds empty
Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key
On 26/03/2018 00:10, David Rientjes wrote: > On Wed, 21 Mar 2018, Laurent Dufour wrote: > >> I found the root cause of this lockdep warning. >> >> In mmap_region(), unmap_region() may be called while vma_link() has not been >> called. This happens during the error path if call_mmap() failed. >> >> The only to fix that particular case is to call >> seqcount_init(>vm_sequence) when initializing the vma in mmap_region(). >> > > Ack, although that would require a fixup to dup_mmap() as well. You're right, I'll fix that too. Thanks a lot. Laurent.
Re: [PATCH 2/2] smp: introduce kick_active_cpus_sync()
On Mon, Mar 26, 2018 at 02:57:35PM -0400, Steven Rostedt wrote: > On Mon, 26 Mar 2018 10:53:13 +0200 > Andrea Parriwrote: > > > > --- a/kernel/smp.c > > > +++ b/kernel/smp.c > > > @@ -724,6 +724,30 @@ void kick_all_cpus_sync(void) > > > } > > > EXPORT_SYMBOL_GPL(kick_all_cpus_sync); > > > > > > +/** > > > + * kick_active_cpus_sync - Force CPUs that are not in extended > > > + * quiescent state (idle or nohz_full userspace) sync by sending > > > + * IPI. Extended quiescent state CPUs will sync at the exit of > > > + * that state. > > > + */ > > > +void kick_active_cpus_sync(void) > > > +{ > > > + int cpu; > > > + struct cpumask kernel_cpus; > > > + > > > + smp_mb(); > > > > (A general remark only:) > > > > checkpatch.pl should have warned about the fact that this barrier is > > missing an accompanying comment (which accesses are being "ordered", > > what is the pairing barrier, etc.). > > He could have simply copied the comment above the smp_mb() for > kick_all_cpus_sync(): > > /* Make sure the change is visible before we kick the cpus */ > > The kick itself is pretty much a synchronization primitive. > > That is, you make some changes and then you need all CPUs to see it, > and you call: kick_active_cpus_synch(), which is the barrier to make > sure you previous changes are seen on all CPUS before you proceed > further. Note, the matching barrier is implicit in the IPI itself. > > -- Steve I know that I had to copy the comment from kick_all_cpus_sync(), but I don't like copy-pasting in general, and as Steven told, this smp_mb() is already inside synchronization routine, so we may hope that users of kick_*_cpus_sync() will explain better what for they need it... > > > > > Moreover if, as your reply above suggested, your patch is relying on > > "implicit barriers" (something I would not recommend) then even more > > so you should comment on these requirements. > > > > This could: (a) force you to reason about the memory ordering stuff, > > (b) easy the task of reviewing and adopting your patch, (c) easy the > > task of preserving those requirements (as implementations changes). > > > > Andrea I need v2 anyway, and I will add comments to address all questions in this thread. I also hope that we'll agree that for powerpc it's also safe to delay synchronization, and if so, we will have no users of kick_all_cpus_sync(), and can drop it. (It looks like this, because nohz_full userspace CPU cannot have pending IPIs, but I'd like to get confirmation from powerpc people.) Would it make sense to rename kick_all_cpus_sync() to smp_mb_sync(), which would stand for 'synchronous memory barrier on all online CPUs'? Yury
[GIT PULL] Please pull powerpc/linux.git powerpc-4.16-6 tag
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Linus, Please pull some more powerpc fixes for 4.16. Apologies if this is a bit big at rc7, but they're all reasonably important fixes. None are actually for new code, so they aren't indicative of 4.16 being in bad shape from our point of view. cheers The following changes since commit b0c41b8b6e43120d7c35e4709508a3d90a09646e: powerpc/pseries: Fix vector5 in ibm architecture vector table (2018-03-06 23:05:38 +1100) are available in the git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-4.16-6 for you to fetch changes up to 52396500f97c53860164debc7d4f759077853423: powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs (2018-03-26 07:40:17 +1100) - powerpc fixes for 4.16 #6 These are actually all fixes for pre-4.16 code, or new hardware workarounds. Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new firmware interface for describing CPU features. Fix lost pending interrupts due to a race in our interrupt soft-masking code. A workaround for a nest MMU bug with TLB invalidations on Power9. A workaround for broadcast TLB invalidations on Power9. Fix a bug in our instruction SLB miss handler, when handling bad addresses (eg. >= TASK_SIZE), which could corrupt non-volatile user GPRs. Thanks to: Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt, Nicholas Piggin. - Aneesh Kumar K.V (3): powerpc/mm/radix: Remove unused code powerpc/mm/radix: Move the functions that does the actual tlbie closer powerpc/mm: Fixup tlbie vs store ordering issue on POWER9 Benjamin Herrenschmidt (2): powerpc/mm: Add tracking of the number of coprocessors using a context powerpc/mm: Workaround Nest MMU bug with TLB invalidations Michael Ellerman (1): powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features Nicholas Piggin (2): powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs arch/powerpc/include/asm/book3s/64/mmu.h | 3 + .../powerpc/include/asm/book3s/64/tlbflush-radix.h | 3 - arch/powerpc/include/asm/cputable.h| 3 +- arch/powerpc/include/asm/mmu_context.h | 18 ++- arch/powerpc/kernel/dt_cpu_ftrs.c | 6 + arch/powerpc/kernel/exceptions-64s.S | 2 +- arch/powerpc/kernel/irq.c | 8 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 + arch/powerpc/kvm/book3s_hv_rm_mmu.c| 11 ++ arch/powerpc/mm/hash_native_64.c | 16 +- arch/powerpc/mm/mmu_context_book3s64.c | 1 + arch/powerpc/mm/pgtable_64.c | 1 + arch/powerpc/mm/tlb-radix.c| 169 +++-- 13 files changed, 154 insertions(+), 90 deletions(-) -BEGIN PGP SIGNATURE- iQIcBAEBCAAGBQJau480AAoJEFHr6jzI4aWAYwIP/RBX30EIQegM7GUwEYulVVhG E16yFb+QLvUrh3YXU9Oqyb0tj1y/gLJcPojin28TWBpt+wGejriCoSIOPjoI915E W8jnnxrR8u/nZfzARNh+zsR6+qbNVz3MyVSAsn9BrS2jjWqiekXuvF7WLr34M22N 6xiu2BJ7B4cjvNJa2AFj1GYLCFeed4BDhERn/bNaK3jhdp3hYpnpaxG6y+od5M2J GdQRO+GhCfBzho/ypLAX62klnprAVYCh+qnDKOOHGpE0qz0WOTHnZ3d5VVJO4sAe zCdadR66h6YvukrGvTLmJfz1ykZv1vbrr+WbgN0kt+8EsYAqRtO1OB1wm0hjbQYK e5hA+PccnVkxqjv0b/dg81OS/3PCA7/4xNtBRFxc4YNO3JGebU87gMYepaJAIt0A zCySqOYOIZMznfuk9310XYY7Z+LnpnRxRLTNZFq5tvaI/O9I/5P3a5f1IUwvkJk3 PCY/EyzLuyIac8FFcMQh5UIJLqebAVVAnNC75YTZG576gf0NmJYO/jwxPz5Yep48 oO1er783SX4vojcYfEVyp+mA4KGbVKel615L1pvgX4CH9tXM4PhX5y/ODZjjP9ua XksvQdaUlakHIBJwNboFS70gaBHoivN1OFEUINf3XtOhwZRTo51ufilONM8XZhoX r8o+pDgD/5QCyJqMhlcr =VKmI -END PGP SIGNATURE-
Re: [PATCH] powerpc: Fix smp_wmb barrier definition use use lwsync consistently
Nicholas Pigginwrites: > asm/barrier.h is not always included after asm/synch.h, which meant > it was missing __SUBARCH_HAS_LWSYNC, so in some files smp_wmb() would > be eieio when it should be lwsync. kernel/time/hrtimer.c is one case. Wow nice catch. Only broken since 2008 presumably. Some days I think maybe we aren't very good at this writing software thing, good to have some certainty :) > __SUBARCH_HAS_LWSYNC is only used in one place, so just fold it in > to where it's used. Previously with my small simulator config, 377 > instances of eieio in the tree. After this patch there are 55. At least for Book3S this isn't actually a terrible bug AFAICS: - smp_wmb() is only defined to order accesses to cacheable memory. - smp_wmb() only orders prior stores vs later stores. - eieio orders all prior stores vs all later stores for cacheable memory. - lwsync orders everything except prior stores vs later loads for cacheable memory. So eieio and lwsync are both valid to use as smp_wmb(), but it's still terrible fishy that we were using both in different places depending on include ordering. I'm inclined to tag this for stable unless anyone can think of a reason not to? cheers > diff --git a/arch/powerpc/include/asm/barrier.h > b/arch/powerpc/include/asm/barrier.h > index 10daa1d56e0a..c7c63959ba91 100644 > --- a/arch/powerpc/include/asm/barrier.h > +++ b/arch/powerpc/include/asm/barrier.h > @@ -35,7 +35,8 @@ > #define rmb() __asm__ __volatile__ ("sync" : : : "memory") > #define wmb() __asm__ __volatile__ ("sync" : : : "memory") > > -#ifdef __SUBARCH_HAS_LWSYNC > +/* The sub-arch has lwsync */ > +#if defined(__powerpc64__) || defined(CONFIG_PPC_E500MC) > #define SMPWMB LWSYNC > #else > #define SMPWMB eieio > diff --git a/arch/powerpc/include/asm/synch.h > b/arch/powerpc/include/asm/synch.h > index 63e7f5a1f105..6ec546090ba1 100644 > --- a/arch/powerpc/include/asm/synch.h > +++ b/arch/powerpc/include/asm/synch.h > @@ -6,10 +6,6 @@ > #include > #include > > -#if defined(__powerpc64__) || defined(CONFIG_PPC_E500MC) > -#define __SUBARCH_HAS_LWSYNC > -#endif > - > #ifndef __ASSEMBLY__ > extern unsigned int __start___lwsync_fixup, __stop___lwsync_fixup; > extern void do_lwsync_fixups(unsigned long value, void *fixup_start, > -- > 2.16.1
Re: [PATCH 01/16] initrd: Add generic code path for common initrd unloading logic.
On Wed, Mar 28, 2018 at 2:04 PM, Christoph Hellwigwrote: >> +#ifdef CONFIG_INITRAMFS_GENERIC_UNLOAD >> +void free_initrd_mem(unsigned long start, unsigned long end) >> +{ >> + free_reserved_area((void *)start, (void *)end, -1, "initrd"); >> +} >> +#endif > > Given how trivial this is and how many architectures can use it I'd > reverse the polarity and add a CONFIG_HAVE_ARCH_FREE_INITRD_MEM > instead. And while adding "special" functionality to the generic version, more and more users of CONFIG_HAVE_ARCH_FREE_INITRD_MEM will be removed. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH 01/16] initrd: Add generic code path for common initrd unloading logic.
> +#ifdef CONFIG_INITRAMFS_GENERIC_UNLOAD > +void free_initrd_mem(unsigned long start, unsigned long end) > +{ > + free_reserved_area((void *)start, (void *)end, -1, "initrd"); > +} > +#endif Given how trivial this is and how many architectures can use it I'd reverse the polarity and add a CONFIG_HAVE_ARCH_FREE_INITRD_MEM instead.
Re: [PATCH 14/19] powerpc/altivec: Add missing prototypes for altivec
On Wed, Mar 28, 2018 at 9:26 AM, Mathieu Malaterrewrote: > On Tue, Mar 27, 2018 at 7:33 PM, LEROY Christophe > wrote: >> LEROY Christophe a écrit : >> >> >>> Mathieu Malaterre a écrit : >>> Christophe, On Sat, Mar 24, 2018 at 9:10 PM, LEROY Christophe wrote: > > Mathieu Malaterre a écrit : > > >> On Fri, Mar 23, 2018 at 1:19 PM, christophe leroy >> wrote: >>> >>> >>> >>> >>> Le 22/03/2018 à 21:20, Mathieu Malaterre a écrit : Some functions prototypes were missing for the non-altivec code. Add the missing prototypes directly in xor_vmx, fix warnings treated as errors with W=1: arch/powerpc/lib/xor_vmx_glue.c:18:6: error: no previous prototype for ‘xor_altivec_2’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:29:6: error: no previous prototype for ‘xor_altivec_3’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:40:6: error: no previous prototype for ‘xor_altivec_4’ [-Werror=missing-prototypes] arch/powerpc/lib/xor_vmx_glue.c:52:6: error: no previous prototype for ‘xor_altivec_5’ [-Werror=missing-prototypes] Signed-off-by: Mathieu Malaterre --- arch/powerpc/lib/xor_vmx.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/lib/xor_vmx.h b/arch/powerpc/lib/xor_vmx.h index 5c2b0839b179..2173e3c84151 100644 --- a/arch/powerpc/lib/xor_vmx.h +++ b/arch/powerpc/lib/xor_vmx.h @@ -19,3 +19,17 @@ void __xor_altivec_4(unsigned long bytes, unsigned long *v1_in, void __xor_altivec_5(unsigned long bytes, unsigned long *v1_in, unsigned long *v2_in, unsigned long *v3_in, unsigned long *v4_in, unsigned long *v5_in); + +void xor_altivec_2(unsigned long bytes, unsigned long *v1_in, +unsigned long *v2_in); + >>> >>> >>> >>> >>> Only used in one place, should be static instead of adding it in a .h >>> >>> Same for the other ones. >> >> >> >> $ git grep xor_altivec_2 >> [...] >> arch/powerpc/lib/xor_vmx_glue.c:EXPORT_SYMBOL(xor_altivec_2); >> >> Are you sure I can change this function to static ? > > > > Yes you are right. But in fact those fonctions are already defined in > asm/xor. h > So you just need to add the missing #include I originally tried it, but this leads to: CC arch/powerpc/lib/xor_vmx_glue.o In file included from arch/powerpc/lib/xor_vmx_glue.c:16:0: ./arch/powerpc/include/asm/xor.h:39:15: error: variable ‘xor_block_altivec’ has initializer but incomplete type static struct xor_block_template xor_block_altivec = { ^~ ./arch/powerpc/include/asm/xor.h:40:2: error: unknown field ‘name’ specified in initializer .name = "altivec", ^ [...] The file (powerpc) is pretty much expected to be included after . I did not want to tweak to test for #ifdef _XOR_H just before #ifdef _XOR_H static struct xor_block_template xor_block_altivec = { [...] since this seems like a hack to me. Is this ok to test for #ifdef _XOR_H in ? >>> >>> >>> What about including linux/raid/xor.h in asm/xor.h ? > > This leads to: > > CALL../arch/powerpc/kernel/systbl_chk.sh > In file included from ../arch/powerpc/include/asm/xor.h:57:0, > from ../arch/powerpc/lib/xor_vmx_glue.c:17: > ../include/asm-generic/xor.h:688:34: error: ‘xor_block_32regs’ defined > but not used [-Werror=unused-variable] > static struct xor_block_template xor_block_32regs = { > ^~~~ > ../include/asm-generic/xor.h:680:34: error: ‘xor_block_8regs’ defined > but not used [-Werror=unused-variable] > static struct xor_block_template xor_block_8regs = { > ^~~ > In file included from ../arch/powerpc/lib/xor_vmx_glue.c:17:0: > ../arch/powerpc/include/asm/xor.h:39:34: error: ‘xor_block_altivec’ > defined but not used [-Werror=unused-variable] > static struct xor_block_template xor_block_altivec = { > ^ > CALL../arch/powerpc/kernel/prom_init_check.sh > I'll prepare a patch which moves the prototypes from arch/powerpc/include/asm/xor.h to arch/powerpc/include/asm/xor_altivec.h
Re: RFC on writel and writel_relaxed
On 2018-03-28 02:14, Linus Torvalds wrote: On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kayawrote: Basically changing it to dma_buffer->foo = 1;/* WB */ wmb() writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ mmiowb() Why? Why not just remove the wmb(), and keep the barrier in the writel()? Yes, we want to get there indeed. It is because of some arch not implementing writel properly. Maintainers want to play safe. That is why I asked if IA64 and other well known archs follow the strongly ordered rule at this moment like PPC and ARM. Or should we go and inform every arch about this before yanking wmb()? Maintainers are afraid of introducing a regression. The above code makes no sense, and just looks stupid to me. It also generates pointlessly bad code on x86, so it's bad there too. Linus
RE: RFC on writel and writel_relaxed
From: Benjamin Herrenschmidt > Sent: 28 March 2018 10:56 ... > For example, let's say I have a device with a reset bit and the spec > says the reset bit needs to be set for at least 10us. > > This is wrong: > > writel(1, RESET_REG); > usleep(10); > writel(0, RESET_REG); > > Because of write posting, the first write might arrive to the device > right before the second one. > > The typical "fix" is to turn that into: > > writel(1, RESET_REG); > readl(RESET_REG); /* Flush posted writes */ Would a writel(1, RESET_REG) here provide enough synchronsiation? > usleep(10); > writel(0, RESET_REG); > > *However* the issue here, at least on power, is that the CPU can issue > that readl but doesn't necessarily wait for it to complete (ie, the > data to return), before proceeding to the usleep. Now a usleep contains > a bunch of loads and stores and is probably fine, but a udelay which > just loops on the timebase may not be. > > Thus we may still violate the timing requirement. I've seem that sort of code (with udelay() and no read back) quite often. How many were in linux I don't know. For small delays I usually fix it by repeated writes (of the same value) to the device register. That can guarantee very short intervals. The only time I've actually seen buffered writes break timing was between a 286 and an 8859 interrupt controller. If you wrote to the mask then enabled interrupts the first IACK cycle could be too close to write and break the cycle recovery time. That clobbered most of the interrupt controller registers. That probably affected every 286 board ever built! Not sure how much software added the required extra bus cycle. > What we did inside readl, with the twi;isync sequence (which basically > means, trap on return value with "trap never" as a condition, followed > by isync that ensures all excpetion conditions are resolved), is force > the CPU to "consume" the data from the read before moving on. > > This effectively makes readl fully synchronous (we would probably avoid > that if we were to implement a readl_relaxed). I've always wondered exactly what the twi;isync were for - always seemed very heavy handed for most mmio reads. Particularly if you are doing mmio reads from a data fifo. Perhaps there should be a writel_wait() that is allowed to do a read back for such code paths? David