Re: perf tools: add support for generating bpf prologue on powerpc
On Thu, 2016-05-05 at 15:23:19 UTC, "Naveen N. Rao" wrote: > Generalize existing macros to serve the purpose. > > Cc: Wang Nan> Cc: Arnaldo Carvalho de Melo > Cc: Masami Hiramatsu > Cc: Ian Munsie > Cc: Michael Ellerman > Signed-off-by: Naveen N. Rao > --- > With this patch: > # ./perf test 37 > 37: Test BPF filter : > 37.1: Test basic BPF filtering : Ok > 37.2: Test BPF prologue generation : Ok > 37.3: Test BPF relocation checker: Ok > > tools/perf/arch/powerpc/Makefile | 1 + > tools/perf/arch/powerpc/util/dwarf-regs.c | 40 > +-- > 2 files changed, 29 insertions(+), 12 deletions(-) Looks feasible, and is in powerpc only code, should I take this or acme? cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [3/3] powerpc/fadump: add support for fadump_nr_cpus= parameter
On Fri, 2016-06-05 at 11:51:08 UTC, Hari Bathini wrote: > Kernel parameter 'nr_cpus' can be used to limit the maximum number > of processors that an SMP kernel could support. This patch extends > this to fadump by introducing 'fadump_nr_cpus' parameter that can > help in booting fadump kernel on a lower memory footprint. Is there really no other way to do this? I really hate adding new, single use only command line parameters. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [2/3] powerpc/fadump: add support to specify memory range based size
On Fri, 2016-06-05 at 11:50:37 UTC, Hari Bathini wrote: > Currently, memory for fadump can be specified with fadump_reserve_mem=size, > where only a fixed size can be specified. This patch tries to extend this > syntax to support conditional reservation based on memory size, with the > below syntax: > > fadump_reserve_mem=:[,:,...] > > This syntax helps using the same commandline parameter for different system > memory sizes. This is basically using the crashkernel= syntax right? So can we please reuse the crashkernel= parsing code? cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
On Fri, 6 May 2016 16:35:38 +1000 Alexey Kardashevskiywrote: > On 05/06/2016 01:05 AM, Alex Williamson wrote: > > On Thu, 5 May 2016 12:15:46 + > > "Tian, Kevin" wrote: > > > >>> From: Yongji Xie [mailto:xyj...@linux.vnet.ibm.com] > >>> Sent: Thursday, May 05, 2016 7:43 PM > >>> > >>> Hi David and Kevin, > >>> > >>> On 2016/5/5 17:54, David Laight wrote: > >>> > From: Tian, Kevin > > Sent: 05 May 2016 10:37 > ... > >> Acutually, we are not aimed at accessing MSI-X table from > >> guest. So I think it's safe to passthrough MSI-X table if we > >> can make sure guest kernel would not touch MSI-X table in > >> normal code path such as para-virtualized guest kernel on PPC64. > >> > > Then how do you prevent malicious guest kernel accessing it? > Or a malicious guest driver for an ethernet card setting up > the receive buffer ring to contain a single word entry that > contains the address associated with an MSI-X interrupt and > then using a loopback mode to cause a specific packet be > received that writes the required word through that address. > > Remember the PCIe cycle for an interrupt is a normal memory write > cycle. > > David > > >>> > >>> If we have enough permission to load a malicious driver or > >>> kernel, we can easily break the guest without exposed > >>> MSI-X table. > >>> > >>> I think it should be safe to expose MSI-X table if we can > >>> make sure that malicious guest driver/kernel can't use > >>> the MSI-X table to break other guest or host. The > >>> capability of IRQ remapping could provide this > >>> kind of protection. > >>> > >> > >> With IRQ remapping it doesn't mean you can pass through MSI-X > >> structure to guest. I know actual IRQ remapping might be platform > >> specific, but at least for Intel VT-d specification, MSI-X entry must > >> be configured with a remappable format by host kernel which > >> contains an index into IRQ remapping table. The index will find a > >> IRQ remapping entry which controls interrupt routing for a specific > >> device. If you allow a malicious program random index into MSI-X > >> entry of assigned device, the hole is obvious... > >> > >> Above might make sense only for a IRQ remapping implementation > >> which doesn't rely on extended MSI-X format (e.g. simply based on > >> BDF). If that's the case for PPC, then you should build MSI-X > >> passthrough based on this fact instead of general IRQ remapping > >> enabled or not. > > > > I don't think anyone is expecting that we can expose the MSI-X vector > > table to the guest and the guest can make direct use of it. The end > > goal here is that the guest on a power system is already > > paravirtualized to not program the device MSI-X by directly writing to > > the MSI-X vector table. They have hypercalls for this since they > > always run virtualized. Therefore a) they never intend to touch the > > MSI-X vector table and b) they have sufficient isolation that a guest > > can only hurt itself by doing so. > > > > On x86 we don't have a), our method of programming the MSI-X vector > > table is to directly write to it. Therefore we will always require QEMU > > to place a MemoryRegion over the vector table to intercept those > > accesses. However with interrupt remapping, we do have b) on x86, which > > means that we don't need to be so strict in disallowing user accesses > > to the MSI-X vector table. It's not useful for configuring MSI-X on > > the device, but the user should only be able to hurt themselves by > > writing it directly. x86 doesn't really get anything out of this > > change, but it helps this special case on power pretty significantly > > aiui. Thanks, > > Excellent short overview, saved :) > > How do we proceed with these patches? Nobody seems objecting them but also > nobody seems taking them either... Well, this series is still based on some non-upstream patches, so... Once that dependency is resolved this series should probably be split into functional areas for acceptance by the appropriate subsystem maintainers. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v9 22/22] PCI/hotplug: PowerPC PowerNV PCI hotplug driver
On Thu, May 5, 2016 at 7:28 PM, Gavin Shanwrote: > On Thu, May 05, 2016 at 12:04:49PM -0500, Rob Herring wrote: >>On Tue, May 3, 2016 at 8:22 AM, Gavin Shan wrote: >>> This adds standalone driver to support PCI hotplug for PowerPC PowerNV >>> platform that runs on top of skiboot firmware. The firmware identifies >>> hotpluggable slots and marked their device tree node with proper >>> "ibm,slot-pluggable" and "ibm,reset-by-firmware". The driver scans >>> device tree nodes to create/register PCI hotplug slot accordingly. >>> >>> The PCI slots are organized in fashion of tree, which means one >>> PCI slot might have parent PCI slot and parent PCI slot possibly >>> contains multiple child PCI slots. At the plugging time, the parent >>> PCI slot is populated before its children. The child PCI slots are >>> removed before their parent PCI slot can be removed from the system. >>> >>> If the skiboot firmware doesn't support slot status retrieval, the PCI >>> slot device node shouldn't have property "ibm,reset-by-firmware". In >>> that case, none of valid PCI slots will be detected from device tree. >>> The skiboot firmware doesn't export the capability to access attention >>> LEDs yet and it's something for TBD. >>> >>> Signed-off-by: Gavin Shan >>> Acked-by: Bjorn Helgaas >> >>[...] >> >>> +static void pnv_php_handle_poweron(struct pnv_php_slot *php_slot) >>> +{ >>> + void *fdt, *fdt1, *dt; >>> + int confirm = PNV_PHP_POWER_CONFIRMED_SUCCESS; >>> + int ret; >>> + >>> + /* We don't know the FDT blob size. We try to get it through >>> +* maximal memory chunk and then copy it to another chunk that >>> +* fits the real size. >>> +*/ >>> + fdt1 = kzalloc(0x1, GFP_KERNEL); >>> + if (!fdt1) >>> + goto error; >>> + >>> + ret = pnv_pci_get_device_tree(php_slot->dn->phandle, fdt1, 0x1); >>> + if (ret) >>> + goto free_fdt1; >>> + >>> + fdt = kzalloc(fdt_totalsize(fdt1), GFP_KERNEL); >>> + if (!fdt) >>> + goto free_fdt1; >>> + >>> + /* Unflatten device tree blob */ >>> + memcpy(fdt, fdt1, fdt_totalsize(fdt1)); >> >>This is wrong. If the size is greater than 64K, then you will be >>overrunning the fdt1 buffer. You need to fetch the FDT again if it is >>bigger than 64KB. >> > > Thanks for review, Rob. Sorry that I don't see how it's a problem. An > errcode is returned from pnv_pci_get_device_tree() if the FDT blob > size is greater than 64K. In this case, memcpy() won't be triggered. > pnv_pci_get_device_tree() relies on firmware implementation which > avoids overrunning the buffer. Okay, I missed that pnv_pci_get_device_tree would error out. > On the other hand, it would be reasonable to retry retriving the > FDT blob if 64K buffer isn't enough. Also, kzalloc() can be replaced > with alloc_pages() as 64K is the default page size on PPC64. I will > have something like below until some one has more concerns. As the > size of the allocated buffer will be greater than the real FDT blob > size, some memory (not too much) is wasted. I guess it should be ok. > > struct page *page; > void *fdt; > unsigned int order; > int ret; > > for (order = 0; order < MAX_ORDER; order++) { > page = alloc_pages(GFP_KERNEL, order); > if (page) { > fdt = page_address(page); > ret = pnv_pci_get_device_tree(php_slot->dn->phandle, > fdt, (1 << order) * > PAGE_SIZE); > if (ret) { > dev_dbg(_slot->pdev.dev, "Error %d > getting device tree (%d)\n", > ret, order); > free_pages(fdt, order); > continue; > } > } > } I would allocate a minimal buffer to read the header, get the actual size, then allocate a new buffer. There's no point in looping. If you know 64KB is the biggest size you should ever see, then how you had it is reasonable, too. Rob ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
On Fri, May 06, 2016 at 01:33:01PM +0200, Petr Mladek wrote: > On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote: > > diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c > > index 782fbb5..b3b8639 100644 > > --- a/kernel/livepatch/patch.c > > +++ b/kernel/livepatch/patch.c > > @@ -29,6 +29,7 @@ > > #include > > #include > > #include "patch.h" > > +#include "transition.h" > > > > static LIST_HEAD(klp_ops); > > > > @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip, > > ops = container_of(fops, struct klp_ops, fops); > > > > rcu_read_lock(); > > + > > func = list_first_or_null_rcu(>func_stack, struct klp_func, > > stack_node); > > - if (WARN_ON_ONCE(!func)) > > + > > + if (!func) > > goto unlock; > > > > + /* > > +* See the comment for the 2nd smp_wmb() in klp_init_transition() for > > +* an explanation of why this read barrier is needed. > > +*/ > > + smp_rmb(); > > + > > + if (unlikely(func->transition)) { > > + > > + /* > > +* See the comment for the 1st smp_wmb() in > > +* klp_init_transition() for an explanation of why this read > > +* barrier is needed. > > +*/ > > + smp_rmb(); > > I would add here: > > WARN_ON_ONCE(current->patch_state == KLP_UNDEFINED); > > We do not know in which context this is called, so the printk's are > not ideal. But it will get triggered only if there is a bug in > the livepatch implementation. It should happen on random locations > and rather early when a bug is introduced. > > Anyway, better to die and catch the bug that let the system running > in an undefined state and produce cryptic errors later on. Ok, makes sense. > > + if (current->patch_state == KLP_UNPATCHED) { > > + /* > > +* Use the previously patched version of the function. > > +* If no previous patches exist, use the original > > +* function. > > +*/ > > + func = list_entry_rcu(func->stack_node.next, > > + struct klp_func, stack_node); > > + > > + if (>stack_node == >func_stack) > > + goto unlock; > > + } > > + } > > I am staring into the code for too long now. I need to step back for a > while. I'll do another look when you send the next version. Anyway, > you did a great work. I speak mainly for the livepatch part and > I like it. Thanks for the helpful reviews! I'll be on vacation again next week so I get a break too :-) -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Canyonlands oops at Shutdown
Getting the following at shutdown with Kernel 4.6-rc's on Sam460ex Canyonlands board . Regards Julian [ 1533.722779] Unable to handle kernel paging request for data at address 0x0128 [ 1533.744309] Faulting instruction address: 0xc026d3c8 [ 1535.763583] Oops: Kernel access of bad area, sig: 11 [#1] [ 1535.782886] PREEMPT Canyonlands [ 1535.799805] Modules linked in: [ 1535.816546] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.6.0-rc6-sam460ex-jm #4 [ 1535.838341] task: ea85 ti: ea846000 task.ti: ea846000 [ 1535.857783] NIP: c026d3c8 LR: c0466984 CTR: c001a8ac [ 1535.876847] REGS: ea847d10 TRAP: 0300 Not tainted (4.6.0-rc6-sam460ex-jm) [ 1535.898224] MSR: 00029000CR: 44422284 XER: [ 1535.918868] DEAR: 0128 ESR: GPR00: c0466984 ea847dc0 ea85 0108 000f fff0 0007 GPR08: 0001 c0b5a19c ea847de0 28428468 205cfe94 205a946e bfff8a7c GPR16: 2097d008 2097d018 2097d090 bfff8980 bfff897c 4321fedc GPR24: 2097d578 c0b3f17c c0b8 c0a95e68 eaa35000 c0b8545c 0108 [ 1536.024665] NIP [c026d3c8] kobject_get+0x18/0x80 [ 1536.044003] LR [c0466984] get_device+0x1c/0x38 [ 1536.063198] Call Trace: [ 1536.080424] [ea847dc0] [eaa3fa10] 0xeaa3fa10 (unreliable) [ 1536.100896] [ea847dd0] [c0466984] get_device+0x1c/0x38 [ 1536.121059] [ea847de0] [c04689f4] device_shutdown+0x58/0x178 [ 1536.141727] [ea847e10] [c003c280] kernel_halt+0x38/0x64 [ 1536.161875] [ea847e20] [c003c4cc] SyS_reboot+0x140/0x1b0 [ 1536.182057] [ea847f40] [c000ad80] ret_from_syscall+0x0/0x3c [ 1536.202568] --- interrupt: c01 at 0x203d1fbc [ 1536.202568] LR = 0x2058f878 [ 1536.239761] Instruction dump: [ 1536.257358] 4b9c 7fa3eb78 484cad91 39610020 7fe3fb78 4bda3d74 9421fff0 7c0802a6 [ 1536.280308] 93e1000c 7c7f1b79 90010014 41820060 <813f0020> 2f89 41bc001c 809f [ 1536.303611] ---[ end trace f5a63492b41c62f2 ]--- [ 1536.323546] [ 1537.340293] note: systemd-shutdow[1] exited with preempt_count 1 [ 1537.363602] Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b [ 1537.363602] [ 1537.404665] Rebooting in 180 seconds.. U-Boot 2015.a (May 16 2015 - 14:20:11) CPU: AMCC PowerPC 460EX Rev. B at 1155 MHz (PLB=231 OPB=115 EBC=115) No Security/Kasumi support Bootstrap Option H - Boot ROM Location I2C (Addr 0x52) Internal PCI arbiter enabled 32 kB I-Cache 32 kB D-Cache Board: Sam460ex/cr, PCIe 4x + SATA-2 I2C: ready DRAM: ddr2_boost enabled, level 3 2 GiB (ECC not enabled, 462 MHz, CL4) PCI: Bus Dev VenId DevId Class Int 00 04 1095 3512 0104 00 00 06 126f 0501 0380 00 PCIE1: successfully set as root-complex 02 00 1002 683f 0300 ff Net: ppc_4xx_eth0 FPGA: Revision 03 (2010-10-07) SM502: found PERMD2:not found VGA: 1 VESA: OK [0.00] Using Canyonlands machine description [0.00] Linux version 4.6.0-rc6-sam460ex-jm (root@julian-VirtualBox) (gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2) ) #4 PREEMPT Fri May 6 07:54:21 AST 2016 [0.00] Zone ranges: [0.00] DMA [mem 0x-0x2fff] [0.00] Normal empty [0.00] HighMem [mem 0x3000-0x7fff] [0.00] Movable zone start for each node [0.00] Early memory node ranges [0.00] node 0: [mem 0x-0x7fff] [0.00] Initmem setup node 0 [mem 0x-0x7fff] [0.00] MMU: Allocated 1088 bytes of context maps for 255 contexts [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 522752 [0.00] Kernel command line: root=/dev/sda6 console=ttyS0,115200 console=tty0 [0.00] PID hash table entries: 4096 (order: 2, 16384 bytes) [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [0.00] Sorting __ex_table... [0.00] Memory: 2001664K/2097152K available (7416K kernel code, 316K rwdata, 3808K rodata, 240K init, 370K bss, 95488K reserved, 0K cma-reserved, 1310720K highmem) [0.00] Kernel virtual memory layout: [0.00] * 0xfffcf000..0xf000 : fixmap [0.00] * 0xffc0..0xffe0 : highmem PTEs [0.00] * 0xffa0..0xffc0 : consistent mem [0.00] * 0xffa0..0xffa0 : early ioremap [0.00] * 0xf100..0xffa0 : vmalloc & ioremap [0.00] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 [0.00] Preemptible hierarchical RCU implementation. [0.00] Build-time adjustment of leaf fanout to 32. [0.00] NR_IRQS:512 nr_irqs:512 16 [0.00] UIC0 (32 IRQ sources) at DCR 0xc0 [0.00] UIC1 (32 IRQ sources) at DCR 0xd0 [0.00] UIC2 (32 IRQ sources) at DCR 0xe0 [0.00] UIC3 (32 IRQ sources) at DCR 0xf0 [
Re: klp_task_patch: was: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
On Thu, May 05, 2016 at 01:57:01PM +0200, Petr Mladek wrote: > I have missed that the two commands are called with preemption > disabled. So, I had the following crazy scenario in mind: > > > CPU0 CPU1 > > klp_enable_patch() > > klp_target_state = KLP_PATCHED; > > for_each_task() > set TIF_PENDING_PATCH > > # task 123 > > if (klp_patch_pending(current) > klp_patch_task(current) > > clear TIF_PENDING_PATCH > > smp_rmb(); > > # switch to assembly of > # klp_patch_task() > > mov klp_target_state, %r12 > > # interrupt and schedule > # another task > > > klp_reverse_transition(); > > klp_target_state = KLP_UNPATCHED; > > klt_try_to_complete_transition() > > task = 123; > if (task->patch_state == klp_target_state; > return 0; > > => task 123 is in target state and does > not block conversion > > klp_complete_transition() > > > # disable previous patch on the stack > klp_disable_patch(); > > klp_target_state = KLP_UNPATCHED; > > > # task 123 gets scheduled again > lea %r12, task->patch_state > > => it happily stores an outdated > state > Thanks for the clear explanation, this helps a lot. > This is why the two functions should get called with preemption > disabled. We should document it at least. I imagine that we will > use them later also in another context and nobody will remember > this crazy scenario. > > Well, even disabled preemption does not help. The process on > CPU1 might be also interrupted by an NMI and do some long > printk in it. > > IMHO, the only safe approach is to call klp_patch_task() > only for "current" on a safe place. Then this race is harmless. > The switch happen on a safe place, so that it does not matter > into which state the process is switched. I'm not sure about this solution. When klp_complete_transition() is called, we need all tasks to be patched, for good. We don't want any of them to randomly switch to the wrong state at some later time in the middle of a future patch operation. How would changing klp_patch_task() to only use "current" prevent that? > By other words, the task state might be updated only > >+ by the task itself on a safe place >+ by other task when the updated on is sleeping on a safe place > > This should be well documented and the API should help to avoid > a misuse. I think we could fix it to be safe for future callers who might not have preemption disabled with a couple of changes to klp_patch_task(): disabling preemption and testing/clearing the TIF_PATCH_PENDING flag before changing the patch state: void klp_patch_task(struct task_struct *task) { preempt_disable(); if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING)) task->patch_state = READ_ONCE(klp_target_state); preempt_enable(); } We would also need a synchronize_sched() after the patching is complete, either at the end of klp_try_complete_transition() or in klp_complete_transition(). That would make sure that all existing calls to klp_patch_task() are done. -- Josh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] powerpc/fadump: add support for fadump_nr_cpus= parameter
Kernel parameter 'nr_cpus' can be used to limit the maximum number of processors that an SMP kernel could support. This patch extends this to fadump by introducing 'fadump_nr_cpus' parameter that can help in booting fadump kernel on a lower memory footprint. Suggested-by: Mahesh SalgaonkarSigned-off-by: Hari Bathini --- arch/powerpc/kernel/fadump.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index a7fef3e..c75783c 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -470,6 +470,28 @@ static int __init early_fadump_param(char *p) } early_param("fadump", early_fadump_param); +/* Look for fadump_nr_cpus= cmdline option. */ +static int __init early_fadump_nrcpus(char *p) +{ + int nr_cpus; + + /* +* fadump_nr_cpus parameter is only applicable on a +* fadump active kernel. This is to reduce memory +* needed to boot a fadump active kernel. +* So, check if we are booting after crash. +*/ + if (!is_fadump_active()) + return 0; + + get_option(, _cpus); + if (nr_cpus > 0 && nr_cpus < nr_cpu_ids) + nr_cpu_ids = nr_cpus; + + return 0; +} +early_param("fadump_nr_cpus", early_fadump_nrcpus); + static void register_fw_dump(struct fadump_mem_struct *fdm) { int rc; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/3] powerpc/fadump: add support to specify memory range based size
Currently, memory for fadump can be specified with fadump_reserve_mem=size, where only a fixed size can be specified. This patch tries to extend this syntax to support conditional reservation based on memory size, with the below syntax: fadump_reserve_mem=:[,:,...] This syntax helps using the same commandline parameter for different system memory sizes. Signed-off-by: Hari Bathini--- arch/powerpc/kernel/fadump.c | 127 +++--- 1 file changed, 118 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index d0af58b..a7fef3e 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -193,6 +193,121 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, return addr; } +#define FADUMP_MEM_CMDLINE_PREFIX "fadump_reserve_mem=" + +static __init char *get_last_fadump_reserve_mem(void) +{ + char *p = boot_command_line, *fadump_cmdline = NULL; + + /* find fadump_reserve_mem and use the last one if there are more */ + p = strstr(p, FADUMP_MEM_CMDLINE_PREFIX); + while (p) { + fadump_cmdline = p; + p = strstr(p+1, FADUMP_MEM_CMDLINE_PREFIX); + } + + return fadump_cmdline; +} + +#define parse_fadump_print(fmt, arg...) \ + printk(KERN_INFO "fadump_reserve_mem: " fmt, ##arg) + +/* + * This function parses command line for fadump_reserve_mem= + * + * Supports the below two syntaxes: + *1. fadump_reserve_mem=size + *2. fadump_reserve_mem=ramsize-range:size[,...] + * + * Sets fw_dump.reserve_bootvar with the memory size + * provided, 0 otherwise + * + * The function returns -EINVAL on failure, 0 otherwise. + */ +static int __init parse_fadump_reserve_mem(void) +{ + char *cur, *tmp; + char *first_colon, *first_space; + char *fadump_cmdline; + unsigned long long system_ram; + + fw_dump.reserve_bootvar = 0; + fadump_cmdline = get_last_fadump_reserve_mem(); + + /* when no fadump_reserve_mem= cmdline option is provided */ + if (!fadump_cmdline) + return 0; + + first_colon = strchr(fadump_cmdline, ':'); + first_space = strchr(fadump_cmdline, ' '); + cur = fadump_cmdline + strlen(FADUMP_MEM_CMDLINE_PREFIX); + + /* for fadump_reserve_mem=size cmdline syntax */ + if (!first_colon || (first_space && (first_colon > first_space))) { + fw_dump.reserve_bootvar = memparse(cur, ); + return 0; + } + + /* for fadump_reserve_mem=ramsize-range:size[,...] cmdline syntax */ + system_ram = memblock_phys_mem_size(); + /* for each entry of the comma-separated list */ + do { + unsigned long long start, end = ULLONG_MAX, size; + + /* get the start of the range */ + start = memparse(cur, ); + if (cur == tmp) { + parse_fadump_print("Memory value expected\n"); + return -EINVAL; + } + cur = tmp; + if (*cur != '-') { + parse_fadump_print("'-' expected\n"); + return -EINVAL; + } + cur++; + + /* if no ':' is here, than we read the end */ + if (*cur != ':') { + end = memparse(cur, ); + if (cur == tmp) { + parse_fadump_print("Memory value expected\n"); + return -EINVAL; + } + cur = tmp; + if (end <= start) { + parse_fadump_print("end <= start\n"); + return -EINVAL; + } + } + + if (*cur != ':') { + parse_fadump_print("':' expected\n"); + return -EINVAL; + } + cur++; + + size = memparse(cur, ); + if (cur == tmp) { + parse_fadump_print("Memory value expected\n"); + return -EINVAL; + } + cur = tmp; + if (size >= system_ram) { + parse_fadump_print("invalid size\n"); + return -EINVAL; + } + + /* match ? */ + if (system_ram >= start && system_ram < end) { + fw_dump.reserve_bootvar = size; + break; + } + } while (*cur++ == ','); + + return 0; +} + /** * fadump_calculate_reserve_size(): reserve variable boot area 5% of System RAM * @@ -212,6 +327,9 @@ static inline unsigned long fadump_calculate_reserve_size(void) { unsigned long size; + /* sets fw_dump.reserve_bootvar */ +
[PATCH 1/3] powerpc/fadump: set an upper limit for the default memory reserved for fadump
When boot memory size for fadump is not specified, memory is reserved for fadump based on system RAM size. As the system RAM size increases, the memory reserved for fadump increases as well. This patch sets an upper limit on the memory reserved for fadump, to avoid reserving excess memory. Signed-off-by: Hari Bathini--- arch/powerpc/include/asm/fadump.h |6 ++ arch/powerpc/kernel/fadump.c |4 2 files changed, 10 insertions(+) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index b4407d0..2c3cb32 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -43,6 +43,12 @@ #define MIN_BOOT_MEM (((RMA_END < (0x1UL << 28)) ? (0x1UL << 28) : RMA_END) \ + (0x1UL << 26)) +/* + * Maximum memory needed for fadump to boot up successfully. Use this as + * an upper limit for fadump so we don't endup reserving excess memory. + */ +#define MAX_BOOT_MEM (0x1UL << 32) + #define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt) #ifndef ELF_CORE_EFLAGS diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 3cb3b02a..d0af58b 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -225,6 +225,10 @@ static inline unsigned long fadump_calculate_reserve_size(void) /* round it down in multiples of 256 */ size = size & ~0x0FFFUL; + /* Set an upper limit on the memory to be reserved */ + if (size > MAX_BOOT_MEM) + size = MAX_BOOT_MEM; + /* Truncate to memory_limit. We don't want to over reserve the memory.*/ if (memory_limit && size > memory_limit) size = memory_limit; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V10 00/28] Add new powerpc specific ELF core notes
On Tue, 2016-02-16 at 14:29 +0530, Anshuman Khandual wrote: > This patch series adds twelve new ELF core note sections which can > be used with existing ptrace request PTRACE_GETREGSET-SETREGSET for accessing > various transactional memory and other miscellaneous debug register sets on > powerpc platform. > > Test Result (All tests pass on both BE and LE) > -- > ptrace-ebbPASS > ptrace-gprPASS > ptrace-tm-gpr PASS > ptrace-tm-spd-gpr PASS > ptrace-tarPASS > ptrace-tm-tar PASS > ptrace-tm-spd-tar PASS > ptrace-vsxPASS > ptrace-tm-vsx PASS > ptrace-tm-spd-vsx PASS > ptrace-tm-spr PASS How are you building the tests? On BE I get: In file included from ptrace-tm-gpr.c:12:0: ptrace-tm-gpr.c: In function ‘trace_tm_gpr’: In file included from ptrace.h:31:0, from ptrace-tm-vsx.c:11: ptrace-tm-vsx.c: In function ‘ptrace_tm_vsx’: ptrace-gpr.h:20:19: error: large integer implicitly truncated to unsigned type [-Werror=overflow] #define FPR_2_REP 0x3f60624de000 ^ ptrace-tm-gpr.c:209:26: note: in expansion of macro ‘FPR_2_REP’ ret = validate_fpr(fpr, FPR_2_REP); ^ ptrace-tm-vsx.c:150:46: error: ‘PPC_FEATURE2_HTM’ undeclared (first use in this function) SKIP_IF(!((long)get_auxv_entry(AT_HWCAP2) & PPC_FEATURE2_HTM)); ^ /home/kerkins/workspace/kernel-build-selftests/arch/powerpc/compiler/gcc_ubuntu_be/linux/tools/testing/selftests/powerpc/utils.h:49:7: note: in definition of macro ‘SKIP_IF’ if ((x)) { \ ^ ptrace-gpr.h:19:19: error: large integer implicitly truncated to unsigned type [-Werror=overflow] #define FPR_1_REP 0x3f50624de000 ^ ptrace-tm-gpr.c:217:26: note: in expansion of macro ‘FPR_1_REP’ ret = validate_fpr(fpr, FPR_1_REP); ^ ptrace-tm-vsx.c:150:46: note: each undeclared identifier is reported only once for each function it appears in SKIP_IF(!((long)get_auxv_entry(AT_HWCAP2) & PPC_FEATURE2_HTM)); ^ /home/kerkins/workspace/kernel-build-selftests/arch/powerpc/compiler/gcc_ubuntu_be/linux/tools/testing/selftests/powerpc/utils.h:49:7: note: in definition of macro ‘SKIP_IF’ if ((x)) { \ ^ ptrace-gpr.h:21:19: error: large integer implicitly truncated to unsigned type [-Werror=overflow] #define FPR_3_REP 0x3f689374c000 ^ ptrace-tm-gpr.c:233:30: note: in expansion of macro ‘FPR_3_REP’ ret = write_ckpt_fpr(child, FPR_3_REP); ^ In file included from ptrace.h:31:0, from ptrace-tm-gpr.c:11: ptrace-tm-gpr.c: In function ‘ptrace_tm_gpr’: ptrace-tm-gpr.c:249:46: error: ‘PPC_FEATURE2_HTM’ undeclared (first use in this function) SKIP_IF(!((long)get_auxv_entry(AT_HWCAP2) & PPC_FEATURE2_HTM)); ^ /home/kerkins/workspace/kernel-build-selftests/arch/powerpc/compiler/gcc_ubuntu_be/linux/tools/testing/selftests/powerpc/utils.h:49:7: note: in definition of macro ‘SKIP_IF’ if ((x)) { \ ^ ptrace-tm-gpr.c:249:46: note: each undeclared identifier is reported only once for each function it appears in SKIP_IF(!((long)get_auxv_entry(AT_HWCAP2) & PPC_FEATURE2_HTM)); ^ /home/kerkins/workspace/kernel-build-selftests/arch/powerpc/compiler/gcc_ubuntu_be/linux/tools/testing/selftests/powerpc/utils.h:49:7: note: in definition of macro ‘SKIP_IF’ if ((x)) { \ ^ cc1: all warnings being treated as errors In file included from ../pmu/ebb/ebb.h:12:0, from ptrace-ebb.c:11: ptrace-ebb.h: In function ‘reset_ebb_with_clear_mask’: ../pmu/ebb/../../reg.h:49:31: error: left shift count >= width of type [-Werror=shift-count-overflow] #define BESCR_PME (0x1ul << 32) /* PMU Event-based exception Enable */ ^ ../pmu/ebb/../../reg.h:16:60: note: in definition of macro ‘mtspr’ : "r" ((unsigned long)(v)) \ ^ ptrace-ebb.h:73:28: note: in expansion of macro ‘BESCR_PME’ mtspr(SPRN_BESCRS, BESCR_PME); ^ In file included from ptrace-tm-tar.c:12:0: ptrace-tm-tar.c: In function ‘tm_tar’: ptrace-tar.h:24:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow] #define PPR_4 0x10/* or 2,2,2 */ ^ ptrace-tm-tar.c:78:45: note: in expansion of macro ‘PPR_4’ ret = validate_tar_registers(regs, TAR_4, PPR_4, DSCR_4); ^ In file included from
Re: [RFC PATCH v2 17/18] livepatch: change to a per-task consistency model
On Thu 2016-04-28 15:44:48, Josh Poimboeuf wrote: > diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c > index 782fbb5..b3b8639 100644 > --- a/kernel/livepatch/patch.c > +++ b/kernel/livepatch/patch.c > @@ -29,6 +29,7 @@ > #include > #include > #include "patch.h" > +#include "transition.h" > > static LIST_HEAD(klp_ops); > > @@ -58,11 +59,42 @@ static void notrace klp_ftrace_handler(unsigned long ip, > ops = container_of(fops, struct klp_ops, fops); > > rcu_read_lock(); > + > func = list_first_or_null_rcu(>func_stack, struct klp_func, > stack_node); > - if (WARN_ON_ONCE(!func)) > + > + if (!func) > goto unlock; > > + /* > + * See the comment for the 2nd smp_wmb() in klp_init_transition() for > + * an explanation of why this read barrier is needed. > + */ > + smp_rmb(); > + > + if (unlikely(func->transition)) { > + > + /* > + * See the comment for the 1st smp_wmb() in > + * klp_init_transition() for an explanation of why this read > + * barrier is needed. > + */ > + smp_rmb(); I would add here: WARN_ON_ONCE(current->patch_state == KLP_UNDEFINED); We do not know in which context this is called, so the printk's are not ideal. But it will get triggered only if there is a bug in the livepatch implementation. It should happen on random locations and rather early when a bug is introduced. Anyway, better to die and catch the bug that let the system running in an undefined state and produce cryptic errors later on. > + if (current->patch_state == KLP_UNPATCHED) { > + /* > + * Use the previously patched version of the function. > + * If no previous patches exist, use the original > + * function. > + */ > + func = list_entry_rcu(func->stack_node.next, > + struct klp_func, stack_node); > + > + if (>stack_node == >func_stack) > + goto unlock; > + } > + } I am staring into the code for too long now. I need to step back for a while. I'll do another look when you send the next version. Anyway, you did a great work. I speak mainly for the livepatch part and I like it. Best Regards, Petr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v9 04/22] powerpc/powernv: Increase PE# capacity
On Fri, May 06, 2016 at 05:17:25PM +1000, Alexey Kardashevskiy wrote: >On 05/03/2016 11:22 PM, Gavin Shan wrote: >>Each PHB maintains an array helping to translate 2-bytes Request >>ID (RID) to PE# with the assumption that PE# takes one byte, meaning >>that we can't have more than 256 PEs. However, pci_dn->pe_number >>already had 4-bytes for the PE#. > >Can you possibly have more than 256 PEs? Or exactly 256? What patch in this >series makes use of it? > >I probably asked but do not remember the answer :) > >Looks like waste of memory - you only used a small fraction of >pe_rmap[0x1] and now the waste is quadrupled. > The PE capacities on different hardware are different as below. So we're going to support 16-bits PE number in near future. That means the element in the array needs "unsigned short" at least and 2 pages (2 * 64KB) will be reserved for it. P7IOC: 127PHB3: 256 PHB4: 65536 NPU1: 4NPU2: 16 I agree some memory is wasted and the wasted amount depends on the PCI topology. No memory will be wasted if 256 busses show on one particular PHB. Less busses one PHB has, more memory will be wasted. As I explained before, the total used memory is 4 pages (4 * 64KB). Considering the memory capacity on PPC64 (especially PowerNV), I guess it's fine. Note that the memory is allocated from memblock together with PHB instance. The alternative solution (to avoid wasting memory) would be searching for the PE number according to the input BDFN through the PE list maintained in each PHB. Obviously, it will induce more logic and more CPU cycles will be used. So it's a kind of trade-off. If you really want to see this, I absolutely can do it in next revision. Another option would be to improve it later and keep the code as what we have. Please input your thought. > >> >>This extends the PE# capacity for every PHB. After that, the PE number >>is represented by 4-bytes value. Then we can reuse IODA_INVALID_PE to >>check the PE# in phb->pe_rmap[] is valid or not. > >Looks like using IODA_INVALID_PE is the only reason for this patch. > For now, yes. In near future, It needs to be extended to represent 16-bits PE number for PHB4 as I explained above. > >> >>Signed-off-by: Gavin Shan>>Reviewed-by: Daniel Axtens >>--- >> arch/powerpc/platforms/powernv/pci-ioda.c | 6 +- >> arch/powerpc/platforms/powernv/pci.h | 7 ++- >> 2 files changed, 7 insertions(+), 6 deletions(-) >> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c >>b/arch/powerpc/platforms/powernv/pci-ioda.c >>index cbd4c0b..cf96cb5 100644 >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c >>@@ -768,7 +768,7 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, >>struct pnv_ioda_pe *pe) >> >> /* Clear the reverse map */ >> for (rid = pe->rid; rid < rid_end; rid++) >>- phb->ioda.pe_rmap[rid] = 0; >>+ phb->ioda.pe_rmap[rid] = IODA_INVALID_PE; >> >> /* Release from all parents PELT-V */ >> while (parent) { >>@@ -3406,6 +3406,10 @@ static void __init pnv_pci_init_ioda_phb(struct >>device_node *np, >> if (prop32) >> phb->ioda.reserved_pe_idx = be32_to_cpup(prop32); >> >>+ /* Invalidate RID to PE# mapping */ >>+ for (segno = 0; segno < ARRAY_SIZE(phb->ioda.pe_rmap); segno++) >>+ phb->ioda.pe_rmap[segno] = IODA_INVALID_PE; >>+ >> /* Parse 64-bit MMIO range */ >> pnv_ioda_parse_m64_window(phb); >> >>diff --git a/arch/powerpc/platforms/powernv/pci.h >>b/arch/powerpc/platforms/powernv/pci.h >>index 904f60b..80f5326 100644 >>--- a/arch/powerpc/platforms/powernv/pci.h >>+++ b/arch/powerpc/platforms/powernv/pci.h >>@@ -156,11 +156,8 @@ struct pnv_phb { >> struct list_headpe_list; >> struct mutexpe_list_mutex; >> >>- /* Reverse map of PEs, will have to extend if >>- * we are to support more than 256 PEs, indexed >>- * bus { bus, devfn } >>- */ >>- unsigned char pe_rmap[0x1]; >>+ /* Reverse map of PEs, indexed by {bus, devfn} */ >>+ unsigned intpe_rmap[0x1]; >> >> /* TCE cache invalidate registers (physical and >> * remapped) >> > > >-- >Alexey > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/9] FP/VEC/VSX switching optimisations
On 2016/05/05 05:32PM, Naveen N Rao wrote: > On 2016/02/29 05:53PM, Cyril Bur wrote: > > Cover-letter for V1 of the series is at > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-November/136350.html > > > > Cover-letter for V2 of the series is at > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-January/138054.html > > > > Changes in V3: > > Addressed review comments from Michael Neuling > > - Made commit message in 4/9 better reflect the patch > > - Removed overuse of #ifdef blocks and redundant condition in 5/9 > > - Split 6/8 in two to better prepare for 7,8,9 > > - Removed #ifdefs in 6/9 > > > > Changes in V4: > > - Addressed non ABI compliant ASM macros in 1/9 > > - Fixed build breakage due to changing #ifdefs in V3 (6/9) > > - Reordered some conditions in if statements > > > > Changes in V5: > > - Enhanced basic-asm.h to provide ABI independent macro as pointed out by > >Naveen Rao. > >- Tested for both BE and LE builds. Had to disable -flto from the > > selftests/powerpc Makefile as it didn't play well with the custom ASM. > > - Added some extra debugging output to the vmx_signal testcase > > - Fixed comments in testing code > > - Updated VSX test code to use GCC Altivec macros > > > > Changes in V6: > > - Removed recursive definition of CFLAGS in math/Makefile > > - Corrected the use of the word param in favour of doubleword > > - Reordered some code in basic-asm.h and neatened some comments > > This series is resulting in a kernel crash with one of the perf tests. > To reproduce, build perf and run the test for breakpoint overflow signal > handler. > > # ./perf test -v 17 > 17: Test breakpoint overflow signal handler : > --- start --- > test child forked, pid 3753 > failed opening event 0 > failed opening event 0 > cpu 0xd: Vector: 600 (Alignment) at [c000edd738c0] > pc: c000a818: save_fpu+0xa8/0x2ac > lr: c001568c: __giveup_fpu+0x2c/0x90 > sp: c000edd73b40 >msr: 8280b033 >dar: c000edc436e0 > dsisr: 4200 > current = 0xc000edc42c00 > paca= 0xc7e82700 softe: 0irq_happened: 0x01 > pid = 3753, comm = perf > Linux version 4.6.0-rc3-nnr+ (root@rhel71le) (gcc version 4.8.3 20140911 (Red > Hat 4.8.3-8) (GCC) ) #93 SMP Wed May 4 22:01:06 IST 2016 > enter ? for help > [link register ] c001568c __giveup_fpu+0x2c/0x90 > [c000edd73b40] (unreliable) > [c000edd73b70] c0015730 giveup_fpu+0x40/0xa0 > [c000edd73ba0] c0015810 flush_fp_to_thread+0x80/0x90 > [c000edd73bd0] c0026b3c setup_sigcontext.constprop.3+0xbc/0x1f0 > [c000edd73c30] c00274c4 handle_rt_signal64+0x3b4/0x7c0 > [c000edd73d10] c0017ee0 do_signal+0x150/0x2b0 > [c000edd73e00] c0018220 do_notify_resume+0xd0/0x110 > [c000edd73e30] c0009844 ret_from_except_lite+0x70/0x74 > --- Exception: 900 (Decrementer) at 100b3c88 > SP (3fffd08cfb20) is in userspace > d:mon> ls save_fpu > save_fpu: c000a770 > > With v4.5, the test would fail, but not cause what looks to be an > alignment exception. xmon couldn't decode the instructions: d:mon> c000a810 3880 li r4,0 c000a814 f250 .long 0xf250 c000a818 7c062798 .long 0x7c062798 c000a81c f250 .long 0xf250 c000a820 38800010 li r4,16 c000a824 f0210a50 .long 0xf0210a50 c000a828 7c262798 .long 0x7c262798 c000a82c f0210a50 .long 0xf0210a50 c000a830 38800020 li r4,32 c000a834 f0421250 .long 0xf0421250 However, with objdump, the instructions look to be ok: c000aa10c000a810: 00 00 80 38 li r4,0 c000a814: 50 02 00 f0 xxswapd vs0,vs0 c000a818: 98 27 06 7c stxvd2x vs0,r6,r4 c000a81c: 50 02 00 f0 xxswapd vs0,vs0 c000a820: 10 00 80 38 li r4,16 c000a824: 50 0a 21 f0 xxswapd vs1,vs1 c000a828: 98 27 26 7c stxvd2x vs1,r6,r4 c000a82c: 50 0a 21 f0 xxswapd vs1,vs1 I saw this on a LE vm on Power7 and that looks to be the issue, since a BE vm does not show this. I'm attaching the .config in case it helps. - Naveen # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.6.0-rc3 Kernel Configuration # CONFIG_PPC64=y # # Processor support # CONFIG_PPC_BOOK3S_64=y # CONFIG_PPC_BOOK3E_64 is not set CONFIG_POWER7_CPU=y # CONFIG_POWER8_CPU is not set CONFIG_PPC_BOOK3S=y CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_VSX=y CONFIG_PPC_ICSWX=y # CONFIG_PPC_ICSWX_PID is not set # CONFIG_PPC_ICSWX_USE_SIGILL is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_64=y CONFIG_PPC_RADIX_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_PPC_HAVE_PMU_SUPPORT=y CONFIG_PPC_PERF_CTRS=y CONFIG_SMP=y CONFIG_NR_CPUS=2048 CONFIG_PPC_DOORBELL=y # CONFIG_CPU_BIG_ENDIAN
[GIT PULL] Please pull powerpc/linux.git powerpc-4.6-5 tag
Hi Linus, Please pull one powerpc fix for 4.6: The following changes since commit d701cca6744fe0d67c86346dcfc9b128b17b5045: powerpc: wire up preadv2 and pwritev2 syscalls (2016-04-27 16:47:55 +1000) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-4.6-5 for you to fetch changes up to b4c112114aab9aff5ed4568ca5e662bb02cdfe74: powerpc: Fix bad inline asm constraint in create_zero_mask() (2016-05-02 11:10:25 +1000) powerpc fixes for 4.6 #4 - Fix bad inline asm constraint in create_zero_mask() from Anton Blanchard Anton Blanchard (1): powerpc: Fix bad inline asm constraint in create_zero_mask() arch/powerpc/include/asm/word-at-a-time.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] cxl: Add kernel API to allow a context to operate with relocate disabled
From: Ian Munsiecxl devices typically access memory using an MMU in much the same way as the CPU, and each context includes a state register much like the MSR in the CPU. Like the CPU, the state register includes a bit to enable relocation, which we currently always enable. In some cases, it may be desirable to allow a device to access memory using real addresses instead of effective addresses, so this adds a new API, cxl_set_translation_mode, that can be used to disable relocation on a given kernel context. This can allow for the creation of a special privileged context that the device can use if it needs relocation disabled, and can use regular contexts at times when it needs relocation enabled. This interface is only available to users of the kernel API for obvious reasons, and will never be supported in a virtualised environment. This will be used by the upcoming cxl support in the mlx5 driver. Signed-off-by: Ian Munsie --- Changes since v1: - Changed API to use a dedicated cxl_set_translation_mode() call instead of adding an extra parameter to cxl_start_context2() based on review feedback from Frederic Barrat - Changed error code for attempting to use in PowerVM environment to -EPERM drivers/misc/cxl/api.c| 19 +++ drivers/misc/cxl/cxl.h| 1 + drivers/misc/cxl/guest.c | 3 +++ drivers/misc/cxl/native.c | 5 +++-- include/misc/cxl.h| 8 5 files changed, 34 insertions(+), 2 deletions(-) diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c index 8075823..6d228cc 100644 --- a/drivers/misc/cxl/api.c +++ b/drivers/misc/cxl/api.c @@ -183,6 +183,7 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, ctx->pid = get_task_pid(task, PIDTYPE_PID); ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID); kernel = false; + ctx->real_mode = false; } cxl_ctx_get(); @@ -219,6 +220,24 @@ void cxl_set_master(struct cxl_context *ctx) } EXPORT_SYMBOL_GPL(cxl_set_master); +int cxl_set_translation_mode(struct cxl_context *ctx, bool real_mode) +{ + if (ctx->status == STARTED) { + /* +* We could potentially update the PE and issue an update LLCMD +* to support this, but it doesn't seem to have a good use case +* since it's trivial to just create a second kernel context +* with different translation modes, so until someone convinces +* me otherwise: +*/ + return -EBUSY; + } + + ctx->real_mode = real_mode; + return 0; +} +EXPORT_SYMBOL_GPL(cxl_set_translation_mode); + /* wrappers around afu_* file ops which are EXPORTED */ int cxl_fd_open(struct inode *inode, struct file *file) { diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index dfdbfb0..6e3e485 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -523,6 +523,7 @@ struct cxl_context { bool pe_inserted; bool master; bool kernel; + bool real_mode; bool pending_irq; bool pending_fault; bool pending_afu_err; diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c index 769971c..c2815b9 100644 --- a/drivers/misc/cxl/guest.c +++ b/drivers/misc/cxl/guest.c @@ -617,6 +617,9 @@ static int guest_attach_process(struct cxl_context *ctx, bool kernel, u64 wed, u { pr_devel("in %s\n", __func__); + if (ctx->real_mode) + return -EPERM; + ctx->kernel = kernel; if (ctx->afu->current_mode == CXL_MODE_DIRECTED) return attach_afu_directed(ctx, wed, amr); diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c index ef494ba..ba459a9 100644 --- a/drivers/misc/cxl/native.c +++ b/drivers/misc/cxl/native.c @@ -485,8 +485,9 @@ static u64 calculate_sr(struct cxl_context *ctx) if (mfspr(SPRN_LPCR) & LPCR_TC) sr |= CXL_PSL_SR_An_TC; if (ctx->kernel) { - sr |= CXL_PSL_SR_An_R | (mfmsr() & MSR_SF); - sr |= CXL_PSL_SR_An_HV; + if (!ctx->real_mode) + sr |= CXL_PSL_SR_An_R; + sr |= (mfmsr() & MSR_SF) | CXL_PSL_SR_An_HV; } else { sr |= CXL_PSL_SR_An_PR | CXL_PSL_SR_An_R; sr &= ~(CXL_PSL_SR_An_HV); diff --git a/include/misc/cxl.h b/include/misc/cxl.h index 7d5e261..56560c5 100644 --- a/include/misc/cxl.h +++ b/include/misc/cxl.h @@ -127,6 +127,14 @@ int cxl_afu_reset(struct cxl_context *ctx); void cxl_set_master(struct cxl_context *ctx); /* + * Sets the context to use real mode memory accesses to operate with + * translation disabled. Note that this only makes sense for kernel contexts + * under bare metal, and will not work with virtualisation. May only be + * performed on stopped contexts. + */ +int cxl_set_translation_mode(struct cxl_context
Re: [PATCH] cxl: Add kernel API to allow a context to operate with relocate disabled
Sure thing, that actually simplifies things a great deal. Testing now and will resend shortly :) -Ian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v9 04/22] powerpc/powernv: Increase PE# capacity
On 05/03/2016 11:22 PM, Gavin Shan wrote: Each PHB maintains an array helping to translate 2-bytes Request ID (RID) to PE# with the assumption that PE# takes one byte, meaning that we can't have more than 256 PEs. However, pci_dn->pe_number already had 4-bytes for the PE#. Can you possibly have more than 256 PEs? Or exactly 256? What patch in this series makes use of it? I probably asked but do not remember the answer :) Looks like waste of memory - you only used a small fraction of pe_rmap[0x1] and now the waste is quadrupled. This extends the PE# capacity for every PHB. After that, the PE number is represented by 4-bytes value. Then we can reuse IODA_INVALID_PE to check the PE# in phb->pe_rmap[] is valid or not. Looks like using IODA_INVALID_PE is the only reason for this patch. Signed-off-by: Gavin ShanReviewed-by: Daniel Axtens --- arch/powerpc/platforms/powernv/pci-ioda.c | 6 +- arch/powerpc/platforms/powernv/pci.h | 7 ++- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cbd4c0b..cf96cb5 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -768,7 +768,7 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) /* Clear the reverse map */ for (rid = pe->rid; rid < rid_end; rid++) - phb->ioda.pe_rmap[rid] = 0; + phb->ioda.pe_rmap[rid] = IODA_INVALID_PE; /* Release from all parents PELT-V */ while (parent) { @@ -3406,6 +3406,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, if (prop32) phb->ioda.reserved_pe_idx = be32_to_cpup(prop32); + /* Invalidate RID to PE# mapping */ + for (segno = 0; segno < ARRAY_SIZE(phb->ioda.pe_rmap); segno++) + phb->ioda.pe_rmap[segno] = IODA_INVALID_PE; + /* Parse 64-bit MMIO range */ pnv_ioda_parse_m64_window(phb); diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 904f60b..80f5326 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -156,11 +156,8 @@ struct pnv_phb { struct list_headpe_list; struct mutexpe_list_mutex; - /* Reverse map of PEs, will have to extend if -* we are to support more than 256 PEs, indexed -* bus { bus, devfn } -*/ - unsigned char pe_rmap[0x1]; + /* Reverse map of PEs, indexed by {bus, devfn} */ + unsigned intpe_rmap[0x1]; /* TCE cache invalidate registers (physical and * remapped) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [GIT PULL 00/17] perf/core improvements and fixes
* Arnaldo Carvalho de Melowrote: > Hi Ingo, > > Please consider pulling, > > - Arnaldo > > > The following changes since commit 1b6de5917172967acd8db4d222df4225d23a8a60: > > perf/x86/intel/pt: Convert ACCESS_ONCE()s (2016-05-05 10:16:29 +0200) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git > tags/perf-core-for-mingo-20160505 > > for you to fetch changes up to b6b85dad30ad7e7394990e2317a780577974a4e6: > > perf evlist: Rename variable in perf_mmap__read() (2016-05-05 21:04:04 > -0300) > > > perf/core improvements and fixes: > > User visible: > > - Order output of 'perf trace --summary' better, now the threads will > appear ascending order of number of events, and then, for each, in > descending order of syscalls by the time spent in the syscalls, so > that the last page produced can be the one about the most interesting > thread straced, suggested by Milian Wolff (Arnaldo Carvalho de Melo) > > - Do not show the runtime_ms for a thread when not collecting it, that > is done so far only with 'perf trace --sched' (Arnaldo Carvalho de Melo) > > - Fix kallsyms perf test on ppc64le (Naveen N. Rao) > > Infrastructure: > > - Move global variables related to presence of some keys in the sort order to > a > per hist struct, to allow code like the hists browser to work with multiple > hists with different lists of columns (Jiri Olsa) > > - Add support for generating bpf prologue in powerpc (Naveen N. Rao) > > - Fix kprobe and kretprobe handling with kallsyms on ppc64le (Naveen N. Rao) > > - evlist mmap changes, prep work for supporting reading backwards (Wang Nan) > > Signed-off-by: Arnaldo Carvalho de Melo > > > Arnaldo Carvalho de Melo (5): > perf machine: Introduce number of threads member > perf tools: Add template for generating rbtree resort class > perf trace: Sort summary output by number of events > perf trace: Sort syscalls stats by msecs in --summary > perf trace: Do not show the runtime_ms for a thread when not collecting > it > > Jiri Olsa (7): > perf hists: Move sort__need_collapse into struct perf_hpp_list > perf hists: Move sort__has_parent into struct perf_hpp_list > perf hists: Move sort__has_sym into struct perf_hpp_list > perf hists: Move sort__has_dso into struct perf_hpp_list > perf hists: Move sort__has_socket into struct perf_hpp_list > perf hists: Move sort__has_thread into struct perf_hpp_list > perf hists: Move sort__has_comm into struct perf_hpp_list > > Naveen N. Rao (3): > perf tools powerpc: Add support for generating bpf prologue > perf powerpc: Fix kprobe and kretprobe handling with kallsyms on ppc64le > perf symbols: Fix kallsyms perf test on ppc64le > > Wang Nan (2): > perf evlist: Extract perf_mmap__read() > perf evlist: Rename variable in perf_mmap__read() > > tools/perf/arch/powerpc/Makefile| 1 + > tools/perf/arch/powerpc/util/dwarf-regs.c | 40 +--- > tools/perf/arch/powerpc/util/sym-handling.c | 43 ++-- > tools/perf/builtin-diff.c | 4 +- > tools/perf/builtin-report.c | 4 +- > tools/perf/builtin-top.c| 8 +- > tools/perf/builtin-trace.c | 87 ++-- > tools/perf/tests/hists_common.c | 2 +- > tools/perf/tests/hists_cumulate.c | 2 +- > tools/perf/tests/hists_link.c | 4 +- > tools/perf/tests/hists_output.c | 2 +- > tools/perf/ui/browsers/hists.c | 32 +++--- > tools/perf/ui/gtk/hists.c | 2 +- > tools/perf/ui/hist.c| 2 +- > tools/perf/util/annotate.c | 2 +- > tools/perf/util/callchain.c | 2 +- > tools/perf/util/evlist.c| 56 ++- > tools/perf/util/hist.c | 14 +-- > tools/perf/util/hist.h | 10 ++ > tools/perf/util/machine.c | 9 +- > tools/perf/util/machine.h | 1 + > tools/perf/util/probe-event.c | 5 +- > tools/perf/util/probe-event.h | 3 +- > tools/perf/util/rb_resort.h | 149 > > tools/perf/util/sort.c | 35 +++ > tools/perf/util/sort.h | 7 -- > tools/perf/util/symbol-elf.c| 7 +- > tools/perf/util/symbol.h| 3 +- > 28 files changed, 382 insertions(+), 154 deletions(-) > create mode 100644 tools/perf/util/rb_resort.h Pulled, thanks a lot Arnaldo! Ingo ___ Linuxppc-dev mailing list
Re: [PATCH v9 03/22] powerpc/powernv: Move pnv_pci_ioda_setup_opal_tce_kill() around
On 05/03/2016 11:22 PM, Gavin Shan wrote: pnv_pci_ioda_setup_opal_tce_kill() called by pnv_ioda_setup_dma() to remap the TCE kill regiter. What's done in pnv_ioda_setup_dma() will be covered in pcibios_setup_bridge() which is invoked on each PCI bridge. It means we will possibly remap the TCE kill register for multiple times and it's unnecessary. This moves pnv_pci_ioda_setup_opal_tce_kill() to where the PHB is initialized (pnv_pci_init_ioda_phb()) to avoid above issue. Signed-off-by: Gavin ShanReviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 5ee8a57..cbd4c0b 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2599,8 +2599,6 @@ static void pnv_ioda_setup_dma(struct pnv_phb *phb) pr_info("PCI: Domain %04x has %d available 32-bit DMA segments\n", hose->global_number, phb->ioda.dma32_count); - pnv_pci_ioda_setup_opal_tce_kill(phb); - /* Walk our PE list and configure their DMA segments */ list_for_each_entry(pe, >ioda.pe_list, list) { weight = pnv_pci_ioda_pe_dma_weight(pe); @@ -3396,6 +3394,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, if (phb->regs == NULL) pr_err(" Failed to map registers !\n"); + /* Initialize TCE kill register */ + pnv_pci_ioda_setup_opal_tce_kill(phb); + /* Initialize more IODA stuff */ phb->ioda.total_pe_num = 1; prop32 = of_get_property(np, "ibm,opal-num-pes", NULL); -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 5/5] vfio-pci: Allow to mmap MSI-X table if interrupt remapping is supported
On 05/06/2016 01:05 AM, Alex Williamson wrote: On Thu, 5 May 2016 12:15:46 + "Tian, Kevin"wrote: From: Yongji Xie [mailto:xyj...@linux.vnet.ibm.com] Sent: Thursday, May 05, 2016 7:43 PM Hi David and Kevin, On 2016/5/5 17:54, David Laight wrote: From: Tian, Kevin Sent: 05 May 2016 10:37 ... Acutually, we are not aimed at accessing MSI-X table from guest. So I think it's safe to passthrough MSI-X table if we can make sure guest kernel would not touch MSI-X table in normal code path such as para-virtualized guest kernel on PPC64. Then how do you prevent malicious guest kernel accessing it? Or a malicious guest driver for an ethernet card setting up the receive buffer ring to contain a single word entry that contains the address associated with an MSI-X interrupt and then using a loopback mode to cause a specific packet be received that writes the required word through that address. Remember the PCIe cycle for an interrupt is a normal memory write cycle. David If we have enough permission to load a malicious driver or kernel, we can easily break the guest without exposed MSI-X table. I think it should be safe to expose MSI-X table if we can make sure that malicious guest driver/kernel can't use the MSI-X table to break other guest or host. The capability of IRQ remapping could provide this kind of protection. With IRQ remapping it doesn't mean you can pass through MSI-X structure to guest. I know actual IRQ remapping might be platform specific, but at least for Intel VT-d specification, MSI-X entry must be configured with a remappable format by host kernel which contains an index into IRQ remapping table. The index will find a IRQ remapping entry which controls interrupt routing for a specific device. If you allow a malicious program random index into MSI-X entry of assigned device, the hole is obvious... Above might make sense only for a IRQ remapping implementation which doesn't rely on extended MSI-X format (e.g. simply based on BDF). If that's the case for PPC, then you should build MSI-X passthrough based on this fact instead of general IRQ remapping enabled or not. I don't think anyone is expecting that we can expose the MSI-X vector table to the guest and the guest can make direct use of it. The end goal here is that the guest on a power system is already paravirtualized to not program the device MSI-X by directly writing to the MSI-X vector table. They have hypercalls for this since they always run virtualized. Therefore a) they never intend to touch the MSI-X vector table and b) they have sufficient isolation that a guest can only hurt itself by doing so. On x86 we don't have a), our method of programming the MSI-X vector table is to directly write to it. Therefore we will always require QEMU to place a MemoryRegion over the vector table to intercept those accesses. However with interrupt remapping, we do have b) on x86, which means that we don't need to be so strict in disallowing user accesses to the MSI-X vector table. It's not useful for configuring MSI-X on the device, but the user should only be able to hurt themselves by writing it directly. x86 doesn't really get anything out of this change, but it helps this special case on power pretty significantly aiui. Thanks, Excellent short overview, saved :) How do we proceed with these patches? Nobody seems objecting them but also nobody seems taking them either... -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/5] pci-ioda: Set PCI_BUS_FLAGS_MSI_REMAP for IODA host bridge
On 04/27/2016 10:43 PM, Yongji Xie wrote: Any IODA host bridge have the capability of IRQ remapping. So we set PCI_BUS_FLAGS_MSI_REMAP when this kind of host birdge is detected. Signed-off-by: Yongji XieReviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c |8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f90dc04..9557638 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3080,6 +3080,12 @@ static void pnv_pci_ioda_fixup(void) pnv_npu_ioda_fixup(); } +int pnv_pci_ioda_root_bridge_prepare(struct pci_host_bridge *bridge) +{ + bridge->bus->bus_flags |= PCI_BUS_FLAGS_MSI_REMAP; + return 0; +} + /* * Returns the alignment for I/O or memory windows for P2P * bridges. That actually depends on how PEs are segmented. @@ -3364,6 +3370,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, */ ppc_md.pcibios_fixup = pnv_pci_ioda_fixup; + ppc_md.pcibios_root_bridge_prepare = pnv_pci_ioda_root_bridge_prepare; + if (phb->type == PNV_PHB_NPU) hose->controller_ops = pnv_npu_ioda_controller_ops; else -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/2] Enable ZONE_DEVICE on POWER
Hi, I've been working on kernel support for a persistent memory (nvdimm) device and the kernel driver infrastructure requires ZONE_DEVICE for DAX support.. I've had it enabled in my tree for some time (without altmap support) without any real issues. I wasn't planning on upstreaming any of my changes until 4.8 at the earliest so I am ok with carrying these patches myself. However, there has been some interest in using ZONE_DEVICE for other things on ppc (wasn't that you?) and given that ZONE_DEVICE is gated behind CONFIG_EXPERT I can't see there being any kind of negative impact on end users by merging it now. At the very least it lets the rest of the kernel development community know that changes affecting zones should also be tested on powerpc. Thanks, Oliver On Fri, May 6, 2016 at 3:13 PM, Anshuman Khandualwrote: > On 05/05/2016 08:18 PM, Aneesh Kumar K.V wrote: >> Anshuman Khandual writes: >> >>> This enables base ZONE_DEVICE support on POWER. This series depends on >>> the following patches posted by Oliver. >>> >>> https://patchwork.ozlabs.org/patch/618867/ >>> https://patchwork.ozlabs.org/patch/618868/ >>> >>> Anshuman Khandual (2): >>> powerpc/mm: Make vmemmap_populate accommodate ZONE_DEVICE memory >>> powerpc/mm: Enable support for ZONE_DEVICE on PPC_BOOK3S_64 platforms >>> >>> arch/powerpc/mm/init_64.c | 4 +++- >>> mm/Kconfig| 2 +- >>> 2 files changed, 4 insertions(+), 2 deletions(-) >>> >> >> What is the use case ? Who will use ZONE_DEVICE on ppc64. This should be >> be merged along with the patch series that use this. > > IIUC, Oliver has been looking at using ZONE_DEVICE for the NVDIMM (or > some other persistent memory) drivers. I have been following Dan William's > work on this front and want to explore more details about it's functioning > on ppc64. This enablement will just help us little bit in that direction. > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev