[PATCH] powerpc: Align hot loops of memset() and backwards_memcpy()
From: Anton Blanchard Align the hot loops in our assembly implementation of memset() and backwards_memcpy(). backwards_memcpy() is called from tcp_v4_rcv(), so we might want to optimise this a little more. Signed-off-by: Anton Blanchard --- arch/powerpc/lib/mem_64.S | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S index 43435c6..eda7a96 100644 --- a/arch/powerpc/lib/mem_64.S +++ b/arch/powerpc/lib/mem_64.S @@ -37,6 +37,7 @@ _GLOBAL(memset) clrldi r5,r5,58 mtctr r0 beq 5f + .balign 16 4: std r4,0(r6) std r4,8(r6) std r4,16(r6) @@ -90,6 +91,7 @@ _GLOBAL(backwards_memcpy) andi. r0,r6,3 mtctr r7 bne 5f + .balign 16 1: lwz r7,-4(r4) lwzur8,-8(r4) stw r7,-4(r6) -- 2.7.4
Crashes in refresh_zone_stat_thresholds when some nodes have no memory
It appears that commit 75ef71840539 ("mm, vmstat: add infrastructure for per-node vmstats", 2016-07-28) has introduced a regression on machines that have nodes which have no memory, such as the POWER8 server that I use for testing. When I boot current upstream, I get a splat like this: [1.713998] Unable to handle kernel paging request for data at address 0xff7a1 [1.714164] Faulting instruction address: 0xc0270cd0 [1.714304] Oops: Kernel access of bad area, sig: 11 [#1] [1.714414] SMP NR_CPUS=2048 NUMA PowerNV [1.714530] Modules linked in: [1.714647] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118 [1.714786] task: c00ff0680010 task.stack: c00ff0704000 [1.714926] NIP: c0270cd0 LR: c0270ce8 CTR: [1.715093] REGS: c00ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+) [1.715232] MSR: 900102009033 CR: 846b6824 XER: 2000 [1.715748] CFAR: c0008768 DAR: 000ff7a1 DSISR: 4200 SOFTE: 1 GPR00: c0270d08 c00ff0707b80 c11fb200 GPR04: 0800 GPR08: 000ff7a1 c122aae0 GPR12: c0a1e440 cfb8 c000c188 GPR16: GPR20: c0cecad0 GPR24: c0d035b8 c0d6cd18 c0d6cd18 c01fffa86300 GPR28: c01fffa96300 c1230034 c122eb18 [1.717484] NIP [c0270cd0] refresh_zone_stat_thresholds+0x80/0x240 [1.717568] LR [c0270ce8] refresh_zone_stat_thresholds+0x98/0x240 [1.717648] Call Trace: [1.717687] [c00ff0707b80] [c0270d08] refresh_zone_stat_thresholds+0xb8/0x240 (unreliable) [1.717818] [c00ff0707bd0] [c0a1e4d4] init_per_zone_wmark_min+0x94/0xb0 [1.717932] [c00ff0707c30] [c000b90c] do_one_initcall+0x6c/0x1d0 [1.718036] [c00ff0707cf0] [c0d04244] kernel_init_freeable+0x294/0x384 [1.718150] [c00ff0707dc0] [c000c1a8] kernel_init+0x28/0x160 [1.718249] [c00ff0707e30] [c0009968] ret_from_kernel_thread+0x5c/0x74 [1.718358] Instruction dump: [1.718408] 3fc20003 3bde4e34 3b80 6042 3860 3fbb0001 481c 6042 [1.718575] 3d220003 3929f8e0 7d49502a e93d9c00 <7f8a49ae> 38a30001 38800800 7ca507b4 It turns out that we can get a pgdat in the online pgdat list where pgdat->per_cpu_nodestats is NULL. On my machine the pgdats for nodes 1 and 17 are like this. All the memory is in nodes 0 and 16. With the patch below, the system boots normally. I don't guarantee to have found every place that needs a check, and it may be better to fix this by allocating space for per-cpu statistics on nodes which have no memory rather than checking at each use site. Paul. mm: cope with memoryless nodes not having per-cpu statistics allocated It seems that the pgdat for nodes which have no memory will also have no per-cpu statistics space allocated, that is, pgdat->per_cpu_nodestats is NULL. Avoid crashing on machines which have memoryless nodes by checking for non-NULL pgdat->per_cpu_nodestats. Signed-off-by: Paul Mackerras --- diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 6137719..48b2780 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -184,8 +184,9 @@ static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat, #ifdef CONFIG_SMP int cpu; - for_each_online_cpu(cpu) - x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item]; + if (pgdat->per_cpu_nodestats) + for_each_online_cpu(cpu) + x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item]; if (x < 0) x = 0; diff --git a/mm/vmstat.c b/mm/vmstat.c index 89cec42..d83881e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -176,6 +176,10 @@ void refresh_zone_stat_thresholds(void) /* Zero current pgdat thresholds */ for_each_online_pgdat(pgdat) { + if (!pgdat->per_cpu_nodestats) { + pr_err("No nodestats for node %d\n", pgdat->node_id); + continue; + } for_each_online_cpu(cpu) { per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold = 0; } @@ -184,6 +188,10 @@ void refresh_zone_stat_thresholds(void) for_each_populated_zone(zone) { struct pglist_data *pgdat = zone->zone_pgdat; unsigned long max_drift, tolerate_drift; + if (!pgdat->per_cpu_nodestats) { + pr_err("No per cpu nodestats\n"); + continue; + }
[PATCH] crypto: crc32c-vpmsum - Convert to CPU feature based module autoloading
From: Anton Blanchard This patch utilises the GENERIC_CPU_AUTOPROBE infrastructure to automatically load the crc32c-vpmsum module if the CPU supports it. Signed-off-by: Anton Blanchard --- arch/powerpc/crypto/crc32c-vpmsum_glue.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/crypto/crc32c-vpmsum_glue.c b/arch/powerpc/crypto/crc32c-vpmsum_glue.c index bfe3d37..9fa046d 100644 --- a/arch/powerpc/crypto/crc32c-vpmsum_glue.c +++ b/arch/powerpc/crypto/crc32c-vpmsum_glue.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #define CHKSUM_BLOCK_SIZE 1 @@ -157,7 +158,7 @@ static void __exit crc32c_vpmsum_mod_fini(void) crypto_unregister_shash(&alg); } -module_init(crc32c_vpmsum_mod_init); +module_cpu_feature_match(PPC_MODULE_FEATURE_VEC_CRYPTO, crc32c_vpmsum_mod_init); module_exit(crc32c_vpmsum_mod_fini); MODULE_AUTHOR("Anton Blanchard "); -- 2.7.4
Re: [PATCH v13 06/30] powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
Hi all, This is causing cppcheck warnings (having just landed in next): [arch/powerpc/kernel/ptrace.c:2062]: (error) Uninitialized variable: ckpt_regs [arch/powerpc/kernel/ptrace.c:2130]: (error) Uninitialized variable: ckpt_regs This is from... > -static int gpr32_get(struct task_struct *target, > +static int gpr32_get_common(struct task_struct *target, >const struct user_regset *regset, >unsigned int pos, unsigned int count, > - void *kbuf, void __user *ubuf) > + void *kbuf, void __user *ubuf, bool tm_active) > { > const unsigned long *regs = &target->thread.regs->gpr[0]; > + const unsigned long *ckpt_regs; > compat_ulong_t *k = kbuf; > compat_ulong_t __user *u = ubuf; > compat_ulong_t reg; > int i; > > - if (target->thread.regs == NULL) > - return -EIO; > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > + ckpt_regs = &target->thread.ckpt_regs.gpr[0]; > +#endif > + if (tm_active) { > + regs = ckpt_regs; ... this bit here. If the ifdef doesn't trigger, cppcheck can't find an initialisation for ckpt_regs, so it complains. Techinically it's a false positive as (I assume!) tm_active cannot ever be true in the absense of CONFIG_PPC_TRANSACTIONAL_MEM. Is there a nice simple fix we could deploy to squash this warning, or will we just live with it? > -static int gpr32_set(struct task_struct *target, > +static int gpr32_set_common(struct task_struct *target, >const struct user_regset *regset, >unsigned int pos, unsigned int count, > - const void *kbuf, const void __user *ubuf) > + const void *kbuf, const void __user *ubuf, bool tm_active) > { > unsigned long *regs = &target->thread.regs->gpr[0]; > + unsigned long *ckpt_regs; > const compat_ulong_t *k = kbuf; > const compat_ulong_t __user *u = ubuf; > compat_ulong_t reg; > > - if (target->thread.regs == NULL) > - return -EIO; > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > + ckpt_regs = &target->thread.ckpt_regs.gpr[0]; > +#endif > > - CHECK_FULL_REGS(target->thread.regs); > + if (tm_active) { > + regs = ckpt_regs; FWIW it happens again here. Regards, Daniel Axtens signature.asc Description: PGP signature
Re: [PATCH v2] powerpc/32: fix csum_partial_copy_generic()
Scott, On 4 August 2016 at 05:53, Scott Wood wrote: > On Tue, 2016-08-02 at 10:07 +0200, Christophe Leroy wrote: >> commit 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() >> based on copy_tofrom_user()") introduced a bug when destination >> address is odd and initial csum is not null >> >> In that (rare) case the initial csum value has to be rotated one byte >> as well as the resulting value is >> >> This patch also fixes related comments >> >> Fixes: 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() >> based on copy_tofrom_user()") >> Cc: sta...@vger.kernel.org >> >> Signed-off-by: Christophe Leroy >> --- >> v2: updated comments as suggested by Segher >> >> arch/powerpc/lib/checksum_32.S | 7 --- >> 1 file changed, 4 insertions(+), 3 deletions(-) > > Alessio, can you confirm whether this fixes the problem you reported? No unfortunately. Ciao, Alessio
Re: [PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
On Wed, Aug 03, 2016 at 06:40:45PM +1000, Alexey Kardashevskiy wrote: > "powerpc/powernv/pci: Rework accessing the TCE invalidate register" > broke TCE invalidation on IODA2/PHB3 for real mode. > > This makes invalidate work again. > > Fixes: fd141d1a99a3 > Signed-off-by: Alexey Kardashevskiy > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index 53b56c0..59c7e7d 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -1877,7 +1877,7 @@ static void pnv_pci_phb3_tce_invalidate(struct > pnv_ioda_pe *pe, bool rm, > unsigned shift, unsigned long index, > unsigned long npages) > { > - __be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, false); > + __be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, rm); > unsigned long start, end, inc; > > /* We'll invalidate DMA address in PE scope */ > @@ -1935,10 +1935,12 @@ static void pnv_pci_ioda2_tce_invalidate(struct > iommu_table *tbl, > pnv_pci_phb3_tce_invalidate(pe, rm, shift, > index, npages); > else if (rm) > + { > opal_rm_pci_tce_kill(phb->opal_id, >OPAL_PCI_TCE_KILL_PAGES, >pe->pe_number, 1u << shift, >index << shift, npages); > + } These braces look a) unrelated to the actual point of the patch, b) unnecessary and c) not in keeping with normal coding style. > else > opal_pci_tce_kill(phb->opal_id, > OPAL_PCI_TCE_KILL_PAGES, -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER
On Wed, Aug 03, 2016 at 06:40:43PM +1000, Alexey Kardashevskiy wrote: > 178a787502 "vfio: Enable VFIO device for powerpc" made an attempt to > enable VFIO KVM device on POWER. > > However as CONFIG_KVM_BOOK3S_64 does not use "common-objs-y", > VFIO KVM device was not enabled for Book3s KVM, this adds VFIO to > the kvm-book3s_64-objs-y list. > > While we are here, enforce KVM_VFIO on KVM_BOOK3S as other platforms > already do. > > Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson This should be merged regardless of the rest of the series. There's no reason not to include the kvm device on Power, and it makes life easier for userspace because it doens't have to have conditionals about whether to instantiate it or not. > --- > arch/powerpc/kvm/Kconfig | 1 + > arch/powerpc/kvm/Makefile | 3 +++ > 2 files changed, 4 insertions(+) > > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig > index c2024ac..b7c494b 100644 > --- a/arch/powerpc/kvm/Kconfig > +++ b/arch/powerpc/kvm/Kconfig > @@ -64,6 +64,7 @@ config KVM_BOOK3S_64 > select KVM_BOOK3S_64_HANDLER > select KVM > select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE > + select KVM_VFIO if VFIO > ---help--- > Support running unmodified book3s_64 and book3s_32 guest kernels > in virtual machines on book3s_64 host processors. > diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile > index 1f9e552..8907af9 100644 > --- a/arch/powerpc/kvm/Makefile > +++ b/arch/powerpc/kvm/Makefile > @@ -88,6 +88,9 @@ endif > kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ > book3s_xics.o > > +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ > + $(KVM)/vfio.o > + > kvm-book3s_64-module-objs += \ > $(KVM)/kvm_main.o \ > $(KVM)/eventfd.o \ -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[patch] ppc/cell: missing error code in spufs_mkgang()
We should return -ENOMEM if alloc_spu_gang() fails. Signed-off-by: Dan Carpenter diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c index 5be15cf..2975754 100644 --- a/arch/powerpc/platforms/cell/spufs/inode.c +++ b/arch/powerpc/platforms/cell/spufs/inode.c @@ -496,8 +496,10 @@ spufs_mkgang(struct inode *dir, struct dentry *dentry, umode_t mode) gang = alloc_spu_gang(); SPUFS_I(inode)->i_ctx = NULL; SPUFS_I(inode)->i_gang = gang; - if (!gang) + if (!gang) { + ret = -ENOMEM; goto out_iput; + } inode->i_op = &simple_dir_inode_operations; inode->i_fop = &simple_dir_operations;
[patch] powerpc/fsl_rio: fix a missing error code
We should set the error code here. Otherwise static checkers complain. Signed-off-by: Dan Carpenter diff --git a/arch/powerpc/sysdev/fsl_rio.c b/arch/powerpc/sysdev/fsl_rio.c index 984e816..68e7c0d 100644 --- a/arch/powerpc/sysdev/fsl_rio.c +++ b/arch/powerpc/sysdev/fsl_rio.c @@ -491,6 +491,7 @@ int fsl_rio_setup(struct platform_device *dev) rmu_node = of_parse_phandle(dev->dev.of_node, "fsl,srio-rmu-handle", 0); if (!rmu_node) { dev_err(&dev->dev, "No valid fsl,srio-rmu-handle property\n"); + rc = -ENOENT; goto err_rmu; } rc = of_address_to_resource(rmu_node, 0, &rmu_regs);
Re: [PATCH 1/2] mm: Allow disabling deferred struct page initialisation
* Dave Hansen [2016-08-03 11:17:43]: > On 08/02/2016 11:38 PM, Srikar Dronamraju wrote: > > * Dave Hansen [2016-08-02 11:09:21]: > >> On 08/02/2016 06:19 AM, Srikar Dronamraju wrote: > >>> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise > >>> only certain size memory per node. The certain size takes into account > >>> the dentry and inode cache sizes. However such a kernel when booting a > >>> secondary kernel will not be able to allocate the required amount of > >>> memory to suffice for the dentry and inode caches. This results in > >>> crashes like the below on large systems such as 32 TB systems. > >> > >> What's a "secondary kernel"? > >> > > I mean the kernel thats booted to collect the crash, On fadump, the > > first kernel acts as the secondary kernel i.e the same kernel is booted > > to collect the crash. > > OK, but I'm still not seeing what the problem is. You've said that it > crashes and that it crashes during inode/dentry cache allocation. > > But, *why* does the same kernel image crash in when it is used as a > "secondary kernel"? > I guess you already got it. But let me try to explain it again. Lets say we have a 32 TB system with 16 nodes each node having 2T of memory. We are assuming deferred page initialisation is configured. When the regular kernel boots, 1. It reserves 5% of the memory for fadump. 2. It initializes 8GB per node, i.e 128GB 3. It allocated dentry/inode cache which is around 16GB. 4. It then kicks the parallel page struct initialization. Now lets say kernel crashed and fadump was triggered. 1. The same kernel boots in the 5% reserved space which is 1600GB 2. It reserves the rest 95% memory. 3. It tries to initialize 8GB per node but can only initialize 8GB. (since except for 1st node the rest nodes are all reserved) 4. It tries to allocate dentry/inode cache of 16GB but fails. (tries to reclaim but reclaim needs spinlock and spinlock is not yet initialized.) -- Thanks and Regards Srikar Dronamraju
another test
This time with a PGP signature -- Cheers, Stephen Rothwell pgpxkiMyeBowX.pgp Description: OpenPGP digital signature
[PATCH] powerpc/eeh: Fix slot locations on NPU and legacy platforms
The slot location code as part of EEH has never functioned perfectly on every powerpc system. The device node properties "ibm,slot-loc", "ibm,slot-location-code" and "ibm,io-base-loc-code" have all been presented in different cases, and in some situations, there are legacy platforms not conforming to the conventions of populating root buses with "ibm,io-base-loc-code" and child nodes with "ibm,slot-location-code". Specifically, some legacy platforms use "ibm,loc-code" instead, which stopped working with 7e56f627768. In addition, EEH PEs for NPU devices have slot locations specified on the devices instead of buses due to their architecture, and these were not printed. This has been fixed by looking at the top device of a PE for a slot location before checking its bus. Fixes: 7e56f627768 "powerpc/eeh: Fix PE location code" Cc: #4.4+ Signed-off-by: Russell Currey --- arch/powerpc/kernel/eeh_pe.c | 31 ++- 1 file changed, 26 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index f0520da..034538c 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -881,17 +881,34 @@ void eeh_pe_restore_bars(struct eeh_pe *pe) * eeh_pe_loc_get - Retrieve location code binding to the given PE * @pe: EEH PE * - * Retrieve the location code of the given PE. If the primary PE bus - * is root bus, we will grab location code from PHB device tree node - * or root port. Otherwise, the upstream bridge's device tree node - * of the primary PE bus will be checked for the location code. + * Retrieve the location code of the given PE. The first device associated + * with the PE is checked for a slot location. If missing, the bus of the + * device is checked instead. If this is a root bus, the location code is + * taken from the PHB device tree node or root port. If not, the upstream + * bridge's device tree node of the primary PE bus will be checked instead. + * If a slot location isn't found on the bus, walk through parent buses + * until a location is found. */ const char *eeh_pe_loc_get(struct eeh_pe *pe) { - struct pci_bus *bus = eeh_pe_bus_get(pe); + struct pci_bus *bus; struct device_node *dn; const char *loc = NULL; + /* Check the slot location of the first (top) PCI device */ + struct eeh_dev *edev = + list_first_entry_or_null(&pe->edevs, struct eeh_dev, list); + + if (edev) { + loc = of_get_property(edev->pdn->node, + "ibm,slot-location-code", NULL); + if (loc) + return loc; + } + + /* If there's nothing on the device, look at the bus */ + bus = eeh_pe_bus_get(pe); + while (bus) { dn = pci_bus_to_OF_node(bus); if (!dn) { @@ -905,6 +922,10 @@ const char *eeh_pe_loc_get(struct eeh_pe *pe) loc = of_get_property(dn, "ibm,slot-location-code", NULL); + /* Fall back to ibm,loc-code if nothing else is found */ + if (!loc) + loc = of_get_property(dn, "ibm,loc-code", NULL); + if (loc) return loc; -- 2.9.2
Re: [PATCH] crypto: powerpc - CRYPT_CRC32C_VPMSUM should depend on ALTIVEC
Hi Michael, > The optimised crc32c implementation depends on VMX (aka. Altivec) > instructions, so the kernel must be built with Altivec support in > order for the crc32c code to build. Thanks for that, looks good. Acked-by: Anton Blanchard > Fixes: 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") > Signed-off-by: Michael Ellerman > --- > crypto/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/crypto/Kconfig b/crypto/Kconfig > index a9377bef25e3..84d71482bf08 100644 > --- a/crypto/Kconfig > +++ b/crypto/Kconfig > @@ -439,7 +439,7 @@ config CRYPTO_CRC32C_INTEL > > config CRYPT_CRC32C_VPMSUM > tristate "CRC32c CRC algorithm (powerpc64)" > - depends on PPC64 > + depends on PPC64 && ALTIVEC > select CRYPTO_HASH > select CRC32 > help
[PATCH] powerpc/book3s: Fix MCE console messages for unrecoverable MCE.
From: Mahesh Salgaonkar When machine check occurs with MSR(RI=0), it means MC interrupt is unrecoverable and kernel goes down to panic path. But the console message still shows it as recovered. This patch fixes the MCE console messages. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/mce.c |3 ++- arch/powerpc/platforms/powernv/opal.c |2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c index ef267fd..5e7ece0 100644 --- a/arch/powerpc/kernel/mce.c +++ b/arch/powerpc/kernel/mce.c @@ -92,7 +92,8 @@ void save_mce_event(struct pt_regs *regs, long handled, mce->in_use = 1; mce->initiator = MCE_INITIATOR_CPU; - if (handled) + /* Mark it recovered if we have handled it and MSR(RI=1). */ + if (handled && (regs->msr & MSR_RI)) mce->disposition = MCE_DISPOSITION_RECOVERED; else mce->disposition = MCE_DISPOSITION_NOT_RECOVERED; diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index 5385434..8154171 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -401,6 +401,8 @@ static int opal_recover_mce(struct pt_regs *regs, if (!(regs->msr & MSR_RI)) { /* If MSR_RI isn't set, we cannot recover */ + printk(KERN_ERR "Machine check interrupt unrecoverable:" + " MSR(RI=0)\n"); recovered = 0; } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) { /* Platform corrected itself */
Re: [PATCH] powerpc/xics: Properly set Edge/Level type and enable resend
Benjamin Herrenschmidt writes: > This sets the type of the interrupt appropriately. We set it as follow: > > - If not mapped from the device-tree, we use edge. This is the case > of the virtual interrupts and PCI MSIs for example. > > - If mapped from the device-tree and #interrupt-cells is 2 (PAPR > compliant), we use the second cell to set the appropriate type > > - If mapped from the device-tree and #interrupt-cells is 1 (current > OPAL on P8 does that), we assume level sensitive since those are > typically going to be the PSI LSIs which are level sensitive. > > Additionally, we mark the interrupts requested via the opal_interrupts > property all level. This is a bit fishy but the best we can do until we > fix OPAL to properly expose them with a complete descriptor. It is also > correct for the current HW anyway as OPAL interrupts are currently PCI > error and PSI interrupts which are level. > > Finally now that edge interrupts are properly identified, we can enable > CONFIG_HARDIRQS_SW_RESEND which will make the core re-send them if > they occur while masked, which some drivers rely upon. > > This fixes issues with lost interrupts on some Mellanox adapters. > > Signed-off-by: Benjamin Herrenschmidt Broken since forever? Cc stable? cheers
Re: [PATCH v2 3/3] powernv: Fix MCE handler to avoid trashing CR0/CR1 registers.
Mahesh J Salgaonkar writes: > From: Mahesh Salgaonkar > > The current implementation of MCE early handling modifies CR0/1 registers > without saving its old values. Fix this by moving early check for > powersaving mode to machine_check_handle_early(). >From (internal bug report) it seems as though in a test where one injects continuous SLB Multi Hit errors, this bug could lead to rebooting "due to to Platform error" rather than continuing to recover successfully. It might be a good idea to mention that in commit message here. Also, should this go to stable? -- Stewart Smith OPAL Architect, IBM.
Re: [PATCH v2] powerpc/32: fix csum_partial_copy_generic()
On Tue, 2016-08-02 at 10:07 +0200, Christophe Leroy wrote: > commit 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() > based on copy_tofrom_user()") introduced a bug when destination > address is odd and initial csum is not null > > In that (rare) case the initial csum value has to be rotated one byte > as well as the resulting value is > > This patch also fixes related comments > > Fixes: 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() > based on copy_tofrom_user()") > Cc: sta...@vger.kernel.org > > Signed-off-by: Christophe Leroy > --- > v2: updated comments as suggested by Segher > > arch/powerpc/lib/checksum_32.S | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) Alessio, can you confirm whether this fixes the problem you reported? -Scott > > diff --git a/arch/powerpc/lib/checksum_32.S b/arch/powerpc/lib/checksum_32.S > index d90870a..0a57fe6 100644 > --- a/arch/powerpc/lib/checksum_32.S > +++ b/arch/powerpc/lib/checksum_32.S > @@ -127,8 +127,9 @@ _GLOBAL(csum_partial_copy_generic) > stw r7,12(r1) > stw r8,8(r1) > > - andi. r0,r4,1 /* is destination > address even ? */ > - cmplwi cr7,r0,0 > + rlwinm r0,r4,3,0x8 > + rlwnm r6,r6,r0,0,31 /* odd destination address: > rotate one byte */ > + cmplwi cr7,r0,0/* is destination address even ? */ > addic r12,r6,0 > addir6,r4,-4 > neg r0,r4 > @@ -237,7 +238,7 @@ _GLOBAL(csum_partial_copy_generic) > 66: addze r3,r12 > addir1,r1,16 > beqlr+ cr7 > - rlwinm r3,r3,8,0,31/* swap bytes for odd destination > */ > + rlwinm r3,r3,8,0,31/* odd destination address: > rotate one byte */ > blr > > /* read fault */
DMARC (and DKIM) problems
Hi all, For some time we have been coping with DMARC by rewriting the sender address for any email sent from a site with a restrictive DMARC policy. This was because the DKIM verification would fail for such an email once it had been processed by the mailing list software and so sites (like Yahoo) who implemented DMARC would bounce such emails. It turns out that by just not adding the footer to each email, we no longer break the DKIM signatures. So, I have turned off the footer and will leave it that way unless someone objects. This means that I have also turned off sender address rewriting. -- Cheers, Stephen Rothwell
test, please ignore again
Just like last time. -- Cheers, Stephen Rothwell
test, please ignore
I am just testing the interaction of the mailing list with DKIM after removing the footer. -- Cheers, Stephen Rothwell
Re: [PATCH] powernv: Search for new flash DT node location
Jack Miller writes: > On Wed, Aug 03, 2016 at 05:16:34PM +1000, Michael Ellerman wrote: >> We could instead just search for all nodes that are compatible with >> "ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs(). >> >> Is there a particular reason not to do that? > > I'm actually surprised that this is preferred. Jeremy mentioned something > similar, but I guess I just don't like the idea of finding devices in weird > places in the tree. But where is "weird". Arguably "/opal/flash" is weird. What does it mean? There's a bus called "opal" and a device on it called "flash"? No. Point being the structure is fairly arbitrary, or at least debatable, so tying the code 100% to the structure is inflexible. As we have discovered. Our other option is to tell skiboot to get stuffed, and leave the flash node where it was on P8. > Then again, if we can't trust the DT we're in bigger > trouble than erroneous flash nodes =). Quite :) > If we really just want to find compatible nodes anywhere, let's simplify i2c > and pdev_init into one function and make that behavior consistent with this > new patch. That seems OK to me. We should get an ack from Stewart though for the other node types. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: move hmi.c to arch/powerpc/kvm/
Paolo Bonzini writes: > hmi.c functions are unused unless sibling_subcore_state is nonzero, and > that in turn happens only if KVM is in use. So move the code to > arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_64_HANDLER > rather than CONFIG_PPC_BOOK3S_64. The sibling_subcore_state is also > included in struct paca_struct only if KVM is supported by the kernel. Ok. Initially I was concerned because there are a bunch of non-KVM related HMI causes (e.g. the CAPP will raise an HMI if it loses the link to the CAPI card.) https://github.com/open-power/skiboot/blob/master/core/hmi.c lists lots of HMIs created by hardware events. Having said that, you're right that this particular file is KVM specific. Reviewed-by: Daniel Axtens Mahesh: is there a way to cause the TB to desynchronise and then test if this resynchronisation works? Regards, Daniel > > Cc: Paul Mackerras > Cc: Michael Ellerman > Cc: Mahesh Salgaonkar > Cc: linuxppc-dev@lists.ozlabs.org > Cc: kvm-...@vger.kernel.org > Cc: k...@vger.kernel.org > Signed-off-by: Paolo Bonzini > --- > It would be nice to have this in 4.8, to minimize any 4.9 conflicts. > Build-tested only, with and without KVM enabled. > > arch/powerpc/include/asm/hmi.h | 2 +- > arch/powerpc/include/asm/paca.h| 10 +- > arch/powerpc/kernel/Makefile | 2 +- > arch/powerpc/kvm/Makefile | 1 + > arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} | 0 > 5 files changed, 8 insertions(+), 7 deletions(-) > rename arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} (100%) > > diff --git a/arch/powerpc/include/asm/hmi.h b/arch/powerpc/include/asm/hmi.h > index 88b4901ac4ee..d3b6ad6e137c 100644 > --- a/arch/powerpc/include/asm/hmi.h > +++ b/arch/powerpc/include/asm/hmi.h > @@ -21,7 +21,7 @@ > #ifndef __ASM_PPC64_HMI_H__ > #define __ASM_PPC64_HMI_H__ > > -#ifdef CONFIG_PPC_BOOK3S_64 > +#ifdef CONFIG_KVM_BOOK3S_64_HANDLER > > #define CORE_TB_RESYNC_REQ_BIT 63 > #define MAX_SUBCORE_PER_CORE 4 > diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h > index 148303e7771f..625321e7e581 100644 > --- a/arch/powerpc/include/asm/paca.h > +++ b/arch/powerpc/include/asm/paca.h > @@ -183,11 +183,6 @@ struct paca_struct { >*/ > u16 in_mce; > u8 hmi_event_available; /* HMI event is available */ > - /* > - * Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for > - * more details > - */ > - struct sibling_subcore_state *sibling_subcore_state; > #endif > > /* Stuff for accurate time accounting */ > @@ -202,6 +197,11 @@ struct paca_struct { > struct kvmppc_book3s_shadow_vcpu shadow_vcpu; > #endif > struct kvmppc_host_state kvm_hstate; > + /* > + * Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for > + * more details > + */ > + struct sibling_subcore_state *sibling_subcore_state; > #endif > }; > > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile > index b2027a5cf508..fe4c075bcf50 100644 > --- a/arch/powerpc/kernel/Makefile > +++ b/arch/powerpc/kernel/Makefile > @@ -41,7 +41,7 @@ obj-$(CONFIG_VDSO32)+= vdso32/ > obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o > obj-$(CONFIG_PPC_BOOK3S_64) += cpu_setup_ppc970.o cpu_setup_pa6t.o > obj-$(CONFIG_PPC_BOOK3S_64) += cpu_setup_power.o > -obj-$(CONFIG_PPC_BOOK3S_64) += mce.o mce_power.o hmi.o > +obj-$(CONFIG_PPC_BOOK3S_64) += mce.o mce_power.o > obj-$(CONFIG_PPC_BOOK3E_64) += exceptions-64e.o idle_book3e.o > obj-$(CONFIG_PPC64) += vdso64/ > obj-$(CONFIG_ALTIVEC)+= vecemu.o > diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile > index 1f9e5529e692..855d4b95d752 100644 > --- a/arch/powerpc/kvm/Makefile > +++ b/arch/powerpc/kvm/Makefile > @@ -78,6 +78,7 @@ kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \ > > ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \ > + book3s_hv_hmi.o \ > book3s_hv_rmhandlers.o \ > book3s_hv_rm_mmu.o \ > book3s_hv_ras.o \ > diff --git a/arch/powerpc/kernel/hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c > similarity index 100% > rename from arch/powerpc/kernel/hmi.c > rename to arch/powerpc/kvm/book3s_hv_hmi.c > -- > 1.8.3.1 > > ___ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev signature.asc Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
Hi Arnd, On Wed, 03 Aug 2016 20:52:48 +0200 Arnd Bergmann wrote: > > Most of the difference appears to be in branch trampolines (634 added, > 559 removed, 14837 unchanged) as you suspect, but I also see a couple > of symbols show up in vmlinux that were not there before: > > -A __crc_dma_noop_ops > -D dma_noop_ops > -R __clz_tab > -r fdt_errtable > -r __kcrctab_dma_noop_ops > -r __kstrtab_dma_noop_ops > -R __ksymtab_dma_noop_ops > -t dma_noop_alloc > -t dma_noop_free > -t dma_noop_map_page > -t dma_noop_mapping_error > -t dma_noop_map_sg > -t dma_noop_supported > -T fdt_add_reservemap_entry > -T fdt_begin_node > -T fdt_create > -T fdt_create_empty_tree > -T fdt_end_node > -T fdt_finish > -T fdt_finish_reservemap > -T fdt_property > -T fdt_resize > -T fdt_strerror > -T find_cpio_data > > From my first look, it seems that all of lib/*.o is now getting linked > into vmlinux, while we traditionally leave out everything from lib/ > that is not referenced. You could try removing the --{,no-}whole-archive arguments to ld in scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh. Last time I did that, though, a whole lot of stuff failed to be linked in. (Especially stuff only referenced by EXPORT_SYMBOL()s, bu that may have been fixed). > I also see a noticeable overhead in link time, the numbers are for > a cache-hot rebuild after a successful allyesconfig build, using a > 24-way Opteron@2.5Ghz, just relinking vmlinux: I was afraid of that, but it is offset by the time saved by not doing the "ld -r"s along the way? It may also be that (for powerpc anyway) the linker is doing a better job. -- Cheers, Stephen Rothwell ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 00/14] Present useful limits to user (v2)
Hello, I'm trying the systemtap approach and it looks promising. The script is annotating strace-like output with capability, device access and RLIMIT information. In the end there's a summary. Here's sample output from wpa_supplicant run: mprotect(0x7efebf14, 16384, PROT_READ) = 0 [DATA 548864 -> 573440] [AS 44986368 -> 45002752] brk(0x55d9611f8000) = 94392125718528 missing [Capabilities=CAP_SYS_ADMIN] [AS 45002752 -> 45010944] open(0x55d960716462, O_RDWR) = 3 [DeviceAllow=/dev/char/1:3 rw ] open("/dev/random", O_RDONLY|O_NONBLOCK) = 3 [DeviceAllow=/dev/char/1:8 r ] socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 4 [RestrictAddressFamilies=AF_UNIX] [NOFILE 3 -> 4] open("/etc/wpa_supplicant.conf", O_RDONLY) = 5 [NOFILE 4 -> 5] socket(PF_NETLINK, SOCK_RAW, 0) = 5 [RestrictAddressFamilies=AF_NETLINK] socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 16) = 6 [RestrictAddressFamilies=AF_NETLINK] [NOFILE 5 -> 6] socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 16) = 7 [RestrictAddressFamilies=AF_NETLINK] [NOFILE 6 -> 7] socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 8 [RestrictAddressFamilies=AF_INET] [NOFILE 7 -> 8] open("/dev/rfkill", O_RDONLY) = 9 [DeviceAllow=/dev/char/10:58 r ] [NOFILE 8 -> 9] socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 10 [RestrictAddressFamilies=AF_UNIX] [NOFILE 9 -> 10] sendmsg(6, 0x7ffc778f35b0, 0x0) = 36 [Capabilities=CAP_NET_ADMIN] Summary: CapabilityBoundingSet=CAP_NET_ADMIN CAP_NET_RAW Consider also missing CapabilityBoundingSet=CAP_SYS_ADMIN DeviceAllow=/dev/char/1:3 rw DeviceAllow=/dev/char/1:8 r DeviceAllow=/dev/char/10:58 r DeviceAllow=/dev/char/1:9 r LimitFSIZE=0 LimitDATA=577536 LimitSTACK=139264 LimitCORE=0 LimitNOFILE=15 LimitAS=45146112 LimitNPROC=171 LimitMEMLOCK=0 LimitSIGPENDING=0 LimitMSGQUEUE=0 LimitNICE=0 LimitRTPRIO=0 RestrictAddressFamilies=AF_UNIX AF_INET AF_NETLINK AF_PACKET MemoryDenyWriteExecute=true Some values are not correct. NPROC is wrong because staprun needs to be run as root instead of the separate privileged user for wpa_supplicant and that messes user process count. DATA/AS/STACK seems to be a bit off. I can easily use this as systemd service configuration drop-in otherwise. Now, the relevant part for the kernel is that I'd like to analyze error paths better, so the system calls would be also annotated when there's a failure when a RLIMIT is too tight. It would be easier to insert probes if there was only one path for RLIMIT checks. Would it be OK to make the function task_rlimit() a full check against the limit and also make it a non-inlined function, just for improved probing purposes? There's already error analysis for the capabilities, but there are some false positive hits (like brk() complaining about missing CAP_SYS_ADMIN above). -Topi #! /bin/sh # suppress some run-time errors here for cleaner output //bin/true && exec stap --suppress-handler-errors --skip-badvars $0 ${1+"$@"} /* * Compile: * stap -p4 -DSTP_NO_OVERLOAD -m strace * Run: * /usr/bin/staprun -R -c "/sbin/wpa_supplicant -u -O /run/wpa_supplicant -c /etc/wpa_supplicant.conf -i wlan0" -w /root/strace.ko only_capability_use=1 timestamp=0 */ /* configuration options; set these with stap -G */ global follow_fork = 0 /* -Gfollow_fork=1 means trace descendant processes too */ global timestamp = 1 /* -Gtimestamp=0 means don't print a syscall timestamp */ global elapsed_time = 0 /* -Gelapsed_time=1 means print a syscall duration too */ global only_capability_use = 0 /* -Gonly_capability_use=1 means print only when capabilities are used */ global thread_argstr% global thread_time% global syscalls_nonreturn[2] global capnames[64] global used_caps global missing_caps global all_used_caps global all_missing_caps global accessed_devices[1000] global all_accessed_devices[1000] global highwatermark_fsize global highwatermark_data global highwatermark_stack global highwatermark_core global highwatermark_nproc global highwatermark_nofile global highwatermark_memlock global highwatermark_as global highwatermark_sigpending global highwatermark_msgqueue global highwatermark_nice global highwatermark_rtprio global old_highwatermark_fsize global old_highwatermark_data global old_highwatermark_stack global old_highwatermark_core global old_highwatermark_nproc global old_highwatermark_nofile global old_highwatermark_memlock global old_highwatermark_as global old_highwatermark_sigpending global old_highwatermark_msgqueue global old_highwatermark_nice global old_highwatermark_rtprio global afnames[64] global used_afs global missing_afs global all_used_afs global all_missing_afs global no_memory_deny_write_execute global all_memory_deny_write_execute = "true" global print_syscall probe begin { /* list those syscalls that never .return */ syscalls_nonreturn["exit"]=1 syscalls_nonreturn["exit_group"]=1 // grep '#define CAP_.*[0-9]+$' /usr/src/linux-headers*/include/uapi/linux/capability.h | awk '{ print "capnames[" $3 "] = \"" $2 "\";" }' capnames[0] = "CAP_CHOWN";
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
On 08/03/2016 06:34 PM, Benjamin Herrenschmidt wrote: I think this is best done by the relevant community maintainer, I just threw an idea but I'm not that familiar with the details:-) Ok, sure; got it. Did you send them to the lkml list ? Yup, plus a few others lists from get_maintainer.pl iirc. Mailing list archive links: - linux-kernel: http://marc.info/?l=linux-kernel&m=146798084822100&w=2 - linux-doc: http://marc.info/?l=linux-doc&m=146798085522104&w=2 - linux-nvme: http://lists.infradead.org/pipermail/linux-nvme/2016-July/005349.html - linuxppc-dev: https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-July/145624.html Thanks, -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ibmvfc: Set READ FCP_XFER_READY DISABLED bit in PRLI
The READ FCP_XFER_READY DISABLED bit is required to always be set to one since FCP-3. Set it in the service parameter page frame during process login. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index ab67ec4..4a680ce 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -3381,6 +3381,7 @@ static void ibmvfc_tgt_send_prli(struct ibmvfc_target *tgt) prli->parms.type = IBMVFC_SCSI_FCP_TYPE; prli->parms.flags = cpu_to_be16(IBMVFC_PRLI_EST_IMG_PAIR); prli->parms.service_parms = cpu_to_be32(IBMVFC_PRLI_INITIATOR_FUNC); + prli->parms.service_parms |= cpu_to_be32(IBMVFC_PRLI_READ_FCP_XFER_RDY_DISABLED); ibmvfc_set_tgt_action(tgt, IBMVFC_TGT_ACTION_INIT_WAIT); if (ibmvfc_send_event(evt, vhost, default_timeout)) { -- 2.7.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ibmvfc: add FC Class 3 Error Recovery support
The ibmvfc driver currently doesn't support FC Class 3 Error Recovery. However, it is simply a matter of informing the VIOS that the payload expects to use sequence level error recovery via a bit flag in the ibmvfc_cmd structure. This patch adds a module parameter to enable error recovery support at boot time. When enabled the RETRY service parameter bit is set during PRLI, and ibmvfc_cmd->flags includes the IBMVFC_CLASS_3_ERR bit. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.c | 10 ++ drivers/scsi/ibmvscsi/ibmvfc.h | 1 + 2 files changed, 11 insertions(+) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index 4a680ce..6b92169 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -52,6 +52,7 @@ static unsigned int max_requests = IBMVFC_MAX_REQUESTS_DEFAULT; static unsigned int disc_threads = IBMVFC_MAX_DISC_THREADS; static unsigned int ibmvfc_debug = IBMVFC_DEBUG; static unsigned int log_level = IBMVFC_DEFAULT_LOG_LEVEL; +static unsigned int cls3_error = IBMVFC_CLS3_ERROR; static LIST_HEAD(ibmvfc_head); static DEFINE_SPINLOCK(ibmvfc_driver_lock); static struct scsi_transport_template *ibmvfc_transport_template; @@ -86,6 +87,9 @@ MODULE_PARM_DESC(debug, "Enable driver debug information. " module_param_named(log_level, log_level, uint, 0); MODULE_PARM_DESC(log_level, "Set to 0 - 4 for increasing verbosity of device driver. " "[Default=" __stringify(IBMVFC_DEFAULT_LOG_LEVEL) "]"); +module_param_named(cls3_error, cls3_error, uint, 0); +MODULE_PARM_DESC(log_level, "Enable FC Class 3 Error Recovery. " +"[Default=" __stringify(IBMVFC_CLS3_ERROR) "]"); static const struct { u16 status; @@ -1335,6 +1339,9 @@ static int ibmvfc_map_sg_data(struct scsi_cmnd *scmd, struct srp_direct_buf *data = &vfc_cmd->ioba; struct ibmvfc_host *vhost = dev_get_drvdata(dev); + if (cls3_error) + vfc_cmd->flags |= cpu_to_be16(IBMVFC_CLASS_3_ERR); + sg_mapped = scsi_dma_map(scmd); if (!sg_mapped) { vfc_cmd->flags |= cpu_to_be16(IBMVFC_NO_MEM_DESC); @@ -3383,6 +3390,9 @@ static void ibmvfc_tgt_send_prli(struct ibmvfc_target *tgt) prli->parms.service_parms = cpu_to_be32(IBMVFC_PRLI_INITIATOR_FUNC); prli->parms.service_parms |= cpu_to_be32(IBMVFC_PRLI_READ_FCP_XFER_RDY_DISABLED); + if (cls3_error) + prli->parms.service_parms |= cpu_to_be32(IBMVFC_PRLI_RETRY); + ibmvfc_set_tgt_action(tgt, IBMVFC_TGT_ACTION_INIT_WAIT); if (ibmvfc_send_event(evt, vhost, default_timeout)) { vhost->discovery_threads--; diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h index 8fae032..7f9bb07 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.h +++ b/drivers/scsi/ibmvscsi/ibmvfc.h @@ -54,6 +54,7 @@ #define IBMVFC_DEV_LOSS_TMO(5 * 60) #define IBMVFC_DEFAULT_LOG_LEVEL 2 #define IBMVFC_MAX_CDB_LEN 16 +#define IBMVFC_CLS3_ERROR 0 /* * Ensure we have resources for ERP and initialization: -- 2.7.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/2] ibmvfc: FC-TAPE Support
This patchset introduces optional FC-TAPE/FC Class 3 Error Recovery to the ibmvfc client driver. Tyrel Datwyler (2): ibmvfc: Set READ FCP_XFER_READY DISABLED bit in PRLI ibmvfc: add FC Class 3 Error Recovery support drivers/scsi/ibmvscsi/ibmvfc.c | 11 +++ drivers/scsi/ibmvscsi/ibmvfc.h | 1 + 2 files changed, 12 insertions(+) -- 2.7.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
On Wed, 2016-08-03 at 16:39 -0300, Mauricio Faria de Oliveira wrote: > Hi Ben, > > On 06/13/2016 06:26 PM, Benjamin Herrenschmidt wrote: > > > > Another option would be to use a dma_attr for silencing mapping > > errors > > which NVME could use provided it does handle them gracefully ... > > I recently submitted patches that implement your suggestion [1]. > May you please review/comment if they're OK with you? I think this is best done by the relevant community maintainer, I just threw an idea but I'm not that familiar with the details :-) Did you send them to the lkml list ? > Thanks! > > [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/14685 > 0.html > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
On Wednesday, August 3, 2016 2:44:29 PM CEST Segher Boessenkool wrote: > Hi Arnd, > > On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote: > > From my first look, it seems that all of lib/*.o is now getting linked > > into vmlinux, while we traditionally leave out everything from lib/ > > that is not referenced. > > > > I also see a noticeable overhead in link time, the numbers are for > > a cache-hot rebuild after a successful allyesconfig build, using a > > 24-way Opteron@2.5Ghz, just relinking vmlinux: > > > > $ time make skj30 vmlinux # before > > real2m8.092s > > user3m41.008s > > sys 0m48.172s > > > > $ time make skj30 vmlinux # after > > real4m10.189s > > user5m43.804s > > sys 0m52.988s > > Is it better when using rcT instead of rcsT? It seems to be noticeably better for the clean rebuild case, though not as good as the original: real3m34.015s user5m7.104s sys 0m49.172s I've also tried now with my own patch applied as well (linking each drivers/*/built-in.o into vmlinux rather than having them linked into drivers/built-in.o first), but that makes no difference. Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
Hi Arnd, On Wed, Aug 03, 2016 at 08:52:48PM +0200, Arnd Bergmann wrote: > From my first look, it seems that all of lib/*.o is now getting linked > into vmlinux, while we traditionally leave out everything from lib/ > that is not referenced. > > I also see a noticeable overhead in link time, the numbers are for > a cache-hot rebuild after a successful allyesconfig build, using a > 24-way Opteron@2.5Ghz, just relinking vmlinux: > > $ time make skj30 vmlinux # before > real 2m8.092s > user 3m41.008s > sys 0m48.172s > > $ time make skj30 vmlinux # after > real 4m10.189s > user 5m43.804s > sys 0m52.988s Is it better when using rcT instead of rcsT? Segher ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: convert 'iommu_alloc failed' messages to dynamic debug
Hi Ben, On 06/13/2016 06:26 PM, Benjamin Herrenschmidt wrote: Another option would be to use a dma_attr for silencing mapping errors which NVME could use provided it does handle them gracefully ... I recently submitted patches that implement your suggestion [1]. May you please review/comment if they're OK with you? Thanks! [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-August/146850.html -- Mauricio Faria de Oliveira IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RESEND][PATCH v2 2/2] powerpc/fadump: parse fadump reserve memory size based on memory range
Currently, memory for fadump can be specified with fadump_reserve_mem=size, where only a fixed size can be specified. Add the below syntax as well, to support conditional reservation based on system memory size: fadump_reserve_mem=:[,:,...] This syntax helps using the same commandline parameter for different system memory sizes. Signed-off-by: Hari Bathini Reviewed-by: Mahesh J Salgaonkar --- arch/powerpc/kernel/fadump.c | 64 -- 1 file changed, 55 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index b3a6633..4661ae6 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -193,6 +193,56 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, return addr; } +/* + * This function parses command line for fadump_reserve_mem= + * + * Supports the below two syntaxes: + *1. fadump_reserve_mem=size + *2. fadump_reserve_mem=ramsize-range:size[,...] + * + * Sets fw_dump.reserve_bootvar with the memory size + * provided, 0 otherwise + * + * The function returns -EINVAL on failure, 0 otherwise. + */ +static int __init parse_fadump_reserve_mem(void) +{ + char *name = "fadump_reserve_mem="; + char *fadump_cmdline = NULL, *cur; + + fw_dump.reserve_bootvar = 0; + + /* find fadump_reserve_mem and use the last one if there are many */ + cur = strstr(boot_command_line, name); + while (cur) { + fadump_cmdline = cur; + cur = strstr(cur+1, name); + } + + /* when no fadump_reserve_mem= cmdline option is provided */ + if (!fadump_cmdline) + return 0; + + fadump_cmdline += strlen(name); + + /* for fadump_reserve_mem=size cmdline syntax */ + if (!is_param_range_based(fadump_cmdline)) { + fw_dump.reserve_bootvar = memparse(fadump_cmdline, NULL); + return 0; + } + + /* for fadump_reserve_mem=ramsize-range:size[,...] cmdline syntax */ + cur = fadump_cmdline; + fw_dump.reserve_bootvar = parse_mem_range_size("fadump_reserve_mem", + &cur, memblock_phys_mem_size()); + if (cur == fadump_cmdline) { + printk(KERN_INFO "fadump_reserve_mem: Invaild syntax!\n"); + return -EINVAL; + } + + return 0; +} + /** * fadump_calculate_reserve_size(): reserve variable boot area 5% of System RAM * @@ -212,12 +262,17 @@ static inline unsigned long fadump_calculate_reserve_size(void) { unsigned long size; + /* sets fw_dump.reserve_bootvar */ + parse_fadump_reserve_mem(); + /* * Check if the size is specified through fadump_reserve_mem= cmdline * option. If yes, then use that. */ if (fw_dump.reserve_bootvar) return fw_dump.reserve_bootvar; + else + printk(KERN_INFO "fadump: calculating default boot size\n"); /* divide by 20 to get 5% of value */ size = memblock_end_of_DRAM() / 20; @@ -348,15 +403,6 @@ static int __init early_fadump_param(char *p) } early_param("fadump", early_fadump_param); -/* Look for fadump_reserve_mem= cmdline option */ -static int __init early_fadump_reserve_mem(char *p) -{ - if (p) - fw_dump.reserve_bootvar = memparse(p, &p); - return 0; -} -early_param("fadump_reserve_mem", early_fadump_reserve_mem); - static void register_fw_dump(struct fadump_mem_struct *fdm) { int rc; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RESEND][PATCH v2 1/2] kexec: refactor code parsing size based on memory range
crashkernel parameter supports different syntaxes to specify the amount of memory to be reserved for kdump kernel. Below is one of the supported syntaxes that needs parsing to find the memory size to reserve, based on memory range: crashkernel=:[,:,...] While such parsing is implemented for crashkernel parameter, it applies to other parameters, like fadump_reserve_mem=, which could use similar syntax. This patch moves crashkernel's parsing code for above syntax to to kernel/params.c file for reuse. Two functions is_param_range_based() and parse_mem_range_size() are added to kernel/params.c file for this purpose. Any parameter that uses the above syntax can use is_param_range_based() function to validate the syntax and parse_mem_range_size() function to get the parsed memory size. While some code is moved to kernel/params.c file, there is no change functionality wise in parsing the crashkernel parameter. Signed-off-by: Hari Bathini --- Changes from v1: 1. Updated changelog include/linux/kernel.h |5 +++ kernel/kexec_core.c| 63 +++- kernel/params.c| 96 3 files changed, 106 insertions(+), 58 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index d96a611..2df7ba2 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -435,6 +435,11 @@ extern char *get_options(const char *str, int nints, int *ints); extern unsigned long long memparse(const char *ptr, char **retptr); extern bool parse_option_str(const char *str, const char *option); +extern bool __init is_param_range_based(const char *cmdline); +extern unsigned long long __init parse_mem_range_size(const char *param, + char **str, + unsigned long long system_ram); + extern int core_kernel_text(unsigned long addr); extern int core_kernel_data(unsigned long addr); extern int __kernel_text_address(unsigned long addr); diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 5616755..3a74024 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1104,59 +1104,9 @@ static int __init parse_crashkernel_mem(char *cmdline, char *cur = cmdline, *tmp; /* for each entry of the comma-separated list */ - do { - unsigned long long start, end = ULLONG_MAX, size; - - /* get the start of the range */ - start = memparse(cur, &tmp); - if (cur == tmp) { - pr_warn("crashkernel: Memory value expected\n"); - return -EINVAL; - } - cur = tmp; - if (*cur != '-') { - pr_warn("crashkernel: '-' expected\n"); - return -EINVAL; - } - cur++; - - /* if no ':' is here, than we read the end */ - if (*cur != ':') { - end = memparse(cur, &tmp); - if (cur == tmp) { - pr_warn("crashkernel: Memory value expected\n"); - return -EINVAL; - } - cur = tmp; - if (end <= start) { - pr_warn("crashkernel: end <= start\n"); - return -EINVAL; - } - } - - if (*cur != ':') { - pr_warn("crashkernel: ':' expected\n"); - return -EINVAL; - } - cur++; - - size = memparse(cur, &tmp); - if (cur == tmp) { - pr_warn("Memory value expected\n"); - return -EINVAL; - } - cur = tmp; - if (size >= system_ram) { - pr_warn("crashkernel: invalid size\n"); - return -EINVAL; - } - - /* match ? */ - if (system_ram >= start && system_ram < end) { - *crash_size = size; - break; - } - } while (*cur++ == ','); + *crash_size = parse_mem_range_size("crashkernel", &cur, system_ram); + if (cur == cmdline) + return -EINVAL; if (*crash_size > 0) { while (*cur && *cur != ' ' && *cur != '@') @@ -1293,7 +1243,6 @@ static int __init __parse_crashkernel(char *cmdline, const char *name, const char *suffix) { - char*first_colon, *first_space; char*ck_cmdline; BUG_ON(!crash_size || !crash_base); @@ -1311,12 +1260,10 @@ static int __init __parse_crashkernel(char *cmdline, return parse_crashkernel_suffix(ck_cmdline, crash_size, s
[RESEND][PATCH v2 0/2] powerpc/fadump: support memory range syntax for fadump memory reservation
This patchset adds support to input system memory range based memory size for fadump reservation. The crashkernel parameter already supports such syntax. The first patch refactors the parsing code of crashkernel parameter for reuse. The second patch uses the newly refactored parsing code to reserve memory for fadump based on system memory size. --- Hari Bathini (2): kexec: refactor code parsing size based on memory range powerpc/fadump: parse fadump reserve memory size based on memory range arch/powerpc/kernel/fadump.c | 64 include/linux/kernel.h |5 ++ kernel/kexec_core.c | 63 ++-- kernel/params.c | 96 ++ 4 files changed, 161 insertions(+), 67 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] lkdtm: Mark lkdtm_rodata_do_nothing() notrace
On Tue, Aug 2, 2016 at 9:59 PM, Michael Ellerman wrote: > lkdtm_rodata_do_nothing() is an empty function which is generated in > order to test the non-executability of rodata. > > Currently if function tracing is enabled then an mcount callsite will be > generated for lkdtm_rodata_do_nothing(), and it will appear in the list > of available functions for function tracing (available_filter_functions). > > Given it's purpose purely as a test function, it seems preferable for > lkdtm_rodata_do_nothing() to be marked notrace, so it doesn't appear as > traceable. > > This also avoids triggering a linker bug on powerpc: > > https://sourceware.org/bugzilla/show_bug.cgi?id=20428 > > When the linker sees code that needs to generate a call stub, eg. a > branch to mcount(), it assumes the section is executable and > dereferences a NULL pointer leading to a linker segfault. Marking > lkdtm_rodata_do_nothing() notrace avoids triggering the bug because the > function contains no other function calls. > > Signed-off-by: Michael Ellerman Awesome! Thanks for tracking this down. I've applied it to my tree, it should get picked up by Greg on my next pull request. -Kees > --- > drivers/misc/lkdtm_rodata.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/misc/lkdtm_rodata.c b/drivers/misc/lkdtm_rodata.c > index 166b1db3969f..3564477b8c2d 100644 > --- a/drivers/misc/lkdtm_rodata.c > +++ b/drivers/misc/lkdtm_rodata.c > @@ -4,7 +4,7 @@ > */ > #include "lkdtm.h" > > -void lkdtm_rodata_do_nothing(void) > +void notrace lkdtm_rodata_do_nothing(void) > { > /* Does nothing. We just want an architecture agnostic "return". */ > } > -- > 2.7.4 > -- Kees Cook Brillo & Chrome OS Security ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
On Thursday, August 4, 2016 1:37:29 AM CEST Nicholas Piggin wrote: > > I've attached what I'm using, which builds and runs for me without > any work. Your arch obviously has to select the option to use it. > > text data bss dec hex filename > 11196784 1185024 1923820 14305628 da495c vmlinuxppc64.before > 11187536 1181848 1923176 14292560 da1650 vmlinuxppc64.after > > ~9K text saving, ~3K data saving. I assume this comes from fewer > branch trampolines and toc entries, but haven't verified exactly. The patch seems to work great, but for me it's getting bigger (compared to my older patch, mainline allyesconfig doesn't build): textdata bss dec hex filename 512998684259955923362148117261575 6fd4507 vmlinuxarm.before 513025454259501523361884117259444 6fd3cb4 vmlinuxarm.after Most of the difference appears to be in branch trampolines (634 added, 559 removed, 14837 unchanged) as you suspect, but I also see a couple of symbols show up in vmlinux that were not there before: -A __crc_dma_noop_ops -D dma_noop_ops -R __clz_tab -r fdt_errtable -r __kcrctab_dma_noop_ops -r __kstrtab_dma_noop_ops -R __ksymtab_dma_noop_ops -t dma_noop_alloc -t dma_noop_free -t dma_noop_map_page -t dma_noop_mapping_error -t dma_noop_map_sg -t dma_noop_supported -T fdt_add_reservemap_entry -T fdt_begin_node -T fdt_create -T fdt_create_empty_tree -T fdt_end_node -T fdt_finish -T fdt_finish_reservemap -T fdt_property -T fdt_resize -T fdt_strerror -T find_cpio_data From my first look, it seems that all of lib/*.o is now getting linked into vmlinux, while we traditionally leave out everything from lib/ that is not referenced. I also see a noticeable overhead in link time, the numbers are for a cache-hot rebuild after a successful allyesconfig build, using a 24-way Opteron@2.5Ghz, just relinking vmlinux: $ time make skj30 vmlinux # before real2m8.092s user3m41.008s sys 0m48.172s $ time make skj30 vmlinux # after real4m10.189s user5m43.804s sys 0m52.988s That is clearly a very sharp difference. Fortunately for the defconfig build, the times are much lower, and I see no real difference other than the noise between subsequent runs: $ time make skj30 vmlinux # before real0m5.415s user0m19.716s sys 0m9.356s $ time make skj30 vmlinux # before real0m9.536s user0m21.320s sys 0m9.224s $ time make skj30 vmlinux # after real0m5.539s user0m20.360s sys 0m9.224s $ time make skj30 vmlinux # after real0m9.138s user0m21.932s sys 0m8.988s $ time make skj30 vmlinux # after real0m5.659s user0m20.332s sys 0m9.620s Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] mm: Allow disabling deferred struct page initialisation
On 08/02/2016 11:38 PM, Srikar Dronamraju wrote: > * Dave Hansen [2016-08-02 11:09:21]: >> On 08/02/2016 06:19 AM, Srikar Dronamraju wrote: >>> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise >>> only certain size memory per node. The certain size takes into account >>> the dentry and inode cache sizes. However such a kernel when booting a >>> secondary kernel will not be able to allocate the required amount of >>> memory to suffice for the dentry and inode caches. This results in >>> crashes like the below on large systems such as 32 TB systems. >> >> What's a "secondary kernel"? >> > I mean the kernel thats booted to collect the crash, On fadump, the > first kernel acts as the secondary kernel i.e the same kernel is booted > to collect the crash. OK, but I'm still not seeing what the problem is. You've said that it crashes and that it crashes during inode/dentry cache allocation. But, *why* does the same kernel image crash in when it is used as a "secondary kernel"? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2] powernv: Simplify searching for compatible device nodes
(rebased on powerpc/next) This condenses the opal node searching into a single function that finds all compatible nodes, instead of just searching the ibm,opal children, for ipmi, flash, and prd similar to how opal-i2c nodes are found. Signed-off-by: Jack Miller --- arch/powerpc/platforms/powernv/opal.c | 24 +++- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index 8b4fc68..9db12ce 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -631,21 +631,11 @@ static void __init opal_dump_region_init(void) "rc = %d\n", rc); } -static void opal_pdev_init(struct device_node *opal_node, - const char *compatible) +static void opal_pdev_init(const char *compatible) { struct device_node *np; - for_each_child_of_node(opal_node, np) - if (of_device_is_compatible(np, compatible)) - of_platform_device_create(np, NULL, NULL); -} - -static void opal_i2c_create_devs(void) -{ - struct device_node *np; - - for_each_compatible_node(np, NULL, "ibm,opal-i2c") + for_each_compatible_node(np, NULL, compatible) of_platform_device_create(np, NULL, NULL); } @@ -717,7 +707,7 @@ static int __init opal_init(void) opal_hmi_handler_init(); /* Create i2c platform devices */ - opal_i2c_create_devs(); + opal_pdev_init("ibm,opal-i2c"); /* Setup a heatbeat thread if requested by OPAL */ opal_init_heartbeat(); @@ -752,12 +742,12 @@ static int __init opal_init(void) } /* Initialize platform devices: IPMI backend, PRD & flash interface */ - opal_pdev_init(opal_node, "ibm,opal-ipmi"); - opal_pdev_init(opal_node, "ibm,opal-flash"); - opal_pdev_init(opal_node, "ibm,opal-prd"); + opal_pdev_init("ibm,opal-ipmi"); + opal_pdev_init("ibm,opal-flash"); + opal_pdev_init("ibm,opal-prd"); /* Initialise platform device: oppanel interface */ - opal_pdev_init(opal_node, "ibm,opal-oppanel"); + opal_pdev_init("ibm,opal-oppanel"); /* Initialise OPAL kmsg dumper for flushing console on panic */ opal_kmsg_init(); -- 2.9.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powernv: Simplify searching for compatible device nodes
This condenses the opal node searching into a single function that finds all compatible nodes, instead of just searching the ibm,opal children, for ipmi, flash, and prd similar to how opal-i2c nodes are found. Signed-off-by: Jack Miller --- arch/powerpc/platforms/powernv/opal.c | 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index ae29eaf..86b7352 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -640,21 +640,11 @@ static void __init opal_dump_region_init(void) "rc = %d\n", rc); } -static void opal_pdev_init(struct device_node *opal_node, - const char *compatible) +static void opal_pdev_init(const char *compatible) { struct device_node *np; - for_each_child_of_node(opal_node, np) - if (of_device_is_compatible(np, compatible)) - of_platform_device_create(np, NULL, NULL); -} - -static void opal_i2c_create_devs(void) -{ - struct device_node *np; - - for_each_compatible_node(np, NULL, "ibm,opal-i2c") + for_each_compatible_node(np, NULL, compatible) of_platform_device_create(np, NULL, NULL); } @@ -722,7 +712,7 @@ static int __init opal_init(void) opal_hmi_handler_init(); /* Create i2c platform devices */ - opal_i2c_create_devs(); + opal_pdev_init("ibm,opal-i2c"); /* Setup a heatbeat thread if requested by OPAL */ opal_init_heartbeat(); @@ -754,9 +744,9 @@ static int __init opal_init(void) } /* Initialize platform devices: IPMI backend, PRD & flash interface */ - opal_pdev_init(opal_node, "ibm,opal-ipmi"); - opal_pdev_init(opal_node, "ibm,opal-flash"); - opal_pdev_init(opal_node, "ibm,opal-prd"); + opal_pdev_init("ibm,opal-ipmi"); + opal_pdev_init("ibm,opal-flash"); + opal_pdev_init("ibm,opal-prd"); /* Initialise OPAL kmsg dumper for flushing console on panic */ opal_kmsg_init(); -- 2.9.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powernv: Search for new flash DT node location
On Wed, Aug 03, 2016 at 05:16:34PM +1000, Michael Ellerman wrote: > We could instead just search for all nodes that are compatible with > "ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs(). > > Is there a particular reason not to do that? I'm actually surprised that this is preferred. Jeremy mentioned something similar, but I guess I just don't like the idea of finding devices in weird places in the tree. Then again, if we can't trust the DT we're in bigger trouble than erroneous flash nodes =). If we really just want to find compatible nodes anywhere, let's simplify i2c and pdev_init into one function and make that behavior consistent with this new patch. - Jack ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
On Wed, 03 Aug 2016 14:29:13 +0200 Arnd Bergmann wrote: > On Wednesday, August 3, 2016 10:19:11 PM CEST Stephen Rothwell wrote: > > Hi Arnd, > > > > On Wed, 03 Aug 2016 09:52:23 +0200 Arnd Bergmann wrote: > > > > > > Using a different way to link the kernel would also help us with > > > the remaining allyesconfig problem on ARM, as the problem is only in > > > 'ld -r' not producing trampolines for symbols that later cannot get > > > them any more. It would probably also help building with ld.gold, > > > which is currently not working. > > > > > > What is your suggested alternative? > > > > I have a patch that make the built-in.o files into thin archives (same > > as archives, but the actual objects are replaced with the name of the > > original object file). That way the final link has all the original > > objects. I haven't checked to see what the overheads of doing it this > > way is. > > > > Nick Piggin has just today taken my old patch (it was last rebased to > > v4.4-rc1) and tried it on a recent kernel and it still seems to mostly > > work. It probably needs some tidying up, but you are welcome to test > > it if you want to. > > Sure, I'll certainly give it a try on ARM when you send me a copy. I've attached what I'm using, which builds and runs for me without any work. Your arch obviously has to select the option to use it. text data bss dec hex filename 11196784 1185024 1923820 14305628 da495c vmlinuxppc64.before 11187536 1181848 1923176 14292560 da1650 vmlinuxppc64.after ~9K text saving, ~3K data saving. I assume this comes from fewer branch trampolines and toc entries, but haven't verified exactly. commit 8bc3ca4798c215e9a9107b6d44408f0af259f84f Author: Stephen Rothwell Date: Tue Oct 30 12:14:18 2012 +1100 kbuild: allow architectures to use thin archives instead of ld -r Alan Modra has been trying to convince the kernel developers that ld -r is "evil" for many years. This is an alternative and means that the linker has much more information available to it when it links the kernel. Signed-off-by: Stephen Rothwell diff --git a/arch/Kconfig b/arch/Kconfig index d794384..1330bf4 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -424,6 +424,12 @@ config CC_STACKPROTECTOR_STRONG endchoice +config THIN_ARCHIVES + bool + help + Select this if the architecture wants to use thin archives + instead of ld -r to create the built-in.o files. + config HAVE_CONTEXT_TRACKING bool help diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 0d1ca5b..bbf60b3 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -358,10 +358,15 @@ $(sort $(subdir-obj-y)): $(subdir-ym) ; # Rule to compile a set of .o files into one .o file # ifdef builtin-target +ifdef CONFIG_THIN_ARCHIVES + cmd_make_builtin = rm -f $@; $(AR) rcsT$(KBUILD_ARFLAGS) +else + cmd_make_builtin = $(LD) $(ld_flags) -r -o +endif quiet_cmd_link_o_target = LD $@ # If the list of objects to link is empty, just create an empty built-in.o cmd_link_o_target = $(if $(strip $(obj-y)),\ - $(LD) $(ld_flags) -r -o $@ $(filter $(obj-y), $^) \ + $(cmd_make_builtin) $@ $(filter $(obj-y), $^) \ $(cmd_secanalysis),\ rm -f $@; $(AR) rcs$(KBUILD_ARFLAGS) $@) diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index f0f6d9d..ef4658f 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -41,8 +41,14 @@ info() # ${1} output file modpost_link() { - ${LD} ${LDFLAGS} -r -o ${1} ${KBUILD_VMLINUX_INIT} \ - --start-group ${KBUILD_VMLINUX_MAIN} --end-group + local objects + + if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then + objects="--whole-archive ${KBUILD_VMLINUX_INIT} ${KBUILD_VMLINUX_MAIN} --no-whole-archive" + else + objects="${KBUILD_VMLINUX_INIT} --start-group ${KBUILD_VMLINUX_MAIN} --end-group" + fi + ${LD} ${LDFLAGS} -r -o ${1} ${objects} } # Link of vmlinux @@ -51,11 +57,16 @@ modpost_link() vmlinux_link() { local lds="${objtree}/${KBUILD_LDS}" + local objects if [ "${SRCARCH}" != "um" ]; then + if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then + objects="--whole-archive ${KBUILD_VMLINUX_INIT} ${KBUILD_VMLINUX_MAIN} --no-whole-archive" + else + objects="${KBUILD_VMLINUX_INIT} --start-group ${KBUILD_VMLINUX_MAIN} --end-group" + fi ${LD} ${LDFLAGS} ${LDFLAGS_vmlinux} -o ${2} \ - -T ${lds} ${KBUILD_VMLINUX_INIT} \ - --start-group ${KBUILD_VMLINUX_MAIN} --end-group ${1} + -T ${lds} ${objects} ${1} else ${CC} ${CFLAGS_vmlinux} -o
Re: [v4] Fix to avoid IS_ERR_VALUE and IS_ERR abuses on 64bit systems.
On Wednesday 03 August 2016 01:27 AM, Scott Wood wrote: On 08/02/2016 10:34 AM, arvind Yadav wrote: On Tuesday 02 August 2016 01:15 PM, Arnd Bergmann wrote: On Monday, August 1, 2016 4:55:43 PM CEST Scott Wood wrote: On 08/01/2016 02:02 AM, Arnd Bergmann wrote: diff --git a/include/linux/err.h b/include/linux/err.h index 1e35588..c2a2789 100644 --- a/include/linux/err.h +++ b/include/linux/err.h @@ -18,7 +18,17 @@ #ifndef __ASSEMBLY__ -#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO) +#define IS_ERR_VALUE(x) unlikely(is_error_check(x)) + +static inline int is_error_check(unsigned long error) Please leave the existing macro alone. I think you were looking for something specific to the return code of qe_muram_alloc() function, so please add a helper in that subsystem if you need it, not in the generic header files. qe_muram_alloc (a.k.a. cpm_muram_alloc) returns unsigned long. The problem is certain callers that store the return value in a u32. Why not just fix those callers to store it in unsigned long (at least until error checking is done)? Yes, that would also address another problem with code like kfree((void *)ugeth->tx_bd_ring_offset[i]); which is not 64-bit safe when tx_bd_ring_offset is a 32-bit value that also holds the return value of qe_muram_alloc. Well, hopefully it doesn't hold a return of qe_muram_alloc() when it's being passed to kfree()... There's also the code that casts kmalloc()'s return to u32, etc. ucc_geth is not 64-bit clean in general. Arnd Yes, we will fix caller. Caller api is not safe on 64bit. The API is fine (or at least, I haven't seen a valid issue pointed out yet). The problem is the ucc_geth driver. Even qe_muram_addr(a.k.a. cpm_muram_addr )passing value unsigned int, but it should be unsigned long. cpm_muram_addr takes unsigned long as a parameter, not that it matters since you can't pass errors into it and a muram offset should never exceed 32 bits. -Scott Yes, It will work for 32bit machine. But will not safe for 64bit. Example : ugeth->tx_bd_ring_offset[j] = qe_muram_alloc(length UCC_GETH_TX_BD_RING_ALIGNMENT); if (!IS_ERR_VALUE(ugeth->tx_bd_ring_offset[j])) ugeth->p_tx_bd_ring[j] = (u8 __iomem *) qe_muram_addr(ugeth-> tx_bd_ring_offset[j]); If qe_muram_alloc will return any error, IS_ERR_VALUE will always return 0 (IS_ERR_VALUE will always pass for 'unsigned int'). Now qe_muram_addr will return wrong virtual address. Which can cause an error. -Arvind ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: trivial fix to non-conventional PCI address output on EEH log
On 07/24/2016 10:46 PM, Gavin Shan wrote: On Mon, Jul 25, 2016 at 10:47:13AM +1000, Michael Ellerman wrote: "Guilherme G. Piccoli" writes: This is a very minor/trivial fix for the output of PCI address on EEH logs. The PCI address on "OF node" field currently is using ":" as a separator for the function, but the usual separator is ".". This patch changes the separator for dot, so the PCI address is printed as usual. No functional changes were introduced. What consumes the log? Can it cope with us changing the formatting? The log is printed by pr_warn() as part of the EEH kernel log. Also, it's argument passed to RTAS call "ibm,slot-error-detail" and it's put into the user data section of the RTAS call's output, which is used by RTAS daemon (rtasd) then. I don't see anyone expects fixed format for it in the user data section. The format was ever adjusted in commit 0ed352dddbfc ("powerpc/eeh: Reduce lines of log dump") on Jul 17 2014. No complains received against it so far. I guess nobody cares about the format or there is a alarm isn't raised yet :) Thanks, Gavin Quick follow-up on this: RTAS daemon stores the information captured via ibm,slot-error-detail in a log file, which can be accessed using the command "rtas_dump -f /var/log/platform". More information on this can be found in https://www.ibm.com/support/knowledgecenter/linuxonibm/liaau/liaau-diagnosing-rtas-events.htm . I was able to check this log and the EEH PCI address output was there, in ascii text format. Thanks, Guilherme ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
On Wednesday, August 3, 2016 10:19:11 PM CEST Stephen Rothwell wrote: > Hi Arnd, > > On Wed, 03 Aug 2016 09:52:23 +0200 Arnd Bergmann wrote: > > > > Using a different way to link the kernel would also help us with > > the remaining allyesconfig problem on ARM, as the problem is only in > > 'ld -r' not producing trampolines for symbols that later cannot get > > them any more. It would probably also help building with ld.gold, > > which is currently not working. > > > > What is your suggested alternative? > > I have a patch that make the built-in.o files into thin archives (same > as archives, but the actual objects are replaced with the name of the > original object file). That way the final link has all the original > objects. I haven't checked to see what the overheads of doing it this > way is. > > Nick Piggin has just today taken my old patch (it was last rebased to > v4.4-rc1) and tried it on a recent kernel and it still seems to mostly > work. It probably needs some tidying up, but you are welcome to test > it if you want to. Sure, I'll certainly give it a try on ARM when you send me a copy. Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
Hi Arnd, On Wed, 03 Aug 2016 09:52:23 +0200 Arnd Bergmann wrote: > > Using a different way to link the kernel would also help us with > the remaining allyesconfig problem on ARM, as the problem is only in > 'ld -r' not producing trampolines for symbols that later cannot get > them any more. It would probably also help building with ld.gold, > which is currently not working. > > What is your suggested alternative? I have a patch that make the built-in.o files into thin archives (same as archives, but the actual objects are replaced with the name of the original object file). That way the final link has all the original objects. I haven't checked to see what the overheads of doing it this way is. Nick Piggin has just today taken my old patch (it was last rebased to v4.4-rc1) and tried it on a recent kernel and it still seems to mostly work. It probably needs some tidying up, but you are welcome to test it if you want to. -- Cheers, Stephen Rothwell ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linker tables on powerpc - build issues
"Luis R. Rodriguez" writes: > I've run into a few compilation issues with linker tables support [0] > [1] on only a few architectures: > > blackfin - compiler issue it seems, I have a work around now in place > arm - some alignment issue - still need to iron this out > powerpc - issue with including on > > The issue with powerpc can be replicated easily with the patch below, > and compilation fails even on a 'make defconfig' configuration, the > issues are recurring include header ordering issues. I've given this > some tries to fix but am still a bit bewildered how to best do this > without affecting non-powerpc compilations. The patch below > replicates the changes in question, it does not include the linker > table work at all, it just includes instead of > to reduce and provide an example of the issues > observed. The list of errors are also pretty endless... so was hoping > some power folks might be able to take a glance if possible. If you > have any ideas, please let me know. What is the end goal? You want to be able to include asm/sections.h in asm/jump_labels.h? So that you can get some macros to wrap the pushsection etc, am I right? The biggest problem I see is dereference_function_descriptor(), which uses probe_kernel(), which pulls in uaccess.h. But it doesn't really make sense for dereference_function_descriptor() to be in sections.h AFAICS. I'll see if I can unstitch it tomorrow. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc: move hmi.c to arch/powerpc/kvm/
hmi.c functions are unused unless sibling_subcore_state is nonzero, and that in turn happens only if KVM is in use. So move the code to arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_64_HANDLER rather than CONFIG_PPC_BOOK3S_64. The sibling_subcore_state is also included in struct paca_struct only if KVM is supported by the kernel. Cc: Paul Mackerras Cc: Michael Ellerman Cc: Mahesh Salgaonkar Cc: linuxppc-dev@lists.ozlabs.org Cc: kvm-...@vger.kernel.org Cc: k...@vger.kernel.org Signed-off-by: Paolo Bonzini --- It would be nice to have this in 4.8, to minimize any 4.9 conflicts. Build-tested only, with and without KVM enabled. arch/powerpc/include/asm/hmi.h | 2 +- arch/powerpc/include/asm/paca.h| 10 +- arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kvm/Makefile | 1 + arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} | 0 5 files changed, 8 insertions(+), 7 deletions(-) rename arch/powerpc/{kernel/hmi.c => kvm/book3s_hv_hmi.c} (100%) diff --git a/arch/powerpc/include/asm/hmi.h b/arch/powerpc/include/asm/hmi.h index 88b4901ac4ee..d3b6ad6e137c 100644 --- a/arch/powerpc/include/asm/hmi.h +++ b/arch/powerpc/include/asm/hmi.h @@ -21,7 +21,7 @@ #ifndef __ASM_PPC64_HMI_H__ #define __ASM_PPC64_HMI_H__ -#ifdef CONFIG_PPC_BOOK3S_64 +#ifdef CONFIG_KVM_BOOK3S_64_HANDLER #defineCORE_TB_RESYNC_REQ_BIT 63 #define MAX_SUBCORE_PER_CORE 4 diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 148303e7771f..625321e7e581 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -183,11 +183,6 @@ struct paca_struct { */ u16 in_mce; u8 hmi_event_available; /* HMI event is available */ - /* -* Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for -* more details -*/ - struct sibling_subcore_state *sibling_subcore_state; #endif /* Stuff for accurate time accounting */ @@ -202,6 +197,11 @@ struct paca_struct { struct kvmppc_book3s_shadow_vcpu shadow_vcpu; #endif struct kvmppc_host_state kvm_hstate; + /* +* Bitmap for sibling subcore status. See kvm/book3s_hv_ras.c for +* more details +*/ + struct sibling_subcore_state *sibling_subcore_state; #endif }; diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index b2027a5cf508..fe4c075bcf50 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -41,7 +41,7 @@ obj-$(CONFIG_VDSO32) += vdso32/ obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o -obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o hmi.o +obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o obj-$(CONFIG_PPC_BOOK3E_64)+= exceptions-64e.o idle_book3e.o obj-$(CONFIG_PPC64)+= vdso64/ obj-$(CONFIG_ALTIVEC) += vecemu.o diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 1f9e5529e692..855d4b95d752 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -78,6 +78,7 @@ kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \ ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \ + book3s_hv_hmi.o \ book3s_hv_rmhandlers.o \ book3s_hv_rm_mmu.o \ book3s_hv_ras.o \ diff --git a/arch/powerpc/kernel/hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c similarity index 100% rename from arch/powerpc/kernel/hmi.c rename to arch/powerpc/kvm/book3s_hv_hmi.c -- 1.8.3.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] fadump: Disable deferred page struct initialisation
Vlastimil Babka writes: > On 08/03/2016 07:20 AM, Balbir Singh wrote: >> On Tue, 2016-08-02 at 18:49 +0530, Srikar Dronamraju wrote: >>> Fadump kernel reserves significant number of memory blocks. On a multi-node >>> machine, with CONFIG_DEFFERRED_STRUCT_PAGE support, fadump kernel fails to >>> boot. Fix this by disabling deferred page struct initialisation. >>> >> >> How much memory does a fadump kernel need? Can we bump up the limits >> depending >> on the config. I presume when you say fadump kernel you mean kernel with >> FADUMP in the config? >> >> BTW, I would much rather prefer a config based solution that does not select >> DEFERRED_INIT if FADUMP is enabled. > > IIRC the kdump/fadump kernel is typically the same vmlinux as the main > kernel, just with special initrd and boot params. So if you want > deferred init for the main kernel, this would be impractical. Yes. Distros won't build a separate kernel, so it has to work at runtime. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [pasemi] Radeon HD graphics card not recognised after the powerpc-4.8-1 commit
On Wed, 2016-08-03 at 11:03 +0200, Christian Zigotzky wrote: > I reverted the commit "powerpc-4.8-1" and Xorg works. The commit > "powerpc-4.8-1" is the problem. > > Link: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bad60e6f259a01cf9f29a1ef8d435ab6c60b2de9 > > Which source code modification in the commit "powerpc-4.8-1" could be > the problem? This is a merge, not a commit. Can you bisect down that branch ? Also include the kernel dmesg log. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit
On Wed, 3 Aug 2016 18:40:47 +1000 Alexey Kardashevskiy wrote: > At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when > the userspace starts using VFIO. When the userspace process finishes, > all the pinned pages need to be put; this is done as a part of > the userspace memory context (MM) destruction which happens on > the very last mmdrop(). > > This approach has a problem that a MM of the userspace process > may live longer than the userspace process itself as kernel threads > use userspace process MMs which was runnning on a CPU where > the kernel thread was scheduled to. If this happened, the MM remains > referenced until this exact kernel thread wakes up again > and releases the very last reference to the MM, on an idle system this > can take even hours. > > This references and caches MM once per container and adds tracking > how many times each preregistered area was registered in > a specific container. This way we do not depend on @current pointing to > a valid task descriptor. > > This changes the userspace interface to return EBUSY if memory is > already registered (mm_iommu_get() used to increment the counter); > however it should not have any practical effect as the only > userspace tool available now does register memory area once per > container anyway. > > As tce_iommu_register_pages/tce_iommu_unregister_pages are called > under container->lock, this does not need additional locking. > > Signed-off-by: Alexey Kardashevskiy Reviewed-by: Nicholas Piggin ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
On Wed, 3 Aug 2016 18:40:46 +1000 Alexey Kardashevskiy wrote: > In some situations the userspace memory context may live longer than > the userspace process itself so if we need to do proper memory context > cleanup, we better cache @mm and use it later when the process is gone > (@current or @current->mm are NULL). > > This changes mm_iommu_xxx API to receive mm_struct instead of using one > from @current. > > This is needed by the following patch to do proper cleanup in time. > This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs" > to do proper cleanup via tce_iommu_clear() patch. > > To keep API consistent, this replaces mm_context_t with mm_struct; > we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs > access to &mm->mmap_sem. > > This should cause no behavioral change. > > Signed-off-by: Alexey Kardashevskiy Reviewed-by: Nicholas Piggin I still have some questions about the use of mm in the driver, but those aren't issues introduced by this patch, so as it is I think the bug fix of this and the next patch is good. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[pasemi] Radeon HD graphics card not recognised after the powerpc-4.8-1 commit
Hello, I tried to compile the latest Git kernel today. It boots but Xorg doesn't work anymore. [41.210] (++) using VT number 7 [41.341] (II) [KMS] Kernel modesetting enabled. [41.341] (EE) No devices detected. [41.341] (EE) Fatal server error: [41.341] (EE) no screens found(EE) [41.341] (EE) I reverted the commit "powerpc-4.8-1" and Xorg works. The commit "powerpc-4.8-1" is the problem. Link: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bad60e6f259a01cf9f29a1ef8d435ab6c60b2de9 Which source code modification in the commit "powerpc-4.8-1" could be the problem? Cheers, Christian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 09/15] powerpc/mmu: Add real mode support for IOMMU preregistered memory
This makes mm_iommu_lookup() able to work in realmode by replacing list_for_each_entry_rcu() (which can do debug stuff which can fail in real mode) with list_for_each_entry_lockless(). This adds realmode version of mm_iommu_ua_to_hpa() which adds explicit vmalloc'd-to-linear address conversion. Unlike mm_iommu_ua_to_hpa(), mm_iommu_ua_to_hpa_rm() can fail. This changes mm_iommu_preregistered() to receive @mm as in real mode @current does not always have a correct pointer. This adds realmode version of mm_iommu_lookup() which receives @mm (for the same reason as for mm_iommu_preregistered()) and uses lockless version of list_for_each_entry_rcu(). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/mmu_context.h | 4 arch/powerpc/mm/mmu_context_iommu.c| 39 ++ 2 files changed, 43 insertions(+) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index a4c4ed5..939030c 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -27,10 +27,14 @@ extern long mm_iommu_put(struct mm_struct *mm, extern void mm_iommu_init(struct mm_struct *mm); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, unsigned long ua, unsigned long size); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm( + struct mm_struct *mm, unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned long *hpa); +extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem); #endif diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 10f01fe..36a906c 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -242,6 +242,25 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, } EXPORT_SYMBOL_GPL(mm_iommu_lookup); +struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm, + unsigned long ua, unsigned long size) +{ + struct mm_iommu_table_group_mem_t *mem, *ret = NULL; + + list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list, + next) { + if ((mem->ua <= ua) && + (ua + size <= mem->ua + +(mem->entries << PAGE_SHIFT))) { + ret = mem; + break; + } + } + + return ret; +} +EXPORT_SYMBOL_GPL(mm_iommu_lookup_rm); + struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, unsigned long ua, unsigned long entries) { @@ -273,6 +292,26 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, } EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa); +long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa) +{ + const long entry = (ua - mem->ua) >> PAGE_SHIFT; + void *va = &mem->hpas[entry]; + unsigned long *ra; + + if (entry >= mem->entries) + return -EFAULT; + + ra = (void *) vmalloc_to_phys(va); + if (!ra) + return -EFAULT; + + *hpa = *ra | (ua & ~PAGE_MASK); + + return 0; +} +EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa_rm); + long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem) { if (atomic64_inc_not_zero(&mem->mapped)) -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 08/15] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
So far iommu_table obejcts were only used in virtual mode and had a single owner. We are going to change by implementing in-kernel acceleration of DMA mapping requests, including real mode. This adds a kref to iommu_table and defines new helpers to update it. This replaces iommu_free_table() with iommu_table_put() and makes iommu_free_table() static. iommu_table_get() is not used in this patch but will be in the following one. While we are here, this removes @node_name parameter as it has never been really useful on powernv and carrying it for the pseries platform code to iommu_free_table() seems to be quite useless too. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 5 +++-- arch/powerpc/kernel/iommu.c | 24 +++- arch/powerpc/kernel/vio.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++--- arch/powerpc/platforms/powernv/pci.c | 1 + arch/powerpc/platforms/pseries/iommu.c| 3 ++- drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- 7 files changed, 34 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index f49a72a..cd4df44 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -114,6 +114,7 @@ struct iommu_table { struct list_head it_group_list;/* List of iommu_table_group_link */ unsigned long *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; + struct krefit_kref; }; #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ @@ -146,8 +147,8 @@ static inline void *get_iommu_table_base(struct device *dev) extern int dma_iommu_dma_supported(struct device *dev, u64 mask); -/* Frees table for an individual device node */ -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); +extern void iommu_table_get(struct iommu_table *tbl); +extern void iommu_table_put(struct iommu_table *tbl); /* Initializes an iommu_table based in values set in the passed-in * structure diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 13263b0..a8f017a 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -710,13 +710,13 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) return tbl; } -void iommu_free_table(struct iommu_table *tbl, const char *node_name) +static void iommu_table_free(struct kref *kref) { unsigned long bitmap_sz; unsigned int order; + struct iommu_table *tbl; - if (!tbl) - return; + tbl = container_of(kref, struct iommu_table, it_kref); if (tbl->it_ops->free) tbl->it_ops->free(tbl); @@ -735,7 +735,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) /* verify that table contains no entries */ if (!bitmap_empty(tbl->it_map, tbl->it_size)) - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name); + pr_warn("%s: Unexpected TCEs\n", __func__); /* calculate bitmap size in bytes */ bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); @@ -747,7 +747,21 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) /* free table */ kfree(tbl); } -EXPORT_SYMBOL_GPL(iommu_free_table); + +void iommu_table_get(struct iommu_table *tbl) +{ + kref_get(&tbl->it_kref); +} +EXPORT_SYMBOL_GPL(iommu_table_get); + +void iommu_table_put(struct iommu_table *tbl) +{ + if (!tbl) + return; + + kref_put(&tbl->it_kref, iommu_table_free); +} +EXPORT_SYMBOL_GPL(iommu_table_put); /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c index 8d7358f..188f452 100644 --- a/arch/powerpc/kernel/vio.c +++ b/arch/powerpc/kernel/vio.c @@ -1318,7 +1318,7 @@ static void vio_dev_release(struct device *dev) struct iommu_table *tbl = get_iommu_table_base(dev); if (tbl) - iommu_free_table(tbl, of_node_full_name(dev->of_node)); + iommu_table_put(tbl); of_node_put(dev->of_node); kfree(to_vio_dev(dev)); } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 74ab8382..c04afd2 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1394,7 +1394,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe iommu_group_put(pe->table_group.group); BUG_ON(pe->table_group.group); } - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); + iommu_table_put(tbl); } static void pnv_ioda_release_vf_PE(struc
[PATCH kernel 07/15] powerpc/iommu: Cleanup iommu_table disposal
At the moment iommu_table could be disposed by either calling iommu_table_free() directly or it_ops::free() which only implementation for IODA2 calls iommu_table_free() anyway. As we are going to have reference counting on tables, we need an unified way of disposing tables. This moves it_ops::free() call into iommu_free_table() and makes use of the latter everywhere. The free() callback now handles only platform-specific data. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/iommu.c | 4 arch/powerpc/platforms/powernv/pci-ioda.c | 6 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a8e3490..13263b0 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -718,6 +718,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) if (!tbl) return; + if (tbl->it_ops->free) + tbl->it_ops->free(tbl); + if (!tbl->it_map) { kfree(tbl); return; @@ -744,6 +747,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) /* free table */ kfree(tbl); } +EXPORT_SYMBOL_GPL(iommu_free_table); /* Creates TCEs for a user provided buffer. The user buffer must be * contiguous real kernel storage (not vmalloc). The address passed here diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 59c7e7d..74ab8382 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1394,7 +1394,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe iommu_group_put(pe->table_group.group); BUG_ON(pe->table_group.group); } - pnv_pci_ioda2_table_free_pages(tbl); iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); } @@ -1987,7 +1986,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index, static void pnv_ioda2_table_free(struct iommu_table *tbl) { pnv_pci_ioda2_table_free_pages(tbl); - iommu_free_table(tbl, "pnv"); } static struct iommu_table_ops pnv_ioda2_iommu_ops = { @@ -2313,7 +2311,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) if (rc) { pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", rc); - pnv_ioda2_table_free(tbl); + iommu_free_table(tbl, ""); return rc; } @@ -2399,7 +2397,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group) pnv_pci_ioda2_set_bypass(pe, false); pnv_pci_ioda2_unset_window(&pe->table_group, 0); - pnv_ioda2_table_free(tbl); + iommu_free_table(tbl, "pnv"); } static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 40e71a0..79f26c7 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -660,7 +660,7 @@ static void tce_iommu_free_table(struct iommu_table *tbl) unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT; tce_iommu_userspace_view_free(tbl); - tbl->it_ops->free(tbl); + iommu_free_table(tbl, ""); decrement_locked_vm(pages); } -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 15/15] KVM: PPC: Add in-kernel acceleration for VFIO
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space; this is not expected to happen ever though. The first user of this is VFIO on POWER. Trampolines to the VFIO external user API functions are required for this patch. This adds ioctl() interface to SPAPR TCE fd which already handles in-kernel acceleration for emulated IO by allocating the guest view of the TCE table in KVM. New ioctls allows the userspace to attach/detach VFIO containers to the kernel-allocated TCE table and handle the hardware TCE table updates in the kernel. The new interface accepts VFIO container fd and uses exported API to get to the actual hardware TCE table. Until _unset() ioctl is called, the VFIO container is referenced to guarantee the TCE table presense in the memory. This also releases unused containers when new container is registered. The criteria of "unused" is vfio_container_get_iommu_data_ext() returning NULL which happens when the container fd is closed. Note that this interface does not operate with IOMMU groups as TCE tables are owned by VFIO containers (and even may have no IOMMU groups attached). This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user space. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/kvm_host.h | 8 + arch/powerpc/include/uapi/asm/kvm.h | 12 ++ arch/powerpc/kvm/book3s_64_vio.c| 403 arch/powerpc/kvm/book3s_64_vio_hv.c | 173 arch/powerpc/kvm/powerpc.c | 2 + 5 files changed, 598 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index ec35af3..3e3d65f 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -182,6 +182,13 @@ struct kvmppc_pginfo { atomic_t refcnt; }; +struct kvmppc_spapr_tce_container { + struct list_head next; + struct rcu_head rcu; + struct vfio_container *vfiocontainer; + struct iommu_table *tbl; +}; + struct kvmppc_spapr_tce_table { struct list_head list; struct kvm *kvm; @@ -190,6 +197,7 @@ struct kvmppc_spapr_tce_table { u32 page_shift; u64 offset; /* in pages */ u64 size; /* window size in pages */ + struct list_head containers; struct page *pages[0]; }; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index c93cf35..cbeb7bb 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -342,6 +342,18 @@ struct kvm_create_spapr_tce_64 { __u64 size; /* in pages */ }; +#define KVM_SPAPR_TCE (':') +#define KVM_SPAPR_TCE_VFIO_SET _IOW(KVM_SPAPR_TCE, 0x00, \ +struct kvm_spapr_tce_vfio) +#define KVM_SPAPR_TCE_VFIO_UNSET _IOW(KVM_SPAPR_TCE, 0x01, \ +struct kvm_spapr_tce_vfio) + +struct kvm_spapr_tce_vfio { + __u32 argsz; + __u32 flags; + __u32 container_fd; +}; + /* for KVM_ALLOCATE_RMA */ struct kvm_allocate_rma { __u64 rma_size; diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 15df8ae..d420ee0 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -27,6 +27,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -39,6 +43,70 @@ #include #include #include +#include + +static struct iommu_table *kvm_vfio_container_spapr_tce_table_get_ext( + void *iommu_data, u64 offset) +{ + struct iommu_table *tbl; + struct iommu_table *(*fn)(void *, u64); + + fn = symbol_get(vfio_container_spapr_tce_table_get_ext); + if (!fn) + return NULL; + + tbl = fn(iommu_data, offset); + + symbol_put(vfio_container_spapr_tce_table_get_ext); + + return tbl; +} + +static struct vfio_container *kvm_vfio_container_get_ext(struct file *filep) +{ + struct vfio_container *container; + struct vfio_container *(*fn)(struct file *); + + fn = symbol_get(vfio_container_get_ext); + if (!fn) + return NULL; + + container = fn(filep); + + symbol_put(vfio_container_get_ext); + + return container; +} + +static void kvm_vfio_container_put_ext(struct vfio_container *container)
[PATCH kernel 14/15] vfio/spapr_tce: Export container API for external users
This exports helpers which are needed to keep a VFIO container in memory while there are external users such as KVM. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/vfio.c | 30 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++- include/linux/vfio.h| 6 ++ 3 files changed, 51 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index d1d70e0..baf6a9c 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct vfio_group *group, unsigned long arg) EXPORT_SYMBOL_GPL(vfio_external_check_extension); /** + * External user API for containers, exported by symbols to be linked + * dynamically. + * + */ +struct vfio_container *vfio_container_get_ext(struct file *filep) +{ + struct vfio_container *container = filep->private_data; + + if (filep->f_op != &vfio_fops) + return ERR_PTR(-EINVAL); + + vfio_container_get(container); + + return container; +} +EXPORT_SYMBOL_GPL(vfio_container_get_ext); + +void vfio_container_put_ext(struct vfio_container *container) +{ + vfio_container_put(container); +} +EXPORT_SYMBOL_GPL(vfio_container_put_ext); + +void *vfio_container_get_iommu_data_ext(struct vfio_container *container) +{ + return container->iommu_data; +} +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext); + +/** * Sub-module support */ /* diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 3594ad3..fceea3d 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops tce_iommu_driver_ops = { .detach_group = tce_iommu_detach_group, }; +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void *iommu_data, + u64 offset) +{ + struct tce_container *container = iommu_data; + struct iommu_table *tbl = NULL; + + if (tce_iommu_find_table(container, offset, &tbl) < 0) + return NULL; + + iommu_table_get(tbl); + + return tbl; +} +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext); + static int __init tce_iommu_init(void) { return vfio_register_iommu_driver(&tce_iommu_driver_ops); @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION); MODULE_LICENSE("GPL v2"); MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); - diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 0ecae0b..1c2138a 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct vfio_group *group); extern int vfio_external_user_iommu_id(struct vfio_group *group); extern long vfio_external_check_extension(struct vfio_group *group, unsigned long arg); +extern struct vfio_container *vfio_container_get_ext(struct file *filep); +extern void vfio_container_put_ext(struct vfio_container *container); +extern void *vfio_container_get_iommu_data_ext( + struct vfio_container *container); +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext( + void *iommu_data, u64 offset); /* * Sub-module helpers -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 02/15] KVM: PPC: Finish enabling VFIO KVM device on POWER
178a787502 "vfio: Enable VFIO device for powerpc" made an attempt to enable VFIO KVM device on POWER. However as CONFIG_KVM_BOOK3S_64 does not use "common-objs-y", VFIO KVM device was not enabled for Book3s KVM, this adds VFIO to the kvm-book3s_64-objs-y list. While we are here, enforce KVM_VFIO on KVM_BOOK3S as other platforms already do. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/kvm/Makefile | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index c2024ac..b7c494b 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_BOOK3S_64 select KVM_BOOK3S_64_HANDLER select KVM select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE + select KVM_VFIO if VFIO ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 1f9e552..8907af9 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -88,6 +88,9 @@ endif kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \ book3s_xics.o +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \ + $(KVM)/vfio.o + kvm-book3s_64-module-objs += \ $(KVM)/kvm_main.o \ $(KVM)/eventfd.o \ -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 13/15] KVM: PPC: Pass kvm* to kvmppc_find_table()
The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm* there. This will be used in the following patches where we will be attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than to VCPU). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/kvm_ppc.h | 2 +- arch/powerpc/kvm/book3s_64_vio.c| 7 --- arch/powerpc/kvm/book3s_64_vio_hv.c | 13 +++-- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 2544eda..7f1abe9 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -167,7 +167,7 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce_64 *args); extern struct kvmppc_spapr_tce_table *kvmppc_find_table( - struct kvm_vcpu *vcpu, unsigned long liobn); + struct kvm *kvm, unsigned long liobn); extern long kvmppc_ioba_validate(struct kvmppc_spapr_tce_table *stt, unsigned long ioba, unsigned long npages); extern long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *tt, diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index c379ff5..15df8ae 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -212,12 +212,13 @@ fail: long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); + struct kvmppc_spapr_tce_table *stt; long ret; /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ /* liobn, ioba, tce); */ + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -245,7 +246,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, u64 __user *tces; u64 tce; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -299,7 +300,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, struct kvmppc_spapr_tce_table *stt; long i, ret; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index a3be4bd..8a6834e 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -49,10 +49,9 @@ * WARNING: This will be called in real or virtual mode on HV KVM and virtual * mode on PR KVM */ -struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm_vcpu *vcpu, +struct kvmppc_spapr_tce_table *kvmppc_find_table(struct kvm *kvm, unsigned long liobn) { - struct kvm *kvm = vcpu->kvm; struct kvmppc_spapr_tce_table *stt; list_for_each_entry_lockless(stt, &kvm->arch.spapr_tce_tables, list) @@ -194,12 +193,13 @@ static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); + struct kvmppc_spapr_tce_table *stt; long ret; /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */ /* liobn, ioba, tce); */ + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -252,7 +252,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, unsigned long tces, entry, ua = 0; unsigned long *rmap = NULL; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -335,7 +335,7 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, struct kvmppc_spapr_tce_table *stt; long i, ret; - stt = kvmppc_find_table(vcpu, liobn); + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; @@ -356,12 +356,13 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu, long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba) { - struct kvmppc_spapr_tce_table *stt = kvmppc_find_table(vcpu, liobn); + struct kvmppc_spapr_tce_table *stt; long ret; unsigned long idx; struct page *page; u64 *tbl; + stt = kvmppc_find_table(vcpu->kvm, liobn); if (!stt) return H_TOO_HARD; -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org ht
[PATCH kernel 01/15] Revert "iommu: Add a function to find an iommu group by id"
This reverts commit aa16bea929ae ("iommu: Add a function to find an iommu group by id") as the iommu_group_get_by_id() helper has never been used and it is unlikely it will in foreseeable future. Dead code is broken code. Signed-off-by: Alexey Kardashevskiy --- drivers/iommu/iommu.c | 29 - include/linux/iommu.h | 1 - 2 files changed, 30 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index b06d935..d2f5efe 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -217,35 +217,6 @@ struct iommu_group *iommu_group_alloc(void) } EXPORT_SYMBOL_GPL(iommu_group_alloc); -struct iommu_group *iommu_group_get_by_id(int id) -{ - struct kobject *group_kobj; - struct iommu_group *group; - const char *name; - - if (!iommu_group_kset) - return NULL; - - name = kasprintf(GFP_KERNEL, "%d", id); - if (!name) - return NULL; - - group_kobj = kset_find_obj(iommu_group_kset, name); - kfree(name); - - if (!group_kobj) - return NULL; - - group = container_of(group_kobj, struct iommu_group, kobj); - BUG_ON(group->id != id); - - kobject_get(group->devices_kobj); - kobject_put(&group->kobj); - - return group; -} -EXPORT_SYMBOL_GPL(iommu_group_get_by_id); - /** * iommu_group_get_iommudata - retrieve iommu_data registered for a group * @group: the group diff --git a/include/linux/iommu.h b/include/linux/iommu.h index a35fb8b..93c69fa 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -215,7 +215,6 @@ extern int bus_set_iommu(struct bus_type *bus, const struct iommu_ops *ops); extern bool iommu_present(struct bus_type *bus); extern bool iommu_capable(struct bus_type *bus, enum iommu_cap cap); extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus); -extern struct iommu_group *iommu_group_get_by_id(int id); extern void iommu_domain_free(struct iommu_domain *domain); extern int iommu_attach_device(struct iommu_domain *domain, struct device *dev); -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 12/15] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
It does not make much sense to have KVM in book3s-64 and not to have IOMMU bits for PCI pass through support as it costs little and allows VFIO to function on book3s KVM. Having IOMMU_API always enabled makes it unnecessary to have a lot of "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those ifdef's we could have only user space emulated devices accelerated (but not VFIO) which do not seem to be very useful. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index b7c494b..63b60a8 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -65,6 +65,7 @@ config KVM_BOOK3S_64 select KVM select KVM_BOOK3S_PR_POSSIBLE if !KVM_BOOK3S_HV_POSSIBLE select KVM_VFIO if VFIO + select SPAPR_TCE_IOMMU if IOMMU_SUPPORT ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 00/15] powerpc/kvm/vfio: Enable in-kernel acceleration
This is my current queue of patches to add acceleration of TCE updates in KVM. This has a long history and was rewritten pretty much completely again, this time I am teaching KVM about VFIO containers. Some patches (such as 01/15) could be posted separately but I keep all of them here to make review easier (if the concept turns out be wrong - then I might still want to have 01/15). Please comment. Thanks. Alexey Kardashevskiy (15): Revert "iommu: Add a function to find an iommu group by id" KVM: PPC: Finish enabling VFIO KVM device on POWER KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again powerpc/iommu: Stop using @current in mm_iommu_xxx powerpc/mm/iommu: Put pages on process exit powerpc/iommu: Cleanup iommu_table disposal powerpc/vfio_spapr_tce: Add reference counting to iommu_table powerpc/mmu: Add real mode support for IOMMU preregistered memory KVM: PPC: Use preregistered memory API to access TCE list powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently KVM: PPC: Pass kvm* to kvmppc_find_table() vfio/spapr_tce: Export container API for external users KVM: PPC: Add in-kernel acceleration for VFIO arch/powerpc/include/asm/iommu.h | 12 +- arch/powerpc/include/asm/kvm_host.h | 8 + arch/powerpc/include/asm/kvm_ppc.h| 2 +- arch/powerpc/include/asm/mmu_context.h| 23 +- arch/powerpc/include/uapi/asm/kvm.h | 12 + arch/powerpc/kernel/iommu.c | 49 +++- arch/powerpc/kernel/setup-common.c| 2 +- arch/powerpc/kernel/vio.c | 2 +- arch/powerpc/kvm/Kconfig | 2 + arch/powerpc/kvm/Makefile | 3 + arch/powerpc/kvm/book3s_64_vio.c | 410 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 251 -- arch/powerpc/kvm/powerpc.c| 2 + arch/powerpc/mm/mmu_context_book3s64.c| 6 +- arch/powerpc/mm/mmu_context_iommu.c | 96 --- arch/powerpc/platforms/powernv/pci-ioda.c | 46 +++- arch/powerpc/platforms/powernv/pci.c | 1 + arch/powerpc/platforms/pseries/iommu.c| 3 +- drivers/iommu/iommu.c | 29 --- drivers/vfio/vfio.c | 30 +++ drivers/vfio/vfio_iommu_spapr_tce.c | 107 ++-- include/linux/iommu.h | 1 - include/linux/vfio.h | 6 + include/uapi/linux/kvm.h | 1 + 24 files changed, 959 insertions(+), 145 deletions(-) -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 11/15] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
In real mode, TCE tables are invalidated using special cache-inhibited store instructions which are not available in virtual mode This defines and implements exchange_rm() callback. This does not define set_rm/clear_rm/flush_rm callbacks as there is no user for those - exchange/exchange_rm are only to be used by KVM for VFIO. The exchange_rm callback is defined for IODA1/IODA2 powernv platforms. This replaces list_for_each_entry_rcu with its lockless version as from now on pnv_pci_ioda2_tce_invalidate() can be called in the real mode too. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 7 +++ arch/powerpc/kernel/iommu.c | 23 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 26 +- 3 files changed, 55 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index cd4df44..a13d207 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -64,6 +64,11 @@ struct iommu_table_ops { long index, unsigned long *hpa, enum dma_data_direction *direction); + /* Real mode */ + int (*exchange_rm)(struct iommu_table *tbl, + long index, + unsigned long *hpa, + enum dma_data_direction *direction); #endif void (*clear)(struct iommu_table *tbl, long index, long npages); @@ -209,6 +214,8 @@ extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, unsigned long *hpa, enum dma_data_direction *direction); +extern long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry, + unsigned long *hpa, enum dma_data_direction *direction); #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a8f017a..65b2dac 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1020,6 +1020,29 @@ long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, } EXPORT_SYMBOL_GPL(iommu_tce_xchg); +long iommu_tce_xchg_rm(struct iommu_table *tbl, unsigned long entry, + unsigned long *hpa, enum dma_data_direction *direction) +{ + long ret; + + ret = tbl->it_ops->exchange_rm(tbl, entry, hpa, direction); + + if (!ret && ((*direction == DMA_FROM_DEVICE) || + (*direction == DMA_BIDIRECTIONAL))) { + struct page *pg = realmode_pfn_to_page(*hpa >> PAGE_SHIFT); + + if (likely(pg)) { + SetPageDirty(pg); + } else { + tbl->it_ops->exchange_rm(tbl, entry, hpa, direction); + ret = -EFAULT; + } + } + + return ret; +} +EXPORT_SYMBOL_GPL(iommu_tce_xchg_rm); + int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c04afd2..a0b5ea6 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1827,6 +1827,17 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index, return ret; } + +static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index, + unsigned long *hpa, enum dma_data_direction *direction) +{ + long ret = pnv_tce_xchg(tbl, index, hpa, direction); + + if (!ret) + pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true); + + return ret; +} #endif static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index, @@ -1841,6 +1852,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .set = pnv_ioda1_tce_build, #ifdef CONFIG_IOMMU_API .exchange = pnv_ioda1_tce_xchg, + .exchange_rm = pnv_ioda1_tce_xchg_rm, #endif .clear = pnv_ioda1_tce_free, .get = pnv_tce_get, @@ -1915,7 +1927,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, { struct iommu_table_group_link *tgl; - list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) { + list_for_each_entry_lockless(tgl, &tbl->it_group_list, next) { struct pnv_ioda_pe *pe = container_of(tgl->table_group, struct pnv_ioda_pe, table_group); struct pnv_phb *phb = pe->phb; @@ -1973,6 +1985,17 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index, return ret; } + +static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index, + unsigned long *hpa, enum d
[PATCH kernel 10/15] KVM: PPC: Use preregistered memory API to access TCE list
VFIO on sPAPR already implements guest memory pre-registration when the entire guest RAM gets pinned. This can be used to translate the physical address of a guest page containing the TCE list from H_PUT_TCE_INDIRECT. This makes use of the pre-registrered memory API to access TCE list pages in order to avoid unnecessary locking on the KVM memory reverse map as we know that all of guest memory is pinned and we have a flat array mapping GPA to HPA which makes it simpler and quicker to index into that array (even with looking up the kernel page tables in vmalloc_to_phys) than it is to find the memslot, lock the rmap entry, look up the user page tables, and unlock the rmap entry. Note that the rmap pointer is initialized to NULL where declared (not in this patch). Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * updated the commit log with Paul's comment --- arch/powerpc/kvm/book3s_64_vio_hv.c | 65 - 1 file changed, 49 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c index d461c44..a3be4bd 100644 --- a/arch/powerpc/kvm/book3s_64_vio_hv.c +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa, EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua); #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu) +{ + return mm_iommu_preregistered(vcpu->kvm->mm); +} + +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( + struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size) +{ + return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size); +} + long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, if (ret != H_SUCCESS) return ret; - if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap)) - return H_TOO_HARD; + if (kvmppc_preregistered(vcpu)) { + /* +* We get here if guest memory was pre-registered which +* is normally VFIO case and gpa->hpa translation does not +* depend on hpt. +*/ + struct mm_iommu_table_group_mem_t *mem; - rmap = (void *) vmalloc_to_phys(rmap); + if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, NULL)) + return H_TOO_HARD; - /* -* Synchronize with the MMU notifier callbacks in -* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.). -* While we have the rmap lock, code running on other CPUs -* cannot finish unmapping the host real page that backs -* this guest real page, so we are OK to access the host -* real page. -*/ - lock_rmap(rmap); - if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) { - ret = H_TOO_HARD; - goto unlock_exit; + mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K); + if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, &tces)) + return H_TOO_HARD; + } else { + /* +* This is emulated devices case. +* We do not require memory to be preregistered in this case +* so lock rmap and do __find_linux_pte_or_hugepte(). +*/ + if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, &ua, &rmap)) + return H_TOO_HARD; + + rmap = (void *) vmalloc_to_phys(rmap); + + /* +* Synchronize with the MMU notifier callbacks in +* book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.). +* While we have the rmap lock, code running on other CPUs +* cannot finish unmapping the host real page that backs +* this guest real page, so we are OK to access the host +* real page. +*/ + lock_rmap(rmap); + if (kvmppc_rm_ua_to_hpa(vcpu, ua, &tces)) { + ret = H_TOO_HARD; + goto unlock_exit; + } } for (i = 0; i < npages; ++i) { @@ -290,7 +322,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, } unlock_exit: - unlock_rmap(rmap); + if (rmap) + unlock_rmap(rmap); return ret; } -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 06/15] powerpc/mm/iommu: Put pages on process exit
At the moment VFIO IOMMU SPAPR v2 driver pins all guest RAM pages when the userspace starts using VFIO. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads use userspace process MMs which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This references and caches MM once per container and adds tracking how many times each preregistered area was registered in a specific container. This way we do not depend on @current pointing to a valid task descriptor. This changes the userspace interface to return EBUSY if memory is already registered (mm_iommu_get() used to increment the counter); however it should not have any practical effect as the only userspace tool available now does register memory area once per container anyway. As tce_iommu_register_pages/tce_iommu_unregister_pages are called under container->lock, this does not need additional locking. Signed-off-by: Alexey Kardashevskiy # Conflicts: # arch/powerpc/include/asm/mmu_context.h # arch/powerpc/mm/mmu_context_book3s64.c # arch/powerpc/mm/mmu_context_iommu.c --- arch/powerpc/include/asm/mmu_context.h | 1 - arch/powerpc/mm/mmu_context_book3s64.c | 4 --- arch/powerpc/mm/mmu_context_iommu.c| 11 --- drivers/vfio/vfio_iommu_spapr_tce.c| 52 +- 4 files changed, 51 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index b85cc7b..a4c4ed5 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -25,7 +25,6 @@ extern long mm_iommu_get(struct mm_struct *mm, extern long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_init(struct mm_struct *mm); -extern void mm_iommu_cleanup(struct mm_struct *mm); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index ad82735..1a07969 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { -#ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(mm); -#endif - #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); kfree(mm->context.cop_lockp); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index ee6685b..10f01fe 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -293,14 +293,3 @@ void mm_iommu_init(struct mm_struct *mm) { INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list); } - -void mm_iommu_cleanup(struct mm_struct *mm) -{ - struct mm_iommu_table_group_mem_t *mem, *tmp; - - list_for_each_entry_safe(mem, tmp, &mm->context.iommu_group_mem_list, - next) { - list_del_rcu(&mem->next); - mm_iommu_do_free(mem); - } -} diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 9752e77..40e71a0 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -89,6 +89,15 @@ struct tce_iommu_group { }; /* + * A container needs to remember which preregistered areas and how many times + * it has referenced to do proper cleanup at the userspace process exit. + */ +struct tce_iommu_prereg { + struct list_head next; + struct mm_iommu_table_group_mem_t *mem; +}; + +/* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group * at the moment of initialization. @@ -101,12 +110,26 @@ struct tce_container { struct mm_struct *mm; struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct list_head group_list; + struct list_head prereg_list; }; +static long tce_iommu_prereg_free(struct tce_container *container, + struct tce_iommu_prereg *tcemem) +{ + long ret; + + list_del(&tcemem->next); + ret = mm_iommu_put(container->mm, tcemem->mem); + kfree(tcemem); + + return ret; +} + static long tce_iommu_unregister_pages(struct tce_container *container, __u64 vaddr, __u64
[PATCH kernel 05/15] powerpc/iommu: Stop using @current in mm_iommu_xxx
In some situations the userspace memory context may live longer than the userspace process itself so if we need to do proper memory context cleanup, we better cache @mm and use it later when the process is gone (@current or @current->mm are NULL). This changes mm_iommu_xxx API to receive mm_struct instead of using one from @current. This is needed by the following patch to do proper cleanup in time. This depends on "powerpc/powernv/ioda: Fix endianness when reading TCEs" to do proper cleanup via tce_iommu_clear() patch. To keep API consistent, this replaces mm_context_t with mm_struct; we stick to mm_struct as mm_iommu_adjust_locked_vm() helper needs access to &mm->mmap_sem. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/mmu_context.h | 20 +++-- arch/powerpc/kernel/setup-common.c | 2 +- arch/powerpc/mm/mmu_context_book3s64.c | 4 +-- arch/powerpc/mm/mmu_context_iommu.c| 54 ++ drivers/vfio/vfio_iommu_spapr_tce.c| 41 -- 5 files changed, 62 insertions(+), 59 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 9d2cd0c..b85cc7b 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -18,16 +18,18 @@ extern void destroy_context(struct mm_struct *mm); #ifdef CONFIG_SPAPR_TCE_IOMMU struct mm_iommu_table_group_mem_t; -extern bool mm_iommu_preregistered(void); -extern long mm_iommu_get(unsigned long ua, unsigned long entries, +extern bool mm_iommu_preregistered(struct mm_struct *mm); +extern long mm_iommu_get(struct mm_struct *mm, + unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem); -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); -extern void mm_iommu_init(mm_context_t *ctx); -extern void mm_iommu_cleanup(mm_context_t *ctx); -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, - unsigned long size); -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua, - unsigned long entries); +extern long mm_iommu_put(struct mm_struct *mm, + struct mm_iommu_table_group_mem_t *mem); +extern void mm_iommu_init(struct mm_struct *mm); +extern void mm_iommu_cleanup(struct mm_struct *mm); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, + unsigned long ua, unsigned long size); +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, + unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned long *hpa); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 714b4ba..e90b68a 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -905,7 +905,7 @@ void __init setup_arch(char **cmdline_p) init_mm.context.pte_frag = NULL; #endif #ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_init(&init_mm.context); + mm_iommu_init(&init_mm); #endif irqstack_early_init(); exc_lvl_early_init(); diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index b114f8b..ad82735 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) mm->context.pte_frag = NULL; #endif #ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_init(&mm->context); + mm_iommu_init(mm); #endif return 0; } @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { #ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(&mm->context); + mm_iommu_cleanup(mm); #endif #ifdef CONFIG_PPC_ICSWX diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index da6a216..ee6685b 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -53,7 +53,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, } pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", - current->pid, + current ? current->pid : 0, incr ? '+' : '-', npages << PAGE_SHIFT, mm->locked_vm << PAGE_SHIFT, @@ -63,28 +63,22 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, return ret; } -bool mm_iommu_preregistered(void) +bool mm_iommu_preregistered(struct mm_struct *mm) { - if (!current || !current->mm) - return false; - - return !list_empty(¤t->mm-
[PATCH kernel 04/15] powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
"powerpc/powernv/pci: Rework accessing the TCE invalidate register" broke TCE invalidation on IODA2/PHB3 for real mode. This makes invalidate work again. Fixes: fd141d1a99a3 Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 53b56c0..59c7e7d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1877,7 +1877,7 @@ static void pnv_pci_phb3_tce_invalidate(struct pnv_ioda_pe *pe, bool rm, unsigned shift, unsigned long index, unsigned long npages) { - __be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, false); + __be64 __iomem *invalidate = pnv_ioda_get_inval_reg(pe->phb, rm); unsigned long start, end, inc; /* We'll invalidate DMA address in PE scope */ @@ -1935,10 +1935,12 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, pnv_pci_phb3_tce_invalidate(pe, rm, shift, index, npages); else if (rm) + { opal_rm_pci_tce_kill(phb->opal_id, OPAL_PCI_TCE_KILL_PAGES, pe->pe_number, 1u << shift, index << shift, npages); + } else opal_pci_tce_kill(phb->opal_id, OPAL_PCI_TCE_KILL_PAGES, -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel 03/15] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
This adds a capability number for in-kernel support for VFIO on SPAPR platform. The capability will tell the user space whether in-kernel handlers of H_PUT_TCE can handle VFIO-targeted requests or not. If not, the user space must not attempt allocating a TCE table in the host kernel via the KVM_CREATE_SPAPR_TCE KVM ioctl because in that case TCE requests will not be passed to the user space which is desired action in the situation like that. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- include/uapi/linux/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e98bb4c..3b4b723 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -870,6 +870,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_S390_USER_INSTR0 130 #define KVM_CAP_MSI_DEVID 131 #define KVM_CAP_PPC_HTM 132 +#define KVM_CAP_SPAPR_TCE_VFIO 133 #ifdef KVM_CAP_IRQ_ROUTING -- 2.5.0.rc3 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RESENT PATCH v5 1/2] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs
When decoding the perf_regs mask in regs_dump__printf(), we loop through the mask using find_first_bit and find_next_bit functions. "mask" is of type "u64", but sent as a "unsigned long *" to lib functions along with sizeof(). While the exisitng code works fine in most of the case, the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian). When reading u64 using (u32 *)(&val)[0], perf (lib/find_*_bit()) assumes it gets lower 32bits of u64 which is wrong. Proposed fix is to swap the words of the u64 to handle this case. This is _not_ endianess swap. Suggested-by: Yury Norov Reviewed-by: Yury Norov Acked-by: Jiri Olsa Cc: Yury Norov Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Adrian Hunter Cc: Kan Liang Cc: Wang Nan Cc: Michael Ellerman Signed-off-by: Madhavan Srinivasan --- tools/include/linux/bitmap.h | 2 ++ tools/lib/bitmap.c | 18 ++ tools/perf/builtin-script.c | 4 +++- tools/perf/util/session.c| 4 +++- 4 files changed, 26 insertions(+), 2 deletions(-) diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h index 28f5493da491..5e98525387dc 100644 --- a/tools/include/linux/bitmap.h +++ b/tools/include/linux/bitmap.h @@ -2,6 +2,7 @@ #define _PERF_BITOPS_H #include +#include #include #define DECLARE_BITMAP(name,bits) \ @@ -10,6 +11,7 @@ int __bitmap_weight(const unsigned long *bitmap, int bits); void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1, const unsigned long *bitmap2, int bits); +void bitmap_from_u64(unsigned long *dst, u64 mask); #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1))) diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c index 0a1adcfd..464a0cc63e6a 100644 --- a/tools/lib/bitmap.c +++ b/tools/lib/bitmap.c @@ -29,3 +29,21 @@ void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1, for (k = 0; k < nr; k++) dst[k] = bitmap1[k] | bitmap2[k]; } + +/* + * bitmap_from_u64 - Check and swap words within u64. + * @mask: source bitmap + * @dst: destination bitmap + * + * In 32 bit big endian userspace on a 64bit kernel, 'unsigned long' is 32 bits. + * When reading u64 using (u32 *)(&val)[0] and (u32 *)(&val)[1], + * we will get wrong value for the mask. That is "(u32 *)(&val)[0]" + * gets upper 32 bits of u64, but perf may expect lower 32bits of u64. + */ +void bitmap_from_u64(unsigned long *dst, u64 mask) +{ + dst[0] = mask & ULONG_MAX; + + if (sizeof(mask) > sizeof(unsigned long)) + dst[1] = mask >> 32; +} diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 971ff91b16cb..20d7988a1636 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -418,11 +418,13 @@ static void print_sample_iregs(struct perf_sample *sample, struct regs_dump *regs = &sample->intr_regs; uint64_t mask = attr->sample_regs_intr; unsigned i = 0, r; + DECLARE_BITMAP(_mask, 64); if (!regs) return; - for_each_set_bit(r, (unsigned long *) &mask, sizeof(mask) * 8) { + bitmap_from_u64(_mask, mask); + for_each_set_bit(r, _mask, sizeof(mask) * 8) { u64 val = regs->regs[i++]; printf("%5s:0x%"PRIx64" ", perf_reg_name(r), val); } diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c index 5d61242a6e64..440a9fb2a6fb 100644 --- a/tools/perf/util/session.c +++ b/tools/perf/util/session.c @@ -944,8 +944,10 @@ static void branch_stack__printf(struct perf_sample *sample) static void regs_dump__printf(u64 mask, u64 *regs) { unsigned rid, i = 0; + DECLARE_BITMAP(_mask, 64); - for_each_set_bit(rid, (unsigned long *) &mask, sizeof(mask) * 8) { + bitmap_from_u64(_mask, mask); + for_each_set_bit(rid, _mask, sizeof(mask) * 8) { u64 val = regs[i++]; printf(" %-5s 0x%" PRIx64 "\n", -- 2.7.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RESEND PATCH 2/2] perf/core: Fix the mask in perf_output_sample_regs
When decoding the perf_regs mask in perf_output_sample_regs(), we loop through the mask using find_first_bit and find_next_bit functions. While the exisitng code works fine in most of the case, the logic is broken for 32bit kernel (Big Endian). When reading u64 mask using (u32 *)(&val)[0], find_*_bit() assumes it gets lower 32bits of u64 but instead gets upper 32bits which is wrong. Proposed fix is to swap the words of the u64 to handle this case. This is _not_ endianness swap. Suggested-by: Yury Norov Reviewed-by: Yury Norov Cc: Yury Norov Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Michael Ellerman Signed-off-by: Madhavan Srinivasan --- include/linux/bitmap.h | 2 ++ kernel/events/core.c | 4 +++- lib/bitmap.c | 19 +++ 3 files changed, 24 insertions(+), 1 deletion(-) diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 27bfc0b631a9..6f2cc9eb12d9 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -188,6 +188,8 @@ extern int bitmap_print_to_pagebuf(bool list, char *buf, #define small_const_nbits(nbits) \ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG) +extern void bitmap_from_u64(unsigned long *dst, u64 mask); + static inline void bitmap_zero(unsigned long *dst, unsigned int nbits) { if (small_const_nbits(nbits)) diff --git a/kernel/events/core.c b/kernel/events/core.c index 356a6c7cb52a..f5ed20a63a5e 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5269,8 +5269,10 @@ perf_output_sample_regs(struct perf_output_handle *handle, struct pt_regs *regs, u64 mask) { int bit; + DECLARE_BITMAP(_mask, 64); - for_each_set_bit(bit, (const unsigned long *) &mask, + bitmap_from_u64(_mask, mask); + for_each_set_bit(bit, _mask, sizeof(mask) * BITS_PER_BYTE) { u64 val; diff --git a/lib/bitmap.c b/lib/bitmap.c index eca88087fa8a..2b9bda507645 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -1170,3 +1170,22 @@ void bitmap_copy_le(unsigned long *dst, const unsigned long *src, unsigned int n } EXPORT_SYMBOL(bitmap_copy_le); #endif + +/* + * bitmap_from_u64 - Check and swap words within u64. + * @mask: source bitmap + * @dst: destination bitmap + * + * In 32bit Big Endian kernel, when using (u32 *)(&val)[*] + * to read u64 mask, we will get wrong word. + * That is "(u32 *)(&val)[0]" gets upper 32 bits, + * but expected could be lower 32bits of u64. + */ +void bitmap_from_u64(unsigned long *dst, u64 mask) +{ + dst[0] = mask & ULONG_MAX; + + if (sizeof(mask) > sizeof(unsigned long)) + dst[1] = mask >> 32; +} +EXPORT_SYMBOL(bitmap_from_u64); -- 2.7.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allyesconfig / allmodconfig linux-next next-20160729 - next-20160729 build failures
On Wednesday, August 3, 2016 10:23:24 AM CEST Stephen Rothwell wrote: > Hi Luis, > > On Wed, 3 Aug 2016 00:02:43 +0200 "Luis R. Rodriguez" > wrote: > > > > Thanks for the confirmation. For how long is it known this is broken? > > Does anyone care and fix these ? Or is this best effort? > > This has been broken for many years > > I have a couple of times almost fixed it, but it requires that we > change from using "ld -r" to build the built-in.o objects and some > changes to the powerpc head.S code ... I will give it another shot now > that the merge window is almost over (and linux-next goes into its > quieter time). Using a different way to link the kernel would also help us with the remaining allyesconfig problem on ARM, as the problem is only in 'ld -r' not producing trampolines for symbols that later cannot get them any more. It would probably also help building with ld.gold, which is currently not working. What is your suggested alternative? Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powernv: Search for new flash DT node location
Quoting Jack Miller (2016-08-02 06:50:35) > Skiboot will place the flash device tree node at ibm,opal/flash/flash@0 > on P9 and later systems, so Linux needs to search for it there as well > as ibm,opal/flash@0 for backwards compatibility. > > Signed-off-by: Jack Miller > --- > arch/powerpc/platforms/powernv/opal.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/opal.c > b/arch/powerpc/platforms/powernv/opal.c > index ae29eaf..2847cb0 100644 > --- a/arch/powerpc/platforms/powernv/opal.c > +++ b/arch/powerpc/platforms/powernv/opal.c > @@ -755,9 +755,14 @@ static int __init opal_init(void) > > /* Initialize platform devices: IPMI backend, PRD & flash interface */ > opal_pdev_init(opal_node, "ibm,opal-ipmi"); > - opal_pdev_init(opal_node, "ibm,opal-flash"); > + opal_pdev_init(opal_node, "ibm,opal-flash"); // old <= P8 flash > location > opal_pdev_init(opal_node, "ibm,opal-prd"); > > + /* New >= P9 flash location */ > + np = of_get_child_by_name(opal_node, "flash"); > + if (np) > + opal_pdev_init(np, "ibm,opal-flash"); We could instead just search for all nodes that are compatible with "ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs(). Is there a particular reason not to do that? cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] rtc-opal: Fix handling of firmware error codes, prevent busy loops
Stewart Smith writes: > According to the OPAL docs: > https://github.com/open-power/skiboot/blob/skiboot-5.2.5/doc/opal-api/opal-rtc-read-3.txt > https://github.com/open-power/skiboot/blob/skiboot-5.2.5/doc/opal-api/opal-rtc-write-4.txt > OPAL_HARDWARE may be returned from OPAL_RTC_READ or OPAL_RTC_WRITE and this > indicates either a transient or permanent error. > > Prior to this patch, Linux was not dealing with OPAL_HARDWARE being a > permanent error particularly well, in that you could end up in a busy > loop. > > This was not too hard to trigger on an AMI BMC based OpenPOWER machine > doing a continuous "ipmitool mc reset cold" to the BMC, the result of > that being that we'd get stuck in an infinite loop in opal_get_rtc_time. > > We now retry a few times before returning the error higher up the stack. Looks like this has always been broken, so: Fixes: 16b1d26e77b1 ("rtc/tpo: Driver to support rtc and wakeup on PowerNV platform") > Cc: sta...@vger.kernel.org And therefore that should be: Cc: sta...@vger.kernel.org # v3.19+ > Signed-off-by: Stewart Smith > --- > drivers/rtc/rtc-opal.c | 12 ++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c > index 9c18d6fd8107..fab19e3e2fba 100644 > --- a/drivers/rtc/rtc-opal.c > +++ b/drivers/rtc/rtc-opal.c > @@ -58,6 +58,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 > *h_m_s_ms) > static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm) > { > long rc = OPAL_BUSY; > + int retries = 10; > u32 y_m_d; > u64 h_m_s_ms; > __be32 __y_m_d; > @@ -67,8 +68,11 @@ static int opal_get_rtc_time(struct device *dev, struct > rtc_time *tm) > rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms); > if (rc == OPAL_BUSY_EVENT) > opal_poll_events(NULL); > - else > + else if (retries-- && (rc == OPAL_HARDWARE > +|| rc == OPAL_INTERNAL_ERROR)) > msleep(10); > + else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT) > + break; > } This is a pretty gross API at this point. That's basically a score of 2 on Rusty's API usability index ("Read the implementation and you'll get it right" - http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html). The docs don't mention OPAL_INTERNAL_ERROR being transient, nor do they mention OPAL_BUSY. Can we at least do a wrapper function in opal.h for drivers to use that handles some or all of these cases? cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev