Re: [PATCH 1/2] mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
On Thu, Jul 12, 2018 at 11:48:47AM +0900, Joonsoo Kim wrote: > One of existing user is general DMA layer and it takes gfp flags that is > provided by user. I don't check all the DMA allocation sites but how do > you convince that none of them try to use anything other > than GFP_KERNEL [|__GFP_NOWARN]? They use a few others things still like __GFP_COMP, __GPF_DMA or GFP_HUGEPAGE. But all these are bogus as we have various implementations that can't respect them. I plan to get rid of the gfp_t argument in the dma_map_ops alloc method in a few merge windows because of that, but it needs further implementation consolidation first.
[PATCH kernel] KVM: PPC: Expose userspace mm context id via debugfs
This adds a debugfs entry with mm context id of a process which is using KVM. This id is an index in the process table so the userspace can dump that tree provided it is granted access to /dev/mem. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/kvm_host.h | 1 + arch/powerpc/kvm/book3s_64_mmu_hv.c | 58 + 2 files changed, 59 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index fa4efa7..bb72667 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -284,6 +284,7 @@ struct kvm_arch { u64 process_table; struct dentry *debugfs_dir; struct dentry *htab_dentry; + struct dentry *mm_ctxid_dentry; struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */ #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 7f3a8cf..3b9eb17 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -2138,11 +2138,69 @@ static const struct file_operations debugfs_htab_fops = { .llseek = generic_file_llseek, }; +static int debugfs_mm_ctxid_open(struct inode *inode, struct file *file) +{ + struct kvm *kvm = inode->i_private; + + kvm_get_kvm(kvm); + file->private_data = kvm; + + return nonseekable_open(inode, file); +} + +static int debugfs_mm_ctxid_release(struct inode *inode, struct file *file) +{ + struct kvm *kvm = file->private_data; + + kvm_put_kvm(kvm); + return 0; +} + +static ssize_t debugfs_mm_ctxid_read(struct file *file, char __user *buf, +size_t len, loff_t *ppos) +{ + struct kvm *kvm = file->private_data; + ssize_t n, left, ret; + char tmp[64]; + + if (!kvm_is_radix(kvm)) + return 0; + + ret = snprintf(tmp, sizeof(tmp) - 1, "%lu\n", kvm->mm->context.id); + if (*ppos >= ret) + return 0; + + left = min_t(ssize_t, ret - *ppos, len); + n = copy_to_user(buf, tmp + *ppos, left); + ret = left - n; + *ppos += ret; + + return ret; +} + +static ssize_t debugfs_mm_ctxid_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos) +{ + return -EACCES; +} + +static const struct file_operations debugfs_mm_ctxid_fops = { + .owner = THIS_MODULE, + .open= debugfs_mm_ctxid_open, + .release = debugfs_mm_ctxid_release, + .read= debugfs_mm_ctxid_read, + .write = debugfs_mm_ctxid_write, + .llseek = generic_file_llseek, +}; + void kvmppc_mmu_debugfs_init(struct kvm *kvm) { kvm->arch.htab_dentry = debugfs_create_file("htab", 0400, kvm->arch.debugfs_dir, kvm, &debugfs_htab_fops); + kvm->arch.mm_ctxid_dentry = debugfs_create_file("mm_ctxid", 0400, + kvm->arch.debugfs_dir, kvm, + &debugfs_mm_ctxid_fops); } void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu) -- 2.11.0
Re: [PATCH kernel] powerpc/powernv/ioda2: Add 256M IOMMU page size to the default POWER8 case
On Mon, 2018-07-02 at 17:42 +1000, Alexey Kardashevskiy wrote: > The sketchy bypass uses 256M pages so add this page size as well. > > This should cause no behavioral change but will be used later. > > Fixes: 477afd6ea6 "powerpc/ioda: Use ibm,supported-tce-sizes for > IOMMU page size mask" > Signed-off-by: Alexey Kardashevskiy Reviewed-by: Russell Currey > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index 5bd0eb6..557c11d 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -2925,7 +2925,7 @@ static unsigned long > pnv_ioda_parse_tce_sizes(struct pnv_phb *phb) > /* Add 16M for POWER8 by default */ > if (cpu_has_feature(CPU_FTR_ARCH_207S) && > !cpu_has_feature(CPU_FTR_ARCH_300)) > - mask |= SZ_16M; > + mask |= SZ_16M | SZ_256M; > return mask; > } >
[RFC PATCH] mm: optimise pte dirty/accessed bits handling in fork
fork clears dirty/accessed bits from new ptes in the child, even though the mapping allows such accesses. This logic has existed for ~ever, and certainly well before physical page reclaim and cleaning was not strongly tied to pte access state as it is today. Now that is the case, this access bit clearing logic does not do much. Other than this case, Linux is "eager" to set dirty/accessed bits when setting up mappings, which avoids micro-faults (and page faults on CPUs that implement these bits in software). With this patch, there are no cases I could instrument where dirty/accessed bits do not match the access permissions without memory pressure (and without more exotic things like migration). This speeds up a fork/exit microbenchmark by about 5% on POWER9 (which uses a software fault fallback mechanism to set these bits). I expect x86 CPUs will barely be noticable, but would be interesting to see. Other archs might care more, and anyway it's always good if we can remove code and make things a bit faster. I don't *think* I'm missing anything fundamental, but would be good to be sure. Comments? Thanks, Nick --- mm/huge_memory.c | 4 ++-- mm/memory.c | 10 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1cd7c1a57a14..c1d41cad9aad 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -974,7 +974,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); pmdp_set_wrprotect(src_mm, addr, src_pmd); - pmd = pmd_mkold(pmd_wrprotect(pmd)); + pmd = pmd_wrprotect(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); ret = 0; @@ -1065,7 +1065,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, } pudp_set_wrprotect(src_mm, addr, src_pud); - pud = pud_mkold(pud_wrprotect(pud)); + pud = pud_wrprotect(pud); set_pud_at(dst_mm, addr, dst_pud, pud); ret = 0; diff --git a/mm/memory.c b/mm/memory.c index 7206a634270b..3fea40da3a58 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1023,12 +1023,12 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, } /* -* If it's a shared mapping, mark it clean in -* the child +* Child inherits dirty and young bits from parent. There is no +* point clearing them because any cleaning or aging has to walk +* all ptes anyway, and it will notice the bits set in the parent. +* Leaving them set avoids stalls and even page faults on CPUs that +* handle these bits in software. */ - if (vm_flags & VM_SHARED) - pte = pte_mkclean(pte); - pte = pte_mkold(pte); page = vm_normal_page(vma, addr, pte); if (page) { -- 2.17.0
Re: [PATCH] powerpc: Replaced msleep(x) with msleep(OPAL_BUSY_DELAY_MS)
On Thu, 12 Jul 2018 15:46:06 +1000 Michael Ellerman wrote: > Daniel Klamt writes: > > > Replaced msleep(x) with with msleep(OPAL_BUSY_DELAY_MS) > > to diocument these sleep is to wait for opal. > > > > Signed-off-by: Daniel Klamt > > Signed-off-by: Bjoern Noetel > > Thanks. > > Your change log should be in the imperative mood, see: > > > https://git.kernel.org/pub/scm/git/git.git/tree/Documentation/SubmittingPatches?id=HEAD#n133 > > > In this case that just means saying "Replace" rather than "Replaced". > > Also the prefix should be "powerpc/xive". You can guess that by doing: > > $ git log --oneline arch/powerpc/sysdev/xive/native.c > > And notice that the majority of commits use that prefix. > > > I've fixed both of those things up for you. Sorry, just noticed this. I've got a patch which changes the xive stuff to the "standard" format this will clash with. if (rc == OPAL_BUSY_EVENT) { msleep(OPAL_BUSY_DELAY_MS); opal_poll_events(NULL); } else if (rc == OPAL_BUSY) { msleep(OPAL_BUSY_DELAY_MS); } If it's already merged that's fine, I can rebase. Thanks, Nick
Re: [PATCH kernel v6 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
On Wed, 11 Jul 2018 21:00:44 +1000 Alexey Kardashevskiy wrote: > A VM which has: > - a DMA capable device passed through to it (eg. network card); > - running a malicious kernel that ignores H_PUT_TCE failure; > - capability of using IOMMU pages bigger that physical pages > can create an IOMMU mapping that exposes (for example) 16MB of > the host physical memory to the device when only 64K was allocated to the VM. > > The remaining 16MB - 64K will be some other content of host memory, possibly > including pages of the VM, but also pages of host kernel memory, host > programs or other VMs. > > The attacking VM does not control the location of the page it can map, > and is only allowed to map as many pages as it has pages of RAM. > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > an IOMMU page is contained in the physical page so the PCI hardware won't > get access to unassigned host memory; however this check is missing in > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and > did not hit this yet as the very first time when the mapping happens > we do not have tbl::it_userspace allocated yet and fall back to > the userspace which in turn calls VFIO IOMMU driver, this fails and > the guest does not retry, > > This stores the smallest preregistered page size in the preregistered > region descriptor and changes the mm_iommu_xxx API to check this against > the IOMMU page size. > > This calculates maximum page size as a minimum of the natural region > alignment and compound page size. For the page shift this uses the shift > returned by find_linux_pte() which indicates how the page is mapped to > the current userspace - if the page is huge and this is not a zero, then > it is a leaf pte and the page is mapped within the range. > > Signed-off-by: Alexey Kardashevskiy > @@ -199,6 +209,25 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long > ua, unsigned long entries, > } > } > populate: > + pageshift = PAGE_SHIFT; > + if (PageCompound(page)) { > + pte_t *pte; > + struct page *head = compound_head(page); > + unsigned int compshift = compound_order(head); > + > + local_irq_save(flags); /* disables as well */ > + pte = find_linux_pte(mm->pgd, ua, NULL, &pageshift); > + local_irq_restore(flags); > + if (!pte) { > + ret = -EFAULT; > + goto unlock_exit; > + } > + /* Double check it is still the same pinned page */ > + if (pte_page(*pte) == head && pageshift == compshift) > + pageshift = max_t(unsigned int, pageshift, > + PAGE_SHIFT); I don't understand this logic. If the page was different, the shift would be wrong. You're not retrying but instead ignoring it in that case. I think I would be slightly happier with the definitely-not-racy get_user_pages slow approach. Anything lock-less like this would be a premature optimisation without performance numbers... Thanks, Nick
[RFC] macintosh: Use common code to access RTC
Once the 68k Mac port adopts the via-pmu driver, it must access the PMU RTC using the appropriate command format. The same code can then be used for both m68k and powerpc. Replace the RTC code that's duplicated in arch/powerpc and arch/m68k with common RTC accessors for Cuda and PMU devices. While we're at it, drop the problematic WARN_ON that was introduced in commit 22db552b50fa ("powerpc/powermac: Fix rtc read/write functions"). --- The patch below hasn't been tested yet and would need to be applied after my PMU patch series (v4). So I'll probably append it to v5. The patch below is an alternative both to the patch Arnd posted here, https://lore.kernel.org/lkml/20180619140229.3615110-2-a...@arndb.de/ as well as another future patch to remove the WARN_ON from arch/powerpc/platforms/powermac/time.c. --- arch/m68k/mac/misc.c | 75 +++ arch/powerpc/platforms/powermac/time.c | 130 - drivers/macintosh/via-cuda.c | 35 + drivers/macintosh/via-pmu.c| 33 + include/linux/cuda.h | 4 + include/linux/pmu.h| 4 + 6 files changed, 99 insertions(+), 182 deletions(-) diff --git a/arch/m68k/mac/misc.c b/arch/m68k/mac/misc.c index 28090a44fa09..21e3afa48de9 100644 --- a/arch/m68k/mac/misc.c +++ b/arch/m68k/mac/misc.c @@ -33,34 +33,6 @@ static void (*rom_reset)(void); #ifdef CONFIG_ADB_CUDA -static long cuda_read_time(void) -{ - struct adb_request req; - long time; - - if (cuda_request(&req, NULL, 2, CUDA_PACKET, CUDA_GET_TIME) < 0) - return 0; - while (!req.complete) - cuda_poll(); - - time = (req.reply[3] << 24) | (req.reply[4] << 16) | - (req.reply[5] << 8) | req.reply[6]; - return time - RTC_OFFSET; -} - -static void cuda_write_time(long data) -{ - struct adb_request req; - - data += RTC_OFFSET; - if (cuda_request(&req, NULL, 6, CUDA_PACKET, CUDA_SET_TIME, -(data >> 24) & 0xFF, (data >> 16) & 0xFF, -(data >> 8) & 0xFF, data & 0xFF) < 0) - return; - while (!req.complete) - cuda_poll(); -} - static __u8 cuda_read_pram(int offset) { struct adb_request req; @@ -86,34 +58,6 @@ static void cuda_write_pram(int offset, __u8 data) #endif /* CONFIG_ADB_CUDA */ #ifdef CONFIG_ADB_PMU -static long pmu_read_time(void) -{ - struct adb_request req; - long time; - - if (pmu_request(&req, NULL, 1, PMU_READ_RTC) < 0) - return 0; - while (!req.complete) - pmu_poll(); - - time = (req.reply[1] << 24) | (req.reply[2] << 16) | - (req.reply[3] << 8) | req.reply[4]; - return time - RTC_OFFSET; -} - -static void pmu_write_time(long data) -{ - struct adb_request req; - - data += RTC_OFFSET; - if (pmu_request(&req, NULL, 5, PMU_SET_RTC, - (data >> 24) & 0xFF, (data >> 16) & 0xFF, - (data >> 8) & 0xFF, data & 0xFF) < 0) - return; - while (!req.complete) - pmu_poll(); -} - static __u8 pmu_read_pram(int offset) { struct adb_request req; @@ -291,13 +235,17 @@ static long via_read_time(void) * is basically any machine with Mac II-style ADB. */ -static void via_write_time(long time) +static void via_set_rtc_time(struct rtc_time *tm) { union { __u8 cdata[4]; long idata; } data; __u8 temp; + unsigned long time; + + time = mktime(tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, + tm->tm_hour, tm->tm_min, tm->tm_sec); /* Clear the write protect bit */ @@ -635,12 +583,12 @@ int mac_hwclk(int op, struct rtc_time *t) #ifdef CONFIG_ADB_CUDA case MAC_ADB_EGRET: case MAC_ADB_CUDA: - now = cuda_read_time(); + now = cuda_get_time(); break; #endif #ifdef CONFIG_ADB_PMU case MAC_ADB_PB2: - now = pmu_read_time(); + now = pmu_get_time(); break; #endif default: @@ -659,24 +607,21 @@ int mac_hwclk(int op, struct rtc_time *t) __func__, t->tm_year + 1900, t->tm_mon + 1, t->tm_mday, t->tm_hour, t->tm_min, t->tm_sec); - now = mktime(t->tm_year + 1900, t->tm_mon + 1, t->tm_mday, -t->tm_hour, t->tm_min, t->tm_sec); - switch (macintosh_config->adb_type) { case MAC_ADB_IOP: case MAC_ADB_II: case MAC_ADB_PB1: - via_write_time(now); + via_set_rtc_time(t); break; #ifdef CONFIG_ADB_CUDA case MAC_ADB_EGRET:
Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
> > I just roughly check, but if I checked the right place, > > vmemmap_populated() checks for the section to contain the flags we are > > setting in sparse_init_one_section(). > > Yes. > > > But with this patch, we populate first everything, and then we call > > sparse_init_one_section() in sparse_init(). > > As I said I could be mistaken because I just checked the surface. > > Yeah I think that's correct. > > This might just be a bug in our code, let me look at it a bit. I wonder if something like this could make the trick: diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 51ce091914f9..e281651f50cd 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -177,6 +177,8 @@ static __meminit void vmemmap_list_populate(unsigned long phys, vmemmap_list = vmem_back; } +static unsigned long last_addr_populated = 0; + int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, struct vmem_altmap *altmap) { @@ -191,7 +193,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p; int rc; - if (vmemmap_populated(start, page_size)) + if (start + page_size <= last_addr_populated) continue; if (altmap) @@ -212,6 +214,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, __func__, rc); return -EFAULT; } + last_addr_populated = start + page_size; } I know it looks hacky, and chances are that are wrong, but could you give it a try? I will try to grab a ppc server and try it out too. Thanks -- Oscar Salvador SUSE L3
Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.
On Tue, 3 Jul 2018 08:08:14 +1000 "Nicholas Piggin" wrote: > On Mon, 02 Jul 2018 11:17:06 +0530 > Mahesh J Salgaonkar wrote: > > > From: Mahesh Salgaonkar > > > > On pseries, as of today system crashes if we get a machine check > > exceptions due to SLB errors. These are soft errors and can be > > fixed by flushing the SLBs so the kernel can continue to function > > instead of system crash. We do this in real mode before turning on > > MMU. Otherwise we would run into nested machine checks. This patch > > now fetches the rtas error log in real mode and flushes the SLBs on > > SLB errors. > > > > Signed-off-by: Mahesh Salgaonkar > > --- > > arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 > > arch/powerpc/include/asm/machdep.h|1 > > arch/powerpc/kernel/exceptions-64s.S | 42 > > + arch/powerpc/kernel/mce.c > > | 16 +++- arch/powerpc/mm/slb.c | > > 6 +++ arch/powerpc/platforms/powernv/opal.c |1 > > arch/powerpc/platforms/pseries/pseries.h |1 > > arch/powerpc/platforms/pseries/ras.c | 51 > > + > > arch/powerpc/platforms/pseries/setup.c|1 9 files > > changed, 116 insertions(+), 4 deletions(-) > > > > +TRAMP_REAL_BEGIN(machine_check_pSeries_early) > > +BEGIN_FTR_SECTION > > + EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200) > > + mr r10,r1 /* Save r1 */ > > + ld r1,PACAMCEMERGSP(r13) /* Use MC emergency > > stack */ > > + subir1,r1,INT_FRAME_SIZE/* alloc stack > > frame */ > > + mfspr r11,SPRN_SRR0 /* Save SRR0 */ > > + mfspr r12,SPRN_SRR1 /* Save SRR1 */ > > + EXCEPTION_PROLOG_COMMON_1() > > + EXCEPTION_PROLOG_COMMON_2(PACA_EXMC) > > + EXCEPTION_PROLOG_COMMON_3(0x200) > > + addir3,r1,STACK_FRAME_OVERHEAD > > + BRANCH_LINK_TO_FAR(machine_check_early) /* Function call > > ABI */ > > Is there any reason you can't use the existing > machine_check_powernv_early code to do all this? > > > diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c > > index efdd16a79075..221271c96a57 100644 > > --- a/arch/powerpc/kernel/mce.c > > +++ b/arch/powerpc/kernel/mce.c > > @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs) > > { > > long handled = 0; > > > > - __this_cpu_inc(irq_stat.mce_exceptions); > > + /* > > +* For pSeries we count mce when we go into virtual mode > > machine > > +* check handler. Hence skip it. Also, We can't access per > > cpu > > +* variables in real mode for LPAR. > > +*/ > > + if (early_cpu_has_feature(CPU_FTR_HVMODE)) > > + __this_cpu_inc(irq_stat.mce_exceptions); > > > > - if (cur_cpu_spec && cur_cpu_spec->machine_check_early) > > + /* > > +* See if platform is capable of handling machine check. > > +* Otherwise fallthrough and allow CPU to handle this > > machine check. > > +*/ > > + if (ppc_md.machine_check_early) > > + handled = ppc_md.machine_check_early(regs); > > + else if (cur_cpu_spec && cur_cpu_spec->machine_check_early) > > handled = > > cur_cpu_spec->machine_check_early(regs); > > Would be good to add a powernv ppc_md handler which does the > cur_cpu_spec->machine_check_early() call now that other platforms are > calling this code. Because those aren't valid as a fallback call, but > specific to powernv. > Something like this (untested)? Subject: [PATCH] powerpc/powernv: define platform MCE handler. --- arch/powerpc/kernel/mce.c | 3 --- arch/powerpc/platforms/powernv/setup.c | 11 +++ 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c index 221271c96a57..ae17d8aa60c4 100644 --- a/arch/powerpc/kernel/mce.c +++ b/arch/powerpc/kernel/mce.c @@ -498,12 +498,9 @@ long machine_check_early(struct pt_regs *regs) /* * See if platform is capable of handling machine check. -* Otherwise fallthrough and allow CPU to handle this machine check. */ if (ppc_md.machine_check_early) handled = ppc_md.machine_check_early(regs); - else if (cur_cpu_spec && cur_cpu_spec->machine_check_early) - handled = cur_cpu_spec->machine_check_early(regs); return handled; } diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index f96df0a25d05..b74c93bc2e55 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -431,6 +431,16 @@ static unsigned long pnv_get_proc_freq(unsigned int cpu) return ret_freq; } +static long pnv_machine_check_early(struct pt_regs *regs) +{ + long handled = 0; + + if (cur_cpu_spec && cur_cpu_spec->machine_check_early) + handled = cur_cpu_spec->machine_check_early(regs); + + return handled; +} + define_machine(powern
Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
On Thu, Jul 12, 2018 at 5:50 AM Oscar Salvador wrote: > > > > I just roughly check, but if I checked the right place, > > > vmemmap_populated() checks for the section to contain the flags we are > > > setting in sparse_init_one_section(). > > > > Yes. > > > > > But with this patch, we populate first everything, and then we call > > > sparse_init_one_section() in sparse_init(). > > > As I said I could be mistaken because I just checked the surface. Yes, this is right, sparse_init_one_section() is needed after every populate call on ppc64. I am adding this to my sparse_init re-write, and it actually simplifies code, as it avoids one extra loop, and makes ppc64 to work. Pavel
[PATCH] powerpc/prom_init: remove linux,stdout-package property
This property was added in 2004 by https://github.com/mpe/linux-fullhistory/commit/689fe5072fe9a0dec914bfa4fa60aed1e54563e6 and the only use of it, which was already inside `#if 0`, was removed a month later by https://github.com/mpe/linux-fullhistory/commit/1fbe5a6d90f6cd4ea610737ef488719d1a875de7 Fixes: https://github.com/linuxppc/linux/issues/125 Signed-off-by: Murilo Opsfelder Araujo --- arch/powerpc/kernel/prom_init.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index 5425dd3d6a9f..c45fb463c9e5 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -2102,8 +2102,6 @@ static void __init prom_init_stdout(void) stdout_node = call_prom("instance-to-package", 1, 1, prom.stdout); if (stdout_node != PROM_ERROR) { val = cpu_to_be32(stdout_node); - prom_setprop(prom.chosen, "/chosen", "linux,stdout-package", -&val, sizeof(val)); /* If it's a display, note it */ memset(type, 0, sizeof(type)); -- 2.17.1
Re: [next-20180711][Oops] linux-next kernel boot is broken on powerpc
> Related commit could be one of below ? I see lots of patches related to mm > and could not bisect > > 5479976fda7d3ab23ba0a4eb4d60b296eb88b866 mm: page_alloc: restore > memblock_next_valid_pfn() on arm/arm64 > 41619b27b5696e7e5ef76d9c692dd7342c1ad7eb > mm-drop-vm_bug_on-from-__get_free_pages-fix > 531bbe6bd2721f4b66cdb0f5cf5ac14612fa1419 mm: drop VM_BUG_ON from > __get_free_pages > 479350dd1a35f8bfb2534697e5ca68ee8a6e8dea mm, page_alloc: actually ignore > mempolicies for high priority allocations > 088018f6fe571444caaeb16e84c9f24f22dfc8b0 mm: skip invalid pages block at a > time in zero_resv_unresv() Looks like: 0ba29a108979 mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER This patch is going to be reverted from linux-next. Abdul, please verify that issue is gone once you revert this patch. Thank you, Pavel
[PATCH] powerpc/Makefile: Assemble with -me500 when building for E500
Some of the assembly files use instructions specific to BookE or E500, which are rejected with the now-default -mcpu=powerpc, so we must pass -me500 to the assembler just as we pass -me200 for E200. Fixes: 4bf4f42a2feb ("powerpc/kbuild: Set default generic machine type for 32-bit compile") Signed-off-by: James Clarke --- arch/powerpc/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 2ea575cb..fb96206d 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -243,6 +243,7 @@ endif cpu-as-$(CONFIG_4xx) += -Wa,-m405 cpu-as-$(CONFIG_ALTIVEC) += $(call as-option,-Wa$(comma)-maltivec) cpu-as-$(CONFIG_E200) += -Wa,-me200 +cpu-as-$(CONFIG_E500) += -Wa,-me500 cpu-as-$(CONFIG_PPC_BOOK3S_64) += -Wa,-mpower4 cpu-as-$(CONFIG_PPC_E500MC)+= $(call as-option,-Wa$(comma)-me500mc) -- 2.18.0
Re: [PATCH kernel v6 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
On Wed, 11 Jul 2018 21:00:44 +1000 Alexey Kardashevskiy wrote: > A VM which has: > - a DMA capable device passed through to it (eg. network card); > - running a malicious kernel that ignores H_PUT_TCE failure; > - capability of using IOMMU pages bigger that physical pages > can create an IOMMU mapping that exposes (for example) 16MB of > the host physical memory to the device when only 64K was allocated to the VM. > > The remaining 16MB - 64K will be some other content of host memory, possibly > including pages of the VM, but also pages of host kernel memory, host > programs or other VMs. > > The attacking VM does not control the location of the page it can map, > and is only allowed to map as many pages as it has pages of RAM. > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that > an IOMMU page is contained in the physical page so the PCI hardware won't > get access to unassigned host memory; however this check is missing in > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and > did not hit this yet as the very first time when the mapping happens > we do not have tbl::it_userspace allocated yet and fall back to > the userspace which in turn calls VFIO IOMMU driver, this fails and > the guest does not retry, > > This stores the smallest preregistered page size in the preregistered > region descriptor and changes the mm_iommu_xxx API to check this against > the IOMMU page size. > > This calculates maximum page size as a minimum of the natural region > alignment and compound page size. For the page shift this uses the shift > returned by find_linux_pte() which indicates how the page is mapped to > the current userspace - if the page is huge and this is not a zero, then > it is a leaf pte and the page is mapped within the range. > > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v6: > * replaced hugetlbfs with pageshift from find_linux_pte() > > v5: > * only consider compound pages from hugetlbfs > > v4: > * reimplemented max pageshift calculation > > v3: > * fixed upper limit for the page size > * added checks that we don't register parts of a huge page > > v2: > * explicitely check for compound pages before calling compound_order() > > --- > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to > advertise 16MB pages to the guest; a typical pseries guest will use 16MB > for IOMMU pages without checking the mmu pagesize and this will fail > at > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256 > > With the change, mapping will fail in KVM and the guest will print: > > mlx5_core :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 18 > 1f returned 0 (liobn = 0x8001 starting addr = 800 0) > mlx5_core :00:00.0: created tce table LIOBN 0x8001 for > /pci@8002000/ethernet@0 > mlx5_core :00:00.0: failed to map direct window for > /pci@8002000/ethernet@0: -1 > --- > arch/powerpc/include/asm/mmu_context.h | 4 ++-- > arch/powerpc/kvm/book3s_64_vio.c | 2 +- > arch/powerpc/kvm/book3s_64_vio_hv.c| 6 -- > arch/powerpc/mm/mmu_context_iommu.c| 39 > -- > drivers/vfio/vfio_iommu_spapr_tce.c| 2 +- > 5 files changed, 45 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/include/asm/mmu_context.h > b/arch/powerpc/include/asm/mmu_context.h > index 896efa5..79d570c 100644 > --- a/arch/powerpc/include/asm/mmu_context.h > +++ b/arch/powerpc/include/asm/mmu_context.h > @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t > *mm_iommu_lookup_rm( > extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, > unsigned long ua, unsigned long entries); > extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > - unsigned long ua, unsigned long *hpa); > + unsigned long ua, unsigned int pageshift, unsigned long *hpa); > extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, > - unsigned long ua, unsigned long *hpa); > + unsigned long ua, unsigned int pageshift, unsigned long *hpa); > extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); > extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem); > #endif > diff --git a/arch/powerpc/kvm/book3s_64_vio.c > b/arch/powerpc/kvm/book3s_64_vio.c > index d066e37..8c456fa 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct > iommu_table *tbl, > /* This only handles v2 IOMMU type, v1 is handled via ioctl() */ > return H_TOO_HARD; > > - if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa))) > + if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa))) > return H_HARDWARE; > >
[PATCH 13/18] ibmvscsi: change strncpy+truncation to strlcpy
Generated by scripts/coccinelle/misc/strncpy_truncation.cocci Signed-off-by: Dominique Martinet --- Please see https://marc.info/?l=linux-kernel&m=153144450722324&w=2 (the first patch of the serie) for the motivation behind this patch drivers/scsi/ibmvscsi/ibmvscsi.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 17df76f0be3c..79eb8af03a19 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -1274,14 +1274,12 @@ static void send_mad_capabilities(struct ibmvscsi_host_data *hostdata) if (hostdata->client_migrated) hostdata->caps.flags |= cpu_to_be32(CLIENT_MIGRATED); - strncpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev), + strlcpy(hostdata->caps.name, dev_name(&hostdata->host->shost_gendev), sizeof(hostdata->caps.name)); - hostdata->caps.name[sizeof(hostdata->caps.name) - 1] = '\0'; location = of_get_property(of_node, "ibm,loc-code", NULL); location = location ? location : dev_name(hostdata->dev); - strncpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc)); - hostdata->caps.loc[sizeof(hostdata->caps.loc) - 1] = '\0'; + strlcpy(hostdata->caps.loc, location, sizeof(hostdata->caps.loc)); req->common.type = cpu_to_be32(VIOSRP_CAPABILITIES_TYPE); req->buffer = cpu_to_be64(hostdata->caps_addr); -- 2.17.1
Re: [PATCH kernel v3 0/6] powerpc/powernv/iommu: Optimize memory use
On Wed, 4 Jul 2018 16:13:43 +1000 Alexey Kardashevskiy wrote: > This patchset aims to reduce actual memory use for guests with > sparse memory. The pseries guest uses dynamic DMA windows to map > the entire guest RAM but it only actually maps onlined memory > which may be not be contiguous. I hit this when tried passing > through NVLink2-connected GPU RAM of NVIDIA V100 and trying to > map this RAM at the same offset as in the real hardware > forced me to rework I handle these windows. > > This moves userspace-to-host-physical translation table > (iommu_table::it_userspace) from VFIO TCE IOMMU subdriver to > the platform code and reuses the already existing multilevel > TCE table code which we have for the hardware tables. > At last in 6/6 I switch to on-demand allocation so we do not > allocate huge chunks of the table if we do not have to; > there is some math in 6/6. > > Changes: > v3: > * rebased on v4.18-rc3 and fixed compile error in 6/6 > > v2: > * bugfix and error handling in 6/6 > > > This is based on sha1 > 021c917 Linus Torvalds "Linux 4.18-rc3". > > Please comment. Thanks. Ping? > > > > Alexey Kardashevskiy (6): > powerpc/powernv: Remove useless wrapper > powerpc/powernv: Move TCE manupulation code to its own file > KVM: PPC: Make iommu_table::it_userspace big endian > powerpc/powernv: Add indirect levels to it_userspace > powerpc/powernv: Rework TCE level allocation > powerpc/powernv/ioda: Allocate indirect TCE levels on demand > > arch/powerpc/platforms/powernv/Makefile | 2 +- > arch/powerpc/include/asm/iommu.h | 11 +- > arch/powerpc/platforms/powernv/pci.h | 44 ++- > arch/powerpc/kvm/book3s_64_vio.c | 11 +- > arch/powerpc/kvm/book3s_64_vio_hv.c | 18 +- > arch/powerpc/platforms/powernv/pci-ioda-tce.c | 399 > ++ > arch/powerpc/platforms/powernv/pci-ioda.c | 184 ++-- > arch/powerpc/platforms/powernv/pci.c | 158 -- > drivers/vfio/vfio_iommu_spapr_tce.c | 65 + > 9 files changed, 478 insertions(+), 414 deletions(-) > create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c -- Alexey
Re: [PATCH 1/2] mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
2018-07-12 16:15 GMT+09:00 Christoph Hellwig : > On Thu, Jul 12, 2018 at 11:48:47AM +0900, Joonsoo Kim wrote: >> One of existing user is general DMA layer and it takes gfp flags that is >> provided by user. I don't check all the DMA allocation sites but how do >> you convince that none of them try to use anything other >> than GFP_KERNEL [|__GFP_NOWARN]? > > They use a few others things still like __GFP_COMP, __GPF_DMA or > GFP_HUGEPAGE. But all these are bogus as we have various implementations > that can't respect them. I plan to get rid of the gfp_t argument > in the dma_map_ops alloc method in a few merge windows because of that, > but it needs further implementation consolidation first. Okay. If those flags are all, this change would be okay. For the remind of this gfp flag introduction in cma_alloc(), see the following link. https://marc.info/?l=linux-mm&m=148431452118407 Thanks.