Re: [PATCH v2] scsi: ibmvfc: interface updates for future FPIN and MQ support
On 9/2/20 7:11 PM, Martin K. Petersen wrote: > > Tyrel, > >> Fixup complier errors from neglected commit --amend > > Bunch of formatting-related checkpatch warnings. Please fix. > > Thanks! > So, I stuck to the existing style already in that header. If I'm going to fixup to make checkpatch happy I imagine it makes sense to send a prerequisite patch that fixes up the rest of the header. -Tyrel
Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()
On 02/09/2020 08:34, Leonardo Bras wrote: On Mon, 2020-08-31 at 10:47 +1000, Alexey Kardashevskiy wrote: Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu should be enough, is there any chance to get indirect mapping in qemu like this? (DDW but with smaller DMA window available) You will have to hack the guest kernel to always do indirect mapping or hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of available TCEs. But you will be testing QEMU/KVM which behave quite differently to pHyp in this particular case. As you suggested before, building for 4k cpu pagesize should be the best approach. It would allow testing for both pHyp and qemu scenarios. Because if we want the former (==support), then we'll have to align the size up to the bigger page size when allocating/zeroing system pages, etc. This part I don't understand. Why do we need to align everything to the bigger pagesize? I mean, is not that enough that the range [ret, ret + size[ is both allocated by mm and mapped on a iommu range? Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and IOMMU_PAGE_SIZE() == 64k. Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough? All the space the user asked for is allocated and mapped for DMA. The user asked to map 16K, the rest - 48K - is used for something else (may be even mapped to another device) but you are making all 64K accessible by the device which only should be able to access 16K. In practice, if this happens, H_PUT_TCE will simply fail. I have noticed mlx5 driver getting a few bytes in a buffer, and using iommu_map_page(). It does map a whole page for as few bytes as the user Whole 4K system page or whole 64K iommu page? I tested it in 64k system page + 64k iommu page. The 64K system page may be used for anything, and a small portion of it (say 128 bytes) needs to be used for DMA. The whole page is mapped by IOMMU, and the driver gets info of the memory range it should access / modify. This works because the whole system page belongs to the same memory context and IOMMU allows a device to access that page. You can still have problems if there is a bug within the page but it will go mostly unnoticed as it will be memory corruption. If you system page is smaller (4K) than IOMMU page (64K), then the device gets wider access than it should but it is still going to be silent memory corruption. wants mapped, and the other bytes get used for something else, or just mapped on another DMA page. It seems to work fine. With 4K system page and 64K IOMMU page? In practice it would take an effort or/and bad luck to see it crashing. Thanks, I haven't tested it yet. On a 64k system page and 4k/64k iommu page, it works as described above. I am new to this, so I am trying to understand how a memory page mapped as DMA, and used for something else could be a problem. From the device prospective, there is PCI space and everything from 0 till 1<<64 is accessible and what is that mapped to - the device does not know. PHB's IOMMU is the thing to notice invalid access and raise EEH but PHB only knows about PCI->physical memory mapping (with IOMMU pages) but nothing about the host kernel pages. Does this help? Thanks, Thanks! Bigger pages are not the case here as I understand it. I did not get this part, what do you mean? Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the supported set of sizes is different for P8/P9 and type of IO (PHB, NVLink/CAPI). Update those functions to guarantee alignment with requested size using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free(). Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift) with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read. Signed-off-by: Leonardo Bras --- arch/powerpc/kernel/iommu.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 9704f3f76e63..d7086087830f 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev, } if (dev) - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - 1 << tbl->it_page_shift); + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl); Run checkpatch.pl, should complain about a long line. It's 86 columns long, which is less than the new limit of 100 columns Linus announced a few weeks ago. checkpatch.pl was updated too: https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col Yay finally :) Thanks, :) else - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift); + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl); /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */ n = iommu_area_
Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift
On 02/09/2020 07:38, Leonardo Bras wrote: On Mon, 2020-08-31 at 13:48 +1000, Alexey Kardashevskiy wrote: Well, I created this TCE_RPN_BITS = 52 because the previous mask was a hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k) pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51 described as RPN, as described before. IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5. shows system memory mapping into a TCE, and the TCE also has bits 0-51 for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it. In fact, by the looks of those figures, the RPN_MASK should always be a 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK. I suspect the mask is there in the first place for extra protection against too big addresses going to the TCE table (or/and for virtial vs physical addresses). Using 52bit mask makes no sense for anything, you could just drop the mask and let c compiler deal with 64bit "uint" as it is basically a 4K page address anywhere in the 64bit space. Thanks, Assuming 4K pages you need 52 RPN bits to cover the whole 64bit physical address space. The IODA3 spec does explicitly say the upper bits are optional and the implementation only needs to support enough to cover up to the physical address limit, which is 56bits of P9 / PHB4. If you want to validate that the address will fit inside of MAX_PHYSMEM_BITS then fine, but I think that should be done as a WARN_ON or similar rather than just silently masking off the bits. We can do this and probably should anyway but I am also pretty sure we can just ditch the mask and have the hypervisor return an error which will show up in dmesg. Ok then, ditching the mask. Well, you could run a little experiment and set some bits above that old mask and see how phyp reacts :) Thanks! -- Alexey
Re: [PATCH 1/2] dma-mapping: introduce dma_get_seg_boundary_nr_pages()
Nicolin Chen writes: > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index 9704f3f76e63..cbc2e62db597 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -236,15 +236,10 @@ static unsigned long iommu_range_alloc(struct device > *dev, > } > } > > - if (dev) > - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, > - 1 << tbl->it_page_shift); > - else > - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift); > - /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */ > + boundary_size = dma_get_seg_boundary_nr_pages(dev, tbl->it_page_shift); > > n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset, > - boundary_size >> tbl->it_page_shift, align_mask); > + boundary_size, align_mask); This has changed the units of boundary_size, but it's unused elsewhere in the function so that's OK. If you need to do a v2 for any other reason, then I'd just drop boundary_size and call dma_get_seg_boundary_nr_pages() directly in the function call. Acked-by: Michael Ellerman (powerpc) cheers
Re: [PATCH] ASoC: fsl_sai: Set SAI Channel Mode to Output Mode
On Thu, Sep 03, 2020 at 11:09:15AM +0800, Shengjiu Wang wrote: > Transmit data pins will output zero when slots are masked or channels > are disabled. In CHMOD TDM mode, transmit data pins are tri-stated when > slots are masked or channels are disabled. When data pins are tri-stated, > there is noise on some channels when FS clock value is high and data is > read while fsclk is transitioning from high to low. > > Signed-off-by: Cosmin-Gabriel Samoila > Signed-off-by: Shengjiu Wang Acked-by: Nicolin Chen Though one nit inline: > --- > sound/soc/fsl/fsl_sai.c | 12 ++-- > sound/soc/fsl/fsl_sai.h | 2 ++ > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c > index 62c5fdb678fc..33b194a5c1dc 100644 > --- a/sound/soc/fsl/fsl_sai.c > +++ b/sound/soc/fsl/fsl_sai.c > @@ -486,6 +486,12 @@ static int fsl_sai_hw_params(struct snd_pcm_substream > *substream, > > val_cr4 |= FSL_SAI_CR4_FRSZ(slots); > > + /* Output Mode - data pins transmit 0 when slots are masked > + * or channels are disabled > + */ Coding style for multi-line comments. Yet, probably can simplify? /* Set to output mode to avoid tri-stated data pins */
Re: [PATCH 0/2] dma-mapping: update default segment_boundary_mask
On Wed, Sep 02, 2020 at 10:13:12AM +0200, Niklas Schnelle wrote: > On 9/2/20 12:16 AM, Nicolin Chen wrote: > > These two patches are to update default segment_boundary_mask. > > > > PATCH-1 fixes overflow issues in callers of dma_get_seg_boundary. > > Previous version was a series: https://lkml.org/lkml/2020/8/31/1026 > > > > Then PATCH-2 sets default segment_boundary_mask to ULONG_MAX. > > > > Nicolin Chen (2): > > dma-mapping: introduce dma_get_seg_boundary_nr_pages() > > dma-mapping: set default segment_boundary_mask to ULONG_MAX > > I gave both of your patches a quick test ride on a couple of dev mainframes, > both NVMe, ConnectX and virtio-pci devices all seems to work fine. > I already commented on Christoph's mail that I like the helper approach, > so as for s390 you can add my > > Acked-by: Niklas Schnelle Thanks for testing and the ack!
Re: [PATCH v2] scsi: ibmvfc: interface updates for future FPIN and MQ support
Tyrel, > Fixup complier errors from neglected commit --amend Bunch of formatting-related checkpatch warnings. Please fix. Thanks! -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] powerpc/64s: handle ISA v3.1 local copy-paste context switches
On Tue, Aug 25, 2020 at 05:55:35PM +1000, Nicholas Piggin wrote: > The ISA v3.1 the copy-paste facility has a new memory move functionality > which allows the copy buffer to be pasted to domestic memory (RAM) as > opposed to foreign memory (accelerator). > > This means the POWER9 trick of avoiding the cp_abort on context switch if > the process had not mapped foreign memory does not work on POWER10. Do the > cp_abort unconditionally there. > > KVM must also cp_abort on guest exit to prevent copy buffer state leaking > between contexts. > > Signed-off-by: Nicholas Piggin For the KVM part: Acked-by: Paul Mackerras
Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
On Wed, Sep 2, 2020 at 8:17 AM Christophe Leroy wrote: > > > With this fix, I get > > root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M > 536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s > > That's still far from the 91.7MB/s I get with 5.9-rc2, but better than > the 65.8MB/s I got yesterday with your series. Still some way to go thought. I don't see why this change would make any difference. And btw, why do the 32-bit and 64-bit checks even differ? It's not like the extra (single) instruction should even matter. I think the main reason is that the simpler 64-bit case could stay as a macro (because it only uses "addr" and "size" once), but honestly, that "simplification" doesn't help when you then need to have that #ifdef for the 32-bit case and an inline function anyway. So why isn't it just static inline int __access_ok(unsigned long addr, unsigned long size) { return addr <= TASK_SIZE_MAX && size <= TASK_SIZE_MAX-addr; } for both and be done with it? The "size=0" check is only relevant for the "addr == TASK_SIZE_MAX" case, and existed in the old code because it had that "-1" thing becasue "seg.seg" was actually TASK_SIZE-1. Now that we don't have any TASK_SIZE-1, zero isn't special any more. However, I suspect a bigger reason for the actual performance degradation would be the patch that makes things use "write_iter()" for writing, even when a simpler "write()" exists. For writing to /dev/null, the cost of setting up iterators and all the pointless indirection is all kinds of stupid. So I think "write()" should just go back to default to using "->write()" rather than "->write_iter()" if the simpler case exists. Linus
Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker
Le 02/09/2020 à 19:41, Nick Desaulniers a écrit : On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman wrote: Nick Desaulniers writes: Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan sections") I think I'll just revert that for v5.9 ? SGTM; you'll probably still want these changes with some modifications at some point; vdso32 did have at least one orphaned section, and will be important for hermetic builds. Seeing crashes in supported versions of the tools ties our hands at the moment. Keeping the tool problem aside with binutils 2.26, do you have a way to really link an elf32ppc object when building vdso32 for PPC64 ? Christophe
Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker
On Wed, Sep 02, 2020 at 05:43:03PM +0200, Christophe Leroy wrote: > >Try with a newer ld? If it still happens with current versions, please > >open a bug report? https://sourceware.org/bugzilla > > Yes it works with 2.30 and 2.34 Ah okay, I missed this part. > But minimum for building kernel is supposed to be 2.23 Sure. Tthat could be upgraded to 2.24 -- you should use a binutils at least as new as your GCC, and that requires 4.9 now -- but that probably doesn't help you here). Segher
Re: ptrace_syscall_32 is failing
On Wed, Sep 2, 2020 at 1:29 AM Thomas Gleixner wrote: > > On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote: > > On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner wrote: > >> > I think that they almost work for x86, but not quite as > >> > indicated by this bug. Even if we imagine we can somehow hack around > >> > this bug, I imagine we're going to find other problems with this > >> > model, e.g. the potential upcoming exit problem I noted in my review. > >> > >> What's the upcoming problem? > > > > If we ever want to get single-stepping fully correct across syscalls, > > we might need to inject SIGTRAP on syscall return. This would be more > > awkward if we can't run instrumentable code after the syscall part of > > the syscall is done. > > We run a lot of instrumentable code after sys_foo() returns. Otherwise > all the TIF work would not be possible at all. > > But you might tell me where exactly you want to inject the SIGTRAP in > the syscall exit code flow. It would be a bit complicated. Definitely after any signals from the syscall are delivered. Right now, I think that we don't deliver a SIGTRAP on the instruction boundary after SYSCALL while single-stepping. (I think we used to, but only sometimes, and now we are at least consistent.) This is because IRET will not trap if it starts with TF clear and ends up setting it. (I asked Intel to document this, and I think they finally did, although I haven't gotten around to reading the new docs. Certainly the old docs as of a year or two ago had no description whatsoever of how TF changes worked.) Deciding exactly *when* a trap should occur would be nontrivial -- we can't trap on sigreturn() from a SIGTRAP, for example. So this isn't fully worked out. > > >> I don't think we want that in general. The current variant is perfectly > >> fine for everything except the 32bit fast syscall nonsense. Also > >> irqentry_entry/exit is not equivalent to the syscall_enter/exit > >> counterparts. > > > > If there are any architectures in which actual work is needed to > > figure out whether something is a syscall in the first place, they'll > > want to do the usual kernel entry work before the syscall entry work. > > That's low level entry code which does not require RCU, lockdep, tracing > or whatever muck we setup before actual work can be done. > > arch_asm_entry() > ... > arch_c_entry(cause) { > switch(cause) { > case EXCEPTION: arch_c_exception(...); > case SYSCALL: arch_c_syscall(...); > ... > } You're assuming that figuring out the cause doesn't need the kernel entry code to run first. In the case of the 32-bit vDSO fast syscalls, we arguably don't know whether an entry is a syscall until we have done a user memory access. Logically, we're doing: if (get_user() < 0) { /* Not a syscall. This is actually a silly operation that sets AX = -EFAULT and returns. Do not audit or invoke ptrace. */ } else { /* This actually is a syscall. */ } So we really do want to stick arch code between the enter_from_user_mode() and the audit check. We *can't* audit because we don't know the syscall args. Now maybe we could invent new semantics for this in which a fault here is still somehow a syscall, but I think that would be a real ABI change and would want very careful thought. And it would be weird -- syscalls are supposed to actually call the syscall handler, aren't they? (Arguably we should go back in time and make this a SIGSEGV. We have the infrastructure to do this cleanly, but when I wrote the code I just copied the ABI from code that was before my time. Even so, it would be an exception, not a syscall.) > > You really want to differentiate between exception and syscall > entry/exit. > Why do we want to distinguish between exception and syscall entry/exit? For the enter part, AFAICS the exception case boils down to enter_from_user_mode() and the syscall case is: enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); ti_work = READ_ONCE(current_thread_info()->flags); if (ti_work & SYSCALL_ENTER_WORK) syscall = syscall_trace_enter(regs, syscall, ti_work); instrumentation_end(); Which would decompose quite nicely as a regular (non-syscall) entry plus the syscall part later. > The splitting of syscall_enter_from_user_mode() is only necessary for > that 32bit fast syscall thing on x86 and there is no point to open code > it with two calls for e.g. do_syscall_64(). > > > Maybe your patch actually makes this possible -- I haven't digested > > all the details yet. > > > > Who advised you to drop the arch parameter? > > Kees, IIRC, but I would have to search through the gazillions of mail > threads to be sure. > > >> + syscall_enter_from_user_mode_prepare(regs); > > > > I'm getting lost in all these "enter" functions... > > It's not that hard. > > syscall_enter_from_user_mode_prepare() > +syscall_enter_from_user_mode_work()
Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker
Hi, Le 02/09/2020 à 16:14, Segher Boessenkool a écrit : Hi! On Wed, Sep 02, 2020 at 06:46:45AM +, Christophe Leroy wrote: ld crashes: LD arch/powerpc/kernel/vdso32/vdso32.so.dbg /bin/sh: line 1: 23780 Segmentation fault (core dumped) ppc-linux-ld -EB -m elf32ppc -shared -soname linux-vdso32.so.1 --eh-frame-hdr --orphan-handling=warn -T arch/powerpc/kernel/vdso32/vdso32.lds arch/powerpc/kernel/vdso32/sigtramp.o arch/powerpc/kernel/vdso32/gettimeofday.o arch/powerpc/kernel/vdso32/datapage.o arch/powerpc/kernel/vdso32/cacheflush.o arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o arch/powerpc/kernel/vdso32/vdso32.so.dbg make[4]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139 [root@localhost linux-powerpc]# ppc-linux-ld --version GNU ld (GNU Binutils) 2.26.20160125 [ Don't build as root :-P ] Try with a newer ld? If it still happens with current versions, please open a bug report? https://sourceware.org/bugzilla Yes it works with 2.30 and 2.34 But minimum for building kernel is supposed to be 2.23 Christophe
Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
Le 02/09/2020 à 14:36, Christoph Hellwig a écrit : On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: - return 0; - return (size == 0 || size - 1 <= seg.seg - addr); + if (addr >= TASK_SIZE_MAX) + return false; + if (size == 0) + return false; __access_ok() was returning true when size == 0 up to now. Any reason to return false now ? No, this is accidental and broken. Can you re-run your benchmark with this fixed? With this fix, I get root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M 1048576+0 records in 1048576+0 records out 536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s real0m 6.78s user0m 1.64s sys 0m 5.13s That's still far from the 91.7MB/s I get with 5.9-rc2, but better than the 65.8MB/s I got yesterday with your series. Still some way to go thought. Christophe
RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
From: Christophe Leroy > Sent: 02 September 2020 15:13 > > > Le 02/09/2020 à 15:51, David Laight a écrit : > > From: Christophe Leroy > >> Sent: 02 September 2020 14:25 > >> Le 02/09/2020 à 15:13, David Laight a écrit : > >>> From: Christoph Hellwig > Sent: 02 September 2020 13:37 > > On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: > >> - return 0; > >> - return (size == 0 || size - 1 <= seg.seg - addr); > >> + if (addr >= TASK_SIZE_MAX) > >> + return false; > >> + if (size == 0) > >> + return false; > > > > __access_ok() was returning true when size == 0 up to now. Any reason to > > return false now ? > > No, this is accidental and broken. Can you re-run your benchmark with > this fixed? > >>> > >>> Is TASK_SIZE_MASK defined such that you can do: > >>> > >>> return (addr | size) < TASK_SIZE_MAX) || !size; > >> > >> TASK_SIZE_MAX will usually be 0xc000 > >> > >> With: > >> addr = 0x8000; > >> size = 0x8000; > >> > >> I expect it to fail > >> > >> With the formula you propose it will succeed, won't it ? > > > > Hmmm... Was i getting confused about some comments for 64bit > > about there being such a big hole between valid user and kernel > > addresses that it was enough to check that 'size < TASK_SIZE_MAX'. > > > > That would be true for 64bit x86 (and probably ppc (& arm??)) > > if TASK_SIZE_MAX were 0x4 << 60. > > IIUC the highest user address is (much) less than 0x0 << 60 > > and the lowest kernel address (much) greater than 0xf << 60 > > on all these 64bit platforms. > > > > Actually if doing access_ok() inside get_user() you don't > > need to check the size at all. > > You mean on 64 bit or on any platform ? 64bit and 32bit > What about a word write to 0xbffe, won't it overwrite 0xc000 ? > > > You don't even need to in copy_to/from_user() provided > > it always does a forwards copy. > > Do you mean due to the gap ? > Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to > 0xc000 and PAGE_OFFSET set to the same ? I read somewhere (I won't find it again) that the last 4k page (below 0xc000) must not be allocated on i386 because some cpu (both intel and amd) do 'horrid things' if they try to (IIRC) do instruction prefetches across the boundary. So the accesses to 0xbffe will fault and the one to 0xc000 won't happen (in any useful way at least). I'd suspect that not allocating the 3G-4k page would be a safe bet on all architectures - even 68k. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c
https://bugzilla.kernel.org/show_bug.cgi?id=206203 Erhard F. (erhar...@mailbox.org) changed: What|Removed |Added Attachment #288187|0 |1 is obsolete|| --- Comment #20 from Erhard F. (erhar...@mailbox.org) --- Created attachment 292289 --> https://bugzilla.kernel.org/attachment.cgi?id=292289&action=edit kernel .config (kernel 5.9-rc3, Talos II) -- You are receiving this mail because: You are watching the assignee of the bug.
[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c
https://bugzilla.kernel.org/show_bug.cgi?id=206203 Erhard F. (erhar...@mailbox.org) changed: What|Removed |Added Attachment #288189|0 |1 is obsolete|| --- Comment #19 from Erhard F. (erhar...@mailbox.org) --- Created attachment 292287 --> https://bugzilla.kernel.org/attachment.cgi?id=292287&action=edit kmemleak output (kernel 5.9-rc3, Talos II) -- You are receiving this mail because: You are watching the assignee of the bug.
[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c
https://bugzilla.kernel.org/show_bug.cgi?id=206203 Erhard F. (erhar...@mailbox.org) changed: What|Removed |Added Attachment #288185|0 |1 is obsolete|| --- Comment #18 from Erhard F. (erhar...@mailbox.org) --- Created attachment 292285 --> https://bugzilla.kernel.org/attachment.cgi?id=292285&action=edit dmesg (kernel 5.9-rc3, Talos II) -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker
Hi! On Wed, Sep 02, 2020 at 06:46:45AM +, Christophe Leroy wrote: > ld crashes: > > LD arch/powerpc/kernel/vdso32/vdso32.so.dbg > /bin/sh: line 1: 23780 Segmentation fault (core dumped) > ppc-linux-ld -EB -m elf32ppc -shared -soname linux-vdso32.so.1 > --eh-frame-hdr --orphan-handling=warn -T > arch/powerpc/kernel/vdso32/vdso32.lds > arch/powerpc/kernel/vdso32/sigtramp.o > arch/powerpc/kernel/vdso32/gettimeofday.o > arch/powerpc/kernel/vdso32/datapage.o > arch/powerpc/kernel/vdso32/cacheflush.o > arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o > arch/powerpc/kernel/vdso32/vdso32.so.dbg > make[4]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139 > > > [root@localhost linux-powerpc]# ppc-linux-ld --version > GNU ld (GNU Binutils) 2.26.20160125 [ Don't build as root :-P ] Try with a newer ld? If it still happens with current versions, please open a bug report? https://sourceware.org/bugzilla Segher
Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
Le 02/09/2020 à 15:51, David Laight a écrit : From: Christophe Leroy Sent: 02 September 2020 14:25 Le 02/09/2020 à 15:13, David Laight a écrit : From: Christoph Hellwig Sent: 02 September 2020 13:37 On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: - return 0; - return (size == 0 || size - 1 <= seg.seg - addr); + if (addr >= TASK_SIZE_MAX) + return false; + if (size == 0) + return false; __access_ok() was returning true when size == 0 up to now. Any reason to return false now ? No, this is accidental and broken. Can you re-run your benchmark with this fixed? Is TASK_SIZE_MASK defined such that you can do: return (addr | size) < TASK_SIZE_MAX) || !size; TASK_SIZE_MAX will usually be 0xc000 With: addr = 0x8000; size = 0x8000; I expect it to fail With the formula you propose it will succeed, won't it ? Hmmm... Was i getting confused about some comments for 64bit about there being such a big hole between valid user and kernel addresses that it was enough to check that 'size < TASK_SIZE_MAX'. That would be true for 64bit x86 (and probably ppc (& arm??)) if TASK_SIZE_MAX were 0x4 << 60. IIUC the highest user address is (much) less than 0x0 << 60 and the lowest kernel address (much) greater than 0xf << 60 on all these 64bit platforms. Actually if doing access_ok() inside get_user() you don't need to check the size at all. You mean on 64 bit or on any platform ? What about a word write to 0xbffe, won't it overwrite 0xc000 ? You don't even need to in copy_to/from_user() provided it always does a forwards copy. Do you mean due to the gap ? Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to 0xc000 and PAGE_OFFSET set to the same ? Christophe
RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
From: Christophe Leroy > Sent: 02 September 2020 14:25 > Le 02/09/2020 à 15:13, David Laight a écrit : > > From: Christoph Hellwig > >> Sent: 02 September 2020 13:37 > >> > >> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: > -return 0; > -return (size == 0 || size - 1 <= seg.seg - addr); > +if (addr >= TASK_SIZE_MAX) > +return false; > +if (size == 0) > +return false; > >>> > >>> __access_ok() was returning true when size == 0 up to now. Any reason to > >>> return false now ? > >> > >> No, this is accidental and broken. Can you re-run your benchmark with > >> this fixed? > > > > Is TASK_SIZE_MASK defined such that you can do: > > > > return (addr | size) < TASK_SIZE_MAX) || !size; > > TASK_SIZE_MAX will usually be 0xc000 > > With: > addr = 0x8000; > size = 0x8000; > > I expect it to fail > > With the formula you propose it will succeed, won't it ? Hmmm... Was i getting confused about some comments for 64bit about there being such a big hole between valid user and kernel addresses that it was enough to check that 'size < TASK_SIZE_MAX'. That would be true for 64bit x86 (and probably ppc (& arm??)) if TASK_SIZE_MAX were 0x4 << 60. IIUC the highest user address is (much) less than 0x0 << 60 and the lowest kernel address (much) greater than 0xf << 60 on all these 64bit platforms. Actually if doing access_ok() inside get_user() you don't need to check the size at all. You don't even need to in copy_to/from_user() provided it always does a forwards copy. (Rather that copying the last word first for misaligned lengths.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
Le 02/09/2020 à 15:13, David Laight a écrit : From: Christoph Hellwig Sent: 02 September 2020 13:37 On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: - return 0; - return (size == 0 || size - 1 <= seg.seg - addr); + if (addr >= TASK_SIZE_MAX) + return false; + if (size == 0) + return false; __access_ok() was returning true when size == 0 up to now. Any reason to return false now ? No, this is accidental and broken. Can you re-run your benchmark with this fixed? Is TASK_SIZE_MASK defined such that you can do: return (addr | size) < TASK_SIZE_MAX) || !size; TASK_SIZE_MAX will usually be 0xc000 With: addr = 0x8000; size = 0x8000; I expect it to fail With the formula you propose it will succeed, won't it ? Christophe
Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
Anshuman Khandual writes: > On 09/01/2020 12:00 PM, Aneesh Kumar K.V wrote: >> On 9/1/20 9:33 AM, Anshuman Khandual wrote: >>> >>> >>> On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: The seems to be missing quite a lot of details w.r.t allocating the correct pgtable_t page (huge_pte_alloc()), holding the right lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. Hence disable the test on ppc64. >>> >>> Would really like this to get resolved in an uniform and better way >>> instead, i.e a modified hugetlb_advanced_tests() which works on all >>> platforms including ppc64. >>> >>> In absence of a modified version, I do realize the situation here, >>> where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely >>> remove hugetlb_advanced_tests() from other platforms as well. >>> Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index a188b6e4e37e..21329c7d672f 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ } +#ifndef CONFIG_PPC_BOOK3S_64 static void __init hugetlb_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *ptep, unsigned long pfn, @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct mm_struct *mm, pte = huge_ptep_get(ptep); WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte))); } +#endif >>> >>> In the worst case if we could not get a new hugetlb_advanced_tests() test >>> that works on all platforms, this might be the last fallback option. In >>> which case, it will require a proper comment section with a "FIXME: ", >>> explaining the current situation (and that #ifdef is temporary in nature) >>> and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled. >>> #else /* !CONFIG_HUGETLB_PAGE */ static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init hugetlb_advanced_tests(struct mm_struct *mm, @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void) pud_populate_tests(mm, pudp, saved_pmdp); spin_unlock(ptl); +#ifndef CONFIG_PPC_BOOK3S_64 hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); +#endif >>> >> >> I actually wanted to add #ifdef BROKEN. That test is completely broken. >> Infact I would suggest to remove that test completely. >> >> >> >>> #ifdef will not be required here as there would be a stub definition >>> for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled. >>> spin_lock(&mm->page_table_lock); p4d_clear_tests(mm, p4dp); >>> >>> But again, we should really try and avoid taking this path. >>> >> >> To be frank i am kind of frustrated with how this patch series is being >> looked at. We pushed a completely broken test to upstream and right now we >> have a code in upstream that crash when booted on ppc64. My attempt has been >> to make progress here and you definitely seems to be not in agreement to >> that. >> > > I am afraid, this does not accurately represent the situation. > > - The second set patch series got merged in it's V5 after accommodating almost > all reviews and objections during previous discussion cycles. For a complete > development log, please refer https://patchwork.kernel.org/cover/11658627/. > > - The series has been repeatedly tested on arm64 and x86 platforms for > multiple > configurations but build tested on all other enabled platforms. I have > always > been dependent on voluntary help from folks on the list to get this tested > on > other enabled platforms as I dont have access to such systems. Always > assumed > that is the way to go for anything which runs on multiple platforms. So, am > I > expected to test on platforms that I dont have access to ? But I am ready to > be corrected here, if the community protocol is not what I have always > assumed > it to be. > > - Each and every version of the series had appropriately copied all the > enabled > platform's mailing list. Also, I had explicitly asked for volunteers to test > this out on platforms apart from x86 and arm64. We had positive response > from > all platforms i.e arc, s390, ppc32 but except for ppc64. > > https://patchwork.kernel.org/cover/11644771/ > https://patchwork.kernel.org/cover/11603713/ > > - The development cycle provided sufficient time window for any detailed > review > and test. I have always been willing to address almost all the issues > bro
RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
From: Christoph Hellwig > Sent: 02 September 2020 13:37 > > On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: > >> - return 0; > >> - return (size == 0 || size - 1 <= seg.seg - addr); > >> + if (addr >= TASK_SIZE_MAX) > >> + return false; > >> + if (size == 0) > >> + return false; > > > > __access_ok() was returning true when size == 0 up to now. Any reason to > > return false now ? > > No, this is accidental and broken. Can you re-run your benchmark with > this fixed? Is TASK_SIZE_MASK defined such that you can do: return (addr | size) < TASK_SIZE_MAX) || !size; David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
On 09/01/2020 12:00 PM, Aneesh Kumar K.V wrote: > On 9/1/20 9:33 AM, Anshuman Khandual wrote: >> >> >> On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: >>> The seems to be missing quite a lot of details w.r.t allocating >>> the correct pgtable_t page (huge_pte_alloc()), holding the right >>> lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. >>> >>> ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. >>> Hence disable the test on ppc64. >> >> Would really like this to get resolved in an uniform and better way >> instead, i.e a modified hugetlb_advanced_tests() which works on all >> platforms including ppc64. >> >> In absence of a modified version, I do realize the situation here, >> where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely >> remove hugetlb_advanced_tests() from other platforms as well. >> >>> >>> Signed-off-by: Aneesh Kumar K.V >>> --- >>> mm/debug_vm_pgtable.c | 4 >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c >>> index a188b6e4e37e..21329c7d672f 100644 >>> --- a/mm/debug_vm_pgtable.c >>> +++ b/mm/debug_vm_pgtable.c >>> @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long >>> pfn, pgprot_t prot) >>> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ >>> } >>> +#ifndef CONFIG_PPC_BOOK3S_64 >>> static void __init hugetlb_advanced_tests(struct mm_struct *mm, >>> struct vm_area_struct *vma, >>> pte_t *ptep, unsigned long pfn, >>> @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct >>> mm_struct *mm, >>> pte = huge_ptep_get(ptep); >>> WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte))); >>> } >>> +#endif >> >> In the worst case if we could not get a new hugetlb_advanced_tests() test >> that works on all platforms, this might be the last fallback option. In >> which case, it will require a proper comment section with a "FIXME: ", >> explaining the current situation (and that #ifdef is temporary in nature) >> and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled. >> >>> #else /* !CONFIG_HUGETLB_PAGE */ >>> static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) >>> { } >>> static void __init hugetlb_advanced_tests(struct mm_struct *mm, >>> @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void) >>> pud_populate_tests(mm, pudp, saved_pmdp); >>> spin_unlock(ptl); >>> +#ifndef CONFIG_PPC_BOOK3S_64 >>> hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); >>> +#endif >> > > I actually wanted to add #ifdef BROKEN. That test is completely broken. > Infact I would suggest to remove that test completely. > > > >> #ifdef will not be required here as there would be a stub definition >> for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled. >> >>> spin_lock(&mm->page_table_lock); >>> p4d_clear_tests(mm, p4dp); >>> >> >> But again, we should really try and avoid taking this path. >> > > To be frank i am kind of frustrated with how this patch series is being > looked at. We pushed a completely broken test to upstream and right now we > have a code in upstream that crash when booted on ppc64. My attempt has been > to make progress here and you definitely seems to be not in agreement to that. > I am afraid, this does not accurately represent the situation. - The second set patch series got merged in it's V5 after accommodating almost all reviews and objections during previous discussion cycles. For a complete development log, please refer https://patchwork.kernel.org/cover/11658627/. - The series has been repeatedly tested on arm64 and x86 platforms for multiple configurations but build tested on all other enabled platforms. I have always been dependent on voluntary help from folks on the list to get this tested on other enabled platforms as I dont have access to such systems. Always assumed that is the way to go for anything which runs on multiple platforms. So, am I expected to test on platforms that I dont have access to ? But I am ready to be corrected here, if the community protocol is not what I have always assumed it to be. - Each and every version of the series had appropriately copied all the enabled platform's mailing list. Also, I had explicitly asked for volunteers to test this out on platforms apart from x86 and arm64. We had positive response from all platforms i.e arc, s390, ppc32 but except for ppc64. https://patchwork.kernel.org/cover/11644771/ https://patchwork.kernel.org/cover/11603713/ - The development cycle provided sufficient time window for any detailed review and test. I have always been willing to address almost all the issues brought forward during these discussions. From past experience on this test, there is an inherent need to understand platform specific details while trying to come up with something gene
Re: [PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.
On 9/2/20 6:10 PM, Christophe Leroy wrote: Le 02/09/2020 à 13:42, Aneesh Kumar K.V a écrit : ppc64 supports huge vmap only with radix translation. Hence use arch helper to determine the huge vmap support. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 00649b47f6e0..4c73e63b4ceb 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pmd_leaf(pmd)); } +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { pmd_t pmd; - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) + if (!arch_ioremap_pmd_supported()) What about moving ioremap_pmd_enabled() from mm/ioremap.c to some .h, and using it ? As ioremap_pmd_enabled() is defined at all time, no need of #ifdef yes. This was discussed earlier too. IMHO we should do that outside this series. I guess figuring out ioremap_pmd/pud support can definitely be simplified. With a generic version like #ifndef arch_ioremap_pmd_supported static inline bool arch_ioremap_pmd_supported(void) { return false; } #endif return; pr_debug("Validating PMD huge\n"); @@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); } +#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } +#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { @@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pud_leaf(pud)); } +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { pud_t pud; - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) + if (!arch_ioremap_pud_supported()) What about moving ioremap_pud_enabled() from mm/ioremap.c to some .h, and using it ? As ioremap_pud_enabled() is defined at all time, no need of #ifdef return; pr_debug("Validating PUD huge\n"); @@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); } +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { } +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ + #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init pud_advanced_tests(struct mm_struct *mm, Christophe
Re: [PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.
Le 02/09/2020 à 13:42, Aneesh Kumar K.V a écrit : ppc64 supports huge vmap only with radix translation. Hence use arch helper to determine the huge vmap support. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 00649b47f6e0..4c73e63b4ceb 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pmd_leaf(pmd)); } +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { pmd_t pmd; - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) + if (!arch_ioremap_pmd_supported()) What about moving ioremap_pmd_enabled() from mm/ioremap.c to some .h, and using it ? As ioremap_pmd_enabled() is defined at all time, no need of #ifdef return; pr_debug("Validating PMD huge\n"); @@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); } +#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } +#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { @@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pud_leaf(pud)); } +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { pud_t pud; - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) + if (!arch_ioremap_pud_supported()) What about moving ioremap_pud_enabled() from mm/ioremap.c to some .h, and using it ? As ioremap_pud_enabled() is defined at all time, no need of #ifdef return; pr_debug("Validating PUD huge\n"); @@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); } +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { } +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ + #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init pud_advanced_tests(struct mm_struct *mm, Christophe
Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()
On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote: >> -return 0; >> -return (size == 0 || size - 1 <= seg.seg - addr); >> +if (addr >= TASK_SIZE_MAX) >> +return false; >> +if (size == 0) >> +return false; > > __access_ok() was returning true when size == 0 up to now. Any reason to > return false now ? No, this is accidental and broken. Can you re-run your benchmark with this fixed?
Re: [PATCH v4 02/13] powerpc/mm: Move setting pte specific flags to pfn_pte
Le 02/09/2020 à 13:42, Aneesh Kumar K.V a écrit : powerpc used to set the pte specific flags in set_pte_at(). This is different from other architectures. To be consistent with other architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now unused pte_mkpte. We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without setting _PAGE_PTE bit. We will remove that after a few releases. With respect to huge pmd entries, pmd_mkhuge() takes care of adding the _PAGE_PTE bit. Signed-off-by: Aneesh Kumar K.V Reviewed-by: Christophe Leroy Small nit below. --- arch/powerpc/include/asm/book3s/64/pgtable.h | 15 +-- arch/powerpc/include/asm/nohash/pgtable.h| 5 - arch/powerpc/mm/pgtable.c| 5 - 3 files changed, 9 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 079211968987..2382fd516f6b 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -823,6 +818,14 @@ static inline int pte_none(pte_t pte) static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, int percpu) { + + VM_WARN_ON(!(pte_raw(pte) & cpu_to_be64(_PAGE_PTE))); + /* +* Keep the _PAGE_PTE added till we are sure we handle _PAGE_PTE +* in all the callers. +*/ +pte = __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE)); Wrong alignment, there is a leading space. + if (radix_enabled()) return radix__set_pte_at(mm, addr, ptep, pte, percpu); return hash__set_pte_at(mm, addr, ptep, pte, percpu); Christophe
Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker
Nick Desaulniers writes: > Rather than invoke the compiler as the driver, use the linker. That way > we can check --orphan-handling=warn support correctly, as cc-ldoption > was removed in > commit 055efab3120b ("kbuild: drop support for cc-ldoption"). Ouch. Seems make is quite happy to $(call deadbeef, ...) and not print a warning, which I guess is probably a feature. > Painstakingly compared the output between `objdump -a` before and after > this change. Now function symbols have the correct type of FUNC rather > than NONE, and the entry is slightly different (which doesn't matter for > the vdso). Binary size is the same. > > Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan > sections") I think I'll just revert that for v5.9 ? cheers > Link: > https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/ > Signed-off-by: Nick Desaulniers > --- > arch/powerpc/include/asm/vdso.h | 17 ++--- > arch/powerpc/kernel/vdso64/Makefile | 8 ++-- > arch/powerpc/kernel/vdso64/vdso64.lds.S | 1 - > 3 files changed, 8 insertions(+), 18 deletions(-) > > diff --git a/arch/powerpc/include/asm/vdso.h b/arch/powerpc/include/asm/vdso.h > index 2ff884853f97..11b2ecf49f79 100644 > --- a/arch/powerpc/include/asm/vdso.h > +++ b/arch/powerpc/include/asm/vdso.h > @@ -24,19 +24,7 @@ int vdso_getcpu_init(void); > > #else /* __ASSEMBLY__ */ > > -#ifdef __VDSO64__ > -#define V_FUNCTION_BEGIN(name) \ > - .globl name;\ > - name: \ > - > -#define V_FUNCTION_END(name) \ > - .size name,.-name; > - > -#define V_LOCAL_FUNC(name) (name) > -#endif /* __VDSO64__ */ > - > -#ifdef __VDSO32__ > - > +#if defined(__VDSO32__) || defined (__VDSO64__) > #define V_FUNCTION_BEGIN(name) \ > .globl name;\ > .type name,@function; \ > @@ -46,8 +34,7 @@ int vdso_getcpu_init(void); > .size name,.-name; > > #define V_LOCAL_FUNC(name) (name) > - > -#endif /* __VDSO32__ */ > +#endif /* __VDSO{32|64}__ */ > > #endif /* __ASSEMBLY__ */ > > diff --git a/arch/powerpc/kernel/vdso64/Makefile > b/arch/powerpc/kernel/vdso64/Makefile > index 38c317f25141..7ea3ce537d0a 100644 > --- a/arch/powerpc/kernel/vdso64/Makefile > +++ b/arch/powerpc/kernel/vdso64/Makefile > @@ -32,9 +32,13 @@ $(obj)/%.so: OBJCOPYFLAGS := -S > $(obj)/%.so: $(obj)/%.so.dbg FORCE > $(call if_changed,objcopy) > > +ldflags-y := -shared -soname linux-vdso64.so.1 \ > + $(call ld-option, --eh-frame-hdr) \ > + $(call ld-option, --orphan-handling=warn) -T > + > # actual build commands > -quiet_cmd_vdso64ld = VDSO64L $@ > - cmd_vdso64ld = $(CC) $(c_flags) -o $@ -Wl,-T$(filter %.lds,$^) > $(filter %.o,$^) $(call cc-ldoption, -Wl$(comma)--orphan-handling=warn) > +quiet_cmd_vdso64ld = LD $@ > + cmd_vdso64ld = $(cmd_ld) > > # install commands for the unstripped file > quiet_cmd_vdso_install = INSTALL $@ > diff --git a/arch/powerpc/kernel/vdso64/vdso64.lds.S > b/arch/powerpc/kernel/vdso64/vdso64.lds.S > index 4e3a8d4ee614..58c33b704b6a 100644 > --- a/arch/powerpc/kernel/vdso64/vdso64.lds.S > +++ b/arch/powerpc/kernel/vdso64/vdso64.lds.S > @@ -11,7 +11,6 @@ OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", > "elf64-powerpcle") > OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc") > #endif > OUTPUT_ARCH(powerpc:common64) > -ENTRY(_start) > > SECTIONS > { > -- > 2.28.0.402.g5ffc5be6b7-goog
[PATCH v4 13/13] mm/debug_vm_pgtable: Avoid none pte in pte_clear_test
pte_clear_tests operate on an existing pte entry. Make sure that is not a none pte entry. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 9afa1354326b..c36530c69e33 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -542,9 +542,10 @@ static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp, #endif /* PAGETABLE_P4D_FOLDED */ static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep, - unsigned long vaddr) + unsigned long pfn, unsigned long vaddr, + pgprot_t prot) { - pte_t pte = ptep_get(ptep); + pte_t pte = pfn_pte(pfn, prot); pr_debug("Validating PTE clear\n"); pte = __pte(pte_val(pte) | RANDOM_ORVALUE); @@ -1049,7 +1050,7 @@ static int __init debug_vm_pgtable(void) ptl = pte_lockptr(mm, pmdp); spin_lock(ptl); - pte_clear_tests(mm, ptep, vaddr); + pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot); pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); pte_unmap_unlock(ptep, ptl); -- 2.26.2
[PATCH v4 13/13] mm/debug_vm_pgtable: Avoid none pte in pte_clear_test
pte_clear_tests operate on an existing pte entry. Make sure that is not a none pte entry. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 9afa1354326b..c36530c69e33 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -542,9 +542,10 @@ static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp, #endif /* PAGETABLE_P4D_FOLDED */ static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep, - unsigned long vaddr) + unsigned long pfn, unsigned long vaddr, + pgprot_t prot) { - pte_t pte = ptep_get(ptep); + pte_t pte = pfn_pte(pfn, prot); pr_debug("Validating PTE clear\n"); pte = __pte(pte_val(pte) | RANDOM_ORVALUE); @@ -1049,7 +1050,7 @@ static int __init debug_vm_pgtable(void) ptl = pte_lockptr(mm, pmdp); spin_lock(ptl); - pte_clear_tests(mm, ptep, vaddr); + pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot); pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); pte_unmap_unlock(ptep, ptl); -- 2.26.2
[PATCH v4 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
The seems to be missing quite a lot of details w.r.t allocating the correct pgtable_t page (huge_pte_alloc()), holding the right lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. Hence disable the test on ppc64. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index b53903fdee85..9afa1354326b 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -811,6 +811,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ } +#ifndef CONFIG_PPC_BOOK3S_64 static void __init hugetlb_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *ptep, unsigned long pfn, @@ -853,6 +854,7 @@ static void __init hugetlb_advanced_tests(struct mm_struct *mm, pte = huge_ptep_get(ptep); WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte))); } +#endif #else /* !CONFIG_HUGETLB_PAGE */ static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init hugetlb_advanced_tests(struct mm_struct *mm, @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void) pud_populate_tests(mm, pudp, saved_pmdp); spin_unlock(ptl); +#ifndef CONFIG_PPC_BOOK3S_64 hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); +#endif spin_lock(&mm->page_table_lock); p4d_clear_tests(mm, p4dp); -- 2.26.2
[PATCH v4 11/13] mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
pmd_clear() should not be used to clear pmd level pte entries. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 26023d990bd0..b53903fdee85 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -196,6 +196,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmd = READ_ONCE(*pmdp); WARN_ON(pmd_young(pmd)); + /* Clear the pte entries */ + pmdp_huge_get_and_clear(mm, vaddr, pmdp); pgtable = pgtable_trans_huge_withdraw(mm, pmdp); } @@ -319,6 +321,8 @@ static void __init pud_advanced_tests(struct mm_struct *mm, pudp_test_and_clear_young(vma, vaddr, pudp); pud = READ_ONCE(*pudp); WARN_ON(pud_young(pud)); + + pudp_huge_get_and_clear(mm, vaddr, pudp); } static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) @@ -442,8 +446,6 @@ static void __init pud_populate_tests(struct mm_struct *mm, pud_t *pudp, * This entry points to next level page table page. * Hence this must not qualify as pud_bad(). */ - pmd_clear(pmdp); - pud_clear(pudp); pud_populate(mm, pudp, pmdp); pud = READ_ONCE(*pudp); WARN_ON(pud_bad(pud)); @@ -575,7 +577,6 @@ static void __init pmd_populate_tests(struct mm_struct *mm, pmd_t *pmdp, * This entry points to next level page table page. * Hence this must not qualify as pmd_bad(). */ - pmd_clear(pmdp); pmd_populate(mm, pmdp, pgtable); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_bad(pmd)); -- 2.26.2
[PATCH v4 10/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
Architectures like ppc64 use deposited page table while updating the huge pte entries. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 2bc1952e5f83..26023d990bd0 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) static void __init pmd_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmdp, unsigned long pfn, unsigned long vaddr, - pgprot_t prot) + pgprot_t prot, pgtable_t pgtable) { pmd_t pmd; @@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, /* Align the address wrt HPAGE_PMD_SIZE */ vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; + pgtable_trans_huge_deposit(mm, pmdp, pgtable); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_set_wrprotect(mm, vaddr, pmdp); @@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmdp_test_and_clear_young(vma, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_young(pmd)); + + pgtable = pgtable_trans_huge_withdraw(mm, pmdp); } static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) @@ -371,7 +375,7 @@ static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init pmd_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmdp, unsigned long pfn, unsigned long vaddr, - pgprot_t prot) + pgprot_t prot, pgtable_t pgtable) { } static void __init pud_advanced_tests(struct mm_struct *mm, @@ -1048,7 +1052,7 @@ static int __init debug_vm_pgtable(void) ptl = pmd_lock(mm, pmdp); pmd_clear_tests(mm, pmdp); - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); pmd_huge_tests(pmdp, pmd_aligned, prot); pmd_populate_tests(mm, pmdp, saved_ptep); spin_unlock(ptl); -- 2.26.2
[PATCH v4 09/13] mm/debug_vm_pgtable/locks: Take correct page table lock
Make sure we call pte accessors with correct lock held. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 35 ++- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index f59cf6a9b05e..2bc1952e5f83 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -1035,30 +1035,39 @@ static int __init debug_vm_pgtable(void) hugetlb_basic_tests(pte_aligned, prot); - pte_clear_tests(mm, ptep, vaddr); - pmd_clear_tests(mm, pmdp); - pud_clear_tests(mm, pudp); - p4d_clear_tests(mm, p4dp); - pgd_clear_tests(mm, pgdp); + /* +* Page table modifying tests. They need to hold +* proper page table lock. +*/ ptl = pte_lockptr(mm, pmdp); spin_lock(ptl); - + pte_clear_tests(mm, ptep, vaddr); pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - + pte_unmap_unlock(ptep, ptl); + ptl = pmd_lock(mm, pmdp); + pmd_clear_tests(mm, pmdp); + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); pmd_huge_tests(pmdp, pmd_aligned, prot); + pmd_populate_tests(mm, pmdp, saved_ptep); + spin_unlock(ptl); + + ptl = pud_lock(mm, pudp); + pud_clear_tests(mm, pudp); + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); pud_huge_tests(pudp, pud_aligned, prot); + pud_populate_tests(mm, pudp, saved_pmdp); + spin_unlock(ptl); - pte_unmap_unlock(ptep, ptl); + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_populate_tests(mm, pmdp, saved_ptep); - pud_populate_tests(mm, pudp, saved_pmdp); + spin_lock(&mm->page_table_lock); + p4d_clear_tests(mm, p4dp); + pgd_clear_tests(mm, pgdp); p4d_populate_tests(mm, p4dp, saved_pudp); pgd_populate_tests(mm, pgdp, saved_p4dp); + spin_unlock(&mm->page_table_lock); p4d_free(mm, saved_p4dp); pud_free(mm, saved_pudp); -- 2.26.2
[PATCH v4 08/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together
This will help in adding proper locks in a later patch Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 51 --- 1 file changed, 28 insertions(+), 23 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index de333871f407..f59cf6a9b05e 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -986,7 +986,7 @@ static int __init debug_vm_pgtable(void) p4dp = p4d_alloc(mm, pgdp, vaddr); pudp = pud_alloc(mm, p4dp, vaddr); pmdp = pmd_alloc(mm, pudp, vaddr); - ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); + ptep = pte_alloc_map(mm, pmdp, vaddr); /* * Save all the page table page addresses as the page table @@ -1006,33 +1006,12 @@ static int __init debug_vm_pgtable(void) p4d_basic_tests(p4d_aligned, prot); pgd_basic_tests(pgd_aligned, prot); - pte_clear_tests(mm, ptep, vaddr); - pmd_clear_tests(mm, pmdp); - pud_clear_tests(mm, pudp); - p4d_clear_tests(mm, p4dp); - pgd_clear_tests(mm, pgdp); - - pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_leaf_tests(pmd_aligned, prot); pud_leaf_tests(pud_aligned, prot); - pmd_huge_tests(pmdp, pmd_aligned, prot); - pud_huge_tests(pudp, pud_aligned, prot); - pte_savedwrite_tests(pte_aligned, protnone); pmd_savedwrite_tests(pmd_aligned, protnone); - pte_unmap_unlock(ptep, ptl); - - pmd_populate_tests(mm, pmdp, saved_ptep); - pud_populate_tests(mm, pudp, saved_pmdp); - p4d_populate_tests(mm, p4dp, saved_pudp); - pgd_populate_tests(mm, pgdp, saved_p4dp); - pte_special_tests(pte_aligned, prot); pte_protnone_tests(pte_aligned, protnone); pmd_protnone_tests(pmd_aligned, protnone); @@ -1050,11 +1029,37 @@ static int __init debug_vm_pgtable(void) pmd_swap_tests(pmd_aligned, prot); swap_migration_tests(); - hugetlb_basic_tests(pte_aligned, prot); pmd_thp_tests(pmd_aligned, prot); pud_thp_tests(pud_aligned, prot); + hugetlb_basic_tests(pte_aligned, prot); + + pte_clear_tests(mm, ptep, vaddr); + pmd_clear_tests(mm, pmdp); + pud_clear_tests(mm, pudp); + p4d_clear_tests(mm, p4dp); + pgd_clear_tests(mm, pgdp); + + ptl = pte_lockptr(mm, pmdp); + spin_lock(ptl); + + pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); + + + pmd_huge_tests(pmdp, pmd_aligned, prot); + pud_huge_tests(pudp, pud_aligned, prot); + + pte_unmap_unlock(ptep, ptl); + + pmd_populate_tests(mm, pmdp, saved_ptep); + pud_populate_tests(mm, pudp, saved_pmdp); + p4d_populate_tests(mm, p4dp, saved_pudp); + pgd_populate_tests(mm, pgdp, saved_p4dp); + p4d_free(mm, saved_p4dp); pud_free(mm, saved_pudp); pmd_free(mm, saved_pmdp); -- 2.26.2
[PATCH v4 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry
set_pte_at() should not be used to set a pte entry at locations that already holds a valid pte entry. Architectures like ppc64 don't do TLB invalidate in set_pte_at() and hence expect it to be used to set locations that are not a valid PTE. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 35 +++ 1 file changed, 15 insertions(+), 20 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 9cafed39c236..de333871f407 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -79,15 +79,18 @@ static void __init pte_advanced_tests(struct mm_struct *mm, { pte_t pte = pfn_pte(pfn, prot); + /* +* Architectures optimize set_pte_at by avoiding TLB flush. +* This requires set_pte_at to be not used to update an +* existing pte entry. Clear pte before we do set_pte_at +*/ + pr_debug("Validating PTE advanced\n"); pte = pfn_pte(pfn, prot); set_pte_at(mm, vaddr, ptep, pte); ptep_set_wrprotect(mm, vaddr, ptep); pte = ptep_get(ptep); WARN_ON(pte_write(pte)); - - pte = pfn_pte(pfn, prot); - set_pte_at(mm, vaddr, ptep, pte); ptep_get_and_clear(mm, vaddr, ptep); pte = ptep_get(ptep); WARN_ON(!pte_none(pte)); @@ -101,13 +104,11 @@ static void __init pte_advanced_tests(struct mm_struct *mm, ptep_set_access_flags(vma, vaddr, ptep, pte, 1); pte = ptep_get(ptep); WARN_ON(!(pte_write(pte) && pte_dirty(pte))); - - pte = pfn_pte(pfn, prot); - set_pte_at(mm, vaddr, ptep, pte); ptep_get_and_clear_full(mm, vaddr, ptep, 1); pte = ptep_get(ptep); WARN_ON(!pte_none(pte)); + pte = pfn_pte(pfn, prot); pte = pte_mkyoung(pte); set_pte_at(mm, vaddr, ptep, pte); ptep_test_and_clear_young(vma, vaddr, ptep); @@ -169,9 +170,6 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmdp_set_wrprotect(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_write(pmd)); - - pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); - set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_huge_get_and_clear(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); @@ -185,13 +183,11 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1); pmd = READ_ONCE(*pmdp); WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd))); - - pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); - set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1); pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); pmd = pmd_mkyoung(pmd); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_test_and_clear_young(vma, vaddr, pmdp); @@ -292,17 +288,9 @@ static void __init pud_advanced_tests(struct mm_struct *mm, WARN_ON(pud_write(pud)); #ifndef __PAGETABLE_PMD_FOLDED - pud = pud_mkhuge(pfn_pud(pfn, prot)); - set_pud_at(mm, vaddr, pudp, pud); pudp_huge_get_and_clear(mm, vaddr, pudp); pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); - - pud = pud_mkhuge(pfn_pud(pfn, prot)); - set_pud_at(mm, vaddr, pudp, pud); - pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); - pud = READ_ONCE(*pudp); - WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ pud = pud_mkhuge(pfn_pud(pfn, prot)); @@ -315,6 +303,13 @@ static void __init pud_advanced_tests(struct mm_struct *mm, pud = READ_ONCE(*pudp); WARN_ON(!(pud_write(pud) && pud_dirty(pud))); +#ifndef __PAGETABLE_PMD_FOLDED + pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); + pud = READ_ONCE(*pudp); + WARN_ON(!pud_none(pud)); +#endif /* __PAGETABLE_PMD_FOLDED */ + + pud = pud_mkhuge(pfn_pud(pfn, prot)); pud = pud_mkyoung(pud); set_pud_at(mm, vaddr, pudp, pud); pudp_test_and_clear_young(vma, vaddr, pudp); -- 2.26.2
[PATCH v4 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at
kernel expects entries to be marked huge before we use set_pmd_at()/set_pud_at(). Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 8704901f6bd8..9cafed39c236 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, unsigned long pfn, unsigned long vaddr, pgprot_t prot) { - pmd_t pmd = pfn_pmd(pfn, prot); + pmd_t pmd; if (!has_transparent_hugepage()) return; @@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, /* Align the address wrt HPAGE_PMD_SIZE */ vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; - pmd = pfn_pmd(pfn, prot); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_set_wrprotect(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_write(pmd)); - pmd = pfn_pmd(pfn, prot); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_huge_get_and_clear(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); - pmd = pfn_pmd(pfn, prot); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); pmd = pmd_wrprotect(pmd); pmd = pmd_mkclean(pmd); set_pmd_at(mm, vaddr, pmdp, pmd); @@ -236,7 +236,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { - pmd_t pmd = pfn_pmd(pfn, prot); + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) return; @@ -276,7 +276,7 @@ static void __init pud_advanced_tests(struct mm_struct *mm, unsigned long pfn, unsigned long vaddr, pgprot_t prot) { - pud_t pud = pfn_pud(pfn, prot); + pud_t pud; if (!has_transparent_hugepage()) return; @@ -285,25 +285,27 @@ static void __init pud_advanced_tests(struct mm_struct *mm, /* Align the address wrt HPAGE_PUD_SIZE */ vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE; + pud = pud_mkhuge(pfn_pud(pfn, prot)); set_pud_at(mm, vaddr, pudp, pud); pudp_set_wrprotect(mm, vaddr, pudp); pud = READ_ONCE(*pudp); WARN_ON(pud_write(pud)); #ifndef __PAGETABLE_PMD_FOLDED - pud = pfn_pud(pfn, prot); + pud = pud_mkhuge(pfn_pud(pfn, prot)); set_pud_at(mm, vaddr, pudp, pud); pudp_huge_get_and_clear(mm, vaddr, pudp); pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); - pud = pfn_pud(pfn, prot); + pud = pud_mkhuge(pfn_pud(pfn, prot)); set_pud_at(mm, vaddr, pudp, pud); pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ - pud = pfn_pud(pfn, prot); + + pud = pud_mkhuge(pfn_pud(pfn, prot)); pud = pud_wrprotect(pud); pud = pud_mkclean(pud); set_pud_at(mm, vaddr, pudp, pud); -- 2.26.2
[PATCH v4 05/13] mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING
Saved write support was added to track the write bit of a pte after marking the pte protnone. This was done so that AUTONUMA can convert a write pte to protnone and still track the old write bit. When converting it back we set the pte write bit correctly thereby avoiding a write fault again. Hence enable the test only when CONFIG_NUMA_BALANCING is enabled and use protnone protflags. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 4c73e63b4ceb..8704901f6bd8 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -119,10 +119,14 @@ static void __init pte_savedwrite_tests(unsigned long pfn, pgprot_t prot) { pte_t pte = pfn_pte(pfn, prot); + if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) + return; + pr_debug("Validating PTE saved write\n"); WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte; WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte; } + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) { @@ -234,6 +238,9 @@ static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { pmd_t pmd = pfn_pmd(pfn, prot); + if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) + return; + pr_debug("Validating PMD saved write\n"); WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd; WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd; @@ -1019,8 +1026,8 @@ static int __init debug_vm_pgtable(void) pmd_huge_tests(pmdp, pmd_aligned, prot); pud_huge_tests(pudp, pud_aligned, prot); - pte_savedwrite_tests(pte_aligned, prot); - pmd_savedwrite_tests(pmd_aligned, prot); + pte_savedwrite_tests(pte_aligned, protnone); + pmd_savedwrite_tests(pmd_aligned, protnone); pte_unmap_unlock(ptep, ptl); -- 2.26.2
[PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.
ppc64 supports huge vmap only with radix translation. Hence use arch helper to determine the huge vmap support. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 00649b47f6e0..4c73e63b4ceb 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pmd_leaf(pmd)); } +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { pmd_t pmd; - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) + if (!arch_ioremap_pmd_supported()) return; pr_debug("Validating PMD huge\n"); @@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); } +#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } +#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { @@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pud_leaf(pud)); } +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { pud_t pud; - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) + if (!arch_ioremap_pud_supported()) return; pr_debug("Validating PUD huge\n"); @@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); } +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { } +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ + #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init pud_advanced_tests(struct mm_struct *mm, -- 2.26.2
[PATCH v4 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in random value. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 086309fb9b6f..00649b47f6e0 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -44,10 +44,17 @@ * entry type. But these bits might affect the ability to clear entries with * pxx_clear() because of how dynamic page table folding works on s390. So * while loading up the entries do not change the lower 4 bits. It does not - * have affect any other platform. + * have affect any other platform. Also avoid the 62nd bit on ppc64 that is + * used to mark a pte entry. */ -#define S390_MASK_BITS 4 -#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS) +#define S390_SKIP_MASK GENMASK(3, 0) +#if __BITS_PER_LONG == 64 +#define PPC64_SKIP_MASKGENMASK(62, 62) +#else +#define PPC64_SKIP_MASK0x0 +#endif +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) #define RANDOM_NZVALUE GENMASK(7, 0) static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot) -- 2.26.2
[PATCH v4 01/13] powerpc/mm: Add DEBUG_VM WARN for pmd_clear
With the hash page table, the kernel should not use pmd_clear for clearing huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 6de56c3b33c4..079211968987 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -868,6 +868,13 @@ static inline bool pte_ci(pte_t pte) static inline void pmd_clear(pmd_t *pmdp) { + if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) { + /* +* Don't use this if we can possibly have a hash page table +* entry mapping this. +*/ + WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == (H_PAGE_HASHPTE | _PAGE_PTE)); + } *pmdp = __pmd(0); } @@ -916,6 +923,13 @@ static inline int pmd_bad(pmd_t pmd) static inline void pud_clear(pud_t *pudp) { + if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) { + /* +* Don't use this if we can possibly have a hash page table +* entry mapping this. +*/ + WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == (H_PAGE_HASHPTE | _PAGE_PTE)); + } *pudp = __pud(0); } -- 2.26.2
[PATCH v4 02/13] powerpc/mm: Move setting pte specific flags to pfn_pte
powerpc used to set the pte specific flags in set_pte_at(). This is different from other architectures. To be consistent with other architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now unused pte_mkpte. We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without setting _PAGE_PTE bit. We will remove that after a few releases. With respect to huge pmd entries, pmd_mkhuge() takes care of adding the _PAGE_PTE bit. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 15 +-- arch/powerpc/include/asm/nohash/pgtable.h| 5 - arch/powerpc/mm/pgtable.c| 5 - 3 files changed, 9 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 079211968987..2382fd516f6b 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -619,7 +619,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) VM_BUG_ON(pfn >> (64 - PAGE_SHIFT)); VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK); - return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot)); + return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot) | _PAGE_PTE); } static inline unsigned long pte_pfn(pte_t pte) @@ -655,11 +655,6 @@ static inline pte_t pte_mkexec(pte_t pte) return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC)); } -static inline pte_t pte_mkpte(pte_t pte) -{ - return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE)); -} - static inline pte_t pte_mkwrite(pte_t pte) { /* @@ -823,6 +818,14 @@ static inline int pte_none(pte_t pte) static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, int percpu) { + + VM_WARN_ON(!(pte_raw(pte) & cpu_to_be64(_PAGE_PTE))); + /* +* Keep the _PAGE_PTE added till we are sure we handle _PAGE_PTE +* in all the callers. +*/ +pte = __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE)); + if (radix_enabled()) return radix__set_pte_at(mm, addr, ptep, pte, percpu); return hash__set_pte_at(mm, addr, ptep, pte, percpu); diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index 4b7c3472eab1..6277e7596ae5 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -140,11 +140,6 @@ static inline pte_t pte_mkold(pte_t pte) return __pte(pte_val(pte) & ~_PAGE_ACCESSED); } -static inline pte_t pte_mkpte(pte_t pte) -{ - return pte; -} - static inline pte_t pte_mkspecial(pte_t pte) { return __pte(pte_val(pte) | _PAGE_SPECIAL); diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9c0547d77af3..ab57b07ef39a 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -184,9 +184,6 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, */ VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); - /* Add the pte bit when trying to set a pte */ - pte = pte_mkpte(pte); - /* Note: mm->context.id might not yet have been assigned as * this context might not have been activated yet when this * is called. @@ -275,8 +272,6 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_ */ VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); - pte = pte_mkpte(pte); - pte = set_pte_filter(pte); val = pte_val(pte); -- 2.26.2
[PATCH v4 00/13] mm/debug_vm_pgtable fixes
This patch series includes fixes for debug_vm_pgtable test code so that they follow page table updates rules correctly. The first two patches introduce changes w.r.t ppc64. The patches are included in this series for completeness. We can merge them via ppc64 tree if required. Hugetlb test is disabled on ppc64 because that needs larger change to satisfy page table update rules. These tests are broken w.r.t page table update rules and results in kernel crash as below. [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0] pc: c009a5ec: assert_pte_locked+0x14c/0x380 lr: c05c: pte_update+0x11c/0x190 sp: c00c6d1e7950 msr: 82029033 current = 0xc00c6d172c80 paca= 0xc3ba irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 kernel BUG at arch/powerpc/mm/pgtable.c:304! [link register ] c05c pte_update+0x11c/0x190 [c00c6d1e7950] 0001 (unreliable) [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190 [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8 [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338 [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0 [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4 [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160 [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c With DEBUG_VM disabled [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x [ 20.530183] Faulting instruction address: 0xc00df330 cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700] pc: c00df330: memset+0x68/0x104 lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 sp: c00c6d19f990 msr: 82009033 dar: 0 current = 0xc00c6d177480 paca= 0xc0001ec4f400 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 [link register ] c009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 [c00c6d19f990] c009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378 [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244 [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0 [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4 [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160 [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c Changes from v3: * Address review feedback * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. Changes from v2: * Fix build failure with different configs and architecture. Changes from v1: * Address review feedback * drop test specific pfn_pte and pfn_pmd. * Update ppc64 page table helper to add _PAGE_PTE Aneesh Kumar K.V (13): powerpc/mm: Add DEBUG_VM WARN for pmd_clear powerpc/mm: Move setting pte specific flags to pfn_pte mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support. mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry mm/debug_vm_pgtable/locks: Move non page table modifying test together mm/debug_vm_pgtable/locks: Take correct page table lock mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64 mm/debug_vm_pgtable: Avoid none pte in pte_clear_test arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +++- arch/powerpc/include/asm/nohash/pgtable.h| 5 - arch/powerpc/mm/pgtable.c| 5 - mm/debug_vm_pgtable.c| 171 --- 4 files changed, 131 insertions(+), 79 deletions(-) -- 2.26.2
[Bug 209029] kernel 5.9-rc2 fails to boot on a PowerMac G5 11,2 - BUG: Kernel NULL pointer dereference on read at 0x00000020
https://bugzilla.kernel.org/show_bug.cgi?id=209029 --- Comment #4 from Erhard F. (erhar...@mailbox.org) --- (In reply to Christophe Leroy from comment #3) > Did you try without CONFIG_DEBUG_VM_PGTABLE ? Without CONFIG_DEBUG_VM_PGTABLE the G5 boots fine. Thanks! > If you want CONFIG_DEBUG_VM_PGTABLE, the following series aims at fixing it > for PPC64: > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=197961 Did not check the series as current ozlabs patches indicate that the CONFIG_DEBUG_VM_PGTABLE option is removed for the time being. -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker
On 9/1/20 10:25 PM, Nick Desaulniers wrote: Rather than invoke the compiler as the driver, use the linker. That way we can check --orphan-handling=warn support correctly, as cc-ldoption was removed in commit 055efab3120b ("kbuild: drop support for cc-ldoption"). Requires dropping the .got section. I couldn't find how it was used in the vdso32. Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan sections") Link: https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/ Signed-off-by: Nick Desaulniers --- Not sure removing .got is a good idea or not. Otherwise I observe the following link error: powerpc-linux-gnu-ld: warning: orphan section `.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.got' powerpc-linux-gnu-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got powerpc-linux-gnu-ld: final link failed: bad value Finally I spotted it I think: make arch/powerpc/kernel/vdso32/ V=1 powerpc64-linux-ld -EB -m elf64ppc -shared -soname linux-vdso32.so.1 --eh-frame-hdr --orphan-handling=warn -T arch/powerpc/kernel/vdso32/vdso32.lds arch/powerpc/kernel/vdso32/sigtramp.o arch/powerpc/kernel/vdso32/gettimeofday.o arch/powerpc/kernel/vdso32/datapage.o arch/powerpc/kernel/vdso32/cacheflush.o arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o arch/powerpc/kernel/vdso32/vdso32.so.dbg If I do the same manually but with -m elf32ppc instead of -m elf64ppc, there is no failure. Adding -m elf32ppc to ldflags-y also works, allthough I don't like too much having "-m elf64ppc -m elf32ppc" on the line. Christophe
Re: [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
Excerpts from Michael Ellerman's message of September 1, 2020 10:00 pm: > Nicholas Piggin writes: >> Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of >> single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of >> a process under certain conditions. One of the assumptions is that >> mm_users would not be incremented via a reference outside the process >> context with mmget_not_zero() then go on to kthread_use_mm() via that >> reference. >> >> That invariant was broken by io_uring code (see previous sparc64 fix), >> but I'll point Fixes: to the original powerpc commit because we are >> changing that assumption going forward, so this will make backports >> match up. >> >> Fix this by no longer relying on that assumption, but by having each CPU >> check the mm is not being used, and clearing their own bit from the mask >> if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix >> kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch, >> and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled. > > You could use: > > Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate") Good idea I wil. >> Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of >> single-threaded mm_cpumask") >> Signed-off-by: Nicholas Piggin >> --- >> arch/powerpc/include/asm/tlb.h | 13 - >> arch/powerpc/mm/book3s64/radix_tlb.c | 23 --- >> 2 files changed, 16 insertions(+), 20 deletions(-) > > One minor nit below if you're respinning anyway. > > You know this stuff better than me, but I still reviewed it and it seems > good to me. > > Reviewed-by: Michael Ellerman Thanks. > >> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h >> index fbc6f3002f23..d97f061fecac 100644 >> --- a/arch/powerpc/include/asm/tlb.h >> +++ b/arch/powerpc/include/asm/tlb.h >> @@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm) >> return false; >> return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm)); >> } >> -static inline void mm_reset_thread_local(struct mm_struct *mm) >> -{ >> -WARN_ON(atomic_read(&mm->context.copros) > 0); >> -/* >> - * It's possible for mm_access to take a reference on mm_users to >> - * access the remote mm from another thread, but it's not allowed >> - * to set mm_cpumask, so mm_users may be > 1 here. >> - */ >> -WARN_ON(current->mm != mm); >> -atomic_set(&mm->context.active_cpus, 1); >> -cpumask_clear(mm_cpumask(mm)); >> -cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm)); >> -} >> #else /* CONFIG_PPC_BOOK3S_64 */ >> static inline int mm_is_thread_local(struct mm_struct *mm) >> { >> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c >> b/arch/powerpc/mm/book3s64/radix_tlb.c >> index 0d233763441f..a421a0e3f930 100644 >> --- a/arch/powerpc/mm/book3s64/radix_tlb.c >> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c >> @@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg) >> struct mm_struct *mm = arg; >> unsigned long pid = mm->context.id; >> >> +/* >> + * A kthread could have done a mmget_not_zero() after the flushing CPU >> + * checked mm_users == 1, and be in the process of kthread_use_mm when > ^ > in mm_is_singlethreaded() > > Adding that reference would help join the dots for a new reader I think. Yes you're right I can change that. Thanks, Nick
Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type
On Wed, Sep 02, 2020 at 06:00:24PM +1000, Jordan Niethe wrote: > On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras wrote: > > > > On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote: > > > The ppc_inst type was added to help cope with the addition of prefixed > > > instructions to the ISA. Convert KVM to use this new type for dealing > > > wiht instructions. For now do not try to add further support for > > > prefixed instructions. > > > > This change does seem to splatter itself across a lot of code that > > mostly or exclusively runs on machines which are not POWER10 and will > > never need to handle prefixed instructions, unfortunately. I wonder > > if there is a less invasive way to approach this. > Something less invasive would be good. > > > > In particular we are inflicting this 64-bit struct on 32-bit platforms > > unnecessarily (I assume, correct me if I am wrong here). > No, that is something that I wanted to to avoid, on 32 bit platforms > it is a 32bit struct: > > struct ppc_inst { > u32 val; > #ifdef CONFIG_PPC64 > u32 suffix; > #endif > } __packed; > > > > How would it be to do something like: > > > > typedef unsigned long ppc_inst_t; > > > > so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms, > > and then use that instead of 'struct ppc_inst'? You would still need > > to change the function declarations but I think most of the function > > bodies would not need to be changed. In particular you would avoid a > > lot of the churn related to having to add ppc_inst_val() and suchlike. > > Would the idea be to get rid of `struct ppc_inst` entirely or just not > use it in kvm? > In an earlier series I did something similar (at least code shared > between 32bit and 64bit would need helpers, but 32bit only code need > not change): > > #ifdef __powerpc64__ > > typedef struct ppc_inst { > union { > struct { > u32 word; > u32 pad; > } __packed; > struct { > u32 prefix; > u32 suffix; > } __packed; > }; > } ppc_inst; > > #else /* !__powerpc64__ */ > > typedef u32 ppc_inst; > #endif > > However mpe wanted to avoid using a typedef > (https://patchwork.ozlabs.org/comment/2391845/) Well it doesn't have to be typedef'd, it could just be "unsigned long", which is used in other places for things that want to be 32-bit on 32-bit machines and 64-bit on 64-bit machines. I do however think that it should be a numeric type so that we can mask, shift and compare it more easily. I know that's less "abstract" but it's also a lot less obfuscated and I think that will lead to clearer code. If you got the opposite advice from Michael Ellerman or Nick Piggin then I will discuss it with them. > We did also talk about just using a u64 for instructions > (https://lore.kernel.org/linuxppc-dev/1585028462.t27rstc2uf.astr...@bobo.none/) > but the concern was that as prefixed instructions act as two separate > u32s (prefix is always before the suffix regardless of endianess) > keeping it as a u64 would lead to lot of macros and potential > confusion. > But it does seem if that can avoid a lot of needless churn it might > worth the trade off. u32 *ip; instr = *ip++; if (is_prefix(instr) && is_suitably_aligned(ip)) instr = (instr << 32) | *ip++; would avoid the endian issues pretty cleanly I think. In other words the prefix would always be the high half of the 64-bit value, so you can't just do a single 64-bit of the instruction on little-endian platforms; but you can't do a single 64-bit load for other reasons as well, such as alignment. Paul.
Re: [RFC PATCH 2/2] KVM: PPC: Book3S HV: Support prefixed instructions
On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras wrote: > > On Thu, Aug 20, 2020 at 01:39:22PM +1000, Jordan Niethe wrote: > > There are two main places where instructions are loaded from the guest: > > * Emulate loadstore - such as when performing MMIO emulation > > triggered by an HDSI > > * After an HV emulation assistance interrupt (e40) > > > > If it is a prefixed instruction that triggers these cases, its suffix > > must be loaded. Use the SRR1_PREFIX bit to decide if a suffix needs to > > be loaded. Make sure if this bit is set inject_interrupt() also sets it > > when giving an interrupt to the guest. > > > > ISA v3.10 extends the Hypervisor Emulation Instruction Register (HEIR) > > to 64 bits long to accommodate prefixed instructions. For interrupts > > caused by a word instruction the instruction is loaded into bits 32:63 > > and bits 0:31 are zeroed. When caused by a prefixed instruction the > > prefix and suffix are loaded into bits 0:63. > > > > Signed-off-by: Jordan Niethe > > --- > > arch/powerpc/kvm/book3s.c | 15 +-- > > arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 +++--- > > arch/powerpc/kvm/book3s_hv_builtin.c| 3 +++ > > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 14 ++ > > 4 files changed, 37 insertions(+), 5 deletions(-) > > > > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c > > index 70d8967acc9b..18b1928a571b 100644 > > --- a/arch/powerpc/kvm/book3s.c > > +++ b/arch/powerpc/kvm/book3s.c > > @@ -456,13 +456,24 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, > > { > > ulong pc = kvmppc_get_pc(vcpu); > > u32 word; > > + u64 doubleword; > > int r; > > > > if (type == INST_SC) > > pc -= 4; > > > > - r = kvmppc_ld(vcpu, &pc, sizeof(u32), &word, false); > > - *inst = ppc_inst(word); > > + if ((kvmppc_get_msr(vcpu) & SRR1_PREFIXED)) { > > + r = kvmppc_ld(vcpu, &pc, sizeof(u64), &doubleword, false); > > Should we also have a check here that the doubleword is not crossing a > page boundary? I can't think of a way to get this code to cross a > page boundary, assuming the hardware is working correctly, but it > makes me just a little nervous. I didn't think it could happen but I will add a check to be safe. > > > +#ifdef CONFIG_CPU_LITTLE_ENDIAN > > + *inst = ppc_inst_prefix(doubleword & 0x, doubleword > > >> 32); > > +#else > > + *inst = ppc_inst_prefix(doubleword >> 32, doubleword & > > 0x); > > +#endif > > Ick. Is there a cleaner way to do this? Would it be nicer to read the prefix as u32 then the suffix as a u32 too? > > > + } else { > > + r = kvmppc_ld(vcpu, &pc, sizeof(u32), &word, false); > > + *inst = ppc_inst(word); > > + } > > + > > if (r == EMULATE_DONE) > > return r; > > else > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c > > b/arch/powerpc/kvm/book3s_64_mmu_hv.c > > index 775ce41738ce..0802471f4856 100644 > > --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c > > +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c > > @@ -411,9 +411,13 @@ static int instruction_is_store(struct ppc_inst instr) > > unsigned int mask; > > > > mask = 0x1000; > > - if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00) > > - mask = 0x100; /* major opcode 31 */ > > - return (ppc_inst_val(instr) & mask) != 0; > > + if (ppc_inst_prefixed(instr)) { > > + return (ppc_inst_suffix(instr) & mask) != 0; > > + } else { > > + if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00) > > + mask = 0x100; /* major opcode 31 */ > > + return (ppc_inst_val(instr) & mask) != 0; > > + } > > The way the code worked before, the mask depended on whether the > instruction was a D-form (or DS-form or other variant) instruction, > where you can tell loads and stores apart by looking at the major > opcode, or an X-form instruction, where you look at the minor opcode. > > Now we are only looking at the minor opcode if it is not a prefixed > instruction. Are there no X-form prefixed loads or stores? I could not see an X-form load/stores so I went with just that. But checking the ISA it does mention "..X-form instructions that are preceded by an MLS-form or MMLS-form prefix..." so I shall use the other mask too. > > Paul. Thank you for the comments and suggestions.
[PATCH] mm: check for memory's node later during boot
register_mem_sect_under_nodem() is checking the memory block's node id only if the system state is "SYSTEM_BOOTING". On PowerPC, the memory blocks are registered while the system state is "SYSTEM_SCHEDULING", the one before SYSTEM_RUNNING. The consequence on PowerPC guest with interleaved memory node's ranges is that some memory block could be assigned to multiple nodes on sysfs. This lately prevents some memory hot-plug and hot-unplug to succeed because links are remaining. Such a panic is then displayed: [ cut here ] kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 NIP: c0403f34 LR: c0403f2c CTR: REGS: c004876e3660 TRAP: 0700 Not tainted (5.9.0-rc1+) MSR: 8282b033 CR: 24000448 XER: 2004 CFAR: c0846d20 IRQMASK: 0 GPR00: c0403f2c c004876e38f0 c12f6f00 ffef GPR04: 0227 c004805ae680 0004886f GPR08: 0226 0003 0002 fffd GPR12: 88000484 c0001ec96280 GPR16: 0004 0003 GPR20: c0047814ffe0 c0077c08 0010 c13332c8 GPR24: c11f6cc0 GPR28: ffef 0001 00015000 1000 NIP [c0403f34] add_memory_resource+0x244/0x340 LR [c0403f2c] add_memory_resource+0x23c/0x340 Call Trace: [c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 (unreliable) [c004876e39c0] [c040408c] __add_memory+0x5c/0xf0 [c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500 [c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80 [c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190 [c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0 [c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50 [c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90 [c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290 [c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290 [c004876e3d70] [c04391ac] ksys_write+0xdc/0x130 [c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270 [c004876e3e20] [c000d740] system_call_common+0xf0/0x27c Instruction dump: 48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14 78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000 ---[ end trace 562fd6c109cd0fb2 ]--- To prevent this multiple links, make the node checking done for states prior to SYSTEM_RUNNING. Signed-off-by: Laurent Dufour Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Andrew Morton Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()") --- drivers/base/node.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 508b80f6329b..8e9f39b562ef 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -789,7 +789,7 @@ static int register_mem_sect_under_node(struct memory_block *mem_blk, * case, during hotplug we know that all pages in the memory * block belong to the same node. */ - if (system_state == SYSTEM_BOOTING) { + if (system_state < SYSTEM_RUNNING) { page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) continue; -- 2.28.0
Re: [PATCH] cpuidle-pseries: Fix CEDE latency conversion from tb to us
Hello Joel, On Wed, Sep 02, 2020 at 01:08:35AM +, Joel Stanley wrote: > On Tue, 1 Sep 2020 at 14:09, Gautham R. Shenoy > wrote: > > > > From: "Gautham R. Shenoy" > > > > commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > of the Extended CEDE states advertised by the platform. The values > > advertised by the platform are in timebase ticks. However the cpuidle > > framework requires the latency values in microseconds. > > > > If the tb-ticks value advertised by the platform correspond to a value > > smaller than 1us, during the conversion from tb-ticks to microseconds, > > in the current code, the result becomes zero. This is incorrect as it > > puts a CEDE state on par with the snooze state. > > > > This patch fixes this by rounding up the result obtained while > > converting the latency value from tb-ticks to microseconds. > > > > Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > CEDE(0)") > > > > Signed-off-by: Gautham R. Shenoy > > Reviewed-by: Joel Stanley > Thanks for reviewing the fix. > Should you check for the zero case and print a warning? Yes, that would be better. I will post a v2 with that. > > > --- > > drivers/cpuidle/cpuidle-pseries.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > > b/drivers/cpuidle/cpuidle-pseries.c > > index ff6d99e..9043358 100644 > > --- a/drivers/cpuidle/cpuidle-pseries.c > > +++ b/drivers/cpuidle/cpuidle-pseries.c > > @@ -361,7 +361,7 @@ static void __init fixup_cede0_latency(void) > > for (i = 0; i < nr_xcede_records; i++) { > > struct xcede_latency_record *record = &payload->records[i]; > > u64 latency_tb = be64_to_cpu(record->latency_ticks); > > - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; > > + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), > > NSEC_PER_USEC); > > > > if (latency_us < min_latency_us) > > min_latency_us = latency_us; > > -- > > 1.9.4 > >
Re: ptrace_syscall_32 is failing
On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote: > On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner wrote: >> > I think that they almost work for x86, but not quite as >> > indicated by this bug. Even if we imagine we can somehow hack around >> > this bug, I imagine we're going to find other problems with this >> > model, e.g. the potential upcoming exit problem I noted in my review. >> >> What's the upcoming problem? > > If we ever want to get single-stepping fully correct across syscalls, > we might need to inject SIGTRAP on syscall return. This would be more > awkward if we can't run instrumentable code after the syscall part of > the syscall is done. We run a lot of instrumentable code after sys_foo() returns. Otherwise all the TIF work would not be possible at all. But you might tell me where exactly you want to inject the SIGTRAP in the syscall exit code flow. >> I don't think we want that in general. The current variant is perfectly >> fine for everything except the 32bit fast syscall nonsense. Also >> irqentry_entry/exit is not equivalent to the syscall_enter/exit >> counterparts. > > If there are any architectures in which actual work is needed to > figure out whether something is a syscall in the first place, they'll > want to do the usual kernel entry work before the syscall entry work. That's low level entry code which does not require RCU, lockdep, tracing or whatever muck we setup before actual work can be done. arch_asm_entry() ... arch_c_entry(cause) { switch(cause) { case EXCEPTION: arch_c_exception(...); case SYSCALL: arch_c_syscall(...); ... } You really want to differentiate between exception and syscall entry/exit. The splitting of syscall_enter_from_user_mode() is only necessary for that 32bit fast syscall thing on x86 and there is no point to open code it with two calls for e.g. do_syscall_64(). > Maybe your patch actually makes this possible -- I haven't digested > all the details yet. > > Who advised you to drop the arch parameter? Kees, IIRC, but I would have to search through the gazillions of mail threads to be sure. >> + syscall_enter_from_user_mode_prepare(regs); > > I'm getting lost in all these "enter" functions... It's not that hard. syscall_enter_from_user_mode_prepare() +syscall_enter_from_user_mode_work() =syscall_enter_from_user_mode() That's exactly what you suggested just with the difference that it is explicit for syscalls and not using irqentry_enter/exit(). If we would do that then instead of having a single call for sane syscall pathes: arch_c_entry() nr = syscall_enter_from_user_mode(); or for that 32bit fast syscall nonsense the split variant: arch_c_entry() syscall_enter_from_user_mode_prepare(); do_fast_syscall_muck(); nr = syscall_enter_from_user_mode_work(); we'd have: arch_c_entry() irqentry_enter(); local_irq_enble(); nr = syscall_enter_from_user_mode_work(); ... which enforces two calls for sane entries and more code in arch/ Thanks, tglx
Re: [PATCH] powerpc: Fix random segfault when freeing hugetlb range
On 9/2/20 1:41 PM, Christophe Leroy wrote: Le 02/09/2020 à 05:23, Aneesh Kumar K.V a écrit : Christophe Leroy writes: The following random segfault is observed from time to time with map_hugetlb selftest: root@localhost:~# ./map_hugetlb 1 19 524288 kB hugepages Mapping 1 Mbytes Segmentation fault [ 31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 779a6834 code 1 in ld-2.23.so[77966000+21000] [ 31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 90010044 9361002c 93810030 93a10034 93c10038 [ 31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 81430038 <80e90004> 814a0004 7f443a14 813a0004 [ 31.221911] BUG: Bad rss-counter state mm:(ptrval) type:MM_FILEPAGES val:33 [ 31.229362] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:5 This fault is due to hugetlb_free_pgd_range() freeing page tables that are also used by regular pages. As explain in the comment at the beginning of hugetlb_free_pgd_range(), the verification done in free_pgd_range() on floor and ceiling is not done here, which means hugetlb_free_pte_range() can free outside the expected range. As the verification cannot be done in hugetlb_free_pgd_range(), it must be done in hugetlb_free_pte_range(). Reviewed-by: Aneesh Kumar K.V Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 26292544630f..e7ae2a2c4545 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -330,10 +330,24 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif get_hugepd_cache_index(pdshift - shift)); } -static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr) +static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) { + unsigned long start = addr; pgtable_t token = pmd_pgtable(*pmd); + start &= PMD_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PMD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + We do repeat that for pud/pmd/pte hugetlb_free_range. Can we consolidate that with comment explaining we are checking if the pgtable entry is mapping outside the range? I was thinking about refactoring that into a helper and add all the necessary comments to explain what it does. Will do that in a followup series if you are OK. This patch is a bug fix and also have to go through stable. agreed. Thanks. -aneesh
Re: [PATCH 0/2] dma-mapping: update default segment_boundary_mask
On 9/2/20 12:16 AM, Nicolin Chen wrote: > These two patches are to update default segment_boundary_mask. > > PATCH-1 fixes overflow issues in callers of dma_get_seg_boundary. > Previous version was a series: https://lkml.org/lkml/2020/8/31/1026 > > Then PATCH-2 sets default segment_boundary_mask to ULONG_MAX. > > Nicolin Chen (2): > dma-mapping: introduce dma_get_seg_boundary_nr_pages() > dma-mapping: set default segment_boundary_mask to ULONG_MAX I gave both of your patches a quick test ride on a couple of dev mainframes, both NVMe, ConnectX and virtio-pci devices all seems to work fine. I already commented on Christoph's mail that I like the helper approach, so as for s390 you can add my Acked-by: Niklas Schnelle > > arch/alpha/kernel/pci_iommu.c| 7 +-- > arch/ia64/hp/common/sba_iommu.c | 3 +-- > arch/powerpc/kernel/iommu.c | 9 ++--- > arch/s390/pci/pci_dma.c | 6 ++ > arch/sparc/kernel/iommu-common.c | 10 +++--- > arch/sparc/kernel/iommu.c| 3 +-- > arch/sparc/kernel/pci_sun4v.c| 3 +-- > arch/x86/kernel/amd_gart_64.c| 3 +-- > drivers/parisc/ccio-dma.c| 3 +-- > drivers/parisc/sba_iommu.c | 3 +-- > include/linux/dma-mapping.h | 21 - > 11 files changed, 34 insertions(+), 37 deletions(-) >
Re: [PATCH] powerpc: Fix random segfault when freeing hugetlb range
Le 02/09/2020 à 05:23, Aneesh Kumar K.V a écrit : Christophe Leroy writes: The following random segfault is observed from time to time with map_hugetlb selftest: root@localhost:~# ./map_hugetlb 1 19 524288 kB hugepages Mapping 1 Mbytes Segmentation fault [ 31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 779a6834 code 1 in ld-2.23.so[77966000+21000] [ 31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 90010044 9361002c 93810030 93a10034 93c10038 [ 31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 81430038 <80e90004> 814a0004 7f443a14 813a0004 [ 31.221911] BUG: Bad rss-counter state mm:(ptrval) type:MM_FILEPAGES val:33 [ 31.229362] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:5 This fault is due to hugetlb_free_pgd_range() freeing page tables that are also used by regular pages. As explain in the comment at the beginning of hugetlb_free_pgd_range(), the verification done in free_pgd_range() on floor and ceiling is not done here, which means hugetlb_free_pte_range() can free outside the expected range. As the verification cannot be done in hugetlb_free_pgd_range(), it must be done in hugetlb_free_pte_range(). Reviewed-by: Aneesh Kumar K.V Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 26292544630f..e7ae2a2c4545 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -330,10 +330,24 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif get_hugepd_cache_index(pdshift - shift)); } -static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr) +static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) { + unsigned long start = addr; pgtable_t token = pmd_pgtable(*pmd); + start &= PMD_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PMD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + We do repeat that for pud/pmd/pte hugetlb_free_range. Can we consolidate that with comment explaining we are checking if the pgtable entry is mapping outside the range? I was thinking about refactoring that into a helper and add all the necessary comments to explain what it does. Will do that in a followup series if you are OK. This patch is a bug fix and also have to go through stable. Christophe
Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2
On Tue, Sep 01, 2020 at 06:25:12PM +0100, Al Viro wrote: > On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote: > > > 10.92% dd [kernel.kallsyms] [k] iov_iter_zero > > Interesting... Could you get an instruction-level profile inside > iov_iter_zero(), > along with the disassembly of that sucker? So the interesting thing here is with that none of these code paths should have changed at all, and the biggest items on the profile look the same modulo some minor reordering.
Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS
On Tue, Sep 01, 2020 at 11:57:37AM -0700, Kees Cook wrote: > On Sat, Aug 29, 2020 at 11:24:06AM +0200, Christoph Hellwig wrote: > > On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote: > > > On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig wrote: > > > > > > > > Once we can't manipulate the address limit, we also can't test what > > > > happens when the manipulation is abused. > > > > > > Just remove these tests entirely. > > > > > > Once set_fs() doesn't exist on x86, the tests no longer make any sense > > > what-so-ever, because test coverage will be basically zero. > > > > > > So don't make the code uglier just to maintain a fiction that > > > something is tested when it isn't really. > > > > Sure fine with me unless Kees screams. > > To clarify: if any of x86, arm64, arm, powerpc, riscv, and s390 are > using set_fs(), I want to keep this test. "ugly" is fine in lkdtm. :) And Linus wants them gone entirely, so I'll need a stage fight between the two of you. At least for this merge window I'm only planning on x86 and power, plus maybe riscv if I get the work done in time. Although helper from the maintainers would be welcome. s390 has a driver that still uses set_fs that will need some surgery, although it shouldn't be too bad, but arm will be a piece of work. Unless I get help it will take a while.
Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type
On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras wrote: > > On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote: > > The ppc_inst type was added to help cope with the addition of prefixed > > instructions to the ISA. Convert KVM to use this new type for dealing > > wiht instructions. For now do not try to add further support for > > prefixed instructions. > > This change does seem to splatter itself across a lot of code that > mostly or exclusively runs on machines which are not POWER10 and will > never need to handle prefixed instructions, unfortunately. I wonder > if there is a less invasive way to approach this. Something less invasive would be good. > > In particular we are inflicting this 64-bit struct on 32-bit platforms > unnecessarily (I assume, correct me if I am wrong here). No, that is something that I wanted to to avoid, on 32 bit platforms it is a 32bit struct: struct ppc_inst { u32 val; #ifdef CONFIG_PPC64 u32 suffix; #endif } __packed; > > How would it be to do something like: > > typedef unsigned long ppc_inst_t; > > so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms, > and then use that instead of 'struct ppc_inst'? You would still need > to change the function declarations but I think most of the function > bodies would not need to be changed. In particular you would avoid a > lot of the churn related to having to add ppc_inst_val() and suchlike. Would the idea be to get rid of `struct ppc_inst` entirely or just not use it in kvm? In an earlier series I did something similar (at least code shared between 32bit and 64bit would need helpers, but 32bit only code need not change): #ifdef __powerpc64__ typedef struct ppc_inst { union { struct { u32 word; u32 pad; } __packed; struct { u32 prefix; u32 suffix; } __packed; }; } ppc_inst; #else /* !__powerpc64__ */ typedef u32 ppc_inst; #endif However mpe wanted to avoid using a typedef (https://patchwork.ozlabs.org/comment/2391845/) We did also talk about just using a u64 for instructions (https://lore.kernel.org/linuxppc-dev/1585028462.t27rstc2uf.astr...@bobo.none/) but the concern was that as prefixed instructions act as two separate u32s (prefix is always before the suffix regardless of endianess) keeping it as a u64 would lead to lot of macros and potential confusion. But it does seem if that can avoid a lot of needless churn it might worth the trade off. > > > -static inline unsigned make_dsisr(unsigned instr) > > +static inline unsigned make_dsisr(struct ppc_inst instr) > > { > > unsigned dsisr; > > + u32 word = ppc_inst_val(instr); > > > > > > /* bits 6:15 --> 22:31 */ > > - dsisr = (instr & 0x03ff) >> 16; > > + dsisr = (word & 0x03ff) >> 16; > > > > if (IS_XFORM(instr)) { > > /* bits 29:30 --> 15:16 */ > > - dsisr |= (instr & 0x0006) << 14; > > + dsisr |= (word & 0x0006) << 14; > > /* bit 25 -->17 */ > > - dsisr |= (instr & 0x0040) << 8; > > + dsisr |= (word & 0x0040) << 8; > > /* bits 21:24 --> 18:21 */ > > - dsisr |= (instr & 0x0780) << 3; > > + dsisr |= (word & 0x0780) << 3; > > } else { > > /* bit 5 -->17 */ > > - dsisr |= (instr & 0x0400) >> 12; > > + dsisr |= (word & 0x0400) >> 12; > > /* bits 1: 4 --> 18:21 */ > > - dsisr |= (instr & 0x7800) >> 17; > > + dsisr |= (word & 0x7800) >> 17; > > /* bits 30:31 --> 12:13 */ > > if (IS_DSFORM(instr)) > > - dsisr |= (instr & 0x0003) << 18; > > + dsisr |= (word & 0x0003) << 18; > > Here I would have done something like: > > > -static inline unsigned make_dsisr(unsigned instr) > > +static inline unsigned make_dsisr(struct ppc_inst pi) > > { > > unsigned dsisr; > > + u32 instr = ppc_inst_val(pi); > > and left the rest of the function unchanged. That is better. > > At first I wondered why we still had that function, since IBM Power > CPUs have not set DSISR on an alignment interrupt since POWER3 days. > It turns out it this function is used by PR KVM when it is emulating > one of the old 32-bit PowerPC CPUs (601, 603, 604, 750, 7450 etc.). > > > diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c > > Despite the file name, this code is not used on IBM Power servers. > It is for platforms which run under an ePAPR (not server PAPR) > hypervisor (which would be a KVM variant, but generally book E KVM not > book 3S). > > Paul.
Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker
On 9/1/20 10:25 PM, Nick Desaulniers wrote: Rather than invoke the compiler as the driver, use the linker. That way we can check --orphan-handling=warn support correctly, as cc-ldoption was removed in commit 055efab3120b ("kbuild: drop support for cc-ldoption"). Requires dropping the .got section. I couldn't find how it was used in the vdso32. Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan sections") Link: https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/ Signed-off-by: Nick Desaulniers --- Not sure removing .got is a good idea or not. Otherwise I observe the following link error: powerpc-linux-gnu-ld: warning: orphan section `.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.got' powerpc-linux-gnu-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got powerpc-linux-gnu-ld: final link failed: bad value sigtramp.c doesn't mention anything from the GOT AFAICT, and doesn't look like it contains relocations that do, so I'm not sure where references to _GLOBAL_OFFSET_TABLE_ are coming from. I'm getting the same but only when building for PPC64. I don't get any reference to sigtramp.o though: CALLscripts/checksyscalls.sh CALLscripts/atomic/check-atomics.sh VDSO32A arch/powerpc/kernel/vdso32/sigtramp.o VDSO32A arch/powerpc/kernel/vdso32/gettimeofday.o VDSO32A arch/powerpc/kernel/vdso32/datapage.o VDSO32A arch/powerpc/kernel/vdso32/cacheflush.o VDSO32A arch/powerpc/kernel/vdso32/note.o VDSO32A arch/powerpc/kernel/vdso32/getcpu.o LD arch/powerpc/kernel/vdso32/vdso32.so.dbg powerpc64-linux-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got powerpc64-linux-ld: final link failed: Bad value (GCC 8.1, Binutils 2.30) So it seems that the got section is being created by the linker. Don't know why though. With GCC 10.1, binutils 2.34 I get: LDS arch/powerpc/kernel/vdso32/vdso32.lds VDSO32A arch/powerpc/kernel/vdso32/sigtramp.o VDSO32A arch/powerpc/kernel/vdso32/gettimeofday.o VDSO32A arch/powerpc/kernel/vdso32/datapage.o VDSO32A arch/powerpc/kernel/vdso32/cacheflush.o VDSO32A arch/powerpc/kernel/vdso32/note.o VDSO32A arch/powerpc/kernel/vdso32/getcpu.o LD arch/powerpc/kernel/vdso32/vdso32.so.dbg powerpc64-linux-ld: warning: orphan section `.branch_lt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.branch_lt' powerpc64-linux-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got powerpc64-linux-ld: final link failed: bad value I can't see any .branch_lt section when objdumping sigtramp.o or any other .o When I move sigtramp.o at the end of the definition of obj-vdso32 in Makefile, I then get: powerpc64-linux-ld: warning: orphan section `.branch_lt' from `arch/powerpc/kernel/vdso32/gettimeofday.o' being placed in section `.branch_lt' powerpc64-linux-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got powerpc64-linux-ld: final link failed: bad value gettimeofday.o now being the first object in obj-vdso32 Christophe arch/powerpc/kernel/vdso32/Makefile | 7 +-- arch/powerpc/kernel/vdso32/vdso32.lds.S | 3 ++- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/vdso32/Makefile b/arch/powerpc/kernel/vdso32/Makefile index 87ab1152d5ce..611a5951945a 100644 --- a/arch/powerpc/kernel/vdso32/Makefile +++ b/arch/powerpc/kernel/vdso32/Makefile @@ -27,6 +27,9 @@ UBSAN_SANITIZE := n ccflags-y := -shared -fno-common -fno-builtin -nostdlib \ -Wl,-soname=linux-vdso32.so.1 -Wl,--hash-style=both asflags-y := -D__VDSO32__ -s +ldflags-y := -shared -soname linux-vdso32.so.1 \ + $(call ld-option, --eh-frame-hdr) \ + $(call ld-option, --orphan-handling=warn) -T obj-y += vdso32_wrapper.o extra-y += vdso32.lds @@ -49,8 +52,8 @@ $(obj-vdso32): %.o: %.S FORCE $(call if_changed_dep,vdso32as) # actual build commands -quiet_cmd_vdso32ld = VDSO32L $@ - cmd_vdso32ld = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ $(call cc-ldoption, -Wl$(comma)--orphan-handling=warn) -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) +quiet_cmd_vdso32ld = LD $@ + cmd_vdso32ld = $(cmd_ld) quiet_cmd_vdso32as = VDSO32A $@ cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) -c -o $@ $< diff --git a/arch/powerpc/kernel/vdso32/vdso32.lds.S b/arch/powerpc/kernel/vdso32/vdso32.lds.S index 4c985467a668..0ccdebad18b8 100644 --- a/arch/powerpc/kernel/vdso32/vdso32.lds.S +++ b/arch/powerpc/kernel/vdso32/vdso32.lds.S @@ -61,7 +61,6 @@ SECTIONS .fixup : { *(.fixup) } .dynamic : { *(.dynamic) } :text :dynamic - .got: { *(.got) } :text .plt: { *(.plt) } _end = .; @@ -108,7 +107,9 @@ SECTIONS .debug_varnames 0 : { *(.debug_varnames) } /DISCARD/ : { + *(.got) *(.note.GNU-stack) + *(.branch_lt)