Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
On Fri, Jun 16, 2023 at 12:48:02PM -0400, Kent Overstreet wrote: > On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > > > module_alloc() is used everywhere as a mean to allocate memory for code. > > > > Beside being semantically wrong, this unnecessarily ties all subsystems > > that need to allocate code, such as ftrace, kprobes and BPF to modules > > and puts the burden of code allocation to the modules code. > > > > Several architectures override module_alloc() because of various > > constraints where the executable memory can be located and this causes > > additional obstacles for improvements of code allocation. > > > > Start splitting code allocation from modules by introducing > > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs. > > > > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for > > module_alloc() and execmem_free() and jit_free() are replacements of > > module_memfree() to allow updating all call sites to use the new APIs. > > > > The intention semantics for new allocation APIs: > > > > * execmem_text_alloc() should be used to allocate memory that must reside > > close to the kernel image, like loadable kernel modules and generated > > code that is restricted by relative addressing. > > > > * jit_text_alloc() should be used to allocate memory for generated code > > when there are no restrictions for the code placement. For > > architectures that require that any code is within certain distance > > from the kernel image, jit_text_alloc() will be essentially aliased to > > execmem_text_alloc(). > > > > The names execmem_text_alloc() and jit_text_alloc() emphasize that the > > allocated memory is for executable code, the allocations of the > > associated data, like data sections of a module will use > > execmem_data_alloc() interface that will be added later. > > I like the API split - at the risk of further bikeshedding, perhaps > near_text_alloc() and far_text_alloc()? Would be more explicit. With near and far it should mention from where and that's getting too long. I don't mind changing the names, but I couldn't think about something better than Song's execmem and your jit. > Reviewed-by: Kent Overstreet Thanks! -- Sincerely yours, Mike.
Re: [PATCH v2 01/12] nios2: define virtual address space for modules
On Fri, Jun 16, 2023 at 04:00:19PM +, Edgecombe, Rick P wrote: > On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote: > > void *module_alloc(unsigned long size) > > { > > - if (size == 0) > > - return NULL; > > - return kmalloc(size, GFP_KERNEL); > > -} > > - > > -/* Free memory returned from module_alloc */ > > -void module_memfree(void *module_region) > > -{ > > - kfree(module_region); > > + return __vmalloc_node_range(size, 1, MODULES_VADDR, > > MODULES_END, > > + GFP_KERNEL, PAGE_KERNEL_EXEC, > > + VM_FLUSH_RESET_PERMS, > > NUMA_NO_NODE, > > + __builtin_return_address(0)); > > } > > > > int apply_relocate_add(Elf32_Shdr *sechdrs, const char *s > > I wonder if the (size == 0) check is really needed, but > __vmalloc_node_range() will WARN on this case where the old code won't. module_alloc() should not be called with zero size, so a warning there would be appropriate. Besides, no other module_alloc() had this check. -- Sincerely yours, Mike.
Re: [PATCH v2 RESEND] ASoC: fsl MPC52xx drivers require PPC_BESTCOMM
Hi Mark, Liam, On 5/30/23 16:38, Randy Dunlap wrote: > Hello maintainers, > > I am still seeing these build errors on linux-next-20230530. > > Is there a problem with the patch? > Thanks. > I am still seeing build errors on linux-next-20230615. Is there a problem with the patch? Can it be applied/merged? Thanks. > On 5/21/23 15:57, Randy Dunlap wrote: >> Both SND_MPC52xx_SOC_PCM030 and SND_MPC52xx_SOC_EFIKA select >> SND_SOC_MPC5200_AC97. The latter symbol depends on PPC_BESTCOMM, >> so the 2 former symbols should also depend on PPC_BESTCOMM since >> "select" does not follow any dependency chains. >> >> This prevents a kconfig warning and build errors: >> >> WARNING: unmet direct dependencies detected for SND_SOC_MPC5200_AC97 >> Depends on [n]: SOUND [=y] && !UML && SND [=m] && SND_SOC [=m] && >> SND_POWERPC_SOC [=m] && PPC_MPC52xx [=y] && PPC_BESTCOMM [=n] >> Selected by [m]: >> - SND_MPC52xx_SOC_PCM030 [=m] && SOUND [=y] && !UML && SND [=m] && SND_SOC >> [=m] && SND_POWERPC_SOC [=m] && PPC_MPC5200_SIMPLE [=y] >> - SND_MPC52xx_SOC_EFIKA [=m] && SOUND [=y] && !UML && SND [=m] && SND_SOC >> [=m] && SND_POWERPC_SOC [=m] && PPC_EFIKA [=y] >> >> ERROR: modpost: "mpc5200_audio_dma_destroy" >> [sound/soc/fsl/mpc5200_psc_ac97.ko] undefined! >> ERROR: modpost: "mpc5200_audio_dma_create" >> [sound/soc/fsl/mpc5200_psc_ac97.ko] undefined! >> >> Fixes: 40d9ec14e7e1 ("ASoC: remove BROKEN from Efika and pcm030 fabric >> drivers") >> Signed-off-by: Randy Dunlap >> Cc: Grant Likely >> Cc: Mark Brown >> Cc: Liam Girdwood >> Cc: Shengjiu Wang >> Cc: Xiubo Li >> Cc: alsa-de...@alsa-project.org >> Cc: linuxppc-dev@lists.ozlabs.org >> Cc: Jaroslav Kysela >> Cc: Takashi Iwai >> --- >> v2: use correct email address for Mark Brown. >> >> sound/soc/fsl/Kconfig |4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff -- a/sound/soc/fsl/Kconfig b/sound/soc/fsl/Kconfig >> --- a/sound/soc/fsl/Kconfig >> +++ b/sound/soc/fsl/Kconfig >> @@ -243,7 +243,7 @@ config SND_SOC_MPC5200_AC97 >> >> config SND_MPC52xx_SOC_PCM030 >> tristate "SoC AC97 Audio support for Phytec pcm030 and WM9712" >> -depends on PPC_MPC5200_SIMPLE >> +depends on PPC_MPC5200_SIMPLE && PPC_BESTCOMM >> select SND_SOC_MPC5200_AC97 >> select SND_SOC_WM9712 >> help >> @@ -252,7 +252,7 @@ config SND_MPC52xx_SOC_PCM030 >> >> config SND_MPC52xx_SOC_EFIKA >> tristate "SoC AC97 Audio support for bbplan Efika and STAC9766" >> -depends on PPC_EFIKA >> +depends on PPC_EFIKA && PPC_BESTCOMM >> select SND_SOC_MPC5200_AC97 >> select SND_SOC_STAC9766 >> help > -- ~Randy
Re: [PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()
On Thu, Jun 15, 2023 at 04:02:43PM -0700, Hugh Dickins wrote: > To keep balance in future, __update_tlb() remember to pte_unmap() after > pte_offset_map(). This is an odd case, since the caller has already done > pte_offset_map_lock(), then mips forgets the address and recalculates it; > but my two naive attempts to clean that up did more harm than good. > > Tested-by: Nathan Chancellor > Signed-off-by: Hugh Dickins FWIW: Tested-by: Yu Zhao There is another problem, likely caused by khugepaged, happened multiple times. But I don't think it's related to your series, just FYI. Got mcheck at 81134ef0 CPU: 3 PID: 36 Comm: khugepaged Not tainted 6.4.0-rc6-00049-g62d8779610bb-dirty #1 $ 0 : 0014 411ac004 4000 $ 4 : c000 0045 00011a80045b 00011a80045b $ 8 : 800080188000 81b526c0 0200 $12 : 0028 81910cb4 0207 $16 : 00aaab80 837ee990 81b50200 85066ae0 $20 : 0001 8000 81c1 00aaab80 $24 : 0002 812b75f8 $28 : 8231 82313b00 81b5 81134d88 Hi: 017a Lo: epc : 81134ef0 __update_tlb+0x260/0x2a0 ra: 81134d88 __update_tlb+0xf8/0x2a0 Status: 14309ce2 KX SX UX KERNEL EXL Cause : 00800060 (ExcCode 18) PrId : 000d9602 (Cavium Octeon III) CPU: 3 PID: 36 Comm: khugepaged Not tainted 6.4.0-rc6-00049-g62d8779610bb-dirty #1 Stack : 0001 0008 82313768 82313768 823138f8 a6c8cd76e1667e00 81db4f28 0001 30302d3663722d30 643236672d393430 0010 81910cc0 81d96bcc 81a68ed0 81b5 0001 8000 81c1 00aaab80 0002 815b78c0 a184e710 00c0 8231 82313760 81b5 8111c9cc 8111c9ec ... Call Trace: [] show_stack+0x64/0x158 [] dump_stack_lvl+0x5c/0x7c [] do_mcheck+0x2c/0x98 [] handle_mcheck_int+0x38/0x50 Index: 8000 PageMask : 1fe000 EntryHi : 00aaab8000bd EntryLo0 : 411a8004 EntryLo1 : 411ac004 Wired: 0 PageGrain: e800 Index: 2 pgmask=4kb va=c0fe4000 asid=b9 [ri=0 xi=0 pa=22a7000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=22af000 c=0 d=1 v=1 g=1] Index: 3 pgmask=4kb va=c0fe8000 asid=b9 [ri=0 xi=0 pa=238 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2381000 c=0 d=1 v=1 g=1] Index: 4 pgmask=4kb va=c0fea000 asid=b9 [ri=0 xi=0 pa=23e9000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=23ea000 c=0 d=1 v=1 g=1] Index: 5 pgmask=4kb va=c0fee000 asid=b9 [ri=0 xi=0 pa=2881000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2882000 c=0 d=1 v=1 g=1] Index: 6 pgmask=4kb va=c0fefffb asid=b9 [ri=0 xi=0 pa=2cc2000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2cc3000 c=0 d=1 v=1 g=1] Index: 7 pgmask=4kb va=c0fec000 asid=b9 [ri=0 xi=0 pa=23eb000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=288 c=0 d=1 v=1 g=1] Index: 8 pgmask=4kb va=c0fe6000 asid=b9 [ri=0 xi=0 pa=237e000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=237f000 c=0 d=1 v=1 g=1] Index: 14 pgmask=4kb va=c0fefff62000 asid=8e [ri=0 xi=0 pa=7477000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=745e000 c=0 d=1 v=1 g=1] Index: 15 pgmask=4kb va=c0fefff52000 asid=8e [ri=0 xi=0 pa=744c000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=616d000 c=0 d=1 v=1 g=1] Index: 16 pgmask=4kb va=c0fefff42000 asid=8e [ri=0 xi=0 pa=6334000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=616b000 c=0 d=1 v=1 g=1] Index: 19 pgmask=4kb va=c0fefffb6000 asid=8e [ri=0 xi=0 pa=505 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=5051000 c=0 d=1 v=1 g=1] Index: 20 pgmask=4kb va=c0fefff72000 asid=b9 [ri=0 xi=0 pa=7504000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=7503000 c=0 d=1 v=1 g=1] Index: 58 pgmask=4kb va=c0fefffaa000 asid=8e [ri=0 xi=0 pa=5126000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=5127000 c=0 d=1 v=1 g=1] Index: 59 pgmask=4kb va=c0fefffba000 asid=8e [ri=0 xi=0 pa=5129000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=512a000 c=0 d=1 v=1 g=1] Index: 79 pgmask=4kb va=c006 asid=8e [ri=0 xi=0 pa=534b000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=62f9000 c=0 d=1 v=1 g=1] Index: 80 pgmask=4kb va=c005e000 asid=8e
[powerpc:next] BUILD SUCCESS 7a313166d7dd0b366f8a47e15cd3c40910494cb6
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next branch HEAD: 7a313166d7dd0b366f8a47e15cd3c40910494cb6 powerpc: update ppc_save_regs to save current r1 in pt_regs elapsed time: 724m configs tested: 132 configs skipped: 6 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alphaallyesconfig gcc alpha defconfig gcc alpharandconfig-r001-20230616 gcc arc allyesconfig gcc arc defconfig gcc arc randconfig-r016-20230616 gcc arc randconfig-r025-20230616 gcc arc randconfig-r031-20230616 gcc arc randconfig-r043-20230616 gcc arm allmodconfig gcc arm allyesconfig gcc arm defconfig gcc arm randconfig-r003-20230616 gcc arm randconfig-r011-20230616 clang arm randconfig-r033-20230616 gcc arm randconfig-r046-20230616 clang arm64allyesconfig gcc arm64 defconfig gcc arm64randconfig-r001-20230616 clang arm64randconfig-r032-20230616 clang arm64randconfig-r036-20230616 clang cskydefconfig gcc csky randconfig-r012-20230616 gcc csky randconfig-r022-20230616 gcc csky randconfig-r026-20230616 gcc csky randconfig-r034-20230616 gcc hexagon randconfig-r041-20230616 clang hexagon randconfig-r045-20230616 clang i386 allyesconfig gcc i386 buildonly-randconfig-r004-20230616 clang i386 buildonly-randconfig-r005-20230616 clang i386 buildonly-randconfig-r006-20230616 clang i386 debian-10.3 gcc i386defconfig gcc i386 randconfig-i001-20230616 clang i386 randconfig-i002-20230616 clang i386 randconfig-i003-20230616 clang i386 randconfig-i004-20230616 clang i386 randconfig-i005-20230616 clang i386 randconfig-i006-20230616 clang i386 randconfig-i011-20230616 gcc i386 randconfig-i012-20230616 gcc i386 randconfig-i013-20230616 gcc i386 randconfig-i014-20230616 gcc i386 randconfig-i015-20230616 gcc i386 randconfig-i016-20230616 gcc i386 randconfig-r002-20230616 clang i386 randconfig-r004-20230616 clang loongarchallmodconfig gcc loongarch allnoconfig gcc loongarch defconfig gcc loongarchrandconfig-r005-20230616 gcc loongarchrandconfig-r013-20230616 gcc loongarchrandconfig-r031-20230616 gcc m68k allmodconfig gcc m68k allyesconfig gcc m68kdefconfig gcc m68k randconfig-r035-20230616 gcc microblaze randconfig-r011-20230616 gcc microblaze randconfig-r021-20230616 gcc mips allmodconfig gcc mips allyesconfig gcc mips randconfig-r024-20230616 clang nios2 defconfig gcc nios2randconfig-r013-20230616 gcc openrisc randconfig-r006-20230616 gcc openrisc randconfig-r014-20230616 gcc openrisc randconfig-r025-20230616 gcc parisc allyesconfig gcc parisc defconfig gcc parisc randconfig-r002-20230616 gcc parisc randconfig-r023-20230616 gcc parisc64defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc powerpc randconfig-r002-20230616 clang powerpc randconfig-r036-20230616 clang riscvallmodconfig gcc riscv allnoconfig gcc riscvallyesconfig gcc riscv defconfig gcc riscvrandconfig-r035-20230616 clang riscvrandconfig-r042-20230616 gcc riscv rv32_defconfig gcc s390 allmodconfig gcc s390
[powerpc:merge] BUILD SUCCESS 12ffddc6444780aec83fa5086673ec005c0bace4
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 12ffddc6444780aec83fa5086673ec005c0bace4 Automatic merge of 'fixes' into merge (2023-06-09 23:36) elapsed time: 724m configs tested: 110 configs skipped: 4 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alphaallyesconfig gcc alpha defconfig gcc alpharandconfig-r001-20230616 gcc arc allyesconfig gcc arc defconfig gcc arc randconfig-r022-20230616 gcc arc randconfig-r035-20230616 gcc arc randconfig-r043-20230616 gcc arm allmodconfig gcc arm allyesconfig gcc arm defconfig gcc arm randconfig-r006-20230616 gcc arm randconfig-r026-20230616 clang arm randconfig-r034-20230616 gcc arm randconfig-r046-20230616 clang arm64allyesconfig gcc arm64 defconfig gcc arm64randconfig-r002-20230616 clang cskydefconfig gcc csky randconfig-r021-20230616 gcc hexagon randconfig-r041-20230616 clang hexagon randconfig-r045-20230616 clang i386 allyesconfig gcc i386 buildonly-randconfig-r004-20230616 clang i386 buildonly-randconfig-r005-20230616 clang i386 buildonly-randconfig-r006-20230616 clang i386 debian-10.3 gcc i386defconfig gcc i386 randconfig-i001-20230616 clang i386 randconfig-i002-20230616 clang i386 randconfig-i003-20230616 clang i386 randconfig-i004-20230616 clang i386 randconfig-i005-20230616 clang i386 randconfig-i006-20230616 clang i386 randconfig-i011-20230616 gcc i386 randconfig-i012-20230616 gcc i386 randconfig-i013-20230616 gcc i386 randconfig-i014-20230616 gcc i386 randconfig-i015-20230616 gcc i386 randconfig-i016-20230616 gcc i386 randconfig-r005-20230616 clang loongarchallmodconfig gcc loongarch allnoconfig gcc loongarch defconfig gcc loongarchrandconfig-r003-20230616 gcc loongarchrandconfig-r012-20230616 gcc loongarchrandconfig-r014-20230616 gcc loongarchrandconfig-r015-20230616 gcc m68k allmodconfig gcc m68k allyesconfig gcc m68kdefconfig gcc m68k randconfig-r024-20230616 gcc mips allmodconfig gcc mips allyesconfig gcc mips randconfig-r006-20230616 gcc nios2 defconfig gcc openrisc randconfig-r002-20230616 gcc openrisc randconfig-r031-20230616 gcc parisc allyesconfig gcc parisc defconfig gcc parisc randconfig-r011-20230616 gcc parisc randconfig-r016-20230616 gcc parisc64defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc riscvallmodconfig gcc riscv allnoconfig gcc riscvallyesconfig gcc riscv defconfig gcc riscvrandconfig-r042-20230616 gcc riscv rv32_defconfig gcc s390 allmodconfig gcc s390 allyesconfig gcc s390defconfig gcc s390 randconfig-r013-20230616 gcc s390 randconfig-r032-20230616 clang s390 randconfig-r044-20230616 gcc sh allmodconfig gcc sh randconfig-r004-20230616 gcc sh randconfig-r005-20230616 gcc sh randconfig-r033-20230616 gcc sparcallyesconfig gcc sparc defconfig gcc sparc64 randconfig-r003-20230616 gcc um defconfig gcc um
Re: [PATCH v4 04/34] pgtable: Create struct ptdesc
On Thu, Jun 15, 2023 at 12:57 AM Hugh Dickins wrote: > > On Mon, 12 Jun 2023, Vishal Moola (Oracle) wrote: > > > Currently, page table information is stored within struct page. As part > > of simplifying struct page, create struct ptdesc for page table > > information. > > > > Signed-off-by: Vishal Moola (Oracle) > > Vishal, as I think you have already guessed, your ptdesc series and > my pte_free_defer() "mm: free retracted page table by RCU" series are > on a collision course. > > Probably just trivial collisions in most architectures, which either > of us can easily adjust to the other; powerpc likely to be more awkward, > but fairly easily resolved; s390 quite a problem. > > I've so far been unable to post a v2 of my series (and powerpc and s390 > were stupidly wrong in the v1), because a good s390 patch is not yet > decided - Gerald Schaefer and I are currently working on that, on the > s390 list (I took off most Ccs until we are settled and I can post v2). > > As you have no doubt found yourself, s390 has sophisticated handling of > free half-pages already, and I need to add rcu_head usage in there too: > it's tricky to squeeze it all in, and ptdesc does not appear to help us > in any way (though mostly it's just changing some field names, okay). > > If ptdesc were actually allowing a flexible structure which architectures > could add into, that would (in some future) be nice; but of course at > present it's still fitting it all into one struct page, and mandating > new restrictions which just make an architecture's job harder. A goal of ptdescs is to make architecture's jobs simpler and standardized. Unfortunately, ptdescs are nowhere near isolated from struct page yet. This version of struct ptdesc contains the exact number of fields architectures need right now, just reorganized to be located next to each other. It *probably* shouldn't make an architectures job harder, aside from discouraging their use of yet even more members of struct page. > Some notes on problematic fields below FYI. > > > --- > > include/linux/pgtable.h | 51 + > > 1 file changed, 51 insertions(+) > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > index c5a51481bbb9..330de96ebfd6 100644 > > --- a/include/linux/pgtable.h > > +++ b/include/linux/pgtable.h > > @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct > > vm_area_struct *vma, > > #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */ > > #endif /* CONFIG_MMU */ > > > > + > > +/** > > + * struct ptdesc - Memory descriptor for page tables. > > + * @__page_flags: Same as page flags. Unused for page tables. > > + * @pt_list: List of used page tables. Used for s390 and x86. > > + * @_pt_pad_1: Padding that aliases with page's compound head. > > + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs. > > + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only. > > + * @pt_mm: Used for x86 pgds. > > + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 > > only. > > + * @ptl: Lock for the page table. > > + * > > + * This struct overlays struct page for now. Do not modify without a good > > + * understanding of the issues. > > + */ > > +struct ptdesc { > > + unsigned long __page_flags; > > + > > + union { > > + struct list_head pt_list; > > I shall be needing struct rcu_head rcu_head (or pt_rcu_head or whatever, > if you prefer) in this union too. Sharing the lru or pt_list with rcu_head > is what's difficult to get right and efficient on s390 - and if ptdesc gave > us an independent rcu_head for each page table, that would be a blessing! > but sadly not, it still has to squeeze into a struct page. I can add a pt_rcu_head along with a comment to deter aliasing issues :) Independent rcu_heads aren't coming any time soon though :( > > + struct { > > + unsigned long _pt_pad_1; > > + pgtable_t pmd_huge_pte; > > + }; > > + }; > > + unsigned long _pt_s390_gaddr; > > + > > + union { > > + struct mm_struct *pt_mm; > > + atomic_t pt_frag_refcount; > > Whether s390 will want pt_mm is not yet decided: I want to use it, > Gerald prefers to go without it; but if we do end up using it, > then pt_frag_refcount is a luxury we would have to give up. I don't like the use of pt_mm for s390 either. s390 uses space equivalent to all five words allocated in the page table struct (albeit in various places of struct page). Using extra space (especially allocated for unrelated reasons) just because it exists makes things more complicated and confusing, and s390 is already confusing enough as a result of that. If having access to pt_mm is necessary I can drop the pt_frag_refcount patch, but I'd rather avoid it. > s390 does very well already with its _refcount tricks, and I'd expect > powerpc's simpler but more wasteful implementation to work as
Re: [PATCH v4 04/34] pgtable: Create struct ptdesc
On Thu, Jun 15, 2023 at 12:57:19AM -0700, Hugh Dickins wrote: > Probably just trivial collisions in most architectures, which either > of us can easily adjust to the other; powerpc likely to be more awkward, > but fairly easily resolved; s390 quite a problem. > > I've so far been unable to post a v2 of my series (and powerpc and s390 > were stupidly wrong in the v1), because a good s390 patch is not yet > decided - Gerald Schaefer and I are currently working on that, on the > s390 list (I took off most Ccs until we are settled and I can post v2). > > As you have no doubt found yourself, s390 has sophisticated handling of > free half-pages already, and I need to add rcu_head usage in there too: > it's tricky to squeeze it all in, and ptdesc does not appear to help us > in any way (though mostly it's just changing some field names, okay). > > If ptdesc were actually allowing a flexible structure which architectures > could add into, that would (in some future) be nice; but of course at > present it's still fitting it all into one struct page, and mandating > new restrictions which just make an architecture's job harder. The intent is to get ptdescs to be dynamically allocated at some point in the ~2-3 years out future when we have finished the folio project ... which is not a terribly helpful thing for me to say. I have three suggestions, probably all dreadful: 1. s390 could change its behaviour to always allocate page tables in pairs. That is, it fills in two pmd_t entries any time it takes a fault in either of them. 2. We could allocate two or four pages at a time for s390 to allocate 2kB pages from. That gives us a lot more space to store RCU heads. 3. We could use s390 as a guinea-pig for dynamic ptdesc allocation. Every time we allocate a struct page, we have a slab cache for an s390-special definition of struct ptdesc, we allocate a ptdesc and store a pointer to that in compound_head. We could sweeten #3 by doing that not just for s390 but also for every configuration which has ALLOC_SPLIT_PTLOCKS today. That would get rid of the ambiguity between "is ptl a pointer or a lock". > But I've no desire to undo powerpc's use of pt_frag_refcount: > just warning that we may want to undo any use of it in s390. I would dearly love ppc & s390 to use the _same_ scheme to solve the same problem.
Re: [PATCH v9 00/14] pci: Work around ASMedia ASM2824 PCIe link training failures
On Fri, Jun 16, 2023 at 01:27:52PM +0100, Maciej W. Rozycki wrote: > On Thu, 15 Jun 2023, Bjorn Helgaas wrote: > As per my earlier remark: > > > I think making a system halfway-fixed would make little sense, but with > > the actual fix actually made last as you suggested I think this can be > > split off, because it'll make no functional change by itself. > > I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS > stub into the change carrying the actual workaround and then have the > reset path update with a follow-up change only, but I won't fight over it. > It's only one tree revision that will be in this halfway-fixed state and > I'll trust your judgement here. Thanks for raising this. Here's my thought process: 12 PCI: Provide stub failed link recovery for device probing and hot plug 13 PCI: Add failed link recovery for device reset events 14 PCI: Work around PCIe link training failures Patch 12 [1] adds calls to pcie_failed_link_retrain(), which does nothing and returns false. Functionally, it's a no-op, but the structure is important later. Patch 13 [2] claims to request failed link recovery after resets, but actually doesn't do anything yet because pcie_failed_link_retrain() is still a no-op, so this was a bit confusing. Patch 14 [3] implements pcie_failed_link_retrain(), so the recovery mentioned in 12 and 13 actually happens. But this patch doesn't add the call to pcie_failed_link_retrain(), so it's a little bit hard to connect the dots. I agree that as I rearranged it, the workaround doesn't apply in all cases simultaneously. Maybe not ideal, but maybe not terrible either. Looking at it again, maybe it would have made more sense to move the pcie_wait_for_link_delay() change to the last patch along with the pci_dev_wait() change. I dunno. Bjorn [1] 12 https://lore.kernel.org/r/alpine.deb.2.21.2306111619570.64...@angie.orcam.me.uk [2] 13 https://lore.kernel.org/r/alpine.deb.2.21.2306111631050.64...@angie.orcam.me.uk [3] 14 https://lore.kernel.org/r/alpine.deb.2.21.2305310038540.59...@angie.orcam.me.uk
Re: [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Dynamic ftrace must allocate memory for code and this was impossible > without CONFIG_MODULES. > > With execmem separated from the modules code, execmem_text_alloc() is > available regardless of CONFIG_MODULES. > > Remove dependency of dynamic ftrace on CONFIG_MODULES and make > CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu > --- > arch/x86/Kconfig | 1 + > arch/x86/kernel/ftrace.c | 10 -- > 2 files changed, 1 insertion(+), 10 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 53bab123a8ee..ab64bbef9e50 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -35,6 +35,7 @@ config X86_64 > select SWIOTLB > select ARCH_HAS_ELFCORE_COMPAT > select ZONE_DMA32 > + select EXECMEM if DYNAMIC_FTRACE > > config FORCE_DYNAMIC_FTRACE > def_bool y > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c > index f77c63bb3203..a824a5d3b129 100644 > --- a/arch/x86/kernel/ftrace.c > +++ b/arch/x86/kernel/ftrace.c > @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command) > /* Currently only x86_64 supports dynamic trampolines */ > #ifdef CONFIG_X86_64 > > -#ifdef CONFIG_MODULES > -/* Module allocation simplifies allocating memory for code */ > static inline void *alloc_tramp(unsigned long size) > { > return execmem_text_alloc(size); > @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp) > { > execmem_free(tramp); > } > -#else > -/* Trampolines can only be created if modules are supported */ > -static inline void *alloc_tramp(unsigned long size) > -{ > - return NULL; > -} > -static inline void tramp_free(void *tramp) { } > -#endif > > /* Defined as markers to the end of the ftrace default trampolines */ > extern void ftrace_regs_caller_end(void); > -- > 2.35.1 >
Re: [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES
On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > execmem does not depend on modules, on the contrary modules use > execmem. > > To make execmem available when CONFIG_MODULES=n, for instance for > kprobes, split execmem_params initialization out from > arch/kernel/module.c and compile it when CONFIG_EXECMEM=y > > Signed-off-by: Mike Rapoport (IBM) > --- [...] > + > +struct execmem_params __init *execmem_arch_params(void) > +{ > + u64 module_alloc_end; > + > + kaslr_init(); Aha, this addresses my comment on the earlier patch. Thanks! Acked-by: Song Liu > + > + module_alloc_end = module_alloc_base + MODULES_VSIZE; > + > + execmem_params.modules.text.pgprot = PAGE_KERNEL; > + execmem_params.modules.text.start = module_alloc_base; > + execmem_params.modules.text.end = module_alloc_end; > + > + execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX; [...]
Re: [PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations
On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > powerpc overrides kprobes::alloc_insn_page() to remove writable > permissions when STRICT_MODULE_RWX is on. > > Add definition of jit area to execmem_params to allow using the generic > kprobes::alloc_insn_page() with the desired permissions. > > As powerpc uses breakpoint instructions to inject kprobes, it does not > need to constrain kprobe allocations to the modules area and can use the > entire vmalloc address space. > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu > --- > arch/powerpc/kernel/kprobes.c | 14 -- > arch/powerpc/kernel/module.c | 13 + > 2 files changed, 13 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > index 5db8df5e3657..14c5ddec3056 100644 > --- a/arch/powerpc/kernel/kprobes.c > +++ b/arch/powerpc/kernel/kprobes.c > @@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long > addr, unsigned long offse > return (kprobe_opcode_t *)(addr + offset); > } > > -void *alloc_insn_page(void) > -{ > - void *page; > - > - page = jit_text_alloc(PAGE_SIZE); > - if (!page) > - return NULL; > - > - if (strict_module_rwx_enabled()) > - set_memory_rox((unsigned long)page, 1); > - > - return page; > -} > - > int arch_prepare_kprobe(struct kprobe *p) > { > int ret = 0; > diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c > index 4c6c15bf3947..8e5b379d6da1 100644 > --- a/arch/powerpc/kernel/module.c > +++ b/arch/powerpc/kernel/module.c > @@ -96,6 +96,11 @@ static struct execmem_params execmem_params = { > .alignment = 1, > }, > }, > + .jit = { > + .text = { > + .alignment = 1, > + }, > + }, > }; > > > @@ -131,5 +136,13 @@ struct execmem_params __init *execmem_arch_params(void) > > execmem_params.modules.text.pgprot = prot; > > + execmem_params.jit.text.start = VMALLOC_START; > + execmem_params.jit.text.end = VMALLOC_END; > + > + if (strict_module_rwx_enabled()) > + execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX; > + else > + execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC; > + > return _params; > } > -- > 2.35.1 >
Re: [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations
On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > RISC-V overrides kprobes::alloc_insn_range() to use the entire vmalloc area > rather than limit the allocations to the modules area. > > Slightly reorder execmem_params initialization to support both 32 and 64 > bit variantsi and add definition of jit area to execmem_params to support > generic kprobes::alloc_insn_page(). > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu > --- > arch/riscv/kernel/module.c | 16 +++- > arch/riscv/kernel/probes/kprobes.c | 10 -- > 2 files changed, 15 insertions(+), 11 deletions(-) > > diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c > index ee5e04cd3f21..cca6ed4e9340 100644 > --- a/arch/riscv/kernel/module.c > +++ b/arch/riscv/kernel/module.c > @@ -436,7 +436,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char > *strtab, > return 0; > } > > -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) > +#ifdef CONFIG_MMU > static struct execmem_params execmem_params = { > .modules = { > .text = { > @@ -444,12 +444,26 @@ static struct execmem_params execmem_params = { > .alignment = 1, > }, > }, > + .jit = { > + .text = { > + .pgprot = PAGE_KERNEL_READ_EXEC, > + .alignment = 1, > + }, > + }, > }; > > struct execmem_params __init *execmem_arch_params(void) > { > +#ifdef CONFIG_64BIT > execmem_params.modules.text.start = MODULES_VADDR; > execmem_params.modules.text.end = MODULES_END; > +#else > + execmem_params.modules.text.start = VMALLOC_START; > + execmem_params.modules.text.end = VMALLOC_END; > +#endif > + > + execmem_params.jit.text.start = VMALLOC_START; > + execmem_params.jit.text.end = VMALLOC_END; > > return _params; > } > diff --git a/arch/riscv/kernel/probes/kprobes.c > b/arch/riscv/kernel/probes/kprobes.c > index 2f08c14a933d..e64f2f3064eb 100644 > --- a/arch/riscv/kernel/probes/kprobes.c > +++ b/arch/riscv/kernel/probes/kprobes.c > @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) > return 0; > } > > -#ifdef CONFIG_MMU > -void *alloc_insn_page(void) > -{ > - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, > -GFP_KERNEL, PAGE_KERNEL_READ_EXEC, > -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, > -__builtin_return_address(0)); > -} > -#endif > - > /* install breakpoint in text */ > void __kprobes arch_arm_kprobe(struct kprobe *p) > { > -- > 2.35.1 >
Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > The memory allocations for kprobes on arm64 can be placed anywhere in > vmalloc address space and currently this is implemented with an override > of alloc_insn_page() in arm64. > > Extend execmem_params with a range for generated code allocations and > make kprobes on arm64 use this extension rather than override > alloc_insn_page(). > > Signed-off-by: Mike Rapoport (IBM) > --- > arch/arm64/kernel/module.c | 9 + > arch/arm64/kernel/probes/kprobes.c | 7 --- > include/linux/execmem.h| 11 +++ > mm/execmem.c | 14 +- > 4 files changed, 33 insertions(+), 8 deletions(-) > > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c > index c3d999f3a3dd..52b09626bc0f 100644 > --- a/arch/arm64/kernel/module.c > +++ b/arch/arm64/kernel/module.c > @@ -30,6 +30,13 @@ static struct execmem_params execmem_params = { > .alignment = MODULE_ALIGN, > }, > }, > + .jit = { > + .text = { > + .start = VMALLOC_START, > + .end = VMALLOC_END, > + .alignment = 1, > + }, > + }, > }; This is growing fast. :) We have 3 now: text, data, jit. And it will be 5 when we split data into rw data, ro data, ro after init data. I wonder whether we should still do some type enum here. But we can revisit this topic later. Other than that Acked-by: Song Liu
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Data related to code allocations, such as module data section, need to > comply with architecture constraints for its placement and its > allocation right now was done using execmem_text_alloc(). > > Create a dedicated API for allocating data related to code allocations > and allow architectures to define address ranges for data allocations. > > Since currently this is only relevant for powerpc variants that use the > VMALLOC address space for module data allocations, automatically reuse > address ranges defined for text unless address range for data is > explicitly defined by an architecture. > > With separation of code and data allocations, data sections of the > modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which > was a default on many architectures. > > Signed-off-by: Mike Rapoport (IBM) [...] > static void free_mod_mem(struct module *mod) > diff --git a/mm/execmem.c b/mm/execmem.c > index a67acd75ffef..f7bf496ad4c3 100644 > --- a/mm/execmem.c > +++ b/mm/execmem.c > @@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size) > fallback_start, fallback_end, kasan); > } > > +void *execmem_data_alloc(size_t size) > +{ > + unsigned long start = execmem_params.modules.data.start; > + unsigned long end = execmem_params.modules.data.end; > + pgprot_t pgprot = execmem_params.modules.data.pgprot; > + unsigned int align = execmem_params.modules.data.alignment; > + unsigned long fallback_start = > execmem_params.modules.data.fallback_start; > + unsigned long fallback_end = execmem_params.modules.data.fallback_end; > + bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW; > + > + return execmem_alloc(size, start, end, align, pgprot, > +fallback_start, fallback_end, kasan); > +} > + > void execmem_free(void *ptr) > { > /* > @@ -101,6 +115,28 @@ static bool execmem_validate_params(struct > execmem_params *p) > return true; > } > > +static void execmem_init_missing(struct execmem_params *p) Shall we call this execmem_default_init_data? > +{ > + struct execmem_modules_range *m = >modules; > + > + if (!pgprot_val(execmem_params.modules.data.pgprot)) > + execmem_params.modules.data.pgprot = PAGE_KERNEL; Do we really need to check each of these? IOW, can we do: if (!pgprot_val(execmem_params.modules.data.pgprot)) { execmem_params.modules.data.pgprot = PAGE_KERNEL; execmem_params.modules.data.alignment = m->text.alignment; execmem_params.modules.data.start = m->text.start; execmem_params.modules.data.end = m->text.end; execmem_params.modules.data.fallback_start = m->text.fallback_start; execmem_params.modules.data.fallback_end = m->text.fallback_end; } Thanks, Song [...]
[PATCH v4 20/25] iommu/sun50i: Add an IOMMU_IDENTITIY_DOMAIN
Prior to commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") the sun50i_iommu_detach_device() function was being called by ops->detach_dev(). This is an IDENTITY domain so convert sun50i_iommu_detach_device() into sun50i_iommu_identity_attach() and a full IDENTITY domain and thus hook it back up the same was as the old ops->detach_dev(). Signed-off-by: Jason Gunthorpe --- drivers/iommu/sun50i-iommu.c | 26 +++--- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/sun50i-iommu.c b/drivers/iommu/sun50i-iommu.c index 74c5cb93e90027..0bf08b120cf105 100644 --- a/drivers/iommu/sun50i-iommu.c +++ b/drivers/iommu/sun50i-iommu.c @@ -757,21 +757,32 @@ static void sun50i_iommu_detach_domain(struct sun50i_iommu *iommu, iommu->domain = NULL; } -static void sun50i_iommu_detach_device(struct iommu_domain *domain, - struct device *dev) +static int sun50i_iommu_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) { - struct sun50i_iommu_domain *sun50i_domain = to_sun50i_domain(domain); struct sun50i_iommu *iommu = dev_iommu_priv_get(dev); + struct sun50i_iommu_domain *sun50i_domain; dev_dbg(dev, "Detaching from IOMMU domain\n"); - if (iommu->domain != domain) - return; + if (iommu->domain == identity_domain) + return 0; + sun50i_domain = to_sun50i_domain(iommu->domain); if (refcount_dec_and_test(_domain->refcnt)) sun50i_iommu_detach_domain(iommu, sun50i_domain); + return 0; } +static struct iommu_domain_ops sun50i_iommu_identity_ops = { + .attach_dev = sun50i_iommu_identity_attach, +}; + +static struct iommu_domain sun50i_iommu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_identity_ops, +}; + static int sun50i_iommu_attach_device(struct iommu_domain *domain, struct device *dev) { @@ -789,8 +800,7 @@ static int sun50i_iommu_attach_device(struct iommu_domain *domain, if (iommu->domain == domain) return 0; - if (iommu->domain) - sun50i_iommu_detach_device(iommu->domain, dev); + sun50i_iommu_identity_attach(_iommu_identity_domain, dev); sun50i_iommu_attach_domain(iommu, sun50i_domain); @@ -827,6 +837,7 @@ static int sun50i_iommu_of_xlate(struct device *dev, } static const struct iommu_ops sun50i_iommu_ops = { + .identity_domain = _iommu_identity_domain, .pgsize_bitmap = SZ_4K, .device_group = sun50i_iommu_device_group, .domain_alloc = sun50i_iommu_domain_alloc, @@ -985,6 +996,7 @@ static int sun50i_iommu_probe(struct platform_device *pdev) if (!iommu) return -ENOMEM; spin_lock_init(>iommu_lock); + iommu->domain = _iommu_identity_domain; platform_set_drvdata(pdev, iommu); iommu->dev = >dev; -- 2.40.1
[PATCH v4 15/25] iommufd/selftest: Make the mock iommu driver into a real driver
I've avoided doing this because there is no way to make this happen without an intrusion into the core code. Up till now this has avoided needing the core code's probe path with some hackery - but now that default domains are becoming mandatory it is unavoidable. The core probe path must be run to set the default_domain, only it can do it. Without a default domain iommufd can't use the group. Make it so that iommufd selftest can create a real iommu driver and bind it only to is own private bus. Add iommu_device_register_bus() as a core code helper to make this possible. It simply sets the right pointers and registers the notifier block. The mock driver then works like any normal driver should, with probe triggered by the bus ops When the bus->iommu_ops stuff is fully unwound we can probably do better here and remove this special case. Remove set_platform_dma_ops from selftest and make it use a BLOCKED default domain. Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu-priv.h | 16 +++ drivers/iommu/iommu.c | 43 drivers/iommu/iommufd/iommufd_private.h | 5 +- drivers/iommu/iommufd/main.c| 8 +- drivers/iommu/iommufd/selftest.c| 141 +--- 5 files changed, 144 insertions(+), 69 deletions(-) create mode 100644 drivers/iommu/iommu-priv.h diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h new file mode 100644 index 00..1cbc04b9cf7297 --- /dev/null +++ b/drivers/iommu/iommu-priv.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. + */ +#ifndef __IOMMU_PRIV_H +#define __IOMMU_PRIV_H + +#include + +int iommu_device_register_bus(struct iommu_device *iommu, + const struct iommu_ops *ops, struct bus_type *bus, + struct notifier_block *nb); +void iommu_device_unregister_bus(struct iommu_device *iommu, +struct bus_type *bus, +struct notifier_block *nb); + +#endif diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 7ca70e2a3f51e9..a3a4d004767b4d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -36,6 +36,7 @@ #include "dma-iommu.h" #include "iommu-sva.h" +#include "iommu-priv.h" static struct kset *iommu_group_kset; static DEFINE_IDA(iommu_group_ida); @@ -287,6 +288,48 @@ void iommu_device_unregister(struct iommu_device *iommu) } EXPORT_SYMBOL_GPL(iommu_device_unregister); +#if IS_ENABLED(CONFIG_IOMMUFD_TEST) +void iommu_device_unregister_bus(struct iommu_device *iommu, +struct bus_type *bus, +struct notifier_block *nb) +{ + bus_unregister_notifier(bus, nb); + iommu_device_unregister(iommu); +} +EXPORT_SYMBOL_GPL(iommu_device_unregister_bus); + +/* + * Register an iommu driver against a single bus. This is only used by iommufd + * selftest to create a mock iommu driver. The caller must provide + * some memory to hold a notifier_block. + */ +int iommu_device_register_bus(struct iommu_device *iommu, + const struct iommu_ops *ops, struct bus_type *bus, + struct notifier_block *nb) +{ + int err; + + iommu->ops = ops; + nb->notifier_call = iommu_bus_notifier; + err = bus_register_notifier(bus, nb); + if (err) + return err; + + spin_lock(_device_lock); + list_add_tail(>list, _device_list); + spin_unlock(_device_lock); + + bus->iommu_ops = ops; + err = bus_iommu_probe(bus); + if (err) { + iommu_device_unregister_bus(iommu, bus, nb); + return err; + } + return 0; +} +EXPORT_SYMBOL_GPL(iommu_device_register_bus); +#endif + static struct dev_iommu *dev_iommu_get(struct device *dev) { struct dev_iommu *param = dev->iommu; diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index b38e67d1988bdb..368f66c63a239a 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -303,7 +303,7 @@ extern size_t iommufd_test_memory_limit; void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd, unsigned int ioas_id, u64 *iova, u32 *flags); bool iommufd_should_fail(void); -void __init iommufd_test_init(void); +int __init iommufd_test_init(void); void iommufd_test_exit(void); bool iommufd_selftest_is_mock_dev(struct device *dev); #else @@ -316,8 +316,9 @@ static inline bool iommufd_should_fail(void) { return false; } -static inline void __init iommufd_test_init(void) +static inline int __init iommufd_test_init(void) { + return 0; } static inline void iommufd_test_exit(void) { diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 3fbe636c3d8a69..042d45cc0b1c0d 100644 ---
[PATCH v4 10/25] iommu/exynos: Implement an IDENTITY domain
What exynos calls exynos_iommu_detach_device is actually putting the iommu into identity mode. Move to the new core support for ARM_DMA_USE_IOMMU by defining ops->identity_domain. Tested-by: Marek Szyprowski Acked-by: Marek Szyprowski Signed-off-by: Jason Gunthorpe --- drivers/iommu/exynos-iommu.c | 66 +--- 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c index c275fe71c4db32..5e12b85dfe8705 100644 --- a/drivers/iommu/exynos-iommu.c +++ b/drivers/iommu/exynos-iommu.c @@ -24,6 +24,7 @@ typedef u32 sysmmu_iova_t; typedef u32 sysmmu_pte_t; +static struct iommu_domain exynos_identity_domain; /* We do not consider super section mapping (16MB) */ #define SECT_ORDER 20 @@ -829,7 +830,7 @@ static int __maybe_unused exynos_sysmmu_suspend(struct device *dev) struct exynos_iommu_owner *owner = dev_iommu_priv_get(master); mutex_lock(>rpm_lock); - if (data->domain) { + if (>domain->domain != _identity_domain) { dev_dbg(data->sysmmu, "saving state\n"); __sysmmu_disable(data); } @@ -847,7 +848,7 @@ static int __maybe_unused exynos_sysmmu_resume(struct device *dev) struct exynos_iommu_owner *owner = dev_iommu_priv_get(master); mutex_lock(>rpm_lock); - if (data->domain) { + if (>domain->domain != _identity_domain) { dev_dbg(data->sysmmu, "restoring state\n"); __sysmmu_enable(data); } @@ -980,17 +981,20 @@ static void exynos_iommu_domain_free(struct iommu_domain *iommu_domain) kfree(domain); } -static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain, - struct device *dev) +static int exynos_iommu_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) { - struct exynos_iommu_domain *domain = to_exynos_domain(iommu_domain); struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev); - phys_addr_t pagetable = virt_to_phys(domain->pgtable); + struct exynos_iommu_domain *domain; + phys_addr_t pagetable; struct sysmmu_drvdata *data, *next; unsigned long flags; - if (!has_sysmmu(dev) || owner->domain != iommu_domain) - return; + if (owner->domain == identity_domain) + return 0; + + domain = to_exynos_domain(owner->domain); + pagetable = virt_to_phys(domain->pgtable); mutex_lock(>rpm_lock); @@ -1009,15 +1013,25 @@ static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain, list_del_init(>domain_node); spin_unlock(>lock); } - owner->domain = NULL; + owner->domain = identity_domain; spin_unlock_irqrestore(>lock, flags); mutex_unlock(>rpm_lock); - dev_dbg(dev, "%s: Detached IOMMU with pgtable %pa\n", __func__, - ); + dev_dbg(dev, "%s: Restored IOMMU to IDENTITY from pgtable %pa\n", + __func__, ); + return 0; } +static struct iommu_domain_ops exynos_identity_ops = { + .attach_dev = exynos_iommu_identity_attach, +}; + +static struct iommu_domain exynos_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _identity_ops, +}; + static int exynos_iommu_attach_device(struct iommu_domain *iommu_domain, struct device *dev) { @@ -1026,12 +1040,11 @@ static int exynos_iommu_attach_device(struct iommu_domain *iommu_domain, struct sysmmu_drvdata *data; phys_addr_t pagetable = virt_to_phys(domain->pgtable); unsigned long flags; + int err; - if (!has_sysmmu(dev)) - return -ENODEV; - - if (owner->domain) - exynos_iommu_detach_device(owner->domain, dev); + err = exynos_iommu_identity_attach(_identity_domain, dev); + if (err) + return err; mutex_lock(>rpm_lock); @@ -1407,26 +1420,12 @@ static struct iommu_device *exynos_iommu_probe_device(struct device *dev) return >iommu; } -static void exynos_iommu_set_platform_dma(struct device *dev) -{ - struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev); - - if (owner->domain) { - struct iommu_group *group = iommu_group_get(dev); - - if (group) { - exynos_iommu_detach_device(owner->domain, dev); - iommu_group_put(group); - } - } -} - static void exynos_iommu_release_device(struct device *dev) { struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev); struct sysmmu_drvdata *data; - exynos_iommu_set_platform_dma(dev); +
[PATCH v4 17/25] iommu/qcom_iommu: Add an IOMMU_IDENTITIY_DOMAIN
This brings back the ops->detach_dev() code that commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it into an IDENTITY domain. Signed-off-by: Jason Gunthorpe --- drivers/iommu/arm/arm-smmu/qcom_iommu.c | 39 + 1 file changed, 39 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c b/drivers/iommu/arm/arm-smmu/qcom_iommu.c index a503ed758ec302..9d7b9d8b4386d4 100644 --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c @@ -387,6 +387,44 @@ static int qcom_iommu_attach_dev(struct iommu_domain *domain, struct device *dev return 0; } +static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) +{ + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct qcom_iommu_domain *qcom_domain; + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + struct qcom_iommu_dev *qcom_iommu = to_iommu(dev); + unsigned int i; + + if (domain == identity_domain || !domain) + return 0; + + qcom_domain = to_qcom_iommu_domain(domain); + if (WARN_ON(!qcom_domain->iommu)) + return -EINVAL; + + pm_runtime_get_sync(qcom_iommu->dev); + for (i = 0; i < fwspec->num_ids; i++) { + struct qcom_iommu_ctx *ctx = to_ctx(qcom_domain, fwspec->ids[i]); + + /* Disable the context bank: */ + iommu_writel(ctx, ARM_SMMU_CB_SCTLR, 0); + + ctx->domain = NULL; + } + pm_runtime_put_sync(qcom_iommu->dev); + return 0; +} + +static struct iommu_domain_ops qcom_iommu_identity_ops = { + .attach_dev = qcom_iommu_identity_attach, +}; + +static struct iommu_domain qcom_iommu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_identity_ops, +}; + static int qcom_iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t pgsize, size_t pgcount, int prot, gfp_t gfp, size_t *mapped) @@ -553,6 +591,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct of_phandle_args *args) } static const struct iommu_ops qcom_iommu_ops = { + .identity_domain = _iommu_identity_domain, .capable= qcom_iommu_capable, .domain_alloc = qcom_iommu_domain_alloc, .probe_device = qcom_iommu_probe_device, -- 2.40.1
[PATCH v4 01/25] iommu: Add iommu_ops->identity_domain
This allows a driver to set a global static to an IDENTITY domain and the core code will automatically use it whenever an IDENTITY domain is requested. By making it always available it means the IDENTITY can be used in error handling paths to force the iommu driver into a known state. Devices implementing global static identity domains should avoid failing their attach_dev ops. Convert rockchip to use the new mechanism. Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 3 +++ drivers/iommu/rockchip-iommu.c | 9 + include/linux/iommu.h | 3 +++ 3 files changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 9e0228ef612b85..bb840a818525ad 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1917,6 +1917,9 @@ static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, if (bus == NULL || bus->iommu_ops == NULL) return NULL; + if (alloc_type == IOMMU_DOMAIN_IDENTITY && bus->iommu_ops->identity_domain) + return bus->iommu_ops->identity_domain; + domain = bus->iommu_ops->domain_alloc(alloc_type); if (!domain) return NULL; diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index 4054030c323795..4fbede269e6712 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -1017,13 +1017,8 @@ static int rk_iommu_identity_attach(struct iommu_domain *identity_domain, return 0; } -static void rk_iommu_identity_free(struct iommu_domain *domain) -{ -} - static struct iommu_domain_ops rk_identity_ops = { .attach_dev = rk_iommu_identity_attach, - .free = rk_iommu_identity_free, }; static struct iommu_domain rk_identity_domain = { @@ -1087,9 +1082,6 @@ static struct iommu_domain *rk_iommu_domain_alloc(unsigned type) { struct rk_iommu_domain *rk_domain; - if (type == IOMMU_DOMAIN_IDENTITY) - return _identity_domain; - if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA) return NULL; @@ -1214,6 +1206,7 @@ static int rk_iommu_of_xlate(struct device *dev, } static const struct iommu_ops rk_iommu_ops = { + .identity_domain = _identity_domain, .domain_alloc = rk_iommu_domain_alloc, .probe_device = rk_iommu_probe_device, .release_device = rk_iommu_release_device, diff --git a/include/linux/iommu.h b/include/linux/iommu.h index d3164259667599..c3004eac2f88e8 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -254,6 +254,8 @@ struct iommu_iotlb_gather { *will be blocked by the hardware. * @pgsize_bitmap: bitmap of all possible supported page sizes * @owner: Driver module providing these ops + * @identity_domain: An always available, always attachable identity + * translation. */ struct iommu_ops { bool (*capable)(struct device *dev, enum iommu_cap); @@ -287,6 +289,7 @@ struct iommu_ops { const struct iommu_domain_ops *default_domain_ops; unsigned long pgsize_bitmap; struct module *owner; + struct iommu_domain *identity_domain; }; /** -- 2.40.1
[PATCH v4 00/25] iommu: Make default_domain's mandatory
[ It would be good to get this in linux-next, we have some good test coverage on the ARM side already, thanks! ] It has been a long time coming, this series completes the default_domain transition and makes it so that the core IOMMU code will always have a non-NULL default_domain for every driver on every platform. set_platform_dma_ops() turned out to be a bad idea, and so completely remove it. This is achieved by changing each driver to either: 1 - Convert the existing (or deleted) ops->detach_dev() into an op->attach_dev() of an IDENTITY domain. This is based on the theory that the ARM32 HW is able to function when the iommu is turned off as so the turned off state is an IDENTITY translation. 2 - Use a new PLATFORM domain type. This is a hack to accommodate drivers that we don't really know WTF they do. S390 is legitimately using this to switch to it's platform dma_ops implementation, which is where the name comes from. 3 - Do #1 and force the default domain to be IDENTITY, this corrects the tegra-smmu case where even an ARM64 system would have a NULL default_domain. Using this we can apply the rules: a) ARM_DMA_USE_IOMMU mode always uses either the driver's ops->default_domain, ops->def_domain_type(), or an IDENTITY domain. All ARM32 drivers provide one of these three options. b) dma-iommu.c mode uses either the driver's ops->default_domain, ops->def_domain_type or the usual DMA API policy logic based on the command line/etc to pick IDENTITY/DMA domain types c) All other arch's (PPC/S390) use ops->default_domain always. See the patch "Require a default_domain for all iommu drivers" for a per-driver breakdown. The conversion broadly teaches a bunch of ARM32 drivers that they can do IDENTITY domains. There is some educated guessing involved that these are actual IDENTITY domains. If this turns out to be wrong the driver can be trivially changed to use a BLOCKING domain type instead. Further, the domain type only matters for drivers using ARM64's dma-iommu.c mode as it will select IDENTITY based on the command line and expect IDENTITY to work. For ARM32 and other arch cases it is purely documentation. Finally, based on all the analysis in this series, we can purge IOMMU_DOMAIN_UNMANAGED/DMA constants from most of the drivers. This greatly simplifies understanding the driver contract to the core code. IOMMU drivers should not be involved in policy for how the DMA API works, that should be a core core decision. The main gain from this work is to remove alot of ARM_DMA_USE_IOMMU specific code and behaviors from drivers. All that remains in iommu drivers after this series is the calls to arm_iommu_create_mapping(). This is a step toward removing ARM_DMA_USE_IOMMU. The IDENTITY domains added to the ARM64 supporting drivers can be tested by booting in ARM64 mode and enabling CONFIG_IOMMU_DEFAULT_PASSTHROUGH. If the system still boots then most likely the implementation is an IDENTITY domain. If not we can trivially change it to BLOCKING or at worst PLATFORM if there is no detail what is going on in the HW. I think this is pretty safe for the ARM32 drivers as they don't really change, the code that was in detach_dev continues to be called in the same places it was called before. This is on github: https://github.com/jgunthorpe/linux/commits/iommu_all_defdom v4: - Fix rebasing typo missing ops->alloc_domain_paging check - Rebase on latest Joerg tree v3: https://lore.kernel.org/r/0-v3-89830a6c7841+43d-iommu_all_defdom_...@nvidia.com - FSL is back to a PLATFORM domain, with some fixing so it attach only does something when leaving an UNMANAGED domain like it always was - Rebase on Joerg's tree, adjust for "alloc_type" change - Change the ARM32 untrusted check to a WARN_ON since no ARM32 system can currently set trusted v2: https://lore.kernel.org/r/0-v2-8d1dc464eac9+10f-iommu_all_defdom_...@nvidia.com - FSL is an IDENTITY domain - Delete terga-gart instead of trying to carry it - Use the policy determination from iommu_get_default_domain_type() to drive the arm_iommu mode - Reorganize and introduce new patches to do the above: * Split the ops->identity_domain to an independent earlier patch * Remove the UNMANAGED return from def_domain_type in mtk_v1 earlier so the new iommu_get_default_domain_type() can work * Make the driver's def_domain_type have higher policy priority than untrusted * Merge the set_platfom_dma_ops hunk from mtk_v1 along with rockchip into the patch that forced IDENTITY on ARM32 - Revise sun50i to be cleaner and have a non-NULL internal domain - Reword logging in exynos - Remove the gdev from the group alloc path, instead add a new function __iommu_group_domain_alloc() that takes in the group and uses the first device. Split this to its own patch - New patch to make iommufd's mock selftest into a real driver - New patch to fix power's partial iommu driver v1:
[PATCH v4 16/25] iommu: Remove ops->set_platform_dma_ops()
All drivers are now using IDENTITY or PLATFORM domains for what this did, we can remove it now. It is no longer possible to attach to a NULL domain. Tested-by: Heiko Stuebner Tested-by: Niklas Schnelle Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 30 +- include/linux/iommu.h | 4 2 files changed, 5 insertions(+), 29 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index a3a4d004767b4d..e60640f6ccb625 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2250,21 +2250,8 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group, if (group->domain == new_domain) return 0; - /* -* New drivers should support default domains, so set_platform_dma() -* op will never be called. Otherwise the NULL domain represents some -* platform specific behavior. -*/ - if (!new_domain) { - for_each_group_device(group, gdev) { - const struct iommu_ops *ops = dev_iommu_ops(gdev->dev); - - if (!WARN_ON(!ops->set_platform_dma_ops)) - ops->set_platform_dma_ops(gdev->dev); - } - group->domain = NULL; - return 0; - } + if (WARN_ON(!new_domain)) + return -EINVAL; /* * Changing the domain is done by calling attach_dev() on the new @@ -2300,19 +2287,15 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group, */ last_gdev = gdev; for_each_group_device(group, gdev) { - const struct iommu_ops *ops = dev_iommu_ops(gdev->dev); - /* -* If set_platform_dma_ops is not present a NULL domain can -* happen only for first probe, in which case we leave -* group->domain as NULL and let release clean everything up. +* A NULL domain can happen only for first probe, in which case +* we leave group->domain as NULL and let release clean +* everything up. */ if (group->domain) WARN_ON(__iommu_device_set_domain( group, gdev->dev, group->domain, IOMMU_SET_DOMAIN_MUST_SUCCEED)); - else if (ops->set_platform_dma_ops) - ops->set_platform_dma_ops(gdev->dev); if (gdev == last_gdev) break; } @@ -2926,9 +2909,6 @@ static int iommu_setup_default_domain(struct iommu_group *group, /* * There are still some drivers which don't support default domains, so * we ignore the failure and leave group->default_domain NULL. -* -* We assume that the iommu driver starts up the device in -* 'set_platform_dma_ops' mode if it does not support default domains. */ dom = iommu_group_alloc_default_domain(group, req_type); if (!dom) { diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ef0af09326..49331573f1d1f5 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -237,9 +237,6 @@ struct iommu_iotlb_gather { * @release_device: Remove device from iommu driver handling * @probe_finalize: Do final setup work after the device is added to an IOMMU * group and attached to the groups domain - * @set_platform_dma_ops: Returning control back to the platform DMA ops. This op - *is to support old IOMMU drivers, new drivers should use - *default domains, and the common IOMMU DMA ops. * @device_group: find iommu group for a particular device * @get_resv_regions: Request list of reserved regions for a device * @of_xlate: add OF master IDs to iommu grouping @@ -271,7 +268,6 @@ struct iommu_ops { struct iommu_device *(*probe_device)(struct device *dev); void (*release_device)(struct device *dev); void (*probe_finalize)(struct device *dev); - void (*set_platform_dma_ops)(struct device *dev); struct iommu_group *(*device_group)(struct device *dev); /* Request/Free a list of reserved regions for a device */ -- 2.40.1
[PATCH v4 09/25] iommu: Allow an IDENTITY domain as the default_domain in ARM32
Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the same stuff, the way they relate to the IOMMU core is quiet different. dma-iommu.c expects the core code to setup an UNMANAGED domain (of type IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This becomes the default_domain for the group. ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly allocates an UNMANAGED domain and operates it just like an external driver. In this case group->default_domain is NULL. If the driver provides a global static identity_domain then automatically use it as the default_domain when in ARM_DMA_USE_IOMMU mode. This allows drivers that implemented default_domain == NULL as an IDENTITY translation to trivially get a properly labeled non-NULL default_domain on ARM32 configs. With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the device the normal detach_domain flow will restore the IDENTITY domain as the default domain. Overall this makes attach_dev() of the IDENTITY domain called in the same places as detach_dev(). This effectively migrates these drivers to default_domain mode. For drivers that support ARM64 they will gain support for the IDENTITY translation mode for the dma_api and behave in a uniform way. Drivers use this by setting ops->identity_domain to a static singleton iommu_domain that implements the identity attach. If the core detects ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain during probe. Drivers can continue to prevent the use of DMA translation by returning IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU. This allows removing the set_platform_dma_ops() from every remaining driver. Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does is set an existing global static identity domain. mkt_v1 does not support IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation is safe. Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 26 +++--- drivers/iommu/mtk_iommu_v1.c | 12 drivers/iommu/rockchip-iommu.c | 10 -- 3 files changed, 23 insertions(+), 25 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 0c4fc46c210366..7ca70e2a3f51e9 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1757,15 +1757,35 @@ static int iommu_get_default_domain_type(struct iommu_group *group, int type; lockdep_assert_held(>mutex); + + /* +* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an +* identity_domain and it will automatically become their default +* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain. +* Override the selection to IDENTITY if we are sure the driver supports +* it. +*/ + if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) { + type = IOMMU_DOMAIN_IDENTITY; + if (best_type && type && best_type != type) + goto err; + best_type = target_type = IOMMU_DOMAIN_IDENTITY; + } + for_each_group_device(group, gdev) { type = best_type; if (ops->def_domain_type) { type = ops->def_domain_type(gdev->dev); - if (best_type && type && best_type != type) + if (best_type && type && best_type != type) { + /* Stick with the last driver override we saw */ + best_type = type; goto err; + } } - if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) { + /* No ARM32 using systems will set untrusted, it cannot work. */ + if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted && + !WARN_ON(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU))) { type = IOMMU_DOMAIN_DMA; if (best_type && type && best_type != type) goto err; @@ -1790,7 +1810,7 @@ static int iommu_get_default_domain_type(struct iommu_group *group, "Device needs domain type %s, but device %s in the same iommu group requires type %s - using default\n", iommu_domain_type_str(type), dev_name(last_dev), iommu_domain_type_str(best_type)); - return 0; + return best_type; } static void iommu_group_do_probe_finalize(struct device *dev) diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index cc3e7d53d33ad9..7c0c1d50df5f75 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -337,11 +337,6 @@ static struct
[PATCH v4 14/25] iommu/msm: Implement an IDENTITY domain
What msm does during omap_iommu_set_platform_dma() is actually putting the iommu into identity mode. Move to the new core support for ARM_DMA_USE_IOMMU by defining ops->identity_domain. This driver does not support IOMMU_DOMAIN_DMA, however it cannot be compiled on ARM64 either. Most likely it is fine to support dma-iommu.c Signed-off-by: Jason Gunthorpe --- drivers/iommu/msm_iommu.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c index 79d89bad5132b7..26ed81cfeee897 100644 --- a/drivers/iommu/msm_iommu.c +++ b/drivers/iommu/msm_iommu.c @@ -443,15 +443,20 @@ static int msm_iommu_attach_dev(struct iommu_domain *domain, struct device *dev) return ret; } -static void msm_iommu_set_platform_dma(struct device *dev) +static int msm_iommu_identity_attach(struct iommu_domain *identity_domain, +struct device *dev) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct msm_priv *priv = to_msm_priv(domain); + struct msm_priv *priv; unsigned long flags; struct msm_iommu_dev *iommu; struct msm_iommu_ctx_dev *master; - int ret; + int ret = 0; + if (domain == identity_domain || !domain) + return 0; + + priv = to_msm_priv(domain); free_io_pgtable_ops(priv->iop); spin_lock_irqsave(_iommu_lock, flags); @@ -468,8 +473,18 @@ static void msm_iommu_set_platform_dma(struct device *dev) } fail: spin_unlock_irqrestore(_iommu_lock, flags); + return ret; } +static struct iommu_domain_ops msm_iommu_identity_ops = { + .attach_dev = msm_iommu_identity_attach, +}; + +static struct iommu_domain msm_iommu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_identity_ops, +}; + static int msm_iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t pa, size_t pgsize, size_t pgcount, int prot, gfp_t gfp, size_t *mapped) @@ -675,10 +690,10 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id) } static struct iommu_ops msm_iommu_ops = { + .identity_domain = _iommu_identity_domain, .domain_alloc = msm_iommu_domain_alloc, .probe_device = msm_iommu_probe_device, .device_group = generic_device_group, - .set_platform_dma_ops = msm_iommu_set_platform_dma, .pgsize_bitmap = MSM_IOMMU_PGSIZES, .of_xlate = qcom_iommu_of_xlate, .default_domain_ops = &(const struct iommu_domain_ops) { -- 2.40.1
[PATCH v4 06/25] iommu/tegra-gart: Remove tegra-gart
Thierry says this is not used anymore, and doesn't think it makes sense as an iommu driver. The HW it supports is about 10 years old now and newer HW uses different IOMMU drivers. As this is the only driver with a GART approach, and it doesn't really meet the driver expectations from the IOMMU core, let's just remove it so we don't have to think about how to make it fit in. It has a number of identified problems: - The assignment of iommu_groups doesn't match the HW behavior - It claims to have an UNMANAGED domain but it is really an IDENTITY domain with a translation aperture. This is inconsistent with the core expectation for security sensitive operations - It doesn't implement a SW page table under struct iommu_domain so * It can't accept a map until the domain is attached * It forgets about all maps after the domain is detached * It doesn't clear the HW of maps once the domain is detached (made worse by having the wrong groups) Cc: Thierry Reding Cc: Dmitry Osipenko Acked-by: Thierry Reding Signed-off-by: Jason Gunthorpe --- arch/arm/configs/multi_v7_defconfig | 1 - arch/arm/configs/tegra_defconfig| 1 - drivers/iommu/Kconfig | 11 - drivers/iommu/Makefile | 1 - drivers/iommu/tegra-gart.c | 371 drivers/memory/tegra/mc.c | 34 --- drivers/memory/tegra/tegra20.c | 28 --- include/soc/tegra/mc.h | 26 -- 8 files changed, 473 deletions(-) delete mode 100644 drivers/iommu/tegra-gart.c diff --git a/arch/arm/configs/multi_v7_defconfig b/arch/arm/configs/multi_v7_defconfig index 871fffe92187bf..daba1afdbd1100 100644 --- a/arch/arm/configs/multi_v7_defconfig +++ b/arch/arm/configs/multi_v7_defconfig @@ -1063,7 +1063,6 @@ CONFIG_BCM2835_MBOX=y CONFIG_QCOM_APCS_IPC=y CONFIG_QCOM_IPCC=y CONFIG_ROCKCHIP_IOMMU=y -CONFIG_TEGRA_IOMMU_GART=y CONFIG_TEGRA_IOMMU_SMMU=y CONFIG_EXYNOS_IOMMU=y CONFIG_QCOM_IOMMU=y diff --git a/arch/arm/configs/tegra_defconfig b/arch/arm/configs/tegra_defconfig index f32047e24b633e..ad31b9322911ce 100644 --- a/arch/arm/configs/tegra_defconfig +++ b/arch/arm/configs/tegra_defconfig @@ -293,7 +293,6 @@ CONFIG_CHROME_PLATFORMS=y CONFIG_CROS_EC=y CONFIG_CROS_EC_I2C=m CONFIG_CROS_EC_SPI=m -CONFIG_TEGRA_IOMMU_GART=y CONFIG_TEGRA_IOMMU_SMMU=y CONFIG_ARCH_TEGRA_2x_SOC=y CONFIG_ARCH_TEGRA_3x_SOC=y diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 4d800601e8ecd6..3309f297bbd822 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -235,17 +235,6 @@ config SUN50I_IOMMU help Support for the IOMMU introduced in the Allwinner H6 SoCs. -config TEGRA_IOMMU_GART - bool "Tegra GART IOMMU Support" - depends on ARCH_TEGRA_2x_SOC - depends on TEGRA_MC - select IOMMU_API - help - Enables support for remapping discontiguous physical memory - shared with the operating system into contiguous I/O virtual - space through the GART (Graphics Address Relocation Table) - hardware included on Tegra SoCs. - config TEGRA_IOMMU_SMMU bool "NVIDIA Tegra SMMU Support" depends on ARCH_TEGRA diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 769e43d780ce89..95ad9dbfbda022 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -20,7 +20,6 @@ obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o obj-$(CONFIG_OMAP_IOMMU_DEBUG) += omap-iommu-debug.o obj-$(CONFIG_ROCKCHIP_IOMMU) += rockchip-iommu.o obj-$(CONFIG_SUN50I_IOMMU) += sun50i-iommu.o -obj-$(CONFIG_TEGRA_IOMMU_GART) += tegra-gart.o obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c deleted file mode 100644 index a482ff838b5331..00 --- a/drivers/iommu/tegra-gart.c +++ /dev/null @@ -1,371 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * IOMMU API for Graphics Address Relocation Table on Tegra20 - * - * Copyright (c) 2010-2012, NVIDIA CORPORATION. All rights reserved. - * - * Author: Hiroshi DOYU - */ - -#define dev_fmt(fmt) "gart: " fmt - -#include -#include -#include -#include -#include -#include -#include - -#include - -#define GART_REG_BASE 0x24 -#define GART_CONFIG(0x24 - GART_REG_BASE) -#define GART_ENTRY_ADDR(0x28 - GART_REG_BASE) -#define GART_ENTRY_DATA(0x2c - GART_REG_BASE) - -#define GART_ENTRY_PHYS_ADDR_VALID BIT(31) - -#define GART_PAGE_SHIFT12 -#define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT) -#define GART_PAGE_MASK GENMASK(30, GART_PAGE_SHIFT) - -/* bitmap of the page sizes currently supported */ -#define GART_IOMMU_PGSIZES (GART_PAGE_SIZE) - -struct gart_device { - void __iomem*regs; - u32 *savedata; - unsigned long
[PATCH v4 19/25] iommu/mtk_iommu: Add an IOMMU_IDENTITIY_DOMAIN
This brings back the ops->detach_dev() code that commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it into an IDENTITY domain. Signed-off-by: Jason Gunthorpe --- drivers/iommu/mtk_iommu.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index e93906d6e112e8..fdb7f5162b1d64 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -753,6 +753,28 @@ static int mtk_iommu_attach_device(struct iommu_domain *domain, return ret; } +static int mtk_iommu_identity_attach(struct iommu_domain *identity_domain, +struct device *dev) +{ + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct mtk_iommu_data *data = dev_iommu_priv_get(dev); + + if (domain == identity_domain || !domain) + return 0; + + mtk_iommu_config(data, dev, false, 0); + return 0; +} + +static struct iommu_domain_ops mtk_iommu_identity_ops = { + .attach_dev = mtk_iommu_identity_attach, +}; + +static struct iommu_domain mtk_iommu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_identity_ops, +}; + static int mtk_iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t pgsize, size_t pgcount, int prot, gfp_t gfp, size_t *mapped) @@ -972,6 +994,7 @@ static void mtk_iommu_get_resv_regions(struct device *dev, } static const struct iommu_ops mtk_iommu_ops = { + .identity_domain = _iommu_identity_domain, .domain_alloc = mtk_iommu_domain_alloc, .probe_device = mtk_iommu_probe_device, .release_device = mtk_iommu_release_device, -- 2.40.1
[PATCH v4 18/25] iommu/ipmmu: Add an IOMMU_IDENTITIY_DOMAIN
This brings back the ops->detach_dev() code that commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it into an IDENTITY domain. Also reverts commit 584d334b1393 ("iommu/ipmmu-vmsa: Remove ipmmu_utlb_disable()") Signed-off-by: Jason Gunthorpe --- drivers/iommu/ipmmu-vmsa.c | 43 ++ 1 file changed, 43 insertions(+) diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index 9f64c5c9f5b90a..de958e411a92e0 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -298,6 +298,18 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain *domain, mmu->utlb_ctx[utlb] = domain->context_id; } +/* + * Disable MMU translation for the microTLB. + */ +static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain, + unsigned int utlb) +{ + struct ipmmu_vmsa_device *mmu = domain->mmu; + + ipmmu_imuctr_write(mmu, utlb, 0); + mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID; +} + static void ipmmu_tlb_flush_all(void *cookie) { struct ipmmu_vmsa_domain *domain = cookie; @@ -630,6 +642,36 @@ static int ipmmu_attach_device(struct iommu_domain *io_domain, return 0; } +static int ipmmu_iommu_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) +{ + struct iommu_domain *io_domain = iommu_get_domain_for_dev(dev); + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + struct ipmmu_vmsa_domain *domain; + unsigned int i; + + if (io_domain == identity_domain || !io_domain) + return 0; + + domain = to_vmsa_domain(io_domain); + for (i = 0; i < fwspec->num_ids; ++i) + ipmmu_utlb_disable(domain, fwspec->ids[i]); + + /* +* TODO: Optimize by disabling the context when no device is attached. +*/ + return 0; +} + +static struct iommu_domain_ops ipmmu_iommu_identity_ops = { + .attach_dev = ipmmu_iommu_identity_attach, +}; + +static struct iommu_domain ipmmu_iommu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_identity_ops, +}; + static int ipmmu_map(struct iommu_domain *io_domain, unsigned long iova, phys_addr_t paddr, size_t pgsize, size_t pgcount, int prot, gfp_t gfp, size_t *mapped) @@ -848,6 +890,7 @@ static struct iommu_group *ipmmu_find_group(struct device *dev) } static const struct iommu_ops ipmmu_ops = { + .identity_domain = _iommu_identity_domain, .domain_alloc = ipmmu_domain_alloc, .probe_device = ipmmu_probe_device, .release_device = ipmmu_release_device, -- 2.40.1
[PATCH v4 12/25] iommu/tegra-smmu: Support DMA domains in tegra
All ARM64 iommu drivers should support IOMMU_DOMAIN_DMA to enable dma-iommu.c. tegra is blocking dma-iommu usage, and also default_domain's, because it wants an identity translation. This is needed for some device quirk. The correct way to do this is to support IDENTITY domains and use ops->def_domain_type() to return IOMMU_DOMAIN_IDENTITY for only the quirky devices. Add support for IOMMU_DOMAIN_DMA and force IOMMU_DOMAIN_IDENTITY mode for everything so no behavior changes. Signed-off-by: Jason Gunthorpe --- drivers/iommu/tegra-smmu.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c index f63f1d4f0bd10f..6cba034905edbf 100644 --- a/drivers/iommu/tegra-smmu.c +++ b/drivers/iommu/tegra-smmu.c @@ -276,7 +276,7 @@ static struct iommu_domain *tegra_smmu_domain_alloc(unsigned type) { struct tegra_smmu_as *as; - if (type != IOMMU_DOMAIN_UNMANAGED) + if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA) return NULL; as = kzalloc(sizeof(*as), GFP_KERNEL); @@ -989,6 +989,12 @@ static int tegra_smmu_def_domain_type(struct device *dev) } static const struct iommu_ops tegra_smmu_ops = { + /* +* FIXME: For now we want to run all translation in IDENTITY mode, +* better would be to have a def_domain_type op do this for just the +* quirky device. +*/ + .default_domain = _smmu_identity_domain, .identity_domain = _smmu_identity_domain, .def_domain_type = _smmu_def_domain_type, .domain_alloc = tegra_smmu_domain_alloc, -- 2.40.1
[PATCH v4 08/25] iommu: Reorganize iommu_get_default_domain_type() to respect def_domain_type()
Except for dart every driver returns 0 or IDENTITY from def_domain_type(). The drivers that return IDENTITY have some kind of good reason, typically that quirky hardware really can't support anything other than IDENTITY. Arrange things so that if the driver says it needs IDENTITY then iommu_get_default_domain_type() either fails or returns IDENTITY. It will never reject the driver's override to IDENTITY. The only real functional difference is that the PCI untrusted flag is now ignored for quirky HW instead of overriding the IOMMU driver. This makes the next patch cleaner that wants to force IDENTITY always for ARM_IOMMU because there is no support for DMA. Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 66 +-- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index c8f6664767152d..0c4fc46c210366 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1608,19 +1608,6 @@ struct iommu_group *fsl_mc_device_group(struct device *dev) } EXPORT_SYMBOL_GPL(fsl_mc_device_group); -static int iommu_get_def_domain_type(struct device *dev) -{ - const struct iommu_ops *ops = dev_iommu_ops(dev); - - if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted) - return IOMMU_DOMAIN_DMA; - - if (ops->def_domain_type) - return ops->def_domain_type(dev); - - return 0; -} - static struct iommu_domain * __iommu_group_alloc_default_domain(const struct bus_type *bus, struct iommu_group *group, int req_type) @@ -1761,36 +1748,49 @@ static int iommu_bus_notifier(struct notifier_block *nb, static int iommu_get_default_domain_type(struct iommu_group *group, int target_type) { + const struct iommu_ops *ops = dev_iommu_ops( + list_first_entry(>devices, struct group_device, list) + ->dev); int best_type = target_type; struct group_device *gdev; struct device *last_dev; + int type; lockdep_assert_held(>mutex); - for_each_group_device(group, gdev) { - unsigned int type = iommu_get_def_domain_type(gdev->dev); - - if (best_type && type && best_type != type) { - if (target_type) { - dev_err_ratelimited( - gdev->dev, - "Device cannot be in %s domain\n", - iommu_domain_type_str(target_type)); - return -1; - } - - dev_warn( - gdev->dev, - "Device needs domain type %s, but device %s in the same iommu group requires type %s - using default\n", - iommu_domain_type_str(type), dev_name(last_dev), - iommu_domain_type_str(best_type)); - return 0; + type = best_type; + if (ops->def_domain_type) { + type = ops->def_domain_type(gdev->dev); + if (best_type && type && best_type != type) + goto err; } - if (!best_type) - best_type = type; + + if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) { + type = IOMMU_DOMAIN_DMA; + if (best_type && type && best_type != type) + goto err; + } + best_type = type; last_dev = gdev->dev; } return best_type; + +err: + if (target_type) { + dev_err_ratelimited( + gdev->dev, + "Device cannot be in %s domain - it is forcing %s\n", + iommu_domain_type_str(target_type), + iommu_domain_type_str(type)); + return -1; + } + + dev_warn( + gdev->dev, + "Device needs domain type %s, but device %s in the same iommu group requires type %s - using default\n", + iommu_domain_type_str(type), dev_name(last_dev), + iommu_domain_type_str(best_type)); + return 0; } static void iommu_group_do_probe_finalize(struct device *dev) -- 2.40.1
[PATCH v4 11/25] iommu/tegra-smmu: Implement an IDENTITY domain
What tegra-smmu does during tegra_smmu_set_platform_dma() is actually putting the iommu into identity mode. Move to the new core support for ARM_DMA_USE_IOMMU by defining ops->identity_domain. Signed-off-by: Jason Gunthorpe --- drivers/iommu/tegra-smmu.c | 37 - 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c index 1cbf063ccf147a..f63f1d4f0bd10f 100644 --- a/drivers/iommu/tegra-smmu.c +++ b/drivers/iommu/tegra-smmu.c @@ -511,23 +511,39 @@ static int tegra_smmu_attach_dev(struct iommu_domain *domain, return err; } -static void tegra_smmu_set_platform_dma(struct device *dev) +static int tegra_smmu_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); - struct tegra_smmu_as *as = to_smmu_as(domain); - struct tegra_smmu *smmu = as->smmu; + struct tegra_smmu_as *as; + struct tegra_smmu *smmu; unsigned int index; if (!fwspec) - return; + return -ENODEV; + if (domain == identity_domain || !domain) + return 0; + + as = to_smmu_as(domain); + smmu = as->smmu; for (index = 0; index < fwspec->num_ids; index++) { tegra_smmu_disable(smmu, fwspec->ids[index], as->id); tegra_smmu_as_unprepare(smmu, as); } + return 0; } +static struct iommu_domain_ops tegra_smmu_identity_ops = { + .attach_dev = tegra_smmu_identity_attach, +}; + +static struct iommu_domain tegra_smmu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _smmu_identity_ops, +}; + static void tegra_smmu_set_pde(struct tegra_smmu_as *as, unsigned long iova, u32 value) { @@ -962,11 +978,22 @@ static int tegra_smmu_of_xlate(struct device *dev, return iommu_fwspec_add_ids(dev, , 1); } +static int tegra_smmu_def_domain_type(struct device *dev) +{ + /* +* FIXME: For now we want to run all translation in IDENTITY mode, due +* to some device quirks. Better would be to just quirk the troubled +* devices. +*/ + return IOMMU_DOMAIN_IDENTITY; +} + static const struct iommu_ops tegra_smmu_ops = { + .identity_domain = _smmu_identity_domain, + .def_domain_type = _smmu_def_domain_type, .domain_alloc = tegra_smmu_domain_alloc, .probe_device = tegra_smmu_probe_device, .device_group = tegra_smmu_device_group, - .set_platform_dma_ops = tegra_smmu_set_platform_dma, .of_xlate = tegra_smmu_of_xlate, .pgsize_bitmap = SZ_4K, .default_domain_ops = &(const struct iommu_domain_ops) { -- 2.40.1
[PATCH v4 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()
These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively allows them to support that mode. The prior work to require default_domains makes this safe because every one of these drivers is either compilation incompatible with dma-iommu.c, or already establishing a default_domain. In both cases alloc_domain() will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe to drop the test. Removing these tests clarifies that the domain allocation path is only about the functionality of a paging domain and has nothing to do with policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ. Tested-by: Niklas Schnelle Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/msm_iommu.c| 7 ++- drivers/iommu/mtk_iommu_v1.c | 7 ++- drivers/iommu/omap-iommu.c | 7 ++- drivers/iommu/s390-iommu.c | 7 ++- 4 files changed, 8 insertions(+), 20 deletions(-) diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c index 26ed81cfeee897..a163cee0b7242d 100644 --- a/drivers/iommu/msm_iommu.c +++ b/drivers/iommu/msm_iommu.c @@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx, SET_M(base, ctx, 1); } -static struct iommu_domain *msm_iommu_domain_alloc(unsigned type) +static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev) { struct msm_priv *priv; - if (type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) goto fail_nomem; @@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id) static struct iommu_ops msm_iommu_ops = { .identity_domain = _iommu_identity_domain, - .domain_alloc = msm_iommu_domain_alloc, + .domain_alloc_paging = msm_iommu_domain_alloc_paging, .probe_device = msm_iommu_probe_device, .device_group = generic_device_group, .pgsize_bitmap = MSM_IOMMU_PGSIZES, diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index 7c0c1d50df5f75..67e044c1a7d93b 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct mtk_iommu_v1_data *data) return 0; } -static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type) +static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device *dev) { struct mtk_iommu_v1_domain *dom; - if (type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - dom = kzalloc(sizeof(*dom), GFP_KERNEL); if (!dom) return NULL; @@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct mtk_iommu_v1_data *data) static const struct iommu_ops mtk_iommu_v1_ops = { .identity_domain = _iommu_v1_identity_domain, - .domain_alloc = mtk_iommu_v1_domain_alloc, + .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging, .probe_device = mtk_iommu_v1_probe_device, .probe_finalize = mtk_iommu_v1_probe_finalize, .release_device = mtk_iommu_v1_release_device, diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c index 34340ef15241bc..fcf99bd195b32e 100644 --- a/drivers/iommu/omap-iommu.c +++ b/drivers/iommu/omap-iommu.c @@ -1580,13 +1580,10 @@ static struct iommu_domain omap_iommu_identity_domain = { .ops = _iommu_identity_ops, }; -static struct iommu_domain *omap_iommu_domain_alloc(unsigned type) +static struct iommu_domain *omap_iommu_domain_alloc_paging(struct device *dev) { struct omap_iommu_domain *omap_domain; - if (type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - omap_domain = kzalloc(sizeof(*omap_domain), GFP_KERNEL); if (!omap_domain) return NULL; @@ -1748,7 +1745,7 @@ static struct iommu_group *omap_iommu_device_group(struct device *dev) static const struct iommu_ops omap_iommu_ops = { .identity_domain = _iommu_identity_domain, - .domain_alloc = omap_iommu_domain_alloc, + .domain_alloc_paging = omap_iommu_domain_alloc_paging, .probe_device = omap_iommu_probe_device, .release_device = omap_iommu_release_device, .device_group = omap_iommu_device_group, diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c index f0c867c57a5b9b..5695ad71d60e24 100644 --- a/drivers/iommu/s390-iommu.c +++ b/drivers/iommu/s390-iommu.c @@ -39,13 +39,10 @@ static bool s390_iommu_capable(struct device *dev, enum iommu_cap cap) } } -static struct iommu_domain *s390_domain_alloc(unsigned domain_type) +static struct iommu_domain *s390_domain_alloc_paging(struct device *dev) { struct s390_domain *s390_domain; - if (domain_type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - s390_domain = kzalloc(sizeof(*s390_domain),
[PATCH v4 21/25] iommu: Require a default_domain for all iommu drivers
At this point every iommu driver will cause a default_domain to be selected, so we can finally remove this gap from the core code. The following table explains what each driver supports and what the resulting default_domain will be: ops->defaut_domain IDENTITY DMA PLATFORMv ARM32 dma-iommu ARCH amd/iommu.c Y Y N/A either apple-dart.cY Y N/A either arm-smmu.c Y Y IDENTITYeither qcom_iommu.cG Y IDENTITYeither arm-smmu-v3.c Y Y N/A either exynos-iommu.c G Y IDENTITYeither fsl_pamu_domain.c Y N/A N/A PLATFORM intel/iommu.c Y Y N/A either ipmmu-vmsa.cG Y IDENTITYeither msm_iommu.c G IDENTITYN/A mtk_iommu.c G Y IDENTITYeither mtk_iommu_v1.c G IDENTITYN/A omap-iommu.cG IDENTITYN/A rockchip-iommu.cG Y IDENTITYeither s390-iommu.cY Y N/A N/A PLATFORM sprd-iommu.cY N/A DMA sun50i-iommu.c G Y IDENTITYeither tegra-smmu.cG Y IDENTITYIDENTITY virtio-iommu.c Y Y N/A either spapr Y Y N/A N/A PLATFORM * G means ops->identity_domain is used * N/A means the driver will not compile in this configuration ARM32 drivers select an IDENTITY default domain through either the ops->identity_domain or directly requesting an IDENTIY domain through alloc_domain(). In ARM64 mode tegra-smmu will still block the use of dma-iommu.c and forces an IDENTITY domain. S390 uses a PLATFORM domain to represent when the dma_ops are set to the s390 iommu code. fsl_pamu uses an IDENTITY domain. POWER SPAPR uses PLATFORM and blocking to enable its weird VFIO mode. The x86 drivers continue unchanged. After this patch group->default_domain is only NULL for a short period during bus iommu probing while all the groups are constituted. Otherwise it is always !NULL. This completes changing the iommu subsystem driver contract to a system where the current iommu_domain always represents some form of translation and the driver is continuously asserting a definable translation mode. It resolves the confusion that the original ops->detach_dev() caused around what translation, exactly, is the IOMMU performing after detach. There were at least three different answers to that question in the tree, they are all now clearly named with domain types. Tested-by: Heiko Stuebner Tested-by: Niklas Schnelle Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 21 +++-- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index e60640f6ccb625..98b855487cf03c 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1805,10 +1805,12 @@ static int iommu_get_default_domain_type(struct iommu_group *group, * ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an * identity_domain and it will automatically become their default * domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain. -* Override the selection to IDENTITY if we are sure the driver supports -* it. +* Override the selection to IDENTITY. */ - if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) { + if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) { + static_assert(!(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && + IS_ENABLED(CONFIG_IOMMU_DMA))); + type = IOMMU_DOMAIN_IDENTITY; if (best_type && type && best_type != type) goto err; @@ -2906,18 +2908,9 @@ static int iommu_setup_default_domain(struct iommu_group *group, if (req_type < 0) return -EINVAL; - /* -* There are still some drivers which don't support default domains, so -* we ignore the failure and leave group->default_domain NULL. -*/ dom = iommu_group_alloc_default_domain(group, req_type); - if (!dom) { - /* Once in default_domain mode we
[PATCH v4 23/25] iommu: Add ops->domain_alloc_paging()
This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING domain, so it saves a few lines in a lot of drivers needlessly checking the type. More critically, this allows us to sweep out all the IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the drivers, simplifying what is going on in the code and ultimately removing the now-unused special cases in drivers where they did not support IOMMU_DOMAIN_DMA. domain_alloc_paging() should return a struct iommu_domain that is functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd. Be forwards looking and pass in a 'struct device *' argument. We can provide this when allocating the default_domain. No drivers will look at this. Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 13 ++--- include/linux/iommu.h | 3 +++ 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 0346c05e108438..8f3464ba204498 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1985,6 +1985,7 @@ void iommu_set_fault_handler(struct iommu_domain *domain, EXPORT_SYMBOL_GPL(iommu_set_fault_handler); static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops, +struct device *dev, unsigned int type) { struct iommu_domain *domain; @@ -1992,8 +1993,13 @@ static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops, if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain) return ops->identity_domain; + else if (type & __IOMMU_DOMAIN_PAGING && ops->domain_alloc_paging) { + domain = ops->domain_alloc_paging(dev); + } else if (ops->domain_alloc) + domain = ops->domain_alloc(alloc_type); + else + return NULL; - domain = ops->domain_alloc(alloc_type); if (!domain) return NULL; @@ -2024,14 +2030,15 @@ __iommu_group_domain_alloc(struct iommu_group *group, unsigned int type) lockdep_assert_held(>mutex); - return __iommu_domain_alloc(dev_iommu_ops(dev), type); + return __iommu_domain_alloc(dev_iommu_ops(dev), dev, type); } struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus) { if (bus == NULL || bus->iommu_ops == NULL) return NULL; - return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED); + return __iommu_domain_alloc(bus->iommu_ops, NULL, + IOMMU_DOMAIN_UNMANAGED); } EXPORT_SYMBOL_GPL(iommu_domain_alloc); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 49331573f1d1f5..8e4d178c49c417 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -233,6 +233,8 @@ struct iommu_iotlb_gather { * struct iommu_ops - iommu ops and capabilities * @capable: check capability * @domain_alloc: allocate iommu domain + * @domain_alloc_paging: Allocate an iommu_domain that can be used for + * UNMANAGED, DMA, and DMA_FQ domain types. * @probe_device: Add device to iommu driver handling * @release_device: Remove device from iommu driver handling * @probe_finalize: Do final setup work after the device is added to an IOMMU @@ -264,6 +266,7 @@ struct iommu_ops { /* Domain allocation and freeing by the iommu driver */ struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type); + struct iommu_domain *(*domain_alloc_paging)(struct device *dev); struct iommu_device *(*probe_device)(struct device *dev); void (*release_device)(struct device *dev); -- 2.40.1
[PATCH v4 02/25] iommu: Add IOMMU_DOMAIN_PLATFORM
This is used when the iommu driver is taking control of the dma_ops, currently only on S390 and power spapr. It is designed to preserve the original ops->detach_dev() semantic that these S390 was built around. Provide an opaque domain type and a 'default_domain' ops value that allows the driver to trivially force any single domain as the default domain. Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 14 +- include/linux/iommu.h | 6 ++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index bb840a818525ad..c8f6664767152d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1644,6 +1644,17 @@ iommu_group_alloc_default_domain(struct iommu_group *group, int req_type) lockdep_assert_held(>mutex); + /* +* Allow legacy drivers to specify the domain that will be the default +* domain. This should always be either an IDENTITY or PLATFORM domain. +* Do not use in new drivers. +*/ + if (bus->iommu_ops->default_domain) { + if (req_type) + return ERR_PTR(-EINVAL); + return bus->iommu_ops->default_domain; + } + if (req_type) return __iommu_group_alloc_default_domain(bus, group, req_type); @@ -1953,7 +1964,8 @@ void iommu_domain_free(struct iommu_domain *domain) if (domain->type == IOMMU_DOMAIN_SVA) mmdrop(domain->mm); iommu_put_dma_cookie(domain); - domain->ops->free(domain); + if (domain->ops->free) + domain->ops->free(domain); } EXPORT_SYMBOL_GPL(iommu_domain_free); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c3004eac2f88e8..ef0af09326 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -64,6 +64,7 @@ struct iommu_domain_geometry { #define __IOMMU_DOMAIN_DMA_FQ (1U << 3) /* DMA-API uses flush queue*/ #define __IOMMU_DOMAIN_SVA (1U << 4) /* Shared process address space */ +#define __IOMMU_DOMAIN_PLATFORM(1U << 5) #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ /* @@ -81,6 +82,8 @@ struct iommu_domain_geometry { * invalidation. * IOMMU_DOMAIN_SVA- DMA addresses are shared process addresses * represented by mm_struct's. + * IOMMU_DOMAIN_PLATFORM - Legacy domain for drivers that do their own + * dma_api stuff. Do not use in new drivers. */ #define IOMMU_DOMAIN_BLOCKED (0U) #define IOMMU_DOMAIN_IDENTITY (__IOMMU_DOMAIN_PT) @@ -91,6 +94,7 @@ struct iommu_domain_geometry { __IOMMU_DOMAIN_DMA_API | \ __IOMMU_DOMAIN_DMA_FQ) #define IOMMU_DOMAIN_SVA (__IOMMU_DOMAIN_SVA) +#define IOMMU_DOMAIN_PLATFORM (__IOMMU_DOMAIN_PLATFORM) struct iommu_domain { unsigned type; @@ -256,6 +260,7 @@ struct iommu_iotlb_gather { * @owner: Driver module providing these ops * @identity_domain: An always available, always attachable identity * translation. + * @default_domain: If not NULL this will always be set as the default domain. */ struct iommu_ops { bool (*capable)(struct device *dev, enum iommu_cap); @@ -290,6 +295,7 @@ struct iommu_ops { unsigned long pgsize_bitmap; struct module *owner; struct iommu_domain *identity_domain; + struct iommu_domain *default_domain; }; /** -- 2.40.1
[PATCH v4 13/25] iommu/omap: Implement an IDENTITY domain
What omap does during omap_iommu_set_platform_dma() is actually putting the iommu into identity mode. Move to the new core support for ARM_DMA_USE_IOMMU by defining ops->identity_domain. This driver does not support IOMMU_DOMAIN_DMA, however it cannot be compiled on ARM64 either. Most likely it is fine to support dma-iommu.c Signed-off-by: Jason Gunthorpe --- drivers/iommu/omap-iommu.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c index 537e402f9bba97..34340ef15241bc 100644 --- a/drivers/iommu/omap-iommu.c +++ b/drivers/iommu/omap-iommu.c @@ -1555,16 +1555,31 @@ static void _omap_iommu_detach_dev(struct omap_iommu_domain *omap_domain, omap_domain->dev = NULL; } -static void omap_iommu_set_platform_dma(struct device *dev) +static int omap_iommu_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct omap_iommu_domain *omap_domain = to_omap_domain(domain); + struct omap_iommu_domain *omap_domain; + if (domain == identity_domain || !domain) + return 0; + + omap_domain = to_omap_domain(domain); spin_lock(_domain->lock); _omap_iommu_detach_dev(omap_domain, dev); spin_unlock(_domain->lock); + return 0; } +static struct iommu_domain_ops omap_iommu_identity_ops = { + .attach_dev = omap_iommu_identity_attach, +}; + +static struct iommu_domain omap_iommu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_identity_ops, +}; + static struct iommu_domain *omap_iommu_domain_alloc(unsigned type) { struct omap_iommu_domain *omap_domain; @@ -1732,11 +1747,11 @@ static struct iommu_group *omap_iommu_device_group(struct device *dev) } static const struct iommu_ops omap_iommu_ops = { + .identity_domain = _iommu_identity_domain, .domain_alloc = omap_iommu_domain_alloc, .probe_device = omap_iommu_probe_device, .release_device = omap_iommu_release_device, .device_group = omap_iommu_device_group, - .set_platform_dma_ops = omap_iommu_set_platform_dma, .pgsize_bitmap = OMAP_IOMMU_PGSIZES, .default_domain_ops = &(const struct iommu_domain_ops) { .attach_dev = omap_iommu_attach_dev, -- 2.40.1
[PATCH v4 22/25] iommu: Add __iommu_group_domain_alloc()
Allocate a domain from a group. Automatically obtains the iommu_ops to use from the device list of the group. Convert the internal callers to use it. Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 66 --- 1 file changed, 37 insertions(+), 29 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 98b855487cf03c..0346c05e108438 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -94,8 +94,8 @@ static const char * const iommu_group_resv_type_string[] = { static int iommu_bus_notifier(struct notifier_block *nb, unsigned long action, void *data); static void iommu_release_device(struct device *dev); -static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, -unsigned type); +static struct iommu_domain * +__iommu_group_domain_alloc(struct iommu_group *group, unsigned int type); static int __iommu_attach_device(struct iommu_domain *domain, struct device *dev); static int __iommu_attach_group(struct iommu_domain *domain, @@ -1652,12 +1652,11 @@ struct iommu_group *fsl_mc_device_group(struct device *dev) EXPORT_SYMBOL_GPL(fsl_mc_device_group); static struct iommu_domain * -__iommu_group_alloc_default_domain(const struct bus_type *bus, - struct iommu_group *group, int req_type) +__iommu_group_alloc_default_domain(struct iommu_group *group, int req_type) { if (group->default_domain && group->default_domain->type == req_type) return group->default_domain; - return __iommu_domain_alloc(bus, req_type); + return __iommu_group_domain_alloc(group, req_type); } /* @@ -1667,9 +1666,10 @@ __iommu_group_alloc_default_domain(const struct bus_type *bus, static struct iommu_domain * iommu_group_alloc_default_domain(struct iommu_group *group, int req_type) { - const struct bus_type *bus = + struct device *dev = list_first_entry(>devices, struct group_device, list) - ->dev->bus; + ->dev; + const struct iommu_ops *ops = dev_iommu_ops(dev); struct iommu_domain *dom; lockdep_assert_held(>mutex); @@ -1679,24 +1679,24 @@ iommu_group_alloc_default_domain(struct iommu_group *group, int req_type) * domain. This should always be either an IDENTITY or PLATFORM domain. * Do not use in new drivers. */ - if (bus->iommu_ops->default_domain) { + if (ops->default_domain) { if (req_type) return ERR_PTR(-EINVAL); - return bus->iommu_ops->default_domain; + return ops->default_domain; } if (req_type) - return __iommu_group_alloc_default_domain(bus, group, req_type); + return __iommu_group_alloc_default_domain(group, req_type); /* The driver gave no guidance on what type to use, try the default */ - dom = __iommu_group_alloc_default_domain(bus, group, iommu_def_domain_type); + dom = __iommu_group_alloc_default_domain(group, iommu_def_domain_type); if (dom) return dom; /* Otherwise IDENTITY and DMA_FQ defaults will try DMA */ if (iommu_def_domain_type == IOMMU_DOMAIN_DMA) return NULL; - dom = __iommu_group_alloc_default_domain(bus, group, IOMMU_DOMAIN_DMA); + dom = __iommu_group_alloc_default_domain(group, IOMMU_DOMAIN_DMA); if (!dom) return NULL; @@ -1984,19 +1984,16 @@ void iommu_set_fault_handler(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_set_fault_handler); -static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, -unsigned type) +static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops, +unsigned int type) { struct iommu_domain *domain; unsigned int alloc_type = type & IOMMU_DOMAIN_ALLOC_FLAGS; - if (bus == NULL || bus->iommu_ops == NULL) - return NULL; + if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain) + return ops->identity_domain; - if (alloc_type == IOMMU_DOMAIN_IDENTITY && bus->iommu_ops->identity_domain) - return bus->iommu_ops->identity_domain; - - domain = bus->iommu_ops->domain_alloc(alloc_type); + domain = ops->domain_alloc(alloc_type); if (!domain) return NULL; @@ -2006,10 +2003,10 @@ static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, * may override this later */ if (!domain->pgsize_bitmap) - domain->pgsize_bitmap =
[PATCH v4 05/25] iommu/fsl_pamu: Implement a PLATFORM domain
This driver is nonsensical. To not block migrating the core API away from NULL default_domains give it a hacky of a PLATFORM domain that keeps it working exactly as it always did. Leave some comments around to warn away any future people looking at this. Signed-off-by: Jason Gunthorpe --- drivers/iommu/fsl_pamu_domain.c | 41 ++--- 1 file changed, 38 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c index 4ac0e247ec2b51..e9d2bff4659b7c 100644 --- a/drivers/iommu/fsl_pamu_domain.c +++ b/drivers/iommu/fsl_pamu_domain.c @@ -196,6 +196,13 @@ static struct iommu_domain *fsl_pamu_domain_alloc(unsigned type) { struct fsl_dma_domain *dma_domain; + /* +* FIXME: This isn't creating an unmanaged domain since the +* default_domain_ops do not have any map/unmap function it doesn't meet +* the requirements for __IOMMU_DOMAIN_PAGING. The only purpose seems to +* allow drivers/soc/fsl/qbman/qman_portal.c to do +* fsl_pamu_configure_l1_stash() +*/ if (type != IOMMU_DOMAIN_UNMANAGED) return NULL; @@ -283,15 +290,33 @@ static int fsl_pamu_attach_device(struct iommu_domain *domain, return ret; } -static void fsl_pamu_set_platform_dma(struct device *dev) +/* + * FIXME: fsl/pamu is completely broken in terms of how it works with the iommu + * API. Immediately after probe the HW is left in an IDENTITY translation and + * the driver provides a non-working UNMANAGED domain that it can switch over + * to. However it cannot switch back to an IDENTITY translation, instead it + * switches to what looks like BLOCKING. + */ +static int fsl_pamu_platform_attach(struct iommu_domain *platform_domain, + struct device *dev) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain); + struct fsl_dma_domain *dma_domain; const u32 *prop; int len; struct pci_dev *pdev = NULL; struct pci_controller *pci_ctl; + /* +* Hack to keep things working as they always have, only leaving an +* UNMANAGED domain makes it BLOCKING. +*/ + if (domain == platform_domain || !domain || + domain->type != IOMMU_DOMAIN_UNMANAGED) + return 0; + + dma_domain = to_fsl_dma_domain(domain); + /* * Use LIODN of the PCI controller while detaching a * PCI device. @@ -312,8 +337,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev) detach_device(dev, dma_domain); else pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node); + return 0; } +static struct iommu_domain_ops fsl_pamu_platform_ops = { + .attach_dev = fsl_pamu_platform_attach, +}; + +static struct iommu_domain fsl_pamu_platform_domain = { + .type = IOMMU_DOMAIN_PLATFORM, + .ops = _pamu_platform_ops, +}; + /* Set the domain stash attribute */ int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu) { @@ -395,11 +430,11 @@ static struct iommu_device *fsl_pamu_probe_device(struct device *dev) } static const struct iommu_ops fsl_pamu_ops = { + .default_domain = _pamu_platform_domain, .capable= fsl_pamu_capable, .domain_alloc = fsl_pamu_domain_alloc, .probe_device = fsl_pamu_probe_device, .device_group = fsl_pamu_device_group, - .set_platform_dma_ops = fsl_pamu_set_platform_dma, .default_domain_ops = &(const struct iommu_domain_ops) { .attach_dev = fsl_pamu_attach_device, .iova_to_phys = fsl_pamu_iova_to_phys, -- 2.40.1
[PATCH v4 24/25] iommu: Convert simple drivers with DOMAIN_DMA to domain_alloc_paging()
These drivers are all trivially converted since the function is only called if the domain type is going to be IOMMU_DOMAIN_UNMANAGED/DMA. Tested-by: Heiko Stuebner Tested-by: Steven Price Tested-by: Marek Szyprowski Tested-by: Nicolin Chen Signed-off-by: Jason Gunthorpe --- drivers/iommu/arm/arm-smmu/qcom_iommu.c | 6 ++ drivers/iommu/exynos-iommu.c| 7 ++- drivers/iommu/ipmmu-vmsa.c | 7 ++- drivers/iommu/mtk_iommu.c | 7 ++- drivers/iommu/rockchip-iommu.c | 7 ++- drivers/iommu/sprd-iommu.c | 7 ++- drivers/iommu/sun50i-iommu.c| 9 +++-- drivers/iommu/tegra-smmu.c | 7 ++- 8 files changed, 17 insertions(+), 40 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c b/drivers/iommu/arm/arm-smmu/qcom_iommu.c index 9d7b9d8b4386d4..a2140fdc65ed58 100644 --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c @@ -319,12 +319,10 @@ static int qcom_iommu_init_domain(struct iommu_domain *domain, return ret; } -static struct iommu_domain *qcom_iommu_domain_alloc(unsigned type) +static struct iommu_domain *qcom_iommu_domain_alloc_paging(struct device *dev) { struct qcom_iommu_domain *qcom_domain; - if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA) - return NULL; /* * Allocate the domain and initialise some of its data structures. * We can't really do anything meaningful until we've added a @@ -593,7 +591,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct of_phandle_args *args) static const struct iommu_ops qcom_iommu_ops = { .identity_domain = _iommu_identity_domain, .capable= qcom_iommu_capable, - .domain_alloc = qcom_iommu_domain_alloc, + .domain_alloc_paging = qcom_iommu_domain_alloc_paging, .probe_device = qcom_iommu_probe_device, .device_group = generic_device_group, .of_xlate = qcom_iommu_of_xlate, diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c index 5e12b85dfe8705..d6dead2ed10c11 100644 --- a/drivers/iommu/exynos-iommu.c +++ b/drivers/iommu/exynos-iommu.c @@ -887,7 +887,7 @@ static inline void exynos_iommu_set_pte(sysmmu_pte_t *ent, sysmmu_pte_t val) DMA_TO_DEVICE); } -static struct iommu_domain *exynos_iommu_domain_alloc(unsigned type) +static struct iommu_domain *exynos_iommu_domain_alloc_paging(struct device *dev) { struct exynos_iommu_domain *domain; dma_addr_t handle; @@ -896,9 +896,6 @@ static struct iommu_domain *exynos_iommu_domain_alloc(unsigned type) /* Check if correct PTE offsets are initialized */ BUG_ON(PG_ENT_SHIFT < 0 || !dma_dev); - if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - domain = kzalloc(sizeof(*domain), GFP_KERNEL); if (!domain) return NULL; @@ -1472,7 +1469,7 @@ static int exynos_iommu_of_xlate(struct device *dev, static const struct iommu_ops exynos_iommu_ops = { .identity_domain = _identity_domain, - .domain_alloc = exynos_iommu_domain_alloc, + .domain_alloc_paging = exynos_iommu_domain_alloc_paging, .device_group = generic_device_group, .probe_device = exynos_iommu_probe_device, .release_device = exynos_iommu_release_device, diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c index de958e411a92e0..27d36347e0fced 100644 --- a/drivers/iommu/ipmmu-vmsa.c +++ b/drivers/iommu/ipmmu-vmsa.c @@ -566,13 +566,10 @@ static irqreturn_t ipmmu_irq(int irq, void *dev) * IOMMU Operations */ -static struct iommu_domain *ipmmu_domain_alloc(unsigned type) +static struct iommu_domain *ipmmu_domain_alloc_paging(struct device *dev) { struct ipmmu_vmsa_domain *domain; - if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA) - return NULL; - domain = kzalloc(sizeof(*domain), GFP_KERNEL); if (!domain) return NULL; @@ -891,7 +888,7 @@ static struct iommu_group *ipmmu_find_group(struct device *dev) static const struct iommu_ops ipmmu_ops = { .identity_domain = _iommu_identity_domain, - .domain_alloc = ipmmu_domain_alloc, + .domain_alloc_paging = ipmmu_domain_alloc_paging, .probe_device = ipmmu_probe_device, .release_device = ipmmu_release_device, .probe_finalize = ipmmu_probe_finalize, diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index fdb7f5162b1d64..3590d3399add32 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -667,13 +667,10 @@ static int mtk_iommu_domain_finalise(struct mtk_iommu_domain *dom, return 0; } -static struct iommu_domain *mtk_iommu_domain_alloc(unsigned type) +static struct iommu_domain
[PATCH v4 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390
The PLATFORM domain will be set as the default domain and attached as normal during probe. The driver will ignore the initial attach from a NULL domain to the PLATFORM domain. After this, the PLATFORM domain's attach_dev will be called whenever we detach from an UNMANAGED domain (eg for VFIO). This is the same time the original design would have called op->detach_dev(). This is temporary until the S390 dma-iommu.c conversion is merged. Tested-by: Heiko Stuebner Tested-by: Niklas Schnelle Signed-off-by: Jason Gunthorpe --- drivers/iommu/s390-iommu.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c index fbf59a8db29b11..f0c867c57a5b9b 100644 --- a/drivers/iommu/s390-iommu.c +++ b/drivers/iommu/s390-iommu.c @@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain *domain, return 0; } -static void s390_iommu_set_platform_dma(struct device *dev) +/* + * Switch control over the IOMMU to S390's internal dma_api ops + */ +static int s390_iommu_platform_attach(struct iommu_domain *platform_domain, + struct device *dev) { struct zpci_dev *zdev = to_zpci_dev(dev); + if (!zdev->s390_domain) + return 0; + __s390_iommu_detach_device(zdev); zpci_dma_init_device(zdev); + return 0; } +static struct iommu_domain_ops s390_iommu_platform_ops = { + .attach_dev = s390_iommu_platform_attach, +}; + +static struct iommu_domain s390_iommu_platform_domain = { + .type = IOMMU_DOMAIN_PLATFORM, + .ops = _iommu_platform_ops, +}; + static void s390_iommu_get_resv_regions(struct device *dev, struct list_head *list) { @@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev) } static const struct iommu_ops s390_iommu_ops = { + .default_domain = _iommu_platform_domain, .capable = s390_iommu_capable, .domain_alloc = s390_domain_alloc, .probe_device = s390_iommu_probe_device, .release_device = s390_iommu_release_device, .device_group = generic_device_group, - .set_platform_dma_ops = s390_iommu_set_platform_dma, .pgsize_bitmap = SZ_4K, .get_resv_regions = s390_iommu_get_resv_regions, .default_domain_ops = &(const struct iommu_domain_ops) { -- 2.40.1
[PATCH v4 07/25] iommu/mtk_iommu_v1: Implement an IDENTITY domain
What mtk does during mtk_iommu_v1_set_platform_dma() is actually putting the iommu into identity mode. Make this available as a proper IDENTITY domain. The mtk_iommu_v1_def_domain_type() from commit 8bbe13f52cb7 ("iommu/mediatek-v1: Add def_domain_type") explains this was needed to allow probe_finalize() to be called, but now the IDENTITY domain will do the same job so change the returned def_domain_type. mkt_v1 is the only driver that returns IOMMU_DOMAIN_UNMANAGED from def_domain_type(). This allows the next patch to enforce an IDENTITY domain policy for this driver. Signed-off-by: Jason Gunthorpe --- drivers/iommu/mtk_iommu_v1.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index 8a0a5e5d049f4a..cc3e7d53d33ad9 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -319,11 +319,27 @@ static int mtk_iommu_v1_attach_device(struct iommu_domain *domain, struct device return 0; } -static void mtk_iommu_v1_set_platform_dma(struct device *dev) +static int mtk_iommu_v1_identity_attach(struct iommu_domain *identity_domain, + struct device *dev) { struct mtk_iommu_v1_data *data = dev_iommu_priv_get(dev); mtk_iommu_v1_config(data, dev, false); + return 0; +} + +static struct iommu_domain_ops mtk_iommu_v1_identity_ops = { + .attach_dev = mtk_iommu_v1_identity_attach, +}; + +static struct iommu_domain mtk_iommu_v1_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _iommu_v1_identity_ops, +}; + +static void mtk_iommu_v1_set_platform_dma(struct device *dev) +{ + mtk_iommu_v1_identity_attach(_iommu_v1_identity_domain, dev); } static int mtk_iommu_v1_map(struct iommu_domain *domain, unsigned long iova, @@ -443,7 +459,7 @@ static int mtk_iommu_v1_create_mapping(struct device *dev, struct of_phandle_arg static int mtk_iommu_v1_def_domain_type(struct device *dev) { - return IOMMU_DOMAIN_UNMANAGED; + return IOMMU_DOMAIN_IDENTITY; } static struct iommu_device *mtk_iommu_v1_probe_device(struct device *dev) @@ -578,6 +594,7 @@ static int mtk_iommu_v1_hw_init(const struct mtk_iommu_v1_data *data) } static const struct iommu_ops mtk_iommu_v1_ops = { + .identity_domain = _iommu_v1_identity_domain, .domain_alloc = mtk_iommu_v1_domain_alloc, .probe_device = mtk_iommu_v1_probe_device, .probe_finalize = mtk_iommu_v1_probe_finalize, -- 2.40.1
[PATCH v4 03/25] powerpc/iommu: Setup a default domain and remove set_platform_dma_ops
POWER is using the set_platform_dma_ops() callback to hook up its private dma_ops, but this is buired under some indirection and is weirdly happening for a BLOCKED domain as well. For better documentation create a PLATFORM domain to manage the dma_ops, since that is what it is for, and make the BLOCKED domain an alias for it. BLOCKED is required for VFIO. Also removes the leaky allocation of the BLOCKED domain by using a global static. Signed-off-by: Jason Gunthorpe --- arch/powerpc/kernel/iommu.c | 38 + 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 67f0b01e6ff575..0f17cd767e1676 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1266,7 +1266,7 @@ struct iommu_table_group_ops spapr_tce_table_group_ops = { /* * A simple iommu_ops to allow less cruft in generic VFIO code. */ -static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom, +static int spapr_tce_platform_iommu_attach_dev(struct iommu_domain *dom, struct device *dev) { struct iommu_group *grp = iommu_group_get(dev); @@ -1283,17 +1283,22 @@ static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom, return ret; } -static void spapr_tce_blocking_iommu_set_platform_dma(struct device *dev) -{ - struct iommu_group *grp = iommu_group_get(dev); - struct iommu_table_group *table_group; +static const struct iommu_domain_ops spapr_tce_platform_domain_ops = { + .attach_dev = spapr_tce_platform_iommu_attach_dev, +}; - table_group = iommu_group_get_iommudata(grp); - table_group->ops->release_ownership(table_group); -} +static struct iommu_domain spapr_tce_platform_domain = { + .type = IOMMU_DOMAIN_PLATFORM, + .ops = _tce_platform_domain_ops, +}; -static const struct iommu_domain_ops spapr_tce_blocking_domain_ops = { - .attach_dev = spapr_tce_blocking_iommu_attach_dev, +static struct iommu_domain spapr_tce_blocked_domain = { + .type = IOMMU_DOMAIN_BLOCKED, + /* +* FIXME: SPAPR mixes blocked and platform behaviors, the blocked domain +* also sets the dma_api ops +*/ + .ops = _tce_platform_domain_ops, }; static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap) @@ -1310,18 +1315,9 @@ static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap) static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type) { - struct iommu_domain *dom; - if (type != IOMMU_DOMAIN_BLOCKED) return NULL; - - dom = kzalloc(sizeof(*dom), GFP_KERNEL); - if (!dom) - return NULL; - - dom->ops = _tce_blocking_domain_ops; - - return dom; + return _tce_blocked_domain; } static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev) @@ -1357,12 +1353,12 @@ static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) } static const struct iommu_ops spapr_tce_iommu_ops = { + .default_domain = _tce_platform_domain, .capable = spapr_tce_iommu_capable, .domain_alloc = spapr_tce_iommu_domain_alloc, .probe_device = spapr_tce_iommu_probe_device, .release_device = spapr_tce_iommu_release_device, .device_group = spapr_tce_iommu_device_group, - .set_platform_dma_ops = spapr_tce_blocking_iommu_set_platform_dma, }; static struct attribute *spapr_tce_iommu_attrs[] = { -- 2.40.1
Re: [PATCH v2 05/12] modules, execmem: drop module_alloc
On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Define default parameters for address range for code allocations using > the current values in module_alloc() and make execmem_text_alloc() use > these defaults when an architecure does not supply its specific > parameters. > > With this, execmem_text_alloc() implements memory allocation in a way > compatible with module_alloc() and can be used as a replacement for > module_alloc(). > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu > --- > include/linux/execmem.h | 8 > include/linux/moduleloader.h | 12 > kernel/module/main.c | 7 --- > mm/execmem.c | 12 > 4 files changed, 16 insertions(+), 23 deletions(-) > > diff --git a/include/linux/execmem.h b/include/linux/execmem.h > index 68b2bfc79993..b9a97fcdf3c5 100644 > --- a/include/linux/execmem.h > +++ b/include/linux/execmem.h > @@ -4,6 +4,14 @@ > > #include > > +#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \ > + !defined(CONFIG_KASAN_VMALLOC) > +#include > +#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) > +#else > +#define MODULE_ALIGN PAGE_SIZE > +#endif > + > /** > * struct execmem_range - definition of a memory range suitable for code and > * related data allocations > diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h > index b3374342f7af..4321682fe849 100644 > --- a/include/linux/moduleloader.h > +++ b/include/linux/moduleloader.h > @@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr, > /* Additional bytes needed by arch in front of individual sections */ > unsigned int arch_mod_section_prepend(struct module *mod, unsigned int > section); > > -/* Allocator used for allocating struct module, core sections and init > - sections. Returns NULL on failure. */ > -void *module_alloc(unsigned long size); > - > /* Determines if the section name is an init section (that is only used > during > * module loading). > */ > @@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod); > /* Any cleanup before freeing mod->module_init */ > void module_arch_freeing_init(struct module *mod); > > -#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \ > - !defined(CONFIG_KASAN_VMALLOC) > -#include > -#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) > -#else > -#define MODULE_ALIGN PAGE_SIZE > -#endif > - > #endif > diff --git a/kernel/module/main.c b/kernel/module/main.c > index 43810a3bdb81..b445c5ad863a 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod) > } > } > > -void * __weak module_alloc(unsigned long size) > -{ > - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, > - GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, > - NUMA_NO_NODE, __builtin_return_address(0)); > -} > - > bool __weak module_init_section(const char *name) > { > return strstarts(name, ".init"); > diff --git a/mm/execmem.c b/mm/execmem.c > index 2fe36dcc7bdf..a67acd75ffef 100644 > --- a/mm/execmem.c > +++ b/mm/execmem.c > @@ -59,9 +59,6 @@ void *execmem_text_alloc(size_t size) > unsigned long fallback_end = execmem_params.modules.text.fallback_end; > bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW; > > - if (!execmem_params.modules.text.start) > - return module_alloc(size); > - > return execmem_alloc(size, start, end, align, pgprot, > fallback_start, fallback_end, kasan); > } > @@ -108,8 +105,15 @@ void __init execmem_init(void) > { > struct execmem_params *p = execmem_arch_params(); > > - if (!p) > + if (!p) { > + p = _params; > + p->modules.text.start = VMALLOC_START; > + p->modules.text.end = VMALLOC_END; > + p->modules.text.pgprot = PAGE_KERNEL_EXEC; > + p->modules.text.alignment = 1; > + > return; > + } > > if (!execmem_validate_params(p)) > return; > -- > 2.35.1 >
Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport wrote: [...] > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c > index 5af4975caeb5..c3d999f3a3dd 100644 > --- a/arch/arm64/kernel/module.c > +++ b/arch/arm64/kernel/module.c > @@ -17,56 +17,50 @@ > #include > #include > #include > +#include > #include > #include > #include > #include > > -void *module_alloc(unsigned long size) > +static struct execmem_params execmem_params = { > + .modules = { > + .flags = EXECMEM_KASAN_SHADOW, > + .text = { > + .alignment = MODULE_ALIGN, > + }, > + }, > +}; > + > +struct execmem_params __init *execmem_arch_params(void) > { > u64 module_alloc_end = module_alloc_base + MODULES_VSIZE; > - gfp_t gfp_mask = GFP_KERNEL; > - void *p; > - > - /* Silence the initial allocation */ > - if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS)) > - gfp_mask |= __GFP_NOWARN; > > - if (IS_ENABLED(CONFIG_KASAN_GENERIC) || > - IS_ENABLED(CONFIG_KASAN_SW_TAGS)) > - /* don't exceed the static module region - see below */ > - module_alloc_end = MODULES_END; > + execmem_params.modules.text.pgprot = PAGE_KERNEL; > + execmem_params.modules.text.start = module_alloc_base; I think I mentioned this earlier. For arm64 with CONFIG_RANDOMIZE_BASE, module_alloc_base is not yet set when execmem_arch_params() is called. So we will need some extra logic for this. Thanks, Song > + execmem_params.modules.text.end = module_alloc_end; > > - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base, > - module_alloc_end, gfp_mask, PAGE_KERNEL, > VM_DEFER_KMEMLEAK, > - NUMA_NO_NODE, __builtin_return_address(0)); > - > - if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) && > + /* > +* KASAN without KASAN_VMALLOC can only deal with module > +* allocations being served from the reserved module region, > +* since the remainder of the vmalloc region is already > +* backed by zero shadow pages, and punching holes into it > +* is non-trivial. Since the module region is not randomized > +* when KASAN is enabled without KASAN_VMALLOC, it is even > +* less likely that the module region gets exhausted, so we > +* can simply omit this fallback in that case. > +*/ > + if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) && > (IS_ENABLED(CONFIG_KASAN_VMALLOC) || > (!IS_ENABLED(CONFIG_KASAN_GENERIC) && > - !IS_ENABLED(CONFIG_KASAN_SW_TAGS > - /* > -* KASAN without KASAN_VMALLOC can only deal with module > -* allocations being served from the reserved module region, > -* since the remainder of the vmalloc region is already > -* backed by zero shadow pages, and punching holes into it > -* is non-trivial. Since the module region is not randomized > -* when KASAN is enabled without KASAN_VMALLOC, it is even > -* less likely that the module region gets exhausted, so we > -* can simply omit this fallback in that case. > -*/ > - p = __vmalloc_node_range(size, MODULE_ALIGN, > module_alloc_base, > - module_alloc_base + SZ_2G, GFP_KERNEL, > - PAGE_KERNEL, 0, NUMA_NO_NODE, > - __builtin_return_address(0)); > - > - if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) { > - vfree(p); > - return NULL; > + !IS_ENABLED(CONFIG_KASAN_SW_TAGS { > + unsigned long end = module_alloc_base + SZ_2G; > + > + execmem_params.modules.text.fallback_start = > module_alloc_base; > + execmem_params.modules.text.fallback_end = end; > } > > - /* Memory is intended to be executable, reset the pointer tag. */ > - return kasan_reset_tag(p); > + return _params; > } > > enum aarch64_reloc_op {
Re: [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem
On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > Several architectures override module_alloc() only to define address > range for code allocations different than VMALLOC address space. > > Provide a generic implementation in execmem that uses the parameters > for address space ranges, required alignment and page protections > provided by architectures. > > The architecures must fill execmem_params structure and implement > execmem_arch_params() that returns a pointer to that structure. This > way the execmem initialization won't be called from every architecure, > but rather from a central place, namely initialization of the core > memory management. > > The execmem provides execmem_text_alloc() API that wraps > __vmalloc_node_range() with the parameters defined by the architecures. > If an architeture does not implement execmem_arch_params(), > execmem_text_alloc() will fall back to module_alloc(). > > The name execmem_text_alloc() emphasizes that the allocated memory is > for executable code, the allocations of the associated data, like data > sections of a module will use execmem_data_alloc() interface that will > be added later. > > Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
On Fri, Jun 16, 2023 at 9:48 AM Kent Overstreet wrote: > > On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > > > module_alloc() is used everywhere as a mean to allocate memory for code. > > > > Beside being semantically wrong, this unnecessarily ties all subsystems > > that need to allocate code, such as ftrace, kprobes and BPF to modules > > and puts the burden of code allocation to the modules code. > > > > Several architectures override module_alloc() because of various > > constraints where the executable memory can be located and this causes > > additional obstacles for improvements of code allocation. > > > > Start splitting code allocation from modules by introducing > > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs. > > > > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for > > module_alloc() and execmem_free() and jit_free() are replacements of > > module_memfree() to allow updating all call sites to use the new APIs. > > > > The intention semantics for new allocation APIs: > > > > * execmem_text_alloc() should be used to allocate memory that must reside > > close to the kernel image, like loadable kernel modules and generated > > code that is restricted by relative addressing. > > > > * jit_text_alloc() should be used to allocate memory for generated code > > when there are no restrictions for the code placement. For > > architectures that require that any code is within certain distance > > from the kernel image, jit_text_alloc() will be essentially aliased to > > execmem_text_alloc(). > > > > The names execmem_text_alloc() and jit_text_alloc() emphasize that the > > allocated memory is for executable code, the allocations of the > > associated data, like data sections of a module will use > > execmem_data_alloc() interface that will be added later. > > I like the API split - at the risk of further bikeshedding, perhaps > near_text_alloc() and far_text_alloc()? Would be more explicit. > > Reviewed-by: Kent Overstreet Acked-by: Song Liu
Re: [PATCH v2 01/12] nios2: define virtual address space for modules
On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" > > nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26 > cannot reach all of vmalloc address space. > > Define module space as 32MiB below the kernel base and switch nios2 to > use vmalloc for module allocations. > > Suggested-by: Thomas Gleixner > Signed-off-by: Mike Rapoport (IBM) > Acked-by: Dinh Nguyen Acked-by: Song Liu > --- > arch/nios2/include/asm/pgtable.h | 5 - > arch/nios2/kernel/module.c | 19 --- > 2 files changed, 8 insertions(+), 16 deletions(-) > > diff --git a/arch/nios2/include/asm/pgtable.h > b/arch/nios2/include/asm/pgtable.h > index 0f5c2564e9f5..0073b289c6a4 100644 > --- a/arch/nios2/include/asm/pgtable.h > +++ b/arch/nios2/include/asm/pgtable.h > @@ -25,7 +25,10 @@ > #include > > #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE > -#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) > +#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1) > + > +#define MODULES_VADDR (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M) > +#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) > > struct mm_struct; > > diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c > index 76e0a42d6e36..9c97b7513853 100644 > --- a/arch/nios2/kernel/module.c > +++ b/arch/nios2/kernel/module.c > @@ -21,23 +21,12 @@ > > #include > > -/* > - * Modules should NOT be allocated with kmalloc for (obvious) reasons. > - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot > reach > - * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns > - * addresses in 0xc000) > - */ > void *module_alloc(unsigned long size) > { > - if (size == 0) > - return NULL; > - return kmalloc(size, GFP_KERNEL); > -} > - > -/* Free memory returned from module_alloc */ > -void module_memfree(void *module_region) > -{ > - kfree(module_region); > + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, > + GFP_KERNEL, PAGE_KERNEL_EXEC, > + VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, > + __builtin_return_address(0)); > } > > int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, > -- > 2.35.1 >
Re: ppc64le vmlinuz is huge when building with BTF
Dominique Martinet wrote: Naveen N Rao wrote on Fri, Jun 16, 2023 at 04:28:53PM +0530: > We're not stripping anything in vmlinuz for other archs -- the linker > script already should be including only the bare minimum to decompress > itself (+compressed useful bits), so I guess it's a Kbuild issue for the > arch. For a related discussion, see: http://lore.kernel.org/CAK18DXZKs2PNmLndeGYqkPxmrrBR=6ca3bhyYCj=ghya7dh...@mail.gmail.com Thanks, I didn't know that ppc64le boots straight into vmlinux, as 'make install' somehow installs something called 'vmlinuz-lts' (-lts coming out of localversion afaiu, but vmlinuz would come from the build scripts) ; this is somewhat confusing as vmlinuz on other archs is a compressed/pre-processed binary so I'd expect it to at least be stripped... As far as I can tell, kernel's install script doesn't give out that name, so 'vmlinuz' is likely coming from the distro's /[s]bin/installkernel script. It probably needs an override to retain 'vmlinux'. > We can add a strip but I unfortunately have no way of testing ppc build, > I'll ask around the build linux-kbuild and linuxppc-dev lists if that's > expected; it shouldn't be that bad now that's figured out. Stripping vmlinux would indeed be the way to go. As mentioned in the above link, fedora also packages a strip'ed vmlinux for ppc64le: https://src.fedoraproject.org/rpms/kernel/blob/4af17bffde7a1eca9ab164e5de0e391c277998a4/f/kernel.spec#_1797 It feels somewhat wrong to add a strip just for ppc64le after make install, but I guess we probably ought to do the same... I don't have any hardware to test booting the result though, I'll submit an update and ask for someone to test when it's done. (bit busy but that doesn't take long, will do that tomorrow morning before I forget) Thanks! You're right that it's likely just powerpc that is different here. It sure would be nice if we can iron out issues with our zImage. - Naveen
Re: [PATCH v2 00/12] mm: jit/text allocator
On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Hi, > > module_alloc() is used everywhere as a mean to allocate memory for > code. > > Beside being semantically wrong, this unnecessarily ties all > subsystmes > that need to allocate code, such as ftrace, kprobes and BPF to > modules and > puts the burden of code allocation to the modules code. > > Several architectures override module_alloc() because of various > constraints where the executable memory can be located and this > causes > additional obstacles for improvements of code allocation. I like how this series leaves the allocation code centralized at the end of it because it will be much easier when we get to ROX, huge page, text_poking() type stuff. I guess that's the idea. I'm just catching up on what you and Song have been up to.
Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Data related to code allocations, such as module data section, need > to > comply with architecture constraints for its placement and its > allocation right now was done using execmem_text_alloc(). > > Create a dedicated API for allocating data related to code > allocations > and allow architectures to define address ranges for data > allocations. Right now the cross-arch way to specify kernel memory permissions is encoded in the function names of all the set_memory_foo()'s. You can't just have unified prot names because some arch's have NX and some have X bits, etc. CPA wouldn't know if it needs to set or unset a bit if you pass in a PROT. But then you end up with a new function for *each* combination (i.e. set_memory_rox()). I wish CPA has flags like mmap() does, and I wonder if it makes sense here instead of execmem_data_alloc(). Maybe that is an overhaul for another day though...
Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > module_alloc() is used everywhere as a mean to allocate memory for code. > > Beside being semantically wrong, this unnecessarily ties all subsystems > that need to allocate code, such as ftrace, kprobes and BPF to modules > and puts the burden of code allocation to the modules code. > > Several architectures override module_alloc() because of various > constraints where the executable memory can be located and this causes > additional obstacles for improvements of code allocation. > > Start splitting code allocation from modules by introducing > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs. > > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for > module_alloc() and execmem_free() and jit_free() are replacements of > module_memfree() to allow updating all call sites to use the new APIs. > > The intention semantics for new allocation APIs: > > * execmem_text_alloc() should be used to allocate memory that must reside > close to the kernel image, like loadable kernel modules and generated > code that is restricted by relative addressing. > > * jit_text_alloc() should be used to allocate memory for generated code > when there are no restrictions for the code placement. For > architectures that require that any code is within certain distance > from the kernel image, jit_text_alloc() will be essentially aliased to > execmem_text_alloc(). > > The names execmem_text_alloc() and jit_text_alloc() emphasize that the > allocated memory is for executable code, the allocations of the > associated data, like data sections of a module will use > execmem_data_alloc() interface that will be added later. I like the API split - at the risk of further bikeshedding, perhaps near_text_alloc() and far_text_alloc()? Would be more explicit. Reviewed-by: Kent Overstreet
Re: [PATCH v2 4/6] watchdog/hardlockup: Make HAVE_NMI_WATCHDOG sparc64-specific
Hi, On Fri, Jun 16, 2023 at 8:07 AM Petr Mladek wrote: > > There are several hardlockup detector implementations and several Kconfig > values which allow selection and build of the preferred one. > > CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb > ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36. > It was a preparation step for introducing the new generic perf hardlockup > detector. > > The existing arch-specific variants did not support the to-be-created > generic build configurations, sysctl interface, etc. This distinction > was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog: > Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR") > in v2.6.38. > > CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105 > ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced > the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used > by three architectures, namely blackfin, mn10300, and sparc. > > The support for blackfin and mn10300 architectures has been completely > dropped some time ago. And sparc is the only architecture with the historic > NMI watchdog at the moment. > > And the old sparc implementation is really special. It is always built on > sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b > ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added > in v4.10-rc1. > > There are only few locations where the sparc64 NMI watchdog interacts > with the generic hardlockup detectors code: > > + implements arch_touch_nmi_watchdog() which is called from the generic > touch_nmi_watchdog() > > + implements watchdog_hardlockup_enable()/disable() to support > /proc/sys/kernel/nmi_watchdog > > + is always preferred over other generic watchdogs, see > CONFIG_HARDLOCKUP_DETECTOR > > + includes asm/nmi.h into linux/nmi.h because some sparc-specific > functions are needed in sparc-specific code which includes > only linux/nmi.h. > > The situation became more complicated after the commit 05a4a95279311c3 > ("kernel/watchdog: split up config options") and commit 2104180a53698df5 > ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1. > They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc > specific hardlockup detector. It was compatible with the perf one > regarding the general boot, sysctl, and programming interfaces. > > HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of > HAVE_NMI_WATCHDOG. It made some sense because all arch-specific > detectors had some common requirements, namely: > > + implemented arch_touch_nmi_watchdog() > + included asm/nmi.h into linux/nmi.h > + defined the default value for /proc/sys/kernel/nmi_watchdog > > But it actually has made things pretty complicated when the generic > buddy hardlockup detector was added. Before the generic perf detector > was newer supported together with an arch-specific one. But the buddy > detector could work on any SMP system. It means that an architecture > could support both the arch-specific and buddy detector. > > As a result, there are few tricky dependencies. For example, > CONFIG_HARDLOCKUP_DETECTOR depends on: > > ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && > !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH > > The problem is that the very special sparc implementation is defined as: > > HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH > > Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear > without reading understanding the history. > > Make the logic less tricky and more self-explanatory by making > HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to > HAVE_HARDLOCKUP_DETECTOR_SPARC64. > > Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF, > and HARDLOCKUP_DETECTOR_BUDDY may conflict only with > HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR > and it is not longer enabled when HAVE_NMI_WATCHDOG is set. > > Signed-off-by: Petr Mladek > > watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 > > The configuration variable HAVE_NMI_WATCHDOG has a generic name but > it is selected only for SPARC64. > > It should _not_ be used in general because it is not integrated with > the other hardlockup detectors. Namely, it does not support the hardlockup > specific command line parameters and systcl interface. Instead, it is > enabled/disabled together with the softlockup detector by the global > "watchdog" sysctl. > > Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special > behavior more clear. > > Also the variable is set only on sparc64. Move the definition > from arch/Kconfig to arch/sparc/Kconfig.debug. > > Signed-off-by: Petr Mladek I think you goofed up when squashing the patches. You've now got a second patch subject after your first Signed-off-by and
Re: [PATCH v2 2/6] watchdog/hardlockup: Make the config checks more straightforward
Hi, On Fri, Jun 16, 2023 at 8:07 AM Petr Mladek wrote: > > There are four possible variants of hardlockup detectors: > > + buddy: available when SMP is set. > > + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. > > + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. > > + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set > and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. > > The check for the sparc64 variant is more complicated because > HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific > and sparc64 specific variant. Therefore it is automatically > selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. > > This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. > It reduces the size of some checks but it makes them harder to follow. > > Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH > is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global > HARDLOCKUP_DETECTOR switch is enabled/disabled. > > Make the logic more straightforward by the following changes: > > + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and > HAVE_NMI_WATCHDOG in comments. > > + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate > HAVE_* for all four hardlockup detector variants. > > Use it in the other conditions instead of SMP. It makes it > clear that it is about the buddy detector. > > + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR > and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand > the conditions between the four hardlockup detector variants. > > + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY > can be enabled. It explains the dependency on the other > hardlockup detector variants. > > Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". > It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when > the global HARDLOCKUP_DETECTOR switch is changed. > > + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables > disappear when the hardlockup detectors are disabled. > > Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY > value is not preserved when the global switch is disabled. > The user has to make the decision again when it gets re-enabled. > > Signed-off-by: Petr Mladek > --- > arch/Kconfig | 23 +- > lib/Kconfig.debug | 62 +++ > 2 files changed, 53 insertions(+), 32 deletions(-) While I'd still paint the bikeshed a different color and organize the dependencies a little differently, as discussed in your v1 post, this is still OK w/ me. Reviewed-by: Douglas Anderson
Re: [PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related config values a logical way
Hi, On Fri, Jun 16, 2023 at 8:06 AM Petr Mladek wrote: > > There are four possible variants of hardlockup detectors: > > + buddy: available when SMP is set. > > + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. > > + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. > > + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set > and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. > > Only one hardlockup detector can be compiled in. The selection is done > using quite complex dependencies between several CONFIG variables. > The following patches will try to make it more straightforward. > > As a first step, reorder the definitions of the various CONFIG variables. > The logical order is: > >1. HAVE_* variables define available variants. They are typically > defined in the arch/ config files. > >2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup > detector is enabled at all. > >3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether > the buddy detector should be preferred over the perf one. > Note that the arch specific variants are always preferred when > available. > >4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given > detector is enabled in the end. > >5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH > are temporary variables that are going to be removed in > a followup patch. > > This is a preparation step for further cleanup. It will change the logic > without shuffling the definitions. > > This change temporary breaks the C-like ordering where the variables are > declared or defined before they are used. It is not really needed for > Kconfig. Also the following patches will rework the logic so that > the ordering will be C-like in the end. > > The patch just shuffles the definitions. It should not change the existing > behavior. > > Signed-off-by: Petr Mladek > --- > lib/Kconfig.debug | 112 +++--- > 1 file changed, 56 insertions(+), 56 deletions(-) Reviewed-by: Douglas Anderson
Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote: > -void *module_alloc(unsigned long size) > -{ > - gfp_t gfp_mask = GFP_KERNEL; > - void *p; > - > - if (PAGE_ALIGN(size) > MODULES_LEN) > - return NULL; > +static struct execmem_params execmem_params = { > + .modules = { > + .flags = EXECMEM_KASAN_SHADOW, > + .text = { > + .alignment = MODULE_ALIGN, > + }, > + }, > +}; Did you consider making these execmem_params's ro_after_init? Not that it is security sensitive, but it's a nice hint to the reader that it is only modified at init. And I guess basically free sanitizing of buggy writes to it. > > - p = __vmalloc_node_range(size, MODULE_ALIGN, > - MODULES_VADDR + > get_module_load_offset(), > - MODULES_END, gfp_mask, PAGE_KERNEL, > - VM_FLUSH_RESET_PERMS | > VM_DEFER_KMEMLEAK, > - NUMA_NO_NODE, > __builtin_return_address(0)); > +struct execmem_params __init *execmem_arch_params(void) > +{ > + unsigned long start = MODULES_VADDR + > get_module_load_offset(); I think we can drop the mutex's in get_module_load_offset() now, since execmem_arch_params() should only be called once at init. > > - if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) > { > - vfree(p); > - return NULL; > - } > + execmem_params.modules.text.start = start; > + execmem_params.modules.text.end = MODULES_END; > + execmem_params.modules.text.pgprot = PAGE_KERNEL; > > - return p; > + return _params; > } >
Re: [PATCH v2 01/12] nios2: define virtual address space for modules
On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote: > void *module_alloc(unsigned long size) > { > - if (size == 0) > - return NULL; > - return kmalloc(size, GFP_KERNEL); > -} > - > -/* Free memory returned from module_alloc */ > -void module_memfree(void *module_region) > -{ > - kfree(module_region); > + return __vmalloc_node_range(size, 1, MODULES_VADDR, > MODULES_END, > + GFP_KERNEL, PAGE_KERNEL_EXEC, > + VM_FLUSH_RESET_PERMS, > NUMA_NO_NODE, > + __builtin_return_address(0)); > } > > int apply_relocate_add(Elf32_Shdr *sechdrs, const char *s I wonder if the (size == 0) check is really needed, but __vmalloc_node_range() will WARN on this case where the old code won't.
Re: [PATCH v2 0/6] watchdog/hardlockup: Cleanup configuration of hardlockup detectors
On Fri 2023-06-16 17:06:12, Petr Mladek wrote: > Hi, > > this patchset is supposed to replace the last patch in the patchset cleaning > up after introducing the buddy detector, see > https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid > > Changes against v1: > > + Better explained the C-like ordering in the 1st patch. > > + Squashed patches for splitting and renaming HAVE_NMI_WATCHDOG, > updated commit message with the history and more facts. > > + Updated comments about the sparc64 variant. It is not handled together > with the softlockup detector. In fact, it is always build. And it even > used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: > implement watchdog_nmi_enable and watchdog_nmi_disable") added in > v4.10-rc1. > > I realized this when updating the comment for the 4th patch. My original > statement in v1 patchset was based on code reading. I looked at it from > a bad side. > > + Removed superfluous "default n" > + Fixed typos. I sometimes find the diff between the two versions useful. Especially, when the patches are not trivial and the last version made only cosmetic changes. This is what I got by comparing "git format-patch" generated patchsets: diff -purN watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/-cover-letter.patch watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/-cover-letter.patch --- watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/-cover-letter.patch 2023-06-16 16:42:07.769941775 +0200 +++ watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/-cover-letter.patch 2023-06-16 16:39:42.179877676 +0200 @@ -1,9 +1,33 @@ -From 0456ed568d98ba5bba8148e4f60d769e3c5a6c7a Mon Sep 17 00:00:00 2001 +From bcf4dfab5a64ee691eb5154b1361ed59610c9387 Mon Sep 17 00:00:00 2001 From: Petr Mladek -Date: Fri, 16 Jun 2023 16:42:07 +0200 -Subject: [PATCH 0/6] *** SUBJECT HERE *** +Date: Fri, 16 Jun 2023 16:28:13 +0200 +Subject: [PATCH v2 0/6] watchdog/hardlockup: Cleanup configuration of hardlockup detectors -*** BLURB HERE *** +Hi, + +this patchset is supposed to replace the last patch in the patchset cleaning +up after introducing the buddy detector, see +https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid + +Changes against v1: + + + Better explained the C-like ordering in the 1st patch. + + + Squashed patches for splitting and renaming HAVE_NMI_WATCHDOG, +updated commit message with the history and more facts. + + + Updated comments about the sparc64 variant. It is not handled together +with the softlockup detector. In fact, it is always build. And it even +used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: +implement watchdog_nmi_enable and watchdog_nmi_disable") added in +v4.10-rc1. + +I realized this when updating the comment for the 4th patch. My original +statement in v1 patchset was based on code reading. I looked at it from +a bad side. + + + Removed superfluous "default n" + + Fixed typos. Petr Mladek (6): watchdog/hardlockup: Sort hardlockup detector related config values a @@ -19,12 +43,12 @@ Petr Mladek (6): arch/powerpc/Kconfig | 5 +- arch/powerpc/include/asm/nmi.h | 2 - arch/sparc/Kconfig | 2 +- - arch/sparc/Kconfig.debug | 20 ++ + arch/sparc/Kconfig.debug | 14 arch/sparc/include/asm/nmi.h | 1 - include/linux/nmi.h| 14 ++-- kernel/watchdog.c | 2 +- - lib/Kconfig.debug | 115 +++-- - 9 files changed, 104 insertions(+), 74 deletions(-) + lib/Kconfig.debug | 114 ++--- + 9 files changed, 97 insertions(+), 74 deletions(-) -- 2.35.3 diff -purN watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch --- watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch 2023-06-16 16:42:07.741941379 +0200 +++ watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch 2023-06-16 16:28:53.594682369 +0200 @@ -1,7 +1,7 @@ -From 9d643e4254b224d22dd1411c51386ab686c052a7 Mon Sep 17 00:00:00 2001 +From bd7019bff3a28fb0bc163308101118adccf699d3 Mon Sep 17 00:00:00 2001 From: Petr Mladek Date: Thu, 1 Jun 2023 15:35:09 +0200 -Subject: [PATCH 1/6] watchdog/hardlockup: Sort hardlockup detector related +Subject: [PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related config values a logical way There are four possible variants of hardlockup detectors: @@ -40,6 +40,14 @@
[PATCH v2 6/6] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH
The HAVE_ prefix means that the code could be enabled. Add another variable for HAVE_HARDLOCKUP_DETECTOR_ARCH without this prefix. It will be set when it should be built. It will make it compatible with the other hardlockup detectors. The change allows to clean up dependencies of PPC_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_PERF definitions for powerpc. As a result HAVE_HARDLOCKUP_DETECTOR_PERF has the same dependencies on arm, x86, powerpc architectures. Signed-off-by: Petr Mladek Reviewed-by: Douglas Anderson --- arch/powerpc/Kconfig | 5 ++--- include/linux/nmi.h | 2 +- lib/Kconfig.debug| 9 + 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 539d1f03ff42..987e730740d7 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -90,8 +90,7 @@ config NMI_IPI config PPC_WATCHDOG bool - depends on HARDLOCKUP_DETECTOR - depends on HAVE_HARDLOCKUP_DETECTOR_ARCH + depends on HARDLOCKUP_DETECTOR_ARCH default y help This is a placeholder when the powerpc hardlockup detector @@ -240,7 +239,7 @@ config PPC select HAVE_GCC_PLUGINS if GCC_VERSION >= 50200 # plugin support on gcc <= 5.1 is buggy on PPC select HAVE_GENERIC_VDSO select HAVE_HARDLOCKUP_DETECTOR_ARCHif PPC_BOOK3S_64 && SMP - select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH + select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && HAVE_PERF_EVENTS_NMI select HAVE_HW_BREAKPOINT if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx) select HAVE_IOREMAP_PROT select HAVE_IRQ_TIME_ACCOUNTING diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 515d6724f469..ec808ebd36ba 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -9,7 +9,7 @@ #include /* Arch specific watchdogs might need to share extra watchdog-related APIs. */ -#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64) +#if defined(CONFIG_HARDLOCKUP_DETECTOR_ARCH) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64) #include #endif diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index f285e9cf967a..2c4bb72e72ad 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1056,6 +1056,7 @@ config HARDLOCKUP_DETECTOR depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH imply HARDLOCKUP_DETECTOR_PERF imply HARDLOCKUP_DETECTOR_BUDDY + imply HARDLOCKUP_DETECTOR_ARCH select LOCKUP_DETECTOR help @@ -1101,6 +1102,14 @@ config HARDLOCKUP_DETECTOR_BUDDY depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER +config HARDLOCKUP_DETECTOR_ARCH + bool + depends on HARDLOCKUP_DETECTOR + depends on HAVE_HARDLOCKUP_DETECTOR_ARCH + help + The arch-specific implementation of the hardlockup detector will + be used. + # # Both the "perf" and "buddy" hardlockup detectors count hrtimer # interrupts. This config enables functions managing this common code. -- 2.35.3
[PATCH v2 5/6] watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64
The HAVE_ prefix means that the code could be enabled. Add another variable for HAVE_HARDLOCKUP_DETECTOR_SPARC64 without this prefix. It will be set when it should be built. It will make it compatible with the other hardlockup detectors. Before, it is far from obvious that the SPARC64 variant is actually used: $> make ARCH=sparc64 defconfig $> grep HARDLOCKUP_DETECTOR .config CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y After, it is more clear: $> make ARCH=sparc64 defconfig $> grep HARDLOCKUP_DETECTOR .config CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y CONFIG_HARDLOCKUP_DETECTOR_SPARC64=y Signed-off-by: Petr Mladek Reviewed-by: Douglas Anderson --- arch/sparc/Kconfig.debug | 7 ++- include/linux/nmi.h | 4 ++-- kernel/watchdog.c| 2 +- lib/Kconfig.debug| 2 +- 4 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/sparc/Kconfig.debug b/arch/sparc/Kconfig.debug index 4903b6847e43..37e003665de6 100644 --- a/arch/sparc/Kconfig.debug +++ b/arch/sparc/Kconfig.debug @@ -16,10 +16,15 @@ config FRAME_POINTER default y config HAVE_HARDLOCKUP_DETECTOR_SPARC64 - depends on HAVE_NMI bool + depends on HAVE_NMI + select HARDLOCKUP_DETECTOR_SPARC64 help Sparc64 hardlockup detector is the last one developed before adding the common infrastructure for handling hardlockup detectors. It is always built. It does _not_ use the common command line parameters and sysctl interface, except for /proc/sys/kernel/nmi_watchdog. + +config HARDLOCKUP_DETECTOR_SPARC64 + bool + depends on HAVE_HARDLOCKUP_DETECTOR_SPARC64 diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 7ee6c35d1f05..515d6724f469 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -9,7 +9,7 @@ #include /* Arch specific watchdogs might need to share extra watchdog-related APIs. */ -#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64) +#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64) #include #endif @@ -92,7 +92,7 @@ static inline void hardlockup_detector_disable(void) {} #endif /* Sparc64 has special implemetantion that is always enabled. */ -#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64) +#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64) void arch_touch_nmi_watchdog(void); #else static inline void arch_touch_nmi_watchdog(void) { } diff --git a/kernel/watchdog.c b/kernel/watchdog.c index babd2f3c8b72..a2154e753cb4 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -29,7 +29,7 @@ static DEFINE_MUTEX(watchdog_mutex); -#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64) +#if defined(CONFIG_HARDLOCKUP_DETECTOR) || defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64) # define WATCHDOG_HARDLOCKUP_DEFAULT 1 #else # define WATCHDOG_HARDLOCKUP_DEFAULT 0 diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index e94664339e28..f285e9cf967a 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1052,7 +1052,7 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY # config HARDLOCKUP_DETECTOR bool "Detect Hard Lockups" - depends on DEBUG_KERNEL && !S390 && !HAVE_HARDLOCKUP_DETECTOR_SPARC64 + depends on DEBUG_KERNEL && !S390 && !HARDLOCKUP_DETECTOR_SPARC64 depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH imply HARDLOCKUP_DETECTOR_PERF imply HARDLOCKUP_DETECTOR_BUDDY -- 2.35.3
[PATCH v2 4/6] watchdog/hardlockup: Make HAVE_NMI_WATCHDOG sparc64-specific
There are several hardlockup detector implementations and several Kconfig values which allow selection and build of the preferred one. CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36. It was a preparation step for introducing the new generic perf hardlockup detector. The existing arch-specific variants did not support the to-be-created generic build configurations, sysctl interface, etc. This distinction was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog: Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR") in v2.6.38. CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105 ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used by three architectures, namely blackfin, mn10300, and sparc. The support for blackfin and mn10300 architectures has been completely dropped some time ago. And sparc is the only architecture with the historic NMI watchdog at the moment. And the old sparc implementation is really special. It is always built on sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added in v4.10-rc1. There are only few locations where the sparc64 NMI watchdog interacts with the generic hardlockup detectors code: + implements arch_touch_nmi_watchdog() which is called from the generic touch_nmi_watchdog() + implements watchdog_hardlockup_enable()/disable() to support /proc/sys/kernel/nmi_watchdog + is always preferred over other generic watchdogs, see CONFIG_HARDLOCKUP_DETECTOR + includes asm/nmi.h into linux/nmi.h because some sparc-specific functions are needed in sparc-specific code which includes only linux/nmi.h. The situation became more complicated after the commit 05a4a95279311c3 ("kernel/watchdog: split up config options") and commit 2104180a53698df5 ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1. They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc specific hardlockup detector. It was compatible with the perf one regarding the general boot, sysctl, and programming interfaces. HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of HAVE_NMI_WATCHDOG. It made some sense because all arch-specific detectors had some common requirements, namely: + implemented arch_touch_nmi_watchdog() + included asm/nmi.h into linux/nmi.h + defined the default value for /proc/sys/kernel/nmi_watchdog But it actually has made things pretty complicated when the generic buddy hardlockup detector was added. Before the generic perf detector was newer supported together with an arch-specific one. But the buddy detector could work on any SMP system. It means that an architecture could support both the arch-specific and buddy detector. As a result, there are few tricky dependencies. For example, CONFIG_HARDLOCKUP_DETECTOR depends on: ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH The problem is that the very special sparc implementation is defined as: HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear without reading understanding the history. Make the logic less tricky and more self-explanatory by making HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to HAVE_HARDLOCKUP_DETECTOR_SPARC64. Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF, and HARDLOCKUP_DETECTOR_BUDDY may conflict only with HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR and it is not longer enabled when HAVE_NMI_WATCHDOG is set. Signed-off-by: Petr Mladek watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 The configuration variable HAVE_NMI_WATCHDOG has a generic name but it is selected only for SPARC64. It should _not_ be used in general because it is not integrated with the other hardlockup detectors. Namely, it does not support the hardlockup specific command line parameters and systcl interface. Instead, it is enabled/disabled together with the softlockup detector by the global "watchdog" sysctl. Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special behavior more clear. Also the variable is set only on sparc64. Move the definition from arch/Kconfig to arch/sparc/Kconfig.debug. Signed-off-by: Petr Mladek --- arch/Kconfig | 18 -- arch/sparc/Kconfig | 2 +- arch/sparc/Kconfig.debug | 9 + include/linux/nmi.h | 5 ++--- kernel/watchdog.c| 2 +- lib/Kconfig.debug| 15 +-- 6 files changed, 18 insertions(+), 33 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index
[PATCH v2 3/6] watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h
arch_touch_nmi_watchdog() needs a different implementation for various hardlockup detector implementations. And it does nothing when any hardlockup detector is not built at all. arch_touch_nmi_watchdog() is declared via linux/nmi.h. And it must be defined as an empty function when there is no hardlockup detector. It is done directly in this header file for the perf and buddy detectors. And it is done in the included asm/linux.h for arch specific detectors. The reason probably is that the arch specific variants build the code using another conditions. For example, powerpc64/sparc64 builds the code when CONFIG_PPC_WATCHDOG is enabled. Another reason might be that these architectures define more functions in asm/nmi.h anyway. However the generic code actually knows when the function will be implemented. It happens when some full featured or the sparc64-specific hardlockup detector is built. In particular, CONFIG_HARDLOCKUP_DETECTOR can be enabled only when a generic or arch-specific full featured hardlockup detector is available. The only exception is sparc64 which can be built even when the global HARDLOCKUP_DETECTOR switch is disabled. The information about sparc64 is a bit complicated. The hardlockup detector is built there when CONFIG_HAVE_NMI_WATCHDOG is set and CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. People might wonder whether this change really makes things easier. The motivation is: + The current logic in linux/nmi.h is far from obvious. For example, arch_touch_nmi_watchdog() is defined as {} when neither CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER nor CONFIG_HAVE_NMI_WATCHDOG is defined. + The change synchronizes the checks in lib/Kconfig.debug and in the generic code. + It is a step that will help cleaning HAVE_NMI_WATCHDOG related checks. The change should not change the existing behavior. Signed-off-by: Petr Mladek Reviewed-by: Douglas Anderson --- arch/powerpc/include/asm/nmi.h | 2 -- arch/sparc/include/asm/nmi.h | 1 - include/linux/nmi.h| 13 ++--- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h index 43bfd4de868f..ce25318c3902 100644 --- a/arch/powerpc/include/asm/nmi.h +++ b/arch/powerpc/include/asm/nmi.h @@ -3,11 +3,9 @@ #define _ASM_NMI_H #ifdef CONFIG_PPC_WATCHDOG -extern void arch_touch_nmi_watchdog(void); long soft_nmi_interrupt(struct pt_regs *regs); void watchdog_hardlockup_set_timeout_pct(u64 pct); #else -static inline void arch_touch_nmi_watchdog(void) {} static inline void watchdog_hardlockup_set_timeout_pct(u64 pct) {} #endif diff --git a/arch/sparc/include/asm/nmi.h b/arch/sparc/include/asm/nmi.h index 90ee7863d9fe..920dc23f443f 100644 --- a/arch/sparc/include/asm/nmi.h +++ b/arch/sparc/include/asm/nmi.h @@ -8,7 +8,6 @@ void nmi_adjust_hz(unsigned int new_hz); extern atomic_t nmi_active; -void arch_touch_nmi_watchdog(void); void start_nmi_watchdog(void *unused); void stop_nmi_watchdog(void *unused); diff --git a/include/linux/nmi.h b/include/linux/nmi.h index b5d0b7ab52fb..b9e816bde14a 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -7,6 +7,8 @@ #include #include + +/* Arch specific watchdogs might need to share extra watchdog-related APIs. */ #if defined(CONFIG_HAVE_NMI_WATCHDOG) #include #endif @@ -89,12 +91,17 @@ extern unsigned int hardlockup_panic; static inline void hardlockup_detector_disable(void) {} #endif -#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER) +/* Sparc64 has special implemetantion that is always enabled. */ +#if defined(CONFIG_HARDLOCKUP_DETECTOR) || \ +(defined(CONFIG_HAVE_NMI_WATCHDOG) && !defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH)) void arch_touch_nmi_watchdog(void); +#else +static inline void arch_touch_nmi_watchdog(void) { } +#endif + +#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER) void watchdog_hardlockup_touch_cpu(unsigned int cpu); void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs); -#elif !defined(CONFIG_HAVE_NMI_WATCHDOG) -static inline void arch_touch_nmi_watchdog(void) { } #endif #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) -- 2.35.3
[PATCH v2 2/6] watchdog/hardlockup: Make the config checks more straightforward
There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. The check for the sparc64 variant is more complicated because HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific and sparc64 specific variant. Therefore it is automatically selected with HAVE_HARDLOCKUP_DETECTOR_ARCH. This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH. It reduces the size of some checks but it makes them harder to follow. Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is enabled/disabled. Make the logic more straightforward by the following changes: + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and HAVE_NMI_WATCHDOG in comments. + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate HAVE_* for all four hardlockup detector variants. Use it in the other conditions instead of SMP. It makes it clear that it is about the buddy detector. + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand the conditions between the four hardlockup detector variants. + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY can be enabled. It explains the dependency on the other hardlockup detector variants. Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply". It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when the global HARDLOCKUP_DETECTOR switch is changed. + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables disappear when the hardlockup detectors are disabled. Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY value is not preserved when the global switch is disabled. The user has to make the decision again when it gets re-enabled. Signed-off-by: Petr Mladek --- arch/Kconfig | 23 +- lib/Kconfig.debug | 62 +++ 2 files changed, 53 insertions(+), 32 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 422f0ffa269e..77e5af5fda3f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -404,17 +404,28 @@ config HAVE_NMI_WATCHDOG depends on HAVE_NMI bool help - The arch provides a low level NMI watchdog. It provides - asm/nmi.h, and defines its own watchdog_hardlockup_probe() and - arch_touch_nmi_watchdog(). + The arch provides its own hardlockup detector implementation instead + of the generic ones. + + Sparc64 defines this variable without HAVE_HARDLOCKUP_DETECTOR_ARCH. + It is the last arch-specific implementation which was developed before + adding the common infrastructure for handling hardlockup detectors. + It is always built. It does _not_ use the common command line + parameters and sysctl interface, except for + /proc/sys/kernel/nmi_watchdog. config HAVE_HARDLOCKUP_DETECTOR_ARCH bool select HAVE_NMI_WATCHDOG help - The arch chooses to provide its own hardlockup detector, which is - a superset of the HAVE_NMI_WATCHDOG. It also conforms to config - interfaces and parameters provided by hardlockup detector subsystem. + The arch provides its own hardlockup detector implementation instead + of the generic ones. + + It uses the same command line parameters, and sysctl interface, + as the generic hardlockup detectors. + + HAVE_NMI_WATCHDOG is selected to build the code shared with + the sparc64 specific implementation. config HAVE_PERF_REGS bool diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 3e91fa33c7a0..a0b0c4decb89 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1035,16 +1035,33 @@ config BOOTPARAM_SOFTLOCKUP_PANIC Say N if unsure. +config HAVE_HARDLOCKUP_DETECTOR_BUDDY + bool + depends on SMP + default y + # -# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard -# lockup detector rather than the perf based detector. +# Global switch whether to build a hardlockup detector at all. It is available +# only when the architecture supports at least one implementation. There are +# two exceptions. The hardlockup detector is never enabled on: +# +# s390: it reported many false positives there +# +# sparc64: has a custom implementation which is not using the common +# hardlockup command line options and sysctl interface. +# +# Note that HAVE_NMI_WATCHDOG is used to distinguish the
[PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related config values a logical way
There are four possible variants of hardlockup detectors: + buddy: available when SMP is set. + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set. + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set. + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set. Only one hardlockup detector can be compiled in. The selection is done using quite complex dependencies between several CONFIG variables. The following patches will try to make it more straightforward. As a first step, reorder the definitions of the various CONFIG variables. The logical order is: 1. HAVE_* variables define available variants. They are typically defined in the arch/ config files. 2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup detector is enabled at all. 3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether the buddy detector should be preferred over the perf one. Note that the arch specific variants are always preferred when available. 4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given detector is enabled in the end. 5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH are temporary variables that are going to be removed in a followup patch. This is a preparation step for further cleanup. It will change the logic without shuffling the definitions. This change temporary breaks the C-like ordering where the variables are declared or defined before they are used. It is not really needed for Kconfig. Also the following patches will rework the logic so that the ordering will be C-like in the end. The patch just shuffles the definitions. It should not change the existing behavior. Signed-off-by: Petr Mladek --- lib/Kconfig.debug | 112 +++--- 1 file changed, 56 insertions(+), 56 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ed7b01c4bd41..3e91fa33c7a0 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1035,62 +1035,6 @@ config BOOTPARAM_SOFTLOCKUP_PANIC Say N if unsure. -# Both the "perf" and "buddy" hardlockup detectors count hrtimer -# interrupts. This config enables functions managing this common code. -config HARDLOCKUP_DETECTOR_COUNTS_HRTIMER - bool - select SOFTLOCKUP_DETECTOR - -config HARDLOCKUP_DETECTOR_PERF - bool - depends on HAVE_HARDLOCKUP_DETECTOR_PERF - select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER - -config HARDLOCKUP_DETECTOR_BUDDY - bool - depends on SMP - select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER - -# For hardlockup detectors you can have one directly provided by the arch -# or use a "non-arch" one. If you're using a "non-arch" one that is -# further divided the perf hardlockup detector (which, confusingly, needs -# arch-provided perf support) and the buddy hardlockup detector (which just -# needs SMP). In either case, using the "non-arch" code conflicts with -# the NMI watchdog code (which is sometimes used directly and sometimes used -# by the arch-provided hardlockup detector). -config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH - bool - depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG - default y - -config HARDLOCKUP_DETECTOR_PREFER_BUDDY - bool "Prefer the buddy CPU hardlockup detector" - depends on HAVE_HARDLOCKUP_DETECTOR_PERF && SMP - help - Say Y here to prefer the buddy hardlockup detector over the perf one. - - With the buddy detector, each CPU uses its softlockup hrtimer - to check that the next CPU is processing hrtimer interrupts by - verifying that a counter is increasing. - - This hardlockup detector is useful on systems that don't have - an arch-specific hardlockup detector or if resources needed - for the hardlockup detector are better used for other things. - -# This will select the appropriate non-arch hardlockdup detector -config HARDLOCKUP_DETECTOR_NON_ARCH - bool - depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH - select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || HARDLOCKUP_DETECTOR_PREFER_BUDDY - select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && !HARDLOCKUP_DETECTOR_PREFER_BUDDY - -# -# Enables a timestamp based low pass filter to compensate for perf based -# hard lockup detection which runs too fast due to turbo modes. -# -config HARDLOCKUP_CHECK_TIMESTAMP - bool - # # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard # lockup detector rather than the perf based detector. @@ -,6 +1055,62 @@ config HARDLOCKUP_DETECTOR chance to run. The current stack trace is displayed upon detection and the system will stay locked up. +config HARDLOCKUP_DETECTOR_PREFER_BUDDY + bool "Prefer the buddy CPU hardlockup detector"
[PATCH v2 0/6] watchdog/hardlockup: Cleanup configuration of hardlockup detectors
Hi, this patchset is supposed to replace the last patch in the patchset cleaning up after introducing the buddy detector, see https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid Changes against v1: + Better explained the C-like ordering in the 1st patch. + Squashed patches for splitting and renaming HAVE_NMI_WATCHDOG, updated commit message with the history and more facts. + Updated comments about the sparc64 variant. It is not handled together with the softlockup detector. In fact, it is always build. And it even used to be always enabled until the commit 7a5c8b57cec93196b ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added in v4.10-rc1. I realized this when updating the comment for the 4th patch. My original statement in v1 patchset was based on code reading. I looked at it from a bad side. + Removed superfluous "default n" + Fixed typos. Petr Mladek (6): watchdog/hardlockup: Sort hardlockup detector related config values a logical way watchdog/hardlockup: Make the config checks more straightforward watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h watchdog/hardlockup: Make HAVE_NMI_WATCHDOG sparc64-specific watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64 watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH arch/Kconfig | 17 ++--- arch/powerpc/Kconfig | 5 +- arch/powerpc/include/asm/nmi.h | 2 - arch/sparc/Kconfig | 2 +- arch/sparc/Kconfig.debug | 14 arch/sparc/include/asm/nmi.h | 1 - include/linux/nmi.h| 14 ++-- kernel/watchdog.c | 2 +- lib/Kconfig.debug | 114 ++--- 9 files changed, 97 insertions(+), 74 deletions(-) -- 2.35.3
Re: [RFC PATCH v1 3/3] powerpc: WIP draft support to objtool check
Few comments.. On Fri, Jun 16, 2023 at 03:47:52PM +0200, Christophe Leroy wrote: > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > index 0fcf99c91400..f945fe271706 100644 > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -236,6 +236,7 @@ static bool __dead_end_function(struct objtool_file > *file, struct symbol *func, > "x86_64_start_reservations", > "xen_cpu_bringup_again", > "xen_start_kernel", > + "longjmp", > }; > > if (!func) > @@ -2060,13 +2061,12 @@ static int add_jump_table(struct objtool_file *file, > struct instruction *insn, >* instruction. >*/ > list_for_each_entry_from(reloc, >sec->reloc_list, list) { > - > /* Check for the end of the table: */ > if (reloc != table && reloc->jump_table_start) > break; > > /* Make sure the table entries are consecutive: */ > - if (prev_offset && reloc->offset != prev_offset + 8) > + if (prev_offset && reloc->offset != prev_offset + 4) Do we want a global variable (from elf.c) called elf_sizeof_long or so? > break; > > /* Detect function pointers from contiguous objects: */ > @@ -2074,7 +2074,10 @@ static int add_jump_table(struct objtool_file *file, > struct instruction *insn, > reloc->addend == pfunc->offset) > break; > > - dest_insn = find_insn(file, reloc->sym->sec, reloc->addend); > + if (table->jump_table_is_rel) > + dest_insn = find_insn(file, reloc->sym->sec, > reloc->addend + table->offset - reloc->offset); > + else > + dest_insn = find_insn(file, reloc->sym->sec, > reloc->addend); offset = reloc->addend; if (table->jump_table_is_rel) offset += table->offset - reloc->offset; dest_insn = find_insn(file, reloc->sym->sec, offset); perhaps? > if (!dest_insn) > break; > > @@ -4024,6 +4022,11 @@ static bool ignore_unreachable_insn(struct > objtool_file *file, struct instructio > if (insn->ignore || insn->type == INSN_NOP || insn->type == INSN_TRAP) > return true; > > + /* powerpc relocatable files have a word in front of each relocatable > function */ > + if ((file->elf->ehdr.e_machine == EM_PPC || file->elf->ehdr.e_machine > == EM_PPC64) && > + (file->elf->ehdr.e_flags & EF_PPC_RELOCATABLE_LIB) && > + insn_func(next_insn_same_sec(file, insn))) > + return true; Can't you simply decode that word to INSN_NOP or so?
[RFC PATCH v1 1/3] Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto"
This reverts commit 1e688dd2a3d6759d416616ff07afc4bb836c4213. That commit aimed at optimising the code around generation of WARN_ON/BUG_ON but this leads to a lot of dead code erroneously generated by GCC. text data bss dec hex filename 9551585 3627834 224376 13403795 cc8693 vmlinux.before 9535281 3628358 224376 13388015 cc48ef vmlinux.after Once this change is reverted, in a standard configuration (pmac32 + function tracer) the text is reduced by 16k which is around 1.7% Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/kup.h | 2 +- arch/powerpc/include/asm/bug.h| 67 +++ arch/powerpc/include/asm/extable.h| 14 arch/powerpc/include/asm/ppc_asm.h| 11 ++- arch/powerpc/kernel/misc_32.S | 2 +- arch/powerpc/kernel/traps.c | 9 +-- .../powerpc/primitives/asm/extable.h | 1 - 7 files changed, 25 insertions(+), 81 deletions(-) delete mode 12 tools/testing/selftests/powerpc/primitives/asm/extable.h diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index 54cf46808157..c82323b864e1 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -90,7 +90,7 @@ /* Prevent access to userspace using any key values */ LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED) 999: tdne\gpr1, \gpr2 - EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | BUGFLAG_ONCE) + EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | BUGFLAG_ONCE) END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67) #endif .endm diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h index ef42adb44aa3..a565995fb742 100644 --- a/arch/powerpc/include/asm/bug.h +++ b/arch/powerpc/include/asm/bug.h @@ -4,14 +4,13 @@ #ifdef __KERNEL__ #include -#include #ifdef CONFIG_BUG #ifdef __ASSEMBLY__ #include #ifdef CONFIG_DEBUG_BUGVERBOSE -.macro __EMIT_BUG_ENTRY addr,file,line,flags +.macro EMIT_BUG_ENTRY addr,file,line,flags .section __bug_table,"aw" 5001: .4byte \addr - . .4byte 5002f - . @@ -23,7 +22,7 @@ .previous .endm #else -.macro __EMIT_BUG_ENTRY addr,file,line,flags +.macro EMIT_BUG_ENTRY addr,file,line,flags .section __bug_table,"aw" 5001: .4byte \addr - . .short \flags @@ -32,18 +31,6 @@ .endm #endif /* verbose */ -.macro EMIT_WARN_ENTRY addr,file,line,flags - EX_TABLE(\addr,\addr+4) - __EMIT_BUG_ENTRY \addr,\file,\line,\flags -.endm - -.macro EMIT_BUG_ENTRY addr,file,line,flags - .if \flags & 1 /* BUGFLAG_WARNING */ - .err /* Use EMIT_WARN_ENTRY for warnings */ - .endif - __EMIT_BUG_ENTRY \addr,\file,\line,\flags -.endm - #else /* !__ASSEMBLY__ */ /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and sizeof(struct bug_entry), respectively */ @@ -73,16 +60,6 @@ "i" (sizeof(struct bug_entry)), \ ##__VA_ARGS__) -#define WARN_ENTRY(insn, flags, label, ...)\ - asm_volatile_goto( \ - "1: " insn "\n" \ - EX_TABLE(1b, %l[label]) \ - _EMIT_BUG_ENTRY \ - : : "i" (__FILE__), "i" (__LINE__), \ - "i" (flags), \ - "i" (sizeof(struct bug_entry)), \ - ##__VA_ARGS__ : : label) - /* * BUG_ON() and WARN_ON() do their best to cooperate with compile-time * optimisations. However depending on the complexity of the condition @@ -95,16 +72,7 @@ } while (0) #define HAVE_ARCH_BUG -#define __WARN_FLAGS(flags) do { \ - __label__ __label_warn_on; \ - \ - WARN_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags), __label_warn_on); \ - barrier_before_unreachable(); \ - __builtin_unreachable();\ - \ -__label_warn_on: \ - break; \ -} while (0) +#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags)) #ifdef CONFIG_PPC64 #define BUG_ON(x) do { \ @@ -117,25 +85,15 @@ __label_warn_on: \ } while (0) #define WARN_ON(x) ({ \ - bool __ret_warn_on = false; \ - do {\ - if
[RFC PATCH v1 3/3] powerpc: WIP draft support to objtool check
This draft messy patch is first try to add support of objtool check for powerpc. This is in preparation of doing uaccess validation for powerpc. For the time being, this is implemented for PPC32 only breaking support for other targets eventually. Will be reworked to be more generic once a final working status has been achieved. All assembly files have been deactivated as they require huge work and are not really needed at the first place for uaccess validation. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 1 + scripts/Makefile.lib | 2 +- tools/objtool/arch/powerpc/decode.c | 60 +-- .../arch/powerpc/include/arch/special.h | 2 +- tools/objtool/arch/powerpc/special.c | 44 +- tools/objtool/check.c | 29 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 2 +- 8 files changed, 118 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 542be1c3c315..3bd244784af1 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -259,6 +259,7 @@ config PPC select HAVE_OPTPROBES select HAVE_OBJTOOL if PPC32 || MPROFILE_KERNEL select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL + select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL select HAVE_PERF_EVENTS select HAVE_PERF_EVENTS_NMI if PPC64 select HAVE_PERF_REGS diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib index 100a386fcd71..298e2656e911 100644 --- a/scripts/Makefile.lib +++ b/scripts/Makefile.lib @@ -267,7 +267,7 @@ objtool-args-$(CONFIG_RETHUNK) += --rethunk objtool-args-$(CONFIG_SLS) += --sls objtool-args-$(CONFIG_STACK_VALIDATION)+= --stackval objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE) += --static-call -objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess +objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess --sec-address objtool-args-$(CONFIG_GCOV_KERNEL) += --no-unreachable objtool-args-$(CONFIG_PREFIX_SYMBOLS) += --prefix=$(CONFIG_FUNCTION_PADDING_BYTES) diff --git a/tools/objtool/arch/powerpc/decode.c b/tools/objtool/arch/powerpc/decode.c index 53b55690f320..e95c0470e34b 100644 --- a/tools/objtool/arch/powerpc/decode.c +++ b/tools/objtool/arch/powerpc/decode.c @@ -43,24 +43,72 @@ int arch_decode_instruction(struct objtool_file *file, const struct section *sec unsigned long offset, unsigned int maxlen, struct instruction *insn) { - unsigned int opcode; + unsigned int opcode, xop; + unsigned int rs, ra, rb, bo, bi, to, uimm, l; enum insn_type typ; unsigned long imm; u32 ins; ins = bswap_if_needed(file->elf, *(u32 *)(sec->data->d_buf + offset)); opcode = ins >> 26; - typ = INSN_OTHER; - imm = 0; + xop = (ins >> 1) & 0x3ff; + rs = bo = to = (ins >> 21) & 0x1f; + ra = bi = (ins >> 16) & 0x1f; + rb = (ins >> 11) & 0x1f; + uimm = (ins >> 0) & 0x; + l = ins & 1; switch (opcode) { + case 16: /* bc[l][a] */ + if (ins & 1)/* bcl[a] */ + typ = INSN_OTHER; + else/* bc[a] */ + typ = INSN_JUMP_CONDITIONAL; + + imm = ins & 0xfffc; + if (imm & 0x8000) + imm -= 0x1; + imm |= ins & 2; /* AA flag */ + insn->immediate = imm; + break; case 18: /* b[l][a] */ - if ((ins & 3) == 1) /* bl */ + if (ins & 1)/* bl[a] */ typ = INSN_CALL; + else/* b[a] */ + typ = INSN_JUMP_UNCONDITIONAL; imm = ins & 0x3fc; if (imm & 0x200) imm -= 0x400; + imm |= ins & 2; /* AA flag */ + insn->immediate = imm; + break; + case 19: + if (xop == 16 && bo == 20 && bi == 0) /* blr */ + typ = INSN_RETURN; + else if (xop == 50) /* rfi */ + typ = INSN_JUMP_DYNAMIC; + else if (xop == 528 && bo == 20 && bi ==0 && !l)/* bctr */ + typ = INSN_JUMP_DYNAMIC; + else if (xop == 528 && bo == 20 && bi ==0 && l) /* bctrl */ + typ = INSN_CALL_DYNAMIC; + else + typ = INSN_OTHER; + break; + case 24: + if (rs == 0 && ra == 0 && uimm == 0) + typ = INSN_NOP; +
[RFC PATCH v1 2/3] powerpc: Mark all .S files invalid for objtool
A lot of work is required in .S files in order to get them ready for objtool checks. For the time being, exclude them from the checks. This is done with the script below: #!/bin/sh DIRS=`find arch/powerpc -name "*.S" -exec dirname {} \; | sort | uniq` for d in $DIRS do pushd $d echo >> Makefile for f in *.S do echo "OBJECT_FILES_NON_STANDARD_$f := y" | sed s/"\.S"/".o"/g done >> Makefile popd done Signed-off-by: Christophe Leroy --- arch/powerpc/boot/Makefile | 17 + arch/powerpc/crypto/Makefile | 13 +++ arch/powerpc/kernel/Makefile | 44 ++ arch/powerpc/kernel/trace/Makefile | 4 ++ arch/powerpc/kernel/vdso/Makefile | 11 ++ arch/powerpc/kexec/Makefile| 2 + arch/powerpc/kvm/Makefile | 13 +++ arch/powerpc/lib/Makefile | 25 arch/powerpc/mm/book3s32/Makefile | 3 ++ arch/powerpc/mm/nohash/Makefile| 3 ++ arch/powerpc/perf/Makefile | 2 + arch/powerpc/platforms/44x/Makefile| 2 + arch/powerpc/platforms/52xx/Makefile | 3 ++ arch/powerpc/platforms/83xx/Makefile | 2 + arch/powerpc/platforms/cell/spufs/Makefile | 3 ++ arch/powerpc/platforms/pasemi/Makefile | 2 + arch/powerpc/platforms/powermac/Makefile | 3 ++ arch/powerpc/platforms/powernv/Makefile| 3 ++ arch/powerpc/platforms/ps3/Makefile| 2 + arch/powerpc/platforms/pseries/Makefile| 2 + arch/powerpc/purgatory/Makefile| 3 ++ arch/powerpc/sysdev/Makefile | 3 ++ arch/powerpc/xmon/Makefile | 3 ++ 23 files changed, 168 insertions(+) diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile index 771b79423bbc..c046eb9d341e 100644 --- a/arch/powerpc/boot/Makefile +++ b/arch/powerpc/boot/Makefile @@ -513,3 +513,20 @@ $(wrapper-installed): $(DESTDIR)$(WRAPPER_BINDIR) $(srctree)/$(obj)/wrapper | $( $(call cmd,install_wrapper) $(obj)/bootwrapper_install: $(all-installed) + +OBJECT_FILES_NON_STANDARD_crt0.o := y +OBJECT_FILES_NON_STANDARD_crtsavres.o := y +OBJECT_FILES_NON_STANDARD_div64.o := y +OBJECT_FILES_NON_STANDARD_fixed-head.o := y +OBJECT_FILES_NON_STANDARD_gamecube-head.o := y +OBJECT_FILES_NON_STANDARD_motload-head.o := y +OBJECT_FILES_NON_STANDARD_opal-calls.o := y +OBJECT_FILES_NON_STANDARD_ps3-head.o := y +OBJECT_FILES_NON_STANDARD_ps3-hvcall.o := y +OBJECT_FILES_NON_STANDARD_pseries-head.o := y +OBJECT_FILES_NON_STANDARD_string.o := y +OBJECT_FILES_NON_STANDARD_util.o := y +OBJECT_FILES_NON_STANDARD_wii-head.o := y +OBJECT_FILES_NON_STANDARD_zImage.coff.lds.o := y +OBJECT_FILES_NON_STANDARD_zImage.lds.o := y +OBJECT_FILES_NON_STANDARD_zImage.ps3.lds.o := y diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile index 7b4f516abec1..f0381d137b06 100644 --- a/arch/powerpc/crypto/Makefile +++ b/arch/powerpc/crypto/Makefile @@ -34,3 +34,16 @@ $(obj)/aesp10-ppc.S $(obj)/ghashp10-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE OBJECT_FILES_NON_STANDARD_aesp10-ppc.o := y OBJECT_FILES_NON_STANDARD_ghashp10-ppc.o := y + +OBJECT_FILES_NON_STANDARD_aes-gcm-p10.o := y +OBJECT_FILES_NON_STANDARD_aes-spe-core.o := y +OBJECT_FILES_NON_STANDARD_aes-spe-keys.o := y +OBJECT_FILES_NON_STANDARD_aes-spe-modes.o := y +OBJECT_FILES_NON_STANDARD_aes-tab-4k.o := y +OBJECT_FILES_NON_STANDARD_crc32c-vpmsum_asm.o := y +OBJECT_FILES_NON_STANDARD_crc32-vpmsum_core.o := y +OBJECT_FILES_NON_STANDARD_crct10dif-vpmsum_asm.o := y +OBJECT_FILES_NON_STANDARD_md5-asm.o := y +OBJECT_FILES_NON_STANDARD_sha1-powerpc-asm.o := y +OBJECT_FILES_NON_STANDARD_sha1-spe-asm.o := y +OBJECT_FILES_NON_STANDARD_sha256-spe-asm.o := y diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 9bf2be123093..19a2c83645e1 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -229,3 +229,47 @@ $(obj)/vdso64_wrapper.o : $(obj)/vdso/vdso64.so.dbg # for cleaning subdir- += vdso + +OBJECT_FILES_NON_STANDARD_85xx_entry_mapping.o := y +OBJECT_FILES_NON_STANDARD_cpu_setup_44x.o := y +OBJECT_FILES_NON_STANDARD_cpu_setup_6xx.o := y +OBJECT_FILES_NON_STANDARD_cpu_setup_e500.o := y +OBJECT_FILES_NON_STANDARD_cpu_setup_pa6t.o := y +OBJECT_FILES_NON_STANDARD_cpu_setup_ppc970.o := y +OBJECT_FILES_NON_STANDARD_entry_32.o := y +OBJECT_FILES_NON_STANDARD_entry_64.o := y +OBJECT_FILES_NON_STANDARD_epapr_hcalls.o := y +OBJECT_FILES_NON_STANDARD_exceptions-64e.o := y +OBJECT_FILES_NON_STANDARD_exceptions-64s.o := y +OBJECT_FILES_NON_STANDARD_fpu.o := y +OBJECT_FILES_NON_STANDARD_head_40x.o := y +OBJECT_FILES_NON_STANDARD_head_44x.o := y +OBJECT_FILES_NON_STANDARD_head_64.o := y +OBJECT_FILES_NON_STANDARD_head_85xx.o := y +OBJECT_FILES_NON_STANDARD_head_8xx.o := y
[RFC PATCH v1 0/3] powerpc/objtool: First step towards uaccess validation (v1)
This RFC is a first step towards the validation of userspace accesses. For the time being it targets only PPC32 and includes hacks directly in core part of objtool. It doesn't yet include handling of uaccess at all but is a first step to support objtool validation. Assembly files have been kept aside as they require a huge work before being ready for objtool validation and are not directly relevant for uaccess validation. Please have a look and hold hand if I'm going in the wrong direction. For the few hacks done directly in the core part of objtool don't hesitate to suggest ways to make it more generic. Christophe Leroy (3): Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto" powerpc: Mark all .S files invalid for objtool powerpc: WIP draft support to objtool check arch/powerpc/Kconfig | 1 + arch/powerpc/boot/Makefile| 17 + arch/powerpc/crypto/Makefile | 13 arch/powerpc/include/asm/book3s/64/kup.h | 2 +- arch/powerpc/include/asm/bug.h| 67 +++ arch/powerpc/include/asm/extable.h| 14 arch/powerpc/include/asm/ppc_asm.h| 11 ++- arch/powerpc/kernel/Makefile | 44 arch/powerpc/kernel/misc_32.S | 2 +- arch/powerpc/kernel/trace/Makefile| 4 ++ arch/powerpc/kernel/traps.c | 9 +-- arch/powerpc/kernel/vdso/Makefile | 11 +++ arch/powerpc/kexec/Makefile | 2 + arch/powerpc/kvm/Makefile | 13 arch/powerpc/lib/Makefile | 25 +++ arch/powerpc/mm/book3s32/Makefile | 3 + arch/powerpc/mm/nohash/Makefile | 3 + arch/powerpc/perf/Makefile| 2 + arch/powerpc/platforms/44x/Makefile | 2 + arch/powerpc/platforms/52xx/Makefile | 3 + arch/powerpc/platforms/83xx/Makefile | 2 + arch/powerpc/platforms/cell/spufs/Makefile| 3 + arch/powerpc/platforms/pasemi/Makefile| 2 + arch/powerpc/platforms/powermac/Makefile | 3 + arch/powerpc/platforms/powernv/Makefile | 3 + arch/powerpc/platforms/ps3/Makefile | 2 + arch/powerpc/platforms/pseries/Makefile | 2 + arch/powerpc/purgatory/Makefile | 3 + arch/powerpc/sysdev/Makefile | 3 + arch/powerpc/xmon/Makefile| 3 + scripts/Makefile.lib | 2 +- tools/objtool/arch/powerpc/decode.c | 60 +++-- .../arch/powerpc/include/arch/special.h | 2 +- tools/objtool/arch/powerpc/special.c | 44 +++- tools/objtool/check.c | 29 tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 2 +- .../powerpc/primitives/asm/extable.h | 1 - 38 files changed, 311 insertions(+), 104 deletions(-) delete mode 12 tools/testing/selftests/powerpc/primitives/asm/extable.h -- 2.40.1
Re: [PATCH v4 04/34] pgtable: Create struct ptdesc
On Mon, Jun 12, 2023 at 02:03:53PM -0700, Vishal Moola (Oracle) wrote: > Currently, page table information is stored within struct page. As part > of simplifying struct page, create struct ptdesc for page table > information. > > Signed-off-by: Vishal Moola (Oracle) > --- > include/linux/pgtable.h | 51 + > 1 file changed, 51 insertions(+) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index c5a51481bbb9..330de96ebfd6 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct > vm_area_struct *vma, > #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */ > #endif /* CONFIG_MMU */ > > + > +/** > + * struct ptdesc - Memory descriptor for page tables. > + * @__page_flags: Same as page flags. Unused for page tables. > + * @pt_list: List of used page tables. Used for s390 and x86. > + * @_pt_pad_1: Padding that aliases with page's compound head. > + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs. > + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only. > + * @pt_mm: Used for x86 pgds. > + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 > only. > + * @ptl: Lock for the page table. > + * > + * This struct overlays struct page for now. Do not modify without a good > + * understanding of the issues. > + */ > +struct ptdesc { > + unsigned long __page_flags; > + > + union { > + struct list_head pt_list; > + struct { > + unsigned long _pt_pad_1; > + pgtable_t pmd_huge_pte; > + }; > + }; > + unsigned long _pt_s390_gaddr; > + > + union { > + struct mm_struct *pt_mm; > + atomic_t pt_frag_refcount; > + }; > + > +#if ALLOC_SPLIT_PTLOCKS > + spinlock_t *ptl; > +#else > + spinlock_t ptl; > +#endif > +}; I think you should include the memcg here too? It needs to be valid for a ptdesc, even if we don't currently deref it through the ptdesc type. Also, do you see a way to someday put a 'struct rcu_head' into here? Thanks, Jason
Re: [PATCH v9 00/14] pci: Work around ASMedia ASM2824 PCIe link training failures
On Thu, 15 Jun 2023, Bjorn Helgaas wrote: > > If doing it this way, which I actually like, I think it would be a little > > bit better performance- and style-wise if this was written as: > > > > if (pci_is_pcie(dev)) { > > bridge = pci_upstream_bridge(dev); > > retrain = !!bridge; > > } > > > > (or "retrain = bridge != NULL" if you prefer this style), and then we > > don't have to repeatedly check two variables iff (pcie && !bridge) in the > > loop below: > > Done, thanks, I do like that better. I did: > > bridge = pci_upstream_bridge(dev); > if (bridge) > retrain = true; > > because it seems like it flows more naturally when reading. Perfect, and good timing too, as I have just started checking your tree as your message arrived. I ran my usual tests with and w/o PCI_QUIRKS enabled and results were as expected. As before I didn't check hot plug and reset paths as these features are awkward with the HiFive Unmatched system involved. I have skimmed over the changes as committed to pci/enumeration and found nothing suspicious. I have verified that the tree builds as at each of them with my configuration. As per my earlier remark: > I think making a system halfway-fixed would make little sense, but with > the actual fix actually made last as you suggested I think this can be > split off, because it'll make no functional change by itself. I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS stub into the change carrying the actual workaround and then have the reset path update with a follow-up change only, but I won't fight over it. It's only one tree revision that will be in this halfway-fixed state and I'll trust your judgement here. Let me know if anything pops up related to these changes anytime and I'll be happy to look into it. The system involved is nearing two years since its deployment already, but hopefully it has many years to go yet and will continue being ready to verify things. It's not that there's lots of real RISC-V hardware available, let alone with PCI/e connectivity. Thank you for staying with me and reviewing this patch series through all the iterations. Maciej
Re: ppc64le vmlinuz is huge when building with BTF
Naveen N Rao wrote on Fri, Jun 16, 2023 at 04:28:53PM +0530: > > We're not stripping anything in vmlinuz for other archs -- the linker > > script already should be including only the bare minimum to decompress > > itself (+compressed useful bits), so I guess it's a Kbuild issue for the > > arch. > > For a related discussion, see: > http://lore.kernel.org/CAK18DXZKs2PNmLndeGYqkPxmrrBR=6ca3bhyYCj=ghya7dh...@mail.gmail.com Thanks, I didn't know that ppc64le boots straight into vmlinux, as 'make install' somehow installs something called 'vmlinuz-lts' (-lts coming out of localversion afaiu, but vmlinuz would come from the build scripts) ; this is somewhat confusing as vmlinuz on other archs is a compressed/pre-processed binary so I'd expect it to at least be stripped... > > We can add a strip but I unfortunately have no way of testing ppc build, > > I'll ask around the build linux-kbuild and linuxppc-dev lists if that's > > expected; it shouldn't be that bad now that's figured out. > > Stripping vmlinux would indeed be the way to go. As mentioned in the above > link, fedora also packages a strip'ed vmlinux for ppc64le: > https://src.fedoraproject.org/rpms/kernel/blob/4af17bffde7a1eca9ab164e5de0e391c277998a4/f/kernel.spec#_1797 It feels somewhat wrong to add a strip just for ppc64le after make install, but I guess we probably ought to do the same... I don't have any hardware to test booting the result though, I'll submit an update and ask for someone to test when it's done. (bit busy but that doesn't take long, will do that tomorrow morning before I forget) Thanks! -- Dominique Martinet | Asmadeus
[PATCH v2 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE
pudp_set_wrprotect and move_huge_pud helpers are only used when CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and move_huge_pmd_helpers use architecture override only if CONFIG_TRANSPARENT_HUGEPAGE is set Signed-off-by: Aneesh Kumar K.V --- include/linux/pgtable.h | 2 ++ mm/mremap.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 8c5174d1f9db..c7f5806dc9d1 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -550,6 +550,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, #endif #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +#ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline void pudp_set_wrprotect(struct mm_struct *mm, unsigned long address, pud_t *pudp) { @@ -563,6 +564,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm, { BUILD_BUG(); } +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ #endif diff --git a/mm/mremap.c b/mm/mremap.c index b11ce6c92099..6373db571e5c 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -338,7 +338,7 @@ static inline bool move_normal_pud(struct vm_area_struct *vma, } #endif -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) { -- 2.40.1
[PATCH v2 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works
Architectures like powerpc will like to use different page table allocators and mapping mechanisms to implement vmemmap optimization. Similar to vmemmap_populate allow architectures to implement vmemap_populate_compound_pages Signed-off-by: Aneesh Kumar K.V --- mm/sparse-vmemmap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 10d73a0dfcec..0b83706c08fd 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node, start, end - 1); } +#ifndef vmemmap_populate_compound_pages pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, struct vmem_altmap *altmap, struct page *reuse) @@ -446,6 +447,8 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, return 0; } +#endif + struct page * __meminit __populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap, struct dev_pagemap *pgmap) -- 2.40.1
Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES
Mike Rapoport writes: > From: "Mike Rapoport (IBM)" > > kprobes depended on CONFIG_MODULES because it has to allocate memory for > code. I think you can remove the MODULES dependency from BPF_JIT as well: --8<-- diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index 2dfe1079f772..fa4587027f8b 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -41,7 +41,6 @@ config BPF_JIT bool "Enable BPF Just In Time compiler" depends on BPF depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT - depends on MODULES help BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded --8<-- Björn
[PATCH v2 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize
This is not used by radix anymore. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 10 -- arch/powerpc/mm/init_64.c| 21 ++--- 2 files changed, 14 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index f0527eebe012..d57b4f5d5cb3 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -601,16 +601,6 @@ void __init radix__early_init_mmu(void) mmu_virtual_psize = MMU_PAGE_4K; #endif -#ifdef CONFIG_SPARSEMEM_VMEMMAP - /* vmemmap mapping */ - if (mmu_psize_defs[MMU_PAGE_2M].shift) { - /* -* map vmemmap using 2M if available -*/ - mmu_vmemmap_psize = MMU_PAGE_2M; - } else - mmu_vmemmap_psize = mmu_virtual_psize; -#endif /* * initialize page table size */ diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 5701faca39ef..6db7a063ba63 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, return false; } -int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, - struct vmem_altmap *altmap) +int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int node, +struct vmem_altmap *altmap) { bool altmap_alloc; unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; -#ifdef CONFIG_PPC_BOOK3S_64 - if (radix_enabled()) - return radix__vmemmap_populate(start, end, node, altmap); -#endif - /* Align to the page size of the linear mapping. */ start = ALIGN_DOWN(start, page_size); @@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, return 0; } +int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, + struct vmem_altmap *altmap) +{ + +#ifdef CONFIG_PPC_BOOK3S_64 + if (radix_enabled()) + return radix__vmemmap_populate(start, end, node, altmap); +#endif + + return __vmemmap_populate(start, end, node, altmap); +} + #ifdef CONFIG_MEMORY_HOTPLUG static unsigned long vmemmap_list_free(unsigned long start) { -- 2.40.1
[PATCH v2 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix
With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence with 64K page size, we don't use vmemmap deduplication for PMD-level mapping. Signed-off-by: Aneesh Kumar K.V --- Documentation/mm/vmemmap_dedup.rst | 1 + Documentation/powerpc/index.rst| 1 + Documentation/powerpc/vmemmap_dedup.rst| 101 ++ arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/radix.h | 8 + arch/powerpc/mm/book3s64/radix_pgtable.c | 203 + 6 files changed, 315 insertions(+) create mode 100644 Documentation/powerpc/vmemmap_dedup.rst diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst index a4b12ff906c4..c573e08b5043 100644 --- a/Documentation/mm/vmemmap_dedup.rst +++ b/Documentation/mm/vmemmap_dedup.rst @@ -210,6 +210,7 @@ the device (altmap). The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). +For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst The differences with HugeTLB are relatively minor. diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst index 85e80e30160b..1b7d943ae8f3 100644 --- a/Documentation/powerpc/index.rst +++ b/Documentation/powerpc/index.rst @@ -35,6 +35,7 @@ powerpc ultravisor vas-api vcpudispatch_stats +vmemmap_dedup features diff --git a/Documentation/powerpc/vmemmap_dedup.rst b/Documentation/powerpc/vmemmap_dedup.rst new file mode 100644 index ..dc4db59fdf87 --- /dev/null +++ b/Documentation/powerpc/vmemmap_dedup.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Device DAX +== + +The device-dax interface uses the tail deduplication technique explained in +Documentation/mm/vmemmap_dedup.rst + +On powerpc, vmemmap deduplication is only used with radix MMU translation. Also +with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap +deduplication. + +With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap +page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no +vmemmap deduplication possible. + +With 1G PUD level mapping, we require 16384 struct pages and a single 64K +vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we +require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping. + +Here's how things look like on device-dax after the sections are populated:: + +---+ ---virt_to_page---> +---+ mapping to +---+ + | | | 0 | -> | 0 | + | | +---++---+ + | | | 1 | -> | 1 | + | | +---++---+ + | | | 2 | ^ ^ ^ ^ ^ ^ + | | +---+ | | | | | + | | | 3 | --+ | | | | + | | +---+ | | | | + | | | 4 | + | | | + |PUD| +---+ | | | + | level | | . | --+ | | + | mapping | +---+ | | + | | | . | + | + | | +---+ | + | | | 15| --+ + | | +---+ + | | + | | + | | + +---+ + + +With 4K page size, 2M PMD level mapping requires 512 struct pages and a single +4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we +require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping. + +Here's how things look like on device-dax after the sections are populated:: + + +---+ ---virt_to_page---> +---+ mapping to +---+ + | | | 0 | -> | 0 | + | | +---++---+ + | | | 1 | -> | 1 | + | | +---++---+ + | | | 2 | ^ ^ ^ ^ ^ ^ + | | +---+ | | | | | + | | | 3 | --+ | | | | + | |
[PATCH v2 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function
This is in preparation to update radix to implement vmemmap optimization for devdax. Below are the rules w.r.t radix vmemmap mapping 1. First try to map things using PMD (2M) 2. With altmap if altmap cross-boundary check returns true, fall back to PAGE_SIZE 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to PAGE_SIZE On removing vmemmap mapping, check if every subsection that is using the vmemmap area is invalid. If found to be invalid, that implies we can safely free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K page size, we need to do the above check even at the PAGE_SIZE granularity. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/radix.h | 2 + arch/powerpc/include/asm/pgtable.h | 3 + arch/powerpc/mm/book3s64/radix_pgtable.c | 319 +++-- arch/powerpc/mm/init_64.c | 26 +- 4 files changed, 319 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h index 8cdff5a05011..87d4c1e62491 100644 --- a/arch/powerpc/include/asm/book3s/64/radix.h +++ b/arch/powerpc/include/asm/book3s/64/radix.h @@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned long start, unsigned long phys); int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, int node, struct vmem_altmap *altmap); +void __ref radix__vmemmap_free(unsigned long start, unsigned long end, + struct vmem_altmap *altmap); extern void radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size); diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 9972626ddaf6..6d4cd2ebae6e 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x) struct seq_file; void arch_report_meminfo(struct seq_file *m); +int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); +bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, + unsigned long page_size); #endif /* CONFIG_PPC64 */ #endif /* __ASSEMBLY__ */ diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index d7e2dd3d4add..ef886fab643d 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d) p4d_clear(p4d); } +static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end) +{ + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); + + return !vmemmap_populated(start, PMD_SIZE); +} + +static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long end) +{ + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE); + + return !vmemmap_populated(start, PAGE_SIZE); + +} + +static void __meminit free_vmemmap_pages(struct page *page, +struct vmem_altmap *altmap, +int order) +{ + unsigned int nr_pages = 1 << order; + + if (altmap) { + unsigned long alt_start, alt_end; + unsigned long base_pfn = page_to_pfn(page); + + /* +* with 1G vmemmap mmaping we can have things setup +* such that even though atlmap is specified we never +* used altmap. +*/ + alt_start = altmap->base_pfn; + alt_end = altmap->base_pfn + altmap->reserve + + altmap->free + altmap->alloc + altmap->align; + + if (base_pfn >= alt_start && base_pfn < alt_end) { + vmem_altmap_free(altmap, nr_pages); + return; + } + } + + if (PageReserved(page)) { + /* allocated from memblock */ + while (nr_pages--) + free_reserved_page(page++); + } else + free_pages((unsigned long)page_address(page), order); +} + static void remove_pte_table(pte_t *pte_start, unsigned long addr, -unsigned long end, bool direct) +unsigned long end, bool direct, +struct vmem_altmap *altmap) { unsigned long next, pages = 0; pte_t *pte; @@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr, if (!pte_present(*pte)) continue; - if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) { - /* -* The vmemmap_free() and
[PATCH v2 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage
This is enabled only with radix translation and 1G hugepage size. This will be used with devdax device memory with a namespace alignment of 1G. Anon transparent hugepage is not supported even though we do have helpers checking pud_trans_huge(). We should never find that return true. The only expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP. Some of the helpers are never expected to get called on hash translation and hence is marked to call BUG() in such a case. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 156 -- arch/powerpc/include/asm/book3s/64/radix.h| 37 + .../include/asm/book3s/64/tlbflush-radix.h| 2 + arch/powerpc/include/asm/book3s/64/tlbflush.h | 8 + arch/powerpc/mm/book3s64/pgtable.c| 78 + arch/powerpc/mm/book3s64/radix_pgtable.c | 28 arch/powerpc/mm/book3s64/radix_tlb.c | 7 + arch/powerpc/platforms/Kconfig.cputype| 1 + include/trace/events/thp.h| 17 ++ 9 files changed, 323 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 4acc9690f599..9a05de007956 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte) { return __pud_raw(pte_raw(pte)); } + +static inline pte_t *pudp_ptep(pud_t *pud) +{ + return (pte_t *)pud; +} + +#define pud_pfn(pud) pte_pfn(pud_pte(pud)) +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) +#define pud_young(pud) pte_young(pud_pte(pud)) +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) +#define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud))) +#define pud_mkwrite(pud) pte_pud(pte_mkwrite(pud_pte(pud))) #define pud_write(pud) pte_write(pud_pte(pud)) +#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY +#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud)) +#define pud_mksoft_dirty(pmd) pte_pud(pte_mksoft_dirty(pud_pte(pud))) +#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud))) +#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ + static inline int pud_bad(pud_t pud) { if (radix_enabled()) @@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write) #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); +extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd); +extern void set_pud_at(struct mm_struct *mm, unsigned long addr, + pud_t *pudp, pud_t pud); + static inline void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) { } +static inline void update_mmu_cache_pud(struct vm_area_struct *vma, + unsigned long addr, pud_t *pud) +{ +} + extern int hash__has_transparent_hugepage(void); static inline int has_transparent_hugepage(void) { @@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void) } #define has_transparent_hugepage has_transparent_hugepage +static inline int has_transparent_pud_hugepage(void) +{ + if (radix_enabled()) + return radix__has_transparent_pud_hugepage(); + return 0; +} +#define has_transparent_pud_hugepage has_transparent_pud_hugepage + static inline unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, unsigned long clr, unsigned long set) @@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set); } +static inline unsigned long +pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp, + unsigned long clr, unsigned long set) +{ + if (radix_enabled()) + return radix__pud_hugepage_update(mm, addr, pudp, clr, set); + BUG(); + return pud_val(*pudp); +} + /* * returns true for pmd migration entries, THP, devmap, hugetlb * But compile time dependent on THP config @@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd) return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); } +static inline int pud_large(pud_t pud) +{ + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); +} + /* * For radix we should always find H_PAGE_HASHPTE zero. Hence * the below will work for radix too @@
[PATCH v2 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap optimization includes an update of both the permissions (writeable to read-only) and the output address (pfn) of the vmemmap ptes. That is not supported without unmapping of pte(marking it invalid) by some architectures. With DAX vmemmap optimization we don't require such pte updates and architectures can enable DAX vmemmap optimization while having hugetlb vmemmap optimization disabled. Hence split DAX optimization support into a different config. loongarch and riscv don't have devdax support. So the DAX config is not enabled for them. With this change, arm64 should be able to select DAX optimization [1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP") Signed-off-by: Aneesh Kumar K.V --- arch/loongarch/Kconfig | 2 +- arch/riscv/Kconfig | 2 +- arch/x86/Kconfig | 3 ++- fs/Kconfig | 2 +- include/linux/mm.h | 2 +- mm/Kconfig | 5 - 6 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index d38b066fc931..2060990c4612 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -55,7 +55,7 @@ config LOONGARCH select ARCH_USE_QUEUED_SPINLOCKS select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select ARCH_WANT_LD_ORPHAN_WARN - select ARCH_WANT_OPTIMIZE_VMEMMAP + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP select ARCH_WANTS_NO_INSTR select BUILDTIME_TABLE_SORT select COMMON_CLK diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 5966ad97c30c..d8f3765ff115 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -50,7 +50,7 @@ config RISCV select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT select ARCH_WANT_HUGE_PMD_SHARE if 64BIT select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL - select ARCH_WANT_OPTIMIZE_VMEMMAP + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU select BUILDTIME_TABLE_SORT if MMU diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 53bab123a8ee..eb383960b6ee 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -127,7 +127,8 @@ config X86 select ARCH_WANT_GENERAL_HUGETLB select ARCH_WANT_HUGE_PMD_SHARE select ARCH_WANT_LD_ORPHAN_WARN - select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64 + select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64 + select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 select ARCH_WANTS_THP_SWAP if X86_64 select ARCH_HAS_PARANOID_L1D_FLUSH select BUILDTIME_TABLE_SORT diff --git a/fs/Kconfig b/fs/Kconfig index 18d034ec7953..9c104c130a6e 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -252,7 +252,7 @@ config HUGETLB_PAGE config HUGETLB_PAGE_OPTIMIZE_VMEMMAP def_bool HUGETLB_PAGE - depends on ARCH_WANT_OPTIMIZE_VMEMMAP + depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP depends on SPARSEMEM_VMEMMAP config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON diff --git a/include/linux/mm.h b/include/linux/mm.h index 9a45e61cd83f..6e56ae09f0c1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3553,7 +3553,7 @@ void vmemmap_free(unsigned long start, unsigned long end, #endif #define VMEMMAP_RESERVE_NR 2 -#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP +#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap) { diff --git a/mm/Kconfig b/mm/Kconfig index 7672a22647b4..7b388c10baab 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -461,7 +461,10 @@ config SPARSEMEM_VMEMMAP # Select this config option from the architecture Kconfig, if it is preferred # to enable the feature of HugeTLB/dev_dax vmemmap optimization. # -config ARCH_WANT_OPTIMIZE_VMEMMAP +config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP + bool + +config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP bool config HAVE_MEMBLOCK_PHYS_MAP -- 2.40.1
[PATCH v2 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
This helps architectures to override pmd_same and pud_same independently. Signed-off-by: Aneesh Kumar K.V --- include/linux/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2fe19720075e..8c5174d1f9db 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -681,7 +681,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) { return pmd_val(pmd_a) == pmd_val(pmd_b); } +#endif +#ifndef __HAVE_ARCH_PUD_SAME static inline int pud_same(pud_t pud_a, pud_t pud_b) { return pud_val(pud_a) == pud_val(pud_b); -- 2.40.1
[PATCH v2 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override
dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within vmemmap such that tail page mapping can point to the second PAGE_SIZE area. Enforce that in vmemmap_can_optimize() function. Architectures like powerpc also want to enable vmemmap optimization conditionally (only with radix MMU translation). Hence allow architecture override. Signed-off-by: Aneesh Kumar K.V --- include/linux/mm.h | 30 ++ mm/mm_init.c | 2 +- 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 27ce77080c79..9a45e61cd83f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -31,6 +31,8 @@ #include #include +#include + struct mempolicy; struct anon_vma; struct anon_vma_chain; @@ -3550,13 +3552,33 @@ void vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap); #endif +#define VMEMMAP_RESERVE_NR 2 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP -static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, - struct dev_pagemap *pgmap) +static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap, + struct dev_pagemap *pgmap) { - return is_power_of_2(sizeof(struct page)) && - pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap; + if (pgmap) { + unsigned long nr_pages; + unsigned long nr_vmemmap_pages; + + nr_pages = pgmap_vmemmap_nr(pgmap); + nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT); + /* +* For vmemmap optimization with DAX we need minimum 2 vmemmap +* pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst +*/ + return is_power_of_2(sizeof(struct page)) && + (nr_vmemmap_pages > VMEMMAP_RESERVE_NR) && !altmap; + } + return false; } +/* + * If we don't have an architecture override, use the generic rule + */ +#ifndef vmemmap_can_optimize +#define vmemmap_can_optimize __vmemmap_can_optimize +#endif + #else static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap, struct dev_pagemap *pgmap) diff --git a/mm/mm_init.c b/mm/mm_init.c index 7f7f9c677854..d1676afc94f1 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap, if (!vmemmap_can_optimize(altmap, pgmap)) return pgmap_vmemmap_nr(pgmap); - return 2 * (PAGE_SIZE / sizeof(struct page)); + return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page)); } static void __ref memmap_init_compound(struct page *head, -- 2.40.1
[PATCH v2 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
We will use this in a later patch to do tlb flush when clearing pud entries on powerpc. This is similar to commit 93a98695f2f9 ("mm: change pmdp_huge_get_and_clear_full take vm_area_struct as arg") Signed-off-by: Aneesh Kumar K.V --- include/linux/pgtable.h | 4 ++-- mm/debug_vm_pgtable.c | 2 +- mm/huge_memory.c| 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index b3f4dd0240f5..2fe19720075e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -442,11 +442,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL -static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm, +static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma, unsigned long address, pud_t *pudp, int full) { - return pudp_huge_get_and_clear(mm, address, pudp); + return pudp_huge_get_and_clear(vma->vm_mm, address, pudp); } #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index c54177aabebd..c2bf25d5e5cd 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -382,7 +382,7 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args) WARN_ON(!(pud_write(pud) && pud_dirty(pud))); #ifndef __PAGETABLE_PMD_FOLDED - pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1); + pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1); pud = READ_ONCE(*args->pudp); WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 624671aaa60d..8774b4751a84 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1980,7 +1980,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, if (!ptl) return 0; - pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); + pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm); tlb_remove_pud_tlb_entry(tlb, pud, addr); if (vma_is_special_huge(vma)) { spin_unlock(ptl); -- 2.40.1
[PATCH v2 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support
Architectures like powerpc would like to enable transparent huge page pud support only with radix translation. To support that add has_transparent_pud_hugepage() helper that architectures can override. Signed-off-by: Aneesh Kumar K.V --- drivers/nvdimm/pfn_devs.c | 2 +- include/linux/pgtable.h | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index af7d9301520c..18ad315581ca 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned long *alignments) if (has_transparent_hugepage()) { alignments[1] = HPAGE_PMD_SIZE; - if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) + if (has_transparent_pud_hugepage()) alignments[2] = HPAGE_PUD_SIZE; } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c5a51481bbb9..b3f4dd0240f5 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1597,6 +1597,9 @@ typedef unsigned int pgtbl_mod_mask; #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE) #endif +#ifndef has_transparent_pud_hugepage +#define has_transparent_pud_hugepage() IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) +#endif /* * On some architectures it depends on the mm if the p4d/pud or pmd * layer of the page table hierarchy is folded or not. -- 2.40.1
[PATCH v2 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary
Without this fix, the last subsection vmemmap can end up in memory even if the namespace is created with -M mem and has sufficient space in the altmap area. Fixes: cf387d9644d8 ("libnvdimm/altmap: Track namespace boundaries in altmap") Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/init_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 05b0d584e50b..fe1b83020e0d 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -189,7 +189,7 @@ static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long star unsigned long nr_pfn = page_size / sizeof(struct page); unsigned long start_pfn = page_to_pfn((struct page *)start); - if ((start_pfn + nr_pfn) > altmap->end_pfn) + if ((start_pfn + nr_pfn - 1) > altmap->end_pfn) return true; if (start_pfn < altmap->base_pfn) -- 2.40.1
[PATCH v2 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 15a099e53cde..76f6a1f3b9d8 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -910,7 +910,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start, unsigned long phys) { /* Create a PTE encoding */ - unsigned long flags = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_KERNEL_RW; int nid = early_pfn_to_nid(phys >> PAGE_SHIFT); int ret; @@ -919,7 +918,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start, return -1; } - ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, nid); + ret = __map_kernel_page_nid(start, phys, PAGE_KERNEL, page_size, nid); BUG_ON(ret); return 0; -- 2.40.1
[PATCH v2 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
On memory unplug reduce DirectMap page count correctly. root@ubuntu-guest:# grep Direct /proc/meminfo DirectMap4k: 0 kB DirectMap64k: 0 kB DirectMap2M:115343360 kB DirectMap1G: 0 kB Before fix: root@ubuntu-guest:# ndctl disable-namespace all disabled 1 namespace root@ubuntu-guest:# grep Direct /proc/meminfo DirectMap4k: 0 kB DirectMap64k: 0 kB DirectMap2M:115343360 kB DirectMap1G: 0 kB After fix: root@ubuntu-guest:# ndctl disable-namespace all disabled 1 namespace root@ubuntu-guest:# grep Direct /proc/meminfo DirectMap4k: 0 kB DirectMap64k: 0 kB DirectMap2M:104857600 kB DirectMap1G: 0 kB Fixes: a2dc009afa9a ("powerpc/mm/book3s/radix: Add mapping statistics") Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 34 +++- 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 570add33c02d..15a099e53cde 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -743,9 +743,9 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d) } static void remove_pte_table(pte_t *pte_start, unsigned long addr, -unsigned long end) +unsigned long end, bool direct) { - unsigned long next; + unsigned long next, pages = 0; pte_t *pte; pte = pte_start + pte_index(addr); @@ -767,13 +767,16 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr, } pte_clear(_mm, addr, pte); + pages++; } + if (direct) + update_page_count(mmu_virtual_psize, -pages); } static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, -unsigned long end) + unsigned long end, bool direct) { - unsigned long next; + unsigned long next, pages = 0; pte_t *pte_base; pmd_t *pmd; @@ -791,19 +794,22 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, continue; } pte_clear(_mm, addr, (pte_t *)pmd); + pages++; continue; } pte_base = (pte_t *)pmd_page_vaddr(*pmd); - remove_pte_table(pte_base, addr, next); + remove_pte_table(pte_base, addr, next, direct); free_pte_table(pte_base, pmd); } + if (direct) + update_page_count(MMU_PAGE_2M, -pages); } static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, -unsigned long end) + unsigned long end, bool direct) { - unsigned long next; + unsigned long next, pages = 0; pmd_t *pmd_base; pud_t *pud; @@ -821,16 +827,20 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, continue; } pte_clear(_mm, addr, (pte_t *)pud); + pages++; continue; } pmd_base = pud_pgtable(*pud); - remove_pmd_table(pmd_base, addr, next); + remove_pmd_table(pmd_base, addr, next, direct); free_pmd_table(pmd_base, pud); } + if (direct) + update_page_count(MMU_PAGE_1G, -pages); } -static void __meminit remove_pagetable(unsigned long start, unsigned long end) +static void __meminit remove_pagetable(unsigned long start, unsigned long end, + bool direct) { unsigned long addr, next; pud_t *pud_base; @@ -859,7 +869,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end) } pud_base = p4d_pgtable(*p4d); - remove_pud_table(pud_base, addr, next); + remove_pud_table(pud_base, addr, next, direct); free_pud_table(pud_base, p4d); } @@ -882,7 +892,7 @@ int __meminit radix__create_section_mapping(unsigned long start, int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end) { - remove_pagetable(start, end); + remove_pagetable(start, end, true); return 0; } #endif /* CONFIG_MEMORY_HOTPLUG */ @@ -918,7 +928,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start, #ifdef CONFIG_MEMORY_HOTPLUG void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned long page_size) { - remove_pagetable(start, start + page_size); + remove_pagetable(start, start + page_size, false); } #endif #endif --
[PATCH v2 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize on radix while mapping the vmemmap area. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 5f8c6fbe8a69..570add33c02d 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -594,7 +594,6 @@ void __init radix__early_init_mmu(void) { unsigned long lpcr; -#ifdef CONFIG_PPC_64S_HASH_MMU #ifdef CONFIG_PPC_64K_PAGES /* PAGE_SIZE mappings */ mmu_virtual_psize = MMU_PAGE_64K; @@ -611,7 +610,6 @@ void __init radix__early_init_mmu(void) mmu_vmemmap_psize = MMU_PAGE_2M; } else mmu_vmemmap_psize = mmu_virtual_psize; -#endif #endif /* * initialize page table size -- 2.40.1
[PATCH v2 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 2297aa764ecd..5f8c6fbe8a69 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -952,7 +952,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add assert_spin_locked(pmd_lockptr(mm, pmdp)); #endif - old = radix__pte_update(mm, addr, (pte_t *)pmdp, clr, set, 1); + old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1); trace_hugepage_update(addr, old, clr, set); return old; -- 2.40.1
[PATCH v2 00/16] Add support for DAX vmemmap optimization for ppc64
This patch series implements changes required to support DAX vmemmap optimization for ppc64. The vmemmap optimization is only enabled with radix MMU translation and 1GB PUD mapping with 64K page size. The patch series also split hugetlb vmemmap optimization as a separate Kconfig variable so that architectures can enable DAX vmemmap optimization without enabling hugetlb vmemmap optimization. This should enable architectures like arm64 to enable DAX vmemmap optimization while they can't enable hugetlb vmemmap optimization. More details of the same are in patch "mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization" Changes from V1: * Fix make htmldocs warning * Fix vmemmap allocation bugs with different alignment values. * Correctly check for section validity to before we free vmemmap area Aneesh Kumar K.V (16): powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting. powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary mm/hugepage pud: Allow arch-specific helper function to check huge page pud support mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override mm/vmemmap: Allow architectures to override how vmemmap optimization works mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization powerpc/book3s64/mm: Enable transparent pud hugepage powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function powerpc/book3s64/radix: Add support for vmemmap optimization for radix powerpc/book3s64/radix: Remove mmu_vmemmap_psize Documentation/mm/vmemmap_dedup.rst| 1 + Documentation/powerpc/index.rst | 1 + Documentation/powerpc/vmemmap_dedup.rst | 101 +++ arch/loongarch/Kconfig| 2 +- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 156 - arch/powerpc/include/asm/book3s/64/radix.h| 47 ++ .../include/asm/book3s/64/tlbflush-radix.h| 2 + arch/powerpc/include/asm/book3s/64/tlbflush.h | 8 + arch/powerpc/include/asm/pgtable.h| 3 + arch/powerpc/mm/book3s64/pgtable.c| 78 +++ arch/powerpc/mm/book3s64/radix_pgtable.c | 577 -- arch/powerpc/mm/book3s64/radix_tlb.c | 7 + arch/powerpc/mm/init_64.c | 39 +- arch/powerpc/platforms/Kconfig.cputype| 1 + arch/riscv/Kconfig| 2 +- arch/x86/Kconfig | 3 +- drivers/nvdimm/pfn_devs.c | 2 +- fs/Kconfig| 2 +- include/linux/mm.h| 32 +- include/linux/pgtable.h | 11 +- include/trace/events/thp.h| 17 + mm/Kconfig| 5 +- mm/debug_vm_pgtable.c | 2 +- mm/huge_memory.c | 2 +- mm/mm_init.c | 2 +- mm/mremap.c | 2 +- mm/sparse-vmemmap.c | 3 + 28 files changed, 1032 insertions(+), 77 deletions(-) create mode 100644 Documentation/powerpc/vmemmap_dedup.rst -- 2.40.1
Re: ppc64le vmlinuz is huge when building with BTF
[Cc linuxppc-dev] Dominique Martinet wrote: Alan Maguire wrote on Thu, Jun 15, 2023 at 03:31:49PM +0100: However the problem I suspect is this: 51 .debug_info 0a488b55 026f8d20 2**0 CONTENTS, READONLY, DEBUGGING [...] The debug info hasn't been stripped, so I suspect the packaging spec file or equivalent - in perhaps trying to preserve the .BTF section - is preserving debug info too. DWARF needs to be there at BTF generation time in vmlinux but is usually stripped for non-debug packages. Thanks Alan and Eduard! I guess I should have checked that first, it helps. We're not stripping anything in vmlinuz for other archs -- the linker script already should be including only the bare minimum to decompress itself (+compressed useful bits), so I guess it's a Kbuild issue for the arch. For a related discussion, see: http://lore.kernel.org/CAK18DXZKs2PNmLndeGYqkPxmrrBR=6ca3bhyYCj=ghya7dh...@mail.gmail.com We can add a strip but I unfortunately have no way of testing ppc build, I'll ask around the build linux-kbuild and linuxppc-dev lists if that's expected; it shouldn't be that bad now that's figured out. Stripping vmlinux would indeed be the way to go. As mentioned in the above link, fedora also packages a strip'ed vmlinux for ppc64le: https://src.fedoraproject.org/rpms/kernel/blob/4af17bffde7a1eca9ab164e5de0e391c277998a4/f/kernel.spec#_1797 - Naveen
[PATCH AUTOSEL 6.1 15/26] ASoC: fsl_sai: Enable BCI bit if SAI works on synchronous mode with BYP asserted
From: Chancel Liu [ Upstream commit 32cf0046a652116d6a216d575f3049a9ff9dd80d ] There's an issue on SAI synchronous mode that TX/RX side can't get BCLK from RX/TX it sync with if BYP bit is asserted. It's a workaround to fix it that enable SION of IOMUX pad control and assert BCI. For example if TX sync with RX which means both TX and RX are using clk form RX and BYP=1. TX can get BCLK only if the following two conditions are valid: 1. SION of RX BCLK IOMUX pad is set to 1 2. BCI of TX is set to 1 Signed-off-by: Chancel Liu Acked-by: Shengjiu Wang Link: https://lore.kernel.org/r/20230530103012.3448838-1-chancel@nxp.com Signed-off-by: Mark Brown Signed-off-by: Sasha Levin --- sound/soc/fsl/fsl_sai.c | 11 +-- sound/soc/fsl/fsl_sai.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c index 6d88af5b287fe..b33104715c7ba 100644 --- a/sound/soc/fsl/fsl_sai.c +++ b/sound/soc/fsl/fsl_sai.c @@ -491,14 +491,21 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, bool tx, u32 freq) regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_MSEL_MASK, FSL_SAI_CR2_MSEL(sai->mclk_id[tx])); - if (savediv == 1) + if (savediv == 1) { regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP, FSL_SAI_CR2_BYP); - else + if (fsl_sai_dir_is_synced(sai, adir)) + regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs), + FSL_SAI_CR2_BCI, FSL_SAI_CR2_BCI); + else + regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs), + FSL_SAI_CR2_BCI, 0); + } else { regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP, savediv / 2 - 1); + } if (sai->soc_data->max_register >= FSL_SAI_MCTL) { /* SAI is in master mode at this point, so enable MCLK */ diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h index 697f6690068c8..c5423f81e4560 100644 --- a/sound/soc/fsl/fsl_sai.h +++ b/sound/soc/fsl/fsl_sai.h @@ -116,6 +116,7 @@ /* SAI Transmit and Receive Configuration 2 Register */ #define FSL_SAI_CR2_SYNC BIT(30) +#define FSL_SAI_CR2_BCIBIT(28) #define FSL_SAI_CR2_MSEL_MASK (0x3 << 26) #define FSL_SAI_CR2_MSEL_BUS 0 #define FSL_SAI_CR2_MSEL_MCLK1 BIT(26) -- 2.39.2
[PATCH AUTOSEL 6.3 16/30] ASoC: fsl_sai: Enable BCI bit if SAI works on synchronous mode with BYP asserted
From: Chancel Liu [ Upstream commit 32cf0046a652116d6a216d575f3049a9ff9dd80d ] There's an issue on SAI synchronous mode that TX/RX side can't get BCLK from RX/TX it sync with if BYP bit is asserted. It's a workaround to fix it that enable SION of IOMUX pad control and assert BCI. For example if TX sync with RX which means both TX and RX are using clk form RX and BYP=1. TX can get BCLK only if the following two conditions are valid: 1. SION of RX BCLK IOMUX pad is set to 1 2. BCI of TX is set to 1 Signed-off-by: Chancel Liu Acked-by: Shengjiu Wang Link: https://lore.kernel.org/r/20230530103012.3448838-1-chancel@nxp.com Signed-off-by: Mark Brown Signed-off-by: Sasha Levin --- sound/soc/fsl/fsl_sai.c | 11 +-- sound/soc/fsl/fsl_sai.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c index 990bba0be1fb1..8a64f3c1d1556 100644 --- a/sound/soc/fsl/fsl_sai.c +++ b/sound/soc/fsl/fsl_sai.c @@ -491,14 +491,21 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, bool tx, u32 freq) regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_MSEL_MASK, FSL_SAI_CR2_MSEL(sai->mclk_id[tx])); - if (savediv == 1) + if (savediv == 1) { regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP, FSL_SAI_CR2_BYP); - else + if (fsl_sai_dir_is_synced(sai, adir)) + regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs), + FSL_SAI_CR2_BCI, FSL_SAI_CR2_BCI); + else + regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs), + FSL_SAI_CR2_BCI, 0); + } else { regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP, savediv / 2 - 1); + } if (sai->soc_data->max_register >= FSL_SAI_MCTL) { /* SAI is in master mode at this point, so enable MCLK */ diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h index 197748a888d5f..a53c4f0e25faf 100644 --- a/sound/soc/fsl/fsl_sai.h +++ b/sound/soc/fsl/fsl_sai.h @@ -116,6 +116,7 @@ /* SAI Transmit and Receive Configuration 2 Register */ #define FSL_SAI_CR2_SYNC BIT(30) +#define FSL_SAI_CR2_BCIBIT(28) #define FSL_SAI_CR2_MSEL_MASK (0x3 << 26) #define FSL_SAI_CR2_MSEL_BUS 0 #define FSL_SAI_CR2_MSEL_MCLK1 BIT(26) -- 2.39.2
[PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES
From: "Mike Rapoport (IBM)" kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by: Mike Rapoport (IBM) --- arch/Kconfig| 2 +- kernel/kprobes.c| 43 + kernel/trace/trace_kprobe.c | 11 ++ 3 files changed, 37 insertions(+), 19 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 205fd23e0cad..f2e9f82c7d0d 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -39,9 +39,9 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES select KALLSYMS + select EXECMEM select TASKS_RCU if PREEMPTION help Kprobes allows you to trap at almost any kernel address and diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 37c928d5deaf..2c2ba29d3f9a 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1568,6 +1568,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* Check if 'p' is probing a module. */ *probed_mod = __module_text_address((unsigned long) p->addr); if (*probed_mod) { @@ -1591,6 +1592,8 @@ static int check_kprobe_address_safe(struct kprobe *p, ret = -ENOENT; } } +#endif + out: preempt_enable(); jump_label_unlock(); @@ -2484,24 +2487,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } -/* Remove all symbols in given area from kprobe blacklist */ -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) -{ - struct kprobe_blacklist_entry *ent, *n; - - list_for_each_entry_safe(ent, n, _blacklist, list) { - if (ent->start_addr < start || ent->start_addr >= end) - continue; - list_del(>list); - kfree(ent); - } -} - -static void kprobe_remove_ksym_blacklist(unsigned long entry) -{ - kprobe_remove_area_blacklist(entry, entry + 1); -} - int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) { @@ -2566,6 +2551,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#ifdef CONFIG_MODULES +/* Remove all symbols in given area from kprobe blacklist */ +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) +{ + struct kprobe_blacklist_entry *ent, *n; + + list_for_each_entry_safe(ent, n, _blacklist, list) { + if (ent->start_addr < start || ent->start_addr >= end) + continue; + list_del(>list); + kfree(ent); + } +} + +static void kprobe_remove_ksym_blacklist(unsigned long entry) +{ + kprobe_remove_area_blacklist(entry, entry + 1); +} + static void add_module_kprobe_blacklist(struct module *mod) { unsigned long start, end; @@ -2667,6 +2671,7 @@ static struct notifier_block kprobe_module_nb = { .notifier_call = kprobes_module_callback, .priority = 0 }; +#endif void kprobe_free_init_mem(void) { @@ -2726,8 +2731,10 @@ static int __init init_kprobes(void) err = arch_init_kprobes(); if (!err) err = register_die_notifier(_exceptions_nb); +#ifdef CONFIG_MODULES if (!err) err = register_module_notifier(_module_nb); +#endif kprobes_initialized = (err == 0); kprobe_sysctls_init(); diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 59cda19a9033..cf804e372554 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -111,6 +111,7 @@ static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk, return strncmp(module_name(mod), name, len) == 0 && name[len] == ':'; } +#ifdef CONFIG_MODULES static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk) { char *p; @@ -129,6 +130,12 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk) return ret; } +#else +static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk) +{ + return false; +} +#endif static bool trace_kprobe_is_busy(struct dyn_event *ev) { @@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk) return ret; } +#ifdef CONFIG_MODULES /* Module notifier call back, checking event on the module */ static int trace_kprobe_module_callback(struct notifier_block *nb,
[PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
From: "Mike Rapoport (IBM)" Dynamic ftrace must allocate memory for code and this was impossible without CONFIG_MODULES. With execmem separated from the modules code, execmem_text_alloc() is available regardless of CONFIG_MODULES. Remove dependency of dynamic ftrace on CONFIG_MODULES and make CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. Signed-off-by: Mike Rapoport (IBM) --- arch/x86/Kconfig | 1 + arch/x86/kernel/ftrace.c | 10 -- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 53bab123a8ee..ab64bbef9e50 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -35,6 +35,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select EXECMEM if DYNAMIC_FTRACE config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index f77c63bb3203..a824a5d3b129 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command) /* Currently only x86_64 supports dynamic trampolines */ #ifdef CONFIG_X86_64 -#ifdef CONFIG_MODULES -/* Module allocation simplifies allocating memory for code */ static inline void *alloc_tramp(unsigned long size) { return execmem_text_alloc(size); @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp) { execmem_free(tramp); } -#else -/* Trampolines can only be created if modules are supported */ -static inline void *alloc_tramp(unsigned long size) -{ - return NULL; -} -static inline void tramp_free(void *tramp) { } -#endif /* Defined as markers to the end of the ftrace default trampolines */ extern void ftrace_regs_caller_end(void); -- 2.35.1
[PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES
From: "Mike Rapoport (IBM)" execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) --- arch/arm/kernel/module.c| 36 arch/arm/mm/init.c | 36 arch/arm64/include/asm/memory.h | 8 + arch/arm64/include/asm/module.h | 6 arch/arm64/kernel/kaslr.c | 3 +- arch/arm64/kernel/module.c | 50 arch/arm64/mm/init.c| 56 +++ arch/loongarch/kernel/module.c | 18 -- arch/loongarch/mm/init.c| 20 +++ arch/mips/kernel/module.c | 18 -- arch/mips/mm/init.c | 19 +++ arch/parisc/kernel/module.c | 17 -- arch/parisc/mm/init.c | 22 +++- arch/powerpc/kernel/module.c| 58 arch/powerpc/mm/mem.c | 59 + arch/riscv/kernel/module.c | 34 --- arch/riscv/mm/init.c| 34 +++ arch/s390/kernel/module.c | 38 - arch/s390/mm/init.c | 41 +++ arch/sparc/kernel/module.c | 23 - arch/sparc/mm/Makefile | 2 ++ arch/sparc/mm/execmem.c | 25 ++ arch/x86/kernel/module.c| 50 arch/x86/mm/init.c | 54 ++ 24 files changed, 376 insertions(+), 351 deletions(-) create mode 100644 arch/sparc/mm/execmem.c diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index f66d479c1c7d..054e799e7091 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -16,48 +16,12 @@ #include #include #include -#include #include #include #include #include -#ifdef CONFIG_XIP_KERNEL -/* - * The XIP kernel text is mapped in the module area for modules and - * some other stuff to work without any indirect relocations. - * MODULES_VADDR is redefined here and not in asm/memory.h to avoid - * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. - */ -#undef MODULES_VADDR -#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) -#endif - -#ifdef CONFIG_MMU -static struct execmem_params execmem_params = { - .modules = { - .text = { - .start = MODULES_VADDR, - .end = MODULES_END, - .alignment = 1, - }, - }, -}; - -struct execmem_params __init *execmem_arch_params(void) -{ - execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC; - - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { - execmem_params.modules.text.fallback_start = VMALLOC_START; - execmem_params.modules.text.fallback_end = VMALLOC_END; - } - - return _params; -} -#endif - bool module_init_section(const char *name) { return strstarts(name, ".init") || diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index ce64bdb55a16..cffd29f15782 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -486,3 +487,38 @@ void free_initrd_mem(unsigned long start, unsigned long end) free_reserved_area((void *)start, (void *)end, -1, "initrd"); } #endif + +#ifdef CONFIG_XIP_KERNEL +/* + * The XIP kernel text is mapped in the module area for modules and + * some other stuff to work without any indirect relocations. + * MODULES_VADDR is redefined here and not in asm/memory.h to avoid + * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. + */ +#undef MODULES_VADDR +#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) +#endif + +#if defined(CONFIG_MMU) && defined(CONFIG_EXECMEM) +static struct execmem_params execmem_params = { + .modules = { + .text = { + .start = MODULES_VADDR, + .end = MODULES_END, + .alignment = 1, + }, + }, +}; + +struct execmem_params __init *execmem_arch_params(void) +{ + execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + execmem_params.modules.text.fallback_start = VMALLOC_START; + execmem_params.modules.text.fallback_end = VMALLOC_END; + } + + return _params; +} +#endif diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index c735afdf639b..962ad2d8b5e5 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -214,6 +214,14 @@ static inline bool kaslr_enabled(void) return kaslr_offset() >=
[PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations
From: "Mike Rapoport (IBM)" powerpc overrides kprobes::alloc_insn_page() to remove writable permissions when STRICT_MODULE_RWX is on. Add definition of jit area to execmem_params to allow using the generic kprobes::alloc_insn_page() with the desired permissions. As powerpc uses breakpoint instructions to inject kprobes, it does not need to constrain kprobe allocations to the modules area and can use the entire vmalloc address space. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/kernel/kprobes.c | 14 -- arch/powerpc/kernel/module.c | 13 + 2 files changed, 13 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 5db8df5e3657..14c5ddec3056 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse return (kprobe_opcode_t *)(addr + offset); } -void *alloc_insn_page(void) -{ - void *page; - - page = jit_text_alloc(PAGE_SIZE); - if (!page) - return NULL; - - if (strict_module_rwx_enabled()) - set_memory_rox((unsigned long)page, 1); - - return page; -} - int arch_prepare_kprobe(struct kprobe *p) { int ret = 0; diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c index 4c6c15bf3947..8e5b379d6da1 100644 --- a/arch/powerpc/kernel/module.c +++ b/arch/powerpc/kernel/module.c @@ -96,6 +96,11 @@ static struct execmem_params execmem_params = { .alignment = 1, }, }, + .jit = { + .text = { + .alignment = 1, + }, + }, }; @@ -131,5 +136,13 @@ struct execmem_params __init *execmem_arch_params(void) execmem_params.modules.text.pgprot = prot; + execmem_params.jit.text.start = VMALLOC_START; + execmem_params.jit.text.end = VMALLOC_END; + + if (strict_module_rwx_enabled()) + execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX; + else + execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC; + return _params; } -- 2.35.1
[PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations
From: "Mike Rapoport (IBM)" RISC-V overrides kprobes::alloc_insn_range() to use the entire vmalloc area rather than limit the allocations to the modules area. Slightly reorder execmem_params initialization to support both 32 and 64 bit variantsi and add definition of jit area to execmem_params to support generic kprobes::alloc_insn_page(). Signed-off-by: Mike Rapoport (IBM) --- arch/riscv/kernel/module.c | 16 +++- arch/riscv/kernel/probes/kprobes.c | 10 -- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c index ee5e04cd3f21..cca6ed4e9340 100644 --- a/arch/riscv/kernel/module.c +++ b/arch/riscv/kernel/module.c @@ -436,7 +436,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +#ifdef CONFIG_MMU static struct execmem_params execmem_params = { .modules = { .text = { @@ -444,12 +444,26 @@ static struct execmem_params execmem_params = { .alignment = 1, }, }, + .jit = { + .text = { + .pgprot = PAGE_KERNEL_READ_EXEC, + .alignment = 1, + }, + }, }; struct execmem_params __init *execmem_arch_params(void) { +#ifdef CONFIG_64BIT execmem_params.modules.text.start = MODULES_VADDR; execmem_params.modules.text.end = MODULES_END; +#else + execmem_params.modules.text.start = VMALLOC_START; + execmem_params.modules.text.end = VMALLOC_END; +#endif + + execmem_params.jit.text.start = VMALLOC_START; + execmem_params.jit.text.end = VMALLOC_END; return _params; } diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c index 2f08c14a933d..e64f2f3064eb 100644 --- a/arch/riscv/kernel/probes/kprobes.c +++ b/arch/riscv/kernel/probes/kprobes.c @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -#ifdef CONFIG_MMU -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, -GFP_KERNEL, PAGE_KERNEL_READ_EXEC, -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, -__builtin_return_address(0)); -} -#endif - /* install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { -- 2.35.1
[PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions
From: "Mike Rapoport (IBM)" The memory allocations for kprobes on arm64 can be placed anywhere in vmalloc address space and currently this is implemented with an override of alloc_insn_page() in arm64. Extend execmem_params with a range for generated code allocations and make kprobes on arm64 use this extension rather than override alloc_insn_page(). Signed-off-by: Mike Rapoport (IBM) --- arch/arm64/kernel/module.c | 9 + arch/arm64/kernel/probes/kprobes.c | 7 --- include/linux/execmem.h| 11 +++ mm/execmem.c | 14 +- 4 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index c3d999f3a3dd..52b09626bc0f 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -30,6 +30,13 @@ static struct execmem_params execmem_params = { .alignment = MODULE_ALIGN, }, }, + .jit = { + .text = { + .start = VMALLOC_START, + .end = VMALLOC_END, + .alignment = 1, + }, + }, }; struct execmem_params __init *execmem_arch_params(void) @@ -40,6 +47,8 @@ struct execmem_params __init *execmem_arch_params(void) execmem_params.modules.text.start = module_alloc_base; execmem_params.modules.text.end = module_alloc_end; + execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX; + /* * KASAN without KASAN_VMALLOC can only deal with module * allocations being served from the reserved module region, diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c index 70b91a8c6bb3..6fccedd02b2a 100644 --- a/arch/arm64/kernel/probes/kprobes.c +++ b/arch/arm64/kernel/probes/kprobes.c @@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS, - NUMA_NO_NODE, __builtin_return_address(0)); -} - /* arm kprobe: install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/include/linux/execmem.h b/include/linux/execmem.h index 2e1221310d13..dc7c9a446111 100644 --- a/include/linux/execmem.h +++ b/include/linux/execmem.h @@ -52,12 +52,23 @@ struct execmem_modules_range { struct execmem_range data; }; +/** + * struct execmem_jit_range - architecure parameters for address space + * suitable for JIT code allocations + * @text: address range for text allocations + */ +struct execmem_jit_range { + struct execmem_range text; +}; + /** * struct execmem_params - architecure parameters for code allocations * @modules: parameters for modules address space + * @jit: parameters for jit memory address space */ struct execmem_params { struct execmem_modules_rangemodules; + struct execmem_jit_rangejit; }; /** diff --git a/mm/execmem.c b/mm/execmem.c index f7bf496ad4c3..9730ecef9a30 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -89,7 +89,12 @@ void execmem_free(void *ptr) void *jit_text_alloc(size_t size) { - return execmem_text_alloc(size); + unsigned long start = execmem_params.jit.text.start; + unsigned long end = execmem_params.jit.text.end; + pgprot_t pgprot = execmem_params.jit.text.pgprot; + unsigned int align = execmem_params.jit.text.alignment; + + return execmem_alloc(size, start, end, align, pgprot, 0, 0, false); } void jit_free(void *ptr) @@ -135,6 +140,13 @@ static void execmem_init_missing(struct execmem_params *p) execmem_params.modules.data.fallback_start = m->text.fallback_start; execmem_params.modules.data.fallback_end = m->text.fallback_end; } + + if (!execmem_params.jit.text.start) { + execmem_params.jit.text.start = m->text.start; + execmem_params.jit.text.end = m->text.end; + execmem_params.jit.text.alignment = m->text.alignment; + execmem_params.jit.text.pgprot = m->text.pgprot; + } } void __init execmem_init(void) -- 2.35.1
[PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()
From: "Mike Rapoport (IBM)" Data related to code allocations, such as module data section, need to comply with architecture constraints for its placement and its allocation right now was done using execmem_text_alloc(). Create a dedicated API for allocating data related to code allocations and allow architectures to define address ranges for data allocations. Since currently this is only relevant for powerpc variants that use the VMALLOC address space for module data allocations, automatically reuse address ranges defined for text unless address range for data is explicitly defined by an architecture. With separation of code and data allocations, data sections of the modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which was a default on many architectures. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/kernel/module.c | 8 +++ include/linux/execmem.h | 18 +++ kernel/module/main.c | 15 +++-- mm/execmem.c | 43 4 files changed, 72 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c index ba7abff77d98..4c6c15bf3947 100644 --- a/arch/powerpc/kernel/module.c +++ b/arch/powerpc/kernel/module.c @@ -103,6 +103,10 @@ struct execmem_params __init *execmem_arch_params(void) { pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC; + /* +* BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and +* allow allocating data in the entire vmalloc space +*/ #ifdef MODULES_VADDR unsigned long limit = (unsigned long)_etext - SZ_32M; @@ -116,6 +120,10 @@ struct execmem_params __init *execmem_arch_params(void) execmem_params.modules.text.start = MODULES_VADDR; execmem_params.modules.text.end = MODULES_END; } + execmem_params.modules.data.start = VMALLOC_START; + execmem_params.modules.data.end = VMALLOC_END; + execmem_params.modules.data.pgprot = PAGE_KERNEL; + execmem_params.modules.data.alignment = 1; #else execmem_params.modules.text.start = VMALLOC_START; execmem_params.modules.text.end = VMALLOC_END; diff --git a/include/linux/execmem.h b/include/linux/execmem.h index b9a97fcdf3c5..2e1221310d13 100644 --- a/include/linux/execmem.h +++ b/include/linux/execmem.h @@ -44,10 +44,12 @@ enum execmem_module_flags { * space * @flags: options for module memory allocations * @text: address range for text allocations + * @data: address range for data allocations */ struct execmem_modules_range { enum execmem_module_flags flags; struct execmem_range text; + struct execmem_range data; }; /** @@ -89,6 +91,22 @@ struct execmem_params *execmem_arch_params(void); */ void *execmem_text_alloc(size_t size); +/** + * execmem_data_alloc - allocate memory for data coupled to code + * @size: how many bytes of memory are required + * + * Allocates memory that will contain data copupled with executable code, + * like data sections in kernel modules. + * + * The memory will have protections defined by architecture. + * + * The allocated memory will reside in an area that does not impose + * restrictions on the addressing modes. + * + * Return: a pointer to the allocated memory or %NULL + */ +void *execmem_data_alloc(size_t size); + /** * execmem_free - free executable memory * @ptr: pointer to the memory that should be freed diff --git a/kernel/module/main.c b/kernel/module/main.c index b445c5ad863a..d6582bfec1f6 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1195,25 +1195,16 @@ void __weak module_arch_freeing_init(struct module *mod) { } -static bool mod_mem_use_vmalloc(enum mod_mem_type type) -{ - return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) && - mod_mem_type_is_core_data(type); -} - static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) { - if (mod_mem_use_vmalloc(type)) - return vzalloc(size); + if (mod_mem_type_is_data(type)) + return execmem_data_alloc(size); return execmem_text_alloc(size); } static void module_memory_free(void *ptr, enum mod_mem_type type) { - if (mod_mem_use_vmalloc(type)) - vfree(ptr); - else - execmem_free(ptr); + execmem_free(ptr); } static void free_mod_mem(struct module *mod) diff --git a/mm/execmem.c b/mm/execmem.c index a67acd75ffef..f7bf496ad4c3 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size) fallback_start, fallback_end, kasan); } +void *execmem_data_alloc(size_t size) +{ + unsigned long start = execmem_params.modules.data.start; + unsigned long end = execmem_params.modules.data.end; + pgprot_t pgprot =
[PATCH v2 05/12] modules, execmem: drop module_alloc
From: "Mike Rapoport (IBM)" Define default parameters for address range for code allocations using the current values in module_alloc() and make execmem_text_alloc() use these defaults when an architecure does not supply its specific parameters. With this, execmem_text_alloc() implements memory allocation in a way compatible with module_alloc() and can be used as a replacement for module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- include/linux/execmem.h | 8 include/linux/moduleloader.h | 12 kernel/module/main.c | 7 --- mm/execmem.c | 12 4 files changed, 16 insertions(+), 23 deletions(-) diff --git a/include/linux/execmem.h b/include/linux/execmem.h index 68b2bfc79993..b9a97fcdf3c5 100644 --- a/include/linux/execmem.h +++ b/include/linux/execmem.h @@ -4,6 +4,14 @@ #include +#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \ + !defined(CONFIG_KASAN_VMALLOC) +#include +#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) +#else +#define MODULE_ALIGN PAGE_SIZE +#endif + /** * struct execmem_range - definition of a memory range suitable for code and * related data allocations diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index b3374342f7af..4321682fe849 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr, /* Additional bytes needed by arch in front of individual sections */ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section); -/* Allocator used for allocating struct module, core sections and init - sections. Returns NULL on failure. */ -void *module_alloc(unsigned long size); - /* Determines if the section name is an init section (that is only used during * module loading). */ @@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod); /* Any cleanup before freeing mod->module_init */ void module_arch_freeing_init(struct module *mod); -#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \ - !defined(CONFIG_KASAN_VMALLOC) -#include -#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) -#else -#define MODULE_ALIGN PAGE_SIZE -#endif - #endif diff --git a/kernel/module/main.c b/kernel/module/main.c index 43810a3bdb81..b445c5ad863a 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod) } } -void * __weak module_alloc(unsigned long size) -{ - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, - NUMA_NO_NODE, __builtin_return_address(0)); -} - bool __weak module_init_section(const char *name) { return strstarts(name, ".init"); diff --git a/mm/execmem.c b/mm/execmem.c index 2fe36dcc7bdf..a67acd75ffef 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -59,9 +59,6 @@ void *execmem_text_alloc(size_t size) unsigned long fallback_end = execmem_params.modules.text.fallback_end; bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW; - if (!execmem_params.modules.text.start) - return module_alloc(size); - return execmem_alloc(size, start, end, align, pgprot, fallback_start, fallback_end, kasan); } @@ -108,8 +105,15 @@ void __init execmem_init(void) { struct execmem_params *p = execmem_arch_params(); - if (!p) + if (!p) { + p = _params; + p->modules.text.start = VMALLOC_START; + p->modules.text.end = VMALLOC_END; + p->modules.text.pgprot = PAGE_KERNEL_EXEC; + p->modules.text.alignment = 1; + return; + } if (!execmem_validate_params(p)) return; -- 2.35.1