date:20230616

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Mike Rapoport

On Fri, Jun 16, 2023 at 12:48:02PM -0400, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> > 
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> > 
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> > 
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> > 
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> > 
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> > 
> > The intention semantics for new allocation APIs:
> > 
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> > 
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> > 
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
> 
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.

With near and far it should mention from where and that's getting too long.
I don't mind changing the names, but I couldn't think about something
better than Song's execmem and your jit.
 
> Reviewed-by: Kent Overstreet 

Thanks!

-- 
Sincerely yours,
Mike.

Re: [PATCH v2 01/12] nios2: define virtual address space for modules

2023-06-16 Thread Mike Rapoport

On Fri, Jun 16, 2023 at 04:00:19PM +, Edgecombe, Rick P wrote:
> On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> >  void *module_alloc(unsigned long size)
> >  {
> > -   if (size == 0)
> > -   return NULL;
> > -   return kmalloc(size, GFP_KERNEL);
> > -}
> > -
> > -/* Free memory returned from module_alloc */
> > -void module_memfree(void *module_region)
> > -{
> > -   kfree(module_region);
> > +   return __vmalloc_node_range(size, 1, MODULES_VADDR,
> > MODULES_END,
> > +   GFP_KERNEL, PAGE_KERNEL_EXEC,
> > +   VM_FLUSH_RESET_PERMS,
> > NUMA_NO_NODE,
> > +   __builtin_return_address(0));
> >  }
> >  
> >  int apply_relocate_add(Elf32_Shdr *sechdrs, const char *s
> 
> I wonder if the (size == 0) check is really needed, but
> __vmalloc_node_range() will WARN on this case where the old code won't.

module_alloc() should not be called with zero size, so a warning there
would be appropriate.
Besides, no other module_alloc() had this check.

-- 
Sincerely yours,
Mike.

Re: [PATCH v2 RESEND] ASoC: fsl MPC52xx drivers require PPC_BESTCOMM

2023-06-16 Thread Randy Dunlap

Hi Mark, Liam,

On 5/30/23 16:38, Randy Dunlap wrote:
> Hello maintainers,
> 
> I am still seeing these build errors on linux-next-20230530.
> 
> Is there a problem with the patch?
> Thanks.
> 

I am still seeing build errors on linux-next-20230615.

Is there a problem with the patch?

Can it be applied/merged?

Thanks.

> On 5/21/23 15:57, Randy Dunlap wrote:
>> Both SND_MPC52xx_SOC_PCM030 and SND_MPC52xx_SOC_EFIKA select
>> SND_SOC_MPC5200_AC97. The latter symbol depends on PPC_BESTCOMM,
>> so the 2 former symbols should also depend on PPC_BESTCOMM since
>> "select" does not follow any dependency chains.
>>
>> This prevents a kconfig warning and build errors:
>>
>> WARNING: unmet direct dependencies detected for SND_SOC_MPC5200_AC97
>>   Depends on [n]: SOUND [=y] && !UML && SND [=m] && SND_SOC [=m] && 
>> SND_POWERPC_SOC [=m] && PPC_MPC52xx [=y] && PPC_BESTCOMM [=n]
>>   Selected by [m]:
>>   - SND_MPC52xx_SOC_PCM030 [=m] && SOUND [=y] && !UML && SND [=m] && SND_SOC 
>> [=m] && SND_POWERPC_SOC [=m] && PPC_MPC5200_SIMPLE [=y]
>>   - SND_MPC52xx_SOC_EFIKA [=m] && SOUND [=y] && !UML && SND [=m] && SND_SOC 
>> [=m] && SND_POWERPC_SOC [=m] && PPC_EFIKA [=y]
>>
>> ERROR: modpost: "mpc5200_audio_dma_destroy" 
>> [sound/soc/fsl/mpc5200_psc_ac97.ko] undefined!
>> ERROR: modpost: "mpc5200_audio_dma_create" 
>> [sound/soc/fsl/mpc5200_psc_ac97.ko] undefined!
>>
>> Fixes: 40d9ec14e7e1 ("ASoC: remove BROKEN from Efika and pcm030 fabric 
>> drivers")
>> Signed-off-by: Randy Dunlap 
>> Cc: Grant Likely 
>> Cc: Mark Brown 
>> Cc: Liam Girdwood 
>> Cc: Shengjiu Wang 
>> Cc: Xiubo Li 
>> Cc: alsa-de...@alsa-project.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: Jaroslav Kysela 
>> Cc: Takashi Iwai 
>> ---
>> v2: use correct email address for Mark Brown.
>>
>>  sound/soc/fsl/Kconfig |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff -- a/sound/soc/fsl/Kconfig b/sound/soc/fsl/Kconfig
>> --- a/sound/soc/fsl/Kconfig
>> +++ b/sound/soc/fsl/Kconfig
>> @@ -243,7 +243,7 @@ config SND_SOC_MPC5200_AC97
>>  
>>  config SND_MPC52xx_SOC_PCM030
>>  tristate "SoC AC97 Audio support for Phytec pcm030 and WM9712"
>> -depends on PPC_MPC5200_SIMPLE
>> +depends on PPC_MPC5200_SIMPLE && PPC_BESTCOMM
>>  select SND_SOC_MPC5200_AC97
>>  select SND_SOC_WM9712
>>  help
>> @@ -252,7 +252,7 @@ config SND_MPC52xx_SOC_PCM030
>>  
>>  config SND_MPC52xx_SOC_EFIKA
>>  tristate "SoC AC97 Audio support for bbplan Efika and STAC9766"
>> -depends on PPC_EFIKA
>> +depends on PPC_EFIKA && PPC_BESTCOMM
>>  select SND_SOC_MPC5200_AC97
>>  select SND_SOC_STAC9766
>>  help
> 

-- 
~Randy

Re: [PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()

2023-06-16 Thread Yu Zhao

On Thu, Jun 15, 2023 at 04:02:43PM -0700, Hugh Dickins wrote:
> To keep balance in future, __update_tlb() remember to pte_unmap() after
> pte_offset_map().  This is an odd case, since the caller has already done
> pte_offset_map_lock(), then mips forgets the address and recalculates it;
> but my two naive attempts to clean that up did more harm than good.
> 
> Tested-by: Nathan Chancellor 
> Signed-off-by: Hugh Dickins 

FWIW: Tested-by: Yu Zhao 

There is another problem, likely caused by khugepaged, happened multiple times. 
But I don't think it's related to your series, just FYI.

  Got mcheck at 81134ef0
  CPU: 3 PID: 36 Comm: khugepaged Not tainted 
6.4.0-rc6-00049-g62d8779610bb-dirty #1
  $ 0   :  0014 411ac004 4000
  $ 4   : c000 0045 00011a80045b 00011a80045b
  $ 8   : 800080188000 81b526c0 0200 
  $12   : 0028 81910cb4  0207
  $16   : 00aaab80 837ee990 81b50200 85066ae0
  $20   : 0001 8000 81c1 00aaab80
  $24   : 0002 812b75f8
  $28   : 8231 82313b00 81b5 81134d88
  Hi: 017a
  Lo: 
  epc   : 81134ef0 __update_tlb+0x260/0x2a0
  ra: 81134d88 __update_tlb+0xf8/0x2a0
  Status: 14309ce2  KX SX UX KERNEL EXL
  Cause : 00800060 (ExcCode 18)
  PrId  : 000d9602 (Cavium Octeon III)
  CPU: 3 PID: 36 Comm: khugepaged Not tainted 
6.4.0-rc6-00049-g62d8779610bb-dirty #1
  Stack : 0001  0008 82313768
  82313768 823138f8  
  a6c8cd76e1667e00 81db4f28 0001 30302d3663722d30
  643236672d393430 0010 81910cc0 
  81d96bcc   81a68ed0
  81b5 0001 8000 81c1
  00aaab80 0002 815b78c0 a184e710
  00c0 8231 82313760 81b5
  8111c9cc   
    8111c9ec 
  ...
  Call Trace:
  [] show_stack+0x64/0x158
  [] dump_stack_lvl+0x5c/0x7c
  [] do_mcheck+0x2c/0x98
  [] handle_mcheck_int+0x38/0x50
  
  Index: 8000
  PageMask : 1fe000
  EntryHi  : 00aaab8000bd
  EntryLo0 : 411a8004
  EntryLo1 : 411ac004
  Wired: 0
  PageGrain: e800
  
  Index:  2 pgmask=4kb va=c0fe4000 asid=b9
[ri=0 xi=0 pa=22a7000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=22af000 
c=0 d=1 v=1 g=1]
  Index:  3 pgmask=4kb va=c0fe8000 asid=b9
[ri=0 xi=0 pa=238 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2381000 
c=0 d=1 v=1 g=1]
  Index:  4 pgmask=4kb va=c0fea000 asid=b9
[ri=0 xi=0 pa=23e9000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=23ea000 
c=0 d=1 v=1 g=1]
  Index:  5 pgmask=4kb va=c0fee000 asid=b9
[ri=0 xi=0 pa=2881000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2882000 
c=0 d=1 v=1 g=1]
  Index:  6 pgmask=4kb va=c0fefffb asid=b9
[ri=0 xi=0 pa=2cc2000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=2cc3000 
c=0 d=1 v=1 g=1]
  Index:  7 pgmask=4kb va=c0fec000 asid=b9
[ri=0 xi=0 pa=23eb000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=288 
c=0 d=1 v=1 g=1]
  Index:  8 pgmask=4kb va=c0fe6000 asid=b9
[ri=0 xi=0 pa=237e000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=237f000 
c=0 d=1 v=1 g=1]
  Index: 14 pgmask=4kb va=c0fefff62000 asid=8e
[ri=0 xi=0 pa=7477000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=745e000 
c=0 d=1 v=1 g=1]
  Index: 15 pgmask=4kb va=c0fefff52000 asid=8e
[ri=0 xi=0 pa=744c000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=616d000 
c=0 d=1 v=1 g=1]
  Index: 16 pgmask=4kb va=c0fefff42000 asid=8e
[ri=0 xi=0 pa=6334000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=616b000 
c=0 d=1 v=1 g=1]
  Index: 19 pgmask=4kb va=c0fefffb6000 asid=8e
[ri=0 xi=0 pa=505 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=5051000 
c=0 d=1 v=1 g=1]
  Index: 20 pgmask=4kb va=c0fefff72000 asid=b9
[ri=0 xi=0 pa=7504000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=7503000 
c=0 d=1 v=1 g=1]
  Index: 58 pgmask=4kb va=c0fefffaa000 asid=8e
[ri=0 xi=0 pa=5126000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=5127000 
c=0 d=1 v=1 g=1]
  Index: 59 pgmask=4kb va=c0fefffba000 asid=8e
[ri=0 xi=0 pa=5129000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=512a000 
c=0 d=1 v=1 g=1]
  Index: 79 pgmask=4kb va=c006 asid=8e
[ri=0 xi=0 pa=534b000 c=0 d=1 v=1 g=1] [ri=0 xi=0 pa=62f9000 
c=0 d=1 v=1 g=1]
  Index: 80 pgmask=4kb va=c005e000 asid=8e

[powerpc:next] BUILD SUCCESS 7a313166d7dd0b366f8a47e15cd3c40910494cb6

2023-06-16 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next
branch HEAD: 7a313166d7dd0b366f8a47e15cd3c40910494cb6  powerpc: update 
ppc_save_regs to save current r1 in pt_regs

elapsed time: 724m

configs tested: 132
configs skipped: 6

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
alpharandconfig-r001-20230616   gcc  
arc  allyesconfig   gcc  
arc defconfig   gcc  
arc  randconfig-r016-20230616   gcc  
arc  randconfig-r025-20230616   gcc  
arc  randconfig-r031-20230616   gcc  
arc  randconfig-r043-20230616   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm  randconfig-r003-20230616   gcc  
arm  randconfig-r011-20230616   clang
arm  randconfig-r033-20230616   gcc  
arm  randconfig-r046-20230616   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r001-20230616   clang
arm64randconfig-r032-20230616   clang
arm64randconfig-r036-20230616   clang
cskydefconfig   gcc  
csky randconfig-r012-20230616   gcc  
csky randconfig-r022-20230616   gcc  
csky randconfig-r026-20230616   gcc  
csky randconfig-r034-20230616   gcc  
hexagon  randconfig-r041-20230616   clang
hexagon  randconfig-r045-20230616   clang
i386 allyesconfig   gcc  
i386 buildonly-randconfig-r004-20230616   clang
i386 buildonly-randconfig-r005-20230616   clang
i386 buildonly-randconfig-r006-20230616   clang
i386  debian-10.3   gcc  
i386defconfig   gcc  
i386 randconfig-i001-20230616   clang
i386 randconfig-i002-20230616   clang
i386 randconfig-i003-20230616   clang
i386 randconfig-i004-20230616   clang
i386 randconfig-i005-20230616   clang
i386 randconfig-i006-20230616   clang
i386 randconfig-i011-20230616   gcc  
i386 randconfig-i012-20230616   gcc  
i386 randconfig-i013-20230616   gcc  
i386 randconfig-i014-20230616   gcc  
i386 randconfig-i015-20230616   gcc  
i386 randconfig-i016-20230616   gcc  
i386 randconfig-r002-20230616   clang
i386 randconfig-r004-20230616   clang
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarch   defconfig   gcc  
loongarchrandconfig-r005-20230616   gcc  
loongarchrandconfig-r013-20230616   gcc  
loongarchrandconfig-r031-20230616   gcc  
m68k allmodconfig   gcc  
m68k allyesconfig   gcc  
m68kdefconfig   gcc  
m68k randconfig-r035-20230616   gcc  
microblaze   randconfig-r011-20230616   gcc  
microblaze   randconfig-r021-20230616   gcc  
mips allmodconfig   gcc  
mips allyesconfig   gcc  
mips randconfig-r024-20230616   clang
nios2   defconfig   gcc  
nios2randconfig-r013-20230616   gcc  
openrisc randconfig-r006-20230616   gcc  
openrisc randconfig-r014-20230616   gcc  
openrisc randconfig-r025-20230616   gcc  
parisc   allyesconfig   gcc  
parisc  defconfig   gcc  
parisc   randconfig-r002-20230616   gcc  
parisc   randconfig-r023-20230616   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
powerpc  randconfig-r002-20230616   clang
powerpc  randconfig-r036-20230616   clang
riscvallmodconfig   gcc  
riscv allnoconfig   gcc  
riscvallyesconfig   gcc  
riscv   defconfig   gcc  
riscvrandconfig-r035-20230616   clang
riscvrandconfig-r042-20230616   gcc  
riscv  rv32_defconfig   gcc  
s390 allmodconfig   gcc  
s390

[powerpc:merge] BUILD SUCCESS 12ffddc6444780aec83fa5086673ec005c0bace4

2023-06-16 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: 12ffddc6444780aec83fa5086673ec005c0bace4  Automatic merge of 
'fixes' into merge (2023-06-09 23:36)

elapsed time: 724m

configs tested: 110
configs skipped: 4

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
alpharandconfig-r001-20230616   gcc  
arc  allyesconfig   gcc  
arc defconfig   gcc  
arc  randconfig-r022-20230616   gcc  
arc  randconfig-r035-20230616   gcc  
arc  randconfig-r043-20230616   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm  randconfig-r006-20230616   gcc  
arm  randconfig-r026-20230616   clang
arm  randconfig-r034-20230616   gcc  
arm  randconfig-r046-20230616   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r002-20230616   clang
cskydefconfig   gcc  
csky randconfig-r021-20230616   gcc  
hexagon  randconfig-r041-20230616   clang
hexagon  randconfig-r045-20230616   clang
i386 allyesconfig   gcc  
i386 buildonly-randconfig-r004-20230616   clang
i386 buildonly-randconfig-r005-20230616   clang
i386 buildonly-randconfig-r006-20230616   clang
i386  debian-10.3   gcc  
i386defconfig   gcc  
i386 randconfig-i001-20230616   clang
i386 randconfig-i002-20230616   clang
i386 randconfig-i003-20230616   clang
i386 randconfig-i004-20230616   clang
i386 randconfig-i005-20230616   clang
i386 randconfig-i006-20230616   clang
i386 randconfig-i011-20230616   gcc  
i386 randconfig-i012-20230616   gcc  
i386 randconfig-i013-20230616   gcc  
i386 randconfig-i014-20230616   gcc  
i386 randconfig-i015-20230616   gcc  
i386 randconfig-i016-20230616   gcc  
i386 randconfig-r005-20230616   clang
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarch   defconfig   gcc  
loongarchrandconfig-r003-20230616   gcc  
loongarchrandconfig-r012-20230616   gcc  
loongarchrandconfig-r014-20230616   gcc  
loongarchrandconfig-r015-20230616   gcc  
m68k allmodconfig   gcc  
m68k allyesconfig   gcc  
m68kdefconfig   gcc  
m68k randconfig-r024-20230616   gcc  
mips allmodconfig   gcc  
mips allyesconfig   gcc  
mips randconfig-r006-20230616   gcc  
nios2   defconfig   gcc  
openrisc randconfig-r002-20230616   gcc  
openrisc randconfig-r031-20230616   gcc  
parisc   allyesconfig   gcc  
parisc  defconfig   gcc  
parisc   randconfig-r011-20230616   gcc  
parisc   randconfig-r016-20230616   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
riscvallmodconfig   gcc  
riscv allnoconfig   gcc  
riscvallyesconfig   gcc  
riscv   defconfig   gcc  
riscvrandconfig-r042-20230616   gcc  
riscv  rv32_defconfig   gcc  
s390 allmodconfig   gcc  
s390 allyesconfig   gcc  
s390defconfig   gcc  
s390 randconfig-r013-20230616   gcc  
s390 randconfig-r032-20230616   clang
s390 randconfig-r044-20230616   gcc  
sh   allmodconfig   gcc  
sh   randconfig-r004-20230616   gcc  
sh   randconfig-r005-20230616   gcc  
sh   randconfig-r033-20230616   gcc  
sparcallyesconfig   gcc  
sparc   defconfig   gcc  
sparc64  randconfig-r003-20230616   gcc  
um  defconfig   gcc  
um

Re: [PATCH v4 04/34] pgtable: Create struct ptdesc

2023-06-16 Thread Vishal Moola

On Thu, Jun 15, 2023 at 12:57 AM Hugh Dickins  wrote:
>
> On Mon, 12 Jun 2023, Vishal Moola (Oracle) wrote:
>
> > Currently, page table information is stored within struct page. As part
> > of simplifying struct page, create struct ptdesc for page table
> > information.
> >
> > Signed-off-by: Vishal Moola (Oracle) 
>
> Vishal, as I think you have already guessed, your ptdesc series and
> my pte_free_defer() "mm: free retracted page table by RCU" series are
> on a collision course.
>
> Probably just trivial collisions in most architectures, which either
> of us can easily adjust to the other; powerpc likely to be more awkward,
> but fairly easily resolved; s390 quite a problem.
>
> I've so far been unable to post a v2 of my series (and powerpc and s390
> were stupidly wrong in the v1), because a good s390 patch is not yet
> decided - Gerald Schaefer and I are currently working on that, on the
> s390 list (I took off most Ccs until we are settled and I can post v2).
>
> As you have no doubt found yourself, s390 has sophisticated handling of
> free half-pages already, and I need to add rcu_head usage in there too:
> it's tricky to squeeze it all in, and ptdesc does not appear to help us
> in any way (though mostly it's just changing some field names, okay).
>
> If ptdesc were actually allowing a flexible structure which architectures
> could add into, that would (in some future) be nice; but of course at
> present it's still fitting it all into one struct page, and mandating
> new restrictions which just make an architecture's job harder.

A goal of ptdescs is to make architecture's jobs simpler and standardized.
Unfortunately, ptdescs are nowhere near isolated from struct page yet.
This version of struct ptdesc contains the exact number of fields architectures
need right now, just reorganized to be located next to each other. It *probably*
shouldn't make an architectures job harder, aside from discouraging their use
of yet even more members of struct page.

> Some notes on problematic fields below FYI.
>
> > ---
> >  include/linux/pgtable.h | 51 +
> >  1 file changed, 51 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index c5a51481bbb9..330de96ebfd6 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct 
> > vm_area_struct *vma,
> >  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> >  #endif /* CONFIG_MMU */
> >
> > +
> > +/**
> > + * struct ptdesc - Memory descriptor for page tables.
> > + * @__page_flags: Same as page flags. Unused for page tables.
> > + * @pt_list: List of used page tables. Used for s390 and x86.
> > + * @_pt_pad_1: Padding that aliases with page's compound head.
> > + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
> > + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only.
> > + * @pt_mm: Used for x86 pgds.
> > + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
> > only.
> > + * @ptl: Lock for the page table.
> > + *
> > + * This struct overlays struct page for now. Do not modify without a good
> > + * understanding of the issues.
> > + */
> > +struct ptdesc {
> > + unsigned long __page_flags;
> > +
> > + union {
> > + struct list_head pt_list;
>
> I shall be needing struct rcu_head rcu_head (or pt_rcu_head or whatever,
> if you prefer) in this union too.  Sharing the lru or pt_list with rcu_head
> is what's difficult to get right and efficient on s390 - and if ptdesc gave
> us an independent rcu_head for each page table, that would be a blessing!
> but sadly not, it still has to squeeze into a struct page.

I can add a pt_rcu_head along with a comment to deter aliasing issues :)
Independent rcu_heads aren't coming any time soon though :(

> > + struct {
> > + unsigned long _pt_pad_1;
> > + pgtable_t pmd_huge_pte;
> > + };
> > + };
> > + unsigned long _pt_s390_gaddr;
> > +
> > + union {
> > + struct mm_struct *pt_mm;
> > + atomic_t pt_frag_refcount;
>
> Whether s390 will want pt_mm is not yet decided: I want to use it,
> Gerald prefers to go without it; but if we do end up using it,
> then pt_frag_refcount is a luxury we would have to give up.

I don't like the use of pt_mm for s390 either. s390 uses space equivalent
to all five words allocated in the page table struct (albeit in various places
of struct page). Using extra space (especially allocated for unrelated
reasons) just because it exists makes things more complicated and
confusing, and s390 is already confusing enough as a result of that.

If having access to pt_mm is necessary I can drop the
pt_frag_refcount patch, but I'd rather avoid it.

> s390 does very well already with its _refcount tricks, and I'd expect
> powerpc's simpler but more wasteful implementation to work as

Re: [PATCH v4 04/34] pgtable: Create struct ptdesc

2023-06-16 Thread Matthew Wilcox

On Thu, Jun 15, 2023 at 12:57:19AM -0700, Hugh Dickins wrote:
> Probably just trivial collisions in most architectures, which either
> of us can easily adjust to the other; powerpc likely to be more awkward,
> but fairly easily resolved; s390 quite a problem.
> 
> I've so far been unable to post a v2 of my series (and powerpc and s390
> were stupidly wrong in the v1), because a good s390 patch is not yet
> decided - Gerald Schaefer and I are currently working on that, on the
> s390 list (I took off most Ccs until we are settled and I can post v2).
> 
> As you have no doubt found yourself, s390 has sophisticated handling of
> free half-pages already, and I need to add rcu_head usage in there too:
> it's tricky to squeeze it all in, and ptdesc does not appear to help us
> in any way (though mostly it's just changing some field names, okay).
> 
> If ptdesc were actually allowing a flexible structure which architectures
> could add into, that would (in some future) be nice; but of course at
> present it's still fitting it all into one struct page, and mandating
> new restrictions which just make an architecture's job harder.

The intent is to get ptdescs to be dynamically allocated at some point
in the ~2-3 years out future when we have finished the folio project ...
which is not a terribly helpful thing for me to say.

I have three suggestions, probably all dreadful:

1. s390 could change its behaviour to always allocate page tables in
pairs.  That is, it fills in two pmd_t entries any time it takes a fault
in either of them.

2. We could allocate two or four pages at a time for s390 to allocate
2kB pages from.  That gives us a lot more space to store RCU heads.

3. We could use s390 as a guinea-pig for dynamic ptdesc allocation.
Every time we allocate a struct page, we have a slab cache for an
s390-special definition of struct ptdesc, we allocate a ptdesc and store
a pointer to that in compound_head.

We could sweeten #3 by doing that not just for s390 but also for every
configuration which has ALLOC_SPLIT_PTLOCKS today.  That would get rid
of the ambiguity between "is ptl a pointer or a lock".

> But I've no desire to undo powerpc's use of pt_frag_refcount:
> just warning that we may want to undo any use of it in s390.

I would dearly love ppc & s390 to use the _same_ scheme to solve the
same problem.

Re: [PATCH v9 00/14] pci: Work around ASMedia ASM2824 PCIe link training failures

2023-06-16 Thread Bjorn Helgaas

On Fri, Jun 16, 2023 at 01:27:52PM +0100, Maciej W. Rozycki wrote:
> On Thu, 15 Jun 2023, Bjorn Helgaas wrote:

>  As per my earlier remark:
> 
> > I think making a system halfway-fixed would make little sense, but with
> > the actual fix actually made last as you suggested I think this can be
> > split off, because it'll make no functional change by itself.
> 
> I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS 
> stub into the change carrying the actual workaround and then have the 
> reset path update with a follow-up change only, but I won't fight over it.  
> It's only one tree revision that will be in this halfway-fixed state and 
> I'll trust your judgement here.

Thanks for raising this.  Here's my thought process:

  12 PCI: Provide stub failed link recovery for device probing and hot plug
  13 PCI: Add failed link recovery for device reset events
  14 PCI: Work around PCIe link training failures

Patch 12 [1] adds calls to pcie_failed_link_retrain(), which does
nothing and returns false.  Functionally, it's a no-op, but the
structure is important later.

Patch 13 [2] claims to request failed link recovery after resets, but
actually doesn't do anything yet because pcie_failed_link_retrain() is
still a no-op, so this was a bit confusing.

Patch 14 [3] implements pcie_failed_link_retrain(), so the recovery
mentioned in 12 and 13 actually happens.  But this patch doesn't add
the call to pcie_failed_link_retrain(), so it's a little bit hard to
connect the dots.

I agree that as I rearranged it, the workaround doesn't apply in all
cases simultaneously.  Maybe not ideal, but maybe not terrible either.
Looking at it again, maybe it would have made more sense to move the
pcie_wait_for_link_delay() change to the last patch along with the
pci_dev_wait() change.  I dunno.

Bjorn

[1] 12 
https://lore.kernel.org/r/alpine.deb.2.21.2306111619570.64...@angie.orcam.me.uk
[2] 13 
https://lore.kernel.org/r/alpine.deb.2.21.2306111631050.64...@angie.orcam.me.uk
[3] 14 
https://lore.kernel.org/r/alpine.deb.2.21.2305310038540.59...@angie.orcam.me.uk

Re: [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Dynamic ftrace must allocate memory for code and this was impossible
> without CONFIG_MODULES.
>
> With execmem separated from the modules code, execmem_text_alloc() is
> available regardless of CONFIG_MODULES.
>
> Remove dependency of dynamic ftrace on CONFIG_MODULES and make
> CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu 

> ---
>  arch/x86/Kconfig |  1 +
>  arch/x86/kernel/ftrace.c | 10 --
>  2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 53bab123a8ee..ab64bbef9e50 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -35,6 +35,7 @@ config X86_64
> select SWIOTLB
> select ARCH_HAS_ELFCORE_COMPAT
> select ZONE_DMA32
> +   select EXECMEM if DYNAMIC_FTRACE
>
>  config FORCE_DYNAMIC_FTRACE
> def_bool y
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index f77c63bb3203..a824a5d3b129 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
>  /* Currently only x86_64 supports dynamic trampolines */
>  #ifdef CONFIG_X86_64
>
> -#ifdef CONFIG_MODULES
> -/* Module allocation simplifies allocating memory for code */
>  static inline void *alloc_tramp(unsigned long size)
>  {
> return execmem_text_alloc(size);
> @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
>  {
> execmem_free(tramp);
>  }
> -#else
> -/* Trampolines can only be created if modules are supported */
> -static inline void *alloc_tramp(unsigned long size)
> -{
> -   return NULL;
> -}
> -static inline void tramp_free(void *tramp) { }
> -#endif
>
>  /* Defined as markers to the end of the ftrace default trampolines */
>  extern void ftrace_regs_caller_end(void);
> --
> 2.35.1
>

Re: [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> execmem does not depend on modules, on the contrary modules use
> execmem.
>
> To make execmem available when CONFIG_MODULES=n, for instance for
> kprobes, split execmem_params initialization out from
> arch/kernel/module.c and compile it when CONFIG_EXECMEM=y
>
> Signed-off-by: Mike Rapoport (IBM) 
> ---
[...]
> +
> +struct execmem_params __init *execmem_arch_params(void)
> +{
> +   u64 module_alloc_end;
> +
> +   kaslr_init();

Aha, this addresses my comment on the earlier patch. Thanks!

Acked-by: Song Liu 


> +
> +   module_alloc_end = module_alloc_base + MODULES_VSIZE;
> +
> +   execmem_params.modules.text.pgprot = PAGE_KERNEL;
> +   execmem_params.modules.text.start = module_alloc_base;
> +   execmem_params.modules.text.end = module_alloc_end;
> +
> +   execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
[...]

Re: [PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> powerpc overrides kprobes::alloc_insn_page() to remove writable
> permissions when STRICT_MODULE_RWX is on.
>
> Add definition of jit area to execmem_params to allow using the generic
> kprobes::alloc_insn_page() with the desired permissions.
>
> As powerpc uses breakpoint instructions to inject kprobes, it does not
> need to constrain kprobe allocations to the modules area and can use the
> entire vmalloc address space.
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu 


> ---
>  arch/powerpc/kernel/kprobes.c | 14 --
>  arch/powerpc/kernel/module.c  | 13 +
>  2 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 5db8df5e3657..14c5ddec3056 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long 
> addr, unsigned long offse
> return (kprobe_opcode_t *)(addr + offset);
>  }
>
> -void *alloc_insn_page(void)
> -{
> -   void *page;
> -
> -   page = jit_text_alloc(PAGE_SIZE);
> -   if (!page)
> -   return NULL;
> -
> -   if (strict_module_rwx_enabled())
> -   set_memory_rox((unsigned long)page, 1);
> -
> -   return page;
> -}
> -
>  int arch_prepare_kprobe(struct kprobe *p)
>  {
> int ret = 0;
> diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
> index 4c6c15bf3947..8e5b379d6da1 100644
> --- a/arch/powerpc/kernel/module.c
> +++ b/arch/powerpc/kernel/module.c
> @@ -96,6 +96,11 @@ static struct execmem_params execmem_params = {
> .alignment = 1,
> },
> },
> +   .jit = {
> +   .text = {
> +   .alignment = 1,
> +   },
> +   },
>  };
>
>
> @@ -131,5 +136,13 @@ struct execmem_params __init *execmem_arch_params(void)
>
> execmem_params.modules.text.pgprot = prot;
>
> +   execmem_params.jit.text.start = VMALLOC_START;
> +   execmem_params.jit.text.end = VMALLOC_END;
> +
> +   if (strict_module_rwx_enabled())
> +   execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
> +   else
> +   execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC;
> +
> return _params;
>  }
> --
> 2.35.1
>

Re: [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> RISC-V overrides kprobes::alloc_insn_range() to use the entire vmalloc area
> rather than limit the allocations to the modules area.
>
> Slightly reorder execmem_params initialization to support both 32 and 64
> bit variantsi and add definition of jit area to execmem_params to support
> generic kprobes::alloc_insn_page().
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu 


> ---
>  arch/riscv/kernel/module.c | 16 +++-
>  arch/riscv/kernel/probes/kprobes.c | 10 --
>  2 files changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index ee5e04cd3f21..cca6ed4e9340 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -436,7 +436,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
> *strtab,
> return 0;
>  }
>
> -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> +#ifdef CONFIG_MMU
>  static struct execmem_params execmem_params = {
> .modules = {
> .text = {
> @@ -444,12 +444,26 @@ static struct execmem_params execmem_params = {
> .alignment = 1,
> },
> },
> +   .jit = {
> +   .text = {
> +   .pgprot = PAGE_KERNEL_READ_EXEC,
> +   .alignment = 1,
> +   },
> +   },
>  };
>
>  struct execmem_params __init *execmem_arch_params(void)
>  {
> +#ifdef CONFIG_64BIT
> execmem_params.modules.text.start = MODULES_VADDR;
> execmem_params.modules.text.end = MODULES_END;
> +#else
> +   execmem_params.modules.text.start = VMALLOC_START;
> +   execmem_params.modules.text.end = VMALLOC_END;
> +#endif
> +
> +   execmem_params.jit.text.start = VMALLOC_START;
> +   execmem_params.jit.text.end = VMALLOC_END;
>
> return _params;
>  }
> diff --git a/arch/riscv/kernel/probes/kprobes.c 
> b/arch/riscv/kernel/probes/kprobes.c
> index 2f08c14a933d..e64f2f3064eb 100644
> --- a/arch/riscv/kernel/probes/kprobes.c
> +++ b/arch/riscv/kernel/probes/kprobes.c
> @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
> return 0;
>  }
>
> -#ifdef CONFIG_MMU
> -void *alloc_insn_page(void)
> -{
> -   return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
> -GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
> -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> -__builtin_return_address(0));
> -}
> -#endif
> -
>  /* install breakpoint in text */
>  void __kprobes arch_arm_kprobe(struct kprobe *p)
>  {
> --
> 2.35.1
>

Re: [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:52 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> The memory allocations for kprobes on arm64 can be placed anywhere in
> vmalloc address space and currently this is implemented with an override
> of alloc_insn_page() in arm64.
>
> Extend execmem_params with a range for generated code allocations and
> make kprobes on arm64 use this extension rather than override
> alloc_insn_page().
>
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/arm64/kernel/module.c |  9 +
>  arch/arm64/kernel/probes/kprobes.c |  7 ---
>  include/linux/execmem.h| 11 +++
>  mm/execmem.c   | 14 +-
>  4 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index c3d999f3a3dd..52b09626bc0f 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -30,6 +30,13 @@ static struct execmem_params execmem_params = {
> .alignment = MODULE_ALIGN,
> },
> },
> +   .jit = {
> +   .text = {
> +   .start = VMALLOC_START,
> +   .end = VMALLOC_END,
> +   .alignment = 1,
> +   },
> +   },
>  };

This is growing fast. :) We have 3 now: text, data, jit. And it will be
5 when we split data into rw data, ro data, ro after init data. I wonder
whether we should still do some type enum here. But we can revisit
this topic later.

Other than that

Acked-by: Song Liu

Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Data related to code allocations, such as module data section, need to
> comply with architecture constraints for its placement and its
> allocation right now was done using execmem_text_alloc().
>
> Create a dedicated API for allocating data related to code allocations
> and allow architectures to define address ranges for data allocations.
>
> Since currently this is only relevant for powerpc variants that use the
> VMALLOC address space for module data allocations, automatically reuse
> address ranges defined for text unless address range for data is
> explicitly defined by an architecture.
>
> With separation of code and data allocations, data sections of the
> modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
> was a default on many architectures.
>
> Signed-off-by: Mike Rapoport (IBM) 
[...]
>  static void free_mod_mem(struct module *mod)
> diff --git a/mm/execmem.c b/mm/execmem.c
> index a67acd75ffef..f7bf496ad4c3 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size)
>  fallback_start, fallback_end, kasan);
>  }
>
> +void *execmem_data_alloc(size_t size)
> +{
> +   unsigned long start = execmem_params.modules.data.start;
> +   unsigned long end = execmem_params.modules.data.end;
> +   pgprot_t pgprot = execmem_params.modules.data.pgprot;
> +   unsigned int align = execmem_params.modules.data.alignment;
> +   unsigned long fallback_start = 
> execmem_params.modules.data.fallback_start;
> +   unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> +   bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
> +
> +   return execmem_alloc(size, start, end, align, pgprot,
> +fallback_start, fallback_end, kasan);
> +}
> +
>  void execmem_free(void *ptr)
>  {
> /*
> @@ -101,6 +115,28 @@ static bool execmem_validate_params(struct 
> execmem_params *p)
> return true;
>  }
>
> +static void execmem_init_missing(struct execmem_params *p)

Shall we call this execmem_default_init_data?

> +{
> +   struct execmem_modules_range *m = >modules;
> +
> +   if (!pgprot_val(execmem_params.modules.data.pgprot))
> +   execmem_params.modules.data.pgprot = PAGE_KERNEL;

Do we really need to check each of these? IOW, can we do:

if (!pgprot_val(execmem_params.modules.data.pgprot)) {
   execmem_params.modules.data.pgprot = PAGE_KERNEL;
   execmem_params.modules.data.alignment = m->text.alignment;
   execmem_params.modules.data.start = m->text.start;
   execmem_params.modules.data.end = m->text.end;
   execmem_params.modules.data.fallback_start = m->text.fallback_start;
  execmem_params.modules.data.fallback_end = m->text.fallback_end;
}

Thanks,
Song

[...]

[PATCH v4 20/25] iommu/sun50i: Add an IOMMU_IDENTITIY_DOMAIN

2023-06-16 Thread Jason Gunthorpe

Prior to commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") the
sun50i_iommu_detach_device() function was being called by
ops->detach_dev().

This is an IDENTITY domain so convert sun50i_iommu_detach_device() into
sun50i_iommu_identity_attach() and a full IDENTITY domain and thus hook it
back up the same was as the old ops->detach_dev().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/sun50i-iommu.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/sun50i-iommu.c b/drivers/iommu/sun50i-iommu.c
index 74c5cb93e90027..0bf08b120cf105 100644
--- a/drivers/iommu/sun50i-iommu.c
+++ b/drivers/iommu/sun50i-iommu.c
@@ -757,21 +757,32 @@ static void sun50i_iommu_detach_domain(struct 
sun50i_iommu *iommu,
iommu->domain = NULL;
 }
 
-static void sun50i_iommu_detach_device(struct iommu_domain *domain,
-  struct device *dev)
+static int sun50i_iommu_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
-   struct sun50i_iommu_domain *sun50i_domain = to_sun50i_domain(domain);
struct sun50i_iommu *iommu = dev_iommu_priv_get(dev);
+   struct sun50i_iommu_domain *sun50i_domain;
 
dev_dbg(dev, "Detaching from IOMMU domain\n");
 
-   if (iommu->domain != domain)
-   return;
+   if (iommu->domain == identity_domain)
+   return 0;
 
+   sun50i_domain = to_sun50i_domain(iommu->domain);
if (refcount_dec_and_test(_domain->refcnt))
sun50i_iommu_detach_domain(iommu, sun50i_domain);
+   return 0;
 }
 
+static struct iommu_domain_ops sun50i_iommu_identity_ops = {
+   .attach_dev = sun50i_iommu_identity_attach,
+};
+
+static struct iommu_domain sun50i_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int sun50i_iommu_attach_device(struct iommu_domain *domain,
  struct device *dev)
 {
@@ -789,8 +800,7 @@ static int sun50i_iommu_attach_device(struct iommu_domain 
*domain,
if (iommu->domain == domain)
return 0;
 
-   if (iommu->domain)
-   sun50i_iommu_detach_device(iommu->domain, dev);
+   sun50i_iommu_identity_attach(_iommu_identity_domain, dev);
 
sun50i_iommu_attach_domain(iommu, sun50i_domain);
 
@@ -827,6 +837,7 @@ static int sun50i_iommu_of_xlate(struct device *dev,
 }
 
 static const struct iommu_ops sun50i_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.pgsize_bitmap  = SZ_4K,
.device_group   = sun50i_iommu_device_group,
.domain_alloc   = sun50i_iommu_domain_alloc,
@@ -985,6 +996,7 @@ static int sun50i_iommu_probe(struct platform_device *pdev)
if (!iommu)
return -ENOMEM;
spin_lock_init(>iommu_lock);
+   iommu->domain = _iommu_identity_domain;
platform_set_drvdata(pdev, iommu);
iommu->dev = >dev;
 
-- 
2.40.1

[PATCH v4 15/25] iommufd/selftest: Make the mock iommu driver into a real driver

2023-06-16 Thread Jason Gunthorpe

I've avoided doing this because there is no way to make this happen
without an intrusion into the core code. Up till now this has avoided
needing the core code's probe path with some hackery - but now that
default domains are becoming mandatory it is unavoidable. The core probe
path must be run to set the default_domain, only it can do it. Without
a default domain iommufd can't use the group.

Make it so that iommufd selftest can create a real iommu driver and bind
it only to is own private bus. Add iommu_device_register_bus() as a core
code helper to make this possible. It simply sets the right pointers and
registers the notifier block. The mock driver then works like any normal
driver should, with probe triggered by the bus ops

When the bus->iommu_ops stuff is fully unwound we can probably do better
here and remove this special case.

Remove set_platform_dma_ops from selftest and make it use a BLOCKED
default domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu-priv.h  |  16 +++
 drivers/iommu/iommu.c   |  43 
 drivers/iommu/iommufd/iommufd_private.h |   5 +-
 drivers/iommu/iommufd/main.c|   8 +-
 drivers/iommu/iommufd/selftest.c| 141 +---
 5 files changed, 144 insertions(+), 69 deletions(-)
 create mode 100644 drivers/iommu/iommu-priv.h

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
new file mode 100644
index 00..1cbc04b9cf7297
--- /dev/null
+++ b/drivers/iommu/iommu-priv.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef __IOMMU_PRIV_H
+#define __IOMMU_PRIV_H
+
+#include 
+
+int iommu_device_register_bus(struct iommu_device *iommu,
+ const struct iommu_ops *ops, struct bus_type *bus,
+ struct notifier_block *nb);
+void iommu_device_unregister_bus(struct iommu_device *iommu,
+struct bus_type *bus,
+struct notifier_block *nb);
+
+#endif
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7ca70e2a3f51e9..a3a4d004767b4d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -36,6 +36,7 @@
 #include "dma-iommu.h"
 
 #include "iommu-sva.h"
+#include "iommu-priv.h"
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -287,6 +288,48 @@ void iommu_device_unregister(struct iommu_device *iommu)
 }
 EXPORT_SYMBOL_GPL(iommu_device_unregister);
 
+#if IS_ENABLED(CONFIG_IOMMUFD_TEST)
+void iommu_device_unregister_bus(struct iommu_device *iommu,
+struct bus_type *bus,
+struct notifier_block *nb)
+{
+   bus_unregister_notifier(bus, nb);
+   iommu_device_unregister(iommu);
+}
+EXPORT_SYMBOL_GPL(iommu_device_unregister_bus);
+
+/*
+ * Register an iommu driver against a single bus. This is only used by iommufd
+ * selftest to create a mock iommu driver. The caller must provide
+ * some memory to hold a notifier_block.
+ */
+int iommu_device_register_bus(struct iommu_device *iommu,
+ const struct iommu_ops *ops, struct bus_type *bus,
+ struct notifier_block *nb)
+{
+   int err;
+
+   iommu->ops = ops;
+   nb->notifier_call = iommu_bus_notifier;
+   err = bus_register_notifier(bus, nb);
+   if (err)
+   return err;
+
+   spin_lock(_device_lock);
+   list_add_tail(>list, _device_list);
+   spin_unlock(_device_lock);
+
+   bus->iommu_ops = ops;
+   err = bus_iommu_probe(bus);
+   if (err) {
+   iommu_device_unregister_bus(iommu, bus, nb);
+   return err;
+   }
+   return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_device_register_bus);
+#endif
+
 static struct dev_iommu *dev_iommu_get(struct device *dev)
 {
struct dev_iommu *param = dev->iommu;
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index b38e67d1988bdb..368f66c63a239a 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -303,7 +303,7 @@ extern size_t iommufd_test_memory_limit;
 void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
   unsigned int ioas_id, u64 *iova, u32 *flags);
 bool iommufd_should_fail(void);
-void __init iommufd_test_init(void);
+int __init iommufd_test_init(void);
 void iommufd_test_exit(void);
 bool iommufd_selftest_is_mock_dev(struct device *dev);
 #else
@@ -316,8 +316,9 @@ static inline bool iommufd_should_fail(void)
 {
return false;
 }
-static inline void __init iommufd_test_init(void)
+static inline int __init iommufd_test_init(void)
 {
+   return 0;
 }
 static inline void iommufd_test_exit(void)
 {
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 3fbe636c3d8a69..042d45cc0b1c0d 100644
---

[PATCH v4 10/25] iommu/exynos: Implement an IDENTITY domain

2023-06-16 Thread Jason Gunthorpe

What exynos calls exynos_iommu_detach_device is actually putting the iommu
into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

Tested-by: Marek Szyprowski 
Acked-by: Marek Szyprowski 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/exynos-iommu.c | 66 +---
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index c275fe71c4db32..5e12b85dfe8705 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -24,6 +24,7 @@
 
 typedef u32 sysmmu_iova_t;
 typedef u32 sysmmu_pte_t;
+static struct iommu_domain exynos_identity_domain;
 
 /* We do not consider super section mapping (16MB) */
 #define SECT_ORDER 20
@@ -829,7 +830,7 @@ static int __maybe_unused exynos_sysmmu_suspend(struct 
device *dev)
struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
mutex_lock(>rpm_lock);
-   if (data->domain) {
+   if (>domain->domain != _identity_domain) {
dev_dbg(data->sysmmu, "saving state\n");
__sysmmu_disable(data);
}
@@ -847,7 +848,7 @@ static int __maybe_unused exynos_sysmmu_resume(struct 
device *dev)
struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
mutex_lock(>rpm_lock);
-   if (data->domain) {
+   if (>domain->domain != _identity_domain) {
dev_dbg(data->sysmmu, "restoring state\n");
__sysmmu_enable(data);
}
@@ -980,17 +981,20 @@ static void exynos_iommu_domain_free(struct iommu_domain 
*iommu_domain)
kfree(domain);
 }
 
-static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain,
-   struct device *dev)
+static int exynos_iommu_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
-   struct exynos_iommu_domain *domain = to_exynos_domain(iommu_domain);
struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
-   phys_addr_t pagetable = virt_to_phys(domain->pgtable);
+   struct exynos_iommu_domain *domain;
+   phys_addr_t pagetable;
struct sysmmu_drvdata *data, *next;
unsigned long flags;
 
-   if (!has_sysmmu(dev) || owner->domain != iommu_domain)
-   return;
+   if (owner->domain == identity_domain)
+   return 0;
+
+   domain = to_exynos_domain(owner->domain);
+   pagetable = virt_to_phys(domain->pgtable);
 
mutex_lock(>rpm_lock);
 
@@ -1009,15 +1013,25 @@ static void exynos_iommu_detach_device(struct 
iommu_domain *iommu_domain,
list_del_init(>domain_node);
spin_unlock(>lock);
}
-   owner->domain = NULL;
+   owner->domain = identity_domain;
spin_unlock_irqrestore(>lock, flags);
 
mutex_unlock(>rpm_lock);
 
-   dev_dbg(dev, "%s: Detached IOMMU with pgtable %pa\n", __func__,
-   );
+   dev_dbg(dev, "%s: Restored IOMMU to IDENTITY from pgtable %pa\n",
+   __func__, );
+   return 0;
 }
 
+static struct iommu_domain_ops exynos_identity_ops = {
+   .attach_dev = exynos_iommu_identity_attach,
+};
+
+static struct iommu_domain exynos_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _identity_ops,
+};
+
 static int exynos_iommu_attach_device(struct iommu_domain *iommu_domain,
   struct device *dev)
 {
@@ -1026,12 +1040,11 @@ static int exynos_iommu_attach_device(struct 
iommu_domain *iommu_domain,
struct sysmmu_drvdata *data;
phys_addr_t pagetable = virt_to_phys(domain->pgtable);
unsigned long flags;
+   int err;
 
-   if (!has_sysmmu(dev))
-   return -ENODEV;
-
-   if (owner->domain)
-   exynos_iommu_detach_device(owner->domain, dev);
+   err = exynos_iommu_identity_attach(_identity_domain, dev);
+   if (err)
+   return err;
 
mutex_lock(>rpm_lock);
 
@@ -1407,26 +1420,12 @@ static struct iommu_device 
*exynos_iommu_probe_device(struct device *dev)
return >iommu;
 }
 
-static void exynos_iommu_set_platform_dma(struct device *dev)
-{
-   struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
-
-   if (owner->domain) {
-   struct iommu_group *group = iommu_group_get(dev);
-
-   if (group) {
-   exynos_iommu_detach_device(owner->domain, dev);
-   iommu_group_put(group);
-   }
-   }
-}
-
 static void exynos_iommu_release_device(struct device *dev)
 {
struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
struct sysmmu_drvdata *data;
 
-   exynos_iommu_set_platform_dma(dev);
+

[PATCH v4 17/25] iommu/qcom_iommu: Add an IOMMU_IDENTITIY_DOMAIN

2023-06-16 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 39 +
 1 file changed, 39 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index a503ed758ec302..9d7b9d8b4386d4 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -387,6 +387,44 @@ static int qcom_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev
return 0;
 }
 
+static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct qcom_iommu_domain *qcom_domain;
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct qcom_iommu_dev *qcom_iommu = to_iommu(dev);
+   unsigned int i;
+
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   qcom_domain = to_qcom_iommu_domain(domain);
+   if (WARN_ON(!qcom_domain->iommu))
+   return -EINVAL;
+
+   pm_runtime_get_sync(qcom_iommu->dev);
+   for (i = 0; i < fwspec->num_ids; i++) {
+   struct qcom_iommu_ctx *ctx = to_ctx(qcom_domain, 
fwspec->ids[i]);
+
+   /* Disable the context bank: */
+   iommu_writel(ctx, ARM_SMMU_CB_SCTLR, 0);
+
+   ctx->domain = NULL;
+   }
+   pm_runtime_put_sync(qcom_iommu->dev);
+   return 0;
+}
+
+static struct iommu_domain_ops qcom_iommu_identity_ops = {
+   .attach_dev = qcom_iommu_identity_attach,
+};
+
+static struct iommu_domain qcom_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int qcom_iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, size_t pgsize, size_t pgcount,
  int prot, gfp_t gfp, size_t *mapped)
@@ -553,6 +591,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
 }
 
 static const struct iommu_ops qcom_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.capable= qcom_iommu_capable,
.domain_alloc   = qcom_iommu_domain_alloc,
.probe_device   = qcom_iommu_probe_device,
-- 
2.40.1

[PATCH v4 01/25] iommu: Add iommu_ops->identity_domain

2023-06-16 Thread Jason Gunthorpe

This allows a driver to set a global static to an IDENTITY domain and
the core code will automatically use it whenever an IDENTITY domain
is requested.

By making it always available it means the IDENTITY can be used in error
handling paths to force the iommu driver into a known state. Devices
implementing global static identity domains should avoid failing their
attach_dev ops.

Convert rockchip to use the new mechanism.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c  | 3 +++
 drivers/iommu/rockchip-iommu.c | 9 +
 include/linux/iommu.h  | 3 +++
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9e0228ef612b85..bb840a818525ad 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1917,6 +1917,9 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct bus_type *bus,
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
 
+   if (alloc_type == IOMMU_DOMAIN_IDENTITY && 
bus->iommu_ops->identity_domain)
+   return bus->iommu_ops->identity_domain;
+
domain = bus->iommu_ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 4054030c323795..4fbede269e6712 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -1017,13 +1017,8 @@ static int rk_iommu_identity_attach(struct iommu_domain 
*identity_domain,
return 0;
 }
 
-static void rk_iommu_identity_free(struct iommu_domain *domain)
-{
-}
-
 static struct iommu_domain_ops rk_identity_ops = {
.attach_dev = rk_iommu_identity_attach,
-   .free = rk_iommu_identity_free,
 };
 
 static struct iommu_domain rk_identity_domain = {
@@ -1087,9 +1082,6 @@ static struct iommu_domain 
*rk_iommu_domain_alloc(unsigned type)
 {
struct rk_iommu_domain *rk_domain;
 
-   if (type == IOMMU_DOMAIN_IDENTITY)
-   return _identity_domain;
-
if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
return NULL;
 
@@ -1214,6 +1206,7 @@ static int rk_iommu_of_xlate(struct device *dev,
 }
 
 static const struct iommu_ops rk_iommu_ops = {
+   .identity_domain = _identity_domain,
.domain_alloc = rk_iommu_domain_alloc,
.probe_device = rk_iommu_probe_device,
.release_device = rk_iommu_release_device,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d3164259667599..c3004eac2f88e8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -254,6 +254,8 @@ struct iommu_iotlb_gather {
  *will be blocked by the hardware.
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
+ * @identity_domain: An always available, always attachable identity
+ *   translation.
  */
 struct iommu_ops {
bool (*capable)(struct device *dev, enum iommu_cap);
@@ -287,6 +289,7 @@ struct iommu_ops {
const struct iommu_domain_ops *default_domain_ops;
unsigned long pgsize_bitmap;
struct module *owner;
+   struct iommu_domain *identity_domain;
 };
 
 /**
-- 
2.40.1

[PATCH v4 00/25] iommu: Make default_domain's mandatory

2023-06-16 Thread Jason Gunthorpe

[ It would be good to get this in linux-next, we have some good test
coverage on the ARM side already, thanks! ]

It has been a long time coming, this series completes the default_domain
transition and makes it so that the core IOMMU code will always have a
non-NULL default_domain for every driver on every
platform. set_platform_dma_ops() turned out to be a bad idea, and so
completely remove it.

This is achieved by changing each driver to either:

1 - Convert the existing (or deleted) ops->detach_dev() into an
op->attach_dev() of an IDENTITY domain.

This is based on the theory that the ARM32 HW is able to function when
the iommu is turned off as so the turned off state is an IDENTITY
translation.

2 - Use a new PLATFORM domain type. This is a hack to accommodate drivers
that we don't really know WTF they do. S390 is legitimately using this
to switch to it's platform dma_ops implementation, which is where the
name comes from.

3 - Do #1 and force the default domain to be IDENTITY, this corrects
the tegra-smmu case where even an ARM64 system would have a NULL
default_domain.

Using this we can apply the rules:

a) ARM_DMA_USE_IOMMU mode always uses either the driver's
   ops->default_domain, ops->def_domain_type(), or an IDENTITY domain.
   All ARM32 drivers provide one of these three options.

b) dma-iommu.c mode uses either the driver's ops->default_domain,
   ops->def_domain_type or the usual DMA API policy logic based on the
   command line/etc to pick IDENTITY/DMA domain types

c) All other arch's (PPC/S390) use ops->default_domain always.

See the patch "Require a default_domain for all iommu drivers" for a
per-driver breakdown.

The conversion broadly teaches a bunch of ARM32 drivers that they can do
IDENTITY domains. There is some educated guessing involved that these are
actual IDENTITY domains. If this turns out to be wrong the driver can be
trivially changed to use a BLOCKING domain type instead. Further, the
domain type only matters for drivers using ARM64's dma-iommu.c mode as it
will select IDENTITY based on the command line and expect IDENTITY to
work. For ARM32 and other arch cases it is purely documentation.

Finally, based on all the analysis in this series, we can purge
IOMMU_DOMAIN_UNMANAGED/DMA constants from most of the drivers. This
greatly simplifies understanding the driver contract to the core
code. IOMMU drivers should not be involved in policy for how the DMA API
works, that should be a core core decision.

The main gain from this work is to remove alot of ARM_DMA_USE_IOMMU
specific code and behaviors from drivers. All that remains in iommu
drivers after this series is the calls to arm_iommu_create_mapping().

This is a step toward removing ARM_DMA_USE_IOMMU.

The IDENTITY domains added to the ARM64 supporting drivers can be tested
by booting in ARM64 mode and enabling CONFIG_IOMMU_DEFAULT_PASSTHROUGH. If
the system still boots then most likely the implementation is an IDENTITY
domain. If not we can trivially change it to BLOCKING or at worst PLATFORM
if there is no detail what is going on in the HW.

I think this is pretty safe for the ARM32 drivers as they don't really
change, the code that was in detach_dev continues to be called in the same
places it was called before.

This is on github: https://github.com/jgunthorpe/linux/commits/iommu_all_defdom

v4:
 - Fix rebasing typo missing ops->alloc_domain_paging check
 - Rebase on latest Joerg tree
v3: 
https://lore.kernel.org/r/0-v3-89830a6c7841+43d-iommu_all_defdom_...@nvidia.com
 - FSL is back to a PLATFORM domain, with some fixing so it attach only
   does something when leaving an UNMANAGED domain like it always was
 - Rebase on Joerg's tree, adjust for "alloc_type" change
 - Change the ARM32 untrusted check to a WARN_ON since no ARM32 system
   can currently set trusted
v2: 
https://lore.kernel.org/r/0-v2-8d1dc464eac9+10f-iommu_all_defdom_...@nvidia.com
 - FSL is an IDENTITY domain
 - Delete terga-gart instead of trying to carry it
 - Use the policy determination from iommu_get_default_domain_type() to
   drive the arm_iommu mode
 - Reorganize and introduce new patches to do the above:
* Split the ops->identity_domain to an independent earlier patch
* Remove the UNMANAGED return from def_domain_type in mtk_v1 earlier
  so the new iommu_get_default_domain_type() can work
* Make the driver's def_domain_type have higher policy priority than
  untrusted
* Merge the set_platfom_dma_ops hunk from mtk_v1 along with rockchip
  into the patch that forced IDENTITY on ARM32
 - Revise sun50i to be cleaner and have a non-NULL internal domain
 - Reword logging in exynos
 - Remove the gdev from the group alloc path, instead add a new
   function __iommu_group_domain_alloc() that takes in the group
   and uses the first device. Split this to its own patch
 - New patch to make iommufd's mock selftest into a real driver
 - New patch to fix power's partial iommu driver
v1:

[PATCH v4 16/25] iommu: Remove ops->set_platform_dma_ops()

2023-06-16 Thread Jason Gunthorpe

All drivers are now using IDENTITY or PLATFORM domains for what this did,
we can remove it now. It is no longer possible to attach to a NULL domain.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 30 +-
 include/linux/iommu.h |  4 
 2 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a3a4d004767b4d..e60640f6ccb625 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2250,21 +2250,8 @@ static int __iommu_group_set_domain_internal(struct 
iommu_group *group,
if (group->domain == new_domain)
return 0;
 
-   /*
-* New drivers should support default domains, so set_platform_dma()
-* op will never be called. Otherwise the NULL domain represents some
-* platform specific behavior.
-*/
-   if (!new_domain) {
-   for_each_group_device(group, gdev) {
-   const struct iommu_ops *ops = dev_iommu_ops(gdev->dev);
-
-   if (!WARN_ON(!ops->set_platform_dma_ops))
-   ops->set_platform_dma_ops(gdev->dev);
-   }
-   group->domain = NULL;
-   return 0;
-   }
+   if (WARN_ON(!new_domain))
+   return -EINVAL;
 
/*
 * Changing the domain is done by calling attach_dev() on the new
@@ -2300,19 +2287,15 @@ static int __iommu_group_set_domain_internal(struct 
iommu_group *group,
 */
last_gdev = gdev;
for_each_group_device(group, gdev) {
-   const struct iommu_ops *ops = dev_iommu_ops(gdev->dev);
-
/*
-* If set_platform_dma_ops is not present a NULL domain can
-* happen only for first probe, in which case we leave
-* group->domain as NULL and let release clean everything up.
+* A NULL domain can happen only for first probe, in which case
+* we leave group->domain as NULL and let release clean
+* everything up.
 */
if (group->domain)
WARN_ON(__iommu_device_set_domain(
group, gdev->dev, group->domain,
IOMMU_SET_DOMAIN_MUST_SUCCEED));
-   else if (ops->set_platform_dma_ops)
-   ops->set_platform_dma_ops(gdev->dev);
if (gdev == last_gdev)
break;
}
@@ -2926,9 +2909,6 @@ static int iommu_setup_default_domain(struct iommu_group 
*group,
/*
 * There are still some drivers which don't support default domains, so
 * we ignore the failure and leave group->default_domain NULL.
-*
-* We assume that the iommu driver starts up the device in
-* 'set_platform_dma_ops' mode if it does not support default domains.
 */
dom = iommu_group_alloc_default_domain(group, req_type);
if (!dom) {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ef0af09326..49331573f1d1f5 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -237,9 +237,6 @@ struct iommu_iotlb_gather {
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
  *  group and attached to the groups domain
- * @set_platform_dma_ops: Returning control back to the platform DMA ops. This 
op
- *is to support old IOMMU drivers, new drivers should 
use
- *default domains, and the common IOMMU DMA ops.
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -271,7 +268,6 @@ struct iommu_ops {
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
void (*probe_finalize)(struct device *dev);
-   void (*set_platform_dma_ops)(struct device *dev);
struct iommu_group *(*device_group)(struct device *dev);
 
/* Request/Free a list of reserved regions for a device */
-- 
2.40.1

[PATCH v4 09/25] iommu: Allow an IDENTITY domain as the default_domain in ARM32

2023-06-16 Thread Jason Gunthorpe

Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the
same stuff, the way they relate to the IOMMU core is quiet different.

dma-iommu.c expects the core code to setup an UNMANAGED domain (of type
IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This
becomes the default_domain for the group.

ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly
allocates an UNMANAGED domain and operates it just like an external
driver. In this case group->default_domain is NULL.

If the driver provides a global static identity_domain then automatically
use it as the default_domain when in ARM_DMA_USE_IOMMU mode.

This allows drivers that implemented default_domain == NULL as an IDENTITY
translation to trivially get a properly labeled non-NULL default_domain on
ARM32 configs.

With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the
device the normal detach_domain flow will restore the IDENTITY domain as
the default domain. Overall this makes attach_dev() of the IDENTITY domain
called in the same places as detach_dev().

This effectively migrates these drivers to default_domain mode. For
drivers that support ARM64 they will gain support for the IDENTITY
translation mode for the dma_api and behave in a uniform way.

Drivers use this by setting ops->identity_domain to a static singleton
iommu_domain that implements the identity attach. If the core detects
ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain
during probe.

Drivers can continue to prevent the use of DMA translation by returning
IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent
IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU.

This allows removing the set_platform_dma_ops() from every remaining
driver.

Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does
is set an existing global static identity domain. mkt_v1 does not support
IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation
is safe.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c  | 26 +++---
 drivers/iommu/mtk_iommu_v1.c   | 12 
 drivers/iommu/rockchip-iommu.c | 10 --
 3 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0c4fc46c210366..7ca70e2a3f51e9 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1757,15 +1757,35 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
int type;
 
lockdep_assert_held(>mutex);
+
+   /*
+* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
+* identity_domain and it will automatically become their default
+* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
+* Override the selection to IDENTITY if we are sure the driver supports
+* it.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) {
+   type = IOMMU_DOMAIN_IDENTITY;
+   if (best_type && type && best_type != type)
+   goto err;
+   best_type = target_type = IOMMU_DOMAIN_IDENTITY;
+   }
+
for_each_group_device(group, gdev) {
type = best_type;
if (ops->def_domain_type) {
type = ops->def_domain_type(gdev->dev);
-   if (best_type && type && best_type != type)
+   if (best_type && type && best_type != type) {
+   /* Stick with the last driver override we saw */
+   best_type = type;
goto err;
+   }
}
 
-   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {
+   /* No ARM32 using systems will set untrusted, it cannot work. */
+   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted &&
+   !WARN_ON(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU))) {
type = IOMMU_DOMAIN_DMA;
if (best_type && type && best_type != type)
goto err;
@@ -1790,7 +1810,7 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
"Device needs domain type %s, but device %s in the same iommu 
group requires type %s - using default\n",
iommu_domain_type_str(type), dev_name(last_dev),
iommu_domain_type_str(best_type));
-   return 0;
+   return best_type;
 }
 
 static void iommu_group_do_probe_finalize(struct device *dev)
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index cc3e7d53d33ad9..7c0c1d50df5f75 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -337,11 +337,6 @@ static struct

[PATCH v4 14/25] iommu/msm: Implement an IDENTITY domain

2023-06-16 Thread Jason Gunthorpe

What msm does during omap_iommu_set_platform_dma() is actually putting the
iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/msm_iommu.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 79d89bad5132b7..26ed81cfeee897 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -443,15 +443,20 @@ static int msm_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev)
return ret;
 }
 
-static void msm_iommu_set_platform_dma(struct device *dev)
+static int msm_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct msm_priv *priv = to_msm_priv(domain);
+   struct msm_priv *priv;
unsigned long flags;
struct msm_iommu_dev *iommu;
struct msm_iommu_ctx_dev *master;
-   int ret;
+   int ret = 0;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   priv = to_msm_priv(domain);
free_io_pgtable_ops(priv->iop);
 
spin_lock_irqsave(_iommu_lock, flags);
@@ -468,8 +473,18 @@ static void msm_iommu_set_platform_dma(struct device *dev)
}
 fail:
spin_unlock_irqrestore(_iommu_lock, flags);
+   return ret;
 }
 
+static struct iommu_domain_ops msm_iommu_identity_ops = {
+   .attach_dev = msm_iommu_identity_attach,
+};
+
+static struct iommu_domain msm_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int msm_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t pa, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -675,10 +690,10 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
 }
 
 static struct iommu_ops msm_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc = msm_iommu_domain_alloc,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = msm_iommu_set_platform_dma,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
.of_xlate = qcom_iommu_of_xlate,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.40.1

[PATCH v4 06/25] iommu/tegra-gart: Remove tegra-gart

2023-06-16 Thread Jason Gunthorpe

Thierry says this is not used anymore, and doesn't think it makes sense as
an iommu driver. The HW it supports is about 10 years old now and newer HW
uses different IOMMU drivers.

As this is the only driver with a GART approach, and it doesn't really
meet the driver expectations from the IOMMU core, let's just remove it
so we don't have to think about how to make it fit in.

It has a number of identified problems:
 - The assignment of iommu_groups doesn't match the HW behavior

 - It claims to have an UNMANAGED domain but it is really an IDENTITY
   domain with a translation aperture. This is inconsistent with the core
   expectation for security sensitive operations

 - It doesn't implement a SW page table under struct iommu_domain so
   * It can't accept a map until the domain is attached
   * It forgets about all maps after the domain is detached
   * It doesn't clear the HW of maps once the domain is detached
 (made worse by having the wrong groups)

Cc: Thierry Reding 
Cc: Dmitry Osipenko 
Acked-by: Thierry Reding 
Signed-off-by: Jason Gunthorpe 
---
 arch/arm/configs/multi_v7_defconfig |   1 -
 arch/arm/configs/tegra_defconfig|   1 -
 drivers/iommu/Kconfig   |  11 -
 drivers/iommu/Makefile  |   1 -
 drivers/iommu/tegra-gart.c  | 371 
 drivers/memory/tegra/mc.c   |  34 ---
 drivers/memory/tegra/tegra20.c  |  28 ---
 include/soc/tegra/mc.h  |  26 --
 8 files changed, 473 deletions(-)
 delete mode 100644 drivers/iommu/tegra-gart.c

diff --git a/arch/arm/configs/multi_v7_defconfig 
b/arch/arm/configs/multi_v7_defconfig
index 871fffe92187bf..daba1afdbd1100 100644
--- a/arch/arm/configs/multi_v7_defconfig
+++ b/arch/arm/configs/multi_v7_defconfig
@@ -1063,7 +1063,6 @@ CONFIG_BCM2835_MBOX=y
 CONFIG_QCOM_APCS_IPC=y
 CONFIG_QCOM_IPCC=y
 CONFIG_ROCKCHIP_IOMMU=y
-CONFIG_TEGRA_IOMMU_GART=y
 CONFIG_TEGRA_IOMMU_SMMU=y
 CONFIG_EXYNOS_IOMMU=y
 CONFIG_QCOM_IOMMU=y
diff --git a/arch/arm/configs/tegra_defconfig b/arch/arm/configs/tegra_defconfig
index f32047e24b633e..ad31b9322911ce 100644
--- a/arch/arm/configs/tegra_defconfig
+++ b/arch/arm/configs/tegra_defconfig
@@ -293,7 +293,6 @@ CONFIG_CHROME_PLATFORMS=y
 CONFIG_CROS_EC=y
 CONFIG_CROS_EC_I2C=m
 CONFIG_CROS_EC_SPI=m
-CONFIG_TEGRA_IOMMU_GART=y
 CONFIG_TEGRA_IOMMU_SMMU=y
 CONFIG_ARCH_TEGRA_2x_SOC=y
 CONFIG_ARCH_TEGRA_3x_SOC=y
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 4d800601e8ecd6..3309f297bbd822 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -235,17 +235,6 @@ config SUN50I_IOMMU
help
  Support for the IOMMU introduced in the Allwinner H6 SoCs.
 
-config TEGRA_IOMMU_GART
-   bool "Tegra GART IOMMU Support"
-   depends on ARCH_TEGRA_2x_SOC
-   depends on TEGRA_MC
-   select IOMMU_API
-   help
- Enables support for remapping discontiguous physical memory
- shared with the operating system into contiguous I/O virtual
- space through the GART (Graphics Address Relocation Table)
- hardware included on Tegra SoCs.
-
 config TEGRA_IOMMU_SMMU
bool "NVIDIA Tegra SMMU Support"
depends on ARCH_TEGRA
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 769e43d780ce89..95ad9dbfbda022 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -20,7 +20,6 @@ obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o
 obj-$(CONFIG_OMAP_IOMMU_DEBUG) += omap-iommu-debug.o
 obj-$(CONFIG_ROCKCHIP_IOMMU) += rockchip-iommu.o
 obj-$(CONFIG_SUN50I_IOMMU) += sun50i-iommu.o
-obj-$(CONFIG_TEGRA_IOMMU_GART) += tegra-gart.o
 obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
deleted file mode 100644
index a482ff838b5331..00
--- a/drivers/iommu/tegra-gart.c
+++ /dev/null
@@ -1,371 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * IOMMU API for Graphics Address Relocation Table on Tegra20
- *
- * Copyright (c) 2010-2012, NVIDIA CORPORATION.  All rights reserved.
- *
- * Author: Hiroshi DOYU 
- */
-
-#define dev_fmt(fmt)   "gart: " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-#define GART_REG_BASE  0x24
-#define GART_CONFIG(0x24 - GART_REG_BASE)
-#define GART_ENTRY_ADDR(0x28 - GART_REG_BASE)
-#define GART_ENTRY_DATA(0x2c - GART_REG_BASE)
-
-#define GART_ENTRY_PHYS_ADDR_VALID BIT(31)
-
-#define GART_PAGE_SHIFT12
-#define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT)
-#define GART_PAGE_MASK GENMASK(30, GART_PAGE_SHIFT)
-
-/* bitmap of the page sizes currently supported */
-#define GART_IOMMU_PGSIZES (GART_PAGE_SIZE)
-
-struct gart_device {
-   void __iomem*regs;
-   u32 *savedata;
-   unsigned long

[PATCH v4 19/25] iommu/mtk_iommu: Add an IOMMU_IDENTITIY_DOMAIN

2023-06-16 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/mtk_iommu.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e93906d6e112e8..fdb7f5162b1d64 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -753,6 +753,28 @@ static int mtk_iommu_attach_device(struct iommu_domain 
*domain,
return ret;
 }
 
+static int mtk_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct mtk_iommu_data *data = dev_iommu_priv_get(dev);
+
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   mtk_iommu_config(data, dev, false, 0);
+   return 0;
+}
+
+static struct iommu_domain_ops mtk_iommu_identity_ops = {
+   .attach_dev = mtk_iommu_identity_attach,
+};
+
+static struct iommu_domain mtk_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int mtk_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -972,6 +994,7 @@ static void mtk_iommu_get_resv_regions(struct device *dev,
 }
 
 static const struct iommu_ops mtk_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc   = mtk_iommu_domain_alloc,
.probe_device   = mtk_iommu_probe_device,
.release_device = mtk_iommu_release_device,
-- 
2.40.1

[PATCH v4 18/25] iommu/ipmmu: Add an IOMMU_IDENTITIY_DOMAIN

2023-06-16 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Also reverts commit 584d334b1393 ("iommu/ipmmu-vmsa: Remove
ipmmu_utlb_disable()")

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/ipmmu-vmsa.c | 43 ++
 1 file changed, 43 insertions(+)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 9f64c5c9f5b90a..de958e411a92e0 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -298,6 +298,18 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain 
*domain,
mmu->utlb_ctx[utlb] = domain->context_id;
 }
 
+/*
+ * Disable MMU translation for the microTLB.
+ */
+static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
+  unsigned int utlb)
+{
+   struct ipmmu_vmsa_device *mmu = domain->mmu;
+
+   ipmmu_imuctr_write(mmu, utlb, 0);
+   mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID;
+}
+
 static void ipmmu_tlb_flush_all(void *cookie)
 {
struct ipmmu_vmsa_domain *domain = cookie;
@@ -630,6 +642,36 @@ static int ipmmu_attach_device(struct iommu_domain 
*io_domain,
return 0;
 }
 
+static int ipmmu_iommu_identity_attach(struct iommu_domain *identity_domain,
+  struct device *dev)
+{
+   struct iommu_domain *io_domain = iommu_get_domain_for_dev(dev);
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct ipmmu_vmsa_domain *domain;
+   unsigned int i;
+
+   if (io_domain == identity_domain || !io_domain)
+   return 0;
+
+   domain = to_vmsa_domain(io_domain);
+   for (i = 0; i < fwspec->num_ids; ++i)
+   ipmmu_utlb_disable(domain, fwspec->ids[i]);
+
+   /*
+* TODO: Optimize by disabling the context when no device is attached.
+*/
+   return 0;
+}
+
+static struct iommu_domain_ops ipmmu_iommu_identity_ops = {
+   .attach_dev = ipmmu_iommu_identity_attach,
+};
+
+static struct iommu_domain ipmmu_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int ipmmu_map(struct iommu_domain *io_domain, unsigned long iova,
 phys_addr_t paddr, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -848,6 +890,7 @@ static struct iommu_group *ipmmu_find_group(struct device 
*dev)
 }
 
 static const struct iommu_ops ipmmu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc = ipmmu_domain_alloc,
.probe_device = ipmmu_probe_device,
.release_device = ipmmu_release_device,
-- 
2.40.1

[PATCH v4 12/25] iommu/tegra-smmu: Support DMA domains in tegra

2023-06-16 Thread Jason Gunthorpe

All ARM64 iommu drivers should support IOMMU_DOMAIN_DMA to enable
dma-iommu.c.

tegra is blocking dma-iommu usage, and also default_domain's, because it
wants an identity translation. This is needed for some device quirk. The
correct way to do this is to support IDENTITY domains and use
ops->def_domain_type() to return IOMMU_DOMAIN_IDENTITY for only the quirky
devices.

Add support for IOMMU_DOMAIN_DMA and force IOMMU_DOMAIN_IDENTITY mode for
everything so no behavior changes.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/tegra-smmu.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index f63f1d4f0bd10f..6cba034905edbf 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -276,7 +276,7 @@ static struct iommu_domain 
*tegra_smmu_domain_alloc(unsigned type)
 {
struct tegra_smmu_as *as;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
+   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
return NULL;
 
as = kzalloc(sizeof(*as), GFP_KERNEL);
@@ -989,6 +989,12 @@ static int tegra_smmu_def_domain_type(struct device *dev)
 }
 
 static const struct iommu_ops tegra_smmu_ops = {
+   /*
+* FIXME: For now we want to run all translation in IDENTITY mode,
+* better would be to have a def_domain_type op do this for just the
+* quirky device.
+*/
+   .default_domain = _smmu_identity_domain,
.identity_domain = _smmu_identity_domain,
.def_domain_type = _smmu_def_domain_type,
.domain_alloc = tegra_smmu_domain_alloc,
-- 
2.40.1

[PATCH v4 08/25] iommu: Reorganize iommu_get_default_domain_type() to respect def_domain_type()

2023-06-16 Thread Jason Gunthorpe

Except for dart every driver returns 0 or IDENTITY from def_domain_type().

The drivers that return IDENTITY have some kind of good reason, typically
that quirky hardware really can't support anything other than IDENTITY.

Arrange things so that if the driver says it needs IDENTITY then
iommu_get_default_domain_type() either fails or returns IDENTITY.  It will
never reject the driver's override to IDENTITY.

The only real functional difference is that the PCI untrusted flag is now
ignored for quirky HW instead of overriding the IOMMU driver.

This makes the next patch cleaner that wants to force IDENTITY always for
ARM_IOMMU because there is no support for DMA.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 66 +--
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c8f6664767152d..0c4fc46c210366 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1608,19 +1608,6 @@ struct iommu_group *fsl_mc_device_group(struct device 
*dev)
 }
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
-static int iommu_get_def_domain_type(struct device *dev)
-{
-   const struct iommu_ops *ops = dev_iommu_ops(dev);
-
-   if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
-   return IOMMU_DOMAIN_DMA;
-
-   if (ops->def_domain_type)
-   return ops->def_domain_type(dev);
-
-   return 0;
-}
-
 static struct iommu_domain *
 __iommu_group_alloc_default_domain(const struct bus_type *bus,
   struct iommu_group *group, int req_type)
@@ -1761,36 +1748,49 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 static int iommu_get_default_domain_type(struct iommu_group *group,
 int target_type)
 {
+   const struct iommu_ops *ops = dev_iommu_ops(
+   list_first_entry(>devices, struct group_device, list)
+   ->dev);
int best_type = target_type;
struct group_device *gdev;
struct device *last_dev;
+   int type;
 
lockdep_assert_held(>mutex);
-
for_each_group_device(group, gdev) {
-   unsigned int type = iommu_get_def_domain_type(gdev->dev);
-
-   if (best_type && type && best_type != type) {
-   if (target_type) {
-   dev_err_ratelimited(
-   gdev->dev,
-   "Device cannot be in %s domain\n",
-   iommu_domain_type_str(target_type));
-   return -1;
-   }
-
-   dev_warn(
-   gdev->dev,
-   "Device needs domain type %s, but device %s in 
the same iommu group requires type %s - using default\n",
-   iommu_domain_type_str(type), dev_name(last_dev),
-   iommu_domain_type_str(best_type));
-   return 0;
+   type = best_type;
+   if (ops->def_domain_type) {
+   type = ops->def_domain_type(gdev->dev);
+   if (best_type && type && best_type != type)
+   goto err;
}
-   if (!best_type)
-   best_type = type;
+
+   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {
+   type = IOMMU_DOMAIN_DMA;
+   if (best_type && type && best_type != type)
+   goto err;
+   }
+   best_type = type;
last_dev = gdev->dev;
}
return best_type;
+
+err:
+   if (target_type) {
+   dev_err_ratelimited(
+   gdev->dev,
+   "Device cannot be in %s domain - it is forcing %s\n",
+   iommu_domain_type_str(target_type),
+   iommu_domain_type_str(type));
+   return -1;
+   }
+
+   dev_warn(
+   gdev->dev,
+   "Device needs domain type %s, but device %s in the same iommu 
group requires type %s - using default\n",
+   iommu_domain_type_str(type), dev_name(last_dev),
+   iommu_domain_type_str(best_type));
+   return 0;
 }
 
 static void iommu_group_do_probe_finalize(struct device *dev)
-- 
2.40.1

[PATCH v4 11/25] iommu/tegra-smmu: Implement an IDENTITY domain

2023-06-16 Thread Jason Gunthorpe

What tegra-smmu does during tegra_smmu_set_platform_dma() is actually
putting the iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/tegra-smmu.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 1cbf063ccf147a..f63f1d4f0bd10f 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -511,23 +511,39 @@ static int tegra_smmu_attach_dev(struct iommu_domain 
*domain,
return err;
 }
 
-static void tegra_smmu_set_platform_dma(struct device *dev)
+static int tegra_smmu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-   struct tegra_smmu_as *as = to_smmu_as(domain);
-   struct tegra_smmu *smmu = as->smmu;
+   struct tegra_smmu_as *as;
+   struct tegra_smmu *smmu;
unsigned int index;
 
if (!fwspec)
-   return;
+   return -ENODEV;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   as = to_smmu_as(domain);
+   smmu = as->smmu;
for (index = 0; index < fwspec->num_ids; index++) {
tegra_smmu_disable(smmu, fwspec->ids[index], as->id);
tegra_smmu_as_unprepare(smmu, as);
}
+   return 0;
 }
 
+static struct iommu_domain_ops tegra_smmu_identity_ops = {
+   .attach_dev = tegra_smmu_identity_attach,
+};
+
+static struct iommu_domain tegra_smmu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _smmu_identity_ops,
+};
+
 static void tegra_smmu_set_pde(struct tegra_smmu_as *as, unsigned long iova,
   u32 value)
 {
@@ -962,11 +978,22 @@ static int tegra_smmu_of_xlate(struct device *dev,
return iommu_fwspec_add_ids(dev, , 1);
 }
 
+static int tegra_smmu_def_domain_type(struct device *dev)
+{
+   /*
+* FIXME: For now we want to run all translation in IDENTITY mode, due
+* to some device quirks. Better would be to just quirk the troubled
+* devices.
+*/
+   return IOMMU_DOMAIN_IDENTITY;
+}
+
 static const struct iommu_ops tegra_smmu_ops = {
+   .identity_domain = _smmu_identity_domain,
+   .def_domain_type = _smmu_def_domain_type,
.domain_alloc = tegra_smmu_domain_alloc,
.probe_device = tegra_smmu_probe_device,
.device_group = tegra_smmu_device_group,
-   .set_platform_dma_ops = tegra_smmu_set_platform_dma,
.of_xlate = tegra_smmu_of_xlate,
.pgsize_bitmap = SZ_4K,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.40.1

[PATCH v4 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()

2023-06-16 Thread Jason Gunthorpe

These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively
allows them to support that mode.

The prior work to require default_domains makes this safe because every
one of these drivers is either compilation incompatible with dma-iommu.c,
or already establishing a default_domain. In both cases alloc_domain()
will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe
to drop the test.

Removing these tests clarifies that the domain allocation path is only
about the functionality of a paging domain and has nothing to do with
policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ.

Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/msm_iommu.c| 7 ++-
 drivers/iommu/mtk_iommu_v1.c | 7 ++-
 drivers/iommu/omap-iommu.c   | 7 ++-
 drivers/iommu/s390-iommu.c   | 7 ++-
 4 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 26ed81cfeee897..a163cee0b7242d 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx,
SET_M(base, ctx, 1);
 }
 
-static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev)
 {
struct msm_priv *priv;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
goto fail_nomem;
@@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
 
 static struct iommu_ops msm_iommu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc = msm_iommu_domain_alloc,
+   .domain_alloc_paging = msm_iommu_domain_alloc_paging,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 7c0c1d50df5f75..67e044c1a7d93b 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct 
mtk_iommu_v1_data *data)
return 0;
 }
 
-static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type)
+static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device 
*dev)
 {
struct mtk_iommu_v1_domain *dom;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!dom)
return NULL;
@@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
 
 static const struct iommu_ops mtk_iommu_v1_ops = {
.identity_domain = _iommu_v1_identity_domain,
-   .domain_alloc   = mtk_iommu_v1_domain_alloc,
+   .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
.release_device = mtk_iommu_v1_release_device,
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 34340ef15241bc..fcf99bd195b32e 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1580,13 +1580,10 @@ static struct iommu_domain omap_iommu_identity_domain = 
{
.ops = _iommu_identity_ops,
 };
 
-static struct iommu_domain *omap_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *omap_iommu_domain_alloc_paging(struct device *dev)
 {
struct omap_iommu_domain *omap_domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
omap_domain = kzalloc(sizeof(*omap_domain), GFP_KERNEL);
if (!omap_domain)
return NULL;
@@ -1748,7 +1745,7 @@ static struct iommu_group *omap_iommu_device_group(struct 
device *dev)
 
 static const struct iommu_ops omap_iommu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc   = omap_iommu_domain_alloc,
+   .domain_alloc_paging = omap_iommu_domain_alloc_paging,
.probe_device   = omap_iommu_probe_device,
.release_device = omap_iommu_release_device,
.device_group   = omap_iommu_device_group,
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index f0c867c57a5b9b..5695ad71d60e24 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -39,13 +39,10 @@ static bool s390_iommu_capable(struct device *dev, enum 
iommu_cap cap)
}
 }
 
-static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
+static struct iommu_domain *s390_domain_alloc_paging(struct device *dev)
 {
struct s390_domain *s390_domain;
 
-   if (domain_type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
s390_domain = kzalloc(sizeof(*s390_domain),

[PATCH v4 21/25] iommu: Require a default_domain for all iommu drivers

2023-06-16 Thread Jason Gunthorpe

At this point every iommu driver will cause a default_domain to be
selected, so we can finally remove this gap from the core code.

The following table explains what each driver supports and what the
resulting default_domain will be:

ops->defaut_domain
IDENTITY   DMA  PLATFORMv  ARM32  dma-iommu 
 ARCH
amd/iommu.c Y   Y   N/A either
apple-dart.cY   Y   N/A either
arm-smmu.c  Y   Y   IDENTITYeither
qcom_iommu.cG   Y   IDENTITYeither
arm-smmu-v3.c   Y   Y   N/A either
exynos-iommu.c  G   Y   IDENTITYeither
fsl_pamu_domain.c   Y   N/A N/A 
PLATFORM
intel/iommu.c   Y   Y   N/A either
ipmmu-vmsa.cG   Y   IDENTITYeither
msm_iommu.c G   IDENTITYN/A
mtk_iommu.c G   Y   IDENTITYeither
mtk_iommu_v1.c  G   IDENTITYN/A
omap-iommu.cG   IDENTITYN/A
rockchip-iommu.cG   Y   IDENTITYeither
s390-iommu.cY   Y   N/A N/A 
PLATFORM
sprd-iommu.cY   N/A DMA
sun50i-iommu.c  G   Y   IDENTITYeither
tegra-smmu.cG   Y   IDENTITYIDENTITY
virtio-iommu.c  Y   Y   N/A either
spapr   Y   Y   N/A N/A 
PLATFORM
 * G means ops->identity_domain is used
 * N/A means the driver will not compile in this configuration

ARM32 drivers select an IDENTITY default domain through either the
ops->identity_domain or directly requesting an IDENTIY domain through
alloc_domain().

In ARM64 mode tegra-smmu will still block the use of dma-iommu.c and
forces an IDENTITY domain.

S390 uses a PLATFORM domain to represent when the dma_ops are set to the
s390 iommu code.

fsl_pamu uses an IDENTITY domain.

POWER SPAPR uses PLATFORM and blocking to enable its weird VFIO mode.

The x86 drivers continue unchanged.

After this patch group->default_domain is only NULL for a short period
during bus iommu probing while all the groups are constituted. Otherwise
it is always !NULL.

This completes changing the iommu subsystem driver contract to a system
where the current iommu_domain always represents some form of translation
and the driver is continuously asserting a definable translation mode.

It resolves the confusion that the original ops->detach_dev() caused
around what translation, exactly, is the IOMMU performing after
detach. There were at least three different answers to that question in
the tree, they are all now clearly named with domain types.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e60640f6ccb625..98b855487cf03c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1805,10 +1805,12 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
 * ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
 * identity_domain and it will automatically become their default
 * domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
-* Override the selection to IDENTITY if we are sure the driver supports
-* it.
+* Override the selection to IDENTITY.
 */
-   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) {
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) {
+   static_assert(!(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) &&
+   IS_ENABLED(CONFIG_IOMMU_DMA)));
+
type = IOMMU_DOMAIN_IDENTITY;
if (best_type && type && best_type != type)
goto err;
@@ -2906,18 +2908,9 @@ static int iommu_setup_default_domain(struct iommu_group 
*group,
if (req_type < 0)
return -EINVAL;
 
-   /*
-* There are still some drivers which don't support default domains, so
-* we ignore the failure and leave group->default_domain NULL.
-*/
dom = iommu_group_alloc_default_domain(group, req_type);
-   if (!dom) {
-   /* Once in default_domain mode we

[PATCH v4 23/25] iommu: Add ops->domain_alloc_paging()

2023-06-16 Thread Jason Gunthorpe

This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING
domain, so it saves a few lines in a lot of drivers needlessly checking
the type.

More critically, this allows us to sweep out all the
IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the
drivers, simplifying what is going on in the code and ultimately removing
the now-unused special cases in drivers where they did not support
IOMMU_DOMAIN_DMA.

domain_alloc_paging() should return a struct iommu_domain that is
functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd.

Be forwards looking and pass in a 'struct device *' argument. We can
provide this when allocating the default_domain. No drivers will look at
this.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 13 ++---
 include/linux/iommu.h |  3 +++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0346c05e108438..8f3464ba204498 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1985,6 +1985,7 @@ void iommu_set_fault_handler(struct iommu_domain *domain,
 EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
 
 static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
+struct device *dev,
 unsigned int type)
 {
struct iommu_domain *domain;
@@ -1992,8 +1993,13 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct iommu_ops *ops,
 
if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
return ops->identity_domain;
+   else if (type & __IOMMU_DOMAIN_PAGING && ops->domain_alloc_paging) {
+   domain = ops->domain_alloc_paging(dev);
+   } else if (ops->domain_alloc)
+   domain = ops->domain_alloc(alloc_type);
+   else
+   return NULL;
 
-   domain = ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
 
@@ -2024,14 +2030,15 @@ __iommu_group_domain_alloc(struct iommu_group *group, 
unsigned int type)
 
lockdep_assert_held(>mutex);
 
-   return __iommu_domain_alloc(dev_iommu_ops(dev), type);
+   return __iommu_domain_alloc(dev_iommu_ops(dev), dev, type);
 }
 
 struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
 {
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
-   return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED);
+   return __iommu_domain_alloc(bus->iommu_ops, NULL,
+   IOMMU_DOMAIN_UNMANAGED);
 }
 EXPORT_SYMBOL_GPL(iommu_domain_alloc);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 49331573f1d1f5..8e4d178c49c417 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -233,6 +233,8 @@ struct iommu_iotlb_gather {
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
  * @domain_alloc: allocate iommu domain
+ * @domain_alloc_paging: Allocate an iommu_domain that can be used for
+ *   UNMANAGED, DMA, and DMA_FQ domain types.
  * @probe_device: Add device to iommu driver handling
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -264,6 +266,7 @@ struct iommu_ops {
 
/* Domain allocation and freeing by the iommu driver */
struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+   struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
 
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
-- 
2.40.1

[PATCH v4 02/25] iommu: Add IOMMU_DOMAIN_PLATFORM

2023-06-16 Thread Jason Gunthorpe

This is used when the iommu driver is taking control of the dma_ops,
currently only on S390 and power spapr. It is designed to preserve the
original ops->detach_dev() semantic that these S390 was built around.

Provide an opaque domain type and a 'default_domain' ops value that allows
the driver to trivially force any single domain as the default domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 14 +-
 include/linux/iommu.h |  6 ++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bb840a818525ad..c8f6664767152d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1644,6 +1644,17 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 
lockdep_assert_held(>mutex);
 
+   /*
+* Allow legacy drivers to specify the domain that will be the default
+* domain. This should always be either an IDENTITY or PLATFORM domain.
+* Do not use in new drivers.
+*/
+   if (bus->iommu_ops->default_domain) {
+   if (req_type)
+   return ERR_PTR(-EINVAL);
+   return bus->iommu_ops->default_domain;
+   }
+
if (req_type)
return __iommu_group_alloc_default_domain(bus, group, req_type);
 
@@ -1953,7 +1964,8 @@ void iommu_domain_free(struct iommu_domain *domain)
if (domain->type == IOMMU_DOMAIN_SVA)
mmdrop(domain->mm);
iommu_put_dma_cookie(domain);
-   domain->ops->free(domain);
+   if (domain->ops->free)
+   domain->ops->free(domain);
 }
 EXPORT_SYMBOL_GPL(iommu_domain_free);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c3004eac2f88e8..ef0af09326 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -64,6 +64,7 @@ struct iommu_domain_geometry {
 #define __IOMMU_DOMAIN_DMA_FQ  (1U << 3)  /* DMA-API uses flush queue*/
 
 #define __IOMMU_DOMAIN_SVA (1U << 4)  /* Shared process address space */
+#define __IOMMU_DOMAIN_PLATFORM(1U << 5)
 
 #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
 /*
@@ -81,6 +82,8 @@ struct iommu_domain_geometry {
  *   invalidation.
  * IOMMU_DOMAIN_SVA- DMA addresses are shared process addresses
  *   represented by mm_struct's.
+ * IOMMU_DOMAIN_PLATFORM   - Legacy domain for drivers that do their own
+ *   dma_api stuff. Do not use in new drivers.
  */
 #define IOMMU_DOMAIN_BLOCKED   (0U)
 #define IOMMU_DOMAIN_IDENTITY  (__IOMMU_DOMAIN_PT)
@@ -91,6 +94,7 @@ struct iommu_domain_geometry {
 __IOMMU_DOMAIN_DMA_API |   \
 __IOMMU_DOMAIN_DMA_FQ)
 #define IOMMU_DOMAIN_SVA   (__IOMMU_DOMAIN_SVA)
+#define IOMMU_DOMAIN_PLATFORM  (__IOMMU_DOMAIN_PLATFORM)
 
 struct iommu_domain {
unsigned type;
@@ -256,6 +260,7 @@ struct iommu_iotlb_gather {
  * @owner: Driver module providing these ops
  * @identity_domain: An always available, always attachable identity
  *   translation.
+ * @default_domain: If not NULL this will always be set as the default domain.
  */
 struct iommu_ops {
bool (*capable)(struct device *dev, enum iommu_cap);
@@ -290,6 +295,7 @@ struct iommu_ops {
unsigned long pgsize_bitmap;
struct module *owner;
struct iommu_domain *identity_domain;
+   struct iommu_domain *default_domain;
 };
 
 /**
-- 
2.40.1

[PATCH v4 13/25] iommu/omap: Implement an IDENTITY domain

2023-06-16 Thread Jason Gunthorpe

What omap does during omap_iommu_set_platform_dma() is actually putting
the iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/omap-iommu.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 537e402f9bba97..34340ef15241bc 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1555,16 +1555,31 @@ static void _omap_iommu_detach_dev(struct 
omap_iommu_domain *omap_domain,
omap_domain->dev = NULL;
 }
 
-static void omap_iommu_set_platform_dma(struct device *dev)
+static int omap_iommu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct omap_iommu_domain *omap_domain = to_omap_domain(domain);
+   struct omap_iommu_domain *omap_domain;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   omap_domain = to_omap_domain(domain);
spin_lock(_domain->lock);
_omap_iommu_detach_dev(omap_domain, dev);
spin_unlock(_domain->lock);
+   return 0;
 }
 
+static struct iommu_domain_ops omap_iommu_identity_ops = {
+   .attach_dev = omap_iommu_identity_attach,
+};
+
+static struct iommu_domain omap_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static struct iommu_domain *omap_iommu_domain_alloc(unsigned type)
 {
struct omap_iommu_domain *omap_domain;
@@ -1732,11 +1747,11 @@ static struct iommu_group 
*omap_iommu_device_group(struct device *dev)
 }
 
 static const struct iommu_ops omap_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc   = omap_iommu_domain_alloc,
.probe_device   = omap_iommu_probe_device,
.release_device = omap_iommu_release_device,
.device_group   = omap_iommu_device_group,
-   .set_platform_dma_ops = omap_iommu_set_platform_dma,
.pgsize_bitmap  = OMAP_IOMMU_PGSIZES,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = omap_iommu_attach_dev,
-- 
2.40.1

[PATCH v4 22/25] iommu: Add __iommu_group_domain_alloc()

2023-06-16 Thread Jason Gunthorpe

Allocate a domain from a group. Automatically obtains the iommu_ops to use
from the device list of the group. Convert the internal callers to use it.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 66 ---
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 98b855487cf03c..0346c05e108438 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -94,8 +94,8 @@ static const char * const iommu_group_resv_type_string[] = {
 static int iommu_bus_notifier(struct notifier_block *nb,
  unsigned long action, void *data);
 static void iommu_release_device(struct device *dev);
-static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
-unsigned type);
+static struct iommu_domain *
+__iommu_group_domain_alloc(struct iommu_group *group, unsigned int type);
 static int __iommu_attach_device(struct iommu_domain *domain,
 struct device *dev);
 static int __iommu_attach_group(struct iommu_domain *domain,
@@ -1652,12 +1652,11 @@ struct iommu_group *fsl_mc_device_group(struct device 
*dev)
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
 static struct iommu_domain *
-__iommu_group_alloc_default_domain(const struct bus_type *bus,
-  struct iommu_group *group, int req_type)
+__iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
if (group->default_domain && group->default_domain->type == req_type)
return group->default_domain;
-   return __iommu_domain_alloc(bus, req_type);
+   return __iommu_group_domain_alloc(group, req_type);
 }
 
 /*
@@ -1667,9 +1666,10 @@ __iommu_group_alloc_default_domain(const struct bus_type 
*bus,
 static struct iommu_domain *
 iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
-   const struct bus_type *bus =
+   struct device *dev =
list_first_entry(>devices, struct group_device, list)
-   ->dev->bus;
+   ->dev;
+   const struct iommu_ops *ops = dev_iommu_ops(dev);
struct iommu_domain *dom;
 
lockdep_assert_held(>mutex);
@@ -1679,24 +1679,24 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 * domain. This should always be either an IDENTITY or PLATFORM domain.
 * Do not use in new drivers.
 */
-   if (bus->iommu_ops->default_domain) {
+   if (ops->default_domain) {
if (req_type)
return ERR_PTR(-EINVAL);
-   return bus->iommu_ops->default_domain;
+   return ops->default_domain;
}
 
if (req_type)
-   return __iommu_group_alloc_default_domain(bus, group, req_type);
+   return __iommu_group_alloc_default_domain(group, req_type);
 
/* The driver gave no guidance on what type to use, try the default */
-   dom = __iommu_group_alloc_default_domain(bus, group, 
iommu_def_domain_type);
+   dom = __iommu_group_alloc_default_domain(group, iommu_def_domain_type);
if (dom)
return dom;
 
/* Otherwise IDENTITY and DMA_FQ defaults will try DMA */
if (iommu_def_domain_type == IOMMU_DOMAIN_DMA)
return NULL;
-   dom = __iommu_group_alloc_default_domain(bus, group, IOMMU_DOMAIN_DMA);
+   dom = __iommu_group_alloc_default_domain(group, IOMMU_DOMAIN_DMA);
if (!dom)
return NULL;
 
@@ -1984,19 +1984,16 @@ void iommu_set_fault_handler(struct iommu_domain 
*domain,
 }
 EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
 
-static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
-unsigned type)
+static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
+unsigned int type)
 {
struct iommu_domain *domain;
unsigned int alloc_type = type & IOMMU_DOMAIN_ALLOC_FLAGS;
 
-   if (bus == NULL || bus->iommu_ops == NULL)
-   return NULL;
+   if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
+   return ops->identity_domain;
 
-   if (alloc_type == IOMMU_DOMAIN_IDENTITY && 
bus->iommu_ops->identity_domain)
-   return bus->iommu_ops->identity_domain;
-
-   domain = bus->iommu_ops->domain_alloc(alloc_type);
+   domain = ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
 
@@ -2006,10 +2003,10 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct bus_type *bus,
 * may override this later
 */
if (!domain->pgsize_bitmap)
-   domain->pgsize_bitmap =

[PATCH v4 05/25] iommu/fsl_pamu: Implement a PLATFORM domain

2023-06-16 Thread Jason Gunthorpe

This driver is nonsensical. To not block migrating the core API away from
NULL default_domains give it a hacky of a PLATFORM domain that keeps it
working exactly as it always did.

Leave some comments around to warn away any future people looking at this.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/fsl_pamu_domain.c | 41 ++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 4ac0e247ec2b51..e9d2bff4659b7c 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -196,6 +196,13 @@ static struct iommu_domain *fsl_pamu_domain_alloc(unsigned 
type)
 {
struct fsl_dma_domain *dma_domain;
 
+   /*
+* FIXME: This isn't creating an unmanaged domain since the
+* default_domain_ops do not have any map/unmap function it doesn't meet
+* the requirements for __IOMMU_DOMAIN_PAGING. The only purpose seems to
+* allow drivers/soc/fsl/qbman/qman_portal.c to do
+* fsl_pamu_configure_l1_stash()
+*/
if (type != IOMMU_DOMAIN_UNMANAGED)
return NULL;
 
@@ -283,15 +290,33 @@ static int fsl_pamu_attach_device(struct iommu_domain 
*domain,
return ret;
 }
 
-static void fsl_pamu_set_platform_dma(struct device *dev)
+/*
+ * FIXME: fsl/pamu is completely broken in terms of how it works with the iommu
+ * API. Immediately after probe the HW is left in an IDENTITY translation and
+ * the driver provides a non-working UNMANAGED domain that it can switch over
+ * to. However it cannot switch back to an IDENTITY translation, instead it
+ * switches to what looks like BLOCKING.
+ */
+static int fsl_pamu_platform_attach(struct iommu_domain *platform_domain,
+   struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain);
+   struct fsl_dma_domain *dma_domain;
const u32 *prop;
int len;
struct pci_dev *pdev = NULL;
struct pci_controller *pci_ctl;
 
+   /*
+* Hack to keep things working as they always have, only leaving an
+* UNMANAGED domain makes it BLOCKING.
+*/
+   if (domain == platform_domain || !domain ||
+   domain->type != IOMMU_DOMAIN_UNMANAGED)
+   return 0;
+
+   dma_domain = to_fsl_dma_domain(domain);
+
/*
 * Use LIODN of the PCI controller while detaching a
 * PCI device.
@@ -312,8 +337,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev)
detach_device(dev, dma_domain);
else
pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node);
+   return 0;
 }
 
+static struct iommu_domain_ops fsl_pamu_platform_ops = {
+   .attach_dev = fsl_pamu_platform_attach,
+};
+
+static struct iommu_domain fsl_pamu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _pamu_platform_ops,
+};
+
 /* Set the domain stash attribute */
 int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu)
 {
@@ -395,11 +430,11 @@ static struct iommu_device *fsl_pamu_probe_device(struct 
device *dev)
 }
 
 static const struct iommu_ops fsl_pamu_ops = {
+   .default_domain = _pamu_platform_domain,
.capable= fsl_pamu_capable,
.domain_alloc   = fsl_pamu_domain_alloc,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
-   .set_platform_dma_ops = fsl_pamu_set_platform_dma,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = fsl_pamu_attach_device,
.iova_to_phys   = fsl_pamu_iova_to_phys,
-- 
2.40.1

[PATCH v4 24/25] iommu: Convert simple drivers with DOMAIN_DMA to domain_alloc_paging()

2023-06-16 Thread Jason Gunthorpe

These drivers are all trivially converted since the function is only
called if the domain type is going to be
IOMMU_DOMAIN_UNMANAGED/DMA.

Tested-by: Heiko Stuebner 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 6 ++
 drivers/iommu/exynos-iommu.c| 7 ++-
 drivers/iommu/ipmmu-vmsa.c  | 7 ++-
 drivers/iommu/mtk_iommu.c   | 7 ++-
 drivers/iommu/rockchip-iommu.c  | 7 ++-
 drivers/iommu/sprd-iommu.c  | 7 ++-
 drivers/iommu/sun50i-iommu.c| 9 +++--
 drivers/iommu/tegra-smmu.c  | 7 ++-
 8 files changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 9d7b9d8b4386d4..a2140fdc65ed58 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -319,12 +319,10 @@ static int qcom_iommu_init_domain(struct iommu_domain 
*domain,
return ret;
 }
 
-static struct iommu_domain *qcom_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *qcom_iommu_domain_alloc_paging(struct device *dev)
 {
struct qcom_iommu_domain *qcom_domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
-   return NULL;
/*
 * Allocate the domain and initialise some of its data structures.
 * We can't really do anything meaningful until we've added a
@@ -593,7 +591,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
 static const struct iommu_ops qcom_iommu_ops = {
.identity_domain = _iommu_identity_domain,
.capable= qcom_iommu_capable,
-   .domain_alloc   = qcom_iommu_domain_alloc,
+   .domain_alloc_paging = qcom_iommu_domain_alloc_paging,
.probe_device   = qcom_iommu_probe_device,
.device_group   = generic_device_group,
.of_xlate   = qcom_iommu_of_xlate,
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 5e12b85dfe8705..d6dead2ed10c11 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -887,7 +887,7 @@ static inline void exynos_iommu_set_pte(sysmmu_pte_t *ent, 
sysmmu_pte_t val)
   DMA_TO_DEVICE);
 }
 
-static struct iommu_domain *exynos_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *exynos_iommu_domain_alloc_paging(struct device 
*dev)
 {
struct exynos_iommu_domain *domain;
dma_addr_t handle;
@@ -896,9 +896,6 @@ static struct iommu_domain 
*exynos_iommu_domain_alloc(unsigned type)
/* Check if correct PTE offsets are initialized */
BUG_ON(PG_ENT_SHIFT < 0 || !dma_dev);
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!domain)
return NULL;
@@ -1472,7 +1469,7 @@ static int exynos_iommu_of_xlate(struct device *dev,
 
 static const struct iommu_ops exynos_iommu_ops = {
.identity_domain = _identity_domain,
-   .domain_alloc = exynos_iommu_domain_alloc,
+   .domain_alloc_paging = exynos_iommu_domain_alloc_paging,
.device_group = generic_device_group,
.probe_device = exynos_iommu_probe_device,
.release_device = exynos_iommu_release_device,
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index de958e411a92e0..27d36347e0fced 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -566,13 +566,10 @@ static irqreturn_t ipmmu_irq(int irq, void *dev)
  * IOMMU Operations
  */
 
-static struct iommu_domain *ipmmu_domain_alloc(unsigned type)
+static struct iommu_domain *ipmmu_domain_alloc_paging(struct device *dev)
 {
struct ipmmu_vmsa_domain *domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
-   return NULL;
-
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!domain)
return NULL;
@@ -891,7 +888,7 @@ static struct iommu_group *ipmmu_find_group(struct device 
*dev)
 
 static const struct iommu_ops ipmmu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc = ipmmu_domain_alloc,
+   .domain_alloc_paging = ipmmu_domain_alloc_paging,
.probe_device = ipmmu_probe_device,
.release_device = ipmmu_release_device,
.probe_finalize = ipmmu_probe_finalize,
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index fdb7f5162b1d64..3590d3399add32 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -667,13 +667,10 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom,
return 0;
 }
 
-static struct iommu_domain *mtk_iommu_domain_alloc(unsigned type)
+static struct iommu_domain

[PATCH v4 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390

2023-06-16 Thread Jason Gunthorpe

The PLATFORM domain will be set as the default domain and attached as
normal during probe. The driver will ignore the initial attach from a NULL
domain to the PLATFORM domain.

After this, the PLATFORM domain's attach_dev will be called whenever we
detach from an UNMANAGED domain (eg for VFIO). This is the same time the
original design would have called op->detach_dev().

This is temporary until the S390 dma-iommu.c conversion is merged.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/s390-iommu.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index fbf59a8db29b11..f0c867c57a5b9b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain 
*domain,
return 0;
 }
 
-static void s390_iommu_set_platform_dma(struct device *dev)
+/*
+ * Switch control over the IOMMU to S390's internal dma_api ops
+ */
+static int s390_iommu_platform_attach(struct iommu_domain *platform_domain,
+ struct device *dev)
 {
struct zpci_dev *zdev = to_zpci_dev(dev);
 
+   if (!zdev->s390_domain)
+   return 0;
+
__s390_iommu_detach_device(zdev);
zpci_dma_init_device(zdev);
+   return 0;
 }
 
+static struct iommu_domain_ops s390_iommu_platform_ops = {
+   .attach_dev = s390_iommu_platform_attach,
+};
+
+static struct iommu_domain s390_iommu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _iommu_platform_ops,
+};
+
 static void s390_iommu_get_resv_regions(struct device *dev,
struct list_head *list)
 {
@@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
 }
 
 static const struct iommu_ops s390_iommu_ops = {
+   .default_domain = _iommu_platform_domain,
.capable = s390_iommu_capable,
.domain_alloc = s390_domain_alloc,
.probe_device = s390_iommu_probe_device,
.release_device = s390_iommu_release_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = s390_iommu_set_platform_dma,
.pgsize_bitmap = SZ_4K,
.get_resv_regions = s390_iommu_get_resv_regions,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.40.1

[PATCH v4 07/25] iommu/mtk_iommu_v1: Implement an IDENTITY domain

2023-06-16 Thread Jason Gunthorpe

What mtk does during mtk_iommu_v1_set_platform_dma() is actually putting
the iommu into identity mode. Make this available as a proper IDENTITY
domain.

The mtk_iommu_v1_def_domain_type() from
commit 8bbe13f52cb7 ("iommu/mediatek-v1: Add def_domain_type") explains
this was needed to allow probe_finalize() to be called, but now the
IDENTITY domain will do the same job so change the returned
def_domain_type.

mkt_v1 is the only driver that returns IOMMU_DOMAIN_UNMANAGED from
def_domain_type().  This allows the next patch to enforce an IDENTITY
domain policy for this driver.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/mtk_iommu_v1.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 8a0a5e5d049f4a..cc3e7d53d33ad9 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -319,11 +319,27 @@ static int mtk_iommu_v1_attach_device(struct iommu_domain 
*domain, struct device
return 0;
 }
 
-static void mtk_iommu_v1_set_platform_dma(struct device *dev)
+static int mtk_iommu_v1_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
struct mtk_iommu_v1_data *data = dev_iommu_priv_get(dev);
 
mtk_iommu_v1_config(data, dev, false);
+   return 0;
+}
+
+static struct iommu_domain_ops mtk_iommu_v1_identity_ops = {
+   .attach_dev = mtk_iommu_v1_identity_attach,
+};
+
+static struct iommu_domain mtk_iommu_v1_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_v1_identity_ops,
+};
+
+static void mtk_iommu_v1_set_platform_dma(struct device *dev)
+{
+   mtk_iommu_v1_identity_attach(_iommu_v1_identity_domain, dev);
 }
 
 static int mtk_iommu_v1_map(struct iommu_domain *domain, unsigned long iova,
@@ -443,7 +459,7 @@ static int mtk_iommu_v1_create_mapping(struct device *dev, 
struct of_phandle_arg
 
 static int mtk_iommu_v1_def_domain_type(struct device *dev)
 {
-   return IOMMU_DOMAIN_UNMANAGED;
+   return IOMMU_DOMAIN_IDENTITY;
 }
 
 static struct iommu_device *mtk_iommu_v1_probe_device(struct device *dev)
@@ -578,6 +594,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
 }
 
 static const struct iommu_ops mtk_iommu_v1_ops = {
+   .identity_domain = _iommu_v1_identity_domain,
.domain_alloc   = mtk_iommu_v1_domain_alloc,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
-- 
2.40.1

[PATCH v4 03/25] powerpc/iommu: Setup a default domain and remove set_platform_dma_ops

2023-06-16 Thread Jason Gunthorpe

POWER is using the set_platform_dma_ops() callback to hook up its private
dma_ops, but this is buired under some indirection and is weirdly
happening for a BLOCKED domain as well.

For better documentation create a PLATFORM domain to manage the dma_ops,
since that is what it is for, and make the BLOCKED domain an alias for
it. BLOCKED is required for VFIO.

Also removes the leaky allocation of the BLOCKED domain by using a global
static.

Signed-off-by: Jason Gunthorpe 
---
 arch/powerpc/kernel/iommu.c | 38 +
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 67f0b01e6ff575..0f17cd767e1676 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1266,7 +1266,7 @@ struct iommu_table_group_ops spapr_tce_table_group_ops = {
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
-static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom,
+static int spapr_tce_platform_iommu_attach_dev(struct iommu_domain *dom,
   struct device *dev)
 {
struct iommu_group *grp = iommu_group_get(dev);
@@ -1283,17 +1283,22 @@ static int spapr_tce_blocking_iommu_attach_dev(struct 
iommu_domain *dom,
return ret;
 }
 
-static void spapr_tce_blocking_iommu_set_platform_dma(struct device *dev)
-{
-   struct iommu_group *grp = iommu_group_get(dev);
-   struct iommu_table_group *table_group;
+static const struct iommu_domain_ops spapr_tce_platform_domain_ops = {
+   .attach_dev = spapr_tce_platform_iommu_attach_dev,
+};
 
-   table_group = iommu_group_get_iommudata(grp);
-   table_group->ops->release_ownership(table_group);
-}
+static struct iommu_domain spapr_tce_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _tce_platform_domain_ops,
+};
 
-static const struct iommu_domain_ops spapr_tce_blocking_domain_ops = {
-   .attach_dev = spapr_tce_blocking_iommu_attach_dev,
+static struct iommu_domain spapr_tce_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   /*
+* FIXME: SPAPR mixes blocked and platform behaviors, the blocked domain
+* also sets the dma_api ops
+*/
+   .ops = _tce_platform_domain_ops,
 };
 
 static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap)
@@ -1310,18 +1315,9 @@ static bool spapr_tce_iommu_capable(struct device *dev, 
enum iommu_cap cap)
 
 static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
 {
-   struct iommu_domain *dom;
-
if (type != IOMMU_DOMAIN_BLOCKED)
return NULL;
-
-   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
-   if (!dom)
-   return NULL;
-
-   dom->ops = _tce_blocking_domain_ops;
-
-   return dom;
+   return _tce_blocked_domain;
 }
 
 static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
@@ -1357,12 +1353,12 @@ static struct iommu_group 
*spapr_tce_iommu_device_group(struct device *dev)
 }
 
 static const struct iommu_ops spapr_tce_iommu_ops = {
+   .default_domain = _tce_platform_domain,
.capable = spapr_tce_iommu_capable,
.domain_alloc = spapr_tce_iommu_domain_alloc,
.probe_device = spapr_tce_iommu_probe_device,
.release_device = spapr_tce_iommu_release_device,
.device_group = spapr_tce_iommu_device_group,
-   .set_platform_dma_ops = spapr_tce_blocking_iommu_set_platform_dma,
 };
 
 static struct attribute *spapr_tce_iommu_attrs[] = {
-- 
2.40.1

Re: [PATCH v2 05/12] modules, execmem: drop module_alloc

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Define default parameters for address range for code allocations using
> the current values in module_alloc() and make execmem_text_alloc() use
> these defaults when an architecure does not supply its specific
> parameters.
>
> With this, execmem_text_alloc() implements memory allocation in a way
> compatible with module_alloc() and can be used as a replacement for
> module_alloc().
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu 

> ---
>  include/linux/execmem.h  |  8 
>  include/linux/moduleloader.h | 12 
>  kernel/module/main.c |  7 ---
>  mm/execmem.c | 12 
>  4 files changed, 16 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> index 68b2bfc79993..b9a97fcdf3c5 100644
> --- a/include/linux/execmem.h
> +++ b/include/linux/execmem.h
> @@ -4,6 +4,14 @@
>
>  #include 
>
> +#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
> +   !defined(CONFIG_KASAN_VMALLOC)
> +#include 
> +#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
> +#else
> +#define MODULE_ALIGN PAGE_SIZE
> +#endif
> +
>  /**
>   * struct execmem_range - definition of a memory range suitable for code and
>   *   related data allocations
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index b3374342f7af..4321682fe849 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
>  /* Additional bytes needed by arch in front of individual sections */
>  unsigned int arch_mod_section_prepend(struct module *mod, unsigned int 
> section);
>
> -/* Allocator used for allocating struct module, core sections and init
> -   sections.  Returns NULL on failure. */
> -void *module_alloc(unsigned long size);
> -
>  /* Determines if the section name is an init section (that is only used 
> during
>   * module loading).
>   */
> @@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod);
>  /* Any cleanup before freeing mod->module_init */
>  void module_arch_freeing_init(struct module *mod);
>
> -#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
> -   !defined(CONFIG_KASAN_VMALLOC)
> -#include 
> -#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
> -#else
> -#define MODULE_ALIGN PAGE_SIZE
> -#endif
> -
>  #endif
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 43810a3bdb81..b445c5ad863a 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod)
> }
>  }
>
> -void * __weak module_alloc(unsigned long size)
> -{
> -   return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -   GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -   NUMA_NO_NODE, __builtin_return_address(0));
> -}
> -
>  bool __weak module_init_section(const char *name)
>  {
> return strstarts(name, ".init");
> diff --git a/mm/execmem.c b/mm/execmem.c
> index 2fe36dcc7bdf..a67acd75ffef 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -59,9 +59,6 @@ void *execmem_text_alloc(size_t size)
> unsigned long fallback_end = execmem_params.modules.text.fallback_end;
> bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
>
> -   if (!execmem_params.modules.text.start)
> -   return module_alloc(size);
> -
> return execmem_alloc(size, start, end, align, pgprot,
>  fallback_start, fallback_end, kasan);
>  }
> @@ -108,8 +105,15 @@ void __init execmem_init(void)
>  {
> struct execmem_params *p = execmem_arch_params();
>
> -   if (!p)
> +   if (!p) {
> +   p = _params;
> +   p->modules.text.start = VMALLOC_START;
> +   p->modules.text.end = VMALLOC_END;
> +   p->modules.text.pgprot = PAGE_KERNEL_EXEC;
> +   p->modules.text.alignment = 1;
> +
> return;
> +   }
>
> if (!execmem_validate_params(p))
> return;
> --
> 2.35.1
>

Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
[...]
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 5af4975caeb5..c3d999f3a3dd 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -17,56 +17,50 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>
> -void *module_alloc(unsigned long size)
> +static struct execmem_params execmem_params = {
> +   .modules = {
> +   .flags = EXECMEM_KASAN_SHADOW,
> +   .text = {
> +   .alignment = MODULE_ALIGN,
> +   },
> +   },
> +};
> +
> +struct execmem_params __init *execmem_arch_params(void)
>  {
> u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> -   gfp_t gfp_mask = GFP_KERNEL;
> -   void *p;
> -
> -   /* Silence the initial allocation */
> -   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> -   gfp_mask |= __GFP_NOWARN;
>
> -   if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
> -   IS_ENABLED(CONFIG_KASAN_SW_TAGS))
> -   /* don't exceed the static module region - see below */
> -   module_alloc_end = MODULES_END;
> +   execmem_params.modules.text.pgprot = PAGE_KERNEL;
> +   execmem_params.modules.text.start = module_alloc_base;

I think I mentioned this earlier. For arm64 with CONFIG_RANDOMIZE_BASE,
module_alloc_base is not yet set when execmem_arch_params() is
called. So we will need some extra logic for this.

Thanks,
Song


> +   execmem_params.modules.text.end = module_alloc_end;
>
> -   p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> -   module_alloc_end, gfp_mask, PAGE_KERNEL, 
> VM_DEFER_KMEMLEAK,
> -   NUMA_NO_NODE, __builtin_return_address(0));
> -
> -   if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> +   /*
> +* KASAN without KASAN_VMALLOC can only deal with module
> +* allocations being served from the reserved module region,
> +* since the remainder of the vmalloc region is already
> +* backed by zero shadow pages, and punching holes into it
> +* is non-trivial. Since the module region is not randomized
> +* when KASAN is enabled without KASAN_VMALLOC, it is even
> +* less likely that the module region gets exhausted, so we
> +* can simply omit this fallback in that case.
> +*/
> +   if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
>  (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
> - !IS_ENABLED(CONFIG_KASAN_SW_TAGS
> -   /*
> -* KASAN without KASAN_VMALLOC can only deal with module
> -* allocations being served from the reserved module region,
> -* since the remainder of the vmalloc region is already
> -* backed by zero shadow pages, and punching holes into it
> -* is non-trivial. Since the module region is not randomized
> -* when KASAN is enabled without KASAN_VMALLOC, it is even
> -* less likely that the module region gets exhausted, so we
> -* can simply omit this fallback in that case.
> -*/
> -   p = __vmalloc_node_range(size, MODULE_ALIGN, 
> module_alloc_base,
> -   module_alloc_base + SZ_2G, GFP_KERNEL,
> -   PAGE_KERNEL, 0, NUMA_NO_NODE,
> -   __builtin_return_address(0));
> -
> -   if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
> -   vfree(p);
> -   return NULL;
> + !IS_ENABLED(CONFIG_KASAN_SW_TAGS {
> +   unsigned long end = module_alloc_base + SZ_2G;
> +
> +   execmem_params.modules.text.fallback_start = 
> module_alloc_base;
> +   execmem_params.modules.text.fallback_end = end;
> }
>
> -   /* Memory is intended to be executable, reset the pointer tag. */
> -   return kasan_reset_tag(p);
> +   return _params;
>  }
>
>  enum aarch64_reloc_op {

Re: [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
>
> Provide a generic implementation in execmem that uses the parameters
> for address space ranges, required alignment and page protections
> provided by architectures.
>
> The architecures must fill execmem_params structure and implement
> execmem_arch_params() that returns a pointer to that structure. This
> way the execmem initialization won't be called from every architecure,
> but rather from a central place, namely initialization of the core
> memory management.
>
> The execmem provides execmem_text_alloc() API that wraps
> __vmalloc_node_range() with the parameters defined by the architecures.
> If an architeture does not implement execmem_arch_params(),
> execmem_text_alloc() will fall back to module_alloc().
>
> The name execmem_text_alloc() emphasizes that the allocated memory is
> for executable code, the allocations of the associated data, like data
> sections of a module will use execmem_data_alloc() interface that will
> be added later.
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 9:48 AM Kent Overstreet
 wrote:
>
> On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> > The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> > allocated memory is for executable code, the allocations of the
> > associated data, like data sections of a module will use
> > execmem_data_alloc() interface that will be added later.
>
> I like the API split - at the risk of further bikeshedding, perhaps
> near_text_alloc() and far_text_alloc()? Would be more explicit.
>
> Reviewed-by: Kent Overstreet 

Acked-by: Song Liu

Re: [PATCH v2 01/12] nios2: define virtual address space for modules

2023-06-16 Thread Song Liu

On Fri, Jun 16, 2023 at 1:51 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
> cannot reach all of vmalloc address space.
>
> Define module space as 32MiB below the kernel base and switch nios2 to
> use vmalloc for module allocations.
>
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Dinh Nguyen 

Acked-by: Song Liu 


> ---
>  arch/nios2/include/asm/pgtable.h |  5 -
>  arch/nios2/kernel/module.c   | 19 ---
>  2 files changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/arch/nios2/include/asm/pgtable.h 
> b/arch/nios2/include/asm/pgtable.h
> index 0f5c2564e9f5..0073b289c6a4 100644
> --- a/arch/nios2/include/asm/pgtable.h
> +++ b/arch/nios2/include/asm/pgtable.h
> @@ -25,7 +25,10 @@
>  #include 
>
>  #define VMALLOC_START  CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
> -#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
> +#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
> +
> +#define MODULES_VADDR  (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
> +#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
>
>  struct mm_struct;
>
> diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
> index 76e0a42d6e36..9c97b7513853 100644
> --- a/arch/nios2/kernel/module.c
> +++ b/arch/nios2/kernel/module.c
> @@ -21,23 +21,12 @@
>
>  #include 
>
> -/*
> - * Modules should NOT be allocated with kmalloc for (obvious) reasons.
> - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot 
> reach
> - * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns
> - * addresses in 0xc000)
> - */
>  void *module_alloc(unsigned long size)
>  {
> -   if (size == 0)
> -   return NULL;
> -   return kmalloc(size, GFP_KERNEL);
> -}
> -
> -/* Free memory returned from module_alloc */
> -void module_memfree(void *module_region)
> -{
> -   kfree(module_region);
> +   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> +   GFP_KERNEL, PAGE_KERNEL_EXEC,
> +   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> +   __builtin_return_address(0));
>  }
>
>  int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
> --
> 2.35.1
>

Re: ppc64le vmlinuz is huge when building with BTF

2023-06-16 Thread Naveen N Rao

Dominique Martinet wrote:

Naveen N Rao wrote on Fri, Jun 16, 2023 at 04:28:53PM +0530:

> We're not stripping anything in vmlinuz for other archs -- the linker
> script already should be including only the bare minimum to decompress
> itself (+compressed useful bits), so I guess it's a Kbuild issue for the
> arch.

For a related discussion, see:
http://lore.kernel.org/CAK18DXZKs2PNmLndeGYqkPxmrrBR=6ca3bhyYCj=ghya7dh...@mail.gmail.com

Thanks, I didn't know that ppc64le boots straight into vmlinux, as 'make
install' somehow installs something called 'vmlinuz-lts' (-lts coming
out of localversion afaiu, but vmlinuz would come from the build
scripts) ; this is somewhat confusing as vmlinuz on other archs is a
compressed/pre-processed binary so I'd expect it to at least be
stripped...

As far as I can tell, kernel's install script doesn't give out that 
name, so 'vmlinuz' is likely coming from the distro's 
/[s]bin/installkernel script. It probably needs an override to retain 
'vmlinux'. 

> We can add a strip but I unfortunately have no way of testing ppc build,
> I'll ask around the build linux-kbuild and linuxppc-dev lists if that's
> expected; it shouldn't be that bad now that's figured out.

Stripping vmlinux would indeed be the way to go. As mentioned in the above
link, fedora also packages a strip'ed vmlinux for ppc64le:
https://src.fedoraproject.org/rpms/kernel/blob/4af17bffde7a1eca9ab164e5de0e391c277998a4/f/kernel.spec#_1797

It feels somewhat wrong to add a strip just for ppc64le after make
install, but I guess we probably ought to do the same...
I don't have any hardware to test booting the result though, I'll submit
an update and ask for someone to test when it's done.
(bit busy but that doesn't take long, will do that tomorrow morning
before I forget)

Thanks! You're right that it's likely just powerpc that is different 
here. It sure would be nice if we can iron out issues with our zImage.

- Naveen

Re: [PATCH v2 00/12] mm: jit/text allocator

2023-06-16 Thread Edgecombe, Rick P

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> module_alloc() is used everywhere as a mean to allocate memory for
> code.
> 
> Beside being semantically wrong, this unnecessarily ties all
> subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to
> modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this
> causes
> additional obstacles for improvements of code allocation.

I like how this series leaves the allocation code centralized at the
end of it because it will be much easier when we get to ROX, huge page,
text_poking() type stuff.

I guess that's the idea. I'm just catching up on what you and Song have
been up to.

Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-16 Thread Edgecombe, Rick P

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Data related to code allocations, such as module data section, need
> to
> comply with architecture constraints for its placement and its
> allocation right now was done using execmem_text_alloc().
> 
> Create a dedicated API for allocating data related to code
> allocations
> and allow architectures to define address ranges for data
> allocations.

Right now the cross-arch way to specify kernel memory permissions is
encoded in the function names of all the set_memory_foo()'s. You can't
just have unified prot names because some arch's have NX and some have
X bits, etc. CPA wouldn't know if it needs to set or unset a bit if you
pass in a PROT.

But then you end up with a new function for *each* combination (i.e.
set_memory_rox()). I wish CPA has flags like mmap() does, and I wonder
if it makes sense here instead of execmem_data_alloc().

Maybe that is an overhaul for another day though...

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-16 Thread Kent Overstreet

On Fri, Jun 16, 2023 at 11:50:28AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> 
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
> 
> The intention semantics for new allocation APIs:
> 
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
> 
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
> 
> The names execmem_text_alloc() and jit_text_alloc() emphasize that the
> allocated memory is for executable code, the allocations of the
> associated data, like data sections of a module will use
> execmem_data_alloc() interface that will be added later.

I like the API split - at the risk of further bikeshedding, perhaps
near_text_alloc() and far_text_alloc()? Would be more explicit.

Reviewed-by: Kent Overstreet

Re: [PATCH v2 4/6] watchdog/hardlockup: Make HAVE_NMI_WATCHDOG sparc64-specific

2023-06-16 Thread Doug Anderson

Hi,

On Fri, Jun 16, 2023 at 8:07 AM Petr Mladek  wrote:
>
> There are several hardlockup detector implementations and several Kconfig
> values which allow selection and build of the preferred one.
>
> CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb
> ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36.
> It was a preparation step for introducing the new generic perf hardlockup
> detector.
>
> The existing arch-specific variants did not support the to-be-created
> generic build configurations, sysctl interface, etc. This distinction
> was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog:
> Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR")
> in v2.6.38.
>
> CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105
> ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced
> the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used
> by three architectures, namely blackfin, mn10300, and sparc.
>
> The support for blackfin and mn10300 architectures has been completely
> dropped some time ago. And sparc is the only architecture with the historic
> NMI watchdog at the moment.
>
> And the old sparc implementation is really special. It is always built on
> sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b
> ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added
> in v4.10-rc1.
>
> There are only few locations where the sparc64 NMI watchdog interacts
> with the generic hardlockup detectors code:
>
>   + implements arch_touch_nmi_watchdog() which is called from the generic
> touch_nmi_watchdog()
>
>   + implements watchdog_hardlockup_enable()/disable() to support
> /proc/sys/kernel/nmi_watchdog
>
>   + is always preferred over other generic watchdogs, see
> CONFIG_HARDLOCKUP_DETECTOR
>
>   + includes asm/nmi.h into linux/nmi.h because some sparc-specific
> functions are needed in sparc-specific code which includes
> only linux/nmi.h.
>
> The situation became more complicated after the commit 05a4a95279311c3
> ("kernel/watchdog: split up config options") and commit 2104180a53698df5
> ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1.
> They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc
> specific hardlockup detector. It was compatible with the perf one
> regarding the general boot, sysctl, and programming interfaces.
>
> HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of
> HAVE_NMI_WATCHDOG. It made some sense because all arch-specific
> detectors had some common requirements, namely:
>
>   + implemented arch_touch_nmi_watchdog()
>   + included asm/nmi.h into linux/nmi.h
>   + defined the default value for /proc/sys/kernel/nmi_watchdog
>
> But it actually has made things pretty complicated when the generic
> buddy hardlockup detector was added. Before the generic perf detector
> was newer supported together with an arch-specific one. But the buddy
> detector could work on any SMP system. It means that an architecture
> could support both the arch-specific and buddy detector.
>
> As a result, there are few tricky dependencies. For example,
> CONFIG_HARDLOCKUP_DETECTOR depends on:
>
>   ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && 
> !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH
>
> The problem is that the very special sparc implementation is defined as:
>
>   HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH
>
> Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear
> without reading understanding the history.
>
> Make the logic less tricky and more self-explanatory by making
> HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to
> HAVE_HARDLOCKUP_DETECTOR_SPARC64.
>
> Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF,
> and HARDLOCKUP_DETECTOR_BUDDY may conflict only with
> HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR
> and it is not longer enabled when HAVE_NMI_WATCHDOG is set.
>
> Signed-off-by: Petr Mladek 
>
> watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64
>
> The configuration variable HAVE_NMI_WATCHDOG has a generic name but
> it is selected only for SPARC64.
>
> It should _not_ be used in general because it is not integrated with
> the other hardlockup detectors. Namely, it does not support the hardlockup
> specific command line parameters and systcl interface. Instead, it is
> enabled/disabled together with the softlockup detector by the global
> "watchdog" sysctl.
>
> Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special
> behavior more clear.
>
> Also the variable is set only on sparc64. Move the definition
> from arch/Kconfig to arch/sparc/Kconfig.debug.
>
> Signed-off-by: Petr Mladek 

I think you goofed up when squashing the patches. You've now got a
second patch subject after your first Signed-off-by and

Re: [PATCH v2 2/6] watchdog/hardlockup: Make the config checks more straightforward

2023-06-16 Thread Doug Anderson

Hi,

On Fri, Jun 16, 2023 at 8:07 AM Petr Mladek  wrote:
>
> There are four possible variants of hardlockup detectors:
>
>   + buddy: available when SMP is set.
>
>   + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.
>
>   + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
>
>   + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
> and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> The check for the sparc64 variant is more complicated because
> HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific
> and sparc64 specific variant. Therefore it is automatically
> selected with HAVE_HARDLOCKUP_DETECTOR_ARCH.
>
> This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH.
> It reduces the size of some checks but it makes them harder to follow.
>
> Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH
> is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global
> HARDLOCKUP_DETECTOR switch is enabled/disabled.
>
> Make the logic more straightforward by the following changes:
>
>   + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and
> HAVE_NMI_WATCHDOG in comments.
>
>   + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate
> HAVE_* for all four hardlockup detector variants.
>
> Use it in the other conditions instead of SMP. It makes it
> clear that it is about the buddy detector.
>
>   + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR
> and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand
> the conditions between the four hardlockup detector variants.
>
>   + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY
> can be enabled. It explains the dependency on the other
> hardlockup detector variants.
>
> Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply".
> It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when
> the global HARDLOCKUP_DETECTOR switch is changed.
>
>   + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables
> disappear when the hardlockup detectors are disabled.
>
> Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY
> value is not preserved when the global switch is disabled.
> The user has to make the decision again when it gets re-enabled.
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/Kconfig  | 23 +-
>  lib/Kconfig.debug | 62 +++
>  2 files changed, 53 insertions(+), 32 deletions(-)

While I'd still paint the bikeshed a different color and organize the
dependencies a little differently, as discussed in your v1 post, this
is still OK w/ me.

Reviewed-by: Douglas Anderson

Re: [PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-16 Thread Doug Anderson

Hi,

On Fri, Jun 16, 2023 at 8:06 AM Petr Mladek  wrote:
>
> There are four possible variants of hardlockup detectors:
>
>   + buddy: available when SMP is set.
>
>   + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.
>
>   + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
>
>   + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
> and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> Only one hardlockup detector can be compiled in. The selection is done
> using quite complex dependencies between several CONFIG variables.
> The following patches will try to make it more straightforward.
>
> As a first step, reorder the definitions of the various CONFIG variables.
> The logical order is:
>
>1. HAVE_* variables define available variants. They are typically
>   defined in the arch/ config files.
>
>2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
>   detector is enabled at all.
>
>3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
>   the buddy detector should be preferred over the perf one.
>   Note that the arch specific variants are always preferred when
>   available.
>
>4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
>   detector is enabled in the end.
>
>5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
>   are temporary variables that are going to be removed in
>   a followup patch.
>
> This is a preparation step for further cleanup. It will change the logic
> without shuffling the definitions.
>
> This change temporary breaks the C-like ordering where the variables are
> declared or defined before they are used. It is not really needed for
> Kconfig. Also the following patches will rework the logic so that
> the ordering will be C-like in the end.
>
> The patch just shuffles the definitions. It should not change the existing
> behavior.
>
> Signed-off-by: Petr Mladek 
> ---
>  lib/Kconfig.debug | 112 +++---
>  1 file changed, 56 insertions(+), 56 deletions(-)

Reviewed-by: Douglas Anderson

Re: [PATCH v2 04/12] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-06-16 Thread Edgecombe, Rick P

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
> -void *module_alloc(unsigned long size)
> -{
> -   gfp_t gfp_mask = GFP_KERNEL;
> -   void *p;
> -
> -   if (PAGE_ALIGN(size) > MODULES_LEN)
> -   return NULL;
> +static struct execmem_params execmem_params = {
> +   .modules = {
> +   .flags = EXECMEM_KASAN_SHADOW,
> +   .text = {
> +   .alignment = MODULE_ALIGN,
> +   },
> +   },
> +};

Did you consider making these execmem_params's ro_after_init? Not that
it is security sensitive, but it's a nice hint to the reader that it is
only modified at init. And I guess basically free sanitizing of buggy
writes to it.

>  
> -   p = __vmalloc_node_range(size, MODULE_ALIGN,
> -    MODULES_VADDR +
> get_module_load_offset(),
> -    MODULES_END, gfp_mask, PAGE_KERNEL,
> -    VM_FLUSH_RESET_PERMS |
> VM_DEFER_KMEMLEAK,
> -    NUMA_NO_NODE,
> __builtin_return_address(0));
> +struct execmem_params __init *execmem_arch_params(void)
> +{
> +   unsigned long start = MODULES_VADDR +
> get_module_load_offset();

I think we can drop the mutex's in get_module_load_offset() now, since
execmem_arch_params() should only be called once at init.

>  
> -   if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0))
> {
> -   vfree(p);
> -   return NULL;
> -   }
> +   execmem_params.modules.text.start = start;
> +   execmem_params.modules.text.end = MODULES_END;
> +   execmem_params.modules.text.pgprot = PAGE_KERNEL;
>  
> -   return p;
> +   return _params;
>  }
>

Re: [PATCH v2 01/12] nios2: define virtual address space for modules

2023-06-16 Thread Edgecombe, Rick P

On Fri, 2023-06-16 at 11:50 +0300, Mike Rapoport wrote:
>  void *module_alloc(unsigned long size)
>  {
> -   if (size == 0)
> -   return NULL;
> -   return kmalloc(size, GFP_KERNEL);
> -}
> -
> -/* Free memory returned from module_alloc */
> -void module_memfree(void *module_region)
> -{
> -   kfree(module_region);
> +   return __vmalloc_node_range(size, 1, MODULES_VADDR,
> MODULES_END,
> +   GFP_KERNEL, PAGE_KERNEL_EXEC,
> +   VM_FLUSH_RESET_PERMS,
> NUMA_NO_NODE,
> +   __builtin_return_address(0));
>  }
>  
>  int apply_relocate_add(Elf32_Shdr *sechdrs, const char *s

I wonder if the (size == 0) check is really needed, but
__vmalloc_node_range() will WARN on this case where the old code won't.

Re: [PATCH v2 0/6] watchdog/hardlockup: Cleanup configuration of hardlockup detectors

2023-06-16 Thread Petr Mladek

On Fri 2023-06-16 17:06:12, Petr Mladek wrote:
> Hi,
> 
> this patchset is supposed to replace the last patch in the patchset cleaning
> up after introducing the buddy detector, see
> https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid
> 
> Changes against v1:
> 
>   + Better explained the C-like ordering in the 1st patch.
> 
>   + Squashed patches for splitting and renaming HAVE_NMI_WATCHDOG,
> updated commit message with the history and more facts.
> 
>   + Updated comments about the sparc64 variant. It is not handled together
> with the softlockup detector. In fact, it is always build. And it even
> used to be always enabled until the commit 7a5c8b57cec93196b ("sparc:
> implement watchdog_nmi_enable and watchdog_nmi_disable") added in
> v4.10-rc1.
> 
> I realized this when updating the comment for the 4th patch. My original
> statement in v1 patchset was based on code reading. I looked at it from
> a bad side.
> 
>   + Removed superfluous "default n"
>   + Fixed typos.

I sometimes find the diff between the two versions useful. Especially, when the
patches are not trivial and the last version made only cosmetic
changes.

This is what I got by comparing "git format-patch" generated patchsets:

diff -purN 
watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/-cover-letter.patch
 
watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/-cover-letter.patch
--- 
watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/-cover-letter.patch
2023-06-16 16:42:07.769941775 +0200
+++ 
watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/-cover-letter.patch
  2023-06-16 16:39:42.179877676 +0200
@@ -1,9 +1,33 @@
-From 0456ed568d98ba5bba8148e4f60d769e3c5a6c7a Mon Sep 17 00:00:00 2001
+From bcf4dfab5a64ee691eb5154b1361ed59610c9387 Mon Sep 17 00:00:00 2001
 From: Petr Mladek 
-Date: Fri, 16 Jun 2023 16:42:07 +0200
-Subject: [PATCH 0/6] *** SUBJECT HERE ***
+Date: Fri, 16 Jun 2023 16:28:13 +0200
+Subject: [PATCH v2 0/6] watchdog/hardlockup: Cleanup configuration of 
hardlockup detectors
 
-*** BLURB HERE ***
+Hi,
+
+this patchset is supposed to replace the last patch in the patchset cleaning
+up after introducing the buddy detector, see
+https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid
+
+Changes against v1:
+
+  + Better explained the C-like ordering in the 1st patch.
+
+  + Squashed patches for splitting and renaming HAVE_NMI_WATCHDOG,
+updated commit message with the history and more facts.
+
+  + Updated comments about the sparc64 variant. It is not handled together
+with the softlockup detector. In fact, it is always build. And it even
+used to be always enabled until the commit 7a5c8b57cec93196b ("sparc:
+implement watchdog_nmi_enable and watchdog_nmi_disable") added in
+v4.10-rc1.
+
+I realized this when updating the comment for the 4th patch. My original
+statement in v1 patchset was based on code reading. I looked at it from
+a bad side.
+
+  + Removed superfluous "default n"
+  + Fixed typos.
 
 Petr Mladek (6):
   watchdog/hardlockup: Sort hardlockup detector related config values a
@@ -19,12 +43,12 @@ Petr Mladek (6):
  arch/powerpc/Kconfig   |   5 +-
  arch/powerpc/include/asm/nmi.h |   2 -
  arch/sparc/Kconfig |   2 +-
- arch/sparc/Kconfig.debug   |  20 ++
+ arch/sparc/Kconfig.debug   |  14 
  arch/sparc/include/asm/nmi.h   |   1 -
  include/linux/nmi.h|  14 ++--
  kernel/watchdog.c  |   2 +-
- lib/Kconfig.debug  | 115 +++--
- 9 files changed, 104 insertions(+), 74 deletions(-)
+ lib/Kconfig.debug  | 114 ++---
+ 9 files changed, 97 insertions(+), 74 deletions(-)
 
 -- 
 2.35.3
diff -purN 
watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch
 
watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch
--- 
watchdog-buddy-hardlockup-detector-config-cleanup-v1-iter1-reference/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch
2023-06-16 16:42:07.741941379 +0200
+++ 
watchdog-buddy-hardlockup-detector-config-cleanup-v2-iter1/0001-watchdog-hardlockup-Sort-hardlockup-detector-related.patch
  2023-06-16 16:28:53.594682369 +0200
@@ -1,7 +1,7 @@
-From 9d643e4254b224d22dd1411c51386ab686c052a7 Mon Sep 17 00:00:00 2001
+From bd7019bff3a28fb0bc163308101118adccf699d3 Mon Sep 17 00:00:00 2001
 From: Petr Mladek 
 Date: Thu, 1 Jun 2023 15:35:09 +0200
-Subject: [PATCH 1/6] watchdog/hardlockup: Sort hardlockup detector related
+Subject: [PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related
  config values a logical way
 
 There are four possible variants of hardlockup detectors:
@@ -40,6 +40,14 @@

[PATCH v2 6/6] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-06-16 Thread Petr Mladek

The HAVE_ prefix means that the code could be enabled. Add another
variable for HAVE_HARDLOCKUP_DETECTOR_ARCH without this prefix.
It will be set when it should be built. It will make it compatible
with the other hardlockup detectors.

The change allows to clean up dependencies of PPC_WATCHDOG
and HAVE_HARDLOCKUP_DETECTOR_PERF definitions for powerpc.

As a result HAVE_HARDLOCKUP_DETECTOR_PERF has the same dependencies
on arm, x86, powerpc architectures.

Signed-off-by: Petr Mladek 
Reviewed-by: Douglas Anderson 
---
 arch/powerpc/Kconfig | 5 ++---
 include/linux/nmi.h  | 2 +-
 lib/Kconfig.debug| 9 +
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 539d1f03ff42..987e730740d7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -90,8 +90,7 @@ config NMI_IPI
 
 config PPC_WATCHDOG
bool
-   depends on HARDLOCKUP_DETECTOR
-   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
+   depends on HARDLOCKUP_DETECTOR_ARCH
default y
help
  This is a placeholder when the powerpc hardlockup detector
@@ -240,7 +239,7 @@ config PPC
select HAVE_GCC_PLUGINS if GCC_VERSION >= 50200   # 
plugin support on gcc <= 5.1 is buggy on PPC
select HAVE_GENERIC_VDSO
select HAVE_HARDLOCKUP_DETECTOR_ARCHif PPC_BOOK3S_64 && SMP
-   select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
+   select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI
select HAVE_HW_BREAKPOINT   if PERF_EVENTS && (PPC_BOOK3S 
|| PPC_8xx)
select HAVE_IOREMAP_PROT
select HAVE_IRQ_TIME_ACCOUNTING
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 515d6724f469..ec808ebd36ba 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -9,7 +9,7 @@
 #include 
 
 /* Arch specific watchdogs might need to share extra watchdog-related APIs. */
-#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 #include 
 #endif
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index f285e9cf967a..2c4bb72e72ad 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1056,6 +1056,7 @@ config HARDLOCKUP_DETECTOR
depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
imply HARDLOCKUP_DETECTOR_PERF
imply HARDLOCKUP_DETECTOR_BUDDY
+   imply HARDLOCKUP_DETECTOR_ARCH
select LOCKUP_DETECTOR
 
help
@@ -1101,6 +1102,14 @@ config HARDLOCKUP_DETECTOR_BUDDY
depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
 
+config HARDLOCKUP_DETECTOR_ARCH
+   bool
+   depends on HARDLOCKUP_DETECTOR
+   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
+   help
+ The arch-specific implementation of the hardlockup detector will
+ be used.
+
 #
 # Both the "perf" and "buddy" hardlockup detectors count hrtimer
 # interrupts. This config enables functions managing this common code.
-- 
2.35.3

[PATCH v2 5/6] watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64

2023-06-16 Thread Petr Mladek

The HAVE_ prefix means that the code could be enabled. Add another
variable for HAVE_HARDLOCKUP_DETECTOR_SPARC64 without this prefix.
It will be set when it should be built. It will make it compatible
with the other hardlockup detectors.

Before, it is far from obvious that the SPARC64 variant is actually used:

$> make ARCH=sparc64 defconfig
$> grep HARDLOCKUP_DETECTOR .config
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y

After, it is more clear:

$> make ARCH=sparc64 defconfig
$> grep HARDLOCKUP_DETECTOR .config
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
CONFIG_HARDLOCKUP_DETECTOR_SPARC64=y

Signed-off-by: Petr Mladek 
Reviewed-by: Douglas Anderson 
---
 arch/sparc/Kconfig.debug | 7 ++-
 include/linux/nmi.h  | 4 ++--
 kernel/watchdog.c| 2 +-
 lib/Kconfig.debug| 2 +-
 4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/Kconfig.debug b/arch/sparc/Kconfig.debug
index 4903b6847e43..37e003665de6 100644
--- a/arch/sparc/Kconfig.debug
+++ b/arch/sparc/Kconfig.debug
@@ -16,10 +16,15 @@ config FRAME_POINTER
default y
 
 config HAVE_HARDLOCKUP_DETECTOR_SPARC64
-   depends on HAVE_NMI
bool
+   depends on HAVE_NMI
+   select HARDLOCKUP_DETECTOR_SPARC64
help
  Sparc64 hardlockup detector is the last one developed before adding
  the common infrastructure for handling hardlockup detectors. It is
  always built. It does _not_ use the common command line parameters
  and sysctl interface, except for /proc/sys/kernel/nmi_watchdog.
+
+config HARDLOCKUP_DETECTOR_SPARC64
+   bool
+   depends on HAVE_HARDLOCKUP_DETECTOR_SPARC64
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 7ee6c35d1f05..515d6724f469 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -9,7 +9,7 @@
 #include 
 
 /* Arch specific watchdogs might need to share extra watchdog-related APIs. */
-#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 #include 
 #endif
 
@@ -92,7 +92,7 @@ static inline void hardlockup_detector_disable(void) {}
 #endif
 
 /* Sparc64 has special implemetantion that is always enabled. */
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 void arch_touch_nmi_watchdog(void);
 #else
 static inline void arch_touch_nmi_watchdog(void) { }
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index babd2f3c8b72..a2154e753cb4 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -29,7 +29,7 @@
 
 static DEFINE_MUTEX(watchdog_mutex);
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
 # define WATCHDOG_HARDLOCKUP_DEFAULT   1
 #else
 # define WATCHDOG_HARDLOCKUP_DEFAULT   0
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e94664339e28..f285e9cf967a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1052,7 +1052,7 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
 #
 config HARDLOCKUP_DETECTOR
bool "Detect Hard Lockups"
-   depends on DEBUG_KERNEL && !S390 && !HAVE_HARDLOCKUP_DETECTOR_SPARC64
+   depends on DEBUG_KERNEL && !S390 && !HARDLOCKUP_DETECTOR_SPARC64
depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
imply HARDLOCKUP_DETECTOR_PERF
imply HARDLOCKUP_DETECTOR_BUDDY
-- 
2.35.3

[PATCH v2 4/6] watchdog/hardlockup: Make HAVE_NMI_WATCHDOG sparc64-specific

2023-06-16 Thread Petr Mladek

There are several hardlockup detector implementations and several Kconfig
values which allow selection and build of the preferred one.

CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb
("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36.
It was a preparation step for introducing the new generic perf hardlockup
detector.

The existing arch-specific variants did not support the to-be-created
generic build configurations, sysctl interface, etc. This distinction
was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog:
Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR")
in v2.6.38.

CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105
("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced
the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used
by three architectures, namely blackfin, mn10300, and sparc.

The support for blackfin and mn10300 architectures has been completely
dropped some time ago. And sparc is the only architecture with the historic
NMI watchdog at the moment.

And the old sparc implementation is really special. It is always built on
sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b
("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added
in v4.10-rc1.

There are only few locations where the sparc64 NMI watchdog interacts
with the generic hardlockup detectors code:

  + implements arch_touch_nmi_watchdog() which is called from the generic
touch_nmi_watchdog()

  + implements watchdog_hardlockup_enable()/disable() to support
/proc/sys/kernel/nmi_watchdog

  + is always preferred over other generic watchdogs, see
CONFIG_HARDLOCKUP_DETECTOR

  + includes asm/nmi.h into linux/nmi.h because some sparc-specific
functions are needed in sparc-specific code which includes
only linux/nmi.h.

The situation became more complicated after the commit 05a4a95279311c3
("kernel/watchdog: split up config options") and commit 2104180a53698df5
("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1.
They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc
specific hardlockup detector. It was compatible with the perf one
regarding the general boot, sysctl, and programming interfaces.

HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of
HAVE_NMI_WATCHDOG. It made some sense because all arch-specific
detectors had some common requirements, namely:

  + implemented arch_touch_nmi_watchdog()
  + included asm/nmi.h into linux/nmi.h
  + defined the default value for /proc/sys/kernel/nmi_watchdog

But it actually has made things pretty complicated when the generic
buddy hardlockup detector was added. Before the generic perf detector
was newer supported together with an arch-specific one. But the buddy
detector could work on any SMP system. It means that an architecture
could support both the arch-specific and buddy detector.

As a result, there are few tricky dependencies. For example,
CONFIG_HARDLOCKUP_DETECTOR depends on:

  ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && 
!HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH

The problem is that the very special sparc implementation is defined as:

  HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH

Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear
without reading understanding the history.

Make the logic less tricky and more self-explanatory by making
HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to
HAVE_HARDLOCKUP_DETECTOR_SPARC64.

Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF,
and HARDLOCKUP_DETECTOR_BUDDY may conflict only with
HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR
and it is not longer enabled when HAVE_NMI_WATCHDOG is set.

Signed-off-by: Petr Mladek 

watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64

The configuration variable HAVE_NMI_WATCHDOG has a generic name but
it is selected only for SPARC64.

It should _not_ be used in general because it is not integrated with
the other hardlockup detectors. Namely, it does not support the hardlockup
specific command line parameters and systcl interface. Instead, it is
enabled/disabled together with the softlockup detector by the global
"watchdog" sysctl.

Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special
behavior more clear.

Also the variable is set only on sparc64. Move the definition
from arch/Kconfig to arch/sparc/Kconfig.debug.

Signed-off-by: Petr Mladek 
---
 arch/Kconfig | 18 --
 arch/sparc/Kconfig   |  2 +-
 arch/sparc/Kconfig.debug |  9 +
 include/linux/nmi.h  |  5 ++---
 kernel/watchdog.c|  2 +-
 lib/Kconfig.debug| 15 +--
 6 files changed, 18 insertions(+), 33 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index

[PATCH v2 3/6] watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h

2023-06-16 Thread Petr Mladek

arch_touch_nmi_watchdog() needs a different implementation for various
hardlockup detector implementations. And it does nothing when
any hardlockup detector is not built at all.

arch_touch_nmi_watchdog() is declared via linux/nmi.h. And it must be
defined as an empty function when there is no hardlockup detector.
It is done directly in this header file for the perf and buddy detectors.
And it is done in the included asm/linux.h for arch specific detectors.

The reason probably is that the arch specific variants build the code
using another conditions. For example, powerpc64/sparc64 builds the code
when CONFIG_PPC_WATCHDOG is enabled.

Another reason might be that these architectures define more functions
in asm/nmi.h anyway.

However the generic code actually knows when the function will be
implemented. It happens when some full featured or the sparc64-specific
hardlockup detector is built.

In particular, CONFIG_HARDLOCKUP_DETECTOR can be enabled only when
a generic or arch-specific full featured hardlockup detector is available.
The only exception is sparc64 which can be built even when the global
HARDLOCKUP_DETECTOR switch is disabled.

The information about sparc64 is a bit complicated. The hardlockup
detector is built there when CONFIG_HAVE_NMI_WATCHDOG is set and
CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

People might wonder whether this change really makes things easier.
The motivation is:

  + The current logic in linux/nmi.h is far from obvious.
For example, arch_touch_nmi_watchdog() is defined as {} when
neither CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER nor
CONFIG_HAVE_NMI_WATCHDOG is defined.

  + The change synchronizes the checks in lib/Kconfig.debug and
in the generic code.

  + It is a step that will help cleaning HAVE_NMI_WATCHDOG related
checks.

The change should not change the existing behavior.

Signed-off-by: Petr Mladek 
Reviewed-by: Douglas Anderson 
---
 arch/powerpc/include/asm/nmi.h |  2 --
 arch/sparc/include/asm/nmi.h   |  1 -
 include/linux/nmi.h| 13 ++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/nmi.h b/arch/powerpc/include/asm/nmi.h
index 43bfd4de868f..ce25318c3902 100644
--- a/arch/powerpc/include/asm/nmi.h
+++ b/arch/powerpc/include/asm/nmi.h
@@ -3,11 +3,9 @@
 #define _ASM_NMI_H
 
 #ifdef CONFIG_PPC_WATCHDOG
-extern void arch_touch_nmi_watchdog(void);
 long soft_nmi_interrupt(struct pt_regs *regs);
 void watchdog_hardlockup_set_timeout_pct(u64 pct);
 #else
-static inline void arch_touch_nmi_watchdog(void) {}
 static inline void watchdog_hardlockup_set_timeout_pct(u64 pct) {}
 #endif
 
diff --git a/arch/sparc/include/asm/nmi.h b/arch/sparc/include/asm/nmi.h
index 90ee7863d9fe..920dc23f443f 100644
--- a/arch/sparc/include/asm/nmi.h
+++ b/arch/sparc/include/asm/nmi.h
@@ -8,7 +8,6 @@ void nmi_adjust_hz(unsigned int new_hz);
 
 extern atomic_t nmi_active;
 
-void arch_touch_nmi_watchdog(void);
 void start_nmi_watchdog(void *unused);
 void stop_nmi_watchdog(void *unused);
 
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index b5d0b7ab52fb..b9e816bde14a 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -7,6 +7,8 @@
 
 #include 
 #include 
+
+/* Arch specific watchdogs might need to share extra watchdog-related APIs. */
 #if defined(CONFIG_HAVE_NMI_WATCHDOG)
 #include 
 #endif
@@ -89,12 +91,17 @@ extern unsigned int hardlockup_panic;
 static inline void hardlockup_detector_disable(void) {}
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
+/* Sparc64 has special implemetantion that is always enabled. */
+#if defined(CONFIG_HARDLOCKUP_DETECTOR) || \
+(defined(CONFIG_HAVE_NMI_WATCHDOG) && 
!defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH))
 void arch_touch_nmi_watchdog(void);
+#else
+static inline void arch_touch_nmi_watchdog(void) { }
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
 void watchdog_hardlockup_touch_cpu(unsigned int cpu);
 void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs);
-#elif !defined(CONFIG_HAVE_NMI_WATCHDOG)
-static inline void arch_touch_nmi_watchdog(void) { }
 #endif
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
-- 
2.35.3

[PATCH v2 2/6] watchdog/hardlockup: Make the config checks more straightforward

2023-06-16 Thread Petr Mladek

There are four possible variants of hardlockup detectors:

  + buddy: available when SMP is set.

  + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.

  + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.

  + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

The check for the sparc64 variant is more complicated because
HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific
and sparc64 specific variant. Therefore it is automatically
selected with HAVE_HARDLOCKUP_DETECTOR_ARCH.

This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH.
It reduces the size of some checks but it makes them harder to follow.

Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH
is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global
HARDLOCKUP_DETECTOR switch is enabled/disabled.

Make the logic more straightforward by the following changes:

  + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and
HAVE_NMI_WATCHDOG in comments.

  + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate
HAVE_* for all four hardlockup detector variants.

Use it in the other conditions instead of SMP. It makes it
clear that it is about the buddy detector.

  + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR
and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand
the conditions between the four hardlockup detector variants.

  + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY
can be enabled. It explains the dependency on the other
hardlockup detector variants.

Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply".
It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when
the global HARDLOCKUP_DETECTOR switch is changed.

  + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables
disappear when the hardlockup detectors are disabled.

Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY
value is not preserved when the global switch is disabled.
The user has to make the decision again when it gets re-enabled.

Signed-off-by: Petr Mladek 
---
 arch/Kconfig  | 23 +-
 lib/Kconfig.debug | 62 +++
 2 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 422f0ffa269e..77e5af5fda3f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -404,17 +404,28 @@ config HAVE_NMI_WATCHDOG
depends on HAVE_NMI
bool
help
- The arch provides a low level NMI watchdog. It provides
- asm/nmi.h, and defines its own watchdog_hardlockup_probe() and
- arch_touch_nmi_watchdog().
+ The arch provides its own hardlockup detector implementation instead
+ of the generic ones.
+
+ Sparc64 defines this variable without HAVE_HARDLOCKUP_DETECTOR_ARCH.
+ It is the last arch-specific implementation which was developed before
+ adding the common infrastructure for handling hardlockup detectors.
+ It is always built. It does _not_ use the common command line
+ parameters and sysctl interface, except for
+ /proc/sys/kernel/nmi_watchdog.
 
 config HAVE_HARDLOCKUP_DETECTOR_ARCH
bool
select HAVE_NMI_WATCHDOG
help
- The arch chooses to provide its own hardlockup detector, which is
- a superset of the HAVE_NMI_WATCHDOG. It also conforms to config
- interfaces and parameters provided by hardlockup detector subsystem.
+ The arch provides its own hardlockup detector implementation instead
+ of the generic ones.
+
+ It uses the same command line parameters, and sysctl interface,
+ as the generic hardlockup detectors.
+
+ HAVE_NMI_WATCHDOG is selected to build the code shared with
+ the sparc64 specific implementation.
 
 config HAVE_PERF_REGS
bool
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3e91fa33c7a0..a0b0c4decb89 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1035,16 +1035,33 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
  Say N if unsure.
 
+config HAVE_HARDLOCKUP_DETECTOR_BUDDY
+   bool
+   depends on SMP
+   default y
+
 #
-# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
-# lockup detector rather than the perf based detector.
+# Global switch whether to build a hardlockup detector at all. It is available
+# only when the architecture supports at least one implementation. There are
+# two exceptions. The hardlockup detector is never enabled on:
+#
+#  s390: it reported many false positives there
+#
+#  sparc64: has a custom implementation which is not using the common
+#  hardlockup command line options and sysctl interface.
+#
+# Note that HAVE_NMI_WATCHDOG is used to distinguish the

[PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-16 Thread Petr Mladek

There are four possible variants of hardlockup detectors:

  + buddy: available when SMP is set.

  + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.

  + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.

  + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.

Only one hardlockup detector can be compiled in. The selection is done
using quite complex dependencies between several CONFIG variables.
The following patches will try to make it more straightforward.

As a first step, reorder the definitions of the various CONFIG variables.
The logical order is:

   1. HAVE_* variables define available variants. They are typically
  defined in the arch/ config files.

   2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
  detector is enabled at all.

   3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
  the buddy detector should be preferred over the perf one.
  Note that the arch specific variants are always preferred when
  available.

   4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
  detector is enabled in the end.

   5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
  are temporary variables that are going to be removed in
  a followup patch.

This is a preparation step for further cleanup. It will change the logic
without shuffling the definitions.

This change temporary breaks the C-like ordering where the variables are
declared or defined before they are used. It is not really needed for
Kconfig. Also the following patches will rework the logic so that
the ordering will be C-like in the end.

The patch just shuffles the definitions. It should not change the existing
behavior.

Signed-off-by: Petr Mladek 
---
 lib/Kconfig.debug | 112 +++---
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ed7b01c4bd41..3e91fa33c7a0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1035,62 +1035,6 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
  Say N if unsure.
 
-# Both the "perf" and "buddy" hardlockup detectors count hrtimer
-# interrupts. This config enables functions managing this common code.
-config HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
-   bool
-   select SOFTLOCKUP_DETECTOR
-
-config HARDLOCKUP_DETECTOR_PERF
-   bool
-   depends on HAVE_HARDLOCKUP_DETECTOR_PERF
-   select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
-
-config HARDLOCKUP_DETECTOR_BUDDY
-   bool
-   depends on SMP
-   select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
-
-# For hardlockup detectors you can have one directly provided by the arch
-# or use a "non-arch" one. If you're using a "non-arch" one that is
-# further divided the perf hardlockup detector (which, confusingly, needs
-# arch-provided perf support) and the buddy hardlockup detector (which just
-# needs SMP). In either case, using the "non-arch" code conflicts with
-# the NMI watchdog code (which is sometimes used directly and sometimes used
-# by the arch-provided hardlockup detector).
-config HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
-   bool
-   depends on (HAVE_HARDLOCKUP_DETECTOR_PERF || SMP) && !HAVE_NMI_WATCHDOG
-   default y
-
-config HARDLOCKUP_DETECTOR_PREFER_BUDDY
-   bool "Prefer the buddy CPU hardlockup detector"
-   depends on HAVE_HARDLOCKUP_DETECTOR_PERF && SMP
-   help
- Say Y here to prefer the buddy hardlockup detector over the perf one.
-
- With the buddy detector, each CPU uses its softlockup hrtimer
- to check that the next CPU is processing hrtimer interrupts by
- verifying that a counter is increasing.
-
- This hardlockup detector is useful on systems that don't have
- an arch-specific hardlockup detector or if resources needed
- for the hardlockup detector are better used for other things.
-
-# This will select the appropriate non-arch hardlockdup detector
-config HARDLOCKUP_DETECTOR_NON_ARCH
-   bool
-   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
-   select HARDLOCKUP_DETECTOR_BUDDY if !HAVE_HARDLOCKUP_DETECTOR_PERF || 
HARDLOCKUP_DETECTOR_PREFER_BUDDY
-   select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF && 
!HARDLOCKUP_DETECTOR_PREFER_BUDDY
-
-#
-# Enables a timestamp based low pass filter to compensate for perf based
-# hard lockup detection which runs too fast due to turbo modes.
-#
-config HARDLOCKUP_CHECK_TIMESTAMP
-   bool
-
 #
 # arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
 # lockup detector rather than the perf based detector.
@@ -,6 +1055,62 @@ config HARDLOCKUP_DETECTOR
  chance to run.  The current stack trace is displayed upon detection
  and the system will stay locked up.
 
+config HARDLOCKUP_DETECTOR_PREFER_BUDDY
+   bool "Prefer the buddy CPU hardlockup detector"

[PATCH v2 0/6] watchdog/hardlockup: Cleanup configuration of hardlockup detectors

2023-06-16 Thread Petr Mladek

Hi,

this patchset is supposed to replace the last patch in the patchset cleaning
up after introducing the buddy detector, see
https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid

Changes against v1:

  + Better explained the C-like ordering in the 1st patch.

  + Squashed patches for splitting and renaming HAVE_NMI_WATCHDOG,
updated commit message with the history and more facts.

  + Updated comments about the sparc64 variant. It is not handled together
with the softlockup detector. In fact, it is always build. And it even
used to be always enabled until the commit 7a5c8b57cec93196b ("sparc:
implement watchdog_nmi_enable and watchdog_nmi_disable") added in
v4.10-rc1.

I realized this when updating the comment for the 4th patch. My original
statement in v1 patchset was based on code reading. I looked at it from
a bad side.

  + Removed superfluous "default n"
  + Fixed typos.

Petr Mladek (6):
  watchdog/hardlockup: Sort hardlockup detector related config values a
logical way
  watchdog/hardlockup: Make the config checks more straightforward
  watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in
linux/nmi.h
  watchdog/hardlockup:  Make HAVE_NMI_WATCHDOG sparc64-specific
  watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64
  watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

 arch/Kconfig   |  17 ++---
 arch/powerpc/Kconfig   |   5 +-
 arch/powerpc/include/asm/nmi.h |   2 -
 arch/sparc/Kconfig |   2 +-
 arch/sparc/Kconfig.debug   |  14 
 arch/sparc/include/asm/nmi.h   |   1 -
 include/linux/nmi.h|  14 ++--
 kernel/watchdog.c  |   2 +-
 lib/Kconfig.debug  | 114 ++---
 9 files changed, 97 insertions(+), 74 deletions(-)

-- 
2.35.3

Re: [RFC PATCH v1 3/3] powerpc: WIP draft support to objtool check

2023-06-16 Thread Peter Zijlstra



Few comments..

On Fri, Jun 16, 2023 at 03:47:52PM +0200, Christophe Leroy wrote:

> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 0fcf99c91400..f945fe271706 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -236,6 +236,7 @@ static bool __dead_end_function(struct objtool_file 
> *file, struct symbol *func,
>   "x86_64_start_reservations",
>   "xen_cpu_bringup_again",
>   "xen_start_kernel",
> + "longjmp",
>   };
>  
>   if (!func)
> @@ -2060,13 +2061,12 @@ static int add_jump_table(struct objtool_file *file, 
> struct instruction *insn,
>* instruction.
>*/
>   list_for_each_entry_from(reloc, >sec->reloc_list, list) {
> -
>   /* Check for the end of the table: */
>   if (reloc != table && reloc->jump_table_start)
>   break;
>  
>   /* Make sure the table entries are consecutive: */
> - if (prev_offset && reloc->offset != prev_offset + 8)
> + if (prev_offset && reloc->offset != prev_offset + 4)

Do we want a global variable (from elf.c) called elf_sizeof_long or so?

>   break;
>  
>   /* Detect function pointers from contiguous objects: */
> @@ -2074,7 +2074,10 @@ static int add_jump_table(struct objtool_file *file, 
> struct instruction *insn,
>   reloc->addend == pfunc->offset)
>   break;
>  
> - dest_insn = find_insn(file, reloc->sym->sec, reloc->addend);
> + if (table->jump_table_is_rel)
> + dest_insn = find_insn(file, reloc->sym->sec, 
> reloc->addend + table->offset - reloc->offset);
> + else
> + dest_insn = find_insn(file, reloc->sym->sec, 
> reloc->addend);

offset = reloc->addend;
if (table->jump_table_is_rel)
offset += table->offset - reloc->offset;
dest_insn = find_insn(file, reloc->sym->sec, offset);

perhaps?

>   if (!dest_insn)
>   break;
>  
> @@ -4024,6 +4022,11 @@ static bool ignore_unreachable_insn(struct 
> objtool_file *file, struct instructio
>   if (insn->ignore || insn->type == INSN_NOP || insn->type == INSN_TRAP)
>   return true;
>  
> + /* powerpc relocatable files have a word in front of each relocatable 
> function */
> + if ((file->elf->ehdr.e_machine == EM_PPC || file->elf->ehdr.e_machine 
> == EM_PPC64) &&
> + (file->elf->ehdr.e_flags & EF_PPC_RELOCATABLE_LIB) &&
> + insn_func(next_insn_same_sec(file, insn)))
> + return true;

Can't you simply decode that word to INSN_NOP or so?

[RFC PATCH v1 1/3] Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto"

2023-06-16 Thread Christophe Leroy

This reverts commit 1e688dd2a3d6759d416616ff07afc4bb836c4213.

That commit aimed at optimising the code around generation of
WARN_ON/BUG_ON but this leads to a lot of dead code erroneously
generated by GCC.

 text  data bss dec hex filename
  9551585   3627834  224376 13403795 cc8693 vmlinux.before
  9535281   3628358  224376 13388015 cc48ef vmlinux.after

Once this change is reverted, in a standard configuration (pmac32 +
function tracer) the text is reduced by 16k which is around 1.7%

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/kup.h  |  2 +-
 arch/powerpc/include/asm/bug.h| 67 +++
 arch/powerpc/include/asm/extable.h| 14 
 arch/powerpc/include/asm/ppc_asm.h| 11 ++-
 arch/powerpc/kernel/misc_32.S |  2 +-
 arch/powerpc/kernel/traps.c   |  9 +--
 .../powerpc/primitives/asm/extable.h  |  1 -
 7 files changed, 25 insertions(+), 81 deletions(-)
 delete mode 12 tools/testing/selftests/powerpc/primitives/asm/extable.h

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 54cf46808157..c82323b864e1 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -90,7 +90,7 @@
/* Prevent access to userspace using any key values */
LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED)
 999:   tdne\gpr1, \gpr2
-   EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
+   EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67)
 #endif
 .endm
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index ef42adb44aa3..a565995fb742 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -4,14 +4,13 @@
 #ifdef __KERNEL__
 
 #include 
-#include 
 
 #ifdef CONFIG_BUG
 
 #ifdef __ASSEMBLY__
 #include 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
-.macro __EMIT_BUG_ENTRY addr,file,line,flags
+.macro EMIT_BUG_ENTRY addr,file,line,flags
 .section __bug_table,"aw"
 5001:   .4byte \addr - .
 .4byte 5002f - .
@@ -23,7 +22,7 @@
 .previous
 .endm
 #else
-.macro __EMIT_BUG_ENTRY addr,file,line,flags
+.macro EMIT_BUG_ENTRY addr,file,line,flags
 .section __bug_table,"aw"
 5001:   .4byte \addr - .
 .short \flags
@@ -32,18 +31,6 @@
 .endm
 #endif /* verbose */
 
-.macro EMIT_WARN_ENTRY addr,file,line,flags
-   EX_TABLE(\addr,\addr+4)
-   __EMIT_BUG_ENTRY \addr,\file,\line,\flags
-.endm
-
-.macro EMIT_BUG_ENTRY addr,file,line,flags
-   .if \flags & 1 /* BUGFLAG_WARNING */
-   .err /* Use EMIT_WARN_ENTRY for warnings */
-   .endif
-   __EMIT_BUG_ENTRY \addr,\file,\line,\flags
-.endm
-
 #else /* !__ASSEMBLY__ */
 /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and
sizeof(struct bug_entry), respectively */
@@ -73,16 +60,6 @@
  "i" (sizeof(struct bug_entry)),   \
  ##__VA_ARGS__)
 
-#define WARN_ENTRY(insn, flags, label, ...)\
-   asm_volatile_goto(  \
-   "1: " insn "\n" \
-   EX_TABLE(1b, %l[label]) \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
- "i" (flags),  \
- "i" (sizeof(struct bug_entry)),   \
- ##__VA_ARGS__ : : label)
-
 /*
  * BUG_ON() and WARN_ON() do their best to cooperate with compile-time
  * optimisations. However depending on the complexity of the condition
@@ -95,16 +72,7 @@
 } while (0)
 #define HAVE_ARCH_BUG
 
-#define __WARN_FLAGS(flags) do {   \
-   __label__ __label_warn_on;  \
-   \
-   WARN_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags), __label_warn_on); 
\
-   barrier_before_unreachable();   \
-   __builtin_unreachable();\
-   \
-__label_warn_on:   \
-   break;  \
-} while (0)
+#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | 
(flags))
 
 #ifdef CONFIG_PPC64
 #define BUG_ON(x) do { \
@@ -117,25 +85,15 @@ __label_warn_on:   
\
 } while (0)
 
 #define WARN_ON(x) ({  \
-   bool __ret_warn_on = false; \
-   do {\
-   if

[RFC PATCH v1 3/3] powerpc: WIP draft support to objtool check

2023-06-16 Thread Christophe Leroy

This draft messy patch is first try to add support of objtool
check for powerpc. This is in preparation of doing uaccess
validation for powerpc.

For the time being, this is implemented for PPC32 only breaking
support for other targets eventually. Will be reworked to be more
generic once a final working status has been achieved.

All assembly files have been deactivated as they require huge
work and are not really needed at the first place for uaccess validation.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig  |  1 +
 scripts/Makefile.lib  |  2 +-
 tools/objtool/arch/powerpc/decode.c   | 60 +--
 .../arch/powerpc/include/arch/special.h   |  2 +-
 tools/objtool/arch/powerpc/special.c  | 44 +-
 tools/objtool/check.c | 29 +
 tools/objtool/include/objtool/elf.h   |  1 +
 tools/objtool/include/objtool/special.h   |  2 +-
 8 files changed, 118 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 542be1c3c315..3bd244784af1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -259,6 +259,7 @@ config PPC
select HAVE_OPTPROBES
select HAVE_OBJTOOL if PPC32 || MPROFILE_KERNEL
select HAVE_OBJTOOL_MCOUNT  if HAVE_OBJTOOL
+   select HAVE_UACCESS_VALIDATION  if HAVE_OBJTOOL
select HAVE_PERF_EVENTS
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 100a386fcd71..298e2656e911 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -267,7 +267,7 @@ objtool-args-$(CONFIG_RETHUNK)  
+= --rethunk
 objtool-args-$(CONFIG_SLS) += --sls
 objtool-args-$(CONFIG_STACK_VALIDATION)+= --stackval
 objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE) += --static-call
-objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess
+objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess 
--sec-address
 objtool-args-$(CONFIG_GCOV_KERNEL) += --no-unreachable
 objtool-args-$(CONFIG_PREFIX_SYMBOLS)  += 
--prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
 
diff --git a/tools/objtool/arch/powerpc/decode.c 
b/tools/objtool/arch/powerpc/decode.c
index 53b55690f320..e95c0470e34b 100644
--- a/tools/objtool/arch/powerpc/decode.c
+++ b/tools/objtool/arch/powerpc/decode.c
@@ -43,24 +43,72 @@ int arch_decode_instruction(struct objtool_file *file, 
const struct section *sec
unsigned long offset, unsigned int maxlen,
struct instruction *insn)
 {
-   unsigned int opcode;
+   unsigned int opcode, xop;
+   unsigned int rs, ra, rb, bo, bi, to, uimm, l;
enum insn_type typ;
unsigned long imm;
u32 ins;
 
ins = bswap_if_needed(file->elf, *(u32 *)(sec->data->d_buf + offset));
opcode = ins >> 26;
-   typ = INSN_OTHER;
-   imm = 0;
+   xop = (ins >> 1) & 0x3ff;
+   rs = bo = to = (ins >> 21) & 0x1f;
+   ra = bi = (ins >> 16) & 0x1f;
+   rb = (ins >> 11) & 0x1f;
+   uimm = (ins >> 0) & 0x;
+   l = ins & 1;
 
switch (opcode) {
+   case 16: /* bc[l][a] */
+   if (ins & 1)/* bcl[a] */
+   typ = INSN_OTHER;
+   else/* bc[a] */
+   typ = INSN_JUMP_CONDITIONAL;
+
+   imm = ins & 0xfffc;
+   if (imm & 0x8000)
+   imm -= 0x1;
+   imm |= ins & 2; /* AA flag */
+   insn->immediate = imm;
+   break;
case 18: /* b[l][a] */
-   if ((ins & 3) == 1) /* bl */
+   if (ins & 1)/* bl[a] */
typ = INSN_CALL;
+   else/* b[a] */
+   typ = INSN_JUMP_UNCONDITIONAL;
 
imm = ins & 0x3fc;
if (imm & 0x200)
imm -= 0x400;
+   imm |= ins & 2; /* AA flag */
+   insn->immediate = imm;
+   break;
+   case 19:
+   if (xop == 16 && bo == 20 && bi == 0)   /* blr */
+   typ = INSN_RETURN;
+   else if (xop == 50) /* rfi */
+   typ = INSN_JUMP_DYNAMIC;
+   else if (xop == 528 && bo == 20 && bi ==0 && !l)/* bctr 
*/
+   typ = INSN_JUMP_DYNAMIC;
+   else if (xop == 528 && bo == 20 && bi ==0 && l) /* 
bctrl */
+   typ = INSN_CALL_DYNAMIC;
+   else
+   typ = INSN_OTHER;
+   break;
+   case 24:
+   if (rs == 0 && ra == 0 && uimm == 0)
+   typ = INSN_NOP;
+

[RFC PATCH v1 2/3] powerpc: Mark all .S files invalid for objtool

2023-06-16 Thread Christophe Leroy

A lot of work is required in .S files in order to get them ready
for objtool checks.

For the time being, exclude them from the checks.

This is done with the script below:

#!/bin/sh
DIRS=`find arch/powerpc -name "*.S" -exec dirname {} \; | sort | uniq`
for d in $DIRS
do
pushd $d
echo >> Makefile
for f in *.S
do
echo "OBJECT_FILES_NON_STANDARD_$f := y" | sed 
s/"\.S"/".o"/g
done >> Makefile
popd
done

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/boot/Makefile | 17 +
 arch/powerpc/crypto/Makefile   | 13 +++
 arch/powerpc/kernel/Makefile   | 44 ++
 arch/powerpc/kernel/trace/Makefile |  4 ++
 arch/powerpc/kernel/vdso/Makefile  | 11 ++
 arch/powerpc/kexec/Makefile|  2 +
 arch/powerpc/kvm/Makefile  | 13 +++
 arch/powerpc/lib/Makefile  | 25 
 arch/powerpc/mm/book3s32/Makefile  |  3 ++
 arch/powerpc/mm/nohash/Makefile|  3 ++
 arch/powerpc/perf/Makefile |  2 +
 arch/powerpc/platforms/44x/Makefile|  2 +
 arch/powerpc/platforms/52xx/Makefile   |  3 ++
 arch/powerpc/platforms/83xx/Makefile   |  2 +
 arch/powerpc/platforms/cell/spufs/Makefile |  3 ++
 arch/powerpc/platforms/pasemi/Makefile |  2 +
 arch/powerpc/platforms/powermac/Makefile   |  3 ++
 arch/powerpc/platforms/powernv/Makefile|  3 ++
 arch/powerpc/platforms/ps3/Makefile|  2 +
 arch/powerpc/platforms/pseries/Makefile|  2 +
 arch/powerpc/purgatory/Makefile|  3 ++
 arch/powerpc/sysdev/Makefile   |  3 ++
 arch/powerpc/xmon/Makefile |  3 ++
 23 files changed, 168 insertions(+)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 771b79423bbc..c046eb9d341e 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -513,3 +513,20 @@ $(wrapper-installed): $(DESTDIR)$(WRAPPER_BINDIR) 
$(srctree)/$(obj)/wrapper | $(
$(call cmd,install_wrapper)
 
 $(obj)/bootwrapper_install: $(all-installed)
+
+OBJECT_FILES_NON_STANDARD_crt0.o := y
+OBJECT_FILES_NON_STANDARD_crtsavres.o := y
+OBJECT_FILES_NON_STANDARD_div64.o := y
+OBJECT_FILES_NON_STANDARD_fixed-head.o := y
+OBJECT_FILES_NON_STANDARD_gamecube-head.o := y
+OBJECT_FILES_NON_STANDARD_motload-head.o := y
+OBJECT_FILES_NON_STANDARD_opal-calls.o := y
+OBJECT_FILES_NON_STANDARD_ps3-head.o := y
+OBJECT_FILES_NON_STANDARD_ps3-hvcall.o := y
+OBJECT_FILES_NON_STANDARD_pseries-head.o := y
+OBJECT_FILES_NON_STANDARD_string.o := y
+OBJECT_FILES_NON_STANDARD_util.o := y
+OBJECT_FILES_NON_STANDARD_wii-head.o := y
+OBJECT_FILES_NON_STANDARD_zImage.coff.lds.o := y
+OBJECT_FILES_NON_STANDARD_zImage.lds.o := y
+OBJECT_FILES_NON_STANDARD_zImage.ps3.lds.o := y
diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile
index 7b4f516abec1..f0381d137b06 100644
--- a/arch/powerpc/crypto/Makefile
+++ b/arch/powerpc/crypto/Makefile
@@ -34,3 +34,16 @@ $(obj)/aesp10-ppc.S $(obj)/ghashp10-ppc.S: $(obj)/%.S: 
$(src)/%.pl FORCE
 
 OBJECT_FILES_NON_STANDARD_aesp10-ppc.o := y
 OBJECT_FILES_NON_STANDARD_ghashp10-ppc.o := y
+
+OBJECT_FILES_NON_STANDARD_aes-gcm-p10.o := y
+OBJECT_FILES_NON_STANDARD_aes-spe-core.o := y
+OBJECT_FILES_NON_STANDARD_aes-spe-keys.o := y
+OBJECT_FILES_NON_STANDARD_aes-spe-modes.o := y
+OBJECT_FILES_NON_STANDARD_aes-tab-4k.o := y
+OBJECT_FILES_NON_STANDARD_crc32c-vpmsum_asm.o := y
+OBJECT_FILES_NON_STANDARD_crc32-vpmsum_core.o := y
+OBJECT_FILES_NON_STANDARD_crct10dif-vpmsum_asm.o := y
+OBJECT_FILES_NON_STANDARD_md5-asm.o := y
+OBJECT_FILES_NON_STANDARD_sha1-powerpc-asm.o := y
+OBJECT_FILES_NON_STANDARD_sha1-spe-asm.o := y
+OBJECT_FILES_NON_STANDARD_sha256-spe-asm.o := y
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 9bf2be123093..19a2c83645e1 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -229,3 +229,47 @@ $(obj)/vdso64_wrapper.o : $(obj)/vdso/vdso64.so.dbg
 
 # for cleaning
 subdir- += vdso
+
+OBJECT_FILES_NON_STANDARD_85xx_entry_mapping.o := y
+OBJECT_FILES_NON_STANDARD_cpu_setup_44x.o := y
+OBJECT_FILES_NON_STANDARD_cpu_setup_6xx.o := y
+OBJECT_FILES_NON_STANDARD_cpu_setup_e500.o := y
+OBJECT_FILES_NON_STANDARD_cpu_setup_pa6t.o := y
+OBJECT_FILES_NON_STANDARD_cpu_setup_ppc970.o := y
+OBJECT_FILES_NON_STANDARD_entry_32.o := y
+OBJECT_FILES_NON_STANDARD_entry_64.o := y
+OBJECT_FILES_NON_STANDARD_epapr_hcalls.o := y
+OBJECT_FILES_NON_STANDARD_exceptions-64e.o := y
+OBJECT_FILES_NON_STANDARD_exceptions-64s.o := y
+OBJECT_FILES_NON_STANDARD_fpu.o := y
+OBJECT_FILES_NON_STANDARD_head_40x.o := y
+OBJECT_FILES_NON_STANDARD_head_44x.o := y
+OBJECT_FILES_NON_STANDARD_head_64.o := y
+OBJECT_FILES_NON_STANDARD_head_85xx.o := y
+OBJECT_FILES_NON_STANDARD_head_8xx.o := y

[RFC PATCH v1 0/3] powerpc/objtool: First step towards uaccess validation (v1)

2023-06-16 Thread Christophe Leroy

This RFC is a first step towards the validation of userspace accesses.

For the time being it targets only PPC32 and includes hacks directly in
core part of objtool.

It doesn't yet include handling of uaccess at all but is a first step
to support objtool validation.

Assembly files have been kept aside as they require a huge work before
being ready for objtool validation and are not directly relevant for
uaccess validation.

Please have a look and hold hand if I'm going in the wrong direction.

For the few hacks done directly in the core part of objtool don't
hesitate to suggest ways to make it more generic.

Christophe Leroy (3):
  Revert "powerpc/bug: Provide better flexibility to
WARN_ON/__WARN_FLAGS() with asm goto"
  powerpc: Mark all .S files invalid for objtool
  powerpc: WIP draft support to objtool check

 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/boot/Makefile| 17 +
 arch/powerpc/crypto/Makefile  | 13 
 arch/powerpc/include/asm/book3s/64/kup.h  |  2 +-
 arch/powerpc/include/asm/bug.h| 67 +++
 arch/powerpc/include/asm/extable.h| 14 
 arch/powerpc/include/asm/ppc_asm.h| 11 ++-
 arch/powerpc/kernel/Makefile  | 44 
 arch/powerpc/kernel/misc_32.S |  2 +-
 arch/powerpc/kernel/trace/Makefile|  4 ++
 arch/powerpc/kernel/traps.c   |  9 +--
 arch/powerpc/kernel/vdso/Makefile | 11 +++
 arch/powerpc/kexec/Makefile   |  2 +
 arch/powerpc/kvm/Makefile | 13 
 arch/powerpc/lib/Makefile | 25 +++
 arch/powerpc/mm/book3s32/Makefile |  3 +
 arch/powerpc/mm/nohash/Makefile   |  3 +
 arch/powerpc/perf/Makefile|  2 +
 arch/powerpc/platforms/44x/Makefile   |  2 +
 arch/powerpc/platforms/52xx/Makefile  |  3 +
 arch/powerpc/platforms/83xx/Makefile  |  2 +
 arch/powerpc/platforms/cell/spufs/Makefile|  3 +
 arch/powerpc/platforms/pasemi/Makefile|  2 +
 arch/powerpc/platforms/powermac/Makefile  |  3 +
 arch/powerpc/platforms/powernv/Makefile   |  3 +
 arch/powerpc/platforms/ps3/Makefile   |  2 +
 arch/powerpc/platforms/pseries/Makefile   |  2 +
 arch/powerpc/purgatory/Makefile   |  3 +
 arch/powerpc/sysdev/Makefile  |  3 +
 arch/powerpc/xmon/Makefile|  3 +
 scripts/Makefile.lib  |  2 +-
 tools/objtool/arch/powerpc/decode.c   | 60 +++--
 .../arch/powerpc/include/arch/special.h   |  2 +-
 tools/objtool/arch/powerpc/special.c  | 44 +++-
 tools/objtool/check.c | 29 
 tools/objtool/include/objtool/elf.h   |  1 +
 tools/objtool/include/objtool/special.h   |  2 +-
 .../powerpc/primitives/asm/extable.h  |  1 -
 38 files changed, 311 insertions(+), 104 deletions(-)
 delete mode 12 tools/testing/selftests/powerpc/primitives/asm/extable.h

-- 
2.40.1

Re: [PATCH v4 04/34] pgtable: Create struct ptdesc

2023-06-16 Thread Jason Gunthorpe

On Mon, Jun 12, 2023 at 02:03:53PM -0700, Vishal Moola (Oracle) wrote:
> Currently, page table information is stored within struct page. As part
> of simplifying struct page, create struct ptdesc for page table
> information.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  include/linux/pgtable.h | 51 +
>  1 file changed, 51 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5a51481bbb9..330de96ebfd6 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>  #endif /* CONFIG_MMU */
>  
> +
> +/**
> + * struct ptdesc - Memory descriptor for page tables.
> + * @__page_flags: Same as page flags. Unused for page tables.
> + * @pt_list: List of used page tables. Used for s390 and x86.
> + * @_pt_pad_1: Padding that aliases with page's compound head.
> + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
> + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only.
> + * @pt_mm: Used for x86 pgds.
> + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
> only.
> + * @ptl: Lock for the page table.
> + *
> + * This struct overlays struct page for now. Do not modify without a good
> + * understanding of the issues.
> + */
> +struct ptdesc {
> + unsigned long __page_flags;
> +
> + union {
> + struct list_head pt_list;
> + struct {
> + unsigned long _pt_pad_1;
> + pgtable_t pmd_huge_pte;
> + };
> + };
> + unsigned long _pt_s390_gaddr;
> +
> + union {
> + struct mm_struct *pt_mm;
> + atomic_t pt_frag_refcount;
> + };
> +
> +#if ALLOC_SPLIT_PTLOCKS
> + spinlock_t *ptl;
> +#else
> + spinlock_t ptl;
> +#endif
> +};

I think you should include the memcg here too? It needs to be valid
for a ptdesc, even if we don't currently deref it through the ptdesc
type.

Also, do you see a way to someday put a 'struct rcu_head' into here?

Thanks,
Jason

Re: [PATCH v9 00/14] pci: Work around ASMedia ASM2824 PCIe link training failures

2023-06-16 Thread Maciej W. Rozycki

On Thu, 15 Jun 2023, Bjorn Helgaas wrote:

> >  If doing it this way, which I actually like, I think it would be a little 
> > bit better performance- and style-wise if this was written as:
> > 
> > if (pci_is_pcie(dev)) {
> > bridge = pci_upstream_bridge(dev);
> > retrain = !!bridge;
> > }
> > 
> > (or "retrain = bridge != NULL" if you prefer this style), and then we 
> > don't have to repeatedly check two variables iff (pcie && !bridge) in the 
> > loop below:
> 
> Done, thanks, I do like that better.  I did:
> 
>   bridge = pci_upstream_bridge(dev);
>   if (bridge)
> retrain = true;
> 
> because it seems like it flows more naturally when reading.

 Perfect, and good timing too, as I have just started checking your tree 
as your message arrived.  I ran my usual tests with and w/o PCI_QUIRKS 
enabled and results were as expected.  As before I didn't check hot plug 
and reset paths as these features are awkward with the HiFive Unmatched 
system involved.

 I have skimmed over the changes as committed to pci/enumeration and found 
nothing suspicious.  I have verified that the tree builds as at each of 
them with my configuration.

 As per my earlier remark:

> I think making a system halfway-fixed would make little sense, but with
> the actual fix actually made last as you suggested I think this can be
> split off, because it'll make no functional change by itself.

I am not perfectly happy with your rearrangement to fold the !PCI_QUIRKS 
stub into the change carrying the actual workaround and then have the 
reset path update with a follow-up change only, but I won't fight over it.  
It's only one tree revision that will be in this halfway-fixed state and 
I'll trust your judgement here.

 Let me know if anything pops up related to these changes anytime and I'll 
be happy to look into it.  The system involved is nearing two years since 
its deployment already, but hopefully it has many years to go yet and will 
continue being ready to verify things.  It's not that there's lots of real 
RISC-V hardware available, let alone with PCI/e connectivity.

 Thank you for staying with me and reviewing this patch series through all 
the iterations.

  Maciej

Re: ppc64le vmlinuz is huge when building with BTF

2023-06-16 Thread Dominique Martinet

Naveen N Rao wrote on Fri, Jun 16, 2023 at 04:28:53PM +0530:
> > We're not stripping anything in vmlinuz for other archs -- the linker
> > script already should be including only the bare minimum to decompress
> > itself (+compressed useful bits), so I guess it's a Kbuild issue for the
> > arch.
> 
> For a related discussion, see:
> http://lore.kernel.org/CAK18DXZKs2PNmLndeGYqkPxmrrBR=6ca3bhyYCj=ghya7dh...@mail.gmail.com

Thanks, I didn't know that ppc64le boots straight into vmlinux, as 'make
install' somehow installs something called 'vmlinuz-lts' (-lts coming
out of localversion afaiu, but vmlinuz would come from the build
scripts) ; this is somewhat confusing as vmlinuz on other archs is a
compressed/pre-processed binary so I'd expect it to at least be
stripped...

> > We can add a strip but I unfortunately have no way of testing ppc build,
> > I'll ask around the build linux-kbuild and linuxppc-dev lists if that's
> > expected; it shouldn't be that bad now that's figured out.
> 
> Stripping vmlinux would indeed be the way to go. As mentioned in the above
> link, fedora also packages a strip'ed vmlinux for ppc64le:
> https://src.fedoraproject.org/rpms/kernel/blob/4af17bffde7a1eca9ab164e5de0e391c277998a4/f/kernel.spec#_1797

It feels somewhat wrong to add a strip just for ppc64le after make
install, but I guess we probably ought to do the same...
I don't have any hardware to test booting the result though, I'll submit
an update and ask for someone to test when it's done.
(bit busy but that doesn't take long, will do that tomorrow morning
before I forget)

Thanks!
-- 
Dominique Martinet | Asmadeus

[PATCH v2 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE

2023-06-16 Thread Aneesh Kumar K.V

pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 mm/mremap.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8c5174d1f9db..c7f5806dc9d1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -550,6 +550,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pud_t *pudp)
 {
@@ -563,6 +564,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
BUILD_BUG();
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #endif
 
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..6373db571e5c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -338,7 +338,7 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && 
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
 {
-- 
2.40.1

[PATCH v2 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works

2023-06-16 Thread Aneesh Kumar K.V

Architectures like powerpc will like to use different page table allocators
and mapping mechanisms to implement vmemmap optimization. Similar to
vmemmap_populate allow architectures to implement
vmemap_populate_compound_pages

Signed-off-by: Aneesh Kumar K.V 
---
 mm/sparse-vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 10d73a0dfcec..0b83706c08fd 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
 }
 
+#ifndef vmemmap_populate_compound_pages
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int 
node,
   struct vmem_altmap *altmap,
   struct page *reuse)
@@ -446,6 +447,8 @@ static int __meminit 
vmemmap_populate_compound_pages(unsigned long start_pfn,
return 0;
 }
 
+#endif
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
-- 
2.40.1

Re: [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES

2023-06-16 Thread Björn Töpel

Mike Rapoport  writes:

> From: "Mike Rapoport (IBM)" 
>
> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> code.

I think you can remove the MODULES dependency from BPF_JIT as well:

--8<--
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 2dfe1079f772..fa4587027f8b 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -41,7 +41,6 @@ config BPF_JIT
bool "Enable BPF Just In Time compiler"
depends on BPF
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
-   depends on MODULES
help
  BPF programs are normally handled by a BPF interpreter. This option
  allows the kernel to generate native code when a program is loaded
--8<--


Björn

[PATCH v2 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize

2023-06-16 Thread Aneesh Kumar K.V

This is not used by radix anymore.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 --
 arch/powerpc/mm/init_64.c| 21 ++---
 2 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index f0527eebe012..d57b4f5d5cb3 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -601,16 +601,6 @@ void __init radix__early_init_mmu(void)
mmu_virtual_psize = MMU_PAGE_4K;
 #endif
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-   /* vmemmap mapping */
-   if (mmu_psize_defs[MMU_PAGE_2M].shift) {
-   /*
-* map vmemmap using 2M if available
-*/
-   mmu_vmemmap_psize = MMU_PAGE_2M;
-   } else
-   mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
/*
 * initialize page table size
 */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5701faca39ef..6db7a063ba63 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, 
unsigned long start,
return false;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
-   struct vmem_altmap *altmap)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+struct vmem_altmap *altmap)
 {
bool altmap_alloc;
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (radix_enabled())
-   return radix__vmemmap_populate(start, end, node, altmap);
-#endif
-
/* Align to the page size of the linear mapping. */
start = ALIGN_DOWN(start, page_size);
 
@@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
return 0;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+  struct vmem_altmap *altmap)
+{
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (radix_enabled())
+   return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
+   return __vmemmap_populate(start, end, node, altmap);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long vmemmap_list_free(unsigned long start)
 {
-- 
2.40.1

[PATCH v2 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix

2023-06-16 Thread Aneesh Kumar K.V

With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap
page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence
with 64K page size, we don't use vmemmap deduplication for PMD-level
mapping.

Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/mm/vmemmap_dedup.rst |   1 +
 Documentation/powerpc/index.rst|   1 +
 Documentation/powerpc/vmemmap_dedup.rst| 101 ++
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/book3s/64/radix.h |   8 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 203 +
 6 files changed, 315 insertions(+)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

diff --git a/Documentation/mm/vmemmap_dedup.rst 
b/Documentation/mm/vmemmap_dedup.rst
index a4b12ff906c4..c573e08b5043 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -210,6 +210,7 @@ the device (altmap).
 
 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
 
 The differences with HugeTLB are relatively minor.
 
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 85e80e30160b..1b7d943ae8f3 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -35,6 +35,7 @@ powerpc
 ultravisor
 vas-api
 vcpudispatch_stats
+vmemmap_dedup
 
 features
 
diff --git a/Documentation/powerpc/vmemmap_dedup.rst 
b/Documentation/powerpc/vmemmap_dedup.rst
new file mode 100644
index ..dc4db59fdf87
--- /dev/null
+++ b/Documentation/powerpc/vmemmap_dedup.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Device DAX
+==
+
+The device-dax interface uses the tail deduplication technique explained in
+Documentation/mm/vmemmap_dedup.rst
+
+On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
+with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
+deduplication.
+
+With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
+page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
+vmemmap deduplication possible.
+
+With 1G PUD level mapping, we require 16384 struct pages and a single 64K
+vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
+require 16 64K pages in vmemmap to map the struct page for 1G PUD level 
mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PUD| +---+   | | |
+ |   level   | | . | --+ | |
+ |  mapping  | +---+ | |
+ |   | | . | + |
+ |   | +---+   |
+ |   | | 15| --+
+ |   | +---+
+ |   |
+ |   |
+ |   |
+ +---+
+
+
+With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
+4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
+require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   |

[PATCH v2 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-06-16 Thread Aneesh Kumar K.V

This is in preparation to update radix to implement vmemmap optimization
for devdax. Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to
   PAGE_SIZE
3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to
   PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the
vmemmap area is invalid. If found to be invalid, that implies we can safely
free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86
because with 64K page size, we need to do the above check even at the
PAGE_SIZE granularity.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/radix.h |   2 +
 arch/powerpc/include/asm/pgtable.h |   3 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 319 +++--
 arch/powerpc/mm/init_64.c  |  26 +-
 4 files changed, 319 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 8cdff5a05011..87d4c1e62491 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned 
long start,
 unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
  int node, struct vmem_altmap *altmap);
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+  struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9972626ddaf6..6d4cd2ebae6e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x)
 
 struct seq_file;
 void arch_report_meminfo(struct seq_file *m);
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+  unsigned long page_size);
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index d7e2dd3d4add..ef886fab643d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
 }
 
+static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+   return !vmemmap_populated(start, PMD_SIZE);
+}
+
+static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+
+   return !vmemmap_populated(start, PAGE_SIZE);
+
+}
+
+static void __meminit free_vmemmap_pages(struct page *page,
+struct vmem_altmap *altmap,
+int order)
+{
+   unsigned int nr_pages = 1 << order;
+
+   if (altmap) {
+   unsigned long alt_start, alt_end;
+   unsigned long base_pfn = page_to_pfn(page);
+
+   /*
+* with 1G vmemmap mmaping we can have things setup
+* such that even though atlmap is specified we never
+* used altmap.
+*/
+   alt_start = altmap->base_pfn;
+   alt_end = altmap->base_pfn + altmap->reserve +
+   altmap->free + altmap->alloc + altmap->align;
+
+   if (base_pfn >= alt_start && base_pfn < alt_end) {
+   vmem_altmap_free(altmap, nr_pages);
+   return;
+   }
+   }
+
+   if (PageReserved(page)) {
+   /* allocated from memblock */
+   while (nr_pages--)
+   free_reserved_page(page++);
+   } else
+   free_pages((unsigned long)page_address(page), order);
+}
+
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end, bool direct)
+unsigned long end, bool direct,
+struct vmem_altmap *altmap)
 {
unsigned long next, pages = 0;
pte_t *pte;
@@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
if (!pte_present(*pte))
continue;
 
-   if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) {
-   /*
-* The vmemmap_free() and

[PATCH v2 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-06-16 Thread Aneesh Kumar K.V

This is enabled only with radix translation and 1G hugepage size. This will
be used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge(). We should never find that return true. The only
expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation
and hence is marked to call BUG() in such a case.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 --
 arch/powerpc/include/asm/book3s/64/radix.h|  37 +
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  28 
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 include/trace/events/thp.h|  17 ++
 9 files changed, 323 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..9a05de007956 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
 {
return __pud_raw(pte_raw(pte));
 }
+
+static inline pte_t *pudp_ptep(pud_t *pud)
+{
+   return (pte_t *)pud;
+}
+
+#define pud_pfn(pud)   pte_pfn(pud_pte(pud))
+#define pud_dirty(pud) pte_dirty(pud_pte(pud))
+#define pud_young(pud) pte_young(pud_pte(pud))
+#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkdirty(pud)   pte_pud(pte_mkdirty(pud_pte(pud)))
+#define pud_mkclean(pud)   pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkyoung(pud)   pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite(pud)   pte_pud(pte_mkwrite(pud_pte(pud)))
 #define pud_write(pud) pte_write(pud_pte(pud))
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud))
+#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
+#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 static inline int pud_bad(pud_t pud)
 {
if (radix_enabled())
@@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
   pmd_t *pmdp, pmd_t pmd);
+extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
+  pud_t *pudp, pud_t pud);
+
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd)
 {
 }
 
+static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pud)
+{
+}
+
 extern int hash__has_transparent_hugepage(void);
 static inline int has_transparent_hugepage(void)
 {
@@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
 }
 #define has_transparent_hugepage has_transparent_hugepage
 
+static inline int has_transparent_pud_hugepage(void)
+{
+   if (radix_enabled())
+   return radix__has_transparent_pud_hugepage();
+   return 0;
+}
+#define has_transparent_pud_hugepage has_transparent_pud_hugepage
+
 static inline unsigned long
 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
unsigned long clr, unsigned long set)
@@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long 
addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
 }
 
+static inline unsigned long
+pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp,
+   unsigned long clr, unsigned long set)
+{
+   if (radix_enabled())
+   return radix__pud_hugepage_update(mm, addr, pudp, clr, set);
+   BUG();
+   return pud_val(*pudp);
+}
+
 /*
  * returns true for pmd migration entries, THP, devmap, hugetlb
  * But compile time dependent on THP config
@@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd)
return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
+static inline int pud_large(pud_t pud)
+{
+   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
 /*
  * For radix we should always find H_PAGE_HASHPTE zero. Hence
  * the below will work for radix too
@@

[PATCH v2 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-06-16 Thread Aneesh Kumar K.V

Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to
read-only) and the output address (pfn) of the vmemmap ptes. That is not
supported without unmapping of pte(marking it invalid) by some
architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb
vmemmap optimization disabled. Hence split DAX optimization support into a
different config.

loongarch and riscv don't have devdax support. So the DAX config is not
enabled for them. With this change, arm64 should be able to select DAX
optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/loongarch/Kconfig | 2 +-
 arch/riscv/Kconfig | 2 +-
 arch/x86/Kconfig   | 3 ++-
 fs/Kconfig | 2 +-
 include/linux/mm.h | 2 +-
 mm/Kconfig | 5 -
 6 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index d38b066fc931..2060990c4612 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -55,7 +55,7 @@ config LOONGARCH
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_NO_INSTR
select BUILDTIME_TABLE_SORT
select COMMON_CLK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 5966ad97c30c..d8f3765ff115 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -50,7 +50,7 @@ config RISCV
select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select BUILDTIME_TABLE_SORT if MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..eb383960b6ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,7 +127,8 @@ config X86
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP   if X86_64
select ARCH_WANTS_THP_SWAP  if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
diff --git a/fs/Kconfig b/fs/Kconfig
index 18d034ec7953..9c104c130a6e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -252,7 +252,7 @@ config HUGETLB_PAGE
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
-   depends on ARCH_WANT_OPTIMIZE_VMEMMAP
+   depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a45e61cd83f..6e56ae09f0c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3553,7 +3553,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 #endif
 
 #define VMEMMAP_RESERVE_NR 2
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
  struct dev_pagemap *pgmap)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..7b388c10baab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -461,7 +461,10 @@ config SPARSEMEM_VMEMMAP
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_VMEMMAP
+config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
+   bool
+
+config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
bool
 
 config HAVE_MEMBLOCK_PHYS_MAP
-- 
2.40.1

[PATCH v2 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME

2023-06-16 Thread Aneesh Kumar K.V

This helps architectures to override pmd_same and pud_same independently.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2fe19720075e..8c5174d1f9db 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,7 +681,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
 }
+#endif
 
+#ifndef __HAVE_ARCH_PUD_SAME
 static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
-- 
2.40.1

[PATCH v2 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override

2023-06-16 Thread Aneesh Kumar K.V

dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within
vmemmap such that tail page mapping can point to the second PAGE_SIZE area.
Enforce that in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/mm.h | 30 ++
 mm/mm_init.c   |  2 +-
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..9a45e61cd83f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,8 @@
 #include 
 #include 
 
+#include 
+
 struct mempolicy;
 struct anon_vma;
 struct anon_vma_chain;
@@ -3550,13 +3552,33 @@ void vmemmap_free(unsigned long start, unsigned long 
end,
struct vmem_altmap *altmap);
 #endif
 
+#define VMEMMAP_RESERVE_NR 2
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-  struct dev_pagemap *pgmap)
+static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
 {
-   return is_power_of_2(sizeof(struct page)) &&
-   pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
+   if (pgmap) {
+   unsigned long nr_pages;
+   unsigned long nr_vmemmap_pages;
+
+   nr_pages = pgmap_vmemmap_nr(pgmap);
+   nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> 
PAGE_SHIFT);
+   /*
+* For vmemmap optimization with DAX we need minimum 2 vmemmap
+* pages. See layout diagram in 
Documentation/mm/vmemmap_dedup.rst
+*/
+   return is_power_of_2(sizeof(struct page)) &&
+   (nr_vmemmap_pages > VMEMMAP_RESERVE_NR) && !altmap;
+   }
+   return false;
 }
+/*
+ * If we don't have an architecture override, use the generic rule
+ */
+#ifndef vmemmap_can_optimize
+#define vmemmap_can_optimize __vmemmap_can_optimize
+#endif
+
 #else
 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
   struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..d1676afc94f1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct 
vmem_altmap *altmap,
if (!vmemmap_can_optimize(altmap, pgmap))
return pgmap_vmemmap_nr(pgmap);
 
-   return 2 * (PAGE_SIZE / sizeof(struct page));
+   return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
 }
 
 static void __ref memmap_init_compound(struct page *head,
-- 
2.40.1

[PATCH v2 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg

2023-06-16 Thread Aneesh Kumar K.V

We will use this in a later patch to do tlb flush when clearing pud entries
on powerpc. This is similar to commit 93a98695f2f9 ("mm: change
pmdp_huge_get_and_clear_full take vm_area_struct as arg")

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 4 ++--
 mm/debug_vm_pgtable.c   | 2 +-
 mm/huge_memory.c| 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b3f4dd0240f5..2fe19720075e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -442,11 +442,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
+static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long address, pud_t *pudp,
int full)
 {
-   return pudp_huge_get_and_clear(mm, address, pudp);
+   return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..c2bf25d5e5cd 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -382,7 +382,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1);
+   pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1);
pud = READ_ONCE(*args->pudp);
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..8774b4751a84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1980,7 +1980,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
if (!ptl)
return 0;
 
-   pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+   pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_special_huge(vma)) {
spin_unlock(ptl);
-- 
2.40.1

[PATCH v2 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support

2023-06-16 Thread Aneesh Kumar K.V

Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/nvdimm/pfn_devs.c | 2 +-
 include/linux/pgtable.h   | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index af7d9301520c..18ad315581ca 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned 
long *alignments)
 
if (has_transparent_hugepage()) {
alignments[1] = HPAGE_PMD_SIZE;
-   if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+   if (has_transparent_pud_hugepage())
alignments[2] = HPAGE_PUD_SIZE;
}
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..b3f4dd0240f5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1597,6 +1597,9 @@ typedef unsigned int pgtbl_mod_mask;
 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
 #endif
 
+#ifndef has_transparent_pud_hugepage
+#define has_transparent_pud_hugepage() 
IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#endif
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.40.1

[PATCH v2 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary

2023-06-16 Thread Aneesh Kumar K.V

Without this fix, the last subsection vmemmap can end up in memory even if
the namespace is created with -M mem and has sufficient space in the altmap
area.

Fixes: cf387d9644d8 ("libnvdimm/altmap: Track namespace boundaries in altmap")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 05b0d584e50b..fe1b83020e0d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -189,7 +189,7 @@ static bool altmap_cross_boundary(struct vmem_altmap 
*altmap, unsigned long star
unsigned long nr_pfn = page_size / sizeof(struct page);
unsigned long start_pfn = page_to_pfn((struct page *)start);
 
-   if ((start_pfn + nr_pfn) > altmap->end_pfn)
+   if ((start_pfn + nr_pfn - 1) > altmap->end_pfn)
return true;
 
if (start_pfn < altmap->base_pfn)
-- 
2.40.1

[PATCH v2 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding

2023-06-16 Thread Aneesh Kumar K.V

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 15a099e53cde..76f6a1f3b9d8 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -910,7 +910,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
  unsigned long phys)
 {
/* Create a PTE encoding */
-   unsigned long flags = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_KERNEL_RW;
int nid = early_pfn_to_nid(phys >> PAGE_SHIFT);
int ret;
 
@@ -919,7 +918,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
return -1;
}
 
-   ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, 
nid);
+   ret = __map_kernel_page_nid(start, phys, PAGE_KERNEL, page_size, nid);
BUG_ON(ret);
 
return 0;
-- 
2.40.1

[PATCH v2 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo

2023-06-16 Thread Aneesh Kumar K.V

On memory unplug reduce DirectMap page count correctly.
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:115343360 kB
DirectMap1G:   0 kB

Before fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:115343360 kB
DirectMap1G:   0 kB

After fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:104857600 kB
DirectMap1G:   0 kB

Fixes: a2dc009afa9a ("powerpc/mm/book3s/radix: Add mapping statistics")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 34 +++-
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 570add33c02d..15a099e53cde 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -743,9 +743,9 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 }
 
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end)
+unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pte_t *pte;
 
pte = pte_start + pte_index(addr);
@@ -767,13 +767,16 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
}
 
pte_clear(_mm, addr, pte);
+   pages++;
}
+   if (direct)
+   update_page_count(mmu_virtual_psize, -pages);
 }
 
 static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
-unsigned long end)
+  unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pte_t *pte_base;
pmd_t *pmd;
 
@@ -791,19 +794,22 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, 
unsigned long addr,
continue;
}
pte_clear(_mm, addr, (pte_t *)pmd);
+   pages++;
continue;
}
 
pte_base = (pte_t *)pmd_page_vaddr(*pmd);
-   remove_pte_table(pte_base, addr, next);
+   remove_pte_table(pte_base, addr, next, direct);
free_pte_table(pte_base, pmd);
}
+   if (direct)
+   update_page_count(MMU_PAGE_2M, -pages);
 }
 
 static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
-unsigned long end)
+  unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pmd_t *pmd_base;
pud_t *pud;
 
@@ -821,16 +827,20 @@ static void __meminit remove_pud_table(pud_t *pud_start, 
unsigned long addr,
continue;
}
pte_clear(_mm, addr, (pte_t *)pud);
+   pages++;
continue;
}
 
pmd_base = pud_pgtable(*pud);
-   remove_pmd_table(pmd_base, addr, next);
+   remove_pmd_table(pmd_base, addr, next, direct);
free_pmd_table(pmd_base, pud);
}
+   if (direct)
+   update_page_count(MMU_PAGE_1G, -pages);
 }
 
-static void __meminit remove_pagetable(unsigned long start, unsigned long end)
+static void __meminit remove_pagetable(unsigned long start, unsigned long end,
+  bool direct)
 {
unsigned long addr, next;
pud_t *pud_base;
@@ -859,7 +869,7 @@ static void __meminit remove_pagetable(unsigned long start, 
unsigned long end)
}
 
pud_base = p4d_pgtable(*p4d);
-   remove_pud_table(pud_base, addr, next);
+   remove_pud_table(pud_base, addr, next, direct);
free_pud_table(pud_base, p4d);
}
 
@@ -882,7 +892,7 @@ int __meminit radix__create_section_mapping(unsigned long 
start,
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long 
end)
 {
-   remove_pagetable(start, end);
+   remove_pagetable(start, end, true);
return 0;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -918,7 +928,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned 
long page_size)
 {
-   remove_pagetable(start, start + page_size);
+   remove_pagetable(start, start + page_size, false);
 }
 #endif
 #endif
--

[PATCH v2 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix

2023-06-16 Thread Aneesh Kumar K.V

This should not be within CONFIG_PPC_64S_HASHS_MMU. We use
mmu_vmemmap_psize on radix while mapping the vmemmap area.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 5f8c6fbe8a69..570add33c02d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,7 +594,6 @@ void __init radix__early_init_mmu(void)
 {
unsigned long lpcr;
 
-#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
/* PAGE_SIZE mappings */
mmu_virtual_psize = MMU_PAGE_64K;
@@ -611,7 +610,6 @@ void __init radix__early_init_mmu(void)
mmu_vmemmap_psize = MMU_PAGE_2M;
} else
mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 #endif
/*
 * initialize page table size
-- 
2.40.1

[PATCH v2 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.

2023-06-16 Thread Aneesh Kumar K.V

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 2297aa764ecd..5f8c6fbe8a69 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -952,7 +952,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long add
assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
-   old = radix__pte_update(mm, addr, (pte_t *)pmdp, clr, set, 1);
+   old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1);
trace_hugepage_update(addr, old, clr, set);
 
return old;
-- 
2.40.1

[PATCH v2 00/16] Add support for DAX vmemmap optimization for ppc64

2023-06-16 Thread Aneesh Kumar K.V

This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
translation and 1GB PUD mapping with 64K page size. The patch series also split
hugetlb vmemmap optimization as a separate Kconfig variable so that
architectures can enable DAX vmemmap optimization without enabling hugetlb
vmemmap optimization. This should enable architectures like arm64 to enable DAX
vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
devdax vmemmap optimization"

Changes from V1:
* Fix make htmldocs warning
* Fix vmemmap allocation bugs with different alignment values.
* Correctly check for section validity to before we free vmemmap area

Aneesh Kumar K.V (16):
  powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
  powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
  powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
  powerpc/mm/dax: Fix the condition when checking if altmap vmemap can
cross-boundary
  mm/hugepage pud: Allow arch-specific helper function to check huge
page pud support
  mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
override
  mm/vmemmap: Allow architectures to override how vmemmap optimization
works
  mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
  mm/huge pud: Use transparent huge pud helpers only with
CONFIG_TRANSPARENT_HUGEPAGE
  mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  powerpc/book3s64/mm: Enable transparent pud hugepage
  powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
handling function
  powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  powerpc/book3s64/radix: Remove mmu_vmemmap_psize

 Documentation/mm/vmemmap_dedup.rst|   1 +
 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/vmemmap_dedup.rst   | 101 +++
 arch/loongarch/Kconfig|   2 +-
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 -
 arch/powerpc/include/asm/book3s/64/radix.h|  47 ++
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/pgtable.h|   3 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 577 --
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/mm/init_64.c |  39 +-
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 arch/riscv/Kconfig|   2 +-
 arch/x86/Kconfig  |   3 +-
 drivers/nvdimm/pfn_devs.c |   2 +-
 fs/Kconfig|   2 +-
 include/linux/mm.h|  32 +-
 include/linux/pgtable.h   |  11 +-
 include/trace/events/thp.h|  17 +
 mm/Kconfig|   5 +-
 mm/debug_vm_pgtable.c |   2 +-
 mm/huge_memory.c  |   2 +-
 mm/mm_init.c  |   2 +-
 mm/mremap.c   |   2 +-
 mm/sparse-vmemmap.c   |   3 +
 28 files changed, 1032 insertions(+), 77 deletions(-)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

-- 
2.40.1

Re: ppc64le vmlinuz is huge when building with BTF

2023-06-16 Thread Naveen N Rao


[Cc linuxppc-dev]

Dominique Martinet wrote:


Alan Maguire wrote on Thu, Jun 15, 2023 at 03:31:49PM +0100:

However the problem I suspect is this:

 51 .debug_info   0a488b55      026f8d20
 2**0
  CONTENTS, READONLY, DEBUGGING
[...]

The debug info hasn't been stripped, so I suspect the packaging spec
file or equivalent - in perhaps trying to preserve the .BTF section -
is preserving debug info too. DWARF needs to be there at BTF
generation time in vmlinux but is usually stripped for non-debug
packages.


Thanks Alan and Eduard!
I guess I should have checked that first, it helps.

We're not stripping anything in vmlinuz for other archs -- the linker
script already should be including only the bare minimum to decompress
itself (+compressed useful bits), so I guess it's a Kbuild issue for the
arch.


For a related discussion, see:
http://lore.kernel.org/CAK18DXZKs2PNmLndeGYqkPxmrrBR=6ca3bhyYCj=ghya7dh...@mail.gmail.com


We can add a strip but I unfortunately have no way of testing ppc build,
I'll ask around the build linux-kbuild and linuxppc-dev lists if that's
expected; it shouldn't be that bad now that's figured out.


Stripping vmlinux would indeed be the way to go. As mentioned in the 
above link, fedora also packages a strip'ed vmlinux for ppc64le:

https://src.fedoraproject.org/rpms/kernel/blob/4af17bffde7a1eca9ab164e5de0e391c277998a4/f/kernel.spec#_1797


- Naveen

[PATCH AUTOSEL 6.1 15/26] ASoC: fsl_sai: Enable BCI bit if SAI works on synchronous mode with BYP asserted

2023-06-16 Thread Sasha Levin

From: Chancel Liu 

[ Upstream commit 32cf0046a652116d6a216d575f3049a9ff9dd80d ]

There's an issue on SAI synchronous mode that TX/RX side can't get BCLK
from RX/TX it sync with if BYP bit is asserted. It's a workaround to
fix it that enable SION of IOMUX pad control and assert BCI.

For example if TX sync with RX which means both TX and RX are using clk
form RX and BYP=1. TX can get BCLK only if the following two conditions
are valid:
1. SION of RX BCLK IOMUX pad is set to 1
2. BCI of TX is set to 1

Signed-off-by: Chancel Liu 
Acked-by: Shengjiu Wang 
Link: https://lore.kernel.org/r/20230530103012.3448838-1-chancel@nxp.com
Signed-off-by: Mark Brown 
Signed-off-by: Sasha Levin 
---
 sound/soc/fsl/fsl_sai.c | 11 +--
 sound/soc/fsl/fsl_sai.h |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
index 6d88af5b287fe..b33104715c7ba 100644
--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -491,14 +491,21 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, bool 
tx, u32 freq)
regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_MSEL_MASK,
   FSL_SAI_CR2_MSEL(sai->mclk_id[tx]));
 
-   if (savediv == 1)
+   if (savediv == 1) {
regmap_update_bits(sai->regmap, reg,
   FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
   FSL_SAI_CR2_BYP);
-   else
+   if (fsl_sai_dir_is_synced(sai, adir))
+   regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs),
+  FSL_SAI_CR2_BCI, FSL_SAI_CR2_BCI);
+   else
+   regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs),
+  FSL_SAI_CR2_BCI, 0);
+   } else {
regmap_update_bits(sai->regmap, reg,
   FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
   savediv / 2 - 1);
+   }
 
if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
/* SAI is in master mode at this point, so enable MCLK */
diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h
index 697f6690068c8..c5423f81e4560 100644
--- a/sound/soc/fsl/fsl_sai.h
+++ b/sound/soc/fsl/fsl_sai.h
@@ -116,6 +116,7 @@
 
 /* SAI Transmit and Receive Configuration 2 Register */
 #define FSL_SAI_CR2_SYNC   BIT(30)
+#define FSL_SAI_CR2_BCIBIT(28)
 #define FSL_SAI_CR2_MSEL_MASK  (0x3 << 26)
 #define FSL_SAI_CR2_MSEL_BUS   0
 #define FSL_SAI_CR2_MSEL_MCLK1 BIT(26)
-- 
2.39.2

[PATCH AUTOSEL 6.3 16/30] ASoC: fsl_sai: Enable BCI bit if SAI works on synchronous mode with BYP asserted

2023-06-16 Thread Sasha Levin

From: Chancel Liu 

[ Upstream commit 32cf0046a652116d6a216d575f3049a9ff9dd80d ]

There's an issue on SAI synchronous mode that TX/RX side can't get BCLK
from RX/TX it sync with if BYP bit is asserted. It's a workaround to
fix it that enable SION of IOMUX pad control and assert BCI.

For example if TX sync with RX which means both TX and RX are using clk
form RX and BYP=1. TX can get BCLK only if the following two conditions
are valid:
1. SION of RX BCLK IOMUX pad is set to 1
2. BCI of TX is set to 1

Signed-off-by: Chancel Liu 
Acked-by: Shengjiu Wang 
Link: https://lore.kernel.org/r/20230530103012.3448838-1-chancel@nxp.com
Signed-off-by: Mark Brown 
Signed-off-by: Sasha Levin 
---
 sound/soc/fsl/fsl_sai.c | 11 +--
 sound/soc/fsl/fsl_sai.h |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
index 990bba0be1fb1..8a64f3c1d1556 100644
--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -491,14 +491,21 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, bool 
tx, u32 freq)
regmap_update_bits(sai->regmap, reg, FSL_SAI_CR2_MSEL_MASK,
   FSL_SAI_CR2_MSEL(sai->mclk_id[tx]));
 
-   if (savediv == 1)
+   if (savediv == 1) {
regmap_update_bits(sai->regmap, reg,
   FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
   FSL_SAI_CR2_BYP);
-   else
+   if (fsl_sai_dir_is_synced(sai, adir))
+   regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs),
+  FSL_SAI_CR2_BCI, FSL_SAI_CR2_BCI);
+   else
+   regmap_update_bits(sai->regmap, FSL_SAI_xCR2(tx, ofs),
+  FSL_SAI_CR2_BCI, 0);
+   } else {
regmap_update_bits(sai->regmap, reg,
   FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
   savediv / 2 - 1);
+   }
 
if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
/* SAI is in master mode at this point, so enable MCLK */
diff --git a/sound/soc/fsl/fsl_sai.h b/sound/soc/fsl/fsl_sai.h
index 197748a888d5f..a53c4f0e25faf 100644
--- a/sound/soc/fsl/fsl_sai.h
+++ b/sound/soc/fsl/fsl_sai.h
@@ -116,6 +116,7 @@
 
 /* SAI Transmit and Receive Configuration 2 Register */
 #define FSL_SAI_CR2_SYNC   BIT(30)
+#define FSL_SAI_CR2_BCIBIT(28)
 #define FSL_SAI_CR2_MSEL_MASK  (0x3 << 26)
 #define FSL_SAI_CR2_MSEL_BUS   0
 #define FSL_SAI_CR2_MSEL_MCLK1 BIT(26)
-- 
2.39.2

[PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/Kconfig|  2 +-
 kernel/kprobes.c| 43 +
 kernel/trace/trace_kprobe.c | 11 ++
 3 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..f2e9f82c7d0d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -39,9 +39,9 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
select KALLSYMS
+   select EXECMEM
select TASKS_RCU if PREEMPTION
help
  Kprobes allows you to trap at almost any kernel address and
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 37c928d5deaf..2c2ba29d3f9a 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1568,6 +1568,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1591,6 +1592,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2484,24 +2487,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
-{
-   struct kprobe_blacklist_entry *ent, *n;
-
-   list_for_each_entry_safe(ent, n, _blacklist, list) {
-   if (ent->start_addr < start || ent->start_addr >= end)
-   continue;
-   list_del(>list);
-   kfree(ent);
-   }
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
-   kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
 {
@@ -2566,6 +2551,25 @@ static int __init populate_kprobe_blacklist(unsigned 
long *start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
+{
+   struct kprobe_blacklist_entry *ent, *n;
+
+   list_for_each_entry_safe(ent, n, _blacklist, list) {
+   if (ent->start_addr < start || ent->start_addr >= end)
+   continue;
+   list_del(>list);
+   kfree(ent);
+   }
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+   kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 static void add_module_kprobe_blacklist(struct module *mod)
 {
unsigned long start, end;
@@ -2667,6 +2671,7 @@ static struct notifier_block kprobe_module_nb = {
.notifier_call = kprobes_module_callback,
.priority = 0
 };
+#endif
 
 void kprobe_free_init_mem(void)
 {
@@ -2726,8 +2731,10 @@ static int __init init_kprobes(void)
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(_exceptions_nb);
+#ifdef CONFIG_MODULES
if (!err)
err = register_module_notifier(_module_nb);
+#endif
 
kprobes_initialized = (err == 0);
kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 59cda19a9033..cf804e372554 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -111,6 +111,7 @@ static nokprobe_inline bool 
trace_kprobe_within_module(struct trace_kprobe *tk,
return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
 }
 
+#ifdef CONFIG_MODULES
 static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 {
char *p;
@@ -129,6 +130,12 @@ static nokprobe_inline bool 
trace_kprobe_module_exist(struct trace_kprobe *tk)
 
return ret;
 }
+#else
+static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+   return false;
+}
+#endif
 
 static bool trace_kprobe_is_busy(struct dyn_event *ev)
 {
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
 }
 
+#ifdef CONFIG_MODULES
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,

[PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With execmem separated from the modules code, execmem_text_alloc() is
available regardless of CONFIG_MODULES.

Remove dependency of dynamic ftrace on CONFIG_MODULES and make
CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/ftrace.c | 10 --
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..ab64bbef9e50 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+   select EXECMEM if DYNAMIC_FTRACE
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index f77c63bb3203..a824a5d3b129 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
 /* Currently only x86_64 supports dynamic trampolines */
 #ifdef CONFIG_X86_64
 
-#ifdef CONFIG_MODULES
-/* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
return execmem_text_alloc(size);
@@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
 {
execmem_free(tramp);
 }
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
-   return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
 
 /* Defined as markers to the end of the ftrace default trampolines */
 extern void ftrace_regs_caller_end(void);
-- 
2.35.1

[PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

execmem does not depend on modules, on the contrary modules use
execmem.

To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/kernel/module.c and compile it when CONFIG_EXECMEM=y

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm/kernel/module.c| 36 
 arch/arm/mm/init.c  | 36 
 arch/arm64/include/asm/memory.h |  8 +
 arch/arm64/include/asm/module.h |  6 
 arch/arm64/kernel/kaslr.c   |  3 +-
 arch/arm64/kernel/module.c  | 50 
 arch/arm64/mm/init.c| 56 +++
 arch/loongarch/kernel/module.c  | 18 --
 arch/loongarch/mm/init.c| 20 +++
 arch/mips/kernel/module.c   | 18 --
 arch/mips/mm/init.c | 19 +++
 arch/parisc/kernel/module.c | 17 --
 arch/parisc/mm/init.c   | 22 +++-
 arch/powerpc/kernel/module.c| 58 
 arch/powerpc/mm/mem.c   | 59 +
 arch/riscv/kernel/module.c  | 34 ---
 arch/riscv/mm/init.c| 34 +++
 arch/s390/kernel/module.c   | 38 -
 arch/s390/mm/init.c | 41 +++
 arch/sparc/kernel/module.c  | 23 -
 arch/sparc/mm/Makefile  |  2 ++
 arch/sparc/mm/execmem.c | 25 ++
 arch/x86/kernel/module.c| 50 
 arch/x86/mm/init.c  | 54 ++
 24 files changed, 376 insertions(+), 351 deletions(-)
 create mode 100644 arch/sparc/mm/execmem.c

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index f66d479c1c7d..054e799e7091 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,48 +16,12 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
 #include 
 #include 
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct execmem_params execmem_params = {
-   .modules = {
-   .text = {
-   .start = MODULES_VADDR,
-   .end = MODULES_END,
-   .alignment = 1,
-   },
-   },
-};
-
-struct execmem_params __init *execmem_arch_params(void)
-{
-   execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC;
-
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
-   execmem_params.modules.text.fallback_start = VMALLOC_START;
-   execmem_params.modules.text.fallback_end = VMALLOC_END;
-   }
-
-   return _params;
-}
-#endif
-
 bool module_init_section(const char *name)
 {
return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index ce64bdb55a16..cffd29f15782 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -486,3 +487,38 @@ void free_initrd_mem(unsigned long start, unsigned long 
end)
free_reserved_area((void *)start, (void *)end, -1, "initrd");
 }
 #endif
+
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#if defined(CONFIG_MMU) && defined(CONFIG_EXECMEM)
+static struct execmem_params execmem_params = {
+   .modules = {
+   .text = {
+   .start = MODULES_VADDR,
+   .end = MODULES_END,
+   .alignment = 1,
+   },
+   },
+};
+
+struct execmem_params __init *execmem_arch_params(void)
+{
+   execmem_params.modules.text.pgprot = PAGE_KERNEL_EXEC;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   execmem_params.modules.text.fallback_start = VMALLOC_START;
+   execmem_params.modules.text.fallback_end = VMALLOC_END;
+   }
+
+   return _params;
+}
+#endif
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c735afdf639b..962ad2d8b5e5 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -214,6 +214,14 @@ static inline bool kaslr_enabled(void)
return kaslr_offset() >=

[PATCH v2 09/12] powerpc: extend execmem_params for kprobes allocations

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

powerpc overrides kprobes::alloc_insn_page() to remove writable
permissions when STRICT_MODULE_RWX is on.

Add definition of jit area to execmem_params to allow using the generic
kprobes::alloc_insn_page() with the desired permissions.

As powerpc uses breakpoint instructions to inject kprobes, it does not
need to constrain kprobe allocations to the modules area and can use the
entire vmalloc address space.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/kprobes.c | 14 --
 arch/powerpc/kernel/module.c  | 13 +
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 5db8df5e3657..14c5ddec3056 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long 
addr, unsigned long offse
return (kprobe_opcode_t *)(addr + offset);
 }
 
-void *alloc_insn_page(void)
-{
-   void *page;
-
-   page = jit_text_alloc(PAGE_SIZE);
-   if (!page)
-   return NULL;
-
-   if (strict_module_rwx_enabled())
-   set_memory_rox((unsigned long)page, 1);
-
-   return page;
-}
-
 int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 4c6c15bf3947..8e5b379d6da1 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -96,6 +96,11 @@ static struct execmem_params execmem_params = {
.alignment = 1,
},
},
+   .jit = {
+   .text = {
+   .alignment = 1,
+   },
+   },
 };
 
 
@@ -131,5 +136,13 @@ struct execmem_params __init *execmem_arch_params(void)
 
execmem_params.modules.text.pgprot = prot;
 
+   execmem_params.jit.text.start = VMALLOC_START;
+   execmem_params.jit.text.end = VMALLOC_END;
+
+   if (strict_module_rwx_enabled())
+   execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
+   else
+   execmem_params.jit.text.pgprot = PAGE_KERNEL_EXEC;
+
return _params;
 }
-- 
2.35.1

[PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

RISC-V overrides kprobes::alloc_insn_range() to use the entire vmalloc area
rather than limit the allocations to the modules area.

Slightly reorder execmem_params initialization to support both 32 and 64
bit variantsi and add definition of jit area to execmem_params to support
generic kprobes::alloc_insn_page().

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/riscv/kernel/module.c | 16 +++-
 arch/riscv/kernel/probes/kprobes.c | 10 --
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index ee5e04cd3f21..cca6ed4e9340 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -436,7 +436,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+#ifdef CONFIG_MMU
 static struct execmem_params execmem_params = {
.modules = {
.text = {
@@ -444,12 +444,26 @@ static struct execmem_params execmem_params = {
.alignment = 1,
},
},
+   .jit = {
+   .text = {
+   .pgprot = PAGE_KERNEL_READ_EXEC,
+   .alignment = 1,
+   },
+   },
 };
 
 struct execmem_params __init *execmem_arch_params(void)
 {
+#ifdef CONFIG_64BIT
execmem_params.modules.text.start = MODULES_VADDR;
execmem_params.modules.text.end = MODULES_END;
+#else
+   execmem_params.modules.text.start = VMALLOC_START;
+   execmem_params.modules.text.end = VMALLOC_END;
+#endif
+
+   execmem_params.jit.text.start = VMALLOC_START;
+   execmem_params.jit.text.end = VMALLOC_END;
 
return _params;
 }
diff --git a/arch/riscv/kernel/probes/kprobes.c 
b/arch/riscv/kernel/probes/kprobes.c
index 2f08c14a933d..e64f2f3064eb 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
 }
 
-#ifdef CONFIG_MMU
-void *alloc_insn_page(void)
-{
-   return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
-VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-__builtin_return_address(0));
-}
-#endif
-
 /* install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
-- 
2.35.1

[PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

The memory allocations for kprobes on arm64 can be placed anywhere in
vmalloc address space and currently this is implemented with an override
of alloc_insn_page() in arm64.

Extend execmem_params with a range for generated code allocations and
make kprobes on arm64 use this extension rather than override
alloc_insn_page().

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm64/kernel/module.c |  9 +
 arch/arm64/kernel/probes/kprobes.c |  7 ---
 include/linux/execmem.h| 11 +++
 mm/execmem.c   | 14 +-
 4 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index c3d999f3a3dd..52b09626bc0f 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -30,6 +30,13 @@ static struct execmem_params execmem_params = {
.alignment = MODULE_ALIGN,
},
},
+   .jit = {
+   .text = {
+   .start = VMALLOC_START,
+   .end = VMALLOC_END,
+   .alignment = 1,
+   },
+   },
 };
 
 struct execmem_params __init *execmem_arch_params(void)
@@ -40,6 +47,8 @@ struct execmem_params __init *execmem_arch_params(void)
execmem_params.modules.text.start = module_alloc_base;
execmem_params.modules.text.end = module_alloc_end;
 
+   execmem_params.jit.text.pgprot = PAGE_KERNEL_ROX;
+
/*
 * KASAN without KASAN_VMALLOC can only deal with module
 * allocations being served from the reserved module region,
diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index 70b91a8c6bb3..6fccedd02b2a 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
 }
 
-void *alloc_insn_page(void)
-{
-   return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
-   NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 /* arm kprobe: install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 2e1221310d13..dc7c9a446111 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -52,12 +52,23 @@ struct execmem_modules_range {
struct execmem_range data;
 };
 
+/**
+ * struct execmem_jit_range - architecure parameters for address space
+ *   suitable for JIT code allocations
+ * @text:  address range for text allocations
+ */
+struct execmem_jit_range {
+   struct execmem_range text;
+};
+
 /**
  * struct execmem_params - architecure parameters for code allocations
  * @modules:   parameters for modules address space
+ * @jit:   parameters for jit memory address space
  */
 struct execmem_params {
struct execmem_modules_rangemodules;
+   struct execmem_jit_rangejit;
 };
 
 /**
diff --git a/mm/execmem.c b/mm/execmem.c
index f7bf496ad4c3..9730ecef9a30 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -89,7 +89,12 @@ void execmem_free(void *ptr)
 
 void *jit_text_alloc(size_t size)
 {
-   return execmem_text_alloc(size);
+   unsigned long start = execmem_params.jit.text.start;
+   unsigned long end = execmem_params.jit.text.end;
+   pgprot_t pgprot = execmem_params.jit.text.pgprot;
+   unsigned int align = execmem_params.jit.text.alignment;
+
+   return execmem_alloc(size, start, end, align, pgprot, 0, 0, false);
 }
 
 void jit_free(void *ptr)
@@ -135,6 +140,13 @@ static void execmem_init_missing(struct execmem_params *p)
execmem_params.modules.data.fallback_start = 
m->text.fallback_start;
execmem_params.modules.data.fallback_end = m->text.fallback_end;
}
+
+   if (!execmem_params.jit.text.start) {
+   execmem_params.jit.text.start = m->text.start;
+   execmem_params.jit.text.end = m->text.end;
+   execmem_params.jit.text.alignment = m->text.alignment;
+   execmem_params.jit.text.pgprot = m->text.pgprot;
+   }
 }
 
 void __init execmem_init(void)
-- 
2.35.1

[PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Data related to code allocations, such as module data section, need to
comply with architecture constraints for its placement and its
allocation right now was done using execmem_text_alloc().

Create a dedicated API for allocating data related to code allocations
and allow architectures to define address ranges for data allocations.

Since currently this is only relevant for powerpc variants that use the
VMALLOC address space for module data allocations, automatically reuse
address ranges defined for text unless address range for data is
explicitly defined by an architecture.

With separation of code and data allocations, data sections of the
modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
was a default on many architectures.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/module.c |  8 +++
 include/linux/execmem.h  | 18 +++
 kernel/module/main.c | 15 +++--
 mm/execmem.c | 43 
 4 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index ba7abff77d98..4c6c15bf3947 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -103,6 +103,10 @@ struct execmem_params __init *execmem_arch_params(void)
 {
pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : 
PAGE_KERNEL_EXEC;
 
+   /*
+* BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+* allow allocating data in the entire vmalloc space
+*/
 #ifdef MODULES_VADDR
unsigned long limit = (unsigned long)_etext - SZ_32M;
 
@@ -116,6 +120,10 @@ struct execmem_params __init *execmem_arch_params(void)
execmem_params.modules.text.start = MODULES_VADDR;
execmem_params.modules.text.end = MODULES_END;
}
+   execmem_params.modules.data.start = VMALLOC_START;
+   execmem_params.modules.data.end = VMALLOC_END;
+   execmem_params.modules.data.pgprot = PAGE_KERNEL;
+   execmem_params.modules.data.alignment = 1;
 #else
execmem_params.modules.text.start = VMALLOC_START;
execmem_params.modules.text.end = VMALLOC_END;
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index b9a97fcdf3c5..2e1221310d13 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -44,10 +44,12 @@ enum execmem_module_flags {
  *   space
  * @flags: options for module memory allocations
  * @text:  address range for text allocations
+ * @data:  address range for data allocations
  */
 struct execmem_modules_range {
enum execmem_module_flags flags;
struct execmem_range text;
+   struct execmem_range data;
 };
 
 /**
@@ -89,6 +91,22 @@ struct execmem_params *execmem_arch_params(void);
  */
 void *execmem_text_alloc(size_t size);
 
+/**
+ * execmem_data_alloc - allocate memory for data coupled to code
+ * @size: how many bytes of memory are required
+ *
+ * Allocates memory that will contain data copupled with executable code,
+ * like data sections in kernel modules.
+ *
+ * The memory will have protections defined by architecture.
+ *
+ * The allocated memory will reside in an area that does not impose
+ * restrictions on the addressing modes.
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *execmem_data_alloc(size_t size);
+
 /**
  * execmem_free - free executable memory
  * @ptr: pointer to the memory that should be freed
diff --git a/kernel/module/main.c b/kernel/module/main.c
index b445c5ad863a..d6582bfec1f6 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1195,25 +1195,16 @@ void __weak module_arch_freeing_init(struct module *mod)
 {
 }
 
-static bool mod_mem_use_vmalloc(enum mod_mem_type type)
-{
-   return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
-   mod_mem_type_is_core_data(type);
-}
-
 static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
-   if (mod_mem_use_vmalloc(type))
-   return vzalloc(size);
+   if (mod_mem_type_is_data(type))
+   return execmem_data_alloc(size);
return execmem_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
 {
-   if (mod_mem_use_vmalloc(type))
-   vfree(ptr);
-   else
-   execmem_free(ptr);
+   execmem_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
diff --git a/mm/execmem.c b/mm/execmem.c
index a67acd75ffef..f7bf496ad4c3 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -63,6 +63,20 @@ void *execmem_text_alloc(size_t size)
 fallback_start, fallback_end, kasan);
 }
 
+void *execmem_data_alloc(size_t size)
+{
+   unsigned long start = execmem_params.modules.data.start;
+   unsigned long end = execmem_params.modules.data.end;
+   pgprot_t pgprot =

[PATCH v2 05/12] modules, execmem: drop module_alloc

2023-06-16 Thread Mike Rapoport

From: "Mike Rapoport (IBM)" 

Define default parameters for address range for code allocations using
the current values in module_alloc() and make execmem_text_alloc() use
these defaults when an architecure does not supply its specific
parameters.

With this, execmem_text_alloc() implements memory allocation in a way
compatible with module_alloc() and can be used as a replacement for
module_alloc().

Signed-off-by: Mike Rapoport (IBM) 
---
 include/linux/execmem.h  |  8 
 include/linux/moduleloader.h | 12 
 kernel/module/main.c |  7 ---
 mm/execmem.c | 12 
 4 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 68b2bfc79993..b9a97fcdf3c5 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -4,6 +4,14 @@
 
 #include 
 
+#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
+   !defined(CONFIG_KASAN_VMALLOC)
+#include 
+#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
+#else
+#define MODULE_ALIGN PAGE_SIZE
+#endif
+
 /**
  * struct execmem_range - definition of a memory range suitable for code and
  *   related data allocations
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index b3374342f7af..4321682fe849 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
 /* Additional bytes needed by arch in front of individual sections */
 unsigned int arch_mod_section_prepend(struct module *mod, unsigned int 
section);
 
-/* Allocator used for allocating struct module, core sections and init
-   sections.  Returns NULL on failure. */
-void *module_alloc(unsigned long size);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
@@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod);
 /* Any cleanup before freeing mod->module_init */
 void module_arch_freeing_init(struct module *mod);
 
-#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
-   !defined(CONFIG_KASAN_VMALLOC)
-#include 
-#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
-#else
-#define MODULE_ALIGN PAGE_SIZE
-#endif
-
 #endif
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 43810a3bdb81..b445c5ad863a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod)
}
 }
 
-void * __weak module_alloc(unsigned long size)
-{
-   return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
-   NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 bool __weak module_init_section(const char *name)
 {
return strstarts(name, ".init");
diff --git a/mm/execmem.c b/mm/execmem.c
index 2fe36dcc7bdf..a67acd75ffef 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -59,9 +59,6 @@ void *execmem_text_alloc(size_t size)
unsigned long fallback_end = execmem_params.modules.text.fallback_end;
bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
 
-   if (!execmem_params.modules.text.start)
-   return module_alloc(size);
-
return execmem_alloc(size, start, end, align, pgprot,
 fallback_start, fallback_end, kasan);
 }
@@ -108,8 +105,15 @@ void __init execmem_init(void)
 {
struct execmem_params *p = execmem_arch_params();
 
-   if (!p)
+   if (!p) {
+   p = _params;
+   p->modules.text.start = VMALLOC_START;
+   p->modules.text.end = VMALLOC_END;
+   p->modules.text.pgprot = PAGE_KERNEL_EXEC;
+   p->modules.text.alignment = 1;
+
return;
+   }
 
if (!execmem_validate_params(p))
return;
-- 
2.35.1

1 2 >

1 - 100 of 106 matches

Mail list logo