Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT
Le 02/03/2022 à 04:22, Anshuman Khandual a écrit : > > > On 3/1/22 1:46 PM, Christophe Leroy wrote: >> >> >> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit : >>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote: On 2/28/22 4:27 PM, Russell King (Oracle) wrote: > On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote: >> This defines and exports a platform specific custom vm_get_page_prot() >> via >> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX >> macros can be dropped which are no longer needed. > > What I would really like to know is why having to run _code_ to work out > what the page protections need to be is better than looking it up in a > table. > > Not only is this more expensive in terms of CPU cycles, it also brings > additional code size with it. > > I'm struggling to see what the benefit is. Currently vm_get_page_prot() is also being _run_ to fetch required page protection values. Although that is being run in the core MM and from a platform perspective __SXXX, __PXXX are just being exported for a table. Looking it up in a table (and applying more constructs there after) is not much different than a clean switch case statement in terms of CPU usage. So this is not more expensive in terms of CPU cycles. >>> >>> I disagree. >> >> So do I. >> >>> >>> However, let's base this disagreement on some evidence. Here is the >>> present 32-bit ARM implementation: >>> >>> 0048 : >>> 48: e20fand r0, r0, #15 >>> 4c: e3003000movwr3, #0 >>> 4c: R_ARM_MOVW_ABS_NC .LANCHOR1 >>> 50: e3403000movtr3, #0 >>> 50: R_ARM_MOVT_ABS .LANCHOR1 >>> 54: e7930100ldr r0, [r3, r0, lsl #2] >>> 58: e12fff1ebx lr >>> >>> That is five instructions long. >> >> On ppc32 I get: >> >> 0094 : >> 94: 3d 20 00 00 lis r9,0 >> 96: R_PPC_ADDR16_HA .data..ro_after_init >> 98: 54 84 16 ba rlwinm r4,r4,2,26,29 >> 9c: 39 29 00 00 addir9,r9,0 >> 9e: R_PPC_ADDR16_LO .data..ro_after_init >> a0: 7d 29 20 2e lwzxr9,r9,r4 >> a4: 91 23 00 00 stw r9,0(r3) >> a8: 4e 80 00 20 blr >> >> >>> >>> Please show that your new implementation is not more expensive on >>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing >>> the disassembly. >> >> With your series I get: >> >> : >> 0: 3d 20 00 00 lis r9,0 >> 2: R_PPC_ADDR16_HA .rodata >> 4: 39 29 00 00 addir9,r9,0 >> 6: R_PPC_ADDR16_LO .rodata >> 8: 54 84 16 ba rlwinm r4,r4,2,26,29 >> c: 7d 49 20 2e lwzxr10,r9,r4 >> 10: 7d 4a 4a 14 add r10,r10,r9 >> 14: 7d 49 03 a6 mtctr r10 >> 18: 4e 80 04 20 bctr >> 1c: 39 20 03 15 li r9,789 >> 20: 91 23 00 00 stw r9,0(r3) >> 24: 4e 80 00 20 blr >> 28: 39 20 01 15 li r9,277 >> 2c: 91 23 00 00 stw r9,0(r3) >> 30: 4e 80 00 20 blr >> 34: 39 20 07 15 li r9,1813 >> 38: 91 23 00 00 stw r9,0(r3) >> 3c: 4e 80 00 20 blr >> 40: 39 20 05 15 li r9,1301 >> 44: 91 23 00 00 stw r9,0(r3) >> 48: 4e 80 00 20 blr >> 4c: 39 20 01 11 li r9,273 >> 50: 4b ff ff d0 b 20 >> >> >> That is definitely more expensive, it implements a table of branches. > > Okay, will split out the PPC32 implementation that retains existing > table look up method. Also planning to keep that inside same file > (arch/powerpc/mm/mmap.c), unless you have a difference preference. My point was not to get something specific for PPC32, but to amplify on Russell's objection. As this is bad for ARM and bad for PPC32, do we have any evidence that your change is good for any other architecture ? I checked PPC64 and there is exactly the same drawback. With the current implementation it is a small function performing table read then a few adjustment. After your change it is a bigger function implementing a table of branches. So, as requested by Russell, could you look at the disassembly for other architectures and show us that ARM and POWERPC are the only ones for which your change is not optimal ? See below the difference for POWERPC64. Current implementation: 0a60 <.vm_get_page_prot>: a60: 3d 42 00 00 addis r10,r2,0 a62: R_PPC64_TOC16_HA .data..ro_after_init a64: 78 89 1e 68 rldic r9,r4,3,57 a68: 39 4a 00 00 addi
Re: [PATCH V3 04/30] powerpc/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT
Anshuman Khandual writes: > This defines and exports a platform specific custom vm_get_page_prot() via > subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX > macros can be dropped which are no longer needed. While here, this also > localizes arch_vm_get_page_prot() as powerpc_vm_get_page_prot() and moves > it near vm_get_page_prot(). > > Cc: Michael Ellerman > Cc: Paul Mackerras > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux-ker...@vger.kernel.org > Signed-off-by: Anshuman Khandual > --- > arch/powerpc/Kconfig | 1 + > arch/powerpc/include/asm/mman.h| 12 -- > arch/powerpc/include/asm/pgtable.h | 19 -- > arch/powerpc/mm/mmap.c | 59 ++ > 4 files changed, 60 insertions(+), 31 deletions(-) Acked-by: Michael Ellerman (powerpc) cheers ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT
On 3/1/22 1:46 PM, Christophe Leroy wrote: > > > Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit : >> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote: >>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote: On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote: > This defines and exports a platform specific custom vm_get_page_prot() via > subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX > macros can be dropped which are no longer needed. What I would really like to know is why having to run _code_ to work out what the page protections need to be is better than looking it up in a table. Not only is this more expensive in terms of CPU cycles, it also brings additional code size with it. I'm struggling to see what the benefit is. >>> >>> Currently vm_get_page_prot() is also being _run_ to fetch required page >>> protection values. Although that is being run in the core MM and from a >>> platform perspective __SXXX, __PXXX are just being exported for a table. >>> Looking it up in a table (and applying more constructs there after) is >>> not much different than a clean switch case statement in terms of CPU >>> usage. So this is not more expensive in terms of CPU cycles. >> >> I disagree. > > So do I. > >> >> However, let's base this disagreement on some evidence. Here is the >> present 32-bit ARM implementation: >> >> 0048 : >>48: e20fand r0, r0, #15 >>4c: e3003000movwr3, #0 >> 4c: R_ARM_MOVW_ABS_NC .LANCHOR1 >>50: e3403000movtr3, #0 >> 50: R_ARM_MOVT_ABS .LANCHOR1 >>54: e7930100ldr r0, [r3, r0, lsl #2] >>58: e12fff1ebx lr >> >> That is five instructions long. > > On ppc32 I get: > > 0094 : >94:3d 20 00 00 lis r9,0 > 96: R_PPC_ADDR16_HA .data..ro_after_init >98:54 84 16 ba rlwinm r4,r4,2,26,29 >9c:39 29 00 00 addir9,r9,0 > 9e: R_PPC_ADDR16_LO .data..ro_after_init >a0:7d 29 20 2e lwzxr9,r9,r4 >a4:91 23 00 00 stw r9,0(r3) >a8:4e 80 00 20 blr > > >> >> Please show that your new implementation is not more expensive on >> 32-bit ARM. Please do so by building a 32-bit kernel, and providing >> the disassembly. > > With your series I get: > > : > 0:3d 20 00 00 lis r9,0 > 2: R_PPC_ADDR16_HA .rodata > 4:39 29 00 00 addir9,r9,0 > 6: R_PPC_ADDR16_LO .rodata > 8:54 84 16 ba rlwinm r4,r4,2,26,29 > c:7d 49 20 2e lwzxr10,r9,r4 >10:7d 4a 4a 14 add r10,r10,r9 >14:7d 49 03 a6 mtctr r10 >18:4e 80 04 20 bctr >1c:39 20 03 15 li r9,789 >20:91 23 00 00 stw r9,0(r3) >24:4e 80 00 20 blr >28:39 20 01 15 li r9,277 >2c:91 23 00 00 stw r9,0(r3) >30:4e 80 00 20 blr >34:39 20 07 15 li r9,1813 >38:91 23 00 00 stw r9,0(r3) >3c:4e 80 00 20 blr >40:39 20 05 15 li r9,1301 >44:91 23 00 00 stw r9,0(r3) >48:4e 80 00 20 blr >4c:39 20 01 11 li r9,273 >50:4b ff ff d0 b 20 > > > That is definitely more expensive, it implements a table of branches. Okay, will split out the PPC32 implementation that retains existing table look up method. Also planning to keep that inside same file (arch/powerpc/mm/mmap.c), unless you have a difference preference. ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT
On 3/1/22 6:01 AM, Russell King (Oracle) wrote: > On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote: >> On 2/28/22 4:27 PM, Russell King (Oracle) wrote: >>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote: This defines and exports a platform specific custom vm_get_page_prot() via subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX macros can be dropped which are no longer needed. >>> What I would really like to know is why having to run _code_ to work out >>> what the page protections need to be is better than looking it up in a >>> table. >>> >>> Not only is this more expensive in terms of CPU cycles, it also brings >>> additional code size with it. >>> >>> I'm struggling to see what the benefit is. >> Currently vm_get_page_prot() is also being _run_ to fetch required page >> protection values. Although that is being run in the core MM and from a >> platform perspective __SXXX, __PXXX are just being exported for a table. >> Looking it up in a table (and applying more constructs there after) is >> not much different than a clean switch case statement in terms of CPU >> usage. So this is not more expensive in terms of CPU cycles. > I disagree. > > However, let's base this disagreement on some evidence. Here is the > present 32-bit ARM implementation: > > 0048 : > 48: e20fand r0, r0, #15 > 4c: e3003000movwr3, #0 > 4c: R_ARM_MOVW_ABS_NC .LANCHOR1 > 50: e3403000movtr3, #0 > 50: R_ARM_MOVT_ABS .LANCHOR1 > 54: e7930100ldr r0, [r3, r0, lsl #2] > 58: e12fff1ebx lr > > That is five instructions long. > > Please show that your new implementation is not more expensive on > 32-bit ARM. Please do so by building a 32-bit kernel, and providing > the disassembly. > > I think you will find way more than five instructions in your version - > the compiler will have to issue code to decode the protection bits, > probably using a table of branches or absolute PC values, or possibly > the worst case of using multiple comparisons and branches. It then has > to load constants that may be moved using movw on ARMv7, but on > older architectures would have to be created from multiple instructions > or loaded from the literal pool. Then there'll be instructions to load > the address of "user_pgprot", retrieve its value, and bitwise or that. > > Therefore, I fail to see how your approach of getting rid of the table > is somehow "better" than what we currently have in terms of the effect > on the resulting code. > > If you don't like the __P and __S stuff and two arch_* hooks, you could > move the table into arch code along with vm_get_page_prot() without the > additional unnecessary hooks, while keeping all the benefits of the > table lookup. Okay, will change the arm's vm_get_page_prot() implementation as suggested. ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH V3 19/30] csky/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT
Acked-by: Guo Ren On Mon, Feb 28, 2022 at 7:10 PM Anshuman Khandual wrote: > > This defines and exports a platform specific custom vm_get_page_prot() via > subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX > macros can be dropped which are no longer needed. > > Cc: Geert Uytterhoeven > Cc: linux-c...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Signed-off-by: Anshuman Khandual > --- > arch/csky/Kconfig | 1 + > arch/csky/include/asm/pgtable.h | 18 -- > arch/csky/mm/init.c | 32 > 3 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig > index 132f43f12dd8..209dac5686dd 100644 > --- a/arch/csky/Kconfig > +++ b/arch/csky/Kconfig > @@ -6,6 +6,7 @@ config CSKY > select ARCH_HAS_GCOV_PROFILE_ALL > select ARCH_HAS_SYNC_DMA_FOR_CPU > select ARCH_HAS_SYNC_DMA_FOR_DEVICE > + select ARCH_HAS_VM_GET_PAGE_PROT > select ARCH_USE_BUILTIN_BSWAP > select ARCH_USE_QUEUED_RWLOCKS > select ARCH_WANT_FRAME_POINTERS if !CPU_CK610 && > $(cc-option,-mbacktrace) > diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h > index 151607ed5158..2c6b1cfb1cce 100644 > --- a/arch/csky/include/asm/pgtable.h > +++ b/arch/csky/include/asm/pgtable.h > @@ -76,24 +76,6 @@ > #define MAX_SWAPFILES_CHECK() \ > BUILD_BUG_ON(MAX_SWAPFILES_SHIFT != 5) > > -#define __P000 PAGE_NONE > -#define __P001 PAGE_READ > -#define __P010 PAGE_READ > -#define __P011 PAGE_READ > -#define __P100 PAGE_READ > -#define __P101 PAGE_READ > -#define __P110 PAGE_READ > -#define __P111 PAGE_READ > - > -#define __S000 PAGE_NONE > -#define __S001 PAGE_READ > -#define __S010 PAGE_WRITE > -#define __S011 PAGE_WRITE > -#define __S100 PAGE_READ > -#define __S101 PAGE_READ > -#define __S110 PAGE_WRITE > -#define __S111 PAGE_WRITE > - > extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]; > #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page)) > > diff --git a/arch/csky/mm/init.c b/arch/csky/mm/init.c > index bf2004aa811a..f9babbed17d4 100644 > --- a/arch/csky/mm/init.c > +++ b/arch/csky/mm/init.c > @@ -197,3 +197,35 @@ void __init fixaddr_init(void) > vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK; > fixrange_init(vaddr, vaddr + PMD_SIZE, swapper_pg_dir); > } > + > +pgprot_t vm_get_page_prot(unsigned long vm_flags) > +{ > + switch (vm_flags & (VM_READ | VM_WRITE | VM_EXEC | VM_SHARED)) { > + case VM_NONE: > + return PAGE_NONE; > + case VM_READ: > + case VM_WRITE: > + case VM_WRITE | VM_READ: > + case VM_EXEC: > + case VM_EXEC | VM_READ: > + case VM_EXEC | VM_WRITE: > + case VM_EXEC | VM_WRITE | VM_READ: > + return PAGE_READ; > + case VM_SHARED: > + return PAGE_NONE; > + case VM_SHARED | VM_READ: > + return PAGE_READ; > + case VM_SHARED | VM_WRITE: > + case VM_SHARED | VM_WRITE | VM_READ: > + return PAGE_WRITE; > + case VM_SHARED | VM_EXEC: > + case VM_SHARED | VM_EXEC | VM_READ: > + return PAGE_READ; > + case VM_SHARED | VM_EXEC | VM_WRITE: > + case VM_SHARED | VM_EXEC | VM_WRITE | VM_READ: > + return PAGE_WRITE; > + default: > + BUILD_BUG(); > + } > +} > +EXPORT_SYMBOL(vm_get_page_prot); > -- > 2.25.1 > -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/ ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT
Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit : > On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote: >> On 2/28/22 4:27 PM, Russell King (Oracle) wrote: >>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote: This defines and exports a platform specific custom vm_get_page_prot() via subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX macros can be dropped which are no longer needed. >>> >>> What I would really like to know is why having to run _code_ to work out >>> what the page protections need to be is better than looking it up in a >>> table. >>> >>> Not only is this more expensive in terms of CPU cycles, it also brings >>> additional code size with it. >>> >>> I'm struggling to see what the benefit is. >> >> Currently vm_get_page_prot() is also being _run_ to fetch required page >> protection values. Although that is being run in the core MM and from a >> platform perspective __SXXX, __PXXX are just being exported for a table. >> Looking it up in a table (and applying more constructs there after) is >> not much different than a clean switch case statement in terms of CPU >> usage. So this is not more expensive in terms of CPU cycles. > > I disagree. So do I. > > However, let's base this disagreement on some evidence. Here is the > present 32-bit ARM implementation: > > 0048 : >48: e20fand r0, r0, #15 >4c: e3003000movwr3, #0 > 4c: R_ARM_MOVW_ABS_NC .LANCHOR1 >50: e3403000movtr3, #0 > 50: R_ARM_MOVT_ABS .LANCHOR1 >54: e7930100ldr r0, [r3, r0, lsl #2] >58: e12fff1ebx lr > > That is five instructions long. On ppc32 I get: 0094 : 94: 3d 20 00 00 lis r9,0 96: R_PPC_ADDR16_HA .data..ro_after_init 98: 54 84 16 ba rlwinm r4,r4,2,26,29 9c: 39 29 00 00 addir9,r9,0 9e: R_PPC_ADDR16_LO .data..ro_after_init a0: 7d 29 20 2e lwzxr9,r9,r4 a4: 91 23 00 00 stw r9,0(r3) a8: 4e 80 00 20 blr > > Please show that your new implementation is not more expensive on > 32-bit ARM. Please do so by building a 32-bit kernel, and providing > the disassembly. With your series I get: : 0: 3d 20 00 00 lis r9,0 2: R_PPC_ADDR16_HA .rodata 4: 39 29 00 00 addir9,r9,0 6: R_PPC_ADDR16_LO .rodata 8: 54 84 16 ba rlwinm r4,r4,2,26,29 c: 7d 49 20 2e lwzxr10,r9,r4 10: 7d 4a 4a 14 add r10,r10,r9 14: 7d 49 03 a6 mtctr r10 18: 4e 80 04 20 bctr 1c: 39 20 03 15 li r9,789 20: 91 23 00 00 stw r9,0(r3) 24: 4e 80 00 20 blr 28: 39 20 01 15 li r9,277 2c: 91 23 00 00 stw r9,0(r3) 30: 4e 80 00 20 blr 34: 39 20 07 15 li r9,1813 38: 91 23 00 00 stw r9,0(r3) 3c: 4e 80 00 20 blr 40: 39 20 05 15 li r9,1301 44: 91 23 00 00 stw r9,0(r3) 48: 4e 80 00 20 blr 4c: 39 20 01 11 li r9,273 50: 4b ff ff d0 b 20 That is definitely more expensive, it implements a table of branches. Christophe ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc