Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Christophe Leroy


Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
> 
> 
> On 3/1/22 1:46 PM, Christophe Leroy wrote:
>>
>>
>> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
 On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
>> This defines and exports a platform specific custom vm_get_page_prot() 
>> via
>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
>> macros can be dropped which are no longer needed.
>
> What I would really like to know is why having to run _code_ to work out
> what the page protections need to be is better than looking it up in a
> table.
>
> Not only is this more expensive in terms of CPU cycles, it also brings
> additional code size with it.
>
> I'm struggling to see what the benefit is.

 Currently vm_get_page_prot() is also being _run_ to fetch required page
 protection values. Although that is being run in the core MM and from a
 platform perspective __SXXX, __PXXX are just being exported for a table.
 Looking it up in a table (and applying more constructs there after) is
 not much different than a clean switch case statement in terms of CPU
 usage. So this is not more expensive in terms of CPU cycles.
>>>
>>> I disagree.
>>
>> So do I.
>>
>>>
>>> However, let's base this disagreement on some evidence. Here is the
>>> present 32-bit ARM implementation:
>>>
>>> 0048 :
>>> 48:   e20fand r0, r0, #15
>>> 4c:   e3003000movwr3, #0
>>>   4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>>> 50:   e3403000movtr3, #0
>>>   50: R_ARM_MOVT_ABS  .LANCHOR1
>>> 54:   e7930100ldr r0, [r3, r0, lsl #2]
>>> 58:   e12fff1ebx  lr
>>>
>>> That is five instructions long.
>>
>> On ppc32 I get:
>>
>> 0094 :
>> 94:  3d 20 00 00 lis r9,0
>>  96: R_PPC_ADDR16_HA .data..ro_after_init
>> 98:  54 84 16 ba rlwinm  r4,r4,2,26,29
>> 9c:  39 29 00 00 addir9,r9,0
>>  9e: R_PPC_ADDR16_LO .data..ro_after_init
>> a0:  7d 29 20 2e lwzxr9,r9,r4
>> a4:  91 23 00 00 stw r9,0(r3)
>> a8:  4e 80 00 20 blr
>>
>>
>>>
>>> Please show that your new implementation is not more expensive on
>>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
>>> the disassembly.
>>
>> With your series I get:
>>
>>  :
>>  0:  3d 20 00 00 lis r9,0
>>  2: R_PPC_ADDR16_HA  .rodata
>>  4:  39 29 00 00 addir9,r9,0
>>  6: R_PPC_ADDR16_LO  .rodata
>>  8:  54 84 16 ba rlwinm  r4,r4,2,26,29
>>  c:  7d 49 20 2e lwzxr10,r9,r4
>> 10:  7d 4a 4a 14 add r10,r10,r9
>> 14:  7d 49 03 a6 mtctr   r10
>> 18:  4e 80 04 20 bctr
>> 1c:  39 20 03 15 li  r9,789
>> 20:  91 23 00 00 stw r9,0(r3)
>> 24:  4e 80 00 20 blr
>> 28:  39 20 01 15 li  r9,277
>> 2c:  91 23 00 00 stw r9,0(r3)
>> 30:  4e 80 00 20 blr
>> 34:  39 20 07 15 li  r9,1813
>> 38:  91 23 00 00 stw r9,0(r3)
>> 3c:  4e 80 00 20 blr
>> 40:  39 20 05 15 li  r9,1301
>> 44:  91 23 00 00 stw r9,0(r3)
>> 48:  4e 80 00 20 blr
>> 4c:  39 20 01 11 li  r9,273
>> 50:  4b ff ff d0 b   20 
>>
>>
>> That is definitely more expensive, it implements a table of branches.
> 
> Okay, will split out the PPC32 implementation that retains existing
> table look up method. Also planning to keep that inside same file
> (arch/powerpc/mm/mmap.c), unless you have a difference preference.

My point was not to get something specific for PPC32, but to amplify on 
Russell's objection.

As this is bad for ARM and bad for PPC32, do we have any evidence that 
your change is good for any other architecture ?

I checked PPC64 and there is exactly the same drawback. With the current 
implementation it is a small function performing table read then a few 
adjustment. After your change it is a bigger function implementing a 
table of branches.

So, as requested by Russell, could you look at the disassembly for other 
architectures and show us that ARM and POWERPC are the only ones for 
which your change is not optimal ?

See below the difference for POWERPC64.

Current implementation:

0a60 <.vm_get_page_prot>:
  a60:  3d 42 00 00 addis   r10,r2,0
a62: R_PPC64_TOC16_HA   .data..ro_after_init
  a64:  78 89 1e 68 rldic   r9,r4,3,57
  a68:  39 4a 00 00 addi

Re: [PATCH V3 04/30] powerpc/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Michael Ellerman
Anshuman Khandual  writes:
> This defines and exports a platform specific custom vm_get_page_prot() via
> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> macros can be dropped which are no longer needed. While here, this also
> localizes arch_vm_get_page_prot() as powerpc_vm_get_page_prot() and moves
> it near vm_get_page_prot().
>
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/powerpc/Kconfig   |  1 +
>  arch/powerpc/include/asm/mman.h| 12 --
>  arch/powerpc/include/asm/pgtable.h | 19 --
>  arch/powerpc/mm/mmap.c | 59 ++
>  4 files changed, 60 insertions(+), 31 deletions(-)

Acked-by: Michael Ellerman  (powerpc)

cheers

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Anshuman Khandual


On 3/1/22 1:46 PM, Christophe Leroy wrote:
> 
> 
> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
 On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> This defines and exports a platform specific custom vm_get_page_prot() via
> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> macros can be dropped which are no longer needed.

 What I would really like to know is why having to run _code_ to work out
 what the page protections need to be is better than looking it up in a
 table.

 Not only is this more expensive in terms of CPU cycles, it also brings
 additional code size with it.

 I'm struggling to see what the benefit is.
>>>
>>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>>> protection values. Although that is being run in the core MM and from a
>>> platform perspective __SXXX, __PXXX are just being exported for a table.
>>> Looking it up in a table (and applying more constructs there after) is
>>> not much different than a clean switch case statement in terms of CPU
>>> usage. So this is not more expensive in terms of CPU cycles.
>>
>> I disagree.
> 
> So do I.
> 
>>
>> However, let's base this disagreement on some evidence. Here is the
>> present 32-bit ARM implementation:
>>
>> 0048 :
>>48:   e20fand r0, r0, #15
>>4c:   e3003000movwr3, #0
>>  4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>>50:   e3403000movtr3, #0
>>  50: R_ARM_MOVT_ABS  .LANCHOR1
>>54:   e7930100ldr r0, [r3, r0, lsl #2]
>>58:   e12fff1ebx  lr
>>
>> That is five instructions long.
> 
> On ppc32 I get:
> 
> 0094 :
>94:3d 20 00 00 lis r9,0
>   96: R_PPC_ADDR16_HA .data..ro_after_init
>98:54 84 16 ba rlwinm  r4,r4,2,26,29
>9c:39 29 00 00 addir9,r9,0
>   9e: R_PPC_ADDR16_LO .data..ro_after_init
>a0:7d 29 20 2e lwzxr9,r9,r4
>a4:91 23 00 00 stw r9,0(r3)
>a8:4e 80 00 20 blr
> 
> 
>>
>> Please show that your new implementation is not more expensive on
>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
>> the disassembly.
> 
> With your series I get:
> 
>  :
> 0:3d 20 00 00 lis r9,0
>   2: R_PPC_ADDR16_HA  .rodata
> 4:39 29 00 00 addir9,r9,0
>   6: R_PPC_ADDR16_LO  .rodata
> 8:54 84 16 ba rlwinm  r4,r4,2,26,29
> c:7d 49 20 2e lwzxr10,r9,r4
>10:7d 4a 4a 14 add r10,r10,r9
>14:7d 49 03 a6 mtctr   r10
>18:4e 80 04 20 bctr
>1c:39 20 03 15 li  r9,789
>20:91 23 00 00 stw r9,0(r3)
>24:4e 80 00 20 blr
>28:39 20 01 15 li  r9,277
>2c:91 23 00 00 stw r9,0(r3)
>30:4e 80 00 20 blr
>34:39 20 07 15 li  r9,1813
>38:91 23 00 00 stw r9,0(r3)
>3c:4e 80 00 20 blr
>40:39 20 05 15 li  r9,1301
>44:91 23 00 00 stw r9,0(r3)
>48:4e 80 00 20 blr
>4c:39 20 01 11 li  r9,273
>50:4b ff ff d0 b   20 
> 
> 
> That is definitely more expensive, it implements a table of branches.

Okay, will split out the PPC32 implementation that retains existing
table look up method. Also planning to keep that inside same file
(arch/powerpc/mm/mmap.c), unless you have a difference preference.

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Anshuman Khandual



On 3/1/22 6:01 AM, Russell King (Oracle) wrote:
> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
 This defines and exports a platform specific custom vm_get_page_prot() via
 subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
 macros can be dropped which are no longer needed.
>>> What I would really like to know is why having to run _code_ to work out
>>> what the page protections need to be is better than looking it up in a
>>> table.
>>>
>>> Not only is this more expensive in terms of CPU cycles, it also brings
>>> additional code size with it.
>>>
>>> I'm struggling to see what the benefit is.
>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>> protection values. Although that is being run in the core MM and from a
>> platform perspective __SXXX, __PXXX are just being exported for a table.
>> Looking it up in a table (and applying more constructs there after) is
>> not much different than a clean switch case statement in terms of CPU
>> usage. So this is not more expensive in terms of CPU cycles.
> I disagree.
> 
> However, let's base this disagreement on some evidence. Here is the
> present 32-bit ARM implementation:
> 
> 0048 :
>   48:   e20fand r0, r0, #15
>   4c:   e3003000movwr3, #0
> 4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>   50:   e3403000movtr3, #0
> 50: R_ARM_MOVT_ABS  .LANCHOR1
>   54:   e7930100ldr r0, [r3, r0, lsl #2]
>   58:   e12fff1ebx  lr
> 
> That is five instructions long.
> 
> Please show that your new implementation is not more expensive on
> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
> the disassembly.
> 
> I think you will find way more than five instructions in your version -
> the compiler will have to issue code to decode the protection bits,
> probably using a table of branches or absolute PC values, or possibly
> the worst case of using multiple comparisons and branches. It then has
> to load constants that may be moved using movw on ARMv7, but on
> older architectures would have to be created from multiple instructions
> or loaded from the literal pool. Then there'll be instructions to load
> the address of "user_pgprot", retrieve its value, and bitwise or that.
> 
> Therefore, I fail to see how your approach of getting rid of the table
> is somehow "better" than what we currently have in terms of the effect
> on the resulting code.
> 
> If you don't like the __P and __S stuff and two arch_* hooks, you could
> move the table into arch code along with vm_get_page_prot() without the
> additional unnecessary hooks, while keeping all the benefits of the
> table lookup.

Okay, will change the arm's vm_get_page_prot() implementation as suggested.

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V3 19/30] csky/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Guo Ren
Acked-by: Guo Ren 

On Mon, Feb 28, 2022 at 7:10 PM Anshuman Khandual
 wrote:
>
> This defines and exports a platform specific custom vm_get_page_prot() via
> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> macros can be dropped which are no longer needed.
>
> Cc: Geert Uytterhoeven 
> Cc: linux-c...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/csky/Kconfig   |  1 +
>  arch/csky/include/asm/pgtable.h | 18 --
>  arch/csky/mm/init.c | 32 
>  3 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
> index 132f43f12dd8..209dac5686dd 100644
> --- a/arch/csky/Kconfig
> +++ b/arch/csky/Kconfig
> @@ -6,6 +6,7 @@ config CSKY
> select ARCH_HAS_GCOV_PROFILE_ALL
> select ARCH_HAS_SYNC_DMA_FOR_CPU
> select ARCH_HAS_SYNC_DMA_FOR_DEVICE
> +   select ARCH_HAS_VM_GET_PAGE_PROT
> select ARCH_USE_BUILTIN_BSWAP
> select ARCH_USE_QUEUED_RWLOCKS
> select ARCH_WANT_FRAME_POINTERS if !CPU_CK610 && 
> $(cc-option,-mbacktrace)
> diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
> index 151607ed5158..2c6b1cfb1cce 100644
> --- a/arch/csky/include/asm/pgtable.h
> +++ b/arch/csky/include/asm/pgtable.h
> @@ -76,24 +76,6 @@
>  #define MAX_SWAPFILES_CHECK() \
> BUILD_BUG_ON(MAX_SWAPFILES_SHIFT != 5)
>
> -#define __P000 PAGE_NONE
> -#define __P001 PAGE_READ
> -#define __P010 PAGE_READ
> -#define __P011 PAGE_READ
> -#define __P100 PAGE_READ
> -#define __P101 PAGE_READ
> -#define __P110 PAGE_READ
> -#define __P111 PAGE_READ
> -
> -#define __S000 PAGE_NONE
> -#define __S001 PAGE_READ
> -#define __S010 PAGE_WRITE
> -#define __S011 PAGE_WRITE
> -#define __S100 PAGE_READ
> -#define __S101 PAGE_READ
> -#define __S110 PAGE_WRITE
> -#define __S111 PAGE_WRITE
> -
>  extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>  #define ZERO_PAGE(vaddr)   (virt_to_page(empty_zero_page))
>
> diff --git a/arch/csky/mm/init.c b/arch/csky/mm/init.c
> index bf2004aa811a..f9babbed17d4 100644
> --- a/arch/csky/mm/init.c
> +++ b/arch/csky/mm/init.c
> @@ -197,3 +197,35 @@ void __init fixaddr_init(void)
> vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
> fixrange_init(vaddr, vaddr + PMD_SIZE, swapper_pg_dir);
>  }
> +
> +pgprot_t vm_get_page_prot(unsigned long vm_flags)
> +{
> +   switch (vm_flags & (VM_READ | VM_WRITE | VM_EXEC | VM_SHARED)) {
> +   case VM_NONE:
> +   return PAGE_NONE;
> +   case VM_READ:
> +   case VM_WRITE:
> +   case VM_WRITE | VM_READ:
> +   case VM_EXEC:
> +   case VM_EXEC | VM_READ:
> +   case VM_EXEC | VM_WRITE:
> +   case VM_EXEC | VM_WRITE | VM_READ:
> +   return PAGE_READ;
> +   case VM_SHARED:
> +   return PAGE_NONE;
> +   case VM_SHARED | VM_READ:
> +   return PAGE_READ;
> +   case VM_SHARED | VM_WRITE:
> +   case VM_SHARED | VM_WRITE | VM_READ:
> +   return PAGE_WRITE;
> +   case VM_SHARED | VM_EXEC:
> +   case VM_SHARED | VM_EXEC | VM_READ:
> +   return PAGE_READ;
> +   case VM_SHARED | VM_EXEC | VM_WRITE:
> +   case VM_SHARED | VM_EXEC | VM_WRITE | VM_READ:
> +   return PAGE_WRITE;
> +   default:
> +   BUILD_BUG();
> +   }
> +}
> +EXPORT_SYMBOL(vm_get_page_prot);
> --
> 2.25.1
>


-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Christophe Leroy


Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
 This defines and exports a platform specific custom vm_get_page_prot() via
 subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
 macros can be dropped which are no longer needed.
>>>
>>> What I would really like to know is why having to run _code_ to work out
>>> what the page protections need to be is better than looking it up in a
>>> table.
>>>
>>> Not only is this more expensive in terms of CPU cycles, it also brings
>>> additional code size with it.
>>>
>>> I'm struggling to see what the benefit is.
>>
>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>> protection values. Although that is being run in the core MM and from a
>> platform perspective __SXXX, __PXXX are just being exported for a table.
>> Looking it up in a table (and applying more constructs there after) is
>> not much different than a clean switch case statement in terms of CPU
>> usage. So this is not more expensive in terms of CPU cycles.
> 
> I disagree.

So do I.

> 
> However, let's base this disagreement on some evidence. Here is the
> present 32-bit ARM implementation:
> 
> 0048 :
>48:   e20fand r0, r0, #15
>4c:   e3003000movwr3, #0
>  4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>50:   e3403000movtr3, #0
>  50: R_ARM_MOVT_ABS  .LANCHOR1
>54:   e7930100ldr r0, [r3, r0, lsl #2]
>58:   e12fff1ebx  lr
> 
> That is five instructions long.

On ppc32 I get:

0094 :
   94:  3d 20 00 00 lis r9,0
96: R_PPC_ADDR16_HA .data..ro_after_init
   98:  54 84 16 ba rlwinm  r4,r4,2,26,29
   9c:  39 29 00 00 addir9,r9,0
9e: R_PPC_ADDR16_LO .data..ro_after_init
   a0:  7d 29 20 2e lwzxr9,r9,r4
   a4:  91 23 00 00 stw r9,0(r3)
   a8:  4e 80 00 20 blr


> 
> Please show that your new implementation is not more expensive on
> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
> the disassembly.

With your series I get:

 :
0:  3d 20 00 00 lis r9,0
2: R_PPC_ADDR16_HA  .rodata
4:  39 29 00 00 addir9,r9,0
6: R_PPC_ADDR16_LO  .rodata
8:  54 84 16 ba rlwinm  r4,r4,2,26,29
c:  7d 49 20 2e lwzxr10,r9,r4
   10:  7d 4a 4a 14 add r10,r10,r9
   14:  7d 49 03 a6 mtctr   r10
   18:  4e 80 04 20 bctr
   1c:  39 20 03 15 li  r9,789
   20:  91 23 00 00 stw r9,0(r3)
   24:  4e 80 00 20 blr
   28:  39 20 01 15 li  r9,277
   2c:  91 23 00 00 stw r9,0(r3)
   30:  4e 80 00 20 blr
   34:  39 20 07 15 li  r9,1813
   38:  91 23 00 00 stw r9,0(r3)
   3c:  4e 80 00 20 blr
   40:  39 20 05 15 li  r9,1301
   44:  91 23 00 00 stw r9,0(r3)
   48:  4e 80 00 20 blr
   4c:  39 20 01 11 li  r9,273
   50:  4b ff ff d0 b   20 


That is definitely more expensive, it implements a table of branches.


Christophe
___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc