Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-09 Thread Anshuman Khandual



On 3/2/22 16:44, Geert Uytterhoeven wrote:
> Hi Anshuman,
> 
> On Wed, Mar 2, 2022 at 12:07 PM Anshuman Khandual
>  wrote:
>> On 3/2/22 3:35 PM, Geert Uytterhoeven wrote:
>>> On Wed, Mar 2, 2022 at 10:51 AM Anshuman Khandual
>>>  wrote:
 On 3/2/22 12:35 PM, Christophe Leroy wrote:
> Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
>> On 3/1/22 1:46 PM, Christophe Leroy wrote:
>>> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
 On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
>>> This defines and exports a platform specific custom 
>>> vm_get_page_prot() via
>>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and 
>>> __PXXX
>>> macros can be dropped which are no longer needed.
>>
>> What I would really like to know is why having to run _code_ to work 
>> out
>> what the page protections need to be is better than looking it up in 
>> a
>> table.
>>
>> Not only is this more expensive in terms of CPU cycles, it also 
>> brings
>> additional code size with it.
>>
>> I'm struggling to see what the benefit is.
>
> Currently vm_get_page_prot() is also being _run_ to fetch required 
> page
> protection values. Although that is being run in the core MM and from 
> a
> platform perspective __SXXX, __PXXX are just being exported for a 
> table.
> Looking it up in a table (and applying more constructs there after) is
> not much different than a clean switch case statement in terms of CPU
> usage. So this is not more expensive in terms of CPU cycles.

 I disagree.
>>>
>>> So do I.
>>>

 However, let's base this disagreement on some evidence. Here is the
 present 32-bit ARM implementation:

 0048 :
 48:   e20fand r0, r0, #15
 4c:   e3003000movwr3, #0
   4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
 50:   e3403000movtr3, #0
   50: R_ARM_MOVT_ABS  .LANCHOR1
 54:   e7930100ldr r0, [r3, r0, lsl #2]
 58:   e12fff1ebx  lr

 That is five instructions long.
>>>
>>> On ppc32 I get:
>>>
>>> 0094 :
>>> 94: 3d 20 00 00 lis r9,0
>>> 96: R_PPC_ADDR16_HA .data..ro_after_init
>>> 98: 54 84 16 ba rlwinm  r4,r4,2,26,29
>>> 9c: 39 29 00 00 addir9,r9,0
>>> 9e: R_PPC_ADDR16_LO .data..ro_after_init
>>> a0: 7d 29 20 2e lwzxr9,r9,r4
>>> a4: 91 23 00 00 stw r9,0(r3)
>>> a8: 4e 80 00 20 blr
>>>
>>>

 Please show that your new implementation is not more expensive on
 32-bit ARM. Please do so by building a 32-bit kernel, and providing
 the disassembly.
>>>
>>> With your series I get:
>>>
>>>  :
>>>  0: 3d 20 00 00 lis r9,0
>>> 2: R_PPC_ADDR16_HA  .rodata
>>>  4: 39 29 00 00 addir9,r9,0
>>> 6: R_PPC_ADDR16_LO  .rodata
>>>  8: 54 84 16 ba rlwinm  r4,r4,2,26,29
>>>  c: 7d 49 20 2e lwzxr10,r9,r4
>>> 10: 7d 4a 4a 14 add r10,r10,r9
>>> 14: 7d 49 03 a6 mtctr   r10
>>> 18: 4e 80 04 20 bctr
>>> 1c: 39 20 03 15 li  r9,789
>>> 20: 91 23 00 00 stw r9,0(r3)
>>> 24: 4e 80 00 20 blr
>>> 28: 39 20 01 15 li  r9,277
>>> 2c: 91 23 00 00 stw r9,0(r3)
>>> 30: 4e 80 00 20 blr
>>> 34: 39 20 07 15 li  r9,1813
>>> 38: 91 23 00 00 stw r9,0(r3)
>>> 3c: 4e 80 00 20 blr
>>> 40: 39 20 05 15 li  r9,1301
>>> 44: 91 23 00 00 stw r9,0(r3)
>>> 48: 4e 80 00 20 blr
>>> 4c: 39 20 01 11 li  r9,273
>>> 50: 4b ff ff d0 b   20 
>>>
>>>
>>> That is definitely more expensive, it implements a table of branches.
>>
>> Okay, will split out the PPC32 implementation that retains existing
>> table look up method. Also planning to keep that inside same file
>> (arch/powerpc/mm/mmap.c), unless you have a difference preference.
>
> My point was not to get something specific for PPC32, but to amplify on
> Russell's objection.
>
> As this is bad for 

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-02 Thread Russell King (Oracle)
On Wed, Mar 02, 2022 at 04:36:52PM +0530, Anshuman Khandual wrote:
> On 3/2/22 3:35 PM, Geert Uytterhoeven wrote:
> > I doubt the switch() variant would give better code on any platform.
> > 
> > What about using tables everywhere, using designated initializers
> > to improve readability?
> 
> Designated initializers ? Could you please be more specific. A table look
> up on arm platform would be something like this and arm_protection_map[]
> needs to be updated with user_pgprot like before.

There is *absolutely* nothing wrong with that. Updating it once during
boot is way more efficient than having to compute the value each time
vm_get_page_prot() gets called.

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-02 Thread Geert Uytterhoeven
Hi Anshuman,

On Wed, Mar 2, 2022 at 12:07 PM Anshuman Khandual
 wrote:
> On 3/2/22 3:35 PM, Geert Uytterhoeven wrote:
> > On Wed, Mar 2, 2022 at 10:51 AM Anshuman Khandual
> >  wrote:
> >> On 3/2/22 12:35 PM, Christophe Leroy wrote:
> >>> Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
>  On 3/1/22 1:46 PM, Christophe Leroy wrote:
> > Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
> >> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
> >>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>  On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> > This defines and exports a platform specific custom 
> > vm_get_page_prot() via
> > subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and 
> > __PXXX
> > macros can be dropped which are no longer needed.
> 
>  What I would really like to know is why having to run _code_ to work 
>  out
>  what the page protections need to be is better than looking it up in 
>  a
>  table.
> 
>  Not only is this more expensive in terms of CPU cycles, it also 
>  brings
>  additional code size with it.
> 
>  I'm struggling to see what the benefit is.
> >>>
> >>> Currently vm_get_page_prot() is also being _run_ to fetch required 
> >>> page
> >>> protection values. Although that is being run in the core MM and from 
> >>> a
> >>> platform perspective __SXXX, __PXXX are just being exported for a 
> >>> table.
> >>> Looking it up in a table (and applying more constructs there after) is
> >>> not much different than a clean switch case statement in terms of CPU
> >>> usage. So this is not more expensive in terms of CPU cycles.
> >>
> >> I disagree.
> >
> > So do I.
> >
> >>
> >> However, let's base this disagreement on some evidence. Here is the
> >> present 32-bit ARM implementation:
> >>
> >> 0048 :
> >> 48:   e20fand r0, r0, #15
> >> 4c:   e3003000movwr3, #0
> >>   4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
> >> 50:   e3403000movtr3, #0
> >>   50: R_ARM_MOVT_ABS  .LANCHOR1
> >> 54:   e7930100ldr r0, [r3, r0, lsl #2]
> >> 58:   e12fff1ebx  lr
> >>
> >> That is five instructions long.
> >
> > On ppc32 I get:
> >
> > 0094 :
> > 94: 3d 20 00 00 lis r9,0
> > 96: R_PPC_ADDR16_HA .data..ro_after_init
> > 98: 54 84 16 ba rlwinm  r4,r4,2,26,29
> > 9c: 39 29 00 00 addir9,r9,0
> > 9e: R_PPC_ADDR16_LO .data..ro_after_init
> > a0: 7d 29 20 2e lwzxr9,r9,r4
> > a4: 91 23 00 00 stw r9,0(r3)
> > a8: 4e 80 00 20 blr
> >
> >
> >>
> >> Please show that your new implementation is not more expensive on
> >> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
> >> the disassembly.
> >
> > With your series I get:
> >
> >  :
> >  0: 3d 20 00 00 lis r9,0
> > 2: R_PPC_ADDR16_HA  .rodata
> >  4: 39 29 00 00 addir9,r9,0
> > 6: R_PPC_ADDR16_LO  .rodata
> >  8: 54 84 16 ba rlwinm  r4,r4,2,26,29
> >  c: 7d 49 20 2e lwzxr10,r9,r4
> > 10: 7d 4a 4a 14 add r10,r10,r9
> > 14: 7d 49 03 a6 mtctr   r10
> > 18: 4e 80 04 20 bctr
> > 1c: 39 20 03 15 li  r9,789
> > 20: 91 23 00 00 stw r9,0(r3)
> > 24: 4e 80 00 20 blr
> > 28: 39 20 01 15 li  r9,277
> > 2c: 91 23 00 00 stw r9,0(r3)
> > 30: 4e 80 00 20 blr
> > 34: 39 20 07 15 li  r9,1813
> > 38: 91 23 00 00 stw r9,0(r3)
> > 3c: 4e 80 00 20 blr
> > 40: 39 20 05 15 li  r9,1301
> > 44: 91 23 00 00 stw r9,0(r3)
> > 48: 4e 80 00 20 blr
> > 4c: 39 20 01 11 li  r9,273
> > 50: 4b ff ff d0 b   20 
> >
> >
> > That is definitely more expensive, it implements a table of branches.
> 
>  Okay, will split out the PPC32 implementation that retains existing
>  table look up method. Also planning to keep that inside same file
>  (arch/powerpc/mm/mmap.c), unless you have a difference preference.
> >>>
> >>> My point was not to get something specific for PPC32, but to amplify on
> >>> Russell's objection.
> >>>
> >>> As this is bad for ARM and bad for PPC32, do we have any evidence that
> 

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-02 Thread Anshuman Khandual



On 3/2/22 3:35 PM, Geert Uytterhoeven wrote:
> Hi Anshuman,
> 
> On Wed, Mar 2, 2022 at 10:51 AM Anshuman Khandual
>  wrote:
>> On 3/2/22 12:35 PM, Christophe Leroy wrote:
>>> Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
 On 3/1/22 1:46 PM, Christophe Leroy wrote:
> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
 On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> This defines and exports a platform specific custom 
> vm_get_page_prot() via
> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and 
> __PXXX
> macros can be dropped which are no longer needed.

 What I would really like to know is why having to run _code_ to work 
 out
 what the page protections need to be is better than looking it up in a
 table.

 Not only is this more expensive in terms of CPU cycles, it also brings
 additional code size with it.

 I'm struggling to see what the benefit is.
>>>
>>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>>> protection values. Although that is being run in the core MM and from a
>>> platform perspective __SXXX, __PXXX are just being exported for a table.
>>> Looking it up in a table (and applying more constructs there after) is
>>> not much different than a clean switch case statement in terms of CPU
>>> usage. So this is not more expensive in terms of CPU cycles.
>>
>> I disagree.
>
> So do I.
>
>>
>> However, let's base this disagreement on some evidence. Here is the
>> present 32-bit ARM implementation:
>>
>> 0048 :
>> 48:   e20fand r0, r0, #15
>> 4c:   e3003000movwr3, #0
>>   4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>> 50:   e3403000movtr3, #0
>>   50: R_ARM_MOVT_ABS  .LANCHOR1
>> 54:   e7930100ldr r0, [r3, r0, lsl #2]
>> 58:   e12fff1ebx  lr
>>
>> That is five instructions long.
>
> On ppc32 I get:
>
> 0094 :
> 94: 3d 20 00 00 lis r9,0
> 96: R_PPC_ADDR16_HA .data..ro_after_init
> 98: 54 84 16 ba rlwinm  r4,r4,2,26,29
> 9c: 39 29 00 00 addir9,r9,0
> 9e: R_PPC_ADDR16_LO .data..ro_after_init
> a0: 7d 29 20 2e lwzxr9,r9,r4
> a4: 91 23 00 00 stw r9,0(r3)
> a8: 4e 80 00 20 blr
>
>
>>
>> Please show that your new implementation is not more expensive on
>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
>> the disassembly.
>
> With your series I get:
>
>  :
>  0: 3d 20 00 00 lis r9,0
> 2: R_PPC_ADDR16_HA  .rodata
>  4: 39 29 00 00 addir9,r9,0
> 6: R_PPC_ADDR16_LO  .rodata
>  8: 54 84 16 ba rlwinm  r4,r4,2,26,29
>  c: 7d 49 20 2e lwzxr10,r9,r4
> 10: 7d 4a 4a 14 add r10,r10,r9
> 14: 7d 49 03 a6 mtctr   r10
> 18: 4e 80 04 20 bctr
> 1c: 39 20 03 15 li  r9,789
> 20: 91 23 00 00 stw r9,0(r3)
> 24: 4e 80 00 20 blr
> 28: 39 20 01 15 li  r9,277
> 2c: 91 23 00 00 stw r9,0(r3)
> 30: 4e 80 00 20 blr
> 34: 39 20 07 15 li  r9,1813
> 38: 91 23 00 00 stw r9,0(r3)
> 3c: 4e 80 00 20 blr
> 40: 39 20 05 15 li  r9,1301
> 44: 91 23 00 00 stw r9,0(r3)
> 48: 4e 80 00 20 blr
> 4c: 39 20 01 11 li  r9,273
> 50: 4b ff ff d0 b   20 
>
>
> That is definitely more expensive, it implements a table of branches.

 Okay, will split out the PPC32 implementation that retains existing
 table look up method. Also planning to keep that inside same file
 (arch/powerpc/mm/mmap.c), unless you have a difference preference.
>>>
>>> My point was not to get something specific for PPC32, but to amplify on
>>> Russell's objection.
>>>
>>> As this is bad for ARM and bad for PPC32, do we have any evidence that
>>> your change is good for any other architecture ?
>>>
>>> I checked PPC64 and there is exactly the same drawback. With the current
>>> implementation it is a small function performing table read then a few
>>> adjustment. After your change it is a bigger function implementing a
>>> table of branches.
>>
>> I am wondering 

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-02 Thread Geert Uytterhoeven
Hi Anshuman,

On Wed, Mar 2, 2022 at 10:51 AM Anshuman Khandual
 wrote:
> On 3/2/22 12:35 PM, Christophe Leroy wrote:
> > Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
> >> On 3/1/22 1:46 PM, Christophe Leroy wrote:
> >>> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>  On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
> > On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
> >> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> >>> This defines and exports a platform specific custom 
> >>> vm_get_page_prot() via
> >>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and 
> >>> __PXXX
> >>> macros can be dropped which are no longer needed.
> >>
> >> What I would really like to know is why having to run _code_ to work 
> >> out
> >> what the page protections need to be is better than looking it up in a
> >> table.
> >>
> >> Not only is this more expensive in terms of CPU cycles, it also brings
> >> additional code size with it.
> >>
> >> I'm struggling to see what the benefit is.
> >
> > Currently vm_get_page_prot() is also being _run_ to fetch required page
> > protection values. Although that is being run in the core MM and from a
> > platform perspective __SXXX, __PXXX are just being exported for a table.
> > Looking it up in a table (and applying more constructs there after) is
> > not much different than a clean switch case statement in terms of CPU
> > usage. So this is not more expensive in terms of CPU cycles.
> 
>  I disagree.
> >>>
> >>> So do I.
> >>>
> 
>  However, let's base this disagreement on some evidence. Here is the
>  present 32-bit ARM implementation:
> 
>  0048 :
>  48:   e20fand r0, r0, #15
>  4c:   e3003000movwr3, #0
>    4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>  50:   e3403000movtr3, #0
>    50: R_ARM_MOVT_ABS  .LANCHOR1
>  54:   e7930100ldr r0, [r3, r0, lsl #2]
>  58:   e12fff1ebx  lr
> 
>  That is five instructions long.
> >>>
> >>> On ppc32 I get:
> >>>
> >>> 0094 :
> >>> 94: 3d 20 00 00 lis r9,0
> >>> 96: R_PPC_ADDR16_HA .data..ro_after_init
> >>> 98: 54 84 16 ba rlwinm  r4,r4,2,26,29
> >>> 9c: 39 29 00 00 addir9,r9,0
> >>> 9e: R_PPC_ADDR16_LO .data..ro_after_init
> >>> a0: 7d 29 20 2e lwzxr9,r9,r4
> >>> a4: 91 23 00 00 stw r9,0(r3)
> >>> a8: 4e 80 00 20 blr
> >>>
> >>>
> 
>  Please show that your new implementation is not more expensive on
>  32-bit ARM. Please do so by building a 32-bit kernel, and providing
>  the disassembly.
> >>>
> >>> With your series I get:
> >>>
> >>>  :
> >>>  0: 3d 20 00 00 lis r9,0
> >>> 2: R_PPC_ADDR16_HA  .rodata
> >>>  4: 39 29 00 00 addir9,r9,0
> >>> 6: R_PPC_ADDR16_LO  .rodata
> >>>  8: 54 84 16 ba rlwinm  r4,r4,2,26,29
> >>>  c: 7d 49 20 2e lwzxr10,r9,r4
> >>> 10: 7d 4a 4a 14 add r10,r10,r9
> >>> 14: 7d 49 03 a6 mtctr   r10
> >>> 18: 4e 80 04 20 bctr
> >>> 1c: 39 20 03 15 li  r9,789
> >>> 20: 91 23 00 00 stw r9,0(r3)
> >>> 24: 4e 80 00 20 blr
> >>> 28: 39 20 01 15 li  r9,277
> >>> 2c: 91 23 00 00 stw r9,0(r3)
> >>> 30: 4e 80 00 20 blr
> >>> 34: 39 20 07 15 li  r9,1813
> >>> 38: 91 23 00 00 stw r9,0(r3)
> >>> 3c: 4e 80 00 20 blr
> >>> 40: 39 20 05 15 li  r9,1301
> >>> 44: 91 23 00 00 stw r9,0(r3)
> >>> 48: 4e 80 00 20 blr
> >>> 4c: 39 20 01 11 li  r9,273
> >>> 50: 4b ff ff d0 b   20 
> >>>
> >>>
> >>> That is definitely more expensive, it implements a table of branches.
> >>
> >> Okay, will split out the PPC32 implementation that retains existing
> >> table look up method. Also planning to keep that inside same file
> >> (arch/powerpc/mm/mmap.c), unless you have a difference preference.
> >
> > My point was not to get something specific for PPC32, but to amplify on
> > Russell's objection.
> >
> > As this is bad for ARM and bad for PPC32, do we have any evidence that
> > your change is good for any other architecture ?
> >
> > I checked PPC64 and there is exactly the same drawback. With the current
> > implementation it is a small function performing table read then a few
> > adjustment. After your change it is a bigger function implementing a
> > table of branches.
>
> I am wondering if this would not be the case for any other switch case
> 

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-02 Thread Anshuman Khandual



On 3/2/22 12:35 PM, Christophe Leroy wrote:
> 
> 
> Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
>>
>>
>> On 3/1/22 1:46 PM, Christophe Leroy wrote:
>>>
>>>
>>> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
 On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
>>> This defines and exports a platform specific custom vm_get_page_prot() 
>>> via
>>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and 
>>> __PXXX
>>> macros can be dropped which are no longer needed.
>>
>> What I would really like to know is why having to run _code_ to work out
>> what the page protections need to be is better than looking it up in a
>> table.
>>
>> Not only is this more expensive in terms of CPU cycles, it also brings
>> additional code size with it.
>>
>> I'm struggling to see what the benefit is.
>
> Currently vm_get_page_prot() is also being _run_ to fetch required page
> protection values. Although that is being run in the core MM and from a
> platform perspective __SXXX, __PXXX are just being exported for a table.
> Looking it up in a table (and applying more constructs there after) is
> not much different than a clean switch case statement in terms of CPU
> usage. So this is not more expensive in terms of CPU cycles.

 I disagree.
>>>
>>> So do I.
>>>

 However, let's base this disagreement on some evidence. Here is the
 present 32-bit ARM implementation:

 0048 :
 48:   e20fand r0, r0, #15
 4c:   e3003000movwr3, #0
   4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
 50:   e3403000movtr3, #0
   50: R_ARM_MOVT_ABS  .LANCHOR1
 54:   e7930100ldr r0, [r3, r0, lsl #2]
 58:   e12fff1ebx  lr

 That is five instructions long.
>>>
>>> On ppc32 I get:
>>>
>>> 0094 :
>>> 94: 3d 20 00 00 lis r9,0
>>> 96: R_PPC_ADDR16_HA .data..ro_after_init
>>> 98: 54 84 16 ba rlwinm  r4,r4,2,26,29
>>> 9c: 39 29 00 00 addir9,r9,0
>>> 9e: R_PPC_ADDR16_LO .data..ro_after_init
>>> a0: 7d 29 20 2e lwzxr9,r9,r4
>>> a4: 91 23 00 00 stw r9,0(r3)
>>> a8: 4e 80 00 20 blr
>>>
>>>

 Please show that your new implementation is not more expensive on
 32-bit ARM. Please do so by building a 32-bit kernel, and providing
 the disassembly.
>>>
>>> With your series I get:
>>>
>>>  :
>>>  0: 3d 20 00 00 lis r9,0
>>> 2: R_PPC_ADDR16_HA  .rodata
>>>  4: 39 29 00 00 addir9,r9,0
>>> 6: R_PPC_ADDR16_LO  .rodata
>>>  8: 54 84 16 ba rlwinm  r4,r4,2,26,29
>>>  c: 7d 49 20 2e lwzxr10,r9,r4
>>> 10: 7d 4a 4a 14 add r10,r10,r9
>>> 14: 7d 49 03 a6 mtctr   r10
>>> 18: 4e 80 04 20 bctr
>>> 1c: 39 20 03 15 li  r9,789
>>> 20: 91 23 00 00 stw r9,0(r3)
>>> 24: 4e 80 00 20 blr
>>> 28: 39 20 01 15 li  r9,277
>>> 2c: 91 23 00 00 stw r9,0(r3)
>>> 30: 4e 80 00 20 blr
>>> 34: 39 20 07 15 li  r9,1813
>>> 38: 91 23 00 00 stw r9,0(r3)
>>> 3c: 4e 80 00 20 blr
>>> 40: 39 20 05 15 li  r9,1301
>>> 44: 91 23 00 00 stw r9,0(r3)
>>> 48: 4e 80 00 20 blr
>>> 4c: 39 20 01 11 li  r9,273
>>> 50: 4b ff ff d0 b   20 
>>>
>>>
>>> That is definitely more expensive, it implements a table of branches.
>>
>> Okay, will split out the PPC32 implementation that retains existing
>> table look up method. Also planning to keep that inside same file
>> (arch/powerpc/mm/mmap.c), unless you have a difference preference.
> 
> My point was not to get something specific for PPC32, but to amplify on 
> Russell's objection.
> 
> As this is bad for ARM and bad for PPC32, do we have any evidence that 
> your change is good for any other architecture ?
> 
> I checked PPC64 and there is exactly the same drawback. With the current 
> implementation it is a small function performing table read then a few 
> adjustment. After your change it is a bigger function implementing a 
> table of branches.

I am wondering if this would not be the case for any other switch case
statement on the platform ? Is there something specific/different just
on vm_get_page_prot() implementation ? Are you suggesting that switch
case statements should just be avoided instead ?

> 
> So, as requested by Russell, could you look at the disassembly for other 
> 

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Christophe Leroy


Le 02/03/2022 à 04:22, Anshuman Khandual a écrit :
> 
> 
> On 3/1/22 1:46 PM, Christophe Leroy wrote:
>>
>>
>> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
 On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
>> This defines and exports a platform specific custom vm_get_page_prot() 
>> via
>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
>> macros can be dropped which are no longer needed.
>
> What I would really like to know is why having to run _code_ to work out
> what the page protections need to be is better than looking it up in a
> table.
>
> Not only is this more expensive in terms of CPU cycles, it also brings
> additional code size with it.
>
> I'm struggling to see what the benefit is.

 Currently vm_get_page_prot() is also being _run_ to fetch required page
 protection values. Although that is being run in the core MM and from a
 platform perspective __SXXX, __PXXX are just being exported for a table.
 Looking it up in a table (and applying more constructs there after) is
 not much different than a clean switch case statement in terms of CPU
 usage. So this is not more expensive in terms of CPU cycles.
>>>
>>> I disagree.
>>
>> So do I.
>>
>>>
>>> However, let's base this disagreement on some evidence. Here is the
>>> present 32-bit ARM implementation:
>>>
>>> 0048 :
>>> 48:   e20fand r0, r0, #15
>>> 4c:   e3003000movwr3, #0
>>>   4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>>> 50:   e3403000movtr3, #0
>>>   50: R_ARM_MOVT_ABS  .LANCHOR1
>>> 54:   e7930100ldr r0, [r3, r0, lsl #2]
>>> 58:   e12fff1ebx  lr
>>>
>>> That is five instructions long.
>>
>> On ppc32 I get:
>>
>> 0094 :
>> 94:  3d 20 00 00 lis r9,0
>>  96: R_PPC_ADDR16_HA .data..ro_after_init
>> 98:  54 84 16 ba rlwinm  r4,r4,2,26,29
>> 9c:  39 29 00 00 addir9,r9,0
>>  9e: R_PPC_ADDR16_LO .data..ro_after_init
>> a0:  7d 29 20 2e lwzxr9,r9,r4
>> a4:  91 23 00 00 stw r9,0(r3)
>> a8:  4e 80 00 20 blr
>>
>>
>>>
>>> Please show that your new implementation is not more expensive on
>>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
>>> the disassembly.
>>
>> With your series I get:
>>
>>  :
>>  0:  3d 20 00 00 lis r9,0
>>  2: R_PPC_ADDR16_HA  .rodata
>>  4:  39 29 00 00 addir9,r9,0
>>  6: R_PPC_ADDR16_LO  .rodata
>>  8:  54 84 16 ba rlwinm  r4,r4,2,26,29
>>  c:  7d 49 20 2e lwzxr10,r9,r4
>> 10:  7d 4a 4a 14 add r10,r10,r9
>> 14:  7d 49 03 a6 mtctr   r10
>> 18:  4e 80 04 20 bctr
>> 1c:  39 20 03 15 li  r9,789
>> 20:  91 23 00 00 stw r9,0(r3)
>> 24:  4e 80 00 20 blr
>> 28:  39 20 01 15 li  r9,277
>> 2c:  91 23 00 00 stw r9,0(r3)
>> 30:  4e 80 00 20 blr
>> 34:  39 20 07 15 li  r9,1813
>> 38:  91 23 00 00 stw r9,0(r3)
>> 3c:  4e 80 00 20 blr
>> 40:  39 20 05 15 li  r9,1301
>> 44:  91 23 00 00 stw r9,0(r3)
>> 48:  4e 80 00 20 blr
>> 4c:  39 20 01 11 li  r9,273
>> 50:  4b ff ff d0 b   20 
>>
>>
>> That is definitely more expensive, it implements a table of branches.
> 
> Okay, will split out the PPC32 implementation that retains existing
> table look up method. Also planning to keep that inside same file
> (arch/powerpc/mm/mmap.c), unless you have a difference preference.

My point was not to get something specific for PPC32, but to amplify on 
Russell's objection.

As this is bad for ARM and bad for PPC32, do we have any evidence that 
your change is good for any other architecture ?

I checked PPC64 and there is exactly the same drawback. With the current 
implementation it is a small function performing table read then a few 
adjustment. After your change it is a bigger function implementing a 
table of branches.

So, as requested by Russell, could you look at the disassembly for other 
architectures and show us that ARM and POWERPC are the only ones for 
which your change is not optimal ?

See below the difference for POWERPC64.

Current implementation:

0a60 <.vm_get_page_prot>:
  a60:  3d 42 00 00 addis   r10,r2,0
a62: R_PPC64_TOC16_HA   .data..ro_after_init
  a64:  78 89 1e 68 rldic   r9,r4,3,57
  a68:  39 4a 00 00 addi

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Anshuman Khandual



On 3/1/22 1:46 PM, Christophe Leroy wrote:
> 
> 
> Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
>> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
 On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> This defines and exports a platform specific custom vm_get_page_prot() via
> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> macros can be dropped which are no longer needed.

 What I would really like to know is why having to run _code_ to work out
 what the page protections need to be is better than looking it up in a
 table.

 Not only is this more expensive in terms of CPU cycles, it also brings
 additional code size with it.

 I'm struggling to see what the benefit is.
>>>
>>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>>> protection values. Although that is being run in the core MM and from a
>>> platform perspective __SXXX, __PXXX are just being exported for a table.
>>> Looking it up in a table (and applying more constructs there after) is
>>> not much different than a clean switch case statement in terms of CPU
>>> usage. So this is not more expensive in terms of CPU cycles.
>>
>> I disagree.
> 
> So do I.
> 
>>
>> However, let's base this disagreement on some evidence. Here is the
>> present 32-bit ARM implementation:
>>
>> 0048 :
>>48:   e20fand r0, r0, #15
>>4c:   e3003000movwr3, #0
>>  4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>>50:   e3403000movtr3, #0
>>  50: R_ARM_MOVT_ABS  .LANCHOR1
>>54:   e7930100ldr r0, [r3, r0, lsl #2]
>>58:   e12fff1ebx  lr
>>
>> That is five instructions long.
> 
> On ppc32 I get:
> 
> 0094 :
>94:3d 20 00 00 lis r9,0
>   96: R_PPC_ADDR16_HA .data..ro_after_init
>98:54 84 16 ba rlwinm  r4,r4,2,26,29
>9c:39 29 00 00 addir9,r9,0
>   9e: R_PPC_ADDR16_LO .data..ro_after_init
>a0:7d 29 20 2e lwzxr9,r9,r4
>a4:91 23 00 00 stw r9,0(r3)
>a8:4e 80 00 20 blr
> 
> 
>>
>> Please show that your new implementation is not more expensive on
>> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
>> the disassembly.
> 
> With your series I get:
> 
>  :
> 0:3d 20 00 00 lis r9,0
>   2: R_PPC_ADDR16_HA  .rodata
> 4:39 29 00 00 addir9,r9,0
>   6: R_PPC_ADDR16_LO  .rodata
> 8:54 84 16 ba rlwinm  r4,r4,2,26,29
> c:7d 49 20 2e lwzxr10,r9,r4
>10:7d 4a 4a 14 add r10,r10,r9
>14:7d 49 03 a6 mtctr   r10
>18:4e 80 04 20 bctr
>1c:39 20 03 15 li  r9,789
>20:91 23 00 00 stw r9,0(r3)
>24:4e 80 00 20 blr
>28:39 20 01 15 li  r9,277
>2c:91 23 00 00 stw r9,0(r3)
>30:4e 80 00 20 blr
>34:39 20 07 15 li  r9,1813
>38:91 23 00 00 stw r9,0(r3)
>3c:4e 80 00 20 blr
>40:39 20 05 15 li  r9,1301
>44:91 23 00 00 stw r9,0(r3)
>48:4e 80 00 20 blr
>4c:39 20 01 11 li  r9,273
>50:4b ff ff d0 b   20 
> 
> 
> That is definitely more expensive, it implements a table of branches.

Okay, will split out the PPC32 implementation that retains existing
table look up method. Also planning to keep that inside same file
(arch/powerpc/mm/mmap.c), unless you have a difference preference.


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Anshuman Khandual



On 3/1/22 6:01 AM, Russell King (Oracle) wrote:
> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
 This defines and exports a platform specific custom vm_get_page_prot() via
 subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
 macros can be dropped which are no longer needed.
>>> What I would really like to know is why having to run _code_ to work out
>>> what the page protections need to be is better than looking it up in a
>>> table.
>>>
>>> Not only is this more expensive in terms of CPU cycles, it also brings
>>> additional code size with it.
>>>
>>> I'm struggling to see what the benefit is.
>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>> protection values. Although that is being run in the core MM and from a
>> platform perspective __SXXX, __PXXX are just being exported for a table.
>> Looking it up in a table (and applying more constructs there after) is
>> not much different than a clean switch case statement in terms of CPU
>> usage. So this is not more expensive in terms of CPU cycles.
> I disagree.
> 
> However, let's base this disagreement on some evidence. Here is the
> present 32-bit ARM implementation:
> 
> 0048 :
>   48:   e20fand r0, r0, #15
>   4c:   e3003000movwr3, #0
> 4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>   50:   e3403000movtr3, #0
> 50: R_ARM_MOVT_ABS  .LANCHOR1
>   54:   e7930100ldr r0, [r3, r0, lsl #2]
>   58:   e12fff1ebx  lr
> 
> That is five instructions long.
> 
> Please show that your new implementation is not more expensive on
> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
> the disassembly.
> 
> I think you will find way more than five instructions in your version -
> the compiler will have to issue code to decode the protection bits,
> probably using a table of branches or absolute PC values, or possibly
> the worst case of using multiple comparisons and branches. It then has
> to load constants that may be moved using movw on ARMv7, but on
> older architectures would have to be created from multiple instructions
> or loaded from the literal pool. Then there'll be instructions to load
> the address of "user_pgprot", retrieve its value, and bitwise or that.
> 
> Therefore, I fail to see how your approach of getting rid of the table
> is somehow "better" than what we currently have in terms of the effect
> on the resulting code.
> 
> If you don't like the __P and __S stuff and two arch_* hooks, you could
> move the table into arch code along with vm_get_page_prot() without the
> additional unnecessary hooks, while keeping all the benefits of the
> table lookup.

Okay, will change the arm's vm_get_page_prot() implementation as suggested.


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-03-01 Thread Christophe Leroy


Le 01/03/2022 à 01:31, Russell King (Oracle) a écrit :
> On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
>> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
>>> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
 This defines and exports a platform specific custom vm_get_page_prot() via
 subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
 macros can be dropped which are no longer needed.
>>>
>>> What I would really like to know is why having to run _code_ to work out
>>> what the page protections need to be is better than looking it up in a
>>> table.
>>>
>>> Not only is this more expensive in terms of CPU cycles, it also brings
>>> additional code size with it.
>>>
>>> I'm struggling to see what the benefit is.
>>
>> Currently vm_get_page_prot() is also being _run_ to fetch required page
>> protection values. Although that is being run in the core MM and from a
>> platform perspective __SXXX, __PXXX are just being exported for a table.
>> Looking it up in a table (and applying more constructs there after) is
>> not much different than a clean switch case statement in terms of CPU
>> usage. So this is not more expensive in terms of CPU cycles.
> 
> I disagree.

So do I.

> 
> However, let's base this disagreement on some evidence. Here is the
> present 32-bit ARM implementation:
> 
> 0048 :
>48:   e20fand r0, r0, #15
>4c:   e3003000movwr3, #0
>  4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
>50:   e3403000movtr3, #0
>  50: R_ARM_MOVT_ABS  .LANCHOR1
>54:   e7930100ldr r0, [r3, r0, lsl #2]
>58:   e12fff1ebx  lr
> 
> That is five instructions long.

On ppc32 I get:

0094 :
   94:  3d 20 00 00 lis r9,0
96: R_PPC_ADDR16_HA .data..ro_after_init
   98:  54 84 16 ba rlwinm  r4,r4,2,26,29
   9c:  39 29 00 00 addir9,r9,0
9e: R_PPC_ADDR16_LO .data..ro_after_init
   a0:  7d 29 20 2e lwzxr9,r9,r4
   a4:  91 23 00 00 stw r9,0(r3)
   a8:  4e 80 00 20 blr


> 
> Please show that your new implementation is not more expensive on
> 32-bit ARM. Please do so by building a 32-bit kernel, and providing
> the disassembly.

With your series I get:

 :
0:  3d 20 00 00 lis r9,0
2: R_PPC_ADDR16_HA  .rodata
4:  39 29 00 00 addir9,r9,0
6: R_PPC_ADDR16_LO  .rodata
8:  54 84 16 ba rlwinm  r4,r4,2,26,29
c:  7d 49 20 2e lwzxr10,r9,r4
   10:  7d 4a 4a 14 add r10,r10,r9
   14:  7d 49 03 a6 mtctr   r10
   18:  4e 80 04 20 bctr
   1c:  39 20 03 15 li  r9,789
   20:  91 23 00 00 stw r9,0(r3)
   24:  4e 80 00 20 blr
   28:  39 20 01 15 li  r9,277
   2c:  91 23 00 00 stw r9,0(r3)
   30:  4e 80 00 20 blr
   34:  39 20 07 15 li  r9,1813
   38:  91 23 00 00 stw r9,0(r3)
   3c:  4e 80 00 20 blr
   40:  39 20 05 15 li  r9,1301
   44:  91 23 00 00 stw r9,0(r3)
   48:  4e 80 00 20 blr
   4c:  39 20 01 11 li  r9,273
   50:  4b ff ff d0 b   20 


That is definitely more expensive, it implements a table of branches.


Christophe

Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-02-28 Thread Russell King (Oracle)
On Tue, Mar 01, 2022 at 05:30:41AM +0530, Anshuman Khandual wrote:
> On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
> > On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> >> This defines and exports a platform specific custom vm_get_page_prot() via
> >> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> >> macros can be dropped which are no longer needed.
> > 
> > What I would really like to know is why having to run _code_ to work out
> > what the page protections need to be is better than looking it up in a
> > table.
> > 
> > Not only is this more expensive in terms of CPU cycles, it also brings
> > additional code size with it.
> > 
> > I'm struggling to see what the benefit is.
> 
> Currently vm_get_page_prot() is also being _run_ to fetch required page
> protection values. Although that is being run in the core MM and from a
> platform perspective __SXXX, __PXXX are just being exported for a table.
> Looking it up in a table (and applying more constructs there after) is
> not much different than a clean switch case statement in terms of CPU
> usage. So this is not more expensive in terms of CPU cycles.

I disagree.

However, let's base this disagreement on some evidence. Here is the
present 32-bit ARM implementation:

0048 :
  48:   e20fand r0, r0, #15
  4c:   e3003000movwr3, #0
4c: R_ARM_MOVW_ABS_NC   .LANCHOR1
  50:   e3403000movtr3, #0
50: R_ARM_MOVT_ABS  .LANCHOR1
  54:   e7930100ldr r0, [r3, r0, lsl #2]
  58:   e12fff1ebx  lr

That is five instructions long.

Please show that your new implementation is not more expensive on
32-bit ARM. Please do so by building a 32-bit kernel, and providing
the disassembly.

I think you will find way more than five instructions in your version -
the compiler will have to issue code to decode the protection bits,
probably using a table of branches or absolute PC values, or possibly
the worst case of using multiple comparisons and branches. It then has
to load constants that may be moved using movw on ARMv7, but on
older architectures would have to be created from multiple instructions
or loaded from the literal pool. Then there'll be instructions to load
the address of "user_pgprot", retrieve its value, and bitwise or that.

Therefore, I fail to see how your approach of getting rid of the table
is somehow "better" than what we currently have in terms of the effect
on the resulting code.

If you don't like the __P and __S stuff and two arch_* hooks, you could
move the table into arch code along with vm_get_page_prot() without the
additional unnecessary hooks, while keeping all the benefits of the
table lookup.

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-02-28 Thread Anshuman Khandual



On 2/28/22 4:27 PM, Russell King (Oracle) wrote:
> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
>> This defines and exports a platform specific custom vm_get_page_prot() via
>> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
>> macros can be dropped which are no longer needed.
> 
> What I would really like to know is why having to run _code_ to work out
> what the page protections need to be is better than looking it up in a
> table.
> 
> Not only is this more expensive in terms of CPU cycles, it also brings
> additional code size with it.
> 
> I'm struggling to see what the benefit is.
> 

Currently vm_get_page_prot() is also being _run_ to fetch required page
protection values. Although that is being run in the core MM and from a
platform perspective __SXXX, __PXXX are just being exported for a table.
Looking it up in a table (and applying more constructs there after) is
not much different than a clean switch case statement in terms of CPU
usage. So this is not more expensive in terms of CPU cycles.

--
pgprot_t protection_map[16] __ro_after_init = {
__P000, __P001, __P010, __P011, __P100, __P101, __P110, __P111,
__S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111
};

#ifndef CONFIG_ARCH_HAS_FILTER_PGPROT
static inline pgprot_t arch_filter_pgprot(pgprot_t prot)
{
return prot;
}
#endif

pgprot_t vm_get_page_prot(unsigned long vm_flags)
{
pgprot_t ret = __pgprot(pgprot_val(protection_map[vm_flags &
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
pgprot_val(arch_vm_get_page_prot(vm_flags)));

return arch_filter_pgprot(ret);
}
EXPORT_SYMBOL(vm_get_page_prot)


There will be a single vm_get_page_prot() instance on a given platform
just like before. So this also does not bring any additional code size
with it.

As mentioned earlier on a previous version.

Remove multiple 'core MM <--> platform' abstraction layers to map
vm_flags access permission combination into page protection.

>From the cover letter ..

--
Currently there are multiple layers of abstraction i.e __SXXX/__PXXX macros
, protection_map[], arch_vm_get_page_prot() and arch_filter_pgprot() built
between the platform and generic MM, finally defining vm_get_page_prot().

Hence this series proposes to drop all these abstraction levels and instead
just move the responsibility of defining vm_get_page_prot() to the platform
itself making it clean and simple.
--

Benefits

1. For platforms using arch_vm_get_page_prot() and/or arch_filter_pgprot()

- A simplified vm_get_page_prot()
- Dropped arch_vm_get_page_prot() and arch_filter_pgprot()
- Dropped __SXXX, __PXXX macros

2. For platforms which just exported __SXXX, __PXXX

- A simplified vm_get_page_prot()
- Dropped __SXXX, __PXXX macros

3. For core MM

- Dropped a complex vm_get_page_prot() with multiple layers
  of abstraction i.e __SXXX/__PXXX macros, protection_map[],
  arch_vm_get_page_prot(), arch_filter_pgprot() etc.

- Anshuman


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-02-28 Thread Geert Uytterhoeven
Hi Russell,

On Mon, Feb 28, 2022 at 11:57 AM Russell King (Oracle)
 wrote:
> On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> > This defines and exports a platform specific custom vm_get_page_prot() via
> > subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> > macros can be dropped which are no longer needed.
>
> What I would really like to know is why having to run _code_ to work out
> what the page protections need to be is better than looking it up in a
> table.
>
> Not only is this more expensive in terms of CPU cycles, it also brings
> additional code size with it.
>
> I'm struggling to see what the benefit is.

I was wondering about that as well. But at least for code size on
m68k, this didn't have much impact.  Looking at the generated code,
the increase due to using code for the (few different) cases is offset
by a 16-bit jump table (which is to be credited to the compiler).

In terms of CPU cycles, it's indeed worse than before.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH V3 09/30] arm/mm: Enable ARCH_HAS_VM_GET_PAGE_PROT

2022-02-28 Thread Russell King (Oracle)
On Mon, Feb 28, 2022 at 04:17:32PM +0530, Anshuman Khandual wrote:
> This defines and exports a platform specific custom vm_get_page_prot() via
> subscribing ARCH_HAS_VM_GET_PAGE_PROT. Subsequently all __SXXX and __PXXX
> macros can be dropped which are no longer needed.

What I would really like to know is why having to run _code_ to work out
what the page protections need to be is better than looking it up in a
table.

Not only is this more expensive in terms of CPU cycles, it also brings
additional code size with it.

I'm struggling to see what the benefit is.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!