Re: [RFC PATCH] powerpc/ftrace: Refactoring and support for -fpatchable-function-entry

2023-05-25 Thread Christophe Leroy


Le 23/05/2023 à 11:31, Naveen N Rao a écrit :
> Christophe Leroy wrote:
>>
>> That's better, but still more time than original implementation:
>>
>> +20% to activate function tracer (was +40% with your RFC)
>> +21% to activate nop tracer (was +24% with your RFC)
>>
>> perf record (without strict kernel rwx) :
>>
>>  17.75%  echo [kernel.kallsyms]   [k] ftrace_check_record
>>   9.76%  echo [kernel.kallsyms]   [k] ftrace_replace_code
>>   6.53%  echo [kernel.kallsyms]   [k] patch_instruction
>>   5.21%  echo [kernel.kallsyms]   [k] __ftrace_hash_rec_update
>>   4.26%  echo [kernel.kallsyms]   [k] ftrace_get_addr_curr
>>   4.18%  echo [kernel.kallsyms]   [k] ftrace_get_call_inst.isra.0
>>   3.45%  echo [kernel.kallsyms]   [k] ftrace_get_addr_new
>>   3.08%  echo [kernel.kallsyms]   [k] function_trace_call
>>   2.20%  echo [kernel.kallsyms]   [k] 
>> __rb_reserve_next.constprop.0
>>   2.05%  echo [kernel.kallsyms]   [k] copy_page
>>   1.91%  echo [kernel.kallsyms]   [k] 
>> ftrace_create_branch_inst.constprop.0
>>   1.83%  echo [kernel.kallsyms]   [k] ftrace_rec_iter_next
>>   1.83%  echo [kernel.kallsyms]   [k] rb_commit
>>   1.69%  echo [kernel.kallsyms]   [k] ring_buffer_lock_reserve
>>   1.54%  echo [kernel.kallsyms]   [k] trace_function
>>   1.39%  echo [kernel.kallsyms]   [k] 
>> __call_rcu_common.constprop.0
>>   1.25%  echo ld-2.23.so  [.] do_lookup_x
>>   1.17%  echo [kernel.kallsyms]   [k] ftrace_rec_iter_record
>>   1.03%  echo [kernel.kallsyms]   [k] unmap_page_range
>>   0.95%  echo [kernel.kallsyms]   [k] flush_dcache_icache_page
>>   0.95%  echo [kernel.kallsyms]   [k] ftrace_lookup_ip
> 
> Ok, I simplified this further, and this is as close to the previous fast 
> path as we can get (applies atop the original RFC). The only difference 
> left is the ftrace_rec iterator.

That's not better, that's even slightly worse (less than 1%).

I will try to investigate why.

Christophe


Re: [PATCH v5 12/18] watchdog/hardlockup: Rename some "NMI watchdog" constants/function

2023-05-25 Thread Doug Anderson
Hi,

On Wed, May 24, 2023 at 6:38 AM Petr Mladek  wrote:
>
> On Fri 2023-05-19 10:18:36, Douglas Anderson wrote:
> > Do a search and replace of:
> > - NMI_WATCHDOG_ENABLED => WATCHDOG_HARDLOCKUP_ENABLED
> > - SOFT_WATCHDOG_ENABLED => WATCHDOG_SOFTOCKUP_ENABLED
> > - watchdog_nmi_ => watchdog_hardlockup_
> > - nmi_watchdog_available => watchdog_hardlockup_available
> > - nmi_watchdog_user_enabled => watchdog_hardlockup_user_enabled
> > - soft_watchdog_user_enabled => watchdog_softlockup_user_enabled
> > - NMI_WATCHDOG_DEFAULT => WATCHDOG_HARDLOCKUP_DEFAULT
> >
> > Then update a few comments near where names were changed.
> >
> > This is specifically to make it less confusing when we want to
> > introduce the buddy hardlockup detector, which isn't using NMIs. As
> > part of this, we sanitized a few names for consistency.
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -30,17 +30,17 @@
> >  static DEFINE_MUTEX(watchdog_mutex);
> >
> >  #if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
> > defined(CONFIG_HAVE_NMI_WATCHDOG)
> > -# define NMI_WATCHDOG_DEFAULT1
> > +# define WATCHDOG_HARDLOCKUP_DEFAULT 1
> >  #else
> > -# define NMI_WATCHDOG_DEFAULT0
> > +# define WATCHDOG_HARDLOCKUP_DEFAULT 0
> >  #endif
> >
> >  unsigned long __read_mostly watchdog_enabled;
> >  int __read_mostly watchdog_user_enabled = 1;
> > -int __read_mostly nmi_watchdog_user_enabled = NMI_WATCHDOG_DEFAULT;
> > -int __read_mostly soft_watchdog_user_enabled = 1;
> > +int __read_mostly watchdog_hardlockup_user_enabled = 
> > WATCHDOG_HARDLOCKUP_DEFAULT;
> > +int __read_mostly watchdog_softlockup_user_enabled = 1;
>
> I still see nmi_watchdog_user_enabled and soft_watchdog_user_enabled
> in include/linux/nmi.h. They are declared there and also mentioned
> in a comment.
>
> It seems that they do not actually need to be declared there.
> I guess that it was need for the /proc stuff. But it was
> moved into kernel/watchdog.c by the commit commit dd0693fdf054f2ed37
> ("watchdog: move watchdog sysctl interface to watchdog.c").
>
> >  int __read_mostly watchdog_thresh = 10;
> > -static int __read_mostly nmi_watchdog_available;
> > +static int __read_mostly watchdog_hardlockup_available;
> >
> >  struct cpumask watchdog_cpumask __read_mostly;
> >  unsigned long *watchdog_cpumask_bits = cpumask_bits(_cpumask);
>
> Otherwise, I like the changes.
>
> With the following:
>
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 83076bf70ce8..d14fe345eae9 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -17,8 +17,6 @@ void lockup_detector_soft_poweroff(void);
>  void lockup_detector_cleanup(void);
>
>  extern int watchdog_user_enabled;
> -extern int nmi_watchdog_user_enabled;
> -extern int soft_watchdog_user_enabled;
>  extern int watchdog_thresh;
>  extern unsigned long watchdog_enabled;
>
> @@ -68,8 +66,8 @@ static inline void reset_hung_task_detector(void) { }
>   * 'watchdog_enabled' variable. Each lockup detector has its dedicated bit -
>   * bit 0 for the hard lockup detector and bit 1 for the soft lockup detector.
>   *
> - * 'watchdog_user_enabled', 'nmi_watchdog_user_enabled' and
> - * 'soft_watchdog_user_enabled' are variables that are only used as an
> + * 'watchdog_user_enabled', 'watchdog_hardlockup_user_enabled' and
> + * 'watchdog_softlockup_user_enabled' are variables that are only used as an
>   * 'interface' between the parameters in /proc/sys/kernel and the internal
>   * state bits in 'watchdog_enabled'. The 'watchdog_thresh' variable is
>   * handled differently because its value is not boolean, and the lockup
>
> Reviewed-by: Petr Mladek 
>
> Even better might be to remove the unused declaration in a separate
> patch. I think that more declarations are not needed after moving
> the /proc stuff into kernel/watchdog.c.
>
> But it might also be fixed later.

Breadcrumbs: I squashed your suggestion together with Tom's patch and
posted the result:

https://lore.kernel.org/r/20230525162822.1.I0fb41d138d158c9230573eaa37dc56afa2fb14ee@changeid

-Doug


Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-05-25 Thread Robert Richter
On 25.05.23 17:01:01, Bjorn Helgaas wrote:
> On Thu, May 25, 2023 at 11:29:58PM +0200, Robert Richter wrote:
> > eOn 24.05.23 16:32:35, Bjorn Helgaas wrote:
> > > On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote:
> > > > From: Robert Richter 
> > > > 
> > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> > > > RCiEP, but CXL downstream and upstream ports are not enumerated and
> > > > not visible in the PCIe hierarchy. Protocol and link errors are sent
> > > > to an RCEC.
> > > >
> > > > Restricted CXL host (RCH) downstream port-detected errors are signaled
> > > > as internal AER errors, either Uncorrectable Internal Error (UIE) or
> > > > Corrected Internal Errors (CIE). 
> > > 
> > > From the parallelism with RCD above, I first thought that RCH devices
> > > were non-RCD mode and *were* enumerated as part of the PCIe hierarchy,
> > > but actually I suspect it's more like the following?
> > > 
> > >   ... but CXL downstream and upstream ports are not enumerated and not
> > >   visible in the PCIe hierarchy.
> > > 
> > >   Protocol and link errors from these non-enumerated ports are
> > >   signaled as internal AER errors ... via a CXL RCEC.
> > 
> > Exactly, except the RCEC is standard PCIe and also must not
> > necessarily on the same PCI bus as the CXL RCiEPs are.
> 
> So make it "RCEC" instead of "CXL RCEC", I guess?  PCIe r6.0, sec
> 7.9.10.3, allows an RCEC to be associated with RCiEPs on different
> buses, so nothing to see there.

Yes, nothing special. This makes it more difficult to check if the
RCEC has CXL devices connected, but still it is feasible.

> 
> > > > The error source is the id of the RCEC.
> > > 
> > > This seems odd; I assume this refers to the RCEC's AER Error Source
> > > Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source
> > > Identification would ordinarily be the Requester ID of the RCiEP that
> > > "sent" the Error Message.  But you're saying it's actually the ID of
> > > the *RCEC*, not the RCiEP?
> > 
> > Right, the downstream port has its own AER ext capability in
> > non-config (io mapped) RCRB register range. Errors originating from
> > there are signaled as internal AER errors via the RCEC *with* the
> > RCEC's Requester ID. Code walks through all associated CXL endpoints,
> > determines the dport and checks its AER.
> > 
> > There is also an RDPAS structure defined in CXL but that is only a
> > different way to provide the RCEC to dport association instead of
> > using the RCEC's Endpoint Association Extended Capability. In the end
> > we get all associated RCHs and check the AER of all their dports.
> > 
> > The upstream port is signaled using the RCiEP's AER. CXL spec is
> > strict here: "Upstream Port RCRB shall not implement the AER Extended
> > Capability." The RCiEP's requestor ID is used then and its config
> > space the AER is in.
> > 
> > CXL.cachemem errors are reported with the RCiEP as requester
> > too. Status is in the CXL RAS cap and the UIE or CIE is set
> > respectively in the AER status of the RCiEP.
> >
> > > We're going to call pci_aer_handle_error() as well, to handle the
> > > non-internal errors, and I'm pretty sure that path expects the RCiEP
> > > ID there.
> > > 
> > > Whatever the answer, I'm not sure this sentence is actually relevant
> > > to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or
> > > look at struct aer_err_source.id.
> > 
> > The source id is used in aer_process_err_devices() which finally calls
> > handle_error_source() for the device with the requestor id. This is
> > the place where cxl_rch_handle_error() checks if it is an RCEC that
> > received an internal error and has cxl devices connected to it. Then,
> > the request is forwarded to the cxl_mem handler which also needs to
> > check the dport now. That is, pcie_walk_rcec() in
> > cxl_rch_handle_error() is called with the RCEC's pci handle,
> > cxl_rch_handle_error_iter() with the RCiEP's pci handle.
> 
> I'm still not sure this is relevant.  Isn't that last sentence just
> the way we always use pcie_walk_rcec()?
> 
> If there's something *different* here about CXL, and it's important to
> this patch, sure.  But I don't see that yet.  Maybe a comment in the
> code if you think it's important to clarify something there.

The importance I see is that internal errors of an RCEC indicate an
AER error in an RCH's downstream port. Thus, once that happens, all
involved dports must be checked. Internal errors are typically
non-standard and implementation defined, but here it is CXL standard.

-Robert


Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-05-25 Thread Bjorn Helgaas
On Thu, May 25, 2023 at 11:29:58PM +0200, Robert Richter wrote:
> eOn 24.05.23 16:32:35, Bjorn Helgaas wrote:
> > On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote:
> > > From: Robert Richter 
> > > 
> > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> > > RCiEP, but CXL downstream and upstream ports are not enumerated and
> > > not visible in the PCIe hierarchy. Protocol and link errors are sent
> > > to an RCEC.
> > >
> > > Restricted CXL host (RCH) downstream port-detected errors are signaled
> > > as internal AER errors, either Uncorrectable Internal Error (UIE) or
> > > Corrected Internal Errors (CIE). 
> > 
> > From the parallelism with RCD above, I first thought that RCH devices
> > were non-RCD mode and *were* enumerated as part of the PCIe hierarchy,
> > but actually I suspect it's more like the following?
> > 
> >   ... but CXL downstream and upstream ports are not enumerated and not
> >   visible in the PCIe hierarchy.
> > 
> >   Protocol and link errors from these non-enumerated ports are
> >   signaled as internal AER errors ... via a CXL RCEC.
> 
> Exactly, except the RCEC is standard PCIe and also must not
> necessarily on the same PCI bus as the CXL RCiEPs are.

So make it "RCEC" instead of "CXL RCEC", I guess?  PCIe r6.0, sec
7.9.10.3, allows an RCEC to be associated with RCiEPs on different
buses, so nothing to see there.

> > > The error source is the id of the RCEC.
> > 
> > This seems odd; I assume this refers to the RCEC's AER Error Source
> > Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source
> > Identification would ordinarily be the Requester ID of the RCiEP that
> > "sent" the Error Message.  But you're saying it's actually the ID of
> > the *RCEC*, not the RCiEP?
> 
> Right, the downstream port has its own AER ext capability in
> non-config (io mapped) RCRB register range. Errors originating from
> there are signaled as internal AER errors via the RCEC *with* the
> RCEC's Requester ID. Code walks through all associated CXL endpoints,
> determines the dport and checks its AER.
> 
> There is also an RDPAS structure defined in CXL but that is only a
> different way to provide the RCEC to dport association instead of
> using the RCEC's Endpoint Association Extended Capability. In the end
> we get all associated RCHs and check the AER of all their dports.
> 
> The upstream port is signaled using the RCiEP's AER. CXL spec is
> strict here: "Upstream Port RCRB shall not implement the AER Extended
> Capability." The RCiEP's requestor ID is used then and its config
> space the AER is in.
> 
> CXL.cachemem errors are reported with the RCiEP as requester
> too. Status is in the CXL RAS cap and the UIE or CIE is set
> respectively in the AER status of the RCiEP.
>
> > We're going to call pci_aer_handle_error() as well, to handle the
> > non-internal errors, and I'm pretty sure that path expects the RCiEP
> > ID there.
> > 
> > Whatever the answer, I'm not sure this sentence is actually relevant
> > to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or
> > look at struct aer_err_source.id.
> 
> The source id is used in aer_process_err_devices() which finally calls
> handle_error_source() for the device with the requestor id. This is
> the place where cxl_rch_handle_error() checks if it is an RCEC that
> received an internal error and has cxl devices connected to it. Then,
> the request is forwarded to the cxl_mem handler which also needs to
> check the dport now. That is, pcie_walk_rcec() in
> cxl_rch_handle_error() is called with the RCEC's pci handle,
> cxl_rch_handle_error_iter() with the RCiEP's pci handle.

I'm still not sure this is relevant.  Isn't that last sentence just
the way we always use pcie_walk_rcec()?

If there's something *different* here about CXL, and it's important to
this patch, sure.  But I don't see that yet.  Maybe a comment in the
code if you think it's important to clarify something there.

Bjorn


Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-05-25 Thread Robert Richter
eOn 24.05.23 16:32:35, Bjorn Helgaas wrote:
> On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote:
> > From: Robert Richter 
> > 
> > In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> > RCiEP, but CXL downstream and upstream ports are not enumerated and
> > not visible in the PCIe hierarchy. Protocol and link errors are sent
> > to an RCEC.
> >
> > Restricted CXL host (RCH) downstream port-detected errors are signaled
> > as internal AER errors, either Uncorrectable Internal Error (UIE) or
> > Corrected Internal Errors (CIE). 
> 
> From the parallelism with RCD above, I first thought that RCH devices
> were non-RCD mode and *were* enumerated as part of the PCIe hierarchy,
> but actually I suspect it's more like the following?
> 
>   ... but CXL downstream and upstream ports are not enumerated and not
>   visible in the PCIe hierarchy.
> 
>   Protocol and link errors from these non-enumerated ports are
>   signaled as internal AER errors ... via a CXL RCEC.

Exactly, except the RCEC is standard PCIe and also must not
necessarily on the same PCI bus as the CXL RCiEPs are.

> 
> > The error source is the id of the RCEC.
> 
> This seems odd; I assume this refers to the RCEC's AER Error Source
> Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source
> Identification would ordinarily be the Requester ID of the RCiEP that
> "sent" the Error Message.  But you're saying it's actually the ID of
> the *RCEC*, not the RCiEP?

Right, the downstream port has it's own AER ext capability in
non-config (io mapped) RCRB regsister range. Errors originating from
there are signaled as internal AER errors via the RCEC *with* the
RCEC's Requester ID. Code walks through all associated CXL endpoints,
determines the dport and checks its AER.

There is also an RDPAS structure defined in CXL but that is only a
different way to provide the RCEC to dport association instead of
using the RCEC's Endpoint Association Extended Capability. In the end
we get all associated RCHs and check the AER of all their dports.

The upstream port is signaled using the RCiEP's AER. CXL spec is
strict here: "Upstream Port RCRB shall not implement the AER Extended
Capability." The RCiEP's requestor ID is used then and it's config
space the AER is in.

CXL.cachemem errors are reported with the RCiEP as requester
too. Status is in the CXL RAS cap and the UIE or CIE is set
respectively in the AER status of the RCiEP.

> 
> We're going to call pci_aer_handle_error() as well, to handle the
> non-internal errors, and I'm pretty sure that path expects the RCiEP
> ID there.
> 
> Whatever the answer, I'm not sure this sentence is actually relevant
> to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or
> look at struct aer_err_source.id.

The source id is used in aer_process_err_devices() which finally calls
handle_error_source() for the device with the requestor id. This is
the place where cxl_rch_handle_error() checks if it is an RCEC that
recieved an internal error and has cxl devices connected to it. Then,
the request is forwarded to the cxl_mem handler which also needs to
check the dport now. That is, pcie_walk_rcec() in
cxl_rch_handle_error() is called with the RCEC's pci handle,
cxl_rch_handle_error_iter() with the RCiEP's pci handle..

> 
> > A CXL handler must then inspect the error status in various CXL
> > registers residing in the dport's component register space (CXL RAS
> > capability) or the dport's RCRB (PCIe AER extended capability). [1]
> > 
> > Errors showing up in the RCEC's error handler must be handled and
> > connected to the CXL subsystem. Implement this by forwarding the error
> > to all CXL devices below the RCEC. Since the entire CXL device is
> > controlled only using PCIe Configuration Space of device 0, function
> > 0, only pass it there [2]. The error handling is limited to currently
> > supported devices with the Memory Device class code set
> > (PCI_CLASS_MEMORY_CXL, 502h), where the handler can be implemented in
> > the existing cxl_pci driver. Support of CXL devices (e.g. a CXL.cache
> > device) can be enabled later.
> 
> I assume the Memory Devices are CXL devices, so maybe "Error handling
> for *other* CXL devices ... can be enabled later"?  
> 
> IIUC, this happens via cxl_rch_handle_error_iter() calling
> pci_error_handlers for CXL RCiEPs.  Maybe the is_cxl_mem_dev() check
> belongs inside those handlers, since that driver claimed the RCiEP and
> should know its functionality?  Maybe is_internal_error() and
> cxl_error_is_native(), too?

The check is outside the handlers on purpose. A corresponding handler
is needed, it is cxl_pci_driver, see the class code in
cxl_mem_pci_tbl. As the handler must handle other device's sources,
only aware drivers may be called here. Otherwise a device's error
handler could be called for errors there the source is the RCEC.

> 
> > In addition to errors directed to the CXL endpoint device, a handler
> > must also inspect the CXL RAS and 

Re: [PATCH] sound: Switch i2c drivers back to use .probe()

2023-05-25 Thread Luca Ceresoli
Hi Uwe,

On Thu, 25 May 2023 22:36:40 +0200
Uwe Kleine-König  wrote:

> After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
> call-back type"), all drivers being converted to .probe_new() and then
> 03c835f498b5 ("i2c: Switch .probe() to not take an id parameter") convert
> back to (the new) .probe() to be able to eventually drop .probe_new() from
> struct i2c_driver.
> 
> Signed-off-by: Uwe Kleine-König 

Reviewed-by: Luca Ceresoli 

-- 
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Re: [PATCH v2 01/34] mm: Add PAGE_TYPE_OP folio functions

2023-05-25 Thread Matthew Wilcox
On Thu, May 25, 2023 at 01:38:54PM -0700, Vishal Moola wrote:
> On Thu, May 25, 2023 at 1:20 PM Mike Rapoport  wrote:
> >
> > On Thu, May 25, 2023 at 10:00:23AM -0700, Vishal Moola wrote:
> > > On Thu, May 25, 2023 at 1:56 AM Mike Rapoport  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Mon, May 01, 2023 at 12:27:56PM -0700, Vishal Moola (Oracle) wrote:
> > > > > No folio equivalents for page type operations have been defined, so
> > > > > define them for later folio conversions.
> > > >
> > > > Can you please elaborate why would we need folios for page table 
> > > > descriptors?
> > >
> > > Thanks for the review!
> > >
> > > These macros are for callers that care about the page type, i.e. Table and
> > > Buddy. Aside from accounting for those cases, the page tables don't use 
> > > folios.
> > > These are more for the cleanliness of those callers.
> >
> > But why using folio APIs for PageType will be cleaner than using page APIs?
> > Do you have an example?
> 
> Ah, for example in mm/memory-failure.c there are a couple uses of PageTable.
> Like the line :
> if (folio_test_slab(folio) || PageTable(>page) ||
> folio_test_reserved(folio))
> where that PageTable(>page) can now be written as 
> folio_test_table(folio)
> instead.
> 
> Also there are numerous uses of PageBuddy in mm/compaction.c that will
> likely need to be converted to folios as well.

... and you can currently call PageTable() on the second/third/... page
of an allocation and it will return false, regardless of what the
first page is typed as.  For most architectures, this doesn't matter,
but /proc/kpageflags will underreport the amount of memory allocated
as page tables on architectures which use multi-page allocations for
their page tables as there's currently absolutely nothing to indicate
the size of the allocation.

To fix this, we need to use __GFP_COMP.


[PATCH] powerpc: Switch i2c drivers back to use .probe()

2023-05-25 Thread Uwe Kleine-König
After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
call-back type"), all drivers being converted to .probe_new() and then
03c835f498b5 ("i2c: Switch .probe() to not take an id parameter")
convert back to (the new) .probe() to be able to eventually drop
.probe_new() from struct i2c_driver.

Signed-off-by: Uwe Kleine-König 
---
 arch/powerpc/platforms/44x/ppc476.c| 2 +-
 arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/44x/ppc476.c 
b/arch/powerpc/platforms/44x/ppc476.c
index fbc6edad481f..164cbcd4588e 100644
--- a/arch/powerpc/platforms/44x/ppc476.c
+++ b/arch/powerpc/platforms/44x/ppc476.c
@@ -103,7 +103,7 @@ static struct i2c_driver avr_driver = {
.driver = {
.name = "akebono-avr",
},
-   .probe_new = avr_probe,
+   .probe = avr_probe,
.id_table = avr_id,
 };
 
diff --git a/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c 
b/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c
index 77ed61306a73..4d8fa9ed1a67 100644
--- a/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c
+++ b/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c
@@ -211,7 +211,7 @@ static struct i2c_driver mcu_driver = {
.name = "mcu-mpc8349emitx",
.of_match_table = mcu_of_match_table,
},
-   .probe_new = mcu_probe,
+   .probe = mcu_probe,
.remove = mcu_remove,
.id_table = mcu_ids,
 };

base-commit: ac9a78681b921877518763ba0e89202254349d1b
-- 
2.39.2



Re: [PATCH v2 05/34] mm: add utility functions for ptdesc

2023-05-25 Thread Vishal Moola
On Thu, May 25, 2023 at 1:26 PM Mike Rapoport  wrote:
>
> On Thu, May 25, 2023 at 11:04:28AM -0700, Vishal Moola wrote:
> > On Thu, May 25, 2023 at 2:10 AM Mike Rapoport  wrote:
> > > > +
> > > > +static inline struct ptdesc *ptdesc_alloc(gfp_t gfp, unsigned int 
> > > > order)
> > > > +{
> > > > + struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > > +
> > > > + return page_ptdesc(page);
> > > > +}
> > > > +
> > > > +static inline void ptdesc_free(struct ptdesc *pt)
> > > > +{
> > > > + struct page *page = ptdesc_page(pt);
> > > > +
> > > > + __free_pages(page, compound_order(page));
> > > > +}
> > >
> > > The ptdesc_{alloc,free} API does not sound right to me. The name
> > > ptdesc_alloc() implies the allocation of the ptdesc itself, rather than
> > > allocation of page table page. The same goes for free.
> >
> > I'm not sure I see the difference. Could you elaborate?
>
> I read ptdesc_alloc() as "allocate a ptdesc" rather than as "allocate a
> page for page table and return ptdesc pointing to that page". Seems very
> confusing to me already and it will be even more confusion when we'll start
> allocating actual ptdescs.

Hmm, I see what you're saying. I'm envisioning this function evolving into
one that allocates a ptdesc later. I don't see why we would need to have both a
page table page AND ptdesc at any point, but that may be a lack of knowledge
from my part.

I was thinking later, if necessary, we could make another function
(only to be used internally) to allocate page table pages.


Re: [PATCH v2 01/34] mm: Add PAGE_TYPE_OP folio functions

2023-05-25 Thread Vishal Moola
On Thu, May 25, 2023 at 1:20 PM Mike Rapoport  wrote:
>
> On Thu, May 25, 2023 at 10:00:23AM -0700, Vishal Moola wrote:
> > On Thu, May 25, 2023 at 1:56 AM Mike Rapoport  wrote:
> > >
> > > Hi,
> > >
> > > On Mon, May 01, 2023 at 12:27:56PM -0700, Vishal Moola (Oracle) wrote:
> > > > No folio equivalents for page type operations have been defined, so
> > > > define them for later folio conversions.
> > >
> > > Can you please elaborate why would we need folios for page table 
> > > descriptors?
> >
> > Thanks for the review!
> >
> > These macros are for callers that care about the page type, i.e. Table and
> > Buddy. Aside from accounting for those cases, the page tables don't use 
> > folios.
> > These are more for the cleanliness of those callers.
>
> But why using folio APIs for PageType will be cleaner than using page APIs?
> Do you have an example?

Ah, for example in mm/memory-failure.c there are a couple uses of PageTable.
Like the line :
if (folio_test_slab(folio) || PageTable(>page) ||
folio_test_reserved(folio))
where that PageTable(>page) can now be written as folio_test_table(folio)
instead.

Also there are numerous uses of PageBuddy in mm/compaction.c that will
likely need to be converted to folios as well.


[PATCH] sound: Switch i2c drivers back to use .probe()

2023-05-25 Thread Uwe Kleine-König
After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
call-back type"), all drivers being converted to .probe_new() and then
03c835f498b5 ("i2c: Switch .probe() to not take an id parameter") convert
back to (the new) .probe() to be able to eventually drop .probe_new() from
struct i2c_driver.

Signed-off-by: Uwe Kleine-König 
---
 sound/aoa/codecs/onyx.c | 2 +-
 sound/aoa/codecs/tas.c  | 2 +-
 sound/pci/hda/cs35l41_hda_i2c.c | 2 +-
 sound/ppc/keywest.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/sound/aoa/codecs/onyx.c b/sound/aoa/codecs/onyx.c
index 4c75381f5ab8..a8a59d71dcec 100644
--- a/sound/aoa/codecs/onyx.c
+++ b/sound/aoa/codecs/onyx.c
@@ -1048,7 +1048,7 @@ static struct i2c_driver onyx_driver = {
.driver = {
.name = "aoa_codec_onyx",
},
-   .probe_new = onyx_i2c_probe,
+   .probe = onyx_i2c_probe,
.remove = onyx_i2c_remove,
.id_table = onyx_i2c_id,
 };
diff --git a/sound/aoa/codecs/tas.c b/sound/aoa/codecs/tas.c
index f906e9aaddcf..ab1472390061 100644
--- a/sound/aoa/codecs/tas.c
+++ b/sound/aoa/codecs/tas.c
@@ -936,7 +936,7 @@ static struct i2c_driver tas_driver = {
.driver = {
.name = "aoa_codec_tas",
},
-   .probe_new = tas_i2c_probe,
+   .probe = tas_i2c_probe,
.remove = tas_i2c_remove,
.id_table = tas_i2c_id,
 };
diff --git a/sound/pci/hda/cs35l41_hda_i2c.c b/sound/pci/hda/cs35l41_hda_i2c.c
index 7826b1a12d7d..b44536fbba17 100644
--- a/sound/pci/hda/cs35l41_hda_i2c.c
+++ b/sound/pci/hda/cs35l41_hda_i2c.c
@@ -58,7 +58,7 @@ static struct i2c_driver cs35l41_i2c_driver = {
.pm = _hda_pm_ops,
},
.id_table   = cs35l41_hda_i2c_id,
-   .probe_new  = cs35l41_hda_i2c_probe,
+   .probe  = cs35l41_hda_i2c_probe,
.remove = cs35l41_hda_i2c_remove,
 };
 module_i2c_driver(cs35l41_i2c_driver);
diff --git a/sound/ppc/keywest.c b/sound/ppc/keywest.c
index 0c4f43963c75..dfc1fc9b701d 100644
--- a/sound/ppc/keywest.c
+++ b/sound/ppc/keywest.c
@@ -90,7 +90,7 @@ static struct i2c_driver keywest_driver = {
.driver = {
.name = "PMac Keywest Audio",
},
-   .probe_new = keywest_probe,
+   .probe = keywest_probe,
.remove = keywest_remove,
.id_table = keywest_i2c_id,
 };

base-commit: ac9a78681b921877518763ba0e89202254349d1b
-- 
2.39.2



Re: [PATCH v2 05/34] mm: add utility functions for ptdesc

2023-05-25 Thread Mike Rapoport
On Thu, May 25, 2023 at 11:04:28AM -0700, Vishal Moola wrote:
> On Thu, May 25, 2023 at 2:10 AM Mike Rapoport  wrote:
> > > +
> > > +static inline struct ptdesc *ptdesc_alloc(gfp_t gfp, unsigned int order)
> > > +{
> > > + struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > > +
> > > + return page_ptdesc(page);
> > > +}
> > > +
> > > +static inline void ptdesc_free(struct ptdesc *pt)
> > > +{
> > > + struct page *page = ptdesc_page(pt);
> > > +
> > > + __free_pages(page, compound_order(page));
> > > +}
> >
> > The ptdesc_{alloc,free} API does not sound right to me. The name
> > ptdesc_alloc() implies the allocation of the ptdesc itself, rather than
> > allocation of page table page. The same goes for free.
> 
> I'm not sure I see the difference. Could you elaborate?

I read ptdesc_alloc() as "allocate a ptdesc" rather than as "allocate a
page for page table and return ptdesc pointing to that page". Seems very
confusing to me already and it will be even more confusion when we'll start
allocating actual ptdescs.
 
-- 
Sincerely yours,
Mike.


Re: [PATCH v2 01/34] mm: Add PAGE_TYPE_OP folio functions

2023-05-25 Thread Mike Rapoport
On Thu, May 25, 2023 at 10:00:23AM -0700, Vishal Moola wrote:
> On Thu, May 25, 2023 at 1:56 AM Mike Rapoport  wrote:
> >
> > Hi,
> >
> > On Mon, May 01, 2023 at 12:27:56PM -0700, Vishal Moola (Oracle) wrote:
> > > No folio equivalents for page type operations have been defined, so
> > > define them for later folio conversions.
> >
> > Can you please elaborate why would we need folios for page table 
> > descriptors?
> 
> Thanks for the review!
> 
> These macros are for callers that care about the page type, i.e. Table and
> Buddy. Aside from accounting for those cases, the page tables don't use 
> folios.
> These are more for the cleanliness of those callers.

But why using folio APIs for PageType will be cleaner than using page APIs?
Do you have an example?

-- 
Sincerely yours,
Mike.


Re: [PATCH v5 14/18] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-25 Thread Doug Anderson
Hi,

On Thu, May 25, 2023 at 9:27 AM Petr Mladek  wrote:
>
> On Fri 2023-05-19 10:18:38, Douglas Anderson wrote:
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -85,7 +85,7 @@ __setup("nmi_watchdog=", hardlockup_panic_setup);
> >
> >  #endif /* CONFIG_HARDLOCKUP_DETECTOR */
> >
> > -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> > +#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
> >
> >  static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
> >  static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
> > @@ -106,6 +106,14 @@ notrace void arch_touch_nmi_watchdog(void)
> >  }
> >  EXPORT_SYMBOL(arch_touch_nmi_watchdog);
> >
> > +void watchdog_hardlockup_touch_cpu(unsigned int cpu)
> > +{
> > + per_cpu(watchdog_hardlockup_touched, cpu) = true;
> > +
> > + /* Match with smp_rmb() in watchdog_hardlockup_check() */
> > + smp_wmb();
>
> It is great that you described where the related barrier is.
>
> Another important information is what exactly is synchronized.
> And I am actually not sure what we are synchronizing here.
>
> My understanding is that a write barrier should synchronize
> related writes, for example:
>
> X = ...;
> /* Make sure that X is modified before Y */
> smp_wmb();
> Y = ...;
>
> And the related read barrier should synchronize the related reads,
> for example:
>
> if (test(Y)) {
> /*
>  * Make sure that we use the updated X when
>  * we saw the updated Y.
>  */
>  smp_rmb();
>  do_something(X);
>  }
>
> IMHO, we do not need any barrier here because we have only
> one variable "watchdog_hardlockup_touched" here.
> watchdog_hardlockup_check() will either see the updated value
> or not. But it does not synchronize it against any other
> variables or values.

Fair. These barriers were present in the original buddy lockup
detector that we've been carrying in ChromeOS but that doesn't
necessarily mean that they were there for a good reason.

Reasoning about weakly ordered memory always makes my brain hurt and I
never feel confident at the end that I got the right answer and, of
course, this is coupled by the fact that if I have a logic error in my
reasoning that it might cause a rare / subtle bug. :( When possible I
try to write code that uses full blown locks to make it easier to
reason about (even if less efficient), but that's not necessarily
possible here. While we obviously don't just want to sprinkle barriers
all over the code, IMO it's not a terrible sin to put a barrier in a
case where it makes it easier to reason about the order of things.

In any case, I guess in this case I would worry about some sort of
ordering race when enabling / disabling the buddy lockup detector. At
the end of the buddy's watchdog_hardlockup_enable() /
watchdog_hardlockup_disable() we adjust the "watchdog_cpus" which
changes buddy assignments. Without a barrier, I _think_ it would be
possible for a new CPU to notice the change in buddies without
noticing the touch. Does that match your understanding? Now when
reasoning about CPUs going online/offline we need to consider this
extra case and we have to decide if there's any chance it could lead
to a false lockup detection. With the memory barriers here, it's a
little easier to think about.

Did the above convince you about keeping the barriers? If so, do you
have any suggested comment that would make it clearer?


> > +}
> > +
> >  static bool is_hardlockup(unsigned int cpu)
> >  {
> >   int hrint = atomic_read(_cpu(hrtimer_interrupts, cpu));
> > @@ -443,11 +454,15 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > hrtimer *hrtimer)
> >   struct pt_regs *regs = get_irq_regs();
> >   int duration;
> >   int softlockup_all_cpu_backtrace = 
> > sysctl_softlockup_all_cpu_backtrace;
> > + unsigned long hrtimer_interrupts;
> >
> >   if (!watchdog_enabled)
> >   return HRTIMER_NORESTART;
> >
> > - watchdog_hardlockup_kick();
> > + hrtimer_interrupts = watchdog_hardlockup_kick();
> > +
> > + /* test for hardlockups */
>
> I would omit the comment. It is not valid when perf detector is used.
> And checking the buddy is clear from the function name.
>
> > + watchdog_buddy_check_hardlockup(hrtimer_interrupts);
>
> I would personally move this into watchdog_hardlockup_kick().
> watchdog_timer_fn() is already complex enough. And checking
> the buddy when kicking a CPU makes sense.

Sure, I'll add that to my list of things to follow-up 

Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-25 Thread Helge Deller

On 5/24/23 02:29, David Rientjes wrote:

On Tue, 23 May 2023, Vlastimil Babka wrote:


As discussed at LSF/MM [1] [2] and with no objections raised there,
deprecate the SLAB allocator. Rename the user-visible option so that
users with CONFIG_SLAB=y get a new prompt with explanation during make
oldconfig, while make olddefconfig will just switch to SLUB.

In all defconfigs with CONFIG_SLAB=y remove the line so those also
switch to SLUB. Regressions due to the switch should be reported to
linux-mm and slab maintainers.

[1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/
[2] https://lwn.net/Articles/932201/

Signed-off-by: Vlastimil Babka 


Acked-by: David Rientjes 


I did tested SLUB on parisc with 32- and 64-bit kernel, so you may add:

Acked-by: Helge Deller  # parisc

Helge


Re: [PATCH v2 13/34] mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}

2023-05-25 Thread Vishal Moola
On Thu, May 25, 2023 at 2:19 AM Mike Rapoport  wrote:
>
> On Mon, May 01, 2023 at 12:28:08PM -0700, Vishal Moola (Oracle) wrote:
> > Creates ptdesc_pte_ctor(), ptdesc_pmd_ctor(), ptdesc_pte_dtor(), and
> > ptdesc_pmd_dtor() and make the original pgtable constructor/destructors
> > wrappers.
>
> I think pgtable_pXY_ctor/dtor names would be better.

I have it as ptdesc to keep it consistent with the rest of the functions. I
also think it makes more sense as it's initializing stuff tracked by a ptdesc.

> > Signed-off-by: Vishal Moola (Oracle) 
> > ---
> >  include/linux/mm.h | 56 ++
> >  1 file changed, 42 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 58c911341a33..dc61aeca9077 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2847,20 +2847,34 @@ static inline bool ptlock_init(struct ptdesc 
> > *ptdesc) { return true; }
> >  static inline void ptlock_free(struct ptdesc *ptdesc) {}
> >  #endif /* USE_SPLIT_PTE_PTLOCKS */
> >
> > -static inline bool pgtable_pte_page_ctor(struct page *page)
> > +static inline bool ptdesc_pte_ctor(struct ptdesc *ptdesc)
> >  {
> > - if (!ptlock_init(page_ptdesc(page)))
> > + struct folio *folio = ptdesc_folio(ptdesc);
> > +
> > + if (!ptlock_init(ptdesc))
> >   return false;
> > - __SetPageTable(page);
> > - inc_lruvec_page_state(page, NR_PAGETABLE);
> > + __folio_set_table(folio);
> > + lruvec_stat_add_folio(folio, NR_PAGETABLE);
> >   return true;
> >  }
> >
> > +static inline bool pgtable_pte_page_ctor(struct page *page)
> > +{
> > + return ptdesc_pte_ctor(page_ptdesc(page));
> > +}
> > +
> > +static inline void ptdesc_pte_dtor(struct ptdesc *ptdesc)
> > +{
> > + struct folio *folio = ptdesc_folio(ptdesc);
> > +
> > + ptlock_free(ptdesc);
> > + __folio_clear_table(folio);
> > + lruvec_stat_sub_folio(folio, NR_PAGETABLE);
> > +}
> > +
> >  static inline void pgtable_pte_page_dtor(struct page *page)
> >  {
> > - ptlock_free(page_ptdesc(page));
> > - __ClearPageTable(page);
> > - dec_lruvec_page_state(page, NR_PAGETABLE);
> > + ptdesc_pte_dtor(page_ptdesc(page));
> >  }
> >
> >  #define pte_offset_map_lock(mm, pmd, address, ptlp)  \
> > @@ -2942,20 +2956,34 @@ static inline spinlock_t *pmd_lock(struct mm_struct 
> > *mm, pmd_t *pmd)
> >   return ptl;
> >  }
> >
> > -static inline bool pgtable_pmd_page_ctor(struct page *page)
> > +static inline bool ptdesc_pmd_ctor(struct ptdesc *ptdesc)
> >  {
> > - if (!pmd_ptlock_init(page_ptdesc(page)))
> > + struct folio *folio = ptdesc_folio(ptdesc);
> > +
> > + if (!pmd_ptlock_init(ptdesc))
> >   return false;
> > - __SetPageTable(page);
> > - inc_lruvec_page_state(page, NR_PAGETABLE);
> > + __folio_set_table(folio);
> > + lruvec_stat_add_folio(folio, NR_PAGETABLE);
> >   return true;
> >  }
> >
> > +static inline bool pgtable_pmd_page_ctor(struct page *page)
> > +{
> > + return ptdesc_pmd_ctor(page_ptdesc(page));
> > +}
> > +
> > +static inline void ptdesc_pmd_dtor(struct ptdesc *ptdesc)
> > +{
> > + struct folio *folio = ptdesc_folio(ptdesc);
> > +
> > + pmd_ptlock_free(ptdesc);
> > + __folio_clear_table(folio);
> > + lruvec_stat_sub_folio(folio, NR_PAGETABLE);
> > +}
> > +
> >  static inline void pgtable_pmd_page_dtor(struct page *page)
> >  {
> > - pmd_ptlock_free(page_ptdesc(page));
> > - __ClearPageTable(page);
> > - dec_lruvec_page_state(page, NR_PAGETABLE);
> > + ptdesc_pmd_dtor(page_ptdesc(page));
> >  }
> >
> >  /*
> > --
> > 2.39.2
> >
> >
>
> --
> Sincerely yours,
> Mike.


Re: [PATCH v2 05/34] mm: add utility functions for ptdesc

2023-05-25 Thread Vishal Moola
On Thu, May 25, 2023 at 2:10 AM Mike Rapoport  wrote:
>
> On Mon, May 01, 2023 at 12:28:00PM -0700, Vishal Moola (Oracle) wrote:
> > Introduce utility functions setting the foundation for ptdescs. These
> > will also assist in the splitting out of ptdesc from struct page.
> >
> > ptdesc_alloc() is defined to allocate new ptdesc pages as compound
> > pages. This is to standardize ptdescs by allowing for one allocation
> > and one free function, in contrast to 2 allocation and 2 free functions.
> >
> > Signed-off-by: Vishal Moola (Oracle) 
> > ---
> >  include/asm-generic/tlb.h | 11 ++
> >  include/linux/mm.h| 44 +++
> >  include/linux/pgtable.h   | 12 +++
> >  3 files changed, 67 insertions(+)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index b46617207c93..6bade9e0e799 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -481,6 +481,17 @@ static inline void tlb_remove_page(struct mmu_gather 
> > *tlb, struct page *page)
> >   return tlb_remove_page_size(tlb, page, PAGE_SIZE);
> >  }
> >
> > +static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
> > +{
> > + tlb_remove_table(tlb, pt);
> > +}
> > +
> > +/* Like tlb_remove_ptdesc, but for page-like page directories. */
> > +static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct 
> > ptdesc *pt)
> > +{
> > + tlb_remove_page(tlb, ptdesc_page(pt));
> > +}
> > +
> >  static inline void tlb_change_page_size(struct mmu_gather *tlb,
> >unsigned int page_size)
> >  {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index b18848ae7e22..258f3b730359 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2744,6 +2744,45 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
> > pud_t *pud, unsigned long a
> >  }
> >  #endif /* CONFIG_MMU */
> >
> > +static inline struct ptdesc *virt_to_ptdesc(const void *x)
> > +{
> > + return page_ptdesc(virt_to_head_page(x));
>
> Do we ever use compound pages for page tables?

Mips and s390 crst tables use multi-order (but not compound) pages.
The ptdesc api *should* change that, but until all the allocation/free paths
are changed it may cause problems.
Thanks for catching that, I'll change it in v3.

> > +}
> > +
> > +static inline void *ptdesc_to_virt(const struct ptdesc *pt)
> > +{
> > + return page_to_virt(ptdesc_page(pt));
> > +}
> > +
> > +static inline void *ptdesc_address(const struct ptdesc *pt)
> > +{
> > + return folio_address(ptdesc_folio(pt));
> > +}
> > +
> > +static inline bool ptdesc_is_reserved(struct ptdesc *pt)
> > +{
> > + return folio_test_reserved(ptdesc_folio(pt));
> > +}
> > +
> > +static inline struct ptdesc *ptdesc_alloc(gfp_t gfp, unsigned int order)
> > +{
> > + struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> > +
> > + return page_ptdesc(page);
> > +}
> > +
> > +static inline void ptdesc_free(struct ptdesc *pt)
> > +{
> > + struct page *page = ptdesc_page(pt);
> > +
> > + __free_pages(page, compound_order(page));
> > +}
>
> The ptdesc_{alloc,free} API does not sound right to me. The name
> ptdesc_alloc() implies the allocation of the ptdesc itself, rather than
> allocation of page table page. The same goes for free.

I'm not sure I see the difference. Could you elaborate?

> > +
> > +static inline void ptdesc_clear(void *x)
> > +{
> > + clear_page(x);
> > +}
> > +
> >  #if USE_SPLIT_PTE_PTLOCKS
> >  #if ALLOC_SPLIT_PTLOCKS
> >  void __init ptlock_cache_init(void);
> > @@ -2970,6 +3009,11 @@ static inline void mark_page_reserved(struct page 
> > *page)
> >   adjust_managed_page_count(page, -1);
> >  }
> >
> > +static inline void free_reserved_ptdesc(struct ptdesc *pt)
> > +{
> > + free_reserved_page(ptdesc_page(pt));
> > +}
> > +
> >  /*
> >   * Default method to free all the __init memory into the buddy system.
> >   * The freed pages will be poisoned with pattern "poison" if it's within
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 5e0f51308724..b067ac10f3dd 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1041,6 +1041,18 @@ TABLE_MATCH(ptl, ptl);
> >  #undef TABLE_MATCH
> >  static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
> >
> > +#define ptdesc_page(pt)  (_Generic((pt),   
> >   \
> > + const struct ptdesc *:  (const struct page *)(pt),  \
> > + struct ptdesc *:(struct page *)(pt)))
> > +
> > +#define ptdesc_folio(pt) (_Generic((pt), \
> > + const struct ptdesc *:  (const struct folio *)(pt), \
> > + struct ptdesc *:(struct folio *)(pt)))
> > +
> > +#define page_ptdesc(p)   (_Generic((p),
> >   \
> > + const struct page *:(const struct 

Re: [PATCH v2 02/34] s390: Use _pt_s390_gaddr for gmap address tracking

2023-05-25 Thread Vishal Moola
On Thu, May 25, 2023 at 1:58 AM Mike Rapoport  wrote:
>
> On Mon, May 01, 2023 at 12:27:57PM -0700, Vishal Moola (Oracle) wrote:
> > s390 uses page->index to keep track of page tables for the guest address
> > space. In an attempt to consolidate the usage of page fields in s390,
> > replace _pt_pad_2 with _pt_s390_gaddr to replace page->index in gmap.
> >
> > This will help with the splitting of struct ptdesc from struct page, as
> > well as allow s390 to use _pt_frag_refcount for fragmented page table
> > tracking.
> >
> > Since page->_pt_s390_gaddr aliases with mapping, ensure its set to NULL
> > before freeing the pages as well.
>
> Wouldn't it be easier to use _pt_pad_1 which is aliased with lru and that
> does not seem to be used by page tables at all?

I initially thought the same, but s390 page tables use lru.


Re: [PATCH v2 01/34] mm: Add PAGE_TYPE_OP folio functions

2023-05-25 Thread Vishal Moola
On Thu, May 25, 2023 at 1:56 AM Mike Rapoport  wrote:
>
> Hi,
>
> On Mon, May 01, 2023 at 12:27:56PM -0700, Vishal Moola (Oracle) wrote:
> > No folio equivalents for page type operations have been defined, so
> > define them for later folio conversions.
>
> Can you please elaborate why would we need folios for page table descriptors?

Thanks for the review!

These macros are for callers that care about the page type, i.e. Table and
Buddy. Aside from accounting for those cases, the page tables don't use folios.
These are more for the cleanliness of those callers.


Re: [PATCH] irq_work: consolidate arch_irq_work_raise prototypes

2023-05-25 Thread Catalin Marinas
On Tue, May 16, 2023 at 10:02:31PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The prototype was hidden on x86, which causes a warning:
> 
> kernel/irq_work.c:72:13: error: no previous prototype for 
> 'arch_irq_work_raise' [-Werror=missing-prototypes]
> 
> Fix this by providing it in only one place that is always visible.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas 


Re: [PATCH 03/23] arm64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-05-25 Thread Catalin Marinas
On Tue, May 09, 2023 at 09:45:57PM -0700, Hugh Dickins wrote:
> pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
> that: to keep balance in future, use the recently added pte_alloc_huge()
> instead; with pte_offset_huge() a better name for pte_offset_kernel().
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas 


Re: [PATCH 02/23] arm64: allow pte_offset_map() to fail

2023-05-25 Thread Catalin Marinas
On Tue, May 09, 2023 at 09:43:47PM -0700, Hugh Dickins wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
> 
> Signed-off-by: Hugh Dickins 

Acked-by: Catalin Marinas 


Re: [PATCH v2 1/2] soc: fsl: cpm1: Fix TSA and QMC dependencies in case of COMPILE_TEST

2023-05-25 Thread Randy Dunlap



On 5/23/23 01:59, Herve Codina wrote:
> In order to compile tsa.c and qmc.c, CONFIG_CPM must be set.
> 
> Without this dependency, the linker fails with some missing
> symbols for COMPILE_TEST configurations that need QMC without
> enabling CPM.
> 
> Signed-off-by: Herve Codina 
> Reported-by: kernel test robot 
> Link: 
> https://lore.kernel.org/oe-kbuild-all/202305160221.9xgweobz-...@intel.com/

Fixes all of my CPM build issues. (with patch 2/2 also applied)
Thanks.

Acked-by: Randy Dunlap 
Tested-by: Randy Dunlap  # build-tested

> ---
>  drivers/soc/fsl/qe/Kconfig | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/soc/fsl/qe/Kconfig b/drivers/soc/fsl/qe/Kconfig
> index 7268c2fbcbc1..e0d096607fef 100644
> --- a/drivers/soc/fsl/qe/Kconfig
> +++ b/drivers/soc/fsl/qe/Kconfig
> @@ -36,7 +36,7 @@ config UCC
>  config CPM_TSA
>   tristate "CPM TSA support"
>   depends on OF && HAS_IOMEM
> - depends on CPM1 || COMPILE_TEST
> + depends on CPM1 || (CPM && COMPILE_TEST)
>   help
> Freescale CPM Time Slot Assigner (TSA)
> controller.
> @@ -47,7 +47,7 @@ config CPM_TSA
>  config CPM_QMC
>   tristate "CPM QMC support"
>   depends on OF && HAS_IOMEM
> - depends on CPM1 || (FSL_SOC && COMPILE_TEST)
> + depends on CPM1 || (FSL_SOC && CPM && COMPILE_TEST)
>   depends on CPM_TSA
>   help
> Freescale CPM QUICC Multichannel Controller

-- 
~Randy


Re: [PATCH v5 14/18] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-25 Thread Petr Mladek
On Fri 2023-05-19 10:18:38, Douglas Anderson wrote:
> Implement a hardlockup detector that doesn't doesn't need any extra
> arch-specific support code to detect lockups. Instead of using
> something arch-specific we will use the buddy system, where each CPU
> watches out for another one. Specifically, each CPU will use its
> softlockup hrtimer to check that the next CPU is processing hrtimer
> interrupts by verifying that a counter is increasing.
> 
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -85,7 +85,7 @@ __setup("nmi_watchdog=", hardlockup_panic_setup);
>  
>  #endif /* CONFIG_HARDLOCKUP_DETECTOR */
>  
> -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> +#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
>  
>  static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
>  static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
> @@ -106,6 +106,14 @@ notrace void arch_touch_nmi_watchdog(void)
>  }
>  EXPORT_SYMBOL(arch_touch_nmi_watchdog);
>  
> +void watchdog_hardlockup_touch_cpu(unsigned int cpu)
> +{
> + per_cpu(watchdog_hardlockup_touched, cpu) = true;
> +
> + /* Match with smp_rmb() in watchdog_hardlockup_check() */
> + smp_wmb();

It is great that you described where the related barrier is.

Another important information is what exactly is synchronized.
And I am actually not sure what we are synchronizing here.

My understanding is that a write barrier should synchronize
related writes, for example:

X = ...;
/* Make sure that X is modified before Y */
smp_wmb();
Y = ...;

And the related read barrier should synchronize the related reads,
for example:

if (test(Y)) {
/*
 * Make sure that we use the updated X when
 * we saw the updated Y.
 */
 smp_rmb();
 do_something(X);
 }

IMHO, we do not need any barrier here because we have only
one variable "watchdog_hardlockup_touched" here.
watchdog_hardlockup_check() will either see the updated value
or not. But it does not synchronize it against any other
variables or values.

> +}
> +
>  static bool is_hardlockup(unsigned int cpu)
>  {
>   int hrint = atomic_read(_cpu(hrtimer_interrupts, cpu));
> @@ -443,11 +454,15 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> hrtimer *hrtimer)
>   struct pt_regs *regs = get_irq_regs();
>   int duration;
>   int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;
> + unsigned long hrtimer_interrupts;
>  
>   if (!watchdog_enabled)
>   return HRTIMER_NORESTART;
>  
> - watchdog_hardlockup_kick();
> + hrtimer_interrupts = watchdog_hardlockup_kick();
> +
> + /* test for hardlockups */

I would omit the comment. It is not valid when perf detector is used.
And checking the buddy is clear from the function name.

> + watchdog_buddy_check_hardlockup(hrtimer_interrupts);

I would personally move this into watchdog_hardlockup_kick().
watchdog_timer_fn() is already complex enough. And checking
the buddy when kicking a CPU makes sense.

Also I would not pass "hrtimer_interrupts". I guess that it is
just an optimization. It is an extra churn in the code. IMHO,
is is not wort it. This code does not need to be super optimized.

> +

>  
>   /* kick the softlockup detector */
>   if (completion_done(this_cpu_ptr(_completion))) {
> diff --git a/kernel/watchdog_buddy.c b/kernel/watchdog_buddy.c
> new file mode 100644
> index ..fee45af2e5bd
> --- /dev/null
> +++ b/kernel/watchdog_buddy.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static cpumask_t __read_mostly watchdog_cpus;
> +
> +static unsigned int watchdog_next_cpu(unsigned int cpu)
> +{
> + cpumask_t cpus = watchdog_cpus;

A copy should be done by cpumask_copy().

But the question is why a copy would be needed. When called from
watchdog_buddy_check_hardlockup(), this function is not sychronized
against the CPU hotplug. And even the copying would be racy.

IMHO, the copy does not help much and we do not need it.

The only important this is that this function would return
a valid CPU. And I think that it is guarantted because
CPU0 could not be disabled.

> + unsigned int next_cpu;
> +
> + next_cpu = cpumask_next(cpu, );
> + if (next_cpu >= nr_cpu_ids)
> + next_cpu = cpumask_first();
> +
> + if (next_cpu == cpu)
> + return nr_cpu_ids;
>> +return next_cpu;
> +}
> +
> +int __init watchdog_hardlockup_probe(void)
> +{
> + return 0;
> +}
> +
> +void watchdog_hardlockup_enable(unsigned int cpu)
> +{
> + unsigned int next_cpu;
> +
> + /*
> +  * The new CPU will be marked online before the hrtimer interrupt
> +  * gets a chance to run on it. If another CPU tests for a
> +  * hardlockup on the new CPU before it has run its the hrtimer
> +  * interrupt, it will get a false 

[PATCH] powerpc/crypto: Fix aes-gcm-p10 link errors

2023-05-25 Thread Michael Ellerman
The recently added P10 AES/GCM code added some files containing
CRYPTOGAMS perl-asm code which are near duplicates of the p8 files
found in drivers/crypto/vmx.

In particular the newly added files produce functions with identical
names to the existing code.

When the kernel is built with CONFIG_CRYPTO_AES_GCM_P10=y and
CONFIG_CRYPTO_DEV_VMX_ENCRYPT=y that leads to link errors, eg:

  ld: drivers/crypto/vmx/aesp8-ppc.o: in function `aes_p8_set_encrypt_key':
  (.text+0xa0): multiple definition of `aes_p8_set_encrypt_key'; 
arch/powerpc/crypto/aesp8-ppc.o:(.text+0xa0): first defined here
  ...
  ld: drivers/crypto/vmx/ghashp8-ppc.o: in function `gcm_ghash_p8':
  (.text+0x140): multiple definition of `gcm_ghash_p8'; 
arch/powerpc/crypto/ghashp8-ppc.o:(.text+0x2e4): first defined here

Fix it for now by renaming the newly added files and functions to use
"p10" instead of "p8" in the names.

Fixes: 45a4672b9a6e ("crypto: p10-aes-gcm - Update Kconfig and Makefile")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/crypto/Makefile   | 10 +-
 arch/powerpc/crypto/aes-gcm-p10-glue.c | 18 +-
 .../crypto/{aesp8-ppc.pl => aesp10-ppc.pl} |  2 +-
 .../crypto/{ghashp8-ppc.pl => ghashp10-ppc.pl} | 12 ++--
 4 files changed, 21 insertions(+), 21 deletions(-)
 rename arch/powerpc/crypto/{aesp8-ppc.pl => aesp10-ppc.pl} (99%)
 rename arch/powerpc/crypto/{ghashp8-ppc.pl => ghashp10-ppc.pl} (97%)

diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile
index 05c7486f42c5..7b4f516abec1 100644
--- a/arch/powerpc/crypto/Makefile
+++ b/arch/powerpc/crypto/Makefile
@@ -22,15 +22,15 @@ sha1-ppc-spe-y := sha1-spe-asm.o sha1-spe-glue.o
 sha256-ppc-spe-y := sha256-spe-asm.o sha256-spe-glue.o
 crc32c-vpmsum-y := crc32c-vpmsum_asm.o crc32c-vpmsum_glue.o
 crct10dif-vpmsum-y := crct10dif-vpmsum_asm.o crct10dif-vpmsum_glue.o
-aes-gcm-p10-crypto-y := aes-gcm-p10-glue.o aes-gcm-p10.o ghashp8-ppc.o 
aesp8-ppc.o
+aes-gcm-p10-crypto-y := aes-gcm-p10-glue.o aes-gcm-p10.o ghashp10-ppc.o 
aesp10-ppc.o
 
 quiet_cmd_perl = PERL$@
   cmd_perl = $(PERL) $< $(if $(CONFIG_CPU_LITTLE_ENDIAN), linux-ppc64le, 
linux-ppc64) > $@
 
-targets += aesp8-ppc.S ghashp8-ppc.S
+targets += aesp10-ppc.S ghashp10-ppc.S
 
-$(obj)/aesp8-ppc.S $(obj)/ghashp8-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE
+$(obj)/aesp10-ppc.S $(obj)/ghashp10-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE
$(call if_changed,perl)
 
-OBJECT_FILES_NON_STANDARD_aesp8-ppc.o := y
-OBJECT_FILES_NON_STANDARD_ghashp8-ppc.o := y
+OBJECT_FILES_NON_STANDARD_aesp10-ppc.o := y
+OBJECT_FILES_NON_STANDARD_ghashp10-ppc.o := y
diff --git a/arch/powerpc/crypto/aes-gcm-p10-glue.c 
b/arch/powerpc/crypto/aes-gcm-p10-glue.c
index bd3475f5348d..4b6e899895e7 100644
--- a/arch/powerpc/crypto/aes-gcm-p10-glue.c
+++ b/arch/powerpc/crypto/aes-gcm-p10-glue.c
@@ -30,15 +30,15 @@ MODULE_AUTHOR("Danny Tsen aadLen = alen;
i = alen & ~0xf;
if (i) {
-   gcm_ghash_p8(nXi, hash->Htable+32, aad, i);
+   gcm_ghash_p10(nXi, hash->Htable+32, aad, i);
aad += i;
alen -= i;
}
@@ -102,7 +102,7 @@ static void set_aad(struct gcm_ctx *gctx, struct Hash_ctx 
*hash,
nXi[i] ^= aad[i];
 
memset(gctx->aad_hash, 0, 16);
-   gcm_ghash_p8(gctx->aad_hash, hash->Htable+32, nXi, 16);
+   gcm_ghash_p10(gctx->aad_hash, hash->Htable+32, nXi, 16);
} else {
memcpy(gctx->aad_hash, nXi, 16);
}
@@ -115,7 +115,7 @@ static void gcmp10_init(struct gcm_ctx *gctx, u8 *iv, 
unsigned char *rdkey,
 {
__be32 counter = cpu_to_be32(1);
 
-   aes_p8_encrypt(hash->H, hash->H, rdkey);
+   aes_p10_encrypt(hash->H, hash->H, rdkey);
set_subkey(hash->H);
gcm_init_htable(hash->Htable+32, hash->H);
 
@@ -126,7 +126,7 @@ static void gcmp10_init(struct gcm_ctx *gctx, u8 *iv, 
unsigned char *rdkey,
/*
 * Encrypt counter vector as iv tag and increment counter.
 */
-   aes_p8_encrypt(iv, gctx->ivtag, rdkey);
+   aes_p10_encrypt(iv, gctx->ivtag, rdkey);
 
counter = cpu_to_be32(2);
*((__be32 *)(iv+12)) = counter;
@@ -160,7 +160,7 @@ static void finish_tag(struct gcm_ctx *gctx, struct 
Hash_ctx *hash, int len)
/*
 * hash (AAD len and len)
 */
-   gcm_ghash_p8(hash->Htable, hash->Htable+32, aclen, 16);
+   gcm_ghash_p10(hash->Htable, hash->Htable+32, aclen, 16);
 
for (i = 0; i < 16; i++)
hash->Htable[i] ^= gctx->ivtag[i];
@@ -192,7 +192,7 @@ static int p10_aes_gcm_setkey(struct crypto_aead *aead, 
const u8 *key,
int ret;
 
vsx_begin();
-   ret = aes_p8_set_encrypt_key(key, keylen * 8, >enc_key);
+   ret = aes_p10_set_encrypt_key(key, keylen * 8, >enc_key);
vsx_end();
 
return ret ? -EINVAL : 0;
diff --git a/arch/powerpc/crypto/aesp8-ppc.pl 

Re: [PATCH v2] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-25 Thread Gaurav Batra

Hello Michael,

Here are the changes in v2 of the patch

1. added some wording to change log to specify why we are seeing the 
issue now. Also added "CC: sta...@vger.kernel.org"


2. changed "limit" to unsigned long. I have kept "rpages" as "long"

Thanks,

Gaurav

On 5/25/23 9:34 AM, Gaurav Batra wrote:

As of now, in tce_freemulti_pSeriesLP(), there is no limit on how many TCEs
are passed to H_STUFF_TCE hcall. This was not an issue until now. Newer
firmware releases have started enforcing this requirement.

The interface has been in it's current form since the beginning.

Cc: sta...@vger.kernel.org

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
  arch/powerpc/platforms/pseries/iommu.c | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..f159a195101d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -306,13 +306,22 @@ static void tce_free_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
  static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, 
long npages)
  {
u64 rc;
+   long rpages = npages;
+   unsigned long limit;

if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
return tce_free_pSeriesLP(tbl->it_index, tcenum,
  tbl->it_page_shift, npages);

-   rc = plpar_tce_stuff((u64)tbl->it_index,
-(u64)tcenum << tbl->it_page_shift, 0, npages);
+   do {
+   limit = min_t(unsigned long, rpages, 512);
+
+   rc = plpar_tce_stuff((u64)tbl->it_index,
+   (u64)tcenum << tbl->it_page_shift, 0, limit);
+
+   rpages -= limit;
+   tcenum += limit;
+   } while (rpages > 0 && !rc);

if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");


[PATCH v2] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-25 Thread Gaurav Batra
As of now, in tce_freemulti_pSeriesLP(), there is no limit on how many TCEs
are passed to H_STUFF_TCE hcall. This was not an issue until now. Newer
firmware releases have started enforcing this requirement.

The interface has been in it's current form since the beginning.

Cc: sta...@vger.kernel.org

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/platforms/pseries/iommu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..f159a195101d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -306,13 +306,22 @@ static void tce_free_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
 static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
npages)
 {
u64 rc;
+   long rpages = npages;
+   unsigned long limit;
 
if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
return tce_free_pSeriesLP(tbl->it_index, tcenum,
  tbl->it_page_shift, npages);
 
-   rc = plpar_tce_stuff((u64)tbl->it_index,
-(u64)tcenum << tbl->it_page_shift, 0, npages);
+   do {
+   limit = min_t(unsigned long, rpages, 512);
+
+   rc = plpar_tce_stuff((u64)tbl->it_index,
+   (u64)tcenum << tbl->it_page_shift, 0, limit);
+
+   rpages -= limit;
+   tcenum += limit;
+   } while (rpages > 0 && !rc);
 
if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
-- 
2.39.2 (Apple Git-143)



Re: [PATCH] powerpc/crypto: Fix aes-gcm-p10 build when VSX=n

2023-05-25 Thread Michael Ellerman
Joel Stanley  writes:
> On Mon, 15 May 2023 at 12:48, Michael Ellerman  wrote:
>>
>> When VSX is disabled, eg. microwatt_defconfig, the build fails with:
>>
>>   In function ‘enable_kernel_vsx’,
>>   inlined from ‘vsx_begin’ at 
>> arch/powerpc/crypto/aes-gcm-p10-glue.c:68:2,
>>   inlined from ‘p10_aes_gcm_crypt.constprop’ at 
>> arch/powerpc/crypto/aes-gcm-p10-glue.c:244:2:
>>   ...
>>   arch/powerpc/include/asm/switch_to.h:86:9: note: in expansion of macro 
>> ‘BUILD_BUG’
>>  86 | BUILD_BUG();
>> | ^
>>
>> Fix it by making the p10-aes-gcm code depend on VSX.
>>
>> Signed-off-by: Michael Ellerman 
>
> Reviewed-by: Joel Stanley 
>
>> ---
>>  arch/powerpc/crypto/Kconfig | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/crypto/Kconfig b/arch/powerpc/crypto/Kconfig
>> index 7113f9355165..ad1872518992 100644
>> --- a/arch/powerpc/crypto/Kconfig
>> +++ b/arch/powerpc/crypto/Kconfig
>> @@ -96,7 +96,7 @@ config CRYPTO_AES_PPC_SPE
>>
>>  config CRYPTO_AES_GCM_P10
>> tristate "Stitched AES/GCM acceleration support on P10 or later CPU 
>> (PPC)"
>> -   depends on PPC64 && CPU_LITTLE_ENDIAN
>> +   depends on PPC64 && CPU_LITTLE_ENDIAN && VSX
>
> VSX depends on PPC_BOOK3S_64 but I guess there's no harm in keeping
> the PPC64 here?

I guess it's somewhat redundant. But also harmless, and possibly clearer
for folks who don't have the ingrained knowledge that VSX is 64-bit
Book3S only. So yeah I think I'll leave it as-is.

cheers


Re: [PATCH v2 13/34] mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}

2023-05-25 Thread Mike Rapoport
On Mon, May 01, 2023 at 12:28:08PM -0700, Vishal Moola (Oracle) wrote:
> Creates ptdesc_pte_ctor(), ptdesc_pmd_ctor(), ptdesc_pte_dtor(), and
> ptdesc_pmd_dtor() and make the original pgtable constructor/destructors
> wrappers.

I think pgtable_pXY_ctor/dtor names would be better.
 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  include/linux/mm.h | 56 ++
>  1 file changed, 42 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 58c911341a33..dc61aeca9077 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2847,20 +2847,34 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) 
> { return true; }
>  static inline void ptlock_free(struct ptdesc *ptdesc) {}
>  #endif /* USE_SPLIT_PTE_PTLOCKS */
>  
> -static inline bool pgtable_pte_page_ctor(struct page *page)
> +static inline bool ptdesc_pte_ctor(struct ptdesc *ptdesc)
>  {
> - if (!ptlock_init(page_ptdesc(page)))
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + if (!ptlock_init(ptdesc))
>   return false;
> - __SetPageTable(page);
> - inc_lruvec_page_state(page, NR_PAGETABLE);
> + __folio_set_table(folio);
> + lruvec_stat_add_folio(folio, NR_PAGETABLE);
>   return true;
>  }
>  
> +static inline bool pgtable_pte_page_ctor(struct page *page)
> +{
> + return ptdesc_pte_ctor(page_ptdesc(page));
> +}
> +
> +static inline void ptdesc_pte_dtor(struct ptdesc *ptdesc)
> +{
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + ptlock_free(ptdesc);
> + __folio_clear_table(folio);
> + lruvec_stat_sub_folio(folio, NR_PAGETABLE);
> +}
> +
>  static inline void pgtable_pte_page_dtor(struct page *page)
>  {
> - ptlock_free(page_ptdesc(page));
> - __ClearPageTable(page);
> - dec_lruvec_page_state(page, NR_PAGETABLE);
> + ptdesc_pte_dtor(page_ptdesc(page));
>  }
>  
>  #define pte_offset_map_lock(mm, pmd, address, ptlp)  \
> @@ -2942,20 +2956,34 @@ static inline spinlock_t *pmd_lock(struct mm_struct 
> *mm, pmd_t *pmd)
>   return ptl;
>  }
>  
> -static inline bool pgtable_pmd_page_ctor(struct page *page)
> +static inline bool ptdesc_pmd_ctor(struct ptdesc *ptdesc)
>  {
> - if (!pmd_ptlock_init(page_ptdesc(page)))
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + if (!pmd_ptlock_init(ptdesc))
>   return false;
> - __SetPageTable(page);
> - inc_lruvec_page_state(page, NR_PAGETABLE);
> + __folio_set_table(folio);
> + lruvec_stat_add_folio(folio, NR_PAGETABLE);
>   return true;
>  }
>  
> +static inline bool pgtable_pmd_page_ctor(struct page *page)
> +{
> + return ptdesc_pmd_ctor(page_ptdesc(page));
> +}
> +
> +static inline void ptdesc_pmd_dtor(struct ptdesc *ptdesc)
> +{
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + pmd_ptlock_free(ptdesc);
> + __folio_clear_table(folio);
> + lruvec_stat_sub_folio(folio, NR_PAGETABLE);
> +}
> +
>  static inline void pgtable_pmd_page_dtor(struct page *page)
>  {
> - pmd_ptlock_free(page_ptdesc(page));
> - __ClearPageTable(page);
> - dec_lruvec_page_state(page, NR_PAGETABLE);
> + ptdesc_pmd_dtor(page_ptdesc(page));
>  }
>  
>  /*
> -- 
> 2.39.2
> 
> 

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 05/34] mm: add utility functions for ptdesc

2023-05-25 Thread Mike Rapoport
On Mon, May 01, 2023 at 12:28:00PM -0700, Vishal Moola (Oracle) wrote:
> Introduce utility functions setting the foundation for ptdescs. These
> will also assist in the splitting out of ptdesc from struct page.
> 
> ptdesc_alloc() is defined to allocate new ptdesc pages as compound
> pages. This is to standardize ptdescs by allowing for one allocation
> and one free function, in contrast to 2 allocation and 2 free functions.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  include/asm-generic/tlb.h | 11 ++
>  include/linux/mm.h| 44 +++
>  include/linux/pgtable.h   | 12 +++
>  3 files changed, 67 insertions(+)
> 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index b46617207c93..6bade9e0e799 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -481,6 +481,17 @@ static inline void tlb_remove_page(struct mmu_gather 
> *tlb, struct page *page)
>   return tlb_remove_page_size(tlb, page, PAGE_SIZE);
>  }
>  
> +static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
> +{
> + tlb_remove_table(tlb, pt);
> +}
> +
> +/* Like tlb_remove_ptdesc, but for page-like page directories. */
> +static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct 
> ptdesc *pt)
> +{
> + tlb_remove_page(tlb, ptdesc_page(pt));
> +}
> +
>  static inline void tlb_change_page_size(struct mmu_gather *tlb,
>unsigned int page_size)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b18848ae7e22..258f3b730359 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2744,6 +2744,45 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
> pud_t *pud, unsigned long a
>  }
>  #endif /* CONFIG_MMU */
>  
> +static inline struct ptdesc *virt_to_ptdesc(const void *x)
> +{
> + return page_ptdesc(virt_to_head_page(x));

Do we ever use compound pages for page tables?

> +}
> +
> +static inline void *ptdesc_to_virt(const struct ptdesc *pt)
> +{
> + return page_to_virt(ptdesc_page(pt));
> +}
> +
> +static inline void *ptdesc_address(const struct ptdesc *pt)
> +{
> + return folio_address(ptdesc_folio(pt));
> +}
> +
> +static inline bool ptdesc_is_reserved(struct ptdesc *pt)
> +{
> + return folio_test_reserved(ptdesc_folio(pt));
> +}
> +
> +static inline struct ptdesc *ptdesc_alloc(gfp_t gfp, unsigned int order)
> +{
> + struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> +
> + return page_ptdesc(page);
> +}
> +
> +static inline void ptdesc_free(struct ptdesc *pt)
> +{
> + struct page *page = ptdesc_page(pt);
> +
> + __free_pages(page, compound_order(page));
> +}

The ptdesc_{alloc,free} API does not sound right to me. The name
ptdesc_alloc() implies the allocation of the ptdesc itself, rather than
allocation of page table page. The same goes for free.

> +
> +static inline void ptdesc_clear(void *x)
> +{
> + clear_page(x);
> +}
> +
>  #if USE_SPLIT_PTE_PTLOCKS
>  #if ALLOC_SPLIT_PTLOCKS
>  void __init ptlock_cache_init(void);
> @@ -2970,6 +3009,11 @@ static inline void mark_page_reserved(struct page 
> *page)
>   adjust_managed_page_count(page, -1);
>  }
>  
> +static inline void free_reserved_ptdesc(struct ptdesc *pt)
> +{
> + free_reserved_page(ptdesc_page(pt));
> +}
> +
>  /*
>   * Default method to free all the __init memory into the buddy system.
>   * The freed pages will be poisoned with pattern "poison" if it's within
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5e0f51308724..b067ac10f3dd 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1041,6 +1041,18 @@ TABLE_MATCH(ptl, ptl);
>  #undef TABLE_MATCH
>  static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
>  
> +#define ptdesc_page(pt)  (_Generic((pt), 
> \
> + const struct ptdesc *:  (const struct page *)(pt),  \
> + struct ptdesc *:(struct page *)(pt)))
> +
> +#define ptdesc_folio(pt) (_Generic((pt), \
> + const struct ptdesc *:  (const struct folio *)(pt), \
> + struct ptdesc *:(struct folio *)(pt)))
> +
> +#define page_ptdesc(p)   (_Generic((p),  
> \
> + const struct page *:(const struct ptdesc *)(p), \
> + struct page *:  (struct ptdesc *)(p)))
> +
>  /*
>   * No-op macros that just return the current protection value. Defined here
>   * because these macros can be used even if CONFIG_MMU is not defined.
> -- 
> 2.39.2
> 
> 

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 02/34] s390: Use _pt_s390_gaddr for gmap address tracking

2023-05-25 Thread Mike Rapoport
On Mon, May 01, 2023 at 12:27:57PM -0700, Vishal Moola (Oracle) wrote:
> s390 uses page->index to keep track of page tables for the guest address
> space. In an attempt to consolidate the usage of page fields in s390,
> replace _pt_pad_2 with _pt_s390_gaddr to replace page->index in gmap.
> 
> This will help with the splitting of struct ptdesc from struct page, as
> well as allow s390 to use _pt_frag_refcount for fragmented page table
> tracking.
> 
> Since page->_pt_s390_gaddr aliases with mapping, ensure its set to NULL
> before freeing the pages as well.

Wouldn't it be easier to use _pt_pad_1 which is aliased with lru and that
does not seem to be used by page tables at all?
 
> This also reverts commit 7e25de77bc5ea ("s390/mm: use pmd_pgtable_page()
> helper in __gmap_segment_gaddr()") which had s390 use
> pmd_pgtable_page() to get a gmap page table, as pmd_pgtable_page()
> should be used for more generic process page tables.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  arch/s390/mm/gmap.c  | 56 +++-
>  include/linux/mm_types.h |  2 +-
>  2 files changed, 39 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index dfe905c7bd8e..a9e8b1805894 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -70,7 +70,7 @@ static struct gmap *gmap_alloc(unsigned long limit)
>   page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
>   if (!page)
>   goto out_free;
> - page->index = 0;
> + page->_pt_s390_gaddr = 0;
>   list_add(>lru, >crst_list);
>   table = page_to_virt(page);
>   crst_table_init(table, etype);
> @@ -187,16 +187,20 @@ static void gmap_free(struct gmap *gmap)
>   if (!(gmap_is_shadow(gmap) && gmap->removed))
>   gmap_flush_tlb(gmap);
>   /* Free all segment & region tables. */
> - list_for_each_entry_safe(page, next, >crst_list, lru)
> + list_for_each_entry_safe(page, next, >crst_list, lru) {
> + page->_pt_s390_gaddr = 0;
>   __free_pages(page, CRST_ALLOC_ORDER);
> + }
>   gmap_radix_tree_free(>guest_to_host);
>   gmap_radix_tree_free(>host_to_guest);
>  
>   /* Free additional data for a shadow gmap */
>   if (gmap_is_shadow(gmap)) {
>   /* Free all page tables. */
> - list_for_each_entry_safe(page, next, >pt_list, lru)
> + list_for_each_entry_safe(page, next, >pt_list, lru) {
> + page->_pt_s390_gaddr = 0;
>   page_table_free_pgste(page);
> + }
>   gmap_rmap_radix_tree_free(>host_to_rmap);
>   /* Release reference to the parent */
>   gmap_put(gmap->parent);
> @@ -318,12 +322,14 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
> long *table,
>   list_add(>lru, >crst_list);
>   *table = __pa(new) | _REGION_ENTRY_LENGTH |
>   (*table & _REGION_ENTRY_TYPE_MASK);
> - page->index = gaddr;
> + page->_pt_s390_gaddr = gaddr;
>   page = NULL;
>   }
>   spin_unlock(>guest_table_lock);
> - if (page)
> + if (page) {
> + page->_pt_s390_gaddr = 0;
>   __free_pages(page, CRST_ALLOC_ORDER);
> + }
>   return 0;
>  }
>  
> @@ -336,12 +342,14 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
> long *table,
>  static unsigned long __gmap_segment_gaddr(unsigned long *entry)
>  {
>   struct page *page;
> - unsigned long offset;
> + unsigned long offset, mask;
>  
>   offset = (unsigned long) entry / sizeof(unsigned long);
>   offset = (offset & (PTRS_PER_PMD - 1)) * PMD_SIZE;
> - page = pmd_pgtable_page((pmd_t *) entry);
> - return page->index + offset;
> + mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
> + page = virt_to_page((void *)((unsigned long) entry & mask));
> +
> + return page->_pt_s390_gaddr + offset;
>  }
>  
>  /**
> @@ -1351,6 +1359,7 @@ static void gmap_unshadow_pgt(struct gmap *sg, unsigned 
> long raddr)
>   /* Free page table */
>   page = phys_to_page(pgt);
>   list_del(>lru);
> + page->_pt_s390_gaddr = 0;
>   page_table_free_pgste(page);
>  }
>  
> @@ -1379,6 +1388,7 @@ static void __gmap_unshadow_sgt(struct gmap *sg, 
> unsigned long raddr,
>   /* Free page table */
>   page = phys_to_page(pgt);
>   list_del(>lru);
> + page->_pt_s390_gaddr = 0;
>   page_table_free_pgste(page);
>   }
>  }
> @@ -1409,6 +1419,7 @@ static void gmap_unshadow_sgt(struct gmap *sg, unsigned 
> long raddr)
>   /* Free segment table */
>   page = phys_to_page(sgt);
>   list_del(>lru);
> + page->_pt_s390_gaddr = 0;
>   __free_pages(page, CRST_ALLOC_ORDER);
>  }
>  
> @@ -1437,6 +1448,7 @@ static void __gmap_unshadow_r3t(struct gmap *sg, 
> unsigned long raddr,
>   /* Free segment table */
>  

Re: [PATCH v2 01/34] mm: Add PAGE_TYPE_OP folio functions

2023-05-25 Thread Mike Rapoport
Hi,

On Mon, May 01, 2023 at 12:27:56PM -0700, Vishal Moola (Oracle) wrote:
> No folio equivalents for page type operations have been defined, so
> define them for later folio conversions.

Can you please elaborate why would we need folios for page table descriptors? 
 
> Also changes the Page##uname macros to take in const struct page* since
> we only read the memory here.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  include/linux/page-flags.h | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1c68d67b832f..607b495d1b57 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -902,6 +902,8 @@ static inline bool is_page_hwpoison(struct page *page)
>  
>  #define PageType(page, flag) \
>   ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> +#define folio_test_type(folio, flag) \
> + ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
>  
>  static inline int page_type_has_type(unsigned int page_type)
>  {
> @@ -914,20 +916,34 @@ static inline int page_has_type(struct page *page)
>  }
>  
>  #define PAGE_TYPE_OPS(uname, lname)  \
> -static __always_inline int Page##uname(struct page *page)\
> +static __always_inline int Page##uname(const struct page *page)  
> \
>  {\
>   return PageType(page, PG_##lname);  \
>  }\
> +static __always_inline int folio_test_##lname(const struct folio *folio)\
> +{\
> + return folio_test_type(folio, PG_##lname);  \
> +}\
>  static __always_inline void __SetPage##uname(struct page *page)  
> \
>  {\
>   VM_BUG_ON_PAGE(!PageType(page, 0), page);   \
>   page->page_type &= ~PG_##lname; \
>  }\
> +static __always_inline void __folio_set_##lname(struct folio *folio) \
> +{\
> + VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \
> + folio->page.page_type &= ~PG_##lname;   \
> +}\
>  static __always_inline void __ClearPage##uname(struct page *page)\
>  {\
>   VM_BUG_ON_PAGE(!Page##uname(page), page);   \
>   page->page_type |= PG_##lname;  \
> -}
> +}\
> +static __always_inline void __folio_clear_##lname(struct folio *folio)   
> \
> +{\
> + VM_BUG_ON_FOLIO(!folio_test_##lname(folio), folio); \
> + folio->page.page_type |= PG_##lname;\
> +}\
>  
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
> -- 
> 2.39.2
> 
> 

-- 
Sincerely yours,
Mike.


Re: [PATCH 00/21] dma-mapping: unify support for cache flushes

2023-05-25 Thread Lad, Prabhakar
Hi Arnd,

On Mon, Mar 27, 2023 at 1:14 PM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> After a long discussion about adding SoC specific semantics for when
> to flush caches in drivers/soc/ drivers that we determined to be
> fundamentally flawed[1], I volunteered to try to move that logic into
> architecture-independent code and make all existing architectures do
> the same thing.
>
> As we had determined earlier, the behavior is wildly different across
> architectures, but most of the differences come down to either bugs
> (when required flushes are missing) or extra flushes that are harmless
> but might hurt performance.
>
> I finally found the time to come up with an implementation of this, which
> starts by replacing every outlier with one of the three common options:
>
>  1. architectures without speculative prefetching (hegagon, m68k,
> openrisc, sh, sparc, and certain armv4 and xtensa implementations)
> only flush their caches before a DMA, by cleaning write-back caches
> (if any) before a DMA to the device, and by invalidating the caches
> before a DMA from a device
>
>  2. arc, microblaze, mips, nios2, sh and later xtensa now follow the
> normal 32-bit arm model and invalidate their writeback caches
> again after a DMA from the device, to remove stale cache lines
> that got prefetched during the DMA. arc, csky and mips used to
> invalidate buffers also before the bidirectional DMA, but this
> is now skipped whenever we know it gets invalidated again
> after the DMA.
>
>  3. parisc, powerpc and riscv already flushed buffers before
> a DMA_FROM_DEVICE, and these get moved to the arm64 behavior
> that does the writeback before and invalidate after both
> DMA_FROM_DEVICE and DMA_BIDIRECTIONAL in order to avoid the
> problem of accidentally leaking stale data if the DMA does
> not actually happen[2].
>
> The last patch in the series replaces the architecture specific code
> with a shared version that implements all three based on architecture
> specific parameters that are almost always determined at compile time.
>
> The difference between cases 1. and 2. is hardware specific, while between
> 2. and 3. we need to decide which semantics we want, but I explicitly
> avoid this question in my series and leave it to be decided later.
>
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU. If we end up
> always writing back dirty cache lines before a DMA (option 3 above),
> then this point becomes moot, otherwise we should probably address this
> in a follow-up series to document one behavior or the other and implement
> it consistently.
>
> Please review!
>
>   Arnd
>
> [1] 
> https://lore.kernel.org/all/20221212115505.36770-1-prabhakar.mahadev-lad...@bp.renesas.com/
> [2] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
>
> Arnd Bergmann (21):
>   openrisc: dma-mapping: flush bidirectional mappings
>   xtensa: dma-mapping: use normal cache invalidation rules
>   sparc32: flush caches in dma_sync_*for_device
>   microblaze: dma-mapping: skip extra DMA flushes
>   powerpc: dma-mapping: split out cache operation logic
>   powerpc: dma-mapping: minimize for_cpu flushing
>   powerpc: dma-mapping: always clean cache in _for_device() op
>   riscv: dma-mapping: only invalidate after DMA, not flush
>   riscv: dma-mapping: skip invalidation before bidirectional DMA
>   csky: dma-mapping: skip invalidating before DMA from device
>   mips: dma-mapping: skip invalidating before bidirectional DMA
>   mips: dma-mapping: split out cache operation logic
>   arc: dma-mapping: skip invalidating before bidirectional DMA
>   parisc: dma-mapping: use regular flush/invalidate ops
>   ARM: dma-mapping: always invalidate WT caches before DMA
>   ARM: dma-mapping: bring back dmac_{clean,inv}_range
>   ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally
>   ARM: drop SMP support for ARM11MPCore
>   ARM: dma-mapping: use generic form of arch_sync_dma_* helpers
>   ARM: dma-mapping: split out arch_dma_mark_clean() helper
>   dma-mapping: replace custom code with generic implementation
>
Do you plan to send v2 for this series?

Cheers,
Prabhakar

>  arch/arc/mm/dma.c  |  66 ++--
>  arch/arm/Kconfig   |   4 +
>  arch/arm/include/asm/cacheflush.h  |  21 +++
>  arch/arm/include/asm/glue-cache.h  |   4 +
>  arch/arm/mach-oxnas/Kconfig|   4 -
>  arch/arm/mach-oxnas/Makefile   |   1 -
>  arch/arm/mach-oxnas/headsmp.S  |  23 ---
>  arch/arm/mach-oxnas/platsmp.c  |  

Re: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

2023-05-25 Thread Claudio Imbrenda
On Tue, 23 May 2023 18:49:14 -0700 (PDT)
Hugh Dickins  wrote:

> On Tue, 23 May 2023, Claudio Imbrenda wrote:
> > 
> > so if I understand the above correctly, pte_offset_map_lock will only
> > fail if the whole page table has disappeared, and in that case, it will
> > never reappear with zero pages, therefore we can safely skip (in that
> > case just break). if we were to do a continue instead of a break, we
> > would most likely fail again anyway.  
> 
> Yes, that's the most likely; and you hold mmap_write_lock() there,
> and VM_NOHUGEPAGE on all vmas, so I think it's the only foreseeable
> possibility.
> 
> > 
> > in that case I would still like a small change in your patch: please
> > write a short (2~3 lines max) comment about why it's ok to do things
> > that way  
> 
> Sure.
> 
> But I now see that I've disobeyed you, and gone to 4 lines (but in the
> comment above the function, so as not to distract from the code itself):
> is this good wording to you?  I needed to research how they were stopped
> from coming in afterwards, so wanted to put something greppable in there.
> 
> And, unless I'm misunderstanding, that "after THP was enabled" was
> always supposed to say "after THP was disabled" (because splitting a
> huge zero page pmd inserts a a page table full of little zero ptes).

indeed, thanks for noticing and fixing it

> 
> Or would you prefer the comment in the commit message instead,
> or down just above the pte_offset_map_lock() line?
> 
> It would much better if I could find one place at the mm end, to
> enforce its end of the contract; but cannot think how to do that.
> 
> Hugh
> 
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
>   * Remove all empty zero pages from the mapping for lazy refaulting
>   * - This must be called after mm->context.has_pgste is set, to avoid
>   *   future creation of zero pages
> - * - This must be called after THP was enabled
> + * - This must be called after THP was disabled.
> + *
> + * mm contracts with s390, that even if mm were to remove a page table,
> + * racing with the loop below and so causing pte_offset_map_lock() to fail,
> + * it will never insert a page table containing empty zero pages once
> + * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
>   */
>  static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
>  unsigned long end, struct mm_walk *walk)

looks good, thanks