date:20230724

Re: [PATCH v3 00/10] Add sysfs interface files to hv_gpci device to expose system information

2023-07-24 Thread Athira Rajeev




> On 19-Jul-2023, at 11:42 AM, Kajol Jain  wrote:
> 
> The hcall H_GET_PERF_COUNTER_INFO can be used to get data related to
> chips, dimms and system topology, by passing different counter request
> values.
> Patchset adds sysfs files to "/sys/devices/hv_gpci/interface/"
> of hv_gpci pmu driver, which will expose system topology information
> using H_GET_PERF_COUNTER_INFO hcall. The added sysfs files are
> available for power10 and above platforms and needs root access
> to read the data.
> 
> Patches 1,3,5,7,9 adds sysfs interface files to the hv_gpci
> pmu driver, to get system topology information.
> 
> List of added sysfs files:
> -> processor_bus_topology (Counter request value : 0xD0)
> -> processor_config (Counter request value : 0x90)
> -> affinity_domain_via_virtual_processor (Counter request value : 0xA0)
> -> affinity_domain_via_domain (Counter request value : 0xB0)
> -> affinity_domain_via_partition (Counter request value : 0xB1)
> 
> Patches 2,4,6,8,10 adds details of the newly added hv_gpci
> interface files listed above in the ABI documentation.
> 
> Patches 2,4,6,8,10 adds details of the newly added hv_gpci
> interface files listed above in the ABI documentation.

Reviewed-by: Athira Rajeev 

Thanks
Athira
> 
> Changelog:
> v2 -> v3
> -> Make nit changes in documentation patches as suggested by Randy Dunlap.
> 
> v1 -> v2
> -> Incase the HCALL fails with errors that can be resolve during runtime,
>   then only add sysinfo interface attributes to the interface_attrs
>   attribute array. Even if one of the counter request value HCALL fails,
>   don't add any sysinfo attribute to the interface_attrs attribute array.
>   Add the code changes to make sure sysinfo interface added only when all
>   the requirements met as suggested by Michael Ellerman.
> -> Make changes in documentation, adds detail of errors type
>   which can be resolved at runtime as suggested by Michael Ellerman.
> -> Add new enum and sysinfo_counter_request array to get required
>   counter request value in hv-gpci.c file.
> -> Move the macros for interface attribute array index to hv-gpci.c, as
>   these macros currently only used in hv-gpci.c file.
> 
> Kajol Jain (10):
>  powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show
>processor bus topology information
>  docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document
>processor_bus_topology sysfs interface file
>  powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show
>processor config information
>  docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document
>processor_config sysfs interface file
>  powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity
>domain via virtual processor information
>  docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document
>affinity_domain_via_virtual_processor sysfs interface file
>  powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity
>domain via domain information
>  docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document
>affinity_domain_via_domain sysfs interface file
>  powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity
>domain via partition information
>  docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document
>affinity_domain_via_partition sysfs interface file
> 
> .../sysfs-bus-event_source-devices-hv_gpci| 160 +
> arch/powerpc/perf/hv-gpci.c   | 640 +-
> 2 files changed, 798 insertions(+), 2 deletions(-)
> 
> -- 
> 2.39.3
>

Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-24 Thread Alistair Popple



Michael Ellerman  writes:

> Alistair Popple  writes:
>> The invalidate_range() is going to become an architecture specific mmu
>> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
>> sync with the CPU page tables. Currently it is called from separate
>> code paths to the main CPU TLB invalidations. This can lead to a
>> secondary TLB not getting invalidated when required and makes it hard
>> to reason about when exactly the secondary TLB is invalidated.
>>
>> To fix this move the notifier call to the architecture specific TLB
>> maintenance functions for architectures that have secondary MMUs
>> requiring explicit software invalidations.
>>
>> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
>> require a TLB invalidation. This invalidation is done by the
>> arahitecutre specific ptep_set_access_flags() which calls
>   ^
>   architecture

Oh. I'd forgotten to apt install codespell ;-)
  
>> flush_tlb_page() if required. However this doesn't call the notifier
>> resulting in infinite faults being generated by devices using the SMMU
>> if it has previously cached a read-only PTE in it's TLB.
>>
>> Moving the invalidations into the TLB invalidation functions ensures
>> all invalidations happen at the same time as the CPU invalidation. The
>> architecture specific flush_tlb_all() routines do not call the
>> notifier as none of the IOMMUs require this.
>>
>> Signed-off-by: Alistair Popple 
>> Suggested-by: Jason Gunthorpe 
>> 
> ...
>
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
>> b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index 0bd4866..9724b26 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct 
>> *vma, unsigned long vmadd
>>  return radix__local_flush_hugetlb_page(vma, vmaddr);
>>  #endif
>>  radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, 
>> mmu_virtual_psize);
>> +mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
>> +vmaddr + mmu_virtual_psize);
>>  }
>>  EXPORT_SYMBOL(radix__local_flush_tlb_page);
>
> I think we can skip calling the notifier there? It's explicitly a local flush.

I suspect you're correct. It's been a while since I last worked on PPC
TLB invalidation code though and it's changed a fair bit since then so
was being conservative and appreciate any comments there. Was worried I
may have missed some clever optimisation that detects a local flush is
all that's needed, but I see OCXL calls mm_context_add_copro() though so
that should be ok. Will respin and drop it.

> cheers

Re: [PATCH V2 00/26] tools/perf: Fix shellcheck coding/formatting issues of perf tool shell scripts

2023-07-24 Thread Athira Rajeev




> On 20-Jul-2023, at 10:48 AM, kajoljain  wrote:
> 
> 
> 
> On 7/20/23 10:42, Athira Rajeev wrote:
>> 
>> 
>>> On 19-Jul-2023, at 11:16 PM, Ian Rogers  wrote:
>>> 
>>> On Tue, Jul 18, 2023 at 11:17 PM kajoljain  wrote:
 
 Hi,
 
 Looking for review comments on this patchset.
 
 Thanks,
 Kajol Jain
 
 
 On 7/9/23 23:57, Athira Rajeev wrote:
> Patchset covers a set of fixes for coding/formatting issues observed while
> running shellcheck tool on the perf shell scripts.
> 
> This cleanup is a pre-requisite to include a build option for shellcheck
> discussed here: 
> https://www.spinics.net/lists/linux-perf-users/msg25553.html
> First set of patches were posted here:
> https://lore.kernel.org/linux-perf-users/53b7d823-1570-4289-a632-2205ee2b5...@linux.vnet.ibm.com/T/#t
> 
> This patchset covers remaining set of shell scripts which needs
> fix. Patch 1 is resubmission of patch 6 from the initial series.
> Patch 15, 16 and 22 touches code from tools/perf/trace/beauty.
> Other patches are fixes for scripts from tools/perf/tests.
> 
> The shellcheck is run for severity level for errors and warnings.
> Command used:
> 
> # for F in $(find tests/shell/ -perm -o=x -name '*.sh'); do shellcheck -S 
> warning $F; done
> # echo $?
> 0
> 
>>> 
>>> I don't see anything objectionable in the changes so for the series:
>>> Acked-by: Ian Rogers 
>>> 
>>> Some thoughts:
>>> - Adding "#!/bin/bash" to scripts in tools/perf/tests/lib - I think
>>> we didn't do this to avoid these being included as tests. There are
>>> now extra checks when finding shell tests, so I can imagine doing this
>>> isn't a regression but just a heads up.
>>> - I think James' comment was addressed:
>>> https://lore.kernel.org/linux-perf-users/334989bf-5501-494c-f246-81878fd2f...@arm.com/
>>> - Why aren't these changes being mailed to LKML? The wider community
>>> on LKML have thoughts on shell scripts, plus it makes the changes miss
>>> my mail filters.
>>> - Can we automate this testing into the build? For example, following
>>> a similar kernel build pattern we run a python test and make the log
>>> output a requirement here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/pmu-events/Build?h=perf-tools-next#n30
>>>  I think we can translate:
>>> for F in $(find tests/shell/ -perm -o=x -name '*.sh'); do shellcheck
>>> -S warning $F; done
>>>  into a rule in make for log files that are then a dependency on the
>>> perf binary. We can then parallel shellcheck during the build and
>>> avoid regressions. We probably need a CONFIG_SHELLCHECK feature check
>>> in the build to avoid not having shellcheck breaking the build.
>> 
>> Hi Ian
>> 
>> Thanks for the comments.
>> Yes, next step after this is to include build option for shellcheck by 
>> updating Makefile.
>> We will surely get into that build option enablement patch once we have all 
>> these corrections in place.
>> 
>> Thanks
>> Athira
>>> 
> 
> Hi Ian,
>   Thanks for reviewing the patches. As athira mentioned our next is to
> include build option. So, we will work on it next once all the
> correction done.
> 
> Thanks,
> Kajol Jain

Hi Arnaldo,  Namhyung

Can you have this patchset applied along with Acked-by from Ian ?
Our next step is to add a build option for shellcheck by updating Makefile and 
will be working on that.

Thanks
Athira 
> 
>>> Thanks,
>>> Ian
>>> 
> Changelog:
> v1 -> v2:
> - Rebased on top of perf-tools-next from:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools-next
> 
> - Fixed shellcheck errors and warnings reported for newly
>   added changes from perf-tools-next branch
> 
> - Addressed review comment from James clark for patch
>   number 13 from V1. The changes in patch 13 were not necessary
>   since the file "tests/shell/lib/coresight.sh" is sourced from
>   other test files.
> 
> Akanksha J N (1):
> tools/perf/tests: Fix shellcheck warnings for
>   trace+probe_vfs_getname.sh
> 
> Athira Rajeev (14):
> tools/perf/tests: fix test_arm_spe_fork.sh signal case issues
> tools/perf/tests: Fix unused variable references in
>   stat+csv_summary.sh testcase
> tools/perf/tests: fix shellcheck warning for
>   test_perf_data_converter_json.sh testcase
> tools/perf/tests: Fix shellcheck issue for stat_bpf_counters.sh
>   testcase
> tools/perf/tests: Fix shellcheck issues in
>   tests/shell/stat+shadow_stat.sh tetscase
> tools/perf/tests: Fix shellcheck warnings for
>   thread_loop_check_tid_10.sh
> tools/perf/tests: Fix shellcheck warnings for unroll_loop_thread_10.sh
> tools/perf/tests: Fix shellcheck warnings for lib/probe_vfs_getname.sh
> tools/perf/tests: Fix the shellcheck warnings in lib/waiting.sh
> tools/perf/trace: Fix x86_arch_prctl.sh to

Re: [PATCH mm-unstable v7 00/31] Split ptdesc from struct page

2023-07-24 Thread Hugh Dickins

On Mon, 24 Jul 2023, Vishal Moola (Oracle) wrote:

> The MM subsystem is trying to shrink struct page. This patchset
> introduces a memory descriptor for page table tracking - struct ptdesc.
> 
> This patchset introduces ptdesc, splits ptdesc from struct page, and
> converts many callers of page table constructor/destructors to use ptdescs.
> 
> Ptdesc is a foundation to further standardize page tables, and eventually
> allow for dynamic allocation of page tables independent of struct page.
> However, the use of pages for page table tracking is quite deeply
> ingrained and varied across archictectures, so there is still a lot of
> work to be done before that can happen.

Others may differ, but it remains the case that I see no point to this
patchset, until the minimal descriptor that replaces struct page is
working, and struct page then becomes just overhead.  Until that time,
let architectures continue to use struct page as they do - whyever not?

Hugh

> 
> This is rebased on mm-unstable.
> 
> v7:
>   Drop s390 gmap ptdesc conversions - gmap is unecessary complication
> that can be dealt with later
>   Be more thorough with ptdesc struct sanity checks and comments
>   Rebase onto mm-unstable
> 
> Vishal Moola (Oracle) (31):
>   mm: Add PAGE_TYPE_OP folio functions
>   pgtable: Create struct ptdesc
>   mm: add utility functions for ptdesc
>   mm: Convert pmd_pgtable_page() callers to use pmd_ptdesc()
>   mm: Convert ptlock_alloc() to use ptdescs
>   mm: Convert ptlock_ptr() to use ptdescs
>   mm: Convert pmd_ptlock_init() to use ptdescs
>   mm: Convert ptlock_init() to use ptdescs
>   mm: Convert pmd_ptlock_free() to use ptdescs
>   mm: Convert ptlock_free() to use ptdescs
>   mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}
>   powerpc: Convert various functions to use ptdescs
>   x86: Convert various functions to use ptdescs
>   s390: Convert various pgalloc functions to use ptdescs
>   mm: Remove page table members from struct page
>   pgalloc: Convert various functions to use ptdescs
>   arm: Convert various functions to use ptdescs
>   arm64: Convert various functions to use ptdescs
>   csky: Convert __pte_free_tlb() to use ptdescs
>   hexagon: Convert __pte_free_tlb() to use ptdescs
>   loongarch: Convert various functions to use ptdescs
>   m68k: Convert various functions to use ptdescs
>   mips: Convert various functions to use ptdescs
>   nios2: Convert __pte_free_tlb() to use ptdescs
>   openrisc: Convert __pte_free_tlb() to use ptdescs
>   riscv: Convert alloc_{pmd, pte}_late() to use ptdescs
>   sh: Convert pte_free_tlb() to use ptdescs
>   sparc64: Convert various functions to use ptdescs
>   sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents
>   um: Convert {pmd, pte}_free_tlb() to use ptdescs
>   mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers
> 
>  Documentation/mm/split_page_table_lock.rst|  12 +-
>  .../zh_CN/mm/split_page_table_lock.rst|  14 +-
>  arch/arm/include/asm/tlb.h|  12 +-
>  arch/arm/mm/mmu.c |   7 +-
>  arch/arm64/include/asm/tlb.h  |  14 +-
>  arch/arm64/mm/mmu.c   |   7 +-
>  arch/csky/include/asm/pgalloc.h   |   4 +-
>  arch/hexagon/include/asm/pgalloc.h|   8 +-
>  arch/loongarch/include/asm/pgalloc.h  |  27 ++--
>  arch/loongarch/mm/pgtable.c   |   7 +-
>  arch/m68k/include/asm/mcf_pgalloc.h   |  47 +++---
>  arch/m68k/include/asm/sun3_pgalloc.h  |   8 +-
>  arch/m68k/mm/motorola.c   |   4 +-
>  arch/mips/include/asm/pgalloc.h   |  32 ++--
>  arch/mips/mm/pgtable.c|   8 +-
>  arch/nios2/include/asm/pgalloc.h  |   8 +-
>  arch/openrisc/include/asm/pgalloc.h   |   8 +-
>  arch/powerpc/mm/book3s64/mmu_context.c|  10 +-
>  arch/powerpc/mm/book3s64/pgtable.c|  32 ++--
>  arch/powerpc/mm/pgtable-frag.c|  56 +++
>  arch/riscv/include/asm/pgalloc.h  |   8 +-
>  arch/riscv/mm/init.c  |  16 +-
>  arch/s390/include/asm/pgalloc.h   |   4 +-
>  arch/s390/include/asm/tlb.h   |   4 +-
>  arch/s390/mm/pgalloc.c| 128 +++
>  arch/sh/include/asm/pgalloc.h |   9 +-
>  arch/sparc/mm/init_64.c   |  17 +-
>  arch/sparc/mm/srmmu.c |   5 +-
>  arch/um/include/asm/pgalloc.h |  18 +--
>  arch/x86/mm/pgtable.c |  47 +++---
>  arch/x86/xen/mmu_pv.c |   2 +-
>  include/asm-generic/pgalloc.h |  88 +-
>  include/asm-generic/tlb.h |  11 ++
>  include/linux/mm.h| 151 +-
>  include/linux/mm_types.h  |  18 ---
>  include/linux/page-flags.h

[Bug 217702] makedumpfile can not open /proc/vmcore

2023-07-24 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=217702

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||mich...@ellerman.id.au
 Resolution|--- |CODE_FIX

--- Comment #1 from Michael Ellerman (mich...@ellerman.id.au) ---
This should be fixed in mainline due to the revert:

106ea7ffd56b ("Revert "powerpc/64s: Remove support for ELFv1 little endian
userspace"")

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH mm-unstable v7 31/31] mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers

2023-07-24 Thread Vishal Moola (Oracle)

These functions are no longer necessary. Remove them and cleanup
Documentation referencing them.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 Documentation/mm/split_page_table_lock.rst| 12 +--
 .../zh_CN/mm/split_page_table_lock.rst| 14 ++---
 include/linux/mm.h| 20 ---
 3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst 
b/Documentation/mm/split_page_table_lock.rst
index a834fad9de12..e4f6972eb6c0 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -58,7 +58,7 @@ Support of split page table lock by an architecture
 ===
 
 There's no need in special enabling of PTE split page table lock: everything
-required is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which
+required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which
 must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
@@ -68,8 +68,8 @@ This field shares storage with page->ptl.
 PMD split lock only makes sense if you have more than two page table
 levels.
 
-PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
-allocation and pgtable_pmd_page_dtor() on freeing.
+PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table
+allocation and pagetable_pmd_dtor() on freeing.
 
 Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
 pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
@@ -77,7 +77,7 @@ paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
 
 With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
 
-NOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
+NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must
 be handled properly.
 
 page->ptl
@@ -97,7 +97,7 @@ trick:
split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
one more cache line for indirect access;
 
-The spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in
-pgtable_pmd_page_ctor() for PMD table.
+The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in
+pagetable_pmd_ctor() for PMD table.
 
 Please, never access page->ptl directly -- use appropriate helper.
diff --git a/Documentation/translations/zh_CN/mm/split_page_table_lock.rst 
b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
index 4fb7aa666037..a2c288670a24 100644
--- a/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
+++ b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
@@ -56,16 +56,16 @@ Hugetlb特定的辅助函数:
 架构对分页表锁的支持
 
 
-没有必要特别启用PTE分页表锁：所有需要的东西都由pgtable_pte_page_ctor()
-和pgtable_pte_page_dtor()完成，它们必须在PTE表分配/释放时被调用。
+没有必要特别启用PTE分页表锁：所有需要的东西都由pagetable_pte_ctor()
+和pagetable_pte_dtor()完成，它们必须在PTE表分配/释放时被调用。
 
 确保架构不使用slab分配器来分配页表：slab使用page->slab_cache来分配其页
 面。这个区域与page->ptl共享存储。
 
 PMD分页锁只有在你有两个以上的页表级别时才有意义。
 
-启用PMD分页锁需要在PMD表分配时调用pgtable_pmd_page_ctor()，在释放时调
-用pgtable_pmd_page_dtor()。
+启用PMD分页锁需要在PMD表分配时调用pagetable_pmd_ctor()，在释放时调
+用pagetable_pmd_dtor()。
 
 分配通常发生在pmd_alloc_one()中，释放发生在pmd_free()和pmd_free_tlb()
 中，但要确保覆盖所有的PMD表分配/释放路径：即X86_PAE在pgd_alloc()中预先
@@ -73,7 +73,7 @@ PMD分页锁只有在你有两个以上的页表级别时才有意义。
 
 一切就绪后，你可以设置CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK。
 
-注意：pgtable_pte_page_ctor()和pgtable_pmd_page_ctor()可能失败--必
+注意：pagetable_pte_ctor()和pagetable_pmd_ctor()可能失败--必
 须正确处理。
 
 page->ptl
@@ -90,7 +90,7 @@ page->ptl用于访问分割页表锁，其中'page'是包含该表的页面struc
的指针并动态分配它。这允许在启用DEBUG_SPINLOCK或DEBUG_LOCK_ALLOC的
情况下使用分页锁，但由于间接访问而多花了一个缓存行。
 
-PTE表的spinlock_t分配在pgtable_pte_page_ctor()中，PMD表的spinlock_t
-分配在pgtable_pmd_page_ctor()中。
+PTE表的spinlock_t分配在pagetable_pte_ctor()中，PMD表的spinlock_t
+分配在pagetable_pmd_ctor()中。
 
 请不要直接访问page->ptl - -使用适当的辅助函数。
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bd3d99d81984..e4e34ecbc2ea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2913,11 +2913,6 @@ static inline bool pagetable_pte_ctor(struct ptdesc 
*ptdesc)
return true;
 }
 
-static inline bool pgtable_pte_page_ctor(struct page *page)
-{
-   return pagetable_pte_ctor(page_ptdesc(page));
-}
-
 static inline void pagetable_pte_dtor(struct ptdesc *ptdesc)
 {
struct folio *folio = ptdesc_folio(ptdesc);
@@ -2927,11 +2922,6 @@ static inline void pagetable_pte_dtor(struct ptdesc 
*ptdesc)
lruvec_stat_sub_folio(folio, NR_PAGETABLE);
 }
 
-static inline void pgtable_pte_page_dtor(struct page *page)
-{
-   pagetable_pte_dtor(page_ptdesc(page));
-}
-
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
 static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
 {
@@ -3038,11 +3028,6 @@ static inline bool pagetable_pmd_ctor(struct ptdesc 
*ptdesc)
return true;
 }

[PATCH mm-unstable v7 30/31] um: Convert {pmd, pte}_free_tlb() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents. Also cleans up some spacing issues.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/um/include/asm/pgalloc.h | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h
index 8ec7cd46dd96..de5e31c64793 100644
--- a/arch/um/include/asm/pgalloc.h
+++ b/arch/um/include/asm/pgalloc.h
@@ -25,19 +25,19 @@
  */
 extern pgd_t *pgd_alloc(struct mm_struct *);
 
-#define __pte_free_tlb(tlb,pte, address)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb),(pte));   \
+#define __pte_free_tlb(tlb, pte, address)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #ifdef CONFIG_3_LEVEL_PGTABLES
 
-#define __pmd_free_tlb(tlb, pmd, address)  \
-do {   \
-   pgtable_pmd_page_dtor(virt_to_page(pmd));   \
-   tlb_remove_page((tlb),virt_to_page(pmd));   \
-} while (0)\
+#define __pmd_free_tlb(tlb, pmd, address)  \
+do {   \
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));\
+   tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pmd)); \
+} while (0)
 
 #endif
 
-- 
2.40.1

[PATCH mm-unstable v7 29/31] sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable pte constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/sparc/mm/srmmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index 13f027afc875..8393faa3e596 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -355,7 +355,8 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
return NULL;
page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
spin_lock(>page_table_lock);
-   if (page_ref_inc_return(page) == 2 && !pgtable_pte_page_ctor(page)) {
+   if (page_ref_inc_return(page) == 2 &&
+   !pagetable_pte_ctor(page_ptdesc(page))) {
page_ref_dec(page);
ptep = NULL;
}
@@ -371,7 +372,7 @@ void pte_free(struct mm_struct *mm, pgtable_t ptep)
page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
spin_lock(>page_table_lock);
if (page_ref_dec_return(page) == 1)
-   pgtable_pte_page_dtor(page);
+   pagetable_pte_dtor(page_ptdesc(page));
spin_unlock(>page_table_lock);
 
srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
-- 
2.40.1

[PATCH mm-unstable v7 28/31] sparc64: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/sparc/mm/init_64.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 0d7fd793924c..9a63a3e08e40 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2893,14 +2893,15 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-   struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-   if (!page)
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL | __GFP_ZERO, 0);
+
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
-   return (pte_t *) page_address(page);
+   return ptdesc_address(ptdesc);
 }
 
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
@@ -2910,10 +2911,10 @@ void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 static void __pte_free(pgtable_t pte)
 {
-   struct page *page = virt_to_page(pte);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pte);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 void pte_free(struct mm_struct *mm, pgtable_t pte)
-- 
2.40.1

[PATCH mm-unstable v7 27/31] sh: Convert pte_free_tlb() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents. Also cleans up some spacing issues.

Signed-off-by: Vishal Moola (Oracle) 
Reviewed-by: Geert Uytterhoeven 
Acked-by: John Paul Adrian Glaubitz 
Acked-by: Mike Rapoport (IBM) 
---
 arch/sh/include/asm/pgalloc.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index a9e98233c4d4..5d8577ab1591 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -2,6 +2,7 @@
 #ifndef __ASM_SH_PGALLOC_H
 #define __ASM_SH_PGALLOC_H
 
+#include 
 #include 
 
 #define __HAVE_ARCH_PMD_ALLOC_ONE
@@ -31,10 +32,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
 }
 
-#define __pte_free_tlb(tlb,pte,addr)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif /* __ASM_SH_PGALLOC_H */
-- 
2.40.1

[PATCH mm-unstable v7 26/31] riscv: Convert alloc_{pmd, pte}_late() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Palmer Dabbelt 
Acked-by: Mike Rapoport (IBM) 
---
 arch/riscv/include/asm/pgalloc.h |  8 
 arch/riscv/mm/init.c | 16 ++--
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 59dc12b5b7e8..d169a4f41a2e 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -153,10 +153,10 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-#define __pte_free_tlb(tlb, pte, buf)   \
-do {\
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, buf)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 #endif /* CONFIG_MMU */
 
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 9ce504737d18..430a3d05a841 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -353,12 +353,10 @@ static inline phys_addr_t __init 
alloc_pte_fixmap(uintptr_t va)
 
 static phys_addr_t __init alloc_pte_late(uintptr_t va)
 {
-   unsigned long vaddr;
-
-   vaddr = __get_free_page(GFP_KERNEL);
-   BUG_ON(!vaddr || !pgtable_pte_page_ctor(virt_to_page((void *)vaddr)));
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   return __pa(vaddr);
+   BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc));
+   return __pa((pte_t *)ptdesc_address(ptdesc));
 }
 
 static void __init create_pte_mapping(pte_t *ptep,
@@ -436,12 +434,10 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
 
 static phys_addr_t __init alloc_pmd_late(uintptr_t va)
 {
-   unsigned long vaddr;
-
-   vaddr = __get_free_page(GFP_KERNEL);
-   BUG_ON(!vaddr || !pgtable_pmd_page_ctor(virt_to_page((void *)vaddr)));
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   return __pa(vaddr);
+   BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc));
+   return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
 
 static void __init create_pmd_mapping(pmd_t *pmdp,
-- 
2.40.1

[PATCH mm-unstable v7 25/31] openrisc: Convert __pte_free_tlb() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/openrisc/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/openrisc/include/asm/pgalloc.h 
b/arch/openrisc/include/asm/pgalloc.h
index b7b2b8d16fad..c6a73772a546 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -66,10 +66,10 @@ extern inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif
-- 
2.40.1

[PATCH mm-unstable v7 24/31] nios2: Convert __pte_free_tlb() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Dinh Nguyen 
---
 arch/nios2/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index ecd1657bb2ce..ce6bb8e74271 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -28,10 +28,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-   do {\
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+   do {\
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
} while (0)
 
 #endif /* _ASM_NIOS2_PGALLOC_H */
-- 
2.40.1

[PATCH mm-unstable v7 23/31] mips: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/mips/include/asm/pgalloc.h | 32 ++--
 arch/mips/mm/pgtable.c  |  8 +---
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index f72e737dda21..40e40a7eb94a 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -51,13 +51,13 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
-   free_pages((unsigned long)pgd, PGD_TABLE_ORDER);
+   pagetable_free(virt_to_ptdesc(pgd));
 }
 
-#define __pte_free_tlb(tlb,pte,address)\
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, address)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -65,18 +65,18 @@ do {
\
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pmd_t *pmd;
-   struct page *pg;
+   struct ptdesc *ptdesc;
 
-   pg = alloc_pages(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER);
-   if (!pg)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER);
+   if (!ptdesc)
return NULL;
 
-   if (!pgtable_pmd_page_ctor(pg)) {
-   __free_pages(pg, PMD_TABLE_ORDER);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pmd = (pmd_t *)page_address(pg);
+   pmd = ptdesc_address(ptdesc);
pmd_init(pmd);
return pmd;
 }
@@ -90,10 +90,14 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long address)
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pud_t *pud;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM,
+   PUD_TABLE_ORDER);
 
-   pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_TABLE_ORDER);
-   if (pud)
-   pud_init(pud);
+   if (!ptdesc)
+   return NULL;
+   pud = ptdesc_address(ptdesc);
+
+   pud_init(pud);
return pud;
 }
 
diff --git a/arch/mips/mm/pgtable.c b/arch/mips/mm/pgtable.c
index b13314be5d0e..1506e458040d 100644
--- a/arch/mips/mm/pgtable.c
+++ b/arch/mips/mm/pgtable.c
@@ -10,10 +10,12 @@
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   pgd_t *ret, *init;
+   pgd_t *init, *ret = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM,
+   PGD_TABLE_ORDER);
 
-   ret = (pgd_t *) __get_free_pages(GFP_KERNEL, PGD_TABLE_ORDER);
-   if (ret) {
+   if (ptdesc) {
+   ret = ptdesc_address(ptdesc);
init = pgd_offset(_mm, 0UL);
pgd_init(ret);
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-- 
2.40.1

[PATCH mm-unstable v7 22/31] m68k: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/include/asm/mcf_pgalloc.h  | 47 ++--
 arch/m68k/include/asm/sun3_pgalloc.h |  8 ++---
 arch/m68k/mm/motorola.c  |  4 +--
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/arch/m68k/include/asm/mcf_pgalloc.h 
b/arch/m68k/include/asm/mcf_pgalloc.h
index 5c2c0a864524..302c5bf67179 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -5,22 +5,22 @@
 #include 
 #include 
 
-extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-   free_page((unsigned long) pte);
+   pagetable_free(virt_to_ptdesc(pte));
 }
 
 extern const char bad_pmd_string[];
 
-extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
-   unsigned long page = __get_free_page(GFP_DMA);
+   struct ptdesc *ptdesc = pagetable_alloc((GFP_DMA | __GFP_ZERO) &
+   ~__GFP_HIGHMEM, 0);
 
-   if (!page)
+   if (!ptdesc)
return NULL;
 
-   memset((void *)page, 0, PAGE_SIZE);
-   return (pte_t *) (page);
+   return ptdesc_address(ptdesc);
 }
 
 extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
@@ -35,36 +35,34 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned 
long address)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable,
  unsigned long address)
 {
-   struct page *page = virt_to_page(pgtable);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-   struct page *page = alloc_pages(GFP_DMA, 0);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | __GFP_ZERO, 0);
pte_t *pte;
 
-   if (!page)
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pte = page_address(page);
-   clear_page(pte);
-
+   pte = ptdesc_address(ptdesc);
return pte;
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable)
 {
-   struct page *page = virt_to_page(pgtable);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 /*
@@ -75,16 +73,19 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
pgtable)
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
-   free_page((unsigned long) pgd);
+   pagetable_free(virt_to_ptdesc(pgd));
 }
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
pgd_t *new_pgd;
+   struct ptdesc *ptdesc = pagetable_alloc((GFP_DMA | __GFP_NOWARN) &
+   ~__GFP_HIGHMEM, 0);
 
-   new_pgd = (pgd_t *)__get_free_page(GFP_DMA | __GFP_NOWARN);
-   if (!new_pgd)
+   if (!ptdesc)
return NULL;
+   new_pgd = ptdesc_address(ptdesc);
+
memcpy(new_pgd, swapper_pg_dir, PTRS_PER_PGD * sizeof(pgd_t));
memset(new_pgd, 0, PAGE_OFFSET >> PGDIR_SHIFT);
return new_pgd;
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h 
b/arch/m68k/include/asm/sun3_pgalloc.h
index 198036aff519..ff48573db2c0 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -17,10 +17,10 @@
 
 extern const char bad_pmd_string[];
 
-#define __pte_free_tlb(tlb,pte,addr)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t 
*pte)
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index c75984e2d86b..594575a0780c 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -161,7 +161,7 @@

[PATCH mm-unstable v7 21/31] loongarch: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/loongarch/include/asm/pgalloc.h | 27 +++
 arch/loongarch/mm/pgtable.c  |  7 ---
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/loongarch/include/asm/pgalloc.h 
b/arch/loongarch/include/asm/pgalloc.h
index af1d1e4a6965..23f5b1107246 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -45,9 +45,9 @@ extern void pagetable_init(void);
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 #define __pte_free_tlb(tlb, pte, address)  \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -55,18 +55,18 @@ do {
\
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pmd_t *pmd;
-   struct page *pg;
+   struct ptdesc *ptdesc;
 
-   pg = alloc_page(GFP_KERNEL_ACCOUNT);
-   if (!pg)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, 0);
+   if (!ptdesc)
return NULL;
 
-   if (!pgtable_pmd_page_ctor(pg)) {
-   __free_page(pg);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pmd = (pmd_t *)page_address(pg);
+   pmd = ptdesc_address(ptdesc);
pmd_init(pmd);
return pmd;
 }
@@ -80,10 +80,13 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long address)
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pud_t *pud;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   pud = (pud_t *) __get_free_page(GFP_KERNEL);
-   if (pud)
-   pud_init(pud);
+   if (!ptdesc)
+   return NULL;
+   pud = ptdesc_address(ptdesc);
+
+   pud_init(pud);
return pud;
 }
 
diff --git a/arch/loongarch/mm/pgtable.c b/arch/loongarch/mm/pgtable.c
index 36a6dc0148ae..5bd102b51f7c 100644
--- a/arch/loongarch/mm/pgtable.c
+++ b/arch/loongarch/mm/pgtable.c
@@ -11,10 +11,11 @@
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   pgd_t *ret, *init;
+   pgd_t *init, *ret = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   ret = (pgd_t *) __get_free_page(GFP_KERNEL);
-   if (ret) {
+   if (ptdesc) {
+   ret = (pgd_t *)ptdesc_address(ptdesc);
init = pgd_offset(_mm, 0UL);
pgd_init(ret);
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-- 
2.40.1

[PATCH mm-unstable v7 20/31] hexagon: Convert __pte_free_tlb() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/hexagon/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/hexagon/include/asm/pgalloc.h 
b/arch/hexagon/include/asm/pgalloc.h
index f0c47e6a7427..55988625e6fb 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -87,10 +87,10 @@ static inline void pmd_populate_kernel(struct mm_struct 
*mm, pmd_t *pmd,
max_kernel_seg = pmdindex;
 }
 
-#define __pte_free_tlb(tlb, pte, addr) \
-do {   \
-   pgtable_pte_page_dtor((pte));   \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor((page_ptdesc(pte))); \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif
-- 
2.40.1

[PATCH mm-unstable v7 19/31] csky: Convert __pte_free_tlb() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Guo Ren 
Acked-by: Mike Rapoport (IBM) 
---
 arch/csky/include/asm/pgalloc.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index 7d57e5da0914..9c84c9012e53 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -63,8 +63,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #define __pte_free_tlb(tlb, pte, address)  \
 do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page(tlb, pte);  \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc(tlb, page_ptdesc(pte));  \
 } while (0)
 
 extern void pagetable_init(void);
-- 
2.40.1

[PATCH mm-unstable v7 18/31] arm64: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Catalin Marinas 
---
 arch/arm64/include/asm/tlb.h | 14 --
 arch/arm64/mm/mmu.c  |  7 ---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index c995d1f4594f..2c29239d05c3 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -75,18 +75,20 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
  unsigned long addr)
 {
-   pgtable_pte_page_dtor(pte);
-   tlb_remove_table(tlb, pte);
+   struct ptdesc *ptdesc = page_ptdesc(pte);
+
+   pagetable_pte_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
  unsigned long addr)
 {
-   struct page *page = virt_to_page(pmdp);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
 
-   pgtable_pmd_page_dtor(page);
-   tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 #endif
 
@@ -94,7 +96,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmdp,
 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
  unsigned long addr)
 {
-   tlb_remove_table(tlb, virt_to_page(pudp));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pudp));
 }
 #endif
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 95d360805f8a..47781bec6171 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -426,6 +426,7 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
 static phys_addr_t pgd_pgtable_alloc(int shift)
 {
phys_addr_t pa = __pgd_pgtable_alloc(shift);
+   struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
 
/*
 * Call proper page table ctor in case later we need to
@@ -433,12 +434,12 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 * this pre-allocated page table.
 *
 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
-* folded, and if so pgtable_pmd_page_ctor() becomes nop.
+* folded, and if so pagetable_pte_ctor() becomes nop.
 */
if (shift == PAGE_SHIFT)
-   BUG_ON(!pgtable_pte_page_ctor(phys_to_page(pa)));
+   BUG_ON(!pagetable_pte_ctor(ptdesc));
else if (shift == PMD_SHIFT)
-   BUG_ON(!pgtable_pmd_page_ctor(phys_to_page(pa)));
+   BUG_ON(!pagetable_pmd_ctor(ptdesc));
 
return pa;
 }
-- 
2.40.1

[PATCH mm-unstable v7 17/31] arm: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

late_alloc() also uses the __get_free_pages() helper function. Convert
this to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/arm/include/asm/tlb.h | 12 +++-
 arch/arm/mm/mmu.c  |  7 ---
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index b8cbe03ad260..f40d06ad5d2a 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -39,7 +39,9 @@ static inline void __tlb_remove_table(void *_table)
 static inline void
 __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr)
 {
-   pgtable_pte_page_dtor(pte);
+   struct ptdesc *ptdesc = page_ptdesc(pte);
+
+   pagetable_pte_dtor(ptdesc);
 
 #ifndef CONFIG_ARM_LPAE
/*
@@ -50,17 +52,17 @@ __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, 
unsigned long addr)
__tlb_adjust_range(tlb, addr - PAGE_SIZE, 2 * PAGE_SIZE);
 #endif
 
-   tlb_remove_table(tlb, pte);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 
 static inline void
 __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
 {
 #ifdef CONFIG_ARM_LPAE
-   struct page *page = virt_to_page(pmdp);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
 
-   pgtable_pmd_page_dtor(page);
-   tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 #endif
 }
 
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 13fc4bb5f792..fdeaee30d167 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -737,11 +737,12 @@ static void __init *early_alloc(unsigned long sz)
 
 static void *__init late_alloc(unsigned long sz)
 {
-   void *ptr = (void *)__get_free_pages(GFP_PGTABLE_KERNEL, get_order(sz));
+   void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM,
+   get_order(sz));
 
-   if (!ptr || !pgtable_pte_page_ctor(virt_to_page(ptr)))
+   if (!ptdesc || !pagetable_pte_ctor(ptdesc))
BUG();
-   return ptr;
+   return ptdesc_to_virt(ptdesc);
 }
 
 static pte_t * __init arm_pte_alloc(pmd_t *pmd, unsigned long addr,
-- 
2.40.1

[PATCH mm-unstable v7 16/31] pgalloc: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/asm-generic/pgalloc.h | 88 +--
 1 file changed, 52 insertions(+), 36 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index a7cf825befae..c75d4a753849 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -8,7 +8,7 @@
 #define GFP_PGTABLE_USER   (GFP_PGTABLE_KERNEL | __GFP_ACCOUNT)
 
 /**
- * __pte_alloc_one_kernel - allocate a page for PTE-level kernel page table
+ * __pte_alloc_one_kernel - allocate memory for a PTE-level kernel page table
  * @mm: the mm_struct of the current context
  *
  * This function is intended for architectures that need
@@ -18,12 +18,17 @@
  */
 static inline pte_t *__pte_alloc_one_kernel(struct mm_struct *mm)
 {
-   return (pte_t *)__get_free_page(GFP_PGTABLE_KERNEL);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL &
+   ~__GFP_HIGHMEM, 0);
+
+   if (!ptdesc)
+   return NULL;
+   return ptdesc_address(ptdesc);
 }
 
 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL
 /**
- * pte_alloc_one_kernel - allocate a page for PTE-level kernel page table
+ * pte_alloc_one_kernel - allocate memory for a PTE-level kernel page table
  * @mm: the mm_struct of the current context
  *
  * Return: pointer to the allocated memory or %NULL on error
@@ -35,40 +40,40 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct 
*mm)
 #endif
 
 /**
- * pte_free_kernel - free PTE-level kernel page table page
+ * pte_free_kernel - free PTE-level kernel page table memory
  * @mm: the mm_struct of the current context
  * @pte: pointer to the memory containing the page table
  */
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-   free_page((unsigned long)pte);
+   pagetable_free(virt_to_ptdesc(pte));
 }
 
 /**
- * __pte_alloc_one - allocate a page for PTE-level user page table
+ * __pte_alloc_one - allocate memory for a PTE-level user page table
  * @mm: the mm_struct of the current context
  * @gfp: GFP flags to use for the allocation
  *
- * Allocates a page and runs the pgtable_pte_page_ctor().
+ * Allocate memory for a page table and ptdesc and runs pagetable_pte_ctor().
  *
  * This function is intended for architectures that need
  * anything beyond simple page allocation or must have custom GFP flags.
  *
- * Return: `struct page` initialized as page table or %NULL on error
+ * Return: `struct page` referencing the ptdesc or %NULL on error
  */
 static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, gfp_t gfp)
 {
-   struct page *pte;
+   struct ptdesc *ptdesc;
 
-   pte = alloc_page(gfp);
-   if (!pte)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(pte)) {
-   __free_page(pte);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   return pte;
+   return ptdesc_page(ptdesc);
 }
 
 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE
@@ -76,9 +81,9 @@ static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, 
gfp_t gfp)
  * pte_alloc_one - allocate a page for PTE-level user page table
  * @mm: the mm_struct of the current context
  *
- * Allocates a page and runs the pgtable_pte_page_ctor().
+ * Allocate memory for a page table and ptdesc and runs pagetable_pte_ctor().
  *
- * Return: `struct page` initialized as page table or %NULL on error
+ * Return: `struct page` referencing the ptdesc or %NULL on error
  */
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
@@ -92,14 +97,16 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
  */
 
 /**
- * pte_free - free PTE-level user page table page
+ * pte_free - free PTE-level user page table memory
  * @mm: the mm_struct of the current context
- * @pte_page: the `struct page` representing the page table
+ * @pte_page: the `struct page` referencing the ptdesc
  */
 static inline void pte_free(struct mm_struct *mm, struct page *pte_page)
 {
-   pgtable_pte_page_dtor(pte_page);
-   __free_page(pte_page);
+   struct ptdesc *ptdesc = page_ptdesc(pte_page);
+
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 
@@ -107,10 +114,11 @@ static inline void pte_free(struct mm_struct *mm, struct 
page *pte_page)
 
 #ifndef __HAVE_ARCH_PMD_ALLOC_ONE
 /**
- * pmd_alloc_one - allocate a page for PMD-level page table
+ * pmd_alloc_one - allocate memory for a PMD-level page table
  * @mm: the mm_struct of the current context
  *
- * Allocates a page and runs the

[PATCH mm-unstable v7 15/31] mm: Remove page table members from struct page

2023-07-24 Thread Vishal Moola (Oracle)

The page table members are now split out into their own ptdesc struct.
Remove them from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm_types.h | 18 --
 include/linux/pgtable.h  |  3 ---
 2 files changed, 21 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index da538ff68953..aae6af098031 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -141,24 +141,6 @@ struct page {
struct {/* Tail pages of compound page */
unsigned long compound_head;/* Bit zero is set */
};
-   struct {/* Page table pages */
-   unsigned long _pt_pad_1;/* compound_head */
-   pgtable_t pmd_huge_pte; /* protected by page->ptl */
-   /*
-* A PTE page table page might be freed by use of
-* rcu_head: which overlays those two fields above.
-*/
-   unsigned long _pt_pad_2;/* mapping */
-   union {
-   struct mm_struct *pt_mm; /* x86 pgds only */
-   atomic_t pt_frag_refcount; /* powerpc */
-   };
-#if ALLOC_SPLIT_PTLOCKS
-   spinlock_t *ptl;
-#else
-   spinlock_t ptl;
-#endif
-   };
struct {/* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 250fdeba68f3..1a984c300d45 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1051,10 +1051,7 @@ struct ptdesc {
 TABLE_MATCH(flags, __page_flags);
 TABLE_MATCH(compound_head, pt_list);
 TABLE_MATCH(compound_head, _pt_pad_1);
-TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
 TABLE_MATCH(mapping, __page_mapping);
-TABLE_MATCH(pt_mm, pt_mm);
-TABLE_MATCH(ptl, ptl);
 TABLE_MATCH(rcu_head, pt_rcu_head);
 TABLE_MATCH(page_type, __page_type);
 TABLE_MATCH(_refcount, _refcount);
-- 
2.40.1

[PATCH mm-unstable v7 14/31] s390: Convert various pgalloc functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/s390/include/asm/pgalloc.h |   4 +-
 arch/s390/include/asm/tlb.h |   4 +-
 arch/s390/mm/pgalloc.c  | 128 
 3 files changed, 69 insertions(+), 67 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 89a9d5ef94f8..376b4b23bdaa 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -86,7 +86,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long vmaddr)
if (!table)
return NULL;
crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
-   if (!pgtable_pmd_page_ctor(virt_to_page(table))) {
+   if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) {
crst_table_free(mm, table);
return NULL;
}
@@ -97,7 +97,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
if (mm_pmd_folded(mm))
return;
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
crst_table_free(mm, (unsigned long *) pmd);
 }
 
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index b91f4a9b044c..383b1f91442c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -89,12 +89,12 @@ static inline void pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmd,
 {
if (mm_pmd_folded(tlb->mm))
return;
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
__tlb_adjust_range(tlb, address, PAGE_SIZE);
tlb->mm->context.flush_mm = 1;
tlb->freed_tables = 1;
tlb->cleared_puds = 1;
-   tlb_remove_table(tlb, pmd);
+   tlb_remove_ptdesc(tlb, pmd);
 }
 
 /*
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index d7374add7820..07fc660a24aa 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -43,17 +43,17 @@ __initcall(page_table_register_sysctl);
 
 unsigned long *crst_table_alloc(struct mm_struct *mm)
 {
-   struct page *page = alloc_pages(GFP_KERNEL, CRST_ALLOC_ORDER);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, CRST_ALLOC_ORDER);
 
-   if (!page)
+   if (!ptdesc)
return NULL;
-   arch_set_page_dat(page, CRST_ALLOC_ORDER);
-   return (unsigned long *) page_to_virt(page);
+   arch_set_page_dat(ptdesc_page(ptdesc), CRST_ALLOC_ORDER);
+   return (unsigned long *) ptdesc_to_virt(ptdesc);
 }
 
 void crst_table_free(struct mm_struct *mm, unsigned long *table)
 {
-   free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+   pagetable_free(virt_to_ptdesc(table));
 }
 
 static void __crst_table_upgrade(void *arg)
@@ -140,21 +140,21 @@ static inline unsigned int atomic_xor_bits(atomic_t *v, 
unsigned int bits)
 
 struct page *page_table_alloc_pgste(struct mm_struct *mm)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
u64 *table;
 
-   page = alloc_page(GFP_KERNEL);
-   if (page) {
-   table = (u64 *)page_to_virt(page);
+   ptdesc = pagetable_alloc(GFP_KERNEL, 0);
+   if (ptdesc) {
+   table = (u64 *)ptdesc_to_virt(ptdesc);
memset64(table, _PAGE_INVALID, PTRS_PER_PTE);
memset64(table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
}
-   return page;
+   return ptdesc_page(ptdesc);
 }
 
 void page_table_free_pgste(struct page *page)
 {
-   __free_page(page);
+   pagetable_free(page_ptdesc(page));
 }
 
 #endif /* CONFIG_PGSTE */
@@ -242,7 +242,7 @@ void page_table_free_pgste(struct page *page)
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
unsigned long *table;
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned int mask, bit;
 
/* Try to get a fragment of a 4K page as a 2K page table */
@@ -250,9 +250,9 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = NULL;
spin_lock_bh(>context.lock);
if (!list_empty(>context.pgtable_list)) {
-   page = list_first_entry(>context.pgtable_list,
-   struct page, lru);
-   mask = atomic_read(>_refcount) >> 24;
+   ptdesc = list_first_entry(>context.pgtable_list,
+   struct ptdesc, pt_list);
+   mask = atomic_read(>_refcount) >> 24;
/*
 * The pending removal bits must also be checked.

[PATCH mm-unstable v7 13/31] x86: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/x86/mm/pgtable.c | 47 ++-
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 15a8009a4480..d3a93e8766ee 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -52,7 +52,7 @@ early_param("userpte", setup_userpte);
 
 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
-   pgtable_pte_page_dtor(pte);
+   pagetable_pte_dtor(page_ptdesc(pte));
paravirt_release_pte(page_to_pfn(pte));
paravirt_tlb_remove_table(tlb, pte);
 }
@@ -60,7 +60,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 #if CONFIG_PGTABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
-   struct page *page = virt_to_page(pmd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
/*
 * NOTE! For PAE, any changes to the top page-directory-pointer-table
@@ -69,8 +69,8 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 #ifdef CONFIG_X86_PAE
tlb->need_flush_all = 1;
 #endif
-   pgtable_pmd_page_dtor(page);
-   paravirt_tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -92,16 +92,16 @@ void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 
 static inline void pgd_list_add(pgd_t *pgd)
 {
-   struct page *page = virt_to_page(pgd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgd);
 
-   list_add(>lru, _list);
+   list_add(>pt_list, _list);
 }
 
 static inline void pgd_list_del(pgd_t *pgd)
 {
-   struct page *page = virt_to_page(pgd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgd);
 
-   list_del(>lru);
+   list_del(>pt_list);
 }
 
 #define UNSHARED_PTRS_PER_PGD  \
@@ -112,12 +112,12 @@ static inline void pgd_list_del(pgd_t *pgd)
 
 static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
 {
-   virt_to_page(pgd)->pt_mm = mm;
+   virt_to_ptdesc(pgd)->pt_mm = mm;
 }
 
 struct mm_struct *pgd_page_get_mm(struct page *page)
 {
-   return page->pt_mm;
+   return page_ptdesc(page)->pt_mm;
 }
 
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
@@ -213,11 +213,14 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, 
pmd_t *pmd)
 static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
 {
int i;
+   struct ptdesc *ptdesc;
 
for (i = 0; i < count; i++)
if (pmds[i]) {
-   pgtable_pmd_page_dtor(virt_to_page(pmds[i]));
-   free_page((unsigned long)pmds[i]);
+   ptdesc = virt_to_ptdesc(pmds[i]);
+
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
mm_dec_nr_pmds(mm);
}
 }
@@ -230,18 +233,24 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t 
*pmds[], int count)
 
if (mm == _mm)
gfp &= ~__GFP_ACCOUNT;
+   gfp &= ~__GFP_HIGHMEM;
 
for (i = 0; i < count; i++) {
-   pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
-   if (!pmd)
+   pmd_t *pmd = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(gfp, 0);
+
+   if (!ptdesc)
failed = true;
-   if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
-   free_page((unsigned long)pmd);
-   pmd = NULL;
+   if (ptdesc && !pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
+   ptdesc = NULL;
failed = true;
}
-   if (pmd)
+   if (ptdesc) {
mm_inc_nr_pmds(mm);
+   pmd = ptdesc_address(ptdesc);
+   }
+
pmds[i] = pmd;
}
 
@@ -830,7 +839,7 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 
free_page((unsigned long)pmd_sv);
 
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
free_page((unsigned long)pmd);
 
return 1;
-- 
2.40.1

[PATCH mm-unstable v7 12/31] powerpc: Convert various functions to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/powerpc/mm/book3s64/mmu_context.c | 10 ++---
 arch/powerpc/mm/book3s64/pgtable.c | 32 +++
 arch/powerpc/mm/pgtable-frag.c | 56 +-
 3 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index c766e4c26e42..1715b07c630c 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -246,15 +246,15 @@ static void destroy_contexts(mm_context_t *ctx)
 static void pmd_frag_destroy(void *pmd_frag)
 {
int count;
-   struct page *page;
+   struct ptdesc *ptdesc;
 
-   page = virt_to_page(pmd_frag);
+   ptdesc = virt_to_ptdesc(pmd_frag);
/* drop all the pending references */
count = ((unsigned long)pmd_frag & ~PAGE_MASK) >> PMD_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (atomic_sub_and_test(PMD_FRAG_NR - count, >pt_frag_refcount)) {
-   pgtable_pmd_page_dtor(page);
-   __free_page(page);
+   if (atomic_sub_and_test(PMD_FRAG_NR - count, 
>pt_frag_refcount)) {
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 85c84e89e3ea..1212deeabe15 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -306,22 +306,22 @@ static pmd_t *get_pmd_from_cache(struct mm_struct *mm)
 static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 {
void *ret = NULL;
-   struct page *page;
+   struct ptdesc *ptdesc;
gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
if (mm == _mm)
gfp &= ~__GFP_ACCOUNT;
-   page = alloc_page(gfp);
-   if (!page)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pmd_page_ctor(page)) {
-   __free_pages(page, 0);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   atomic_set(>pt_frag_refcount, 1);
+   atomic_set(>pt_frag_refcount, 1);
 
-   ret = page_address(page);
+   ret = ptdesc_address(ptdesc);
/*
 * if we support only one fragment just return the
 * allocated page.
@@ -331,12 +331,12 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 
spin_lock(>page_table_lock);
/*
-* If we find pgtable_page set, we return
+* If we find ptdesc_page set, we return
 * the allocated page with single fragment
 * count.
 */
if (likely(!mm->context.pmd_frag)) {
-   atomic_set(>pt_frag_refcount, PMD_FRAG_NR);
+   atomic_set(>pt_frag_refcount, PMD_FRAG_NR);
mm->context.pmd_frag = ret + PMD_FRAG_SIZE;
}
spin_unlock(>page_table_lock);
@@ -357,15 +357,15 @@ pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned 
long vmaddr)
 
 void pmd_fragment_free(unsigned long *pmd)
 {
-   struct page *page = virt_to_page(pmd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
 
-   if (PageReserved(page))
-   return free_reserved_page(page);
+   if (pagetable_is_reserved(ptdesc))
+   return free_reserved_ptdesc(ptdesc);
 
-   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
-   if (atomic_dec_and_test(>pt_frag_refcount)) {
-   pgtable_pmd_page_dtor(page);
-   __free_page(page);
+   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
+   if (atomic_dec_and_test(>pt_frag_refcount)) {
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 0c6b68130025..4c899c9c0694 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -18,15 +18,15 @@
 void pte_frag_destroy(void *pte_frag)
 {
int count;
-   struct page *page;
+   struct ptdesc *ptdesc;
 
-   page = virt_to_page(pte_frag);
+   ptdesc = virt_to_ptdesc(pte_frag);
/* drop all the pending references */
count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (atomic_sub_and_test(PTE_FRAG_NR - count, >pt_frag_refcount)) {
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   if (atomic_sub_and_test(PTE_FRAG_NR - count, 
>pt_frag_refcount)) {
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
@@ -55,25 +55,25 @@ static pte_t *get_pte_from_cache(struct mm_struct *mm)
 static pte_t

[PATCH mm-unstable v7 11/31] mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}

2023-07-24 Thread Vishal Moola (Oracle)

Create pagetable_pte_ctor(), pagetable_pmd_ctor(), pagetable_pte_dtor(),
and pagetable_pmd_dtor() and make the original pgtable
constructor/destructors wrappers.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 56 ++
 1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffddae95af78..bd3d99d81984 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2902,20 +2902,34 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { 
return true; }
 static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
-static inline bool pgtable_pte_page_ctor(struct page *page)
+static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc)
 {
-   if (!ptlock_init(page_ptdesc(page)))
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   if (!ptlock_init(ptdesc))
return false;
-   __SetPageTable(page);
-   inc_lruvec_page_state(page, NR_PAGETABLE);
+   __folio_set_pgtable(folio);
+   lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
 }
 
+static inline bool pgtable_pte_page_ctor(struct page *page)
+{
+   return pagetable_pte_ctor(page_ptdesc(page));
+}
+
+static inline void pagetable_pte_dtor(struct ptdesc *ptdesc)
+{
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   ptlock_free(ptdesc);
+   __folio_clear_pgtable(folio);
+   lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+}
+
 static inline void pgtable_pte_page_dtor(struct page *page)
 {
-   ptlock_free(page_ptdesc(page));
-   __ClearPageTable(page);
-   dec_lruvec_page_state(page, NR_PAGETABLE);
+   pagetable_pte_dtor(page_ptdesc(page));
 }
 
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
@@ -3013,20 +3027,34 @@ static inline spinlock_t *pmd_lock(struct mm_struct 
*mm, pmd_t *pmd)
return ptl;
 }
 
-static inline bool pgtable_pmd_page_ctor(struct page *page)
+static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
 {
-   if (!pmd_ptlock_init(page_ptdesc(page)))
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   if (!pmd_ptlock_init(ptdesc))
return false;
-   __SetPageTable(page);
-   inc_lruvec_page_state(page, NR_PAGETABLE);
+   __folio_set_pgtable(folio);
+   lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
 }
 
+static inline bool pgtable_pmd_page_ctor(struct page *page)
+{
+   return pagetable_pmd_ctor(page_ptdesc(page));
+}
+
+static inline void pagetable_pmd_dtor(struct ptdesc *ptdesc)
+{
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   pmd_ptlock_free(ptdesc);
+   __folio_clear_pgtable(folio);
+   lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+}
+
 static inline void pgtable_pmd_page_dtor(struct page *page)
 {
-   pmd_ptlock_free(page_ptdesc(page));
-   __ClearPageTable(page);
-   dec_lruvec_page_state(page, NR_PAGETABLE);
+   pagetable_pmd_dtor(page_ptdesc(page));
 }
 
 /*
-- 
2.40.1

[PATCH mm-unstable v7 10/31] mm: Convert ptlock_free() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 10 +-
 mm/memory.c|  4 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 774fe83c0c16..ffddae95af78 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2842,7 +2842,7 @@ static inline void pagetable_free(struct ptdesc *pt)
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
 bool ptlock_alloc(struct ptdesc *ptdesc);
-extern void ptlock_free(struct page *page);
+void ptlock_free(struct ptdesc *ptdesc);
 
 static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
@@ -2858,7 +2858,7 @@ static inline bool ptlock_alloc(struct ptdesc *ptdesc)
return true;
 }
 
-static inline void ptlock_free(struct page *page)
+static inline void ptlock_free(struct ptdesc *ptdesc)
 {
 }
 
@@ -2899,7 +2899,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
 }
 static inline void ptlock_cache_init(void) {}
 static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
-static inline void ptlock_free(struct page *page) {}
+static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
@@ -2913,7 +2913,7 @@ static inline bool pgtable_pte_page_ctor(struct page 
*page)
 
 static inline void pgtable_pte_page_dtor(struct page *page)
 {
-   ptlock_free(page);
+   ptlock_free(page_ptdesc(page));
__ClearPageTable(page);
dec_lruvec_page_state(page, NR_PAGETABLE);
 }
@@ -2987,7 +2987,7 @@ static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc));
 #endif
-   ptlock_free(ptdesc_page(ptdesc));
+   ptlock_free(ptdesc);
 }
 
 #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
diff --git a/mm/memory.c b/mm/memory.c
index 4fee273595e2..e5e370cdac23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6242,8 +6242,8 @@ bool ptlock_alloc(struct ptdesc *ptdesc)
return true;
 }
 
-void ptlock_free(struct page *page)
+void ptlock_free(struct ptdesc *ptdesc)
 {
-   kmem_cache_free(page_ptl_cachep, page->ptl);
+   kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
-- 
2.40.1

[PATCH mm-unstable v7 09/31] mm: Convert pmd_ptlock_free() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 675972d3f7e4..774fe83c0c16 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2982,12 +2982,12 @@ static inline bool pmd_ptlock_init(struct ptdesc 
*ptdesc)
return ptlock_init(ptdesc);
 }
 
-static inline void pmd_ptlock_free(struct page *page)
+static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+   VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc));
 #endif
-   ptlock_free(page);
+   ptlock_free(ptdesc_page(ptdesc));
 }
 
 #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
@@ -3000,7 +3000,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
 }
 
 static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
+static inline void pmd_ptlock_free(struct ptdesc *ptdesc) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
 
@@ -3024,7 +3024,7 @@ static inline bool pgtable_pmd_page_ctor(struct page 
*page)
 
 static inline void pgtable_pmd_page_dtor(struct page *page)
 {
-   pmd_ptlock_free(page);
+   pmd_ptlock_free(page_ptdesc(page));
__ClearPageTable(page);
dec_lruvec_page_state(page, NR_PAGETABLE);
 }
-- 
2.40.1

[PATCH mm-unstable v7 08/31] mm: Convert ptlock_init() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52ef09c100a2..675972d3f7e4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2873,7 +2873,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
-static inline bool ptlock_init(struct page *page)
+static inline bool ptlock_init(struct ptdesc *ptdesc)
 {
/*
 * prep_new_page() initialize page->private (and therefore page->ptl)
@@ -2882,10 +2882,10 @@ static inline bool ptlock_init(struct page *page)
 * It can happen if arch try to use slab for page table allocation:
 * slab code uses page->slab_cache, which share storage with page->ptl.
 */
-   VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
-   if (!ptlock_alloc(page_ptdesc(page)))
+   VM_BUG_ON_PAGE(*(unsigned long *)>ptl, ptdesc_page(ptdesc));
+   if (!ptlock_alloc(ptdesc))
return false;
-   spin_lock_init(ptlock_ptr(page_ptdesc(page)));
+   spin_lock_init(ptlock_ptr(ptdesc));
return true;
 }
 
@@ -2898,13 +2898,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return >page_table_lock;
 }
 static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
+static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void ptlock_free(struct page *page) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
 {
-   if (!ptlock_init(page))
+   if (!ptlock_init(page_ptdesc(page)))
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
@@ -2979,7 +2979,7 @@ static inline bool pmd_ptlock_init(struct ptdesc *ptdesc)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptdesc->pmd_huge_pte = NULL;
 #endif
-   return ptlock_init(ptdesc_page(ptdesc));
+   return ptlock_init(ptdesc);
 }
 
 static inline void pmd_ptlock_free(struct page *page)
-- 
2.40.1

[PATCH mm-unstable v7 07/31] mm: Convert pmd_ptlock_init() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c155f82dd2cc..52ef09c100a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2974,12 +2974,12 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return ptlock_ptr(pmd_ptdesc(pmd));
 }
 
-static inline bool pmd_ptlock_init(struct page *page)
+static inline bool pmd_ptlock_init(struct ptdesc *ptdesc)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   page->pmd_huge_pte = NULL;
+   ptdesc->pmd_huge_pte = NULL;
 #endif
-   return ptlock_init(page);
+   return ptlock_init(ptdesc_page(ptdesc));
 }
 
 static inline void pmd_ptlock_free(struct page *page)
@@ -2999,7 +2999,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return >page_table_lock;
 }
 
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void pmd_ptlock_free(struct page *page) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
@@ -3015,7 +3015,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
-   if (!pmd_ptlock_init(page))
+   if (!pmd_ptlock_init(page_ptdesc(page)))
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
-- 
2.40.1

[PATCH mm-unstable v7 06/31] mm: Convert ptlock_ptr() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/x86/xen/mmu_pv.c |  2 +-
 include/linux/mm.h| 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index e0a975165de7..8796ec310483 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -667,7 +667,7 @@ static spinlock_t *xen_pte_lock(struct page *page, struct 
mm_struct *mm)
spinlock_t *ptl = NULL;
 
 #if USE_SPLIT_PTE_PTLOCKS
-   ptl = ptlock_ptr(page);
+   ptl = ptlock_ptr(page_ptdesc(page));
spin_lock_nest_lock(ptl, >page_table_lock);
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b3fce0bfe201..c155f82dd2cc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2844,9 +2844,9 @@ void __init ptlock_cache_init(void);
 bool ptlock_alloc(struct ptdesc *ptdesc);
 extern void ptlock_free(struct page *page);
 
-static inline spinlock_t *ptlock_ptr(struct page *page)
+static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-   return page->ptl;
+   return ptdesc->ptl;
 }
 #else /* ALLOC_SPLIT_PTLOCKS */
 static inline void ptlock_cache_init(void)
@@ -2862,15 +2862,15 @@ static inline void ptlock_free(struct page *page)
 {
 }
 
-static inline spinlock_t *ptlock_ptr(struct page *page)
+static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-   return >ptl;
+   return >ptl;
 }
 #endif /* ALLOC_SPLIT_PTLOCKS */
 
 static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(pmd_page(*pmd));
+   return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
 static inline bool ptlock_init(struct page *page)
@@ -2885,7 +2885,7 @@ static inline bool ptlock_init(struct page *page)
VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
if (!ptlock_alloc(page_ptdesc(page)))
return false;
-   spin_lock_init(ptlock_ptr(page));
+   spin_lock_init(ptlock_ptr(page_ptdesc(page)));
return true;
 }
 
@@ -2971,7 +2971,7 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
 
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(ptdesc_page(pmd_ptdesc(pmd)));
+   return ptlock_ptr(pmd_ptdesc(pmd));
 }
 
 static inline bool pmd_ptlock_init(struct page *page)
-- 
2.40.1

[PATCH mm-unstable v7 05/31] mm: Convert ptlock_alloc() to use ptdescs

2023-07-24 Thread Vishal Moola (Oracle)

This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 6 +++---
 mm/memory.c| 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf552a106e4a..b3fce0bfe201 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2841,7 +2841,7 @@ static inline void pagetable_free(struct ptdesc *pt)
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
+bool ptlock_alloc(struct ptdesc *ptdesc);
 extern void ptlock_free(struct page *page);
 
 static inline spinlock_t *ptlock_ptr(struct page *page)
@@ -2853,7 +2853,7 @@ static inline void ptlock_cache_init(void)
 {
 }
 
-static inline bool ptlock_alloc(struct page *page)
+static inline bool ptlock_alloc(struct ptdesc *ptdesc)
 {
return true;
 }
@@ -2883,7 +2883,7 @@ static inline bool ptlock_init(struct page *page)
 * slab code uses page->slab_cache, which share storage with page->ptl.
 */
VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
-   if (!ptlock_alloc(page))
+   if (!ptlock_alloc(page_ptdesc(page)))
return false;
spin_lock_init(ptlock_ptr(page));
return true;
diff --git a/mm/memory.c b/mm/memory.c
index 2130bad76eb1..4fee273595e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6231,14 +6231,14 @@ void __init ptlock_cache_init(void)
SLAB_PANIC, NULL);
 }
 
-bool ptlock_alloc(struct page *page)
+bool ptlock_alloc(struct ptdesc *ptdesc)
 {
spinlock_t *ptl;
 
ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
if (!ptl)
return false;
-   page->ptl = ptl;
+   ptdesc->ptl = ptl;
return true;
 }
 
-- 
2.40.1

[PATCH mm-unstable v7 04/31] mm: Convert pmd_pgtable_page() callers to use pmd_ptdesc()

2023-07-24 Thread Vishal Moola (Oracle)

Converts internal pmd_pgtable_page() callers to use pmd_ptdesc(). This
removes some direct accesses to struct page, working towards splitting
out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3fda0ad41cf2..bf552a106e4a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2971,7 +2971,7 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
 
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(pmd_pgtable_page(pmd));
+   return ptlock_ptr(ptdesc_page(pmd_ptdesc(pmd)));
 }
 
 static inline bool pmd_ptlock_init(struct page *page)
@@ -2990,7 +2990,7 @@ static inline void pmd_ptlock_free(struct page *page)
ptlock_free(page);
 }
 
-#define pmd_huge_pte(mm, pmd) (pmd_pgtable_page(pmd)->pmd_huge_pte)
+#define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
 
 #else
 
-- 
2.40.1

[PATCH mm-unstable v7 03/31] mm: add utility functions for ptdesc

2023-07-24 Thread Vishal Moola (Oracle)

Introduce utility functions setting the foundation for ptdescs. These
will also assist in the splitting out of ptdesc from struct page.

Functions that focus on the descriptor are prefixed with ptdesc_* while
functions that focus on the pagetable are prefixed with pagetable_*.

pagetable_alloc() is defined to allocate new ptdesc pages as compound
pages. This is to standardize ptdescs by allowing for one allocation
and one free function, in contrast to 2 allocation and 2 free functions.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/asm-generic/tlb.h | 11 +++
 include/linux/mm.h| 61 +++
 include/linux/pgtable.h   | 12 
 3 files changed, 84 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index bc32a2284c56..129a3a759976 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -480,6 +480,17 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, 
struct page *page)
return tlb_remove_page_size(tlb, page, PAGE_SIZE);
 }
 
+static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
+{
+   tlb_remove_table(tlb, pt);
+}
+
+/* Like tlb_remove_ptdesc, but for page-like page directories. */
+static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct 
ptdesc *pt)
+{
+   tlb_remove_page(tlb, ptdesc_page(pt));
+}
+
 static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ba73f09ae4a..3fda0ad41cf2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2787,6 +2787,57 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU */
 
+static inline struct ptdesc *virt_to_ptdesc(const void *x)
+{
+   return page_ptdesc(virt_to_page(x));
+}
+
+static inline void *ptdesc_to_virt(const struct ptdesc *pt)
+{
+   return page_to_virt(ptdesc_page(pt));
+}
+
+static inline void *ptdesc_address(const struct ptdesc *pt)
+{
+   return folio_address(ptdesc_folio(pt));
+}
+
+static inline bool pagetable_is_reserved(struct ptdesc *pt)
+{
+   return folio_test_reserved(ptdesc_folio(pt));
+}
+
+/**
+ * pagetable_alloc - Allocate pagetables
+ * @gfp:GFP flags
+ * @order:  desired pagetable order
+ *
+ * pagetable_alloc allocates memory for page tables as well as a page table
+ * descriptor to describe that memory.
+ *
+ * Return: The ptdesc describing the allocated page tables.
+ */
+static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
+{
+   struct page *page = alloc_pages(gfp | __GFP_COMP, order);
+
+   return page_ptdesc(page);
+}
+
+/**
+ * pagetable_free - Free pagetables
+ * @pt:The page table descriptor
+ *
+ * pagetable_free frees the memory of all page tables described by a page
+ * table descriptor and the memory for the descriptor itself.
+ */
+static inline void pagetable_free(struct ptdesc *pt)
+{
+   struct page *page = ptdesc_page(pt);
+
+   __free_pages(page, compound_order(page));
+}
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
@@ -2913,6 +2964,11 @@ static inline struct page *pmd_pgtable_page(pmd_t *pmd)
return virt_to_page((void *)((unsigned long) pmd & mask));
 }
 
+static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
+{
+   return page_ptdesc(pmd_pgtable_page(pmd));
+}
+
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
return ptlock_ptr(pmd_pgtable_page(pmd));
@@ -3025,6 +3081,11 @@ static inline void mark_page_reserved(struct page *page)
adjust_managed_page_count(page, -1);
 }
 
+static inline void free_reserved_ptdesc(struct ptdesc *pt)
+{
+   free_reserved_page(ptdesc_page(pt));
+}
+
 /*
  * Default method to free all the __init memory into the buddy system.
  * The freed pages will be poisoned with pattern "poison" if it's within
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1f92514d54b0..250fdeba68f3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1064,6 +1064,18 @@ TABLE_MATCH(memcg_data, pt_memcg_data);
 #undef TABLE_MATCH
 static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
 
+#define ptdesc_page(pt)(_Generic((pt), 
\
+   const struct ptdesc *:  (const struct page *)(pt),  \
+   struct ptdesc *:(struct page *)(pt)))
+
+#define ptdesc_folio(pt)   (_Generic((pt), \
+   const struct ptdesc *:  (const struct folio *)(pt), \
+   struct ptdesc *:(struct folio *)(pt)))
+
+#define page_ptdesc(p) (_Generic((p),  \
+   const struct page *:(const struct ptdesc *)(p), \
+   struct page *:  (struct ptdesc *)(p)))
+
 /*
  * No-op macros

[PATCH mm-unstable v7 02/31] pgtable: Create struct ptdesc

2023-07-24 Thread Vishal Moola (Oracle)

Currently, page table information is stored within struct page. As part
of simplifying struct page, create struct ptdesc for page table
information.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/pgtable.h | 71 +
 1 file changed, 71 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5f36c055794b..1f92514d54b0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -993,6 +993,77 @@ static inline void ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
 #endif /* CONFIG_MMU */
 
+
+/**
+ * struct ptdesc -Memory descriptor for page tables.
+ * @__page_flags: Same as page flags. Unused for page tables.
+ * @pt_rcu_head:  For freeing page table pages.
+ * @pt_list:  List of used page tables. Used for s390 and x86.
+ * @_pt_pad_1:Padding that aliases with page's compound head.
+ * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
+ * @__page_mapping:   Aliases with page->mapping. Unused for page tables.
+ * @pt_mm:Used for x86 pgds.
+ * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
only.
+ * @_pt_pad_2:Padding to ensure proper alignment.
+ * @ptl:  Lock for the page table.
+ * @__page_type:  Same as page->page_type. Unused for page tables.
+ * @_refcount:Same as page refcount. Used for s390 page tables.
+ * @pt_memcg_data:Memcg data. Tracked for page tables here.
+ *
+ * This struct overlays struct page for now. Do not modify without a good
+ * understanding of the issues.
+ */
+struct ptdesc {
+   unsigned long __page_flags;
+
+   union {
+   struct rcu_head pt_rcu_head;
+   struct list_head pt_list;
+   struct {
+   unsigned long _pt_pad_1;
+   pgtable_t pmd_huge_pte;
+   };
+   };
+   unsigned long __page_mapping;
+
+   union {
+   struct mm_struct *pt_mm;
+   atomic_t pt_frag_refcount;
+   };
+
+   union {
+   unsigned long _pt_pad_2;
+#if ALLOC_SPLIT_PTLOCKS
+   spinlock_t *ptl;
+#else
+   spinlock_t ptl;
+#endif
+   };
+   unsigned int __page_type;
+   atomic_t _refcount;
+#ifdef CONFIG_MEMCG
+   unsigned long pt_memcg_data;
+#endif
+};
+
+#define TABLE_MATCH(pg, pt)\
+   static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
+TABLE_MATCH(flags, __page_flags);
+TABLE_MATCH(compound_head, pt_list);
+TABLE_MATCH(compound_head, _pt_pad_1);
+TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
+TABLE_MATCH(mapping, __page_mapping);
+TABLE_MATCH(pt_mm, pt_mm);
+TABLE_MATCH(ptl, ptl);
+TABLE_MATCH(rcu_head, pt_rcu_head);
+TABLE_MATCH(page_type, __page_type);
+TABLE_MATCH(_refcount, _refcount);
+#ifdef CONFIG_MEMCG
+TABLE_MATCH(memcg_data, pt_memcg_data);
+#endif
+#undef TABLE_MATCH
+static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
+
 /*
  * No-op macros that just return the current protection value. Defined here
  * because these macros can be used even if CONFIG_MMU is not defined.
-- 
2.40.1

[PATCH mm-unstable v7 01/31] mm: Add PAGE_TYPE_OP folio functions

2023-07-24 Thread Vishal Moola (Oracle)

No folio equivalents for page type operations have been defined, so
define them for later folio conversions.

Also changes the Page##uname macros to take in const struct page* since
we only read the memory here.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/page-flags.h | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..9218028caf33 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -908,6 +908,8 @@ static inline bool is_page_hwpoison(struct page *page)
 
 #define PageType(page, flag)   \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+#define folio_test_type(folio, flag)   \
+   ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
 
 static inline int page_type_has_type(unsigned int page_type)
 {
@@ -919,27 +921,41 @@ static inline int page_has_type(struct page *page)
return page_type_has_type(page->page_type);
 }
 
-#define PAGE_TYPE_OPS(uname, lname)\
-static __always_inline int Page##uname(struct page *page)  \
+#define PAGE_TYPE_OPS(uname, lname, fname) \
+static __always_inline int Page##uname(const struct page *page)
\
 {  \
return PageType(page, PG_##lname);  \
 }  \
+static __always_inline int folio_test_##fname(const struct folio *folio)\
+{  \
+   return folio_test_type(folio, PG_##lname);  \
+}  \
 static __always_inline void __SetPage##uname(struct page *page)
\
 {  \
VM_BUG_ON_PAGE(!PageType(page, 0), page);   \
page->page_type &= ~PG_##lname; \
 }  \
+static __always_inline void __folio_set_##fname(struct folio *folio)   \
+{  \
+   VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \
+   folio->page.page_type &= ~PG_##lname;   \
+}  \
 static __always_inline void __ClearPage##uname(struct page *page)  \
 {  \
VM_BUG_ON_PAGE(!Page##uname(page), page);   \
page->page_type |= PG_##lname;  \
-}
+}  \
+static __always_inline void __folio_clear_##fname(struct folio *folio) \
+{  \
+   VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \
+   folio->page.page_type |= PG_##lname;\
+}  \
 
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  */
-PAGE_TYPE_OPS(Buddy, buddy)
+PAGE_TYPE_OPS(Buddy, buddy, buddy)
 
 /*
  * PageOffline() indicates that the page is logically offline although the
@@ -963,7 +979,7 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * pages should check PageOffline() and synchronize with such drivers using
  * page_offline_freeze()/page_offline_thaw().
  */
-PAGE_TYPE_OPS(Offline, offline)
+PAGE_TYPE_OPS(Offline, offline, offline)
 
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
@@ -973,12 +989,12 @@ extern void page_offline_end(void);
 /*
  * Marks pages in use as page tables.
  */
-PAGE_TYPE_OPS(Table, table)
+PAGE_TYPE_OPS(Table, table, pgtable)
 
 /*
  * Marks guardpages used with debug_pagealloc.
  */
-PAGE_TYPE_OPS(Guard, guard)
+PAGE_TYPE_OPS(Guard, guard, guard)
 
 extern bool is_free_buddy_page(struct page *page);
 
-- 
2.40.1

[PATCH mm-unstable v7 00/31] Split ptdesc from struct page

2023-07-24 Thread Vishal Moola (Oracle)

The MM subsystem is trying to shrink struct page. This patchset
introduces a memory descriptor for page table tracking - struct ptdesc.

This patchset introduces ptdesc, splits ptdesc from struct page, and
converts many callers of page table constructor/destructors to use ptdescs.

Ptdesc is a foundation to further standardize page tables, and eventually
allow for dynamic allocation of page tables independent of struct page.
However, the use of pages for page table tracking is quite deeply
ingrained and varied across archictectures, so there is still a lot of
work to be done before that can happen.

This is rebased on mm-unstable.

v7:
  Drop s390 gmap ptdesc conversions - gmap is unecessary complication
that can be dealt with later
  Be more thorough with ptdesc struct sanity checks and comments
  Rebase onto mm-unstable

Vishal Moola (Oracle) (31):
  mm: Add PAGE_TYPE_OP folio functions
  pgtable: Create struct ptdesc
  mm: add utility functions for ptdesc
  mm: Convert pmd_pgtable_page() callers to use pmd_ptdesc()
  mm: Convert ptlock_alloc() to use ptdescs
  mm: Convert ptlock_ptr() to use ptdescs
  mm: Convert pmd_ptlock_init() to use ptdescs
  mm: Convert ptlock_init() to use ptdescs
  mm: Convert pmd_ptlock_free() to use ptdescs
  mm: Convert ptlock_free() to use ptdescs
  mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}
  powerpc: Convert various functions to use ptdescs
  x86: Convert various functions to use ptdescs
  s390: Convert various pgalloc functions to use ptdescs
  mm: Remove page table members from struct page
  pgalloc: Convert various functions to use ptdescs
  arm: Convert various functions to use ptdescs
  arm64: Convert various functions to use ptdescs
  csky: Convert __pte_free_tlb() to use ptdescs
  hexagon: Convert __pte_free_tlb() to use ptdescs
  loongarch: Convert various functions to use ptdescs
  m68k: Convert various functions to use ptdescs
  mips: Convert various functions to use ptdescs
  nios2: Convert __pte_free_tlb() to use ptdescs
  openrisc: Convert __pte_free_tlb() to use ptdescs
  riscv: Convert alloc_{pmd, pte}_late() to use ptdescs
  sh: Convert pte_free_tlb() to use ptdescs
  sparc64: Convert various functions to use ptdescs
  sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents
  um: Convert {pmd, pte}_free_tlb() to use ptdescs
  mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers

 Documentation/mm/split_page_table_lock.rst|  12 +-
 .../zh_CN/mm/split_page_table_lock.rst|  14 +-
 arch/arm/include/asm/tlb.h|  12 +-
 arch/arm/mm/mmu.c |   7 +-
 arch/arm64/include/asm/tlb.h  |  14 +-
 arch/arm64/mm/mmu.c   |   7 +-
 arch/csky/include/asm/pgalloc.h   |   4 +-
 arch/hexagon/include/asm/pgalloc.h|   8 +-
 arch/loongarch/include/asm/pgalloc.h  |  27 ++--
 arch/loongarch/mm/pgtable.c   |   7 +-
 arch/m68k/include/asm/mcf_pgalloc.h   |  47 +++---
 arch/m68k/include/asm/sun3_pgalloc.h  |   8 +-
 arch/m68k/mm/motorola.c   |   4 +-
 arch/mips/include/asm/pgalloc.h   |  32 ++--
 arch/mips/mm/pgtable.c|   8 +-
 arch/nios2/include/asm/pgalloc.h  |   8 +-
 arch/openrisc/include/asm/pgalloc.h   |   8 +-
 arch/powerpc/mm/book3s64/mmu_context.c|  10 +-
 arch/powerpc/mm/book3s64/pgtable.c|  32 ++--
 arch/powerpc/mm/pgtable-frag.c|  56 +++
 arch/riscv/include/asm/pgalloc.h  |   8 +-
 arch/riscv/mm/init.c  |  16 +-
 arch/s390/include/asm/pgalloc.h   |   4 +-
 arch/s390/include/asm/tlb.h   |   4 +-
 arch/s390/mm/pgalloc.c| 128 +++
 arch/sh/include/asm/pgalloc.h |   9 +-
 arch/sparc/mm/init_64.c   |  17 +-
 arch/sparc/mm/srmmu.c |   5 +-
 arch/um/include/asm/pgalloc.h |  18 +--
 arch/x86/mm/pgtable.c |  47 +++---
 arch/x86/xen/mmu_pv.c |   2 +-
 include/asm-generic/pgalloc.h |  88 +-
 include/asm-generic/tlb.h |  11 ++
 include/linux/mm.h| 151 +-
 include/linux/mm_types.h  |  18 ---
 include/linux/page-flags.h|  30 +++-
 include/linux/pgtable.h   |  80 ++
 mm/memory.c   |   8 +-
 38 files changed, 585 insertions(+), 384 deletions(-)

-- 
2.40.1

Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-24 Thread Michael Ellerman

Alistair Popple  writes:
> The invalidate_range() is going to become an architecture specific mmu
> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> sync with the CPU page tables. Currently it is called from separate
> code paths to the main CPU TLB invalidations. This can lead to a
> secondary TLB not getting invalidated when required and makes it hard
> to reason about when exactly the secondary TLB is invalidated.
>
> To fix this move the notifier call to the architecture specific TLB
> maintenance functions for architectures that have secondary MMUs
> requiring explicit software invalidations.
>
> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> require a TLB invalidation. This invalidation is done by the
> architecutre specific ptep_set_access_flags() which calls
  ^
  architecture
  
> flush_tlb_page() if required. However this doesn't call the notifier
> resulting in infinite faults being generated by devices using the SMMU
> if it has previously cached a read-only PTE in it's TLB.
>
> Moving the invalidations into the TLB invalidation functions ensures
> all invalidations happen at the same time as the CPU invalidation. The
> architecture specific flush_tlb_all() routines do not call the
> notifier as none of the IOMMUs require this.
>
> Signed-off-by: Alistair Popple 
> Suggested-by: Jason Gunthorpe 
> 
...

> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
> b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 0bd4866..9724b26 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -752,6 +752,8 @@ void radix__local_flush_tlb_page(struct vm_area_struct 
> *vma, unsigned long vmadd
>   return radix__local_flush_hugetlb_page(vma, vmaddr);
>  #endif
>   radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, 
> mmu_virtual_psize);
> + mmu_notifier_invalidate_range(vma->vm_mm, vmaddr,
> + vmaddr + mmu_virtual_psize);
>  }
>  EXPORT_SYMBOL(radix__local_flush_tlb_page);

I think we can skip calling the notifier there? It's explicitly a local flush.

cheers

[Bug 217702] New: makedumpfile can not open /proc/vmcore

2023-07-24 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=217702

Bug ID: 217702
   Summary: makedumpfile can not open /proc/vmcore
   Product: Platform Specific/Hardware
   Version: 2.5
  Hardware: All
OS: Linux
Status: NEW
  Severity: normal
  Priority: P3
 Component: PPC-64
  Assignee: platform_ppc...@kernel-bugs.osdl.org
  Reporter: pi...@redhat.com
Regression: No

This bug should be introduced by the
commit 606787fed7268feb256957872586370b56af697a
Author: Nicholas Piggin 
Date:   Tue Jun 6 19:38:32 2023 +1000

powerpc/64s: Remove support for ELFv1 little endian userspace

ELFv2 was introduced together with little-endian. ELFv1 with LE has
never been a thing. The GNU toolchain can create such a beast, but
anyone doing that is a maniac who needs to be stopped so I consider
this patch a feature.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/20230606093832.199712-5-npig...@gmail.com


And can be workaround by the following draft patch:

diff --git a/arch/powerpc/kexec/file_load_64.c
b/arch/powerpc/kexec/file_load_64.c
index 110d28bede2a..6af49c90c4b2 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -782,6 +782,18 @@ static void update_backup_region_phdr(struct kimage
*image, Elf64_Ehdr *ehdr)
}
 }

+/**
+ * 64le only supports ELFv2 64-bit binaries (64be supports v1 and v2).
+ */
+static inline void update_elfcorehdr_eflags(Elf64_Ehdr *ehdr)
+{
+#if defined(CONFIG_PPC64) && defined(CONFIG_CPU_LITTLE_ENDIAN)
+   ehdr->e_flags = 0x2;
+#endif
+}
+
+
+
 /**
  * load_elfcorehdr_segment - Setup crash memory ranges and initialize
elfcorehdr
  *   segment needed to load kdump kernel.
@@ -810,6 +822,7 @@ static int load_elfcorehdr_segment(struct kimage *image,
struct kexec_buf *kbuf)

/* Fix the offset for backup region in the ELF header */
update_backup_region_phdr(image, headers);
+   update_elfcorehdr_eflags(headers);

kbuf->buffer = headers;
kbuf->mem = KEXEC_BUF_MEM_UNKNOWN;

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH v3 1/1] ASoC: imx-audmux: fix return value checks of clk_prepare_enable()

2023-07-24 Thread Yuanjun Gong

check the return value of clk_prepare_enable(), and if
clk_prepare_enable() gets an unexpected return value,
imx_audmux_suspend() and imx_audmux_resume() should return
the error value.

Signed-off-by: Yuanjun Gong 
---
 sound/soc/fsl/imx-audmux.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/sound/soc/fsl/imx-audmux.c b/sound/soc/fsl/imx-audmux.c
index be003a117b39..096705ec2add 100644
--- a/sound/soc/fsl/imx-audmux.c
+++ b/sound/soc/fsl/imx-audmux.c
@@ -324,9 +324,11 @@ static void imx_audmux_remove(struct platform_device *pdev)
 #ifdef CONFIG_PM_SLEEP
 static int imx_audmux_suspend(struct device *dev)
 {
-   int i;
+   int i, ret;
 
-   clk_prepare_enable(audmux_clk);
+   ret = clk_prepare_enable(audmux_clk);
+   if (ret)
+   return ret;
 
for (i = 0; i < reg_max; i++)
regcache[i] = readl(audmux_base + i * 4);
@@ -338,9 +340,11 @@ static int imx_audmux_suspend(struct device *dev)
 
 static int imx_audmux_resume(struct device *dev)
 {
-   int i;
+   int i, ret;
 
-   clk_prepare_enable(audmux_clk);
+   ret = clk_prepare_enable(audmux_clk);
+   if (ret)
+   return ret;
 
for (i = 0; i < reg_max; i++)
writel(regcache[i], audmux_base + i * 4);
-- 
2.17.1

[PATCH v1 3/4] selftests/powerpc/ptrace: Fix typo in pid_max search error

2023-07-24 Thread Benjamin Gray

pid_max_addr() searches for the 'pid_max' symbol in /proc/kallsyms, and
prints an error if it cannot find it. The error message has a typo,
calling it pix_max.

Signed-off-by: Benjamin Gray 
---
 tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
index 16c653600124..d8a9e95fc03d 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
@@ -46,7 +46,7 @@ static unsigned long pid_max_addr(void)
return strtoul(addr, , 16);
}
fclose(fp);
-   printf("Could not find pix_max. Exiting..\n");
+   printf("Could not find pid_max. Exiting..\n");
exit(EXIT_FAILURE);
return -1;
 }
-- 
2.41.0

[PATCH v1 2/4] selftests/powerpc/ptrace: Explain why tests are skipped

2023-07-24 Thread Benjamin Gray

Many tests require specific hardware features/configurations that a
typical machine might not have. As a result, it's common to see a test
is skipped. But it is tedious to find out why a test is skipped
when all it gives is the file location of the skip macro.

Convert SKIP_IF() to SKIP_IF_MSG(), with appropriate descriptions of why
the test is being skipped. This gives a general idea of why a test is
skipped, which can be looked into further if it doesn't make sense.

Signed-off-by: Benjamin Gray 
---
 tools/testing/selftests/powerpc/ptrace/child.h   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/core-pkey.c   | 2 +-
 tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c| 2 +-
 tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c  | 2 +-
 tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-pkey.c | 2 +-
 tools/testing/selftests/powerpc/ptrace/ptrace-tar.c  | 2 +-
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-gpr.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-spd-gpr.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-spd-tar.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-spd-vsx.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-spr.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-tar.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-tm-vsx.c   | 4 ++--
 tools/testing/selftests/powerpc/ptrace/ptrace-vsx.c  | 2 +-
 15 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/child.h 
b/tools/testing/selftests/powerpc/ptrace/child.h
index d7275b7b33dc..df62ff0735f7 100644
--- a/tools/testing/selftests/powerpc/ptrace/child.h
+++ b/tools/testing/selftests/powerpc/ptrace/child.h
@@ -48,12 +48,12 @@ struct child_sync {
}   \
} while (0)
 
-#define PARENT_SKIP_IF_UNSUPPORTED(x, sync)\
+#define PARENT_SKIP_IF_UNSUPPORTED(x, sync, msg)   \
do {\
if ((x) == -1 && (errno == ENODEV || errno == EINVAL)) { \
(sync)->parent_gave_up = true;  \
prod_child(sync);   \
-   SKIP_IF(1); \
+   SKIP_IF_MSG(1, msg);\
}   \
} while (0)
 
diff --git a/tools/testing/selftests/powerpc/ptrace/core-pkey.c 
b/tools/testing/selftests/powerpc/ptrace/core-pkey.c
index f6f8596ce8e1..f6da4cb30cd6 100644
--- a/tools/testing/selftests/powerpc/ptrace/core-pkey.c
+++ b/tools/testing/selftests/powerpc/ptrace/core-pkey.c
@@ -266,7 +266,7 @@ static int parent(struct shared_info *info, pid_t pid)
 * to the child.
 */
ret = ptrace_read_regs(pid, NT_PPC_PKEY, regs, 3);
-   PARENT_SKIP_IF_UNSUPPORTED(ret, >child_sync);
+   PARENT_SKIP_IF_UNSUPPORTED(ret, >child_sync, "PKEYs not 
supported");
PARENT_FAIL_IF(ret, >child_sync);
 
info->amr = regs[0];
diff --git a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
index f75739bbad28..e374c6b7ace6 100644
--- a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
@@ -884,7 +884,7 @@ static int perf_hwbreak(void)
 {
srand ( time(NULL) );
 
-   SKIP_IF(!perf_breakpoint_supported());
+   SKIP_IF_MSG(!perf_breakpoint_supported(), "Perf breakpoints not 
supported");
 
return runtest();
 }
diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
index 1345e9b9af0f..a16239277a6f 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
@@ -603,7 +603,7 @@ static int ptrace_hwbreak(void)
wait(NULL);
 
get_dbginfo(child_pid, );
-   SKIP_IF(dbginfo.num_data_bps == 0);
+   SKIP_IF_MSG(dbginfo.num_data_bps == 0, "No data breakpoints present");
 
dawr = dawr_present();
run_tests(child_pid, , dawr);
diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
index 3344e74a97b4..16c653600124 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
@@ -641,10 +641,10 @@ static int ptrace_perf_hwbreak(void)
wait(NULL); /* <-- child (SIGUSR1) */
 
get_dbginfo(child_pid, );
-   SKIP_IF(dbginfo.num_data_bps <= 1);
+

[PATCH v1 4/4] selftests/powerpc/ptrace: Declare test temporary variables as volatile

2023-07-24 Thread Benjamin Gray

While the target is volatile, the temporary variables used to access the
target cast away the volatile. This is undefined behaviour, and a
compiler may optimise away/reorder these accesses, breaking the test.

This was observed with GCC 13.1.1, but it can be difficult to reproduce
because of the dependency on compiler behaviour.

Signed-off-by: Benjamin Gray 
---
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 24 +--
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
index a16239277a6f..75d30d61ab0e 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
@@ -64,26 +64,26 @@ static bool dawr_present(struct ppc_debug_info *dbginfo)
 
 static void write_var(int len)
 {
-   __u8 *pcvar;
-   __u16 *psvar;
-   __u32 *pivar;
-   __u64 *plvar;
+   volatile __u8 *pcvar;
+   volatile __u16 *psvar;
+   volatile __u32 *pivar;
+   volatile __u64 *plvar;
 
switch (len) {
case 1:
-   pcvar = (__u8 *)
+   pcvar = (volatile __u8 *)
*pcvar = 0xff;
break;
case 2:
-   psvar = (__u16 *)
+   psvar = (volatile __u16 *)
*psvar = 0x;
break;
case 4:
-   pivar = (__u32 *)
+   pivar = (volatile __u32 *)
*pivar = 0x;
break;
case 8:
-   plvar = (__u64 *)
+   plvar = (volatile __u64 *)
*plvar = 0xLL;
break;
}
@@ -98,16 +98,16 @@ static void read_var(int len)
 
switch (len) {
case 1:
-   cvar = (__u8)glvar;
+   cvar = (volatile __u8)glvar;
break;
case 2:
-   svar = (__u16)glvar;
+   svar = (volatile __u16)glvar;
break;
case 4:
-   ivar = (__u32)glvar;
+   ivar = (volatile __u32)glvar;
break;
case 8:
-   lvar = (__u64)glvar;
+   lvar = (volatile __u64)glvar;
break;
}
 }
-- 
2.41.0

[PATCH v1 1/4] Documentation/powerpc: Fix ptrace request names

2023-07-24 Thread Benjamin Gray

The documented ptrace request names are currently wrong/incomplete.
Fix this to improve correctness and searchability.

Signed-off-by: Benjamin Gray 
---
 Documentation/powerpc/ptrace.rst | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/powerpc/ptrace.rst b/Documentation/powerpc/ptrace.rst
index 77725d69eb4a..5629edf4d56e 100644
--- a/Documentation/powerpc/ptrace.rst
+++ b/Documentation/powerpc/ptrace.rst
@@ -15,7 +15,7 @@ that's extendable and that covers both BookE and server 
processors, so
 that GDB doesn't need to special-case each of them. We added the
 following 3 new ptrace requests.
 
-1. PTRACE_PPC_GETHWDEBUGINFO
+1. PPC_PTRACE_GETHWDBGINFO
 
 
 Query for GDB to discover the hardware debug features. The main info to
@@ -48,7 +48,7 @@ features will have bits indicating whether there is support 
for::
   #define PPC_DEBUG_FEATURE_DATA_BP_DAWR   0x10
   #define PPC_DEBUG_FEATURE_DATA_BP_ARCH_310x20
 
-2. PTRACE_SETHWDEBUG
+2. PPC_PTRACE_SETHWDEBUG
 
 Sets a hardware breakpoint or watchpoint, according to the provided structure::
 
@@ -88,7 +88,7 @@ that the BookE supports. COMEFROM breakpoints available in 
server processors
 are not contemplated, but that is out of the scope of this work.
 
 ptrace will return an integer (handle) uniquely identifying the breakpoint or
-watchpoint just created. This integer will be used in the PTRACE_DELHWDEBUG
+watchpoint just created. This integer will be used in the PPC_PTRACE_DELHWDEBUG
 request to ask for its removal. Return -ENOSPC if the requested breakpoint
 can't be allocated on the registers.
 
@@ -150,7 +150,7 @@ Some examples of using the structure to:
 p.addr2   = (uint64_t) end_range;
 p.condition_value = 0;
 
-3. PTRACE_DELHWDEBUG
+3. PPC_PTRACE_DELHWDEBUG
 
 Takes an integer which identifies an existing breakpoint or watchpoint
 (i.e., the value returned from PTRACE_SETHWDEBUG), and deletes the
-- 
2.41.0

[PATCH v1 0/4] Improve ptrace selftest usability

2023-07-24 Thread Benjamin Gray

While trying to test changes to the breakpoint implementation in the kernel, I
tried to run the ptrace tests and met many unexplained skips and failures.

This series addresses the pain points of trying to run these tests and learn
what they are doing.

Benjamin Gray (4):
  Documentation/powerpc: Fix ptrace request names
  selftests/powerpc/ptrace: Explain why tests are skipped
  selftests/powerpc/ptrace: Fix typo in pid_max search error
  selftests/powerpc/ptrace: Declare test temporary variables as volatile

 Documentation/powerpc/ptrace.rst  |  8 +++---
 .../testing/selftests/powerpc/ptrace/child.h  |  4 +--
 .../selftests/powerpc/ptrace/core-pkey.c  |  2 +-
 .../selftests/powerpc/ptrace/perf-hwbreak.c   |  2 +-
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 26 +--
 .../powerpc/ptrace/ptrace-perf-hwbreak.c  |  6 ++---
 .../selftests/powerpc/ptrace/ptrace-pkey.c|  2 +-
 .../selftests/powerpc/ptrace/ptrace-tar.c |  2 +-
 .../selftests/powerpc/ptrace/ptrace-tm-gpr.c  |  4 +--
 .../powerpc/ptrace/ptrace-tm-spd-gpr.c|  4 +--
 .../powerpc/ptrace/ptrace-tm-spd-tar.c|  4 +--
 .../powerpc/ptrace/ptrace-tm-spd-vsx.c|  4 +--
 .../selftests/powerpc/ptrace/ptrace-tm-spr.c  |  4 +--
 .../selftests/powerpc/ptrace/ptrace-tm-tar.c  |  4 +--
 .../selftests/powerpc/ptrace/ptrace-tm-vsx.c  |  4 +--
 .../selftests/powerpc/ptrace/ptrace-vsx.c |  2 +-
 16 files changed, 41 insertions(+), 41 deletions(-)

--
2.41.0

[PATCH] powerpc: Use shared font data

2023-07-24 Thread linux

From: "Dr. David Alan Gilbert" 

PowerPC has a 'btext' font used for the console which is almost identical
to the shared font_sun8x16, so use it rather than duplicating the data.

They were actually identical until about a decade ago when
   commit bcfbeecea11c ("drivers: console: font_: Change a glyph from
"broken bar" to "vertical line"")

which changed the | in the shared font to be a solid
bar rather than a broken bar.  That's the only difference.

This was originally spotted by PMD which noticed that sparc does
the same thing with the same data, and they also share a bunch
of functions to manipulate the data.  I've previously posted a near
identical patch for sparc.

One difference I notice in PowerPC is that there are a bunch of compile
options for the .c files for the early code to avoid a bunch of security
compilation features;  it's not clear to me if this is a problem for
this font data.

Tested very lightly with a boot without FS in qemu.

Signed-off-by: Dr. David Alan Gilbert 
---
 arch/powerpc/Kconfig.debug  |   1 +
 arch/powerpc/kernel/btext.c | 360 +---
 2 files changed, 7 insertions(+), 354 deletions(-)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 2a54fadbeaf51..51930904e5108 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -147,6 +147,7 @@ config BDI_SWITCH
 config BOOTX_TEXT
bool "Support for early boot text console (BootX or OpenFirmware only)"
depends on PPC_BOOK3S
+   select FONT_SUN8x16
help
  Say Y here to see progress messages from the boot firmware in text
  mode. Requires either BootX or Open Firmware.
diff --git a/arch/powerpc/kernel/btext.c b/arch/powerpc/kernel/btext.c
index 19e46fd623b0d..7f63f1cdc6c39 100644
--- a/arch/powerpc/kernel/btext.c
+++ b/arch/powerpc/kernel/btext.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,10 +42,6 @@ static unsigned char *logicalDisplayBase __force_data;
 
 unsigned long disp_BAT[2] __initdata = {0, 0};
 
-#define cmapsz (16*256)
-
-static unsigned char vga_font[cmapsz];
-
 static int boot_text_mapped __force_data;
 
 extern void rmci_on(void);
@@ -407,7 +404,7 @@ static unsigned int expand_bits_16[4] = {
 };
 
 
-static void draw_byte_32(unsigned char *font, unsigned int *base, int rb)
+static void draw_byte_32(const unsigned char *font, unsigned int *base, int rb)
 {
int l, bits;
int fg = 0xUL;
@@ -428,7 +425,7 @@ static void draw_byte_32(unsigned char *font, unsigned int 
*base, int rb)
}
 }
 
-static inline void draw_byte_16(unsigned char *font, unsigned int *base, int 
rb)
+static inline void draw_byte_16(const unsigned char *font, unsigned int *base, 
int rb)
 {
int l, bits;
int fg = 0xUL;
@@ -446,7 +443,7 @@ static inline void draw_byte_16(unsigned char *font, 
unsigned int *base, int rb)
}
 }
 
-static inline void draw_byte_8(unsigned char *font, unsigned int *base, int rb)
+static inline void draw_byte_8(const unsigned char *font, unsigned int *base, 
int rb)
 {
int l, bits;
int fg = 0x0F0F0F0FUL;
@@ -465,7 +462,8 @@ static inline void draw_byte_8(unsigned char *font, 
unsigned int *base, int rb)
 static noinline void draw_byte(unsigned char c, long locX, long locY)
 {
unsigned char *base = calc_base(locX << 3, locY << 4);
-   unsigned char *font = _font[((unsigned int)c) * 16];
+   unsigned int font_index = c * 16;
+   const unsigned char *font   = font_sun_8x16.data + font_index;
int rb  = dispDeviceRowBytes;
 
rmci_maybe_on();
@@ -583,349 +581,3 @@ void __init udbg_init_btext(void)
 */
udbg_putc = btext_drawchar;
 }
-
-static unsigned char vga_font[cmapsz] = {
-0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
-0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7e, 0x81, 0xa5, 0x81, 0x81, 0xbd,
-0x99, 0x81, 0x81, 0x7e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7e, 0xff,
-0xdb, 0xff, 0xff, 0xc3, 0xe7, 0xff, 0xff, 0x7e, 0x00, 0x00, 0x00, 0x00,
-0x00, 0x00, 0x00, 0x00, 0x6c, 0xfe, 0xfe, 0xfe, 0xfe, 0x7c, 0x38, 0x10,
-0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x38, 0x7c, 0xfe,
-0x7c, 0x38, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x18,
-0x3c, 0x3c, 0xe7, 0xe7, 0xe7, 0x18, 0x18, 0x3c, 0x00, 0x00, 0x00, 0x00,
-0x00, 0x00, 0x00, 0x18, 0x3c, 0x7e, 0xff, 0xff, 0x7e, 0x18, 0x18, 0x3c,
-0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x18, 0x3c,
-0x3c, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
-0xff, 0xff, 0xe7, 0xc3, 0xc3, 0xe7, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
-0x00, 0x00, 0x00, 0x00, 0x00, 0x3c, 0x66, 0x42, 0x42, 0x66, 0x3c, 0x00,
-0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xc3, 0x99, 0xbd,
-0xbd, 0x99, 0xc3, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00, 0x00, 0x1e, 0x0e,
-0x1a, 0x32, 0x78, 0xcc, 0xcc,

Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-24 Thread Alistair Popple



Luis Chamberlain  writes:

>> diff --git a/arch/x86/include/asm/tlbflush.h 
>> b/arch/x86/include/asm/tlbflush.h
>> index 837e4a50281a..79c46da919b9 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -4,6 +4,7 @@
>>  
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct 
>> arch_tlbflush_unmap_batch *b
>>  {
>>  inc_mm_tlb_gen(mm);
>>  cpumask_or(>cpumask, >cpumask, mm_cpumask(mm));
>> +mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>>  }
>>  
>>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 0b990fb56b66..2d253919b3e8 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct 
>> arch_tlbflush_unmap_batch *batch)
>>  
>>  put_flush_tlb_info();
>>  put_cpu();
>> -mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>>  }
>>  
>>  /*
>
> This patch also fixes a regression introduced on linux-next, the same
> crash on arch_tlbbatch_flush() is reproducible with fstests generic/176
> on XFS. This patch fixes that regression [0]. This should also close out
> the syzbot crash too [1]
>
> [0] https://gist.github.com/mcgrof/b37fc8cf7e6e1b3935242681de1a83e2
> [1] https://lore.kernel.org/all/3afcb4060135a...@google.com/
>
> Tested-by: Luis Chamberlain 

Thanks Luis. The above fix/respin is already in yesterdays linux-next
(next-20230724) so hopefully you are no longer seeing issues.

>   Luis

Re: linux-next: Tree for Jul 24 (arch/powerpc/platforms/embedded6xx/mvme5100.c)

2023-07-24 Thread Randy Dunlap



On 7/23/23 21:08, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20230721:
> 

on ppc32:

../arch/powerpc/platforms/embedded6xx/mvme5100.c: In function 
'mvme5100_add_bridge':
../arch/powerpc/platforms/embedded6xx/mvme5100.c:135:65: error: passing 
argument 5 of 'early_read_config_dword' from incompatible pointer type 
[-Werror=incompatible-pointer-types]
  135 | early_read_config_dword(hose, 0, 0, PCI_BASE_ADDRESS_1, 
_membase);
  | 
^~~~
  | |
  | 
phys_addr_t * {aka long long unsigned int *}
In file included from ../arch/powerpc/platforms/embedded6xx/mvme5100.c:19:
../arch/powerpc/include/asm/pci-bridge.h:150:53: note: expected 'u32 *' {aka 
'unsigned int *'} but argument is of type 'phys_addr_t *' {aka 'long long 
unsigned int *'}
  150 | int dev_fn, int where, u32 *val);
  |~^~~
In file included from ../include/asm-generic/bug.h:22,
 from ../arch/powerpc/include/asm/bug.h:116,
 from ../include/linux/bug.h:5,
 from ../include/linux/thread_info.h:13,
 from ../include/asm-generic/preempt.h:5,
 from ./arch/powerpc/include/generated/asm/preempt.h:1,
 from ../include/linux/preempt.h:79,
 from ../include/linux/spinlock.h:56,
 from ../include/linux/irq.h:14,
 from ../include/linux/of_irq.h:7,
 from ../arch/powerpc/platforms/embedded6xx/mvme5100.c:15:
../include/linux/kern_levels.h:5:25: warning: format '%x' expects argument of 
type 'unsigned int', but argument 2 has type 'phys_addr_t' {aka 'long long 
unsigned int'} [-Wformat=]
5 | #define KERN_SOH"\001"  /* ASCII Start Of Header */
  | ^~
../include/linux/printk.h:427:25: note: in definition of macro 
'printk_index_wrap'
  427 | _p_func(_fmt, ##__VA_ARGS__);   
\
  | ^~~~
../include/linux/printk.h:528:9: note: in expansion of macro 'printk'
  528 | printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
  | ^~
../include/linux/kern_levels.h:14:25: note: in expansion of macro 'KERN_SOH'
   14 | #define KERN_INFO   KERN_SOH "6"/* informational */
  | ^~~~
../include/linux/printk.h:528:16: note: in expansion of macro 'KERN_INFO'
  528 | printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
  |^
../arch/powerpc/platforms/embedded6xx/mvme5100.c:142:9: note: in expansion of 
macro 'pr_info'
  142 | pr_info("mvme5100_pic_init: pci_membase: %x\n", pci_membase);
  | ^~~
cc1: some warnings being treated as errors


Full randconfig file is attached.

-- 
~Randy

config-r3694.gz
Description: application/gzip

Re: [PATCH v5 14/25] iommu/msm: Implement an IDENTITY domain

2023-07-24 Thread Dmitry Baryshkov


On 24/07/2023 20:22, Jason Gunthorpe wrote:

What msm does during omap_iommu_set_platform_dma() is actually putting the


typo: msm driver doesn't use/provide omap_iommu_set_platform_dma().


iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/msm_iommu.c | 23 +++
  1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 79d89bad5132b7..26ed81cfeee897 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -443,15 +443,20 @@ static int msm_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev)
return ret;
  }
  
-static void msm_iommu_set_platform_dma(struct device *dev)

+static int msm_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
  {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct msm_priv *priv = to_msm_priv(domain);
+   struct msm_priv *priv;
unsigned long flags;
struct msm_iommu_dev *iommu;
struct msm_iommu_ctx_dev *master;
-   int ret;
+   int ret = 0;
  
+	if (domain == identity_domain || !domain)

+   return 0;
+
+   priv = to_msm_priv(domain);
free_io_pgtable_ops(priv->iop);
  
  	spin_lock_irqsave(_iommu_lock, flags);

@@ -468,8 +473,18 @@ static void msm_iommu_set_platform_dma(struct device *dev)
}
  fail:
spin_unlock_irqrestore(_iommu_lock, flags);
+   return ret;
  }
  
+static struct iommu_domain_ops msm_iommu_identity_ops = {

+   .attach_dev = msm_iommu_identity_attach,
+};
+
+static struct iommu_domain msm_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
  static int msm_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t pa, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -675,10 +690,10 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
  }
  
  static struct iommu_ops msm_iommu_ops = {

+   .identity_domain = _iommu_identity_domain,
.domain_alloc = msm_iommu_domain_alloc,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = msm_iommu_set_platform_dma,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
.of_xlate = qcom_iommu_of_xlate,
.default_domain_ops = &(const struct iommu_domain_ops) {


--
With best wishes
Dmitry

Re: [PATCH] tools/perf: Fix addr location init during arch_skip_callchain_idx function

2023-07-24 Thread Arnaldo Carvalho de Melo

Em Mon, Jul 24, 2023 at 10:28:15PM +0530, Athira Rajeev escreveu:
> perf record with callchain recording fails as below
> in powerpc:
> 
> ./perf record -a -gR sleep 10
> ./perf report
> perf: Segmentation fault
> 
> gdb trace points to thread__find_map
> 
> 0  0x101df314 in atomic_cmpxchg (newval=1818846826, 
> oldval=1818846827, v=0x1001a8f3) at 
> /home/athira/linux/tools/include/asm-generic/atomic-gcc.h:70
> 1  refcount_sub_and_test (i=1, r=0x1001a8f3) at 
> /home/athira/linux/tools/include/linux/refcount.h:135
> 2  refcount_dec_and_test (r=0x1001a8f3) at 
> /home/athira/linux/tools/include/linux/refcount.h:148
> 3  map__put (map=0x1001a8b3) at util/map.c:311
> 4  0x1016842c in __map__zput (map=0x7fffa368) at 
> util/map.h:190
> 5  thread__find_map (thread=0x105b92f0, cpumode=, 
> addr=13835058055283572736, al=al@entry=0x7fffa358) at util/event.c:582
> 6  0x1016882c in thread__find_symbol (thread=, 
> cpumode=, addr=, al=0x7fffa358) at 
> util/event.c:656
> 7  0x102e12b4 in arch_skip_callchain_idx (thread=, 
> chain=) at arch/powerpc/util/skip-callchain-idx.c:255
> 8  0x101d3bf4 in thread__resolve_callchain_sample 
> (thread=0x105b92f0, cursor=0x1053d160, evsel=, 
> sample=0x7fffa908, parent=0x7fffa778, root_al=0x7fffa710,
> max_stack=) at util/machine.c:2940
> 9  0x101cd210 in sample__resolve_callchain (sample= out>, cursor=, parent=, evsel=, 
> al=, max_stack=)
> at util/callchain.c:1112
> 10 0x1022a9d8 in hist_entry_iter__add (iter=0x7fffa750, 
> al=0x7fffa710, max_stack_depth=, arg=0x7fffbbd0) at 
> util/hist.c:1232
> 11 0x10056d98 in process_sample_event (tool=0x7fffbbd0, 
> event=0x76223c38, sample=0x7fffa908, evsel=, 
> machine=0x10524ef8) at builtin-report.c:332
> 
> Here arch_skip_callchain_idx calls thread__find_symbol and which
> invokes thread__find_map with uninitialised "addr_location".
> Snippet:
> 
> thread__find_symbol(thread, PERF_RECORD_MISC_USER, ip, );
> 
> Recent change with commit 0dd5041c9a0ea ("perf addr_location:
> Add init/exit/copy functions"), introduced "maps__zput" in the
> function thread__find_map. This could result in segfault while
> accessing uninitialised map from "struct addr_location". Fix this
> by adding addr_location__init and addr_location__exit in
> arch_skip_callchain_idx.

Thanks, applied.
 
> Fixes: 0dd5041c9a0ea ("perf addr_location: Add init/exit/copy functions")

> Reported-by: Aneesh Kumar K.V 
> Signed-off-by: Athira Rajeev 

I'll also do a audit of all calls to thread__find_map() and its callers
to check for other such cases :-\

For instance, this one seem buggy as well, Adrian?

diff --git a/tools/perf/util/dlfilter.c b/tools/perf/util/dlfilter.c
index 46f74b2344dbb34c..798a53d7e6c9dfc5 100644
--- a/tools/perf/util/dlfilter.c
+++ b/tools/perf/util/dlfilter.c
@@ -166,6 +166,7 @@ static __s32 dlfilter__resolve_address(void *ctx, __u64 
address, struct perf_dlf
if (!thread)
return -1;
 
+   addr_location__init();
thread__find_symbol_fb(thread, d->sample->cpumode, address, );
 
al_to_d_al(, _al);

[PATCH v3] powerpc: Explicitly include correct DT includes

2023-07-24 Thread Rob Herring

The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.

Signed-off-by: Rob Herring 
---
v3:
- Add 83xx/mpc832x_rdb.c, 85xx/common.c, 86xx/common.c
v2:
- Fix double include of of.h
---
 arch/powerpc/include/asm/ibmebus.h  | 2 ++
 arch/powerpc/include/asm/macio.h| 3 ++-
 arch/powerpc/kernel/legacy_serial.c | 2 +-
 arch/powerpc/kernel/of_platform.c   | 4 +---
 arch/powerpc/kernel/setup-common.c  | 4 ++--
 arch/powerpc/kexec/file_load_64.c   | 2 +-
 arch/powerpc/kexec/ranges.c | 2 +-
 arch/powerpc/platforms/4xx/cpm.c| 2 +-
 arch/powerpc/platforms/4xx/hsta_msi.c   | 2 +-
 arch/powerpc/platforms/4xx/soc.c| 2 +-
 arch/powerpc/platforms/512x/mpc5121_ads.c   | 2 +-
 arch/powerpc/platforms/512x/mpc512x_generic.c   | 2 +-
 arch/powerpc/platforms/512x/mpc512x_lpbfifo.c   | 2 +-
 arch/powerpc/platforms/512x/pdm360ng.c  | 3 ++-
 arch/powerpc/platforms/52xx/mpc52xx_gpt.c   | 3 +--
 arch/powerpc/platforms/82xx/ep8248e.c   | 1 +
 arch/powerpc/platforms/83xx/km83xx.c| 4 ++--
 arch/powerpc/platforms/83xx/mpc832x_rdb.c   | 4 +++-
 arch/powerpc/platforms/83xx/suspend.c   | 2 +-
 arch/powerpc/platforms/85xx/bsc913x_qds.c   | 2 +-
 arch/powerpc/platforms/85xx/bsc913x_rdb.c   | 2 +-
 arch/powerpc/platforms/85xx/c293pcie.c  | 3 +--
 arch/powerpc/platforms/85xx/common.c| 1 +
 arch/powerpc/platforms/85xx/ge_imp3a.c  | 2 +-
 arch/powerpc/platforms/85xx/ksi8560.c   | 3 ++-
 arch/powerpc/platforms/85xx/mpc8536_ds.c| 2 +-
 arch/powerpc/platforms/85xx/mpc85xx_ds.c| 2 +-
 arch/powerpc/platforms/85xx/mpc85xx_mds.c   | 4 ++--
 arch/powerpc/platforms/85xx/mpc85xx_rdb.c   | 3 ++-
 arch/powerpc/platforms/85xx/p1010rdb.c  | 2 +-
 arch/powerpc/platforms/85xx/p1022_ds.c  | 2 +-
 arch/powerpc/platforms/85xx/p1022_rdk.c | 2 +-
 arch/powerpc/platforms/85xx/p1023_rdb.c | 3 +--
 arch/powerpc/platforms/85xx/socrates.c  | 2 +-
 arch/powerpc/platforms/85xx/socrates_fpga_pic.c | 1 -
 arch/powerpc/platforms/85xx/stx_gp3.c   | 2 +-
 arch/powerpc/platforms/85xx/tqm85xx.c   | 2 +-
 arch/powerpc/platforms/85xx/twr_p102x.c | 3 ++-
 arch/powerpc/platforms/85xx/xes_mpc85xx.c   | 2 +-
 arch/powerpc/platforms/86xx/common.c| 3 +++
 arch/powerpc/platforms/86xx/gef_ppc9a.c | 2 +-
 arch/powerpc/platforms/86xx/gef_sbc310.c| 2 +-
 arch/powerpc/platforms/86xx/gef_sbc610.c| 2 +-
 arch/powerpc/platforms/86xx/mvme7100.c  | 1 -
 arch/powerpc/platforms/86xx/pic.c   | 2 +-
 arch/powerpc/platforms/cell/axon_msi.c  | 3 ++-
 arch/powerpc/platforms/cell/cbe_regs.c  | 3 +--
 arch/powerpc/platforms/cell/iommu.c | 2 +-
 arch/powerpc/platforms/cell/setup.c | 1 +
 arch/powerpc/platforms/cell/spider-pci.c| 1 -
 arch/powerpc/platforms/embedded6xx/holly.c  | 2 +-
 arch/powerpc/platforms/maple/setup.c| 4 ++--
 arch/powerpc/platforms/pasemi/gpio_mdio.c   | 2 +-
 arch/powerpc/platforms/pasemi/setup.c   | 2 ++
 arch/powerpc/platforms/powermac/setup.c | 2 +-
 arch/powerpc/platforms/powernv/opal-imc.c   | 1 -
 arch/powerpc/platforms/powernv/opal-rtc.c   | 3 ++-
 arch/powerpc/platforms/powernv/opal-secvar.c| 2 +-
 arch/powerpc/platforms/powernv/opal-sensor.c| 2 ++
 arch/powerpc/platforms/pseries/ibmebus.c| 1 +
 arch/powerpc/sysdev/cpm_common.c| 2 --
 arch/powerpc/sysdev/cpm_gpio.c  | 3 ++-
 arch/powerpc/sysdev/fsl_pmc.c   | 4 ++--
 arch/powerpc/sysdev/fsl_rio.c   | 4 ++--
 arch/powerpc/sysdev/fsl_rmu.c   | 1 -
 arch/powerpc/sysdev/fsl_soc.c   | 1 -
 arch/powerpc/sysdev/mpic_msgr.c | 3 ++-
 arch/powerpc/sysdev/mpic_timer.c| 1 -
 arch/powerpc/sysdev/of_rtc.c| 4 ++--
 arch/powerpc/sysdev/pmi.c   | 4 ++--
 70 files changed, 86 insertions(+), 77 deletions(-)

diff --git a/arch/powerpc/include/asm/ibmebus.h 
b/arch/powerpc/include/asm/ibmebus.h
index 088f95b2e14f..6f33253a364a 100644
--- a/arch/powerpc/include/asm/ibmebus.h
+++ b/arch/powerpc/include/asm/ibmebus.h
@@ -46,6 +46,8 @@
 #include 
 #include 
 
+struct platform_driver;
+
 extern struct bus_type ibmebus_bus_type;
 
 int

[PATCH v2] soc: fsl: Explicitly include correct DT includes

2023-07-24 Thread Rob Herring

The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.

Signed-off-by: Rob Herring 

---
v2:
 - Add qe.c
---
 drivers/soc/fsl/dpaa2-console.c | 3 ++-
 drivers/soc/fsl/qe/qe.c | 3 ++-
 drivers/soc/fsl/qe/qe_common.c  | 1 -
 drivers/soc/fsl/qe/qe_tdm.c | 4 +---
 4 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/soc/fsl/dpaa2-console.c b/drivers/soc/fsl/dpaa2-console.c
index 53917410f2bd..1dca693b6b38 100644
--- a/drivers/soc/fsl/dpaa2-console.c
+++ b/drivers/soc/fsl/dpaa2-console.c
@@ -9,9 +9,10 @@
 #define pr_fmt(fmt) "dpaa2-console: " fmt
 
 #include 
-#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
index b3c226eb5292..95168b574627 100644
--- a/drivers/soc/fsl/qe/qe.c
+++ b/drivers/soc/fsl/qe/qe.c
@@ -25,7 +25,8 @@
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
 #include 
 #include 
 
diff --git a/drivers/soc/fsl/qe/qe_common.c b/drivers/soc/fsl/qe/qe_common.c
index a0cb8e746879..9729ce86db59 100644
--- a/drivers/soc/fsl/qe/qe_common.c
+++ b/drivers/soc/fsl/qe/qe_common.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/soc/fsl/qe/qe_tdm.c b/drivers/soc/fsl/qe/qe_tdm.c
index 7d7d78d3ee50..a3b691875c8e 100644
--- a/drivers/soc/fsl/qe/qe_tdm.c
+++ b/drivers/soc/fsl/qe/qe_tdm.c
@@ -9,9 +9,7 @@
  */
 #include 
 #include 
-#include 
-#include 
-#include 
+#include 
 #include 
 
 static int set_tdm_framer(const char *tdm_framer_type)
-- 
2.40.1

[PATCH v2] tty: Explicitly include correct DT includes

2023-07-24 Thread Rob Herring

The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it as merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h. As a result, there's a pretty much random mix of those include
files used throughout the tree. In order to detangle these headers and
replace the implicit includes with struct declarations, users need to
explicitly include the correct includes.

Signed-off-by: Rob Herring 
---
v2:
 - Add mpc52xx_uart
---
 drivers/tty/hvc/hvc_opal.c | 2 +-
 drivers/tty/serial/8250/8250_early.c   | 1 -
 drivers/tty/serial/8250/8250_ingenic.c | 1 -
 drivers/tty/serial/8250/8250_omap.c| 1 -
 drivers/tty/serial/amba-pl011.c| 2 +-
 drivers/tty/serial/apbuart.c   | 3 ---
 drivers/tty/serial/atmel_serial.c  | 1 -
 drivers/tty/serial/fsl_linflexuart.c   | 2 +-
 drivers/tty/serial/fsl_lpuart.c| 2 +-
 drivers/tty/serial/imx.c   | 1 -
 drivers/tty/serial/lantiq.c| 3 ++-
 drivers/tty/serial/liteuart.c  | 3 +--
 drivers/tty/serial/ma35d1_serial.c | 2 +-
 drivers/tty/serial/mpc52xx_uart.c  | 2 +-
 drivers/tty/serial/mps2-uart.c | 1 -
 drivers/tty/serial/mxs-auart.c | 2 +-
 drivers/tty/serial/pic32_uart.c| 1 -
 drivers/tty/serial/qcom_geni_serial.c  | 1 -
 drivers/tty/serial/serial-tegra.c  | 1 -
 drivers/tty/serial/sh-sci.c| 1 -
 drivers/tty/serial/sunhv.c | 4 ++--
 drivers/tty/serial/sunsab.c| 3 ++-
 drivers/tty/serial/sunsu.c | 4 ++--
 drivers/tty/serial/sunzilog.c  | 4 ++--
 drivers/tty/serial/tegra-tcu.c | 1 -
 drivers/tty/serial/uartlite.c  | 3 ---
 drivers/tty/serial/ucc_uart.c  | 3 ++-
 drivers/tty/serial/vt8500_serial.c | 2 +-
 28 files changed, 21 insertions(+), 36 deletions(-)

diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index 794c7b18aa06..992e199e0ea8 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -14,7 +14,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 
diff --git a/drivers/tty/serial/8250/8250_early.c 
b/drivers/tty/serial/8250/8250_early.c
index 4299a8bd83d9..9837a27739fd 100644
--- a/drivers/tty/serial/8250/8250_early.c
+++ b/drivers/tty/serial/8250/8250_early.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/8250/8250_ingenic.c 
b/drivers/tty/serial/8250/8250_ingenic.c
index 617b8ce60d6b..4c4c4da73ad0 100644
--- a/drivers/tty/serial/8250/8250_ingenic.c
+++ b/drivers/tty/serial/8250/8250_ingenic.c
@@ -13,7 +13,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/8250/8250_omap.c 
b/drivers/tty/serial/8250/8250_omap.c
index d48a82f1634e..26dd089d8e82 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index c5c3f4674459..a1e594b79890 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -36,7 +37,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/apbuart.c b/drivers/tty/serial/apbuart.c
index 915ee4b0d594..f3defc6da3df 100644
--- a/drivers/tty/serial/apbuart.c
+++ b/drivers/tty/serial/apbuart.c
@@ -22,9 +22,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/atmel_serial.c 
b/drivers/tty/serial/atmel_serial.c
index 3467a875641a..7ac477344aa3 100644
--- a/drivers/tty/serial/atmel_serial.c
+++ b/drivers/tty/serial/atmel_serial.c
@@ -21,7 +21,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/fsl_linflexuart.c 
b/drivers/tty/serial/fsl_linflexuart.c
index 6fc21b6684e6..f697751c2ad5 100644
--- a/drivers/tty/serial/fsl_linflexuart.c
+++ b/drivers/tty/serial/fsl_linflexuart.c
@@ -11,7 +11,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index 4d80fae20177..e1a8d5415718 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -18,9 +18,9 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 7341d060f85c..3ed5083a7108 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -25,7 +25,6 @@
 #include

[PATCH v6 02/13] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg

2023-07-24 Thread Aneesh Kumar K.V

We will use this in a later patch to do tlb flush when clearing pud entries
on powerpc. This is similar to commit 93a98695f2f9 ("mm: change
pmdp_huge_get_and_clear_full take vm_area_struct as arg")

Reviewed-by: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 4 ++--
 mm/debug_vm_pgtable.c   | 2 +-
 mm/huge_memory.c| 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5eb6bdf30c62..124427ece520 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -456,11 +456,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
+static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long address, pud_t *pudp,
int full)
 {
-   return pudp_huge_get_and_clear(mm, address, pudp);
+   return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index ee119e33fef1..ee2c4c1dcfc8 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -385,7 +385,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1);
+   pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1);
pud = READ_ONCE(*args->pudp);
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e0420de0e2e0..e371503f7746 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1981,7 +1981,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
if (!ptl)
return 0;
 
-   pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+   pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_special_huge(vma)) {
spin_unlock(ptl);
-- 
2.41.0

[PATCH v6 06/13] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE

2023-07-24 Thread Aneesh Kumar K.V

pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Reviewed-by: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 mm/mremap.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0af8bc4ce258..f34e0f2cb4d8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -564,6 +564,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pud_t *pudp)
 {
@@ -577,6 +578,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
BUILD_BUG();
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #endif
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 11e06e4ab33b..056478c106ee 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -349,7 +349,7 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && 
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
 {
-- 
2.41.0

[PATCH v6 12/13] powerpc/book3s64/radix: Remove mmu_vmemmap_psize

2023-07-24 Thread Aneesh Kumar K.V

This is not used by radix anymore.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 11 ---
 arch/powerpc/mm/init_64.c| 21 ++---
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index e5356ac37e99..25b46058f556 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -601,17 +601,6 @@ void __init radix__early_init_mmu(void)
 #else
mmu_virtual_psize = MMU_PAGE_4K;
 #endif
-
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-   /* vmemmap mapping */
-   if (mmu_psize_defs[MMU_PAGE_2M].shift) {
-   /*
-* map vmemmap using 2M if available
-*/
-   mmu_vmemmap_psize = MMU_PAGE_2M;
-   } else
-   mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 #endif
/*
 * initialize page table size
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5701faca39ef..6db7a063ba63 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, 
unsigned long start,
return false;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
-   struct vmem_altmap *altmap)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+struct vmem_altmap *altmap)
 {
bool altmap_alloc;
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (radix_enabled())
-   return radix__vmemmap_populate(start, end, node, altmap);
-#endif
-
/* Align to the page size of the linear mapping. */
start = ALIGN_DOWN(start, page_size);
 
@@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
return 0;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+  struct vmem_altmap *altmap)
+{
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (radix_enabled())
+   return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
+   return __vmemmap_populate(start, end, node, altmap);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long vmemmap_list_free(unsigned long start)
 {
-- 
2.41.0

[PATCH v6 13/13] powerpc/book3s64/radix: Add debug message to give more details of vmemmap allocation

2023-07-24 Thread Aneesh Kumar K.V

Add some extra vmemmap pr_debug message that will indicate the type of
vmemmap allocations.

For ex: with DAX vmemmap optimization we can find the below details:
[  187.166580] radix-mmu: PAGE_SIZE vmemmap mapping
[  187.166587] radix-mmu: PAGE_SIZE vmemmap mapping
[  187.166591] radix-mmu: Tail page reuse vmemmap mapping
[  187.166594] radix-mmu: Tail page reuse vmemmap mapping
[  187.166598] radix-mmu: Tail page reuse vmemmap mapping
[  187.166601] radix-mmu: Tail page reuse vmemmap mapping
[  187.166604] radix-mmu: Tail page reuse vmemmap mapping
[  187.166608] radix-mmu: Tail page reuse vmemmap mapping
[  187.166611] radix-mmu: Tail page reuse vmemmap mapping
[  187.166614] radix-mmu: Tail page reuse vmemmap mapping
[  187.166617] radix-mmu: Tail page reuse vmemmap mapping
[  187.166620] radix-mmu: Tail page reuse vmemmap mapping
[  187.166623] radix-mmu: Tail page reuse vmemmap mapping
[  187.166626] radix-mmu: Tail page reuse vmemmap mapping
[  187.166629] radix-mmu: Tail page reuse vmemmap mapping
[  187.166632] radix-mmu: Tail page reuse vmemmap mapping

And without vmemmap optimization
[  293.549931] radix-mmu: PMD_SIZE vmemmap mapping
[  293.549984] radix-mmu: PMD_SIZE vmemmap mapping
[  293.550032] radix-mmu: PMD_SIZE vmemmap mapping
[  293.550076] radix-mmu: PMD_SIZE vmemmap mapping
[  293.550117] radix-mmu: PMD_SIZE vmemmap mapping

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 25b46058f556..59aaa30a7c0d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1033,6 +1033,7 @@ static pte_t * __meminit 
radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long
p = vmemmap_alloc_block_buf(PAGE_SIZE, node, 
NULL);
if (!p)
return NULL;
+   pr_debug("PAGE_SIZE vmemmap mapping\n");
} else {
/*
 * When a PTE/PMD entry is freed from the init_mm
@@ -1045,6 +1046,7 @@ static pte_t * __meminit 
radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long
 */
get_page(reuse);
p = page_to_virt(reuse);
+   pr_debug("Tail page reuse vmemmap mapping\n");
}
 
VM_BUG_ON(!PAGE_ALIGNED(addr));
@@ -1154,6 +1156,7 @@ int __meminit radix__vmemmap_populate(unsigned long 
start, unsigned long end, in
p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
if (p) {
vmemmap_set_pmd(pmd, p, node, addr, next);
+   pr_debug("PMD_SIZE vmemmap mapping\n");
continue;
} else if (altmap) {
/*
-- 
2.41.0

[PATCH v6 11/13] powerpc/book3s64/radix: Add support for vmemmap optimization for radix

2023-07-24 Thread Aneesh Kumar K.V

With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap
page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence
with 64K page size, we don't use vmemmap deduplication for PMD-level
mapping.

Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/mm/vmemmap_dedup.rst |   1 +
 Documentation/powerpc/index.rst|   1 +
 Documentation/powerpc/vmemmap_dedup.rst| 101 ++
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/book3s/64/radix.h |   9 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 203 +
 6 files changed, 316 insertions(+)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

diff --git a/Documentation/mm/vmemmap_dedup.rst 
b/Documentation/mm/vmemmap_dedup.rst
index a4b12ff906c4..c573e08b5043 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -210,6 +210,7 @@ the device (altmap).
 
 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
 
 The differences with HugeTLB are relatively minor.
 
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index d33b554ca7ba..a50834798454 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -36,6 +36,7 @@ powerpc
 ultravisor
 vas-api
 vcpudispatch_stats
+vmemmap_dedup
 
 features
 
diff --git a/Documentation/powerpc/vmemmap_dedup.rst 
b/Documentation/powerpc/vmemmap_dedup.rst
new file mode 100644
index ..dc4db59fdf87
--- /dev/null
+++ b/Documentation/powerpc/vmemmap_dedup.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Device DAX
+==
+
+The device-dax interface uses the tail deduplication technique explained in
+Documentation/mm/vmemmap_dedup.rst
+
+On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
+with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
+deduplication.
+
+With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
+page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
+vmemmap deduplication possible.
+
+With 1G PUD level mapping, we require 16384 struct pages and a single 64K
+vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
+require 16 64K pages in vmemmap to map the struct page for 1G PUD level 
mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PUD| +---+   | | |
+ |   level   | | . | --+ | |
+ |  mapping  | +---+ | |
+ |   | | . | + |
+ |   | +---+   |
+ |   | | 15| --+
+ |   | +---+
+ |   |
+ |   |
+ |   |
+ +---+
+
+
+With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
+4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
+require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   |

[PATCH v6 10/13] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-07-24 Thread Aneesh Kumar K.V

This is in preparation to update radix to implement vmemmap optimization
for devdax. Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to
   PAGE_SIZE
3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to
   PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the
vmemmap area is invalid. If found to be invalid, that implies we can safely
free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86
because with 64K page size, we need to do the above check even at the
PAGE_SIZE granularity.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/radix.h |   2 +
 arch/powerpc/include/asm/pgtable.h |   6 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 325 +++--
 arch/powerpc/mm/init_64.c  |  26 +-
 4 files changed, 328 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 2ef92f36340f..f1461289643a 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -331,6 +331,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned 
long start,
 unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
  int node, struct vmem_altmap *altmap);
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+  struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 445a22987aa3..a4893b17705a 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -157,6 +157,12 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
return (pgtable_t)pmd_page_vaddr(pmd);
 }
 
+#ifdef CONFIG_PPC64
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+  unsigned long page_size);
+#endif /* CONFIG_PPC64 */
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 227fea53c217..53f8340e390c 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -744,8 +744,58 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
 }
 
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+   return !vmemmap_populated(start, PMD_SIZE);
+}
+
+static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+
+   return !vmemmap_populated(start, PAGE_SIZE);
+
+}
+#endif
+
+static void __meminit free_vmemmap_pages(struct page *page,
+struct vmem_altmap *altmap,
+int order)
+{
+   unsigned int nr_pages = 1 << order;
+
+   if (altmap) {
+   unsigned long alt_start, alt_end;
+   unsigned long base_pfn = page_to_pfn(page);
+
+   /*
+* with 2M vmemmap mmaping we can have things setup
+* such that even though atlmap is specified we never
+* used altmap.
+*/
+   alt_start = altmap->base_pfn;
+   alt_end = altmap->base_pfn + altmap->reserve + altmap->free;
+
+   if (base_pfn >= alt_start && base_pfn < alt_end) {
+   vmem_altmap_free(altmap, nr_pages);
+   return;
+   }
+   }
+
+   if (PageReserved(page)) {
+   /* allocated from memblock */
+   while (nr_pages--)
+   free_reserved_page(page++);
+   } else
+   free_pages((unsigned long)page_address(page), order);
+}
+
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end, bool direct)
+unsigned long end, bool direct,
+struct vmem_altmap *altmap)
 {
unsigned long next, pages = 0;
pte_t *pte;
@@ -759,24 +809,26 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
if (!pte_present(*pte))
continue;
 
-   if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) {
-   /*
-* The

[PATCH v6 09/13] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-07-24 Thread Aneesh Kumar K.V

This is enabled only with radix translation and 1G hugepage size. This will
be used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge(). We should never find that return true. The only
expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation
and hence is marked to call BUG() in such a case.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash.h |   9 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 155 --
 arch/powerpc/include/asm/book3s/64/radix.h|  36 
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  28 
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 include/trace/events/thp.h|  10 ++
 10 files changed, 323 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 17e7a778c856..efce6ef3e2a9 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -133,7 +133,16 @@ static inline int get_region_id(unsigned long ea)
 }
 
 #definehash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
+
+/*
+ * pud comparison that will work with both pte and page table pointer.
+ */
+static inline int hash__pud_same(pud_t pud_a, pud_t pud_b)
+{
+   return (((pud_raw(pud_a) ^ pud_raw(pud_b)) & 
~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
+}
 #definehash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
+
 static inline int hash__p4d_bad(p4d_t p4d)
 {
return (p4d_val(p4d) == 0);
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..a8204566cfd0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
 {
return __pud_raw(pte_raw(pte));
 }
+
+static inline pte_t *pudp_ptep(pud_t *pud)
+{
+   return (pte_t *)pud;
+}
+
+#define pud_pfn(pud)   pte_pfn(pud_pte(pud))
+#define pud_dirty(pud) pte_dirty(pud_pte(pud))
+#define pud_young(pud) pte_young(pud_pte(pud))
+#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkdirty(pud)   pte_pud(pte_mkdirty(pud_pte(pud)))
+#define pud_mkclean(pud)   pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkyoung(pud)   pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite(pud)   pte_pud(pte_mkwrite(pud_pte(pud)))
 #define pud_write(pud) pte_write(pud_pte(pud))
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud))
+#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
+#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 static inline int pud_bad(pud_t pud)
 {
if (radix_enabled())
@@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
   pmd_t *pmdp, pmd_t pmd);
+extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
+  pud_t *pudp, pud_t pud);
+
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd)
 {
 }
 
+static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pud)
+{
+}
+
 extern int hash__has_transparent_hugepage(void);
 static inline int has_transparent_hugepage(void)
 {
@@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
 }
 #define has_transparent_hugepage has_transparent_hugepage
 
+static inline int has_transparent_pud_hugepage(void)
+{
+   if (radix_enabled())
+   return radix__has_transparent_pud_hugepage();
+   return 0;
+}
+#define has_transparent_pud_hugepage has_transparent_pud_hugepage
+
 static inline unsigned long
 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
unsigned long clr, unsigned long set)
@@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long 
addr, pmd_t *pmdp,

[PATCH v6 08/13] powerpc/mm/trace: Convert trace event to trace event class

2023-07-24 Thread Aneesh Kumar K.V

A follow-up patch will add a pud variant for this same event.
Using event class makes that addition simpler.

No functional change in this patch.

Reviewed-by: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/hash_pgtable.c  |  2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |  2 +-
 include/trace/events/thp.h   | 23 ---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 51f48984abca..988948d69bc1 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -214,7 +214,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long addr
 
old = be64_to_cpu(old_be);
 
-   trace_hugepage_update(addr, old, clr, set);
+   trace_hugepage_update_pmd(addr, old, clr, set);
if (old & H_PAGE_HASHPTE)
hpte_do_hugepage_flush(mm, addr, pmdp, old);
return old;
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index e7ea492ac510..02e185d2e4d6 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -962,7 +962,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long add
 #endif
 
old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1);
-   trace_hugepage_update(addr, old, clr, set);
+   trace_hugepage_update_pmd(addr, old, clr, set);
 
return old;
 }
diff --git a/include/trace/events/thp.h b/include/trace/events/thp.h
index 202b3e3e67ff..a95c78b10561 100644
--- a/include/trace/events/thp.h
+++ b/include/trace/events/thp.h
@@ -8,25 +8,29 @@
 #include 
 #include 
 
-TRACE_EVENT(hugepage_set_pmd,
+DECLARE_EVENT_CLASS(hugepage_set,
 
-   TP_PROTO(unsigned long addr, unsigned long pmd),
-   TP_ARGS(addr, pmd),
+   TP_PROTO(unsigned long addr, unsigned long pte),
+   TP_ARGS(addr, pte),
TP_STRUCT__entry(
__field(unsigned long, addr)
-   __field(unsigned long, pmd)
+   __field(unsigned long, pte)
),
 
TP_fast_assign(
__entry->addr = addr;
-   __entry->pmd = pmd;
+   __entry->pte = pte;
),
 
-   TP_printk("Set pmd with 0x%lx with 0x%lx", __entry->addr, 
__entry->pmd)
+   TP_printk("Set page table entry with 0x%lx with 0x%lx", 
__entry->addr, __entry->pte)
 );
 
+DEFINE_EVENT(hugepage_set, hugepage_set_pmd,
+   TP_PROTO(unsigned long addr, unsigned long pmd),
+   TP_ARGS(addr, pmd)
+);
 
-TRACE_EVENT(hugepage_update,
+DECLARE_EVENT_CLASS(hugepage_update,
 
TP_PROTO(unsigned long addr, unsigned long pte, unsigned long clr, 
unsigned long set),
TP_ARGS(addr, pte, clr, set),
@@ -48,6 +52,11 @@ TRACE_EVENT(hugepage_update,
TP_printk("hugepage update at addr 0x%lx and pte = 0x%lx clr = 
0x%lx, set = 0x%lx", __entry->addr, __entry->pte, __entry->clr, __entry->set)
 );
 
+DEFINE_EVENT(hugepage_update, hugepage_update_pmd,
+   TP_PROTO(unsigned long addr, unsigned long pmd, unsigned long clr, 
unsigned long set),
+   TP_ARGS(addr, pmd, clr, set)
+);
+
 DECLARE_EVENT_CLASS(migration_pmd,
 
TP_PROTO(unsigned long addr, unsigned long pmd),
-- 
2.41.0

[PATCH v6 07/13] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-07-24 Thread Aneesh Kumar K.V

Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to
read-only) and the output address (pfn) of the vmemmap ptes. That is not
supported without unmapping of pte(marking it invalid) by some
architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb
vmemmap optimization disabled. Hence split DAX optimization support into a
different config.

s390, loongarch and riscv don't have devdax support. So the DAX config is not
enabled for them. With this change, arm64 should be able to select DAX
optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/loongarch/Kconfig | 2 +-
 arch/riscv/Kconfig | 2 +-
 arch/s390/Kconfig  | 2 +-
 arch/x86/Kconfig   | 3 ++-
 fs/Kconfig | 2 +-
 include/linux/mm.h | 2 +-
 mm/Kconfig | 5 -
 7 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index e55511af4c77..537ca2a4005a 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -59,7 +59,7 @@ config LOONGARCH
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_NO_INSTR
select BUILDTIME_TABLE_SORT
select COMMON_CLK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 4c07b9189c86..6943d34c1ec1 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -53,7 +53,7 @@ config RISCV
select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select BUILDTIME_TABLE_SORT if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 290b6f93b816..8ff6d1c21e38 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -127,7 +127,7 @@ config S390
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_IPC_PARSE_VERSION
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS2
select DMA_OPS if PCI
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7422db409770..78224aa76409 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -128,7 +128,8 @@ config X86
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP   if X86_64
select ARCH_WANTS_THP_SWAP  if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
diff --git a/fs/Kconfig b/fs/Kconfig
index 19975b104bc3..f3be721bab6d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -254,7 +254,7 @@ config HUGETLB_PAGE
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
-   depends on ARCH_WANT_OPTIMIZE_VMEMMAP
+   depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fe6741539740..f8899bda941a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3650,7 +3650,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 #endif
 
 #define VMEMMAP_RESERVE_NR 2
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
  struct dev_pagemap *pgmap)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index da681dda8af1..5fe49c030961 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -486,7 +486,10 @@ config SPARSEMEM_VMEMMAP
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_VMEMMAP
+config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
+   bool
+
+config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
bool
 
 config HAVE_MEMBLOCK_PHYS_MAP
-- 
2.41.0

[PATCH v6 05/13] mm: Add pud_same similar to __HAVE_ARCH_P4D_SAME

2023-07-24 Thread Aneesh Kumar K.V

This helps architectures to override pmd_same and pud_same independently.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 124427ece520..0af8bc4ce258 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -699,11 +699,14 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
 }
+#endif
 
+#ifndef pud_same
 static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
 }
+#define pud_same pud_same
 #endif
 
 #ifndef __HAVE_ARCH_P4D_SAME
-- 
2.41.0

[PATCH v6 04/13] mm/vmemmap: Allow architectures to override how vmemmap optimization works

2023-07-24 Thread Aneesh Kumar K.V

Architectures like powerpc will like to use different page table allocators
and mapping mechanisms to implement vmemmap optimization. Similar to
vmemmap_populate allow architectures to implement
vmemap_populate_compound_pages

Signed-off-by: Aneesh Kumar K.V 
---
 mm/sparse-vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a044a130405b..a2cbe44c48e1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -358,6 +358,7 @@ int __meminit vmemmap_populate_hugepages(unsigned long 
start, unsigned long end,
return 0;
 }
 
+#ifndef vmemmap_populate_compound_pages
 /*
  * For compound pages bigger than section size (e.g. x86 1G compound
  * pages with 2M subsection size) fill the rest of sections as tail
@@ -446,6 +447,8 @@ static int __meminit 
vmemmap_populate_compound_pages(unsigned long start_pfn,
return 0;
 }
 
+#endif
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
-- 
2.41.0

[PATCH v6 03/13] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override

2023-07-24 Thread Aneesh Kumar K.V

dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within
vmemmap such that tail page mapping can point to the second PAGE_SIZE area.
Enforce that in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.

Reviewed-by: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/mm.h | 27 +++
 mm/mm_init.c   |  2 +-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5d68baea231..fe6741539740 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3649,13 +3649,32 @@ void vmemmap_free(unsigned long start, unsigned long 
end,
struct vmem_altmap *altmap);
 #endif
 
+#define VMEMMAP_RESERVE_NR 2
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-  struct dev_pagemap *pgmap)
+static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
 {
-   return is_power_of_2(sizeof(struct page)) &&
-   pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
+   unsigned long nr_pages;
+   unsigned long nr_vmemmap_pages;
+
+   if (!pgmap || !is_power_of_2(sizeof(struct page)))
+   return false;
+
+   nr_pages = pgmap_vmemmap_nr(pgmap);
+   nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT);
+   /*
+* For vmemmap optimization with DAX we need minimum 2 vmemmap
+* pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst
+*/
+   return !altmap && (nr_vmemmap_pages > VMEMMAP_RESERVE_NR);
 }
+/*
+ * If we don't have an architecture override, use the generic rule
+ */
+#ifndef vmemmap_can_optimize
+#define vmemmap_can_optimize __vmemmap_can_optimize
+#endif
+
 #else
 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
   struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index acb0ac194672..641c56fd08a2 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct 
vmem_altmap *altmap,
if (!vmemmap_can_optimize(altmap, pgmap))
return pgmap_vmemmap_nr(pgmap);
 
-   return 2 * (PAGE_SIZE / sizeof(struct page));
+   return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
 }
 
 static void __ref memmap_init_compound(struct page *head,
-- 
2.41.0

[PATCH v6 01/13] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support

2023-07-24 Thread Aneesh Kumar K.V

Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.

Reviewed-by: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 drivers/nvdimm/pfn_devs.c | 2 +-
 include/linux/pgtable.h   | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index af7d9301520c..18ad315581ca 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned 
long *alignments)
 
if (has_transparent_hugepage()) {
alignments[1] = HPAGE_PMD_SIZE;
-   if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+   if (has_transparent_pud_hugepage())
alignments[2] = HPAGE_PUD_SIZE;
}
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5f36c055794b..5eb6bdf30c62 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1505,6 +1505,9 @@ typedef unsigned int pgtbl_mod_mask;
 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
 #endif
 
+#ifndef has_transparent_pud_hugepage
+#define has_transparent_pud_hugepage() 
IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#endif
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.41.0

[PATCH v6 00/13] Add support for DAX vmemmap optimization for ppc64

2023-07-24 Thread Aneesh Kumar K.V

This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
translation and 1GB PUD mapping with 64K page size. The patch series also split
hugetlb vmemmap optimization as a separate Kconfig variable so that
architectures can enable DAX vmemmap optimization without enabling hugetlb
vmemmap optimization. This should enable architectures like arm64 to enable DAX
vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
devdax vmemmap optimization"

Changes from v5:
* rebase to mm-unstable branch

Changes from v4:
* Address review feedback
* Add the Reviewed-by:

Changes from v3:
* Rebase to latest linus tree
* Build fix with SPARSEMEM_VMEMMP disabled
* Add hash_pud_same outisde THP Kconfig

Changes from v2:
* Rebase to latest linus tree
* Address review feedback

Changes from V1:
* Fix make htmldocs warning
* Fix vmemmap allocation bugs with different alignment values.
* Correctly check for section validity to before we free vmemmap area



Aneesh Kumar K.V (13):
  mm/hugepage pud: Allow arch-specific helper function to check huge
page pud support
  mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
override
  mm/vmemmap: Allow architectures to override how vmemmap optimization
works
  mm: Add pud_same similar to __HAVE_ARCH_P4D_SAME
  mm/huge pud: Use transparent huge pud helpers only with
CONFIG_TRANSPARENT_HUGEPAGE
  mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  powerpc/mm/trace: Convert trace event to trace event class
  powerpc/book3s64/mm: Enable transparent pud hugepage
  powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
handling function
  powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  powerpc/book3s64/radix: Remove mmu_vmemmap_psize
  powerpc/book3s64/radix: Add debug message to give more details of
vmemmap allocation

 Documentation/mm/vmemmap_dedup.rst|   1 +
 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/vmemmap_dedup.rst   | 101 
 arch/loongarch/Kconfig|   2 +-
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/book3s/64/hash.h |   9 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 155 -
 arch/powerpc/include/asm/book3s/64/radix.h|  47 ++
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/pgtable.h|   6 +
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   2 +-
 arch/powerpc/mm/book3s64/pgtable.c|  78 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 572 --
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/mm/init_64.c |  37 +-
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 arch/riscv/Kconfig|   2 +-
 arch/s390/Kconfig |   2 +-
 arch/x86/Kconfig  |   3 +-
 drivers/nvdimm/pfn_devs.c |   2 +-
 fs/Kconfig|   2 +-
 include/linux/mm.h|  29 +-
 include/linux/pgtable.h   |  12 +-
 include/trace/events/thp.h|  33 +-
 mm/Kconfig|   5 +-
 mm/debug_vm_pgtable.c |   2 +-
 mm/huge_memory.c  |   2 +-
 mm/mm_init.c  |   2 +-
 mm/mremap.c   |   2 +-
 mm/sparse-vmemmap.c   |   3 +
 31 files changed, 1049 insertions(+), 82 deletions(-)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

-- 
2.41.0

Re: [PATCH v5 10/13] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-07-24 Thread Andrew Morton

On Mon, 24 Jul 2023 23:59:27 +0530 "Aneesh Kumar K.V" 
 wrote:

> Please take this diff on top of this patch when adding this series to
> -mm .

"[10/13] powerpc/book3s64/vmemmap: Switch radix to use a different
vmemmap handling function" has a conflict with "mm: move
is_ioremap_addr() into new header file" (in mm-unstable)
(https://lkml.kernel.org/r/20230706154520.11257-17-...@redhat.com), so
please can we have a v6 of this series, which includes this fixup?

Re: [PATCH v5 10/13] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-07-24 Thread Aneesh Kumar K.V

"Aneesh Kumar K.V"  writes:

> This is in preparation to update radix to implement vmemmap optimization
> for devdax. Below are the rules w.r.t radix vmemmap mapping
>
> 1. First try to map things using PMD (2M)
> 2. With altmap if altmap cross-boundary check returns true, fall back to
>PAGE_SIZE
> 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to
>PAGE_SIZE
>
> On removing vmemmap mapping, check if every subsection that is using the
> vmemmap area is invalid. If found to be invalid, that implies we can safely
> free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86
> because with 64K page size, we need to do the above check even at the
> PAGE_SIZE granularity.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h |   2 +
>  arch/powerpc/include/asm/pgtable.h |   4 +
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 326 +++--
>  arch/powerpc/mm/init_64.c  |  26 +-
>  4 files changed, 327 insertions(+), 31 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
> b/arch/powerpc/include/asm/book3s/64/radix.h
> index 2ef92f36340f..f1461289643a 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -331,6 +331,8 @@ extern int __meminit 
> radix__vmemmap_create_mapping(unsigned long start,
>unsigned long phys);
>  int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
> int node, struct vmem_altmap *altmap);
> +void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
> +struct vmem_altmap *altmap);
>  extern void radix__vmemmap_remove_mapping(unsigned long start,
>   unsigned long page_size);
>  
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 6a88bfdaa69b..68817ea7f994 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -165,6 +165,10 @@ static inline bool is_ioremap_addr(const void *x)
>  
>   return addr >= IOREMAP_BASE && addr < IOREMAP_END;
>  }
> +
> +int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
> vmemmap_map_size);
> +bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
> +unsigned long page_size);
>  #endif /* CONFIG_PPC64 */
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 227fea53c217..9a7f3707b6fb 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -744,8 +744,59 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
>   p4d_clear(p4d);
>  }
>  
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned 
> long end)
> +{
> + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> +
> + return !vmemmap_populated(start, PMD_SIZE);
> +}
> +
> +static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned 
> long end)
> +{
> + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
> +
> + return !vmemmap_populated(start, PAGE_SIZE);
> +
> +}
> +#endif
> +
> +static void __meminit free_vmemmap_pages(struct page *page,
> +  struct vmem_altmap *altmap,
> +  int order)
> +{
> + unsigned int nr_pages = 1 << order;
> +
> + if (altmap) {
> + unsigned long alt_start, alt_end;
> + unsigned long base_pfn = page_to_pfn(page);
> +
> + /*
> +  * with 2M vmemmap mmaping we can have things setup
> +  * such that even though atlmap is specified we never
> +  * used altmap.
> +  */
> + alt_start = altmap->base_pfn;
> + alt_end = altmap->base_pfn + altmap->reserve +
> + altmap->free + altmap->alloc + altmap->align;
> +
> + if (base_pfn >= alt_start && base_pfn < alt_end) {
> + vmem_altmap_free(altmap, nr_pages);
> + return;
> + }
> + }
> +

Please take this diff on top of this patch when adding this series to
-mm .

commit 613569d9517be60611a86bf4b9821b150c4c4954
Author: Aneesh Kumar K.V 
Date:   Mon Jul 24 22:49:29 2023 +0530

powerpc/mm/altmap: Fix altmap boundary check

altmap->free includes the entire free space from which altmap blocks
can be allocated. So when checking whether the kernel is doing altmap
block free, compute the boundary correctly.

Signed-off-by: Aneesh Kumar K.V 

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 7761c2e93bff..ed63c2953b54 100644
---

Re: [PATCH] powerpc/mm/altmap: Fix altmap boundary check

2023-07-24 Thread David Hildenbrand


On 24.07.23 20:13, Aneesh Kumar K.V wrote:

altmap->free includes the entire free space from which altmap blocks
can be allocated. So when checking whether the kernel is doing altmap
block free, compute the boundary correctly.

Cc: David Hildenbrand 
Cc: Dan Williams 
Fixes: 9ef34630a461 ("powerpc/mm: Fallback to RAM if the altmap is unusable")
Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/init_64.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index fe1b83020e0d..0ec5b45b1e86 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -314,8 +314,7 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
start = ALIGN_DOWN(start, page_size);
if (altmap) {
alt_start = altmap->base_pfn;
-   alt_end = altmap->base_pfn + altmap->reserve +
- altmap->free + altmap->alloc + altmap->align;
+   alt_end = altmap->base_pfn + altmap->reserve + altmap->free;



Right, align is treated like allocated and align+alloc cannot exceed free.

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-24 Thread Luis Chamberlain

Cc'ing fsdevel + xfs folks as this fixes a regression tests with
XFS with generic/176.

On Thu, Jul 20, 2023 at 10:52:59AM +1000, Alistair Popple wrote:
> 
> SeongJae Park  writes:
> 
> > Hi Alistair,
> >
> > On Wed, 19 Jul 2023 22:18:44 +1000 Alistair Popple  
> > wrote:
> >
> >> The invalidate_range() is going to become an architecture specific mmu
> >> notifier used to keep the TLB of secondary MMUs such as an IOMMU in
> >> sync with the CPU page tables. Currently it is called from separate
> >> code paths to the main CPU TLB invalidations. This can lead to a
> >> secondary TLB not getting invalidated when required and makes it hard
> >> to reason about when exactly the secondary TLB is invalidated.
> >> 
> >> To fix this move the notifier call to the architecture specific TLB
> >> maintenance functions for architectures that have secondary MMUs
> >> requiring explicit software invalidations.
> >> 
> >> This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
> >> require a TLB invalidation. This invalidation is done by the
> >> architecutre specific ptep_set_access_flags() which calls
> >> flush_tlb_page() if required. However this doesn't call the notifier
> >> resulting in infinite faults being generated by devices using the SMMU
> >> if it has previously cached a read-only PTE in it's TLB.
> >> 
> >> Moving the invalidations into the TLB invalidation functions ensures
> >> all invalidations happen at the same time as the CPU invalidation. The
> >> architecture specific flush_tlb_all() routines do not call the
> >> notifier as none of the IOMMUs require this.
> >> 
> >> Signed-off-by: Alistair Popple 
> >> Suggested-by: Jason Gunthorpe 
> >
> > I found below kernel NULL-dereference issue on latest mm-unstable tree, and
> > bisect points me to the commit of this patch, namely
> > 75c400f82d347af1307010a3e06f3aa5d831d995.
> >
> > To reproduce, I use 'stress-ng --bigheap $(nproc)'.  The issue happens as 
> > soon
> > as it starts reclaiming memory.  I didn't dive deep into this yet, but
> > reporting this issue first, since you might have an idea already.
> 
> Thanks for the report SJ!
> 
> I see the problem - current->mm can (obviously!) be NULL which is what's
> leading to the NULL dereference. Instead I think on x86 I need to call
> the notifier when adding the invalidate to the tlbbatch in
> arch_tlbbatch_add_pending() which is equivalent to what ARM64 does.
> 
> The below should fix it. Will do a respin with this.
> 
> ---
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 837e4a50281a..79c46da919b9 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -4,6 +4,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -282,6 +283,7 @@ static inline void arch_tlbbatch_add_pending(struct 
> arch_tlbflush_unmap_batch *b
>  {
>   inc_mm_tlb_gen(mm);
>   cpumask_or(>cpumask, >cpumask, mm_cpumask(mm));
> + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>  }
>  
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0b990fb56b66..2d253919b3e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1265,7 +1265,6 @@ void arch_tlbbatch_flush(struct 
> arch_tlbflush_unmap_batch *batch)
>  
>   put_flush_tlb_info();
>   put_cpu();
> - mmu_notifier_arch_invalidate_secondary_tlbs(current->mm, 0, -1UL);
>  }
>  
>  /*

This patch also fixes a regression introduced on linux-next, the same
crash on arch_tlbbatch_flush() is reproducible with fstests generic/176
on XFS. This patch fixes that regression [0]. This should also close out
the syzbot crash too [1]

[0] https://gist.github.com/mcgrof/b37fc8cf7e6e1b3935242681de1a83e2
[1] https://lore.kernel.org/all/3afcb4060135a...@google.com/

Tested-by: Luis Chamberlain 

  Luis

[PATCH] powerpc/mm/altmap: Fix altmap boundary check

2023-07-24 Thread Aneesh Kumar K.V

altmap->free includes the entire free space from which altmap blocks
can be allocated. So when checking whether the kernel is doing altmap
block free, compute the boundary correctly.

Cc: David Hildenbrand 
Cc: Dan Williams 
Fixes: 9ef34630a461 ("powerpc/mm: Fallback to RAM if the altmap is unusable")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index fe1b83020e0d..0ec5b45b1e86 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -314,8 +314,7 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
start = ALIGN_DOWN(start, page_size);
if (altmap) {
alt_start = altmap->base_pfn;
-   alt_end = altmap->base_pfn + altmap->reserve +
- altmap->free + altmap->alloc + altmap->align;
+   alt_end = altmap->base_pfn + altmap->reserve + altmap->free;
}
 
pr_debug("vmemmap_free %lx...%lx\n", start, end);
-- 
2.41.0

Re: [PATCH] mm/hotplug: Enable runtime update of memmap_on_memory parameter

2023-07-24 Thread David Hildenbrand


On 24.07.23 19:31, Andrew Morton wrote:

On Fri, 21 Jul 2023 18:49:50 +0530 "Aneesh Kumar K.V" 
 wrote:


Signed-off-by: Aneesh Kumar K.V 
---
This is dependent on patches posted at
https://lore.kernel.org/linux-mm/20230718024409.95742-1-aneesh.ku...@linux.ibm.com/


It appears that the above-linked series is to be updated, so
would it be appropriate to append this patch to that series?

If not, please resend this once the above-linked series has landed
in mm-unstable, thanks.



Yes, let's include that patch in that series.

In general, LGTM. Toggling it at runtime, however, makes it harder for a 
driver to stabilize on the value. In the context of virtio-mem this 
would get important (once supporting memmap_on_memory).


--
Cheers,

David / dhildenb

Re: [PATCH] mm/hotplug: Enable runtime update of memmap_on_memory parameter

2023-07-24 Thread Andrew Morton

On Fri, 21 Jul 2023 18:49:50 +0530 "Aneesh Kumar K.V" 
 wrote:

> Signed-off-by: Aneesh Kumar K.V 
> ---
> This is dependent on patches posted at
> https://lore.kernel.org/linux-mm/20230718024409.95742-1-aneesh.ku...@linux.ibm.com/

It appears that the above-linked series is to be updated, so
would it be appropriate to append this patch to that series?

If not, please resend this once the above-linked series has landed
in mm-unstable, thanks.

[PATCH v5 00/25] iommu: Make default_domain's mandatory

2023-07-24 Thread Jason Gunthorpe

[ It would be good to get this in linux-next, we have some good test
coverage on the ARM side already, thanks! ]

It has been a long time coming, this series completes the default_domain
transition and makes it so that the core IOMMU code will always have a
non-NULL default_domain for every driver on every
platform. set_platform_dma_ops() turned out to be a bad idea, and so
completely remove it.

This is achieved by changing each driver to either:

1 - Convert the existing (or deleted) ops->detach_dev() into an
op->attach_dev() of an IDENTITY domain.

This is based on the theory that the ARM32 HW is able to function when
the iommu is turned off and so the turned off state is an IDENTITY
translation.

2 - Use a new PLATFORM domain type. This is a hack to accommodate drivers
that we don't really know WTF they do. S390 is legitimately using this
to switch to it's platform dma_ops implementation, which is where the
name comes from.

3 - Do #1 and force the default domain to be IDENTITY, this corrects
the tegra-smmu case where even an ARM64 system would have a NULL
default_domain.

Using this we can apply the rules:

a) ARM_DMA_USE_IOMMU mode always uses either the driver's
   ops->default_domain, ops->def_domain_type(), or an IDENTITY domain.
   All ARM32 drivers provide one of these three options.

b) dma-iommu.c mode uses either the driver's ops->default_domain,
   ops->def_domain_type or the usual DMA API policy logic based on the
   command line/etc to pick IDENTITY/DMA domain types

c) All other arch's (PPC/S390) use ops->default_domain always.

See the patch "Require a default_domain for all iommu drivers" for a
per-driver breakdown.

The conversion broadly teaches a bunch of ARM32 drivers that they can do
IDENTITY domains. There is some educated guessing involved that these are
actual IDENTITY domains. If this turns out to be wrong the driver can be
trivially changed to use a BLOCKING domain type instead. Further, the
domain type only matters for drivers using ARM64's dma-iommu.c mode as it
will select IDENTITY based on the command line and expect IDENTITY to
work. For ARM32 and other arch cases it is purely documentation.

Finally, based on all the analysis in this series, we can purge
IOMMU_DOMAIN_UNMANAGED/DMA constants from most of the drivers. This
greatly simplifies understanding the driver contract to the core
code. IOMMU drivers should not be involved in policy for how the DMA API
works, that should be a core core decision.

The main gain from this work is to remove alot of ARM_DMA_USE_IOMMU
specific code and behaviors from drivers. All that remains in iommu
drivers after this series is the calls to arm_iommu_create_mapping().

This is a step toward removing ARM_DMA_USE_IOMMU.

The IDENTITY domains added to the ARM64 supporting drivers can be tested
by booting in ARM64 mode and enabling CONFIG_IOMMU_DEFAULT_PASSTHROUGH. If
the system still boots then most likely the implementation is an IDENTITY
domain. If not we can trivially change it to BLOCKING or at worst PLATFORM
if there is no detail what is going on in the HW.

I think this is pretty safe for the ARM32 drivers as they don't really
change, the code that was in detach_dev continues to be called in the same
places it was called before.

This is on github: https://github.com/jgunthorpe/linux/commits/iommu_all_defdom

v5:
 - Rebase on v6.5-rc1/Joerg's tree
 - Fix Dan's remark about 'gdev uninitialized' in patch 9
v4: 
https://lore.kernel.org/r/0-v4-874277bde66e+1a9f6-iommu_all_defdom_...@nvidia.com
 - Fix rebasing typo missing ops->alloc_domain_paging check
 - Rebase on latest Joerg tree
v3: 
https://lore.kernel.org/r/0-v3-89830a6c7841+43d-iommu_all_defdom_...@nvidia.com
 - FSL is back to a PLATFORM domain, with some fixing so it attach only
   does something when leaving an UNMANAGED domain like it always was
 - Rebase on Joerg's tree, adjust for "alloc_type" change
 - Change the ARM32 untrusted check to a WARN_ON since no ARM32 system
   can currently set trusted
v2: 
https://lore.kernel.org/r/0-v2-8d1dc464eac9+10f-iommu_all_defdom_...@nvidia.com
 - FSL is an IDENTITY domain
 - Delete terga-gart instead of trying to carry it
 - Use the policy determination from iommu_get_default_domain_type() to
   drive the arm_iommu mode
 - Reorganize and introduce new patches to do the above:
* Split the ops->identity_domain to an independent earlier patch
* Remove the UNMANAGED return from def_domain_type in mtk_v1 earlier
  so the new iommu_get_default_domain_type() can work
* Make the driver's def_domain_type have higher policy priority than
  untrusted
* Merge the set_platfom_dma_ops hunk from mtk_v1 along with rockchip
  into the patch that forced IDENTITY on ARM32
 - Revise sun50i to be cleaner and have a non-NULL internal domain
 - Reword logging in exynos
 - Remove the gdev from the group alloc path, instead add a new
   function __iommu_group_domain_alloc() that takes in the group

Re: [PATCH v4 4/6] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-24 Thread Aneesh Kumar K.V

David Hildenbrand  writes:

> On 24.07.23 18:02, Aneesh Kumar K V wrote:
>> On 7/24/23 9:11 PM, David Hildenbrand wrote:
>>> On 24.07.23 17:16, Aneesh Kumar K V wrote:
>>>
>
> /*
>    * In "forced" memmap_on_memory mode, we always align the vmemmap size 
> up to cover
>    * full pageblocks. That way, we can add memory even if the vmemmap 
> size is not properly
>    * aligned, however, we might waste memory.
>    */

 I am finding that confusing. We do want things to be pageblock_nr_pages 
 aligned both ways.
 With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for 
 memmap and
 in the default case we do that by making sure only memory blocks of 
 specific size supporting
 that alignment can use MEMMAP_ON_MEMORY feature.
>>>
>>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense 
>>> then.
>>>
>>> But if you have any ideas on how to clarify that (terminology), I'm all 
>>> ears!
>>>
>> 
>> 
>> I updated the commit message
>> 
>> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks
>> 
>> Currently, memmap_on_memory feature is only supported with memory block
>> sizes that result in vmemmap pages covering full page blocks. This is
>> because memory onlining/offlining code requires applicable ranges to be
>> pageblock-aligned, for example, to set the migratetypes properly.
>> 
>> This patch helps to lift that restriction by reserving more pages than
>> required for vmemmap space. This helps to align the start addr to be
>> page block aligned with different memory block sizes. This implies the
>> kernel will be reserving some pages for every memoryblock. This also
>> allows the memmap on memory feature to be widely useful with different
>> memory block size values.
>> 
>> For ex: with 64K page size and 256MiB memory block size, we require 4
>> pages to map vmemmap pages, To align things correctly we end up adding a
>> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved.
>> 
>> 
>
> Much better.
>
>> Also while implementing your  suggestion to use 
>> memory_block_memmap_on_memory_size()
>> I am finding it not really useful because in mhp_supports_memmap_on_memory() 
>> we are checking
>> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use 
>> that helper
>> later).
>
> Let's focus on this patchset here first.
>
> Factoring out how manye memmap pages we actually need vs. how many pages 
> we need when aligning up sound very reasonable to me.
>
>
> Can you elaborate what the problem is?
>
>> Also I still think altmap.reserve is easier because of the start_pfn 
>> calculation.
>> (more on this below)
>
> Can you elaborate? Do you mean the try_remove_memory() change?
>
>> 
>> 
>>> [...]
>>>
>> +    return arch_supports_memmap_on_memory(size);
>>     }
>>       /*
>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct 
>> resource *res, mhp_t mhp_flags)
>>     {
>>     struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>>     enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>> -    struct vmem_altmap mhp_altmap = {};
>> +    struct vmem_altmap mhp_altmap = {
>> +    .base_pfn =  PHYS_PFN(res->start),
>> +    .end_pfn  =  PHYS_PFN(res->end),
>> +    .reserve  = memory_block_align_base(resource_size(res)),
>
> Can you remind me why we have to set reserve here at all?
>
> IOW, can't we simply set
>
> .free = memory_block_memmap_on_memory_size();
>
> end then pass
>
> mhp_altmap.alloc + mhp_altmap.free
>
> to create_memory_block_devices() instead?
>

 But with the dax usage of altmap, altmap->reserve is what we use to 
 reserve things to get
 the required alignment. One difference is where we allocate the struct 
 page at. For this specific
 case it should not matter.

 static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap 
 *altmap)
 {
  return altmap->base_pfn + altmap->reserve + altmap->alloc
      + altmap->align;
 }

 And other is where we online a memory block

 We find the start pfn using mem->altmap->alloc + mem->altmap->reserve;

 Considering altmap->reserve is what dax pfn_dev use, is there a reason you 
 want to use altmap->free for this?
>>>
>>> "Reserve" is all about "reserving that much memory for driver usage".
>>>
>>> We don't care about that. We simply want vmemmap allocations coming from 
>>> the pageblock(s) we set aside. Where exactly, we don't care.
>>>
 I find it confusing to update free when we haven't allocated any altmap 
 blocks yet.
>>>
>>> "
>>> @reserve: pages mapped, but reserved for driver use (relative to @base)"
>>> @free: free pages set aside in the mapping for memmap storage
>>> @alloc: track pages consumed, private to vmemmap_populate()
>>> "
>>>
>>> To

[PATCH v5 02/25] iommu: Add IOMMU_DOMAIN_PLATFORM

2023-07-24 Thread Jason Gunthorpe

This is used when the iommu driver is taking control of the dma_ops,
currently only on S390 and power spapr. It is designed to preserve the
original ops->detach_dev() semantic that these S390 was built around.

Provide an opaque domain type and a 'default_domain' ops value that allows
the driver to trivially force any single domain as the default domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 14 +-
 include/linux/iommu.h |  6 ++
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5e3cdc9f3a9e78..c64365169b678d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1705,6 +1705,17 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 
lockdep_assert_held(>mutex);
 
+   /*
+* Allow legacy drivers to specify the domain that will be the default
+* domain. This should always be either an IDENTITY or PLATFORM domain.
+* Do not use in new drivers.
+*/
+   if (bus->iommu_ops->default_domain) {
+   if (req_type)
+   return ERR_PTR(-EINVAL);
+   return bus->iommu_ops->default_domain;
+   }
+
if (req_type)
return __iommu_group_alloc_default_domain(bus, group, req_type);
 
@@ -1967,7 +1978,8 @@ void iommu_domain_free(struct iommu_domain *domain)
if (domain->type == IOMMU_DOMAIN_SVA)
mmdrop(domain->mm);
iommu_put_dma_cookie(domain);
-   domain->ops->free(domain);
+   if (domain->ops->free)
+   domain->ops->free(domain);
 }
 EXPORT_SYMBOL_GPL(iommu_domain_free);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e05c93b6c37fba..87aebba474e093 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -64,6 +64,7 @@ struct iommu_domain_geometry {
 #define __IOMMU_DOMAIN_DMA_FQ  (1U << 3)  /* DMA-API uses flush queue*/
 
 #define __IOMMU_DOMAIN_SVA (1U << 4)  /* Shared process address space */
+#define __IOMMU_DOMAIN_PLATFORM(1U << 5)
 
 #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
 /*
@@ -81,6 +82,8 @@ struct iommu_domain_geometry {
  *   invalidation.
  * IOMMU_DOMAIN_SVA- DMA addresses are shared process addresses
  *   represented by mm_struct's.
+ * IOMMU_DOMAIN_PLATFORM   - Legacy domain for drivers that do their own
+ *   dma_api stuff. Do not use in new drivers.
  */
 #define IOMMU_DOMAIN_BLOCKED   (0U)
 #define IOMMU_DOMAIN_IDENTITY  (__IOMMU_DOMAIN_PT)
@@ -91,6 +94,7 @@ struct iommu_domain_geometry {
 __IOMMU_DOMAIN_DMA_API |   \
 __IOMMU_DOMAIN_DMA_FQ)
 #define IOMMU_DOMAIN_SVA   (__IOMMU_DOMAIN_SVA)
+#define IOMMU_DOMAIN_PLATFORM  (__IOMMU_DOMAIN_PLATFORM)
 
 struct iommu_domain {
unsigned type;
@@ -256,6 +260,7 @@ struct iommu_iotlb_gather {
  * @owner: Driver module providing these ops
  * @identity_domain: An always available, always attachable identity
  *   translation.
+ * @default_domain: If not NULL this will always be set as the default domain.
  */
 struct iommu_ops {
bool (*capable)(struct device *dev, enum iommu_cap);
@@ -290,6 +295,7 @@ struct iommu_ops {
unsigned long pgsize_bitmap;
struct module *owner;
struct iommu_domain *identity_domain;
+   struct iommu_domain *default_domain;
 };
 
 /**
-- 
2.41.0

[PATCH v5 11/25] iommu/tegra-smmu: Implement an IDENTITY domain

2023-07-24 Thread Jason Gunthorpe

What tegra-smmu does during tegra_smmu_set_platform_dma() is actually
putting the iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/tegra-smmu.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 1cbf063ccf147a..f63f1d4f0bd10f 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -511,23 +511,39 @@ static int tegra_smmu_attach_dev(struct iommu_domain 
*domain,
return err;
 }
 
-static void tegra_smmu_set_platform_dma(struct device *dev)
+static int tegra_smmu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-   struct tegra_smmu_as *as = to_smmu_as(domain);
-   struct tegra_smmu *smmu = as->smmu;
+   struct tegra_smmu_as *as;
+   struct tegra_smmu *smmu;
unsigned int index;
 
if (!fwspec)
-   return;
+   return -ENODEV;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   as = to_smmu_as(domain);
+   smmu = as->smmu;
for (index = 0; index < fwspec->num_ids; index++) {
tegra_smmu_disable(smmu, fwspec->ids[index], as->id);
tegra_smmu_as_unprepare(smmu, as);
}
+   return 0;
 }
 
+static struct iommu_domain_ops tegra_smmu_identity_ops = {
+   .attach_dev = tegra_smmu_identity_attach,
+};
+
+static struct iommu_domain tegra_smmu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _smmu_identity_ops,
+};
+
 static void tegra_smmu_set_pde(struct tegra_smmu_as *as, unsigned long iova,
   u32 value)
 {
@@ -962,11 +978,22 @@ static int tegra_smmu_of_xlate(struct device *dev,
return iommu_fwspec_add_ids(dev, , 1);
 }
 
+static int tegra_smmu_def_domain_type(struct device *dev)
+{
+   /*
+* FIXME: For now we want to run all translation in IDENTITY mode, due
+* to some device quirks. Better would be to just quirk the troubled
+* devices.
+*/
+   return IOMMU_DOMAIN_IDENTITY;
+}
+
 static const struct iommu_ops tegra_smmu_ops = {
+   .identity_domain = _smmu_identity_domain,
+   .def_domain_type = _smmu_def_domain_type,
.domain_alloc = tegra_smmu_domain_alloc,
.probe_device = tegra_smmu_probe_device,
.device_group = tegra_smmu_device_group,
-   .set_platform_dma_ops = tegra_smmu_set_platform_dma,
.of_xlate = tegra_smmu_of_xlate,
.pgsize_bitmap = SZ_4K,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.41.0

[PATCH v5 17/25] iommu/qcom_iommu: Add an IOMMU_IDENTITIY_DOMAIN

2023-07-24 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 39 +
 1 file changed, 39 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index a503ed758ec302..9d7b9d8b4386d4 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -387,6 +387,44 @@ static int qcom_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev
return 0;
 }
 
+static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct qcom_iommu_domain *qcom_domain;
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct qcom_iommu_dev *qcom_iommu = to_iommu(dev);
+   unsigned int i;
+
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   qcom_domain = to_qcom_iommu_domain(domain);
+   if (WARN_ON(!qcom_domain->iommu))
+   return -EINVAL;
+
+   pm_runtime_get_sync(qcom_iommu->dev);
+   for (i = 0; i < fwspec->num_ids; i++) {
+   struct qcom_iommu_ctx *ctx = to_ctx(qcom_domain, 
fwspec->ids[i]);
+
+   /* Disable the context bank: */
+   iommu_writel(ctx, ARM_SMMU_CB_SCTLR, 0);
+
+   ctx->domain = NULL;
+   }
+   pm_runtime_put_sync(qcom_iommu->dev);
+   return 0;
+}
+
+static struct iommu_domain_ops qcom_iommu_identity_ops = {
+   .attach_dev = qcom_iommu_identity_attach,
+};
+
+static struct iommu_domain qcom_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int qcom_iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, size_t pgsize, size_t pgcount,
  int prot, gfp_t gfp, size_t *mapped)
@@ -553,6 +591,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
 }
 
 static const struct iommu_ops qcom_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.capable= qcom_iommu_capable,
.domain_alloc   = qcom_iommu_domain_alloc,
.probe_device   = qcom_iommu_probe_device,
-- 
2.41.0

[PATCH v5 23/25] iommu: Add ops->domain_alloc_paging()

2023-07-24 Thread Jason Gunthorpe

This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING
domain, so it saves a few lines in a lot of drivers needlessly checking
the type.

More critically, this allows us to sweep out all the
IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the
drivers, simplifying what is going on in the code and ultimately removing
the now-unused special cases in drivers where they did not support
IOMMU_DOMAIN_DMA.

domain_alloc_paging() should return a struct iommu_domain that is
functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd.

Be forwards looking and pass in a 'struct device *' argument. We can
provide this when allocating the default_domain. No drivers will look at
this.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 13 ++---
 include/linux/iommu.h |  3 +++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index bc8b35e31b5343..5b5cf74edc7e53 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1999,6 +1999,7 @@ void iommu_set_fault_handler(struct iommu_domain *domain,
 EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
 
 static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
+struct device *dev,
 unsigned int type)
 {
struct iommu_domain *domain;
@@ -2006,8 +2007,13 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct iommu_ops *ops,
 
if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
return ops->identity_domain;
+   else if (type & __IOMMU_DOMAIN_PAGING && ops->domain_alloc_paging) {
+   domain = ops->domain_alloc_paging(dev);
+   } else if (ops->domain_alloc)
+   domain = ops->domain_alloc(alloc_type);
+   else
+   return NULL;
 
-   domain = ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
 
@@ -2038,14 +2044,15 @@ __iommu_group_domain_alloc(struct iommu_group *group, 
unsigned int type)
 
lockdep_assert_held(>mutex);
 
-   return __iommu_domain_alloc(dev_iommu_ops(dev), type);
+   return __iommu_domain_alloc(dev_iommu_ops(dev), dev, type);
 }
 
 struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
 {
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
-   return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED);
+   return __iommu_domain_alloc(bus->iommu_ops, NULL,
+   IOMMU_DOMAIN_UNMANAGED);
 }
 EXPORT_SYMBOL_GPL(iommu_domain_alloc);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index df54066c262db4..8f69866c868e04 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -233,6 +233,8 @@ struct iommu_iotlb_gather {
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
  * @domain_alloc: allocate iommu domain
+ * @domain_alloc_paging: Allocate an iommu_domain that can be used for
+ *   UNMANAGED, DMA, and DMA_FQ domain types.
  * @probe_device: Add device to iommu driver handling
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -264,6 +266,7 @@ struct iommu_ops {
 
/* Domain allocation and freeing by the iommu driver */
struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+   struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
 
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
-- 
2.41.0

[PATCH v5 21/25] iommu: Require a default_domain for all iommu drivers

2023-07-24 Thread Jason Gunthorpe

At this point every iommu driver will cause a default_domain to be
selected, so we can finally remove this gap from the core code.

The following table explains what each driver supports and what the
resulting default_domain will be:

ops->defaut_domain
IDENTITY   DMA  PLATFORMv  ARM32  dma-iommu 
 ARCH
amd/iommu.c Y   Y   N/A either
apple-dart.cY   Y   N/A either
arm-smmu.c  Y   Y   IDENTITYeither
qcom_iommu.cG   Y   IDENTITYeither
arm-smmu-v3.c   Y   Y   N/A either
exynos-iommu.c  G   Y   IDENTITYeither
fsl_pamu_domain.c   Y   N/A N/A 
PLATFORM
intel/iommu.c   Y   Y   N/A either
ipmmu-vmsa.cG   Y   IDENTITYeither
msm_iommu.c G   IDENTITYN/A
mtk_iommu.c G   Y   IDENTITYeither
mtk_iommu_v1.c  G   IDENTITYN/A
omap-iommu.cG   IDENTITYN/A
rockchip-iommu.cG   Y   IDENTITYeither
s390-iommu.cY   Y   N/A N/A 
PLATFORM
sprd-iommu.cY   N/A DMA
sun50i-iommu.c  G   Y   IDENTITYeither
tegra-smmu.cG   Y   IDENTITYIDENTITY
virtio-iommu.c  Y   Y   N/A either
spapr   Y   Y   N/A N/A 
PLATFORM
 * G means ops->identity_domain is used
 * N/A means the driver will not compile in this configuration

ARM32 drivers select an IDENTITY default domain through either the
ops->identity_domain or directly requesting an IDENTIY domain through
alloc_domain().

In ARM64 mode tegra-smmu will still block the use of dma-iommu.c and
forces an IDENTITY domain.

S390 uses a PLATFORM domain to represent when the dma_ops are set to the
s390 iommu code.

fsl_pamu uses an IDENTITY domain.

POWER SPAPR uses PLATFORM and blocking to enable its weird VFIO mode.

The x86 drivers continue unchanged.

After this patch group->default_domain is only NULL for a short period
during bus iommu probing while all the groups are constituted. Otherwise
it is always !NULL.

This completes changing the iommu subsystem driver contract to a system
where the current iommu_domain always represents some form of translation
and the driver is continuously asserting a definable translation mode.

It resolves the confusion that the original ops->detach_dev() caused
around what translation, exactly, is the IOMMU performing after
detach. There were at least three different answers to that question in
the tree, they are all now clearly named with domain types.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index dada2c00d78ca4..1533e65d075bce 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1819,10 +1819,12 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
 * ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
 * identity_domain and it will automatically become their default
 * domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
-* Override the selection to IDENTITY if we are sure the driver supports
-* it.
+* Override the selection to IDENTITY.
 */
-   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) {
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) {
+   static_assert(!(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) &&
+   IS_ENABLED(CONFIG_IOMMU_DMA)));
+
type = IOMMU_DOMAIN_IDENTITY;
if (best_type && type && best_type != type)
return -1;
@@ -2920,18 +2922,9 @@ static int iommu_setup_default_domain(struct iommu_group 
*group,
if (req_type < 0)
return -EINVAL;
 
-   /*
-* There are still some drivers which don't support default domains, so
-* we ignore the failure and leave group->default_domain NULL.
-*/
dom = iommu_group_alloc_default_domain(group, req_type);
-   if (!dom) {
-   /* Once in default_domain mode we

[PATCH v5 15/25] iommufd/selftest: Make the mock iommu driver into a real driver

2023-07-24 Thread Jason Gunthorpe

I've avoided doing this because there is no way to make this happen
without an intrusion into the core code. Up till now this has avoided
needing the core code's probe path with some hackery - but now that
default domains are becoming mandatory it is unavoidable. The core probe
path must be run to set the default_domain, only it can do it. Without
a default domain iommufd can't use the group.

Make it so that iommufd selftest can create a real iommu driver and bind
it only to is own private bus. Add iommu_device_register_bus() as a core
code helper to make this possible. It simply sets the right pointers and
registers the notifier block. The mock driver then works like any normal
driver should, with probe triggered by the bus ops

When the bus->iommu_ops stuff is fully unwound we can probably do better
here and remove this special case.

Remove set_platform_dma_ops from selftest and make it use a BLOCKED
default domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu-priv.h  |  16 +++
 drivers/iommu/iommu.c   |  43 
 drivers/iommu/iommufd/iommufd_private.h |   5 +-
 drivers/iommu/iommufd/main.c|   8 +-
 drivers/iommu/iommufd/selftest.c| 141 +---
 5 files changed, 144 insertions(+), 69 deletions(-)
 create mode 100644 drivers/iommu/iommu-priv.h

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
new file mode 100644
index 00..1cbc04b9cf7297
--- /dev/null
+++ b/drivers/iommu/iommu-priv.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef __IOMMU_PRIV_H
+#define __IOMMU_PRIV_H
+
+#include 
+
+int iommu_device_register_bus(struct iommu_device *iommu,
+ const struct iommu_ops *ops, struct bus_type *bus,
+ struct notifier_block *nb);
+void iommu_device_unregister_bus(struct iommu_device *iommu,
+struct bus_type *bus,
+struct notifier_block *nb);
+
+#endif
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a1a93990b3a211..7fae866af0db7a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -36,6 +36,7 @@
 #include "dma-iommu.h"
 
 #include "iommu-sva.h"
+#include "iommu-priv.h"
 
 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
@@ -290,6 +291,48 @@ void iommu_device_unregister(struct iommu_device *iommu)
 }
 EXPORT_SYMBOL_GPL(iommu_device_unregister);
 
+#if IS_ENABLED(CONFIG_IOMMUFD_TEST)
+void iommu_device_unregister_bus(struct iommu_device *iommu,
+struct bus_type *bus,
+struct notifier_block *nb)
+{
+   bus_unregister_notifier(bus, nb);
+   iommu_device_unregister(iommu);
+}
+EXPORT_SYMBOL_GPL(iommu_device_unregister_bus);
+
+/*
+ * Register an iommu driver against a single bus. This is only used by iommufd
+ * selftest to create a mock iommu driver. The caller must provide
+ * some memory to hold a notifier_block.
+ */
+int iommu_device_register_bus(struct iommu_device *iommu,
+ const struct iommu_ops *ops, struct bus_type *bus,
+ struct notifier_block *nb)
+{
+   int err;
+
+   iommu->ops = ops;
+   nb->notifier_call = iommu_bus_notifier;
+   err = bus_register_notifier(bus, nb);
+   if (err)
+   return err;
+
+   spin_lock(_device_lock);
+   list_add_tail(>list, _device_list);
+   spin_unlock(_device_lock);
+
+   bus->iommu_ops = ops;
+   err = bus_iommu_probe(bus);
+   if (err) {
+   iommu_device_unregister_bus(iommu, bus, nb);
+   return err;
+   }
+   return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_device_register_bus);
+#endif
+
 static struct dev_iommu *dev_iommu_get(struct device *dev)
 {
struct dev_iommu *param = dev->iommu;
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index b38e67d1988bdb..368f66c63a239a 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -303,7 +303,7 @@ extern size_t iommufd_test_memory_limit;
 void iommufd_test_syz_conv_iova_id(struct iommufd_ucmd *ucmd,
   unsigned int ioas_id, u64 *iova, u32 *flags);
 bool iommufd_should_fail(void);
-void __init iommufd_test_init(void);
+int __init iommufd_test_init(void);
 void iommufd_test_exit(void);
 bool iommufd_selftest_is_mock_dev(struct device *dev);
 #else
@@ -316,8 +316,9 @@ static inline bool iommufd_should_fail(void)
 {
return false;
 }
-static inline void __init iommufd_test_init(void)
+static inline int __init iommufd_test_init(void)
 {
+   return 0;
 }
 static inline void iommufd_test_exit(void)
 {
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 3fbe636c3d8a69..042d45cc0b1c0d 100644
---

[PATCH v5 05/25] iommu/fsl_pamu: Implement a PLATFORM domain

2023-07-24 Thread Jason Gunthorpe

This driver is nonsensical. To not block migrating the core API away from
NULL default_domains give it a hacky of a PLATFORM domain that keeps it
working exactly as it always did.

Leave some comments around to warn away any future people looking at this.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/fsl_pamu_domain.c | 41 ++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 4ac0e247ec2b51..e9d2bff4659b7c 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -196,6 +196,13 @@ static struct iommu_domain *fsl_pamu_domain_alloc(unsigned 
type)
 {
struct fsl_dma_domain *dma_domain;
 
+   /*
+* FIXME: This isn't creating an unmanaged domain since the
+* default_domain_ops do not have any map/unmap function it doesn't meet
+* the requirements for __IOMMU_DOMAIN_PAGING. The only purpose seems to
+* allow drivers/soc/fsl/qbman/qman_portal.c to do
+* fsl_pamu_configure_l1_stash()
+*/
if (type != IOMMU_DOMAIN_UNMANAGED)
return NULL;
 
@@ -283,15 +290,33 @@ static int fsl_pamu_attach_device(struct iommu_domain 
*domain,
return ret;
 }
 
-static void fsl_pamu_set_platform_dma(struct device *dev)
+/*
+ * FIXME: fsl/pamu is completely broken in terms of how it works with the iommu
+ * API. Immediately after probe the HW is left in an IDENTITY translation and
+ * the driver provides a non-working UNMANAGED domain that it can switch over
+ * to. However it cannot switch back to an IDENTITY translation, instead it
+ * switches to what looks like BLOCKING.
+ */
+static int fsl_pamu_platform_attach(struct iommu_domain *platform_domain,
+   struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain);
+   struct fsl_dma_domain *dma_domain;
const u32 *prop;
int len;
struct pci_dev *pdev = NULL;
struct pci_controller *pci_ctl;
 
+   /*
+* Hack to keep things working as they always have, only leaving an
+* UNMANAGED domain makes it BLOCKING.
+*/
+   if (domain == platform_domain || !domain ||
+   domain->type != IOMMU_DOMAIN_UNMANAGED)
+   return 0;
+
+   dma_domain = to_fsl_dma_domain(domain);
+
/*
 * Use LIODN of the PCI controller while detaching a
 * PCI device.
@@ -312,8 +337,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev)
detach_device(dev, dma_domain);
else
pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node);
+   return 0;
 }
 
+static struct iommu_domain_ops fsl_pamu_platform_ops = {
+   .attach_dev = fsl_pamu_platform_attach,
+};
+
+static struct iommu_domain fsl_pamu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _pamu_platform_ops,
+};
+
 /* Set the domain stash attribute */
 int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu)
 {
@@ -395,11 +430,11 @@ static struct iommu_device *fsl_pamu_probe_device(struct 
device *dev)
 }
 
 static const struct iommu_ops fsl_pamu_ops = {
+   .default_domain = _pamu_platform_domain,
.capable= fsl_pamu_capable,
.domain_alloc   = fsl_pamu_domain_alloc,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
-   .set_platform_dma_ops = fsl_pamu_set_platform_dma,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = fsl_pamu_attach_device,
.iova_to_phys   = fsl_pamu_iova_to_phys,
-- 
2.41.0

[PATCH v5 20/25] iommu/sun50i: Add an IOMMU_IDENTITIY_DOMAIN

2023-07-24 Thread Jason Gunthorpe

Prior to commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") the
sun50i_iommu_detach_device() function was being called by
ops->detach_dev().

This is an IDENTITY domain so convert sun50i_iommu_detach_device() into
sun50i_iommu_identity_attach() and a full IDENTITY domain and thus hook it
back up the same was as the old ops->detach_dev().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/sun50i-iommu.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/sun50i-iommu.c b/drivers/iommu/sun50i-iommu.c
index 74c5cb93e90027..0bf08b120cf105 100644
--- a/drivers/iommu/sun50i-iommu.c
+++ b/drivers/iommu/sun50i-iommu.c
@@ -757,21 +757,32 @@ static void sun50i_iommu_detach_domain(struct 
sun50i_iommu *iommu,
iommu->domain = NULL;
 }
 
-static void sun50i_iommu_detach_device(struct iommu_domain *domain,
-  struct device *dev)
+static int sun50i_iommu_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
-   struct sun50i_iommu_domain *sun50i_domain = to_sun50i_domain(domain);
struct sun50i_iommu *iommu = dev_iommu_priv_get(dev);
+   struct sun50i_iommu_domain *sun50i_domain;
 
dev_dbg(dev, "Detaching from IOMMU domain\n");
 
-   if (iommu->domain != domain)
-   return;
+   if (iommu->domain == identity_domain)
+   return 0;
 
+   sun50i_domain = to_sun50i_domain(iommu->domain);
if (refcount_dec_and_test(_domain->refcnt))
sun50i_iommu_detach_domain(iommu, sun50i_domain);
+   return 0;
 }
 
+static struct iommu_domain_ops sun50i_iommu_identity_ops = {
+   .attach_dev = sun50i_iommu_identity_attach,
+};
+
+static struct iommu_domain sun50i_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int sun50i_iommu_attach_device(struct iommu_domain *domain,
  struct device *dev)
 {
@@ -789,8 +800,7 @@ static int sun50i_iommu_attach_device(struct iommu_domain 
*domain,
if (iommu->domain == domain)
return 0;
 
-   if (iommu->domain)
-   sun50i_iommu_detach_device(iommu->domain, dev);
+   sun50i_iommu_identity_attach(_iommu_identity_domain, dev);
 
sun50i_iommu_attach_domain(iommu, sun50i_domain);
 
@@ -827,6 +837,7 @@ static int sun50i_iommu_of_xlate(struct device *dev,
 }
 
 static const struct iommu_ops sun50i_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.pgsize_bitmap  = SZ_4K,
.device_group   = sun50i_iommu_device_group,
.domain_alloc   = sun50i_iommu_domain_alloc,
@@ -985,6 +996,7 @@ static int sun50i_iommu_probe(struct platform_device *pdev)
if (!iommu)
return -ENOMEM;
spin_lock_init(>iommu_lock);
+   iommu->domain = _iommu_identity_domain;
platform_set_drvdata(pdev, iommu);
iommu->dev = >dev;
 
-- 
2.41.0

[PATCH v5 13/25] iommu/omap: Implement an IDENTITY domain

2023-07-24 Thread Jason Gunthorpe

What omap does during omap_iommu_set_platform_dma() is actually putting
the iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/omap-iommu.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 537e402f9bba97..34340ef15241bc 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1555,16 +1555,31 @@ static void _omap_iommu_detach_dev(struct 
omap_iommu_domain *omap_domain,
omap_domain->dev = NULL;
 }
 
-static void omap_iommu_set_platform_dma(struct device *dev)
+static int omap_iommu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct omap_iommu_domain *omap_domain = to_omap_domain(domain);
+   struct omap_iommu_domain *omap_domain;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   omap_domain = to_omap_domain(domain);
spin_lock(_domain->lock);
_omap_iommu_detach_dev(omap_domain, dev);
spin_unlock(_domain->lock);
+   return 0;
 }
 
+static struct iommu_domain_ops omap_iommu_identity_ops = {
+   .attach_dev = omap_iommu_identity_attach,
+};
+
+static struct iommu_domain omap_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static struct iommu_domain *omap_iommu_domain_alloc(unsigned type)
 {
struct omap_iommu_domain *omap_domain;
@@ -1732,11 +1747,11 @@ static struct iommu_group 
*omap_iommu_device_group(struct device *dev)
 }
 
 static const struct iommu_ops omap_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc   = omap_iommu_domain_alloc,
.probe_device   = omap_iommu_probe_device,
.release_device = omap_iommu_release_device,
.device_group   = omap_iommu_device_group,
-   .set_platform_dma_ops = omap_iommu_set_platform_dma,
.pgsize_bitmap  = OMAP_IOMMU_PGSIZES,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = omap_iommu_attach_dev,
-- 
2.41.0

[PATCH v5 14/25] iommu/msm: Implement an IDENTITY domain

2023-07-24 Thread Jason Gunthorpe

What msm does during omap_iommu_set_platform_dma() is actually putting the
iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/msm_iommu.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 79d89bad5132b7..26ed81cfeee897 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -443,15 +443,20 @@ static int msm_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev)
return ret;
 }
 
-static void msm_iommu_set_platform_dma(struct device *dev)
+static int msm_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct msm_priv *priv = to_msm_priv(domain);
+   struct msm_priv *priv;
unsigned long flags;
struct msm_iommu_dev *iommu;
struct msm_iommu_ctx_dev *master;
-   int ret;
+   int ret = 0;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   priv = to_msm_priv(domain);
free_io_pgtable_ops(priv->iop);
 
spin_lock_irqsave(_iommu_lock, flags);
@@ -468,8 +473,18 @@ static void msm_iommu_set_platform_dma(struct device *dev)
}
 fail:
spin_unlock_irqrestore(_iommu_lock, flags);
+   return ret;
 }
 
+static struct iommu_domain_ops msm_iommu_identity_ops = {
+   .attach_dev = msm_iommu_identity_attach,
+};
+
+static struct iommu_domain msm_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int msm_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t pa, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -675,10 +690,10 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
 }
 
 static struct iommu_ops msm_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc = msm_iommu_domain_alloc,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = msm_iommu_set_platform_dma,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
.of_xlate = qcom_iommu_of_xlate,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.41.0

[PATCH v5 19/25] iommu/mtk_iommu: Add an IOMMU_IDENTITIY_DOMAIN

2023-07-24 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/mtk_iommu.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e93906d6e112e8..fdb7f5162b1d64 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -753,6 +753,28 @@ static int mtk_iommu_attach_device(struct iommu_domain 
*domain,
return ret;
 }
 
+static int mtk_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct mtk_iommu_data *data = dev_iommu_priv_get(dev);
+
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   mtk_iommu_config(data, dev, false, 0);
+   return 0;
+}
+
+static struct iommu_domain_ops mtk_iommu_identity_ops = {
+   .attach_dev = mtk_iommu_identity_attach,
+};
+
+static struct iommu_domain mtk_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int mtk_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -972,6 +994,7 @@ static void mtk_iommu_get_resv_regions(struct device *dev,
 }
 
 static const struct iommu_ops mtk_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc   = mtk_iommu_domain_alloc,
.probe_device   = mtk_iommu_probe_device,
.release_device = mtk_iommu_release_device,
-- 
2.41.0

[PATCH v5 07/25] iommu/mtk_iommu_v1: Implement an IDENTITY domain

2023-07-24 Thread Jason Gunthorpe

What mtk does during mtk_iommu_v1_set_platform_dma() is actually putting
the iommu into identity mode. Make this available as a proper IDENTITY
domain.

The mtk_iommu_v1_def_domain_type() from
commit 8bbe13f52cb7 ("iommu/mediatek-v1: Add def_domain_type") explains
this was needed to allow probe_finalize() to be called, but now the
IDENTITY domain will do the same job so change the returned
def_domain_type.

mkt_v1 is the only driver that returns IOMMU_DOMAIN_UNMANAGED from
def_domain_type().  This allows the next patch to enforce an IDENTITY
domain policy for this driver.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/mtk_iommu_v1.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 8a0a5e5d049f4a..cc3e7d53d33ad9 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -319,11 +319,27 @@ static int mtk_iommu_v1_attach_device(struct iommu_domain 
*domain, struct device
return 0;
 }
 
-static void mtk_iommu_v1_set_platform_dma(struct device *dev)
+static int mtk_iommu_v1_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
struct mtk_iommu_v1_data *data = dev_iommu_priv_get(dev);
 
mtk_iommu_v1_config(data, dev, false);
+   return 0;
+}
+
+static struct iommu_domain_ops mtk_iommu_v1_identity_ops = {
+   .attach_dev = mtk_iommu_v1_identity_attach,
+};
+
+static struct iommu_domain mtk_iommu_v1_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_v1_identity_ops,
+};
+
+static void mtk_iommu_v1_set_platform_dma(struct device *dev)
+{
+   mtk_iommu_v1_identity_attach(_iommu_v1_identity_domain, dev);
 }
 
 static int mtk_iommu_v1_map(struct iommu_domain *domain, unsigned long iova,
@@ -443,7 +459,7 @@ static int mtk_iommu_v1_create_mapping(struct device *dev, 
struct of_phandle_arg
 
 static int mtk_iommu_v1_def_domain_type(struct device *dev)
 {
-   return IOMMU_DOMAIN_UNMANAGED;
+   return IOMMU_DOMAIN_IDENTITY;
 }
 
 static struct iommu_device *mtk_iommu_v1_probe_device(struct device *dev)
@@ -578,6 +594,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
 }
 
 static const struct iommu_ops mtk_iommu_v1_ops = {
+   .identity_domain = _iommu_v1_identity_domain,
.domain_alloc   = mtk_iommu_v1_domain_alloc,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
-- 
2.41.0

[PATCH v5 08/25] iommu: Reorganize iommu_get_default_domain_type() to respect def_domain_type()

2023-07-24 Thread Jason Gunthorpe

Except for dart every driver returns 0 or IDENTITY from def_domain_type().

The drivers that return IDENTITY have some kind of good reason, typically
that quirky hardware really can't support anything other than IDENTITY.

Arrange things so that if the driver says it needs IDENTITY then
iommu_get_default_domain_type() either fails or returns IDENTITY.  It will
never reject the driver's override to IDENTITY.

The only real functional difference is that the PCI untrusted flag is now
ignored for quirky HW instead of overriding the IOMMU driver.

This makes the next patch cleaner that wants to force IDENTITY always for
ARM_IOMMU because there is no support for DMA.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 66 +--
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c64365169b678d..53174179102d17 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1669,19 +1669,6 @@ struct iommu_group *fsl_mc_device_group(struct device 
*dev)
 }
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
-static int iommu_get_def_domain_type(struct device *dev)
-{
-   const struct iommu_ops *ops = dev_iommu_ops(dev);
-
-   if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
-   return IOMMU_DOMAIN_DMA;
-
-   if (ops->def_domain_type)
-   return ops->def_domain_type(dev);
-
-   return 0;
-}
-
 static struct iommu_domain *
 __iommu_group_alloc_default_domain(const struct bus_type *bus,
   struct iommu_group *group, int req_type)
@@ -1775,36 +1762,49 @@ static int iommu_bus_notifier(struct notifier_block *nb,
 static int iommu_get_default_domain_type(struct iommu_group *group,
 int target_type)
 {
+   const struct iommu_ops *ops = dev_iommu_ops(
+   list_first_entry(>devices, struct group_device, list)
+   ->dev);
int best_type = target_type;
struct group_device *gdev;
struct device *last_dev;
+   int type;
 
lockdep_assert_held(>mutex);
-
for_each_group_device(group, gdev) {
-   unsigned int type = iommu_get_def_domain_type(gdev->dev);
-
-   if (best_type && type && best_type != type) {
-   if (target_type) {
-   dev_err_ratelimited(
-   gdev->dev,
-   "Device cannot be in %s domain\n",
-   iommu_domain_type_str(target_type));
-   return -1;
-   }
-
-   dev_warn(
-   gdev->dev,
-   "Device needs domain type %s, but device %s in 
the same iommu group requires type %s - using default\n",
-   iommu_domain_type_str(type), dev_name(last_dev),
-   iommu_domain_type_str(best_type));
-   return 0;
+   type = best_type;
+   if (ops->def_domain_type) {
+   type = ops->def_domain_type(gdev->dev);
+   if (best_type && type && best_type != type)
+   goto err;
}
-   if (!best_type)
-   best_type = type;
+
+   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {
+   type = IOMMU_DOMAIN_DMA;
+   if (best_type && type && best_type != type)
+   goto err;
+   }
+   best_type = type;
last_dev = gdev->dev;
}
return best_type;
+
+err:
+   if (target_type) {
+   dev_err_ratelimited(
+   gdev->dev,
+   "Device cannot be in %s domain - it is forcing %s\n",
+   iommu_domain_type_str(target_type),
+   iommu_domain_type_str(type));
+   return -1;
+   }
+
+   dev_warn(
+   gdev->dev,
+   "Device needs domain type %s, but device %s in the same iommu 
group requires type %s - using default\n",
+   iommu_domain_type_str(type), dev_name(last_dev),
+   iommu_domain_type_str(best_type));
+   return 0;
 }
 
 static void iommu_group_do_probe_finalize(struct device *dev)
-- 
2.41.0

[PATCH v5 09/25] iommu: Allow an IDENTITY domain as the default_domain in ARM32

2023-07-24 Thread Jason Gunthorpe

Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the
same stuff, the way they relate to the IOMMU core is quiet different.

dma-iommu.c expects the core code to setup an UNMANAGED domain (of type
IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This
becomes the default_domain for the group.

ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly
allocates an UNMANAGED domain and operates it just like an external
driver. In this case group->default_domain is NULL.

If the driver provides a global static identity_domain then automatically
use it as the default_domain when in ARM_DMA_USE_IOMMU mode.

This allows drivers that implemented default_domain == NULL as an IDENTITY
translation to trivially get a properly labeled non-NULL default_domain on
ARM32 configs.

With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the
device the normal detach_domain flow will restore the IDENTITY domain as
the default domain. Overall this makes attach_dev() of the IDENTITY domain
called in the same places as detach_dev().

This effectively migrates these drivers to default_domain mode. For
drivers that support ARM64 they will gain support for the IDENTITY
translation mode for the dma_api and behave in a uniform way.

Drivers use this by setting ops->identity_domain to a static singleton
iommu_domain that implements the identity attach. If the core detects
ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain
during probe.

Drivers can continue to prevent the use of DMA translation by returning
IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent
IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU.

This allows removing the set_platform_dma_ops() from every remaining
driver.

Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does
is set an existing global static identity domain. mkt_v1 does not support
IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation
is safe.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c  | 26 +++---
 drivers/iommu/mtk_iommu_v1.c   | 12 
 drivers/iommu/rockchip-iommu.c | 10 --
 3 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 53174179102d17..a1a93990b3a211 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1771,15 +1771,35 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
int type;
 
lockdep_assert_held(>mutex);
+
+   /*
+* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
+* identity_domain and it will automatically become their default
+* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
+* Override the selection to IDENTITY if we are sure the driver supports
+* it.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) {
+   type = IOMMU_DOMAIN_IDENTITY;
+   if (best_type && type && best_type != type)
+   return -1;
+   best_type = target_type = IOMMU_DOMAIN_IDENTITY;
+   }
+
for_each_group_device(group, gdev) {
type = best_type;
if (ops->def_domain_type) {
type = ops->def_domain_type(gdev->dev);
-   if (best_type && type && best_type != type)
+   if (best_type && type && best_type != type) {
+   /* Stick with the last driver override we saw */
+   best_type = type;
goto err;
+   }
}
 
-   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {
+   /* No ARM32 using systems will set untrusted, it cannot work. */
+   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted &&
+   !WARN_ON(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU))) {
type = IOMMU_DOMAIN_DMA;
if (best_type && type && best_type != type)
goto err;
@@ -1804,7 +1824,7 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
"Device needs domain type %s, but device %s in the same iommu 
group requires type %s - using default\n",
iommu_domain_type_str(type), dev_name(last_dev),
iommu_domain_type_str(best_type));
-   return 0;
+   return best_type;
 }
 
 static void iommu_group_do_probe_finalize(struct device *dev)
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index cc3e7d53d33ad9..7c0c1d50df5f75 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -337,11 +337,6 @@ static struct

[PATCH v5 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()

2023-07-24 Thread Jason Gunthorpe

These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively
allows them to support that mode.

The prior work to require default_domains makes this safe because every
one of these drivers is either compilation incompatible with dma-iommu.c,
or already establishing a default_domain. In both cases alloc_domain()
will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe
to drop the test.

Removing these tests clarifies that the domain allocation path is only
about the functionality of a paging domain and has nothing to do with
policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ.

Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/msm_iommu.c| 7 ++-
 drivers/iommu/mtk_iommu_v1.c | 7 ++-
 drivers/iommu/omap-iommu.c   | 7 ++-
 drivers/iommu/s390-iommu.c   | 7 ++-
 4 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 26ed81cfeee897..a163cee0b7242d 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx,
SET_M(base, ctx, 1);
 }
 
-static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev)
 {
struct msm_priv *priv;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
goto fail_nomem;
@@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
 
 static struct iommu_ops msm_iommu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc = msm_iommu_domain_alloc,
+   .domain_alloc_paging = msm_iommu_domain_alloc_paging,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 7c0c1d50df5f75..67e044c1a7d93b 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct 
mtk_iommu_v1_data *data)
return 0;
 }
 
-static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type)
+static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device 
*dev)
 {
struct mtk_iommu_v1_domain *dom;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!dom)
return NULL;
@@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
 
 static const struct iommu_ops mtk_iommu_v1_ops = {
.identity_domain = _iommu_v1_identity_domain,
-   .domain_alloc   = mtk_iommu_v1_domain_alloc,
+   .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
.release_device = mtk_iommu_v1_release_device,
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 34340ef15241bc..fcf99bd195b32e 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1580,13 +1580,10 @@ static struct iommu_domain omap_iommu_identity_domain = 
{
.ops = _iommu_identity_ops,
 };
 
-static struct iommu_domain *omap_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *omap_iommu_domain_alloc_paging(struct device *dev)
 {
struct omap_iommu_domain *omap_domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
omap_domain = kzalloc(sizeof(*omap_domain), GFP_KERNEL);
if (!omap_domain)
return NULL;
@@ -1748,7 +1745,7 @@ static struct iommu_group *omap_iommu_device_group(struct 
device *dev)
 
 static const struct iommu_ops omap_iommu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc   = omap_iommu_domain_alloc,
+   .domain_alloc_paging = omap_iommu_domain_alloc_paging,
.probe_device   = omap_iommu_probe_device,
.release_device = omap_iommu_release_device,
.device_group   = omap_iommu_device_group,
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index f0c867c57a5b9b..5695ad71d60e24 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -39,13 +39,10 @@ static bool s390_iommu_capable(struct device *dev, enum 
iommu_cap cap)
}
 }
 
-static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
+static struct iommu_domain *s390_domain_alloc_paging(struct device *dev)
 {
struct s390_domain *s390_domain;
 
-   if (domain_type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
s390_domain = kzalloc(sizeof(*s390_domain),

[PATCH v5 24/25] iommu: Convert simple drivers with DOMAIN_DMA to domain_alloc_paging()

2023-07-24 Thread Jason Gunthorpe

These drivers are all trivially converted since the function is only
called if the domain type is going to be
IOMMU_DOMAIN_UNMANAGED/DMA.

Tested-by: Heiko Stuebner 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 6 ++
 drivers/iommu/exynos-iommu.c| 7 ++-
 drivers/iommu/ipmmu-vmsa.c  | 7 ++-
 drivers/iommu/mtk_iommu.c   | 7 ++-
 drivers/iommu/rockchip-iommu.c  | 7 ++-
 drivers/iommu/sprd-iommu.c  | 7 ++-
 drivers/iommu/sun50i-iommu.c| 9 +++--
 drivers/iommu/tegra-smmu.c  | 7 ++-
 8 files changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 9d7b9d8b4386d4..a2140fdc65ed58 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -319,12 +319,10 @@ static int qcom_iommu_init_domain(struct iommu_domain 
*domain,
return ret;
 }
 
-static struct iommu_domain *qcom_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *qcom_iommu_domain_alloc_paging(struct device *dev)
 {
struct qcom_iommu_domain *qcom_domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
-   return NULL;
/*
 * Allocate the domain and initialise some of its data structures.
 * We can't really do anything meaningful until we've added a
@@ -593,7 +591,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
 static const struct iommu_ops qcom_iommu_ops = {
.identity_domain = _iommu_identity_domain,
.capable= qcom_iommu_capable,
-   .domain_alloc   = qcom_iommu_domain_alloc,
+   .domain_alloc_paging = qcom_iommu_domain_alloc_paging,
.probe_device   = qcom_iommu_probe_device,
.device_group   = generic_device_group,
.of_xlate   = qcom_iommu_of_xlate,
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 5e12b85dfe8705..d6dead2ed10c11 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -887,7 +887,7 @@ static inline void exynos_iommu_set_pte(sysmmu_pte_t *ent, 
sysmmu_pte_t val)
   DMA_TO_DEVICE);
 }
 
-static struct iommu_domain *exynos_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *exynos_iommu_domain_alloc_paging(struct device 
*dev)
 {
struct exynos_iommu_domain *domain;
dma_addr_t handle;
@@ -896,9 +896,6 @@ static struct iommu_domain 
*exynos_iommu_domain_alloc(unsigned type)
/* Check if correct PTE offsets are initialized */
BUG_ON(PG_ENT_SHIFT < 0 || !dma_dev);
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!domain)
return NULL;
@@ -1472,7 +1469,7 @@ static int exynos_iommu_of_xlate(struct device *dev,
 
 static const struct iommu_ops exynos_iommu_ops = {
.identity_domain = _identity_domain,
-   .domain_alloc = exynos_iommu_domain_alloc,
+   .domain_alloc_paging = exynos_iommu_domain_alloc_paging,
.device_group = generic_device_group,
.probe_device = exynos_iommu_probe_device,
.release_device = exynos_iommu_release_device,
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index de958e411a92e0..27d36347e0fced 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -566,13 +566,10 @@ static irqreturn_t ipmmu_irq(int irq, void *dev)
  * IOMMU Operations
  */
 
-static struct iommu_domain *ipmmu_domain_alloc(unsigned type)
+static struct iommu_domain *ipmmu_domain_alloc_paging(struct device *dev)
 {
struct ipmmu_vmsa_domain *domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
-   return NULL;
-
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!domain)
return NULL;
@@ -891,7 +888,7 @@ static struct iommu_group *ipmmu_find_group(struct device 
*dev)
 
 static const struct iommu_ops ipmmu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc = ipmmu_domain_alloc,
+   .domain_alloc_paging = ipmmu_domain_alloc_paging,
.probe_device = ipmmu_probe_device,
.release_device = ipmmu_release_device,
.probe_finalize = ipmmu_probe_finalize,
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index fdb7f5162b1d64..3590d3399add32 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -667,13 +667,10 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom,
return 0;
 }
 
-static struct iommu_domain *mtk_iommu_domain_alloc(unsigned type)
+static struct iommu_domain

[PATCH v5 18/25] iommu/ipmmu: Add an IOMMU_IDENTITIY_DOMAIN

2023-07-24 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Also reverts commit 584d334b1393 ("iommu/ipmmu-vmsa: Remove
ipmmu_utlb_disable()")

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/ipmmu-vmsa.c | 43 ++
 1 file changed, 43 insertions(+)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 9f64c5c9f5b90a..de958e411a92e0 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -298,6 +298,18 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain 
*domain,
mmu->utlb_ctx[utlb] = domain->context_id;
 }
 
+/*
+ * Disable MMU translation for the microTLB.
+ */
+static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
+  unsigned int utlb)
+{
+   struct ipmmu_vmsa_device *mmu = domain->mmu;
+
+   ipmmu_imuctr_write(mmu, utlb, 0);
+   mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID;
+}
+
 static void ipmmu_tlb_flush_all(void *cookie)
 {
struct ipmmu_vmsa_domain *domain = cookie;
@@ -630,6 +642,36 @@ static int ipmmu_attach_device(struct iommu_domain 
*io_domain,
return 0;
 }
 
+static int ipmmu_iommu_identity_attach(struct iommu_domain *identity_domain,
+  struct device *dev)
+{
+   struct iommu_domain *io_domain = iommu_get_domain_for_dev(dev);
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct ipmmu_vmsa_domain *domain;
+   unsigned int i;
+
+   if (io_domain == identity_domain || !io_domain)
+   return 0;
+
+   domain = to_vmsa_domain(io_domain);
+   for (i = 0; i < fwspec->num_ids; ++i)
+   ipmmu_utlb_disable(domain, fwspec->ids[i]);
+
+   /*
+* TODO: Optimize by disabling the context when no device is attached.
+*/
+   return 0;
+}
+
+static struct iommu_domain_ops ipmmu_iommu_identity_ops = {
+   .attach_dev = ipmmu_iommu_identity_attach,
+};
+
+static struct iommu_domain ipmmu_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int ipmmu_map(struct iommu_domain *io_domain, unsigned long iova,
 phys_addr_t paddr, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -848,6 +890,7 @@ static struct iommu_group *ipmmu_find_group(struct device 
*dev)
 }
 
 static const struct iommu_ops ipmmu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc = ipmmu_domain_alloc,
.probe_device = ipmmu_probe_device,
.release_device = ipmmu_release_device,
-- 
2.41.0

[PATCH v5 06/25] iommu/tegra-gart: Remove tegra-gart

2023-07-24 Thread Jason Gunthorpe

Thierry says this is not used anymore, and doesn't think it makes sense as
an iommu driver. The HW it supports is about 10 years old now and newer HW
uses different IOMMU drivers.

As this is the only driver with a GART approach, and it doesn't really
meet the driver expectations from the IOMMU core, let's just remove it
so we don't have to think about how to make it fit in.

It has a number of identified problems:
 - The assignment of iommu_groups doesn't match the HW behavior

 - It claims to have an UNMANAGED domain but it is really an IDENTITY
   domain with a translation aperture. This is inconsistent with the core
   expectation for security sensitive operations

 - It doesn't implement a SW page table under struct iommu_domain so
   * It can't accept a map until the domain is attached
   * It forgets about all maps after the domain is detached
   * It doesn't clear the HW of maps once the domain is detached
 (made worse by having the wrong groups)

Cc: Thierry Reding 
Cc: Dmitry Osipenko 
Acked-by: Thierry Reding 
Signed-off-by: Jason Gunthorpe 
---
 arch/arm/configs/multi_v7_defconfig |   1 -
 arch/arm/configs/tegra_defconfig|   1 -
 drivers/iommu/Kconfig   |  11 -
 drivers/iommu/Makefile  |   1 -
 drivers/iommu/tegra-gart.c  | 371 
 drivers/memory/tegra/mc.c   |  34 ---
 drivers/memory/tegra/tegra20.c  |  28 ---
 include/soc/tegra/mc.h  |  26 --
 8 files changed, 473 deletions(-)
 delete mode 100644 drivers/iommu/tegra-gart.c

diff --git a/arch/arm/configs/multi_v7_defconfig 
b/arch/arm/configs/multi_v7_defconfig
index f0800f806b5f62..c7e63e54a400e9 100644
--- a/arch/arm/configs/multi_v7_defconfig
+++ b/arch/arm/configs/multi_v7_defconfig
@@ -1063,7 +1063,6 @@ CONFIG_BCM2835_MBOX=y
 CONFIG_QCOM_APCS_IPC=y
 CONFIG_QCOM_IPCC=y
 CONFIG_ROCKCHIP_IOMMU=y
-CONFIG_TEGRA_IOMMU_GART=y
 CONFIG_TEGRA_IOMMU_SMMU=y
 CONFIG_EXYNOS_IOMMU=y
 CONFIG_QCOM_IOMMU=y
diff --git a/arch/arm/configs/tegra_defconfig b/arch/arm/configs/tegra_defconfig
index 3c6af935e9328a..79141dddb037a9 100644
--- a/arch/arm/configs/tegra_defconfig
+++ b/arch/arm/configs/tegra_defconfig
@@ -292,7 +292,6 @@ CONFIG_CHROME_PLATFORMS=y
 CONFIG_CROS_EC=y
 CONFIG_CROS_EC_I2C=m
 CONFIG_CROS_EC_SPI=m
-CONFIG_TEGRA_IOMMU_GART=y
 CONFIG_TEGRA_IOMMU_SMMU=y
 CONFIG_ARCH_TEGRA_2x_SOC=y
 CONFIG_ARCH_TEGRA_3x_SOC=y
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 2b12b583ef4b1e..cd6727898b1175 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -236,17 +236,6 @@ config SUN50I_IOMMU
help
  Support for the IOMMU introduced in the Allwinner H6 SoCs.
 
-config TEGRA_IOMMU_GART
-   bool "Tegra GART IOMMU Support"
-   depends on ARCH_TEGRA_2x_SOC
-   depends on TEGRA_MC
-   select IOMMU_API
-   help
- Enables support for remapping discontiguous physical memory
- shared with the operating system into contiguous I/O virtual
- space through the GART (Graphics Address Relocation Table)
- hardware included on Tegra SoCs.
-
 config TEGRA_IOMMU_SMMU
bool "NVIDIA Tegra SMMU Support"
depends on ARCH_TEGRA
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 769e43d780ce89..95ad9dbfbda022 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -20,7 +20,6 @@ obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o
 obj-$(CONFIG_OMAP_IOMMU_DEBUG) += omap-iommu-debug.o
 obj-$(CONFIG_ROCKCHIP_IOMMU) += rockchip-iommu.o
 obj-$(CONFIG_SUN50I_IOMMU) += sun50i-iommu.o
-obj-$(CONFIG_TEGRA_IOMMU_GART) += tegra-gart.o
 obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
deleted file mode 100644
index a482ff838b5331..00
--- a/drivers/iommu/tegra-gart.c
+++ /dev/null
@@ -1,371 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * IOMMU API for Graphics Address Relocation Table on Tegra20
- *
- * Copyright (c) 2010-2012, NVIDIA CORPORATION.  All rights reserved.
- *
- * Author: Hiroshi DOYU 
- */
-
-#define dev_fmt(fmt)   "gart: " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-#define GART_REG_BASE  0x24
-#define GART_CONFIG(0x24 - GART_REG_BASE)
-#define GART_ENTRY_ADDR(0x28 - GART_REG_BASE)
-#define GART_ENTRY_DATA(0x2c - GART_REG_BASE)
-
-#define GART_ENTRY_PHYS_ADDR_VALID BIT(31)
-
-#define GART_PAGE_SHIFT12
-#define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT)
-#define GART_PAGE_MASK GENMASK(30, GART_PAGE_SHIFT)
-
-/* bitmap of the page sizes currently supported */
-#define GART_IOMMU_PGSIZES (GART_PAGE_SIZE)
-
-struct gart_device {
-   void __iomem*regs;
-   u32 *savedata;
-   unsigned long

[PATCH v5 16/25] iommu: Remove ops->set_platform_dma_ops()

2023-07-24 Thread Jason Gunthorpe

All drivers are now using IDENTITY or PLATFORM domains for what this did,
we can remove it now. It is no longer possible to attach to a NULL domain.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 30 +-
 include/linux/iommu.h |  4 
 2 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7fae866af0db7a..dada2c00d78ca4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2264,21 +2264,8 @@ static int __iommu_group_set_domain_internal(struct 
iommu_group *group,
if (group->domain == new_domain)
return 0;
 
-   /*
-* New drivers should support default domains, so set_platform_dma()
-* op will never be called. Otherwise the NULL domain represents some
-* platform specific behavior.
-*/
-   if (!new_domain) {
-   for_each_group_device(group, gdev) {
-   const struct iommu_ops *ops = dev_iommu_ops(gdev->dev);
-
-   if (!WARN_ON(!ops->set_platform_dma_ops))
-   ops->set_platform_dma_ops(gdev->dev);
-   }
-   group->domain = NULL;
-   return 0;
-   }
+   if (WARN_ON(!new_domain))
+   return -EINVAL;
 
/*
 * Changing the domain is done by calling attach_dev() on the new
@@ -2314,19 +2301,15 @@ static int __iommu_group_set_domain_internal(struct 
iommu_group *group,
 */
last_gdev = gdev;
for_each_group_device(group, gdev) {
-   const struct iommu_ops *ops = dev_iommu_ops(gdev->dev);
-
/*
-* If set_platform_dma_ops is not present a NULL domain can
-* happen only for first probe, in which case we leave
-* group->domain as NULL and let release clean everything up.
+* A NULL domain can happen only for first probe, in which case
+* we leave group->domain as NULL and let release clean
+* everything up.
 */
if (group->domain)
WARN_ON(__iommu_device_set_domain(
group, gdev->dev, group->domain,
IOMMU_SET_DOMAIN_MUST_SUCCEED));
-   else if (ops->set_platform_dma_ops)
-   ops->set_platform_dma_ops(gdev->dev);
if (gdev == last_gdev)
break;
}
@@ -2940,9 +2923,6 @@ static int iommu_setup_default_domain(struct iommu_group 
*group,
/*
 * There are still some drivers which don't support default domains, so
 * we ignore the failure and leave group->default_domain NULL.
-*
-* We assume that the iommu driver starts up the device in
-* 'set_platform_dma_ops' mode if it does not support default domains.
 */
dom = iommu_group_alloc_default_domain(group, req_type);
if (!dom) {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 87aebba474e093..df54066c262db4 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -237,9 +237,6 @@ struct iommu_iotlb_gather {
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
  *  group and attached to the groups domain
- * @set_platform_dma_ops: Returning control back to the platform DMA ops. This 
op
- *is to support old IOMMU drivers, new drivers should 
use
- *default domains, and the common IOMMU DMA ops.
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -271,7 +268,6 @@ struct iommu_ops {
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
void (*probe_finalize)(struct device *dev);
-   void (*set_platform_dma_ops)(struct device *dev);
struct iommu_group *(*device_group)(struct device *dev);
 
/* Request/Free a list of reserved regions for a device */
-- 
2.41.0

[PATCH v5 10/25] iommu/exynos: Implement an IDENTITY domain

2023-07-24 Thread Jason Gunthorpe

What exynos calls exynos_iommu_detach_device is actually putting the iommu
into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

Tested-by: Marek Szyprowski 
Acked-by: Marek Szyprowski 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/exynos-iommu.c | 66 +---
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index c275fe71c4db32..5e12b85dfe8705 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -24,6 +24,7 @@
 
 typedef u32 sysmmu_iova_t;
 typedef u32 sysmmu_pte_t;
+static struct iommu_domain exynos_identity_domain;
 
 /* We do not consider super section mapping (16MB) */
 #define SECT_ORDER 20
@@ -829,7 +830,7 @@ static int __maybe_unused exynos_sysmmu_suspend(struct 
device *dev)
struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
mutex_lock(>rpm_lock);
-   if (data->domain) {
+   if (>domain->domain != _identity_domain) {
dev_dbg(data->sysmmu, "saving state\n");
__sysmmu_disable(data);
}
@@ -847,7 +848,7 @@ static int __maybe_unused exynos_sysmmu_resume(struct 
device *dev)
struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
mutex_lock(>rpm_lock);
-   if (data->domain) {
+   if (>domain->domain != _identity_domain) {
dev_dbg(data->sysmmu, "restoring state\n");
__sysmmu_enable(data);
}
@@ -980,17 +981,20 @@ static void exynos_iommu_domain_free(struct iommu_domain 
*iommu_domain)
kfree(domain);
 }
 
-static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain,
-   struct device *dev)
+static int exynos_iommu_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
-   struct exynos_iommu_domain *domain = to_exynos_domain(iommu_domain);
struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
-   phys_addr_t pagetable = virt_to_phys(domain->pgtable);
+   struct exynos_iommu_domain *domain;
+   phys_addr_t pagetable;
struct sysmmu_drvdata *data, *next;
unsigned long flags;
 
-   if (!has_sysmmu(dev) || owner->domain != iommu_domain)
-   return;
+   if (owner->domain == identity_domain)
+   return 0;
+
+   domain = to_exynos_domain(owner->domain);
+   pagetable = virt_to_phys(domain->pgtable);
 
mutex_lock(>rpm_lock);
 
@@ -1009,15 +1013,25 @@ static void exynos_iommu_detach_device(struct 
iommu_domain *iommu_domain,
list_del_init(>domain_node);
spin_unlock(>lock);
}
-   owner->domain = NULL;
+   owner->domain = identity_domain;
spin_unlock_irqrestore(>lock, flags);
 
mutex_unlock(>rpm_lock);
 
-   dev_dbg(dev, "%s: Detached IOMMU with pgtable %pa\n", __func__,
-   );
+   dev_dbg(dev, "%s: Restored IOMMU to IDENTITY from pgtable %pa\n",
+   __func__, );
+   return 0;
 }
 
+static struct iommu_domain_ops exynos_identity_ops = {
+   .attach_dev = exynos_iommu_identity_attach,
+};
+
+static struct iommu_domain exynos_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _identity_ops,
+};
+
 static int exynos_iommu_attach_device(struct iommu_domain *iommu_domain,
   struct device *dev)
 {
@@ -1026,12 +1040,11 @@ static int exynos_iommu_attach_device(struct 
iommu_domain *iommu_domain,
struct sysmmu_drvdata *data;
phys_addr_t pagetable = virt_to_phys(domain->pgtable);
unsigned long flags;
+   int err;
 
-   if (!has_sysmmu(dev))
-   return -ENODEV;
-
-   if (owner->domain)
-   exynos_iommu_detach_device(owner->domain, dev);
+   err = exynos_iommu_identity_attach(_identity_domain, dev);
+   if (err)
+   return err;
 
mutex_lock(>rpm_lock);
 
@@ -1407,26 +1420,12 @@ static struct iommu_device 
*exynos_iommu_probe_device(struct device *dev)
return >iommu;
 }
 
-static void exynos_iommu_set_platform_dma(struct device *dev)
-{
-   struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
-
-   if (owner->domain) {
-   struct iommu_group *group = iommu_group_get(dev);
-
-   if (group) {
-   exynos_iommu_detach_device(owner->domain, dev);
-   iommu_group_put(group);
-   }
-   }
-}
-
 static void exynos_iommu_release_device(struct device *dev)
 {
struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
struct sysmmu_drvdata *data;
 
-   exynos_iommu_set_platform_dma(dev);
+

[PATCH v5 03/25] powerpc/iommu: Setup a default domain and remove set_platform_dma_ops

2023-07-24 Thread Jason Gunthorpe

POWER is using the set_platform_dma_ops() callback to hook up its private
dma_ops, but this is buired under some indirection and is weirdly
happening for a BLOCKED domain as well.

For better documentation create a PLATFORM domain to manage the dma_ops,
since that is what it is for, and make the BLOCKED domain an alias for
it. BLOCKED is required for VFIO.

Also removes the leaky allocation of the BLOCKED domain by using a global
static.

Signed-off-by: Jason Gunthorpe 
---
 arch/powerpc/kernel/iommu.c | 38 +
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c52449ae6936ad..ffe8d1411a9d56 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1269,7 +1269,7 @@ struct iommu_table_group_ops spapr_tce_table_group_ops = {
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
-static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom,
+static int spapr_tce_platform_iommu_attach_dev(struct iommu_domain *dom,
   struct device *dev)
 {
struct iommu_group *grp = iommu_group_get(dev);
@@ -1286,17 +1286,22 @@ static int spapr_tce_blocking_iommu_attach_dev(struct 
iommu_domain *dom,
return ret;
 }
 
-static void spapr_tce_blocking_iommu_set_platform_dma(struct device *dev)
-{
-   struct iommu_group *grp = iommu_group_get(dev);
-   struct iommu_table_group *table_group;
+static const struct iommu_domain_ops spapr_tce_platform_domain_ops = {
+   .attach_dev = spapr_tce_platform_iommu_attach_dev,
+};
 
-   table_group = iommu_group_get_iommudata(grp);
-   table_group->ops->release_ownership(table_group);
-}
+static struct iommu_domain spapr_tce_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _tce_platform_domain_ops,
+};
 
-static const struct iommu_domain_ops spapr_tce_blocking_domain_ops = {
-   .attach_dev = spapr_tce_blocking_iommu_attach_dev,
+static struct iommu_domain spapr_tce_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   /*
+* FIXME: SPAPR mixes blocked and platform behaviors, the blocked domain
+* also sets the dma_api ops
+*/
+   .ops = _tce_platform_domain_ops,
 };
 
 static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap)
@@ -1313,18 +1318,9 @@ static bool spapr_tce_iommu_capable(struct device *dev, 
enum iommu_cap cap)
 
 static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
 {
-   struct iommu_domain *dom;
-
if (type != IOMMU_DOMAIN_BLOCKED)
return NULL;
-
-   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
-   if (!dom)
-   return NULL;
-
-   dom->ops = _tce_blocking_domain_ops;
-
-   return dom;
+   return _tce_blocked_domain;
 }
 
 static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
@@ -1360,12 +1356,12 @@ static struct iommu_group 
*spapr_tce_iommu_device_group(struct device *dev)
 }
 
 static const struct iommu_ops spapr_tce_iommu_ops = {
+   .default_domain = _tce_platform_domain,
.capable = spapr_tce_iommu_capable,
.domain_alloc = spapr_tce_iommu_domain_alloc,
.probe_device = spapr_tce_iommu_probe_device,
.release_device = spapr_tce_iommu_release_device,
.device_group = spapr_tce_iommu_device_group,
-   .set_platform_dma_ops = spapr_tce_blocking_iommu_set_platform_dma,
 };
 
 static struct attribute *spapr_tce_iommu_attrs[] = {
-- 
2.41.0

[PATCH v5 22/25] iommu: Add __iommu_group_domain_alloc()

2023-07-24 Thread Jason Gunthorpe

Allocate a domain from a group. Automatically obtains the iommu_ops to use
from the device list of the group. Convert the internal callers to use it.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 66 ---
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1533e65d075bce..bc8b35e31b5343 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -94,8 +94,8 @@ static const char * const iommu_group_resv_type_string[] = {
 static int iommu_bus_notifier(struct notifier_block *nb,
  unsigned long action, void *data);
 static void iommu_release_device(struct device *dev);
-static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
-unsigned type);
+static struct iommu_domain *
+__iommu_group_domain_alloc(struct iommu_group *group, unsigned int type);
 static int __iommu_attach_device(struct iommu_domain *domain,
 struct device *dev);
 static int __iommu_attach_group(struct iommu_domain *domain,
@@ -1713,12 +1713,11 @@ struct iommu_group *fsl_mc_device_group(struct device 
*dev)
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
 static struct iommu_domain *
-__iommu_group_alloc_default_domain(const struct bus_type *bus,
-  struct iommu_group *group, int req_type)
+__iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
if (group->default_domain && group->default_domain->type == req_type)
return group->default_domain;
-   return __iommu_domain_alloc(bus, req_type);
+   return __iommu_group_domain_alloc(group, req_type);
 }
 
 /*
@@ -1728,9 +1727,10 @@ __iommu_group_alloc_default_domain(const struct bus_type 
*bus,
 static struct iommu_domain *
 iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
-   const struct bus_type *bus =
+   struct device *dev =
list_first_entry(>devices, struct group_device, list)
-   ->dev->bus;
+   ->dev;
+   const struct iommu_ops *ops = dev_iommu_ops(dev);
struct iommu_domain *dom;
 
lockdep_assert_held(>mutex);
@@ -1740,24 +1740,24 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 * domain. This should always be either an IDENTITY or PLATFORM domain.
 * Do not use in new drivers.
 */
-   if (bus->iommu_ops->default_domain) {
+   if (ops->default_domain) {
if (req_type)
return ERR_PTR(-EINVAL);
-   return bus->iommu_ops->default_domain;
+   return ops->default_domain;
}
 
if (req_type)
-   return __iommu_group_alloc_default_domain(bus, group, req_type);
+   return __iommu_group_alloc_default_domain(group, req_type);
 
/* The driver gave no guidance on what type to use, try the default */
-   dom = __iommu_group_alloc_default_domain(bus, group, 
iommu_def_domain_type);
+   dom = __iommu_group_alloc_default_domain(group, iommu_def_domain_type);
if (dom)
return dom;
 
/* Otherwise IDENTITY and DMA_FQ defaults will try DMA */
if (iommu_def_domain_type == IOMMU_DOMAIN_DMA)
return NULL;
-   dom = __iommu_group_alloc_default_domain(bus, group, IOMMU_DOMAIN_DMA);
+   dom = __iommu_group_alloc_default_domain(group, IOMMU_DOMAIN_DMA);
if (!dom)
return NULL;
 
@@ -1998,19 +1998,16 @@ void iommu_set_fault_handler(struct iommu_domain 
*domain,
 }
 EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
 
-static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
-unsigned type)
+static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
+unsigned int type)
 {
struct iommu_domain *domain;
unsigned int alloc_type = type & IOMMU_DOMAIN_ALLOC_FLAGS;
 
-   if (bus == NULL || bus->iommu_ops == NULL)
-   return NULL;
+   if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
+   return ops->identity_domain;
 
-   if (alloc_type == IOMMU_DOMAIN_IDENTITY && 
bus->iommu_ops->identity_domain)
-   return bus->iommu_ops->identity_domain;
-
-   domain = bus->iommu_ops->domain_alloc(alloc_type);
+   domain = ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
 
@@ -2020,10 +2017,10 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct bus_type *bus,
 * may override this later
 */
if (!domain->pgsize_bitmap)
-   domain->pgsize_bitmap =

[PATCH v5 01/25] iommu: Add iommu_ops->identity_domain

2023-07-24 Thread Jason Gunthorpe

This allows a driver to set a global static to an IDENTITY domain and
the core code will automatically use it whenever an IDENTITY domain
is requested.

By making it always available it means the IDENTITY can be used in error
handling paths to force the iommu driver into a known state. Devices
implementing global static identity domains should avoid failing their
attach_dev ops.

Convert rockchip to use the new mechanism.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c  | 3 +++
 drivers/iommu/rockchip-iommu.c | 9 +
 include/linux/iommu.h  | 3 +++
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4352a149a935e8..5e3cdc9f3a9e78 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1931,6 +1931,9 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct bus_type *bus,
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
 
+   if (alloc_type == IOMMU_DOMAIN_IDENTITY && 
bus->iommu_ops->identity_domain)
+   return bus->iommu_ops->identity_domain;
+
domain = bus->iommu_ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 8ff69fbf9f65db..033678f2f8b3ab 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -989,13 +989,8 @@ static int rk_iommu_identity_attach(struct iommu_domain 
*identity_domain,
return 0;
 }
 
-static void rk_iommu_identity_free(struct iommu_domain *domain)
-{
-}
-
 static struct iommu_domain_ops rk_identity_ops = {
.attach_dev = rk_iommu_identity_attach,
-   .free = rk_iommu_identity_free,
 };
 
 static struct iommu_domain rk_identity_domain = {
@@ -1059,9 +1054,6 @@ static struct iommu_domain 
*rk_iommu_domain_alloc(unsigned type)
 {
struct rk_iommu_domain *rk_domain;
 
-   if (type == IOMMU_DOMAIN_IDENTITY)
-   return _identity_domain;
-
if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
return NULL;
 
@@ -1186,6 +1178,7 @@ static int rk_iommu_of_xlate(struct device *dev,
 }
 
 static const struct iommu_ops rk_iommu_ops = {
+   .identity_domain = _identity_domain,
.domain_alloc = rk_iommu_domain_alloc,
.probe_device = rk_iommu_probe_device,
.release_device = rk_iommu_release_device,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b1dcb1b9b17040..e05c93b6c37fba 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -254,6 +254,8 @@ struct iommu_iotlb_gather {
  *will be blocked by the hardware.
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
+ * @identity_domain: An always available, always attachable identity
+ *   translation.
  */
 struct iommu_ops {
bool (*capable)(struct device *dev, enum iommu_cap);
@@ -287,6 +289,7 @@ struct iommu_ops {
const struct iommu_domain_ops *default_domain_ops;
unsigned long pgsize_bitmap;
struct module *owner;
+   struct iommu_domain *identity_domain;
 };
 
 /**
-- 
2.41.0

[PATCH v5 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390

2023-07-24 Thread Jason Gunthorpe

The PLATFORM domain will be set as the default domain and attached as
normal during probe. The driver will ignore the initial attach from a NULL
domain to the PLATFORM domain.

After this, the PLATFORM domain's attach_dev will be called whenever we
detach from an UNMANAGED domain (eg for VFIO). This is the same time the
original design would have called op->detach_dev().

This is temporary until the S390 dma-iommu.c conversion is merged.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/s390-iommu.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index fbf59a8db29b11..f0c867c57a5b9b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain 
*domain,
return 0;
 }
 
-static void s390_iommu_set_platform_dma(struct device *dev)
+/*
+ * Switch control over the IOMMU to S390's internal dma_api ops
+ */
+static int s390_iommu_platform_attach(struct iommu_domain *platform_domain,
+ struct device *dev)
 {
struct zpci_dev *zdev = to_zpci_dev(dev);
 
+   if (!zdev->s390_domain)
+   return 0;
+
__s390_iommu_detach_device(zdev);
zpci_dma_init_device(zdev);
+   return 0;
 }
 
+static struct iommu_domain_ops s390_iommu_platform_ops = {
+   .attach_dev = s390_iommu_platform_attach,
+};
+
+static struct iommu_domain s390_iommu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _iommu_platform_ops,
+};
+
 static void s390_iommu_get_resv_regions(struct device *dev,
struct list_head *list)
 {
@@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
 }
 
 static const struct iommu_ops s390_iommu_ops = {
+   .default_domain = _iommu_platform_domain,
.capable = s390_iommu_capable,
.domain_alloc = s390_domain_alloc,
.probe_device = s390_iommu_probe_device,
.release_device = s390_iommu_release_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = s390_iommu_set_platform_dma,
.pgsize_bitmap = SZ_4K,
.get_resv_regions = s390_iommu_get_resv_regions,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.41.0

1 2 >

1 - 100 of 123 matches

Mail list logo