date:20230711

[PATCH mm 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()

2023-07-11 Thread Hugh Dickins

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins 
---
This is the version which applies to mm-unstable or linux-next.

 include/linux/mm.h| 17 -
 include/linux/mmap_lock.h | 10 --
 2 files changed, 27 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struc
up_write(>vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-   int mm_lock_seq;
-
-   if (__is_vma_write_locked(vma, _lock_seq))
-   return true;
-
-   if (!down_write_trylock(>vm_lock->lock))
-   return false;
-
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
-   return true;
-}
-
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
int mm_lock_seq;
@@ -758,8 +743,6 @@ static inline bool vma_start_read(struct
{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-   { return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 bool detached) {}
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killab
return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-   bool ret;
-
-   __mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
__mmap_lock_trace_released(mm, true);

[PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]()

2023-07-11 Thread Hugh Dickins

Add a block of comments on pte_offset_map_lock(), pte_offset_map() and
pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them.

Signed-off-by: Hugh Dickins 
---
 mm/pgtable-generic.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index fa9d4d084291..4fcd959dcc4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -315,6 +315,50 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t 
*pmd,
return pte;
 }
 
+/*
+ * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
+ * __pte_offset_map_lock() below, is usually called with the pmd pointer for
+ * addr, reached by walking down the mm's pgd, p4d, pud for addr: either while
+ * holding mmap_lock or vma lock for read or for write; or in truncate or rmap
+ * context, while holding file's i_mmap_lock or anon_vma lock for read (or for
+ * write). In a few cases, it may be used with pmd pointing to a pmd_t already
+ * copied to or constructed on the stack.
+ *
+ * When successful, it returns the pte pointer for addr, with its page table
+ * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent
+ * modification by software, with a pointer to that spinlock in ptlp (in some
+ * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's
+ * struct page).  pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards.
+ *
+ * But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no
+ * page table at *pmd: if, for example, the page table has just been removed,
+ * or replaced by the huge pmd of a THP.  (When successful, *pmd is rechecked
+ * after acquiring the ptlock, and retried internally if it changed: so that a
+ * page table can be safely removed or replaced by THP while holding its lock.)
+ *
+ * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above,
+ * just returns the pte pointer for addr, its page table kmapped if necessary;
+ * or NULL if there is no page table at *pmd.  It does not attempt to lock the
+ * page table, so cannot normally be used when the page table is to be updated,
+ * or when entries read must be stable.  But it does take rcu_read_lock(): so
+ * that even when page table is racily removed, it remains a valid though empty
+ * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
+ * afterwards.
+ *
+ * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it.  This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
+ * pointer for the page table that it returns.  In principle, the caller should
+ * recheck *pmd once the lock is taken; in practice, no callsite needs that -
+ * either the mmap_lock for write, or pte_same() check on contents, is enough.
+ *
+ * Note that free_pgtables(), used after unmapping detached vmas, or when
+ * exiting the whole mm, does not take page table lock before freeing a page
+ * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
+ * pte_offset_map() and co once the vma is detached from mm or mm_users is 
zero.
+ */
 pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 unsigned long addr, spinlock_t **ptlp)
 {
-- 
2.35.3

[PATCH v2] powerpc:platforms:Fix an NULL vs IS_ERR() bug for debugfs_create_dir()

2023-07-11 Thread Wang Ming

The debugfs_create_dir() function returns error pointers.
It never returns NULL. Most incorrect error checks were fixed,
but the one in scom_debug_init() was forgotten, the other one in
scom_debug_init_one() was also forgotten.

Fix the remaining error check.

Signed-off-by: Wang Ming 

Fixes: bfd2f0d49aef ("powerpc/powernv: Get rid of old scom_controller 
abstraction")
---
 arch/powerpc/platforms/powernv/opal-xscom.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-xscom.c 
b/arch/powerpc/platforms/powernv/opal-xscom.c
index 6b4eed2ef4fa..262cd6fac907 100644
--- a/arch/powerpc/platforms/powernv/opal-xscom.c
+++ b/arch/powerpc/platforms/powernv/opal-xscom.c
@@ -168,7 +168,7 @@ static int scom_debug_init_one(struct dentry *root, struct 
device_node *dn,
ent->path.size = strlen((char *)ent->path.data);
 
dir = debugfs_create_dir(ent->name, root);
-   if (!dir) {
+   if (IS_ERR(dir)) {
kfree(ent->path.data);
kfree(ent);
return -1;
@@ -190,7 +190,7 @@ static int scom_debug_init(void)
return 0;
 
root = debugfs_create_dir("scom", arch_debugfs_dir);
-   if (!root)
+   if (IS_ERR(root))
return -1;
 
rc = 0;
-- 
2.25.1

[PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()

2023-07-11 Thread Hugh Dickins

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins 
---
 include/linux/mm.h| 17 -
 include/linux/mmap_lock.h | 10 --
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..b7b45be616ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struct vm_area_struct 
*vma)
up_write(>vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-   int mm_lock_seq;
-
-   if (__is_vma_write_locked(vma, _lock_seq))
-   return true;
-
-   if (!down_write_trylock(>vm_lock->lock))
-   return false;
-
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
-   return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
int mm_lock_seq;
@@ -731,8 +716,6 @@ static inline bool vma_start_read(struct vm_area_struct 
*vma)
{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-   { return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct 
mm_struct *mm)
return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-   bool ret;
-
-   __mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
__mmap_lock_trace_released(mm, true);
-- 
2.35.3

[PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

2023-07-11 Thread Hugh Dickins

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 125 +++---
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 46986eb4eebb..7c7aaddbe130 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, 
MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being 
scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
struct mm_slot slot;
-
-   /* pte-mapped THP in this mm */
-   int nr_pte_mapped_thp;
-   unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1439,50 +1431,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot 
*mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- * emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- * retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- * (at virtual address X) and adds an entry (for X) into mm_struct A's
- * ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- * (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the 
same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
-   struct khugepaged_mm_slot *mm_slot;
-   struct mm_slot *slot;
-   bool ret = false;
-
-   VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-   spin_lock(_mm_lock);
-   slot = mm_slot_lookup(mm_slots_hash, mm);
-   mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-   if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) 
{
-   mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-   ret = true;
-   }
-   spin_unlock(_mm_lock);
-   return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
@@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
return result;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot 
*mm_slot)
-{
-   struct mm_slot *slot = _slot->slot;
-   struct mm_struct *mm = slot->mm;
-   int i;
-
-   if (likely(mm_slot->nr_pte_mapped_thp == 0))
-   return;
-
-   if (!mmap_write_trylock(mm))
-   return;
-
-   if (unlikely(hpage_collapse_test_exit(mm)))
-   goto out;
-
-   for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-   collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-   mm_slot->nr_pte_mapped_thp = 0;
-   mmap_write_unlock(mm);
-}
-
 static void retract_page_tables(struct

[PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-07-11 Thread Hugh Dickins

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3".  But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 172 ++
 1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3bb05147961b..46986eb4eebb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct 
mm_struct *mm,
return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
 {
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
};
 
VM_BUG_ON(!PageTransHuge(hpage));
-   mmap_assert_write_locked(vma->vm_mm);
+   mmap_assert_locked(vma->vm_mm);
 
if (do_set_pmd(, hpage))
return SCAN_FAIL;
@@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to 
another
- *page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct 
*vma,
- unsigned long addr, pmd_t *pmdp)
-{
-   pmd_t pmd;
-   struct mmu_notifier_range range;
-
-   mmap_assert_write_locked(mm);
-   if (vma->vm_file)
-   
lockdep_assert_held_write(>vm_file->f_mapping->i_mmap_rwsem);
-   /*
-* All anon_vmas attached to the VMA have the same root and are
-* therefore locked by the same lock.
-*/
-   if (vma->anon_vma)
-   lockdep_assert_held_write(>anon_vma->root->rwsem);
-
-   mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, mm, addr,
-   addr + HPAGE_PMD_SIZE);
-   mmu_notifier_invalidate_range_start();
-   pmd = pmdp_collapse_flush(vma, addr, pmdp);
-   tlb_remove_table_sync_one();
-   mmu_notifier_invalidate_range_end();
-   mm_dec_nr_ptes(mm);
-   page_table_check_pte_clear_range(mm, addr, pmd);
-   pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, 
struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
 {
+   struct mmu_notifier_range range;
+   bool notified = false;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page *hpage;
pte_t *start_pte, *pte;
-   pmd_t *pmd;
-   spinlock_t *ptl;
-

[PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-07-11 Thread Hugh Dickins

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that enhancement
does raise some more questions: leave it until a later release.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 184 --
 1 file changed, 75 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..3bb05147961b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1615,9 +1615,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
break;
case SCAN_PMD_NONE:
/*
-* In MADV_COLLAPSE path, possible race with khugepaged where
-* all pte entries have been removed and pmd cleared.  If so,
-* skip all the pte checks and just update the pmd mapping.
+* All pte entries have been removed and pmd cleared.
+* Skip all the pte checks and just update the pmd mapping.
 */
goto maybe_install_pmd;
default:
@@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct 
khugepaged_mm_slot *mm_sl
mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-  struct mm_struct *target_mm,
-  unsigned long target_addr, struct page *hpage,
-  struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
struct vm_area_struct *vma;
-   int target_result = SCAN_FAIL;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
-   int result = SCAN_FAIL;
-   struct mm_struct *mm = NULL;
-   unsigned long addr = 0;
-   pmd_t *pmd;
-   bool is_target = false;
+   struct mmu_notifier_range range;
+   struct mm_struct *mm;
+   unsigned long addr;
+   pmd_t *pmd, pgt_pmd;
+   spinlock_t *pml;
+   spinlock_t *ptl;
+   bool skipped_uffd = false;
 
/*
 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-* got written to. These VMAs are likely not worth investing
-* mmap_write_lock(mm) as PMD-mapping is likely to be split
-* later.
-*
-* Note that vma->anon_vma check is racy: it can be set up after
-* the check but before we took mmap_lock by the fault path.
-* But page lock would prevent establishing any new ptes of the
-* page, so we are safe.
-*
-* An alternative would be drop the check, but check that page
-* table is clear before calling pmdp_collapse_flush() under
-* ptl. It has higher chance to recover THP for the VMA, but
-* has higher cost too. It would also probably

[PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page

2023-07-11 Thread Hugh Dickins

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins 
---
 include/linux/mm_types.h |  4 
 include/linux/pgtable.h  |  2 ++
 mm/pgtable-generic.c | 20 
 3 files changed, 26 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de10fc797c8e..17a7868f00bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -144,6 +144,10 @@ struct page {
struct {/* Page table pages */
unsigned long _pt_pad_1;/* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
+   /*
+* A PTE page table page might be freed by use of
+* rcu_head: which overlays those two fields above.
+*/
unsigned long _pt_pad_2;/* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 7f2db400f653..9fa34be65159 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b9a0c2137cc1..fa9d4d084291 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = pgtable;
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3

[PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page

2023-07-11 Thread Hugh Dickins

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive).  And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins 
Reviewed-by: Gerald Schaefer 
---
 arch/s390/include/asm/pgalloc.h |  4 ++
 arch/s390/mm/pgalloc.c  | 80 +--
 2 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..760b4ace475e 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table += PTRS_PER_PTE;
atomic_xor_bits(>_refcount,
0x01U << (bit + 24));
-   list_del(>lru);
+   list_del_init(>lru);
}
}
spin_unlock_bh(>context.lock);
@@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
+   INIT_LIST_HEAD(>lru);
atomic_xor_bits(>_refcount, 0x03U << 24);
memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
@@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, 
void *table,
 {
char msg[128];
 
-   if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return;
+   if (!mask && list_empty(>lru))
return;
snprintf(msg, sizeof(msg),
 "Invalid pgtable %p release half 0x%02x mask 0x%02x",
@@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, 
void *table,
dump_page(page, msg);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pgtable_pte_page_dtor(page);
+   __free_page(page);
+}
+
 void page_table_free(struct mm_struct *mm, unsigned long

[PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t

2023-07-11 Thread Hugh Dickins

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/include/asm/pgalloc_64.h |  4 
 arch/sparc/mm/init_64.c | 16 
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h 
b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)  pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..0d7fd793924c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   __pte_free((pgtable_t)page_address(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   call_rcu(>rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
  pmd_t *pmd)
 {
-- 
2.35.3

[PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page

2023-07-11 Thread Hugh Dickins

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time.  But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 
 arch/powerpc/mm/pgtable-frag.c | 29 ++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
return __alloc_for_ptecache(mm, kernel);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pgtable_pte_page_dtor(page);
+   __free_page(page);
+}
+
 void pte_fragment_free(unsigned long *table, int kernel)
 {
struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)
 
BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
if (atomic_dec_and_test(>pt_frag_refcount)) {
-   if (!kernel)
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   if (kernel)
+   __free_page(page);
+   else if (TestClearPageActive(page))
+   call_rcu(>rcu_head, pte_free_now);
+   else
+   pte_free_now(>rcu_head);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   SetPageActive(page);
+   pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3

[PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

2023-07-11 Thread Hugh Dickins

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/pgtable.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
+   pte_t *pte;
+   spinlock_t *ptl;
 
if (mm == _mm)
return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned 
long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
-   /*
-* khugepaged to collapse normal pages to hugepage, first set
-* pmd to none to force page fault/gup to take mmap_lock. After
-* pmd is set to none, we do a pte_clear which does this assertion
-* so if we find pmd none, return.
-*/
-   if (pmd_none(*pmd))
-   return;
-   BUG_ON(!pmd_present(*pmd));
-   assert_spin_locked(pte_lockptr(mm, pmd));
+   pte = pte_offset_map_nolock(mm, pmd, addr, );
+   BUG_ON(!pte);
+   assert_spin_locked(ptl);
+   pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3

[PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock()

2023-07-11 Thread Hugh Dickins

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins 
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, 
unsigned long address,
 * must use the nested version.  This also means we need to
 * open-code the spin-locking.
 */
-   pte = pte_offset_map(pmd, address);
+   pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, );
if (!pte)
return 0;
 
-   ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);
 
ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3

[PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map()

2023-07-11 Thread Hugh Dickins

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h |  4 
 mm/pgtable-generic.c| 29 +
 2 files changed, 33 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5134edcec668..7f2db400f653 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
return pmd;
 }
 #define pmdp_get_lockless pmdp_get_lockless
+#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
 
@@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 {
return pmdp_get(pmdp);
 }
+static inline void pmdp_get_lockless_sync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 400e5a045848..b9a0c2137cc1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+   (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+static unsigned long pmdp_get_lockless_start(void)
+{
+   unsigned long irqflags;
+
+   local_irq_save(irqflags);
+   return irqflags;
+}
+static void pmdp_get_lockless_end(unsigned long irqflags)
+{
+   local_irq_restore(irqflags);
+}
+#else
+static unsigned long pmdp_get_lockless_start(void) { return 0; }
+static void pmdp_get_lockless_end(unsigned long irqflags) { }
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+   unsigned long irqflags;
pmd_t pmdval;
 
rcu_read_lock();
+   irqflags = pmdp_get_lockless_start();
pmdval = pmdp_get_lockless(pmd);
+   pmdp_get_lockless_end(irqflags);
+
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3

[PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-07-11 Thread Hugh Dickins

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..5134edcec668 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned 
long address)
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte) do {\
kunmap_local((pte));\
-   /* rcu_read_unlock() to be added later */   \
+   rcu_read_unlock();  \
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long 
address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4d454953046f..400e5a045848 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
 {
pmd_t pmdval;
 
-   /* rcu_read_lock() to be added later */
+   rcu_read_lock();
pmdval = pmdp_get_lockless(pmd);
if (pmdvalp)
*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
}
return __pte_map(, addr);
 nomap:
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
return NULL;
 }
 
-- 
2.35.3

[PATCH v3 00/13] mm: free retracted page table by RCU

2023-07-11 Thread Hugh Dickins

Here is v3 of the series of patches to mm (and a few architectures), based
on v6.5-rc1 which includes the preceding two series (thank you!): in which
khugepaged takes advantage of pte_offset_map[_lock]() allowing for pmd
transitions.  Differences from v1 and v2 are noted patch by patch below.

This replaces the v2 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/54cb04f-3762-987f-8294-91dafd8eb...@google.com/
series of 12 posted on 2023-06-20.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on v6.5-rc1; and almost good on current mm-unstable or current
linux-next - just one patch conflicts, the 12/13: I'll reply to that
one with its mm-unstable or linux-next equivalent (vma_assert_locked()
has been added next to where vma_try_start_write() is being removed).

01/13 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  v3: same as v1
02/13 mm/pgtable: add PAE safety to __pte_offset_map()
  v3: same as v2
  v2: rename to pmdp_get_lockless_start/end() per Matthew;
  so use inlines without _irq_save(flags) macro oddity;
  add pmdp_get_lockless_sync() for use later in 09/13.
03/13 arm: adjust_pte() use pte_offset_map_nolock()
  v3: same as v1
04/13 powerpc: assert_pte_locked() use pte_offset_map_nolock()
  v3: same as v1
05/13 powerpc: add pte_free_defer() for pgtables sharing page
  v3: much simpler version, following suggestion by Jason
  v2: fix rcu_head usage to cope with concurrent deferrals;
  add para to commit message explaining rcu_head issue.
06/13 sparc: add pte_free_defer() for pte_t *pgtable_t
  v3: same as v2
  v2: use page_address() instead of less common page_to_virt();
  add para to commit message explaining simple conversion;
  changed title since sparc64 pgtables do not share page.
07/13 s390: add pte_free_defer() for pgtables sharing page
  v3: much simpler version, following suggestion by Gerald
  v2: complete rewrite, integrated with s390's existing pgtable
  management; temporarily using a global mm_pgtable_list_lock,
  to be restored to per-mm spinlock in a later followup patch.
08/13 mm/pgtable: add pte_free_defer() for pgtable as page
  v3: same as v2
  v2: add comment on rcu_head to "Page table pages", per JannH
09/13 mm/khugepaged: retract_page_tables() without mmap or vma lock
  v3: same as v2
  v2: repeat checks under ptl because UFFD, per PeterX and JannH;
  bring back mmu_notifier calls for PMD, per JannH and Jason;
  pmdp_get_lockless_sync() to issue missing interrupt if PAE.
10/13 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  v3: updated to using ptent instead of *pte
  v2: first check VMA, in case page tables torn down, per JannH;
  pmdp_get_lockless_sync() to issue missing interrupt if PAE;
  moved mmu_notifier after step 1, reworked final goto labels.
11/13 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
  v3: rediffed
  v2: same as v1
12/13 mm: delete mmap_write_trylock() and vma_try_start_write()
  v3: rediffed (different diff needed for mm-unstable or linux-next)
  v2: same as v1
13/13 mm/pgtable: notes on pte_offset_map[_lock]()
  v3: new: JannH asked for more helpful comment, this is my attempt;
  could be moved to be the first in the series.

 arch/arm/mm/fault-armv.c|   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c  |  29 +-
 arch/powerpc/mm/pgtable.c   |  16 +-
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c  |  80 -
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c |  16 +
 include/linux/mm.h  |  17 --
 include/linux/mm_types.h|   4 +
 include/linux/mmap_lock.h   |  10 -
 include/linux/pgtable.h |  10 +-
 mm/khugepaged.c | 481 +++---
 mm/pgtable-generic.c|  97 +-
 14 files changed, 404 insertions(+), 371 deletions(-)

Hugh

Re: [PATCH net-next v3 0/8] net: freescale: Convert to platform remove callback returning void

2023-07-11 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski :

On Mon, 10 Jul 2023 09:19:38 +0200 you wrote:
> Hello,
> 
> v2 of this series was sent in June[1], code changes since then only affect
> patch #1 where the dev_err invocation was adapted to emit the error code of
> dpaa_fq_free(). Thanks for feedback by Maciej Fijalkowski and Russell King.
> Other than that I added Reviewed-by tags for Simon Horman and Wei Fang and
> rebased to v6.5-rc1.
> 
> [...]

Here is the summary with links:
  - [net-next,v3,1/8] net: dpaa: Improve error reporting
https://git.kernel.org/netdev/net-next/c/1e679b957ae2
  - [net-next,v3,2/8] net: dpaa: Convert to platform remove callback returning 
void
https://git.kernel.org/netdev/net-next/c/9c3ddc44d0c0
  - [net-next,v3,3/8] net: fec: Convert to platform remove callback returning 
void
https://git.kernel.org/netdev/net-next/c/12d6cc19f29b
  - [net-next,v3,4/8] net: fman: Convert to platform remove callback returning 
void
https://git.kernel.org/netdev/net-next/c/4875b2a362e9
  - [net-next,v3,5/8] net: fs_enet: Convert to platform remove callback 
returning void
https://git.kernel.org/netdev/net-next/c/ead29c5e0888
  - [net-next,v3,6/8] net: fsl_pq_mdio: Convert to platform remove callback 
returning void
https://git.kernel.org/netdev/net-next/c/f833635589ae
  - [net-next,v3,7/8] net: gianfar: Convert to platform remove callback 
returning void
https://git.kernel.org/netdev/net-next/c/4be0ebc33f39
  - [net-next,v3,8/8] net: ucc_geth: Convert to platform remove callback 
returning void
https://git.kernel.org/netdev/net-next/c/ae18facf566c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-11 Thread Aneesh Kumar K V

On 7/11/23 10:49 PM, David Hildenbrand wrote:
> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>> Add a new kconfig option that can be selected if we want to allow
>> pageblock alignment by reserving pages in the vmemmap altmap area.
>> This implies we will be reserving some pages for every memoryblock
>> This also allows the memmap on memory feature to be widely useful
>> with different memory block size values.
> 
> "reserving pages" is a nice way of saying "wasting memory". :) Let's spell 
> that out.
> 
> I think we have to find a better name for this, and I think we should have a 
> toggle similar to memory_hotplug.memmap_on_memory. This should be an admin 
> decision, not some kernel config option.
> 
> 
> memory_hotplug.force_memmap_on_memory
> 
> "Enable the memmap on memory feature even if it could result in memory waste 
> due to memmap size limitations. For example, if the memmap for a memory block 
> requires 1 MiB, but the pageblock size is 2 MiB, 1 MiB
> of hotplugged memory will be wasted. Note that there are still cases where 
> the feature cannot be enforced: for example, if the memmap is smaller than a 
> single page, or if the architecture does not support the forced mode in all 
> configurations."
> 
> Thoughts?
> 

With module parameter, do we still need the Kconfig option? 

-aneesh

Re: [PATCH] soc: fsl: qe: Replace all non-returning strlcpy with strscpy

2023-07-11 Thread Azeem Shaikh

> Sorry for the late response.  But I found some old discussions with the 
> conclusion to be not converting old users.  Has this been changed later on?
> https://lwn.net/Articles/659214/
>

@Kees Cook what's your advice here?

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-11 Thread Sean Christopherson

On Tue, Jul 11, 2023, Zhi Wang wrote:
> On Thu, 6 Jul 2023 15:49:39 +0900
> David Stevens  wrote:
> 
> > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang  wrote:
> > >
> > > On Tue,  4 Jul 2023 16:50:48 +0900
> > > David Stevens  wrote:
> > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail 
> > > page?
> > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like 
> > > a
> > > temporary solution. I don't think it is a good idea to play tricks with
> > > a temporary solution, more like we are abusing the toleration.  
> > 
> > I'm not sure I understand what you're getting at. This series never
> > calls gup without FOLL_GET.
> > 
> > This series aims to provide kvm_follow_pfn as a unified API on top of
> > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > notifier, it makes sense to support returning a pfn without taking a
> > reference. And we indeed need to do that for certain types of memory.
> > 
> 
> I am not having prob with taking a pfn without taking a ref. I am
> questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> a pfn without a ref is a good idea or not, while there is another flag
> actually showing it.
> 
> I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> translation between struct kvm_follow_pfn.{write, async, } and GUP
> flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> requirements of GUP in the code path that going to call GUP is reasonable.
> 
> But using FOLL_XXX with purposes that are not related to GUP call really
> feels off.

I agree, assuming you're talking specifically about the logic in 
hva_to_pfn_remapped()
that handles non-refcounted pages, i.e. this

if (get_page_unless_zero(page)) {
foll->is_refcounted_page = true;
if (!(foll->flags & FOLL_GET))
put_page(page);
} else if (foll->flags & FOLL_GET) {
r = -EFAULT;
}

should be

if (get_page_unless_zero(page)) {
foll->is_refcounted_page = true;
if (!(foll->flags & FOLL_GET))
put_page(page);
else if (!foll->guarded_by_mmu_notifier)
r = -EFAULT;

because it's not the desire to grab a reference that makes getting 
non-refcounted
pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.

Though that highlights that checking guarded_by_mmu_notifier should be done for
*all* non-refcounted pfns, not just non-refcounted struct page memory.

As for the other usage of FOLL_GET in this series (using it to conditionally do
put_page()), IMO that's very much related to the GUP call.  Invoking put_page()
is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
without grabbing a reference to the page.  In an ideal world, KVM would NOT pass
FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.

I do think it's worth providing a helper to consolidate and document that hacky
code, e.g. add a kvm_follow_refcounted_pfn() helper.

All in all, I think the below (completely untested) is what we want?

David (and others), I am planning on doing a full review of this series "soon",
but it will likely be a few weeks until that happens.  I jumped in on this
specific thread because this caught my eye and I really don't want to throw out
*all* of the FOLL_GET usage.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5afd70f239..90d424990e0a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long 
addr)
return rc == -EHWPOISON;
 }

+static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
+  struct page *page)
+{
+   kvm_pfn_t pfn = page_to_pfn(page);
+
+   foll->is_refcounted_page = true;
+
+   /*
+* FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
+* doesn't want to grab a reference, but gup() doesn't support getting
+* just the pfn, i.e. FOLL_GET is effectively mandatory.  If that ever
+* changes, drop this and simply don't pass FOLL_GET to gup().
+*/
+   if (!(foll->flags & FOLL_GET))
+   put_page(page);
+
+   return pfn;
+}
+
 /*
  * The fast path to get the writable pfn which will be stored in @pfn,
  * true indicates success, otherwise false is returned.  It's also the
@@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, 
kvm_pfn_t *pfn)
return false;

if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
-   *pfn = page_to_pfn(page[0]);
foll->writable = foll->allow_write_mapping;
-   foll->is_refcounted_page = true;
-   if (!(foll->flags &

Re: [PATCH v2 1/2] powerpc/tpm: Create linux,sml-base/size as big endian

2023-07-11 Thread Jarkko Sakkinen

On Tue, 2023-07-11 at 08:47 -0400, Stefan Berger wrote:
> 
> On 7/10/23 17:23, Jarkko Sakkinen wrote:
> > On Thu, 2023-06-15 at 22:37 +1000, Michael Ellerman wrote:
> > > There's code in prom_instantiate_sml() to do a "SML handover" (Stored
> > > Measurement Log) from OF to Linux, before Linux shuts down Open
> > > Firmware.
> > > 
> > > This involves creating a buffer to hold the SML, and creating two device
> > > tree properties to record its base address and size. The kernel then
> > > later reads those properties from the device tree to find the SML.
> > > 
> > > When the code was initially added in commit 4a727429abec ("PPC64: Add
> > > support for instantiating SML from Open Firmware") the powerpc kernel
> > > was always built big endian, so the properties were created big endian
> > > by default.
> > > 
> > > However since then little endian support was added to powerpc, and now
> > > the code lacks conversions to big endian when creating the properties.
> > > 
> > > This means on little endian kernels the device tree properties are
> > > little endian, which is contrary to the device tree spec, and in
> > > contrast to all other device tree properties.
> > > 
> > > To cope with that a workaround was added in tpm_read_log_of() to skip
> > > the endian conversion if the properties were created via the SML
> > > handover.
> > > 
> > > A better solution is to encode the properties as big endian as they
> > > should be, and remove the workaround.
> > > 
> > > Typically changing the encoding of a property like this would present
> > > problems for kexec. However the SML is not propagated across kexec, so
> > > changing the encoding of the properties is a non-issue.
> > > 
> > > Fixes: e46e22f12b19 ("tpm: enhance read_log_of() to support Physical TPM 
> > > event log")
> > > Signed-off-by: Michael Ellerman 
> > > Reviewed-by: Stefan Berger 
> > > ---
> > >   arch/powerpc/kernel/prom_init.c |  8 ++--
> > >   drivers/char/tpm/eventlog/of.c  | 23 ---
> > >   2 files changed, 10 insertions(+), 21 deletions(-)
> > 
> > Split into two patches (producer and consumer).
> 
> I think this wouldn't be right since it would break the system when only one 
> patch is applied since it would be reading the fields in the wrong endianess.

I think it would help if the commit message would better explain
what is going on. It is somewhat difficult to decipher, if you
don't have deep knowledge of the powerpc architecture.

BR, Jarkko

Re: [PATCH v2 10/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document affinity_domain_via_partition sysfs interface file

2023-07-11 Thread Randy Dunlap

Hi,

Same correction comments as in the other 4 patches (not repeated here).


On 7/10/23 02:27, Kajol Jain wrote:
> Add details of the new hv-gpci interface file called
> "affinity_domain_via_partition" in the ABI documentation.
> 
> Signed-off-by: Kajol Jain 
> ---
>  .../sysfs-bus-event_source-devices-hv_gpci| 32 +++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> index d8e65b93d1f7..b03b2bd4b081 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> @@ -208,3 +208,35 @@ Description: admin read only
>  more information.
>  
>   * "-EFBIG" : System information exceeds PAGE_SIZE.
> +
> +What:
> /sys/devices/hv_gpci/interface/affinity_domain_via_partition
> +Date:July 2023
> +Contact: Linux on PowerPC Developer List 
> +Description: admin read only
> + This sysfs file exposes the system topology information by 
> making HCALL
> + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request 
> value
> + AFFINITY_DOMAIN_INFORMATION_BY_PARTITION(0xB1).
> +
> + * This sysfs file will be created only for power10 and above 
> platforms.
> +
> + * User needs root privileges to read data from this sysfs file.
> +
> + * This sysfs file will be created, only when the HCALL returns 
> "H_SUCESS",
> +   "H_AUTHORITY" and "H_PARAMETER" as the return type.
> +
> +   HCALL with return error type "H_AUTHORITY", can be resolved 
> during
> +   runtime by setting "Enable Performance Information 
> Collection" option.
> +
> + * The end user reading this sysfs file must decode the content 
> as per
> +   underlying platform/firmware.
> +
> + Possible error codes while reading this sysfs file:
> +
> + * "-EPERM" : Partition is not permitted to retrieve performance 
> information,
> + required to set "Enable Performance Information 
> Collection" option.
> +
> + * "-EIO" : Can't retrieve system information because of invalid 
> buffer length/invalid address
> +or because of some hardware error. Refer 
> getPerfCountInfo documentation for
> +more information.
> +
> + * "-EFBIG" : System information exceeds PAGE_SIZE.

-- 
~Randy

Re: [PATCH v2 08/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document affinity_domain_via_domain sysfs interface file

2023-07-11 Thread Randy Dunlap

Hi,

On 7/10/23 02:27, Kajol Jain wrote:
> Add details of the new hv-gpci interface file called
> "affinity_domain_via_domain" in the ABI documentation.
> 
> Signed-off-by: Kajol Jain 
> ---
>  .../sysfs-bus-event_source-devices-hv_gpci| 32 +++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> index 3b63d66658fe..d8e65b93d1f7 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> @@ -176,3 +176,35 @@ Description: admin read only
>  more information.
>  
>   * "-EFBIG" : System information exceeds PAGE_SIZE.
> +
> +What:
> /sys/devices/hv_gpci/interface/affinity_domain_via_domain
> +Date:July 2023
> +Contact: Linux on PowerPC Developer List 
> +Description: admin read only
> + This sysfs file exposes the system topology information by 
> making HCALL
> + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request 
> value
> + AFFINITY_DOMAIN_INFORMATION_BY_DOMAIN(0xB0).
> +
> + * This sysfs file will be created only for power10 and above 
> platforms.
> +
> + * User needs root privileges to read data from this sysfs file.
> +
> + * This sysfs file will be created, only when the HCALL returns 
> "H_SUCESS",


typo

> +   "H_AUTHORITY" and "H_PARAMETER" as the return type.

  s/and/or/

> +
> +   HCALL with return error type "H_AUTHORITY", can be resolved 
> during

Drop the comma: ^

> +   runtime by setting "Enable Performance Information 
> Collection" option.
> +
> + * The end user reading this sysfs file must decode the content 
> as per
> +   underlying platform/firmware.
> +
> + Possible error codes while reading this sysfs file:
> +
> + * "-EPERM" : Partition is not permitted to retrieve performance 
> information,
> + required to set "Enable Performance Information 
> Collection" option.
> +
> + * "-EIO" : Can't retrieve system information because of invalid 
> buffer length/invalid address
> +or because of some hardware error. Refer 
> getPerfCountInfo documentation for

  Refer to

> +more information.
> +
> + * "-EFBIG" : System information exceeds PAGE_SIZE.

-- 
~Randy

Re: [PATCH v2 06/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document affinity_domain_via_virtual_processor sysfs interface file

2023-07-11 Thread Randy Dunlap

Hi--

On 7/10/23 02:27, Kajol Jain wrote:
> Add details of the new hv-gpci interface file called
> "affinity_domain_via_virtual_processor" in the ABI documentation.
> 
> Signed-off-by: Kajol Jain 
> ---
>  .../sysfs-bus-event_source-devices-hv_gpci| 32 +++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> index aff52dc3b05c..3b63d66658fe 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> @@ -144,3 +144,35 @@ Description: admin read only
>  more information.
>  
>   * "-EFBIG" : System information exceeds PAGE_SIZE.
> +
> +What:
> /sys/devices/hv_gpci/interface/affinity_domain_via_virtual_processor
> +Date:July 2023
> +Contact: Linux on PowerPC Developer List 
> +Description: admin read only
> + This sysfs file exposes the system topology information by 
> making HCALL
> + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request 
> value
> + AFFINITY_DOMAIN_INFORMATION_BY_VIRTUAL_PROCESSOR(0xA0).
> +
> + * This sysfs file will be created only for power10 and above 
> platforms.
> +
> + * User needs root privileges to read data from this sysfs file.
> +
> + * This sysfs file will be created, only when the HCALL returns 
> "H_SUCESS",


H_SUCCESS

> +   "H_AUTHORITY" and "H_PARAMETER" as the return type.

s/and/or/

> +
> +   HCALL with return error type "H_AUTHORITY", can be resolved 
> during

Drop the comma: ^

> +   runtime by setting "Enable Performance Information 
> Collection" option.
> +
> + * The end user reading this sysfs file must decode the content 
> as per
> +   underlying platform/firmware.
> +
> + Possible error codes while reading this sysfs file:
> +
> + * "-EPERM" : Partition is not permitted to retrieve performance 
> information,
> + required to set "Enable Performance Information 
> Collection" option.
> +
> + * "-EIO" : Can't retrieve system information because of invalid 
> buffer length/invalid address
> +or because of some hardware error. Refer 
> getPerfCountInfo documentation for

  Refer to

> +more information.
> +
> + * "-EFBIG" : System information exceeds PAGE_SIZE.

-- 
~Randy

Re: [PATCH v2 04/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document processor_config sysfs interface file

2023-07-11 Thread Randy Dunlap

Hi--

On 7/10/23 02:27, Kajol Jain wrote:
> Add details of the new hv-gpci interface file called
> "processor_config" in the ABI documentation.
> 
> Signed-off-by: Kajol Jain 
> ---
>  .../sysfs-bus-event_source-devices-hv_gpci| 32 +++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> index 2eeeab9a20fa..aff52dc3b05c 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> @@ -112,3 +112,35 @@ Description: admin read only
>  more information.
>  
>   * "-EFBIG" : System information exceeds PAGE_SIZE.
> +
> +What:/sys/devices/hv_gpci/interface/processor_config
> +Date:July 2023
> +Contact: Linux on PowerPC Developer List 
> +Description: admin read only
> + This sysfs file exposes the system topology information by 
> making HCALL
> + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request 
> value
> + PROCESSOR_CONFIG(0x90).
> +
> + * This sysfs file will be created only for power10 and above 
> platforms.
> +
> + * User needs root privileges to read data from this sysfs file.
> +
> + * This sysfs file will be created, only when the HCALL returns 
> "H_SUCESS",


H_SUCCESS

> +   "H_AUTHORITY" and "H_PARAMETER" as the return type.

   s/and/or/

> +
> +   HCALL with return error type "H_AUTHORITY", can be resolved 
> during

  Drop the comma:   ^

> +   runtime by setting "Enable Performance Information 
> Collection" option.
> +
> + * The end user reading this sysfs file must decode the content 
> as per
> +   underlying platform/firmware.
> +
> + Possible error codes while reading this sysfs file:
> +
> + * "-EPERM" : Partition is not permitted to retrieve performance 
> information,
> + required to set "Enable Performance Information 
> Collection" option.
> +
> + * "-EIO" : Can't retrieve system information because of invalid 
> buffer length/invalid address
> +or because of some hardware error. Refer 
> getPerfCountInfo documentation for

  Refer to

> +more information.
> +
> + * "-EFBIG" : System information exceeds PAGE_SIZE.

-- 
~Randy

Re: [PATCH v2 02/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document processor_bus_topology sysfs interface file

2023-07-11 Thread Randy Dunlap

Hi--

On 7/10/23 02:27, Kajol Jain wrote:
> Add details of the new hv-gpci interface file called
> "processor_bus_topology" in the ABI documentation.
> 
> Signed-off-by: Kajol Jain 
> ---
>  .../sysfs-bus-event_source-devices-hv_gpci| 32 +++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci 
> b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> index 12e2bf92783f..2eeeab9a20fa 100644
> --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
> @@ -80,3 +80,35 @@ Contact:   Linux on PowerPC Developer List 
> 
>  Description: read only
>   This sysfs file exposes the cpumask which is designated to make
>   HCALLs to retrieve hv-gpci pmu event counter data.
> +
> +What:/sys/devices/hv_gpci/interface/processor_bus_topology
> +Date:July 2023
> +Contact: Linux on PowerPC Developer List 
> +Description: admin read only
> + This sysfs file exposes the system topology information by 
> making HCALL
> + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request 
> value
> + PROCESSOR_BUS_TOPOLOGY(0xD0).
> +
> + * This sysfs file will be created only for power10 and above 
> platforms.
> +
> + * User needs root privileges to read data from this sysfs file.
> +
> + * This sysfs file will be created, only when the HCALL returns 
> "H_SUCESS",


H_SUCCESS

> +   "H_AUTHORITY" and "H_PARAMETER" as the return type.

s/and/or/

> +
> +   HCALL with return error type "H_AUTHORITY", can be resolved 
> during

 Drop the comma ^

> +   runtime by setting "Enable Performance Information 
> Collection" option.
> +
> + * The end user reading this sysfs file must decode the content 
> as per
> +   underlying platform/firmware.
> +
> + Possible error codes while reading this sysfs file:
> +
> + * "-EPERM" : Partition is not permitted to retrieve performance 
> information,
> + required to set "Enable Performance Information 
> Collection" option.
> +
> + * "-EIO" : Can't retrieve system information because of invalid 
> buffer length/invalid address
> +or because of some hardware error. Refer 
> getPerfCountInfo documentation for

  Refer to

> +more information.
> +
> + * "-EFBIG" : System information exceeds PAGE_SIZE.

-- 
~Randy

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Alexey Gladkov

On Tue, Jul 11, 2023 at 04:01:03PM +0200, Christian Brauner wrote:
> On Tue, Jul 11, 2023 at 02:51:01PM +0200, Alexey Gladkov wrote:
> > On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote:
> > > On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote:
> > > > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> > > > > From: Palmer Dabbelt 
> > > > >
> > > > > On the userspace side fchmodat(3) is implemented as a wrapper
> > > > > function which implements the POSIX-specified interface. This
> > > > > interface differs from the underlying kernel system call, which does 
> > > > > not
> > > > > have a flags argument. Most implementations require procfs [1][2].
> > > > >
> > > > > There doesn't appear to be a good userspace workaround for this issue
> > > > > but the implementation in the kernel is pretty straight-forward.
> > > > >
> > > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW 
> > > > > flag,
> > > > > unlike existing fchmodat.
> > > > >
> > > > > [1] 
> > > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> > > > > [2] 
> > > > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> > > > >
> > > > > Signed-off-by: Palmer Dabbelt 
> > > > > Signed-off-by: Alexey Gladkov 
> > > > 
> > > > I don't know the history of why we ended up with the different
> > > > interface, or whether this was done intentionally in the kernel
> > > > or if we want this syscall.
> > > > 
> > > > Assuming this is in fact needed, I double-checked that the
> > > > implementation looks correct to me and is portable to all the
> > > > architectures, without the need for a compat wrapper.
> > > > 
> > > > Acked-by: Arnd Bergmann 
> > > 
> > > The system call itself is useful afaict. But please,
> > > 
> > > s/fchmodat4/fchmodat2/
> > 
> > Sure. I will.
> 
> Thanks. Can you also wire this up for every architecture, please?
> I don't see that this has been done in this series.

Sure. I have already added in all architectures as far as I can tell:

$ diff -s <(find arch/ -name '*.tbl' |sort -u) <(git grep -lw fchmodat2 arch/ 
|sort -u)
Files /dev/fd/63 and /dev/fd/62 are identical

-- 
Rgrds, legion

Re: [PATCH v3 0/5] Add a new fchmodat4() syscall

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> * Alexey Gladkov:
> 
> > This patch set adds fchmodat4(), a new syscall. The actual
> > implementation is super simple: essentially it's just the same as
> > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > I've attempted to make this match "man 2 fchmodat" as closely as
> > possible, which says EINVAL is returned for invalid flags (as opposed to
> > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > seems fairly straight-forward:
> >
> > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
> > b/sysdeps/unix/sysv/linux/fchmodat.c
> > index 6d9cbc1ce9e0..b1beab76d56c 100644
> > --- a/sysdeps/unix/sysv/linux/fchmodat.c
> > +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> > @@ -29,12 +29,36 @@
> >  int
> >  fchmodat (int fd, const char *file, mode_t mode, int flag)
> >  {
> > -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> > -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> > -#ifndef __NR_lchmod/* Linux so far has no lchmod syscall.  
> > */
> > +  /* There are four paths through this code:
> > +  - The flags are zero.  In this case it's fine to call fchmodat.
> > +  - The flags are non-zero and glibc doesn't have access to
> > +   __NR_fchmodat4.  In this case all we can do is emulate the 
> > error codes
> > +   defined by the glibc interface from userspace.
> > +  - The flags are non-zero, glibc has __NR_fchmodat4, and the 
> > kernel has
> > +   fchmodat4.  This is the simplest case, as the fchmodat4 syscall 
> > exactly
> > +   matches glibc's library interface so it can be called directly.
> > +  - The flags are non-zero, glibc has __NR_fchmodat4, but the 
> > kernel does
> 
> If you define __NR_fchmodat4 on all architectures, we can use these
> constants directly in glibc.  We no longer depend on the UAPI
> definitions of those constants, to cut down the number of code variants,
> and to make glibc's system call profile independent of the kernel header
> version at build time.
> 
> Your version is based on 2.31, more recent versions have some reasonable
> emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> describing the same buggy behavior that you witnessed:
> 
> +  /* Some Linux versions with some file systems can actually
> +change symbolic link permissions via /proc, but this is not
> +intentional, and it gives inconsistent results (e.g., error
> +return despite mode change).  The expected behavior is that
> +symbolic link modes cannot be changed at all, and this check
> +enforces that.  */
> +  if (S_ISLNK (st.st_mode))
> +   {
> + __close_nocancel (pathfd);
> + __set_errno (EOPNOTSUPP);
> + return -1;
> +   }
> 
> I think there was some kernel discussion about that behavior before, but
> apparently, it hasn't led to fixes.

I think I've explained this somewhere else a couple of months ago but
just in case you weren't on that thread or don't remember and apologies
if you should already know.

A lot of filesystem will happily update the mode of a symlink. The VFS
doesn't do anything to prevent this from happening. This is filesystem
specific.

The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
Specifically it comes from filesystems that call posix_acl_chmod(),
e.g., btrfs via

if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);

Most filesystems don't implement i_op->set_acl() for POSIX ACLs.
So posix_acl_chmod() will report EOPNOTSUPP. By the time
posix_acl_chmod() is called, most filesystems will have finished
updating the inode. POSIX ACLs also often aren't integrated into
transactions so a rollback wouldn't even be possible on some
filesystems.

Any filesystem that doesn't implement POSIX ACLs at all will obviously
never fail unless it blocks mode changes on symlinks. Or filesystems
that do have a way to rollback failures from posix_acl_chmod(), or
filesystems that do return an error on chmod() on symlinks such as 9p,
ntfs, ocfs2.

> 
> I wonder if it makes sense to add a similar error return to the system
> call implementation?

Hm, blocking symlink mode changes is pretty regression prone. And just
blocking it through one interface seems weird and makes things even more
inconsistent.

So two options I see:
(1) minimally invasive:
Filesystems that do call posix_acl_chmod() on symlinks need to be
changed to stop doing that.
(2) might hit us on the head invasive:
Try and block symlink mode changes in chmod_common().

Thoughts?

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 02:51:01PM +0200, Alexey Gladkov wrote:
> On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote:
> > On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote:
> > > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> > > > From: Palmer Dabbelt 
> > > >
> > > > On the userspace side fchmodat(3) is implemented as a wrapper
> > > > function which implements the POSIX-specified interface. This
> > > > interface differs from the underlying kernel system call, which does not
> > > > have a flags argument. Most implementations require procfs [1][2].
> > > >
> > > > There doesn't appear to be a good userspace workaround for this issue
> > > > but the implementation in the kernel is pretty straight-forward.
> > > >
> > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> > > > unlike existing fchmodat.
> > > >
> > > > [1] 
> > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> > > > [2] 
> > > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> > > >
> > > > Signed-off-by: Palmer Dabbelt 
> > > > Signed-off-by: Alexey Gladkov 
> > > 
> > > I don't know the history of why we ended up with the different
> > > interface, or whether this was done intentionally in the kernel
> > > or if we want this syscall.
> > > 
> > > Assuming this is in fact needed, I double-checked that the
> > > implementation looks correct to me and is portable to all the
> > > architectures, without the need for a compat wrapper.
> > > 
> > > Acked-by: Arnd Bergmann 
> > 
> > The system call itself is useful afaict. But please,
> > 
> > s/fchmodat4/fchmodat2/
> 
> Sure. I will.

Thanks. Can you also wire this up for every architecture, please?
I don't see that this has been done in this series.

Re: [PATCH v3 5/5] selftests: add fchmodat4(2) selftest

2023-07-11 Thread Alexey Gladkov

On Tue, Jul 11, 2023 at 02:10:58PM +0200, Florian Weimer wrote:
> * Alexey Gladkov:
> 
> > The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag
> > fails. This is because not all filesystems support changing the mode
> > bits of symlinks properly. These filesystems return an error but change
> > the mode bits:
> >
> > newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
> > AT_SYMLINK_NOFOLLOW) = 0
> > newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, 
> > AT_SYMLINK_NOFOLLOW) = 0
> > syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 
> > EOPNOTSUPP (Operation not supported)
> > newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
> > AT_SYMLINK_NOFOLLOW) = 0
> >
> > This happens with btrfs and xfs:
> >
> >  $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test
> >  TAP version 13
> >  1..1
> >  ok 1 # SKIP fchmodat4(symlink)
> >  # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
> >
> >  $ stat /tmp/ksft-fchmodat4.*/symlink
> >File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile
> >Size: 7   Blocks: 0  IO Block: 4096   symbolic link
> >  Device: 7,0 Inode: 133 Links: 1
> >  Access: (0600/lrw---)  Uid: (0/root)   Gid: (0/root)
> >
> > Signed-off-by: Alexey Gladkov 
> 
> This looks like a bug in those file systems?

To me this looks like a bug. I'm fine if the operation ends with
EOPNOTSUPP, but in that case the mode bits shouldn't change.

> As an extra test, “echo 3 > /proc/sys/vm/drop_caches” sometimes has
> strange effects in such cases because the bits are not actually stored
> on disk, only in the dentry cache.

tmpfs
syscall_0x1c3(0xff9c, 0x7ffd58758574, 0, 0x100, 0x7f6cf18adc70, 
0x7ffd58756ad8) = 0
+++ exited with 0 +++
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f
=== dropping caches ===
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f

ext4
syscall_0x1c3(0xff9c, 0x7ffedfdb4574, 0, 0x100, 0x7f7f40b45c70, 
0x7ffedfdb3ae8) = -1 EOPNOTSUPP (Operation not supported)
+++ exited with 1 +++
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f
=== dropping caches ===
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f

xfs
syscall_0x1c3(0xff9c, 0x7ffcd03ce574, 0, 0x100, 0x7ff2f2980c70, 
0x7ffcd03cdd38) = -1 EOPNOTSUPP (Operation not supported)
+++ exited with 1 +++
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f
=== dropping caches ===
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f

btrfs
syscall_0x1c3(0xff9c, 0x7fff13d2e574, 0, 0x100, 0x7f9b67f59c70, 
0x7fff13d2ca88) = -1 EOPNOTSUPP (Operation not supported)
+++ exited with 1 +++
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f
=== dropping caches ===
l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f

reiserfs
syscall_0x1c3(0xff9c, 0x7ffdf75af574, 0, 0x100, 0x7f7ad0634c70, 
0x7ffdf75ae478) = 0
+++ exited with 0 +++
l- 1 root root 1 Jul 11 16:43 /tmp/dir/link -> f
=== dropping caches ===
l- 1 root root 1 Jul 11 16:43 /tmp/dir/link -> f

-- 
Rgrds, legion

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Alexey Gladkov

On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote:
> On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote:
> > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> > > From: Palmer Dabbelt 
> > >
> > > On the userspace side fchmodat(3) is implemented as a wrapper
> > > function which implements the POSIX-specified interface. This
> > > interface differs from the underlying kernel system call, which does not
> > > have a flags argument. Most implementations require procfs [1][2].
> > >
> > > There doesn't appear to be a good userspace workaround for this issue
> > > but the implementation in the kernel is pretty straight-forward.
> > >
> > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> > > unlike existing fchmodat.
> > >
> > > [1] 
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> > > [2] 
> > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> > >
> > > Signed-off-by: Palmer Dabbelt 
> > > Signed-off-by: Alexey Gladkov 
> > 
> > I don't know the history of why we ended up with the different
> > interface, or whether this was done intentionally in the kernel
> > or if we want this syscall.
> > 
> > Assuming this is in fact needed, I double-checked that the
> > implementation looks correct to me and is portable to all the
> > architectures, without the need for a compat wrapper.
> > 
> > Acked-by: Arnd Bergmann 
> 
> The system call itself is useful afaict. But please,
> 
> s/fchmodat4/fchmodat2/

Sure. I will.

> With very few exceptions we don't version by argument number but by
> revision and we should stick to one scheme:
> 
> openat()->openat2()
> eventfd()->eventfd2()
> clone()/clone2()->clone3()
> dup()->dup2()->dup3() // coincides with nr of arguments
> pipe()->pipe2() // coincides with nr of arguments
> renameat()->renameat2()
> 

-- 
Rgrds, legion

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Alexey Gladkov

On Tue, Jul 11, 2023 at 01:28:04PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 11, 2023 at 01:25:43PM +0200, Alexey Gladkov wrote:
> > -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
> > +static int do_fchmodat4(int dfd, const char __user *filename, umode_t 
> > mode, int lookup_flags)
> 
> This function can still be called do_fchmodat(); we don't need to
> version internal functions.

Yes. I tried not to change too much when adopting a patch. In the new
version, I will return the old name. Thanks.

-- 
Rgrds, legion

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-11 Thread Zhi Wang

On Wed, 5 Jul 2023 18:08:17 +0900
David Stevens  wrote:

> On Wed, Jul 5, 2023 at 5:47___PM Zhi Wang  wrote:
> >
> > On Tue,  4 Jul 2023 16:50:47 +0900
> > David Stevens  wrote:
> >  
> > > From: David Stevens 
> > >
> > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > > where possible, combines the boolean arguments into a single flags
> > > argument.
> > >
> > > Signed-off-by: David Stevens 
> > > ---
> > >  include/linux/kvm_host.h |  16 
> > >  virt/kvm/kvm_main.c  | 171 ++-
> > >  virt/kvm/kvm_mm.h|   3 +-
> > >  virt/kvm/pfncache.c  |   8 +-
> > >  4 files changed, 122 insertions(+), 76 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 9d3ac7720da9..ef2763c2b12e 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -97,6 +97,7 @@
> > >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> > >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> > >  #define KVM_PFN_ERR_SIGPENDING   (KVM_PFN_ERR_MASK + 3)
> > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> > >
> > >  /*
> > >   * error pfns indicate that the gfn is in slot but faild to
> > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > > kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +struct kvm_follow_pfn {
> > > + const struct kvm_memory_slot *slot;
> > > + gfn_t gfn;
> > > + unsigned int flags;
> > > + bool atomic;
> > > + /* Allow a read fault to create a writeable mapping. */
> > > + bool allow_write_mapping;
> > > +
> > > + /* Outputs of __kvm_follow_pfn */
> > > + hva_t hva;
> > > + bool writable;
> > > +};
> > > +
> > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > > +
> > >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> > >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > > bool *writable);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 371bd783ff2b..b13f22861d2f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -2486,24 +2486,22 @@ static inline int 
> > > check_user_page_hwpoison(unsigned long addr)
> > >   * true indicates success, otherwise false is returned.  It's also the
> > >   * only part that runs if we can in atomic context.
> > >   */
> > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > - bool *writable, kvm_pfn_t *pfn)
> > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > >   struct page *page[1];
> > > + bool write_fault = foll->flags & FOLL_WRITE;
> > >
> > >   /*
> > >* Fast pin a writable pfn only if it is a write fault request
> > >* or the caller allows to map a writable pfn for a read fault
> > >* request.
> > >*/
> > > - if (!(write_fault || writable))
> > > + if (!(write_fault || foll->allow_write_mapping))
> > >   return false;
> > >
> > > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > >   *pfn = page_to_pfn(page[0]);
> > > -
> > > - if (writable)
> > > - *writable = true;
> > > + foll->writable = foll->allow_write_mapping;
> > >   return true;
> > >   }
> > >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, 
> > > bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool 
> > > write_fault,
> > > -bool interruptible, bool *writable, kvm_pfn_t 
> > > *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > - unsigned int flags = FOLL_HWPOISON;
> > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >   struct page *page;
> > >   int npages;
> > >
> > >   might_sleep();
> > >
> > > - if (writable)
> > > - *writable = write_fault;
> > > -
> > > - if (write_fault)
> > > - flags |= FOLL_WRITE;
> > > - if (async)
> > > - flags |= FOLL_NOWAIT;
> > > - if (interruptible)
> > > - flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > - npages = get_user_pages_unlocked(addr, 1, , flags);
> > > + npages = get_user_pages_unlocked(foll->hva, 1, , flags);
> > >   if (npages != 1)
> > >   return npages;
> > >
> > > + foll->writable = (foll->flags & FOLL_WRITE) && 
> > >

Re: (subset) [PATCH v4 0/5] Add a new fchmodat2() syscall

2023-07-11 Thread Christian Brauner

On Tue, 11 Jul 2023 18:16:02 +0200, Alexey Gladkov wrote:
> In glibc, the fchmodat(3) function has a flags argument according to the
> POSIX specification [1], but kernel syscalls has no such argument.
> Therefore, libc implementations do workarounds using /proc. However,
> this requires procfs to be mounted and accessible.
> 
> This patch set adds fchmodat2(), a new syscall. The syscall allows to
> pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other
> respects, this syscall is no different from fchmodat().
> 
> [...]

Tools updates usually go separately.
Flags argument ported to unsigned int; otherwise unchanged.

---

Applied to the master branch of the vfs/vfs.git tree.
Patches in the master branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: master

[1/5] Non-functional cleanup of a "__user * filename"
  https://git.kernel.org/vfs/vfs/c/0f05a6af6b7e
[2/5] fs: Add fchmodat2()
  https://git.kernel.org/vfs/vfs/c/8d593559ec09
[3/5] arch: Register fchmodat2, usually as syscall 452
  https://git.kernel.org/vfs/vfs/c/2ee63b04f206
[5/5] selftests: Add fchmodat2 selftest
  https://git.kernel.org/vfs/vfs/c/f175b92081ec

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-11 Thread Zhi Wang

On Thu, 6 Jul 2023 15:49:39 +0900
David Stevens  wrote:

> On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang  wrote:
> >
> > On Tue,  4 Jul 2023 16:50:48 +0900
> > David Stevens  wrote:
> >  
> > > From: David Stevens 
> > >
> > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > > callers to resolve a gfn when the associated pfn has a valid struct page
> > > that isn't being actively refcounted (e.g. tail pages of non-compound
> > > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > > the returned pfn must be guarded by a mmu notifier.
> > >
> > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > > is set when the returned pfn has an associated struct page with a valid
> > > refcount. Callers that don't pass FOLL_GET should remember this value
> > > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > > non-zero refcount.
> > >
> > > Signed-off-by: David Stevens 
> > > ---
> > >  include/linux/kvm_host.h | 10 ++
> > >  virt/kvm/kvm_main.c  | 67 +---
> > >  virt/kvm/pfncache.c  |  2 +-
> > >  3 files changed, 47 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ef2763c2b12e..a45308c7d2d9 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > > kvm_memory_slot *slot, gfn_t gfn,
> > >  void kvm_release_page_clean(struct page *page);
> > >  void kvm_release_page_dirty(struct page *page);
> > >
> > > +void kvm_set_page_accessed(struct page *page);
> > > +void kvm_set_page_dirty(struct page *page);
> > > +
> > >  struct kvm_follow_pfn {
> > >   const struct kvm_memory_slot *slot;
> > >   gfn_t gfn;
> > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> > >   bool atomic;
> > >   /* Allow a read fault to create a writeable mapping. */
> > >   bool allow_write_mapping;
> > > + /*
> > > +  * Usage of the returned pfn will be guared by a mmu notifier. Must 
> > >  
> >   ^guarded  
> > > +  * be true if FOLL_GET is not set.
> > > +  */
> > > + bool guarded_by_mmu_notifier;
> > >  
> > It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> > caller of __kvm_follow_pfn()?  
> 
> Yes, this is the case.
> 
> > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail 
> > page?
> > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > temporary solution. I don't think it is a good idea to play tricks with
> > a temporary solution, more like we are abusing the toleration.  
> 
> I'm not sure I understand what you're getting at. This series never
> calls gup without FOLL_GET.
> 
> This series aims to provide kvm_follow_pfn as a unified API on top of
> gup+follow_pte. Since one of the major clients of this API uses an mmu
> notifier, it makes sense to support returning a pfn without taking a
> reference. And we indeed need to do that for certain types of memory.
> 

I am not having prob with taking a pfn without taking a ref. I am
questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
a pfn without a ref is a good idea or not, while there is another flag
actually showing it.

I can understand that using FOLL_XXX in kvm_follow_pfn saves some
translation between struct kvm_follow_pfn.{write, async, } and GUP
flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
requirements of GUP in the code path that going to call GUP is reasonable.

But using FOLL_XXX with purposes that are not related to GUP call really
feels off. Those flags can be changed in future because of GUP requirements.
Then people have to figure out what actually is happening with FOLL_GET here
as it is not actually tied to GUP calls.


> > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> > indicate a tail page?  
> 
> What do you mean by to indicate a tail page? Do you mean to indicate
> that the returned pfn refers to non-refcounted page? That's specified
> by is_refcounted_page.
>

I figured out the reason why I got confused.

+* Otherwise, certain IO or PFNMAP mappings can be backed with valid
+* struct pages but be allocated without refcounting e.g., tail pages of
+* non-compound higher order allocations. If FOLL_GET is set and we
+* increment such a refcount, then when that pfn is eventually passed to
+* kvm_release_pfn_clean, its refcount would hit zero and be incorrectly
+* freed. Therefore don't allow those pages here when FOLL_GET is set.
 */ 

The above statements only explains the wrong behavior, but doesn't explain the
expected behavior. It would be better to explain that for manipulating mmu
notifier guard

Re: [PATCH v4 4/5] tools headers UAPI: Sync files changed by new fchmodat2 syscall

2023-07-11 Thread Alexey Gladkov

On Tue, Jul 11, 2023 at 10:19:35AM -0700, Namhyung Kim wrote:
> Hello,
> 
> On Tue, Jul 11, 2023 at 9:18 AM Alexey Gladkov  wrote:
> >
> > From: Palmer Dabbelt 
> >
> > That add support for this new syscall in tools such as 'perf trace'.
> >
> > Signed-off-by: Palmer Dabbelt 
> > Signed-off-by: Alexey Gladkov 
> > ---
> >  tools/include/uapi/asm-generic/unistd.h | 5 -
> >  tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++
> >  tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 2 ++
> >  tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 ++
> >  tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 2 ++
> >  5 files changed, 12 insertions(+), 1 deletion(-)
> 
> It'd be nice if you route this patch separately through the
> perf tools tree.  We can add this after the kernel change
> is accepted.

Sure. No problem.

> >
> > diff --git a/tools/include/uapi/asm-generic/unistd.h 
> > b/tools/include/uapi/asm-generic/unistd.h
> > index dd7d8e10f16d..76b5922b0d39 100644
> > --- a/tools/include/uapi/asm-generic/unistd.h
> > +++ b/tools/include/uapi/asm-generic/unistd.h
> > @@ -817,8 +817,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >
> > +#define __NR_fchmodat2 452
> > +__SYSCALL(__NR_fchmodat2, sys_fchmodat2)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 453
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl 
> > b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
> > index 3f1886ad9d80..434728af4eaa 100644
> > --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
> > +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
> > @@ -365,3 +365,5 @@
> >  448n64 process_mreleasesys_process_mrelease
> >  449n64 futex_waitv sys_futex_waitv
> >  450common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +# 451 reserved for cachestat
> > +452n64 fchmodat2   sys_fchmodat2
> > diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl 
> > b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> > index a0be127475b1..6b70b6705bd7 100644
> > --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> > +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> > @@ -537,3 +537,5 @@
> >  448common  process_mreleasesys_process_mrelease
> >  449common  futex_waitv sys_futex_waitv
> >  450nospu   set_mempolicy_home_node sys_set_mempolicy_home_node
> > +# 451 reserved for cachestat
> > +452common  fchmodat2   sys_fchmodat2
> > diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl 
> > b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> > index b68f47541169..0ed90c9535b0 100644
> > --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> > +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> > @@ -453,3 +453,5 @@
> >  448  commonprocess_mreleasesys_process_mrelease
> > sys_process_mrelease
> >  449  commonfutex_waitv sys_futex_waitv 
> > sys_futex_waitv
> >  450  commonset_mempolicy_home_node sys_set_mempolicy_home_node 
> > sys_set_mempolicy_home_node
> > +# 451 reserved for cachestat
> > +452  commonfchmodat2   sys_fchmodat2   
> > sys_fchmodat2
> > diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..a008724a1f48 100644
> > --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,8 @@
> >  448common  process_mreleasesys_process_mrelease
> >  449common  futex_waitv sys_futex_waitv
> >  450common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +# 451 reserved for cachestat
> > +452common  fchmodat2   sys_fchmodat2
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered 
> > differently
> > --
> > 2.33.8
> >
> 

-- 
Rgrds, legion

Re: [PATCH v4 4/5] tools headers UAPI: Sync files changed by new fchmodat2 syscall

2023-07-11 Thread Namhyung Kim

Hello,

On Tue, Jul 11, 2023 at 9:18 AM Alexey Gladkov  wrote:
>
> From: Palmer Dabbelt 
>
> That add support for this new syscall in tools such as 'perf trace'.
>
> Signed-off-by: Palmer Dabbelt 
> Signed-off-by: Alexey Gladkov 
> ---
>  tools/include/uapi/asm-generic/unistd.h | 5 -
>  tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++
>  tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 2 ++
>  tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 ++
>  tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 2 ++
>  5 files changed, 12 insertions(+), 1 deletion(-)

It'd be nice if you route this patch separately through the
perf tools tree.  We can add this after the kernel change
is accepted.

Thanks,
Namhyung


>
> diff --git a/tools/include/uapi/asm-generic/unistd.h 
> b/tools/include/uapi/asm-generic/unistd.h
> index dd7d8e10f16d..76b5922b0d39 100644
> --- a/tools/include/uapi/asm-generic/unistd.h
> +++ b/tools/include/uapi/asm-generic/unistd.h
> @@ -817,8 +817,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_fchmodat2 452
> +__SYSCALL(__NR_fchmodat2, sys_fchmodat2)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 453
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl 
> b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
> index 3f1886ad9d80..434728af4eaa 100644
> --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
> +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
> @@ -365,3 +365,5 @@
>  448n64 process_mreleasesys_process_mrelease
>  449n64 futex_waitv sys_futex_waitv
>  450common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +# 451 reserved for cachestat
> +452n64 fchmodat2   sys_fchmodat2
> diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl 
> b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> index a0be127475b1..6b70b6705bd7 100644
> --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> @@ -537,3 +537,5 @@
>  448common  process_mreleasesys_process_mrelease
>  449common  futex_waitv sys_futex_waitv
>  450nospu   set_mempolicy_home_node sys_set_mempolicy_home_node
> +# 451 reserved for cachestat
> +452common  fchmodat2   sys_fchmodat2
> diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl 
> b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> index b68f47541169..0ed90c9535b0 100644
> --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> @@ -453,3 +453,5 @@
>  448  commonprocess_mreleasesys_process_mrelease
> sys_process_mrelease
>  449  commonfutex_waitv sys_futex_waitv 
> sys_futex_waitv
>  450  commonset_mempolicy_home_node sys_set_mempolicy_home_node 
> sys_set_mempolicy_home_node
> +# 451 reserved for cachestat
> +452  commonfchmodat2   sys_fchmodat2   
> sys_fchmodat2
> diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
> b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..a008724a1f48 100644
> --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,8 @@
>  448common  process_mreleasesys_process_mrelease
>  449common  futex_waitv sys_futex_waitv
>  450common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +# 451 reserved for cachestat
> +452common  fchmodat2   sys_fchmodat2
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> --
> 2.33.8
>

Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-11 Thread David Hildenbrand


On 11.07.23 06:48, Aneesh Kumar K.V wrote:

Add a new kconfig option that can be selected if we want to allow
pageblock alignment by reserving pages in the vmemmap altmap area.
This implies we will be reserving some pages for every memoryblock
This also allows the memmap on memory feature to be widely useful
with different memory block size values.


"reserving pages" is a nice way of saying "wasting memory". :) Let's 
spell that out.


I think we have to find a better name for this, and I think we should 
have a toggle similar to memory_hotplug.memmap_on_memory. This should be 
an admin decision, not some kernel config option.



memory_hotplug.force_memmap_on_memory

"Enable the memmap on memory feature even if it could result in memory 
waste due to memmap size limitations. For example, if the memmap for a 
memory block requires 1 MiB, but the pageblock size is 2 MiB, 1 MiB
of hotplugged memory will be wasted. Note that there are still cases 
where the feature cannot be enforced: for example, if the memmap is 
smaller than a single page, or if the architecture does not support the 
forced mode in all configurations."


Thoughts?



Signed-off-by: Aneesh Kumar K.V 
---
  mm/Kconfig  |  9 +++
  mm/memory_hotplug.c | 59 +
  2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 932349271e28..88a1472b2086 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -570,6 +570,15 @@ config MHP_MEMMAP_ON_MEMORY
depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  
+config MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY

+   bool "Allow Reserving pages for page block aligment"
+   depends on MHP_MEMMAP_ON_MEMORY
+   help
+   This option allows memmap on memory feature to be more useful
+   with different memory block sizes. This is achieved by marking some 
pages
+   in each memory block as reserved so that we can get page-block alignment
+   for the remaining pages.
+





  endif # MEMORY_HOTPLUG
  
  config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 07c99b0cc371..f36aec1f7626 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1252,15 +1252,17 @@ static inline bool 
arch_supports_memmap_on_memory(unsigned long size)
  {
unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
-   unsigned long remaining_size = size - vmemmap_size;
  
-	return IS_ALIGNED(vmemmap_size, PMD_SIZE) &&

-   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+   return IS_ALIGNED(vmemmap_size, PMD_SIZE);
  }
  #endif
  
  static bool mhp_supports_memmap_on_memory(unsigned long size)

  {
+   unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
+   unsigned long remaining_size = size - vmemmap_size;
+
/*
 * Besides having arch support and the feature enabled at runtime, we
 * need a few more assumptions to hold true:
@@ -1287,9 +1289,30 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 *   altmap as an alternative source of memory, and we do not 
exactly
 *   populate a single PMD.
 */
-   return mhp_memmap_on_memory() &&
-   size == memory_block_size_bytes() &&
-   arch_supports_memmap_on_memory(size);
+   if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+   return false;
+/*
+ * Without page reservation remaining pages should be pageblock 
aligned.
+ */
+   if (!IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY) &&
+   !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
+   return false;
+
+   return arch_supports_memmap_on_memory(size);
+}
+
+static inline unsigned long memory_block_align_base(unsigned long size)
+{
+   if (IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY)) {
+   unsigned long align;
+   unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+   unsigned long vmemmap_size;
+
+   vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) >> 
PAGE_SHIFT;
+   align = pageblock_align(vmemmap_size) - vmemmap_size;


We should probably have a helper to calculate

a) the unaligned vmemmap size, for example used in 
arch_supports_memmap_on_memory()


b) the pageblock-aligned vmemmap size.


--
Cheers,

David / dhildenb

Re: [PATCH v4 00/33] Per-VMA locks

2023-07-11 Thread Leon Romanovsky

On Tue, Jul 11, 2023 at 09:35:13AM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky  wrote:
> >
> > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote:
> > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote:
> > > > On 7/11/23 12:35, Leon Romanovsky wrote:
> > > > >
> > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> > > > >
> > > > > <...>
> > > > >
> > > > >> Laurent Dufour (1):
> > > > >>   powerc/mm: try VMA lock-based page fault handling first
> > > > >
> > > > > Hi,
> > > > >
> > > > > This series and specifically the commit above broke docker over PPC.
> > > > > It causes to docker service stuck while trying to activate. Revert of
> > > > > this commit allows us to use docker again.
> > > >
> > > > Hi,
> > > >
> > > > there have been follow-up fixes, that are part of 6.4.3 stable (also
> > > > 6.5-rc1) Does that version work for you?
> > >
> > > I'll recheck it again on clean system, but for the record:
> > > 1. We are running 6.5-rc1 kernels.
> > > 2. PPC doesn't compile for us on -rc1 without this fix.
> > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/
> >
> > Ohh, I see it in -rc1, let's recheck.
> 
> Hi Leon,
> Please let us know how it goes.

Once, we rebuilt clean -rc1, docker worked for us.
Sorry for the noise.

> 
> >
> > > 3. I didn't see anything relevant -rc1 with "git log 
> > > arch/powerpc/mm/fault.c".
> 
> The fixes Vlastimil was referring to are not in the fault.c, they are
> in the main mm and fork code. More specifically, check for these
> patches to exist in the branch you are testing:
> 
> mm: lock newly mapped VMA with corrected ordering
> fork: lock VMAs of the parent process when forking
> mm: lock newly mapped VMA which can be modified after it becomes visible
> mm: lock a vma before stack expansion

Thanks

> 
> Thanks,
> Suren.
> 
> > >
> > > Do you have in mind anything specific to check?
> > >
> > > Thanks
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to kernel-team+unsubscr...@android.com.
> >

Re: [PATCH v4 2/5] fs: Add fchmodat2()

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 06:16:04PM +0200, Alexey Gladkov wrote:
> On the userspace side fchmodat(3) is implemented as a wrapper
> function which implements the POSIX-specified interface. This
> interface differs from the underlying kernel system call, which does not
> have a flags argument. Most implementations require procfs [1][2].
> 
> There doesn't appear to be a good userspace workaround for this issue
> but the implementation in the kernel is pretty straight-forward.
> 
> The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> unlike existing fchmodat.
> 
> [1] 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> [2] 
> https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> 
> Co-developed-by: Palmer Dabbelt 
> Signed-off-by: Palmer Dabbelt 
> Signed-off-by: Alexey Gladkov 
> Acked-by: Arnd Bergmann 
> ---
>  fs/open.c| 18 ++
>  include/linux/syscalls.h |  2 ++
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 0c55c8e7f837..39a7939f0d00 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
>   return err;
>  }
>  
> -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
> +static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, 
> int lookup_flags)

Should all be unsigned instead of int here for flags. We also had a
documentation update to that effect but smh never sent it.
user_path_at() itself takes an unsigned as well.

I'll fix that up though.

Re: [PATCH v4 00/33] Per-VMA locks

2023-07-11 Thread Suren Baghdasaryan

On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky  wrote:
>
> On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote:
> > > On 7/11/23 12:35, Leon Romanovsky wrote:
> > > >
> > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> > > >
> > > > <...>
> > > >
> > > >> Laurent Dufour (1):
> > > >>   powerc/mm: try VMA lock-based page fault handling first
> > > >
> > > > Hi,
> > > >
> > > > This series and specifically the commit above broke docker over PPC.
> > > > It causes to docker service stuck while trying to activate. Revert of
> > > > this commit allows us to use docker again.
> > >
> > > Hi,
> > >
> > > there have been follow-up fixes, that are part of 6.4.3 stable (also
> > > 6.5-rc1) Does that version work for you?
> >
> > I'll recheck it again on clean system, but for the record:
> > 1. We are running 6.5-rc1 kernels.
> > 2. PPC doesn't compile for us on -rc1 without this fix.
> > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/
>
> Ohh, I see it in -rc1, let's recheck.

Hi Leon,
Please let us know how it goes.

>
> > 3. I didn't see anything relevant -rc1 with "git log 
> > arch/powerpc/mm/fault.c".

The fixes Vlastimil was referring to are not in the fault.c, they are
in the main mm and fork code. More specifically, check for these
patches to exist in the branch you are testing:

mm: lock newly mapped VMA with corrected ordering
fork: lock VMAs of the parent process when forking
mm: lock newly mapped VMA which can be modified after it becomes visible
mm: lock a vma before stack expansion

Thanks,
Suren.

> >
> > Do you have in mind anything specific to check?
> >
> > Thanks
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

[PATCH v4 4/5] tools headers UAPI: Sync files changed by new fchmodat2 syscall

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

That add support for this new syscall in tools such as 'perf trace'.

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Alexey Gladkov 
---
 tools/include/uapi/asm-generic/unistd.h | 5 -
 tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 2 ++
 tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 ++
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 2 ++
 5 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/asm-generic/unistd.h 
b/tools/include/uapi/asm-generic/unistd.h
index dd7d8e10f16d..76b5922b0d39 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -817,8 +817,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_fchmodat2 452
+__SYSCALL(__NR_fchmodat2, sys_fchmodat2)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 453
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl 
b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 3f1886ad9d80..434728af4eaa 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -365,3 +365,5 @@
 448n64 process_mreleasesys_process_mrelease
 449n64 futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+# 451 reserved for cachestat
+452n64 fchmodat2   sys_fchmodat2
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl 
b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index a0be127475b1..6b70b6705bd7 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -537,3 +537,5 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450nospu   set_mempolicy_home_node sys_set_mempolicy_home_node
+# 451 reserved for cachestat
+452common  fchmodat2   sys_fchmodat2
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl 
b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index b68f47541169..0ed90c9535b0 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -453,3 +453,5 @@
 448  commonprocess_mreleasesys_process_mrelease
sys_process_mrelease
 449  commonfutex_waitv sys_futex_waitv 
sys_futex_waitv
 450  commonset_mempolicy_home_node sys_set_mempolicy_home_node 
sys_set_mempolicy_home_node
+# 451 reserved for cachestat
+452  commonfchmodat2   sys_fchmodat2   
sys_fchmodat2
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..a008724a1f48 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,8 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+# 451 reserved for cachestat
+452common  fchmodat2   sys_fchmodat2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.33.8

Re: [PATCH v4 3/5] arch: Register fchmodat2, usually as syscall 452

2023-07-11 Thread Arnd Bergmann

On Tue, Jul 11, 2023, at 18:16, Alexey Gladkov wrote:
> From: Palmer Dabbelt 
>
> This registers the new fchmodat2 syscall in most places as nuber 452,
> with alpha being the exception where it's 562.  I found all these sites
> by grepping for fspick, which I assume has found me everything.
>
> Signed-off-by: Palmer Dabbelt 
> Signed-off-by: Alexey Gladkov 

Acked-by: Arnd Bergmann

[PATCH v4 5/5] selftests: Add fchmodat2 selftest

2023-07-11 Thread Alexey Gladkov

The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag
fails. This is because not all filesystems support changing the mode
bits of symlinks properly. These filesystems return an error but change
the mode bits:

newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, 
AT_SYMLINK_NOFOLLOW) = 0
syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 
EOPNOTSUPP (Operation not supported)
newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
AT_SYMLINK_NOFOLLOW) = 0

This happens with btrfs and xfs:

 $ tools/testing/selftests/fchmodat2/fchmodat2_test
 TAP version 13
 1..1
 ok 1 # SKIP fchmodat2(symlink)
 # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0

 $ stat /tmp/ksft-fchmodat2.*/symlink
   File: /tmp/ksft-fchmodat2.3NCqlE/symlink -> regfile
   Size: 7   Blocks: 0  IO Block: 4096   symbolic link
 Device: 7,0 Inode: 133 Links: 1
 Access: (0600/lrw---)  Uid: (0/root)   Gid: (0/root)

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/fchmodat2/.gitignore  |   2 +
 tools/testing/selftests/fchmodat2/Makefile|   6 +
 .../selftests/fchmodat2/fchmodat2_test.c  | 162 ++
 4 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/fchmodat2/.gitignore
 create mode 100644 tools/testing/selftests/fchmodat2/Makefile
 create mode 100644 tools/testing/selftests/fchmodat2/fchmodat2_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 666b56f22a41..8dca8acdb671 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -18,6 +18,7 @@ TARGETS += drivers/net/bonding
 TARGETS += drivers/net/team
 TARGETS += efivarfs
 TARGETS += exec
+TARGETS += fchmodat2
 TARGETS += filesystems
 TARGETS += filesystems/binderfs
 TARGETS += filesystems/epoll
diff --git a/tools/testing/selftests/fchmodat2/.gitignore 
b/tools/testing/selftests/fchmodat2/.gitignore
new file mode 100644
index ..82a4846cbc4b
--- /dev/null
+++ b/tools/testing/selftests/fchmodat2/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/*_test
diff --git a/tools/testing/selftests/fchmodat2/Makefile 
b/tools/testing/selftests/fchmodat2/Makefile
new file mode 100644
index ..45b519eab851
--- /dev/null
+++ b/tools/testing/selftests/fchmodat2/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := fchmodat2_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/fchmodat2/fchmodat2_test.c 
b/tools/testing/selftests/fchmodat2/fchmodat2_test.c
new file mode 100644
index ..2d98eb215bc6
--- /dev/null
+++ b/tools/testing/selftests/fchmodat2/fchmodat2_test.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+#ifndef __NR_fchmodat2
+   #if defined __alpha__
+   #define __NR_fchmodat2 562
+   #elif defined _MIPS_SIM
+   #if _MIPS_SIM == _MIPS_SIM_ABI32/* o32 */
+   #define __NR_fchmodat2 (452 + 4000)
+   #endif
+   #if _MIPS_SIM == _MIPS_SIM_NABI32   /* n32 */
+   #define __NR_fchmodat2 (452 + 6000)
+   #endif
+   #if _MIPS_SIM == _MIPS_SIM_ABI64/* n64 */
+   #define __NR_fchmodat2 (452 + 5000)
+   #endif
+   #elif defined __ia64__
+   #define __NR_fchmodat2 (452 + 1024)
+   #else
+   #define __NR_fchmodat2 452
+   #endif
+#endif
+
+int sys_fchmodat2(int dfd, const char *filename, mode_t mode, int flags)
+{
+   int ret = syscall(__NR_fchmodat2, dfd, filename, mode, flags);
+
+   return ret >= 0 ? ret : -errno;
+}
+
+int setup_testdir(void)
+{
+   int dfd, ret;
+   char dirname[] = "/tmp/ksft-fchmodat2.XX";
+
+   /* Make the top-level directory. */
+   if (!mkdtemp(dirname))
+   ksft_exit_fail_msg("%s: failed to create tmpdir\n", __func__);
+
+   dfd = open(dirname, O_PATH | O_DIRECTORY);
+   if (dfd < 0)
+   ksft_exit_fail_msg("%s: failed to open tmpdir\n", __func__);
+
+   ret = openat(dfd, "regfile", O_CREAT | O_WRONLY | O_TRUNC, 0644);
+   if (ret < 0)
+   ksft_exit_fail_msg("%s: failed to create file in tmpdir\n",
+   __func__);
+   close(ret);
+
+   ret = symlinkat("regfile", dfd, "symlink");
+   if (ret < 0)
+   ksft_exit_fail_msg("%s: failed to create symlink in tmpdir\n",
+   __func__);
+
+   return dfd;
+}
+
+int expect_mode(int dfd, const char *filename,

[PATCH v4 3/5] arch: Register fchmodat2, usually as syscall 452

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

This registers the new fchmodat2 syscall in most places as nuber 452,
with alpha being the exception where it's 562.  I found all these sites
by grepping for fspick, which I assume has found me everything.

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Alexey Gladkov 
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd.h | 2 +-
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/uapi/asm-generic/unistd.h   | 5 -
 19 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 1f13995d00d7..ad37569d0507 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -491,3 +491,4 @@
 559common  futex_waitv sys_futex_waitv
 560common  set_mempolicy_home_node sys_ni_syscall
 561common  cachestat   sys_cachestat
+562common  fchmodat2   sys_fchmodat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 8ebed8a13874..c572d6c3dee0 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -465,3 +465,4 @@
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
 451common  cachestat   sys_cachestat
+452common  fchmodat2   sys_fchmodat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 64a514f90131..bd77253b62e0 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   452
+#define __NR_compat_syscalls   453
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index d952a28463e0..78b68311ec81 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -909,6 +909,8 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 #define __NR_cachestat 451
 __SYSCALL(__NR_cachestat, sys_cachestat)
+#define __NR_fchmodat2 452
+__SYSCALL(__NR_fchmodat2, sys_fchmodat2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index f8c74ffeeefb..83d8609aec03 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -372,3 +372,4 @@
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
 451common  cachestat   sys_cachestat
+452common  fchmodat2   sys_fchmodat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index 4f504783371f..259ceb125367 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -451,3 +451,4 @@
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
 451common  cachestat   sys_cachestat
+452common  fchmodat2   sys_fchmodat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index 858d22bf275c..a3798c2637fd 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
 451common  cachestat   sys_cachestat
+452common  fchmodat2   sys_fchmodat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 1976317d4e8b..152034b8e0a0 100644
---

[PATCH v4 2/5] fs: Add fchmodat2()

2023-07-11 Thread Alexey Gladkov

On the userspace side fchmodat(3) is implemented as a wrapper
function which implements the POSIX-specified interface. This
interface differs from the underlying kernel system call, which does not
have a flags argument. Most implementations require procfs [1][2].

There doesn't appear to be a good userspace workaround for this issue
but the implementation in the kernel is pretty straight-forward.

The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
unlike existing fchmodat.

[1] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
[2] 
https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28

Co-developed-by: Palmer Dabbelt 
Signed-off-by: Palmer Dabbelt 
Signed-off-by: Alexey Gladkov 
Acked-by: Arnd Bergmann 
---
 fs/open.c| 18 ++
 include/linux/syscalls.h |  2 ++
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 0c55c8e7f837..39a7939f0d00 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
return err;
 }
 
-static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
+static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int 
lookup_flags)
 {
struct path path;
int error;
-   unsigned int lookup_flags = LOOKUP_FOLLOW;
+
 retry:
error = user_path_at(dfd, filename, lookup_flags, );
if (!error) {
@@ -689,15 +689,25 @@ static int do_fchmodat(int dfd, const char __user 
*filename, umode_t mode)
return error;
 }
 
+SYSCALL_DEFINE4(fchmodat2, int, dfd, const char __user *, filename,
+   umode_t, mode, int, flags)
+{
+   if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW))
+   return -EINVAL;
+
+   return do_fchmodat(dfd, filename, mode,
+   flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW);
+}
+
 SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename,
umode_t, mode)
 {
-   return do_fchmodat(dfd, filename, mode);
+   return do_fchmodat(dfd, filename, mode, LOOKUP_FOLLOW);
 }
 
 SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
 {
-   return do_fchmodat(AT_FDCWD, filename, mode);
+   return do_fchmodat(AT_FDCWD, filename, mode, LOOKUP_FOLLOW);
 }
 
 /*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 584f404bf868..6e852279fbc3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -440,6 +440,8 @@ asmlinkage long sys_chroot(const char __user *filename);
 asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
 asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
 umode_t mode);
+asmlinkage long sys_fchmodat2(int dfd, const char __user *filename,
+umode_t mode, int flags);
 asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 gid_t group, int flag);
 asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
-- 
2.33.8

[PATCH v4 1/5] Non-functional cleanup of a "__user * filename"

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

The next patch defines a very similar interface, which I copied from
this definition.  Since I'm touching it anyway I don't see any reason
not to just go fix this one up.

Signed-off-by: Palmer Dabbelt 
Acked-by: Arnd Bergmann 
---
 include/linux/syscalls.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 03e3d0121d5e..584f404bf868 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -438,7 +438,7 @@ asmlinkage long sys_chdir(const char __user *filename);
 asmlinkage long sys_fchdir(unsigned int fd);
 asmlinkage long sys_chroot(const char __user *filename);
 asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
-asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
+asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
 umode_t mode);
 asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 gid_t group, int flag);
-- 
2.33.8

[PATCH v4 0/5] Add a new fchmodat2() syscall

2023-07-11 Thread Alexey Gladkov

In glibc, the fchmodat(3) function has a flags argument according to the
POSIX specification [1], but kernel syscalls has no such argument.
Therefore, libc implementations do workarounds using /proc. However,
this requires procfs to be mounted and accessible.

This patch set adds fchmodat2(), a new syscall. The syscall allows to
pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other
respects, this syscall is no different from fchmodat().

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/chmod.html

Changes since v3 [cover.1689074739.git.leg...@kernel.org]:

* Rebased to master because a new syscall has appeared in master.
* Increased __NR_compat_syscalls as pointed out by Arnd Bergmann.
* Syscall renamed fchmodat4 -> fchmodat2 as suggested by Christian Brauner.
* Returned do_fchmodat4() the original name. We don't need to version
  internal functions.
* Fixed warnings found by checkpatch.pl.

Changes since v2 [20190717012719.5524-1-pal...@sifive.com]:

* Rebased to master.
* The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro.
* Selftest added.

Changes since v1 [20190531191204.4044-1-pal...@sifive.com]:

* All architectures are now supported, which support squashed into a
  single patch.
* The do_fchmodat() helper function has been removed, in favor of directly
  calling do_fchmodat4().
* The patches are based on 5.2 instead of 5.1.

---

Alexey Gladkov (2):
  fs: Add fchmodat2()
  selftests: Add fchmodat2 selftest

Palmer Dabbelt (3):
  Non-functional cleanup of a "__user * filename"
  arch: Register fchmodat2, usually as syscall 452
  tools headers UAPI: Sync files changed by new fchmodat2 syscall

 arch/alpha/kernel/syscalls/syscall.tbl|   1 +
 arch/arm/tools/syscall.tbl|   1 +
 arch/arm64/include/asm/unistd.h   |   2 +-
 arch/arm64/include/asm/unistd32.h |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl   |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  |   1 +
 arch/s390/kernel/syscalls/syscall.tbl |   1 +
 arch/sh/kernel/syscalls/syscall.tbl   |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl|   1 +
 arch/x86/entry/syscalls/syscall_32.tbl|   1 +
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl   |   1 +
 fs/open.c |  18 +-
 include/linux/syscalls.h  |   4 +-
 include/uapi/asm-generic/unistd.h |   5 +-
 tools/include/uapi/asm-generic/unistd.h   |   5 +-
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |   2 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |   2 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |   2 +
 .../arch/x86/entry/syscalls/syscall_64.tbl|   2 +
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/fchmodat2/.gitignore  |   2 +
 tools/testing/selftests/fchmodat2/Makefile|   6 +
 .../selftests/fchmodat2/fchmodat2_test.c  | 162 ++
 30 files changed, 223 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/fchmodat2/.gitignore
 create mode 100644 tools/testing/selftests/fchmodat2/Makefile
 create mode 100644 tools/testing/selftests/fchmodat2/fchmodat2_test.c

-- 
2.33.8

[PATCH v4 15/15] powerpc: Implement UACCESS validation on PPC32

2023-07-11 Thread Christophe Leroy

In order to implement UACCESS validation, objtool support
for powerpc needs to be enhanced to decode more instructions.

It also requires implementation of switch tables finding.
On PPC32 it is similar to x86, switch tables are anonymous in .rodata,
the difference is that the value is relative to its index in the table.
But several switch tables can be nested so the register containing
the table base address also needs to be tracked and taken into account.

Don't activate if for Clang for now because its switch tables are
different from GCC switch tables.

Then comes the UACCESS enabling/disabling instructions. On booke and
8xx it is done with a mtspr instruction. For 8xx that's in SPRN_MD_AP,
for booke that's in SPRN_PID. Annotate those instructions.

No work has been done for ASM files, they are not used for UACCESS
so for the moment just tell objtool to ignore ASM files.

For relocable code, the .got2 relocation preceding each global
function needs to be marked as ignored because some versions of GCC
do this:

 120:   00 00 00 00 .long 0x0
120: R_PPC_REL32.got2+0x7ff0

0124 :
 124:   94 21 ff f0 stwur1,-16(r1)
 128:   7c 08 02 a6 mflrr0
 12c:   42 9f 00 05 bcl 20,4*cr7+so,130 
 130:   39 00 00 00 li  r8,0
 134:   39 20 00 08 li  r9,8
 138:   93 c1 00 08 stw r30,8(r1)
 13c:   7f c8 02 a6 mflrr30
 140:   90 01 00 14 stw r0,20(r1)
 144:   80 1e ff f0 lwz r0,-16(r30)
 148:   7f c0 f2 14 add r30,r0,r30
 14c:   81 5e 80 00 lwz r10,-32768(r30)
 150:   80 fe 80 04 lwz r7,-32764(r30)

Also declare longjmp() and start_secondary_resume() as global noreturn
functions, and declare __copy_tofrom_user() and __arch_clear_user()
as UACCESS safe.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/include/asm/book3s/32/kup.h  |   2 +
 arch/powerpc/include/asm/nohash/32/kup-8xx.h  |   4 +-
 arch/powerpc/include/asm/nohash/kup-booke.h   |   4 +-
 arch/powerpc/kexec/core_32.c  |   4 +-
 arch/powerpc/mm/nohash/kup.c  |   2 +
 tools/objtool/arch/powerpc/decode.c   | 155 +-
 .../arch/powerpc/include/arch/noreturns.h |  11 ++
 tools/objtool/arch/powerpc/special.c  |  36 +++-
 tools/objtool/check.c |   6 +-
 10 files changed, 211 insertions(+), 15 deletions(-)
 create mode 100644 tools/objtool/arch/powerpc/include/arch/noreturns.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..cdaca38868e1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -159,6 +159,7 @@ config PPC
select ARCH_KEEP_MEMBLOCK
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
+   select ARCH_OBJTOOL_SKIP_ASM
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
select ARCH_SPLIT_ARG64 if PPC32
@@ -257,6 +258,7 @@ config PPC
select HAVE_OPTPROBES
select HAVE_OBJTOOL if PPC32 || MPROFILE_KERNEL
select HAVE_OBJTOOL_MCOUNT  if HAVE_OBJTOOL
+   select HAVE_UACCESS_VALIDATION  if HAVE_OBJTOOL && PPC_KUAP && 
PPC32 && CC_IS_GCC
select HAVE_PERF_EVENTS
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 4e14a5427a63..842d9a6f4b7a 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -34,6 +34,7 @@ static __always_inline void uaccess_begin_32s(unsigned long 
addr)
asm volatile(ASM_MMU_FTR_IFSET(
"mfsrin %0, %1;"
"rlwinm %0, %0, 0, %2;"
+   ASM_UACCESS_BEGIN
"mtsrin %0, %1;"
"isync", "", %3)
: "="(tmp)
@@ -48,6 +49,7 @@ static __always_inline void uaccess_end_32s(unsigned long 
addr)
asm volatile(ASM_MMU_FTR_IFSET(
"mfsrin %0, %1;"
"oris %0, %0, %2;"
+   ASM_UACCESS_END
"mtsrin %0, %1;"
"isync", "", %3)
: "="(tmp)
diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index 46bc5925e5fd..38c7ed766445 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -39,13 +39,13 @@ static __always_inline unsigned long 
__kuap_get_and_assert_locked(void)
 
 static __always_inline void uaccess_begin_8xx(unsigned long val)
 {
-   asm(ASM_MMU_FTR_IFSET("mtspr %0, %1", "", %2) : :
+   asm(ASM_UACCESS_BEGIN ASM_MMU_FTR_IFSET("mtspr %0, %1", "", %2) : :

[PATCH v4 08/15] objtool: Track general purpose register used for switch table base

2023-07-11 Thread Christophe Leroy

A function can contain nested switch tables using different registers
as base address.

In order to avoid failure in tracking those switch tables, the register
containing the base address needs to be taken into account.

To do so, add a 5 bits field in struct instruction that will hold the
ID of the register containing the base address of the switch table and
take that register into account during the backward search in order to
not stop the walk when encountering a jump related to another switch
table.

On architectures not handling it, the ID stays nul and has no impact
on the search.

To enable that, also provide to arch_find_switch_table() the dynamic
instruction related to a table search.

Also allow prev_insn_same_sec() to be used outside check.c so that
architectures can backward walk through instruction to find out which
register is used as base address for a switch table.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/arch/powerpc/special.c| 3 ++-
 tools/objtool/arch/x86/special.c| 3 ++-
 tools/objtool/check.c   | 9 +
 tools/objtool/include/objtool/check.h   | 6 --
 tools/objtool/include/objtool/special.h | 3 ++-
 5 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/tools/objtool/arch/powerpc/special.c 
b/tools/objtool/arch/powerpc/special.c
index d33868147196..a7dd2559b536 100644
--- a/tools/objtool/arch/powerpc/special.c
+++ b/tools/objtool/arch/powerpc/special.c
@@ -13,7 +13,8 @@ bool arch_support_alt_relocation(struct special_alt 
*special_alt,
 }
 
 struct reloc *arch_find_switch_table(struct objtool_file *file,
-   struct instruction *insn)
+struct instruction *insn,
+struct instruction *orig_insn)
 {
exit(-1);
 }
diff --git a/tools/objtool/arch/x86/special.c b/tools/objtool/arch/x86/special.c
index 8e8302fe909f..8cf17d94c69b 100644
--- a/tools/objtool/arch/x86/special.c
+++ b/tools/objtool/arch/x86/special.c
@@ -86,7 +86,8 @@ bool arch_support_alt_relocation(struct special_alt 
*special_alt,
  *NOTE: RETPOLINE made it harder still to decode dynamic jumps.
  */
 struct reloc *arch_find_switch_table(struct objtool_file *file,
-   struct instruction *insn)
+struct instruction *insn,
+struct instruction *orig_insn)
 {
struct reloc  *text_reloc, *rodata_reloc;
struct section *table_sec;
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d51f47c4a3bd..be413c578588 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -80,8 +80,8 @@ static struct instruction *next_insn_same_func(struct 
objtool_file *file,
return find_insn(file, func->cfunc->sec, func->cfunc->offset);
 }
 
-static struct instruction *prev_insn_same_sec(struct objtool_file *file,
- struct instruction *insn)
+struct instruction *prev_insn_same_sec(struct objtool_file *file,
+  struct instruction *insn)
 {
if (insn->idx == 0) {
if (insn->prev_len)
@@ -2064,7 +2064,8 @@ static struct reloc *find_jump_table(struct objtool_file 
*file,
 insn && insn_func(insn) && insn_func(insn)->pfunc == func;
 insn = insn->first_jump_src ?: prev_insn_same_sym(file, insn)) {
 
-   if (insn != orig_insn && insn->type == INSN_JUMP_DYNAMIC)
+   if (insn != orig_insn && insn->type == INSN_JUMP_DYNAMIC &&
+   insn->gpr == orig_insn->gpr)
break;
 
/* allow small jumps within the range */
@@ -2074,7 +2075,7 @@ static struct reloc *find_jump_table(struct objtool_file 
*file,
 insn->jump_dest->offset > orig_insn->offset))
break;
 
-   table_reloc = arch_find_switch_table(file, insn);
+   table_reloc = arch_find_switch_table(file, insn, orig_insn);
if (!table_reloc)
continue;
 
diff --git a/tools/objtool/include/objtool/check.h 
b/tools/objtool/include/objtool/check.h
index daa46f1f0965..660ea9d0393e 100644
--- a/tools/objtool/include/objtool/check.h
+++ b/tools/objtool/include/objtool/check.h
@@ -63,8 +63,9 @@ struct instruction {
noendbr : 1,
unret   : 1,
visited : 4,
-   no_reloc: 1;
-   /* 10 bit hole */
+   no_reloc: 1,
+   gpr : 5;
+   /* 5 bit hole */
 
struct alt_group *alt_group;
struct instruction *jump_dest;
@@ -115,6 +116,7 @@ struct instruction *find_insn(struct objtool_file *file,
  struct section *sec, unsigned long offset);
 
 struct instruction *next_insn_same_sec(struct objtool_file *file, struct 
instruction *insn);

[PATCH v4 14/15] powerpc/bug: Annotate reachable after warning trap

2023-07-11 Thread Christophe Leroy

This commit is copied from commit bfb1a7c91fb7 ("x86/bug: Merge
annotate_reachable() into _BUG_FLAGS() asm")

'twi 31,0,0' is a BUG instruction, which is by default a dead end.

But the same instruction is used for WARNINGs and the execution
resumes with the following instruction. Mark it reachable so
that objtool knows that it is not a dead end in that case.

Also change the unreachable() annotation by __builtin_unreachable()
since objtool already knows that a BUG instruction is a dead end.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index abb608dff15a..1c204ee4cc03 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -4,6 +4,7 @@
 #ifdef __KERNEL__
 
 #include 
+#include 
 
 #ifdef CONFIG_BUG
 
@@ -51,10 +52,11 @@
".previous\n"
 #endif
 
-#define BUG_ENTRY(insn, flags, ...)\
+#define BUG_ENTRY(insn, flags, extra, ...) \
__asm__ __volatile__(   \
"1: " insn "\n" \
_EMIT_BUG_ENTRY \
+   extra   \
: : "i" (__FILE__), "i" (__LINE__), \
  "i" (flags),  \
  "i" (sizeof(struct bug_entry)),   \
@@ -67,12 +69,12 @@
  */
 
 #define BUG() do { \
-   BUG_ENTRY("twi 31, 0, 0", 0);   \
-   unreachable();  \
+   BUG_ENTRY("twi 31, 0, 0", 0, "");   \
+   __builtin_unreachable();\
 } while (0)
 #define HAVE_ARCH_BUG
 
-#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | 
(flags))
+#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | 
(flags), ASM_REACHABLE)
 
 #ifdef CONFIG_PPC64
 #define BUG_ON(x) do { \
@@ -80,7 +82,7 @@
if (x)  \
BUG();  \
} else {\
-   BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "r" ((__force long)(x)));  
\
+   BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "", "r" ((__force long)(x)));  
\
}   \
 } while (0)
 
@@ -92,7 +94,7 @@
} else {\
BUG_ENTRY(PPC_TLNEI " %4, 0",   \
  BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN),  \
- "r" (__ret_warn_on)); \
+ "", "r" (__ret_warn_on)); \
}   \
unlikely(__ret_warn_on);\
 })
-- 
2.41.0

[PATCH v4 07/15] objtool: Merge mark_func_jump_tables() and add_func_jump_tables()

2023-07-11 Thread Christophe Leroy

Those two functions loop over the instructions of a function.
Merge the two loops in order to ease enhancement of table end
in a following patch.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/check.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5a6a87ddbf27..d51f47c4a3bd 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2097,11 +2097,12 @@ static struct reloc *find_jump_table(struct 
objtool_file *file,
  * First pass: Mark the head of each jump table so that in the next pass,
  * we know when a given jump table ends and the next one starts.
  */
-static void mark_func_jump_tables(struct objtool_file *file,
-   struct symbol *func)
+static int mark_add_func_jump_tables(struct objtool_file *file,
+struct symbol *func)
 {
-   struct instruction *insn, *last = NULL;
+   struct instruction *insn, *last = NULL, *insn_t1 = NULL, *insn_t2;
struct reloc *reloc;
+   int ret = 0;
 
func_for_each_insn(file, func, insn) {
if (!last)
@@ -2127,17 +2128,7 @@ static void mark_func_jump_tables(struct objtool_file 
*file,
reloc = find_jump_table(file, func, insn);
if (reloc)
insn->_jump_table = reloc;
-   }
-}
-
-static int add_func_jump_tables(struct objtool_file *file,
- struct symbol *func)
-{
-   struct instruction *insn, *insn_t1 = NULL, *insn_t2;
-   int ret = 0;
-
-   func_for_each_insn(file, func, insn) {
-   if (!insn_jump_table(insn))
+   else
continue;
 
if (!insn_t1) {
@@ -2177,8 +2168,7 @@ static int add_jump_table_alts(struct objtool_file *file)
if (func->type != STT_FUNC)
continue;
 
-   mark_func_jump_tables(file, func);
-   ret = add_func_jump_tables(file, func);
+   ret = mark_add_func_jump_tables(file, func);
if (ret)
return ret;
}
-- 
2.41.0

[PATCH v4 04/15] objtool: Fix JUMP_ENTRY_SIZE for bi-arch like powerpc

2023-07-11 Thread Christophe Leroy

struct jump_entry {
s32 code;
s32 target;
long key;
};

It means that the size of the third argument depends on
whether we are building a 32 bits or 64 bits kernel.

Therefore JUMP_ENTRY_SIZE must depend on elf_class_addrsize(elf).

To allow that, entries[] table must be initialised at runtime. This is
easily done by moving it into its only user which is special_get_alts().

Signed-off-by: Christophe Leroy 
Acked-by: Peter Zijlstra (Intel) 
---
 .../arch/powerpc/include/arch/special.h   |  2 +-
 tools/objtool/special.c   | 55 +--
 2 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/tools/objtool/arch/powerpc/include/arch/special.h 
b/tools/objtool/arch/powerpc/include/arch/special.h
index ffef9ada7133..b17802dcf436 100644
--- a/tools/objtool/arch/powerpc/include/arch/special.h
+++ b/tools/objtool/arch/powerpc/include/arch/special.h
@@ -6,7 +6,7 @@
 #define EX_ORIG_OFFSET 0
 #define EX_NEW_OFFSET 4
 
-#define JUMP_ENTRY_SIZE 16
+#define JUMP_ENTRY_SIZE (8 + elf_addr_size(elf)) /* 12 on PPC32, 16 on PPC64 */
 #define JUMP_ORIG_OFFSET 0
 #define JUMP_NEW_OFFSET 4
 #define JUMP_KEY_OFFSET 8
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 91b1950f5bd8..b3f07e8beb85 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -26,34 +26,6 @@ struct special_entry {
unsigned char key; /* jump_label key */
 };
 
-static const struct special_entry entries[] = {
-   {
-   .sec = ".altinstructions",
-   .group = true,
-   .size = ALT_ENTRY_SIZE,
-   .orig = ALT_ORIG_OFFSET,
-   .orig_len = ALT_ORIG_LEN_OFFSET,
-   .new = ALT_NEW_OFFSET,
-   .new_len = ALT_NEW_LEN_OFFSET,
-   .feature = ALT_FEATURE_OFFSET,
-   },
-   {
-   .sec = "__jump_table",
-   .jump_or_nop = true,
-   .size = JUMP_ENTRY_SIZE,
-   .orig = JUMP_ORIG_OFFSET,
-   .new = JUMP_NEW_OFFSET,
-   .key = JUMP_KEY_OFFSET,
-   },
-   {
-   .sec = "__ex_table",
-   .size = EX_ENTRY_SIZE,
-   .orig = EX_ORIG_OFFSET,
-   .new = EX_NEW_OFFSET,
-   },
-   {},
-};
-
 void __weak arch_handle_alternative(unsigned short feature, struct special_alt 
*alt)
 {
 }
@@ -144,6 +116,33 @@ int special_get_alts(struct elf *elf, struct list_head 
*alts)
unsigned int nr_entries;
struct special_alt *alt;
int idx, ret;
+   const struct special_entry entries[] = {
+   {
+   .sec = ".altinstructions",
+   .group = true,
+   .size = ALT_ENTRY_SIZE,
+   .orig = ALT_ORIG_OFFSET,
+   .orig_len = ALT_ORIG_LEN_OFFSET,
+   .new = ALT_NEW_OFFSET,
+   .new_len = ALT_NEW_LEN_OFFSET,
+   .feature = ALT_FEATURE_OFFSET,
+   },
+   {
+   .sec = "__jump_table",
+   .jump_or_nop = true,
+   .size = JUMP_ENTRY_SIZE,
+   .orig = JUMP_ORIG_OFFSET,
+   .new = JUMP_NEW_OFFSET,
+   .key = JUMP_KEY_OFFSET,
+   },
+   {
+   .sec = "__ex_table",
+   .size = EX_ENTRY_SIZE,
+   .orig = EX_ORIG_OFFSET,
+   .new = EX_NEW_OFFSET,
+   },
+   {},
+   };
 
INIT_LIST_HEAD(alts);
 
-- 
2.41.0

[PATCH v4 02/15] objtool: Move back misplaced comment

2023-07-11 Thread Christophe Leroy

A comment was introduced by commit 113d4bc90483 ("objtool: Fix
clang switch table edge case") and wrongly moved by
commit d871f7b5a6a2 ("objtool: Refactor jump table code to support
other architectures") without the piece of code added with the
comment in the original commit.

Fixes: d871f7b5a6a2 ("objtool: Refactor jump table code to support other 
architectures")
Signed-off-by: Christophe Leroy 
---
 tools/objtool/arch/x86/special.c | 5 -
 tools/objtool/check.c| 6 ++
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/tools/objtool/arch/x86/special.c b/tools/objtool/arch/x86/special.c
index 29e949579ede..8e8302fe909f 100644
--- a/tools/objtool/arch/x86/special.c
+++ b/tools/objtool/arch/x86/special.c
@@ -118,11 +118,6 @@ struct reloc *arch_find_switch_table(struct objtool_file 
*file,
strcmp(table_sec->name, C_JUMP_TABLE_SECTION))
return NULL;
 
-   /*
-* Each table entry has a rela associated with it.  The rela
-* should reference text in the same function as the original
-* instruction.
-*/
rodata_reloc = find_reloc_by_dest(file->elf, table_sec, table_offset);
if (!rodata_reloc)
return NULL;
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 8936a05f0e5a..25f6df4713ed 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2072,6 +2072,12 @@ static struct reloc *find_jump_table(struct objtool_file 
*file,
table_reloc = arch_find_switch_table(file, insn);
if (!table_reloc)
continue;
+
+   /*
+* Each table entry has a rela associated with it.  The rela
+* should reference text in the same function as the original
+* instruction.
+*/
dest_insn = find_insn(file, table_reloc->sym->sec, 
reloc_addend(table_reloc));
if (!dest_insn || !insn_func(dest_insn) || 
insn_func(dest_insn)->pfunc != func)
continue;
-- 
2.41.0

[PATCH v4 06/15] objtool: Add support for relative switch tables

2023-07-11 Thread Christophe Leroy

On powerpc, switch tables are relative, than means the address of the
table is added to the value of the entry in order to get the pointed
address: (r10 is the table address, r4 the index in the table)

  lis r10,0 <== Load r10 with upper part of .rodata address
  R_PPC_ADDR16_HA .rodata
  addir10,r10,0 <== Add lower part of .rodata address
  R_PPC_ADDR16_LO .rodata
  lwzxr8,r10,r4 <== Read table entry at r10 + r4 into r8
  add r10,r8,r10<== Add table address to read value
  mtctr   r10   <== Save calculated address in CTR
  bctr  <== Branch to address in CTR

  RELOCATION RECORDS FOR [.rodata]:
  OFFSET   TYPE  VALUE
   R_PPC_REL32   .text+0x054c
  0004 R_PPC_REL32   .text+0x03d0
...

But for c_jump_tables it is not the case, they contain the
pointed address directly:

  lis r28,0 <== Load r28 with upper .rodata..c_jump_table
  R_PPC_ADDR16_HA   .rodata..c_jump_table
  addir28,r28,0 <== Add lower part of .rodata..c_jump_table
  R_PPC_ADDR16_LO   .rodata..c_jump_table
  lwzxr10,r28,r10   <== Read table entry at r10 + r28 into r10
  mtctr   r10   <== Save read value in CTR
  bctr  <== Branch to address in CTR

  RELOCATION RECORDS FOR [.rodata..c_jump_table]:
  OFFSET   TYPE  VALUE
   R_PPC_ADDR32  .text+0x0dc8
  0004 R_PPC_ADDR32  .text+0x0dc8
...

Add support to objtool for relative tables, based on the relocation
type which is R_PPC_REL32 for switch tables and R_PPC_ADDR32 for
C jump tables. Do the comparison using R_ABS32 and R_ABS64 which are
architecture agnostic.

And use correct size for 'long' instead of hard coding a size of '8'.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/check.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index ae0019412123..5a6a87ddbf27 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1988,7 +1988,7 @@ static int add_jump_table(struct objtool_file *file, 
struct instruction *insn,
struct symbol *pfunc = insn_func(insn)->pfunc;
struct reloc *table = insn_jump_table(insn);
struct instruction *dest_insn;
-   unsigned int prev_offset = 0;
+   unsigned int offset, prev_offset = 0;
struct reloc *reloc = table;
struct alternative *alt;
 
@@ -2003,7 +2003,7 @@ static int add_jump_table(struct objtool_file *file, 
struct instruction *insn,
break;
 
/* Make sure the table entries are consecutive: */
-   if (prev_offset && reloc_offset(reloc) != prev_offset + 8)
+   if (prev_offset && reloc_offset(reloc) != prev_offset + 
elf_addr_size(file->elf))
break;
 
/* Detect function pointers from contiguous objects: */
@@ -2011,7 +2011,12 @@ static int add_jump_table(struct objtool_file *file, 
struct instruction *insn,
reloc_addend(reloc) == pfunc->offset)
break;
 
-   dest_insn = find_insn(file, reloc->sym->sec, 
reloc_addend(reloc));
+   if (reloc_type(reloc) == R_ABS32 || reloc_type(reloc) == 
R_ABS64)
+   offset = reloc_addend(reloc);
+   else
+   offset = reloc_addend(reloc) + reloc_offset(table) - 
reloc_offset(reloc);
+
+   dest_insn = find_insn(file, reloc->sym->sec, offset);
if (!dest_insn)
break;
 
-- 
2.41.0

[PATCH v4 12/15] objtool: Add support for more complex UACCESS control

2023-07-11 Thread Christophe Leroy

On x86, UACCESS is controlled by two instructions: STAC and CLAC.
STAC instruction enables UACCESS while CLAC disables UACCESS.
This is simple enough for objtool to locate UACCESS enable and
disable.

But on powerpc it is a bit more complex, the same instruction is
used for enabling and disabling UACCESS, and the same instruction
can be used for many other things. It would be too complex to use
exclusively instruction decoding.

To help objtool, mark such instruction into .discard.uaccess_begin
and .discard.uaccess_end sections, on the same principle as for
reachable/unreachable instructions. And add ASM_UACCESS_BEGIN
and ASM_UACCESS_END macros to be used in inline assembly code to
annotate UACCESS enable and UACCESS disable instructions.

Signed-off-by: Christophe Leroy 
---
 include/linux/objtool.h | 14 ++
 tools/objtool/check.c   | 33 +
 2 files changed, 47 insertions(+)

diff --git a/include/linux/objtool.h b/include/linux/objtool.h
index 03f82c2c2ebf..d8fde4158a40 100644
--- a/include/linux/objtool.h
+++ b/include/linux/objtool.h
@@ -57,6 +57,18 @@
".long 998b - .\n\t"\
".popsection\n\t"
 
+#define ASM_UACCESS_BEGIN  \
+   "998:\n\t"  \
+   ".pushsection .discard.uaccess_begin\n\t"   \
+   ".long 998b - .\n\t"\
+   ".popsection\n\t"
+
+#define ASM_UACCESS_END
\
+   "998:\n\t"  \
+   ".pushsection .discard.uaccess_end\n\t" \
+   ".long 998b - .\n\t"\
+   ".popsection\n\t"
+
 #else /* __ASSEMBLY__ */
 
 /*
@@ -156,6 +168,8 @@
 #define STACK_FRAME_NON_STANDARD_FP(func)
 #define ANNOTATE_NOENDBR
 #define ASM_REACHABLE
+#define ASM_UACCESS_BEGIN
+#define ASM_UACCESS_END
 #else
 #define ANNOTATE_INTRA_FUNCTION_CALL
 .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d2a0dfec5909..5af6c6c3fbed 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1052,6 +1052,38 @@ static void add_ignores(struct objtool_file *file)
}
 }
 
+static void __add_uaccess(struct objtool_file *file, const char *name,
+ int type, const char *action)
+{
+   struct section *rsec;
+   struct reloc *reloc;
+   struct instruction *insn;
+
+   rsec = find_section_by_name(file->elf, name);
+   if (!rsec)
+   return;
+
+   for_each_reloc(rsec, reloc) {
+   if (reloc->sym->type != STT_SECTION) {
+   WARN("unexpected relocation symbol type in %s: ", 
rsec->name);
+   continue;
+   }
+   insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc));
+   if (!insn) {
+   WARN("can't find UACCESS %s insn at %s+0x%" PRIx64,
+action, reloc->sym->sec->name, 
reloc_addend(reloc));
+   continue;
+   }
+   insn->type = type;
+   }
+}
+
+static void add_uaccess(struct objtool_file *file)
+{
+   __add_uaccess(file, ".rela.discard.uaccess_begin", INSN_STAC, "enable");
+   __add_uaccess(file, ".rela.discard.uaccess_end", INSN_CLAC, "disable");
+}
+
 /*
  * This is a whitelist of functions that is allowed to be called with AC set.
  * The list is meant to be minimal and only contains compiler instrumentation
@@ -2597,6 +2629,7 @@ static int decode_sections(struct objtool_file *file)
return ret;
 
add_ignores(file);
+   add_uaccess(file);
add_uaccess_safe(file);
 
ret = add_ignore_alternatives(file);
-- 
2.41.0

[PATCH v4 11/15] objtool: .rodata.cst{2/4/8/16} are not switch tables

2023-07-11 Thread Christophe Leroy

Exclude sections named
  .rodata.cst2
  .rodata.cst4
  .rodata.cst8
  .rodata.cst16
as they won't contain switch tables.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/check.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index ea0945f2195f..d2a0dfec5909 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2565,7 +2565,8 @@ static void mark_rodata(struct objtool_file *file)
 */
for_each_sec(file, sec) {
if (!strncmp(sec->name, ".rodata", 7) &&
-   !strstr(sec->name, ".str1.")) {
+   !strstr(sec->name, ".str1.") &&
+   !strstr(sec->name, ".cst")) {
sec->rodata = true;
found = true;
}
-- 
2.41.0

[PATCH v4 10/15] objtool: When looking for switch tables also follow conditional and dynamic jumps

2023-07-11 Thread Christophe Leroy

When walking backward to find the base address of a switch table,
also take into account conditionnal branches and dynamic jumps from
a previous switch table.

To avoid mis-routing, break when stumbling on a function return.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/check.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 361c832aefc8..ea0945f2195f 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2034,6 +2034,8 @@ static int add_jump_table(struct objtool_file *file, 
struct instruction *insn,
alt->next = insn->alts;
insn->alts = alt;
prev_offset = reloc_offset(reloc);
+   if (!dest_insn->first_jump_src)
+   dest_insn->first_jump_src = insn;
}
 
if (!prev_offset) {
@@ -2068,6 +2070,9 @@ static struct reloc *find_jump_table(struct objtool_file 
*file,
insn->gpr == orig_insn->gpr)
break;
 
+   if (insn->type == INSN_RETURN)
+   break;
+
/* allow small jumps within the range */
if (insn->type == INSN_JUMP_UNCONDITIONAL &&
insn->jump_dest &&
@@ -2130,8 +2135,7 @@ static int mark_add_func_jump_tables(struct objtool_file 
*file,
 * that find_jump_table() can back-track using those and
 * avoid some potentially confusing code.
 */
-   if (insn->type == INSN_JUMP_UNCONDITIONAL && insn->jump_dest &&
-   insn->offset > last->offset &&
+   if (is_static_jump(insn) && insn->jump_dest &&
insn->jump_dest->offset > insn->offset &&
!insn->jump_dest->first_jump_src) {
 
-- 
2.41.0

[PATCH v4 05/15] objtool: Add INSN_RETURN_CONDITIONAL

2023-07-11 Thread Christophe Leroy

Most functions have an unconditional return at the end, like
this one:

 :
   0:   81 22 04 d0 lwz r9,1232(r2)
   4:   38 60 00 00 li  r3,0
   8:   2c 09 00 00 cmpwi   r9,0
   c:   4d 82 00 20 beqlr   <== Conditional return
  10:   80 69 00 a0 lwz r3,160(r9)
  14:   54 63 00 36 clrrwi  r3,r3,4
  18:   68 63 04 00 xorir3,r3,1024
  1c:   7c 63 00 34 cntlzw  r3,r3
  20:   54 63 d9 7e srwir3,r3,5
  24:   4e 80 00 20 blr <== Unconditional return

But other functions like this other one below only have
conditional returns:

0028 :
  28:   81 25 00 00 lwz r9,0(r5)
  2c:   2c 08 00 00 cmpwi   r8,0
  30:   7d 29 30 78 andcr9,r9,r6
  34:   7d 27 3b 78 or  r7,r9,r7
  38:   54 84 65 3a rlwinm  r4,r4,12,20,29
  3c:   81 23 00 18 lwz r9,24(r3)
  40:   41 82 00 58 beq 98 
  44:   7d 29 20 2e lwzxr9,r9,r4
  48:   55 29 07 3a rlwinm  r9,r9,0,28,29
  4c:   2c 09 00 0c cmpwi   r9,12
  50:   41 82 00 08 beq 58 
  54:   39 00 00 80 li  r8,128
  58:   2c 08 00 01 cmpwi   r8,1
  5c:   90 e5 00 00 stw r7,0(r5)
  60:   4d a2 00 20 beqlr+  <== Conditional return
  64:   7c e9 3b 78 mr  r9,r7
  68:   39 40 00 00 li  r10,0
  6c:   39 4a 00 04 addir10,r10,4
  70:   7c 0a 40 00 cmpwr10,r8
  74:   91 25 00 04 stw r9,4(r5)
  78:   91 25 00 08 stw r9,8(r5)
  7c:   38 a5 00 10 addir5,r5,16
  80:   91 25 ff fc stw r9,-4(r5)
  84:   4c 80 00 20 bgelr   <== Conditional return
  88:   55 49 60 26 slwir9,r10,12
  8c:   7d 29 3a 14 add r9,r9,r7
  90:   91 25 00 00 stw r9,0(r5)
  94:   4b ff ff d8 b   6c 
  98:   39 00 00 04 li  r8,4
  9c:   4b ff ff bc b   58 

If conditional returns are decoded as INSN_OTHER, objtool considers
that the second function never returns.

If conditional returns are decoded as INSN_RETURN, objtool considers
that code after that conditional return is dead.

To overcome this situation, introduce INSN_RETURN_CONDITIONAL which
is taken as a confirmation that a function is not noreturn but still
sees following code as reachable.

Signed-off-by: Christophe Leroy 
Acked-by: Peter Zijlstra (Intel) 
---
 tools/objtool/check.c| 2 +-
 tools/objtool/include/objtool/arch.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 25f6df4713ed..ae0019412123 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -219,7 +219,7 @@ static bool __dead_end_function(struct objtool_file *file, 
struct symbol *func,
func_for_each_insn(file, func, insn) {
empty = false;
 
-   if (insn->type == INSN_RETURN)
+   if (insn->type == INSN_RETURN || insn->type == 
INSN_RETURN_CONDITIONAL)
return false;
}
 
diff --git a/tools/objtool/include/objtool/arch.h 
b/tools/objtool/include/objtool/arch.h
index 2b6d2ce4f9a5..84ba75112934 100644
--- a/tools/objtool/include/objtool/arch.h
+++ b/tools/objtool/include/objtool/arch.h
@@ -19,6 +19,7 @@ enum insn_type {
INSN_CALL,
INSN_CALL_DYNAMIC,
INSN_RETURN,
+   INSN_RETURN_CONDITIONAL,
INSN_CONTEXT_SWITCH,
INSN_BUG,
INSN_NOP,
-- 
2.41.0

[PATCH v4 13/15] objtool: Prepare noreturns.h for more architectures

2023-07-11 Thread Christophe Leroy

noreturns.h is a mix of x86 specific functions and more generic
core functions.

In preparation of inclusion of powerpc, split x86 functions out of
noreturns.h into arch/noreturns.h

Signed-off-by: Christophe Leroy 
---
 .../objtool/arch/x86/include/arch/noreturns.h | 20 +++
 tools/objtool/noreturns.h | 14 ++---
 2 files changed, 22 insertions(+), 12 deletions(-)
 create mode 100644 tools/objtool/arch/x86/include/arch/noreturns.h

diff --git a/tools/objtool/arch/x86/include/arch/noreturns.h 
b/tools/objtool/arch/x86/include/arch/noreturns.h
new file mode 100644
index ..a4262aff3917
--- /dev/null
+++ b/tools/objtool/arch/x86/include/arch/noreturns.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * This is a (sorted!) list of all known __noreturn functions in arch/x86.
+ * It's needed for objtool to properly reverse-engineer the control flow graph.
+ *
+ * Yes, this is unfortunate.  A better solution is in the works.
+ */
+NORETURN(cpu_bringup_and_idle)
+NORETURN(ex_handler_msr_mce)
+NORETURN(hlt_play_dead)
+NORETURN(hv_ghcb_terminate)
+NORETURN(machine_real_restart)
+NORETURN(rewind_stack_and_make_dead)
+NORETURN(sev_es_terminate)
+NORETURN(snp_abort)
+NORETURN(x86_64_start_kernel)
+NORETURN(x86_64_start_reservations)
+NORETURN(xen_cpu_bringup_again)
+NORETURN(xen_start_kernel)
diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
index e45c7cb1d5bc..b5e0f078dbb6 100644
--- a/tools/objtool/noreturns.h
+++ b/tools/objtool/noreturns.h
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
+#include 
+
 /*
  * This is a (sorted!) list of all known __noreturn functions in the kernel.
  * It's needed for objtool to properly reverse-engineer the control flow graph.
@@ -14,32 +16,20 @@ NORETURN(__stack_chk_fail)
 NORETURN(__ubsan_handle_builtin_unreachable)
 NORETURN(arch_call_rest_init)
 NORETURN(arch_cpu_idle_dead)
-NORETURN(cpu_bringup_and_idle)
 NORETURN(cpu_startup_entry)
 NORETURN(do_exit)
 NORETURN(do_group_exit)
 NORETURN(do_task_dead)
-NORETURN(ex_handler_msr_mce)
 NORETURN(fortify_panic)
-NORETURN(hlt_play_dead)
-NORETURN(hv_ghcb_terminate)
 NORETURN(kthread_complete_and_exit)
 NORETURN(kthread_exit)
 NORETURN(kunit_try_catch_throw)
-NORETURN(machine_real_restart)
 NORETURN(make_task_dead)
 NORETURN(mpt_halt_firmware)
 NORETURN(nmi_panic_self_stop)
 NORETURN(panic)
 NORETURN(panic_smp_self_stop)
 NORETURN(rest_init)
-NORETURN(rewind_stack_and_make_dead)
-NORETURN(sev_es_terminate)
-NORETURN(snp_abort)
 NORETURN(start_kernel)
 NORETURN(stop_this_cpu)
 NORETURN(usercopy_abort)
-NORETURN(x86_64_start_kernel)
-NORETURN(x86_64_start_reservations)
-NORETURN(xen_cpu_bringup_again)
-NORETURN(xen_start_kernel)
-- 
2.41.0

Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check

2023-07-11 Thread David Hildenbrand


On 11.07.23 18:07, Aneesh Kumar K V wrote:

On 7/11/23 4:06 PM, David Hildenbrand wrote:

On 11.07.23 06:48, Aneesh Kumar K.V wrote:

Some architectures would want different restrictions. Hence add an
architecture-specific override.

Both the PMD_SIZE check and pageblock alignment check are moved there.

Signed-off-by: Aneesh Kumar K.V 
---
   mm/memory_hotplug.c | 17 -
   1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1b19462f4e72..07c99b0cc371 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1247,12 +1247,20 @@ static int online_memory_block(struct memory_block 
*mem, void *arg)
   return device_online(>dev);
   }
   -static bool mhp_supports_memmap_on_memory(unsigned long size)
+#ifndef arch_supports_memmap_on_memory


Can we make that a __weak function instead?



We can. It is confusing because we do have these two patterns within the kernel 
where we use

#ifndef x
#endif

vs

__weak x

What is the recommended way to override ? I have mostly been using #ifndef for 
most of the arch overrides till now.



I think when placing the implementation in a C file, it's __weak. But 
don't ask me :)


We do this already for arch_get_mappable_range() in mm/memory_hotplug.c 
and IMHO it looks quite nice.



--
Cheers,

David / dhildenb

[PATCH v4 09/15] objtool: Find end of switch table directly

2023-07-11 Thread Christophe Leroy

At the time being, the end of a switch table can only be known
once the start of the following switch table has ben located.

This is a problem when switch tables are nested because until the first
switch table is properly added, the second one cannot be located as a
the backward walk will abut on the dynamic switch of the previous one.

So perform a first forward walk in the code in order to locate all
possible relocations to switch tables and build a local table with
those relocations. Later on once one switch table is found, go through
this local table to know where next switch table starts.

Signed-off-by: Christophe Leroy 
---
 tools/objtool/check.c | 62 ---
 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index be413c578588..361c832aefc8 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -2094,14 +2094,30 @@ static struct reloc *find_jump_table(struct 
objtool_file *file,
return NULL;
 }
 
+static struct reloc *find_next_table(struct instruction *insn,
+struct reloc **table, unsigned int size)
+{
+   unsigned long offset = reloc_offset(insn_jump_table(insn));
+   int i;
+   struct reloc *reloc = NULL;
+
+   for (i = 0; i < size; i++) {
+   if (reloc_offset(table[i]) > offset &&
+   (!reloc || reloc_offset(table[i]) < reloc_offset(reloc)))
+   reloc = table[i];
+   }
+   return reloc;
+}
+
 /*
  * First pass: Mark the head of each jump table so that in the next pass,
  * we know when a given jump table ends and the next one starts.
  */
 static int mark_add_func_jump_tables(struct objtool_file *file,
-struct symbol *func)
+struct symbol *func,
+struct reloc **table, unsigned int size)
 {
-   struct instruction *insn, *last = NULL, *insn_t1 = NULL, *insn_t2;
+   struct instruction *insn, *last = NULL;
struct reloc *reloc;
int ret = 0;
 
@@ -2132,23 +2148,11 @@ static int mark_add_func_jump_tables(struct 
objtool_file *file,
else
continue;
 
-   if (!insn_t1) {
-   insn_t1 = insn;
-   continue;
-   }
-
-   insn_t2 = insn;
-
-   ret = add_jump_table(file, insn_t1, insn_jump_table(insn_t2));
+   ret = add_jump_table(file, insn, find_next_table(insn, table, 
size));
if (ret)
return ret;
-
-   insn_t1 = insn_t2;
}
 
-   if (insn_t1)
-   ret = add_jump_table(file, insn_t1, NULL);
-
return ret;
 }
 
@@ -2161,15 +2165,41 @@ static int add_jump_table_alts(struct objtool_file 
*file)
 {
struct symbol *func;
int ret;
+   struct instruction *insn;
+   unsigned int size = 0, i = 0;
+   struct reloc **table = NULL;
 
if (!file->rodata)
return 0;
 
+   for_each_insn(file, insn) {
+   struct instruction *dest_insn;
+   struct reloc *reloc;
+
+   func = insn_func(insn) ? insn_func(insn)->pfunc : NULL;
+   reloc = arch_find_switch_table(file, insn, NULL);
+   /*
+* Each table entry has a rela associated with it.  The rela
+* should reference text in the same function as the original
+* instruction.
+*/
+   if (!reloc)
+   continue;
+   dest_insn = find_insn(file, reloc->sym->sec, 
reloc_addend(reloc));
+   if (!dest_insn || !insn_func(dest_insn) || 
insn_func(dest_insn)->pfunc != func)
+   continue;
+   if (i == size) {
+   size += 1024;
+   table = realloc(table, size * sizeof(*table));
+   }
+   table[i++] = reloc;
+   }
+
for_each_sym(file, func) {
if (func->type != STT_FUNC)
continue;
 
-   ret = mark_add_func_jump_tables(file, func);
+   ret = mark_add_func_jump_tables(file, func, table, i);
if (ret)
return ret;
}
-- 
2.41.0

[PATCH v4 03/15] objtool: Allow an architecture to disable objtool on ASM files

2023-07-11 Thread Christophe Leroy

Supporting objtool on ASM files requires quite an effort.

Features like UACCESS validation don't require ASM files validation.

In order to allow architectures to enable objtool validation
without spending unnecessary effort on cleaning up ASM files,
provide an option to disable objtool validation on ASM files.

Suggested-by: Naveen N Rao 
Signed-off-by: Christophe Leroy 
---
 arch/Kconfig   | 5 +
 scripts/Makefile.build | 4 
 2 files changed, 9 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index aff2746c8af2..3330ed761260 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -,6 +,11 @@ config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 config HAVE_OBJTOOL
bool
 
+config ARCH_OBJTOOL_SKIP_ASM
+   bool
+   help
+ Architecture doesn't support objtool on ASM files
+
 config HAVE_JUMP_LABEL_HACK
bool
 
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 6413342a03f4..5818baddfb27 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -342,7 +342,11 @@ $(obj)/%.s: $(src)/%.S FORCE
$(call if_changed_dep,cpp_s_S)
 
 quiet_cmd_as_o_S = AS $(quiet_modtag)  $@
+ifndef CONFIG_ARCH_OBJTOOL_SKIP_ASM
   cmd_as_o_S = $(CC) $(a_flags) -c -o $@ $< $(cmd_objtool)
+else
+  cmd_as_o_S = $(CC) $(a_flags) -c -o $@ $<
+endif
 
 ifdef CONFIG_ASM_MODVERSIONS
 
-- 
2.41.0

[PATCH v4 01/15] Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto"

2023-07-11 Thread Christophe Leroy

This reverts commit 1e688dd2a3d6759d416616ff07afc4bb836c4213.

That commit aimed at optimising the code around generation of
WARN_ON/BUG_ON but this leads to a lot of dead code erroneously
generated by GCC.

That dead code becomes a problem when we start using objtool validation
because objtool will abort validation with a warning as soon as it
detects unreachable code. This is because unreachable code might
be the indication that objtool doesn't properly decode object text.

 text  data bss dec hex filename
  9551585   3627834  224376 13403795 cc8693 vmlinux.before
  9535281   3628358  224376 13388015 cc48ef vmlinux.after

Once this change is reverted, in a standard configuration (pmac32 +
function tracer) the text is reduced by 16k which is around 1.7%

We already had problem with it when starting to use objtool on powerpc
as a replacement for recordmcount, see commit 93e3f45a2631 ("powerpc:
Fix __WARN_FLAGS() for use with Objtool")

There is also a problem with at least GCC 12, on ppc64_defconfig +
CONFIG_CC_OPTIMIZE_FOR_SIZE=y + CONFIG_DEBUG_SECTION_MISMATCH=y :

LD  .tmp_vmlinux.kallsyms1
  powerpc64-linux-ld: net/ipv4/tcp_input.o:(__ex_table+0xc4): undefined 
reference to `.L2136'
  make[2]: *** [scripts/Makefile.vmlinux:36: vmlinux] Error 1
  make[1]: *** [/home/chleroy/linux-powerpc/Makefile:1238: vmlinux] Error 2

Taking into account that other problems are encountered with that
'asm goto' in WARN_ON(), including build failures, keeping that
change is not worth it allthough it is primarily a compiler bug.

Revert it for now.

Signed-off-by: Christophe Leroy 
Acked-by: Naveen N Rao 
---
 arch/powerpc/include/asm/book3s/64/kup.h |  2 +-
 arch/powerpc/include/asm/bug.h   | 67 
 arch/powerpc/kernel/misc_32.S|  2 +-
 arch/powerpc/kernel/traps.c  |  9 +---
 4 files changed, 15 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 497a7bd31ecc..e875cb7e68dc 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -90,7 +90,7 @@
/* Prevent access to userspace using any key values */
LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED)
 999:   tdne\gpr1, \gpr2
-   EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
+   EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67)
 #endif
 .endm
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 492530adecc2..abb608dff15a 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -4,14 +4,13 @@
 #ifdef __KERNEL__
 
 #include 
-#include 
 
 #ifdef CONFIG_BUG
 
 #ifdef __ASSEMBLY__
 #include 
 #ifdef CONFIG_DEBUG_BUGVERBOSE
-.macro __EMIT_BUG_ENTRY addr,file,line,flags
+.macro EMIT_BUG_ENTRY addr,file,line,flags
 .section __bug_table,"aw"
 5001:   .4byte \addr - .
 .4byte 5002f - .
@@ -23,7 +22,7 @@
 .previous
 .endm
 #else
-.macro __EMIT_BUG_ENTRY addr,file,line,flags
+.macro EMIT_BUG_ENTRY addr,file,line,flags
 .section __bug_table,"aw"
 5001:   .4byte \addr - .
 .short \flags
@@ -32,18 +31,6 @@
 .endm
 #endif /* verbose */
 
-.macro EMIT_WARN_ENTRY addr,file,line,flags
-   EX_TABLE(\addr,\addr+4)
-   __EMIT_BUG_ENTRY \addr,\file,\line,\flags
-.endm
-
-.macro EMIT_BUG_ENTRY addr,file,line,flags
-   .if \flags & 1 /* BUGFLAG_WARNING */
-   .err /* Use EMIT_WARN_ENTRY for warnings */
-   .endif
-   __EMIT_BUG_ENTRY \addr,\file,\line,\flags
-.endm
-
 #else /* !__ASSEMBLY__ */
 /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and
sizeof(struct bug_entry), respectively */
@@ -73,16 +60,6 @@
  "i" (sizeof(struct bug_entry)),   \
  ##__VA_ARGS__)
 
-#define WARN_ENTRY(insn, flags, label, ...)\
-   asm_volatile_goto(  \
-   "1: " insn "\n" \
-   EX_TABLE(1b, %l[label]) \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
- "i" (flags),  \
- "i" (sizeof(struct bug_entry)),   \
- ##__VA_ARGS__ : : label)
-
 /*
  * BUG_ON() and WARN_ON() do their best to cooperate with compile-time
  * optimisations. However depending on the complexity of the condition
@@ -95,16 +72,7 @@
 } while (0)
 #define HAVE_ARCH_BUG
 
-#define __WARN_FLAGS(flags) do {   \
-   __label__ __label_warn_on;  \
-   \
-   WARN_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags), __label_warn_on); 
\
-

[PATCH v4 00/15] powerpc/objtool: uaccess validation for PPC32 (v4)

2023-07-11 Thread Christophe Leroy

This series adds UACCESS validation for PPC32. It includes
a dozen of changes to objtool core.

It applies on top of series "Cleanup/Optimise KUAP (v3)"
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=363368=*

It is almost mature, performs code analysis for all PPC32.

In this version objtool switch table lookup has been enhanced to
handle nested switch tables.

Most object files are correctly decoded, only a few
'unreachable instruction' warnings remain due to more complex
fonctions which include back and forth jumps or branches.

It allowed to detect some UACCESS mess in a few files. They've been
fixed through other patches.

Changes in v4:
- Split series in two parts, the powerpc uaccess rework is submitted
separately, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=363368=*
- Support of UACCESS on all PPC32 including book3s/32 which was missing in v3.
- More elaborated switch tables lookup.
- Patches 2, 7, 8, 9, 10, 11 are new
- Patch 11 in series v3 is now removed.

Changes in v3:
- Rebased on top of a merge of powerpc tree and tip/objtool/core tree
- Simplified support for relative switch tables based on relocation type
- Taken comments from Peter

Christophe Leroy (15):
  Revert "powerpc/bug: Provide better flexibility to
WARN_ON/__WARN_FLAGS() with asm goto"
  objtool: Move back misplaced comment
  objtool: Allow an architecture to disable objtool on ASM files
  objtool: Fix JUMP_ENTRY_SIZE for bi-arch like powerpc
  objtool: Add INSN_RETURN_CONDITIONAL
  objtool: Add support for relative switch tables
  objtool: Merge mark_func_jump_tables() and add_func_jump_tables()
  objtool: Track general purpose register used for switch table base
  objtool: Find end of switch table directly
  objtool: When looking for switch tables also follow conditional and
dynamic jumps
  objtool: .rodata.cst{2/4/8/16} are not switch tables
  objtool: Add support for more complex UACCESS control
  objtool: Prepare noreturns.h for more architectures
  powerpc/bug: Annotate reachable after warning trap
  powerpc: Implement UACCESS validation on PPC32

 arch/Kconfig  |   5 +
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/include/asm/book3s/32/kup.h  |   2 +
 arch/powerpc/include/asm/book3s/64/kup.h  |   2 +-
 arch/powerpc/include/asm/bug.h|  77 ++---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h  |   4 +-
 arch/powerpc/include/asm/nohash/kup-booke.h   |   4 +-
 arch/powerpc/kernel/misc_32.S |   2 +-
 arch/powerpc/kernel/traps.c   |   9 +-
 arch/powerpc/kexec/core_32.c  |   4 +-
 arch/powerpc/mm/nohash/kup.c  |   2 +
 include/linux/objtool.h   |  14 ++
 scripts/Makefile.build|   4 +
 tools/objtool/arch/powerpc/decode.c   | 155 +-
 .../arch/powerpc/include/arch/noreturns.h |  11 ++
 .../arch/powerpc/include/arch/special.h   |   2 +-
 tools/objtool/arch/powerpc/special.c  |  39 -
 .../objtool/arch/x86/include/arch/noreturns.h |  20 +++
 tools/objtool/arch/x86/special.c  |   8 +-
 tools/objtool/check.c | 154 -
 tools/objtool/include/objtool/arch.h  |   1 +
 tools/objtool/include/objtool/check.h |   6 +-
 tools/objtool/include/objtool/special.h   |   3 +-
 tools/objtool/noreturns.h |  14 +-
 tools/objtool/special.c   |  55 +++
 25 files changed, 425 insertions(+), 174 deletions(-)
 create mode 100644 tools/objtool/arch/powerpc/include/arch/noreturns.h
 create mode 100644 tools/objtool/arch/x86/include/arch/noreturns.h

-- 
2.41.0

Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check

2023-07-11 Thread Aneesh Kumar K V

On 7/11/23 4:06 PM, David Hildenbrand wrote:
> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>> Some architectures would want different restrictions. Hence add an
>> architecture-specific override.
>>
>> Both the PMD_SIZE check and pageblock alignment check are moved there.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   mm/memory_hotplug.c | 17 -
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 1b19462f4e72..07c99b0cc371 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1247,12 +1247,20 @@ static int online_memory_block(struct memory_block 
>> *mem, void *arg)
>>   return device_online(>dev);
>>   }
>>   -static bool mhp_supports_memmap_on_memory(unsigned long size)
>> +#ifndef arch_supports_memmap_on_memory
> 
> Can we make that a __weak function instead?


We can. It is confusing because we do have these two patterns within the kernel 
where we use 

#ifndef x
#endif 

vs

__weak x 

What is the recommended way to override ? I have mostly been using #ifndef for 
most of the arch overrides till now.


> 
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>>   {
>> -    unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>> +    unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>>   unsigned long remaining_size = size - vmemmap_size;
>>   +    return IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> +    IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> 
> You're moving that check back to mhp_supports_memmap_on_memory() in the 
> following patch, where it actually belongs. So this check should stay in 
> mhp_supports_memmap_on_memory(). Might be reasonable to factor out the 
> vmemmap_size calculation.
> 
> 
> Also, let's a comment
> 
> /*
>  * As default, we want the vmemmap to span a complete PMD such that we
>  * can map the vmemmap using a single PMD if supported by the
>  * architecture.
>  */
> return IS_ALIGNED(vmemmap_size, PMD_SIZE);
> 
>> +}
>> +#endif
>> +
>> +static bool mhp_supports_memmap_on_memory(unsigned long size)
>> +{
>>   /*
>>    * Besides having arch support and the feature enabled at runtime, we
>>    * need a few more assumptions to hold true:
>> @@ -1280,9 +1288,8 @@ static bool mhp_supports_memmap_on_memory(unsigned 
>> long size)
>>    *   populate a single PMD.
>>    */
>>   return mhp_memmap_on_memory() &&
>> -   size == memory_block_size_bytes() &&
>> -   IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> -   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>> +    size == memory_block_size_bytes() &&
> 
> If you keep the properly aligned indentation, this will not be detected as a 
> change by git.
> 
>> +    arch_supports_memmap_on_memory(size);
>>   }
>>     /*
> 

Will update the code based on the above feedback.

-aneesh

Re: [PATCH 00/17] fbdev: Remove FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT flags

2023-07-11 Thread Geert Uytterhoeven

Hi Helge,

On Tue, Jul 11, 2023 at 5:26 PM Helge Deller  wrote:
> On 7/11/23 16:47, Sam Ravnborg wrote:
> > On Tue, Jul 11, 2023 at 08:24:40AM +0200, Thomas Zimmermann wrote:
> >> Am 10.07.23 um 19:19 schrieb Sam Ravnborg:
> >>> On Mon, Jul 10, 2023 at 02:50:04PM +0200, Thomas Zimmermann wrote:
>  Remove the unused flags FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT from
>  fbdev and drivers, as briefly discussed at [1]. Both flags were maybe
>  useful when fbdev had special handling for driver modules. With
>  commit 376b3ff54c9a ("fbdev: Nuke FBINFO_MODULE"), they are both 0
>  and have no further effect.
> 
>  Patches 1 to 7 remove FBINFO_DEFAULT from drivers. Patches 2 to 5
>  split this by the way the fb_info struct is being allocated. All flags
>  are cleared to zero during the allocation.
> 
>  Patches 8 to 16 do the same for FBINFO_FLAG_DEFAULT. Patch 8 fixes
>  an actual bug in how arch/sh uses the tokne for struct fb_videomode,
>  which is unrelated.
> 
>  Patch 17 removes both flag constants from 
> >>>
> >>> We have a few more flags that are unused - should they be nuked too?
> >>> FBINFO_HWACCEL_FILLRECT
> >>> FBINFO_HWACCEL_ROTATE
> >>> FBINFO_HWACCEL_XPAN
> >>
> >> It seems those are there for completeness. Nothing sets _ROTATE,
>
> I think some fbdev drivers had hardware acceleration for ROTATE in the
> past. HWACCEL_XPAN is still in some drivers.
>
> >> the others are simply never checked. According to the comments,
> >> some are required, some are optional. I don't know what that
> >> means.
>
> I think it's OK if you remove those flags which aren't used anywhere,
> e.g. FBINFO_HWACCEL_ROTATE.

Indeed.

> >> IIRC there were complains about performance when Daniel tried to remove
> >> fbcon acceleration, so not all _HWACCEL_ flags are unneeded.
>
> Correct. I think COPYAREA and FILLRECT are the bare minimum to accelerate
> fbcon, IMAGEBLIT is for showing the tux penguin (?),
> XPAN/YPAN and YWRAP for some hardware screen panning needed by some drivers
> (not sure if this is still used as I don't have such hardware, Geert?).

Yes, they are used.  Anything that is handled in drivers/video/fbdev/core/
is used:

$ git grep  HWACCEL_ -- drivers/video/fbdev/core/
drivers/video/fbdev/core/fbcon.c:   if ((info->flags &
FBINFO_HWACCEL_COPYAREA) &&
drivers/video/fbdev/core/fbcon.c:   !(info->flags &
FBINFO_HWACCEL_DISABLED))
drivers/video/fbdev/core/fbcon.c:   int good_pan = (cap &
FBINFO_HWACCEL_YPAN) &&
drivers/video/fbdev/core/fbcon.c:   int good_wrap = (cap &
FBINFO_HWACCEL_YWRAP) &&
drivers/video/fbdev/core/fbcon.c:   int fast_copyarea = (cap &
FBINFO_HWACCEL_COPYAREA) &&
drivers/video/fbdev/core/fbcon.c:   !(cap &
FBINFO_HWACCEL_DISABLED);
drivers/video/fbdev/core/fbcon.c:   int fast_imageblit = (cap &
FBINFO_HWACCEL_IMAGEBLIT) &&
drivers/video/fbdev/core/fbcon.c:   !(cap &
FBINFO_HWACCEL_DISABLED);

BTW, I'm surprised FBINFO_HWACCEL_FILLRECT is not handled.
But looking at the full history, it never was...

> >> Leaving them in for reference/completeness might be an option; or not. I
> >> have no strong feelings about those flags.
>
> I'd say drop FBINFO_HWACCEL_ROTATE at least ?

Agreed.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

[PATCH v3 8/9] powerpc/kuap: KUAP enabling/disabling functions must be __always_inline

2023-07-11 Thread Christophe Leroy

Objtool reports following warnings:

  arch/powerpc/kernel/signal_32.o: warning: objtool:
__prevent_user_access.constprop.0+0x4 (.text+0x4):
redundant UACCESS disable

  arch/powerpc/kernel/signal_32.o: warning: objtool: user_access_begin+0x2c
(.text+0x4c): return with UACCESS enabled

  arch/powerpc/kernel/signal_32.o: warning: objtool: handle_rt_signal32+0x188
(.text+0x360): call to __prevent_user_access.constprop.0() with UACCESS 
enabled

  arch/powerpc/kernel/signal_32.o: warning: objtool: handle_signal32+0x150
(.text+0x4d4): call to __prevent_user_access.constprop.0() with UACCESS 
enabled

This is due to some KUAP enabling/disabling functions being outline
allthough they are marked inline. Use __always_inline instead.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h | 18 +++
 arch/powerpc/include/asm/book3s/64/kup.h | 23 ++--
 arch/powerpc/include/asm/kup.h   | 16 +++---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 20 -
 arch/powerpc/include/asm/nohash/kup-booke.h  | 22 +--
 arch/powerpc/include/asm/uaccess.h   |  6 ++---
 6 files changed, 53 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 452d4efa84f5..931d200afe56 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -15,19 +15,19 @@
 
 #define KUAP_NONE  (~0UL)
 
-static inline void kuap_lock_one(unsigned long addr)
+static __always_inline void kuap_lock_one(unsigned long addr)
 {
mtsr(mfsr(addr) | SR_KS, addr);
isync();/* Context sync required after mtsr() */
 }
 
-static inline void kuap_unlock_one(unsigned long addr)
+static __always_inline void kuap_unlock_one(unsigned long addr)
 {
mtsr(mfsr(addr) & ~SR_KS, addr);
isync();/* Context sync required after mtsr() */
 }
 
-static inline void __kuap_save_and_lock(struct pt_regs *regs)
+static __always_inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
unsigned long kuap = current->thread.kuap;
 
@@ -40,11 +40,11 @@ static inline void __kuap_save_and_lock(struct pt_regs 
*regs)
 }
 #define __kuap_save_and_lock __kuap_save_and_lock
 
-static inline void kuap_user_restore(struct pt_regs *regs)
+static __always_inline void kuap_user_restore(struct pt_regs *regs)
 {
 }
 
-static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long 
kuap)
+static __always_inline void __kuap_kernel_restore(struct pt_regs *regs, 
unsigned long kuap)
 {
if (unlikely(kuap != KUAP_NONE)) {
current->thread.kuap = KUAP_NONE;
@@ -59,7 +59,7 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
kuap_unlock_one(regs->kuap);
 }
 
-static inline unsigned long __kuap_get_and_assert_locked(void)
+static __always_inline unsigned long __kuap_get_and_assert_locked(void)
 {
unsigned long kuap = current->thread.kuap;
 
@@ -94,7 +94,7 @@ static __always_inline void __prevent_user_access(unsigned 
long dir)
kuap_lock_one(kuap);
 }
 
-static inline unsigned long __prevent_user_access_return(void)
+static __always_inline unsigned long __prevent_user_access_return(void)
 {
unsigned long flags = current->thread.kuap;
 
@@ -106,7 +106,7 @@ static inline unsigned long 
__prevent_user_access_return(void)
return flags;
 }
 
-static inline void __restore_user_access(unsigned long flags)
+static __always_inline void __restore_user_access(unsigned long flags)
 {
if (flags != KUAP_NONE) {
current->thread.kuap = flags;
@@ -114,7 +114,7 @@ static inline void __restore_user_access(unsigned long 
flags)
}
 }
 
-static inline bool
+static __always_inline bool
 __bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
 {
unsigned long kuap = regs->kuap;
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index a014f4d9a2aa..497a7bd31ecc 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -213,14 +213,14 @@ extern u64 __ro_after_init default_iamr;
  * access restrictions. Because of this ignore AMR value when accessing
  * userspace via kernel thread.
  */
-static inline u64 current_thread_amr(void)
+static __always_inline u64 current_thread_amr(void)
 {
if (current->thread.regs)
return current->thread.regs->amr;
return default_amr;
 }
 
-static inline u64 current_thread_iamr(void)
+static __always_inline u64 current_thread_iamr(void)
 {
if (current->thread.regs)
return current->thread.regs->iamr;
@@ -230,7 +230,7 @@ static inline u64 current_thread_iamr(void)
 
 #ifdef CONFIG_PPC_KUAP
 
-static inline void kuap_user_restore(struct pt_regs *regs)
+static __always_inline void

[PATCH v3 9/9] powerpc/kuap: Use ASM feature fixups instead of static branches

2023-07-11 Thread Christophe Leroy

To avoid a useless nop on top of every uaccess enable/disable and
make life easier for objtool, replace static branches by ASM feature
fixups that will nop KUAP enabling instructions out in the unlikely
case KUAP is disabled at boottime.

Leave it as is on book3s/64 for now, it will be handled later when
objtool is activated on PPC64.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h | 46 
 arch/powerpc/include/asm/kup.h   | 45 +++
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 30 +
 arch/powerpc/include/asm/nohash/kup-booke.h  | 38 +---
 arch/powerpc/mm/nohash/kup.c |  2 +-
 5 files changed, 87 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 931d200afe56..4e14a5427a63 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -27,6 +27,34 @@ static __always_inline void kuap_unlock_one(unsigned long 
addr)
isync();/* Context sync required after mtsr() */
 }
 
+static __always_inline void uaccess_begin_32s(unsigned long addr)
+{
+   unsigned long tmp;
+
+   asm volatile(ASM_MMU_FTR_IFSET(
+   "mfsrin %0, %1;"
+   "rlwinm %0, %0, 0, %2;"
+   "mtsrin %0, %1;"
+   "isync", "", %3)
+   : "="(tmp)
+   : "r"(addr), "i"(~SR_KS), "i"(MMU_FTR_KUAP)
+   : "memory");
+}
+
+static __always_inline void uaccess_end_32s(unsigned long addr)
+{
+   unsigned long tmp;
+
+   asm volatile(ASM_MMU_FTR_IFSET(
+   "mfsrin %0, %1;"
+   "oris %0, %0, %2;"
+   "mtsrin %0, %1;"
+   "isync", "", %3)
+   : "="(tmp)
+   : "r"(addr), "i"(SR_KS >> 16), "i"(MMU_FTR_KUAP)
+   : "memory");
+}
+
 static __always_inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
unsigned long kuap = current->thread.kuap;
@@ -69,8 +97,8 @@ static __always_inline unsigned long 
__kuap_get_and_assert_locked(void)
 }
 #define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
-static __always_inline void __allow_user_access(void __user *to, const void 
__user *from,
-   u32 size, unsigned long dir)
+static __always_inline void allow_user_access(void __user *to, const void 
__user *from,
+ u32 size, unsigned long dir)
 {
BUILD_BUG_ON(!__builtin_constant_p(dir));
 
@@ -78,10 +106,10 @@ static __always_inline void __allow_user_access(void 
__user *to, const void __us
return;
 
current->thread.kuap = (__force u32)to;
-   kuap_unlock_one((__force u32)to);
+   uaccess_begin_32s((__force u32)to);
 }
 
-static __always_inline void __prevent_user_access(unsigned long dir)
+static __always_inline void prevent_user_access(unsigned long dir)
 {
u32 kuap = current->thread.kuap;
 
@@ -91,26 +119,26 @@ static __always_inline void __prevent_user_access(unsigned 
long dir)
return;
 
current->thread.kuap = KUAP_NONE;
-   kuap_lock_one(kuap);
+   uaccess_end_32s(kuap);
 }
 
-static __always_inline unsigned long __prevent_user_access_return(void)
+static __always_inline unsigned long prevent_user_access_return(void)
 {
unsigned long flags = current->thread.kuap;
 
if (flags != KUAP_NONE) {
current->thread.kuap = KUAP_NONE;
-   kuap_lock_one(flags);
+   uaccess_end_32s(flags);
}
 
return flags;
 }
 
-static __always_inline void __restore_user_access(unsigned long flags)
+static __always_inline void restore_user_access(unsigned long flags)
 {
if (flags != KUAP_NONE) {
current->thread.kuap = flags;
-   kuap_unlock_one(flags);
+   uaccess_begin_32s(flags);
}
 }
 
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 77adb9cd2da5..ad7e8c5aec3f 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -72,11 +72,11 @@ static __always_inline void __kuap_kernel_restore(struct 
pt_regs *regs, unsigned
  * platforms.
  */
 #ifndef CONFIG_PPC_BOOK3S_64
-static __always_inline void __allow_user_access(void __user *to, const void 
__user *from,
-   unsigned long size, unsigned 
long dir) { }
-static __always_inline void __prevent_user_access(unsigned long dir) { }
-static __always_inline unsigned long __prevent_user_access_return(void) { 
return 0UL; }
-static __always_inline void __restore_user_access(unsigned long flags) { }
+static __always_inline void allow_user_access(void __user *to, const void 
__user *from,
+ unsigned long size, unsigned long 
dir) { }
+static __always_inline void

[PATCH v3 5/9] powerpc/kuap: MMU_FTR_BOOK3S_KUAP becomes MMU_FTR_KUAP

2023-07-11 Thread Christophe Leroy

In order to reuse MMU_FTR_BOOK3S_KUAP for other targets than BOOK3S,
rename it MMU_FTR_KUAP.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/hash-pkey.h |  2 +-
 arch/powerpc/include/asm/book3s/64/kup.h   | 18 +-
 arch/powerpc/include/asm/mmu.h |  4 ++--
 arch/powerpc/kernel/syscall.c  |  2 +-
 arch/powerpc/mm/book3s64/pkeys.c   |  2 +-
 5 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-pkey.h 
b/arch/powerpc/include/asm/book3s/64/hash-pkey.h
index f1e60d579f6c..6c5564c4fae4 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-pkey.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-pkey.h
@@ -24,7 +24,7 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags, 
unsigned long flags)
((pteflags & H_PTE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) |
((pteflags & H_PTE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL));
 
-   if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP) ||
+   if (mmu_has_feature(MMU_FTR_KUAP) ||
mmu_has_feature(MMU_FTR_BOOK3S_KUEP)) {
if ((pte_pkey == 0) && (flags & HPTE_USE_KERNEL_KEY))
return HASH_DEFAULT_KERNEL_KEY;
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 2a7bd3ecc556..72fc4263ed26 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -31,7 +31,7 @@
mfspr   \gpr2, SPRN_AMR
cmpd\gpr1, \gpr2
beq 99f
-   END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_BOOK3S_KUAP, 68)
+   END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_KUAP, 68)
 
isync
mtspr   SPRN_AMR, \gpr1
@@ -78,7 +78,7 @@
 * No need to restore IAMR when returning to kernel space.
 */
 100:
-   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67)
+   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67)
 #endif
 .endm
 
@@ -91,7 +91,7 @@
LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED)
 999:   tdne\gpr1, \gpr2
EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
-   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67)
+   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67)
 #endif
 .endm
 #endif
@@ -130,7 +130,7 @@
 */
BEGIN_MMU_FTR_SECTION_NESTED(68)
b   100f  // skip_save_amr
-   END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_BOOK3S_KUAP, 68)
+   END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_KUAP, 68)
 
/*
 * if pkey is disabled and we are entering from userspace
@@ -166,7 +166,7 @@
mtspr   SPRN_AMR, \gpr2
isync
 102:
-   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 69)
+   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 69)
 
/*
 * if entering from kernel we don't need save IAMR
@@ -232,7 +232,7 @@ static inline u64 current_thread_iamr(void)
 
 static __always_inline bool kuap_is_disabled(void)
 {
-   return !mmu_has_feature(MMU_FTR_BOOK3S_KUAP);
+   return !mmu_has_feature(MMU_FTR_KUAP);
 }
 
 static inline void kuap_user_restore(struct pt_regs *regs)
@@ -243,7 +243,7 @@ static inline void kuap_user_restore(struct pt_regs *regs)
if (!mmu_has_feature(MMU_FTR_PKEY))
return;
 
-   if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP)) {
+   if (!mmu_has_feature(MMU_FTR_KUAP)) {
amr = mfspr(SPRN_AMR);
if (amr != regs->amr)
restore_amr = true;
@@ -317,7 +317,7 @@ static inline unsigned long get_kuap(void)
 * This has no effect in terms of actually blocking things on hash,
 * so it doesn't break anything.
 */
-   if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
+   if (!mmu_has_feature(MMU_FTR_KUAP))
return AMR_KUAP_BLOCKED;
 
return mfspr(SPRN_AMR);
@@ -325,7 +325,7 @@ static inline unsigned long get_kuap(void)
 
 static __always_inline void set_kuap(unsigned long value)
 {
-   if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP))
+   if (!mmu_has_feature(MMU_FTR_KUAP))
return;
 
/*
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 94b981152667..82af2e2c5eca 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -33,7 +33,7 @@
  * key 0 controlling userspace addresses on radix
  * Key 3 on hash
  */
-#define MMU_FTR_BOOK3S_KUAPASM_CONST(0x0200)
+#define MMU_FTR_KUAP   ASM_CONST(0x0200)
 
 /*
  * Supports KUEP feature
@@ -188,7 +188,7 @@ enum {
 #endif /* CONFIG_PPC_RADIX_MMU */
 #endif
 #ifdef CONFIG_PPC_KUAP
-   MMU_FTR_BOOK3S_KUAP |
+   MMU_FTR_KUAP |
 #endif /* CONFIG_PPC_KUAP */
 #ifdef CONFIG_PPC_MEM_KEYS
MMU_FTR_PKEY |
diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c
index

[PATCH v3 7/9] powerpc/kuap: Simplify KUAP lock/unlock on BOOK3S/32

2023-07-11 Thread Christophe Leroy

On book3s/32 KUAP is performed at segment level. At the moment,
when enabling userspace access, only current segment is modified.
Then if a write is performed on another user segment, a fault is
taken and all other user segments get enabled for userspace
access. This then require special attention when disabling
userspace access.

Having a userspace write access crossing a segment boundary is
unlikely. Having a userspace write access crossing a segment boundary
back and forth is even more unlikely. So, instead of enabling
userspace access on all segments when a write fault occurs, just
change which segment has userspace access enabled in order to
eliminate the case when more than one segment has userspace access
enabled. That simplifies userspace access deactivation.

There is however a corner case which is even more unlikely but has
to be handled anyway: an unaligned access which is crossing a
segment boundary. That would definitely require at least having
userspace access enabled on the two segments. To avoid complicating
the likely case for a so unlikely happening, handle such situation
like an alignment exception and emulate the store.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h | 65 +++-
 arch/powerpc/include/asm/bug.h   |  1 +
 arch/powerpc/kernel/traps.c  |  2 +-
 arch/powerpc/mm/book3s32/kuap.c  | 15 +-
 4 files changed, 23 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 4ca6122ef0e1..452d4efa84f5 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -14,7 +14,6 @@
 #include 
 
 #define KUAP_NONE  (~0UL)
-#define KUAP_ALL   (~1UL)
 
 static inline void kuap_lock_one(unsigned long addr)
 {
@@ -28,41 +27,6 @@ static inline void kuap_unlock_one(unsigned long addr)
isync();/* Context sync required after mtsr() */
 }
 
-static inline void kuap_lock_all(void)
-{
-   update_user_segments(mfsr(0) | SR_KS);
-   isync();/* Context sync required after mtsr() */
-}
-
-static inline void kuap_unlock_all(void)
-{
-   update_user_segments(mfsr(0) & ~SR_KS);
-   isync();/* Context sync required after mtsr() */
-}
-
-void kuap_lock_all_ool(void);
-void kuap_unlock_all_ool(void);
-
-static inline void kuap_lock_addr(unsigned long addr, bool ool)
-{
-   if (likely(addr != KUAP_ALL))
-   kuap_lock_one(addr);
-   else if (!ool)
-   kuap_lock_all();
-   else
-   kuap_lock_all_ool();
-}
-
-static inline void kuap_unlock(unsigned long addr, bool ool)
-{
-   if (likely(addr != KUAP_ALL))
-   kuap_unlock_one(addr);
-   else if (!ool)
-   kuap_unlock_all();
-   else
-   kuap_unlock_all_ool();
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
unsigned long kuap = current->thread.kuap;
@@ -72,7 +36,7 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs)
return;
 
current->thread.kuap = KUAP_NONE;
-   kuap_lock_addr(kuap, false);
+   kuap_lock_one(kuap);
 }
 #define __kuap_save_and_lock __kuap_save_and_lock
 
@@ -84,7 +48,7 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 {
if (unlikely(kuap != KUAP_NONE)) {
current->thread.kuap = KUAP_NONE;
-   kuap_lock_addr(kuap, false);
+   kuap_lock_one(kuap);
}
 
if (likely(regs->kuap == KUAP_NONE))
@@ -92,7 +56,7 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
current->thread.kuap = regs->kuap;
 
-   kuap_unlock(regs->kuap, false);
+   kuap_unlock_one(regs->kuap);
 }
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
@@ -127,7 +91,7 @@ static __always_inline void __prevent_user_access(unsigned 
long dir)
return;
 
current->thread.kuap = KUAP_NONE;
-   kuap_lock_addr(kuap, true);
+   kuap_lock_one(kuap);
 }
 
 static inline unsigned long __prevent_user_access_return(void)
@@ -136,7 +100,7 @@ static inline unsigned long 
__prevent_user_access_return(void)
 
if (flags != KUAP_NONE) {
current->thread.kuap = KUAP_NONE;
-   kuap_lock_addr(flags, true);
+   kuap_lock_one(flags);
}
 
return flags;
@@ -146,7 +110,7 @@ static inline void __restore_user_access(unsigned long 
flags)
 {
if (flags != KUAP_NONE) {
current->thread.kuap = flags;
-   kuap_unlock(flags, true);
+   kuap_unlock_one(flags);
}
 }
 
@@ -155,14 +119,23 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long 
address, bool is_write)
 {
unsigned long kuap = regs->kuap;
 
-   if (!is_write || kuap == KUAP_ALL)
+   if (!is_write)
return

[PATCH v3 6/9] powerpc/kuap: Use MMU_FTR_KUAP on all and refactor disabling kuap

2023-07-11 Thread Christophe Leroy

All but book3s/64 use a static branch key for disabling kuap.
book3s/64 uses an mmu feature.

Refactor all targets to use MMU_FTR_KUAP like book3s/64.

For PPC32 that implies updating mmu features fixups once KUAP
has been initialised.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h |  9 -
 arch/powerpc/include/asm/book3s/64/kup.h |  5 -
 arch/powerpc/include/asm/kup.h   | 11 +++
 arch/powerpc/include/asm/nohash/32/kup-8xx.h |  9 -
 arch/powerpc/include/asm/nohash/kup-booke.h  |  8 
 arch/powerpc/kernel/cputable.c   |  4 
 arch/powerpc/mm/book3s32/kuap.c  |  5 +
 arch/powerpc/mm/init_32.c|  2 ++
 arch/powerpc/mm/nohash/kup.c |  6 +-
 9 files changed, 19 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 0da0dea76c47..4ca6122ef0e1 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -9,10 +9,6 @@
 
 #ifndef __ASSEMBLY__
 
-#include 
-
-extern struct static_key_false disable_kuap_key;
-
 #ifdef CONFIG_PPC_KUAP
 
 #include 
@@ -20,11 +16,6 @@ extern struct static_key_false disable_kuap_key;
 #define KUAP_NONE  (~0UL)
 #define KUAP_ALL   (~1UL)
 
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(_kuap_key);
-}
-
 static inline void kuap_lock_one(unsigned long addr)
 {
mtsr(mfsr(addr) | SR_KS, addr);
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 72fc4263ed26..a014f4d9a2aa 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -230,11 +230,6 @@ static inline u64 current_thread_iamr(void)
 
 #ifdef CONFIG_PPC_KUAP
 
-static __always_inline bool kuap_is_disabled(void)
-{
-   return !mmu_has_feature(MMU_FTR_KUAP);
-}
-
 static inline void kuap_user_restore(struct pt_regs *regs)
 {
bool restore_amr = false, restore_iamr = false;
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 24cde16c4fbe..bab161b609c1 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -6,6 +6,12 @@
 #define KUAP_WRITE 2
 #define KUAP_READ_WRITE(KUAP_READ | KUAP_WRITE)
 
+#ifndef __ASSEMBLY__
+#include 
+
+static __always_inline bool kuap_is_disabled(void);
+#endif
+
 #ifdef CONFIG_PPC_BOOK3S_64
 #include 
 #endif
@@ -41,6 +47,11 @@ void setup_kuep(bool disabled);
 
 #ifdef CONFIG_PPC_KUAP
 void setup_kuap(bool disabled);
+
+static __always_inline bool kuap_is_disabled(void)
+{
+   return !mmu_has_feature(MMU_FTR_KUAP);
+}
 #else
 static inline void setup_kuap(bool disabled) { }
 
diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index a372cd822887..d0601859c45a 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -9,17 +9,8 @@
 
 #ifndef __ASSEMBLY__
 
-#include 
-
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(_kuap_key);
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
regs->kuap = mfspr(SPRN_MD_AP);
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h 
b/arch/powerpc/include/asm/nohash/kup-booke.h
index 71182cbe20c3..8e4734c8fef1 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -13,18 +13,10 @@
 
 #else
 
-#include 
 #include 
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(_kuap_key);
-}
-
 static inline void __kuap_lock(void)
 {
mtspr(SPRN_PID, 0);
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 8a32bffefa5b..e97a0fd0ae90 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -75,6 +75,10 @@ static struct cpu_spec * __init setup_cpu_spec(unsigned long 
offset,
t->cpu_features |= old.cpu_features & CPU_FTR_PMAO_BUG;
}
 
+   /* Set kuap ON at startup, will be disabled later if cmdline has 
'nosmap' */
+   if (IS_ENABLED(CONFIG_PPC_KUAP) && IS_ENABLED(CONFIG_PPC32))
+   t->mmu_features |= MMU_FTR_KUAP;
+
*PTRRELOC(_cpu_spec) = _cpu_spec;
 
/*
diff --git a/arch/powerpc/mm/book3s32/kuap.c b/arch/powerpc/mm/book3s32/kuap.c
index 28676cabb005..24c1c686e6b9 100644
--- a/arch/powerpc/mm/book3s32/kuap.c
+++ b/arch/powerpc/mm/book3s32/kuap.c
@@ -3,9 +3,6 @@
 #include 
 #include 
 
-struct static_key_false disable_kuap_key;
-EXPORT_SYMBOL(disable_kuap_key);
-
 void kuap_lock_all_ool(void)
 {
kuap_lock_all();
@@ -30,7 +27,7 @@ void

[PATCH v3 1/9] powerpc/kuap: Avoid unnecessary reads of MD_AP

2023-07-11 Thread Christophe Leroy

A disassembly of interrupt_exit_kernel_prepare() shows a useless read
of MD_AP register. This is shown by r9 being re-used immediately without
doing anything with the value read.

  c000e0e0:   60 00 00 00 nop
  c000e0e4: ===>  7d 3a c2 a6 mfmd_ap r9<
  c000e0e8:   7d 20 00 a6 mfmsr   r9
  c000e0ec:   7c 51 13 a6 mtspr   81,r2
  c000e0f0:   81 3f 00 84 lwz r9,132(r31)
  c000e0f4:   71 29 80 00 andi.   r9,r9,32768

kuap_get_and_assert_locked() is paired with kuap_kernel_restore()
and are only used in interrupt_exit_kernel_prepare(). The value
returned by kuap_get_and_assert_locked() is only used by
kuap_kernel_restore().

On 8xx, kuap_kernel_restore() doesn't use the value read by
kuap_get_and_assert_locked() so modify kuap_get_and_assert_locked()
to not perform the read of MD_AP and return 0 instead.

The same applies on BOOKE.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 8 ++--
 arch/powerpc/include/asm/nohash/kup-booke.h  | 6 ++
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index c44d97751723..8579210f2a6a 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -41,14 +41,10 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
 {
-   unsigned long kuap;
-
-   kuap = mfspr(SPRN_MD_AP);
-
if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   WARN_ON_ONCE(kuap >> 16 != MD_APG_KUAP >> 16);
+   WARN_ON_ONCE(mfspr(SPRN_MD_AP) >> 16 != MD_APG_KUAP >> 16);
 
-   return kuap;
+   return 0;
 }
 
 static inline void __allow_user_access(void __user *to, const void __user 
*from,
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h 
b/arch/powerpc/include/asm/nohash/kup-booke.h
index 49bb41ed0816..823c5a3a96d8 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -58,12 +58,10 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
 {
-   unsigned long kuap = mfspr(SPRN_PID);
-
if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   WARN_ON_ONCE(kuap);
+   WARN_ON_ONCE(mfspr(SPRN_PID));
 
-   return kuap;
+   return 0;
 }
 
 static inline void __allow_user_access(void __user *to, const void __user 
*from,
-- 
2.41.0

[PATCH v3 4/9] powerpc/features: Add capability to update mmu features later

2023-07-11 Thread Christophe Leroy

On powerpc32, features fixup is performed very early and that's too
early to read the cmdline and take into account 'nosmap' parameter.

On the other hand, no userspace access is performed that early and
KUAP feature fixup can be performed later.

Add a function to update mmu features. The function is passed a
mask with the features that can be updated.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/feature-fixups.h |  1 +
 arch/powerpc/lib/feature-fixups.c | 31 ---
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index ac605fc369c4..77824bd289a3 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -292,6 +292,7 @@ extern long __start___barrier_nospec_fixup, 
__stop___barrier_nospec_fixup;
 extern long __start__btb_flush_fixup, __stop__btb_flush_fixup;
 
 void apply_feature_fixups(void);
+void update_mmu_feature_fixups(unsigned long mask);
 void setup_feature_keys(void);
 #endif
 
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 80def1c2afcb..4f82581ca203 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -67,7 +67,8 @@ static int patch_alt_instruction(u32 *src, u32 *dest, u32 
*alt_start, u32 *alt_e
return 0;
 }
 
-static int patch_feature_section(unsigned long value, struct fixup_entry *fcur)
+static int patch_feature_section_mask(unsigned long value, unsigned long mask,
+ struct fixup_entry *fcur)
 {
u32 *start, *end, *alt_start, *alt_end, *src, *dest;
 
@@ -79,7 +80,7 @@ static int patch_feature_section(unsigned long value, struct 
fixup_entry *fcur)
if ((alt_end - alt_start) > (end - start))
return 1;
 
-   if ((value & fcur->mask) == fcur->value)
+   if ((value & fcur->mask & mask) == (fcur->value & mask))
return 0;
 
src = alt_start;
@@ -97,7 +98,8 @@ static int patch_feature_section(unsigned long value, struct 
fixup_entry *fcur)
return 0;
 }
 
-void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end)
+static void do_feature_fixups_mask(unsigned long value, unsigned long mask,
+  void *fixup_start, void *fixup_end)
 {
struct fixup_entry *fcur, *fend;
 
@@ -105,7 +107,7 @@ void do_feature_fixups(unsigned long value, void 
*fixup_start, void *fixup_end)
fend = fixup_end;
 
for (; fcur < fend; fcur++) {
-   if (patch_feature_section(value, fcur)) {
+   if (patch_feature_section_mask(value, mask, fcur)) {
WARN_ON(1);
printk("Unable to patch feature section at %p - %p" \
" with %p - %p\n",
@@ -117,6 +119,11 @@ void do_feature_fixups(unsigned long value, void 
*fixup_start, void *fixup_end)
}
 }
 
+void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end)
+{
+   do_feature_fixups_mask(value, ~0, fixup_start, fixup_end);
+}
+
 #ifdef CONFIG_PPC_BARRIER_NOSPEC
 static bool is_fixup_addr_valid(void *dest, size_t size)
 {
@@ -651,6 +658,17 @@ void __init apply_feature_fixups(void)
do_final_fixups();
 }
 
+void __init update_mmu_feature_fixups(unsigned long mask)
+{
+   saved_mmu_features &= ~mask;
+   saved_mmu_features |= cur_cpu_spec->mmu_features & mask;
+
+   do_feature_fixups_mask(cur_cpu_spec->mmu_features, mask,
+  PTRRELOC(&__start___mmu_ftr_fixup),
+  PTRRELOC(&__stop___mmu_ftr_fixup));
+   mmu_feature_keys_init();
+}
+
 void __init setup_feature_keys(void)
 {
/*
@@ -683,6 +701,11 @@ late_initcall(check_features);
 #define check(x)   \
if (!(x)) printk("feature-fixups: test failed at line %d\n", __LINE__);
 
+static int patch_feature_section(unsigned long value, struct fixup_entry *fcur)
+{
+   return patch_feature_section_mask(value, ~0, fcur);
+}
+
 /* This must be after the text it fixes up, vmlinux.lds.S enforces that atm */
 static struct fixup_entry fixup;
 
-- 
2.41.0

[PATCH v3 2/9] powerpc/kuap: Avoid useless jump_label on empty function

2023-07-11 Thread Christophe Leroy

Disassembly of interrupt_enter_prepare() shows a pointless nop
before the mftb

  c000abf0 :
  c000abf0:   81 23 00 84 lwz r9,132(r3)
  c000abf4:   71 29 40 00 andi.   r9,r9,16384
  c000abf8:   41 82 00 28 beq-c000ac20 

  c000abfc: ===>  60 00 00 00 nop   <
  c000ac00:   7d 0c 42 e6 mftbr8
  c000ac04:   80 e2 00 08 lwz r7,8(r2)
  c000ac08:   81 22 00 28 lwz r9,40(r2)
  c000ac0c:   91 02 00 24 stw r8,36(r2)
  c000ac10:   7d 29 38 50 subfr9,r9,r7
  c000ac14:   7d 29 42 14 add r9,r9,r8
  c000ac18:   91 22 00 08 stw r9,8(r2)
  c000ac1c:   4e 80 00 20 blr
  c000ac20:   60 00 00 00 nop
  c000ac24:   7d 5a c2 a6 mfmd_ap r10
  c000ac28:   3d 20 de 00 lis r9,-8704
  c000ac2c:   91 43 00 b0 stw r10,176(r3)
  c000ac30:   7d 3a c3 a6 mtspr   794,r9
  c000ac34:   4e 80 00 20 blr

That comes from the call to kuap_loc(), allthough __kuap_lock() is an
empty function on the 8xx.

To avoid that, only perform kuap_is_disabled() check when there is
something to do with __kuap_lock().

Do the same with __kuap_save_and_lock() and
__kuap_get_and_assert_locked().

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
v2: Add back comment about __kupa_lock() not needed on 64s
---
 arch/powerpc/include/asm/book3s/32/kup.h |  6 ++--
 arch/powerpc/include/asm/book3s/64/kup.h | 10 ++
 arch/powerpc/include/asm/kup.h   | 33 +---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 11 +++
 arch/powerpc/include/asm/nohash/kup-booke.h  |  8 +++--
 5 files changed, 29 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 678f9c9d89b6..466a19cfb4df 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -77,10 +77,6 @@ static inline void kuap_unlock(unsigned long addr, bool ool)
kuap_unlock_all_ool();
 }
 
-static inline void __kuap_lock(void)
-{
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
unsigned long kuap = current->thread.kuap;
@@ -92,6 +88,7 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs)
current->thread.kuap = KUAP_NONE;
kuap_lock_addr(kuap, false);
 }
+#define __kuap_save_and_lock __kuap_save_and_lock
 
 static inline void kuap_user_restore(struct pt_regs *regs)
 {
@@ -120,6 +117,7 @@ static inline unsigned long 
__kuap_get_and_assert_locked(void)
 
return kuap;
 }
+#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
 static __always_inline void __allow_user_access(void __user *to, const void 
__user *from,
u32 size, unsigned long dir)
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 84c09e546115..2a7bd3ecc556 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -298,15 +298,9 @@ static inline unsigned long 
__kuap_get_and_assert_locked(void)
WARN_ON_ONCE(amr != AMR_KUAP_BLOCKED);
return amr;
 }
+#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
-/* Do nothing, book3s/64 does that in ASM */
-static inline void __kuap_lock(void)
-{
-}
-
-static inline void __kuap_save_and_lock(struct pt_regs *regs)
-{
-}
+/* __kuap_lock() not required, book3s/64 does that in ASM */
 
 /*
  * We support individually allowing read or write, but we don't support nesting
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index d751ddd08110..24cde16c4fbe 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -52,16 +52,9 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long 
address, bool is_write)
return false;
 }
 
-static inline void __kuap_lock(void) { }
-static inline void __kuap_save_and_lock(struct pt_regs *regs) { }
 static inline void kuap_user_restore(struct pt_regs *regs) { }
 static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long 
amr) { }
 
-static inline unsigned long __kuap_get_and_assert_locked(void)
-{
-   return 0;
-}
-
 /*
  * book3s/64/kup-radix.h defines these functions for the !KUAP case to flush
  * the L1D cache after user accesses. Only include the empty stubs for other
@@ -85,29 +78,24 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long address, 
bool is_write)
return __bad_kuap_fault(regs, address, is_write);
 }
 
-static __always_inline void kuap_assert_locked(void)
-{
-   if (kuap_is_disabled())
-   return;
-
-   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   __kuap_get_and_assert_locked();
-}
-
 static __always_inline void kuap_lock(void)
 {
+#ifdef __kuap_lock
if (kuap_is_disabled())
return;

[PATCH v3 3/9] powerpc/kuap: Fold kuep_is_disabled() into its only user

2023-07-11 Thread Christophe Leroy

kuep_is_disabled() was introduced by commit 91bb30822a2e ("powerpc/32s:
Refactor update of user segment registers") but then all users but one
were removed by commit 526d4a4c77ae ("powerpc/32s: Do kuep_lock() and
kuep_unlock() in assembly").

Fold kuep_is_disabled() into init_new_context() which is its only user.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h | 5 -
 arch/powerpc/mm/book3s32/mmu_context.c   | 2 +-
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 466a19cfb4df..0da0dea76c47 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -13,11 +13,6 @@
 
 extern struct static_key_false disable_kuap_key;
 
-static __always_inline bool kuep_is_disabled(void)
-{
-   return !IS_ENABLED(CONFIG_PPC_KUEP);
-}
-
 #ifdef CONFIG_PPC_KUAP
 
 #include 
diff --git a/arch/powerpc/mm/book3s32/mmu_context.c 
b/arch/powerpc/mm/book3s32/mmu_context.c
index 269a3eb25a73..1922f9a6b058 100644
--- a/arch/powerpc/mm/book3s32/mmu_context.c
+++ b/arch/powerpc/mm/book3s32/mmu_context.c
@@ -71,7 +71,7 @@ int init_new_context(struct task_struct *t, struct mm_struct 
*mm)
mm->context.id = __init_new_context();
mm->context.sr0 = CTX_TO_VSID(mm->context.id, 0);
 
-   if (!kuep_is_disabled())
+   if (IS_ENABLED(CONFIG_PPC_KUEP))
mm->context.sr0 |= SR_NX;
if (!kuap_is_disabled())
mm->context.sr0 |= SR_KS;
-- 
2.41.0

[PATCH v3 0/9] Cleanup/Optimise KUAP (v3)

2023-07-11 Thread Christophe Leroy

This series is cleaning up a bit KUAP in preparation of using objtool
to validate UACCESS.

There are two main changes in this series:

1/ Simplification of KUAP on book3s/32

2/ Using ASM features on 32 bits and booke as suggested by Nic.

Those changes will be required for objtool UACCESS validation, but
even before they are worth it, especially the simplification on 32s.

Changes in v3:
- Rearranged book3s/32 simplification in order to ease objtool UACCESS
check implementation (patches 7 and 9)

Christophe Leroy (9):
  powerpc/kuap: Avoid unnecessary reads of MD_AP
  powerpc/kuap: Avoid useless jump_label on empty function
  powerpc/kuap: Fold kuep_is_disabled() into its only user
  powerpc/features: Add capability to update mmu features later
  powerpc/kuap: MMU_FTR_BOOK3S_KUAP becomes MMU_FTR_KUAP
  powerpc/kuap: Use MMU_FTR_KUAP on all and refactor disabling kuap
  powerpc/kuap: Simplify KUAP lock/unlock on BOOK3S/32
  powerpc/kuap: KUAP enabling/disabling functions must be
__always_inline
  powerpc/kuap: Use ASM feature fixups instead of static branches

 arch/powerpc/include/asm/book3s/32/kup.h  | 123 --
 .../powerpc/include/asm/book3s/64/hash-pkey.h |   2 +-
 arch/powerpc/include/asm/book3s/64/kup.h  |  54 
 arch/powerpc/include/asm/bug.h|   1 +
 arch/powerpc/include/asm/feature-fixups.h |   1 +
 arch/powerpc/include/asm/kup.h|  91 +
 arch/powerpc/include/asm/mmu.h|   4 +-
 arch/powerpc/include/asm/nohash/32/kup-8xx.h  |  62 +
 arch/powerpc/include/asm/nohash/kup-booke.h   |  68 +-
 arch/powerpc/include/asm/uaccess.h|   6 +-
 arch/powerpc/kernel/cputable.c|   4 +
 arch/powerpc/kernel/syscall.c |   2 +-
 arch/powerpc/kernel/traps.c   |   2 +-
 arch/powerpc/lib/feature-fixups.c |  31 -
 arch/powerpc/mm/book3s32/kuap.c   |  20 +--
 arch/powerpc/mm/book3s32/mmu_context.c|   2 +-
 arch/powerpc/mm/book3s64/pkeys.c  |   2 +-
 arch/powerpc/mm/init_32.c |   2 +
 arch/powerpc/mm/nohash/kup.c  |   8 +-
 19 files changed, 222 insertions(+), 263 deletions(-)

-- 
2.41.0

Re: [PATCH v3 2/7] mm/hotplug: Allow memmap on memory hotplug request to fallback

2023-07-11 Thread Aneesh Kumar K V

On 7/11/23 3:53 PM, David Hildenbrand wrote:
>> -bool mhp_supports_memmap_on_memory(unsigned long size)
>> +static bool mhp_supports_memmap_on_memory(unsigned long size)
>>   {
>>   unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>> @@ -1339,13 +1339,12 @@ int __ref add_memory_resource(int nid, struct 
>> resource *res, mhp_t mhp_flags)
>>    * Self hosted memmap array
>>    */
>>   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> -    if (!mhp_supports_memmap_on_memory(size)) {
>> -    ret = -EINVAL;
>> -    goto error;
>> +    if (mhp_supports_memmap_on_memory(size)) {
>> +    mhp_altmap.free = PHYS_PFN(size);
>> +    mhp_altmap.base_pfn = PHYS_PFN(start);
>> +    params.altmap = _altmap;
>>   }
>> -    mhp_altmap.free = PHYS_PFN(size);
>> -    mhp_altmap.base_pfn = PHYS_PFN(start);
>> -    params.altmap = _altmap;
>> +    /* fallback to not using altmap  */
>>   }
>>     /* call arch's memory hotadd */
> 
> In general, LGTM, but please extend the documentation of the parameter in 
> memory_hotplug.h, stating that this is just a hint and that the core can 
> decide to no do that.
> 

will update

modified   include/linux/memory_hotplug.h
@@ -97,6 +97,8 @@ typedef int __bitwise mhp_t;
  * To do so, we will use the beginning of the hot-added range to build
  * the page tables for the memmap array that describes the entire range.
  * Only selected architectures support it with SPARSE_VMEMMAP.
+ * This is only a hint, core kernel can decide to not do this based on
+ * different alignment checks.
  */
 #define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))

Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-11 Thread Aneesh Kumar K V

On 7/11/23 9:14 PM, David Hildenbrand wrote:
> On 11.07.23 17:40, Aneesh Kumar K V wrote:
>> On 7/11/23 8:56 PM, David Hildenbrand wrote:
>>> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
 Radix vmemmap mapping can map things correctly at the PMD level or PTE
 level based on different device boundary checks. Hence we skip the
 restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
 makes the feature widely useful because to use PMD_SIZE vmemmap area we
 require a memory block size of 2GiB

 We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
 can work with a memory block size of 256MB. Using altmap.reserve feature
 to align things correctly at pageblock granularity. We can end up
 losing some pages in memory with this. For ex: with a 256MiB memory block
 size, we require 4 pages to map vmemmap pages, In order to align things
 correctly we end up adding a reserve of 28 pages. ie, for every 4096
 pages 28 pages get reserved.

 Signed-off-by: Aneesh Kumar K.V 
 ---
    arch/powerpc/Kconfig  |  1 +
    arch/powerpc/include/asm/pgtable.h    | 28 +++
    .../platforms/pseries/hotplug-memory.c    |  3 +-
    mm/memory_hotplug.c   |  2 ++
    4 files changed, 33 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index 116d6add0bb0..f890907e5bbf 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -157,6 +157,7 @@ config PPC
    select ARCH_HAS_UBSAN_SANITIZE_ALL
    select ARCH_HAVE_NMI_SAFE_CMPXCHG
    select ARCH_KEEP_MEMBLOCK
 +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
    select ARCH_MIGHT_HAVE_PC_PARPORT
    select ARCH_MIGHT_HAVE_PC_SERIO
    select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
 diff --git a/arch/powerpc/include/asm/pgtable.h 
 b/arch/powerpc/include/asm/pgtable.h
 index 68817ea7f994..8e6c92dde6ad 100644
 --- a/arch/powerpc/include/asm/pgtable.h
 +++ b/arch/powerpc/include/asm/pgtable.h
 @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x)
    int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
 vmemmap_map_size);
    bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long 
 start,
   unsigned long page_size);
 +/*
 + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
 + * some of the restrictions. We don't check for PMD_SIZE because our
 + * vmemmap allocation code can fallback correctly. The pageblock
 + * alignment requirement is met using altmap->reserve blocks.
 + */
 +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
 +static inline bool arch_supports_memmap_on_memory(unsigned long size)
 +{
 +    unsigned long nr_pages = size >> PAGE_SHIFT;
 +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
 +
 +    if (!radix_enabled())
 +    return false;
 +
 +#ifdef CONFIG_PPC_4K_PAGES
 +    return IS_ALIGNED(vmemmap_size, PMD_SIZE);
 +#else
 +    /*
 + * Make sure the vmemmap allocation is fully contianed
 + * so that we always allocate vmemmap memory from altmap area.
 + * The pageblock alignment requirement is met by using
 + * reserve blocks in altmap.
 + */
 +    return IS_ALIGNED(vmemmap_size,  PAGE_SIZE);
>>>
>>> Can we move that check into common code as well?
>>>
>>> If our (original) vmemmap size would not fit into a single page, we would 
>>> be in trouble on any architecture. Did not check if it would be an issue 
>>> for arm64 as well in case we would allow eventually wasting memory.
>>>
>>
>>
>> For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in 
>> arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment.
>> If arm64 allow the usage of altmap.reserve, I would expect the 
>> arch_supports_memmap_on_memory to have the PAGE_SIZE check.
>>
>> Adding the PAGE_SIZE check in  mhp_supports_memmap_on_memory() makes it 
>> redundant check for x86 and arm currently?
> 
> IMHO not an issue. The common code check is a bit weaker and the arch check a 
> bit stronger.
> 
>>

ok will update

>> modified   mm/memory_hotplug.c
>> @@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned 
>> long size)
>>    */
>>   if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
>>   return false;
>> +
>> +    /*
>> + * Make sure the vmemmap allocation is fully contianed
> 
> s/contianed/contained/
> 
>> + * so that we always allocate vmemmap memory from altmap area.
> 
> In theory, it's not only the vmemmap size, but also the vmemmap start (that 
> it doesn't start somewhere in between a page,

Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-11 Thread David Hildenbrand


On 11.07.23 17:40, Aneesh Kumar K V wrote:

On 7/11/23 8:56 PM, David Hildenbrand wrote:

On 11.07.23 06:48, Aneesh Kumar K.V wrote:

Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. Hence we skip the
restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
makes the feature widely useful because to use PMD_SIZE vmemmap area we
require a memory block size of 2GiB

We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
can work with a memory block size of 256MB. Using altmap.reserve feature
to align things correctly at pageblock granularity. We can end up
losing some pages in memory with this. For ex: with a 256MiB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096
pages 28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V 
---
   arch/powerpc/Kconfig  |  1 +
   arch/powerpc/include/asm/pgtable.h    | 28 +++
   .../platforms/pseries/hotplug-memory.c    |  3 +-
   mm/memory_hotplug.c   |  2 ++
   4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 116d6add0bb0..f890907e5bbf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -157,6 +157,7 @@ config PPC
   select ARCH_HAS_UBSAN_SANITIZE_ALL
   select ARCH_HAVE_NMI_SAFE_CMPXCHG
   select ARCH_KEEP_MEMBLOCK
+    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
   select ARCH_MIGHT_HAVE_PC_PARPORT
   select ARCH_MIGHT_HAVE_PC_SERIO
   select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 68817ea7f994..8e6c92dde6ad 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x)
   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
  unsigned long page_size);
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock
+ * alignment requirement is met using altmap->reserve blocks.
+ */
+#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
+static inline bool arch_supports_memmap_on_memory(unsigned long size)
+{
+    unsigned long nr_pages = size >> PAGE_SHIFT;
+    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
+
+    if (!radix_enabled())
+    return false;
+
+#ifdef CONFIG_PPC_4K_PAGES
+    return IS_ALIGNED(vmemmap_size, PMD_SIZE);
+#else
+    /*
+ * Make sure the vmemmap allocation is fully contianed
+ * so that we always allocate vmemmap memory from altmap area.
+ * The pageblock alignment requirement is met by using
+ * reserve blocks in altmap.
+ */
+    return IS_ALIGNED(vmemmap_size,  PAGE_SIZE);


Can we move that check into common code as well?

If our (original) vmemmap size would not fit into a single page, we would be in 
trouble on any architecture. Did not check if it would be an issue for arm64 as 
well in case we would allow eventually wasting memory.




For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in 
arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment.
If arm64 allow the usage of altmap.reserve, I would expect the 
arch_supports_memmap_on_memory to have the PAGE_SIZE check.

Adding the PAGE_SIZE check in  mhp_supports_memmap_on_memory() makes it 
redundant check for x86 and arm currently?


IMHO not an issue. The common code check is a bit weaker and the arch 
check a bit stronger.




modified   mm/memory_hotplug.c
@@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 */
if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
return false;
+
+   /*
+* Make sure the vmemmap allocation is fully contianed


s/contianed/contained/


+* so that we always allocate vmemmap memory from altmap area.


In theory, it's not only the vmemmap size, but also the vmemmap start 
(that it doesn't start somewhere in between a page, crossing a page). I 
suspect the start is always guaranteed to be aligned (of the vmemmap 
size is aligned), correct?



+*/
+   if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+   return false;
 /*
  * Without page reservation remaining pages should be pageblock 
aligned.
  */




--
Cheers,

David / dhildenb

Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-11 Thread Aneesh Kumar K V

On 7/11/23 8:56 PM, David Hildenbrand wrote:
> On 11.07.23 06:48, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. Hence we skip the
>> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
>> makes the feature widely useful because to use PMD_SIZE vmemmap area we
>> require a memory block size of 2GiB
>>
>> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
>> can work with a memory block size of 256MB. Using altmap.reserve feature
>> to align things correctly at pageblock granularity. We can end up
>> losing some pages in memory with this. For ex: with a 256MiB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096
>> pages 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/Kconfig  |  1 +
>>   arch/powerpc/include/asm/pgtable.h    | 28 +++
>>   .../platforms/pseries/hotplug-memory.c    |  3 +-
>>   mm/memory_hotplug.c   |  2 ++
>>   4 files changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>   select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>   select ARCH_MIGHT_HAVE_PC_PARPORT
>>   select ARCH_MIGHT_HAVE_PC_SERIO
>>   select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/include/asm/pgtable.h 
>> b/arch/powerpc/include/asm/pgtable.h
>> index 68817ea7f994..8e6c92dde6ad 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x)
>>   int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
>> vmemmap_map_size);
>>   bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
>>  unsigned long page_size);
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
>> +static inline bool arch_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    unsigned long nr_pages = size >> PAGE_SHIFT;
>> +    unsigned long vmemmap_size = nr_pages * sizeof(struct page);
>> +
>> +    if (!radix_enabled())
>> +    return false;
>> +
>> +#ifdef CONFIG_PPC_4K_PAGES
>> +    return IS_ALIGNED(vmemmap_size, PMD_SIZE);
>> +#else
>> +    /*
>> + * Make sure the vmemmap allocation is fully contianed
>> + * so that we always allocate vmemmap memory from altmap area.
>> + * The pageblock alignment requirement is met by using
>> + * reserve blocks in altmap.
>> + */
>> +    return IS_ALIGNED(vmemmap_size,  PAGE_SIZE);
> 
> Can we move that check into common code as well?
> 
> If our (original) vmemmap size would not fit into a single page, we would be 
> in trouble on any architecture. Did not check if it would be an issue for 
> arm64 as well in case we would allow eventually wasting memory.
> 


For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in 
arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment.
If arm64 allow the usage of altmap.reserve, I would expect the 
arch_supports_memmap_on_memory to have the PAGE_SIZE check.  

Adding the PAGE_SIZE check in  mhp_supports_memmap_on_memory() makes it 
redundant check for x86 and arm currently? 

modified   mm/memory_hotplug.c
@@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 */
if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
return false;
+
+   /*
+* Make sure the vmemmap allocation is fully contianed
+* so that we always allocate vmemmap memory from altmap area.
+*/
+   if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
+   return false;
 /*
  * Without page reservation remaining pages should be pageblock 
aligned.
  */

Re: [PATCH 00/17] fbdev: Remove FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT flags

2023-07-11 Thread Helge Deller


On 7/11/23 16:47, Sam Ravnborg wrote:

Hi Thomas,

On Tue, Jul 11, 2023 at 08:24:40AM +0200, Thomas Zimmermann wrote:

Hi Sam

Am 10.07.23 um 19:19 schrieb Sam Ravnborg:

Hi Thomas,

On Mon, Jul 10, 2023 at 02:50:04PM +0200, Thomas Zimmermann wrote:

Remove the unused flags FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT from
fbdev and drivers, as briefly discussed at [1]. Both flags were maybe
useful when fbdev had special handling for driver modules. With
commit 376b3ff54c9a ("fbdev: Nuke FBINFO_MODULE"), they are both 0
and have no further effect.

Patches 1 to 7 remove FBINFO_DEFAULT from drivers. Patches 2 to 5
split this by the way the fb_info struct is being allocated. All flags
are cleared to zero during the allocation.

Patches 8 to 16 do the same for FBINFO_FLAG_DEFAULT. Patch 8 fixes
an actual bug in how arch/sh uses the tokne for struct fb_videomode,
which is unrelated.

Patch 17 removes both flag constants from 


We have a few more flags that are unused - should they be nuked too?
FBINFO_HWACCEL_FILLRECT
FBINFO_HWACCEL_ROTATE
FBINFO_HWACCEL_XPAN


It seems those are there for completeness. Nothing sets _ROTATE,


I think some fbdev drivers had hardware acceleration for ROTATE in the
past. HWACCEL_XPAN is still in some drivers.


the others are simply never checked. According to the comments,
some are required, some are optional. I don't know what that
means.


I think it's OK if you remove those flags which aren't used anywhere,
e.g. FBINFO_HWACCEL_ROTATE.


IIRC there were complains about performance when Daniel tried to remove
fbcon acceleration, so not all _HWACCEL_ flags are unneeded.


Correct. I think COPYAREA and FILLRECT are the bare minimum to accelerate
fbcon, IMAGEBLIT is for showing the tux penguin (?),
XPAN/YPAN and YWRAP for some hardware screen panning needed by some drivers
(not sure if this is still used as I don't have such hardware, Geert?).


Leaving them in for reference/completeness might be an option; or not. I
have no strong feelings about those flags.


I'd say drop FBINFO_HWACCEL_ROTATE at least ?


Unused as in no references from fbdev/core/*

I would rather see one series nuke all unused FBINFO flags in one go.
Assuming my quick grep are right and the above can be dropped.


I would not want to extend this series. I'm removing _DEFAULT as it's
absolutely pointless and confusing.


Yes, Ok.

Helge

Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-11 Thread David Hildenbrand


On 11.07.23 06:48, Aneesh Kumar K.V wrote:

Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. Hence we skip the
restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also
makes the feature widely useful because to use PMD_SIZE vmemmap area we
require a memory block size of 2GiB

We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature
can work with a memory block size of 256MB. Using altmap.reserve feature
to align things correctly at pageblock granularity. We can end up
losing some pages in memory with this. For ex: with a 256MiB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096
pages 28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/Kconfig  |  1 +
  arch/powerpc/include/asm/pgtable.h| 28 +++
  .../platforms/pseries/hotplug-memory.c|  3 +-
  mm/memory_hotplug.c   |  2 ++
  4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 116d6add0bb0..f890907e5bbf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -157,6 +157,7 @@ config PPC
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 68817ea7f994..8e6c92dde6ad 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x)
  int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
  bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
   unsigned long page_size);
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock
+ * alignment requirement is met using altmap->reserve blocks.
+ */
+#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory
+static inline bool arch_supports_memmap_on_memory(unsigned long size)
+{
+   unsigned long nr_pages = size >> PAGE_SHIFT;
+   unsigned long vmemmap_size = nr_pages * sizeof(struct page);
+
+   if (!radix_enabled())
+   return false;
+
+#ifdef CONFIG_PPC_4K_PAGES
+   return IS_ALIGNED(vmemmap_size, PMD_SIZE);
+#else
+   /*
+* Make sure the vmemmap allocation is fully contianed
+* so that we always allocate vmemmap memory from altmap area.
+* The pageblock alignment requirement is met by using
+* reserve blocks in altmap.
+*/
+   return IS_ALIGNED(vmemmap_size,  PAGE_SIZE);


Can we move that check into common code as well?

If our (original) vmemmap size would not fit into a single page, we 
would be in trouble on any architecture. Did not check if it would be an 
issue for arm64 as well in case we would allow eventually wasting memory.


--
Cheers,

David / dhildenb

Re: [PATCH 00/17] fbdev: Remove FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT flags

2023-07-11 Thread Sam Ravnborg

Hi Thomas,

On Tue, Jul 11, 2023 at 08:24:40AM +0200, Thomas Zimmermann wrote:
> Hi Sam
> 
> Am 10.07.23 um 19:19 schrieb Sam Ravnborg:
> > Hi Thomas,
> > 
> > On Mon, Jul 10, 2023 at 02:50:04PM +0200, Thomas Zimmermann wrote:
> > > Remove the unused flags FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT from
> > > fbdev and drivers, as briefly discussed at [1]. Both flags were maybe
> > > useful when fbdev had special handling for driver modules. With
> > > commit 376b3ff54c9a ("fbdev: Nuke FBINFO_MODULE"), they are both 0
> > > and have no further effect.
> > > 
> > > Patches 1 to 7 remove FBINFO_DEFAULT from drivers. Patches 2 to 5
> > > split this by the way the fb_info struct is being allocated. All flags
> > > are cleared to zero during the allocation.
> > > 
> > > Patches 8 to 16 do the same for FBINFO_FLAG_DEFAULT. Patch 8 fixes
> > > an actual bug in how arch/sh uses the tokne for struct fb_videomode,
> > > which is unrelated.
> > > 
> > > Patch 17 removes both flag constants from 
> > 
> > We have a few more flags that are unused - should they be nuked too?
> > FBINFO_HWACCEL_FILLRECT
> > FBINFO_HWACCEL_ROTATE
> > FBINFO_HWACCEL_XPAN
> 
> It seems those are there for completeness. Nothing sets _ROTATE, the others
> are simply never checked. According to the comments, some are required, some
> are optional. I don't know what that means.
> 
> IIRC there were complains about performance when Daniel tried to remove
> fbcon acceleration, so not all _HWACCEL_ flags are unneeded.
> 
> Leaving them in for reference/completeness might be an option; or not. I
> have no strong feelings about those flags.
> 
> > 
> > Unused as in no references from fbdev/core/*
> > 
> > I would rather see one series nuke all unused FBINFO flags in one go.
> > Assuming my quick grep are right and the above can be dropped.
> 
> I would not want to extend this series. I'm removing _DEFAULT as it's
> absolutely pointless and confusing.

OK, makes sense and thanks for the explanation.

The series is:
Acked-by: Sam Ravnborg

[PATCH v2] powerpc/512x: lpbfifo: Convert to platform remove callback returning void

2023-07-11 Thread Uwe Kleine-König

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new() which already returns void. Eventually after all drivers
are converted, .remove_new() is renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König 
---
Changes since (implicit) v1:

 - provide an actually compilable patch :-\

Best regards
Uwe

 arch/powerpc/platforms/512x/mpc512x_lpbfifo.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c 
b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
index 1bfb29574caa..c1e981649bd9 100644
--- a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
+++ b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
@@ -477,7 +477,7 @@ static int mpc512x_lpbfifo_probe(struct platform_device 
*pdev)
return ret;
 }
 
-static int mpc512x_lpbfifo_remove(struct platform_device *pdev)
+static void mpc512x_lpbfifo_remove(struct platform_device *pdev)
 {
unsigned long flags;
struct dma_device *dma_dev = lpbfifo.chan->device;
@@ -494,8 +494,6 @@ static int mpc512x_lpbfifo_remove(struct platform_device 
*pdev)
free_irq(lpbfifo.irq, >dev);
irq_dispose_mapping(lpbfifo.irq);
dma_release_channel(lpbfifo.chan);
-
-   return 0;
 }
 
 static const struct of_device_id mpc512x_lpbfifo_match[] = {
@@ -506,7 +504,7 @@ MODULE_DEVICE_TABLE(of, mpc512x_lpbfifo_match);
 
 static struct platform_driver mpc512x_lpbfifo_driver = {
.probe = mpc512x_lpbfifo_probe,
-   .remove = mpc512x_lpbfifo_remove,
+   .remove_new = mpc512x_lpbfifo_remove,
.driver = {
.name = DRV_NAME,
.of_match_table = mpc512x_lpbfifo_match,

base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
-- 
2.39.2

[PATCH] powerpc/512x: lpbfifo: Convert to platform remove callback returning void

2023-07-11 Thread Uwe Kleine-König

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new() which already returns void. Eventually after all drivers
are converted, .remove_new() is renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König 
---
 arch/powerpc/platforms/512x/mpc512x_lpbfifo.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c 
b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
index 1bfb29574caa..dbe722f7b855 100644
--- a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
+++ b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
@@ -477,7 +477,7 @@ static int mpc512x_lpbfifo_probe(struct platform_device 
*pdev)
return ret;
 }
 
-static int mpc512x_lpbfifo_remove(struct platform_device *pdev)
+static void pc512x_lpbfifo_remove(struct platform_device *pdev)
 {
unsigned long flags;
struct dma_device *dma_dev = lpbfifo.chan->device;
@@ -494,8 +494,6 @@ static int mpc512x_lpbfifo_remove(struct platform_device 
*pdev)
free_irq(lpbfifo.irq, >dev);
irq_dispose_mapping(lpbfifo.irq);
dma_release_channel(lpbfifo.chan);
-
-   return 0;
 }
 
 static const struct of_device_id mpc512x_lpbfifo_match[] = {
@@ -506,7 +504,7 @@ MODULE_DEVICE_TABLE(of, mpc512x_lpbfifo_match);
 
 static struct platform_driver mpc512x_lpbfifo_driver = {
.probe = mpc512x_lpbfifo_probe,
-   .remove = mpc512x_lpbfifo_remove,
+   .remove_new = mpc512x_lpbfifo_remove,
.driver = {
.name = DRV_NAME,
.of_match_table = mpc512x_lpbfifo_match,

base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
-- 
2.39.2

Re: [PATCH v2 1/2] powerpc/tpm: Create linux,sml-base/size as big endian

2023-07-11 Thread Stefan Berger





On 7/10/23 17:23, Jarkko Sakkinen wrote:

On Thu, 2023-06-15 at 22:37 +1000, Michael Ellerman wrote:

There's code in prom_instantiate_sml() to do a "SML handover" (Stored
Measurement Log) from OF to Linux, before Linux shuts down Open
Firmware.

This involves creating a buffer to hold the SML, and creating two device
tree properties to record its base address and size. The kernel then
later reads those properties from the device tree to find the SML.

When the code was initially added in commit 4a727429abec ("PPC64: Add
support for instantiating SML from Open Firmware") the powerpc kernel
was always built big endian, so the properties were created big endian
by default.

However since then little endian support was added to powerpc, and now
the code lacks conversions to big endian when creating the properties.

This means on little endian kernels the device tree properties are
little endian, which is contrary to the device tree spec, and in
contrast to all other device tree properties.

To cope with that a workaround was added in tpm_read_log_of() to skip
the endian conversion if the properties were created via the SML
handover.

A better solution is to encode the properties as big endian as they
should be, and remove the workaround.

Typically changing the encoding of a property like this would present
problems for kexec. However the SML is not propagated across kexec, so
changing the encoding of the properties is a non-issue.

Fixes: e46e22f12b19 ("tpm: enhance read_log_of() to support Physical TPM event 
log")
Signed-off-by: Michael Ellerman 
Reviewed-by: Stefan Berger 
---
  arch/powerpc/kernel/prom_init.c |  8 ++--
  drivers/char/tpm/eventlog/of.c  | 23 ---
  2 files changed, 10 insertions(+), 21 deletions(-)


Split into two patches (producer and consumer).


I think this wouldn't be right since it would break the system when only one 
patch is applied since it would be reading the fields in the wrong endianess.

Stefan


BR, Jarkko



v2: Add Stefan's reviewed-by.

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index d464ba412084..72fe306b6820 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1900,6 +1900,7 @@ static void __init prom_instantiate_sml(void)
u32 entry = 0, size = 0, succ = 0;
u64 base;
__be32 val;
+   __be64 val64;
  
  	prom_debug("prom_instantiate_sml: start...\n");
  
@@ -1956,10 +1957,13 @@ static void __init prom_instantiate_sml(void)
  
  	reserve_mem(base, size);
  
+	val64 = cpu_to_be64(base);

prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-base",
-, sizeof(base));
+, sizeof(val64));
+
+   val = cpu_to_be32(size);
prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-size",
-, sizeof(size));
+, sizeof(val));
  
  	prom_debug("sml base = 0x%llx\n", base);

prom_debug("sml size = 0x%x\n", size);
diff --git a/drivers/char/tpm/eventlog/of.c b/drivers/char/tpm/eventlog/of.c
index 930fe43d5daf..0bc0cb6333c6 100644
--- a/drivers/char/tpm/eventlog/of.c
+++ b/drivers/char/tpm/eventlog/of.c
@@ -51,8 +51,8 @@ static int tpm_read_log_memory_region(struct tpm_chip *chip)
  int tpm_read_log_of(struct tpm_chip *chip)
  {
struct device_node *np;
-   const u32 *sizep;
-   const u64 *basep;
+   const __be32 *sizep;
+   const __be64 *basep;
struct tpm_bios_log *log;
u32 size;
u64 base;
@@ -73,23 +73,8 @@ int tpm_read_log_of(struct tpm_chip *chip)
if (sizep == NULL || basep == NULL)
return -EIO;
  
-	/*

-* For both vtpm/tpm, firmware has log addr and log size in big
-* endian format. But in case of vtpm, there is a method called
-* sml-handover which is run during kernel init even before
-* device tree is setup. This sml-handover function takes care
-* of endianness and writes to sml-base and sml-size in little
-* endian format. For this reason, vtpm doesn't need conversion
-* but physical tpm needs the conversion.
-*/
-   if (of_property_match_string(np, "compatible", "IBM,vtpm") < 0 &&
-   of_property_match_string(np, "compatible", "IBM,vtpm20") < 0) {
-   size = be32_to_cpup((__force __be32 *)sizep);
-   base = be64_to_cpup((__force __be64 *)basep);
-   } else {
-   size = *sizep;
-   base = *basep;
-   }
+   size = be32_to_cpup(sizep);
+   base = be64_to_cpup(basep);
  
  	if (size == 0) {

dev_warn(>dev, "%s: Event log area empty\n", __func__);

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Matthew Wilcox

On Tue, Jul 11, 2023 at 01:25:43PM +0200, Alexey Gladkov wrote:
> -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
> +static int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, 
> int lookup_flags)

This function can still be called do_fchmodat(); we don't need to
version internal functions.

Re: [PATCH v3 0/5] Add a new fchmodat4() syscall

2023-07-11 Thread Florian Weimer

* Alexey Gladkov:

> This patch set adds fchmodat4(), a new syscall. The actual
> implementation is super simple: essentially it's just the same as
> fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> I've attempted to make this match "man 2 fchmodat" as closely as
> possible, which says EINVAL is returned for invalid flags (as opposed to
> ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> I have a sketch of a glibc patch that I haven't even compiled yet, but
> seems fairly straight-forward:
>
> diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
> b/sysdeps/unix/sysv/linux/fchmodat.c
> index 6d9cbc1ce9e0..b1beab76d56c 100644
> --- a/sysdeps/unix/sysv/linux/fchmodat.c
> +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> @@ -29,12 +29,36 @@
>  int
>  fchmodat (int fd, const char *file, mode_t mode, int flag)
>  {
> -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> -#ifndef __NR_lchmod  /* Linux so far has no lchmod syscall.  
> */
> +  /* There are four paths through this code:
> +  - The flags are zero.  In this case it's fine to call fchmodat.
> +  - The flags are non-zero and glibc doesn't have access to
> + __NR_fchmodat4.  In this case all we can do is emulate the error codes
> + defined by the glibc interface from userspace.
> +  - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel 
> has
> + fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
> + matches glibc's library interface so it can be called directly.
> +  - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel 
> does

If you define __NR_fchmodat4 on all architectures, we can use these
constants directly in glibc.  We no longer depend on the UAPI
definitions of those constants, to cut down the number of code variants,
and to make glibc's system call profile independent of the kernel header
version at build time.

Your version is based on 2.31, more recent versions have some reasonable
emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
describing the same buggy behavior that you witnessed:

+  /* Some Linux versions with some file systems can actually
+change symbolic link permissions via /proc, but this is not
+intentional, and it gives inconsistent results (e.g., error
+return despite mode change).  The expected behavior is that
+symbolic link modes cannot be changed at all, and this check
+enforces that.  */
+  if (S_ISLNK (st.st_mode))
+   {
+ __close_nocancel (pathfd);
+ __set_errno (EOPNOTSUPP);
+ return -1;
+   }

I think there was some kernel discussion about that behavior before, but
apparently, it hasn't led to fixes.

I wonder if it makes sense to add a similar error return to the system
call implementation?

> + not.  In this case we must respect the error codes defined by the glibc
> + interface instead of returning ENOSYS.
> +The intent here is to ensure that the kernel is called at most once 
> per
> +library call, and that the error types defined by glibc are always
> +respected.  */
> +
> +#ifdef __NR_fchmodat4
> +  long result;
> +#endif
> +
> +  if (flag == 0)
> +return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
> +
> +#ifdef __NR_fchmodat4
> +  result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag);
> +  if (result == 0 || errno != ENOSYS)
> +return result;
> +#endif

The last if condition is the recommended approach, but in the past, it
broke container host compatibility pretty badly due to seccomp filters
that return EPERM instead of ENOSYS.  I guess we'll learn soon enough if
that's been fixed by now. 8-P

Thanks,
Florian

Re: [PATCH v3 5/5] selftests: add fchmodat4(2) selftest

2023-07-11 Thread Florian Weimer

* Alexey Gladkov:

> The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag
> fails. This is because not all filesystems support changing the mode
> bits of symlinks properly. These filesystems return an error but change
> the mode bits:
>
> newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
> AT_SYMLINK_NOFOLLOW) = 0
> newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, 
> AT_SYMLINK_NOFOLLOW) = 0
> syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 
> EOPNOTSUPP (Operation not supported)
> newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
> AT_SYMLINK_NOFOLLOW) = 0
>
> This happens with btrfs and xfs:
>
>  $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test
>  TAP version 13
>  1..1
>  ok 1 # SKIP fchmodat4(symlink)
>  # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
>
>  $ stat /tmp/ksft-fchmodat4.*/symlink
>File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile
>Size: 7   Blocks: 0  IO Block: 4096   symbolic link
>  Device: 7,0 Inode: 133 Links: 1
>  Access: (0600/lrw---)  Uid: (0/root)   Gid: (0/root)
>
> Signed-off-by: Alexey Gladkov 

This looks like a bug in those file systems?

As an extra test, “echo 3 > /proc/sys/vm/drop_caches” sometimes has
strange effects in such cases because the bits are not actually stored
on disk, only in the dentry cache.

Thanks,
Florian

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote:
> On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> > From: Palmer Dabbelt 
> >
> > On the userspace side fchmodat(3) is implemented as a wrapper
> > function which implements the POSIX-specified interface. This
> > interface differs from the underlying kernel system call, which does not
> > have a flags argument. Most implementations require procfs [1][2].
> >
> > There doesn't appear to be a good userspace workaround for this issue
> > but the implementation in the kernel is pretty straight-forward.
> >
> > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> > unlike existing fchmodat.
> >
> > [1] 
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> > [2] 
> > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> >
> > Signed-off-by: Palmer Dabbelt 
> > Signed-off-by: Alexey Gladkov 
> 
> I don't know the history of why we ended up with the different
> interface, or whether this was done intentionally in the kernel
> or if we want this syscall.
> 
> Assuming this is in fact needed, I double-checked that the
> implementation looks correct to me and is portable to all the
> architectures, without the need for a compat wrapper.
> 
> Acked-by: Arnd Bergmann 

The system call itself is useful afaict. But please,

s/fchmodat4/fchmodat2/

With very few exceptions we don't version by argument number but by
revision and we should stick to one scheme:

openat()->openat2()
eventfd()->eventfd2()
clone()/clone2()->clone3()
dup()->dup2()->dup3() // coincides with nr of arguments
pipe()->pipe2() // coincides with nr of arguments
renameat()->renameat2()

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Arnd Bergmann

On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> From: Palmer Dabbelt 
>
> On the userspace side fchmodat(3) is implemented as a wrapper
> function which implements the POSIX-specified interface. This
> interface differs from the underlying kernel system call, which does not
> have a flags argument. Most implementations require procfs [1][2].
>
> There doesn't appear to be a good userspace workaround for this issue
> but the implementation in the kernel is pretty straight-forward.
>
> The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> unlike existing fchmodat.
>
> [1] 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> [2] 
> https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
>
> Signed-off-by: Palmer Dabbelt 
> Signed-off-by: Alexey Gladkov 

I don't know the history of why we ended up with the different
interface, or whether this was done intentionally in the kernel
or if we want this syscall.

Assuming this is in fact needed, I double-checked that the
implementation looks correct to me and is portable to all the
architectures, without the need for a compat wrapper.

Acked-by: Arnd Bergmann

Re: [PATCH v3 1/5] Non-functional cleanup of a "__user * filename"

2023-07-11 Thread Arnd Bergmann

On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> From: Palmer Dabbelt 
>
> The next patch defines a very similar interface, which I copied from
> this definition.  Since I'm touching it anyway I don't see any reason
> not to just go fix this one up.
>
> Signed-off-by: Palmer Dabbelt 

Acked-by: Arnd Bergmann

Re: [PATCH v3 3/5] arch: Register fchmodat4, usually as syscall 451

2023-07-11 Thread Arnd Bergmann

On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> From: Palmer Dabbelt 
>
> This registers the new fchmodat4 syscall in most places as nuber 451,
> with alpha being the exception where it's 561.  I found all these sites
> by grepping for fspick, which I assume has found me everything.
>
> Signed-off-by: Palmer Dabbelt 
> Signed-off-by: Alexey Gladkov 

In linux-6.5-rc1, number 451 is used for __NR_cachestat, the
next free one at the moment is 452.

>  arch/arm/tools/syscall.tbl  | 1 +
>  arch/arm64/include/asm/unistd32.h   | 2 ++

Unfortunately, you still also need to change __NR_compat_syscalls
in arch/arm64/include/asm/unistd.h. Aside from these two issues,
your patch is the correct way to hook up a new syscall.

   Arnd

[PATCH v3 5/5] selftests: add fchmodat4(2) selftest

2023-07-11 Thread Alexey Gladkov

The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag
fails. This is because not all filesystems support changing the mode
bits of symlinks properly. These filesystems return an error but change
the mode bits:

newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
AT_SYMLINK_NOFOLLOW) = 0
newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, 
AT_SYMLINK_NOFOLLOW) = 0
syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 
EOPNOTSUPP (Operation not supported)
newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, 
AT_SYMLINK_NOFOLLOW) = 0

This happens with btrfs and xfs:

 $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test
 TAP version 13
 1..1
 ok 1 # SKIP fchmodat4(symlink)
 # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0

 $ stat /tmp/ksft-fchmodat4.*/symlink
   File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile
   Size: 7   Blocks: 0  IO Block: 4096   symbolic link
 Device: 7,0 Inode: 133 Links: 1
 Access: (0600/lrw---)  Uid: (0/root)   Gid: (0/root)

Signed-off-by: Alexey Gladkov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/fchmodat4/.gitignore  |   2 +
 tools/testing/selftests/fchmodat4/Makefile|   6 +
 .../selftests/fchmodat4/fchmodat4_test.c  | 151 ++
 4 files changed, 160 insertions(+)
 create mode 100644 tools/testing/selftests/fchmodat4/.gitignore
 create mode 100644 tools/testing/selftests/fchmodat4/Makefile
 create mode 100644 tools/testing/selftests/fchmodat4/fchmodat4_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 90a62cf75008..fe61fa55412d 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -17,6 +17,7 @@ TARGETS += drivers/net/bonding
 TARGETS += drivers/net/team
 TARGETS += efivarfs
 TARGETS += exec
+TARGETS += fchmodat4
 TARGETS += filesystems
 TARGETS += filesystems/binderfs
 TARGETS += filesystems/epoll
diff --git a/tools/testing/selftests/fchmodat4/.gitignore 
b/tools/testing/selftests/fchmodat4/.gitignore
new file mode 100644
index ..82a4846cbc4b
--- /dev/null
+++ b/tools/testing/selftests/fchmodat4/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/*_test
diff --git a/tools/testing/selftests/fchmodat4/Makefile 
b/tools/testing/selftests/fchmodat4/Makefile
new file mode 100644
index ..3d38a69c3c12
--- /dev/null
+++ b/tools/testing/selftests/fchmodat4/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := fchmodat4_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/fchmodat4/fchmodat4_test.c 
b/tools/testing/selftests/fchmodat4/fchmodat4_test.c
new file mode 100644
index ..50beb731d8ba
--- /dev/null
+++ b/tools/testing/selftests/fchmodat4/fchmodat4_test.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+#ifndef __NR_fchmodat4
+   #if defined __alpha__
+   #define __NR_fchmodat4 561
+   #elif defined _MIPS_SIM
+   #if _MIPS_SIM == _MIPS_SIM_ABI32/* o32 */
+   #define __NR_fchmodat4 (451 + 4000)
+   #endif
+   #if _MIPS_SIM == _MIPS_SIM_NABI32   /* n32 */
+   #define __NR_fchmodat4 (451 + 6000)
+   #endif
+   #if _MIPS_SIM == _MIPS_SIM_ABI64/* n64 */
+   #define __NR_fchmodat4 (451 + 5000)
+   #endif
+   #elif defined __ia64__
+   #define __NR_fchmodat4 (451 + 1024)
+   #else
+   #define __NR_fchmodat4 451
+   #endif
+#endif
+
+int sys_fchmodat4(int dfd, const char *filename, mode_t mode, int flags)
+{
+   int ret = syscall(__NR_fchmodat4, dfd, filename, mode, flags);
+   return ret >= 0 ? ret : -errno;
+}
+
+int setup_testdir(void)
+{
+   int dfd, ret;
+   char dirname[] = "/tmp/ksft-fchmodat4.XX";
+
+   /* Make the top-level directory. */
+   if (!mkdtemp(dirname))
+   ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+
+   dfd = open(dirname, O_PATH | O_DIRECTORY);
+   if (dfd < 0)
+   ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+   ret = openat(dfd, "regfile", O_CREAT | O_WRONLY | O_TRUNC, 0644);
+   if (ret < 0)
+   ksft_exit_fail_msg("setup_testdir: failed to create file in 
tmpdir\n");
+   close(ret);
+
+   ret = symlinkat("regfile", dfd, "symlink");
+   if (ret < 0)
+   ksft_exit_fail_msg("setup_testdir: failed to create symlink in 
tmpdir\n");
+
+   return dfd;
+}
+
+int expect_mode(int dfd, const char *filename, mode_t expect_mode)
+{
+   struct stat st;
+

[PATCH v3 4/5] tools headers UAPI: Sync files changed by new fchmodat4 syscall

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

That add support for this new syscall in tools such as 'perf trace'.

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Alexey Gladkov 
---
 tools/include/uapi/asm-generic/unistd.h | 5 -
 tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 1 +
 tools/perf/arch/powerpc/entry/syscalls/syscall.tbl  | 1 +
 tools/perf/arch/s390/entry/syscalls/syscall.tbl | 1 +
 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl   | 1 +
 5 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/asm-generic/unistd.h 
b/tools/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b7978b3ce3f1 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_fchmodat4 451
+__SYSCALL(__NR_fchmodat4, sys_fchmodat4)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl 
b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 3f1886ad9d80..6356c0a6cda0 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -365,3 +365,4 @@
 448n64 process_mreleasesys_process_mrelease
 449n64 futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451n64 fchmodat4   sys_fchmodat4
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl 
b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index a0be127475b1..ee23866fa1c8 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -537,3 +537,4 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450nospu   set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  fchmodat4   sys_fchmodat4
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl 
b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index b68f47541169..d5ce80065ece 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -453,3 +453,4 @@
 448  commonprocess_mreleasesys_process_mrelease
sys_process_mrelease
 449  commonfutex_waitv sys_futex_waitv 
sys_futex_waitv
 450  commonset_mempolicy_home_node sys_set_mempolicy_home_node 
sys_set_mempolicy_home_node
+451  commonfchmodat4   sys_fchmodat4   
sys_fchmodat4
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..17047878293c 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  fchmodat4   sys_fchmodat4
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.33.8

[PATCH v3 1/5] Non-functional cleanup of a "__user * filename"

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

The next patch defines a very similar interface, which I copied from
this definition.  Since I'm touching it anyway I don't see any reason
not to just go fix this one up.

Signed-off-by: Palmer Dabbelt 
---
 include/linux/syscalls.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..497bdd968c32 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -464,7 +464,7 @@ asmlinkage long sys_chdir(const char __user *filename);
 asmlinkage long sys_fchdir(unsigned int fd);
 asmlinkage long sys_chroot(const char __user *filename);
 asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
-asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
+asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
 umode_t mode);
 asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 gid_t group, int flag);
-- 
2.33.8

[PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

On the userspace side fchmodat(3) is implemented as a wrapper
function which implements the POSIX-specified interface. This
interface differs from the underlying kernel system call, which does not
have a flags argument. Most implementations require procfs [1][2].

There doesn't appear to be a good userspace workaround for this issue
but the implementation in the kernel is pretty straight-forward.

The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
unlike existing fchmodat.

[1] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
[2] 
https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Alexey Gladkov 
---
 fs/open.c| 18 ++
 include/linux/syscalls.h |  2 ++
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 4478adcc4f3a..58bb88c6afb6 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
return err;
 }
 
-static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
+static int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, 
int lookup_flags)
 {
struct path path;
int error;
-   unsigned int lookup_flags = LOOKUP_FOLLOW;
+
 retry:
error = user_path_at(dfd, filename, lookup_flags, );
if (!error) {
@@ -689,15 +689,25 @@ static int do_fchmodat(int dfd, const char __user 
*filename, umode_t mode)
return error;
 }
 
+SYSCALL_DEFINE4(fchmodat4, int, dfd, const char __user *, filename,
+   umode_t, mode, int, flags)
+{
+   if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW))
+   return -EINVAL;
+
+   return do_fchmodat4(dfd, filename, mode,
+   flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW);
+}
+
 SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename,
umode_t, mode)
 {
-   return do_fchmodat(dfd, filename, mode);
+   return do_fchmodat4(dfd, filename, mode, LOOKUP_FOLLOW);
 }
 
 SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
 {
-   return do_fchmodat(AT_FDCWD, filename, mode);
+   return do_fchmodat4(AT_FDCWD, filename, mode, LOOKUP_FOLLOW);
 }
 
 /**
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 497bdd968c32..b17d37d2bad6 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -466,6 +466,8 @@ asmlinkage long sys_chroot(const char __user *filename);
 asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
 asmlinkage long sys_fchmodat(int dfd, const char __user *filename,
 umode_t mode);
+asmlinkage long sys_fchmodat4(int dfd, const char __user *filename,
+umode_t mode, int flags);
 asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
 gid_t group, int flag);
 asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
-- 
2.33.8

[PATCH v3 3/5] arch: Register fchmodat4, usually as syscall 451

2023-07-11 Thread Alexey Gladkov

From: Palmer Dabbelt 

This registers the new fchmodat4 syscall in most places as nuber 451,
with alpha being the exception where it's 561.  I found all these sites
by grepping for fspick, which I assume has found me everything.

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Alexey Gladkov 
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/uapi/asm-generic/unistd.h   | 5 -
 18 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 8ebacf37a8cf..00ceeffec7ff 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -490,3 +490,4 @@
 558common  process_mreleasesys_process_mrelease
 559common  futex_waitv sys_futex_waitv
 560common  set_mempolicy_home_node sys_ni_syscall
+561common  fchmodat4   sys_fchmodat4
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index ac964612d8b0..0b9702d5c425 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -464,3 +464,4 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  fchmodat4   sys_fchmodat4
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index 604a2053d006..49c65d935049 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -907,6 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_fchmodat4 451
+__SYSCALL(__NR_fchmodat4, sys_fchmodat4)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index 72c929d9902b..b35225c64781 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -371,3 +371,4 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  fchmodat4   sys_fchmodat4
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index b1f3940bc298..4d80cd87e089 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -450,3 +450,4 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  fchmodat4   sys_fchmodat4
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index 820145e47350..306bd18e5b52 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  fchmodat4   sys_fchmodat4
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 253ff994ed2e..2ef47a546fd3 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -389,3 +389,4 @@
 448n32 process_mreleasesys_process_mrelease
 449n32 futex_waitv sys_futex_waitv
 450n32 set_mempolicy_home_node sys_set_mempolicy_home_node
+451n32 fchmodat4   sys_fchmodat4
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl

[PATCH v3 0/5] Add a new fchmodat4() syscall

2023-07-11 Thread Alexey Gladkov

This patch set adds fchmodat4(), a new syscall. The actual
implementation is super simple: essentially it's just the same as
fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
I've attempted to make this match "man 2 fchmodat" as closely as
possible, which says EINVAL is returned for invalid flags (as opposed to
ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
I have a sketch of a glibc patch that I haven't even compiled yet, but
seems fairly straight-forward:

diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
b/sysdeps/unix/sysv/linux/fchmodat.c
index 6d9cbc1ce9e0..b1beab76d56c 100644
--- a/sysdeps/unix/sysv/linux/fchmodat.c
+++ b/sysdeps/unix/sysv/linux/fchmodat.c
@@ -29,12 +29,36 @@
 int
 fchmodat (int fd, const char *file, mode_t mode, int flag)
 {
-  if (flag & ~AT_SYMLINK_NOFOLLOW)
-return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
-#ifndef __NR_lchmod/* Linux so far has no lchmod syscall.  
*/
+  /* There are four paths through this code:
+  - The flags are zero.  In this case it's fine to call fchmodat.
+  - The flags are non-zero and glibc doesn't have access to
+   __NR_fchmodat4.  In this case all we can do is emulate the error codes
+   defined by the glibc interface from userspace.
+  - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel 
has
+   fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
+   matches glibc's library interface so it can be called directly.
+  - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel 
does
+   not.  In this case we must respect the error codes defined by the glibc
+   interface instead of returning ENOSYS.
+The intent here is to ensure that the kernel is called at most once per
+library call, and that the error types defined by glibc are always
+respected.  */
+
+#ifdef __NR_fchmodat4
+  long result;
+#endif
+
+  if (flag == 0)
+return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
+
+#ifdef __NR_fchmodat4
+  result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag);
+  if (result == 0 || errno != ENOSYS)
+return result;
+#endif
+
   if (flag & AT_SYMLINK_NOFOLLOW)
 return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP);
-#endif

-  return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
+  return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
 }

I've never added a new syscall before so I'm not really sure what the
proper procedure to follow is.  Based on the feedback from my v1 patch
set it seems this is somewhat uncontroversial.  At this point I don't
think there's anything I'm missing, though note that I haven't gotten
around to testing it this time because the diff from v1 is trivial for
any platform I could reasonably test on.  The v1 patches suggest a
simple test case, but I didn't re-run it because I don't want to reboot
my laptop.

Changes since v2 [20190717012719.5524-1-pal...@sifive.com]:

* Rebased to master.
* The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro.
* Selftest added.

Changes since v1 [20190531191204.4044-1-pal...@sifive.com]:

* All architectures are now supported, which support squashed into a
  single patch.
* The do_fchmodat() helper function has been removed, in favor of directly
  calling do_fchmodat4().
* The patches are based on 5.2 instead of 5.1.

---

Alexey Gladkov (1):
  selftests: add fchmodat4(2) selftest

Palmer Dabbelt (4):
  Non-functional cleanup of a "__user * filename"
  fs: Add fchmodat4()
  arch: Register fchmodat4, usually as syscall 451
  tools headers UAPI: Sync files changed by new fchmodat4 syscall

 arch/alpha/kernel/syscalls/syscall.tbl|   1 +
 arch/arm/tools/syscall.tbl|   1 +
 arch/arm64/include/asm/unistd32.h |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl   |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  |   1 +
 arch/s390/kernel/syscalls/syscall.tbl |   1 +
 arch/sh/kernel/syscalls/syscall.tbl   |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl|   1 +
 arch/x86/entry/syscalls/syscall_32.tbl|   1 +
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl   |   1 +
 fs/open.c |  18 ++-
 include/linux/syscalls.h  |   4 +-
 include/uapi/asm-generic/unistd.h |   5 +-
 tools/include/uapi/asm-generic/unistd.h   |   5 +-

1 2 >

1 - 100 of 117 matches

Mail list logo