[PATCH mm 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()
mmap_write_trylock() and vma_try_start_write() were added just for khugepaged, but now it has no use for them: delete. Signed-off-by: Hugh Dickins --- This is the version which applies to mm-unstable or linux-next. include/linux/mm.h| 17 - include/linux/mmap_lock.h | 10 -- 2 files changed, 27 deletions(-) --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -692,21 +692,6 @@ static inline void vma_start_write(struc up_write(>vm_lock->lock); } -static inline bool vma_try_start_write(struct vm_area_struct *vma) -{ - int mm_lock_seq; - - if (__is_vma_write_locked(vma, _lock_seq)) - return true; - - if (!down_write_trylock(>vm_lock->lock)) - return false; - - vma->vm_lock_seq = mm_lock_seq; - up_write(>vm_lock->lock); - return true; -} - static inline void vma_assert_locked(struct vm_area_struct *vma) { int mm_lock_seq; @@ -758,8 +743,6 @@ static inline bool vma_start_read(struct { return false; } static inline void vma_end_read(struct vm_area_struct *vma) {} static inline void vma_start_write(struct vm_area_struct *vma) {} -static inline bool vma_try_start_write(struct vm_area_struct *vma) - { return true; } static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) {} --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -112,16 +112,6 @@ static inline int mmap_write_lock_killab return ret; } -static inline bool mmap_write_trylock(struct mm_struct *mm) -{ - bool ret; - - __mmap_lock_trace_start_locking(mm, true); - ret = down_write_trylock(>mmap_lock) != 0; - __mmap_lock_trace_acquire_returned(mm, true, ret); - return ret; -} - static inline void mmap_write_unlock(struct mm_struct *mm) { __mmap_lock_trace_released(mm, true);
[PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]()
Add a block of comments on pte_offset_map_lock(), pte_offset_map() and pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them. Signed-off-by: Hugh Dickins --- mm/pgtable-generic.c | 44 1 file changed, 44 insertions(+) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index fa9d4d084291..4fcd959dcc4d 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -315,6 +315,50 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, return pte; } +/* + * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation + * __pte_offset_map_lock() below, is usually called with the pmd pointer for + * addr, reached by walking down the mm's pgd, p4d, pud for addr: either while + * holding mmap_lock or vma lock for read or for write; or in truncate or rmap + * context, while holding file's i_mmap_lock or anon_vma lock for read (or for + * write). In a few cases, it may be used with pmd pointing to a pmd_t already + * copied to or constructed on the stack. + * + * When successful, it returns the pte pointer for addr, with its page table + * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent + * modification by software, with a pointer to that spinlock in ptlp (in some + * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's + * struct page). pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards. + * + * But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no + * page table at *pmd: if, for example, the page table has just been removed, + * or replaced by the huge pmd of a THP. (When successful, *pmd is rechecked + * after acquiring the ptlock, and retried internally if it changed: so that a + * page table can be safely removed or replaced by THP while holding its lock.) + * + * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above, + * just returns the pte pointer for addr, its page table kmapped if necessary; + * or NULL if there is no page table at *pmd. It does not attempt to lock the + * page table, so cannot normally be used when the page table is to be updated, + * or when entries read must be stable. But it does take rcu_read_lock(): so + * that even when page table is racily removed, it remains a valid though empty + * and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s + * afterwards. + * + * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); + * but when successful, it also outputs a pointer to the spinlock in ptlp - as + * pte_offset_map_lock() does, but in this case without locking it. This helps + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time + * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock + * pointer for the page table that it returns. In principle, the caller should + * recheck *pmd once the lock is taken; in practice, no callsite needs that - + * either the mmap_lock for write, or pte_same() check on contents, is enough. + * + * Note that free_pgtables(), used after unmapping detached vmas, or when + * exiting the whole mm, does not take page table lock before freeing a page + * table, and may not use RCU at all: "outsiders" like khugepaged should avoid + * pte_offset_map() and co once the vma is detached from mm or mm_users is zero. + */ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, spinlock_t **ptlp) { -- 2.35.3
[PATCH v2] powerpc:platforms:Fix an NULL vs IS_ERR() bug for debugfs_create_dir()
The debugfs_create_dir() function returns error pointers. It never returns NULL. Most incorrect error checks were fixed, but the one in scom_debug_init() was forgotten, the other one in scom_debug_init_one() was also forgotten. Fix the remaining error check. Signed-off-by: Wang Ming Fixes: bfd2f0d49aef ("powerpc/powernv: Get rid of old scom_controller abstraction") --- arch/powerpc/platforms/powernv/opal-xscom.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-xscom.c b/arch/powerpc/platforms/powernv/opal-xscom.c index 6b4eed2ef4fa..262cd6fac907 100644 --- a/arch/powerpc/platforms/powernv/opal-xscom.c +++ b/arch/powerpc/platforms/powernv/opal-xscom.c @@ -168,7 +168,7 @@ static int scom_debug_init_one(struct dentry *root, struct device_node *dn, ent->path.size = strlen((char *)ent->path.data); dir = debugfs_create_dir(ent->name, root); - if (!dir) { + if (IS_ERR(dir)) { kfree(ent->path.data); kfree(ent); return -1; @@ -190,7 +190,7 @@ static int scom_debug_init(void) return 0; root = debugfs_create_dir("scom", arch_debugfs_dir); - if (!root) + if (IS_ERR(root)) return -1; rc = 0; -- 2.25.1
[PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()
mmap_write_trylock() and vma_try_start_write() were added just for khugepaged, but now it has no use for them: delete. Signed-off-by: Hugh Dickins --- include/linux/mm.h| 17 - include/linux/mmap_lock.h | 10 -- 2 files changed, 27 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..b7b45be616ad 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -692,21 +692,6 @@ static inline void vma_start_write(struct vm_area_struct *vma) up_write(>vm_lock->lock); } -static inline bool vma_try_start_write(struct vm_area_struct *vma) -{ - int mm_lock_seq; - - if (__is_vma_write_locked(vma, _lock_seq)) - return true; - - if (!down_write_trylock(>vm_lock->lock)) - return false; - - vma->vm_lock_seq = mm_lock_seq; - up_write(>vm_lock->lock); - return true; -} - static inline void vma_assert_write_locked(struct vm_area_struct *vma) { int mm_lock_seq; @@ -731,8 +716,6 @@ static inline bool vma_start_read(struct vm_area_struct *vma) { return false; } static inline void vma_end_read(struct vm_area_struct *vma) {} static inline void vma_start_write(struct vm_area_struct *vma) {} -static inline bool vma_try_start_write(struct vm_area_struct *vma) - { return true; } static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) {} diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index aab8f1b28d26..d1191f02c7fa 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct mm_struct *mm) return ret; } -static inline bool mmap_write_trylock(struct mm_struct *mm) -{ - bool ret; - - __mmap_lock_trace_start_locking(mm, true); - ret = down_write_trylock(>mmap_lock) != 0; - __mmap_lock_trace_acquire_returned(mm, true, ret); - return ret; -} - static inline void mmap_write_unlock(struct mm_struct *mm) { __mmap_lock_trace_released(mm, true); -- 2.35.3
[PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
Now that retract_page_tables() can retract page tables reliably, without depending on trylocks, delete all the apparatus for khugepaged to try again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the per-mm memory which was set aside for that in the khugepaged_mm_slot. But one part of that is worth keeping: when hpage_collapse_scan_file() found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot to be tried for retraction later - catching, for example, page tables where a reversible mprotect() of a portion had required splitting the pmd, but now it can be recollapsed. Call collapse_pte_mapped_thp() directly in this case (why was it deferred before? I assume an issue with needing mmap_lock for write, but now it's only needed for read). Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 125 +++--- 1 file changed, 16 insertions(+), 109 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 46986eb4eebb..7c7aaddbe130 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -92,8 +92,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); static struct kmem_cache *mm_slot_cache __read_mostly; -#define MAX_PTE_MAPPED_THP 8 - struct collapse_control { bool is_khugepaged; @@ -107,15 +105,9 @@ struct collapse_control { /** * struct khugepaged_mm_slot - khugepaged information per mm that is being scanned * @slot: hash lookup from mm to mm_slot - * @nr_pte_mapped_thp: number of pte mapped THP - * @pte_mapped_thp: address array corresponding pte mapped THP */ struct khugepaged_mm_slot { struct mm_slot slot; - - /* pte-mapped THP in this mm */ - int nr_pte_mapped_thp; - unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP]; }; /** @@ -1439,50 +1431,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot) } #ifdef CONFIG_SHMEM -/* - * Notify khugepaged that given addr of the mm is pte-mapped THP. Then - * khugepaged should try to collapse the page table. - * - * Note that following race exists: - * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A, - * emptying the A's ->pte_mapped_thp[] array. - * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and - * retract_page_tables() finds a VMA in mm_struct A mapping the same extent - * (at virtual address X) and adds an entry (for X) into mm_struct A's - * ->pte-mapped_thp[] array. - * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X, - * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry - * (for X) into mm_struct A's ->pte-mapped_thp[] array. - * Thus, it's possible the same address is added multiple times for the same - * mm_struct. Should this happen, we'll simply attempt - * collapse_pte_mapped_thp() multiple times for the same address, under the same - * exclusive mmap_lock, and assuming the first call is successful, subsequent - * attempts will return quickly (without grabbing any additional locks) when - * a huge pmd is found in find_pmd_or_thp_or_none(). Since this is a cheap - * check, and since this is a rare occurrence, the cost of preventing this - * "multiple-add" is thought to be more expensive than just handling it, should - * it occur. - */ -static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr) -{ - struct khugepaged_mm_slot *mm_slot; - struct mm_slot *slot; - bool ret = false; - - VM_BUG_ON(addr & ~HPAGE_PMD_MASK); - - spin_lock(_mm_lock); - slot = mm_slot_lookup(mm_slots_hash, mm); - mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot); - if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) { - mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; - ret = true; - } - spin_unlock(_mm_lock); - return ret; -} - /* hpage must be locked, and mmap_lock must be held */ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct page *hpage) @@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, return result; } -static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot) -{ - struct mm_slot *slot = _slot->slot; - struct mm_struct *mm = slot->mm; - int i; - - if (likely(mm_slot->nr_pte_mapped_thp == 0)) - return; - - if (!mmap_write_trylock(mm)) - return; - - if (unlikely(hpage_collapse_test_exit(mm))) - goto out; - - for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++) - collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false); - -out: - mm_slot->nr_pte_mapped_thp = 0; - mmap_write_unlock(mm); -} - static void retract_page_tables(struct
[PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp(). It does need mmap_read_lock(), but it does not need mmap_write_lock(), nor vma_start_write() nor i_mmap lock nor anon_vma lock. All racing paths are relying on pte_offset_map_lock() and pmd_lock(), so use those. Follow the pattern in retract_page_tables(); and using pte_free_defer() removes most of the need for tlb_remove_table_sync_one() here; but call pmdp_get_lockless_sync() to use it in the PAE case. First check the VMA, in case page tables are being torn down: from JannH. Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been acquired and the page looks suitable: from then on its state is stable. However, collapse_pte_mapped_thp() was doing something others don't: freeing a page table still containing "valid" entries. i_mmap lock did stop a racing truncate from double-freeing those pages, but we prefer collapse_pte_mapped_thp() to clear the entries as usual. Their TLB flush can wait until the pmdp_collapse_flush() which follows, but the mmu_notifier_invalidate_range_start() has to be done earlier. Do the "step 1" checking loop without mmu_notifier: it wouldn't be good for khugepaged to keep on repeatedly invalidating a range which is then found unsuitable e.g. contains COWs. "step 2", which does the clearing, must then be more careful (after dropping ptl to do mmu_notifier), with abort prepared to correct the accounting like "step 3". But with those entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept safe by the huge page lock, which stops new PTEs from being faulted in. Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 172 ++ 1 file changed, 77 insertions(+), 95 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3bb05147961b..46986eb4eebb 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm, return ret; } -/* hpage must be locked, and mmap_lock must be held in write */ +/* hpage must be locked, and mmap_lock must be held */ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct page *hpage) { @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, }; VM_BUG_ON(!PageTransHuge(hpage)); - mmap_assert_write_locked(vma->vm_mm); + mmap_assert_locked(vma->vm_mm); if (do_set_pmd(, hpage)) return SCAN_FAIL; @@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, return SCAN_SUCCEED; } -/* - * A note about locking: - * Trying to take the page table spinlocks would be useless here because those - * are only used to synchronize: - * - * - modifying terminal entries (ones that point to a data page, not to another - *page table) - * - installing *new* non-terminal entries - * - * Instead, we need roughly the same kind of protection as free_pgtables() or - * mm_take_all_locks() (but only for a single VMA): - * The mmap lock together with this VMA's rmap locks covers all paths towards - * the page table entries we're messing with here, except for hardware page - * table walks and lockless_pages_from_mm(). - */ -static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long addr, pmd_t *pmdp) -{ - pmd_t pmd; - struct mmu_notifier_range range; - - mmap_assert_write_locked(mm); - if (vma->vm_file) - lockdep_assert_held_write(>vm_file->f_mapping->i_mmap_rwsem); - /* -* All anon_vmas attached to the VMA have the same root and are -* therefore locked by the same lock. -*/ - if (vma->anon_vma) - lockdep_assert_held_write(>anon_vma->root->rwsem); - - mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, mm, addr, - addr + HPAGE_PMD_SIZE); - mmu_notifier_invalidate_range_start(); - pmd = pmdp_collapse_flush(vma, addr, pmdp); - tlb_remove_table_sync_one(); - mmu_notifier_invalidate_range_end(); - mm_dec_nr_ptes(mm); - page_table_check_pte_clear_range(mm, addr, pmd); - pte_free(mm, pmd_pgtable(pmd)); -} - /** * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at * address haddr. @@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd) { + struct mmu_notifier_range range; + bool notified = false; unsigned long haddr = addr & HPAGE_PMD_MASK; struct vm_area_struct *vma = vma_lookup(mm, haddr); struct page *hpage; pte_t *start_pte, *pte; - pmd_t *pmd; - spinlock_t *ptl; -
[PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock
Simplify shmem and file THP collapse's retract_page_tables(), and relax its locking: to improve its success rate and to lessen impact on others. Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of target_mm, leave that part of the work to madvise_collapse() calling collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result code to arrange for that. That spares retract_page_tables() four arguments; and since it will be successful in retracting all of the page tables expected of it, no need to track and return a result code itself. It needs i_mmap_lock_read(mapping) for traversing the vma interval tree, but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk() allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for THPs. retract_page_tables() just needs to use those same spinlocks to exclude it briefly, while transitioning pmd from page table to none: so restore its use of pmd_lock() inside of which pte lock is nested. Users of pte_offset_map_lock() etc all now allow for them to fail: so retract_page_tables() now has no use for mmap_write_trylock() or vma_try_start_write(). In common with rmap and page_vma_mapped_walk(), it does not even need the mmap_read_lock(). But those users do expect the page table to remain a good page table, until they unlock and rcu_read_unlock(): so the page table cannot be freed immediately, but rather by the recently added pte_free_defer(). Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt when PAE, and pmdp_collapse_flush() did not already do so: to make sure that the start,pmdp_get_lockless(),end sequence in __pte_offset_map() cannot pick up a pmd entry with mismatched pmd_low and pmd_high. retract_page_tables() can be enhanced to replace_page_tables(), which inserts the final huge pmd without mmap lock: going through an invalid state instead of pmd_none() followed by fault. But that enhancement does raise some more questions: leave it until a later release. Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 184 -- 1 file changed, 75 insertions(+), 109 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 78c8d5d8b628..3bb05147961b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1615,9 +1615,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, break; case SCAN_PMD_NONE: /* -* In MADV_COLLAPSE path, possible race with khugepaged where -* all pte entries have been removed and pmd cleared. If so, -* skip all the pte checks and just update the pmd mapping. +* All pte entries have been removed and pmd cleared. +* Skip all the pte checks and just update the pmd mapping. */ goto maybe_install_pmd; default: @@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl mmap_write_unlock(mm); } -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, - struct mm_struct *target_mm, - unsigned long target_addr, struct page *hpage, - struct collapse_control *cc) +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) { struct vm_area_struct *vma; - int target_result = SCAN_FAIL; - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { - int result = SCAN_FAIL; - struct mm_struct *mm = NULL; - unsigned long addr = 0; - pmd_t *pmd; - bool is_target = false; + struct mmu_notifier_range range; + struct mm_struct *mm; + unsigned long addr; + pmd_t *pmd, pgt_pmd; + spinlock_t *pml; + spinlock_t *ptl; + bool skipped_uffd = false; /* * Check vma->anon_vma to exclude MAP_PRIVATE mappings that -* got written to. These VMAs are likely not worth investing -* mmap_write_lock(mm) as PMD-mapping is likely to be split -* later. -* -* Note that vma->anon_vma check is racy: it can be set up after -* the check but before we took mmap_lock by the fault path. -* But page lock would prevent establishing any new ptes of the -* page, so we are safe. -* -* An alternative would be drop the check, but check that page -* table is clear before calling pmdp_collapse_flush() under -* ptl. It has higher chance to recover THP for the VMA, but -* has higher cost too. It would also probably
[PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page
Add the generic pte_free_defer(), to call pte_free() via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This version suits all those architectures which use an unfragmented page for one page table (none of whose pte_free()s use the mm arg which was passed to it). Signed-off-by: Hugh Dickins --- include/linux/mm_types.h | 4 include/linux/pgtable.h | 2 ++ mm/pgtable-generic.c | 20 3 files changed, 26 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index de10fc797c8e..17a7868f00bd 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -144,6 +144,10 @@ struct page { struct {/* Page table pages */ unsigned long _pt_pad_1;/* compound_head */ pgtable_t pmd_huge_pte; /* protected by page->ptl */ + /* +* A PTE page table page might be freed by use of +* rcu_head: which overlays those two fields above. +*/ unsigned long _pt_pad_2;/* mapping */ union { struct mm_struct *pt_mm; /* x86 pgds only */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 7f2db400f653..9fa34be65159 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte) } #endif +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + /* Find an entry in the second-level page table.. */ #ifndef pmd_offset static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index b9a0c2137cc1..fa9d4d084291 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include #include #include +#include #include /* @@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, return pmd; } #endif + +/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */ +#ifndef pte_free_defer +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + pte_free(NULL /* mm not passed and not used */, (pgtable_t)page); +} + +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page = pgtable; + call_rcu(>rcu_head, pte_free_now); +} +#endif /* pte_free_defer */ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ -- 2.35.3
[PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page
Add s390-specific pte_free_defer(), to free table page via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This version is more complicated than others: because s390 fits two 2K page tables into one 4K page (so page->rcu_head must be shared between both halves), and already uses page->lru (which page->rcu_head overlays) to list any free halves; with clever management by page->_refcount bits. Build upon the existing management, adjusted to follow a new rule: that a page is never on the free list if pte_free_defer() was used on either half (marked by PageActive). And for simplicity, delay calling RCU until both halves are freed. Not adding back unallocated fragments to the list in pte_free_defer() can result in wasting some amount of memory for pagetables, depending on how long the allocated fragment will stay in use. In practice, this effect is expected to be insignificant, and not justify a far more complex approach, which might allow to add the fragments back later in __tlb_remove_table(), where we might not have a stable mm any more. Signed-off-by: Hugh Dickins Reviewed-by: Gerald Schaefer --- arch/s390/include/asm/pgalloc.h | 4 ++ arch/s390/mm/pgalloc.c | 80 +-- 2 files changed, 72 insertions(+), 12 deletions(-) diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h index 17eb618f1348..89a9d5ef94f8 100644 --- a/arch/s390/include/asm/pgalloc.h +++ b/arch/s390/include/asm/pgalloc.h @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm, #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte) #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte) +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + void vmem_map_init(void); void *vmem_crst_alloc(unsigned long val); pte_t *vmem_pte_alloc(void); diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c index 66ab68db9842..760b4ace475e 100644 --- a/arch/s390/mm/pgalloc.c +++ b/arch/s390/mm/pgalloc.c @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page) * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable * while the PP bits are never used, nor such a page is added to or removed * from mm_context_t::pgtable_list. + * + * pte_free_defer() overrides those rules: it takes the page off pgtable_list, + * and prevents both 2K fragments from being reused. pte_free_defer() has to + * guarantee that its pgtable cannot be reused before the RCU grace period + * has elapsed (which page_table_free_rcu() does not actually guarantee). + * But for simplicity, because page->rcu_head overlays page->lru, and because + * the RCU callback might not be called before the mm_context_t has been freed, + * pte_free_defer() in this implementation prevents both fragments from being + * reused, and delays making the call to RCU until both fragments are freed. */ unsigned long *page_table_alloc(struct mm_struct *mm) { @@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) table += PTRS_PER_PTE; atomic_xor_bits(>_refcount, 0x01U << (bit + 24)); - list_del(>lru); + list_del_init(>lru); } } spin_unlock_bh(>context.lock); @@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm) table = (unsigned long *) page_to_virt(page); if (mm_alloc_pgste(mm)) { /* Return 4K page table with PGSTEs */ + INIT_LIST_HEAD(>lru); atomic_xor_bits(>_refcount, 0x03U << 24); memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE); memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE); @@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, void *table, { char msg[128]; - if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask) + if (!IS_ENABLED(CONFIG_DEBUG_VM)) + return; + if (!mask && list_empty(>lru)) return; snprintf(msg, sizeof(msg), "Invalid pgtable %p release half 0x%02x mask 0x%02x", @@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, void *table, dump_page(page, msg); } +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + pgtable_pte_page_dtor(page); + __free_page(page); +} + void page_table_free(struct mm_struct *mm, unsigned long
[PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t
Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. sparc32 supports pagetables sharing a page, but does not support THP; sparc64 supports THP, but does not support pagetables sharing a page. So the sparc-specific pte_free_defer() is as simple as the generic one, except for converting between pte_t *pgtable_t and struct page *. Signed-off-by: Hugh Dickins --- arch/sparc/include/asm/pgalloc_64.h | 4 arch/sparc/mm/init_64.c | 16 2 files changed, 20 insertions(+) diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h index 7b5561d17ab1..caa7632be4c2 100644 --- a/arch/sparc/include/asm/pgalloc_64.h +++ b/arch/sparc/include/asm/pgalloc_64.h @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm); void pte_free_kernel(struct mm_struct *mm, pte_t *pte); void pte_free(struct mm_struct *mm, pgtable_t ptepage); +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE) #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 04f9db0c3111..0d7fd793924c 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page) } #ifdef CONFIG_TRANSPARENT_HUGEPAGE +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + __pte_free((pgtable_t)page_address(page)); +} + +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page = virt_to_page(pgtable); + call_rcu(>rcu_head, pte_free_now); +} + void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd) { -- 2.35.3
[PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page
Add powerpc-specific pte_free_defer(), to free table page via call_rcu(). pte_free_defer() will be called inside khugepaged's retract_page_tables() loop, where allocating extra memory cannot be relied upon. This precedes the generic version to avoid build breakage from incompatible pgtable_t. This is awkward because the struct page contains only one rcu_head, but that page may be shared between PTE_FRAG_NR pagetables, each wanting to use the rcu_head at the same time. But powerpc never reuses a fragment once it has been freed: so mark the page Active in pte_free_defer(), before calling pte_fragment_free() directly; and there call_rcu() to pte_free_now() when last fragment is freed and the page is PageActive. Suggested-by: Jason Gunthorpe Signed-off-by: Hugh Dickins --- arch/powerpc/include/asm/pgalloc.h | 4 arch/powerpc/mm/pgtable-frag.c | 29 ++--- 2 files changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index 3360cad78ace..3a971e2a8c73 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage) pte_fragment_free((unsigned long *)ptepage, 0); } +/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */ +#define pte_free_defer pte_free_defer +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable); + /* * Functions that deal with pagetables that could be at any level of * the table need to be passed an "index_size" so they know how to diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c index 20652daa1d7e..0c6b68130025 100644 --- a/arch/powerpc/mm/pgtable-frag.c +++ b/arch/powerpc/mm/pgtable-frag.c @@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel) return __alloc_for_ptecache(mm, kernel); } +static void pte_free_now(struct rcu_head *head) +{ + struct page *page; + + page = container_of(head, struct page, rcu_head); + pgtable_pte_page_dtor(page); + __free_page(page); +} + void pte_fragment_free(unsigned long *table, int kernel) { struct page *page = virt_to_page(table); @@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel) BUG_ON(atomic_read(>pt_frag_refcount) <= 0); if (atomic_dec_and_test(>pt_frag_refcount)) { - if (!kernel) - pgtable_pte_page_dtor(page); - __free_page(page); + if (kernel) + __free_page(page); + else if (TestClearPageActive(page)) + call_rcu(>rcu_head, pte_free_now); + else + pte_free_now(>rcu_head); } } + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) +{ + struct page *page; + + page = virt_to_page(pgtable); + SetPageActive(page); + pte_fragment_free((unsigned long *)pgtable, 0); +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ -- 2.35.3
[PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()
Instead of pte_lockptr(), use the recently added pte_offset_map_nolock() in assert_pte_locked(). BUG if pte_offset_map_nolock() fails: this is stricter than the previous implementation, which skipped when pmd_none() (with a comment on khugepaged collapse transitions): but wouldn't we want to know, if an assert_pte_locked() caller can be racing such transitions? This mod might cause new crashes: which either expose my ignorance, or indicate issues to be fixed, or limit the usage of assert_pte_locked(). Signed-off-by: Hugh Dickins --- arch/powerpc/mm/pgtable.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index cb2dcdb18f8e..16b061af86d7 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr) p4d_t *p4d; pud_t *pud; pmd_t *pmd; + pte_t *pte; + spinlock_t *ptl; if (mm == _mm) return; @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr) pud = pud_offset(p4d, addr); BUG_ON(pud_none(*pud)); pmd = pmd_offset(pud, addr); - /* -* khugepaged to collapse normal pages to hugepage, first set -* pmd to none to force page fault/gup to take mmap_lock. After -* pmd is set to none, we do a pte_clear which does this assertion -* so if we find pmd none, return. -*/ - if (pmd_none(*pmd)) - return; - BUG_ON(!pmd_present(*pmd)); - assert_spin_locked(pte_lockptr(mm, pmd)); + pte = pte_offset_map_nolock(mm, pmd, addr, ); + BUG_ON(!pte); + assert_spin_locked(ptl); + pte_unmap(pte); } #endif /* CONFIG_DEBUG_VM */ -- 2.35.3
[PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock()
Instead of pte_lockptr(), use the recently added pte_offset_map_nolock() in adjust_pte(): because it gives the not-locked ptl for precisely that pte, which the caller can then safely lock; whereas pte_lockptr() is not so tightly coupled, because it dereferences the pmd pointer again. Signed-off-by: Hugh Dickins --- arch/arm/mm/fault-armv.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index ca5302b0b7ee..7cb125497976 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address, * must use the nested version. This also means we need to * open-code the spin-locking. */ - pte = pte_offset_map(pmd, address); + pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, ); if (!pte) return 0; - ptl = pte_lockptr(vma->vm_mm, pmd); do_pte_lock(ptl); ret = do_adjust_pte(vma, address, pfn, pte); -- 2.35.3
[PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a pmdval assembled from a pmd_low and a pmd_high which never belonged together: their combination not pointing to a page table at all, perhaps not even a valid pfn. pmdp_get_lockless() is not enough to prevent that. Guard against that (on such configs) by local_irq_save() blocking TLB flush between present updates, as linux/pgtable.h suggests. It's only needed around the pmdp_get_lockless() in __pte_offset_map(): a race when __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the lock, would just send it back to __pte_offset_map() again. Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(), used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync() synonym for tlb_remove_table_sync_one(): to send the necessary interrupt at the right moment on those configs which do not already send it. CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86. It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit. Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a little more work, to retry if pmd_low good for page table, but pmd_high non-zero from THP (and that might be making x86-specific assumptions). Signed-off-by: Hugh Dickins --- include/linux/pgtable.h | 4 mm/pgtable-generic.c| 29 + 2 files changed, 33 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5134edcec668..7f2db400f653 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) return pmd; } #define pmdp_get_lockless pmdp_get_lockless +#define pmdp_get_lockless_sync() tlb_remove_table_sync_one() #endif /* CONFIG_PGTABLE_LEVELS > 2 */ #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ @@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) { return pmdp_get(pmdp); } +static inline void pmdp_get_lockless_sync(void) +{ +} #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 400e5a045848..b9a0c2137cc1 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address, #endif #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \ + (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU)) +/* + * See the comment above ptep_get_lockless() in include/linux/pgtable.h: + * the barriers in pmdp_get_lockless() cannot guarantee that the value in + * pmd_high actually belongs with the value in pmd_low; but holding interrupts + * off blocks the TLB flush between present updates, which guarantees that a + * successful __pte_offset_map() points to a page from matched halves. + */ +static unsigned long pmdp_get_lockless_start(void) +{ + unsigned long irqflags; + + local_irq_save(irqflags); + return irqflags; +} +static void pmdp_get_lockless_end(unsigned long irqflags) +{ + local_irq_restore(irqflags); +} +#else +static unsigned long pmdp_get_lockless_start(void) { return 0; } +static void pmdp_get_lockless_end(unsigned long irqflags) { } +#endif + pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { + unsigned long irqflags; pmd_t pmdval; rcu_read_lock(); + irqflags = pmdp_get_lockless_start(); pmdval = pmdp_get_lockless(pmd); + pmdp_get_lockless_end(irqflags); + if (pmdvalp) *pmdvalp = pmdval; if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval))) -- 2.35.3
[PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
Before putting them to use (several commits later), add rcu_read_lock() to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this a separate commit, since it risks exposing imbalances: prior commits have fixed all the known imbalances, but we may find some have been missed. Signed-off-by: Hugh Dickins --- include/linux/pgtable.h | 4 ++-- mm/pgtable-generic.c| 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5063b482e34f..5134edcec668 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address) ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address))) #define pte_unmap(pte) do {\ kunmap_local((pte));\ - /* rcu_read_unlock() to be added later */ \ + rcu_read_unlock(); \ } while (0) #else static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address) @@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address) } static inline void pte_unmap(pte_t *pte) { - /* rcu_read_unlock() to be added later */ + rcu_read_unlock(); } #endif diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4d454953046f..400e5a045848 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { pmd_t pmdval; - /* rcu_read_lock() to be added later */ + rcu_read_lock(); pmdval = pmdp_get_lockless(pmd); if (pmdvalp) *pmdvalp = pmdval; @@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) } return __pte_map(, addr); nomap: - /* rcu_read_unlock() to be added later */ + rcu_read_unlock(); return NULL; } -- 2.35.3
[PATCH v3 00/13] mm: free retracted page table by RCU
Here is v3 of the series of patches to mm (and a few architectures), based on v6.5-rc1 which includes the preceding two series (thank you!): in which khugepaged takes advantage of pte_offset_map[_lock]() allowing for pmd transitions. Differences from v1 and v2 are noted patch by patch below. This replaces the v2 "mm: free retracted page table by RCU" https://lore.kernel.org/linux-mm/54cb04f-3762-987f-8294-91dafd8eb...@google.com/ series of 12 posted on 2023-06-20. What is it all about? Some mmap_lock avoidance i.e. latency reduction. Initially just for the case of collapsing shmem or file pages to THPs: the usefulness of MADV_COLLAPSE on shmem is being limited by that mmap_write_lock it currently requires. Likely to be relied upon later in other contexts e.g. freeing of empty page tables (but that's not work I'm doing). mmap_write_lock avoidance when collapsing to anon THPs? Perhaps, but again that's not work I've done: a quick attempt was not as easy as the shmem/file case. These changes (though of course not these exact patches) have been in Google's data centre kernel for three years now: we do rely upon them. Based on v6.5-rc1; and almost good on current mm-unstable or current linux-next - just one patch conflicts, the 12/13: I'll reply to that one with its mm-unstable or linux-next equivalent (vma_assert_locked() has been added next to where vma_try_start_write() is being removed). 01/13 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s v3: same as v1 02/13 mm/pgtable: add PAE safety to __pte_offset_map() v3: same as v2 v2: rename to pmdp_get_lockless_start/end() per Matthew; so use inlines without _irq_save(flags) macro oddity; add pmdp_get_lockless_sync() for use later in 09/13. 03/13 arm: adjust_pte() use pte_offset_map_nolock() v3: same as v1 04/13 powerpc: assert_pte_locked() use pte_offset_map_nolock() v3: same as v1 05/13 powerpc: add pte_free_defer() for pgtables sharing page v3: much simpler version, following suggestion by Jason v2: fix rcu_head usage to cope with concurrent deferrals; add para to commit message explaining rcu_head issue. 06/13 sparc: add pte_free_defer() for pte_t *pgtable_t v3: same as v2 v2: use page_address() instead of less common page_to_virt(); add para to commit message explaining simple conversion; changed title since sparc64 pgtables do not share page. 07/13 s390: add pte_free_defer() for pgtables sharing page v3: much simpler version, following suggestion by Gerald v2: complete rewrite, integrated with s390's existing pgtable management; temporarily using a global mm_pgtable_list_lock, to be restored to per-mm spinlock in a later followup patch. 08/13 mm/pgtable: add pte_free_defer() for pgtable as page v3: same as v2 v2: add comment on rcu_head to "Page table pages", per JannH 09/13 mm/khugepaged: retract_page_tables() without mmap or vma lock v3: same as v2 v2: repeat checks under ptl because UFFD, per PeterX and JannH; bring back mmu_notifier calls for PMD, per JannH and Jason; pmdp_get_lockless_sync() to issue missing interrupt if PAE. 10/13 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() v3: updated to using ptent instead of *pte v2: first check VMA, in case page tables torn down, per JannH; pmdp_get_lockless_sync() to issue missing interrupt if PAE; moved mmu_notifier after step 1, reworked final goto labels. 11/13 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() v3: rediffed v2: same as v1 12/13 mm: delete mmap_write_trylock() and vma_try_start_write() v3: rediffed (different diff needed for mm-unstable or linux-next) v2: same as v1 13/13 mm/pgtable: notes on pte_offset_map[_lock]() v3: new: JannH asked for more helpful comment, this is my attempt; could be moved to be the first in the series. arch/arm/mm/fault-armv.c| 3 +- arch/powerpc/include/asm/pgalloc.h | 4 + arch/powerpc/mm/pgtable-frag.c | 29 +- arch/powerpc/mm/pgtable.c | 16 +- arch/s390/include/asm/pgalloc.h | 4 + arch/s390/mm/pgalloc.c | 80 - arch/sparc/include/asm/pgalloc_64.h | 4 + arch/sparc/mm/init_64.c | 16 + include/linux/mm.h | 17 -- include/linux/mm_types.h| 4 + include/linux/mmap_lock.h | 10 - include/linux/pgtable.h | 10 +- mm/khugepaged.c | 481 +++--- mm/pgtable-generic.c| 97 +- 14 files changed, 404 insertions(+), 371 deletions(-) Hugh
Re: [PATCH net-next v3 0/8] net: freescale: Convert to platform remove callback returning void
Hello: This series was applied to netdev/net-next.git (main) by Jakub Kicinski : On Mon, 10 Jul 2023 09:19:38 +0200 you wrote: > Hello, > > v2 of this series was sent in June[1], code changes since then only affect > patch #1 where the dev_err invocation was adapted to emit the error code of > dpaa_fq_free(). Thanks for feedback by Maciej Fijalkowski and Russell King. > Other than that I added Reviewed-by tags for Simon Horman and Wei Fang and > rebased to v6.5-rc1. > > [...] Here is the summary with links: - [net-next,v3,1/8] net: dpaa: Improve error reporting https://git.kernel.org/netdev/net-next/c/1e679b957ae2 - [net-next,v3,2/8] net: dpaa: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/9c3ddc44d0c0 - [net-next,v3,3/8] net: fec: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/12d6cc19f29b - [net-next,v3,4/8] net: fman: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/4875b2a362e9 - [net-next,v3,5/8] net: fs_enet: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/ead29c5e0888 - [net-next,v3,6/8] net: fsl_pq_mdio: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/f833635589ae - [net-next,v3,7/8] net: gianfar: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/4be0ebc33f39 - [net-next,v3,8/8] net: ucc_geth: Convert to platform remove callback returning void https://git.kernel.org/netdev/net-next/c/ae18facf566c You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation
On 7/11/23 10:49 PM, David Hildenbrand wrote: > On 11.07.23 06:48, Aneesh Kumar K.V wrote: >> Add a new kconfig option that can be selected if we want to allow >> pageblock alignment by reserving pages in the vmemmap altmap area. >> This implies we will be reserving some pages for every memoryblock >> This also allows the memmap on memory feature to be widely useful >> with different memory block size values. > > "reserving pages" is a nice way of saying "wasting memory". :) Let's spell > that out. > > I think we have to find a better name for this, and I think we should have a > toggle similar to memory_hotplug.memmap_on_memory. This should be an admin > decision, not some kernel config option. > > > memory_hotplug.force_memmap_on_memory > > "Enable the memmap on memory feature even if it could result in memory waste > due to memmap size limitations. For example, if the memmap for a memory block > requires 1 MiB, but the pageblock size is 2 MiB, 1 MiB > of hotplugged memory will be wasted. Note that there are still cases where > the feature cannot be enforced: for example, if the memmap is smaller than a > single page, or if the architecture does not support the forced mode in all > configurations." > > Thoughts? > With module parameter, do we still need the Kconfig option? -aneesh
Re: [PATCH] soc: fsl: qe: Replace all non-returning strlcpy with strscpy
> Sorry for the late response. But I found some old discussions with the > conclusion to be not converting old users. Has this been changed later on? > https://lwn.net/Articles/659214/ > @Kees Cook what's your advice here?
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Tue, Jul 11, 2023, Zhi Wang wrote: > On Thu, 6 Jul 2023 15:49:39 +0900 > David Stevens wrote: > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang wrote: > > > > > > On Tue, 4 Jul 2023 16:50:48 +0900 > > > David Stevens wrote: > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail > > > page? > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like > > > a > > > temporary solution. I don't think it is a good idea to play tricks with > > > a temporary solution, more like we are abusing the toleration. > > > > I'm not sure I understand what you're getting at. This series never > > calls gup without FOLL_GET. > > > > This series aims to provide kvm_follow_pfn as a unified API on top of > > gup+follow_pte. Since one of the major clients of this API uses an mmu > > notifier, it makes sense to support returning a pfn without taking a > > reference. And we indeed need to do that for certain types of memory. > > > > I am not having prob with taking a pfn without taking a ref. I am > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking > a pfn without a ref is a good idea or not, while there is another flag > actually showing it. > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some > translation between struct kvm_follow_pfn.{write, async, } and GUP > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the > requirements of GUP in the code path that going to call GUP is reasonable. > > But using FOLL_XXX with purposes that are not related to GUP call really > feels off. I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped() that handles non-refcounted pages, i.e. this if (get_page_unless_zero(page)) { foll->is_refcounted_page = true; if (!(foll->flags & FOLL_GET)) put_page(page); } else if (foll->flags & FOLL_GET) { r = -EFAULT; } should be if (get_page_unless_zero(page)) { foll->is_refcounted_page = true; if (!(foll->flags & FOLL_GET)) put_page(page); else if (!foll->guarded_by_mmu_notifier) r = -EFAULT; because it's not the desire to grab a reference that makes getting non-refcounted pfns "safe", it's whether or not the caller is plugged into the MMU notifiers. Though that highlights that checking guarded_by_mmu_notifier should be done for *all* non-refcounted pfns, not just non-refcounted struct page memory. As for the other usage of FOLL_GET in this series (using it to conditionally do put_page()), IMO that's very much related to the GUP call. Invoking put_page() is a hack to workaround the fact that GUP doesn't provide a way to get the pfn without grabbing a reference to the page. In an ideal world, KVM would NOT pass FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference. I do think it's worth providing a helper to consolidate and document that hacky code, e.g. add a kvm_follow_refcounted_pfn() helper. All in all, I think the below (completely untested) is what we want? David (and others), I am planning on doing a full review of this series "soon", but it will likely be a few weeks until that happens. I jumped in on this specific thread because this caught my eye and I really don't want to throw out *all* of the FOLL_GET usage. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5b5afd70f239..90d424990e0a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr) return rc == -EHWPOISON; } +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll, + struct page *page) +{ + kvm_pfn_t pfn = page_to_pfn(page); + + foll->is_refcounted_page = true; + + /* +* FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller +* doesn't want to grab a reference, but gup() doesn't support getting +* just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever +* changes, drop this and simply don't pass FOLL_GET to gup(). +*/ + if (!(foll->flags & FOLL_GET)) + put_page(page); + + return pfn; +} + /* * The fast path to get the writable pfn which will be stored in @pfn, * true indicates success, otherwise false is returned. It's also the @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) return false; if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { - *pfn = page_to_pfn(page[0]); foll->writable = foll->allow_write_mapping; - foll->is_refcounted_page = true; - if (!(foll->flags &
Re: [PATCH v2 1/2] powerpc/tpm: Create linux,sml-base/size as big endian
On Tue, 2023-07-11 at 08:47 -0400, Stefan Berger wrote: > > On 7/10/23 17:23, Jarkko Sakkinen wrote: > > On Thu, 2023-06-15 at 22:37 +1000, Michael Ellerman wrote: > > > There's code in prom_instantiate_sml() to do a "SML handover" (Stored > > > Measurement Log) from OF to Linux, before Linux shuts down Open > > > Firmware. > > > > > > This involves creating a buffer to hold the SML, and creating two device > > > tree properties to record its base address and size. The kernel then > > > later reads those properties from the device tree to find the SML. > > > > > > When the code was initially added in commit 4a727429abec ("PPC64: Add > > > support for instantiating SML from Open Firmware") the powerpc kernel > > > was always built big endian, so the properties were created big endian > > > by default. > > > > > > However since then little endian support was added to powerpc, and now > > > the code lacks conversions to big endian when creating the properties. > > > > > > This means on little endian kernels the device tree properties are > > > little endian, which is contrary to the device tree spec, and in > > > contrast to all other device tree properties. > > > > > > To cope with that a workaround was added in tpm_read_log_of() to skip > > > the endian conversion if the properties were created via the SML > > > handover. > > > > > > A better solution is to encode the properties as big endian as they > > > should be, and remove the workaround. > > > > > > Typically changing the encoding of a property like this would present > > > problems for kexec. However the SML is not propagated across kexec, so > > > changing the encoding of the properties is a non-issue. > > > > > > Fixes: e46e22f12b19 ("tpm: enhance read_log_of() to support Physical TPM > > > event log") > > > Signed-off-by: Michael Ellerman > > > Reviewed-by: Stefan Berger > > > --- > > > arch/powerpc/kernel/prom_init.c | 8 ++-- > > > drivers/char/tpm/eventlog/of.c | 23 --- > > > 2 files changed, 10 insertions(+), 21 deletions(-) > > > > Split into two patches (producer and consumer). > > I think this wouldn't be right since it would break the system when only one > patch is applied since it would be reading the fields in the wrong endianess. I think it would help if the commit message would better explain what is going on. It is somewhat difficult to decipher, if you don't have deep knowledge of the powerpc architecture. BR, Jarkko
Re: [PATCH v2 10/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document affinity_domain_via_partition sysfs interface file
Hi, Same correction comments as in the other 4 patches (not repeated here). On 7/10/23 02:27, Kajol Jain wrote: > Add details of the new hv-gpci interface file called > "affinity_domain_via_partition" in the ABI documentation. > > Signed-off-by: Kajol Jain > --- > .../sysfs-bus-event_source-devices-hv_gpci| 32 +++ > 1 file changed, 32 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > index d8e65b93d1f7..b03b2bd4b081 100644 > --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > @@ -208,3 +208,35 @@ Description: admin read only > more information. > > * "-EFBIG" : System information exceeds PAGE_SIZE. > + > +What: > /sys/devices/hv_gpci/interface/affinity_domain_via_partition > +Date:July 2023 > +Contact: Linux on PowerPC Developer List > +Description: admin read only > + This sysfs file exposes the system topology information by > making HCALL > + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request > value > + AFFINITY_DOMAIN_INFORMATION_BY_PARTITION(0xB1). > + > + * This sysfs file will be created only for power10 and above > platforms. > + > + * User needs root privileges to read data from this sysfs file. > + > + * This sysfs file will be created, only when the HCALL returns > "H_SUCESS", > + "H_AUTHORITY" and "H_PARAMETER" as the return type. > + > + HCALL with return error type "H_AUTHORITY", can be resolved > during > + runtime by setting "Enable Performance Information > Collection" option. > + > + * The end user reading this sysfs file must decode the content > as per > + underlying platform/firmware. > + > + Possible error codes while reading this sysfs file: > + > + * "-EPERM" : Partition is not permitted to retrieve performance > information, > + required to set "Enable Performance Information > Collection" option. > + > + * "-EIO" : Can't retrieve system information because of invalid > buffer length/invalid address > +or because of some hardware error. Refer > getPerfCountInfo documentation for > +more information. > + > + * "-EFBIG" : System information exceeds PAGE_SIZE. -- ~Randy
Re: [PATCH v2 08/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document affinity_domain_via_domain sysfs interface file
Hi, On 7/10/23 02:27, Kajol Jain wrote: > Add details of the new hv-gpci interface file called > "affinity_domain_via_domain" in the ABI documentation. > > Signed-off-by: Kajol Jain > --- > .../sysfs-bus-event_source-devices-hv_gpci| 32 +++ > 1 file changed, 32 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > index 3b63d66658fe..d8e65b93d1f7 100644 > --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > @@ -176,3 +176,35 @@ Description: admin read only > more information. > > * "-EFBIG" : System information exceeds PAGE_SIZE. > + > +What: > /sys/devices/hv_gpci/interface/affinity_domain_via_domain > +Date:July 2023 > +Contact: Linux on PowerPC Developer List > +Description: admin read only > + This sysfs file exposes the system topology information by > making HCALL > + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request > value > + AFFINITY_DOMAIN_INFORMATION_BY_DOMAIN(0xB0). > + > + * This sysfs file will be created only for power10 and above > platforms. > + > + * User needs root privileges to read data from this sysfs file. > + > + * This sysfs file will be created, only when the HCALL returns > "H_SUCESS", typo > + "H_AUTHORITY" and "H_PARAMETER" as the return type. s/and/or/ > + > + HCALL with return error type "H_AUTHORITY", can be resolved > during Drop the comma: ^ > + runtime by setting "Enable Performance Information > Collection" option. > + > + * The end user reading this sysfs file must decode the content > as per > + underlying platform/firmware. > + > + Possible error codes while reading this sysfs file: > + > + * "-EPERM" : Partition is not permitted to retrieve performance > information, > + required to set "Enable Performance Information > Collection" option. > + > + * "-EIO" : Can't retrieve system information because of invalid > buffer length/invalid address > +or because of some hardware error. Refer > getPerfCountInfo documentation for Refer to > +more information. > + > + * "-EFBIG" : System information exceeds PAGE_SIZE. -- ~Randy
Re: [PATCH v2 06/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document affinity_domain_via_virtual_processor sysfs interface file
Hi-- On 7/10/23 02:27, Kajol Jain wrote: > Add details of the new hv-gpci interface file called > "affinity_domain_via_virtual_processor" in the ABI documentation. > > Signed-off-by: Kajol Jain > --- > .../sysfs-bus-event_source-devices-hv_gpci| 32 +++ > 1 file changed, 32 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > index aff52dc3b05c..3b63d66658fe 100644 > --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > @@ -144,3 +144,35 @@ Description: admin read only > more information. > > * "-EFBIG" : System information exceeds PAGE_SIZE. > + > +What: > /sys/devices/hv_gpci/interface/affinity_domain_via_virtual_processor > +Date:July 2023 > +Contact: Linux on PowerPC Developer List > +Description: admin read only > + This sysfs file exposes the system topology information by > making HCALL > + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request > value > + AFFINITY_DOMAIN_INFORMATION_BY_VIRTUAL_PROCESSOR(0xA0). > + > + * This sysfs file will be created only for power10 and above > platforms. > + > + * User needs root privileges to read data from this sysfs file. > + > + * This sysfs file will be created, only when the HCALL returns > "H_SUCESS", H_SUCCESS > + "H_AUTHORITY" and "H_PARAMETER" as the return type. s/and/or/ > + > + HCALL with return error type "H_AUTHORITY", can be resolved > during Drop the comma: ^ > + runtime by setting "Enable Performance Information > Collection" option. > + > + * The end user reading this sysfs file must decode the content > as per > + underlying platform/firmware. > + > + Possible error codes while reading this sysfs file: > + > + * "-EPERM" : Partition is not permitted to retrieve performance > information, > + required to set "Enable Performance Information > Collection" option. > + > + * "-EIO" : Can't retrieve system information because of invalid > buffer length/invalid address > +or because of some hardware error. Refer > getPerfCountInfo documentation for Refer to > +more information. > + > + * "-EFBIG" : System information exceeds PAGE_SIZE. -- ~Randy
Re: [PATCH v2 04/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document processor_config sysfs interface file
Hi-- On 7/10/23 02:27, Kajol Jain wrote: > Add details of the new hv-gpci interface file called > "processor_config" in the ABI documentation. > > Signed-off-by: Kajol Jain > --- > .../sysfs-bus-event_source-devices-hv_gpci| 32 +++ > 1 file changed, 32 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > index 2eeeab9a20fa..aff52dc3b05c 100644 > --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > @@ -112,3 +112,35 @@ Description: admin read only > more information. > > * "-EFBIG" : System information exceeds PAGE_SIZE. > + > +What:/sys/devices/hv_gpci/interface/processor_config > +Date:July 2023 > +Contact: Linux on PowerPC Developer List > +Description: admin read only > + This sysfs file exposes the system topology information by > making HCALL > + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request > value > + PROCESSOR_CONFIG(0x90). > + > + * This sysfs file will be created only for power10 and above > platforms. > + > + * User needs root privileges to read data from this sysfs file. > + > + * This sysfs file will be created, only when the HCALL returns > "H_SUCESS", H_SUCCESS > + "H_AUTHORITY" and "H_PARAMETER" as the return type. s/and/or/ > + > + HCALL with return error type "H_AUTHORITY", can be resolved > during Drop the comma: ^ > + runtime by setting "Enable Performance Information > Collection" option. > + > + * The end user reading this sysfs file must decode the content > as per > + underlying platform/firmware. > + > + Possible error codes while reading this sysfs file: > + > + * "-EPERM" : Partition is not permitted to retrieve performance > information, > + required to set "Enable Performance Information > Collection" option. > + > + * "-EIO" : Can't retrieve system information because of invalid > buffer length/invalid address > +or because of some hardware error. Refer > getPerfCountInfo documentation for Refer to > +more information. > + > + * "-EFBIG" : System information exceeds PAGE_SIZE. -- ~Randy
Re: [PATCH v2 02/10] docs: ABI: sysfs-bus-event_source-devices-hv_gpci: Document processor_bus_topology sysfs interface file
Hi-- On 7/10/23 02:27, Kajol Jain wrote: > Add details of the new hv-gpci interface file called > "processor_bus_topology" in the ABI documentation. > > Signed-off-by: Kajol Jain > --- > .../sysfs-bus-event_source-devices-hv_gpci| 32 +++ > 1 file changed, 32 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > index 12e2bf92783f..2eeeab9a20fa 100644 > --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci > @@ -80,3 +80,35 @@ Contact: Linux on PowerPC Developer List > > Description: read only > This sysfs file exposes the cpumask which is designated to make > HCALLs to retrieve hv-gpci pmu event counter data. > + > +What:/sys/devices/hv_gpci/interface/processor_bus_topology > +Date:July 2023 > +Contact: Linux on PowerPC Developer List > +Description: admin read only > + This sysfs file exposes the system topology information by > making HCALL > + H_GET_PERF_COUNTER_INFO. The HCALL is made with counter request > value > + PROCESSOR_BUS_TOPOLOGY(0xD0). > + > + * This sysfs file will be created only for power10 and above > platforms. > + > + * User needs root privileges to read data from this sysfs file. > + > + * This sysfs file will be created, only when the HCALL returns > "H_SUCESS", H_SUCCESS > + "H_AUTHORITY" and "H_PARAMETER" as the return type. s/and/or/ > + > + HCALL with return error type "H_AUTHORITY", can be resolved > during Drop the comma ^ > + runtime by setting "Enable Performance Information > Collection" option. > + > + * The end user reading this sysfs file must decode the content > as per > + underlying platform/firmware. > + > + Possible error codes while reading this sysfs file: > + > + * "-EPERM" : Partition is not permitted to retrieve performance > information, > + required to set "Enable Performance Information > Collection" option. > + > + * "-EIO" : Can't retrieve system information because of invalid > buffer length/invalid address > +or because of some hardware error. Refer > getPerfCountInfo documentation for Refer to > +more information. > + > + * "-EFBIG" : System information exceeds PAGE_SIZE. -- ~Randy
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023 at 04:01:03PM +0200, Christian Brauner wrote: > On Tue, Jul 11, 2023 at 02:51:01PM +0200, Alexey Gladkov wrote: > > On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote: > > > On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote: > > > > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > > > > > From: Palmer Dabbelt > > > > > > > > > > On the userspace side fchmodat(3) is implemented as a wrapper > > > > > function which implements the POSIX-specified interface. This > > > > > interface differs from the underlying kernel system call, which does > > > > > not > > > > > have a flags argument. Most implementations require procfs [1][2]. > > > > > > > > > > There doesn't appear to be a good userspace workaround for this issue > > > > > but the implementation in the kernel is pretty straight-forward. > > > > > > > > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW > > > > > flag, > > > > > unlike existing fchmodat. > > > > > > > > > > [1] > > > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 > > > > > [2] > > > > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 > > > > > > > > > > Signed-off-by: Palmer Dabbelt > > > > > Signed-off-by: Alexey Gladkov > > > > > > > > I don't know the history of why we ended up with the different > > > > interface, or whether this was done intentionally in the kernel > > > > or if we want this syscall. > > > > > > > > Assuming this is in fact needed, I double-checked that the > > > > implementation looks correct to me and is portable to all the > > > > architectures, without the need for a compat wrapper. > > > > > > > > Acked-by: Arnd Bergmann > > > > > > The system call itself is useful afaict. But please, > > > > > > s/fchmodat4/fchmodat2/ > > > > Sure. I will. > > Thanks. Can you also wire this up for every architecture, please? > I don't see that this has been done in this series. Sure. I have already added in all architectures as far as I can tell: $ diff -s <(find arch/ -name '*.tbl' |sort -u) <(git grep -lw fchmodat2 arch/ |sort -u) Files /dev/fd/63 and /dev/fd/62 are identical -- Rgrds, legion
Re: [PATCH v3 0/5] Add a new fchmodat4() syscall
On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote: > * Alexey Gladkov: > > > This patch set adds fchmodat4(), a new syscall. The actual > > implementation is super simple: essentially it's just the same as > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. > > I've attempted to make this match "man 2 fchmodat" as closely as > > possible, which says EINVAL is returned for invalid flags (as opposed to > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). > > I have a sketch of a glibc patch that I haven't even compiled yet, but > > seems fairly straight-forward: > > > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c > > b/sysdeps/unix/sysv/linux/fchmodat.c > > index 6d9cbc1ce9e0..b1beab76d56c 100644 > > --- a/sysdeps/unix/sysv/linux/fchmodat.c > > +++ b/sysdeps/unix/sysv/linux/fchmodat.c > > @@ -29,12 +29,36 @@ > > int > > fchmodat (int fd, const char *file, mode_t mode, int flag) > > { > > - if (flag & ~AT_SYMLINK_NOFOLLOW) > > -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); > > -#ifndef __NR_lchmod/* Linux so far has no lchmod syscall. > > */ > > + /* There are four paths through this code: > > + - The flags are zero. In this case it's fine to call fchmodat. > > + - The flags are non-zero and glibc doesn't have access to > > + __NR_fchmodat4. In this case all we can do is emulate the > > error codes > > + defined by the glibc interface from userspace. > > + - The flags are non-zero, glibc has __NR_fchmodat4, and the > > kernel has > > + fchmodat4. This is the simplest case, as the fchmodat4 syscall > > exactly > > + matches glibc's library interface so it can be called directly. > > + - The flags are non-zero, glibc has __NR_fchmodat4, but the > > kernel does > > If you define __NR_fchmodat4 on all architectures, we can use these > constants directly in glibc. We no longer depend on the UAPI > definitions of those constants, to cut down the number of code variants, > and to make glibc's system call profile independent of the kernel header > version at build time. > > Your version is based on 2.31, more recent versions have some reasonable > emulation for fchmodat based on /proc/self/fd. I even wrote a comment > describing the same buggy behavior that you witnessed: > > + /* Some Linux versions with some file systems can actually > +change symbolic link permissions via /proc, but this is not > +intentional, and it gives inconsistent results (e.g., error > +return despite mode change). The expected behavior is that > +symbolic link modes cannot be changed at all, and this check > +enforces that. */ > + if (S_ISLNK (st.st_mode)) > + { > + __close_nocancel (pathfd); > + __set_errno (EOPNOTSUPP); > + return -1; > + } > > I think there was some kernel discussion about that behavior before, but > apparently, it hasn't led to fixes. I think I've explained this somewhere else a couple of months ago but just in case you weren't on that thread or don't remember and apologies if you should already know. A lot of filesystem will happily update the mode of a symlink. The VFS doesn't do anything to prevent this from happening. This is filesystem specific. The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs. Specifically it comes from filesystems that call posix_acl_chmod(), e.g., btrfs via if (!err && attr->ia_valid & ATTR_MODE) err = posix_acl_chmod(idmap, dentry, inode->i_mode); Most filesystems don't implement i_op->set_acl() for POSIX ACLs. So posix_acl_chmod() will report EOPNOTSUPP. By the time posix_acl_chmod() is called, most filesystems will have finished updating the inode. POSIX ACLs also often aren't integrated into transactions so a rollback wouldn't even be possible on some filesystems. Any filesystem that doesn't implement POSIX ACLs at all will obviously never fail unless it blocks mode changes on symlinks. Or filesystems that do have a way to rollback failures from posix_acl_chmod(), or filesystems that do return an error on chmod() on symlinks such as 9p, ntfs, ocfs2. > > I wonder if it makes sense to add a similar error return to the system > call implementation? Hm, blocking symlink mode changes is pretty regression prone. And just blocking it through one interface seems weird and makes things even more inconsistent. So two options I see: (1) minimally invasive: Filesystems that do call posix_acl_chmod() on symlinks need to be changed to stop doing that. (2) might hit us on the head invasive: Try and block symlink mode changes in chmod_common(). Thoughts?
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023 at 02:51:01PM +0200, Alexey Gladkov wrote: > On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote: > > On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote: > > > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > > > > From: Palmer Dabbelt > > > > > > > > On the userspace side fchmodat(3) is implemented as a wrapper > > > > function which implements the POSIX-specified interface. This > > > > interface differs from the underlying kernel system call, which does not > > > > have a flags argument. Most implementations require procfs [1][2]. > > > > > > > > There doesn't appear to be a good userspace workaround for this issue > > > > but the implementation in the kernel is pretty straight-forward. > > > > > > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, > > > > unlike existing fchmodat. > > > > > > > > [1] > > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 > > > > [2] > > > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 > > > > > > > > Signed-off-by: Palmer Dabbelt > > > > Signed-off-by: Alexey Gladkov > > > > > > I don't know the history of why we ended up with the different > > > interface, or whether this was done intentionally in the kernel > > > or if we want this syscall. > > > > > > Assuming this is in fact needed, I double-checked that the > > > implementation looks correct to me and is portable to all the > > > architectures, without the need for a compat wrapper. > > > > > > Acked-by: Arnd Bergmann > > > > The system call itself is useful afaict. But please, > > > > s/fchmodat4/fchmodat2/ > > Sure. I will. Thanks. Can you also wire this up for every architecture, please? I don't see that this has been done in this series.
Re: [PATCH v3 5/5] selftests: add fchmodat4(2) selftest
On Tue, Jul 11, 2023 at 02:10:58PM +0200, Florian Weimer wrote: > * Alexey Gladkov: > > > The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag > > fails. This is because not all filesystems support changing the mode > > bits of symlinks properly. These filesystems return an error but change > > the mode bits: > > > > newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, > > AT_SYMLINK_NOFOLLOW) = 0 > > newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, > > AT_SYMLINK_NOFOLLOW) = 0 > > syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 > > EOPNOTSUPP (Operation not supported) > > newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, > > AT_SYMLINK_NOFOLLOW) = 0 > > > > This happens with btrfs and xfs: > > > > $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test > > TAP version 13 > > 1..1 > > ok 1 # SKIP fchmodat4(symlink) > > # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0 > > > > $ stat /tmp/ksft-fchmodat4.*/symlink > >File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile > >Size: 7 Blocks: 0 IO Block: 4096 symbolic link > > Device: 7,0 Inode: 133 Links: 1 > > Access: (0600/lrw---) Uid: (0/root) Gid: (0/root) > > > > Signed-off-by: Alexey Gladkov > > This looks like a bug in those file systems? To me this looks like a bug. I'm fine if the operation ends with EOPNOTSUPP, but in that case the mode bits shouldn't change. > As an extra test, “echo 3 > /proc/sys/vm/drop_caches” sometimes has > strange effects in such cases because the bits are not actually stored > on disk, only in the dentry cache. tmpfs syscall_0x1c3(0xff9c, 0x7ffd58758574, 0, 0x100, 0x7f6cf18adc70, 0x7ffd58756ad8) = 0 +++ exited with 0 +++ l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f === dropping caches === l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f ext4 syscall_0x1c3(0xff9c, 0x7ffedfdb4574, 0, 0x100, 0x7f7f40b45c70, 0x7ffedfdb3ae8) = -1 EOPNOTSUPP (Operation not supported) +++ exited with 1 +++ l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f === dropping caches === l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f xfs syscall_0x1c3(0xff9c, 0x7ffcd03ce574, 0, 0x100, 0x7ff2f2980c70, 0x7ffcd03cdd38) = -1 EOPNOTSUPP (Operation not supported) +++ exited with 1 +++ l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f === dropping caches === l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f btrfs syscall_0x1c3(0xff9c, 0x7fff13d2e574, 0, 0x100, 0x7f9b67f59c70, 0x7fff13d2ca88) = -1 EOPNOTSUPP (Operation not supported) +++ exited with 1 +++ l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f === dropping caches === l- 1 root root 1 Jul 11 16:36 /tmp/dir/link -> f reiserfs syscall_0x1c3(0xff9c, 0x7ffdf75af574, 0, 0x100, 0x7f7ad0634c70, 0x7ffdf75ae478) = 0 +++ exited with 0 +++ l- 1 root root 1 Jul 11 16:43 /tmp/dir/link -> f === dropping caches === l- 1 root root 1 Jul 11 16:43 /tmp/dir/link -> f -- Rgrds, legion
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote: > On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote: > > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > > > From: Palmer Dabbelt > > > > > > On the userspace side fchmodat(3) is implemented as a wrapper > > > function which implements the POSIX-specified interface. This > > > interface differs from the underlying kernel system call, which does not > > > have a flags argument. Most implementations require procfs [1][2]. > > > > > > There doesn't appear to be a good userspace workaround for this issue > > > but the implementation in the kernel is pretty straight-forward. > > > > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, > > > unlike existing fchmodat. > > > > > > [1] > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 > > > [2] > > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 > > > > > > Signed-off-by: Palmer Dabbelt > > > Signed-off-by: Alexey Gladkov > > > > I don't know the history of why we ended up with the different > > interface, or whether this was done intentionally in the kernel > > or if we want this syscall. > > > > Assuming this is in fact needed, I double-checked that the > > implementation looks correct to me and is portable to all the > > architectures, without the need for a compat wrapper. > > > > Acked-by: Arnd Bergmann > > The system call itself is useful afaict. But please, > > s/fchmodat4/fchmodat2/ Sure. I will. > With very few exceptions we don't version by argument number but by > revision and we should stick to one scheme: > > openat()->openat2() > eventfd()->eventfd2() > clone()/clone2()->clone3() > dup()->dup2()->dup3() // coincides with nr of arguments > pipe()->pipe2() // coincides with nr of arguments > renameat()->renameat2() > -- Rgrds, legion
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023 at 01:28:04PM +0100, Matthew Wilcox wrote: > On Tue, Jul 11, 2023 at 01:25:43PM +0200, Alexey Gladkov wrote: > > -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) > > +static int do_fchmodat4(int dfd, const char __user *filename, umode_t > > mode, int lookup_flags) > > This function can still be called do_fchmodat(); we don't need to > version internal functions. Yes. I tried not to change too much when adopting a patch. In the new version, I will return the old name. Thanks. -- Rgrds, legion
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Wed, 5 Jul 2023 18:08:17 +0900 David Stevens wrote: > On Wed, Jul 5, 2023 at 5:47___PM Zhi Wang wrote: > > > > On Tue, 4 Jul 2023 16:50:47 +0900 > > David Stevens wrote: > > > > > From: David Stevens > > > > > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. > > > __kvm_follow_pfn refactors the old API's arguments into a struct and, > > > where possible, combines the boolean arguments into a single flags > > > argument. > > > > > > Signed-off-by: David Stevens > > > --- > > > include/linux/kvm_host.h | 16 > > > virt/kvm/kvm_main.c | 171 ++- > > > virt/kvm/kvm_mm.h| 3 +- > > > virt/kvm/pfncache.c | 8 +- > > > 4 files changed, 122 insertions(+), 76 deletions(-) > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index 9d3ac7720da9..ef2763c2b12e 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -97,6 +97,7 @@ > > > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) > > > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) > > > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) > > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) > > > > > > /* > > > * error pfns indicate that the gfn is in slot but faild to > > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct > > > kvm_memory_slot *slot, gfn_t gfn, > > > void kvm_release_page_clean(struct page *page); > > > void kvm_release_page_dirty(struct page *page); > > > > > > +struct kvm_follow_pfn { > > > + const struct kvm_memory_slot *slot; > > > + gfn_t gfn; > > > + unsigned int flags; > > > + bool atomic; > > > + /* Allow a read fault to create a writeable mapping. */ > > > + bool allow_write_mapping; > > > + > > > + /* Outputs of __kvm_follow_pfn */ > > > + hva_t hva; > > > + bool writable; > > > +}; > > > + > > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > > > + > > > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); > > > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, > > > bool *writable); > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 371bd783ff2b..b13f22861d2f 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -2486,24 +2486,22 @@ static inline int > > > check_user_page_hwpoison(unsigned long addr) > > > * true indicates success, otherwise false is returned. It's also the > > > * only part that runs if we can in atomic context. > > > */ > > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, > > > - bool *writable, kvm_pfn_t *pfn) > > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > { > > > struct page *page[1]; > > > + bool write_fault = foll->flags & FOLL_WRITE; > > > > > > /* > > >* Fast pin a writable pfn only if it is a write fault request > > >* or the caller allows to map a writable pfn for a read fault > > >* request. > > >*/ > > > - if (!(write_fault || writable)) > > > + if (!(write_fault || foll->allow_write_mapping)) > > > return false; > > > > > > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { > > > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > > > *pfn = page_to_pfn(page[0]); > > > - > > > - if (writable) > > > - *writable = true; > > > + foll->writable = foll->allow_write_mapping; > > > return true; > > > } > > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, > > > bool write_fault, > > > * The slow path to get the pfn of the specified host virtual address, > > > * 1 indicates success, -errno is returned if error is detected. > > > */ > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool > > > write_fault, > > > -bool interruptible, bool *writable, kvm_pfn_t > > > *pfn) > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > { > > > - unsigned int flags = FOLL_HWPOISON; > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > > > struct page *page; > > > int npages; > > > > > > might_sleep(); > > > > > > - if (writable) > > > - *writable = write_fault; > > > - > > > - if (write_fault) > > > - flags |= FOLL_WRITE; > > > - if (async) > > > - flags |= FOLL_NOWAIT; > > > - if (interruptible) > > > - flags |= FOLL_INTERRUPTIBLE; > > > - > > > - npages = get_user_pages_unlocked(addr, 1, , flags); > > > + npages = get_user_pages_unlocked(foll->hva, 1, , flags); > > > if (npages != 1) > > > return npages; > > > > > > + foll->writable = (foll->flags & FOLL_WRITE) && > > >
Re: (subset) [PATCH v4 0/5] Add a new fchmodat2() syscall
On Tue, 11 Jul 2023 18:16:02 +0200, Alexey Gladkov wrote: > In glibc, the fchmodat(3) function has a flags argument according to the > POSIX specification [1], but kernel syscalls has no such argument. > Therefore, libc implementations do workarounds using /proc. However, > this requires procfs to be mounted and accessible. > > This patch set adds fchmodat2(), a new syscall. The syscall allows to > pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other > respects, this syscall is no different from fchmodat(). > > [...] Tools updates usually go separately. Flags argument ported to unsigned int; otherwise unchanged. --- Applied to the master branch of the vfs/vfs.git tree. Patches in the master branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: master [1/5] Non-functional cleanup of a "__user * filename" https://git.kernel.org/vfs/vfs/c/0f05a6af6b7e [2/5] fs: Add fchmodat2() https://git.kernel.org/vfs/vfs/c/8d593559ec09 [3/5] arch: Register fchmodat2, usually as syscall 452 https://git.kernel.org/vfs/vfs/c/2ee63b04f206 [5/5] selftests: Add fchmodat2 selftest https://git.kernel.org/vfs/vfs/c/f175b92081ec
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Thu, 6 Jul 2023 15:49:39 +0900 David Stevens wrote: > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang wrote: > > > > On Tue, 4 Jul 2023 16:50:48 +0900 > > David Stevens wrote: > > > > > From: David Stevens > > > > > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > > > callers to resolve a gfn when the associated pfn has a valid struct page > > > that isn't being actively refcounted (e.g. tail pages of non-compound > > > higher order pages). For a caller to safely omit FOLL_GET, all usages of > > > the returned pfn must be guarded by a mmu notifier. > > > > > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > > > is set when the returned pfn has an associated struct page with a valid > > > refcount. Callers that don't pass FOLL_GET should remember this value > > > and use it to avoid places like kvm_is_ad_tracked_page that assume a > > > non-zero refcount. > > > > > > Signed-off-by: David Stevens > > > --- > > > include/linux/kvm_host.h | 10 ++ > > > virt/kvm/kvm_main.c | 67 +--- > > > virt/kvm/pfncache.c | 2 +- > > > 3 files changed, 47 insertions(+), 32 deletions(-) > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index ef2763c2b12e..a45308c7d2d9 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > > > kvm_memory_slot *slot, gfn_t gfn, > > > void kvm_release_page_clean(struct page *page); > > > void kvm_release_page_dirty(struct page *page); > > > > > > +void kvm_set_page_accessed(struct page *page); > > > +void kvm_set_page_dirty(struct page *page); > > > + > > > struct kvm_follow_pfn { > > > const struct kvm_memory_slot *slot; > > > gfn_t gfn; > > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > > > bool atomic; > > > /* Allow a read fault to create a writeable mapping. */ > > > bool allow_write_mapping; > > > + /* > > > + * Usage of the returned pfn will be guared by a mmu notifier. Must > > > > > ^guarded > > > + * be true if FOLL_GET is not set. > > > + */ > > > + bool guarded_by_mmu_notifier; > > > > > It seems no one sets the guraded_by_mmu_notifier in this patch. Is > > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the > > caller of __kvm_follow_pfn()? > > Yes, this is the case. > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail > > page? > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a > > temporary solution. I don't think it is a good idea to play tricks with > > a temporary solution, more like we are abusing the toleration. > > I'm not sure I understand what you're getting at. This series never > calls gup without FOLL_GET. > > This series aims to provide kvm_follow_pfn as a unified API on top of > gup+follow_pte. Since one of the major clients of this API uses an mmu > notifier, it makes sense to support returning a pfn without taking a > reference. And we indeed need to do that for certain types of memory. > I am not having prob with taking a pfn without taking a ref. I am questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking a pfn without a ref is a good idea or not, while there is another flag actually showing it. I can understand that using FOLL_XXX in kvm_follow_pfn saves some translation between struct kvm_follow_pfn.{write, async, } and GUP flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the requirements of GUP in the code path that going to call GUP is reasonable. But using FOLL_XXX with purposes that are not related to GUP call really feels off. Those flags can be changed in future because of GUP requirements. Then people have to figure out what actually is happening with FOLL_GET here as it is not actually tied to GUP calls. > > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to > > indicate a tail page? > > What do you mean by to indicate a tail page? Do you mean to indicate > that the returned pfn refers to non-refcounted page? That's specified > by is_refcounted_page. > I figured out the reason why I got confused. +* Otherwise, certain IO or PFNMAP mappings can be backed with valid +* struct pages but be allocated without refcounting e.g., tail pages of +* non-compound higher order allocations. If FOLL_GET is set and we +* increment such a refcount, then when that pfn is eventually passed to +* kvm_release_pfn_clean, its refcount would hit zero and be incorrectly +* freed. Therefore don't allow those pages here when FOLL_GET is set. */ The above statements only explains the wrong behavior, but doesn't explain the expected behavior. It would be better to explain that for manipulating mmu notifier guard
Re: [PATCH v4 4/5] tools headers UAPI: Sync files changed by new fchmodat2 syscall
On Tue, Jul 11, 2023 at 10:19:35AM -0700, Namhyung Kim wrote: > Hello, > > On Tue, Jul 11, 2023 at 9:18 AM Alexey Gladkov wrote: > > > > From: Palmer Dabbelt > > > > That add support for this new syscall in tools such as 'perf trace'. > > > > Signed-off-by: Palmer Dabbelt > > Signed-off-by: Alexey Gladkov > > --- > > tools/include/uapi/asm-generic/unistd.h | 5 - > > tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++ > > tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2 ++ > > tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 ++ > > tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 2 ++ > > 5 files changed, 12 insertions(+), 1 deletion(-) > > It'd be nice if you route this patch separately through the > perf tools tree. We can add this after the kernel change > is accepted. Sure. No problem. > > > > diff --git a/tools/include/uapi/asm-generic/unistd.h > > b/tools/include/uapi/asm-generic/unistd.h > > index dd7d8e10f16d..76b5922b0d39 100644 > > --- a/tools/include/uapi/asm-generic/unistd.h > > +++ b/tools/include/uapi/asm-generic/unistd.h > > @@ -817,8 +817,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) > > #define __NR_set_mempolicy_home_node 450 > > __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) > > > > +#define __NR_fchmodat2 452 > > +__SYSCALL(__NR_fchmodat2, sys_fchmodat2) > > + > > #undef __NR_syscalls > > -#define __NR_syscalls 451 > > +#define __NR_syscalls 453 > > > > /* > > * 32 bit systems traditionally used different > > diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > > b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > > index 3f1886ad9d80..434728af4eaa 100644 > > --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > > +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > > @@ -365,3 +365,5 @@ > > 448n64 process_mreleasesys_process_mrelease > > 449n64 futex_waitv sys_futex_waitv > > 450common set_mempolicy_home_node sys_set_mempolicy_home_node > > +# 451 reserved for cachestat > > +452n64 fchmodat2 sys_fchmodat2 > > diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > > b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > > index a0be127475b1..6b70b6705bd7 100644 > > --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > > +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > > @@ -537,3 +537,5 @@ > > 448common process_mreleasesys_process_mrelease > > 449common futex_waitv sys_futex_waitv > > 450nospu set_mempolicy_home_node sys_set_mempolicy_home_node > > +# 451 reserved for cachestat > > +452common fchmodat2 sys_fchmodat2 > > diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl > > b/tools/perf/arch/s390/entry/syscalls/syscall.tbl > > index b68f47541169..0ed90c9535b0 100644 > > --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl > > +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl > > @@ -453,3 +453,5 @@ > > 448 commonprocess_mreleasesys_process_mrelease > > sys_process_mrelease > > 449 commonfutex_waitv sys_futex_waitv > > sys_futex_waitv > > 450 commonset_mempolicy_home_node sys_set_mempolicy_home_node > > sys_set_mempolicy_home_node > > +# 451 reserved for cachestat > > +452 commonfchmodat2 sys_fchmodat2 > > sys_fchmodat2 > > diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > > b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > > index c84d12608cd2..a008724a1f48 100644 > > --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -372,6 +372,8 @@ > > 448common process_mreleasesys_process_mrelease > > 449common futex_waitv sys_futex_waitv > > 450common set_mempolicy_home_node sys_set_mempolicy_home_node > > +# 451 reserved for cachestat > > +452common fchmodat2 sys_fchmodat2 > > > > # > > # Due to a historical design error, certain syscalls are numbered > > differently > > -- > > 2.33.8 > > > -- Rgrds, legion
Re: [PATCH v4 4/5] tools headers UAPI: Sync files changed by new fchmodat2 syscall
Hello, On Tue, Jul 11, 2023 at 9:18 AM Alexey Gladkov wrote: > > From: Palmer Dabbelt > > That add support for this new syscall in tools such as 'perf trace'. > > Signed-off-by: Palmer Dabbelt > Signed-off-by: Alexey Gladkov > --- > tools/include/uapi/asm-generic/unistd.h | 5 - > tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++ > tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2 ++ > tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 ++ > tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 2 ++ > 5 files changed, 12 insertions(+), 1 deletion(-) It'd be nice if you route this patch separately through the perf tools tree. We can add this after the kernel change is accepted. Thanks, Namhyung > > diff --git a/tools/include/uapi/asm-generic/unistd.h > b/tools/include/uapi/asm-generic/unistd.h > index dd7d8e10f16d..76b5922b0d39 100644 > --- a/tools/include/uapi/asm-generic/unistd.h > +++ b/tools/include/uapi/asm-generic/unistd.h > @@ -817,8 +817,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) > #define __NR_set_mempolicy_home_node 450 > __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) > > +#define __NR_fchmodat2 452 > +__SYSCALL(__NR_fchmodat2, sys_fchmodat2) > + > #undef __NR_syscalls > -#define __NR_syscalls 451 > +#define __NR_syscalls 453 > > /* > * 32 bit systems traditionally used different > diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > index 3f1886ad9d80..434728af4eaa 100644 > --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl > @@ -365,3 +365,5 @@ > 448n64 process_mreleasesys_process_mrelease > 449n64 futex_waitv sys_futex_waitv > 450common set_mempolicy_home_node sys_set_mempolicy_home_node > +# 451 reserved for cachestat > +452n64 fchmodat2 sys_fchmodat2 > diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > index a0be127475b1..6b70b6705bd7 100644 > --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl > @@ -537,3 +537,5 @@ > 448common process_mreleasesys_process_mrelease > 449common futex_waitv sys_futex_waitv > 450nospu set_mempolicy_home_node sys_set_mempolicy_home_node > +# 451 reserved for cachestat > +452common fchmodat2 sys_fchmodat2 > diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl > b/tools/perf/arch/s390/entry/syscalls/syscall.tbl > index b68f47541169..0ed90c9535b0 100644 > --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl > +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl > @@ -453,3 +453,5 @@ > 448 commonprocess_mreleasesys_process_mrelease > sys_process_mrelease > 449 commonfutex_waitv sys_futex_waitv > sys_futex_waitv > 450 commonset_mempolicy_home_node sys_set_mempolicy_home_node > sys_set_mempolicy_home_node > +# 451 reserved for cachestat > +452 commonfchmodat2 sys_fchmodat2 > sys_fchmodat2 > diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > index c84d12608cd2..a008724a1f48 100644 > --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl > @@ -372,6 +372,8 @@ > 448common process_mreleasesys_process_mrelease > 449common futex_waitv sys_futex_waitv > 450common set_mempolicy_home_node sys_set_mempolicy_home_node > +# 451 reserved for cachestat > +452common fchmodat2 sys_fchmodat2 > > # > # Due to a historical design error, certain syscalls are numbered differently > -- > 2.33.8 >
Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation
On 11.07.23 06:48, Aneesh Kumar K.V wrote: Add a new kconfig option that can be selected if we want to allow pageblock alignment by reserving pages in the vmemmap altmap area. This implies we will be reserving some pages for every memoryblock This also allows the memmap on memory feature to be widely useful with different memory block size values. "reserving pages" is a nice way of saying "wasting memory". :) Let's spell that out. I think we have to find a better name for this, and I think we should have a toggle similar to memory_hotplug.memmap_on_memory. This should be an admin decision, not some kernel config option. memory_hotplug.force_memmap_on_memory "Enable the memmap on memory feature even if it could result in memory waste due to memmap size limitations. For example, if the memmap for a memory block requires 1 MiB, but the pageblock size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are still cases where the feature cannot be enforced: for example, if the memmap is smaller than a single page, or if the architecture does not support the forced mode in all configurations." Thoughts? Signed-off-by: Aneesh Kumar K.V --- mm/Kconfig | 9 +++ mm/memory_hotplug.c | 59 + 2 files changed, 58 insertions(+), 10 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 932349271e28..88a1472b2086 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -570,6 +570,15 @@ config MHP_MEMMAP_ON_MEMORY depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE +config MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY + bool "Allow Reserving pages for page block aligment" + depends on MHP_MEMMAP_ON_MEMORY + help + This option allows memmap on memory feature to be more useful + with different memory block sizes. This is achieved by marking some pages + in each memory block as reserved so that we can get page-block alignment + for the remaining pages. + endif # MEMORY_HOTPLUG config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 07c99b0cc371..f36aec1f7626 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1252,15 +1252,17 @@ static inline bool arch_supports_memmap_on_memory(unsigned long size) { unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); - unsigned long remaining_size = size - vmemmap_size; - return IS_ALIGNED(vmemmap_size, PMD_SIZE) && - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); + return IS_ALIGNED(vmemmap_size, PMD_SIZE); } #endif static bool mhp_supports_memmap_on_memory(unsigned long size) { + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; + unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); + unsigned long remaining_size = size - vmemmap_size; + /* * Besides having arch support and the feature enabled at runtime, we * need a few more assumptions to hold true: @@ -1287,9 +1289,30 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) * altmap as an alternative source of memory, and we do not exactly * populate a single PMD. */ - return mhp_memmap_on_memory() && - size == memory_block_size_bytes() && - arch_supports_memmap_on_memory(size); + if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) + return false; +/* + * Without page reservation remaining pages should be pageblock aligned. + */ + if (!IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY) && + !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT))) + return false; + + return arch_supports_memmap_on_memory(size); +} + +static inline unsigned long memory_block_align_base(unsigned long size) +{ + if (IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY)) { + unsigned long align; + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; + unsigned long vmemmap_size; + + vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) >> PAGE_SHIFT; + align = pageblock_align(vmemmap_size) - vmemmap_size; We should probably have a helper to calculate a) the unaligned vmemmap size, for example used in arch_supports_memmap_on_memory() b) the pageblock-aligned vmemmap size. -- Cheers, David / dhildenb
Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 09:35:13AM -0700, Suren Baghdasaryan wrote: > On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky wrote: > > > > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote: > > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > > > > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > > > > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > > > > > > > <...> > > > > > > > > > >> Laurent Dufour (1): > > > > >> powerc/mm: try VMA lock-based page fault handling first > > > > > > > > > > Hi, > > > > > > > > > > This series and specifically the commit above broke docker over PPC. > > > > > It causes to docker service stuck while trying to activate. Revert of > > > > > this commit allows us to use docker again. > > > > > > > > Hi, > > > > > > > > there have been follow-up fixes, that are part of 6.4.3 stable (also > > > > 6.5-rc1) Does that version work for you? > > > > > > I'll recheck it again on clean system, but for the record: > > > 1. We are running 6.5-rc1 kernels. > > > 2. PPC doesn't compile for us on -rc1 without this fix. > > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ > > > > Ohh, I see it in -rc1, let's recheck. > > Hi Leon, > Please let us know how it goes. Once, we rebuilt clean -rc1, docker worked for us. Sorry for the noise. > > > > > > 3. I didn't see anything relevant -rc1 with "git log > > > arch/powerpc/mm/fault.c". > > The fixes Vlastimil was referring to are not in the fault.c, they are > in the main mm and fork code. More specifically, check for these > patches to exist in the branch you are testing: > > mm: lock newly mapped VMA with corrected ordering > fork: lock VMAs of the parent process when forking > mm: lock newly mapped VMA which can be modified after it becomes visible > mm: lock a vma before stack expansion Thanks > > Thanks, > Suren. > > > > > > > Do you have in mind anything specific to check? > > > > > > Thanks > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
Re: [PATCH v4 2/5] fs: Add fchmodat2()
On Tue, Jul 11, 2023 at 06:16:04PM +0200, Alexey Gladkov wrote: > On the userspace side fchmodat(3) is implemented as a wrapper > function which implements the POSIX-specified interface. This > interface differs from the underlying kernel system call, which does not > have a flags argument. Most implementations require procfs [1][2]. > > There doesn't appear to be a good userspace workaround for this issue > but the implementation in the kernel is pretty straight-forward. > > The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, > unlike existing fchmodat. > > [1] > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 > [2] > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 > > Co-developed-by: Palmer Dabbelt > Signed-off-by: Palmer Dabbelt > Signed-off-by: Alexey Gladkov > Acked-by: Arnd Bergmann > --- > fs/open.c| 18 ++ > include/linux/syscalls.h | 2 ++ > 2 files changed, 16 insertions(+), 4 deletions(-) > > diff --git a/fs/open.c b/fs/open.c > index 0c55c8e7f837..39a7939f0d00 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode) > return err; > } > > -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) > +static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, > int lookup_flags) Should all be unsigned instead of int here for flags. We also had a documentation update to that effect but smh never sent it. user_path_at() itself takes an unsigned as well. I'll fix that up though.
Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky wrote: > > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote: > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > > > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > > > > > <...> > > > > > > > >> Laurent Dufour (1): > > > >> powerc/mm: try VMA lock-based page fault handling first > > > > > > > > Hi, > > > > > > > > This series and specifically the commit above broke docker over PPC. > > > > It causes to docker service stuck while trying to activate. Revert of > > > > this commit allows us to use docker again. > > > > > > Hi, > > > > > > there have been follow-up fixes, that are part of 6.4.3 stable (also > > > 6.5-rc1) Does that version work for you? > > > > I'll recheck it again on clean system, but for the record: > > 1. We are running 6.5-rc1 kernels. > > 2. PPC doesn't compile for us on -rc1 without this fix. > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ > > Ohh, I see it in -rc1, let's recheck. Hi Leon, Please let us know how it goes. > > > 3. I didn't see anything relevant -rc1 with "git log > > arch/powerpc/mm/fault.c". The fixes Vlastimil was referring to are not in the fault.c, they are in the main mm and fork code. More specifically, check for these patches to exist in the branch you are testing: mm: lock newly mapped VMA with corrected ordering fork: lock VMAs of the parent process when forking mm: lock newly mapped VMA which can be modified after it becomes visible mm: lock a vma before stack expansion Thanks, Suren. > > > > Do you have in mind anything specific to check? > > > > Thanks > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
[PATCH v4 4/5] tools headers UAPI: Sync files changed by new fchmodat2 syscall
From: Palmer Dabbelt That add support for this new syscall in tools such as 'perf trace'. Signed-off-by: Palmer Dabbelt Signed-off-by: Alexey Gladkov --- tools/include/uapi/asm-generic/unistd.h | 5 - tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 2 ++ tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2 ++ tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 ++ tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 2 ++ 5 files changed, 12 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index dd7d8e10f16d..76b5922b0d39 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -817,8 +817,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_fchmodat2 452 +__SYSCALL(__NR_fchmodat2, sys_fchmodat2) + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 453 /* * 32 bit systems traditionally used different diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl index 3f1886ad9d80..434728af4eaa 100644 --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl @@ -365,3 +365,5 @@ 448n64 process_mreleasesys_process_mrelease 449n64 futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +# 451 reserved for cachestat +452n64 fchmodat2 sys_fchmodat2 diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl index a0be127475b1..6b70b6705bd7 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -537,3 +537,5 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450nospu set_mempolicy_home_node sys_set_mempolicy_home_node +# 451 reserved for cachestat +452common fchmodat2 sys_fchmodat2 diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl index b68f47541169..0ed90c9535b0 100644 --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl @@ -453,3 +453,5 @@ 448 commonprocess_mreleasesys_process_mrelease sys_process_mrelease 449 commonfutex_waitv sys_futex_waitv sys_futex_waitv 450 commonset_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node +# 451 reserved for cachestat +452 commonfchmodat2 sys_fchmodat2 sys_fchmodat2 diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..a008724a1f48 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,8 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +# 451 reserved for cachestat +452common fchmodat2 sys_fchmodat2 # # Due to a historical design error, certain syscalls are numbered differently -- 2.33.8
Re: [PATCH v4 3/5] arch: Register fchmodat2, usually as syscall 452
On Tue, Jul 11, 2023, at 18:16, Alexey Gladkov wrote: > From: Palmer Dabbelt > > This registers the new fchmodat2 syscall in most places as nuber 452, > with alpha being the exception where it's 562. I found all these sites > by grepping for fspick, which I assume has found me everything. > > Signed-off-by: Palmer Dabbelt > Signed-off-by: Alexey Gladkov Acked-by: Arnd Bergmann
[PATCH v4 5/5] selftests: Add fchmodat2 selftest
The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag fails. This is because not all filesystems support changing the mode bits of symlinks properly. These filesystems return an error but change the mode bits: newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, AT_SYMLINK_NOFOLLOW) = 0 syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 EOPNOTSUPP (Operation not supported) newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 This happens with btrfs and xfs: $ tools/testing/selftests/fchmodat2/fchmodat2_test TAP version 13 1..1 ok 1 # SKIP fchmodat2(symlink) # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0 $ stat /tmp/ksft-fchmodat2.*/symlink File: /tmp/ksft-fchmodat2.3NCqlE/symlink -> regfile Size: 7 Blocks: 0 IO Block: 4096 symbolic link Device: 7,0 Inode: 133 Links: 1 Access: (0600/lrw---) Uid: (0/root) Gid: (0/root) Signed-off-by: Alexey Gladkov --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/fchmodat2/.gitignore | 2 + tools/testing/selftests/fchmodat2/Makefile| 6 + .../selftests/fchmodat2/fchmodat2_test.c | 162 ++ 4 files changed, 171 insertions(+) create mode 100644 tools/testing/selftests/fchmodat2/.gitignore create mode 100644 tools/testing/selftests/fchmodat2/Makefile create mode 100644 tools/testing/selftests/fchmodat2/fchmodat2_test.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 666b56f22a41..8dca8acdb671 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -18,6 +18,7 @@ TARGETS += drivers/net/bonding TARGETS += drivers/net/team TARGETS += efivarfs TARGETS += exec +TARGETS += fchmodat2 TARGETS += filesystems TARGETS += filesystems/binderfs TARGETS += filesystems/epoll diff --git a/tools/testing/selftests/fchmodat2/.gitignore b/tools/testing/selftests/fchmodat2/.gitignore new file mode 100644 index ..82a4846cbc4b --- /dev/null +++ b/tools/testing/selftests/fchmodat2/.gitignore @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +/*_test diff --git a/tools/testing/selftests/fchmodat2/Makefile b/tools/testing/selftests/fchmodat2/Makefile new file mode 100644 index ..45b519eab851 --- /dev/null +++ b/tools/testing/selftests/fchmodat2/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0-or-later + +CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined +TEST_GEN_PROGS := fchmodat2_test + +include ../lib.mk diff --git a/tools/testing/selftests/fchmodat2/fchmodat2_test.c b/tools/testing/selftests/fchmodat2/fchmodat2_test.c new file mode 100644 index ..2d98eb215bc6 --- /dev/null +++ b/tools/testing/selftests/fchmodat2/fchmodat2_test.c @@ -0,0 +1,162 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#ifndef __NR_fchmodat2 + #if defined __alpha__ + #define __NR_fchmodat2 562 + #elif defined _MIPS_SIM + #if _MIPS_SIM == _MIPS_SIM_ABI32/* o32 */ + #define __NR_fchmodat2 (452 + 4000) + #endif + #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ + #define __NR_fchmodat2 (452 + 6000) + #endif + #if _MIPS_SIM == _MIPS_SIM_ABI64/* n64 */ + #define __NR_fchmodat2 (452 + 5000) + #endif + #elif defined __ia64__ + #define __NR_fchmodat2 (452 + 1024) + #else + #define __NR_fchmodat2 452 + #endif +#endif + +int sys_fchmodat2(int dfd, const char *filename, mode_t mode, int flags) +{ + int ret = syscall(__NR_fchmodat2, dfd, filename, mode, flags); + + return ret >= 0 ? ret : -errno; +} + +int setup_testdir(void) +{ + int dfd, ret; + char dirname[] = "/tmp/ksft-fchmodat2.XX"; + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("%s: failed to create tmpdir\n", __func__); + + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("%s: failed to open tmpdir\n", __func__); + + ret = openat(dfd, "regfile", O_CREAT | O_WRONLY | O_TRUNC, 0644); + if (ret < 0) + ksft_exit_fail_msg("%s: failed to create file in tmpdir\n", + __func__); + close(ret); + + ret = symlinkat("regfile", dfd, "symlink"); + if (ret < 0) + ksft_exit_fail_msg("%s: failed to create symlink in tmpdir\n", + __func__); + + return dfd; +} + +int expect_mode(int dfd, const char *filename,
[PATCH v4 3/5] arch: Register fchmodat2, usually as syscall 452
From: Palmer Dabbelt This registers the new fchmodat2 syscall in most places as nuber 452, with alpha being the exception where it's 562. I found all these sites by grepping for fspick, which I assume has found me everything. Signed-off-by: Palmer Dabbelt Signed-off-by: Alexey Gladkov --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 ++ arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl| 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/uapi/asm-generic/unistd.h | 5 - 19 files changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 1f13995d00d7..ad37569d0507 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -491,3 +491,4 @@ 559common futex_waitv sys_futex_waitv 560common set_mempolicy_home_node sys_ni_syscall 561common cachestat sys_cachestat +562common fchmodat2 sys_fchmodat2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 8ebed8a13874..c572d6c3dee0 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -465,3 +465,4 @@ 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node 451common cachestat sys_cachestat +452common fchmodat2 sys_fchmodat2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 64a514f90131..bd77253b62e0 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 452 +#define __NR_compat_syscalls 453 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index d952a28463e0..78b68311ec81 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -909,6 +909,8 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) #define __NR_cachestat 451 __SYSCALL(__NR_cachestat, sys_cachestat) +#define __NR_fchmodat2 452 +__SYSCALL(__NR_fchmodat2, sys_fchmodat2) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index f8c74ffeeefb..83d8609aec03 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -372,3 +372,4 @@ 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node 451common cachestat sys_cachestat +452common fchmodat2 sys_fchmodat2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 4f504783371f..259ceb125367 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -451,3 +451,4 @@ 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node 451common cachestat sys_cachestat +452common fchmodat2 sys_fchmodat2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 858d22bf275c..a3798c2637fd 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node 451common cachestat sys_cachestat +452common fchmodat2 sys_fchmodat2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 1976317d4e8b..152034b8e0a0 100644 ---
[PATCH v4 2/5] fs: Add fchmodat2()
On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system call, which does not have a flags argument. Most implementations require procfs [1][2]. There doesn't appear to be a good userspace workaround for this issue but the implementation in the kernel is pretty straight-forward. The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, unlike existing fchmodat. [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 Co-developed-by: Palmer Dabbelt Signed-off-by: Palmer Dabbelt Signed-off-by: Alexey Gladkov Acked-by: Arnd Bergmann --- fs/open.c| 18 ++ include/linux/syscalls.h | 2 ++ 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/open.c b/fs/open.c index 0c55c8e7f837..39a7939f0d00 100644 --- a/fs/open.c +++ b/fs/open.c @@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode) return err; } -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) +static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int lookup_flags) { struct path path; int error; - unsigned int lookup_flags = LOOKUP_FOLLOW; + retry: error = user_path_at(dfd, filename, lookup_flags, ); if (!error) { @@ -689,15 +689,25 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) return error; } +SYSCALL_DEFINE4(fchmodat2, int, dfd, const char __user *, filename, + umode_t, mode, int, flags) +{ + if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW)) + return -EINVAL; + + return do_fchmodat(dfd, filename, mode, + flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW); +} + SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, umode_t, mode) { - return do_fchmodat(dfd, filename, mode); + return do_fchmodat(dfd, filename, mode, LOOKUP_FOLLOW); } SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode) { - return do_fchmodat(AT_FDCWD, filename, mode); + return do_fchmodat(AT_FDCWD, filename, mode, LOOKUP_FOLLOW); } /* diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 584f404bf868..6e852279fbc3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -440,6 +440,8 @@ asmlinkage long sys_chroot(const char __user *filename); asmlinkage long sys_fchmod(unsigned int fd, umode_t mode); asmlinkage long sys_fchmodat(int dfd, const char __user *filename, umode_t mode); +asmlinkage long sys_fchmodat2(int dfd, const char __user *filename, +umode_t mode, int flags); asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, int flag); asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group); -- 2.33.8
[PATCH v4 1/5] Non-functional cleanup of a "__user * filename"
From: Palmer Dabbelt The next patch defines a very similar interface, which I copied from this definition. Since I'm touching it anyway I don't see any reason not to just go fix this one up. Signed-off-by: Palmer Dabbelt Acked-by: Arnd Bergmann --- include/linux/syscalls.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 03e3d0121d5e..584f404bf868 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -438,7 +438,7 @@ asmlinkage long sys_chdir(const char __user *filename); asmlinkage long sys_fchdir(unsigned int fd); asmlinkage long sys_chroot(const char __user *filename); asmlinkage long sys_fchmod(unsigned int fd, umode_t mode); -asmlinkage long sys_fchmodat(int dfd, const char __user * filename, +asmlinkage long sys_fchmodat(int dfd, const char __user *filename, umode_t mode); asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, int flag); -- 2.33.8
[PATCH v4 0/5] Add a new fchmodat2() syscall
In glibc, the fchmodat(3) function has a flags argument according to the POSIX specification [1], but kernel syscalls has no such argument. Therefore, libc implementations do workarounds using /proc. However, this requires procfs to be mounted and accessible. This patch set adds fchmodat2(), a new syscall. The syscall allows to pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other respects, this syscall is no different from fchmodat(). [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/chmod.html Changes since v3 [cover.1689074739.git.leg...@kernel.org]: * Rebased to master because a new syscall has appeared in master. * Increased __NR_compat_syscalls as pointed out by Arnd Bergmann. * Syscall renamed fchmodat4 -> fchmodat2 as suggested by Christian Brauner. * Returned do_fchmodat4() the original name. We don't need to version internal functions. * Fixed warnings found by checkpatch.pl. Changes since v2 [20190717012719.5524-1-pal...@sifive.com]: * Rebased to master. * The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro. * Selftest added. Changes since v1 [20190531191204.4044-1-pal...@sifive.com]: * All architectures are now supported, which support squashed into a single patch. * The do_fchmodat() helper function has been removed, in favor of directly calling do_fchmodat4(). * The patches are based on 5.2 instead of 5.1. --- Alexey Gladkov (2): fs: Add fchmodat2() selftests: Add fchmodat2 selftest Palmer Dabbelt (3): Non-functional cleanup of a "__user * filename" arch: Register fchmodat2, usually as syscall 452 tools headers UAPI: Sync files changed by new fchmodat2 syscall arch/alpha/kernel/syscalls/syscall.tbl| 1 + arch/arm/tools/syscall.tbl| 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl| 1 + arch/x86/entry/syscalls/syscall_32.tbl| 1 + arch/x86/entry/syscalls/syscall_64.tbl| 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 18 +- include/linux/syscalls.h | 4 +- include/uapi/asm-generic/unistd.h | 5 +- tools/include/uapi/asm-generic/unistd.h | 5 +- .../arch/mips/entry/syscalls/syscall_n64.tbl | 2 + .../arch/powerpc/entry/syscalls/syscall.tbl | 2 + .../perf/arch/s390/entry/syscalls/syscall.tbl | 2 + .../arch/x86/entry/syscalls/syscall_64.tbl| 2 + tools/testing/selftests/Makefile | 1 + tools/testing/selftests/fchmodat2/.gitignore | 2 + tools/testing/selftests/fchmodat2/Makefile| 6 + .../selftests/fchmodat2/fchmodat2_test.c | 162 ++ 30 files changed, 223 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/fchmodat2/.gitignore create mode 100644 tools/testing/selftests/fchmodat2/Makefile create mode 100644 tools/testing/selftests/fchmodat2/fchmodat2_test.c -- 2.33.8
[PATCH v4 15/15] powerpc: Implement UACCESS validation on PPC32
In order to implement UACCESS validation, objtool support for powerpc needs to be enhanced to decode more instructions. It also requires implementation of switch tables finding. On PPC32 it is similar to x86, switch tables are anonymous in .rodata, the difference is that the value is relative to its index in the table. But several switch tables can be nested so the register containing the table base address also needs to be tracked and taken into account. Don't activate if for Clang for now because its switch tables are different from GCC switch tables. Then comes the UACCESS enabling/disabling instructions. On booke and 8xx it is done with a mtspr instruction. For 8xx that's in SPRN_MD_AP, for booke that's in SPRN_PID. Annotate those instructions. No work has been done for ASM files, they are not used for UACCESS so for the moment just tell objtool to ignore ASM files. For relocable code, the .got2 relocation preceding each global function needs to be marked as ignored because some versions of GCC do this: 120: 00 00 00 00 .long 0x0 120: R_PPC_REL32.got2+0x7ff0 0124 : 124: 94 21 ff f0 stwur1,-16(r1) 128: 7c 08 02 a6 mflrr0 12c: 42 9f 00 05 bcl 20,4*cr7+so,130 130: 39 00 00 00 li r8,0 134: 39 20 00 08 li r9,8 138: 93 c1 00 08 stw r30,8(r1) 13c: 7f c8 02 a6 mflrr30 140: 90 01 00 14 stw r0,20(r1) 144: 80 1e ff f0 lwz r0,-16(r30) 148: 7f c0 f2 14 add r30,r0,r30 14c: 81 5e 80 00 lwz r10,-32768(r30) 150: 80 fe 80 04 lwz r7,-32764(r30) Also declare longjmp() and start_secondary_resume() as global noreturn functions, and declare __copy_tofrom_user() and __arch_clear_user() as UACCESS safe. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/book3s/32/kup.h | 2 + arch/powerpc/include/asm/nohash/32/kup-8xx.h | 4 +- arch/powerpc/include/asm/nohash/kup-booke.h | 4 +- arch/powerpc/kexec/core_32.c | 4 +- arch/powerpc/mm/nohash/kup.c | 2 + tools/objtool/arch/powerpc/decode.c | 155 +- .../arch/powerpc/include/arch/noreturns.h | 11 ++ tools/objtool/arch/powerpc/special.c | 36 +++- tools/objtool/check.c | 6 +- 10 files changed, 211 insertions(+), 15 deletions(-) create mode 100644 tools/objtool/arch/powerpc/include/arch/noreturns.h diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..cdaca38868e1 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -159,6 +159,7 @@ config PPC select ARCH_KEEP_MEMBLOCK select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO + select ARCH_OBJTOOL_SKIP_ASM select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT select ARCH_SPLIT_ARG64 if PPC32 @@ -257,6 +258,7 @@ config PPC select HAVE_OPTPROBES select HAVE_OBJTOOL if PPC32 || MPROFILE_KERNEL select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL + select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL && PPC_KUAP && PPC32 && CC_IS_GCC select HAVE_PERF_EVENTS select HAVE_PERF_EVENTS_NMI if PPC64 select HAVE_PERF_REGS diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 4e14a5427a63..842d9a6f4b7a 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -34,6 +34,7 @@ static __always_inline void uaccess_begin_32s(unsigned long addr) asm volatile(ASM_MMU_FTR_IFSET( "mfsrin %0, %1;" "rlwinm %0, %0, 0, %2;" + ASM_UACCESS_BEGIN "mtsrin %0, %1;" "isync", "", %3) : "="(tmp) @@ -48,6 +49,7 @@ static __always_inline void uaccess_end_32s(unsigned long addr) asm volatile(ASM_MMU_FTR_IFSET( "mfsrin %0, %1;" "oris %0, %0, %2;" + ASM_UACCESS_END "mtsrin %0, %1;" "isync", "", %3) : "="(tmp) diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h b/arch/powerpc/include/asm/nohash/32/kup-8xx.h index 46bc5925e5fd..38c7ed766445 100644 --- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h @@ -39,13 +39,13 @@ static __always_inline unsigned long __kuap_get_and_assert_locked(void) static __always_inline void uaccess_begin_8xx(unsigned long val) { - asm(ASM_MMU_FTR_IFSET("mtspr %0, %1", "", %2) : : + asm(ASM_UACCESS_BEGIN ASM_MMU_FTR_IFSET("mtspr %0, %1", "", %2) : :
[PATCH v4 08/15] objtool: Track general purpose register used for switch table base
A function can contain nested switch tables using different registers as base address. In order to avoid failure in tracking those switch tables, the register containing the base address needs to be taken into account. To do so, add a 5 bits field in struct instruction that will hold the ID of the register containing the base address of the switch table and take that register into account during the backward search in order to not stop the walk when encountering a jump related to another switch table. On architectures not handling it, the ID stays nul and has no impact on the search. To enable that, also provide to arch_find_switch_table() the dynamic instruction related to a table search. Also allow prev_insn_same_sec() to be used outside check.c so that architectures can backward walk through instruction to find out which register is used as base address for a switch table. Signed-off-by: Christophe Leroy --- tools/objtool/arch/powerpc/special.c| 3 ++- tools/objtool/arch/x86/special.c| 3 ++- tools/objtool/check.c | 9 + tools/objtool/include/objtool/check.h | 6 -- tools/objtool/include/objtool/special.h | 3 ++- 5 files changed, 15 insertions(+), 9 deletions(-) diff --git a/tools/objtool/arch/powerpc/special.c b/tools/objtool/arch/powerpc/special.c index d33868147196..a7dd2559b536 100644 --- a/tools/objtool/arch/powerpc/special.c +++ b/tools/objtool/arch/powerpc/special.c @@ -13,7 +13,8 @@ bool arch_support_alt_relocation(struct special_alt *special_alt, } struct reloc *arch_find_switch_table(struct objtool_file *file, - struct instruction *insn) +struct instruction *insn, +struct instruction *orig_insn) { exit(-1); } diff --git a/tools/objtool/arch/x86/special.c b/tools/objtool/arch/x86/special.c index 8e8302fe909f..8cf17d94c69b 100644 --- a/tools/objtool/arch/x86/special.c +++ b/tools/objtool/arch/x86/special.c @@ -86,7 +86,8 @@ bool arch_support_alt_relocation(struct special_alt *special_alt, *NOTE: RETPOLINE made it harder still to decode dynamic jumps. */ struct reloc *arch_find_switch_table(struct objtool_file *file, - struct instruction *insn) +struct instruction *insn, +struct instruction *orig_insn) { struct reloc *text_reloc, *rodata_reloc; struct section *table_sec; diff --git a/tools/objtool/check.c b/tools/objtool/check.c index d51f47c4a3bd..be413c578588 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -80,8 +80,8 @@ static struct instruction *next_insn_same_func(struct objtool_file *file, return find_insn(file, func->cfunc->sec, func->cfunc->offset); } -static struct instruction *prev_insn_same_sec(struct objtool_file *file, - struct instruction *insn) +struct instruction *prev_insn_same_sec(struct objtool_file *file, + struct instruction *insn) { if (insn->idx == 0) { if (insn->prev_len) @@ -2064,7 +2064,8 @@ static struct reloc *find_jump_table(struct objtool_file *file, insn && insn_func(insn) && insn_func(insn)->pfunc == func; insn = insn->first_jump_src ?: prev_insn_same_sym(file, insn)) { - if (insn != orig_insn && insn->type == INSN_JUMP_DYNAMIC) + if (insn != orig_insn && insn->type == INSN_JUMP_DYNAMIC && + insn->gpr == orig_insn->gpr) break; /* allow small jumps within the range */ @@ -2074,7 +2075,7 @@ static struct reloc *find_jump_table(struct objtool_file *file, insn->jump_dest->offset > orig_insn->offset)) break; - table_reloc = arch_find_switch_table(file, insn); + table_reloc = arch_find_switch_table(file, insn, orig_insn); if (!table_reloc) continue; diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h index daa46f1f0965..660ea9d0393e 100644 --- a/tools/objtool/include/objtool/check.h +++ b/tools/objtool/include/objtool/check.h @@ -63,8 +63,9 @@ struct instruction { noendbr : 1, unret : 1, visited : 4, - no_reloc: 1; - /* 10 bit hole */ + no_reloc: 1, + gpr : 5; + /* 5 bit hole */ struct alt_group *alt_group; struct instruction *jump_dest; @@ -115,6 +116,7 @@ struct instruction *find_insn(struct objtool_file *file, struct section *sec, unsigned long offset); struct instruction *next_insn_same_sec(struct objtool_file *file, struct instruction *insn);
[PATCH v4 14/15] powerpc/bug: Annotate reachable after warning trap
This commit is copied from commit bfb1a7c91fb7 ("x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm") 'twi 31,0,0' is a BUG instruction, which is by default a dead end. But the same instruction is used for WARNINGs and the execution resumes with the following instruction. Mark it reachable so that objtool knows that it is not a dead end in that case. Also change the unreachable() annotation by __builtin_unreachable() since objtool already knows that a BUG instruction is a dead end. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/bug.h | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h index abb608dff15a..1c204ee4cc03 100644 --- a/arch/powerpc/include/asm/bug.h +++ b/arch/powerpc/include/asm/bug.h @@ -4,6 +4,7 @@ #ifdef __KERNEL__ #include +#include #ifdef CONFIG_BUG @@ -51,10 +52,11 @@ ".previous\n" #endif -#define BUG_ENTRY(insn, flags, ...)\ +#define BUG_ENTRY(insn, flags, extra, ...) \ __asm__ __volatile__( \ "1: " insn "\n" \ _EMIT_BUG_ENTRY \ + extra \ : : "i" (__FILE__), "i" (__LINE__), \ "i" (flags), \ "i" (sizeof(struct bug_entry)), \ @@ -67,12 +69,12 @@ */ #define BUG() do { \ - BUG_ENTRY("twi 31, 0, 0", 0); \ - unreachable(); \ + BUG_ENTRY("twi 31, 0, 0", 0, ""); \ + __builtin_unreachable();\ } while (0) #define HAVE_ARCH_BUG -#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags)) +#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags), ASM_REACHABLE) #ifdef CONFIG_PPC64 #define BUG_ON(x) do { \ @@ -80,7 +82,7 @@ if (x) \ BUG(); \ } else {\ - BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "r" ((__force long)(x))); \ + BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "", "r" ((__force long)(x))); \ } \ } while (0) @@ -92,7 +94,7 @@ } else {\ BUG_ENTRY(PPC_TLNEI " %4, 0", \ BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \ - "r" (__ret_warn_on)); \ + "", "r" (__ret_warn_on)); \ } \ unlikely(__ret_warn_on);\ }) -- 2.41.0
[PATCH v4 07/15] objtool: Merge mark_func_jump_tables() and add_func_jump_tables()
Those two functions loop over the instructions of a function. Merge the two loops in order to ease enhancement of table end in a following patch. Signed-off-by: Christophe Leroy --- tools/objtool/check.c | 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 5a6a87ddbf27..d51f47c4a3bd 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2097,11 +2097,12 @@ static struct reloc *find_jump_table(struct objtool_file *file, * First pass: Mark the head of each jump table so that in the next pass, * we know when a given jump table ends and the next one starts. */ -static void mark_func_jump_tables(struct objtool_file *file, - struct symbol *func) +static int mark_add_func_jump_tables(struct objtool_file *file, +struct symbol *func) { - struct instruction *insn, *last = NULL; + struct instruction *insn, *last = NULL, *insn_t1 = NULL, *insn_t2; struct reloc *reloc; + int ret = 0; func_for_each_insn(file, func, insn) { if (!last) @@ -2127,17 +2128,7 @@ static void mark_func_jump_tables(struct objtool_file *file, reloc = find_jump_table(file, func, insn); if (reloc) insn->_jump_table = reloc; - } -} - -static int add_func_jump_tables(struct objtool_file *file, - struct symbol *func) -{ - struct instruction *insn, *insn_t1 = NULL, *insn_t2; - int ret = 0; - - func_for_each_insn(file, func, insn) { - if (!insn_jump_table(insn)) + else continue; if (!insn_t1) { @@ -2177,8 +2168,7 @@ static int add_jump_table_alts(struct objtool_file *file) if (func->type != STT_FUNC) continue; - mark_func_jump_tables(file, func); - ret = add_func_jump_tables(file, func); + ret = mark_add_func_jump_tables(file, func); if (ret) return ret; } -- 2.41.0
[PATCH v4 04/15] objtool: Fix JUMP_ENTRY_SIZE for bi-arch like powerpc
struct jump_entry { s32 code; s32 target; long key; }; It means that the size of the third argument depends on whether we are building a 32 bits or 64 bits kernel. Therefore JUMP_ENTRY_SIZE must depend on elf_class_addrsize(elf). To allow that, entries[] table must be initialised at runtime. This is easily done by moving it into its only user which is special_get_alts(). Signed-off-by: Christophe Leroy Acked-by: Peter Zijlstra (Intel) --- .../arch/powerpc/include/arch/special.h | 2 +- tools/objtool/special.c | 55 +-- 2 files changed, 28 insertions(+), 29 deletions(-) diff --git a/tools/objtool/arch/powerpc/include/arch/special.h b/tools/objtool/arch/powerpc/include/arch/special.h index ffef9ada7133..b17802dcf436 100644 --- a/tools/objtool/arch/powerpc/include/arch/special.h +++ b/tools/objtool/arch/powerpc/include/arch/special.h @@ -6,7 +6,7 @@ #define EX_ORIG_OFFSET 0 #define EX_NEW_OFFSET 4 -#define JUMP_ENTRY_SIZE 16 +#define JUMP_ENTRY_SIZE (8 + elf_addr_size(elf)) /* 12 on PPC32, 16 on PPC64 */ #define JUMP_ORIG_OFFSET 0 #define JUMP_NEW_OFFSET 4 #define JUMP_KEY_OFFSET 8 diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 91b1950f5bd8..b3f07e8beb85 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -26,34 +26,6 @@ struct special_entry { unsigned char key; /* jump_label key */ }; -static const struct special_entry entries[] = { - { - .sec = ".altinstructions", - .group = true, - .size = ALT_ENTRY_SIZE, - .orig = ALT_ORIG_OFFSET, - .orig_len = ALT_ORIG_LEN_OFFSET, - .new = ALT_NEW_OFFSET, - .new_len = ALT_NEW_LEN_OFFSET, - .feature = ALT_FEATURE_OFFSET, - }, - { - .sec = "__jump_table", - .jump_or_nop = true, - .size = JUMP_ENTRY_SIZE, - .orig = JUMP_ORIG_OFFSET, - .new = JUMP_NEW_OFFSET, - .key = JUMP_KEY_OFFSET, - }, - { - .sec = "__ex_table", - .size = EX_ENTRY_SIZE, - .orig = EX_ORIG_OFFSET, - .new = EX_NEW_OFFSET, - }, - {}, -}; - void __weak arch_handle_alternative(unsigned short feature, struct special_alt *alt) { } @@ -144,6 +116,33 @@ int special_get_alts(struct elf *elf, struct list_head *alts) unsigned int nr_entries; struct special_alt *alt; int idx, ret; + const struct special_entry entries[] = { + { + .sec = ".altinstructions", + .group = true, + .size = ALT_ENTRY_SIZE, + .orig = ALT_ORIG_OFFSET, + .orig_len = ALT_ORIG_LEN_OFFSET, + .new = ALT_NEW_OFFSET, + .new_len = ALT_NEW_LEN_OFFSET, + .feature = ALT_FEATURE_OFFSET, + }, + { + .sec = "__jump_table", + .jump_or_nop = true, + .size = JUMP_ENTRY_SIZE, + .orig = JUMP_ORIG_OFFSET, + .new = JUMP_NEW_OFFSET, + .key = JUMP_KEY_OFFSET, + }, + { + .sec = "__ex_table", + .size = EX_ENTRY_SIZE, + .orig = EX_ORIG_OFFSET, + .new = EX_NEW_OFFSET, + }, + {}, + }; INIT_LIST_HEAD(alts); -- 2.41.0
[PATCH v4 02/15] objtool: Move back misplaced comment
A comment was introduced by commit 113d4bc90483 ("objtool: Fix clang switch table edge case") and wrongly moved by commit d871f7b5a6a2 ("objtool: Refactor jump table code to support other architectures") without the piece of code added with the comment in the original commit. Fixes: d871f7b5a6a2 ("objtool: Refactor jump table code to support other architectures") Signed-off-by: Christophe Leroy --- tools/objtool/arch/x86/special.c | 5 - tools/objtool/check.c| 6 ++ 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/tools/objtool/arch/x86/special.c b/tools/objtool/arch/x86/special.c index 29e949579ede..8e8302fe909f 100644 --- a/tools/objtool/arch/x86/special.c +++ b/tools/objtool/arch/x86/special.c @@ -118,11 +118,6 @@ struct reloc *arch_find_switch_table(struct objtool_file *file, strcmp(table_sec->name, C_JUMP_TABLE_SECTION)) return NULL; - /* -* Each table entry has a rela associated with it. The rela -* should reference text in the same function as the original -* instruction. -*/ rodata_reloc = find_reloc_by_dest(file->elf, table_sec, table_offset); if (!rodata_reloc) return NULL; diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 8936a05f0e5a..25f6df4713ed 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2072,6 +2072,12 @@ static struct reloc *find_jump_table(struct objtool_file *file, table_reloc = arch_find_switch_table(file, insn); if (!table_reloc) continue; + + /* +* Each table entry has a rela associated with it. The rela +* should reference text in the same function as the original +* instruction. +*/ dest_insn = find_insn(file, table_reloc->sym->sec, reloc_addend(table_reloc)); if (!dest_insn || !insn_func(dest_insn) || insn_func(dest_insn)->pfunc != func) continue; -- 2.41.0
[PATCH v4 06/15] objtool: Add support for relative switch tables
On powerpc, switch tables are relative, than means the address of the table is added to the value of the entry in order to get the pointed address: (r10 is the table address, r4 the index in the table) lis r10,0 <== Load r10 with upper part of .rodata address R_PPC_ADDR16_HA .rodata addir10,r10,0 <== Add lower part of .rodata address R_PPC_ADDR16_LO .rodata lwzxr8,r10,r4 <== Read table entry at r10 + r4 into r8 add r10,r8,r10<== Add table address to read value mtctr r10 <== Save calculated address in CTR bctr <== Branch to address in CTR RELOCATION RECORDS FOR [.rodata]: OFFSET TYPE VALUE R_PPC_REL32 .text+0x054c 0004 R_PPC_REL32 .text+0x03d0 ... But for c_jump_tables it is not the case, they contain the pointed address directly: lis r28,0 <== Load r28 with upper .rodata..c_jump_table R_PPC_ADDR16_HA .rodata..c_jump_table addir28,r28,0 <== Add lower part of .rodata..c_jump_table R_PPC_ADDR16_LO .rodata..c_jump_table lwzxr10,r28,r10 <== Read table entry at r10 + r28 into r10 mtctr r10 <== Save read value in CTR bctr <== Branch to address in CTR RELOCATION RECORDS FOR [.rodata..c_jump_table]: OFFSET TYPE VALUE R_PPC_ADDR32 .text+0x0dc8 0004 R_PPC_ADDR32 .text+0x0dc8 ... Add support to objtool for relative tables, based on the relocation type which is R_PPC_REL32 for switch tables and R_PPC_ADDR32 for C jump tables. Do the comparison using R_ABS32 and R_ABS64 which are architecture agnostic. And use correct size for 'long' instead of hard coding a size of '8'. Signed-off-by: Christophe Leroy --- tools/objtool/check.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index ae0019412123..5a6a87ddbf27 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1988,7 +1988,7 @@ static int add_jump_table(struct objtool_file *file, struct instruction *insn, struct symbol *pfunc = insn_func(insn)->pfunc; struct reloc *table = insn_jump_table(insn); struct instruction *dest_insn; - unsigned int prev_offset = 0; + unsigned int offset, prev_offset = 0; struct reloc *reloc = table; struct alternative *alt; @@ -2003,7 +2003,7 @@ static int add_jump_table(struct objtool_file *file, struct instruction *insn, break; /* Make sure the table entries are consecutive: */ - if (prev_offset && reloc_offset(reloc) != prev_offset + 8) + if (prev_offset && reloc_offset(reloc) != prev_offset + elf_addr_size(file->elf)) break; /* Detect function pointers from contiguous objects: */ @@ -2011,7 +2011,12 @@ static int add_jump_table(struct objtool_file *file, struct instruction *insn, reloc_addend(reloc) == pfunc->offset) break; - dest_insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc)); + if (reloc_type(reloc) == R_ABS32 || reloc_type(reloc) == R_ABS64) + offset = reloc_addend(reloc); + else + offset = reloc_addend(reloc) + reloc_offset(table) - reloc_offset(reloc); + + dest_insn = find_insn(file, reloc->sym->sec, offset); if (!dest_insn) break; -- 2.41.0
[PATCH v4 12/15] objtool: Add support for more complex UACCESS control
On x86, UACCESS is controlled by two instructions: STAC and CLAC. STAC instruction enables UACCESS while CLAC disables UACCESS. This is simple enough for objtool to locate UACCESS enable and disable. But on powerpc it is a bit more complex, the same instruction is used for enabling and disabling UACCESS, and the same instruction can be used for many other things. It would be too complex to use exclusively instruction decoding. To help objtool, mark such instruction into .discard.uaccess_begin and .discard.uaccess_end sections, on the same principle as for reachable/unreachable instructions. And add ASM_UACCESS_BEGIN and ASM_UACCESS_END macros to be used in inline assembly code to annotate UACCESS enable and UACCESS disable instructions. Signed-off-by: Christophe Leroy --- include/linux/objtool.h | 14 ++ tools/objtool/check.c | 33 + 2 files changed, 47 insertions(+) diff --git a/include/linux/objtool.h b/include/linux/objtool.h index 03f82c2c2ebf..d8fde4158a40 100644 --- a/include/linux/objtool.h +++ b/include/linux/objtool.h @@ -57,6 +57,18 @@ ".long 998b - .\n\t"\ ".popsection\n\t" +#define ASM_UACCESS_BEGIN \ + "998:\n\t" \ + ".pushsection .discard.uaccess_begin\n\t" \ + ".long 998b - .\n\t"\ + ".popsection\n\t" + +#define ASM_UACCESS_END \ + "998:\n\t" \ + ".pushsection .discard.uaccess_end\n\t" \ + ".long 998b - .\n\t"\ + ".popsection\n\t" + #else /* __ASSEMBLY__ */ /* @@ -156,6 +168,8 @@ #define STACK_FRAME_NON_STANDARD_FP(func) #define ANNOTATE_NOENDBR #define ASM_REACHABLE +#define ASM_UACCESS_BEGIN +#define ASM_UACCESS_END #else #define ANNOTATE_INTRA_FUNCTION_CALL .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0 diff --git a/tools/objtool/check.c b/tools/objtool/check.c index d2a0dfec5909..5af6c6c3fbed 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1052,6 +1052,38 @@ static void add_ignores(struct objtool_file *file) } } +static void __add_uaccess(struct objtool_file *file, const char *name, + int type, const char *action) +{ + struct section *rsec; + struct reloc *reloc; + struct instruction *insn; + + rsec = find_section_by_name(file->elf, name); + if (!rsec) + return; + + for_each_reloc(rsec, reloc) { + if (reloc->sym->type != STT_SECTION) { + WARN("unexpected relocation symbol type in %s: ", rsec->name); + continue; + } + insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc)); + if (!insn) { + WARN("can't find UACCESS %s insn at %s+0x%" PRIx64, +action, reloc->sym->sec->name, reloc_addend(reloc)); + continue; + } + insn->type = type; + } +} + +static void add_uaccess(struct objtool_file *file) +{ + __add_uaccess(file, ".rela.discard.uaccess_begin", INSN_STAC, "enable"); + __add_uaccess(file, ".rela.discard.uaccess_end", INSN_CLAC, "disable"); +} + /* * This is a whitelist of functions that is allowed to be called with AC set. * The list is meant to be minimal and only contains compiler instrumentation @@ -2597,6 +2629,7 @@ static int decode_sections(struct objtool_file *file) return ret; add_ignores(file); + add_uaccess(file); add_uaccess_safe(file); ret = add_ignore_alternatives(file); -- 2.41.0
[PATCH v4 11/15] objtool: .rodata.cst{2/4/8/16} are not switch tables
Exclude sections named .rodata.cst2 .rodata.cst4 .rodata.cst8 .rodata.cst16 as they won't contain switch tables. Signed-off-by: Christophe Leroy --- tools/objtool/check.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index ea0945f2195f..d2a0dfec5909 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2565,7 +2565,8 @@ static void mark_rodata(struct objtool_file *file) */ for_each_sec(file, sec) { if (!strncmp(sec->name, ".rodata", 7) && - !strstr(sec->name, ".str1.")) { + !strstr(sec->name, ".str1.") && + !strstr(sec->name, ".cst")) { sec->rodata = true; found = true; } -- 2.41.0
[PATCH v4 10/15] objtool: When looking for switch tables also follow conditional and dynamic jumps
When walking backward to find the base address of a switch table, also take into account conditionnal branches and dynamic jumps from a previous switch table. To avoid mis-routing, break when stumbling on a function return. Signed-off-by: Christophe Leroy --- tools/objtool/check.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 361c832aefc8..ea0945f2195f 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2034,6 +2034,8 @@ static int add_jump_table(struct objtool_file *file, struct instruction *insn, alt->next = insn->alts; insn->alts = alt; prev_offset = reloc_offset(reloc); + if (!dest_insn->first_jump_src) + dest_insn->first_jump_src = insn; } if (!prev_offset) { @@ -2068,6 +2070,9 @@ static struct reloc *find_jump_table(struct objtool_file *file, insn->gpr == orig_insn->gpr) break; + if (insn->type == INSN_RETURN) + break; + /* allow small jumps within the range */ if (insn->type == INSN_JUMP_UNCONDITIONAL && insn->jump_dest && @@ -2130,8 +2135,7 @@ static int mark_add_func_jump_tables(struct objtool_file *file, * that find_jump_table() can back-track using those and * avoid some potentially confusing code. */ - if (insn->type == INSN_JUMP_UNCONDITIONAL && insn->jump_dest && - insn->offset > last->offset && + if (is_static_jump(insn) && insn->jump_dest && insn->jump_dest->offset > insn->offset && !insn->jump_dest->first_jump_src) { -- 2.41.0
[PATCH v4 05/15] objtool: Add INSN_RETURN_CONDITIONAL
Most functions have an unconditional return at the end, like this one: : 0: 81 22 04 d0 lwz r9,1232(r2) 4: 38 60 00 00 li r3,0 8: 2c 09 00 00 cmpwi r9,0 c: 4d 82 00 20 beqlr <== Conditional return 10: 80 69 00 a0 lwz r3,160(r9) 14: 54 63 00 36 clrrwi r3,r3,4 18: 68 63 04 00 xorir3,r3,1024 1c: 7c 63 00 34 cntlzw r3,r3 20: 54 63 d9 7e srwir3,r3,5 24: 4e 80 00 20 blr <== Unconditional return But other functions like this other one below only have conditional returns: 0028 : 28: 81 25 00 00 lwz r9,0(r5) 2c: 2c 08 00 00 cmpwi r8,0 30: 7d 29 30 78 andcr9,r9,r6 34: 7d 27 3b 78 or r7,r9,r7 38: 54 84 65 3a rlwinm r4,r4,12,20,29 3c: 81 23 00 18 lwz r9,24(r3) 40: 41 82 00 58 beq 98 44: 7d 29 20 2e lwzxr9,r9,r4 48: 55 29 07 3a rlwinm r9,r9,0,28,29 4c: 2c 09 00 0c cmpwi r9,12 50: 41 82 00 08 beq 58 54: 39 00 00 80 li r8,128 58: 2c 08 00 01 cmpwi r8,1 5c: 90 e5 00 00 stw r7,0(r5) 60: 4d a2 00 20 beqlr+ <== Conditional return 64: 7c e9 3b 78 mr r9,r7 68: 39 40 00 00 li r10,0 6c: 39 4a 00 04 addir10,r10,4 70: 7c 0a 40 00 cmpwr10,r8 74: 91 25 00 04 stw r9,4(r5) 78: 91 25 00 08 stw r9,8(r5) 7c: 38 a5 00 10 addir5,r5,16 80: 91 25 ff fc stw r9,-4(r5) 84: 4c 80 00 20 bgelr <== Conditional return 88: 55 49 60 26 slwir9,r10,12 8c: 7d 29 3a 14 add r9,r9,r7 90: 91 25 00 00 stw r9,0(r5) 94: 4b ff ff d8 b 6c 98: 39 00 00 04 li r8,4 9c: 4b ff ff bc b 58 If conditional returns are decoded as INSN_OTHER, objtool considers that the second function never returns. If conditional returns are decoded as INSN_RETURN, objtool considers that code after that conditional return is dead. To overcome this situation, introduce INSN_RETURN_CONDITIONAL which is taken as a confirmation that a function is not noreturn but still sees following code as reachable. Signed-off-by: Christophe Leroy Acked-by: Peter Zijlstra (Intel) --- tools/objtool/check.c| 2 +- tools/objtool/include/objtool/arch.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 25f6df4713ed..ae0019412123 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -219,7 +219,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func, func_for_each_insn(file, func, insn) { empty = false; - if (insn->type == INSN_RETURN) + if (insn->type == INSN_RETURN || insn->type == INSN_RETURN_CONDITIONAL) return false; } diff --git a/tools/objtool/include/objtool/arch.h b/tools/objtool/include/objtool/arch.h index 2b6d2ce4f9a5..84ba75112934 100644 --- a/tools/objtool/include/objtool/arch.h +++ b/tools/objtool/include/objtool/arch.h @@ -19,6 +19,7 @@ enum insn_type { INSN_CALL, INSN_CALL_DYNAMIC, INSN_RETURN, + INSN_RETURN_CONDITIONAL, INSN_CONTEXT_SWITCH, INSN_BUG, INSN_NOP, -- 2.41.0
[PATCH v4 13/15] objtool: Prepare noreturns.h for more architectures
noreturns.h is a mix of x86 specific functions and more generic core functions. In preparation of inclusion of powerpc, split x86 functions out of noreturns.h into arch/noreturns.h Signed-off-by: Christophe Leroy --- .../objtool/arch/x86/include/arch/noreturns.h | 20 +++ tools/objtool/noreturns.h | 14 ++--- 2 files changed, 22 insertions(+), 12 deletions(-) create mode 100644 tools/objtool/arch/x86/include/arch/noreturns.h diff --git a/tools/objtool/arch/x86/include/arch/noreturns.h b/tools/objtool/arch/x86/include/arch/noreturns.h new file mode 100644 index ..a4262aff3917 --- /dev/null +++ b/tools/objtool/arch/x86/include/arch/noreturns.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * This is a (sorted!) list of all known __noreturn functions in arch/x86. + * It's needed for objtool to properly reverse-engineer the control flow graph. + * + * Yes, this is unfortunate. A better solution is in the works. + */ +NORETURN(cpu_bringup_and_idle) +NORETURN(ex_handler_msr_mce) +NORETURN(hlt_play_dead) +NORETURN(hv_ghcb_terminate) +NORETURN(machine_real_restart) +NORETURN(rewind_stack_and_make_dead) +NORETURN(sev_es_terminate) +NORETURN(snp_abort) +NORETURN(x86_64_start_kernel) +NORETURN(x86_64_start_reservations) +NORETURN(xen_cpu_bringup_again) +NORETURN(xen_start_kernel) diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h index e45c7cb1d5bc..b5e0f078dbb6 100644 --- a/tools/objtool/noreturns.h +++ b/tools/objtool/noreturns.h @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#include + /* * This is a (sorted!) list of all known __noreturn functions in the kernel. * It's needed for objtool to properly reverse-engineer the control flow graph. @@ -14,32 +16,20 @@ NORETURN(__stack_chk_fail) NORETURN(__ubsan_handle_builtin_unreachable) NORETURN(arch_call_rest_init) NORETURN(arch_cpu_idle_dead) -NORETURN(cpu_bringup_and_idle) NORETURN(cpu_startup_entry) NORETURN(do_exit) NORETURN(do_group_exit) NORETURN(do_task_dead) -NORETURN(ex_handler_msr_mce) NORETURN(fortify_panic) -NORETURN(hlt_play_dead) -NORETURN(hv_ghcb_terminate) NORETURN(kthread_complete_and_exit) NORETURN(kthread_exit) NORETURN(kunit_try_catch_throw) -NORETURN(machine_real_restart) NORETURN(make_task_dead) NORETURN(mpt_halt_firmware) NORETURN(nmi_panic_self_stop) NORETURN(panic) NORETURN(panic_smp_self_stop) NORETURN(rest_init) -NORETURN(rewind_stack_and_make_dead) -NORETURN(sev_es_terminate) -NORETURN(snp_abort) NORETURN(start_kernel) NORETURN(stop_this_cpu) NORETURN(usercopy_abort) -NORETURN(x86_64_start_kernel) -NORETURN(x86_64_start_reservations) -NORETURN(xen_cpu_bringup_again) -NORETURN(xen_start_kernel) -- 2.41.0
Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check
On 11.07.23 18:07, Aneesh Kumar K V wrote: On 7/11/23 4:06 PM, David Hildenbrand wrote: On 11.07.23 06:48, Aneesh Kumar K.V wrote: Some architectures would want different restrictions. Hence add an architecture-specific override. Both the PMD_SIZE check and pageblock alignment check are moved there. Signed-off-by: Aneesh Kumar K.V --- mm/memory_hotplug.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1b19462f4e72..07c99b0cc371 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1247,12 +1247,20 @@ static int online_memory_block(struct memory_block *mem, void *arg) return device_online(>dev); } -static bool mhp_supports_memmap_on_memory(unsigned long size) +#ifndef arch_supports_memmap_on_memory Can we make that a __weak function instead? We can. It is confusing because we do have these two patterns within the kernel where we use #ifndef x #endif vs __weak x What is the recommended way to override ? I have mostly been using #ifndef for most of the arch overrides till now. I think when placing the implementation in a C file, it's __weak. But don't ask me :) We do this already for arch_get_mappable_range() in mm/memory_hotplug.c and IMHO it looks quite nice. -- Cheers, David / dhildenb
[PATCH v4 09/15] objtool: Find end of switch table directly
At the time being, the end of a switch table can only be known once the start of the following switch table has ben located. This is a problem when switch tables are nested because until the first switch table is properly added, the second one cannot be located as a the backward walk will abut on the dynamic switch of the previous one. So perform a first forward walk in the code in order to locate all possible relocations to switch tables and build a local table with those relocations. Later on once one switch table is found, go through this local table to know where next switch table starts. Signed-off-by: Christophe Leroy --- tools/objtool/check.c | 62 --- 1 file changed, 46 insertions(+), 16 deletions(-) diff --git a/tools/objtool/check.c b/tools/objtool/check.c index be413c578588..361c832aefc8 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2094,14 +2094,30 @@ static struct reloc *find_jump_table(struct objtool_file *file, return NULL; } +static struct reloc *find_next_table(struct instruction *insn, +struct reloc **table, unsigned int size) +{ + unsigned long offset = reloc_offset(insn_jump_table(insn)); + int i; + struct reloc *reloc = NULL; + + for (i = 0; i < size; i++) { + if (reloc_offset(table[i]) > offset && + (!reloc || reloc_offset(table[i]) < reloc_offset(reloc))) + reloc = table[i]; + } + return reloc; +} + /* * First pass: Mark the head of each jump table so that in the next pass, * we know when a given jump table ends and the next one starts. */ static int mark_add_func_jump_tables(struct objtool_file *file, -struct symbol *func) +struct symbol *func, +struct reloc **table, unsigned int size) { - struct instruction *insn, *last = NULL, *insn_t1 = NULL, *insn_t2; + struct instruction *insn, *last = NULL; struct reloc *reloc; int ret = 0; @@ -2132,23 +2148,11 @@ static int mark_add_func_jump_tables(struct objtool_file *file, else continue; - if (!insn_t1) { - insn_t1 = insn; - continue; - } - - insn_t2 = insn; - - ret = add_jump_table(file, insn_t1, insn_jump_table(insn_t2)); + ret = add_jump_table(file, insn, find_next_table(insn, table, size)); if (ret) return ret; - - insn_t1 = insn_t2; } - if (insn_t1) - ret = add_jump_table(file, insn_t1, NULL); - return ret; } @@ -2161,15 +2165,41 @@ static int add_jump_table_alts(struct objtool_file *file) { struct symbol *func; int ret; + struct instruction *insn; + unsigned int size = 0, i = 0; + struct reloc **table = NULL; if (!file->rodata) return 0; + for_each_insn(file, insn) { + struct instruction *dest_insn; + struct reloc *reloc; + + func = insn_func(insn) ? insn_func(insn)->pfunc : NULL; + reloc = arch_find_switch_table(file, insn, NULL); + /* +* Each table entry has a rela associated with it. The rela +* should reference text in the same function as the original +* instruction. +*/ + if (!reloc) + continue; + dest_insn = find_insn(file, reloc->sym->sec, reloc_addend(reloc)); + if (!dest_insn || !insn_func(dest_insn) || insn_func(dest_insn)->pfunc != func) + continue; + if (i == size) { + size += 1024; + table = realloc(table, size * sizeof(*table)); + } + table[i++] = reloc; + } + for_each_sym(file, func) { if (func->type != STT_FUNC) continue; - ret = mark_add_func_jump_tables(file, func); + ret = mark_add_func_jump_tables(file, func, table, i); if (ret) return ret; } -- 2.41.0
[PATCH v4 03/15] objtool: Allow an architecture to disable objtool on ASM files
Supporting objtool on ASM files requires quite an effort. Features like UACCESS validation don't require ASM files validation. In order to allow architectures to enable objtool validation without spending unnecessary effort on cleaning up ASM files, provide an option to disable objtool validation on ASM files. Suggested-by: Naveen N Rao Signed-off-by: Christophe Leroy --- arch/Kconfig | 5 + scripts/Makefile.build | 4 2 files changed, 9 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index aff2746c8af2..3330ed761260 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -,6 +,11 @@ config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT config HAVE_OBJTOOL bool +config ARCH_OBJTOOL_SKIP_ASM + bool + help + Architecture doesn't support objtool on ASM files + config HAVE_JUMP_LABEL_HACK bool diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 6413342a03f4..5818baddfb27 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -342,7 +342,11 @@ $(obj)/%.s: $(src)/%.S FORCE $(call if_changed_dep,cpp_s_S) quiet_cmd_as_o_S = AS $(quiet_modtag) $@ +ifndef CONFIG_ARCH_OBJTOOL_SKIP_ASM cmd_as_o_S = $(CC) $(a_flags) -c -o $@ $< $(cmd_objtool) +else + cmd_as_o_S = $(CC) $(a_flags) -c -o $@ $< +endif ifdef CONFIG_ASM_MODVERSIONS -- 2.41.0
[PATCH v4 01/15] Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto"
This reverts commit 1e688dd2a3d6759d416616ff07afc4bb836c4213. That commit aimed at optimising the code around generation of WARN_ON/BUG_ON but this leads to a lot of dead code erroneously generated by GCC. That dead code becomes a problem when we start using objtool validation because objtool will abort validation with a warning as soon as it detects unreachable code. This is because unreachable code might be the indication that objtool doesn't properly decode object text. text data bss dec hex filename 9551585 3627834 224376 13403795 cc8693 vmlinux.before 9535281 3628358 224376 13388015 cc48ef vmlinux.after Once this change is reverted, in a standard configuration (pmac32 + function tracer) the text is reduced by 16k which is around 1.7% We already had problem with it when starting to use objtool on powerpc as a replacement for recordmcount, see commit 93e3f45a2631 ("powerpc: Fix __WARN_FLAGS() for use with Objtool") There is also a problem with at least GCC 12, on ppc64_defconfig + CONFIG_CC_OPTIMIZE_FOR_SIZE=y + CONFIG_DEBUG_SECTION_MISMATCH=y : LD .tmp_vmlinux.kallsyms1 powerpc64-linux-ld: net/ipv4/tcp_input.o:(__ex_table+0xc4): undefined reference to `.L2136' make[2]: *** [scripts/Makefile.vmlinux:36: vmlinux] Error 1 make[1]: *** [/home/chleroy/linux-powerpc/Makefile:1238: vmlinux] Error 2 Taking into account that other problems are encountered with that 'asm goto' in WARN_ON(), including build failures, keeping that change is not worth it allthough it is primarily a compiler bug. Revert it for now. Signed-off-by: Christophe Leroy Acked-by: Naveen N Rao --- arch/powerpc/include/asm/book3s/64/kup.h | 2 +- arch/powerpc/include/asm/bug.h | 67 arch/powerpc/kernel/misc_32.S| 2 +- arch/powerpc/kernel/traps.c | 9 +--- 4 files changed, 15 insertions(+), 65 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index 497a7bd31ecc..e875cb7e68dc 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -90,7 +90,7 @@ /* Prevent access to userspace using any key values */ LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED) 999: tdne\gpr1, \gpr2 - EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | BUGFLAG_ONCE) + EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | BUGFLAG_ONCE) END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67) #endif .endm diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h index 492530adecc2..abb608dff15a 100644 --- a/arch/powerpc/include/asm/bug.h +++ b/arch/powerpc/include/asm/bug.h @@ -4,14 +4,13 @@ #ifdef __KERNEL__ #include -#include #ifdef CONFIG_BUG #ifdef __ASSEMBLY__ #include #ifdef CONFIG_DEBUG_BUGVERBOSE -.macro __EMIT_BUG_ENTRY addr,file,line,flags +.macro EMIT_BUG_ENTRY addr,file,line,flags .section __bug_table,"aw" 5001: .4byte \addr - . .4byte 5002f - . @@ -23,7 +22,7 @@ .previous .endm #else -.macro __EMIT_BUG_ENTRY addr,file,line,flags +.macro EMIT_BUG_ENTRY addr,file,line,flags .section __bug_table,"aw" 5001: .4byte \addr - . .short \flags @@ -32,18 +31,6 @@ .endm #endif /* verbose */ -.macro EMIT_WARN_ENTRY addr,file,line,flags - EX_TABLE(\addr,\addr+4) - __EMIT_BUG_ENTRY \addr,\file,\line,\flags -.endm - -.macro EMIT_BUG_ENTRY addr,file,line,flags - .if \flags & 1 /* BUGFLAG_WARNING */ - .err /* Use EMIT_WARN_ENTRY for warnings */ - .endif - __EMIT_BUG_ENTRY \addr,\file,\line,\flags -.endm - #else /* !__ASSEMBLY__ */ /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and sizeof(struct bug_entry), respectively */ @@ -73,16 +60,6 @@ "i" (sizeof(struct bug_entry)), \ ##__VA_ARGS__) -#define WARN_ENTRY(insn, flags, label, ...)\ - asm_volatile_goto( \ - "1: " insn "\n" \ - EX_TABLE(1b, %l[label]) \ - _EMIT_BUG_ENTRY \ - : : "i" (__FILE__), "i" (__LINE__), \ - "i" (flags), \ - "i" (sizeof(struct bug_entry)), \ - ##__VA_ARGS__ : : label) - /* * BUG_ON() and WARN_ON() do their best to cooperate with compile-time * optimisations. However depending on the complexity of the condition @@ -95,16 +72,7 @@ } while (0) #define HAVE_ARCH_BUG -#define __WARN_FLAGS(flags) do { \ - __label__ __label_warn_on; \ - \ - WARN_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags), __label_warn_on); \ -
[PATCH v4 00/15] powerpc/objtool: uaccess validation for PPC32 (v4)
This series adds UACCESS validation for PPC32. It includes a dozen of changes to objtool core. It applies on top of series "Cleanup/Optimise KUAP (v3)" https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=363368=* It is almost mature, performs code analysis for all PPC32. In this version objtool switch table lookup has been enhanced to handle nested switch tables. Most object files are correctly decoded, only a few 'unreachable instruction' warnings remain due to more complex fonctions which include back and forth jumps or branches. It allowed to detect some UACCESS mess in a few files. They've been fixed through other patches. Changes in v4: - Split series in two parts, the powerpc uaccess rework is submitted separately, see https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=363368=* - Support of UACCESS on all PPC32 including book3s/32 which was missing in v3. - More elaborated switch tables lookup. - Patches 2, 7, 8, 9, 10, 11 are new - Patch 11 in series v3 is now removed. Changes in v3: - Rebased on top of a merge of powerpc tree and tip/objtool/core tree - Simplified support for relative switch tables based on relocation type - Taken comments from Peter Christophe Leroy (15): Revert "powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto" objtool: Move back misplaced comment objtool: Allow an architecture to disable objtool on ASM files objtool: Fix JUMP_ENTRY_SIZE for bi-arch like powerpc objtool: Add INSN_RETURN_CONDITIONAL objtool: Add support for relative switch tables objtool: Merge mark_func_jump_tables() and add_func_jump_tables() objtool: Track general purpose register used for switch table base objtool: Find end of switch table directly objtool: When looking for switch tables also follow conditional and dynamic jumps objtool: .rodata.cst{2/4/8/16} are not switch tables objtool: Add support for more complex UACCESS control objtool: Prepare noreturns.h for more architectures powerpc/bug: Annotate reachable after warning trap powerpc: Implement UACCESS validation on PPC32 arch/Kconfig | 5 + arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/book3s/32/kup.h | 2 + arch/powerpc/include/asm/book3s/64/kup.h | 2 +- arch/powerpc/include/asm/bug.h| 77 ++--- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 4 +- arch/powerpc/include/asm/nohash/kup-booke.h | 4 +- arch/powerpc/kernel/misc_32.S | 2 +- arch/powerpc/kernel/traps.c | 9 +- arch/powerpc/kexec/core_32.c | 4 +- arch/powerpc/mm/nohash/kup.c | 2 + include/linux/objtool.h | 14 ++ scripts/Makefile.build| 4 + tools/objtool/arch/powerpc/decode.c | 155 +- .../arch/powerpc/include/arch/noreturns.h | 11 ++ .../arch/powerpc/include/arch/special.h | 2 +- tools/objtool/arch/powerpc/special.c | 39 - .../objtool/arch/x86/include/arch/noreturns.h | 20 +++ tools/objtool/arch/x86/special.c | 8 +- tools/objtool/check.c | 154 - tools/objtool/include/objtool/arch.h | 1 + tools/objtool/include/objtool/check.h | 6 +- tools/objtool/include/objtool/special.h | 3 +- tools/objtool/noreturns.h | 14 +- tools/objtool/special.c | 55 +++ 25 files changed, 425 insertions(+), 174 deletions(-) create mode 100644 tools/objtool/arch/powerpc/include/arch/noreturns.h create mode 100644 tools/objtool/arch/x86/include/arch/noreturns.h -- 2.41.0
Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check
On 7/11/23 4:06 PM, David Hildenbrand wrote: > On 11.07.23 06:48, Aneesh Kumar K.V wrote: >> Some architectures would want different restrictions. Hence add an >> architecture-specific override. >> >> Both the PMD_SIZE check and pageblock alignment check are moved there. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> mm/memory_hotplug.c | 17 - >> 1 file changed, 12 insertions(+), 5 deletions(-) >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 1b19462f4e72..07c99b0cc371 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1247,12 +1247,20 @@ static int online_memory_block(struct memory_block >> *mem, void *arg) >> return device_online(>dev); >> } >> -static bool mhp_supports_memmap_on_memory(unsigned long size) >> +#ifndef arch_supports_memmap_on_memory > > Can we make that a __weak function instead? We can. It is confusing because we do have these two patterns within the kernel where we use #ifndef x #endif vs __weak x What is the recommended way to override ? I have mostly been using #ifndef for most of the arch overrides till now. > >> +static inline bool arch_supports_memmap_on_memory(unsigned long size) >> { >> - unsigned long nr_vmemmap_pages = size / PAGE_SIZE; >> + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; >> unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); >> unsigned long remaining_size = size - vmemmap_size; >> + return IS_ALIGNED(vmemmap_size, PMD_SIZE) && >> + IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); > > You're moving that check back to mhp_supports_memmap_on_memory() in the > following patch, where it actually belongs. So this check should stay in > mhp_supports_memmap_on_memory(). Might be reasonable to factor out the > vmemmap_size calculation. > > > Also, let's a comment > > /* > * As default, we want the vmemmap to span a complete PMD such that we > * can map the vmemmap using a single PMD if supported by the > * architecture. > */ > return IS_ALIGNED(vmemmap_size, PMD_SIZE); > >> +} >> +#endif >> + >> +static bool mhp_supports_memmap_on_memory(unsigned long size) >> +{ >> /* >> * Besides having arch support and the feature enabled at runtime, we >> * need a few more assumptions to hold true: >> @@ -1280,9 +1288,8 @@ static bool mhp_supports_memmap_on_memory(unsigned >> long size) >> * populate a single PMD. >> */ >> return mhp_memmap_on_memory() && >> - size == memory_block_size_bytes() && >> - IS_ALIGNED(vmemmap_size, PMD_SIZE) && >> - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); >> + size == memory_block_size_bytes() && > > If you keep the properly aligned indentation, this will not be detected as a > change by git. > >> + arch_supports_memmap_on_memory(size); >> } >> /* > Will update the code based on the above feedback. -aneesh
Re: [PATCH 00/17] fbdev: Remove FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT flags
Hi Helge, On Tue, Jul 11, 2023 at 5:26 PM Helge Deller wrote: > On 7/11/23 16:47, Sam Ravnborg wrote: > > On Tue, Jul 11, 2023 at 08:24:40AM +0200, Thomas Zimmermann wrote: > >> Am 10.07.23 um 19:19 schrieb Sam Ravnborg: > >>> On Mon, Jul 10, 2023 at 02:50:04PM +0200, Thomas Zimmermann wrote: > Remove the unused flags FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT from > fbdev and drivers, as briefly discussed at [1]. Both flags were maybe > useful when fbdev had special handling for driver modules. With > commit 376b3ff54c9a ("fbdev: Nuke FBINFO_MODULE"), they are both 0 > and have no further effect. > > Patches 1 to 7 remove FBINFO_DEFAULT from drivers. Patches 2 to 5 > split this by the way the fb_info struct is being allocated. All flags > are cleared to zero during the allocation. > > Patches 8 to 16 do the same for FBINFO_FLAG_DEFAULT. Patch 8 fixes > an actual bug in how arch/sh uses the tokne for struct fb_videomode, > which is unrelated. > > Patch 17 removes both flag constants from > >>> > >>> We have a few more flags that are unused - should they be nuked too? > >>> FBINFO_HWACCEL_FILLRECT > >>> FBINFO_HWACCEL_ROTATE > >>> FBINFO_HWACCEL_XPAN > >> > >> It seems those are there for completeness. Nothing sets _ROTATE, > > I think some fbdev drivers had hardware acceleration for ROTATE in the > past. HWACCEL_XPAN is still in some drivers. > > >> the others are simply never checked. According to the comments, > >> some are required, some are optional. I don't know what that > >> means. > > I think it's OK if you remove those flags which aren't used anywhere, > e.g. FBINFO_HWACCEL_ROTATE. Indeed. > >> IIRC there were complains about performance when Daniel tried to remove > >> fbcon acceleration, so not all _HWACCEL_ flags are unneeded. > > Correct. I think COPYAREA and FILLRECT are the bare minimum to accelerate > fbcon, IMAGEBLIT is for showing the tux penguin (?), > XPAN/YPAN and YWRAP for some hardware screen panning needed by some drivers > (not sure if this is still used as I don't have such hardware, Geert?). Yes, they are used. Anything that is handled in drivers/video/fbdev/core/ is used: $ git grep HWACCEL_ -- drivers/video/fbdev/core/ drivers/video/fbdev/core/fbcon.c: if ((info->flags & FBINFO_HWACCEL_COPYAREA) && drivers/video/fbdev/core/fbcon.c: !(info->flags & FBINFO_HWACCEL_DISABLED)) drivers/video/fbdev/core/fbcon.c: int good_pan = (cap & FBINFO_HWACCEL_YPAN) && drivers/video/fbdev/core/fbcon.c: int good_wrap = (cap & FBINFO_HWACCEL_YWRAP) && drivers/video/fbdev/core/fbcon.c: int fast_copyarea = (cap & FBINFO_HWACCEL_COPYAREA) && drivers/video/fbdev/core/fbcon.c: !(cap & FBINFO_HWACCEL_DISABLED); drivers/video/fbdev/core/fbcon.c: int fast_imageblit = (cap & FBINFO_HWACCEL_IMAGEBLIT) && drivers/video/fbdev/core/fbcon.c: !(cap & FBINFO_HWACCEL_DISABLED); BTW, I'm surprised FBINFO_HWACCEL_FILLRECT is not handled. But looking at the full history, it never was... > >> Leaving them in for reference/completeness might be an option; or not. I > >> have no strong feelings about those flags. > > I'd say drop FBINFO_HWACCEL_ROTATE at least ? Agreed. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
[PATCH v3 8/9] powerpc/kuap: KUAP enabling/disabling functions must be __always_inline
Objtool reports following warnings: arch/powerpc/kernel/signal_32.o: warning: objtool: __prevent_user_access.constprop.0+0x4 (.text+0x4): redundant UACCESS disable arch/powerpc/kernel/signal_32.o: warning: objtool: user_access_begin+0x2c (.text+0x4c): return with UACCESS enabled arch/powerpc/kernel/signal_32.o: warning: objtool: handle_rt_signal32+0x188 (.text+0x360): call to __prevent_user_access.constprop.0() with UACCESS enabled arch/powerpc/kernel/signal_32.o: warning: objtool: handle_signal32+0x150 (.text+0x4d4): call to __prevent_user_access.constprop.0() with UACCESS enabled This is due to some KUAP enabling/disabling functions being outline allthough they are marked inline. Use __always_inline instead. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/kup.h | 18 +++ arch/powerpc/include/asm/book3s/64/kup.h | 23 ++-- arch/powerpc/include/asm/kup.h | 16 +++--- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 20 - arch/powerpc/include/asm/nohash/kup-booke.h | 22 +-- arch/powerpc/include/asm/uaccess.h | 6 ++--- 6 files changed, 53 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 452d4efa84f5..931d200afe56 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -15,19 +15,19 @@ #define KUAP_NONE (~0UL) -static inline void kuap_lock_one(unsigned long addr) +static __always_inline void kuap_lock_one(unsigned long addr) { mtsr(mfsr(addr) | SR_KS, addr); isync();/* Context sync required after mtsr() */ } -static inline void kuap_unlock_one(unsigned long addr) +static __always_inline void kuap_unlock_one(unsigned long addr) { mtsr(mfsr(addr) & ~SR_KS, addr); isync();/* Context sync required after mtsr() */ } -static inline void __kuap_save_and_lock(struct pt_regs *regs) +static __always_inline void __kuap_save_and_lock(struct pt_regs *regs) { unsigned long kuap = current->thread.kuap; @@ -40,11 +40,11 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs) } #define __kuap_save_and_lock __kuap_save_and_lock -static inline void kuap_user_restore(struct pt_regs *regs) +static __always_inline void kuap_user_restore(struct pt_regs *regs) { } -static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kuap) +static __always_inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kuap) { if (unlikely(kuap != KUAP_NONE)) { current->thread.kuap = KUAP_NONE; @@ -59,7 +59,7 @@ static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kua kuap_unlock_one(regs->kuap); } -static inline unsigned long __kuap_get_and_assert_locked(void) +static __always_inline unsigned long __kuap_get_and_assert_locked(void) { unsigned long kuap = current->thread.kuap; @@ -94,7 +94,7 @@ static __always_inline void __prevent_user_access(unsigned long dir) kuap_lock_one(kuap); } -static inline unsigned long __prevent_user_access_return(void) +static __always_inline unsigned long __prevent_user_access_return(void) { unsigned long flags = current->thread.kuap; @@ -106,7 +106,7 @@ static inline unsigned long __prevent_user_access_return(void) return flags; } -static inline void __restore_user_access(unsigned long flags) +static __always_inline void __restore_user_access(unsigned long flags) { if (flags != KUAP_NONE) { current->thread.kuap = flags; @@ -114,7 +114,7 @@ static inline void __restore_user_access(unsigned long flags) } } -static inline bool +static __always_inline bool __bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write) { unsigned long kuap = regs->kuap; diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index a014f4d9a2aa..497a7bd31ecc 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -213,14 +213,14 @@ extern u64 __ro_after_init default_iamr; * access restrictions. Because of this ignore AMR value when accessing * userspace via kernel thread. */ -static inline u64 current_thread_amr(void) +static __always_inline u64 current_thread_amr(void) { if (current->thread.regs) return current->thread.regs->amr; return default_amr; } -static inline u64 current_thread_iamr(void) +static __always_inline u64 current_thread_iamr(void) { if (current->thread.regs) return current->thread.regs->iamr; @@ -230,7 +230,7 @@ static inline u64 current_thread_iamr(void) #ifdef CONFIG_PPC_KUAP -static inline void kuap_user_restore(struct pt_regs *regs) +static __always_inline void
[PATCH v3 9/9] powerpc/kuap: Use ASM feature fixups instead of static branches
To avoid a useless nop on top of every uaccess enable/disable and make life easier for objtool, replace static branches by ASM feature fixups that will nop KUAP enabling instructions out in the unlikely case KUAP is disabled at boottime. Leave it as is on book3s/64 for now, it will be handled later when objtool is activated on PPC64. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/kup.h | 46 arch/powerpc/include/asm/kup.h | 45 +++ arch/powerpc/include/asm/nohash/32/kup-8xx.h | 30 + arch/powerpc/include/asm/nohash/kup-booke.h | 38 +--- arch/powerpc/mm/nohash/kup.c | 2 +- 5 files changed, 87 insertions(+), 74 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 931d200afe56..4e14a5427a63 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -27,6 +27,34 @@ static __always_inline void kuap_unlock_one(unsigned long addr) isync();/* Context sync required after mtsr() */ } +static __always_inline void uaccess_begin_32s(unsigned long addr) +{ + unsigned long tmp; + + asm volatile(ASM_MMU_FTR_IFSET( + "mfsrin %0, %1;" + "rlwinm %0, %0, 0, %2;" + "mtsrin %0, %1;" + "isync", "", %3) + : "="(tmp) + : "r"(addr), "i"(~SR_KS), "i"(MMU_FTR_KUAP) + : "memory"); +} + +static __always_inline void uaccess_end_32s(unsigned long addr) +{ + unsigned long tmp; + + asm volatile(ASM_MMU_FTR_IFSET( + "mfsrin %0, %1;" + "oris %0, %0, %2;" + "mtsrin %0, %1;" + "isync", "", %3) + : "="(tmp) + : "r"(addr), "i"(SR_KS >> 16), "i"(MMU_FTR_KUAP) + : "memory"); +} + static __always_inline void __kuap_save_and_lock(struct pt_regs *regs) { unsigned long kuap = current->thread.kuap; @@ -69,8 +97,8 @@ static __always_inline unsigned long __kuap_get_and_assert_locked(void) } #define __kuap_get_and_assert_locked __kuap_get_and_assert_locked -static __always_inline void __allow_user_access(void __user *to, const void __user *from, - u32 size, unsigned long dir) +static __always_inline void allow_user_access(void __user *to, const void __user *from, + u32 size, unsigned long dir) { BUILD_BUG_ON(!__builtin_constant_p(dir)); @@ -78,10 +106,10 @@ static __always_inline void __allow_user_access(void __user *to, const void __us return; current->thread.kuap = (__force u32)to; - kuap_unlock_one((__force u32)to); + uaccess_begin_32s((__force u32)to); } -static __always_inline void __prevent_user_access(unsigned long dir) +static __always_inline void prevent_user_access(unsigned long dir) { u32 kuap = current->thread.kuap; @@ -91,26 +119,26 @@ static __always_inline void __prevent_user_access(unsigned long dir) return; current->thread.kuap = KUAP_NONE; - kuap_lock_one(kuap); + uaccess_end_32s(kuap); } -static __always_inline unsigned long __prevent_user_access_return(void) +static __always_inline unsigned long prevent_user_access_return(void) { unsigned long flags = current->thread.kuap; if (flags != KUAP_NONE) { current->thread.kuap = KUAP_NONE; - kuap_lock_one(flags); + uaccess_end_32s(flags); } return flags; } -static __always_inline void __restore_user_access(unsigned long flags) +static __always_inline void restore_user_access(unsigned long flags) { if (flags != KUAP_NONE) { current->thread.kuap = flags; - kuap_unlock_one(flags); + uaccess_begin_32s(flags); } } diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h index 77adb9cd2da5..ad7e8c5aec3f 100644 --- a/arch/powerpc/include/asm/kup.h +++ b/arch/powerpc/include/asm/kup.h @@ -72,11 +72,11 @@ static __always_inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned * platforms. */ #ifndef CONFIG_PPC_BOOK3S_64 -static __always_inline void __allow_user_access(void __user *to, const void __user *from, - unsigned long size, unsigned long dir) { } -static __always_inline void __prevent_user_access(unsigned long dir) { } -static __always_inline unsigned long __prevent_user_access_return(void) { return 0UL; } -static __always_inline void __restore_user_access(unsigned long flags) { } +static __always_inline void allow_user_access(void __user *to, const void __user *from, + unsigned long size, unsigned long dir) { } +static __always_inline void
[PATCH v3 5/9] powerpc/kuap: MMU_FTR_BOOK3S_KUAP becomes MMU_FTR_KUAP
In order to reuse MMU_FTR_BOOK3S_KUAP for other targets than BOOK3S, rename it MMU_FTR_KUAP. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/64/hash-pkey.h | 2 +- arch/powerpc/include/asm/book3s/64/kup.h | 18 +- arch/powerpc/include/asm/mmu.h | 4 ++-- arch/powerpc/kernel/syscall.c | 2 +- arch/powerpc/mm/book3s64/pkeys.c | 2 +- 5 files changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hash-pkey.h b/arch/powerpc/include/asm/book3s/64/hash-pkey.h index f1e60d579f6c..6c5564c4fae4 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-pkey.h +++ b/arch/powerpc/include/asm/book3s/64/hash-pkey.h @@ -24,7 +24,7 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags, unsigned long flags) ((pteflags & H_PTE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | ((pteflags & H_PTE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL)); - if (mmu_has_feature(MMU_FTR_BOOK3S_KUAP) || + if (mmu_has_feature(MMU_FTR_KUAP) || mmu_has_feature(MMU_FTR_BOOK3S_KUEP)) { if ((pte_pkey == 0) && (flags & HPTE_USE_KERNEL_KEY)) return HASH_DEFAULT_KERNEL_KEY; diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index 2a7bd3ecc556..72fc4263ed26 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -31,7 +31,7 @@ mfspr \gpr2, SPRN_AMR cmpd\gpr1, \gpr2 beq 99f - END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_BOOK3S_KUAP, 68) + END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_KUAP, 68) isync mtspr SPRN_AMR, \gpr1 @@ -78,7 +78,7 @@ * No need to restore IAMR when returning to kernel space. */ 100: - END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67) + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67) #endif .endm @@ -91,7 +91,7 @@ LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED) 999: tdne\gpr1, \gpr2 EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | BUGFLAG_ONCE) - END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67) + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 67) #endif .endm #endif @@ -130,7 +130,7 @@ */ BEGIN_MMU_FTR_SECTION_NESTED(68) b 100f // skip_save_amr - END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_BOOK3S_KUAP, 68) + END_MMU_FTR_SECTION_NESTED_IFCLR(MMU_FTR_PKEY | MMU_FTR_KUAP, 68) /* * if pkey is disabled and we are entering from userspace @@ -166,7 +166,7 @@ mtspr SPRN_AMR, \gpr2 isync 102: - END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 69) + END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_KUAP, 69) /* * if entering from kernel we don't need save IAMR @@ -232,7 +232,7 @@ static inline u64 current_thread_iamr(void) static __always_inline bool kuap_is_disabled(void) { - return !mmu_has_feature(MMU_FTR_BOOK3S_KUAP); + return !mmu_has_feature(MMU_FTR_KUAP); } static inline void kuap_user_restore(struct pt_regs *regs) @@ -243,7 +243,7 @@ static inline void kuap_user_restore(struct pt_regs *regs) if (!mmu_has_feature(MMU_FTR_PKEY)) return; - if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP)) { + if (!mmu_has_feature(MMU_FTR_KUAP)) { amr = mfspr(SPRN_AMR); if (amr != regs->amr) restore_amr = true; @@ -317,7 +317,7 @@ static inline unsigned long get_kuap(void) * This has no effect in terms of actually blocking things on hash, * so it doesn't break anything. */ - if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP)) + if (!mmu_has_feature(MMU_FTR_KUAP)) return AMR_KUAP_BLOCKED; return mfspr(SPRN_AMR); @@ -325,7 +325,7 @@ static inline unsigned long get_kuap(void) static __always_inline void set_kuap(unsigned long value) { - if (!mmu_has_feature(MMU_FTR_BOOK3S_KUAP)) + if (!mmu_has_feature(MMU_FTR_KUAP)) return; /* diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h index 94b981152667..82af2e2c5eca 100644 --- a/arch/powerpc/include/asm/mmu.h +++ b/arch/powerpc/include/asm/mmu.h @@ -33,7 +33,7 @@ * key 0 controlling userspace addresses on radix * Key 3 on hash */ -#define MMU_FTR_BOOK3S_KUAPASM_CONST(0x0200) +#define MMU_FTR_KUAP ASM_CONST(0x0200) /* * Supports KUEP feature @@ -188,7 +188,7 @@ enum { #endif /* CONFIG_PPC_RADIX_MMU */ #endif #ifdef CONFIG_PPC_KUAP - MMU_FTR_BOOK3S_KUAP | + MMU_FTR_KUAP | #endif /* CONFIG_PPC_KUAP */ #ifdef CONFIG_PPC_MEM_KEYS MMU_FTR_PKEY | diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index
[PATCH v3 7/9] powerpc/kuap: Simplify KUAP lock/unlock on BOOK3S/32
On book3s/32 KUAP is performed at segment level. At the moment, when enabling userspace access, only current segment is modified. Then if a write is performed on another user segment, a fault is taken and all other user segments get enabled for userspace access. This then require special attention when disabling userspace access. Having a userspace write access crossing a segment boundary is unlikely. Having a userspace write access crossing a segment boundary back and forth is even more unlikely. So, instead of enabling userspace access on all segments when a write fault occurs, just change which segment has userspace access enabled in order to eliminate the case when more than one segment has userspace access enabled. That simplifies userspace access deactivation. There is however a corner case which is even more unlikely but has to be handled anyway: an unaligned access which is crossing a segment boundary. That would definitely require at least having userspace access enabled on the two segments. To avoid complicating the likely case for a so unlikely happening, handle such situation like an alignment exception and emulate the store. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/kup.h | 65 +++- arch/powerpc/include/asm/bug.h | 1 + arch/powerpc/kernel/traps.c | 2 +- arch/powerpc/mm/book3s32/kuap.c | 15 +- 4 files changed, 23 insertions(+), 60 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 4ca6122ef0e1..452d4efa84f5 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -14,7 +14,6 @@ #include #define KUAP_NONE (~0UL) -#define KUAP_ALL (~1UL) static inline void kuap_lock_one(unsigned long addr) { @@ -28,41 +27,6 @@ static inline void kuap_unlock_one(unsigned long addr) isync();/* Context sync required after mtsr() */ } -static inline void kuap_lock_all(void) -{ - update_user_segments(mfsr(0) | SR_KS); - isync();/* Context sync required after mtsr() */ -} - -static inline void kuap_unlock_all(void) -{ - update_user_segments(mfsr(0) & ~SR_KS); - isync();/* Context sync required after mtsr() */ -} - -void kuap_lock_all_ool(void); -void kuap_unlock_all_ool(void); - -static inline void kuap_lock_addr(unsigned long addr, bool ool) -{ - if (likely(addr != KUAP_ALL)) - kuap_lock_one(addr); - else if (!ool) - kuap_lock_all(); - else - kuap_lock_all_ool(); -} - -static inline void kuap_unlock(unsigned long addr, bool ool) -{ - if (likely(addr != KUAP_ALL)) - kuap_unlock_one(addr); - else if (!ool) - kuap_unlock_all(); - else - kuap_unlock_all_ool(); -} - static inline void __kuap_save_and_lock(struct pt_regs *regs) { unsigned long kuap = current->thread.kuap; @@ -72,7 +36,7 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs) return; current->thread.kuap = KUAP_NONE; - kuap_lock_addr(kuap, false); + kuap_lock_one(kuap); } #define __kuap_save_and_lock __kuap_save_and_lock @@ -84,7 +48,7 @@ static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kua { if (unlikely(kuap != KUAP_NONE)) { current->thread.kuap = KUAP_NONE; - kuap_lock_addr(kuap, false); + kuap_lock_one(kuap); } if (likely(regs->kuap == KUAP_NONE)) @@ -92,7 +56,7 @@ static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kua current->thread.kuap = regs->kuap; - kuap_unlock(regs->kuap, false); + kuap_unlock_one(regs->kuap); } static inline unsigned long __kuap_get_and_assert_locked(void) @@ -127,7 +91,7 @@ static __always_inline void __prevent_user_access(unsigned long dir) return; current->thread.kuap = KUAP_NONE; - kuap_lock_addr(kuap, true); + kuap_lock_one(kuap); } static inline unsigned long __prevent_user_access_return(void) @@ -136,7 +100,7 @@ static inline unsigned long __prevent_user_access_return(void) if (flags != KUAP_NONE) { current->thread.kuap = KUAP_NONE; - kuap_lock_addr(flags, true); + kuap_lock_one(flags); } return flags; @@ -146,7 +110,7 @@ static inline void __restore_user_access(unsigned long flags) { if (flags != KUAP_NONE) { current->thread.kuap = flags; - kuap_unlock(flags, true); + kuap_unlock_one(flags); } } @@ -155,14 +119,23 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write) { unsigned long kuap = regs->kuap; - if (!is_write || kuap == KUAP_ALL) + if (!is_write) return
[PATCH v3 6/9] powerpc/kuap: Use MMU_FTR_KUAP on all and refactor disabling kuap
All but book3s/64 use a static branch key for disabling kuap. book3s/64 uses an mmu feature. Refactor all targets to use MMU_FTR_KUAP like book3s/64. For PPC32 that implies updating mmu features fixups once KUAP has been initialised. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/kup.h | 9 - arch/powerpc/include/asm/book3s/64/kup.h | 5 - arch/powerpc/include/asm/kup.h | 11 +++ arch/powerpc/include/asm/nohash/32/kup-8xx.h | 9 - arch/powerpc/include/asm/nohash/kup-booke.h | 8 arch/powerpc/kernel/cputable.c | 4 arch/powerpc/mm/book3s32/kuap.c | 5 + arch/powerpc/mm/init_32.c| 2 ++ arch/powerpc/mm/nohash/kup.c | 6 +- 9 files changed, 19 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 0da0dea76c47..4ca6122ef0e1 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -9,10 +9,6 @@ #ifndef __ASSEMBLY__ -#include - -extern struct static_key_false disable_kuap_key; - #ifdef CONFIG_PPC_KUAP #include @@ -20,11 +16,6 @@ extern struct static_key_false disable_kuap_key; #define KUAP_NONE (~0UL) #define KUAP_ALL (~1UL) -static __always_inline bool kuap_is_disabled(void) -{ - return static_branch_unlikely(_kuap_key); -} - static inline void kuap_lock_one(unsigned long addr) { mtsr(mfsr(addr) | SR_KS, addr); diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index 72fc4263ed26..a014f4d9a2aa 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -230,11 +230,6 @@ static inline u64 current_thread_iamr(void) #ifdef CONFIG_PPC_KUAP -static __always_inline bool kuap_is_disabled(void) -{ - return !mmu_has_feature(MMU_FTR_KUAP); -} - static inline void kuap_user_restore(struct pt_regs *regs) { bool restore_amr = false, restore_iamr = false; diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h index 24cde16c4fbe..bab161b609c1 100644 --- a/arch/powerpc/include/asm/kup.h +++ b/arch/powerpc/include/asm/kup.h @@ -6,6 +6,12 @@ #define KUAP_WRITE 2 #define KUAP_READ_WRITE(KUAP_READ | KUAP_WRITE) +#ifndef __ASSEMBLY__ +#include + +static __always_inline bool kuap_is_disabled(void); +#endif + #ifdef CONFIG_PPC_BOOK3S_64 #include #endif @@ -41,6 +47,11 @@ void setup_kuep(bool disabled); #ifdef CONFIG_PPC_KUAP void setup_kuap(bool disabled); + +static __always_inline bool kuap_is_disabled(void) +{ + return !mmu_has_feature(MMU_FTR_KUAP); +} #else static inline void setup_kuap(bool disabled) { } diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h b/arch/powerpc/include/asm/nohash/32/kup-8xx.h index a372cd822887..d0601859c45a 100644 --- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h @@ -9,17 +9,8 @@ #ifndef __ASSEMBLY__ -#include - #include -extern struct static_key_false disable_kuap_key; - -static __always_inline bool kuap_is_disabled(void) -{ - return static_branch_unlikely(_kuap_key); -} - static inline void __kuap_save_and_lock(struct pt_regs *regs) { regs->kuap = mfspr(SPRN_MD_AP); diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h b/arch/powerpc/include/asm/nohash/kup-booke.h index 71182cbe20c3..8e4734c8fef1 100644 --- a/arch/powerpc/include/asm/nohash/kup-booke.h +++ b/arch/powerpc/include/asm/nohash/kup-booke.h @@ -13,18 +13,10 @@ #else -#include #include #include -extern struct static_key_false disable_kuap_key; - -static __always_inline bool kuap_is_disabled(void) -{ - return static_branch_unlikely(_kuap_key); -} - static inline void __kuap_lock(void) { mtspr(SPRN_PID, 0); diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index 8a32bffefa5b..e97a0fd0ae90 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -75,6 +75,10 @@ static struct cpu_spec * __init setup_cpu_spec(unsigned long offset, t->cpu_features |= old.cpu_features & CPU_FTR_PMAO_BUG; } + /* Set kuap ON at startup, will be disabled later if cmdline has 'nosmap' */ + if (IS_ENABLED(CONFIG_PPC_KUAP) && IS_ENABLED(CONFIG_PPC32)) + t->mmu_features |= MMU_FTR_KUAP; + *PTRRELOC(_cpu_spec) = _cpu_spec; /* diff --git a/arch/powerpc/mm/book3s32/kuap.c b/arch/powerpc/mm/book3s32/kuap.c index 28676cabb005..24c1c686e6b9 100644 --- a/arch/powerpc/mm/book3s32/kuap.c +++ b/arch/powerpc/mm/book3s32/kuap.c @@ -3,9 +3,6 @@ #include #include -struct static_key_false disable_kuap_key; -EXPORT_SYMBOL(disable_kuap_key); - void kuap_lock_all_ool(void) { kuap_lock_all(); @@ -30,7 +27,7 @@ void
[PATCH v3 1/9] powerpc/kuap: Avoid unnecessary reads of MD_AP
A disassembly of interrupt_exit_kernel_prepare() shows a useless read of MD_AP register. This is shown by r9 being re-used immediately without doing anything with the value read. c000e0e0: 60 00 00 00 nop c000e0e4: ===> 7d 3a c2 a6 mfmd_ap r9< c000e0e8: 7d 20 00 a6 mfmsr r9 c000e0ec: 7c 51 13 a6 mtspr 81,r2 c000e0f0: 81 3f 00 84 lwz r9,132(r31) c000e0f4: 71 29 80 00 andi. r9,r9,32768 kuap_get_and_assert_locked() is paired with kuap_kernel_restore() and are only used in interrupt_exit_kernel_prepare(). The value returned by kuap_get_and_assert_locked() is only used by kuap_kernel_restore(). On 8xx, kuap_kernel_restore() doesn't use the value read by kuap_get_and_assert_locked() so modify kuap_get_and_assert_locked() to not perform the read of MD_AP and return 0 instead. The same applies on BOOKE. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 8 ++-- arch/powerpc/include/asm/nohash/kup-booke.h | 6 ++ 2 files changed, 4 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h b/arch/powerpc/include/asm/nohash/32/kup-8xx.h index c44d97751723..8579210f2a6a 100644 --- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h @@ -41,14 +41,10 @@ static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kua static inline unsigned long __kuap_get_and_assert_locked(void) { - unsigned long kuap; - - kuap = mfspr(SPRN_MD_AP); - if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG)) - WARN_ON_ONCE(kuap >> 16 != MD_APG_KUAP >> 16); + WARN_ON_ONCE(mfspr(SPRN_MD_AP) >> 16 != MD_APG_KUAP >> 16); - return kuap; + return 0; } static inline void __allow_user_access(void __user *to, const void __user *from, diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h b/arch/powerpc/include/asm/nohash/kup-booke.h index 49bb41ed0816..823c5a3a96d8 100644 --- a/arch/powerpc/include/asm/nohash/kup-booke.h +++ b/arch/powerpc/include/asm/nohash/kup-booke.h @@ -58,12 +58,10 @@ static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long kua static inline unsigned long __kuap_get_and_assert_locked(void) { - unsigned long kuap = mfspr(SPRN_PID); - if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG)) - WARN_ON_ONCE(kuap); + WARN_ON_ONCE(mfspr(SPRN_PID)); - return kuap; + return 0; } static inline void __allow_user_access(void __user *to, const void __user *from, -- 2.41.0
[PATCH v3 4/9] powerpc/features: Add capability to update mmu features later
On powerpc32, features fixup is performed very early and that's too early to read the cmdline and take into account 'nosmap' parameter. On the other hand, no userspace access is performed that early and KUAP feature fixup can be performed later. Add a function to update mmu features. The function is passed a mask with the features that can be updated. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/feature-fixups.h | 1 + arch/powerpc/lib/feature-fixups.c | 31 --- 2 files changed, 28 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/feature-fixups.h b/arch/powerpc/include/asm/feature-fixups.h index ac605fc369c4..77824bd289a3 100644 --- a/arch/powerpc/include/asm/feature-fixups.h +++ b/arch/powerpc/include/asm/feature-fixups.h @@ -292,6 +292,7 @@ extern long __start___barrier_nospec_fixup, __stop___barrier_nospec_fixup; extern long __start__btb_flush_fixup, __stop__btb_flush_fixup; void apply_feature_fixups(void); +void update_mmu_feature_fixups(unsigned long mask); void setup_feature_keys(void); #endif diff --git a/arch/powerpc/lib/feature-fixups.c b/arch/powerpc/lib/feature-fixups.c index 80def1c2afcb..4f82581ca203 100644 --- a/arch/powerpc/lib/feature-fixups.c +++ b/arch/powerpc/lib/feature-fixups.c @@ -67,7 +67,8 @@ static int patch_alt_instruction(u32 *src, u32 *dest, u32 *alt_start, u32 *alt_e return 0; } -static int patch_feature_section(unsigned long value, struct fixup_entry *fcur) +static int patch_feature_section_mask(unsigned long value, unsigned long mask, + struct fixup_entry *fcur) { u32 *start, *end, *alt_start, *alt_end, *src, *dest; @@ -79,7 +80,7 @@ static int patch_feature_section(unsigned long value, struct fixup_entry *fcur) if ((alt_end - alt_start) > (end - start)) return 1; - if ((value & fcur->mask) == fcur->value) + if ((value & fcur->mask & mask) == (fcur->value & mask)) return 0; src = alt_start; @@ -97,7 +98,8 @@ static int patch_feature_section(unsigned long value, struct fixup_entry *fcur) return 0; } -void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end) +static void do_feature_fixups_mask(unsigned long value, unsigned long mask, + void *fixup_start, void *fixup_end) { struct fixup_entry *fcur, *fend; @@ -105,7 +107,7 @@ void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end) fend = fixup_end; for (; fcur < fend; fcur++) { - if (patch_feature_section(value, fcur)) { + if (patch_feature_section_mask(value, mask, fcur)) { WARN_ON(1); printk("Unable to patch feature section at %p - %p" \ " with %p - %p\n", @@ -117,6 +119,11 @@ void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end) } } +void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end) +{ + do_feature_fixups_mask(value, ~0, fixup_start, fixup_end); +} + #ifdef CONFIG_PPC_BARRIER_NOSPEC static bool is_fixup_addr_valid(void *dest, size_t size) { @@ -651,6 +658,17 @@ void __init apply_feature_fixups(void) do_final_fixups(); } +void __init update_mmu_feature_fixups(unsigned long mask) +{ + saved_mmu_features &= ~mask; + saved_mmu_features |= cur_cpu_spec->mmu_features & mask; + + do_feature_fixups_mask(cur_cpu_spec->mmu_features, mask, + PTRRELOC(&__start___mmu_ftr_fixup), + PTRRELOC(&__stop___mmu_ftr_fixup)); + mmu_feature_keys_init(); +} + void __init setup_feature_keys(void) { /* @@ -683,6 +701,11 @@ late_initcall(check_features); #define check(x) \ if (!(x)) printk("feature-fixups: test failed at line %d\n", __LINE__); +static int patch_feature_section(unsigned long value, struct fixup_entry *fcur) +{ + return patch_feature_section_mask(value, ~0, fcur); +} + /* This must be after the text it fixes up, vmlinux.lds.S enforces that atm */ static struct fixup_entry fixup; -- 2.41.0
[PATCH v3 2/9] powerpc/kuap: Avoid useless jump_label on empty function
Disassembly of interrupt_enter_prepare() shows a pointless nop before the mftb c000abf0 : c000abf0: 81 23 00 84 lwz r9,132(r3) c000abf4: 71 29 40 00 andi. r9,r9,16384 c000abf8: 41 82 00 28 beq-c000ac20 c000abfc: ===> 60 00 00 00 nop < c000ac00: 7d 0c 42 e6 mftbr8 c000ac04: 80 e2 00 08 lwz r7,8(r2) c000ac08: 81 22 00 28 lwz r9,40(r2) c000ac0c: 91 02 00 24 stw r8,36(r2) c000ac10: 7d 29 38 50 subfr9,r9,r7 c000ac14: 7d 29 42 14 add r9,r9,r8 c000ac18: 91 22 00 08 stw r9,8(r2) c000ac1c: 4e 80 00 20 blr c000ac20: 60 00 00 00 nop c000ac24: 7d 5a c2 a6 mfmd_ap r10 c000ac28: 3d 20 de 00 lis r9,-8704 c000ac2c: 91 43 00 b0 stw r10,176(r3) c000ac30: 7d 3a c3 a6 mtspr 794,r9 c000ac34: 4e 80 00 20 blr That comes from the call to kuap_loc(), allthough __kuap_lock() is an empty function on the 8xx. To avoid that, only perform kuap_is_disabled() check when there is something to do with __kuap_lock(). Do the same with __kuap_save_and_lock() and __kuap_get_and_assert_locked(). Signed-off-by: Christophe Leroy Reviewed-by: Nicholas Piggin --- v2: Add back comment about __kupa_lock() not needed on 64s --- arch/powerpc/include/asm/book3s/32/kup.h | 6 ++-- arch/powerpc/include/asm/book3s/64/kup.h | 10 ++ arch/powerpc/include/asm/kup.h | 33 +--- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 11 +++ arch/powerpc/include/asm/nohash/kup-booke.h | 8 +++-- 5 files changed, 29 insertions(+), 39 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 678f9c9d89b6..466a19cfb4df 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -77,10 +77,6 @@ static inline void kuap_unlock(unsigned long addr, bool ool) kuap_unlock_all_ool(); } -static inline void __kuap_lock(void) -{ -} - static inline void __kuap_save_and_lock(struct pt_regs *regs) { unsigned long kuap = current->thread.kuap; @@ -92,6 +88,7 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs) current->thread.kuap = KUAP_NONE; kuap_lock_addr(kuap, false); } +#define __kuap_save_and_lock __kuap_save_and_lock static inline void kuap_user_restore(struct pt_regs *regs) { @@ -120,6 +117,7 @@ static inline unsigned long __kuap_get_and_assert_locked(void) return kuap; } +#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked static __always_inline void __allow_user_access(void __user *to, const void __user *from, u32 size, unsigned long dir) diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index 84c09e546115..2a7bd3ecc556 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -298,15 +298,9 @@ static inline unsigned long __kuap_get_and_assert_locked(void) WARN_ON_ONCE(amr != AMR_KUAP_BLOCKED); return amr; } +#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked -/* Do nothing, book3s/64 does that in ASM */ -static inline void __kuap_lock(void) -{ -} - -static inline void __kuap_save_and_lock(struct pt_regs *regs) -{ -} +/* __kuap_lock() not required, book3s/64 does that in ASM */ /* * We support individually allowing read or write, but we don't support nesting diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h index d751ddd08110..24cde16c4fbe 100644 --- a/arch/powerpc/include/asm/kup.h +++ b/arch/powerpc/include/asm/kup.h @@ -52,16 +52,9 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write) return false; } -static inline void __kuap_lock(void) { } -static inline void __kuap_save_and_lock(struct pt_regs *regs) { } static inline void kuap_user_restore(struct pt_regs *regs) { } static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long amr) { } -static inline unsigned long __kuap_get_and_assert_locked(void) -{ - return 0; -} - /* * book3s/64/kup-radix.h defines these functions for the !KUAP case to flush * the L1D cache after user accesses. Only include the empty stubs for other @@ -85,29 +78,24 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write) return __bad_kuap_fault(regs, address, is_write); } -static __always_inline void kuap_assert_locked(void) -{ - if (kuap_is_disabled()) - return; - - if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG)) - __kuap_get_and_assert_locked(); -} - static __always_inline void kuap_lock(void) { +#ifdef __kuap_lock if (kuap_is_disabled()) return;
[PATCH v3 3/9] powerpc/kuap: Fold kuep_is_disabled() into its only user
kuep_is_disabled() was introduced by commit 91bb30822a2e ("powerpc/32s: Refactor update of user segment registers") but then all users but one were removed by commit 526d4a4c77ae ("powerpc/32s: Do kuep_lock() and kuep_unlock() in assembly"). Fold kuep_is_disabled() into init_new_context() which is its only user. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/book3s/32/kup.h | 5 - arch/powerpc/mm/book3s32/mmu_context.c | 2 +- 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h index 466a19cfb4df..0da0dea76c47 100644 --- a/arch/powerpc/include/asm/book3s/32/kup.h +++ b/arch/powerpc/include/asm/book3s/32/kup.h @@ -13,11 +13,6 @@ extern struct static_key_false disable_kuap_key; -static __always_inline bool kuep_is_disabled(void) -{ - return !IS_ENABLED(CONFIG_PPC_KUEP); -} - #ifdef CONFIG_PPC_KUAP #include diff --git a/arch/powerpc/mm/book3s32/mmu_context.c b/arch/powerpc/mm/book3s32/mmu_context.c index 269a3eb25a73..1922f9a6b058 100644 --- a/arch/powerpc/mm/book3s32/mmu_context.c +++ b/arch/powerpc/mm/book3s32/mmu_context.c @@ -71,7 +71,7 @@ int init_new_context(struct task_struct *t, struct mm_struct *mm) mm->context.id = __init_new_context(); mm->context.sr0 = CTX_TO_VSID(mm->context.id, 0); - if (!kuep_is_disabled()) + if (IS_ENABLED(CONFIG_PPC_KUEP)) mm->context.sr0 |= SR_NX; if (!kuap_is_disabled()) mm->context.sr0 |= SR_KS; -- 2.41.0
[PATCH v3 0/9] Cleanup/Optimise KUAP (v3)
This series is cleaning up a bit KUAP in preparation of using objtool to validate UACCESS. There are two main changes in this series: 1/ Simplification of KUAP on book3s/32 2/ Using ASM features on 32 bits and booke as suggested by Nic. Those changes will be required for objtool UACCESS validation, but even before they are worth it, especially the simplification on 32s. Changes in v3: - Rearranged book3s/32 simplification in order to ease objtool UACCESS check implementation (patches 7 and 9) Christophe Leroy (9): powerpc/kuap: Avoid unnecessary reads of MD_AP powerpc/kuap: Avoid useless jump_label on empty function powerpc/kuap: Fold kuep_is_disabled() into its only user powerpc/features: Add capability to update mmu features later powerpc/kuap: MMU_FTR_BOOK3S_KUAP becomes MMU_FTR_KUAP powerpc/kuap: Use MMU_FTR_KUAP on all and refactor disabling kuap powerpc/kuap: Simplify KUAP lock/unlock on BOOK3S/32 powerpc/kuap: KUAP enabling/disabling functions must be __always_inline powerpc/kuap: Use ASM feature fixups instead of static branches arch/powerpc/include/asm/book3s/32/kup.h | 123 -- .../powerpc/include/asm/book3s/64/hash-pkey.h | 2 +- arch/powerpc/include/asm/book3s/64/kup.h | 54 arch/powerpc/include/asm/bug.h| 1 + arch/powerpc/include/asm/feature-fixups.h | 1 + arch/powerpc/include/asm/kup.h| 91 + arch/powerpc/include/asm/mmu.h| 4 +- arch/powerpc/include/asm/nohash/32/kup-8xx.h | 62 + arch/powerpc/include/asm/nohash/kup-booke.h | 68 +- arch/powerpc/include/asm/uaccess.h| 6 +- arch/powerpc/kernel/cputable.c| 4 + arch/powerpc/kernel/syscall.c | 2 +- arch/powerpc/kernel/traps.c | 2 +- arch/powerpc/lib/feature-fixups.c | 31 - arch/powerpc/mm/book3s32/kuap.c | 20 +-- arch/powerpc/mm/book3s32/mmu_context.c| 2 +- arch/powerpc/mm/book3s64/pkeys.c | 2 +- arch/powerpc/mm/init_32.c | 2 + arch/powerpc/mm/nohash/kup.c | 8 +- 19 files changed, 222 insertions(+), 263 deletions(-) -- 2.41.0
Re: [PATCH v3 2/7] mm/hotplug: Allow memmap on memory hotplug request to fallback
On 7/11/23 3:53 PM, David Hildenbrand wrote: >> -bool mhp_supports_memmap_on_memory(unsigned long size) >> +static bool mhp_supports_memmap_on_memory(unsigned long size) >> { >> unsigned long nr_vmemmap_pages = size / PAGE_SIZE; >> unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); >> @@ -1339,13 +1339,12 @@ int __ref add_memory_resource(int nid, struct >> resource *res, mhp_t mhp_flags) >> * Self hosted memmap array >> */ >> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { >> - if (!mhp_supports_memmap_on_memory(size)) { >> - ret = -EINVAL; >> - goto error; >> + if (mhp_supports_memmap_on_memory(size)) { >> + mhp_altmap.free = PHYS_PFN(size); >> + mhp_altmap.base_pfn = PHYS_PFN(start); >> + params.altmap = _altmap; >> } >> - mhp_altmap.free = PHYS_PFN(size); >> - mhp_altmap.base_pfn = PHYS_PFN(start); >> - params.altmap = _altmap; >> + /* fallback to not using altmap */ >> } >> /* call arch's memory hotadd */ > > In general, LGTM, but please extend the documentation of the parameter in > memory_hotplug.h, stating that this is just a hint and that the core can > decide to no do that. > will update modified include/linux/memory_hotplug.h @@ -97,6 +97,8 @@ typedef int __bitwise mhp_t; * To do so, we will use the beginning of the hot-added range to build * the page tables for the memmap array that describes the entire range. * Only selected architectures support it with SPARSE_VMEMMAP. + * This is only a hint, core kernel can decide to not do this based on + * different alignment checks. */ #define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1))
Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
On 7/11/23 9:14 PM, David Hildenbrand wrote: > On 11.07.23 17:40, Aneesh Kumar K V wrote: >> On 7/11/23 8:56 PM, David Hildenbrand wrote: >>> On 11.07.23 06:48, Aneesh Kumar K.V wrote: Radix vmemmap mapping can map things correctly at the PMD level or PTE level based on different device boundary checks. Hence we skip the restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also makes the feature widely useful because to use PMD_SIZE vmemmap area we require a memory block size of 2GiB We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature can work with a memory block size of 256MB. Using altmap.reserve feature to align things correctly at pageblock granularity. We can end up losing some pages in memory with this. For ex: with a 256MiB memory block size, we require 4 pages to map vmemmap pages, In order to align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/pgtable.h | 28 +++ .../platforms/pseries/hotplug-memory.c | 3 +- mm/memory_hotplug.c | 2 ++ 4 files changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 116d6add0bb0..f890907e5bbf 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -157,6 +157,7 @@ config PPC select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 68817ea7f994..8e6c92dde6ad 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x) int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, unsigned long page_size); +/* + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details + * some of the restrictions. We don't check for PMD_SIZE because our + * vmemmap allocation code can fallback correctly. The pageblock + * alignment requirement is met using altmap->reserve blocks. + */ +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory +static inline bool arch_supports_memmap_on_memory(unsigned long size) +{ + unsigned long nr_pages = size >> PAGE_SHIFT; + unsigned long vmemmap_size = nr_pages * sizeof(struct page); + + if (!radix_enabled()) + return false; + +#ifdef CONFIG_PPC_4K_PAGES + return IS_ALIGNED(vmemmap_size, PMD_SIZE); +#else + /* + * Make sure the vmemmap allocation is fully contianed + * so that we always allocate vmemmap memory from altmap area. + * The pageblock alignment requirement is met by using + * reserve blocks in altmap. + */ + return IS_ALIGNED(vmemmap_size, PAGE_SIZE); >>> >>> Can we move that check into common code as well? >>> >>> If our (original) vmemmap size would not fit into a single page, we would >>> be in trouble on any architecture. Did not check if it would be an issue >>> for arm64 as well in case we would allow eventually wasting memory. >>> >> >> >> For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in >> arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment. >> If arm64 allow the usage of altmap.reserve, I would expect the >> arch_supports_memmap_on_memory to have the PAGE_SIZE check. >> >> Adding the PAGE_SIZE check in mhp_supports_memmap_on_memory() makes it >> redundant check for x86 and arm currently? > > IMHO not an issue. The common code check is a bit weaker and the arch check a > bit stronger. > >> ok will update >> modified mm/memory_hotplug.c >> @@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned >> long size) >> */ >> if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) >> return false; >> + >> + /* >> + * Make sure the vmemmap allocation is fully contianed > > s/contianed/contained/ > >> + * so that we always allocate vmemmap memory from altmap area. > > In theory, it's not only the vmemmap size, but also the vmemmap start (that > it doesn't start somewhere in between a page,
Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
On 11.07.23 17:40, Aneesh Kumar K V wrote: On 7/11/23 8:56 PM, David Hildenbrand wrote: On 11.07.23 06:48, Aneesh Kumar K.V wrote: Radix vmemmap mapping can map things correctly at the PMD level or PTE level based on different device boundary checks. Hence we skip the restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also makes the feature widely useful because to use PMD_SIZE vmemmap area we require a memory block size of 2GiB We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature can work with a memory block size of 256MB. Using altmap.reserve feature to align things correctly at pageblock granularity. We can end up losing some pages in memory with this. For ex: with a 256MiB memory block size, we require 4 pages to map vmemmap pages, In order to align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/pgtable.h | 28 +++ .../platforms/pseries/hotplug-memory.c | 3 +- mm/memory_hotplug.c | 2 ++ 4 files changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 116d6add0bb0..f890907e5bbf 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -157,6 +157,7 @@ config PPC select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 68817ea7f994..8e6c92dde6ad 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x) int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, unsigned long page_size); +/* + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details + * some of the restrictions. We don't check for PMD_SIZE because our + * vmemmap allocation code can fallback correctly. The pageblock + * alignment requirement is met using altmap->reserve blocks. + */ +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory +static inline bool arch_supports_memmap_on_memory(unsigned long size) +{ + unsigned long nr_pages = size >> PAGE_SHIFT; + unsigned long vmemmap_size = nr_pages * sizeof(struct page); + + if (!radix_enabled()) + return false; + +#ifdef CONFIG_PPC_4K_PAGES + return IS_ALIGNED(vmemmap_size, PMD_SIZE); +#else + /* + * Make sure the vmemmap allocation is fully contianed + * so that we always allocate vmemmap memory from altmap area. + * The pageblock alignment requirement is met by using + * reserve blocks in altmap. + */ + return IS_ALIGNED(vmemmap_size, PAGE_SIZE); Can we move that check into common code as well? If our (original) vmemmap size would not fit into a single page, we would be in trouble on any architecture. Did not check if it would be an issue for arm64 as well in case we would allow eventually wasting memory. For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment. If arm64 allow the usage of altmap.reserve, I would expect the arch_supports_memmap_on_memory to have the PAGE_SIZE check. Adding the PAGE_SIZE check in mhp_supports_memmap_on_memory() makes it redundant check for x86 and arm currently? IMHO not an issue. The common code check is a bit weaker and the arch check a bit stronger. modified mm/memory_hotplug.c @@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) */ if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) return false; + + /* +* Make sure the vmemmap allocation is fully contianed s/contianed/contained/ +* so that we always allocate vmemmap memory from altmap area. In theory, it's not only the vmemmap size, but also the vmemmap start (that it doesn't start somewhere in between a page, crossing a page). I suspect the start is always guaranteed to be aligned (of the vmemmap size is aligned), correct? +*/ + if (!IS_ALIGNED(vmemmap_size, PAGE_SIZE)) + return false; /* * Without page reservation remaining pages should be pageblock aligned. */ -- Cheers, David / dhildenb
Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
On 7/11/23 8:56 PM, David Hildenbrand wrote: > On 11.07.23 06:48, Aneesh Kumar K.V wrote: >> Radix vmemmap mapping can map things correctly at the PMD level or PTE >> level based on different device boundary checks. Hence we skip the >> restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also >> makes the feature widely useful because to use PMD_SIZE vmemmap area we >> require a memory block size of 2GiB >> >> We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature >> can work with a memory block size of 256MB. Using altmap.reserve feature >> to align things correctly at pageblock granularity. We can end up >> losing some pages in memory with this. For ex: with a 256MiB memory block >> size, we require 4 pages to map vmemmap pages, In order to align things >> correctly we end up adding a reserve of 28 pages. ie, for every 4096 >> pages 28 pages get reserved. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/Kconfig | 1 + >> arch/powerpc/include/asm/pgtable.h | 28 +++ >> .../platforms/pseries/hotplug-memory.c | 3 +- >> mm/memory_hotplug.c | 2 ++ >> 4 files changed, 33 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig >> index 116d6add0bb0..f890907e5bbf 100644 >> --- a/arch/powerpc/Kconfig >> +++ b/arch/powerpc/Kconfig >> @@ -157,6 +157,7 @@ config PPC >> select ARCH_HAS_UBSAN_SANITIZE_ALL >> select ARCH_HAVE_NMI_SAFE_CMPXCHG >> select ARCH_KEEP_MEMBLOCK >> + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU >> select ARCH_MIGHT_HAVE_PC_PARPORT >> select ARCH_MIGHT_HAVE_PC_SERIO >> select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX >> diff --git a/arch/powerpc/include/asm/pgtable.h >> b/arch/powerpc/include/asm/pgtable.h >> index 68817ea7f994..8e6c92dde6ad 100644 >> --- a/arch/powerpc/include/asm/pgtable.h >> +++ b/arch/powerpc/include/asm/pgtable.h >> @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x) >> int __meminit vmemmap_populated(unsigned long vmemmap_addr, int >> vmemmap_map_size); >> bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, >> unsigned long page_size); >> +/* >> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details >> + * some of the restrictions. We don't check for PMD_SIZE because our >> + * vmemmap allocation code can fallback correctly. The pageblock >> + * alignment requirement is met using altmap->reserve blocks. >> + */ >> +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory >> +static inline bool arch_supports_memmap_on_memory(unsigned long size) >> +{ >> + unsigned long nr_pages = size >> PAGE_SHIFT; >> + unsigned long vmemmap_size = nr_pages * sizeof(struct page); >> + >> + if (!radix_enabled()) >> + return false; >> + >> +#ifdef CONFIG_PPC_4K_PAGES >> + return IS_ALIGNED(vmemmap_size, PMD_SIZE); >> +#else >> + /* >> + * Make sure the vmemmap allocation is fully contianed >> + * so that we always allocate vmemmap memory from altmap area. >> + * The pageblock alignment requirement is met by using >> + * reserve blocks in altmap. >> + */ >> + return IS_ALIGNED(vmemmap_size, PAGE_SIZE); > > Can we move that check into common code as well? > > If our (original) vmemmap size would not fit into a single page, we would be > in trouble on any architecture. Did not check if it would be an issue for > arm64 as well in case we would allow eventually wasting memory. > For x86 and arm we already do IS_ALIGNED(vmemmap_size, PMD_SIZE); in arch_supports_memmap_on_memory(). That should imply PAGE_SIZE alignment. If arm64 allow the usage of altmap.reserve, I would expect the arch_supports_memmap_on_memory to have the PAGE_SIZE check. Adding the PAGE_SIZE check in mhp_supports_memmap_on_memory() makes it redundant check for x86 and arm currently? modified mm/memory_hotplug.c @@ -1293,6 +1293,13 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) */ if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) return false; + + /* +* Make sure the vmemmap allocation is fully contianed +* so that we always allocate vmemmap memory from altmap area. +*/ + if (!IS_ALIGNED(vmemmap_size, PAGE_SIZE)) + return false; /* * Without page reservation remaining pages should be pageblock aligned. */
Re: [PATCH 00/17] fbdev: Remove FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT flags
On 7/11/23 16:47, Sam Ravnborg wrote: Hi Thomas, On Tue, Jul 11, 2023 at 08:24:40AM +0200, Thomas Zimmermann wrote: Hi Sam Am 10.07.23 um 19:19 schrieb Sam Ravnborg: Hi Thomas, On Mon, Jul 10, 2023 at 02:50:04PM +0200, Thomas Zimmermann wrote: Remove the unused flags FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT from fbdev and drivers, as briefly discussed at [1]. Both flags were maybe useful when fbdev had special handling for driver modules. With commit 376b3ff54c9a ("fbdev: Nuke FBINFO_MODULE"), they are both 0 and have no further effect. Patches 1 to 7 remove FBINFO_DEFAULT from drivers. Patches 2 to 5 split this by the way the fb_info struct is being allocated. All flags are cleared to zero during the allocation. Patches 8 to 16 do the same for FBINFO_FLAG_DEFAULT. Patch 8 fixes an actual bug in how arch/sh uses the tokne for struct fb_videomode, which is unrelated. Patch 17 removes both flag constants from We have a few more flags that are unused - should they be nuked too? FBINFO_HWACCEL_FILLRECT FBINFO_HWACCEL_ROTATE FBINFO_HWACCEL_XPAN It seems those are there for completeness. Nothing sets _ROTATE, I think some fbdev drivers had hardware acceleration for ROTATE in the past. HWACCEL_XPAN is still in some drivers. the others are simply never checked. According to the comments, some are required, some are optional. I don't know what that means. I think it's OK if you remove those flags which aren't used anywhere, e.g. FBINFO_HWACCEL_ROTATE. IIRC there were complains about performance when Daniel tried to remove fbcon acceleration, so not all _HWACCEL_ flags are unneeded. Correct. I think COPYAREA and FILLRECT are the bare minimum to accelerate fbcon, IMAGEBLIT is for showing the tux penguin (?), XPAN/YPAN and YWRAP for some hardware screen panning needed by some drivers (not sure if this is still used as I don't have such hardware, Geert?). Leaving them in for reference/completeness might be an option; or not. I have no strong feelings about those flags. I'd say drop FBINFO_HWACCEL_ROTATE at least ? Unused as in no references from fbdev/core/* I would rather see one series nuke all unused FBINFO flags in one go. Assuming my quick grep are right and the above can be dropped. I would not want to extend this series. I'm removing _DEFAULT as it's absolutely pointless and confusing. Yes, Ok. Helge
Re: [PATCH v3 5/7] powerpc/book3s64/memhotplug: Enable memmap on memory for radix
On 11.07.23 06:48, Aneesh Kumar K.V wrote: Radix vmemmap mapping can map things correctly at the PMD level or PTE level based on different device boundary checks. Hence we skip the restrictions w.r.t vmemmap size to be multiple of PMD_SIZE. This also makes the feature widely useful because to use PMD_SIZE vmemmap area we require a memory block size of 2GiB We can also use MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY to that the feature can work with a memory block size of 256MB. Using altmap.reserve feature to align things correctly at pageblock granularity. We can end up losing some pages in memory with this. For ex: with a 256MiB memory block size, we require 4 pages to map vmemmap pages, In order to align things correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/pgtable.h| 28 +++ .../platforms/pseries/hotplug-memory.c| 3 +- mm/memory_hotplug.c | 2 ++ 4 files changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 116d6add0bb0..f890907e5bbf 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -157,6 +157,7 @@ config PPC select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK + select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index 68817ea7f994..8e6c92dde6ad 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -169,6 +169,34 @@ static inline bool is_ioremap_addr(const void *x) int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, unsigned long page_size); +/* + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details + * some of the restrictions. We don't check for PMD_SIZE because our + * vmemmap allocation code can fallback correctly. The pageblock + * alignment requirement is met using altmap->reserve blocks. + */ +#define arch_supports_memmap_on_memory arch_supports_memmap_on_memory +static inline bool arch_supports_memmap_on_memory(unsigned long size) +{ + unsigned long nr_pages = size >> PAGE_SHIFT; + unsigned long vmemmap_size = nr_pages * sizeof(struct page); + + if (!radix_enabled()) + return false; + +#ifdef CONFIG_PPC_4K_PAGES + return IS_ALIGNED(vmemmap_size, PMD_SIZE); +#else + /* +* Make sure the vmemmap allocation is fully contianed +* so that we always allocate vmemmap memory from altmap area. +* The pageblock alignment requirement is met by using +* reserve blocks in altmap. +*/ + return IS_ALIGNED(vmemmap_size, PAGE_SIZE); Can we move that check into common code as well? If our (original) vmemmap size would not fit into a single page, we would be in trouble on any architecture. Did not check if it would be an issue for arm64 as well in case we would allow eventually wasting memory. -- Cheers, David / dhildenb
Re: [PATCH 00/17] fbdev: Remove FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT flags
Hi Thomas, On Tue, Jul 11, 2023 at 08:24:40AM +0200, Thomas Zimmermann wrote: > Hi Sam > > Am 10.07.23 um 19:19 schrieb Sam Ravnborg: > > Hi Thomas, > > > > On Mon, Jul 10, 2023 at 02:50:04PM +0200, Thomas Zimmermann wrote: > > > Remove the unused flags FBINFO_DEFAULT and FBINFO_FLAG_DEFAULT from > > > fbdev and drivers, as briefly discussed at [1]. Both flags were maybe > > > useful when fbdev had special handling for driver modules. With > > > commit 376b3ff54c9a ("fbdev: Nuke FBINFO_MODULE"), they are both 0 > > > and have no further effect. > > > > > > Patches 1 to 7 remove FBINFO_DEFAULT from drivers. Patches 2 to 5 > > > split this by the way the fb_info struct is being allocated. All flags > > > are cleared to zero during the allocation. > > > > > > Patches 8 to 16 do the same for FBINFO_FLAG_DEFAULT. Patch 8 fixes > > > an actual bug in how arch/sh uses the tokne for struct fb_videomode, > > > which is unrelated. > > > > > > Patch 17 removes both flag constants from > > > > We have a few more flags that are unused - should they be nuked too? > > FBINFO_HWACCEL_FILLRECT > > FBINFO_HWACCEL_ROTATE > > FBINFO_HWACCEL_XPAN > > It seems those are there for completeness. Nothing sets _ROTATE, the others > are simply never checked. According to the comments, some are required, some > are optional. I don't know what that means. > > IIRC there were complains about performance when Daniel tried to remove > fbcon acceleration, so not all _HWACCEL_ flags are unneeded. > > Leaving them in for reference/completeness might be an option; or not. I > have no strong feelings about those flags. > > > > > Unused as in no references from fbdev/core/* > > > > I would rather see one series nuke all unused FBINFO flags in one go. > > Assuming my quick grep are right and the above can be dropped. > > I would not want to extend this series. I'm removing _DEFAULT as it's > absolutely pointless and confusing. OK, makes sense and thanks for the explanation. The series is: Acked-by: Sam Ravnborg
[PATCH v2] powerpc/512x: lpbfifo: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new() which already returns void. Eventually after all drivers are converted, .remove_new() is renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König --- Changes since (implicit) v1: - provide an actually compilable patch :-\ Best regards Uwe arch/powerpc/platforms/512x/mpc512x_lpbfifo.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c index 1bfb29574caa..c1e981649bd9 100644 --- a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c +++ b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c @@ -477,7 +477,7 @@ static int mpc512x_lpbfifo_probe(struct platform_device *pdev) return ret; } -static int mpc512x_lpbfifo_remove(struct platform_device *pdev) +static void mpc512x_lpbfifo_remove(struct platform_device *pdev) { unsigned long flags; struct dma_device *dma_dev = lpbfifo.chan->device; @@ -494,8 +494,6 @@ static int mpc512x_lpbfifo_remove(struct platform_device *pdev) free_irq(lpbfifo.irq, >dev); irq_dispose_mapping(lpbfifo.irq); dma_release_channel(lpbfifo.chan); - - return 0; } static const struct of_device_id mpc512x_lpbfifo_match[] = { @@ -506,7 +504,7 @@ MODULE_DEVICE_TABLE(of, mpc512x_lpbfifo_match); static struct platform_driver mpc512x_lpbfifo_driver = { .probe = mpc512x_lpbfifo_probe, - .remove = mpc512x_lpbfifo_remove, + .remove_new = mpc512x_lpbfifo_remove, .driver = { .name = DRV_NAME, .of_match_table = mpc512x_lpbfifo_match, base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5 -- 2.39.2
[PATCH] powerpc/512x: lpbfifo: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new() which already returns void. Eventually after all drivers are converted, .remove_new() is renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König --- arch/powerpc/platforms/512x/mpc512x_lpbfifo.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c index 1bfb29574caa..dbe722f7b855 100644 --- a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c +++ b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c @@ -477,7 +477,7 @@ static int mpc512x_lpbfifo_probe(struct platform_device *pdev) return ret; } -static int mpc512x_lpbfifo_remove(struct platform_device *pdev) +static void pc512x_lpbfifo_remove(struct platform_device *pdev) { unsigned long flags; struct dma_device *dma_dev = lpbfifo.chan->device; @@ -494,8 +494,6 @@ static int mpc512x_lpbfifo_remove(struct platform_device *pdev) free_irq(lpbfifo.irq, >dev); irq_dispose_mapping(lpbfifo.irq); dma_release_channel(lpbfifo.chan); - - return 0; } static const struct of_device_id mpc512x_lpbfifo_match[] = { @@ -506,7 +504,7 @@ MODULE_DEVICE_TABLE(of, mpc512x_lpbfifo_match); static struct platform_driver mpc512x_lpbfifo_driver = { .probe = mpc512x_lpbfifo_probe, - .remove = mpc512x_lpbfifo_remove, + .remove_new = mpc512x_lpbfifo_remove, .driver = { .name = DRV_NAME, .of_match_table = mpc512x_lpbfifo_match, base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5 -- 2.39.2
Re: [PATCH v2 1/2] powerpc/tpm: Create linux,sml-base/size as big endian
On 7/10/23 17:23, Jarkko Sakkinen wrote: On Thu, 2023-06-15 at 22:37 +1000, Michael Ellerman wrote: There's code in prom_instantiate_sml() to do a "SML handover" (Stored Measurement Log) from OF to Linux, before Linux shuts down Open Firmware. This involves creating a buffer to hold the SML, and creating two device tree properties to record its base address and size. The kernel then later reads those properties from the device tree to find the SML. When the code was initially added in commit 4a727429abec ("PPC64: Add support for instantiating SML from Open Firmware") the powerpc kernel was always built big endian, so the properties were created big endian by default. However since then little endian support was added to powerpc, and now the code lacks conversions to big endian when creating the properties. This means on little endian kernels the device tree properties are little endian, which is contrary to the device tree spec, and in contrast to all other device tree properties. To cope with that a workaround was added in tpm_read_log_of() to skip the endian conversion if the properties were created via the SML handover. A better solution is to encode the properties as big endian as they should be, and remove the workaround. Typically changing the encoding of a property like this would present problems for kexec. However the SML is not propagated across kexec, so changing the encoding of the properties is a non-issue. Fixes: e46e22f12b19 ("tpm: enhance read_log_of() to support Physical TPM event log") Signed-off-by: Michael Ellerman Reviewed-by: Stefan Berger --- arch/powerpc/kernel/prom_init.c | 8 ++-- drivers/char/tpm/eventlog/of.c | 23 --- 2 files changed, 10 insertions(+), 21 deletions(-) Split into two patches (producer and consumer). I think this wouldn't be right since it would break the system when only one patch is applied since it would be reading the fields in the wrong endianess. Stefan BR, Jarkko v2: Add Stefan's reviewed-by. diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index d464ba412084..72fe306b6820 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -1900,6 +1900,7 @@ static void __init prom_instantiate_sml(void) u32 entry = 0, size = 0, succ = 0; u64 base; __be32 val; + __be64 val64; prom_debug("prom_instantiate_sml: start...\n"); @@ -1956,10 +1957,13 @@ static void __init prom_instantiate_sml(void) reserve_mem(base, size); + val64 = cpu_to_be64(base); prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-base", -, sizeof(base)); +, sizeof(val64)); + + val = cpu_to_be32(size); prom_setprop(ibmvtpm_node, "/vdevice/vtpm", "linux,sml-size", -, sizeof(size)); +, sizeof(val)); prom_debug("sml base = 0x%llx\n", base); prom_debug("sml size = 0x%x\n", size); diff --git a/drivers/char/tpm/eventlog/of.c b/drivers/char/tpm/eventlog/of.c index 930fe43d5daf..0bc0cb6333c6 100644 --- a/drivers/char/tpm/eventlog/of.c +++ b/drivers/char/tpm/eventlog/of.c @@ -51,8 +51,8 @@ static int tpm_read_log_memory_region(struct tpm_chip *chip) int tpm_read_log_of(struct tpm_chip *chip) { struct device_node *np; - const u32 *sizep; - const u64 *basep; + const __be32 *sizep; + const __be64 *basep; struct tpm_bios_log *log; u32 size; u64 base; @@ -73,23 +73,8 @@ int tpm_read_log_of(struct tpm_chip *chip) if (sizep == NULL || basep == NULL) return -EIO; - /* -* For both vtpm/tpm, firmware has log addr and log size in big -* endian format. But in case of vtpm, there is a method called -* sml-handover which is run during kernel init even before -* device tree is setup. This sml-handover function takes care -* of endianness and writes to sml-base and sml-size in little -* endian format. For this reason, vtpm doesn't need conversion -* but physical tpm needs the conversion. -*/ - if (of_property_match_string(np, "compatible", "IBM,vtpm") < 0 && - of_property_match_string(np, "compatible", "IBM,vtpm20") < 0) { - size = be32_to_cpup((__force __be32 *)sizep); - base = be64_to_cpup((__force __be64 *)basep); - } else { - size = *sizep; - base = *basep; - } + size = be32_to_cpup(sizep); + base = be64_to_cpup(basep); if (size == 0) { dev_warn(>dev, "%s: Event log area empty\n", __func__);
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023 at 01:25:43PM +0200, Alexey Gladkov wrote: > -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) > +static int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, > int lookup_flags) This function can still be called do_fchmodat(); we don't need to version internal functions.
Re: [PATCH v3 0/5] Add a new fchmodat4() syscall
* Alexey Gladkov: > This patch set adds fchmodat4(), a new syscall. The actual > implementation is super simple: essentially it's just the same as > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. > I've attempted to make this match "man 2 fchmodat" as closely as > possible, which says EINVAL is returned for invalid flags (as opposed to > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). > I have a sketch of a glibc patch that I haven't even compiled yet, but > seems fairly straight-forward: > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c > b/sysdeps/unix/sysv/linux/fchmodat.c > index 6d9cbc1ce9e0..b1beab76d56c 100644 > --- a/sysdeps/unix/sysv/linux/fchmodat.c > +++ b/sysdeps/unix/sysv/linux/fchmodat.c > @@ -29,12 +29,36 @@ > int > fchmodat (int fd, const char *file, mode_t mode, int flag) > { > - if (flag & ~AT_SYMLINK_NOFOLLOW) > -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); > -#ifndef __NR_lchmod /* Linux so far has no lchmod syscall. > */ > + /* There are four paths through this code: > + - The flags are zero. In this case it's fine to call fchmodat. > + - The flags are non-zero and glibc doesn't have access to > + __NR_fchmodat4. In this case all we can do is emulate the error codes > + defined by the glibc interface from userspace. > + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel > has > + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly > + matches glibc's library interface so it can be called directly. > + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel > does If you define __NR_fchmodat4 on all architectures, we can use these constants directly in glibc. We no longer depend on the UAPI definitions of those constants, to cut down the number of code variants, and to make glibc's system call profile independent of the kernel header version at build time. Your version is based on 2.31, more recent versions have some reasonable emulation for fchmodat based on /proc/self/fd. I even wrote a comment describing the same buggy behavior that you witnessed: + /* Some Linux versions with some file systems can actually +change symbolic link permissions via /proc, but this is not +intentional, and it gives inconsistent results (e.g., error +return despite mode change). The expected behavior is that +symbolic link modes cannot be changed at all, and this check +enforces that. */ + if (S_ISLNK (st.st_mode)) + { + __close_nocancel (pathfd); + __set_errno (EOPNOTSUPP); + return -1; + } I think there was some kernel discussion about that behavior before, but apparently, it hasn't led to fixes. I wonder if it makes sense to add a similar error return to the system call implementation? > + not. In this case we must respect the error codes defined by the glibc > + interface instead of returning ENOSYS. > +The intent here is to ensure that the kernel is called at most once > per > +library call, and that the error types defined by glibc are always > +respected. */ > + > +#ifdef __NR_fchmodat4 > + long result; > +#endif > + > + if (flag == 0) > +return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); > + > +#ifdef __NR_fchmodat4 > + result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag); > + if (result == 0 || errno != ENOSYS) > +return result; > +#endif The last if condition is the recommended approach, but in the past, it broke container host compatibility pretty badly due to seccomp filters that return EPERM instead of ENOSYS. I guess we'll learn soon enough if that's been fixed by now. 8-P Thanks, Florian
Re: [PATCH v3 5/5] selftests: add fchmodat4(2) selftest
* Alexey Gladkov: > The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag > fails. This is because not all filesystems support changing the mode > bits of symlinks properly. These filesystems return an error but change > the mode bits: > > newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, > AT_SYMLINK_NOFOLLOW) = 0 > newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, > AT_SYMLINK_NOFOLLOW) = 0 > syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 > EOPNOTSUPP (Operation not supported) > newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, > AT_SYMLINK_NOFOLLOW) = 0 > > This happens with btrfs and xfs: > > $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test > TAP version 13 > 1..1 > ok 1 # SKIP fchmodat4(symlink) > # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0 > > $ stat /tmp/ksft-fchmodat4.*/symlink >File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile >Size: 7 Blocks: 0 IO Block: 4096 symbolic link > Device: 7,0 Inode: 133 Links: 1 > Access: (0600/lrw---) Uid: (0/root) Gid: (0/root) > > Signed-off-by: Alexey Gladkov This looks like a bug in those file systems? As an extra test, “echo 3 > /proc/sys/vm/drop_caches” sometimes has strange effects in such cases because the bits are not actually stored on disk, only in the dentry cache. Thanks, Florian
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote: > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > > From: Palmer Dabbelt > > > > On the userspace side fchmodat(3) is implemented as a wrapper > > function which implements the POSIX-specified interface. This > > interface differs from the underlying kernel system call, which does not > > have a flags argument. Most implementations require procfs [1][2]. > > > > There doesn't appear to be a good userspace workaround for this issue > > but the implementation in the kernel is pretty straight-forward. > > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, > > unlike existing fchmodat. > > > > [1] > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 > > [2] > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 > > > > Signed-off-by: Palmer Dabbelt > > Signed-off-by: Alexey Gladkov > > I don't know the history of why we ended up with the different > interface, or whether this was done intentionally in the kernel > or if we want this syscall. > > Assuming this is in fact needed, I double-checked that the > implementation looks correct to me and is portable to all the > architectures, without the need for a compat wrapper. > > Acked-by: Arnd Bergmann The system call itself is useful afaict. But please, s/fchmodat4/fchmodat2/ With very few exceptions we don't version by argument number but by revision and we should stick to one scheme: openat()->openat2() eventfd()->eventfd2() clone()/clone2()->clone3() dup()->dup2()->dup3() // coincides with nr of arguments pipe()->pipe2() // coincides with nr of arguments renameat()->renameat2()
Re: [PATCH v3 2/5] fs: Add fchmodat4()
On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > From: Palmer Dabbelt > > On the userspace side fchmodat(3) is implemented as a wrapper > function which implements the POSIX-specified interface. This > interface differs from the underlying kernel system call, which does not > have a flags argument. Most implementations require procfs [1][2]. > > There doesn't appear to be a good userspace workaround for this issue > but the implementation in the kernel is pretty straight-forward. > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, > unlike existing fchmodat. > > [1] > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 > [2] > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 > > Signed-off-by: Palmer Dabbelt > Signed-off-by: Alexey Gladkov I don't know the history of why we ended up with the different interface, or whether this was done intentionally in the kernel or if we want this syscall. Assuming this is in fact needed, I double-checked that the implementation looks correct to me and is portable to all the architectures, without the need for a compat wrapper. Acked-by: Arnd Bergmann
Re: [PATCH v3 1/5] Non-functional cleanup of a "__user * filename"
On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > From: Palmer Dabbelt > > The next patch defines a very similar interface, which I copied from > this definition. Since I'm touching it anyway I don't see any reason > not to just go fix this one up. > > Signed-off-by: Palmer Dabbelt Acked-by: Arnd Bergmann
Re: [PATCH v3 3/5] arch: Register fchmodat4, usually as syscall 451
On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote: > From: Palmer Dabbelt > > This registers the new fchmodat4 syscall in most places as nuber 451, > with alpha being the exception where it's 561. I found all these sites > by grepping for fspick, which I assume has found me everything. > > Signed-off-by: Palmer Dabbelt > Signed-off-by: Alexey Gladkov In linux-6.5-rc1, number 451 is used for __NR_cachestat, the next free one at the moment is 452. > arch/arm/tools/syscall.tbl | 1 + > arch/arm64/include/asm/unistd32.h | 2 ++ Unfortunately, you still also need to change __NR_compat_syscalls in arch/arm64/include/asm/unistd.h. Aside from these two issues, your patch is the correct way to hook up a new syscall. Arnd
[PATCH v3 5/5] selftests: add fchmodat4(2) selftest
The test marks as skipped if a syscall with the AT_SYMLINK_NOFOLLOW flag fails. This is because not all filesystems support changing the mode bits of symlinks properly. These filesystems return an error but change the mode bits: newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 newfstatat(4, "symlink", {st_mode=S_IFLNK|0777, st_size=7, ...}, AT_SYMLINK_NOFOLLOW) = 0 syscall_0x1c3(0x4, 0x55fa1f244396, 0x180, 0x100, 0x55fa1f24438e, 0x34) = -1 EOPNOTSUPP (Operation not supported) newfstatat(4, "regfile", {st_mode=S_IFREG|0640, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 This happens with btrfs and xfs: $ /kernel/tools/testing/selftests/fchmodat4/fchmodat4_test TAP version 13 1..1 ok 1 # SKIP fchmodat4(symlink) # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0 $ stat /tmp/ksft-fchmodat4.*/symlink File: /tmp/ksft-fchmodat4.3NCqlE/symlink -> regfile Size: 7 Blocks: 0 IO Block: 4096 symbolic link Device: 7,0 Inode: 133 Links: 1 Access: (0600/lrw---) Uid: (0/root) Gid: (0/root) Signed-off-by: Alexey Gladkov --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/fchmodat4/.gitignore | 2 + tools/testing/selftests/fchmodat4/Makefile| 6 + .../selftests/fchmodat4/fchmodat4_test.c | 151 ++ 4 files changed, 160 insertions(+) create mode 100644 tools/testing/selftests/fchmodat4/.gitignore create mode 100644 tools/testing/selftests/fchmodat4/Makefile create mode 100644 tools/testing/selftests/fchmodat4/fchmodat4_test.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 90a62cf75008..fe61fa55412d 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -17,6 +17,7 @@ TARGETS += drivers/net/bonding TARGETS += drivers/net/team TARGETS += efivarfs TARGETS += exec +TARGETS += fchmodat4 TARGETS += filesystems TARGETS += filesystems/binderfs TARGETS += filesystems/epoll diff --git a/tools/testing/selftests/fchmodat4/.gitignore b/tools/testing/selftests/fchmodat4/.gitignore new file mode 100644 index ..82a4846cbc4b --- /dev/null +++ b/tools/testing/selftests/fchmodat4/.gitignore @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +/*_test diff --git a/tools/testing/selftests/fchmodat4/Makefile b/tools/testing/selftests/fchmodat4/Makefile new file mode 100644 index ..3d38a69c3c12 --- /dev/null +++ b/tools/testing/selftests/fchmodat4/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0-or-later + +CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined +TEST_GEN_PROGS := fchmodat4_test + +include ../lib.mk diff --git a/tools/testing/selftests/fchmodat4/fchmodat4_test.c b/tools/testing/selftests/fchmodat4/fchmodat4_test.c new file mode 100644 index ..50beb731d8ba --- /dev/null +++ b/tools/testing/selftests/fchmodat4/fchmodat4_test.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#ifndef __NR_fchmodat4 + #if defined __alpha__ + #define __NR_fchmodat4 561 + #elif defined _MIPS_SIM + #if _MIPS_SIM == _MIPS_SIM_ABI32/* o32 */ + #define __NR_fchmodat4 (451 + 4000) + #endif + #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ + #define __NR_fchmodat4 (451 + 6000) + #endif + #if _MIPS_SIM == _MIPS_SIM_ABI64/* n64 */ + #define __NR_fchmodat4 (451 + 5000) + #endif + #elif defined __ia64__ + #define __NR_fchmodat4 (451 + 1024) + #else + #define __NR_fchmodat4 451 + #endif +#endif + +int sys_fchmodat4(int dfd, const char *filename, mode_t mode, int flags) +{ + int ret = syscall(__NR_fchmodat4, dfd, filename, mode, flags); + return ret >= 0 ? ret : -errno; +} + +int setup_testdir(void) +{ + int dfd, ret; + char dirname[] = "/tmp/ksft-fchmodat4.XX"; + + /* Make the top-level directory. */ + if (!mkdtemp(dirname)) + ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n"); + + dfd = open(dirname, O_PATH | O_DIRECTORY); + if (dfd < 0) + ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n"); + + ret = openat(dfd, "regfile", O_CREAT | O_WRONLY | O_TRUNC, 0644); + if (ret < 0) + ksft_exit_fail_msg("setup_testdir: failed to create file in tmpdir\n"); + close(ret); + + ret = symlinkat("regfile", dfd, "symlink"); + if (ret < 0) + ksft_exit_fail_msg("setup_testdir: failed to create symlink in tmpdir\n"); + + return dfd; +} + +int expect_mode(int dfd, const char *filename, mode_t expect_mode) +{ + struct stat st; +
[PATCH v3 4/5] tools headers UAPI: Sync files changed by new fchmodat4 syscall
From: Palmer Dabbelt That add support for this new syscall in tools such as 'perf trace'. Signed-off-by: Palmer Dabbelt Signed-off-by: Alexey Gladkov --- tools/include/uapi/asm-generic/unistd.h | 5 - tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl | 1 + tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 1 + tools/perf/arch/s390/entry/syscalls/syscall.tbl | 1 + tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 1 + 5 files changed, 8 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h index 45fa180cc56a..b7978b3ce3f1 100644 --- a/tools/include/uapi/asm-generic/unistd.h +++ b/tools/include/uapi/asm-generic/unistd.h @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_fchmodat4 451 +__SYSCALL(__NR_fchmodat4, sys_fchmodat4) + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl index 3f1886ad9d80..6356c0a6cda0 100644 --- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl +++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl @@ -365,3 +365,4 @@ 448n64 process_mreleasesys_process_mrelease 449n64 futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +451n64 fchmodat4 sys_fchmodat4 diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl index a0be127475b1..ee23866fa1c8 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -537,3 +537,4 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450nospu set_mempolicy_home_node sys_set_mempolicy_home_node +451common fchmodat4 sys_fchmodat4 diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl index b68f47541169..d5ce80065ece 100644 --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl @@ -453,3 +453,4 @@ 448 commonprocess_mreleasesys_process_mrelease sys_process_mrelease 449 commonfutex_waitv sys_futex_waitv sys_futex_waitv 450 commonset_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node +451 commonfchmodat4 sys_fchmodat4 sys_fchmodat4 diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..17047878293c 100644 --- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl +++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +451common fchmodat4 sys_fchmodat4 # # Due to a historical design error, certain syscalls are numbered differently -- 2.33.8
[PATCH v3 1/5] Non-functional cleanup of a "__user * filename"
From: Palmer Dabbelt The next patch defines a very similar interface, which I copied from this definition. Since I'm touching it anyway I don't see any reason not to just go fix this one up. Signed-off-by: Palmer Dabbelt --- include/linux/syscalls.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 33a0ee3bcb2e..497bdd968c32 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -464,7 +464,7 @@ asmlinkage long sys_chdir(const char __user *filename); asmlinkage long sys_fchdir(unsigned int fd); asmlinkage long sys_chroot(const char __user *filename); asmlinkage long sys_fchmod(unsigned int fd, umode_t mode); -asmlinkage long sys_fchmodat(int dfd, const char __user * filename, +asmlinkage long sys_fchmodat(int dfd, const char __user *filename, umode_t mode); asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, int flag); -- 2.33.8
[PATCH v3 2/5] fs: Add fchmodat4()
From: Palmer Dabbelt On the userspace side fchmodat(3) is implemented as a wrapper function which implements the POSIX-specified interface. This interface differs from the underlying kernel system call, which does not have a flags argument. Most implementations require procfs [1][2]. There doesn't appear to be a good userspace workaround for this issue but the implementation in the kernel is pretty straight-forward. The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag, unlike existing fchmodat. [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 Signed-off-by: Palmer Dabbelt Signed-off-by: Alexey Gladkov --- fs/open.c| 18 ++ include/linux/syscalls.h | 2 ++ 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/open.c b/fs/open.c index 4478adcc4f3a..58bb88c6afb6 100644 --- a/fs/open.c +++ b/fs/open.c @@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode) return err; } -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) +static int do_fchmodat4(int dfd, const char __user *filename, umode_t mode, int lookup_flags) { struct path path; int error; - unsigned int lookup_flags = LOOKUP_FOLLOW; + retry: error = user_path_at(dfd, filename, lookup_flags, ); if (!error) { @@ -689,15 +689,25 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode) return error; } +SYSCALL_DEFINE4(fchmodat4, int, dfd, const char __user *, filename, + umode_t, mode, int, flags) +{ + if (unlikely(flags & ~AT_SYMLINK_NOFOLLOW)) + return -EINVAL; + + return do_fchmodat4(dfd, filename, mode, + flags & AT_SYMLINK_NOFOLLOW ? 0 : LOOKUP_FOLLOW); +} + SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, umode_t, mode) { - return do_fchmodat(dfd, filename, mode); + return do_fchmodat4(dfd, filename, mode, LOOKUP_FOLLOW); } SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode) { - return do_fchmodat(AT_FDCWD, filename, mode); + return do_fchmodat4(AT_FDCWD, filename, mode, LOOKUP_FOLLOW); } /** diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 497bdd968c32..b17d37d2bad6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -466,6 +466,8 @@ asmlinkage long sys_chroot(const char __user *filename); asmlinkage long sys_fchmod(unsigned int fd, umode_t mode); asmlinkage long sys_fchmodat(int dfd, const char __user *filename, umode_t mode); +asmlinkage long sys_fchmodat4(int dfd, const char __user *filename, +umode_t mode, int flags); asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group, int flag); asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group); -- 2.33.8
[PATCH v3 3/5] arch: Register fchmodat4, usually as syscall 451
From: Palmer Dabbelt This registers the new fchmodat4 syscall in most places as nuber 451, with alpha being the exception where it's 561. I found all these sites by grepping for fspick, which I assume has found me everything. Signed-off-by: Palmer Dabbelt Signed-off-by: Alexey Gladkov --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd32.h | 2 ++ arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl| 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/uapi/asm-generic/unistd.h | 5 - 18 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 8ebacf37a8cf..00ceeffec7ff 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -490,3 +490,4 @@ 558common process_mreleasesys_process_mrelease 559common futex_waitv sys_futex_waitv 560common set_mempolicy_home_node sys_ni_syscall +561common fchmodat4 sys_fchmodat4 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index ac964612d8b0..0b9702d5c425 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -464,3 +464,4 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +451common fchmodat4 sys_fchmodat4 diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 604a2053d006..49c65d935049 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -907,6 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease) __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_fchmodat4 451 +__SYSCALL(__NR_fchmodat4, sys_fchmodat4) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 72c929d9902b..b35225c64781 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -371,3 +371,4 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +451common fchmodat4 sys_fchmodat4 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index b1f3940bc298..4d80cd87e089 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -450,3 +450,4 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +451common fchmodat4 sys_fchmodat4 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 820145e47350..306bd18e5b52 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 448common process_mreleasesys_process_mrelease 449common futex_waitv sys_futex_waitv 450common set_mempolicy_home_node sys_set_mempolicy_home_node +451common fchmodat4 sys_fchmodat4 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 253ff994ed2e..2ef47a546fd3 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -389,3 +389,4 @@ 448n32 process_mreleasesys_process_mrelease 449n32 futex_waitv sys_futex_waitv 450n32 set_mempolicy_home_node sys_set_mempolicy_home_node +451n32 fchmodat4 sys_fchmodat4 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl
[PATCH v3 0/5] Add a new fchmodat4() syscall
This patch set adds fchmodat4(), a new syscall. The actual implementation is super simple: essentially it's just the same as fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags. I've attempted to make this match "man 2 fchmodat" as closely as possible, which says EINVAL is returned for invalid flags (as opposed to ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW). I have a sketch of a glibc patch that I haven't even compiled yet, but seems fairly straight-forward: diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c index 6d9cbc1ce9e0..b1beab76d56c 100644 --- a/sysdeps/unix/sysv/linux/fchmodat.c +++ b/sysdeps/unix/sysv/linux/fchmodat.c @@ -29,12 +29,36 @@ int fchmodat (int fd, const char *file, mode_t mode, int flag) { - if (flag & ~AT_SYMLINK_NOFOLLOW) -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); -#ifndef __NR_lchmod/* Linux so far has no lchmod syscall. */ + /* There are four paths through this code: + - The flags are zero. In this case it's fine to call fchmodat. + - The flags are non-zero and glibc doesn't have access to + __NR_fchmodat4. In this case all we can do is emulate the error codes + defined by the glibc interface from userspace. + - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has + fchmodat4. This is the simplest case, as the fchmodat4 syscall exactly + matches glibc's library interface so it can be called directly. + - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does + not. In this case we must respect the error codes defined by the glibc + interface instead of returning ENOSYS. +The intent here is to ensure that the kernel is called at most once per +library call, and that the error types defined by glibc are always +respected. */ + +#ifdef __NR_fchmodat4 + long result; +#endif + + if (flag == 0) +return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); + +#ifdef __NR_fchmodat4 + result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag); + if (result == 0 || errno != ENOSYS) +return result; +#endif + if (flag & AT_SYMLINK_NOFOLLOW) return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP); -#endif - return INLINE_SYSCALL (fchmodat, 3, fd, file, mode); + return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL); } I've never added a new syscall before so I'm not really sure what the proper procedure to follow is. Based on the feedback from my v1 patch set it seems this is somewhat uncontroversial. At this point I don't think there's anything I'm missing, though note that I haven't gotten around to testing it this time because the diff from v1 is trivial for any platform I could reasonably test on. The v1 patches suggest a simple test case, but I didn't re-run it because I don't want to reboot my laptop. Changes since v2 [20190717012719.5524-1-pal...@sifive.com]: * Rebased to master. * The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro. * Selftest added. Changes since v1 [20190531191204.4044-1-pal...@sifive.com]: * All architectures are now supported, which support squashed into a single patch. * The do_fchmodat() helper function has been removed, in favor of directly calling do_fchmodat4(). * The patches are based on 5.2 instead of 5.1. --- Alexey Gladkov (1): selftests: add fchmodat4(2) selftest Palmer Dabbelt (4): Non-functional cleanup of a "__user * filename" fs: Add fchmodat4() arch: Register fchmodat4, usually as syscall 451 tools headers UAPI: Sync files changed by new fchmodat4 syscall arch/alpha/kernel/syscalls/syscall.tbl| 1 + arch/arm/tools/syscall.tbl| 1 + arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl| 1 + arch/x86/entry/syscalls/syscall_32.tbl| 1 + arch/x86/entry/syscalls/syscall_64.tbl| 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/open.c | 18 ++- include/linux/syscalls.h | 4 +- include/uapi/asm-generic/unistd.h | 5 +- tools/include/uapi/asm-generic/unistd.h | 5 +-