Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED
On Tue, 23 May 2023, Vlastimil Babka wrote: > As discussed at LSF/MM [1] [2] and with no objections raised there, > deprecate the SLAB allocator. Rename the user-visible option so that > users with CONFIG_SLAB=y get a new prompt with explanation during make > oldconfig, while make olddefconfig will just switch to SLUB. > > In all defconfigs with CONFIG_SLAB=y remove the line so those also > switch to SLUB. Regressions due to the switch should be reported to > linux-mm and slab maintainers. > > [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/ > [2] https://lwn.net/Articles/932201/ > > Signed-off-by: Vlastimil Babka Acked-by: David Rientjes The Kconfig option says that SLAB will be removed in a few cycles. I think we should wait until at least the next LTS kernel is forked at the end of the year so that users who upgrade to only the LTS releases can be prompted for this change and surface any concerns. Slab allocation is a critical subsystem, so I presume this is the safest and most responsible way to do the SLAB deprecation. Hopefully that timeline works for everybody.
Re: [PATCH] mm: Remove HARDENED_USERCOPY_FALLBACK
On Tue, 21 Sep 2021, Stephen Kitt wrote: > This has served its purpose and is no longer used. All usercopy > violations appear to have been handled by now, any remaining > instances (or new bugs) will cause copies to be rejected. > > This isn't a direct revert of commit 2d891fbc3bb6 ("usercopy: Allow > strict enforcement of whitelists"); since usercopy_fallback is > effectively 0, the fallback handling is removed too. > > This also removes the usercopy_fallback module parameter on > slab_common. > > Link: https://github.com/KSPP/linux/issues/153 > Signed-off-by: Stephen Kitt > Suggested-by: Kees Cook Acked-by: David Rientjes
Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
On Tue, 10 Apr 2018, Laurent Dufour wrote: > > On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote: > >> arch/powerpc/include/asm/pte-common.h | 3 --- > >> arch/riscv/Kconfig | 1 + > >> arch/s390/Kconfig | 1 + > > > > You forgot to delete __HAVE_ARCH_PTE_SPECIAL from > > arch/riscv/include/asm/pgtable-bits.h > > Damned ! > Thanks for catching it. > Squashing the two patches together at least allowed it to be caught easily. After it's fixed, feel free to add Acked-by: David Rientjes <rient...@google.com> Thanks for doing this!
Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event
On Mon, 26 Mar 2018, Andi Kleen wrote: > > Aside: should there be a new spec_flt field for struct task_struct that > > complements maj_flt and min_flt and be exported through /proc/pid/stat? > > No. task_struct is already too bloated. If you need per process tracking > you can always get it through trace points. > Hi Andi, We have count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); in handle_mm_fault() but not counterpart for spf. I think it would be helpful to be able to determine how much faulting can be done speculatively if there is no per-process tracking without tracing.
Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL
On Mon, 9 Apr 2018, Christoph Hellwig wrote: > > -#ifdef __HAVE_ARCH_PTE_SPECIAL > > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > # define HAVE_PTE_SPECIAL 1 > > #else > > # define HAVE_PTE_SPECIAL 0 > > I'd say kill this odd indirection and just use the > CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly. > > Agree, and I think it would be easier to audit/review if patches 1 and 3 were folded together to see the relationship between the newly added selects and what #define's it is replacing. Otherwise, looks good!
Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key
On Wed, 4 Apr 2018, Laurent Dufour wrote: > > I also think the following is needed: > > > > diff --git a/fs/exec.c b/fs/exec.c > > --- a/fs/exec.c > > +++ b/fs/exec.c > > @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm) > > vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | > > VM_STACK_INCOMPLETE_SETUP; > > vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); > > INIT_LIST_HEAD(>anon_vma_chain); > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > > + seqcount_init(>vm_sequence); > > + atomic_set(>vm_ref_count, 0); > > +#endif > > > > err = insert_vm_struct(mm, vma); > > if (err) > > No, this not needed because the vma is allocated with kmem_cache_zalloc() so > vm_ref_count is 0, and insert_vm_struc() will later call > __vma_link_rb() which will call seqcount_init(). > > Furhtermore, in case of error, the vma structure is freed without calling > get_vma() so there is risk of lockdep warning. > Perhaps you're working from a different tree than I am, or you fixed the lockdep warning differently when adding to dup_mmap() and mmap_region(). I got the following two lockdep errors. I fixed it locally by doing the seqcount_init() and atomic_set() everywhere a vma could be initialized. INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 12 PID: 1 Comm: init Not tainted Call Trace: [] dump_stack+0x67/0x98 [] register_lock_class+0x1e6/0x4e0 [] __lock_acquire+0xb9/0x1710 [] lock_acquire+0xba/0x200 [] mprotect_fixup+0x10f/0x310 [] setup_arg_pages+0x12d/0x230 [] load_elf_binary+0x44a/0x1740 [] search_binary_handler+0x9b/0x1e0 [] load_script+0x206/0x270 [] search_binary_handler+0x9b/0x1e0 [] do_execveat_common.isra.32+0x6b5/0x9d0 [] do_execve+0x2c/0x30 [] run_init_process+0x2b/0x30 [] kernel_init+0x54/0x110 [] ret_from_fork+0x3a/0x50 and INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 21 PID: 1926 Comm: mkdir Not tainted Call Trace: [] dump_stack+0x67/0x98 [] register_lock_class+0x1e6/0x4e0 [] __lock_acquire+0xb9/0x1710 [] lock_acquire+0xba/0x200 [] unmap_page_range+0x89/0xaa0 [] unmap_single_vma+0x8f/0x100 [] unmap_vmas+0x4b/0x90 [] exit_mmap+0xa3/0x1c0 [] mmput+0x73/0x120 [] do_exit+0x2bd/0xd60 [] SyS_exit+0x17/0x20 [] do_syscall_64+0x6d/0x1a0 [] entry_SYSCALL_64_after_hwframe+0x26/0x9b I think it would just be better to generalize vma allocation to initialize certain fields and init both spf fields properly for CONFIG_SPECULATIVE_PAGE_FAULT. It's obviously too delicate as is.
Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key
On Tue, 3 Apr 2018, David Rientjes wrote: > > >> I found the root cause of this lockdep warning. > > >> > > >> In mmap_region(), unmap_region() may be called while vma_link() has not > > >> been > > >> called. This happens during the error path if call_mmap() failed. > > >> > > >> The only to fix that particular case is to call > > >> seqcount_init(>vm_sequence) when initializing the vma in > > >> mmap_region(). > > >> > > > > > > Ack, although that would require a fixup to dup_mmap() as well. > > > > You're right, I'll fix that too. > > > > I also think the following is needed: > > diff --git a/fs/exec.c b/fs/exec.c > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm) > vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | > VM_STACK_INCOMPLETE_SETUP; > vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); > INIT_LIST_HEAD(>anon_vma_chain); > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + seqcount_init(>vm_sequence); > + atomic_set(>vm_ref_count, 0); > +#endif > > err = insert_vm_struct(mm, vma); > if (err) > Ugh, I think there are a number of other places where this is needed as well in mm/mmap.c. I think it would be better to just create a new alloc_vma(unsigned long flags) that all vma allocators can use and for CONFIG_SPECULATIVE_PAGE_FAULT will initialize the seqcount_t and atomic_t.
Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key
On Wed, 28 Mar 2018, Laurent Dufour wrote: > On 26/03/2018 00:10, David Rientjes wrote: > > On Wed, 21 Mar 2018, Laurent Dufour wrote: > > > >> I found the root cause of this lockdep warning. > >> > >> In mmap_region(), unmap_region() may be called while vma_link() has not > >> been > >> called. This happens during the error path if call_mmap() failed. > >> > >> The only to fix that particular case is to call > >> seqcount_init(>vm_sequence) when initializing the vma in > >> mmap_region(). > >> > > > > Ack, although that would require a fixup to dup_mmap() as well. > > You're right, I'll fix that too. > I also think the following is needed: diff --git a/fs/exec.c b/fs/exec.c --- a/fs/exec.c +++ b/fs/exec.c @@ -312,6 +312,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm) vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP; vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); INIT_LIST_HEAD(>anon_vma_chain); +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT + seqcount_init(>vm_sequence); + atomic_set(>vm_ref_count, 0); +#endif err = insert_vm_struct(mm, vma); if (err)
Re: [PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE
On Wed, 28 Mar 2018, Laurent Dufour wrote: > >> diff --git a/include/linux/mm.h b/include/linux/mm.h > >> index 4d02524a7998..2f3e98edc94a 100644 > >> --- a/include/linux/mm.h > >> +++ b/include/linux/mm.h > >> @@ -300,6 +300,7 @@ extern pgprot_t protection_map[16]; > >> #define FAULT_FLAG_USER 0x40/* The fault originated in > >> userspace */ > >> #define FAULT_FLAG_REMOTE 0x80/* faulting for non current tsk/mm */ > >> #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an > >> instruction fetch */ > >> +#define FAULT_FLAG_SPECULATIVE0x200 /* Speculative fault, not > >> holding mmap_sem */ > >> > >> #define FAULT_FLAG_TRACE \ > >>{ FAULT_FLAG_WRITE, "WRITE" }, \ > > > > I think FAULT_FLAG_SPECULATIVE should be introduced in the patch that > > actually uses it. > > I think you're right, I'll move down this define in the series. > > >> diff --git a/mm/memory.c b/mm/memory.c > >> index e0ae4999c824..8ac241b9f370 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, > >> unsigned long addr, > >> } > >> EXPORT_SYMBOL_GPL(apply_to_page_range); > >> > >> +static bool pte_map_lock(struct vm_fault *vmf) > > > > inline? > > Agreed. > Ignore this, the final form of the function after the full patchset shouldn't be inline. > >> +{ > >> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > >> + vmf->address, >ptl); > >> + return true; > >> +} > >> + > >> /* > >> * handle_pte_fault chooses page fault handler according to an entry > >> which was > >> * read non-atomically. Before making any commitment, on those > >> architectures > >> @@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf) > >>const unsigned long mmun_start = vmf->address & PAGE_MASK; > >>const unsigned long mmun_end = mmun_start + PAGE_SIZE; > >>struct mem_cgroup *memcg; > >> + int ret = VM_FAULT_OOM; > >> > >>if (unlikely(anon_vma_prepare(vma))) > >>goto oom; > >> @@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf) > >>/* > >> * Re-check the pte - we dropped the lock > >> */ > >> - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl); > >> + if (!pte_map_lock(vmf)) { > >> + mem_cgroup_cancel_charge(new_page, memcg, false); > >> + ret = VM_FAULT_RETRY; > >> + goto oom_free_new; > >> + } > > > > Ugh, but we aren't oom here, so maybe rename oom_free_new so that it makes > > sense for return values other than VM_FAULT_OOM? > > You're right, now this label name is not correct, I'll rename it to > "out_free_new" and rename also the label "oom" to "out" since it is generic > too > now. > I think it would just be better to introduce a out_uncharge that handles the mem_cgroup_cancel_charge() in the exit path. diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -2645,9 +2645,8 @@ static int wp_page_copy(struct vm_fault *vmf) * Re-check the pte - we dropped the lock */ if (!pte_map_lock(vmf)) { - mem_cgroup_cancel_charge(new_page, memcg, false); ret = VM_FAULT_RETRY; - goto oom_free_new; + goto out_uncharge; } if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { if (old_page) { @@ -2735,6 +2734,8 @@ static int wp_page_copy(struct vm_fault *vmf) put_page(old_page); } return page_copied ? VM_FAULT_WRITE : 0; +out_uncharge: + mem_cgroup_cancel_charge(new_page, memcg, false); oom_free_new: put_page(new_page); oom:
Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()
On Tue, 3 Apr 2018, Jerome Glisse wrote: > > When dealing with the speculative fault path we should use the VMA's field > > cached value stored in the vm_fault structure. > > > > Currently vm_normal_page() is using the pointer to the VMA to fetch the > > vm_flags value. This patch provides a new __vm_normal_page() which is > > receiving the vm_flags flags value as parameter. > > > > Note: The speculative path is turned on for architecture providing support > > for special PTE flag. So only the first block of vm_normal_page is used > > during the speculative path. > > Might be a good idea to explicitly have SPECULATIVE Kconfig option depends > on ARCH_PTE_SPECIAL and a comment for !HAVE_PTE_SPECIAL in the function > explaining that speculative page fault should never reach that point. Yeah, I think that's appropriate but in a follow-up patch since this is only propagating vma_flags. It will require that __HAVE_ARCH_PTE_SPECIAL become an actual Kconfig entry, however.
Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF
On Tue, 3 Apr 2018, Jerome Glisse wrote: > > diff --git a/mm/memory.c b/mm/memory.c > > index 21b1212a0892..4bc7b0bdcb40 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf) > > * parts, do_swap_page must check under lock before unmapping the pte and > > * proceeding (but do_wp_page is only called after already making such a > > check; > > * and do_anonymous_page can safely check later on). > > + * > > + * pte_unmap_same() returns: > > + * 0 if the PTE are the same > > + * VM_FAULT_PTNOTSAME if the PTE are different > > + * VM_FAULT_RETRY if the VMA has changed in our back during > > + * a speculative page fault handling. > > */ > > -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, > > - pte_t *page_table, pte_t orig_pte) > > +static inline int pte_unmap_same(struct vm_fault *vmf) > > { > > - int same = 1; > > + int ret = 0; > > + > > #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT) > > if (sizeof(pte_t) > sizeof(unsigned long)) { > > - spinlock_t *ptl = pte_lockptr(mm, pmd); > > - spin_lock(ptl); > > - same = pte_same(*page_table, orig_pte); > > - spin_unlock(ptl); > > + if (pte_spinlock(vmf)) { > > + if (!pte_same(*vmf->pte, vmf->orig_pte)) > > + ret = VM_FAULT_PTNOTSAME; > > + spin_unlock(vmf->ptl); > > + } else > > + ret = VM_FAULT_RETRY; > > } > > #endif > > - pte_unmap(page_table); > > - return same; > > + pte_unmap(vmf->pte); > > + return ret; > > } > > > > static inline void cow_user_page(struct page *dst, struct page *src, > > unsigned long va, struct vm_area_struct *vma) > > @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf) > > int exclusive = 0; > > int ret = 0; > > > > - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) > > + ret = pte_unmap_same(vmf); > > + if (ret) > > goto out; > > > > This change what do_swap_page() returns ie before it was returning 0 > when locked pte lookup was different from orig_pte. After this patch > it returns VM_FAULT_PTNOTSAME but this is a new return value for > handle_mm_fault() (the do_swap_page() return value is what ultimately > get return by handle_mm_fault()) > > Do we really want that ? This might confuse some existing user of > handle_mm_fault() and i am not sure of the value of that information > to caller. > > Note i do understand that you want to return retry if anything did > change from underneath and thus need to differentiate from when the > pte value are not the same. > I think VM_FAULT_RETRY should be handled appropriately for any user of handle_mm_fault() already, and would be surprised to learn differently. Khugepaged has the appropriate handling. I think the concern is whether a user is handling anything other than VM_FAULT_RETRY and VM_FAULT_ERROR (which VM_FAULT_PTNOTSAME is not set in)? I haven't done a full audit of the users.
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On Tue, 13 Mar 2018, Laurent Dufour wrote: > This change is inspired by the Peter's proposal patch [1] which was > protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in > that particular case, and it is introducing major performance degradation > due to excessive scheduling operations. > > To allow access to the mm_rb tree without grabbing the mmap_sem, this patch > is protecting it access using a rwlock. As the mm_rb tree is a O(log n) > search it is safe to protect it using such a lock. The VMA cache is not > protected by the new rwlock and it should not be used without holding the > mmap_sem. > > To allow the picked VMA structure to be used once the rwlock is released, a > use count is added to the VMA structure. When the VMA is allocated it is > set to 1. Each time the VMA is picked with the rwlock held its use count > is incremented. Each time the VMA is released it is decremented. When the > use count hits zero, this means that the VMA is no more used and should be > freed. > > This patch is preparing for 2 kind of VMA access : > - as usual, under the control of the mmap_sem, > - without holding the mmap_sem for the speculative page fault handler. > > Access done under the control the mmap_sem doesn't require to grab the > rwlock to protect read access to the mm_rb tree, but access in write must > be done under the protection of the rwlock too. This affects inserting and > removing of elements in the RB tree. > > The patch is introducing 2 new functions: > - vma_get() to find a VMA based on an address by holding the new rwlock. > - vma_put() to release the VMA when its no more used. > These services are designed to be used when access are made to the RB tree > without holding the mmap_sem. > > When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and > we rely on the WMB done when releasing the rwlock to serialize the write > with the RMB done in a later patch to check for the VMA's validity. > > When free_vma is called, the file associated with the VMA is closed > immediately, but the policy and the file structure remained in used until > the VMA's use count reach 0, which may happens later when exiting an > in progress speculative page fault. > > [1] https://patchwork.kernel.org/patch/5108281/ > > Cc: Peter Zijlstra (Intel)> Cc: Matthew Wilcox > Signed-off-by: Laurent Dufour Can __free_vma() be generalized for mm/nommu.c's delete_vma() and do_mmap()?
Re: [PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > When dealing with speculative page fault handler, we may race with VMA > being split or merged. In this case the vma->vm_start and vm->vm_end > fields may not match the address the page fault is occurring. > > This can only happens when the VMA is split but in that case, the > anon_vma pointer of the new VMA will be the same as the original one, > because in __split_vma the new->anon_vma is set to src->anon_vma when > *new = *vma. > > So even if the VMA boundaries are not correct, the anon_vma pointer is > still valid. > > If the VMA has been merged, then the VMA in which it has been merged > must have the same anon_vma pointer otherwise the merge can't be done. > > So in all the case we know that the anon_vma is valid, since we have > checked before starting the speculative page fault that the anon_vma > pointer is valid for this VMA and since there is an anon_vma this > means that at one time a page has been backed and that before the VMA > is cleaned, the page table lock would have to be grab to clean the > PTE, and the anon_vma field is checked once the PTE is locked. > > This patch introduce a new __page_add_new_anon_rmap() service which > doesn't check for the VMA boundaries, and create a new inline one > which do the check. > > When called from a page fault handler, if this is not a speculative one, > there is a guarantee that vm_start and vm_end match the faulting address, > so this check is useless. In the context of the speculative page fault > handler, this check may be wrong but anon_vma is still valid as explained > above. > > Signed-off-by: Laurent DufourI'm indifferent on this: it could be argued both sides that the new function and its variant for a simple VM_BUG_ON() isn't worth it and it would should rather be done in the callers of page_add_new_anon_rmap(). It feels like it would be better left to the caller and add a comment to page_add_anon_rmap() itself in mm/rmap.c.
Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a84ddc218bbd..73b8b99f482b 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1263,8 +1263,11 @@ struct zap_details { > pgoff_t last_index; /* Highest page->index to unmap > */ > }; > > -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > - pte_t pte, bool with_public_device); > +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > + pte_t pte, bool with_public_device, > + unsigned long vma_flags); > +#define _vm_normal_page(vma, addr, pte, with_public_device) \ > + __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags) > #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false) > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long > addr, If _vm_normal_page() is a static inline function does it break somehow? It's nice to avoid the #define's. > diff --git a/mm/memory.c b/mm/memory.c > index af0338fbc34d..184a0d663a76 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, > unsigned long addr, > #else > # define HAVE_PTE_SPECIAL 0 > #endif > -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > - pte_t pte, bool with_public_device) > +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > + pte_t pte, bool with_public_device, > + unsigned long vma_flags) > { > unsigned long pfn = pte_pfn(pte); > Would it be possible to update the comment since the function itself is no longer named vm_normal_page?
Re: [PATCH v9 14/24] mm: Introduce __maybe_mkwrite()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index dfa81a638b7c..a84ddc218bbd 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -684,13 +684,18 @@ void free_compound_page(struct page *page); > * pte_mkwrite. But get_user_pages can cause write faults for mappings > * that do not have writing enabled, when used by access_process_vm. > */ > -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) > +static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags) > { > - if (likely(vma->vm_flags & VM_WRITE)) > + if (likely(vma_flags & VM_WRITE)) > pte = pte_mkwrite(pte); > return pte; > } > > +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) > +{ > + return __maybe_mkwrite(pte, vma->vm_flags); > +} > + > int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg, > struct page *page); > int finish_fault(struct vm_fault *vmf); > diff --git a/mm/memory.c b/mm/memory.c > index 0a0a483d9a65..af0338fbc34d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf) > > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); > entry = pte_mkyoung(vmf->orig_pte); > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags); > if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1)) > update_mmu_cache(vma, vmf->address, vmf->pte); > pte_unmap_unlock(vmf->pte, vmf->ptl); > @@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf) > inc_mm_counter_fast(mm, MM_ANONPAGES); > } > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); > - entry = mk_pte(new_page, vma->vm_page_prot); > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + entry = mk_pte(new_page, vmf->vma_page_prot); > + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags); > /* >* Clear the pte entry and flush it first, before updating the >* pte with the new entry. This will avoid a race condition Don't you also need to do this in do_swap_page()? diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -3067,9 +3067,9 @@ int do_swap_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); - pte = mk_pte(page, vma->vm_page_prot); + pte = mk_pte(page, vmf->vma_page_prot); if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = __maybe_mkwrite(pte_mkdirty(pte), vmf->vma_flags); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = RMAP_EXCLUSIVE;
Re: [PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable
On Tue, 13 Mar 2018, Laurent Dufour wrote: > The speculative page fault handler which is run without holding the > mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags > is not guaranteed to remain constant. > Introducing __lru_cache_add_active_or_unevictable() which has the vma flags > value parameter instead of the vma pointer. > > Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com> Acked-by: David Rientjes <rient...@google.com>
Re: [PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > migrate_misplaced_page() is only called during the page fault handling so > it's better to pass the pointer to the struct vm_fault instead of the vma. > > This way during the speculative page fault path the saved vma->vm_flags > could be used. > > Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com> Acked-by: David Rientjes <rient...@google.com>
Re: [PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ef6ef0627090..dfa81a638b7c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -359,6 +359,12 @@ struct vm_fault { >* page table to avoid allocation from >* atomic context. >*/ > + /* > + * These entries are required when handling speculative page fault. > + * This way the page handling is done using consistent field values. > + */ > + unsigned long vma_flags; > + pgprot_t vma_page_prot; > }; > > /* page entry size for vm->huge_fault() */ > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 446427cafa19..f71db2b42b30 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct > vm_area_struct *vma, > .vma = vma, > .address = address, > .flags = flags, > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, > /* >* Hard to debug if it ends up being >* used by a callee that assumes > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 32314e9e48dd..a946d5306160 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct > *mm, > .flags = FAULT_FLAG_ALLOW_RETRY, > .pmd = pmd, > .pgoff = linear_page_index(vma, address), > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, > }; > > /* we only decide to swapin, if there is enough young ptes */ > diff --git a/mm/memory.c b/mm/memory.c > index 0200340ef089..46fe92b93682 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf) >* Don't let another task, with possibly unlocked vma, >* keep the mlocked page. >*/ > - if (page_copied && (vma->vm_flags & VM_LOCKED)) { > + if (page_copied && (vmf->vma_flags & VM_LOCKED)) { > lock_page(old_page);/* LRU manipulation */ > if (PageMlocked(old_page)) > munlock_vma_page(old_page); Doesn't wp_page_copy() also need to pass this to anon_vma_prepare() so that find_mergeable_anon_vma() works correctly? > @@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf) > */ > int finish_mkwrite_fault(struct vm_fault *vmf) > { > - WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); > + WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED)); > if (!pte_map_lock(vmf)) > return VM_FAULT_RETRY; > /* > @@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf) >* We should not cow pages in a shared writeable mapping. >* Just mark the pages writable and/or call ops->pfn_mkwrite. >*/ > - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > + if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) == >(VM_WRITE|VM_SHARED)) > return wp_pfn_shared(vmf); > > @@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf) > return VM_FAULT_WRITE; > } > unlock_page(vmf->page); > - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > + } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) == > (VM_WRITE|VM_SHARED))) { > return wp_page_shared(vmf); > } > @@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf) > > inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); > - pte = mk_pte(page, vma->vm_page_prot); > + pte = mk_pte(page, vmf->vma_page_prot); > if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { > pte = maybe_mkwrite(pte_mkdirty(pte), vma); > vmf->flags &= ~FAULT_FLAG_WRITE; > @@ -3093,7 +3093,7 @@ int do_swap_page(struct vm_fault *vmf) > > swap_free(entry); > if (mem_cgroup_swap_full(page) || > - (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) > + (vmf->vma_flags & VM_LOCKED) || PageMlocked(page)) > try_to_free_swap(page); > unlock_page(page); > if (page != swapcache && swapcache) { > @@ -3150,7 +3150,7 @@ static int do_anonymous_page(struct vm_fault *vmf) > pte_t entry; > > /* File mapping without ->vm_ops ? */ > - if (vma->vm_flags & VM_SHARED) > + if (vmf->vma_flags & VM_SHARED) >
Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder
On Wed, 28 Mar 2018, Laurent Dufour wrote: > >> @@ -326,7 +336,10 @@ static unsigned long move_vma(struct vm_area_struct > >> *vma, > >>mremap_userfaultfd_prep(new_vma, uf); > >>arch_remap(mm, old_addr, old_addr + old_len, > >> new_addr, new_addr + new_len); > >> + if (vma != new_vma) > >> + vm_raw_write_end(vma); > >>} > >> + vm_raw_write_end(new_vma); > > > > Just do > > > > vm_raw_write_end(vma); > > vm_raw_write_end(new_vma); > > > > here. > > Are you sure ? we can have vma = new_vma done if (unlikely(err)) > Sorry, what I meant was do if (vma != new_vma) vm_raw_write_end(vma); vm_raw_write_end(new_vma); after the conditional. Having the locking unnecessarily embedded in the conditional has been an issue in the past with other areas of core code, unless you have a strong reason for it.
Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT
On Wed, 28 Mar 2018, Laurent Dufour wrote: > > Putting this in mm/Kconfig is definitely the right way to go about it > > instead of any generic option in arch/*. > > > > My question, though, was making this configurable by the user: > > > > config SPECULATIVE_PAGE_FAULT > > bool "Speculative page faults" > > depends on X86_64 || PPC > > default y > > help > > .. > > > > It's a question about whether we want this always enabled on x86_64 and > > power or whether the user should be able to disable it (right now they > > can't). With a large feature like this, you may want to offer something > > simple (disable CONFIG_SPECULATIVE_PAGE_FAULT) if someone runs into > > regressions. > > I agree, but I think it would be important to get the per architecture > enablement to avoid complex check here. For instance in the case of powerPC > this is only supported for PPC_BOOK3S_64. > > To avoid exposing such per architecture define here, what do you think about > having supporting architectures setting ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > and the SPECULATIVE_PAGE_FAULT depends on this, like this: > > In mm/Kconfig: > config SPECULATIVE_PAGE_FAULT > bool "Speculative page faults" > depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT && SMP > default y > help > ... > > In arch/powerpc/Kconfig: > config PPC > ... > select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64 > > In arch/x86/Kconfig: > config X86_64 > ... > select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > > Looks good to me! It feels like this will add more assurance that if things regress for certain workloads that it can be disabled. I don't feel strongly about the default value, I'm ok with it being enabled by default.
Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF
On Wed, 28 Mar 2018, Laurent Dufour wrote: > >> @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf) > >>int exclusive = 0; > >>int ret = 0; > > > > Initialization is now unneeded. > > I'm sorry, what "initialization" are you talking about here ? > The initialization of the ret variable. @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf) int exclusive = 0; int ret = 0; - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) + ret = pte_unmap_same(vmf); + if (ret) goto out; entry = pte_to_swp_entry(vmf->orig_pte); "ret" is immediately set to the return value of pte_unmap_same(), so there is no need to initialize it to 0.
Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT
On Wed, 28 Mar 2018, Laurent Dufour wrote: > >> This configuration variable will be used to build the code needed to > >> handle speculative page fault. > >> > >> By default it is turned off, and activated depending on architecture > >> support. > >> > >> Suggested-by: Thomas Gleixner> >> Signed-off-by: Laurent Dufour > >> --- > >> mm/Kconfig | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index abefa573bcd8..07c566c88faf 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -759,3 +759,6 @@ config GUP_BENCHMARK > >> performance of get_user_pages_fast(). > >> > >> See tools/testing/selftests/vm/gup_benchmark.c > >> + > >> +config SPECULATIVE_PAGE_FAULT > >> + bool > > > > Should this be configurable even if the arch supports it? > > Actually, this is not configurable unless by manually editing the .config > file. > > I made it this way on the Thomas's request : > https://lkml.org/lkml/2018/1/15/969 > > That sounds to be the smarter way to achieve that, isn't it ? > Putting this in mm/Kconfig is definitely the right way to go about it instead of any generic option in arch/*. My question, though, was making this configurable by the user: config SPECULATIVE_PAGE_FAULT bool "Speculative page faults" depends on X86_64 || PPC default y help .. It's a question about whether we want this always enabled on x86_64 and power or whether the user should be able to disable it (right now they can't). With a large feature like this, you may want to offer something simple (disable CONFIG_SPECULATIVE_PAGE_FAULT) if someone runs into regressions.
Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 88042d843668..ef6ef0627090 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct > anon_vma_chain *node); > extern int __vm_enough_memory(struct mm_struct *mm, long pages, int > cap_sys_admin); > extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start, > unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, > - struct vm_area_struct *expand); > + struct vm_area_struct *expand, bool keep_locked); > static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start, > unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) > { > - return __vma_adjust(vma, start, end, pgoff, insert, NULL); > + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false); > } > -extern struct vm_area_struct *vma_merge(struct mm_struct *, > +extern struct vm_area_struct *__vma_merge(struct mm_struct *, > struct vm_area_struct *prev, unsigned long addr, unsigned long end, > unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, > - struct mempolicy *, struct vm_userfaultfd_ctx); > + struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked); > +static inline struct vm_area_struct *vma_merge(struct mm_struct *vma, > + struct vm_area_struct *prev, unsigned long addr, unsigned long end, > + unsigned long vm_flags, struct anon_vma *anon, struct file *file, > + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff) > +{ > + return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off, > +pol, uff, false); > +} The first formal to vma_merge() is an mm, not a vma. This area could use an uncluttering. > extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); > extern int __split_vma(struct mm_struct *, struct vm_area_struct *, > unsigned long addr, int new_below); > diff --git a/mm/mmap.c b/mm/mmap.c > index d6533cb85213..ac32b577a0c9 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm, > */ > int __vma_adjust(struct vm_area_struct *vma, unsigned long start, > unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, > - struct vm_area_struct *expand) > + struct vm_area_struct *expand, bool keep_locked) > { > struct mm_struct *mm = vma->vm_mm; > struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; > @@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned > long start, > > if (next && next != vma) > vm_raw_write_end(next); > - vm_raw_write_end(vma); > + if (!keep_locked) > + vm_raw_write_end(vma); > > validate_mm(mm); > This will require a fixup for the following patch where a retval from anon_vma_close() can also return without vma locked even though keep_locked == true. How does vma_merge() handle that error wrt vm_raw_write_begin(vma)? > @@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, > unsigned long vm_flags, > * parameter) may establish ptes with the wrong permissions of > * instead of the right permissions of . > */ > -struct vm_area_struct *vma_merge(struct mm_struct *mm, > +struct vm_area_struct *__vma_merge(struct mm_struct *mm, > struct vm_area_struct *prev, unsigned long addr, > unsigned long end, unsigned long vm_flags, > struct anon_vma *anon_vma, struct file *file, > pgoff_t pgoff, struct mempolicy *policy, > - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) > + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, > + bool keep_locked) > { > pgoff_t pglen = (end - addr) >> PAGE_SHIFT; > struct vm_area_struct *area, *next; > @@ -1185,10 +1187,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > /* cases 1, 6 */ > err = __vma_adjust(prev, prev->vm_start, >next->vm_end, prev->vm_pgoff, NULL, > - prev); > + prev, keep_locked); > } else /* cases 2, 5, 7 */ > err = __vma_adjust(prev, prev->vm_start, > - end, prev->vm_pgoff, NULL, prev); > +end, prev->vm_pgoff, NULL, prev, > +keep_locked); > if (err) > return NULL; > khugepaged_enter_vma_merge(prev, vm_flags); > @@ -1205,10 +1208,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >
Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/mm/mmap.c b/mm/mmap.c > index 5898255d0aeb..d6533cb85213 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -847,17 +847,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned > long start, > } > > if (start != vma->vm_start) { > - vma->vm_start = start; > + WRITE_ONCE(vma->vm_start, start); > start_changed = true; > } > if (end != vma->vm_end) { > - vma->vm_end = end; > + WRITE_ONCE(vma->vm_end, end); > end_changed = true; > } > - vma->vm_pgoff = pgoff; > + WRITE_ONCE(vma->vm_pgoff, pgoff); > if (adjust_next) { > - next->vm_start += adjust_next << PAGE_SHIFT; > - next->vm_pgoff += adjust_next; > + WRITE_ONCE(next->vm_start, > +next->vm_start + (adjust_next << PAGE_SHIFT)); > + WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next); > } > > if (root) { > @@ -1781,6 +1782,7 @@ unsigned long mmap_region(struct file *file, unsigned > long addr, > out: > perf_event_mmap(vma); > > + vm_write_begin(vma); > vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); > if (vm_flags & VM_LOCKED) { > if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || > @@ -1803,6 +1805,7 @@ unsigned long mmap_region(struct file *file, unsigned > long addr, > vma->vm_flags |= VM_SOFTDIRTY; > > vma_set_page_prot(vma); > + vm_write_end(vma); > > return addr; > Shouldn't this also protect vma->vm_flags? diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1796,7 +1796,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma == get_gate_vma(current->mm))) mm->locked_vm += (len >> PAGE_SHIFT); else - vma->vm_flags &= VM_LOCKED_CLEAR_MASK; + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & VM_LOCKED_CLEAR_MASK); } if (file) @@ -1809,7 +1810,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * then new mapped in-place (which must be aimed as * a completely new data area). */ - vma->vm_flags |= VM_SOFTDIRTY; + WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY); vma_set_page_prot(vma); vm_write_end(vma);
Re: [PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 65ae54659833..a2d9c87b7b0b 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, > const char __user *buf, > goto out_mm; > } > for (vma = mm->mmap; vma; vma = vma->vm_next) { > - vma->vm_flags &= ~VM_SOFTDIRTY; > + vm_write_begin(vma); > + WRITE_ONCE(vma->vm_flags, > +vma->vm_flags & > ~VM_SOFTDIRTY); > vma_set_page_prot(vma); > + vm_write_end(vma); > } > downgrade_write(>mmap_sem); > break; > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index cec550c8468f..b8212ba17695 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct > list_head *fcs) > > octx = vma->vm_userfaultfd_ctx.ctx; > if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { > + vm_write_begin(vma); > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); > + WRITE_ONCE(vma->vm_flags, > +vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING)); > + vm_write_end(vma); > return 0; > } > In several locations in this patch vm_write_begin(vma) -> vm_write_end(vma) is nesting things other than vma->vm_flags, vma->vm_policy, etc. I think it's better to do vm_write_end(vma) as soon as the members that the seqcount protects are modified. In other words, this isn't offering protection for vma->vm_userfaultfd_ctx. There are several examples of this in the patch. > @@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, > struct file *file) > vma = prev; > else > prev = vma; > - vma->vm_flags = new_flags; > + vm_write_begin(vma); > + WRITE_ONCE(vma->vm_flags, new_flags); > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > + vm_write_end(vma); > } > up_write(>mmap_sem); > mmput(mm); > @@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx > *ctx, >* the next vma was merged into the current one and >* the current one has not been updated yet. >*/ > - vma->vm_flags = new_flags; > + vm_write_begin(vma); > + WRITE_ONCE(vma->vm_flags, new_flags); > vma->vm_userfaultfd_ctx.ctx = ctx; > + vm_write_end(vma); > > skip: > prev = vma; > @@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct > userfaultfd_ctx *ctx, >* the next vma was merged into the current one and >* the current one has not been updated yet. >*/ > - vma->vm_flags = new_flags; > + vm_write_begin(vma); > + WRITE_ONCE(vma->vm_flags, new_flags); > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > + vm_write_end(vma); > > skip: > prev = vma; > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b7e2268dfc9a..32314e9e48dd 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm, > if (mm_find_pmd(mm, address) != pmd) > goto out; > > + vm_write_begin(vma); > anon_vma_lock_write(vma->anon_vma); > > pte = pte_offset_map(pmd, address); > @@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm, > pmd_populate(mm, pmd, pmd_pgtable(_pmd)); > spin_unlock(pmd_ptl); > anon_vma_unlock_write(vma->anon_vma); > + vm_write_end(vma); > result = SCAN_FAIL; > goto out; > } > @@ -1075,6 +1077,7 @@ static void collapse_huge_page(struct mm_struct *mm, > set_pmd_at(mm, address, pmd, _pmd); > update_mmu_cache_pmd(vma, address, pmd); > spin_unlock(pmd_ptl); > + vm_write_end(vma); > > *hpage = NULL; > > diff --git a/mm/madvise.c b/mm/madvise.c > index 4d3c922ea1a1..e328f7ab5942 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -184,7 +184,9 @@ static long madvise_behavior(struct vm_area_struct *vma, > /* >* vm_flags is protected by the mmap_sem held in write mode. >*/ > - vma->vm_flags = new_flags; > + vm_write_begin(vma); > + WRITE_ONCE(vma->vm_flags,
Re: [PATCH v9 07/24] mm: VMA sequence count
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/mm/mmap.c b/mm/mmap.c > index faf85699f1a1..5898255d0aeb 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -558,6 +558,10 @@ void __vma_link_rb(struct mm_struct *mm, struct > vm_area_struct *vma, > else > mm->highest_vm_end = vm_end_gap(vma); > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + seqcount_init(>vm_sequence); > +#endif > + > /* >* vma->vm_prev wasn't known when we followed the rbtree to find the >* correct insertion point for that vma. As a result, we could not > @@ -692,6 +696,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned > long start, > long adjust_next = 0; > int remove_next = 0; > > + /* > + * Why using vm_raw_write*() functions here to avoid lockdep's warning ? > + * > + * Locked is complaining about a theoretical lock dependency, involving > + * 3 locks: > + * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim > + * > + * Here are the major path leading to this dependency : > + * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem > + * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim > + * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem > + * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence > + * > + * So there is no way to solve this easily, especially because in > + * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted > + * VMAs are not yet known. > + * However, the way the vm_seq is used is guarantying that we will > + * never block on it since we just check for its value and never wait > + * for it to move, see vma_has_changed() and handle_speculative_fault(). > + */ > + vm_raw_write_begin(vma); > + if (next) > + vm_raw_write_begin(next); > + > if (next && !insert) { > struct vm_area_struct *exporter = NULL, *importer = NULL; > Eek, what about later on: /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { int error; importer->anon_vma = exporter->anon_vma; error = anon_vma_clone(importer, exporter); if (error) return error; } This needs if (error) { if (next && next != vma) vm_raw_write_end(next); vm_raw_write_end(vma); return error; }
Re: [PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 2f3e98edc94a..b6432a261e63 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page > *page) > #define VM_FAULT_NEEDDSYNC 0x2000 /* ->fault did not modify page tables >* and needs fsync() to complete (for >* synchronous page faults in DAX) */ > +#define VM_FAULT_PTNOTSAME 0x4000/* Page table entries have changed */ > > #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | > VM_FAULT_SIGSEGV | \ >VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \ > diff --git a/mm/memory.c b/mm/memory.c > index 21b1212a0892..4bc7b0bdcb40 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf) > * parts, do_swap_page must check under lock before unmapping the pte and > * proceeding (but do_wp_page is only called after already making such a > check; > * and do_anonymous_page can safely check later on). > + * > + * pte_unmap_same() returns: > + * 0 if the PTE are the same > + * VM_FAULT_PTNOTSAME if the PTE are different > + * VM_FAULT_RETRY if the VMA has changed in our back during > + * a speculative page fault handling. > */ > -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, > - pte_t *page_table, pte_t orig_pte) > +static inline int pte_unmap_same(struct vm_fault *vmf) > { > - int same = 1; > + int ret = 0; > + > #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT) > if (sizeof(pte_t) > sizeof(unsigned long)) { > - spinlock_t *ptl = pte_lockptr(mm, pmd); > - spin_lock(ptl); > - same = pte_same(*page_table, orig_pte); > - spin_unlock(ptl); > + if (pte_spinlock(vmf)) { > + if (!pte_same(*vmf->pte, vmf->orig_pte)) > + ret = VM_FAULT_PTNOTSAME; > + spin_unlock(vmf->ptl); > + } else > + ret = VM_FAULT_RETRY; > } > #endif > - pte_unmap(page_table); > - return same; > + pte_unmap(vmf->pte); > + return ret; > } > > static inline void cow_user_page(struct page *dst, struct page *src, > unsigned long va, struct vm_area_struct *vma) > @@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf) > int exclusive = 0; > int ret = 0; Initialization is now unneeded. Otherwise: Acked-by: David Rientjes <rient...@google.com>
Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event
On Tue, 13 Mar 2018, Laurent Dufour wrote: > Add support for the new speculative faults event. > > Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com> Acked-by: David Rientjes <rient...@google.com> Aside: should there be a new spec_flt field for struct task_struct that complements maj_flt and min_flt and be exported through /proc/pid/stat?
Re: [PATCH v9 20/24] perf: Add a speculative page fault sw event
On Tue, 13 Mar 2018, Laurent Dufour wrote: > Add a new software event to count succeeded speculative page faults. > > Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com> Acked-by: David Rientjes <rient...@google.com>
Re: [PATCH v9 23/24] x86/mm: Add speculative pagefault handling
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index e6af2b464c3d..a73cf227edd6 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1239,6 +1239,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long > error_code, > unsigned long address) > { > struct vm_area_struct *vma; > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + struct vm_area_struct *spf_vma = NULL; > +#endif > struct task_struct *tsk; > struct mm_struct *mm; > int fault, major = 0; > @@ -1332,6 +1335,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long > error_code, > if (error_code & X86_PF_INSTR) > flags |= FAULT_FLAG_INSTRUCTION; > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + if ((error_code & X86_PF_USER) && (atomic_read(>mm_users) > 1)) { > + fault = handle_speculative_fault(mm, address, flags, > + _vma); > + > + if (!(fault & VM_FAULT_RETRY)) { > + if (!(fault & VM_FAULT_ERROR)) { > + perf_sw_event(PERF_COUNT_SW_SPF, 1, > + regs, address); > + goto done; > + } > + /* > + * In case of error we need the pkey value, but > + * can't get it from the spf_vma as it is only returned > + * when VM_FAULT_RETRY is returned. So we have to > + * retry the page fault with the mmap_sem grabbed. > + */ > + } > + } > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ All the comments from the powerpc version will apply here as well, the only interesting point is whether VM_FAULT_FALLBACK can be returned from handle_speculative_fault() to indicate its not possible. > + > /* >* When running in the kernel we expect faults to occur only to >* addresses in user space. All other faults represent errors in > @@ -1365,7 +1389,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long > error_code, > might_sleep(); > } > > - vma = find_vma(mm, address); > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + if (spf_vma) { > + if (can_reuse_spf_vma(spf_vma, address)) > + vma = spf_vma; > + else > + vma = find_vma(mm, address); > + spf_vma = NULL; > + } else > +#endif > + vma = find_vma(mm, address); > if (unlikely(!vma)) { > bad_area(regs, error_code, address); > return; > @@ -1451,6 +1484,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long > error_code, > return; > } > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > +done: > +#endif > /* >* Major/minor page fault accounting. If any of the events >* returned VM_FAULT_MAJOR, we account it as a major fault.
Re: [PATCH v9 24/24] powerpc/mm: Add speculative page fault
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index 866446cf2d9a..104f3cc86b51 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned > long address, > unsigned long error_code) > { > struct vm_area_struct * vma; > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + struct vm_area_struct *spf_vma = NULL; > +#endif > struct mm_struct *mm = current->mm; > unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; > int is_exec = TRAP(regs) == 0x400; > @@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, > unsigned long address, > if (is_exec) > flags |= FAULT_FLAG_INSTRUCTION; > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + if (is_user && (atomic_read(>mm_users) > 1)) { > + /* let's try a speculative page fault without grabbing the > + * mmap_sem. > + */ > + fault = handle_speculative_fault(mm, address, flags, _vma); > + if (!(fault & VM_FAULT_RETRY)) { > + perf_sw_event(PERF_COUNT_SW_SPF, 1, > + regs, address); > + goto done; > + } > + } > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ > + Can't you elimiate all #ifdef's in this patch if handle_speculative_fault() can be passed is_user and return some error code that fallback is needed? Maybe reuse VM_FAULT_FALLBACK? > /* When running in the kernel we expect faults to occur only to >* addresses in user space. All other faults represent errors in the >* kernel and should generate an OOPS. Unfortunately, in the case of an > @@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, > unsigned long address, > might_sleep(); > } > > - vma = find_vma(mm, address); > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + if (spf_vma) { > + if (can_reuse_spf_vma(spf_vma, address)) > + vma = spf_vma; > + else > + vma = find_vma(mm, address); > + spf_vma = NULL; > + } else > +#endif > + vma = find_vma(mm, address); > if (unlikely(!vma)) > return bad_area(regs, address); > if (likely(vma->vm_start <= address)) I think the code quality here could be improved such that you can pass mm, _vma, and address and some helper function would return spf_vma if can_reuse_spf_vma() is true (and do *spf_vma to NULL) or otherwise return find_vma(mm, address). Also, spf_vma is being set to NULL because of VM_FAULT_RETRY, but does it make sense to retry handle_speculative_fault() in this case since we've dropped mm->mmap_sem and there may have been a writer queued behind it? > @@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned > long address, > > up_read(>mm->mmap_sem); > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > +done: > +#endif > if (unlikely(fault & VM_FAULT_ERROR)) > return mm_fault_error(regs, address, fault); > And things like this are trivially handled by doing done: __maybe_unused
Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key
On Wed, 21 Mar 2018, Laurent Dufour wrote: > I found the root cause of this lockdep warning. > > In mmap_region(), unmap_region() may be called while vma_link() has not been > called. This happens during the error path if call_mmap() failed. > > The only to fix that particular case is to call > seqcount_init(>vm_sequence) when initializing the vma in mmap_region(). > Ack, although that would require a fixup to dup_mmap() as well.
Re: [PATCH v9 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
On Tue, 13 Mar 2018, Laurent Dufour wrote: > When handling page fault without holding the mmap_sem the fetch of the > pte lock pointer and the locking will have to be done while ensuring > that the VMA is not touched in our back. > > So move the fetch and locking operations in a dedicated function. > > Signed-off-by: Laurent Dufour> --- > mm/memory.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 8ac241b9f370..21b1212a0892 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned > long addr, > } > EXPORT_SYMBOL_GPL(apply_to_page_range); > > +static bool pte_spinlock(struct vm_fault *vmf) inline? > +{ > + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); > + spin_lock(vmf->ptl); > + return true; > +} > + > static bool pte_map_lock(struct vm_fault *vmf) > { > vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, Shouldn't pte_unmap_same() take struct vm_fault * and use the new pte_spinlock()?
Re: [PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE
On Tue, 13 Mar 2018, Laurent Dufour wrote: > From: Peter Zijlstra> > When speculating faults (without holding mmap_sem) we need to validate > that the vma against which we loaded pages is still valid when we're > ready to install the new PTE. > > Therefore, replace the pte_offset_map_lock() calls that (re)take the > PTL with pte_map_lock() which can fail in case we find the VMA changed > since we started the fault. > Based on how its used, I would have suspected this to be named pte_map_trylock(). > Signed-off-by: Peter Zijlstra (Intel) > > [Port to 4.12 kernel] > [Remove the comment about the fault_env structure which has been > implemented as the vm_fault structure in the kernel] > [move pte_map_lock()'s definition upper in the file] > Signed-off-by: Laurent Dufour > --- > include/linux/mm.h | 1 + > mm/memory.c| 56 > ++ > 2 files changed, 41 insertions(+), 16 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 4d02524a7998..2f3e98edc94a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -300,6 +300,7 @@ extern pgprot_t protection_map[16]; > #define FAULT_FLAG_USER 0x40/* The fault originated in > userspace */ > #define FAULT_FLAG_REMOTE0x80/* faulting for non current tsk/mm */ > #define FAULT_FLAG_INSTRUCTION 0x100/* The fault was during an > instruction fetch */ > +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not > holding mmap_sem */ > > #define FAULT_FLAG_TRACE \ > { FAULT_FLAG_WRITE, "WRITE" }, \ I think FAULT_FLAG_SPECULATIVE should be introduced in the patch that actually uses it. > diff --git a/mm/memory.c b/mm/memory.c > index e0ae4999c824..8ac241b9f370 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned > long addr, > } > EXPORT_SYMBOL_GPL(apply_to_page_range); > > +static bool pte_map_lock(struct vm_fault *vmf) inline? > +{ > + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, > +vmf->address, >ptl); > + return true; > +} > + > /* > * handle_pte_fault chooses page fault handler according to an entry which > was > * read non-atomically. Before making any commitment, on those architectures > @@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf) > const unsigned long mmun_start = vmf->address & PAGE_MASK; > const unsigned long mmun_end = mmun_start + PAGE_SIZE; > struct mem_cgroup *memcg; > + int ret = VM_FAULT_OOM; > > if (unlikely(anon_vma_prepare(vma))) > goto oom; > @@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf) > /* >* Re-check the pte - we dropped the lock >*/ > - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl); > + if (!pte_map_lock(vmf)) { > + mem_cgroup_cancel_charge(new_page, memcg, false); > + ret = VM_FAULT_RETRY; > + goto oom_free_new; > + } Ugh, but we aren't oom here, so maybe rename oom_free_new so that it makes sense for return values other than VM_FAULT_OOM? > if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { > if (old_page) { > if (!PageAnon(old_page)) { > @@ -2596,7 +2608,7 @@ static int wp_page_copy(struct vm_fault *vmf) > oom: > if (old_page) > put_page(old_page); > - return VM_FAULT_OOM; > + return ret; > } > > /** > @@ -2617,8 +2629,8 @@ static int wp_page_copy(struct vm_fault *vmf) > int finish_mkwrite_fault(struct vm_fault *vmf) > { > WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); > - vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, > ->ptl); > + if (!pte_map_lock(vmf)) > + return VM_FAULT_RETRY; > /* >* We might have raced with another page fault while we released the >* pte_offset_map_lock. > @@ -2736,8 +2748,11 @@ static int do_wp_page(struct vm_fault *vmf) > get_page(vmf->page); > pte_unmap_unlock(vmf->pte, vmf->ptl); > lock_page(vmf->page); > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, > - vmf->address, >ptl); > + if (!pte_map_lock(vmf)) { > + unlock_page(vmf->page); > + put_page(vmf->page); > + return VM_FAULT_RETRY; > + } > if (!pte_same(*vmf->pte, vmf->orig_pte)) { > unlock_page(vmf->page); > pte_unmap_unlock(vmf->pte, vmf->ptl); > @@ -2947,8 +2962,10 @@ int
Re: [PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT
On Tue, 13 Mar 2018, Laurent Dufour wrote: > This configuration variable will be used to build the code needed to > handle speculative page fault. > > By default it is turned off, and activated depending on architecture > support. > > Suggested-by: Thomas Gleixner> Signed-off-by: Laurent Dufour > --- > mm/Kconfig | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/Kconfig b/mm/Kconfig > index abefa573bcd8..07c566c88faf 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -759,3 +759,6 @@ config GUP_BENCHMARK > performance of get_user_pages_fast(). > > See tools/testing/selftests/vm/gup_benchmark.c > + > +config SPECULATIVE_PAGE_FAULT > + bool Should this be configurable even if the arch supports it?
Re: [PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more
On Wed, 12 Apr 2017, Oliver O'Halloran wrote: > Depending flags of the PMD being zapped there may or may not be a > deposited pgtable to be freed. In two of the three cases this is open > coded while the third uses the zap_deposited_table() helper. This patch > converts the others to use the helper to clean things up a bit. > > Cc: "Aneesh Kumar K.V" <aneesh.ku...@linux.vnet.ibm.com> > Cc: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com> > Cc: linux...@kvack.org > Signed-off-by: Oliver O'Halloran <ooh...@gmail.com> Acked-by: David Rientjes <rient...@google.com>
[no subject]
--- Begin Message --- On Thu, 28 Jan 2016, David Rientjes wrote: > On Thu, 28 Jan 2016, Christian Borntraeger wrote: > > > Indeed, I only touched the identity mapping and dump stack. > > The question is do we really want to change free_init_pages as well? > > The unmapping during runtime causes significant overhead, but the > > unmapping after init imposes almost no runtime overhead. Of course, > > things get fishy now as what is enabled and what not. > > > > Kconfig after my patch "mm/debug_pagealloc: Ask users for default setting > > of debug_pagealloc" > > (in mm) now states > > snip > > By default this option will have a small overhead, e.g. by not > > allowing the kernel mapping to be backed by large pages on some > > architectures. Even bigger overhead comes when the debugging is > > enabled by DEBUG_PAGEALLOC_ENABLE_DEFAULT or the debug_pagealloc > > command line parameter. > > snip > > > > So I am tempted to NOT change free_init_pages, but the x86 maintainers > > can certainly decide differently. Ingo, Thomas, H. Peter, please advise. > > > > I'm sorry, but I thought the discussion of the previous version of the > patchset led to deciding that all CONFIG_DEBUG_PAGEALLOC behavior would be > controlled by being enabled on the commandline and checked with > debug_pagealloc_enabled(). > > I don't think we should have a CONFIG_DEBUG_PAGEALLOC that does some stuff > and then a commandline parameter or CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT > to enable more stuff. It should either be all enabled by the commandline > (or config option) or split into a separate entity. > CONFIG_DEBUG_PAGEALLOC_LIGHT and CONFIG_DEBUG_PAGEALLOC would be fine, but > the current state is very confusing about what is being done and what > isn't. > Ping? --- End Message --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[no subject]
--- Begin Message --- On Thu, 28 Jan 2016, Christian Borntraeger wrote: > Indeed, I only touched the identity mapping and dump stack. > The question is do we really want to change free_init_pages as well? > The unmapping during runtime causes significant overhead, but the > unmapping after init imposes almost no runtime overhead. Of course, > things get fishy now as what is enabled and what not. > > Kconfig after my patch "mm/debug_pagealloc: Ask users for default setting of > debug_pagealloc" > (in mm) now states > snip > By default this option will have a small overhead, e.g. by not > allowing the kernel mapping to be backed by large pages on some > architectures. Even bigger overhead comes when the debugging is > enabled by DEBUG_PAGEALLOC_ENABLE_DEFAULT or the debug_pagealloc > command line parameter. > snip > > So I am tempted to NOT change free_init_pages, but the x86 maintainers > can certainly decide differently. Ingo, Thomas, H. Peter, please advise. > I'm sorry, but I thought the discussion of the previous version of the patchset led to deciding that all CONFIG_DEBUG_PAGEALLOC behavior would be controlled by being enabled on the commandline and checked with debug_pagealloc_enabled(). I don't think we should have a CONFIG_DEBUG_PAGEALLOC that does some stuff and then a commandline parameter or CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT to enable more stuff. It should either be all enabled by the commandline (or config option) or split into a separate entity. CONFIG_DEBUG_PAGEALLOC_LIGHT and CONFIG_DEBUG_PAGEALLOC would be fine, but the current state is very confusing about what is being done and what isn't. It also wouldn't hurt to enumerate what is enabled and what isn't enabled in the Kconfig entry. --- End Message --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[no subject]
--- Begin Message --- On Wed, 27 Jan 2016, Christian Borntraeger wrote: > We can use debug_pagealloc_enabled() to check if we can map > the identity mapping with 2MB pages. We can also add the state > into the dump_stack output. > > The patch does not touch the code for the 1GB pages, which ignored > CONFIG_DEBUG_PAGEALLOC. Do we need to fence this as well? > > Signed-off-by: Christian Borntraeger> Reviewed-by: Thomas Gleixner > --- > arch/x86/kernel/dumpstack.c | 5 ++--- > arch/x86/mm/init.c | 7 --- > arch/x86/mm/pageattr.c | 14 -- > 3 files changed, 10 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c > index 9c30acf..32e5699 100644 > --- a/arch/x86/kernel/dumpstack.c > +++ b/arch/x86/kernel/dumpstack.c > @@ -265,9 +265,8 @@ int __die(const char *str, struct pt_regs *regs, long err) > #ifdef CONFIG_SMP > printk("SMP "); > #endif > -#ifdef CONFIG_DEBUG_PAGEALLOC > - printk("DEBUG_PAGEALLOC "); > -#endif > + if (debug_pagealloc_enabled()) > + printk("DEBUG_PAGEALLOC "); > #ifdef CONFIG_KASAN > printk("KASAN"); > #endif > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index 493f541..39823fd 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -150,13 +150,14 @@ static int page_size_mask; > > static void __init probe_page_size_mask(void) > { > -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) > +#if !defined(CONFIG_KMEMCHECK) > /* > - * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages. > + * For CONFIG_KMEMCHECK or pagealloc debugging, identity mapping will > + * use small pages. >* This will simplify cpa(), which otherwise needs to support splitting >* large pages into small in interrupt context, etc. >*/ > - if (cpu_has_pse) > + if (cpu_has_pse && !debug_pagealloc_enabled()) > page_size_mask |= 1 << PG_LEVEL_2M; > #endif > I would have thought free_init_pages() would be modified to use debug_pagealloc_enabled() as well? --- End Message --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[no subject]
--- Begin Message --- On Wed, 27 Jan 2016, Christian Borntraeger wrote: > We can use debug_pagealloc_enabled() to check if we can map > the identity mapping with 1MB/2GB pages as well as to print > the current setting in dump_stack. > > Signed-off-by: Christian Borntraeger <borntrae...@de.ibm.com> > Reviewed-by: Heiko Carstens <heiko.carst...@de.ibm.com> Acked-by: David Rientjes <rient...@google.com> --- End Message --- ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node
On Thu, 30 Jul 2015, Vlastimil Babka wrote: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index aa58a32..56355f2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2469,7 +2469,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm, */ up_read(mm-mmap_sem); - *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER); + *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER); if (unlikely(!*hpage)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); @@ -2568,9 +2568,7 @@ static void collapse_huge_page(struct mm_struct *mm, VM_BUG_ON(address ~HPAGE_PMD_MASK); - /* Only allocate from the target node */ - gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) | - __GFP_THISNODE; + gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), 0); /* release the mmap_sem read lock. */ new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node); Hmm, where is the __GFP_THISNODE enforcement in khugepaged_alloc_page() that is removed in collapse_huge_page()? I also don't see what happened to the __GFP_OTHER_NODE. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: rename and document alloc_pages_exact_node
On Thu, 23 Jul 2015, Christoph Lameter wrote: The only possible downside would be existing users of alloc_pages_node() that are calling it with an offline node. Since it's a VM_BUG_ON() that would catch that, I think it should be changed to a VM_WARN_ON() and eventually fixed up because it's nonsensical. VM_BUG_ON() here should be avoided. The offline node thing could be addresses by using numa_mem_id()? I was concerned about any callers that were passing an offline node, not NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON() for it, the other silently calls node_zonelist() on it. I suppose the final alloc_pages_node() implementation could be if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid))) nid = numa_mem_id(); VM_BUG_ON(nid 0 || nid = MAX_NUMNODES); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); though. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: rename and document alloc_pages_exact_node
On Wed, 22 Jul 2015, Paolo Bonzini wrote: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2d73807..a8723a8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu) struct page *pages; struct vmcs *vmcs; - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order); + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order); if (!pages) return NULL; vmcs = page_address(pages); Even though there's a pretty strong preference for the right node, things can work if the node is the wrong one. The order is always zero in practice, so the allocation should succeed. You're code is fine both before and after the patch since __GFP_THISNODE isn't set. The allocation will eventually succeed but, as you said, may be from remote memory (and the success of allocating on node may be influenced by the global setting of zone_reclaim_mode). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: rename and document alloc_pages_exact_node
On Wed, 22 Jul 2015, Vlastimil Babka wrote: alloc_pages_exact_node(), as you said, connotates that the allocation will take place on that node or will fail. So why not go beyond this patch and actually make alloc_pages_exact_node() set __GFP_THISNODE and then call into a new alloc_pages_prefer_node(), which would be the current alloc_pages_exact_node() implementation, and then fix up the callers? OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and alloc_pages_exact_node(). Isn't that a bit too much? The first two differ only in tiny bit: static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { /* Unknown node is current node */ if (nid 0) nid = numa_node_id(); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid 0 || nid = MAX_NUMNODES || !node_online(nid)); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } Eek, yeah, that does look bad. I'm not even sure the if (nid 0) nid = numa_node_id(); is correct; I think this should be comparing to NUMA_NO_NODE rather than all negative numbers, otherwise we silently ignore overflow and nobody ever knows. So _prefer_node is just a tiny optimization over the other one. It should be maybe called __alloc_pages_node() then? This would perhaps discourage users outside of mm/arch code (where it may matter). The savings of a skipped branch is likely dubious anyway... It would be also nice if alloc_pages_node() could use __alloc_pages_node() internally, but I'm not sure if all callers are safe wrt the VM_BUG_ON(!node_online(nid)) part. I'm not sure how large you want to make your patch :) In a perfect world I would think that we wouldn't have an alloc_pages_prefer_node() at all and everything would be converted to alloc_pages_node() which would do if (nid == NUMA_NO_NODE) nid = numa_mem_id(); VM_BUG_ON(nid 0 || nid = MAX_NUMNODES || !node_online(nid)); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); and then alloc_pages_exact_node() would do return alloc_pages_node(nid, gfp_mask | __GFP_THISNODE, order); and existing alloc_pages_exact_node() callers fixed up depending on whether they set the bit or not. The only possible downside would be existing users of alloc_pages_node() that are calling it with an offline node. Since it's a VM_BUG_ON() that would catch that, I think it should be changed to a VM_WARN_ON() and eventually fixed up because it's nonsensical. VM_BUG_ON() here should be avoided. Or just go with a single alloc_pages_node() and rename __GFP_THISNODE to __GFP_EXACT_NODE which may accomplish the same thing :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] mm: rename and document alloc_pages_exact_node
On Tue, 21 Jul 2015, Vlastimil Babka wrote: The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 (page allocator: do not check NUMA node ID when the caller knows the node is valid) as an optimized variant of alloc_pages_node(), that doesn't allow the node id to be -1. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node. In truth, the node is only preferred, unless __GFP_THISNODE is among the gfp flags. The misleading name has lead to mistakes in the past, see 5265047ac301 (mm, thp: really limit transparent hugepage allocation to local node) and b360edb43f8e (mm, mempolicy: migrate_to_node should only migrate to node). To prevent further mistakes, this patch renames the function to alloc_pages_prefer_node() and documents it together with alloc_pages_node(). alloc_pages_exact_node(), as you said, connotates that the allocation will take place on that node or will fail. So why not go beyond this patch and actually make alloc_pages_exact_node() set __GFP_THISNODE and then call into a new alloc_pages_prefer_node(), which would be the current alloc_pages_exact_node() implementation, and then fix up the callers? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
On Fri, 10 Jul 2015, Nishanth Aravamudan wrote: After the percpu areas on initialized and cpu_to_node() is correct, it would be really nice to be able to make numa_cpu_lookup_table[] be __initdata since it shouldn't be necessary anymore. That probably has cpu callbacks that need to be modified to no longer look at numa_cpu_lookup_table[] or pass the value in, but it would make it much cleaner. Then nobody will have to worry about figuring out whether early_cpu_to_node() or cpu_to_node() is the right one to call. When I worked on the original pcpu patches for power, I wanted to do this, but got myself confused and never came back to it. Thank you for suggesting it and I'll work on it soon. Great, thanks for taking it on! I have powerpc machines so I can test this and try to help where possible. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot
On Thu, 2 Jul 2015, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). The value returned by those calls now depend on the per-cpu area being setup, but that is not guaranteed to be the case during boot. Instead, we need to add an early_cpu_to_node() which doesn't use the per-CPU area and call that from certain spots that are known to invoke cpu_to_node() before the per-CPU areas are not configured. On an example 2-node NUMA system with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 2029 MB node 0 free: 1753 MB node 1 cpus: 4 5 6 7 node 1 size: 2045 MB node 1 free: 1945 MB node distances: node 0 1 0: 10 40 1: 40 10 we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 5f1048e..f2c4c89 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus) extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) +extern int early_cpu_to_node(int); + extern void __init dump_numa_cpu_topology(void); extern int sysfs_add_device_to_node(struct device *dev, int nid); diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c index c69671c..23a2cf3 100644 --- a/arch/powerpc/kernel/setup_64.c +++ b/arch/powerpc/kernel/setup_64.c @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p) static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align) { - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align, - __pa(MAX_DMA_ADDRESS)); + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size, + align, __pa(MAX_DMA_ADDRESS)); } static void __init pcpu_fc_free(void *ptr, size_t size) @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size) static int pcpu_cpu_distance(unsigned int from, unsigned int to) { - if (cpu_to_node(from) == cpu_to_node(to)) + if (early_cpu_to_node(from) == early_cpu_to_node(to)) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 5e80621..9ffabf4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node) cpumask_set_cpu(cpu, node_to_cpumask_map[node]); } +int early_cpu_to_node(int cpu) +{ + return numa_cpu_lookup_table[cpu]; +} + #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { early_cpu_to_node() looks like it's begging to be __init since we shouldn't have a need to reference to numa_cpu_lookup_table after boot and that appears like it can be done if pcpu_cpu_distance() is made __init in this patch and smp_prepare_boot_cpu() is made __init in the next patch. So I think this is fine, but those functions and things like reset_numa_cpu_lookup_table() should be in init.text. After the percpu areas on initialized and cpu_to_node() is correct, it would be really nice to be able to make numa_cpu_lookup_table[] be __initdata since it shouldn't be necessary anymore. That probably has cpu callbacks that need to be modified to no longer look at numa_cpu_lookup_table[] or pass the value in, but it would make it much cleaner. Then nobody will have to worry about figuring out whether early_cpu_to_node() or cpu_to_node() is the right one to call. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On Wed, 8 Jul 2015, Nishanth Aravamudan wrote: It looks like the symptom is that the per-cpu areas are all allocated on node 0, is that all that goes wrong? Yes, that's the symptom. I cc'd a few folks to see if they could help indicate the performance implications of such a setup -- sorry, I should have been more explicit about that. Yeah, not sure it's really a bugfix but rather a performance optimization since cpu_to_node() with CONFIG_USE_PERCPU_NUMA_NODE_ID is only about performance. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table
On Thu, 2 Jul 2015, Nishanth Aravamudan wrote: A simple move to a wrapper function to numa_cpu_lookup_table, now that power has the early_cpu_to_node() API. Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com When early_cpu_to_node() is __init: Acked-by: David Rientjes rient...@google.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* + * zero out the possible nodes after we parse the device-tree, + * so that we lower the maximum NUMA node ID to what is actually + * present. + */ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, start_pfn, end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid); This seems a bit strange, node_possible_map is supposed to be a superset of node_online_map and this loop is iterating over node_online_map to set nodes in node_possible_map. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: So if we compare to x86: arch/x86/mm/numa.c::numa_init(): nodes_clear(numa_nodes_parsed); nodes_clear(node_possible_map); nodes_clear(node_online_map); ... numa_register_memblks(...); arch/x86/mm/numa.c::numa_register_memblks(): node_possible_map = numa_nodes_parsed; Basically, it looks like x86 NUMA init clears out possible map and online map, probably for a similar reason to what I gave in the changelog that by default, the possible map seems to be based off MAX_NUMNODES, rather than nr_node_ids or anything dynamic. My patch was an attempt to emulate the same thing on powerpc. You are right that there is a window in which the node_possible_map and node_online_map are out of sync with my patch. It seems like it shouldn't matter given how early in boot we are, but perhaps the following would have been clearer: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..1a118b08fad2 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* Reduce the possible NUMA nodes to the online NUMA nodes, +* since we do not support node hotplug. This ensures that we +* lower the maximum NUMA node ID to what is actually present. +*/ + nodes_and(node_possible_map, node_possible_map, node_online_map); If you don't support node hotplug, then a node should always be possible if it's online unless there are other tricks powerpc plays with node_possible_map. Shouldn't this just be node_possible_map = node_online_map? + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH] powerpc/numa: reset node_possible_map to only node_online_map
On Fri, 6 Mar 2015, Michael Ellerman wrote: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..24de29b3651b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,9 +958,17 @@ void __init initmem_init(void) memblock_dump_all(); + /* + * zero out the possible nodes after we parse the device-tree, + * so that we lower the maximum NUMA node ID to what is actually + * present. + */ + nodes_clear(node_possible_map); + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; + node_set(nid, node_possible_map); get_pfn_range_for_nid(nid, start_pfn, end_pfn); setup_node_data(nid, start_pfn, end_pfn); sparse_memory_present_with_active_regions(nid); This seems a bit strange, node_possible_map is supposed to be a superset of node_online_map and this loop is iterating over node_online_map to set nodes in node_possible_map. Yeah. Though at this point in boot I don't think it matters that the two maps are out-of-sync temporarily. But it would simpler to just set the possible map to be the online map. That would also maintain the invariant that the possible map is always a superset of the online map. Or did I miss a detail there (sleep deprived parent mode). I think reset_numa_cpu_lookup_table() which iterates over the possible map, and thus only a subset of nodes now, may be concerning. I'm not sure why this is being proposed as a powerpc patch and now a patch for mem_cgroup_css_alloc(). In other words, why do we have to allocate for all possible nodes? We should only be allocating for online nodes in N_MEMORY with mem hotplug disabled initially and then have a mem hotplug callback implemented to alloc_mem_cgroup_per_zone_info() for nodes that transition from memoryless - memory. The extra bonus is that alloc_mem_cgroup_per_zone_info() need never allocate remote memory and the TODO in that function can be removed. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] slab: Fix nodeid bounds check for non-contiguous node IDs
On Mon, 1 Dec 2014, Paul Mackerras wrote: The bounds check for nodeid in cache_alloc_node gives false positives on machines where the node IDs are not contiguous, leading to a panic at boot time. For example, on a POWER8 machine the node IDs are typically 0, 1, 16 and 17. This means that num_online_nodes() returns 4, so when cache_alloc_node is called with nodeid = 16 the VM_BUG_ON triggers, like this: kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=1024 NUMA PowerNV Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.0-rc5-kvm+ #17 task: c13ba230 ti: c1494000 task.ti: c1494000 NIP: c0264f6c LR: c0264f5c CTR: REGS: c14979a0 TRAP: 0700 Not tainted (3.18.0-rc5-kvm+) MSR: 92021032 SF,HV,VEC,ME,IR,DR,RI CR: 28000448 XER: 2000 CFAR: c047e978 SOFTE: 0 GPR00: c0264f5c c1497c20 c1499d48 0004 GPR04: 0100 0010 0068 GPR08: 0001 082d c0cca5a8 GPR12: 48000448 cfda 01003bd44ff0 10020578 GPR16: 01003bd44ff8 01003bd45000 0001 GPR20: 0010 GPR24: c00ffe80 c0c824ec 0068 c00ffe80 GPR28: 0010 c00ffe80 0010 NIP [c0264f6c] .cache_alloc_node+0x6c/0x270 LR [c0264f5c] .cache_alloc_node+0x5c/0x270 Call Trace: [c1497c20] [c0264f5c] .cache_alloc_node+0x5c/0x270 (unreliable) [c1497cf0] [c026552c] .kmem_cache_alloc_node_trace+0xdc/0x360 [c1497dc0] [c0c824ec] .init_list+0x3c/0x128 [c1497e50] [c0c827b4] .kmem_cache_init+0x1dc/0x258 [c1497ef0] [c0c54090] .start_kernel+0x2a0/0x568 [c1497f90] [c0008c6c] start_here_common+0x20/0xa8 Instruction dump: 7c7d1b78 7c962378 4bda4e91 6000 3c620004 38800100 386370d8 48219959 6000 7f83e000 7d301026 5529effe 0b09 393c0010 79291f24 7d3d4a14 To fix this, we instead compare the nodeid with MAX_NUMNODES, and additionally make sure it isn't negative (since nodeid is an int). The check is there mainly to protect the array dereference in the get_node() call in the next line, and the array being dereferenced is of size MAX_NUMNODES. If the nodeid is in range but invalid (for example if the node is off-line), the BUG_ON in the next line will catch that. Signed-off-by: Paul Mackerras pau...@samba.org Acked-by: David Rientjes rient...@google.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
On Tue, 22 Jul 2014, Nishanth Aravamudan wrote: I think there's two use cases of interest: - allocating from a memoryless node where numa_node_id() is memoryless, and - using node_to_mem_node() for a possibly-memoryless node for kmalloc(). I believe the first should have its own node_zonelist[0], whether it's memoryless or not, that points to a list of zones that start with those with the smallest distance. Ok, and that would be used for falling back in the appropriate priority? There's no real fallback since there's never a case when you can allocate on a memoryless node. The zonelist defines the appropriate order in which to try to allocate from zones, so it depends on things like the numa_node_id() in alloc_pages_current() and whether the zonelist for a memoryless node is properly initialized or whether this needs to be numa_mem_id(). It depends on the intended behavior of calling alloc_pages_{node,vma}() with a memoryless node, the complexity of (re-)building the zonelists at bootstrap and for memory hotplug isn't a hotpath. This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL is set. I think its own node_zonelist[1], for __GFP_THISNODE allocations, should point to the node with present memory that has the smallest distance. And so would this, but with the caveat that we can fail here and don't go further? Semantically, __GFP_THISNODE then means as close as physically possible ignoring run-time memory constraints. I say that because obviously we might get off-node memory without memoryless nodes, but that shouldn't be used to satisfy __GPF_THISNODE allocations. alloc_pages_current() substitutes any existing mempolicy for the default local policy when __GFP_THISNODE is set, and that would require local allocation. That, currently, is numa_node_id() and not numa_mem_id(). The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so it will allocate remotely anyway. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
On Mon, 21 Jul 2014, Nishanth Aravamudan wrote: Sorry for bringing up this old thread again, but I had a question for you, David. node_to_mem_node(), which does seem like a useful API, doesn't seem like it can just node_distance() solely, right? Because that just tells us the relative cost (or so I think about it) of using resources from that node. But we also need to know if that node itself has memory, etc. So using the zonelists is required no matter what? And upon memory hotplug (or unplug), the topology can change in a way that affects things, so node online time isn't right either? I think there's two use cases of interest: - allocating from a memoryless node where numa_node_id() is memoryless, and - using node_to_mem_node() for a possibly-memoryless node for kmalloc(). I believe the first should have its own node_zonelist[0], whether it's memoryless or not, that points to a list of zones that start with those with the smallest distance. I think its own node_zonelist[1], for __GFP_THISNODE allocations, should point to the node with present memory that has the smallest distance. For sure node_zonelist[0] cannot be NULL since things like first_online_pgdat() would break and it should be unnecessary to do node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES since the zonelists should already be defined properly. All nodes, regardless of whether they have memory or not, should probably end up having a struct pglist_data unless there's a reason for another level of indirection. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Fix build warning
On Fri, 13 Jun 2014, Guenter Roeck wrote: If compiled with W=1, the following warning is seen in powerpc builds. arch/powerpc/kernel/smp.c:750:18: warning: type qualifiers ignored on function return type static const int powerpc_smt_flags(void) ^ This is caused by a function returning 'const int', which doesn't make sense to gcc. Drop 'const' to fix the problem. Reported-by: Vincent Guittot vincent.guit...@linaro.org Signed-off-by: Guenter Roeck li...@roeck-us.net Acked-by: David Rientjes rient...@google.com Although it's strange you report this happening on line 750 in the changelog but the patch shows it differently. --- arch/powerpc/kernel/smp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 10e..49d5d4e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -768,7 +768,7 @@ int setup_profiling_timer(unsigned int multiplier) #ifdef CONFIG_SCHED_SMT /* cpumask of CPUs with asymetric SMT dependancy */ -static const int powerpc_smt_flags(void) +static int powerpc_smt_flags(void) { int flags = SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: NUMA topology question wrt. d4edc5b6
On Fri, 23 May 2014, Srivatsa S. Bhat wrote: diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index c920215..58e6469 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -18,6 +18,7 @@ struct device_node; */ #define RECLAIM_DISTANCE 10 +#include linux/nodemask.h #include asm/mmzone.h static inline int cpu_to_node(int cpu) @@ -30,7 +31,7 @@ static inline int cpu_to_node(int cpu) * During early boot, the numa-cpu lookup table might not have been * setup for all CPUs yet. In such cases, default to node 0. */ - return (nid 0) ? 0 : nid; + return (nid 0) ? first_online_node : nid; } #define parent_node(node)(node) I wonder what would happen on ppc if we just returned NUMA_NO_NODE here for cpus that have not been mapped (they shouldn't even be possible). This would at least allow callers that do kmalloc_node(..., cpu_to_node(cpu)) to be allocated on the local cpu rather than on a perhaps offline or remote node 0. It would seem better to catch callers that do cpu_to_node(not-possible-cpu) rather than blindly return an online node. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Node 0 not necessary for powerpc?
On Wed, 21 May 2014, Nishanth Aravamudan wrote: For context: I was looking at why N_ONLINE was statically setting Node 0 to be online, whether or not the topology is that way -- I've been getting several bugs lately where Node 0 is online, but has no CPUs and no memory on it, on powerpc. On powerpc, setup_per_cpu_areas calls into ___alloc_bootmem_node using NODE_DATA(cpu_to_node(cpu)). Currently, cpu_to_node() in arch/powerpc/include/asm/topology.h does: /* * During early boot, the numa-cpu lookup table might not have been * setup for all CPUs yet. In such cases, default to node 0. */ return (nid 0) ? 0 : nid; And so early at boot, if node 0 is not present, we end up accessing an unitialized NODE_DATA(). So this seems buggy (I'll contact the powerpc deveopers separately on that). I think what this really wants to do is NODE_DATA(cpu_to_mem(cpu)) and I thought ppc had the cpu-to-local-memory-node mappings correct? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RESEND PATCH] slub: search partial list on numa_mem_id(), instead of numa_node_id()
On Wed, 21 May 2014, Joonsoo Kim wrote: Currently, if allocation constraint to node is NUMA_NO_NODE, we search a partial slab on numa_node_id() node. This doesn't work properly on the system having memoryless node, since it can have no memory on that node and there must be no partial slab on that node. On that node, page allocation always fallback to numa_mem_id() first. So searching a partial slab on numa_node_id() in that case is proper solution for memoryless node case. Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com Acked-by: David Rientjes rient...@google.com Acked-by: Christoph Lameter c...@linux.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/slub.c b/mm/slub.c index 545a170..cc1f995 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, struct kmem_cache_cpu *c) { void *object; - int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node; + int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node; object = get_partial_node(s, get_node(s, searchnode), c, flags); if (object || node != NUMA_NO_NODE) Andrew, can you merge this please? It's still not in linux-next. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Node 0 not necessary for powerpc?
On Tue, 11 Mar 2014, Nishanth Aravamudan wrote: I have a P7 system that has no node0, but a node0 shows up in numactl --hardware, which has no cpus and no memory (and no PCI devices): numactl --hardware available: 4 nodes (0-3) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: node 2 size: 7935 MB node 2 free: 7716 MB node 3 cpus: node 3 size: 8395 MB node 3 free: 8015 MB node distances: node 0 1 2 3 0: 10 20 10 20 1: 20 10 20 20 2: 10 20 10 20 3: 20 20 20 10 This is because we statically initialize N_ONLINE to be [0] in mm/page_alloc.c: [N_ONLINE] = { { [0] = 1UL } }, I'm not sure what the architectural requirements are here, but at least on this test system, removing this initialization, it boots fine and is running. I've not yet tried stress tests, but it's survived the beginnings of kernbench so far. numactl --hardware available: 3 nodes (1-3) node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: node 2 size: 7935 MB node 2 free: 7479 MB node 3 cpus: node 3 size: 8396 MB node 3 free: 8375 MB node distances: node 1 2 3 1: 10 20 20 2: 20 10 20 3: 20 20 10 Perhaps we could put in a ARCH_DOES_NOT_NEED_NODE0 and only define it on powerpc for now, conditionalizing the above initialization on that? I don't know if anything has recently changed in the past year or so, but I've booted x86 machines with a hacked BIOS so that all memory on node 0 is hotpluggable and offline, so I believe this is possible on x86 as well. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[patch] mm, hotplug: avoid compiling memory hotremove functions when disabled
__remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC pseries will return -EOPNOTSUPP if unsupported. Adding an #ifdef causes several other functions it depends on to also become unnecessary, which saves in .text when disabled (it's disabled in most defconfigs besides powerpc, including x86). remove_memory_block() becomes static since it is not referenced outside of drivers/base/memory.c. Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled and disabled. Signed-off-by: David Rientjes rient...@google.com --- arch/powerpc/platforms/pseries/hotplug-memory.c | 12 + drivers/base/memory.c | 44 +++ include/linux/memory.h | 3 +- include/linux/memory_hotplug.h | 4 +- mm/memory_hotplug.c | 68 +++ mm/sparse.c | 72 + 6 files changed, 113 insertions(+), 90 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -72,6 +72,7 @@ unsigned long memory_block_size_bytes(void) return get_memblock_size(); } +#ifdef CONFIG_MEMORY_HOTREMOVE static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size) { unsigned long start, start_pfn; @@ -153,6 +154,17 @@ static int pseries_remove_memory(struct device_node *np) ret = pseries_remove_memblock(base, lmb_size); return ret; } +#else +static inline int pseries_remove_memblock(unsigned long base, + unsigned int memblock_size) +{ + return -EOPNOTSUPP; +} +static inline int pseries_remove_memory(struct device_node *np) +{ + return -EOPNOTSUPP; +} +#endif /* CONFIG_MEMORY_HOTREMOVE */ static int pseries_add_memory(struct device_node *np) { diff --git a/drivers/base/memory.c b/drivers/base/memory.c --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -93,16 +93,6 @@ int register_memory(struct memory_block *memory) return error; } -static void -unregister_memory(struct memory_block *memory) -{ - BUG_ON(memory-dev.bus != memory_subsys); - - /* drop the ref. we got in remove_memory_block() */ - kobject_put(memory-dev.kobj); - device_unregister(memory-dev); -} - unsigned long __weak memory_block_size_bytes(void) { return MIN_MEMORY_BLOCK_SIZE; @@ -637,8 +627,28 @@ static int add_memory_section(int nid, struct mem_section *section, return ret; } -int remove_memory_block(unsigned long node_id, struct mem_section *section, - int phys_device) +/* + * need an interface for the VM to add new memory regions, + * but without onlining it. + */ +int register_new_memory(int nid, struct mem_section *section) +{ + return add_memory_section(nid, section, NULL, MEM_OFFLINE, HOTPLUG); +} + +#ifdef CONFIG_MEMORY_HOTREMOVE +static void +unregister_memory(struct memory_block *memory) +{ + BUG_ON(memory-dev.bus != memory_subsys); + + /* drop the ref. we got in remove_memory_block() */ + kobject_put(memory-dev.kobj); + device_unregister(memory-dev); +} + +static int remove_memory_block(unsigned long node_id, + struct mem_section *section, int phys_device) { struct memory_block *mem; @@ -661,15 +671,6 @@ int remove_memory_block(unsigned long node_id, struct mem_section *section, return 0; } -/* - * need an interface for the VM to add new memory regions, - * but without onlining it. - */ -int register_new_memory(int nid, struct mem_section *section) -{ - return add_memory_section(nid, section, NULL, MEM_OFFLINE, HOTPLUG); -} - int unregister_memory_section(struct mem_section *section) { if (!present_section(section)) @@ -677,6 +678,7 @@ int unregister_memory_section(struct mem_section *section) return remove_memory_block(0, section, 0); } +#endif /* CONFIG_MEMORY_HOTREMOVE */ /* * offline one memory block. If the memory block has been offlined, do nothing. diff --git a/include/linux/memory.h b/include/linux/memory.h --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -114,9 +114,10 @@ extern void unregister_memory_notifier(struct notifier_block *nb); extern int register_memory_isolate_notifier(struct notifier_block *nb); extern void unregister_memory_isolate_notifier(struct notifier_block *nb); extern int register_new_memory(int, struct mem_section *); +#ifdef CONFIG_MEMORY_HOTREMOVE extern int unregister_memory_section(struct mem_section *); +#endif extern int memory_dev_init(void); -extern int remove_memory_block(unsigned long, struct mem_section *, int); extern int memory_notify(unsigned long val, void *v); extern int memory_isolate_notify(unsigned long val, void *v); extern
[patch 4/4] mm, oom: remove statically defined arch functions of same name
out_of_memory() is a globally defined function to call the oom killer. x86, sh, and powerpc all use a function of the same name within file scope in their respective fault.c unnecessarily. Inline the functions into the pagefault handlers to clean the code up. Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: Thomas Gleixner t...@linutronix.de Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Paul Mundt let...@linux-sh.org Signed-off-by: David Rientjes rient...@google.com --- arch/powerpc/mm/fault.c | 27 --- arch/sh/mm/fault.c | 19 +++ arch/x86/mm/fault.c | 23 --- 3 files changed, 27 insertions(+), 42 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -113,19 +113,6 @@ static int store_updates_sp(struct pt_regs *regs) #define MM_FAULT_CONTINUE -1 #define MM_FAULT_ERR(sig) (sig) -static int out_of_memory(struct pt_regs *regs) -{ - /* -* We ran out of memory, or some other thing happened to us that made -* us unable to handle the page fault gracefully. -*/ - up_read(current-mm-mmap_sem); - if (!user_mode(regs)) - return MM_FAULT_ERR(SIGKILL); - pagefault_out_of_memory(); - return MM_FAULT_RETURN; -} - static int do_sigbus(struct pt_regs *regs, unsigned long address) { siginfo_t info; @@ -169,8 +156,18 @@ static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault) return MM_FAULT_CONTINUE; /* Out of memory */ - if (fault VM_FAULT_OOM) - return out_of_memory(regs); + if (fault VM_FAULT_OOM) { + up_read(current-mm-mmap_sem); + + /* +* We ran out of memory, or some other thing happened to us that +* made us unable to handle the page fault gracefully. +*/ + if (!user_mode(regs)) + return MM_FAULT_ERR(SIGKILL); + pagefault_out_of_memory(); + return MM_FAULT_RETURN; + } /* Bus error. x86 handles HWPOISON here, we'll add this if/when * we support the feature in HW diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -301,17 +301,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code, __bad_area(regs, error_code, address, SEGV_ACCERR); } -static void out_of_memory(void) -{ - /* -* We ran out of memory, call the OOM killer, and return the userspace -* (which will retry the fault, or kill us if we got oom-killed): -*/ - up_read(current-mm-mmap_sem); - - pagefault_out_of_memory(); -} - static void do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address) { @@ -353,8 +342,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, no_context(regs, error_code, address); return 1; } + up_read(current-mm-mmap_sem); - out_of_memory(); + /* +* We ran out of memory, call the OOM killer, and return the +* userspace (which will retry the fault, or kill us if we got +* oom-killed): +*/ + pagefault_out_of_memory(); } else { if (fault VM_FAULT_SIGBUS) do_sigbus(regs, error_code, address); diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -803,20 +803,6 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code, __bad_area(regs, error_code, address, SEGV_ACCERR); } -/* TODO: fixup for mm-invoke-oom-killer-from-page-fault.patch */ -static void -out_of_memory(struct pt_regs *regs, unsigned long error_code, - unsigned long address) -{ - /* -* We ran out of memory, call the OOM killer, and return the userspace -* (which will retry the fault, or kill us if we got oom-killed): -*/ - up_read(current-mm-mmap_sem); - - pagefault_out_of_memory(); -} - static void do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, unsigned int fault) @@ -879,7 +865,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, return 1; } - out_of_memory(regs, error_code, address); + up_read(current-mm-mmap_sem); + + /* +* We ran out of memory, call the OOM killer, and return the +* userspace (which will retry the fault, or kill us if we got +* oom-killed
Re: [RFC PATCH 1/12] memory-hotplug : rename remove_memory to offline_memory
On Wed, 27 Jun 2012, Yasuaki Ishimatsu wrote: remove_memory() does not remove memory but just offlines memory. The patch changes name of it to offline_memory(). The kernel is never going to physically remove the memory itself, so I don't see the big problem with calling it remove_memory(). If you're going to change it to offline_memory(), which is just as good but not better, then I'd suggest changing add_memory() to online_memory() for completeness. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 2/12] memory-hogplug : check memory offline in offline_pages
On Wed, 27 Jun 2012, Yasuaki Ishimatsu wrote: Index: linux-3.5-rc4/mm/memory_hotplug.c === --- linux-3.5-rc4.orig/mm/memory_hotplug.c2012-06-26 13:28:16.743211538 +0900 +++ linux-3.5-rc4/mm/memory_hotplug.c 2012-06-26 13:48:38.264940468 +0900 @@ -887,6 +887,11 @@ static int __ref offline_pages(unsigned lock_memory_hotplug(); + if (memory_is_offline(start_pfn, end_pfn)) { + ret = 0; + goto out; + } + zone = page_zone(pfn_to_page(start_pfn)); node = zone_to_nid(zone); nr_pages = end_pfn - start_pfn; Are there additional prerequisites for this patch? Otherwise it changes the return value of offline_memory() which will now call acpi_memory_powerdown_device() in the acpi memhotplug case when disabling. Is that a problem? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[patch] sched, numa: Allocate node_queue on any node for offline nodes
struct node_queue must be allocated with NUMA_NO_NODE for nodes that are not (yet) online, otherwise the page allocator has a bad zonelist and results in an early crash. Tested-by: Stephen Rothwell s...@canb.auug.org.au Signed-off-by: David Rientjes rient...@google.com --- kernel/sched/numa.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c --- a/kernel/sched/numa.c +++ b/kernel/sched/numa.c @@ -885,7 +885,8 @@ static __init int numa_init(void) for_each_node(node) { struct node_queue *nq = kmalloc_node(sizeof(*nq), - GFP_KERNEL | __GFP_ZERO, node); + GFP_KERNEL | __GFP_ZERO, + node_online(node) ? node : NUMA_NO_NODE); BUG_ON(!nq); spin_lock_init(nq-lock); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: PowerPC boot failures in next-20120521
On Tue, 22 May 2012, Stephen Rothwell wrote: Unable to handle kernel paging request for data at address 0x1688 Faulting instruction address: 0xc016e154 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=32 NUMA pSeries Modules linked in: NIP: c016e154 LR: c01b9140 CTR: REGS: c003fc8c76d0 TRAP: 0300 Not tainted (3.4.0-autokern1) MSR: 80009032 SF,EE,ME,IR,DR,RI CR: 24044022 XER: 0003 SOFTE: 1 CFAR: 562c DAR: 1688, DSISR: 4000 TASK = c003fc8c8000[1] 'swapper/0' THREAD: c003fc8c4000 CPU: 0 GPR00: c003fc8c7950 c0d05b30 12d0 GPR04: 1680 c003fe032f60 GPR08: 000400540001 c980 c0d24fe0 GPR12: 24044024 cf33b000 01a3fa78 009bac00 GPR16: 00e1f338 02d513f0 1680 GPR20: 0001 c003fc8c7c00 0001 GPR24: 0001 c0d1b490 1680 GPR28: c0c7ce58 c003fe009200 NIP [c016e154] .__alloc_pages_nodemask+0xc4/0x8f0 LR [c01b9140] .new_slab+0xd0/0x3c0 Call Trace: [c003fc8c7950] [2e6e756d615f696e] 0x2e6e756d615f696e (unreliable) [c003fc8c7ae0] [c01b9140] .new_slab+0xd0/0x3c0 [c003fc8c7b90] [c01b9844] .__slab_alloc+0x254/0x5b0 [c003fc8c7cd0] [c01bb7a4] .kmem_cache_alloc_node_trace+0x94/0x260 [c003fc8c7d80] [c0ba36d0] .numa_init+0x98/0x1dc [c003fc8c7e10] [c000ace4] .do_one_initcall+0x1a4/0x1e0 [c003fc8c7ed0] [c0b7b354] .kernel_init+0x124/0x2e0 [c003fc8c7f90] [c00211c8] .kernel_thread+0x54/0x70 Instruction dump: 5400d97e 7b170020 0b00 eb3e8000 3b80 80190088 2f80 40de0014 7860efe2 787c6fe2 78000fa4 7f9c0378 e81b0008 83f9 2fa0 7fff1838 ---[ end trace 31fd0ba7d8756001 ]--- swapper/0 (1) used greatest stack depth: 10864 bytes left Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b I may be completely wrong, but I guess the obvious target would be the sched/numa branch that came in via the tip tree. Config file attached. I haven't had a chance to try to bisect this yet. Anyone have any ideas? Yeah, it's sched/numa since that's what introduced numa_init(). It does for_each_node() for each node and does a kmalloc_node() even though that node may not be online. Slub ends up passing this node to the page allocator through alloc_pages_exact_node(). CONFIG_DEBUG_VM would have caught this and your config confirms its not enabled. sched/numa either needs a memory hotplug notifier or it needs to pass NUMA_NO_NODE for nodes that aren't online. Until we get the former, the following should fix it. sched, numa: Allocate node_queue on any node for offline nodes struct node_queue must be allocated with NUMA_NO_NODE for nodes that are not (yet) online, otherwise the page allocator has a bad zonelist. Signed-off-by: David Rientjes rient...@google.com --- diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c --- a/kernel/sched/numa.c +++ b/kernel/sched/numa.c @@ -885,7 +885,8 @@ static __init int numa_init(void) for_each_node(node) { struct node_queue *nq = kmalloc_node(sizeof(*nq), - GFP_KERNEL | __GFP_ZERO, node); + GFP_KERNEL | __GFP_ZERO, + node_online(node) ? node : NUMA_NO_NODE); BUG_ON(!nq); spin_lock_init(nq-lock); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: PowerPC boot failures in next-20120521
On Tue, 22 May 2012, Michael Neuling wrote: console [tty0] enabled console [hvc0] enabled pid_max: default: 32768 minimum: 301 Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes) Mount-cache hash table entries: 4096 Initializing cgroup subsys cpuacct Initializing cgroup subsys devices Initializing cgroup subsys freezer POWER7 performance monitor hardware support registered Unable to handle kernel paging request for data at address 0x1388 Faulting instruction address: 0xc014a070 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=1024 NUMA pSeries Modules linked in: NIP: c014a070 LR: c01978cc CTR: c00b6870 REGS: c0007e5836b0 TRAP: 0300 Tainted: GW (3.4.0-rc6-mikey) MSR: 90009032 SF,HV,EE,ME,IR,DR,RI CR: 28004022 XER: 0200 SOFTE: 1 CFAR: 50fc DAR: 1388, DSISR: 4000 TASK = c0007e56[1] 'swapper/0' THREAD: c0007e58 CPU: 0 GPR00: c0007e583930 c0c034d8 12d0 GPR04: 1380 0001 GPR08: c0007e0dff60 c0ca05a0 GPR12: 28004024 cff2 GPR16: 0001 1380 GPR20: 0001 c0e14900 c0e148f0 0001 GPR24: c0c6f378 1380 02aa GPR28: c0b576b0 c0007e021200 NIP [c014a070] .__alloc_pages_nodemask+0xd0/0x910 LR [c01978cc] .new_slab+0xcc/0x3d0 Call Trace: [c0007e583930] [c0007e5839c0] 0xc0007e5839c0 (unreliable) [c0007e583ac0] [c01978cc] .new_slab+0xcc/0x3d0 [c0007e583b70] [c072ae98] .__slab_alloc+0x38c/0x4f8 [c0007e583cb0] [c0198190] .kmem_cache_alloc_node_trace+0x90/0x260 [c0007e583d60] [c0a5a404] .numa_init+0x9c/0x188 [c0007e583e00] [c000aa30] .do_one_initcall+0x60/0x1e0 [c0007e583ec0] [c0a40b60] .kernel_init+0x128/0x294 [c0007e583f90] [c0020788] .kernel_thread+0x54/0x70 Instruction dump: 0b00 eb1e8000 3b80 801800a8 2f80 409e001c 7860efe3 3800 41820008 3802 787c6fe2 7f9c0378 e93a0008 801800a4 3b60 2fa9 ---[ end trace 31fd0ba7d8756002 ]--- Which seems to be this code in __alloc_pages_nodemask --- /* * Check the zones suitable for the gfp_mask contain at least one * valid zone. It's possible to have an empty zonelist as a result * of GFP_THISNODE and a memoryless node */ if (unlikely(!zonelist-_zonerefs-zone)) c014a070: e9 3a 00 08 ld r9,8(r26) --- r26 is coming from r5 which is the struct zonelist *zonelist parameter to __alloc_pages_nodemask. Having 1380 in there is clearly a bogus pointer. Bisecting it points to b4cdf91668c27a5a6a5a3ed4234756c042dd8288 b4cdf91 sched/numa: Implement numa balancer Trying David's patch just posted doesn't fix it. Hmm, what does CONFIG_DEBUG_VM say? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: PowerPC boot failures in next-20120521
On Tue, 22 May 2012, Michael Neuling wrote: Trying David's patch just posted doesn't fix it. Hmm, what does CONFIG_DEBUG_VM say? No set. Sorry, should have read Not set I mean if it's set, what does it emit to the kernel log with my patch applied? I made CONFIG_DEBUG_VM catch !node_online(node) about six months ago, so I was thinking it would have caught this if either you or Stephen enable it. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: PowerPC boot failures in next-20120521
On Tue, 22 May 2012, Michael Neuling wrote: Sorry, got it... CONFIG_DEBUG_VM enabled below... pid_max: default: 32768 minimum: 301 Dentry cache hash table entries: 262144 (order: 5, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 4, 1048576 bytes) Mount-cache hash table entries: 4096 Initializing cgroup subsys cpuacct Initializing cgroup subsys devices Initializing cgroup subsys freezer POWER7 performance monitor hardware support registered [ cut here ] kernel BUG at /scratch/mikey/src/linux-next/include/linux/gfp.h:318! Yeah, this is what I was expecting, it's tripping on VM_BUG_ON(nid 0 || nid = MAX_NUMNODES || !node_online(nid)); and slub won't pass nid 0. You're sure my patch is applied? :) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[patch] powerpc, mm: fix section mismatch for mark_reserved_regions_for_nid
On Fri, 9 Dec 2011, Subrata Modak wrote: WARNING: vmlinux.o(.text+0x4c760): Section mismatch in reference from the function .mark_reserved_regions_for_nid() to the function .meminit.text:.early_pfn_to_nid() The function .mark_reserved_regions_for_nid() references the function __meminit .early_pfn_to_nid(). This is often because .mark_reserved_regions_for_nid lacks a __meminit annotation or the annotation of .early_pfn_to_nid is wrong. WARNING: vmlinux.o(.text+0x4c780): Section mismatch in reference from the function .mark_reserved_regions_for_nid() to the function .init.text:.work_with_active_regions() The function .mark_reserved_regions_for_nid() references the function __init .work_with_active_regions(). This is often because .mark_reserved_regions_for_nid lacks a __init annotation or the annotation of .work_with_active_regions is wrong. WARNING: vmlinux.o(.text+0x4c7d4): Section mismatch in reference from the function .mark_reserved_regions_for_nid() to the function .meminit.text:.early_pfn_to_nid() The function .mark_reserved_regions_for_nid() references the function __meminit .early_pfn_to_nid(). This is often because .mark_reserved_regions_for_nid lacks a __meminit annotation or the annotation of .early_pfn_to_nid is wrong. WARNING: vmlinux.o(.text+0x4c7f0): Section mismatch in reference from the function .mark_reserved_regions_for_nid() to the function .init.text:.work_with_active_regions() The function .mark_reserved_regions_for_nid() references the function __init .work_with_active_regions(). This is often because .mark_reserved_regions_for_nid lacks a __init annotation or the annotation of .work_with_active_regions is wrong. WARNING: vmlinux.o(.text+0x4c828): Section mismatch in reference from the function .mark_reserved_regions_for_nid() to the function .init.text:.reserve_bootmem_node() The function .mark_reserved_regions_for_nid() references the function __init .reserve_bootmem_node(). This is often because .mark_reserved_regions_for_nid lacks a __init annotation or the annotation of .reserve_bootmem_node is wrong. Wow, lots of ibm folks on the cc :) I can only talk about the mm related section mismatches, but these five can easily be solved with the following. powerpc, mm: fix section mismatch for mark_reserved_regions_for_nid mark_reserved_regions_for_nid() is only called from do_init_bootmem(), which is in .init.text, so it must be in the same section to avoid a section mismatch warning. Reported-by: Subrata Modak subr...@linux.vnet.ibm.com Signed-off-by: David Rientjes rient...@google.com --- arch/powerpc/mm/numa.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -969,7 +969,7 @@ static struct notifier_block __cpuinitdata ppc64_numa_nb = { .priority = 1 /* Must run before sched domains notifier. */ }; -static void mark_reserved_regions_for_nid(int nid) +static void __init mark_reserved_regions_for_nid(int nid) { struct pglist_data *node = NODE_DATA(nid); struct memblock_region *reg; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[patch] powerpc, mm: fix section mismatch for read_n_cells
read_n_cells() cannot be marked as .devinit.text since it is referenced from two functions that are not in that section: of_get_lmb_size() and hot_add_drconf_scn_to_nid(). Signed-off-by: David Rientjes rient...@google.com --- arch/powerpc/mm/numa.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -406,7 +406,7 @@ static void __init get_n_mem_cells(int *n_addr_cells, int *n_size_cells) of_node_put(memory); } -static unsigned long __devinit read_n_cells(int n, const unsigned int **buf) +static unsigned long read_n_cells(int n, const unsigned int **buf) { unsigned long result = 0; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Badness with the kernel version 2.6.35-rc1-git1 running on P6 box
On Fri, 16 Jul 2010, Dave Hansen wrote: SLUB: Unable to allocate memory on node -1 (gfp=0x20) cache: kmalloc-16384, object size: 16384, buffer size: 16384, default order: 2, min order: 0 node 0: slabs: 28, objs: 292, free: 0 ip: page allocation failure. order:0, mode:0x8020 Call Trace: [c6a0eb40] [c0011c30] .show_stack+0x6c/0x16c (unreliable) [c6a0ebf0] [c012129c] .__alloc_pages_nodemask+0x6a0/0x75c [c6a0ed70] [c01527cc] .alloc_pages_current+0xc4/0x104 [c6a0ee10] [c011fca4] .__get_free_pages+0x18/0x90 [c6a0ee90] [c04f7058] .ehea_get_stats+0x4c/0x1bc [c6a0ef30] [c05a0a04] .dev_get_stats+0x38/0x64 [c6a0efc0] [c05b456c] .rtnl_fill_ifinfo+0x35c/0x85c [c6a0f150] [c05b5920] .rtmsg_ifinfo+0x164/0x204 [c6a0f210] [c05a6d6c] .dev_change_flags+0x4c/0x7c [c6a0f2a0] [c05b50b4] .do_setlink+0x31c/0x750 [c6a0f3b0] [c05b6724] .rtnl_newlink+0x388/0x618 [c6a0f5f0] [c05b6350] .rtnetlink_rcv_msg+0x268/0x2b4 [c6a0f6a0] [c05cfdc0] .netlink_rcv_skb+0x74/0x108 [c6a0f730] [c05b60c4] .rtnetlink_rcv+0x38/0x5c [c6a0f7c0] [c05cf8c8] .netlink_unicast+0x318/0x3f4 [c6a0f890] [c05d05b4] .netlink_sendmsg+0x2d0/0x310 [c6a0f970] [c058e1e8] .sock_sendmsg+0xd4/0x110 [c6a0fb50] [c058e514] .SyS_sendmsg+0x1f4/0x288 [c6a0fd70] [c058c2b8] .SyS_socketcall+0x214/0x280 [c6a0fe30] [c00085b4] syscall_exit+0x0/0x40 Mem-Info: Node 0 DMA per-cpu: CPU0: hi:0, btch: 1 usd: 0 CPU1: hi:0, btch: 1 usd: 0 CPU2: hi:0, btch: 1 usd: 0 CPU3: hi:0, btch: 1 usd: 0 The mainline 2.6.35-rc5 worked fine. Maybe you were lucky with 2.6.35-rc5 Anyway ehea should not use GFP_ATOMIC in its ehea_get_stats() method, called in process context, but GFP_KERNEL. Another patch is needed for ehea_refill_rq_def() as well. You're right that this is abusing GFP_ATOMIC. But is, this is just a normal GFP_ATOMIC allocation failure? SLUB: Unable to allocate memory on node -1 seems like a somewhat inappropriate error message for that. The slub message is seperate and doesn't generate a call trace, even though it is a (minimum) order-0 GFP_ATOMIC allocation as well. The page allocation failure is seperate instance that is calling the page allocator, not the slab allocator. It isn't immediately obvious where the -1 is coming from. Does it truly mean allocate from any node here, or is that a buglet in and of itself? Yes, slub uses -1 to indicate that the allocation need not come from a specific node. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [patch 11/14] powerpc: invoke oom-killer from page fault
On Fri, 23 Apr 2010, npig...@suse.de wrote: As explained in commit 1c0fe6e3bd, we want to call the architecture independent oom killer when getting an unexplained OOM from handle_mm_fault, rather than simply killing current. Cc: linuxppc-...@ozlabs.org Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: linux-a...@vger.kernel.org Signed-off-by: Nick Piggin npig...@suse.de --- Index: linux-2.6/arch/powerpc/mm/fault.c === --- linux-2.6.orig/arch/powerpc/mm/fault.c +++ linux-2.6/arch/powerpc/mm/fault.c @@ -359,15 +359,10 @@ bad_area_nosemaphore: */ out_of_memory: up_read(mm-mmap_sem); - if (is_global_init(current)) { - yield(); - down_read(mm-mmap_sem); - goto survive; - } - printk(VM: killing process %s\n, current-comm); - if (user_mode(regs)) - do_group_exit(SIGKILL); - return SIGKILL; + if (!user_mode(regs)) + return SIGKILL; + pagefault_out_of_memory(); + return 0; Do we really want to return 0 and indicate that the fault was handled? It seems more consistent to do if (user_mode(regs)) pagefault_out_of_memory(); return SIGKILL; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Fix fake numa on ppc
On Wed, 2 Sep 2009, Ankita Garg wrote: With the patch, # cat /proc/cmdline root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G # cat /sys/devices/system/node/node0/cpulist 0-3 # cat /sys/devices/system/node/node1/cpulist Oh! interesting.. cpuless nodes :) I think we need to fix this in the longer run and distribute cpus between fake numa nodes of a real node using some acceptable heuristic. True. Presently this is broken on both x86 and ppc systems. It would be interesting to find a way to map, for example, 4 cpus to 4 number of fake nodes created from a single real numa node! We've done it for years on x86_64. It's quite trivial to map all fake nodes within a physical node to the cpus to which they have affinity both via node_to_cpumask_map() and cpu_to_node_map(). There should be no kernel space dependencies on a cpu appearing in only a single node's cpumask and if you map each fake node to its physical node's pxm, you can index into the slit and generate local NUMA distances amongst fake nodes. So if you map the apicids and pxms appropriately depending on the physical topology of the machine, that is the only emulation necessary on x86_64 for the page allocator zonelist ordering, task migration, etc. (If you use CONFIG_SLAB, you'll need to avoid the exponential growth of alien caches, but that's an implementation detail and isn't really within the scope of numa=fake's purpose to modify.) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Fix fake numa on ppc
On Wed, 2 Sep 2009, Benjamin Herrenschmidt wrote: Since I'm pretty sure there could be CPU less nodes just like there could be memory-less nodes, it would be good if fake numa could simulate them too :-) You don't want to simulate cpu less nodes since they do have affinity to ranges of memory, you want to map each fake node to a cpumask including all cpus with affinity to its memory, map each cpu to one fake node (with memory) that it has physical affinity to, and then give all fake nodes local NUMA distance to those on the same physical node. Memoryless nodes take care of themselves since they rely purely on node_distance(), so the index into the slit for all fake nodes to those without memory will be the same. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] Fix fake numa on ppc
On Wed, 2 Sep 2009, Ankita Garg wrote: Hi, Below is a patch to fix a couple of issues with fake numa node creation on ppc: 1) Presently, fake nodes could be created such that real numa node boundaries are not respected. So a node could have lmbs that belong to different real nodes. On x86_64, we can use numa=off to completely disable NUMA so that all memory and all cpus are mapped to a single node 0. That's an extreme example of the above and is totally permissible. 2) The cpu association is broken. On a JS22 blade for example, which is a 2-node numa machine, I get the following: # cat /proc/cmdline root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G # cat /sys/devices/system/node/node0/cpulist 0-3 # cat /sys/devices/system/node/node1/cpulist 4-7 # cat /sys/devices/system/node/node4/cpulist # This doesn't show what the true NUMA topology of the machine is, could you please post the output of $ cat /sys/devices/system/node/node*/cpulist $ cat /sys/devices/system/node/node*/distance $ ls -d /sys/devices/system/node/node*/cpu[0-8] from a normal boot without any numa=fake? So, though the cpus 4-7 should have been associated with node4, they still belong to node1. The patch works by recording a real numa node boundary and incrementing the fake node count. At the same time, a mapping is stored from the real numa node to the first fake node that gets created on it. If there are multiple fake nodes on a real physical node, all cpus in that node should appear in the cpulist for each fake node for which it has local distance. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] Fix fake numa on ppc
On Wed, 2 Sep 2009, Ankita Garg wrote: Currently, the behavior of fake numa is not so on x86 as well? Below is a sample output from a single node x86 system booted with numa=fake=8: # cat node0/cpulist # cat node1/cpulist ... # cat node6/cpulist # cat node7/cpulist 0-7 Presently, just fixing the cpu association issue with ppc, as explained in my previous mail. Right, I'm proposing an alternate mapping scheme (which we've used for years) for both platforms such that a cpu is bound (and is set in cpumask_of_node()) to each fake node with which it has physical affinity. That is the only way for zonelist ordering in node order, task migration from offlined cpus, correct sched domains, etc. I can propose a patchset for x86_64 to do exactly this if there aren't any objections and I hope you'll help do ppc. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] Fix fake numa on ppc
On Thu, 3 Sep 2009, Balbir Singh wrote: Right, I'm proposing an alternate mapping scheme (which we've used for years) for both platforms such that a cpu is bound (and is set in cpumask_of_node()) to each fake node with which it has physical affinity. That is the only way for zonelist ordering in node order, task migration from offlined cpus, correct sched domains, etc. I can propose a patchset for x86_64 to do exactly this if there aren't any objections and I hope you'll help do ppc. Sounds interesting, I'd definitely be interested in seeing your proposal, but I would think of that as additional development on top of this patch Absolutely. I'm not familiar with numa=fake on ppc, but if cpus are being bound to nodes with which they don't have affinity, it definitely warrants a fix such as this (although the initial value for fake_enabled looks wrong and fake_numa_node_mapping[] can be __cpuinitdata). I'll cc you, Ben, and Ankita on the x86_64 patches. Thanks.___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Fake NUMA emulation for PowerPC
On Sat, 8 Dec 2007, Balbir Singh wrote: You're going to want to distribute the cpu's based on how they match up physically with the actual platform that you're running on. x86_64 does Could you explain this better, how does it match up CPU's with fake NUMA memory? Is there some smartness there? I'll try and look at the code and also see what I can do for PowerPC numa_cpumask_lookup_table[] would return the correct cpumask for the fake node index. Then all the code that uses node_to_cpumask() in generic kernel code like the scheduler and VM still preserve their true NUMA affinity that matches the underlying hardware. I tried to make x86_64 fake NUMA as close to the real thing as possible. You also probably want to make all you changes dependent on CONFIG_NUMA_EMU like the x86_64 case. That'll probably be helpful as you extend this tool more and more. David ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Fake NUMA emulation for PowerPC
On Sat, 8 Dec 2007, Balbir Singh wrote: Yes, they all appear on node 0. We could have tweaks to distribute CPU's as well. You're going to want to distribute the cpu's based on how they match up physically with the actual platform that you're running on. x86_64 does this already and it makes fake NUMA more useful because it matches the real-life case more often. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Fake NUMA emulation for PowerPC
On Sat, 8 Dec 2007, Balbir Singh wrote: To be able to test the memory controller under NUMA, I use fake NUMA nodes. x86-64 has a similar feature, the code I have here is the simplest I could come up with for PowerPC. Magnus Damm had patches from over a year ago that, I believe, made much of the x86_64 fake NUMA code generic so that it could be extended for architectures such as i386. Perhaps he could resurrect those patches if there is wider interest in such a tool. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Fake NUMA emulation for PowerPC
On Fri, 7 Dec 2007, Olof Johansson wrote: Comments are as always welcome! Care to explain what this is useful for? (Not saying it's a stupid idea, just wondering what the reason for doing it is). Fake NUMA has always been useful for testing NUMA code without having to have a wide range of hardware available to you. It's a clever tool on x86_64 intended for kernel developers that simply makes it easier to test code and adds an increased level of robustness to the kernel. I think it's a valuable addition. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev