Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path
On 4/17/21 1:26 AM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Fri, Apr 16, 2021 at 06:51:10PM +0800, Xu, Yanfei wrote: On 4/16/21 1:07 AM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Fri, Apr 16, 2021 at 12:18:42AM +0800, Xu, Yanfei wrote: On 4/15/21 11:43 PM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote: Hi experts, I am learning rcu mechanism and its codes. When looking at the rcu_blocking_is_gp(), I found there is a pair preemption disable/enable operation in non-preemption code path. And it has been a long time. I can't understand why we need it? Is there some thing I missed? If not, can we remove the unnecessary operation like blow? Good point, you are right that preemption is disabled anyway in that block of code. However, preempt_disable() and preempt_enable() also prevent the compiler from moving that READ_ONCE() around. So my question to you is whether it is safe to remove those statements entirely or whether they should instead be replaced by barrier() or similar. Thanks for your reply! :) Yes, preempt_disable() and preempt_enable() defined in !preemption are barrier(). barrier can prevent from reordering that READ_ONCE(), but base on my current understanding, volatile in READ_ONCE can also tell the compiler not to reorder it. So, I think it's safe? Maybe. Please keep in mind that although the compiler is prohibited from reordering volatile accesses with each other, there is nothing stopping it from reordering volatile accesses with non-volatile accesses. Thanks for your patient explanation! I am trying to absorb what you said. Blow are my understanding: 1. "the compiler is prohibited from reordering volatile accesses with each other" means these situations: int a; foo() { for(;;) READ_ONCE(a); } or int a,b; foo() { int c,d; c = READ_ONCE(a); d = READ_ONCE(b); } Yes, in both cases the load instructions emitted for the READ_ONCE() macros must be emitted in order. The underlying hardware is free to reorder. Got it. 2. "volatile accesses with non-volatile accesses" means d=b may happen before c=READ_ONCE(a) : int a; foo() { int b = 2 int c,d; c = READ_ONCE(a); d = b; } if we want to keep the ordering of volatile access "c=READ_ONCE(a)" and non-volatile access "d=b", we should use stronger barrier like barrier(). Or an additional READ_ONCE() for b or a WRITE_ONCE() for d. But again, this would constrain only the compiler, not the hardware. But this wouldn't matter in most cases, because both b and d are local variables whose addresses were never taken. So someone would need to be using something crazy to poke into others' stacks for this to matter. Agree. Hope I didn't misunderstand. It looks like you have most of it. Back to rcu_blocking_is_gp(), I find this link today https://www.spinics.net/lists/rcu/msg03985.html With the content in this link, I still haven't got the meaning of these two barrier(). I think I should learn knowledge about cpu-hotplug and things which talked in the link first to make sure if I am missing something, and then consult you. :) That sounds like a very good approach! Keep in mind that I am worried not just about the current state of the code and compilers, but also their possible future states. I see. Thanks again. Best regards, Yanfei Thanx, Paul Best regards, Yanfei Thanx, Paul Best regards, Yanfei Thanx, Paul diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f5213fb74..c6d95a00715e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void) if (IS_ENABLED(CONFIG_PREEMPTION)) return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); /* * If the rcu_state.n_online_cpus counter is equal to one, * there is only one CPU, and that CPU sees all prior accesses @@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void) * Those memory barriers are provided by CPU-hotplug code. */ ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; - preempt_enable(); return ret; } Best regards, Yanfei
Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path
On 4/16/21 1:07 AM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Fri, Apr 16, 2021 at 12:18:42AM +0800, Xu, Yanfei wrote: On 4/15/21 11:43 PM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote: Hi experts, I am learning rcu mechanism and its codes. When looking at the rcu_blocking_is_gp(), I found there is a pair preemption disable/enable operation in non-preemption code path. And it has been a long time. I can't understand why we need it? Is there some thing I missed? If not, can we remove the unnecessary operation like blow? Good point, you are right that preemption is disabled anyway in that block of code. However, preempt_disable() and preempt_enable() also prevent the compiler from moving that READ_ONCE() around. So my question to you is whether it is safe to remove those statements entirely or whether they should instead be replaced by barrier() or similar. Thanks for your reply! :) Yes, preempt_disable() and preempt_enable() defined in !preemption are barrier(). barrier can prevent from reordering that READ_ONCE(), but base on my current understanding, volatile in READ_ONCE can also tell the compiler not to reorder it. So, I think it's safe? Maybe. Please keep in mind that although the compiler is prohibited from reordering volatile accesses with each other, there is nothing stopping it from reordering volatile accesses with non-volatile accesses. Thanks for your patient explanation! I am trying to absorb what you said. Blow are my understanding: 1. "the compiler is prohibited from reordering volatile accesses with each other" means these situations: int a; foo() { for(;;) READ_ONCE(a); } or int a,b; foo() { int c,d; c = READ_ONCE(a); d = READ_ONCE(b); } 2. "volatile accesses with non-volatile accesses" means d=b may happen before c=READ_ONCE(a) : int a; foo() { int b = 2 int c,d; c = READ_ONCE(a); d = b; } if we want to keep the ordering of volatile access "c=READ_ONCE(a)" and non-volatile access "d=b", we should use stronger barrier like barrier(). Hope I didn't misunderstand. Back to rcu_blocking_is_gp(), I find this link today https://www.spinics.net/lists/rcu/msg03985.html With the content in this link, I still haven't got the meaning of these two barrier(). I think I should learn knowledge about cpu-hotplug and things which talked in the link first to make sure if I am missing something, and then consult you. :) Best regards, Yanfei Thanx, Paul Best regards, Yanfei Thanx, Paul diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f5213fb74..c6d95a00715e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void) if (IS_ENABLED(CONFIG_PREEMPTION)) return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); /* * If the rcu_state.n_online_cpus counter is equal to one, * there is only one CPU, and that CPU sees all prior accesses @@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void) * Those memory barriers are provided by CPU-hotplug code. */ ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; - preempt_enable(); return ret; } Best regards, Yanfei
Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path
On 4/16/21 12:18 AM, Xu, Yanfei wrote: On 4/15/21 11:43 PM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote: Hi experts, I am learning rcu mechanism and its codes. When looking at the rcu_blocking_is_gp(), I found there is a pair preemption disable/enable operation in non-preemption code path. And it has been a long time. I can't understand why we need it? Is there some thing I missed? If not, can we remove the unnecessary operation like blow? Good point, you are right that preemption is disabled anyway in that block of code. However, preempt_disable() and preempt_enable() also prevent the compiler from moving that READ_ONCE() around. So my question to you is whether it is safe to remove those statements entirely or whether they should instead be replaced by barrier() or similar. Thanks for your reply! :) Yes, preempt_disable() and preempt_enable() defined in !preemption are barrier(). barrier can prevent from reordering that READ_ONCE(), but base on my current understanding, volatile in READ_ONCE can also tell the compiler not to reorder it. So, I think it's safe? Best regards, Yanfei Hi Paul, I objdump the function rcu_blocking_is_gp(): after dropping the barrier(): 81107c50 : 81107c50: e8 7b 2a f5 ff callq 8105a6d0 <__fentry__> 81107c55: 8b 05 41 fe 7c 01 mov 0x17cfe41(%rip),%eax# 828d7a9c 81107c5b: 55 push %rbp 81107c5c: 48 89 e5mov%rsp,%rbp 81107c5f: 5d pop%rbp 81107c60: 83 f8 01cmp$0x1,%eax 81107c63: 0f 9e c0setle %al 81107c66: 0f b6 c0movzbl %al,%eax 81107c69: c3 retq 81107c6a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) the original codes: 81107ba0 : 81107ba0: e8 2b 2b f5 ff callq 8105a6d0 <__fentry__> 81107ba5: 55 push %rbp 81107ba6: 48 89 e5mov%rsp,%rbp 81107ba9: 8b 05 ed fe 7c 01 mov 0x17cfeed(%rip),%eax# 828d7a9c 81107baf: 83 f8 01cmp$0x1,%eax 81107bb2: 5d pop%rbp 81107bb3: 0f 9e c0setle %al 81107bb6: 0f b6 c0movzbl %al,%eax 81107bb9: c3 retq 81107bba: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) umm... It did been reordered by compiler after dropping the barrier(), however, I think the result will not be effected. Right? Best regards, Yanfei Thanx, Paul diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f5213fb74..c6d95a00715e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void) if (IS_ENABLED(CONFIG_PREEMPTION)) return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); /* * If the rcu_state.n_online_cpus counter is equal to one, * there is only one CPU, and that CPU sees all prior accesses @@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void) * Those memory barriers are provided by CPU-hotplug code. */ ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; - preempt_enable(); return ret; } Best regards, Yanfei
Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path
On 4/15/21 11:43 PM, Paul E. McKenney wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote: Hi experts, I am learning rcu mechanism and its codes. When looking at the rcu_blocking_is_gp(), I found there is a pair preemption disable/enable operation in non-preemption code path. And it has been a long time. I can't understand why we need it? Is there some thing I missed? If not, can we remove the unnecessary operation like blow? Good point, you are right that preemption is disabled anyway in that block of code. However, preempt_disable() and preempt_enable() also prevent the compiler from moving that READ_ONCE() around. So my question to you is whether it is safe to remove those statements entirely or whether they should instead be replaced by barrier() or similar. Thanks for your reply! :) Yes, preempt_disable() and preempt_enable() defined in !preemption are barrier(). barrier can prevent from reordering that READ_ONCE(), but base on my current understanding, volatile in READ_ONCE can also tell the compiler not to reorder it. So, I think it's safe? Best regards, Yanfei Thanx, Paul diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f5213fb74..c6d95a00715e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void) if (IS_ENABLED(CONFIG_PREEMPTION)) return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); /* * If the rcu_state.n_online_cpus counter is equal to one, * there is only one CPU, and that CPU sees all prior accesses @@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void) * Those memory barriers are provided by CPU-hotplug code. */ ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; - preempt_enable(); return ret; } Best regards, Yanfei
[Qestion] Is preempt_disable/enable needed in non-preemption code path
Hi experts, I am learning rcu mechanism and its codes. When looking at the rcu_blocking_is_gp(), I found there is a pair preemption disable/enable operation in non-preemption code path. And it has been a long time. I can't understand why we need it? Is there some thing I missed? If not, can we remove the unnecessary operation like blow? diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f5213fb74..c6d95a00715e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void) if (IS_ENABLED(CONFIG_PREEMPTION)) return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; might_sleep(); /* Check for RCU read-side critical section. */ - preempt_disable(); /* * If the rcu_state.n_online_cpus counter is equal to one, * there is only one CPU, and that CPU sees all prior accesses @@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void) * Those memory barriers are provided by CPU-hotplug code. */ ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; - preempt_enable(); return ret; } Best regards, Yanfei
Re: [PATCH v2 0/2] mm: khugepaged: cleanup and a minor tuning in THP
Gentle ping. On 4/7/21 11:05 AM, yanfei...@windriver.com wrote: From: Yanfei Xu v1-->v2: 1.correct the wrong location where the goto jump to. 2.keep the cond_resched() dropped in v1 still there. Thanks for Yang's review. Yanfei Xu (2): mm: khugepaged: use macro to align addresses mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas mm/khugepaged.c | 29 +++-- 1 file changed, 15 insertions(+), 14 deletions(-)
Re: [PATCH 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas
On 4/6/21 10:51 AM, Xu, Yanfei wrote: On 4/6/21 2:20 AM, Yang Shi wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Sun, Apr 4, 2021 at 8:33 AM wrote: From: Yanfei Xu We could check MMF_DISABLE_THP ahead of iterating over all of vma. Otherwise if some mm_struct contain a large number of vma, there will be amounts meaningless cpu cycles cost. BTW, drop an unnecessary cond_resched(), because there is a another cond_resched() followed it and no consumed invocation between them. Signed-off-by: Yanfei Xu --- mm/khugepaged.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2efe1d0c92ed..c293ec4a94ea 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2094,6 +2094,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, */ if (unlikely(!mmap_read_trylock(mm))) goto breakouterloop_mmap_lock; + if (test_bit(MMF_DISABLE_THP, >flags)) + goto breakouterloop_mmap_lock; It is fine to check this flag. But mmap_lock has been acquired so you should jump to breakouterloop. Oops! It's my fault. Thank you for pointing out this. Will fix it in v2. if (likely(!khugepaged_test_exit(mm))) vma = find_vma(mm, khugepaged_scan.address); @@ -2101,7 +2103,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, for (; vma; vma = vma->vm_next) { unsigned long hstart, hend; - cond_resched(); I don't have a strong opinion for removing this cond_resched(). But IIUC khugepaged is a best effort job there is no harm to keep it IMHO. Yes, keeping it is no harm. But I think we should add it when we need. Look at the blow codes, there are only some simple check between these two cond_resched(). And we still have some cond_resched() in the khugepaged_scan_file() and khugepaged_scan_pmd() which is the actual wrok about collapsing. So I think it is unnecessary. :) BTW, the original author add this cond_resched() might be worry about the hugepage_vma_check() always return false due to the MMF_DISABLE_THP. But now we have moved it out of the for loop of iterating vma. um.. That is my guess.. Thanks, Yanfei for (; vma; vma = vma->vm_next) { unsigned long hstart, hend; cond_resched(); //here if (unlikely(khugepaged_test_exit(mm))) { progress++; break; } if (!hugepage_vma_check(vma, vma->vm_flags)) { skip: progress++; continue; } hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE); hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE); if (hstart >= hend) goto skip; if (khugepaged_scan.address > hend) goto skip; if (khugepaged_scan.address < hstart) khugepaged_scan.address = hstart; VM_BUG_ON(!IS_ALIGNED(khugepaged_scan.address, HPAGE_PMD_SIZE)); if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma)) goto skip; while (khugepaged_scan.address < hend) { int ret; cond_resched(); //here if (unlikely(khugepaged_test_exit(mm))) { progress++; break; -- 2.27.0
Re: [PATCH 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas
On 4/6/21 2:20 AM, Yang Shi wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Sun, Apr 4, 2021 at 8:33 AM wrote: From: Yanfei Xu We could check MMF_DISABLE_THP ahead of iterating over all of vma. Otherwise if some mm_struct contain a large number of vma, there will be amounts meaningless cpu cycles cost. BTW, drop an unnecessary cond_resched(), because there is a another cond_resched() followed it and no consumed invocation between them. Signed-off-by: Yanfei Xu --- mm/khugepaged.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2efe1d0c92ed..c293ec4a94ea 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2094,6 +2094,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, */ if (unlikely(!mmap_read_trylock(mm))) goto breakouterloop_mmap_lock; + if (test_bit(MMF_DISABLE_THP, >flags)) + goto breakouterloop_mmap_lock; It is fine to check this flag. But mmap_lock has been acquired so you should jump to breakouterloop. Oops! It's my fault. Thank you for pointing out this. Will fix it in v2. if (likely(!khugepaged_test_exit(mm))) vma = find_vma(mm, khugepaged_scan.address); @@ -2101,7 +2103,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, for (; vma; vma = vma->vm_next) { unsigned long hstart, hend; - cond_resched(); I don't have a strong opinion for removing this cond_resched(). But IIUC khugepaged is a best effort job there is no harm to keep it IMHO. Yes, keeping it is no harm. But I think we should add it when we need. Look at the blow codes, there are only some simple check between these two cond_resched(). And we still have some cond_resched() in the khugepaged_scan_file() and khugepaged_scan_pmd() which is the actual wrok about collapsing. So I think it is unnecessary. :) for (; vma; vma = vma->vm_next) { unsigned long hstart, hend; cond_resched(); //here if (unlikely(khugepaged_test_exit(mm))) { progress++; break; } if (!hugepage_vma_check(vma, vma->vm_flags)) { skip: progress++; continue; } hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE); hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE); if (hstart >= hend) goto skip; if (khugepaged_scan.address > hend) goto skip; if (khugepaged_scan.address < hstart) khugepaged_scan.address = hstart; VM_BUG_ON(!IS_ALIGNED(khugepaged_scan.address, HPAGE_PMD_SIZE)); if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma)) goto skip; while (khugepaged_scan.address < hend) { int ret; cond_resched();//here if (unlikely(khugepaged_test_exit(mm))) { progress++; break; -- 2.27.0
Re: [PATCH] block: export disk_part_iter_* helpers
On 3/26/21 8:13 PM, Christoph Hellwig wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On Fri, Mar 26, 2021 at 08:10:59PM +0800, yanfei...@windriver.com wrote: From: Yanfei Xu disk_part_iter_* helpers might be used by other external modules, like lttng-modules. But it was unexport in 'commit bc359d03c7ec ("block: add a disk_uevent helper")'. Here export them again. Err, no. We never export things for out of tree modules. And any kind of driver code has absolutely no business looking at the partition tables to start with, modular or not. I see. Thanks. //Yanfei
Re: [PATCH] mm/hugetlb: remove duplicate codes of setting compound_nr
Sorry. Please ignore this patch, it's incorrect. Thanks, Yanfei On 2/3/21 12:40 PM, yanfei...@windriver.com wrote: From: Yanfei Xu set_compound_order() set both of page's compound_order and compound_nr. It's no need to assign to compound_nr again, so remove it. Signed-off-by: Yanfei Xu --- mm/hugetlb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a3e4fa2c5e94..ac249b1583de 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1228,7 +1228,6 @@ static void destroy_compound_gigantic_page(struct page *page, } set_compound_order(page, 0); - page[1].compound_nr = 0; __ClearPageHead(page); }
Re: [PATCH v2] mm/hugetlb: remove redundant check in preparing and destroying gigantic page
I'm sorry for forgetting to add David. Now add David :) Thanks, Yanfei On 2/2/21 7:20 PM, yanfei...@windriver.com wrote: From: Yanfei Xu Gigantic page is a compound page and its order is more than 1. Thus it must be available for hpage_pincount. Let's remove the redundant check for gigantic page. Signed-off-by: Yanfei Xu --- mm/hugetlb.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a3e4fa2c5e94..dac5db569ccb 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1219,8 +1219,7 @@ static void destroy_compound_gigantic_page(struct page *page, struct page *p = page + 1; atomic_set(compound_mapcount_ptr(page), 0); - if (hpage_pincount_available(page)) - atomic_set(compound_pincount_ptr(page), 0); + atomic_set(compound_pincount_ptr(page), 0); for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { clear_compound_head(p); @@ -1501,9 +1500,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); - - if (hpage_pincount_available(page)) - atomic_set(compound_pincount_ptr(page), 0); + atomic_set(compound_pincount_ptr(page), 0); } /*
Re: [PATCH] mm/hugetlb: remove a meaningless if statement in gigantic page initialization
On 2/2/21 6:06 PM, David Hildenbrand wrote: [Please note: This e-mail is from an EXTERNAL e-mail address] On 02.02.21 11:12, yanfei...@windriver.com wrote: From: Yanfei Xu Gigantic page is a compound page and its order is more than 1. Thus it must be available for hpage_pincount. Let's remove this meaningless if statement. Signed-off-by: Yanfei Xu --- mm/hugetlb.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a3e4fa2c5e94..73d602f8c7e2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1501,9 +1501,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); - - if (hpage_pincount_available(page)) - atomic_set(compound_pincount_ptr(page), 0); + atomic_set(compound_pincount_ptr(page), 0); } /* I can spot similar handling in destroy_compound_gigantic_page(). If this is correct (which I think it is), we should tackle both occurrences at once. Agree. Will do it in v2. Thanks, Yanfei -- Thanks, David / dhildenb
Re: KASAN: invalid-free in p9_client_create
How about this patch? If it is appropriate, I will send a real one. mm/slub: fix slab double-free when release callback of sysfs trigger Signed-off-by: Yanfei Xu diff --git a/mm/slub.c b/mm/slub.c index 4148235ba554..d10c4fbf8c84 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5653,7 +5653,7 @@ static int sysfs_slab_add(struct kmem_cache *s) s->kobj.kset = kset; err = kobject_init_and_add(>kobj, _ktype, NULL, "%s", name); if (err) { - kobject_put(>kobj); + kfree_const(>kobj.name); goto out; } -- Because a previous patch dde3c6b7(mm/slub: fix a memory leak in sysfs_slab_add()) added a kobject_put() in the error path of kobject_init_and_add(). It results the release callback of sysfs, which is kmem_cache_release(), is triggered when get into the error path. However, we will also free the 'kmem_cache' and 'kmem_cache_node' and 'kmem_cache->name' at the error path of create_cache() when kobject_init_and_add() returns failure. That makes double-free. About the patch which introduced this issue, I think we could fix it by freeing the leaked memory directly instead of adding kobject_put(). Regards, Yanfei On 11/16/20 7:11 PM, Dmitry Vyukov wrote: [Please note this e-mail is from an EXTERNAL e-mail address] On Mon, Nov 16, 2020 at 11:30 AM syzbot wrote: Hello, syzbot found the following issue on: HEAD commit:92edc4ae Add linux-next specific files for 20201113 git tree: linux-next console output: https://syzkaller.appspot.com/x/log.txt?x=142f881650 kernel config: https://syzkaller.appspot.com/x/.config?x=79ad4f8ad2d96176 dashboard link: https://syzkaller.appspot.com/bug?extid=3a0f6c96e37e347c6ba9 compiler: gcc (GCC) 10.1.0-syz 20200507 Unfortunately, I don't have any reproducer for this issue yet. IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+3a0f6c96e37e347c6...@syzkaller.appspotmail.com Looks like a real double free in slab code. +MM maintainers Note there was a preceding kmalloc failure in sysfs_slab_add. RBP: 7fa358076ca0 R08: 2080 R09: R10: R11: 0246 R12: 001f R13: 7fff7dcf224f R14: 7fa3580779c0 R15: 0118bf2c kobject_add_internal failed for 9p-fcall-cache (error: -12 parent: slab) == BUG: KASAN: double-free or invalid-free in slab_free mm/slub.c:3157 [inline] BUG: KASAN: double-free or invalid-free in kmem_cache_free+0x82/0x350 mm/slub.c:3173 CPU: 0 PID: 15981 Comm: syz-executor.5 Not tainted 5.10.0-rc3-next-20201113-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230 kasan_report_invalid_free+0x51/0x80 mm/kasan/report.c:355 kasan_slab_free+0x100/0x110 mm/kasan/common.c:352 kasan_slab_free include/linux/kasan.h:194 [inline] slab_free_hook mm/slub.c:1548 [inline] slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1586 slab_free mm/slub.c:3157 [inline] kmem_cache_free+0x82/0x350 mm/slub.c:3173 create_cache mm/slab_common.c:274 [inline] kmem_cache_create_usercopy+0x2ab/0x300 mm/slab_common.c:357 p9_client_create+0xc4d/0x10c0 net/9p/client.c:1063 v9fs_session_init+0x1dd/0x1770 fs/9p/v9fs.c:406 v9fs_mount+0x79/0x9b0 fs/9p/vfs_super.c:126 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1549 do_new_mount fs/namespace.c:2896 [inline] path_mount+0x12ae/0x1e70 fs/namespace.c:3227 do_mount fs/namespace.c:3240 [inline] __do_sys_mount fs/namespace.c:3448 [inline] __se_sys_mount fs/namespace.c:3425 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45deb9 Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7fa358076c78 EFLAGS: 0246 ORIG_RAX: 00a5 RAX: ffda RBX: 00021800 RCX: 0045deb9 RDX: 2100 RSI: 2040 RDI: RBP: 7fa358076ca0 R08: 2080 R09: R10: R11: 0246 R12: 001f R13: 7fff7dcf224f R14: 7fa3580779c0 R15: 0118bf2c Allocated by task 15981: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:39 kasan_set_track mm/kasan/common.c:47 [inline] set_alloc_info mm/kasan/common.c:403 [inline] kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:434 kasan_slab_alloc
Re: [External] Re: [PATCH] mm/memory.c: Introduce non-atomic __{Set,Clear}PageSwapCache
On 10/20/20 9:08 PM, Muchun Song wrote: On Tue, Oct 20, 2020 at 7:51 PM Xu, Yanfei wrote: On 10/19/20 10:58 PM, Muchun Song wrote: On Mon, Oct 19, 2020 at 8:31 PM Michal Hocko wrote: On Mon 19-10-20 18:15:20, Muchun Song wrote: For the exclusive reference page, the non-atomic operations is enough, so replace them to non-atomic operations. I do expect you do not see any difference in runtime and this is mostly driven by the code reading, right? Being explicit about this in the code would be preferred. Yeah, just code reading.And the set_bit and __set_bit is actually different on some architectures. Thanks. No objection to the change. Signed-off-by: Muchun Song With an improved changelog Acked-by: Michal Hocko --- include/linux/page-flags.h | 2 ++ mm/memory.c| 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index fbbb841a9346..ec039dde5e4b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -401,6 +401,8 @@ static __always_inline int PageSwapCache(struct page *page) } SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) +__SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) +__CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) #else PAGEFLAG_FALSE(SwapCache) #endif diff --git a/mm/memory.c b/mm/memory.c index 2d267ef6621a..02dd62da26e0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3128,10 +3128,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) set_page_private(page, entry.val); /* Tell memcg to use swap ownership records */ - SetPageSwapCache(page); + __SetPageSwapCache(page); Good evening, Muchun. I found there are still some places could be replaced with __SetPageSwapCache(). Such as shmem_replace_page(), why Yeah, thanks for your suggestion. PG_locked has been set before SetPageSwapCache() is involved. In this case, It doesn't matter whether PG_locked is set before SetPageSwapCache. Sorry for this mistake. PG_locked is used for disk I/O. Would you please to check the rest places? :) Ok, I'll take a look. Thanks. Thanks Acked-by: Yanfei Xu err = mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL); - ClearPageSwapCache(page); + __ClearPageSwapCache(page); if (err) { ret = VM_FAULT_OOM; goto out_page; -- 2.20.1 -- Michal Hocko SUSE Labs -- Yours, Muchun
Re: [External] Re: [PATCH] mm/memory.c: Introduce non-atomic __{Set,Clear}PageSwapCache
On 10/19/20 10:58 PM, Muchun Song wrote: On Mon, Oct 19, 2020 at 8:31 PM Michal Hocko wrote: On Mon 19-10-20 18:15:20, Muchun Song wrote: For the exclusive reference page, the non-atomic operations is enough, so replace them to non-atomic operations. I do expect you do not see any difference in runtime and this is mostly driven by the code reading, right? Being explicit about this in the code would be preferred. Yeah, just code reading.And the set_bit and __set_bit is actually different on some architectures. Thanks. No objection to the change. Signed-off-by: Muchun Song With an improved changelog Acked-by: Michal Hocko --- include/linux/page-flags.h | 2 ++ mm/memory.c| 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index fbbb841a9346..ec039dde5e4b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -401,6 +401,8 @@ static __always_inline int PageSwapCache(struct page *page) } SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) +__SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) +__CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) #else PAGEFLAG_FALSE(SwapCache) #endif diff --git a/mm/memory.c b/mm/memory.c index 2d267ef6621a..02dd62da26e0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3128,10 +3128,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) set_page_private(page, entry.val); /* Tell memcg to use swap ownership records */ - SetPageSwapCache(page); + __SetPageSwapCache(page); Good evening, Muchun. I found there are still some places could be replaced with __SetPageSwapCache(). Such as shmem_replace_page(), why PG_locked has been set before SetPageSwapCache() is involved. Would you please to check the rest places? :) Thanks Acked-by: Yanfei Xu err = mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL); - ClearPageSwapCache(page); + __ClearPageSwapCache(page); if (err) { ret = VM_FAULT_OOM; goto out_page; -- 2.20.1 -- Michal Hocko SUSE Labs
Re: [PATCH v2] mm/compaction: Rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()
On 10/19/20 6:48 PM, Vlastimil Babka wrote: On 10/19/20 12:29 PM, Xu, Yanfei wrote: On 10/19/20 5:40 PM, Vlastimil Babka wrote: On 10/19/20 10:36 AM, yanfei...@windriver.com wrote: From: Yanfei Xu There are two 'start_pfn' declared in compact_zone() which have different meaning. Rename the second one to 'iteration_start_pfn' to prevent trace_mm_compaction_end() from tracing an undesirable value. "to prevent confusion.", because trace_mm_compaction_end() has the correct value even before the patch - the second start_pfn is out of scope at that point. Thanks In the while-statement, the second start_pfn is always be reassigned the value of cc->migrate_pfn in every loop, also the cc->migrate_pfn might be changed in the loop. Does trace_mm_compaction_end() really want to trace the new assinged start_pfn? compact_zone() { unsigned long start_pfn = cc->zone->zone_start_pfn; while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) { unsigned long start_pfn = cc->migrate_pfn; ... } trace_mm_compaction_end(start_pfn, cc->migrate_pfn, ...) } Unless my C knowledge fails me completely, the start_pfn in the while loop is a new different local variable that shadows the start_pfn from compact_zone() level, but does not modify its value. After while loop finishes, start_pfn has still the value assigned at compact_zone() beginning and that's what tracepoint sees. You are right! and I got your point. As you said, the second start_pfn is out of scope at that point. Will send v3. Many thanks, Yanfei So renaming the variable in while loop is not a bug fix, but removing confusion. Without the patch: 566e54e11(mm, compaction: remove last_migrated_pfn from compact_control), there is only one start_pfn which has a fixed value. The trace_mm_compaction_end() trace it too. Thus, I think the tracepoint might get an undesireble value.:) Thanks, Yanfei BTW, remove an useless semicolon. Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Signed-off-by: Yanfei Xu --- v1->v2: Rename 'start_pfn' to 'iteration_start_pfn' and change commit messages. mm/compaction.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 176dcded298e..ccd27c739fd6 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) { int err; - unsigned long start_pfn = cc->migrate_pfn; + unsigned long iteration_start_pfn = cc->migrate_pfn; /* * Avoid multiple rescans which can happen if a page cannot be @@ -2284,7 +2284,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) */ cc->rescan = false; if (pageblock_start_pfn(last_migrated_pfn) == - pageblock_start_pfn(start_pfn)) { + pageblock_start_pfn(iteration_start_pfn)) { cc->rescan = true; } @@ -2308,8 +2308,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) goto check_drain; case ISOLATE_SUCCESS: update_cached = false; - last_migrated_pfn = start_pfn; - ; + last_migrated_pfn = iteration_start_pfn; } err = migrate_pages(>migratepages, compaction_alloc,
Re: [PATCH v2] mm/compaction: Rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()
On 10/19/20 5:40 PM, Vlastimil Babka wrote: On 10/19/20 10:36 AM, yanfei...@windriver.com wrote: From: Yanfei Xu There are two 'start_pfn' declared in compact_zone() which have different meaning. Rename the second one to 'iteration_start_pfn' to prevent trace_mm_compaction_end() from tracing an undesirable value. "to prevent confusion.", because trace_mm_compaction_end() has the correct value even before the patch - the second start_pfn is out of scope at that point. Thanks In the while-statement, the second start_pfn is always be reassigned the value of cc->migrate_pfn in every loop, also the cc->migrate_pfn might be changed in the loop. Does trace_mm_compaction_end() really want to trace the new assinged start_pfn? Without the patch: 566e54e11(mm, compaction: remove last_migrated_pfn from compact_control), there is only one start_pfn which has a fixed value. The trace_mm_compaction_end() trace it too. Thus, I think the tracepoint might get an undesireble value.:) Thanks, Yanfei BTW, remove an useless semicolon. Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Signed-off-by: Yanfei Xu --- v1->v2: Rename 'start_pfn' to 'iteration_start_pfn' and change commit messages. mm/compaction.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 176dcded298e..ccd27c739fd6 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) { int err; - unsigned long start_pfn = cc->migrate_pfn; + unsigned long iteration_start_pfn = cc->migrate_pfn; /* * Avoid multiple rescans which can happen if a page cannot be @@ -2284,7 +2284,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) */ cc->rescan = false; if (pageblock_start_pfn(last_migrated_pfn) == - pageblock_start_pfn(start_pfn)) { + pageblock_start_pfn(iteration_start_pfn)) { cc->rescan = true; } @@ -2308,8 +2308,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) goto check_drain; case ISOLATE_SUCCESS: update_cached = false; - last_migrated_pfn = start_pfn; - ; + last_migrated_pfn = iteration_start_pfn; } err = migrate_pages(>migratepages, compaction_alloc,
Re: [PATCH] mm/compaction: Remove some useless code in compact_zone()
On 10/16/20 11:05 PM, Vlastimil Babka wrote: On 10/14/20 2:28 PM, David Hildenbrand wrote: On 14.10.20 09:23, yanfei...@windriver.com wrote: From: Yanfei Xu start_pfn has been declared at the begin of compact_zone(), it's no need to declare it again. And remove an useless semicolon. Signed-off-by: Yanfei Xu --- mm/compaction.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 176dcded298e..5e69c1f94d96 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) { int err; - unsigned long start_pfn = cc->migrate_pfn; + start_pfn = cc->migrate_pfn; There is a user in trace_mm_compaction_end(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn, sync, ret); we would now trace a different value, no? Agreed. We should rather rename the while-local one to avoid confusion. Something like "iteration_start_pfn" ? Thank you, David and Vlastimil, for pointing out the impact to the tracepoint. I think "iteration_start_pfn" is appropriate and will send V2. /* * Avoid multiple rescans which can happen if a page cannot be @@ -2309,7 +2309,6 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) case ISOLATE_SUCCESS: update_cached = false; last_migrated_pfn = start_pfn; - ; Huh, how does something like that happen :) Looks like "case ISOLATE_SUCCESS:" used to be an empty implementation, then statements got added, but semicolon not removed. Yup, this case used to be an empty.
Re: [PATCH] Bluetooth: Use lock_sock() when acquiring lock in sco_conn_del
On 10/16/20 12:10 PM, Hillf Danton wrote: On Fri, 16 Oct 2020 11:15:27 +0800 Yanfei Xu wrote: On 10/14/20 8:31 PM, Hillf Danton wrote: On Wed, 14 Oct 2020 15:17:31 +0800 From: Yanfei Xu Locking slock-AF_BLUETOOTH-BTPROTO_SCO may happen in process context or BH context. If in process context, we should use lock_sock(). As blow warning, sco_conn_del() is called in process context, so let's use lock_sock() instead of bh_lock_sock(). Sounds opposite because blocking BH in BH context provides no extra protection while it makes sense in the less critical context particularly wrt sock lock. WARNING: inconsistent lock state 5.9.0-rc4-syzkaller #0 Not tainted inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. syz-executor675/31233 [HC0[0]:SC0[0]:HE1:SE1] takes: 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: sco_conn_del+0x128/0x270 net/bluetooth/sco.c:176 {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] sco_sock_timeout+0x24/0x140 net/bluetooth/sco.c:83 call_timer_fn+0x1ac/0x760 kernel/time/timer.c:1413 expire_timers kernel/time/timer.c:1458 [inline] __run_timers.part.0+0x67c/0xaa0 kernel/time/timer.c:1755 __run_timers kernel/time/timer.c:1736 [inline] run_timer_softirq+0xae/0x1a0 kernel/time/timer.c:1768 __do_softirq+0x1f7/0xa91 kernel/softirq.c:298 asm_call_on_stack+0xf/0x20 arch/x86/entry/entry_64.S:706 __run_on_irqstack arch/x86/include/asm/irq_stack.h:22 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:48 [inline] do_softirq_own_stack+0x9d/0xd0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:393 [inline] __irq_exit_rcu kernel/softirq.c:423 [inline] irq_exit_rcu+0x235/0x280 kernel/softirq.c:435 sysvec_apic_timer_interrupt+0x51/0xf0 arch/x86/kernel/apic/apic.c:1091 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:581 unwind_next_frame+0x139a/0x1f90 arch/x86/kernel/unwind_orc.c:607 arch_stack_walk+0x81/0xf0 arch/x86/kernel/stacktrace.c:25 stack_trace_save+0x8c/0xc0 kernel/stacktrace.c:123 kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 slab_post_alloc_hook mm/slab.h:518 [inline] slab_alloc mm/slab.c:3312 [inline] kmem_cache_alloc+0x13a/0x3a0 mm/slab.c:3482 __d_alloc+0x2a/0x950 fs/dcache.c:1709 d_alloc+0x4a/0x230 fs/dcache.c:1788 d_alloc_parallel+0xe9/0x18e0 fs/dcache.c:2540 lookup_open.isra.0+0x9ac/0x1350 fs/namei.c:3030 open_last_lookups fs/namei.c:3177 [inline] path_openat+0x96d/0x2730 fs/namei.c:3365 do_filp_open+0x17e/0x3c0 fs/namei.c:3395 do_sys_openat2+0x16d/0x420 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_sys_open fs/open.c:1192 [inline] __se_sys_open fs/open.c:1188 [inline] __x64_sys_open+0x119/0x1c0 fs/open.c:1188 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 irq event stamp: 853 hardirqs last enabled at (853): [] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline] hardirqs last enabled at (853): [] _raw_spin_unlock_irq+0x1f/0x80 kernel/locking/spinlock.c:199 hardirqs last disabled at (852): [] __raw_spin_lock_irq include/linux/spinlock_api_smp.h:126 [inline] hardirqs last disabled at (852): [] _raw_spin_lock_irq+0xa4/0xd0 kernel/locking/spinlock.c:167 softirqs last enabled at (0): [] copy_process+0x1a99/0x6920 kernel/fork.c:2018 softirqs last disabled at (0): [<>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(slock-AF_BLUETOOTH-BTPROTO_SCO); lock(slock-AF_BLUETOOTH-BTPROTO_SCO); *** DEADLOCK *** 3 locks held by syz-executor675/31233: #0: 88809f104f40 (>req_lock){+.+.}-{3:3}, at: hci_dev_do_close+0xf5/0x1080 net/bluetooth/hci_core.c:1720 #1: 88809f104078 (>lock){+.+.}-{3:3}, at: hci_dev_do_close+0x253/0x1080 net/bluetooth/hci_core.c:1757 #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_disconn_cfm include/net/bluetooth/hci_core.h:1435 [inline] #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_conn_hash_flush+0xc7/0x220 net/bluetooth/hci_conn.c:1557 stack backtrace: CPU: 1 PID: 31233 Comm: syz-executor675 Not tainted 5.9.0-rc4-syzkaller Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118
Re: [PATCH] Bluetooth: Use lock_sock() when acquiring lock in sco_conn_del
On 10/14/20 8:31 PM, Hillf Danton wrote: On Wed, 14 Oct 2020 15:17:31 +0800 From: Yanfei Xu Locking slock-AF_BLUETOOTH-BTPROTO_SCO may happen in process context or BH context. If in process context, we should use lock_sock(). As blow warning, sco_conn_del() is called in process context, so let's use lock_sock() instead of bh_lock_sock(). Sounds opposite because blocking BH in BH context provides no extra protection while it makes sense in the less critical context particularly wrt sock lock. WARNING: inconsistent lock state 5.9.0-rc4-syzkaller #0 Not tainted inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. syz-executor675/31233 [HC0[0]:SC0[0]:HE1:SE1] takes: 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: sco_conn_del+0x128/0x270 net/bluetooth/sco.c:176 {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] sco_sock_timeout+0x24/0x140 net/bluetooth/sco.c:83 call_timer_fn+0x1ac/0x760 kernel/time/timer.c:1413 expire_timers kernel/time/timer.c:1458 [inline] __run_timers.part.0+0x67c/0xaa0 kernel/time/timer.c:1755 __run_timers kernel/time/timer.c:1736 [inline] run_timer_softirq+0xae/0x1a0 kernel/time/timer.c:1768 __do_softirq+0x1f7/0xa91 kernel/softirq.c:298 asm_call_on_stack+0xf/0x20 arch/x86/entry/entry_64.S:706 __run_on_irqstack arch/x86/include/asm/irq_stack.h:22 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:48 [inline] do_softirq_own_stack+0x9d/0xd0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:393 [inline] __irq_exit_rcu kernel/softirq.c:423 [inline] irq_exit_rcu+0x235/0x280 kernel/softirq.c:435 sysvec_apic_timer_interrupt+0x51/0xf0 arch/x86/kernel/apic/apic.c:1091 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:581 unwind_next_frame+0x139a/0x1f90 arch/x86/kernel/unwind_orc.c:607 arch_stack_walk+0x81/0xf0 arch/x86/kernel/stacktrace.c:25 stack_trace_save+0x8c/0xc0 kernel/stacktrace.c:123 kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 slab_post_alloc_hook mm/slab.h:518 [inline] slab_alloc mm/slab.c:3312 [inline] kmem_cache_alloc+0x13a/0x3a0 mm/slab.c:3482 __d_alloc+0x2a/0x950 fs/dcache.c:1709 d_alloc+0x4a/0x230 fs/dcache.c:1788 d_alloc_parallel+0xe9/0x18e0 fs/dcache.c:2540 lookup_open.isra.0+0x9ac/0x1350 fs/namei.c:3030 open_last_lookups fs/namei.c:3177 [inline] path_openat+0x96d/0x2730 fs/namei.c:3365 do_filp_open+0x17e/0x3c0 fs/namei.c:3395 do_sys_openat2+0x16d/0x420 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_sys_open fs/open.c:1192 [inline] __se_sys_open fs/open.c:1188 [inline] __x64_sys_open+0x119/0x1c0 fs/open.c:1188 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 irq event stamp: 853 hardirqs last enabled at (853): [] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline] hardirqs last enabled at (853): [] _raw_spin_unlock_irq+0x1f/0x80 kernel/locking/spinlock.c:199 hardirqs last disabled at (852): [] __raw_spin_lock_irq include/linux/spinlock_api_smp.h:126 [inline] hardirqs last disabled at (852): [] _raw_spin_lock_irq+0xa4/0xd0 kernel/locking/spinlock.c:167 softirqs last enabled at (0): [] copy_process+0x1a99/0x6920 kernel/fork.c:2018 softirqs last disabled at (0): [<>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(slock-AF_BLUETOOTH-BTPROTO_SCO); lock(slock-AF_BLUETOOTH-BTPROTO_SCO); *** DEADLOCK *** 3 locks held by syz-executor675/31233: #0: 88809f104f40 (>req_lock){+.+.}-{3:3}, at: hci_dev_do_close+0xf5/0x1080 net/bluetooth/hci_core.c:1720 #1: 88809f104078 (>lock){+.+.}-{3:3}, at: hci_dev_do_close+0x253/0x1080 net/bluetooth/hci_core.c:1757 #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_disconn_cfm include/net/bluetooth/hci_core.h:1435 [inline] #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_conn_hash_flush+0xc7/0x220 net/bluetooth/hci_conn.c:1557 stack backtrace: CPU: 1 PID: 31233 Comm: syz-executor675 Not tainted 5.9.0-rc4-syzkaller Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 print_usage_bug kernel/locking/lockdep.c:4020 [inline] valid_state kernel/locking/lockdep.c:3361 [inline] mark_lock_irq kernel/locking/lockdep.c:3560
Re: inconsistent lock state in sco_conn_del
> syzbot has found a reproducer for the following issue on: > > HEAD commit:e8878ab8 Merge tag 'spi-fix-v5.9-rc4' of git://git.kernel... > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=1213075990 > kernel config: https://syzkaller.appspot.com/x/.config?x=c61610091f4ca8c4 > dashboard link: https://syzkaller.appspot.com/bug?extid=65684128cd7c35bc66a1 > compiler: gcc (GCC) 10.1.0-syz 20200507 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=121ef0fd90 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=16c3a85390 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+65684128cd7c35bc6...@syzkaller.appspotmail.com > > > WARNING: inconsistent lock state > 5.9.0-rc4-syzkaller #0 Not tainted > > inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. > syz-executor675/31233 [HC0[0]:SC0[0]:HE1:SE1] takes: > 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] > 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: sco_conn_del+0x128/0x270 net/bluetooth/sco.c:176 > {IN-SOFTIRQ-W} state was registered at: >lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 >__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] >_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 >spin_lock include/linux/spinlock.h:354 [inline] >sco_sock_timeout+0x24/0x140 net/bluetooth/sco.c:83 >call_timer_fn+0x1ac/0x760 kernel/time/timer.c:1413 >expire_timers kernel/time/timer.c:1458 [inline] >__run_timers.part.0+0x67c/0xaa0 kernel/time/timer.c:1755 >__run_timers kernel/time/timer.c:1736 [inline] >run_timer_softirq+0xae/0x1a0 kernel/time/timer.c:1768 >__do_softirq+0x1f7/0xa91 kernel/softirq.c:298 >asm_call_on_stack+0xf/0x20 arch/x86/entry/entry_64.S:706 >__run_on_irqstack arch/x86/include/asm/irq_stack.h:22 [inline] >run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:48 [inline] >do_softirq_own_stack+0x9d/0xd0 arch/x86/kernel/irq_64.c:77 >invoke_softirq kernel/softirq.c:393 [inline] >__irq_exit_rcu kernel/softirq.c:423 [inline] >irq_exit_rcu+0x235/0x280 kernel/softirq.c:435 >sysvec_apic_timer_interrupt+0x51/0xf0 arch/x86/kernel/apic/apic.c:1091 >asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:581 >unwind_next_frame+0x139a/0x1f90 arch/x86/kernel/unwind_orc.c:607 >arch_stack_walk+0x81/0xf0 arch/x86/kernel/stacktrace.c:25 >stack_trace_save+0x8c/0xc0 kernel/stacktrace.c:123 >kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48 >kasan_set_track mm/kasan/common.c:56 [inline] >__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461 >slab_post_alloc_hook mm/slab.h:518 [inline] >slab_alloc mm/slab.c:3312 [inline] >kmem_cache_alloc+0x13a/0x3a0 mm/slab.c:3482 >__d_alloc+0x2a/0x950 fs/dcache.c:1709 >d_alloc+0x4a/0x230 fs/dcache.c:1788 >d_alloc_parallel+0xe9/0x18e0 fs/dcache.c:2540 >lookup_open.isra.0+0x9ac/0x1350 fs/namei.c:3030 >open_last_lookups fs/namei.c:3177 [inline] >path_openat+0x96d/0x2730 fs/namei.c:3365 >do_filp_open+0x17e/0x3c0 fs/namei.c:3395 >do_sys_openat2+0x16d/0x420 fs/open.c:1168 >do_sys_open fs/open.c:1184 [inline] >__do_sys_open fs/open.c:1192 [inline] >__se_sys_open fs/open.c:1188 [inline] >__x64_sys_open+0x119/0x1c0 fs/open.c:1188 >do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 >entry_SYSCALL_64_after_hwframe+0x44/0xa9 > irq event stamp: 853 > hardirqs last enabled at (853): [] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline] > hardirqs last enabled at (853): [] _raw_spin_unlock_irq+0x1f/0x80 kernel/locking/spinlock.c:199 > hardirqs last disabled at (852): [] __raw_spin_lock_irq include/linux/spinlock_api_smp.h:126 [inline] > hardirqs last disabled at (852): [] _raw_spin_lock_irq+0xa4/0xd0 kernel/locking/spinlock.c:167 > softirqs last enabled at (0): [] copy_process+0x1a99/0x6920 kernel/fork.c:2018 > softirqs last disabled at (0): [<>] 0x0 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > >lock(slock-AF_BLUETOOTH-BTPROTO_SCO); > > lock(slock-AF_BLUETOOTH-BTPROTO_SCO); > > *** DEADLOCK *** > > 3 locks held by syz-executor675/31233: > #0: 88809f104f40 (>req_lock){+.+.}-{3:3}, at: hci_dev_do_close+0xf5/0x1080 net/bluetooth/hci_core.c:1720 > #1: 88809f104078 (>lock){+.+.}-{3:3}, at: hci_dev_do_close+0x253/0x1080 net/bluetooth/hci_core.c:1757 > #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_disconn_cfm include/net/bluetooth/hci_core.h:1435 [inline] > #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: hci_conn_hash_flush+0xc7/0x220 net/bluetooth/hci_conn.c:1557 > > stack backtrace: > CPU: 1
Re: [PATCH v2] sched, mm: Optimize current_gfp_context()
On 9/18/20 11:18 PM, Waiman Long wrote: On 9/18/20 2:44 AM, Xu, Yanfei wrote: Hi Waiman, On 8/12/20 6:29 AM, Andrew Morton wrote: On Thu, 18 Jun 2020 17:29:36 -0400 Waiman Long wrote: The current_gfp_context() converts a number of PF_MEMALLOC_* per-process flags into the corresponding GFP_* flags for memory allocation. In that function, current->flags is accessed 3 times. That may lead to duplicated access of the same memory location. I have a puzzle about this comment, what's the meaning about "That may lead to duplicated access of the same memory location". After using variable 'pflags', will it not duplicated access the same memory location? Looking forward to your reply :) That condition usually won't happen on a non-debug kernel. However, if certain debugging capability is turned on, access to current will be compiled into a bunch of checks and memory accesses. So if current is used multiple times, the same set of codes will be duplicated the same number of times slowing down the operation and increasing code size. By accessing current once, we avoid this overhead in a debug kernel. Cheers, Longman Hah, got it. Thanks for your detailed explain! cheers, Yanfei
Re: [PATCH v2] sched, mm: Optimize current_gfp_context()
Hi Waiman, On 8/12/20 6:29 AM, Andrew Morton wrote: On Thu, 18 Jun 2020 17:29:36 -0400 Waiman Long wrote: The current_gfp_context() converts a number of PF_MEMALLOC_* per-process flags into the corresponding GFP_* flags for memory allocation. In that function, current->flags is accessed 3 times. That may lead to duplicated access of the same memory location. I have a puzzle about this comment, what's the meaning about "That may lead to duplicated access of the same memory location". After using variable 'pflags', will it not duplicated access the same memory location? Looking forward to your reply :) Thanks, Yanfei This is not usually a problem with minimal debug config options on as the compiler can optimize away the duplicated memory accesses. With most of the debug config options on, however, that may not be the case. For example, the x86-64 object size of the __need_fs_reclaim() in a debug kernel that calls current_gfp_context() was 309 bytes. With this patch applied, the object size is reduced to 202 bytes. This is a saving of 107 bytes and will probably be slightly faster too. ... --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -181,18 +181,20 @@ static inline bool in_vfork(struct task_struct *tsk) */ static inline gfp_t current_gfp_context(gfp_t flags) { - if (unlikely(current->flags & + unsigned int pflags = READ_ONCE(current->flags); Why use READ_ONCE() here? + if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) { /* * NOIO implies both NOIO and NOFS and it is a weaker context * so always make sure it makes precedence */ - if (current->flags & PF_MEMALLOC_NOIO) + if (pflags & PF_MEMALLOC_NOIO) flags &= ~(__GFP_IO | __GFP_FS); - else if (current->flags & PF_MEMALLOC_NOFS) + else if (pflags & PF_MEMALLOC_NOFS) flags &= ~__GFP_FS; #ifdef CONFIG_CMA - if (current->flags & PF_MEMALLOC_NOCMA) + if (pflags & PF_MEMALLOC_NOCMA) flags &= ~__GFP_MOVABLE; #endif }
Re: [PATCH] mm/page_alloc.c: avoid inheritting current's flags when invoked in interrupt
On 9/16/20 9:17 AM, Andrew Morton wrote: On Tue, 15 Sep 2020 15:56:35 +0800 wrote: From: Yanfei Xu alloc_mask shouldn't inherit the current task's flags when __alloc_pages_nodemask is invoked in interrupt. ... --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4889,7 +4889,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, * from a particular context which has been marked by * memalloc_no{fs,io}_{save,restore}. */ - alloc_mask = current_gfp_context(gfp_mask); + if (!in_interrupt()) + alloc_mask = current_gfp_context(gfp_mask); ac.spread_dirty_pages = false; /* hm, yes, and perhaps other callsites in page_alloc.c. I assume this doesn't actually make any runtime difference? Because gfp_mask in interrupt contexts isn't going to have __GFP_IO or __GFP_FS anyway. Thanks for your reply! Yes, It doesn't make any runtime difference. Theoretically, GPF_ATOMIC or GFP_NOWAIT should be used in interrupt context for allocate pages, so that gfp_mask isn't going to have __GFP_IO or __GFP_FS. But if somebody use wrong gfp_masks, __GFP_IO or __GFP_FS will be introduced, with the process interrupted has PF_MEMALLOC_NOIO or PF_MEMALLOC_NOFS, current_gfp_context may help to hide these wrong usages. I don't think it is the original purpose of that piece of codes. And how about add BUG_ON or WARN_ON to figure out the situation which introduce __GFP_IO or __GFP_FS in interrupt context? Regards, Yanfei
Re: [PATCH] mm/page_alloc.c: variable type of 'progress' should be 'unsigned long'
On 9/16/20 8:48 AM, Andrew Morton wrote: On Tue, 15 Sep 2020 18:46:20 +0800 wrote: From: Yanfei Xu try_to_free_pages returns the number of pages reclaimed, and the type of returns is 'unsigned long'. So we should use a matched type for storing it. __perform_reclaim() returns an int, so this change is fairly pointless. However __perform_reclaim()'s single caller expects it to return unsigned long, so including that change in this patch would make more sense. Yeah, thanks for reminding. I will add that and send a v2. :) Regards, Yanfei
Re: [PATCH] USB: integrate macro definitions into include/linux/usb.h
I just think it is a clear up to make these macros get togather which have the samilar attributes. That's why :) Thanks, Yanfei On 8/28/20 3:48 PM, Greg KH wrote: On Tue, Aug 25, 2020 at 11:44:21PM +0800, yanfei...@windriver.com wrote: From: Yanfei Xu include/linux/usb.h also contains 'Hard limit' and 'Arbitrary limit' macro definitions in it, hence we can integrate these from config.c into include/linux/usb.h Why? No one uses these values outside of this .c file, so why put a value in a global .h file? Who else wants to use these values? If something else needs it, then sure, it could be moved, but until then, there's nothing wrong with the existing code as-is from what I can tell. thanks, greg k-h
Re: [PATCH] USB: core: limit access to rawdescriptors which were not allocated
On 8/26/20 2:00 AM, Alan Stern wrote: On Wed, Aug 26, 2020 at 12:16:59AM +0800, yanfei...@windriver.com wrote: From: Yanfei Xu When using systemcall to read the rawdescriptors, make sure we won't access to the rawdescriptors never allocated, which are number exceed the USB_MAXCONFIG. Reported-by: syzbot+256e56ddde8b8957e...@syzkaller.appspotmail.com Signed-off-by: Yanfei Xu --- drivers/usb/core/sysfs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/usb/core/sysfs.c b/drivers/usb/core/sysfs.c index a2ca38e25e0c..1a7a625e5f55 100644 --- a/drivers/usb/core/sysfs.c +++ b/drivers/usb/core/sysfs.c @@ -895,7 +895,8 @@ read_descriptors(struct file *filp, struct kobject *kobj, * configurations (config plus subsidiary descriptors). */ for (cfgno = -1; cfgno < udev->descriptor.bNumConfigurations && - nleft > 0; ++cfgno) { + nleft > 0 && + cfgno < USB_MAXCONFIG; ++cfgno) { if (cfgno < 0) { src = >descriptor; srclen = sizeof(struct usb_device_descriptor); This is not the right way to fix the problem. Instead, we should make sure that udev->descriptor.bNumConfigurations is always <= USB_MAXCONFIG. That's what this code in usb_get_configuration() is supposed to do: int ncfg = dev->descriptor.bNumConfigurations; ... if (ncfg > USB_MAXCONFIG) { dev_warn(ddev, "too many configurations: %d, " "using maximum allowed: %d\n", ncfg, USB_MAXCONFIG); dev->descriptor.bNumConfigurations = ncfg = USB_MAXCONFIG; } If you want to fix the bug, you need to figure out why this isn't working. Thanks for you suggestion. The patch is not right. I'll try to look into the root cause. Regard, Yanfei. Alan Stern
Re: [PATCH] mm/memory.c: Correct the function name in comment
On 8/18/20 3:34 PM, David Hildenbrand wrote: On 18.08.20 09:29, yanfei...@windriver.com wrote: From: Yanfei Xu Correct the function name which is "pte_alloc_pne" to "pte_alloc_one" I'd have phrased this like "mm/memory: Fix typo in __do_fault() comment It's "pte_alloc_one", not "pte_alloc_pne". Let's fix that. " Hah, yours is more clear. I'll send a v2 Thanks for your suggestion. :) //Yanfei Reviewed-by: David Hildenbrand Signed-off-by: Yanfei Xu --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index c3a83f4ca851..9cc3d0dc816c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3446,7 +3446,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) * unlock_page(A) * lock_page(B) * lock_page(B) -* pte_alloc_pne +* pte_alloc_one * shrink_page_list * wait_on_page_writeback(A) * SetPageWriteback(B)
Re: [PATCH] userfaultfd: avoid the duplicated release for userfaultfd_ctx
On 7/20/20 12:57 AM, Al Viro wrote: On Sun, Jul 19, 2020 at 09:58:34PM +0800, Xu, Yanfei wrote: ping Al Viro Could you please help to review this patch? Thanks a lot. That's -next, right? As for the patch itself... Frankly, Yes, it's -next. Daniel's patch looks seriously wrong. Get it. Regards, Yanfei * why has O_CLOEXEC been quietly smuggled in? It's a userland ABI change, for fsck sake... * the double-put you've spotted * the whole out: thing - just make it if (IS_ERR(file)) { userfaultfd_ctx_put(ctx); return PTR_ERR(file); } and be done with that.
Re: [PATCH] userfaultfd: avoid the duplicated release for userfaultfd_ctx
ping Al Viro Could you please help to review this patch? Thanks a lot. Yanfei On 7/15/20 12:12 AM, yanfei...@windriver.com wrote: From: Yanfei Xu when get_unused_fd_flags gets failure, userfaultfd_ctx_cachep will be freed by userfaultfd_fops's release function which is the userfaultfd_release. So we could return directly after fput(). userfaultfd_release()->userfaultfd_ctx_put(ctx) Fixes: d08ac70b1e0d (Wire UFFD up to SELinux) Reported-by: syzbot+75867c44841cb6373...@syzkaller.appspotmail.com Signed-off-by: Yanfei Xu --- fs/userfaultfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 3a4d6ac5a81a..e98317c15530 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -2049,7 +2049,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC); if (fd < 0) { fput(file); - goto out; + return fd; } ctx->owner = file_inode(file);
Re: [PATCH] userfaultfd: avoid the duplicated release for userfaultfd_ctx
Add maintainer Alexander Viro :) On 7/15/20 12:12 AM, yanfei...@windriver.com wrote: From: Yanfei Xu when get_unused_fd_flags gets failure, userfaultfd_ctx_cachep will be freed by userfaultfd_fops's release function which is the userfaultfd_release. So we could return directly after fput(). userfaultfd_release()->userfaultfd_ctx_put(ctx) Fixes: d08ac70b1e0d (Wire UFFD up to SELinux) Reported-by: syzbot+75867c44841cb6373...@syzkaller.appspotmail.com Signed-off-by: Yanfei Xu --- fs/userfaultfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 3a4d6ac5a81a..e98317c15530 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -2049,7 +2049,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC); if (fd < 0) { fput(file); - goto out; + return fd; } ctx->owner = file_inode(file);
Re: BUG:loop:blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES)
On 5/14/20 12:41 AM, Darrick J. Wong wrote: [add fsdevel to cc] On Tue, May 12, 2020 at 08:22:08PM -0600, Jens Axboe wrote: On 5/12/20 8:14 PM, Xu, Yanfei wrote: Hi, After operating the /dev/loop which losetup with an image placed in**tmpfs, I got the following ERROR messages: [cut here]- [ 183.110770] blk_update_request: I/O error, dev loop6, sector 524160 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.123949] blk_update_request: I/O error, dev loop6, sector 522 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.137123] blk_update_request: I/O error, dev loop6, sector 16906 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.150314] blk_update_request: I/O error, dev loop6, sector 32774 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.163551] blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.176824] blk_update_request: I/O error, dev loop6, sector 65542 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.190029] blk_update_request: I/O error, dev loop6, sector 82442 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.203281] blk_update_request: I/O error, dev loop6, sector 98310 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.216531] blk_update_request: I/O error, dev loop6, sector 115210 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.229914] blk_update_request: I/O error, dev loop6, sector 131078 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 I have found the commit which introduce this issue by git bisect : commit :efcfec57[loop: fix no-unmap write-zeroes request behavior] Please CC the author of that commit too. Leaving the rest quoted below. Kernrel version: Linux version 5.6.0 Frequency: everytime steps to reproduce: 1.git clone mainline kernel 2.compile kernel with ARCH=x86_64, and then boot the system with it (seems other arch also can reproduce it ) 3.make an image by "dd of=/tmp/image if=/dev/zero bs=1M count=256" *4.**place the image in tmpfs directory* 5.losetup /dev/loop6 /PATH/TO/image 6.mkfs.ext2 /dev/loop6 Any comments will be appreciated. Hm, you got IO failures here because shmem_fallocate doesn't support FL_ZERO_RANGE range. That might not be too hard to add, but there's a broader problem of detecting fallocate support-- The loop driver assumes that if the file has an fallocate method then it's safe to set max_discard_sectors (and now max_write_zeroes_sectors) to UINT_MAX>>9. There's currently no good way to detect which modes are supported by a filesystem's ->fallocate function, or to discover the required granularity. Right now we tell application developers that the way to discover the conditions under which fallocate will work is to try it and see if they get EOPNOTSUPP. One way to "fix" this would be to fix lo_fallocate to set RQF_QUIET if the filesystem returns EOPNOTSUPP, which gets rid of the log messages. We probably ought to zero out the appropriate max_*_sectors if we get EOPNOTSUPP. Many thanks for your detailed reply:) No good method for detecting fallocte support is a real problem. And the way to "fix" you mentioned do is a good workaround for the current satuation. Best regards, Yanfei --D Thanks, Yanfei -- Jens Axboe
BUG:loop:blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES)
Hi, After operating the /dev/loop which losetup with an image placed in tmpfs. I got the following ERROR messages: [cut here]- [ 183.110770] blk_update_request: I/O error, dev loop6, sector 524160 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.123949] blk_update_request: I/O error, dev loop6, sector 522 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.137123] blk_update_request: I/O error, dev loop6, sector 16906 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.150314] blk_update_request: I/O error, dev loop6, sector 32774 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.163551] blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.176824] blk_update_request: I/O error, dev loop6, sector 65542 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.190029] blk_update_request: I/O error, dev loop6, sector 82442 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.203281] blk_update_request: I/O error, dev loop6, sector 98310 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.216531] blk_update_request: I/O error, dev loop6, sector 115210 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.229914] blk_update_request: I/O error, dev loop6, sector 131078 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 I have found the commit which introduce this issue by git bisect : commit :efcfec57[loop: fix no-unmap write-zeroes request behavior] Tips: My reproduce command: mkfs.ext2 /dev/loop6 Thanks, Yanfei
[loop]efcfec579: BUG:blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES)
Hi, After operating the /dev/loop which losetup with an image placed in tmpfs, I got the following ERROR messages: [cut here]- [ 183.110770] blk_update_request: I/O error, dev loop6, sector 524160 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.123949] blk_update_request: I/O error, dev loop6, sector 522 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.137123] blk_update_request: I/O error, dev loop6, sector 16906 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.150314] blk_update_request: I/O error, dev loop6, sector 32774 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.163551] blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.176824] blk_update_request: I/O error, dev loop6, sector 65542 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.190029] blk_update_request: I/O error, dev loop6, sector 82442 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.203281] blk_update_request: I/O error, dev loop6, sector 98310 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.216531] blk_update_request: I/O error, dev loop6, sector 115210 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 [ 183.229914] blk_update_request: I/O error, dev loop6, sector 131078 op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0 I have found the commit which introduce this issue by git bisect : commit :efcfec57[loop: fix no-unmap write-zeroes request behavior] Kernrel version: Linux version 5.6.0 Frequency: every time steps to reproduce: 1.git clone mainline kernel 2.compile kernel with ARCH=x86_64, and then boot the system with it (seems other arch also can reproduce it ) 3.make an image by "dd of=/tmp/image if=/dev/zero bs=1M count=256" 4.place the image in tmpfs directory 5.losetup /dev/loop6 /PATH/image 6.mkfs.ext2 /dev/loop6 Any comments will be appreciated. Thanks, Yanfei