Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path

2021-04-18 Thread Xu, Yanfei




On 4/17/21 1:26 AM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Fri, Apr 16, 2021 at 06:51:10PM +0800, Xu, Yanfei wrote:



On 4/16/21 1:07 AM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Fri, Apr 16, 2021 at 12:18:42AM +0800, Xu, Yanfei wrote:



On 4/15/21 11:43 PM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote:

Hi experts,

I am learning rcu mechanism and its codes. When looking at the
rcu_blocking_is_gp(), I found there is a pair preemption disable/enable
operation in non-preemption code path. And it has been a long time. I can't
understand why we need it? Is there some thing I missed? If not, can we
remove the unnecessary operation like blow?


Good point, you are right that preemption is disabled anyway in that block
of code.  However, preempt_disable() and preempt_enable() also prevent the
compiler from moving that READ_ONCE() around.  So my question to you is
whether it is safe to remove those statements entirely or whether they
should instead be replaced by barrier() or similar.


Thanks for your reply! :)

Yes, preempt_disable() and preempt_enable() defined in !preemption are
barrier(). barrier can prevent from reordering that READ_ONCE(), but base on
my current understanding, volatile in READ_ONCE can also tell the compiler
not to reorder it. So, I think it's safe?


Maybe.

Please keep in mind that although the compiler is prohibited from
reordering volatile accesses with each other, there is nothing stopping
it from reordering volatile accesses with non-volatile accesses.


Thanks for your patient explanation!

I am trying to absorb what you said. Blow are my understanding:
1. "the compiler is prohibited from reordering volatile accesses with each
other" means these situations:
int a;
foo()
{
 for(;;)
 READ_ONCE(a);
}

or

int a,b;
foo()
{
 int c,d;
 c = READ_ONCE(a);
 d = READ_ONCE(b);
}


Yes, in both cases the load instructions emitted for the READ_ONCE()
macros must be emitted in order.  The underlying hardware is free
to reorder.


Got it.



2. "volatile accesses with non-volatile accesses" means d=b may happen
before c=READ_ONCE(a) :
int a;
foo()
{
 int b = 2
 int c,d;
 c = READ_ONCE(a);
 d = b;
}
if we want to keep the ordering of volatile access "c=READ_ONCE(a)" and
non-volatile access "d=b", we should use stronger barrier like barrier().


Or an additional READ_ONCE() for b or a WRITE_ONCE() for d.  But again,
this would constrain only the compiler, not the hardware.

But this wouldn't matter in most cases, because both b and d are local
variables whose addresses were never taken.  So someone would need to
be using something crazy to poke into others' stacks for this to matter.


Agree.



Hope I didn't misunderstand.


It looks like you have most of it.


Back to rcu_blocking_is_gp(), I find this link today
https://www.spinics.net/lists/rcu/msg03985.html
With the content in this link, I still haven't got the meaning of these two
barrier(). I think I should learn knowledge about cpu-hotplug and things
which talked in the link first to make sure if I am missing something, and
then consult you. :)


That sounds like a very good approach!

Keep in mind that I am worried not just about the current state of
the code and compilers, but also their possible future states.


I see.

Thanks again.

Best regards,
Yanfei


 Thanx, Paul


Best regards,
Yanfei



  Thanx, Paul


Best regards,
Yanfei



   Thanx, Paul


diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f5213fb74..c6d95a00715e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void)
   if (IS_ENABLED(CONFIG_PREEMPTION))
   return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
   might_sleep();  /* Check for RCU read-side critical section. */
-   preempt_disable();
   /*
* If the rcu_state.n_online_cpus counter is equal to one,
* there is only one CPU, and that CPU sees all prior accesses
@@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void)
* Those memory barriers are provided by CPU-hotplug code.
*/
   ret = READ_ONCE(rcu_state.n_online_cpus) <= 1;
-   preempt_enable();
   return ret;
}



Best regards,
Yanfei


Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path

2021-04-16 Thread Xu, Yanfei




On 4/16/21 1:07 AM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Fri, Apr 16, 2021 at 12:18:42AM +0800, Xu, Yanfei wrote:



On 4/15/21 11:43 PM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote:

Hi experts,

I am learning rcu mechanism and its codes. When looking at the
rcu_blocking_is_gp(), I found there is a pair preemption disable/enable
operation in non-preemption code path. And it has been a long time. I can't
understand why we need it? Is there some thing I missed? If not, can we
remove the unnecessary operation like blow?


Good point, you are right that preemption is disabled anyway in that block
of code.  However, preempt_disable() and preempt_enable() also prevent the
compiler from moving that READ_ONCE() around.  So my question to you is
whether it is safe to remove those statements entirely or whether they
should instead be replaced by barrier() or similar.


Thanks for your reply! :)

Yes, preempt_disable() and preempt_enable() defined in !preemption are
barrier(). barrier can prevent from reordering that READ_ONCE(), but base on
my current understanding, volatile in READ_ONCE can also tell the compiler
not to reorder it. So, I think it's safe?


Maybe.

Please keep in mind that although the compiler is prohibited from
reordering volatile accesses with each other, there is nothing stopping
it from reordering volatile accesses with non-volatile accesses.


Thanks for your patient explanation!

I am trying to absorb what you said. Blow are my understanding:
1. "the compiler is prohibited from reordering volatile accesses with 
each other" means these situations:

int a;
foo()
{
for(;;)
READ_ONCE(a);
}

or

int a,b;
foo()
{
int c,d;
c = READ_ONCE(a);
d = READ_ONCE(b);
}

2. "volatile accesses with non-volatile accesses" means d=b may happen 
before c=READ_ONCE(a) :

int a;
foo()
{
int b = 2
int c,d;
c = READ_ONCE(a);
d = b;
}
if we want to keep the ordering of volatile access "c=READ_ONCE(a)" and 
non-volatile access "d=b", we should use stronger barrier like barrier().


Hope I didn't misunderstand.

Back to rcu_blocking_is_gp(), I find this link today 
https://www.spinics.net/lists/rcu/msg03985.html
With the content in this link, I still haven't got the meaning of these 
two barrier(). I think I should learn knowledge about cpu-hotplug and 
things which talked in the link first to make sure if I am missing 
something, and then consult you. :)


Best regards,
Yanfei



 Thanx, Paul


Best regards,
Yanfei



  Thanx, Paul


diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f5213fb74..c6d95a00715e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void)
  if (IS_ENABLED(CONFIG_PREEMPTION))
  return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
  might_sleep();  /* Check for RCU read-side critical section. */
-   preempt_disable();
  /*
   * If the rcu_state.n_online_cpus counter is equal to one,
   * there is only one CPU, and that CPU sees all prior accesses
@@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void)
   * Those memory barriers are provided by CPU-hotplug code.
   */
  ret = READ_ONCE(rcu_state.n_online_cpus) <= 1;
-   preempt_enable();
  return ret;
   }



Best regards,
Yanfei


Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path

2021-04-15 Thread Xu, Yanfei




On 4/16/21 12:18 AM, Xu, Yanfei wrote:



On 4/15/21 11:43 PM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote:

Hi experts,

I am learning rcu mechanism and its codes. When looking at the
rcu_blocking_is_gp(), I found there is a pair preemption disable/enable
operation in non-preemption code path. And it has been a long time. I 
can't

understand why we need it? Is there some thing I missed? If not, can we
remove the unnecessary operation like blow?


Good point, you are right that preemption is disabled anyway in that 
block
of code.  However, preempt_disable() and preempt_enable() also prevent 
the

compiler from moving that READ_ONCE() around.  So my question to you is
whether it is safe to remove those statements entirely or whether they
should instead be replaced by barrier() or similar.


Thanks for your reply! :)

Yes, preempt_disable() and preempt_enable() defined in !preemption are 
barrier(). barrier can prevent from reordering that READ_ONCE(), but 
base on my current understanding, volatile in READ_ONCE can also tell 
the compiler not to reorder it. So, I think it's safe?


Best regards,
Yanfei


Hi Paul,
I objdump the function rcu_blocking_is_gp():

after dropping the barrier():
81107c50 :
81107c50:   e8 7b 2a f5 ff  callq  8105a6d0 
<__fentry__>
81107c55:   8b 05 41 fe 7c 01   mov 
0x17cfe41(%rip),%eax# 828d7a9c 

81107c5b:   55  push   %rbp
81107c5c:   48 89 e5mov%rsp,%rbp
81107c5f:   5d  pop%rbp
81107c60:   83 f8 01cmp$0x1,%eax
81107c63:   0f 9e c0setle  %al
81107c66:   0f b6 c0movzbl %al,%eax
81107c69:   c3  retq
81107c6a:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)

the original codes:
81107ba0 :
81107ba0:   e8 2b 2b f5 ff  callq  8105a6d0 
<__fentry__>

81107ba5:   55  push   %rbp
81107ba6:   48 89 e5mov%rsp,%rbp
81107ba9:   8b 05 ed fe 7c 01   mov 
0x17cfeed(%rip),%eax# 828d7a9c 

81107baf:   83 f8 01cmp$0x1,%eax
81107bb2:   5d  pop%rbp
81107bb3:   0f 9e c0setle  %al
81107bb6:   0f b6 c0movzbl %al,%eax
81107bb9:   c3  retq
81107bba:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)

umm... It did been reordered by compiler after dropping the barrier(), 
however, I think the result will not be effected. Right?


Best regards,
Yanfei





 Thanx, Paul


diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f5213fb74..c6d95a00715e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void)
 if (IS_ENABLED(CONFIG_PREEMPTION))
 return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
 might_sleep();  /* Check for RCU read-side critical section. */
-   preempt_disable();
 /*
  * If the rcu_state.n_online_cpus counter is equal to one,
  * there is only one CPU, and that CPU sees all prior accesses
@@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void)
  * Those memory barriers are provided by CPU-hotplug code.
  */
 ret = READ_ONCE(rcu_state.n_online_cpus) <= 1;
-   preempt_enable();
 return ret;
  }



Best regards,
Yanfei


Re: [Qestion] Is preempt_disable/enable needed in non-preemption code path

2021-04-15 Thread Xu, Yanfei




On 4/15/21 11:43 PM, Paul E. McKenney wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Thu, Apr 15, 2021 at 11:04:05PM +0800, Xu, Yanfei wrote:

Hi experts,

I am learning rcu mechanism and its codes. When looking at the
rcu_blocking_is_gp(), I found there is a pair preemption disable/enable
operation in non-preemption code path. And it has been a long time. I can't
understand why we need it? Is there some thing I missed? If not, can we
remove the unnecessary operation like blow?


Good point, you are right that preemption is disabled anyway in that block
of code.  However, preempt_disable() and preempt_enable() also prevent the
compiler from moving that READ_ONCE() around.  So my question to you is
whether it is safe to remove those statements entirely or whether they
should instead be replaced by barrier() or similar.


Thanks for your reply! :)

Yes, preempt_disable() and preempt_enable() defined in !preemption are 
barrier(). barrier can prevent from reordering that READ_ONCE(), but 
base on my current understanding, volatile in READ_ONCE can also tell 
the compiler not to reorder it. So, I think it's safe?


Best regards,
Yanfei



 Thanx, Paul


diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f5213fb74..c6d95a00715e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void)
 if (IS_ENABLED(CONFIG_PREEMPTION))
 return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
 might_sleep();  /* Check for RCU read-side critical section. */
-   preempt_disable();
 /*
  * If the rcu_state.n_online_cpus counter is equal to one,
  * there is only one CPU, and that CPU sees all prior accesses
@@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void)
  * Those memory barriers are provided by CPU-hotplug code.
  */
 ret = READ_ONCE(rcu_state.n_online_cpus) <= 1;
-   preempt_enable();
 return ret;
  }



Best regards,
Yanfei


[Qestion] Is preempt_disable/enable needed in non-preemption code path

2021-04-15 Thread Xu, Yanfei

Hi experts,

I am learning rcu mechanism and its codes. When looking at the 
rcu_blocking_is_gp(), I found there is a pair preemption disable/enable 
operation in non-preemption code path. And it has been a long time. I 
can't understand why we need it? Is there some thing I missed? If not, 
can we remove the unnecessary operation like blow?



diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f5213fb74..c6d95a00715e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3703,7 +3703,6 @@ static int rcu_blocking_is_gp(void)
if (IS_ENABLED(CONFIG_PREEMPTION))
return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE;
might_sleep();  /* Check for RCU read-side critical section. */
-   preempt_disable();
/*
 * If the rcu_state.n_online_cpus counter is equal to one,
 * there is only one CPU, and that CPU sees all prior accesses
@@ -3718,7 +3717,6 @@ static int rcu_blocking_is_gp(void)
 * Those memory barriers are provided by CPU-hotplug code.
 */
ret = READ_ONCE(rcu_state.n_online_cpus) <= 1;
-   preempt_enable();
return ret;
 }



Best regards,
Yanfei


Re: [PATCH v2 0/2] mm: khugepaged: cleanup and a minor tuning in THP

2021-04-13 Thread Xu, Yanfei

Gentle ping.

On 4/7/21 11:05 AM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

v1-->v2:
1.correct the wrong location where the goto jump to.
2.keep the cond_resched() dropped in v1 still there.

Thanks for Yang's review.

Yanfei Xu (2):
   mm: khugepaged: use macro to align addresses
   mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas

  mm/khugepaged.c | 29 +++--
  1 file changed, 15 insertions(+), 14 deletions(-)



Re: [PATCH 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas

2021-04-05 Thread Xu, Yanfei




On 4/6/21 10:51 AM, Xu, Yanfei wrote:



On 4/6/21 2:20 AM, Yang Shi wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Sun, Apr 4, 2021 at 8:33 AM  wrote:


From: Yanfei Xu 

We could check MMF_DISABLE_THP ahead of iterating over all of vma.
Otherwise if some mm_struct contain a large number of vma, there will
be amounts meaningless cpu cycles cost.

BTW, drop an unnecessary cond_resched(), because there is a another
cond_resched() followed it and no consumed invocation between them.

Signed-off-by: Yanfei Xu 
---
  mm/khugepaged.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2efe1d0c92ed..c293ec4a94ea 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2094,6 +2094,8 @@ static unsigned int 
khugepaged_scan_mm_slot(unsigned int pages,

  */
 if (unlikely(!mmap_read_trylock(mm)))
 goto breakouterloop_mmap_lock;
+   if (test_bit(MMF_DISABLE_THP, >flags))
+   goto breakouterloop_mmap_lock;


It is fine to check this flag. But mmap_lock has been acquired so you
should jump to breakouterloop.


Oops! It's my fault. Thank you for pointing out this.
Will fix it in v2.




 if (likely(!khugepaged_test_exit(mm)))
 vma = find_vma(mm, khugepaged_scan.address);

@@ -2101,7 +2103,6 @@ static unsigned int 
khugepaged_scan_mm_slot(unsigned int pages,

 for (; vma; vma = vma->vm_next) {
 unsigned long hstart, hend;

-   cond_resched();


I don't have a strong opinion for removing this cond_resched(). But
IIUC khugepaged is a best effort job there is no harm to keep it IMHO.



Yes, keeping it is no harm. But I think we should add it when we need.
Look at the blow codes, there are only some simple check between these
two cond_resched().  And we still have some cond_resched() in the
khugepaged_scan_file() and khugepaged_scan_pmd() which is the actual
wrok about collapsing. So I think it is unnecessary.  :)



BTW, the original author add this cond_resched() might be worry about 
the hugepage_vma_check() always return false due to the MMF_DISABLE_THP. 
But now we have moved it out of the for loop of iterating vma.


um.. That is my guess..

Thanks,
Yanfei


     for (; vma; vma = vma->vm_next) {
     unsigned long hstart, hend;

     cond_resched(); //here
     if (unlikely(khugepaged_test_exit(mm))) {
     progress++;
     break;
     }
     if (!hugepage_vma_check(vma, vma->vm_flags)) {
skip:
     progress++;
     continue;
     }
     hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE);
     hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE);
     if (hstart >= hend)
     goto skip;
     if (khugepaged_scan.address > hend)
     goto skip;
     if (khugepaged_scan.address < hstart)
     khugepaged_scan.address = hstart;
     VM_BUG_ON(!IS_ALIGNED(khugepaged_scan.address, 
HPAGE_PMD_SIZE));


     if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
     goto skip;

     while (khugepaged_scan.address < hend) {
     int ret;
     cond_resched();    //here



 if (unlikely(khugepaged_test_exit(mm))) {
 progress++;
 break;
--
2.27.0




Re: [PATCH 2/2] mm: khugepaged: check MMF_DISABLE_THP ahead of iterating over vmas

2021-04-05 Thread Xu, Yanfei




On 4/6/21 2:20 AM, Yang Shi wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Sun, Apr 4, 2021 at 8:33 AM  wrote:


From: Yanfei Xu 

We could check MMF_DISABLE_THP ahead of iterating over all of vma.
Otherwise if some mm_struct contain a large number of vma, there will
be amounts meaningless cpu cycles cost.

BTW, drop an unnecessary cond_resched(), because there is a another
cond_resched() followed it and no consumed invocation between them.

Signed-off-by: Yanfei Xu 
---
  mm/khugepaged.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2efe1d0c92ed..c293ec4a94ea 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2094,6 +2094,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int 
pages,
  */
 if (unlikely(!mmap_read_trylock(mm)))
 goto breakouterloop_mmap_lock;
+   if (test_bit(MMF_DISABLE_THP, >flags))
+   goto breakouterloop_mmap_lock;


It is fine to check this flag. But mmap_lock has been acquired so you
should jump to breakouterloop.


Oops! It's my fault. Thank you for pointing out this.
Will fix it in v2.




 if (likely(!khugepaged_test_exit(mm)))
 vma = find_vma(mm, khugepaged_scan.address);

@@ -2101,7 +2103,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int 
pages,
 for (; vma; vma = vma->vm_next) {
 unsigned long hstart, hend;

-   cond_resched();


I don't have a strong opinion for removing this cond_resched(). But
IIUC khugepaged is a best effort job there is no harm to keep it IMHO.



Yes, keeping it is no harm. But I think we should add it when we need.
Look at the blow codes, there are only some simple check between these
two cond_resched().  And we still have some cond_resched() in the
khugepaged_scan_file() and khugepaged_scan_pmd() which is the actual
wrok about collapsing. So I think it is unnecessary.  :)

for (; vma; vma = vma->vm_next) {
unsigned long hstart, hend;

cond_resched(); //here
if (unlikely(khugepaged_test_exit(mm))) {
progress++;
break;
}
if (!hugepage_vma_check(vma, vma->vm_flags)) {
skip:
progress++;
continue;
}
hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE);
hend = ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE);
if (hstart >= hend)
goto skip;
if (khugepaged_scan.address > hend)
goto skip;
if (khugepaged_scan.address < hstart)
khugepaged_scan.address = hstart;
VM_BUG_ON(!IS_ALIGNED(khugepaged_scan.address, 
HPAGE_PMD_SIZE));


if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma))
goto skip;

while (khugepaged_scan.address < hend) {
int ret;
cond_resched();//here



 if (unlikely(khugepaged_test_exit(mm))) {
 progress++;
 break;
--
2.27.0




Re: [PATCH] block: export disk_part_iter_* helpers

2021-03-26 Thread Xu, Yanfei




On 3/26/21 8:13 PM, Christoph Hellwig wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On Fri, Mar 26, 2021 at 08:10:59PM +0800, yanfei...@windriver.com wrote:

From: Yanfei Xu 

disk_part_iter_* helpers might be used by other external modules, like
lttng-modules. But it was unexport in 'commit bc359d03c7ec ("block: add
a disk_uevent helper")'. Here export them again.


Err, no.  We never export things for out of tree modules.  And any kind
of driver code has absolutely no business looking at the partition tables
to start with, modular or not.
I see. Thanks.


//Yanfei


Re: [PATCH] mm/hugetlb: remove duplicate codes of setting compound_nr

2021-02-02 Thread Xu, Yanfei

Sorry. Please ignore this patch, it's incorrect.

Thanks,
Yanfei

On 2/3/21 12:40 PM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

set_compound_order() set both of page's compound_order and
compound_nr. It's no need to assign to compound_nr again, so
remove it.

Signed-off-by: Yanfei Xu 
---
  mm/hugetlb.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a3e4fa2c5e94..ac249b1583de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1228,7 +1228,6 @@ static void destroy_compound_gigantic_page(struct page 
*page,
}
  
  	set_compound_order(page, 0);

-   page[1].compound_nr = 0;
__ClearPageHead(page);
  }
  



Re: [PATCH v2] mm/hugetlb: remove redundant check in preparing and destroying gigantic page

2021-02-02 Thread Xu, Yanfei

I'm sorry for forgetting to add David.
Now add David :)

Thanks,
Yanfei

On 2/2/21 7:20 PM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

Gigantic page is a compound page and its order is more than 1.
Thus it must be available for hpage_pincount. Let's remove the
redundant check for gigantic page.

Signed-off-by: Yanfei Xu 
---
  mm/hugetlb.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a3e4fa2c5e94..dac5db569ccb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1219,8 +1219,7 @@ static void destroy_compound_gigantic_page(struct page 
*page,
struct page *p = page + 1;
  
  	atomic_set(compound_mapcount_ptr(page), 0);

-   if (hpage_pincount_available(page))
-   atomic_set(compound_pincount_ptr(page), 0);
+   atomic_set(compound_pincount_ptr(page), 0);
  
  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {

clear_compound_head(p);
@@ -1501,9 +1500,7 @@ static void prep_compound_gigantic_page(struct page 
*page, unsigned int order)
set_compound_head(p, page);
}
atomic_set(compound_mapcount_ptr(page), -1);
-
-   if (hpage_pincount_available(page))
-   atomic_set(compound_pincount_ptr(page), 0);
+   atomic_set(compound_pincount_ptr(page), 0);
  }
  
  /*




Re: [PATCH] mm/hugetlb: remove a meaningless if statement in gigantic page initialization

2021-02-02 Thread Xu, Yanfei




On 2/2/21 6:06 PM, David Hildenbrand wrote:

[Please note: This e-mail is from an EXTERNAL e-mail address]

On 02.02.21 11:12, yanfei...@windriver.com wrote:

From: Yanfei Xu 

Gigantic page is a compound page and its order is more than 1.
Thus it must be available for hpage_pincount. Let's remove this
meaningless if statement.

Signed-off-by: Yanfei Xu 
---
  mm/hugetlb.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a3e4fa2c5e94..73d602f8c7e2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1501,9 +1501,7 @@ static void prep_compound_gigantic_page(struct 
page *page, unsigned int order)

  set_compound_head(p, page);
  }
  atomic_set(compound_mapcount_ptr(page), -1);
-
- if (hpage_pincount_available(page))
- atomic_set(compound_pincount_ptr(page), 0);
+ atomic_set(compound_pincount_ptr(page), 0);
  }

  /*



I can spot similar handling in destroy_compound_gigantic_page(). If this
is correct (which I think it is), we should tackle both occurrences at 
once.



Agree. Will do it in v2.

Thanks,
Yanfei

--
Thanks,

David / dhildenb



Re: KASAN: invalid-free in p9_client_create

2020-11-17 Thread Xu, Yanfei

How about this patch? If it is appropriate, I will send a real one.

mm/slub: fix slab double-free when release callback of sysfs trigger

Signed-off-by: Yanfei Xu 

diff --git a/mm/slub.c b/mm/slub.c
index 4148235ba554..d10c4fbf8c84 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5653,7 +5653,7 @@ static int sysfs_slab_add(struct kmem_cache *s)
s->kobj.kset = kset;
err = kobject_init_and_add(>kobj, _ktype, NULL, "%s", 
name);

if (err) {
-   kobject_put(>kobj);
+   kfree_const(>kobj.name);
goto out;
}
--
Because a previous patch dde3c6b7(mm/slub: fix a memory leak in 
sysfs_slab_add()) added a kobject_put() in the error path of 
kobject_init_and_add(). It results the release callback of sysfs, which 
is kmem_cache_release(), is triggered when get into the error path.


However, we will also free the 'kmem_cache' and 'kmem_cache_node' and 
'kmem_cache->name' at the error path of create_cache() when 
kobject_init_and_add() returns failure. That makes double-free.


About the patch which introduced this issue, I think we could fix it by 
freeing the leaked memory directly instead of adding kobject_put().


Regards,
Yanfei



On 11/16/20 7:11 PM, Dmitry Vyukov wrote:

[Please note this e-mail is from an EXTERNAL e-mail address]

On Mon, Nov 16, 2020 at 11:30 AM syzbot
 wrote:


Hello,

syzbot found the following issue on:

HEAD commit:92edc4ae Add linux-next specific files for 20201113
git tree:   linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=142f881650
kernel config:  https://syzkaller.appspot.com/x/.config?x=79ad4f8ad2d96176
dashboard link: https://syzkaller.appspot.com/bug?extid=3a0f6c96e37e347c6ba9
compiler:   gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+3a0f6c96e37e347c6...@syzkaller.appspotmail.com


Looks like a real double free in slab code. +MM maintainers
Note there was a preceding kmalloc failure in sysfs_slab_add.



RBP: 7fa358076ca0 R08: 2080 R09: 
R10:  R11: 0246 R12: 001f
R13: 7fff7dcf224f R14: 7fa3580779c0 R15: 0118bf2c
kobject_add_internal failed for 9p-fcall-cache (error: -12 parent: slab)
==
BUG: KASAN: double-free or invalid-free in slab_free mm/slub.c:3157 [inline]
BUG: KASAN: double-free or invalid-free in kmem_cache_free+0x82/0x350 
mm/slub.c:3173

CPU: 0 PID: 15981 Comm: syz-executor.5 Not tainted 
5.10.0-rc3-next-20201113-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x107/0x163 lib/dump_stack.c:120
  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230
  kasan_report_invalid_free+0x51/0x80 mm/kasan/report.c:355
  kasan_slab_free+0x100/0x110 mm/kasan/common.c:352
  kasan_slab_free include/linux/kasan.h:194 [inline]
  slab_free_hook mm/slub.c:1548 [inline]
  slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1586
  slab_free mm/slub.c:3157 [inline]
  kmem_cache_free+0x82/0x350 mm/slub.c:3173
  create_cache mm/slab_common.c:274 [inline]
  kmem_cache_create_usercopy+0x2ab/0x300 mm/slab_common.c:357
  p9_client_create+0xc4d/0x10c0 net/9p/client.c:1063
  v9fs_session_init+0x1dd/0x1770 fs/9p/v9fs.c:406
  v9fs_mount+0x79/0x9b0 fs/9p/vfs_super.c:126
  legacy_get_tree+0x105/0x220 fs/fs_context.c:592
  vfs_get_tree+0x89/0x2f0 fs/super.c:1549
  do_new_mount fs/namespace.c:2896 [inline]
  path_mount+0x12ae/0x1e70 fs/namespace.c:3227
  do_mount fs/namespace.c:3240 [inline]
  __do_sys_mount fs/namespace.c:3448 [inline]
  __se_sys_mount fs/namespace.c:3425 [inline]
  __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 
48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 
fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7fa358076c78 EFLAGS: 0246 ORIG_RAX: 00a5
RAX: ffda RBX: 00021800 RCX: 0045deb9
RDX: 2100 RSI: 2040 RDI: 
RBP: 7fa358076ca0 R08: 2080 R09: 
R10:  R11: 0246 R12: 001f
R13: 7fff7dcf224f R14: 7fa3580779c0 R15: 0118bf2c

Allocated by task 15981:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:39
  kasan_set_track mm/kasan/common.c:47 [inline]
  set_alloc_info mm/kasan/common.c:403 [inline]
  kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:434
  kasan_slab_alloc 

Re: [External] Re: [PATCH] mm/memory.c: Introduce non-atomic __{Set,Clear}PageSwapCache

2020-10-20 Thread Xu, Yanfei




On 10/20/20 9:08 PM, Muchun Song wrote:

On Tue, Oct 20, 2020 at 7:51 PM Xu, Yanfei  wrote:




On 10/19/20 10:58 PM, Muchun Song wrote:

On Mon, Oct 19, 2020 at 8:31 PM Michal Hocko  wrote:


On Mon 19-10-20 18:15:20, Muchun Song wrote:

For the exclusive reference page, the non-atomic operations is enough,
so replace them to non-atomic operations.


I do expect you do not see any difference in runtime and this is mostly
driven by the code reading, right? Being explicit about this in the code
would be preferred.


Yeah, just code reading.And the set_bit and __set_bit is actually different
on some architectures. Thanks.



No objection to the change.


Signed-off-by: Muchun Song 


With an improved changelog
Acked-by: Michal Hocko 


---
   include/linux/page-flags.h | 2 ++
   mm/memory.c| 4 ++--
   2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fbbb841a9346..ec039dde5e4b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -401,6 +401,8 @@ static __always_inline int PageSwapCache(struct page *page)
   }
   SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
   CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
+__SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
+__CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
   #else
   PAGEFLAG_FALSE(SwapCache)
   #endif
diff --git a/mm/memory.c b/mm/memory.c
index 2d267ef6621a..02dd62da26e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3128,10 +3128,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
set_page_private(page, entry.val);

/* Tell memcg to use swap ownership records */
- SetPageSwapCache(page);
+ __SetPageSwapCache(page);


Good evening, Muchun. I found there are still some places could be
replaced with __SetPageSwapCache(). Such as shmem_replace_page(), why


Yeah, thanks for your suggestion.


PG_locked has been set before SetPageSwapCache() is involved.


In this case, It doesn't matter whether PG_locked is set before
SetPageSwapCache.


Sorry for this mistake. PG_locked is used for disk I/O.


Would you please to check the rest places? :)


Ok, I'll take a look. Thanks.



Thanks

Acked-by: Yanfei Xu 


err = mem_cgroup_charge(page, vma->vm_mm,
GFP_KERNEL);
- ClearPageSwapCache(page);
+ __ClearPageSwapCache(page);
if (err) {
ret = VM_FAULT_OOM;
goto out_page;
--
2.20.1



--
Michal Hocko
SUSE Labs








--
Yours,
Muchun



Re: [External] Re: [PATCH] mm/memory.c: Introduce non-atomic __{Set,Clear}PageSwapCache

2020-10-20 Thread Xu, Yanfei




On 10/19/20 10:58 PM, Muchun Song wrote:

On Mon, Oct 19, 2020 at 8:31 PM Michal Hocko  wrote:


On Mon 19-10-20 18:15:20, Muchun Song wrote:

For the exclusive reference page, the non-atomic operations is enough,
so replace them to non-atomic operations.


I do expect you do not see any difference in runtime and this is mostly
driven by the code reading, right? Being explicit about this in the code
would be preferred.


Yeah, just code reading.And the set_bit and __set_bit is actually different
on some architectures. Thanks.



No objection to the change.


Signed-off-by: Muchun Song 


With an improved changelog
Acked-by: Michal Hocko 


---
  include/linux/page-flags.h | 2 ++
  mm/memory.c| 4 ++--
  2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fbbb841a9346..ec039dde5e4b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -401,6 +401,8 @@ static __always_inline int PageSwapCache(struct page *page)
  }
  SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
  CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
+__SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
+__CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
  #else
  PAGEFLAG_FALSE(SwapCache)
  #endif
diff --git a/mm/memory.c b/mm/memory.c
index 2d267ef6621a..02dd62da26e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3128,10 +3128,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
   set_page_private(page, entry.val);

   /* Tell memcg to use swap ownership records */
- SetPageSwapCache(page);
+ __SetPageSwapCache(page);


Good evening, Muchun. I found there are still some places could be 
replaced with __SetPageSwapCache(). Such as shmem_replace_page(), why 
PG_locked has been set before SetPageSwapCache() is involved.


Would you please to check the rest places? :)

Thanks

Acked-by: Yanfei Xu 


   err = mem_cgroup_charge(page, vma->vm_mm,
   GFP_KERNEL);
- ClearPageSwapCache(page);
+ __ClearPageSwapCache(page);
   if (err) {
   ret = VM_FAULT_OOM;
   goto out_page;
--
2.20.1



--
Michal Hocko
SUSE Labs






Re: [PATCH v2] mm/compaction: Rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()

2020-10-19 Thread Xu, Yanfei




On 10/19/20 6:48 PM, Vlastimil Babka wrote:

On 10/19/20 12:29 PM, Xu, Yanfei wrote:



On 10/19/20 5:40 PM, Vlastimil Babka wrote:

On 10/19/20 10:36 AM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

There are two 'start_pfn' declared in compact_zone() which have
different meaning. Rename the second one to 'iteration_start_pfn'
to prevent trace_mm_compaction_end() from tracing an undesirable
value.


"to prevent confusion.", because trace_mm_compaction_end() has the
correct value even before the patch - the second start_pfn is out
of scope at that point.

Thanks


In the while-statement, the second start_pfn is always be reassigned the
value of cc->migrate_pfn in every loop, also the cc->migrate_pfn might
be changed in the loop. Does trace_mm_compaction_end() really want to
trace the new assinged start_pfn?


compact_zone()
{
     unsigned long start_pfn = cc->zone->zone_start_pfn;

     while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
     unsigned long start_pfn = cc->migrate_pfn;
 ...
     }

     trace_mm_compaction_end(start_pfn, cc->migrate_pfn, ...)
}

Unless my C knowledge fails me completely, the start_pfn in the while 
loop is a new different local variable that shadows the start_pfn from 
compact_zone() level, but does not modify its value. After while loop 
finishes, start_pfn has still the value assigned at

compact_zone() beginning and that's what tracepoint sees.

You are right! and I got your point. As you said, the second start_pfn 
is out of scope at that point.

Will send v3.

Many thanks,
Yanfei


So renaming the variable in while loop is not a bug fix, but removing 
confusion.



Without the patch: 566e54e11(mm, compaction: remove last_migrated_pfn
from compact_control), there is only one start_pfn which has a fixed
value. The trace_mm_compaction_end() trace it too.

Thus, I think the tracepoint might get an undesireble value.:)

Thanks,
Yanfei


BTW, remove an useless semicolon.

Acked-by: David Hildenbrand 
Acked-by: Vlastimil Babka 
Signed-off-by: Yanfei Xu 
---
v1->v2:
Rename 'start_pfn' to 'iteration_start_pfn' and change commit messages.

  mm/compaction.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 176dcded298e..ccd27c739fd6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, 
struct capture_control *capc)

  while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
  int err;
-    unsigned long start_pfn = cc->migrate_pfn;
+    unsigned long iteration_start_pfn = cc->migrate_pfn;
  /*
   * Avoid multiple rescans which can happen if a page 
cannot be
@@ -2284,7 +2284,7 @@ compact_zone(struct compact_control *cc, 
struct capture_control *capc)

   */
  cc->rescan = false;
  if (pageblock_start_pfn(last_migrated_pfn) ==
-    pageblock_start_pfn(start_pfn)) {
+    pageblock_start_pfn(iteration_start_pfn)) {
  cc->rescan = true;
  }
@@ -2308,8 +2308,7 @@ compact_zone(struct compact_control *cc, 
struct capture_control *capc)

  goto check_drain;
  case ISOLATE_SUCCESS:
  update_cached = false;
-    last_migrated_pfn = start_pfn;
-    ;
+    last_migrated_pfn = iteration_start_pfn;
  }
  err = migrate_pages(>migratepages, compaction_alloc,










Re: [PATCH v2] mm/compaction: Rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()

2020-10-19 Thread Xu, Yanfei




On 10/19/20 5:40 PM, Vlastimil Babka wrote:

On 10/19/20 10:36 AM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

There are two 'start_pfn' declared in compact_zone() which have
different meaning. Rename the second one to 'iteration_start_pfn'
to prevent trace_mm_compaction_end() from tracing an undesirable
value.


"to prevent confusion.", because trace_mm_compaction_end() has the
correct value even before the patch - the second start_pfn is out
of scope at that point.

Thanks

In the while-statement, the second start_pfn is always be reassigned the 
value of cc->migrate_pfn in every loop, also the cc->migrate_pfn might 
be changed in the loop. Does trace_mm_compaction_end() really want to 
trace the new assinged start_pfn?


Without the patch: 566e54e11(mm, compaction: remove last_migrated_pfn 
from compact_control), there is only one start_pfn which has a fixed 
value. The trace_mm_compaction_end() trace it too.


Thus, I think the tracepoint might get an undesireble value.:)

Thanks,
Yanfei


BTW, remove an useless semicolon.

Acked-by: David Hildenbrand 
Acked-by: Vlastimil Babka 
Signed-off-by: Yanfei Xu 
---
v1->v2:
Rename 'start_pfn' to 'iteration_start_pfn' and change commit messages.

  mm/compaction.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 176dcded298e..ccd27c739fd6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)

  while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
  int err;
-    unsigned long start_pfn = cc->migrate_pfn;
+    unsigned long iteration_start_pfn = cc->migrate_pfn;
  /*
   * Avoid multiple rescans which can happen if a page cannot be
@@ -2284,7 +2284,7 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)

   */
  cc->rescan = false;
  if (pageblock_start_pfn(last_migrated_pfn) ==
-    pageblock_start_pfn(start_pfn)) {
+    pageblock_start_pfn(iteration_start_pfn)) {
  cc->rescan = true;
  }
@@ -2308,8 +2308,7 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)

  goto check_drain;
  case ISOLATE_SUCCESS:
  update_cached = false;
-    last_migrated_pfn = start_pfn;
-    ;
+    last_migrated_pfn = iteration_start_pfn;
  }
  err = migrate_pages(>migratepages, compaction_alloc,






Re: [PATCH] mm/compaction: Remove some useless code in compact_zone()

2020-10-19 Thread Xu, Yanfei




On 10/16/20 11:05 PM, Vlastimil Babka wrote:

On 10/14/20 2:28 PM, David Hildenbrand wrote:

On 14.10.20 09:23, yanfei...@windriver.com wrote:

From: Yanfei Xu 

start_pfn has been declared at the begin of compact_zone(), it's
no need to declare it again. And remove an useless semicolon.

Signed-off-by: Yanfei Xu 
---
 mm/compaction.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 176dcded298e..5e69c1f94d96 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2272,7 +2272,7 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)


 while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
 int err;
-    unsigned long start_pfn = cc->migrate_pfn;
+    start_pfn = cc->migrate_pfn;


There is a user in

trace_mm_compaction_end(start_pfn, cc->migrate_pfn, cc->free_pfn,
end_pfn, sync, ret);

we would now trace a different value, no?


Agreed. We should rather rename the while-local one to avoid confusion. 
Something like "iteration_start_pfn" ?




Thank you, David and Vlastimil, for pointing out the impact to the 
tracepoint. I think "iteration_start_pfn" is appropriate and will send V2.




 /*
  * Avoid multiple rescans which can happen if a page cannot be
@@ -2309,7 +2309,6 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)

 case ISOLATE_SUCCESS:
 update_cached = false;
 last_migrated_pfn = start_pfn;
-    ;


Huh, how does something like that happen :)


Looks like "case ISOLATE_SUCCESS:" used to be an empty implementation, 
then statements got added, but semicolon not removed.


Yup, this case used to be an empty.


Re: [PATCH] Bluetooth: Use lock_sock() when acquiring lock in sco_conn_del

2020-10-18 Thread Xu, Yanfei




On 10/16/20 12:10 PM, Hillf Danton wrote:

On Fri, 16 Oct 2020 11:15:27 +0800 Yanfei Xu wrote:

On 10/14/20 8:31 PM, Hillf Danton wrote:


On Wed, 14 Oct 2020 15:17:31 +0800

From: Yanfei Xu 

Locking slock-AF_BLUETOOTH-BTPROTO_SCO may happen in process context or
BH context. If in process context, we should use lock_sock(). As blow
warning, sco_conn_del() is called in process context, so let's use
lock_sock() instead of bh_lock_sock().


Sounds opposite because blocking BH in BH context provides no extra
protection while it makes sense in the less critical context particularly
wrt sock lock.



WARNING: inconsistent lock state
5.9.0-rc4-syzkaller #0 Not tainted

inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
syz-executor675/31233 [HC0[0]:SC0[0]:HE1:SE1] takes:
8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at:
spin_lock include/linux/spinlock.h:354 [inline]
8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at:
sco_conn_del+0x128/0x270 net/bluetooth/sco.c:176
{IN-SOFTIRQ-W} state was registered at:
lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
sco_sock_timeout+0x24/0x140 net/bluetooth/sco.c:83
call_timer_fn+0x1ac/0x760 kernel/time/timer.c:1413
expire_timers kernel/time/timer.c:1458 [inline]
__run_timers.part.0+0x67c/0xaa0 kernel/time/timer.c:1755
__run_timers kernel/time/timer.c:1736 [inline]
run_timer_softirq+0xae/0x1a0 kernel/time/timer.c:1768
__do_softirq+0x1f7/0xa91 kernel/softirq.c:298
asm_call_on_stack+0xf/0x20 arch/x86/entry/entry_64.S:706
__run_on_irqstack arch/x86/include/asm/irq_stack.h:22 [inline]
run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:48 [inline]
do_softirq_own_stack+0x9d/0xd0 arch/x86/kernel/irq_64.c:77
invoke_softirq kernel/softirq.c:393 [inline]
__irq_exit_rcu kernel/softirq.c:423 [inline]
irq_exit_rcu+0x235/0x280 kernel/softirq.c:435
sysvec_apic_timer_interrupt+0x51/0xf0 arch/x86/kernel/apic/apic.c:1091
asm_sysvec_apic_timer_interrupt+0x12/0x20
arch/x86/include/asm/idtentry.h:581
unwind_next_frame+0x139a/0x1f90 arch/x86/kernel/unwind_orc.c:607
arch_stack_walk+0x81/0xf0 arch/x86/kernel/stacktrace.c:25
stack_trace_save+0x8c/0xc0 kernel/stacktrace.c:123
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
slab_post_alloc_hook mm/slab.h:518 [inline]
slab_alloc mm/slab.c:3312 [inline]
kmem_cache_alloc+0x13a/0x3a0 mm/slab.c:3482
__d_alloc+0x2a/0x950 fs/dcache.c:1709
d_alloc+0x4a/0x230 fs/dcache.c:1788
d_alloc_parallel+0xe9/0x18e0 fs/dcache.c:2540
lookup_open.isra.0+0x9ac/0x1350 fs/namei.c:3030
open_last_lookups fs/namei.c:3177 [inline]
path_openat+0x96d/0x2730 fs/namei.c:3365
do_filp_open+0x17e/0x3c0 fs/namei.c:3395
do_sys_openat2+0x16d/0x420 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_sys_open fs/open.c:1192 [inline]
__se_sys_open fs/open.c:1188 [inline]
__x64_sys_open+0x119/0x1c0 fs/open.c:1188
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
irq event stamp: 853
hardirqs last  enabled at (853): []
__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
hardirqs last  enabled at (853): []
_raw_spin_unlock_irq+0x1f/0x80 kernel/locking/spinlock.c:199
hardirqs last disabled at (852): []
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:126 [inline]
hardirqs last disabled at (852): []
_raw_spin_lock_irq+0xa4/0xd0 kernel/locking/spinlock.c:167
softirqs last  enabled at (0): []
copy_process+0x1a99/0x6920 kernel/fork.c:2018
softirqs last disabled at (0): [<>] 0x0

other info that might help us debug this:
   Possible unsafe locking scenario:

 CPU0
 
lock(slock-AF_BLUETOOTH-BTPROTO_SCO);

  lock(slock-AF_BLUETOOTH-BTPROTO_SCO);

   *** DEADLOCK ***

3 locks held by syz-executor675/31233:
   #0: 88809f104f40 (>req_lock){+.+.}-{3:3}, at:
hci_dev_do_close+0xf5/0x1080 net/bluetooth/hci_core.c:1720
   #1: 88809f104078 (>lock){+.+.}-{3:3}, at:
hci_dev_do_close+0x253/0x1080 net/bluetooth/hci_core.c:1757
   #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at:
hci_disconn_cfm include/net/bluetooth/hci_core.h:1435 [inline]
   #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at:
hci_conn_hash_flush+0xc7/0x220 net/bluetooth/hci_conn.c:1557

stack backtrace:
CPU: 1 PID: 31233 Comm: syz-executor675 Not tainted 5.9.0-rc4-syzkaller
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x198/0x1fd lib/dump_stack.c:118
   

Re: [PATCH] Bluetooth: Use lock_sock() when acquiring lock in sco_conn_del

2020-10-15 Thread Xu, Yanfei




On 10/14/20 8:31 PM, Hillf Danton wrote:


On Wed, 14 Oct 2020 15:17:31 +0800

From: Yanfei Xu 

Locking slock-AF_BLUETOOTH-BTPROTO_SCO may happen in process context or
BH context. If in process context, we should use lock_sock(). As blow
warning, sco_conn_del() is called in process context, so let's use
lock_sock() instead of bh_lock_sock().


Sounds opposite because blocking BH in BH context provides no extra
protection while it makes sense in the less critical context particularly
wrt sock lock.



WARNING: inconsistent lock state
5.9.0-rc4-syzkaller #0 Not tainted

inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
syz-executor675/31233 [HC0[0]:SC0[0]:HE1:SE1] takes:
8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at:
spin_lock include/linux/spinlock.h:354 [inline]
8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at:
sco_conn_del+0x128/0x270 net/bluetooth/sco.c:176
{IN-SOFTIRQ-W} state was registered at:
   lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
   _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
   spin_lock include/linux/spinlock.h:354 [inline]
   sco_sock_timeout+0x24/0x140 net/bluetooth/sco.c:83
   call_timer_fn+0x1ac/0x760 kernel/time/timer.c:1413
   expire_timers kernel/time/timer.c:1458 [inline]
   __run_timers.part.0+0x67c/0xaa0 kernel/time/timer.c:1755
   __run_timers kernel/time/timer.c:1736 [inline]
   run_timer_softirq+0xae/0x1a0 kernel/time/timer.c:1768
   __do_softirq+0x1f7/0xa91 kernel/softirq.c:298
   asm_call_on_stack+0xf/0x20 arch/x86/entry/entry_64.S:706
   __run_on_irqstack arch/x86/include/asm/irq_stack.h:22 [inline]
   run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:48 [inline]
   do_softirq_own_stack+0x9d/0xd0 arch/x86/kernel/irq_64.c:77
   invoke_softirq kernel/softirq.c:393 [inline]
   __irq_exit_rcu kernel/softirq.c:423 [inline]
   irq_exit_rcu+0x235/0x280 kernel/softirq.c:435
   sysvec_apic_timer_interrupt+0x51/0xf0 arch/x86/kernel/apic/apic.c:1091
   asm_sysvec_apic_timer_interrupt+0x12/0x20
   arch/x86/include/asm/idtentry.h:581
   unwind_next_frame+0x139a/0x1f90 arch/x86/kernel/unwind_orc.c:607
   arch_stack_walk+0x81/0xf0 arch/x86/kernel/stacktrace.c:25
   stack_trace_save+0x8c/0xc0 kernel/stacktrace.c:123
   kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
   kasan_set_track mm/kasan/common.c:56 [inline]
   __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
   slab_post_alloc_hook mm/slab.h:518 [inline]
   slab_alloc mm/slab.c:3312 [inline]
   kmem_cache_alloc+0x13a/0x3a0 mm/slab.c:3482
   __d_alloc+0x2a/0x950 fs/dcache.c:1709
   d_alloc+0x4a/0x230 fs/dcache.c:1788
   d_alloc_parallel+0xe9/0x18e0 fs/dcache.c:2540
   lookup_open.isra.0+0x9ac/0x1350 fs/namei.c:3030
   open_last_lookups fs/namei.c:3177 [inline]
   path_openat+0x96d/0x2730 fs/namei.c:3365
   do_filp_open+0x17e/0x3c0 fs/namei.c:3395
   do_sys_openat2+0x16d/0x420 fs/open.c:1168
   do_sys_open fs/open.c:1184 [inline]
   __do_sys_open fs/open.c:1192 [inline]
   __se_sys_open fs/open.c:1188 [inline]
   __x64_sys_open+0x119/0x1c0 fs/open.c:1188
   do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
irq event stamp: 853
hardirqs last  enabled at (853): []
__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
hardirqs last  enabled at (853): []
_raw_spin_unlock_irq+0x1f/0x80 kernel/locking/spinlock.c:199
hardirqs last disabled at (852): []
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:126 [inline]
hardirqs last disabled at (852): []
_raw_spin_lock_irq+0xa4/0xd0 kernel/locking/spinlock.c:167
softirqs last  enabled at (0): []
copy_process+0x1a99/0x6920 kernel/fork.c:2018
softirqs last disabled at (0): [<>] 0x0

other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(slock-AF_BLUETOOTH-BTPROTO_SCO);
   
 lock(slock-AF_BLUETOOTH-BTPROTO_SCO);

  *** DEADLOCK ***

3 locks held by syz-executor675/31233:
  #0: 88809f104f40 (>req_lock){+.+.}-{3:3}, at:
hci_dev_do_close+0xf5/0x1080 net/bluetooth/hci_core.c:1720
  #1: 88809f104078 (>lock){+.+.}-{3:3}, at:
hci_dev_do_close+0x253/0x1080 net/bluetooth/hci_core.c:1757
  #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at:
hci_disconn_cfm include/net/bluetooth/hci_core.h:1435 [inline]
  #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at:
hci_conn_hash_flush+0xc7/0x220 net/bluetooth/hci_conn.c:1557

stack backtrace:
CPU: 1 PID: 31233 Comm: syz-executor675 Not tainted 5.9.0-rc4-syzkaller
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x198/0x1fd lib/dump_stack.c:118
  print_usage_bug kernel/locking/lockdep.c:4020 [inline]
  valid_state kernel/locking/lockdep.c:3361 [inline]
  mark_lock_irq kernel/locking/lockdep.c:3560 

Re: inconsistent lock state in sco_conn_del

2020-10-10 Thread Xu, Yanfei

> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:e8878ab8 Merge tag 'spi-fix-v5.9-rc4' of 
git://git.kernel...

> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1213075990
> kernel config: 
https://syzkaller.appspot.com/x/.config?x=c61610091f4ca8c4
> dashboard link: 
https://syzkaller.appspot.com/bug?extid=65684128cd7c35bc66a1

> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro: 
https://syzkaller.appspot.com/x/repro.syz?x=121ef0fd90

> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16c3a85390
>
> IMPORTANT: if you fix the issue, please add the following tag to the 
commit:

> Reported-by: syzbot+65684128cd7c35bc6...@syzkaller.appspotmail.com
>
> 
> WARNING: inconsistent lock state
> 5.9.0-rc4-syzkaller #0 Not tainted
> 
> inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> syz-executor675/31233 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: 
spin_lock include/linux/spinlock.h:354 [inline]
> 8880a75c50a0 (slock-AF_BLUETOOTH-BTPROTO_SCO){+.?.}-{2:2}, at: 
sco_conn_del+0x128/0x270 net/bluetooth/sco.c:176

> {IN-SOFTIRQ-W} state was registered at:
>lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
>__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
>_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
>spin_lock include/linux/spinlock.h:354 [inline]
>sco_sock_timeout+0x24/0x140 net/bluetooth/sco.c:83
>call_timer_fn+0x1ac/0x760 kernel/time/timer.c:1413
>expire_timers kernel/time/timer.c:1458 [inline]
>__run_timers.part.0+0x67c/0xaa0 kernel/time/timer.c:1755
>__run_timers kernel/time/timer.c:1736 [inline]
>run_timer_softirq+0xae/0x1a0 kernel/time/timer.c:1768
>__do_softirq+0x1f7/0xa91 kernel/softirq.c:298
>asm_call_on_stack+0xf/0x20 arch/x86/entry/entry_64.S:706
>__run_on_irqstack arch/x86/include/asm/irq_stack.h:22 [inline]
>run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:48 [inline]
>do_softirq_own_stack+0x9d/0xd0 arch/x86/kernel/irq_64.c:77
>invoke_softirq kernel/softirq.c:393 [inline]
>__irq_exit_rcu kernel/softirq.c:423 [inline]
>irq_exit_rcu+0x235/0x280 kernel/softirq.c:435
>sysvec_apic_timer_interrupt+0x51/0xf0 arch/x86/kernel/apic/apic.c:1091
>asm_sysvec_apic_timer_interrupt+0x12/0x20 
arch/x86/include/asm/idtentry.h:581

>unwind_next_frame+0x139a/0x1f90 arch/x86/kernel/unwind_orc.c:607
>arch_stack_walk+0x81/0xf0 arch/x86/kernel/stacktrace.c:25
>stack_trace_save+0x8c/0xc0 kernel/stacktrace.c:123
>kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
>kasan_set_track mm/kasan/common.c:56 [inline]
>__kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
>slab_post_alloc_hook mm/slab.h:518 [inline]
>slab_alloc mm/slab.c:3312 [inline]
>kmem_cache_alloc+0x13a/0x3a0 mm/slab.c:3482
>__d_alloc+0x2a/0x950 fs/dcache.c:1709
>d_alloc+0x4a/0x230 fs/dcache.c:1788
>d_alloc_parallel+0xe9/0x18e0 fs/dcache.c:2540
>lookup_open.isra.0+0x9ac/0x1350 fs/namei.c:3030
>open_last_lookups fs/namei.c:3177 [inline]
>path_openat+0x96d/0x2730 fs/namei.c:3365
>do_filp_open+0x17e/0x3c0 fs/namei.c:3395
>do_sys_openat2+0x16d/0x420 fs/open.c:1168
>do_sys_open fs/open.c:1184 [inline]
>__do_sys_open fs/open.c:1192 [inline]
>__se_sys_open fs/open.c:1188 [inline]
>__x64_sys_open+0x119/0x1c0 fs/open.c:1188
>do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
> irq event stamp: 853
> hardirqs last  enabled at (853): [] 
__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
> hardirqs last  enabled at (853): [] 
_raw_spin_unlock_irq+0x1f/0x80 kernel/locking/spinlock.c:199
> hardirqs last disabled at (852): [] 
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:126 [inline]
> hardirqs last disabled at (852): [] 
_raw_spin_lock_irq+0xa4/0xd0 kernel/locking/spinlock.c:167
> softirqs last  enabled at (0): [] 
copy_process+0x1a99/0x6920 kernel/fork.c:2018

> softirqs last disabled at (0): [<>] 0x0
>
> other info that might help us debug this:
>   Possible unsafe locking scenario:
>
> CPU0
> 
>lock(slock-AF_BLUETOOTH-BTPROTO_SCO);
>
>  lock(slock-AF_BLUETOOTH-BTPROTO_SCO);
>
>   *** DEADLOCK ***
>
> 3 locks held by syz-executor675/31233:
>   #0: 88809f104f40 (>req_lock){+.+.}-{3:3}, at: 
hci_dev_do_close+0xf5/0x1080 net/bluetooth/hci_core.c:1720
>   #1: 88809f104078 (>lock){+.+.}-{3:3}, at: 
hci_dev_do_close+0x253/0x1080 net/bluetooth/hci_core.c:1757
>   #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: 
hci_disconn_cfm include/net/bluetooth/hci_core.h:1435 [inline]
>   #2: 8a9188c8 (hci_cb_list_lock){+.+.}-{3:3}, at: 
hci_conn_hash_flush+0xc7/0x220 net/bluetooth/hci_conn.c:1557

>
> stack backtrace:
> CPU: 1 

Re: [PATCH v2] sched, mm: Optimize current_gfp_context()

2020-09-19 Thread Xu, Yanfei




On 9/18/20 11:18 PM, Waiman Long wrote:

On 9/18/20 2:44 AM, Xu, Yanfei wrote:

Hi Waiman,

On 8/12/20 6:29 AM, Andrew Morton wrote:
On Thu, 18 Jun 2020 17:29:36 -0400 Waiman Long  
wrote:


The current_gfp_context() converts a number of PF_MEMALLOC_* 
per-process

flags into the corresponding GFP_* flags for memory allocation. In
that function, current->flags is accessed 3 times. That may lead to
duplicated access of the same memory location.


I have a puzzle about this comment, what's the meaning about "That may
lead to duplicated access of the same memory location". After using
variable 'pflags', will it not duplicated access the same memory
location?
Looking forward to your reply :)


That condition usually won't happen on a non-debug kernel. However, if 
certain debugging capability is turned on, access to current will be 
compiled into a bunch of checks and memory accesses. So if current is 
used multiple times, the same set of codes will be duplicated the same 
number of times slowing down the operation and increasing code size. By 
accessing current once, we avoid this overhead in a debug kernel.


Cheers,
Longman


Hah, got it.
Thanks for your detailed explain!

cheers,
Yanfei


Re: [PATCH v2] sched, mm: Optimize current_gfp_context()

2020-09-18 Thread Xu, Yanfei

Hi Waiman,

On 8/12/20 6:29 AM, Andrew Morton wrote:

On Thu, 18 Jun 2020 17:29:36 -0400 Waiman Long  wrote:


The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
flags into the corresponding GFP_* flags for memory allocation. In
that function, current->flags is accessed 3 times. That may lead to
duplicated access of the same memory location.


I have a puzzle about this comment, what's the meaning about "That may
lead to duplicated access of the same memory location". After using
variable 'pflags', will it not duplicated access the same memory
location?
Looking forward to your reply :)

Thanks,
Yanfei


This is not usually a problem with minimal debug config options on as the
compiler can optimize away the duplicated memory accesses.  With most
of the debug config options on, however, that may not be the case.
For example, the x86-64 object size of the __need_fs_reclaim() in a
debug kernel that calls current_gfp_context() was 309 bytes. With this
patch applied, the object size is reduced to 202 bytes. This is a saving
of 107 bytes and will probably be slightly faster too.

...

--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -181,18 +181,20 @@ static inline bool in_vfork(struct task_struct *tsk)
   */
  static inline gfp_t current_gfp_context(gfp_t flags)
  {
-   if (unlikely(current->flags &
+   unsigned int pflags = READ_ONCE(current->flags);


Why use READ_ONCE() here?


+   if (unlikely(pflags &
 (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | 
PF_MEMALLOC_NOCMA))) {
/*
 * NOIO implies both NOIO and NOFS and it is a weaker context
 * so always make sure it makes precedence
 */
-   if (current->flags & PF_MEMALLOC_NOIO)
+   if (pflags & PF_MEMALLOC_NOIO)
flags &= ~(__GFP_IO | __GFP_FS);
-   else if (current->flags & PF_MEMALLOC_NOFS)
+   else if (pflags & PF_MEMALLOC_NOFS)
flags &= ~__GFP_FS;
  #ifdef CONFIG_CMA
-   if (current->flags & PF_MEMALLOC_NOCMA)
+   if (pflags & PF_MEMALLOC_NOCMA)
flags &= ~__GFP_MOVABLE;
  #endif
}




Re: [PATCH] mm/page_alloc.c: avoid inheritting current's flags when invoked in interrupt

2020-09-15 Thread Xu, Yanfei




On 9/16/20 9:17 AM, Andrew Morton wrote:

On Tue, 15 Sep 2020 15:56:35 +0800  wrote:


From: Yanfei Xu 

alloc_mask shouldn't inherit the current task's flags when
__alloc_pages_nodemask is invoked in interrupt.

...

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4889,7 +4889,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order, int preferred_nid,
 * from a particular context which has been marked by
 * memalloc_no{fs,io}_{save,restore}.
 */
-   alloc_mask = current_gfp_context(gfp_mask);
+   if (!in_interrupt())
+   alloc_mask = current_gfp_context(gfp_mask);
ac.spread_dirty_pages = false;
  
  	/*


hm, yes, and perhaps other callsites in page_alloc.c.

I assume this doesn't actually make any runtime difference?  Because
gfp_mask in interrupt contexts isn't going to have __GFP_IO or __GFP_FS
anyway.


Thanks for your reply!

Yes, It doesn't make any runtime difference. Theoretically, GPF_ATOMIC 
or GFP_NOWAIT should be used in interrupt context for allocate pages, so

that gfp_mask isn't going to have __GFP_IO or __GFP_FS.

But if somebody use wrong gfp_masks, __GFP_IO or __GFP_FS will be 
introduced, with the process interrupted has PF_MEMALLOC_NOIO or 
PF_MEMALLOC_NOFS, current_gfp_context may help to hide these wrong 
usages. I don't think it is the original purpose of that piece of

codes.

And how about add BUG_ON or WARN_ON to figure out the situation which
introduce __GFP_IO or __GFP_FS in interrupt context?

Regards,
Yanfei


Re: [PATCH] mm/page_alloc.c: variable type of 'progress' should be 'unsigned long'

2020-09-15 Thread Xu, Yanfei




On 9/16/20 8:48 AM, Andrew Morton wrote:

On Tue, 15 Sep 2020 18:46:20 +0800  wrote:


From: Yanfei Xu 

try_to_free_pages returns the number of pages reclaimed, and the type of
returns is 'unsigned long'. So we should use a matched type for storing
it.



__perform_reclaim() returns an int, so this change is fairly pointless.

However __perform_reclaim()'s single caller expects it to return
unsigned long, so including that change in this patch would make more
sense.


Yeah, thanks for reminding. I will add that and send a v2. :)

Regards,
Yanfei


Re: [PATCH] USB: integrate macro definitions into include/linux/usb.h

2020-08-29 Thread Xu, Yanfei

I just think it is a clear up to make these macros get togather
which have the samilar attributes. That's why :)

Thanks,
Yanfei

On 8/28/20 3:48 PM, Greg KH wrote:

On Tue, Aug 25, 2020 at 11:44:21PM +0800, yanfei...@windriver.com wrote:

From: Yanfei Xu 

include/linux/usb.h also contains 'Hard limit' and 'Arbitrary limit'
macro definitions in it, hence we can integrate these from config.c
into include/linux/usb.h


Why?  No one uses these values outside of this .c file, so why put a
value in a global .h file?

Who else wants to use these values?  If something else needs it, then
sure, it could be moved, but until then, there's nothing wrong with the
existing code as-is from what I can tell.

thanks,

greg k-h



Re: [PATCH] USB: core: limit access to rawdescriptors which were not allocated

2020-08-26 Thread Xu, Yanfei




On 8/26/20 2:00 AM, Alan Stern wrote:

On Wed, Aug 26, 2020 at 12:16:59AM +0800, yanfei...@windriver.com wrote:

From: Yanfei Xu 

When using systemcall to read the rawdescriptors, make sure we won't
access to the rawdescriptors never allocated, which are number
exceed the USB_MAXCONFIG.

Reported-by: syzbot+256e56ddde8b8957e...@syzkaller.appspotmail.com
Signed-off-by: Yanfei Xu 
---
  drivers/usb/core/sysfs.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/core/sysfs.c b/drivers/usb/core/sysfs.c
index a2ca38e25e0c..1a7a625e5f55 100644
--- a/drivers/usb/core/sysfs.c
+++ b/drivers/usb/core/sysfs.c
@@ -895,7 +895,8 @@ read_descriptors(struct file *filp, struct kobject *kobj,
 * configurations (config plus subsidiary descriptors).
 */
for (cfgno = -1; cfgno < udev->descriptor.bNumConfigurations &&
-   nleft > 0; ++cfgno) {
+   nleft > 0 &&
+   cfgno < USB_MAXCONFIG; ++cfgno) {
if (cfgno < 0) {
src = >descriptor;
srclen = sizeof(struct usb_device_descriptor);


This is not the right way to fix the problem.

Instead, we should make sure that udev->descriptor.bNumConfigurations is
always <= USB_MAXCONFIG.  That's what this code in
usb_get_configuration() is supposed to do:

int ncfg = dev->descriptor.bNumConfigurations;
...

if (ncfg > USB_MAXCONFIG) {
dev_warn(ddev, "too many configurations: %d, "
"using maximum allowed: %d\n", ncfg, USB_MAXCONFIG);
dev->descriptor.bNumConfigurations = ncfg = USB_MAXCONFIG;
}

If you want to fix the bug, you need to figure out why this isn't
working.

Thanks for you suggestion. The patch is not right. I'll try to look
into the root cause.

Regard,
Yanfei.


Alan Stern



Re: [PATCH] mm/memory.c: Correct the function name in comment

2020-08-18 Thread Xu, Yanfei




On 8/18/20 3:34 PM, David Hildenbrand wrote:

On 18.08.20 09:29, yanfei...@windriver.com wrote:

From: Yanfei Xu 

Correct the function name which is "pte_alloc_pne" to "pte_alloc_one"



I'd have phrased this like

"mm/memory: Fix typo in __do_fault() comment

It's "pte_alloc_one", not "pte_alloc_pne". Let's fix that.
"


Hah, yours is more clear. I'll send a v2
Thanks for your suggestion. :)

//Yanfei


Reviewed-by: David Hildenbrand 


Signed-off-by: Yanfei Xu 
---
  mm/memory.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index c3a83f4ca851..9cc3d0dc816c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3446,7 +3446,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 *  unlock_page(A)
 * lock_page(B)
 *  lock_page(B)
-* pte_alloc_pne
+* pte_alloc_one
 *   shrink_page_list
 * wait_on_page_writeback(A)
 *  SetPageWriteback(B)






Re: [PATCH] userfaultfd: avoid the duplicated release for userfaultfd_ctx

2020-07-19 Thread Xu, Yanfei




On 7/20/20 12:57 AM, Al Viro wrote:

On Sun, Jul 19, 2020 at 09:58:34PM +0800, Xu, Yanfei wrote:

ping Al Viro

Could you please help to review this patch? Thanks a lot.


That's -next, right?  As for the patch itself...  Frankly,

Yes, it's -next.

Daniel's patch looks seriously wrong.

Get it.

Regards,
Yanfei

* why has O_CLOEXEC been quietly smuggled in?  It's
a userland ABI change, for fsck sake...
* the double-put you've spotted
* the whole out: thing - just make it
if (IS_ERR(file)) {
userfaultfd_ctx_put(ctx);
return PTR_ERR(file);
}
and be done with that.



Re: [PATCH] userfaultfd: avoid the duplicated release for userfaultfd_ctx

2020-07-19 Thread Xu, Yanfei

ping Al Viro

Could you please help to review this patch? Thanks a lot.

Yanfei

On 7/15/20 12:12 AM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

when get_unused_fd_flags gets failure, userfaultfd_ctx_cachep will
be freed by userfaultfd_fops's release function which is the
userfaultfd_release. So we could return directly after fput().

userfaultfd_release()->userfaultfd_ctx_put(ctx)

Fixes: d08ac70b1e0d (Wire UFFD up to SELinux)
Reported-by: syzbot+75867c44841cb6373...@syzkaller.appspotmail.com
Signed-off-by: Yanfei Xu 
---
  fs/userfaultfd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3a4d6ac5a81a..e98317c15530 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -2049,7 +2049,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
if (fd < 0) {
fput(file);
-   goto out;
+   return fd;
}
  
  	ctx->owner = file_inode(file);




Re: [PATCH] userfaultfd: avoid the duplicated release for userfaultfd_ctx

2020-07-14 Thread Xu, Yanfei

Add maintainer Alexander Viro  :)

On 7/15/20 12:12 AM, yanfei...@windriver.com wrote:

From: Yanfei Xu 

when get_unused_fd_flags gets failure, userfaultfd_ctx_cachep will
be freed by userfaultfd_fops's release function which is the
userfaultfd_release. So we could return directly after fput().

userfaultfd_release()->userfaultfd_ctx_put(ctx)

Fixes: d08ac70b1e0d (Wire UFFD up to SELinux)
Reported-by: syzbot+75867c44841cb6373...@syzkaller.appspotmail.com
Signed-off-by: Yanfei Xu 
---
  fs/userfaultfd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3a4d6ac5a81a..e98317c15530 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -2049,7 +2049,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
if (fd < 0) {
fput(file);
-   goto out;
+   return fd;
}
  
  	ctx->owner = file_inode(file);




Re: BUG:loop:blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES)

2020-05-14 Thread Xu, Yanfei




On 5/14/20 12:41 AM, Darrick J. Wong wrote:

[add fsdevel to cc]

On Tue, May 12, 2020 at 08:22:08PM -0600, Jens Axboe wrote:

On 5/12/20 8:14 PM, Xu, Yanfei wrote:

Hi,

After operating the /dev/loop which losetup with an image placed in**tmpfs,

I got the following ERROR messages:

[cut here]-

[  183.110770] blk_update_request: I/O error, dev loop6, sector 524160 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.123949] blk_update_request: I/O error, dev loop6, sector 522 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.137123] blk_update_request: I/O error, dev loop6, sector 16906 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.150314] blk_update_request: I/O error, dev loop6, sector 32774 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.163551] blk_update_request: I/O error, dev loop6, sector 49674 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.176824] blk_update_request: I/O error, dev loop6, sector 65542 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.190029] blk_update_request: I/O error, dev loop6, sector 82442 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.203281] blk_update_request: I/O error, dev loop6, sector 98310 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.216531] blk_update_request: I/O error, dev loop6, sector 115210 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.229914] blk_update_request: I/O error, dev loop6, sector 131078 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0


I have found the commit which introduce this issue by git bisect :

     commit :efcfec57[loop: fix no-unmap write-zeroes request behavior]


Please CC the author of that commit too. Leaving the rest quoted below.


Kernrel version: Linux version 5.6.0

Frequency: everytime

steps to reproduce:

   1.git clone mainline kernel

   2.compile kernel with ARCH=x86_64, and then boot the system with it

     (seems other arch also can reproduce it )

   3.make an image by "dd of=/tmp/image if=/dev/zero bs=1M count=256"

   *4.**place the image in tmpfs directory*

   5.losetup /dev/loop6 /PATH/TO/image

   6.mkfs.ext2 /dev/loop6


Any comments will be appreciated.


Hm, you got IO failures here because shmem_fallocate doesn't support
FL_ZERO_RANGE range.  That might not be too hard to add, but there's a
broader problem of detecting fallocate support--

The loop driver assumes that if the file has an fallocate method then
it's safe to set max_discard_sectors (and now max_write_zeroes_sectors)
to UINT_MAX>>9.  There's currently no good way to detect which modes are
supported by a filesystem's ->fallocate function, or to discover the
required granularity.

Right now we tell application developers that the way to discover the
conditions under which fallocate will work is to try it and see if they
get EOPNOTSUPP.

One way to "fix" this would be to fix lo_fallocate to set RQF_QUIET if
the filesystem returns EOPNOTSUPP, which gets rid of the log messages.
We probably ought to zero out the appropriate max_*_sectors if we get
EOPNOTSUPP.


Many thanks for your detailed reply:) No good method for detecting
fallocte support is a real problem. And the way to "fix" you mentioned
do is a good workaround for the current satuation.


Best regards,
Yanfei



--D




Thanks,

Yanfei








--
Jens Axboe



BUG:loop:blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES)

2020-05-13 Thread Xu, Yanfei

Hi,

After operating the /dev/loop which losetup with an image placed in 
tmpfs. I got the following ERROR messages:


[cut here]-

[  183.110770] blk_update_request: I/O error, dev loop6, sector 524160 
op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.123949] blk_update_request: I/O error, dev loop6, sector 522 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.137123] blk_update_request: I/O error, dev loop6, sector 16906 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.150314] blk_update_request: I/O error, dev loop6, sector 32774 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.163551] blk_update_request: I/O error, dev loop6, sector 49674 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.176824] blk_update_request: I/O error, dev loop6, sector 65542 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.190029] blk_update_request: I/O error, dev loop6, sector 82442 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.203281] blk_update_request: I/O error, dev loop6, sector 98310 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.216531] blk_update_request: I/O error, dev loop6, sector 115210 
op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.229914] blk_update_request: I/O error, dev loop6, sector 131078 
op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0



I have found the commit which introduce this issue by git bisect :

commit :efcfec57[loop: fix no-unmap write-zeroes request behavior]

Tips: My reproduce command: mkfs.ext2 /dev/loop6


Thanks,
Yanfei


[loop]efcfec579: BUG:blk_update_request: I/O error, dev loop6, sector 49674 op 0x9:(WRITE_ZEROES)

2020-05-12 Thread Xu, Yanfei

Hi,

After operating the /dev/loop which losetup with an image placed in tmpfs,

I got the following ERROR messages:

[cut here]-

[  183.110770] blk_update_request: I/O error, dev loop6, sector 524160 
op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.123949] blk_update_request: I/O error, dev loop6, sector 522 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.137123] blk_update_request: I/O error, dev loop6, sector 16906 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.150314] blk_update_request: I/O error, dev loop6, sector 32774 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.163551] blk_update_request: I/O error, dev loop6, sector 49674 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.176824] blk_update_request: I/O error, dev loop6, sector 65542 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.190029] blk_update_request: I/O error, dev loop6, sector 82442 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.203281] blk_update_request: I/O error, dev loop6, sector 98310 op 
0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.216531] blk_update_request: I/O error, dev loop6, sector 115210 
op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0
[  183.229914] blk_update_request: I/O error, dev loop6, sector 131078 
op 0x9:(WRITE_ZEROES) flags 0x1000800 phys_seg 0 prio class 0



I have found the commit which introduce this issue by git bisect :

commit :efcfec57[loop: fix no-unmap write-zeroes request behavior]


Kernrel version: Linux version 5.6.0

Frequency: every time

steps to reproduce:

  1.git clone mainline kernel

  2.compile kernel with ARCH=x86_64, and then boot the system with it

(seems other arch also can reproduce it )

  3.make an image by "dd of=/tmp/image if=/dev/zero bs=1M count=256"

  4.place the image in tmpfs directory

  5.losetup /dev/loop6 /PATH/image

  6.mkfs.ext2 /dev/loop6


Any comments will be appreciated.


Thanks,

Yanfei