Re: [PATCH] [V2]r8152: Add support for MAC address pass through on RTL8153-BD

2019-02-21 Thread David Chen
Hi Jiri,

Thanks for your comments.  BND_MASK and BD_MASK are defined separately, and I 
have tested RTL8153-BD and RTL8153-BND devices, both work as expected.

Thanks and Regards,
-David


> Jiri Slaby  於 2019年2月18日 下午10:22 寫道:
> 
>> On 18. 02. 19, 15:00, David Chen wrote:
>> From: David Chen 
>> 
>> RTL8153-BD is used in Dell DA300 type-C dongle.
>> It should be added to the whitelist of devices to activate MAC address
>> pass through.
>> 
>> Per confirming with Realtek all devices containing RTL8153-BD should
>> activate MAC pass through and there won't use pass through bit on efuse
>> like in RTL8153-AD.
>> 
>> Signed-off-by: David Chen 
>> ---
>> drivers/net/usb/r8152.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
>> index 60dd1ec1665f..86c8c64fbb0f 100644
>> --- a/drivers/net/usb/r8152.c
>> +++ b/drivers/net/usb/r8152.c
>> @@ -557,6 +557,7 @@ enum spd_duplex {
>> /* MAC PASSTHRU */
>> #define AD_MASK0xfee0
>> #define BND_MASK0x0004
>> +#define BD_MASK0x0001
>> #define EFUSE0xcfdb
>> #define PASS_THRU_MASK0x1
>> 
>> @@ -1176,9 +1177,9 @@ static int vendor_mac_passthru_addr_read(struct r8152 
>> *tp, struct sockaddr *sa)
>>return -ENODEV;
>>}
>>} else {
>> -/* test for RTL8153-BND */
>> +/* test for RTL8153-BND and RTL8153-BD */
>>ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_MISC_1);
>> -if ((ocp_data & BND_MASK) == 0) {
>> +if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK) == 0) {
> 
> Can you ensure that BND won't have the BD's bit set and vice versa? I.e.
> should the check be something like:
>  if (isBND() && (ocp_data & BND_MASK) == 0 ||
>  isBD() && (ocp_data & BD_MASK) == 0)
> instead?
> 
> thanks,
> -- 
> js
> suse labs


[PATCH] r8152: Fix an error on RTL8153-BD MAC Address Passthrough support

2019-02-19 Thread David Chen
From: David Chen 

RTL8153-BD is used in Dell DA300 type-C dongle.
Added RTL8153-BD support to activate MAC address pass through on DA300.
Apply correction on previously submitted patch in net.git tree.

Signed-off-by: David Chen 
---
 drivers/net/usb/r8152.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index ada6baf8847a..86c8c64fbb0f 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1179,7 +1179,7 @@ static int vendor_mac_passthru_addr_read(struct r8152 
*tp, struct sockaddr *sa)
} else {
/* test for RTL8153-BND and RTL8153-BD */
ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_MISC_1);
-   if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK)) {
+   if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK) == 0) {
netif_dbg(tp, probe, tp->netdev,
  "Invalid variant for MAC pass through\n");
return -ENODEV;
-- 
2.19.1



[PATCH] [V2]r8152: Add support for MAC address pass through on RTL8153-BD

2019-02-18 Thread David Chen
From: David Chen 

RTL8153-BD is used in Dell DA300 type-C dongle.
It should be added to the whitelist of devices to activate MAC address
pass through.

Per confirming with Realtek all devices containing RTL8153-BD should
activate MAC pass through and there won't use pass through bit on efuse
like in RTL8153-AD.

Signed-off-by: David Chen 
---
 drivers/net/usb/r8152.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 60dd1ec1665f..86c8c64fbb0f 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -557,6 +557,7 @@ enum spd_duplex {
 /* MAC PASSTHRU */
 #define AD_MASK0xfee0
 #define BND_MASK   0x0004
+#define BD_MASK0x0001
 #define EFUSE  0xcfdb
 #define PASS_THRU_MASK 0x1
 
@@ -1176,9 +1177,9 @@ static int vendor_mac_passthru_addr_read(struct r8152 
*tp, struct sockaddr *sa)
return -ENODEV;
}
} else {
-   /* test for RTL8153-BND */
+   /* test for RTL8153-BND and RTL8153-BD */
ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_MISC_1);
-   if ((ocp_data & BND_MASK) == 0) {
+   if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK) == 0) {
netif_dbg(tp, probe, tp->netdev,
  "Invalid variant for MAC pass through\n");
return -ENODEV;
-- 
2.19.1



Re: [PATCH 2/2] RTL8153-BD is used in Dell DA300 type-C dongle. It should be added to the whitelist of devices to activate MAC address pass through.

2019-02-18 Thread David Chen
Hi Oliver,

Thanks for reviewing.  I have made some mistake, will correct it and resend.

Thanks and Regards,
-David

> Oliver Neukum  於 2019年2月18日 下午4:04 寫道:
> 
>> On Mo, 2019-02-18 at 11:48 +0800, David Chen wrote:
>> From: David Chen 
>> 
>> Per confirming with Realtek all devices containing RTL8153-BD should
>> activate MAC pass through and there won't use pass through bit on efuse
>> like in RTL8153-AD.
>> 
>> Signed-off-by: David Chen 
>> ---
>> drivers/net/usb/r8152.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
>> index ada6baf8847a..86c8c64fbb0f 100644
>> --- a/drivers/net/usb/r8152.c
>> +++ b/drivers/net/usb/r8152.c
>> @@ -1179,7 +1179,7 @@ static int vendor_mac_passthru_addr_read(struct r8152 
>> *tp, struct sockaddr *sa)
>>} else {
>>/* test for RTL8153-BND and RTL8153-BD */
>>ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_MISC_1);
>> -if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK)) {
>> +if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK) == 0) {
> 
> You are inverting the second half of the test. How can this possibly be
> right? Had you dropped it, I would understand. But this? Are you sure?
> 
>Regards
>Oliver
> 


[PATCH 2/2] RTL8153-BD is used in Dell DA300 type-C dongle. It should be added to the whitelist of devices to activate MAC address pass through.

2019-02-17 Thread David Chen
From: David Chen 

Per confirming with Realtek all devices containing RTL8153-BD should
activate MAC pass through and there won't use pass through bit on efuse
like in RTL8153-AD.

Signed-off-by: David Chen 
---
 drivers/net/usb/r8152.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index ada6baf8847a..86c8c64fbb0f 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1179,7 +1179,7 @@ static int vendor_mac_passthru_addr_read(struct r8152 
*tp, struct sockaddr *sa)
} else {
/* test for RTL8153-BND and RTL8153-BD */
ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_MISC_1);
-   if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK)) {
+   if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK) == 0) {
netif_dbg(tp, probe, tp->netdev,
  "Invalid variant for MAC pass through\n");
return -ENODEV;
-- 
2.19.1



[PATCH] r8152: Add support for MAC address pass through on RTL8153-BD

2019-02-16 Thread David Chen
From: David Chen 

RTL8153-BD is used in Dell DA300 type-C dongle.
It should be added to the whitelist of devices to activate MAC address
pass through.

Per confirming with Realtek all devices containing RTL8153-BD should
activate MAC pass through and there won't use pass through bit on efuse
like in RTL8153-AD.

Signed-off-by: David Chen 
---
 drivers/net/usb/r8152.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 60dd1ec1665f..ada6baf8847a 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -557,6 +557,7 @@ enum spd_duplex {
 /* MAC PASSTHRU */
 #define AD_MASK0xfee0
 #define BND_MASK   0x0004
+#define BD_MASK0x0001
 #define EFUSE  0xcfdb
 #define PASS_THRU_MASK 0x1
 
@@ -1176,9 +1177,9 @@ static int vendor_mac_passthru_addr_read(struct r8152 
*tp, struct sockaddr *sa)
return -ENODEV;
}
} else {
-   /* test for RTL8153-BND */
+   /* test for RTL8153-BND and RTL8153-BD */
ocp_data = ocp_read_byte(tp, MCU_TYPE_USB, USB_MISC_1);
-   if ((ocp_data & BND_MASK) == 0) {
+   if ((ocp_data & BND_MASK) == 0 && (ocp_data & BD_MASK)) {
netif_dbg(tp, probe, tp->netdev,
  "Invalid variant for MAC pass through\n");
return -ENODEV;
-- 
2.19.1



PROBLEM: SCSI write error not seen by Linux AIO?

2018-10-10 Thread David Chen
[1.] SCSI write error not seen by Linux AIO?

[2.] Full description of the problem/report:
I filed a bug ( https://bugzilla.kernel.org/show_bug.cgi?id=201257 )
that looks pretty serious but quite possibly I've made some mistake. I
emailed the Linux AIO mailing list about it over a week ago but haven't
received any response.

The bug (as I see it) is: I do Linux AIO to write to a SCSI device, and
if the SCSI device returns an error (e.g. WRITE ERROR - AUTO
REALLOCATION FAILED) then I get a success from Linux AIO! Of course I
expect Linux AIO to indicate to me in some way that the write didn't
succeed, but AFAICT I get no such indication. I'm seeing this on the 4.9
LTS kernel but not on 4.18.10 (i.e. on 4.18.10 Linux AIO reports an I/O
error as expected), and I've reproduced this on a VMware VM as well as a
bare metal x86 server. I've reproduced this using both a fake SCSI
device created using the scsi_debug module as the SCSI device, and
ClearSky storage as the SCSI device. The bugzilla entry includes more
details, including reproduction steps.

[3.] Keywords (i.e., modules, networking, kernel): SCSI, Linux AIO
[4.1.] Kernel version (from /proc/version):
Linux version 4.9.129+ (dc...@qf-l208-a.dev-bos.csdops.net) (gcc
version 4.9.1 (GCC) ) #1 SMP Mon Oct 1 21:10:31 UTC 2018

[5.] Most recent kernel version which did not have the bug:
I only see it on 4.9 kernels. I see it on 4.9.0 and 4.9.129
(i.e. almost the most recent 4.9 kernel).

[7.] A small shell script or example program which triggers the
 problem (if possible)
There are test logs in the bugzilla entry (and pasted below),
but the steps I take to reproduce this are:
1) Boot a 4.9 kernel with the scsi_debug kludge applied.
2) Create a scsi_debug fake scsi device (i.e. "modprobe scsi_debug").
3) Enable write faults on that device
(e.g. "echo 2 > /sys/module/scsi_debug/parameters/opts")
4) Run the test program and observe that it doesn't fail an assert().
(e.g. "./aio_test /dev/sdb")
5) Check /var/log/messages to verify that SCSI saw an error.

Is this a serious bug, or does anyone see where I've made a mistake?

Thanks!

The following is information from the bugzilla entry for anyone who
prefers to see the info in email as opposed to bugzilla. For people
who prefer bugzilla there's nothing interesting below:

My test program (a slightly edited version of
https://gist.github.com/larytet/87f90b08643ac3de934df2cadff4989c) uses
Linux AIO to do a single write of 512B. I've included the program below.
The device I'm writing to is a fake SCSI device created by the
scsi_debug kernel module. I've added a small kludge to the scsi_debug
kernel module to return a WRITE ERROR - AUTO REALLOCATION FAILED error
when the SCSI device is written to. I've included a patch for the kludge
below.

I get this unexpected (to me anyway) behavior on kernel versions 4.9.129
and 4.9.0. I do get an error as I expect on kernel versions 4.18.10,
4.14.7, and 4.1.8.

I'm checking that the result code is zero for each of: the io_event
filled in by io_getevents(), the return code of io_setup(),
io_destroy(), fsync(), and close(). I'm checking that the result code
is 1 for each of io_submit() and io_getevents().

When I use dd to do a standard write() to the SCSI device, instead of
using Linux AIO, then I get an I/O error as expected. Also when I use
device mapper to create a flakey device
(e.g. "dmsetup create dchen-test --table="0 `blockdev --getsz /dev/sdg`
flakey /dev/sdg 0 9 1"), instead of using a SCSI device, then I get an
I/O error as expected.

This looks like a bug to me, but certainly I could have missed checking
a return code somewhere, or maybe I'm not supposed to get an error back
for a reason I don't know about yet, or...?

It could be that I'm seeing this unexpected (to me) behavior because of
some quirk with the scsi_debug fake SCSI device. However, I originally
ran into this behavior when testing against my company's ClearSky
Storage, using both iSCSI and Fibre Channel targets. Since I've seen it
on these different targets, I think it's unlikely to be some quirk of
scsi_debug.

Here are logs from my tests:

Test log for unexpected behavior on 4.9.129:

[root@vm-dchen ~]# uname -r
4.9.129+
[root@vm-dchen ~]# modprobe scsi_debug
[root@vm-dchen ~]# lsscsi
[1:0:0:0]diskVMware   Virtual disk 1.0   /dev/sda
[3:0:0:0]diskLinuxscsi_debug   0186  /dev/sdb
[root@vm-dchen ~]# dd if=/dev/zero of=/dev/sdb oflag=direct bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00116091 s, 441 kB/s
[root@vm-dchen ~]#
[root@vm-dchen ~]# date
Thu Sep 27 17:00:45 EDT 2018
[root@vm-dchen ~]# echo 2 > /sys/module/scsi_debug/parameters/opts
[root@vm-dchen ~]# dd if=/dev/zero of=/dev/sdb oflag=direct bs=512 count=1
dd: error writing ?/dev/sdb?: Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0459316 s, 0.0 kB/s
[root@vm-dchen ~]#
[root@vm-dchen ~]# date
Thu Sep 27 17:01:01 EDT 2018
[root@vm-dchen ~]# ./aio_test 

PROBLEM: SCSI write error not seen by Linux AIO?

2018-10-10 Thread David Chen
[1.] SCSI write error not seen by Linux AIO?

[2.] Full description of the problem/report:
I filed a bug ( https://bugzilla.kernel.org/show_bug.cgi?id=201257 )
that looks pretty serious but quite possibly I've made some mistake. I
emailed the Linux AIO mailing list about it over a week ago but haven't
received any response.

The bug (as I see it) is: I do Linux AIO to write to a SCSI device, and
if the SCSI device returns an error (e.g. WRITE ERROR - AUTO
REALLOCATION FAILED) then I get a success from Linux AIO! Of course I
expect Linux AIO to indicate to me in some way that the write didn't
succeed, but AFAICT I get no such indication. I'm seeing this on the 4.9
LTS kernel but not on 4.18.10 (i.e. on 4.18.10 Linux AIO reports an I/O
error as expected), and I've reproduced this on a VMware VM as well as a
bare metal x86 server. I've reproduced this using both a fake SCSI
device created using the scsi_debug module as the SCSI device, and
ClearSky storage as the SCSI device. The bugzilla entry includes more
details, including reproduction steps.

[3.] Keywords (i.e., modules, networking, kernel): SCSI, Linux AIO
[4.1.] Kernel version (from /proc/version):
Linux version 4.9.129+ (dc...@qf-l208-a.dev-bos.csdops.net) (gcc
version 4.9.1 (GCC) ) #1 SMP Mon Oct 1 21:10:31 UTC 2018

[5.] Most recent kernel version which did not have the bug:
I only see it on 4.9 kernels. I see it on 4.9.0 and 4.9.129
(i.e. almost the most recent 4.9 kernel).

[7.] A small shell script or example program which triggers the
 problem (if possible)
There are test logs in the bugzilla entry (and pasted below),
but the steps I take to reproduce this are:
1) Boot a 4.9 kernel with the scsi_debug kludge applied.
2) Create a scsi_debug fake scsi device (i.e. "modprobe scsi_debug").
3) Enable write faults on that device
(e.g. "echo 2 > /sys/module/scsi_debug/parameters/opts")
4) Run the test program and observe that it doesn't fail an assert().
(e.g. "./aio_test /dev/sdb")
5) Check /var/log/messages to verify that SCSI saw an error.

Is this a serious bug, or does anyone see where I've made a mistake?

Thanks!

The following is information from the bugzilla entry for anyone who
prefers to see the info in email as opposed to bugzilla. For people
who prefer bugzilla there's nothing interesting below:

My test program (a slightly edited version of
https://gist.github.com/larytet/87f90b08643ac3de934df2cadff4989c) uses
Linux AIO to do a single write of 512B. I've included the program below.
The device I'm writing to is a fake SCSI device created by the
scsi_debug kernel module. I've added a small kludge to the scsi_debug
kernel module to return a WRITE ERROR - AUTO REALLOCATION FAILED error
when the SCSI device is written to. I've included a patch for the kludge
below.

I get this unexpected (to me anyway) behavior on kernel versions 4.9.129
and 4.9.0. I do get an error as I expect on kernel versions 4.18.10,
4.14.7, and 4.1.8.

I'm checking that the result code is zero for each of: the io_event
filled in by io_getevents(), the return code of io_setup(),
io_destroy(), fsync(), and close(). I'm checking that the result code
is 1 for each of io_submit() and io_getevents().

When I use dd to do a standard write() to the SCSI device, instead of
using Linux AIO, then I get an I/O error as expected. Also when I use
device mapper to create a flakey device
(e.g. "dmsetup create dchen-test --table="0 `blockdev --getsz /dev/sdg`
flakey /dev/sdg 0 9 1"), instead of using a SCSI device, then I get an
I/O error as expected.

This looks like a bug to me, but certainly I could have missed checking
a return code somewhere, or maybe I'm not supposed to get an error back
for a reason I don't know about yet, or...?

It could be that I'm seeing this unexpected (to me) behavior because of
some quirk with the scsi_debug fake SCSI device. However, I originally
ran into this behavior when testing against my company's ClearSky
Storage, using both iSCSI and Fibre Channel targets. Since I've seen it
on these different targets, I think it's unlikely to be some quirk of
scsi_debug.

Here are logs from my tests:

Test log for unexpected behavior on 4.9.129:

[root@vm-dchen ~]# uname -r
4.9.129+
[root@vm-dchen ~]# modprobe scsi_debug
[root@vm-dchen ~]# lsscsi
[1:0:0:0]diskVMware   Virtual disk 1.0   /dev/sda
[3:0:0:0]diskLinuxscsi_debug   0186  /dev/sdb
[root@vm-dchen ~]# dd if=/dev/zero of=/dev/sdb oflag=direct bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00116091 s, 441 kB/s
[root@vm-dchen ~]#
[root@vm-dchen ~]# date
Thu Sep 27 17:00:45 EDT 2018
[root@vm-dchen ~]# echo 2 > /sys/module/scsi_debug/parameters/opts
[root@vm-dchen ~]# dd if=/dev/zero of=/dev/sdb oflag=direct bs=512 count=1
dd: error writing ?/dev/sdb?: Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0459316 s, 0.0 kB/s
[root@vm-dchen ~]#
[root@vm-dchen ~]# date
Thu Sep 27 17:01:01 EDT 2018
[root@vm-dchen ~]# ./aio_test 

Backport 35a2897c2a (sched/wait: Remove the lockless swait_active() check in swake_up*()) to 4.9 branch

2018-07-31 Thread David Chen
Hi Peter,

In 4.9 branch, we hit an issue in RCU, where the NOCB follower list not getting
reclaimed and causing OOM.

In discussion with Paul, we were able to figure out the problem was because of
missed wake up resulted from lack of proper memory barrier between setting
wake up condition and swake_up().

nocb_leader_wait()
{
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
swake_up(>nocb_wq);

Note, that the smp_mb__after_atomic() is only a compiler barrier on x86.
Originally I was going to change the barrier to smp_mb(). But then I found out
master has this following patch that solves the same class of problem by
removing the lockless check inside swake_up().

35a2897c2a (sched/wait: Remove the lockless swait_active() check in swake_up*())

So I'm wonder if we can backport this patch to 4.9 branch to solve this issue,
and maybe solve other potential missed wake up issue as well.

Thanks,
David

Backport 35a2897c2a (sched/wait: Remove the lockless swait_active() check in swake_up*()) to 4.9 branch

2018-07-31 Thread David Chen
Hi Peter,

In 4.9 branch, we hit an issue in RCU, where the NOCB follower list not getting
reclaimed and causing OOM.

In discussion with Paul, we were able to figure out the problem was because of
missed wake up resulted from lack of proper memory barrier between setting
wake up condition and swake_up().

nocb_leader_wait()
{
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
swake_up(>nocb_wq);

Note, that the smp_mb__after_atomic() is only a compiler barrier on x86.
Originally I was going to change the barrier to smp_mb(). But then I found out
master has this following patch that solves the same class of problem by
removing the lockless check inside swake_up().

35a2897c2a (sched/wait: Remove the lockless swait_active() check in swake_up*())

So I'm wonder if we can backport this patch to 4.9 branch to solve this issue,
and maybe solve other potential missed wake up issue as well.

Thanks,
David

[PATCH] rcu: Fix NOCB follower not waking up causing OOM

2018-07-30 Thread David Chen
From: Chunwei Chen 

In 4.9 stable branch, we hit an issue where one of the NOCB follower
thread wasn't waking up, even though the follower list is not empty.
The follower list just kept on growing and never got reclaimed, and
finally caused the system to run out of memory.

This issue is similar to the issue fixed by 6b5fc3a13318, both are
caused by lacking proper memory barrier before swake_up, and causing
wake up to be missed. While we do have smp_mb__after_atomic here, but
smp_mb__after_atomic is only a compiler barrier on x86, so it doesn't
prevent this at all. So we fix this issue by changing it to smp_mb.

Note, this patch doesn't apply to master, because in master the follower
list is changed to spinlock protected list. So there's no such issue
there.

Signed-off-by: Chunwei Chen 
Cc: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 554ea54..b2d663d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2090,7 +2090,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
/* Append callbacks to follower's "done" list. */
tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
-   smp_mb__after_atomic(); /* Store *tail before wakeup. */
+   smp_mb(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
/*
 * List was empty, wake up the follower.
-- 
1.9.4



[PATCH] rcu: Fix NOCB follower not waking up causing OOM

2018-07-30 Thread David Chen
From: Chunwei Chen 

In 4.9 stable branch, we hit an issue where one of the NOCB follower
thread wasn't waking up, even though the follower list is not empty.
The follower list just kept on growing and never got reclaimed, and
finally caused the system to run out of memory.

This issue is similar to the issue fixed by 6b5fc3a13318, both are
caused by lacking proper memory barrier before swake_up, and causing
wake up to be missed. While we do have smp_mb__after_atomic here, but
smp_mb__after_atomic is only a compiler barrier on x86, so it doesn't
prevent this at all. So we fix this issue by changing it to smp_mb.

Note, this patch doesn't apply to master, because in master the follower
list is changed to spinlock protected list. So there's no such issue
there.

Signed-off-by: Chunwei Chen 
Cc: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 554ea54..b2d663d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2090,7 +2090,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
/* Append callbacks to follower's "done" list. */
tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
-   smp_mb__after_atomic(); /* Store *tail before wakeup. */
+   smp_mb(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
/*
 * List was empty, wake up the follower.
-- 
1.9.4



Re: RCU nocb list not reclaiming causing OOM

2018-07-30 Thread David Chen
Hi Paul,

Just want to know what's your plan on stable branches regarding this issue.
Do you intend to backport the ->nocb_lock? Or are you going with just the
memory barrier change?

Thanks,
David


From: Paul E. McKenney 
Sent: Saturday, July 28, 2018 1:29 AM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 06:47:00PM -0700, Paul E. McKenney wrote:
> On Sat, Jul 28, 2018 at 12:07:19AM +, David Chen wrote:
> > Hi Paul,
> > 
> > I wasn't talking about the xchg() though.
> > 
> > The smp_mb__after_atomic() is not for xchg(), it's for `*tail = 
> > rdp->nocb_gp_head;`
> > it's stated so in the comment. And I do think we need ordering between
> > `*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on 
> > head not tail.
> > 
> >  swait_event_interruptible(rdp->nocb_wq,
> >   
> >READ_ONCE(rdp->nocb_follower_head));
> > 
> > So what I'm saying is that since we need to maintain ordering between 
> > `*tail = rdp->nocb_gp_head;`
> > and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). 
> > Because
> > smp_mb__after_atomic() wouldn't guarantee 
> > 
> > So this is what I'm proposing.
> 
> Good eyes!
> 
> Hmmm...  What do I do about this in mainline?  Ah, I introduced a
> ->nocb_lock in mainline to prevent this from happening.  In that same
> commit that you didn't want to use because it is hard to backport.  ;-)
> 
> So yes, but there might well be other misorderings fixed by the
> hard-to-backport commit that your change below does not cover.  Still I
> do agree that you need full ordering at that point.

Another approach would be to backport only the ->nocb_lock portions
of that patch.  This would still potentially leave failure-to-wake
(as opposed to misordering-on-wake) issues, but it should cover all
of the misordering-on-wake issues.

    Thanx, Paul

> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
> > linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 
> > 06:42:41.0 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 16:23:41.349044259 
> > -0700
> > @@ -2076,7 +2076,7 @@
> >  /* Append callbacks to follower's "done" list. */
> >  tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
> >  *tail = rdp->nocb_gp_head;
> > -   smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > +   smp_mb(); /* Store *tail before wakeup. */
> >  if (rdp != my_rdp && tail == >nocb_follower_head) {
> >  /*
> >   * List was empty, wake up the follower.
> > 
> > Thanks,
> > David
> > 
> > From: Paul E. McKenney 
> > Sent: Friday, July 27, 2018 4:47 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 27, 2018 at 11:16:39PM +, David Chen wrote:
> > > Hi Paul,
> > > 
> > > Thanks for the advice.
> > > The bug is kind of hard to hit, so I can't say for certain yet.
> > 
> > Well, you can always remove the "tail == >nocb_follower_head" as an
> > extra belt-and-suspenders safety net.  I am not putting that in mainline,
> > but in the privacy of your own copy of the kernel, I don't see any really
> > serious problem with it.  (As long as you aren't going for absolute maximum
> > performance, but even then there are other more important tuning actions
> > and code changes you could make.)
> > 
> > > Though after another look at the code, I found out the 
> > > `smp_mb__after_atomic();`
> > > seems to be only a compiler barrier on x86.
> > 
> > Yes, and that is because the locked xchg instruction used on x86 to
> > implement xchg() already provides full ordering.  ;-)
> > 
> > >    tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
> > >    *tail = rdp->nocb_gp_head;
> > >    smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > >    if (rdp != my_rdp && tail == >nocb_follower_head) {
> > >    swake_up(>nocb_wq);
> > > 
> > > But that wouldn't be enough right? Because from 6b5fc3a13318, it stated 
> > > that
> > > wakeu

Re: RCU nocb list not reclaiming causing OOM

2018-07-30 Thread David Chen
Hi Paul,

Just want to know what's your plan on stable branches regarding this issue.
Do you intend to backport the ->nocb_lock? Or are you going with just the
memory barrier change?

Thanks,
David


From: Paul E. McKenney 
Sent: Saturday, July 28, 2018 1:29 AM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 06:47:00PM -0700, Paul E. McKenney wrote:
> On Sat, Jul 28, 2018 at 12:07:19AM +, David Chen wrote:
> > Hi Paul,
> > 
> > I wasn't talking about the xchg() though.
> > 
> > The smp_mb__after_atomic() is not for xchg(), it's for `*tail = 
> > rdp->nocb_gp_head;`
> > it's stated so in the comment. And I do think we need ordering between
> > `*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on 
> > head not tail.
> > 
> >  swait_event_interruptible(rdp->nocb_wq,
> >   
> >READ_ONCE(rdp->nocb_follower_head));
> > 
> > So what I'm saying is that since we need to maintain ordering between 
> > `*tail = rdp->nocb_gp_head;`
> > and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). 
> > Because
> > smp_mb__after_atomic() wouldn't guarantee 
> > 
> > So this is what I'm proposing.
> 
> Good eyes!
> 
> Hmmm...  What do I do about this in mainline?  Ah, I introduced a
> ->nocb_lock in mainline to prevent this from happening.  In that same
> commit that you didn't want to use because it is hard to backport.  ;-)
> 
> So yes, but there might well be other misorderings fixed by the
> hard-to-backport commit that your change below does not cover.  Still I
> do agree that you need full ordering at that point.

Another approach would be to backport only the ->nocb_lock portions
of that patch.  This would still potentially leave failure-to-wake
(as opposed to misordering-on-wake) issues, but it should cover all
of the misordering-on-wake issues.

    Thanx, Paul

> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
> > linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 
> > 06:42:41.0 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 16:23:41.349044259 
> > -0700
> > @@ -2076,7 +2076,7 @@
> >  /* Append callbacks to follower's "done" list. */
> >  tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
> >  *tail = rdp->nocb_gp_head;
> > -   smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > +   smp_mb(); /* Store *tail before wakeup. */
> >  if (rdp != my_rdp && tail == >nocb_follower_head) {
> >  /*
> >   * List was empty, wake up the follower.
> > 
> > Thanks,
> > David
> > 
> > From: Paul E. McKenney 
> > Sent: Friday, July 27, 2018 4:47 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 27, 2018 at 11:16:39PM +, David Chen wrote:
> > > Hi Paul,
> > > 
> > > Thanks for the advice.
> > > The bug is kind of hard to hit, so I can't say for certain yet.
> > 
> > Well, you can always remove the "tail == >nocb_follower_head" as an
> > extra belt-and-suspenders safety net.  I am not putting that in mainline,
> > but in the privacy of your own copy of the kernel, I don't see any really
> > serious problem with it.  (As long as you aren't going for absolute maximum
> > performance, but even then there are other more important tuning actions
> > and code changes you could make.)
> > 
> > > Though after another look at the code, I found out the 
> > > `smp_mb__after_atomic();`
> > > seems to be only a compiler barrier on x86.
> > 
> > Yes, and that is because the locked xchg instruction used on x86 to
> > implement xchg() already provides full ordering.  ;-)
> > 
> > >    tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
> > >    *tail = rdp->nocb_gp_head;
> > >    smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > >    if (rdp != my_rdp && tail == >nocb_follower_head) {
> > >    swake_up(>nocb_wq);
> > > 
> > > But that wouldn't be enough right? Because from 6b5fc3a13318, it stated 
> > > that
> > > wakeu

Re: RCU nocb list not reclaiming causing OOM

2018-07-27 Thread David Chen
Hi Paul,

I wasn't talking about the xchg() though.

The smp_mb__after_atomic() is not for xchg(), it's for `*tail = 
rdp->nocb_gp_head;`
it's stated so in the comment. And I do think we need ordering between
`*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on 
head not tail.

swait_event_interruptible(rdp->nocb_wq,
 
READ_ONCE(rdp->nocb_follower_head));

So what I'm saying is that since we need to maintain ordering between `*tail = 
rdp->nocb_gp_head;`
and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
smp_mb__after_atomic() wouldn't guarantee 

So this is what I'm proposing.

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 06:42:41.0 
-0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 16:23:41.349044259 
-0700
@@ -2076,7 +2076,7 @@
/* Append callbacks to follower's "done" list. */
tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
-   smp_mb__after_atomic(); /* Store *tail before wakeup. */
+   smp_mb(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
/*
 * List was empty, wake up the follower.

Thanks,
David

From: Paul E. McKenney 
Sent: Friday, July 27, 2018 4:47 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 11:16:39PM +, David Chen wrote:
> Hi Paul,
> 
> Thanks for the advice.
> The bug is kind of hard to hit, so I can't say for certain yet.

Well, you can always remove the "tail == >nocb_follower_head" as an
extra belt-and-suspenders safety net.  I am not putting that in mainline,
but in the privacy of your own copy of the kernel, I don't see any really
serious problem with it.  (As long as you aren't going for absolute maximum
performance, but even then there are other more important tuning actions
and code changes you could make.)

> Though after another look at the code, I found out the 
> `smp_mb__after_atomic();`
> seems to be only a compiler barrier on x86.

Yes, and that is because the locked xchg instruction used on x86 to
implement xchg() already provides full ordering.  ;-)

>    tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
>    *tail = rdp->nocb_gp_head;
>    smp_mb__after_atomic(); /* Store *tail before wakeup. */
>    if (rdp != my_rdp && tail == >nocb_follower_head) {
>    swake_up(>nocb_wq);
> 
> But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> wakeup operation don't guarantee ordering. And when the follower wakes up, it 
> checks
> for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` 
> which doesn't
> have LOCK prefix. So it's possible for follower to wake up and see the list 
> is empty and go
> back to sleep.

Again, xchg() is defined to provide full ordering against all operations
before and after it.  Each architecture is required to do whatever is
necessary to implement that full ordering, and x86 need only provide
its "lock xchg" instruction.

The smp_mb__after_atomic() has effect only after atomic read-modify-write
operations that do not return a value, for example, atomic_inc().
If you use it after a value-returning atomic read-modify-write operation
like xchg(), all you do is needlessly slow things down on platforms
that provide non-empty smp_mb__after_atomic() definitions.  So again,
smp_mb__after_atomic() after xchg() is pointless.

Please take a look at Documentation/core-api/atomic_ops.rst in the
Linux-kernel source tree for more information.  Or get a v4.17 kernel
source tree and check this using the memory model (tools/memory-model
in that version).

        Thanx, Paul

> Thanks,
> David
> 
> From: Paul E. McKenney 
> Sent: Friday, July 27, 2018 3:31 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 27, 2018 at 07:07:46PM +, David Chen wrote:
> > Hi Paul,
> > 
> > I'd like to opinion again on this subject.
> > 
> > So we are going to backport this patch:
> > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> 
> Does this one solve the problem, or are you still seeing hangs?
> If you are no longer seeing hangs, my advice is "hands off keyboard",
> though some would no doubt point out that I should follow tha

Re: RCU nocb list not reclaiming causing OOM

2018-07-27 Thread David Chen
Hi Paul,

I wasn't talking about the xchg() though.

The smp_mb__after_atomic() is not for xchg(), it's for `*tail = 
rdp->nocb_gp_head;`
it's stated so in the comment. And I do think we need ordering between
`*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on 
head not tail.

swait_event_interruptible(rdp->nocb_wq,
 
READ_ONCE(rdp->nocb_follower_head));

So what I'm saying is that since we need to maintain ordering between `*tail = 
rdp->nocb_gp_head;`
and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
smp_mb__after_atomic() wouldn't guarantee 

So this is what I'm proposing.

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 06:42:41.0 
-0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 16:23:41.349044259 
-0700
@@ -2076,7 +2076,7 @@
/* Append callbacks to follower's "done" list. */
tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
-   smp_mb__after_atomic(); /* Store *tail before wakeup. */
+   smp_mb(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
/*
 * List was empty, wake up the follower.

Thanks,
David

From: Paul E. McKenney 
Sent: Friday, July 27, 2018 4:47 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 11:16:39PM +, David Chen wrote:
> Hi Paul,
> 
> Thanks for the advice.
> The bug is kind of hard to hit, so I can't say for certain yet.

Well, you can always remove the "tail == >nocb_follower_head" as an
extra belt-and-suspenders safety net.  I am not putting that in mainline,
but in the privacy of your own copy of the kernel, I don't see any really
serious problem with it.  (As long as you aren't going for absolute maximum
performance, but even then there are other more important tuning actions
and code changes you could make.)

> Though after another look at the code, I found out the 
> `smp_mb__after_atomic();`
> seems to be only a compiler barrier on x86.

Yes, and that is because the locked xchg instruction used on x86 to
implement xchg() already provides full ordering.  ;-)

>    tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
>    *tail = rdp->nocb_gp_head;
>    smp_mb__after_atomic(); /* Store *tail before wakeup. */
>    if (rdp != my_rdp && tail == >nocb_follower_head) {
>    swake_up(>nocb_wq);
> 
> But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> wakeup operation don't guarantee ordering. And when the follower wakes up, it 
> checks
> for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` 
> which doesn't
> have LOCK prefix. So it's possible for follower to wake up and see the list 
> is empty and go
> back to sleep.

Again, xchg() is defined to provide full ordering against all operations
before and after it.  Each architecture is required to do whatever is
necessary to implement that full ordering, and x86 need only provide
its "lock xchg" instruction.

The smp_mb__after_atomic() has effect only after atomic read-modify-write
operations that do not return a value, for example, atomic_inc().
If you use it after a value-returning atomic read-modify-write operation
like xchg(), all you do is needlessly slow things down on platforms
that provide non-empty smp_mb__after_atomic() definitions.  So again,
smp_mb__after_atomic() after xchg() is pointless.

Please take a look at Documentation/core-api/atomic_ops.rst in the
Linux-kernel source tree for more information.  Or get a v4.17 kernel
source tree and check this using the memory model (tools/memory-model
in that version).

        Thanx, Paul

> Thanks,
> David
> 
> From: Paul E. McKenney 
> Sent: Friday, July 27, 2018 3:31 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 27, 2018 at 07:07:46PM +, David Chen wrote:
> > Hi Paul,
> > 
> > I'd like to opinion again on this subject.
> > 
> > So we are going to backport this patch:
> > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> 
> Does this one solve the problem, or are you still seeing hangs?
> If you are no longer seeing hangs, my advice is "hands off keyboard",
> though some would no doubt point out that I should follow tha

Re: RCU nocb list not reclaiming causing OOM

2018-07-27 Thread David Chen
Hi Paul,

Thanks for the advice.
The bug is kind of hard to hit, so I can't say for certain yet.

Though after another look at the code, I found out the `smp_mb__after_atomic();`
seems to be only a compiler barrier on x86.

tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
swake_up(>nocb_wq);

But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
wakeup operation don't guarantee ordering. And when the follower wakes up, it 
checks
for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which 
doesn't
have LOCK prefix. So it's possible for follower to wake up and see the list is 
empty and go
back to sleep.

Thanks,
David

From: Paul E. McKenney 
Sent: Friday, July 27, 2018 3:31 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 07:07:46PM +, David Chen wrote:
> Hi Paul,
> 
> I'd like to opinion again on this subject.
> 
> So we are going to backport this patch:
> 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

Does this one solve the problem, or are you still seeing hangs?
If you are no longer seeing hangs, my advice is "hands off keyboard",
though some would no doubt point out that I should follow that advice
more myself.  ;-)

> But the other one:
> 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> It doesn't apply cleanly, and I'm not too comfortable porting it myself.

Yeah, that one is a bit on the non-trivial side, no two ways about it.

> So I'm wondering if I use the following change to always wake up follower 
> thread
> regardless if the list was empty or not, just to be on the safe side. Do you 
> think
> this change is reasonable? Do you see any problem it might cause?
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
> linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h    2017-07-12 
> 06:42:41.0 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h 2018-07-27 11:57:03.582134519 
> -0700
> @@ -2077,7 +2077,7 @@
>    tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
>    *tail = rdp->nocb_gp_head;
>    smp_mb__after_atomic(); /* Store *tail before wakeup. */
> - if (rdp != my_rdp && tail == >nocb_follower_head) {
> + if (rdp != my_rdp) {

This will burn a bit of extra CPU time, but it should be fine other than
that.

    Thanx, Paul

>    /*
> * List was empty, wake up the follower.
> * Memory barriers supplied by atomic_long_add().
> 
> 
> From: David Chen
> Sent: Friday, July 20, 2018 5:12 PM
> To: paul...@linux.vnet.ibm.com
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>    
> 
> Hi Paul,
> 
> Ok, I'll try those patches.
> 
> Thanks,
> David
>   
> From: Paul E. McKenney 
> Sent: Friday, July 20, 2018 4:32:12 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 20, 2018 at 11:05:52PM +, David Chen wrote:
> > Hi Paul,
> > 
> > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows 
> > too
> > large, and not getting reclaimed, causing the system to OOM.
> > 
> > Printing the culprit rcu_sched_data:
> > 
> >   nocb_q_count = {
> > counter = 32369635
> >   },
> >   nocb_follower_head = 0x88ae901c0a00,
> >   nocb_follower_tail = 0x88af1538b8d8,
> >   nocb_kthread = 0x88b06d29,
> > 
> > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > nocb_kthread shouldn't go to sleep. However, if dump the stack of the 
> > kthread:
> > 
> > crash> bt 0x88b06d29
> > PID: 21 TASK: 88b06d29  CPU: 3   COMMAND: "rcuos/1"
> >  #0 [afc9020b7dc0] __schedule at 8d8789dc
> >  #1 [afc9020b7e38] schedule at 8d878e76
> >  #2 [afc9020b7e50] rcu_nocb_kthread at 8d112337
> >  #3 [afc9020b7ec8] kthread at 8d0c6ce7
> >  #4 [afc9020b7f50] ret_from_fork at 8d87d755
> > 
> > And if we dis the address at 8d112337:
> > 
> > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.0

Re: RCU nocb list not reclaiming causing OOM

2018-07-27 Thread David Chen
Hi Paul,

Thanks for the advice.
The bug is kind of hard to hit, so I can't say for certain yet.

Though after another look at the code, I found out the `smp_mb__after_atomic();`
seems to be only a compiler barrier on x86.

tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
if (rdp != my_rdp && tail == >nocb_follower_head) {
swake_up(>nocb_wq);

But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
wakeup operation don't guarantee ordering. And when the follower wakes up, it 
checks
for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which 
doesn't
have LOCK prefix. So it's possible for follower to wake up and see the list is 
empty and go
back to sleep.

Thanks,
David

From: Paul E. McKenney 
Sent: Friday, July 27, 2018 3:31 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 07:07:46PM +, David Chen wrote:
> Hi Paul,
> 
> I'd like to opinion again on this subject.
> 
> So we are going to backport this patch:
> 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

Does this one solve the problem, or are you still seeing hangs?
If you are no longer seeing hangs, my advice is "hands off keyboard",
though some would no doubt point out that I should follow that advice
more myself.  ;-)

> But the other one:
> 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> It doesn't apply cleanly, and I'm not too comfortable porting it myself.

Yeah, that one is a bit on the non-trivial side, no two ways about it.

> So I'm wondering if I use the following change to always wake up follower 
> thread
> regardless if the list was empty or not, just to be on the safe side. Do you 
> think
> this change is reasonable? Do you see any problem it might cause?
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
> linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h    2017-07-12 
> 06:42:41.0 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h 2018-07-27 11:57:03.582134519 
> -0700
> @@ -2077,7 +2077,7 @@
>    tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
>    *tail = rdp->nocb_gp_head;
>    smp_mb__after_atomic(); /* Store *tail before wakeup. */
> - if (rdp != my_rdp && tail == >nocb_follower_head) {
> + if (rdp != my_rdp) {

This will burn a bit of extra CPU time, but it should be fine other than
that.

    Thanx, Paul

>    /*
> * List was empty, wake up the follower.
> * Memory barriers supplied by atomic_long_add().
> 
> 
> From: David Chen
> Sent: Friday, July 20, 2018 5:12 PM
> To: paul...@linux.vnet.ibm.com
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>    
> 
> Hi Paul,
> 
> Ok, I'll try those patches.
> 
> Thanks,
> David
>   
> From: Paul E. McKenney 
> Sent: Friday, July 20, 2018 4:32:12 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 20, 2018 at 11:05:52PM +, David Chen wrote:
> > Hi Paul,
> > 
> > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows 
> > too
> > large, and not getting reclaimed, causing the system to OOM.
> > 
> > Printing the culprit rcu_sched_data:
> > 
> >   nocb_q_count = {
> > counter = 32369635
> >   },
> >   nocb_follower_head = 0x88ae901c0a00,
> >   nocb_follower_tail = 0x88af1538b8d8,
> >   nocb_kthread = 0x88b06d29,
> > 
> > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > nocb_kthread shouldn't go to sleep. However, if dump the stack of the 
> > kthread:
> > 
> > crash> bt 0x88b06d29
> > PID: 21 TASK: 88b06d29  CPU: 3   COMMAND: "rcuos/1"
> >  #0 [afc9020b7dc0] __schedule at 8d8789dc
> >  #1 [afc9020b7e38] schedule at 8d878e76
> >  #2 [afc9020b7e50] rcu_nocb_kthread at 8d112337
> >  #3 [afc9020b7ec8] kthread at 8d0c6ce7
> >  #4 [afc9020b7f50] ret_from_fork at 8d87d755
> > 
> > And if we dis the address at 8d112337:
> > 
> > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.0

Re: RCU nocb list not reclaiming causing OOM

2018-07-27 Thread David Chen
Hi Paul,

I'd like to opinion again on this subject.

So we are going to backport this patch:
6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

But the other one:
8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
It doesn't apply cleanly, and I'm not too comfortable porting it myself.

So I'm wondering if I use the following change to always wake up follower thread
regardless if the list was empty or not, just to be on the safe side. Do you 
think
this change is reasonable? Do you see any problem it might cause?

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 06:42:41.0 
-0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 11:57:03.582134519 
-0700
@@ -2077,7 +2077,7 @@
tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
-   if (rdp != my_rdp && tail == >nocb_follower_head) {
+   if (rdp != my_rdp) {
/*
 * List was empty, wake up the follower.
 * Memory barriers supplied by atomic_long_add().


From: David Chen
Sent: Friday, July 20, 2018 5:12 PM
To: paul...@linux.vnet.ibm.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
   

Hi Paul,

Ok, I'll try those patches.

Thanks,
David
  
From: Paul E. McKenney 
Sent: Friday, July 20, 2018 4:32:12 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 20, 2018 at 11:05:52PM +, David Chen wrote:
> Hi Paul,
> 
> We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> large, and not getting reclaimed, causing the system to OOM.
> 
> Printing the culprit rcu_sched_data:
> 
>   nocb_q_count = {
> counter = 32369635
>   },
>   nocb_follower_head = 0x88ae901c0a00,
>   nocb_follower_tail = 0x88af1538b8d8,
>   nocb_kthread = 0x88b06d29,
> 
> As you can see here, the nocb_follower_head is not empty, so in theory, the
> nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> 
> crash> bt 0x88b06d29
> PID: 21 TASK: 88b06d29  CPU: 3   COMMAND: "rcuos/1"
>  #0 [afc9020b7dc0] __schedule at 8d8789dc
>  #1 [afc9020b7e38] schedule at 8d878e76
>  #2 [afc9020b7e50] rcu_nocb_kthread at 8d112337
>  #3 [afc9020b7ec8] kthread at 8d0c6ce7
>  #4 [afc9020b7f50] ret_from_fork at 8d87d755
> 
> And if we dis the address at 8d112337:
> 
> /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h:
>  2106
> 0x8d11232d :  test   %rax,%rax
> 0x8d112330 :  jne    0x8d112355 
> 
> 0x8d112332 :  callq  0x8d878e40 
> 
> 0x8d112337 :  lea    -0x40(%rbp),%rsi
> 
> So the kthread is blocked at swait_event_interruptible in the 
> nocb_follower_wait.
> This contradict with the fact that nocb_follower_head was not empty. So I
> wonder if this is caused by the lack of memory barrier in the place shown 
> below.
> If the head is set to NULL after doing xchg, it will overwrite the head set
> by leader. This caused the kthread to sleep the next iteration, and the leader
> won't wake him up as the tail doesn't point to head.
> 
> Please tell me what do you think.
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
> linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h    2017-07-12 
> 06:42:41.0 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h 2018-07-20 15:25:57.311206343 
> -0700
> @@ -2149,6 +2149,7 @@
>    BUG_ON(!list);
>    trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
>    WRITE_ONCE(rdp->nocb_follower_head, NULL);
> + smp_mb();
>    tail = xchg(>nocb_follower_tail, 
>>nocb_follower_head);

The xchg() operation implies full memory barriers both before and after,
so adding the smp_mb() before would have no effect.

But let me take a look at post-4.9 changes to this code...

I suggest trying out the following commit:

6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

If that one doesn't help, the following might be worth trying, but probably
a lot harder to backport:

8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")

Please let me know how it goes!

   

Re: RCU nocb list not reclaiming causing OOM

2018-07-27 Thread David Chen
Hi Paul,

I'd like to opinion again on this subject.

So we are going to backport this patch:
6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

But the other one:
8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
It doesn't apply cleanly, and I'm not too comfortable porting it myself.

So I'm wondering if I use the following change to always wake up follower thread
regardless if the list was empty or not, just to be on the safe side. Do you 
think
this change is reasonable? Do you see any problem it might cause?

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 06:42:41.0 
-0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 11:57:03.582134519 
-0700
@@ -2077,7 +2077,7 @@
tail = xchg(>nocb_follower_tail, rdp->nocb_gp_tail);
*tail = rdp->nocb_gp_head;
smp_mb__after_atomic(); /* Store *tail before wakeup. */
-   if (rdp != my_rdp && tail == >nocb_follower_head) {
+   if (rdp != my_rdp) {
/*
 * List was empty, wake up the follower.
 * Memory barriers supplied by atomic_long_add().


From: David Chen
Sent: Friday, July 20, 2018 5:12 PM
To: paul...@linux.vnet.ibm.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
   

Hi Paul,

Ok, I'll try those patches.

Thanks,
David
  
From: Paul E. McKenney 
Sent: Friday, July 20, 2018 4:32:12 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 20, 2018 at 11:05:52PM +, David Chen wrote:
> Hi Paul,
> 
> We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> large, and not getting reclaimed, causing the system to OOM.
> 
> Printing the culprit rcu_sched_data:
> 
>   nocb_q_count = {
> counter = 32369635
>   },
>   nocb_follower_head = 0x88ae901c0a00,
>   nocb_follower_tail = 0x88af1538b8d8,
>   nocb_kthread = 0x88b06d29,
> 
> As you can see here, the nocb_follower_head is not empty, so in theory, the
> nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> 
> crash> bt 0x88b06d29
> PID: 21 TASK: 88b06d29  CPU: 3   COMMAND: "rcuos/1"
>  #0 [afc9020b7dc0] __schedule at 8d8789dc
>  #1 [afc9020b7e38] schedule at 8d878e76
>  #2 [afc9020b7e50] rcu_nocb_kthread at 8d112337
>  #3 [afc9020b7ec8] kthread at 8d0c6ce7
>  #4 [afc9020b7f50] ret_from_fork at 8d87d755
> 
> And if we dis the address at 8d112337:
> 
> /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h:
>  2106
> 0x8d11232d :  test   %rax,%rax
> 0x8d112330 :  jne    0x8d112355 
> 
> 0x8d112332 :  callq  0x8d878e40 
> 
> 0x8d112337 :  lea    -0x40(%rbp),%rsi
> 
> So the kthread is blocked at swait_event_interruptible in the 
> nocb_follower_wait.
> This contradict with the fact that nocb_follower_head was not empty. So I
> wonder if this is caused by the lack of memory barrier in the place shown 
> below.
> If the head is set to NULL after doing xchg, it will overwrite the head set
> by leader. This caused the kthread to sleep the next iteration, and the leader
> won't wake him up as the tail doesn't point to head.
> 
> Please tell me what do you think.
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
> linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h    2017-07-12 
> 06:42:41.0 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h 2018-07-20 15:25:57.311206343 
> -0700
> @@ -2149,6 +2149,7 @@
>    BUG_ON(!list);
>    trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
>    WRITE_ONCE(rdp->nocb_follower_head, NULL);
> + smp_mb();
>    tail = xchg(>nocb_follower_tail, 
>>nocb_follower_head);

The xchg() operation implies full memory barriers both before and after,
so adding the smp_mb() before would have no effect.

But let me take a look at post-4.9 changes to this code...

I suggest trying out the following commit:

6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

If that one doesn't help, the following might be worth trying, but probably
a lot harder to backport:

8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")

Please let me know how it goes!

   

RCU nocb list not reclaiming causing OOM

2018-07-20 Thread David Chen
Hi Paul,

We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
large, and not getting reclaimed, causing the system to OOM.

Printing the culprit rcu_sched_data:

  nocb_q_count = {
counter = 32369635
  },
  nocb_follower_head = 0x88ae901c0a00,
  nocb_follower_tail = 0x88af1538b8d8,
  nocb_kthread = 0x88b06d29,

As you can see here, the nocb_follower_head is not empty, so in theory, the
nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:

crash> bt 0x88b06d29
PID: 21 TASK: 88b06d29  CPU: 3   COMMAND: "rcuos/1"
 #0 [afc9020b7dc0] __schedule at 8d8789dc
 #1 [afc9020b7e38] schedule at 8d878e76
 #2 [afc9020b7e50] rcu_nocb_kthread at 8d112337
 #3 [afc9020b7ec8] kthread at 8d0c6ce7
 #4 [afc9020b7f50] ret_from_fork at 8d87d755

And if we dis the address at 8d112337:

/usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h:
 2106
0x8d11232d :  test   %rax,%rax
0x8d112330 :  jne0x8d112355 

0x8d112332 :  callq  0x8d878e40 

0x8d112337 :  lea-0x40(%rbp),%rsi

So the kthread is blocked at swait_event_interruptible in the 
nocb_follower_wait.
This contradict with the fact that nocb_follower_head was not empty. So I
wonder if this is caused by the lack of memory barrier in the place shown below.
If the head is set to NULL after doing xchg, it will overwrite the head set
by leader. This caused the kthread to sleep the next iteration, and the leader
won't wake him up as the tail doesn't point to head.

Please tell me what do you think.

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 06:42:41.0 
-0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-20 15:25:57.311206343 
-0700
@@ -2149,6 +2149,7 @@
BUG_ON(!list);
trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
WRITE_ONCE(rdp->nocb_follower_head, NULL);
+   smp_mb();
tail = xchg(>nocb_follower_tail, >nocb_follower_head);
 
/* Each pass through the following loop invokes a callback. */


RCU nocb list not reclaiming causing OOM

2018-07-20 Thread David Chen
Hi Paul,

We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
large, and not getting reclaimed, causing the system to OOM.

Printing the culprit rcu_sched_data:

  nocb_q_count = {
counter = 32369635
  },
  nocb_follower_head = 0x88ae901c0a00,
  nocb_follower_tail = 0x88af1538b8d8,
  nocb_kthread = 0x88b06d29,

As you can see here, the nocb_follower_head is not empty, so in theory, the
nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:

crash> bt 0x88b06d29
PID: 21 TASK: 88b06d29  CPU: 3   COMMAND: "rcuos/1"
 #0 [afc9020b7dc0] __schedule at 8d8789dc
 #1 [afc9020b7e38] schedule at 8d878e76
 #2 [afc9020b7e50] rcu_nocb_kthread at 8d112337
 #3 [afc9020b7ec8] kthread at 8d0c6ce7
 #4 [afc9020b7f50] ret_from_fork at 8d87d755

And if we dis the address at 8d112337:

/usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h:
 2106
0x8d11232d :  test   %rax,%rax
0x8d112330 :  jne0x8d112355 

0x8d112332 :  callq  0x8d878e40 

0x8d112337 :  lea-0x40(%rbp),%rsi

So the kthread is blocked at swait_event_interruptible in the 
nocb_follower_wait.
This contradict with the fact that nocb_follower_head was not empty. So I
wonder if this is caused by the lack of memory barrier in the place shown below.
If the head is set to NULL after doing xchg, it will overwrite the head set
by leader. This caused the kthread to sleep the next iteration, and the leader
won't wake him up as the tail doesn't point to head.

Please tell me what do you think.

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h 
linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h  2017-07-12 06:42:41.0 
-0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-20 15:25:57.311206343 
-0700
@@ -2149,6 +2149,7 @@
BUG_ON(!list);
trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
WRITE_ONCE(rdp->nocb_follower_head, NULL);
+   smp_mb();
tail = xchg(>nocb_follower_tail, >nocb_follower_head);
 
/* Each pass through the following loop invokes a callback. */


Re: blk_mq_freeze_queue hang and possible race in percpu-refcount

2018-03-14 Thread David Chen
Hi Tejun,

Thanks, I see I missed the RCU part.
I'll try the force atomic thing.
Though so far I haven't been able to reproduce it yet.

Thanks,
David


2018-03-14 8:43 GMT-07:00 Tejun Heo <t...@kernel.org>:
> Hello, David.
>
> On Tue, Mar 13, 2018 at 03:50:47PM -0700, David Chen wrote:
>> 
>> CPU A   CPU B
>> -   -
>> percpu_ref_kill()   percpu_ref_tryget_live()
>> {
>> if (__ref_is_percpu())
>>   set __PERCPU_REF_DEAD;
>>   __percpu_ref_switch_mode();
>>^ sum up current percpu_count
>> this_cpu_inc(*percpu_count); <- this
>> increment got leaked.
>>
>> 
>>
>> So if later CPU B later does percpu_ref_put, it will cause ref->count
>> to drop to -1.
>> And thus causing the above hung task issue.
>>
>> Do you think this theory is correct, or am I missing something?
>> Please tell me what do you think.
>
> The switching to atomic mode does something like the following.
>
> 1. Mark the refcnt so that __ref_is_percpu() is false.
>
> 2. Wait for RCU grace period so that everyone including
>percpu_ref_tryget_live() which has seen true __ref_is_percpu() is
>done with its operation.
>
> 3. Now that it knows nobody is operating on the assumption that the
>counter is in percpu mode, it adds up all the percpu counters.
>
> So, provided there aren't some silly bugs, what you described
> shouldn't happen.  Can you force the refcnt into atomic mode w/
> PERCPU_REF_INIT_ATOMIC and see whether the problem persists?
>
> Thanks.
>
> --
> tejun


Re: blk_mq_freeze_queue hang and possible race in percpu-refcount

2018-03-14 Thread David Chen
Hi Tejun,

Thanks, I see I missed the RCU part.
I'll try the force atomic thing.
Though so far I haven't been able to reproduce it yet.

Thanks,
David


2018-03-14 8:43 GMT-07:00 Tejun Heo :
> Hello, David.
>
> On Tue, Mar 13, 2018 at 03:50:47PM -0700, David Chen wrote:
>> 
>> CPU A   CPU B
>> -   -
>> percpu_ref_kill()   percpu_ref_tryget_live()
>> {
>> if (__ref_is_percpu())
>>   set __PERCPU_REF_DEAD;
>>   __percpu_ref_switch_mode();
>>^ sum up current percpu_count
>> this_cpu_inc(*percpu_count); <- this
>> increment got leaked.
>>
>> 
>>
>> So if later CPU B later does percpu_ref_put, it will cause ref->count
>> to drop to -1.
>> And thus causing the above hung task issue.
>>
>> Do you think this theory is correct, or am I missing something?
>> Please tell me what do you think.
>
> The switching to atomic mode does something like the following.
>
> 1. Mark the refcnt so that __ref_is_percpu() is false.
>
> 2. Wait for RCU grace period so that everyone including
>percpu_ref_tryget_live() which has seen true __ref_is_percpu() is
>done with its operation.
>
> 3. Now that it knows nobody is operating on the assumption that the
>counter is in percpu mode, it adds up all the percpu counters.
>
> So, provided there aren't some silly bugs, what you described
> shouldn't happen.  Can you force the refcnt into atomic mode w/
> PERCPU_REF_INIT_ATOMIC and see whether the problem persists?
>
> Thanks.
>
> --
> tejun


blk_mq_freeze_queue hang and possible race in percpu-refcount

2018-03-13 Thread David Chen
Hi Tejun,

We recently hit an issue where several tasks hung in blk_mq_freeze_queue_wait.
All the task have the same stack trace as the one below, which is
doing fput on loop device.
Which is strange because the task should be no more than a open(2), a
couple pread(2),
and close(2) on a loop device.


"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zpool   D0 32416  32156 0x0080
967e4b0a1740 967e4b0a1740 967dcdd5 967dcd8825c0
967eafad9780 b7ab84c5fd18 938789dc 0286
0286 967dcdd5 967ea7d691e8 967ea6df2528
Call Trace:
[] ? __schedule+0x21c/0x680
[] schedule+0x36/0x80
[] blk_mq_freeze_queue_wait+0x3c/0x90
[] ? prepare_to_wait_event+0x110/0x110
[] blk_mq_freeze_queue+0x1a/0x20
[] lo_release+0x61/0x70
[] __blkdev_put+0x225/0x280
[] blkdev_put+0x4c/0x110
[] blkdev_close+0x25/0x30
[] __fput+0xe7/0x220
[] fput+0xe/0x10
[] task_work_run+0x83/0xb0
[] exit_to_usermode_loop+0x66/0x92
[] do_syscall_64+0x165/0x180
[] entry_SYSCALL64_slow_path+0x25/0x25


I look into the percpu-refcount code, and found a place that might
have a race that might cause this.
Consider a concurrent percpu_ref_kill and percpu_ref_tryget_live:


CPU A   CPU B
-   -
percpu_ref_kill()   percpu_ref_tryget_live()
{
if (__ref_is_percpu())
  set __PERCPU_REF_DEAD;
  __percpu_ref_switch_mode();
   ^ sum up current percpu_count
this_cpu_inc(*percpu_count); <- this
increment got leaked.



So if later CPU B later does percpu_ref_put, it will cause ref->count
to drop to -1.
And thus causing the above hung task issue.

Do you think this theory is correct, or am I missing something?
Please tell me what do you think.

Thanks,
David


blk_mq_freeze_queue hang and possible race in percpu-refcount

2018-03-13 Thread David Chen
Hi Tejun,

We recently hit an issue where several tasks hung in blk_mq_freeze_queue_wait.
All the task have the same stack trace as the one below, which is
doing fput on loop device.
Which is strange because the task should be no more than a open(2), a
couple pread(2),
and close(2) on a loop device.


"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zpool   D0 32416  32156 0x0080
967e4b0a1740 967e4b0a1740 967dcdd5 967dcd8825c0
967eafad9780 b7ab84c5fd18 938789dc 0286
0286 967dcdd5 967ea7d691e8 967ea6df2528
Call Trace:
[] ? __schedule+0x21c/0x680
[] schedule+0x36/0x80
[] blk_mq_freeze_queue_wait+0x3c/0x90
[] ? prepare_to_wait_event+0x110/0x110
[] blk_mq_freeze_queue+0x1a/0x20
[] lo_release+0x61/0x70
[] __blkdev_put+0x225/0x280
[] blkdev_put+0x4c/0x110
[] blkdev_close+0x25/0x30
[] __fput+0xe7/0x220
[] fput+0xe/0x10
[] task_work_run+0x83/0xb0
[] exit_to_usermode_loop+0x66/0x92
[] do_syscall_64+0x165/0x180
[] entry_SYSCALL64_slow_path+0x25/0x25


I look into the percpu-refcount code, and found a place that might
have a race that might cause this.
Consider a concurrent percpu_ref_kill and percpu_ref_tryget_live:


CPU A   CPU B
-   -
percpu_ref_kill()   percpu_ref_tryget_live()
{
if (__ref_is_percpu())
  set __PERCPU_REF_DEAD;
  __percpu_ref_switch_mode();
   ^ sum up current percpu_count
this_cpu_inc(*percpu_count); <- this
increment got leaked.



So if later CPU B later does percpu_ref_put, it will cause ref->count
to drop to -1.
And thus causing the above hung task issue.

Do you think this theory is correct, or am I missing something?
Please tell me what do you think.

Thanks,
David


Re: [PATCH] vfs: check i_count under lock in evict_inodes

2016-07-11 Thread David Chen
Hi Al,

I'm not sure about the in-tree fs, but in zfsonlinux, it would offload
iput to a thread, so this would happen there. And it would wait for
the thread in put_super(), so that part is not a problem...

Thanks

2016-07-11 17:46 GMT-07:00 Al Viro :
> On Mon, Jul 11, 2016 at 05:15:04PM -0700, Chunwei Chen wrote:
>> We need to check i_count again with i_lock held, because iput might re-add
>> i_count when lazytime is on. Without this check, we could end up with
>> double-free or use-after-free.
>
> Details, please.  Ideally - with a reproducer.  Who is calling that iput()
> at that point of generic_shutdown_super() (has to be another thread) and
> just what will happen if the same iput() is delayed until *after*
> evict_inodes(), all the way into ->put_super().  At which point there's
> no promise whatsoever that the data structures used by ->evict_inode()
> hadn't been already freed...
>


Re: [PATCH] vfs: check i_count under lock in evict_inodes

2016-07-11 Thread David Chen
Hi Al,

I'm not sure about the in-tree fs, but in zfsonlinux, it would offload
iput to a thread, so this would happen there. And it would wait for
the thread in put_super(), so that part is not a problem...

Thanks

2016-07-11 17:46 GMT-07:00 Al Viro :
> On Mon, Jul 11, 2016 at 05:15:04PM -0700, Chunwei Chen wrote:
>> We need to check i_count again with i_lock held, because iput might re-add
>> i_count when lazytime is on. Without this check, we could end up with
>> double-free or use-after-free.
>
> Details, please.  Ideally - with a reproducer.  Who is calling that iput()
> at that point of generic_shutdown_super() (has to be another thread) and
> just what will happen if the same iput() is delayed until *after*
> evict_inodes(), all the way into ->put_super().  At which point there's
> no promise whatsoever that the data structures used by ->evict_inode()
> hadn't been already freed...
>


Re: [PATCH] perf config: ignore generated files in feature-checks

2013-12-19 Thread David Chen
How about this:

test-*
!*.c

So that we don't ignore *.c and we don't have to change file name.
Anyway, I'll reupload tomorrow when I have my computer.

Thanks,
Chunwei Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf config: ignore generated files in feature-checks

2013-12-19 Thread David Chen
Why not? They're generated files aren't they?

By the way, I just found out that test-* will be built depend on the
detected features.
Should I include them all?

Thanks,
Chunwei Chen

2013/12/19 Ingo Molnar :
>
> * Chunwei Chen  wrote:
>
>> Signed-off-by: Chunwei Chen 
>> ---
>>  tools/perf/config/feature-checks/.gitignore |2 ++
>>  1 file changed, 2 insertions(+)
>>  create mode 100644 tools/perf/config/feature-checks/.gitignore
>>
>> diff --git a/tools/perf/config/feature-checks/.gitignore 
>> b/tools/perf/config/feature-checks/.gitignore
>> new file mode 100644
>> index 000..9662c68
>> --- /dev/null
>> +++ b/tools/perf/config/feature-checks/.gitignore
>> @@ -0,0 +1,2 @@
>> +test-all
>> +*.d
>
> Why? These are cleaned out on 'make clean'.
>
> Thanks,
>
> Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf config: ignore generated files in feature-checks

2013-12-19 Thread David Chen
Why not? They're generated files aren't they?

By the way, I just found out that test-* will be built depend on the
detected features.
Should I include them all?

Thanks,
Chunwei Chen

2013/12/19 Ingo Molnar mi...@kernel.org:

 * Chunwei Chen tux...@gmail.com wrote:

 Signed-off-by: Chunwei Chen tux...@gmail.com
 ---
  tools/perf/config/feature-checks/.gitignore |2 ++
  1 file changed, 2 insertions(+)
  create mode 100644 tools/perf/config/feature-checks/.gitignore

 diff --git a/tools/perf/config/feature-checks/.gitignore 
 b/tools/perf/config/feature-checks/.gitignore
 new file mode 100644
 index 000..9662c68
 --- /dev/null
 +++ b/tools/perf/config/feature-checks/.gitignore
 @@ -0,0 +1,2 @@
 +test-all
 +*.d

 Why? These are cleaned out on 'make clean'.

 Thanks,

 Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf config: ignore generated files in feature-checks

2013-12-19 Thread David Chen
How about this:

test-*
!*.c

So that we don't ignore *.c and we don't have to change file name.
Anyway, I'll reupload tomorrow when I have my computer.

Thanks,
Chunwei Chen
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/