Re: [LKP] Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-11-02 Thread Oliver Sang
On Mon, Nov 02, 2020 at 09:59:29AM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 02, 2020 at 10:02:36PM +0800, Sang, Oliver wrote:
> > Hi,
> > 
> > want to consult if all fix merged into mainline?
> > 
> > we found below commit merged rdma updates into mainline
> 
> rc2 probably fixes the error these logs have
> 
> But I think you'll hit a WARN_ON that isn't fixed yet

Thanks a lot for information! we'll check on rc2. and back to you
if need more help. Thanks

> 
> Jason


Re: [LKP] Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-11-02 Thread Jason Gunthorpe
On Mon, Nov 02, 2020 at 10:02:36PM +0800, Sang, Oliver wrote:
> Hi,
> 
> want to consult if all fix merged into mainline?
> 
> we found below commit merged rdma updates into mainline

rc2 probably fixes the error these logs have

But I think you'll hit a WARN_ON that isn't fixed yet

Jason


Re: [LKP] Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-11-02 Thread Sang, Oliver
Hi,

want to consult if all fix merged into mainline?

we found below commit merged rdma updates into mainline

commit a1e16bc7d5f7ca3599d8a7f061841c93a563665e
Merge: 2a934b38c066f c7a198c700763
Author: Linus Torvalds 
Date:   Sat Oct 17 11:18:18 2020 -0700

Merge tag 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

however, on commit c7a198c700763, the hung is still existing. (detail is in 
attached kmsg-c7a198c700.xz)
user  :notice: [   32.708205] 2020-10-21 18:38:19 ./check srp/001 srp/002 
srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 
srp/0
13 srp/015
user  :notice: [   32.724650] srp/001 (Create and remove LUNs)
user  :notice: [   81.696806] IPMI BMC is not supported on this machine, skip 
bmc-watchdog setup!
user  :notice: [ 3681.698286] Wed Oct 21 19:39:18 UTC 2020 detected soft_timeout
user  :notice: [ 3681.832613] kill 886 /usr/bin/time -v -o 
/tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests

and after this merge, the mainline test of blktests changed from PASS to FAIL.
on 2a934b38c066f (detail is in attached kmsg-2a934b38c0.xz), still PASS.
user  :notice: [   36.669645] 2020-10-20 08:03:59 ./check srp/001 srp/002 
srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 
srp/0
13 srp/015
user  :notice: [   36.686114] srp/001 (Create and remove LUNs)
user  :notice: [   36.695753] srp/001 (Create and remove LUNs)  
   [passed]
user  :notice: [   36.705332] runtime  ...  1.716s
user  :notice: [   36.711993] srp/002 (File I/O on top of multipath 
concurrently with logout and login (mq))
kern  :debug : [   38.273376] ib_srp:srp_parse_in: ib_srp: 192.168.3.94 -> 
192.168.3.94:0
...

but on a1e16bc7d5f7ca3599d8a7f061841c93a563665e,  (detail is in attached 
kmsg-a1e16bc7d5.xz), FAIL.
user  :notice: [   33.392439] 2020-10-20 09:39:24 ./check srp/001 srp/002 
srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 
srp/0
13 srp/015
kern  :debug : [   33.392640] ib_srp:srp_parse_in: ib_srp: 127.0.0.1 -> 
127.0.0.1:0
user  :notice: [   33.407416] srp/001 (Create and remove LUNs)
kern  :debug : [   33.407484] ib_srp:srp_parse_in: ib_srp: 127.0.0.1: -> 
127.0.0.1:
user  :notice: [   33.415088] srp/001 (Create and remove LUNs)  
   [failed]
kern  :debug : [   33.420262] ib_srp:srp_parse_in: ib_srp: [::1%1] -> 
[::1]:0/251658241%0

and the tests still keep FAIL on v5.10-rc1.


kmsg-a1e16bc7d5.xz
Description: kmsg-a1e16bc7d5.xz


kmsg-2a934b38c0.xz
Description: kmsg-2a934b38c0.xz


kmsg-c7a198c700.xz
Description: kmsg-c7a198c700.xz


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-09-14 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 03:00:30PM -0700, Bart Van Assche wrote:
> On 2020-09-08 19:01, Bart Van Assche wrote:
> > The above patch didn't compile, but the patch below does and makes the hang
> > disappear. So feel free to add the following to the patch below:
> > 
> > Tested-by: Bart Van Assche 
> > 
> > diff --git a/drivers/infiniband/core/device.c 
> > b/drivers/infiniband/core/device.c
> > index c36b4d2b61e0..23ee65a9185f 100644
> > +++ b/drivers/infiniband/core/device.c
> > @@ -1285,6 +1285,8 @@ static void disable_device(struct ib_device *device)
> > remove_client_context(device, cid);
> > }
> > 
> > +   ib_cq_pool_destroy(device);
> > +
> > /* Pairs with refcount_set in enable_device */
> > ib_device_put(device);
> > wait_for_completion(>unreg_completion);
> > @@ -1328,6 +1330,8 @@ static int enable_device_and_get(struct ib_device 
> > *device)
> > goto out;
> > }
> > 
> > +   ib_cq_pool_init(device);
> > +
> > down_read(_rwsem);
> > xa_for_each_marked (, index, client, CLIENT_REGISTERED) {
> > ret = add_client_context(device, client);
> > @@ -1400,7 +1404,6 @@ int ib_register_device(struct ib_device *device, 
> > const char *name)
> > goto dev_cleanup;
> > }
> > 
> > -   ib_cq_pool_init(device);
> > ret = enable_device_and_get(device);
> > dev_set_uevent_suppress(>dev, false);
> > /* Mark for userspace that device is ready */
> > @@ -1455,7 +1458,6 @@ static void __ib_unregister_device(struct ib_device 
> > *ib_dev)
> > goto out;
> > 
> > disable_device(ib_dev);
> > -   ib_cq_pool_destroy(ib_dev);
> > 
> > /* Expedite removing unregistered pointers from the hash table */
> > free_netdevs(ib_dev);
> 
> Hi Jason,
> 
> Please let me know how you want to proceed with this patch.

Applied to for-rc

Thanks,
Jason


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-09-11 Thread Yi Zhang

Tested-by: Yi Zhang 


This patch fixed the issue I filed[1] which use rdma_rxe for nvme-rdma 
testing.


[1]
http://lists.infradead.org/pipermail/linux-nvme/2020-August/018988.html

Thanks
Yi

On 9/12/20 6:00 AM, Bart Van Assche wrote:

On 2020-09-08 19:01, Bart Van Assche wrote:

The above patch didn't compile, but the patch below does and makes the hang
disappear. So feel free to add the following to the patch below:

Tested-by: Bart Van Assche 

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c36b4d2b61e0..23ee65a9185f 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1285,6 +1285,8 @@ static void disable_device(struct ib_device *device)
remove_client_context(device, cid);
}

+   ib_cq_pool_destroy(device);
+
/* Pairs with refcount_set in enable_device */
ib_device_put(device);
wait_for_completion(>unreg_completion);
@@ -1328,6 +1330,8 @@ static int enable_device_and_get(struct ib_device *device)
goto out;
}

+   ib_cq_pool_init(device);
+
down_read(_rwsem);
xa_for_each_marked (, index, client, CLIENT_REGISTERED) {
ret = add_client_context(device, client);
@@ -1400,7 +1404,6 @@ int ib_register_device(struct ib_device *device, const 
char *name)
goto dev_cleanup;
}

-   ib_cq_pool_init(device);
ret = enable_device_and_get(device);
dev_set_uevent_suppress(>dev, false);
/* Mark for userspace that device is ready */
@@ -1455,7 +1458,6 @@ static void __ib_unregister_device(struct ib_device 
*ib_dev)
goto out;

disable_device(ib_dev);
-   ib_cq_pool_destroy(ib_dev);

/* Expedite removing unregistered pointers from the hash table */
free_netdevs(ib_dev);

Hi Jason,

Please let me know how you want to proceed with this patch.

Thanks,

Bart.





Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-09-11 Thread Bart Van Assche
On 2020-09-08 19:01, Bart Van Assche wrote:
> The above patch didn't compile, but the patch below does and makes the hang
> disappear. So feel free to add the following to the patch below:
> 
> Tested-by: Bart Van Assche 
> 
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index c36b4d2b61e0..23ee65a9185f 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -1285,6 +1285,8 @@ static void disable_device(struct ib_device *device)
>   remove_client_context(device, cid);
>   }
> 
> + ib_cq_pool_destroy(device);
> +
>   /* Pairs with refcount_set in enable_device */
>   ib_device_put(device);
>   wait_for_completion(>unreg_completion);
> @@ -1328,6 +1330,8 @@ static int enable_device_and_get(struct ib_device 
> *device)
>   goto out;
>   }
> 
> + ib_cq_pool_init(device);
> +
>   down_read(_rwsem);
>   xa_for_each_marked (, index, client, CLIENT_REGISTERED) {
>   ret = add_client_context(device, client);
> @@ -1400,7 +1404,6 @@ int ib_register_device(struct ib_device *device, const 
> char *name)
>   goto dev_cleanup;
>   }
> 
> - ib_cq_pool_init(device);
>   ret = enable_device_and_get(device);
>   dev_set_uevent_suppress(>dev, false);
>   /* Mark for userspace that device is ready */
> @@ -1455,7 +1458,6 @@ static void __ib_unregister_device(struct ib_device 
> *ib_dev)
>   goto out;
> 
>   disable_device(ib_dev);
> - ib_cq_pool_destroy(ib_dev);
> 
>   /* Expedite removing unregistered pointers from the hash table */
>   free_netdevs(ib_dev);

Hi Jason,

Please let me know how you want to proceed with this patch.

Thanks,

Bart.


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-09-08 Thread Bart Van Assche
On 2020-09-08 11:22, Jason Gunthorpe wrote:
> It is reasonable to consider the cq_pool as a built-in client, so I
> would suggest moving it to right around the time the dynamic clients
> are handled. Something like this:
> 
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index c36b4d2b61e0c0..e3651dacad1da6 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -1285,6 +1285,8 @@ static void disable_device(struct ib_device *device)
>   remove_client_context(device, cid);
>   }
>  
> + ib_cq_pool_destroy(ib_dev);
> +
>   /* Pairs with refcount_set in enable_device */
>   ib_device_put(device);
>   wait_for_completion(>unreg_completion);
> @@ -1328,6 +1330,8 @@ static int enable_device_and_get(struct ib_device 
> *device)
>   goto out;
>   }
>  
> + ib_cq_pool_init(device);
> +
>   down_read(_rwsem);
>   xa_for_each_marked (, index, client, CLIENT_REGISTERED) {
>   ret = add_client_context(device, client);
> @@ -1400,7 +1404,6 @@ int ib_register_device(struct ib_device *device, const 
> char *name)
>   goto dev_cleanup;
>   }
>  
> - ib_cq_pool_init(device);
>   ret = enable_device_and_get(device);
>   dev_set_uevent_suppress(>dev, false);
>   /* Mark for userspace that device is ready */
> @@ -1455,7 +1458,6 @@ static void __ib_unregister_device(struct ib_device 
> *ib_dev)
>   goto out;
>  
>   disable_device(ib_dev);
> - ib_cq_pool_destroy(ib_dev);
>  
>   /* Expedite removing unregistered pointers from the hash table */
>   free_netdevs(ib_dev);

The above patch didn't compile, but the patch below does and makes the hang
disappear. So feel free to add the following to the patch below:

Tested-by: Bart Van Assche 

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c36b4d2b61e0..23ee65a9185f 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1285,6 +1285,8 @@ static void disable_device(struct ib_device *device)
remove_client_context(device, cid);
}

+   ib_cq_pool_destroy(device);
+
/* Pairs with refcount_set in enable_device */
ib_device_put(device);
wait_for_completion(>unreg_completion);
@@ -1328,6 +1330,8 @@ static int enable_device_and_get(struct ib_device *device)
goto out;
}

+   ib_cq_pool_init(device);
+
down_read(_rwsem);
xa_for_each_marked (, index, client, CLIENT_REGISTERED) {
ret = add_client_context(device, client);
@@ -1400,7 +1404,6 @@ int ib_register_device(struct ib_device *device, const 
char *name)
goto dev_cleanup;
}

-   ib_cq_pool_init(device);
ret = enable_device_and_get(device);
dev_set_uevent_suppress(>dev, false);
/* Mark for userspace that device is ready */
@@ -1455,7 +1458,6 @@ static void __ib_unregister_device(struct ib_device 
*ib_dev)
goto out;

disable_device(ib_dev);
-   ib_cq_pool_destroy(ib_dev);

/* Expedite removing unregistered pointers from the hash table */
free_netdevs(ib_dev);



Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-09-08 Thread Jason Gunthorpe
On Sun, Sep 06, 2020 at 07:58:12PM -0700, Bart Van Assche wrote:

> The following appeared:
> 
> WARNING: CPU: 5 PID: 1760 at drivers/infiniband/core/device.c:335 
> ib_device_put+0xf2/0x100 [ib_core]
> 
> Call Trace:
>  rxe_elem_release+0x76/0x90 [rdma_rxe]
>  rxe_destroy_cq+0x4f/0x70 [rdma_rxe]
>  ib_free_cq_user+0x12b/0x2b0 [ib_core]
>  ib_cq_pool_destroy+0xa8/0x140 [ib_core]
>  __ib_unregister_device+0x9c/0x1c0 [ib_core]
>  ib_unregister_driver+0x181/0x1a0 [ib_core]
>  rxe_module_exit+0x31/0x50 [rdma_rxe]
>  __x64_sys_delete_module+0x22a/0x310
>  do_syscall_64+0x36/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Oh interesting..

> Do you agree that the above proves that the hang-on-unload is a
> regression that has been introduced by the cq-pool patches? Is the patch
> below a good way to fix this?

> +++ b/drivers/infiniband/core/device.c
> @@ -1454,8 +1454,8 @@ static void __ib_unregister_device(struct ib_device 
> *ib_dev)
>   if (!refcount_read(_dev->refcount))
>   goto out;
> 
> - disable_device(ib_dev);
>   ib_cq_pool_destroy(ib_dev);
> + disable_device(ib_dev);
> 
>   /* Expedite removing unregistered pointers from the hash table */
>   free_netdevs(ib_dev);

Hum. Not quite..

disable_device() ensures that all ib_clients have disconnected from
the device, up until the clients have disconnected the cq_pool must
remain operational.

It is reasonable to consider the cq_pool as a built-in client, so I
would suggest moving it to right around the time the dynamic clients
are handled. Something like this:

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c36b4d2b61e0c0..e3651dacad1da6 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1285,6 +1285,8 @@ static void disable_device(struct ib_device *device)
remove_client_context(device, cid);
}
 
+   ib_cq_pool_destroy(ib_dev);
+
/* Pairs with refcount_set in enable_device */
ib_device_put(device);
wait_for_completion(>unreg_completion);
@@ -1328,6 +1330,8 @@ static int enable_device_and_get(struct ib_device *device)
goto out;
}
 
+   ib_cq_pool_init(device);
+
down_read(_rwsem);
xa_for_each_marked (, index, client, CLIENT_REGISTERED) {
ret = add_client_context(device, client);
@@ -1400,7 +1404,6 @@ int ib_register_device(struct ib_device *device, const 
char *name)
goto dev_cleanup;
}
 
-   ib_cq_pool_init(device);
ret = enable_device_and_get(device);
dev_set_uevent_suppress(>dev, false);
/* Mark for userspace that device is ready */
@@ -1455,7 +1458,6 @@ static void __ib_unregister_device(struct ib_device 
*ib_dev)
goto out;
 
disable_device(ib_dev);
-   ib_cq_pool_destroy(ib_dev);
 
/* Expedite removing unregistered pointers from the hash table */
free_netdevs(ib_dev);


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-09-06 Thread Bart Van Assche
On 2020-08-24 06:30, Jason Gunthorpe wrote:
> On Sun, Aug 23, 2020 at 02:18:41PM -0700, Bart Van Assche wrote:>  
>> The patch below is sufficient to unbreak blktests. I think that the
>> deadlock while unloading rdma_rxe happens because the RDMA core waits for
>> all ib_dev references to be dropped before dealloc_driver is called.
> 
> Which is required, yes
> 
>> The rdma_rxe dealloc_driver implementation drops an ib_dev
>> reference. 
> 
> Where does it do that? I didn't notice it?

That last statement was wrong.

Anyway, with the following debugging patch applied:

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c36b4d2b61e0..b976dd30f727 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -332,6 +332,8 @@ struct ib_device *ib_device_get_by_index(const struct net 
*net, u32 index)
  */
 void ib_device_put(struct ib_device *device)
 {
+   WARN(device->warn_on_refcount_drop, "%s refcnt = %d\n",
+dev_name(>dev), refcount_read(>refcount));
if (refcount_dec_and_test(>refcount))
complete(>unreg_completion);
 }
@@ -1287,7 +1289,10 @@ static void disable_device(struct ib_device *device)

/* Pairs with refcount_set in enable_device */
ib_device_put(device);
+   device->warn_on_refcount_drop = true;
+#if 0
wait_for_completion(>unreg_completion);
+#endif

/*
 * compat devices must be removed after device refcount drops to zero.
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c868609a4ffa..2d050e3ee55a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2704,6 +2704,7 @@ struct ib_device {
 * registered and cannot be unregistered.
 */
refcount_t refcount;
+   boolwarn_on_refcount_drop;
struct completion unreg_completion;
struct work_struct unregistration_work;

The following appeared:

WARNING: CPU: 5 PID: 1760 at drivers/infiniband/core/device.c:335 
ib_device_put+0xf2/0x100 [ib_core]

Call Trace:
 rxe_elem_release+0x76/0x90 [rdma_rxe]
 rxe_destroy_cq+0x4f/0x70 [rdma_rxe]
 ib_free_cq_user+0x12b/0x2b0 [ib_core]
 ib_cq_pool_destroy+0xa8/0x140 [ib_core]
 __ib_unregister_device+0x9c/0x1c0 [ib_core]
 ib_unregister_driver+0x181/0x1a0 [ib_core]
 rxe_module_exit+0x31/0x50 [rdma_rxe]
 __x64_sys_delete_module+0x22a/0x310
 do_syscall_64+0x36/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Do you agree that the above proves that the hang-on-unload is a
regression that has been introduced by the cq-pool patches? Is the patch
below a good way to fix this?

--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1454,8 +1454,8 @@ static void __ib_unregister_device(struct ib_device 
*ib_dev)
if (!refcount_read(_dev->refcount))
goto out;

-   disable_device(ib_dev);
ib_cq_pool_destroy(ib_dev);
+   disable_device(ib_dev);

/* Expedite removing unregistered pointers from the hash table */
free_netdevs(ib_dev);

Thanks,

Bart.


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-24 Thread Jason Gunthorpe
On Sun, Aug 23, 2020 at 02:18:41PM -0700, Bart Van Assche wrote:
 
> The patch below is sufficient to unbreak blktests. I think that the
> deadlock while unloading rdma_rxe happens because the RDMA core waits for
> all ib_dev references to be dropped before dealloc_driver is called.

Which is required, yes

> The rdma_rxe dealloc_driver implementation drops an ib_dev
> reference. 

Where does it do that? I didn't notice it?

Jason


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-23 Thread Bart Van Assche
On 2020-08-03 00:27, Sagi Grimberg wrote:
> 
 Greeting,

 FYI, we noticed the following commit (built with gcc-9):

 commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new shared 
 CQ mechanism")
 https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next


 in testcase: blktests
 with following parameters:

 test: srp-group1
 ucode: 0x21



 on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 4G 
 memory

 caused below changes (please refer to attached dmesg/kmsg for entire 
 log/backtrace):




 If you fix the issue, kindly add following tag
 Reported-by: kernel test robot 


 user  :notice: [   44.688140] 2020-08-01 16:10:22 ./check srp/001 srp/002 
 srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 
 srp/012 srp/013 srp/015
 user  :notice: [   44.706657] srp/001 (Create and remove LUNs)
 user  :notice: [   44.718405] srp/001 (Create and remove LUNs) 
     [passed]
 user  :notice: [   44.729902] runtime  ...  1.972s
 user  :notice: [   99.038748] IPMI BMC is not supported on this machine, 
 skip bmc-watchdog setup!
 user  :notice: [ 3699.039790] Sat Aug  1 17:11:22 UTC 2020 detected 
 soft_timeout
 user  :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o 
 /tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests
>>> Yamin and Max, can you take a look at this? The SRP tests from the
>>> blktests repository pass reliably with kernel version v5.7 and before.
>>> With label next-20200731 from linux-next however that test triggers the
>>> following hang:
>>
>> I will look into it.
> 
> FWIW, I ran into this as well with nvme-rdma, but it also reproduces
> when I revert the shared CQ patch from nvme-rdma. Another data point
> is that my tests passes with siw.

Hi Jason,

The patch below is sufficient to unbreak blktests. I think that the
deadlock while unloading rdma_rxe happens because the RDMA core waits for
all ib_dev references to be dropped before dealloc_driver is called.
The rdma_rxe dealloc_driver implementation drops an ib_dev reference. The
dealloc_driver callback was introduced by commit d0899892edd0
("RDMA/device: Provide APIs from the core code to help unregistration").
Do you agree that this regression has been introduced by commits
d0899892edd0 and c367074b6c37 ("RDMA/rxe: Use driver_unregister and new
unregistration API")?

Thanks,

Bart.

---
 drivers/infiniband/core/device.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index dca2842a7872..5192f305b253 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1287,13 +1287,8 @@ static void disable_device(struct ib_device *device)

/* Pairs with refcount_set in enable_device */
ib_device_put(device);
-   wait_for_completion(>unreg_completion);

-   /*
-* compat devices must be removed after device refcount drops to zero.
-* Otherwise init_net() may add more compatdevs after removing compat
-* devices and before device is disabled.
-*/
+   /* To do: prevent init_net() to add more compat_devs. */
remove_compat_devs(device);
 }




Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-03 Thread Sagi Grimberg




Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new 
shared CQ mechanism")

https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next


in testcase: blktests
with following parameters:

test: srp-group1
ucode: 0x21



on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz 
with 4G memory


caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):





If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


user  :notice: [   44.688140] 2020-08-01 16:10:22 ./check srp/001 
srp/002 srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 
srp/010 srp/011 srp/012 srp/013 srp/015

user  :notice: [   44.706657] srp/001 (Create and remove LUNs)
user  :notice: [   44.718405] srp/001 (Create and remove 
LUNs) [passed]

user  :notice: [   44.729902] runtime  ...  1.972s
user  :notice: [   99.038748] IPMI BMC is not supported on this 
machine, skip bmc-watchdog setup!
user  :notice: [ 3699.039790] Sat Aug  1 17:11:22 UTC 2020 detected 
soft_timeout
user  :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o 
/tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests

Yamin and Max, can you take a look at this? The SRP tests from the
blktests repository pass reliably with kernel version v5.7 and before.
With label next-20200731 from linux-next however that test triggers the
following hang:


I will look into it.


FWIW, I ran into this as well with nvme-rdma, but it also reproduces
when I revert the shared CQ patch from nvme-rdma. Another data point
is that my tests passes with siw.


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-03 Thread Yamin Friedman



On 8/2/2020 6:05 PM, Bart Van Assche wrote:

On 2020-08-01 23:09, kernel test robot wrote:

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new shared CQ 
mechanism")
https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next


in testcase: blktests
with following parameters:

test: srp-group1
ucode: 0x21



on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 4G 
memory

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):




If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


user  :notice: [   44.688140] 2020-08-01 16:10:22 ./check srp/001 srp/002 
srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 
srp/013 srp/015
user  :notice: [   44.706657] srp/001 (Create and remove LUNs)
user  :notice: [   44.718405] srp/001 (Create and remove LUNs)  
   [passed]
user  :notice: [   44.729902] runtime  ...  1.972s
user  :notice: [   99.038748] IPMI BMC is not supported on this machine, skip 
bmc-watchdog setup!
user  :notice: [ 3699.039790] Sat Aug  1 17:11:22 UTC 2020 detected soft_timeout
user  :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o 
/tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests

Yamin and Max, can you take a look at this? The SRP tests from the
blktests repository pass reliably with kernel version v5.7 and before.
With label next-20200731 from linux-next however that test triggers the
following hang:


I will look into it.



sd 8:0:0:0: [sda] Synchronizing SCSI cache
rdma_rxe: not configured on eth0
rdma_rxe: not configured on lo
INFO: task modprobe:1894 blocked for more than 122 seconds.
   Not tainted 5.8.0-rc7-next-20200731-dbg+ #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobeD27624  1894   1081 0x4000
Call Trace:
  __schedule+0x4ee/0x1170
  schedule+0x7f/0x170
  schedule_timeout+0x453/0x6f0
  wait_for_completion+0x126/0x1b0
  disable_device+0x12a/0x1c0 [ib_core]
  __ib_unregister_device+0x64/0x100 [ib_core]
  ib_unregister_driver+0x11c/0x180 [ib_core]
  rxe_module_exit+0x1e/0x36 [rdma_rxe]
  __x64_sys_delete_module+0x22a/0x310
  do_syscall_64+0x36/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f8d0e0c8a3b
Code: Bad RIP value.
RSP: 002b:7ffe8238f798 EFLAGS: 0206 ORIG_RAX: 00b0
RAX: ffda RBX: 55a2c7d9be80 RCX: 7f8d0e0c8a3b
RDX:  RSI: 0800 RDI: 55a2c7d9bee8
RBP: 55a2c7d9be80 R08:  R09: 
R10: 7f8d0e144ac0 R11: 0206 R12: 55a2c7d9bee8
R13:  R14: 55a2c7d9bee8 R15: 55a2c7d9be80

Showing all locks held in the system:
1 lock held by khungtaskd/53:
  #0: 82895880 (rcu_read_lock){}-{1:2}, at: 
debug_show_all_locks+0x37/0x20f
1 lock held by modprobe/1894:
  #0: 8881c5e8c660 (>unregistration_lock){+.+.}-{3:3}, at: 
__ib_unregister_device+0x23/0x100 [ib_core]

=
INFO: task modprobe:1894 blocked for more than 245 seconds.
   Not tainted 5.8.0-rc7-next-20200731-dbg+ #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobeD27624  1894   1081 0x4000
Call Trace:
  __schedule+0x4ee/0x1170
  schedule+0x7f/0x170
  schedule_timeout+0x453/0x6f0
  wait_for_completion+0x126/0x1b0
  disable_device+0x12a/0x1c0 [ib_core]
  __ib_unregister_device+0x64/0x100 [ib_core]
  ib_unregister_driver+0x11c/0x180 [ib_core]
  rxe_module_exit+0x1e/0x36 [rdma_rxe]
  __x64_sys_delete_module+0x22a/0x310
  do_syscall_64+0x36/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f8d0e0c8a3b
Code: Bad RIP value.
RSP: 002b:7ffe8238f798 EFLAGS: 0206 ORIG_RAX: 00b0
RAX: ffda RBX: 55a2c7d9be80 RCX: 7f8d0e0c8a3b
RDX:  RSI: 0800 RDI: 55a2c7d9bee8
RBP: 55a2c7d9be80 R08:  R09: 
R10: 7f8d0e144ac0 R11: 0206 R12: 55a2c7d9bee8
R13:  R14: 55a2c7d9bee8 R15: 55a2c7d9be80

Showing all locks held in the system:
1 lock held by khungtaskd/53:
  #0: 82895880 (rcu_read_lock){}-{1:2}, at: 
debug_show_all_locks+0x37/0x20f
no locks held by systemd-journal/241.
1 lock held by modprobe/1894:
  #0: 8881c5e8c660 (>unregistration_lock){+.+.}-{3:3}, at: 
__ib_unregister_device+0x23/0x100 [ib_core]

=


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-02 Thread Bart Van Assche
On 2020-08-01 23:09, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new shared CQ 
> mechanism")
> https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next
> 
> 
> in testcase: blktests
> with following parameters:
> 
>   test: srp-group1
>   ucode: 0x21
> 
> 
> 
> on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 4G 
> memory
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> user  :notice: [   44.688140] 2020-08-01 16:10:22 ./check srp/001 srp/002 
> srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 
> srp/012 srp/013 srp/015
> user  :notice: [   44.706657] srp/001 (Create and remove LUNs)
> 
> user  :notice: [   44.718405] srp/001 (Create and remove LUNs)
>  [passed]
> user  :notice: [   44.729902] runtime  ...  1.972s
> user  :notice: [   99.038748] IPMI BMC is not supported on this machine, skip 
> bmc-watchdog setup!
> user  :notice: [ 3699.039790] Sat Aug  1 17:11:22 UTC 2020 detected 
> soft_timeout
> user  :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o 
> /tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests 

Yamin and Max, can you take a look at this? The SRP tests from the
blktests repository pass reliably with kernel version v5.7 and before.
With label next-20200731 from linux-next however that test triggers the
following hang:

sd 8:0:0:0: [sda] Synchronizing SCSI cache
rdma_rxe: not configured on eth0
rdma_rxe: not configured on lo
INFO: task modprobe:1894 blocked for more than 122 seconds.
  Not tainted 5.8.0-rc7-next-20200731-dbg+ #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobeD27624  1894   1081 0x4000
Call Trace:
 __schedule+0x4ee/0x1170
 schedule+0x7f/0x170
 schedule_timeout+0x453/0x6f0
 wait_for_completion+0x126/0x1b0
 disable_device+0x12a/0x1c0 [ib_core]
 __ib_unregister_device+0x64/0x100 [ib_core]
 ib_unregister_driver+0x11c/0x180 [ib_core]
 rxe_module_exit+0x1e/0x36 [rdma_rxe]
 __x64_sys_delete_module+0x22a/0x310
 do_syscall_64+0x36/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f8d0e0c8a3b
Code: Bad RIP value.
RSP: 002b:7ffe8238f798 EFLAGS: 0206 ORIG_RAX: 00b0
RAX: ffda RBX: 55a2c7d9be80 RCX: 7f8d0e0c8a3b
RDX:  RSI: 0800 RDI: 55a2c7d9bee8
RBP: 55a2c7d9be80 R08:  R09: 
R10: 7f8d0e144ac0 R11: 0206 R12: 55a2c7d9bee8
R13:  R14: 55a2c7d9bee8 R15: 55a2c7d9be80

Showing all locks held in the system:
1 lock held by khungtaskd/53:
 #0: 82895880 (rcu_read_lock){}-{1:2}, at: 
debug_show_all_locks+0x37/0x20f
1 lock held by modprobe/1894:
 #0: 8881c5e8c660 (>unregistration_lock){+.+.}-{3:3}, at: 
__ib_unregister_device+0x23/0x100 [ib_core]

=
INFO: task modprobe:1894 blocked for more than 245 seconds.
  Not tainted 5.8.0-rc7-next-20200731-dbg+ #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobeD27624  1894   1081 0x4000
Call Trace:
 __schedule+0x4ee/0x1170
 schedule+0x7f/0x170
 schedule_timeout+0x453/0x6f0
 wait_for_completion+0x126/0x1b0
 disable_device+0x12a/0x1c0 [ib_core]
 __ib_unregister_device+0x64/0x100 [ib_core]
 ib_unregister_driver+0x11c/0x180 [ib_core]
 rxe_module_exit+0x1e/0x36 [rdma_rxe]
 __x64_sys_delete_module+0x22a/0x310
 do_syscall_64+0x36/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f8d0e0c8a3b
Code: Bad RIP value.
RSP: 002b:7ffe8238f798 EFLAGS: 0206 ORIG_RAX: 00b0
RAX: ffda RBX: 55a2c7d9be80 RCX: 7f8d0e0c8a3b
RDX:  RSI: 0800 RDI: 55a2c7d9bee8
RBP: 55a2c7d9be80 R08:  R09: 
R10: 7f8d0e144ac0 R11: 0206 R12: 55a2c7d9bee8
R13:  R14: 55a2c7d9bee8 R15: 55a2c7d9be80

Showing all locks held in the system:
1 lock held by khungtaskd/53:
 #0: 82895880 (rcu_read_lock){}-{1:2}, at: 
debug_show_all_locks+0x37/0x20f
no locks held by systemd-journal/241.
1 lock held by modprobe/1894:
 #0: 8881c5e8c660 (>unregistration_lock){+.+.}-{3:3}, at: 
__ib_unregister_device+0x23/0x100 [ib_core]

=