Re: mpt3sas heavy I/O load causes kernel BUG at block/blk-core.c:2695

2018-06-07 Thread Douglas Miller

Thanks, Suganath,

That commit was introduced with version 12.100.00.00 and the distro 
version we're running is 15.100.00.01 (RHEL-ALT 7.5) and appears to 
include this fix - although the code is not identical, probably due to 
the effects of backporting patches. This driver also does not include 
commit 9961c9bbf2b43acaaf030a0fbabc9954d937ad8c, which was added much 
later (added on top of driver 17.100.00.00). So, I guess I am still 
looking for a companion (opposite scenario) patch to 
9961c9bbf2b43acaaf030a0fbabc9954d937ad8c.


Do you have any reason to believe that both situations (normal 
completion before abort, and abort before normal completion) do not need 
to be handled?


Thanks,

Doug


On 06/07/2018 01:24 AM, Suganath Prabu Subramani wrote:

Hi Douglas,

Can you check if this patch is already part of driver, If not please
try with below patch.
This patch is to fix the completion of abort before the IO completion.
With this, driver will process IO's reply first followed by TM.

authorSuganath prabu Subramani
2016-01-28 12:07:06 +0530
committerMartin K. Petersen 2016-02-23
21:27:02 -0500
commit03d1fb3a65783979f23bd58b5a0387e6992d9e26 (patch)
tree6aca275e2ebe7fbcd5fac1654cedd8f56d0947d0 /drivers/scsi/mpt3sas
parent5c739b6157bd090942e5847ddd12bfb99cd4240d (diff)
downloadlinux-03d1fb3a65783979f23bd58b5a0387e6992d9e26.tar.gz

mpt3sas: Fix for Asynchronous completion of timedout IO and task abort
of timedout IO.
Track msix of each IO and use the same msix for issuing abort to timed
out IO. With this driver will process IO's reply first followed by TM.
Signed-off-by: Suganath prabu Subramani
 Signed-off-by: Chaitra P B
 Reviewed-by: Tomas Henzl
 Signed-off-by: Martin K. Petersen



Thanks,
Suganath Prabu S

On Wed, Jun 6, 2018 at 7:50 PM, Douglas Miller
 wrote:

Running a heavy I/O load on multipath/dual-ported SSD disks attached to a
SAS3008 adapter (mpt3sas driver), we are seeing I/Os get aborted and tasks
stuck in blk_complete_request() and this sometimes results in hitting a
BUG_ON in blk_start_request(). It would appear that we are seeing two
completions performed on an I/O, and the second completion is racing with
re-use of the request for a new I/O.

I saw this upstream commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3=9961c9bbf2b43acaaf030a0fbabc9954d937ad8c

which addresses the case where the normal completion occurs before the abort
completion. But the situation I am seeing appears to be that the abort
completion occurs before the normal completion (due to tasks getting delayed
in blk_complete_request()). I don't find any commit to fix this second case.

Of course, tasks being delayed like this is a concern, and is being worked
separately. But it seems that the alternate double-completion case is being
ignored here.

Does everyone concur that this second case needs to be addressed? Is there a
proposed fix?

Thanks,

Doug

FYI, system is a Power9 running RHEL-ALT 7.5, two SAS3008 adapters connected
to an IBM EXP24SX SAS Storage Enclosure with 24 HUSMM8040ASS201 drives. FIO
was being used to drive the I/O load.






mpt3sas heavy I/O load causes kernel BUG at block/blk-core.c:2695

2018-06-06 Thread Douglas Miller
Running a heavy I/O load on multipath/dual-ported SSD disks attached to 
a SAS3008 adapter (mpt3sas driver), we are seeing I/Os get aborted and 
tasks stuck in blk_complete_request() and this sometimes results in 
hitting a BUG_ON in blk_start_request(). It would appear that we are 
seeing two completions performed on an I/O, and the second completion is 
racing with re-use of the request for a new I/O.


I saw this upstream commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3=9961c9bbf2b43acaaf030a0fbabc9954d937ad8c

which addresses the case where the normal completion occurs before the 
abort completion. But the situation I am seeing appears to be that the 
abort completion occurs before the normal completion (due to tasks 
getting delayed in blk_complete_request()). I don't find any commit to 
fix this second case.


Of course, tasks being delayed like this is a concern, and is being 
worked separately. But it seems that the alternate double-completion 
case is being ignored here.


Does everyone concur that this second case needs to be addressed? Is 
there a proposed fix?


Thanks,

Doug

FYI, system is a Power9 running RHEL-ALT 7.5, two SAS3008 adapters 
connected to an IBM EXP24SX SAS Storage Enclosure with 24 
HUSMM8040ASS201 drives. FIO was being used to drive the I/O load.





[PATCH 1/1] qla2xxx: Fix oops in qla2x00_probe_one error path

2017-10-20 Thread Douglas Miller
On error, kthread_create() returns an errno-encoded pointer, not NULL.
The routine qla2x00_probe_one() detects the error case and jumps
to probe_failed, but has already assigned the return value from
kthread_create() to ha->dpc_thread.  Then probe_failed checks to see
if ha->dpc_thread is not NULL before doing cleanup on it. Since in the
error case this is also not NULL, it ends up trying to access an invalid
task pointer.

Solution is to assign NULL to ha->dpc_thread in the error path to avoid
kthread cleanup in that case.

Signed-off-by: Douglas Miller <dougm...@linux.vnet.ibm.com>
---
 drivers/scsi/qla2xxx/qla_os.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 9372098..bd39bf2 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -3212,6 +3212,7 @@ static void qla2x00_iocb_work_fn(struct work_struct *work)
ql_log(ql_log_fatal, base_vha, 0x00ed,
"Failed to start DPC thread.\n");
ret = PTR_ERR(ha->dpc_thread);
+   ha->dpc_thread = NULL;
goto probe_failed;
}
ql_dbg(ql_dbg_init, base_vha, 0x00ee,
-- 
1.7.1



[PATCH 0/1] qla2xxx: Fix oops in qla2x00_probe_one error path

2017-10-20 Thread Douglas Miller
See [PATCH 1/1] qla2xxx: Fix oops in qla2x00_probe_one error path



Re: [PATCH] ses: do not add a device to an enclosure if enclosure_add_links() fails.

2017-07-10 Thread Douglas Miller

On 06/27/2017 07:50 AM, Douglas Miller wrote:

On 06/27/2017 04:53 AM, Maurizio Lombardi wrote:

The enclosure_add_device() function should fail if it can't
create the relevant sysfs links.

Signed-off-by: Maurizio Lombardi <mlomb...@redhat.com>
---
  drivers/misc/enclosure.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index d3fe3ea..eb29113 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,

   struct device *dev)
  {
  struct enclosure_component *cdev;
+int err;

  if (!edev || component >= edev->components)
  return -EINVAL;
@@ -384,12 +385,17 @@ int enclosure_add_device(struct 
enclosure_device *edev, int component,

  if (cdev->dev == dev)
  return -EEXIST;

-if (cdev->dev)
+if (cdev->dev) {
  enclosure_remove_links(cdev);
-
-put_device(cdev->dev);
+put_device(cdev->dev);
+}
  cdev->dev = get_device(dev);
-return enclosure_add_links(cdev);
+err = enclosure_add_links(cdev);
+if (err) {
+put_device(cdev->dev);
+cdev->dev = NULL;
+}
+return err;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);


Tested-by: Douglas Miller <dougm...@linux.vnet.ibm.com>

This fixes a problem where udevd (insmod ses) races with/overtakes 
do_scan_async(), which creates the directory target of the symlink, 
resulting in missing enclosure symlinks. This patch relaxes the 
symlink creation allowing for delayed addition to enclosure and 
creation of symlinks after do_scan_async() has created the target 
directory.



Has there been any progress with getting this patch accepted?



Re: [PATCH] ses: do not add a device to an enclosure if enclosure_add_links() fails.

2017-06-27 Thread Douglas Miller

On 06/27/2017 04:53 AM, Maurizio Lombardi wrote:

The enclosure_add_device() function should fail if it can't
create the relevant sysfs links.

Signed-off-by: Maurizio Lombardi <mlomb...@redhat.com>
---
  drivers/misc/enclosure.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index d3fe3ea..eb29113 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int 
component,
 struct device *dev)
  {
struct enclosure_component *cdev;
+   int err;

if (!edev || component >= edev->components)
return -EINVAL;
@@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device *edev, 
int component,
if (cdev->dev == dev)
return -EEXIST;

-   if (cdev->dev)
+   if (cdev->dev) {
enclosure_remove_links(cdev);
-
-   put_device(cdev->dev);
+   put_device(cdev->dev);
+   }
cdev->dev = get_device(dev);
-   return enclosure_add_links(cdev);
+   err = enclosure_add_links(cdev);
+   if (err) {
+   put_device(cdev->dev);
+   cdev->dev = NULL;
+   }
+   return err;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);


Tested-by: Douglas Miller <dougm...@linux.vnet.ibm.com>

This fixes a problem where udevd (insmod ses) races with/overtakes 
do_scan_async(), which creates the directory target of the symlink, 
resulting in missing enclosure symlinks. This patch relaxes the symlink 
creation allowing for delayed addition to enclosure and creation of 
symlinks after do_scan_async() has created the target directory.




Re: enclosure: fix sysfs symlinks creation when using multipath

2017-06-26 Thread Douglas Miller

On 06/20/2017 06:38 AM, Maurizio Lombardi wrote:


Dne 16.6.2017 v 18:08 Douglas Miller napsal(a):

Just to respond to James' question on the cause. What I observed was a race 
condition between udevd (ses_init()) and a worker thread (do_scan_async()),
where the worker thread is creating the directories that are the target of the 
symlinks being created by udevd.
Something was happening when udevd caught up with the worker thread (so the 
target directory did not exist) and it seemed the worker thread either got 
preempted or
else just could not stay ahead of udevd. This means that udevd started failing 
to create symlinks even though the worker thread eventually got them all 
created.
I did observe what appeared to be preemption, as the creation of directories 
stopped until udevd finished failing all the (rest of the) symlinks.
Although there may have been other explanations for what I saw.

I am able to pass my testing with this patch. I don't see an official submit of 
this patch, but will respond to it when I see one.

Thanks Douglas for testing it, I will resubmit the patch if no one has any 
objections.

Maurizio.

I did not see any additional comments, and no objections. Is it time to 
submit the new patch?


Thanks,
Doug



Re: enclosure: fix sysfs symlinks creation when using multipath

2017-06-16 Thread Douglas Miller

On 06/16/2017 10:41 AM, Douglas Miller wrote:

On 03/16/2017 01:49 PM, James Bottomley wrote:

On Wed, 2017-03-15 at 19:39 -0400, Martin K. Petersen wrote:

Maurizio Lombardi <mlomb...@redhat.com> writes:


With multipath, it may happen that the same device is passed to
enclosure_add_device() multiple times and that the
enclosure_add_links() function fails to create the symlinks because
the device's sysfs directory entry is still NULL.  In this case,
the
links will never be created because all the subsequent calls to
enclosure_add_device() will immediately fail with EEXIST.

James?

Well I don't think the patch is the correct way to do this.  The
problem is that if we encounter an error creating the links, we
shouldn't add the device to the enclosure.  There's no need of a
links_created variable (see below).

However, more interesting is why the link creation failed in the first
place.  The device clearly seems to exist because it was added to sysfs
at time index 19.2 and the enclosure didn't try to use it until 60.0.
  Can you debug this a bit more, please?  I can't see anything specific
to multipath in the trace, so whatever this is looks like it could
happen in the single path case as well.

James

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..ae89082 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,

   struct device *dev)
  {
  struct enclosure_component *cdev;
+int err;
if (!edev || component >= edev->components)
  return -EINVAL;
@@ -384,12 +385,15 @@ int enclosure_add_device(struct 
enclosure_device *edev, int component,

  if (cdev->dev == dev)
  return -EEXIST;
  -if (cdev->dev)
+if (cdev->dev) {
  enclosure_remove_links(cdev);
-
-put_device(cdev->dev);
-cdev->dev = get_device(dev);
-return enclosure_add_links(cdev);
+put_device(cdev->dev);
+cdev->dev = NULL;
+}
+err = enclosure_add_links(cdev);
+if (!err)
+cdev->dev = get_device(dev);
+return err;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);
After stumbling across the NULL pointer panic, I was able to use 
Maurizio's second patch below:


diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..6ac07ea 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,

 struct device *dev)
 {
struct enclosure_component *cdev;
+   int err;

if (!edev || component >= edev->components)
return -EINVAL;
@@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,

if (cdev->dev == dev)
return -EEXIST;

-   if (cdev->dev)
+   if (cdev->dev) {
enclosure_remove_links(cdev);
-
-   put_device(cdev->dev);
+   put_device(cdev->dev);
+   }
cdev->dev = get_device(dev);
-   return enclosure_add_links(cdev);
+   err = enclosure_add_links(cdev);
+   if (err) {
+   cdev->dev = NULL;
+   put_device(cdev->dev);
+   }
+   return err;
 }
 EXPORT_SYMBOL_GPL(enclosure_add_device);


I am able to pass my testing with this patch. I don't see an official 
submit of this patch, but will respond to it when I see one. Again, I 
am seeing the problem even without multipath.


Just to respond to James' question on the cause. What I observed was a 
race condition between udevd (ses_init()) and a worker thread 
(do_scan_async()), where the worker thread is creating the directories 
that are the target of the symlinks being created by udevd. Something 
was happening when udevd caught up with the worker thread (so the target 
directory did not exist) and it seemed the worker thread either got 
preempted or else just could not stay ahead of udevd. This means that 
udevd started failing to create symlinks even though the worker thread 
eventually got them all created. I did observe what appeared to be 
preemption, as the creation of directories stopped until udevd finished 
failing all the (rest of the) symlinks. Although there may have been 
other explanations for what I saw.




Re: enclosure: fix sysfs symlinks creation when using multipath

2017-06-16 Thread Douglas Miller

On 03/16/2017 01:49 PM, James Bottomley wrote:

On Wed, 2017-03-15 at 19:39 -0400, Martin K. Petersen wrote:

Maurizio Lombardi  writes:


With multipath, it may happen that the same device is passed to
enclosure_add_device() multiple times and that the
enclosure_add_links() function fails to create the symlinks because
the device's sysfs directory entry is still NULL.  In this case,
the
links will never be created because all the subsequent calls to
enclosure_add_device() will immediately fail with EEXIST.

James?

Well I don't think the patch is the correct way to do this.  The
problem is that if we encounter an error creating the links, we
shouldn't add the device to the enclosure.  There's no need of a
links_created variable (see below).

However, more interesting is why the link creation failed in the first
place.  The device clearly seems to exist because it was added to sysfs
at time index 19.2 and the enclosure didn't try to use it until 60.0.
  Can you debug this a bit more, please?  I can't see anything specific
to multipath in the trace, so whatever this is looks like it could
happen in the single path case as well.

James

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..ae89082 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int 
component,
 struct device *dev)
  {
struct enclosure_component *cdev;
+   int err;
  
  	if (!edev || component >= edev->components)

return -EINVAL;
@@ -384,12 +385,15 @@ int enclosure_add_device(struct enclosure_device *edev, 
int component,
if (cdev->dev == dev)
return -EEXIST;
  
-	if (cdev->dev)

+   if (cdev->dev) {
enclosure_remove_links(cdev);
-
-   put_device(cdev->dev);
-   cdev->dev = get_device(dev);
-   return enclosure_add_links(cdev);
+   put_device(cdev->dev);
+   cdev->dev = NULL;
+   }
+   err = enclosure_add_links(cdev);
+   if (!err)
+   cdev->dev = get_device(dev);
+   return err;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);
  
After stumbling across the NULL pointer panic, I was able to use 
Maurizio's second patch below:


diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..6ac07ea 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,

 struct device *dev)
 {
struct enclosure_component *cdev;
+   int err;

if (!edev || component >= edev->components)
return -EINVAL;
@@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device 
*edev, int component,

if (cdev->dev == dev)
return -EEXIST;

-   if (cdev->dev)
+   if (cdev->dev) {
enclosure_remove_links(cdev);
-
-   put_device(cdev->dev);
+   put_device(cdev->dev);
+   }
cdev->dev = get_device(dev);
-   return enclosure_add_links(cdev);
+   err = enclosure_add_links(cdev);
+   if (err) {
+   cdev->dev = NULL;
+   put_device(cdev->dev);
+   }
+   return err;
 }
 EXPORT_SYMBOL_GPL(enclosure_add_device);


I am able to pass my testing with this patch. I don't see an official 
submit of this patch, but will respond to it when I see one. Again, I am 
seeing the problem even without multipath.




Re: [RFC] enclosure: fix sysfs symlinks creation when using multipath

2017-06-16 Thread Douglas Miller

On 06/16/2017 07:48 AM, Maurizio Lombardi wrote:


Dne 16.6.2017 v 14:40 Douglas Miller napsal(a):

I'd like to add that we are seeing this problem with singlepath installations 
and need to get this fixed upstream as soon as possible. RHEL new product 
contains this fix and is working for us, but we need to be able to offer other 
distros as well. I am currently running this patch on a custom-built Ubuntu 
16.04.2 kernel and it is fixing the problem there.

What needs to be done to get this patch accepted?


Note that James proposed a different patch to fix this bug.

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..ae89082 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int 
component,
 struct device *dev)
  {
struct enclosure_component *cdev;
+   int err;

if (!edev || component >= edev->components)
return -EINVAL;
@@ -384,12 +385,15 @@ int enclosure_add_device(struct enclosure_device *edev, 
int component,
if (cdev->dev == dev)
return -EEXIST;

-   if (cdev->dev)
+   if (cdev->dev) {
enclosure_remove_links(cdev);
-
-   put_device(cdev->dev);
-   cdev->dev = get_device(dev);
-   return enclosure_add_links(cdev);
+   put_device(cdev->dev);
+   cdev->dev = NULL;
+   }
+   err = enclosure_add_links(cdev);
+   if (!err)
+   cdev->dev = get_device(dev);
+   return err;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);



I will test this out. Thanks.



Re: [RFC] enclosure: fix sysfs symlinks creation when using multipath

2017-06-16 Thread Douglas Miller

On 02/07/2017 08:08 AM, Maurizio Lombardi wrote:

With multipath, it may happen that the same device is passed
to enclosure_add_device() multiple times and that the enclosure_add_links()
function fails to create the symlinks because the device's sysfs
directory entry is still NULL.
In this case, the links will never be created because all the subsequent
calls to enclosure_add_device() will immediately fail with EEXIST.

This patch modifies the code so the driver will detect this condition
and will retry to create the symlinks when enclosure_add_device() is called.

Signed-off-by: Maurizio Lombardi <mlomb...@redhat.com>
---
  drivers/misc/enclosure.c  | 16 ++--
  include/linux/enclosure.h |  1 +
  2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c
index 65fed71..a856c98 100644
--- a/drivers/misc/enclosure.c
+++ b/drivers/misc/enclosure.c
@@ -375,21 +375,33 @@ int enclosure_add_device(struct enclosure_device *edev, 
int component,
 struct device *dev)
  {
struct enclosure_component *cdev;
+   int error;
  
  	if (!edev || component >= edev->components)

return -EINVAL;
  
  	cdev = >component[component];
  
-	if (cdev->dev == dev)

+   if (cdev->dev == dev) {
+   if (!cdev->links_created) {
+   error = enclosure_add_links(cdev);
+   if (!error)
+   cdev->links_created = 1;
+   }
return -EEXIST;
+   }
  
  	if (cdev->dev)

enclosure_remove_links(cdev);
  
  	put_device(cdev->dev);

cdev->dev = get_device(dev);
-   return enclosure_add_links(cdev);
+   error = enclosure_add_links(cdev);
+   if (!error)
+   cdev->links_created = 1;
+   else
+   cdev->links_created = 0;
+   return error;
  }
  EXPORT_SYMBOL_GPL(enclosure_add_device);
  
diff --git a/include/linux/enclosure.h b/include/linux/enclosure.h

index a4cf57c..c3bdc4c 100644
--- a/include/linux/enclosure.h
+++ b/include/linux/enclosure.h
@@ -97,6 +97,7 @@ struct enclosure_component {
struct device cdev;
struct device *dev;
enum enclosure_component_type type;
+   int links_created;
int number;
int fault;
    int active;


Tested-by: Douglas Miller <dougm...@linux.vnet.ibm.com>

I'd like to add that we are seeing this problem with singlepath 
installations and need to get this fixed upstream as soon as possible. 
RHEL new product contains this fix and is working for us, but we need to 
be able to offer other distros as well. I am currently running this 
patch on a custom-built Ubuntu 16.04.2 kernel and it is fixing the 
problem there.


What needs to be done to get this patch accepted?

Thanks,
Doug



Re: [PATCH] block: Fix kernel panic occurs while creating second raid disk

2017-01-24 Thread Douglas Miller

On 11/03/2016 12:15 AM, Sreekanth Reddy wrote:

On Tue, Nov 1, 2016 at 11:52 PM, Douglas Miller
<dougm...@linux.vnet.ibm.com> wrote:

On 10/24/2016 01:54 PM, Sreekanth Reddy wrote:

Observing below kernel panic while creating second raid disk
on LSI SAS3008 HBA card.

[  +0.55] [ cut here ]
[  +0.07] WARNING: CPU: 2 PID: 281 at fs/sysfs/dir.c:31
sysfs_warn_dup+0x62/0x80
[  +0.02] sysfs: cannot create duplicate filename
'/devices/virtual/bdi/8:32'
[  +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge
stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl
sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt
ipmi_ssif mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich
mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd
grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore raid_class
nvme_core scsi_transport_sas dca
[  +0.67] CPU: 2 PID: 281 Comm: kworker/u49:5 Not tainted 4.9.0-rc2 #1
[  +0.02] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS
1.1 07/22/2015
[  +0.05] Workqueue: events_unbound async_run_entry_fn
[  +0.04] Call Trace:
[  +0.09]  [] dump_stack+0x63/0x85
[  +0.05]  [] __warn+0xcb/0xf0
[  +0.04]  [] warn_slowpath_fmt+0x5f/0x80
[  +0.06]  [] ? kernfs_path_from_node+0x4f/0x60
[  +0.02]  [] sysfs_warn_dup+0x62/0x80
[  +0.02]  [] sysfs_create_dir_ns+0x77/0x90
[  +0.04]  [] kobject_add_internal+0x99/0x330
[  +0.03]  [] ? vsnprintf+0x35b/0x4c0
[  +0.03]  [] kobject_add+0x75/0xd0
[  +0.06]  [] ? device_private_init+0x23/0x70
[  +0.07]  [] ? mutex_lock+0x12/0x30
[  +0.03]  [] device_add+0x119/0x670
[  +0.04]  [] device_create_groups_vargs+0xe0/0xf0
[  +0.03]  [] device_create_vargs+0x1c/0x20
[  +0.06]  [] bdi_register+0x8c/0x180
[  +0.03]  [] bdi_register_owner+0x36/0x60
[  +0.06]  [] device_add_disk+0x168/0x480
[  +0.05]  [] ? update_autosuspend+0x51/0x60
[  +0.05]  [] sd_probe_async+0x110/0x1c0
[  +0.02]  [] async_run_entry_fn+0x39/0x140
[  +0.03]  [] process_one_work+0x15f/0x430
[  +0.02]  [] worker_thread+0x4e/0x490
[  +0.02]  [] ? process_one_work+0x430/0x430
[  +0.03]  [] kthread+0xd9/0xf0
[  +0.03]  [] ? kthread_park+0x60/0x60
[  +0.03]  [] ret_from_fork+0x25/0x30
[  +0.02] [ cut here ]
[  +0.04] WARNING: CPU: 2 PID: 281 at lib/kobject.c:240
kobject_add_internal+0x2bd/0x330
[  +0.01] kobject_add_internal failed for 8:32 with -EEXIST, don't try
to register things with the same name in the same
[  +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge
stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl
sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt
ipmi_ssif mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich
mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd
grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore raid_class
nvme_core scsi_transport_sas dca
[  +0.43] CPU: 2 PID: 281 Comm: kworker/u49:5 Tainted: GW
4.9.0-rc2 #1
[  +0.01] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS
1.1 07/22/2015
[  +0.02] Workqueue: events_unbound async_run_entry_fn
[  +0.03] Call Trace:
[  +0.03]  [] dump_stack+0x63/0x85
[  +0.03]  [] __warn+0xcb/0xf0
[  +0.04]  [] warn_slowpath_fmt+0x5f/0x80
[  +0.02]  [] ? sysfs_warn_dup+0x6a/0x80
[  +0.03]  [] kobject_add_internal+0x2bd/0x330
[  +0.03]  [] ? vsnprintf+0x35b/0x4c0
[  +0.03]  [] kobject_add+0x75/0xd0
[  +0.03]  [] ? device_private_init+0x23/0x70
[  +0.04]  [] ? mutex_lock+0x12/0x30
[  +0.02]  [] device_add+0x119/0x670
[  +0.04]  [] device_create_groups_vargs+0xe0/0xf0
[  +0.03]  [] device_create_vargs+0x1c/0x20
[  +0.03]  [] bdi_register+0x8c/0x180
[  +0.03]  [] bdi_register_owner+0x36/0x60
[  +0.04]  [] device_add_disk+0x168/0x480
[  +0.03]  [] ? update_autosuspend+0x51/0x60
[  +0.02]  [] sd_probe_async+0x110/0x1c0
[  +0.02]  [] async_run_entry_fn+0x39/0x140
[  +0.02]  [] process_one_work+0x15f/0x430
[  +0.02]  [] worker_thread+0x4e/0x490
[  +0.02]  [] ? process_one_work+0x430/0x430
[  +0.03]  [] kthread+0xd9/0xf0
[  +0.03]  [] ? kthread_park+0x60/0x60
[  +0.03]  [] ret_from_fork+0x25/0x30
[  +0.000949] BUG: unable to handle kernel
[  +0.005263] NULL pointer dereference
[  +0.002853] IP: []
sysfs_do_create_link_sd.isra.2+0x34/0xb0
[  +0.008584] PGD 0

[  +0.006115] Oops:  [#1] SMP
[  +0.004531] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge
stp llc ebtable_filter ebtables ip6table_filter ip6_

Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-12-07 Thread Douglas Miller

On 12/07/2016 02:06 PM, Douglas Miller wrote:

On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jen's for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110]
 pc: c05e868c: __sbitmap_queue_get+0x2c/0x180
 lr: c0575328: __bt_get+0x48/0xd0
 sp: c00fe99eb390
msr: 90010280b033
dar: 28
  dsisr: 4000
   current = 0xc00fe9966800
   paca= 0xc7e80300   softe: 0irq_happened: 0x01
 pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 
2016

1:mon> s
[c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0
[c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0
[c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100
[c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220
[c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0
[c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540
[c00fe99eb640] c0561c44 generic_make_request+0x144/0x230
[c00fe99eb690] c0561e00 submit_bio+0xd0/0x200
[c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0
[c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0
[c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0
[c00fe99eb940] c0246c88 
__filemap_fdatawrite_range+0xf8/0x180
[c00fe99eb9e0] c0246f90 
filemap_write_and_wait_range+0x70/0xf0

[c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540
[c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130
[c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430
[c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450
[c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730
[c00fe99ebe30] c0009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-bl...@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
---
  block/blk-mq.c | 21 +++--
  1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fb94bd69375..6718f894fbe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct 
request_queue *q,

  static void blk_mq_map_swqueue(struct request_queue *q,
 const struct cpumask *online_mask)
  {
-unsigned int i;
+unsigned int i, hctx_idx;
  struct blk_mq_hw_ctx *hctx;
  struct blk_mq_ctx *ctx;
  struct blk_mq_tag_set *set = q->tag_set;
@@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct 
request_queue *q,

  if (!cpumask_test_cpu(i, online_mask))
  continue;

+hctx_idx = q->mq_map[i];
+/* unmapped hw queue can be remapped after CPU topo changed */
+if (!set->tags[hctx_idx]) {
+set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+if (!set->tags[hctx_idx])
+q->mq_map[i] = 0;
+}
+
  ctx = per_cpu_ptr(q->queue_ctx, i);
  hctx = blk_mq_map_queue(q, i);

@@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct 
request_queue *q,

   * disable it and free the request entries.
   */
  if (!hctx->nr_ctx) {
-if (set->tags[i]) {
+/* Never unmap queue 0.  We need it as a
+ * fallback in case of a new remap fails
+ * allocation. */
+if (i && set->tags[i]) {
  blk_mq_free_rq_map(set, set->tags[i], i);
  set->tags[i] = NULL;
  }
@@ -1917,1

Re: [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues

2016-12-07 Thread Douglas Miller

On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:

While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jen's for-next branch cleanly.

Changes since v1:
   - Use GFP_NOIO instead of GFP_NOWAIT.

  Call Trace:
[c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable)
[c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430
[c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0
[c00f0160ae00] [c0b1af68] schedule+0x48/0xc0
[c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30
[c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0
[c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210
[c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0
[c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0
[c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520
[c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250
[c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0
[c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380
[c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360
[c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0
[c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100
[c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0
[c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0
[c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250
[c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120
[c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0
[c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150
[c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0
[c00f0160bb90] [c06f5124] device_online+0xb4/0x120
[c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0
[c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0
[c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0
[c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250
[c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0
[c00f0160bd90] [c0311490] vfs_write+0xd0/0x270
[c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110
[c00f0160be30] [c0009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-bl...@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
---
  block/blk-mq.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6718f894fbe1..5f4e452eef72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1605,7 +1605,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
INIT_LIST_HEAD(>page_list);

tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
-GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
 set->numa_node);
if (!tags->rqs) {
blk_mq_free_tags(tags);
@@ -1631,7 +1631,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,

do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1652,7 +1652,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 * Allow kmemleak to scan these pages as they contain pointers
 * to additional allocations like via ops->init_request().
 */
-   kme

Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues

2016-12-07 Thread Douglas Miller

On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote:

In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jen's for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110]
 pc: c05e868c: __sbitmap_queue_get+0x2c/0x180
 lr: c0575328: __bt_get+0x48/0xd0
 sp: c00fe99eb390
msr: 90010280b033
dar: 28
  dsisr: 4000
   current = 0xc00fe9966800
   paca= 0xc7e80300   softe: 0irq_happened: 0x01
 pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0
[c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0
[c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100
[c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220
[c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0
[c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540
[c00fe99eb640] c0561c44 generic_make_request+0x144/0x230
[c00fe99eb690] c0561e00 submit_bio+0xd0/0x200
[c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0
[c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0
[c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0
[c00fe99eb940] c0246c88 __filemap_fdatawrite_range+0xf8/0x180
[c00fe99eb9e0] c0246f90 filemap_write_and_wait_range+0x70/0xf0
[c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540
[c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130
[c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430
[c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450
[c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730
[c00fe99ebe30] c0009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com>
Cc: Brian King <brk...@linux.vnet.ibm.com>
Cc: Douglas Miller <dougm...@linux.vnet.ibm.com>
Cc: linux-bl...@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
---
  block/blk-mq.c | 21 +++--
  1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fb94bd69375..6718f894fbe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
  static void blk_mq_map_swqueue(struct request_queue *q,
   const struct cpumask *online_mask)
  {
-   unsigned int i;
+   unsigned int i, hctx_idx;
struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
@@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q,
if (!cpumask_test_cpu(i, online_mask))
continue;

+   hctx_idx = q->mq_map[i];
+   /* unmapped hw queue can be remapped after CPU topo changed */
+   if (!set->tags[hctx_idx]) {
+   set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx);
+
+   if (!set->tags[hctx_idx])
+   q->mq_map[i] = 0;
+   }
+
ctx = per_cpu_ptr(q->queue_ctx, i);
hctx = blk_mq_map_queue(q, i);

@@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 * disable it and free the request entries.
 */
if (!hctx->nr_ctx) {
-   if (set->tags[i]) {
+   /* Never unmap queue 0.  We need it as a
+* fallback in case of a new remap fails
+* allocation. */
+   if (i && set->tags[i])

Re: [PATCH] block: Fix kernel panic occurs while creating second raid disk

2016-11-01 Thread Douglas Miller

On 10/24/2016 01:54 PM, Sreekanth Reddy wrote:

Observing below kernel panic while creating second raid disk
on LSI SAS3008 HBA card.

[  +0.55] [ cut here ]
[  +0.07] WARNING: CPU: 2 PID: 281 at fs/sysfs/dir.c:31 
sysfs_warn_dup+0x62/0x80
[  +0.02] sysfs: cannot create duplicate filename 
'/devices/virtual/bdi/8:32'
[  +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_conntrack tun bridge stp llc ebtable_filter ebtables 
ip6table_filter ip6_tables intel_rapl sb_edac edac_core 
x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif 
mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich 
mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl 
lockd grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore 
raid_class nvme_core scsi_transport_sas dca
[  +0.67] CPU: 2 PID: 281 Comm: kworker/u49:5 Not tainted 
4.9.0-rc2 #1
[  +0.02] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, 
BIOS 1.1 07/22/2015

[  +0.05] Workqueue: events_unbound async_run_entry_fn
[  +0.04] Call Trace:
[  +0.09]  [] dump_stack+0x63/0x85
[  +0.05]  [] __warn+0xcb/0xf0
[  +0.04]  [] warn_slowpath_fmt+0x5f/0x80
[  +0.06]  [] ? kernfs_path_from_node+0x4f/0x60
[  +0.02]  [] sysfs_warn_dup+0x62/0x80
[  +0.02]  [] sysfs_create_dir_ns+0x77/0x90
[  +0.04]  [] kobject_add_internal+0x99/0x330
[  +0.03]  [] ? vsnprintf+0x35b/0x4c0
[  +0.03]  [] kobject_add+0x75/0xd0
[  +0.06]  [] ? device_private_init+0x23/0x70
[  +0.07]  [] ? mutex_lock+0x12/0x30
[  +0.03]  [] device_add+0x119/0x670
[  +0.04]  [] device_create_groups_vargs+0xe0/0xf0
[  +0.03]  [] device_create_vargs+0x1c/0x20
[  +0.06]  [] bdi_register+0x8c/0x180
[  +0.03]  [] bdi_register_owner+0x36/0x60
[  +0.06]  [] device_add_disk+0x168/0x480
[  +0.05]  [] ? update_autosuspend+0x51/0x60
[  +0.05]  [] sd_probe_async+0x110/0x1c0
[  +0.02]  [] async_run_entry_fn+0x39/0x140
[  +0.03]  [] process_one_work+0x15f/0x430
[  +0.02]  [] worker_thread+0x4e/0x490
[  +0.02]  [] ? process_one_work+0x430/0x430
[  +0.03]  [] kthread+0xd9/0xf0
[  +0.03]  [] ? kthread_park+0x60/0x60
[  +0.03]  [] ret_from_fork+0x25/0x30
[  +0.02] [ cut here ]
[  +0.04] WARNING: CPU: 2 PID: 281 at lib/kobject.c:240 
kobject_add_internal+0x2bd/0x330
[  +0.01] kobject_add_internal failed for 8:32 with -EEXIST, don't 
try to register things with the same name in the same
[  +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_conntrack tun bridge stp llc ebtable_filter ebtables 
ip6table_filter ip6_tables intel_rapl sb_edac edac_core 
x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif 
mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich 
mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl 
lockd grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore 
raid_class nvme_core scsi_transport_sas dca
[  +0.43] CPU: 2 PID: 281 Comm: kworker/u49:5 Tainted: G
W   4.9.0-rc2 #1
[  +0.01] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, 
BIOS 1.1 07/22/2015

[  +0.02] Workqueue: events_unbound async_run_entry_fn
[  +0.03] Call Trace:
[  +0.03]  [] dump_stack+0x63/0x85
[  +0.03]  [] __warn+0xcb/0xf0
[  +0.04]  [] warn_slowpath_fmt+0x5f/0x80
[  +0.02]  [] ? sysfs_warn_dup+0x6a/0x80
[  +0.03]  [] kobject_add_internal+0x2bd/0x330
[  +0.03]  [] ? vsnprintf+0x35b/0x4c0
[  +0.03]  [] kobject_add+0x75/0xd0
[  +0.03]  [] ? device_private_init+0x23/0x70
[  +0.04]  [] ? mutex_lock+0x12/0x30
[  +0.02]  [] device_add+0x119/0x670
[  +0.04]  [] device_create_groups_vargs+0xe0/0xf0
[  +0.03]  [] device_create_vargs+0x1c/0x20
[  +0.03]  [] bdi_register+0x8c/0x180
[  +0.03]  [] bdi_register_owner+0x36/0x60
[  +0.04]  [] device_add_disk+0x168/0x480
[  +0.03]  [] ? update_autosuspend+0x51/0x60
[  +0.02]  [] sd_probe_async+0x110/0x1c0
[  +0.02]  [] async_run_entry_fn+0x39/0x140
[  +0.02]  [] process_one_work+0x15f/0x430
[  +0.02]  [] worker_thread+0x4e/0x490
[  +0.02]  [] ? process_one_work+0x430/0x430
[  +0.03]  [] kthread+0xd9/0xf0
[  +0.03]  [] ? kthread_park+0x60/0x60
[  +0.03]  [] ret_from_fork+0x25/0x30
[  +0.000949] BUG: unable to handle kernel
[  +0.005263] NULL pointer dereference
[  +0.002853] IP: [] 
sysfs_do_create_link_sd.isra.2+0x34/0xb0

[  +0.008584] PGD 0

[  +0.006115] Oops:  [#1] SMP
[  +0.004531] Modules linked in: mptctl mptbase xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_conntrack tun bridge stp llc ebtable_filter ebtables 
ip6table_filter ip6_tables intel_rapl sb_edac edac_core 
x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif 

Issues with LSI-3008 adapters, mpt3sas driver

2016-08-31 Thread Douglas Miller

Hi all,

I am seeing an issue while using an LSI-3008-based adapter (mpt3sas 
driver) on a PowerPC system (although I am not yet convinced it is 
architecture dependent). When I create a RAID1 volume, the physical disk 
devices get "hidden" as expected however the various kernel objects are 
out of sync. The corresponding bits in the "sd_index_ida" bitmap gets 
cleared, and the symlink in /sys/dev/block for this major:minor pair 
gets removed, but none of the other major:minor entries in sysfs get 
removed. The next time a new device is added (for example, during 
another RAID volume create or delete), the recently-freed major:minor 
number is picked up from the "sd_index_ida" bitmap but the attempt to 
create sysfs entries fails EEXIST due to an entry by the same name 
already (still) existing. This failure goes unhandled and later the 
kernel panics in sd_probe_async while dereferencing an (apparently) 
invalid backing_dev_info structure (presumably left invalid due to the 
EEXIST error).


A reboot clears this (bitmaps and sysfs) up and the second RAID volume 
(if a create was done) shows up normally. However, even if the panic 
were avoided by better error handling in sd_probe_async there would 
still be the problem of being able to create more than one RAID volume 
without rebooting.


I am wondering if this issue has been seen elsewhere, and also just what 
might be going wrong. For mpt3sas, it appears that the firmware largely 
drives the hiding/exposing of devices but I don't see an issue with the 
ordering of those events. I am wondering if the driver is failing to 
setup the device attributes correctly in order to get the proper sysfs 
handling.


I am seeing this on Ubuntu 16.04, but also see it on the upstream 
kernel. Oddly, it does not happen on RHEL 7.2 (an older kernel).


A possibly-related issue we see is that when a RAID volume is deleted, 
none of the RAID device nodes (/dev as well as /sys/) get removed - 
although they are unusable. Deleting before creating does not produce 
the panic, so I believe the "sd_index_ida" bitmap is not getting updated 
by the delete.



Any help would be appreciated.

Thanks,

Doug

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html