Re: mpt3sas heavy I/O load causes kernel BUG at block/blk-core.c:2695
Thanks, Suganath, That commit was introduced with version 12.100.00.00 and the distro version we're running is 15.100.00.01 (RHEL-ALT 7.5) and appears to include this fix - although the code is not identical, probably due to the effects of backporting patches. This driver also does not include commit 9961c9bbf2b43acaaf030a0fbabc9954d937ad8c, which was added much later (added on top of driver 17.100.00.00). So, I guess I am still looking for a companion (opposite scenario) patch to 9961c9bbf2b43acaaf030a0fbabc9954d937ad8c. Do you have any reason to believe that both situations (normal completion before abort, and abort before normal completion) do not need to be handled? Thanks, Doug On 06/07/2018 01:24 AM, Suganath Prabu Subramani wrote: Hi Douglas, Can you check if this patch is already part of driver, If not please try with below patch. This patch is to fix the completion of abort before the IO completion. With this, driver will process IO's reply first followed by TM. authorSuganath prabu Subramani 2016-01-28 12:07:06 +0530 committerMartin K. Petersen 2016-02-23 21:27:02 -0500 commit03d1fb3a65783979f23bd58b5a0387e6992d9e26 (patch) tree6aca275e2ebe7fbcd5fac1654cedd8f56d0947d0 /drivers/scsi/mpt3sas parent5c739b6157bd090942e5847ddd12bfb99cd4240d (diff) downloadlinux-03d1fb3a65783979f23bd58b5a0387e6992d9e26.tar.gz mpt3sas: Fix for Asynchronous completion of timedout IO and task abort of timedout IO. Track msix of each IO and use the same msix for issuing abort to timed out IO. With this driver will process IO's reply first followed by TM. Signed-off-by: Suganath prabu Subramani Signed-off-by: Chaitra P B Reviewed-by: Tomas Henzl Signed-off-by: Martin K. Petersen Thanks, Suganath Prabu S On Wed, Jun 6, 2018 at 7:50 PM, Douglas Miller wrote: Running a heavy I/O load on multipath/dual-ported SSD disks attached to a SAS3008 adapter (mpt3sas driver), we are seeing I/Os get aborted and tasks stuck in blk_complete_request() and this sometimes results in hitting a BUG_ON in blk_start_request(). It would appear that we are seeing two completions performed on an I/O, and the second completion is racing with re-use of the request for a new I/O. I saw this upstream commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3=9961c9bbf2b43acaaf030a0fbabc9954d937ad8c which addresses the case where the normal completion occurs before the abort completion. But the situation I am seeing appears to be that the abort completion occurs before the normal completion (due to tasks getting delayed in blk_complete_request()). I don't find any commit to fix this second case. Of course, tasks being delayed like this is a concern, and is being worked separately. But it seems that the alternate double-completion case is being ignored here. Does everyone concur that this second case needs to be addressed? Is there a proposed fix? Thanks, Doug FYI, system is a Power9 running RHEL-ALT 7.5, two SAS3008 adapters connected to an IBM EXP24SX SAS Storage Enclosure with 24 HUSMM8040ASS201 drives. FIO was being used to drive the I/O load.
mpt3sas heavy I/O load causes kernel BUG at block/blk-core.c:2695
Running a heavy I/O load on multipath/dual-ported SSD disks attached to a SAS3008 adapter (mpt3sas driver), we are seeing I/Os get aborted and tasks stuck in blk_complete_request() and this sometimes results in hitting a BUG_ON in blk_start_request(). It would appear that we are seeing two completions performed on an I/O, and the second completion is racing with re-use of the request for a new I/O. I saw this upstream commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.17-rc3=9961c9bbf2b43acaaf030a0fbabc9954d937ad8c which addresses the case where the normal completion occurs before the abort completion. But the situation I am seeing appears to be that the abort completion occurs before the normal completion (due to tasks getting delayed in blk_complete_request()). I don't find any commit to fix this second case. Of course, tasks being delayed like this is a concern, and is being worked separately. But it seems that the alternate double-completion case is being ignored here. Does everyone concur that this second case needs to be addressed? Is there a proposed fix? Thanks, Doug FYI, system is a Power9 running RHEL-ALT 7.5, two SAS3008 adapters connected to an IBM EXP24SX SAS Storage Enclosure with 24 HUSMM8040ASS201 drives. FIO was being used to drive the I/O load.
[PATCH 1/1] qla2xxx: Fix oops in qla2x00_probe_one error path
On error, kthread_create() returns an errno-encoded pointer, not NULL. The routine qla2x00_probe_one() detects the error case and jumps to probe_failed, but has already assigned the return value from kthread_create() to ha->dpc_thread. Then probe_failed checks to see if ha->dpc_thread is not NULL before doing cleanup on it. Since in the error case this is also not NULL, it ends up trying to access an invalid task pointer. Solution is to assign NULL to ha->dpc_thread in the error path to avoid kthread cleanup in that case. Signed-off-by: Douglas Miller <dougm...@linux.vnet.ibm.com> --- drivers/scsi/qla2xxx/qla_os.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 9372098..bd39bf2 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -3212,6 +3212,7 @@ static void qla2x00_iocb_work_fn(struct work_struct *work) ql_log(ql_log_fatal, base_vha, 0x00ed, "Failed to start DPC thread.\n"); ret = PTR_ERR(ha->dpc_thread); + ha->dpc_thread = NULL; goto probe_failed; } ql_dbg(ql_dbg_init, base_vha, 0x00ee, -- 1.7.1
[PATCH 0/1] qla2xxx: Fix oops in qla2x00_probe_one error path
See [PATCH 1/1] qla2xxx: Fix oops in qla2x00_probe_one error path
Re: [PATCH] ses: do not add a device to an enclosure if enclosure_add_links() fails.
On 06/27/2017 07:50 AM, Douglas Miller wrote: On 06/27/2017 04:53 AM, Maurizio Lombardi wrote: The enclosure_add_device() function should fail if it can't create the relevant sysfs links. Signed-off-by: Maurizio Lombardi <mlomb...@redhat.com> --- drivers/misc/enclosure.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index d3fe3ea..eb29113 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; +int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; -if (cdev->dev) +if (cdev->dev) { enclosure_remove_links(cdev); - -put_device(cdev->dev); +put_device(cdev->dev); +} cdev->dev = get_device(dev); -return enclosure_add_links(cdev); +err = enclosure_add_links(cdev); +if (err) { +put_device(cdev->dev); +cdev->dev = NULL; +} +return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); Tested-by: Douglas Miller <dougm...@linux.vnet.ibm.com> This fixes a problem where udevd (insmod ses) races with/overtakes do_scan_async(), which creates the directory target of the symlink, resulting in missing enclosure symlinks. This patch relaxes the symlink creation allowing for delayed addition to enclosure and creation of symlinks after do_scan_async() has created the target directory. Has there been any progress with getting this patch accepted?
Re: [PATCH] ses: do not add a device to an enclosure if enclosure_add_links() fails.
On 06/27/2017 04:53 AM, Maurizio Lombardi wrote: The enclosure_add_device() function should fail if it can't create the relevant sysfs links. Signed-off-by: Maurizio Lombardi <mlomb...@redhat.com> --- drivers/misc/enclosure.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index d3fe3ea..eb29113 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; + int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; - if (cdev->dev) + if (cdev->dev) { enclosure_remove_links(cdev); - - put_device(cdev->dev); + put_device(cdev->dev); + } cdev->dev = get_device(dev); - return enclosure_add_links(cdev); + err = enclosure_add_links(cdev); + if (err) { + put_device(cdev->dev); + cdev->dev = NULL; + } + return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); Tested-by: Douglas Miller <dougm...@linux.vnet.ibm.com> This fixes a problem where udevd (insmod ses) races with/overtakes do_scan_async(), which creates the directory target of the symlink, resulting in missing enclosure symlinks. This patch relaxes the symlink creation allowing for delayed addition to enclosure and creation of symlinks after do_scan_async() has created the target directory.
Re: enclosure: fix sysfs symlinks creation when using multipath
On 06/20/2017 06:38 AM, Maurizio Lombardi wrote: Dne 16.6.2017 v 18:08 Douglas Miller napsal(a): Just to respond to James' question on the cause. What I observed was a race condition between udevd (ses_init()) and a worker thread (do_scan_async()), where the worker thread is creating the directories that are the target of the symlinks being created by udevd. Something was happening when udevd caught up with the worker thread (so the target directory did not exist) and it seemed the worker thread either got preempted or else just could not stay ahead of udevd. This means that udevd started failing to create symlinks even though the worker thread eventually got them all created. I did observe what appeared to be preemption, as the creation of directories stopped until udevd finished failing all the (rest of the) symlinks. Although there may have been other explanations for what I saw. I am able to pass my testing with this patch. I don't see an official submit of this patch, but will respond to it when I see one. Thanks Douglas for testing it, I will resubmit the patch if no one has any objections. Maurizio. I did not see any additional comments, and no objections. Is it time to submit the new patch? Thanks, Doug
Re: enclosure: fix sysfs symlinks creation when using multipath
On 06/16/2017 10:41 AM, Douglas Miller wrote: On 03/16/2017 01:49 PM, James Bottomley wrote: On Wed, 2017-03-15 at 19:39 -0400, Martin K. Petersen wrote: Maurizio Lombardi <mlomb...@redhat.com> writes: With multipath, it may happen that the same device is passed to enclosure_add_device() multiple times and that the enclosure_add_links() function fails to create the symlinks because the device's sysfs directory entry is still NULL. In this case, the links will never be created because all the subsequent calls to enclosure_add_device() will immediately fail with EEXIST. James? Well I don't think the patch is the correct way to do this. The problem is that if we encounter an error creating the links, we shouldn't add the device to the enclosure. There's no need of a links_created variable (see below). However, more interesting is why the link creation failed in the first place. The device clearly seems to exist because it was added to sysfs at time index 19.2 and the enclosure didn't try to use it until 60.0. Can you debug this a bit more, please? I can't see anything specific to multipath in the trace, so whatever this is looks like it could happen in the single path case as well. James diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed71..ae89082 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; +int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,15 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; -if (cdev->dev) +if (cdev->dev) { enclosure_remove_links(cdev); - -put_device(cdev->dev); -cdev->dev = get_device(dev); -return enclosure_add_links(cdev); +put_device(cdev->dev); +cdev->dev = NULL; +} +err = enclosure_add_links(cdev); +if (!err) +cdev->dev = get_device(dev); +return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); After stumbling across the NULL pointer panic, I was able to use Maurizio's second patch below: diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed71..6ac07ea 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; + int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; - if (cdev->dev) + if (cdev->dev) { enclosure_remove_links(cdev); - - put_device(cdev->dev); + put_device(cdev->dev); + } cdev->dev = get_device(dev); - return enclosure_add_links(cdev); + err = enclosure_add_links(cdev); + if (err) { + cdev->dev = NULL; + put_device(cdev->dev); + } + return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); I am able to pass my testing with this patch. I don't see an official submit of this patch, but will respond to it when I see one. Again, I am seeing the problem even without multipath. Just to respond to James' question on the cause. What I observed was a race condition between udevd (ses_init()) and a worker thread (do_scan_async()), where the worker thread is creating the directories that are the target of the symlinks being created by udevd. Something was happening when udevd caught up with the worker thread (so the target directory did not exist) and it seemed the worker thread either got preempted or else just could not stay ahead of udevd. This means that udevd started failing to create symlinks even though the worker thread eventually got them all created. I did observe what appeared to be preemption, as the creation of directories stopped until udevd finished failing all the (rest of the) symlinks. Although there may have been other explanations for what I saw.
Re: enclosure: fix sysfs symlinks creation when using multipath
On 03/16/2017 01:49 PM, James Bottomley wrote: On Wed, 2017-03-15 at 19:39 -0400, Martin K. Petersen wrote: Maurizio Lombardiwrites: With multipath, it may happen that the same device is passed to enclosure_add_device() multiple times and that the enclosure_add_links() function fails to create the symlinks because the device's sysfs directory entry is still NULL. In this case, the links will never be created because all the subsequent calls to enclosure_add_device() will immediately fail with EEXIST. James? Well I don't think the patch is the correct way to do this. The problem is that if we encounter an error creating the links, we shouldn't add the device to the enclosure. There's no need of a links_created variable (see below). However, more interesting is why the link creation failed in the first place. The device clearly seems to exist because it was added to sysfs at time index 19.2 and the enclosure didn't try to use it until 60.0. Can you debug this a bit more, please? I can't see anything specific to multipath in the trace, so whatever this is looks like it could happen in the single path case as well. James diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed71..ae89082 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; + int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,15 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; - if (cdev->dev) + if (cdev->dev) { enclosure_remove_links(cdev); - - put_device(cdev->dev); - cdev->dev = get_device(dev); - return enclosure_add_links(cdev); + put_device(cdev->dev); + cdev->dev = NULL; + } + err = enclosure_add_links(cdev); + if (!err) + cdev->dev = get_device(dev); + return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); After stumbling across the NULL pointer panic, I was able to use Maurizio's second patch below: diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed71..6ac07ea 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; + int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,17 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; - if (cdev->dev) + if (cdev->dev) { enclosure_remove_links(cdev); - - put_device(cdev->dev); + put_device(cdev->dev); + } cdev->dev = get_device(dev); - return enclosure_add_links(cdev); + err = enclosure_add_links(cdev); + if (err) { + cdev->dev = NULL; + put_device(cdev->dev); + } + return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); I am able to pass my testing with this patch. I don't see an official submit of this patch, but will respond to it when I see one. Again, I am seeing the problem even without multipath.
Re: [RFC] enclosure: fix sysfs symlinks creation when using multipath
On 06/16/2017 07:48 AM, Maurizio Lombardi wrote: Dne 16.6.2017 v 14:40 Douglas Miller napsal(a): I'd like to add that we are seeing this problem with singlepath installations and need to get this fixed upstream as soon as possible. RHEL new product contains this fix and is working for us, but we need to be able to offer other distros as well. I am currently running this patch on a custom-built Ubuntu 16.04.2 kernel and it is fixing the problem there. What needs to be done to get this patch accepted? Note that James proposed a different patch to fix this bug. diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed71..ae89082 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,6 +375,7 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; + int err; if (!edev || component >= edev->components) return -EINVAL; @@ -384,12 +385,15 @@ int enclosure_add_device(struct enclosure_device *edev, int component, if (cdev->dev == dev) return -EEXIST; - if (cdev->dev) + if (cdev->dev) { enclosure_remove_links(cdev); - - put_device(cdev->dev); - cdev->dev = get_device(dev); - return enclosure_add_links(cdev); + put_device(cdev->dev); + cdev->dev = NULL; + } + err = enclosure_add_links(cdev); + if (!err) + cdev->dev = get_device(dev); + return err; } EXPORT_SYMBOL_GPL(enclosure_add_device); I will test this out. Thanks.
Re: [RFC] enclosure: fix sysfs symlinks creation when using multipath
On 02/07/2017 08:08 AM, Maurizio Lombardi wrote: With multipath, it may happen that the same device is passed to enclosure_add_device() multiple times and that the enclosure_add_links() function fails to create the symlinks because the device's sysfs directory entry is still NULL. In this case, the links will never be created because all the subsequent calls to enclosure_add_device() will immediately fail with EEXIST. This patch modifies the code so the driver will detect this condition and will retry to create the symlinks when enclosure_add_device() is called. Signed-off-by: Maurizio Lombardi <mlomb...@redhat.com> --- drivers/misc/enclosure.c | 16 ++-- include/linux/enclosure.h | 1 + 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/misc/enclosure.c b/drivers/misc/enclosure.c index 65fed71..a856c98 100644 --- a/drivers/misc/enclosure.c +++ b/drivers/misc/enclosure.c @@ -375,21 +375,33 @@ int enclosure_add_device(struct enclosure_device *edev, int component, struct device *dev) { struct enclosure_component *cdev; + int error; if (!edev || component >= edev->components) return -EINVAL; cdev = >component[component]; - if (cdev->dev == dev) + if (cdev->dev == dev) { + if (!cdev->links_created) { + error = enclosure_add_links(cdev); + if (!error) + cdev->links_created = 1; + } return -EEXIST; + } if (cdev->dev) enclosure_remove_links(cdev); put_device(cdev->dev); cdev->dev = get_device(dev); - return enclosure_add_links(cdev); + error = enclosure_add_links(cdev); + if (!error) + cdev->links_created = 1; + else + cdev->links_created = 0; + return error; } EXPORT_SYMBOL_GPL(enclosure_add_device); diff --git a/include/linux/enclosure.h b/include/linux/enclosure.h index a4cf57c..c3bdc4c 100644 --- a/include/linux/enclosure.h +++ b/include/linux/enclosure.h @@ -97,6 +97,7 @@ struct enclosure_component { struct device cdev; struct device *dev; enum enclosure_component_type type; + int links_created; int number; int fault; int active; Tested-by: Douglas Miller <dougm...@linux.vnet.ibm.com> I'd like to add that we are seeing this problem with singlepath installations and need to get this fixed upstream as soon as possible. RHEL new product contains this fix and is working for us, but we need to be able to offer other distros as well. I am currently running this patch on a custom-built Ubuntu 16.04.2 kernel and it is fixing the problem there. What needs to be done to get this patch accepted? Thanks, Doug
Re: [PATCH] block: Fix kernel panic occurs while creating second raid disk
On 11/03/2016 12:15 AM, Sreekanth Reddy wrote: On Tue, Nov 1, 2016 at 11:52 PM, Douglas Miller <dougm...@linux.vnet.ibm.com> wrote: On 10/24/2016 01:54 PM, Sreekanth Reddy wrote: Observing below kernel panic while creating second raid disk on LSI SAS3008 HBA card. [ +0.55] [ cut here ] [ +0.07] WARNING: CPU: 2 PID: 281 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 [ +0.02] sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:32' [ +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore raid_class nvme_core scsi_transport_sas dca [ +0.67] CPU: 2 PID: 281 Comm: kworker/u49:5 Not tainted 4.9.0-rc2 #1 [ +0.02] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS 1.1 07/22/2015 [ +0.05] Workqueue: events_unbound async_run_entry_fn [ +0.04] Call Trace: [ +0.09] [] dump_stack+0x63/0x85 [ +0.05] [] __warn+0xcb/0xf0 [ +0.04] [] warn_slowpath_fmt+0x5f/0x80 [ +0.06] [] ? kernfs_path_from_node+0x4f/0x60 [ +0.02] [] sysfs_warn_dup+0x62/0x80 [ +0.02] [] sysfs_create_dir_ns+0x77/0x90 [ +0.04] [] kobject_add_internal+0x99/0x330 [ +0.03] [] ? vsnprintf+0x35b/0x4c0 [ +0.03] [] kobject_add+0x75/0xd0 [ +0.06] [] ? device_private_init+0x23/0x70 [ +0.07] [] ? mutex_lock+0x12/0x30 [ +0.03] [] device_add+0x119/0x670 [ +0.04] [] device_create_groups_vargs+0xe0/0xf0 [ +0.03] [] device_create_vargs+0x1c/0x20 [ +0.06] [] bdi_register+0x8c/0x180 [ +0.03] [] bdi_register_owner+0x36/0x60 [ +0.06] [] device_add_disk+0x168/0x480 [ +0.05] [] ? update_autosuspend+0x51/0x60 [ +0.05] [] sd_probe_async+0x110/0x1c0 [ +0.02] [] async_run_entry_fn+0x39/0x140 [ +0.03] [] process_one_work+0x15f/0x430 [ +0.02] [] worker_thread+0x4e/0x490 [ +0.02] [] ? process_one_work+0x430/0x430 [ +0.03] [] kthread+0xd9/0xf0 [ +0.03] [] ? kthread_park+0x60/0x60 [ +0.03] [] ret_from_fork+0x25/0x30 [ +0.02] [ cut here ] [ +0.04] WARNING: CPU: 2 PID: 281 at lib/kobject.c:240 kobject_add_internal+0x2bd/0x330 [ +0.01] kobject_add_internal failed for 8:32 with -EEXIST, don't try to register things with the same name in the same [ +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore raid_class nvme_core scsi_transport_sas dca [ +0.43] CPU: 2 PID: 281 Comm: kworker/u49:5 Tainted: GW 4.9.0-rc2 #1 [ +0.01] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS 1.1 07/22/2015 [ +0.02] Workqueue: events_unbound async_run_entry_fn [ +0.03] Call Trace: [ +0.03] [] dump_stack+0x63/0x85 [ +0.03] [] __warn+0xcb/0xf0 [ +0.04] [] warn_slowpath_fmt+0x5f/0x80 [ +0.02] [] ? sysfs_warn_dup+0x6a/0x80 [ +0.03] [] kobject_add_internal+0x2bd/0x330 [ +0.03] [] ? vsnprintf+0x35b/0x4c0 [ +0.03] [] kobject_add+0x75/0xd0 [ +0.03] [] ? device_private_init+0x23/0x70 [ +0.04] [] ? mutex_lock+0x12/0x30 [ +0.02] [] device_add+0x119/0x670 [ +0.04] [] device_create_groups_vargs+0xe0/0xf0 [ +0.03] [] device_create_vargs+0x1c/0x20 [ +0.03] [] bdi_register+0x8c/0x180 [ +0.03] [] bdi_register_owner+0x36/0x60 [ +0.04] [] device_add_disk+0x168/0x480 [ +0.03] [] ? update_autosuspend+0x51/0x60 [ +0.02] [] sd_probe_async+0x110/0x1c0 [ +0.02] [] async_run_entry_fn+0x39/0x140 [ +0.02] [] process_one_work+0x15f/0x430 [ +0.02] [] worker_thread+0x4e/0x490 [ +0.02] [] ? process_one_work+0x430/0x430 [ +0.03] [] kthread+0xd9/0xf0 [ +0.03] [] ? kthread_park+0x60/0x60 [ +0.03] [] ret_from_fork+0x25/0x30 [ +0.000949] BUG: unable to handle kernel [ +0.005263] NULL pointer dereference [ +0.002853] IP: [] sysfs_do_create_link_sd.isra.2+0x34/0xb0 [ +0.008584] PGD 0 [ +0.006115] Oops: [#1] SMP [ +0.004531] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_
Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
On 12/07/2016 02:06 PM, Douglas Miller wrote: On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote: In blk_mq_map_swqueue, there is a memory optimization that frees the tags of a queue that has gone unmapped. Later, if that hctx is remapped after another topology change, the tags need to be reallocated. If this allocation fails, a simple WARN_ON triggers, but the block layer ends up with an active hctx without any corresponding set of tags. Then, any income IO to that hctx can trigger an Oops. I can reproduce it consistently by running IO, flipping CPUs on and off and eventually injecting a memory allocation failure in that path. In the fix below, if the system experiences a failed allocation of any hctx's tags, we remap all the ctxs of that queue to the hctx_0, which should always keep it's tags. There is a minor performance hit, since our mapping just got worse after the error path, but this is the simplest solution to handle this error path. The performance hit will disappear after another successful remap. I considered dropping the memory optimization all together, but it seemed a bad trade-off to handle this very specific error case. This should apply cleanly on top of Jen's for-next branch. The Oops is the one below: SP (3fff935ce4d0) is in userspace 1:mon> e cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110] pc: c05e868c: __sbitmap_queue_get+0x2c/0x180 lr: c0575328: __bt_get+0x48/0xd0 sp: c00fe99eb390 msr: 90010280b033 dar: 28 dsisr: 4000 current = 0xc00fe9966800 paca= 0xc7e80300 softe: 0irq_happened: 0x01 pid = 11035, comm = aio-stress Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016 1:mon> s [c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0 [c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0 [c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100 [c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220 [c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0 [c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540 [c00fe99eb640] c0561c44 generic_make_request+0x144/0x230 [c00fe99eb690] c0561e00 submit_bio+0xd0/0x200 [c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0 [c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0 [c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0 [c00fe99eb940] c0246c88 __filemap_fdatawrite_range+0xf8/0x180 [c00fe99eb9e0] c0246f90 filemap_write_and_wait_range+0x70/0xf0 [c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540 [c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130 [c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430 [c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450 [c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730 [c00fe99ebe30] c0009404 system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com> Cc: Brian King <brk...@linux.vnet.ibm.com> Cc: Douglas Miller <dougm...@linux.vnet.ibm.com> Cc: linux-bl...@vger.kernel.org Cc: linux-scsi@vger.kernel.org --- block/blk-mq.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6fb94bd69375..6718f894fbe1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, static void blk_mq_map_swqueue(struct request_queue *q, const struct cpumask *online_mask) { -unsigned int i; +unsigned int i, hctx_idx; struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; struct blk_mq_tag_set *set = q->tag_set; @@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q, if (!cpumask_test_cpu(i, online_mask)) continue; +hctx_idx = q->mq_map[i]; +/* unmapped hw queue can be remapped after CPU topo changed */ +if (!set->tags[hctx_idx]) { +set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx); + +if (!set->tags[hctx_idx]) +q->mq_map[i] = 0; +} + ctx = per_cpu_ptr(q->queue_ctx, i); hctx = blk_mq_map_queue(q, i); @@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct request_queue *q, * disable it and free the request entries. */ if (!hctx->nr_ctx) { -if (set->tags[i]) { +/* Never unmap queue 0. We need it as a + * fallback in case of a new remap fails + * allocation. */ +if (i && set->tags[i]) { blk_mq_free_rq_map(set, set->tags[i], i); set->tags[i] = NULL; } @@ -1917,1
Re: [PATCH RESEND v2 2/2] blk-mq: Avoid memory reclaim when remapping queues
On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote: While stressing memory and IO at the same time we changed SMT settings, we were able to consistently trigger deadlocks in the mm system, which froze the entire machine. I think that under memory stress conditions, the large allocations performed by blk_mq_init_rq_map may trigger a reclaim, which stalls waiting on the block layer remmaping completion, thus deadlocking the system. The trace below was collected after the machine stalled, waiting for the hotplug event completion. The simplest fix for this is to make allocations in this path non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the issue anymore. This should apply on top of Jen's for-next branch cleanly. Changes since v1: - Use GFP_NOIO instead of GFP_NOWAIT. Call Trace: [c00f0160aaf0] [c00f0160ab50] 0xc00f0160ab50 (unreliable) [c00f0160acc0] [c0016624] __switch_to+0x2e4/0x430 [c00f0160ad20] [c0b1a880] __schedule+0x310/0x9b0 [c00f0160ae00] [c0b1af68] schedule+0x48/0xc0 [c00f0160ae30] [c0b1b4b0] schedule_preempt_disabled+0x20/0x30 [c00f0160ae50] [c0b1d4fc] __mutex_lock_slowpath+0xec/0x1f0 [c00f0160aed0] [c0b1d678] mutex_lock+0x78/0xa0 [c00f0160af00] [d00019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs] [c00f0160b0b0] [d00019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs] [c00f0160b0f0] [d000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs] [c00f0160b120] [c03172c8] super_cache_scan+0x1f8/0x210 [c00f0160b190] [c026301c] shrink_slab.part.13+0x21c/0x4c0 [c00f0160b2d0] [c0268088] shrink_zone+0x2d8/0x3c0 [c00f0160b380] [c026834c] do_try_to_free_pages+0x1dc/0x520 [c00f0160b450] [c026876c] try_to_free_pages+0xdc/0x250 [c00f0160b4e0] [c0251978] __alloc_pages_nodemask+0x868/0x10d0 [c00f0160b6f0] [c0567030] blk_mq_init_rq_map+0x160/0x380 [c00f0160b7a0] [c056758c] blk_mq_map_swqueue+0x33c/0x360 [c00f0160b820] [c0567904] blk_mq_queue_reinit+0x64/0xb0 [c00f0160b850] [c056a16c] blk_mq_queue_reinit_notify+0x19c/0x250 [c00f0160b8a0] [c00f5d38] notifier_call_chain+0x98/0x100 [c00f0160b8f0] [c00c5fb0] __cpu_notify+0x70/0xe0 [c00f0160b930] [c00c63c4] notify_prepare+0x44/0xb0 [c00f0160b9b0] [c00c52f4] cpuhp_invoke_callback+0x84/0x250 [c00f0160ba10] [c00c570c] cpuhp_up_callbacks+0x5c/0x120 [c00f0160ba60] [c00c7cb8] _cpu_up+0xf8/0x1d0 [c00f0160bac0] [c00c7eb0] do_cpu_up+0x120/0x150 [c00f0160bb40] [c06fe024] cpu_subsys_online+0x64/0xe0 [c00f0160bb90] [c06f5124] device_online+0xb4/0x120 [c00f0160bbd0] [c06f5244] online_store+0xb4/0xc0 [c00f0160bc20] [c06f0a68] dev_attr_store+0x68/0xa0 [c00f0160bc60] [c03ccc30] sysfs_kf_write+0x80/0xb0 [c00f0160bca0] [c03cbabc] kernfs_fop_write+0x17c/0x250 [c00f0160bcf0] [c030fe6c] __vfs_write+0x6c/0x1e0 [c00f0160bd90] [c0311490] vfs_write+0xd0/0x270 [c00f0160bde0] [c03131fc] SyS_write+0x6c/0x110 [c00f0160be30] [c0009204] system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com> Cc: Brian King <brk...@linux.vnet.ibm.com> Cc: Douglas Miller <dougm...@linux.vnet.ibm.com> Cc: linux-bl...@vger.kernel.org Cc: linux-scsi@vger.kernel.org --- block/blk-mq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6718f894fbe1..5f4e452eef72 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1605,7 +1605,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, INIT_LIST_HEAD(>page_list); tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *), -GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, +GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, set->numa_node); if (!tags->rqs) { blk_mq_free_tags(tags); @@ -1631,7 +1631,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, do { page = alloc_pages_node(set->numa_node, - GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO, + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO, this_order); if (page) break; @@ -1652,7 +1652,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set, * Allow kmemleak to scan these pages as they contain pointers * to additional allocations like via ops->init_request(). */ - kme
Re: [PATCH RESEND v2 1/2] blk-mq: Fix failed allocation path when mapping queues
On 12/06/2016 09:31 AM, Gabriel Krisman Bertazi wrote: In blk_mq_map_swqueue, there is a memory optimization that frees the tags of a queue that has gone unmapped. Later, if that hctx is remapped after another topology change, the tags need to be reallocated. If this allocation fails, a simple WARN_ON triggers, but the block layer ends up with an active hctx without any corresponding set of tags. Then, any income IO to that hctx can trigger an Oops. I can reproduce it consistently by running IO, flipping CPUs on and off and eventually injecting a memory allocation failure in that path. In the fix below, if the system experiences a failed allocation of any hctx's tags, we remap all the ctxs of that queue to the hctx_0, which should always keep it's tags. There is a minor performance hit, since our mapping just got worse after the error path, but this is the simplest solution to handle this error path. The performance hit will disappear after another successful remap. I considered dropping the memory optimization all together, but it seemed a bad trade-off to handle this very specific error case. This should apply cleanly on top of Jen's for-next branch. The Oops is the one below: SP (3fff935ce4d0) is in userspace 1:mon> e cpu 0x1: Vector: 300 (Data Access) at [c00fe99eb110] pc: c05e868c: __sbitmap_queue_get+0x2c/0x180 lr: c0575328: __bt_get+0x48/0xd0 sp: c00fe99eb390 msr: 90010280b033 dar: 28 dsisr: 4000 current = 0xc00fe9966800 paca= 0xc7e80300 softe: 0irq_happened: 0x01 pid = 11035, comm = aio-stress Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016 1:mon> s [c00fe99eb3d0] c0575328 __bt_get+0x48/0xd0 [c00fe99eb400] c0575838 bt_get.isra.1+0x78/0x2d0 [c00fe99eb480] c0575cb4 blk_mq_get_tag+0x44/0x100 [c00fe99eb4b0] c056f6f4 __blk_mq_alloc_request+0x44/0x220 [c00fe99eb500] c0570050 blk_mq_map_request+0x100/0x1f0 [c00fe99eb580] c0574650 blk_mq_make_request+0xf0/0x540 [c00fe99eb640] c0561c44 generic_make_request+0x144/0x230 [c00fe99eb690] c0561e00 submit_bio+0xd0/0x200 [c00fe99eb740] c03ef740 ext4_io_submit+0x90/0xb0 [c00fe99eb770] c03e95d8 ext4_writepages+0x588/0xdd0 [c00fe99eb910] c025a9f0 do_writepages+0x60/0xc0 [c00fe99eb940] c0246c88 __filemap_fdatawrite_range+0xf8/0x180 [c00fe99eb9e0] c0246f90 filemap_write_and_wait_range+0x70/0xf0 [c00fe99eba20] c03dd844 ext4_sync_file+0x214/0x540 [c00fe99eba80] c0364718 vfs_fsync_range+0x78/0x130 [c00fe99ebad0] c03dd46c ext4_file_write_iter+0x35c/0x430 [c00fe99ebb90] c038c280 aio_run_iocb+0x3b0/0x450 [c00fe99ebce0] c038dc28 do_io_submit+0x368/0x730 [c00fe99ebe30] c0009404 system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <kris...@linux.vnet.ibm.com> Cc: Brian King <brk...@linux.vnet.ibm.com> Cc: Douglas Miller <dougm...@linux.vnet.ibm.com> Cc: linux-bl...@vger.kernel.org Cc: linux-scsi@vger.kernel.org --- block/blk-mq.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6fb94bd69375..6718f894fbe1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1870,7 +1870,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, static void blk_mq_map_swqueue(struct request_queue *q, const struct cpumask *online_mask) { - unsigned int i; + unsigned int i, hctx_idx; struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; struct blk_mq_tag_set *set = q->tag_set; @@ -1893,6 +1893,15 @@ static void blk_mq_map_swqueue(struct request_queue *q, if (!cpumask_test_cpu(i, online_mask)) continue; + hctx_idx = q->mq_map[i]; + /* unmapped hw queue can be remapped after CPU topo changed */ + if (!set->tags[hctx_idx]) { + set->tags[hctx_idx] = blk_mq_init_rq_map(set, hctx_idx); + + if (!set->tags[hctx_idx]) + q->mq_map[i] = 0; + } + ctx = per_cpu_ptr(q->queue_ctx, i); hctx = blk_mq_map_queue(q, i); @@ -1909,7 +1918,10 @@ static void blk_mq_map_swqueue(struct request_queue *q, * disable it and free the request entries. */ if (!hctx->nr_ctx) { - if (set->tags[i]) { + /* Never unmap queue 0. We need it as a +* fallback in case of a new remap fails +* allocation. */ + if (i && set->tags[i])
Re: [PATCH] block: Fix kernel panic occurs while creating second raid disk
On 10/24/2016 01:54 PM, Sreekanth Reddy wrote: Observing below kernel panic while creating second raid disk on LSI SAS3008 HBA card. [ +0.55] [ cut here ] [ +0.07] WARNING: CPU: 2 PID: 281 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 [ +0.02] sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:32' [ +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore raid_class nvme_core scsi_transport_sas dca [ +0.67] CPU: 2 PID: 281 Comm: kworker/u49:5 Not tainted 4.9.0-rc2 #1 [ +0.02] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS 1.1 07/22/2015 [ +0.05] Workqueue: events_unbound async_run_entry_fn [ +0.04] Call Trace: [ +0.09] [] dump_stack+0x63/0x85 [ +0.05] [] __warn+0xcb/0xf0 [ +0.04] [] warn_slowpath_fmt+0x5f/0x80 [ +0.06] [] ? kernfs_path_from_node+0x4f/0x60 [ +0.02] [] sysfs_warn_dup+0x62/0x80 [ +0.02] [] sysfs_create_dir_ns+0x77/0x90 [ +0.04] [] kobject_add_internal+0x99/0x330 [ +0.03] [] ? vsnprintf+0x35b/0x4c0 [ +0.03] [] kobject_add+0x75/0xd0 [ +0.06] [] ? device_private_init+0x23/0x70 [ +0.07] [] ? mutex_lock+0x12/0x30 [ +0.03] [] device_add+0x119/0x670 [ +0.04] [] device_create_groups_vargs+0xe0/0xf0 [ +0.03] [] device_create_vargs+0x1c/0x20 [ +0.06] [] bdi_register+0x8c/0x180 [ +0.03] [] bdi_register_owner+0x36/0x60 [ +0.06] [] device_add_disk+0x168/0x480 [ +0.05] [] ? update_autosuspend+0x51/0x60 [ +0.05] [] sd_probe_async+0x110/0x1c0 [ +0.02] [] async_run_entry_fn+0x39/0x140 [ +0.03] [] process_one_work+0x15f/0x430 [ +0.02] [] worker_thread+0x4e/0x490 [ +0.02] [] ? process_one_work+0x430/0x430 [ +0.03] [] kthread+0xd9/0xf0 [ +0.03] [] ? kthread_park+0x60/0x60 [ +0.03] [] ret_from_fork+0x25/0x30 [ +0.02] [ cut here ] [ +0.04] WARNING: CPU: 2 PID: 281 at lib/kobject.c:240 kobject_add_internal+0x2bd/0x330 [ +0.01] kobject_add_internal failed for 8:32 with -EEXIST, don't try to register things with the same name in the same [ +0.01] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif mei_me pcspkr mei iTCO_vendor_support ipmi_si i2c_i801 lpc_ich mfd_corema acpi_pad wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace binfmt_misc sunrpc xfs libcrc32c ast i2c_algo_bit drm_kore raid_class nvme_core scsi_transport_sas dca [ +0.43] CPU: 2 PID: 281 Comm: kworker/u49:5 Tainted: G W 4.9.0-rc2 #1 [ +0.01] Hardware name: Supermicro SYS-2028U-TNRT+/X10DRU-i+, BIOS 1.1 07/22/2015 [ +0.02] Workqueue: events_unbound async_run_entry_fn [ +0.03] Call Trace: [ +0.03] [] dump_stack+0x63/0x85 [ +0.03] [] __warn+0xcb/0xf0 [ +0.04] [] warn_slowpath_fmt+0x5f/0x80 [ +0.02] [] ? sysfs_warn_dup+0x6a/0x80 [ +0.03] [] kobject_add_internal+0x2bd/0x330 [ +0.03] [] ? vsnprintf+0x35b/0x4c0 [ +0.03] [] kobject_add+0x75/0xd0 [ +0.03] [] ? device_private_init+0x23/0x70 [ +0.04] [] ? mutex_lock+0x12/0x30 [ +0.02] [] device_add+0x119/0x670 [ +0.04] [] device_create_groups_vargs+0xe0/0xf0 [ +0.03] [] device_create_vargs+0x1c/0x20 [ +0.03] [] bdi_register+0x8c/0x180 [ +0.03] [] bdi_register_owner+0x36/0x60 [ +0.04] [] device_add_disk+0x168/0x480 [ +0.03] [] ? update_autosuspend+0x51/0x60 [ +0.02] [] sd_probe_async+0x110/0x1c0 [ +0.02] [] async_run_entry_fn+0x39/0x140 [ +0.02] [] process_one_work+0x15f/0x430 [ +0.02] [] worker_thread+0x4e/0x490 [ +0.02] [] ? process_one_work+0x430/0x430 [ +0.03] [] kthread+0xd9/0xf0 [ +0.03] [] ? kthread_park+0x60/0x60 [ +0.03] [] ret_from_fork+0x25/0x30 [ +0.000949] BUG: unable to handle kernel [ +0.005263] NULL pointer dereference [ +0.002853] IP: [] sysfs_do_create_link_sd.isra.2+0x34/0xb0 [ +0.008584] PGD 0 [ +0.006115] Oops: [#1] SMP [ +0.004531] Modules linked in: mptctl mptbase xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl sb_edac edac_core x86_pkg_temp_pclmul joydev ghash_clmulni_intel iTCO_wdt ipmi_ssif
Issues with LSI-3008 adapters, mpt3sas driver
Hi all, I am seeing an issue while using an LSI-3008-based adapter (mpt3sas driver) on a PowerPC system (although I am not yet convinced it is architecture dependent). When I create a RAID1 volume, the physical disk devices get "hidden" as expected however the various kernel objects are out of sync. The corresponding bits in the "sd_index_ida" bitmap gets cleared, and the symlink in /sys/dev/block for this major:minor pair gets removed, but none of the other major:minor entries in sysfs get removed. The next time a new device is added (for example, during another RAID volume create or delete), the recently-freed major:minor number is picked up from the "sd_index_ida" bitmap but the attempt to create sysfs entries fails EEXIST due to an entry by the same name already (still) existing. This failure goes unhandled and later the kernel panics in sd_probe_async while dereferencing an (apparently) invalid backing_dev_info structure (presumably left invalid due to the EEXIST error). A reboot clears this (bitmaps and sysfs) up and the second RAID volume (if a create was done) shows up normally. However, even if the panic were avoided by better error handling in sd_probe_async there would still be the problem of being able to create more than one RAID volume without rebooting. I am wondering if this issue has been seen elsewhere, and also just what might be going wrong. For mpt3sas, it appears that the firmware largely drives the hiding/exposing of devices but I don't see an issue with the ordering of those events. I am wondering if the driver is failing to setup the device attributes correctly in order to get the proper sysfs handling. I am seeing this on Ubuntu 16.04, but also see it on the upstream kernel. Oddly, it does not happen on RHEL 7.2 (an older kernel). A possibly-related issue we see is that when a RAID volume is deleted, none of the RAID device nodes (/dev as well as /sys/) get removed - although they are unusable. Deleting before creating does not produce the panic, so I believe the "sd_index_ida" bitmap is not getting updated by the delete. Any help would be appreciated. Thanks, Doug -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html