Re: linux-next scsi-mq hang in suspend-resume
On 17 July 2017 at 18:18, Evangelos Foutraswrote: > On 17/07/17 10:53, Christoph Hellwig wrote: >> But I did some audit of the code, and it seems blk-mq is lacking >> support for the RQF_PM flag. While I can't directly see how >> this would cause the hang your caused it's a least easy to test. >> >> Can you apply the patch below and test with the use_blk_mq=0 parameter? > > I think the patch needs to be tested with scsi_mod.use_blk_mq=1 (which I > will try to do and report back). I briefly tested the patch (on top of Linux 4.12.2) and it appears to successfully work around the issue; my laptop happily resumes from S3 and can access the HDD.
Re: linux-next scsi-mq hang in suspend-resume
(Hopefully I got the In-Reply-To header right and won't mess up the thread.) On 17/07/17 10:53, Christoph Hellwig wrote: > I still haven't gotten hold of an i915 machine where I could > run the actua ltest suite. At the risk of posting an unproductive "me too" reply, I also got bit by the dead disk on resume from S3 when Arch Linux enabled MQ by default in the 4.12 kernel (CONFIG_SCSI_MQ_DEFAULT=y). The configuration change was later reverted due to this issue. For me the hang occurs pretty reliably (tested about 5-6 times) on an Intel laptop and an AMD desktop, both with HDDs and ext4 on top of LUKS. It feels as if the disk stops responding to commands. The machine itself wakes up from sleep but even a simple `ls` will hang and do nothing. > But I did some audit of the code, and it seems blk-mq is lacking > support for the RQF_PM flag. While I can't directly see how > this would cause the hang your caused it's a least easy to test. > > Can you apply the patch below and test with the use_blk_mq=0 parameter? I think the patch needs to be tested with scsi_mod.use_blk_mq=1 (which I will try to do and report back).
Re: linux-next scsi-mq hang in suspend-resume
On Mon, Jul 17, 2017 at 01:30:00PM +0300, Tomi Sarvela wrote: > First, tested that next-20170717 still triggers the problem when no extra > options given. Adding scsi_mod.use_blk_mq=0 makes tests work. > > Then I tried with sd.diff patched next-20170717. Works (still) with > use_blk_mq=0. Also works when no options given, so this patch avoids the > hang when using the new block-mq. > > These tests on generic Haswell 4790K desktop machine. Thanks Tomi, this seems to confirm it's runtime PM related, although I don't really understand why that's an issue. Let me spin up an implementation of RQF_PM for blk-mq and give it to you for testing.
Re: linux-next scsi-mq hang in suspend-resume
On 17/07/17 10:53, Christoph Hellwig wrote: I still haven't gotten hold of an i915 machine where I could run the actua ltest suite. But I did some audit of the code, and it seems blk-mq is lacking support for the RQF_PM flag. While I can't directly see how this would cause the hang your caused it's a least easy to test. Can you apply the patch below and test with the use_blk_mq=0 parameter? Note that implementing RQF_PM for blk-mq shouldn't be too hard either, but if we don't get rid of the nr_pending counter somehow it would be a severe performance penalty for all scsi devices. First, tested that next-20170717 still triggers the problem when no extra options given. Adding scsi_mod.use_blk_mq=0 makes tests work. Then I tried with sd.diff patched next-20170717. Works (still) with use_blk_mq=0. Also works when no options given, so this patch avoids the hang when using the new block-mq. These tests on generic Haswell 4790K desktop machine. Best regards, Tomi -- Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
Re: linux-next scsi-mq hang in suspend-resume
I still haven't gotten hold of an i915 machine where I could run the actua ltest suite. But I did some audit of the code, and it seems blk-mq is lacking support for the RQF_PM flag. While I can't directly see how this would cause the hang your caused it's a least easy to test. Can you apply the patch below and test with the use_blk_mq=0 parameter? Note that implementing RQF_PM for blk-mq shouldn't be too hard either, but if we don't get rid of the nr_pending counter somehow it would be a severe performance penalty for all scsi devices. diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index bea36adeee17..5c3818ebee9c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -554,7 +554,7 @@ static struct scsi_driver sd_template = { .probe = sd_probe, .remove = sd_remove, .shutdown = sd_shutdown, - .pm = _pm_ops, +// .pm = _pm_ops, }, .rescan = sd_rescan, .init_command = sd_init_command, @@ -3249,7 +3249,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) gd->events |= DISK_EVENT_MEDIA_CHANGE; } - blk_pm_runtime_init(sdp->request_queue, dev); +// blk_pm_runtime_init(sdp->request_queue, dev); device_add_disk(dev, gd); if (sdkp->capacity) sd_dif_config_host(sdkp);
Re: linux-next scsi-mq hang in suspend-resume
On 14/07/17 15:44, Christoph Hellwig wrote: can you please report what hardware this is one (e.g. libata or real scsi, which driver), a kernel config and the actual command used to suspend the system (to ram, to disk?) so that I an try to reproduce it? The hardware I used to bisect the problem is is Broxton: Asrock ITX-J3455 motherboard with Intel J3455 SoC (about Skylake Gen). Disk is Intel SATA SSD. Issue also happens with Samsung SSD on other testhost. Note that there is half dozen other hosts indicating the same problem, and traces are available starting from ILK to Skylake. None of the Kaby Lakes triggers the issue (the KBL issue is probably NVMe-related instead). Usual setup is one SATA SSD disk on port 0 on motherboard. Kernel config is available at: https://intel-gfx-ci.01.org/CI/next-20170711/kernel.config.bz2 Kernel options: BOOT_IMAGE=/boot/drm_intel root=/dev/sda2 console=ttyS0,115200n8 console=tty0 intel_iommu=igfx_off drm.debug=0xe nmi_watchdog=panic,auto panic=1 softdog.soft_panic=1 rootwait ro 3 To reproduce the problem on Broxton, i-g-t was used: https://cgit.freedesktop.org/xorg/app/intel-gpu-tools/ From i-g-t, the binaries could be run with: tests/gem_exec_gttfill --r basic tests/gem_exec_suspend --r basic-s3 but, from my experience, this issue pops up much easier if there is piglit framework capturing logs to disk: https://cgit.freedesktop.org/piglit With IGT/piglit testlist file would be (ex. scsi-mq.testlist): # igt@gem_exec_gttfill@basic igt@gem_exec_suspend@basic-s3 # and command to run i-g-t through piglit is /opt/igt/scripts/run-tests.sh -vT scsi-mq.testlist I can try to reproduce the issue without i-g-t/piglit, but it might take some trying. Definitely suspend-to-ram and writes to disk are needed to trigger this, gem_exec_suspend/basic-s3 can loop quite well without panicing. Tomi -- Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
Re: linux-next scsi-mq hang in suspend-resume
Tomi, can you please report what hardware this is one (e.g. libata or real scsi, which driver), a kernel config and the actual command used to suspend the system (to ram, to disk?) so that I an try to reproduce it?
Re: linux-next scsi-mq hang in suspend-resume
On Wed, Jul 12, 2017 at 10:50:19AM -0600, Jens Axboe wrote: > On 07/12/2017 08:51 AM, Tomi Sarvela wrote: > > Hello there, > > > > I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, > > and couple of weeks ago we took linux-next for a ride to see what kind > > of integration problems there might pop up when pulling 4.13-rc1. > > Latest results can be seen at > > > > https://intel-gfx-ci.01.org/CI/next-issues.html > > https://intel-gfx-ci.01.org/CI/next-all.html > > > > The purple blocks are hangs, starting from 20170628 (20170627 was > > untestable due to locking changes which were reverted). Traces were > > pointing to ext4 but bisecting between good 20170626 and bad 20170628 > > pointed to: > > > > commit 5c279bd9e40624f4ab6e688671026d6005b066fa > > Date: Fri Jun 16 10:27:55 2017 +0200 > > > > scsi: default to scsi-mq > > > > Reproduction is 100% or close to it when running two i-g-t tests as a > > testlist. I'm assuming that it creates the correct amount or pattern > > of actions to the device. The testlist consists of the following > > lines: > > > > igt@gem_exec_gttfill@basic > > igt@gem_exec_suspend@basic-s3 > > > > Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. > > Configuration option was copied over on testhosts and 20170712 was re- > > tested, that's why today looks so much greener. > > > > More information including traces and reproduction instructions at > > https://bugzilla.kernel.org/show_bug.cgi?id=196223 > > > > I can run patchsets through the farm, if needed. In addition, daily > > linux-next tags are automatically tested and results published. > > Christoph, any ideas? Smells like something in SCSI, my notebook > with nvme/blk-mq suspend/resumes just fine. There isn't much mq-specific scsi code, so it's probably an interaction of both. I'll see if the bugzilla has enough data to reproduce it locally. Although I really wish people wouldn't use #TY^$Y^$ bugzilla and just post the important data to the list :( > > -- > Jens Axboe > ---end quoted text---
Re: linux-next scsi-mq hang in suspend-resume
On 07/12/2017 08:51 AM, Tomi Sarvela wrote: > Hello there, > > I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, > and couple of weeks ago we took linux-next for a ride to see what kind > of integration problems there might pop up when pulling 4.13-rc1. > Latest results can be seen at > > https://intel-gfx-ci.01.org/CI/next-issues.html > https://intel-gfx-ci.01.org/CI/next-all.html > > The purple blocks are hangs, starting from 20170628 (20170627 was > untestable due to locking changes which were reverted). Traces were > pointing to ext4 but bisecting between good 20170626 and bad 20170628 > pointed to: > > commit 5c279bd9e40624f4ab6e688671026d6005b066fa > Date: Fri Jun 16 10:27:55 2017 +0200 > > scsi: default to scsi-mq > > Reproduction is 100% or close to it when running two i-g-t tests as a > testlist. I'm assuming that it creates the correct amount or pattern > of actions to the device. The testlist consists of the following > lines: > > igt@gem_exec_gttfill@basic > igt@gem_exec_suspend@basic-s3 > > Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. > Configuration option was copied over on testhosts and 20170712 was re- > tested, that's why today looks so much greener. > > More information including traces and reproduction instructions at > https://bugzilla.kernel.org/show_bug.cgi?id=196223 > > I can run patchsets through the farm, if needed. In addition, daily > linux-next tags are automatically tested and results published. Christoph, any ideas? Smells like something in SCSI, my notebook with nvme/blk-mq suspend/resumes just fine. -- Jens Axboe