Re: linux-next scsi-mq hang in suspend-resume

2017-07-17 Thread Evangelos Foutras
On 17 July 2017 at 18:18, Evangelos Foutras  wrote:
> On 17/07/17 10:53, Christoph Hellwig wrote:
>> But I did some audit of the code, and it seems blk-mq is lacking
>> support for the RQF_PM flag.  While I can't directly see how
>> this would cause the hang your caused it's a least easy to test.
>>
>> Can you apply the patch below and test with the use_blk_mq=0 parameter?
>
> I think the patch needs to be tested with scsi_mod.use_blk_mq=1 (which I
> will try to do and report back).

I briefly tested the patch (on top of Linux 4.12.2) and it appears to
successfully work around the issue; my laptop happily resumes from S3
and can access the HDD.


Re: linux-next scsi-mq hang in suspend-resume

2017-07-17 Thread Evangelos Foutras
(Hopefully I got the In-Reply-To header right and won't mess up the thread.)

On 17/07/17 10:53, Christoph Hellwig wrote:
> I still haven't gotten hold of an i915 machine where I could
> run the actua ltest suite.

At the risk of posting an unproductive "me too" reply, I also got bit by
the dead disk on resume from S3 when Arch Linux enabled MQ by default in
the 4.12 kernel (CONFIG_SCSI_MQ_DEFAULT=y). The configuration change was
later reverted due to this issue.

For me the hang occurs pretty reliably (tested about 5-6 times) on an
Intel laptop and an AMD desktop, both with HDDs and ext4 on top of LUKS.
It feels as if the disk stops responding to commands. The machine itself
wakes up from sleep but even a simple `ls` will hang and do nothing.

> But I did some audit of the code, and it seems blk-mq is lacking
> support for the RQF_PM flag.  While I can't directly see how
> this would cause the hang your caused it's a least easy to test.
>
> Can you apply the patch below and test with the use_blk_mq=0 parameter?

I think the patch needs to be tested with scsi_mod.use_blk_mq=1 (which I
will try to do and report back).


Re: linux-next scsi-mq hang in suspend-resume

2017-07-17 Thread Christoph Hellwig
On Mon, Jul 17, 2017 at 01:30:00PM +0300, Tomi Sarvela wrote:
> First, tested that next-20170717 still triggers the problem when no extra
> options given. Adding scsi_mod.use_blk_mq=0 makes tests work.
> 
> Then I tried with sd.diff patched next-20170717. Works (still) with
> use_blk_mq=0. Also works when no options given, so this patch avoids the
> hang when using the new block-mq.
> 
> These tests on generic Haswell 4790K desktop machine.

Thanks Tomi,

this seems to confirm it's runtime PM related, although I don't
really understand why that's an issue.  Let me spin up an implementation
of RQF_PM for blk-mq and give it to you for testing.


Re: linux-next scsi-mq hang in suspend-resume

2017-07-17 Thread Tomi Sarvela

On 17/07/17 10:53, Christoph Hellwig wrote:

I still haven't gotten hold of an i915 machine where I could
run the actua ltest suite.

But I did some audit of the code, and it seems blk-mq is lacking
support for the RQF_PM flag.  While I can't directly see how
this would cause the hang your caused it's a least easy to test.

Can you apply the patch below and test with the use_blk_mq=0 parameter?

Note that implementing RQF_PM for blk-mq shouldn't be too hard either,
but if we don't get rid of the nr_pending counter somehow it would
be a severe performance penalty for all scsi devices.


First, tested that next-20170717 still triggers the problem when no 
extra options given. Adding scsi_mod.use_blk_mq=0 makes tests work.


Then I tried with sd.diff patched next-20170717. Works (still) with 
use_blk_mq=0. Also works when no options given, so this patch avoids the 
hang when using the new block-mq.


These tests on generic Haswell 4790K desktop machine.

Best regards,

Tomi
--
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo


Re: linux-next scsi-mq hang in suspend-resume

2017-07-17 Thread Christoph Hellwig
I still haven't gotten hold of an i915 machine where I could
run the actua ltest suite.

But I did some audit of the code, and it seems blk-mq is lacking
support for the RQF_PM flag.  While I can't directly see how
this would cause the hang your caused it's a least easy to test.

Can you apply the patch below and test with the use_blk_mq=0 parameter?

Note that implementing RQF_PM for blk-mq shouldn't be too hard either,
but if we don't get rid of the nr_pending counter somehow it would
be a severe performance penalty for all scsi devices.
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bea36adeee17..5c3818ebee9c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -554,7 +554,7 @@ static struct scsi_driver sd_template = {
.probe  = sd_probe,
.remove = sd_remove,
.shutdown   = sd_shutdown,
-   .pm = _pm_ops,
+// .pm = _pm_ops,
},
.rescan = sd_rescan,
.init_command   = sd_init_command,
@@ -3249,7 +3249,7 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
gd->events |= DISK_EVENT_MEDIA_CHANGE;
}
 
-   blk_pm_runtime_init(sdp->request_queue, dev);
+// blk_pm_runtime_init(sdp->request_queue, dev);
device_add_disk(dev, gd);
if (sdkp->capacity)
sd_dif_config_host(sdkp);


Re: linux-next scsi-mq hang in suspend-resume

2017-07-14 Thread Tomi Sarvela

On 14/07/17 15:44, Christoph Hellwig wrote:

can you please report what hardware this is one (e.g. libata or
real scsi, which driver), a kernel config and the actual command
used to suspend the system (to ram, to disk?) so that I an try to
reproduce it?


The hardware I used to bisect the problem is is Broxton: Asrock 
ITX-J3455 motherboard with Intel J3455 SoC (about Skylake Gen). Disk is 
Intel SATA SSD. Issue also happens with Samsung SSD on other testhost.


Note that there is half dozen other hosts indicating the same problem, 
and traces are available starting from ILK to Skylake. None of the Kaby 
Lakes triggers the issue (the KBL issue is probably NVMe-related 
instead). Usual setup is one SATA SSD disk on port 0 on motherboard.


Kernel config is available at:
https://intel-gfx-ci.01.org/CI/next-20170711/kernel.config.bz2

Kernel options:
BOOT_IMAGE=/boot/drm_intel root=/dev/sda2 console=ttyS0,115200n8 
console=tty0 intel_iommu=igfx_off drm.debug=0xe nmi_watchdog=panic,auto 
panic=1 softdog.soft_panic=1 rootwait ro 3


To reproduce the problem on Broxton, i-g-t was used:
https://cgit.freedesktop.org/xorg/app/intel-gpu-tools/

From i-g-t, the binaries could be run with:

tests/gem_exec_gttfill --r basic
tests/gem_exec_suspend --r basic-s3

but, from my experience, this issue pops up much easier if there is 
piglit framework capturing logs to disk:


https://cgit.freedesktop.org/piglit

With IGT/piglit testlist file would be (ex. scsi-mq.testlist):
#
igt@gem_exec_gttfill@basic
igt@gem_exec_suspend@basic-s3
#
and command to run i-g-t through piglit is
/opt/igt/scripts/run-tests.sh -vT scsi-mq.testlist

I can try to reproduce the issue without i-g-t/piglit, but it might take 
some trying. Definitely suspend-to-ram and writes to disk are needed to 
trigger this, gem_exec_suspend/basic-s3 can loop quite well without 
panicing.


Tomi
--
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo


Re: linux-next scsi-mq hang in suspend-resume

2017-07-14 Thread Christoph Hellwig

Tomi,

can you please report what hardware this is one (e.g. libata or
real scsi, which driver), a kernel config and the actual command
used to suspend the system (to ram, to disk?) so that I an try to
reproduce it?


Re: linux-next scsi-mq hang in suspend-resume

2017-07-13 Thread Christoph Hellwig
On Wed, Jul 12, 2017 at 10:50:19AM -0600, Jens Axboe wrote:
> On 07/12/2017 08:51 AM, Tomi Sarvela wrote:
> > Hello there,
> > 
> > I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, 
> > and couple of weeks ago we took linux-next for a ride to see what kind 
> > of integration problems there might pop up when pulling 4.13-rc1. 
> > Latest results can be seen at
> > 
> > https://intel-gfx-ci.01.org/CI/next-issues.html
> > https://intel-gfx-ci.01.org/CI/next-all.html
> > 
> > The purple blocks are hangs, starting from 20170628 (20170627 was 
> > untestable due to locking changes which were reverted). Traces were 
> > pointing to ext4 but bisecting between good 20170626 and bad 20170628 
> > pointed to:
> > 
> > commit 5c279bd9e40624f4ab6e688671026d6005b066fa
> > Date:   Fri Jun 16 10:27:55 2017 +0200
> > 
> > scsi: default to scsi-mq
> > 
> > Reproduction is 100% or close to it when running two i-g-t tests as a 
> > testlist. I'm assuming that it creates the correct amount or pattern 
> > of actions to the device. The testlist consists of the following 
> > lines:
> > 
> > igt@gem_exec_gttfill@basic
> > igt@gem_exec_suspend@basic-s3
> > 
> > Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. 
> > Configuration option was copied over on testhosts and 20170712 was re-
> > tested, that's why today looks so much greener.
> > 
> > More information including traces and reproduction instructions at
> > https://bugzilla.kernel.org/show_bug.cgi?id=196223
> > 
> > I can run patchsets through the farm, if needed. In addition, daily 
> > linux-next tags are automatically tested and results published.
> 
> Christoph, any ideas? Smells like something in SCSI, my notebook
> with nvme/blk-mq suspend/resumes just fine.

There isn't much mq-specific scsi code, so it's probably an interaction
of both.  I'll see if the bugzilla has enough data to reproduce it
locally.

Although I really wish people wouldn't use #TY^$Y^$ bugzilla and just
post the important data to the list :(

> 
> -- 
> Jens Axboe
> 
---end quoted text---


Re: linux-next scsi-mq hang in suspend-resume

2017-07-12 Thread Jens Axboe
On 07/12/2017 08:51 AM, Tomi Sarvela wrote:
> Hello there,
> 
> I've been running Intel GFX CI testing for linux DRM-Tip i915 driver, 
> and couple of weeks ago we took linux-next for a ride to see what kind 
> of integration problems there might pop up when pulling 4.13-rc1. 
> Latest results can be seen at
> 
> https://intel-gfx-ci.01.org/CI/next-issues.html
> https://intel-gfx-ci.01.org/CI/next-all.html
> 
> The purple blocks are hangs, starting from 20170628 (20170627 was 
> untestable due to locking changes which were reverted). Traces were 
> pointing to ext4 but bisecting between good 20170626 and bad 20170628 
> pointed to:
> 
> commit 5c279bd9e40624f4ab6e688671026d6005b066fa
> Date:   Fri Jun 16 10:27:55 2017 +0200
> 
> scsi: default to scsi-mq
> 
> Reproduction is 100% or close to it when running two i-g-t tests as a 
> testlist. I'm assuming that it creates the correct amount or pattern 
> of actions to the device. The testlist consists of the following 
> lines:
> 
> igt@gem_exec_gttfill@basic
> igt@gem_exec_suspend@basic-s3
> 
> Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts. 
> Configuration option was copied over on testhosts and 20170712 was re-
> tested, that's why today looks so much greener.
> 
> More information including traces and reproduction instructions at
> https://bugzilla.kernel.org/show_bug.cgi?id=196223
> 
> I can run patchsets through the farm, if needed. In addition, daily 
> linux-next tags are automatically tested and results published.

Christoph, any ideas? Smells like something in SCSI, my notebook
with nvme/blk-mq suspend/resumes just fine.

-- 
Jens Axboe