Re: disk-io lockup in 4.14.13 kernel

2018-03-27 Thread Jaco Kroon
Hi Bart,
> The above call trace means that SysRq-l was triggered, either via the keyboard
> or through procfs. I don't think that there is any information in the above
> that reveals the root cause of why a reboot was necessary.
I triggered it hoping to get a stack trace of the process which is
deadlocking finding where the lock is being taken that ends up blocking,
but I then realized that you mentioned sleeping, which may end up not
having a stack trace because there is no process actually running?  (I'm
not that intimately familiar with kernel internals, mostly doing
user-space development, so kernel is an interesting environment for me
only, but not really part of my day-to-day activities.
>
> What I do myself to identify the root cause of weird kernel behavior is to
> rebuild the kernel with a bunch of debugging options enabled and that I try to
> repeat the trigger that caused the weird behavior. If this causes the kernel
> debugging code to produce additional output that output can be very helpful 
> for
> identifying what is going on. This approach does not always work however.
Yea, I find debugging is often an art more than a science.  I saw a
patch for the mpt3sas issue but the email was munged and I wasn't able
to apply the patch from the email - I've emailed w.r.t. that but I've
not received a response.  Have you managed to get anything more sensible
from them?

Kind Regards,
Jaco


Re: disk-io lockup in 4.14.13 kernel

2018-03-24 Thread Jaco Kroon
Hi Bart,

Does the following go with your theory:

[452545.945561] sysrq: SysRq : Show backtrace of all active CPUs
[452545.946182] NMI backtrace for cpu 5
[452545.946185] CPU: 5 PID: 31921 Comm: bash Tainted: G  I
4.14.13-uls #2
[452545.946186] Hardware name: Supermicro
SSG-5048R-E1CR36L/X10SRH-CLN4F, BIOS T20140520103247 05/20/2014
[452545.946187] Call Trace:
[452545.946196]  dump_stack+0x46/0x5a
[452545.946200]  nmi_cpu_backtrace+0xb3/0xc0
[452545.946205]  ? irq_force_complete_move+0xd0/0xd0
[452545.946208]  nmi_trigger_cpumask_backtrace+0x8f/0xc0
[452545.946212]  __handle_sysrq+0xec/0x140
[452545.946216]  write_sysrq_trigger+0x26/0x30
[452545.946219]  proc_reg_write+0x38/0x60
[452545.946222]  __vfs_write+0x1e/0x130
[452545.946225]  vfs_write+0xab/0x190
[452545.946228]  SyS_write+0x3d/0xa0
[452545.946233]  entry_SYSCALL_64_fastpath+0x13/0x6c
[452545.946236] RIP: 0033:0x7f6b85db52d0
[452545.946238] RSP: 002b:7fff6f9479e8 EFLAGS: 0246
[452545.946241] Sending NMI from CPU 5 to CPUs 0-4:
[452545.946272] NMI backtrace for cpu 0 skipped: idling at pc
0x8162b0a0
[452545.946275] NMI backtrace for cpu 3 skipped: idling at pc
0x8162b0a0
[452545.946279] NMI backtrace for cpu 4 skipped: idling at pc
0x8162b0a0
[452545.946283] NMI backtrace for cpu 2 skipped: idling at pc
0x8162b0a0
[452545.946287] NMI backtrace for cpu 1 skipped: idling at pc
0x8162b0a0

I'm not sure how to link that address back to some function or
something, and had to reboot, so not sure if that can be done still.

Kind Regards,
Jaco

On 13/03/2018 19:24, Bart Van Assche wrote:
> On Tue, 2018-03-13 at 19:16 +0200, Jaco Kroon wrote:
>> The server in question is the destination of  numerous rsync/ssh cases
>> (used primarily for backups) and is not intended as a real-time system.
>> I'm happy to enable the options below that you would indicate would be
>> helpful in pinpointing the problem (assuming we're not looking at a 8x
>> more CPU required type of degrading as I've recently seen with asterisk
>> lock debugging enabled). I've marked in bold below what I assume would
>> be helpful.  If you don't mind confirming for me I'll enable and
>> schedule a reboot.
> Hello Jaco,
>
> My recommendation is to wait until the mpt3sas maintainers post a fix
> for what I reported yesterday on the linux-scsi mailing list. Enabling
> CONFIG_DEBUG_ATOMIC_SLEEP has namely a very annoying consequence for the
> mpt3sas driver: the first process that hits the "sleep in atomic context"
> bug gets killed. I don't think that you want this kind of behavior on a
> production setup.
>
> Bart.
>
>
>
>



Re: disk-io lockup in 4.14.13 kernel

2018-03-13 Thread Jaco Kroon
Hi Bart,


On 13/03/2018 16:10, Bart Van Assche wrote:
> On Tue, 2018-03-13 at 11:30 +0200, Jaco Kroon wrote:
>> On 11/03/2018 07:00, Bart Van Assche wrote:
>>> Did I see correctly that /dev/sdm is behind a MPT SAS controller? You may
>>> want to contact the authors of this driver and Cc the linux-scsi mailing
>>> list. Sorry but I'm not familiar with the mpt3sas driver myself.
>> You did see correctly, all drives are behind the MPT SAS.  If there is
>> in fact a problem with the driver (or the controller itself for that
>> matter) it would explain things.  It would also explain why we don't see
>> this problem on other hosts.
> You may want to have a look at the following report, something I ran into
> myself yesterday: https://marc.info/?l=linux-scsi=152087904024146.
I quickly checked my dmesg logs and I'm not seeing that particular
message, could be that newer kernels only started warning about it?

I'm not seeing any references to the
scsih_get_enclosure_logicalid_chassis_slot() function in the code here,
so there is obviously a newer driver in newer kernels.  The scsih_get_*
calls I do see is in mpt3sas_scsih.c and I'm not seeing (direct) calls
to _config_request at all.  At least it seems there is activity on the
driver, which is always a good sign that someone is attending to problems.

I'll queue an upgrade to 4.16 so long for when it's released in a couple
of days/weeks.

Kind Regards,
Jaco


Re: disk-io lockup in 4.14.13 kernel

2018-03-13 Thread Jaco Kroon
Hi Bart,

On 11/03/2018 07:00, Bart Van Assche wrote:
> On Sun, 2018-03-11 at 06:33 +0200, Jaco Kroon wrote:
>> crowsnest ~ # find /sys -name sdm
>> /sys/kernel/debug/block/sdm
>> /sys/devices/pci:00/:00:01.0/:01:00.0/host0/port-0:0/expander-0:0/port-0:0:0/expander-0:1/port-0:1:0/end_device-0:1:0/target0:0:13/0:0:13:0/block/sdm
>> /sys/class/block/sdm
>> /sys/block/sdm
>>
>>> lspci
>> crowsnest ~ # lspci
>> [ ... ]
>> 01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic
>> SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
>> [ ... ]
> Hi Jaco,
>
> Recently a bug fix for the mq-deadline scheduler was posted but I don't
> think that that patch will change the behavior on your setup since you are
> not using ZBC disks. See also "mq-deadline: Make sure to always unlock
> zones" (https://marc.info/?l=linux-block=151983933714492).
>From that link:

In case of a failed write request (all retries failed) and when using
libata, the SCSI error handler calls scsi_finish_command(). In the
case of blk-mq this means that scsi_mq_done() does not get called,
that blk_mq_complete_request() does not get called and also that the
mq-deadline .completed_request() method is not called. This results in
the target zone of the failed write request being left in a locked
state, preventing that any new write requests are issued to the same
zone.

Why do you say that this won't make a difference? To me it sounds like
it could very well relate? You're talking about "ZBC" disks. I'm going
to assume that the ZBC is Zoned Block ??? and reading up on it I get
really confused.

Either way, the source version onto hich the patch applies is not
4.14.13 code (the patch references lines 756 and the source in 4.14.13
only has 679 lines of code.  I also can't find any kind of locking that
I can imagine that can cause a problem unless there is problems inside
__dd_dispatch_request, blk_mq_sched_try_merge or dd_insert_request (none
of which contains any loops that I can see at a quick glance, at least
down to elv_merge, from there it gets more complicated).

>
> Did I see correctly that /dev/sdm is behind a MPT SAS controller? You may
> want to contact the authors of this driver and Cc the linux-scsi mailing
> list. Sorry but I'm not familiar with the mpt3sas driver myself.
You did see correctly, all drives are behind the MPT SAS.  If there is
in fact a problem with the driver (or the controller itself for that
matter) it would explain things.  It would also explain why we don't see
this problem on other hosts.

I'll contact them as well.

Kind Regards,
Jaco


Re: disk-io lockup in 4.14.13 kernel

2018-03-10 Thread Jaco Kroon
Hi Bart,

On 11/03/2018 05:08, Bart Van Assche wrote:
> On Sat, 2018-03-10 at 22:56 +0200, Jaco Kroon wrote:
>> On 22/02/2018 18:46, Bart Van Assche wrote:
>>> On 02/22/18 02:58, Jaco Kroon wrote:
>>>> We've been seeing sporadic IO lockups on recent kernels.
>>> Are you using the legacy I/O stack or blk-mq? If you are not yet using
>>> blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is
>>> reproducible with blk-mq, can you share the output of the following
>>> command:
>>>
>>> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
>> Looks like the lockups are far more frequent with everything on mq. 
>> Just to confirm:
>>
>> CONFIG_SCSI_MQ_DEFAULT=y
>> CONFIG_DM_MQ_DEFAULT=y
>>
>>
>> Please find attached the output from the requested.
>>
>> http://downloads.uls.co.za/lockup/lockup-20180310-223036/ contains
>> additional stuff, surrounding that.
> Thanks, that helps. In block_debug.txt I see that only for /dev/sdm a
> request got stuck:
>
> $ grep 'busy=[^0]' block_debug.txt  
> ./sdm/hctx0/tags:busy=9
>
> But I can't see in the output that has been shared which I/O scheduler has
> been configured nor which SCSI LLD is involved. Can you please also share
> that information, e.g. by providing the output of the following commands:
Had to reboot, but I trust the information should still be valid.  Could
be scheduler related now that you mention it.
> cat /sys/block/sdm/queue/scheduler
crowsnest ~ # cat /sys/block/sdm/queue/scheduler
[mq-deadline] kyber bfq none

Used to get better performance (on average) from deadline that others,
but I don't see elevator and other options that look familiar any more. 
Having said that I just cross referenced with a few other systems and
they all too use deadline and I'm not seeing similar behavior there. 
Most notably the large difference I see with those systems is that they
use raid1 and not raid6.  Could just be co-incidence.
> find /sys -name sdm # provides the PCI ID
crowsnest ~ # find /sys -name sdm
/sys/kernel/debug/block/sdm
/sys/devices/pci:00/:00:01.0/:01:00.0/host0/port-0:0/expander-0:0/port-0:0:0/expander-0:1/port-0:1:0/end_device-0:1:0/target0:0:13/0:0:13:0/block/sdm
/sys/class/block/sdm
/sys/block/sdm

> lspci
crowsnest ~ # lspci
00:00.0 Host bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMI2 (rev 02)
00:01.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 1 (rev 02)
00:02.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 2 (rev 02)
00:02.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 2 (rev 02)
00:03.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 3 (rev 02)
00:03.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 3 (rev 02)
00:03.3 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 3 (rev 02)
00:04.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 0 (rev 02)
00:04.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 1 (rev 02)
00:04.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 2 (rev 02)
00:04.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 3 (rev 02)
00:04.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 4 (rev 02)
00:04.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 5 (rev 02)
00:04.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 6 (rev 02)
00:04.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 7 (rev 02)
00:05.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Address Map, VTd_Misc, System Management (rev 02)
00:05.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Hot Plug (rev 02)
00:05.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 RAS, Control Status and Global Errors (rev 02)
00:05.4 PIC: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
(rev 02)
00:11.0 Unassigned class [ff00]: Intel Corporation C610/X99 series
chipset SPSR (rev 05)
00:11.4 SATA controller: Intel Corporation C610/X99 series chipset sSATA
Controller [AHCI mode] (rev 05)
00:14.0 USB controller: Intel Corporation C610/X99 series chipset USB
xHCI Host Controller (rev 05)
00:16.0 Communication controller: Intel Corporation C610/X99 series
chipset MEI Controller #1 (rev 05)
00:16.1 Communication controller: Intel Corporation C610/X99 series
chipset MEI Controller #2 (rev 05)
00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB
Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation C610/X

Re: disk-io lockup in 4.14.13 kernel

2018-02-23 Thread Jaco Kroon
Hi Bart,

Thank you for your response.

On 22/02/2018 18:46, Bart Van Assche wrote:
> On 02/22/18 02:58, Jaco Kroon wrote:
>> We've been seeing sporadic IO lockups on recent kernels.
>
> Are you using the legacy I/O stack or blk-mq? If you are not yet using
> blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is
> reproducible with blk-mq, can you share the output of the following
> command:
crowsnest ~ # zgrep MQ /proc/config.gz

CONFIG_SCSI_MQ_DEFAULT=y
# CONFIG_DM_MQ_DEFAULT is not set

... oi, so that's a very valid question.

blk-mq is thus off by default, I've now enabled it on the "live" system
with "echo 1 > /sys/module/dm_mod/parameters/use_blk_mq".

I've also modified the kernel config to set CONFIG_DM_MQ_DEFAULT (I know
I can just set this on cmdline too).

The only immediately visible effect is that I seem to be more
consistently get >300MB/s (and more frequently >400MB) off the array in
terms of read speed, where normally I'd expect 200MB consistent with
spikes just over 300MB and very infrequently over 400MB.  This is a very
simple spot check with iotop over approximately a minute.

I am seeing I/O errors in dmesg from time to time, this to me hints that
potentially it may be something related to some error path that's
causing problems.

Just so we're clear, we're seeing this happen approximately once a
month, so if switching on dm_mod.use_blk_mq solves it then I won't
really know beyond a shadow of a doubt, and the only way of "knowing" is
if we can get to an uptime of three months or so ...

>
> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
I don't have a /sys/kernel/debug folder - I've enabled CONFIG_DEBUG_FS
and BLK_DEBUG_FS, will reboot at the first opportunity.  As a general
rule - is there additional overhead to having debugfs enabled?  Any
other risks that I should be aware of?  In essence, are there any
disadvantages to just enabling DEBUG_FS as a general rule?  I did note
that a few extra DEBUG options pop up for other modules ... so my gut is
towards leaving this disabled as a general rule and enabling when needed.

>
> Thanks,
Thank you!

Kind Regards,
Jaco