Bug#913138: linux: I/O on md RAID 6 hangs completely

2019-01-09 Thread Florian Schmidt

i have the same problem here with one of my boxes.

it happens with disk IO going via raid1 to one of my SATA ssd's (sda & sdb).

disabling blk_mq solves this issue for me.

i am using 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1
but i had the same problems with 4.18.

my raid setup:
$ pvdisplay
 --- Physical volume ---
  PV Name   /dev/sdb
  VG Name   flo_data
  PV Size   465.76 GiB / not usable 4.02 MiB
  Allocatable   yes
  PE Size   4.00 MiB
  Total PE  119234
  Free PE   9153
  Allocated PE  110081
  PV UUID   7vbn06-H2vv-6O08-ndZa-Z1LF-xKN1-QcTBGh
   
  --- Physical volume ---

  PV Name   /dev/sda
  VG Name   flo_data
  PV Size   465.76 GiB / not usable 4.02 MiB
  Allocatable   yes
  PE Size   4.00 MiB
  Total PE  119234
  Free PE   9153
  Allocated PE  110081
  PV UUID   xYIXB0-7SF6-KexY-Aw4S-31uL-7mAF-7sHTqj
   
$ lvdisplay

  --- Logical volume ---
  LV Path/dev/flo_data/home
  LV Namehome
  VG Nameflo_data
  LV UUIDMf5mBg-Gj1c-Gclf-e3WG-mK2g-icRN-NO0SZg
  LV Write Accessread/write
  LV Creation host, time uter, 2018-05-24 19:57:17 +0200
  LV Status  available
  # open 2
  LV Size430.00 GiB
  Current LE 110080
  Mirrored volumes   2
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:4


i get "blocked for more than 120 seconds" kernel messages and all stacks seem 
to have in common:
... proc a:
[ 5076.429209]  schedule+0x28/0x80
[ 5076.429218]  md_super_wait+0x6e/0xa0 [md_mod]
[ 5076.429220]  ? finish_wait+0x80/0x80
[ 5076.429229]  md_bitmap_wait_writes+0x93/0xa0 [md_mod]
[ 5076.429233]  ? __wake_up_common_lock+0x89/0xc0
[ 5076.429242]  md_bitmap_unplug+0xc7/0x110 [md_mod]
[ 5076.429246]  flush_bio_list+0x1c/0xd0 [raid1]
[ 5076.429249]  raid1_unplug+0xb9/0xd0 [raid1]
[ 5076.429254]  blk_flush_plug_list+0xcf/0x240
[ 5076.429257]  blk_finish_plug+0x21/0x2e
[ 5076.429286]  ext4_writepages+0x68f/0xf00 [ext4]
... proc b:
[ 5076.429429]  schedule+0x28/0x80
[ 5076.429438]  md_super_wait+0x6e/0xa0 [md_mod]
[ 5076.429440]  ? finish_wait+0x80/0x80
[ 5076.429449]  md_bitmap_wait_writes+0x93/0xa0 [md_mod]
[ 5076.429452]  ? __wake_up_common_lock+0x89/0xc0
[ 5076.429461]  md_bitmap_unplug+0xc7/0x110 [md_mod]
[ 5076.429465]  flush_bio_list+0x1c/0xd0 [raid1]
[ 5076.429469]  raid1_unplug+0xb9/0xd0 [raid1]
[ 5076.429472]  blk_flush_plug_list+0xcf/0x240
[ 5076.429475]  blk_finish_plug+0x21/0x2e
[ 5076.429497]  ext4_writepages+0x68f/0xf00 [ext4]
... proc c:
[ 5076.429633]  schedule+0x28/0x80
[ 5076.429641]  md_super_wait+0x6e/0xa0 [md_mod]
[ 5076.429644]  ? finish_wait+0x80/0x80
[ 5076.429652]  md_bitmap_wait_writes+0x93/0xa0 [md_mod]
[ 5076.429656]  ? __wake_up_common_lock+0x89/0xc0
[ 5076.429665]  md_bitmap_unplug+0xc7/0x110 [md_mod]
[ 5076.429669]  flush_bio_list+0x1c/0xd0 [raid1]
[ 5076.429672]  raid1_unplug+0xb9/0xd0 [raid1]
[ 5076.429675]  blk_flush_plug_list+0xcf/0x240
[ 5076.429678]  blk_finish_plug+0x21/0x2e
[ 5076.429701]  ext4_writepages+0x68f/0xf00 [ext4]

my journalctl -a output:
Jan 09 19:27:48 uter kernel: Linux version 4.19.0-1-amd64 
(debian-ker...@lists.debian.org) (gcc version 8.2.0 (Debian 8.2.0-13)) #1 SMP 
Debian 4.19.12-1 (2018-12-22)
Jan 09 19:27:48 uter kernel: Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.19.0-1-amd64 root=/dev/sdc1 ro quiet
Jan 09 19:27:48 uter kernel: x86/fpu: x87 FPU will use FXSAVE
Jan 09 19:27:48 uter kernel: BIOS-provided physical RAM map:
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0x-0x0009e7ff] usable
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0x0009f800-0x0009] reserved
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0x000f-0x000f] reserved
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0x0010-0xcfed] usable
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0xcfee-0xcfee2fff] ACPI NVS
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0xcfee3000-0xcfee] ACPI data
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0xcfef-0xcfef] reserved
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0xf000-0xf3ff] reserved
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0xfec0-0x] reserved
Jan 09 19:27:48 uter kernel: BIOS-e820: [mem 
0x0001-0x00022fff] usable
Jan 09 19:27:48 uter kernel: NX (Execute Disable) protection: active
Jan 09 19:27:48 uter kernel: SMBIOS 2.4 present.
Jan 09 19:27:48 uter kernel: DMI: Gigabyte Technology Co., Ltd. 
X38-DS4/X38-DS4, BIOS F1 11/23/2007
Jan 09 19:27:48 uter kernel: tsc: Fast TSC calibration using PIT
Jan 09 19:27:48 uter 

Bug#913138: linux: I/O on md RAID 6 hangs completely

2018-11-22 Thread Cesare Leonardi
On Thu, 08 Nov 2018 23:28:16 +0100 =?UTF-8?Q?Stanis=C5=82aw?= 
 wrote:

I suffer the same problem while running RAID1 with kernel 4.18.10-2.


Me too.
For me this happens since the switch from 4.16 to 4.17.x, with two 
different PCs, both with LVM based RAID1. I've already opened bug 
#913119, then I've found this bug report and the reply from Stanislav 
was really helpful for me.
To me this bug, mine and the already closed bug #904822 have the same 
root: the stack traces reported by dmesg are very similar. And the 
common denominators are some sort of LVM RAID and the range of kernel used.



"...Someone else suggested this might be related to using "blk-mq", so
could you try with these parameter:

dm_mod.use_blk_mq=0 scsi_mod.use_blk_mq=0


This seems to have solved the problem for me.
I've tested these boot parameters on one of the affected PC and now it's 
running for more than three days. Before, with kernel from 4.17.x to the 
current Debian's 4.18.10-2+b1, the system showed an oops within 0.5/1 day.


Disabling these parameters is plausible, since Debian's kernel enabled 
SCSI_MQ_DEFAULT and DM_MQ_DEFAULT with 4.17~rc7-1~exp1.



Also, do you have laptop-mode-tools installed?


No, not installed here.

I've checked with two other distributions I have here, to see what they 
have done with SCSI_MQ_DEFAULT and DM_MQ_DEFAULT parameters:


- Arch Linux (kernel 4.18.16-arch1-1-ARCH): both disabled.
- Arch Linux (kernel 4.19.2-arch1-1-ARCH): both enabled.

- Fedora server 29 (kernel 4.18.17-300.fc29.x86_64): both disabled.
- Fedora server 29 (kernel 4.19.2-301.fc29.x86_64): both disabled.

But I was unable to find if upstream is aware of this problem and if 
it's already resolved in 4.19.


Cesare.



Bug#913138: linux: I/O on md RAID 6 hangs completely

2018-11-08 Thread Stanisław
I suffer the same problem while running RAID1 with kernel 4.18.10-2.   I have 
found a hint in the Debian kernel mailing list, however, I havent tested 
it yet:   ...Someone else suggested this might be related to using 
blk-mq, so  could you try with these parameter:   dm_mod.use_blk_mq=0 
scsi_mod.use_blk_mq=0   Also, do you have laptop-mode-tools installed?   
Ben...   Please, read:  lists.debian.org lists.debian.org  
lists.debian.org lists.debian.org  lists.debian.org lists.debian.org


Bug#913138: linux: I/O on md RAID 6 hangs completely

2018-11-07 Thread Thorsten Glaser
On Wed, 7 Nov 2018, Thorsten Glaser wrote:

> Normally, if I leave the system alone for a while (half an hour or
> so), it resolves itself, but that’s unacceptable for a work system,

The system hasn’t recovered yet today. There’s nothing new in dmesg.

bye,
//mirabilos
-- 
tarent solutions GmbH
Rochusstraße 2-4, D-53123 Bonn • http://www.tarent.de/
Tel: +49 228 54881-393 • Fax: +49 228 54881-235
HRB 5168 (AG Bonn) • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg



Bug#913138: linux: I/O on md RAID 6 hangs completely

2018-11-07 Thread Thorsten Glaser
Package: linux-image-4.18.0-2-amd64
Version: 4.18.10-2
Severity: normal

Occasionally, my system begins freezing (processes doing a lot of
I/O enter D state). It is still somewhat usable for already cached
stuff (starting a new shell tab in GNU screen works, lynx does, …
but e.g. the debsums verify of reportbug freezes it, a new reportbug
with --no-verify again works though).

Normally, if I leave the system alone for a while (half an hour or
so), it resolves itself, but that’s unacceptable for a work system,
especially if I can’t engage the screen lock (most of the time, I
can, though).

This has happened a few times over the last weeks, sometimes twice
in a single day, oftentimes not at all, and with multiple recent
kernel images.

Today’s occurrence is from
Linux tglase.lan.tarent.de 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) 
x86_64 GNU/Linux


dmesg:

[0.00] microcode: microcode updated early to revision 0x1d, date = 
2018-05-11
[0.00] Linux version 4.18.0-2-amd64 (debian-ker...@lists.debian.org) 
(gcc version 7.3.0 (Debian 7.3.0-29)) #1 SMP Debian 4.18.10-2 (2018-10-07)
[0.00] Command line: BOOT_IMAGE=/vmlinuz-4.18.0-2-amd64 
root=/dev/mapper/vg--tglase-lv--tglase ro rootdelay=5 net.ifnames=0 
syscall.x32=y vsyscall=emulate kaslr
[0.00] x86/fpu: x87 FPU will use FXSAVE
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009dbff] usable
[0.00] BIOS-e820: [mem 0x0009f800-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xcfec] usable
[0.00] BIOS-e820: [mem 0xcfed-0xcfed0fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xcfed1000-0xcfed] ACPI data
[0.00] BIOS-e820: [mem 0xcfee-0xcfef] reserved
[0.00] BIOS-e820: [mem 0xf400-0xf7ff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00062fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Gigabyte Technology Co., Ltd. X58-USB3/X58-USB3, BIOS F5 
09/07/2011
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] last_pfn = 0x63 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-CDFFF write-protect
[0.00]   CE000-E uncachable
[0.00]   F-F write-through
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 0 mask F write-back
[0.00]   1 base 0E000 mask FE000 uncachable
[0.00]   2 base 0D000 mask FF000 uncachable
[0.00]   3 base 1 mask F write-back
[0.00]   4 base 2 mask E write-back
[0.00]   5 base 4 mask C write-back
[0.00]   6 base 5 mask F write-back
[0.00]   7 base 6 mask FC000 write-back
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.00] e820: update [mem 0xd000-0x] usable ==> reserved
[0.00] last_pfn = 0xcfed0 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000f5a60-0x000f5a6f] mapped at 
[(ptrval)]
[0.00] Base memory trampoline at [(ptrval)] 97000 size 24576
[0.00] BRK [0x2ba84b000, 0x2ba84bfff] PGTABLE
[0.00] BRK [0x2ba84c000, 0x2ba84cfff] PGTABLE
[0.00] BRK [0x2ba84d000, 0x2ba84dfff] PGTABLE
[0.00] BRK [0x2ba84e000, 0x2ba84efff] PGTABLE
[0.00] BRK [0x2ba84f000, 0x2ba84] PGTABLE
[0.00] BRK [0x2ba85, 0x2ba850fff] PGTABLE
[0.00] BRK [0x2ba851000, 0x2ba851fff] PGTABLE
[0.00] BRK [0x2ba852000, 0x2ba852fff] PGTABLE
[0.00] BRK [0x2ba853000, 0x2ba853fff] PGTABLE
[0.00] RAMDISK: [mem 0x34928000-0x3648bfff]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F7200 14 (v00 GBT   )
[0.00] ACPI: RSDT 0xCFED1040 48 (v01 GBTGBTUACPI 
42302E31 GBTU 01010101)
[0.00] ACPI: FACP 0xCFED1100 74 (v01 GBTGBTUACPI 
42302E31 GBTU 01010101)
[0.00] ACPI: DSDT 0xCFED11C0 00391C (v01 GBTGBTUACPI 
1000 MSFT 010C)
[0.00] ACPI: FACS 0xCFED 40
[0.00] ACPI: MSDM 0xCFED4CC0 55 (v03 GBTGBTUACPI 
42302E31 GBTU 01010101)
[0.00] ACPI: HPET 0xCFED4D80 38 (v01 GBTGBTUACPI 
42302E31 GBTU 0098)
[0.00] ACPI: MCFG