Re: dump -X of large LVM based FFSv2 with WAPBL panics

2017-11-17 Thread Matthias Petermann

Hello Jaromir,

actually I did a forced fsck on the respective FS while it was unmounted 
upfront. To be sure I just ran the command again - it passes with no 
errors the second time. When I run dump -X again, the panic still occurs.


Best regards,
Matthias


nuc# fsck -P /dev/mapper/vg0-photo
** /dev/mapper/rvg0-photo
** File system is clean; not checking
nuc# fsck -P -f /dev/mapper/vg0-photo
** /dev/mapper/rvg0-photo
** File system is already clean
** Last Mounted on /p
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN 
SUPERBLK** 
|  97%

SALVAGE? [yn] y

59411 files, 63408414 used, 35694535 free (2079 frags, 4461557 blocks, 
0.0% fragmentation)


* FILE SYSTEM WAS MODIFIED *
nuc# fsck -P -f /dev/mapper/vg0-photo
** /dev/mapper/rvg0-photo
** File system is already clean
** Last Mounted on /p
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
59411 files, 63408414 used, 35694535 free (2079 frags, 4461557 blocks, 
0.0% fragmentation)

nuc# mount /p
nuc# touch /p/test.ignore
nuc# umount /p
nuc# fsck -P -f /dev/mapper/vg0-photo
** /dev/mapper/rvg0-photo
** File system is already clean
** Last Mounted on /p
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
59412 files, 63408414 used, 35694535 free (2079 frags, 4461557 blocks, 
0.0% fragmentation)

nuc#

Am 15.11.2017 um 20:29 schrieb Jaromír Doleček:

Hi,

can you try if doing full forced fsck (fsck -f) would resolve this?

I've seen several such persistent panics when I was debugging WAPBL. 
Even after kernel fixes I had persistent panics around ffs_newvnode() 
due to disk data corruption from previous runs. This is worth trying.


Some day I plan to add some counter, so that actually boot would 
actually force fsck every X boots even when clean, similarily what Linux 
does with ext3/4.


Jaromir

2017-11-15 12:56 GMT+01:00 Matthias Petermann >:


Hello,

on my system I have observed a serious panic when doing FFSv2 dumps
under certain conditions. I did some googling on my own and found
some references regarding the lead symptom

         "ffs_newvnode: ino=113 on /p: gen 55fd2f1f/55fd2f1f has non
zero blocks ff00 or size 0"

but all of them ended up as solved back in 2016. So I wanted to
share my observation here, in the hope somebody can give me some
pointers how the issue could be narrowed down further.

1) Given:

- NetBSD 8.0_BETA (Kernel built from branches/netbsd-8 around
2017-11-06)

         NetBSD nuc.local 8.0_BETA NetBSD 8.0_BETA (XEN3_DOM0_XHCI)
#0: Mon Nov 6 14:31:17 CET 2017
admin@nuc.local:/s/src/sys/arch/amd64/compile/XEN3_DOM0_XHCI amd64

- A large (392 GB) LVM volume hosting a FFSv2 filesystem with WAPBL
enabled
   (/dev/mapper/vg0-photo mounted at /p)

- (An external USB 3.0 Drive)

2) What I tried:

- make a dump of the aforementioned filesystem, using snapshots

     # dump -X -0auf /mnt/photo.0.dump /p

3) What happens then:

- the System crashes, leaving a coredump with with the following
indication:

     ffs_newvnode: ino=113 on /p: gen 55fd2f1f/55fd2f1f has non zero
blocks ff00 or size 0
     fatal page fault in supervisor mode
     trap type 6 code 0x2 rip 0x8022c0cc cs 0x8 rflags
0x10246 cr2 0xfe82deaddf1d ilevel 0x3 rsp 0xfe810e6b1eb8
     curlwp 0xfe827f736000 pid 0.4 lowest kstack 0xfe810e6ae2c0
     panic: trap
     cpu0: Begin traceback...
     vpanic() at netbsd:vpanic+0x140
     snprintf() at netbsd:snprintf
     trap() at netbsd:trap+0xc6b
     --- trap (number 6) ---
     mutex_enter() at netbsd:mutex_enter+0xc
     biodone2() at netbsd:biodone2+0x9b
     biodone2() at netbsd:biodone2+0x9b
     biointr() at netbsd:biointr+0x3a
     softint_dispatch() at netbsd:softint_dispatch+0xd3
     DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe810e6b1ff0
     Xsoftintr() at netbsd:Xsoftintr+0x4f
     --- interrupt ---
     0:
     cpu0: End traceback...

     dumping to dev 0,1 (offset=168119, size=2076255):
     dump

- gdb backtrace shows:

     (gdb) target kvm netbsd.3.core
     0x80229545 in cpu_reboot ()
     (gdb) bt
     #0  0x80229545 in cpu_reboot ()
     #1  0x809a4afc in vpanic ()
     #2  0x809a4bb0 in panic ()
     #3  0x8022b176 in trap ()

Re: dump -X of large LVM based FFSv2 with WAPBL panics

2017-11-16 Thread Manuel Bouyer
On Wed, Nov 15, 2017 at 08:29:51PM +0100, Jaromír Dole?ek wrote:
> Hi,
> 
> can you try if doing full forced fsck (fsck -f) would resolve this?
> 
> I've seen several such persistent panics when I was debugging WAPBL. Even
> after kernel fixes I had persistent panics around ffs_newvnode() due to
> disk data corruption from previous runs. This is worth trying.
> 
> Some day I plan to add some counter, so that actually boot would actually
> force fsck every X boots even when clean, similarily what Linux does with
> ext3/4.

I hope it will be configurable. On linux I alwas turn it off
(you don't want a multi-hours fsck following a "quick reboot for kernel
update").

I'd prefer a forced fsck when the kernel has detected a fs corruption.
This indeed needs a write to the superblock ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: dump -X of large LVM based FFSv2 with WAPBL panics

2017-11-15 Thread Jaromír Doleček
Hi,

can you try if doing full forced fsck (fsck -f) would resolve this?

I've seen several such persistent panics when I was debugging WAPBL. Even
after kernel fixes I had persistent panics around ffs_newvnode() due to
disk data corruption from previous runs. This is worth trying.

Some day I plan to add some counter, so that actually boot would actually
force fsck every X boots even when clean, similarily what Linux does with
ext3/4.

Jaromir

2017-11-15 12:56 GMT+01:00 Matthias Petermann :

> Hello,
>
> on my system I have observed a serious panic when doing FFSv2 dumps under
> certain conditions. I did some googling on my own and found some references
> regarding the lead symptom
>
> "ffs_newvnode: ino=113 on /p: gen 55fd2f1f/55fd2f1f has non zero
> blocks ff00 or size 0"
>
> but all of them ended up as solved back in 2016. So I wanted to share my
> observation here, in the hope somebody can give me some pointers how the
> issue could be narrowed down further.
>
> 1) Given:
>
> - NetBSD 8.0_BETA (Kernel built from branches/netbsd-8 around 2017-11-06)
>
> NetBSD nuc.local 8.0_BETA NetBSD 8.0_BETA (XEN3_DOM0_XHCI) #0: Mon
> Nov 6 14:31:17 CET 2017 
> admin@nuc.local:/s/src/sys/arch/amd64/compile/XEN3_DOM0_XHCI
> amd64
>
> - A large (392 GB) LVM volume hosting a FFSv2 filesystem with WAPBL enabled
>   (/dev/mapper/vg0-photo mounted at /p)
>
> - (An external USB 3.0 Drive)
>
> 2) What I tried:
>
> - make a dump of the aforementioned filesystem, using snapshots
>
> # dump -X -0auf /mnt/photo.0.dump /p
>
> 3) What happens then:
>
> - the System crashes, leaving a coredump with with the following
> indication:
>
> ffs_newvnode: ino=113 on /p: gen 55fd2f1f/55fd2f1f has non zero blocks
> ff00 or size 0
> fatal page fault in supervisor mode
> trap type 6 code 0x2 rip 0x8022c0cc cs 0x8 rflags 0x10246 cr2
> 0xfe82deaddf1d ilevel 0x3 rsp 0xfe810e6b1eb8
> curlwp 0xfe827f736000 pid 0.4 lowest kstack 0xfe810e6ae2c0
> panic: trap
> cpu0: Begin traceback...
> vpanic() at netbsd:vpanic+0x140
> snprintf() at netbsd:snprintf
> trap() at netbsd:trap+0xc6b
> --- trap (number 6) ---
> mutex_enter() at netbsd:mutex_enter+0xc
> biodone2() at netbsd:biodone2+0x9b
> biodone2() at netbsd:biodone2+0x9b
> biointr() at netbsd:biointr+0x3a
> softint_dispatch() at netbsd:softint_dispatch+0xd3
> DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe810e6b1ff0
> Xsoftintr() at netbsd:Xsoftintr+0x4f
> --- interrupt ---
> 0:
> cpu0: End traceback...
>
> dumping to dev 0,1 (offset=168119, size=2076255):
> dump
>
> - gdb backtrace shows:
>
> (gdb) target kvm netbsd.3.core
> 0x80229545 in cpu_reboot ()
> (gdb) bt
> #0  0x80229545 in cpu_reboot ()
> #1  0x809a4afc in vpanic ()
> #2  0x809a4bb0 in panic ()
> #3  0x8022b176 in trap ()
> #4  0x8020113e in alltraps ()
> #5  0x8022c0cc in mutex_enter ()
> #6  0x80a029f5 in wapbl_biodone ()
> #7  0x809e2f20 in biodone2 ()
> #8  0x809e2f20 in biodone2 ()
> #9  0x809e303e in biointr ()
> #10 0x8097bc1d in softint_dispatch ()
> #11 0x80223eef in Xsoftintr ()
> (gdb)
>
> 4) What I tried afterwards:
>
> - make a dump of the aforementioned filesystem, using NO snapshots
>
> # dump -0auf /mnt/photo.0.dump /p
>
> -> works
>
> - umount the filesystem, enforcing a manual fsck
>
> -> no problems
>
> - dumpfs -s /dev/mapper/vg0-photo
>
> nuc# dumpfs -s /dev/mapper/vg0-photo
> file system: /dev/mapper/vg0-photo
> format  FFSv2
> endian  little-endian
> location 65536  (-b 128)
> magic   19540119timeWed Nov 15 12:26:52 2017
> superblock location 65536   id  [ 59f8026a 16319237 ]
> cylgrp  dynamic inodes  FFSv2   sblock  FFSv2   fslevel 5
> nbfree  4461561 ndir1865nifree  24770027nffree  2079
> ncg 530 size100663296   blocks  99102949
> bsize   32768   shift   15  mask0x8000
> fsize   4096shift   12  mask0xf000
> frag8   shift   3   fsbtodb 3
> bpg 23742   fpg 189936  ipg 46848
> minfree 5%  optim   timemaxcontig 2 maxbpg  4096
> symlinklen 120  contigsumsize 2
> maxfilesize 0x000800800805
> nindir  4096inopb   128
> avgfilesize 16384   avgfpdir 64
> sblkno  24  cblkno  32  iblkno  40  dblkno  2968
> sbsize  4096cgsize  32768
> csaddr  2968cssize  12288
> cgrotor 0   fmod0   ronly   0   clean   0x01
> wapbl version 0x1   location 2  flags 0x0
> wapbl loc0 402688128loc1 131072 loc2 512loc3 3
> flags   none
> fsmnt   /p
> volname swuid   0
>
> 5) Further