Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Dave Chinner
On Thu, Oct 10, 2013 at 04:23:50PM +0800, Fengguang Wu wrote:
> Dave,
> 
> >> This is an easily reproducible bug. And I further confirmed it in
> >> two ways:
> >>
> >> 1) turn off XFS, build 39 commits and boot them 2000+ times
> >>
> >> => no single mount error
> >
> >That doesn't tell you it is an XFS error. Absence of symptoms !=
> >absence of bug.
> 
> True.
> 
> >> 2) turn off all other filesystems, build 2 kernels on v3.12-rc3
> >>v3.12-rc4 and boot them
> >> 
> >> => half boots have oops
> >
> >Again, it doesn't tell you that it is an XFS bug. XFS is well known
> >for exposing bugs in less used block devices, and you are definitely
> >using devices that are unusual and not commonly tested by filesystem
> >developers (e.g. zram, nbd, etc).
> >
> 
> Yeah, it's possible that your commit exposed a bug in the less used
> nbd/zram devices.

So please reproduce it on a brd/scsi/sata/virtio block device before
going any further. Preferably with a bash script I can point at a
single block device, not a binary initrd blob that I have to
deconstruct to try to work out what your test is doing.

because this:

> [7.707009] end_request: I/O error, dev fd0, sector 0
> [   10.475988] block nbd4: Attempted send on closed socket
> [   10.478272] end_request: I/O error, dev nbd4, sector 0
> [   10.492950] block nbd15: Attempted send on closed socket
> [   10.498283] end_request: I/O error, dev nbd15, sector 0

says that nbd is going through I/O error land, and that's the most
likely cause of problems being seen by higher level IO completion
operations

> [   10.504236] BUG: unable to handle kernel NULL pointer dereference at 
> 0004
> [   10.507558] IP: [] pool_mayday_timeout+0x5f/0x9c

And that's deep inside the workqueue infrastructure, indicating that
rescues are being used (allocation deadlock?) which is also less
tested error handling code path

> [   10.507558] *pdpt = 0ce6a001 *pde = 
> [   10.507558] Oops:  [#1]
> [   10.507558] CPU: 0 PID: 516 Comm: mount Not tainted 3.12.0-rc4 #2
> [   10.507558] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   10.507558] task: ccda7440 ti: cf40a000 task.ti: cce2e000
> [   10.507558] EIP: 0060:[] EFLAGS: 00010046 CPU: 0
> [   10.507558] EIP is at pool_mayday_timeout+0x5f/0x9c
> [   10.507558] EAX:  EBX: c1931d50 ECX:  EDX: 
> [   10.507558] ESI: c10343ba EDI: cd5a3258 EBP: cf40bf94 ESP: cf40bf80
> [   10.507558]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
> [   10.507558] CR0: 8005003b CR2: 0004 CR3: 0cdbd000 CR4: 06b0
> [   10.507558] Stack:
> [   10.507558]  c1931d60 cf40bf90 0100 c10343ba cf40bfc0 cf40bfa4 
> c102cd96 c1a52700
> [   10.507558]  cf40bfc0 cf40bfd4 c102cf7e c1931d50 c1a53110 c1a52f10 
> cf40bfc0 c10343ba
> [   10.507558]  cf40bfc0 cf40bfc0 0001 c1a52588 0100 cf40bff8 
> c1028f61 0001
> [   10.507558] Call Trace:
> [   10.507558]  [] ? need_to_create_worker+0x32/0x32
> [   10.507558]  [] call_timer_fn.isra.39+0x16/0x60
> [   10.507558]  [] run_timer_softirq+0x144/0x15e
> [   10.507558]  [] ? need_to_create_worker+0x32/0x32
> [   10.507558]  [] __do_softirq+0x87/0x12b
> [   10.507558]  [] ? local_bh_enable_ip+0xa/0xa
> [   10.507558]  
> [   10.507558]  [] ? irq_exit+0x3a/0x48
> [   10.507558]  [] ? smp_apic_timer_interrupt+0x23/0x2c
> [   10.507558]  [] ? apic_timer_interrupt+0x2d/0x34
> [   10.507558]  [] ? arch_local_irq_restore+0x5/0xb
> [   10.507558]  [] ? spin_unlock_irqrestore.isra.4+0x8/0x14
> [   10.507558]  [] ? nbd_end_request+0x65/0x6d
> [   10.507558]  [] ? do_nbd_request+0x77/0xc1
> [   10.507558]  [] ? __blk_run_queue_uncond+0x1e/0x27
> [   10.507558]  [] ? __blk_run_queue+0x13/0x15
> [   10.507558]  [] ? queue_unplugged.isra.56+0x13/0x1f
> [   10.507558]  [] ? blk_flush_plug_list+0x140/0x14f
> [   10.507558]  [] ? blk_finish_plug+0xd/0x27
> [   10.507558]  [] ? _xfs_buf_ioapply+0x236/0x24e

and it has happened deep inside the nbd IO path in the context of
the xfs_buf allocation that has seen corruptions in previous dumps.

So before I look any further at this, you need to rule out nbd as
the cause of the problems because the XFS code paths on scsi, sata,
brd and virtio block device don't cause any problems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Fengguang Wu
Dave,

>> This is an easily reproducible bug. And I further confirmed it in
>> two ways:
>>
>> 1) turn off XFS, build 39 commits and boot them 2000+ times
>>
>> => no single mount error
>
>That doesn't tell you it is an XFS error. Absence of symptoms !=
>absence of bug.

True.

>> 2) turn off all other filesystems, build 2 kernels on v3.12-rc3
>>v3.12-rc4 and boot them
>> 
>> => half boots have oops
>
>Again, it doesn't tell you that it is an XFS bug. XFS is well known
>for exposing bugs in less used block devices, and you are definitely
>using devices that are unusual and not commonly tested by filesystem
>developers (e.g. zram, nbd, etc).
>

Yeah, it's possible that your commit exposed a bug in the less used
nbd/zram devices.

> You need to refine the test down from "throw shit at the wall, look
> at what sticks" to a simple, reproducable test case. I can't
> reproduce your systems or testing, so you need to provide a test
> case I can use. Otherwise we're just wasting time

You may try the attached script. The initrd used in the script will be
sent to you in a private email. Here is an example run on my side:

wfg@bee /kernel/i386-randconfig-c4-0920-XFS/v3.12-rc4% kvm-0day.sh 
vmlinuz-3.12.0-rc4

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.12.0-rc4 (kbuild@cairo) (gcc version 4.8.1 
(Debian 4.8.1-8) ) #2 Thu Oct 10 12:55:12 CST 2013
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x0fffdfff] usable
[0.00] BIOS-e820: [mem 0x0fffe000-0x0fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] debug: ignoring loglevel setting.
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Bochs Bochs, BIOS Bochs 01/01/2011
[0.00] Hypervisor detected: KVM
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0xfffe max_arch_pfn = 0x100
[0.00] MTRR default type: write-back
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 008000 mask FF8000 uncachable
[0.00]   1 disabled
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86 PAT enabled: cpu 0, old 0x70406, new 0x7010600070106
[0.00] Scan for SMP in [mem 0x-0x03ff]
[0.00] Scan for SMP in [mem 0x0009fc00-0x0009]
[0.00] Scan for SMP in [mem 0x000f-0x000f]
[0.00] found SMP MP-table at [mem 0x000f1840-0x000f184f] mapped at 
[c00f1840]
[0.00]   mpc: f1850-f193c
[0.00] initial memory mapped: [mem 0x-0x01ff]
[0.00] Base memory trampoline at [c009b000] 9b000 size 16384
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 0x-0x000f] page 4k
[0.00] init_memory_mapping: [mem 0x0fa0-0x0fbf]
[0.00]  [mem 0x0fa0-0x0fbf] page 2M
[0.00] init_memory_mapping: [mem 0x0c00-0x0f9f]
[0.00]  [mem 0x0c00-0x0f9f] page 2M
[0.00] init_memory_mapping: [mem 0x0010-0x0bff]
[0.00]  [mem 0x0010-0x001f] page 4k
[0.00]  [mem 0x0020-0x0bff] page 2M
[0.00] init_memory_mapping: [mem 0x0fc0-0x0fffdfff]
[0.00]  [mem 0x0fc0-0x0fdf] page 2M
[0.00]  [mem 0x0fe0-0x0fffdfff] page 4k
[0.00] BRK [0x01ab4000, 0x01ab4fff] PGTABLE
[0.00] RAMDISK: [mem 0x0fce4000-0x0ffe]
[0.00] ACPI: RSDP 000f16b0 00014 (v00 BOCHS )
[0.00] ACPI: RSDT 0fffe3f0 00034 (v01 BOCHS  BXPCRSDT 0001 BXPC 
0001)
[0.00] ACPI: FACP 0f80 00074 (v01 BOCHS  BXPCFACP 0001 BXPC 
0001)
[0.00] ACPI: DSDT 0fffe430 01137 (v01   BXPC   BXDSDT 0001 INTL 
20100528)
[0.00] ACPI: FACS 0f40 00040
[0.00] ACPI: SSDT 06a0 00899 (v01 BOCHS  BXPCSSDT 0001 BXPC 
0001)
[0.00] ACPI: APIC 05b0 00080 (v01 BOCHS  BXPCAPIC 0001 BXPC 
0001)
[0.00] ACPI: HPET 0570 00038 (v01 BOCHS  BXPCHPET 0001 BXPC 
0001)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] mapped APIC to 

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Dave Chinner
On Thu, Oct 10, 2013 at 02:03:34PM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 03:28:20PM +1100, Dave Chinner wrote:
> > On Thu, Oct 10, 2013 at 11:38:34AM +0800, Fengguang Wu wrote:
> > > On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
> > > > On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
> > > > > Dave,
> > > > > 
> > > > > > I note that you have CONFIG_SLUB=y, which means that the cache slabs
> > > > > > are shared with objects of other types. That means that the memory
> > > > > > corruption problem is likely to be caused by one of the other
> > > > > > filesystems that is probing the block device(s), not XFS.
> > > > > 
> > > > > Good to know that, it would easy to test then: just turn off every
> > > > > other filesystems. I'll try it right away.
> > > > 
> > > > Seems that we don't even need to do that. A dig through the oops
> > > > database and I find stack dumps from other FS.
> > > > 
> > > > This happens in the kernel with same kconfig and commit 3.12-rc1.
> > > 
> > > Here is a summary of all FS with oops:
> > > 
> > > 411 ocfs2_fill_super
> > > 189 xfs_fs_fill_super
> > >  86 jfs_fill_super
> > >  50 isofs_fill_super
> > >  33 fat_fill_super
> > >  18 vfat_fill_super
> > >  15 msdos_fill_super
> > >  11 ext2_fill_super
> > >  10 ext3_fill_super
> > >   3 reiserfs_fill_super
> > 
> > The order of probing on the original dmesg output you reported is:
> > 
> > ext3
> > ext2
> > fatfs
> > reiserfs
> > gfs2
> > isofs
> > ocfs2
> 
> There are effectively no particular order, because there are many
> superblocks for these filesystems to scan.
> 
> for superblocks:
> for filesystems:
> scan super block

Sure, but if XFs is at the end of the list of filesystems to try to
mount, then you'll get allt he other filesystems attempted first,
lik eis being seen. And the absence of a single message in dmesg
from XFS is kind of suspicious, because XFs is by far the noisest of
all filesystems when it comes to warning about bad superblocks

> 
> In the end, any filesystem may impact the other (and perhaps a later
> run of itself).

No filesystem should impact on any other filesystem.

However, we have seen in the past that when filesystems share slab
caches that a bug in one filesystem can cause problems in another.
For example, years ago there was a bug in Reiserfs causing bufferhead
corruption that only affected other XFS filesystems on the same
machine.

> > which means that no XFS filesystem was mounted in the original bug
> > report, and hence that further indicates that XFS is not responsible
> > for the problem and that perhaps the original bisect was not
> > reliable...
> 
> This is an easily reproducible bug. And I further confirmed it in
> two ways:
> 
> 1) turn off XFS, build 39 commits and boot them 2000+ times
> 
> => no single mount error

That doesn't tell you it is an XFS error. Absence of symptoms !=
absence of bug.

> 2) turn off all other filesystems, build 2 kernels on v3.12-rc3
>v3.12-rc4 and boot them
> 
> => half boots have oops

Again, it doesn't tell you that it is an XFS bug. XFS is well known
for exposing bugs in less used block devices, and you are definitely
using devices that are unusual and not commonly tested by filesystem
developers (e.g. zram, nbd, etc).

You need to refine the test down from "throw shit at the wall, look
at what sticks" to a simple, reproducable test case. I can't
reproduce your systems or testing, so you need to provide a test
case I can use. Otherwise we're just wasting time

> So it may well be that XFS is impacted by an early run of itself.

You haven't provided any evidence that XFS is even finding bad
superblocks. As I said before, XFS is extremely loud when you
attempt to mount a corrupt image. I test this regularly on real
block devices, and I've never, ever had it fall over.

e.g:

$ sudo umount /dev/vda
$ sudo dd if=/dev/zero of=/dev/vda bs=512 count=128
128+0 records in
128+0 records out
65536 bytes (66 kB) copied, 0.0205057 s, 3.2 MB/s
$ sync
$ sudo !!
sudo mount /dev/vda /mnt/test
mount: block device /dev/vda is write-protected, mounting read-only
mount: you must specify the filesystem type
$ dmesg

[121196.435480] REISERFS warning (device vda): sh-2021 reiserfs_fill_super: can 
not find reiserfs on vda
[121196.440097] EXT3-fs (vda): error: can't find ext3 filesystem on dev vda.
[121196.443278] EXT2-fs (vda): error: can't find an ext2 filesystem on dev vda.
[121196.445941] EXT4-fs (vda): VFS: Can't find ext4 filesystem
[121196.449151] cramfs: wrong magic
[121196.450436] SQUASHFS error: Can't find a SQUASHFS superblock on vda
[121196.452453] VFS: Can't find a Minix filesystem V1 | V2 | V3 on device vda.
[121196.454745] FAT-fs (vda): bogus number of reserved sectors
[121196.456275] FAT-fs (vda): Can't find a valid FAT filesystem
[121196.458394] FAT-fs (vda): bogus number of 

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 02:23:13PM +0800, Fengguang Wu wrote:
> Dave,
> 
> Here are the first oops chunks that show up in the 3.12-rc4 kernel
> with only XFS build in. Attached is the kconfig and one full dmesg.
> 
> Hope there are more clues in them. I'll further test whether the
> problems disappear if further disabling XFS..

Yeah I just confirmed that disabling XFS quiets the error messages.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 03:28:20PM +1100, Dave Chinner wrote:
> On Thu, Oct 10, 2013 at 11:38:34AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
> > > On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
> > > > Dave,
> > > > 
> > > > > I note that you have CONFIG_SLUB=y, which means that the cache slabs
> > > > > are shared with objects of other types. That means that the memory
> > > > > corruption problem is likely to be caused by one of the other
> > > > > filesystems that is probing the block device(s), not XFS.
> > > > 
> > > > Good to know that, it would easy to test then: just turn off every
> > > > other filesystems. I'll try it right away.
> > > 
> > > Seems that we don't even need to do that. A dig through the oops
> > > database and I find stack dumps from other FS.
> > > 
> > > This happens in the kernel with same kconfig and commit 3.12-rc1.
> > 
> > Here is a summary of all FS with oops:
> > 
> > 411 ocfs2_fill_super
> > 189 xfs_fs_fill_super
> >  86 jfs_fill_super
> >  50 isofs_fill_super
> >  33 fat_fill_super
> >  18 vfat_fill_super
> >  15 msdos_fill_super
> >  11 ext2_fill_super
> >  10 ext3_fill_super
> >   3 reiserfs_fill_super
> 
> The order of probing on the original dmesg output you reported is:
> 
>   ext3
>   ext2
>   fatfs
>   reiserfs
>   gfs2
>   isofs
>   ocfs2

There are effectively no particular order, because there are many
superblocks for these filesystems to scan.

for superblocks:
for filesystems:
scan super block

In the end, any filesystem may impact the other (and perhaps a later
run of itself).

> which means that no XFS filesystem was mounted in the original bug
> report, and hence that further indicates that XFS is not responsible
> for the problem and that perhaps the original bisect was not
> reliable...

This is an easily reproducible bug. And I further confirmed it in
two ways:

1) turn off XFS, build 39 commits and boot them 2000+ times

=> no single mount error

2) turn off all other filesystems, build 2 kernels on v3.12-rc3
   v3.12-rc4 and boot them

=> half boots have oops

So it may well be that XFS is impacted by an early run of itself.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 03:28:20PM +1100, Dave Chinner wrote:
 On Thu, Oct 10, 2013 at 11:38:34AM +0800, Fengguang Wu wrote:
  On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
   On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
Dave,

 I note that you have CONFIG_SLUB=y, which means that the cache slabs
 are shared with objects of other types. That means that the memory
 corruption problem is likely to be caused by one of the other
 filesystems that is probing the block device(s), not XFS.

Good to know that, it would easy to test then: just turn off every
other filesystems. I'll try it right away.
   
   Seems that we don't even need to do that. A dig through the oops
   database and I find stack dumps from other FS.
   
   This happens in the kernel with same kconfig and commit 3.12-rc1.
  
  Here is a summary of all FS with oops:
  
  411 ocfs2_fill_super
  189 xfs_fs_fill_super
   86 jfs_fill_super
   50 isofs_fill_super
   33 fat_fill_super
   18 vfat_fill_super
   15 msdos_fill_super
   11 ext2_fill_super
   10 ext3_fill_super
3 reiserfs_fill_super
 
 The order of probing on the original dmesg output you reported is:
 
   ext3
   ext2
   fatfs
   reiserfs
   gfs2
   isofs
   ocfs2

There are effectively no particular order, because there are many
superblocks for these filesystems to scan.

for superblocks:
for filesystems:
scan super block

In the end, any filesystem may impact the other (and perhaps a later
run of itself).

 which means that no XFS filesystem was mounted in the original bug
 report, and hence that further indicates that XFS is not responsible
 for the problem and that perhaps the original bisect was not
 reliable...

This is an easily reproducible bug. And I further confirmed it in
two ways:

1) turn off XFS, build 39 commits and boot them 2000+ times

= no single mount error

2) turn off all other filesystems, build 2 kernels on v3.12-rc3
   v3.12-rc4 and boot them

= half boots have oops

So it may well be that XFS is impacted by an early run of itself.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 02:23:13PM +0800, Fengguang Wu wrote:
 Dave,
 
 Here are the first oops chunks that show up in the 3.12-rc4 kernel
 with only XFS build in. Attached is the kconfig and one full dmesg.
 
 Hope there are more clues in them. I'll further test whether the
 problems disappear if further disabling XFS..

Yeah I just confirmed that disabling XFS quiets the error messages.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Dave Chinner
On Thu, Oct 10, 2013 at 02:03:34PM +0800, Fengguang Wu wrote:
 On Thu, Oct 10, 2013 at 03:28:20PM +1100, Dave Chinner wrote:
  On Thu, Oct 10, 2013 at 11:38:34AM +0800, Fengguang Wu wrote:
   On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
 Dave,
 
  I note that you have CONFIG_SLUB=y, which means that the cache slabs
  are shared with objects of other types. That means that the memory
  corruption problem is likely to be caused by one of the other
  filesystems that is probing the block device(s), not XFS.
 
 Good to know that, it would easy to test then: just turn off every
 other filesystems. I'll try it right away.

Seems that we don't even need to do that. A dig through the oops
database and I find stack dumps from other FS.

This happens in the kernel with same kconfig and commit 3.12-rc1.
   
   Here is a summary of all FS with oops:
   
   411 ocfs2_fill_super
   189 xfs_fs_fill_super
86 jfs_fill_super
50 isofs_fill_super
33 fat_fill_super
18 vfat_fill_super
15 msdos_fill_super
11 ext2_fill_super
10 ext3_fill_super
 3 reiserfs_fill_super
  
  The order of probing on the original dmesg output you reported is:
  
  ext3
  ext2
  fatfs
  reiserfs
  gfs2
  isofs
  ocfs2
 
 There are effectively no particular order, because there are many
 superblocks for these filesystems to scan.
 
 for superblocks:
 for filesystems:
 scan super block

Sure, but if XFs is at the end of the list of filesystems to try to
mount, then you'll get allt he other filesystems attempted first,
lik eis being seen. And the absence of a single message in dmesg
from XFS is kind of suspicious, because XFs is by far the noisest of
all filesystems when it comes to warning about bad superblocks

 
 In the end, any filesystem may impact the other (and perhaps a later
 run of itself).

No filesystem should impact on any other filesystem.

However, we have seen in the past that when filesystems share slab
caches that a bug in one filesystem can cause problems in another.
For example, years ago there was a bug in Reiserfs causing bufferhead
corruption that only affected other XFS filesystems on the same
machine.

  which means that no XFS filesystem was mounted in the original bug
  report, and hence that further indicates that XFS is not responsible
  for the problem and that perhaps the original bisect was not
  reliable...
 
 This is an easily reproducible bug. And I further confirmed it in
 two ways:
 
 1) turn off XFS, build 39 commits and boot them 2000+ times
 
 = no single mount error

That doesn't tell you it is an XFS error. Absence of symptoms !=
absence of bug.

 2) turn off all other filesystems, build 2 kernels on v3.12-rc3
v3.12-rc4 and boot them
 
 = half boots have oops

Again, it doesn't tell you that it is an XFS bug. XFS is well known
for exposing bugs in less used block devices, and you are definitely
using devices that are unusual and not commonly tested by filesystem
developers (e.g. zram, nbd, etc).

You need to refine the test down from throw shit at the wall, look
at what sticks to a simple, reproducable test case. I can't
reproduce your systems or testing, so you need to provide a test
case I can use. Otherwise we're just wasting time

 So it may well be that XFS is impacted by an early run of itself.

You haven't provided any evidence that XFS is even finding bad
superblocks. As I said before, XFS is extremely loud when you
attempt to mount a corrupt image. I test this regularly on real
block devices, and I've never, ever had it fall over.

e.g:

$ sudo umount /dev/vda
$ sudo dd if=/dev/zero of=/dev/vda bs=512 count=128
128+0 records in
128+0 records out
65536 bytes (66 kB) copied, 0.0205057 s, 3.2 MB/s
$ sync
$ sudo !!
sudo mount /dev/vda /mnt/test
mount: block device /dev/vda is write-protected, mounting read-only
mount: you must specify the filesystem type
$ dmesg

[121196.435480] REISERFS warning (device vda): sh-2021 reiserfs_fill_super: can 
not find reiserfs on vda
[121196.440097] EXT3-fs (vda): error: can't find ext3 filesystem on dev vda.
[121196.443278] EXT2-fs (vda): error: can't find an ext2 filesystem on dev vda.
[121196.445941] EXT4-fs (vda): VFS: Can't find ext4 filesystem
[121196.449151] cramfs: wrong magic
[121196.450436] SQUASHFS error: Can't find a SQUASHFS superblock on vda
[121196.452453] VFS: Can't find a Minix filesystem V1 | V2 | V3 on device vda.
[121196.454745] FAT-fs (vda): bogus number of reserved sectors
[121196.456275] FAT-fs (vda): Can't find a valid FAT filesystem
[121196.458394] FAT-fs (vda): bogus number of reserved sectors
[121196.459885] FAT-fs (vda): Can't find a valid FAT filesystem
[121196.461918] BFS-fs: bfs_fill_super(): No BFS filesystem on vda 
(magic=)

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Fengguang Wu
Dave,

 This is an easily reproducible bug. And I further confirmed it in
 two ways:

 1) turn off XFS, build 39 commits and boot them 2000+ times

 = no single mount error

That doesn't tell you it is an XFS error. Absence of symptoms !=
absence of bug.

True.

 2) turn off all other filesystems, build 2 kernels on v3.12-rc3
v3.12-rc4 and boot them
 
 = half boots have oops

Again, it doesn't tell you that it is an XFS bug. XFS is well known
for exposing bugs in less used block devices, and you are definitely
using devices that are unusual and not commonly tested by filesystem
developers (e.g. zram, nbd, etc).


Yeah, it's possible that your commit exposed a bug in the less used
nbd/zram devices.

 You need to refine the test down from throw shit at the wall, look
 at what sticks to a simple, reproducable test case. I can't
 reproduce your systems or testing, so you need to provide a test
 case I can use. Otherwise we're just wasting time

You may try the attached script. The initrd used in the script will be
sent to you in a private email. Here is an example run on my side:

wfg@bee /kernel/i386-randconfig-c4-0920-XFS/v3.12-rc4% kvm-0day.sh 
vmlinuz-3.12.0-rc4

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.12.0-rc4 (kbuild@cairo) (gcc version 4.8.1 
(Debian 4.8.1-8) ) #2 Thu Oct 10 12:55:12 CST 2013
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x0fffdfff] usable
[0.00] BIOS-e820: [mem 0x0fffe000-0x0fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] debug: ignoring loglevel setting.
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Bochs Bochs, BIOS Bochs 01/01/2011
[0.00] Hypervisor detected: KVM
[0.00] e820: update [mem 0x-0x0fff] usable == reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] e820: last_pfn = 0xfffe max_arch_pfn = 0x100
[0.00] MTRR default type: write-back
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 008000 mask FF8000 uncachable
[0.00]   1 disabled
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86 PAT enabled: cpu 0, old 0x70406, new 0x7010600070106
[0.00] Scan for SMP in [mem 0x-0x03ff]
[0.00] Scan for SMP in [mem 0x0009fc00-0x0009]
[0.00] Scan for SMP in [mem 0x000f-0x000f]
[0.00] found SMP MP-table at [mem 0x000f1840-0x000f184f] mapped at 
[c00f1840]
[0.00]   mpc: f1850-f193c
[0.00] initial memory mapped: [mem 0x-0x01ff]
[0.00] Base memory trampoline at [c009b000] 9b000 size 16384
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 0x-0x000f] page 4k
[0.00] init_memory_mapping: [mem 0x0fa0-0x0fbf]
[0.00]  [mem 0x0fa0-0x0fbf] page 2M
[0.00] init_memory_mapping: [mem 0x0c00-0x0f9f]
[0.00]  [mem 0x0c00-0x0f9f] page 2M
[0.00] init_memory_mapping: [mem 0x0010-0x0bff]
[0.00]  [mem 0x0010-0x001f] page 4k
[0.00]  [mem 0x0020-0x0bff] page 2M
[0.00] init_memory_mapping: [mem 0x0fc0-0x0fffdfff]
[0.00]  [mem 0x0fc0-0x0fdf] page 2M
[0.00]  [mem 0x0fe0-0x0fffdfff] page 4k
[0.00] BRK [0x01ab4000, 0x01ab4fff] PGTABLE
[0.00] RAMDISK: [mem 0x0fce4000-0x0ffe]
[0.00] ACPI: RSDP 000f16b0 00014 (v00 BOCHS )
[0.00] ACPI: RSDT 0fffe3f0 00034 (v01 BOCHS  BXPCRSDT 0001 BXPC 
0001)
[0.00] ACPI: FACP 0f80 00074 (v01 BOCHS  BXPCFACP 0001 BXPC 
0001)
[0.00] ACPI: DSDT 0fffe430 01137 (v01   BXPC   BXDSDT 0001 INTL 
20100528)
[0.00] ACPI: FACS 0f40 00040
[0.00] ACPI: SSDT 06a0 00899 (v01 BOCHS  BXPCSSDT 0001 BXPC 
0001)
[0.00] ACPI: APIC 05b0 00080 (v01 BOCHS  BXPCAPIC 0001 BXPC 
0001)
[0.00] ACPI: HPET 0570 00038 (v01 BOCHS  BXPCHPET 0001 BXPC 
0001)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] mapped APIC to b000 (fee0)
[

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-10 Thread Dave Chinner
On Thu, Oct 10, 2013 at 04:23:50PM +0800, Fengguang Wu wrote:
 Dave,
 
  This is an easily reproducible bug. And I further confirmed it in
  two ways:
 
  1) turn off XFS, build 39 commits and boot them 2000+ times
 
  = no single mount error
 
 That doesn't tell you it is an XFS error. Absence of symptoms !=
 absence of bug.
 
 True.
 
  2) turn off all other filesystems, build 2 kernels on v3.12-rc3
 v3.12-rc4 and boot them
  
  = half boots have oops
 
 Again, it doesn't tell you that it is an XFS bug. XFS is well known
 for exposing bugs in less used block devices, and you are definitely
 using devices that are unusual and not commonly tested by filesystem
 developers (e.g. zram, nbd, etc).
 
 
 Yeah, it's possible that your commit exposed a bug in the less used
 nbd/zram devices.

So please reproduce it on a brd/scsi/sata/virtio block device before
going any further. Preferably with a bash script I can point at a
single block device, not a binary initrd blob that I have to
deconstruct to try to work out what your test is doing.

because this:

 [7.707009] end_request: I/O error, dev fd0, sector 0
 [   10.475988] block nbd4: Attempted send on closed socket
 [   10.478272] end_request: I/O error, dev nbd4, sector 0
 [   10.492950] block nbd15: Attempted send on closed socket
 [   10.498283] end_request: I/O error, dev nbd15, sector 0

says that nbd is going through I/O error land, and that's the most
likely cause of problems being seen by higher level IO completion
operations

 [   10.504236] BUG: unable to handle kernel NULL pointer dereference at 
 0004
 [   10.507558] IP: [c1034419] pool_mayday_timeout+0x5f/0x9c

And that's deep inside the workqueue infrastructure, indicating that
rescues are being used (allocation deadlock?) which is also less
tested error handling code path

 [   10.507558] *pdpt = 0ce6a001 *pde = 
 [   10.507558] Oops:  [#1]
 [   10.507558] CPU: 0 PID: 516 Comm: mount Not tainted 3.12.0-rc4 #2
 [   10.507558] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   10.507558] task: ccda7440 ti: cf40a000 task.ti: cce2e000
 [   10.507558] EIP: 0060:[c1034419] EFLAGS: 00010046 CPU: 0
 [   10.507558] EIP is at pool_mayday_timeout+0x5f/0x9c
 [   10.507558] EAX:  EBX: c1931d50 ECX:  EDX: 
 [   10.507558] ESI: c10343ba EDI: cd5a3258 EBP: cf40bf94 ESP: cf40bf80
 [   10.507558]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
 [   10.507558] CR0: 8005003b CR2: 0004 CR3: 0cdbd000 CR4: 06b0
 [   10.507558] Stack:
 [   10.507558]  c1931d60 cf40bf90 0100 c10343ba cf40bfc0 cf40bfa4 
 c102cd96 c1a52700
 [   10.507558]  cf40bfc0 cf40bfd4 c102cf7e c1931d50 c1a53110 c1a52f10 
 cf40bfc0 c10343ba
 [   10.507558]  cf40bfc0 cf40bfc0 0001 c1a52588 0100 cf40bff8 
 c1028f61 0001
 [   10.507558] Call Trace:
 [   10.507558]  [c10343ba] ? need_to_create_worker+0x32/0x32
 [   10.507558]  [c102cd96] call_timer_fn.isra.39+0x16/0x60
 [   10.507558]  [c102cf7e] run_timer_softirq+0x144/0x15e
 [   10.507558]  [c10343ba] ? need_to_create_worker+0x32/0x32
 [   10.507558]  [c1028f61] __do_softirq+0x87/0x12b
 [   10.507558]  [c1028eda] ? local_bh_enable_ip+0xa/0xa
 [   10.507558]  IRQ
 [   10.507558]  [c10290c4] ? irq_exit+0x3a/0x48
 [   10.507558]  [c1018818] ? smp_apic_timer_interrupt+0x23/0x2c
 [   10.507558]  [c164f38d] ? apic_timer_interrupt+0x2d/0x34
 [   10.507558]  [c126f5e2] ? arch_local_irq_restore+0x5/0xb
 [   10.507558]  [c126f694] ? spin_unlock_irqrestore.isra.4+0x8/0x14
 [   10.507558]  [c126f705] ? nbd_end_request+0x65/0x6d
 [   10.507558]  [c126f784] ? do_nbd_request+0x77/0xc1
 [   10.507558]  [c11abe4f] ? __blk_run_queue_uncond+0x1e/0x27
 [   10.507558]  [c11abe6b] ? __blk_run_queue+0x13/0x15
 [   10.507558]  [c11abfe8] ? queue_unplugged.isra.56+0x13/0x1f
 [   10.507558]  [c11ad70b] ? blk_flush_plug_list+0x140/0x14f
 [   10.507558]  [c11ad95f] ? blk_finish_plug+0xd/0x27
 [   10.507558]  [c11051b6] ? _xfs_buf_ioapply+0x236/0x24e

and it has happened deep inside the nbd IO path in the context of
the xfs_buf allocation that has seen corruptions in previous dumps.

So before I look any further at this, you need to rule out nbd as
the cause of the problems because the XFS code paths on scsi, sata,
brd and virtio block device don't cause any problems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Dave Chinner
On Thu, Oct 10, 2013 at 11:38:34AM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
> > > Dave,
> > > 
> > > > I note that you have CONFIG_SLUB=y, which means that the cache slabs
> > > > are shared with objects of other types. That means that the memory
> > > > corruption problem is likely to be caused by one of the other
> > > > filesystems that is probing the block device(s), not XFS.
> > > 
> > > Good to know that, it would easy to test then: just turn off every
> > > other filesystems. I'll try it right away.
> > 
> > Seems that we don't even need to do that. A dig through the oops
> > database and I find stack dumps from other FS.
> > 
> > This happens in the kernel with same kconfig and commit 3.12-rc1.
> 
> Here is a summary of all FS with oops:
> 
> 411 ocfs2_fill_super
> 189 xfs_fs_fill_super
>  86 jfs_fill_super
>  50 isofs_fill_super
>  33 fat_fill_super
>  18 vfat_fill_super
>  15 msdos_fill_super
>  11 ext2_fill_super
>  10 ext3_fill_super
>   3 reiserfs_fill_super

The order of probing on the original dmesg output you reported is:

ext3
ext2
fatfs
reiserfs
gfs2
isofs
ocfs2

which means that no XFS filesystem was mounted in the original bug
report, and hence that further indicates that XFS is not responsible
for the problem and that perhaps the original bisect was not
reliable...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
> > Dave,
> > 
> > > I note that you have CONFIG_SLUB=y, which means that the cache slabs
> > > are shared with objects of other types. That means that the memory
> > > corruption problem is likely to be caused by one of the other
> > > filesystems that is probing the block device(s), not XFS.
> > 
> > Good to know that, it would easy to test then: just turn off every
> > other filesystems. I'll try it right away.
> 
> Seems that we don't even need to do that. A dig through the oops
> database and I find stack dumps from other FS.
> 
> This happens in the kernel with same kconfig and commit 3.12-rc1.

Here is a summary of all FS with oops:

411 ocfs2_fill_super
189 xfs_fs_fill_super
 86 jfs_fill_super
 50 isofs_fill_super
 33 fat_fill_super
 18 vfat_fill_super
 15 msdos_fill_super
 11 ext2_fill_super
 10 ext3_fill_super
  3 reiserfs_fill_super

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
> Dave,
> 
> > I note that you have CONFIG_SLUB=y, which means that the cache slabs
> > are shared with objects of other types. That means that the memory
> > corruption problem is likely to be caused by one of the other
> > filesystems that is probing the block device(s), not XFS.
> 
> Good to know that, it would easy to test then: just turn off every
> other filesystems. I'll try it right away.

Seems that we don't even need to do that. A dig through the oops
database and I find stack dumps from other FS.

This happens in the kernel with same kconfig and commit 3.12-rc1.

[   51.205369] block nbd1: Attempted send on closed socket
[   51.214126] BUG: unable to handle kernel NULL pointer dereference at 0004
[   51.215640] IP: [] pool_mayday_timeout+0x5f/0x9c
[   51.216262] *pdpt = 0ca90001 *pde =  
[   51.216262] Oops:  [#1] 
[   51.216262] CPU: 0 PID: 644 Comm: mount Not tainted 3.12.0-rc1 #2
[   51.216262] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   51.216262] task: ccffd7a0 ti: cca54000 task.ti: cca54000
[   51.216262] EIP: 0060:[] EFLAGS: 0046 CPU: 0
[   51.216262] EIP is at pool_mayday_timeout+0x5f/0x9c
[   51.216262] EAX:  EBX: c1a81d50 ECX:  EDX: 
[   51.216262] ESI: cd0d303c EDI: cfff7054 EBP: cca55d2c ESP: cca55d18
[   51.216262]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
[   51.216262] CR0: 8005003b CR2: 0004 CR3: 0ca0b000 CR4: 06b0
[   51.216262] DR0:  DR1:  DR2:  DR3: 
[   51.216262] DR6:  DR7: 
[   51.216262] Stack:
[   51.216262]  c1a81d60 cd0d303c 0100 c103439c cca55d58 cca55d3c c102cd96 
c1ba4700
[   51.216262]  cca55d58 cca55d6c c102cf7e c1a81d50 c1ba5110 c1ba4f10 cca55d58 
c103439c
[   51.216262]  cca55d58 cca55d58 0001 c1ba4588 0100 cca55d90 c1028f61 
0001
[   51.216262] Call Trace:
[   51.216262]  [] ? need_to_create_worker+0x32/0x32
[   51.216262]  [] call_timer_fn.isra.39+0x16/0x60
[   51.216262]  [] run_timer_softirq+0x144/0x15e
[   51.216262]  [] ? need_to_create_worker+0x32/0x32
[   51.216262]  [] __do_softirq+0x87/0x12b
[   51.216262]  [] irq_exit+0x3a/0x48
[   51.216262]  [] do_IRQ+0x64/0x77
[   51.216262]  [] common_interrupt+0x2c/0x31
[   51.216262]  [] ? ocfs2_get_sector+0x14/0x1cd
[   51.216262]  [] ocfs2_sb_probe+0xcb/0x7ca
[   51.216262]  [] ? bdi_lock_two+0x8/0x14
[   51.216262]  [] ? string.isra.4+0x26/0x89
[   51.216262]  [] ocfs2_fill_super+0x39/0xe84
[   51.216262]  [] ? pointer.isra.15+0x23f/0x25b
[   51.216262]  [] ? disk_name+0x20/0x65
[   51.216262]  [] mount_bdev+0x105/0x14d
[   51.216262]  [] ? slab_pre_alloc_hook.isra.66+0x1e/0x25
[   51.216262]  [] ? __kmalloc_track_caller+0xb8/0xe4
[   51.216262]  [] ? alloc_vfsmnt+0xdc/0xff
[   51.216262]  [] ocfs2_mount+0x10/0x12
[   51.216262]  [] ? ocfs2_handle_error+0xa2/0xa2
[   51.216262]  [] mount_fs+0x55/0x123
[   51.216262]  [] vfs_kern_mount+0x44/0xac
[   51.216262]  [] do_mount+0x647/0x768
[   51.216262]  [] ? strndup_user+0x2c/0x3d
[   51.216262]  [] SyS_mount+0x71/0xa0
[   51.216262]  [] syscall_call+0x7/0xb
[   51.216262] Code: 43 44 e8 7a 8c ff ff 58 5a 5b 5e 5f 5d c3 8b 43 10 8d 78 
fc 8d 43 10 89 45 ec 8d 47 04 3b 45 ec 74 ca 89 f8 e8 44 f0 ff ff 89 c1 <8b> 50 
04 83 7a 44 00 74 2c 8b 40 68 8d 71 68 39 f0 75 22 8b 72
[   51.216262] EIP: [] pool_mayday_timeout+0x5f/0x9c SS:ESP 
0068:cca55d18
[   51.216262] CR2: 0004
[   51.216262] ---[ end trace 267272283b2d7610 ]---
[   51.216262] Kernel panic - not syncing: Fatal exception in interrupt

[3.244964] block nbd1: Attempted send on closed socket
[3.246243] block nbd1: Attempted send on closed socket
[3.247508] (mount,661,0):ocfs2_get_sector:1861 ERROR: status = -5
[3.248906] (mount,661,0):ocfs2_sb_probe:770 ERROR: status = -5
[3.250269] (mount,661,0):ocfs2_fill_super:1038 ERROR: superblock probe 
failed!
[3.252100] (mount,661,0):ocfs2_fill_super:1229 ERROR: status = -5
[3.253569] BUG: unable to handle kernel NULL pointer dereference at 0004
[3.255322] IP: [] process_one_work+0x1a/0x1cc
[3.256681] *pdpt = 0c950001 *pde =  
[3.256833] Oops:  [#1] 
[3.256833] CPU: 0 PID: 5 Comm: kworker/0:0H Not tainted 3.12.0-rc1 #2
[3.256833] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[3.256833] task: cec44d80 ti: cec54000 task.ti: cec54000
[3.256833] EIP: 0060:[] EFLAGS: 00010046 CPU: 0
[3.256833] EIP is at process_one_work+0x1a/0x1cc
[3.256833] EAX:  EBX: cec1b900 ECX: ccdf0700 EDX: ccdf0700
[3.256833] ESI: ccdf0754 EDI: c1a81d50 EBP: cec55f44 ESP: cec55f2c
[3.256833]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[3.256833] CR0: 8005003b CR2: 005c CR3: 0cfc5000 CR4: 06b0
[3.256833] Stack:
[3.256833]  c1a81d50  c10345b0 cec1b900 cec1b918 cec1b918 cec55f54 
c1034a1d
[3.256833]  cec1b900 c1a81d50 cec55f70 c1034d3b 

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
Dave,

> I note that you have CONFIG_SLUB=y, which means that the cache slabs
> are shared with objects of other types. That means that the memory
> corruption problem is likely to be caused by one of the other
> filesystems that is probing the block device(s), not XFS.

Good to know that, it would easy to test then: just turn off every
other filesystems. I'll try it right away.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Dave Chinner
On Thu, Oct 10, 2013 at 09:41:17AM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 09:16:40AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
> > > [add x...@oss.sgi.com to cc]
> > 
> > Thanks.
> > 
> > To help debug the problem, I searched XFS in my tests' oops database
> > and find one kernel that failed 4 times (out of 12 total boots) with
> > basically the same error:
> > 
> >   4 BUG: sleeping function called from invalid context at 
> > kernel/workqueue.c:2810
> >   1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> >   1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> >   1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> >   1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> 

Fenguang, I'll having real trouble associating these with the XFS
code path that is seeing the problems. These look like a use after
free or a double free, but that isn't possible in the XFS code paths
that are showing up in the traces.

> And some other messages in an older kernel:
> 
> [   39.004416] F2FS-fs (nbd2): unable to read second superblock
> [   39.005088] XFS: Assertion failed: read && bp->b_ops, file: 
> fs/xfs/xfs_buf.c, line: 1036

This can not possibily occur on the superblock read path, as
bp->b_ops in that case is *always* initialised, as is XBF_READ.

So this implies something else has modified the struct xfs_buf.

> [   41.550471] [ cut here ]
> [   41.550476] WARNING: CPU: 1 PID: 878 at lib/list_debug.c:33 
> __list_add+0xac/0xc0()
> [   41.550478] list_add corruption. prev->next should be next 
> (88000f3d7360), but was   (null). (prev=880008786a30).

And this is a smoking gun - list corruption...

> [   41.550481] CPU: 1 PID: 878 Comm: mount Not tainted 
> 3.11.0-rc1-00667-gf70eb07 #64
> [   41.550482] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   41.550485]  0009 880007d6fb08 824044a1 
> 880007d6fb50
> [   41.550488]  880007d6fb40 8109a0a8 880007c6b530 
> 88000f3d7360
> [   41.550491]  880008786a30 0007  
> 880007d6fba0
> [   41.550491] Call Trace:
> [   41.550499]  [] dump_stack+0x4e/0x82
> [   41.550503]  [] warn_slowpath_common+0x78/0xa0
> [   41.550505]  [] warn_slowpath_fmt+0x4c/0x50
> [   41.550509]  [] ? get_lock_stats+0x19/0x60
> [   41.550511]  [] __list_add+0xac/0xc0
> [   41.550515]  [] insert_work+0x43/0xa0
> [   41.550518]  [] __queue_work+0x11b/0x510
> [   41.550520]  [] queue_work_on+0x96/0xa0
> [   41.550526]  [] ? _xfs_buf_ioend.constprop.15+0x26/0x30
> [   41.550529]  [] xfs_buf_ioend+0x15c/0x260

... in the workqueue code on a work item in the the struct xfs_buf .

> [   41.550531]  [] ? xfsbdstrat+0x22/0x170
> [   41.550534]  [] _xfs_buf_ioend.constprop.15+0x26/0x30
> [   41.550537]  [] xfs_buf_iorequest+0x73/0x1a0
> [   41.550539]  [] xfsbdstrat+0x22/0x170
> [   41.550542]  [] xfs_buf_read_uncached+0x72/0xa0
> [   41.550546]  [] xfs_readsb+0x176/0x250

... in the very context that we allocated the struct xfs_buf. It's
not a use after free or memory corruption caused by XFS you are
seeing here.

I note that you have CONFIG_SLUB=y, which means that the cache slabs
are shared with objects of other types. That means that the memory
corruption problem is likely to be caused by one of the other
filesystems that is probing the block device(s), not XFS.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 09:16:40AM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
> > [add x...@oss.sgi.com to cc]
> 
> Thanks.
> 
> To help debug the problem, I searched XFS in my tests' oops database
> and find one kernel that failed 4 times (out of 12 total boots) with
> basically the same error:
> 
>   4 BUG: sleeping function called from invalid context at 
> kernel/workqueue.c:2810
>   1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
> debug_print_object+0x94/0xa2()
>   1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
> debug_print_object+0x94/0xa2()
>   1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
> debug_print_object+0x94/0xa2()
>   1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
> debug_print_object+0x94/0xa2()

And some other messages in an older kernel:

[   39.004416] F2FS-fs (nbd2): unable to read second superblock
[   39.005088] XFS: Assertion failed: read && bp->b_ops, file: 
fs/xfs/xfs_buf.c, line: 1036
[   39.005089] [ cut here ]
[   39.005096] WARNING: CPU: 1 PID: 20 at fs/xfs/xfs_message.c:100 
asswarn+0x33/0x40()
[   39.005099] CPU: 1 PID: 20 Comm: kworker/1:0H Not tainted 
3.11.0-rc1-00667-gf70eb07 #64
[   39.005100] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   39.005106] Workqueue: xfslogd xfs_buf_iodone_work
[   39.005110]  0009 88000ed91ce0 824044a1 

[   39.005113]  88000ed91d18 8109a0a8 880007d6cea8 
880007d6cd80
[   39.005116]    0060 
88000ed91d28
[   39.005116] Call Trace:
[   39.005122]  [] dump_stack+0x4e/0x82
[   39.005126]  [] warn_slowpath_common+0x78/0xa0
[   39.005129]  [] warn_slowpath_null+0x1a/0x20
[   39.005131]  [] asswarn+0x33/0x40
[   39.005134]  [] xfs_buf_iodone_work+0x92/0x200
[   39.005138]  [] process_one_work+0x1fa/0x6f0
[   39.005141]  [] ? process_one_work+0x198/0x6f0
[   39.005144]  [] worker_thread+0x11d/0x3a0
[   39.005146]  [] ? process_one_work+0x6f0/0x6f0
[   39.005149]  [] kthread+0xed/0x100
[   39.005153]  [] ? trace_hardirqs_on_caller+0xfd/0x1c0
[   39.005156]  [] ? insert_kthread_work+0x80/0x80
[   39.005159]  [] ret_from_fork+0x7c/0xb0
[   39.005162]  [] ? insert_kthread_work+0x80/0x80
[   39.005164] ---[ end trace be5b205ae29d07cd ]---
[   39.028229] block nbd6: Attempted send on closed socket
[   39.028988] block nbd6: Attempted send on closed socket
[   39.029109] block nbd6: Attempted send on closed socket

[   41.550438] block nbd7: Attempted send on closed socket
[   41.550471] [ cut here ]
[   41.550476] WARNING: CPU: 1 PID: 878 at lib/list_debug.c:33 
__list_add+0xac/0xc0()
[   41.550478] list_add corruption. prev->next should be next 
(88000f3d7360), but was   (null). (prev=880008786a30).
[   41.550481] CPU: 1 PID: 878 Comm: mount Not tainted 
3.11.0-rc1-00667-gf70eb07 #64
[   41.550482] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   41.550485]  0009 880007d6fb08 824044a1 
880007d6fb50
[   41.550488]  880007d6fb40 8109a0a8 880007c6b530 
88000f3d7360
[   41.550491]  880008786a30 0007  
880007d6fba0
[   41.550491] Call Trace:
[   41.550499]  [] dump_stack+0x4e/0x82
[   41.550503]  [] warn_slowpath_common+0x78/0xa0
[   41.550505]  [] warn_slowpath_fmt+0x4c/0x50
[   41.550509]  [] ? get_lock_stats+0x19/0x60
[   41.550511]  [] __list_add+0xac/0xc0
[   41.550515]  [] insert_work+0x43/0xa0
[   41.550518]  [] __queue_work+0x11b/0x510
[   41.550520]  [] queue_work_on+0x96/0xa0
[   41.550526]  [] ? _xfs_buf_ioend.constprop.15+0x26/0x30
[   41.550529]  [] xfs_buf_ioend+0x15c/0x260
[   41.550531]  [] ? xfsbdstrat+0x22/0x170
[   41.550534]  [] _xfs_buf_ioend.constprop.15+0x26/0x30
[   41.550537]  [] xfs_buf_iorequest+0x73/0x1a0
[   41.550539]  [] xfsbdstrat+0x22/0x170
[   41.550542]  [] xfs_buf_read_uncached+0x72/0xa0
[   41.550546]  [] xfs_readsb+0x176/0x250
[   41.550550]  [] xfs_fs_fill_super+0x24a/0x3a0
[   41.550553]  [] mount_bdev+0x1c0/0x200
[   41.550556]  [] ? xfs_parseargs+0xc10/0xc10
[   41.550559]  [] ? lock_release_holdtime.part.29+0xe6/0x160
[   41.550561]  [] xfs_fs_mount+0x15/0x20
[   41.550563]  [] mount_fs+0x39/0x1b0
[   41.550567]  [] ? __alloc_percpu+0x10/0x20
[   41.550571]  [] vfs_kern_mount+0x63/0xf0
[   41.550574]  [] do_mount+0x23e/0xa20
[   41.550577]  [] ? strndup_user+0x4b/0x60
[   41.550579]  [] SyS_mount+0x83/0xc0
[   41.550583]  [] tracesys+0xdd/0xe2
[   41.550584] ---[ end trace 5193f938804cbcca ]---
[   41.550820] block nbd15: Attempted send on closed socket

[   38.738751] block nbd9: Attempted send on closed socket
[   38.738775] F2FS-fs (nbd9): unable to read first superblock
[   38.738852] block nbd9: Attempted send on closed socket
[   38.738875] F2FS-fs (nbd9): unable to read second superblock
 881cbde0 810b9ddf 8800081dfe18
[   

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
> [add x...@oss.sgi.com to cc]

Thanks.

To help debug the problem, I searched XFS in my tests' oops database
and find one kernel that failed 4 times (out of 12 total boots) with
basically the same error:

  4 BUG: sleeping function called from invalid context at 
kernel/workqueue.c:2810
  1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
  1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
  1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
  1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()

dmesg-yocto-cairo-28:20131004225501:i386-randconfig-j0-10042023:3.12.0-rc2-next-20130927-03100-ga0cf1ab:3

[   29.709162] qnx6: unable to set blocksize
[   29.716037] ADFS-fs error (device zram0): adfs_fill_super: unable to read 
superblock
[   29.719395] [ cut here ]
[   29.719838] WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
[   29.720016] ODEBUG: free active (active state 0) object type: work_struct 
hint: xfs_buf_iodone_work+0x0/0xfe
[   29.720016] CPU: 1 PID: 372 Comm: mount Not tainted 
3.12.0-rc2-next-20130927-03100-ga0cf1ab #3
[   29.720016]    8a3efd30 8275c9a2 8a3efd70  0104 
8a3efd60
[   29.720016]  8103f5e3 8334fafe 8a3efd8c 0174 8334fb3b 0104 817001fb 
817001fb
[   29.720016]  8a121e90 8355b58c 83323588 8a3efd78 8103f641 0009 8a3efd70 
8334fafe
[   29.720016] Call Trace:
[   29.720016]  [<8275c9a2>] dump_stack+0x7a/0xaa
[   29.720016]  [<8103f5e3>] warn_slowpath_common+0x8c/0xa3
[   29.720016]  [<817001fb>] ? debug_print_object+0x94/0xa2
[   29.720016]  [<817001fb>] ? debug_print_object+0x94/0xa2
[   29.720016]  [<8103f641>] warn_slowpath_fmt+0x2e/0x30
[   29.720016]  [<817001fb>] debug_print_object+0x94/0xa2
[   29.720016]  [<81344703>] ? xfs_buf_unlock+0x17f/0x17f
[   29.720016]  [<81700f1a>] debug_check_no_obj_freed+0xe2/0x1f8
[   29.720016]  [<8115aaa1>] kmem_cache_free+0xfe/0x32b
[   29.720016]  [<81342be0>] xfs_buf_free+0x263/0x26f
[   29.720016]  [<81343a71>] xfs_buf_rele+0x3e0/0x3f0
[   29.720016]  [<813635f6>] ? xfs_readsb+0x25e/0x268
[   29.720016]  [<813635f6>] xfs_readsb+0x25e/0x268
[   29.720016]  [<827243a5>] ? register_cpu_notifier+0x20/0x27
[   29.720016]  [<81367344>] xfs_fs_fill_super+0x27d/0x495
[   29.720016]  [<813670c7>] ? xfs_finish_flags+0x24e/0x24e
[   29.720016]  [<811666bf>] mount_bdev+0x1d6/0x256
[   29.720016]  [<813670c7>] ? xfs_finish_flags+0x24e/0x24e
[   29.720016]  [<8115a80b>] ? __kmalloc_track_caller+0x21a/0x3b2
[   29.720016]  [<816d87d6>] ? ida_get_new_above+0x24a/0x25b
[   29.720016]  [<81364b69>] xfs_fs_mount+0x1a/0x1f
[   29.720016]  [<813670c7>] ? xfs_finish_flags+0x24e/0x24e
[   29.720016]  [<81166a09>] mount_fs+0x38/0x1b3
[   29.720016]  [<811395a4>] ? __alloc_percpu+0xa/0xc
[   29.720016]  [<8118b7ff>] ? alloc_vfsmnt+0x108/0x1d6
[   29.720016]  [<8118ca68>] vfs_kern_mount+0x97/0x123
[   29.720016]  [<8118ef8f>] do_mount+0xc8e/0xe0d
[   29.720016]  [<8118e2e3>] ? copy_mount_string+0x51/0x6f
[   29.720016]  [<8118f1ee>] SyS_mount+0xe0/0x10d
[   29.720016]  [<827889ff>] syscall_call+0x7/0xb
[   29.720016] ---[ end trace dbf8d35123ab64b3 ]---
[   29.720016] BUG: sleeping function called from invalid context at 
kernel/workqueue.c:2810

dmesg-yocto-cairo-33:20131004225518:i386-randconfig-j0-10042023:3.12.0-rc2-next-20130927-03100-ga0cf1ab:3

[   30.423344] qnx6: unable to set blocksize
[   30.424465] ADFS-fs error (device zram0): adfs_fill_super: unable to read 
superblock
[   30.428281] [ cut here ]
[   30.428765] WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
[   30.429826] ODEBUG: free active (active state 0) object type: work_struct 
hint: xfs_buf_iodone_work+0x0/0xfe
[   30.430014] CPU: 0 PID: 381 Comm: mount Not tainted 
3.12.0-rc2-next-20130927-03100-ga0cf1ab #3
[   30.430014]    8982dd30 8275c9a2 8982dd70  0104 
8982dd60
[   30.430014]  8103f5e3 8334fafe 8982dd8c 017d 8334fb3b 0104 817001fb 
817001fb
[   30.430014]  8a7ba8c0 8355b58c 83323588 8982dd78 8103f641 0009 8982dd70 
8334fafe
[   30.434063] Call Trace:
[   30.434063]  [<8275c9a2>] dump_stack+0x7a/0xaa
[   30.434063]  [<8103f5e3>] warn_slowpath_common+0x8c/0xa3
[   30.434063]  [<817001fb>] ? debug_print_object+0x94/0xa2
[   30.434063]  [<817001fb>] ? debug_print_object+0x94/0xa2
[   30.434063]  [<8103f641>] warn_slowpath_fmt+0x2e/0x30
[   30.434063]  [<817001fb>] debug_print_object+0x94/0xa2
[   30.434063]  [<81344703>] ? xfs_buf_unlock+0x17f/0x17f
[   30.434063]  [<81700f1a>] debug_check_no_obj_freed+0xe2/0x1f8
[   30.434063]  [<8115aaa1>] kmem_cache_free+0xfe/0x32b
[   30.434063]  [<81342be0>] xfs_buf_free+0x263/0x26f
[   30.434063]  [<81343a71>] xfs_buf_rele+0x3e0/0x3f0
[   

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
 [add x...@oss.sgi.com to cc]

Thanks.

To help debug the problem, I searched XFS in my tests' oops database
and find one kernel that failed 4 times (out of 12 total boots) with
basically the same error:

  4 BUG: sleeping function called from invalid context at 
kernel/workqueue.c:2810
  1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
  1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
  1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
  1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()

dmesg-yocto-cairo-28:20131004225501:i386-randconfig-j0-10042023:3.12.0-rc2-next-20130927-03100-ga0cf1ab:3

[   29.709162] qnx6: unable to set blocksize
[   29.716037] ADFS-fs error (device zram0): adfs_fill_super: unable to read 
superblock
[   29.719395] [ cut here ]
[   29.719838] WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
[   29.720016] ODEBUG: free active (active state 0) object type: work_struct 
hint: xfs_buf_iodone_work+0x0/0xfe
[   29.720016] CPU: 1 PID: 372 Comm: mount Not tainted 
3.12.0-rc2-next-20130927-03100-ga0cf1ab #3
[   29.720016]    8a3efd30 8275c9a2 8a3efd70  0104 
8a3efd60
[   29.720016]  8103f5e3 8334fafe 8a3efd8c 0174 8334fb3b 0104 817001fb 
817001fb
[   29.720016]  8a121e90 8355b58c 83323588 8a3efd78 8103f641 0009 8a3efd70 
8334fafe
[   29.720016] Call Trace:
[   29.720016]  [8275c9a2] dump_stack+0x7a/0xaa
[   29.720016]  [8103f5e3] warn_slowpath_common+0x8c/0xa3
[   29.720016]  [817001fb] ? debug_print_object+0x94/0xa2
[   29.720016]  [817001fb] ? debug_print_object+0x94/0xa2
[   29.720016]  [8103f641] warn_slowpath_fmt+0x2e/0x30
[   29.720016]  [817001fb] debug_print_object+0x94/0xa2
[   29.720016]  [81344703] ? xfs_buf_unlock+0x17f/0x17f
[   29.720016]  [81700f1a] debug_check_no_obj_freed+0xe2/0x1f8
[   29.720016]  [8115aaa1] kmem_cache_free+0xfe/0x32b
[   29.720016]  [81342be0] xfs_buf_free+0x263/0x26f
[   29.720016]  [81343a71] xfs_buf_rele+0x3e0/0x3f0
[   29.720016]  [813635f6] ? xfs_readsb+0x25e/0x268
[   29.720016]  [813635f6] xfs_readsb+0x25e/0x268
[   29.720016]  [827243a5] ? register_cpu_notifier+0x20/0x27
[   29.720016]  [81367344] xfs_fs_fill_super+0x27d/0x495
[   29.720016]  [813670c7] ? xfs_finish_flags+0x24e/0x24e
[   29.720016]  [811666bf] mount_bdev+0x1d6/0x256
[   29.720016]  [813670c7] ? xfs_finish_flags+0x24e/0x24e
[   29.720016]  [8115a80b] ? __kmalloc_track_caller+0x21a/0x3b2
[   29.720016]  [816d87d6] ? ida_get_new_above+0x24a/0x25b
[   29.720016]  [81364b69] xfs_fs_mount+0x1a/0x1f
[   29.720016]  [813670c7] ? xfs_finish_flags+0x24e/0x24e
[   29.720016]  [81166a09] mount_fs+0x38/0x1b3
[   29.720016]  [811395a4] ? __alloc_percpu+0xa/0xc
[   29.720016]  [8118b7ff] ? alloc_vfsmnt+0x108/0x1d6
[   29.720016]  [8118ca68] vfs_kern_mount+0x97/0x123
[   29.720016]  [8118ef8f] do_mount+0xc8e/0xe0d
[   29.720016]  [8118e2e3] ? copy_mount_string+0x51/0x6f
[   29.720016]  [8118f1ee] SyS_mount+0xe0/0x10d
[   29.720016]  [827889ff] syscall_call+0x7/0xb
[   29.720016] ---[ end trace dbf8d35123ab64b3 ]---
[   29.720016] BUG: sleeping function called from invalid context at 
kernel/workqueue.c:2810

dmesg-yocto-cairo-33:20131004225518:i386-randconfig-j0-10042023:3.12.0-rc2-next-20130927-03100-ga0cf1ab:3

[   30.423344] qnx6: unable to set blocksize
[   30.424465] ADFS-fs error (device zram0): adfs_fill_super: unable to read 
superblock
[   30.428281] [ cut here ]
[   30.428765] WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
debug_print_object+0x94/0xa2()
[   30.429826] ODEBUG: free active (active state 0) object type: work_struct 
hint: xfs_buf_iodone_work+0x0/0xfe
[   30.430014] CPU: 0 PID: 381 Comm: mount Not tainted 
3.12.0-rc2-next-20130927-03100-ga0cf1ab #3
[   30.430014]    8982dd30 8275c9a2 8982dd70  0104 
8982dd60
[   30.430014]  8103f5e3 8334fafe 8982dd8c 017d 8334fb3b 0104 817001fb 
817001fb
[   30.430014]  8a7ba8c0 8355b58c 83323588 8982dd78 8103f641 0009 8982dd70 
8334fafe
[   30.434063] Call Trace:
[   30.434063]  [8275c9a2] dump_stack+0x7a/0xaa
[   30.434063]  [8103f5e3] warn_slowpath_common+0x8c/0xa3
[   30.434063]  [817001fb] ? debug_print_object+0x94/0xa2
[   30.434063]  [817001fb] ? debug_print_object+0x94/0xa2
[   30.434063]  [8103f641] warn_slowpath_fmt+0x2e/0x30
[   30.434063]  [817001fb] debug_print_object+0x94/0xa2
[   30.434063]  [81344703] ? xfs_buf_unlock+0x17f/0x17f
[   30.434063]  [81700f1a] debug_check_no_obj_freed+0xe2/0x1f8
[   30.434063]  [8115aaa1] kmem_cache_free+0xfe/0x32b
[   30.434063]  [81342be0] xfs_buf_free+0x263/0x26f
[   30.434063]  [81343a71] xfs_buf_rele+0x3e0/0x3f0
[   30.434063]  [813635f6] ? xfs_readsb+0x25e/0x268
[   30.434063]  [813635f6] 

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 09:16:40AM +0800, Fengguang Wu wrote:
 On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
  [add x...@oss.sgi.com to cc]
 
 Thanks.
 
 To help debug the problem, I searched XFS in my tests' oops database
 and find one kernel that failed 4 times (out of 12 total boots) with
 basically the same error:
 
   4 BUG: sleeping function called from invalid context at 
 kernel/workqueue.c:2810
   1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
 debug_print_object+0x94/0xa2()
   1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
 debug_print_object+0x94/0xa2()
   1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
 debug_print_object+0x94/0xa2()
   1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
 debug_print_object+0x94/0xa2()

And some other messages in an older kernel:

[   39.004416] F2FS-fs (nbd2): unable to read second superblock
[   39.005088] XFS: Assertion failed: read  bp-b_ops, file: 
fs/xfs/xfs_buf.c, line: 1036
[   39.005089] [ cut here ]
[   39.005096] WARNING: CPU: 1 PID: 20 at fs/xfs/xfs_message.c:100 
asswarn+0x33/0x40()
[   39.005099] CPU: 1 PID: 20 Comm: kworker/1:0H Not tainted 
3.11.0-rc1-00667-gf70eb07 #64
[   39.005100] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   39.005106] Workqueue: xfslogd xfs_buf_iodone_work
[   39.005110]  0009 88000ed91ce0 824044a1 

[   39.005113]  88000ed91d18 8109a0a8 880007d6cea8 
880007d6cd80
[   39.005116]    0060 
88000ed91d28
[   39.005116] Call Trace:
[   39.005122]  [824044a1] dump_stack+0x4e/0x82
[   39.005126]  [8109a0a8] warn_slowpath_common+0x78/0xa0
[   39.005129]  [8109a1ba] warn_slowpath_null+0x1a/0x20
[   39.005131]  [813e70d3] asswarn+0x33/0x40
[   39.005134]  [813d0f82] xfs_buf_iodone_work+0x92/0x200
[   39.005138]  [810bbdfa] process_one_work+0x1fa/0x6f0
[   39.005141]  [810bbd98] ? process_one_work+0x198/0x6f0
[   39.005144]  [810bc40d] worker_thread+0x11d/0x3a0
[   39.005146]  [810bc2f0] ? process_one_work+0x6f0/0x6f0
[   39.005149]  [810c80dd] kthread+0xed/0x100
[   39.005153]  [8110391d] ? trace_hardirqs_on_caller+0xfd/0x1c0
[   39.005156]  [810c7ff0] ? insert_kthread_work+0x80/0x80
[   39.005159]  [82418efc] ret_from_fork+0x7c/0xb0
[   39.005162]  [810c7ff0] ? insert_kthread_work+0x80/0x80
[   39.005164] ---[ end trace be5b205ae29d07cd ]---
[   39.028229] block nbd6: Attempted send on closed socket
[   39.028988] block nbd6: Attempted send on closed socket
[   39.029109] block nbd6: Attempted send on closed socket

[   41.550438] block nbd7: Attempted send on closed socket
[   41.550471] [ cut here ]
[   41.550476] WARNING: CPU: 1 PID: 878 at lib/list_debug.c:33 
__list_add+0xac/0xc0()
[   41.550478] list_add corruption. prev-next should be next 
(88000f3d7360), but was   (null). (prev=880008786a30).
[   41.550481] CPU: 1 PID: 878 Comm: mount Not tainted 
3.11.0-rc1-00667-gf70eb07 #64
[   41.550482] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   41.550485]  0009 880007d6fb08 824044a1 
880007d6fb50
[   41.550488]  880007d6fb40 8109a0a8 880007c6b530 
88000f3d7360
[   41.550491]  880008786a30 0007  
880007d6fba0
[   41.550491] Call Trace:
[   41.550499]  [824044a1] dump_stack+0x4e/0x82
[   41.550503]  [8109a0a8] warn_slowpath_common+0x78/0xa0
[   41.550505]  [8109a14c] warn_slowpath_fmt+0x4c/0x50
[   41.550509]  [81101359] ? get_lock_stats+0x19/0x60
[   41.550511]  [8163434c] __list_add+0xac/0xc0
[   41.550515]  [810ba453] insert_work+0x43/0xa0
[   41.550518]  [810bb22b] __queue_work+0x11b/0x510
[   41.550520]  [810bb936] queue_work_on+0x96/0xa0
[   41.550526]  [813d2096] ? _xfs_buf_ioend.constprop.15+0x26/0x30
[   41.550529]  [813d1f6c] xfs_buf_ioend+0x15c/0x260
[   41.550531]  [813d2f92] ? xfsbdstrat+0x22/0x170
[   41.550534]  [813d2096] _xfs_buf_ioend.constprop.15+0x26/0x30
[   41.550537]  [813d2873] xfs_buf_iorequest+0x73/0x1a0
[   41.550539]  [813d2f92] xfsbdstrat+0x22/0x170
[   41.550542]  [813d3832] xfs_buf_read_uncached+0x72/0xa0
[   41.550546]  [81445846] xfs_readsb+0x176/0x250
[   41.550550]  [813ea2da] xfs_fs_fill_super+0x24a/0x3a0
[   41.550553]  [811b9f40] mount_bdev+0x1c0/0x200
[   41.550556]  [813ea090] ? xfs_parseargs+0xc10/0xc10
[   41.550559]  [81101ff6] ? lock_release_holdtime.part.29+0xe6/0x160
[   41.550561]  [813e84f5] xfs_fs_mount+0x15/0x20
[   41.550563]  [811ba199] mount_fs+0x39/0x1b0
[   41.550567]  [81190980] ? __alloc_percpu+0x10/0x20
[   41.550571]  [811d8ff3] 

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Dave Chinner
On Thu, Oct 10, 2013 at 09:41:17AM +0800, Fengguang Wu wrote:
 On Thu, Oct 10, 2013 at 09:16:40AM +0800, Fengguang Wu wrote:
  On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
   [add x...@oss.sgi.com to cc]
  
  Thanks.
  
  To help debug the problem, I searched XFS in my tests' oops database
  and find one kernel that failed 4 times (out of 12 total boots) with
  basically the same error:
  
4 BUG: sleeping function called from invalid context at 
  kernel/workqueue.c:2810
1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
  debug_print_object+0x94/0xa2()
1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
  debug_print_object+0x94/0xa2()
1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
  debug_print_object+0x94/0xa2()
1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
  debug_print_object+0x94/0xa2()
 

Fenguang, I'll having real trouble associating these with the XFS
code path that is seeing the problems. These look like a use after
free or a double free, but that isn't possible in the XFS code paths
that are showing up in the traces.

 And some other messages in an older kernel:
 
 [   39.004416] F2FS-fs (nbd2): unable to read second superblock
 [   39.005088] XFS: Assertion failed: read  bp-b_ops, file: 
 fs/xfs/xfs_buf.c, line: 1036

This can not possibily occur on the superblock read path, as
bp-b_ops in that case is *always* initialised, as is XBF_READ.

So this implies something else has modified the struct xfs_buf.

 [   41.550471] [ cut here ]
 [   41.550476] WARNING: CPU: 1 PID: 878 at lib/list_debug.c:33 
 __list_add+0xac/0xc0()
 [   41.550478] list_add corruption. prev-next should be next 
 (88000f3d7360), but was   (null). (prev=880008786a30).

And this is a smoking gun - list corruption...

 [   41.550481] CPU: 1 PID: 878 Comm: mount Not tainted 
 3.11.0-rc1-00667-gf70eb07 #64
 [   41.550482] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   41.550485]  0009 880007d6fb08 824044a1 
 880007d6fb50
 [   41.550488]  880007d6fb40 8109a0a8 880007c6b530 
 88000f3d7360
 [   41.550491]  880008786a30 0007  
 880007d6fba0
 [   41.550491] Call Trace:
 [   41.550499]  [824044a1] dump_stack+0x4e/0x82
 [   41.550503]  [8109a0a8] warn_slowpath_common+0x78/0xa0
 [   41.550505]  [8109a14c] warn_slowpath_fmt+0x4c/0x50
 [   41.550509]  [81101359] ? get_lock_stats+0x19/0x60
 [   41.550511]  [8163434c] __list_add+0xac/0xc0
 [   41.550515]  [810ba453] insert_work+0x43/0xa0
 [   41.550518]  [810bb22b] __queue_work+0x11b/0x510
 [   41.550520]  [810bb936] queue_work_on+0x96/0xa0
 [   41.550526]  [813d2096] ? _xfs_buf_ioend.constprop.15+0x26/0x30
 [   41.550529]  [813d1f6c] xfs_buf_ioend+0x15c/0x260

... in the workqueue code on a work item in the the struct xfs_buf .

 [   41.550531]  [813d2f92] ? xfsbdstrat+0x22/0x170
 [   41.550534]  [813d2096] _xfs_buf_ioend.constprop.15+0x26/0x30
 [   41.550537]  [813d2873] xfs_buf_iorequest+0x73/0x1a0
 [   41.550539]  [813d2f92] xfsbdstrat+0x22/0x170
 [   41.550542]  [813d3832] xfs_buf_read_uncached+0x72/0xa0
 [   41.550546]  [81445846] xfs_readsb+0x176/0x250

... in the very context that we allocated the struct xfs_buf. It's
not a use after free or memory corruption caused by XFS you are
seeing here.

I note that you have CONFIG_SLUB=y, which means that the cache slabs
are shared with objects of other types. That means that the memory
corruption problem is likely to be caused by one of the other
filesystems that is probing the block device(s), not XFS.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
Dave,

 I note that you have CONFIG_SLUB=y, which means that the cache slabs
 are shared with objects of other types. That means that the memory
 corruption problem is likely to be caused by one of the other
 filesystems that is probing the block device(s), not XFS.

Good to know that, it would easy to test then: just turn off every
other filesystems. I'll try it right away.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
 Dave,
 
  I note that you have CONFIG_SLUB=y, which means that the cache slabs
  are shared with objects of other types. That means that the memory
  corruption problem is likely to be caused by one of the other
  filesystems that is probing the block device(s), not XFS.
 
 Good to know that, it would easy to test then: just turn off every
 other filesystems. I'll try it right away.

Seems that we don't even need to do that. A dig through the oops
database and I find stack dumps from other FS.

This happens in the kernel with same kconfig and commit 3.12-rc1.

[   51.205369] block nbd1: Attempted send on closed socket
[   51.214126] BUG: unable to handle kernel NULL pointer dereference at 0004
[   51.215640] IP: [c10343fb] pool_mayday_timeout+0x5f/0x9c
[   51.216262] *pdpt = 0ca90001 *pde =  
[   51.216262] Oops:  [#1] 
[   51.216262] CPU: 0 PID: 644 Comm: mount Not tainted 3.12.0-rc1 #2
[   51.216262] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   51.216262] task: ccffd7a0 ti: cca54000 task.ti: cca54000
[   51.216262] EIP: 0060:[c10343fb] EFLAGS: 0046 CPU: 0
[   51.216262] EIP is at pool_mayday_timeout+0x5f/0x9c
[   51.216262] EAX:  EBX: c1a81d50 ECX:  EDX: 
[   51.216262] ESI: cd0d303c EDI: cfff7054 EBP: cca55d2c ESP: cca55d18
[   51.216262]  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
[   51.216262] CR0: 8005003b CR2: 0004 CR3: 0ca0b000 CR4: 06b0
[   51.216262] DR0:  DR1:  DR2:  DR3: 
[   51.216262] DR6:  DR7: 
[   51.216262] Stack:
[   51.216262]  c1a81d60 cd0d303c 0100 c103439c cca55d58 cca55d3c c102cd96 
c1ba4700
[   51.216262]  cca55d58 cca55d6c c102cf7e c1a81d50 c1ba5110 c1ba4f10 cca55d58 
c103439c
[   51.216262]  cca55d58 cca55d58 0001 c1ba4588 0100 cca55d90 c1028f61 
0001
[   51.216262] Call Trace:
[   51.216262]  [c103439c] ? need_to_create_worker+0x32/0x32
[   51.216262]  [c102cd96] call_timer_fn.isra.39+0x16/0x60
[   51.216262]  [c102cf7e] run_timer_softirq+0x144/0x15e
[   51.216262]  [c103439c] ? need_to_create_worker+0x32/0x32
[   51.216262]  [c1028f61] __do_softirq+0x87/0x12b
[   51.216262]  [c10290c4] irq_exit+0x3a/0x48
[   51.216262]  [c1002918] do_IRQ+0x64/0x77
[   51.216262]  [c175fbac] common_interrupt+0x2c/0x31
[   51.216262]  [c12188ee] ? ocfs2_get_sector+0x14/0x1cd
[   51.216262]  [c1218b72] ocfs2_sb_probe+0xcb/0x7ca
[   51.216262]  [c107bb1c] ? bdi_lock_two+0x8/0x14
[   51.216262]  [c12cfc11] ? string.isra.4+0x26/0x89
[   51.216262]  [c121a7ba] ocfs2_fill_super+0x39/0xe84
[   51.216262]  [c12d1000] ? pointer.isra.15+0x23f/0x25b
[   51.216262]  [c12c3660] ? disk_name+0x20/0x65
[   51.216262]  [c109d8f6] mount_bdev+0x105/0x14d
[   51.216262]  [c1092aaa] ? slab_pre_alloc_hook.isra.66+0x1e/0x25
[   51.216262]  [c1095353] ? __kmalloc_track_caller+0xb8/0xe4
[   51.216262]  [c10ae5da] ? alloc_vfsmnt+0xdc/0xff
[   51.216262]  [c1217173] ocfs2_mount+0x10/0x12
[   51.216262]  [c121a781] ? ocfs2_handle_error+0xa2/0xa2
[   51.216262]  [c109dad1] mount_fs+0x55/0x123
[   51.216262]  [c10aef24] vfs_kern_mount+0x44/0xac
[   51.216262]  [c10b030a] do_mount+0x647/0x768
[   51.216262]  [c107b043] ? strndup_user+0x2c/0x3d
[   51.216262]  [c10b049c] SyS_mount+0x71/0xa0
[   51.216262]  [c175f074] syscall_call+0x7/0xb
[   51.216262] Code: 43 44 e8 7a 8c ff ff 58 5a 5b 5e 5f 5d c3 8b 43 10 8d 78 
fc 8d 43 10 89 45 ec 8d 47 04 3b 45 ec 74 ca 89 f8 e8 44 f0 ff ff 89 c1 8b 50 
04 83 7a 44 00 74 2c 8b 40 68 8d 71 68 39 f0 75 22 8b 72
[   51.216262] EIP: [c10343fb] pool_mayday_timeout+0x5f/0x9c SS:ESP 
0068:cca55d18
[   51.216262] CR2: 0004
[   51.216262] ---[ end trace 267272283b2d7610 ]---
[   51.216262] Kernel panic - not syncing: Fatal exception in interrupt

[3.244964] block nbd1: Attempted send on closed socket
[3.246243] block nbd1: Attempted send on closed socket
[3.247508] (mount,661,0):ocfs2_get_sector:1861 ERROR: status = -5
[3.248906] (mount,661,0):ocfs2_sb_probe:770 ERROR: status = -5
[3.250269] (mount,661,0):ocfs2_fill_super:1038 ERROR: superblock probe 
failed!
[3.252100] (mount,661,0):ocfs2_fill_super:1229 ERROR: status = -5
[3.253569] BUG: unable to handle kernel NULL pointer dereference at 0004
[3.255322] IP: [c1034850] process_one_work+0x1a/0x1cc
[3.256681] *pdpt = 0c950001 *pde =  
[3.256833] Oops:  [#1] 
[3.256833] CPU: 0 PID: 5 Comm: kworker/0:0H Not tainted 3.12.0-rc1 #2
[3.256833] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[3.256833] task: cec44d80 ti: cec54000 task.ti: cec54000
[3.256833] EIP: 0060:[c1034850] EFLAGS: 00010046 CPU: 0
[3.256833] EIP is at process_one_work+0x1a/0x1cc
[3.256833] EAX:  EBX: cec1b900 ECX: ccdf0700 EDX: ccdf0700
[3.256833] ESI: ccdf0754 EDI: c1a81d50 EBP: cec55f44 ESP: cec55f2c
[3.256833]  DS: 007b ES: 007b FS:  GS:  SS: 

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Fengguang Wu
On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
 On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
  Dave,
  
   I note that you have CONFIG_SLUB=y, which means that the cache slabs
   are shared with objects of other types. That means that the memory
   corruption problem is likely to be caused by one of the other
   filesystems that is probing the block device(s), not XFS.
  
  Good to know that, it would easy to test then: just turn off every
  other filesystems. I'll try it right away.
 
 Seems that we don't even need to do that. A dig through the oops
 database and I find stack dumps from other FS.
 
 This happens in the kernel with same kconfig and commit 3.12-rc1.

Here is a summary of all FS with oops:

411 ocfs2_fill_super
189 xfs_fs_fill_super
 86 jfs_fill_super
 50 isofs_fill_super
 33 fat_fill_super
 18 vfat_fill_super
 15 msdos_fill_super
 11 ext2_fill_super
 10 ext3_fill_super
  3 reiserfs_fill_super

Thanks,
Fengguang
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003

2013-10-09 Thread Dave Chinner
On Thu, Oct 10, 2013 at 11:38:34AM +0800, Fengguang Wu wrote:
 On Thu, Oct 10, 2013 at 11:33:00AM +0800, Fengguang Wu wrote:
  On Thu, Oct 10, 2013 at 11:26:37AM +0800, Fengguang Wu wrote:
   Dave,
   
I note that you have CONFIG_SLUB=y, which means that the cache slabs
are shared with objects of other types. That means that the memory
corruption problem is likely to be caused by one of the other
filesystems that is probing the block device(s), not XFS.
   
   Good to know that, it would easy to test then: just turn off every
   other filesystems. I'll try it right away.
  
  Seems that we don't even need to do that. A dig through the oops
  database and I find stack dumps from other FS.
  
  This happens in the kernel with same kconfig and commit 3.12-rc1.
 
 Here is a summary of all FS with oops:
 
 411 ocfs2_fill_super
 189 xfs_fs_fill_super
  86 jfs_fill_super
  50 isofs_fill_super
  33 fat_fill_super
  18 vfat_fill_super
  15 msdos_fill_super
  11 ext2_fill_super
  10 ext3_fill_super
   3 reiserfs_fill_super

The order of probing on the original dmesg output you reported is:

ext3
ext2
fatfs
reiserfs
gfs2
isofs
ocfs2

which means that no XFS filesystem was mounted in the original bug
report, and hence that further indicates that XFS is not responsible
for the problem and that perhaps the original bisect was not
reliable...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/