Re: xfslogd-spinlock bug?

2006-12-16 Thread Haar János
+0x45/0xa6
Dec 16 12:08:37 dy-base  [8020a6a8] child_rip+0xa/0x12
Dec 16 12:08:37 dy-base  [802399d8] worker_thread+0x0/0x134
Dec 16 12:08:37 dy-base  [8023c60d] kthread+0x0/0x10b
Dec 16 12:08:38 dy-base  [8020a69e] child_rip+0x0/0x12
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Code:
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 83
Dec 16 12:08:38 dy-base 0c
Dec 16 12:08:38 dy-base 01
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 48
Dec 16 12:08:38 dy-base 8d
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 98
Dec 16 12:08:38 dy-base 02
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 41
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 54
Dec 16 12:08:38 dy-base 24
Dec 16 12:08:38 dy-base 04
Dec 16 12:08:38 dy-base 41
Dec 16 12:08:38 dy-base 89
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base RIP
Dec 16 12:08:38 dy-base  [803f3aba] spin_bug+0x69/0xdf
Dec 16 12:08:38 dy-base  RSP 81011fdedbc0
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Kernel panic - not syncing: Fatal exception
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Rebooting in 5 seconds..




- Original Message - 
From: Haar János [EMAIL PROTECTED]
To: Justin Piszcz [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Sent: Wednesday, December 13, 2006 2:11 AM
Subject: Re: xfslogd-spinlock bug?


 Hello, Justin,

 This is a 64bit system.

 But i cannot understand, what is the curious? :-)

 I am not a kernel developer, and not a C programmer, but the long pointers
 shows me, the 64 bit.
 Or am i on the wrong clue? :-)

 Anyway, this issue happens for me about daily, or max 2-3 day often.
 But i have no idea what cause this exactly.
 The 2.6.16.18 was stable for me a long time, and one day starts to
tricking
 me, and happens more and more often.
 Thats why i thinking some bad part of the (14TB) FS on the disks.

 (this fs have a lot of errors, what the xfs_repair cannot be corrected,
but
 anyway this is a productive system, and works well, except this issue.
 Some months before i have replaced the parity disk in one of the RAID4
 array, and the next day, during the resync process, another one disk died.
 I almost lost everything, but thanks to the raid4, and mdadm, i have
 successfully recovered a lot of data with the 1 day older parity-only
drive.
 This was really bad, and leave some scars on the fs. )

 This issue looks like for me a race condition between the cpus. (2x Xeon
HT)

 Am i right? :-)

 Thanks,

 Janos




 - Original Message - 
 From: Justin Piszcz [EMAIL PROTECTED]
 To: Haar János [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
 Sent: Tuesday, December 12, 2006 3:32 PM
 Subject: Re: xfslogd-spinlock bug?


 I'm not sure what is causing this problem but I was curious is this on a
 32bit or 64bit platform?

 Justin.

 On Tue, 12 Dec 2006, Haar János wrote:

  Hello, list,
 
  I am the big red button men with the one big 14TB xfs, if somebody can
  remember me. :-)
 
  Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
  2.6.19, but the bug still exists:
 
  Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
  Dec 11 22:47:21 dy-base general protection fault:  [1]
  Dec 11 22:47:21 dy-base SMP
  Dec 11 22:47:21 dy-base
  Dec 11 22:47:21 dy-base CPU 3
  Dec 11 22:47:21 dy-base
  Dec 11 22:47:21 dy-base Modules linked in:
  Dec 11 22:47:21 dy-base  nbd
  Dec 11 22:47:21 dy-base  rd
  Dec 11 22:47:21 dy-base  netconsole
  Dec 11 22:47:21 dy-base  e1000
  Dec 11 22:47:21 dy-base  video
  Dec 11 22:47:21 dy-base
  Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
  Dec 11 22:47:21 dy-base RIP: 0010:[803f3aba]
  Dec 11 22:47:21 dy-base  [803f3aba] spin_bug+0x69/0xdf
  Dec 11 22:47:21 dy-base RSP: 0018:81011fb89bc0  EFLAGS: 00010002
  Dec 11 22:47:21 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:
  
  Dec 11 22:47:21 dy-base RDX: 808137a0 RSI: 0082 RDI:
  0001
  Dec 11 22:47:21 dy-base RBP: 81011fb89be0 R08: 00026a70 R09:
  6b6b6b6b
  Dec 11 22:47:21 dy-base R10: 0082 R11: 81000584d380 R12:
  8100db92ad80
  Dec 11 22:47:21 dy-base R13: 80642dc6 R14:  R15:
  0003
  Dec 11 22:47:21 dy-base FS:  ()
  GS:81011fc76b90() knlGS:
  Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0:
8005003b
  Dec 11 22:47:21 dy-base CR2: 2ba00770 CR3: 000108c05000 CR4:
  06e0
  Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
  81011fb88000, task 81011fa7f830)
  Dec 11 22:47:21 dy-base Stack:
  Dec 11 22:47:21 dy-base  81011fb89be0
  Dec 11 22:47:21 dy-base  8100db92ad80
  Dec 11 22:47:21 dy-base  
  Dec 11 22:47:21 dy-base

Re: xfslogd-spinlock bug?

2006-12-17 Thread Haar János

- Original Message - 
From: David Chinner [EMAIL PROTECTED]
To: Haar János [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Sent: Sunday, December 17, 2006 11:44 PM
Subject: Re: xfslogd-spinlock bug?


 On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
  Hi
 
  I have some news.
 
  I dont know there is a context between 2 messages, but i can see, the
  spinlock bug comes always on cpu #3.
 
  Somebody have any idea?

 Your disk interrupts are directed to CPU 3, and so log I/O completion
 occurs on that CPU.

   CPU0   CPU1   CPU2   CPU3
  0:100  0  04583704   IO-APIC-edge  timer
  1:  0  0  0  2   IO-APIC-edge  i8042
  4:  0  0  03878668   IO-APIC-edge  serial
  8:  0  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 12:  0  0  0  3   IO-APIC-edge  i8042
 14:3072118  0  0181   IO-APIC-edge  ide0
 16:  0  0  0  0   IO-APIC-fasteoi
uhci_hcd:usb2
 18:  0  0  0  0   IO-APIC-fasteoi
uhci_hcd:usb4
 19:  0  0  0  0   IO-APIC-fasteoi
uhci_hcd:usb3
 23:  0  0  0  0   IO-APIC-fasteoi
ehci_hcd:usb1
 52:  0  0  0  213052723   IO-APIC-fasteoi   eth1
 53:  0  0  0   91913759   IO-APIC-fasteoi   eth2
100:  0  0  0   16776910   IO-APIC-fasteoi   eth0
NMI:  42271  43187  42234  43168
LOC:4584247458421945842154584198
ERR:  0

Maybe
I have 3 XFS on this system, with 3 source.

1. 200G one ide hdd.
2. 2x200G mirror on 1 ide + 1 sata hdd.
3. 4x3.3TB strip on NBD.

The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
CPU0.



  Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
  Dec 16 12:08:36 dy-base general protection fault:  [1]
  Dec 16 12:08:36 dy-base SMP
  Dec 16 12:08:36 dy-base
  Dec 16 12:08:36 dy-base CPU 3
  Dec 16 12:08:36 dy-base
  Dec 16 12:08:36 dy-base Modules linked in:
  Dec 16 12:08:36 dy-base  nbd

 Are you using XFS on a NBD?

Yes, on the 3. source.
I used it about 1.5 years.

(The nbd deadlock is fixed on my system, thanks to Herbert Xu on 2.6.14.)


  Dec 16 12:08:36 dy-base  rd
  Dec 16 12:08:36 dy-base  netconsole
  Dec 16 12:08:36 dy-base  e1000
  Dec 16 12:08:36 dy-base  video
  Dec 16 12:08:36 dy-base
  Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
  Dec 16 12:08:36 dy-base RIP: 0010:[803f3aba]
  Dec 16 12:08:36 dy-base  [803f3aba] spin_bug+0x69/0xdf
  Dec 16 12:08:36 dy-base RSP: 0018:81011fdedbc0  EFLAGS: 00010002
  Dec 16 12:08:36 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:
  
 Anyone recognise that pattern?

I think i have one idea.
This issue can stops sometimes the 5sec automatic restart on crash, and this
shows possible memory corruption, and if the bug occurs in the IRQ
handling :-)
I have a lot of logs about this issue, and the RAX, RBX always the same.


  Dec 16 12:08:36 dy-base Call Trace:
  Dec 16 12:08:36 dy-base  [803f3bdc] _raw_spin_lock+0x23/0xf1
  Dec 16 12:08:36 dy-base  [805e7f2b]
_spin_lock_irqsave+0x11/0x18
  Dec 16 12:08:36 dy-base  [80222aab] __wake_up+0x22/0x50
  Dec 16 12:08:36 dy-base  [803c97f9] xfs_buf_unpin+0x21/0x23
  Dec 16 12:08:36 dy-base  [803970a4]
xfs_buf_item_unpin+0x2e/0xa6

 This implies a spinlock inside a wait_queue_head_t is corrupt.

 What are you type of system do you have, and what sort of
 workload are you running?

OS: Fedora 5, 64bit.
HW: dual xeon, with HT, ram 4GB.
(the min_free_kbytes limit is set to 128000, because sometimes the e1000
driver run out the reserved memory during irq handling.)

Workload:

I use this system for free web storage.
(2x apache 2.0.xx,   12x pure-ftpd, 2x mysql but sql only use the source #2
fs.)

The normal system load is ~20-40, but currently i have a little problem with
apache, because it sometimes starts to read a lot from the big XFS device,
and eats all memory, the load is rising to 700-800.
At this point i use httpd restart, and everithing go back to normal, but if
i offline.

Thanks a lot!

Janos


 Cheers,

 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfslogd-spinlock bug?

2006-12-18 Thread Haar János

- Original Message - 
From: David Chinner [EMAIL PROTECTED]
To: Haar János [EMAIL PROTECTED]
Cc: David Chinner [EMAIL PROTECTED]; [EMAIL PROTECTED];
linux-kernel@vger.kernel.org
Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?


 On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
   On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
I dont know there is a context between 2 messages, but i can see,
the
spinlock bug comes always on cpu #3.
   
Somebody have any idea?
  
   Your disk interrupts are directed to CPU 3, and so log I/O completion
   occurs on that CPU.
 
 CPU0   CPU1   CPU2   CPU3
0:100  0  04583704   IO-APIC-edge
timer
1:  0  0  0  2   IO-APIC-edge
i8042
4:  0  0  03878668   IO-APIC-edge
serial
 .
   14:3072118  0  0181   IO-APIC-edge
ide0
 .
   52:  0  0  0  213052723   IO-APIC-fasteoi
eth1
   53:  0  0  0   91913759   IO-APIC-fasteoi
eth2
  100:  0  0  0   16776910   IO-APIC-fasteoi
eth0
 
 
  Maybe
  I have 3 XFS on this system, with 3 source.
 
  1. 200G one ide hdd.
  2. 2x200G mirror on 1 ide + 1 sata hdd.
  3. 4x3.3TB strip on NBD.
 
  The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
  CPU0.

 I'd say your NBD based XFS filesystem is having trouble.

   Are you using XFS on a NBD?
 
  Yes, on the 3. source.

 Ok, I've never heard of a problem like this before and you are doing
 something that very few ppl are doing (i.e. XFS on NBD). I'd start
 Hence  I'd start by suspecting a bug in the NBD driver.

Ok, if you have right, this also can be in context with the following issue:

http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)



Dec 16 12:08:36 dy-base RSP: 0018:81011fdedbc0  EFLAGS: 00010002
Dec 16 12:08:36 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b
RCX:

   Anyone recognise that pattern?
 
  I think i have one idea.
  This issue can stops sometimes the 5sec automatic restart on crash, and
this
  shows possible memory corruption, and if the bug occurs in the IRQ
  handling :-)
  I have a lot of logs about this issue, and the RAX, RBX always the same.

 And is this the only place where you see the problem? Or are there
 other stack traces that you see this in as well?

I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)

What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?

[EMAIL PROTECTED] netlog]# grep 0033 messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 12 18:16:35 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 13 11:40:05 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 14 22:25:32 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 15 06:24:44 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 16 12:08:36 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.11:Oct  3 19:49:44 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.11:Oct  7 01:11:17 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.13:Sep 21 15:35:31 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.15:Sep  3 16:13:35 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.15:Sep  5 21:00:38 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.2:Dec  9 00:10:47 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.2:Dec  9 14:07:01 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.2:Dec 10 04:44:48 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.3:Nov 30 10:59:21 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.3:Dec  2 00:54:23 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.5:Nov 13 10:44:49 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.5:Nov 14 03:14:14 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.5:Nov 14 03:37:07 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.5:Nov 15 01:39:54 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.6:Nov  6

Re: xfslogd-spinlock bug?

2006-12-18 Thread Haar János

- Original Message - 
From: David Chinner [EMAIL PROTECTED]
To: Haar János [EMAIL PROTECTED]
Cc: David Chinner [EMAIL PROTECTED]; [EMAIL PROTECTED];
linux-kernel@vger.kernel.org
Sent: Monday, December 18, 2006 11:36 PM
Subject: Re: xfslogd-spinlock bug?


 On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote:
  From: David Chinner [EMAIL PROTECTED]
The NBD serves through eth1, and it is on the CPU3, but the ide0 is
on
  the
CPU0.
  
   I'd say your NBD based XFS filesystem is having trouble.
  
 Are you using XFS on a NBD?
   
Yes, on the 3. source.
  
   Ok, I've never heard of a problem like this before and you are doing
   something that very few ppl are doing (i.e. XFS on NBD). I'd start
   Hence  I'd start by suspecting a bug in the NBD driver.
 
  Ok, if you have right, this also can be in context with the following
issue:
 
  http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)

 Which appears to be a crash in wake_up_process() when doing memory
 reclaim (waking the xfsbufd).

Sorry, can you translate it to poor mans language? :-)
This is a different bug?



  Dec 16 12:08:36 dy-base RSP: 0018:81011fdedbc0  EFLAGS:
00010002
  Dec 16 12:08:36 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b
  RCX:


 Anyone recognise that pattern?

 Ok, I've found this pattern:

 #define POISON_FREE 0x6b

 Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?

Yes, i build with this option enabled.
Is this wrong?


 If so, we have a use after free occurring here and it would also
 explain why no-one has reported it before.

 FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers
 a different bug check prior to the above dump?

[EMAIL PROTECTED] linux-2.6.19]# make bzImage
scripts/kconfig/conf -s arch/x86_64/Kconfig
.config:7:warning: trying to assign nonexistent symbol XFS_DEBUG

I have missed something?


Thanks,

Janos


 Cheers,

 Dave.
 -- 
 Dave Chinner
 Principal Engineer
 SGI Australian Software Group

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfslogd-spinlock bug?

2006-12-27 Thread Haar János
Hello,

- Original Message - 
From: David Chinner [EMAIL PROTECTED]
To: David Chinner [EMAIL PROTECTED]
Cc: Haar János [EMAIL PROTECTED]; [EMAIL PROTECTED];
linux-kernel@vger.kernel.org
Sent: Tuesday, December 19, 2006 5:47 AM
Subject: Re: xfslogd-spinlock bug?


 On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
  On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
   From: David Chinner [EMAIL PROTECTED]
#define POISON_FREE 0x6b
   
Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
  
   Yes, i build with this option enabled.

 ..

  FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
  on and I haven't seen this problem. I'm not sure how to track down
  the source of the problem without a test case, but as a quick test, can
  you try the following patch?

 Third try an I got a crash on a poisoned object:

 [1]kdb md8c40 e0300d7d5100
 0xe0300d7d5100 5a2cf071    q.,Z
 0xe0300d7d5110 5a2cf071 6b6b6b6b6b6b6b6b   q.,Z
 0xe0300d7d5120 e039eb7b6320 6b6b6b6b6b6b6b6bc{.9...
 0xe0300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b   okkk
 0xe0300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
 0xe0300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b   kkk.
 0xe0300d7d51e0 5a2cf071 a00100468c30   q.,Z0.F.
 [1]kdb mds 0xe0300d7d51e0
 0xe0300d7d51e0 5a2cf071   q.,Z
 0xe0300d7d51e8 a00100468c30 xfs_inode_item_destroy+0x30

 So the use-after-free here is on an inode item. You're tripping
 over a buffer item.

 Unfortunately, it is not the same problem - the problem I've just
 hit is to do with a QA test that does a forced shutdown on an active
 filesystem, and:

 [1]kdb xmount 0xe0304393e238
 .
 flags 0x440010 FSSHUTDOWN IDELETE COMPAT_IOSIZE 

 The filesystem was being shutdown so xfs_inode_item_destroy() just
 frees the inode log item without removing it from the AIL. I'll fix that,
 and see if i have any luck

 So I'd still try that patch i sent in the previous email...

I still using the patch, but didnt shows any messages at this point.

I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:

Dec 27 13:41:29 dy-base BUG: warning at
kernel/mutex.c:220/__mutex_unlock_common_slowpath()
Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
66604480 RIP:
Dec 27 13:41:29 dy-base  [80222c64] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
Dec 27 13:41:29 dy-base Oops:  [1] SMP
Dec 27 13:41:29 dy-base CPU 1
Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
Dec 27 13:41:29 dy-base RIP: 0010:[80222c64]  [80222c64]
resched_task+0x12/0x64
Dec 27 13:41:29 dy-base RSP: 0018:810105c01b78  EFLAGS: 00010083
Dec 27 13:41:29 dy-base RAX: 807d5800 RBX: 1749fd97c214 RCX:
81001cbd
Dec 27 13:41:29 dy-base RDX: 1cbd0048 RSI: 810005834068 RDI:
8101047bf040
Dec 27 13:41:29 dy-base RBP: 810105c01b78 R08: 0001 R09:

Dec 27 13:41:29 dy-base R10: 0057 R11: 81000583cd80 R12:
810116693140
Dec 27 13:41:29 dy-base R13: 0001 R14:  R15:

Dec 27 13:41:29 dy-base FS:  2ba3c1ad07d0()
GS:81011fc769c8() knlGS:
Dec 27 13:41:29 dy-base CS:  0010 DS:  ES:  CR0: 80050033
Dec 27 13:41:29 dy-base CR2: 66604480 CR3: 000118196000 CR4:
06e0
Dec 27 13:41:29 dy-base Process httpd (pid: 4069, threadinfo
810105c0, task 8101166e1040)
Dec 27 13:41:29 dy-base Stack:  810105c01bf8 80223f37
810002996a00 02cba600
Dec 27 13:41:29 dy-base  000f 0001 810005833700
0001
Dec 27 13:41:29 dy-base  0005 0296 810105c01bd8
810117fef690
Dec 27 13:41:29 dy-base Call Trace:
Dec 27 13:41:29 dy-base  [80223f37] try_to_wake_up+0x3a7/0x3dc
Dec 27 13:41:29 dy-base  [80223f98] wake_up_process+0x10/0x12
Dec 27 13:41:29 dy-base  [803c9186] xfsbufd_wakeup+0x34/0x61
Dec 27 13:41:29 dy-base  [8025cdf9] shrink_slab+0x64/0x163
Dec 27 13:41:29 dy-base

xfslogd-spinlock bug?

2006-12-11 Thread Haar János
Hello, list,

I am the big red button men with the one big 14TB xfs, if somebody can
remember me. :-)

Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
2.6.19, but the bug still exists:

Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
Dec 11 22:47:21 dy-base general protection fault:  [1]
Dec 11 22:47:21 dy-base SMP
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base CPU 3
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Modules linked in:
Dec 11 22:47:21 dy-base  nbd
Dec 11 22:47:21 dy-base  rd
Dec 11 22:47:21 dy-base  netconsole
Dec 11 22:47:21 dy-base  e1000
Dec 11 22:47:21 dy-base  video
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
Dec 11 22:47:21 dy-base RIP: 0010:[803f3aba]
Dec 11 22:47:21 dy-base  [803f3aba] spin_bug+0x69/0xdf
Dec 11 22:47:21 dy-base RSP: 0018:81011fb89bc0  EFLAGS: 00010002
Dec 11 22:47:21 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:

Dec 11 22:47:21 dy-base RDX: 808137a0 RSI: 0082 RDI:
0001
Dec 11 22:47:21 dy-base RBP: 81011fb89be0 R08: 00026a70 R09:
6b6b6b6b
Dec 11 22:47:21 dy-base R10: 0082 R11: 81000584d380 R12:
8100db92ad80
Dec 11 22:47:21 dy-base R13: 80642dc6 R14:  R15:
0003
Dec 11 22:47:21 dy-base FS:  ()
GS:81011fc76b90() knlGS:
Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
Dec 11 22:47:21 dy-base CR2: 2ba00770 CR3: 000108c05000 CR4:
06e0
Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
81011fb88000, task 81011fa7f830)
Dec 11 22:47:21 dy-base Stack:
Dec 11 22:47:21 dy-base  81011fb89be0
Dec 11 22:47:21 dy-base  8100db92ad80
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base  81011fb89c10
Dec 11 22:47:21 dy-base  803f3bdc
Dec 11 22:47:21 dy-base  0282
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base  81011fb89c30
Dec 11 22:47:21 dy-base  805e7f2b
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Call Trace:
Dec 11 22:47:21 dy-base  [803f3bdc] _raw_spin_lock+0x23/0xf1
Dec 11 22:47:21 dy-base  [805e7f2b] _spin_lock_irqsave+0x11/0x18
Dec 11 22:47:21 dy-base  [80222aab] __wake_up+0x22/0x50
Dec 11 22:47:21 dy-base  [803c97f9] xfs_buf_unpin+0x21/0x23
Dec 11 22:47:21 dy-base  [803970a4] xfs_buf_item_unpin+0x2e/0xa6
Dec 11 22:47:21 dy-base  [803bc460]
xfs_trans_chunk_committed+0xc3/0xf7
Dec 11 22:47:21 dy-base  [803bc4dd] xfs_trans_committed+0x49/0xde
Dec 11 22:47:21 dy-base  [803b1bde]
xlog_state_do_callback+0x185/0x33f
Dec 11 22:47:21 dy-base  [803b1e9c] xlog_iodone+0x104/0x131
Dec 11 22:47:22 dy-base  [803c9dae] xfs_buf_iodone_work+0x1a/0x3e
Dec 11 22:47:22 dy-base  [802399d8] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [8023937e] run_workqueue+0xa8/0xf8
Dec 11 22:47:22 dy-base  [803c9d94] xfs_buf_iodone_work+0x0/0x3e
Dec 11 22:47:22 dy-base  [802399d8] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [80239ad3] worker_thread+0xfb/0x134
Dec 11 22:47:22 dy-base  [80223f6c] default_wake_function+0x0/0xf
Dec 11 22:47:22 dy-base  [802399d8] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [8023c6e5] kthread+0xd8/0x10b
Dec 11 22:47:22 dy-base  [802256ac] schedule_tail+0x45/0xa6
Dec 11 22:47:22 dy-base  [8020a6a8] child_rip+0xa/0x12
Dec 11 22:47:22 dy-base  [802399d8] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [8023c60d] kthread+0x0/0x10b
Dec 11 22:47:22 dy-base  [8020a69e] child_rip+0x0/0x12
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Code:
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 83
Dec 11 22:47:22 dy-base 0c
Dec 11 22:47:22 dy-base 01
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 48
Dec 11 22:47:22 dy-base 8d
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 98
Dec 11 22:47:22 dy-base 02
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 41
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 54
Dec 11 22:47:22 dy-base 24
Dec 11 22:47:22 dy-base 04
Dec 11 22:47:22 dy-base 41
Dec 11 22:47:22 dy-base 89
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base RIP
Dec 11 22:47:22 dy-base  [803f3aba] spin_bug+0x69/0xdf
Dec 11 22:47:22 dy-base  RSP 81011fb89bc0
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Rebooting in 5 seconds..

After this, sometimes the server reboots normally, but sometimes 

Re: xfslogd-spinlock bug?

2006-12-12 Thread Haar János
Hello, Justin,

This is a 64bit system.

But i cannot understand, what is the curious? :-)

I am not a kernel developer, and not a C programmer, but the long pointers
shows me, the 64 bit.
Or am i on the wrong clue? :-)

Anyway, this issue happens for me about daily, or max 2-3 day often.
But i have no idea what cause this exactly.
The 2.6.16.18 was stable for me a long time, and one day starts to tricking
me, and happens more and more often.
Thats why i thinking some bad part of the (14TB) FS on the disks.

(this fs have a lot of errors, what the xfs_repair cannot be corrected, but
anyway this is a productive system, and works well, except this issue.
Some months before i have replaced the parity disk in one of the RAID4
array, and the next day, during the resync process, another one disk died.
I almost lost everything, but thanks to the raid4, and mdadm, i have
successfully recovered a lot of data with the 1 day older parity-only drive.
This was really bad, and leave some scars on the fs. )

This issue looks like for me a race condition between the cpus. (2x Xeon HT)

Am i right? :-)

Thanks,

Janos




- Original Message - 
From: Justin Piszcz [EMAIL PROTECTED]
To: Haar János [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Sent: Tuesday, December 12, 2006 3:32 PM
Subject: Re: xfslogd-spinlock bug?


I'm not sure what is causing this problem but I was curious is this on a
32bit or 64bit platform?

Justin.

On Tue, 12 Dec 2006, Haar János wrote:

 Hello, list,

 I am the big red button men with the one big 14TB xfs, if somebody can
 remember me. :-)

 Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
 2.6.19, but the bug still exists:

 Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
 Dec 11 22:47:21 dy-base general protection fault:  [1]
 Dec 11 22:47:21 dy-base SMP
 Dec 11 22:47:21 dy-base
 Dec 11 22:47:21 dy-base CPU 3
 Dec 11 22:47:21 dy-base
 Dec 11 22:47:21 dy-base Modules linked in:
 Dec 11 22:47:21 dy-base  nbd
 Dec 11 22:47:21 dy-base  rd
 Dec 11 22:47:21 dy-base  netconsole
 Dec 11 22:47:21 dy-base  e1000
 Dec 11 22:47:21 dy-base  video
 Dec 11 22:47:21 dy-base
 Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
 Dec 11 22:47:21 dy-base RIP: 0010:[803f3aba]
 Dec 11 22:47:21 dy-base  [803f3aba] spin_bug+0x69/0xdf
 Dec 11 22:47:21 dy-base RSP: 0018:81011fb89bc0  EFLAGS: 00010002
 Dec 11 22:47:21 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:
 
 Dec 11 22:47:21 dy-base RDX: 808137a0 RSI: 0082 RDI:
 0001
 Dec 11 22:47:21 dy-base RBP: 81011fb89be0 R08: 00026a70 R09:
 6b6b6b6b
 Dec 11 22:47:21 dy-base R10: 0082 R11: 81000584d380 R12:
 8100db92ad80
 Dec 11 22:47:21 dy-base R13: 80642dc6 R14:  R15:
 0003
 Dec 11 22:47:21 dy-base FS:  ()
 GS:81011fc76b90() knlGS:
 Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 Dec 11 22:47:21 dy-base CR2: 2ba00770 CR3: 000108c05000 CR4:
 06e0
 Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
 81011fb88000, task 81011fa7f830)
 Dec 11 22:47:21 dy-base Stack:
 Dec 11 22:47:21 dy-base  81011fb89be0
 Dec 11 22:47:21 dy-base  8100db92ad80
 Dec 11 22:47:21 dy-base  
 Dec 11 22:47:21 dy-base  
 Dec 11 22:47:21 dy-base
 Dec 11 22:47:21 dy-base  81011fb89c10
 Dec 11 22:47:21 dy-base  803f3bdc
 Dec 11 22:47:21 dy-base  0282
 Dec 11 22:47:21 dy-base  
 Dec 11 22:47:21 dy-base
 Dec 11 22:47:21 dy-base  
 Dec 11 22:47:21 dy-base  
 Dec 11 22:47:21 dy-base  81011fb89c30
 Dec 11 22:47:21 dy-base  805e7f2b
 Dec 11 22:47:21 dy-base
 Dec 11 22:47:21 dy-base Call Trace:
 Dec 11 22:47:21 dy-base  [803f3bdc] _raw_spin_lock+0x23/0xf1
 Dec 11 22:47:21 dy-base  [805e7f2b] _spin_lock_irqsave+0x11/0x18
 Dec 11 22:47:21 dy-base  [80222aab] __wake_up+0x22/0x50
 Dec 11 22:47:21 dy-base  [803c97f9] xfs_buf_unpin+0x21/0x23
 Dec 11 22:47:21 dy-base  [803970a4] xfs_buf_item_unpin+0x2e/0xa6
 Dec 11 22:47:21 dy-base  [803bc460]
 xfs_trans_chunk_committed+0xc3/0xf7
 Dec 11 22:47:21 dy-base  [803bc4dd]
xfs_trans_committed+0x49/0xde
 Dec 11 22:47:21 dy-base  [803b1bde]
 xlog_state_do_callback+0x185/0x33f
 Dec 11 22:47:21 dy-base  [803b1e9c] xlog_iodone+0x104/0x131
 Dec 11 22:47:22 dy-base  [803c9dae]
xfs_buf_iodone_work+0x1a/0x3e
 Dec 11 22:47:22 dy-base  [802399d8] worker_thread+0x0/0x134
 Dec 11 22:47:22 dy-base  [8023937e] run_workqueue+0xa8/0xf8
 Dec 11 22:47:22 dy-base  [803c9d94] xfs_buf_iodone_work+0x0/0x3e
 Dec 11 22:47:22 dy-base  [802399d8] worker_thread+0x0/0x134
 Dec 11 22:47:22 dy-base

Re: xfslogd-spinlock bug?

2006-12-18 Thread Haar János

- Original Message - 
From: "David Chinner" <[EMAIL PROTECTED]>
To: "Haar János" <[EMAIL PROTECTED]>
Cc: "David Chinner" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;

Sent: Monday, December 18, 2006 11:36 PM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 09:17:50AM +0100, Haar János wrote:
> > From: "David Chinner" <[EMAIL PROTECTED]>
> > > > The NBD serves through eth1, and it is on the CPU3, but the ide0 is
on
> > the
> > > > CPU0.
> > >
> > > I'd say your NBD based XFS filesystem is having trouble.
> > >
> > > > > Are you using XFS on a NBD?
> > > >
> > > > Yes, on the 3. source.
> > >
> > > Ok, I've never heard of a problem like this before and you are doing
> > > something that very few ppl are doing (i.e. XFS on NBD). I'd start
> > > Hence  I'd start by suspecting a bug in the NBD driver.
> >
> > Ok, if you have right, this also can be in context with the following
issue:
> >
> > http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)
>
> Which appears to be a crash in wake_up_process() when doing memory
> reclaim (waking the xfsbufd).

Sorry, can you translate it to "poor mans language"? :-)
This is a different bug?


>
> > > > > > Dec 16 12:08:36 dy-base RSP: 0018:81011fdedbc0  EFLAGS:
00010002
> > > > > > Dec 16 12:08:36 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b
> > RCX:
> > > > >

> > > > > Anyone recognise that pattern?
>
> Ok, I've found this pattern:
>
> #define POISON_FREE 0x6b
>
> Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?

Yes, i build with this option enabled.
Is this wrong?

>
> If so, we have a use after free occurring here and it would also
> explain why no-one has reported it before.
>
> FWIW, can you turn on CONFIG_XFS_DEBUG=y and see if that triggers
> a different bug check prior to the above dump?

[EMAIL PROTECTED] linux-2.6.19]# make bzImage
scripts/kconfig/conf -s arch/x86_64/Kconfig
.config:7:warning: trying to assign nonexistent symbol XFS_DEBUG

I have missed something?


Thanks,

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfslogd-spinlock bug?

2006-12-27 Thread Haar János
Hello,

- Original Message - 
From: "David Chinner" <[EMAIL PROTECTED]>
To: "David Chinner" <[EMAIL PROTECTED]>
Cc: "Haar János" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;

Sent: Tuesday, December 19, 2006 5:47 AM
Subject: Re: xfslogd-spinlock bug?


> On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> > On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
> > > From: "David Chinner" <[EMAIL PROTECTED]>
> > > > #define POISON_FREE 0x6b
> > > >
> > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
> > >
> > > Yes, i build with this option enabled.
>
> ..
>
> > FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
> > on and I haven't seen this problem. I'm not sure how to track down
> > the source of the problem without a test case, but as a quick test, can
> > you try the following patch?
>
> Third try an I got a crash on a poisoned object:
>
> [1]kdb> md8c40 e0300d7d5100
> 0xe0300d7d5100 5a2cf071    q.,Z
> 0xe0300d7d5110 5a2cf071 6b6b6b6b6b6b6b6b   q.,Z
> 0xe0300d7d5120 e039eb7b6320 6b6b6b6b6b6b6b6bc{.9...
> 0xe0300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b   okkk
> 0xe0300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   
> 0xe0300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b   kkk.
> 0xe0300d7d51e0 5a2cf071 a00100468c30   q.,Z0.F.
> [1]kdb> mds 0xe0300d7d51e0
> 0xe0300d7d51e0 5a2cf071   q.,Z
> 0xe0300d7d51e8 a00100468c30 xfs_inode_item_destroy+0x30
>
> So the use-after-free here is on an inode item. You're tripping
> over a buffer item.
>
> Unfortunately, it is not the same problem - the problem I've just
> hit is to do with a QA test that does a forced shutdown on an active
> filesystem, and:
>
> [1]kdb> xmount 0xe0304393e238
> .
> flags 0x440010 
>
> The filesystem was being shutdown so xfs_inode_item_destroy() just
> frees the inode log item without removing it from the AIL. I'll fix that,
> and see if i have any luck
>
> So I'd still try that patch i sent in the previous email...

I still using the patch, but didnt shows any messages at this point.

I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:

Dec 27 13:41:29 dy-base BUG: warning at
kernel/mutex.c:220/__mutex_unlock_common_slowpath()
Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
66604480 RIP:
Dec 27 13:41:29 dy-base  [] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
Dec 27 13:41:29 dy-base Oops:  [1] SMP
Dec 27 13:41:29 dy-base CPU 1
Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
Dec 27 13:41:29 dy-base RIP: 0010:[]  []
resched_task+0x12/0x64
Dec 27 13:41:29 dy-base RSP: 0018:810105c01b78  EFLAGS: 00010083
Dec 27 13:41:29 dy-base RAX: 807d5800 RBX: 1749fd97c214 RCX:
81001cbd
Dec 27 13:41:29 dy-base RDX: 1cbd0048 RSI: 810005834068 RDI:
8101047bf040
Dec 27 13:41:29 dy-base RBP: 810105c01b78 R08: 0001 R09:

Dec 27 13:41:29 dy-base R10: 0057 R11: 81000583cd80 R12:
810116693140
Dec 27 13:41:29 dy-base R13: 0001 R14:  R15:

Dec 27 13:41:29 dy-base FS:  2ba3c1ad07d0()
GS:81011fc769c8() knlGS:
Dec 27 13:41:29 dy-base CS:  0010 DS:  ES:  CR0: 80050033
Dec 27 13:41:29 dy-base CR2: 66604480 CR3: 000118196000 CR4:
06e0
Dec 27 13:41:29 dy-base Process httpd (pid: 4069, threadinfo
810105c0, task 8101166e1040)
Dec 27 13:41:29 dy-base Stack:  810105c01bf8 80223f37
810002996a00 02cba600
Dec 27 13:41:29 dy-base  000f 0001 810005833700
0001
Dec 27 13:41:29 dy-base  0005 0296 810105c01bd8
810117fef690
Dec 27 13:41:29 dy-base Call Trace:
Dec 27 13:41:29 dy-base  [] tr

xfslogd-spinlock bug?

2006-12-11 Thread Haar János
Hello, list,

I am the "big red button men" with the one big 14TB xfs, if somebody can
remember me. :-)

Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
2.6.19, but the bug still exists:

Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
Dec 11 22:47:21 dy-base general protection fault:  [1]
Dec 11 22:47:21 dy-base SMP
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base CPU 3
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Modules linked in:
Dec 11 22:47:21 dy-base  nbd
Dec 11 22:47:21 dy-base  rd
Dec 11 22:47:21 dy-base  netconsole
Dec 11 22:47:21 dy-base  e1000
Dec 11 22:47:21 dy-base  video
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
Dec 11 22:47:21 dy-base RIP: 0010:[]
Dec 11 22:47:21 dy-base  [] spin_bug+0x69/0xdf
Dec 11 22:47:21 dy-base RSP: 0018:81011fb89bc0  EFLAGS: 00010002
Dec 11 22:47:21 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:

Dec 11 22:47:21 dy-base RDX: 808137a0 RSI: 0082 RDI:
0001
Dec 11 22:47:21 dy-base RBP: 81011fb89be0 R08: 00026a70 R09:
6b6b6b6b
Dec 11 22:47:21 dy-base R10: 0082 R11: 81000584d380 R12:
8100db92ad80
Dec 11 22:47:21 dy-base R13: 80642dc6 R14:  R15:
0003
Dec 11 22:47:21 dy-base FS:  ()
GS:81011fc76b90() knlGS:
Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
Dec 11 22:47:21 dy-base CR2: 2ba00770 CR3: 000108c05000 CR4:
06e0
Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
81011fb88000, task 81011fa7f830)
Dec 11 22:47:21 dy-base Stack:
Dec 11 22:47:21 dy-base  81011fb89be0
Dec 11 22:47:21 dy-base  8100db92ad80
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base  81011fb89c10
Dec 11 22:47:21 dy-base  803f3bdc
Dec 11 22:47:21 dy-base  0282
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base  
Dec 11 22:47:21 dy-base  81011fb89c30
Dec 11 22:47:21 dy-base  805e7f2b
Dec 11 22:47:21 dy-base
Dec 11 22:47:21 dy-base Call Trace:
Dec 11 22:47:21 dy-base  [] _raw_spin_lock+0x23/0xf1
Dec 11 22:47:21 dy-base  [] _spin_lock_irqsave+0x11/0x18
Dec 11 22:47:21 dy-base  [] __wake_up+0x22/0x50
Dec 11 22:47:21 dy-base  [] xfs_buf_unpin+0x21/0x23
Dec 11 22:47:21 dy-base  [] xfs_buf_item_unpin+0x2e/0xa6
Dec 11 22:47:21 dy-base  []
xfs_trans_chunk_committed+0xc3/0xf7
Dec 11 22:47:21 dy-base  [] xfs_trans_committed+0x49/0xde
Dec 11 22:47:21 dy-base  []
xlog_state_do_callback+0x185/0x33f
Dec 11 22:47:21 dy-base  [] xlog_iodone+0x104/0x131
Dec 11 22:47:22 dy-base  [] xfs_buf_iodone_work+0x1a/0x3e
Dec 11 22:47:22 dy-base  [] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [] run_workqueue+0xa8/0xf8
Dec 11 22:47:22 dy-base  [] xfs_buf_iodone_work+0x0/0x3e
Dec 11 22:47:22 dy-base  [] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [] worker_thread+0xfb/0x134
Dec 11 22:47:22 dy-base  [] default_wake_function+0x0/0xf
Dec 11 22:47:22 dy-base  [] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [] kthread+0xd8/0x10b
Dec 11 22:47:22 dy-base  [] schedule_tail+0x45/0xa6
Dec 11 22:47:22 dy-base  [] child_rip+0xa/0x12
Dec 11 22:47:22 dy-base  [] worker_thread+0x0/0x134
Dec 11 22:47:22 dy-base  [] kthread+0x0/0x10b
Dec 11 22:47:22 dy-base  [] child_rip+0x0/0x12
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Code:
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 83
Dec 11 22:47:22 dy-base 0c
Dec 11 22:47:22 dy-base 01
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 48
Dec 11 22:47:22 dy-base 8d
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 98
Dec 11 22:47:22 dy-base 02
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 00
Dec 11 22:47:22 dy-base 41
Dec 11 22:47:22 dy-base 8b
Dec 11 22:47:22 dy-base 54
Dec 11 22:47:22 dy-base 24
Dec 11 22:47:22 dy-base 04
Dec 11 22:47:22 dy-base 41
Dec 11 22:47:22 dy-base 89
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base RIP
Dec 11 22:47:22 dy-base  [] spin_bug+0x69/0xdf
Dec 11 22:47:22 dy-base  RSP 
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Kernel panic - not syncing: Fatal exception
Dec 11 22:47:22 dy-base
Dec 11 22:47:22 dy-base Rebooting in 5 seconds..

After this, sometimes the server reboots normally, but sometimes hangs, no
console, no sysreq, no nothing.

This is a "simple" crash, no "too much" data lost, or else.

Can somebody help me to tracking down the problem?

Thanks,
Janos Haar



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfslogd-spinlock bug?

2006-12-12 Thread Haar János
Hello, Justin,

This is a 64bit system.

But i cannot understand, what is the curious? :-)

I am not a kernel developer, and not a C programmer, but the long pointers
shows me, the 64 bit.
Or am i on the wrong clue? :-)

Anyway, this issue happens for me about daily, or max 2-3 day often.
But i have no idea what cause this exactly.
The 2.6.16.18 was stable for me a long time, and one day starts to tricking
me, and happens more and more often.
Thats why i thinking some bad part of the (14TB) FS on the disks.

(this fs have a lot of errors, what the xfs_repair cannot be corrected, but
anyway this is a productive system, and works well, except this issue.
Some months before i have replaced the parity disk in one of the RAID4
array, and the next day, during the resync process, another one disk died.
I almost lost everything, but thanks to the raid4, and mdadm, i have
successfully recovered a lot of data with the 1 day older parity-only drive.
This was really bad, and leave some scars on the fs. )

This issue looks like for me a race condition between the cpus. (2x Xeon HT)

Am i right? :-)

Thanks,

Janos




- Original Message - 
From: "Justin Piszcz" <[EMAIL PROTECTED]>
To: "Haar János" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; 
Sent: Tuesday, December 12, 2006 3:32 PM
Subject: Re: xfslogd-spinlock bug?


I'm not sure what is causing this problem but I was curious is this on a
32bit or 64bit platform?

Justin.

On Tue, 12 Dec 2006, Haar János wrote:

> Hello, list,
>
> I am the "big red button men" with the one big 14TB xfs, if somebody can
> remember me. :-)
>
> Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
> 2.6.19, but the bug still exists:
>
> Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> Dec 11 22:47:21 dy-base general protection fault:  [1]
> Dec 11 22:47:21 dy-base SMP
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base CPU 3
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Modules linked in:
> Dec 11 22:47:21 dy-base  nbd
> Dec 11 22:47:21 dy-base  rd
> Dec 11 22:47:21 dy-base  netconsole
> Dec 11 22:47:21 dy-base  e1000
> Dec 11 22:47:21 dy-base  video
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> Dec 11 22:47:21 dy-base RIP: 0010:[]
> Dec 11 22:47:21 dy-base  [] spin_bug+0x69/0xdf
> Dec 11 22:47:21 dy-base RSP: 0018:81011fb89bc0  EFLAGS: 00010002
> Dec 11 22:47:21 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:
> 
> Dec 11 22:47:21 dy-base RDX: 808137a0 RSI: 0082 RDI:
> 0001
> Dec 11 22:47:21 dy-base RBP: 81011fb89be0 R08: 00026a70 R09:
> 6b6b6b6b
> Dec 11 22:47:21 dy-base R10: 0082 R11: 81000584d380 R12:
> 8100db92ad80
> Dec 11 22:47:21 dy-base R13: 80642dc6 R14:  R15:
> 0003
> Dec 11 22:47:21 dy-base FS:  ()
> GS:81011fc76b90() knlGS:
> Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> Dec 11 22:47:21 dy-base CR2: 2ba00770 CR3: 000108c05000 CR4:
> 06e0
> Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
> 81011fb88000, task 81011fa7f830)
> Dec 11 22:47:21 dy-base Stack:
> Dec 11 22:47:21 dy-base  81011fb89be0
> Dec 11 22:47:21 dy-base  8100db92ad80
> Dec 11 22:47:21 dy-base  
> Dec 11 22:47:21 dy-base  
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base  81011fb89c10
> Dec 11 22:47:21 dy-base  803f3bdc
> Dec 11 22:47:21 dy-base  0282
> Dec 11 22:47:21 dy-base  
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base  
> Dec 11 22:47:21 dy-base  
> Dec 11 22:47:21 dy-base  81011fb89c30
> Dec 11 22:47:21 dy-base  805e7f2b
> Dec 11 22:47:21 dy-base
> Dec 11 22:47:21 dy-base Call Trace:
> Dec 11 22:47:21 dy-base  [] _raw_spin_lock+0x23/0xf1
> Dec 11 22:47:21 dy-base  [] _spin_lock_irqsave+0x11/0x18
> Dec 11 22:47:21 dy-base  [] __wake_up+0x22/0x50
> Dec 11 22:47:21 dy-base  [] xfs_buf_unpin+0x21/0x23
> Dec 11 22:47:21 dy-base  [] xfs_buf_item_unpin+0x2e/0xa6
> Dec 11 22:47:21 dy-base  []
> xfs_trans_chunk_committed+0xc3/0xf7
> Dec 11 22:47:21 dy-base  []
xfs_trans_committed+0x49/0xde
> Dec 11 22:47:21 dy-base  []
> xlog_state_do_callback+0x185/0x33f
> Dec 11 22:47:21 dy-base  [] xlog_iodone+0x104/0x131
> Dec 11 22:47:22 dy-base  []
xfs_buf_iodone_work+0x1a/0x3e
> Dec 11 22:47:22 dy-base  [] worker_thread+0x0/0x134
> Dec 11 22:47:22 dy-base  [] run_workqueue+0xa8/0xf8
> Dec 11 22:47:22 dy-base  [] xfs_buf_iodone_work+0x0/0x3e
> Dec 11 22:47:2

Re: xfslogd-spinlock bug?

2006-12-16 Thread Haar János
:38 dy-base 00
Dec 16 12:08:38 dy-base 00
Dec 16 12:08:38 dy-base 41
Dec 16 12:08:38 dy-base 8b
Dec 16 12:08:38 dy-base 54
Dec 16 12:08:38 dy-base 24
Dec 16 12:08:38 dy-base 04
Dec 16 12:08:38 dy-base 41
Dec 16 12:08:38 dy-base 89
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base RIP
Dec 16 12:08:38 dy-base  [] spin_bug+0x69/0xdf
Dec 16 12:08:38 dy-base  RSP 
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Kernel panic - not syncing: Fatal exception
Dec 16 12:08:38 dy-base
Dec 16 12:08:38 dy-base Rebooting in 5 seconds..




- Original Message - 
From: "Haar János" <[EMAIL PROTECTED]>
To: "Justin Piszcz" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; 
Sent: Wednesday, December 13, 2006 2:11 AM
Subject: Re: xfslogd-spinlock bug?


> Hello, Justin,
>
> This is a 64bit system.
>
> But i cannot understand, what is the curious? :-)
>
> I am not a kernel developer, and not a C programmer, but the long pointers
> shows me, the 64 bit.
> Or am i on the wrong clue? :-)
>
> Anyway, this issue happens for me about daily, or max 2-3 day often.
> But i have no idea what cause this exactly.
> The 2.6.16.18 was stable for me a long time, and one day starts to
tricking
> me, and happens more and more often.
> Thats why i thinking some bad part of the (14TB) FS on the disks.
>
> (this fs have a lot of errors, what the xfs_repair cannot be corrected,
but
> anyway this is a productive system, and works well, except this issue.
> Some months before i have replaced the parity disk in one of the RAID4
> array, and the next day, during the resync process, another one disk died.
> I almost lost everything, but thanks to the raid4, and mdadm, i have
> successfully recovered a lot of data with the 1 day older parity-only
drive.
> This was really bad, and leave some scars on the fs. )
>
> This issue looks like for me a race condition between the cpus. (2x Xeon
HT)
>
> Am i right? :-)
>
> Thanks,
>
> Janos
>
>
>
>
> - Original Message - 
> From: "Justin Piszcz" <[EMAIL PROTECTED]>
> To: "Haar János" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>; 
> Sent: Tuesday, December 12, 2006 3:32 PM
> Subject: Re: xfslogd-spinlock bug?
>
>
> I'm not sure what is causing this problem but I was curious is this on a
> 32bit or 64bit platform?
>
> Justin.
>
> On Tue, 12 Dec 2006, Haar János wrote:
>
> > Hello, list,
> >
> > I am the "big red button men" with the one big 14TB xfs, if somebody can
> > remember me. :-)
> >
> > Now i found something in the 2.6.16.18, and try the 2.6.18.4, and the
> > 2.6.19, but the bug still exists:
> >
> > Dec 11 22:47:21 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> > Dec 11 22:47:21 dy-base general protection fault:  [1]
> > Dec 11 22:47:21 dy-base SMP
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base CPU 3
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base Modules linked in:
> > Dec 11 22:47:21 dy-base  nbd
> > Dec 11 22:47:21 dy-base  rd
> > Dec 11 22:47:21 dy-base  netconsole
> > Dec 11 22:47:21 dy-base  e1000
> > Dec 11 22:47:21 dy-base  video
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:21 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> > Dec 11 22:47:21 dy-base RIP: 0010:[]
> > Dec 11 22:47:21 dy-base  [] spin_bug+0x69/0xdf
> > Dec 11 22:47:21 dy-base RSP: 0018:81011fb89bc0  EFLAGS: 00010002
> > Dec 11 22:47:21 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:
> > 
> > Dec 11 22:47:21 dy-base RDX: 808137a0 RSI: 0082 RDI:
> > 0001
> > Dec 11 22:47:21 dy-base RBP: 81011fb89be0 R08: 00026a70 R09:
> > 6b6b6b6b
> > Dec 11 22:47:21 dy-base R10: 0082 R11: 81000584d380 R12:
> > 8100db92ad80
> > Dec 11 22:47:21 dy-base R13: 80642dc6 R14:  R15:
> > 0003
> > Dec 11 22:47:21 dy-base FS:  ()
> > GS:81011fc76b90() knlGS:
> > Dec 11 22:47:21 dy-base CS:  0010 DS: 0018 ES: 0018 CR0:
8005003b
> > Dec 11 22:47:21 dy-base CR2: 2ba00770 CR3: 000108c05000 CR4:
> > 06e0
> > Dec 11 22:47:21 dy-base Process xfslogd/3 (pid: 317, threadinfo
> > 81011fb88000, task 81011fa7f830)
> > Dec 11 22:47:21 dy-base Stack:
> > Dec 11 22:47:21 dy-base  81011fb89be0
> > Dec 11 22:47:21 dy-base  8100db92ad80
> > Dec 11 22:47:21 dy-base  
> > Dec 11 22:47:21 dy-base  
> > Dec 11 22:47:21 dy-base
> > Dec 11 22:47:2

Re: xfslogd-spinlock bug?

2006-12-17 Thread Haar János

- Original Message - 
From: "David Chinner" <[EMAIL PROTECTED]>
To: "Haar János" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; 
Sent: Sunday, December 17, 2006 11:44 PM
Subject: Re: xfslogd-spinlock bug?


> On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > Hi
> >
> > I have some news.
> >
> > I dont know there is a context between 2 messages, but i can see, the
> > spinlock bug comes always on cpu #3.
> >
> > Somebody have any idea?
>
> Your disk interrupts are directed to CPU 3, and so log I/O completion
> occurs on that CPU.

   CPU0   CPU1   CPU2   CPU3
  0:100  0  04583704   IO-APIC-edge  timer
  1:  0  0  0  2   IO-APIC-edge  i8042
  4:  0  0  03878668   IO-APIC-edge  serial
  8:  0  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 12:  0  0  0  3   IO-APIC-edge  i8042
 14:3072118  0  0181   IO-APIC-edge  ide0
 16:  0  0  0  0   IO-APIC-fasteoi
uhci_hcd:usb2
 18:  0  0  0  0   IO-APIC-fasteoi
uhci_hcd:usb4
 19:  0  0  0  0   IO-APIC-fasteoi
uhci_hcd:usb3
 23:  0  0  0  0   IO-APIC-fasteoi
ehci_hcd:usb1
 52:  0  0  0  213052723   IO-APIC-fasteoi   eth1
 53:  0  0  0   91913759   IO-APIC-fasteoi   eth2
100:  0  0  0   16776910   IO-APIC-fasteoi   eth0
NMI:  42271  43187  42234  43168
LOC:4584247458421945842154584198
ERR:  0

Maybe
I have 3 XFS on this system, with 3 source.

1. 200G one ide hdd.
2. 2x200G mirror on 1 ide + 1 sata hdd.
3. 4x3.3TB strip on NBD.

The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
CPU0.


>
> > Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> > Dec 16 12:08:36 dy-base general protection fault:  [1]
> > Dec 16 12:08:36 dy-base SMP
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base CPU 3
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Modules linked in:
> > Dec 16 12:08:36 dy-base  nbd
>
> Are you using XFS on a NBD?

Yes, on the 3. source.
I used it about 1.5 years.

(The nbd deadlock is fixed on my system, thanks to Herbert Xu on 2.6.14.)

>
> > Dec 16 12:08:36 dy-base  rd
> > Dec 16 12:08:36 dy-base  netconsole
> > Dec 16 12:08:36 dy-base  e1000
> > Dec 16 12:08:36 dy-base  video
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> > Dec 16 12:08:36 dy-base RIP: 0010:[]
> > Dec 16 12:08:36 dy-base  [] spin_bug+0x69/0xdf
> > Dec 16 12:08:36 dy-base RSP: 0018:81011fdedbc0  EFLAGS: 00010002
> > Dec 16 12:08:36 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b RCX:
>  
> Anyone recognise that pattern?

I think i have one idea.
This issue can stops sometimes the 5sec automatic restart on crash, and this
shows possible memory corruption, and if the bug occurs in the IRQ
handling :-)
I have a lot of logs about this issue, and the RAX, RBX always the same.

>
> > Dec 16 12:08:36 dy-base Call Trace:
> > Dec 16 12:08:36 dy-base  [] _raw_spin_lock+0x23/0xf1
> > Dec 16 12:08:36 dy-base  []
_spin_lock_irqsave+0x11/0x18
> > Dec 16 12:08:36 dy-base  [] __wake_up+0x22/0x50
> > Dec 16 12:08:36 dy-base  [] xfs_buf_unpin+0x21/0x23
> > Dec 16 12:08:36 dy-base  []
xfs_buf_item_unpin+0x2e/0xa6
>
> This implies a spinlock inside a wait_queue_head_t is corrupt.
>
> What are you type of system do you have, and what sort of
> workload are you running?

OS: Fedora 5, 64bit.
HW: dual xeon, with HT, ram 4GB.
(the min_free_kbytes limit is set to 128000, because sometimes the e1000
driver run out the reserved memory during irq handling.)

Workload:

I use this system for free web storage.
(2x apache 2.0.xx,   12x pure-ftpd, 2x mysql but sql only use the source #2
fs.)

The normal system load is ~20-40, but currently i have a little problem with
apache, because it sometimes starts to read a lot from the big XFS device,
and eats all memory, the load is rising to 700-800.
At this point i use httpd restart, and everithing go back to normal, but if
i offline.

Thanks a lot!

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xfslogd-spinlock bug?

2006-12-18 Thread Haar János

- Original Message - 
From: "David Chinner" <[EMAIL PROTECTED]>
To: "Haar János" <[EMAIL PROTECTED]>
Cc: "David Chinner" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;

Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > > > I dont know there is a context between 2 messages, but i can see,
the
> > > > spinlock bug comes always on cpu #3.
> > > >
> > > > Somebody have any idea?
> > >
> > > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > > occurs on that CPU.
> >
> >CPU0   CPU1   CPU2   CPU3
> >   0:100  0  04583704   IO-APIC-edge
timer
> >   1:  0  0  0  2   IO-APIC-edge
i8042
> >   4:  0  0  03878668   IO-APIC-edge
serial
> .
> >  14:3072118  0  0181   IO-APIC-edge
ide0
> .
> >  52:  0  0  0  213052723   IO-APIC-fasteoi
eth1
> >  53:  0  0  0   91913759   IO-APIC-fasteoi
eth2
> > 100:  0  0  0   16776910   IO-APIC-fasteoi
eth0
> 
> >
> > Maybe
> > I have 3 XFS on this system, with 3 source.
> >
> > 1. 200G one ide hdd.
> > 2. 2x200G mirror on 1 ide + 1 sata hdd.
> > 3. 4x3.3TB strip on NBD.
> >
> > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
> > CPU0.
>
> I'd say your NBD based XFS filesystem is having trouble.
>
> > > Are you using XFS on a NBD?
> >
> > Yes, on the 3. source.
>
> Ok, I've never heard of a problem like this before and you are doing
> something that very few ppl are doing (i.e. XFS on NBD). I'd start
> Hence  I'd start by suspecting a bug in the NBD driver.

Ok, if you have right, this also can be in context with the following issue:

http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)


>
> > > > Dec 16 12:08:36 dy-base RSP: 0018:81011fdedbc0  EFLAGS: 00010002
> > > > Dec 16 12:08:36 dy-base RAX: 0033 RBX: 6b6b6b6b6b6b6b6b
RCX:
> > >  
> > > Anyone recognise that pattern?
> >
> > I think i have one idea.
> > This issue can stops sometimes the 5sec automatic restart on crash, and
this
> > shows possible memory corruption, and if the bug occurs in the IRQ
> > handling :-)
> > I have a lot of logs about this issue, and the RAX, RBX always the same.
>
> And is this the only place where you see the problem? Or are there
> other stack traces that you see this in as well?

I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)

What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?

[EMAIL PROTECTED] netlog]# grep "0033" messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 12 18:16:35 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 13 11:40:05 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 14 22:25:32 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 15 06:24:44 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.1:Dec 16 12:08:36 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.11:Oct  3 19:49:44 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.11:Oct  7 01:11:17 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.13:Sep 21 15:35:31 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.15:Sep  3 16:13:35 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.15:Sep  5 21:00:38 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.2:Dec  9 00:10:47 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.2:Dec  9 14:07:01 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.2:Dec 10 04:44:48 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.3:Nov 30 10:59:21 dy-base RAX: 0033 RBX:
6b6b6b6b6b6b6b6b RCX: 
messages.3:Dec