Re: [zfs-discuss] x4500 panic report.

2008-07-11 Thread Jorgen Lundman

Today we had another panic, at least it was during work time :) Just a 
shame the 999GB ufs takes 80+ mins to fsck. (Yes, it is mounted 'logging').







panic[cpu3]/thread=ff001e70dc80:
free: freeing free block, dev:0xb60024, block:13144, ino:1737885, 
fs:/export
/saba1


ff001e70d500 genunix:vcmn_err+28 ()
ff001e70d550 ufs:real_panic_v+f7 ()
ff001e70d5b0 ufs:ufs_fault_v+1d0 ()
ff001e70d6a0 ufs:ufs_fault+a0 ()
ff001e70d770 ufs:free+38f ()
ff001e70d830 ufs:indirtrunc+260 ()
ff001e70dab0 ufs:ufs_itrunc+738 ()
ff001e70db60 ufs:ufs_trans_itrunc+128 ()
ff001e70dbf0 ufs:ufs_delete+3b0 ()
ff001e70dc60 ufs:ufs_thread_delete+da ()
ff001e70dc70 unix:thread_start+8 ()

syncing file systems...

panic[cpu3]/thread=ff001e70dc80:
panic sync timeout

dumping to /dev/dsk/c6t0d0s1, offset 65536, content: kernel


  $c
vpanic()
vcmn_err+0x28(3, f783a128, ff001e70d678)
real_panic_v+0xf7(0, f783a128, ff001e70d678)
ufs_fault_v+0x1d0(ff04facf65c0, f783a128, ff001e70d678)
ufs_fault+0xa0()
free+0x38f(ff001e70d8d0, a6a7358, 2000, 89)
indirtrunc+0x260(ff001e70d8d0, a6a42b8, , 0, 89)
ufs_itrunc+0x738(ff0550b9fde0, 0, 81, fffec0594db0)
ufs_trans_itrunc+0x128(ff0550b9fde0, 0, 81, fffec0594db0)
ufs_delete+0x3b0(fffed20e2a00, ff0550b9fde0, 1)
ufs_thread_delete+0xda(64704840)
thread_start+8()

  ::panicinfo
  cpu3
   thread ff001e70dc80
  message
free: freeing free block, dev:0xb60024, block:13144, ino:1737885, 
fs:/export
/saba1
  rdi f783a128
  rsi ff001e70d678
  rdx f783a128
  rcx ff001e70d678
   r8 f783a128
   r90
  rax3
  rbx0
  rbp ff001e70d4d0
  r10 fffec3d40580
  r10 fffec3d40580
  r11 ff001e70dc80
  r12 f783a128
  r13 ff001e70d678
  r143
  r15 f783a128
   fsbase0
   gsbase fffec3d40580
   ds   4b
   es   4b
   fs0
   gs  1c3
   trapno0
  err0
  rip fb83c860
   cs   30
   rflags  246
  rsp ff001e70d488
   ss   38
   gdt_hi0
   gdt_lo 81ef
   idt_hi0
   idt_lo 7fff
  ldt0
 task   70
  cr0 8005003b
  cr2 fed0e010
  cr3  2c0
  cr4  6f8





Jorgen Lundman wrote:
 On Saturday the X4500 system paniced, and rebooted. For some reason the 
 /export/saba1 UFS partition was corrupt, and needed fsck. This is why 
 it did not come back online. /export/saba1 is mounted logging,noatime, 
 so fsck should never (-ish) be needed.
 
 SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
 
 /export/saba1 on /dev/zvol/dsk/zpool1/saba1 
 read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024
  
 on Sat Jul  5 08:48:54 2008
 
 
 One possible related bug:
 
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138
 
 
 What would be the best solution? Go back to latest Solaris 10 and pass 
 it on to Sun support, or find a patch for this problem?
 
 
 
 Panic dump follows:
 
 
 -rw-r--r--   1 root root 2529300 Jul  5 08:48 unix.2
 -rw-r--r--   1 root root 10133225472 Jul  5 09:10 vmcore.2
 
 
 # mdb unix.2 vmcore.2
 Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc 
 pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba uhci s1394 qlc fctl 
 nca lofs zfs random cpc crypto fcip fcp logindmux nsctl sdbc ptm sv ii 
 sppp rdc nfs ]
 
   $c
 vpanic()
 vcmn_err+0x28(3, f783ade0, ff001e737aa8)
 real_panic_v+0xf7(0, f783ade0, ff001e737aa8)
 ufs_fault_v+0x1d0(fffed0bfb980, f783ade0, ff001e737aa8)
 ufs_fault+0xa0()
 dqput+0xce(1db26ef0)
 dqrele+0x48(1db26ef0)
 ufs_trans_dqrele+0x6f(1db26ef0)
 ufs_idle_free+0x16d(ff04f17b1e00)
 ufs_idle_some+0x152(3f60)
 ufs_thread_idle+0x1a1()
 thread_start+8()
 
 
   ::cpuinfo
   ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD 
 PROC
0 fbc2fc10  1b00  60   nono t-0 
 ff001e737c80 sched
1 fffec3a0a000  1f10  -1   nono t-0ff001e971c80
   (idle)
2 fffec3a02ac0  1f00  -1   nono t-1ff001e9dbc80
   (idle)
3 fffec3d60580  1f00  -1   no

[zfs-discuss] x4500 panic report.

2008-07-06 Thread Jorgen Lundman

On Saturday the X4500 system paniced, and rebooted. For some reason the 
/export/saba1 UFS partition was corrupt, and needed fsck. This is why 
it did not come back online. /export/saba1 is mounted logging,noatime, 
so fsck should never (-ish) be needed.

SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc

/export/saba1 on /dev/zvol/dsk/zpool1/saba1 
read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024
 
on Sat Jul  5 08:48:54 2008


One possible related bug:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138


What would be the best solution? Go back to latest Solaris 10 and pass 
it on to Sun support, or find a patch for this problem?



Panic dump follows:


-rw-r--r--   1 root root 2529300 Jul  5 08:48 unix.2
-rw-r--r--   1 root root 10133225472 Jul  5 09:10 vmcore.2


# mdb unix.2 vmcore.2
Loading modules: [ unix genunix specfs dtrace cpu.AuthenticAMD.15 uppc 
pcplusmp scsi_vhci ufs md ip hook neti sctp arp usba uhci s1394 qlc fctl 
nca lofs zfs random cpc crypto fcip fcp logindmux nsctl sdbc ptm sv ii 
sppp rdc nfs ]

  $c
vpanic()
vcmn_err+0x28(3, f783ade0, ff001e737aa8)
real_panic_v+0xf7(0, f783ade0, ff001e737aa8)
ufs_fault_v+0x1d0(fffed0bfb980, f783ade0, ff001e737aa8)
ufs_fault+0xa0()
dqput+0xce(1db26ef0)
dqrele+0x48(1db26ef0)
ufs_trans_dqrele+0x6f(1db26ef0)
ufs_idle_free+0x16d(ff04f17b1e00)
ufs_idle_some+0x152(3f60)
ufs_thread_idle+0x1a1()
thread_start+8()


  ::cpuinfo
  ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD 
PROC
   0 fbc2fc10  1b00  60   nono t-0 
ff001e737c80 sched
   1 fffec3a0a000  1f10  -1   nono t-0ff001e971c80
  (idle)
   2 fffec3a02ac0  1f00  -1   nono t-1ff001e9dbc80
  (idle)
   3 fffec3d60580  1f00  -1   nono t-1ff001ea50c80
  (idle)

  ::panicinfo
  cpu0
   thread ff001e737c80
  message dqput: dqp-dq_cnt == 0
  rdi f783ade0
  rsi ff001e737aa8
  rdx f783ade0
  rcx ff001e737aa8
   r8 f783ade0
   r90
  rax3
  rbx0
  rbp ff001e737900
  r10 fbc26fb0
  r10 fbc26fb0
  r11 ff001e737c80
  r12 f783ade0
  r13 ff001e737aa8
  r143
  r15 f783ade0
   fsbase0
   gsbase fbc26fb0
   ds   4b
   es   4b
   fsbase0
   gsbase fbc26fb0
   ds   4b
   es   4b
   fs0
   gs  1c3
   trapno0
  err0
  rip fb83c860
   cs   30
   rflags  246
  rsp ff001e7378b8
   ss   38
   gdt_hi0
   gdt_lo e1ef
   idt_hi0
   idt_lo 77c00fff
  ldt0
 task   70
  cr0 8005003b
  cr2 fee7d650
  cr3  2c0
  cr4  6f8

  ::msgbuf
quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs 
/export/zero1)
quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs 
/export/zero1)

panic[cpu0]/thread=ff001e737c80:
dqput: dqp-dq_cnt == 0


ff001e737930 genunix:vcmn_err+28 ()
ff001e737980 ufs:real_panic_v+f7 ()
ff001e7379e0 ufs:ufs_fault_v+1d0 ()
ff001e737ad0 ufs:ufs_fault+a0 ()
ff001e737b00 ufs:dqput+ce ()
ff001e737b30 ufs:dqrele+48 ()
ff001e737b70 ufs:ufs_trans_dqrele+6f ()
ff001e737bc0 ufs:ufs_idle_free+16d ()
ff001e737c10 ufs:ufs_idle_some+152 ()
ff001e737c60 ufs:ufs_thread_idle+1a1 ()
ff001e737c70 unix:thread_start+8 ()

syncing file systems...




-- 
Jorgen Lundman   | [EMAIL PROTECTED]
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 panic report.

2008-07-06 Thread James C. McPherson
Jorgen Lundman wrote:
 On Saturday the X4500 system paniced, and rebooted. For some reason the 
 /export/saba1 UFS partition was corrupt, and needed fsck. This is why 
 it did not come back online. /export/saba1 is mounted logging,noatime, 
 so fsck should never (-ish) be needed.
 
 SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
 
 /export/saba1 on /dev/zvol/dsk/zpool1/saba1 
 read/write/setuid/devices/intr/largefiles/logging/quota/xattr/noatime/onerror=panic/dev=2d80024
  
 on Sat Jul  5 08:48:54 2008
 
 
 One possible related bug:
 
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4884138

Yes, that bug is possibly related. However, the panic stacks listed
in it do not match yours.

 What would be the best solution? Go back to latest Solaris 10 and pass 
 it on to Sun support, or find a patch for this problem?

Since the panic stack only ever goes through ufs, you should
log a call with Sun support.
...
   ::msgbuf
 quota_ufs: over hard disk limit (pid 600, uid 178199, inum 941499, fs 
 /export/zero1)
 quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, fs 
 /export/zero1)
 
 panic[cpu0]/thread=ff001e737c80:
 dqput: dqp-dq_cnt == 0
 
 
 ff001e737930 genunix:vcmn_err+28 ()
 ff001e737980 ufs:real_panic_v+f7 ()
 ff001e7379e0 ufs:ufs_fault_v+1d0 ()
 ff001e737ad0 ufs:ufs_fault+a0 ()
 ff001e737b00 ufs:dqput+ce ()
 ff001e737b30 ufs:dqrele+48 ()
 ff001e737b70 ufs:ufs_trans_dqrele+6f ()
 ff001e737bc0 ufs:ufs_idle_free+16d ()
 ff001e737c10 ufs:ufs_idle_some+152 ()
 ff001e737c60 ufs:ufs_thread_idle+1a1 ()
 ff001e737c70 unix:thread_start+8 ()

Although given the entry in the msgbuf, perhaps
you might want to fix up your quota settings on that
particular filesystem.



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 panic report.

2008-07-06 Thread Jorgen Lundman
  Since the panic stack only ever goes through ufs, you should
log a call with Sun support.

We do have support, but they only speak Japanese, and I'm still quite 
poor at it. But I have started the process of having it translated and 
passed along to the next person. It is always fun to see what it becomes 
at the other end. Meanwhile, I like to research and see if it is a 
already known problem, rather than just sit around and wait.



  quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, 
fs /export/zero1)

 
 Although given the entry in the msgbuf, perhaps
 you might want to fix up your quota settings on that
 particular filesystem.
 

Customers pay for a certain amount of disk-quota, and being users, 
always stay close to the edge. Those messages are as constant as 
precipitation in the rainy season.

Are you suggestion that indicate a problem, beyond that the user is out 
of space?

Lund


-- 
Jorgen Lundman   | [EMAIL PROTECTED]
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 panic report.

2008-07-06 Thread James C. McPherson
Jorgen Lundman wrote:
   Since the panic stack only ever goes through ufs, you should
 log a call with Sun support.
 
 We do have support, but they only speak Japanese, and I'm still quite 
 poor at it. But I have started the process of having it translated and 
 passed along to the next person. It is always fun to see what it becomes 
 at the other end. Meanwhile, I like to research and see if it is a 
 already known problem, rather than just sit around and wait.

That sounds like a learning opportunity :-)

   quota_ufs: over hard disk limit (pid 600, uid 33647, inum 29504134, 
 fs /export/zero1)
 
 Although given the entry in the msgbuf, perhaps
 you might want to fix up your quota settings on that
 particular filesystem.

 
 Customers pay for a certain amount of disk-quota, and being users, 
 always stay close to the edge. Those messages are as constant as 
 precipitation in the rainy season.
 
 Are you suggestion that indicate a problem, beyond that the user is out 
 of space?

I don't know, I'm not a UFS expert (heck, I'm not an expert
on _anything_). Have you investigated putting your paying
customers onto zfs and managing quotas with zfs properties
instead of ufs?




James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 panic report.

2008-07-06 Thread Jorgen Lundman
 I don't know, I'm not a UFS expert (heck, I'm not an expert
 on _anything_). Have you investigated putting your paying
 customers onto zfs and managing quotas with zfs properties
 instead of ufs?

Yep, we spent about 6 weeks during the trial period of the x4500 to try 
to find a way for ZFS to be able to replace the current NetApps. History 
of this mailing-list should have it, and thanks to everyone who helped.

But it was just not possible. Perhaps now it can be done, using 
mirror-mounts, but the 50 odd servers hanging off the x4500 don't all 
support it, so it would still not be feasible.

Unless there has been some advancement in ZFS in the last 6 months I am 
not aware of... like user quotas?

Thanks for your assistance.

Lund

-- 
Jorgen Lundman   | [EMAIL PROTECTED]
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss