[zfs-discuss] Panic while scrubbing

2006-10-24 Thread Siegfried Nikolaivich
Hello,

I am not sure if I am posting in the correct forum, but it seems somewhat zfs 
related, so I thought I'd share it.

While the machine was idle, I started a scrub.  Around the time the scrubbing 
was supposed to be finished, the machine panicked.

This might be related to the 'metadata corruption' that happened earlier to me. 
 Here is the log, any ideas?


Oct 24 20:13:51 FServe unix: [ID 836849 kern.notice] 
Oct 24 20:13:51 FServe ^Mpanic[cpu0]/thread=fe8000311c80: 
Oct 24 20:13:51 FServe genunix: [ID 683410 kern.notice] BAD TRAP: type=e (#pf 
Page fault) rp=fe80003119c0 addr=fe00e24c6218
Oct 24 20:13:51 FServe unix: [ID 10 kern.notice] 
Oct 24 20:13:51 FServe unix: [ID 839527 kern.notice] sched: 
Oct 24 20:13:51 FServe unix: [ID 753105 kern.notice] #pf Page fault
Oct 24 20:13:51 FServe unix: [ID 532287 kern.notice] Bad kernel fault at 
addr=0xfe00e24c6218
Oct 24 20:13:51 FServe unix: [ID 243837 kern.notice] pid=0, 
pc=0xfb92c360, sp=0xfe8000311ab0, eflags=0x10282
Oct 24 20:13:51 FServe unix: [ID 211416 kern.notice] cr0: 
8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f0xmme,fxsr,pge,mce,pae,pse
Oct 24 20:13:51 FServe unix: [ID 354241 kern.notice] cr2: fe00e24c6218 cr3: 
a22b000 cr8: c
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]rdi: 84233e88 
rsi: fe00e24c6208 rdx: 3f8038931883
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]rcx:0  
r8:1  r9: 
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]rax:2 
rbx: fe80eb90f7c0 rbp: fe8000311ab0
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]r10: a5de7488 
r11:1 r12: 84233e88
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]r13:2 
r14: fe80eb90f7c0 r15: 84233dd8
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]fsb: 8000 
gsb: fbc24060  ds:   43
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] es:   43  
fs:0  gs:  1c3
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice]trp:e 
err:0 rip: fb92c360
Oct 24 20:13:51 FServe unix: [ID 592667 kern.notice] cs:   28 
rfl:10282 rsp: fe8000311ab0
Oct 24 20:13:51 FServe unix: [ID 266532 kern.notice] ss:   30
Oct 24 20:13:51 FServe unix: [ID 10 kern.notice] 
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe80003118d0 
unix:real_mode_end+58d1 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe80003119b0 
unix:trap+d77 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe80003119c0 
unix:_cmntrap+13f ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311ab0 
genunix:avl_insert+60 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311ae0 
genunix:avl_add+33 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311b60 
zfs:vdev_queue_io_to_issue+1ec ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311ba0 
zfs:zfsctl_ops_root+33c6e7a1 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311bc0 
zfs:vdev_disk_io_done+11 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311bd0 
zfs:vdev_io_done+12 ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311be0 
zfs:zio_vdev_io_done+1b ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311c60 
genunix:taskq_thread+bc ()
Oct 24 20:13:51 FServe genunix: [ID 655072 kern.notice] fe8000311c70 
unix:thread_start+8 ()
Oct 24 20:13:51 FServe unix: [ID 10 kern.notice] 
Oct 24 20:13:51 FServe genunix: [ID 672855 kern.notice] syncing file systems...
Oct 24 20:13:51 FServe genunix: [ID 904073 kern.notice]  done
Oct 24 20:13:52 FServe genunix: [ID 111219 kern.notice] dumping to 
/dev/dsk/c0t3d0s1, offset 860356608, content: kernel
Oct 24 20:13:52 FServe marvell88sx: [ID 812950 kern.warning] WARNING: 
marvell88sx0: error on port 3:
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   device 
disconnected
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   device connected
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   SError interrupt
Oct 24 20:13:52 FServe marvell88sx: [ID 131198 kern.info]   SErrors:
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   
Recovered communication error
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   PHY 
ready change
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   10-bit 
to 8-bit decode error
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   
Disparity error
Oct 24 20:13:57 FServe genunix: [ID 409368 kern.notice] ^M100% done: 150751 
pages dumped, compression ratio 4.23, 
Oct 24 20:13:57 FServe genunix: [ID 851671 kern.notice] dump succeeded


Thanks,

Re: [zfs-discuss] Panic while scrubbing

2006-10-24 Thread James McPherson

On 10/25/06, Siegfried Nikolaivich [EMAIL PROTECTED] wrote:
...

While the machine was idle, I started a scrub.  Around the time the scrubbing 
was supposed to be finished, the machine panicked.
This might be related to the 'metadata corruption' that happened earlier to me. 
 Here is the log, any ideas?

...

Oct 24 20:13:52 FServe marvell88sx: [ID 812950 kern.warning] WARNING: 
marvell88sx0: error on port 3:
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   device 
disconnected
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   device connected
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   SError interrupt
Oct 24 20:13:52 FServe marvell88sx: [ID 131198 kern.info]   SErrors:
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   
Recovered communication error
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   PHY 
ready change
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   10-bit 
to 8-bit decode error
Oct 24 20:13:52 FServe marvell88sx: [ID 517869 kern.info]   
Disparity error



Hi Siegfried,
this error from the marvell88sx driver is of concern, The 10b8b decode
and disparity error messages make me think that you have a bad piece
of hardware. I hope it's not your controller but I can't tell without more
data. You should have a look at the iostat -En output for the device
on marvell88sx instance #0, attached as port 3. If there are any error
counts above 0 then - after checking /var/adm/messages for medium
errors - you should probably replace the disk.

However, don't discount the possibly that the controller and or the
cable is at fault.

cheers,
James
--
Solaris kernel software engineer, system admin and troubleshooter
 http://www.jmcp.homeunix.com/blog
Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Panic while scrubbing

2006-10-24 Thread Siegfried Nikolaivich


On 24-Oct-06, at 9:11 PM, James McPherson wrote:


this error from the marvell88sx driver is of concern, The 10b8b decode
and disparity error messages make me think that you have a bad piece
of hardware. I hope it's not your controller but I can't tell  
without more

data. You should have a look at the iostat -En output for the device
on marvell88sx instance #0, attached as port 3. If there are any error
counts above 0 then - after checking /var/adm/messages for medium
errors - you should probably replace the disk.



I have just tried to do a 'zpool scrub' and I got the same result - a  
panic right when the scrub finishes (no errors found during / after  
panic).  So I guess this problem is reproducible (and might not be an  
intermittent hardware malfunction).


It is funny I get the marvell88sx driver error for port 3 as that is  
the Solaris UFS drive, whereas the rest of the ports are setup for  
ZFS.  Since the scrub seems to be causing the panic, I don't see why  
an error on the root drive would be the root cause.


Note that this error comes in the log after it is trying to make a  
dump of the panic: genunix: [ID 111219 kern.notice] dumping to /dev/ 
dsk/c0t3d0s1, offset 860356608, content: kernel



By the way, this is what iostat -En shows for port 3:
c0t3d0   Soft Errors: 24 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 24 Predictive Failure Analysis: 0


And this is shown on the rest of the ports:
c0t?d0   Soft Errors: 6 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 6 Predictive Failure Analysis: 0


Thanks,
Siegfried
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Panic while scrubbing

2006-10-24 Thread Siegfried Nikolaivich


On 24-Oct-06, at 9:47 PM, James McPherson wrote:


On 10/25/06, Siegfried Nikolaivich [EMAIL PROTECTED] wrote:

And this is shown on the rest of the ports:
c0t?d0   Soft Errors: 6 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 6 Predictive Failure Analysis: 0


Hmm. All your disks attached to the same controller and showing
entries in the Illegal Request field . what's the common component
between them - the cable?


I guess the common component between them is the power supply.  Each  
drive has its own SATA cable connected directly to the controller.



Could you look through your msgbuf and/or /var/adm/messages and
find the full text of when these Illegal Request errors were  
logged. That

will give an idea of where to look next.


That is the part I can't figure out.  Nowhere does it say Illegal  
Request except when I run iostat -nE.


I found out that the Illegal Request count can be incremented on  
the ZFS drives by starting a scrub.


For example:
# iostat -nE
...
c0t2d0   Soft Errors: 8 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 8 Predictive Failure Analysis: 0
c0t3d0   Soft Errors: 24 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 24 Predictive Failure Analysis: 0
...

# zpool scrub tank

# iostat -nE
...
c0t2d0   Soft Errors: 9 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 9 Predictive Failure Analysis: 0
c0t3d0   Soft Errors: 24 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3320620AS  Revision: CSerial No:
Size: 320.07GB 320072932864 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 24 Predictive Failure Analysis: 0
...

# zpool scrub -s tank
(no panic at this point)

Happens every time.



Thanks,
Siegfried
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss