Re: [zfs-discuss] ZIL errors but device seems OK

2010-04-15 Thread Richard Skelton
Hi,
After a little bit more digging I found in /var/adm/messages:-
Mar 25 13:13:08 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:13:08 brszfs02timeout: early timeout, target=1 lun=0
Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1):
Mar 25 13:13:08 brszfs02Error for command 'write sector'Error 
Level: Informational
Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.notice]   Sense Key: aborted 
command
Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.notice]   Vendor 'Gen-ATA ' error 
code: 0x3
Mar 25 13:13:43 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:13:43 brszfs02timeout: early timeout, target=1 lun=0
Mar 25 13:13:43 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:13:43 brszfs02timeout: early timeout, target=1 lun=0
Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1):
Mar 25 13:13:43 brszfs02Error for command 'read sector' Error Level: 
Informational
Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice]   Sense Key: aborted 
command
Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice]   Vendor 'Gen-ATA ' error 
code: 0x3
Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1):
Mar 25 13:13:43 brszfs02Error for command 'read sector' Error Level: 
Informational
Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice]   Sense Key: aborted 
command
Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice]   Vendor 'Gen-ATA ' error 
code: 0x3
Mar 25 13:14:18 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:14:18 brszfs02timeout: early timeout, target=1 lun=0
Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1):
Mar 25 13:14:18 brszfs02Error for command 'read sector' Error Level: 
Informational
Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.notice]   Sense Key: aborted 
command
Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.notice]   Vendor 'Gen-ATA ' error 
code: 0x3
Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:14:33 brszfs02timeout: abort request, target=0 lun=0
Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:14:33 brszfs02timeout: abort device, target=0 lun=0
Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:14:33 brszfs02timeout: reset target, target=0 lun=0
Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci-...@1f,2/i...@1 (ata1):
Mar 25 13:14:33 brszfs02timeout: reset bus, target=0 lun=0
Mar 25 13:14:34 brszfs02 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: 
ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
Mar 25 13:14:34 brszfs02 EVENT-TIME: Thu Mar 25 13:14:34 GMT 2010
Mar 25 13:14:34 brszfs02 PLATFORM: HP-Compaq-dc7700-Convertible-Minitower, CSN: 
CZC7264JN4, HOSTNAME: brszfs02
Mar 25 13:14:34 brszfs02 SOURCE: zfs-diagnosis, REV: 1.0
Mar 25 13:14:34 brszfs02 EVENT-ID: 6c0bd163-56bf-ee92-e393-ce2063355b52
Mar 25 13:14:34 brszfs02 DESC: The number of I/O errors associated with a ZFS 
device exceeded
Mar 25 13:14:34 brszfs02 acceptable levels.  Refer to 
http://sun.com/msg/ZFS-8000-FD for more information.
Mar 25 13:14:34 brszfs02 AUTO-RESPONSE: The device has been offlined and marked 
as faulted.  An attempt
Mar 25 13:14:34 brszfs02 will be made to activate a hot spare if 
available.
Mar 25 13:14:34 brszfs02 IMPACT: Fault tolerance of the pool may be compromised.
Mar 25 13:14:34 brszfs02 REC-ACTION: Run 'zpool status -x' and replace the bad 
device.




If I remember correctly I was thrashing this pool with Bonnie++ at this time.

Cheers
Richard.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZIL errors but device seems OK

2010-04-14 Thread Richard Skelton
Hi,
I have installed OpenSolaris snv_134 from the iso at genunix.org.
Mon Mar 8 2010 New OpenSolaris preview, based on build 134
I created a zpool:-
NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  c7t4d0ONLINE   0 0 0
  c7t5d0ONLINE   0 0 0
  c7t6d0ONLINE   0 0 0
  c7t8d0ONLINE   0 0 0
  c7t9d0ONLINE   0 0 0
logs
  c5d1p1ONLINE   0 0 0
cache
  c5d1p2ONLINE   0 0 0

The log device and cache are each one half of a 128GB  OCZ VERTEX-TURBO flash 
card.

I am getting good NFS performance but have seen this error:-
r...@brszfs02:~# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  c7t4d0ONLINE   0 0 0
  c7t5d0ONLINE   0 0 0
  c7t6d0ONLINE   0 0 0
  c7t8d0ONLINE   0 0 0
  c7t9d0ONLINE   0 0 0
logs
  c5d1p1FAULTED  0 4 0  too many errors
cache
  c5d1p2ONLINE   0 0 0

errors: No known data errors

r...@brszfs02:~# fmadm faulty
---   -- -
TIMEEVENT-ID  MSG-ID SEVERITY
---   -- -
Mar 25 13:14:34 6c0bd163-56bf-ee92-e393-ce2063355b52  ZFS-8000-FDMajor

Host: brszfs02
Platform: HP-Compaq-dc7700-Convertible-MinitowerChassis_id  : CZC7264JN4
Product_sn  :

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=tank/vdev=4ec464b5bf74a898
  faulted but still in service
Problem in  : zfs://pool=tank/vdev=4ec464b5bf74a898
  faulted but still in service

Description : The number of I/O errors associated with a ZFS device exceeded
 acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD
  for more information.

Response: The device has been offlined and marked as faulted.  An attempt
 will be made to activate a hot spare if available.

Impact  : Fault tolerance of the pool may be compromised.

Action  : Run 'zpool status -x' and replace the bad device.

r...@brszfs02:~# iostat -En c5d1
c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: OCZ VERTEX-TURB Revision:  Serial No: 062F97G71C5T676 Size: 128.04GB 
128035160064 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0


As there seems to be not hardware errors as reported by iostat I ran zpool 
clear tank and a scrub on Monday.
Up to now I have seen no new errors, I have set-up a cron to scrub a 01:30 each 
day.

Is the flash card faulty or is this a ZFS problem?

Cheers
Richard
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL errors but device seems OK

2010-04-14 Thread Richard Elling
comment below...

On Apr 14, 2010, at 1:49 AM, Richard Skelton wrote:

 Hi,
 I have installed OpenSolaris snv_134 from the iso at genunix.org.
 Mon Mar 8 2010 New OpenSolaris preview, based on build 134
 I created a zpool:-
NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  c7t4d0ONLINE   0 0 0
  c7t5d0ONLINE   0 0 0
  c7t6d0ONLINE   0 0 0
  c7t8d0ONLINE   0 0 0
  c7t9d0ONLINE   0 0 0
logs
  c5d1p1ONLINE   0 0 0
cache
  c5d1p2ONLINE   0 0 0
 
 The log device and cache are each one half of a 128GB  OCZ VERTEX-TURBO flash 
 card.
 
 I am getting good NFS performance but have seen this error:-
 r...@brszfs02:~# zpool status tank
  pool: tank
 state: DEGRADED
 status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
 action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: none requested
 config:
 
NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  c7t4d0ONLINE   0 0 0
  c7t5d0ONLINE   0 0 0
  c7t6d0ONLINE   0 0 0
  c7t8d0ONLINE   0 0 0
  c7t9d0ONLINE   0 0 0
logs
  c5d1p1FAULTED  0 4 0  too many errors
cache
  c5d1p2ONLINE   0 0 0
 
 errors: No known data errors
 
 r...@brszfs02:~# fmadm faulty
 ---   -- -
 TIMEEVENT-ID  MSG-ID SEVERITY
 ---   -- -
 Mar 25 13:14:34 6c0bd163-56bf-ee92-e393-ce2063355b52  ZFS-8000-FDMajor
 
 Host: brszfs02
 Platform: HP-Compaq-dc7700-Convertible-MinitowerChassis_id  : 
 CZC7264JN4
 Product_sn  :
 
 Fault class : fault.fs.zfs.vdev.io
 Affects : zfs://pool=tank/vdev=4ec464b5bf74a898
  faulted but still in service
 Problem in  : zfs://pool=tank/vdev=4ec464b5bf74a898
  faulted but still in service
 
 Description : The number of I/O errors associated with a ZFS device exceeded
 acceptable levels.  Refer to 
 http://sun.com/msg/ZFS-8000-FD
  for more information.
 
 Response: The device has been offlined and marked as faulted.  An attempt
 will be made to activate a hot spare if available.
 
 Impact  : Fault tolerance of the pool may be compromised.
 
 Action  : Run 'zpool status -x' and replace the bad device.
 
 r...@brszfs02:~# iostat -En c5d1
 c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Model: OCZ VERTEX-TURB Revision:  Serial No: 062F97G71C5T676 Size: 128.04GB 
 128035160064 bytes
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 Illegal Request: 0
 
 
 As there seems to be not hardware errors as reported by iostat I ran zpool 
 clear tank and a scrub on Monday.
 Up to now I have seen no new errors, I have set-up a cron to scrub a 01:30 
 each day.
 
 Is the flash card faulty or is this a ZFS problem?

In my testing of Flash-based SSDs, this is the most common error.
Since the drive is not reporting media errors or hard errors, the only
interim conclusion is that something in the data path caused data
to be corrupted. This can mean the drive doesn't report these errors,
the errors are transient, or an error occurred which is not related to
the data (eg. phantom writes).

For example, my current bad-boy says:
$ iostat -En
...
c7t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: USB2.0 Product: VAULT DRIVE Revision: 1100 Serial No: Size: 
8.12GB 8120172544 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal 
Request: 103 
Predictive Failure Analysis: 0 
...
$ pfexec zpool status -v
syspool 

  pool: syspool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 0h1m with 325 errors on Wed Apr 14 
11:06:58 2010
config:

NAMESTATE READ WRITE CKSUM
syspool ONLINE   0 0   330
  c7t0d0s0  ONLINE   0 0   690

errors: Permanent errors have been