Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-17 Thread Carson Gaspar

On 2/16/11 9:58 PM, Krunal Desai wrote:


When I try to do a SMART status read (more than just a simple
identify), looks like the 1068E drops the drive for a little bit. I
bought the Intel-branded LSI SAS3081E:
Current active firmware version is 0120 (1.32.00)
Firmware image's version is MPTFW-01.32.00.00-IT
   LSI Logic
x86 BIOS image's version is MPTBIOS-6.34.00.00 (2010.12.07)

...

Fault management records some transport errors followed by recovery.
Any ideas? Disks are ST32000542AS.


Please give the _exact_ command you are running. I see the same thing, 
but only if I tray and retrieve some of the extended info (-x...). I 
don't see it with -a.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-17 Thread Krunal Desai
On Thu, Feb 17, 2011 at 10:52 AM, Carson Gaspar car...@taltos.org wrote:
 Please give the _exact_ command you are running. I see the same thing, but
 only if I tray and retrieve some of the extended info (-x...). I don't see
 it with -a.

Sure, here it is (apologies in advance if GMail applies its forced wrapping):


movax@megatron:~/downloads# smartctl -a -d sat /dev/rdsk/c1t0d0
smartctl 5.40 2010-10-16 r3189 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda LP
Device Model: ST32000542AS
Serial Number:redacted
Firmware Version: CC34
User Capacity:2,000,398,934,016 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:Thu Feb 17 00:52:56 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

drive drops/resets here
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-16 Thread Krunal Desai
On Wed, Feb 2, 2011 at 8:38 PM, Carson Gaspar car...@taltos.org wrote:
 Works For Me (TM).

 c7t0d0 is hanging off an LSI SAS3081E-R (SAS1068E chip) rev B3 MPT rev 105
 Firmware rev 011d (1.29.00.00) (IT FW)

 This is a SATA disk - I don't have any SAS disks behind a LSI1068E to test.

When I try to do a SMART status read (more than just a simple
identify), looks like the 1068E drops the drive for a little bit. I
bought the Intel-branded LSI SAS3081E:
Current active firmware version is 0120 (1.32.00)
Firmware image's version is MPTFW-01.32.00.00-IT
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.34.00.00 (2010.12.07)

kernel log messages:
Feb 17 00:54:05 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:05 megatronDisconnected command timeout for Target 0
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronLog info 0x3114 received for target 0.
Feb 17 00:54:06 megatronscsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronLog info 0x3113 received for target 0.
Feb 17 00:54:06 megatronscsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronLog info 0x3113 received for target 0.
Feb 17 00:54:06 megatronscsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronLog info 0x3113 received for target 0.
Feb 17 00:54:06 megatronscsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronLog info 0x3113 received for target 0.
Feb 17 00:54:06 megatronscsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Feb 17 00:54:06 megatron scsi: [ID 107833 kern.notice]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronmpt_flush_target discovered non-NULL
cmd in slot 33, tasktype 0x3
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronCmd (0xff02dea63a40) dump for
Target 0 Lun 0:
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatroncdb=[ ]
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronpkt_flags=0x8000 pkt_statistics=0x0
pkt_state=0x0
Feb 17 00:54:06 megatron scsi: [ID 365881 kern.info]
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronpkt_scbp=0x0 cmd_flags=0x2800024
Feb 17 00:54:06 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,2e29@6/pci1000,3140@0 (mpt4):
Feb 17 00:54:06 megatronioc reset abort passthru

Fault management records some transport errors followed by recovery.
Any ideas? Disks are ST32000542AS.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Richard Elling
On Feb 1, 2011, at 8:54 PM, Krunal Desai wrote:

 On Tue, Feb 1, 2011 at 11:34 PM, Richard Elling
 richard.ell...@gmail.com wrote:
 There is a failure going on here.  It could be a cable or it could be a bad
 disk or firmware. The actual fault might not be in the disk reporting the 
 errors (!)
 It is not a media error.
 
 
 Errors were as follows:
 Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered0x269213b01d700401
 Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered0x269213b01d700401
 Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered0x269213b01d700401
 Feb 01 19:33:04.9969 ereport.io.scsi.cmd.disk.tran 0x269f99ef0b300401
 Feb 01 19:33:04.9970 ereport.io.scsi.cmd.disk.tran 0x269f9a165a400401
 
 Verbose of a message:
 Feb 01 2011 19:33:04.996932283 ereport.io.scsi.cmd.disk.tran
 nvlist version: 0
class = ereport.io.scsi.cmd.disk.tran
ena = 0x269f99ef0b300401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,2e21@1/pci15d9,a580@0/sd@3,0
(end detector)
 
devid = id1,sd@n5000c50010ed6a31
driver-assessment = fail
op-code = 0x0
cdb = 0x0 0x0 0x0 0x0 0x0 0x0
pkt-reason = 0x18

This error code means the device is gone.

pkt-state = 0x1

The command got the bus, but could not access the target.

pkt-stats = 0x0
__ttl = 0x1
__tod = 0x4d48a640 0x3b6bfabb
 
 It was a cable error, but why didn't fault management tell me about
 it? What do you mean by The actual fault might not be in the disk
 reporting the errors (!)
 It is not a media error.? Fault might be sourcing from my SATA
 controller or something possibly?

Possibly.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Oyvind Syljuasen
 I agree that we need to get email updates for failing
 devices.
 

If FMA discovers it, email can be sent, at least in Solaris Express 11;
http://blogs.sun.com/robj/entry/fma_and_email_notifications

br,
syljua
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Carson Gaspar

On 2/1/11 5:52 PM, Krunal Desai wrote:


SMART status was reported healthy as well (got smartctl kind of
working), but I cannot read the SMART data of my disks behind the
1068E due to limitations of smartmontools I guess. (e.g. 'smartctl -d
scsi -a /dev/rdsk/c10t0d0' gives me serial #, model, and just a
generic 'SMART Ok'). I assume that SUNWhd is licensed only for use on
the X4500 Thumper and family? I'd like to see if it works with the
1068E.


Works For Me (TM).

c7t0d0 is hanging off an LSI SAS3081E-R (SAS1068E chip) rev B3 MPT rev 
105 Firmware rev 011d (1.29.00.00) (IT FW)


This is a SATA disk - I don't have any SAS disks behind a LSI1068E to test.

# uname -a
SunOS gandalf.taltos.org 5.11 snv_151a i86pc i386 i86pc

# /usr/local/sbin/smartctl -H -i -d sat /dev/rdsk/c7t0d0 
  smartctl 5.40 
2010-10-16 r3189 [i386-pc-solaris2.11] (local build)

Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family
Device Model: ST31500341AS
Serial Number:9VS4HDYH
Firmware Version: CC1H
User Capacity:1,500,301,910,016 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:Wed Feb  2 17:37:56 2011 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Krunal Desai
 This error code means the device is gone.
 The command got the bus, but could not access the target.

Thanks for that!

I updated firmware on both of my USAS-L8i (LSI1068E based), and while
controller numbering has shifted around in Solaris (went from c10/c11
to c11/c12, not a big deal I think), suddently smartctl is able to
pull temperatures. Can't get a full SMART listing, but temperatures
are going now. Oddly enough, my second LSI controller has skipped
c12t0d0 and jumped straight from number c12t1d0 and onwards. It's a
good thing that ZFS can figure out what is what, but it will make
configuring power management tricky.

I'll post in pm-discuss about the kernel panics I was getting after
enabling drive power management.

-- 
--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Krunal Desai
 # uname -a
 SunOS gandalf.taltos.org 5.11 snv_151a i86pc i386 i86pc

movax@megatron:~# uname -a
SunOS megatron 5.11 snv_151a i86pc i386 i86pc


 # /usr/local/sbin/smartctl -H -i -d sat /dev/rdsk/c7t0d0
                                       smartctl 5.40 2010-10-16 r3189
 [i386-pc-solaris2.11] (local build)
 Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net


Fails for me, my version does not recognize the 'sat' option. I've
been using -d scsi:

movax@megatron:~# smartctl -h
smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen

but,

movax@megatron:~# smartctl -a -d scsi /dev/rdsk/c11t0d0
smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ATA  ST31500341AS Version: CC1H
Serial number: 9VS14DJD
Device type: disk
Local Time is: Wed Feb  2 20:45:00 2011 EST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature: 49 C

Error Counter logging not supported
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Carson Gaspar

On 2/2/11 5:47 PM, Krunal Desai wrote:


Fails for me, my version does not recognize the 'sat' option. I've
been using -d scsi:

movax@megatron:~# smartctl -h
smartctl version 5.36 [i386-pc-solaris2.8] Copyright (C) 2002-6 Bruce Allen


So build the current version of smartmontools. As you should have seen 
in my original response, I'm using 5.40. Bugs in 5.36 are unlikely to be 
interesting to the maintainers of the package ;-)


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Krunal Desai
 So build the current version of smartmontools. As you should have seen in my 
 original response, I'm using 5.40. Bugs in 5.36 are unlikely to be 
 interesting to the maintainers of the package ;-)

Oops, missed that in your log. Will try compiling from source and see what 
happens.

Also, recently it seems like all the links to tools I need are broken. Where 
can I find a lsiutil binary for Solaris?

--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Eric D. Mudama

On Wed, Feb  2 at 21:05, Krunal Desai wrote:

So build the current version of smartmontools. As you should have seen in my 
original response, I'm using 5.40. Bugs in 5.36 are unlikely to be interesting 
to the maintainers of the package ;-)


Oops, missed that in your log. Will try compiling from source and see what 
happens.

Also, recently it seems like all the links to tools I need are broken. Where 
can I find a lsiutil binary for Solaris?


If you search for 'lsiutil solaris' on lsi.com, it'll direct you to
zipfile that includes a solaris binary for x86 solaris.

At home now so can't test it.

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Krunal Desai
 If you search for 'lsiutil solaris' on lsi.com, it'll direct you to
 zipfile that includes a solaris binary for x86 solaris.

Yep, that worked, grabbed it off some other adapter's page. Thanks!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Richard Elling
On Feb 2, 2011, at 8:59 AM, Oyvind Syljuasen wrote:

 I agree that we need to get email updates for failing
 devices.
 
 
 If FMA discovers it, email can be sent, at least in Solaris Express 11;
 http://blogs.sun.com/robj/entry/fma_and_email_notifications

For NexentaStor we have a slightly different email delivery of system
fault notices. For those who are using the current version, please note that
there are improvements coming in configuration and reporting so that we
can help detect some specific pathologies often associated with transport
errors :-). There is always room for improvement in fault management...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Krunal Desai
I recently discovered a drive failure (either that or a loose cable, I
need to investigate further) on my home fileserver. 'fmadm faulty'
returns no output, but I can clearly see a failure when I do zpool
status -v:

pool: tank
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub canceled on Tue Feb  1 11:51:58 2011
config:

NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
  raidz2-0   DEGRADED 0 0 0
c10t0d0  ONLINE   0 0 0
c10t1d0  ONLINE   0 0 0
c10t2d0  ONLINE   0 0 0
c10t3d0  REMOVED  0 0 0
c10t4d0  ONLINE   0 0 0
c10t5d0  ONLINE   0 0 0
c10t6d0  ONLINE   0 0 0
c10t7d0  ONLINE   0 0 0

In dmesg, I see:
Feb  1 11:14:33 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,2e21@1/pci15d9,a580@0/sd@3,0 (sd8):
Feb  1 11:14:33 megatronCommand failed to complete...Device is gone

never had any problems with these drives + mpt in snv_134 (on snv_151a
now), only change was adding a second 1068E-IT that's currently
unpopulated with drives. But more importantly I guess, why can't I see
this failure in fmadm (and how would I go about setting up
automatically dispatching an e-mail to me when stuff like this
happens?)? Is a pool going degraded != to failure?

-- 
--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Cindy Swearingen

Hi Krunal,

It looks to me like FMA thinks that you removed the disk so you'll need
to confirm whether the cable dropped or something else.

I agree that we need to get email updates for failing devices.

See if fmdump generated an error report using the commands below.

Thanks,

Cindy

# fmdump
TIME UUID SUNW-MSG-ID EVENT

Jan 07 14:01:14.7839 04ee736a-b2cb-612f-ce5e-a0e43d666762 ZFS-8000-GH 
Diagnosed
Jan 13 10:34:32.2301 04ee736a-b2cb-612f-ce5e-a0e43d666762 FMD-8000-58 
Updated


Then, review the contents:

fmdump -u 04ee736a-b2cb-612f-ce5e-a0e43d666762 -v
TIME UUID SUNW-MSG-ID EVENT
Jan 07 14:01:14.7839 04ee736a-b2cb-612f-ce5e-a0e43d666762 ZFS-8000-GH 
Diagnosed

  100%  fault.fs.zfs.vdev.checksum

Problem in: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
   Affects: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
   FRU: -
  Location: -

Jan 13 10:34:32.2301 04ee736a-b2cb-612f-ce5e-a0e43d666762 FMD-8000-58 
Updated

  100%  fault.fs.zfs.vdev.checksum

Problem in: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
   Affects: zfs://pool=c4538d8607c1e030/vdev=7954b2ff7a8383
   FRU: -
  Location: -

Thanks,

Cindy



On 02/01/11 09:55, Krunal Desai wrote:

I recently discovered a drive failure (either that or a loose cable, I
need to investigate further) on my home fileserver. 'fmadm faulty'
returns no output, but I can clearly see a failure when I do zpool
status -v:

pool: tank
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub canceled on Tue Feb  1 11:51:58 2011
config:

NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
  raidz2-0   DEGRADED 0 0 0
c10t0d0  ONLINE   0 0 0
c10t1d0  ONLINE   0 0 0
c10t2d0  ONLINE   0 0 0
c10t3d0  REMOVED  0 0 0
c10t4d0  ONLINE   0 0 0
c10t5d0  ONLINE   0 0 0
c10t6d0  ONLINE   0 0 0
c10t7d0  ONLINE   0 0 0

In dmesg, I see:
Feb  1 11:14:33 megatron scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,2e21@1/pci15d9,a580@0/sd@3,0 (sd8):
Feb  1 11:14:33 megatronCommand failed to complete...Device is gone

never had any problems with these drives + mpt in snv_134 (on snv_151a
now), only change was adding a second 1068E-IT that's currently
unpopulated with drives. But more importantly I guess, why can't I see
this failure in fmadm (and how would I go about setting up
automatically dispatching an e-mail to me when stuff like this
happens?)? Is a pool going degraded != to failure?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Krunal Desai
On Tue, Feb 1, 2011 at 1:29 PM, Cindy Swearingen
cindy.swearin...@oracle.com wrote:
 I agree that we need to get email updates for failing devices.

Definitely!

 See if fmdump generated an error report using the commands below.

Unfortunately not, see below:

movax@megatron:/root# fmdump
TIME UUID SUNW-MSG-ID EVENT
fmdump: warning: /var/fm/fmd/fltlog is empty

--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Cindy Swearingen

I misspoke and should clarify:

1. fmdump identifies fault reports that explain system issues

2. fmdump -eV identifies errors or problem symptoms

I'm unclear about your REMOVED status. I don't see it very often.

The ZFS Admin Guide says:

REMOVED

The device was physically removed while the system was running. Device 
removal detection is hardware-dependent and might not be supported on 
all platforms.


I need to check if FMA generally reports on devices that are REMOVED
by the administrator, as ZFS seems to think in this case.

Thanks,

Cindy



On 02/01/11 15:47, Krunal Desai wrote:

On Tue, Feb 1, 2011 at 1:29 PM, Cindy Swearingen
cindy.swearin...@oracle.com wrote:

I agree that we need to get email updates for failing devices.


Definitely!


See if fmdump generated an error report using the commands below.


Unfortunately not, see below:

movax@megatron:/root# fmdump
TIME UUID SUNW-MSG-ID EVENT
fmdump: warning: /var/fm/fmd/fltlog is empty

--khd

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Krunal Desai
On Tue, Feb 1, 2011 at 6:11 PM, Cindy Swearingen
cindy.swearin...@oracle.com wrote:
 I misspoke and should clarify:

 1. fmdump identifies fault reports that explain system issues

 2. fmdump -eV identifies errors or problem symptoms

Gotcha; fmdump -eV gives me the information I need. It appears to have
been a loose cable, I'm hitting the machine with some heavy I/O load,
and the pool resilvered itself, drive has not dropped out.

SMART status was reported healthy as well (got smartctl kind of
working), but I cannot read the SMART data of my disks behind the
1068E due to limitations of smartmontools I guess. (e.g. 'smartctl -d
scsi -a /dev/rdsk/c10t0d0' gives me serial #, model, and just a
generic 'SMART Ok'). I assume that SUNWhd is licensed only for use on
the X4500 Thumper and family? I'd like to see if it works with the
1068E.

It's getting kind of tempting for me to investigate oing a run of
boards that run Marvell 88SX6081s behind a PLX PCIe - PCI-X bridge.
They should have beyond excellent support seeing as that is what the
X4500 uses to run its SATA ports.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Richard Elling
On Feb 1, 2011, at 5:52 PM, Krunal Desai wrote:

 On Tue, Feb 1, 2011 at 6:11 PM, Cindy Swearingen
 cindy.swearin...@oracle.com wrote:
 I misspoke and should clarify:
 
 1. fmdump identifies fault reports that explain system issues
 
 2. fmdump -eV identifies errors or problem symptoms
 
 Gotcha; fmdump -eV gives me the information I need. It appears to have
 been a loose cable, I'm hitting the machine with some heavy I/O load,
 and the pool resilvered itself, drive has not dropped out.

The output of fmdump is explicit. I am interested to know if you saw 
aborts and timeouts or some other errors.

 
 SMART status was reported healthy as well (got smartctl kind of
 working), but I cannot read the SMART data of my disks behind the
 1068E due to limitations of smartmontools I guess. (e.g. 'smartctl -d
 scsi -a /dev/rdsk/c10t0d0' gives me serial #, model, and just a
 generic 'SMART Ok'). I assume that SUNWhd is licensed only for use on
 the X4500 Thumper and family? I'd like to see if it works with the
 1068E.

The open-source version of smartmontools seems to be slightly out
of date and somewhat finicky. Does anyone know of a better SMART
implementation?

 
 It's getting kind of tempting for me to investigate oing a run of
 boards that run Marvell 88SX6081s behind a PLX PCIe - PCI-X bridge.
 They should have beyond excellent support seeing as that is what the
 X4500 uses to run its SATA ports.

Nice idea, except that the X4500 was EOL years ago and the replacement,
X4540, uses LSI HBAs. I think you will find better Solaris support for the LSI
chipsets because Oracle's Sun products use them from the top (M9000) all
the way down the product line.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Krunal Desai
 The output of fmdump is explicit. I am interested to know if you saw 
 aborts and timeouts or some other errors.

I have the machine off atm while I install new disks (18x ST32000542AS), but 
IIRC they appeared as transport errors (scsi.something.transport, I can paste 
the exact errors in a little bit). A slew of transfer/soft errors followed by 
the drive disappearing. I assume that my HBA took it offline, and mpt driver 
reported that to the OS as an admin disconnecting, not as a failure per se.

 The open-source version of smartmontools seems to be slightly out
 of date and somewhat finicky. Does anyone know of a better SMART
 implementation?

That SUNWhd I mentioned seemed interesting, but I assume licensing means I can 
only get that if I purchase SUn hardware.

 Nice idea, except that the X4500 was EOL years ago and the replacement,
 X4540, uses LSI HBAs. I think you will find better Solaris support for the LSI
 chipsets because Oracle's Sun products use them from the top (M9000) all
 the way down the product line.

Oops, forgot that the X4500s are actually kind of old. I'll have to look up 
what LSI controllers the newer models are using (the LSI 2xx8 something IIRC? 
Will have to Google).

--khd

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Richard Elling
On Feb 1, 2011, at 6:49 PM, Krunal Desai wrote:

 The output of fmdump is explicit. I am interested to know if you saw 
 aborts and timeouts or some other errors.
 
 I have the machine off atm while I install new disks (18x ST32000542AS), but 
 IIRC they appeared as transport errors (scsi.something.transport, I can 
 paste the exact errors in a little bit). A slew of transfer/soft errors 
 followed by the drive disappearing. I assume that my HBA took it offline, and 
 mpt driver reported that to the OS as an admin disconnecting, not as a 
 failure per se.

There is a failure going on here.  It could be a cable or it could be a bad
disk or firmware. The actual fault might not be in the disk reporting the 
errors (!)
It is not a media error.

 
 The open-source version of smartmontools seems to be slightly out
 of date and somewhat finicky. Does anyone know of a better SMART
 implementation?
 
 That SUNWhd I mentioned seemed interesting, but I assume licensing means I 
 can only get that if I purchase SUn hardware.
 
 Nice idea, except that the X4500 was EOL years ago and the replacement,
 X4540, uses LSI HBAs. I think you will find better Solaris support for the 
 LSI
 chipsets because Oracle's Sun products use them from the top (M9000) all
 the way down the product line.
 
 Oops, forgot that the X4500s are actually kind of old. I'll have to look up 
 what LSI controllers the newer models are using (the LSI 2xx8 something IIRC? 
 Will have to Google).

No, they aren't that new.  The LSI 2008 are 6 Gbps HBAs and the older 1064/1068 
series are 3 Gbps.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-01 Thread Krunal Desai
On Tue, Feb 1, 2011 at 11:34 PM, Richard Elling
richard.ell...@gmail.com wrote:
 There is a failure going on here.  It could be a cable or it could be a bad
 disk or firmware. The actual fault might not be in the disk reporting the 
 errors (!)
 It is not a media error.


Errors were as follows:
Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered0x269213b01d700401
Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered0x269213b01d700401
Feb 01 19:33:01.3665 ereport.io.scsi.cmd.disk.recovered0x269213b01d700401
Feb 01 19:33:04.9969 ereport.io.scsi.cmd.disk.tran 0x269f99ef0b300401
Feb 01 19:33:04.9970 ereport.io.scsi.cmd.disk.tran 0x269f9a165a400401

Verbose of a message:
Feb 01 2011 19:33:04.996932283 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
class = ereport.io.scsi.cmd.disk.tran
ena = 0x269f99ef0b300401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,2e21@1/pci15d9,a580@0/sd@3,0
(end detector)

devid = id1,sd@n5000c50010ed6a31
driver-assessment = fail
op-code = 0x0
cdb = 0x0 0x0 0x0 0x0 0x0 0x0
pkt-reason = 0x18
pkt-state = 0x1
pkt-stats = 0x0
__ttl = 0x1
__tod = 0x4d48a640 0x3b6bfabb

It was a cable error, but why didn't fault management tell me about
it? What do you mean by The actual fault might not be in the disk
reporting the errors (!)
It is not a media error.? Fault might be sourcing from my SATA
controller or something possibly?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss