Re: [zfs-discuss] Narrow escape with FAULTED disks

2010-08-23 Thread Mark Bennett
Well I do have a plan.

Thanks to the portability of ZFS boot disks, I'll make two new OS disks on 
another machine with the next Nexcenta release, export the data pool and swap 
in the new ones.

That way, I can at least manage a zfs scrub without killing the performance and 
get the Intel SSD's I have been testing to work properly.

On the other hand, I could just use the spare 7210 Appliance boot disk I have 
lying about.

Mark.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Narrow escape with FAULTED disks

2010-08-18 Thread Cindy Swearingen

Its hard to tell what caused the smart predictive failure message,
like a temp fluctuation. If ZFS noticed that a disk wasn't available
yet, then I would expect a message to that effect.

In any case, I think I would have a replacement disk available.

The important thing is that you continue to monitor your hardware
for failures.

We recommend using ZFS redundancy and alway have backups of your
data.

Thanks,

Cindy


On 08/18/10 02:38, Mark Bennett wrote:

Hi Cindy,

Not very enlightening.
No previous errors for the disks.
I did replace one about a month earlier when it showed a rise in io errors, and 
before it reached a level where fault management would have failed it.

The disk mentioned is not one of those that went FAULTED.
Also, no more smart error events since.
The ZFS failed on boot after a reboot command.

The scrub was eventually stopped at 75% due to the performance impact.
No errors were found up to that point.

One thing I see from the (attached) messages log is that the zfs error occurs 
before all the disks have been logged as enumerated.
This is probably the first reboot since at least 8, and maybe 16 extra disks 
were hot plugged and added to the pool.

The Hardware is a Supermicro 3U plus 2 x 4U SAS storage chassis.
The SAS controller has 16 disks on one SAS port, and 32 in the other.




Aug 16 18:44:39.2154 02f57499-ae0a-c46c-b8f8-825205a8505d ZFS-8000-D3
  100%  fault.fs.zfs.device
Problem in: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789
   Affects: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789
   FRU: -
  Location: -
Aug 16 18:44:39.5569 25e0bdc2-0171-c4b5-b530-a268f8572bd1 ZFS-8000-D3
  100%  fault.fs.zfs.device
Problem in: zfs://pool=drgvault/vdev=e912d259d7829903
   Affects: zfs://pool=drgvault/vdev=e912d259d7829903
   FRU: -
  Location: -
Aug 16 18:44:39.8964 8e9cff35-8e9d-c0f1-cd5b-bd1d0276cda1 ZFS-8000-CS
  100%  fault.fs.zfs.pool
Problem in: zfs://pool=drgvault
   Affects: zfs://pool=drgvault
   FRU: -
  Location: -
Aug 16 18:45:47.2604 3848ba46-ee18-4aad-b632-9baf25b532ea DISK-8000-0X
  100%  fault.io.disk.predictive-failure
Problem in: 
hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0
   Affects: 
dev:///:devid=id1,s...@n5000c50021f4916f//p...@0,0/pci8086,4...@3/pci15d9,a...@0/s...@24,0
   FRU: 
hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0
  Location: 006



Mark.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Narrow escape with FAULTED disks

2010-08-17 Thread Cindy Swearingen

Hi Mark,

I would recheck with fmdump to see if you have any persistent errors
on the second disk.

The fmdump command will display faults and fmdump -eV will display 
errors (persistent faults that have turned into errors based on some

criteria).

If fmdump -eV doesn't show any activity for that second disk, then
review /var/adm/messages or iostat -En for driver-level resets and
so on.

Thanks,

Cindy

On 08/16/10 18:53, Mark Bennett wrote:

Nothing like a heart in mouth moment to shave tears from your life.

I rebooted a snv_132 box in perfect heath, and it came back up with two FAULTED 
disks in the same vdisk group.

Everything an hour on Google I found basically said your data is gone.

All 45Tb of it.

A postmortem of fmadm showed a single disk failed with smart predictive failure.
No indication why the second failed.

I don't give up easily, and it is now back up and scrubbing - no errors so far.

I checked both the drives were readable, so it didn't seem to be a hardware 
fault.
I moved one into a different server and ran a zpool import to see what it made 
of it.

The disk was ONLINE, and it's vdisk buddies were unavailable.
Ok, so I moved the disks into different bays and booted from the snv_134 cdrom.
Ran zpool import and the zpool came back with everything online.

That was encouraging, so I exported it and booted from the origional 132 boot 
drive.

Well, it came back, and at 1:00AM I was able to get back to the origional issue 
I was chasing.

So, don't give up hope when all hope appears to be lost.

Mark.

Still an Open_Solaris fan keen to help the community achieve a 2010 release on 
it's own.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss