Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-11 Thread Ross
... which sounds very similar to issues I've raised many times.  ZFS should 
have the ability to double check what a drive is doing, and speculatively time 
out a device that appears to be failing in order to maintain pool performance.

If a single drive in a redundant pool can be seen to be responding 10-50x 
slower than others, or to have hundreds of oustanding IOs, ZFS should be able 
to flag it as 'possibly faulty' and return data from the rest of the pool 
without that one device blocking it.  It should not block an entire redundant 
pool when just one device is behaving badly.

And I don't care what the driver says.  If the performance figures indicate 
there's a problem, that's a driver bug, and it's possible for ZFS to spot that.

I've no problems with Sun's position that this should be done at the driver 
level, I agree that in theory that is where it should be dealt with, I just 
feel that in the real world bugs occur, and this extra sanity check could be 
useful in ensuring that ZFS still performs well despite problems in the device 
drivers.

There have been reports to this forum now of single disk timeout errors have 
caused whole pool problems for devices connected via iscsi, usb, sas and sata.  
I've had personal experience of it on a test whitebox server using an 
AOC-SAT2-MV8, and similar problems have been reported on a Sun x4540.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-11 Thread Sanjeev
Hi Chris,

On Sun, Aug 09, 2009 at 05:53:12PM -0700, Chris Baker wrote:
> OK - had a chance to do more testing over the weekend. Firstly some extra 
> data:
> 
> Moving the mirror to both drives on ICH10R ports and on sudden disk power-off 
> the mirror faulted cleanly to the remaining drive no problem.
> 
> Having a one drive pool on the ICH10R under heavy write traffic and then 
> powered off causes the zpool/zfs hangs described above.
> 
> ZPool being tested is called "Remove" and consists of:
> c7t2d0s0 - attached to the ICH10R
> c8t0d0s0 - second disk attached to the Si3132 card with the Si3124 driver
> 
> This leads me to the following suspicions:
> (1) We have an Si3124 issue in not detecting the drive removal always, or of 
> failing to pass that info back to ZFS, even though we know the kernel noticed
> (2) In the event that the only disk in a pool goes faulted, the zpool/zfs 
> subsystem will block indefinitely waiting to get rid of the pending writes.
> 
> I've just recabled back to one disk on ICH10R and one on Si3132 and tried the 
> sudden off with the Si drive:
> 
> *) First try - mirror faulted and IO continued - good news but confusing
> *) Second try - zfs/zpool hung, couldn't even get a zpool status, tried a 
> savecore but savecore hung moving the data to a seperate zpool
> *) Third try - zfs/zpool hung, ran savecore -L to a UFS filesystem I created 
> for the that purpose
> 
> After the first try, dmesg shows:
> Aug 10 00:34:41 TS1  SATA device detected at port 0
> Aug 10 00:34:41 TS1 sata: [ID 663010 kern.info] 
> /p...@0,0/pci8086,3...@1c,3/pci1095,7...@0 :
> Aug 10 00:34:41 TS1 sata: [ID 761595 kern.info] SATA disk device at 
> port 0
> Aug 10 00:34:41 TS1 sata: [ID 846691 kern.info] model WDC 
> WD5000AACS-00ZUB0
> Aug 10 00:34:41 TS1 sata: [ID 693010 kern.info] firmware 01.01B01
> Aug 10 00:34:41 TS1 sata: [ID 163988 kern.info] serial number  
> WD-xx
> Aug 10 00:34:41 TS1 sata: [ID 594940 kern.info] supported features:
> Aug 10 00:34:41 TS1 sata: [ID 981177 kern.info]  48-bit LBA, DMA, 
> Native Command Queueing, SMART, SMART self-test
> Aug 10 00:34:41 TS1 sata: [ID 643337 kern.info] SATA Gen2 signaling 
> speed (3.0Gbps)
> Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] Supported queue depth 
> 32, limited to 31
> Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] capacity = 976773168 
> sectors
> Aug 10 00:34:41 TS1 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, 
> TYPE: Fault, VER: 1, SEVERITY: Major
> Aug 10 00:34:41 TS1 EVENT-TIME: Mon Aug 10 00:34:41 BST 2009
> Aug 10 00:34:41 TS1 PLATFORM:  , CSN: 
>  , HOSTNAME: TS1
> Aug 10 00:34:41 TS1 SOURCE: zfs-diagnosis, REV: 1.0
> Aug 10 00:34:41 TS1 EVENT-ID: ab7df266-3380-4a35-e0bc-9056878fd182
> Aug 10 00:34:41 TS1 DESC: The number of I/O errors associated with a ZFS 
> device exceeded
> Aug 10 00:34:41 TS1  acceptable levels.  Refer to 
> http://sun.com/msg/ZFS-8000-FD for more information.
> Aug 10 00:34:41 TS1 AUTO-RESPONSE: The device has been offlined and marked as 
> faulted.  An attempt
> Aug 10 00:34:41 TS1  will be made to activate a hot spare if 
> available.
> Aug 10 00:34:41 TS1 IMPACT: Fault tolerance of the pool may be compromised.
> Aug 10 00:34:41 TS1 REC-ACTION: Run 'zpool status -x' and replace the bad 
> device.
> 
> and after the second and third test, just:
> SATA device detached at port 0
> 
> Core files were tar-ed together and bzip2-ed and can be found at:
> 
> http://dl.getdropbox.com/u/1709454/dump.bakerci.200908100106.tar.bz2
> 
> Please let me know if you need any further core/debug. Apologies to readers 
> having all this inflicted by email digest.

Spent some time analysing the dump and I find that ZFS does not know that the
disk is dead. There are about 1900 WRITE requests pending on that disk
(c8t0d0s0).

Attached are the details. Let me know what you find from fmdump.

I suspect that this has got to do with support for the card.

Hope that helps.

Regards,
Sanjeev
-- 

Sanjeev Bagewadi
Solaris RPE 
Bangalore, India
The pool in question "remove"

-- snip --
ZFS spa @ 0xff01c77b9800
Pool name: remove
State: ACTIVE
   VDEV Address  StateAux   Description
0xff01c9faec80  HEALTHY-   root

VDEV Address  StateAux   Description
 0xff01c9faf2c0  HEALTHY-  mirror

 VDEV Address  StateAux Description
  0xff01d4099940  HEALTHY-/dev/dsk/c7t2d0s0

 VDEV Address  StateAux Description
  0xff01d4099300  HEALTHY-/dev/dsk/c8t0d0s0
-- snip --

Obviously, the status for c8t0d0s0 is wrong because, it should have marked
it dead.
 
Looking at the threads we have spa_sync() waiting for an IO to complete :
-- snip --
> ff0008678c60::findstack -v
stac

Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-09 Thread Sanjeev
Chris,

Thanks for providing the details and the dump.
I shall look into this and update with my findings.

Thanks and regards,
Sanjeev

On Sun, Aug 09, 2009 at 05:53:12PM -0700, Chris Baker wrote:
> Hi Sanjeev
> 
> OK - had a chance to do more testing over the weekend. Firstly some extra 
> data:
> 
> Moving the mirror to both drives on ICH10R ports and on sudden disk power-off 
> the mirror faulted cleanly to the remaining drive no problem.
> 
> Having a one drive pool on the ICH10R under heavy write traffic and then 
> powered off causes the zpool/zfs hangs described above.
> 
> ZPool being tested is called "Remove" and consists of:
> c7t2d0s0 - attached to the ICH10R
> c8t0d0s0 - second disk attached to the Si3132 card with the Si3124 driver
> 
> This leads me to the following suspicions:
> (1) We have an Si3124 issue in not detecting the drive removal always, or of 
> failing to pass that info back to ZFS, even though we know the kernel noticed
> (2) In the event that the only disk in a pool goes faulted, the zpool/zfs 
> subsystem will block indefinitely waiting to get rid of the pending writes.
> 
> I've just recabled back to one disk on ICH10R and one on Si3132 and tried the 
> sudden off with the Si drive:
> 
> *) First try - mirror faulted and IO continued - good news but confusing
> *) Second try - zfs/zpool hung, couldn't even get a zpool status, tried a 
> savecore but savecore hung moving the data to a seperate zpool
> *) Third try - zfs/zpool hung, ran savecore -L to a UFS filesystem I created 
> for the that purpose
> 
> After the first try, dmesg shows:
> Aug 10 00:34:41 TS1  SATA device detected at port 0
> Aug 10 00:34:41 TS1 sata: [ID 663010 kern.info] 
> /p...@0,0/pci8086,3...@1c,3/pci1095,7...@0 :
> Aug 10 00:34:41 TS1 sata: [ID 761595 kern.info] SATA disk device at 
> port 0
> Aug 10 00:34:41 TS1 sata: [ID 846691 kern.info] model WDC 
> WD5000AACS-00ZUB0
> Aug 10 00:34:41 TS1 sata: [ID 693010 kern.info] firmware 01.01B01
> Aug 10 00:34:41 TS1 sata: [ID 163988 kern.info] serial number  
> WD-xx
> Aug 10 00:34:41 TS1 sata: [ID 594940 kern.info] supported features:
> Aug 10 00:34:41 TS1 sata: [ID 981177 kern.info]  48-bit LBA, DMA, 
> Native Command Queueing, SMART, SMART self-test
> Aug 10 00:34:41 TS1 sata: [ID 643337 kern.info] SATA Gen2 signaling 
> speed (3.0Gbps)
> Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] Supported queue depth 
> 32, limited to 31
> Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] capacity = 976773168 
> sectors
> Aug 10 00:34:41 TS1 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, 
> TYPE: Fault, VER: 1, SEVERITY: Major
> Aug 10 00:34:41 TS1 EVENT-TIME: Mon Aug 10 00:34:41 BST 2009
> Aug 10 00:34:41 TS1 PLATFORM:  , CSN: 
>  , HOSTNAME: TS1
> Aug 10 00:34:41 TS1 SOURCE: zfs-diagnosis, REV: 1.0
> Aug 10 00:34:41 TS1 EVENT-ID: ab7df266-3380-4a35-e0bc-9056878fd182
> Aug 10 00:34:41 TS1 DESC: The number of I/O errors associated with a ZFS 
> device exceeded
> Aug 10 00:34:41 TS1  acceptable levels.  Refer to 
> http://sun.com/msg/ZFS-8000-FD for more information.
> Aug 10 00:34:41 TS1 AUTO-RESPONSE: The device has been offlined and marked as 
> faulted.  An attempt
> Aug 10 00:34:41 TS1  will be made to activate a hot spare if 
> available.
> Aug 10 00:34:41 TS1 IMPACT: Fault tolerance of the pool may be compromised.
> Aug 10 00:34:41 TS1 REC-ACTION: Run 'zpool status -x' and replace the bad 
> device.
> 
> and after the second and third test, just:
> SATA device detached at port 0
> 
> Core files were tar-ed together and bzip2-ed and can be found at:
> 
> http://dl.getdropbox.com/u/1709454/dump.bakerci.200908100106.tar.bz2
> 
> Please let me know if you need any further core/debug. Apologies to readers 
> having all this inflicted by email digest.
> 
> Many thanks
> 
> Chris
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

Sanjeev Bagewadi
Solaris RPE 
Bangalore, India
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-09 Thread Chris Baker
Hi Sanjeev

OK - had a chance to do more testing over the weekend. Firstly some extra data:

Moving the mirror to both drives on ICH10R ports and on sudden disk power-off 
the mirror faulted cleanly to the remaining drive no problem.

Having a one drive pool on the ICH10R under heavy write traffic and then 
powered off causes the zpool/zfs hangs described above.

ZPool being tested is called "Remove" and consists of:
c7t2d0s0 - attached to the ICH10R
c8t0d0s0 - second disk attached to the Si3132 card with the Si3124 driver

This leads me to the following suspicions:
(1) We have an Si3124 issue in not detecting the drive removal always, or of 
failing to pass that info back to ZFS, even though we know the kernel noticed
(2) In the event that the only disk in a pool goes faulted, the zpool/zfs 
subsystem will block indefinitely waiting to get rid of the pending writes.

I've just recabled back to one disk on ICH10R and one on Si3132 and tried the 
sudden off with the Si drive:

*) First try - mirror faulted and IO continued - good news but confusing
*) Second try - zfs/zpool hung, couldn't even get a zpool status, tried a 
savecore but savecore hung moving the data to a seperate zpool
*) Third try - zfs/zpool hung, ran savecore -L to a UFS filesystem I created 
for the that purpose

After the first try, dmesg shows:
Aug 10 00:34:41 TS1  SATA device detected at port 0
Aug 10 00:34:41 TS1 sata: [ID 663010 kern.info] 
/p...@0,0/pci8086,3...@1c,3/pci1095,7...@0 :
Aug 10 00:34:41 TS1 sata: [ID 761595 kern.info] SATA disk device at 
port 0
Aug 10 00:34:41 TS1 sata: [ID 846691 kern.info] model WDC 
WD5000AACS-00ZUB0
Aug 10 00:34:41 TS1 sata: [ID 693010 kern.info] firmware 01.01B01
Aug 10 00:34:41 TS1 sata: [ID 163988 kern.info] serial number  
WD-xx
Aug 10 00:34:41 TS1 sata: [ID 594940 kern.info] supported features:
Aug 10 00:34:41 TS1 sata: [ID 981177 kern.info]  48-bit LBA, DMA, 
Native Command Queueing, SMART, SMART self-test
Aug 10 00:34:41 TS1 sata: [ID 643337 kern.info] SATA Gen2 signaling 
speed (3.0Gbps)
Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] Supported queue depth 
32, limited to 31
Aug 10 00:34:41 TS1 sata: [ID 349649 kern.info] capacity = 976773168 
sectors
Aug 10 00:34:41 TS1 fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, 
TYPE: Fault, VER: 1, SEVERITY: Major
Aug 10 00:34:41 TS1 EVENT-TIME: Mon Aug 10 00:34:41 BST 2009
Aug 10 00:34:41 TS1 PLATFORM:  , CSN:   
   , HOSTNAME: TS1
Aug 10 00:34:41 TS1 SOURCE: zfs-diagnosis, REV: 1.0
Aug 10 00:34:41 TS1 EVENT-ID: ab7df266-3380-4a35-e0bc-9056878fd182
Aug 10 00:34:41 TS1 DESC: The number of I/O errors associated with a ZFS device 
exceeded
Aug 10 00:34:41 TS1  acceptable levels.  Refer to 
http://sun.com/msg/ZFS-8000-FD for more information.
Aug 10 00:34:41 TS1 AUTO-RESPONSE: The device has been offlined and marked as 
faulted.  An attempt
Aug 10 00:34:41 TS1  will be made to activate a hot spare if available.
Aug 10 00:34:41 TS1 IMPACT: Fault tolerance of the pool may be compromised.
Aug 10 00:34:41 TS1 REC-ACTION: Run 'zpool status -x' and replace the bad 
device.

and after the second and third test, just:
SATA device detached at port 0

Core files were tar-ed together and bzip2-ed and can be found at:

http://dl.getdropbox.com/u/1709454/dump.bakerci.200908100106.tar.bz2

Please let me know if you need any further core/debug. Apologies to readers 
having all this inflicted by email digest.

Many thanks

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-05 Thread Sanjeev
Chris,

On Wed, Aug 05, 2009 at 05:33:24AM -0700, Chris Baker wrote:
> Sanjeev
> 
> Thanks for taking an interest. Unfortunately I did have failmode=continue, 
> but I have just destroyed/recreated and double confirmed and got exactly the 
> same results.
> 
> zpool status shows both drives mirror, ONLINE, no errors
> 
> dmesg shows:
> 
> SATA device detached at port 0
> 
> cfgadm shows:
> 
> sata-portemptyunconfigured
> 
> The IO process has just hung. 
> 
> It seems to me that zfs thinks it has a drive with a really long response 
> time rather than a dead drive so no failmode processing, no mirror resilience 
> etc. Clearly something has been reported back to the kernel re the port going 
> dead but whether that came from the driver or not I wouldn't know.

Would it be possible for you to take a crashdump of the machine and point me to
it. We could try looking at where things are stuck.

Thanks and regards,
Sanjeev

-- 

Sanjeev Bagewadi
Solaris RPE 
Bangalore, India
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-05 Thread roland
doesn´t solaris have the great builtin dtrace for issues like these ?

if we knew in which syscall or kernel-thread the system is stuck, we may get a 
clue...

unfortunately, i don´t have any real knowledge of solaris kernel internals or 
dtrace...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-05 Thread Ross
Yeah, sounds just like the issues I've seen before.  I don't think you're 
likely to see a fix anytime soon, but the good news is that so far I've not 
seen anybody reporting problems with LSI 1068 based cards (and I've been 
watching for a while).

With the 1068 being used in the x4540 Thumper 2, I'd expect it to have pretty 
solid drivers :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-05 Thread Chris Baker
I've left it hanging about 2 hours. I've also just learned that whatever the 
issue is it is also blocking an "init 5" shutdown. I was thinking about setting 
a watchdog with a forced reboot but that will get me nowhere if I need I reset 
button restart.

Thanks for the advice re the LSI 1068, not exactly what I was hoping to hear 
but very good info all the same.

KInd regards

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-05 Thread Ross
Just a thought, but how long have you left it?  I had problems with a failing 
drive a while back which did eventually get taken offline, but took about 20 
minutes to do so.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-05 Thread Chris Baker
Sanjeev

Thanks for taking an interest. Unfortunately I did have failmode=continue, but 
I have just destroyed/recreated and double confirmed and got exactly the same 
results.

zpool status shows both drives mirror, ONLINE, no errors

dmesg shows:

SATA device detached at port 0

cfgadm shows:

sata-portemptyunconfigured

The IO process has just hung. 

It seems to me that zfs thinks it has a drive with a really long response time 
rather than a dead drive so no failmode processing, no mirror resilience etc. 
Clearly something has been reported back to the kernel re the port going dead 
but whether that came from the driver or not I wouldn't know.

KInd regards

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Sanjeev
Chris,

Can you please check the failmode property of the pool ?

-- zpool get failmode 

If it is set to "wait", you could try setting it to "continue".

Regards,
Sanjeev
On Tue, Aug 04, 2009 at 08:56:03PM -0700, Chris Baker wrote:
> Ok - in an attempt to weasel my way past the issue I mirrored my problematic 
> si3124 drive to a second drive on the ICH10R, started writing to the file 
> system and then killed the power to the si3124 removable drive.
> 
> To my (unfortunate) surprise, the IO stream that was writing to the mirrored 
> filesystem just hung. I can still zpool status, zfs list, but the process 
> that was writing has hung and the zpool iostat that was running in another 
> window has also hung.
> 
> dmesg shows the kernel noticed the sata disconnect ok and cfgadm shows the 
> sata port as empty. zpool status shows both drives online and no errors.
> 
> Now I'm worried my mirror protection isn't quite as solid as I thought too.
> 
> Anyone any ideas?
> 
> Cheers
> 
> Chris
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

Sanjeev Bagewadi
Solaris RPE 
Bangalore, India
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Ross
Whether ZFS properly detects device removal depends to a large extent on the 
device drivers for the controller.  I personally have stuck to using 
controllers with chipsets I know Sun use on their own servers, but even then 
I've been bitten by similar problems to yours on the AOC-SAT2-MV8 cards.

The LSI 1068 based cards seem to be the most stable, but I haven't been 
fortunate enough to test them myself yet.

I've been saying for ages that ZFS needs its own timeouts to detect when a 
drive has gone in a redundant pool, but Sun don't seem to agree that it's 
needed.  They seem happy to have ZFS working on their own kit, and hanging for 
others.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Chris Baker
Ok - in an attempt to weasel my way past the issue I mirrored my problematic 
si3124 drive to a second drive on the ICH10R, started writing to the file 
system and then killed the power to the si3124 removable drive.

To my (unfortunate) surprise, the IO stream that was writing to the mirrored 
filesystem just hung. I can still zpool status, zfs list, but the process that 
was writing has hung and the zpool iostat that was running in another window 
has also hung.

dmesg shows the kernel noticed the sata disconnect ok and cfgadm shows the sata 
port as empty. zpool status shows both drives online and no errors.

Now I'm worried my mirror protection isn't quite as solid as I thought too.

Anyone any ideas?

Cheers

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Chris Baker
It's a generic Sil3132 based PCIe x1 card using the si3124 driver.

Prior to this I had been using Intel ICH10R with AHCI but I have found the 
Sil3132 actually hot plugs a little smoother than the Intel chipset. I have not 
gone back to recheck this specific problem on the ICH10R (though I can), I had 
been quite happy with the Sil up to this point.

Kind regards

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread roland
what exact type of sata controller do you use?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Chris Baker
Apologies - I'm daft for not saying originally: OpenSolaris 2009.06 on x86

Cheers

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Ross
What version of Solaris / OpenSolaris are you running there?  I remember zfs 
commands locking up being a big problem a while ago, but I thought they'd 
managed to solve the issues like this.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Recovering from ZFS command lock up after yanking a non-redundant drive?

2009-08-04 Thread Chris Baker
Hi

I'm running an application which is using hot plug sata drives as giant 
removable usb keys but bigger and with SATA performance.

I'm using “cfgadm connect” then “configure” then “zpool import” to bring a 
drive on-line and export / unconfigure / disconnect before unplugging. All 
works well.

What I can't guarantee is that one of my users won't one day just yank the 
drive without running the offline sequence. 

In testing for that case I am finding that the system runs fine until a command 
or subsystem tries to write to the drive and then that command and that 
subsystem locks up hard.

The big problem then becomes if I try a zfs or zpool command to attempt 
recovery I then lose zfs/zpool access to all pools in the system and not just 
the damaged one. Specifically - in testing:

Just one single drive with s0 mounted and then yanked:

- zpool status – I have seen either the results show the pool online and no 
errors or a lock up of zpool.
- I can cd into and ls the missing directory but if I try and write anything my 
shell locks up hard
- I try a zfs unmount -f and that locks hard plus I can now no longer run zfs 
anything
- I try a zpool export -f and that locks plus I can now no longer run zpool 
anything
- Even a simple zfs list can lock up zfs commands

Rest of the system continues ticking over but I have now lost access to basic 
admin commands and I can't find a recovery plan short of a reboot.

I've tried "zpool set failmode=continue" with no luck. I tried adding a ZIL, no 
luck.

I can't kill the locked processes.

I'm guessing zfs is waiting for the drive to come back online to safely store 
the write-in-flight - reconnecting the drive makes some of the locked processes 
killable, not all, and running zpool/zfs anything locks up again.

To be clear - the rest of the system working with different data pools keeps 
running fine.

I don't mind data loss on the yanked disk - that would be the user's own stupid 
fault, but I can't accept the risk of locking up zpool/zfs control of the rest 
of the system.

Trying the same tests with a UFS removable disk and the processes are 
interruptible so I could live with zfs internal/ufs removable but it seems to 
be significantly slower plus I was hoping for the integrity benefits of zfs.

Any thoughts on how to stabilise the OS without a reboot?

Thanks

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss