I'm not sure you're actually seeing the same problem there Richard. It seems
that for you I/O is stopping on removal of the device, whereas for me I/O
continues for some considerable time. You are also able to obtain a result
from "zpool status" whereas that completely hangs for me.
To illustrate the difference, this is what I saw today in snv_94, with a pool
created from a single external USB hard drive.
1. As before I started a copy of a directory using Solaris' file manager.
About 1/3 of the way through I pulled the plug on the drive.
2. File manager continued to copy a further 30MB+ of files across. Checking
the properties of the copy shows it contains 71.1MB of data and 19,160 files,
despite me pulling the drive at around 8,000 files.
3. 8:24am I ran "zpool status":
# zpool status rc-usb pool: rc-usb state: ONLINEstatus: One or more devices
has experienced an error resulting in data corruption. Applications may be
affected.action: Restore the file in question if possible. Otherwise restore
the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub:
none requested
That is as far as it gets. It never gives me any further information. I left
it two hours, and it still had not displayed the status of the drive in the
pool. I also did a "zfs list", that also hangs now although I'm pretty sure
that if you run "zfs list" before "zpool status" it works fine.
As you can see from /var/adm/messages, I am getting nothing at all from FMA:
Jul 31 08:16:46 unknown usba: [ID 912658 kern.info] USB 2.0 device
(usbd49,7350) operating at hi speed (USB 2.x) on USB 2.0 root hub: [EMAIL
PROTECTED], scsa2usb0 at bus address 2Jul 31 08:16:46 unknown usba: [ID 349649
kern.info] Maxtor OneTouch 2HAP70DZ Jul 31 08:16:46 unknown
genunix: [ID 936769 kern.info] scsa2usb0 is /[EMAIL PROTECTED],0/pci15d9,[EMAIL
PROTECTED],1/[EMAIL PROTECTED] 31 08:16:46 unknown genunix: [ID 408114
kern.info] /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]
(scsa2usb0) onlineJul 31 08:16:46 unknown scsi: [ID 193665 kern.info] sd17 at
scsa2usb0: target 0 lun 0Jul 31 08:16:46 unknown genunix: [ID 936769 kern.info]
sd17 is /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL
PROTECTED]/[EMAIL PROTECTED],0Jul 31 08:16:46 unknown genunix: [ID 340201
kern.warning] WARNING: Page83 data not standards compliant Maxtor OneTouch
0125Jul 31 08:16:46 unknown genunix: [ID 408114 kern.info] /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17) onlineJul 31 08:16:49 unknown pcplusmp: [ID 444295 kern.info] pcplusmp:
ide (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3Jul 31
08:16:49 unknown scsi: [ID 193665 kern.info] sd14 at marvell88sx1: target 7 lun
0Jul 31 08:16:49 unknown genunix: [ID 936769 kern.info] sd14 is /[EMAIL
PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL
PROTECTED],0Jul 31 08:16:49 unknown genunix: [ID 408114 kern.info] /[EMAIL
PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL
PROTECTED],0 (sd14) onlineJul 31 08:21:35 unknown usba: [ID 691482
kern.warning] WARNING: /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL
PROTECTED] (scsa2usb0): Disconnected device was busy, please reconnect.Jul 31
08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:21:38 unknown Command failed to complete...Device is goneJul
31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:24:26 unknown Command failed to complete...Device is goneJul
31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:24:26 unknown Command failed to complete...Device is goneJul
31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0
(sd17):Jul 31 08:24:26 unknown drive offlineJul 31 08:27:43 unknown smbd[603]:
[ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 08:39:43
unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small
packetJul 31 08:44:50 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning]
accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul
31 08:44:58 unknown last message repeated 3 timesJul 31 08:45:06 unknown
/sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on
nge0 missing mandatory lease option, ignoredJul 31 08:45:06 unknown last
message repeated 1 timeJul 31 08:51:44 unknown smbd[603]: [ID 766186
daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:03:44 unknown
smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul
31 09:13:51 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning]
accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul
31 09:14:09 unknown last message repeated 5 timesJul 31 09:15:44 unknown
smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul
31 09:27:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]:
too small packetJul 31 09:27:55 unknown pcplusmp: [ID 444295 kern.info]
pcplusmp: ide (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu
3
cfgadm reports that the port is empty but still configured:
# cfgadmAp_Id Type Receptacle Occupant
Conditionusb1/3 unknown empty configured
unusable
4. 9:32am I now tried writing more data to the pool, to see if I can trigger
the I/O error you are seeing. I tried making a second copy of the files on the
USB drive in the Solaris File manager, but that attempt simply hung the copy
dialog. I'm still seeing nothing else that appears relevant in
/var/adm/messages.
5. 10:08am While checking free space, I found that although df works, "df -kh"
hangs, apparently when it tries to query any zfs pool:
# df/ (/dev/dsk/c1t0d0s0 ): 2504586 blocks 656867
files/devices (/devices ): 0 blocks 0 files/dev
(/dev ): 0 blocks 0
files/system/contract (ctfs ): 0 blocks 2147483609
files/proc (proc ): 0 blocks 29902
files/etc/mnttab (mnttab ): 0 blocks 0
files/etc/svc/volatile (swap ): 9850928 blocks 1180374
files/system/object (objfs ): 0 blocks 2147483409
files/etc/dfs/sharetab (sharefs ): 0 blocks 2147483646
files/lib/libc.so.1 (/usr/lib/libc/libc_hwcap2.so.1): 2504586 blocks
656867 files/dev/fd (fd ): 0 blocks 0
files/tmp (swap ): 9850928 blocks 1180374
files/var/run (swap ): 9850928 blocks 1180374
files/export/home (/dev/dsk/c1t0d0s7 ):881398942 blocks 53621232
files/rc-pool (rc-pool ):4344346098 blocks 4344346098
files/rc-pool/admin (rc-pool/admin ):4344346098 blocks 4344346098
files/rc-pool/ross-home (rc-pool/ross-home ):4344346098 blocks 4344346098
files/rc-pool/vmware (rc-pool/vmware ):4344346098 blocks 4344346098
files/rc-usb (rc-usb ):153725153 blocks 153725153 files#
df -khFilesystem size used avail capacity Mounted
on/dev/dsk/c1t0d0s0 7.2G 6.0G 1.1G 85% //devices
0K 0K 0K 0% /devices/dev 0K 0K 0K
0% /devctfs 0K 0K 0K 0%
/system/contractproc 0K 0K 0K 0% /procmnttab
0K 0K 0K 0% /etc/mnttabswap
4.7G 1.1M 4.7G 1% /etc/svc/volatileobjfs 0K
0K 0K 0% /system/objectsharefs 0K 0K 0K
0% /etc/dfs/sharetab/usr/lib/libc/libc_hwcap2.so.1
7.2G 6.0G 1.1G 85% /lib/libc.so.1fd 0K 0K
0K 0% /dev/fdswap 4.7G 48K 4.7G 1%
/tmpswap 4.7G 76K 4.7G 1%
/var/run/dev/dsk/c1t0d0s7 425G 4.8G 416G 2% /export/home
6. 10:35am It's now been two hours, neither "zpool status" nor "zfs list" have
ever finished. The file copy attempt has also been hung for over an hour
(although that's not unexpected with 'wait' as the failmode).
Richard, you say ZFS is not silently failing, well for me it appears that it
is. I can't see any warnings from ZFS, I can't get any status information. I
see no way that I could find out what files are going to be lost on this server.
Yes, I'm now aware that the pool has hung since file operations are hanging,
however had that been my first indication of a problem I believe I am now left
in a position where I cannot find out either the cause, nor the files affected.
I don't believe I have any way to find out which operations had completed
without error, but are not currently committed to disk. I certainly don't get
the status message you do saying permanent errors have been found in files.
I plugged the USB drive back in now, Solaris detected it ok, but ZFS is still
hung. The rest of /var/adm/messages is:
Jul 31 09:39:44 unknown smbd[603]: [ID 766186 daemon.error]
NbtDatagramDecode[11]: too small packetJul 31 09:45:22 unknown
/sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on
nge0 missing mandatory lease option, ignoredJul 31 09:45:38 unknown last
message repeated 5 timesJul 31 09:51:44 unknown smbd[603]: [ID 766186
daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:03:44 unknown
last message repeated 2 timesJul 31 10:14:27 unknown /sbin/dhcpagent[95]: [ID
732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory
lease option, ignoredJul 31 10:14:45 unknown last message repeated 5 timesJul
31 10:15:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]:
too small packetJul 31 10:27:45 unknown smbd[603]: [ID 766186 daemon.error]
NbtDatagramDecode[11]: too small packet
Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning] WARNING: /[EMAIL
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED] (scsa2usb0):
Reinserted device is accessible again.Jul 31 10:39:45 unknown smbd[603]: [ID
766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:45:53
unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK
packet on nge0 missing mandatory lease option, ignoredJul 31 10:46:09 unknown
last message repeated 5 timesJul 31 10:51:45 unknown smbd[603]: [ID 766186
daemon.error] NbtDatagramDecode[11]: too small packet
7. 10:55am Gave up on ZFS ever recovering. A shutdown attempt hung as
expected. I hard-reset the computer.
Ross
> Date: Wed, 30 Jul 2008 11:17:08 -0700> From: [EMAIL PROTECTED]> Subject: Re:
> [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: [EMAIL
> PROTECTED]> CC: zfs-discuss@opensolaris.org> > I was able to reproduce this
> in b93, but might have a different> interpretation of the conditions. More
> below...> > Ross Smith wrote:> > A little more information today. I had a
> feeling that ZFS would > > continue quite some time before giving an error,
> and today I've shown > > that you can carry on working with the filesystem
> for at least half an > > hour with the disk removed.> > > > I suspect on a
> system with little load you could carry on working for > > several hours
> without any indication that there is a problem. It > > looks to me like ZFS
> is caching reads & writes, and that provided > > requests can be fulfilled
> from the cache, it doesn't care whether the > > disk is present or not.> > In
> my USB-flash-disk-sudden-removal-while-writing-big-file-test,> 1. I/O to the
> missing device stopped (as I expected)> 2. FMA kicked in, as expected.> 3.
> /var/adm/messages recorded "Command failed to complete... device gone."> 4.
> After exactly 9 minutes, 17,951 e-reports had been processed and the>
> diagnosis was complete. FMA logged the following to /var/adm/messages> > Jul
> 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING: > /[EMAIL
> PROTECTED],0/pci1458,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL
> PROTECTED],0 (sd1):> Jul 30 10:33:44 grond Command failed to
> complete...Device is gone> Jul 30 10:42:31 grond fmd: [ID 441519
> daemon.error] SUNW-MSG-ID: > ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY:
> Major> Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008> Jul 30
> 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond> Jul 30 10:42:31 grond
> SOURCE: zfs-diagnosis, REV: 1.0> Jul 30 10:42:31 grond EVENT-ID:
> d99769aa-28e8-cf16-d181-945592130525> Jul 30 10:42:31 grond DESC: The number
> of I/O errors associated with a > ZFS device exceeded> Jul 30 10:42:31 grond
> acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD for more
> information.> Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been
> offlined and > marked as faulted. An attempt> Jul 30 10:42:31 grond will be
> made to activate a hot spare if > available.> Jul 30 10:42:31 grond IMPACT:
> Fault tolerance of the pool may be > compromised.> Jul 30 10:42:31 grond
> REC-ACTION: Run 'zpool status -x' and replace > the bad device.> > The above
> URL shows what you expect, but more (and better) info> is available from
> zpool status -xv> > pool: rmtestpool> state: UNAVAIL> status: One or more
> devices are faultd in response to IO failures.> action: Make sure the
> affected devices are connected, then run 'zpool > clear'.> see:
> http://www.sun.com/msg/ZFS-8000-HC> scrub: none requested> config:> > NAME
> STATE READ WRITE CKSUM> rmtestpool UNAVAIL 0 15.7K 0 insufficient replicas>
> c2t0d0p0 FAULTED 0 15.7K 0 experienced I/O failures> > errors: Permanent
> errors have been detected in the following files:> > /rmtestpool/random.data>
> > > If you surf to http://www.sun.com/msg/ZFS-8000-HC you'll> see words to
> the effect that,> The pool has experienced I/O failures. Since the ZFS pool
> property> 'failmode' is set to 'wait', all I/Os (reads and writes) are>
> blocked. See the zpool(1M) manpage for more information on the> 'failmode'
> property. Manual intervention is required for I/Os to> be serviced.> > > > >
> I would guess that ZFS is attempting to write to the disk in the > >
> background, and that this is silently failing.> > It is clearly not silently
> failing.> > However, the default failmode property is set to "wait" which
> will patiently> wait forever. If you would rather have the I/O fail, then you
> should change> the failmode to "continue" I would not normally recommend a
> failmode of> "panic"> > Now to figure out how to recover gracefully... zpool
> clear isn't happy...> > [sidebar]> while performing this experiment, I
> noticed that fmd was checkpointing> the diagnosis engine to disk in the
> /var/fm/fmd/ckpt/zfs-diagnosis > directory.> If this had been the boot disk,
> with failmode=wait, I'm not convinced> that we'd get a complete diagnosis...
> I'll explore that later.> [/sidebar]> > -- richard>
_________________________________________________________________
The John Lewis Clearance - save up to 50% with FREE delivery
http://clk.atdmt.com/UKM/go/101719806/direct/01/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss