Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

Ross Smith Thu, 31 Jul 2008 05:30:15 -0700

I'm not sure you're actually seeing the same problem there Richard.  It seems 
that for you I/O is stopping on removal of the device, whereas for me I/O 
continues for some considerable time.  You are also able to obtain a result 
from "zpool status" whereas that completely hangs for me.
 
To illustrate the difference, this is what I saw today in snv_94, with a pool 
created from a single external USB hard drive.
 
1. As before I started a copy of a directory using Solaris' file manager.  
About 1/3 of the way through I pulled the plug on the drive.
2. File manager continued to copy a further 30MB+ of files across.  Checking 
the properties of the copy shows it contains 71.1MB of data and 19,160 files, 
despite me pulling the drive at around 8,000 files.
 
3.  8:24am  I ran "zpool status":
# zpool status rc-usb  pool: rc-usb state: ONLINEstatus: One or more devices 
has experienced an error resulting in data corruption.  Applications may be 
affected.action: Restore the file in question if possible.  Otherwise restore 
the entire pool from backup.   see: http://www.sun.com/msg/ZFS-8000-8A scrub: 
none requested
 
That is as far as it gets.  It never gives me any further information.  I left 
it two hours, and it still had not displayed the status of the drive in the 
pool.  I also did a "zfs list", that also hangs now although I'm pretty sure 
that if you run "zfs list" before "zpool status" it works fine.
 
As you can see from /var/adm/messages, I am getting nothing at all from FMA:
Jul 31 08:16:46 unknown usba: [ID 912658 kern.info] USB 2.0 device 
(usbd49,7350) operating at hi speed (USB 2.x) on USB 2.0 root hub: [EMAIL 
PROTECTED], scsa2usb0 at bus address 2Jul 31 08:16:46 unknown usba: [ID 349649 
kern.info]  Maxtor   OneTouch         2HAP70DZ    Jul 31 08:16:46 unknown 
genunix: [ID 936769 kern.info] scsa2usb0 is /[EMAIL PROTECTED],0/pci15d9,[EMAIL 
PROTECTED],1/[EMAIL PROTECTED] 31 08:16:46 unknown genunix: [ID 408114 
kern.info] /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED] 
(scsa2usb0) onlineJul 31 08:16:46 unknown scsi: [ID 193665 kern.info] sd17 at 
scsa2usb0: target 0 lun 0Jul 31 08:16:46 unknown genunix: [ID 936769 kern.info] 
sd17 is /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0Jul 31 08:16:46 unknown genunix: [ID 340201 
kern.warning] WARNING: Page83 data not standards compliant Maxtor   OneTouch    
     0125Jul 31 08:16:46 unknown genunix: [ID 408114 kern.info] /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17) onlineJul 31 08:16:49 unknown pcplusmp: [ID 444295 kern.info] pcplusmp: 
ide (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 3Jul 31 
08:16:49 unknown scsi: [ID 193665 kern.info] sd14 at marvell88sx1: target 7 lun 
0Jul 31 08:16:49 unknown genunix: [ID 936769 kern.info] sd14 is /[EMAIL 
PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0Jul 31 08:16:49 unknown genunix: [ID 408114 kern.info] /[EMAIL 
PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0 (sd14) onlineJul 31 08:21:35 unknown usba: [ID 691482 
kern.warning] WARNING: /[EMAIL PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL 
PROTECTED] (scsa2usb0): Disconnected device was busy, please reconnect.Jul 31 
08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:21:38 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:21:38 unknown  Command failed to complete...Device is goneJul 
31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:24:26 unknown  Command failed to complete...Device is goneJul 
31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:24:26 unknown  Command failed to complete...Device is goneJul 
31 08:24:26 unknown scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 
(sd17):Jul 31 08:24:26 unknown  drive offlineJul 31 08:27:43 unknown smbd[603]: 
[ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 08:39:43 
unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small 
packetJul 31 08:44:50 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] 
accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 
31 08:44:58 unknown last message repeated 3 timesJul 31 08:45:06 unknown 
/sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on 
nge0 missing mandatory lease option, ignoredJul 31 08:45:06 unknown last 
message repeated 1 timeJul 31 08:51:44 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packetJul 31 09:03:44 unknown 
smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 
31 09:13:51 unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] 
accept_v4_acknak: ACK packet on nge0 missing mandatory lease option, ignoredJul 
31 09:14:09 unknown last message repeated 5 timesJul 31 09:15:44 unknown 
smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 
31 09:27:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: 
too small packetJul 31 09:27:55 unknown pcplusmp: [ID 444295 kern.info] 
pcplusmp: ide (ata) instance #1 vector 0xf ioapic 0x4 intin 0xf is bound to cpu 
3
 
cfgadm reports that the port is empty but still configured:
# cfgadmAp_Id                          Type         Receptacle   Occupant     
Conditionusb1/3                         unknown      empty        configured   
unusable
4. 9:32am I now tried writing more data to the pool, to see if I can trigger 
the I/O error you are seeing.  I tried making a second copy of the files on the 
USB drive in the Solaris File manager, but that attempt simply hung the copy 
dialog.  I'm still seeing nothing else that appears relevant in 
/var/adm/messages.
 
5. 10:08am While checking free space, I found that although df works, "df -kh" 
hangs, apparently when it tries to query any zfs pool:
# df/                  (/dev/dsk/c1t0d0s0 ): 2504586 blocks   656867 
files/devices           (/devices          ):       0 blocks        0 files/dev 
              (/dev              ):       0 blocks        0 
files/system/contract   (ctfs              ):       0 blocks 2147483609 
files/proc              (proc              ):       0 blocks    29902 
files/etc/mnttab        (mnttab            ):       0 blocks        0 
files/etc/svc/volatile  (swap              ): 9850928 blocks  1180374 
files/system/object     (objfs             ):       0 blocks 2147483409 
files/etc/dfs/sharetab  (sharefs           ):       0 blocks 2147483646 
files/lib/libc.so.1     (/usr/lib/libc/libc_hwcap2.so.1): 2504586 blocks   
656867 files/dev/fd            (fd                ):       0 blocks        0 
files/tmp               (swap              ): 9850928 blocks  1180374 
files/var/run           (swap              ): 9850928 blocks  1180374 
files/export/home       (/dev/dsk/c1t0d0s7 ):881398942 blocks 53621232 
files/rc-pool           (rc-pool           ):4344346098 blocks 4344346098 
files/rc-pool/admin     (rc-pool/admin     ):4344346098 blocks 4344346098 
files/rc-pool/ross-home (rc-pool/ross-home ):4344346098 blocks 4344346098 
files/rc-pool/vmware    (rc-pool/vmware    ):4344346098 blocks 4344346098 
files/rc-usb            (rc-usb            ):153725153 blocks 153725153 files# 
df -khFilesystem             size   used  avail capacity  Mounted 
on/dev/dsk/c1t0d0s0      7.2G   6.0G   1.1G    85%    //devices                 
0K     0K     0K     0%    /devices/dev                     0K     0K     0K    
 0%    /devctfs                     0K     0K     0K     0%    
/system/contractproc                     0K     0K     0K     0%    /procmnttab 
                  0K     0K     0K     0%    /etc/mnttabswap                   
4.7G   1.1M   4.7G     1%    /etc/svc/volatileobjfs                    0K     
0K     0K     0%    /system/objectsharefs                  0K     0K     0K     
0%    /etc/dfs/sharetab/usr/lib/libc/libc_hwcap2.so.1                       
7.2G   6.0G   1.1G    85%    /lib/libc.so.1fd                       0K     0K   
  0K     0%    /dev/fdswap                   4.7G    48K   4.7G     1%    
/tmpswap                   4.7G    76K   4.7G     1%    
/var/run/dev/dsk/c1t0d0s7      425G   4.8G   416G     2%    /export/home
 
6. 10:35am  It's now been two hours, neither "zpool status" nor "zfs list" have 
ever finished.  The file copy attempt has also been hung for over an hour 
(although that's not unexpected with 'wait' as the failmode).
 
Richard, you say ZFS is not silently failing, well for me it appears that it 
is.  I can't see any warnings from ZFS, I can't get any status information.  I 
see no way that I could find out what files are going to be lost on this server.
 
Yes, I'm now aware that the pool has hung since file operations are hanging, 
however had that been my first indication of a problem I believe I am now left 
in a position where I cannot find out either the cause, nor the files affected. 
 I don't believe I have any way to find out which operations had completed 
without error, but are not currently committed to disk.  I certainly don't get 
the status message you do saying permanent errors have been found in files.
 
I plugged the USB drive back in now, Solaris detected it ok, but ZFS is still 
hung.  The rest of /var/adm/messages is:
Jul 31 09:39:44 unknown smbd[603]: [ID 766186 daemon.error] 
NbtDatagramDecode[11]: too small packetJul 31 09:45:22 unknown 
/sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on 
nge0 missing mandatory lease option, ignoredJul 31 09:45:38 unknown last 
message repeated 5 timesJul 31 09:51:44 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:03:44 unknown 
last message repeated 2 timesJul 31 10:14:27 unknown /sbin/dhcpagent[95]: [ID 
732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory 
lease option, ignoredJul 31 10:14:45 unknown last message repeated 5 timesJul 
31 10:15:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: 
too small packetJul 31 10:27:45 unknown smbd[603]: [ID 766186 daemon.error] 
NbtDatagramDecode[11]: too small packet
Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED] (scsa2usb0): 
Reinserted device is accessible again.Jul 31 10:39:45 unknown smbd[603]: [ID 
766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:45:53 
unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK 
packet on nge0 missing mandatory lease option, ignoredJul 31 10:46:09 unknown 
last message repeated 5 timesJul 31 10:51:45 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packet
 
7. 10:55am  Gave up on ZFS ever recovering.  A shutdown attempt hung as 
expected.  I hard-reset the computer.
 
Ross
 
 
> Date: Wed, 30 Jul 2008 11:17:08 -0700> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> > I was able to reproduce this 
> in b93, but might have a different> interpretation of the conditions. More 
> below...> > Ross Smith wrote:> > A little more information today. I had a 
> feeling that ZFS would > > continue quite some time before giving an error, 
> and today I've shown > > that you can carry on working with the filesystem 
> for at least half an > > hour with the disk removed.> > > > I suspect on a 
> system with little load you could carry on working for > > several hours 
> without any indication that there is a problem. It > > looks to me like ZFS 
> is caching reads & writes, and that provided > > requests can be fulfilled 
> from the cache, it doesn't care whether the > > disk is present or not.> > In 
> my USB-flash-disk-sudden-removal-while-writing-big-file-test,> 1. I/O to the 
> missing device stopped (as I expected)> 2. FMA kicked in, as expected.> 3. 
> /var/adm/messages recorded "Command failed to complete... device gone."> 4. 
> After exactly 9 minutes, 17,951 e-reports had been processed and the> 
> diagnosis was complete. FMA logged the following to /var/adm/messages> > Jul 
> 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING: > /[EMAIL 
> PROTECTED],0/pci1458,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL 
> PROTECTED],0 (sd1):> Jul 30 10:33:44 grond Command failed to 
> complete...Device is gone> Jul 30 10:42:31 grond fmd: [ID 441519 
> daemon.error] SUNW-MSG-ID: > ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: 
> Major> Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008> Jul 30 
> 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond> Jul 30 10:42:31 grond 
> SOURCE: zfs-diagnosis, REV: 1.0> Jul 30 10:42:31 grond EVENT-ID: 
> d99769aa-28e8-cf16-d181-945592130525> Jul 30 10:42:31 grond DESC: The number 
> of I/O errors associated with a > ZFS device exceeded> Jul 30 10:42:31 grond 
> acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD for more 
> information.> Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been 
> offlined and > marked as faulted. An attempt> Jul 30 10:42:31 grond will be 
> made to activate a hot spare if > available.> Jul 30 10:42:31 grond IMPACT: 
> Fault tolerance of the pool may be > compromised.> Jul 30 10:42:31 grond 
> REC-ACTION: Run 'zpool status -x' and replace > the bad device.> > The above 
> URL shows what you expect, but more (and better) info> is available from 
> zpool status -xv> > pool: rmtestpool> state: UNAVAIL> status: One or more 
> devices are faultd in response to IO failures.> action: Make sure the 
> affected devices are connected, then run 'zpool > clear'.> see: 
> http://www.sun.com/msg/ZFS-8000-HC> scrub: none requested> config:> > NAME 
> STATE READ WRITE CKSUM> rmtestpool UNAVAIL 0 15.7K 0 insufficient replicas> 
> c2t0d0p0 FAULTED 0 15.7K 0 experienced I/O failures> > errors: Permanent 
> errors have been detected in the following files:> > /rmtestpool/random.data> 
> > > If you surf to http://www.sun.com/msg/ZFS-8000-HC you'll> see words to 
> the effect that,> The pool has experienced I/O failures. Since the ZFS pool 
> property> 'failmode' is set to 'wait', all I/Os (reads and writes) are> 
> blocked. See the zpool(1M) manpage for more information on the> 'failmode' 
> property. Manual intervention is required for I/Os to> be serviced.> > > > > 
> I would guess that ZFS is attempting to write to the disk in the > > 
> background, and that this is silently failing.> > It is clearly not silently 
> failing.> > However, the default failmode property is set to "wait" which 
> will patiently> wait forever. If you would rather have the I/O fail, then you 
> should change> the failmode to "continue" I would not normally recommend a 
> failmode of> "panic"> > Now to figure out how to recover gracefully... zpool 
> clear isn't happy...> > [sidebar]> while performing this experiment, I 
> noticed that fmd was checkpointing> the diagnosis engine to disk in the 
> /var/fm/fmd/ckpt/zfs-diagnosis > directory.> If this had been the boot disk, 
> with failmode=wait, I'm not convinced> that we'd get a complete diagnosis... 
> I'll explore that later.> [/sidebar]> > -- richard> 
_________________________________________________________________
The John Lewis Clearance - save up to 50% with FREE delivery
http://clk.atdmt.com/UKM/go/101719806/direct/01/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

Reply via email to