Dear list,

Solaris 10 U3 on SPARC.

I had a 197GB raidz storage pool. Within that pool, I had allocated a 191GB zvol (filesystem A), and a 6.75GB zvol (filesystem B). These used all but a couple hundred K of the zpool. Both zvols contained UFS filesystems with logging enabled. The (A) filesystem was about 79% full. (B) was also nearly full, but unmounted and not being used.

This configuration worked happily for a bit over two months. Then the other day, a user decided to copy (cp) about 11GB worth of video files within (A). This caused UFS to choke as such:

Mar 9 17:34:43 maxwell ufs: [ID 702911 kern.warning] WARNING: Error writing master during ufs log roll Mar 9 17:34:43 maxwell ufs: [ID 127457 kern.warning] WARNING: ufs log for /export/home/engr changed state to Error Mar 9 17:34:43 maxwell ufs: [ID 616219 kern.warning] WARNING: Please umount(1M) /export/home/engr and run fsck(1M)

I do as the message says: unmount and attempt to fsck. I am then bombarded with thousands of errors, BUT fsck can not fix them due to 'no space left on device'. That's right, the filesystem with about 30GB free didn't have enough free space to fsck. Strange.

After messing with the machine all weekend, rebooting, calling coworkers (other sys admins), calling sun, scratching my head, etc.. The solution ended up being to _delete the (B) zvol_ (which contained only junk data). Once that was done, fsck ran all the way through without problems (besides wiping all my ACLs) and things were happy again.

So I surmised that ZFS ran out of space to do it's thing, and for whatever reason, that 'out of space' got pushed down into the zvol as well, causing fsck to choke. I _have_ been able to reproduce the situation on a test machine, but not reliably. It basically comprises of setting up two zvols that take up almost all of the pool space, newfsing them, filling one up to about 90% full, then looping though copys of 1/2 of the remaining space until it dies.

(So for a 36GB pool, create a 34GB zvol and a 2.xxGB zvol. newfs them. Mount the larger one. Create a 30GB junk file. Create a directory of say 5 files worth about 2GB total. Then do 'while true; do copy -r dira dirb;done' until it fails. Sometimes it does, sometimes not.)

Why does this happen? Is it a bug? I know there is a recommendation of 20% free space for good performance, but that thought never occurred to me when this machine was set up (zvols only, no zfs proper).

I think it is a bug simply because it _allowed_ me to create a configuration that didn't leave enough room for overhead. There isn't a whole lot of info surrounding zvol. Does the 80% free rule still apply to the underlining zfs if only zvols are used? That would be really unfortunate. I think most people wanting to use a zvol would want to use 100% of a pool toward the zvol.

-Brian

--
---------------------------------------------------
Brian H. Nelson         Youngstown State University
System Administrator   Media and Academic Computing
             bnelson[at]cis.ysu.edu
---------------------------------------------------

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to