First, a disclaimer: I do not know how the zfs dataset destruction is implemented in reality,but I can guess at least a couple of legal variants for a slow destruction.
2012-06-25 21:55, Philip Brown wrote:
I ran into something odd today: zfs destroy -r random/filesystem is mindbogglingly slow. But seems to me, it shouldnt be. It's slow, because the filesystem has two snapshots on it. Presumably, it's busy "rolling back" the snapshots. but I've already declared by my command line, that I DONT CARE about the contents of the filesystem! Why doesnt zfs simply do: 1. unmount filesystem, if possible (it was possible) (1.5 possibly note "intent to delete" somewhere in the pool records) 2. zero out/free the in-kernel-memory in one go 3. update the pool, "hey I deleted the filesystem, all these blocks are now clear"
Basically, your ideal fast destruction would be the pruning of the dataset tree (the node under which the snapshots' and the live dataset's blocks are rooted and accounted for). In this case "everything not allocated is free", or at least it might be made this way. The slow part is, likely, a walk of the block pointer tree (through all the random on-disk locations) and some sort of revision in order to release the blocks. So, what can be done at this step (speculation follows)? * Blocks might have been written as deduped; in this case we have to decrease the reference counters in DDT - but first we have to walk the dataset's branch of the block-pointer tree and see if any have the "dedup" bit-flag set. * A simpler case is the presence of cloned datasets based on snapshots of this dataset. Unless you're destroying the whole family of sibling datasets, the clones have to be promoted and referenced blocks are to be reassigned to these datasets (including reassignment of the snapshot "ownership"). * Even for the "trivial" step (2) of yours, the freeing of memory, we need to know which ARC-cached blocks to free. How can we know that without walking the BP tree first? I listed just a few reasons off the top of my head why a walk of the whole BP-tree branch is required to free the blocks referenced by this tree. If any further operations are needed, such as modifications to DDT, they may delay the result. In particular, this may be why recent versions of zfs/zpool worked towards asynchronous destructions and "deferred free" capability. The destroyed branch can be quickly marked as deleted, then the kernel works in the background to do its processing. In my (and not only mine) problematic cases it could have required prodigous amounts of RAM, especially with dedup procesing in play, and cause computer freezes. However, sometime after ZFSv22, the deferred freeing in such cases just takes several hard-resets to complete, instead of taking truly forever with no progress ;) Basically, the steps you outlined should be there already, in some manner, at least for ZFSv28. So, the practical questions are: * your version of zpool/zfs; OS version? * presence of deduplication on this dataset (and dedup support in the OS version - lack of it may have less code paths to follow and check, and be faster just due to that; i.e. Solaris 10 nominally has ZFSv29(?), but not all features are implemented as in Solaris 11 or OpenSolaris of similar ZFS versions); * did you use clones? * fragmentation (or how busy is the pool while processing the deletion, in terms of iops)? HTH, //Jim _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss