On Thu, Mar 04, 2010 at 04:20:10PM -0600, Gary Mills wrote: > We have an IMAP e-mail server running on a Solaris 10 10/09 system. > It uses six ZFS filesystems built on a single zpool with 14 daily > snapshots. Every day at 11:56, a cron command destroys the oldest > snapshots and creates new ones, both recursively. For about four > minutes thereafter, the load average drops and I/O to the disk devices > drops to almost zero. Then, the load average shoots up to about ten > times normal and then declines to normal over about four minutes, as > disk activity resumes. The statistics return to their normal state > about ten minutes after the cron command runs.
I'm pleased to report that I found the culprit and the culprit was me! Well, ZFS peculiarities may be involved as well. Let me explain: We had a single second-level filesystem and five third-level filesystems, all with 14 daily snapshots. The snapshots were maintained by a cron command that did a `zfs list -rH -t snapshot -o name' to get the names of all of the snapshots, extracted the part after the `@', and then sorted them uniquely to get a list of suffixes that were older than 14 days. The suffixes were Julian dates so they sorted correctly. It then did a `zfs destroy -r' to delete them. The recursion was always done from the second-level filesystem. The top-level filesystem was empty and had no snapshots. Here's a portion of the script: zfs list -rH -t snapshot -o name $FS | \ cut -d@ -f2 | \ sort -ur | \ sed 1,${NR}d | \ xargs -I '{}' zfs destroy -r $FS@'{}' zfs snapshot -r $...@$jd Just over two weeks ago, I rearranged the filesystems so that the second-level filesystem was newly-created and initially had no snapshots. It did have a snapshot taken every day thereafter, so that eventually it also had 14 of them. It was during that interval that the complaints started. My statistics clearly showed the performance stall and subsequent recovery. Once that filesystem reached 14 snapshots, the complaints stopped and the statistics showed only a modest increase in CPU activity, but no stall. During this interval, the script was doing a recursive destroy for a snapshot that didn't exist at the specified level, but only existed in the descendent filesystems. I'm assuming that that unusual situation was the cause of the stall, although I don't have good evidence. By the time the complaints reached my ears, and I was able to refine my statistics gathering sufficiently, the problem had gone away. -- -Gary Mills- -Unix Group- -Computer and Network Services- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss