On Thu, Mar 04, 2010 at 04:20:10PM -0600, Gary Mills wrote:
> We have an IMAP e-mail server running on a Solaris 10 10/09 system.
> It uses six ZFS filesystems built on a single zpool with 14 daily
> snapshots.  Every day at 11:56, a cron command destroys the oldest
> snapshots and creates new ones, both recursively.  For about four
> minutes thereafter, the load average drops and I/O to the disk devices
> drops to almost zero.  Then, the load average shoots up to about ten
> times normal and then declines to normal over about four minutes, as
> disk activity resumes.  The statistics return to their normal state
> about ten minutes after the cron command runs.

I'm pleased to report that I found the culprit and the culprit was me!
Well, ZFS peculiarities may be involved as well.  Let me explain:

We had a single second-level filesystem and five third-level
filesystems, all with 14 daily snapshots.  The snapshots were
maintained by a cron command that did a `zfs list -rH -t snapshot -o
name' to get the names of all of the snapshots, extracted the part
after the `@', and then sorted them uniquely to get a list of suffixes
that were older than 14 days.  The suffixes were Julian dates so they
sorted correctly.  It then did a `zfs destroy -r' to delete them.  The
recursion was always done from the second-level filesystem.  The
top-level filesystem was empty and had no snapshots.  Here's a portion
of the script:

    zfs list -rH -t snapshot -o name $FS | \
            cut -d@ -f2 | \
            sort -ur | \
            sed 1,${NR}d | \
            xargs -I '{}' zfs destroy -r $FS@'{}'

    zfs snapshot -r $...@$jd

Just over two weeks ago, I rearranged the filesystems so that the
second-level filesystem was newly-created and initially had no
snapshots.  It did have a snapshot taken every day thereafter, so that
eventually it also had 14 of them.  It was during that interval that
the complaints started.  My statistics clearly showed the performance
stall and subsequent recovery.  Once that filesystem reached 14
snapshots, the complaints stopped and the statistics showed only a
modest increase in CPU activity, but no stall.

During this interval, the script was doing a recursive destroy for a
snapshot that didn't exist at the specified level, but only existed in
the descendent filesystems.  I'm assuming that that unusual situation
was the cause of the stall, although I don't have good evidence.  By
the time the complaints reached my ears, and I was able to refine my
statistics gathering sufficiently, the problem had gone away.

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to