Gathered knowledge,
    I have a moderately large version 22 zpool, zpool list reports 75
TB and it is all raidZ2 made up of 22 vdevs each 5 x 750 GB drives.

    We take snapshots hourly and keep them for 5 weeks for operational
backups (Disaster Recovery backups are via zfs send | zfs recv to
another physical system).

    `zpool list` reports 44 TB allocated.

    We had a drive fail, the hot spare stepped in, and as soon as we
had a replacement from Oracle we `zpool replace`d the failed drive
(the layout assures that each vdev hits one drive in each of five
J4400 so we can lose up to 2 of the J4400 and not lose data).

    The resilver ran for days and hit 100% done but kept going, for
over two weeks, and still going. Each 750 GB drive involved reported
resilvering over 4 TB!

    I had seen this before, but not to this extent.

    Now for my questions:

1) I assume the percent done is the resilver of the base zpool and
datasets but does not include snapshots. This means that once we hit
100% the _current_ data has been resilvered and it is working on the

2) Is the resilver operation walking through all the data in all of
the snapshots ? If so, then I should be able to estimate total
completion by taking time to get to 100% and multiplying by the number
of snapshots (assuming all snapshots are about the same size).

    I know there are "fixes" for this with later version zpools, but
we are stuck at 22 for right now.

NOTE: Since the snapshots are our backups we really can't disable
them, if we do we run into a different zpool 22 issue where the amount
of RAM we will need to destroy a large snapshot will be more than we
have. This is also fixed with zpool 26.

Paul Kraus
-> Senior Systems Architect, Garnet River ( )
-> Assistant Technical Director, LoneStarCon 3 (
-> Sound Coordinator, Schenectady Light Opera Company ( )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players
zfs-discuss mailing list

Reply via email to