I have a moderately large version 22 zpool, zpool list reports 75
TB and it is all raidZ2 made up of 22 vdevs each 5 x 750 GB drives.
We take snapshots hourly and keep them for 5 weeks for operational
backups (Disaster Recovery backups are via zfs send | zfs recv to
another physical system).
`zpool list` reports 44 TB allocated.
We had a drive fail, the hot spare stepped in, and as soon as we
had a replacement from Oracle we `zpool replace`d the failed drive
(the layout assures that each vdev hits one drive in each of five
J4400 so we can lose up to 2 of the J4400 and not lose data).
The resilver ran for days and hit 100% done but kept going, for
over two weeks, and still going. Each 750 GB drive involved reported
resilvering over 4 TB!
I had seen this before, but not to this extent.
Now for my questions:
1) I assume the percent done is the resilver of the base zpool and
datasets but does not include snapshots. This means that once we hit
100% the _current_ data has been resilvered and it is working on the
2) Is the resilver operation walking through all the data in all of
the snapshots ? If so, then I should be able to estimate total
completion by taking time to get to 100% and multiplying by the number
of snapshots (assuming all snapshots are about the same size).
I know there are "fixes" for this with later version zpools, but
we are stuck at 22 for right now.
NOTE: Since the snapshots are our backups we really can't disable
them, if we do we run into a different zpool 22 issue where the amount
of RAM we will need to destroy a large snapshot will be more than we
have. This is also fixed with zpool 26.
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Assistant Technical Director, LoneStarCon 3 (http://lonestarcon3.org/)
-> Sound Coordinator, Schenectady Light Opera Company (
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players
zfs-discuss mailing list