Steve McKinty wrote: > I have a couple of questions and concerns about using ZFS in an environment > where the underlying LUNs are replicated at a block level using products like > HDS TrueCopy or EMC SRDF. Apologies in advance for the length, but I wanted > the explanation to be clear. > > (I do realise that there are other possibilities such as zfs send/recv and > there are technical and business pros and cons for the various options. I > don't want to start a 'which is best' argument :) ) > > The CoW design of ZFS means that it goes to great lengths to always maintain > on-disk self-consistency, and ZFS can make certain assumptions about state > (e.g not needing fsck) based on that. This is the basis of my questions. > > 1) First issue relates to the überblock. Updates to it are assumed to be > atomic, but if the replication block size is smaller than the überblock then > we can't guarantee that the whole überblock is replicated as an entity. That > could in theory result in a corrupt überblock at the > secondary. >
The uberblock contains a circular queue of updates. For all practical purposes, this is COW. The updates I measure are usually 1 block (or, to put it another way, I don't recall seeing more than 1 block being updated... I'd have to recheck my data) > Will this be caught and handled by the normal ZFS checksumming? If so, does > ZFS just use an alternate überblock and rewrite the damaged one transparently? > > The checksum should catch it. To be safe, there are 4 copies of the uberblock. > 2) Assuming that the replication maintains write-ordering, the secondary site > will always have valid and self-consistent data, although it may be > out-of-date compared to the primary if the replication is asynchronous, > depending on link latency, buffering, etc. > > Normally most replication systems do maintain write ordering, [i]except[/i] > for one specific scenario. If the replication is interrupted, for example > secondary site down or unreachable due to a comms problem, the primary site > will keep a list of changed blocks. When contact between the sites is > re-established there will be a period of 'catch-up' resynchronization. In > most, if not all, cases this is done on a simple block-order basis. > Write-ordering is lost until the two sites are once again in sync and routine > replication restarts. > > I can see this has having major ZFS impact. It would be possible for > intermediate blocks to be replicated before the data blocks they point to, > and in the worst case an updated überblock could be replicated before the > block chains that it references have been copied. This breaks the assumption > that the on-disk format is always self-consistent. > > If a disaster happened during the 'catch-up', and the > partially-resynchronized LUNs were imported into a zpool at the secondary > site, what would/could happen? Refusal to accept the whole zpool? Rejection > just of the files affected? System panic? How could recovery from this > situation be achieved? > I think all of these reactions to the double-failure mode are possible. The version of ZFS used will also have an impact as the later versions are more resilient. I think that in most cases, only the affected files will be impacted. zpool scrub will ensure that everything is consistent and mark those files which fail to checksum properly. > Obviously all filesystems can suffer with this scenario, but ones that expect > less from their underlying storage (like UFS) can be fscked, and although > data that was being updated is potentially corrupt, existing data should > still be OK and usable. My concern is that ZFS will handle this scenario > less well. > ...databases too... It might be easier to analyze this from the perspective of the transaction group than an individual file. Since ZFS is COW, you may have a state where a transaction group is incomplete, but the previous data state should be consistent. > There are ways to mitigate this, of course, the most obvious being to take a > snapshot of the (valid) secondary before starting resync, as a fallback. > This isn't always easy to do, especially since the resync is usually > automatic; there is no clear trigger to use for the snapshot. It may also be > difficult to synchronize the snapshot of all LUNs in a pool. I'd like to > better understand the risks/behaviour of ZFS before starting to work on > mitigation strategies. > > I don't see how snapshots would help. The inherent transaction group commits should be sufficient. Or, to look at this another way, a snapshot is really just a metadata change. I am more worried about how the storage admin sets up the LUN groups. The human factor can really ruin my day... -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss