Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

can you guess? Thu, 13 Dec 2007 13:17:38 -0800

Great questions.

> 1) First issue relates to the überblock.  Updates to
> it are assumed to be atomic, but if the replication
> block size is smaller than the überblock then we
> can't guarantee that the whole überblock is
> replicated as an entity.  That could in theory result
> in a corrupt überblock at the
> secondary. 
> 
> Will this be caught and handled by the normal ZFS
> checksumming? If so, does ZFS just use an alternate
> überblock and rewrite the damaged one transparently?


ZFS already has to deal with potential uberblock partial writes if it contains 
multiple disk sectors (and it might be prudent even if it doesn't, as Richard's 
response seems to suggest).  Common ways of dealing with this problem include 
dumping it into the log (in which case the log with its own internal recovery 
procedure becomes the real root of all evil) or cycling around at least two 
locations per mirror copy (Richard's response suggests that there are 
considerably more, and that perhaps each one is written in quadruplicate) such 
that the previous uberblock would still be available if the new write tanked.  
ZFS-style snapshots complicate both approaches unless special provisions are 
taken - e.g., copying the current uberblock on each snapshot and hanging a list 
of these snapshot uberblock addresses off the current uberblock, though even 
that might run into interesting complications under the scenario which you 
describe below.  Just using the 'queue' that Richard describes to accumulate 
snapshot uberblocks would limit the number of concurrent snapshots to less than 
the size of that queue.

In any event, as long as writes to the secondary copy don't continue after a 
write failure of the kind that you describe has occurred (save for the kind of 
catch-up procedure that you mention later), ZFS's internal facilities should 
not be confused by encountering a partial uberblock update at the secondary, 
any more than they'd be confused by encountering it on an unreplicated system 
after restart.

> 
> 2) Assuming that the replication maintains
> write-ordering, the secondary site will always have
> valid and self-consistent data, although it may be
> out-of-date compared to the primary if the
> replication is asynchronous, depending on link
> latency, buffering, etc. 
> 
> Normally most replication systems do maintain write
> ordering, [i]except[/i] for one specific scenario.
> If the replication is interrupted, for example
> secondary site down or unreachable due to a comms
> problem, the primary site will keep a list of
> changed blocks.  When contact between the sites is
> re-established there will be a period of 'catch-up'
> resynchronization.  In most, if not all, cases this
> is done on a simple block-order basis.
> Write-ordering is lost until the two sites are once
>  again in sync and routine replication restarts. 
> 
> I can see this has having major ZFS impact.  It would
> be possible for intermediate blocks to be replicated
> before the data blocks they point to, and in the
> worst case an updated überblock could be replicated
> before the block chains that it references have been
> copied.  This breaks the assumption that the on-disk
> format is always self-consistent. 
> 
> If a disaster happened during the 'catch-up', and the
> partially-resynchronized LUNs were imported into a
> zpool at the secondary site, what would/could happen?
> Refusal to accept the whole zpool? Rejection just of
> the files affected? System panic? How could recovery
> from this situation be achieved?

My inclination is to say "By repopulating your environment from backups":  it 
is not reasonable to expect *any* file system to operate correctly, or to 
attempt any kind of comprehensive recovery (other than via something like fsck, 
with no guarantee of how much you'll get back), when the underlying hardware 
transparently reorders updates which the file system has explicitly ordered 
when it presented them.

But you may well be correct in suspecting that there's more potential for 
data loss should this occur in a ZFS environment than in update-in-place 
environments where only portions of the tree structure that were explicitly 
changed during the connection hiatus would likely be affected by such a 
recovery interruption (though even there if a directory changed enough to 
change its block structure on disk you could be in more trouble).

> 
> Obviously all filesystems can suffer with this
> scenario, but ones that expect less from their
> underlying storage (like UFS) can be fscked, and
> although data that was being updated is potentially
> corrupt, existing data should still be OK and usable.
> My concern is that ZFS will handle this scenario
>  less well. 
> 
> There are ways to mitigate this, of course, the most
> obvious being to take a snapshot of the (valid)
> secondary before starting resync, as a fallback.

You're talking about an HDS- or EMC-level snapshot, right?

> This isn't always easy to do, especially since the
> resync is usually automatic; there is no clear
> trigger to use for the snapshot. It may also be
> difficult to synchronize the snapshot of all LUNs in
> a pool. I'd like to better understand the
> risks/behaviour of ZFS before starting to work on
>  mitigation strategies. 

It strikes me as irresponsible for a high-end storage product such as you 
describe neither to order its recovery in the same manner that it orders its 
normal operation nor to protect that recovery such that it can be virtually 
guaranteed to complete successfully (e.g., by taking a destination snapshot as 
you suggest or by first copying and mirroring the entire set of update blocks 
to the destination).  Are you *sure* they don't?

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Reply via email to