Steve McKinty wrote:
> I have a couple of questions and concerns about using ZFS in an environment 
> where the underlying LUNs are replicated at a block level using products like 
> HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
> the explanation to be clear.
>
> (I do realise that there are other possibilities such as zfs send/recv and 
> there are technical and business pros and cons for the various options. I 
> don't want to start a 'which is best' argument :) )
>
> The CoW design of ZFS means that it goes to great lengths to always maintain 
> on-disk self-consistency, and ZFS can make certain assumptions about state 
> (e.g not needing fsck) based on that.  This is the basis of my questions. 
>
> 1) First issue relates to the überblock.  Updates to it are assumed to be 
> atomic, but if the replication block size is smaller than the überblock then 
> we can't guarantee that the whole überblock is replicated as an entity.  That 
> could in theory result in a corrupt überblock at the
> secondary. 
>   

The uberblock contains a circular queue of updates.  For all practical
purposes, this is COW.  The updates I measure are usually 1 block
(or, to put it another way, I don't recall seeing more than 1 block being
updated... I'd have to recheck my data)

> Will this be caught and handled by the normal ZFS checksumming? If so, does 
> ZFS just use an alternate überblock and rewrite the damaged one transparently?
>
>   

The checksum should catch it.  To be safe, there are 4 copies of the 
uberblock.

> 2) Assuming that the replication maintains write-ordering, the secondary site 
> will always have valid and self-consistent data, although it may be 
> out-of-date compared to the primary if the replication is asynchronous, 
> depending on link latency, buffering, etc. 
>
> Normally most replication systems do maintain write ordering, [i]except[/i] 
> for one specific scenario.  If the replication is interrupted, for example 
> secondary site down or unreachable due to a comms problem, the primary site 
> will keep a list of changed blocks.  When contact between the sites is 
> re-established there will be a period of 'catch-up' resynchronization.  In 
> most, if not all, cases this is done on a simple block-order basis.  
> Write-ordering is lost until the two sites are once again in sync and routine 
> replication restarts. 
>
> I can see this has having major ZFS impact.  It would be possible for 
> intermediate blocks to be replicated before the data blocks they point to, 
> and in the worst case an updated überblock could be replicated before the 
> block chains that it references have been copied.  This breaks the assumption 
> that the on-disk format is always self-consistent. 
>
> If a disaster happened during the 'catch-up', and the 
> partially-resynchronized LUNs were imported into a zpool at the secondary 
> site, what would/could happen? Refusal to accept the whole zpool? Rejection 
> just of the files affected? System panic? How could recovery from this 
> situation be achieved?
>   

I think all of these reactions to the double-failure mode are possible.
The version of ZFS used will also have an impact as the later versions
are more resilient.  I think that in most cases, only the affected files
will be impacted.  zpool scrub will ensure that everything is consistent
and mark those files which fail to checksum properly.

> Obviously all filesystems can suffer with this scenario, but ones that expect 
> less from their underlying storage (like UFS) can be fscked, and although 
> data that was being updated is potentially corrupt, existing data should 
> still be OK and usable.  My concern is that ZFS will handle this scenario 
> less well. 
>   

...databases too...
It might be easier to analyze this from the perspective of the transaction
group than an individual file.  Since ZFS is COW, you may have a
state where a transaction group is incomplete, but the previous data
state should be consistent.

> There are ways to mitigate this, of course, the most obvious being to take a 
> snapshot of the (valid) secondary before starting resync, as a fallback.  
> This isn't always easy to do, especially since the resync is usually 
> automatic; there is no clear trigger to use for the snapshot. It may also be 
> difficult to synchronize the snapshot of all LUNs in a pool. I'd like to 
> better understand the risks/behaviour of ZFS before starting to work on 
> mitigation strategies. 
>
>   

I don't see how snapshots would help.  The inherent transaction group 
commits
should be sufficient.  Or, to look at this another way, a snapshot is 
really just
a metadata change.

I am more worried about how the storage admin sets up the LUN groups.
The human factor can really ruin my day...
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to