Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-14 Thread Jim Dunham
Steve,

 I have a couple of questions and concerns about using ZFS in an  
 environment where the underlying LUNs are replicated at a block  
 level using products like HDS TrueCopy or EMC SRDF.  Apologies in  
 advance for the length, but I wanted the explanation to be clear.

 (I do realise that there are other possibilities such as zfs send/ 
 recv and there are technical and business pros and cons for the  
 various options. I don't want to start a 'which is best' argument :) )

 The CoW design of ZFS means that it goes to great lengths to always  
 maintain on-disk self-consistency, and ZFS can make certain  
 assumptions about state (e.g not needing fsck) based on that.  This  
 is the basis of my questions.

 1) First issue relates to the überblock.  Updates to it are assumed  
 to be atomic, but if the replication block size is smaller than the  
 überblock then we can't guarantee that the whole überblock is  
 replicated as an entity.  That could in theory result in a corrupt  
 überblock at the
 secondary.

 Will this be caught and handled by the normal ZFS checksumming? If  
 so, does ZFS just use an alternate überblock and rewrite the  
 damaged one transparently?

 2) Assuming that the replication maintains write-ordering, the  
 secondary site will always have valid and self-consistent data,  
 although it may be out-of-date compared to the primary if the  
 replication is asynchronous, depending on link latency, buffering,  
 etc.

 Normally most replication systems do maintain write ordering, [i] 
 except[/i] for one specific scenario.  If the replication is  
 interrupted, for example secondary site down or unreachable due to  
 a comms problem, the primary site will keep a list of changed  
 blocks.  When contact between the sites is re-established there  
 will be a period of 'catch-up' resynchronization.  In most, if not  
 all, cases this is done on a simple block-order basis.  Write- 
 ordering is lost until the two sites are once again in sync and  
 routine replication restarts.

 I can see this has having major ZFS impact.  It would be possible  
 for intermediate blocks to be replicated before the data blocks  
 they point to, and in the worst case an updated überblock could be  
 replicated before the block chains that it references have been  
 copied.  This breaks the assumption that the on-disk format is  
 always self-consistent.

For most implementations of resynchronization, not only are changes  
resilvered in a block-ordered basis, resynchronization is also done  
in a single pass over the volume(s). To address the fact that  
resynchronization happens while additional changes are also being  
replicated, the concept of a resynchronization point is kept. As this  
resynchronization point traverse the volume from beginning to end, I/ 
Os occurring before, or at this point need to be replicated inline,  
whereas I/Os occurring after this point need to marked such that they  
will be replicated later in block order. You are quite correct in  
that the data is not consistent.

 If a disaster happened during the 'catch-up', and the partially- 
 resynchronized LUNs were imported into a zpool at the secondary  
 site, what would/could happen? Refusal to accept the whole zpool?  
 Rejection just of the files affected? System panic? How could  
 recovery from this situation be achieved?

The state of the partially-resynchronized LUNs are much worse than  
you know. During active resynchronization, the remote volume contains  
a mixture of prior write-order consistent data, resilvered block- 
order data, plus new replicated data. Essentially the partially- 
resynchronized LUNs are totally inconsistent until such a times as  
the single pass over all data is 100% complete.

For some, but not all replication software, if the 'catch-up'  
resynchronization failed, read access to the LUNs should be  
prevented, or a least read access while the LUNs are configured as  
remote mirrors. Availability Suite's Remote Mirror software (SNDR)  
marks such volumes as need synchronization and fails all  
application read and write I/Os.

 Obviously all filesystems can suffer with this scenario, but ones  
 that expect less from their underlying storage (like UFS) can be  
 fscked, and although data that was being updated is potentially  
 corrupt, existing data should still be OK and usable.  My concern  
 is that ZFS will handle this scenario less well.

 There are ways to mitigate this, of course, the most obvious being  
 to take a snapshot of the (valid) secondary before starting resync,  
 as a fallback.  This isn't always easy to do, especially since the  
 resync is usually automatic; there is no clear trigger to use for  
 the snapshot. It may also be difficult to synchronize the snapshot  
 of all LUNs in a pool. I'd like to better understand the risks/ 
 behaviour of ZFS before starting to work on mitigation strategies.

Since Availability Suite is both Remote Mirroring and 

[zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Steve McKinty
I have a couple of questions and concerns about using ZFS in an environment 
where the underlying LUNs are replicated at a block level using products like 
HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
the explanation to be clear.

(I do realise that there are other possibilities such as zfs send/recv and 
there are technical and business pros and cons for the various options. I don't 
want to start a 'which is best' argument :) )

The CoW design of ZFS means that it goes to great lengths to always maintain 
on-disk self-consistency, and ZFS can make certain assumptions about state (e.g 
not needing fsck) based on that.  This is the basis of my questions. 

1) First issue relates to the überblock.  Updates to it are assumed to be 
atomic, but if the replication block size is smaller than the überblock then we 
can't guarantee that the whole überblock is replicated as an entity.  That 
could in theory result in a corrupt überblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS 
just use an alternate überblock and rewrite the damaged one transparently?

2) Assuming that the replication maintains write-ordering, the secondary site 
will always have valid and self-consistent data, although it may be out-of-date 
compared to the primary if the replication is asynchronous, depending on link 
latency, buffering, etc. 

Normally most replication systems do maintain write ordering, [i]except[/i] for 
one specific scenario.  If the replication is interrupted, for example 
secondary site down or unreachable due to a comms problem, the primary site 
will keep a list of changed blocks.  When contact between the sites is 
re-established there will be a period of 'catch-up' resynchronization.  In 
most, if not all, cases this is done on a simple block-order basis.  
Write-ordering is lost until the two sites are once again in sync and routine 
replication restarts. 

I can see this has having major ZFS impact.  It would be possible for 
intermediate blocks to be replicated before the data blocks they point to, and 
in the worst case an updated überblock could be replicated before the block 
chains that it references have been copied.  This breaks the assumption that 
the on-disk format is always self-consistent. 

If a disaster happened during the 'catch-up', and the partially-resynchronized 
LUNs were imported into a zpool at the secondary site, what would/could happen? 
Refusal to accept the whole zpool? Rejection just of the files affected? System 
panic? How could recovery from this situation be achieved?

Obviously all filesystems can suffer with this scenario, but ones that expect 
less from their underlying storage (like UFS) can be fscked, and although data 
that was being updated is potentially corrupt, existing data should still be OK 
and usable.  My concern is that ZFS will handle this scenario less well. 

There are ways to mitigate this, of course, the most obvious being to take a 
snapshot of the (valid) secondary before starting resync, as a fallback.  This 
isn't always easy to do, especially since the resync is usually automatic; 
there is no clear trigger to use for the snapshot. It may also be difficult to 
synchronize the snapshot of all LUNs in a pool. I'd like to better understand 
the risks/behaviour of ZFS before starting to work on mitigation strategies. 

Thanks

Steve
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Richard Elling
Steve McKinty wrote:
 I have a couple of questions and concerns about using ZFS in an environment 
 where the underlying LUNs are replicated at a block level using products like 
 HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
 the explanation to be clear.

 (I do realise that there are other possibilities such as zfs send/recv and 
 there are technical and business pros and cons for the various options. I 
 don't want to start a 'which is best' argument :) )

 The CoW design of ZFS means that it goes to great lengths to always maintain 
 on-disk self-consistency, and ZFS can make certain assumptions about state 
 (e.g not needing fsck) based on that.  This is the basis of my questions. 

 1) First issue relates to the überblock.  Updates to it are assumed to be 
 atomic, but if the replication block size is smaller than the überblock then 
 we can't guarantee that the whole überblock is replicated as an entity.  That 
 could in theory result in a corrupt überblock at the
 secondary. 
   

The uberblock contains a circular queue of updates.  For all practical
purposes, this is COW.  The updates I measure are usually 1 block
(or, to put it another way, I don't recall seeing more than 1 block being
updated... I'd have to recheck my data)

 Will this be caught and handled by the normal ZFS checksumming? If so, does 
 ZFS just use an alternate überblock and rewrite the damaged one transparently?

   

The checksum should catch it.  To be safe, there are 4 copies of the 
uberblock.

 2) Assuming that the replication maintains write-ordering, the secondary site 
 will always have valid and self-consistent data, although it may be 
 out-of-date compared to the primary if the replication is asynchronous, 
 depending on link latency, buffering, etc. 

 Normally most replication systems do maintain write ordering, [i]except[/i] 
 for one specific scenario.  If the replication is interrupted, for example 
 secondary site down or unreachable due to a comms problem, the primary site 
 will keep a list of changed blocks.  When contact between the sites is 
 re-established there will be a period of 'catch-up' resynchronization.  In 
 most, if not all, cases this is done on a simple block-order basis.  
 Write-ordering is lost until the two sites are once again in sync and routine 
 replication restarts. 

 I can see this has having major ZFS impact.  It would be possible for 
 intermediate blocks to be replicated before the data blocks they point to, 
 and in the worst case an updated überblock could be replicated before the 
 block chains that it references have been copied.  This breaks the assumption 
 that the on-disk format is always self-consistent. 

 If a disaster happened during the 'catch-up', and the 
 partially-resynchronized LUNs were imported into a zpool at the secondary 
 site, what would/could happen? Refusal to accept the whole zpool? Rejection 
 just of the files affected? System panic? How could recovery from this 
 situation be achieved?
   

I think all of these reactions to the double-failure mode are possible.
The version of ZFS used will also have an impact as the later versions
are more resilient.  I think that in most cases, only the affected files
will be impacted.  zpool scrub will ensure that everything is consistent
and mark those files which fail to checksum properly.

 Obviously all filesystems can suffer with this scenario, but ones that expect 
 less from their underlying storage (like UFS) can be fscked, and although 
 data that was being updated is potentially corrupt, existing data should 
 still be OK and usable.  My concern is that ZFS will handle this scenario 
 less well. 
   

...databases too...
It might be easier to analyze this from the perspective of the transaction
group than an individual file.  Since ZFS is COW, you may have a
state where a transaction group is incomplete, but the previous data
state should be consistent.

 There are ways to mitigate this, of course, the most obvious being to take a 
 snapshot of the (valid) secondary before starting resync, as a fallback.  
 This isn't always easy to do, especially since the resync is usually 
 automatic; there is no clear trigger to use for the snapshot. It may also be 
 difficult to synchronize the snapshot of all LUNs in a pool. I'd like to 
 better understand the risks/behaviour of ZFS before starting to work on 
 mitigation strategies. 

   

I don't see how snapshots would help.  The inherent transaction group 
commits
should be sufficient.  Or, to look at this another way, a snapshot is 
really just
a metadata change.

I am more worried about how the storage admin sets up the LUN groups.
The human factor can really ruin my day...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread can you guess?
Great questions.

 1) First issue relates to the überblock.  Updates to
 it are assumed to be atomic, but if the replication
 block size is smaller than the überblock then we
 can't guarantee that the whole überblock is
 replicated as an entity.  That could in theory result
 in a corrupt überblock at the
 secondary. 
 
 Will this be caught and handled by the normal ZFS
 checksumming? If so, does ZFS just use an alternate
 überblock and rewrite the damaged one transparently?

ZFS already has to deal with potential uberblock partial writes if it contains 
multiple disk sectors (and it might be prudent even if it doesn't, as Richard's 
response seems to suggest).  Common ways of dealing with this problem include 
dumping it into the log (in which case the log with its own internal recovery 
procedure becomes the real root of all evil) or cycling around at least two 
locations per mirror copy (Richard's response suggests that there are 
considerably more, and that perhaps each one is written in quadruplicate) such 
that the previous uberblock would still be available if the new write tanked.  
ZFS-style snapshots complicate both approaches unless special provisions are 
taken - e.g., copying the current uberblock on each snapshot and hanging a list 
of these snapshot uberblock addresses off the current uberblock, though even 
that might run into interesting complications under the scenario which you 
describe below.  Just using the 'queue' that Richard describes to accumulate 
snapshot uberblocks would limit the number of concurrent snapshots to less than 
the size of that queue.

In any event, as long as writes to the secondary copy don't continue after a 
write failure of the kind that you describe has occurred (save for the kind of 
catch-up procedure that you mention later), ZFS's internal facilities should 
not be confused by encountering a partial uberblock update at the secondary, 
any more than they'd be confused by encountering it on an unreplicated system 
after restart.

 
 2) Assuming that the replication maintains
 write-ordering, the secondary site will always have
 valid and self-consistent data, although it may be
 out-of-date compared to the primary if the
 replication is asynchronous, depending on link
 latency, buffering, etc. 
 
 Normally most replication systems do maintain write
 ordering, [i]except[/i] for one specific scenario.
 If the replication is interrupted, for example
 secondary site down or unreachable due to a comms
 problem, the primary site will keep a list of
 changed blocks.  When contact between the sites is
 re-established there will be a period of 'catch-up'
 resynchronization.  In most, if not all, cases this
 is done on a simple block-order basis.
 Write-ordering is lost until the two sites are once
  again in sync and routine replication restarts. 
 
 I can see this has having major ZFS impact.  It would
 be possible for intermediate blocks to be replicated
 before the data blocks they point to, and in the
 worst case an updated überblock could be replicated
 before the block chains that it references have been
 copied.  This breaks the assumption that the on-disk
 format is always self-consistent. 
 
 If a disaster happened during the 'catch-up', and the
 partially-resynchronized LUNs were imported into a
 zpool at the secondary site, what would/could happen?
 Refusal to accept the whole zpool? Rejection just of
 the files affected? System panic? How could recovery
 from this situation be achieved?

My inclination is to say By repopulating your environment from backups:  it 
is not reasonable to expect *any* file system to operate correctly, or to 
attempt any kind of comprehensive recovery (other than via something like fsck, 
with no guarantee of how much you'll get back), when the underlying hardware 
transparently reorders updates which the file system has explicitly ordered 
when it presented them.

But you may well be correct in suspecting that there's more potential for 
data loss should this occur in a ZFS environment than in update-in-place 
environments where only portions of the tree structure that were explicitly 
changed during the connection hiatus would likely be affected by such a 
recovery interruption (though even there if a directory changed enough to 
change its block structure on disk you could be in more trouble).

 
 Obviously all filesystems can suffer with this
 scenario, but ones that expect less from their
 underlying storage (like UFS) can be fscked, and
 although data that was being updated is potentially
 corrupt, existing data should still be OK and usable.
 My concern is that ZFS will handle this scenario
  less well. 
 
 There are ways to mitigate this, of course, the most
 obvious being to take a snapshot of the (valid)
 secondary before starting resync, as a fallback.

You're talking about an HDS- or EMC-level snapshot, right?

 This isn't always easy to do, especially since the
 resync is usually automatic; there is no clear
 

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-13 Thread Ricardo M. Correia




Steve McKinty wrote:

  1) First issue relates to the überblock.  Updates to it are assumed to be atomic, but if the replication block size is smaller than the überblock then we can't guarantee that the whole überblock is replicated as an entity.  That could in theory result in a corrupt überblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS just use an alternate überblock and rewrite the damaged one transparently?
  


Yes, ZFS uberblocks are self-checksummed with SHA-256 and when opening
the pool it uses the latest valid uberblock that it can find. So that
is not a problem.


  2) Assuming that the replication maintains write-ordering, the secondary site will always have valid and self-consistent data, although it may be out-of-date compared to the primary if the replication is asynchronous, depending on link latency, buffering, etc. 

Normally most replication systems do maintain write ordering, [i]except[/i] for one specific scenario.  If the replication is interrupted, for example secondary site down or unreachable due to a comms problem, the primary site will keep a list of changed blocks.  When contact between the sites is re-established there will be a period of 'catch-up' resynchronization.  In most, if not all, cases this is done on a simple block-order basis.  Write-ordering is lost until the two sites are once again in sync and routine replication restarts. 

I can see this has having major ZFS impact.  It would be possible for intermediate blocks to be replicated before the data blocks they point to, and in the worst case an updated überblock could be replicated before the block chains that it references have been copied.  This breaks the assumption that the on-disk format is always self-consistent. 

If a disaster happened during the 'catch-up', and the partially-resynchronized LUNs were imported into a zpool at the secondary site, what would/could happen? Refusal to accept the whole zpool? Rejection just of the files affected? System panic? How could recovery from this situation be achieved?
  


I believe your understanding is correct. If you expect such a
double-failure, you cannot rely on being able to recover your pool at
the secondary site.

The newest uberblocks would be among the first blocks to be replicated
(2 of the uberblock arrays are situated at the start of the vdev) and
your whole block tree might be inaccessible if the latest Meta Object
Set blocks were not also replicated. You might be lucky and be able to
mount your filesystems because ZFS keeps 3 separate copies of the most
important metadata and it tries to keep apart each copy by about 1/8th
of the disk, but even then I wouldn't count on it.

If ZFS can't open the pool due to this kind of corruption, you would
get the following message:

status: The pool metadata is corrupted and the pool cannot be
opened.
action: Destroy and re-create the pool from a backup source.

At this point, you could try zeroing out the first 2 uberblock
arrays so that ZFS tries using an older uberblock from the last 2
arrays, but this might not work. As the message says, the only reliable
way to recover from this is restoring your pool from backups.


  There are ways to mitigate this, of course, the most obvious being to take a snapshot of the (valid) secondary before starting resync, as a fallback.  This isn't always easy to do, especially since the resync is usually automatic; there is no clear trigger to use for the snapshot. It may also be difficult to synchronize the snapshot of all LUNs in a pool. I'd like to better understand the risks/behaviour of ZFS before starting to work on mitigation strategies. 
  


If the replication process was interrupted for a sufficiently long time
and disaster strikes at the primary site *during resync*, I don't think
snapshots would save you even if you had took them at the right time.
Snapshots might increase your chances of recovery (by making ZFS not
free and reuse blocks), but AFAIK there wouldn't be any guarantee that
you'd be able to recover anything whatsoever since the most important
pool metadata is not part of the snapshots.

Regards,
Ricardo

-- 

  

  
  
  
  Ricardo Manuel Correia
  
Lustre Engineering
  Sun Microsystems, Inc.
Portugal
  
  

  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss