Re: [zfs-discuss] Mirrored zpool across network

Jim Dunham Tue, 21 Aug 2007 07:27:47 -0700

Ralf,

> Torrey McMahon wrote:
>> AVS?
>>
> Jim Dunham will probably shoot me, or worse, but I recommend thinking
> twice about using AVS for ZFS replication.


That's is why the call this a discussion group, as it encourages  
differing opinions,

> Basically, you only have a
> few options:
>
>  1) Using a battery buffered hardware RAID controller, which leads to
> bad ZFS performance in many cases,
>  2) Buildung up Three-Way-Mirrors to avoid complete data loss in  
> several
> desaster scenarios due to missing ZFS recovery mechanisms like `fsck`,
> which makes AVS/ZFS based solutions quite expensive,
>  3) Additionally using another form of backup, e.g. tapes.
>
> For instance, one scenario which made me think: Imagine you have a
> X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool  
> on 40
> disks (you need 1 for the system, plus 3x RAID 10 for the bitmap
> volumes, otherwise your performance will be very bad, plus 2x HSP).
> Using 40 disks leads to a total of 40 separate replications. Now  
> imagine
> the following two scenarios:

This is just one scenario for deploying the 48 disks of x4500. The  
blog listed below offers another option, by mirroring the bitmaps  
across all available disks, bring the total disk count back up to 46,  
(or 44, if 2x HSP) leaving the other two for a mirrored root disk.   
http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless

Yes, provisioning one slice for bitmaps and the another slice for  
ZFS's vdevs on the same internal disk, may introduce out of band head  
seeks between bitmap I/O and ZFS I/O, plus taking a piece of a ZFS  
vdev turns off ZFS's ability to enable write-caching. All things  
considered, this is the cost of host-based replication.

>
> a) A disk in the primary fails. What happens? A HSP jumps in and  
> 500 GB
> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
> crossover cable. This takes a bit of time and is 100% unnecessary

But it is necessary! As soon as the HSP disk kicks in, not only is  
the disk being rebuilt by ZFS, but newly allocated ZFS data will also  
being written to this HSP disk. So although it may appear that there  
is wasted replication cost (of which there is), the instant that ZFS  
writes new data to this HSP disk, the old replicated disk is  
instantly inconsistent, and there is no means to fix.

For all that is good (or bad) about AVS, the fact that it works by  
simply interposing itself on the Solaris I/O data path is great, as  
it works with any Solaris block storage. Of course this also means  
that it has not filesystem, database or host-spare knowledge, which  
means that at times AVS will be inefficient at what it does.


> - and
> it will become much worse in the future, because the disk capacities
> rocket up into the sky, while the performance isn't improved as much.

Larger disk capacities are now worse in this scenario, then they are  
with controller-based replication, ZFS send / receive, etc. Actually  
it is quite efficient. If the disk that failed was one 5% full, when  
the HSP disk is switch and being rebuilt, old 5% of the entire disk  
will have to be replicated. If at the time ZFS and AVS were deployed  
on this server, if they HSP disks (containing uninitialized data)  
were also configured as equal with "sndradm -E ...", then there would  
be not initial replication cost, and when swapped into use, only the  
cost of replicating the actual ZFS in use data.

> During this time, your service misses redundancy.

Absolute not. If all of the ZFS in use and ZFS HSP disks are  
configured under AVS, there is never a time of lost redundancy.

> And we're not talking
> about some minutes during this time. Well, and now try to imagine what
> will happen if another disks fails during this rebuild, this time  
> in the
> secondary ...

If I was truly counting on AVS, I would be glad this happened!  
Getting replication configured right, be it AVS or some other option,  
means that when disks, systems, networks, etc., fail, there is always  
a period of degraded system performance, but it is better then no  
system performance.

>
> b) A disk in the secondary fails. What happens now? No HSP will  
> jump in
> on the secondary, because the zpool isn't imported and ZFS doesn't  
> know
> about the failure.  Instead, you'll end up with 39 active replications
> instead of 40. The one which replicates to the failed drive will  
> become
> inactive. But ... oh damn, the zpool isn't mounted on the secondary
> host, so ZFS doesn't report the drive failure to our server  
> monitoring.

But if a disaster happened on the primary node, and a decision was  
made to ZFS import the storage pool on the secondary, ZFS will detect  
the inconsistency, mark the drive as failed, swap in the secondary  
HSP disk. Later, when the primary site comes back, and a reverse  
synchronization is done to restore writes that happened on the  
secondary, the primary ZFS file system will become aware that a HSP  
swap occurred, and continue on right where the secondary node left off.

>
> That can be funny. The only way to get aware of the problem I found
> after a minute of thinking was asking sndradm about the health  
> status -
> which would lead to a  false alarm on Host A, because the failed  
> disc is
> in Host B, and operators are usually not bright enough to change the
> disc in Host B after they get notified about a problem on Host B. But
> even if everything works,  what will if the primary fails before an
> administrator fixed the problem, the missing replication is running
> again and the replacement disc has been completely synced? "Hello,
> kernel panic", and "Goodbye, 12 TB of data").

See above, but yes there is a need for a system administrator to  
monitor SNDR replication.


>
> c) You *must* force every single `zfs import <zpool>` on the secondary
> host. Always.

Correct, but this is the case even without AVS! If one configured ZFS  
on SAN based storage and your primary node crashed, one would need to  
force every single `zfs import <zpool>`. This is not an AVS issue,  
but a ZFS protection.

> Because you usually need your secondary host after your
> primary crashed. You won't have the chance to export your zpool on the
> primary first - and if you do, you don't need AVS at all. Bring some
> Kleenex to get rid of the sweat on your forehead when you have to  
> switch
> to your secondary host, because a single mistake (like forgetting  
> to put
> the secondary host into logging mode manually before you try to import
> the zpool) will lead to a complete data loss.

Correct, but this is the case even without AVS! Take the same SAN  
based storage scenario above, go to a secondary system on your SAN,  
and force every single `zfs import <zpool>`.

In the case of a SAN, where the same physical disk would be written  
to by both hosts, you would likely get complete data loss, but with  
AVS, where ZFS is actually on two physical disk, and AVS is tracking  
writes, even if they are inconsistent writes, AVS can and will  
recover if an update sync is done.

> I bet you won't even trust
> your own failover scripts.
>
> Use AVS and ZFS together. I use it myself. But I made sure that I know
> what I'm doing. Most people probably don't.

Your are quite correct in that although ZFS is intuitively easy to  
use, AVS is painfully complex. Of course the mindset of AVS and ZFS  
are as distant apart as they are in the alphabet. :-O


> Btw: I have to admit that I haven't tried the newst nevada builds  
> during
> the tests. It's possible that AVS and ZFS work better together than  
> they
> did under Solaris 10 11/06 and AVS 4.0. But there's a reason I haven't
> tried. It's because Sun Cluster 3.2 instantly crashes on Thumpers,
> SATA-related kernel panics, and the OpenHA Cluster isn't available  
> yet.

With AVS in Nevada, there is now an opportunity for leveraging the  
ease of use of ZFS, with AVS. Being also the iSCSI Target project  
lead, I see a lot of value in the ZFS option "set shareiscsi=on", to  
get end users in using iSCSI.

I would like to see "set replication=AVS:<secondary host>",  
configuring a locally named ZFS storage pool to the same named pair  
on some remote host. Starting down this path would afford things like  
ZFS replication monitoring, similar to what ZFS does with each of its  
own vdevs.

Jim

>
> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> [EMAIL PROTECTED] - http://web.de/
>
> 1&1 Internet AG
> Brauerstraße 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich,  
> Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang,  
> Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirrored zpool across network

Reply via email to