Ralf, > Torrey McMahon wrote: >> AVS? >> > Jim Dunham will probably shoot me, or worse, but I recommend thinking > twice about using AVS for ZFS replication.
That's is why the call this a discussion group, as it encourages differing opinions, > Basically, you only have a > few options: > > 1) Using a battery buffered hardware RAID controller, which leads to > bad ZFS performance in many cases, > 2) Buildung up Three-Way-Mirrors to avoid complete data loss in > several > desaster scenarios due to missing ZFS recovery mechanisms like `fsck`, > which makes AVS/ZFS based solutions quite expensive, > 3) Additionally using another form of backup, e.g. tapes. > > For instance, one scenario which made me think: Imagine you have a > X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool > on 40 > disks (you need 1 for the system, plus 3x RAID 10 for the bitmap > volumes, otherwise your performance will be very bad, plus 2x HSP). > Using 40 disks leads to a total of 40 separate replications. Now > imagine > the following two scenarios: This is just one scenario for deploying the 48 disks of x4500. The blog listed below offers another option, by mirroring the bitmaps across all available disks, bring the total disk count back up to 46, (or 44, if 2x HSP) leaving the other two for a mirrored root disk. http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless Yes, provisioning one slice for bitmaps and the another slice for ZFS's vdevs on the same internal disk, may introduce out of band head seeks between bitmap I/O and ZFS I/O, plus taking a piece of a ZFS vdev turns off ZFS's ability to enable write-caching. All things considered, this is the cost of host-based replication. > > a) A disk in the primary fails. What happens? A HSP jumps in and > 500 GB > will be rebuilt. These 500 GB are synced over a single 1 GBit/s > crossover cable. This takes a bit of time and is 100% unnecessary But it is necessary! As soon as the HSP disk kicks in, not only is the disk being rebuilt by ZFS, but newly allocated ZFS data will also being written to this HSP disk. So although it may appear that there is wasted replication cost (of which there is), the instant that ZFS writes new data to this HSP disk, the old replicated disk is instantly inconsistent, and there is no means to fix. For all that is good (or bad) about AVS, the fact that it works by simply interposing itself on the Solaris I/O data path is great, as it works with any Solaris block storage. Of course this also means that it has not filesystem, database or host-spare knowledge, which means that at times AVS will be inefficient at what it does. > - and > it will become much worse in the future, because the disk capacities > rocket up into the sky, while the performance isn't improved as much. Larger disk capacities are now worse in this scenario, then they are with controller-based replication, ZFS send / receive, etc. Actually it is quite efficient. If the disk that failed was one 5% full, when the HSP disk is switch and being rebuilt, old 5% of the entire disk will have to be replicated. If at the time ZFS and AVS were deployed on this server, if they HSP disks (containing uninitialized data) were also configured as equal with "sndradm -E ...", then there would be not initial replication cost, and when swapped into use, only the cost of replicating the actual ZFS in use data. > During this time, your service misses redundancy. Absolute not. If all of the ZFS in use and ZFS HSP disks are configured under AVS, there is never a time of lost redundancy. > And we're not talking > about some minutes during this time. Well, and now try to imagine what > will happen if another disks fails during this rebuild, this time > in the > secondary ... If I was truly counting on AVS, I would be glad this happened! Getting replication configured right, be it AVS or some other option, means that when disks, systems, networks, etc., fail, there is always a period of degraded system performance, but it is better then no system performance. > > b) A disk in the secondary fails. What happens now? No HSP will > jump in > on the secondary, because the zpool isn't imported and ZFS doesn't > know > about the failure. Instead, you'll end up with 39 active replications > instead of 40. The one which replicates to the failed drive will > become > inactive. But ... oh damn, the zpool isn't mounted on the secondary > host, so ZFS doesn't report the drive failure to our server > monitoring. But if a disaster happened on the primary node, and a decision was made to ZFS import the storage pool on the secondary, ZFS will detect the inconsistency, mark the drive as failed, swap in the secondary HSP disk. Later, when the primary site comes back, and a reverse synchronization is done to restore writes that happened on the secondary, the primary ZFS file system will become aware that a HSP swap occurred, and continue on right where the secondary node left off. > > That can be funny. The only way to get aware of the problem I found > after a minute of thinking was asking sndradm about the health > status - > which would lead to a false alarm on Host A, because the failed > disc is > in Host B, and operators are usually not bright enough to change the > disc in Host B after they get notified about a problem on Host B. But > even if everything works, what will if the primary fails before an > administrator fixed the problem, the missing replication is running > again and the replacement disc has been completely synced? "Hello, > kernel panic", and "Goodbye, 12 TB of data"). See above, but yes there is a need for a system administrator to monitor SNDR replication. > > c) You *must* force every single `zfs import <zpool>` on the secondary > host. Always. Correct, but this is the case even without AVS! If one configured ZFS on SAN based storage and your primary node crashed, one would need to force every single `zfs import <zpool>`. This is not an AVS issue, but a ZFS protection. > Because you usually need your secondary host after your > primary crashed. You won't have the chance to export your zpool on the > primary first - and if you do, you don't need AVS at all. Bring some > Kleenex to get rid of the sweat on your forehead when you have to > switch > to your secondary host, because a single mistake (like forgetting > to put > the secondary host into logging mode manually before you try to import > the zpool) will lead to a complete data loss. Correct, but this is the case even without AVS! Take the same SAN based storage scenario above, go to a secondary system on your SAN, and force every single `zfs import <zpool>`. In the case of a SAN, where the same physical disk would be written to by both hosts, you would likely get complete data loss, but with AVS, where ZFS is actually on two physical disk, and AVS is tracking writes, even if they are inconsistent writes, AVS can and will recover if an update sync is done. > I bet you won't even trust > your own failover scripts. > > Use AVS and ZFS together. I use it myself. But I made sure that I know > what I'm doing. Most people probably don't. Your are quite correct in that although ZFS is intuitively easy to use, AVS is painfully complex. Of course the mindset of AVS and ZFS are as distant apart as they are in the alphabet. :-O > Btw: I have to admit that I haven't tried the newst nevada builds > during > the tests. It's possible that AVS and ZFS work better together than > they > did under Solaris 10 11/06 and AVS 4.0. But there's a reason I haven't > tried. It's because Sun Cluster 3.2 instantly crashes on Thumpers, > SATA-related kernel panics, and the OpenHA Cluster isn't available > yet. With AVS in Nevada, there is now an opportunity for leveraging the ease of use of ZFS, with AVS. Being also the iSCSI Target project lead, I see a lot of value in the ZFS option "set shareiscsi=on", to get end users in using iSCSI. I would like to see "set replication=AVS:<secondary host>", configuring a locally named ZFS storage pool to the same named pair on some remote host. Starting down this path would afford things like ZFS replication monitoring, similar to what ZFS does with each of its own vdevs. Jim > > -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSA > > Tel. +49-721-91374-3963 > [EMAIL PROTECTED] - http://web.de/ > > 1&1 Internet AG > Brauerstraße 48 > 76135 Karlsruhe > > Amtsgericht Montabaur HRB 6484 > > Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, > Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, > Achim Weiss > Aufsichtsratsvorsitzender: Michael Scheeren > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss