Jim Dunham wrote: > It is the mixture of both resilvering writes, and new ZFS filesystem > writes, that make it impossible for AVS to make replication 'smarter'.
Jim is right here. I just want to add that I don't see an obvious way to make AVS as "smart" as Brent may wish it to be. Sometimes I describe AVS as a low level service with some proxy functionalities. That's not really correct, but good enough for a single powerpoint sheet. AVS receives the writes from the file system, and replicates them. It does not care about the contents of the transactions, like IP can't take care of the responsibilities of higher layer protocols like TCP or even layer 7 data (bad comparison, I know, but it may help to understand what I mean). What AVS does is copying the contents of devices. A file system writes some data to a sector on a hard disk -> AVS is aware of this transaction -> AVS replicates the sector to the second host -> on the secondary host, AVS makes sure that *exactly* the same data is written to *exactly* the same position on the secondary host's storage device. Your secondary storage is a 100% copy. And if you write a bazillion of 0 byte sectors to the disk with `dd`, AVS will make sure that the secondary does does it, too. And it does this in near real time (if you ignore the network bottlenecks). The downside of it: it's easy to do something wrong and you may run in network bottlenecks due to a higher amount of traffic. What AVS can't offer: file-based replication. In many cases, you don't have to care about having an exact copy of a device. For example, if you want a standby solution for your NFS file server, you want to keep the contents of the files and directories in sync. You don't care if a newly written file uses the same inode number. You only care if the file is copied to your backup host while the file system of the backup host is *mounted*. The best-known service for this functionality is `rsync`. And if you know rsync, you know the downside of these services, too: don't even think about replicating your data in real time and/or to multiple servers. The challenge is to find out which kind of replication suits your concept better. For instance, if you want to replicate html pages, graphics or other documents, perhaps even with a "copy button" on an intranet page, file-based replication is your friend. If you need real time copying or device replication, for instance on a database server with its own file system, or for keeping configuration files in sync across a cluster, then AVS is your best bet. But let's face it: everybody wants the best of both worlds, and so people ask if AVS could not just get smarter. The answer: no, not really. It can't check if the file system's write operation "make sense" or if the data "really needs to be replicated". AVS is a truck which guarantees fast and accurate delivery of whatever you throw into it. Taking care of the content itself is the job of the person who prepares the freight. And, in our case, this person is called UFS. Or ZFS. And ZFS could do a much better job here. Sun's marketing sells ZFS as offering data integrity at *all times* (http://www.sun.com/2004-0914/feature/). Well, that's true, at least as long as there is no problem on lower layers. And I often wondered if ZFS doesn't offer something fsck-like for faulted pools because it's technically impossible, or because the marketing guys forbade it. I also wondered why people are enthusiastic about gimmicks like ditto blocks, but don't want data protection in case an X4540 suffers a power outage and lots of gigabytes of zfs cache may go down the drain. Proposal: ZFS should offer some kind of "IsReplicated" flag in the zpool metadata. During a `zpool import`, this flag should be checked and if it is set, a corresponding error message should be printed on stdout. Or the ability to set dummy zpool parameters, something like a "zpool set storage:cluster:avs=true tank". This would be only some kind of first aid only, but that's better than nothing. This has nothing to do with AVS only. It also applies to other replication services. It would allow us to write simple wrapper scripts to switch the replication mechanism into logging mode, thus allowing us to safely force the import of the zpool in case of a desaster. Of course, it would be even better to integrate AVS into ZFS itself. "zfs set replication=<hostname1>[,<hostname2>...<hostnameN>]" would be the coolest thing on earth, because it would combine the benefits of AVS and rsync-like replication into a perfect product. And it would allow the marketing people to use the "high availability" and "full data redundancy" buzzwords in their flyers. But until then, I'll have to continue using cron jobs on the secondary node which try to log in to the primary with ssh and to do a "zfs get storage:cluster:avs <filesystem>" on all mounted file systems and save it locally for my "zpool import wrapper" script. This is a cheap workaround, but honestly: You can use something like this for your own datacenter, but I bet nobody wants to sell it to a customer as a supported solution ;-) -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 1&1 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss