Jim Dunham wrote:

> It is the mixture of both resilvering writes, and new ZFS filesystem 
> writes, that make it impossible for AVS to make replication 'smarter'.

Jim is right here. I just want to add that I don't see an obvious way to 
  make AVS as "smart" as Brent may wish it to be.
Sometimes I describe AVS as a low level service with some proxy 
functionalities. That's not really correct, but good enough for a single 
powerpoint sheet. AVS receives the writes from the file system, and 
replicates them. It does not care about the contents of the 
transactions, like IP can't take care of the responsibilities of higher 
layer protocols like TCP or even layer 7 data (bad comparison, I know, 
but it may help to understand what I mean).

What AVS does is copying the contents of devices. A file system writes 
some data to a sector on a hard disk -> AVS is aware of this transaction 
-> AVS replicates the sector to the second host -> on the secondary 
host, AVS makes sure that *exactly* the same data is written to 
*exactly* the same position on the secondary host's storage device. Your 
secondary storage is a 100% copy. And if you write a bazillion of 0 byte 
sectors to the disk with `dd`, AVS will make sure that the secondary 
does does it, too. And it does this in near real time (if you ignore the 
network bottlenecks). The downside of it: it's easy to do something 
wrong and you may run in network bottlenecks due to a higher amount of 
traffic.

What AVS can't offer: file-based replication. In many cases, you don't 
have to care about having an exact copy of a device. For example, if you 
want a standby solution for your NFS file server, you want to keep the 
contents of the files and directories in sync. You don't care if a newly 
written file uses the same inode number. You only care if the file is 
copied to your backup host while the file system of the backup host is 
*mounted*. The best-known service for this functionality is `rsync`. And 
if you know rsync, you know the downside of these services, too: don't 
even think about replicating your data in real time and/or to multiple 
servers.

The challenge is to find out which kind of replication suits your 
concept better.
For instance, if you want to replicate html pages, graphics or other 
documents, perhaps even with a "copy button" on an intranet page, 
file-based replication is your friend.
If you need real time copying or device replication, for instance on a 
database server with its own file system, or for keeping configuration 
files in sync across a cluster, then AVS is your best bet.

But let's face it: everybody wants the best of both worlds, and so 
people ask if AVS could not just get smarter. The answer: no, not 
really. It can't check if the file system's write operation "make sense" 
or if the data "really needs to be replicated". AVS is a truck which 
guarantees fast and accurate delivery of whatever you throw into it. 
Taking care of the content itself is the job of the person who prepares 
the freight. And, in our case, this person is called UFS. Or ZFS. And 
ZFS could do a much better job here.

Sun's marketing sells ZFS as offering data integrity at *all times* 
(http://www.sun.com/2004-0914/feature/). Well, that's true, at least as 
long as there is no problem on lower layers. And I  often wondered if 
ZFS doesn't offer something fsck-like for faulted pools because it's 
technically impossible, or because the marketing guys forbade it. I also 
wondered why people are enthusiastic about gimmicks like ditto blocks, 
but don't want data protection in case an X4540 suffers a power outage 
and lots of gigabytes of zfs cache may go down the drain.

Proposal: ZFS should offer some kind of "IsReplicated" flag in the zpool 
metadata. During a `zpool import`, this flag should be checked and if it 
is set, a corresponding error message should be printed on stdout. Or 
the ability to set dummy zpool parameters, something like a "zpool set 
storage:cluster:avs=true tank". This would be only some kind of first 
aid only, but that's better than nothing.

This has nothing to do with AVS only. It also applies to other 
replication services. It would allow us to write simple wrapper scripts 
to switch the replication mechanism into logging mode, thus allowing us 
to safely force the import of the zpool in case of a desaster.

Of course, it would be even better to integrate AVS into ZFS itself. 
"zfs set replication=<hostname1>[,<hostname2>...<hostnameN>]" would be 
the coolest thing on earth, because it would combine the benefits of AVS 
and rsync-like replication into a perfect product. And it would allow 
the marketing people to use the "high availability" and "full data 
redundancy" buzzwords in their flyers.

But until then, I'll have to continue using cron jobs on the secondary 
node which try to log in to the primary with ssh and to do a "zfs get 
storage:cluster:avs <filesystem>" on all mounted file systems and save 
it locally for my "zpool import wrapper" script.  This is a cheap 
workaround, but honestly: You can use something like this for your own 
datacenter, but I bet nobody wants to sell it to a customer as a 
supported solution ;-)


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
[EMAIL PROTECTED] - http://web.de/

1&1 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to