On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov <jimkli...@cos.ru> wrote:
> So, one version of the solution would be to have a single host
> which imports the pool in read-write mode (i.e. the first one
> which boots), and other hosts would write thru it (like iSCSI
> or whatever; maybe using SAS or FC to connect between
> "reader" and "writer" hosts). However they would read directly
> from the ZFS pool using the full SAN bandwidth.
You need to do more than simply assign a node for writes. You need to
send write and lock requests to one node. And then you need to figure
out what to do about POSIX write visibility rules (i.e., when a write
should be visible to other readers). I think you'd basically end up
not meeting POSIX in this regard, just like NFS, though perhaps not
with close-to-open semantics.
I don't think ZFS is the beast you're looking for. You want something
more like Lustre, GPFS, and so on. I suppose someone might surprise
us one day with properly clustered ZFS, but I think it'd be more
likely that the filesystem would be ZFS-like, not ZFS proper.
> Second version of the solution is more or less the same, except
> that all nodes can write to the pool hardware directly using some
> dedicated block ranges "owned" by one node at a time. This
> would work like much a ZIL containing both data and metadata.
> Perhaps these ranges would be whole metaslabs or some other
> ranges as "agreed" between the master node and other nodes.
This is much hairier. You need consistency. If two processes on
different nodes are writing to the same file, then you need to
*internally* lock around all those writes so that the on-disk
structure ends up being sane. There's a number of things you could do
here, such as, for example, having a per-node log and one node
coalescing them (possibly one node per-file, but even then one node
has to be the master of every txg).
And still you need to be careful about POSIX semantics. That does not
come for free in any design -- you will need something like the Lustre
DLM (distributed lock manager). Or else you'll have to give up on
There's a hefty price to be paid for POSIX semantics in a clustered
environment. You'd do well to read up on Lustre's experience in
detail. And not just Lustre -- that would be just to start. I
caution you that this is not a simple project.
zfs-discuss mailing list