ZFS developers have for a long time stated that ZFS is not intended,
at least not in near term, for clustered environments (that is, having
a pool safely imported by several nodes simultaneously). However,
many people on forums have wished having ZFS features in clusters.
I have some ideas at least for a limited implementation of clustering
which may be useful aat least for some areas. If it is not my fantasy
and if it is realistic to make - this might be a good start for further
optimisation of ZFS clustering for other uses.
For one use-case example, I would talk about VM farms with VM
migration. In case of shared storage, the physical hosts need only
migrate the VM RAM without copying gigabytes of data between their
individual storages. Such copying makes less sense when the
hosts' storage is mounted off the same NAS/NAS box(es), because:
* it only wastes bandwidth moving bits around the same storage, and
* IP networking speed (NFS/SMB copying) may be less than that of
dedicated storage net between the hosts and storage (SAS, FC, etc.)
* with pre-configured disk layout from one storage box into LUNs for
several hosts, more slack space is wasted than with having a single
pool for several hosts, all using the same "free" pool space;
* it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
it would be problematic to add a 6th server) - but it won't be a problem
when the single pool consumes the whole SAM and is available to
all server nodes.
One feature of this use-case is that specific datasets within the
potentially common pool on the NAS/SAN are still dedicated to
certain physical hosts. This would be similar to serving iSCSI
volumes or NFS datasets with individual VMs from a NAS box -
just with a faster connection over SAS/FC. Hopefully this allows
for some shortcuts in clustering ZFS implementation, while
such solutions would still be useful in practice.
So, one version of the solution would be to have a single host
which imports the pool in read-write mode (i.e. the first one
which boots), and other hosts would write thru it (like iSCSI
or whatever; maybe using SAS or FC to connect between
"reader" and "writer" hosts). However they would read directly
from the ZFS pool using the full SAN bandwidth.
WRITES would be consistent because only one node writes
data to the active ZFS block tree using more or less the same
code and algorithms as already exist.
In order for READS to be consistent, the "readers" need only
rely on whatever latest TXG they know of, and on the cached
results of their more recent writes (between the last TXG
these nodes know of and current state).
Here's where this use-case's bonus comes in: the node which
currently uses a certain dataset and issues writes for it, is the
only one expected to write there - so even if its knowledge of
the pool is some TXGs behind, it does not matter.
In order to stay up to date, and "know" the current TXG completely,
the "reader nodes" should regularly read-in the ZIL data (anyway
available and accessible as part of the pool) and expire changed
entries from their local caches.
If for some reason a "reader node" has lost track of the pool for
too long, so that ZIL data is not sufficient to update from "known
in-RAM TXG" to "current on-disk TXG", the full read-only import
can be done again (keeping track of newer TXGs appearing
while the RO import is being done).
Thanks to ZFS COW, nodes can expect that on-disk data (as
pointed to by block addresses/numbers) does not change.
So in the worst case, nodes would read outdated data a few
TXGs old - but not completely invalid data.
Second version of the solution is more or less the same, except
that all nodes can write to the pool hardware directly using some
dedicated block ranges "owned" by one node at a time. This
would work like much a ZIL containing both data and metadata.
Perhaps these ranges would be whole metaslabs or some other
ranges as "agreed" between the master node and other nodes.
When a node's write is completed (or a TXG sync happens), the
master node would update the ZFS block tree and uberblocks,
and those per-node-ZIL blocks which are already on disk would
become part of the ZFS tree. At this time new block ranges would
be fanned out for writes by each non-master node.
A probable optimization would be to give out several TXG's worth
of dedicated block ranges to each node, to reduce hickups during
any lags or even master-node reelections.
Main difference from the first solution would be in performance -
here all nodes would be writing to the pool hardware at full SAN/NAS
networking speed, and less load would come on the "writer node".
Actually, instead of a "writer node" (responsible for translation of
LAN writes to SAN writes in the first solution), there would be a
"master node" responsible just for consistent application of TXG
updates, and for distribution of new dedicated block ranges to
other nodes for new writes. Information about such block ranges
would be kept on-disk like some sort of a "cluster ZIL" - so that
writes won't be lost in case of hardware resets, software panics,
node reelections, etc. Applying these cluster ZILs and per-node
ZIL ranges would become part of normal ZFS read-write imports
(by a master node).
Probably there should be a part of the pool with information about
most-current cluster members (who imported the pool and what
role that node performs); it could be a "ring" of blocks like the ZFS
uberblocks are now.
So... above I presented a couple of possible solutions to the
problem of ZFS clustering. These are "off the top of my head"
ideas, and as I am not a great specialist in storage clustering,
they are probably silly ideas with many flaws "as is". At the very
least, I see a lot of possible optimisation locations already,
and the solutions (esp. #1) may be unreliable for uses other
than VM hosting.
Other than common clustering problems (quorum, stonith,
loss of connectivity from active nodes and assumption of wrong
roles - i.e. attempted pool-mastering by several nodes), which
may be solved by different popular methods, not excluding that
part of the pool with information about cluster members, there
is also a problem of ensuring that all nodes have and use the
most current state of the pool - receiving new TXG info/ZIL
updates, and ultimately updating uberblocks ASAP.
So beside an invitation to bash these ideas and explain why they
are wrong an impossible - if they are - there is also a hope to
stir up a constructive discussion finally leading up to a working
"clustered ZFS" solution, and one more reliable than my ideas
above ;) I think there is some demand for that in the market, as
well as amoung enthusiasts...
Hope to see some interesting reading,
zfs-discuss mailing list