On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
> Hello all,
> ZFS developers have for a long time stated that ZFS is not intended,
> at least not in near term, for clustered environments (that is, having
> a pool safely imported by several nodes simultaneously). However,
> many people on forums have wished having ZFS features in clusters.
...and UFS before ZFS… I'd wager that every file system has this RFE in its
wish list :-)
> I have some ideas at least for a limited implementation of clustering
> which may be useful aat least for some areas. If it is not my fantasy
> and if it is realistic to make - this might be a good start for further
> optimisation of ZFS clustering for other uses.
> For one use-case example, I would talk about VM farms with VM
> migration. In case of shared storage, the physical hosts need only
> migrate the VM RAM without copying gigabytes of data between their
> individual storages. Such copying makes less sense when the
> hosts' storage is mounted off the same NAS/NAS box(es), because:
> * it only wastes bandwidth moving bits around the same storage, and
This is why the best solutions use snapshots… no moving of data and
you get the added benefit of shared ARC -- increasing the logical working
set size does not increase the physical working set size.
> * IP networking speed (NFS/SMB copying) may be less than that of
> dedicated storage net between the hosts and storage (SAS, FC, etc.)
Disk access is not bandwidth bound by the channel.
> * with pre-configured disk layout from one storage box into LUNs for
> several hosts, more slack space is wasted than with having a single
> pool for several hosts, all using the same "free" pool space;
...and you die by latency of metadata traffic.
> * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
> it would be problematic to add a 6th server) - but it won't be a problem
> when the single pool consumes the whole SAM and is available to
> all server nodes.
Are you assuming disk access is faster than RAM access?
> One feature of this use-case is that specific datasets within the
> potentially common pool on the NAS/SAN are still dedicated to
> certain physical hosts. This would be similar to serving iSCSI
> volumes or NFS datasets with individual VMs from a NAS box -
> just with a faster connection over SAS/FC. Hopefully this allows
> for some shortcuts in clustering ZFS implementation, while
> such solutions would still be useful in practice.
I'm still missing the connection of the problem to the solution.
The problem, as I see it today: disks are slow and not getting
faster. SSDs are fast and getting faster and lower $/IOP. Almost
all VM environments and most general purpose environments are
overprovisioned for bandwidth and underprovisioned for latency.
The Achille's heel of solutions that cluster for bandwidth (eg lustre,
QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency.
But latency is what we need, so perhaps not the best architectural
> So, one version of the solution would be to have a single host
> which imports the pool in read-write mode (i.e. the first one
> which boots), and other hosts would write thru it (like iSCSI
> or whatever; maybe using SAS or FC to connect between
> "reader" and "writer" hosts). However they would read directly
> from the ZFS pool using the full SAN bandwidth.
> WRITES would be consistent because only one node writes
> data to the active ZFS block tree using more or less the same
> code and algorithms as already exist.
> In order for READS to be consistent, the "readers" need only
> rely on whatever latest TXG they know of, and on the cached
> results of their more recent writes (between the last TXG
> these nodes know of and current state).
> Here's where this use-case's bonus comes in: the node which
> currently uses a certain dataset and issues writes for it, is the
> only one expected to write there - so even if its knowledge of
> the pool is some TXGs behind, it does not matter.
> In order to stay up to date, and "know" the current TXG completely,
> the "reader nodes" should regularly read-in the ZIL data (anyway
> available and accessible as part of the pool) and expire changed
> entries from their local caches.
> If for some reason a "reader node" has lost track of the pool for
> too long, so that ZIL data is not sufficient to update from "known
> in-RAM TXG" to "current on-disk TXG", the full read-only import
> can be done again (keeping track of newer TXGs appearing
> while the RO import is being done).
> Thanks to ZFS COW, nodes can expect that on-disk data (as
> pointed to by block addresses/numbers) does not change.
> So in the worst case, nodes would read outdated data a few
> TXGs old - but not completely invalid data.
> Second version of the solution is more or less the same, except
> that all nodes can write to the pool hardware directly using some
> dedicated block ranges "owned" by one node at a time. This
> would work like much a ZIL containing both data and metadata.
> Perhaps these ranges would be whole metaslabs or some other
> ranges as "agreed" between the master node and other nodes.
> When a node's write is completed (or a TXG sync happens), the
> master node would update the ZFS block tree and uberblocks,
> and those per-node-ZIL blocks which are already on disk would
> become part of the ZFS tree. At this time new block ranges would
> be fanned out for writes by each non-master node.
> A probable optimization would be to give out several TXG's worth
> of dedicated block ranges to each node, to reduce hickups during
> any lags or even master-node reelections.
> Main difference from the first solution would be in performance -
> here all nodes would be writing to the pool hardware at full SAN/NAS
> networking speed, and less load would come on the "writer node".
> Actually, instead of a "writer node" (responsible for translation of
> LAN writes to SAN writes in the first solution), there would be a
> "master node" responsible just for consistent application of TXG
> updates, and for distribution of new dedicated block ranges to
> other nodes for new writes. Information about such block ranges
> would be kept on-disk like some sort of a "cluster ZIL" - so that
> writes won't be lost in case of hardware resets, software panics,
> node reelections, etc. Applying these cluster ZILs and per-node
> ZIL ranges would become part of normal ZFS read-write imports
> (by a master node).
> Probably there should be a part of the pool with information about
> most-current cluster members (who imported the pool and what
> role that node performs); it could be a "ring" of blocks like the ZFS
> uberblocks are now.
> So... above I presented a couple of possible solutions to the
> problem of ZFS clustering. These are "off the top of my head"
> ideas, and as I am not a great specialist in storage clustering,
> they are probably silly ideas with many flaws "as is". At the very
> least, I see a lot of possible optimisation locations already,
> and the solutions (esp. #1) may be unreliable for uses other
> than VM hosting.
Everything in the ZIL is also in RAM. I can read from RAM with lower
latency than reading from a shared slog. So how are you improving
> Other than common clustering problems (quorum, stonith,
> loss of connectivity from active nodes and assumption of wrong
> roles - i.e. attempted pool-mastering by several nodes), which
> may be solved by different popular methods, not excluding that
> part of the pool with information about cluster members, there
> is also a problem of ensuring that all nodes have and use the
> most current state of the pool - receiving new TXG info/ZIL
> updates, and ultimately updating uberblocks ASAP.
> So beside an invitation to bash these ideas and explain why they
> are wrong an impossible - if they are - there is also a hope to
> stir up a constructive discussion finally leading up to a working
> "clustered ZFS" solution, and one more reliable than my ideas
> above ;) I think there is some demand for that in the market, as
> well as amoung enthusiasts…
Definitely not impossible, but please work on the business case.
Remember, it is easier to build hardware than software, so your
software solution must be sufficiently advanced to not be obsoleted
by the next few hardware generations.
ZFS and performance consulting
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA '11, Boston, MA, December 4-9
zfs-discuss mailing list