Definitely not impossible, but please work on the business case.
Remember, it is easier to build hardware than software, so your
software solution must be sufficiently advanced to not be obsoleted
by the next few hardware generations.
I guess Richard was correct about the usecase description -
I should detail what I'm thinking about, to give some illustration.
Coming from a software company though, I tend to think of
software being the more flexible part of equation. This is
something we have a chance to change. We use whatever
hardware is given to us from above, for years...
When thinking about the problem and its applications to life,
I have in mind blade servers farms like Intel MFSYS25 which
include relatively large internal storage and you can possibly
add external SAS storage. We use such server farms as
self-contained units (a single chassis plugged into customer's
network) for a number of projects, and recently more and more
of these deployments become VMWare ESX farms with shared
VMFS. Due to my stronger love for things Solaris, I would love
to see ZFS and any of Solaris-based hypervisors (VBox, Xen
or KVM ports) running there instead. But for things to be as
efficient, ZFS would have to become shared - clustered...
I think I would have to elaborate more on this hardware, as
it tends to be our major usecase, and thus a limitation which
influences my approach to clustered ZFS and belief whatever
shortcuts are appropriate.
These boxes have a shared chassis to accomodate 6 server
blades, each with 2 CPUs and 2 or 4 gigabit ethernet ports.
The chassis also has single or dual ethernet switches to interlink
the servers and to connect to external world (10 ext ports each),
as well as single or dual storage controllers and 14 internal HDD
bays. External SAS boxes can also be attached to the storage
controller modules, but I haven't yet seen real setups like that.
In normal "Intel usecase", the controller(s) implement several
RAID LUNs which are accessible to the servers via SAS
(with MPIO in case of dual controllers). Usually these LUNs
are dedicated to servers - for example, boot/OS volumes.
With an additional license from Intel, Shared LUNs can be
implemented on the chassis - these are primarily aimed at
VMWare farms with clustered VMFS to use available disk
space (and multiple-spindle aggregated bandwidths) more
efficiently, as well as aid in VM migration.
To be clearer, I should say that modern VM hypervisors can
migrate running virtual machines between two VM hosts.
Usually (with dedicated storage for each server host) they
do this by copying over the IP network their HDD image
files from an "old host" to "new host", transferring virtual
RAM contents, replumbing virtual networks and beginning
execution "from the same point" - after just a second-long
hiccup for finalization of the running VM's migration.
With clustered VMFS on shared storage, VMWare can
migrate VMs faster - it knows not to copy the HDD image
file in vain - it will be equally available to the "new host"
at the correct point in migration, just as it was accessible
to the "old host".
This is what I kind of hoped to reimplement with VirtualBox
or Xen or KVM running on OpenSolaris derivatives (such as
OpenIndiana and others), and the proposed "ZFS clustering"
using each HDD wholly as an individual LUN, aggregated into
a ZFS pool by the servers themselves. For many cases this
would also be cheaper, with OpenIndiana and free hypervisors ;)
As was rightfully noted, with a common ZFS pool as underlying
storage (as happens in current Sun VDI solutions using a ZFS
NAS), VM image clones can be instantiated quickly and efficiently
on resources - cheaper and faster than copying a golden image.
Now, at the risk of being accused pushing some "marketing"
through the discussion list, I have to state that these servers
are relatively cheap (if compared to 6 single-unit servers of
comparable configuration, dual managed ethernet switches,
a SAN with 14 disks + dual storage controllers). Price is an
important factor in many of our deployments, where these
boxes work stand-alone.
This usually starts with a POC, when a pre-configured
basic MFSYS with some VMs of our software arrives to
a customer, gets tailored and works like a "black box".
In a year or so an upgrade may come in form of added
disks, server blades and RAM. I have never heard even
discussions of adding external storage - too pricey, and
often useless with relatively fixed VM sizes - hence my
desire to get a single ZFS pool available to all the blades
equally. While dedicated storage boxes might be good
and great, they would bump the solution price by orders
of magnitude (StorEdge 7000 series) and are generally
out of question for our limited deployments.
Thanks to Nico for concerns about POSIX locking. However,
hopefully, in the usecase I described - serving images of
VMs in a manner where storage, access and migration are
efficient - whole datasets (be it volumes or FS datasets)
can be dedicated to one VM host server at a time, just like
whole pools are dedicated to one host nowadays. In this
case POSIX compliance can be disregarded - access
is locked by one host, not avaialble to others, period.
Of course, there is a problem of capturing storage from
hosts which died, and avoiding corruptions - but this is
hopefully solved in the past decades of clustering tech's.
Nico also confirmed that "one node has to be a master of
all TXGs" - which is conveyed in both ideas of my original
More directed replies below...
2011-10-12 8:15, Richard Elling пишет:
On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
> ... individual storages. Such copying makes less sense when the
> hosts' storage is mounted off the same NAS/NAS box(es), because:
> * it only wastes bandwidth moving bits around the same storage, and
This is why the best solutions use snapshots… no moving of data and
you get the added benefit of shared ARC -- increasing the logical working
set size does not increase the physical working set size.
Snapshots would be good for cloning of VMs.
They can also help with VM migration between separate hosts,
IFF both machines have some common baseline snapshot so
as to send increments, but the VM supervisor would have to be
really intimate with the FS - like VMWare is with VMFS...
I believe with "clustered ZFS" rearchitecturing nobody forbids
implementation of either ARCs or L2ARCs local to each host.
I believe it is also not a very big challenge to allow non-local
L2ARCs, i.e. so that shared ZFS pool caches can be local
to individual hosts, and that the common shared ZFS pool
would have no knowledge of remote L2ARCs - so their
absence would not cause the pool to be considered corrupt.
True, though, in case of cloned VM images, common blocks
used by different datasets on different hosts, would use up
their ARCs separately and lose some benefit of shared ARC
described above by Richard.
However each blade has as much RAM as any other which
might be dedicated as a storage host, so that common
cache memory in the VM farm overall would be increased.
Moreover, such ARCs and L2ARCs local to cluster nodes
would be "cluttered" only by data relevant to this certain host.
By default there can be no L2ARCs in MFSYS, though -
unless a box of SSDs would be attached as an external
SAS storage and portions would be dedicated to each
blade as LUNs via storage controller modules.
> * IP networking speed (NFS/SMB copying) may be less than that of
> dedicated storage net between the hosts and storage (SAS, FC, etc.)
Disk access is not bandwidth bound by the channel.
In case of larger NASes or SANs, where multiple-spindle
performance would exceed say 125MB/s, a gigabit LAN
performance (iSCSI/NFS/SMB) would be a bottleneck
indeed, compared to a faster SAS or FC link (i.e. 8Gbit/s).
Even in case of the MFSYS chassis above, LUNs are
accessible using a faster link than that which could be
provided by networking of a blade dedicated to NAS
tasks and serving access to ZFS to other blades over
LAN. To say the least, disk access is common and
equal to each blade - so having a ZFS server and
serving storage over LAN only adds another layer
to latency, and possibly limits bandwidth.
> * with pre-configured disk layout from one storage box into LUNs for
> several hosts, more slack space is wasted than with having a single
> pool for several hosts, all using the same "free" pool space;
...and you die by latency of metadata traffic.
That's possible. Hopefully it can be reduced by preallocating
adequately large (sets of) fragments from the shared pool for
each server's writes, so that actual blocking exchange of
metadata would be rare. Since each server knows in advance
what on-disk blocks it can safely write into, there should be
little danger of conflict, and little added real-time latency.
> * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
> it would be problematic to add a 6th server) - but it won't be a problem
> when the single pool consumes the whole SAM and is available to
> all server nodes.
Are you assuming disk access is faster than RAM access?
I am not sure how this question is relevant to the paragraph above.
Of course I don't assume THAT ;)
I would take it is due to my typo "SAM" misinterpreted as "RAM"
while it stood for "SAN".
To reiterate that idea, a SAN, such as the 14 shared disks in the
MFSYS chassis, aggregated by RAID and cut into individual
per-server LUNs, is less scalable that a Shared LUN on the same
chassis. Because, for example, if we have 3 blades during a POC
and distribute the whole disk array into 3 individual LUNs, there
would be no more disk space to allocate when new blades
arrive. If we don't preallocate disk space, it is wasted.
Of course, in my example we know there can be no more than
6 servers, so we can preallocate 6 LUNs, and give some servers
2 or more storage areas for a while. In non-blade setups there is
no such luxury of certain-prediction ;)
One feature of this use-case is that specific datasets within the
potentially common pool on the NAS/SAN are still dedicated to
certain physical hosts. This would be similar to serving iSCSI
volumes or NFS datasets with individual VMs from a NAS box -
just with a faster connection over SAS/FC. Hopefully this allows
for some shortcuts in clustering ZFS implementation, while
such solutions would still be useful in practice.
I'm still missing the connection of the problem to the solution.
The problem, as I see it today: disks are slow and not getting
faster. SSDs are fast and getting faster and lower $/IOP. Almost
all VM environments and most general purpose environments are
overprovisioned for bandwidth and underprovisioned for latency.
The Achille's heel of solutions that cluster for bandwidth (eg lustre,
QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency.
But latency is what we need, so perhaps not the best architectural
Again back to my MFSYS example:
* Individual server blades have no local HDDs, nor SSDs for L2ARC.
They only have CPUs, RAM, SAS-initiator and Pro/1000 chips.
* All blades access LUNs from chassis storage, no matter what.
Thechnically one of the servers can be provisioned as a storage
node, but it should better be redundant - taking 2 blades out of
other jobs. And repackaging disk access from SAS to LAN is
bound to be slower and/or have more latency than accessing
these disks (LUNs) directly.
Everything in the ZIL is also in RAM.
True for the local host which wrote the ZIL.
False for other hosts which use the same shared ZFS pool concurrently.
However these other hosts can read in older (flushed) ZILs to update
their local caches and general knowledge of pool metadata.
I can read from RAM with lower latency than reading from a shared
slog. So how are you improving latency?
To be honest - I don't know. But I can make some excuses ;)
1) If datasets are dedicated to hosts (i.e. with locking) there is
not much stuff going on in other parts of the pool that would be
"interesting" to hosts. They are interested in two things:
* what they can READ - safely thanks to COW, and not changed
by others thanks to "dedication" of datasets
* where they can WRITE so as not to disturb/overwrite other hosts'
new writes - distributed in advance by master-host.
In this case, latency is only added when hosts run out of assigned
block ranges for writes, and are waiting for new assigned block
ranges from master-host.
2) Improvement of latency was, truly, not considered. I am not
ready to speculate how or why it might improve or worsen.
I was after best utilization of disk space and spindles (by using
a single pool), as well as bandwidth (by using direct disk access
instead of adding a storage server in the path).
Other than common clustering problems (quorum, stonith,
loss of connectivity from active nodes and assumption of wrong
roles - i.e. attempted pool-mastering by several nodes), which
may be solved by different popular methods, not excluding that
part of the pool with information about cluster members, there
is also a problem of ensuring that all nodes have and use the
most current state of the pool - receiving new TXG info/ZIL
updates, and ultimately updating uberblocks ASAP.
So beside an invitation to bash these ideas and explain why they
are wrong an impossible - if they are - there is also a hope to
stir up a constructive discussion finally leading up to a working
"clustered ZFS" solution, and one more reliable than my ideas
above ;) I think there is some demand for that in the market, as
well as amoung enthusiasts…
| Климов Евгений, Jim Klimov |
| технический директор CTO |
| ЗАО "ЦОС и ВТ" JSC COS&HT |
| +7-903-7705859 (cellular) mailto:jimkli...@cos.ru |
| CC:ad...@cos.ru,jimkli...@mail.ru |
| () ascii ribbon campaign - against html mail |
| /\ - against microsoft attachments |
zfs-discuss mailing list