Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Jim Klimov

Hello all,

  A couple of months ago I wrote up some ideas about clustered
ZFS with shared storage, but the idea was generally disregarded
as not something to be done in near-term due to technological
difficultes.

  Recently I stumbled upon a Nexenta+Supermicro report [1] about
cluster-in-a-box with shared storage boasting an active-active
cluster with transparent failover. Now, I am not certain how
these two phrases fit in the same sentence, and maybe it is some
marketing-people mixup, but I have a couple of options:

1) The shared storage (all 16 disks are accessible to both
   motherboards) is split into two ZFS pools, each mounted
   by one node normally. If a node fails, another imports
   the pool and continues serving it.

2) All disks are aggregated into one pool, and one node
   serves it while another is in hot standby.

   Ideas (1) and (2) may possibly contradict the claim that
   the failover is seamless and transparent to clients.
   A pool import usually takes some time, maybe long if
   fixups are needed; and TCP sessions are likely to get
   broken. Still, maybe the clusterware solves this...


3) Nexenta did implement a shared ZFS pool with both nodes
   accessing all of the data instantly and cleanly.
   Can this be true? ;)


If this is not a deeply-kept trade secret, can the Nexenta
people elaborate in technical terms how this cluster works?

[1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Daniel Carosone
On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
   Recently I stumbled upon a Nexenta+Supermicro report [1] about
 cluster-in-a-box with shared storage boasting an active-active
 cluster with transparent failover. Now, I am not certain how
 these two phrases fit in the same sentence, and maybe it is some
 marketing-people mixup,

One way they can not be in conflict, is if each host normally owns 8
disks and is active with it, and standby for the other 8 disks. 

Not sure if this is what the solution in question is doing, just
saying. 

--
Dan.


pgpzJ4iippP0L.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Matt Breitbach
This is accomplished with the Nexenta HA cluster plugin.  The plugin is
written by RSF, and you can read more about it here :
http://www.high-availability.com/

You can do either option 1 or two that you put forth.  There is some
failover time, but in the latest version of Nexenta (3.1.1) there are some
additional tweaks that bring the failover time down significantly.
Depending on pool configuration and load, failover can be done in under 10
seconds based on some of my internal testing.

-Matt Breitbach

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Jim Klimov
Sent: Tuesday, November 08, 2011 5:53 PM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Hello all,

   A couple of months ago I wrote up some ideas about clustered
ZFS with shared storage, but the idea was generally disregarded
as not something to be done in near-term due to technological
difficultes.

   Recently I stumbled upon a Nexenta+Supermicro report [1] about
cluster-in-a-box with shared storage boasting an active-active
cluster with transparent failover. Now, I am not certain how
these two phrases fit in the same sentence, and maybe it is some
marketing-people mixup, but I have a couple of options:

1) The shared storage (all 16 disks are accessible to both
motherboards) is split into two ZFS pools, each mounted
by one node normally. If a node fails, another imports
the pool and continues serving it.

2) All disks are aggregated into one pool, and one node
serves it while another is in hot standby.

Ideas (1) and (2) may possibly contradict the claim that
the failover is seamless and transparent to clients.
A pool import usually takes some time, maybe long if
fixups are needed; and TCP sessions are likely to get
broken. Still, maybe the clusterware solves this...


3) Nexenta did implement a shared ZFS pool with both nodes
accessing all of the data instantly and cleanly.
Can this be true? ;)


If this is not a deeply-kept trade secret, can the Nexenta
people elaborate in technical terms how this cluster works?

[1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Daniel Carosone
On Wed, Nov 09, 2011 at 11:09:45AM +1100, Daniel Carosone wrote:
 On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
Recently I stumbled upon a Nexenta+Supermicro report [1] about
  cluster-in-a-box with shared storage boasting an active-active
  cluster with transparent failover. Now, I am not certain how
  these two phrases fit in the same sentence, and maybe it is some
  marketing-people mixup,
 
 One way they can not be in conflict, is if each host normally owns 8
 disks and is active with it, and standby for the other 8 disks. 

Which, now that I reread it more carefully, is your case 1. 

Sorry for the noise.

--
Dan.

pgphTDpO9Oucq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Nico Williams
To some people active-active means all cluster members serve the
same filesystems.

To others active-active means all cluster members serve some
filesystems and can serve all filesystems ultimately by taking over
failed cluster members.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-17 Thread Richard Elling
On Oct 15, 2011, at 12:31 PM, Toby Thain wrote:
 On 15/10/11 2:43 PM, Richard Elling wrote:
 On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tim Cook
 
 In my example - probably not a completely clustered FS.
 A clustered ZFS pool with datasets individually owned by
 specific nodes at any given time would suffice for such
 VM farms. This would give users the benefits of ZFS
 (resilience, snapshots and clones, shared free space)
 merged with the speed of direct disk access instead of
 lagging through a storage server accessing these disks.
 
 I think I see a couple of points of disconnect.
 
 #1 - You seem to be assuming storage is slower when it's on a remote storage
 server as opposed to a local disk.  While this is typically true over
 ethernet, it's not necessarily true over infiniband or fibre channel.
 
 Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
 10Mbps Ethernet it was faster than the 30ms average access time for the 
 disks of
 the day. I tested a simple server the other day and round-trip for 4KB of 
 data on a
 busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs
 have trouble reaching that rate under load.
 
 Hmm, of course the *latency* of Ethernet has always been much less, but I did 
 not see it reaching the *throughput* of a single direct attached disk until 
 gigabit.

In practice, there are very, very, very few disk workloads that do not involve 
a seek.
Just one seek kills your bandwidth. But we do not define fast as bandwidth 
do we?

 I'm pretty sure direct attached disk throughput in the Sun 3 era was much 
 better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 running NetBSD 
 over 10B2 was only *just* capable of streaming MP3, with tweaking, from my 
 own experiments (I ran 10B2 at home until 2004; hey, it was good enough!)

The max memory you could put into a Sun-3/280 was 32MB. There is no possible way
for such a system to handle 100 Mbps Ethernet, you could exhaust all of main 
memory
in about 3 seconds :-)

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA '11, Boston, MA, December 4-9 













___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 The idea is you would dedicate one of the servers in the chassis to be a
 Solaris system, which then presents NFS out to the rest of the hosts.  

Actually, I looked into a configuration like this, and found it's useful in
some cases - 

VMware boots from a dumb disk, and does PCI pass-thru, presenting the raw
HBA to solaris.  Create your pools, and export on the Virtual Switch, so
VMware can then use the storage to hold other VM's.  Since it's going across
only a CPU limited virtual ethernet switch, it should be nearly as fast as
local access to the disk.  In theory.  But not in practice.

I found that the max throughput of the virtual switch is around 2-3Gbit/sec.
Nevermind ZFS or storage or anything.  Simply the CPU limited virtual switch
is a bottleneck.  

I see they're developing virtual switches with cisco and intel.  Maybe it'll
improve.  But I suspect they're probably adding more functionality (QoS,
etc) rather than focusing on performance.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tim Cook
 
 In my example - probably not a completely clustered FS.
 A clustered ZFS pool with datasets individually owned by
 specific nodes at any given time would suffice for such
 VM farms. This would give users the benefits of ZFS
 (resilience, snapshots and clones, shared free space)
 merged with the speed of direct disk access instead of
 lagging through a storage server accessing these disks.

I think I see a couple of points of disconnect.

#1 - You seem to be assuming storage is slower when it's on a remote storage
server as opposed to a local disk.  While this is typically true over
ethernet, it's not necessarily true over infiniband or fibre channel.  That
being said, I don't want to assume everyone should be shoe-horned into
infiniband or fibre channel.  There are some significant downsides of IB and
FC.  Such as cost, and centralization of the storage.  Single point of
failure, and so on.  So there is some ground to be gained...  Saving cost
and/or increasing workload distribution and/or scalability.  One size
doesn't fit all.  I like the fact that you're thinking of something
different.

#2 - You're talking about a clustered FS, but the characteristics required
are more similar to a distributed filesystem.  In a clustered FS, you have
something like a LUN on a SAN, which is a raw device simultaneously mounted
by multiple OSes.  In a distributed FS, such as lustre, you have a
configurable level of redundancy (maybe zero) distributed across multiple
systems (maybe all) and meanwhile all hosts share the same namespace.  So
each system doing heavy IO is working at local disk speeds, but any system
trying to access data that was created by another system must access that
data remotely.

If the goal is ... to do something like VMotion, including the storage...
Doing something like VMotion would be largely pointless if the VM storage
still remains on the node that was previously the compute head.

So let's imagine for a moment that you have two systems, which are connected
directly to each other over infiniband or any bus whose remote performance
is the same as local performance.  You have a zpool mirror using the local
disk and the remote disk.  Then you should be able to (theoretically) do
something like VMotion from one system to the other, and kill the original
system.  Even if the original system dies ungracefully and the VM dies with
it, you can still boot up the VM on the second system, and the only loss
you've suffered was an ungraceful reboot.

If you do the same thing over ethernet, then the performance will be
degraded to ethernet speeds.  So take it for granted, no matter what you do,
you either need a bus that performs just as well remotely versus locally...
Or else performance will be degraded...  Or else it's kind of pointless
because the VM storage lives only on the system that you want to VMotion
away from.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Richard Elling
On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tim Cook
 
 In my example - probably not a completely clustered FS.
 A clustered ZFS pool with datasets individually owned by
 specific nodes at any given time would suffice for such
 VM farms. This would give users the benefits of ZFS
 (resilience, snapshots and clones, shared free space)
 merged with the speed of direct disk access instead of
 lagging through a storage server accessing these disks.
 
 I think I see a couple of points of disconnect.
 
 #1 - You seem to be assuming storage is slower when it's on a remote storage
 server as opposed to a local disk.  While this is typically true over
 ethernet, it's not necessarily true over infiniband or fibre channel.  

Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
10Mbps Ethernet it was faster than the 30ms average access time for the disks 
of 
the day. I tested a simple server the other day and round-trip for 4KB of data 
on a 
busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs 
have trouble reaching that rate under load.

Many people today are deploying 10GbE and it is relatively easy to get wire 
speed
for bandwidth and  0.1 ms average access for storage.

Today, HDDs aren't fast, and are not getting faster.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA '11, Boston, MA, December 4-9 













___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Toby Thain

On 15/10/11 2:43 PM, Richard Elling wrote:

On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Tim Cook

In my example - probably not a completely clustered FS.
A clustered ZFS pool with datasets individually owned by
specific nodes at any given time would suffice for such
VM farms. This would give users the benefits of ZFS
(resilience, snapshots and clones, shared free space)
merged with the speed of direct disk access instead of
lagging through a storage server accessing these disks.


I think I see a couple of points of disconnect.

#1 - You seem to be assuming storage is slower when it's on a remote storage
server as opposed to a local disk.  While this is typically true over
ethernet, it's not necessarily true over infiniband or fibre channel.


Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
10Mbps Ethernet it was faster than the 30ms average access time for the disks of
the day. I tested a simple server the other day and round-trip for 4KB of data 
on a
busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs
have trouble reaching that rate under load.


Hmm, of course the *latency* of Ethernet has always been much less, but 
I did not see it reaching the *throughput* of a single direct attached 
disk until gigabit.


I'm pretty sure direct attached disk throughput in the Sun 3 era was 
much better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 
running NetBSD over 10B2 was only *just* capable of streaming MP3, with 
tweaking, from my own experiments (I ran 10B2 at home until 2004; hey, 
it was good enough!)


--Toby



Many people today are deploying 10GbE and it is relatively easy to get wire 
speed
for bandwidth and  0.1 ms average access for storage.

Today, HDDs aren't fast, and are not getting faster.
  -- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Jim Klimov

Thanks to all that replied. I hope we may continue the discussion,
but I'm afraid the overall verdict so far is disapproval of the idea.
It is my understanding that those active in discussion considered
it either too limited (in application - for VMs, or for hardware cfg),
or too difficult to implement, so that we should rather use some
alternative solutions. Or at least research them better (thanks Nico).

I guess I am happy to not have seen replies like won't work
at all, period or useless, period. I get Difficult and Limited
and hope these can be worked around sometime, and hopefully
this discussion would spark some interest in other software
authors or customers to suggest more solutions and applications -
to make some shared ZFS a possibility ;)

Still, I would like to clear up some misunderstandings in replies -
because at times we seemed to have been speaking about
different architectures. Thanks to Richard, I stated what exact
hardware I had in mind (and wanted to use most efficiently)
while thinking about this problem, and how it is different from
general extensible computers or server+NAS networks.

Namely, with the shared storage architecture built into Intel
MFSYS25 blade chassis and lack of expansibility of servers
beyond that, some suggested solutions are not applicable
(10GbE, FC, Infiniband) but some networking problems
are already solved in hardware (full and equal connectivity
between all servers and all shared storage LUNs).

So some combined replies follow below:

2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams wrote:

  #1 - You seem to be assuming storage is slower when it's on a remote storage
  server as opposed to a local disk.  While this is typically true over
  ethernet, it's not necessarily true over infiniband or fibre channel.
Many people today are deploying 10GbE and it is relatively easy to get wire 
speed
for bandwidth and  0.1 ms average access for storage.


Well, I am afraid I have to reiterate: for a number of reasons including
price, our customers are choosing some specific and relatively fixed
hardware solutions. So, time and again, I am afraid I'll have to remind
of the sandbox I'm tucked into - I have to do with these boxes, and I
want to do the best with them.

I understand that Richard comes from a background where HW is the
flexible part in equations and software is designed to be used for
years. But  for many people (especially those oriented at fast-evolving
free software) the hardware is something they have to BUY and it
works unchanged as long as possible. This does not only cover
enthusiasts like the proverbial red-eyed linuxoids, but also many
small businesses. I do still maintain several decade-old computers
running infrastructure tasks (luckily, floorspace and electricity are
near-free there) which were not yet virtualized because if it ain't
broken - don't touch it! ;)

In particular, the blade chassis in my example, which I hoped to
utilize to their best, using shared ZFS pools, have no extension
slots. There is no 10GbE for neither external RJ45 nor internal
ports (technically there is 10GbE interlink of two switch modules),
so each server blade is limited to have either 2 or 4 1Gbps ports.
There is no FC. No infiniband. There may be one extSAS link
on each storage controller module, that's it.


I think the biggest problem lies in requiring full
connectivity from every server to every LUN.


This is exactly (and the only) sort of connectivity available to
server blades in this chassis.

I think this is as applicable to networked storage where there
is a mesh of reliable connections between disk controllers
and disks (or at least LUNs), be it switched FC or dual-link
SAS or whatnot.


Doing something like VMotion would be largely pointless if the VM storage
still remains on the node that was previously the compute head.


True. However, in these Intel MFSYS25 boxes no server blade
has any local disks (unlike most other blades I know). Any disk
space is fed to them - and is equally accessible over a HA link -
from the storage controller modules (which are in turn connected
to the built-in array of hard-disks) that are a part of the chassis
shared by all servers, like the networking switches are.


If you do the same thing over ethernet, then the performance will be
degraded to ethernet speeds.  So take it for granted, no matter what you do,
you either need a bus that performs just as well remotely versus locally...
Or else performance will be degraded...  Or else it's kind of pointless
because the VM storage lives only on the system that you want to VMotion
away from.


Well, while this is no Infiniband, in terms of disk access this
paragraph is applicable to MFSYS chassis: disk access
via storage controller modules can be considered a fast
common bus - if this comforts readers into understanding
my idea better. And yes, I do also think that channeling
disk over ethernet via one of the servers is a bad thing
bound to degrade 

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Tim Cook
On Sat, Oct 15, 2011 at 6:57 PM, Jim Klimov jimkli...@cos.ru wrote:

 Thanks to all that replied. I hope we may continue the discussion,
 but I'm afraid the overall verdict so far is disapproval of the idea.
 It is my understanding that those active in discussion considered
 it either too limited (in application - for VMs, or for hardware cfg),
 or too difficult to implement, so that we should rather use some
 alternative solutions. Or at least research them better (thanks Nico).

 I guess I am happy to not have seen replies like won't work
 at all, period or useless, period. I get Difficult and Limited
 and hope these can be worked around sometime, and hopefully
 this discussion would spark some interest in other software
 authors or customers to suggest more solutions and applications -
 to make some shared ZFS a possibility ;)

 Still, I would like to clear up some misunderstandings in replies -
 because at times we seemed to have been speaking about
 different architectures. Thanks to Richard, I stated what exact
 hardware I had in mind (and wanted to use most efficiently)
 while thinking about this problem, and how it is different from
 general extensible computers or server+NAS networks.

 Namely, with the shared storage architecture built into Intel
 MFSYS25 blade chassis and lack of expansibility of servers
 beyond that, some suggested solutions are not applicable
 (10GbE, FC, Infiniband) but some networking problems
 are already solved in hardware (full and equal connectivity
 between all servers and all shared storage LUNs).

 So some combined replies follow below:

 2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams wrote:

   #1 - You seem to be assuming storage is slower when it's on a remote
 storage
   server as opposed to a local disk.  While this is typically true over
   ethernet, it's not necessarily true over infiniband or fibre channel.
 Many people today are deploying 10GbE and it is relatively easy to get
 wire speed
 for bandwidth and  0.1 ms average access for storage.


 Well, I am afraid I have to reiterate: for a number of reasons including
 price, our customers are choosing some specific and relatively fixed
 hardware solutions. So, time and again, I am afraid I'll have to remind
 of the sandbox I'm tucked into - I have to do with these boxes, and I
 want to do the best with them.

 I understand that Richard comes from a background where HW is the
 flexible part in equations and software is designed to be used for
 years. But  for many people (especially those oriented at fast-evolving
 free software) the hardware is something they have to BUY and it
 works unchanged as long as possible. This does not only cover
 enthusiasts like the proverbial red-eyed linuxoids, but also many
 small businesses. I do still maintain several decade-old computers
 running infrastructure tasks (luckily, floorspace and electricity are
 near-free there) which were not yet virtualized because if it ain't
 broken - don't touch it! ;)

 In particular, the blade chassis in my example, which I hoped to
 utilize to their best, using shared ZFS pools, have no extension
 slots. There is no 10GbE for neither external RJ45 nor internal
 ports (technically there is 10GbE interlink of two switch modules),
 so each server blade is limited to have either 2 or 4 1Gbps ports.
 There is no FC. No infiniband. There may be one extSAS link
 on each storage controller module, that's it.

  I think the biggest problem lies in requiring full
 connectivity from every server to every LUN.


 This is exactly (and the only) sort of connectivity available to
 server blades in this chassis.

 I think this is as applicable to networked storage where there
 is a mesh of reliable connections between disk controllers
 and disks (or at least LUNs), be it switched FC or dual-link
 SAS or whatnot.

  Doing something like VMotion would be largely pointless if the VM storage
 still remains on the node that was previously the compute head.


 True. However, in these Intel MFSYS25 boxes no server blade
 has any local disks (unlike most other blades I know). Any disk
 space is fed to them - and is equally accessible over a HA link -
 from the storage controller modules (which are in turn connected
 to the built-in array of hard-disks) that are a part of the chassis
 shared by all servers, like the networking switches are.

  If you do the same thing over ethernet, then the performance will be
 degraded to ethernet speeds.  So take it for granted, no matter what you
 do,
 you either need a bus that performs just as well remotely versus
 locally...
 Or else performance will be degraded...  Or else it's kind of pointless
 because the VM storage lives only on the system that you want to VMotion
 away from.


 Well, while this is no Infiniband, in terms of disk access this
 paragraph is applicable to MFSYS chassis: disk access
 via storage controller modules can be considered a fast
 common bus - if this comforts readers into 

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Jim Klimov

2011-10-16 4:14, Tim Cook wrote:
Quite frankly your choice in blade chassis was a horrible design 
decision.  From your description of its limitations it should never be 
the building block for a vmware cluster in the first place.  I would 
start by rethinking that decision instead of trying to pound a round 
ZFS peg into a square hole.


--Tim


Point taken ;)

Alas, quite often, it is not us engineers that make designs but a mix of 
bookkeeping folks and vendor marketing.


The MFSYS boxes are pushed by Intel or its partners as a good VMWare 
farm in a box - and for that it works well. As long as storage capacity 
on board (4.2TB with basic 300Gb drives, or more with larger ones, or 
even expanded with extSAS) is sufficient, the chassis is not a building 
block of the VMWare cluster. It is the cluster, all of it. The box has 
many HA features, including dual-link SAS, redundant storage and 
networking controllers, and stuff. It is just not very expansible. But 
relatively cheap, which as I said is an important factor for many.


For our company as software service vendors it is also suitable - the 
customer buys almost a preconfigured appliance, plugs in power and an 
ethernet uplink, and things magically work. This requires little to no 
skill from customers' IT people (won't always say they are admins) to 
maintain, and there are no intricate external connections to break off...


For relatively small offices, 20 external gigabit ports of two managed 
switch modules can also become the networking core for the deployment site.


Thanks,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Toby Thain
 
 Hmm, of course the *latency* of Ethernet has always been much less, but
 I did not see it reaching the *throughput* of a single direct attached
 disk until gigabit.

Nobody runs a single disk except in laptops, which is of course not a
relevant datum for this conversation.  If you want to remotely attach
storage, you'll need at least 1Gb per disk, if not more.  This is assuming
the bus is dedicated to storage traffic and nothing else.

Yes, 10G ether is relevant, but for the same price, IB will get 4x the
bandwidth and 10x smaller latency.  So ...

Supposing you have a single local disk and you have a dedicated 1Gb ethernet
to use for mirroring that device to something like an iscsi remote device...
That's probably reasonable.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 I guess Richard was correct about the usecase description -
 I should detail what I'm thinking about, to give some illustration.

After reading all this, I'm still unclear on what you want to accomplish, that 
isn't already done today.  Yes I understand what it means when we say ZFS is 
not a clustering filesystem, and yes I understand what benefits there would be 
to gain if it were a clustering FS.  But in all of what you're saying below, I 
don't see that you need a clustering FS.


 of these deployments become VMWare ESX farms with shared
 VMFS. Due to my stronger love for things Solaris, I would love
 to see ZFS and any of Solaris-based hypervisors (VBox, Xen
 or KVM ports) running there instead. But for things to be as
 efficient, ZFS would have to become shared - clustered...

I think the solution people currently use in this area is either NFS or iscsi.  
(Or infiniband, and other flavors.)  You have a storage server presenting the 
storage to the various vmware (or whatever) hypervisors.  Everything works.  
What's missing?  And why does this need to be a clustering FS?


 To be clearer, I should say that modern VM hypervisors can
 migrate running virtual machines between two VM hosts.

This works on NFS/iscsi/IB as well.  Doesn't need a clustering FS.


 With clustered VMFS on shared storage, VMWare can
 migrate VMs faster - it knows not to copy the HDD image
 file in vain - it will be equally available to the new host
 at the correct point in migration, just as it was accessible
 to the old host.

Again.  NFS/iscsi/IB = ok.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Jim Klimov

2011-10-14 15:53, Edward Ned Harvey пишет:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

I guess Richard was correct about the usecase description -
I should detail what I'm thinking about, to give some illustration.

After reading all this, I'm still unclear on what you want to accomplish, that 
isn't already done today.  Yes I understand what it means when we say ZFS is 
not a clustering filesystem, and yes I understand what benefits there would be 
to gain if it were a clustering FS.  But in all of what you're saying below, I 
don't see that you need a clustering FS.


In my example - probably not a completely clustered FS.
A clustered ZFS pool with datasets individually owned by
specific nodes at any given time would suffice for such
VM farms. This would give users the benefits of ZFS
(resilience, snapshots and clones, shared free space)
merged with the speed of direct disk access instead of
lagging through a storage server accessing these disks.

This is why I think such a solution may be more simple
than a fully-fledged POSIX-compliant shared FS, but it
would still have some benefits for specific - and popular -
usage cases. And it might pave way for a more complete
solution - or perhaps illustrate what should not be done
for those solutions ;)

After all, I think that if the problem of safe multiple-node
RW access to ZFS gets fundamentally solved, these
usages I described before might just become a couple
of new dataset types with specific predefined usage
and limitations - like POSIX-compliant FS datasets
and block-based volumes are now defined over ZFS.
There is no reason not to call them clustered FS and
clustered volume datasets, for example ;)

AFAIK, VMFS is not a generic filesystem, and cannot
quite be used directly by software applications, but it
has its target market for shared VM farming...

I do not know how they solve the problems of consistency
control - with master nodes or something else, and for
the sake of patent un-encroaching, I'm afraid I'd rather
not know - as to not copycat someone's solution and
get burnt for that ;)




of these deployments become VMWare ESX farms with shared
VMFS. Due to my stronger love for things Solaris, I would love
to see ZFS and any of Solaris-based hypervisors (VBox, Xen
or KVM ports) running there instead. But for things to be as
efficient, ZFS would have to become shared - clustered...

I think the solution people currently use in this area is either NFS or iscsi.  
(Or infiniband, and other flavors.)  You have a storage server presenting the 
storage to the various vmware (or whatever) hypervisors.


In fact, no. Based on the MFSYS model, there is no storage server.
There is a built-in storage controller which can do RAID over HDDs
and represent SCSI LUNs to the blades over direct SAS access.
These LUNs can be accessed individually by certain servers, or
concurrently. In the latter case it is possible that servers take turns
mounting the LUN as a HDD with some single-server FS, or use
a clustered FS to use the LUN's disk space simultaneously.

If we were to use in this system an OpenSolaris-based OS and
VirtualBox/Xen/KVM as they are now, and hope for live migration
of VMs without copying of data, we would have to make a separate
LUN for each VM on the controller, and mount/import this LUN to
its current running host. I don't need to explain why that would be
a clumsy and unflexible solution for a near-infinite number of
reasons, do i? ;)


  Everything works.  What's missing?  And why does this need to be a clustering 
FS?



To be clearer, I should say that modern VM hypervisors can
migrate running virtual machines between two VM hosts.

This works on NFS/iscsi/IB as well.  Doesn't need a clustering FS.

Except that the storage controller doesn't do NFS/iscsi/IB,
and doesn't do snapshots and clones. And if I were to
dedicate one or two out of six blades to storage tasks,
this might be considered an improper waste of resources.
And would repackage SAS access (anyway available to
all blades at full bandwidth) into NFS/iscsi access over a
Gbit link...





With clustered VMFS on shared storage, VMWare can
migrate VMs faster - it knows not to copy the HDD image
file in vain - it will be equally available to the new host
at the correct point in migration, just as it was accessible
to the old host.

Again.  NFS/iscsi/IB = ok.


True, except that this is not an optimal solution in this described
usecase - a farm of server blades with a relatively dumb fast raw
storage (but NOT an intellectual storage server).

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Tim Cook
On Fri, Oct 14, 2011 at 7:36 AM, Jim Klimov jimkli...@cos.ru wrote:

 2011-10-14 15:53, Edward Ned Harvey пишет:

  From: 
 zfs-discuss-bounces@**opensolaris.orgzfs-discuss-boun...@opensolaris.org[mailto:
 zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov

 I guess Richard was correct about the usecase description -
 I should detail what I'm thinking about, to give some illustration.

 After reading all this, I'm still unclear on what you want to accomplish,
 that isn't already done today.  Yes I understand what it means when we say
 ZFS is not a clustering filesystem, and yes I understand what benefits there
 would be to gain if it were a clustering FS.  But in all of what you're
 saying below, I don't see that you need a clustering FS.


 In my example - probably not a completely clustered FS.
 A clustered ZFS pool with datasets individually owned by
 specific nodes at any given time would suffice for such
 VM farms. This would give users the benefits of ZFS
 (resilience, snapshots and clones, shared free space)
 merged with the speed of direct disk access instead of
 lagging through a storage server accessing these disks.

 This is why I think such a solution may be more simple
 than a fully-fledged POSIX-compliant shared FS, but it
 would still have some benefits for specific - and popular -
 usage cases. And it might pave way for a more complete
 solution - or perhaps illustrate what should not be done
 for those solutions ;)

 After all, I think that if the problem of safe multiple-node
 RW access to ZFS gets fundamentally solved, these
 usages I described before might just become a couple
 of new dataset types with specific predefined usage
 and limitations - like POSIX-compliant FS datasets
 and block-based volumes are now defined over ZFS.
 There is no reason not to call them clustered FS and
 clustered volume datasets, for example ;)

 AFAIK, VMFS is not a generic filesystem, and cannot
 quite be used directly by software applications, but it
 has its target market for shared VM farming...

 I do not know how they solve the problems of consistency
 control - with master nodes or something else, and for
 the sake of patent un-encroaching, I'm afraid I'd rather
 not know - as to not copycat someone's solution and
 get burnt for that ;)



  of these deployments become VMWare ESX farms with shared
 VMFS. Due to my stronger love for things Solaris, I would love
 to see ZFS and any of Solaris-based hypervisors (VBox, Xen
 or KVM ports) running there instead. But for things to be as
 efficient, ZFS would have to become shared - clustered...

 I think the solution people currently use in this area is either NFS or
 iscsi.  (Or infiniband, and other flavors.)  You have a storage server
 presenting the storage to the various vmware (or whatever) hypervisors.


 In fact, no. Based on the MFSYS model, there is no storage server.
 There is a built-in storage controller which can do RAID over HDDs
 and represent SCSI LUNs to the blades over direct SAS access.
 These LUNs can be accessed individually by certain servers, or
 concurrently. In the latter case it is possible that servers take turns
 mounting the LUN as a HDD with some single-server FS, or use
 a clustered FS to use the LUN's disk space simultaneously.

 If we were to use in this system an OpenSolaris-based OS and
 VirtualBox/Xen/KVM as they are now, and hope for live migration
 of VMs without copying of data, we would have to make a separate
 LUN for each VM on the controller, and mount/import this LUN to
 its current running host. I don't need to explain why that would be
 a clumsy and unflexible solution for a near-infinite number of
 reasons, do i? ;)


   Everything works.  What's missing?  And why does this need to be a
 clustering FS?


  To be clearer, I should say that modern VM hypervisors can
 migrate running virtual machines between two VM hosts.

 This works on NFS/iscsi/IB as well.  Doesn't need a clustering FS.

 Except that the storage controller doesn't do NFS/iscsi/IB,
 and doesn't do snapshots and clones. And if I were to
 dedicate one or two out of six blades to storage tasks,
 this might be considered an improper waste of resources.
 And would repackage SAS access (anyway available to
 all blades at full bandwidth) into NFS/iscsi access over a
 Gbit link...




  With clustered VMFS on shared storage, VMWare can
 migrate VMs faster - it knows not to copy the HDD image
 file in vain - it will be equally available to the new host
 at the correct point in migration, just as it was accessible
 to the old host.

 Again.  NFS/iscsi/IB = ok.


 True, except that this is not an optimal solution in this described
 usecase - a farm of server blades with a relatively dumb fast raw
 storage (but NOT an intellectual storage server).

 //Jim




The idea is you would dedicate one of the servers in the chassis to be a
Solaris system, which then presents NFS out to the rest of the hosts.  From
the chassis itself you would present 

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Jim Klimov

2011-10-14 19:33, Tim Cook пишет:



With clustered VMFS on shared storage, VMWare can
migrate VMs faster - it knows not to copy the HDD image
file in vain - it will be equally available to the new host
at the correct point in migration, just as it was accessible
to the old host.

Again.  NFS/iscsi/IB = ok.


True, except that this is not an optimal solution in this described
usecase - a farm of server blades with a relatively dumb fast raw
storage (but NOT an intellectual storage server).

//Jim




The idea is you would dedicate one of the servers in the chassis to be 
a Solaris system, which then presents NFS out to the rest of the 
hosts.  From the chassis itself you would present every drive that 
isn't being used to boot an existing server to this solaris host as 
individual disks, and let that server take care of RAID and presenting 
out the storage to the rests of the vmware hosts.


--Tim

Yes, I wrote of that as an option - but a relatively poor one
(though now we're limited to do this). As I numerously
wrote, major downsides are:
* probably increased latency due to another added hop
of processing delays, just as with extra switches and
routers in networks;
* probably reduced bandwidth of LAN as compared to
direct disk access; certainly it won't get increased ;)
Besides, the LAN may be (highly) utilized by servers
running in VMs or physical blades, so storage traffic
over LAN would compete with real networking and/or
add to latencies.
* in order for the whole chassis to provide HA services
and run highly-available VMs, the storage servers have
to be redundant - at least one other blade would have
to be provisioned for failover ZFS import and serving
for other nodes.
This is not exactly a showstopper - but the spare blade
would either have to not run VMs at all, or run not as many
VMs as others, and in case of a pool failover event it would
probably have to migrate its running VMs away in order to
increase ARC and reduce storage latency for other servers.
That's doable, and automatable, but a hassle nonetheless.

Also I'm not certain how well other hosts can benefit from
caching in their local RAMs when using NFS or iSCSI
resources. I think they might benefit better from local
ARCs in the pool were directly imported to each of them...

Upsides are:
* this already works, and reliably, as any other ZFS NAS
solution. That's a certain plus :)

In this current case one or two out of six blades should be
dedicated  to storage, leaving only 4 or 5 to VMs.

In case of shared pools, there is a new problem of
TXG-master failover to some other node (which would
probably be not slower than a pool reimport is now), but
otherwise all six servers' loads are balanced. And they
only cache what they really need. And they have faster
disk access times. And they don't use LAN superfluously
for storage access.

//Jim

PS: Anyway, I wanted to say this earlier - thanks to everyone
who responded, even (or especially) with criticism and
requests for detalisation. If nothing else, you helped me
describe my idea better and less ambigously, so that
some other thinkers can decide whether and how to
implement it ;)

PPS: When I earlier asked about getting ZFS under the
hood of RAID controllers, I guess I kinda wished to
replace the black box of intel's firmware with a ZFS-aware
OS (FreeBSD probably) - the storage controller modules
must be some sort of computers running in a failover link...
These SCMs would then export datasets as SAS LUNs
to specific servers, like is done now, and possibly would
not require clustered ZFS - but might benefit from it too.
So my MFSYS illustration is partially relevant for that
question as well...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Nico Williams
On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov jimkli...@cos.ru wrote:
 Thanks to Nico for concerns about POSIX locking. However,
 hopefully, in the usecase I described - serving images of
 VMs in a manner where storage, access and migration are
 efficient - whole datasets (be it volumes or FS datasets)
 can be dedicated to one VM host server at a time, just like
 whole pools are dedicated to one host nowadays. In this
 case POSIX compliance can be disregarded - access
 is locked by one host, not avaialble to others, period.
 Of course, there is a problem of capturing storage from
 hosts which died, and avoiding corruptions - but this is
 hopefully solved in the past decades of clustering tech's.

It sounds to me like you need horizontal scaling more than anything
else.  In that case, why not use pNFS or Lustre?  Even if you want
snapshots, a VM should be able to handle that on its own, and though
probably not as nicely as ZFS in some respects, having the application
be in control of the exact snapshot boundaries does mean that you
don't have to quiesce your VMs just to snapshot safely.

 Nico also confirmed that one node has to be a master of
 all TXGs - which is conveyed in both ideas of my original
 post.

Well, at any one time one node would have to be the master of the next
TXG, but it doesn't mean that you couldn't have some cooperation.
There are lots of other much more interesting questions.  I think the
biggest problem lies in requiring full connectivity from every server
to every LUN.  I'd much rather take the Lustre / pNFS model (which,
incidentally, don't preclude having snapshots).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Nico Williams
Also, it's not worth doing a clustered ZFS thing that is too
application-specific.  You really want to nail down your choices of
semantics, explore what design options those yield (or approach from
the other direction, or both), and so on.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-13 Thread Jim Klimov

Hello all,


Definitely not impossible, but please work on the business case.
Remember, it is easier to build hardware than software, so your
software solution must be sufficiently advanced to not be obsoleted
by the next few hardware generations.
  -- richard


I guess Richard was correct about the usecase description -
I should detail what I'm thinking about, to give some illustration.
Coming from a software company though, I tend to think of
software being the more flexible part of equation. This is
something we have a chance to change. We use whatever
hardware is given to us from above, for years...

When thinking about the problem and its applications to life,
I have in mind blade servers farms like Intel MFSYS25 which
include relatively large internal storage and you can possibly
add external SAS storage. We use such server farms as
self-contained units (a single chassis plugged into customer's
network) for a number of projects, and recently more and more
of these deployments become VMWare ESX farms with shared
VMFS. Due to my stronger love for things Solaris, I would love
to see ZFS and any of Solaris-based hypervisors (VBox, Xen
or KVM ports) running there instead. But for things to be as
efficient, ZFS would have to become shared - clustered...

I think I would have to elaborate more on this hardware, as
it tends to be our major usecase, and thus a limitation which
influences my approach to clustered ZFS and belief whatever
shortcuts are appropriate.

These boxes have a shared chassis to accomodate 6 server
blades, each with 2 CPUs and 2 or 4 gigabit ethernet ports.
The chassis also has single or dual ethernet switches to interlink
the servers and to connect to external world (10 ext ports each),
as well as single or dual storage controllers and 14 internal HDD
bays. External SAS boxes can also be attached to the storage
controller modules, but I haven't yet seen real setups like that.

In normal Intel usecase, the controller(s) implement several
RAID LUNs which are accessible to the servers via SAS
(with MPIO in case of dual controllers). Usually these LUNs
are dedicated to servers - for example, boot/OS volumes.

With an additional license from Intel, Shared LUNs can be
implemented on the chassis - these are primarily aimed at
VMWare farms with clustered VMFS to use available disk
space (and multiple-spindle aggregated bandwidths) more
efficiently, as well as aid in VM migration.

To be clearer, I should say that modern VM hypervisors can
migrate running virtual machines between two VM hosts.

Usually (with dedicated storage for each server host) they
do this by copying over the IP network their HDD image
files from an old host to new host, transferring virtual
RAM contents, replumbing virtual networks and beginning
execution from the same point - after just a second-long
hiccup for finalization of the running VM's migration.

With clustered VMFS on shared storage, VMWare can
migrate VMs faster - it knows not to copy the HDD image
file in vain - it will be equally available to the new host
at the correct point in migration, just as it was accessible
to the old host.

This is what I kind of hoped to reimplement with VirtualBox
or Xen or KVM running on OpenSolaris derivatives (such as
OpenIndiana and others), and the proposed ZFS clustering
using each HDD wholly as an individual LUN, aggregated into
a ZFS pool by the servers themselves. For many cases this
would also be cheaper, with OpenIndiana and free hypervisors ;)

As was rightfully noted, with a common ZFS pool as underlying
storage (as happens in current Sun VDI solutions using a ZFS
NAS), VM image clones can be instantiated quickly and efficiently
on resources - cheaper and faster than copying a golden image.

Now, at the risk of being accused pushing some marketing
through the discussion list, I have to state that these servers
are relatively cheap (if compared to 6 single-unit servers of
comparable configuration, dual managed ethernet switches,
a SAN with 14 disks + dual storage controllers). Price is an
important factor in many of our deployments, where these
boxes work stand-alone.

This usually starts with a POC, when a pre-configured
basic MFSYS with some VMs of our software arrives to
a customer, gets tailored and works like a black box.
In a year or so an upgrade may come in form of added
disks, server blades and RAM. I have never heard even
discussions of adding external storage - too pricey, and
often useless with relatively fixed VM sizes - hence my
desire to get a single ZFS pool available to all the blades
equally. While dedicated storage boxes might be good
and great, they would bump the solution price by orders
of magnitude (StorEdge 7000 series) and are generally
out of question for our limited deployments.

Thanks to Nico for concerns about POSIX locking. However,
hopefully, in the usecase I described - serving images of
VMs in a manner where storage, access and migration are
efficient - whole datasets (be it volumes or 

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Richard Elling
On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:

 Hello all,
 
 ZFS developers have for a long time stated that ZFS is not intended,
 at least not in near term, for clustered environments (that is, having
 a pool safely imported by several nodes simultaneously). However,
 many people on forums have wished having ZFS features in clusters.

...and UFS before ZFS… I'd wager that every file system has this RFE in its
wish list :-)

 I have some ideas at least for a limited implementation of clustering
 which may be useful aat least for some areas. If it is not my fantasy
 and if it is realistic to make - this might be a good start for further
 optimisation of ZFS clustering for other uses.
 
 For one use-case example, I would talk about VM farms with VM
 migration. In case of shared storage, the physical hosts need only
 migrate the VM RAM without copying gigabytes of data between their
 individual storages. Such copying makes less sense when the
 hosts' storage is mounted off the same NAS/NAS box(es), because:
 * it only wastes bandwidth moving bits around the same storage, and

This is why the best solutions use snapshots… no moving of data and
you get the added benefit of shared ARC -- increasing the logical working
set size does not increase the physical working set size.

 * IP networking speed (NFS/SMB copying) may be less than that of
 dedicated storage net between the hosts  and storage (SAS, FC, etc.)

Disk access is not bandwidth bound by the channel.

 * with pre-configured disk layout from one storage box into LUNs for
 several hosts, more slack space is wasted than with having a single
 pool for several hosts, all using the same free pool space;

...and you die by latency of metadata traffic.

 * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
 it would be problematic to add a 6th server) - but it won't be a problem
 when the single pool consumes the whole SAM and is available to
 all server nodes.

Are you assuming disk access is faster than RAM access?

 One feature of this use-case is that specific datasets within the
 potentially common pool on the NAS/SAN are still dedicated to
 certain physical hosts. This would be similar to serving iSCSI
 volumes or NFS datasets with individual VMs from a NAS box -
 just with a faster connection over SAS/FC. Hopefully this allows
 for some shortcuts in clustering ZFS implementation, while
 such solutions would still be useful in practice.

I'm still missing the connection of the problem to the solution.
The problem, as I see it today: disks are slow and not getting 
faster. SSDs are fast and getting faster and lower $/IOP. Almost
all VM environments and most general purpose environments are
overprovisioned for bandwidth and underprovisioned for latency.
The Achille's heel of solutions that cluster for bandwidth (eg lustre,
QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency.
But latency is what we need, so perhaps not the best architectural
solution?

 So, one version of the solution would be to have a single host
 which imports the pool in read-write mode (i.e. the first one
 which boots), and other hosts would write thru it (like iSCSI
 or whatever; maybe using SAS or FC to connect between
 reader and writer hosts). However they would read directly
 from the ZFS pool using the full SAN bandwidth.
 
 WRITES would be consistent because only one node writes
 data to the active ZFS block tree using more or less the same
 code and algorithms as already exist.
 
 
 In order for READS to be consistent, the readers need only
 rely on whatever latest TXG they know of, and on the cached
 results of their more recent writes (between the last TXG
 these nodes know of and current state).
 
 Here's where this use-case's bonus comes in: the node which
 currently uses a certain dataset and issues writes for it, is the
 only one expected to write there - so even if its knowledge of
 the pool is some TXGs behind, it does not matter.
 
 In order to stay up to date, and know the current TXG completely,
 the reader nodes should regularly read-in the ZIL data (anyway
 available and accessible as part of the pool) and expire changed
 entries from their local caches.

:-)

 If for some reason a reader node has lost track of the pool for
 too long, so that ZIL data is not sufficient to update from known
 in-RAM TXG to current on-disk TXG, the full read-only import
 can be done again (keeping track of newer TXGs appearing
 while the RO import is being done).
 
 Thanks to ZFS COW, nodes can expect that on-disk data (as
 pointed to by block addresses/numbers) does not change.
 So in the worst case, nodes would read outdated data a few
 TXGs old - but not completely invalid data.
 
 
 Second version of the solution is more or less the same, except
 that all nodes can write to the pool hardware directly using some
 dedicated block ranges owned by one node at a time. This
 would work like much a ZIL containing both data and metadata.
 Perhaps 

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
 ZFS developers have for a long time stated that ZFS is not intended,
 at least not in near term, for clustered environments (that is, having
 a pool safely imported by several nodes simultaneously). However,
 many people on forums have wished having ZFS features in clusters.

 ...and UFS before ZFS… I'd wager that every file system has this RFE in its
 wish list :-)

Except the ones that already have it!  :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov jimkli...@cos.ru wrote:
 So, one version of the solution would be to have a single host
 which imports the pool in read-write mode (i.e. the first one
 which boots), and other hosts would write thru it (like iSCSI
 or whatever; maybe using SAS or FC to connect between
 reader and writer hosts). However they would read directly
 from the ZFS pool using the full SAN bandwidth.

You need to do more than simply assign a node for writes.  You need to
send write and lock requests to one node.  And then you need to figure
out what to do about POSIX write visibility rules (i.e., when a write
should be visible to other readers).  I think you'd basically end up
not meeting POSIX in this regard, just like NFS, though perhaps not
with close-to-open semantics.

I don't think ZFS is the beast you're looking for.  You want something
more like Lustre, GPFS, and so on.  I suppose someone might surprise
us one day with properly clustered ZFS, but I think it'd be more
likely that the filesystem would be ZFS-like, not ZFS proper.

 Second version of the solution is more or less the same, except
 that all nodes can write to the pool hardware directly using some
 dedicated block ranges owned by one node at a time. This
 would work like much a ZIL containing both data and metadata.
 Perhaps these ranges would be whole metaslabs or some other
 ranges as agreed between the master node and other nodes.

This is much hairier.  You need consistency.  If two processes on
different nodes are writing to the same file, then you need to
*internally* lock around all those writes so that the on-disk
structure ends up being sane.  There's a number of things you could do
here, such as, for example, having a per-node log and one node
coalescing them (possibly one node per-file, but even then one node
has to be the master of every txg).

And still you need to be careful about POSIX semantics.  That does not
come for free in any design -- you will need something like the Lustre
DLM (distributed lock manager).  Or else you'll have to give up on
POSIX.

There's a hefty price to be paid for POSIX semantics in a clustered
environment.  You'd do well to read up on Lustre's experience in
detail.  And not just Lustre -- that would be just to start.  I
caution you that this is not a simple project.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-10 Thread Jim Klimov

Hello all,

ZFS developers have for a long time stated that ZFS is not intended,
at least not in near term, for clustered environments (that is, having
a pool safely imported by several nodes simultaneously). However,
many people on forums have wished having ZFS features in clusters.

I have some ideas at least for a limited implementation of clustering
which may be useful aat least for some areas. If it is not my fantasy
and if it is realistic to make - this might be a good start for further
optimisation of ZFS clustering for other uses.

For one use-case example, I would talk about VM farms with VM
migration. In case of shared storage, the physical hosts need only
migrate the VM RAM without copying gigabytes of data between their
individual storages. Such copying makes less sense when the
hosts' storage is mounted off the same NAS/NAS box(es), because:
* it only wastes bandwidth moving bits around the same storage, and
* IP networking speed (NFS/SMB copying) may be less than that of
dedicated storage net between the hosts  and storage (SAS, FC, etc.)
* with pre-configured disk layout from one storage box into LUNs for
several hosts, more slack space is wasted than with having a single
pool for several hosts, all using the same free pool space;
* it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
it would be problematic to add a 6th server) - but it won't be a problem
when the single pool consumes the whole SAM and is available to
all server nodes.

One feature of this use-case is that specific datasets within the
potentially common pool on the NAS/SAN are still dedicated to
certain physical hosts. This would be similar to serving iSCSI
volumes or NFS datasets with individual VMs from a NAS box -
just with a faster connection over SAS/FC. Hopefully this allows
for some shortcuts in clustering ZFS implementation, while
such solutions would still be useful in practice.



So, one version of the solution would be to have a single host
which imports the pool in read-write mode (i.e. the first one
which boots), and other hosts would write thru it (like iSCSI
or whatever; maybe using SAS or FC to connect between
reader and writer hosts). However they would read directly
from the ZFS pool using the full SAN bandwidth.

WRITES would be consistent because only one node writes
data to the active ZFS block tree using more or less the same
code and algorithms as already exist.


In order for READS to be consistent, the readers need only
rely on whatever latest TXG they know of, and on the cached
results of their more recent writes (between the last TXG
these nodes know of and current state).

Here's where this use-case's bonus comes in: the node which
currently uses a certain dataset and issues writes for it, is the
only one expected to write there - so even if its knowledge of
the pool is some TXGs behind, it does not matter.

In order to stay up to date, and know the current TXG completely,
the reader nodes should regularly read-in the ZIL data (anyway
available and accessible as part of the pool) and expire changed
entries from their local caches.

If for some reason a reader node has lost track of the pool for
too long, so that ZIL data is not sufficient to update from known
in-RAM TXG to current on-disk TXG, the full read-only import
can be done again (keeping track of newer TXGs appearing
while the RO import is being done).

Thanks to ZFS COW, nodes can expect that on-disk data (as
pointed to by block addresses/numbers) does not change.
So in the worst case, nodes would read outdated data a few
TXGs old - but not completely invalid data.


Second version of the solution is more or less the same, except
that all nodes can write to the pool hardware directly using some
dedicated block ranges owned by one node at a time. This
would work like much a ZIL containing both data and metadata.
Perhaps these ranges would be whole metaslabs or some other
ranges as agreed between the master node and other nodes.

When a node's write is completed (or a TXG sync happens), the
master node would update the ZFS block tree and uberblocks,
and those per-node-ZIL blocks which are already on disk would
become part of the ZFS tree. At this time new block ranges would
be fanned out for writes by each non-master node.

A probable optimization would be to give out several TXG's worth
of dedicated block ranges to each node, to reduce hickups during
any lags or even master-node reelections.

Main difference from the first solution would be in performance -
here all nodes would be writing to the pool hardware at full SAN/NAS
networking speed, and less load would come on the writer node.
Actually, instead of a writer node (responsible for translation of
LAN writes to SAN writes in the first solution), there would be a
master node responsible just for consistent application of TXG
updates, and for distribution of new dedicated block ranges to
other nodes for new writes. Information about such block ranges
would be kept