Re: [zfs-discuss] cluster vs nfs

2012-05-01 Thread Maurice R Volaski
Instead we've switched to Linux and DRBD.  And if that doesn't get me
sympathy I don't know what will.

SvSAN does something similar and it does it rather well, I think.
http://www.stormagic.com/SvSAN.php
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:
 Reboot requirement is a lame client implementation.

And lame protocol design.  You could possibly migrate read-write NFSv3
on the fly by preserving FHs and somehow updating the clients to go to
the new server (with a hiccup in between, no doubt), but only entire
shares at a time -- you could not migrate only part of a volume with
NFSv3.

Of course, having migration support in the protocol does not equate to
getting it in the implementation, but it's certainly a good step in
that direction.

 You are correct, a ZFS send/receive will result in different file handles on
 the receiver, just like
 rsync, tar, ufsdump+ufsrestore, etc.

That's understandable for NFSv2 and v3, but for v4 there's no reason
that an NFSv4 server stack and ZFS could not arrange to preserve FHs
(if, perhaps, at the price of making the v4 FHs rather large).
Although even for v3 it should be possible for servers in a cluster to
arrange to preserve devids...

Bottom line: live migration needs to be built right into the protocol.

For me one of the exciting things about Lustre was/is the idea that
you could just have a single volume where all new data (and metadata)
is distributed evenly as you go.  Need more storage?  Plug it in,
either to an existing head or via a new head, then flip a switch and
there it is.  No need to manage allocation.  Migration may still be
needed, both within a cluster and between clusters, but that's much
more manageable when you have a protocol where data locations can be
all over the place in a completely transparent manner.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Fred Liu
I jump into this loop with different alternative -- ip-based block device.
And I saw few successful cases with HAST + UCARP + ZFS + FreeBSD.
If zfsonlinux is robust enough, trying DRBD + PACEMAKER + ZFS + LINUX is
definitely encouraged.

Thanks.


Fred

 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 Sent: 星期四, 四月 26, 2012 14:00
 To: Richard Elling
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] cluster vs nfs
 
 On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling
 richard.ell...@gmail.com wrote:
  On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:
  Reboot requirement is a lame client implementation.
 
 And lame protocol design.  You could possibly migrate read-write NFSv3
 on the fly by preserving FHs and somehow updating the clients to go to
 the new server (with a hiccup in between, no doubt), but only entire
 shares at a time -- you could not migrate only part of a volume with
 NFSv3.
 
 Of course, having migration support in the protocol does not equate to
 getting it in the implementation, but it's certainly a good step in
 that direction.
 
  You are correct, a ZFS send/receive will result in different file
 handles on
  the receiver, just like
  rsync, tar, ufsdump+ufsrestore, etc.
 
 That's understandable for NFSv2 and v3, but for v4 there's no reason
 that an NFSv4 server stack and ZFS could not arrange to preserve FHs
 (if, perhaps, at the price of making the v4 FHs rather large).
 Although even for v3 it should be possible for servers in a cluster to
 arrange to preserve devids...
 
 Bottom line: live migration needs to be built right into the protocol.
 
 For me one of the exciting things about Lustre was/is the idea that
 you could just have a single volume where all new data (and metadata)
 is distributed evenly as you go.  Need more storage?  Plug it in,
 either to an existing head or via a new head, then flip a switch and
 there it is.  No need to manage allocation.  Migration may still be
 needed, both within a cluster and between clusters, but that's much
 more manageable when you have a protocol where data locations can be
 all over the place in a completely transparent manner.
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Jim Klimov

On 2012-04-26 2:20, Ian Collins wrote:

On 04/26/12 09:54 AM, Bob Friesenhahn wrote:

On Wed, 25 Apr 2012, Rich Teer wrote:

Perhaps I'm being overly simplistic, but in this scenario, what would
prevent
one from having, on a single file server, /exports/nodes/node[0-15],
and then
having each node NFS-mount /exports/nodes from the server? Much simplier
than
your example, and all data is available on all machines/nodes.

This solution would limit bandwidth to that available from that single
server. With the cluster approach, the objective is for each machine
in the cluster to primarily access files which are stored locally.
Whole files could be moved as necessary.


Distributed software building faces similar issues, but I've found once
the common files have been read (and cached) by each node, network
traffic becomes one way (to the file server). I guess that topology
works well when most access to shared data is read.


Which reminds me: older Solarises used to have a nifty-looking
(via descriptions) cachefs, apparently to speed up NFS clients
and reduce traffic, which we did not get to really use in real
life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure
it is in illumos either.

Does caching in current Solaris/illumos NFS client replace those
benefits, or did the project have some merits of its own (like
caching into local storage of client, so that the cache was not
empty after reboot)?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Tomas Forsman
On 26 April, 2012 - Jim Klimov sent me these 1,6K bytes:

 Which reminds me: older Solarises used to have a nifty-looking
 (via descriptions) cachefs, apparently to speed up NFS clients
 and reduce traffic, which we did not get to really use in real
 life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure
 it is in illumos either.

 Does caching in current Solaris/illumos NFS client replace those
 benefits, or did the project have some merits of its own (like
 caching into local storage of client, so that the cache was not
 empty after reboot)?

It had its share of merits and bugs.

/Tomas
-- 
Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Ian Collins

On 04/26/12 10:12 PM, Jim Klimov wrote:

On 2012-04-26 2:20, Ian Collins wrote:

On 04/26/12 09:54 AM, Bob Friesenhahn wrote:

On Wed, 25 Apr 2012, Rich Teer wrote:

Perhaps I'm being overly simplistic, but in this scenario, what would
prevent
one from having, on a single file server, /exports/nodes/node[0-15],
and then
having each node NFS-mount /exports/nodes from the server? Much simplier
than
your example, and all data is available on all machines/nodes.

This solution would limit bandwidth to that available from that single
server. With the cluster approach, the objective is for each machine
in the cluster to primarily access files which are stored locally.
Whole files could be moved as necessary.

Distributed software building faces similar issues, but I've found once
the common files have been read (and cached) by each node, network
traffic becomes one way (to the file server). I guess that topology
works well when most access to shared data is read.

Which reminds me: older Solarises used to have a nifty-looking
(via descriptions) cachefs, apparently to speed up NFS clients
and reduce traffic, which we did not get to really use in real
life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure
it is in illumos either.


I don't think it even made it into Solaris 10.. I used to use it with 
Solaris 8 back in the days when 100Mb switches were exotic!

Does caching in current Solaris/illumos NFS client replace those
benefits, or did the project have some merits of its own (like
caching into local storage of client, so that the cache was not
empty after reboot)?

It did have local backing store, but my current desktop has more RAM 
than that Solaris 8 box had disk and my network is 100 times faster, so 
it doesn't really matter any more.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Jim Klimov

On 2012-04-26 14:47, Ian Collins wrote:
 I don't think it even made it into Solaris 10.

Actually, I see the kernel modules available in both Solaris 10,
several builds of OpenSolaris SXCE and an illumos-current.

$ find /kernel/ /platform/ /usr/platform/ /usr/kernel/ | grep -i cachefs
/kernel/fs/amd64/cachefs
/kernel/fs/cachefs
/platform/i86pc/amd64/archive_cache/kernel/fs/amd64/cachefs
/platform/i86pc/archive_cache/kernel/fs/cachefs

$ uname -a
SunOS summit-blade5 5.11 oi_151a2 i86pc i386 i86pc


It did have local backing store, but my current desktop has more RAM
than that Solaris 8 box had disk and my network is 100 times faster, so
it doesn't really matter any more.


Well, it depends on your working set size. A matter of scale.

If those researchers dig into their terabyte of data each
(each seems important here for conflict/sync resolution),
on a gigabit-connected workstation, it would still take
them a couple of minutes to just download the dataset from
the server, let alone random-seek around it afterwards.

And you can easily have a local backing store for such
cachefs (or equivalent) today, even on an SSD or a few.

Just my 2c for possible build of that cluster they wanted,
and perhaps some evolution/revival of cachefs with today's
realities and demands - if it's deemed appropriate for
their task.

MY THEORY based on marketing info:  I believe they could
make a central fileserver with enough data space for
everyone, and each worker would use cachefs+nfs to access
it. Their actual worksets would be stored locally in the
cachefs backing stores on each workstation, and not
abuse networking traffic and the fileserver until there
are some writes to be replicated into central storage.
They would have approximately one common share to mount ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Deepak Honnalli

On 04/26/12 04:17 PM, Ian Collins wrote:

On 04/26/12 10:12 PM, Jim Klimov wrote:

On 2012-04-26 2:20, Ian Collins wrote:

On 04/26/12 09:54 AM, Bob Friesenhahn wrote:

On Wed, 25 Apr 2012, Rich Teer wrote:

Perhaps I'm being overly simplistic, but in this scenario, what would
prevent
one from having, on a single file server, /exports/nodes/node[0-15],
and then
having each node NFS-mount /exports/nodes from the server? Much 
simplier

than
your example, and all data is available on all machines/nodes.

This solution would limit bandwidth to that available from that single
server. With the cluster approach, the objective is for each machine
in the cluster to primarily access files which are stored locally.
Whole files could be moved as necessary.

Distributed software building faces similar issues, but I've found once
the common files have been read (and cached) by each node, network
traffic becomes one way (to the file server). I guess that topology
works well when most access to shared data is read.

Which reminds me: older Solarises used to have a nifty-looking
(via descriptions) cachefs, apparently to speed up NFS clients
and reduce traffic, which we did not get to really use in real
life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure
it is in illumos either.


I don't think it even made it into Solaris 10.. I used to use it with 
Solaris 8 back in the days when 100Mb switches were exotic!


cachefs is present in Solaris 10. It is EOL'd in S11.


Does caching in current Solaris/illumos NFS client replace those
benefits, or did the project have some merits of its own (like
caching into local storage of client, so that the cache was not
empty after reboot)?

It did have local backing store, but my current desktop has more RAM 
than that Solaris 8 box had disk and my network is 100 times faster, 
so it doesn't really matter any more.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Freddie Cash
On Thu, Apr 26, 2012 at 4:34 AM, Deepak Honnalli
deepak.honna...@oracle.com wrote:
    cachefs is present in Solaris 10. It is EOL'd in S11.

And for those who need/want to use Linux, the equivalent is FSCache.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Carson Gaspar

On 4/25/12 10:10 PM, Richard Elling wrote:

On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:


And applications that don't pin the mount points, and can be idled
during the migration. If your migration is due to a dead server, and
you have pending writes, you have no choice but to reboot the
client(s) (and accept the data loss, of course).


Reboot requirement is a lame client implementation.


Then it's a lame client misfeature of every single NFS client I've ever 
seen, assuming the mount is hard (and if a RW mount isn't, you're crazy).



To bring this back to ZFS, sadly ZFS doesn't support NFS HA without
shared / replicated storage, as ZFS send / recv can't preserve the
data necessary to have the same NFS filehandle, so failing over to a
replica causes stale NFS filehandles on the clients. Which frustrates
me, because the technology to do NFS shadow copy (which is possible in
Solaris - not sure about the open source forks) is a superset of that
needed to do HA, but can't be used for HA.


You are correct, a ZFS send/receive will result in different file
handles on the receiver, just like
rsync, tar, ufsdump+ufsrestore, etc.


But unlike SnapMirror.


It is possible to preserve NFSv[23] file handles in a ZFS environment
using lower-level replication
like TrueCopy, SRDF, AVS, etc. But those have other architectural issues
(aka suckage). I am
open to looking at what it would take to make a ZFS-friendly replicator
that would do this, but
need to know the business case [1]


See below.


The beauty of AFS and others, is that the file handle equivalent is not
a number. NFSv4 also has
this feature. So I have a little bit of heartburn when people say, NFS
sux because it has a feature
I won't use because I won't upgrade to NFSv4 even though it was released
10 years ago.


NFSv4 implementations are still iffy. We've tried it - it hasn't been 
stable (on Linux, at least). However we haven't tested RHEL6 yet. Are 
you saying that if we have a Solaris NFSv4 server serving Solaris and 
Linux NFSv4 clients with ZFS send/recv replication, that we can flip a 
VIP to point to the replica target and the clients won't get stale 
filehandles? Or that this is not the case today, but would be easier to 
make the case than for v[23] filehandles?



[1] FWIW, you can build a metropolitan area ZFS-based, shared storage
cluster today for about 1/4
the cost of the NetApp Stretch Metro software license. There is more
than one way to skin a cat :-)
So if the idea is to get even lower than 1/4 the NetApp cost, it feels
like a race to the bottom.


Shared storage is evil (in this context). Corrupt the storage, and you 
have no DR. That goes for all block-based replication products as well. 
This is not acceptable risk. I keep looking for a non-block-based 
replication system that allows seamless client failover, and can't find 
anything but NetApp SnapMirror. Please tell me I haven't been looking 
hard enough. Lustre et. al. don't support Solaris clients (which I find 
hilarious as Oracle owns it). I could build something on top of / under 
AFS for RW replication if I tried hard, but it would be fairly fragile.


--
Carson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread J.P. King


Shared storage is evil (in this context). Corrupt the storage, and you have 
no DR.


Now I am confused.  We're talking about storage which can be used for 
failover, aren't we? In which case we are talking about HA not DR.


That goes for all block-based replication products as well. This is 
not acceptable risk. I keep looking for a non-block-based replication system 
that allows seamless client failover, and can't find anything but NetApp 
SnapMirror.


I don't know SnapMirror, so I may be mistaken, but I don't see how you can 
have non-synchronous replication which can allow for seamless client 
failover (in the general case).  Technically this doesn't have to be block 
based, but I've not seen anything which wasn't.  Synchronous replication 
pretty much precludes DR (again, I can think of theoretical ways around 
this, but have never come across anything in practice).




Carson


Julian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Carson Gaspar

On 4/26/12 2:17 PM, J.P. King wrote:



Shared storage is evil (in this context). Corrupt the storage, and you
have no DR.


Now I am confused. We're talking about storage which can be used for
failover, aren't we? In which case we are talking about HA not DR.


Depends on how you define DR - we have shared storage HA in each 
datacenter (NetApp cluster), and replication between them in case we 
lose a datacenter (all clients on the MAN hit the same cluster unless we 
do a DR failover). The latter is what I'm calling DR.



That goes for all block-based replication products as well. This is
not acceptable risk. I keep looking for a non-block-based replication
system that allows seamless client failover, and can't find anything
but NetApp SnapMirror.


I don't know SnapMirror, so I may be mistaken, but I don't see how you
can have non-synchronous replication which can allow for seamless client
failover (in the general case). Technically this doesn't have to be
block based, but I've not seen anything which wasn't. Synchronous
replication pretty much precludes DR (again, I can think of theoretical
ways around this, but have never come across anything in practice).


seamless is an over-statement, I agree. NetApp has synchronous 
SnapMirror (which is only mostly synchronous...). Worst case, clients 
may see a filesystem go backwards in time, but to a point-in-time 
consistent state.


--
Carson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread J.P. King


Depends on how you define DR - we have shared storage HA in each datacenter 
(NetApp cluster), and replication between them in case we lose a datacenter 
(all clients on the MAN hit the same cluster unless we do a DR failover). The 
latter is what I'm calling DR.


It's what I call HA.  DR is what snapshots or backups can help you 
towards.  HA can be used to reduce the likelyhood of needing to use DR 
measures of course.


seamless is an over-statement, I agree. NetApp has synchronous SnapMirror 
(which is only mostly synchronous...). Worst case, clients may see a 
filesystem go backwards in time, but to a point-in-time consistent state.


Tell that to my swapfile!  Here we use synchronous mirroring for our VM 
systems storage.  Having that go back in time will cause unpredictable 
problems.  Worst case is pretty bad!


It may be that for your purposes you can treat your filesystems the way 
you do safely - although you'd better not have any in-memory caching of 
files, obviously - however lots and lots of people cannot.


I believe that we can do seamless replication and failover of NFS/ZFS, 
except that it is very painful to manage, iSCSI (the only way I know to do 
mirroring in this context) caused us a lot of pain last time we used it, 
and the way Oracle treats Solaris and its support has made it largely 
untenable for us.


Instead we've switched to Linux and DRBD.  And if that doesn't get me 
sympathy I don't know what will.



Carson


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 5:45 PM, Carson Gaspar car...@taltos.org wrote:
 On 4/26/12 2:17 PM, J.P. King wrote:
 I don't know SnapMirror, so I may be mistaken, but I don't see how you
 can have non-synchronous replication which can allow for seamless client
 failover (in the general case). Technically this doesn't have to be
 block based, but I've not seen anything which wasn't. Synchronous
 replication pretty much precludes DR (again, I can think of theoretical
 ways around this, but have never come across anything in practice).

 seamless is an over-statement, I agree. NetApp has synchronous SnapMirror
 (which is only mostly synchronous...). Worst case, clients may see a
 filesystem go backwards in time, but to a point-in-time consistent state.

Sure, if we assume apps make proper use of O_EXECL, O_APPEND,
link(2)/unlink(2)/rename(2), sync(2), fsync(2), and fdatasync(3C) and
can roll their own state back on their own.  Databases typically know
how to do that (e.g., SQLite3).  Most apps?  Doubtful.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Richard Elling
On Apr 25, 2012, at 11:00 PM, Nico Williams wrote:

 On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling
 richard.ell...@gmail.com wrote:
 On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:
 Reboot requirement is a lame client implementation.
 
 And lame protocol design.  You could possibly migrate read-write NFSv3
 on the fly by preserving FHs and somehow updating the clients to go to
 the new server (with a hiccup in between, no doubt), but only entire
 shares at a time -- you could not migrate only part of a volume with
 NFSv3.

Requirements, requirements, requirements... boil the ocean while we're at it? 
:-)

 Of course, having migration support in the protocol does not equate to
 getting it in the implementation, but it's certainly a good step in
 that direction.

NFSv4 has support for migrating volumes and managing the movement
of file handles. The technique includes filehandle expiry, similar to methods
used in other distributed FSs.

 You are correct, a ZFS send/receive will result in different file handles on
 the receiver, just like
 rsync, tar, ufsdump+ufsrestore, etc.
 
 That's understandable for NFSv2 and v3, but for v4 there's no reason
 that an NFSv4 server stack and ZFS could not arrange to preserve FHs
 (if, perhaps, at the price of making the v4 FHs rather large).

This is already in the v4 spec.

 Although even for v3 it should be possible for servers in a cluster to
 arrange to preserve devids...

We've been doing that for many years.

 
 Bottom line: live migration needs to be built right into the protocol.

Agree, and volume migration support is already in the NFSv4 spec.

 For me one of the exciting things about Lustre was/is the idea that
 you could just have a single volume where all new data (and metadata)
 is distributed evenly as you go.  Need more storage?  Plug it in,
 either to an existing head or via a new head, then flip a switch and
 there it is.  No need to manage allocation.  Migration may still be
 needed, both within a cluster and between clusters, but that's much
 more manageable when you have a protocol where data locations can be
 all over the place in a completely transparent manner.

Many distributed file systems do this, at the cost of being not quite POSIX-ish.
In the brave new world of storage vmotion, nosql, and distributed object stores,
it is not clear to me that coding to a POSIX file system is a strong 
requirement.

Perhaps people are so tainted by experiences with v2 and v3 that we can explain
the non-migration to v4 as being due to poor marketing? As a leader of NFS, Sun
had unimpressive marketing.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 12:37 PM, Richard Elling
richard.ell...@gmail.com wrote:
 [...]

NFSv4 had migration in the protocol (excluding protocols between
servers) from the get-go, but it was missing a lot (FedFS) and was not
implemented until recently.  I've no idea what clients and servers
support it adequately besides Solaris 11, though that's just my fault
(not being informed).  It's taken over a decade to get to where we
have any implementations of NFSv4 migration.

 For me one of the exciting things about Lustre was/is the idea that
 you could just have a single volume where all new data (and metadata)
 is distributed evenly as you go.  Need more storage?  Plug it in,
 either to an existing head or via a new head, then flip a switch and
 there it is.  No need to manage allocation.  Migration may still be
 needed, both within a cluster and between clusters, but that's much
 more manageable when you have a protocol where data locations can be
 all over the place in a completely transparent manner.


 Many distributed file systems do this, at the cost of being not quite
 POSIX-ish.

Well, Lustre does POSIX semantics just fine, including cache coherency
(as opposed to NFS' close-to-open coherency, which is decidedly
non-POSIX).

 In the brave new world of storage vmotion, nosql, and distributed object
 stores,
 it is not clear to me that coding to a POSIX file system is a strong
 requirement.

Well, I don't quite agree.  I'm very suspicious of
eventually-consistent.  I'm not saying that the enormous DBs that eBay
and such run should sport SQL and ACID semantics -- I'm saying that I
think we can do much better than eventually-consistent (and
no-language) while not paying the steep price that ACID requires.  I'm
not alone in this either.

The trick is to find the right compromise.  Close-to-open semantics
works out fine for NFS, but O_APPEND is too wonderful not to have
(ditto O_EXCL, which NFSv2 did not have; v4 has O_EXCL, but not
O_APPEND).

Whoever first delivers the right compromise in distributed DB
semantics stands to make a fortune.

 Perhaps people are so tainted by experiences with v2 and v3 that we can
 explain
 the non-migration to v4 as being due to poor marketing? As a leader of NFS,
 Sun
 had unimpressive marketing.

Sun did not do too much to improve NFS in the 90s, not compared to the
v4 work that only really started paying off only too recently.  And
then since Sun had lost the client space by then it doesn't mean all
that much to have the best server if the clients aren't able to take
advantage of the server's best features for lack of client
implementation.  Basically, Sun's ZFS, DTrace, SMF, NFSv4, Zones, and
other amazing innovations came a few years too late to make up for the
awful management that Sun was saddled with.  But for all the decidedly
awful things Sun management did (or didn't do), the worst was
terminating Sun PS (yes, worse that all the non-marketing, poor
marketing, poor acquisitions, poor strategy, and all the rest
including truly epic mistakes like icing Solaris on x86 a decade ago).
 One of the worst outcomes of the Sun debacle is that now there's a
bevy of senior execs who think the worst thing Sun did was to open
source Solaris and Java -- which isn't to say that Sun should have
open sourced as much as it did, or that open source is an end in
itself, but that open sourcing these things was legitimate a business
tool with very specific goals in mind in each case, and which had
nothing to do with the sinking of the company.  Or maybe that's one of
the best outcomes, because the good news about it is that those who
learn the right lessons (in that case: that open source is a
legitimate business tool that is sometimes, often even, a great
mind-share building tool) will be in the minority, and thus will have
a huge advantage over their competition.  That's another thing Sun did
not learn until it was too late: mind-share matters enormously to a
software company.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
I agree, you need something like AFS, Lustre, or pNFS.  And/or an NFS
proxy to those.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Robert Milkowski

And he will still need an underlying filesystem like ZFS for them :)


 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 Sent: 25 April 2012 20:32
 To: Paul Archer
 Cc: ZFS-Discuss mailing list
 Subject: Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs
FreeBSD)
 
 I agree, you need something like AFS, Lustre, or pNFS.  And/or an NFS
proxy
 to those.
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Richard Elling
On Apr 25, 2012, at 12:04 PM, Paul Archer wrote:

 11:26am, Richard Elling wrote:
 
 On Apr 25, 2012, at 10:59 AM, Paul Archer wrote:
 
  The point of a clustered filesystem was to be able to spread our data 
 out among all nodes and still have access
  from any node without having to run NFS. Size of the data set (once you 
 get past the point where you can replicate
  it on each node) is irrelevant.
 Interesting, something more complex than NFS to avoid the complexities of 
 NFS? ;-)
 We have data coming in on multiple nodes (with local storage) that is needed 
 on other multiple nodes. The only way to do that with NFS would be with a 
 matrix of cross mounts that would be truly scary.


Ignoring lame NFS clients, how is that architecture different than what you 
would have 
with any other distributed file system? If all nodes share data to all other 
nodes, then...?
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Paul Archer

2:20pm, Richard Elling wrote:


On Apr 25, 2012, at 12:04 PM, Paul Archer wrote:

Interesting, something more complex than NFS to avoid the 
complexities of NFS? ;-)

  We have data coming in on multiple nodes (with local storage) that is 
needed on other multiple nodes. The only way
  to do that with NFS would be with a matrix of cross mounts that would be 
truly scary.


Ignoring lame NFS clients, how is that architecture different than what you 
would have 
with any other distributed file system? If all nodes share data to all other 
nodes, then...?
 -- richard



Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, 
each node would have to mount from each other node. With 16 nodes, that's 
what, 240 mounts? Not to mention your data is in 16 different mounts/directory 
structures, instead of being in a unified filespace.___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Rich Teer

On Wed, 25 Apr 2012, Paul Archer wrote:


Simple. With a distributed FS, all nodes mount from a single DFS. With NFS,
each node would have to mount from each other node. With 16 nodes, that's
what, 240 mounts? Not to mention your data is in 16 different mounts/directory
structures, instead of being in a unified filespace.


Perhaps I'm being overly simplistic, but in this scenario, what would prevent
one from having, on a single file server, /exports/nodes/node[0-15], and then
having each node NFS-mount /exports/nodes from the server?  Much simplier than
your example, and all data is available on all machines/nodes.

--
Rich Teer, Publisher
Vinylphile Magazine

www.vinylphilemag.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 4:26 PM, Paul Archer p...@paularcher.org wrote:
 2:20pm, Richard Elling wrote:
 Ignoring lame NFS clients, how is that architecture different than what
 you would have
 with any other distributed file system? If all nodes share data to all
 other nodes, then...?

 Simple. With a distributed FS, all nodes mount from a single DFS. With NFS,
 each node would have to mount from each other node. With 16 nodes, that's
 what, 240 mounts? Not to mention your data is in 16 different
 mounts/directory structures, instead of being in a unified filespace.

To be fair NFSv4 now has a distributed namespace scheme so you could
still have a single mount on the client.  That said, some DFSes have
better properties, such as striping of data across sets of servers,
aggressive caching, and various choices of semantics (e.g., Lustre
tries hard to give you POSIX cache coherency semantics).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Bob Friesenhahn

On Wed, 25 Apr 2012, Rich Teer wrote:


Perhaps I'm being overly simplistic, but in this scenario, what would prevent
one from having, on a single file server, /exports/nodes/node[0-15], and then
having each node NFS-mount /exports/nodes from the server?  Much simplier 
than

your example, and all data is available on all machines/nodes.


This solution would limit bandwidth to that available from that single 
server.  With the cluster approach, the objective is for each machine 
in the cluster to primarily access files which are stored locally. 
Whole files could be moved as necessary.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Ian Collins

On 04/26/12 09:54 AM, Bob Friesenhahn wrote:

On Wed, 25 Apr 2012, Rich Teer wrote:

Perhaps I'm being overly simplistic, but in this scenario, what would prevent
one from having, on a single file server, /exports/nodes/node[0-15], and then
having each node NFS-mount /exports/nodes from the server?  Much simplier
than
your example, and all data is available on all machines/nodes.

This solution would limit bandwidth to that available from that single
server.  With the cluster approach, the objective is for each machine
in the cluster to primarily access files which are stored locally.
Whole files could be moved as necessary.


Distributed software building faces similar issues, but I've found once 
the common files have been read (and cached) by each node, network 
traffic becomes one way (to the file server).  I guess that topology 
works well when most access to shared data is read.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Richard Elling
On Apr 25, 2012, at 2:26 PM, Paul Archer wrote:

 2:20pm, Richard Elling wrote:
 
 On Apr 25, 2012, at 12:04 PM, Paul Archer wrote:
 
Interesting, something more complex than NFS to avoid the 
 complexities of NFS? ;-)
 
  We have data coming in on multiple nodes (with local storage) that is 
 needed on other multiple nodes. The only way
  to do that with NFS would be with a matrix of cross mounts that would 
 be truly scary.
 Ignoring lame NFS clients, how is that architecture different than what you 
 would have 
 with any other distributed file system? If all nodes share data to all other 
 nodes, then...?
  -- richard
 
 
 Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, 
 each node would have to mount from each other node. With 16 nodes, that's 
 what, 240 mounts? Not to mention your data is in 16 different 
 mounts/directory structures, instead of being in a unified filespace.

Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents). 
FWIW,
automounters were invented 20+ years ago to handle this in a nearly seamless 
manner.
Today, we have DFS from Microsoft and NFS referrals that almost eliminate the 
need
for automounter-like solutions.

Also, it is not unusual for a NFS environment to have 10,000+ mounts with 
thousands
of mounts on each server. No big deal, happens every day.

On Apr 25, 2012, at 2:53 PM, Nico Williams wrote:
 To be fair NFSv4 now has a distributed namespace scheme so you could
 still have a single mount on the client.  That said, some DFSes have
 better properties, such as striping of data across sets of servers,
 aggressive caching, and various choices of semantics (e.g., Lustre
 tries hard to give you POSIX cache coherency semantics).


I think this is where the real value is. NFS  CIFS are intentionally generic 
and have
caching policies that are favorably described as generic. For special-purpose 
workloads 
there can be advantages to having policies more explicitly applicable to the 
workload.
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Paul Archer

2:34pm, Rich Teer wrote:


On Wed, 25 Apr 2012, Paul Archer wrote:


Simple. With a distributed FS, all nodes mount from a single DFS. With NFS,
each node would have to mount from each other node. With 16 nodes, that's
what, 240 mounts? Not to mention your data is in 16 different 
mounts/directory

structures, instead of being in a unified filespace.


Perhaps I'm being overly simplistic, but in this scenario, what would prevent
one from having, on a single file server, /exports/nodes/node[0-15], and then
having each node NFS-mount /exports/nodes from the server?  Much simplier 
than

your example, and all data is available on all machines/nodes.



That assumes the data set will fit on one machine, and that machine won't be a 
performance bottleneck.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling
richard.ell...@gmail.com wrote:
 Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents).
 FWIW,
 automounters were invented 20+ years ago to handle this in a nearly seamless
 manner.
 Today, we have DFS from Microsoft and NFS referrals that almost eliminate
 the need
 for automounter-like solutions.

I disagree vehemently.  automount is a disaster because you need to
synchronize changes with all those clients.  That's not realistic.
I've built a large automount-based namespace, replete with a
distributed configuration system for setting the environment variables
available to the automounter.  I can tell you this: the automounter
does not scale, and it certainly does not avoid the need for outages
when storage migrates.

With server-side, referral-based namespace construction that problem
goes away, and the whole thing can be transparent w.r.t. migrations.

For my money the key features a DFS must have are:

 - server-driven namespace construction
 - data migration without having to restart clients,
   reconfigure them, or do anything at all to them
 - aggressive caching

 - striping of file data for HPC and media environments

 - semantics that ultimately allow multiple processes
   on disparate clients to cooperate (i.e., byte range
   locking), but I don't think full POSIX semantics are
   needed

   (that said, I think O_EXCL is necessary, and it'd be
   very nice to have O_APPEND, though the latter is
   particularly difficult to implement and painful when
   there's contention if you stripe file data across
   multiple servers)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Ian Collins

On 04/26/12 10:34 AM, Paul Archer wrote:

2:34pm, Rich Teer wrote:


On Wed, 25 Apr 2012, Paul Archer wrote:


Simple. With a distributed FS, all nodes mount from a single DFS. With NFS,
each node would have to mount from each other node. With 16 nodes, that's
what, 240 mounts? Not to mention your data is in 16 different
mounts/directory
structures, instead of being in a unified filespace.

Perhaps I'm being overly simplistic, but in this scenario, what would prevent
one from having, on a single file server, /exports/nodes/node[0-15], and then
having each node NFS-mount /exports/nodes from the server?  Much simplier
than
your example, and all data is available on all machines/nodes.


That assumes the data set will fit on one machine, and that machine won't be a
performance bottleneck.


Aren't those general considerations when specifying a file server?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Richard Elling
On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:

 On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling
 richard.ell...@gmail.com wrote:
 Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents).
 FWIW,
 automounters were invented 20+ years ago to handle this in a nearly seamless
 manner.
 Today, we have DFS from Microsoft and NFS referrals that almost eliminate
 the need
 for automounter-like solutions.
 
 I disagree vehemently.  automount is a disaster because you need to
 synchronize changes with all those clients.  That's not realistic.

Really?  I did it with NIS automount maps and 600+ clients back in 1991.
Other than the obvious problems with open files, has it gotten worse since then?

 I've built a large automount-based namespace, replete with a
 distributed configuration system for setting the environment variables
 available to the automounter.  I can tell you this: the automounter
 does not scale, and it certainly does not avoid the need for outages
 when storage migrates.

Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc. 

 With server-side, referral-based namespace construction that problem
 goes away, and the whole thing can be transparent w.r.t. migrations.

Agree, but we didn't have NFSv4 back in 1991 :-)  Today, of course, this
is how one would design it if you had to design a new DFS today.

 
 For my money the key features a DFS must have are:
 
 - server-driven namespace construction
 - data migration without having to restart clients,
   reconfigure them, or do anything at all to them
 - aggressive caching
 
 - striping of file data for HPC and media environments
 
 - semantics that ultimately allow multiple processes
   on disparate clients to cooperate (i.e., byte range
   locking), but I don't think full POSIX semantics are
   needed

Almost any of the popular nosql databases offer this and more.
The movement away from POSIX-ish DFS and storing data in 
traditional files is inevitable. Even ZFS is a object store at its core.

   (that said, I think O_EXCL is necessary, and it'd be
   very nice to have O_APPEND, though the latter is
   particularly difficult to implement and painful when
   there's contention if you stripe file data across
   multiple servers)

+1
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 5:42 PM, Ian Collins i...@ianshome.com wrote:
 Aren't those general considerations when specifying a file server?

There are Lustre clusters with thousands of nodes, hundreds of them
being servers, and high utilization rates.  Whatever specs you might
have for one server head will not meet the demand that hundreds of the
same can.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:
  I disagree vehemently.  automount is a disaster because you need to
  synchronize changes with all those clients.  That's not realistic.

 Really?  I did it with NIS automount maps and 600+ clients back in 1991.
 Other than the obvious problems with open files, has it gotten worse since
 then?

Nothing's changed.  Automounter + data migration - rebooting clients
(or close enough to rebooting).  I.e., outage.

 Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc.

But not with AFS.  And spec-wise not with NFSv4 (though I don't know
if/when all NFSv4 clients will properly support migration, just that
the protocol and some servers do).

 With server-side, referral-based namespace construction that problem
 goes away, and the whole thing can be transparent w.r.t. migrations.

Yes.

 Agree, but we didn't have NFSv4 back in 1991 :-)  Today, of course, this
 is how one would design it if you had to design a new DFS today.

Indeed, that's why I built an automounter solution in 1996 (that's
still in use, I'm told).  Although to be fair AFS existed back then
and had global namespace and data migration back then, and was mature.
 It's taken NFS that long to catch up...

 [...]

 Almost any of the popular nosql databases offer this and more.
 The movement away from POSIX-ish DFS and storing data in
 traditional files is inevitable. Even ZFS is a object store at its core.

I agree.  Except that there are applications where large octet streams
are needed.  HPC, media come to mind.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Paul Archer

Tomorrow, Ian Collins wrote:


On 04/26/12 10:34 AM, Paul Archer wrote:
That assumes the data set will fit on one machine, and that machine won't 
be a

performance bottleneck.


Aren't those general considerations when specifying a file server?

I suppose. But I meant specifically that our data will not fit on one single 
machine, and we are relying on spreading it across more nodes to get it on 
more spindles as well.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Paul Kraus
On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams n...@cryptonector.com wrote:
 On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling
 richard.ell...@gmail.com wrote:
 On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:

  I disagree vehemently.  automount is a disaster because you need to
  synchronize changes with all those clients.  That's not realistic.

 Really?  I did it with NIS automount maps and 600+ clients back in 1991.
 Other than the obvious problems with open files, has it gotten worse since
 then?

 Nothing's changed.  Automounter + data migration - rebooting clients
 (or close enough to rebooting).  I.e., outage.

Uhhh, not if you design your automounter architecture correctly
and (as Richard said) have NFS clients that are not lame to which I'll
add, automunters that actually work as advertised. I was designing
automount architectures that permitted dynamic changes with minimal to
no outages in the late 1990's. I only had a little over 100 clients
(most of which were also servers) and NIS+ (NIS ver. 3) to distribute
the indirect automount maps.

I also had to _redesign_ a number of automount strategies that
were built by people who thought that using direct maps for everything
was a good idea. That _was_ a pain in the a** due to the changes
needed at the applications to point at a different hierarchy.

It all depends on _what_ the application is doing. Something that
opens and locks a file and never releases the lock or closes the file
until the application exits will require a restart of the application
with an automounter / NFS approach.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Carson Gaspar

On 4/25/12 6:57 PM, Paul Kraus wrote:

On Wed, Apr 25, 2012 at 9:07 PM, Nico Williamsn...@cryptonector.com  wrote:

On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling
richard.ell...@gmail.com  wrote:




Nothing's changed.  Automounter + data migration -  rebooting clients
(or close enough to rebooting).  I.e., outage.


 Uhhh, not if you design your automounter architecture correctly
and (as Richard said) have NFS clients that are not lame to which I'll
add, automunters that actually work as advertised. I was designing


And applications that don't pin the mount points, and can be idled 
during the migration. If your migration is due to a dead server, and you 
have pending writes, you have no choice but to reboot the client(s) (and 
accept the data loss, of course).


Which is why we use AFS for RO replicated data, and NetApp clusters with 
SnapMirror and VIPs for RW data.


To bring this back to ZFS, sadly ZFS doesn't support NFS HA without 
shared / replicated storage, as ZFS send / recv can't preserve the data 
necessary to have the same NFS filehandle, so failing over to a replica 
causes stale NFS filehandles on the clients. Which frustrates me, 
because the technology to do NFS shadow copy (which is possible in 
Solaris - not sure about the open source forks) is a superset of that 
needed to do HA, but can't be used for HA.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 8:57 PM, Paul Kraus pk1...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams n...@cryptonector.com wrote:
 Nothing's changed.  Automounter + data migration - rebooting clients
 (or close enough to rebooting).  I.e., outage.

    Uhhh, not if you design your automounter architecture correctly
 and (as Richard said) have NFS clients that are not lame to which I'll
 add, automunters that actually work as advertised. I was designing
 automount architectures that permitted dynamic changes with minimal to
 no outages in the late 1990's. I only had a little over 100 clients
 (most of which were also servers) and NIS+ (NIS ver. 3) to distribute
 the indirect automount maps.

Further below you admit that you're talking about read-only data,
effectively.  But the world is not static.  Sure, *code* is by and
large static, and indeed, we segregated data by whether it was
read-only (code, historical data) or not (application data, home
directories).  We were able to migrated *read-only* data with no
outages.  But for the rest?  Yeah, there were always outages.  Of
course, we had a periodic maintenance window, with all systems
rebooting within a short period, and this meant that some data
migration outages were not noticeable, but they were real.

    I also had to _redesign_ a number of automount strategies that
 were built by people who thought that using direct maps for everything
 was a good idea. That _was_ a pain in the a** due to the changes
 needed at the applications to point at a different hierarchy.

We used indirect maps almost exclusively.  Moreover, we used
hierarchical automount entries, and even -autofs mounts.  We also used
environment variables to control various things, such as which servers
to mount what from (this was particularly useful for spreading the
load on read-only static data).  We used practically every feature of
the automounter except for executable maps (and direct maps, when we
eventually stopped using those).

    It all depends on _what_ the application is doing. Something that
 opens and locks a file and never releases the lock or closes the file
 until the application exits will require a restart of the application
 with an automounter / NFS approach.

No kidding!  In the real world such applications exist and get used.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Richard Elling
On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:

 On 4/25/12 6:57 PM, Paul Kraus wrote:
 On Wed, Apr 25, 2012 at 9:07 PM, Nico Williamsn...@cryptonector.com  wrote:
 On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling
 richard.ell...@gmail.com  wrote:
 
 
 Nothing's changed.  Automounter + data migration -  rebooting clients
 (or close enough to rebooting).  I.e., outage.
 
 Uhhh, not if you design your automounter architecture correctly
 and (as Richard said) have NFS clients that are not lame to which I'll
 add, automunters that actually work as advertised. I was designing
 
 And applications that don't pin the mount points, and can be idled during the 
 migration. If your migration is due to a dead server, and you have pending 
 writes, you have no choice but to reboot the client(s) (and accept the data 
 loss, of course).

Reboot requirement is a lame client implementation.

 Which is why we use AFS for RO replicated data, and NetApp clusters with 
 SnapMirror and VIPs for RW data.
 
 To bring this back to ZFS, sadly ZFS doesn't support NFS HA without shared / 
 replicated storage, as ZFS send / recv can't preserve the data necessary to 
 have the same NFS filehandle, so failing over to a replica causes stale NFS 
 filehandles on the clients. Which frustrates me, because the technology to do 
 NFS shadow copy (which is possible in Solaris - not sure about the open 
 source forks) is a superset of that needed to do HA, but can't be used for HA.

You are correct, a ZFS send/receive will result in different file handles on 
the receiver, just like
rsync, tar, ufsdump+ufsrestore, etc.

Do you mean the Sun ZFS Storage 7000 Shadow Migration feature?  This is not a 
HA feature, it
is an interposition architecture.

It is possible to preserve NFSv[23] file handles in a ZFS environment using 
lower-level replication
like TrueCopy, SRDF, AVS, etc. But those have other architectural issues (aka 
suckage). I am 
open to looking at what it would take to make a ZFS-friendly replicator that 
would do this, but
need to know the business case [1]

The beauty of AFS and others, is that the file handle equivalent is not a 
number. NFSv4 also has
this feature. So I have a little bit of heartburn when people say, NFS sux 
because it has a feature
I won't use because I won't upgrade to NFSv4 even though it was released 10 
years ago.

As Nico points out, there are cases where you really need a Lustre, Ceph, 
Gluster, or other 
parallel file system. That is not the design point for ZFS's ZPL or volume 
interfaces.

[1] FWIW, you can build a metropolitan area ZFS-based, shared storage cluster 
today for about 1/4 
the cost of the NetApp Stretch Metro software license. There is more than one 
way to skin a cat :-)
So if the idea is to get even lower than 1/4 the NetApp cost, it feels like a 
race to the bottom.

 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss