[zfs-discuss] how to remove disk from raid0

2011-10-11 Thread KES
Hi

I have the next configuration: 3 disk 1Gb in raid0
all disks in zfs pool

freespace on so raid is 1.5Gb and 1.5Gb is used.

so I have some questions:
1. If I don plan to use 3 disks in pool any more. How can I remove one of it?
2. Imaine one disk has failures. I want to replace it, but now I do not have 
disk 1Gb and have only 2Gb
 I replace disk of 1Gb with 2Gb, and after some time, I want to put 1Gb disk 
(as it was before) back.
with replace command i have error: device is too small
  How to return pool into beginning state?

Thank you.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to remove disk from raid0

2011-10-11 Thread Edho Arief
On Tue, Oct 11, 2011 at 9:25 AM, KES kes-...@yandex.ua wrote:
 Hi

 I have the next configuration: 3 disk 1Gb in raid0
 all disks in zfs pool

 freespace on so raid is 1.5Gb and 1.5Gb is used.

 so I have some questions:
 1. If I don plan to use 3 disks in pool any more. How can I remove one of it?
 2. Imaine one disk has failures. I want to replace it, but now I do not have 
 disk 1Gb and have only 2Gb
  I replace disk of 1Gb with 2Gb, and after some time, I want to put 1Gb disk 
 (as it was before) back.
 with replace command i have error: device is too small
  How to return pool into beginning state?


Simpy put, current zfs can only be extended, shrinking is not possible.

Until the mythical block pointer rewrite actually written, at least.

-- 
O ascii ribbon campaign - stop html mail - www.asciiribbon.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] commercial zfs-based storage replication software?

2011-10-11 Thread Darren J Moffat
Have you looked at the time-slider functionality that is already in 
Solaris ?


There is a GUI for configuration of the snapshots and time-slider can be 
configured to do a 'zfs send' or 'rsync'.  The GUI doesn't have the 
ability to set the 'zfs recv' command but that is set one-time in the 
SMF service properties.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any info about System attributes

2011-10-11 Thread Darren J Moffat

On 09/26/11 20:03, Jesus Cea wrote:

# zpool upgrade -v
[...]
24  System attributes
[...]


This is really an on disk format issue rather than something that the 
end user or admin can use directly.


These are special on disk blocks for storing file system metadata 
attributes when there isn't enough space in the bonus buffer area of the 
on disk version of the dnode.


This can be necessary in some cases if a file has a very large and 
complex ACL and also has other attributes set such as the ones for CIFS 
compatibility.


They are also always used if the filesystem is encrypted, so that all 
metadata is in the system attribute (also know as spill) block rather 
than in the dnode - this is required because we need the dnone in the 
clear because it contains block pointers and other information needed to 
navigate the pool.  However we never want file system metadata to be in 
the clear.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to remove disk from raid0

2011-10-11 Thread Richard Elling
On Oct 11, 2011, at 2:25 AM, KES kes-...@yandex.ua wrote:

 Hi
 
 I have the next configuration: 3 disk 1Gb in raid0
 all disks in zfs pool

we recommend protecting the data. Friends don't let friends use raid-0.

nit: We tend to refer to disk size in bytes (B), not bits (b)

 freespace on so raid is 1.5Gb and 1.5Gb is used.
 
 so I have some questions:
 1. If I don plan to use 3 disks in pool any more. How can I remove one of it?

copy out, copy in. Using sparse volumes or file systems can help you
manage this task cost effectively. The mythical block pointer rewrite is
a form of copy out, copy in.

 2. Imaine one disk has failures. I want to replace it, but now I do not have 
 disk 1Gb and have only 2Gb
 I replace disk of 1Gb with 2Gb, and after some time, I want to put 1Gb disk 
 (as it was before) back.
 with replace command i have error: device is too small
  How to return pool into beginning state?

Partition the disk so that the size of the replacement partition on the 2GB disk
is exactly the same size (in blocks) as the 1GB disk.
 

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] OpenStorage Summit email blast

2011-10-11 Thread Avneet Dhanota
Subject: FYI on Storage Event
Just an FYI
on storage.  I just learned that an
OpenStorage Summit is happening, in San Jose, during the last week of October.
Some great speakers are presenting and some really interesting topics will be
addressed, including Korea Telecom on Public Cloud Storage, Intel on converged
storage, and a discussion on how VMware designed a public cloud to host its
Hands on Lab during the VMworld 2011 event held in Las Vegas. Also, there will
be a presentation on the best practices concerning ZFS / OpenSolaris. What a
great opportunity!
Check it out
and register here.___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS issue on read performance

2011-10-11 Thread degger

Hi,

I'm not familiar with ZFS stuff, so I'll try to give you as much as info I can 
get with our environment
We are using a ZFS pool as a VLS for a backup server (Sun V445 Solaris 10), and 
we are faced with very low read performance (whilst write performance is much 
better, i.e : up to 40GB/h to migrate data onto LTO-3 tape from disk, and up to 
100GB/h to unstage data from LTO-3 tape to disk, either with Time Navigator 4.2 
software or directly using dd commands)
We have tunned ZFS parameters for ARC and disabled preftech but performance is 
poor. If we dd from disk to RAM or tape, it's very slow, but if we dd from tape 
or RAM to disk, it's faster. I can't figure out why. I've read other posts 
related to this, but I'm not sure what can of tuning can be made.
For disks concern, I have no idea on how our System team created the ZFS 
volume. 
Can you help ?

Thank you

David
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tuning zfs_arc_min

2011-10-11 Thread Richard Elling
On Oct 6, 2011, at 5:19 AM, Frank Van Damme frank.vanda...@gmail.com wrote:

 Hello,
 
 quick and stupid question: I'm breaking my head over how to tunz
 zfs_arc_min on a running system. There must be some magic word to pipe
 into mdb -kw but I forgot it. I tried /etc/system but it's still at the
 old value after reboot:
 
 ZFS Tunables (/etc/system):
 set zfs:zfs_arc_min = 0x20
 set zfs:zfs_arc_meta_limit=0x1

It is not uncommon to tune arc meta limit. But I've not seen a case
where tuning arc min is justified, especially for a storage server. Can
you explain your reasoning?
 -- richard

 
 ARC Size:
 Current Size: 1314 MB (arcsize)
 Target Size (Adaptive):   5102 MB (c)
 Min Size (Hard Limit):2048 MB (zfs_arc_min)
 Max Size (Hard Limit):5102 MB (zfs_arc_max)
 
 
 I could use the memory now since I'm running out of it, trying to delete
 a large snapshot :-/
 
 -- 
 No part of this copyright message may be reproduced, read or seen,
 dead or alive or by any means, including but not limited to telepathy
 without the benevolence of the author.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS issue on read performance

2011-10-11 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of deg...@free.fr
 
 I'm not familiar with ZFS stuff, so I'll try to give you as much as info I
can get
 with our environment
 We are using a ZFS pool as a VLS for a backup server (Sun V445 Solaris
10),
 and we are faced with very low read performance (whilst write performance
 is much better, i.e : up to 40GB/h to migrate data onto LTO-3 tape from
disk,
 and up to 100GB/h to unstage data from LTO-3 tape to disk, either with
Time
 Navigator 4.2 software or directly using dd commands)
 We have tunned ZFS parameters for ARC and disabled preftech but
 performance is poor. If we dd from disk to RAM or tape, it's very slow,
but if
 we dd from tape or RAM to disk, it's faster. I can't figure out why. I've
read
 other posts related to this, but I'm not sure what can of tuning can be
made.
 For disks concern, I have no idea on how our System team created the ZFS
 volume.
 Can you help ?

Normally, even a single cheap disk in the dumbest configuration should
vastly outperform an LTO3 tape device.  And 100 GB/h is nowhere near what
you should expect, unless you're using highly fragmented or scattered small
files.  In the optimal configuration, you'll read/write something like
1Gbit/sec per disk, until you saturate your controller, let's just pick
rough numbers and say 6Gbit/sec = 2.7 TB per hour.  So there's a ballpark to
think about.

Next things next.  I am highly skeptical of dd.  I constantly get weird
performance problems when using dd.  Especially if you're reading/writing
tapes.  Instead, this is a good benchmark for how fast your disks can
actually go in the present configuration:  zfs send somefilesystem@somesnap
| pv -i 30  /dev/null(You might have to install pv, for example using
opencsw or blastwave.  If you don't have pv and don't want to install it,
you might want to time zfs send | wc  /dev/null, so you can get the total
size and the total time.)  Expect the performance to go up and down...  So
watch it a while.  Or wait for it to complete and then you'll have the
average.

Also...  In what way are you using dd?  dd is not really an appropriate tool
for backing up a ZFS filesystem.  Well, there are some corner cases where it
might be ok, but generally speaking, no.  So the *very* first question you
should be asking is probably not about the bad performance you're seeing,
but verifying the validity of your backup technique.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS issue on read performance

2011-10-11 Thread Paul Kraus
On Tue, Oct 11, 2011 at 6:25 AM,  deg...@free.fr wrote:

 I'm not familiar with ZFS stuff, so I'll try to give you as much as info I 
 can get with our environment
 We are using a ZFS pool as a VLS for a backup server (Sun V445 Solaris 10), 
 and we are faced with very low read performance (whilst write performance is 
 much better, i.e : up to 40GB/h to migrate data onto LTO-3 tape from disk, 
 and up to 100GB/h to unstage data from LTO-3 tape to disk, either with Time 
 Navigator 4.2 software or directly using dd commands)
 We have tunned ZFS parameters for ARC and disabled preftech but performance 
 is poor. If we dd from disk to RAM or tape, it's very slow, but if we dd from 
 tape or RAM to disk, it's faster. I can't figure out why. I've read other 
 posts related to this, but I'm not sure what can of tuning can be made.
 For disks concern, I have no idea on how our System team created the ZFS 
 volume.
 Can you help ?

If you can, please post the output from `zpool status` so we know
what your configuration is. There are many ways to configure a zpool,
some of which have horrible read performance. We are using zfs as
backend storage for NetBackup and we do not see the disk storage as
the bottleneck except when copying from disk to tape (LTO-3) and that
depends on the backup images. We regularly see 75-100 MB/sec
throughput disk to tape for large backup images. I rarely see LTO-3
drives writing any faster than 100 MB/sec.

100 MB/sec. is about 350 GB/hr.
75 MB/sec. is about 260 GB/hr.

Our disk stage zpool is configured for capacity and reliability and
not performance.

  pool: nbu-ds0
 state: ONLINE
 scrub: scrub completed after 7h9m with 0 errors on Thu Sep 29 16:25:56 2011
config:

NAME   STATE READ WRITE CKSUM
nbu-ds0  ONLINE   0 0 0
  raidz2-0 ONLINE   0 0 0
c3t5000C5001A67AB63d0  ONLINE   0 0 0
c3t5000C5001A671685d0  ONLINE   0 0 0
c3t5000C5001A670DE6d0  ONLINE   0 0 0
c3t5000C5001A66CDA4d0  ONLINE   0 0 0
c3t5000C5001A66A43Bd0  ONLINE   0 0 0
c3t5000C5001A66994Dd0  ONLINE   0 0 0
c3t5000C5001A663062d0  ONLINE   0 0 0
c3t5000C5001A659F79d0  ONLINE   0 0 0
c3t5000C5001A6591B2d0  ONLINE   0 0 0
c3t5000C5001A658481d0  ONLINE   0 0 0
c3t5000C5001A4C47C8d0  ONLINE   0 0 0
  raidz2-1 ONLINE   0 0 0
c3t5000C5001A6548A2d0  ONLINE   0 0 0
c3t5000C5001A6546AAd0  ONLINE   0 0 0
c3t5000C5001A65400Ed0  ONLINE   0 0 0
c3t5000C5001A653B70d0  ONLINE   0 0 0
c3t5000C5001A6531F5d0  ONLINE   0 0 0
c3t5000C5001A64332Ed0  ONLINE   0 0 0
c3t5000C500112A5AF8d0  ONLINE   0 0 0
c3t5000C5001A5D61A8d0  ONLINE   0 0 0
c3t5000C5001A5C5EA9d0  ONLINE   0 0 0
c3t5000C5001A55F7A6d0  ONLINE   0 0 0  114K repaired
c3t5000C5001A5347FEd0  ONLINE   0 0 0
spares
  c3t5000C5001A485C88d0AVAIL
  c3t5000C50026A0EC78d0AVAIL

errors: No known data errors

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tuning zfs_arc_min

2011-10-11 Thread Frank Van Damme
2011/10/11 Richard Elling richard.ell...@gmail.com:
 ZFS Tunables (/etc/system):
         set zfs:zfs_arc_min = 0x20
         set zfs:zfs_arc_meta_limit=0x1

 It is not uncommon to tune arc meta limit. But I've not seen a case
 where tuning arc min is justified, especially for a storage server. Can
 you explain your reasoning?


Honestly? I don't remember. might be a leftover setting from a year
ago. by now, I figured out I need to update the boot archive in
order for the new setting to have effect at boot time which apparently
involves booting in safe mode.

-- 
Frank Van Damme
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] weird bug with Seagate 3TB USB3 drive

2011-10-11 Thread John D Groenveld
Banging my head against a Seagate 3TB USB3 drive.
Its marketing name is:
Seagate Expansion 3 TB USB 3.0 Desktop External Hard Drive STAY3000102
format(1M) shows it identify itself as:
Seagate-External-SG11-2.73TB

Under both Solaris 10 and Solaris 11x, I receive the evil message:
| I/O request is not aligned with 4096 disk sector size.
| It is handled through Read Modify Write but the performance is very low.

However, that's not my big issue as I will use the zpool-12 hack.

My big issue is that once I zpool(1M) export the pool from
my W2100z running S10 or my Ultra 40 running S11x, I can't 
import it.

I thought weird USB connectivity issue, but I can run
format - analyze - read merrily.

Anyone seen this bug?

John
groenv...@acm.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] tuning zfs_arc_min

2011-10-11 Thread Richard Elling
On Oct 11, 2011, at 2:03 PM, Frank Van Damme wrote:
 2011/10/11 Richard Elling richard.ell...@gmail.com:
 ZFS Tunables (/etc/system):
 set zfs:zfs_arc_min = 0x20
 set zfs:zfs_arc_meta_limit=0x1
 
 It is not uncommon to tune arc meta limit. But I've not seen a case
 where tuning arc min is justified, especially for a storage server. Can
 you explain your reasoning?
 
 
 Honestly? I don't remember. might be a leftover setting from a year
 ago. by now, I figured out I need to update the boot archive in
 order for the new setting to have effect at boot time which apparently
 involves booting in safe mode.

The archive should be updated when you reboot. Or you can run
bootadm update-archive
anytime.

At boot, the zfs_arc_min is copied into arc_c_min overriding the default
setting. You can see the current value via kstat:
kstat -p zfs:0:arcstats:c_min
zfs:0:arcstats:c_min389202432

This is the smallest size that the ARC will shrink to, when asked to shrink
because other applications need memory.
 -- richard


-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA '11, Boston, MA, December 4-9 













___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Richard Elling
On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:

 Hello all,
 
 ZFS developers have for a long time stated that ZFS is not intended,
 at least not in near term, for clustered environments (that is, having
 a pool safely imported by several nodes simultaneously). However,
 many people on forums have wished having ZFS features in clusters.

...and UFS before ZFS… I'd wager that every file system has this RFE in its
wish list :-)

 I have some ideas at least for a limited implementation of clustering
 which may be useful aat least for some areas. If it is not my fantasy
 and if it is realistic to make - this might be a good start for further
 optimisation of ZFS clustering for other uses.
 
 For one use-case example, I would talk about VM farms with VM
 migration. In case of shared storage, the physical hosts need only
 migrate the VM RAM without copying gigabytes of data between their
 individual storages. Such copying makes less sense when the
 hosts' storage is mounted off the same NAS/NAS box(es), because:
 * it only wastes bandwidth moving bits around the same storage, and

This is why the best solutions use snapshots… no moving of data and
you get the added benefit of shared ARC -- increasing the logical working
set size does not increase the physical working set size.

 * IP networking speed (NFS/SMB copying) may be less than that of
 dedicated storage net between the hosts  and storage (SAS, FC, etc.)

Disk access is not bandwidth bound by the channel.

 * with pre-configured disk layout from one storage box into LUNs for
 several hosts, more slack space is wasted than with having a single
 pool for several hosts, all using the same free pool space;

...and you die by latency of metadata traffic.

 * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
 it would be problematic to add a 6th server) - but it won't be a problem
 when the single pool consumes the whole SAM and is available to
 all server nodes.

Are you assuming disk access is faster than RAM access?

 One feature of this use-case is that specific datasets within the
 potentially common pool on the NAS/SAN are still dedicated to
 certain physical hosts. This would be similar to serving iSCSI
 volumes or NFS datasets with individual VMs from a NAS box -
 just with a faster connection over SAS/FC. Hopefully this allows
 for some shortcuts in clustering ZFS implementation, while
 such solutions would still be useful in practice.

I'm still missing the connection of the problem to the solution.
The problem, as I see it today: disks are slow and not getting 
faster. SSDs are fast and getting faster and lower $/IOP. Almost
all VM environments and most general purpose environments are
overprovisioned for bandwidth and underprovisioned for latency.
The Achille's heel of solutions that cluster for bandwidth (eg lustre,
QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency.
But latency is what we need, so perhaps not the best architectural
solution?

 So, one version of the solution would be to have a single host
 which imports the pool in read-write mode (i.e. the first one
 which boots), and other hosts would write thru it (like iSCSI
 or whatever; maybe using SAS or FC to connect between
 reader and writer hosts). However they would read directly
 from the ZFS pool using the full SAN bandwidth.
 
 WRITES would be consistent because only one node writes
 data to the active ZFS block tree using more or less the same
 code and algorithms as already exist.
 
 
 In order for READS to be consistent, the readers need only
 rely on whatever latest TXG they know of, and on the cached
 results of their more recent writes (between the last TXG
 these nodes know of and current state).
 
 Here's where this use-case's bonus comes in: the node which
 currently uses a certain dataset and issues writes for it, is the
 only one expected to write there - so even if its knowledge of
 the pool is some TXGs behind, it does not matter.
 
 In order to stay up to date, and know the current TXG completely,
 the reader nodes should regularly read-in the ZIL data (anyway
 available and accessible as part of the pool) and expire changed
 entries from their local caches.

:-)

 If for some reason a reader node has lost track of the pool for
 too long, so that ZIL data is not sufficient to update from known
 in-RAM TXG to current on-disk TXG, the full read-only import
 can be done again (keeping track of newer TXGs appearing
 while the RO import is being done).
 
 Thanks to ZFS COW, nodes can expect that on-disk data (as
 pointed to by block addresses/numbers) does not change.
 So in the worst case, nodes would read outdated data a few
 TXGs old - but not completely invalid data.
 
 
 Second version of the solution is more or less the same, except
 that all nodes can write to the pool hardware directly using some
 dedicated block ranges owned by one node at a time. This
 would work like much a ZIL containing both data and metadata.
 Perhaps 

Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
 ZFS developers have for a long time stated that ZFS is not intended,
 at least not in near term, for clustered environments (that is, having
 a pool safely imported by several nodes simultaneously). However,
 many people on forums have wished having ZFS features in clusters.

 ...and UFS before ZFS… I'd wager that every file system has this RFE in its
 wish list :-)

Except the ones that already have it!  :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov jimkli...@cos.ru wrote:
 So, one version of the solution would be to have a single host
 which imports the pool in read-write mode (i.e. the first one
 which boots), and other hosts would write thru it (like iSCSI
 or whatever; maybe using SAS or FC to connect between
 reader and writer hosts). However they would read directly
 from the ZFS pool using the full SAN bandwidth.

You need to do more than simply assign a node for writes.  You need to
send write and lock requests to one node.  And then you need to figure
out what to do about POSIX write visibility rules (i.e., when a write
should be visible to other readers).  I think you'd basically end up
not meeting POSIX in this regard, just like NFS, though perhaps not
with close-to-open semantics.

I don't think ZFS is the beast you're looking for.  You want something
more like Lustre, GPFS, and so on.  I suppose someone might surprise
us one day with properly clustered ZFS, but I think it'd be more
likely that the filesystem would be ZFS-like, not ZFS proper.

 Second version of the solution is more or less the same, except
 that all nodes can write to the pool hardware directly using some
 dedicated block ranges owned by one node at a time. This
 would work like much a ZIL containing both data and metadata.
 Perhaps these ranges would be whole metaslabs or some other
 ranges as agreed between the master node and other nodes.

This is much hairier.  You need consistency.  If two processes on
different nodes are writing to the same file, then you need to
*internally* lock around all those writes so that the on-disk
structure ends up being sane.  There's a number of things you could do
here, such as, for example, having a per-node log and one node
coalescing them (possibly one node per-file, but even then one node
has to be the master of every txg).

And still you need to be careful about POSIX semantics.  That does not
come for free in any design -- you will need something like the Lustre
DLM (distributed lock manager).  Or else you'll have to give up on
POSIX.

There's a hefty price to be paid for POSIX semantics in a clustered
environment.  You'd do well to read up on Lustre's experience in
detail.  And not just Lustre -- that would be just to start.  I
caution you that this is not a simple project.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss