Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Thomas Maier-Komor
 
 the ZIL is always there in host memory, even when no
 synchronous writes
 are being done, since the POSIX fsync() call could be
 made on an open 
 write channel at any time, requiring all to-date
 writes on that channel
 to be committed to persistent store before it returns
 to the application
 ... it's cheaper to write the ZIL at this point than
 to force the entire 5 sec
 buffer out prematurely
 

I have a question that is related to this topic: Why is there only a (tunable) 
5 second threshold and not also an additional threshold for the buffer size 
(e.g. 50MB)?

Sometimes I see my system writing huge amounts of data to a zfs, but the disks 
staying idle for 5 seconds, although the memory consumption is already quite 
big and it really would make sense (from my uneducated point of view as an 
observer) to start writing all the data to disks. I think this leads to the 
pumping effect that has been previously mentioned in one of the forums here.

Can anybody comment on this?

TIA,
Thomas
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does block allocation for small writes work over iSCSI?

2008-01-08 Thread Andre Wenas

Although it looks like possible, but very complex architecture.

If you can wait, please explore pNFS: 
http://opensolaris.org/os/project/nfsv41/


What is pNFS?

   * The pNFS protocol allows us to separate a NFS file system's data
 and metadata paths. With a separate data path we are free to lay
 file data out in interesting ways like striping it across multiple
 different file servers. For more information, see the NFSv4.1
 specification.



Gilberto Mautner wrote:

Hello list,
 
 
I'm thinking about this topology:
 
NFS Client NFS--- zFS Host ---iSCSI--- zFS Node 1, 2, 3 etc.
 
The idea here is to create a scalable NFS server by plugging in more 
nodes as more space is needed, striping data across them.
 
A question is: we know from the docs that zFS optimizes random write 
speed by consolidating what would be many random writes into a single 
sequential operation.
 
I imagine that for zFS be able to do that it has to have some 
knowledge about the hard disk geography. Now, if this geography is 
being abstracted by iSCSI, is that optimization still valid?
 
 
Thanks
 
Gilberto
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
 I have a question that is related to this topic: Why
 is there only a (tunable) 5 second threshold and not
 also an additional threshold for the buffer size
 (e.g. 50MB)?
 
 Sometimes I see my system writing huge amounts of
 data to a zfs, but the disks staying idle for 5
 seconds, although the memory consumption is already
 quite big and it really would make sense (from my
 uneducated point of view as an observer) to start
 writing all the data to disks. I think this leads to
 the pumping effect that has been previously mentioned
 in one of the forums here.
 
 Can anybody comment on this?
 
 TIA,
 Thomas

because ZFS always writes to a new location on the disk, premature writing
can often result in redundant work ... a single host write to a ZFS object
results in the need to rewrite all of the changed data and meta-data leading
to that object

if a subsequent follow-up write to the same object occurs quickly,
this entire path, once again, has to be recreated, even though only a small 
portion of it is actually different from the previous version

if both versions were written to disk, the result would be to physically write 
potentially large amounts of nearly duplicate information over and over
again, resulting in logically vacant bandwidth

consolidating these writes in host cache eliminates some redundant disk
writing, resulting in more productive bandwidth ... providing some ability to
tune the consolidation time window and/or the accumulated cache size may
seem like a reasonable thing to do, but I think that it's typically a moving
target, and depending on an adaptive, built-in algorithm to dynamically set
these marks (as ZFS claims it does) seems like a better choice

...Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Casper . Dik


consolidating these writes in host cache eliminates some redundant disk
writing, resulting in more productive bandwidth ... providing some ability to
tune the consolidation time window and/or the accumulated cache size may
seem like a reasonable thing to do, but I think that it's typically a moving
target, and depending on an adaptive, built-in algorithm to dynamically set
these marks (as ZFS claims it does) seems like a better choice


But is seems that when we're talking about full block writes (such as 
sequential file writes) ZFS could do a bit better.

And as long as there is bandwidth left to the disk and the controllers, it 
is difficult to argue that the work is redundant.  If it's free in that
sense, it doesn't matter whether it is redundant.  But if it turns out NOT
to have been redundant you save a lot.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] clones bound too tightly to its origin

2008-01-08 Thread Andreas Koppenhoefer
[I apologise for reposting this... but no one replied to my post from Dec, 4th.]

Hallo all,

while experimenting with zfs send and zfs receive mixed with cloning on 
receiver side I found the following...

On server A there is a zpool with snapshots created on regular basis via cron.
Server B get's updated by a zfs-send-ssh-zfs-receive command pipe. Both servers 
are running Solaris 10 update 4 (08/07).

Sometimes I want to do some testing on server B without corrupting data on 
server A. For doing so I create a clone of the filesystem. Up to here 
everything is ok. As long as the mounte clone filesystem is NOT busy, any 
further zfs-send-ssh-zfs-receive will work properly, updating my pool on B.

But there are some long running test jobs on server B which keeps clone's 
filesystem busy or just a single login shell with its cwd within clone's 
filesystem, which makes the filesystem busy from umount's point of view.

Meanwhile another zfs-send-ssh-zfs-receive command gets launched to copy new 
snapshot from A to B. If the receiving pool of a zfs-receive-command has busy 
clones, the receive command will fail.
For some unknown reason the receive command tries to umount my cloned 
filesystem and fails with Device busy.

The question is: why?

Since the clone is (or should be) independent of its origin, zfs receive 
should not umount cloned data of older snapshots.

If you want to reproduce this - below (and attached) you find a simple test 
script. The script will bump out at last zfs-receive command. If you comment 
out the line cd /mnt/copy, script will run as expected.

Disclaimer: Before running my script, make sure you do not have zpools named 
copy or origin. Use this script only on a test machine! Use it at your own 
risk.
Here is the script:
pre
#!/usr/bin/bash
# cleanup before test
cd /
set -ex
for pool in origin copy; do
zpool destroy $pool || :
rm -f /var/tmp/zpool.$pool
[ -d /mnt/$pool ]  rmdir /mnt/$pool

mkfile -nv 64m /var/tmp/zpool.$pool
zpool create -m none $pool /var/tmp/zpool.$pool

done

zfs create -o mountpoint=/mnt/origin origin/test

update () {
# create/update a log file
date /mnt/origin/log
}

snapnum=0
make_snap () {
snapnum=$(($snapnum+1))
zfs snapshot origin/[EMAIL PROTECTED]
}

update
make_snap
update
make_snap
update
make_snap
update

zfs send origin/[EMAIL PROTECTED] | zfs receive -v -d copy
zfs clone copy/[EMAIL PROTECTED] copy/clone
zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive -v 
-d copy
zfs set mountpoint=/mnt/copy copy/clone
zfs list -r origin copy
cd /mnt/copy# make filesystem busy
zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive -v 
-d copy
ls -l /mnt/{origin,copy}/log
exit
/pre
-
Cleanup with
pre
zpool destroy copy; zpool destroy origin; rm /var/tmp/zpool.*
/pre
after running tests.

- Andreas
 
 
This message posted from opensolaris.org

test-zfs-clone.sh
Description: Bourne shell script
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
 But is seems that when we're talking about full block
 writes (such as 
 sequential file writes) ZFS could do a bit better.
 
 And as long as there is bandwidth left to the disk
 and the controllers, it 
 is difficult to argue that the work is redundant.  If
 it's free in that
 sense, it doesn't matter whether it is redundant.
  But if it turns out NOT
 o have been redundant you save a lot.
 

I think this is why an adaptive algorithm makes sense ... in situations where
frequent, progressive small writes are engaged by an application, the amount
of redundant disk access can be significant, and longer consolidation times
may make sense ... larger writes (= the FS block size) would benefit less 
from longer consolidation times, and shorter thresholds could provide more
usable bandwidth

to get a sense of the issue here, I've done some write testing to previously
written files in a ZFS file system, and the choice of write element size
shows some big swings in actual vs data-driven bandwidth

when I launch a set of threads each of which writes 4KB buffers 
sequentially to its own file, I observe that for 60GB of application 
writes, the disks see 230+GB of IO (reads and writes): 
data-driven BW =~41MB/Sec (my 60GB in ~1500 Sec)
actual BW =~157 MB/Sec (the 230+GB in ~1500 Sec)

if I do the same writes with 128KB buffers (block size of my pool),
the same 60GBs of writes only generate 95GB of disk IO (reads and writes)
data-driven BW =~85MB/Sec (my 60GB in ~700 Sec)
actual BW =~134.6MB/Sec (the 95+GB in ~700 Sec)

in the first case, longer consolidation times would have lead to less total IO
and better data-driven BW, while in the second case shorter consolidation
times would have worked better

as far as redundant writes possibly occupying free bandwidth (and thus
costing nothing), I think you also have to consider the related costs of
additional block scavenging, and less available free space at any specific 
instant, possibly limiting the sequentiality of the next write ... of
course there's also the additional device stress

in any case, I agree with you that ZFS could do a better job in this area,
but it's not as simple as just looking for large or small IOs ...
sequential vs random access patterns also play a big role (as you point out)

I expect  (hope) the adaptive algorithms will mature over time, eventually
providing better behavior over a broader set of operating conditions
... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS on OS X port now on macosforge

2008-01-08 Thread Noël Dellofano
Hey everyone,

This is just a quick announcement to say that the ZFS on OS X port is  
now posted for your viewing fun at:

http://zfs.macosforge.org/

The page is also linked off of the ZFS Open Solaris page under ZFS  
Ports:
http://opensolaris.org/os/community/zfs/porting/

This page holds the status for the ZFS on OSX port and includes a  
small FAQ, some known bugs, announcements, and will include more as  
time goes on.  It also holds the latest source code and binaries that  
you can download to your hearts content.  So if you have a Mac, are  
running Leopard, and are feeling bleeding edge please try it out.   
Comments, questions, suggestions and feedback are all very welcome.
I also want to point out this is BETA.  We're still working on getting  
some features going, as well as fleshing out issues with Finder, Disk  
Util, iTunes, and other parts of the system.  So when I say bleeding,  
I'm not kidding :)  However I'm excited to say that I'm happily  
running ZFS as my home directory on my MacBook Pro which is what I  
work off of every day, and am running weekly snapshots which I 'zfs  
send' to my external drive.  Oh happy day.

thanks!
Noel Dellofano
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] clones bound too tightly to its origin

2008-01-08 Thread eric kustarz
This should work just fine with latest bits (Nevada 77 and later) via:
http://bugs.opensolaris.org/view_bug.do?bug_id=6425096

Its backport is currently targeted for an early build of s10u6.

eric

On Jan 8, 2008, at 7:13 AM, Andreas Koppenhoefer wrote:

 [I apologise for reposting this... but no one replied to my post  
 from Dec, 4th.]

 Hallo all,

 while experimenting with zfs send and zfs receive mixed with  
 cloning on receiver side I found the following...

 On server A there is a zpool with snapshots created on regular  
 basis via cron.
 Server B get's updated by a zfs-send-ssh-zfs-receive command pipe.  
 Both servers are running Solaris 10 update 4 (08/07).

 Sometimes I want to do some testing on server B without corrupting  
 data on server A. For doing so I create a clone of the filesystem.  
 Up to here everything is ok. As long as the mounte clone filesystem  
 is NOT busy, any further zfs-send-ssh-zfs-receive will work  
 properly, updating my pool on B.

 But there are some long running test jobs on server B which keeps  
 clone's filesystem busy or just a single login shell with its cwd  
 within clone's filesystem, which makes the filesystem busy from  
 umount's point of view.

 Meanwhile another zfs-send-ssh-zfs-receive command gets launched to  
 copy new snapshot from A to B. If the receiving pool of a zfs- 
 receive-command has busy clones, the receive command will fail.
 For some unknown reason the receive command tries to umount my  
 cloned filesystem and fails with Device busy.

 The question is: why?

 Since the clone is (or should be) independent of its origin, zfs  
 receive should not umount cloned data of older snapshots.

 If you want to reproduce this - below (and attached) you find a  
 simple test script. The script will bump out at last zfs-receive  
 command. If you comment out the line cd /mnt/copy, script will  
 run as expected.

 Disclaimer: Before running my script, make sure you do not have  
 zpools named copy or origin. Use this script only on a test  
 machine! Use it at your own risk.
 Here is the script:
 pre
 #!/usr/bin/bash
 # cleanup before test
 cd /
 set -ex
 for pool in origin copy; do
 zpool destroy $pool || :
 rm -f /var/tmp/zpool.$pool
 [ -d /mnt/$pool ]  rmdir /mnt/$pool

 mkfile -nv 64m /var/tmp/zpool.$pool
 zpool create -m none $pool /var/tmp/zpool.$pool

 done

 zfs create -o mountpoint=/mnt/origin origin/test

 update () {
 # create/update a log file
 date /mnt/origin/log
 }

 snapnum=0
 make_snap () {
 snapnum=$(($snapnum+1))
 zfs snapshot origin/[EMAIL PROTECTED]
 }

 update
 make_snap
 update
 make_snap
 update
 make_snap
 update

 zfs send origin/[EMAIL PROTECTED] | zfs receive -v -d copy
 zfs clone copy/[EMAIL PROTECTED] copy/clone
 zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive 
 -v -d copy
 zfs set mountpoint=/mnt/copy copy/clone
 zfs list -r origin copy
 cd /mnt/copy  # make filesystem busy
 zfs send -i origin/[EMAIL PROTECTED] origin/[EMAIL PROTECTED] | zfs receive 
 -v -d copy
 ls -l /mnt/{origin,copy}/log
 exit
 /pre
 -
 Cleanup with
 pre
 zpool destroy copy; zpool destroy origin; rm /var/tmp/zpool.*
 /pre
 after running tests.

 - Andreas


 This message posted from opensolaris.orgtest-zfs- 
 clone.sh___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss