Re: [zfs-discuss] ZFS I/O algorithms

Bill Moloney Wed, 19 Mar 2008 19:36:36 -0700

> On my own system, when a new file is 
> written, the write block size does not make 
> a significant difference to the write speed


Yes, I've observed the same result ... when a new file is being written 
sequentially, the file data and newly constructed meta-data can be 
built in cache and written in large sequential chunks periodically,
without the need to read in existing meta-data and/or data.  It
seems that data and meta-data that are newly constucted in cache for 
sequential operations will persist in cache effectively, 
and the application IO size is a much less sensitive parameter.
Monitoring disks with iostat in these cases shows the disk IO to 
be only marginally greater than the application IO.
This is why I specified that the write tests
described in my previous post were to existing files.

The overhead of doing small sequential writes to an 
existing object is so much greater than writing to a new 
object, that it begs for some reasonable explanation.
The only one that i've been able assemble in various
experimentation, is that data/meta-data for existing objects 
is not retained effetively in cache if ZFS detects that such an
object is being sequntially written.  This forces the
constant re-reading of the data/meta-data associated with
such an object, causing a huge increase in device IO
traffic that does not seem to accompany the writing of a
brand new object.  The size of RAM seems to make little
difference in this case.

As small sequential writes accumulate in the 5 second cache, the 
chain of meta-data leading to the newly constructed data block may 
see only one pointer (of the 128 in the final set) changing to point to
this newly constructed data block, but all the meta-data from the
uber block to the target must be rewritten on the 5 second flush.
Of course this is not much diffrent from what's happening in the
newly created object scenario, so it must be the behavior that follows 
this flush that's different.  It seems to me that after this flush, some,
or all of the data/meta-data that will be affected next is re-read even
though much of what's needed for subsequent operations should already
be in cache.

My experience with large RAM systems and with the use of SSDs 
as ZFS cache devices has convinced me that data/meta-data associated
with sequential write operations to existing objects (and ZFS seems 
very good at detecting this association) does not get retained 
in cache very effectively.

You can see this very clearly if you look at the IO to a cache
device (ZFS allows you to easily attach a device to a pool as a 
cache device which acts as a sort of L2 type cache for RAM).  
When I do random IO operations to existing objects I
see a large amount of IO to my cache device as RAM fills and ZFS 
pushes cached information (that would otherwise be evicted)
to the SSD cache device.  If I repeat 
the random IO test over the same total file space I see improved
performance as I get occassional hits from the RAM cache and the
SSD cache.  As this extended cache heirarchy warms up with each
test run, my results continue to improve.  If I run sequential write 
operations to exiting objects however,  I see very little activity to 
my SSD cache, and virtually no change in performance  when I 
immediately run the same test again.  

It seems that ZFS is still in need of some fine-tuning for small
sequential write operations to exiting objects.

regards, Bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

Reply via email to