> On my own system, when a new file is > written, the write block size does not make > a significant difference to the write speed
Yes, I've observed the same result ... when a new file is being written sequentially, the file data and newly constructed meta-data can be built in cache and written in large sequential chunks periodically, without the need to read in existing meta-data and/or data. It seems that data and meta-data that are newly constucted in cache for sequential operations will persist in cache effectively, and the application IO size is a much less sensitive parameter. Monitoring disks with iostat in these cases shows the disk IO to be only marginally greater than the application IO. This is why I specified that the write tests described in my previous post were to existing files. The overhead of doing small sequential writes to an existing object is so much greater than writing to a new object, that it begs for some reasonable explanation. The only one that i've been able assemble in various experimentation, is that data/meta-data for existing objects is not retained effetively in cache if ZFS detects that such an object is being sequntially written. This forces the constant re-reading of the data/meta-data associated with such an object, causing a huge increase in device IO traffic that does not seem to accompany the writing of a brand new object. The size of RAM seems to make little difference in this case. As small sequential writes accumulate in the 5 second cache, the chain of meta-data leading to the newly constructed data block may see only one pointer (of the 128 in the final set) changing to point to this newly constructed data block, but all the meta-data from the uber block to the target must be rewritten on the 5 second flush. Of course this is not much diffrent from what's happening in the newly created object scenario, so it must be the behavior that follows this flush that's different. It seems to me that after this flush, some, or all of the data/meta-data that will be affected next is re-read even though much of what's needed for subsequent operations should already be in cache. My experience with large RAM systems and with the use of SSDs as ZFS cache devices has convinced me that data/meta-data associated with sequential write operations to existing objects (and ZFS seems very good at detecting this association) does not get retained in cache very effectively. You can see this very clearly if you look at the IO to a cache device (ZFS allows you to easily attach a device to a pool as a cache device which acts as a sort of L2 type cache for RAM). When I do random IO operations to existing objects I see a large amount of IO to my cache device as RAM fills and ZFS pushes cached information (that would otherwise be evicted) to the SSD cache device. If I repeat the random IO test over the same total file space I see improved performance as I get occassional hits from the RAM cache and the SSD cache. As this extended cache heirarchy warms up with each test run, my results continue to improve. If I run sequential write operations to exiting objects however, I see very little activity to my SSD cache, and virtually no change in performance when I immediately run the same test again. It seems that ZFS is still in need of some fine-tuning for small sequential write operations to exiting objects. regards, Bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss