I will touch briefly on the recent Ext4 issue, where files written by the write-to-temporary/rename-as-target method would frequently show up as zero length after a crash instead of having the original or updated data.
The problem is, Ext4 holds recently written file data in cache even across an atomic (journalled) update of directory metadata. While a strict reading of Posix permits this, application writers do not expect it and I think we want to define stronger semantics for Tux3. That is, we should guarantee that a rename will never be committed to disk before the source file of the rename is flushed. Our initial implementation of atomic commit will always flush every dirty inode to disk at each delta transition, which provides the above guarantee by default. That is, a rename will always be committed in or after the delta that flushes its source inode. Later, we will move to a model where only some inodes are flushed at each delta, allowing more dirty file data to be retained in memory under heavy write loads, with a corresponding improvement in transfer efficiency. At that point we need to do something special to satisfy the proposed flush-before-rename rule. Supposing we keep a list of inodes scheduled to be flushed in the current delta, each rename just needs to move the source inode to that list, if the source inode has dirty pages in page cache. This additional requirement for rename is unlikely to reduce write performance perceptably, because write/rename loads are relatively rare. This strategy is typically used for update of application config files, or for mail delivery, where the application write relies on rename as the only Posix means of atomically transitioning from one file state to another. Mass file writes such as cp -a or untar do not typically use the write/rename method. It has also been proposed that application writers should not expect a write/rename operation to guarantee any ordering between data flush and rename, and that applications should be recoded to place an fsync before the rename where such ordering is important. I side with the application writers on this: expectations of write/rename semantics are well established, and it is now incumbent on filesystem developers to fulfill these expectations, even though a strict reading of Posix does not require it. Imposing a new requirement for fsync before rename would slow down file operations significantly under many common loads, much more so than introducing a barrier between data flush and rename commit. Most importantly, large numbers of applications expect this implicit ordering, because in the past, filesystems have always seemed to provide it. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3