In an earlier design note I described the Tux3 cache layering model, and a pipelining problem related to updating inode table blocks:
http://kerneltrap.org/mailarchive/tux3/2008/11/10/4051294 I proposed a method of deferring namespace filesystem updates until delta transition, which promises not only to solve the update pipeline problem, but improve latency of buffered file operations and reduce contention on the very busy i_mutex locks for under namespace-intensive loads. But this technique is new and untried, and requires changes to core kernel. Even though the dentry cache change is quite small, I would prefer to be able to build as a module without core kernel patches at this point. And a significant amount of work could be required to implement this idea, which would distract us from the immediate task of preparing Tux3 for review. So we need a workaround, which in my earlier note I suggested would be to make frontend operations wait for delta staging to complete at places where the frontend wants to violate the update pipeline order. But now I have noticed a more efficient and easier workaround that avoids the waits, and thus avoids front end stalls on delta staging. The only reason that a frontend inode create wants to update the inode table block immediately is to avoid choosing the same inode number for another create. Another way to do this is to remove the store_attrs call from make_inode, and instead, put a newly created inode onto a defer list, and consult that list in make_inode to avoid assigning the same inode number twice: - Frontend create just puts the inode on a list of inodes for which itable update is deferred, make_inode checks that list when choosing a new inode number - Frontend delete puts inode on list, open_inode checks this list to verify that the inode being loaded actually exists. Just a cross check, because the dirent should be gone at that point. Leaving the inode attibutes in the inode table until after the delta transition just means that the inode number will not be reused in the same delta. - Delta staging applies the deferred inode updates for creates and deletes. Observations: - A linear defer list has n^2 behavior: allocing N inodes requires n^2 / 2 list node compares. Easily solved in various ways. Even easier to ignore this until we observe the CPU cost, then fix it. - Delta staging becomes the only updater of inode table blocks. This eliminates the update pipeline ordering violation. As a fringe benefit, the exclusive write lock in make_inode becomes a shared read lock, improving parallelism of frontend namespace operations. - Delta staging is able to fork inode table blocks freely, because these blocks are read-only after staging, and thus read-only in earlier deltas (more on block forking in an upcoming design note) - Compared to my earlier proposal to defer inode number assignment, we assign the inode number immediately on create, but defer recording it in the inode table. This means that complications like making NFS handle generation and fstat wait on inode number assignment are not needed. In short, I am happy that this little piece fell into place in a way that supports clean cache layer separation, without having to pioneer the unexplored territory of deferred namespace operations. Regards, Daniel _______________________________________________ Tux3 mailing list [email protected] http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
