On 2018-04-03 12:30 AM, Raymond Jennings wrote:
Are you guys close to getting merged into mainline?

I think it's high time that btrfs got a healthy dose of competition

Hi Raymond,

For the time being we will continue to develop out-of-tree, while continuing to track Linus's latest mainline kernel.

Currently, I am busy fixing Tux3's lack of directory indexing, which becomes a performance bottleneck at more than a few hundred files per directory. We need to fix this this before seriously putting Tux3 up against other general purpose file systems.

We could have gone with a hash-keyed B-tree indexing scheme like everybody else, but I felt we would be better off with a completely new approach based on scalable hash tables. I actually prototyped Shardmap back in 2012, to the point where I convinced myself that the technology was capable of meeting or beating B-tree performance at all scales, while not needing a huge hack to work around the basically impossible problem of doing readdir in hash order.

Evolving that prototype into usable code has kept me busy for a few months now. Problem number one was, a hash table does not scale naturally like a B-tree, instead the entire table needs to be expanded as directory size increases. A simple-minded implementation would cause huge latency for the create that happens to trigger the expand. Instead, Shardmap expands the hash table one shard at a time, where the latency of expanding a single shard is just a couple of milliseconds, appearing completely smooth to the user. The state of this incremental reshard, as I call it, needs to be recorded in the directory file so that the incremental re-shard will continue exactly where it left off if the directory is re-opened. After some effort, that settled down to a simple design where the index is represented as one or two "tiers" of hash tables, depending on whether whether a reshard is in progress or not. The lower tier merges incrementally into the upper tier until it disappears, so that the entire hash index moves higher up in the directory file over time, making room for a nice linear array of directory entry blocks below it.

This linear array of directory entry blocks is one of the main points of Shardmap. It means that readdir can use a simple logical directory address for readdir position, which is really the only way to comply accurately with Posix readdir semantics that were originally defined with simple linear directory layout in mind. Linear directory layout also gives the fastest and most cache-efficient readdir, so you can walk through an arbitrarily large Shardmap directory at essentially media transfer speed. Finally, we avoid an issue that Htree has, where walking the directory in hash order means that the inode table is accessed in random order, causing increased hash pressure and (in the case of delete) increased write multiplication.

Our nice linear array of directory entry blocks brings up hard problem number two: how to keep track of free space in directory entry blocks due to deleted entries? HTree does not have that problem because it always creates a new entry in the B-tree leaf that corresponds to the entry's hash, and splits that block to create space if necessary. So Shardmap needs something like a malloc, but because Shardmap competes with Htree in performance, the cost of this has to be nearly zero. My solution is a new algorithm called Bigmap, that records the largest free entry per block with overhead of just one byte per block. Searching and updating adds so little extra overhead that it is hard to measure.

Putting this all together, we got our reward: a directory index that scales efficiently to the billion file range while also handling smaller directories at least as efficiently as current B-tree schemes. Because a file system directory is really just a kind of specialized key-value store, we decided to compare Shardmap performance to standalone databases, and we found Shardmap outperforming them at create, delete and lookup for small data sets and large. This is by way of gaining confidence that we did not overlook some even better way to do things.

Please excuse me for going into this perhaps a little more deeply than I originally intended, but this should give you some idea where we are right now, and why we prioritized the current development work ahead of putting Tux3 up for LKML review once again. There is still more work to do on the Shardmap front: this code must now be ported from userspace to kernel,  work currently in progress. After that, there are some outstanding issues to take care of with seek optimization on spinning disk. That will bring us to the point where we are ready to make our case for mainline merge, without needing to explain away cases where we do not currently come out on top in file system benchmarks.



Tux3 mailing list

Reply via email to