Re: Progress report?

Daniel Phillips Wed, 04 Apr 2018 21:48:45 -0700

On 2018-04-03 12:30 AM, Raymond Jennings wrote:

Are you guys close to getting merged into mainline?


I think it's high time that btrfs got a healthy dose of competition


Hi Raymond,

For the time being we will continue to develop out-of-tree, whilecontinuing to track Linus's latest mainline kernel.

Currently, I am busy fixing Tux3's lack of directory indexing, whichbecomes a performance bottleneck at more than a few hundred files perdirectory. We need to fix this this before seriously putting Tux3 upagainst other general purpose file systems.

We could have gone with a hash-keyed B-tree indexing scheme likeeverybody else, but I felt we would be better off with a completely newapproach based on scalable hash tables. I actually prototyped Shardmapback in 2012, to the point where I convinced myself that the technologywas capable of meeting or beating B-tree performance at all scales,while not needing a huge hack to work around the basically impossibleproblem of doing readdir in hash order.

Evolving that prototype into usable code has kept me busy for a fewmonths now. Problem number one was, a hash table does not scalenaturally like a B-tree, instead the entire table needs to be expandedas directory size increases. A simple-minded implementation would causehuge latency for the create that happens to trigger the expand. Instead,Shardmap expands the hash table one shard at a time, where the latencyof expanding a single shard is just a couple of milliseconds, appearingcompletely smooth to the user. The state of this incremental reshard, asI call it, needs to be recorded in the directory file so that theincremental re-shard will continue exactly where it left off if thedirectory is re-opened. After some effort, that settled down to a simpledesign where the index is represented as one or two "tiers" of hashtables, depending on whether whether a reshard is in progress or not.The lower tier merges incrementally into the upper tier until itdisappears, so that the entire hash index moves higher up in thedirectory file over time, making room for a nice linear array ofdirectory entry blocks below it.

This linear array of directory entry blocks is one of the main points ofShardmap. It means that readdir can use a simple logical directoryaddress for readdir position, which is really the only way to complyaccurately with Posix readdir semantics that were originally definedwith simple linear directory layout in mind. Linear directory layoutalso gives the fastest and most cache-efficient readdir, so you can walkthrough an arbitrarily large Shardmap directory at essentially mediatransfer speed. Finally, we avoid an issue that Htree has, where walkingthe directory in hash order means that the inode table is accessed inrandom order, causing increased hash pressure and (in the case ofdelete) increased write multiplication.

Our nice linear array of directory entry blocks brings up hard problemnumber two: how to keep track of free space in directory entry blocksdue to deleted entries? HTree does not have that problem because italways creates a new entry in the B-tree leaf that corresponds to theentry's hash, and splits that block to create space if necessary. SoShardmap needs something like a malloc, but because Shardmap competeswith Htree in performance, the cost of this has to be nearly zero. Mysolution is a new algorithm called Bigmap, that records the largest freeentry per block with overhead of just one byte per block. Searching andupdating adds so little extra overhead that it is hard to measure.

Putting this all together, we got our reward: a directory index thatscales efficiently to the billion file range while also handling smallerdirectories at least as efficiently as current B-tree schemes. Because afile system directory is really just a kind of specialized key-valuestore, we decided to compare Shardmap performance to standalonedatabases, and we found Shardmap outperforming them at create, deleteand lookup for small data sets and large. This is by way of gainingconfidence that we did not overlook some even better way to do things.

Please excuse me for going into this perhaps a little more deeply than Ioriginally intended, but this should give you some idea where we areright now, and why we prioritized the current development work ahead ofputting Tux3 up for LKML review once again. There is still more work todo on the Shardmap front: this code must now be ported from userspace tokernel, work currently in progress. After that, there are someoutstanding issues to take care of with seek optimization on spinningdisk. That will bring us to the point where we are ready to make ourcase for mainline merge, without needing to explain away cases where wedo not currently come out on top in file system benchmarks.


Regards,

Daniel


_______________________________________________
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3

Re: Progress report?

Reply via email to