This post contains two others, one sent to Zooko while ago but that I was foiled in posting to the list afterwards by Gmane using the old allmydata list address, and one which I posted as a reply to a G+ post Zooko made a while ago. Posting to the list as he requested.
First message: CRUSH and Tahoe-LAFS ---cut--- Hey, I was reading through the Tahoe-LAFS FAQ, and while looking into the one about controlling replication based on topology (the wiki page, tickets, etc), I noticed that there didn't seem to be any mention of CRUSH, which the Ceph cluster filesystem (or rather, its distributed object store RADOS) uses for this. Figured it might be worthwhile to toss you a link in case you hadn't seen it: http://ceph.com/papers/weil-crush-sc06.pdf Ceph/RADOS is a solution to a different problem than Tahoe-LAFS, but CRUSH is interesting for the cases listed at the top of the wiki page because as long as the client has a copy of the crushmap, then computing where something goes (or comes from) is a purely local operation. Since the crushmap is user specified, and the placement is then generated based on it, it lets users describe their topology and policies and then just lays the data out accordingly. ---cut--- Second message: Overall design similarities between Ceph and Tahoe-LAFS In reply to https://plus.google.com/108313527900507320366/posts/ZrgdgLhV3NG May ramble a bit. ---cut--- Apologies for commenting on such an ancient post, but I figured I'd drop some info about Ceph here. Yes, I'm the same guy who sent the email about CRUSH - I happened to come across this via the post to freedombox-devel on not being a filesystem. Anyway, Ceph does have something roughly analogous to introducers, called Monitors or MON nodes. They also manage the PAXOS consistency stuff IIRC. Ceph manages clients connecting to the cluster by letting the client pick a monitor, any monitor, at which point it bootstraps to more knowledge of the cluster. One thing it does is let the client know about the other monitors, so even if the one the client used to connect dies, nothing bad happens (unless there aren't enough left to keep PAXOS happy, that is). Monitors are actually pretty close to what Ticket 68 seems to be hoping for, aside from being a separate node type instead of being on every node. I think you might find a lot of Ceph's design interesting - especially from the perspective of scaling Tahoe. For one, looking at it thinking of it as a filesystem actually misses a lot of its capabilities. The part of Ceph that's really fascinating is the underlying object store, RADOS. It's surprisingly close to Tahoe, as a matter of fact - placement of objects can be computed on any node via a function, so the client can know where stuff is going without talking to some sort of central server or DHT (they have an optimization that the client puts it on one OSD and lets it do the distribution, but that's an implementation choice and not a core design element). The Ceph MDS nodes aren't part of RADOS - their big role is putting POSIX on top of the object store, and doing some fancy caching/load balancing of (posixy) metadata for performance. RADOS itself is a object storage cluster that replicates data among a configurable arrangement of nodes, often accessed through a gateway that makes the protocol look like S3 or Swift, and which has clustered introducers. Aside from encryption, it means that some of the things on Tahoe's wiki and proposed enhancements list look kinda familiar at times... I'd recommend checking out the 2013 Linux.conf.au talks on Ceph - the one on OpenStack goes over some of the other non-POSIX-fs ways they're using the underlying object store, like thin-provisioned network block devices. ---cut--- _______________________________________________ tahoe-dev mailing list [email protected] https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
