At lunch yesterday, Nathan mentioned that he is interested in seeing how Tahoe's ideas and techniques could trickle outwards and influence the design of other security systems. And I was complaining about how the Firefox upgrade process doesn't provide the integrity checks that I want (it turns out they rely upon the CA infrastructure and SSL alone, no end-to-end checking; the updates and releases are GPG-signed, but firefox doesn't check that, only humans might). And PyPI has this nice habit of appending "#md5=XYZ.." to the URLs of the release tarballs that they publish, which is (I think) automatically used by tools like easy_install to guard against corrupted downloads (and which I always use, as a human, to do the same). And Nathan mentioned a class of web attacks in which a page, loaded over SSL, imports something (JS, CSS, JPG) via a regular http: URL, and becomes vulnerable to third-parties who can take over the page by controlling what arrives over unauthenticated HTTP.
So, setting aside the reliability-via-distributedness properties for a moment, what could we bring from Tahoe into regular HTTP and regular webservers that could improve the state of security on the web? == Integrity == To start with integrity-checking, we could imagine a firefox plugin that validated a PyPI-style #md5= annotation on everything it loads. The rule would be that no action would be taken on the downloaded content until the hash was verified, and that a hash failure would be treated like a 404. Or maybe a slightly different error code, to indicate that the correct resource is unavailable and that it's a server-side problem, but it's because you got the wrong version of the document, rather than the document being missing altogether. This would work just fine for a flat hash: the original file remains untouched, only the referencing URLs change to get the new hash annotation. Non-enhanced browsers are unaffected: the #-prefixed fragment identifier is never sent to the server, and the <a name=> tag is fairly rare these days (and would still mostly work). Container files (the HTML which references the hashed documents) could be updated to benefit at leisure. Automation (see below) could be used to update the URLs in the containers whenever the referenced objects were modified. To improve alacrity on larger files, Tahoe uses a Merkle tree over segments of the file. This tree has to be stored somewhere (Tahoe stores it along with the shares, but it would be more convenient for a web site to not modify the source files). We could use an annotation like "#hashtree=ROOTXYZ;http://otherplace" to reference an external hash tree (with root hash XYZ). The plugin would start pulling from the source file and the hash tree at the same time, and not deliver any source data until it had been validated. The hashtree object would need to start with the segment size and filesize, so the tree could be computed properly. For very large files, you could read those parameters and then pull down (via a Range: header) just the parts of the Merkle tree that were necessary. In this case, the automation would need to create the hash tree file and put it in a known place each time the source file changes, and then updated the references. (note that "ROOTXYZ" provides the "identification" properties of this annotation, and "http://otherplace" provides the "location" properties, where identification means the ability to recognize the correct document if someone gives it to you, and location means the ability to retrieve a possibly-correct document. URIs provide identification, URLs are supposed to provide both.) We could compress this by establishing an (overriable) convention that http://example.com/foo.mp3 always has a hashtree at http://example.com/foo.mp3.hashtree, resulting in a URL that looked like "http://example.com/foo.mp3#hashtree=ROOTXYZ". If you needed to store it elsewhere, you could use "#hashtree=ROOTXYZ;WHERE", and define WHERE to be a relative URL (with a default value of NAME.hashtree). == Mutable Integrity == Zooko and I have both run HTML presentations out of a Tahoe grid (which makes for a great demo), and the first thing you learn there is that immutability, while a great property in some cases, is a hassle for authoring. You need mutability somewhere, and the more places you have it, the fewer URLs you have to update every time you change something. In technical terms, you frequently want to cut down the diameter of the immutable domains of the object DAG, by splitting those domains with mutable boundary nodes. In practical terms, it means you might want to publish *everything* via a mutable file. At the very least, if your web site has any internal cycles in it, you'll need a mutable node to break the cycle. Again, this requires data beyond the contents of the source file. We could use a "#sigkey=XYZ" annotation with a base62'ed ECDSA pubkey (this would provide the "identification" property of the constant pubkey), but we'd still need to know where to get the actual signature (the "location" property of the variable signature). We could do "#sigkey=XYZ;sigurl=http://otherplace". Or we could establish a convention of keeping the signature files next to the source files with "#sigkey=XYZ;sigsuffix=.sig" (and then http://example.com/main.css would have its signature stored in http://example.com/main.css.sig). Or, compress the convention further and have "sigkey=" imply "sigsuffix=.sig" unless overridden. This would involve two GETs, but they'd be done in parallel, and the original files would remain untouched (thus unaware browsers would be unaffected, obliviously content in their insecurity). The immutable "#hashtree=" would also involve two parallel GETs, but presumably it'd only be used for large files, in which the overhead would be less noticeable. Whereas the mutable "#sigkey=" would be used for even small files, so you might notice the overhead more. The .sig file would probably contain a copy of the pubkey too, for local verification purposes. If we used a signature scheme that didn't give us short-enough pubkeys, the .sig file would contain the whole pubkey, and the #sigkey=XYZ suffix would contain its hash. == Encryption == Now, how could we provide fine-grained confidentiality? We all know how broken the SSL+CA model is. Tahoe uses per-object encryption keys that are tightly bound to the object identifiers, providing obj-cap properties (like fine-grained delegation) and also honoring the end-to-end argument. Obviously, this step requires abandoning the unmodified browser. Goodbye unmodified browser! Now, the plugin-enhanced browsers that are left can recognize a new URL scheme. Let's call it "x-yzzy:" for now (I don't want to use "tahoe:" for this purpose, since I still want that for *distributed* secure files). These URLs will look like "x-yzzy://example.com/READKEY.UEBHASH", and behave just like Tahoe immutable readcaps for 1-of-1 encoded files except they reference the single host where you can get the sole share (instead of permuting an out-of-band serverlist to find a set of likely places for k shares). The READKEY would be hashed to form a storage-index, then the plugin would fetch http://example.com/STORAGEINDEX (base64-encoded), which would contain an encrypted+hashed version of the plaintext. The hash information would include both a flat hash and a merkle tree, covered by a UEB just like in tahoe (except we could drop the block hash tree since k=1). For mutable files, the URL would be "x-yzzy://example.com/MUTREADKEY", which would be even shorter (2*kappa instead of (1+2)*kappa, if I'm remembering the necessary length of the hash correctly). Again, MUTREADKEY is hashed to form a storage-index, the corresponding ciphertext+hashes+signature file is fetched, the hashes checked, the signature checked, the data decrypted, and delivered to the caller. Web servers would be completely unaffected: they'd just have directories full of base64-encoded (or base62, or a modified base64 without "/", or whatever) filenames, which they serve to anyone who cares. All GETs would use unencrypted http, since this protocol would provide both integrity and confidentiality. Oh, and the rule would be that the storage-index would be treated as a URL relative to the http equivalent of the original x-yzzy URL. So "x-yzzy://example.com/subdir/READKEY.UEBHASH" would get an encrypted blob from "http://example.com/subdir/STORAGEINDEX". == Tools == You'd start with a hashing tool: given a file, emit the "#hash=XYZ" suffix that should be tacked on to the URL. Or, given an URL prefix and a webroot-relative filename, emit the whole URL. Then you'd move on to the merkle tree generation tool. Given FILENAME, it writes the hash tree data to FILENAME.hashtree, and emits the "#hashtree=XYZ" suffix that you need to attach to the URL. The mutable-file tool would maintain an out-of-webroot file mapping pubkey to privkey. It would create a new keypair when run on a file that did not already have a .sig file, or would extract the old pubkey from an existing .sig file and look up the corresponding signing key. It would emit the #sigkey=XYZ suffix, and update or create the .sig file (next to the original data file) with the new signature. The encryption+immutable tool would take a file (from your source directory, which of course would *not* be under the webroot), produce the encrypted+hashed tahoe-like single-share output data, store it in the webroot under the storage-index name, and emit the URL. The encryption+mutable tool would do the same, taking the existing key from an adjoining .key file (or creating a new one), putting the signed+hashed+encrypted data in the webroot, and emitting the URL. == Automation == Now, what's a good way to update all the container files? I.e., when you change your CSS and it gets a new hash, how should you update the .html file that references it? I've been using Git a lot recently, and it gave me an idea: * store your website in Git or Mercurial (you *do* manage your website in a revision control system, right? and the system you picked *does* give you cryptographically-strong file-version identifiers, right?) * use regular relative URLs in the .html files that you check in; web authors remain unaware of the integrity-checking suffixes that gets added later * now build a tool that rewrites the HTML (and other containers, JS and perhaps CSS) to replace the relative URLs with URL#hash=XYZ . The tool runs at checkout time, when you deploy a new revision to the webserver, or takes a git checkout (with all repository metadata) as input and produces the webroot directories as output. * The tool will build a table that says "bar.css has hash=XYZ" for everything that gets checked out. * take advantage of git's hash-of-data content-tracking properties to cache the table that maps object to #hash=XYZ values: instead of "the current version of bar.css has hash=XYZ", remember "version ABC of bar.css will always have hash=XYZ". * build a table that says "version ABC of foo.html references bar.css and baz.js", to capture the object graph. Invert the table ("bar.css is referenced by version ABC of foo.html, among others"). Now you can quickly tell what files need rewriting when bar.css is modified. New versions of foo.html get rescanned, added to the who-references-whom table, then processed (hashed) and added to the whats-your-hash table, then anyone who references it gets updated. * keep careful track of containers (objects which reference other objects). If bar.css imports booze.css, then while the original contents of bar.css might not change, the annotated version (which includes "booze.css#hash=XYZ") will change whenever booze.css changes. The tables must reflect this, so that the updating scheme will catch everything * the last step should be a sanity check, walking through all the output files, and comparing the #hash=XYZ values therein with the actual hashes of the other output files. * the generated tables can be used to alert you to immutable-reference cycles, which are a no-no, and require mutability somewhere to break the circle and turn the graph back into a strict DAG. Then, when you introduce mutability, you somehow mark the filenames that you want to be delivered as mutable (breaking cycles and reducing reference-updating effort, in exchange for possibly slowing down client fetch times). Then this rewriting tool will treat those files differently at checkout, creating (or updating) mutable objects for them. Other files which reference the mutable ones don't need to be updated when they change. When you introduce encryption, the same tool is used, except it dumps encrypted+hashed+(sometimes-)signed storage-index-named files into the output directory, instead of preserving the original filenames. The sanity-check would need to be given the readcaps (instead of working on the ciphertext, obviously), but would proceed the same way. The entire process could be automated to run each time you pushed a change to the production branch. Authors would be unaware of the process (except they'd get fewer complaints about http-used-in-https vulnerabilities). Web servers would be unaware of the process (they're just serving up weirdly-named files). End users (well, at least those who'd installed the plugin) would be mostly unaware of the process (they'd just see weird URLs in their status bar, but they're starting to get used to that anyways). If you stick with integrity (and not encryption), then end users with normal browsers are mostly unaware (they see the #hash=XYZ suffixes, if their status bar is wide enough). I've no idea how hard it would be to write this sort of plugin. But I'm pretty sure it's feasible, as would be the site-building tools. If firefox had this built-in, and web authors used it, what sorts of vulnerabilities would go away? What sorts of new applications could we build that would take advantage of this kind of security? thoughts? -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
