Re: [tahoe-dev] Bringing Tahoe ideas to HTTP

2009-09-08 Thread Brian Warner
James A. Donald wrote:
 Nicolas Williams wrote:
One possible problem: streaming [real-time] content.
 Brian Warner wrote:
   Yeah, that's a very different problem space. You need
   the low-alacrity stuff from Tahoe, but also you don't
   generally know the full contents in advance. So you're
   talking about a mutable stream rather than an
   immutable file.
 Not mutable, just incomplete.
 Immutable streaming content needs a tiger hash or a
 patricia hash, which can handle the fact that some of
 the stream will be lost in transmission, and that one
 needs to validate the small part of the stream that one
 has already received rather than waiting for the end.

I was assuming a real-time stream, so the goal would be to provide a
filecap before the source had finished generating all the bits. This
would necessarily be a mutable filecap, unless you've got some way of
predicting the future :-).

If instead, you just have a large file that a client wants to fetch one
piece at a time, well, Tahoe's immutable-file merkle trees already
handle the goal of quickly validating a small part of a large byte
sequence. You could use this in a non-real-time stream, in which you
process the entire input stream, produce and publish the filecap, then a
client fetches pieces of that stream at their own pace.

   upgrade bundles are produced by a very strict process,
   and are rigidly immutable [...] For software upgrades,
   it would reduce the attack surface significantly.
 But how does one know which immutable file is the one
 that has been blessed by proper authority?

You're right, I was assuming a pre-existing secure what to upgrade to
channel which could send down an integrity-enhanced download URL. To
actually implement such a channel would require further integrity
guarantees over mutable data.

As I understand it, Firefox actually has a fairly complex upgrade path,
because only certain combinations of from-version/to-version are fully
tested by QA. Sometimes moving from e.g. to requires
going through a sort of
path. The upgrade channel basically provides instructions to each older
version, telling them which new version they should move to.

The best sort of integrity guarantees I could think of would be a rule
that says the new version must be signed by my baked-in DSA key and
have a version number that is greater than mine, or maybe just the
upgrade instructions must be signed by that DSA key. It's probably not
a robust idea to make the rules too strict.


The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to

Bringing Tahoe ideas to HTTP

2009-08-31 Thread Brian Warner

[sent once to tahoe-dev, now copying to cryptography too, sorry for the

At lunch yesterday, Nathan mentioned that he is interested in seeing how
Tahoe's ideas and techniques could trickle outwards and influence the
design of other security systems. And I was complaining about how the
Firefox upgrade process doesn't provide the integrity checks that I want
(it turns out they rely upon the CA infrastructure and SSL alone, no
end-to-end checking; the updates and releases are GPG-signed, but
firefox doesn't check that, only humans might). And PyPI has this nice
habit of appending #md5=XYZ.. to the URLs of the release tarballs that
they publish, which is (I think) automatically used by tools like
easy_install to guard against corrupted downloads (and which I always
use, as a human, to do the same). And Nathan mentioned a class of web
attacks in which a page, loaded over SSL, imports something (JS, CSS,
JPG) via a regular http: URL, and becomes vulnerable to third-parties
who can take over the page by controlling what arrives over
unauthenticated HTTP.

So, setting aside the reliability-via-distributedness properties for a
moment, what could we bring from Tahoe into regular HTTP and regular
webservers that could improve the state of security on the web?

== Integrity ==

To start with integrity-checking, we could imagine a firefox plugin that
validated a PyPI-style #md5= annotation on everything it loads. The rule
would be that no action would be taken on the downloaded content until
the hash was verified, and that a hash failure would be treated like a
404. Or maybe a slightly different error code, to indicate that the
correct resource is unavailable and that it's a server-side problem, but
it's because you got the wrong version of the document, rather than the
document being missing altogether.

This would work just fine for a flat hash: the original file remains
untouched, only the referencing URLs change to get the new hash
annotation. Non-enhanced browsers are unaffected: the #-prefixed
fragment identifier is never sent to the server, and the a name= tag
is fairly rare these days (and would still mostly work). Container files
(the HTML which references the hashed documents) could be updated to
benefit at leisure. Automation (see below) could be used to update the
URLs in the containers whenever the referenced objects were modified.

To improve alacrity on larger files, Tahoe uses a Merkle tree over
segments of the file. This tree has to be stored somewhere (Tahoe stores
it along with the shares, but it would be more convenient for a web site
to not modify the source files). We could use an annotation like
#hashtree=ROOTXYZ;http://otherplace; to reference an external hash tree
(with root hash XYZ). The plugin would start pulling from the source
file and the hash tree at the same time, and not deliver any source data
until it had been validated. The hashtree object would need to start
with the segment size and filesize, so the tree could be computed
properly. For very large files, you could read those parameters and then
pull down (via a Range: header) just the parts of the Merkle tree that
were necessary. In this case, the automation would need to create the
hash tree file and put it in a known place each time the source file
changes, and then updated the references.

(note that ROOTXYZ provides the identification properties of this
annotation, and http://otherplace; provides the location properties,
where identification means the ability to recognize the correct document
if someone gives it to you, and location means the ability to retrieve a
possibly-correct document. URIs provide identification, URLs are
supposed to provide both.)

We could compress this by establishing an (overriable) convention that always has a hashtree at, resulting in a URL that looked like;. If you needed to store it
elsewhere, you could use #hashtree=ROOTXYZ;WHERE, and define WHERE to
be a relative URL (with a default value of NAME.hashtree).

== Mutable Integrity ==

Zooko and I have both run HTML presentations out of a Tahoe grid (which
makes for a great demo), and the first thing you learn there is that
immutability, while a great property in some cases, is a hassle for
authoring. You need mutability somewhere, and the more places you have
it, the fewer URLs you have to update every time you change something.
In technical terms, you frequently want to cut down the diameter of the
immutable domains of the object DAG, by splitting those domains with
mutable boundary nodes. In practical terms, it means you might want to
publish *everything* via a mutable file. At the very least, if your web
site has any internal cycles in it, you'll need a mutable node to break
the cycle.

Again, this requires data beyond the contents of the source file. We
could use a #sigkey=XYZ annotation with a base62'ed ECDSA pubkey (this

Re: [tahoe-dev] Bringing Tahoe ideas to HTTP

2009-08-31 Thread Brian Warner
Michael Walsh wrote:

  - Adding a HTTP header with this data but requires something like a
 server module or output script. It also doesn't ugly up the URL (but
 then again, we have url shortner services for manual typing).

Ah, but see, that loses the security. If the URL doesn't contain the
root hash, then you're depending upon somebody else for your
authentication, and then it's not end-to-end anymore. URL shortener
services are great for the location properties, but lousy for the
identification properties. Not only are you relying upon your DNS, and
your network, and every other network between you and the server, and
the server who's providing you with the data.. now you're also dependent
upon the operator of the shortener service, and their database, and
anyone who's managed to break into their database, etc.

I guess there are three new things to add:

 * secure identifier in the URL
 * an upstream request, to say what additional integrity information you
 * a downstream response, to provide that additional integrity

The stuff I proposed used extra HTTP requests for those last two.

But, if you use the HTTP request headers to ask for the extra integrity
metadata, you could use the HTTP response headers to convey it. The only
place where this would make sense would be to fetch the merkle tree, or
the signature. (if the file was small and you only check the flat hash,
then there's nothing else to fetch; if the file is encrypted, then you
can define its data layout to be whatever you like, and just include the
integrity information in it directly).

Oh, and that would make the partial-range request a lot simpler: the
client does a GET with a Range: 123-456 header, and the server looks
at the associated merkle tree, figures out which chain they'll need to
validate those bytes, and returns a header that includes all of those
hash tree nodes (and the segment size, filesize). And returns enough
file data to cover those segments (i.e. the Content-Range: response
would be for a larger range than the Range: request, basically rounded
up to a segment size). The client would hash the segments it receives,
build and verify the merkle chain, then compare the root of the chain
against the roothash in the URL and make sure they match.

The response headers might get a bit large:
log2(filesize/segsize)*hashsize . And you have to fetch at least a full
segsize. But that's predictable and fairly well bounded, so maybe it
isn't that big of a deal.

Doing this with HTTP headers instead of a separate GET would avoid a
roundtrip, since the server (which now takes an active role in the
process) can decide these things (like which hash tree nodes are needed)
on behalf of the client. Instead of the client pulling one file for the
data, then pulling part of another file to get the segment size (so it
can figure out which segments it wants), then pulling a different part
of that file to get the hash nodes... the client just does a GET Range:,
and the server figures out the rest.

As you said, it requires a server module. But compared to a client-side
plugin, that's positively easy :). It'd probably be a good idea to think
about a scheme that would take advantage of a server which had
additional capabilities like this. Maybe the client could try the header
thing first, and if the response didn't indicate that the server is able
to provide that data, go and fetch the .hashtree file.

 My thoughts purely turn to verifying files and all webpage resources
 integrity in a transparent and backward compatible way. Who has not
 encountered unstable connections where images get corrupted and css
 files don't fully load? Solving that problem would make me very happy!

Yeah. We've been having an interesting thread on tahoe-dev recently
about the backwards-compatible question, what sorts of failure modes are
best to aim for when you're faced with an old server or whatever.

You can't make this stuff totally transparent and backwards compatible
(in the sense that existing webpages and users should start benefiting
from it without doing any work).. I think that's part of the brokenness
of the SSL+CA model, where they wanted the only change to be starting
from https instead of http. But you can certainly create a framework
that lets people get better control over what they're loading. Now, if
only distributed programs could be written in languages with those sorts
of properties...


The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to