On Sun, Jul 26, 2009 at 02:19:23PM -0700, Robert Ames wrote: In addition, I > see here an explicit recommendation is to maintain revision history > outside of CouchDB, and it seems as though the replication model is pretty > similar to Git's model...
The replication model is quite different to Git's - although this would be a useful comparison to put on the wiki somewhere. With Git, you have a copy of each peer's tree (remotes/PEER/BRANCH). Then you perform a merge into a single document in your working tree. If the merge fails, then it fails; you still have a single working copy, but with the conflicts explicitly marked within that document. It's up to you to resolve those conflicts and commit the final version. With Git, merging is always done when you *pull* from a peer; if you *push* to a peer which can't be fast-forwarded, the push fails. (Or you can force the push, but that simply overwrites the changes at the peer) With CouchDB, you don't keep track of peers in the database. When you replicate from a peer, or a peer replicates to you, and the documents conflict(*), then you get multiple copies of the document within the database. When you request a document by ID, you get an arbitrary one of this set, unless you explicitly ask for the other versions. However the multiple copies are all there and are all effectively equal copies (except for the property that one is arbitrary chosen as the "winner"); no precedence is given to the version which originated locally, for instance. (*) In this case, 'conflict' effectively means 'derived from a different _rev'. There is no attempt in CouchDB to perform any merging. A CouchDB-like system running on top of git would be extremely interesting. I can see four main parts: - a high-performance git backend which appends objects to a single file. The compact operation creates a new file and rotates it into place, and ideally retains git's ability to create packs and compact using diffs. - a new 'btree' git object class, for mapping an unlimited number of keys to objects. refs would be stored using this. The existing flat 'tree' object won't scale well to millions of keys. - a HTTP interface for storing and retrieving objects by key, using the 'btree' class again (*) - the map/reduce engine ported to run on top of this, following the commit tree to determine which documents had changed. There would need to be some way to handle merging and conflicts. Perhaps the HTTP interface returns all peer versions as a multipart, with their commit IDs as the rev. It also needs to be decided how to handle 'attachments'. Personally I'd like to store the MIME type against every document, which means you could store non-JSON objects as first-class objects in their own right. Then you could do things like using map/reduce to scale JPEGs. (*) Note: you might think of having one branch per document, instead of a top-level btree object containing all documents. That would give each object an independent history. The trouble with that approach is that git replication doesn't work well with thousands of branches - I've tried it :-) - because it has to iterate linearly over each branch. So I think that you really need one btree object for the database, and then store the history of that via commits. Regards, Brian.
