I will respond to this section by section in a bit. However my general comment is that there is a lot to digest here and I think we need to start putting this stuff in a wiki page(s). I think it will get buried here. I think we might need individual pages on each protocol idea. And the a pro/con page where we compare them based on discussions on the list.
~Michael On Jun 24, 2013, at 1:06 PM, Joseph Gentle <jose...@gmail.com> wrote: > I want to start the discussion around what OT algorithms to use. This > is going to get technical. This is not a bug. Please ask simple but > tangential questions (like how does wave's current crypto system > work?) in a new thread instead of this one. > > Requirements: > - Arbitrary peers can syncronize data > - We can collaboratively edit tree structures, key-value pairs and > rich text. (I'd like arbitrary JSON editing) > - Fast enough to work on mobile devices. Protocol needs to be low > bandwidth and require a low number of network round-trips. > - Ideally a low storage overhead > > Nice-to-have: > - End-to-end encryption of content. (In the wake of PRISM, the less > information my wave server needs to know about me, the better.) > - Some wave version of OTR[1] > - Presence & cursors > - Simple. At least, as simple as we can manage. Lets at least try :/ > > Is there anything else I'm missing? Some of this stuff is only > tangentially related to OT (like cursors and encryption), but its > still important we bring it up. > > > Given that we want to be able to connect arbitrary peers, I think it > makes sense to use a message encryption & signing system. Currently, > wave servers sign their user's messages. To support users > collaborating over a LAN without servers, we should allow users to > sign (& encrypt) their own messages. Unfortunately, the signature will > become invalid if the operation is transformed. > > Consider peers A and B (communicating over a LAN), and disconnected > peer C. A and B each generate simultaneous operations and sign their > operations. A sends its operation to B, who transforms it. B then > connects to C and sends both operations. For C to verify A's > operation, C must see the original (untransformed) version of the > operation. I think the only way we can make this work securely is by > only sending original operations instead of sending operations after > they've been transformed. > > As far as I know, there's three broad directions we could take > algorithmically: > 1. CRDTs (like Logoot[1]) > 2. 'Classical' OT using vector clocks for versioning. > 3. The OT system that Torben Weis was working on. (Some might remember > him from the wave summit a few years ago). I have been chatting on and > off with him about this since the summit. He hasn't published anything > on it yet though. Its similar to classical OT, except using git style > version hashes. > > > CRDTs are probably the most elegant solution. They avoid the entire > complicated mess of transformation. However: > - They add a significant overhead to the document size (we must store > a unique name between every two characters in the wave) > - As far as I know, nobody has implemented arbitrary JSON editing > using CRDTs. Implementing a set is very difficult. > - Even without transform, we'll still have to store operations anyway > so peers can verify signatures. > > I'd like to know what we should expect the actual document size > overhead to be. I expect it to be at least a 5x size increase, but I > don't know for sure. The papers I've read cleverly minimize the > overhead by only doing OT on a line-by-line basis, but thats > unacceptable for wave. > > > As I understand it, vector clocks are the size of the number of unique > peers which have ever connected to the network. (They contain an > integer version number per peer on the wave). As I understand them, a > vector clock must be stored in each operation - which to me always > seemed unpalatable enough I didn't explore them further. Note that > this is per *peer*, not per user. If we want full p2p, this is one per > device which has ever edited a wave. > > There's probably some clever ways we could minimize that size. Off the > top of my head: > - Most operations are generated on the same site as their single > parent. These operations only increment the local version number. We > could special case this. I don't know how much better this would make > the system - I suspect a lot. > - If we want, we could build a hybrid system where most clients use a > client-server protocol like the current system. Servers could then use > vector clocks or whatever to confer amongst themselves. This way we > only need one element in the clock per server. However, we lose most > of the benefits of having a p2p OT system to begin with. > > > Torben's system is the one I know the most about. I'm biased to think > thats the right answer because I understand it better than the others. > I just spent the last couple hours writing up how it works, but I'm > going to run it by Torben before posting to this list. > > It works like this: > > First, we need a transform function that supports TP2. The simplest > way to do this is to add tombstones to the data type. (We'll need to > do this for the vector clock based system as well, by the way). I have > a plaintext implementation which does this in javascript[3], although > its quite slow (~150k transforms per second). I expect that an > optimized version of this will be around 10-20 million transforms / > second in C[4] for simple ops, and probably a few million per second > in java. > > Second, make an inverse transform function (prune). P(T(a, b), b) = a. > > In its current incantation, the algorithm works as follows: > Consider a DAG of all the operations. The closest metaphor is a > network of operations in git. Every operation has a unique hash > identifier and one or more parents (except the root, which has no > parents). It'll look something like this: > https://dl.dropboxusercontent.com/u/2494815/ops.png > ... except it'll usually be waaaay less complicated than that. > > The cool thing about this structure is that if you apply all of the > operations in that graph in *any order*, you'll end up in the same > document state. (So long as each node comes after all its parents and > you use the transform function correctly). > > Each peer stores a history list of operations. Each operation is > stored twice. One copy is its original form (the exact bytes which > were used for hashing and signing). When sending the operation to a > remote peer, this is the copy we send. The second copy of the op is > transformed. At all times, the transformed ops in the history list are > written such that you could directly apply every operation in the > history list (in order) to a new document and you would reach the > current document state. The history lists on different nodes don't > have to be in the same order. > > We also store a copy of the current document snapshot, because its > really useful for UI. > > Consider the DAG again. Every operation in the history list is a node > in that DAG, with its parents as other nodes. The DAG will have at > least one node with no children (= the bottom nodes in that diagram I > linked above). These childless nodes form the *frontier*. When a new > operation is created, all ops in the frontier become the new node's > parents. The new node also stores the number of ancestors it has (= > history list length). I'll get to why thats useful that in a moment. > > The history list has a special power because of the prune function: > operations which don't depend on one another can be swapped in place. > > swap = (a, b) -> > b' = prune(b, a) > a' = transform(a, b) > > > When peers connect, they send each other missing ops. Figuring out > which ops are missing can be surprisingly tricky - but we'll figure > that out later. New ops must be ingested in order, so we always ingest > an operation after ingesting all of its parents. > > So you're given an op from another peer to ingest. We need to > transform that operation by all operations we have locally that aren't > in the op's ancestors. You can't just arbitrarily transform two ops by > one another - they have to be both be applyable to the same document. > So what we do is rewrite our local history so that all of the > operation's ancestors are at the start of our history list, and all of > the other operations are at the end of the history. Once we've done > that, the operation can be transformed by all the ops in the second > half of our history list and applied (and appended at the end of the > history list). > > Remember, we know how many ancestors the new operation has. We saved > that earlier when the op was created. We also have all of that > operation's parents in our history list. So we find the position of > the last parent the op has in the history list. If that operation is > at the same position as the new op's ancestor length, we're golden. > > Imagine your history list contains: > A B X Y > > and the peer you're connecting to has: > A B C > > (C has B as a parent). C has 2 ancestors (A and B). When you ingest C, > you find B (its last parent), which is at position 2 in the history > list. So you know that everything after that in the history list isn't > one of C's parents, and you can transform C by X, Y and then apply it. > > If our history list instead looks like this: > > X A B Y > ... then you'll find B is too late in your history list. You need to > rewrite your history list to move the X after B. You use the swap > function to do that, swapping X and A, then X and B. Your history then > looks like A B X Y like we wanted, so you can proceed as above. > > This algorithm requires a lot of transforms, but I think most of them > are unavoidable because of the signing / trust issue. It also requires > that every operation (ie, every key press) has a unique ID. Thats > probably going to be a 16 byte hash or something; which seems like a > lot of extra bytes for typing one character. > > I implemented it here: > https://github.com/josephg/tp2stuff/blob/master/node2.coffee > > ... though with the randomizer setup I have (a handful of peers, all > randomly generating operations and periodically (randomly) connecting > to each other to sync up. > > 5 clients generating operations, with two random clients syncing about > every 5 operations results in about 100x as many transforms as > operations. (10k ops -> 11M transforms). Doing the same thing with 3 > clients 'only' needs 2M transforms. Performance stays steady over time > (it doesn't get worse as the document grows like CRDTs will) - this > scatterplot shows the number of seconds per 30 iterations on one of my > tests: > https://dl.dropboxusercontent.com/u/2494815/Screen%20Shot%202013-06-24%20at%201.01.42%20PM.png > > I think this huge number of transforms required is one of the costs of > doing encryption & signing of the ops. We might be able to lower the > number of transforms required if we trust our servers with our keys - > but I'm not sure I want to do that. Real-world performance should be > much better than that anyway, but yeah - it makes me nervous. I'm > probably going to run more tests to see what kind of performance we > can expect in practice. > > > Blah. That was a lot of words. Thoughts? > -J > > > [1] http://www.cypherpunks.ca/otr/ > [2] http://hal.archives-ouvertes.fr/docs/00/34/59/11/PDF/main.pdf > [3] https://github.com/share/ottypes/blob/master/lib/text-tp2.js > [4] based on my extremely fast non-tombstone equivalent data structure > here: https://github.com/share/libot/blob/master/text.h