Re: superlifter design notes and rZync feedback
One more link, about variable-length vs fixed-length encodings: http://ntrg.cs.tcd.ie/undergrad/4ba2/presentation/xdrandber.html (The HTML is a bit broken, view the source.) Basically they make the somewhat obvious point that variable-length encodings are much slower to handle than fixed-length. I don't know if the difference is so great that lzo encoding could produce a smaller result with less work. I wouldn't be surprised either way, actually. One way to look at it is this: in the case where you're CPU-bound, not network-bound, then you'll definitely want to use something like XDR. In the case where you're completely network-bound, then you probably want to use gzip -9 or even bzip2, and whether the underlying protocol is fixed or variable-length probably doesn't matter. So perhaps XDR plus compression is a good tradeoff across a wider domain. (Or perhaps not.) -- Martin -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes and rZync feedback
On 18 Jul 2002, Wayne Davison <[EMAIL PROTECTED]> wrote: > (definitely NOT rzync). Great. (Excuse my overreaction :-) > Re: rzync's variable-length fields: Note that my code allows more > variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size > value to only as many bytes as needed to actually store the length. I > agree that we should question whether this complexity is needed, but I > don't agree that it is wrong on principal. There are two areas where > field-sizing is used: in the directory-info compression (which is very > similar to what rsync does, but with some extra field-sizing thrown in > for good measure), and in the transmission protocol itself: OK. If the protocol said that all integers are encoded in a UTF-8-ish or BER-ish variable length scheme that would sound perfectly reasonable to me. I had misunderstood the document as suggesting that some fields should be defined to be different lengths to others that would worry me. There is still a question on the relative merits of having known-length headers (easier to manage buffers, know how much to read, etc), vs making them as small as possible. I think I mentioned this -- I'd like to have a reasonable means to choose a compression scheme at connection time. bzip2 would be good for modems; lzo for 100Mbps. (I think of bzip2 as simmering on the stove all day, and lzo as lightly blanching :-) > I still have questions about how best to handle the transfer of > directory info. I'm thinking that it might be better to remove the > rsync-like downsizing of the data and to use a library like zlib to > remove the huge redundancies in the dir data during its > transmission. Ben Escoto suggested a stack like this: > > 1. The specification for an abstract protocol designed to allow a > > single threaded application get good performance using a single, > > possibly low bandwidth/high latency pipe. No specific file commands > > would enter in at this stage, but error reporting and recovery, some > > kind of security policy, and some other stuff I'm omitting would be > > included. > > 2. A library to make it easy for applications to work with protocols > > that have the form in 1. A well-written interface to a scripting > > language (probably python) would be considered a core part of this. > > 3. Specification for a more specific, rsync-like protocol, and maybe > > another library (again with at least a scripting wrapper) to make it > > easy for applications to implement the protocol. > > 4. The model application rsync3 which shows off what the protocol can > > do. Ideally this part should be really short and sweet. I think that's a good way to play it, because there is enough work in each section that they're non-trivial layers, but they're also sufficiently separate to allow a lot of good experimentation or adaption. I'd hope that by getting a good foundation in #1 and #2, we would be able to experiment with doing binary deltas on directories, or not, or something else again. I would hope that working only at layer 4, you'd be able to implement a client that could detect remote renames (by scanning for files with the same size, looking at their checksums, etc.) I wonder if this layering is excessive, but I think that all the layers are necessary, and a first implementation could be simple in many cases. For example, 2 could initially be trivially implemented in a way that only supports non-pipelined operation. > In the protocol itself, there are only two variable-size elements that > goes into each message header. While this increases complexity quite a > bit over a fixed-length message header, it shouldn't be too hard to > automate a test that ensures that the various header combinations > (particularly boundary conditions) encode and decode properly. I don't > know if this level of message header complexity is actually needed (this > is one of the things that we can use the test app to check out), but if > we decide we want it, I believe we can adequately test it to ensure that > it will not be a sinkhole of latent bugs. OK, good. > Re: rzync's name cache. I've revamped it to be a very dependable design > that no longer depends on lock-step synchronization in the expiration of > old items (just in the creation of new items, which is easy to achieve). > > Some comments on your registers: > > You mention having something like 16 registers to hold names. I think > you'll find this to be inadequate, but it does depend on exactly how > much you plan to cache names outside of the registers, how much > retransmission of names you consider to be acceptable, and whether you > plan to have a "move mode" where the source file is deleted. Yes, I agree that 16 is probably too small; the next round number would be 256. If we use something like BER it could be unboundedly big. However, since using a name causes server-side resources to be allocated, that's probably no good. We don't want somebody abusing a
Re: superlifter design notes and rZync feedback
> "WD" == Wayne Davison <[EMAIL PROTECTED]> > wrote the following on Thu, 18 Jul 2002 10:19:40 -0700 (PDT) WD> Re: rzync's name cache. I've revamped it to be a very WD> dependable design that no longer depends on lock-step WD> synchronization in the expiration of old items (just in the WD> creation of new items, which is easy to achieve). Could you possibly explain this a little more? I'm not sure I follow you here with the "expiration of old items" talk. Or tell me if there is some basic document I should read that explains all this. The rdiff-backup protocol is not sophisticated and certainly has a lot to gain from these design considerations (not to say that I'll be motivated enough to do anything about it). WD> If we just register the active items that are currently being WD> sent over the wire, the name will need to live through the WD> entire sig, delta, patch, and (optionally) source-side-delete WD> steps. When the files are nearly up-to-date, having only 16 of WD> them will, I believe, be overly restrictive. Part of the WD> problem is that the buffered data on the sig-generating side WD> delays the source-side-delete messages quite a bit. If we had a WD> high-priority delete channel, that would help to alleviate WD> things, but I think you'll find that having several hundred WD> active names will be a better lower limit in your design WD> thinking. For what it's worth, if I understand what you mean by "active names" correctly, I believe rdiff-backup's protocol can sometimes have hundreds of active names. -- Ben Escoto -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: superlifter design notes and rZync feedback
On Thu, Jul 18, 2002 at 10:19:40AM -0700, Wayne Davison wrote: > Martin Pool <[EMAIL PROTECTED]> wrote: > > I've put a cleaned-up version of my design notes up here > > http://samba.org/~mbp/superlifter/design-notes.html > > I'll start with some feedback on your rzync comments: > > Re: rzync's name: I currently consider the rZync to be a test app to > allow me (and anyone else who wants to fiddle with it) to try out some > ideas in protocol design. Integrating the ideas from this back into > rsync or into superlifter would be ideal. If I ever decide to release > my own file transfer utility, I'll name it something useful at that > time (definitely NOT rzync). > > Re: rzync's variable-length fields: Note that my code allows more > variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size > value to only as many bytes as needed to actually store the length. I > agree that we should question whether this complexity is needed, but I > don't agree that it is wrong on principal. There are two areas where > field-sizing is used: in the directory-info compression (which is very > similar to what rsync does, but with some extra field-sizing thrown in > for good measure), and in the transmission protocol itself: > > I still have questions about how best to handle the transfer of > directory info. I'm thinking that it might be better to remove the > rsync-like downsizing of the data and to use a library like zlib to > remove the huge redundancies in the dir data during its transmission. > > In the protocol itself, there are only two variable-size elements that > goes into each message header. While this increases complexity quite a > bit over a fixed-length message header, it shouldn't be too hard to > automate a test that ensures that the various header combinations > (particularly boundary conditions) encode and decode properly. I don't > know if this level of message header complexity is actually needed (this > is one of the things that we can use the test app to check out), but if > we decide we want it, I believe we can adequately test it to ensure that > it will not be a sinkhole of latent bugs. > > Re: rzync's name cache. I've revamped it to be a very dependable design > that no longer depends on lock-step synchronization in the expiration of > old items (just in the creation of new items, which is easy to achieve). > > Some comments on your registers: > > You mention having something like 16 registers to hold names. I think > you'll find this to be inadequate, but it does depend on exactly how > much you plan to cache names outside of the registers, how much > retransmission of names you consider to be acceptable, and whether you > plan to have a "move mode" where the source file is deleted. > > My first test app had no name-cache whatsoever. It relied on external > commands to drive it, and it sent the source/destination/basis trio of > names from side to side before every step of the file's progress. While > this was simple, the increased bandwidth necessary to retransmit the > names was not acceptable to me. I think the better approach is to reduce the bandwidth needed rather than make multiple stages require side-channel communication. > > If we just register the active items that are currently being sent over > the wire, the name will need to live through the entire sig, delta, > patch, and (optionally) source-side-delete steps. When the files are > nearly up-to-date, having only 16 of them will, I believe, be overly > restrictive. Part of the problem is that the buffered data on the > sig-generating side delays the source-side-delete messages quite a bit. > If we had a high-priority delete channel, that would help to alleviate > things, but I think you'll find that having several hundred active names > will be a better lower limit in your design thinking. > > Another question is whether names are sent fully-qualified or relative > to some directory. My protocol caches directory names in the name cache > and allows you to send filenames relative to a cached directory. Just > having a way to "chdir" each side (even if the chdir is just virtual) > and send names relative to the current directory should help a lot. I see no reason (so far) why the concept of a current tree-relative directory wouldn't be perfectly viable. The stream would contain CD commands. As such the only time we might need to pass a complete pathname would be for link destinations and a build as-you-go directory table could eliminate that. > > An additional source of cached names is in the directory scanning when > doing a recursive transfer. My protocol has specific commands that > refer to a name index within a specified directory so that the receiving > side can request changed files using a small binary value instead of a > full pathname. > > One more area of complexity that you don't mention (and I don't either > in my new-protocol doc): there are some operations where 2 names ne
Re: superlifter design notes and rZync feedback
Martin Pool <[EMAIL PROTECTED]> wrote: > I've put a cleaned-up version of my design notes up here > http://samba.org/~mbp/superlifter/design-notes.html I'll start with some feedback on your rzync comments: Re: rzync's name: I currently consider the rZync to be a test app to allow me (and anyone else who wants to fiddle with it) to try out some ideas in protocol design. Integrating the ideas from this back into rsync or into superlifter would be ideal. If I ever decide to release my own file transfer utility, I'll name it something useful at that time (definitely NOT rzync). Re: rzync's variable-length fields: Note that my code allows more variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size value to only as many bytes as needed to actually store the length. I agree that we should question whether this complexity is needed, but I don't agree that it is wrong on principal. There are two areas where field-sizing is used: in the directory-info compression (which is very similar to what rsync does, but with some extra field-sizing thrown in for good measure), and in the transmission protocol itself: I still have questions about how best to handle the transfer of directory info. I'm thinking that it might be better to remove the rsync-like downsizing of the data and to use a library like zlib to remove the huge redundancies in the dir data during its transmission. In the protocol itself, there are only two variable-size elements that goes into each message header. While this increases complexity quite a bit over a fixed-length message header, it shouldn't be too hard to automate a test that ensures that the various header combinations (particularly boundary conditions) encode and decode properly. I don't know if this level of message header complexity is actually needed (this is one of the things that we can use the test app to check out), but if we decide we want it, I believe we can adequately test it to ensure that it will not be a sinkhole of latent bugs. Re: rzync's name cache. I've revamped it to be a very dependable design that no longer depends on lock-step synchronization in the expiration of old items (just in the creation of new items, which is easy to achieve). Some comments on your registers: You mention having something like 16 registers to hold names. I think you'll find this to be inadequate, but it does depend on exactly how much you plan to cache names outside of the registers, how much retransmission of names you consider to be acceptable, and whether you plan to have a "move mode" where the source file is deleted. My first test app had no name-cache whatsoever. It relied on external commands to drive it, and it sent the source/destination/basis trio of names from side to side before every step of the file's progress. While this was simple, the increased bandwidth necessary to retransmit the names was not acceptable to me. If we just register the active items that are currently being sent over the wire, the name will need to live through the entire sig, delta, patch, and (optionally) source-side-delete steps. When the files are nearly up-to-date, having only 16 of them will, I believe, be overly restrictive. Part of the problem is that the buffered data on the sig-generating side delays the source-side-delete messages quite a bit. If we had a high-priority delete channel, that would help to alleviate things, but I think you'll find that having several hundred active names will be a better lower limit in your design thinking. Another question is whether names are sent fully-qualified or relative to some directory. My protocol caches directory names in the name cache and allows you to send filenames relative to a cached directory. Just having a way to "chdir" each side (even if the chdir is just virtual) and send names relative to the current directory should help a lot. An additional source of cached names is in the directory scanning when doing a recursive transfer. My protocol has specific commands that refer to a name index within a specified directory so that the receiving side can request changed files using a small binary value instead of a full pathname. One more area of complexity that you don't mention (and I don't either in my new-protocol doc): there are some operations where 2 names need to be associated with one operation. This happens when we have both a destination file and a basis file. My current cache implementation allows both of these names to be associated with a single cache element (though I need to improve this a bit in rzync) and lets the sig/patch stage snag them both. ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html