Re: superlifter design notes and rZync feedback

2002-07-19 Thread Martin Pool

One more link, about variable-length vs fixed-length encodings:

  http://ntrg.cs.tcd.ie/undergrad/4ba2/presentation/xdrandber.html

(The HTML is a bit broken, view the source.)

Basically they make the somewhat obvious point that variable-length
encodings are much slower to handle than fixed-length.  I don't know
if the difference is so great that lzo encoding could produce a
smaller result with less work.  I wouldn't be surprised either way,
actually.

One way to look at it is this: in the case where you're CPU-bound, not
network-bound, then you'll definitely want to use something like XDR.
In the case where you're completely network-bound, then you probably
want to use gzip -9 or even bzip2, and whether the underlying protocol
is fixed or variable-length probably doesn't matter.  

So perhaps XDR plus compression is a good tradeoff across a wider
domain.  (Or perhaps not.)

-- 
Martin 

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: superlifter design notes and rZync feedback

2002-07-18 Thread Martin Pool

On 18 Jul 2002, Wayne Davison <[EMAIL PROTECTED]> wrote:

> (definitely NOT rzync).

Great.  (Excuse my overreaction :-)

> Re: rzync's variable-length fields:  Note that my code allows more
> variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size
> value to only as many bytes as needed to actually store the length.  I
> agree that we should question whether this complexity is needed, but I
> don't agree that it is wrong on principal.  There are two areas where
> field-sizing is used:  in the directory-info compression (which is very
> similar to what rsync does, but with some extra field-sizing thrown in
> for good measure), and in the transmission protocol itself:

OK.  If the protocol said that all integers are encoded in a UTF-8-ish
or BER-ish variable length scheme that would sound perfectly
reasonable to me.  I had misunderstood the document as suggesting that
some fields should be defined to be different lengths to others that
would worry me.

There is still a question on the relative merits of having
known-length headers (easier to manage buffers, know how much to read,
etc), vs making them as small as possible.

I think I mentioned this -- I'd like to have a reasonable means to
choose a compression scheme at connection time.  bzip2 would be good
for modems; lzo for 100Mbps.  (I think of bzip2 as simmering on the
stove all day, and lzo as lightly blanching :-)

> I still have questions about how best to handle the transfer of
> directory info.  I'm thinking that it might be better to remove the
> rsync-like downsizing of the data and to use a library like zlib to
> remove the huge redundancies in the dir data during its
> transmission.

Ben Escoto suggested a stack like this:

> > 1.  The specification for an abstract protocol designed to allow a
> > single threaded application get good performance using a single,
> > possibly low bandwidth/high latency pipe.  No specific file commands
> > would enter in at this stage, but error reporting and recovery, some
> > kind of security policy, and some other stuff I'm omitting would be
> > included.

> > 2.  A library to make it easy for applications to work with protocols
> > that have the form in 1.  A well-written interface to a scripting
> > language (probably python) would be considered a core part of this.

> > 3.  Specification for a more specific, rsync-like protocol, and maybe
> > another library (again with at least a scripting wrapper) to make it
> > easy for applications to implement the protocol.

> > 4.  The model application rsync3 which shows off what the protocol can
> > do.  Ideally this part should be really short and sweet.

I think that's a good way to play it, because there is enough work in
each section that they're non-trivial layers, but they're also
sufficiently separate to allow a lot of good experimentation or
adaption.

I'd hope that by getting a good foundation in #1 and #2, we would be
able to experiment with doing binary deltas on directories, or not, or
something else again.  I would hope that working only at layer 4,
you'd be able to implement a client that could detect remote renames
(by scanning for files with the same size, looking at their checksums,
etc.)

I wonder if this layering is excessive, but I think that all the
layers are necessary, and a first implementation could be simple in
many cases.  For example, 2 could initially be trivially implemented
in a way that only supports non-pipelined operation.

> In the protocol itself, there are only two variable-size elements that
> goes into each message header.  While this increases complexity quite a
> bit over a fixed-length message header, it shouldn't be too hard to
> automate a test that ensures that the various header combinations
> (particularly boundary conditions) encode and decode properly.  I don't
> know if this level of message header complexity is actually needed (this
> is one of the things that we can use the test app to check out), but if
> we decide we want it, I believe we can adequately test it to ensure that
> it will not be a sinkhole of latent bugs.

OK, good.

> Re: rzync's name cache.  I've revamped it to be a very dependable design
> that no longer depends on lock-step synchronization in the expiration of
> old items (just in the creation of new items, which is easy to achieve).
> 
> Some comments on your registers:
> 
> You mention having something like 16 registers to hold names.  I think
> you'll find this to be inadequate, but it does depend on exactly how
> much you plan to cache names outside of the registers, how much
> retransmission of names you consider to be acceptable, and whether you
> plan to have a "move mode" where the source file is deleted.

Yes, I agree that 16 is probably too small; the next round number
would be 256.  If we use something like BER it could be unboundedly
big.  However, since using a name causes server-side resources to be
allocated, that's probably no good.  We don't want somebody abusing a

Re: superlifter design notes and rZync feedback

2002-07-18 Thread Ben Escoto

> "WD" == Wayne Davison <[EMAIL PROTECTED]>
> wrote the following on Thu, 18 Jul 2002 10:19:40 -0700 (PDT)

  WD> Re: rzync's name cache.  I've revamped it to be a very
  WD> dependable design that no longer depends on lock-step
  WD> synchronization in the expiration of old items (just in the
  WD> creation of new items, which is easy to achieve).

Could you possibly explain this a little more?  I'm not sure I follow
you here with the "expiration of old items" talk.  Or tell me if there
is some basic document I should read that explains all this.  The
rdiff-backup protocol is not sophisticated and certainly has a lot to
gain from these design considerations (not to say that I'll be
motivated enough to do anything about it).

  WD> If we just register the active items that are currently being
  WD> sent over the wire, the name will need to live through the
  WD> entire sig, delta, patch, and (optionally) source-side-delete
  WD> steps.  When the files are nearly up-to-date, having only 16 of
  WD> them will, I believe, be overly restrictive.  Part of the
  WD> problem is that the buffered data on the sig-generating side
  WD> delays the source-side-delete messages quite a bit.  If we had a
  WD> high-priority delete channel, that would help to alleviate
  WD> things, but I think you'll find that having several hundred
  WD> active names will be a better lower limit in your design
  WD> thinking.

For what it's worth, if I understand what you mean by "active names"
correctly, I believe rdiff-backup's protocol can sometimes have
hundreds of active names.


-- 
Ben Escoto

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: superlifter design notes and rZync feedback

2002-07-18 Thread jw schultz

On Thu, Jul 18, 2002 at 10:19:40AM -0700, Wayne Davison wrote:
> Martin Pool <[EMAIL PROTECTED]> wrote:
> > I've put a cleaned-up version of my design notes up here
> > http://samba.org/~mbp/superlifter/design-notes.html
> 
> I'll start with some feedback on your rzync comments:
> 
> Re: rzync's name:  I currently consider the rZync to be a test app to
> allow me (and anyone else who wants to fiddle with it) to try out some
> ideas in protocol design.  Integrating the ideas from this back into
> rsync or into superlifter would be ideal.  If I ever decide to release
> my own file transfer utility, I'll name it something useful at that
> time (definitely NOT rzync).
> 
> Re: rzync's variable-length fields:  Note that my code allows more
> variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size
> value to only as many bytes as needed to actually store the length.  I
> agree that we should question whether this complexity is needed, but I
> don't agree that it is wrong on principal.  There are two areas where
> field-sizing is used:  in the directory-info compression (which is very
> similar to what rsync does, but with some extra field-sizing thrown in
> for good measure), and in the transmission protocol itself:
> 
> I still have questions about how best to handle the transfer of
> directory info.  I'm thinking that it might be better to remove the
> rsync-like downsizing of the data and to use a library like zlib to
> remove the huge redundancies in the dir data during its transmission.
> 
> In the protocol itself, there are only two variable-size elements that
> goes into each message header.  While this increases complexity quite a
> bit over a fixed-length message header, it shouldn't be too hard to
> automate a test that ensures that the various header combinations
> (particularly boundary conditions) encode and decode properly.  I don't
> know if this level of message header complexity is actually needed (this
> is one of the things that we can use the test app to check out), but if
> we decide we want it, I believe we can adequately test it to ensure that
> it will not be a sinkhole of latent bugs.
> 
> Re: rzync's name cache.  I've revamped it to be a very dependable design
> that no longer depends on lock-step synchronization in the expiration of
> old items (just in the creation of new items, which is easy to achieve).
> 
> Some comments on your registers:
> 
> You mention having something like 16 registers to hold names.  I think
> you'll find this to be inadequate, but it does depend on exactly how
> much you plan to cache names outside of the registers, how much
> retransmission of names you consider to be acceptable, and whether you
> plan to have a "move mode" where the source file is deleted.
> 
> My first test app had no name-cache whatsoever.  It relied on external
> commands to drive it, and it sent the source/destination/basis trio of
> names from side to side before every step of the file's progress.  While
> this was simple, the increased bandwidth necessary to retransmit the
> names was not acceptable to me.
I think the better approach is to reduce the bandwidth
needed rather than make multiple stages require side-channel
communication.

> 
> If we just register the active items that are currently being sent over
> the wire, the name will need to live through the entire sig, delta,
> patch, and (optionally) source-side-delete steps.  When the files are
> nearly up-to-date, having only 16 of them will, I believe, be overly
> restrictive.  Part of the problem is that the buffered data on the
> sig-generating side delays the source-side-delete messages quite a bit.
> If we had a high-priority delete channel, that would help to alleviate
> things, but I think you'll find that having several hundred active names
> will be a better lower limit in your design thinking.
> 
> Another question is whether names are sent fully-qualified or relative
> to some directory.  My protocol caches directory names in the name cache
> and allows you to send filenames relative to a cached directory.  Just
> having a way to "chdir" each side (even if the chdir is just virtual)
> and send names relative to the current directory should help a lot.

I see no reason (so far) why the concept of a current
tree-relative directory wouldn't be perfectly viable.
The stream would contain CD commands.
As such the only time we might need to pass a complete
pathname would be for link destinations and a build as-you-go
directory table could eliminate that.

> 
> An additional source of cached names is in the directory scanning when
> doing a recursive transfer.  My protocol has specific commands that
> refer to a name index within a specified directory so that the receiving
> side can request changed files using a small binary value instead of a
> full pathname.
> 
> One more area of complexity that you don't mention (and I don't either
> in my new-protocol doc):  there are some operations where 2 names ne

Re: superlifter design notes and rZync feedback

2002-07-18 Thread Wayne Davison

Martin Pool <[EMAIL PROTECTED]> wrote:
> I've put a cleaned-up version of my design notes up here
> http://samba.org/~mbp/superlifter/design-notes.html

I'll start with some feedback on your rzync comments:

Re: rzync's name:  I currently consider the rZync to be a test app to
allow me (and anyone else who wants to fiddle with it) to try out some
ideas in protocol design.  Integrating the ideas from this back into
rsync or into superlifter would be ideal.  If I ever decide to release
my own file transfer utility, I'll name it something useful at that
time (definitely NOT rzync).

Re: rzync's variable-length fields:  Note that my code allows more
variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size
value to only as many bytes as needed to actually store the length.  I
agree that we should question whether this complexity is needed, but I
don't agree that it is wrong on principal.  There are two areas where
field-sizing is used:  in the directory-info compression (which is very
similar to what rsync does, but with some extra field-sizing thrown in
for good measure), and in the transmission protocol itself:

I still have questions about how best to handle the transfer of
directory info.  I'm thinking that it might be better to remove the
rsync-like downsizing of the data and to use a library like zlib to
remove the huge redundancies in the dir data during its transmission.

In the protocol itself, there are only two variable-size elements that
goes into each message header.  While this increases complexity quite a
bit over a fixed-length message header, it shouldn't be too hard to
automate a test that ensures that the various header combinations
(particularly boundary conditions) encode and decode properly.  I don't
know if this level of message header complexity is actually needed (this
is one of the things that we can use the test app to check out), but if
we decide we want it, I believe we can adequately test it to ensure that
it will not be a sinkhole of latent bugs.

Re: rzync's name cache.  I've revamped it to be a very dependable design
that no longer depends on lock-step synchronization in the expiration of
old items (just in the creation of new items, which is easy to achieve).

Some comments on your registers:

You mention having something like 16 registers to hold names.  I think
you'll find this to be inadequate, but it does depend on exactly how
much you plan to cache names outside of the registers, how much
retransmission of names you consider to be acceptable, and whether you
plan to have a "move mode" where the source file is deleted.

My first test app had no name-cache whatsoever.  It relied on external
commands to drive it, and it sent the source/destination/basis trio of
names from side to side before every step of the file's progress.  While
this was simple, the increased bandwidth necessary to retransmit the
names was not acceptable to me.

If we just register the active items that are currently being sent over
the wire, the name will need to live through the entire sig, delta,
patch, and (optionally) source-side-delete steps.  When the files are
nearly up-to-date, having only 16 of them will, I believe, be overly
restrictive.  Part of the problem is that the buffered data on the
sig-generating side delays the source-side-delete messages quite a bit.
If we had a high-priority delete channel, that would help to alleviate
things, but I think you'll find that having several hundred active names
will be a better lower limit in your design thinking.

Another question is whether names are sent fully-qualified or relative
to some directory.  My protocol caches directory names in the name cache
and allows you to send filenames relative to a cached directory.  Just
having a way to "chdir" each side (even if the chdir is just virtual)
and send names relative to the current directory should help a lot.

An additional source of cached names is in the directory scanning when
doing a recursive transfer.  My protocol has specific commands that
refer to a name index within a specified directory so that the receiving
side can request changed files using a small binary value instead of a
full pathname.

One more area of complexity that you don't mention (and I don't either
in my new-protocol doc):  there are some operations where 2 names need
to be associated with one operation.  This happens when we have both a
destination file and a basis file.  My current cache implementation
allows both of these names to be associated with a single cache element
(though I need to improve this a bit in rzync) and lets the sig/patch
stage snag them both.

..wayne..


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html