Re: superlifter design notes (OpenVMS perspective)

2002-07-30 Thread jw schultz

On Tue, Jul 30, 2002 at 12:00:21AM -0400, John E. Malmberg wrote:
 To help explain why the backup and file distribution have such different 
 implementation issues, let me give some background.
 
 
 This is a dump of an OpenVMS native text file.  This is the format that 
 virtually all text editors produce on it.
 
 Dump of file PROJECT_ROOT:[rsync_vms]CHECKSUM.C_VMS;1 on 29-JUL-2002 
 22:02:21.32
 File ID (118449,3,0)   End of file block 8 / Allocated 8
 
 Virtual block number 1 (0001), 512 (0200) bytes
 
  67697279 706F4320 20200025 2A2F0002 ../*%.   Copyrig 00
  72542077 6572646E 41202943 28207468 ht (C) Andrew Tr 10
  20200024 00363939 31206C6C 65676469 idgell 1996.$.   20
  50202943 28207468 67697279 706F4320  Copyright (C) P 30
  39312073 61727265 6B63614D 206C7561 aul Mackerras 19 40
  72702073 69685420 20200047 3639 96..G.   This pr 50
 
 Each record is preceded by a 16 bit count of how long the record is. 
 While any value can be present in a record, ususally only printable 
 ASCII is usually present.

 The file must be open in binary mode.  On an fopen() call, the b 
 mode qualifier causes the file to be opened in binary mode, so no 
 translation is done.  This has no effect on UNIX, but it is important on 
 other file platforms.  This flag is documented as part of the ISO C 
 standard, but has no effect on a UNIX platform.

While VMS and a few other OSs make the distinction between
text and binary files the VMS is fairly unique.  UNIX is our
primary focus and i don't intend to get bogged down with OS
specifics on all platforms.  POSIX has no mechanism for
determining the content of files.  All files are binary.

To meet your record-oriented text file needs i would say
that the VMS port would need to have a options and extra
logic.  For backups all file could be opened with the b
mode qualifier.

For sending to non-VMS systems text files would want
conversion to another format, and for receiving some
heuristics would identify text files for conversion
(updating text files could take advantage of the local
file's attributes).  Such file conversions would require
in-core translation for checksums, file length and change
merges.  This puts them into the same category as unix2dos
text-file conversions and backup compression.  Such file
conversions are outside the scope of current consideration
but where possible we should keep them in mind for future
enhancement.

 Then there are the file attributes:
 
 CHECKSUM.C_VMS;1  File ID:  (118449,3,0)
 Size:8/8  Owner:[SYSOP,MALMBERG]
 Created:   29-JUL-2002 22:01:37.95
 Revised:   29-JUL-2002 22:01:38.01 (1)
 Expires:   None specified
 Backup:No backup recorded
 Effective: None specified
 Recording: None specified
 File organization:  Sequential
 Shelved state:  Online
 Caching attribute:  Writethrough
 File attributes:Allocation: 8, Extend: 0, Global buffer count: 0
 No version limit
 Record format:  Variable length, maximum 0 bytes, longest 71 bytes
 Record attributes:  Carriage return carriage control
 RMS attributes: None
 Journaling enabled: None
 File protection:System:RWED, Owner:RWED, Group:RWED, World:RE
 Access Cntrl List:  None
 Client attributes:  None
 
 And this is for a simple file format.  Files can be indexed or have 
 multiple keys.
 
 And there is no cross platform API for retrieving all of these 
 attributes, so how do you determine how to transmit them through?

We can't rely on a pre-existing cross-platform API.   What
I'm inclined toward is to use native I/O routines.  The
protocol would be focused on UNIX file semantics.  We might
add a few reasonable additional bits for those platforms
that will be VERY common interoperators.  These other
attributes i would treat as special extended attributes.

 Security is another issue:
 
 In some cases the binary values for the access control entries needs to 
 preserved, and in other cases, the text values need to be preserved.
 It also may need a translation from one set of text or binary values to 
 another set.
 
 And again, there are no cross platform API's for returning this information.

See above.  We need to support binary IDs and text IDs and
ID squashing.  I'm not sure yet but mode bits will probably
be binary.  There is no reason to transmit them as text.

 So a backup type application is going to have to have a lot of platform 
 specific tweaks, and some way to pass all this varied information 
 between the client and server.  As each platform is added, an extension 
 may need to be developed.

Platform specific tweaks will only be built into the
binaries for that platform.  The protocol will have certain
UNIX centricities but the flexibility to transmit platform
specifics.

 A server definitely needs to know if it is in backup mode as opposed to 
 file distribution mode.
 
 In file distribution mode, only a few file attributes need to be 
 preserved, and a loss of 

Re: superlifter design notes (OpenVMS perspective)

2002-07-28 Thread Ben Escoto

 JS == jw schultz [EMAIL PROTECTED]
 wrote the following on Sat, 27 Jul 2002 23:05:50 -0700

  JS As a poor example let us suppose that a filename contained a
  JS /.  A UNIX system using translation might turn this into _.
  JS Escapement might turn it into =2F and = into =3D.

rdiff-backup has this feature.  I'm not sure anyone uses it, and it
was a pain to add and to test adequately, especially when the
additional quoting characters push the length of the filename over the
limit.  If I had to do it over I probably would have skipped this
feature (or at least wait until lots of people bothered me about it).


-- 
Ben Escoto



msg04693/pgp0.pgp
Description: PGP signature


Re: superlifter design notes (OpenVMS perspective)

2002-07-28 Thread Martin Pool

On 27 Jul 2002, jw schultz [EMAIL PROTECTED] wrote:
 The server has no need to deal with cleint limitations.  I
 am saying that the protocol would make the bare minimum of
 limitatons (null termination, no nulls in names).

It probably also makes sense to follow NFS4 in representing
paths as a vector of components, rather than as a single string
with '/'s in it or whatever.  ['home', 'mbp', 'work', 'rsync'] avoids
any worries about / vs \ vs :, and just lets the client do
whatever makes sense.

I don't know a lot about i18n support, but it does seem that
programs will need to know what encoding to use for the filesystem
on platforms that are not natively Unicode.  On Unix it probably
makes sense to default to UTF-8, but latin-1 or others are
equally likely.  This is independent of the choice of message
locale.  I think the W32 APIs are defined in Unicode so we don't 
need to worry.

Quoting, translating, or rejecting illegal characters could all
make sense depending on context.

I guess I see John's backup vs distribution question as 
hopefully being different profiles or wrappers around a single
codebase, rather than different programs.  Perhaps the distinction
he's getting at is whether the audience for the client who
uploaded the data is the same client, or somebody else?

-- 
Martin

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: superlifter design notes (OpenVMS perspective)

2002-07-28 Thread jw schultz

On Sun, Jul 28, 2002 at 05:39:22PM +1000, Martin Pool wrote:
 On 27 Jul 2002, jw schultz [EMAIL PROTECTED] wrote:
  The server has no need to deal with cleint limitations.  I
  am saying that the protocol would make the bare minimum of
  limitatons (null termination, no nulls in names).
 
 It probably also makes sense to follow NFS4 in representing
 paths as a vector of components, rather than as a single string
 with '/'s in it or whatever.  ['home', 'mbp', 'work', 'rsync'] avoids
 any worries about / vs \ vs :, and just lets the client do
 whatever makes sense.

That is _one_ of the reasons that i said filenames should be
CWD relative (no path components).  That way the protocol
never needs to know about / vs \ with the possible exception
of links.  The vector component list would address the isue
of link destinations nicely with a null terminated list of null
terminated strings 'home\0mbp\0work\0\rsync\0\0'.

 I don't know a lot about i18n support, but it does seem that
 programs will need to know what encoding to use for the filesystem
 on platforms that are not natively Unicode.  On Unix it probably
 makes sense to default to UTF-8, but latin-1 or others are
 equally likely.  This is independent of the choice of message
 locale.  I think the W32 APIs are defined in Unicode so we don't 
 need to worry.
 
 Quoting, translating, or rejecting illegal characters could all
 make sense depending on context.

I avoided the idea of rejection but there may be cases where
we need it.  Rejection would mean the file would not be
transfered.  For interactive use the default would be to
translate and would seldom be used because most transfers
would be of filesnames supported on both ends.
Any time a translation occurs a warning would be generated
unless silenced.

 I guess I see John's backup vs distribution question as 
 hopefully being different profiles or wrappers around a single
 codebase, rather than different programs.  Perhaps the distinction
 he's getting at is whether the audience for the client who
 uploaded the data is the same client, or somebody else?

The backup vs. distribution question seems to hang on what
we do when the storage semantics of the two nodes have a
mismatch.  For backups we want to retain all data either
through lossless conversion or in some kind of meta-data
store.  I'm inclined to take advantage of extended
attributes (NAME=rsync_perms etc. ?) But for distribution
we can afford some meta-data loss as long as future
runs will compare correctly (ignore the loss).

I agree with you, this is either a different wrapper or
perhaps a mode setting multiple options.  The biggest
difference seems to be on the server so perhaps the same
codebase might generate a server that has additional
capabilities but the client for both would be the same
regardless.

-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: superlifter design notes (OpenVMS perspective)

2002-07-27 Thread John E. Malmberg

Lenny Foner [EMAIL PROTECTED] wrote:
 jw schultz wrote:
   I find the use of funny chars (including space) in filenames
   offensive but we need to deal with internationalizations and
   sheer stupidity.
 
 Regardless of what you think about them, MacOS comes with pathnames
 containing spaces right out of the box (think System Folder).  Yes,
 rsync needs to not make assumptions about what's legal in a filename.
 Some OS's think slashes are path separators; some put them inside
 individual filenames.  Some think [] are separators.  We shouldn't
 try to make any assumptions.

Agreed.  For a file distribution program, for each file to be 
transferred, ideally the server will have a list of how the file should 
be represented on platforms that the server knows about.

The client would be able to tell the server about new platforms, but the 
server would not be required to remember the information if it did not 
trust the client.

As I work through my back log of e-mail messages, I will give some 
possible implemention details as answers to other posts.

-John
[EMAIL PROTECTED]
Personal Opinion Only


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: superlifter design notes (OpenVMS perspective)

2002-07-27 Thread John E. Malmberg

Martin Pool wrote:
 
 On 22 Jul 2002, John E. Malmberg [EMAIL PROTECTED] wrote:
 
 
A clean design allows optimization to be done by the compiler, and tight 
optimization should be driven by profiling tools.
 
 
 Right.  So, for example, glib has a very smart assembly ntohl() and
 LZO is tight code.  I would much rather use them than try to reduce
 the byte count by a complicated protocol.

Many compilers will inline ntohl() giving the call very low overhead.


5. Similarly, no silly tricks with forking, threads, or nonblocking
IO: one process, one IO.

Forking or multiple processes can be high cost on some platforms.  I am 
not experienced with Posix threads to judge their portability.

But as long as it is done right, non-blocking I/O is not a problem for me.

If you structure the protocol processing where no subroutine ever posts 
a write and then waits for a read, you can set up a library that can be 
used either blocking or non-blocking.
 
 
 Yes, that's how librsync is structured.
 
 Is it reasonable to assume that some kind of poll/select arrangement
 is available everywhere?  In other words, can I check to see if input
 is available from a socket without needing to block trying to read
 from it?

I can poll, but I prefer to cause the I/O completion to trigger a 
completion routine.  But that is not portable. :-)

 I would hope that only a relatively small layer needs to know about
 how and when IO is scheduled.  It will make callbacks (or whatever) to
 processes that produce and consume data.  That layer can be adapted,
 or if necessary, rewritten, to use whatever async IO features are
 available on the relevant platform.
  
Test programs that internally fork() are very troublesome for me. 
Starting a few hundred individually by a script are not.
 
 If we always use fork/exec (aka spawn()) is that OK?  Is it only
 processes that fork and that then continue executing the same program
 that cause trouble?

Mainly.  I can deal with spawn() much easier than fork()

 
I can only read UNIX shell scripts of minor complexity.
  
 Apparently Python runs on VMS.  I'm in favour of using it for the test
 suite; it's much more effective than sh.

Unfortunately the Python maintainer for VMS retired, and I have not been 
able to figure out how to get his source to compile.  I have got the 
official Python to compile and link with only having to fix one severe 
programming error.  However it still is not running.  I am isolating 
where the problem is in my free time.

12. Try to keep the TCP pipe full in both directions at all times.
Pursuing this intently has worked well in rsync, but has also led to
a complicated design prone to deadlocks.

Deadlocks can be avoided.
 
 Do you mean that in the technical sense of deadlock avoidance?
 i.e. checking for a cycle of dependencies and failing?  That sounds
 undesirably complex.

No by not using a complex protocol, so that there are no deadlocks.
 
9  Model files as composed of a stream of bytes, plus an optional
table of key-value attributes. Some of these can be distinguished to
model ownership, ACLs, resource forks, etc.

Not portable.  This will effectively either exclude all non-UNIX or make 
it very difficult to port to them.
  
 Non-UNIX is not completely fair; as far as I know MacOS, Amiga,
 OS/2, Windows, BeOS, and QNX are {byte stream + attributes + forks}
 too.

 I realize there are platforms which are record-oriented, but I don't
 have much experience on them.  How would the rsync algorithm even
 operate on such things?

Record files need to be transmitted on record boundaries, not arbitrary 
boundaries.  Also random access can not be used.  The file segments need 
to be transmitted in order.

For a UNIX text file, a record is a line of text deliminated by the 
line-feed character.

[This is turned out to be a big problem in porting SAMBA.  An NT client 
transfers a large file by sending 64K, skipping 32K, sending some more 
and then sending the 32K later.  Samba does not do this, so the 
resulting corruption of a record structured file did not show up in the 
initial testing.  I still have not found the ideal fix for SAMBA, but
implemented a workaround.]

 Is it sufficient to model them as ascii+linefeeds internally, and then
 do any necessary translation away from that model on IO?

Yes as long as no partial records are transmitted.  Partial records can 
be a problem.  If I now the rest of the record is coming, then I can 
wait for it, but if the rest of the record is going to be skipped, then 
it takes more work.

 
BINARY files are no real problem.  The binary is either meaningful on 
the client or server or it is not.  However file attributes may need to 
be maintained.  If the file attributes are maintained, it would be 
possible for me to have a OpenVMS indexed file moved up to a UNIX 
server, and then back to another OpenVMS system and be usuable.
  
 Possibly it would be nice to have a way to stash attributes that
 cannot be represented on the 

Re: superlifter design notes (OpenVMS perspective)

2002-07-21 Thread John E. Malmberg

  Qualities
 
  1. Be reasonably portable: at least in principle, it should be
  possible to port to Windows, OS X, and various Unixes without major
  changes.

In general, I would like to see OpenVMS in that list.

  Principles
 
  1. Clean design rather than micro-optimization.

A clean design allows optimization to be done by the compiler, and tight 
optimization should be driven by profiling tools.

  4. Keep the socket open until the client gets bored. (Avoids startup
  time; good for on-line mirroring; good for interactive clients.)

I am afraid I do not quite understand this one.  Are you refering to a 
server waiting for a reconnect for a while instead of reconnecting?

If so, that seems to be a standard behavior for network daemons.

  5. Similarly, no silly tricks with forking, threads, or nonblocking
  IO: one process, one IO.

Forking or multiple processes can be high cost on some platforms.  I am 
not experienced with Posix threads to judge their portability.

But as long as it is done right, non-blocking I/O is not a problem for me.

If you structure the protocol processing where no subroutine ever posts 
a write and then waits for a read, you can set up a library that can be 
used either blocking or non-blocking.

The same for file access.

On OpenVMS, I can do all I/O in a non-blocking manor.  The problem is 
that I must use native I/O calls to do so.

If the structure is that after any I/O, control returns to a common 
point for the next step in the protocol, then it is easy to move from a 
blocking implementation to a non-blocking one.  MACROs can probably be 
used to allow common code to be used for blocking or non-blocking 
implementations.

Two systems that use non-blocking mode can push a higher data rate 
through the same time period.

This is an area where I can offer help to produce a clean implementation.

One of the obstacles to me cleanly implementing RSYNC as a single 
process is when a subroutine is waiting for a response to a command that 
it sent.  If that subroutine is called from as an asynchronous event, it 
blocks all other execution in that process.  That same practice hurts in 
SAMBA.


  8. Design for testability. For example: don't rely on global
  resources that may not be available when testing; do make behaviours
  deterministic to ease testing.

Test programs that internally fork() are very troublesome for me. 
Starting a few hundred individually by a script are not.

I can only read UNIX shell scripts of minor complexity.

  10. Have a design that is as simple as possible.

  11. Smart clients, dumb servers. This is claimed to be a good
  design pattern for internet software. rsync at the moment does not
  really adhere to it. Part of the point of rsync is that having a
  smarter server can make things much more efficient. A strength of
  this approach is that to add features, you (often) only need to add
  them to the client.

It should be a case of who can do the job easier.

  12. Try to keep the TCP pipe full in both directions at all times.
  Pursuing this intently has worked well in rsync, but has also led to
  a complicated design prone to deadlocks.

Deadlocks can be avoided.  Make sure if an I/O is initiated, that the 
next step is to return to the protocol dispatching routine.

  General design ideas
 
  9  Model files as composed of a stream of bytes, plus an optional
  table of key-value attributes. Some of these can be distinguished to
  model ownership, ACLs, resource forks, etc.

Not portable.  This will effectively either exclude all non-UNIX or make 
it very difficult to port to them.

Binary files are a stream of bytes.

Text files are a stream of records.  Many systems do not store text 
files as a stream of bytes.  They may or may not even be ASCII.

If you are going to maintain meta files for ACLs and Resource Forks.
Then there should be some provision to supply attributes for an entire 
directory or individual files.

BINARY files are no real problem.  The binary is either meaningful on 
the client or server or it is not.  However file attributes may need to 
be maintained.  If the file attributes are maintained, it would be 
possible for me to have a OpenVMS indexed file moved up to a UNIX 
server, and then back to another OpenVMS system and be usuable.

Currently in order to do so, I must encapsulate them in a .ZIP archive.
That is .ZIP, not GZIP or BZIP.  On OpenVMS those are only useful to 
transfer source and a limited subset of binaries.

TEXT files are much different than binary files, except on UNIX.

A text file needs to be processed by records, and on many systems can 
not have the records updated randomly, or if they do it is not real 
efficient.

If a target use for this program is to be for assisting in cross 
platform open source synchronization, then it really needs to properly 
address the text files.

A server should know how to represent a TEXT file in a portable format 
to the client.  Stream records in ASCII, delimited 

Re: superlifter design notes (OpenVMS perspective)

2002-07-21 Thread Martin Pool

On 22 Jul 2002, John E. Malmberg [EMAIL PROTECTED] wrote:
  Qualities
 
  1. Be reasonably portable: at least in principle, it should be
  possible to port to Windows, OS X, and various Unixes without major
  changes.
 
 In general, I would like to see OpenVMS in that list.

Yes, OpenVMS, perhaps also QNX and some other TCP/IP-capable RTOSs.

Having a portable protocol is a bit more important than a portable
implementation.  I would hope that with a new system, even if the
implementation was unix-bound, you would at least be able to write a
new client, reusing some of the code, that worked well on ITS.

 A clean design allows optimization to be done by the compiler, and tight 
 optimization should be driven by profiling tools.

Right.  So, for example, glib has a very smart assembly ntohl() and
LZO is tight code.  I would much rather use them than try to reduce
the byte count by a complicated protocol.

  4. Keep the socket open until the client gets bored. (Avoids startup
  time; good for on-line mirroring; good for interactive clients.)
 
 I am afraid I do not quite understand this one.  Are you refering to a 
 server waiting for a reconnect for a while instead of reconnecting?

What I meant is that I would like to be able to open a connection to a
server, download a file, leave the connection open, decide I need
another file, and then get that one too.  You can do this with FTP,
and (kindof) HTTP, but not rsync, which needs to know the command up
front.

Of course the server can drop you too by a timeout or whatever.

 If so, that seems to be a standard behavior for network daemons.
 
  5. Similarly, no silly tricks with forking, threads, or nonblocking
  IO: one process, one IO.
 
 Forking or multiple processes can be high cost on some platforms.  I am 
 not experienced with Posix threads to judge their portability.
 
 But as long as it is done right, non-blocking I/O is not a problem for me.
 
 If you structure the protocol processing where no subroutine ever posts 
 a write and then waits for a read, you can set up a library that can be 
 used either blocking or non-blocking.

Yes, that's how librsync is structured.

Is it reasonable to assume that some kind of poll/select arrangement
is available everywhere?  In other words, can I check to see if input
is available from a socket without needing to block trying to read
from it?

I would hope that only a relatively small layer needs to know about
how and when IO is scheduled.  It will make callbacks (or whatever) to
processes that produce and consume data.  That layer can be adapted,
or if necessary, rewritten, to use whatever async IO features are
available on the relevant platform.

 Test programs that internally fork() are very troublesome for me. 
 Starting a few hundred individually by a script are not.

If we always use fork/exec (aka spawn()) is that OK?  Is it only
processes that fork and that then continue executing the same program
that cause trouble?

 I can only read UNIX shell scripts of minor complexity.

Apparently Python runs on VMS.  I'm in favour of using it for the test
suite; it's much more effective than sh.

  12. Try to keep the TCP pipe full in both directions at all times.
  Pursuing this intently has worked well in rsync, but has also led to
  a complicated design prone to deadlocks.
 
 Deadlocks can be avoided.

Do you mean that in the technical sense of deadlock avoidance?
i.e. checking for a cycle of dependencies and failing?  That sounds
undesirably complex.

 Make sure if an I/O is initiated, that the 
 next step is to return to the protocol dispatching routine.

  9  Model files as composed of a stream of bytes, plus an optional
  table of key-value attributes. Some of these can be distinguished to
  model ownership, ACLs, resource forks, etc.
 
 Not portable.  This will effectively either exclude all non-UNIX or make 
 it very difficult to port to them.

Non-UNIX is not completely fair; as far as I know MacOS, Amiga,
OS/2, Windows, BeOS, and QNX are {byte stream + attributes + forks}
too.

I realize there are platforms which are record-oriented, but I don't
have much experience on them.  How would the rsync algorithm even
operate on such things?

Is it sufficient to model them as ascii+linefeeds internally, and then
do any necessary translation away from that model on IO?

 BINARY files are no real problem.  The binary is either meaningful on 
 the client or server or it is not.  However file attributes may need to 
 be maintained.  If the file attributes are maintained, it would be 
 possible for me to have a OpenVMS indexed file moved up to a UNIX 
 server, and then back to another OpenVMS system and be usuable.

Possibly it would be nice to have a way to stash attributes that
cannot be represented on the destination filesystem, but perhaps that
is out of scope.

 I recall seeing a comment somewhere in this thread about timestamps 
 being left to 16 bits.

No, 32 bits.  16 bits is obviously silly.

 File timestamps 

Re: superlifter design notes (OpenVMS perspective)

2002-07-21 Thread Martin Pool

 User-Agent: Mozilla/5.0 (X11; U; OpenVMS COMPAQ_AlphaServer_DS10_466_MHz; en-US; 
rv:1.1a) Gecko/20020614

If something as complex as Mozilla can run on OpenVMS then I guess we
really have no excuse :-)

-- 
Martin 

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: superlifter design notes (OpenVMS perspective)

2002-07-21 Thread jw schultz

On Mon, Jul 22, 2002 at 03:34:37PM +1000, Martin Pool wrote:
 On 22 Jul 2002, John E. Malmberg [EMAIL PROTECTED] wrote:
  
  If you structure the protocol processing where no subroutine ever posts 
  a write and then waits for a read, you can set up a library that can be 
  used either blocking or non-blocking.
 
 Yes, that's how librsync is structured.
 
 Is it reasonable to assume that some kind of poll/select arrangement
 is available everywhere?  In other words, can I check to see if input
 is available from a socket without needing to block trying to read
 from it?

I think we can assume that any platform supporting POSIX I/O
semantics will be sufficient.

 I would hope that only a relatively small layer needs to know about
 how and when IO is scheduled.  It will make callbacks (or whatever) to
 processes that produce and consume data.  That layer can be adapted,
 or if necessary, rewritten, to use whatever async IO features are
 available on the relevant platform.

That is the better approach.  Use I/O routines so most
processing can be while (get_input()) { process(); send_output()}
Then the I/O routines can be defined accorinding to platform.

[snip]

   9  Model files as composed of a stream of bytes, plus an optional
   table of key-value attributes. Some of these can be distinguished to
   model ownership, ACLs, resource forks, etc.
  
  Not portable.  This will effectively either exclude all non-UNIX or make 
  it very difficult to port to them.
 
 Non-UNIX is not completely fair; as far as I know MacOS, Amiga,
 OS/2, Windows, BeOS, and QNX are {byte stream + attributes + forks}
 too.
 
 I realize there are platforms which are record-oriented, but I don't
 have much experience on them.  How would the rsync algorithm even
 operate on such things?
 
 Is it sufficient to model them as ascii+linefeeds internally, and then
 do any necessary translation away from that model on IO?
 
  BINARY files are no real problem.  The binary is either meaningful on 
  the client or server or it is not.  However file attributes may need to 
  be maintained.  If the file attributes are maintained, it would be 
  possible for me to have a OpenVMS indexed file moved up to a UNIX 
  server, and then back to another OpenVMS system and be usuable.

If a platform has some special type of file it would be
responsible for converting to/from a multi-segment
bytestream.

By multi-segement bytestream i mean a sequence of
binary_data blocks having an offset and length.  In this way
we have the potential to deal with sparse files and to
packetize the transfers of large files.  Obviously offset
and size are 64bit.


 
 Possibly it would be nice to have a way to stash attributes that
 cannot be represented on the destination filesystem, but perhaps that
 is out of scope.

In general what we have to expect is that we can only
transfer the lowest common denominator of file attributes.

It would be possible to build a server that didn't depend on
local filesystem semantics and so could support an attribute
superset.  But that is out of scope for now.

  File timestamps for OpenVMS and for Windows NT are in 64 bits, but use 
  different base dates.
 
 I think we should use something like 64-bit microseconds-since-1970,
 with a precision indicator.
 
  File attributes need to be stored somewhere, so a reserved directory or 
  filename convention will need to be used.
  
  I assume that there will be provisions for a server to be marked as a 
  master reference.
 
 What do you mean master reference?

See my super/subset comment above.

 
  For flexability, a client may need to provide filename translation, so 
  the original filename (that will be used on the wire) should be stored 
  as a file attribute.  It also follows that it probably is a good idea to 
  store the translated filename as an attribute also.
 
 Can you give us an example?  Are you talking about things like
 managing case-insensitive systems?

Filenames should be null terminated UTF-8.  If a given
platform cannot support the port to that platform will be
responsible for conversion.  We probably should designate an
inline subroutine for filename converson.  The only
alternative would be to restrict filenames to ascii
[-_.A-Za-z0-9] or something similarly restrictive.  I find
the use of funny chars (including space) in filenames
offensive but we need to deal with internationalizations and
sheer stupidity.



-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html