Re: Why does one of there work and the other doesn't

2001-12-03 Thread Phil Howard

On Sun, Dec 02, 2001 at 09:31:25PM -0500, Mark Eichin wrote:

|  Perhaps a trailing / instead of training /. is supposed to work.  I do
|  not remember why I didn't start using it, but I am sure I would have tried
| 
| Quite possibly because you've been bitten by class cp/rcp; cp is not
| idempotent, in that if you cp -r foo bar where foo is a dir and bar
| doesn't exist, you end up with a bar that has the contents of foo
| (ie. foo/zot - bar/zot) and if you do it twice, cp sees that bar is a
| dir and inserts it instead (so foo-bar/foo, foo/zot-bar/foo/zot.)
| TO make it worse, on BSD-ish systems, traditionally adding a trailing
| slash makes it treat bar as a directory (bar/ == bar/ == bar/.), but
| under sysv-ish systems it doesn't change the interpretation (bar/ ==
| bar, even if bar doesn't exist.)
| 
| Partially *because* of this horror, rsync is (and is documented to be)
| consistent, and to have an explicit interpretation of trailing slash
| (that is consistent with bar/ == bar/. as far as destinations are
| concerned)  and is independent of the existence of the destination, so
| you can expect it to do the same thing when run twice.  This is one
| reason i'll often run rsync -a on local files rather than cp -r...

I have certainly been bitten by that, and it is not limited to cp and rcp,
either.  Another example I know that has bitten with disasterous effects
is the ln -s command.  If the destination does not exist, it puts a symlink
there.  If the destination exists and is (even if is means a symlink that
points to) a directory, it puts the new symlink in the directory named.
So doing ln -s twice with a directory target can produce two different
symlinks.  Even hard links have a problem with target directories, although
the twice issue is not relevant since you can't hard link a directory
itself (if your system is not broken, unlink pre-ptx Dynix way back in time).
On some systems the -n option gets around this on ln -s.

I'll do some tests with training / instead of /. to see if that works
for me now with 2.5.0.  It may have been a bug in an older version.  If I
get any unexpected results with 2.5.0 I'll report back with those.

Consistency is a great value.

-- 
-
| Phil Howard - KA9WGN |   Dallas   | http://linuxhomepage.com/ |
| [EMAIL PROTECTED] | Texas, USA | http://phil.ipal.org/ |
-




Re: Why does one of there work and the other doesn't

2001-12-03 Thread tim . conway

rsync already has a memory-hogging issue.  Imagine having it search your 
entire directory tree, checksumming all files, storing and sending them 
all, comparing both lists looking for matching date/time/checksums to 
guess where you've moved files to.  You'd be better off to use a wrapper 
the tools you move files with, keeping a replayable log, and have your 
mirrors retrieve and replay that log, before doing the rsync.

Tim Conway
[EMAIL PROTECTED]
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(, 
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), 
.\n '
There are some who call me Tim?




Phil Howard [EMAIL PROTECTED]
Sent by: [EMAIL PROTECTED]
12/03/2001 09:04 AM

 
To: [EMAIL PROTECTED]
cc: (bcc: Tim Conway/LMT/SC/PHILIPS)
Subject:Re: Why does one of there work and the other doesn't
Classification: 



On Mon, Dec 03, 2001 at 12:09:16AM +1100, Martin Pool wrote:

| On 30 Nov 2001, Randy Kramer [EMAIL PROTECTED] wrote:
| 
|  I am not sure which end the 100 bytes per file applies to, and I guess
|  that is the RAM memory footprint?.  Does rsync need 100 bytes for each
|  file that might be transferred during a session (all files in the
|  specified directory(ies)), or does it need only 100 bytes as it does 
one
|  file at a time?
| 
| At the moment that is 100B for all files to be transferred in the
| whole session.  This is a big limit to scalability at the moment, and
| a goal of mine is to reduce it to at most holding file information
| from a single directory in memory.

It would still be nice to have an option to gather all files at once,
but this will be of value if it also gathers all the checksums and
syncronizes files moves that have happened on the source end by
doing the syncronization of the moved file to the new location using
the old (checksum matched) file on the destination end.  Right now
if a file gets moved from one location to another (especially in a
different directory, which is often the case with a re-organization)
things get retransferred even though most every file already exists
somewhere on the destination.

-- 
-
| Phil Howard - KA9WGN |   Dallas   | http://linuxhomepage.com/ |
| [EMAIL PROTECTED] | Texas, USA | http://phil.ipal.org/ |
-








Re: Why does one of there work and the other doesn't

2001-12-03 Thread Phil Howard

On Mon, Dec 03, 2001 at 09:55:53AM -0700, [EMAIL PROTECTED] wrote:

| rsync already has a memory-hogging issue.  Imagine having it search your 
| entire directory tree, checksumming all files, storing and sending them 
| all, comparing both lists looking for matching date/time/checksums to 
| guess where you've moved files to.  You'd be better off to use a wrapper 
| the tools you move files with, keeping a replayable log, and have your 
| mirrors retrieve and replay that log, before doing the rsync.

I don't think so.  I would like to have that kind of smart capability be
fully integrated into a useful tool.  And rsync already has most of the
pieces such a thing would need in place.  I am NOT suggesting that it be
the default.  As you say, it would be memory hogging.  But it is already
memory hogging now, and adding a checksum for every file in the tree would
be 32 bytes more per file.

In some cases I definitely want LESS memory hogging, such as replicating
trees of millions of files.  In other cases I do want the checksumming to
get LESS files redundantly transferred.

What I have done in the past to accomplish it is to build a tar file of the
entire tree on both sides, then sync the tar files making sure the rsync
blocksize matches correctly.  That still takes a lot of time because rsync
is sending a LOT of checksum for small blocks.  If I could get tar to build
the tar file with the files on very large block boundaries, then I could
specify a larger blocksize to rsync and do the transfer much faster.  But
it would make just as much sense to just send a checksum per file, and, in
cases where a whole file checksum matches (though at a different name on
the destination) to copy, hardlink, or move (as appropriate) the file to
the new location.

Inventing a whole new tool to do this when rsync has most of the logic of
it in place is absurd.  I just don't understand the actual rsync internals
or protocol enough to accomplish such a patch myself, so my only option is
to offer the suggestion and hope someone likes it.  Again, I am not
suggesting that it be the default option, so it would nt impact anyone
unless they wanted it to.

-- 
-
| Phil Howard - KA9WGN |   Dallas   | http://linuxhomepage.com/ |
| [EMAIL PROTECTED] | Texas, USA | http://phil.ipal.org/ |
-




Re: Why does one of there work and the other doesn't

2001-12-02 Thread Martin Pool

On 30 Nov 2001, Randy Kramer [EMAIL PROTECTED] wrote:

 I am not sure which end the 100 bytes per file applies to, and I guess
 that is the RAM memory footprint?.  Does rsync need 100 bytes for each
 file that might be transferred during a session (all files in the
 specified directory(ies)), or does it need only 100 bytes as it does one
 file at a time?

At the moment that is 100B for all files to be transferred in the
whole session.  This is a big limit to scalability at the moment, and
a goal of mine is to reduce it to at most holding file information
from a single directory in memory.

--
Martin




Re: Why does one of there work and the other doesn't

2001-12-01 Thread Phil Howard

On Fri, Nov 30, 2001 at 07:42:17AM -0700, [EMAIL PROTECTED] wrote:

| from man rsync:
|  a trailing slash on the  source  changes  this  behavior  to
|  transfer all files from the directory src/bar on the machine
|  foo into the /data/tmp/.  A trailing  /  on  a  source  name
|  means  copy  the  contents  of  this directory.  Without a
|  trailing slash it means copy the directory.  This  differ-
|  ence  becomes particularly important when using the --delete
|  option.
| Wonderful things, those manuals.  Warning:  in my experience, this gives 
| unpredictable results.  it does NOT, in fact, always detect all the 
| content of the directory, and as a result, a --delete can have 
| catastrophic consequences.  I have not had time to try to figure out why 
| this happens, but my few tests aren't even repeatable... if there are more 
| than maybe 10 entries in the directory, something is always left out, but 
| rarely the same thing twice.  Needless to say, I never use that syntax.

If the source is a file and the destination is a file, or non-existant,
then you get a straight replication.  However, if the destination is a
directory, it puts the file _into_ the directory.  And this happens even
if the source is a directory (i.e. the source directory goes _into the
destination directory).  This is classic UNIX behaviour, and from that I
presume correct for rsync.  However, this behaviour (be it in rsync or
anywhere else, such as cp) is a big pitfall.  On a local machine, it's
easy enough to test the target before executing the command.  On a remote
it's somewhat more cumbersome.

I have found that for rsync, when I want to replicate a directory from one
machine to another and want to be certain that I am not putting one into
the other, but instead making one become another (e.g. treated as peers)
that the syntax of putting /. at the end of both source and destination
does the trick.  Whatever is _in_ the source goes _into_ the destination.
does require the destination must exist in advance, else the reference to
/. in the destination will fail (and thus so will the transfer).  But at
least I can do ssh to make the directory first (which might fail if the
destination is already a file, but I don't have to worry about getting a
reliable status from ssh concerning that because rsync will subsequently
fail in that case, too).  The end result is I get expected results or I
get a failure, but I don't get unexpected results (like filling up a disk
because files went in the wrong place, or deleting unintended files).

Perhaps a trailing / instead of training /. is supposed to work.  I do
not remember why I didn't start using it, but I am sure I would have tried
it, so maybe I encountered that problem.  But /. on the end works for me
and is what I have been using in all my backup scripts.

-- 
-
| Phil Howard - KA9WGN |   Dallas   | http://linuxhomepage.com/ |
| [EMAIL PROTECTED] | Texas, USA | http://phil.ipal.org/ |
-




Re: Why does one of there work and the other doesn't

2001-11-30 Thread Randy Kramer

Martin Pool wrote:
 Ian Kettleborough [EMAIL PROTECTED] wrote:
  1. How much memory does each file to be copied need. Obvisiouly I have too many
  files.
 
 Hard to say exactly.  On the order of a hundred bytes per file.

I may have misunderstood the question, but maybe we should point out
that, on the receiving end, each file needs at least an amount of *disk
space* equal in size to the file as a new file is constructed before the
old file is deleted.  

I am not sure which end the 100 bytes per file applies to, and I guess
that is the RAM memory footprint?.  Does rsync need 100 bytes for each
file that might be transferred during a session (all files in the
specified directory(ies)), or does it need only 100 bytes as it does one
file at a time?

Trying to learn, also,
Randy Kramer




RE: Why does one of there work and the other doesn't

2001-11-30 Thread David Bolen

From: Randy Kramer [mailto:[EMAIL PROTECTED]]

 I am not sure which end the 100 bytes per file applies to, and I guess
 that is the RAM memory footprint?.  Does rsync need 100 bytes for each
 file that might be transferred during a session (all files in the
 specified directory(ies)), or does it need only 100 bytes as it does one
 file at a time?

Yes, the ~100 bytes is in RAM - I think a key point though is that the
storage to hold the file list grows exponentially (doubling each
time), so if you have a lot of files in the worst case you can use
almost twice as much memory as needed.

Here's an analysis I posted to the list a while back that I think is
still probably valid for the current versions of rsync - a later followup
noted that it didn't include an ~28 byte structure for each entry in
the include/exclude list:

  - - - - - - - - - - - - - - - - - - - - - - - - -

 (a) How much memory, in bytes/file, does rsync allocate?

This is only based on my informal code peeks in the past, so take it
with a grain of salt - I don't know if anyone has done a more formal
memory analysis.

I believe that the major driving factors in memory usage that I can
see is:

1. The per-file overhead in the filelist for each file in the system.
   The memory is kept for all files for the life of the rsync process.

   I believe this is 56 bytes per file (it's a file_list structure),
   but a critical point is that it is allocated initially for 1000
   files, but then grows exponentially (doubling).  So the space will
   grow as 1000, 2000, 4000, 8000 etc.. until it has enough room for
   the files necessary.  This means you might, worst case, have just
   about twice as much memory as necessary, but it reduces the
   reallocation calls quite a bit.  At ~56K per 1000 files, if you've
   got a file system with 1 files in it, you'll allocate room for
   16000 and use up 896K.

   This growth pattern seems to occur on both sender and receiver of
   any given file list (e.g., I don't see a transfer of the total
   count over the wire used to optimize the allocation on the receiver).

2. The per-block overhead for the checksums for each file as it is 
   processed.  This memory exists only for the duration of one file.
   
   This is 32 bytes per file (a sum_buf) allocated as on memory chunk.
   This exists on the receiver as it is computed and transmitted, and
   on the sender as it receives it and uses it to match against the
   new file.

3. The match tables built to determine the delta between the original
   file and the new file.
  
   I haven't looked at closely at this section of code, but I believe
   we're basically talking about the hash table, which is going to be
   a one time (during rsync execution) 256K for the tag table and then
   8 (or maybe 6 if your compiler doesn't pad the target struct) bytes
   per block of the file being worked on, which only exists for the
   duration of the file.
   
   This only occurs on the sender.

There is also some fixed space for various things - I think the
largest of which is up to 256K for the buffer used to map files.

 (b) Is this the same for the rsyncs on both ends, or is there
 some asymmetry there?

There's asymmetry.  Both sides need the memory to handle the lists of
files involved.  But while the receiver just constructs the checksums
and sends them, and then waits for instructions on how to build the
new file (either new data or pulling from the old file), the sender
also constructs the hash of those checksums to use while walking
through the new file.

So in general on any given transfer, I think the sender will end up
using a bit more memory.

 (c) Does it matter whether pushing or pulling?

Yes, inasmuch as the asymmetry is based on who is sending and who is
receiving a given file.  It doesn't matter who initiates the contact,
but the direction that the files are flowing.  This is due to the
algorithm (the sender is the component that has to construct the
mapping from the new file using portions of the old file as
transmitted by the receiver).

  - - - - - - - - - - - - - - - - - - - - - - - - -


-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




Re: Why does one of there work and the other doesn't

2001-11-29 Thread Martin Pool

On 29 Nov 2001, Ian Kettleborough [EMAIL PROTECTED] wrote:

 1. How much memory does each file to be copied need. Obvisiouly I have too many
 files.

Hard to say exactly.  On the order of a hundred bytes per file.

 2. Why does this command work:
 
   rsync -ax /usr/xx /backup/usr/
 
 
   when:
 
   rsync -ax /usr/xx/ /backup/usr/ 
 
   refuses to create the directory xx in /backup/usr and copies
   the contents of the directory to /backup

Actually that's a feature not a bug:

  /usr/xx means the directory xx so it creates /backup/usr/xx

  /usr/xx/ means the contents of xx so it copies the contents
  directly into /backup/usr/ without creating an xx destination
  directory. 

Just use whichever one is appropriate.

-- 
Martin