RE: Patch to avoid 'Connection reset by peer' error for rsync on cygwin

2002-04-26 Thread David Bolen

Max Bowsher [[EMAIL PROTECTED]] writes:

 I though that shutdown acts as below:
(no data loss)
 SUSv2 is annoyingly somewhat vague on the specifics.

So are the FreeBSD/Linux man pages.  They don't specifically indicate
truncation or flushing of data, although I don't recall ever thinking
of shutdown() as truncating.

To be, what Microsoft got wrong in WinSock is _requiring_ shutdown()
to reliably close a socket while ensuring all final data gets
delivered.  It certainly caught me by surprise (and annoyance) at one
point.  I rarely ever used it in a Unix environment, but I do think it
should be safe.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: memory requirements was RE: out of memory in build_hash_table

2002-04-24 Thread David Bolen

Granzow, Doug (NCI) [[EMAIL PROTECTED]] writes:

 From what I've observed by running top while rsync is running, its memory
 usage appears to grow gradually, not exponentially.

The exponential portion of the growth is up front when rsync gathers
the file listing (it starts with room for 1000 files, then doubles
that to 2000, 4000, etc...).  So if your rsync has started
transferring at least the first file, it's already done whatever
exponential growth it's going to do.  After that yes, it's far more
gradual and should I think settle down since most of the rest of the
memory allocation is on a per-file basis and not saved once the
individual file is done.

 I saw someone on this list recently mentioned changing the block size.
The
 rsync man page indicates this defaults to 700 (bytes?).  Would a larger
 block size reduce memory usage, since there will be fewer (but larger)
 blocks of data, and therefore fewer checksums to store?

Yep, although I think that is a reasonably small amount of memory
(something like 8 bytes per block to hold the checksums) and only
holds the checksums for a single file at a time.  But in addition to
memory for larger files this can also improve performance because
there's less computation to be done as well as less block matching.
The downside is potentially more data transferred.

 You suggested setting ARENA_SIZE to 0... I guess this would be done like
 this?
 
 % ARENA_SIZE=0 ./configure

I don't know if the configure script looks in the environment or not,
but my guess would be no.  (Took a quick peek and it doesn't look like
that's something munged by configure at all).

If you wanted to try it, I'd just edit rsync.h and comment out it's
current definition in favor of one defining it to 0 - e.g.:

/* #define ARENA_SIZE  (32 * 1024) */
#define ARENA_SIZE 0

The arena handling seems to be reasonably tight, so it's probably a
long shot in any event.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: On Windows OS, is there any advantage for completing rsync using MSVC instead of gcc/cygwin ?

2002-04-22 Thread David Bolen

Diburim [[EMAIL PROTECTED]] writes:

(Quoted from the subject line - Diburim, it's best to keep the subject
line short and put your question in the body of the e-mail.  Subject 
lines are often truncated for display purposes and it can make it more
difficult to see your question)

 On Windows OS, is there any advantage for completing rsync using
 MSVC instead of gcc/cygwin ?

Not only is there no advantage, but it won't work - I'm guessing you
haven't actually tried, right?  :-)

rsync is designed to run in a Unix environment, and makes extensive
use of Unix system calls and facilities (being able to fork() a child
process for example).  The native Windows environment isn't compatible
with the Unix system API.  That's what cygwin brings to the table, the
entire Unix emulation layer, and it's crucial to the ability of rsync
to work under Windows at all.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: memory requirements was RE: out of memory in build_hash_table

2002-04-22 Thread David Bolen

Granzow, Doug (NCI) [[EMAIL PROTECTED]] writes:

 Hmm... I have a filesystem that contains 3,098,119 files.  That's
 3,098,119 * 56 bytes or 173,494,664 bytes (about 165 MB).  Allowing
 for the exponential resizing we end up with space for 4,096,000
 files * 56 bytes = 218 MB.  But 'top' tells me the rsync running on
 this filesystem is taking up 646 MB, about 3 times what it should.

 Are there other factors that affect how much memory rsync takes up?
 I only ask because I would certainly prefer it used 218 MB instead
 of 646. :)

Hmm, yes - I only mentioned the per-file meta-data overhead since
that's the only memory user in the original note case, which was
failing before it actually got the file list transferred, and it
hadn't yet started computing any checksums.  But there are definitely
some other dynamic memory chunks.  However, in general the per-file
meta-data ought to be the major contributor to memory usage.

I've attached an old e-mail of mine when I did some examining of
memory usage for an older version of rsync (2.4.3) which I think it
still fairly valid.  I don't think it'll explain your significantly
larger usage than expected.  (A followup note corrected the first
paragraph as rsync doesn't create any tree structures)

Two possibilities I can think of have to do with the fact that the
per-file overhead is handled by 'realloc'ing the space as it grows.
It's possible that the sequence of events is such that some other
allocation is being done in the midst of that growth which forces the
next realloc to actually move the memory to gain more space, thus
leaving a hole of unused memory that just takes up process space.

Or, it's also possible that the underlying allocation library (e.g.,
the system malloc()) is itself performing some exponential rounding up
in order to help prevent just such movement.  I know that AIX used to
do that, and even provided an environment variable way to revert to
older behavior.

What you might try doing is observing the process growth during the
directory scanning phase and see how much memory actually gets used to
that point in time - gauged either by observing client/server traffic
for when the file list starts getting transmitted, or by
enabling/adding some debugging output to rsync.

I just peeked at the latest sources in CVS, and it looks like around
version 2.4.6 the file list processing added some of it's own
micro-management of memory for small strings, so there's something
else going on there too, in theory to help avoid platform growth like
mentioned in my last paragraph.  So if you're using 2.4.6, you might
try a later version to see if it improves things.  Or if you're using
a later version you might try rebuilding with ARENA_SIZE set to 0 to
disable this code to see if your native platform handles it better
somehow.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

  - - - - - - - - - - - - - - - - - - - - - - - - -

From: David Bolen [EMAIL PROTECTED]
To: 'Lenny Foner' [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: RE: The out of memory problem with large numbers of files
Date: Thu, 25 Jan 2001 13:25:43 -0500

Lenny Foner [[EMAIL PROTECTED]] writes:

 While we're discussing memory issues, could someone provide a simple
 answer to the following three questions?

Well, as with any dynamic system, I'm not sure there's a totally
simple answer to the overall allocation, as the tree structure created
on the sender side can depend on the files involved and thus the total
memory demands are themselves dynamic.

 (a) How much memory, in bytes/file, does rsync allocate?

This is only based on my informal code peeks in the past, so take it
with a grain of salt - I don't know if anyone has done a more formal
memory analysis.

I believe that the major driving factors in memory usage that I can
see is:

1. The per-file overhead in the filelist for each file in the system.
   The memory is kept for all files for the life of the rsync process.

   I believe this is 56 bytes per file (it's a file_list structure),
   but a critical point is that it is allocated initially for 1000
   files, but then grows exponentially (doubling).  So the space will
   grow as 1000, 2000, 4000, 8000 etc.. until it has enough room for
   the files necessary.  This means you might, worst case, have just
   about twice as much memory as necessary, but it reduces the
   reallocation calls quite a bit.  At ~56K per 1000 files, if you've
   got a file system with 1 files in it, you'll allocate room for
   16000 and use up 896K.

   This growth pattern seems to occur on both sender and receiver of
   any given file list (e.g., I don't see a transfer

RE: Future RSYNC enhancement/improvement suggestions

2002-04-22 Thread David Bolen

(I wrote about long files using 20-30 min to checksum without network
traffic)

Jason Haar [[EMAIL PROTECTED]] writes:

 ...But then you should have a dialup timeout of 1 hour set?

Oh of course - I was more responding to Martin's comment about there
being enough traffic present in general during an rsync session, since
there are cases when you can have lengthy periods without traffic at
all.

I could also see some NAT boxes holding a particular stream for far
less than an hour by default, but I don't have a particular data point
for that so perhaps it's just being too conservative.

 I think the problem is that you're morally upset that rsync spends so
 much time sending no network traffic. Quite understandable ;-)

Not sure about morally, but definitely financially :-)

 What about separating the tree into subtrees and rsyncing them? That
 means you go from:

 1 dialup connection started [quick]
 2 rsync generates checksums (no network traffic) [slow]
 3 rsync transmits files 

Perhaps you misunderstood - the checksum generation time that was
taking so long was on a *single* file level.  Rsync had already
exchanged file lists and chosen the files to transfer - it was working
on a single file and generating the block checksums on the receiver
side to send over to the sender side.

(As it turns out the transfers in question were for a single directory
normally comprised of two files - a database file and its transaction
log)

The real rub was that after spending 20+ minutes with an idle line
computing the checksum, it would then take another 30+ minutes to
transmit the checksum information over.  So it was (and likely still
is) a case where sending the data as computed would have been a major
win.  At least for slow connections, the checksum computation is
unlikely to be the bottleneck versus network transmission, so leaving
the network idle is totally wasted time that could be fully reclaimed.

I may still look into that sort of change but just haven't had the
cycles yet with the decrease in our checksum time - although this
particular discussion has sort of started me thinking about it again.
I may review our current logs to see how much time is being wasted.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: Future RSYNC enhancement/improvement suggestions

2002-04-22 Thread David Bolen

Martin Pool [[EMAIL PROTECTED]] writes:

 I guess alternatively you could set the rsync timeout high, the
 line-drop timeout low, and make it dial on demand.  That would let the
 line drop when rsync was really thinking hard, and it would come back
 up as necessary.  Losing the ppp channel does not by itself interrupt
 any tcp sessions running across it, provided that you can recover the
 same ip address next time you connect.

That assumes an environment where dial-on-demand is feasible.
Unfortunately, our particular setup is a direct PC to PC dial, and
there's no IP involved (it's Windows-Windows with NETBIOS/NETBEUI)
so disconnecting would shut down the remote rsync.

But it's an interesting thought for cases where it could get used.  In
general I'd expect it to be fairly fragile though unless you had
complete control of the dial infrastructure or could otherwise ensure,
as you note, identical IP address assignment.

I don't suppose anyone knows any legacy reason why all the checksums
are computed and stored in memory before transmission do they?  I
don't think at the time I could find any real requirement in the code
that it be done that way - the sequence was pretty much
generate/send/free.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: Future RSYNC enhancement/improvement suggestions

2002-04-22 Thread David Bolen

Martin Pool [[EMAIL PROTECTED]] writes:

 No, I think you could avoid it, and also avoid the up-front traversal
 of the tree, and possibly even do this while retaining some degree of
 wire compatibility.  It will be a fair bit of work.

Yeah, I was sort of thinking bang for the buck - munging with the file
list handling reaches into far more code and would likely be far more
effort to change within the current rsync source than the checksum
transmission.  I think the checksum would just be moving the
equivalent of send_sums right into generate_sums and only touching the
single generate.c module, with no noticeable difference on the wire or
to other modules.

I did go back and take a current look at our current transfers for the
one task this for which this could make the most difference.  For the
~110GB of data we synchronize each month (over V.34 dialup lines :-)),
the wasted time with our current network/filesystem looks to be in
aggregate only about 7.5 hours of phone time, which in turn is only
about 1.6% of the ~480 hours used each month.  So it's hard to worry
extensively about that 1.6%.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: mixed case file systems.

2002-04-19 Thread David Bolen

Martin Pool [[EMAIL PROTECTED]] writes:

 On 18 Apr 2002, David Bolen [EMAIL PROTECTED] wrote:
  A few caveats - both ends have to support the option - I couldn't make
  it backwards compatible because both ends exchange information about a
  sorted file list that has to sort the same way on either side (which
  very subtly bit me when I first did this).
 
 I was just going to say that :-)

Heh .. and wow, is it confusing if you mess that up.  Randomly
transferring files that it shouldn't be, but even better, putting the
contents of one file into another silently.  It seems to me that it
would have been better to have the side generating the list control
the sequence and the receiving side simply obey it as transmitted, but
that's neither here nor there at this point.

The issue with the new command line option was a general issue of
versioning command line options - since they get transmitted,
obviously, on the command line, it's prior to any option negotiation.
So I couldn't figure out any clean way to negotiate away from the
ignore case if the remote side didn't support it.  Originally I wanted
it to default to case-insensitive under Windows, but that was guaranteed
to break older versions, so I went back to an explicit option in all
cases.  But that seems to be a general issue with evolving options.

Actually, it was this issue that also led me to add a small bit of
code to io.c so that on an unexpected tag, it would dump any pending
data (as ASCII if printable, hex otherwise), since without that you
never got any of the remote command line parsing errors shown.  But
there are problems with that too since sometimes you may have a bunch
of data in the stream on a real protocol failure.

 I'll put this into the patches/ repository.  I'd like to study the 
 problem a bit more and see if there isn't a better solution before
 we merge it.  Perhaps something like the --fuzzy patch will make it
 detect them as renames.

No problem - aside from the options processing (which is also the bulk
of the patch), the patch does have the property that it's very simple;
one comparison routine change and one new flag supplied to the
existing fnmatch library module which already supported
case-insensitivity as an option.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: Future RSYNC enhancement/improvement suggestions

2002-04-19 Thread David Bolen

Jan Rafaj [[EMAIL PROTECTED]] writes:

   How about adding a feature to keep the checksums in a berkeley-style
   database somewhere on the HDD separately, and with subsequent
   mirroring attempts, look to it just for the checksums, so that
   the rsync does not need to do checksumming of whole target
   (already mirrored) file tree ?

There's a chicken and egg issue with this - how do you know that the
separately stored checksum accurately reflects the file which it
represents?  Once they are stored separately they can get out of sync.
The natural way to verify the checksum would be to recompute it, but
then you're sort of back to square one.  I know there have been
discussions about this sort of thing on the list in the past.

For multiple similar distributions, the rsync+ work (recently
incorporated into the mainline rsync in experimental mode - the
write-batch and read-batch options) helps remove repeated computations
of the checksums and deltas, but it's not a generalized system for any
random transfer.

I've wanted similar benefits because we use dialup to remote locations
and for databases with hundreds of MB or 1-2 GB, we end up wasting a
bit of phone time when both sides are just computing checksums.  But
I'm not sure of a good generalized solution.  There may be platform
specific hacks (e.g., under NT, storing the computed checksum in a
separate stream in the file, so it's guaranteed to be associated with
the file), but I don't know of a portable way to link meta information
with filesystem files.

Note that if you aren't already, be sure that you up the default
blocksize for large files - that can cut down significantly on both
checksum computation time as well as meta data transferred over the
session, since there are fewer blocks that need two checksums (weak +
MD4) apiece.

 - make output of error  status messages from rsync uniformed,
   so that it could be easily parsed by scripts (it is not right
   now - rsync 2.5.5)

I know Martin has expressed some interest to the list in having something
like this in the future as an option.

 - perhaps if the network connection between rsync client and server
   stalls for some reason, implement something like 'tcp keepalive'
   feature ?

I think rsync is pretty complicated at the network level already - it
seems reasonable to me that rsync ought to be able to assume that the
lowest level network protocol stack will get the data to the other end
and/or give an error if something goes wrong without needing a lot of
babysitting.

In all but the rsync server cases, rsync doesn't control the network
stream itself anyway (it just has a child process using ssh, rsh or
anything else), so it becomes a question for that particular utility
and not something rsync can do anything about.

In the rsync server case, it already sets the TCP KEEPALIVE option at
the socket level when it receives a connection.

If your network transport between systems is problematic, there's a
limited about of stuff rsync can do about it.  Oh and no, just being
idle on a session shouldn't terminate it, no matter how long rsync
takes to compute checksums.  So if that's happening to you, you might
want to investigate your network connectivity.  Or perhaps you're
going through a NAT or some sort of proxy box that places a timeout on
TCP sessions that you can increase?

Upon failures, if you use --partial and a separate destination
directory you can keep re-trying and slowly get the whole file across
(that's how we do our backups) but you do still need to recompute
checksums each time.  It might be nice to see if rsync itself could
have a retry mechanism that would re-use the existing checksum
information it had computed previously.  I have a feeling with the
structure of the code at this point though that doing so would be
reasonably complicated.

The caveat to --partial is that once you have a partial file, even
with --compare-dest, that partial file is all rsync considers for the
remaining portion of the transfer.  So originally for our database
backups, I was removing any partial copy manually if it was less than
some fraction of the previous copy I already had, since I'd lose less
time rebuilding that fraction than losing access to the entire prior
file.

In response to that, there was another internal-use patch I made to
rsync to --partial-pad any partial file with data from the original
file on the destination system during an error.  No guarantees it
would work as well, since I just took data from the original file past
the size point of the partial copy, but in many cases (growing files)
its a big win.  If anyone is interested, I could extract it and post
it.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150

RE: out of memory in build_hash_table

2002-04-19 Thread David Bolen

Eric Echter [[EMAIL PROTECTED]] writes:

 I recently installed rsync 2.5.5 on both my rsync server and client being
 used.  I installed the latest version because I was having problems with
 rsync stalling with version 2.4.6 (I read that 2.5.5 was supposed to clear
 this up or at least give more appropriate errors).  I am still having
 problems with rsync stalling even after upgrading to 2.5.5.  It only
stalls
 in the /home tree and there are approximately 385,000 (38.5 MB of memory
 give or take if the 100 bytes/file still pertains) files in that tree.

The key growth factor for the file construction is going to be
per-file information that's about 56 bytes (last I looked) per file.
However, the kicker is that the file storage block gets resized
exponentially as you have more files.  So for 385,000 files it'll
actually be a 512,000 file block of about 30MB.  (So yeah, I suppose
an ~50 byte file chunk in memory growing as a power of 2 might average
out close to 100 bytes/file as an estimate :-))

 ERROR: out of memory in build_hash_table
 rsync error: error allocating core memory buffers (code 22) at util.c(232)

Seems like that's just a real out of memory error.  You'll only get that
error if a malloc() call returned NULL.

I presume there's still enough virtual memory available on the server
at the point when this fails?  Could you be running into a process limit 
on virtual memory?  What's a ulimit -a show for a server process?  
I think under Linux the default settings are in /etc/security/limits.conf,
maybe by default processes on the server are limited to 32MB of memory 
or something?

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: mixed case file systems.

2002-04-19 Thread David Bolen

Peter Tattam [[EMAIL PROTECTED]] writes:

 Given the interoperability problems between versions and the risk of
 data loss, I think I will have to wait till this option is in the
 mainstream.  My alternative workaround to to write a utility to
 rename all files on the errant file system to be all lower case.

Of course it's entirely up to you to try it or not.  But just so you
don't misunderstand my amusing developer anecdote ... any data loss
occurred during the development of the patch.  The patch as it stands
now definitely works properly, and we've used it plenty of times
successfully.

(The interoperability between versions is only true if you use the new
option, and that's the same as any of the other options that have been
added over time - it's a generic problem for rsync command line option
evolution)

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: out of memory in build_hash_table

2002-04-19 Thread David Bolen

Eric Echter [[EMAIL PROTECTED]] writes:

 I also checked the /etc/security/limits.conf file and everything in
 the file is commented out.  Are there default limits if there are no
 actual settings in this file that may be causing problems?  Your
 assumption about the memory limit on processes sounds correct, but I
 can't find any reasoning for this from the system settings.  Thanks
 a bunch for the response.

I'm not that familiar with Linux defaults, but your ulimit -a should
reflect what the process actually has and it certainly looks good.  
That assumes that you are running rsync on the server under root 
(either via the rsh/ssh path, or as a daemon).  If you're running it 
as some other user, you should ensure you check the ulimit -a under 
a process running as that user.  Perhaps /etc/profile makes adjustments?

If not, you might watch virtual memory stats while running the failing
operation (e.g., have a window using vmstat or top running on the server
when you try the copy) to see if there's anything amiss looking at the
overall server level - or if perhaps rsync is somehow burning up much
more memory than we're estimating.

Beyond that though, I suppose perhaps a linux-oriented group would offer 
further suggestions, under the assumption that something must be leading 
to the malloc() failing.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: mixed case file systems.

2002-04-18 Thread David Bolen

Peter Tattam [[EMAIL PROTECTED]] writes:

 I believe a suitable workaround would be to ignore case for file names
 when the rsync process is undertaken.  Is this facility available or
 planned in the near future?

I've attached a context diff for some changes I made to our local copy
a while back to add an --ignore-case option just for this purpose.
In our case it came up in the context of disting between NTFS and FAT
remote systems.  I think we ended up not needing it, but it does make
rsync match filenames in a case insensitive manner, so it might at
least be worth trying to see if it resolves your issue.

A few caveats - both ends have to support the option - I couldn't make
it backwards compatible because both ends exchange information about a
sorted file list that has to sort the same way on either side (which
very subtly bit me when I first did this).  I also didn't bump the
protocol in this patch (wasn't quite sure it was appropriate just for an
incompatible command line option) since since it was for local use.

The patch is based on a 2.4.x series rsync, but if it doesn't apply
cleanly to 2.5.x, it's should be simple enough to just apply manually.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

  - - - - - - - - - - - - - - - - - - - - - - - - -

Index: options.c
===
RCS file: e:/binaries/cvs/ni/bin/rsync/options.c,v
retrieving revision 1.5
retrieving revision 1.7
diff -c -r1.5 -r1.7
*** options.c   2000/12/28 00:30:18 1.5
--- options.c   2001/06/20 19:25:24 1.7
***
*** 72,77 
--- 72,78 
  #else
  int modify_window=0;
  #endif /* _WIN32 */
+ int ignore_case=0;
  int modify_window_set=0;
  int delete_sent=0;
  
***
*** 162,167 
--- 164,170 
rprintf(F, --exclude-from=FILE exclude patterns listed in
FILE\n);
rprintf(F, --include=PATTERN   don't exclude files matching
PATTERN\n);
rprintf(F, --include-from=FILE don't exclude patterns listed in
FILE\n);
+   rprintf(F, --ignore-case   ignore case when comparing
filenames\n);
rprintf(F, --version   print version number\n);  
rprintf(F, --daemonrun as a rsync daemon\n);  
rprintf(F, --address   bind to the specified
address\n);  
***
*** 186,194 
OPT_PROGRESS, OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS,
OPT_COMPARE_DEST,
OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS,
OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, 
!   OPT_IGNORE_ERRORS, OPT_MODIFY_WINDOW, OPT_DELETE_SENT};
  
! static char *short_options = oblLWHpguDCtcahvqrRIxnSe:B:T:zP;
  
  static struct option long_options[] = {
{version, 0, 0,OPT_VERSION},
--- 189,198 
OPT_PROGRESS, OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS,
OPT_COMPARE_DEST,
OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS,
OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, 
!   OPT_IGNORE_ERRORS, OPT_MODIFY_WINDOW, OPT_DELETE_SENT,
!   OPT_IGNORE_CASE};
  
! static char *short_options = oblLWHpguDCtcahvqrRIxnSe:B:T:zP;
  
  static struct option long_options[] = {
{version, 0, 0,OPT_VERSION},
***
*** 204,209 
--- 208,214 
{exclude-from,1, 0,OPT_EXCLUDE_FROM},
{include, 1, 0,OPT_INCLUDE},
{include-from,1, 0,OPT_INCLUDE_FROM},
+   {ignore-case, 0, 0,OPT_IGNORE_CASE},
{rsync-path,  1, 0,OPT_RSYNC_PATH},
{password-file, 1,0, OPT_PASSWORD_FILE},
{one-file-system,0,  0,'x'},
***
*** 401,406 
--- 406,415 
add_exclude_file(optarg,1, 1);
break;
  
+   case OPT_IGNORE_CASE:
+   ignore_case=1;
+   break;
+ 
case OPT_COPY_UNSAFE_LINKS:
copy_unsafe_links=1;
break;
***
*** 712,717 
--- 727,736 
slprintf(mwindow,sizeof(mwindow),--modify-window=%d,
 modify_window);
args[ac++] = mwindow;
+   }
+ 
+   if (ignore_case) {
+  args[ac++] = --ignore-case;
}
  
if (keep_partial)
Index: exclude.c
===
RCS file: e:/binaries/cvs/ni/bin/rsync/exclude.c,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -c -r1.1.1.1 -r1.2
*** exclude.c   2000/05/30 18:08:19 1.1.1.1

RE: Non-determinism

2002-04-17 Thread David Bolen

Berend Tober [[EMAIL PROTECTED]] writes:

 That was my point about comparing rsync to sending the entire file 
 using say, ftp or cp. That is, one might think that sending the 
 entire file via ftp or cp will produce a exact file copy, however the 
 actual transmission of the data takes the form of electrical signals 
 on a wire that must be detected at the receiving end. The detection 
 process must have some probablilty of false alarm/missed detection 
 characteristic and so there must be some estimate of the probability 
 of ftp and cp failing to produce a reliable copy. So while the 
 software algorithm of ftp and cp are deterministic, there must be 
 some quantifiable probablity of failure non-the-less. The difference 
 with rsync is that not only are the same effects of data corruption 
 at work as with ftp and cp, but the algorithm itself introduces non-
 determinism.

Except of course that rsync uses its own final checksum to balance out
its risk of incorrectly deciding a block is the same.  If the final
full-file checksum doesn't match, then rsync automatically restarts
the transfer (using a slightly different seed, I believe).

Thus, it's fairly accurate to compare rsync to performing an ftp or cp
and then doing a full checksum on the file, so one could argue it's
actually more reliable than a straight ftp/cp without the checksum.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: Non-determinism

2002-04-17 Thread David Bolen

Martin Pool [[EMAIL PROTECTED]] writes:

 To put it in simple language, the probability of an file transmission
 error being undetected by MD4 message digest is believed to be
 approximately one in one thousand million million million million
 million million.  

I think that's one duodecillion :-)

As a cryptographic message-digest hash, MD4 (and MD5) is intended as
having 2^128 operations necessary to crack a specific digest (find the
original source), but probably only on the order of 2^64 operations to 
find two messages that have the same digest.  But even that isn't a 
direct translation to the probability that two random input strings 
might hash to the same value.

There's an interesting thread from sci.crypt from late last year that
had some addressing of this question:

http://groups.google.com/groups?threadm=u21i5llf2bpt03%40corp.supernews.com

which for one of the examples where the computation was followed through
(the odds of a collision when keeping all 128 bits of the hash and
running it against about 67 million files), the probability of a
collision was about 2^-77.  So I suppose you'd sort of have to figure
out what you wanted to declare your universe of files to be since more
files would increase the odds and less files decrease them.

It's about at this point that I sit back and just say, that's one tiny
probability!

It is interesting that MD4 has been a cracked algorithm for a while
now, so if someone was explicitly trying to forge a file that would
fool it, it's very doable.  But I doubt that changes the odds on two
random files colliding.  MD5 has not yet had any duplication found
(and plenty of protocols currently assume there aren't any), but it's
far more computationally intensive to compute, so I think MD4 is more
than sufficient for rsync.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: how to take least risk on rsync dir

2002-04-16 Thread David Bolen

Patrick Hsieh [[EMAIL PROTECTED]] writes:

 When rsync dir_A to dir_B, I hope I wont make any change to the original
 dir_B unless the rsync procedure end withour errors, therefore, I hope
 there's somethig like

 rsync -av dir_A dir_B_tmp  \
 mv dir_B dir_B.bkup 
 mv dir_B_tmp dir_B
 
 This small script can ensure the minimal change time between 2 versions
 of archive. Is this built in the native rsync function? Do I have to
 write scripts myself?

rsync's default behavior ensures this sort of minimal change time, but
only at a per file level.  That is, each file is actually built as a
temporary copy and then only renamed on top of the original file as a
final step.  Of course, that's largely a requirement so rsync can use
the original file as a source for the new file, but it also serves to
preclude interruption of the original file as long as possible.

But if you want the same sort of assurances at something larger than a
file level (e.g., a directory as above), then yes, you need to impose
that on your own.  For example, when backing up databases (where I
need to keep the database backup and transaction logs in sync), I copy
them into a temporary directory and only overlay the primary
destination files when fully done.

The simplest way to do it is close to what you have, but there are a
few things you need to be aware of.

First, you'll want to use the rsync --compare-dest directory so that
it can still find the original files on the destination system for its
algorithm - otherwise it'll send over the entire contents of the
source files and not use what it can from the original files.

Second, you need to realize that by default rsync will only copy files
that have changed (by default based on size/timestamp unless you add
the -c (checksum) option).  So if you do what you have above you'll
end up losing files that hadn't changed since they won't exist in
dir_B_tmp.  You can override this with the -I option at the expense of
a small amount of extra data transferred for the unchanged files.

So you could do something like:

rsync -av -I --compare-dest=B_tmp_to_B_path dir_A dir_B_tmp

Note that the --compare-dest argument is a relative path to get from
the destination directory (dir_B_tmp in this case) back to the
original source directory.  rsync won't touch the source directory,
but it will use the files within it as masters for the new copy.

This will result in dir_B_tmp being a completely copy of dir_A, using
the original dir_B as a master whenever possible.

This all assumes that you're doing remote copies where the rsync
protocol makes sense (you don't show a remote system in your example).
If you're just making local copies, it would be better to use -W
instead, but you'd still need -I if you wanted files matching those in
the current dir_B to be transferred.  Then again, for a local setup
where you want to update the whole directory, a simple copy may be as
effective as rsync, since you're not benefitting from the selection of
a subset of files.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: is it a bug or a feature? re:time zone differences, laptops, and suggestion for a new option

2002-04-04 Thread David Bolen

Martin Pool [[EMAIL PROTECTED]] writes:

 Linux stores file times in UTC, and rsync transfers them in UTC.  I
 thought that NT and XP did too, but perhaps not, or perhaps there is a
 problem with Cygwin.  (...)

It depends on the filesystem under Windows.  NTFS uses UTC for
timestamps, but the FAT* variants use local time.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: File over 2GB using Cygwin

2002-02-22 Thread David Bolen

Martin Pool [[EMAIL PROTECTED]] writes:

 It could be an interesting project to try to build rsync under MSVC++.
 Presumably it can handle large files.  I don't think there's anything
 impossible in principle about it.

Not in principle, but unless you're also going to handle the same fork
emulation and Unix semantics that Cygwin is doing, it's unlikely to be
a weekend project :-)

In theory a native port might use threads and overlapping I/O very
effectively, but I think it would be fairly tough to do without some
significant changes to the existing code.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: rsync dir in _both_ directions?

2002-02-06 Thread David Bolen

Jack McKinney [[EMAIL PROTECTED]] writes:

 If I add 512 bytes at the begining of the file, then I would expect
 it.  If I only add 14 bytes, then I don't think rsync will detect this,
 as it would require it to compute checksums start at EVERY byte instead
 of 512 byte checksums at offsets 0, 512, 1024, 1536, et al.

Yep, and that's precisely what rsync does.  It actually uses two types
of checksums.  One is a fast rolling checksum that can be efficiently
computed with a block starting at _every_ byte in the file.  The nature
of the checksum is that you can compute its new value starting at byte
X+1, based on its old value from a block starting at X by only performing
a single computation based on the new byte at the end of the block
starting at X+1.  But the penalty you pay for the speed is that it's a
weaker checksum - you can have inaccurately identified matches 
(e.g., overlaps in the checksum).  So there's a second, much stronger 
checksum, but much slower, that is used to validate a match once the
first checksum thinks it found a match.

When you transmit a file, the sender computes both checksums for each
block in the file it has and sends them over.  The receiver then walks
its current file, taking block size chunks _at every byte_ and
computing the weak/fast checksum.  If the weak matches, it then does
the stronger checksum, and if that matches, it knows it need not
request that block of data from the sender.

This will match common blocks located anywhere within the file at any
offset (including re-using a source block multiple times to reproduce
the target).

You might want to read the tech paper on rsync and its protocol, since
it goes into this in much more detail.  If all rsync did was match on
finite block boundaries, it would be _way_ less useful than it really
is.

It is an easy experiment.   (...)
 (...) I suspect that your xfer time will be comparable to
 the first one, not to the second.

Since it's an easy experiment - why suspect - did you try this?  It
should take virtually no time for the second (sans the initial
checksum computation and transmission, which to be fair for large
files and small block sizes can be quite significant).

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: efficient file appends

2001-12-12 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

 It seems to me that this situation is common enough that the rsync
 protocol should look for it as a special case. Once the protocol has
 determined from differing timestamps and/or lengths that a file needs
 to be synchronized, the receiver should return a hash (and length) of
 its copy of the entire file to the sender.  The sender then computes
 the hash for the corresponding leading segment of its copy. If they
 match, the sender simply sends the newly appended data and instructs
 the receiver to append it to its copy.

While potentially a useful option, you wouldn't want the protocol to
automatically always check for it, since it would preclude rsync on
the sending side from being able to use part of the original file when
transmitting the newly added data to the receiver.  While perhaps not
helpful for log files, it can be a big win for other files, even if
the current copy on the receiver matches the sender's initial portion.
So at best, you'd only want to enable this option if the only thing
for the entire set of files in a given run were files known to expand
this way.

Alternatively, even with rsync the way it is today, what I do is
manually bump up the blocksize to something large (say 16 or 32K).
This results in far fewer blocks for the checksum algorithm (from
perhaps 10-45x depending on original file size based on the default
dynamic blocksize selection) and thus minimizes the meta data
transmitted for the common portion of the file.  It works pretty well
for me with database transaction log files which get pretty big.  You
can probably find some past e-mail on the subject in the list by
looking for threads about rsync blocksize.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: efficient file appends

2001-12-12 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

 While potentially a useful option, you wouldn't want the protocol to
 automatically always check for it, since it would preclude rsync on

 This extension need not break any existing mechanism; if the hash of
 the receiver's copy of the file doesn't match the start of the
 sender's file, the protocol would continue as before.

Well, my point was that even if it does match, you might still want
the protocol to continue as before.  For example, if you have a file
that grows, but tends to contain similar information.  In that case,
you still want the per-block checksum information from the destination
because that way the source can use that information to minimize the
amount of new information to transmit.  Without having the per-block
information, it can't tell how to extract data from the current copy
at the destination to re-use for the new data rather than sending the
new data directly.  Not a big deal for appending log files (as long as
they have changing date strings), but not necessarily something to
have enabled by default.

 Alternatively, even with rsync the way it is today, what I do is
 manually bump up the blocksize to something large (say 16 or 32K).
 
 This sounds like an excellent idea, and I'll give it a try. As the
 blocksize reaches the receiver's file size, the scheme essentially
 approaches my idea.

Hmm, I've never tried _really_ large block sizes (I thought I had
problems if I got close to 64K, but I may be mis-remembering).  The
one drawback to the larger block sizes is that if you do encounter any
differences, you'll retransmit more information than necessary, but if
you do beforehand it's definitely just appended dat that won't be the
case.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: definite data corruption in 2.5.0 with -z option

2001-12-12 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

 After I sent my note, I ran some more experiments and found the
 problem goes away if I use the default checksum blocksize. So the
 problem occurs *only* if I use a large blocksize (65536) *and* enable
 compression.

Should have read ahead - this is probably the problem I was recalling
in my last reply.  There was some reason I tended to keep my variable
block sizes that my scripts were picking = 32K or so.

If that's it, then I doubt it's the bit length overflow issue since
I was running into this back with a modified 2.4.3.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: move rsync development tree to BitKeeper?

2001-12-07 Thread David Bolen

 You can find a lot more information about the differences here:
 
   http://bitkeeper.com/4.1.1.html
 
 BitKeeper is not strictly Open Source, but arguably good enough.

I guess arguably is if you don't mind having all your metadata
logged to an open logging server?

 The proposed plan is to convert the existing repository, retaining all
 history, some time in December.  At this point CVS will become
 read-only and retain historical versions.

I'm curious at the driving force here?  You talk about switching, but
don't really mention much about why - other than to get feet wet
before using it for other projects.  So is it really the other
projects that have specific needs?

Is there specific functionality lacking in CVS that is trying to be
fixed?  At least for me, CVS is more convenient since it works will
all the open projects I use (and yeah, is easier in terms of
licensing).

I don't have strong objections to a change, but as one user who does
tend to track the source tree and not just releases, I definitely
would prefer to continue to see (as you did suggest) alternative
access to the current source tree (even if only daily snapshots),
since at least for me rsync would be the only BK project I'd care
about - it's not clear I'd want to bother with the client.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Why does one of there work and the other doesn't

2001-11-30 Thread David Bolen

From: Randy Kramer [mailto:[EMAIL PROTECTED]]

 I am not sure which end the 100 bytes per file applies to, and I guess
 that is the RAM memory footprint?.  Does rsync need 100 bytes for each
 file that might be transferred during a session (all files in the
 specified directory(ies)), or does it need only 100 bytes as it does one
 file at a time?

Yes, the ~100 bytes is in RAM - I think a key point though is that the
storage to hold the file list grows exponentially (doubling each
time), so if you have a lot of files in the worst case you can use
almost twice as much memory as needed.

Here's an analysis I posted to the list a while back that I think is
still probably valid for the current versions of rsync - a later followup
noted that it didn't include an ~28 byte structure for each entry in
the include/exclude list:

  - - - - - - - - - - - - - - - - - - - - - - - - -

 (a) How much memory, in bytes/file, does rsync allocate?

This is only based on my informal code peeks in the past, so take it
with a grain of salt - I don't know if anyone has done a more formal
memory analysis.

I believe that the major driving factors in memory usage that I can
see is:

1. The per-file overhead in the filelist for each file in the system.
   The memory is kept for all files for the life of the rsync process.

   I believe this is 56 bytes per file (it's a file_list structure),
   but a critical point is that it is allocated initially for 1000
   files, but then grows exponentially (doubling).  So the space will
   grow as 1000, 2000, 4000, 8000 etc.. until it has enough room for
   the files necessary.  This means you might, worst case, have just
   about twice as much memory as necessary, but it reduces the
   reallocation calls quite a bit.  At ~56K per 1000 files, if you've
   got a file system with 1 files in it, you'll allocate room for
   16000 and use up 896K.

   This growth pattern seems to occur on both sender and receiver of
   any given file list (e.g., I don't see a transfer of the total
   count over the wire used to optimize the allocation on the receiver).

2. The per-block overhead for the checksums for each file as it is 
   processed.  This memory exists only for the duration of one file.
   
   This is 32 bytes per file (a sum_buf) allocated as on memory chunk.
   This exists on the receiver as it is computed and transmitted, and
   on the sender as it receives it and uses it to match against the
   new file.

3. The match tables built to determine the delta between the original
   file and the new file.
  
   I haven't looked at closely at this section of code, but I believe
   we're basically talking about the hash table, which is going to be
   a one time (during rsync execution) 256K for the tag table and then
   8 (or maybe 6 if your compiler doesn't pad the target struct) bytes
   per block of the file being worked on, which only exists for the
   duration of the file.
   
   This only occurs on the sender.

There is also some fixed space for various things - I think the
largest of which is up to 256K for the buffer used to map files.

 (b) Is this the same for the rsyncs on both ends, or is there
 some asymmetry there?

There's asymmetry.  Both sides need the memory to handle the lists of
files involved.  But while the receiver just constructs the checksums
and sends them, and then waits for instructions on how to build the
new file (either new data or pulling from the old file), the sender
also constructs the hash of those checksums to use while walking
through the new file.

So in general on any given transfer, I think the sender will end up
using a bit more memory.

 (c) Does it matter whether pushing or pulling?

Yes, inasmuch as the asymmetry is based on who is sending and who is
receiving a given file.  It doesn't matter who initiates the contact,
but the direction that the files are flowing.  This is due to the
algorithm (the sender is the component that has to construct the
mapping from the new file using portions of the old file as
transmitted by the receiver).

  - - - - - - - - - - - - - - - - - - - - - - - - -


-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Rsync: Re: patch to enable faster mirroring of large filesyst ems

2001-11-29 Thread David Bolen

Keating, Tim [[EMAIL PROTECTED]] writes:

  - If there's a mismatch, the client sends over the entire .checksum
 file.  The server does the compare and sends back a list of files to
 delete and a list of files to update. (And now I think of it, it
 would probably be better if the server just sent the client back the
 list of files and let the client figure out what it needed, since
 this would distribute the work better.)

Whenever caching checksums comes up I'm always curious - how do you
figure out if your checksum cache is still valid (e.g., properly
associated with its file) without re-checksumming the files?

Are you just trusting size/timestamp?  I know in my case I've got
database files that don't change timestamp/size and yet have different
contents.  Thus I'd always have to do full checksums so I'm not sure
what a cache would buy.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Rsync: Re: patch to enable faster mirroring of large filesyst ems

2001-11-29 Thread David Bolen

Keating, Tim [[EMAIL PROTECTED]] writes:

 Is there a way you could query your database to tell you which
 extents have data that has been modified within a certain timeframe?

Not in any practical way that I know of.  It's not normally a major
hassle for us since rsync is used for a central backup that occurs on
a large enough time scale that the timestamp does normally change from
the prior time.  So our controlling script just does its own timestamp
comparison and only activates the -c rsync option (which definitely
increases overhead) if they happen to match.

Although I will say that the whole behavior (the transaction log
always has an appropriate timestamp, it's just the raw database file
itself that doesn't) sure caught me by surprise in the beginning after
finding what I thought was a valid backup wouldn't load :-)

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Block Size

2001-11-15 Thread David Bolen

Thomas Lambert [[EMAIL PROTECTED]] writes:

 What is the default block size?  I have a few files 30+mb and data
 is just added to the end of them.  It seems like it takes longer to
 sync them that it was to send it initially.  Should I change the
 block size or something else?

The default is an adaptive block size.  It's based on the file size
divided by 1, truncated to a multiple of 16, with a minimum of 700
and a maximum of 16K (16384).

So your 30MB file ought to be using 16K blocks.  And yes, depending on
your machines (memory and CPU), it can take a while to synchronize
such files because rsync has to compute two checksums per block,
keeping that in memory, before making the transfer.  During the first
transfer rsync knows there is no target file, so it doesn't bother
with any of that but just sends the bytes.

If you know something about the construction of your file, manually
selecting a block size can be very helpful, since it helps optimize
how many changes rsync finds.  For example, when transferring database
sizes that I know have a 1K page size, I always keep block sizes a
multiple of 1K, since otherwise a single page change in the database
might affect two rsync blocks.  I then scale the block size by
database size to help keep the total number of blocks down, since that
burns memory and computation time.

My database transaction log files are very similar to your file - they
constantly grow so I'm really always only catching up the tail end of
the file.  For those, I use as large a block size as feasible.
However, my files aren't as large (we truncate a lot) so I use 16K
myself.  I believe I've had it work up closer to 32K but then had some
problems, so there may be some signed number issues (e.g., stick just
below 32K).

Not sure how much that would help, although it'll reduce your block
count by about a factor of 2.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Block Size

2001-11-15 Thread David Bolen

(I previously wrote)

 So your 30MB file ought to be using 16K blocks

Whoops - my fault for assuming 30MB was large enough and skipping the
calculations.  Turns out that really only yields about a 3K block size
with the adaptive algorithm.  So you can get significant reduction in
blocks by using a larger value
(16K for example).

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: times difference causes write

2001-11-13 Thread David Bolen

Don Mahurin [[EMAIL PROTECTED]] writes:

 My second problem is that the flash is of limited size, so I need
 some sort of patch rsync that does not keep the old file before
 writing the new one.  My patch now just unlinks the file ahead, and
 implies -W.

Sounds reasonable as long as you force the -W.

 So my wish was that a time discrepancy would lead to a checksum, where the
 files would match.
 This is not the case, however, as you say.

At least not with -W.

In most cases, the time discrepancy would then cause rsync to try to
synchronize the file, and during its protocol processing it would
determine that it didn't need to send anything, thus the only end
result would be adjusting the remote timestamp to match the source.
But this requires access to the original source file, so your prior
patch (and forcing -W) defeats this as a side effect.

 So for now, I must use -c.  It's slow, but I know that I get the minimum
 number of writes.

It definitely sounds like the best match for you.  Although -c tends
to be used more for cases where files may differ although they appear
the same (timestamp/size) than vice versa, it will serve that purpose
as well at the expense of some additional I/O and computation.

Presumably you could modify your patch so that -c (or some new option)
only invoked the checksum if the timestamp differed, since I don't
think there's any suitable equivalent currently in rsync.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: rsync recursion question

2001-10-24 Thread David Bolen

Justin Banks [[EMAIL PROTECTED]] writes:

 If your suggestion worked, that would be just fine with
 me. Actually, I guess it's fine anyway, I'll just have to maintain
 my patch ;)

This is probably obvious, but just in case it isn't, CVS makes this
fairly trivial (importing the main rsync releases and then developing
your own changes on the mainline) to maintain over time.  Tracking
local changes to third party sources is one of its strengths.  That's
how I maintain our internal version of rsync which has a variety of
local changes.  Most I eventually submit back for possible inclusion
in the main release (after some burn-in time in local use), but there
are some that aren't general purpose enough, so they just stay in our
repository.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Does RSYNC work over NFS?

2001-09-25 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

 consider, however, a slow pipe between systems, one or more mounting
 filesystems via nfs over a fast connection.  the lan connection to
 the nfs is negligible versus the rsync connection from server to
 server.

Oh, I'd agree with that.  But then to me you aren't running rsync 
over the NFS connection, but over the slow LAN connection.

I took the original question to mean using rsync over an NFS
connection serving as the link between source and destination (in which
case only -W makes sense), but in re-reading the subject, it's a tad 
ambiguous and could certainly include the above scenario.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Problem with transfering large files.

2001-09-20 Thread David Bolen

I'm not Dave Dykstra [[EMAIL PROTECTED]] writes:

 On September 9 Tridge submitted a fix to CVS for that problem.  See
 revision 1.25 at
 http://pserver.samba.org/cgi-bin/cvsweb/rsync/generator.c

I'm not sure that fixes the use of the timeout for the overall
process.  See a recent answer by me to this list in the Feedback on
2.4.7pre1 thread, which included an older patch from last year
that I've been using since then.

The overall timeout problem is due to the parent process doing a
read_int on the child process to wait for final completion, which is 
subject to the same timeout setting as the child processes is using 
on individual I/O.  But the parent won't hear from the child until 
it's fully done.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: lock files

2001-09-11 Thread David Bolen

Dietrich Baluyot [[EMAIL PROTECTED]] writes:

 Does rsync lock the source files while its copying?

No (there's really no guaranteed portable way to do it anyway).

But rsync does perform a final checksum on transferred files and if
they differ, it will re-execute the transfer (with an adjustment to
the checksum algorithm).  This is intended to catch the rare case
where the checksums can be fooled into thinking a portion of the file
is unchanged, but it can also catch changes under the covers while
rsync is moving the file.

However, it's not an absolute guarantee since rsync uses stored
directory information (such as overall file size) I believe, so it's
possible for a growing file to not include everything added during the
execution of rsync.

If you need absolute guarantees on a changing file, you need to apply
your own locking around rsync.  For example, in copying back database
backups I use a script that creates a lock file before running rsync,
and that lock file is also checked by the backup script.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Can rsync synchronize design changes to tables and data between two Microsoft ACCESS replicas, mdb files?

2001-07-24 Thread David Bolen

R. Weisz [[EMAIL PROTECTED]] writes:

 Has anyone using rsync ever tried using it to manage the replication and
 synchronization process for Microsoft ACCESS replicas?  If so,

Not for Microsoft ACCESS, but we synchronize copies of SQL/Anywhere
databases constantly.

As long as you're not trying to synchronize files that are actively in
use (which may prevent rsync from reading portions of the files) there's
no reason why it won't work for any database file.  Rsync itself is just
treating the files as arbitrary binary data - it doesn't care about any
structure to the file data, whether it be a database storage, word
processing document, or just flat text.

One small suggestion for efficiency - for our database transfers, we
keep the blocksize at some multiple of the underlying database page
size (1K for our SQL/Anywhere databases) since its the nature of the
beast that all changes to the database will occur within those
boundaries.  It's not guaranteed to be more efficient, but we've found
it to be so (it prevents multiple rsync blocks from being involved in
a single database change).  I believe for the Jet engine that access
uses it was 2K prior to Jet 4.0 and 4K afterwards.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: another data point re: asymmetric route problem

2001-07-20 Thread David Bolen

Adam McKenna [[EMAIL PROTECTED]] writes:

 Well, the route to my other secondary dns server recently became
asymmetric,
 and, as expected, the rsyncs between the primrary and that box are hanging
 now, too.

Have you tried running with Wayne's no-hang patches applied?  The
asymmetry might be affecting timing of information flow, and it would
be interesting if that exacerbated any of the characteristics that his
buffering changes were focused on addressing.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Problem with --compare-dest=/

2001-07-06 Thread David Bolen

Dave Dykstra [[EMAIL PROTECTED]] writes:

 Perhaps it should be using clean_fname().  Please try making a fix using
 clean_fname() or some other way if it looks better, test it out, and
submit
 a patch.  I wrote that option so I'll make sure the patch gets in if I
 think it looks good.

Ok.  clean_fname() has its own problem though because it eliminates
double // even at the beginning of the path, which while it would fix
this specific case, would break if I was actually trying to use a UNC
compare-dest.

I think I tried fixing clean_fname() to avoid this case in the past but
ran into problems with other portions of the code that depended on it
cleaning this up.

I'll poke around as I get a chance - right now we're prepping for a big
deployment so I think I'll go the /. route for the near term :-)

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




Problem with --compare-dest=/

2001-07-05 Thread David Bolen

We do a bunch of distribution to remote machines (under Windows NT) using
compare-dest to make use of existing files on those machines when possible.
Up until now, the comparison directory has always been at least one level
down in the directory hierarchy, but we've just started doing full recursive
distributions of files that have a mirror structure to our C: drive itself,
and I think I've run into a buglet with --compare-dest.

The code that uses the compare-dest value blindly builds up the comparison
using %s/%s (with compare-dest and filename), so if you set
--compare-dest=/, you end up with filenames like //dir/name - the
leading double slash may not matter under most Unixes, but it does under
Windows (Cygwin still uses it as a UNC path) and I think POSIX permits such
a path to be system-specific (// inside a path must be ignored, but at the
start can do special stuff).

Anyway, a quick workaround appears to be to use /. rather than just /
for compare-dest (or actually, since it's Windows, using C: works too),
but I'm guessing it's something that should probably be handled inside of
rsync better?

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: RSync on NT

2001-07-03 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

 Has anybody had any luck getting RSync to work with WinNT 4.0?

Yep.  At least compiled for use with Cygwin, it works fine.  I do use
a local tool to make a named pipe connection to a target machine
rather than rsh, but any old path should work.

 I am interested in using RSync in a non-daemon mode.  How do I specify
 drive/directory paths along with host names?  If I issue this command:

You can't really use native drive specifications because as you note
rsync uses the Unix convention of separating the system from the path
with a colon.

But since Rsync is built on top of Cygwin (I'm presuming that's how
you built it) you can use the standard /cygdrive/? notation (or until
they formally remove it, the deprecated //? notation) to select a
drive.  You should also be able to use //system/share to access
network shares.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: problems encountered in 2.4.6

2001-05-25 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

Dave Dykstra wrote:

 That's two different kinds of checksums.  The -c option runs a whole-file
 checksum on both sides, but if you don't use -W the rsync rolling
checksum
 will be applied.

So the chunk-by-chunk checksum always is used w/o -W?  I guess the docs are
more confusing than I originally thought.

It might help if you think of it as two phases - discovery of what
files need to be transferred, and then the transfer itself.

The discovery phase will by default just check timestamps and sizes.
You can adjust that with command line options, including the use of -c
to include a full file checksum as part of the comparison, if for
example, files might change without affecting timestamp or size.

Once rsync knows what it needs to transfer, then it works its way
through the file list, and for each file it performs a transfer.  By
default, that transfer is the rsync protocol - which involves the full
process of dividing the file into chunks with both a strong and
rolling checksum, and doing the computations to figure out what parts
to send and so on.

Now, normally this process is divided so that the copy of rsync that
does the I/O is local to the file - e.g., for discovery both client
and server rsync identify file timestamp/sizes independently (and
optionally compute the checksums locally) and then exchange that
information.  For transfer both rsyncs build up the rolling and chunk
checksums and exchange them and then decide what file data to send.

But when you are copying with a single rsync (and in particular when
one of the files is on the network), then that rsync has to do all the
work.  That means that during discovery it either 'stat's all files or
optionally computes checksums.  To do the checksum it has to read the
file, so both source and destination get read fully - if either are
on the network you will have already spent the network traffic to pull
the complete files back to the local machine.

Likewise for the transfer - under the rsync protocol, rsync has to
compute the checksums for both source and destination files.  Now,
it'll only do this for those that it wants to transfer, but in those
cases it effectively pulls back complete files from the network just
to compute the checksums, only to then start transferring them.  Even
if the rsync protocol yields a very small amount of difference,
anything beyond that point is already more than the full file with
respect to the network activity that takes place.

That's why the -W option is really the only logical thing to use with
a single rsync and local (on-system or network share/mount) copies.
Under such circumstances, the rsync protocol isn't going to help at
all, and will probably slow things down and take more memory instead.
With -W rsync becomes an intelligent copier (in terms of figuring out
what changed), but that's about it.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: problems encountered in 2.4.6

2001-05-25 Thread David Bolen

[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes:

 Actually, the lack of -W isn't helping me at all.  The reason is that
 even for the stuff I do over the network, 99% of it is compressed with
 gzip or bzip2.  If the files change, the originals were changed and a
 new compression is made, and usually most of the file is different.

Just to clarify, when you say over the network you mean in true
client/server rsync (or across an rsh/ssh stream) and not just using
one rsync with references using network mount points, right?  In the
latter case, not having -W is hurting you, never helping.

But yes, any format (e.g., encryption, compression) that effectively
distributes changes randomly over a file is going to be a killer for
rsync.

For the case of gzip'd files when a client and server rsync are in
use, you may want to look back through the archives of this list -
there was a reference to a patch for the gzip sources that created
rsync-friendly gzip's.  Not as great as the non-gzip'd version, but
far better than normal gzip.

Ah yes - here was the URL:

http://antarctica.penguincomputing.com/~netfilter/diary/gzip.rsync.patch2

At the time when I tried it (1/2001), here were some test results:

For comparison, here's a database file (delta between one day and the
next), both uncompressed and gzip'd (normal and -9).  For the
uncompressed I also transferred with a fixed 1K blocksize since I know
that's the page size for the database - the others are default
computations (I tried the 1K with the gzip'd version but it was
worse, as expected).

Normal Normal+1Kgzip   gzip-9   
Size54206464   54206464 21867539   21845091
Wrote29021821011490  31698643214740
Read   60176 31764860350  60290
Total29623581329138  32302143275030

Speedup18.30  40.786.77   6.67
Compression 1.00   1.002.479  2.481
Normalized 18.30  40.78   16.78  16.54

And in terms of size:
   
As Rusty's page comments, they are slightly larger, but not
tremendously so.  In my one case:

Normal gzip:21627629
gzip --rsyncable:   21867539
gzip -9 --rsyncable:21845091

So about a 1-1.1% hit in compressed size.


Personally, here we end up just leaving the major stuff we transfer
uncompressed - as we're using slow analog lines, the cost recovery was
easily worth the cost in disk space, particularly in cases like our
databases where knowledge of the page size and method of change goes a
long way.

 It definitely helped for transferring ISO images where the whole image
 would be changed if some files changed.  I set the chunk size to 2048
 for that.  Why it defaults to 700 seems odd to me.

Not sure - perhaps some early empirical work.  When I'm moving files
that I know something about I definitely control the block size
myself, so for example, when moving databases with a 1K page size, I
always use a multiple of that (since I know a priori that's how the
database dirties the file), and then I scale that up a bit based on
database size, to get a reasonable tradeoff between block overhead and
extra transfer upon a change detection.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: --compare-dest usage (was Re: [expert] 8.0 Final)

2001-04-23 Thread David Bolen

Randy Kramer [[EMAIL PROTECTED]] writes:

 I'm still uncertain about what happens if a single file rsync is
 interrupted partway through (and I've specified the --partial
 option) -- will rsync take advantage of the old copy of the file in
 --compare-dest and the partially rsync'd file (now truncated) in the
 destination directory?

No, there can only be one compare file, so the next time you try the
transfer it'll use the previous partially transferred copy for that
purpose.  Depending on the size of the original compare file and how 
much was partially transferred the last time, that can lose a 
significant amount of information (if you were using --partial) in 
terms of rsync's ability to be as efficient as possible.

Locally here, we have a similar setup where we're transferring large
database files.  To work around that, I've been trying a local
--partial-pad option.  If enabled, it works like --partial, but upon
an interruption, it appends data from the original source to the
partial copy to fill out the partial copy at least as large as the
original.

I believe in our specific case it's an improvement - and we really
only use it on the commands that perform the database backups - but am
not convinced that it's necessarily useful as a general purpose option
(and haven't yet submitted a patch), since there's nothing to say that
the partially transferred information will be of any use in the next
transfer, since it assumes sort of a linear change pattern to the
file.  The best performing behavior would be to work with both the
previous partial file and the original in the --compare-dest directory,
but that would be significant changes to rsync internals, not to
mention potentially twice the work if the partial copy was any
significant fraction of the --compare-dest copy.

But I'd be happy to supply a diff if you think it might help in your
setup.  It'd be against 2.4.3, but should be very close if not the
same against later releases.

 Aside: I think, based on your previous response, that if I did a
 multifile rsync (say 60 files), and rsync was interrupted after 20
 of the files were rsync'd, the --compare-dest option would work to
 avoid rsync'ing the first 20 files and then rsync would rsync the
 last 40 files in the normal manner (i.e., breaking them into blocks
 of 3000 to 8000 bytes and then comparing them, and transferring only
 the blocks that were different).

I don't think the --compare-dest would be the reason rsync would skip
the first 20 - it would just see them as existing in the target
directory at the right date and size.  Where --compare-dest could come
into play was if they already existed in the separate comparision
directory, in which case they wouldn't be transferred at all (unless
you were using the -I option).

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Backing up *alot* of files

2001-02-23 Thread David Bolen

Nemholt, Jesper Frank [[EMAIL PROTECTED]] writes:

 Now the big question : How long will next run take (most likely, only a
few
 files has changed) ?

You'll need the same basic startup time (and memory) to identify the file
list,
but at that point it should be quite fast at skipping to only the files that
need to be transferred (providing you let it identify such files by size and
timestamp - the default operation).

However, I'm not sure I follow what you are currently running - are you
using rsync to sort of "bootstrap" your backup repository?  If that's the
case, then it can be more efficient to just transfer the files via a
standard
copy mechanism (you don't have any of the overhead of rsync at all) or use
rsync with the -W (whole file, no incremental computations) option that very
first time.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Backing up *alot* of files

2001-02-23 Thread David Bolen

Nemholt, Jesper Frank [[EMAIL PROTECTED]] writes:

 That was also what I was hoping, but what if I add the -c for
 checksum I suppose it then needs to read  checksum both source
 and destination for all files, or ? (this will as far as I can see
 take at least the 10 hours, maybe more).

Yes, with -c it'll be back to processing each and every file, and it
may even be worse than your current timing since it has to checksum
both the local and remote copy, whereas if you're copying into an
empty filesystem now (I'm not sure if you are or aren't) it just knows
it has to transfer.

 I don't think checksum is a necessity here, but when dealing with
 files including production database files from Oracle, it _is_ nice
 to play safe...  We plan to let the DBAs fire up the databases on
 the backup and check everything. If they say OK, the most important
 files are OK.

We had precisely this problem with SQLAnywhere database files under
Windows.  The timestamp and size of the main database file would
remain the same but the data would have changed.  The only way to let
rsync detect this was with -c.

However, the performance implications eventually made me add some
extra support to the script wrapping rsync so that it checked the
timestamp and sizes and only added the "-c" to rsync if they were the
same.

So in your case, it might be easier to pre-identify any files that
might need to be transferred even if their timestamp and size remain
unchanged, and handle them with a separate run of rsync with
appropriate options.

 Yes, I would probably save the first hour, but it was done with
 rsync  the normal options nevertheless, just to see if anything
 went wrong when using rsync on 2 million files.

True - that's an amazingly large single-directory structure.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: exit status

2001-02-05 Thread David Bolen

Toni Pisjak [[EMAIL PROTECTED]] writes:

 My question: Is the exit status reliable in the current version ?

It's not 100% reliable, but it does somewhat depend on what you would
consider a failure, since there are some slightly ambiguous cases.

For my part, the cases where I've seen fit to make some local changes
to cover scenarios included the following:

* recv_files in receiver.c: There are 4-5 points where it can have a local
  failure (fstat, getting a tmp name, etc...) that the stock code prints
  a warning, but continues to the next file in the list.  Depending on the
  failure it may still exit out with a non-zero code, but in some cases
  it'll just skip the problem file and continue on.  I changed this to
  abort with a non-zero code.  (This does change control flow slightly in
  that function but I haven't seen a problem yet, and receive_data that
  is called from this function can also directly exit on an I/O failure).

* finish_transfer in rsync.c: Failures renaming the temporary file could
  be ignored (I find this happens sometimes under NT) and you could lose
  both the temporary ".filename" version and think it was successful.  I
  switched this to trigging an I/O error exit.

In my transfers, I got caught once in the first case when I ran out of
disk space, so the mktemp call was failing on all the files but
looking like the overall transfer was ok.  The second case hit me
under NT (as noted above) sporadically - but certainly not frequently.

I haven't had the opportunity yet to suggest these changes back for
inclusion in the main source.

The only other issue that I've seen is the potential for the server
side to run into local problems, that get reflected as error messages
to the receiver, but don't stop the transfer or result in a non-zero
exit code.  One case is a missing file on the server - you'll get a
link_stat warning message in the stream but if it was just one of
several files, the rest of the transfer will complete and it can look
successful.  I was going to fix this until I realized that it actually
helped us in some transfers where a file might or might not be
present.  I do think it should probably error out in the long run, but
haven't made any changes along those lines.

With all this said however, it took me about 6 months of using rsync
heavily to get to the point of making the changes I did, so these
aren't frequent occurrences, nor did they really impede my use of
rsync with scripts that strictly watched the exit code.  Also, in all
of these cases, there is a warning or error message that gets
displayed, but you'd have to parse the rsync output to see it rather
than just trusting the exit code.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: The out of memory problem with large numbers of files

2001-01-25 Thread David Bolen

Dave Dykstra [[EMAIL PROTECTED]] writes:

 No, that behavior should be identical with the --include-from/exclude '*'
 approach; I don't believe rsync uses any memory for excluded files.

Actually, I think there's an exclude_struct allocated somewhere per
file (looks like 28 bytes or so), but the growth algorithm is not
exponential (it just reallocates two entries at a time it looks like).

But I expect compared to other stuff it's in the noise.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: The out of memory problem with large numbers of files

2001-01-25 Thread David Bolen

I previously wrote:

 Well, as with any dynamic system, I'm not sure there's a totally
 simple answer to the overall allocation, as the tree structure created

Oops, this slipped through editing - as I wrote up the rest of the note
I didn't actually find a tree structure (I earlier thought there was
one) - so please ignore that :-)

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: rsync problem

2001-01-24 Thread David Bolen

Kevin Saenz [[EMAIL PROTECTED]] writes:

 I guess that might be the case but there is one question left to ask
 the total files that we rsync has not changed. why would this task
 cause problems all of a sudden?

If it's not the per-file overhead adding up, have you suddenly picked
up a huge file in the bunch?  generate_sums() is called for each file
to construct the per-block checksums to be transmitted to the sender
to compute the delta.  Perhaps you've now got a file in your
filesystem that is so large that the aggregate space required for the
checksums for that file exceeds your available working space.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: rsync memory usage ...

2001-01-24 Thread David Bolen

Cameron Simpson [[EMAIL PROTECTED]] writes:

| Cameron The other day I was moving a lot of data from one spot to
| Cameron another.  About 12G in several 2G files. [...]
| Cameron so I used rsync so that its checksumming could speed past
| Cameron the partially copied file. It spent a long time
| Cameron transferring nothing and ran out of memory. From the
| Cameron error I'm inferring that it checksums the entire source
| Cameron file before sending anything across the link.
| I know I'm not (directly) addressing the problem, and I don't know the
| code, but will specifying a larger block size allow you to work-around
| the problem?

Perhaps - the transfer is done now but I'll try it next time I have
such an issue. I was more concerned with the appearance that rsync
stashes all the checksums before sending any. This seemed memory hungry
and nonstreaming, which is odd in an app so devoted to efficiency.

Your question about behavior is accurate though - while rsync spends a
lot of energy trying to make an efficient transfer of the file itself,
the actual meta-process to determine the transfer is fairly synchronous.

After exchanging information to determine the set of files involved,
the receiver proceeds through each file in turn, computes the
checksums for the file and then transmits them.  The sender receives
all the checksums, then uses that in conjunction with its copy of the
file to compute the delta information, and then transmits that back.
As the receiver receives the delta information it recreates the new
file.

So there is definitely start-up overhead that must occur before any of
the file data is transferred at all, and for a very large file, the
checksum computation and the transmission of the checksum information
can be lengthy.

Some of this is unavoidable - until the sender has all of the receiver
checksum information it can't necessarily start sending - some of the
very end of the current file on the receiver may be used at the very
beginning of the sender's new version, which it can't detect until it
knows about the entire receiver's file.

Adjusting the blocksize manually can have an impact on this.  The
larger the blocksize, the smaller the checksum meta-information, since
you have linear growth with the number of blocks the file represents.
If a block size is not set on the command line, rsync will do some
dynamic adjustment of the blocksize (roughly size/1) maxing out at
16K.  During transmission it's 6 bytes per block, but I believe it's
32 bytes in memory.  So for the 2GB file, you'll have about 122,000
blocks for, so ~700K transmitted and ~4MB in memory.  That doesn't
really sound like enough to exhaust memory on typical machines
nowadays though.  There's some per-file growth too, but the per-block
checksums are freed as it works through each file.

Now, in terms of increased efficiency - while you do have to transmit
all of the checksum information before the sender can compute the
delta, one thing I've been interested in trying is to have the sender
send the checksums as it computes them - I'm not entirely sure why it
has to be saved in memory, since it'll be freed right after
transmission.  About the only risk I see is that it couples the
checksum process to the line speed which could raise the risk of
inconsistency if the file on the receiver is changing, but that risk
is already there, just a smaller window.  I haven't had a chance to
try the change though yet.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: rsync hangs

2001-01-18 Thread David Bolen

Marquis Johnson [[EMAIL PROTECTED]] writes:

I found yesterday that this was my fault.  I
 probably should not have used rsync -avvv.  When I used rsync -av it
 worked fine.  Thanks for your help.

Yes, using more than two "-v" options can cause a problem, although I
normally get a failure early on, depending on how many more.  The
problem is that the extremely verbose modes sometimes generate output
that doesn't go through the standard I/O handling for streams on the
server, and thus doesn't get packaged up properly for the client to
decode.  So the client perceives the server debug output as a protocol
failure.

I had at one point played with both removing the verbose option from
transmission to the server and/or explicitly limiting the level to no
more than 2 no matter what the local debugging was enabled as.  Without
that you can never really get the higher debugging levels locally.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Source and destination don't match

2001-01-18 Thread David Bolen

Jeff Kennedy [[EMAIL PROTECTED]] writes:

 I have a source directory that is not being touched by anyone, no
 updates or even reads except by the rsync host.  I am using just a
 straight binary, no rsyncd.conf file.  I am using the follwing command:
 
 rsync -avz /source/path/dir /dest/path/dir
 
 Using version 2.4.6 on Solaris 7, source and destination are both on a
 NetApp filer.  Seems to run without incident but du's on both
 directories show a 40MB difference.
 
 Is this normal?  Thanks.

It depends.  You might try comparing the output of a "find -ls"
(perhaps excluding directories) on both trees to see if it's easy to
tell where the difference lies.

One thing that might account for a difference would be if the source
filesystem is a very active one (over time, not at this instant), in
which case its directory files could be much larger due to normal
usage growth, whereas your destination copy is fresh and only as large
as necessary for the actual current file information.  This would
require that the larger side be the source, and I'd expect in that
case the 40MB would have to be a relatively small fraction of the
overall size, which I can't tell from the info provided.

Another possibility is that your source tree has sparse files, which
are being expanded during the copy.  In that case, the --sparse option
of rsync may help, although I have not had need to use it myself in
the past.  Oh, this would also imply that the source filesystem would
be the larger of the two.

Of course, I suppose it's also possible that there's something in your
source that rsync isn't syncing up properly - the find comparision
should be able to highlight that.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Interrupted transfer of a file

2000-12-22 Thread David Bolen

Whoops, I wrote in my previous message:

 So you lose the copy, but rsync still exits with a non-zero code,
 which makes it look like there was nothing to transfer (e.g., no
 change to the file).

That should have said "exits with a zero exit code" - e.g., it exits
looking like it was successful.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/




RE: Changed file not copied

2000-12-01 Thread David Bolen

John Horne [[EMAIL PROTECTED]] writes:

 Okay I think I've solved this :-) The rsync with the '--size-only'
 option updates the modification time for the file but doesn't copy
 it.

Actually, I believe it's less related to "--size-only" rather than
your use of the "-t/--times" option (implicitly since you used "-a").

Thus, "--size-only" says not to bother transferring the file contents
even if the date is different but the size is the same, but the "-t"
still transfers over the appropriate timestamp information.  This can
be a good way to initially sync up two systems that weren't previously
mirrored with maintenance of timestamp information, but otherwise have
the same content.

 Hence when I run the command without that option the size and time
 are equal - hence the file is not copied despite being different. I
 see that there are options to ignore the time as well, so that would
 get round it on the second rsync command. Not a problem, I just
 found it a bit confusing :-)

In addition to ignoring the timestamp ("-I/--ignore-times"), and
depending on your environment and size of the files involved, you can
also use the "-c/--checksum" option to have rsync compute a file
checksum to determine if a file should be transferred.  While this can
be slow, if the actual rsync processing of the file (e.g., block
checksums) and transmission of that information (particularly on a
slow link) would be lengthy, it's a more efficient way to guarantee
you only bother sending a file if it is different.

-- David

/---\
 \   David Bolen\   E-mail: [EMAIL PROTECTED]  /
  | FitLinxx, Inc.\  Phone: (203) 708-5192|
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150 \
\---/