RE: Patch to avoid 'Connection reset by peer' error for rsync on cygwin
Max Bowsher [[EMAIL PROTECTED]] writes: I though that shutdown acts as below: (no data loss) SUSv2 is annoyingly somewhat vague on the specifics. So are the FreeBSD/Linux man pages. They don't specifically indicate truncation or flushing of data, although I don't recall ever thinking of shutdown() as truncating. To be, what Microsoft got wrong in WinSock is _requiring_ shutdown() to reliably close a socket while ensuring all final data gets delivered. It certainly caught me by surprise (and annoyance) at one point. I rarely ever used it in a Unix environment, but I do think it should be safe. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: memory requirements was RE: out of memory in build_hash_table
Granzow, Doug (NCI) [[EMAIL PROTECTED]] writes: From what I've observed by running top while rsync is running, its memory usage appears to grow gradually, not exponentially. The exponential portion of the growth is up front when rsync gathers the file listing (it starts with room for 1000 files, then doubles that to 2000, 4000, etc...). So if your rsync has started transferring at least the first file, it's already done whatever exponential growth it's going to do. After that yes, it's far more gradual and should I think settle down since most of the rest of the memory allocation is on a per-file basis and not saved once the individual file is done. I saw someone on this list recently mentioned changing the block size. The rsync man page indicates this defaults to 700 (bytes?). Would a larger block size reduce memory usage, since there will be fewer (but larger) blocks of data, and therefore fewer checksums to store? Yep, although I think that is a reasonably small amount of memory (something like 8 bytes per block to hold the checksums) and only holds the checksums for a single file at a time. But in addition to memory for larger files this can also improve performance because there's less computation to be done as well as less block matching. The downside is potentially more data transferred. You suggested setting ARENA_SIZE to 0... I guess this would be done like this? % ARENA_SIZE=0 ./configure I don't know if the configure script looks in the environment or not, but my guess would be no. (Took a quick peek and it doesn't look like that's something munged by configure at all). If you wanted to try it, I'd just edit rsync.h and comment out it's current definition in favor of one defining it to 0 - e.g.: /* #define ARENA_SIZE (32 * 1024) */ #define ARENA_SIZE 0 The arena handling seems to be reasonably tight, so it's probably a long shot in any event. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: On Windows OS, is there any advantage for completing rsync using MSVC instead of gcc/cygwin ?
Diburim [[EMAIL PROTECTED]] writes: (Quoted from the subject line - Diburim, it's best to keep the subject line short and put your question in the body of the e-mail. Subject lines are often truncated for display purposes and it can make it more difficult to see your question) On Windows OS, is there any advantage for completing rsync using MSVC instead of gcc/cygwin ? Not only is there no advantage, but it won't work - I'm guessing you haven't actually tried, right? :-) rsync is designed to run in a Unix environment, and makes extensive use of Unix system calls and facilities (being able to fork() a child process for example). The native Windows environment isn't compatible with the Unix system API. That's what cygwin brings to the table, the entire Unix emulation layer, and it's crucial to the ability of rsync to work under Windows at all. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: memory requirements was RE: out of memory in build_hash_table
Granzow, Doug (NCI) [[EMAIL PROTECTED]] writes: Hmm... I have a filesystem that contains 3,098,119 files. That's 3,098,119 * 56 bytes or 173,494,664 bytes (about 165 MB). Allowing for the exponential resizing we end up with space for 4,096,000 files * 56 bytes = 218 MB. But 'top' tells me the rsync running on this filesystem is taking up 646 MB, about 3 times what it should. Are there other factors that affect how much memory rsync takes up? I only ask because I would certainly prefer it used 218 MB instead of 646. :) Hmm, yes - I only mentioned the per-file meta-data overhead since that's the only memory user in the original note case, which was failing before it actually got the file list transferred, and it hadn't yet started computing any checksums. But there are definitely some other dynamic memory chunks. However, in general the per-file meta-data ought to be the major contributor to memory usage. I've attached an old e-mail of mine when I did some examining of memory usage for an older version of rsync (2.4.3) which I think it still fairly valid. I don't think it'll explain your significantly larger usage than expected. (A followup note corrected the first paragraph as rsync doesn't create any tree structures) Two possibilities I can think of have to do with the fact that the per-file overhead is handled by 'realloc'ing the space as it grows. It's possible that the sequence of events is such that some other allocation is being done in the midst of that growth which forces the next realloc to actually move the memory to gain more space, thus leaving a hole of unused memory that just takes up process space. Or, it's also possible that the underlying allocation library (e.g., the system malloc()) is itself performing some exponential rounding up in order to help prevent just such movement. I know that AIX used to do that, and even provided an environment variable way to revert to older behavior. What you might try doing is observing the process growth during the directory scanning phase and see how much memory actually gets used to that point in time - gauged either by observing client/server traffic for when the file list starts getting transmitted, or by enabling/adding some debugging output to rsync. I just peeked at the latest sources in CVS, and it looks like around version 2.4.6 the file list processing added some of it's own micro-management of memory for small strings, so there's something else going on there too, in theory to help avoid platform growth like mentioned in my last paragraph. So if you're using 2.4.6, you might try a later version to see if it improves things. Or if you're using a later version you might try rebuilding with ARENA_SIZE set to 0 to disable this code to see if your native platform handles it better somehow. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ - - - - - - - - - - - - - - - - - - - - - - - - - From: David Bolen [EMAIL PROTECTED] To: 'Lenny Foner' [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: RE: The out of memory problem with large numbers of files Date: Thu, 25 Jan 2001 13:25:43 -0500 Lenny Foner [[EMAIL PROTECTED]] writes: While we're discussing memory issues, could someone provide a simple answer to the following three questions? Well, as with any dynamic system, I'm not sure there's a totally simple answer to the overall allocation, as the tree structure created on the sender side can depend on the files involved and thus the total memory demands are themselves dynamic. (a) How much memory, in bytes/file, does rsync allocate? This is only based on my informal code peeks in the past, so take it with a grain of salt - I don't know if anyone has done a more formal memory analysis. I believe that the major driving factors in memory usage that I can see is: 1. The per-file overhead in the filelist for each file in the system. The memory is kept for all files for the life of the rsync process. I believe this is 56 bytes per file (it's a file_list structure), but a critical point is that it is allocated initially for 1000 files, but then grows exponentially (doubling). So the space will grow as 1000, 2000, 4000, 8000 etc.. until it has enough room for the files necessary. This means you might, worst case, have just about twice as much memory as necessary, but it reduces the reallocation calls quite a bit. At ~56K per 1000 files, if you've got a file system with 1 files in it, you'll allocate room for 16000 and use up 896K. This growth pattern seems to occur on both sender and receiver of any given file list (e.g., I don't see a transfer
RE: Future RSYNC enhancement/improvement suggestions
(I wrote about long files using 20-30 min to checksum without network traffic) Jason Haar [[EMAIL PROTECTED]] writes: ...But then you should have a dialup timeout of 1 hour set? Oh of course - I was more responding to Martin's comment about there being enough traffic present in general during an rsync session, since there are cases when you can have lengthy periods without traffic at all. I could also see some NAT boxes holding a particular stream for far less than an hour by default, but I don't have a particular data point for that so perhaps it's just being too conservative. I think the problem is that you're morally upset that rsync spends so much time sending no network traffic. Quite understandable ;-) Not sure about morally, but definitely financially :-) What about separating the tree into subtrees and rsyncing them? That means you go from: 1 dialup connection started [quick] 2 rsync generates checksums (no network traffic) [slow] 3 rsync transmits files Perhaps you misunderstood - the checksum generation time that was taking so long was on a *single* file level. Rsync had already exchanged file lists and chosen the files to transfer - it was working on a single file and generating the block checksums on the receiver side to send over to the sender side. (As it turns out the transfers in question were for a single directory normally comprised of two files - a database file and its transaction log) The real rub was that after spending 20+ minutes with an idle line computing the checksum, it would then take another 30+ minutes to transmit the checksum information over. So it was (and likely still is) a case where sending the data as computed would have been a major win. At least for slow connections, the checksum computation is unlikely to be the bottleneck versus network transmission, so leaving the network idle is totally wasted time that could be fully reclaimed. I may still look into that sort of change but just haven't had the cycles yet with the decrease in our checksum time - although this particular discussion has sort of started me thinking about it again. I may review our current logs to see how much time is being wasted. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: Future RSYNC enhancement/improvement suggestions
Martin Pool [[EMAIL PROTECTED]] writes: I guess alternatively you could set the rsync timeout high, the line-drop timeout low, and make it dial on demand. That would let the line drop when rsync was really thinking hard, and it would come back up as necessary. Losing the ppp channel does not by itself interrupt any tcp sessions running across it, provided that you can recover the same ip address next time you connect. That assumes an environment where dial-on-demand is feasible. Unfortunately, our particular setup is a direct PC to PC dial, and there's no IP involved (it's Windows-Windows with NETBIOS/NETBEUI) so disconnecting would shut down the remote rsync. But it's an interesting thought for cases where it could get used. In general I'd expect it to be fairly fragile though unless you had complete control of the dial infrastructure or could otherwise ensure, as you note, identical IP address assignment. I don't suppose anyone knows any legacy reason why all the checksums are computed and stored in memory before transmission do they? I don't think at the time I could find any real requirement in the code that it be done that way - the sequence was pretty much generate/send/free. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: Future RSYNC enhancement/improvement suggestions
Martin Pool [[EMAIL PROTECTED]] writes: No, I think you could avoid it, and also avoid the up-front traversal of the tree, and possibly even do this while retaining some degree of wire compatibility. It will be a fair bit of work. Yeah, I was sort of thinking bang for the buck - munging with the file list handling reaches into far more code and would likely be far more effort to change within the current rsync source than the checksum transmission. I think the checksum would just be moving the equivalent of send_sums right into generate_sums and only touching the single generate.c module, with no noticeable difference on the wire or to other modules. I did go back and take a current look at our current transfers for the one task this for which this could make the most difference. For the ~110GB of data we synchronize each month (over V.34 dialup lines :-)), the wasted time with our current network/filesystem looks to be in aggregate only about 7.5 hours of phone time, which in turn is only about 1.6% of the ~480 hours used each month. So it's hard to worry extensively about that 1.6%. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: mixed case file systems.
Martin Pool [[EMAIL PROTECTED]] writes: On 18 Apr 2002, David Bolen [EMAIL PROTECTED] wrote: A few caveats - both ends have to support the option - I couldn't make it backwards compatible because both ends exchange information about a sorted file list that has to sort the same way on either side (which very subtly bit me when I first did this). I was just going to say that :-) Heh .. and wow, is it confusing if you mess that up. Randomly transferring files that it shouldn't be, but even better, putting the contents of one file into another silently. It seems to me that it would have been better to have the side generating the list control the sequence and the receiving side simply obey it as transmitted, but that's neither here nor there at this point. The issue with the new command line option was a general issue of versioning command line options - since they get transmitted, obviously, on the command line, it's prior to any option negotiation. So I couldn't figure out any clean way to negotiate away from the ignore case if the remote side didn't support it. Originally I wanted it to default to case-insensitive under Windows, but that was guaranteed to break older versions, so I went back to an explicit option in all cases. But that seems to be a general issue with evolving options. Actually, it was this issue that also led me to add a small bit of code to io.c so that on an unexpected tag, it would dump any pending data (as ASCII if printable, hex otherwise), since without that you never got any of the remote command line parsing errors shown. But there are problems with that too since sometimes you may have a bunch of data in the stream on a real protocol failure. I'll put this into the patches/ repository. I'd like to study the problem a bit more and see if there isn't a better solution before we merge it. Perhaps something like the --fuzzy patch will make it detect them as renames. No problem - aside from the options processing (which is also the bulk of the patch), the patch does have the property that it's very simple; one comparison routine change and one new flag supplied to the existing fnmatch library module which already supported case-insensitivity as an option. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: Future RSYNC enhancement/improvement suggestions
Jan Rafaj [[EMAIL PROTECTED]] writes: How about adding a feature to keep the checksums in a berkeley-style database somewhere on the HDD separately, and with subsequent mirroring attempts, look to it just for the checksums, so that the rsync does not need to do checksumming of whole target (already mirrored) file tree ? There's a chicken and egg issue with this - how do you know that the separately stored checksum accurately reflects the file which it represents? Once they are stored separately they can get out of sync. The natural way to verify the checksum would be to recompute it, but then you're sort of back to square one. I know there have been discussions about this sort of thing on the list in the past. For multiple similar distributions, the rsync+ work (recently incorporated into the mainline rsync in experimental mode - the write-batch and read-batch options) helps remove repeated computations of the checksums and deltas, but it's not a generalized system for any random transfer. I've wanted similar benefits because we use dialup to remote locations and for databases with hundreds of MB or 1-2 GB, we end up wasting a bit of phone time when both sides are just computing checksums. But I'm not sure of a good generalized solution. There may be platform specific hacks (e.g., under NT, storing the computed checksum in a separate stream in the file, so it's guaranteed to be associated with the file), but I don't know of a portable way to link meta information with filesystem files. Note that if you aren't already, be sure that you up the default blocksize for large files - that can cut down significantly on both checksum computation time as well as meta data transferred over the session, since there are fewer blocks that need two checksums (weak + MD4) apiece. - make output of error status messages from rsync uniformed, so that it could be easily parsed by scripts (it is not right now - rsync 2.5.5) I know Martin has expressed some interest to the list in having something like this in the future as an option. - perhaps if the network connection between rsync client and server stalls for some reason, implement something like 'tcp keepalive' feature ? I think rsync is pretty complicated at the network level already - it seems reasonable to me that rsync ought to be able to assume that the lowest level network protocol stack will get the data to the other end and/or give an error if something goes wrong without needing a lot of babysitting. In all but the rsync server cases, rsync doesn't control the network stream itself anyway (it just has a child process using ssh, rsh or anything else), so it becomes a question for that particular utility and not something rsync can do anything about. In the rsync server case, it already sets the TCP KEEPALIVE option at the socket level when it receives a connection. If your network transport between systems is problematic, there's a limited about of stuff rsync can do about it. Oh and no, just being idle on a session shouldn't terminate it, no matter how long rsync takes to compute checksums. So if that's happening to you, you might want to investigate your network connectivity. Or perhaps you're going through a NAT or some sort of proxy box that places a timeout on TCP sessions that you can increase? Upon failures, if you use --partial and a separate destination directory you can keep re-trying and slowly get the whole file across (that's how we do our backups) but you do still need to recompute checksums each time. It might be nice to see if rsync itself could have a retry mechanism that would re-use the existing checksum information it had computed previously. I have a feeling with the structure of the code at this point though that doing so would be reasonably complicated. The caveat to --partial is that once you have a partial file, even with --compare-dest, that partial file is all rsync considers for the remaining portion of the transfer. So originally for our database backups, I was removing any partial copy manually if it was less than some fraction of the previous copy I already had, since I'd lose less time rebuilding that fraction than losing access to the entire prior file. In response to that, there was another internal-use patch I made to rsync to --partial-pad any partial file with data from the original file on the destination system during an error. No guarantees it would work as well, since I just took data from the original file past the size point of the partial copy, but in many cases (growing files) its a big win. If anyone is interested, I could extract it and post it. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150
RE: out of memory in build_hash_table
Eric Echter [[EMAIL PROTECTED]] writes: I recently installed rsync 2.5.5 on both my rsync server and client being used. I installed the latest version because I was having problems with rsync stalling with version 2.4.6 (I read that 2.5.5 was supposed to clear this up or at least give more appropriate errors). I am still having problems with rsync stalling even after upgrading to 2.5.5. It only stalls in the /home tree and there are approximately 385,000 (38.5 MB of memory give or take if the 100 bytes/file still pertains) files in that tree. The key growth factor for the file construction is going to be per-file information that's about 56 bytes (last I looked) per file. However, the kicker is that the file storage block gets resized exponentially as you have more files. So for 385,000 files it'll actually be a 512,000 file block of about 30MB. (So yeah, I suppose an ~50 byte file chunk in memory growing as a power of 2 might average out close to 100 bytes/file as an estimate :-)) ERROR: out of memory in build_hash_table rsync error: error allocating core memory buffers (code 22) at util.c(232) Seems like that's just a real out of memory error. You'll only get that error if a malloc() call returned NULL. I presume there's still enough virtual memory available on the server at the point when this fails? Could you be running into a process limit on virtual memory? What's a ulimit -a show for a server process? I think under Linux the default settings are in /etc/security/limits.conf, maybe by default processes on the server are limited to 32MB of memory or something? -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: mixed case file systems.
Peter Tattam [[EMAIL PROTECTED]] writes: Given the interoperability problems between versions and the risk of data loss, I think I will have to wait till this option is in the mainstream. My alternative workaround to to write a utility to rename all files on the errant file system to be all lower case. Of course it's entirely up to you to try it or not. But just so you don't misunderstand my amusing developer anecdote ... any data loss occurred during the development of the patch. The patch as it stands now definitely works properly, and we've used it plenty of times successfully. (The interoperability between versions is only true if you use the new option, and that's the same as any of the other options that have been added over time - it's a generic problem for rsync command line option evolution) -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: out of memory in build_hash_table
Eric Echter [[EMAIL PROTECTED]] writes: I also checked the /etc/security/limits.conf file and everything in the file is commented out. Are there default limits if there are no actual settings in this file that may be causing problems? Your assumption about the memory limit on processes sounds correct, but I can't find any reasoning for this from the system settings. Thanks a bunch for the response. I'm not that familiar with Linux defaults, but your ulimit -a should reflect what the process actually has and it certainly looks good. That assumes that you are running rsync on the server under root (either via the rsh/ssh path, or as a daemon). If you're running it as some other user, you should ensure you check the ulimit -a under a process running as that user. Perhaps /etc/profile makes adjustments? If not, you might watch virtual memory stats while running the failing operation (e.g., have a window using vmstat or top running on the server when you try the copy) to see if there's anything amiss looking at the overall server level - or if perhaps rsync is somehow burning up much more memory than we're estimating. Beyond that though, I suppose perhaps a linux-oriented group would offer further suggestions, under the assumption that something must be leading to the malloc() failing. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: mixed case file systems.
Peter Tattam [[EMAIL PROTECTED]] writes: I believe a suitable workaround would be to ignore case for file names when the rsync process is undertaken. Is this facility available or planned in the near future? I've attached a context diff for some changes I made to our local copy a while back to add an --ignore-case option just for this purpose. In our case it came up in the context of disting between NTFS and FAT remote systems. I think we ended up not needing it, but it does make rsync match filenames in a case insensitive manner, so it might at least be worth trying to see if it resolves your issue. A few caveats - both ends have to support the option - I couldn't make it backwards compatible because both ends exchange information about a sorted file list that has to sort the same way on either side (which very subtly bit me when I first did this). I also didn't bump the protocol in this patch (wasn't quite sure it was appropriate just for an incompatible command line option) since since it was for local use. The patch is based on a 2.4.x series rsync, but if it doesn't apply cleanly to 2.5.x, it's should be simple enough to just apply manually. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ - - - - - - - - - - - - - - - - - - - - - - - - - Index: options.c === RCS file: e:/binaries/cvs/ni/bin/rsync/options.c,v retrieving revision 1.5 retrieving revision 1.7 diff -c -r1.5 -r1.7 *** options.c 2000/12/28 00:30:18 1.5 --- options.c 2001/06/20 19:25:24 1.7 *** *** 72,77 --- 72,78 #else int modify_window=0; #endif /* _WIN32 */ + int ignore_case=0; int modify_window_set=0; int delete_sent=0; *** *** 162,167 --- 164,170 rprintf(F, --exclude-from=FILE exclude patterns listed in FILE\n); rprintf(F, --include=PATTERN don't exclude files matching PATTERN\n); rprintf(F, --include-from=FILE don't exclude patterns listed in FILE\n); + rprintf(F, --ignore-case ignore case when comparing filenames\n); rprintf(F, --version print version number\n); rprintf(F, --daemonrun as a rsync daemon\n); rprintf(F, --address bind to the specified address\n); *** *** 186,194 OPT_PROGRESS, OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS, OPT_COMPARE_DEST, OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS, OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, ! OPT_IGNORE_ERRORS, OPT_MODIFY_WINDOW, OPT_DELETE_SENT}; ! static char *short_options = oblLWHpguDCtcahvqrRIxnSe:B:T:zP; static struct option long_options[] = { {version, 0, 0,OPT_VERSION}, --- 189,198 OPT_PROGRESS, OPT_COPY_UNSAFE_LINKS, OPT_SAFE_LINKS, OPT_COMPARE_DEST, OPT_LOG_FORMAT, OPT_PASSWORD_FILE, OPT_SIZE_ONLY, OPT_ADDRESS, OPT_DELETE_AFTER, OPT_EXISTING, OPT_MAX_DELETE, OPT_BACKUP_DIR, ! OPT_IGNORE_ERRORS, OPT_MODIFY_WINDOW, OPT_DELETE_SENT, ! OPT_IGNORE_CASE}; ! static char *short_options = oblLWHpguDCtcahvqrRIxnSe:B:T:zP; static struct option long_options[] = { {version, 0, 0,OPT_VERSION}, *** *** 204,209 --- 208,214 {exclude-from,1, 0,OPT_EXCLUDE_FROM}, {include, 1, 0,OPT_INCLUDE}, {include-from,1, 0,OPT_INCLUDE_FROM}, + {ignore-case, 0, 0,OPT_IGNORE_CASE}, {rsync-path, 1, 0,OPT_RSYNC_PATH}, {password-file, 1,0, OPT_PASSWORD_FILE}, {one-file-system,0, 0,'x'}, *** *** 401,406 --- 406,415 add_exclude_file(optarg,1, 1); break; + case OPT_IGNORE_CASE: + ignore_case=1; + break; + case OPT_COPY_UNSAFE_LINKS: copy_unsafe_links=1; break; *** *** 712,717 --- 727,736 slprintf(mwindow,sizeof(mwindow),--modify-window=%d, modify_window); args[ac++] = mwindow; + } + + if (ignore_case) { + args[ac++] = --ignore-case; } if (keep_partial) Index: exclude.c === RCS file: e:/binaries/cvs/ni/bin/rsync/exclude.c,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -c -r1.1.1.1 -r1.2 *** exclude.c 2000/05/30 18:08:19 1.1.1.1
RE: Non-determinism
Berend Tober [[EMAIL PROTECTED]] writes: That was my point about comparing rsync to sending the entire file using say, ftp or cp. That is, one might think that sending the entire file via ftp or cp will produce a exact file copy, however the actual transmission of the data takes the form of electrical signals on a wire that must be detected at the receiving end. The detection process must have some probablilty of false alarm/missed detection characteristic and so there must be some estimate of the probability of ftp and cp failing to produce a reliable copy. So while the software algorithm of ftp and cp are deterministic, there must be some quantifiable probablity of failure non-the-less. The difference with rsync is that not only are the same effects of data corruption at work as with ftp and cp, but the algorithm itself introduces non- determinism. Except of course that rsync uses its own final checksum to balance out its risk of incorrectly deciding a block is the same. If the final full-file checksum doesn't match, then rsync automatically restarts the transfer (using a slightly different seed, I believe). Thus, it's fairly accurate to compare rsync to performing an ftp or cp and then doing a full checksum on the file, so one could argue it's actually more reliable than a straight ftp/cp without the checksum. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: Non-determinism
Martin Pool [[EMAIL PROTECTED]] writes: To put it in simple language, the probability of an file transmission error being undetected by MD4 message digest is believed to be approximately one in one thousand million million million million million million. I think that's one duodecillion :-) As a cryptographic message-digest hash, MD4 (and MD5) is intended as having 2^128 operations necessary to crack a specific digest (find the original source), but probably only on the order of 2^64 operations to find two messages that have the same digest. But even that isn't a direct translation to the probability that two random input strings might hash to the same value. There's an interesting thread from sci.crypt from late last year that had some addressing of this question: http://groups.google.com/groups?threadm=u21i5llf2bpt03%40corp.supernews.com which for one of the examples where the computation was followed through (the odds of a collision when keeping all 128 bits of the hash and running it against about 67 million files), the probability of a collision was about 2^-77. So I suppose you'd sort of have to figure out what you wanted to declare your universe of files to be since more files would increase the odds and less files decrease them. It's about at this point that I sit back and just say, that's one tiny probability! It is interesting that MD4 has been a cracked algorithm for a while now, so if someone was explicitly trying to forge a file that would fool it, it's very doable. But I doubt that changes the odds on two random files colliding. MD5 has not yet had any duplication found (and plenty of protocols currently assume there aren't any), but it's far more computationally intensive to compute, so I think MD4 is more than sufficient for rsync. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: how to take least risk on rsync dir
Patrick Hsieh [[EMAIL PROTECTED]] writes: When rsync dir_A to dir_B, I hope I wont make any change to the original dir_B unless the rsync procedure end withour errors, therefore, I hope there's somethig like rsync -av dir_A dir_B_tmp \ mv dir_B dir_B.bkup mv dir_B_tmp dir_B This small script can ensure the minimal change time between 2 versions of archive. Is this built in the native rsync function? Do I have to write scripts myself? rsync's default behavior ensures this sort of minimal change time, but only at a per file level. That is, each file is actually built as a temporary copy and then only renamed on top of the original file as a final step. Of course, that's largely a requirement so rsync can use the original file as a source for the new file, but it also serves to preclude interruption of the original file as long as possible. But if you want the same sort of assurances at something larger than a file level (e.g., a directory as above), then yes, you need to impose that on your own. For example, when backing up databases (where I need to keep the database backup and transaction logs in sync), I copy them into a temporary directory and only overlay the primary destination files when fully done. The simplest way to do it is close to what you have, but there are a few things you need to be aware of. First, you'll want to use the rsync --compare-dest directory so that it can still find the original files on the destination system for its algorithm - otherwise it'll send over the entire contents of the source files and not use what it can from the original files. Second, you need to realize that by default rsync will only copy files that have changed (by default based on size/timestamp unless you add the -c (checksum) option). So if you do what you have above you'll end up losing files that hadn't changed since they won't exist in dir_B_tmp. You can override this with the -I option at the expense of a small amount of extra data transferred for the unchanged files. So you could do something like: rsync -av -I --compare-dest=B_tmp_to_B_path dir_A dir_B_tmp Note that the --compare-dest argument is a relative path to get from the destination directory (dir_B_tmp in this case) back to the original source directory. rsync won't touch the source directory, but it will use the files within it as masters for the new copy. This will result in dir_B_tmp being a completely copy of dir_A, using the original dir_B as a master whenever possible. This all assumes that you're doing remote copies where the rsync protocol makes sense (you don't show a remote system in your example). If you're just making local copies, it would be better to use -W instead, but you'd still need -I if you wanted files matching those in the current dir_B to be transferred. Then again, for a local setup where you want to update the whole directory, a simple copy may be as effective as rsync, since you're not benefitting from the selection of a subset of files. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: is it a bug or a feature? re:time zone differences, laptops, and suggestion for a new option
Martin Pool [[EMAIL PROTECTED]] writes: Linux stores file times in UTC, and rsync transfers them in UTC. I thought that NT and XP did too, but perhaps not, or perhaps there is a problem with Cygwin. (...) It depends on the filesystem under Windows. NTFS uses UTC for timestamps, but the FAT* variants use local time. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: File over 2GB using Cygwin
Martin Pool [[EMAIL PROTECTED]] writes: It could be an interesting project to try to build rsync under MSVC++. Presumably it can handle large files. I don't think there's anything impossible in principle about it. Not in principle, but unless you're also going to handle the same fork emulation and Unix semantics that Cygwin is doing, it's unlikely to be a weekend project :-) In theory a native port might use threads and overlapping I/O very effectively, but I think it would be fairly tough to do without some significant changes to the existing code. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: rsync dir in _both_ directions?
Jack McKinney [[EMAIL PROTECTED]] writes: If I add 512 bytes at the begining of the file, then I would expect it. If I only add 14 bytes, then I don't think rsync will detect this, as it would require it to compute checksums start at EVERY byte instead of 512 byte checksums at offsets 0, 512, 1024, 1536, et al. Yep, and that's precisely what rsync does. It actually uses two types of checksums. One is a fast rolling checksum that can be efficiently computed with a block starting at _every_ byte in the file. The nature of the checksum is that you can compute its new value starting at byte X+1, based on its old value from a block starting at X by only performing a single computation based on the new byte at the end of the block starting at X+1. But the penalty you pay for the speed is that it's a weaker checksum - you can have inaccurately identified matches (e.g., overlaps in the checksum). So there's a second, much stronger checksum, but much slower, that is used to validate a match once the first checksum thinks it found a match. When you transmit a file, the sender computes both checksums for each block in the file it has and sends them over. The receiver then walks its current file, taking block size chunks _at every byte_ and computing the weak/fast checksum. If the weak matches, it then does the stronger checksum, and if that matches, it knows it need not request that block of data from the sender. This will match common blocks located anywhere within the file at any offset (including re-using a source block multiple times to reproduce the target). You might want to read the tech paper on rsync and its protocol, since it goes into this in much more detail. If all rsync did was match on finite block boundaries, it would be _way_ less useful than it really is. It is an easy experiment. (...) (...) I suspect that your xfer time will be comparable to the first one, not to the second. Since it's an easy experiment - why suspect - did you try this? It should take virtually no time for the second (sans the initial checksum computation and transmission, which to be fair for large files and small block sizes can be quite significant). -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: efficient file appends
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: It seems to me that this situation is common enough that the rsync protocol should look for it as a special case. Once the protocol has determined from differing timestamps and/or lengths that a file needs to be synchronized, the receiver should return a hash (and length) of its copy of the entire file to the sender. The sender then computes the hash for the corresponding leading segment of its copy. If they match, the sender simply sends the newly appended data and instructs the receiver to append it to its copy. While potentially a useful option, you wouldn't want the protocol to automatically always check for it, since it would preclude rsync on the sending side from being able to use part of the original file when transmitting the newly added data to the receiver. While perhaps not helpful for log files, it can be a big win for other files, even if the current copy on the receiver matches the sender's initial portion. So at best, you'd only want to enable this option if the only thing for the entire set of files in a given run were files known to expand this way. Alternatively, even with rsync the way it is today, what I do is manually bump up the blocksize to something large (say 16 or 32K). This results in far fewer blocks for the checksum algorithm (from perhaps 10-45x depending on original file size based on the default dynamic blocksize selection) and thus minimizes the meta data transmitted for the common portion of the file. It works pretty well for me with database transaction log files which get pretty big. You can probably find some past e-mail on the subject in the list by looking for threads about rsync blocksize. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: efficient file appends
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: While potentially a useful option, you wouldn't want the protocol to automatically always check for it, since it would preclude rsync on This extension need not break any existing mechanism; if the hash of the receiver's copy of the file doesn't match the start of the sender's file, the protocol would continue as before. Well, my point was that even if it does match, you might still want the protocol to continue as before. For example, if you have a file that grows, but tends to contain similar information. In that case, you still want the per-block checksum information from the destination because that way the source can use that information to minimize the amount of new information to transmit. Without having the per-block information, it can't tell how to extract data from the current copy at the destination to re-use for the new data rather than sending the new data directly. Not a big deal for appending log files (as long as they have changing date strings), but not necessarily something to have enabled by default. Alternatively, even with rsync the way it is today, what I do is manually bump up the blocksize to something large (say 16 or 32K). This sounds like an excellent idea, and I'll give it a try. As the blocksize reaches the receiver's file size, the scheme essentially approaches my idea. Hmm, I've never tried _really_ large block sizes (I thought I had problems if I got close to 64K, but I may be mis-remembering). The one drawback to the larger block sizes is that if you do encounter any differences, you'll retransmit more information than necessary, but if you do beforehand it's definitely just appended dat that won't be the case. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: definite data corruption in 2.5.0 with -z option
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: After I sent my note, I ran some more experiments and found the problem goes away if I use the default checksum blocksize. So the problem occurs *only* if I use a large blocksize (65536) *and* enable compression. Should have read ahead - this is probably the problem I was recalling in my last reply. There was some reason I tended to keep my variable block sizes that my scripts were picking = 32K or so. If that's it, then I doubt it's the bit length overflow issue since I was running into this back with a modified 2.4.3. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: move rsync development tree to BitKeeper?
You can find a lot more information about the differences here: http://bitkeeper.com/4.1.1.html BitKeeper is not strictly Open Source, but arguably good enough. I guess arguably is if you don't mind having all your metadata logged to an open logging server? The proposed plan is to convert the existing repository, retaining all history, some time in December. At this point CVS will become read-only and retain historical versions. I'm curious at the driving force here? You talk about switching, but don't really mention much about why - other than to get feet wet before using it for other projects. So is it really the other projects that have specific needs? Is there specific functionality lacking in CVS that is trying to be fixed? At least for me, CVS is more convenient since it works will all the open projects I use (and yeah, is easier in terms of licensing). I don't have strong objections to a change, but as one user who does tend to track the source tree and not just releases, I definitely would prefer to continue to see (as you did suggest) alternative access to the current source tree (even if only daily snapshots), since at least for me rsync would be the only BK project I'd care about - it's not clear I'd want to bother with the client. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Why does one of there work and the other doesn't
From: Randy Kramer [mailto:[EMAIL PROTECTED]] I am not sure which end the 100 bytes per file applies to, and I guess that is the RAM memory footprint?. Does rsync need 100 bytes for each file that might be transferred during a session (all files in the specified directory(ies)), or does it need only 100 bytes as it does one file at a time? Yes, the ~100 bytes is in RAM - I think a key point though is that the storage to hold the file list grows exponentially (doubling each time), so if you have a lot of files in the worst case you can use almost twice as much memory as needed. Here's an analysis I posted to the list a while back that I think is still probably valid for the current versions of rsync - a later followup noted that it didn't include an ~28 byte structure for each entry in the include/exclude list: - - - - - - - - - - - - - - - - - - - - - - - - - (a) How much memory, in bytes/file, does rsync allocate? This is only based on my informal code peeks in the past, so take it with a grain of salt - I don't know if anyone has done a more formal memory analysis. I believe that the major driving factors in memory usage that I can see is: 1. The per-file overhead in the filelist for each file in the system. The memory is kept for all files for the life of the rsync process. I believe this is 56 bytes per file (it's a file_list structure), but a critical point is that it is allocated initially for 1000 files, but then grows exponentially (doubling). So the space will grow as 1000, 2000, 4000, 8000 etc.. until it has enough room for the files necessary. This means you might, worst case, have just about twice as much memory as necessary, but it reduces the reallocation calls quite a bit. At ~56K per 1000 files, if you've got a file system with 1 files in it, you'll allocate room for 16000 and use up 896K. This growth pattern seems to occur on both sender and receiver of any given file list (e.g., I don't see a transfer of the total count over the wire used to optimize the allocation on the receiver). 2. The per-block overhead for the checksums for each file as it is processed. This memory exists only for the duration of one file. This is 32 bytes per file (a sum_buf) allocated as on memory chunk. This exists on the receiver as it is computed and transmitted, and on the sender as it receives it and uses it to match against the new file. 3. The match tables built to determine the delta between the original file and the new file. I haven't looked at closely at this section of code, but I believe we're basically talking about the hash table, which is going to be a one time (during rsync execution) 256K for the tag table and then 8 (or maybe 6 if your compiler doesn't pad the target struct) bytes per block of the file being worked on, which only exists for the duration of the file. This only occurs on the sender. There is also some fixed space for various things - I think the largest of which is up to 256K for the buffer used to map files. (b) Is this the same for the rsyncs on both ends, or is there some asymmetry there? There's asymmetry. Both sides need the memory to handle the lists of files involved. But while the receiver just constructs the checksums and sends them, and then waits for instructions on how to build the new file (either new data or pulling from the old file), the sender also constructs the hash of those checksums to use while walking through the new file. So in general on any given transfer, I think the sender will end up using a bit more memory. (c) Does it matter whether pushing or pulling? Yes, inasmuch as the asymmetry is based on who is sending and who is receiving a given file. It doesn't matter who initiates the contact, but the direction that the files are flowing. This is due to the algorithm (the sender is the component that has to construct the mapping from the new file using portions of the old file as transmitted by the receiver). - - - - - - - - - - - - - - - - - - - - - - - - - -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Rsync: Re: patch to enable faster mirroring of large filesyst ems
Keating, Tim [[EMAIL PROTECTED]] writes: - If there's a mismatch, the client sends over the entire .checksum file. The server does the compare and sends back a list of files to delete and a list of files to update. (And now I think of it, it would probably be better if the server just sent the client back the list of files and let the client figure out what it needed, since this would distribute the work better.) Whenever caching checksums comes up I'm always curious - how do you figure out if your checksum cache is still valid (e.g., properly associated with its file) without re-checksumming the files? Are you just trusting size/timestamp? I know in my case I've got database files that don't change timestamp/size and yet have different contents. Thus I'd always have to do full checksums so I'm not sure what a cache would buy. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Rsync: Re: patch to enable faster mirroring of large filesyst ems
Keating, Tim [[EMAIL PROTECTED]] writes: Is there a way you could query your database to tell you which extents have data that has been modified within a certain timeframe? Not in any practical way that I know of. It's not normally a major hassle for us since rsync is used for a central backup that occurs on a large enough time scale that the timestamp does normally change from the prior time. So our controlling script just does its own timestamp comparison and only activates the -c rsync option (which definitely increases overhead) if they happen to match. Although I will say that the whole behavior (the transaction log always has an appropriate timestamp, it's just the raw database file itself that doesn't) sure caught me by surprise in the beginning after finding what I thought was a valid backup wouldn't load :-) -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Block Size
Thomas Lambert [[EMAIL PROTECTED]] writes: What is the default block size? I have a few files 30+mb and data is just added to the end of them. It seems like it takes longer to sync them that it was to send it initially. Should I change the block size or something else? The default is an adaptive block size. It's based on the file size divided by 1, truncated to a multiple of 16, with a minimum of 700 and a maximum of 16K (16384). So your 30MB file ought to be using 16K blocks. And yes, depending on your machines (memory and CPU), it can take a while to synchronize such files because rsync has to compute two checksums per block, keeping that in memory, before making the transfer. During the first transfer rsync knows there is no target file, so it doesn't bother with any of that but just sends the bytes. If you know something about the construction of your file, manually selecting a block size can be very helpful, since it helps optimize how many changes rsync finds. For example, when transferring database sizes that I know have a 1K page size, I always keep block sizes a multiple of 1K, since otherwise a single page change in the database might affect two rsync blocks. I then scale the block size by database size to help keep the total number of blocks down, since that burns memory and computation time. My database transaction log files are very similar to your file - they constantly grow so I'm really always only catching up the tail end of the file. For those, I use as large a block size as feasible. However, my files aren't as large (we truncate a lot) so I use 16K myself. I believe I've had it work up closer to 32K but then had some problems, so there may be some signed number issues (e.g., stick just below 32K). Not sure how much that would help, although it'll reduce your block count by about a factor of 2. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Block Size
(I previously wrote) So your 30MB file ought to be using 16K blocks Whoops - my fault for assuming 30MB was large enough and skipping the calculations. Turns out that really only yields about a 3K block size with the adaptive algorithm. So you can get significant reduction in blocks by using a larger value (16K for example). -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: times difference causes write
Don Mahurin [[EMAIL PROTECTED]] writes: My second problem is that the flash is of limited size, so I need some sort of patch rsync that does not keep the old file before writing the new one. My patch now just unlinks the file ahead, and implies -W. Sounds reasonable as long as you force the -W. So my wish was that a time discrepancy would lead to a checksum, where the files would match. This is not the case, however, as you say. At least not with -W. In most cases, the time discrepancy would then cause rsync to try to synchronize the file, and during its protocol processing it would determine that it didn't need to send anything, thus the only end result would be adjusting the remote timestamp to match the source. But this requires access to the original source file, so your prior patch (and forcing -W) defeats this as a side effect. So for now, I must use -c. It's slow, but I know that I get the minimum number of writes. It definitely sounds like the best match for you. Although -c tends to be used more for cases where files may differ although they appear the same (timestamp/size) than vice versa, it will serve that purpose as well at the expense of some additional I/O and computation. Presumably you could modify your patch so that -c (or some new option) only invoked the checksum if the timestamp differed, since I don't think there's any suitable equivalent currently in rsync. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: rsync recursion question
Justin Banks [[EMAIL PROTECTED]] writes: If your suggestion worked, that would be just fine with me. Actually, I guess it's fine anyway, I'll just have to maintain my patch ;) This is probably obvious, but just in case it isn't, CVS makes this fairly trivial (importing the main rsync releases and then developing your own changes on the mainline) to maintain over time. Tracking local changes to third party sources is one of its strengths. That's how I maintain our internal version of rsync which has a variety of local changes. Most I eventually submit back for possible inclusion in the main release (after some burn-in time in local use), but there are some that aren't general purpose enough, so they just stay in our repository. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Does RSYNC work over NFS?
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: consider, however, a slow pipe between systems, one or more mounting filesystems via nfs over a fast connection. the lan connection to the nfs is negligible versus the rsync connection from server to server. Oh, I'd agree with that. But then to me you aren't running rsync over the NFS connection, but over the slow LAN connection. I took the original question to mean using rsync over an NFS connection serving as the link between source and destination (in which case only -W makes sense), but in re-reading the subject, it's a tad ambiguous and could certainly include the above scenario. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Problem with transfering large files.
I'm not Dave Dykstra [[EMAIL PROTECTED]] writes: On September 9 Tridge submitted a fix to CVS for that problem. See revision 1.25 at http://pserver.samba.org/cgi-bin/cvsweb/rsync/generator.c I'm not sure that fixes the use of the timeout for the overall process. See a recent answer by me to this list in the Feedback on 2.4.7pre1 thread, which included an older patch from last year that I've been using since then. The overall timeout problem is due to the parent process doing a read_int on the child process to wait for final completion, which is subject to the same timeout setting as the child processes is using on individual I/O. But the parent won't hear from the child until it's fully done. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: lock files
Dietrich Baluyot [[EMAIL PROTECTED]] writes: Does rsync lock the source files while its copying? No (there's really no guaranteed portable way to do it anyway). But rsync does perform a final checksum on transferred files and if they differ, it will re-execute the transfer (with an adjustment to the checksum algorithm). This is intended to catch the rare case where the checksums can be fooled into thinking a portion of the file is unchanged, but it can also catch changes under the covers while rsync is moving the file. However, it's not an absolute guarantee since rsync uses stored directory information (such as overall file size) I believe, so it's possible for a growing file to not include everything added during the execution of rsync. If you need absolute guarantees on a changing file, you need to apply your own locking around rsync. For example, in copying back database backups I use a script that creates a lock file before running rsync, and that lock file is also checked by the backup script. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Can rsync synchronize design changes to tables and data between two Microsoft ACCESS replicas, mdb files?
R. Weisz [[EMAIL PROTECTED]] writes: Has anyone using rsync ever tried using it to manage the replication and synchronization process for Microsoft ACCESS replicas? If so, Not for Microsoft ACCESS, but we synchronize copies of SQL/Anywhere databases constantly. As long as you're not trying to synchronize files that are actively in use (which may prevent rsync from reading portions of the files) there's no reason why it won't work for any database file. Rsync itself is just treating the files as arbitrary binary data - it doesn't care about any structure to the file data, whether it be a database storage, word processing document, or just flat text. One small suggestion for efficiency - for our database transfers, we keep the blocksize at some multiple of the underlying database page size (1K for our SQL/Anywhere databases) since its the nature of the beast that all changes to the database will occur within those boundaries. It's not guaranteed to be more efficient, but we've found it to be so (it prevents multiple rsync blocks from being involved in a single database change). I believe for the Jet engine that access uses it was 2K prior to Jet 4.0 and 4K afterwards. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: another data point re: asymmetric route problem
Adam McKenna [[EMAIL PROTECTED]] writes: Well, the route to my other secondary dns server recently became asymmetric, and, as expected, the rsyncs between the primrary and that box are hanging now, too. Have you tried running with Wayne's no-hang patches applied? The asymmetry might be affecting timing of information flow, and it would be interesting if that exacerbated any of the characteristics that his buffering changes were focused on addressing. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Problem with --compare-dest=/
Dave Dykstra [[EMAIL PROTECTED]] writes: Perhaps it should be using clean_fname(). Please try making a fix using clean_fname() or some other way if it looks better, test it out, and submit a patch. I wrote that option so I'll make sure the patch gets in if I think it looks good. Ok. clean_fname() has its own problem though because it eliminates double // even at the beginning of the path, which while it would fix this specific case, would break if I was actually trying to use a UNC compare-dest. I think I tried fixing clean_fname() to avoid this case in the past but ran into problems with other portions of the code that depended on it cleaning this up. I'll poke around as I get a chance - right now we're prepping for a big deployment so I think I'll go the /. route for the near term :-) -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
Problem with --compare-dest=/
We do a bunch of distribution to remote machines (under Windows NT) using compare-dest to make use of existing files on those machines when possible. Up until now, the comparison directory has always been at least one level down in the directory hierarchy, but we've just started doing full recursive distributions of files that have a mirror structure to our C: drive itself, and I think I've run into a buglet with --compare-dest. The code that uses the compare-dest value blindly builds up the comparison using %s/%s (with compare-dest and filename), so if you set --compare-dest=/, you end up with filenames like //dir/name - the leading double slash may not matter under most Unixes, but it does under Windows (Cygwin still uses it as a UNC path) and I think POSIX permits such a path to be system-specific (// inside a path must be ignored, but at the start can do special stuff). Anyway, a quick workaround appears to be to use /. rather than just / for compare-dest (or actually, since it's Windows, using C: works too), but I'm guessing it's something that should probably be handled inside of rsync better? -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: RSync on NT
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: Has anybody had any luck getting RSync to work with WinNT 4.0? Yep. At least compiled for use with Cygwin, it works fine. I do use a local tool to make a named pipe connection to a target machine rather than rsh, but any old path should work. I am interested in using RSync in a non-daemon mode. How do I specify drive/directory paths along with host names? If I issue this command: You can't really use native drive specifications because as you note rsync uses the Unix convention of separating the system from the path with a colon. But since Rsync is built on top of Cygwin (I'm presuming that's how you built it) you can use the standard /cygdrive/? notation (or until they formally remove it, the deprecated //? notation) to select a drive. You should also be able to use //system/share to access network shares. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: problems encountered in 2.4.6
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: Dave Dykstra wrote: That's two different kinds of checksums. The -c option runs a whole-file checksum on both sides, but if you don't use -W the rsync rolling checksum will be applied. So the chunk-by-chunk checksum always is used w/o -W? I guess the docs are more confusing than I originally thought. It might help if you think of it as two phases - discovery of what files need to be transferred, and then the transfer itself. The discovery phase will by default just check timestamps and sizes. You can adjust that with command line options, including the use of -c to include a full file checksum as part of the comparison, if for example, files might change without affecting timestamp or size. Once rsync knows what it needs to transfer, then it works its way through the file list, and for each file it performs a transfer. By default, that transfer is the rsync protocol - which involves the full process of dividing the file into chunks with both a strong and rolling checksum, and doing the computations to figure out what parts to send and so on. Now, normally this process is divided so that the copy of rsync that does the I/O is local to the file - e.g., for discovery both client and server rsync identify file timestamp/sizes independently (and optionally compute the checksums locally) and then exchange that information. For transfer both rsyncs build up the rolling and chunk checksums and exchange them and then decide what file data to send. But when you are copying with a single rsync (and in particular when one of the files is on the network), then that rsync has to do all the work. That means that during discovery it either 'stat's all files or optionally computes checksums. To do the checksum it has to read the file, so both source and destination get read fully - if either are on the network you will have already spent the network traffic to pull the complete files back to the local machine. Likewise for the transfer - under the rsync protocol, rsync has to compute the checksums for both source and destination files. Now, it'll only do this for those that it wants to transfer, but in those cases it effectively pulls back complete files from the network just to compute the checksums, only to then start transferring them. Even if the rsync protocol yields a very small amount of difference, anything beyond that point is already more than the full file with respect to the network activity that takes place. That's why the -W option is really the only logical thing to use with a single rsync and local (on-system or network share/mount) copies. Under such circumstances, the rsync protocol isn't going to help at all, and will probably slow things down and take more memory instead. With -W rsync becomes an intelligent copier (in terms of figuring out what changed), but that's about it. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: problems encountered in 2.4.6
[EMAIL PROTECTED] [[EMAIL PROTECTED]] writes: Actually, the lack of -W isn't helping me at all. The reason is that even for the stuff I do over the network, 99% of it is compressed with gzip or bzip2. If the files change, the originals were changed and a new compression is made, and usually most of the file is different. Just to clarify, when you say over the network you mean in true client/server rsync (or across an rsh/ssh stream) and not just using one rsync with references using network mount points, right? In the latter case, not having -W is hurting you, never helping. But yes, any format (e.g., encryption, compression) that effectively distributes changes randomly over a file is going to be a killer for rsync. For the case of gzip'd files when a client and server rsync are in use, you may want to look back through the archives of this list - there was a reference to a patch for the gzip sources that created rsync-friendly gzip's. Not as great as the non-gzip'd version, but far better than normal gzip. Ah yes - here was the URL: http://antarctica.penguincomputing.com/~netfilter/diary/gzip.rsync.patch2 At the time when I tried it (1/2001), here were some test results: For comparison, here's a database file (delta between one day and the next), both uncompressed and gzip'd (normal and -9). For the uncompressed I also transferred with a fixed 1K blocksize since I know that's the page size for the database - the others are default computations (I tried the 1K with the gzip'd version but it was worse, as expected). Normal Normal+1Kgzip gzip-9 Size54206464 54206464 21867539 21845091 Wrote29021821011490 31698643214740 Read 60176 31764860350 60290 Total29623581329138 32302143275030 Speedup18.30 40.786.77 6.67 Compression 1.00 1.002.479 2.481 Normalized 18.30 40.78 16.78 16.54 And in terms of size: As Rusty's page comments, they are slightly larger, but not tremendously so. In my one case: Normal gzip:21627629 gzip --rsyncable: 21867539 gzip -9 --rsyncable:21845091 So about a 1-1.1% hit in compressed size. Personally, here we end up just leaving the major stuff we transfer uncompressed - as we're using slow analog lines, the cost recovery was easily worth the cost in disk space, particularly in cases like our databases where knowledge of the page size and method of change goes a long way. It definitely helped for transferring ISO images where the whole image would be changed if some files changed. I set the chunk size to 2048 for that. Why it defaults to 700 seems odd to me. Not sure - perhaps some early empirical work. When I'm moving files that I know something about I definitely control the block size myself, so for example, when moving databases with a 1K page size, I always use a multiple of that (since I know a priori that's how the database dirties the file), and then I scale that up a bit based on database size, to get a reasonable tradeoff between block overhead and extra transfer upon a change detection. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: --compare-dest usage (was Re: [expert] 8.0 Final)
Randy Kramer [[EMAIL PROTECTED]] writes: I'm still uncertain about what happens if a single file rsync is interrupted partway through (and I've specified the --partial option) -- will rsync take advantage of the old copy of the file in --compare-dest and the partially rsync'd file (now truncated) in the destination directory? No, there can only be one compare file, so the next time you try the transfer it'll use the previous partially transferred copy for that purpose. Depending on the size of the original compare file and how much was partially transferred the last time, that can lose a significant amount of information (if you were using --partial) in terms of rsync's ability to be as efficient as possible. Locally here, we have a similar setup where we're transferring large database files. To work around that, I've been trying a local --partial-pad option. If enabled, it works like --partial, but upon an interruption, it appends data from the original source to the partial copy to fill out the partial copy at least as large as the original. I believe in our specific case it's an improvement - and we really only use it on the commands that perform the database backups - but am not convinced that it's necessarily useful as a general purpose option (and haven't yet submitted a patch), since there's nothing to say that the partially transferred information will be of any use in the next transfer, since it assumes sort of a linear change pattern to the file. The best performing behavior would be to work with both the previous partial file and the original in the --compare-dest directory, but that would be significant changes to rsync internals, not to mention potentially twice the work if the partial copy was any significant fraction of the --compare-dest copy. But I'd be happy to supply a diff if you think it might help in your setup. It'd be against 2.4.3, but should be very close if not the same against later releases. Aside: I think, based on your previous response, that if I did a multifile rsync (say 60 files), and rsync was interrupted after 20 of the files were rsync'd, the --compare-dest option would work to avoid rsync'ing the first 20 files and then rsync would rsync the last 40 files in the normal manner (i.e., breaking them into blocks of 3000 to 8000 bytes and then comparing them, and transferring only the blocks that were different). I don't think the --compare-dest would be the reason rsync would skip the first 20 - it would just see them as existing in the target directory at the right date and size. Where --compare-dest could come into play was if they already existed in the separate comparision directory, in which case they wouldn't be transferred at all (unless you were using the -I option). -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Backing up *alot* of files
Nemholt, Jesper Frank [[EMAIL PROTECTED]] writes: Now the big question : How long will next run take (most likely, only a few files has changed) ? You'll need the same basic startup time (and memory) to identify the file list, but at that point it should be quite fast at skipping to only the files that need to be transferred (providing you let it identify such files by size and timestamp - the default operation). However, I'm not sure I follow what you are currently running - are you using rsync to sort of "bootstrap" your backup repository? If that's the case, then it can be more efficient to just transfer the files via a standard copy mechanism (you don't have any of the overhead of rsync at all) or use rsync with the -W (whole file, no incremental computations) option that very first time. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Backing up *alot* of files
Nemholt, Jesper Frank [[EMAIL PROTECTED]] writes: That was also what I was hoping, but what if I add the -c for checksum I suppose it then needs to read checksum both source and destination for all files, or ? (this will as far as I can see take at least the 10 hours, maybe more). Yes, with -c it'll be back to processing each and every file, and it may even be worse than your current timing since it has to checksum both the local and remote copy, whereas if you're copying into an empty filesystem now (I'm not sure if you are or aren't) it just knows it has to transfer. I don't think checksum is a necessity here, but when dealing with files including production database files from Oracle, it _is_ nice to play safe... We plan to let the DBAs fire up the databases on the backup and check everything. If they say OK, the most important files are OK. We had precisely this problem with SQLAnywhere database files under Windows. The timestamp and size of the main database file would remain the same but the data would have changed. The only way to let rsync detect this was with -c. However, the performance implications eventually made me add some extra support to the script wrapping rsync so that it checked the timestamp and sizes and only added the "-c" to rsync if they were the same. So in your case, it might be easier to pre-identify any files that might need to be transferred even if their timestamp and size remain unchanged, and handle them with a separate run of rsync with appropriate options. Yes, I would probably save the first hour, but it was done with rsync the normal options nevertheless, just to see if anything went wrong when using rsync on 2 million files. True - that's an amazingly large single-directory structure. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: exit status
Toni Pisjak [[EMAIL PROTECTED]] writes: My question: Is the exit status reliable in the current version ? It's not 100% reliable, but it does somewhat depend on what you would consider a failure, since there are some slightly ambiguous cases. For my part, the cases where I've seen fit to make some local changes to cover scenarios included the following: * recv_files in receiver.c: There are 4-5 points where it can have a local failure (fstat, getting a tmp name, etc...) that the stock code prints a warning, but continues to the next file in the list. Depending on the failure it may still exit out with a non-zero code, but in some cases it'll just skip the problem file and continue on. I changed this to abort with a non-zero code. (This does change control flow slightly in that function but I haven't seen a problem yet, and receive_data that is called from this function can also directly exit on an I/O failure). * finish_transfer in rsync.c: Failures renaming the temporary file could be ignored (I find this happens sometimes under NT) and you could lose both the temporary ".filename" version and think it was successful. I switched this to trigging an I/O error exit. In my transfers, I got caught once in the first case when I ran out of disk space, so the mktemp call was failing on all the files but looking like the overall transfer was ok. The second case hit me under NT (as noted above) sporadically - but certainly not frequently. I haven't had the opportunity yet to suggest these changes back for inclusion in the main source. The only other issue that I've seen is the potential for the server side to run into local problems, that get reflected as error messages to the receiver, but don't stop the transfer or result in a non-zero exit code. One case is a missing file on the server - you'll get a link_stat warning message in the stream but if it was just one of several files, the rest of the transfer will complete and it can look successful. I was going to fix this until I realized that it actually helped us in some transfers where a file might or might not be present. I do think it should probably error out in the long run, but haven't made any changes along those lines. With all this said however, it took me about 6 months of using rsync heavily to get to the point of making the changes I did, so these aren't frequent occurrences, nor did they really impede my use of rsync with scripts that strictly watched the exit code. Also, in all of these cases, there is a warning or error message that gets displayed, but you'd have to parse the rsync output to see it rather than just trusting the exit code. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: The out of memory problem with large numbers of files
Dave Dykstra [[EMAIL PROTECTED]] writes: No, that behavior should be identical with the --include-from/exclude '*' approach; I don't believe rsync uses any memory for excluded files. Actually, I think there's an exclude_struct allocated somewhere per file (looks like 28 bytes or so), but the growth algorithm is not exponential (it just reallocates two entries at a time it looks like). But I expect compared to other stuff it's in the noise. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: The out of memory problem with large numbers of files
I previously wrote: Well, as with any dynamic system, I'm not sure there's a totally simple answer to the overall allocation, as the tree structure created Oops, this slipped through editing - as I wrote up the rest of the note I didn't actually find a tree structure (I earlier thought there was one) - so please ignore that :-) -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: rsync problem
Kevin Saenz [[EMAIL PROTECTED]] writes: I guess that might be the case but there is one question left to ask the total files that we rsync has not changed. why would this task cause problems all of a sudden? If it's not the per-file overhead adding up, have you suddenly picked up a huge file in the bunch? generate_sums() is called for each file to construct the per-block checksums to be transmitted to the sender to compute the delta. Perhaps you've now got a file in your filesystem that is so large that the aggregate space required for the checksums for that file exceeds your available working space. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: rsync memory usage ...
Cameron Simpson [[EMAIL PROTECTED]] writes: | Cameron The other day I was moving a lot of data from one spot to | Cameron another. About 12G in several 2G files. [...] | Cameron so I used rsync so that its checksumming could speed past | Cameron the partially copied file. It spent a long time | Cameron transferring nothing and ran out of memory. From the | Cameron error I'm inferring that it checksums the entire source | Cameron file before sending anything across the link. | I know I'm not (directly) addressing the problem, and I don't know the | code, but will specifying a larger block size allow you to work-around | the problem? Perhaps - the transfer is done now but I'll try it next time I have such an issue. I was more concerned with the appearance that rsync stashes all the checksums before sending any. This seemed memory hungry and nonstreaming, which is odd in an app so devoted to efficiency. Your question about behavior is accurate though - while rsync spends a lot of energy trying to make an efficient transfer of the file itself, the actual meta-process to determine the transfer is fairly synchronous. After exchanging information to determine the set of files involved, the receiver proceeds through each file in turn, computes the checksums for the file and then transmits them. The sender receives all the checksums, then uses that in conjunction with its copy of the file to compute the delta information, and then transmits that back. As the receiver receives the delta information it recreates the new file. So there is definitely start-up overhead that must occur before any of the file data is transferred at all, and for a very large file, the checksum computation and the transmission of the checksum information can be lengthy. Some of this is unavoidable - until the sender has all of the receiver checksum information it can't necessarily start sending - some of the very end of the current file on the receiver may be used at the very beginning of the sender's new version, which it can't detect until it knows about the entire receiver's file. Adjusting the blocksize manually can have an impact on this. The larger the blocksize, the smaller the checksum meta-information, since you have linear growth with the number of blocks the file represents. If a block size is not set on the command line, rsync will do some dynamic adjustment of the blocksize (roughly size/1) maxing out at 16K. During transmission it's 6 bytes per block, but I believe it's 32 bytes in memory. So for the 2GB file, you'll have about 122,000 blocks for, so ~700K transmitted and ~4MB in memory. That doesn't really sound like enough to exhaust memory on typical machines nowadays though. There's some per-file growth too, but the per-block checksums are freed as it works through each file. Now, in terms of increased efficiency - while you do have to transmit all of the checksum information before the sender can compute the delta, one thing I've been interested in trying is to have the sender send the checksums as it computes them - I'm not entirely sure why it has to be saved in memory, since it'll be freed right after transmission. About the only risk I see is that it couples the checksum process to the line speed which could raise the risk of inconsistency if the file on the receiver is changing, but that risk is already there, just a smaller window. I haven't had a chance to try the change though yet. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: rsync hangs
Marquis Johnson [[EMAIL PROTECTED]] writes: I found yesterday that this was my fault. I probably should not have used rsync -avvv. When I used rsync -av it worked fine. Thanks for your help. Yes, using more than two "-v" options can cause a problem, although I normally get a failure early on, depending on how many more. The problem is that the extremely verbose modes sometimes generate output that doesn't go through the standard I/O handling for streams on the server, and thus doesn't get packaged up properly for the client to decode. So the client perceives the server debug output as a protocol failure. I had at one point played with both removing the verbose option from transmission to the server and/or explicitly limiting the level to no more than 2 no matter what the local debugging was enabled as. Without that you can never really get the higher debugging levels locally. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Source and destination don't match
Jeff Kennedy [[EMAIL PROTECTED]] writes: I have a source directory that is not being touched by anyone, no updates or even reads except by the rsync host. I am using just a straight binary, no rsyncd.conf file. I am using the follwing command: rsync -avz /source/path/dir /dest/path/dir Using version 2.4.6 on Solaris 7, source and destination are both on a NetApp filer. Seems to run without incident but du's on both directories show a 40MB difference. Is this normal? Thanks. It depends. You might try comparing the output of a "find -ls" (perhaps excluding directories) on both trees to see if it's easy to tell where the difference lies. One thing that might account for a difference would be if the source filesystem is a very active one (over time, not at this instant), in which case its directory files could be much larger due to normal usage growth, whereas your destination copy is fresh and only as large as necessary for the actual current file information. This would require that the larger side be the source, and I'd expect in that case the 40MB would have to be a relatively small fraction of the overall size, which I can't tell from the info provided. Another possibility is that your source tree has sparse files, which are being expanded during the copy. In that case, the --sparse option of rsync may help, although I have not had need to use it myself in the past. Oh, this would also imply that the source filesystem would be the larger of the two. Of course, I suppose it's also possible that there's something in your source that rsync isn't syncing up properly - the find comparision should be able to highlight that. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Interrupted transfer of a file
Whoops, I wrote in my previous message: So you lose the copy, but rsync still exits with a non-zero code, which makes it look like there was nothing to transfer (e.g., no change to the file). That should have said "exits with a zero exit code" - e.g., it exits looking like it was successful. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/
RE: Changed file not copied
John Horne [[EMAIL PROTECTED]] writes: Okay I think I've solved this :-) The rsync with the '--size-only' option updates the modification time for the file but doesn't copy it. Actually, I believe it's less related to "--size-only" rather than your use of the "-t/--times" option (implicitly since you used "-a"). Thus, "--size-only" says not to bother transferring the file contents even if the date is different but the size is the same, but the "-t" still transfers over the appropriate timestamp information. This can be a good way to initially sync up two systems that weren't previously mirrored with maintenance of timestamp information, but otherwise have the same content. Hence when I run the command without that option the size and time are equal - hence the file is not copied despite being different. I see that there are options to ignore the time as well, so that would get round it on the second rsync command. Not a problem, I just found it a bit confusing :-) In addition to ignoring the timestamp ("-I/--ignore-times"), and depending on your environment and size of the files involved, you can also use the "-c/--checksum" option to have rsync compute a file checksum to determine if a file should be transferred. While this can be slow, if the actual rsync processing of the file (e.g., block checksums) and transmission of that information (particularly on a slow link) would be lengthy, it's a more efficient way to guarantee you only bother sending a file if it is different. -- David /---\ \ David Bolen\ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc.\ Phone: (203) 708-5192| / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \---/