Re: Reliability and robustness problems
On Tue, 15 Jun 2004, John [EMAIL PROTECTED] wrote: It may be that there will be files from different systems that are identical - think system binaries, fonts etc. If these are in /var/local/backups/{host1,host2} etc, and I've run a script to identify these dupes and eliminate them using hard links, can rsync preserve these hard links even though it can't see them all? Yes and no. Here are some examples. Assume file A, B and Z are hardlinked on the source and on the target and the source paths being synced includes files A and B, but not Z. 1. If there was no change to A and B, then there is no issue - rsync leaves them alone and all 3 files remain hardlinked on the target. 2. If A changes and becomes independent of an unchanged B, then the new A content is transferred as an individual file and B is left hardlinked to Z. 3. If A (and B and Z) change and remain hardlinked, the new content will be transferred, A and B will be hardlinked, and the hardlink with Z is now lost because rsync doesn't know anything about it. Z on the target now contains old content and becomes independent. -- John Van Essen Univ of MN Alumnus [EMAIL PROTECTED] -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
Wayne Davison wrote: On Thu, Jun 10, 2004 at 07:21:41AM +0800, John wrote: flist.c: In function `send_file_entry': flist.c:349: `lastdir_len' undeclared (first use in this function) It patched the wrong function, which is really hard to understand because the line numbers in the patch are right for the 2.6.2 version of flist.c. If you read the @@ line before each hunk, you'll see the function name it should have patched. The first hunk makes its change in receive_file_entry(), and the second makes its change in make_file(). Both changes are simple enough that you can patch them by hand, if needed. Wayne suggested off-list to check whether the patch is already in place. Grumble grumble. I've installed 2.6.2 in both sites. We've also discovered that Telstra has improved some configuration item in its DSLAMs or somewhere and this leads to a lack of reliability of the connexion. I've now made the necessary adjustment to the DSL-300 (don't believe the dlink website, they all talk telnet on 192.168.1.1) and we live in hopes the DSL link will stay up for weeks instead of hours. I've implemented some rudimentary performance monitoring: each hour I run these commands: ifconfig tun0 | mail -s Traffic report [EMAIL PROTECTED] ps ww u -C rsync | mail -s rsync report [EMAIL PROTECTED] This shows me that we're not transferring enormous amounts of data (so I guess the hard-link problem's gone), but we're still using lots of memory: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 15191 4.0 68.4 323900 131352 ? S01:23 6:22 rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids --timeout=3600 /var/local/backups 192.168.0.1:/var/local/backups/ It doesn't seem to be doing a lot of paging. (Note I'm not at all sure that my understanding of paging is the same as is meant in Linux - I've seen systems reporting paging where there was no swap file, and my understanding of paging prohibits this). The memory usage is a concern, not because I can't reduce it for this run - I've not yet made the refinements suggested, or implemented deleting old backups, but there are other systems that need to be backed up too. It may be that there will be files from different systems that are identical - think system binaries, fonts etc. If these are in /var/local/backups/{host1,host2} etc, and I've run a script to identify these dupes and eliminate them using hard links, can rsync preserve these hard links even though it can't see them all? If not, I'll simply run the script in all locations whenever I feel the need. This uncertainty on my part is the reason I'm exposing the whole backup directory hierarchy to rsync rather than parts of it. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
Wayne Davison wrote: On Wed, Jun 09, 2004 at 09:42:08PM +0800, John wrote: I will install 2.6.2 when the backup run has completed, but I want the current run to complete first. Since you're using multiple sources with --delete, make sure the 2.6.2 code you compiled has been patched with the simple fix attached to this bug report: https://bugzilla.samba.org/show_bug.cgi?id=1413 ..wayn I don't yet have that patch in place, so my next run won't be on the new version. This one has completed, with a timeout: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids --timeout=3600 /var/local/backups 192.168.0.1:/var/local/backups/ io timeout after 3600 seconds - exiting rsync error: timeout in data send/receive (code 30) at io.c(85) real3001m51.871s user14m32.230s sys 4m26.440s As you can see, I don't have any stats on its performance. I'm about to restart it, based on this I expect the next run to finish catching up. Regarding the patch, I intend to put together a procedure that works, or reliably fails. I'll report back on this thread when I have an outcome. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
I got the axe out and sharpened it, and leaned it against the garden shed, and the plant is growning. Canberrans here may recall the advice of David Young who had a gardening talkback program on ABC radio in the early 80s. 2.6.2 has fixes for unnecessary transfers. I added --timeout=3600, and set it running. I then did a little work. First, I built 2.6.2 from Sid for Woody. Second, I located Fedora 2 where I discovered the source rpm for rsync 2.6.2, It built as wasily as one could wish on RHL 7.3. The backup has been chugging away for over 24 hours now. When it faulters, I have 2.6.2 ready to install in both locations. Seriously, I think the significant difference has been running it on the VPN. I'm running OpenVPN because, when I was researching my problems with VTUND I discovered it doesn't cope with with firewalls: the recommendation is to use TCP instead of UDP, and my reading at the CIPE home page suggests that's not a good idea. I figured I might as well use PPP over SSH. However, someone reported on the VTUND that he'd got OpenVPN going with a minimum of bother, and that's my experience now too. OpenVPN tunnels using UDP, and can survive outages that cause rsync/ssh to hang. It also does adaptive compression - it turns compression on/off from time to time based on current traffic. It can also do bandwidth limiting. I will install 2.6.2 when the backup run has completed, but I want the current run to complete first. Thanks for your help. I'll review the email with the object of filing some bug reports for anything I think outstanding so you folk can deal appropriately with them. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
On Wed, Jun 09, 2004 at 09:42:08PM +0800, John wrote: I will install 2.6.2 when the backup run has completed, but I want the current run to complete first. Since you're using multiple sources with --delete, make sure the 2.6.2 code you compiled has been patched with the simple fix attached to this bug report: https://bugzilla.samba.org/show_bug.cgi?id=1413 ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
On Wed, Jun 09, 2004 at 09:42:08PM +0800, John wrote: OpenVPN tunnels using UDP, and can survive outages that cause rsync/ssh to hang. It also does adaptive compression - it turns compression on/off from time to time based on current traffic. If you are strictly using the VPN for rsync, you will be better off using rsync's compression (-z option) that using the VPN's compression. I realize that it is adaptive, but if you turn it completely off, the VPN won't even have to consider if it should compress the stream or not. OTOH, if you are rsyncing a log of compressed data, turning off rsync's compression and using the VPN's adaptive compression might work out better. -John -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
Wayne Davison wrote: On Wed, Jun 09, 2004 at 09:42:08PM +0800, John wrote: I will install 2.6.2 when the backup run has completed, but I want the current run to complete first. Since you're using multiple sources with --delete, make sure the 2.6.2 code you compiled has been patched with the simple fix attached to this bug report: https://bugzilla.samba.org/show_bug.cgi?id=1413 sob Here's why I don't launch into wholesale changes to rsync First I cut and pasted the patch. It half fitted so I reversed it. vim flist.patch patch flist.patch patch -R flist.patch I then saved it and copied it with scp. Patched again: patch attachment.cgi Build: [EMAIL PROTECTED]:~/packages/rsync-2.6.2$ dpkg-buildpackage -rfakeroot -uc gcc -I. -I. -Wall -O2 -c flist.c -o flist.o flist.c: In function `send_file_entry': flist.c:349: `lastdir_len' undeclared (first use in this function) flist.c:349: (Each undeclared identifier is reported only once flist.c:349: for each function it appears in.) make[1]: *** [flist.o] Error 1 make[1]: Leaving directory `/home/summer/packages/rsync-2.6.2' At least I seem to have solved the most pressing problem, it's still chugging away despite the ADSL link at home going down for upwards of 30 seconds that I've noticed. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
On Thu, Jun 10, 2004 at 07:21:41AM +0800, John wrote: flist.c: In function `send_file_entry': flist.c:349: `lastdir_len' undeclared (first use in this function) It patched the wrong function, which is really hard to understand because the line numbers in the patch are right for the 2.6.2 version of flist.c. If you read the @@ line before each hunk, you'll see the function name it should have patched. The first hunk makes its change in receive_file_entry(), and the second makes its change in make_file(). Both changes are simple enough that you can patch them by hand, if needed. ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Reliability and robustness problems
I am trying to use rsync to backup from a site we will call office and another we will call home. Both sites have DSL accounts provided by Arachnet. At present the files being backed up don't all all to be backed up, but OTOH we wish to backup lots more files that aren't being backed up now. First, we create a local backup on our office machine which happens to be called mail. We have this directory structure: drwxr-xr-x 20 root 4096 May 17 23:06 20040517-1500-mon drwxr-xr-x 20 root 4096 May 18 23:06 20040518-1500-tue drwxr-xr-x 20 root 4096 May 19 23:09 20040519-1500-wed drwxr-xr-x 20 root 4096 May 20 23:09 20040520-1500-thu drwxr-xr-x 20 root 4096 May 21 23:09 20040521-1500-fri drwxr-xr-x 20 root 4096 May 22 23:10 20040522-1500-sat drwxr-xr-x 20 root 4096 May 23 23:09 20040523-1500-sun drwxr-xr-x 20 root 4096 May 24 23:10 20040524-1500-mon drwxr-xr-x 20 root 4096 May 25 23:10 20040525-1500-tue drwxr-xr-x 20 root 4096 May 26 23:10 20040526-1500-wed drwxr-xr-x 20 root 4096 May 27 23:10 20040527-1500-thu drwxr-xr-x 20 root 4096 May 28 23:11 20040528-1500-fri drwxr-xr-x 20 root 4096 May 29 23:11 20040529-1500-sat drwxr-xr-x 20 root 4096 May 30 23:10 20040530-1500-sun drwxr-xr-x 20 root 4096 May 31 23:11 20040531-1500-mon drwxr-xr-x3 root 4096 Jun 1 14:10 20040601-0603-tue drwxr-xr-x3 root 4096 Jun 1 23:07 20040601-1500-tue drwxr-xr-x3 root 4096 Jun 2 07:42 20040601-2323-tue drwxr-xr-x3 root 4096 Jun 2 23:07 20040602-1500-wed drwxr-xr-x3 root 4096 Jun 3 14:04 20040603-0555-thu drwxr-xr-x3 root 4096 Jun 3 23:06 20040603-1500-thu drwxr-xr-x3 root 4096 Jun 4 23:07 20040604-1500-fri drwxr-xr-x3 root 4096 Jun 5 23:08 20040605-1500-sat drwxr-xr-x3 root 4096 Jun 7 14:19 20040607-0610-mon drwxr-xr-x3 root 4096 Jun 8 05:01 20040607-2054-mon drwxr-xr-x3 root 4096 Jun 8 05:35 20040607-2128-mon drwxr-xr-x 20 root 4096 Jun 1 14:06 latest The timestamps in the directory names are UTC times. We maintain the contents of latest thus: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids --exclude-from=/etc/local/backup/system-backup.excludes /boot/ / /home/ /var/ /var/local/backups/office//latest and create the backup-du-jour: + cp -rl /var/local/backups/office//latest /var/local/backups/office//20040607-2128-mon That part works well, and the rsync part generally takes about seven minutes. To copy office to home we try this: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids /var/local/backups 192.168.0.1:/var/local/backups/ Prior to this run that is in progress, we used home's external host name. I've created a VPN between the two sites (for other reasons) using OpenVPN: all the problems we've had so far occurred with, we'll say, the hostname is home.arach.net.au as that's the default way Arachnet assign hostnames. I'm hoping that OpenVPN will provide a more robust recovery from network problems. Problems we've had include 1. ADSL connexion at one end ot the other dropping for a while. rsync doesn't notice and mostly hangs. I have seen rsync at home still running but with no relevant files open. 2. rsync uses an enormous amount of virtual memory with the result the Linux kernel lashes out at lots of processes, mostly innocent, until it lucks on rsync. This can cause rsync to terminate without a useful message. 2a. Sometimes the rsync that does this is at home. I've alleviated this at office by allocating an unreasonable amount of swap: unreasonable because if it gets used, performance will be truly dreadful. 3. rsync does not detect when its partner has vanished. I don't understand why this should be so: it seems to me that, at office, it should be able to detect by the fact {r,s}sh has terminated or by timeout, and at home by timeout. 3a. It'd like to see rsync have the ability to retry in the case it's initiated the transfer. It can take some time to collect together the information as to what needs to be done: if I try in its wrapper script, then this has to be redone whereas, I surmise, rsync doing the retry would not need to. 4. I've already mentioned this, but as I've had no feedback I'll try again. As you can see from the above, the source directories for the transfer from office to home are chock-full of hard links. As best I can tell, rsync is transferring each copy fresh instead of recognising the hard link before the transfer and getting
Re: Reliability and robustness problems
On Tue, Jun 08, 2004 at 07:37:32AM +0800, John wrote: 1. ADSL connexion at one end ot the other dropping for a while. rsync doesn't notice and mostly hangs. I have seen rsync at home still running but with no relevant files open. There are two aspects of this: (1) Your remote shell should be setup to timeout appropriately (which is why rsync doesn't timeout by default) -- see your remote-shell's docs for how to do this; (2) you can tell rsync to timeout after a certain amount of inactivity (see --timeout). 2. rsync uses an enormous amount of virtual memory Yes, it uses something like 80-100 bytes or so per file in the transferred hierarchy (depending on options) plus a certain base amount of memory. Your options are to (1) copy smaller sections of the hierarchy at a time, (2) add more memory, or (3) help code something better. This is one of the big areas that I've wanted to solve by completely replacing the current rsync protocol with something better (as I did in my rZync testbed protocol project a while back -- it transfers the hierarchy incrementally, so it never has more than a handful of directories in action at any one time). At some point I will get back to working on an rsync-replacement project. 3. rsync does not detect when its partner has vanished. That seems unlikely unless the remote shell is still around. If the shell has terminated, the socket would return an EOF and rsync would exit. So, I'll assume (until shown otherwise) that this is a case of the remote shell still hanging around. 3a. It'd like to see rsync have the ability to retry in the case it's initiated the transfer. There has been some talk of this recently. It doesn't seem like it would be too hard to do, but it's not trivial either. If someone wanted to code something up, I'd certainly appreciate the assistance. Or feel free to put an enhancement request into bugzilla. (BTW: has anyone heard from J.W. Schultz anytime recently? He seems to have dropped off the net without any explanation about 3 months ago -- I hope he's OK.) 4. [...] As best I can tell, rsync is transferring each copy fresh instead of recognising the hard link before the transfer and getting the destination rsync to make a new hard link. This should not be the case if you use the -H option. (It also helps to use 2.6.2 on both ends, as the memory-consumption was reduced considerably from older releases.) If you're seeing a problem with this, you should provide full details on what command you're running, what versions you're using, and as small a test case as you can that shows the problem. ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
On Mon, Jun 07, 2004 at 07:40:22PM -0700, Wayne Davison wrote: So, I'll assume (until shown otherwise) that this is a case of the remote shell still hanging around. There's one other possibility I thought of. You mentioned that your kernel has gone around killing processes when memory is low. If one rsync process is just sitting around waiting to be killed by its sibling rsync process, but that sibling process got killed before it had a chance to generate the all done signal, a do-nothing rsync process could be left hanging around indefinitely. This is pretty rare, though, as most of the time rsync is actively interacting with the open socket and it notices when something goes wrong. ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
Hmm I subscribed to the list before I sent this. I've not seen either the email confirmation request or my mail to the list. Wayne Davison wrote: On Tue, Jun 08, 2004 at 07:37:32AM +0800, John wrote: 1. ADSL connexion at one end ot the other dropping for a while. rsync doesn't notice and mostly hangs. I have seen rsync at home still running but with no relevant files open. There are two aspects of this: (1) Your remote shell should be setup to timeout appropriately (which is why rsync doesn't timeout by default) -- see your remote-shell's docs for how to do this; (2) you can tell rsync to timeout after a certain amount of inactivity (see --timeout). I'm pretty sure that ssh times out properly: certainly it disconnects some time after my dialup line goes down. I'd managed to overlook the --timeout option on rsync. I'll look more closely for ssh hanging round next time it happens, but I'm skeptical. 2. rsync uses an enormous amount of virtual memory Yes, it uses something like 80-100 bytes or so per file in the transferred hierarchy (depending on options) plus a certain base amount of memory. Your options are to (1) copy smaller sections of the hierarchy at a time, (2) add more memory, or (3) help code something better. This is one of the big areas that I've wanted to solve by completely replacing the current rsync protocol with something better (as I did in my rZync testbed protocol project a while back -- it transfers the hierarchy incrementally, so it never has more than a handful of directories in action at any one time). At some point I will get back to working on an rsync-replacement project. 1. No chance, I'd think, of it handling hard links properly if it can't see at least one of the other copies. However, I could easily be wrong. 2. May require new boxes all round. Money may be an issue. 3. Yeah. My C skills a pretty rudimentary; I can barely read the stuff. Time for me to learn it was 30 or even years go, but I don't think it was invented then. 3. rsync does not detect when its partner has vanished. That seems unlikely unless the remote shell is still around. If the shell has terminated, the socket would return an EOF and rsync would exit. So, I'll assume (until shown otherwise) that this is a case of the remote shell still hanging around. I've been known to have unlikely failures before:-) It could be the bloody Billion in the way. I know that if a TCP session is quiet too long the Billion forgets all about it. The fact I'm now trying a VPN using UDP should overcome that issue. 3a. It'd like to see rsync have the ability to retry in the case it's initiated the transfer. There has been some talk of this recently. It doesn't seem like it would be too hard to do, but it's not trivial either. If someone wanted to code something up, I'd certainly appreciate the assistance. Or feel free to put an enhancement request into bugzilla. (BTW: has anyone heard from J.W. Schultz anytime recently? He seems to have dropped off the net without any explanation about 3 months ago -- I hope he's OK.) 4. [...] As best I can tell, rsync is transferring each copy fresh instead of recognising the hard link before the transfer and getting the destination rsync to make a new hard link. This should not be the case if you use the -H option. (It also helps to use 2.6.2 on both ends, as the memory-consumption was reduced considerably from older releases.) If you're seeing a problem with this, you should provide full details on what command you're running, what versions you're using, and as small a test case as you can that shows the problem. Well, I cut and pasted the commandline (with only a minor edit to disguise relevant sites). Here 'tis again: rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids At the moment I'm running the standard latest versions for Woody (office) and Red Hat Linux 9 (home). Acutally, Woody may be one fix behind latest, there as a new rsync out in the past few days that went in this morning. Nah, looks like the update went in before the last started, the rync binary it's got open isn't deleted. Office: rsync version 2.5.6cvs protocol version 26 Home: rsync version 2.5.7 protocol version 26 Is there a source of pre-built binaries? I didn't see it on the Rsync site. ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
Wayne Davison wrote: On Mon, Jun 07, 2004 at 07:40:22PM -0700, Wayne Davison wrote: So, I'll assume (until shown otherwise) that this is a case of the remote shell still hanging around. There's one other possibility I thought of. You mentioned that your kernel has gone around killing processes when memory is low. If one rsync process is just sitting around waiting to be killed by its sibling rsync process, but that sibling process got killed before it had a chance to generate the all done signal, a do-nothing rsync process could be left hanging around indefinitely. This is pretty rare, though, as most of the time rsync is actively interacting with the open socket and it notices when something goes wrong. ..wayne.. I don't know Kernels at both office and home have done this: most recently it was home, but by now I've destroyed the information needed to know. On the subject of signals, when rsync dies for any signal-related reason, it does not produce the stats. Most recently this occurred this morning when I very carely chose to kill -HUP it. It also misreported the signal as USR1 or INT. Whichever, it could have reported the stats. A stat I don't see is how much memory was used. This would be very helpful in estimating what our memory requirements might be, especially as I don't see any guidelines elsewhere. I might also add here that the stats I see seemed targeted at hackers. I find them next to incomprehensible and so mostly useless. Numbers I do understand include megabytes transfered (accuracy to the last byte is meaningless on my runs), transfer speed. Some of these numbers are beyond easy comprehension: Total file size: 1850665035 bytes Total transferred file size: 3064385 bytes Literal data: 3065439 bytes I prefer megabytes, and punctuation. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Reliability and robustness problems
(I see there's already been an exchange between you and Wayne, but I'll still send this reply that I composed to your original email.) On Tue, 08 Jun 2004, John [EMAIL PROTECTED] wrote: We maintain the contents of latest thus: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids --exclude-from=/etc/local/backup/system-backup.excludes /boot/ / /home/ /var/ /var/local/backups/office//latest Why the double slash before latest? and create the backup-du-jour: + cp -rl /var/local/backups/office//latest /var/local/backups/office//20040607-2128-mon That part works well, and the rsync part generally takes about seven minutes. To copy office to home we try this: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids /var/local/backups 192.168.0.1:/var/local/backups/ I can see where you will have a dreadful number of files to process if you are also processing all the previous backups. Problems we've had include 1. ADSL connexion at one end ot the other dropping for a while. rsync doesn't notice and mostly hangs. I have seen rsync at home still running but with no relevant files open. 2. rsync uses an enormous amount of virtual memory with the result the Linux kernel lashes out at lots of processes, mostly innocent, until it lucks on rsync. This can cause rsync to terminate without a useful message. 2a. Sometimes the rsync that does this is at home. I've alleviated this at office by allocating an unreasonable amount of swap: unreasonable because if it gets used, performance will be truly dreadful. In neither this nor your previous post have you mentioned the verison of rsync or the OSes involved. rsync prior to 2.6.2 (skipping 2.6.1) have non-optimized hard link processing that used twice as much memory (!) and sometimes copied hard-linked files when there was already a match on the receiver. If you are not using 2.6.2, install that on both ends and try it again. 3. rsync does not detect when its partner has vanished. I don't understand why this should be so: it seems to me that, at office, it should be able to detect by the fact {r,s}sh has terminated or by timeout, and at home by timeout. There are two timeouts - a relatively short internal socket I/O timeout and a user-controlled client-server communications timeout. If you are not using --timeout and the link goes down at the wrong time, rsync can sit there forever waiting for the next item from the other end. Use --timeout set to some number of seconds that seems long enough to get the job done. If it times out, then either bump it or try to solve the cause of the timeout. 3a. It'd like to see rsync have the ability to retry in the case it's initiated the transfer. It can take some time to collect together the information as to what needs to be done: if I try in its wrapper script, then this has to be redone whereas, I surmise, rsync doing the retry would not need to. You need to avoid the kinds of rsync where this becomes a major factor. 4. I've already mentioned this, but as I've had no feedback I'll try again. As you can see from the above, the source directories for the transfer from office to home are chock-full of hard links. As best I can tell, rsync is transferring each copy fresh instead of recognising the hard link before the transfer and getting the destination rsync to make a new hard link. It is so that it _can_ do this that I present the backup directory as a whole and not the individual day's backup. That, and I have hopes that today's unfinished work will be done tomorrow. 2.6.2 has fixes for unnecessary transfers. btw the latest directory contains 1.5 Gbytes of data. The system is still calculating that today's backup contains 1.5 Gbytes, so it seems the startup costs are considerable. It's not the size of the data that hurts, it's the number of files and directories involved. Here's what I suggest. Since you have wisely made a static snapshot of the content that you wish to back up, do the office - home rsync in two steps. First, only rsync the latest directory, using your original rsync arguments with the source and destination as: /var/local/backups/latest 192.168.0.1:/var/local/backups/latest/ Unchanged content won't be disturbed. Changed or new content will get transferred. When that completes successfully, then do the second rsync, but do *not* use --delete-excluded. The second rsync should include latest and the new MMDD-HHMM-ddd directory, and exclude all others. That should be nothing but hardlinks and should go very quickly once the filesystem scan for the two hierarchies is done. -- John Van Essen Univ of MN
Re: Reliability and robustness problems
John Van Essen wrote: (I see there's already been an exchange between you and Wayne, but I'll still send this reply that I composed to your original email.) I'm glad you did! On Tue, 08 Jun 2004, John [EMAIL PROTECTED] wrote: We maintain the contents of latest thus: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids --exclude-from=/etc/local/backup/system-backup.excludes /boot/ / /home/ /var/ /var/local/backups/office//latest Why the double slash before latest? Just an accident of the way creation and substitution of variables worked. It doesn't matter to Linux (or *ix in general). and create the backup-du-jour: + cp -rl /var/local/backups/office//latest /var/local/backups/office//20040607-2128-mon That part works well, and the rsync part generally takes about seven minutes. To copy office to home we try this: + rsync --recursive --links --hard-links --perms --owner --group --devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete --delete-excluded --delete-after --max-delete=80 --relative --stats --numeric-ids /var/local/backups 192.168.0.1:/var/local/backups/ I can see where you will have a dreadful number of files to process if you are also processing all the previous backups. I will, at some time implement some pruning of the backups. Presenting the full list ensures any previous backup that didn't complete gets fixed. However well rsync works, I can't rule out power failures. Problems we've had include 1. ADSL connexion at one end ot the other dropping for a while. rsync doesn't notice and mostly hangs. I have seen rsync at home still running but with no relevant files open. 2. rsync uses an enormous amount of virtual memory with the result the Linux kernel lashes out at lots of processes, mostly innocent, until it lucks on rsync. This can cause rsync to terminate without a useful message. 2a. Sometimes the rsync that does this is at home. I've alleviated this at office by allocating an unreasonable amount of swap: unreasonable because if it gets used, performance will be truly dreadful. In neither this nor your previous post have you mentioned the verison of rsync or the OSes involved. rsync prior to 2.6.2 (skipping 2.6.1) have non-optimized hard link processing that used twice as much memory (!) and sometimes copied hard-linked files when there was already a match on the receiver. If you are not using 2.6.2, install that on both ends and try it again. I have now. I will be upgrading: I've built 2.6.2 from Sarge, am mulling over what to do for RHL 7.3. I ask myself, Will Woody binaries work? Do I need a RHL 7.3 development machine? 3. rsync does not detect when its partner has vanished. I don't understand why this should be so: it seems to me that, at office, it should be able to detect by the fact {r,s}sh has terminated or by timeout, and at home by timeout. There are two timeouts - a relatively short internal socket I/O timeout and a user-controlled client-server communications timeout. If you are not using --timeout and the link goes down at the wrong time, rsync can sit there forever waiting for the next item from the other end. Use --timeout set to some number of seconds that seems long enough to get the job done. If it times out, then either bump it or try to solve the cause of the timeout. This is consistent with what I see. --timeout=500 will be in the next run. 3a. It'd like to see rsync have the ability to retry in the case it's initiated the transfer. It can take some time to collect together the information as to what needs to be done: if I try in its wrapper script, then this has to be redone whereas, I surmise, rsync doing the retry would not need to. You need to avoid the kinds of rsync where this becomes a major factor. Well, yes. I'm using the latest in the latest stable version of Debian. 4. I've already mentioned this, but as I've had no feedback I'll try again. As you can see from the above, the source directories for the transfer from office to home are chock-full of hard links. As best I can tell, rsync is transferring each copy fresh instead of recognising the hard link before the transfer and getting the destination rsync to make a new hard link. It is so that it _can_ do this that I present the backup directory as a whole and not the individual day's backup. That, and I have hopes that today's unfinished work will be done tomorrow. 2.6.2 has fixes for unnecessary transfers. Good. btw the latest directory contains 1.5 Gbytes of data. The system is still calculating that today's backup contains 1.5 Gbytes, so it seems the startup costs are considerable. It's not the size of the data that hurts, it's the number of files and directories involved. Here's what I suggest. Since you have wisely made a static snapshot of