Re: Anti hang comments - unexpected EOF in read_timeout
Some feedback. I'd been having trouble with rsync hanging or early EOF timeouts. ... versions over many days and believe that both the kernel upgrade and the latest rsync cvs were necessary. nothing hangs in the case below, the job is done, but what produces the message Aborted by user! unexpected EOF in read_timeout ? Example of a cron job email: *** Date: Wed, 25 Jul 2001 01:00:01 +0200 From: [EMAIL PROTECTED] (Cron Daemon) To: [EMAIL PROTECTED] Subject: Cron root@it97 test -x /usr/lib/cron/run-crons /usr/lib/cron/run-crons X-Cron-Env: SHELL=/bin/sh X-Cron-Env: PATH=/usr/bin:/usr/sbin:/sbin:/bin:/usr/lib/news/bin X-Cron-Env: MAILTO=root X-Cron-Env: HOME=/root X-Cron-Env: LOGNAME=root Aborted by user! unexpected EOF in read_timeout Willkommen auf dem rsync-Server %SERVER receiving file list ... done ithum/yakumo/.kde/share/apps/kmail/ ithum/yakumo/.kde/share/config/ ithum/yakumo/.kde/share/apps/kmail/ithum:@192.168.111.113:110 ithum/yakumo/.kde/share/config/kdesktoprc ithum/yakumo/.netscape/cache/index.db ithum/yakumo/.netscape/history.dat ithum/yakumo/.kde/share/apps/kmail/ ithum/yakumo/.kde/share/config/ ithum/yakumo/.netscape/ ithum/yakumo/.netscape/cache/ wrote 4564 bytes read 172792 bytes 6692.68 bytes/sec total size is 178910233 speedup is 1008.76 *** -- -- Irmund Thum +491796998564
Anti hang comments
Some feedback. I'd been having trouble with rsync hanging or early EOF timeouts. Problem occurred on 2 identical receiving machines. I upgraded from Linux kernel 2.4.2 (RH 7.1) to 2.4.7 on the receiving machines. Sender machines are standard Linux RH 6.1 and RH 6.2 kernel.. Receiving machines are Duron 900Mhz 256 Mb software-raid1 ext2 2x40Gb. This improved the situation but still gave problems. Then I downloaded the latest rsync cvs and compiled on both machines and it's now working perfectly. Around 10Gb of data to sync. I tried many permutations of raid/non-raid, kernel 2.4.2/2.4.7 and rsync versions over many days and believe that both the kernel upgrade and the latest rsync cvs were necessary. John Leach http://osware.net Melbourne
Re: Anti-hang comments?
On Thu, Jul 05, 2001 at 10:58:22AM -0700, you [Jos Backus] claimed: On Thu, Jul 05, 2001 at 12:48:06PM -0500, Dave Dykstra wrote: If you really want it to stay in the foreground, edit become_daemon in socket.c. It would be nice to have this available as an option so rsyncd can be run under djb's daemontools. I also needed that option to run it as a service under cygwin. I think I have the patch somewhere, although it is of course trivial to reimplement. -- v -- [EMAIL PROTECTED]
Re: Anti-hang comments?
On Thu, Jul 05, 2001 at 12:38:00AM -0500, Phil Howard wrote: Wayne Davison wrote: We certainly do need to be careful here, since the interaction between the various read and write functions can be pretty complex. However, I think that the data flow of my move-files patch stress-tests this code fairly well, so once we've done some more testing I feel that we will not leave rsync worse off than it was before the patch. Along those lines, I've been testing the new code plus I ported a version of my move-files patch on top of it. The result has a couple fewer bugs and seems to be working well so far. The latest non-expanding-buffer-nohang patch is in the same place: http://www.clari.net/~wayne/rsync-nohang2.patch and the new move-files patch that works with nohang2 is here: http://www.clari.net/~wayne/rsync-move-files2.patch I'll keep banging on it. Let me know what you think. So far it is working for me. Now I can kill my client side and know that my daemon side will properly close down and exit and not leave a dangling lock. But the problem I still have (not quite as bad as before because of no more hangs) is that the locks to control the number of daemons is still working wrong. It's still locking the whole lock file instead of the first lockable 4 byte record. I still don't know if it is rsync or Linux causing the problem. The code in both looks right to me. But lslk shows: SRC PID DEV INUM SZ TY M ST WH END LEN NAME rsyncd 24401 3,5 44 0 w 0 0 0 0 0 /tmp/rsyncd.lock (note, I've been moving the lock file around to see if it might be sensitive to filesystem mounting options I'm using, etc). I'd like to find a way to start rsync in daemon mode AND leave it in the foreground so I can run it via strace and maybe see if the syscall is being done right. You shouldn't have to have it be in the foreground in order for strace -f to work. I just wrote a test program that verified it: main() { if (fork() == 0) { printf(child\n); setsid(); sleep(10); printf(bye bye\n); } } strace on that waits until the child process has exitted. If you really want it to stay in the foreground, edit become_daemon in socket.c. - Dave Dykstra
Re: Anti-hang comments?
On Thu, Jul 05, 2001 at 12:48:06PM -0500, Dave Dykstra wrote: If you really want it to stay in the foreground, edit become_daemon in socket.c. It would be nice to have this available as an option so rsyncd can be run under djb's daemontools. -- Jos Backus _/ _/_/_/Santa Clara, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/_/ [EMAIL PROTECTED] _/_/ _/_/_/use Std::Disclaimer;
Re: Anti-hang comments?
Dave Dykstra wrote: You shouldn't have to have it be in the foreground in order for strace -f You're right, I was not aware of that option. And I thought I knew my way around strace. Here's what strace shows me: [pid 14576] open(/tmp/rsyncd.lock, O_RDWR|O_CREAT|0x8000, 0600) = 4 [pid 14576] fcntl(4, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 But the source looks just right: connection.c[39,42]: /* find a free spot */ for (i=0;imax_connections;i++) { if (lock_range(fd, i*4, 4)) return 1; } util.c[494,506]: /* lock a byte range in a open file */ int lock_range(int fd, int offset, int len) { struct flock lock; lock.l_type = F_WRLCK; lock.l_whence = SEEK_SET; lock.l_start = offset; lock.l_len = len; lock.l_pid = 0; return fcntl(fd,F_SETLK,lock) == 0; } I guess maybe there's a library issue involved. But why it would stomp on a structure element is unclear. I'm putting together a couple new systems with Slackware 8.0 which has glibc 2.2.3 so I'll probably just first try it on there and see if the problme persists or not. -- - | Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ | | [EMAIL PROTECTED] | Texas, USA | http://phil.ipal.org/ | -
Re: Anti-hang comments?
On Wed, 27 Jun 2001, Martin Pool wrote: This is getting disturbingly complex. I realize the problem is complex too, so this is no slur on Wayne's coding. My gut reaction is that if we start adding this then the program's behaviour will become even more baroque. We certainly do need to be careful here, since the interaction between the various read and write functions can be pretty complex. However, I think that the data flow of my move-files patch stress-tests this code fairly well, so once we've done some more testing I feel that we will not leave rsync worse off than it was before the patch. Along those lines, I've been testing the new code plus I ported a version of my move-files patch on top of it. The result has a couple fewer bugs and seems to be working well so far. The latest non-expanding-buffer-nohang patch is in the same place: http://www.clari.net/~wayne/rsync-nohang2.patch and the new move-files patch that works with nohang2 is here: http://www.clari.net/~wayne/rsync-move-files2.patch I'll keep banging on it. Let me know what you think. ..wayne..
Re: Anti-hang comments?
Wayne Davison wrote: We certainly do need to be careful here, since the interaction between the various read and write functions can be pretty complex. However, I think that the data flow of my move-files patch stress-tests this code fairly well, so once we've done some more testing I feel that we will not leave rsync worse off than it was before the patch. Along those lines, I've been testing the new code plus I ported a version of my move-files patch on top of it. The result has a couple fewer bugs and seems to be working well so far. The latest non-expanding-buffer-nohang patch is in the same place: http://www.clari.net/~wayne/rsync-nohang2.patch and the new move-files patch that works with nohang2 is here: http://www.clari.net/~wayne/rsync-move-files2.patch I'll keep banging on it. Let me know what you think. So far it is working for me. Now I can kill my client side and know that my daemon side will properly close down and exit and not leave a dangling lock. But the problem I still have (not quite as bad as before because of no more hangs) is that the locks to control the number of daemons is still working wrong. It's still locking the whole lock file instead of the first lockable 4 byte record. I still don't know if it is rsync or Linux causing the problem. The code in both looks right to me. But lslk shows: SRC PID DEV INUM SZ TY M ST WH END LEN NAME rsyncd 24401 3,5 44 0 w 0 0 0 0 0 /tmp/rsyncd.lock (note, I've been moving the lock file around to see if it might be sensitive to filesystem mounting options I'm using, etc). I'd like to find a way to start rsync in daemon mode AND leave it in the foreground so I can run it via strace and maybe see if the syscall is being done right. -- - | Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ | | [EMAIL PROTECTED] | Texas, USA | http://phil.ipal.org/ | -
Re: Anti-hang comments?
On 26 Jun 2001, Wayne Davison [EMAIL PROTECTED] wrote: Here's a solution with a non-growing buffer. This is getting disturbingly complex. I realize the problem is complex too, so this is no slur on Wayne's coding. My gut reaction is that if we start adding this then the program's behaviour will become even more baroque. I'll read it and see how it goes. -- Martin
Re: Anti-hang comments?
On Tue, 26 Jun 2001, Wayne Davison wrote: Since read_int() is a fairly high-level call, I had to manually ensure that a flush doesn't happen and to ensure that reading the redo_fd doesn't try to read the io_error_fd (both to avoid nested read attempts on the redo_fd). In case you're wondering where this extra read_int() call is in my patch, I changed it so that it avoids higher-level calls when handling the lower-level read functionality. The patch was updated yesterday before the first person grabbed a copy, so there's no need for anyone to re-grab the patch. ..wayne..
Re: Anti-hang comments?
On Mon, 25 Jun 2001, Andrew Tridgell wrote: I've applied your simple nohang patch. Cool. That's the one that affects the most people. Instead we need a way of reproducing the bug and see if we can find a solution without a buffer. You can minimize the buffer usage by applying my move-files patch. It constantly reads the redo pipe during the generator's main loop and marks the redo items with a flag in the existing files struct (and also forwards the delete indicators on to the sender). This ensures that this buffer doesn't expand much at all. (With both patches applied I haven't seen it reallocate except when I tested the buffer code with a 16-byte realloc size.) Alternately, it might not be too hard to remove the buffer and have the low-level code take a more direct role in interpreting the data, but I'd have to look at this more closely to see for sure. One way to reproduce this hang is to modify the receiver code to redo every file that is processed in the first phase. Also, my move-files patch puts enough extra data down the sender-to-generator pipe that it should hang up without difficulty if you disable the buffer and use the --move-files option. ..wayne..
Re: Anti-hang comments?
On Mon, 25 Jun 2001, Andrew Tridgell wrote: see if we can find a solution without a buffer. Here's a solution with a non-growing buffer. This code keeps the receiver-generator pipe clear by reading the ints and setting redo flags in a character array (of flist-count elements). I'm avoiding setting flags in the actual flist structure since it is shared memory between 2 forked processes, and this might cause a lot of memory to become unshared (if the OS supports copy on write for fork). Since read_int() is a fairly high-level call, I had to manually ensure that a flush doesn't happen and to ensure that reading the redo_fd doesn't try to read the io_error_fd (both to avoid nested read attempts on the redo_fd). I have done some simple testing of this with my usual redo all files testing tweak and it is working fine, but the code is still pretty young. If you want to test this, be sure to unapply my previous no-hang patch or start fresh from the CVS version. The new patch is here: http://www.clari.net/~wayne/rsync-nohang2.patch I think it will also work to start from 2.4.6, but you should also apply the other no-hang fix I made (that was recently committed to CVS): http://www.clari.net/~wayne/rsync-nohang1.patch You'll need to use patch -p1 to apply the new patches (unlike the previous one, which used -p0) since I had a request for the top-level directory to be included in the file names. [FYI, I have not yet ported my move-files patch to use this code.] ..wayne..
Re: Anti-hang comments?
Wayne, I've applied your simple nohang patch. The longer nohang patch I'm not nearly as confident of. It goes back to a method used in early versions of rsync where it uses a buffer that can grow indefinately. Just some history on this. The earliest versions of rsync had no buffer, then when I first saw hangs I added a growing buffer very similar to what your patch adds. Several people found that it grew to enormous sizes and brought the machine to its knees. I then added an arbitrary limit on its size (about 4M I think) and then some people found they got hangs when that filled up. Then I got rid of the buffer, and we did the pipe/socketpair thing which reduced the hangs a lot. Next it was discovered that the error pipe could still cause hangs, and I have fixed that in the current CVS tree, but without a infinitely growing buffer. Now perhaps you have discovered another (much less common) way for it to hang, but I don't think the solution is a buffer. Instead we need a way of reproducing the bug and see if we can find a solution without a buffer. The horrors of a badly designed prototcol :(
Re: Anti-hang comments?
On 22 Jun 2001, [EMAIL PROTECTED] wrote: I have been testing this patch in a duplicate of our production environment, for a week now. With the patch, the runs complete. I'm handling 86756263K in 1816688 files (at last count) average 47K files (ranging up to about .5G). It seems to solve the problems. I think it constitutes rsync 2.4.6 (if you add in the other fixes - errors on module listing, etc.). Do you mean 2.4.7? I'm looking at the patch now. I think I will try it out here locally, and then make a tarball of 2.4.7pre1 for people to try out more broadly. -- Martin VA Linux SystemsGnuPG encrypted email preferred PGP signature