RE: PATCH/RFC: Another stab at the Cygwin hang problem
Ah, I just found the patch that jw sent (email system locked it as potential virus). Will try to compile and test this week. My own environment uses only SSH push. jpt > -Original Message- > From: jw schultz [mailto:[EMAIL PROTECTED] > Sent: Saturday, July 12, 2003 6:53 AM > To: [EMAIL PROTECTED] > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote: > > > > > > > -Original Message- > > > From: jw schultz [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, July 09, 2003 5:59 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > > > > I can't quite place why but my instincts inform me that you > > > > have latched onto something. Some sort of one character > > > > buffering error in the io libraries under cygwin. Most > > > > likely in the windos libs. > > > > > > > > Well, we have two reports of this fixing the rsync hang > > > > problem when signals failed. I'd like a little more testing > > > > before mainlining it. > > > > > > Nope! This is a no-go. It intermittantly produces > > > > > > error (10) -- error in socket IO > > > > > > on both network and local transfers. > > > > > > > I guess I'd better double check my processes to make sure > that I'm getting a > > satisfactory success rate on my own servers. If I see any > clues, I'll > > report them here. Any hope for a fix, or does this look > like an inherent > > problem in the method being used? > > It looks like the method is fairly sound. The problem seems > to primarily be in dealing with the child termination. > > io_set_error_fd(-1); > - kill(pid, SIGUSR2); > - wait_process(pid, &status); > + write(cleanup_pipe[1], ".", 1); > + if (waitpid(pid, &status, 0) != pid) { > + rprintf(FERROR,"cleanup in do_recv failed\n"); > + exit_cleanup(RERR_SOCKETIO); > + } > return status; > > There is a huge window between the write() and the return of > waitpid() that depending on scheduling and signal delivery > allows the child pid to be reaped by SIGCHILD handler. That > results in this waitpid() returning -1 with errno of ECHILD. > EINTER would also be possible. The timing dependencies > account for intermittency of the error. > > I've attached an altered patch. I've only dealt with this > one location which produced errors doing a ssh pull. I > haven't addressed the local transfer errors but i suspect > that derived from this waitpid error. Further testing will > still be needed to ensure that ssh push and rsyncd usage are > unbroken. This really needs testing in cygwin which i don't > have. If it takes care of the the cygwin hang then we can > polish it. There remains the issue of an error status when > when the only failure is termination. > > -- > > J.W. SchultzPegasystems Technologies > email address: [EMAIL PROTECTED] > > Remember Cernan and Schmitt > -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
RE: PATCH/RFC: Another stab at the Cygwin hang problem
> -Original Message- > From: jw schultz [mailto:[EMAIL PROTECTED] > Sent: Saturday, July 12, 2003 11:25 AM > To: [EMAIL PROTECTED] > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > [...] > > Anyhow, just to let you know. If you're happy tidying > > up and refining the patch yourself, please go ahead. If > > you want to me to do anything, or have any comments on > > what I've done, I'd appreciate an email. However I > > will try to follow the rsync list for the next few > > weeks at least. > > As i said earlier, i intuit you are on to something with > this patch. If you care to clean it up that would be good. > I would rather someone experiencing the hangs do the fix. > That tends to reduce the cycle times. I'm willing to help test if someone sends improvements on Anthony's original patch to list. The original has been working great for my own purposes so far. I realized when I started using it that I was being a little hasty, but my own situation required quicker action than is usually recommended. The risks were worth it, apparently. What I'm most interested in seeing is a real fix for this hang problem (Anthony's or someone else's) incorporated into an rsync release sometime in the near future so that I don't have to retain the patch code and special instructions for reinstalling my own running system. jpt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Sat, Jul 12, 2003 at 11:42:52PM +0900, Anthony Heading wrote: > On Sat, Jul 12, 2003 at 03:52:59AM -0700, jw schultz wrote: > > There is a huge window between the write() and the return of > > waitpid() that depending on scheduling and signal delivery > > allows the child pid to be reaped by SIGCHILD handler. That > > results in this waitpid() returning -1 with errno of ECHILD. > > EINTER would also be possible. The timing dependencies > > account for intermittency of the error. > > Hi JW - > > Afraid I've not really been following the rsync mailing list, > and it seems you've been addressing your comments about > my patch to James Tillman? Not in the least. I've addressed them to the list. > As I said originally, it was illustrative patch - I didn't > flesh out the error handling since that made the concept > more difficult to follow. > > Catching up now, I think your observation here is right. > In fact I'd made a similar change already myself locally. > > Only one difference - I was conciously avoiding calling > wait_process(), since that function calls msleep() - which > was implicated in the original hanging problem! Since > there is no signal being sent any more, hopefully it's not > a problem (except for the SIGUSR2 cases?) - however I > was wanting to ensure that the hangs were _completely_ > eliminated, and thus didn't want to take any chances. > > So my own patch here is checking the errno and gives > the OK for ECHILD. I would worry that the whole > msleep NOHANG io_flush stuff is a very complex loop > to run simply to collect an exit status, particularly > when we believe that the root of the hang lies with > the underlying Cygwin OS. I don't recall msleep being a hang problem. I don't see how it could be. Myself i wonder why the WNOHANG and msleep loop instead of a normal waitpid. I initially had waitpid with checking of the pid_stat_table if ECHILD but disliked having the duplicate code. Besides, if wait_process has a hang problem lets fix that instead of orphaning it. > But I think as long as the hangs don't reappear, your > updated patch is obviously more concise. Otherwise, I'll be > further tempted to take the axe to the SIGCHLD handling, > which looks somewhat jammed with voodoo cruft. Layer on layer. I don't care for it myself but changes in this tend to cause problems on less popular platforms. > Anyhow, just to let you know. If you're happy tidying > up and refining the patch yourself, please go ahead. If > you want to me to do anything, or have any comments on > what I've done, I'd appreciate an email. However I > will try to follow the rsync list for the next few > weeks at least. As i said earlier, i intuit you are on to something with this patch. If you care to clean it up that would be good. I would rather someone experiencing the hangs do the fix. That tends to reduce the cycle times. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Sat, Jul 12, 2003 at 03:52:59AM -0700, jw schultz wrote: > There is a huge window between the write() and the return of > waitpid() that depending on scheduling and signal delivery > allows the child pid to be reaped by SIGCHILD handler. That > results in this waitpid() returning -1 with errno of ECHILD. > EINTER would also be possible. The timing dependencies > account for intermittency of the error. Hi JW - Afraid I've not really been following the rsync mailing list, and it seems you've been addressing your comments about my patch to James Tillman? As I said originally, it was illustrative patch - I didn't flesh out the error handling since that made the concept more difficult to follow. Catching up now, I think your observation here is right. In fact I'd made a similar change already myself locally. Only one difference - I was conciously avoiding calling wait_process(), since that function calls msleep() - which was implicated in the original hanging problem! Since there is no signal being sent any more, hopefully it's not a problem (except for the SIGUSR2 cases?) - however I was wanting to ensure that the hangs were _completely_ eliminated, and thus didn't want to take any chances. So my own patch here is checking the errno and gives the OK for ECHILD. I would worry that the whole msleep NOHANG io_flush stuff is a very complex loop to run simply to collect an exit status, particularly when we believe that the root of the hang lies with the underlying Cygwin OS. But I think as long as the hangs don't reappear, your updated patch is obviously more concise. Otherwise, I'll be further tempted to take the axe to the SIGCHLD handling, which looks somewhat jammed with voodoo cruft. Anyhow, just to let you know. If you're happy tidying up and refining the patch yourself, please go ahead. If you want to me to do anything, or have any comments on what I've done, I'd appreciate an email. However I will try to follow the rsync list for the next few weeks at least. Rgds Anthony This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of J.P. Morgan Chase & Co., its subsidiaries and affiliates. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 jw schultz wrote: >I've attached an altered patch. I've only dealt with this >one location which produced errors doing a ssh pull. > OK, I created a test package with your patch included, so that anyone willing to test but not wililng to compile can use it. Please notice that it is build against the experimental cygwin DLL release 1.5.0, with support for 64-bit files. If used on a system with an older DLL it can do "bad things"... and maybe can do bad things anyway, I take no responsability as this is a *TEST* package. (not that I am responsable anyway =P) http://www.lapo.it/tmp/rsync-2.5.6-3.tar.bz2 http://www.lapo.it/tmp/rsync-2.5.6-3-src.tar.bz2 Please notice moreover that overall in seven "make check" I got the following failures (once each one). I'd say this patch gives problems ^_^ -BEGIN PGP SIGNATURE- Version: PGP 8.0 - not licensed for commercial use: www.pgp.com iQA/AwUBPw/5+2iYgizI8lL7EQJnWwCdGFxTPjDT8voCXgonG9CYS5h/JGwAmgJ0 cCjDDP03tmNHYBaPEsfeSgnm =Xkyb -END PGP SIGNATURE- - unsafe-links log follows Testing for symlinks using 'test -h' + echo rsync with relative path and just -a rsync with relative path and just -a + /tmp/rsync-2.5.6/rsync.exe -avv from/safe/ to building file list ... expand file_list to 4000 bytes, did move done created directory to delta-transmission disabled for local transfer or --whole-file files/file1 files/file2 links/file1 -> ../files/file1 links/file2 -> ../files/file2 links/unsafefile -> ../../unsafe/unsafefile total: matches=0 tag_hits=0 false_alarms=0 data=0 wrote 297 bytes read 52 bytes 232.67 bytes/sec total size is 342 speedup is 0.98 + test_symlink to/links/file1 + is_a_link to/links/file1 + test -h to/links/file1 + test_symlink to/links/file2 + is_a_link to/links/file2 + test -h to/links/file2 + test_symlink to/links/unsafefile + is_a_link to/links/unsafefile + test -h to/links/unsafefile + echo rsync with relative path and -a --copy-links rsync with relative path and -a --copy-links + /tmp/rsync-2.5.6/rsync.exe -avv --copy-links from/safe/ to building file list ... expand file_list to 4000 bytes, did move done delta-transmission disabled for local transfer or --whole-file files/file1 is uptodate files/file2 is uptodate links/file1 is uptodate links/file2 is uptodate links/unsafefile total: matches=0 tag_hits=0 false_alarms=0 data=0 wrote 198 bytes read 36 bytes 468.00 bytes/sec total size is 0 speedup is 0.00 + test_regular to/links/file1 + [ ! -f to/links/file1 ] + test_regular to/links/file2 + [ ! -f to/links/file2 ] + test_regular to/links/unsafefile + [ ! -f to/links/unsafefile ] + echo rsync with relative path and --copy-unsafe-links rsync with relative path and --copy-unsafe-links + /tmp/rsync-2.5.6/rsync.exe -avv --copy-unsafe-links from/safe/ to pipe: Address already in use rsync error: error in IPC code (code 14) at pipe.c(107) - unsafe-links log ends FAILunsafe-links - hands log follows Testing for symlinks using 'test -h' Test basic operation: Running: "/tmp/rsync-2.5.6/rsync.exe -av /tmp/rsync-2.5.6/testtmp.hands/from/ /tmp/rsync-2.5.6/testtmp.hands/to" building file list ... done ./ dir/ dir/subdir/ dir/subdir/subsubdir/ dir/subdir/subsubdir/etc-ltr-list dir/subdir/subsubdir2/ dir/subdir/subsubdir2/bin-lt-list dir/text empty emptydir/ filelist nolf nolf-symlink -> nolf text wrote 829890 bytes read 132 bytes 1660044.00 bytes/sec total size is 829321 speedup is 1.00 - check how the files compare with diff: - check how the directory listings compare with diff: done. Test hard links: Running: "/tmp/rsync-2.5.6/rsync.exe -avH /tmp/rsync-2.5.6/testtmp.hands/from/ /tmp/rsync-2.5.6/testtmp.hands/to" building file list ... done dir/ dir/filelist filelist => dir/filelist wrote 21870 bytes read 36 bytes 43812.00 bytes/sec total size is 850647 speedup is 38.83 - check how the files compare with diff: - check how the directory listings compare with diff: done. Test one file: Running: "/tmp/rsync-2.5.6/rsync.exe -avH /tmp/rsync-2.5.6/testtmp.hands/from/ /tmp/rsync-2.5.6/testtmp.hands/to" building file list ... done ./ text wrote 374971 bytes read 36 bytes 250004.67 bytes/sec total size is 850647 speedup is 2.27 - check how the files compare with diff: - check how the directory listings compare with diff: done. Test extra data: Running: "/tmp/rsync-2.5.6/rsync.exe -avH /tmp/rsync-2.5.6/testtmp.hands/from/ /tmp/rsync-2.5.6/testtmp.hands/to" building file list ... done pipe failed in do_recv rsync error: error in socket IO (code 10) at main.c(412) rsync: connection unexpectedly closed (8 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(165) - hands log ends FAILhands - hands log follows Testing for symlinks using 'test -h' Test basic operation: Running: "/tmp/rsync-2.5.6/rsync.exe -av /tmp/rsync-2.5.6/testtmp.hands/fr
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote: > > > > -Original Message- > > From: jw schultz [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 09, 2003 5:59 AM > > To: [EMAIL PROTECTED] > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > I can't quite place why but my instincts inform me that you > > > have latched onto something. Some sort of one character > > > buffering error in the io libraries under cygwin. Most > > > likely in the windos libs. > > > > > > Well, we have two reports of this fixing the rsync hang > > > problem when signals failed. I'd like a little more testing > > > before mainlining it. > > > > Nope! This is a no-go. It intermittantly produces > > > > error (10) -- error in socket IO > > > > on both network and local transfers. > > > > I guess I'd better double check my processes to make sure that I'm getting a > satisfactory success rate on my own servers. If I see any clues, I'll > report them here. Any hope for a fix, or does this look like an inherent > problem in the method being used? It looks like the method is fairly sound. The problem seems to primarily be in dealing with the child termination. io_set_error_fd(-1); - kill(pid, SIGUSR2); - wait_process(pid, &status); + write(cleanup_pipe[1], ".", 1); + if (waitpid(pid, &status, 0) != pid) { + rprintf(FERROR,"cleanup in do_recv failed\n"); + exit_cleanup(RERR_SOCKETIO); + } return status; There is a huge window between the write() and the return of waitpid() that depending on scheduling and signal delivery allows the child pid to be reaped by SIGCHILD handler. That results in this waitpid() returning -1 with errno of ECHILD. EINTER would also be possible. The timing dependencies account for intermittency of the error. I've attached an altered patch. I've only dealt with this one location which produced errors doing a ssh pull. I haven't addressed the local transfer errors but i suspect that derived from this waitpid error. Further testing will still be needed to ensure that ssh push and rsyncd usage are unbroken. This really needs testing in cygwin which i don't have. If it takes care of the the cygwin hang then we can polish it. There remains the issue of an error status when when the only failure is termination. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt ? main.2.5.5 Index: cleanup.c === RCS file: /data/cvs/rsync/cleanup.c,v retrieving revision 1.18 diff -u -r1.18 cleanup.c --- cleanup.c 21 Mar 2003 23:43:50 - 1.18 +++ cleanup.c 12 Jul 2003 10:31:04 - @@ -96,7 +96,6 @@ inside_cleanup++; signal(SIGUSR1, SIG_IGN); - signal(SIGUSR2, SIG_IGN); if (verbose > 3) rprintf(FINFO,"_exit_cleanup(code=%d, file=%s, line=%d): entered\n", Index: main.c === RCS file: /data/cvs/rsync/main.c,v retrieving revision 1.169 diff -u -r1.169 main.c --- main.c 4 Jul 2003 15:11:46 - 1.169 +++ main.c 12 Jul 2003 10:31:04 - @@ -391,6 +391,7 @@ int status=0; int recv_pipe[2]; int error_pipe[2]; + int cleanup_pipe[2]; extern int preserve_hard_links; extern int delete_after; extern int recurse; @@ -417,11 +418,19 @@ exit_cleanup(RERR_SOCKETIO); } + if (pipe(cleanup_pipe) < 0) { + rprintf(FERROR,"cleanup pipe failed in do_recv\n"); + exit_cleanup(RERR_SOCKETIO); + } + io_flush(); if ((pid=do_fork()) == 0) { + char tmp; + close(recv_pipe[0]); close(error_pipe[0]); + close(cleanup_pipe[1]); if (f_in != f_out) close(f_out); /* we can't let two processes write to the socket at one time */ @@ -437,15 +446,21 @@ write_int(recv_pipe[1],1); close(recv_pipe[1]); io_flush(); - /* finally we go to sleep until our parent kills us - with a USR2 signal. We sleep for a short time as on - some OSes a signal won't interrupt a sleep! */ - while (msleep(20)) - ; + do { + status = read(cleanup_pipe[0], &tmp, 1); + } while (sta
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote: > > > > -Original Message- > > From: jw schultz [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 09, 2003 5:59 AM > > To: [EMAIL PROTECTED] > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > I can't quite place why but my instincts inform me that you > > > have latched onto something. Some sort of one character > > > buffering error in the io libraries under cygwin. Most > > > likely in the windos libs. > > > > > > Well, we have two reports of this fixing the rsync hang > > > problem when signals failed. I'd like a little more testing > > > before mainlining it. > > > > Nope! This is a no-go. It intermittantly produces > > > > error (10) -- error in socket IO > > > > on both network and local transfers. > > > > I guess I'd better double check my processes to make sure that I'm getting a > satisfactory success rate on my own servers. If I see any clues, I'll > report them here. Any hope for a fix, or does this look like an inherent > problem in the method being used? Better diags might help. Pull over ssh hits this. + write(cleanup_pipe[1], ".", 1); + if (waitpid(pid, &status, 0) != pid) { + rprintf(FERROR,"cleanup in do_recv failed\n"); + exit_cleanup(RERR_SOCKETIO); + } I have two problems here. Firstly you are ignoring errno. The waitpid call fails but you don't identify why. Secondly, as long as the processes exit (no hangs, zombies or runaways) and the actual transfer is successful i don't mind too much if the termination is less than perfect. Lets not use RERR_SOCKETIO. Lets use a different warning status that only applies to the termination. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote: > > > > -Original Message- > > From: jw schultz [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 09, 2003 5:59 AM > > To: [EMAIL PROTECTED] > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > I can't quite place why but my instincts inform me that you > > > have latched onto something. Some sort of one character > > > buffering error in the io libraries under cygwin. Most > > > likely in the windos libs. > > > > > > Well, we have two reports of this fixing the rsync hang > > > problem when signals failed. I'd like a little more testing > > > before mainlining it. > > > > Nope! This is a no-go. It intermittantly produces > > > > error (10) -- error in socket IO > > > > on both network and local transfers. > > > > I guess I'd better double check my processes to make sure that I'm getting a > satisfactory success rate on my own servers. If I see any clues, I'll > report them here. Any hope for a fix, or does this look like an inherent > problem in the method being used? I haven't dug into it yet. As the patch author i you might be a bit more familiar with it. I'm not running cygwin so i never had the hangs. I only applied it to test for regression, which is what i found. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
RE: PATCH/RFC: Another stab at the Cygwin hang problem
My sincerest apologies for the duplicate msgs from me that were sent to the list this morning. My email administrator must have done something quite stupid to have all msgs I've sent in the last week go out again! jpt > -Original Message- > From: Tillman, James > Sent: Wednesday, July 09, 2003 6:48 AM > To: [EMAIL PROTECTED] > Subject: RE: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > -Original Message- > > From: jw schultz [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 09, 2003 5:59 AM > > To: [EMAIL PROTECTED] > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > > > > I can't quite place why but my instincts inform me that you > > > have latched onto something. Some sort of one character > > > buffering error in the io libraries under cygwin. Most > > > likely in the windos libs. > > > > > > Well, we have two reports of this fixing the rsync hang > > > problem when signals failed. I'd like a little more testing > > > before mainlining it. > > > > Nope! This is a no-go. It intermittantly produces > > > > error (10) -- error in socket IO > > > > on both network and local transfers. > > > > I guess I'd better double check my processes to make sure > that I'm getting a > satisfactory success rate on my own servers. If I see any clues, I'll > report them here. Any hope for a fix, or does this look like > an inherent > problem in the method being used? > > jpt > -- > To unsubscribe or change options: > http://lists.samba.org/mailman/listinfo/rsync > Before posting, read: > http://www.catb.org/~esr/faqs/smart-questions.html > -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
RE: PATCH/RFC: Another stab at the Cygwin hang problem
> -Original Message- > From: jw schultz [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 09, 2003 5:59 AM > To: [EMAIL PROTECTED] > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem > > > > I can't quite place why but my instincts inform me that you > > have latched onto something. Some sort of one character > > buffering error in the io libraries under cygwin. Most > > likely in the windos libs. > > > > Well, we have two reports of this fixing the rsync hang > > problem when signals failed. I'd like a little more testing > > before mainlining it. > > Nope! This is a no-go. It intermittantly produces > > error (10) -- error in socket IO > > on both network and local transfers. > I guess I'd better double check my processes to make sure that I'm getting a satisfactory success rate on my own servers. If I see any clues, I'll report them here. Any hope for a fix, or does this look like an inherent problem in the method being used? jpt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Mon, Jun 30, 2003 at 05:49:45PM -0700, jw schultz wrote: > On Mon, Jun 30, 2003 at 11:12:29PM +0900, Anthony Heading wrote: > > On Mon, Jun 30, 2003 at 04:54:22AM -0700, jw schultz wrote: > > > Could you regenerate the patch with diff -u please? > > > > Okay, sure. This one against current CVS. > > Thanks that helps in examining it. > > I can't quite place why but my instincts inform me that you > have latched onto something. Some sort of one character > buffering error in the io libraries under cygwin. Most > likely in the windos libs. > > Well, we have two reports of this fixing the rsync hang > problem when signals failed. I'd like a little more testing > before mainlining it. Nope! This is a no-go. It intermittantly produces error (10) -- error in socket IO on both network and local transfers. -- J.W. SchultzPegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Mon, Jun 30, 2003 at 11:12:29PM +0900, Anthony Heading wrote: > On Mon, Jun 30, 2003 at 04:54:22AM -0700, jw schultz wrote: > > Could you regenerate the patch with diff -u please? > > Okay, sure. This one against current CVS. Thanks that helps in examining it. I can't quite place why but my instincts inform me that you have latched onto something. Some sort of one character buffering error in the io libraries under cygwin. Most likely in the windos libs. Well, we have two reports of this fixing the rsync hang problem when signals failed. I'd like a little more testing before mainlining it. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
On Mon, Jun 30, 2003 at 04:54:22AM -0700, jw schultz wrote: > Could you regenerate the patch with diff -u please? Okay, sure. This one against current CVS. Anthony --- cleanup.c.Orig 2003-06-30 22:42:16.0 +0900 +++ cleanup.c 2003-06-30 22:42:47.0 +0900 @@ -96,7 +96,6 @@ inside_cleanup++; signal(SIGUSR1, SIG_IGN); - signal(SIGUSR2, SIG_IGN); if (verbose > 3) rprintf(FINFO,"_exit_cleanup(code=%d, file=%s, line=%d): entered\n", --- main.c.Orig 2003-04-25 01:26:09.0 +0900 +++ main.c 2003-06-30 22:41:35.0 +0900 @@ -391,6 +391,7 @@ int status=0; int recv_pipe[2]; int error_pipe[2]; + int cleanup_pipe[2]; extern int preserve_hard_links; extern int delete_after; extern int recurse; @@ -417,11 +418,19 @@ exit_cleanup(RERR_SOCKETIO); } + if (pipe(cleanup_pipe) < 0) { + rprintf(FERROR,"cleanup pipe failed in do_recv\n"); + exit_cleanup(RERR_SOCKETIO); + } + io_flush(); if ((pid=do_fork()) == 0) { + char tmp; + close(recv_pipe[0]); close(error_pipe[0]); + close(cleanup_pipe[1]); if (f_in != f_out) close(f_out); /* we can't let two processes write to the socket at one time */ @@ -437,15 +446,21 @@ write_int(recv_pipe[1],1); close(recv_pipe[1]); io_flush(); - /* finally we go to sleep until our parent kills us - with a USR2 signal. We sleep for a short time as on - some OSes a signal won't interrupt a sleep! */ - while (msleep(20)) - ; + do { + status = read(cleanup_pipe[0], &tmp, 1); + } while (status == -1 && errno == EINTR); + if (status != 1) { + rprintf(FERROR,"cleanup read returned %d in do_recv\n", status); + if (status == -1) + rprintf(FERROR,"with errno %d (%s)\n", errno, strerror(errno)); + _exit(RERR_PARTIAL); + } + _exit(0); } close(recv_pipe[1]); close(error_pipe[1]); + close(cleanup_pipe[0]); if (f_in != f_out) close(f_in); io_start_buffering(f_out); @@ -463,8 +478,11 @@ io_flush(); io_set_error_fd(-1); - kill(pid, SIGUSR2); - wait_process(pid, &status); + write(cleanup_pipe[1], ".", 1); + if (waitpid(pid, &status, 0) != pid) { + rprintf(FERROR,"cleanup in do_recv failed\n"); + exit_cleanup(RERR_SOCKETIO); + } return status; } @@ -881,12 +899,6 @@ exit_cleanup(RERR_SIGNAL); } -static RETSIGTYPE sigusr2_handler(int UNUSED(val)) { - extern int log_got_error; - if (log_got_error) _exit(RERR_PARTIAL); - _exit(0); -} - static RETSIGTYPE sigchld_handler(int UNUSED(val)) { #ifdef WNOHANG int cnt, status; @@ -976,7 +988,6 @@ orig_argv = argv; signal(SIGUSR1, sigusr1_handler); - signal(SIGUSR2, sigusr2_handler); signal(SIGCHLD, sigchld_handler); #ifdef MAINTAINER_MODE signal(SIGSEGV, rsync_panic_handler); -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: PATCH/RFC: Another stab at the Cygwin hang problem
Apparently this fixed the problem for Tillman, James. Could you regenerate the patch with diff -u please? On Fri, Jun 27, 2003 at 04:16:12PM +0900, Anthony Heading wrote: > Hi, > > In http://sources.redhat.com/ml/cygwin/2002-09/msg01155.html, I noted that > the often-observed hangs of rsync under Cygwin were assuaged by a call to > msleep(). > > After upgrading my Cygwin environment to rsync 2.5.6, I'm seeing these > hangs again, not surprisingly given a CVS entry for main.c notes that > this kludge was not harmless: > > Revision 1.162 / (download) - annotate - [select for diffs] , > Tue Jan 28 05:05:53 2003 UTC (4 months, 4 weeks ago) by dwd > > Remove the Cygwin msleep(100) before the generator kills the receiver, > because it caused the testsuite/unsafe-links test to hang. > > So it seems sensible to attempt something a bit more elegant. > > And the first question is why kill/signals are being used > being used here at all. > > The illustrative patch below I think effects an equivalent synchronization, > but does so by queuing a byte into a pipe rather than sending a signal. > > Of course, since it's not currently done this way, I may be overlooking > something obvious. I can't quite see what though, since in the event > that an error occurs then exit_cleanup is available to send SIGUSR1 > with extreme prejudice; but if the protocol in fact concludes cleanly > then there really should be no need for an asynchronous notification? > > Comments sought, meanwhile I'll test the patch a bit... > > Regards > > Anthony > > > *** main.c.Orig Fri Jun 27 15:21:22 2003 > --- main.cFri Jun 27 15:30:09 2003 > *** > *** 390,395 > --- 390,396 > int status=0; > int recv_pipe[2]; > int error_pipe[2]; > + int cleanup_pipe[2]; > extern int preserve_hard_links; > extern int delete_after; > extern int recurse; > *** > *** 416,426 > --- 417,435 > exit_cleanup(RERR_SOCKETIO); > } > > + if (pipe(cleanup_pipe) < 0) { > + rprintf(FERROR,"cleanup pipe failed in do_recv\n"); > + exit_cleanup(RERR_SOCKETIO); > + } > + > io_flush(); > > if ((pid=do_fork()) == 0) { > + char tmp; > + > close(recv_pipe[0]); > close(error_pipe[0]); > + close(cleanup_pipe[1]); > if (f_in != f_out) close(f_out); > > /* we can't let two processes write to the socket at one time */ > *** > *** 436,450 > write_int(recv_pipe[1],1); > close(recv_pipe[1]); > io_flush(); > ! /* finally we go to sleep until our parent kills us > !with a USR2 signal. We sleep for a short time as on > !some OSes a signal won't interrupt a sleep! */ > ! while (msleep(20)) > ! ; > } > > close(recv_pipe[1]); > close(error_pipe[1]); > if (f_in != f_out) close(f_in); > > io_start_buffering(f_out); > --- 445,465 > write_int(recv_pipe[1],1); > close(recv_pipe[1]); > io_flush(); > ! do { > ! status = read(cleanup_pipe[0], &tmp, 1); > ! } while (status == -1 && errno == EINTR); > ! if (status != 1) { > ! rprintf(FERROR,"cleanup read returned %d in do_recv\n", > status); > ! if (status == -1) > ! rprintf(FERROR,"with errno %d (%s)\n", errno, > strerror(errno)); > ! _exit(RERR_PARTIAL); > ! } > ! _exit(0); > } > > close(recv_pipe[1]); > close(error_pipe[1]); > + close(cleanup_pipe[0]); > if (f_in != f_out) close(f_in); > > io_start_buffering(f_out); > *** > *** 462,469 > io_flush(); > > io_set_error_fd(-1); > ! kill(pid, SIGUSR2); > ! wait_process(pid, &status); > return status; > } > > --- 477,487 > io_flush(); > > io_set_error_fd(-1); > ! write(cleanup_pipe[1], ".", 1); > ! if (waitpid(pid, &status, 0) != pid) { > ! rprintf(FERROR,"cleanup in do_recv failed\n"); > ! exit_cleanup(RERR_SOCKETIO); > ! } > return status; > } > > *** > *** 867,878 > exit_cleanup(RERR_SIGNAL); > } > > - static RETSIGTYPE sigusr2_handler(int UNUSED(val)) { > - extern int log_got_error; > - if (log_got_error) _exit(RERR_PARTIAL); > - _exit(0); > - } > - > static RETSIGTYPE sigchld_handler(int UNUSED(val)) { > #ifdef WNOHANG > int cnt, status; > --- 885,890 > *** > *** 964,970 > orig_argv = argv; > > signal(SIGUSR1, sigusr1_handler); > - signal(SIGUSR2, sigusr2_handler); > signal(SIGCHLD, sigchld_handler)