Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system
On Wed, Jan 24, 2001 at 11:59:18AM -0500, John Stoffel wrote: Hi all, This is a followup to bug report 2779 on the rsync bug tracking web site, I'm also seeing the hang in waitid() on the master process when trying to do an rsync on a single host. Basically, I've got a server with two network interfaces, connected to two different NetApps and I'm using rsync to bring them into sync for a migration. Each netapp is on it's own dedicated subnet link, so there's no network contention. Here's how I'm running it: # rsync-2.4.6/rsync --archive --delete --exclude ".snapshot/" --exclude ".snapshot" --links --recursive --stats --verbose /sqatoast/acme /newtoast192/acme I've also tried 2.4.5 and it too hangs, but with a different set of traces, each process (there are three) is just in a poll() loop. I'm now trying 2.4.4 to see if that will work, but the --exclude option seems to have changed how it works as I go back in versions. Does anyone have a patch for 2.4.6 that will make it work properly with Solaris 2.6 servers talking to itself? Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies [EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548 Other people have reported similar experiences but nobody has pointed to a problem in rsync; the problem is more likely to be in NFS on the NetApp or Solaris machines. I believe most NFS traffic goes over UDP but do you happen to know if it using TCP? We have seen many problems with TCP connections when rsync is communicating between two different machines. Try using "-W" to disable the rsync rolling checksum algorithm when copying between two NFS mounts, because that causes extra NFS traffic. Rsync's algorithm is optimized for minimizing network traffic between its two halves at the expense of extra local access and in your case the "network" is between processes on the same machine and the "local" is over a network. - Dave Dykstra
Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system
On Wed, Jan 24, 2001 at 11:59:18AM -0500, John Stoffel wrote: Hi all, This is a followup to bug report 2779 on the rsync bug tracking web site, I'm also seeing the hang in waitid() on the master process when trying to do an rsync on a single host. snip Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies [EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548 Other people have reported similar experiences but nobody has pointed to a problem in rsync; the problem is more likely to be in NFS on the NetApp or Solaris machines. I believe most NFS traffic goes over UDP but do you happen to know if it using TCP? We have seen many problems with TCP connections when rsync is communicating between two different machines. Here is some more data. Different setup, yet similar end result-- but no answers. Once again I saw a very simliar hang last weekend. I put a new HD in my home PC and was rsyncing from one IBM IDE disk to a second IBM IDE hard disk. Linux 2.2.18. Rsync 2.4.6. After moving about 6G of files rsync would stop. I checked the /proc/pid(s)/fd dir and didn't see any open 'real' files (like I do when rsync is actaully moving data. I CNTL-C'd the rsync, ran the same cmd again and it would finish up the job. I love that about rsync. I killed the whole dest directory tree and ran it again (for testing). It still hung. Same file it hung on and same resolution. I'll run it again and provide better details. I realize that rsync has sometimes taken unfair blame for TCP bugs, NFS bugs, ssh bugs, rsh bugs, OS bugs, etc -- but I still think there might be something that can be improved -- either a hard-to-find problem or a different way to handle a infrequent exception. Is there an int64 problem here? A compiler mess up? A cast int64-int that confuses something? I was not using -W, just rsync -av path1 path2 (no :'s in any path). eric Try using "-W" to disable the rsync rolling checksum algorithm when copying between two NFS mounts, because that causes extra NFS traffic. Rsync's algorithm is optimized for minimizing network traffic between its two halves at the expense of extra local access and in your case the "network" is between processes on the same machine and the "local" is over a network. - Dave Dykstra -- __ Eric T. Whiting AMI Semiconductors (208) 234-6717 2300 Buckskin Road (208) 234-6659 (fax)Pocatello,ID 83201 [EMAIL PROTECTED]
Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system
Dave Other people have reported similar experiences but nobody has Dave pointed to a problem in rsync; the problem is more likely to be Dave in NFS on the NetApp or Solaris machines. I believe most NFS Dave traffic goes over UDP but do you happen to know if it using TCP? Dave We have seen many problems with TCP connections when rsync is Dave communicating between two different machines. We're using plain UDP NFS as far as I know, I certainly haven't explicitly enabled it. Dave Try using "-W" to disable the rsync rolling checksum algorithm Dave when copying between two NFS mounts, because that causes extra Dave NFS traffic. Rsync's algorithm is optimized for minimizing Dave network traffic between its two halves at the expense of extra Dave local access and in your case the "network" is between processes Dave on the same machine and the "local" is over a network. So I should go back and use 2.4.6 then? Thanks, John x27548
Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system
Quick follow to my previous posting today. I'm now using the -W flag on 2.4.6 again and it just hung. Luckily a 'kill -HUP' to the process it's waiting on will do the trick in terms of killing rsync cleaning. But it also means I have to sit and watch the damm thing do it's work and wait until the CPU useage drops to 1% or so. Here's a process listing: atlas:/newtoast192/users# ps -ef | grep rsync root 8809 8808 0 14:42:07 pts/20:12 /home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh root 8808 8674 0 14:42:07 pts/25:01 /home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh root 8819 8809 0 14:43:30 pts/2 14:57 /home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh So I do a truss on one pid and I see atlas:/newtoast192/users# truss -p 8809 poll(0xEFFFB6E0, 1, 6) (sleeping...) ^C Oops, wrong pid. I then check the previous one and I see: atlas:/newtoast192/users# truss -p 8808 poll(0xEFFFD3B0, 0, 20) = 0 waitid(P_PID, 8809, 0xE3B8, WEXITED|WTRAPPED|WNOHANG) = 0 poll(0xEFFFD3B0, 0, 20) = 0 waitid(P_PID, 8809, 0xE3B8, WEXITED|WTRAPPED|WNOHANG) = 0 poll(0xEFFFD3B0, 0, 20) = 0 poll(0xEFFFD3B0, 0, 7) = 0 waitid(P_PID, 8809, 0xE3B8, WEXITED|WTRAPPED|WNOHANG) = 0 poll(0xEFFFD3B0, 0, 20) = 0 poll(0xEFFFD3B0, 0, 1) = 0 And it just sits there in waitid() and poll(). So it really looks like the real problem is in the child process, pid=8809, because it's just sitting there doing a poll() on just one file descriptor and waiting for info. I don't think it gets anything, so maybe it's an subtle protocol error, where it's not getting a shutdown message from the other client, or it doesn't have a good heuristic that says: if I don't get *any* info after X seconds, just die where X would be something like 900 or 1200 seconds, which seems like a reasonable number. Now this of course would only kick after data has been transfering for a while, since the initial work on a large directory could take quite a while. This is really starting to bug me, so I'd like to try and get this solved if we could. Unfortuately, I'm not a great programmer, I'm more of a tweaker hacker type. Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies [EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548
Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system
On Wed, Jan 24, 2001 at 03:48:06PM -0500, John Stoffel wrote: ... or it doesn't have a good heuristic that says: if I don't get *any* info after X seconds, just die where X would be something like 900 or 1200 seconds, which seems like a reasonable number. Have you tried --timeout? - Dave Dykstra