Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system

2001-01-24 Thread Dave Dykstra

On Wed, Jan 24, 2001 at 11:59:18AM -0500, John Stoffel wrote:
 
 
 Hi all,
 
 This is a followup to bug report 2779 on the rsync bug tracking web
 site, I'm also seeing the hang in waitid() on the master process when
 trying to do an rsync on a single host.
 
 Basically, I've got a server with two network interfaces, connected to
 two different NetApps and I'm using rsync to bring them into sync for
 a migration.
 
 Each netapp is on it's own dedicated subnet link, so there's no
 network contention.  Here's  how I'm running it:
 
 # rsync-2.4.6/rsync --archive --delete --exclude ".snapshot/" --exclude ".snapshot" 
--links --recursive --stats --verbose /sqatoast/acme /newtoast192/acme
 
 
 I've also tried 2.4.5 and it too hangs, but with a different set of
 traces, each process (there are three) is just in a poll() loop.
 
 I'm now trying 2.4.4 to see if that will work, but the --exclude
 option seems to have changed how it works as I go back in versions.
 
 Does anyone have a patch for 2.4.6 that will make it work properly
 with Solaris 2.6 servers talking to itself?
 
 Thanks,
 John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
[EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548


Other people have reported similar experiences but nobody has pointed to a
problem in rsync; the problem is more likely to be in NFS on the NetApp or
Solaris machines.  I believe most NFS traffic goes over UDP but do you happen
to know if it using TCP?  We have seen many problems with TCP connections
when rsync is communicating between two different machines.

Try using "-W" to disable the rsync rolling checksum algorithm when copying
between two NFS mounts, because that causes extra NFS traffic.  Rsync's
algorithm is optimized for minimizing network traffic between its two
halves at the expense of extra local access and in your case the "network"
is between processes on the same machine and the "local" is over a
network.

- Dave Dykstra




Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system

2001-01-24 Thread Eric Whiting

On Wed, Jan 24, 2001 at 11:59:18AM -0500, John Stoffel wrote:
 
 
 Hi all,
 
 This is a followup to bug report 2779 on the rsync bug tracking web
 site, I'm also seeing the hang in waitid() on the master process when
 trying to do an rsync on a single host.
 
snip

 Thanks,
 John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
[EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548

 Other people have reported similar experiences but nobody has pointed to a
 problem in rsync; the problem is more likely to be in NFS on the NetApp or
 Solaris machines.  I believe most NFS traffic goes over UDP but do you happen
 to know if it using TCP?  We have seen many problems with TCP connections
 when rsync is communicating between two different machines.

Here is some more data. Different setup, yet similar end result-- but
no answers. 

Once again I saw a very simliar hang last weekend. I put a new HD in
my home PC and was rsyncing from one IBM IDE disk to a second IBM IDE
hard disk. Linux 2.2.18. Rsync 2.4.6. 

After moving about 6G of files rsync would stop. I checked the
/proc/pid(s)/fd dir and didn't see any open 'real' files (like I do
when rsync is actaully moving data. I CNTL-C'd the rsync, ran the same
cmd again and it would finish up the job. I love that about rsync. 

I killed the whole dest directory tree and ran it again (for testing).
It still hung. Same file it hung on and same resolution.

I'll run it again and provide better details. I realize that rsync has
sometimes taken unfair blame for TCP bugs, NFS bugs, ssh bugs, rsh
bugs, OS bugs, etc -- but I still think there might be something that
can be improved -- either a hard-to-find problem or a different way to
handle a infrequent exception. 

Is there an int64 problem here? A compiler mess up? A cast int64-int
that confuses something?

I was not using -W,  just rsync -av path1 path2  (no :'s in any path).

eric


 
 Try using "-W" to disable the rsync rolling checksum algorithm when copying
 between two NFS mounts, because that causes extra NFS traffic.  Rsync's
 algorithm is optimized for minimizing network traffic between its two
 halves at the expense of extra local access and in your case the "network"
 is between processes on the same machine and the "local" is over a
 network.
 
 - Dave Dykstra

-- 
__
Eric T. Whiting AMI Semiconductors   
(208) 234-6717  2300 Buckskin Road
(208) 234-6659 (fax)Pocatello,ID 83201
[EMAIL PROTECTED]




Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system

2001-01-24 Thread John Stoffel


Dave Other people have reported similar experiences but nobody has
Dave pointed to a problem in rsync; the problem is more likely to be
Dave in NFS on the NetApp or Solaris machines.  I believe most NFS
Dave traffic goes over UDP but do you happen to know if it using TCP?
Dave We have seen many problems with TCP connections when rsync is
Dave communicating between two different machines.

We're using plain UDP NFS as far as I know, I certainly haven't
explicitly enabled it. 

Dave Try using "-W" to disable the rsync rolling checksum algorithm
Dave when copying between two NFS mounts, because that causes extra
Dave NFS traffic.  Rsync's algorithm is optimized for minimizing
Dave network traffic between its two halves at the expense of extra
Dave local access and in your case the "network" is between processes
Dave on the same machine and the "local" is over a network.

So I should go back and use 2.4.6 then?

Thanks,
John
x27548




Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system

2001-01-24 Thread John Stoffel


Quick follow to my previous posting today.

I'm now using the -W flag on 2.4.6 again and it just hung.  Luckily a
'kill -HUP' to the process it's waiting on will do the trick in terms
of killing rsync cleaning.   But it also means I have to sit and watch
the damm thing do it's work and wait until the CPU useage drops to 1%
or so.

Here's a process listing:

atlas:/newtoast192/users# ps -ef | grep rsync
root  8809  8808  0 14:42:07 pts/20:12 
/home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh
root  8808  8674  0 14:42:07 pts/25:01 
/home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh
root  8819  8809  0 14:43:30 pts/2   14:57 
/home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh


So I do a truss on one pid and I see

atlas:/newtoast192/users# truss -p 8809
poll(0xEFFFB6E0, 1, 6)  (sleeping...)
^C

Oops, wrong pid.  I then check the previous one and I see:

atlas:/newtoast192/users# truss -p 8808
poll(0xEFFFD3B0, 0, 20) = 0
waitid(P_PID, 8809, 0xE3B8, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xEFFFD3B0, 0, 20) = 0
waitid(P_PID, 8809, 0xE3B8, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xEFFFD3B0, 0, 20) = 0
poll(0xEFFFD3B0, 0, 7)  = 0
waitid(P_PID, 8809, 0xE3B8, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xEFFFD3B0, 0, 20) = 0
poll(0xEFFFD3B0, 0, 1)  = 0


And it just sits there in waitid() and poll().  So it really looks
like the real problem is in the child process, pid=8809, because it's
just sitting there doing a poll() on just one file descriptor and
waiting for info.  I don't think it gets anything, so maybe it's an
subtle protocol error, where it's not getting a shutdown message from
the other client, or it doesn't have a good heuristic that says:

if I don't get *any* info after X seconds, just die

where X would be something like 900 or 1200 seconds, which seems like
a reasonable number.  Now this of course would only kick after data
has been transfering for a while, since the initial work on a large
directory could take quite a while.  

This is really starting to bug me, so I'd like to try and get this
solved if we could.  Unfortuately, I'm not a great programmer, I'm
more of a tweaker hacker type.

Thanks,
John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
 [EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548




Re: rsync 2.4.6 hangs in waitid() on Solaris 2.6 system

2001-01-24 Thread Dave Dykstra

On Wed, Jan 24, 2001 at 03:48:06PM -0500, John Stoffel wrote:
...
 or it doesn't have a good heuristic that says:
 
 if I don't get *any* info after X seconds, just die
 
 where X would be something like 900 or 1200 seconds, which seems like
 a reasonable number.

Have you tried --timeout?

- Dave Dykstra