Any (known) scaling issues?

2004-01-16 Thread Bret Foreman
I'm considering using rsync in our data center but I'm worried about whether
it will scale to the numbers and sizes we deal with. We would be moving up
to a terabyte in a typical sync, consisting of about a million files. Our
data mover machines are RedHat Linux Advanced Server 2.1 and all the sources
and destinations are NFS mounts. The data is stored on big NFS file servers.
The destination will typically be empty and rsync will have to copy
everything. However, the copy operation takes many hours and often gets
interrupted by an outage. In that case, the operator should be able to
restart the process and it resumes where it left off.
The current, less than desirable, method uses tar. In the event of an
outage, everything needs to be copied again. I'm hoping rsync could avoid
this and pick up where it left off.
There are really two scaling problems here:
1) Number and size of files - What are the theoretical limits in rsycn? What
are the demonstrated maxima?
2) Performance - The current tar-based method breaks the mount points down
into (a few dozen) subdirectories and runs multiple tar processes. This does
a much better job of keeping the GigE pipes full than a single process and
allows the load to be spread over the 4 CPUs in the Linux box. Is there a
better way to do this with rsync or would we do the same thing, generate one
rsync call for each subdirectory? A major drawback of the subdirectory
approach is that tuning to find the optimum number of copy processes is
almost impossible. Is anyone looking at multithreading rsync to copy many
files at once and get more CPU utilization from a multi-CPU machine? We're
moving about 10 terabytes a week (and rising) so whatever we use has to keep
those GigE pipes full.

Thanks,
Bret

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Any (known) scaling issues?

2004-01-16 Thread jw schultz
On Fri, Jan 16, 2004 at 03:55:42PM -0800, Bret Foreman wrote:
 I'm considering using rsync in our data center but I'm worried about whether
 it will scale to the numbers and sizes we deal with. We would be moving up
 to a terabyte in a typical sync, consisting of about a million files. Our
 data mover machines are RedHat Linux Advanced Server 2.1 and all the sources
 and destinations are NFS mounts. The data is stored on big NFS file servers.
 The destination will typically be empty and rsync will have to copy
 everything. However, the copy operation takes many hours and often gets
 interrupted by an outage. In that case, the operator should be able to
 restart the process and it resumes where it left off.
 The current, less than desirable, method uses tar. In the event of an
 outage, everything needs to be copied again. I'm hoping rsync could avoid
 this and pick up where it left off.
 There are really two scaling problems here:
 1) Number and size of files - What are the theoretical limits in rsycn? What
 are the demonstrated maxima?
 2) Performance - The current tar-based method breaks the mount points down
 into (a few dozen) subdirectories and runs multiple tar processes. This does
 a much better job of keeping the GigE pipes full than a single process and
 allows the load to be spread over the 4 CPUs in the Linux box. Is there a
 better way to do this with rsync or would we do the same thing, generate one
 rsync call for each subdirectory? A major drawback of the subdirectory
 approach is that tuning to find the optimum number of copy processes is
 almost impossible. Is anyone looking at multithreading rsync to copy many
 files at once and get more CPU utilization from a multi-CPU machine? We're
 moving about 10 terabytes a week (and rising) so whatever we use has to keep
 those GigE pipes full.

The numbers you site should be no problem for rsync.
However, the scenario is one that rsync has no real
advantage and several disadvantages.  You are copying, not
syncing so rsync will be slower.  Your network is faster
than the disks and rsync is designed for disks several times
faster than the network.  Rsync is even worse over NFS and
you are doing NFS to NFS copies.  All in all, i wouldn't use
rsync.  My inclination would be to use cpio -p with no -u.
The one thing rsync gets you is checksumming and NFS over
udp has a measurable data corruption rate but caches are
likely to defeat rsync's checksums so a seperate checksum
cycle would still be wanted.

-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html