Re: Question about rsync and BIG mirror

2006-03-06 Thread jp
100gb of 4-40MB files sounds like my home PC full of digital photos I've 
taken. It backs up to a linux PC right beside it with rsync. I don't 
really call it that big a project for rsync. Big things for rsync are 
millions of files. At 100mbps, it takes a few seconds to build the list. 
I use the -W option to copy the whole file instead of mess with 
incremental on a fast network. The linux PC running rsync only has 
768MB, athlon 1700, and runs X, so it's nothing special.

It will take a while over the VPN to build the file list and start 
transferring, but it shouldn't be hard on the computer resouces.
I used to back the whole system up over a 512kbps link to our 
datacenter, but don't anymore. That was reliable but obviously slow.

On Fri, Mar 03, 2006 at 08:02:55AM +0100, [EMAIL PROTECTED] wrote:
 // I wonder if this message has been posted, so I sent it again //
 
 Hello,
 
   I'm quite a n00b on rsync stuff but I went to the website, read
 FAQ/how-to, Google and more, I setup my own rsync server and clients:
 everything works fine :-D
 
   I'm preparing a plan for a production mode in my company: we need to
 mirror around 100GB of data trough a special VPN internet line 2MB
 symmetric.
   The first time, the data will be transferred by a media such as a HD.
 Next, each night, we will try to update clients from the master server.
 It should be around 500MB to 3GB, no so much in comparison of the
 original size of data. 
   I discovered rsync use a lot of CPU and RAM to run checksums on
 file that have to be synchronised. I need an opinion about my situation:
 
 
   So: each night, from 0:00am to maximum 7:00am, the server will have to
 check the 100Go of files and see what files have been modified, then,
 upload them to the clients. Each file is around 4MB to 40MB in average. 
 
 I would like to know your opinion about this situation:  
  - Should I setup a strong dual CPU computer dedicated to calculate this
 whole stuff? 
  - What about the memory I should install? 
  - Is there any bandwidth used during the checksums computation? Mine is
 quite limited.
  - I know the client computer will have to check files too; Disk I/O
 will be the most used. I think this computer will have NFS mount from a
 datacenter computer with a GB LAN card, I wonder it will be enough...
 
   I'm quite scared of the amount of data to check before synchronise
 clients, and how long it will take. To finish shortly, what do YOU
 think? Any advices?
 
 
 Thanks,
 
 Johan
 --
 To unsubscribe or change options: 
 https://lists.samba.org/mailman/listinfo/rsync
 Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
/*
Jason Philbrook   |   Midcoast Internet Solutions - Internet Access,
KB1IOJ|  Hosting, and TCP-IP Networks for Midcoast Maine
 http://f64.nu/   | http://www.midcoast.com/
*/
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Jamie Lokier
jp wrote:
 100gb of 4-40MB files sounds like my home PC full of digital photos I've 
 taken. It backs up to a linux PC right beside it with rsync. I don't 
 really call it that big a project for rsync. Big things for rsync are 
 millions of files. At 100mbps, it takes a few seconds to build the list. 
 I use the -W option to copy the whole file instead of mess with 
 incremental on a fast network. The linux PC running rsync only has 
 768MB, athlon 1700, and runs X, so it's nothing special.

Hmm.  My home directory, on my laptop (a mere 60GB disk), does contain
millions of files, and it takes about 20 minutes to build the list on
a good day.  100Mbps network, but it's I/O bound not network bound.

It looks a lot like the number of files is more significant than the
amount of data at this scale.

-- Jamie
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Shachar Shemesh
Jamie Lokier wrote:

Hmm.  My home directory, on my laptop (a mere 60GB disk), does contain
millions of files, and it takes about 20 minutes to build the list on
a good day.  100Mbps network, but it's I/O bound not network bound.

It looks a lot like the number of files is more significant than the
amount of data at this scale.
  

In fact, I know of at least one place where they don't use rsync because
they don't have enough RAM+SWAP to hold the list of files in memory.

As far as future directions for rsync, I think this is the major place
where rsync needs to become better.

  Shachar
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Wayne Davison
On Mon, Mar 06, 2006 at 07:18:45PM +0200, Shachar Shemesh wrote:
 In fact, I know of at least one place where they don't use rsync because
 they don't have enough RAM+SWAP to hold the list of files in memory.
 
 As far as future directions for rsync, I think this is the major place
 where rsync needs to become better.

I agree, and I am planning to add protocol improvements to rsync that
will allow it to have a much smaller memory footprint, and to begin
transferring files without pre-scanning the whole hierarchy first.

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Jamie Lokier
Shachar Shemesh wrote:
 Hmm.  My home directory, on my laptop (a mere 60GB disk), does contain
 millions of files, and it takes about 20 minutes to build the list on
 a good day.  100Mbps network, but it's I/O bound not network bound.
 
 It looks a lot like the number of files is more significant than the
 amount of data at this scale.
   
 
 In fact, I know of at least one place where they don't use rsync because
 they don't have enough RAM+SWAP to hold the list of files in memory.

That's true when I'm syncing my laptop's home directory (192MB RAM +
256MB swap) to the desktop machine (plenty of RAM).

So I have to split my home directory rsync into two commands, which
copy different parts of it separately.

I suspect the excessive memory usage is due to -H, preserve hard
links, of which I have many due to mostly-hard-linked Linux kernel
source trees.

That place you mention wouldn't happen to be using the -H option,
would they?

-- Jamie
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Jamie Lokier
Wayne Davison wrote:
 On Mon, Mar 06, 2006 at 07:18:45PM +0200, Shachar Shemesh wrote:
  In fact, I know of at least one place where they don't use rsync because
  they don't have enough RAM+SWAP to hold the list of files in memory.
  
  As far as future directions for rsync, I think this is the major place
  where rsync needs to become better.
 
 I agree, and I am planning to add protocol improvements to rsync that
 will allow it to have a much smaller memory footprint, and to begin
 transferring files without pre-scanning the whole hierarchy first.

While you're there, one little trick I've found that speeds up
scanning large directory hierarchies is to stat() or open() entries in
inode-number order.  For some filesystems it makes no difference, but
for others it reduces the average disk seek time as on many common
filesystems, inode number is related to position on the disk.  In
unusual cases I've seen a factor of 10 improvement, but usually it's
just 1-2.

-- Jamie
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Shachar Shemesh
Jamie Lokier wrote:

While you're there, one little trick I've found that speeds up
scanning large directory hierarchies is to stat() or open() entries in
inode-number order.  For some filesystems it makes no difference, but
for others it reduces the average disk seek time as on many common
filesystems, inode number is related to position on the disk.  In
unusual cases I've seen a factor of 10 improvement, but usually it's
just 1-2.

  

The way I see it, if you got that far, then you don't have any problem
with the size of the file list.

-- Jamie
  


-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-06 Thread Jamie Lokier
Shachar Shemesh wrote:
 While you're there, one little trick I've found that speeds up
 scanning large directory hierarchies is to stat() or open() entries in
 inode-number order.  For some filesystems it makes no difference, but
 for others it reduces the average disk seek time as on many common
 filesystems, inode number is related to position on the disk.  In
 unusual cases I've seen a factor of 10 improvement, but usually it's
 just 1-2.
 
 The way I see it, if you got that far, then you don't have any problem
 with the size of the file list.

I don't mean to stat() after reading the whole hierarchy!

That doesn't make sense anyway, because you have to stat() to decide
if something's a directory in order to recurse into it.

I mean one directory at a time, after calling readdir(), just sort the
list of directory entries by d_ino values before using them.  That's
negligable in time and memory.

-- Jamie
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


RE: Question about rsync and BIG mirror

2006-03-06 Thread johan.boye

 Object: Re: Question about rsync and BIG mirror

Thanks for all your answers and advices. My problem seems on the side of
the 2MB line one time the whole 190GB data are synchronised. I will keep
in touch and give some feedbacks.

Thanks for all
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-03 Thread Shachar Shemesh
[EMAIL PROTECTED] wrote:

Hello,

  So: each night, from 0:00am to maximum 7:00am, the server will have to
check the 100Go of files and see what files have been modified, then,
upload them to the clients. Each file is around 4MB to 40MB in average. 
  

Are the clients what you call the mirror? Are there several of them?

I would like to know your opinion about this situation:  
 - Should I setup a strong dual CPU computer dedicated to calculate this
whole stuff? 
  

That depends.

 - What about the memory I should install? 
 - Is there any bandwidth used during the checksums computation? Mine is
quite limited.
  

Is that 2 mega BYTE per second or 2 mega BIT per second?

 - I know the client computer will have to check files too; Disk I/O
will be the most used. I think this computer will have NFS mount from a
datacenter computer with a GB LAN card, I wonder it will be enough...
  

Scanning 100GB of data in 7 hours doesn't require that much a disk
bandwidth.

  I'm quite scared of the amount of data to check before synchronise
clients, and how long it will take. To finish shortly, what do YOU
think? Any advices?
  

Here are a few performance characteristics of rsync I think you should
be aware of:
- By default, rsync only checks files that are different between
receiver and sender in timestamp or size. If most files in your archive
did not change at all, you can discard them altogether from your
bandwidth calculations.
- The receiver only does a linear scan of the file, followed by
generating a second file (which MAY require random access of the first
file, if blocks in the file changed order). It's CPU performance
requirements are negligible. This is bad for the case where you have one
mirror source sending out info to many mirrors, as all the CPU load
falls on the single server.
- If your bandwidth is 2 mega BIT per second, you are a bit marginal as
far as transferring 5GB of data in 7 hours. This has nothing to do with
rsync, though. A simple calculation can show you the same result.
Getting full bandwidth for the entire 7 hours will allow you to transfer
6 GB of data.

Thanks,

Johan
  

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question about rsync and BIG mirror

2006-03-03 Thread Jan-Benedict Glaw
On Fri, 2006-03-03 08:02:55 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 // I wonder if this message has been posted, so I sent it again //

It was, but nobody answered yet.

   I'm preparing a plan for a production mode in my company: we need to
 mirror around 100GB of data trough a special VPN internet line 2MB
 symmetric.
   The first time, the data will be transferred by a media such as a HD.
 Next, each night, we will try to update clients from the master server.

Does every client have 2MBit? ...or only the server's machine?

 It should be around 500MB to 3GB, no so much in comparison of the
 original size of data. 
   I discovered rsync use a lot of CPU and RAM to run checksums on
 file that have to be synchronised. I need an opinion about my situation:

Right. Rsync trades (especially) CPU cycles and some RAM for network
bandwidth.

   So: each night, from 0:00am to maximum 7:00am, the server will have to
 check the 100Go of files and see what files have been modified, then,
 upload them to the clients. Each file is around 4MB to 40MB in average. 

Are these new files, or do the old ones change? Are that minimal
changes within those files, or do they change throughoutly?

 I would like to know your opinion about this situation:  
  - Should I setup a strong dual CPU computer dedicated to calculate this
 whole stuff? 

A lot of CPU power cannot hurt.

  - What about the memory I should install? 

Depends on the number of clients.

Rule of thumb: RAM can only be substituted by even more RAM.

  - Is there any bandwidth used during the checksums computation? Mine is
 quite limited.

Checksum calculation basically happens on the server side as well as
on the client side; this part doesn't really use bandwidth.

  - I know the client computer will have to check files too; Disk I/O
 will be the most used. I think this computer will have NFS mount from a
 datacenter computer with a GB LAN card, I wonder it will be enough...

Is it a two-computer-sync or one master machines with a hugh number of
clients? However, both sides may need to touch all the file data...

   I'm quite scared of the amount of data to check before synchronise
 clients, and how long it will take. To finish shortly, what do YOU
 think? Any advices?

That all depends on the usage pattern. So you've got one central rsync
server and a number (how many?) of clients that need to synchronize.
All these do have 2Mbit connectivity, right?

You'd also have to define the way your files change. Do they change by
name? By content? If by content, how much does change within the
files?

See, it's all about the details :-)

MfG, JBG

-- 
Jan-Benedict Glaw   [EMAIL PROTECTED]. +49-172-7608481 _ O _
Eine Freie Meinung in  einem Freien Kopf| Gegen Zensur | Gegen Krieg  _ _ O
 für einen Freien Staat voll Freier Bürger  | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH)  ~(NEW_COPYRIGHT_LAW | DRM | TCPA));


signature.asc
Description: Digital signature
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

RE: Question about rsync and BIG mirror

2006-03-03 Thread Tony
Flames invited if I'm wrong on any of this, but:

Some (long overdue) backups indicate that network speed
should be much more important than cpu speed.
Your results will depend heavily on your exact mix
and I cannot think of any reasonable way to quantify it.
That said, this may help give you a clue.

This is about two and a half hours to transfer two months of changes
to 15GB over as 20k bit/second connection.
This is aprox 2.38% of the files, 1.19% of the volume changed.
Most if not all the changed files are actually new files.

Be aware that big files that are logically almost the same
but are physically a bit different
can be rather time consuming to transfer
--- watch timeoutsolder versions of rsync
sometimes needed very large timeout values.

You probably want the exact same version of rsync on both sides
although I've had almost no problems using whatever version
happened to be available at the time.
Seems like only problems I've seen were slightly spooky error messages.

If the computers are fast and the network is slow,
probably compression is a good idea


Files which have the same size and last modified times are assumed to be
identical and are not further checked. (There are switches to check anyway)

What will matter is how much changed and where.
You are probably ahead to set things up so you can cope sensibly
when a lot of stuff gets rearranged. It happens.

From a rather overdue (Dec 30) backup of almost 15GB drawings
(secondary, off-off-site)  (T1 to backwater cable modem)
Receiver (Server) looks 99+ % idle (with top taking most of the cpu)
Sender (Client) looks 94 % idle (top taking about 4%) 700MHz AMD Duron
58183 files

time rsync -avPz --bwlimit=20 --timeout=750 \
  --password-file=/etc/rsync.secrets/xx \
/home/xxx/*  [EMAIL PROTECTED]::x/

  146587 100%   44.89kB/s0:00:03 (xfer#1385, to-check=643/58183)

sent 178,475,844 bytes  received 198,042 bytes  20,348.94 bytes/sec
total size is 15,034,880,070  speedup is 84.15

real146m43.557s
user2m4.680s
sys 0m16.950s


  146587 100%   44.89kB/s0:00:03 (xfer#1385, to-check=643/58183)

sent 178,475,844 bytes  received 198,042 bytes  20,348.94 bytes/sec
total size is 15,034,880,070  speedup is 84.15

real146m43.557s
user2m4.680s
sys 0m16.950s


time rsync -av [EMAIL PROTECTED]::rsync-xxx/
/home/rsync-/
very stale  -- Sep 8 2005   about 15% idle 300MHz P2

sent 315061 bytes  received 1,239,438,988 bytes  5,571,928.31 bytes/sec
total size is 15,186,475,211  speedup is 12.25

real3m42.096s
user0m54.560s
sys 1m44.730s



 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of
 [EMAIL PROTECTED]
 Sent: Friday, March 03, 2006 1:03 AM
 To: rsync@lists.samba.org
 Subject: Question about rsync and BIG mirror


 // I wonder if this message has been posted, so I sent it again //

 Hello,

   I'm quite a n00b on rsync stuff but I went to the website, read
 FAQ/how-to, Google and more, I setup my own rsync server and clients:
 everything works fine :-D

   I'm preparing a plan for a production mode in my company: we need to
 mirror around 100GB of data trough a special VPN internet line 2MB
 symmetric.
   The first time, the data will be transferred by a media such as a HD.
 Next, each night, we will try to update clients from the master server.
 It should be around 500MB to 3GB, no so much in comparison of the
 original size of data.
   I discovered rsync use a lot of CPU and RAM to run checksums on
 file that have to be synchronised. I need an opinion about my situation:


   So: each night, from 0:00am to maximum 7:00am, the server will have to
 check the 100Go of files and see what files have been modified, then,
 upload them to the clients. Each file is around 4MB to 40MB in average.

 I would like to know your opinion about this situation:
  - Should I setup a strong dual CPU computer dedicated to calculate this
 whole stuff?
  - What about the memory I should install?
  - Is there any bandwidth used during the checksums computation? Mine is
 quite limited.
  - I know the client computer will have to check files too; Disk I/O
 will be the most used. I think this computer will have NFS mount from a
 datacenter computer with a GB LAN card, I wonder it will be enough...

   I'm quite scared of the amount of data to check before synchronise
 clients, and how long it will take. To finish shortly, what do YOU
 think? Any advices?


 Thanks,

 Johan
 --
 To unsubscribe or change options:
 https://lists.samba.org/mailman/listinfo/rsync
 Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html