Re: Question about rsync and BIG mirror
100gb of 4-40MB files sounds like my home PC full of digital photos I've taken. It backs up to a linux PC right beside it with rsync. I don't really call it that big a project for rsync. Big things for rsync are millions of files. At 100mbps, it takes a few seconds to build the list. I use the -W option to copy the whole file instead of mess with incremental on a fast network. The linux PC running rsync only has 768MB, athlon 1700, and runs X, so it's nothing special. It will take a while over the VPN to build the file list and start transferring, but it shouldn't be hard on the computer resouces. I used to back the whole system up over a 512kbps link to our datacenter, but don't anymore. That was reliable but obviously slow. On Fri, Mar 03, 2006 at 08:02:55AM +0100, [EMAIL PROTECTED] wrote: // I wonder if this message has been posted, so I sent it again // Hello, I'm quite a n00b on rsync stuff but I went to the website, read FAQ/how-to, Google and more, I setup my own rsync server and clients: everything works fine :-D I'm preparing a plan for a production mode in my company: we need to mirror around 100GB of data trough a special VPN internet line 2MB symmetric. The first time, the data will be transferred by a media such as a HD. Next, each night, we will try to update clients from the master server. It should be around 500MB to 3GB, no so much in comparison of the original size of data. I discovered rsync use a lot of CPU and RAM to run checksums on file that have to be synchronised. I need an opinion about my situation: So: each night, from 0:00am to maximum 7:00am, the server will have to check the 100Go of files and see what files have been modified, then, upload them to the clients. Each file is around 4MB to 40MB in average. I would like to know your opinion about this situation: - Should I setup a strong dual CPU computer dedicated to calculate this whole stuff? - What about the memory I should install? - Is there any bandwidth used during the checksums computation? Mine is quite limited. - I know the client computer will have to check files too; Disk I/O will be the most used. I think this computer will have NFS mount from a datacenter computer with a GB LAN card, I wonder it will be enough... I'm quite scared of the amount of data to check before synchronise clients, and how long it will take. To finish shortly, what do YOU think? Any advices? Thanks, Johan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- /* Jason Philbrook | Midcoast Internet Solutions - Internet Access, KB1IOJ| Hosting, and TCP-IP Networks for Midcoast Maine http://f64.nu/ | http://www.midcoast.com/ */ -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
jp wrote: 100gb of 4-40MB files sounds like my home PC full of digital photos I've taken. It backs up to a linux PC right beside it with rsync. I don't really call it that big a project for rsync. Big things for rsync are millions of files. At 100mbps, it takes a few seconds to build the list. I use the -W option to copy the whole file instead of mess with incremental on a fast network. The linux PC running rsync only has 768MB, athlon 1700, and runs X, so it's nothing special. Hmm. My home directory, on my laptop (a mere 60GB disk), does contain millions of files, and it takes about 20 minutes to build the list on a good day. 100Mbps network, but it's I/O bound not network bound. It looks a lot like the number of files is more significant than the amount of data at this scale. -- Jamie -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
Jamie Lokier wrote: Hmm. My home directory, on my laptop (a mere 60GB disk), does contain millions of files, and it takes about 20 minutes to build the list on a good day. 100Mbps network, but it's I/O bound not network bound. It looks a lot like the number of files is more significant than the amount of data at this scale. In fact, I know of at least one place where they don't use rsync because they don't have enough RAM+SWAP to hold the list of files in memory. As far as future directions for rsync, I think this is the major place where rsync needs to become better. Shachar -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
On Mon, Mar 06, 2006 at 07:18:45PM +0200, Shachar Shemesh wrote: In fact, I know of at least one place where they don't use rsync because they don't have enough RAM+SWAP to hold the list of files in memory. As far as future directions for rsync, I think this is the major place where rsync needs to become better. I agree, and I am planning to add protocol improvements to rsync that will allow it to have a much smaller memory footprint, and to begin transferring files without pre-scanning the whole hierarchy first. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
Shachar Shemesh wrote: Hmm. My home directory, on my laptop (a mere 60GB disk), does contain millions of files, and it takes about 20 minutes to build the list on a good day. 100Mbps network, but it's I/O bound not network bound. It looks a lot like the number of files is more significant than the amount of data at this scale. In fact, I know of at least one place where they don't use rsync because they don't have enough RAM+SWAP to hold the list of files in memory. That's true when I'm syncing my laptop's home directory (192MB RAM + 256MB swap) to the desktop machine (plenty of RAM). So I have to split my home directory rsync into two commands, which copy different parts of it separately. I suspect the excessive memory usage is due to -H, preserve hard links, of which I have many due to mostly-hard-linked Linux kernel source trees. That place you mention wouldn't happen to be using the -H option, would they? -- Jamie -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
Wayne Davison wrote: On Mon, Mar 06, 2006 at 07:18:45PM +0200, Shachar Shemesh wrote: In fact, I know of at least one place where they don't use rsync because they don't have enough RAM+SWAP to hold the list of files in memory. As far as future directions for rsync, I think this is the major place where rsync needs to become better. I agree, and I am planning to add protocol improvements to rsync that will allow it to have a much smaller memory footprint, and to begin transferring files without pre-scanning the whole hierarchy first. While you're there, one little trick I've found that speeds up scanning large directory hierarchies is to stat() or open() entries in inode-number order. For some filesystems it makes no difference, but for others it reduces the average disk seek time as on many common filesystems, inode number is related to position on the disk. In unusual cases I've seen a factor of 10 improvement, but usually it's just 1-2. -- Jamie -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
Jamie Lokier wrote: While you're there, one little trick I've found that speeds up scanning large directory hierarchies is to stat() or open() entries in inode-number order. For some filesystems it makes no difference, but for others it reduces the average disk seek time as on many common filesystems, inode number is related to position on the disk. In unusual cases I've seen a factor of 10 improvement, but usually it's just 1-2. The way I see it, if you got that far, then you don't have any problem with the size of the file list. -- Jamie -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
Shachar Shemesh wrote: While you're there, one little trick I've found that speeds up scanning large directory hierarchies is to stat() or open() entries in inode-number order. For some filesystems it makes no difference, but for others it reduces the average disk seek time as on many common filesystems, inode number is related to position on the disk. In unusual cases I've seen a factor of 10 improvement, but usually it's just 1-2. The way I see it, if you got that far, then you don't have any problem with the size of the file list. I don't mean to stat() after reading the whole hierarchy! That doesn't make sense anyway, because you have to stat() to decide if something's a directory in order to recurse into it. I mean one directory at a time, after calling readdir(), just sort the list of directory entries by d_ino values before using them. That's negligable in time and memory. -- Jamie -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
RE: Question about rsync and BIG mirror
Object: Re: Question about rsync and BIG mirror Thanks for all your answers and advices. My problem seems on the side of the 2MB line one time the whole 190GB data are synchronised. I will keep in touch and give some feedbacks. Thanks for all -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
[EMAIL PROTECTED] wrote: Hello, So: each night, from 0:00am to maximum 7:00am, the server will have to check the 100Go of files and see what files have been modified, then, upload them to the clients. Each file is around 4MB to 40MB in average. Are the clients what you call the mirror? Are there several of them? I would like to know your opinion about this situation: - Should I setup a strong dual CPU computer dedicated to calculate this whole stuff? That depends. - What about the memory I should install? - Is there any bandwidth used during the checksums computation? Mine is quite limited. Is that 2 mega BYTE per second or 2 mega BIT per second? - I know the client computer will have to check files too; Disk I/O will be the most used. I think this computer will have NFS mount from a datacenter computer with a GB LAN card, I wonder it will be enough... Scanning 100GB of data in 7 hours doesn't require that much a disk bandwidth. I'm quite scared of the amount of data to check before synchronise clients, and how long it will take. To finish shortly, what do YOU think? Any advices? Here are a few performance characteristics of rsync I think you should be aware of: - By default, rsync only checks files that are different between receiver and sender in timestamp or size. If most files in your archive did not change at all, you can discard them altogether from your bandwidth calculations. - The receiver only does a linear scan of the file, followed by generating a second file (which MAY require random access of the first file, if blocks in the file changed order). It's CPU performance requirements are negligible. This is bad for the case where you have one mirror source sending out info to many mirrors, as all the CPU load falls on the single server. - If your bandwidth is 2 mega BIT per second, you are a bit marginal as far as transferring 5GB of data in 7 hours. This has nothing to do with rsync, though. A simple calculation can show you the same result. Getting full bandwidth for the entire 7 hours will allow you to transfer 6 GB of data. Thanks, Johan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question about rsync and BIG mirror
On Fri, 2006-03-03 08:02:55 +0100, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: // I wonder if this message has been posted, so I sent it again // It was, but nobody answered yet. I'm preparing a plan for a production mode in my company: we need to mirror around 100GB of data trough a special VPN internet line 2MB symmetric. The first time, the data will be transferred by a media such as a HD. Next, each night, we will try to update clients from the master server. Does every client have 2MBit? ...or only the server's machine? It should be around 500MB to 3GB, no so much in comparison of the original size of data. I discovered rsync use a lot of CPU and RAM to run checksums on file that have to be synchronised. I need an opinion about my situation: Right. Rsync trades (especially) CPU cycles and some RAM for network bandwidth. So: each night, from 0:00am to maximum 7:00am, the server will have to check the 100Go of files and see what files have been modified, then, upload them to the clients. Each file is around 4MB to 40MB in average. Are these new files, or do the old ones change? Are that minimal changes within those files, or do they change throughoutly? I would like to know your opinion about this situation: - Should I setup a strong dual CPU computer dedicated to calculate this whole stuff? A lot of CPU power cannot hurt. - What about the memory I should install? Depends on the number of clients. Rule of thumb: RAM can only be substituted by even more RAM. - Is there any bandwidth used during the checksums computation? Mine is quite limited. Checksum calculation basically happens on the server side as well as on the client side; this part doesn't really use bandwidth. - I know the client computer will have to check files too; Disk I/O will be the most used. I think this computer will have NFS mount from a datacenter computer with a GB LAN card, I wonder it will be enough... Is it a two-computer-sync or one master machines with a hugh number of clients? However, both sides may need to touch all the file data... I'm quite scared of the amount of data to check before synchronise clients, and how long it will take. To finish shortly, what do YOU think? Any advices? That all depends on the usage pattern. So you've got one central rsync server and a number (how many?) of clients that need to synchronize. All these do have 2Mbit connectivity, right? You'd also have to define the way your files change. Do they change by name? By content? If by content, how much does change within the files? See, it's all about the details :-) MfG, JBG -- Jan-Benedict Glaw [EMAIL PROTECTED]. +49-172-7608481 _ O _ Eine Freie Meinung in einem Freien Kopf| Gegen Zensur | Gegen Krieg _ _ O für einen Freien Staat voll Freier Bürger | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); signature.asc Description: Digital signature -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
RE: Question about rsync and BIG mirror
Flames invited if I'm wrong on any of this, but: Some (long overdue) backups indicate that network speed should be much more important than cpu speed. Your results will depend heavily on your exact mix and I cannot think of any reasonable way to quantify it. That said, this may help give you a clue. This is about two and a half hours to transfer two months of changes to 15GB over as 20k bit/second connection. This is aprox 2.38% of the files, 1.19% of the volume changed. Most if not all the changed files are actually new files. Be aware that big files that are logically almost the same but are physically a bit different can be rather time consuming to transfer --- watch timeoutsolder versions of rsync sometimes needed very large timeout values. You probably want the exact same version of rsync on both sides although I've had almost no problems using whatever version happened to be available at the time. Seems like only problems I've seen were slightly spooky error messages. If the computers are fast and the network is slow, probably compression is a good idea Files which have the same size and last modified times are assumed to be identical and are not further checked. (There are switches to check anyway) What will matter is how much changed and where. You are probably ahead to set things up so you can cope sensibly when a lot of stuff gets rearranged. It happens. From a rather overdue (Dec 30) backup of almost 15GB drawings (secondary, off-off-site) (T1 to backwater cable modem) Receiver (Server) looks 99+ % idle (with top taking most of the cpu) Sender (Client) looks 94 % idle (top taking about 4%) 700MHz AMD Duron 58183 files time rsync -avPz --bwlimit=20 --timeout=750 \ --password-file=/etc/rsync.secrets/xx \ /home/xxx/* [EMAIL PROTECTED]::x/ 146587 100% 44.89kB/s0:00:03 (xfer#1385, to-check=643/58183) sent 178,475,844 bytes received 198,042 bytes 20,348.94 bytes/sec total size is 15,034,880,070 speedup is 84.15 real146m43.557s user2m4.680s sys 0m16.950s 146587 100% 44.89kB/s0:00:03 (xfer#1385, to-check=643/58183) sent 178,475,844 bytes received 198,042 bytes 20,348.94 bytes/sec total size is 15,034,880,070 speedup is 84.15 real146m43.557s user2m4.680s sys 0m16.950s time rsync -av [EMAIL PROTECTED]::rsync-xxx/ /home/rsync-/ very stale -- Sep 8 2005 about 15% idle 300MHz P2 sent 315061 bytes received 1,239,438,988 bytes 5,571,928.31 bytes/sec total size is 15,186,475,211 speedup is 12.25 real3m42.096s user0m54.560s sys 1m44.730s -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of [EMAIL PROTECTED] Sent: Friday, March 03, 2006 1:03 AM To: rsync@lists.samba.org Subject: Question about rsync and BIG mirror // I wonder if this message has been posted, so I sent it again // Hello, I'm quite a n00b on rsync stuff but I went to the website, read FAQ/how-to, Google and more, I setup my own rsync server and clients: everything works fine :-D I'm preparing a plan for a production mode in my company: we need to mirror around 100GB of data trough a special VPN internet line 2MB symmetric. The first time, the data will be transferred by a media such as a HD. Next, each night, we will try to update clients from the master server. It should be around 500MB to 3GB, no so much in comparison of the original size of data. I discovered rsync use a lot of CPU and RAM to run checksums on file that have to be synchronised. I need an opinion about my situation: So: each night, from 0:00am to maximum 7:00am, the server will have to check the 100Go of files and see what files have been modified, then, upload them to the clients. Each file is around 4MB to 40MB in average. I would like to know your opinion about this situation: - Should I setup a strong dual CPU computer dedicated to calculate this whole stuff? - What about the memory I should install? - Is there any bandwidth used during the checksums computation? Mine is quite limited. - I know the client computer will have to check files too; Disk I/O will be the most used. I think this computer will have NFS mount from a datacenter computer with a GB LAN card, I wonder it will be enough... I'm quite scared of the amount of data to check before synchronise clients, and how long it will take. To finish shortly, what do YOU think? Any advices? Thanks, Johan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html