Re: proposal to speed rsync with lots of files
Using inotify with rsync is a great idea. If one has a job that runs daily to get differences on a very large filesytem with very small files, then can do this (assuming the initial copy is already completed): inotify watch source filesystem (or tree) take down all the notices in a txt file (absolute path) use rsync with the results from the txt file and place them in the destination repository re-resync again to be 100% sure. I like this idea. On Fri, Mar 6, 2009 at 11:58 AM, Wayne Davison way...@samba.org wrote: On Thu, Mar 05, 2009 at 03:27:50PM -0800, Peter Salameh wrote: My proposal is to first send a checksum of the file list for each directory. If is found to be identical to the same checksum on the remote side then the list need not be sent for that directory! My rZync source does something like that for directories: it treats a directory-list transfer like a file transfer. That means that the receiving side sends a set of checksums to the sending side telling it what it's version of the directory looks like, and then the sender sends a normal set of delta data that lets the receiver reconstruct the sender's version of the directory (which it compares to its own). One potential drawback is having to deal with false checksum-matches (which should be rare, but would require the dir data to be resent) I hadn't optimized it for block size or (possibly) data order to make it more efficient, but it is an interesting idea for speeding up a slow connection. I'm not sure if it would really help out that much for a more modern, faster connection, because rsync sends the file-list data at the same time as it is being scanned, and sometimes the scan is the bottle-neck. The best way to optimize sending of really large numbers of files that are mostly the same is to start to leverage a file-change notification system, such as inotify. Using that, it is possible to distill a list of what files/directories need to be copied, and to just copy what is needed. ..wayne.. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
At 07:58 06.03.2009 -0800, Wayne Davison wrote: On Thu, Mar 05, 2009 at 03:27:50PM -0800, Peter Salameh wrote: My proposal is to first send a checksum of the file list for each directory. If is found to be identical to the same checksum on the remote side then the list need not be sent for that directory! My rZync source does something like that for directories: it treats a directory-list transfer like a file transfer. That means that the receiving side sends a set of checksums to the sending side telling it what it's version of the directory looks like, and then the sender sends a normal set of delta data that lets the receiver reconstruct the sender's version of the directory (which it compares to its own). One potential drawback is having to deal with false checksum-matches (which should be rare, but would require the dir data to be resent) I hadn't optimized it for block size or (possibly) data order to make it more efficient, but it is an interesting idea for speeding up a slow connection. I'm not sure if it would really help out that much for a more modern, faster connection, because rsync sends the file-list data at the same time as it is being scanned, and sometimes the scan is the bottle-neck. To find out whether the scanning or the transferring is the bottleneck, would it be possible to give in the statistics a hint like what threads needed to wait longer, what action took more time? Something that would give a hint that e.g. enabling/disabling compression might give a faster overall transfer. I don't know if this internal data can be collected or if the trial-and-change method is the only way to do it. Thanks bye Fabi -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
On Thu, Mar 05, 2009 at 03:27:50PM -0800, Peter Salameh wrote: My proposal is to first send a checksum of the file list for each directory. If is found to be identical to the same checksum on the remote side then the list need not be sent for that directory! My rZync source does something like that for directories: it treats a directory-list transfer like a file transfer. That means that the receiving side sends a set of checksums to the sending side telling it what it's version of the directory looks like, and then the sender sends a normal set of delta data that lets the receiver reconstruct the sender's version of the directory (which it compares to its own). One potential drawback is having to deal with false checksum-matches (which should be rare, but would require the dir data to be resent) I hadn't optimized it for block size or (possibly) data order to make it more efficient, but it is an interesting idea for speeding up a slow connection. I'm not sure if it would really help out that much for a more modern, faster connection, because rsync sends the file-list data at the same time as it is being scanned, and sometimes the scan is the bottle-neck. The best way to optimize sending of really large numbers of files that are mostly the same is to start to leverage a file-change notification system, such as inotify. Using that, it is possible to distill a list of what files/directories need to be copied, and to just copy what is needed. ..wayne.. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
proposal to speed rsync with lots of files
Hello, I have followed the discussion of speeding up rsync when there are lots of files, and I have a proposal which I think would greatly speed rsync when doing routine mirroring of large filesystems. One of the speed-limiting issues with rsync is having to send huge file lists when mirroring large file systems, even for incremental updates where only a small part of the file system might have changed. My proposal is to first send a checksum of the file list for each directory. If is found to be identical to the same checksum on the remote side then the list need not be sent for that directory! That would reduce the size of the file list greatly when there are directories containing many files which do not change from on rsync to the next. Here's an example: remotelocal dir1 dir1 - file list checksum same as on remote - don't send file list for dir1 dir2 dir2 - file list checksum same as on remote - don't send file list for dir2 dir3 dir3 - file list checksum different from remote - send file list for dir3 It might even be possible to use the rsync checksum algorithm on the directory lists themselves to determine which portion of the directory lists to send, in the case of directories which nearly identical. I would appreciate hearing from rsync developers if this feasible with the current implementation and if they think it would help. Thanks, Peter Salameh -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
Peter Salameh wrote: One of the speed-limiting issues with rsync is having to send huge file lists when mirroring large file systems, even for incremental updates where only a small part of the file system might have changed. Personally, I find that the sending of the file list, whether incremental or otherwise, takes orders of magnitude less time than the construction of the file list in the first place. The act of stat'ing millions of files takes an enormous amount of time in comparison to just about anything else, assuming that you are not on a low-bandwidth link. What would be ideal, I think, is for rsync to scan the filesystem while a transfer is in place; I am completely ignorant of rsync's innards in this regard, but it seems like a basic producer/consumer algorithm, with a configurable quantity of file transfer threads, combined with a configurable quantity of filesystem spider threads, would result in the most optimal interleaving of disk latency and time required to transfer files. --Kyle -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
Kyle Lanclos wrote: Peter Salameh wrote: One of the speed-limiting issues with rsync is having to send huge file lists when mirroring large file systems, even for incremental updates where only a small part of the file system might have changed. Personally, I find that the sending of the file list, whether incremental or otherwise, takes orders of magnitude less time than the construction of the file list in the first place. The act of stat'ing millions of files takes an enormous amount of time in comparison to just about anything else, assuming that you are not on a low-bandwidth link. Many of the locations I mirror do have low-bandwidth links, and I imagine that a sizable fraction of rsync users have issues with bandwidth at least some of the time. I have not studied rsync internals either, but I imagine using a checksum to avoid sending directory file lists unnecessarily would not hurt anything and would help folks with lower bandwidth. Peter -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
Kyle Lanclos wrote: Peter Salameh wrote: One of the speed-limiting issues with rsync is having to send huge file lists when mirroring large file systems, even for incremental updates where only a small part of the file system might have changed. Personally, I find that the sending of the file list, whether incremental or otherwise, takes orders of magnitude less time than the construction of the file list in the first place. The act of stat'ing millions of files takes an enormous amount of time in comparison to just about anything else, assuming that you are not on a low-bandwidth link. I find both take time, and the dominant one depends on the link, and whether that stat information is already in RAM. Usually I only transfer very large file lists over a LAN, though, so it's more like Kyle's situation where the file stat'ing takes longest. The only realistic way to eliminate stat time is some kind of filesystem monitoring and attribute index - similar to the methods used by the dynamic indexes of local search engine style programs. (On Linux that means using inotify, and a daemon which runs all the time. On Windows there are (perhaps) better methods which can survive reboots.) Without that, you can reduce the stat time by scanning the filesystem in a different way. I wrote a program many years ago called treescan which did a redcrsive directory traversal while sorting stat calls by inode number from directory's d_ino. On many filesystems, the inode number is approximately related to position on the disk. On those where it was, the heuristic sped up whole filesystem scans by a factor of about 2, and on some directory structures by a factor of about 100. It's possible some parallel stat calls would improve this further on some OSes and kernel versions, by allowing better head seek optimisation at the kernel level. But other OSes or kernel versions would be slowed by it. What would be ideal, I think, is for rsync to scan the filesystem while a transfer is in place; I think rsync 3 does this, it's called incremental scan mode. with a configurable quantity of file transfer threads, combined with a configurable quantity of filesystem spider threads, would result in the most optimal interleaving of disk latency and time required to transfer files. Be careful when using multiple unsynchronised threads to access a filesystem. It sometimes thrashes the disk - seeking back and forth between different files - resulting in much worse latency than just doing one file at a time. That said, it can work out better. Just have to be careful how it's done. -- Jamie -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
Peter Salameh wrote: My proposal is to first send a checksum of the file list for each directory. If is found to be identical to the same checksum on the remote side then the list need not be sent for that directory! ... It might even be possible to use the rsync checksum algorithm on the directory lists themselves to determine which portion of the directory lists to send, in the case of directories which nearly identical. Yes, these are both sensible improvements to the algorithm. Using the rsync algorithm on the _whole_ file list (not just per directory) would likely improve it in many cases. However unlike files, you don't know the size in advance, so you'd have to pick a block size. You might have to change the algorithm a little because you'd want to buffer the checksummed list for transmitting blocks which don't match, and you'd want to limit the buffer size. When transmitting the list, it needs to be limited to just the attributes being compared, and just the files which pass the filter, of course. If only a few files have changed in a very large list, then you get into the interesting problem of few small changes in very large stream. There are tweaks to the rsync algorithm which handle that better than rsync does. One of them recursive rsync is mentioned in a paper on the rsync web page. I've been working on a variation of the rsync algorithm to delta-transfer arbitrary tree-like data structures at all scales. That might be a better fit to this problem, but it might be a radical addition to graft it into rsync and unnecessary for the problem space. I would appreciate hearing from rsync developers if this feasible with the current implementation and if they think it would help. I have no idea if it's feasible in the current implementation without a lot of work. It's definitely feasible if you're willing to put the work in :-) -- Jamie -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: proposal to speed rsync with lots of files
On 5-Mar-2009, at 16:27, Peter Salameh wrote: One of the speed-limiting issues with rsync is having to send huge file lists when mirroring large file systems, even for incremental updates where only a small part of the file system might have changed. My proposal is to first send a checksum of the file list for each directory. If is found to be identical to the same checksum on the remote side then the list need not be sent for that directory! That would reduce the size of the file list greatly when there are directories containing many files which do not change from on rsync to the next. How long does it take to calculate the checksum on the remote directory? Longer than sending the file list? or are you suggesting the checksums be stored on the source and target somehow? -- Bishops move diagonally. That's why they often turn up where the kings don't expect them to be. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html