Re: rsync very slow with large include/exclude file list
I investigated the rsync code and found the reason why. For every file in the source, it searches the entire filter-list looking to see if that filename is on the exclude/include list. Most aren't, so it compares (350K - 72K) * 72K names (the non-listed files) plus (72K * 72K/2) names (the ones that are listed), for a total of about 22,608,000,000 strcmp's. That's 22 BILLION comparisons. (I may have left off a zero there, it might be 220 B). I'm working on a fix to improve this. The first phase was to just improve the existing code without changing the methodology. The set I've been testing with is local-local machine, dry-run, 216K files in the source directory, 25,000 files in the exclude-from list. The original rsync takes 488 seconds. The improved code takes 300 seconds. The next phase was to improve the algorithm of handling large filter_lists. Change the unsorted linear search to a sorted binary search (skiplist). This improved code takes 2 seconds. The original code does 4,492,304,682 strcmp's. The fully improved code does 6,472,564. 98.5% fewer. I am cleaning up the code and will submit a patchfile soon. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: rsync very slow with large include/exclude file list
This is similar to using fuzzy / -y in a large directory. O(n^2) behaviour occurs and can be incredibly slow. No caching of md5's for the directory occurs, it would seem (or even so, there are O(N^2) comparisons). /kc On Mon, Jun 15, 2015 at 06:02:14PM -0500, ray vantassle said: I investigated the rsync code and found the reason why. For every file in the source, it searches the entire filter-list looking to see if that filename is on the exclude/include list.** Most aren't, so it compares (350K - 72K) * 72K names (the non-listed files) plus (72K * 72K/2) names (the ones that are listed), for a total of about** 22,608,000,000 strcmp's.** That's 22 BILLION comparisons. (I may have left off a zero there, it might be 220 B). I'm working on a fix to improve this.** The first phase was to just improve the existing code without changing the methodology. The set I've been testing with is local-local machine, dry-run, 216K files in the source directory, 25,000 files in the exclude-from list. The original rsync takes 488 seconds. The improved code takes 300 seconds. The next phase was to improve the algorithm of handling large filter_lists.** Change the unsorted linear search to a sorted binary search (skiplist). This improved code takes 2 seconds. The original code does 4,492,304,682 strcmp's. The fully improved code does 6,472,564.** 98.5% fewer. I am cleaning up the code and will submit a patchfile soon. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto Canada Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
rsync very slow with large include/exclude file list
I have a sensor collector system (very low-powered slow ARM cpu), and another system which daily pulls the data files from it for processing. There are about 1000 new files each day. As part of the processing it decides that certain of the files are of no interest, and adds them to an exclude file, which is used in future rsyncs. No files are ever deleted from the source system, just the receiver system. All done in cron jobs at night. After about 18 months, there are about 350,000 files and the exclude list has about 72,000 filenames. I recently ran the pulling script manually and thought the system must have died. Rsync took almost 3 hours. Trying to narrow down the problem, I removed the --exclude-file= option from the rsync command -- and it took only 16 minutes -- including the time to transfer the 72,000 files that are of no interest. (con't) -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html