Re: rsync very slow with large include/exclude file list

2015-06-15 Thread ray vantassle
I investigated the rsync code and found the reason why.
For every file in the source, it searches the entire filter-list looking to
see if that filename is on the exclude/include list.  Most aren't, so it
compares (350K - 72K) * 72K names (the non-listed files) plus (72K * 72K/2)
names (the ones that are listed), for a total of about  22,608,000,000
strcmp's.  That's 22 BILLION comparisons. (I may have left off a zero
there, it might be 220 B).

I'm working on a fix to improve this.  The first phase was to just improve
the existing code without changing the methodology.
The set I've been testing with is local-local machine, dry-run, 216K files
in the source directory, 25,000 files in the exclude-from list.
The original rsync takes 488 seconds.
The improved code takes 300 seconds.

The next phase was to improve the algorithm of handling large
filter_lists.  Change the unsorted linear search to a sorted binary search
(skiplist).
This improved code takes 2 seconds.

The original code does 4,492,304,682 strcmp's.
The fully improved code does 6,472,564.  98.5% fewer.

I am cleaning up the code and will submit a patchfile soon.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: rsync very slow with large include/exclude file list

2015-06-15 Thread Ken Chase
This is similar to using fuzzy / -y in a large directory. O(n^2) behaviour
occurs and can be incredibly slow. No caching of md5's for the directory
occurs, it would seem (or even so, there are O(N^2) comparisons).

/kc


On Mon, Jun 15, 2015 at 06:02:14PM -0500, ray vantassle said:
 I investigated the rsync code and found the reason why.
 For every file in the source, it searches the entire filter-list looking
 to see if that filename is on the exclude/include list.** Most aren't, so
 it compares (350K - 72K) * 72K names (the non-listed files) plus (72K *
 72K/2) names (the ones that are listed), for a total of about**
 22,608,000,000 strcmp's.** That's 22 BILLION comparisons. (I may have left
 off a zero there, it might be 220 B).
  
 I'm working on a fix to improve this.** The first phase was to just
 improve the existing code without changing the methodology.
 The set I've been testing with is local-local machine, dry-run, 216K files
 in the source directory, 25,000 files in the exclude-from list.
 The original rsync takes 488 seconds.
 The improved code takes 300 seconds.
  
 The next phase was to improve the algorithm of handling large
 filter_lists.** Change the unsorted linear search to a sorted binary
 search (skiplist).
 This improved code takes 2 seconds.
  
 The original code does 4,492,304,682 strcmp's.
 The fully improved code does 6,472,564.** 98.5% fewer.
  
 I am cleaning up the code and will submit a patchfile soon.

  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html