Re: rsync very slow with large include/exclude file list

2015-06-15 Thread ray vantassle
I investigated the rsync code and found the reason why.
For every file in the source, it searches the entire filter-list looking to
see if that filename is on the exclude/include list.  Most aren't, so it
compares (350K - 72K) * 72K names (the non-listed files) plus (72K * 72K/2)
names (the ones that are listed), for a total of about  22,608,000,000
strcmp's.  That's 22 BILLION comparisons. (I may have left off a zero
there, it might be 220 B).

I'm working on a fix to improve this.  The first phase was to just improve
the existing code without changing the methodology.
The set I've been testing with is local-local machine, dry-run, 216K files
in the source directory, 25,000 files in the exclude-from list.
The original rsync takes 488 seconds.
The improved code takes 300 seconds.

The next phase was to improve the algorithm of handling large
filter_lists.  Change the unsorted linear search to a sorted binary search
(skiplist).
This improved code takes 2 seconds.

The original code does 4,492,304,682 strcmp's.
The fully improved code does 6,472,564.  98.5% fewer.

I am cleaning up the code and will submit a patchfile soon.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: rsync very slow with large include/exclude file list

2015-06-15 Thread Ken Chase
This is similar to using fuzzy / -y in a large directory. O(n^2) behaviour
occurs and can be incredibly slow. No caching of md5's for the directory
occurs, it would seem (or even so, there are O(N^2) comparisons).

/kc


On Mon, Jun 15, 2015 at 06:02:14PM -0500, ray vantassle said:
 I investigated the rsync code and found the reason why.
 For every file in the source, it searches the entire filter-list looking
 to see if that filename is on the exclude/include list.** Most aren't, so
 it compares (350K - 72K) * 72K names (the non-listed files) plus (72K *
 72K/2) names (the ones that are listed), for a total of about**
 22,608,000,000 strcmp's.** That's 22 BILLION comparisons. (I may have left
 off a zero there, it might be 220 B).
  
 I'm working on a fix to improve this.** The first phase was to just
 improve the existing code without changing the methodology.
 The set I've been testing with is local-local machine, dry-run, 216K files
 in the source directory, 25,000 files in the exclude-from list.
 The original rsync takes 488 seconds.
 The improved code takes 300 seconds.
  
 The next phase was to improve the algorithm of handling large
 filter_lists.** Change the unsorted linear search to a sorted binary
 search (skiplist).
 This improved code takes 2 seconds.
  
 The original code does 4,492,304,682 strcmp's.
 The fully improved code does 6,472,564.** 98.5% fewer.
  
 I am cleaning up the code and will submit a patchfile soon.

  -- 
  Please use reply-all for most replies to avoid omitting the mailing list.
  To unsubscribe or change options: 
https://lists.samba.org/mailman/listinfo/rsync
  Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


-- 
Ken Chase - k...@heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto 
Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front 
St. W.
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


rsync very slow with large include/exclude file list

2015-06-15 Thread ray vantassle
I have a sensor collector system (very low-powered slow ARM cpu), and
another system which daily pulls the data files from it for processing.
There are about 1000 new files each day.  As part of the processing it
decides that certain of the files are of no interest, and adds them to an
exclude file, which is used in future rsyncs.  No files are ever deleted
from the source system, just the receiver system.  All done in cron jobs at
night.

After about 18 months, there are about 350,000 files and the exclude list
has about 72,000 filenames.  I recently ran the pulling script manually and
thought the system must have died.
Rsync took almost 3 hours.  Trying to narrow down the problem, I removed
the --exclude-file= option from the rsync command -- and it took only 16
minutes -- including the time to transfer the 72,000 files that are of no
interest.
 (con't)
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html