Re: Nice little performance improvement

2009-10-20 Thread Matt McCutchen
On Sat, 2009-10-17 at 12:13 -0700, Mike Connell wrote:
  Interesting.  If you're not using incremental recursion (the default in
  rsync = 3.0.0), I can see that the du would help by forcing the
  destination I/O to overlap the file-list building in time.  But with
  incremental recursion, the du shouldn't be necessary because rsync
  actually overlaps the checking of destination files with the file-list
  building on the source.
 
 Ignoring incremental recursion for a moment.

Don't ignore it, it makes a difference.

 It seems to me that anything
 that can warm up the file cache before it is needed would be beneficial?

I didn't reason it out carefully enough; let's try again...

Warming up the destination file cache decreases the amount of time the
generator spends blocked on I/O.  So the answer is yes, provided that
the generator is the bottleneck.

If incremental recursion is not used, that's almost certainly the case
during the main phase of the rsync run, since the generator is checking
all the destination files but the sender is only processing the small
number of source files that need a transfer.  But with incremental
recursion, the sender and generator are checking files in parallel, so
the sender may be the bottleneck depending on the relative speeds or
disk configurations of the machines.  (I take it that your rsync run is
local.  For remote runs, the network could be the bottleneck.)

-- 
Matt

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Nice little performance improvement

2009-10-17 Thread Mike Connell


Hi,


Interesting.  If you're not using incremental recursion (the default in
rsync = 3.0.0), I can see that the du would help by forcing the
destination I/O to overlap the file-list building in time.  But with
incremental recursion, the du shouldn't be necessary because rsync
actually overlaps the checking of destination files with the file-list
building on the source.


Ignoring incremental recursion for a moment. It seems to me that anything
that can warm up the file cache before it is needed would be beneficial?
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Nice little performance improvement

2009-10-17 Thread Darryl Dixon - Winterhouse Consulting
 Hi,

 In order to expeditiously move these new files offsite, we use a
 modified
 version of pyinotify to log all added/altered files across the entire
 filesystem(s) and then every five minutes feed the list to rsync with
 the
 --files-from option. This works very effectively and quickly.

 Interesting...

 How do you tell rsync to delete files that were deleted from the source,
 or is that not part of your use case?

For us, that is not a necessary part of our use-case. It would certainly
however be possible to capture the delete events and remove the files with
some other helper script, rather than use rsync directly (rsync doesn't
give any advantage in that scenario except to be able to re-use the
existing network transport mechanism).

regards,
Darryl Dixon
Winterhouse Consulting Ltd
http://www.winterhouseconsulting.com
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Nice little performance improvement

2009-10-17 Thread Mike Connell

No, not if the file cache isn't large enough for the number of files.
E.g. if you have 20 million files and only 256MB RAM, it's likely a bad 
idea.



Splitting down to the subsub (2-levels down) directory level allows a single
subsub rsync to fit for me. Warming the cache is beneficial here, I didn't 
say
it was in every situation. 


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Nice little performance improvement

2009-10-17 Thread Jamie Lokier
Mike Connell wrote:
 
 Hi,
 
 Interesting.  If you're not using incremental recursion (the default in
 rsync = 3.0.0), I can see that the du would help by forcing the
 destination I/O to overlap the file-list building in time.  But with
 incremental recursion, the du shouldn't be necessary because rsync
 actually overlaps the checking of destination files with the file-list
 building on the source.
 
 Ignoring incremental recursion for a moment. It seems to me that anything
 that can warm up the file cache before it is needed would be beneficial?

No, not if the file cache isn't large enough for the number of files.
E.g. if you have 20 million files and only 256MB RAM, it's likely a bad idea.

Personally I use a program that I wrote about 11 years ago, called
treescan, which pulls in the inodes to cache about twice as fast as
du by using inode number sorting.

-- Jamie
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Nice little performance improvement

2009-10-15 Thread Mike Connell
Hi,

In my situation I'm using rsync to backup a server with (currently) about 
570,000 files.
These are all little files and maybe .1% of them change or new ones are added in
any 15 minute period.

I've split the main tree up so rsync can run on sub sub directories of the main 
tree. 
It does each of these sub sub directories sequentially. I would have liked to 
run 
some of these in parallel, but that seems to increase i/o on the main server 
too much.


Today I tried the following:

For all subsub directories
a) Fork a du -s subsubdirectory on the destination subsubdirectory
b) Run rsync on the subsubdirectory
c) repeat untill done

Seems to have improved the time it takes by about 25-30%. It looks like the du 
can
run ahead of the rsync...so that while rsync is building its file list, the du 
is warming up
the file cache on the destination. Then when rsync looks to see what it needs 
to do
on the destination, it can do this more efficiently.

Looks like a keeper so far. Any other suggestions? (was thinking of a previous
suggestion of setting /proc/sys/vm/vfs_cache_pressure to a low value).

Thanks,

Mike-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: Nice little performance improvement

2009-10-15 Thread Darryl Dixon - Winterhouse Consulting
 Hi,

 In my situation I'm using rsync to backup a server with (currently) about
 570,000 files.
 These are all little files and maybe .1% of them change or new ones are
 added in
 any 15 minute period.


Hi Mike,

We have three filesystems that between them have approx 22 million files,
and around 10-20,000 new or changed files every business day.

In order to expeditiously move these new files offsite, we use a modified
version of pyinotify to log all added/altered files across the entire
filesystem(s) and then every five minutes feed the list to rsync with the
--files-from option. This works very effectively and quickly.

regards,
Darryl Dixon
Winterhouse Consulting Ltd
http://www.winterhouseconsulting.com
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Nice little performance improvement

2009-10-15 Thread Matt McCutchen
On Thu, 2009-10-15 at 19:07 -0700, Mike Connell wrote:
 Today I tried the following:
  
 For all subsub directories
 a) Fork a du -s subsubdirectory on the destination
 subsubdirectory
 b) Run rsync on the subsubdirectory
 c) repeat untill done
  
 Seems to have improved the time it takes by about 25-30%. It looks
 like the du can
 run ahead of the rsync...so that while rsync is building its file
 list, the du is warming up
 the file cache on the destination. Then when rsync looks to see what
 it needs to do
 on the destination, it can do this more efficiently.

Interesting.  If you're not using incremental recursion (the default in
rsync = 3.0.0), I can see that the du would help by forcing the
destination I/O to overlap the file-list building in time.  But with
incremental recursion, the du shouldn't be necessary because rsync
actually overlaps the checking of destination files with the file-list
building on the source.

-- 
Matt

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Nice little performance improvement

2009-10-15 Thread Mike Connell

Hi,


In order to expeditiously move these new files offsite, we use a modified
version of pyinotify to log all added/altered files across the entire
filesystem(s) and then every five minutes feed the list to rsync with the
--files-from option. This works very effectively and quickly.


Interesting...

How do you tell rsync to delete files that were deleted from the source, 
or is that not part of your use case?


Thanks,

Mike
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html