Re: proposal to speed rsync with lots of files

2009-03-11 Thread Mag Gam
Using inotify with rsync is a great idea.

If one has a job that runs daily to get differences on a very large
filesytem with very small files, then can do this (assuming the
initial copy is already completed):
inotify watch source filesystem (or tree)
take down all the notices in a txt file (absolute path)
use rsync with the results from the txt file and place them in the
destination repository
re-resync again to be 100% sure.

I like this idea.




On Fri, Mar 6, 2009 at 11:58 AM, Wayne Davison way...@samba.org wrote:
 On Thu, Mar 05, 2009 at 03:27:50PM -0800, Peter Salameh wrote:
 My proposal is to first send a checksum of the file list for each
 directory.  If is found to be identical to the same checksum on the
 remote side then the list need not be sent for that directory!

 My rZync source does something like that for directories:  it treats a
 directory-list transfer like a file transfer.  That means that the
 receiving side sends a set of checksums to the sending side telling it
 what it's version of the directory looks like, and then the sender sends
 a normal set of delta data that lets the receiver reconstruct the
 sender's version of the directory (which it compares to its own).  One
 potential drawback is having to deal with false checksum-matches (which
 should be rare, but would require the dir data to be resent) I hadn't
 optimized it for block size or (possibly) data order to make it more
 efficient, but it is an interesting idea for speeding up a slow
 connection.  I'm not sure if it would really help out that much for a
 more modern, faster connection, because rsync sends the file-list data
 at the same time as it is being scanned, and sometimes the scan is the
 bottle-neck.

 The best way to optimize sending of really large numbers of files that
 are mostly the same is to start to leverage a file-change notification
 system, such as inotify.  Using that, it is possible to distill a list
 of what files/directories need to be copied, and to just copy what is
 needed.

 ..wayne..
 --
 Please use reply-all for most replies to avoid omitting the mailing list.
 To unsubscribe or change options: 
 https://lists.samba.org/mailman/listinfo/rsync
 Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: proposal to speed rsync with lots of files

2009-03-09 Thread Fabian Cenedese
At 07:58 06.03.2009 -0800, Wayne Davison wrote:
On Thu, Mar 05, 2009 at 03:27:50PM -0800, Peter Salameh wrote:
 My proposal is to first send a checksum of the file list for each
 directory.  If is found to be identical to the same checksum on the
 remote side then the list need not be sent for that directory!

My rZync source does something like that for directories:  it treats a
directory-list transfer like a file transfer.  That means that the
receiving side sends a set of checksums to the sending side telling it
what it's version of the directory looks like, and then the sender sends
a normal set of delta data that lets the receiver reconstruct the
sender's version of the directory (which it compares to its own).  One
potential drawback is having to deal with false checksum-matches (which
should be rare, but would require the dir data to be resent) I hadn't
optimized it for block size or (possibly) data order to make it more
efficient, but it is an interesting idea for speeding up a slow
connection.  I'm not sure if it would really help out that much for a
more modern, faster connection, because rsync sends the file-list data
at the same time as it is being scanned, and sometimes the scan is the
bottle-neck.

To find out whether the scanning or the transferring is the bottleneck,
would it be possible to give in the statistics a hint like what threads
needed to wait longer, what action took more time? Something that
would give a hint that e.g. enabling/disabling compression might give
a faster overall transfer. I don't know if this internal data can be collected
or if the trial-and-change method is the only way to do it.

Thanks

bye  Fabi

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: proposal to speed rsync with lots of files

2009-03-06 Thread Wayne Davison
On Thu, Mar 05, 2009 at 03:27:50PM -0800, Peter Salameh wrote:
 My proposal is to first send a checksum of the file list for each
 directory.  If is found to be identical to the same checksum on the
 remote side then the list need not be sent for that directory!

My rZync source does something like that for directories:  it treats a
directory-list transfer like a file transfer.  That means that the
receiving side sends a set of checksums to the sending side telling it
what it's version of the directory looks like, and then the sender sends
a normal set of delta data that lets the receiver reconstruct the
sender's version of the directory (which it compares to its own).  One
potential drawback is having to deal with false checksum-matches (which
should be rare, but would require the dir data to be resent) I hadn't
optimized it for block size or (possibly) data order to make it more
efficient, but it is an interesting idea for speeding up a slow
connection.  I'm not sure if it would really help out that much for a
more modern, faster connection, because rsync sends the file-list data
at the same time as it is being scanned, and sometimes the scan is the
bottle-neck.

The best way to optimize sending of really large numbers of files that
are mostly the same is to start to leverage a file-change notification
system, such as inotify.  Using that, it is possible to distill a list
of what files/directories need to be copied, and to just copy what is
needed.

..wayne..
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


proposal to speed rsync with lots of files

2009-03-05 Thread Peter Salameh

Hello,

I have followed the discussion of speeding up rsync when there are lots 
of files, and I have a proposal which I think would greatly speed rsync 
when doing routine mirroring of large filesystems.


One of the speed-limiting issues with rsync is having to send huge file 
lists when mirroring large file systems, even for incremental updates 
where only a small part of the file system might have changed.  My 
proposal is to first send a checksum of the file list for each 
directory.  If is found to be identical to the same checksum on the 
remote side then the list need not be sent for that directory!  That 
would reduce the size of the file list greatly when there are 
directories containing many files which do not change from on rsync to 
the next.


Here's an example:

 remotelocal
 dir1 dir1  -  file 
list checksum same as on remote   - don't send file list for dir1
 dir2 dir2  -  file 
list checksum same as on remote   - don't send file list for dir2
 dir3 dir3  -  file 
list checksum different from remote  - send file list for dir3


It might even be possible to use the rsync checksum algorithm on the 
directory lists themselves to determine which portion of the directory 
lists to send, in the case of directories which nearly identical.


I would appreciate hearing from rsync developers if this feasible with 
the current implementation and if they think it would help.


Thanks,

Peter Salameh


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: proposal to speed rsync with lots of files

2009-03-05 Thread Kyle Lanclos
Peter Salameh wrote:
 One of the speed-limiting issues with rsync is having to send huge file 
 lists when mirroring large file systems, even for incremental updates 
 where only a small part of the file system might have changed.

Personally, I find that the sending of the file list, whether incremental
or otherwise, takes orders of magnitude less time than the construction of
the file list in the first place. The act of stat'ing millions of files
takes an enormous amount of time in comparison to just about anything else,
assuming that you are not on a low-bandwidth link.

What would be ideal, I think, is for rsync to scan the filesystem while
a transfer is in place; I am completely ignorant of rsync's innards in
this regard, but it seems like a basic producer/consumer algorithm, with
a configurable quantity of file transfer threads, combined with a
configurable quantity of filesystem spider threads, would result in the
most optimal interleaving of disk latency and time required to transfer files.

--Kyle
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: proposal to speed rsync with lots of files

2009-03-05 Thread Peter Salameh

Kyle Lanclos wrote:

Peter Salameh wrote:
  
One of the speed-limiting issues with rsync is having to send huge file 
lists when mirroring large file systems, even for incremental updates 
where only a small part of the file system might have changed.



Personally, I find that the sending of the file list, whether incremental
or otherwise, takes orders of magnitude less time than the construction of
the file list in the first place. The act of stat'ing millions of files
takes an enormous amount of time in comparison to just about anything else,
assuming that you are not on a low-bandwidth link.
  
Many of the locations I mirror do have low-bandwidth links, and I 
imagine that a sizable fraction of rsync users have issues with 
bandwidth at least some of the time.  I have not studied rsync internals 
either, but I imagine using a checksum to avoid sending directory file 
lists unnecessarily would not hurt anything and would help folks with 
lower bandwidth.


Peter

-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: proposal to speed rsync with lots of files

2009-03-05 Thread Jamie Lokier
Kyle Lanclos wrote:
 Peter Salameh wrote:
  One of the speed-limiting issues with rsync is having to send huge file 
  lists when mirroring large file systems, even for incremental updates 
  where only a small part of the file system might have changed.
 
 Personally, I find that the sending of the file list, whether incremental
 or otherwise, takes orders of magnitude less time than the construction of
 the file list in the first place. The act of stat'ing millions of files
 takes an enormous amount of time in comparison to just about anything else,
 assuming that you are not on a low-bandwidth link.

I find both take time, and the dominant one depends on the link, and
whether that stat information is already in RAM.

Usually I only transfer very large file lists over a LAN, though, so
it's more like Kyle's situation where the file stat'ing takes longest.

The only realistic way to eliminate stat time is some kind of
filesystem monitoring and attribute index - similar to the methods
used by the dynamic indexes of local search engine style programs.
(On Linux that means using inotify, and a daemon which runs all the
time.  On Windows there are (perhaps) better methods which can survive
reboots.)

Without that, you can reduce the stat time by scanning the filesystem
in a different way.  I wrote a program many years ago called
treescan which did a redcrsive directory traversal while sorting
stat calls by inode number from directory's d_ino.  On many
filesystems, the inode number is approximately related to position on
the disk.  On those where it was, the heuristic sped up whole
filesystem scans by a factor of about 2, and on some directory
structures by a factor of about 100.  It's possible some parallel stat
calls would improve this further on some OSes and kernel versions, by
allowing better head seek optimisation at the kernel level.  But other
OSes or kernel versions would be slowed by it.

 What would be ideal, I think, is for rsync to scan the filesystem while
 a transfer is in place;

I think rsync 3 does this, it's called incremental scan mode.

 with a configurable quantity of file transfer threads, combined with
 a configurable quantity of filesystem spider threads, would result
 in the most optimal interleaving of disk latency and time required
 to transfer files.

Be careful when using multiple unsynchronised threads to access a
filesystem.  It sometimes thrashes the disk - seeking back and forth
between different files - resulting in much worse latency than just
doing one file at a time.  That said, it can work out better.  Just
have to be careful how it's done.

-- Jamie
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: proposal to speed rsync with lots of files

2009-03-05 Thread Jamie Lokier
Peter Salameh wrote:
 My 
 proposal is to first send a checksum of the file list for each 
 directory.  If is found to be identical to the same checksum on the 
 remote side then the list need not be sent for that directory!
...
 It might even be possible to use the rsync checksum algorithm on the 
 directory lists themselves to determine which portion of the directory 
 lists to send, in the case of directories which nearly identical.

Yes, these are both sensible improvements to the algorithm.  Using the
rsync algorithm on the _whole_ file list (not just per directory)
would likely improve it in many cases.  However unlike files, you
don't know the size in advance, so you'd have to pick a block size.
You might have to change the algorithm a little because you'd want to
buffer the checksummed list for transmitting blocks which don't match,
and you'd want to limit the buffer size.

When transmitting the list, it needs to be limited to just the
attributes being compared, and just the files which pass the filter,
of course.

If only a few files have changed in a very large list, then you get
into the interesting problem of few small changes in very large
stream.  There are tweaks to the rsync algorithm which handle that
better than rsync does.  One of them recursive rsync is mentioned in
a paper on the rsync web page.

I've been working on a variation of the rsync algorithm to
delta-transfer arbitrary tree-like data structures at all scales.
That might be a better fit to this problem, but it might be a radical
addition to graft it into rsync and unnecessary for the problem space.

 I would appreciate hearing from rsync developers if this feasible with 
 the current implementation and if they think it would help.

I have no idea if it's feasible in the current implementation without
a lot of work.  It's definitely feasible if you're willing to put the
work in :-)

-- Jamie
-- 
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: proposal to speed rsync with lots of files

2009-03-05 Thread lewis butler

On 5-Mar-2009, at 16:27, Peter Salameh wrote:
One of the speed-limiting issues with rsync is having to send huge  
file lists when mirroring large file systems, even for incremental  
updates where only a small part of the file system might have  
changed.  My proposal is to first send a checksum of the file list  
for each directory.  If is found to be identical to the same  
checksum on the remote side then the list need not be sent for that  
directory!  That would reduce the size of the file list greatly when  
there are directories containing many files which do not change from  
on rsync to the next.


How long does it take to calculate the checksum on the remote  
directory?  Longer than sending the file list?  or are you suggesting  
the checksums be stored on the source and target somehow?



--
Bishops move diagonally. That's why they often turn up where the
kings don't expect them to be.

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html