Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-02 Thread Matt McCutchen

On 7/1/07, Wayne Davison [EMAIL PROTECTED] wrote:

[...]  It is still useful for allowing a server to cache the
checksum values without requiring any extra files.  As long as it is
used on files that aren't being actively updated, it works great.


OK, that's reasonable.


 Second, it is impossible to make xattr-based checksum caching
 foolproof against same-second modification.

Not really.


What do you mean?  There's no way to fix the example I gave with
xattrs, whereas...


The git algorithm only works if nothing modifies the files
while the checksum operation is running.  So, the algorithm protects
against bad things for sequential operations, but not parallel
operations.


...I proposed a small change to the git algorithm that makes it
protect against parallel operations too:

http://marc.info/?l=gitm=118323680215966w=2


A paranoid checksummer could notice if the mtime of a file
was now(*) and delay checksumming that file until later in the run.


That would be especially smart.  Git doesn't attempt to save reusable
checksums for files whose mtimes are now.


It could also compare the mtime of a file from before and after it was
read to ensure that it wasn't modified during the read phase (assuming
that it never starts to read a file with an mtime of now).


Or it could just use the before mtime in the cache so that, if the
file is modified during reading, the cached checksum would already be
invalid.  I think git does this.


*Note that now for a particular disk may not be the same as time() if
the disk is remote, so network filesystems can be rather complicated.


That's easy to fix: get your now by touching a file on the
filesystem and reading the resulting mtime.


Also, being off by a second might still be now if the value of the
seconds field rolled over during the check.


I don't think this is a problem if you stat the file just once before
reading it.


The perl script in my patch
that creates/updates these xattr checksums doesn't try to deal with any
of these complications.


And that's probably fine for rsync's purposes.  However, I still think
it might be cool if I made a foolproof checksum-caching library and
rsync used it...

Matt
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-02 Thread Wayne Davison
On Mon, Jul 02, 2007 at 08:43:39AM -0400, Matt McCutchen wrote:
 What do you mean?  There's no way to fix the example I gave with
 xattrs

Not so.  I went on to explain how that is possible in my prior email
(i.e. avoiding caching a checksum on a now mtime file is all that is
needed).

 That's easy to fix: get your now by touching a file on the
 filesystem and reading the resulting mtime.

Yes, that's one solution that I had already thought of.  You'd also need
to do that for every filesystem in the transfer, so you need to add
filesystem checking and hope that you always have write permissions to
the dirs holding the files (or have a work-around algorithm if you
don't).  As I said, it's complicated (and quite a bit of hassle).

 Also, being off by a second might still be now if the value of the
 seconds field rolled over during the check.
 
 I don't think this is a problem if you stat the file just once before
 reading it.

It is if you're doing one check to see if a file is being updated (e.g.
stat() followed by time() to compute now.  If time rolls over between
the two calls, you may have just missed that the mtime would now match
if you did another stat().  Because of this, you can't be sure if you
read the file prior to the last change, or after the last change.

 And that's probably fine for rsync's purposes.  However, I still think
 it might be cool if I made a foolproof checksum-caching library and
 rsync used it...

I don't see any need for that for the xattr version, since rsync isn't
going to update the checksums (just optionally create them on its temp
files).  For the non-xattr version it would be nice to have a better
cache mechanism than the simple per-dir .rsyncsums files I implemented
in my patch: having a library that implemented a checksum lookup/update
by dev+inode using a global checksum cache would be cool, and avoid the
file droppings.  Making it so that different programs could request a
checksum of a particular type concurrently (which the server/library
would return from cache, if possible, or compute and store in the cache,
if safe) would make it generally useful for a variety of programs.  That
would be quite easy for rsync to support, if it existed.

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-02 Thread Wayne Davison
On Mon, Jul 02, 2007 at 10:28:25AM -0700, Wayne Davison wrote:
 It is if you're doing one check to see if a file is being updated (e.g.
 stat() followed by time() to compute now).  If time rolls over between
 the two calls, you may have just missed that the mtime would now match
 if you did another stat().

I didn't explain that well.  The glitch is not if the mtime would now be
1 second later (in which case the checksum would be ignored) but if it
was possible that the read happened, then a write, then now rolled
over to the next second.

Changing my assumed function-calling order to be first time() then
stat() would take care of this, since any roll-over in the stat() mtime
would appear to be in the future, and would be seen as being too recent.

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-02 Thread Matt McCutchen

On 7/2/07, Wayne Davison [EMAIL PROTECTED] wrote:

On Mon, Jul 02, 2007 at 08:43:39AM -0400, Matt McCutchen wrote:
 What do you mean?  There's no way to fix the example I gave with
 xattrs

Not so.  I went on to explain how that is possible in my prior email
(i.e. avoiding caching a checksum on a now mtime file is all that is
needed).


I do not see how avoiding caching a checksum on a now mtime file
affects my example.  In my example, a file is written at one time (say
5) and gets mtime 5.  At some *later* time (say 8), rsync runs and
caches the checksum, which sets the ctime to 8.  The trouble is that,
right after rsync caches the checksum, someone could quickly modify
the file and then set the mtime back to 5.  If he manages to do this
before time ticks to 9, the ctime will still be 8, and rsync will have
no way of knowing the file was modified.

It still appears to me that this is one case that xattr-based checksum
caching cannot be made to handle correctly.  That doesn't mean anyone
should care; the case is unlikely and setting a file's mtime back
after modifying it is a silly thing to do.


 That's easy to fix: get your now by touching a file on the
 filesystem and reading the resulting mtime.

Yes, that's one solution that I had already thought of.  You'd also need
to do that for every filesystem in the transfer, so you need to add
filesystem checking and hope that you always have write permissions to
the dirs holding the files (or have a work-around algorithm if you
don't).  As I said, it's complicated (and quite a bit of hassle).


It is a hassle.  To make matters worse, the mapping between st_dev
numbers and filesystems might change over time.  The same filesystem
could get a different st_dev if the hardware is connected to a
different port.  More troublingly, I could prepare two external disks
that are identical except for the actual contents of the files, mount
one, let rsync cache its checksums, and then switch to the other; the
second disk would get the same st_dev and rsync would be fooled.

The approach I'm considering is to have a separate cache file for each
filesystem and store it on the filesystem itself.  That avoids the
issues with st_dev and has the added benefit that, if a disk is moved
to a different computer, the checksums come with it for that computer
to use.  Conveniently, rsync could touch the cache file itself to get
now and possibly run other tests on the cache file to make sure the
behavior of the filesystem's mtimes and ctimes is sane enough to make
correct caching possible.

I propose that sysadmins give each user a directory on each filesystem
at a standard path from the filesystem root to hold the cache
files.(*)  When rsync stats a file and sees a st_dev that doesn't
match any of the caches it already has open, it could obtain the mount
point of that st_dev (I'm not sure how but I'm sure it's possible) and
follow the appropriate path relative to the mount point to open the
cache or create a new one.

* Storage of checksum caches is just one of several uses for a
per-user, per-filesystem staging directory.  Such a directory could
also be used as a secure rsync --temp-dir and could contain a trash
area emptied by a setuid-root program (so users can delete others'
nonempty directories from their dropboxes).  At one point, my computer
provided staging directories at {/,/home/}staging/$UID.


Changing my assumed function-calling order to be first time() then
stat() would take care of this, since any roll-over in the stat() mtime
would appear to be in the future, and would be seen as being too recent.


Yes, I had been assuming time() (actually, stat() on the touched file)
before stat() on the data file.

To economize on system calls, I propose a single time() at the
beginning.  If a stat() then reveals a now mtime, I propose calling
time() again to update now.  If it changed and another stat()
confirms that the file is still older than the new now, the file's
checksum can be cached.


I don't see any need for that for the xattr version, since rsync isn't
going to update the checksums (just optionally create them on its temp
files).


OK.


For the non-xattr version it would be nice to have a better
cache mechanism than the simple per-dir .rsyncsums files I implemented
in my patch: having a library that implemented a checksum lookup/update
by dev+inode using a global checksum cache would be cool, and avoid the
file droppings.  Making it so that different programs could request a
checksum of a particular type concurrently (which the server/library
would return from cache, if possible, or compute and store in the cache,
if safe) would make it generally useful for a variety of programs.  That
would be quite easy for rsync to support, if it existed.


OK, I think I will start working on such a library.  I don't know
whether it will be ready in time for the release of rsync 3.0.0; until
it is ready, rsync could keep the existing .rsyncsums support or drop
it and provide only 

Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-02 Thread Matt McCutchen

On 7/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Unreliable.  If you sync up at the beginning of a run, and then the
remote system executes a large clock step (e.g., because it's not
running NTP or it's misconfigured, or it is but NTP has bailed due to
excessive drift from hardware issues or a bogus driftfile (both of
which I've seen*), then now might glitch by a second (or more),
which is enough to break your idea of what now means---even a
smaller glitch can lead to races based on whose clock ticks first.
Sure, it's a low-probability event, but then, with low probability,
you have some file that isn't getting updated, which can lead to all
kinds of mysterious bugs, etc...


The technique Wayne and I are discussing assumes only that the clock
on *each side* never steps backwards.  It compares the current mtime
and ctime on each side to the previous mtime and ctime on that side as
recorded in the cache.  Clock synchronization between the two sides is
irrelevant.

It is true that if either side's clock steps backwards, that side
could be fooled into thinking a file hasn't changed from the cache
when it really has.  There's very little we can do about that except
tell the sysadmin to delete all the caches when he/she sets the clock
backwards.


Seems to me the only way around this would be to do the touch before
-every- file you handle, which doubles the amount of statting going
on, etc.  And there are probably still timing windows there.


I don't understand this concern.  If you'd like a more formal proof
that the technique never misses a modification assuming each side's
clock runs forward (actually, just each filesystem's clock), I would
be happy to provide one.

Matt
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-02 Thread Matt McCutchen

On 7/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

I understand that it's a fairly low probability, and
depends on some questionable configurations, but rsync is well-known
to be both reliable and deterministic.  I'd hate for something like
this to start chipping away at that reputation, even if we -are-
talking about a corner case in a performance optimization that might
not get invoked all that much.


You needn't worry: there's no question in my mind that --checksum
should continue to mean read every file every time and calculate the
checksum anew for those who need to be 100% certain (modulo checksum
collisions) that the file data is the same on both sides or suspect
that something is wrong with the clock.  However, for general use, a
checksum cache is still considerably more robust than the current
default size-and-mtime quick check, so I think rsync should offer it
via a separate option (perhaps --cached-checksum).

I believe that the administrator of an rsync daemon who doesn't want
clients bogging it down unnecessarily should be able to configure it
so that --checksum means --cached-checksum.  If you don't like this,
consider that clients have always gotten whatever data the
administrator chooses to give them; by enabling this setting, the
administrator is merely choosing to give them a slightly cached
version of the data in the filesystem rather than the actual data.
Compulsively honest daemon administrators could refuse --checksum
instead.

Matt
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-07-01 Thread Wayne Davison
On Sat, Jun 30, 2007 at 04:17:29PM -0400, Matt McCutchen wrote:
 First, setting the xattr hits the file's ctime.

Yeah, I realize that, and that's why none of the xattr values cache the
ctime.  This does mean that this method isn't good for updating checksum
values on existing files (since a general-purpose trusting/updating of
checksums based on size and mtime would be no better than a non-checksum
quick check).  It is still useful for allowing a server to cache the
checksum values without requiring any extra files.  As long as it is
used on files that aren't being actively updated, it works great.  I
might make this patch capable of creating the cached checksum values
when rsync creates a file, but I don't plan to make rsync ever update
an xattr checksum on an existing file.

 Second, it is impossible to make xattr-based checksum caching
 foolproof against same-second modification.

Not really.  The git algorithm only works if nothing modifies the files
while the checksum operation is running.  So, the algorithm protects
against bad things for sequential operations, but not parallel
operations.  A paranoid checksummer could notice if the mtime of a file
was now(*) and delay checksumming that file until later in the run.
It could also compare the mtime of a file from before and after it was
read to ensure that it wasn't modified during the read phase (assuming
that it never starts to read a file with an mtime of now).

*Note that now for a particular disk may not be the same as time() if
the disk is remote, so network filesystems can be rather complicated.
Also, being off by a second might still be now if the value of the
seconds field rolled over during the check.  The perl script in my patch
that creates/updates these xattr checksums doesn't try to deal with any
of these complications.

..wayne..
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-06-30 Thread Matt McCutchen

On 6/30/07, Wayne Davison [EMAIL PROTECTED] committed:

Added Files:
checksum-xattr.diff
Log Message:
A simple patch that lets rsync use cached checksum values stored in
each file's extended attributes.  A perl script is provided to create
and update the values.


Wayne,

You should be aware of two drawbacks of caching checksums in xattrs:

First, setting the xattr hits the file's ctime.  Thus, in exchange for
rsync being able to skip the file, other tools that use ctime (such as
GNU tar incremental backups) unnecessarily reprocess it.  Beagle also
caches checksums in xattrs, and one of its users complained about the
effect on the ctime:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg03251.html

Second, it is impossible to make xattr-based checksum caching
foolproof against same-second modification.  Suppose a file is written
during second 5 and then rsync caches its checksum during second 8;
now the file has mtime 5 and ctime 8.  Sometime later, rsync notices
that the file still has mtime 5 and ctime 8.  Does rsync trust the
cached checksum?  It must; otherwise the benefit of caching checksums
would be lost.  However, rsync will be fooled if the file was modified
and then touched back to mtime 5 during second 8, right after the
checksum was cached.  This concern may not be relevant when the
content is slowly changing.

Matt
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: checksum-xattr.diff [CVS update: rsync/patches]

2007-06-30 Thread Jamie Lokier
Matt McCutchen wrote:
 Second, it is impossible to make xattr-based checksum caching
 foolproof against same-second modification.  Suppose a file is written
 during second 5 and then rsync caches its checksum during second 8;
 now the file has mtime 5 and ctime 8.  Sometime later, rsync notices
 that the file still has mtime 5 and ctime 8.  Does rsync trust the
 cached checksum?  It must; otherwise the benefit of caching checksums
 would be lost.  However, rsync will be fooled if the file was modified
 and then touched back to mtime 5 during second 8, right after the
 checksum was cached.  This concern may not be relevant when the
 content is slowly changing.

There really ought to be a special kind of xattr which automatically
disappears when the file is modified, for this sort of thing.  Or a
modification serial number, perhaps only incremented when somebody
actually has read it.  Alas, I think attempts to get one into Linux
didn't get very far; nobody thought it was that important.

-- Jamie
-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html