Re: checksum-xattr.diff [CVS update: rsync/patches]
On 7/1/07, Wayne Davison [EMAIL PROTECTED] wrote: [...] It is still useful for allowing a server to cache the checksum values without requiring any extra files. As long as it is used on files that aren't being actively updated, it works great. OK, that's reasonable. Second, it is impossible to make xattr-based checksum caching foolproof against same-second modification. Not really. What do you mean? There's no way to fix the example I gave with xattrs, whereas... The git algorithm only works if nothing modifies the files while the checksum operation is running. So, the algorithm protects against bad things for sequential operations, but not parallel operations. ...I proposed a small change to the git algorithm that makes it protect against parallel operations too: http://marc.info/?l=gitm=118323680215966w=2 A paranoid checksummer could notice if the mtime of a file was now(*) and delay checksumming that file until later in the run. That would be especially smart. Git doesn't attempt to save reusable checksums for files whose mtimes are now. It could also compare the mtime of a file from before and after it was read to ensure that it wasn't modified during the read phase (assuming that it never starts to read a file with an mtime of now). Or it could just use the before mtime in the cache so that, if the file is modified during reading, the cached checksum would already be invalid. I think git does this. *Note that now for a particular disk may not be the same as time() if the disk is remote, so network filesystems can be rather complicated. That's easy to fix: get your now by touching a file on the filesystem and reading the resulting mtime. Also, being off by a second might still be now if the value of the seconds field rolled over during the check. I don't think this is a problem if you stat the file just once before reading it. The perl script in my patch that creates/updates these xattr checksums doesn't try to deal with any of these complications. And that's probably fine for rsync's purposes. However, I still think it might be cool if I made a foolproof checksum-caching library and rsync used it... Matt -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
On Mon, Jul 02, 2007 at 08:43:39AM -0400, Matt McCutchen wrote: What do you mean? There's no way to fix the example I gave with xattrs Not so. I went on to explain how that is possible in my prior email (i.e. avoiding caching a checksum on a now mtime file is all that is needed). That's easy to fix: get your now by touching a file on the filesystem and reading the resulting mtime. Yes, that's one solution that I had already thought of. You'd also need to do that for every filesystem in the transfer, so you need to add filesystem checking and hope that you always have write permissions to the dirs holding the files (or have a work-around algorithm if you don't). As I said, it's complicated (and quite a bit of hassle). Also, being off by a second might still be now if the value of the seconds field rolled over during the check. I don't think this is a problem if you stat the file just once before reading it. It is if you're doing one check to see if a file is being updated (e.g. stat() followed by time() to compute now. If time rolls over between the two calls, you may have just missed that the mtime would now match if you did another stat(). Because of this, you can't be sure if you read the file prior to the last change, or after the last change. And that's probably fine for rsync's purposes. However, I still think it might be cool if I made a foolproof checksum-caching library and rsync used it... I don't see any need for that for the xattr version, since rsync isn't going to update the checksums (just optionally create them on its temp files). For the non-xattr version it would be nice to have a better cache mechanism than the simple per-dir .rsyncsums files I implemented in my patch: having a library that implemented a checksum lookup/update by dev+inode using a global checksum cache would be cool, and avoid the file droppings. Making it so that different programs could request a checksum of a particular type concurrently (which the server/library would return from cache, if possible, or compute and store in the cache, if safe) would make it generally useful for a variety of programs. That would be quite easy for rsync to support, if it existed. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
On Mon, Jul 02, 2007 at 10:28:25AM -0700, Wayne Davison wrote: It is if you're doing one check to see if a file is being updated (e.g. stat() followed by time() to compute now). If time rolls over between the two calls, you may have just missed that the mtime would now match if you did another stat(). I didn't explain that well. The glitch is not if the mtime would now be 1 second later (in which case the checksum would be ignored) but if it was possible that the read happened, then a write, then now rolled over to the next second. Changing my assumed function-calling order to be first time() then stat() would take care of this, since any roll-over in the stat() mtime would appear to be in the future, and would be seen as being too recent. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
On 7/2/07, Wayne Davison [EMAIL PROTECTED] wrote: On Mon, Jul 02, 2007 at 08:43:39AM -0400, Matt McCutchen wrote: What do you mean? There's no way to fix the example I gave with xattrs Not so. I went on to explain how that is possible in my prior email (i.e. avoiding caching a checksum on a now mtime file is all that is needed). I do not see how avoiding caching a checksum on a now mtime file affects my example. In my example, a file is written at one time (say 5) and gets mtime 5. At some *later* time (say 8), rsync runs and caches the checksum, which sets the ctime to 8. The trouble is that, right after rsync caches the checksum, someone could quickly modify the file and then set the mtime back to 5. If he manages to do this before time ticks to 9, the ctime will still be 8, and rsync will have no way of knowing the file was modified. It still appears to me that this is one case that xattr-based checksum caching cannot be made to handle correctly. That doesn't mean anyone should care; the case is unlikely and setting a file's mtime back after modifying it is a silly thing to do. That's easy to fix: get your now by touching a file on the filesystem and reading the resulting mtime. Yes, that's one solution that I had already thought of. You'd also need to do that for every filesystem in the transfer, so you need to add filesystem checking and hope that you always have write permissions to the dirs holding the files (or have a work-around algorithm if you don't). As I said, it's complicated (and quite a bit of hassle). It is a hassle. To make matters worse, the mapping between st_dev numbers and filesystems might change over time. The same filesystem could get a different st_dev if the hardware is connected to a different port. More troublingly, I could prepare two external disks that are identical except for the actual contents of the files, mount one, let rsync cache its checksums, and then switch to the other; the second disk would get the same st_dev and rsync would be fooled. The approach I'm considering is to have a separate cache file for each filesystem and store it on the filesystem itself. That avoids the issues with st_dev and has the added benefit that, if a disk is moved to a different computer, the checksums come with it for that computer to use. Conveniently, rsync could touch the cache file itself to get now and possibly run other tests on the cache file to make sure the behavior of the filesystem's mtimes and ctimes is sane enough to make correct caching possible. I propose that sysadmins give each user a directory on each filesystem at a standard path from the filesystem root to hold the cache files.(*) When rsync stats a file and sees a st_dev that doesn't match any of the caches it already has open, it could obtain the mount point of that st_dev (I'm not sure how but I'm sure it's possible) and follow the appropriate path relative to the mount point to open the cache or create a new one. * Storage of checksum caches is just one of several uses for a per-user, per-filesystem staging directory. Such a directory could also be used as a secure rsync --temp-dir and could contain a trash area emptied by a setuid-root program (so users can delete others' nonempty directories from their dropboxes). At one point, my computer provided staging directories at {/,/home/}staging/$UID. Changing my assumed function-calling order to be first time() then stat() would take care of this, since any roll-over in the stat() mtime would appear to be in the future, and would be seen as being too recent. Yes, I had been assuming time() (actually, stat() on the touched file) before stat() on the data file. To economize on system calls, I propose a single time() at the beginning. If a stat() then reveals a now mtime, I propose calling time() again to update now. If it changed and another stat() confirms that the file is still older than the new now, the file's checksum can be cached. I don't see any need for that for the xattr version, since rsync isn't going to update the checksums (just optionally create them on its temp files). OK. For the non-xattr version it would be nice to have a better cache mechanism than the simple per-dir .rsyncsums files I implemented in my patch: having a library that implemented a checksum lookup/update by dev+inode using a global checksum cache would be cool, and avoid the file droppings. Making it so that different programs could request a checksum of a particular type concurrently (which the server/library would return from cache, if possible, or compute and store in the cache, if safe) would make it generally useful for a variety of programs. That would be quite easy for rsync to support, if it existed. OK, I think I will start working on such a library. I don't know whether it will be ready in time for the release of rsync 3.0.0; until it is ready, rsync could keep the existing .rsyncsums support or drop it and provide only
Re: checksum-xattr.diff [CVS update: rsync/patches]
On 7/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Unreliable. If you sync up at the beginning of a run, and then the remote system executes a large clock step (e.g., because it's not running NTP or it's misconfigured, or it is but NTP has bailed due to excessive drift from hardware issues or a bogus driftfile (both of which I've seen*), then now might glitch by a second (or more), which is enough to break your idea of what now means---even a smaller glitch can lead to races based on whose clock ticks first. Sure, it's a low-probability event, but then, with low probability, you have some file that isn't getting updated, which can lead to all kinds of mysterious bugs, etc... The technique Wayne and I are discussing assumes only that the clock on *each side* never steps backwards. It compares the current mtime and ctime on each side to the previous mtime and ctime on that side as recorded in the cache. Clock synchronization between the two sides is irrelevant. It is true that if either side's clock steps backwards, that side could be fooled into thinking a file hasn't changed from the cache when it really has. There's very little we can do about that except tell the sysadmin to delete all the caches when he/she sets the clock backwards. Seems to me the only way around this would be to do the touch before -every- file you handle, which doubles the amount of statting going on, etc. And there are probably still timing windows there. I don't understand this concern. If you'd like a more formal proof that the technique never misses a modification assuming each side's clock runs forward (actually, just each filesystem's clock), I would be happy to provide one. Matt -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
On 7/2/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I understand that it's a fairly low probability, and depends on some questionable configurations, but rsync is well-known to be both reliable and deterministic. I'd hate for something like this to start chipping away at that reputation, even if we -are- talking about a corner case in a performance optimization that might not get invoked all that much. You needn't worry: there's no question in my mind that --checksum should continue to mean read every file every time and calculate the checksum anew for those who need to be 100% certain (modulo checksum collisions) that the file data is the same on both sides or suspect that something is wrong with the clock. However, for general use, a checksum cache is still considerably more robust than the current default size-and-mtime quick check, so I think rsync should offer it via a separate option (perhaps --cached-checksum). I believe that the administrator of an rsync daemon who doesn't want clients bogging it down unnecessarily should be able to configure it so that --checksum means --cached-checksum. If you don't like this, consider that clients have always gotten whatever data the administrator chooses to give them; by enabling this setting, the administrator is merely choosing to give them a slightly cached version of the data in the filesystem rather than the actual data. Compulsively honest daemon administrators could refuse --checksum instead. Matt -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
On Sat, Jun 30, 2007 at 04:17:29PM -0400, Matt McCutchen wrote: First, setting the xattr hits the file's ctime. Yeah, I realize that, and that's why none of the xattr values cache the ctime. This does mean that this method isn't good for updating checksum values on existing files (since a general-purpose trusting/updating of checksums based on size and mtime would be no better than a non-checksum quick check). It is still useful for allowing a server to cache the checksum values without requiring any extra files. As long as it is used on files that aren't being actively updated, it works great. I might make this patch capable of creating the cached checksum values when rsync creates a file, but I don't plan to make rsync ever update an xattr checksum on an existing file. Second, it is impossible to make xattr-based checksum caching foolproof against same-second modification. Not really. The git algorithm only works if nothing modifies the files while the checksum operation is running. So, the algorithm protects against bad things for sequential operations, but not parallel operations. A paranoid checksummer could notice if the mtime of a file was now(*) and delay checksumming that file until later in the run. It could also compare the mtime of a file from before and after it was read to ensure that it wasn't modified during the read phase (assuming that it never starts to read a file with an mtime of now). *Note that now for a particular disk may not be the same as time() if the disk is remote, so network filesystems can be rather complicated. Also, being off by a second might still be now if the value of the seconds field rolled over during the check. The perl script in my patch that creates/updates these xattr checksums doesn't try to deal with any of these complications. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
On 6/30/07, Wayne Davison [EMAIL PROTECTED] committed: Added Files: checksum-xattr.diff Log Message: A simple patch that lets rsync use cached checksum values stored in each file's extended attributes. A perl script is provided to create and update the values. Wayne, You should be aware of two drawbacks of caching checksums in xattrs: First, setting the xattr hits the file's ctime. Thus, in exchange for rsync being able to skip the file, other tools that use ctime (such as GNU tar incremental backups) unnecessarily reprocess it. Beagle also caches checksums in xattrs, and one of its users complained about the effect on the ctime: http://www.mail-archive.com/[EMAIL PROTECTED]/msg03251.html Second, it is impossible to make xattr-based checksum caching foolproof against same-second modification. Suppose a file is written during second 5 and then rsync caches its checksum during second 8; now the file has mtime 5 and ctime 8. Sometime later, rsync notices that the file still has mtime 5 and ctime 8. Does rsync trust the cached checksum? It must; otherwise the benefit of caching checksums would be lost. However, rsync will be fooled if the file was modified and then touched back to mtime 5 during second 8, right after the checksum was cached. This concern may not be relevant when the content is slowly changing. Matt -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: checksum-xattr.diff [CVS update: rsync/patches]
Matt McCutchen wrote: Second, it is impossible to make xattr-based checksum caching foolproof against same-second modification. Suppose a file is written during second 5 and then rsync caches its checksum during second 8; now the file has mtime 5 and ctime 8. Sometime later, rsync notices that the file still has mtime 5 and ctime 8. Does rsync trust the cached checksum? It must; otherwise the benefit of caching checksums would be lost. However, rsync will be fooled if the file was modified and then touched back to mtime 5 during second 8, right after the checksum was cached. This concern may not be relevant when the content is slowly changing. There really ought to be a special kind of xattr which automatically disappears when the file is modified, for this sort of thing. Or a modification serial number, perhaps only incremented when somebody actually has read it. Alas, I think attempts to get one into Linux didn't get very far; nobody thought it was that important. -- Jamie -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html