On Fri, 6 August 2004 12:53:43 +1200, Sam Vilain wrote:
> 
> The chances of bits on your hard drive platter randomly losing their
> magnetism or capacitors in your RAM losing charge and changing are 
> probably higher than two different files having an SHA1 collision :-). 

I used to have the same opinion.  Then I read this:
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/hash.html

> Hashing only the first block of the file as an optimisation is a
> sensible idea.

Yes.

> The script could be easily modified to do this as a seperate step,
> however bear in mind that it will only even consider checking the file's
> contents if the files already have the same owner/group/permissions,
> relative path and file size.  My assumption was that if these all match,
> the files are probably going to be the same anyway.

In that case, you can ignore the hashes anyway.  Do a direct
comparison, nothing lost.

> Nice idea, but I think on UNIX that's pretty much a can of worms with no 
> easy answer.  You'd need something in the kernel that notifies userland 
> when any inode on a filesystem changes.  Have a look at the intermezzo 
> module if you want to go down that path.  If you can provide the kernel 
> half, I'll be more than happy to extend unify-dirs to work with it :).

Yes, I know.  Quite a few people tried it already, Al Viro didn't like
any of it.

> Failing active monitoring, as a simple compromise there's no reason that 
> unify-dirs couldn't optionally store its internal inode/stat/SHA1 hash 
> cache in a Berkeley database, and run the script every hour or so via 
> cron.  It would certainly prevent the copious stat()'ing that the script 
> does, at the expense of not noticing unlikely unification situations 
> until the DB cache entries expire.
> 
> Of course, it would still absolutely hammer the VFS every time it runs 
> with readdir() calls and find all those glorious reiserfs corner case 
> bugs, but in my experience with a "handful" (say, 30) of vservers that 
> are already mostly unified the script completes in under a minute when 
> unifying just the OS (eg, /usr, /lib, /sbin and /bin).
> 
> Who knows, maybe there are other optimizations possible - like only 
> stat()'ing the leaf directories in the heirarchy, to see if any files 
> have been added or removed before actually using readdir() to read them. 
>   Again this will not catch some unlikely unification situations until 
> full stat()'ing happens.

Your problem is simpler, compared to the one I want to solve.  Also,
with final cowlinks, it's perfectly sane to combine two files with
different owners, permissions, [amc]times, etc.  Both will have
seperate inodes, just the data is identical.

J�rn

-- 
Invincibility is in oneself, vulnerability is in the opponent.
-- Sun Tzu
_______________________________________________
Vserver mailing list
[EMAIL PROTECTED]
http://list.linux-vserver.org/mailman/listinfo/vserver

Reply via email to