> second, do not let you stop by anything, or anybody! > (the vserver list seems good at stopping people from > doing useful things ...)
this is one of the friendliest lists i am. > and finally I've come to the conclusion, we are both > (or at least I am) bad in explaining things, so I will > try to shed light from different angles ... I'm too :) > > How I see the process: > > a) selecting files on an apropriate basis > (including find syntax, file lists, patterns, etc) > b) comparing/sorting the files per size > (this might be done with your bucket algorithm) > c) comparing files of equal size agains each other > (you can't assume that files of equal size are > the same, so you'll have to compare them) > d) unifying all files found to be equal clarification: a) recursing through the filesystem (while filtering) and store the file-data in a ordered set with the size as key, so we have a size-sorted set of candidate files after this step b) not nesseary, done in a) c) comparing files of equal size agains each other .. thats what the bucket algo does, in a smarter way than 'each file against each other' d) each bucket with more than 1 file can be unified, it will rather be integrated in c) to save memory requirements > > Some ideas I had regarding this process: > > - why a brand new selection/pattern syntax if > find probably already does what you want? find cannot select on file-attributes. someone prolly dont want to unify files which are marked as 'immutable_file' instead 'immutable_link'. > > - what about external knowledge, in the form of > include/exclude lists? Maybe later, i am thinking about external config-files which can contain include/exclude patterns and other options. But at the current point i would like if it simply works at first. > - why not generate some hash value for the files > (in step c), so they could be compared instead > of the files ... generating a hash: iterate though a file and do a moderate expensive computation. my bucket algo: iterate though a file and do a cheap comparsion. conclusion: we need to iterate through the entire file anyways which is most expensive task. Hashes have (microscopic) small chance of failure while being more expensive to compute than the bucket thing. Result: I don't intend to use hashes. > - maybe one can store the hashes of once unified > files (together with the file name, location, > creation time, etc) and reuse this information i earlier mentioned to use db3 as temponary file data storage, this could be extended as persistent store but i dont see much use from that, since i dont keep hashes and u need to scan the filesystem anyways to find modified files and you would need to recalculate the hashes of the modified files. > Some (might be) useful information: > > - be careful about filesystem change (-xdev) > - avoid/block recursive/broken links links (and other special files ) will be completely ignored, only directores are used for the recursion and plain files for unification > - do not modify/touch the files (timestamps) maybe i do (optionally). Example: i have 2 debian/woody installations which are currently not unified and updated indepently, when this new vunify is finished i want to unify them! > - do not assume (virtual) memory is unlimmited mmap only consumes address space (which is limited too), the only big memory amount i need is the file-metadata storage which might be grow to some tens or hundred megabytes, thats the place where i'm thninking about db3 if it shows up problems. But i think noone has such a big server with such less memory that it will be a problem ... > - if you want it fast, code in C this thing is so much affected by disk-io, it will spend the most of the time in waiting for disks i guess that only 0.5-1% are spend in the programm itself. So even a very worse language which is 10 times slower than C will only make the programm 5-10% slower. cya Christian
