On Thu, 14 Nov 2002 10:32:06 +0100 Herbert Poetzl <[EMAIL PROTECTED]> wrote:
> On Thu, Nov 14, 2002 at 01:19:45AM +0000, [EMAIL PROTECTED] wrote: > > Hi, > > > > > > because you might run in an O(n^2) issue ... > > > > > > I don't really care this is not a performance important task you > > > might run > > > it once a month and it can take many hours, no problem and the above > > > algorithm might be somewhere in O(n) maybe little worse. > > > > I do not see a performance problem. > > somebody will care, if a memory mapping process > goes wild on thousands of files, making millions > of comparisons for several hours *G* huh? there are not many files which become a unify-candidate by match all filters and exact same size also i saied 'mmap() reasonably many' files, i think about "map as much files which are comfortably fit in the free address space" wich might not much more than 1 GB on normal x86's ... see below > how to compare 5 files? > - compare 1st with number 2-5 > - compare 2nd with number 3-5 > - compare 3rd with number 4-5 > - and compare the last two > makes a total of (4+3+2+1)=10 comparisons Heuristics!! algorithm (just from scratch, hopefully noone patented it): we put all same-size candidate files together in one bucket, start a iterator from 1 to size_of_files, compare the iterator position of the first file in the bucket with all other files same position, each time we find a different value than the one of the current position of the first file in the bucket we check if there is an other bucket we 'just created' with the same value at that position, then we move the file there, else we create a new bucket and move the file there and remember this bucket as 'just created'. The 'just created' list is reset each time the iterator is incremented. So we need to scann each file only once, i would call that O(N) and comparing cost is almost nothing compared to disc access and is even faster than calculating any checksum. There are a very few cases when not all files map at once, and a few optimizations (buckets with only 1 file dont need to be compared with anything else..). mmap was my intention for easy file-handling which is more a implementation detail, reading char or blockwise or mmap blocks instead whole files will do just fine. (well i must agree that accessing many files in parallel will be more worse than accessing many files in serial order, if the files are not fragmented) > cksum (for example) would map each file only once, > and produce a hash value, which could be compared > in the same way as planned, but much much faster, > equally fast like the size sort/comparison ... > > please correct me if I'm wrong ... please correct me if I'm wrong ... make it better. I want a) a solutions for unifying vservers and b) a good one. That doent imply that my one will be the best .. but someone has to start ... well there is the Perl Version i check it latter but it looks much more limited than my proprosal since it has no file-selections et all (and can run into symlink-race conditions). I try to start coding on it within the next days lets see then. cya Christian
