> > > Could be avoided by having a MD5 checksum AND a SHA1 checksum. > > > Hitting collisions of two algorithms at the same time should be > > > virtually impossible. If not really impossible. > > > > Yes, or we take a SHA-256/SHA-512 or whatever. Doesn't matter. > > My idea was that 2 different hashing algorithms would produce different > results for a change - so if one algorithm produces the same checksum > after a change, the other would produce something completely different > to its previous value. To hack such a construction, you need to find a > change that creates a collision for both algorithms at the same time. Security-wise, that would surely be better.
> Of course, this is quite expensive, but the repository could do these > checks while being idle. > I just thought if it would be enough to have the size of the file as > security add-on. Would be much cheaper. But also a lot weaker than a 2nd > hashing algorithm, as there's no guarantee that 2 files of the same length > but different content could never lead to the same hash value... The MD5 collisions are with identical filelengths, only a few bits toggled. > It's merely balancing between the probability of having a hash collision > and how bad a hash collision could turn out. (Extremely bad, IMHO.) That's only an issue if we rely on the client doing the duplicate searching. The repository has to build the full-text on commit, then it could verify the indentity of two files byte-per-byte ... no chance for errors (apart from cosmic rays and similar :-) > > > What could "PUtting in unscheduled." possibly mean?!? This was a > > > comment here: http://subversion.tigris.org/issues/show_bug.cgi?id=2286 > > > > They're waiting for patches? > > Aha :-) Volunteers :-? > > > What if a whole directory is renamed? > > > > Currently: The directory is checked in as new, with all files as new. > > With shared space in the repository we'd just have duplicated space for > > the directory tree, the files should not need anything (as the storage > > space is shared). > > Sounds good enough, spacewise. Directories should not take much space. I think so, too. Although, for deeply nested structures (think /usr), it might be an issue, too: $ find /usr/ | wc -l 200869 That's an awful big directory structure. > > Later: fsvs could detect that eg. the inode is the same and the contents > > are mostly the same, deduce a rename, and send a rename (or a > > copy/delete) to the repository ... But that's not even on my TODO > > currently. > Can fsvs see that the inode is the same? Then it should be easy. A true rename retains the inode number, that could be seen by fsvs. > If fsfv > needs to deduce a rename, this would cause a lot of internal work. Not necessarily. On commit the file has to MD5ed, then the duplicate hash value could be seen. Although that would mean double reading of the file (one for hashing, two for streaming). But that's only a problem for files which do not fit in file cache ... > IMHO, > the cheapest thing for later would be: have a special renaming command in > the client. That's always an option. > > So please let me repeat myself :-) > > Any other ideas? > > "Eg. for a file with MD5 of 8a04f87ad04f4a1d3c7e6ca12e07290d > > repository/ > ... > db/ > ... > md5index/ > 8a/ > 04/ > f87a.index > " > Hmm, how about: > repository/ > ... > db/ > ... > md5index/ > 8a/ > 04/ > 8a04f87ad04f4a1d3c7e6ca12e07290d > ? > I mean a file where the name is the full hash value. > Advantages: > - No need to open and read a file when you look for a specific hash value. > Just try a stat() on the value you know, from a caculation you just did. > - No need to add checksums to an index file (read-modify-write) Depending on the hash chosen, you might have a collision ... Eg. for MD5 the chance *not* to have a collision for 10^6 (better 2^20) files is (2^20 ! * (2^128-2^20) !)/(2^128 !) ... Which might as well be 0. Take MD5 and SHA-1, concatenate, and the problem should vanish :-) > The "Con: FSFS cannot be append-only; the indizes have to be written > and re-written." wouldn't be a problem anymore :-) Per chance there'll be three glibc versions with identical MD5 and SHA1, and we're botched :-) > Disadvantage: > - One hash-file for every stored real file. That might become a huge. That's why I thought about the lists ... You can *never* say "there won't ever by a collision", so you might have to take lists after all ... > > If not, I'll let that sink in a bit (about a week), and then maybe try to > > implement issue 2286. > > In any way, you have much more insight than me. I'd hope but don't think so. I've never looked into the repository backend. Thank you for your opinion! Regards, Phil --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
