On Thursday, 11. May 2006 07:23, you wrote:

> >>   Disadv: possible collisions
> >
> > Could be avoided by having a MD5 checksum AND a SHA1 checksum.
> > Hitting collisions of two algorithms at the same time should be virtually
> > impossible. If not really impossible.
> Yes, or we take a SHA-256/SHA-512 or whatever. Doesn't matter.

My idea was that 2 different hashing algorithms would produce different
results for a change - so if one algorithm produces the same checksum
after a change, the other would produce something completely different
to its previous value. To hack such a construction, you need to find a
change that creates a collision for both algorithms at the same time.
Of course, this is quite expensive, but the repository could do these
checks while being idle.
I just thought if it would be enough to have the size of the file as security
add-on. Would be much cheaper. But also a lot weaker than a 2nd
hashing algorithm, as there's no guarantee that 2 files of the same length 
but different content could never lead to the same hash value... 
It's merely balancing between the probability of having a hash collision
and how bad a hash collision could turn out. (Extremely bad, IMHO.)

> ... - the history and properties are not
> connected, only the storage space should be shared.
> 
> 
> >> Ad 1: as far as MD5 or SHA1 say they are identical.
> >> Maybe we should take SHA-256, which seems to be better.
> >> (There's a way to generate files with identical MD5).
> >>
> >> Ad 2: Would we copy the MD5/.../file to a new name, change it there,
> >> and copy it to the existing location? Then the previous and current data
> >> don't share history, as they're not directly connected.
> >> Or otherwise, if we'd change the "local" location, and copy to MD5/...,
> >> then every identical file copied from there as a bad history ...
> >
> > It would be great if both would be possible. The first method covers
> > files that become identical "by chance" and don't share the history (I
> > think, it should be the default behaviour) and the second case covers
> > the renaming. But telling the two cases apart requires IMHO different
> > commands for these operations. Hmm... it must be strictly avoided
> > that the user can have access to two "different" files with the same
> > content and the same history, I mean where the paths/names are
> > different but the contents are equal. This can happen and those
> > files must not share the history. So I think, the command that does
> > "rename with history" must either do a true rename within the database
> > or imply that the old file is removed.
> Did my previous sentences clear that up a bit? If not, I'll try to express
> that with a picture.

Not necessary :-) I think it's clear, thanks!

> >> Ad 3: There's a thread "Stage 1 of true rename support"
> >> http://marc.theaimsgroup.com/?t=111661193300004&r=1&w=2 starting
> >> 2005-05-20, where I did mentioned issue 2286, and where it seemed to fit
> >> (at least IMO). I did get no answer to that, though.
> >> I don't know the current status of that.
> >
> > What if a whole directory is renamed?
> >
> > "------- Additional comments from Peter ... Wed May ... -------
> >
> > PUtting in unscheduled."
> >
> > What could "PUtting in unscheduled." possibly mean?!? This was a comment
> > here: http://subversion.tigris.org/issues/show_bug.cgi?id=2286
> They're waiting for patches?

Aha :-)

> > What if a whole directory is renamed?
> Currently: The directory is checked in as new, with all files as new.
> With shared space in the repository we'd just have duplicated space for
> the directory tree, the files should not need anything (as the storage
> space is shared).

Sounds good enough, spacewise. Directories should not take much space.

> Later: fsvs could detect that eg. the inode is the same and the contents
> are mostly the same, deduce a rename, and send a rename (or a copy/delete)
> to the repository ... But that's not even on my TODO currently.

Can fsvs see that the inode is the same? Then it should be easy. If fsfv needs
to deduce a rename, this would cause a lot of internal work. IMHO, the cheapest 
thing for later would be: have a special renaming command in the client.

> So please let me repeat myself :-)
> Any other ideas?

"Eg. for a file with MD5 of 8a04f87ad04f4a1d3c7e6ca12e07290d 

repository/
  ...
  db/
    ...
    md5index/
      8a/
        04/
          f87a.index
"
Hmm, how about:
repository/
  ...
  db/
    ...
    md5index/
      8a/
        04/
          8a04f87ad04f4a1d3c7e6ca12e07290d
?
I mean a file where the name is the full hash value.
Advantages: 
- No need to open and read a file when you look for a specific hash value. 
Just try a stat() on the value you know, from a caculation you just did.
- No need to add checksums to an index file (read-modify-write)
The "Con: FSFS cannot be append-only; the indizes have to be written 
and re-written." wouldn't be a problem anymore :-)

Disadvantage: 
- One hash-file for every stored real file. That might become a huge.

> If not, I'll let that sink in a bit (about a week), and then maybe try to
> implement issue 2286.

In any way, you have much more insight than me.

Cheers
  Dirk

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to