Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 2:59 PM Jeff King  wrote:
> On Thu, Aug 09, 2018 at 02:53:42PM -0700, Elijah Newren wrote:
>
> > On Thu, Aug 9, 2018 at 2:44 PM Jeff King  wrote:
> > > > The error message isn't quite as good, but does the user really need
> > > > all the names of the file?  If so, we gave them enough information to
> > > > figure it out, and this is a really unusual case anyway, right?
> > > > Besides, now we're back to linear performance
> > >
> > > Well, it's still quadratic when they run O(n) iterations of "git
> > > ls-files -s | grep $colliding_oid". You've just pushed the second linear
> > > search onto the user. ;)
> >
> > Wouldn't that be their own fault for not running
> >   git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort 
> > -k 2
> > ?   ;-)
>
> Man, this thread is the gift that keeps on giving. :)
>
> That's still quadratic, isn't it? You've just hidden the second
> dimension in the single grep call.

It may depend on the implementation within grep.  If I had remembered
to pass -F, and if wikipedia's claims[1] about the Aho-Corasick
algotihrm being the basis of the original Unix command fgrep, and it's
claims that this algorithm is "linear in the length of the strings
plus the length of the searched text plus the number of output
matches", and I didn't just misunderstand something there, then it
looks like it could be linear.

[1] https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

Of course, the grep implementation could also do something stupid and
provide dramatically worse behavior than running the command N times
specifying one pattern each time.[2]

[2] http://savannah.gnu.org/bugs/?16305

> Now since these are all going to be constant strings, in theory an
> intelligent grep could stick them all in a search trie, and match each
> line with complexity k, the length of the matched strings. And since
> k=40, that's technically still linear overall.

Looks like Aho-Corasick uses "a finite-state machine that resembles a
trie", so definitely along those lines.


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Junio C Hamano
Elijah Newren  writes:

> A possibly crazy idea: Don't bother reporting the other filename; just
> report the OID instead.
>
> "Error: Foo.txt cannot be checked out because another file with hash
>  is in the way."  Maybe even add a hint for the user: "Run
> `git ls-files -s` to see see all files and their hash".

Once we start using OID to talk to humans, we are already lost.  At
that point, it would be 1000% better to

 - not check out Foo.txt to unlink and overwrite; instead leave the
   content that is already in the working tree as-is; and

 - report that Foo.txt was not checked out as something else that
   also claims to deserve that path was checked out already.

Then the user can inspect Foo.txt and realize it is actually the
contents that should be in foo.txt or whatever.




Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:53:42PM -0700, Elijah Newren wrote:

> On Thu, Aug 9, 2018 at 2:44 PM Jeff King  wrote:
> > > The error message isn't quite as good, but does the user really need
> > > all the names of the file?  If so, we gave them enough information to
> > > figure it out, and this is a really unusual case anyway, right?
> > > Besides, now we're back to linear performance
> >
> > Well, it's still quadratic when they run O(n) iterations of "git
> > ls-files -s | grep $colliding_oid". You've just pushed the second linear
> > search onto the user. ;)
> 
> Wouldn't that be their own fault for not running
>   git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort 
> -k 2
> ?   ;-)

Man, this thread is the gift that keeps on giving. :)

That's still quadratic, isn't it? You've just hidden the second
dimension in the single grep call.

Now since these are all going to be constant strings, in theory an
intelligent grep could stick them all in a search trie, and match each
line with complexity k, the length of the matched strings. And since
k=40, that's technically still linear overall.

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 2:44 PM Jeff King  wrote:
> > The error message isn't quite as good, but does the user really need
> > all the names of the file?  If so, we gave them enough information to
> > figure it out, and this is a really unusual case anyway, right?
> > Besides, now we're back to linear performance
>
> Well, it's still quadratic when they run O(n) iterations of "git
> ls-files -s | grep $colliding_oid". You've just pushed the second linear
> search onto the user. ;)

Wouldn't that be their own fault for not running
  git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort -k 2
?   ;-)


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:40:58PM -0700, Elijah Newren wrote:

> > I worry that the false positives make this a non-starter.  I mean, if
> > clone creates files 'A' and 'B' (both equal) and then tries to create
> > 'b', would the collision code reports that 'b' collided with 'A' because
> > that was the first OID match?  Ideally with this scheme we'd have to
> > search the entire index prior to 'b' and then report that 'b' collided
> > with either 'A' or 'B'.  Neither message instills confidence.  And
> > there's no way to prefer answer 'B' over 'A' without using knowledge
> > of the FS name mangling/aliasing rules -- unless we want to just assume
> > ignore-case for this iteration.
> 
> A possibly crazy idea: Don't bother reporting the other filename; just
> report the OID instead.
> 
> "Error: Foo.txt cannot be checked out because another file with hash
>  is in the way."  Maybe even add a hint for the user: "Run
> `git ls-files -s` to see see all files and their hash".
> 
> Whatever the exact wording for the error message, just create a nice
> post on stackoverflow.com explaining the various weird filesystems out
> there (VFAT, NTFS, HFS, APFS, etc) and how they cause differing
> filenames to be written to the same location.  Have a bunch of folks
> vote it up so it has some nice search-engine juice.

Actually, I kind of like the simplicity of that. It puts the human brain
in the loop.

> The error message isn't quite as good, but does the user really need
> all the names of the file?  If so, we gave them enough information to
> figure it out, and this is a really unusual case anyway, right?
> Besides, now we're back to linear performance

Well, it's still quadratic when they run O(n) iterations of "git
ls-files -s | grep $colliding_oid". You've just pushed the second linear
search onto the user. ;)

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 2:14 PM Jeff Hostetler  wrote:
> On 8/9/2018 10:23 AM, Jeff King wrote:
> > On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:
> >> If we found that there is something when we tried to write out
> >> "Foo.txt", if we open "Foo.txt" on the working tree and hash-object
> >> it, we should find the matching blob somewhere in the index _before_
> >> "Foo.txt".  On a case-insensitive filesytem, it may well be
> >> "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
> >> only differ in case.
> >
> > Clever. You might still run into false positives when there is
> > duplicated content in the repository (especially, say, zero-length
> > files).  But the fact that you only do the hashing on known duplicates
> > helps with that.
>
> I worry that the false positives make this a non-starter.  I mean, if
> clone creates files 'A' and 'B' (both equal) and then tries to create
> 'b', would the collision code reports that 'b' collided with 'A' because
> that was the first OID match?  Ideally with this scheme we'd have to
> search the entire index prior to 'b' and then report that 'b' collided
> with either 'A' or 'B'.  Neither message instills confidence.  And
> there's no way to prefer answer 'B' over 'A' without using knowledge
> of the FS name mangling/aliasing rules -- unless we want to just assume
> ignore-case for this iteration.

A possibly crazy idea: Don't bother reporting the other filename; just
report the OID instead.

"Error: Foo.txt cannot be checked out because another file with hash
 is in the way."  Maybe even add a hint for the user: "Run
`git ls-files -s` to see see all files and their hash".

Whatever the exact wording for the error message, just create a nice
post on stackoverflow.com explaining the various weird filesystems out
there (VFAT, NTFS, HFS, APFS, etc) and how they cause differing
filenames to be written to the same location.  Have a bunch of folks
vote it up so it has some nice search-engine juice.


The error message isn't quite as good, but does the user really need
all the names of the file?  If so, we gave them enough information to
figure it out, and this is a really unusual case anyway, right?
Besides, now we're back to linear performance


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 05:14:16PM -0400, Jeff Hostetler wrote:

> > Clever. You might still run into false positives when there is
> > duplicated content in the repository (especially, say, zero-length
> > files).  But the fact that you only do the hashing on known duplicates
> > helps with that.
> > 
> 
> I worry that the false positives make this a non-starter.  I mean, if
> clone creates files 'A' and 'B' (both equal) and then tries to create
> 'b', would the collision code reports that 'b' collided with 'A' because
> that was the first OID match?  Ideally with this scheme we'd have to
> search the entire index prior to 'b' and then report that 'b' collided
> with either 'A' or 'B'.  Neither message instills confidence.  And
> there's no way to prefer answer 'B' over 'A' without using knowledge
> of the FS name mangling/aliasing rules -- unless we want to just assume
> ignore-case for this iteration.

Yeah. If we can get usable unique ids (inode or otherwise) of some form
on each system, I think I prefer that. It's much easier to reason about.

> Sorry to be paranoid, but I have an index with 3.5M entries, the word
> "quadratic" rings all kinds of alarm bells for me.  :-)
> 
> Granted, we expect the number of collisions to be small, but searching
> back for each collision over the already-populated portion of the index
> could be expensive.

Heh. Yeah, it's really O(n*m), and we expect a small "m". But of course
the two are equal in the worst case.

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff Hostetler




On 8/9/2018 10:23 AM, Jeff King wrote:

On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:


If we have an equivalence-class hashmap and feed it inodes (or again,
some system equivalent) as the keys, we should get buckets of
collisions.


I guess one way to get "some system equivalent" that can be used as
the last resort, when there absolutely is no inum equivalent, is to
rehash the working tree file that shouldn't be there when we detect
a collision.

If we found that there is something when we tried to write out
"Foo.txt", if we open "Foo.txt" on the working tree and hash-object
it, we should find the matching blob somewhere in the index _before_
"Foo.txt".  On a case-insensitive filesytem, it may well be
"foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
only differ in case.


Clever. You might still run into false positives when there is
duplicated content in the repository (especially, say, zero-length
files).  But the fact that you only do the hashing on known duplicates
helps with that.



I worry that the false positives make this a non-starter.  I mean, if
clone creates files 'A' and 'B' (both equal) and then tries to create
'b', would the collision code reports that 'b' collided with 'A' because
that was the first OID match?  Ideally with this scheme we'd have to
search the entire index prior to 'b' and then report that 'b' collided
with either 'A' or 'B'.  Neither message instills confidence.  And
there's no way to prefer answer 'B' over 'A' without using knowledge
of the FS name mangling/aliasing rules -- unless we want to just assume
ignore-case for this iteration.


One of the things I did like about the equivalence-class approach is
that it can be done in a single linear pass in the worst case. Whereas
anything that searches when we see a collision is quite likely to be
quadratic. But as I said before, it may not be worth worrying too much
about that for an error code path where we expect the number of
collisions to be small.

-Peff



Sorry to be paranoid, but I have an index with 3.5M entries, the word
"quadratic" rings all kinds of alarm bells for me.  :-)

Granted, we expect the number of collisions to be small, but searching
back for each collision over the already-populated portion of the index
could be expensive.

Jeff



Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:

> > If we have an equivalence-class hashmap and feed it inodes (or again,
> > some system equivalent) as the keys, we should get buckets of
> > collisions.
> 
> I guess one way to get "some system equivalent" that can be used as
> the last resort, when there absolutely is no inum equivalent, is to
> rehash the working tree file that shouldn't be there when we detect
> a collision.
> 
> If we found that there is something when we tried to write out
> "Foo.txt", if we open "Foo.txt" on the working tree and hash-object
> it, we should find the matching blob somewhere in the index _before_
> "Foo.txt".  On a case-insensitive filesytem, it may well be
> "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
> only differ in case.

Clever. You might still run into false positives when there is
duplicated content in the repository (especially, say, zero-length
files).  But the fact that you only do the hashing on known duplicates
helps with that.

One of the things I did like about the equivalence-class approach is
that it can be done in a single linear pass in the worst case. Whereas
anything that searches when we see a collision is quite likely to be
quadratic. But as I said before, it may not be worth worrying too much
about that for an error code path where we expect the number of
collisions to be small.

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-08 Thread Junio C Hamano
Jeff King  writes:

> I think we really want to avoid doing that normalization ourselves if we
> can. There are just too many filesystem-specific rules.

Exactly; not having to learn these rules is the major (if not whole)
point of the "let checkout notice the collision and then deal with
it" approach.  Let's not forget that.

> If we have an equivalence-class hashmap and feed it inodes (or again,
> some system equivalent) as the keys, we should get buckets of
> collisions.

I guess one way to get "some system equivalent" that can be used as
the last resort, when there absolutely is no inum equivalent, is to
rehash the working tree file that shouldn't be there when we detect
a collision.

If we found that there is something when we tried to write out
"Foo.txt", if we open "Foo.txt" on the working tree and hash-object
it, we should find the matching blob somewhere in the index _before_
"Foo.txt".  On a case-insensitive filesytem, it may well be
"foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
only differ in case.

Of course, that's really the last resort, as it would be costly, but
this is something that only need to happen on the "unusual" case in
the error codepath, so...


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-08 Thread Jeff King
On Wed, Aug 08, 2018 at 03:48:04PM -0400, Jeff Hostetler wrote:

> > ce_match_stat() may not be a very good measure to see if two paths
> > refer to the same file, though.  After a fresh checkout, I would not
> > be surprised if two completely unrelated paths have the same size
> > and have same mtime/ctime.  In its original use case, i.e. "I have
> > one specific path in mind and took a snapshot of its metadata
> > earlier.  Is it still the same, or has it been touched?", that may
> > be sufficient to detect that the path has been _modified_, but
> > without reliable inum, it may be a poor measure to say two paths
> > refer to the same.
> 
> I agree with Junio on this one.  The mtime values are sloppy at best.
> On FAT file systems, they have 2 second resolution.  Even NTFS IIRC
> has only 100ns resolution, so we might get a lot of false matches
> using this technique, right?

Yeah, I think anything less than inode (or some system equivalent) is
going to be too flaky.

> It might be better to build an equivalence-class hash-map for the
> colliding entries.  Compute a "normalized" version of the pathname
> (such as convert to lowercase, strip final-dots/spaces, strip the
> digits following tilda of a shortname, and etc for the MAC's UTF-isms).
> Then when you rescan the index entries to find the matches, apply the
> equivalence operator on the pathname and do the hashmap lookup.
> When you find a match, you have a "potential" collider pair (I say
> potential only because of the ambiguity of shortnames).  Then we
> can use inum/file-index/whatever to see if they actually collide.

I think we really want to avoid doing that normalization ourselves if we
can. There are just too many filesystem-specific rules.

If we have an equivalence-class hashmap and feed it inodes (or again,
some system equivalent) as the keys, we should get buckets of
collisions. I started to write a "something like this..." earlier, but
got bogged down in boilerplate around the C hashmap.

But here it is in perl. ;)

-- >8 --
# pretend we have these paths in our index
paths='foo FOO and some other paths'

# create them; this will make a single path on a case-insensitive system
for i in $paths; do
  echo $i >$i
done

# now find the duplicates
perl -le '
  for my $path (@ARGV) {
# this would be an ntfs unique-id on Windows
my $inode = (lstat($path))[1];
push @{$h{$inode}}, $path;
  }

  for my $group (grep { @$_ > 1 } values(%h)) {
print "group:";
print "  ", $_ for (@$group);
  }
' $paths
-- >8 --

which should show the obvious pair (it does for me on vfat-on-linux,
though where it gets those inodes from, I have no idea ;) ).

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-08 Thread Jeff Hostetler




On 8/7/2018 3:31 PM, Junio C Hamano wrote:

Nguyễn Thái Ngọc Duy   writes:


  One nice thing about this is we don't need platform specific code for
  detecting the duplicate entries. I think ce_match_stat() works even
  on Windows. And it's now equally expensive on all platforms :D


ce_match_stat() may not be a very good measure to see if two paths
refer to the same file, though.  After a fresh checkout, I would not
be surprised if two completely unrelated paths have the same size
and have same mtime/ctime.  In its original use case, i.e. "I have
one specific path in mind and took a snapshot of its metadata
earlier.  Is it still the same, or has it been touched?", that may
be sufficient to detect that the path has been _modified_, but
without reliable inum, it may be a poor measure to say two paths
refer to the same.


I agree with Junio on this one.  The mtime values are sloppy at best.
On FAT file systems, they have 2 second resolution.  Even NTFS IIRC
has only 100ns resolution, so we might get a lot of false matches
using this technique, right?

It might be better to build an equivalence-class hash-map for the
colliding entries.  Compute a "normalized" version of the pathname
(such as convert to lowercase, strip final-dots/spaces, strip the
digits following tilda of a shortname, and etc for the MAC's UTF-isms).
Then when you rescan the index entries to find the matches, apply the
equivalence operator on the pathname and do the hashmap lookup.
When you find a match, you have a "potential" collider pair (I say
potential only because of the ambiguity of shortnames).  Then we
can use inum/file-index/whatever to see if they actually collide.


Jeff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-07 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy   writes:

>  One nice thing about this is we don't need platform specific code for
>  detecting the duplicate entries. I think ce_match_stat() works even
>  on Windows. And it's now equally expensive on all platforms :D

ce_match_stat() may not be a very good measure to see if two paths
refer to the same file, though.  After a fresh checkout, I would not
be surprised if two completely unrelated paths have the same size
and have same mtime/ctime.  In its original use case, i.e. "I have
one specific path in mind and took a snapshot of its metadata
earlier.  Is it still the same, or has it been touched?", that may
be sufficient to detect that the path has been _modified_, but
without reliable inum, it may be a poor measure to say two paths
refer to the same.

>  builtin/clone.c |  1 +
>  cache.h |  2 ++
>  entry.c | 44 
>  unpack-trees.c  | 23 +++
>  unpack-trees.h  |  1 +
>  5 files changed, 71 insertions(+)

Having said that, it is pleasing to see that this can be achieved
with so little additional code.

> +static void mark_duplicate_entries(const struct checkout *state,
> +struct cache_entry *ce, struct stat *st)
> +{
> + int i;
> + int *count = state->nr_duplicates;
> +
> + if (!count)
> + BUG("state->nr_duplicates must not be NULL");
> +
> + ce->ce_flags |= CE_MATCHED;
> + (*count)++;
> +
> + if (!state->refresh_cache)
> + BUG("We need this to narrow down the set of updated entries");
> +
> + for (i = 0; i < state->istate->cache_nr; i++) {
> + struct cache_entry *dup = state->istate->cache[i];
> +
> + /*
> +  * This entry has not been checked out yet, otherwise
> +  * its stat info must have been updated. And since we
> +  * check out from top to bottom, the rest is guaranteed
> +  * not checked out. Stop now.
> +  */
> + if (!ce_uptodate(dup))
> + break;
> +
> + if (dup->ce_flags & CE_MATCHED)
> + continue;
> +
> + if (ce_match_stat(dup, st,
> +   CE_MATCH_IGNORE_VALID |
> +   CE_MATCH_IGNORE_SKIP_WORKTREE))
> + continue;
> +
> + dup->ce_flags |= CE_MATCHED;
> + (*count)++;
> + break;
> + }
> +}
> +

Hmph.  If there is only one true collision, then all its aliases
will be marked with CE_MATCHED bit every time the second and the
subsequent alias is checked out (as the caller calls this function
when it noticed that something already is at the path ce wants to
go).  But if there are two sets of colliding paths, because there is
only one bit used, we do not group the paths into these two sets and
report, e.g. "blue.txt, BLUE.txt and BLUE.TXT collide.  red.txt and
RED.txt also collide."  I am not sure if computing that is too much
work for too little gain, but because this is in an error codepath,
it may be worth doing.  I dunno.

> +
> + if (o->clone && state.nr_duplicates) {
> + warning(_("the following paths in this repository only differ 
> in case\n"
> +   "from another path and will cause problems because 
> you have cloned\n"
> +   "it on an case-insensitive filesytem:\n"));

With the new approach, we no longer preemptively detect that the
project will be harmed by a case smashing filesystems before it
happens.  This instead reports that the project has already been
harmed on _this_ system by such a filesystem after the fact.

So from the end-user's point of view, "will cause problems" may be a
message that came a bit too late.  "have collided and only one from
the same colliding group is in the working tree; others failed to be
checked out" is probably closer to the truth.