Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 9, 2018 at 2:59 PM Jeff King wrote: > On Thu, Aug 09, 2018 at 02:53:42PM -0700, Elijah Newren wrote: > > > On Thu, Aug 9, 2018 at 2:44 PM Jeff King wrote: > > > > The error message isn't quite as good, but does the user really need > > > > all the names of the file? If so, we gave them enough information to > > > > figure it out, and this is a really unusual case anyway, right? > > > > Besides, now we're back to linear performance > > > > > > Well, it's still quadratic when they run O(n) iterations of "git > > > ls-files -s | grep $colliding_oid". You've just pushed the second linear > > > search onto the user. ;) > > > > Wouldn't that be their own fault for not running > > git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort > > -k 2 > > ? ;-) > > Man, this thread is the gift that keeps on giving. :) > > That's still quadratic, isn't it? You've just hidden the second > dimension in the single grep call. It may depend on the implementation within grep. If I had remembered to pass -F, and if wikipedia's claims[1] about the Aho-Corasick algotihrm being the basis of the original Unix command fgrep, and it's claims that this algorithm is "linear in the length of the strings plus the length of the searched text plus the number of output matches", and I didn't just misunderstand something there, then it looks like it could be linear. [1] https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm Of course, the grep implementation could also do something stupid and provide dramatically worse behavior than running the command N times specifying one pattern each time.[2] [2] http://savannah.gnu.org/bugs/?16305 > Now since these are all going to be constant strings, in theory an > intelligent grep could stick them all in a search trie, and match each > line with complexity k, the length of the matched strings. And since > k=40, that's technically still linear overall. Looks like Aho-Corasick uses "a finite-state machine that resembles a trie", so definitely along those lines.
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
Elijah Newren writes: > A possibly crazy idea: Don't bother reporting the other filename; just > report the OID instead. > > "Error: Foo.txt cannot be checked out because another file with hash > is in the way." Maybe even add a hint for the user: "Run > `git ls-files -s` to see see all files and their hash". Once we start using OID to talk to humans, we are already lost. At that point, it would be 1000% better to - not check out Foo.txt to unlink and overwrite; instead leave the content that is already in the working tree as-is; and - report that Foo.txt was not checked out as something else that also claims to deserve that path was checked out already. Then the user can inspect Foo.txt and realize it is actually the contents that should be in foo.txt or whatever.
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 09, 2018 at 02:53:42PM -0700, Elijah Newren wrote: > On Thu, Aug 9, 2018 at 2:44 PM Jeff King wrote: > > > The error message isn't quite as good, but does the user really need > > > all the names of the file? If so, we gave them enough information to > > > figure it out, and this is a really unusual case anyway, right? > > > Besides, now we're back to linear performance > > > > Well, it's still quadratic when they run O(n) iterations of "git > > ls-files -s | grep $colliding_oid". You've just pushed the second linear > > search onto the user. ;) > > Wouldn't that be their own fault for not running > git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort > -k 2 > ? ;-) Man, this thread is the gift that keeps on giving. :) That's still quadratic, isn't it? You've just hidden the second dimension in the single grep call. Now since these are all going to be constant strings, in theory an intelligent grep could stick them all in a search trie, and match each line with complexity k, the length of the matched strings. And since k=40, that's technically still linear overall. -Peff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 9, 2018 at 2:44 PM Jeff King wrote: > > The error message isn't quite as good, but does the user really need > > all the names of the file? If so, we gave them enough information to > > figure it out, and this is a really unusual case anyway, right? > > Besides, now we're back to linear performance > > Well, it's still quadratic when they run O(n) iterations of "git > ls-files -s | grep $colliding_oid". You've just pushed the second linear > search onto the user. ;) Wouldn't that be their own fault for not running git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort -k 2 ? ;-)
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 09, 2018 at 02:40:58PM -0700, Elijah Newren wrote: > > I worry that the false positives make this a non-starter. I mean, if > > clone creates files 'A' and 'B' (both equal) and then tries to create > > 'b', would the collision code reports that 'b' collided with 'A' because > > that was the first OID match? Ideally with this scheme we'd have to > > search the entire index prior to 'b' and then report that 'b' collided > > with either 'A' or 'B'. Neither message instills confidence. And > > there's no way to prefer answer 'B' over 'A' without using knowledge > > of the FS name mangling/aliasing rules -- unless we want to just assume > > ignore-case for this iteration. > > A possibly crazy idea: Don't bother reporting the other filename; just > report the OID instead. > > "Error: Foo.txt cannot be checked out because another file with hash > is in the way." Maybe even add a hint for the user: "Run > `git ls-files -s` to see see all files and their hash". > > Whatever the exact wording for the error message, just create a nice > post on stackoverflow.com explaining the various weird filesystems out > there (VFAT, NTFS, HFS, APFS, etc) and how they cause differing > filenames to be written to the same location. Have a bunch of folks > vote it up so it has some nice search-engine juice. Actually, I kind of like the simplicity of that. It puts the human brain in the loop. > The error message isn't quite as good, but does the user really need > all the names of the file? If so, we gave them enough information to > figure it out, and this is a really unusual case anyway, right? > Besides, now we're back to linear performance Well, it's still quadratic when they run O(n) iterations of "git ls-files -s | grep $colliding_oid". You've just pushed the second linear search onto the user. ;) -Peff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 9, 2018 at 2:14 PM Jeff Hostetler wrote: > On 8/9/2018 10:23 AM, Jeff King wrote: > > On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote: > >> If we found that there is something when we tried to write out > >> "Foo.txt", if we open "Foo.txt" on the working tree and hash-object > >> it, we should find the matching blob somewhere in the index _before_ > >> "Foo.txt". On a case-insensitive filesytem, it may well be > >> "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt" > >> only differ in case. > > > > Clever. You might still run into false positives when there is > > duplicated content in the repository (especially, say, zero-length > > files). But the fact that you only do the hashing on known duplicates > > helps with that. > > I worry that the false positives make this a non-starter. I mean, if > clone creates files 'A' and 'B' (both equal) and then tries to create > 'b', would the collision code reports that 'b' collided with 'A' because > that was the first OID match? Ideally with this scheme we'd have to > search the entire index prior to 'b' and then report that 'b' collided > with either 'A' or 'B'. Neither message instills confidence. And > there's no way to prefer answer 'B' over 'A' without using knowledge > of the FS name mangling/aliasing rules -- unless we want to just assume > ignore-case for this iteration. A possibly crazy idea: Don't bother reporting the other filename; just report the OID instead. "Error: Foo.txt cannot be checked out because another file with hash is in the way." Maybe even add a hint for the user: "Run `git ls-files -s` to see see all files and their hash". Whatever the exact wording for the error message, just create a nice post on stackoverflow.com explaining the various weird filesystems out there (VFAT, NTFS, HFS, APFS, etc) and how they cause differing filenames to be written to the same location. Have a bunch of folks vote it up so it has some nice search-engine juice. The error message isn't quite as good, but does the user really need all the names of the file? If so, we gave them enough information to figure it out, and this is a really unusual case anyway, right? Besides, now we're back to linear performance
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Thu, Aug 09, 2018 at 05:14:16PM -0400, Jeff Hostetler wrote: > > Clever. You might still run into false positives when there is > > duplicated content in the repository (especially, say, zero-length > > files). But the fact that you only do the hashing on known duplicates > > helps with that. > > > > I worry that the false positives make this a non-starter. I mean, if > clone creates files 'A' and 'B' (both equal) and then tries to create > 'b', would the collision code reports that 'b' collided with 'A' because > that was the first OID match? Ideally with this scheme we'd have to > search the entire index prior to 'b' and then report that 'b' collided > with either 'A' or 'B'. Neither message instills confidence. And > there's no way to prefer answer 'B' over 'A' without using knowledge > of the FS name mangling/aliasing rules -- unless we want to just assume > ignore-case for this iteration. Yeah. If we can get usable unique ids (inode or otherwise) of some form on each system, I think I prefer that. It's much easier to reason about. > Sorry to be paranoid, but I have an index with 3.5M entries, the word > "quadratic" rings all kinds of alarm bells for me. :-) > > Granted, we expect the number of collisions to be small, but searching > back for each collision over the already-populated portion of the index > could be expensive. Heh. Yeah, it's really O(n*m), and we expect a small "m". But of course the two are equal in the worst case. -Peff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On 8/9/2018 10:23 AM, Jeff King wrote: On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote: If we have an equivalence-class hashmap and feed it inodes (or again, some system equivalent) as the keys, we should get buckets of collisions. I guess one way to get "some system equivalent" that can be used as the last resort, when there absolutely is no inum equivalent, is to rehash the working tree file that shouldn't be there when we detect a collision. If we found that there is something when we tried to write out "Foo.txt", if we open "Foo.txt" on the working tree and hash-object it, we should find the matching blob somewhere in the index _before_ "Foo.txt". On a case-insensitive filesytem, it may well be "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt" only differ in case. Clever. You might still run into false positives when there is duplicated content in the repository (especially, say, zero-length files). But the fact that you only do the hashing on known duplicates helps with that. I worry that the false positives make this a non-starter. I mean, if clone creates files 'A' and 'B' (both equal) and then tries to create 'b', would the collision code reports that 'b' collided with 'A' because that was the first OID match? Ideally with this scheme we'd have to search the entire index prior to 'b' and then report that 'b' collided with either 'A' or 'B'. Neither message instills confidence. And there's no way to prefer answer 'B' over 'A' without using knowledge of the FS name mangling/aliasing rules -- unless we want to just assume ignore-case for this iteration. One of the things I did like about the equivalence-class approach is that it can be done in a single linear pass in the worst case. Whereas anything that searches when we see a collision is quite likely to be quadratic. But as I said before, it may not be worth worrying too much about that for an error code path where we expect the number of collisions to be small. -Peff Sorry to be paranoid, but I have an index with 3.5M entries, the word "quadratic" rings all kinds of alarm bells for me. :-) Granted, we expect the number of collisions to be small, but searching back for each collision over the already-populated portion of the index could be expensive. Jeff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote: > > If we have an equivalence-class hashmap and feed it inodes (or again, > > some system equivalent) as the keys, we should get buckets of > > collisions. > > I guess one way to get "some system equivalent" that can be used as > the last resort, when there absolutely is no inum equivalent, is to > rehash the working tree file that shouldn't be there when we detect > a collision. > > If we found that there is something when we tried to write out > "Foo.txt", if we open "Foo.txt" on the working tree and hash-object > it, we should find the matching blob somewhere in the index _before_ > "Foo.txt". On a case-insensitive filesytem, it may well be > "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt" > only differ in case. Clever. You might still run into false positives when there is duplicated content in the repository (especially, say, zero-length files). But the fact that you only do the hashing on known duplicates helps with that. One of the things I did like about the equivalence-class approach is that it can be done in a single linear pass in the worst case. Whereas anything that searches when we see a collision is quite likely to be quadratic. But as I said before, it may not be worth worrying too much about that for an error code path where we expect the number of collisions to be small. -Peff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
Jeff King writes: > I think we really want to avoid doing that normalization ourselves if we > can. There are just too many filesystem-specific rules. Exactly; not having to learn these rules is the major (if not whole) point of the "let checkout notice the collision and then deal with it" approach. Let's not forget that. > If we have an equivalence-class hashmap and feed it inodes (or again, > some system equivalent) as the keys, we should get buckets of > collisions. I guess one way to get "some system equivalent" that can be used as the last resort, when there absolutely is no inum equivalent, is to rehash the working tree file that shouldn't be there when we detect a collision. If we found that there is something when we tried to write out "Foo.txt", if we open "Foo.txt" on the working tree and hash-object it, we should find the matching blob somewhere in the index _before_ "Foo.txt". On a case-insensitive filesytem, it may well be "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt" only differ in case. Of course, that's really the last resort, as it would be costly, but this is something that only need to happen on the "unusual" case in the error codepath, so...
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On Wed, Aug 08, 2018 at 03:48:04PM -0400, Jeff Hostetler wrote: > > ce_match_stat() may not be a very good measure to see if two paths > > refer to the same file, though. After a fresh checkout, I would not > > be surprised if two completely unrelated paths have the same size > > and have same mtime/ctime. In its original use case, i.e. "I have > > one specific path in mind and took a snapshot of its metadata > > earlier. Is it still the same, or has it been touched?", that may > > be sufficient to detect that the path has been _modified_, but > > without reliable inum, it may be a poor measure to say two paths > > refer to the same. > > I agree with Junio on this one. The mtime values are sloppy at best. > On FAT file systems, they have 2 second resolution. Even NTFS IIRC > has only 100ns resolution, so we might get a lot of false matches > using this technique, right? Yeah, I think anything less than inode (or some system equivalent) is going to be too flaky. > It might be better to build an equivalence-class hash-map for the > colliding entries. Compute a "normalized" version of the pathname > (such as convert to lowercase, strip final-dots/spaces, strip the > digits following tilda of a shortname, and etc for the MAC's UTF-isms). > Then when you rescan the index entries to find the matches, apply the > equivalence operator on the pathname and do the hashmap lookup. > When you find a match, you have a "potential" collider pair (I say > potential only because of the ambiguity of shortnames). Then we > can use inum/file-index/whatever to see if they actually collide. I think we really want to avoid doing that normalization ourselves if we can. There are just too many filesystem-specific rules. If we have an equivalence-class hashmap and feed it inodes (or again, some system equivalent) as the keys, we should get buckets of collisions. I started to write a "something like this..." earlier, but got bogged down in boilerplate around the C hashmap. But here it is in perl. ;) -- >8 -- # pretend we have these paths in our index paths='foo FOO and some other paths' # create them; this will make a single path on a case-insensitive system for i in $paths; do echo $i >$i done # now find the duplicates perl -le ' for my $path (@ARGV) { # this would be an ntfs unique-id on Windows my $inode = (lstat($path))[1]; push @{$h{$inode}}, $path; } for my $group (grep { @$_ > 1 } values(%h)) { print "group:"; print " ", $_ for (@$group); } ' $paths -- >8 -- which should show the obvious pair (it does for me on vfat-on-linux, though where it gets those inodes from, I have no idea ;) ). -Peff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
On 8/7/2018 3:31 PM, Junio C Hamano wrote: Nguyễn Thái Ngọc Duy writes: One nice thing about this is we don't need platform specific code for detecting the duplicate entries. I think ce_match_stat() works even on Windows. And it's now equally expensive on all platforms :D ce_match_stat() may not be a very good measure to see if two paths refer to the same file, though. After a fresh checkout, I would not be surprised if two completely unrelated paths have the same size and have same mtime/ctime. In its original use case, i.e. "I have one specific path in mind and took a snapshot of its metadata earlier. Is it still the same, or has it been touched?", that may be sufficient to detect that the path has been _modified_, but without reliable inum, it may be a poor measure to say two paths refer to the same. I agree with Junio on this one. The mtime values are sloppy at best. On FAT file systems, they have 2 second resolution. Even NTFS IIRC has only 100ns resolution, so we might get a lot of false matches using this technique, right? It might be better to build an equivalence-class hash-map for the colliding entries. Compute a "normalized" version of the pathname (such as convert to lowercase, strip final-dots/spaces, strip the digits following tilda of a shortname, and etc for the MAC's UTF-isms). Then when you rescan the index entries to find the matches, apply the equivalence operator on the pathname and do the hashmap lookup. When you find a match, you have a "potential" collider pair (I say potential only because of the ambiguity of shortnames). Then we can use inum/file-index/whatever to see if they actually collide. Jeff
Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems
Nguyễn Thái Ngọc Duy writes: > One nice thing about this is we don't need platform specific code for > detecting the duplicate entries. I think ce_match_stat() works even > on Windows. And it's now equally expensive on all platforms :D ce_match_stat() may not be a very good measure to see if two paths refer to the same file, though. After a fresh checkout, I would not be surprised if two completely unrelated paths have the same size and have same mtime/ctime. In its original use case, i.e. "I have one specific path in mind and took a snapshot of its metadata earlier. Is it still the same, or has it been touched?", that may be sufficient to detect that the path has been _modified_, but without reliable inum, it may be a poor measure to say two paths refer to the same. > builtin/clone.c | 1 + > cache.h | 2 ++ > entry.c | 44 > unpack-trees.c | 23 +++ > unpack-trees.h | 1 + > 5 files changed, 71 insertions(+) Having said that, it is pleasing to see that this can be achieved with so little additional code. > +static void mark_duplicate_entries(const struct checkout *state, > +struct cache_entry *ce, struct stat *st) > +{ > + int i; > + int *count = state->nr_duplicates; > + > + if (!count) > + BUG("state->nr_duplicates must not be NULL"); > + > + ce->ce_flags |= CE_MATCHED; > + (*count)++; > + > + if (!state->refresh_cache) > + BUG("We need this to narrow down the set of updated entries"); > + > + for (i = 0; i < state->istate->cache_nr; i++) { > + struct cache_entry *dup = state->istate->cache[i]; > + > + /* > + * This entry has not been checked out yet, otherwise > + * its stat info must have been updated. And since we > + * check out from top to bottom, the rest is guaranteed > + * not checked out. Stop now. > + */ > + if (!ce_uptodate(dup)) > + break; > + > + if (dup->ce_flags & CE_MATCHED) > + continue; > + > + if (ce_match_stat(dup, st, > + CE_MATCH_IGNORE_VALID | > + CE_MATCH_IGNORE_SKIP_WORKTREE)) > + continue; > + > + dup->ce_flags |= CE_MATCHED; > + (*count)++; > + break; > + } > +} > + Hmph. If there is only one true collision, then all its aliases will be marked with CE_MATCHED bit every time the second and the subsequent alias is checked out (as the caller calls this function when it noticed that something already is at the path ce wants to go). But if there are two sets of colliding paths, because there is only one bit used, we do not group the paths into these two sets and report, e.g. "blue.txt, BLUE.txt and BLUE.TXT collide. red.txt and RED.txt also collide." I am not sure if computing that is too much work for too little gain, but because this is in an error codepath, it may be worth doing. I dunno. > + > + if (o->clone && state.nr_duplicates) { > + warning(_("the following paths in this repository only differ > in case\n" > + "from another path and will cause problems because > you have cloned\n" > + "it on an case-insensitive filesytem:\n")); With the new approach, we no longer preemptively detect that the project will be harmed by a case smashing filesystems before it happens. This instead reports that the project has already been harmed on _this_ system by such a filesystem after the fact. So from the end-user's point of view, "will cause problems" may be a message that came a bit too late. "have collided and only one from the same colliding group is in the working tree; others failed to be checked out" is probably closer to the truth.