Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too
Jeff King writes: > To be honest, I don't know. Most of dir.c predates me, and I've tried to > avoid looking at it too hard. But I had a vague recollection of it being > "best effort", and this bit from cf424f5fd89b reinforces that: > > However, read_directory does not actually check against our pathspec. > It uses a simplified version that may turn up false positives. As a > result, we need to check that any hits match our pathspec. At least the original design of the traversal was "try making use of pathspec during the traversal when we can cheaply filter out obvious non-hits and avoid recursing into an entire hierarchy---false negative is an absolute no-no, but forcing the consumer to post filter is OK". > ... But I think that anybody who looks at the output of > fill_directory() does need to be aware that they may get more entries > than they expected, and has to apply the pathspecs themselves. That matches with my understanding of how "fill" thing worked from early days.
Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too
On Thu, Apr 05, 2018 at 12:15:59PM -0700, Elijah Newren wrote: > On Thu, Apr 5, 2018 at 11:58 AM, Jeff King wrote: > > > It sounds like correct_untracked_entries() is doing the wrong thing, and > > it should be aware of the pathspec-matching when culling entries. In > > other words, my understanding was that read_directory() does not > > necessarily promise to cull fully (which is what led to cf424f5fd in the > > first place), and callers are forced to apply their own pathspecs. > > > > The distinction is academic for this particular bug, but it makes me > > wonder if there are other cases where "clean" needs to be more careful > > with what comes out of dir.c. > > Interesting, I read things very differently. Looking back at commit > 6b1db43109ab ("clean: teach clean -d to preserve ignored paths", > 2017-05-23), which introduced correct_untracked_entries(), I thought > that correct_untracked_entries() wasn't there to correct pathspec > issues with fill_directory(), but instead to special case the handling > of files which are both untracked and ignored. Did I mis-read or were > there other commits that changed the semantics? > > Also, it would just seem odd to me that fill_directory() requires > pathspecs, and it uses those pathspecs, but it doesn't guarantee that > the files it returns matches them. That seems like an API ripe for > mis-use, especially since I don't see any comment in the code about > such an assumption. Is that really the expectation? To be honest, I don't know. Most of dir.c predates me, and I've tried to avoid looking at it too hard. But I had a vague recollection of it being "best effort", and this bit from cf424f5fd89b reinforces that: However, read_directory does not actually check against our pathspec. It uses a simplified version that may turn up false positives. As a result, we need to check that any hits match our pathspec. So I don't know that correct_untracked_entries() is there to fix the pathspec handling. But I think that anybody who looks at the output of fill_directory() does need to be aware that they may get more entries than they expected, and has to apply the pathspecs themselves. And that's what that extra dir_path_match() call in cmd_clean() is there for (it used to be match_pathspec before some renaming). I agree it's an error-prone interface. I don't know all the conditions under which dir.c might return extra entries, but it seems like it might be sane for it to do a final pathspec-matching pass so that callers don't have to. That would mean that correct_untracked_entries() sees the correctly culled list, and the extra check in cmd_clean() could be dropped. Ideally, of course, we'd fix those individual cases, since that would be more efficient. And your patch may be the right first step in that direction. But since we don't know what all of them are, it seems ripe for regressions. -Peff
Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too
On Thu, Apr 5, 2018 at 11:58 AM, Jeff King wrote: > It sounds like correct_untracked_entries() is doing the wrong thing, and > it should be aware of the pathspec-matching when culling entries. In > other words, my understanding was that read_directory() does not > necessarily promise to cull fully (which is what led to cf424f5fd in the > first place), and callers are forced to apply their own pathspecs. > > The distinction is academic for this particular bug, but it makes me > wonder if there are other cases where "clean" needs to be more careful > with what comes out of dir.c. Interesting, I read things very differently. Looking back at commit 6b1db43109ab ("clean: teach clean -d to preserve ignored paths", 2017-05-23), which introduced correct_untracked_entries(), I thought that correct_untracked_entries() wasn't there to correct pathspec issues with fill_directory(), but instead to special case the handling of files which are both untracked and ignored. Did I mis-read or were there other commits that changed the semantics? Also, it would just seem odd to me that fill_directory() requires pathspecs, and it uses those pathspecs, but it doesn't guarantee that the files it returns matches them. That seems like an API ripe for mis-use, especially since I don't see any comment in the code about such an assumption. Is that really the expectation?
Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too
On Thu, Apr 05, 2018 at 10:34:43AM -0700, Elijah Newren wrote: > Even if a directory doesn't match a pathspec, it is possible, depending > on the precise pathspecs, that some file underneath it might. So we > special case and recurse into the directory for such situations. However, > we previously always added any untracked directory that we recursed into > to the list of untracked paths, regardless of whether the directory > itself matched the pathspec. > > For the case of git-clean and a set of pathspecs of "dir/file" and "more", > this caused a problem because we'd end up with dir entries for both of > "dir" > "dir/file" > Then correct_untracked_entries() would try to helpfully prune duplicates > for us by removing "dir/file" since it's under "dir", leaving us with > "dir" > Since the original pathspec only had "dir/file", the only entry left > doesn't match and leaves nothing to be removed. (Note that if only one > pathspec was specified, e.g. only "dir/file", then the common_prefix_len > optimizations in fill_directory would cause us to bypass this problem, > making it appear in simple tests that we could correctly remove manually > specified pathspecs.) It sounds like correct_untracked_entries() is doing the wrong thing, and it should be aware of the pathspec-matching when culling entries. In other words, my understanding was that read_directory() does not necessarily promise to cull fully (which is what led to cf424f5fd in the first place), and callers are forced to apply their own pathspecs. The distinction is academic for this particular bug, but it makes me wonder if there are other cases where "clean" needs to be more careful with what comes out of dir.c. -Peff