Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too

2018-04-08 Thread Junio C Hamano
Jeff King  writes:

> To be honest, I don't know. Most of dir.c predates me, and I've tried to
> avoid looking at it too hard. But I had a vague recollection of it being
> "best effort", and this bit from cf424f5fd89b reinforces that:
>
>   However, read_directory does not actually check against our pathspec.
>   It uses a simplified version that may turn up false positives. As a
>   result, we need to check that any hits match our pathspec.

At least the original design of the traversal was "try making use of
pathspec during the traversal when we can cheaply filter out obvious
non-hits and avoid recursing into an entire hierarchy---false negative
is an absolute no-no, but forcing the consumer to post filter is OK".

> ... But I think that anybody who looks at the output of
> fill_directory() does need to be aware that they may get more entries
> than they expected, and has to apply the pathspecs themselves.

That matches with my understanding of how "fill" thing worked from
early days.


Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too

2018-04-05 Thread Jeff King
On Thu, Apr 05, 2018 at 12:15:59PM -0700, Elijah Newren wrote:

> On Thu, Apr 5, 2018 at 11:58 AM, Jeff King  wrote:
> 
> > It sounds like correct_untracked_entries() is doing the wrong thing, and
> > it should be aware of the pathspec-matching when culling entries. In
> > other words, my understanding was that read_directory() does not
> > necessarily promise to cull fully (which is what led to cf424f5fd in the
> > first place), and callers are forced to apply their own pathspecs.
> >
> > The distinction is academic for this particular bug, but it makes me
> > wonder if there are other cases where "clean" needs to be more careful
> > with what comes out of dir.c.
> 
> Interesting, I read things very differently.  Looking back at commit
> 6b1db43109ab ("clean: teach clean -d to preserve ignored paths",
> 2017-05-23), which introduced correct_untracked_entries(), I thought
> that correct_untracked_entries() wasn't there to correct pathspec
> issues with fill_directory(), but instead to special case the handling
> of files which are both untracked and ignored.  Did I mis-read or were
> there other commits that changed the semantics?
> 
> Also, it would just seem odd to me that fill_directory() requires
> pathspecs, and it uses those pathspecs, but it doesn't guarantee that
> the files it returns matches them.  That seems like an API ripe for
> mis-use, especially since I don't see any comment in the code about
> such an assumption.  Is that really the expectation?

To be honest, I don't know. Most of dir.c predates me, and I've tried to
avoid looking at it too hard. But I had a vague recollection of it being
"best effort", and this bit from cf424f5fd89b reinforces that:

  However, read_directory does not actually check against our pathspec.
  It uses a simplified version that may turn up false positives. As a
  result, we need to check that any hits match our pathspec.

So I don't know that correct_untracked_entries() is there to fix the
pathspec handling. But I think that anybody who looks at the output of
fill_directory() does need to be aware that they may get more entries
than they expected, and has to apply the pathspecs themselves. And
that's what that extra dir_path_match() call in cmd_clean() is
there for (it used to be match_pathspec before some renaming).

I agree it's an error-prone interface. I don't know all the conditions
under which dir.c might return extra entries, but it seems like it might
be sane for it to do a final pathspec-matching pass so that callers
don't have to. That would mean that correct_untracked_entries() sees the
correctly culled list, and the extra check in cmd_clean() could be
dropped.

Ideally, of course, we'd fix those individual cases, since that would be
more efficient. And your patch may be the right first step in that
direction. But since we don't know what all of them are, it seems ripe
for regressions.

-Peff


Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too

2018-04-05 Thread Elijah Newren
On Thu, Apr 5, 2018 at 11:58 AM, Jeff King  wrote:

> It sounds like correct_untracked_entries() is doing the wrong thing, and
> it should be aware of the pathspec-matching when culling entries. In
> other words, my understanding was that read_directory() does not
> necessarily promise to cull fully (which is what led to cf424f5fd in the
> first place), and callers are forced to apply their own pathspecs.
>
> The distinction is academic for this particular bug, but it makes me
> wonder if there are other cases where "clean" needs to be more careful
> with what comes out of dir.c.

Interesting, I read things very differently.  Looking back at commit
6b1db43109ab ("clean: teach clean -d to preserve ignored paths",
2017-05-23), which introduced correct_untracked_entries(), I thought
that correct_untracked_entries() wasn't there to correct pathspec
issues with fill_directory(), but instead to special case the handling
of files which are both untracked and ignored.  Did I mis-read or were
there other commits that changed the semantics?

Also, it would just seem odd to me that fill_directory() requires
pathspecs, and it uses those pathspecs, but it doesn't guarantee that
the files it returns matches them.  That seems like an API ripe for
mis-use, especially since I don't see any comment in the code about
such an assumption.  Is that really the expectation?


Re: [RFC PATCH 4/7] dir: Directories should be checked for matching pathspecs too

2018-04-05 Thread Jeff King
On Thu, Apr 05, 2018 at 10:34:43AM -0700, Elijah Newren wrote:

> Even if a directory doesn't match a pathspec, it is possible, depending
> on the precise pathspecs, that some file underneath it might.  So we
> special case and recurse into the directory for such situations.  However,
> we previously always added any untracked directory that we recursed into
> to the list of untracked paths, regardless of whether the directory
> itself matched the pathspec.
> 
> For the case of git-clean and a set of pathspecs of "dir/file" and "more",
> this caused a problem because we'd end up with dir entries for both of
>   "dir"
>   "dir/file"
> Then correct_untracked_entries() would try to helpfully prune duplicates
> for us by removing "dir/file" since it's under "dir", leaving us with
>   "dir"
> Since the original pathspec only had "dir/file", the only entry left
> doesn't match and leaves nothing to be removed.  (Note that if only one
> pathspec was specified, e.g. only "dir/file", then the common_prefix_len
> optimizations in fill_directory would cause us to bypass this problem,
> making it appear in simple tests that we could correctly remove manually
> specified pathspecs.)

It sounds like correct_untracked_entries() is doing the wrong thing, and
it should be aware of the pathspec-matching when culling entries. In
other words, my understanding was that read_directory() does not
necessarily promise to cull fully (which is what led to cf424f5fd in the
first place), and callers are forced to apply their own pathspecs.

The distinction is academic for this particular bug, but it makes me
wonder if there are other cases where "clean" needs to be more careful
with what comes out of dir.c.

-Peff