Re: [PATCH 2/2 v2] ls-files: add pathspec matching for submodules

2016-09-23 Thread Brandon Williams
On Fri, Sep 23, 2016 at 12:20 PM, Junio C Hamano  wrote:
> There is an interesting observation around this code.  Note that it
> is just something to keep in mind, even though I think we are in no
> position to solve this within the scope of this series, or in fact I
> am not sure if there is anything to "fix".
>
> The expectation here is that the leading part of pathspec elements
> contain path components above and outside the current working tree,
> e.g. in a superproject with a submodule at "sub/", the end-user may
> have said from the top of the superproject
>
> A saving grace is that "s*b/file" in this case is what the end-user
> is giving us, not something we internally generated.  So we can
> simply blame the end user, saying "what --recurse-submodules does is
> to (conceptually) flatten the indices of submodules into the index
> of the superproject and show the entries that match your pathspec.
> Because you gave us 's*b/file', which does match 's*b/oob/file',
> that is what you get."
>
> ;-)

Yeah I've been thinking a bit about that as well.  To me, it is
incredibly silly to
have a wildcard character in a filename (its unfortunate that its
allowed).  We can
easily do as you suggest and simply blame the user and if they do have wildcard
characters in their filenames they would just need to force the
pathspec code to
do checks literally (using the appropriate pathspec magic).  This
would just limit their
ability to use actual wildcards in their pathspecs, ie they have to
pick wildcards in their
filenames or the ability to do wildmatching.

-Brandon


Re: [PATCH 2/2 v2] ls-files: add pathspec matching for submodules

2016-09-23 Thread Junio C Hamano
Brandon Williams  writes:

>  static void show_ce_entry(const char *tag, const struct cache_entry *ce)
>  {
> + struct strbuf name = STRBUF_INIT;
>   int len = max_prefix_len;
> + if (submodule_prefix)
> + strbuf_addstr(, submodule_prefix);
> + strbuf_addstr(, ce->name);
> ...  
> + } else if (match_pathspec(, name.buf, name.len,
> +   len, ps_matched,
> +   S_ISDIR(ce->ce_mode) ||
> +   S_ISGITLINK(ce->ce_mode))) {

There is an interesting observation around this code.  Note that it
is just something to keep in mind, even though I think we are in no
position to solve this within the scope of this series, or in fact I
am not sure if there is anything to "fix".

The expectation here is that the leading part of pathspec elements
contain path components above and outside the current working tree,
e.g. in a superproject with a submodule at "sub/", the end-user may
have said from the top of the superproject

git ls-files --recurse-submodules -- sub/file

and the recursing "ls-files" is spawned as

git -C sub ls-files -- sub/file

relaying the pathspec literally.

This does not correctly work if the path to the submodule has
wildcard in it.  Imagine that the submodule were at "s*b/".  The
recursing invocation would look like:

git -C "s*b" ls-files -- "s*b/file"

Further imagine that the index in the submodule at "s*b" has two
paths in it, i.e.

file
oob/file

The prefix is prepended to them, to turn them into

s*b/file
s*b/oob/file

and I suspect that the pathspec element "s*b/file" would match both
of them.

The pathspec machinery has a provision to prevent a similar gotcha
happening for the "prefix" we internally use.  In a sample
repository created like so:

$ git init
$ mkdir -p 's*b/oob' sib
$ >sib/file
$ cd 's*b'
$ >file
$ >oob/file
$ git add .
$ git ls-files -- file

the "ls-files" in the last step gets 's*b/' as the "prefix", and the
pathspec is formed by concatenating "file" to it, but in a special
way.  The part that come from the "prefix" is marked not to honor
any wildcard in it, so 's*b/' even though it has an asterisk, it is
forced to match literally, giving only 's*b/file'.

A saving grace is that "s*b/file" in this case is what the end-user
is giving us, not something we internally generated.  So we can
simply blame the end user, saying "what --recurse-submodules does is
to (conceptually) flatten the indices of submodules into the index
of the superproject and show the entries that match your pathspec.
Because you gave us 's*b/file', which does match 's*b/oob/file',
that is what you get."

;-)


Re: [PATCH 2/2 v2] ls-files: add pathspec matching for submodules

2016-09-23 Thread Junio C Hamano
Brandon Williams  writes:

> - /* Find common prefix for all pathspec's */
> - max_prefix = common_prefix();
> + /*
> +  * Find common prefix for all pathspec's
> +  * This is used as a performance optimization which unfortunately cannot
> +  * be done when recursing into submodules
> +  */
> + if (recurse_submodules)
> + max_prefix = NULL;
> + else
> + max_prefix = common_prefix();
>   max_prefix_len = max_prefix ? strlen(max_prefix) : 0;

This is OK for now, but for a future enhancement, I think we could
do better than this.  In a superproject with a submodule at "sub/",
the current implementation of the common_prefix() helper would yield
"sub/a/" when given "sub/a/x" and "sub/a/y" (a pathspec with two
elements), which we want to avoid.

But somebody should be able to notice, before "sub/a/" is given to
max_prefix here, that "sub/" is the leaf level in our repository and
reduce the max_prefix to it.  dir.c::common_prefix_len() might be 
a place we could do so, but I didn't think about the ramifications
of doing so for other callers of common_prefix() or when we are not
recursing into submodules.  Doing it in the caller here, i.e.

max_prefix = common_prefix();
if (recurse_submodules)
max_prefix = chomp_at_submodule_boundary(max_prefix);

is certainly safer.

If the superproject has submodules at "a/b/{sub1,sub2,...}", this
matters more.  We do want to notice that we won't have to scan
outside "a/b/" of the index given "a/b/sub1" and "a/b/sub2" as a
pathspec.

The common_prefix_len() function also looks beyond symbolic links,
which is another thing that we may want to think about.  In a
repository with a symbolic link "link" pointing somewhere else, when
you give "link/a/x" and "link/a/y" (a pathspec with two elements),
we would get "link/a/" as a common prefix, but we won't find
anything underneath "link" in our index.  In such a case, leaving
the common prefix to "link/a/" _might_ allow us to notice that no
pathspec elements can ever match, so not noticing that the common
prefix points beyond a symbolic link might be a feature.  I dunno.


[PATCH 2/2 v2] ls-files: add pathspec matching for submodules

2016-09-21 Thread Brandon Williams
Pathspecs can be a bit tricky when trying to apply them to submodules.
The main challenge is that the pathspecs will be with respect to the
superproject and not with respect to paths in the submodule.  The
approach this patch takes is to pass in the identical pathspec from the
superproject to the submodule in addition to the submodule-prefix, which
is the path from the root of the superproject to the submodule, and then
we can compare an entry in the submodule prepended with the
submodule-prefix to the pathspec in order to determine if there is a
match.

This patch also permits the pathspec logic to perform a prefix match against
submodules since a pathspec could refer to a file inside of a submodule.
Due to limitations in the wildmatch logic, a prefix match is only done
literally.  If any wildcard character is encountered we'll simply punt
and produce a false positive match.  More accurate matching will be done
once inside the submodule.  This is due to the superproject not knowing
what files could exist in the submodule.

Signed-off-by: Brandon Williams 
---
 builtin/ls-files.c | 132 -
 dir.c  |  46 +++-
 dir.h  |   4 +
 t/t3007-ls-files-recurse-submodules.sh | 114 ++--
 4 files changed, 234 insertions(+), 62 deletions(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index ffd9ea6..fa4029e 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -177,12 +177,34 @@ static void show_gitlink(const struct cache_entry *ce)
 {
struct child_process cp = CHILD_PROCESS_INIT;
int status;
+   int i;
 
argv_array_push(, "ls-files");
argv_array_push(, "--recurse-submodules");
argv_array_pushf(, "--submodule-prefix=%s%s/",
 submodule_prefix ? submodule_prefix : "",
 ce->name);
+   /* add options */
+   if (show_eol)
+   argv_array_push(, "--eol");
+   if (show_valid_bit)
+   argv_array_push(, "-v");
+   if (show_stage)
+   argv_array_push(, "--stage");
+   if (show_cached)
+   argv_array_push(, "--cached");
+   if (debug_mode)
+   argv_array_push(, "--debug");
+
+   /*
+* Pass in the original pathspec args.  The submodule will be
+* responsible for prepending the 'submodule_prefix' prior to comparing
+* against the pathspec for matches.
+*/
+   argv_array_push(, "--");
+   for (i = 0; i < pathspec.nr; i++)
+   argv_array_push(, pathspec.items[i].original);
+
cp.git_cmd = 1;
cp.dir = ce->name;
status = run_command();
@@ -192,57 +214,62 @@ static void show_gitlink(const struct cache_entry *ce)
 
 static void show_ce_entry(const char *tag, const struct cache_entry *ce)
 {
+   struct strbuf name = STRBUF_INIT;
int len = max_prefix_len;
+   if (submodule_prefix)
+   strbuf_addstr(, submodule_prefix);
+   strbuf_addstr(, ce->name);
 
if (len >= ce_namelen(ce))
die("git ls-files: internal error - cache entry not superset of 
prefix");
 
-   if (!match_pathspec(, ce->name, ce_namelen(ce),
-   len, ps_matched,
-   S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode)))
-   return;
-   if (recurse_submodules && S_ISGITLINK(ce->ce_mode)) {
+   if (recurse_submodules && S_ISGITLINK(ce->ce_mode) &&
+   submodule_path_match(, name.buf, ps_matched)) {
show_gitlink(ce);
-   return;
-   }
+   } else if (match_pathspec(, name.buf, name.len,
+ len, ps_matched,
+ S_ISDIR(ce->ce_mode) ||
+ S_ISGITLINK(ce->ce_mode))) {
+   if (tag && *tag && show_valid_bit &&
+   (ce->ce_flags & CE_VALID)) {
+   static char alttag[4];
+   memcpy(alttag, tag, 3);
+   if (isalpha(tag[0]))
+   alttag[0] = tolower(tag[0]);
+   else if (tag[0] == '?')
+   alttag[0] = '!';
+   else {
+   alttag[0] = 'v';
+   alttag[1] = tag[0];
+   alttag[2] = ' ';
+   alttag[3] = 0;
+   }
+   tag = alttag;
+   }
 
-   if (tag && *tag && show_valid_bit &&
-   (ce->ce_flags & CE_VALID)) {
-   static char alttag[4];
-   memcpy(alttag, tag, 3);
-   if (isalpha(tag[0]))
-   alttag[0] = tolower(tag[0]);
-   else if (tag[0] == '?')
-   alttag[0] = '!';
-