Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-12 Thread Brandon Williams
Thanks for all the comments.  What it sounds like is that using ls-files as
a means to power a recursive git-grep may not be like the best approach (I
assumed that would be the case but thought it a decent place to start).

I agree that not all operating modes would be useful for a recursive
ls-files, which is why I initially don't have support for them.  I guess
the question would be which modes would be worth supporting in a recursive
case?


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-12 Thread Junio C Hamano
Jeff King  writes:

>> I do not use submodules myself, but I could imagine that you may have
>> scripts outside of git that do not care about the submodule divisions at
>> all, and would be happy with the flat block. ...
>> git-grep does not (I don't use "ack", but perhaps "git ls-files
>> --recurse-submodules -z | xargs --null ack ..." is something people
>> would want to do).
>
> None of that negates your point, btw, which is that this does not seem
> like a great building block for "git grep --recurse-submodules". Just
> that it seems plausible to me that people could find recursive
> "ls-files" useful on its own.

I do think it is a good argument why ls-files (with or without -z)
that recurses into submodules would help "git grep" that does not
look at the object store and does the search in the working tree
files that are tracked.  In normal Git world view, if you want a
list of files that are tracked, you are the one who is supposed to
go to the repository and look at the index there, but if you can
somehow magically get that list out-of-band, the "in the working
tree files" mode of "git grep" can work just like the "xargs -z ack"
pipeline in your example.

So I tend to agree that the enhancement in question is a useful
thing on its own, at least the part that gives a list of paths in
the index.

I do not know if all other operationg modes are useful, though.  For
eaxmple, the mode that lists untracked paths might be useful to do a
recursive "git clean".  On the other hand, "ls-files -s" with the
patch may produce the "correct" result (i.e. the object name given
for each path would be the same one that are found in the index of
the submodule the path was taken from), but the correctness may not
necessarily translate to usefulness, exactly because you need to
know which submodule's repository has the named object that is not
given by the flattened list.

Having to say that "this command produces correct result, but not
all correct things are necessarily useful" makes me wonder if that
is the direction in which we would want to go.  For example, what
would be our answer to an end-user question: I now know that the
recursive ls-files can give me all the untracked files in the
top-level and submodules.  How can I feed that to "git add" to
update their respective index files with them?

I am not convinced that we would always want to make "git add
lib/Makefile" to automagically run "git -C lib/ add Makefile" when
lib/ is a submodule (for that matter, even if we wanted to, I do not
know if that is something we can efficiently do with pathspecs that
have wildcards).

So...





Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-11 Thread Junio C Hamano
Junio C Hamano  writes:

> So a "ls-files" that is done internally in the end-user facing "git
> grep --recurse-submodules" needs to be run _without_ recursing
> itself at least once to learn "lib/" is a submodule.  A flat "here
> are everything we have" does not sound like a good building block.

... unless you are _only_ interested in grepping (or in general
working outside Git repository) in the files in the working tree,
i.e. "git grep" without  nor "--cached".

A lot of the time you are interested in the current state of files,
not even in "the state recorded in the tip of the history" but in
"the state as I have in my working tree, the state as my compiler
sees it".

I am a bit torn.  Clearly this is an important special case, but it
would make the codepath for object database case and working tree
case even further apart between "git grep [--cached | ]"
and the "find in the working tree" codepath, which does not sound
friendly to the codebase.



Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-11 Thread Jeff King
On Sun, Sep 11, 2016 at 08:51:06PM -0400, Jeff King wrote:

> On Sun, Sep 11, 2016 at 03:10:13PM -0700, Junio C Hamano wrote:
> 
> > So a "ls-files" that is done internally in the end-user facing "git
> > grep --recurse-submodules" needs to be run _without_ recursing
> > itself at least once to learn "lib/" is a submodule.  A flat "here
> > are everything we have" does not sound like a good building block.
> 
> I do not use submodules myself, but I could imagine that you may have
> scripts outside of git that do not care about the submodule divisions at
> all, and would be happy with the flat block. E.g., our "make tags"
> target uses "git ls-files" so find all of the source files. I could
> imagine projects with submodules that would want to do so recursively (I
> could also imagine projects that do _not_ want to do so; it would depend
> on your workflow and how tightly bound the submodules are). Another
> plausible example would be something grep-like that has features
> git-grep does not (I don't use "ack", but perhaps "git ls-files
> --recurse-submodules -z | xargs --null ack ..." is something people
> would want to do).

None of that negates your point, btw, which is that this does not seem
like a great building block for "git grep --recurse-submodules". Just
that it seems plausible to me that people could find recursive
"ls-files" useful on its own.

-Peff


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-11 Thread Jeff King
On Sun, Sep 11, 2016 at 03:10:13PM -0700, Junio C Hamano wrote:

> So a "ls-files" that is done internally in the end-user facing "git
> grep --recurse-submodules" needs to be run _without_ recursing
> itself at least once to learn "lib/" is a submodule.  A flat "here
> are everything we have" does not sound like a good building block.

I do not use submodules myself, but I could imagine that you may have
scripts outside of git that do not care about the submodule divisions at
all, and would be happy with the flat block. E.g., our "make tags"
target uses "git ls-files" so find all of the source files. I could
imagine projects with submodules that would want to do so recursively (I
could also imagine projects that do _not_ want to do so; it would depend
on your workflow and how tightly bound the submodules are). Another
plausible example would be something grep-like that has features
git-grep does not (I don't use "ack", but perhaps "git ls-files
--recurse-submodules -z | xargs --null ack ..." is something people
would want to do).

-Peff


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-11 Thread Junio C Hamano
Stefan Beller  writes:

> The plan is to hook the ls-files machinery into
> git-grep as the way of obtaining files to grep for a pattern.

That does not make much sense to me for exactly the same reason why
the "grab the list of paths and run 'git add' on them" example in
the message you are responding to does not make sense.  The use of
the thread-pool would still need to honor the submodule boundary so
that one thread may be assigned files in the top-level superproject
while another may be assigned files in lib/ submodule repository,
and the latter would be doing a rough equivalent of "git -C lib
grep" perhaps with a new option "--output-path-prefix=lib/" that
makes any and all paths that are reported from the command prefixed
with the specified string, so the result of its grepping in Makefile
may be reported as findings in lib/Makefile.

For that, it is not sufficient for the enumeration of paths done in
the top-level to just list lib/Makefile and lib/hello.py along with
Makefile and main.py, is it?  You would somehow need to have a way
to tell that 'lib/' and everything in there is inside a separate
repository.  Without knowing that "lib/" is its own repository, you
would not even know which files under "lib/" hierarchy in the
filesystem are actually tracked files, which you would learn only by
reading lib/.git/index, or what textconv filtering needs to be done
on them, which you would learn only by reading lib/.gitattributes
and/or lib/.git/config.

So a "ls-files" that is done internally in the end-user facing "git
grep --recurse-submodules" needs to be run _without_ recursing
itself at least once to learn "lib/" is a submodule.  A flat "here
are everything we have" does not sound like a good building block.


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-09 Thread Stefan Beller
On Fri, Sep 9, 2016 at 4:01 PM, Junio C Hamano  wrote:
> Brandon Williams  writes:
>
>> Allow ls-files to recognize submodules in order to retrieve a list of
>> files from a repository's submodules.  This is done by forking off a
>> process to recursively call ls-files on all submodules.
>
> While I see why "ls-files --recurse-submodules" sounds nice ("hey, I
> can get list of _all_ the files here"), and I am quite happy with
> the quality of implementation (not just the code but its
> documentation and test)

The plan is to hook the ls-files machinery into
git-grep as the way of obtaining files to grep for a pattern.

As git-grep has a thread pool already, getting the list of files
single threaded  is more beneficial as e.g. calling git-grep recursively
for submodules, as then you may have a fork bomb.

>
>> diff --git a/t/t3007-ls-files-recurse-submodules.sh 
>> b/t/t3007-ls-files-recurse-submodules.sh

> We used to do that when we didn't know better.
>

Oops, a lot of stuff slipped through the first internal review.


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-09 Thread Junio C Hamano
Brandon Williams  writes:

> Allow ls-files to recognize submodules in order to retrieve a list of
> files from a repository's submodules.  This is done by forking off a
> process to recursively call ls-files on all submodules.

While I see why "ls-files --recurse-submodules" sounds nice ("hey, I
can get list of _all_ the files here"), and I am quite happy with
the quality of implementation (not just the code but its
documentation and test) especially from a first-time contributor, I
am not quite sure what the utility of this new feature would be,
especially given that the command is a plumbing, i.e. meant to be a
useful building block for scripts.

If I get

$ git ls-files --recurse-submodules
Makefile
lib/Makefile
lib/hello.py
main.py
goodbye.py

out of the command, what can I do with it without knowing where the
submodule boundaries are?  It's not like I can just do

git ls-files --recurse-submodule |
while read path
do
git update-index --add "$path"
done

when "lib/" is a submodule.  Instead, I'd need to go to "lib/" and
then add "Makefile" and "hello.py" from there.

> diff --git a/t/t3007-ls-files-recurse-submodules.sh 
> b/t/t3007-ls-files-recurse-submodules.sh
> new file mode 100644
> index 000..78deded
> --- /dev/null
> +++ b/t/t3007-ls-files-recurse-submodules.sh
> @@ -0,0 +1,103 @@
> +...
> +test_expect_success 'setup directory structure and submodules' '
> +...
> +'
> +
> +cat >expect < +.gitmodules
> +a
> +b/b
> +submodule/c
> +EOF

We used to do that when we didn't know better.

Please don't do things outside test_expect_* block, especially in a
new script.


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-09 Thread Junio C Hamano
Jeff King  writes:

> So I suppose another option would be to teach ls-files a "prefix" option
> to add to each filename, and just pass in the submodule path. Then you
> can let the sub-processes write directly to the common stdout, and I
> think it would be safe to blindly pass the parent argv into the child
> processes.

I think that is a sensible way to do this, instead of reading from -z
and showing things depending on various output modes you were told
to use.


Re: [RFC/PATCH] ls-files: adding support for submodules

2016-09-09 Thread Jeff King
On Fri, Sep 09, 2016 at 02:53:24PM -0700, Brandon Williams wrote:

> Allow ls-files to recognize submodules in order to retrieve a list of
> files from a repository's submodules.  This is done by forking off a
> process to recursively call ls-files on all submodules.
> 
> Signed-off-by: Brandon Williams 
> ---
> Hey git developers!
> 
> I'm new to the community and this is the first patch for an open source 
> project
> that I have worked on.
> 
> I'm looking forward to working on the project!

Welcome. :)

Submodules are not really my area of expertise, so I don't have any
commentary on the goal of the patch, except that it sounds reasonable to
my layman's ears.

The implementation looks fairly clean. A few comments:

> +static void show_gitlink(const struct cache_entry *ce)
> +{
> + struct child_process cp = CHILD_PROCESS_INIT;
> + struct strbuf buf = STRBUF_INIT;
> + struct strbuf name = STRBUF_INIT;
> + int submodule_name_len;
> + FILE *fp;
> +
> + argv_array_push(, "ls-files");
> + argv_array_push(, "--recurse-submodules");
> + cp.git_cmd = 1;
> + cp.dir = ce->name;
> + cp.out = -1;
> + start_command();
> + fp = fdopen(cp.out, "r");

You should error-check the result of start_command(). I guess the
reasonable outcome would be to die(), as it is a sign that we could not
fork, find git, etc.

Ditto for fdopen (you can use xfdopen for that).

> + /*
> +  * The ls-files child process produces filenames relative to
> +  * the submodule. Prefix each line with the submodule path
> +  * to make it relative to the current repository.
> +  */
> + strbuf_addstr(, ce->name);
> + strbuf_addch(, '/');
> + submodule_name_len = name.len;
> + while (strbuf_getline(, fp) != EOF) {
> + strbuf_addbuf(, );
> + write_name(name.buf);
> + strbuf_setlen(, submodule_name_len);
> + }

What happens if the filename in the submodule needs quoting? You'll get
the quoted value in your buffer, and then re-quote it again in
write_name().

The simplest thing would probably be to use "ls-files -z" for the
recursive invocation, and then split on NUL bytes (we have
strbuf_getline_nul for that).

> + finish_command();

What should happen if finish_command() tells us that the ls-files
sub-process reported an error? It may not be worth aborting the rest of
the listing, but we might want to propagate that in our own return code.

> + strbuf_release();
> + strbuf_release();
> + fclose(fp);
> +}

A minor style nit, but I would generally fclose(fp) before running
finish_command() (i.e., resource clean up in the reverse order of
allocation). It doesn't matter in this case because "fp" is output from
the process, and we know we've already read to EOF. For other cases, it
could cause a deadlock (e.g., we end up in wait() for the child process
to finish, but it is blocked in write() waiting for us to read). So I
think it's a good habit to get into.

> @@ -519,6 +566,17 @@ int cmd_ls_files(int argc, const char **argv, const char 
> *cmd_prefix)
>   if (require_work_tree && !is_inside_work_tree())
>   setup_work_tree();
>  
> + if (recurse_submodules &&
> + (show_stage || show_deleted || show_others || show_unmerged ||
> +  show_killed || show_modified || show_resolve_undo ||
> +  show_valid_bit || show_tag || show_eol))
> + die("ls-files --recurse-submodules can only be used in "
> + "--cached mode");

Woah, that list of variables is getting rather long. This is not a
problem introduced by your patch, so it's not a blocker. But I wonder if
some of them are mutually exclusive and could be collapsed to a single
variable.

I guess the reason for this "only with --cached" is that you do not
propagate the options down to the recursive process. If we were to do
that, then this big list of restrictions would go away. I'd be OK with
starting with more limited functionality like your patch, though. I
think doing the recursive thing correctly would also involve parsing the
output of each to append the filename prefix.

So I suppose another option would be to teach ls-files a "prefix" option
to add to each filename, and just pass in the submodule path. Then you
can let the sub-processes write directly to the common stdout, and I
think it would be safe to blindly pass the parent argv into the child
processes.

-Peff