Re: [RFC/PATCH] ls-files: adding support for submodules
Thanks for all the comments. What it sounds like is that using ls-files as a means to power a recursive git-grep may not be like the best approach (I assumed that would be the case but thought it a decent place to start). I agree that not all operating modes would be useful for a recursive ls-files, which is why I initially don't have support for them. I guess the question would be which modes would be worth supporting in a recursive case?
Re: [RFC/PATCH] ls-files: adding support for submodules
Jeff King writes: >> I do not use submodules myself, but I could imagine that you may have >> scripts outside of git that do not care about the submodule divisions at >> all, and would be happy with the flat block. ... >> git-grep does not (I don't use "ack", but perhaps "git ls-files >> --recurse-submodules -z | xargs --null ack ..." is something people >> would want to do). > > None of that negates your point, btw, which is that this does not seem > like a great building block for "git grep --recurse-submodules". Just > that it seems plausible to me that people could find recursive > "ls-files" useful on its own. I do think it is a good argument why ls-files (with or without -z) that recurses into submodules would help "git grep" that does not look at the object store and does the search in the working tree files that are tracked. In normal Git world view, if you want a list of files that are tracked, you are the one who is supposed to go to the repository and look at the index there, but if you can somehow magically get that list out-of-band, the "in the working tree files" mode of "git grep" can work just like the "xargs -z ack" pipeline in your example. So I tend to agree that the enhancement in question is a useful thing on its own, at least the part that gives a list of paths in the index. I do not know if all other operationg modes are useful, though. For eaxmple, the mode that lists untracked paths might be useful to do a recursive "git clean". On the other hand, "ls-files -s" with the patch may produce the "correct" result (i.e. the object name given for each path would be the same one that are found in the index of the submodule the path was taken from), but the correctness may not necessarily translate to usefulness, exactly because you need to know which submodule's repository has the named object that is not given by the flattened list. Having to say that "this command produces correct result, but not all correct things are necessarily useful" makes me wonder if that is the direction in which we would want to go. For example, what would be our answer to an end-user question: I now know that the recursive ls-files can give me all the untracked files in the top-level and submodules. How can I feed that to "git add" to update their respective index files with them? I am not convinced that we would always want to make "git add lib/Makefile" to automagically run "git -C lib/ add Makefile" when lib/ is a submodule (for that matter, even if we wanted to, I do not know if that is something we can efficiently do with pathspecs that have wildcards). So...
Re: [RFC/PATCH] ls-files: adding support for submodules
Junio C Hamano writes: > So a "ls-files" that is done internally in the end-user facing "git > grep --recurse-submodules" needs to be run _without_ recursing > itself at least once to learn "lib/" is a submodule. A flat "here > are everything we have" does not sound like a good building block. ... unless you are _only_ interested in grepping (or in general working outside Git repository) in the files in the working tree, i.e. "git grep" without nor "--cached". A lot of the time you are interested in the current state of files, not even in "the state recorded in the tip of the history" but in "the state as I have in my working tree, the state as my compiler sees it". I am a bit torn. Clearly this is an important special case, but it would make the codepath for object database case and working tree case even further apart between "git grep [--cached | ]" and the "find in the working tree" codepath, which does not sound friendly to the codebase.
Re: [RFC/PATCH] ls-files: adding support for submodules
On Sun, Sep 11, 2016 at 08:51:06PM -0400, Jeff King wrote: > On Sun, Sep 11, 2016 at 03:10:13PM -0700, Junio C Hamano wrote: > > > So a "ls-files" that is done internally in the end-user facing "git > > grep --recurse-submodules" needs to be run _without_ recursing > > itself at least once to learn "lib/" is a submodule. A flat "here > > are everything we have" does not sound like a good building block. > > I do not use submodules myself, but I could imagine that you may have > scripts outside of git that do not care about the submodule divisions at > all, and would be happy with the flat block. E.g., our "make tags" > target uses "git ls-files" so find all of the source files. I could > imagine projects with submodules that would want to do so recursively (I > could also imagine projects that do _not_ want to do so; it would depend > on your workflow and how tightly bound the submodules are). Another > plausible example would be something grep-like that has features > git-grep does not (I don't use "ack", but perhaps "git ls-files > --recurse-submodules -z | xargs --null ack ..." is something people > would want to do). None of that negates your point, btw, which is that this does not seem like a great building block for "git grep --recurse-submodules". Just that it seems plausible to me that people could find recursive "ls-files" useful on its own. -Peff
Re: [RFC/PATCH] ls-files: adding support for submodules
On Sun, Sep 11, 2016 at 03:10:13PM -0700, Junio C Hamano wrote: > So a "ls-files" that is done internally in the end-user facing "git > grep --recurse-submodules" needs to be run _without_ recursing > itself at least once to learn "lib/" is a submodule. A flat "here > are everything we have" does not sound like a good building block. I do not use submodules myself, but I could imagine that you may have scripts outside of git that do not care about the submodule divisions at all, and would be happy with the flat block. E.g., our "make tags" target uses "git ls-files" so find all of the source files. I could imagine projects with submodules that would want to do so recursively (I could also imagine projects that do _not_ want to do so; it would depend on your workflow and how tightly bound the submodules are). Another plausible example would be something grep-like that has features git-grep does not (I don't use "ack", but perhaps "git ls-files --recurse-submodules -z | xargs --null ack ..." is something people would want to do). -Peff
Re: [RFC/PATCH] ls-files: adding support for submodules
Stefan Beller writes: > The plan is to hook the ls-files machinery into > git-grep as the way of obtaining files to grep for a pattern. That does not make much sense to me for exactly the same reason why the "grab the list of paths and run 'git add' on them" example in the message you are responding to does not make sense. The use of the thread-pool would still need to honor the submodule boundary so that one thread may be assigned files in the top-level superproject while another may be assigned files in lib/ submodule repository, and the latter would be doing a rough equivalent of "git -C lib grep" perhaps with a new option "--output-path-prefix=lib/" that makes any and all paths that are reported from the command prefixed with the specified string, so the result of its grepping in Makefile may be reported as findings in lib/Makefile. For that, it is not sufficient for the enumeration of paths done in the top-level to just list lib/Makefile and lib/hello.py along with Makefile and main.py, is it? You would somehow need to have a way to tell that 'lib/' and everything in there is inside a separate repository. Without knowing that "lib/" is its own repository, you would not even know which files under "lib/" hierarchy in the filesystem are actually tracked files, which you would learn only by reading lib/.git/index, or what textconv filtering needs to be done on them, which you would learn only by reading lib/.gitattributes and/or lib/.git/config. So a "ls-files" that is done internally in the end-user facing "git grep --recurse-submodules" needs to be run _without_ recursing itself at least once to learn "lib/" is a submodule. A flat "here are everything we have" does not sound like a good building block.
Re: [RFC/PATCH] ls-files: adding support for submodules
On Fri, Sep 9, 2016 at 4:01 PM, Junio C Hamano wrote: > Brandon Williams writes: > >> Allow ls-files to recognize submodules in order to retrieve a list of >> files from a repository's submodules. This is done by forking off a >> process to recursively call ls-files on all submodules. > > While I see why "ls-files --recurse-submodules" sounds nice ("hey, I > can get list of _all_ the files here"), and I am quite happy with > the quality of implementation (not just the code but its > documentation and test) The plan is to hook the ls-files machinery into git-grep as the way of obtaining files to grep for a pattern. As git-grep has a thread pool already, getting the list of files single threaded is more beneficial as e.g. calling git-grep recursively for submodules, as then you may have a fork bomb. > >> diff --git a/t/t3007-ls-files-recurse-submodules.sh >> b/t/t3007-ls-files-recurse-submodules.sh > We used to do that when we didn't know better. > Oops, a lot of stuff slipped through the first internal review.
Re: [RFC/PATCH] ls-files: adding support for submodules
Brandon Williams writes: > Allow ls-files to recognize submodules in order to retrieve a list of > files from a repository's submodules. This is done by forking off a > process to recursively call ls-files on all submodules. While I see why "ls-files --recurse-submodules" sounds nice ("hey, I can get list of _all_ the files here"), and I am quite happy with the quality of implementation (not just the code but its documentation and test) especially from a first-time contributor, I am not quite sure what the utility of this new feature would be, especially given that the command is a plumbing, i.e. meant to be a useful building block for scripts. If I get $ git ls-files --recurse-submodules Makefile lib/Makefile lib/hello.py main.py goodbye.py out of the command, what can I do with it without knowing where the submodule boundaries are? It's not like I can just do git ls-files --recurse-submodule | while read path do git update-index --add "$path" done when "lib/" is a submodule. Instead, I'd need to go to "lib/" and then add "Makefile" and "hello.py" from there. > diff --git a/t/t3007-ls-files-recurse-submodules.sh > b/t/t3007-ls-files-recurse-submodules.sh > new file mode 100644 > index 000..78deded > --- /dev/null > +++ b/t/t3007-ls-files-recurse-submodules.sh > @@ -0,0 +1,103 @@ > +... > +test_expect_success 'setup directory structure and submodules' ' > +... > +' > + > +cat >expect < +.gitmodules > +a > +b/b > +submodule/c > +EOF We used to do that when we didn't know better. Please don't do things outside test_expect_* block, especially in a new script.
Re: [RFC/PATCH] ls-files: adding support for submodules
Jeff King writes: > So I suppose another option would be to teach ls-files a "prefix" option > to add to each filename, and just pass in the submodule path. Then you > can let the sub-processes write directly to the common stdout, and I > think it would be safe to blindly pass the parent argv into the child > processes. I think that is a sensible way to do this, instead of reading from -z and showing things depending on various output modes you were told to use.
Re: [RFC/PATCH] ls-files: adding support for submodules
On Fri, Sep 09, 2016 at 02:53:24PM -0700, Brandon Williams wrote: > Allow ls-files to recognize submodules in order to retrieve a list of > files from a repository's submodules. This is done by forking off a > process to recursively call ls-files on all submodules. > > Signed-off-by: Brandon Williams > --- > Hey git developers! > > I'm new to the community and this is the first patch for an open source > project > that I have worked on. > > I'm looking forward to working on the project! Welcome. :) Submodules are not really my area of expertise, so I don't have any commentary on the goal of the patch, except that it sounds reasonable to my layman's ears. The implementation looks fairly clean. A few comments: > +static void show_gitlink(const struct cache_entry *ce) > +{ > + struct child_process cp = CHILD_PROCESS_INIT; > + struct strbuf buf = STRBUF_INIT; > + struct strbuf name = STRBUF_INIT; > + int submodule_name_len; > + FILE *fp; > + > + argv_array_push(&cp.args, "ls-files"); > + argv_array_push(&cp.args, "--recurse-submodules"); > + cp.git_cmd = 1; > + cp.dir = ce->name; > + cp.out = -1; > + start_command(&cp); > + fp = fdopen(cp.out, "r"); You should error-check the result of start_command(). I guess the reasonable outcome would be to die(), as it is a sign that we could not fork, find git, etc. Ditto for fdopen (you can use xfdopen for that). > + /* > + * The ls-files child process produces filenames relative to > + * the submodule. Prefix each line with the submodule path > + * to make it relative to the current repository. > + */ > + strbuf_addstr(&name, ce->name); > + strbuf_addch(&name, '/'); > + submodule_name_len = name.len; > + while (strbuf_getline(&buf, fp) != EOF) { > + strbuf_addbuf(&name, &buf); > + write_name(name.buf); > + strbuf_setlen(&name, submodule_name_len); > + } What happens if the filename in the submodule needs quoting? You'll get the quoted value in your buffer, and then re-quote it again in write_name(). The simplest thing would probably be to use "ls-files -z" for the recursive invocation, and then split on NUL bytes (we have strbuf_getline_nul for that). > + finish_command(&cp); What should happen if finish_command() tells us that the ls-files sub-process reported an error? It may not be worth aborting the rest of the listing, but we might want to propagate that in our own return code. > + strbuf_release(&buf); > + strbuf_release(&name); > + fclose(fp); > +} A minor style nit, but I would generally fclose(fp) before running finish_command() (i.e., resource clean up in the reverse order of allocation). It doesn't matter in this case because "fp" is output from the process, and we know we've already read to EOF. For other cases, it could cause a deadlock (e.g., we end up in wait() for the child process to finish, but it is blocked in write() waiting for us to read). So I think it's a good habit to get into. > @@ -519,6 +566,17 @@ int cmd_ls_files(int argc, const char **argv, const char > *cmd_prefix) > if (require_work_tree && !is_inside_work_tree()) > setup_work_tree(); > > + if (recurse_submodules && > + (show_stage || show_deleted || show_others || show_unmerged || > + show_killed || show_modified || show_resolve_undo || > + show_valid_bit || show_tag || show_eol)) > + die("ls-files --recurse-submodules can only be used in " > + "--cached mode"); Woah, that list of variables is getting rather long. This is not a problem introduced by your patch, so it's not a blocker. But I wonder if some of them are mutually exclusive and could be collapsed to a single variable. I guess the reason for this "only with --cached" is that you do not propagate the options down to the recursive process. If we were to do that, then this big list of restrictions would go away. I'd be OK with starting with more limited functionality like your patch, though. I think doing the recursive thing correctly would also involve parsing the output of each to append the filename prefix. So I suppose another option would be to teach ls-files a "prefix" option to add to each filename, and just pass in the submodule path. Then you can let the sub-processes write directly to the common stdout, and I think it would be safe to blindly pass the parent argv into the child processes. -Peff