Re: git submodules implementation question

2016-09-01 Thread Uma Srinivasan
>>> The final version needs to be accompanied with tests to show the
>>> effect of this change for callers.  A test would set up a top-level
>>> and submodule, deliberately break submodule/.git/ repository and
>>> show what breaks and how without this change.

Agreed!

The repo where the original problem surfaced is huge and in fact "git
status" appears to go into an infinite loop forking a zillion
processes on the system. If the dev system is small, it brings it to a
standstill. However, the good news is that I could build myself a
smaller reproducer by doing the following:

1) mkdir sm_test
2) cd sm_test
3) git clone git://git.mysociety.org/commonlib commonlib
4) git init
5) git submodule add ./commonlib/
6) cd commonlib/.git
7) rm -f all files

After this "git status" will fork several thousand processes but it
will ultimately fail as it runs into the path length max limit. I
still don't know why this doesn't happen with my original problem
repo. Perhaps I have to let it run overnight or perhaps it runs into
other system limitations before then

Anyway, with the fix "git status" fails quickly both in my reproducer
(and original repo) with the following message.
fatal: Not a git repository: '.git'
fatal: 'git status --porcelain' failed in submodule commonlib

Hope this helps.

Uma


On Thu, Sep 1, 2016 at 12:19 PM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>>> The final version needs to be accompanied with tests to show the
>>> effect of this change for callers.  A test would set up a top-level
>>> and submodule, deliberately break submodule/.git/ repository and
>>> show what breaks and how without this change.
>>
>> Tests are really good at providing this context as well, or to communicate
>> the actual underlying problem, which is not quite clear to me.
>> That is why I refrained from jumping into the discussion as I think the
>> first few emails were dropped from the mailing list and I am missing context.
>
> I do not know where you started reading, but the gist of it is that
> submodule.c spawns subprocess to run in the submodule's context by
> assuming that chdir'ing into the  of the submodule and running
> it (i.e. cp.dir set to  to drive start_command()) is
> sufficient.  When /.git (either it is a directory itself or it
> points at a directory in .git/module/ in the superproject) is
> a corrupt repository, running "git -C  command" would try to
> auto-detect the repository, because it thinks /.git is not a
> repository and it thinks it is not at the top-level of the working
> tree, and instead finds the repository of the top-level, which is
> almost never what we want.
>


Re: git submodules implementation question

2016-09-01 Thread Uma Srinivasan
Yes, this one line fix addresses my problem.

So, what is the next step? Will someone submit a patch or should I?
Please note that I've never submitted a patch before, but I don't mind
learning how to.

Thanks,
Uma

> --- a/submodule.c
> +++ b/submodule.c
> @@ -1160,4 +1160,5 @@ void prepare_submodule_repo_env(struct argv_array *out)
> if (strcmp(*var, CONFIG_DATA_ENVIRONMENT))
> argv_array_push(out, *var);
> }
> +   argv_array_push(out, "GIT_DIR=.git");
>  }
>


Re: git submodules implementation question

2016-08-31 Thread Uma Srinivasan
h, NULL);
-   prepare_submodule_repo_env(>env_array);
+   prepare_submodule_repo_env(>env_array, git_dir);
cp->git_cmd = 1;
if (!spf->quiet)
strbuf_addf(err, "Fetching submodule %s%s\n",
@@ -958,15 +993,14 @@ unsigned is_submodule_modified(const char *path,
int ignore_untracked)
strbuf_release();
/* The submodule is not checked out, so it is not modified */
return 0;
-
}
-   strbuf_reset();

if (ignore_untracked)
argv[2] = "-uno";

cp.argv = argv;
-   prepare_submodule_repo_env(_array);
+   prepare_submodule_repo_env(_array, git_dir);
+   strbuf_reset();
cp.git_cmd = 1;
cp.no_stdin = 1;
cp.out = -1;
@@ -1023,11 +1057,11 @@ int submodule_uses_gitfile(const char *path)
strbuf_release();
return 0;
}
-   strbuf_release();

/* Now test that all nested submodules use a gitfile too */
cp.argv = argv;
-   prepare_submodule_repo_env(_array);
+   prepare_submodule_repo_env(_array, git_dir);
+   strbuf_release();
cp.git_cmd = 1;
cp.no_stdin = 1;
cp.no_stderr = 1;
@@ -1052,6 +1086,7 @@ int ok_to_remove_submodule(const char *path)
};
struct strbuf buf = STRBUF_INIT;
int ok_to_remove = 1;
+   const char *git_dir;

if (!file_exists(path) || is_empty_dir(path))
return 1;
@@ -1060,7 +1095,10 @@ int ok_to_remove_submodule(const char *path)
return 0;

cp.argv = argv;
-   prepare_submodule_repo_env(_array);
+   strbuf_addf(, "%s/.git", path);
+   git_dir = read_gitfile(buf.buf);
+   prepare_submodule_repo_env(_array, git_dir);
+   strbuf_release();
cp.git_cmd = 1;
cp.no_stdin = 1;
cp.out = -1;
@@ -1272,12 +1310,16 @@ int parallel_submodules(void)
return parallel_jobs;
 }

-void prepare_submodule_repo_env(struct argv_array *out)
+void prepare_submodule_repo_env(struct argv_array *out, const char* git_dir)
 {
const char * const *var;

for (var = local_repo_env; *var; var++) {
if (strcmp(*var, CONFIG_DATA_ENVIRONMENT))
argv_array_push(out, *var);
+   if (strcmp(*var, GIT_DIR_ENVIRONMENT))
+   argv_array_pushf(out, "%s=%s", GIT_DIR_ENVIRONMENT,
+real_path(git_dir));
}
+
 }
diff --git a/submodule.h b/submodule.h
index d9e197a..4f8b0c7 100644
--- a/submodule.h
+++ b/submodule.h
@@ -73,6 +73,6 @@ int parallel_submodules(void);
  * a submodule by clearing any repo-specific envirionment variables, but
  * retaining any config in the environment.
  */
-void prepare_submodule_repo_env(struct argv_array *out);
+void prepare_submodule_repo_env(struct argv_array *out, const char *git_dir);

On Wed, Aug 31, 2016 at 11:58 AM, Uma Srinivasan
<usriniva...@twitter.com> wrote:
>> We want to affect only the process we are going to spawn to work
>> inside the submodule, not ourselves, which is what this call does;
>> this does not sound like a good idea.
>
> Okay, in that case I would have to pass the "git_dir" as a new
> argument to prepare_submodule_repo_env(). I know what to pass from the
> is_submodule_modified() caller. I don't think it's all that obvious
> for the other callers.
>
> Thanks,
> Uma
>
> On Wed, Aug 31, 2016 at 11:44 AM, Junio C Hamano <gits...@pobox.com> wrote:
>> Uma Srinivasan <usriniva...@twitter.com> writes:
>>
>>> diff --git a/submodule.c b/submodule.c
>>> index 5a62aa2..23443a7 100644
>>> --- a/submodule.c
>>> +++ b/submodule.c
>>> @@ -960,6 +960,9 @@ unsigned is_submodule_modified(const char *path,
>>> int ignore_untracked)
>>> return 0;
>>>
>>> }
>>> +   /* stuff submodule git dir into env var */
>>> +   set_git_dir(git_dir);
>>
>> We want to affect only the process we are going to spawn to work
>> inside the submodule, not ourselves, which is what this call does;
>> this does not sound like a good idea.
>>


Re: git submodules implementation question

2016-08-31 Thread Uma Srinivasan
> We want to affect only the process we are going to spawn to work
> inside the submodule, not ourselves, which is what this call does;
> this does not sound like a good idea.

Okay, in that case I would have to pass the "git_dir" as a new
argument to prepare_submodule_repo_env(). I know what to pass from the
is_submodule_modified() caller. I don't think it's all that obvious
for the other callers.

Thanks,
Uma

On Wed, Aug 31, 2016 at 11:44 AM, Junio C Hamano <gits...@pobox.com> wrote:
> Uma Srinivasan <usriniva...@twitter.com> writes:
>
>> diff --git a/submodule.c b/submodule.c
>> index 5a62aa2..23443a7 100644
>> --- a/submodule.c
>> +++ b/submodule.c
>> @@ -960,6 +960,9 @@ unsigned is_submodule_modified(const char *path,
>> int ignore_untracked)
>> return 0;
>>
>> }
>> +   /* stuff submodule git dir into env var */
>> +   set_git_dir(git_dir);
>
> We want to affect only the process we are going to spawn to work
> inside the submodule, not ourselves, which is what this call does;
> this does not sound like a good idea.
>


Re: git submodules implementation question

2016-08-31 Thread Uma Srinivasan
FWIW, I tried the following quick change based on the info. provided
by Junio and it seems to "fix" my original problem. I'll let you
experts figure out if we need a more complete fix or not.

diff --git a/submodule.c b/submodule.c
index 5a62aa2..23443a7 100644
--- a/submodule.c
+++ b/submodule.c
@@ -960,6 +960,9 @@ unsigned is_submodule_modified(const char *path,
int ignore_untracked)
return 0;

}
+   /* stuff submodule git dir into env var */
+   set_git_dir(git_dir);
+
strbuf_reset();

if (ignore_untracked)
@@ -1279,5 +1282,9 @@ void prepare_submodule_repo_env(struct argv_array *out)
for (var = local_repo_env; *var; var++) {
if (strcmp(*var, CONFIG_DATA_ENVIRONMENT))
argv_array_push(out, *var);
+   if (strcmp(*var, GIT_DIR_ENVIRONMENT))
+   argv_array_pushf(out, "%s=%s", GIT_DIR_ENVIRONMENT,
+get_git_dir());
}
+
 }

Thanks,
Uma

On Wed, Aug 31, 2016 at 9:42 AM, Junio C Hamano <gits...@pobox.com> wrote:
> Uma Srinivasan <usriniva...@twitter.com> writes:
>
>>> I might suggest to update prepare_submodule_repo_env() so that the
>>> spawned process will *NOT* have to guess where the working tree and
>>> repository by exporting GIT_DIR (set to "git_dir" we discover above)
>>> and GIT_WORK_TREE (set to "." as cp.dir is set to the path to the
>>> working tree of the submodule).  That would stop the "git status" to
>>> guess (and fooled by a corrupted dir/.git that is not a git
>>> repository).
>>
>> Here's where I am struggling with my lack of knowledge of git
>> internals and the implementation particularly in the context of how
>> environment variables are passed from the parent to the child process.
>
> Ah, I was primarily addressing Jacob in the latter part of my
> message, as he's looked at similar things in his recent topic.
>
>> Are you suggesting that we set up the child process environment array
>> (the "out" argument) in prepare_submodule_repo_env() to include
>> GIT_DIR_ENVIRONMENT and GIT_WORK_TREE_ENVIRONMENT in addition to
>> CONFIG_DATA_ENVIRONMENT that is there now?
>
> I was wondering if we should unconditionally stuff GIT_DIR= repository location for the submodule> in the cp.env_array passed to
> the function prepare_submodule_repo_env().  As cp.dir will be set to
> the submodule's working tree, there is no need to set GIT_WORK_TREE
> and export it, I think, although it would not hurt.
>
> After all, we _are_ going into a separate and different project's
> repository (that is what a submodule is), and we _know_ where its
> repository data (i.e. GIT_DIR) and the location of its working tree
> (i.e. GIT_WORK_TREE).  There is no reason for the process that will
> work in the submodule to go through the usual "do we have .git in
> our $cwd that is a repository?  if not how about the parent directory
> of $cwd?  go up until we find one and that directory is the top of
> the working tree" discovery.
>
> More importantly, this matters when your GIT_DIR for the submodule
> is somehow corrupt.  The discovery process would say "there is .git
> in $cwd but that is not a repository" and continue upwards, which
> likely would find the repository for the top-level superproject,
> which definitely is _not_ want you want to happen.  And one way to
> avoid it is to tell the setup.c code that it should not do the
> discovery by being explicit.
>


Re: git submodules implementation question

2016-08-30 Thread Uma Srinivasan
On Tue, Aug 30, 2016 at 10:53 AM, Junio C Hamano <gits...@pobox.com> wrote:
> Uma Srinivasan <usriniva...@twitter.com> writes:
>
>> I think the following fix is still needed to is_submodule_modified():
>>
>> strbuf_addf(, "%s/.git", path);
>> git_dir = read_gitfile(buf.buf);
>> if (!git_dir) {
>> git_dir = buf.buf;
>>  ==>   if (!is_git_directory(git_dir)) {
>>  ==> die("Corrupted .git dir in submodule %s", path);
>>  ==>   }
>> }
>
> If it is so important that git_dir is a valid Git
> repository after
>
> git_dir = read_gitfile(buf.buf);
> if (!git_dir)
> git_dir = buf.buf;
>
> is done to determine what "git_dir" to use, it seems to me that it
> does not make much sense to check ONLY dir/.git that is a directory
> and leave .git/modules/$name that dir/.git file points at unchecked.
>

Okay.

> But there is much bigger problem with the above addition, I think.
> There also can be a case where dir/ does not even have ".git" in it.
> A submodule the user is not interested in will just have an empty
> directory there, and immediately after the above three lines I
> reproduced above, we have this
>
> if (!is_directory(git_dir)) {
> strbuf_release();
> return 0;
> }
>

Good to know about this use case.

> The added check will break the use case.  If anything, that check,
> if this code needs to verify that "git_dir" points at a valid Git
> repository, should come _after_ that.

Okay.

>
> Shouldn't "git-status --porcelain" run in the submodule notice that
> it is not a valid repository and quickly die anyway?  Why should we
> even check before spawning that process in the first place?

I don't see any reason to disagree with this.

>
> I might suggest to update prepare_submodule_repo_env() so that the
> spawned process will *NOT* have to guess where the working tree and
> repository by exporting GIT_DIR (set to "git_dir" we discover above)
> and GIT_WORK_TREE (set to "." as cp.dir is set to the path to the
> working tree of the submodule).  That would stop the "git status" to
> guess (and fooled by a corrupted dir/.git that is not a git
> repository).
>

Here's where I am struggling with my lack of knowledge of git
internals and the implementation particularly in the context of how
environment variables are passed from the parent to the child process.

Are you suggesting that we set up the child process environment array
(the "out" argument) in prepare_submodule_repo_env() to include
GIT_DIR_ENVIRONMENT and GIT_WORK_TREE_ENVIRONMENT in addition to
CONFIG_DATA_ENVIRONMENT that is there now?  Can I use the strcmp and
argv_array_push() calls to do this? There are several callers for this
routine prepare_submodule_repo_env(). Would the caller pass the git
dir and work tree as arguments to this routine or would the caller set
up the environment variables as needed? Is there any documentation or
other example code where similar passing of env. vars to a child
process is being done?

Sorry for all these questions but I'm still a novice as far as the
code is concerned.

Thanks for your help.

Uma


Re: git submodules implementation question

2016-08-29 Thread Uma Srinivasan
This is great! Thanks Jake. If you happen to have the patch ID it
would be helpful.

Uma

On Mon, Aug 29, 2016 at 5:02 PM, Jacob Keller <jacob.kel...@gmail.com> wrote:
> On Mon, Aug 29, 2016 at 4:15 PM, Junio C Hamano <gits...@pobox.com> wrote:
>> Uma Srinivasan <usriniva...@twitter.com> writes:
>>> This fixes my issue but what do you think? Is this the right way to
>>> fix it? Is there a better way?
>>
>> I think we already have a helper function that does a lot better
>> than "does it have a file called HEAD" to ask "is this a git
>> directory?" and its name I think is "is_git_directory" (I know, we
>> are not imaginative when naming our functions).
>>
>> As to the check makes sense in the context of this function, I am
>> not an expert to judge.  I'd expect Jens, Heiko and/or Stefan to
>> know better than I do.
>
> One of my patches adds a "is_git_directory()" call to this, and if we
> fail falls back to checking the .gitmodules and git-config for
> information regarding the submodule should it no longer be checked
> out. I suspect this patch will address your concern.
>
> Thanks,
> Jake


Re: git submodules implementation question

2016-08-29 Thread Uma Srinivasan
Yes, is_git_directory() is much better. Thanks for the pointer.

I will submit a patch unless I hear more suggestions from others.

Uma



On Mon, Aug 29, 2016 at 4:15 PM, Junio C Hamano <gits...@pobox.com> wrote:
> Uma Srinivasan <usriniva...@twitter.com> writes:
>
>> On Mon, Aug 29, 2016 at 2:13 PM, Uma Srinivasan <usriniva...@twitter.com> 
>> wrote:
>>> Ok that makes sense. Thanks much.
>>>
>>> Uma
>>>
>> With respect to my original problem with a corrupted .git directory
>> under the submodule directory, I am thinking of adding the following 4
>> lines marked with ### to is_submodule_modified() to detect the
>> corrupted dir and die quickly instead of forking several child
>> processes:
>>
>>strbuf_addf(, "%s/.git", path);
>>git_dir = read_gitfile(buf.buf);
>>if (!git_dir) {
>>  ### strbuf_addf(_ref, "%s/HEAD",buf.buf);
>>  ### if (strbuf_read_file(_ref, head_ref.buf,0) < 0) {
>>   ### die("Corrupted .git dir in submodule %s", 
>> path);
>>  ###}
>>  git_dir = buf.buf;
>>   }
>>
>> This fixes my issue but what do you think? Is this the right way to
>> fix it? Is there a better way?
>
> I think we already have a helper function that does a lot better
> than "does it have a file called HEAD" to ask "is this a git
> directory?" and its name I think is "is_git_directory" (I know, we
> are not imaginative when naming our functions).
>
> As to the check makes sense in the context of this function, I am
> not an expert to judge.  I'd expect Jens, Heiko and/or Stefan to
> know better than I do.


Re: git submodules implementation question

2016-08-29 Thread Uma Srinivasan
With respect to my original problem with a corrupted .git directory
under the submodule directory, I am thinking of adding the following 4
lines marked with ### to is_submodule_modified() to detect the
corrupted dir and die quickly instead of forking several child
processes:

   strbuf_addf(, "%s/.git", path);
   git_dir = read_gitfile(buf.buf);
   if (!git_dir) {
 ### strbuf_addf(_ref, "%s/HEAD",buf.buf);
 ### if (strbuf_read_file(_ref, head_ref.buf,0) < 0) {
  ### die("Corrupted .git dir in submodule %s", path);
 ###}
 git_dir = buf.buf;
  }

This fixes my issue but what do you think? Is this the right way to
fix it? Is there a better way?

Thanks,
Uma

On Mon, Aug 29, 2016 at 2:13 PM, Uma Srinivasan <usriniva...@twitter.com> wrote:
> Ok that makes sense. Thanks much.
>
> Uma
>
> On Mon, Aug 29, 2016 at 2:09 PM, Junio C Hamano <gits...@pobox.com> wrote:
>> On Mon, Aug 29, 2016 at 2:03 PM, Uma Srinivasan <usriniva...@twitter.com> 
>> wrote:
>>> On Mon, Aug 29, 2016 at 1:03 PM, Junio C Hamano <gits...@pobox.com> wrote:
>>>>
>>>> A top-level superproject can have a submodule bound at its "dir/"
>>>> directory, and "dir/.git" can either be a gitfile which you can read
>>>> with read_gitfile() and point into somewhere in ".git/modules/" of
>>>> the top-level superproject.  "dir/.git" can _ALSO_ be a fully valid
>>>> Git directory.  So at the top of a superproject, you could do
>>>>
>>>> git clone $URL ./dir2
>>>> git add dir2
>>>>
>>>> to clone an independent project into dir2 directory, and add it as a
>>>> new submodule.  The fallback is to support such a layout.
>>>>
>>> Thanks for the reply. However, in this case
>>>
>>>   git clone $URL ./dir2
>>>   git add dir2
>>>
>>> how will "dir2" get ever get registered as a submodule?
>>
>> With a separate invocation of "git config -f .gitmodules", of course.
>> The layout to use gitfile to point into .git/modules/ is a more recent
>> invention than the submodule support itself that "git add" knows about.
>> The code needs to support both layout as well as it can, and that
>> is what the "can we read it as gitfile?  If not, that directory itself may
>> be a git repository" codepath you asked about is doing.


Re: git submodules implementation question

2016-08-29 Thread Uma Srinivasan
Ok that makes sense. Thanks much.

Uma

On Mon, Aug 29, 2016 at 2:09 PM, Junio C Hamano <gits...@pobox.com> wrote:
> On Mon, Aug 29, 2016 at 2:03 PM, Uma Srinivasan <usriniva...@twitter.com> 
> wrote:
>> On Mon, Aug 29, 2016 at 1:03 PM, Junio C Hamano <gits...@pobox.com> wrote:
>>>
>>> A top-level superproject can have a submodule bound at its "dir/"
>>> directory, and "dir/.git" can either be a gitfile which you can read
>>> with read_gitfile() and point into somewhere in ".git/modules/" of
>>> the top-level superproject.  "dir/.git" can _ALSO_ be a fully valid
>>> Git directory.  So at the top of a superproject, you could do
>>>
>>> git clone $URL ./dir2
>>> git add dir2
>>>
>>> to clone an independent project into dir2 directory, and add it as a
>>> new submodule.  The fallback is to support such a layout.
>>>
>> Thanks for the reply. However, in this case
>>
>>   git clone $URL ./dir2
>>   git add dir2
>>
>> how will "dir2" get ever get registered as a submodule?
>
> With a separate invocation of "git config -f .gitmodules", of course.
> The layout to use gitfile to point into .git/modules/ is a more recent
> invention than the submodule support itself that "git add" knows about.
> The code needs to support both layout as well as it can, and that
> is what the "can we read it as gitfile?  If not, that directory itself may
> be a git repository" codepath you asked about is doing.


Re: git submodules implementation question

2016-08-29 Thread Uma Srinivasan
Thanks for the reply. However, in this case

  git clone $URL ./dir2
  git add dir2

how will "dir2" get ever get registered as a submodule? I don't see
how one can reach the "is_submodule_modified" routine for the scenario
above.

My understanding is that a sub-directory can be registered as a
submodule only with the "git submodule add " command. In this case
only a gitlink file is created within the sub-directory and not a .git
subdirectory. Please correct me if I am wrong.

Thanks again,
Uma

On Mon, Aug 29, 2016 at 1:03 PM, Junio C Hamano <gits...@pobox.com> wrote:
> Uma Srinivasan <usriniva...@twitter.com> writes:
>
>> git_dir = read_gitfile(buf.buf);
>> if (!git_dir)
>>
>> git_dir = buf.buf;
>>
>> Can anyone explain to me why we are replacing a failed reading of a
>> git file with the original sub directory name?
>
> A top-level superproject can have a submodule bound at its "dir/"
> directory, and "dir/.git" can either be a gitfile which you can read
> with read_gitfile() and point into somewhere in ".git/modules/" of
> the top-level superproject.  "dir/.git" can _ALSO_ be a fully valid
> Git directory.  So at the top of a superproject, you could do
>
> git clone $URL ./dir2
> git add dir2
>
> to clone an independent project into dir2 directory, and add it as a
> new submodule.  The fallback is to support such a layout.
>


git submodules implementation question

2016-08-28 Thread Uma Srinivasan
Hi,

I am new to git source and internal development. Recently I've been
looking at a problem where issuing "git status" in a corrupted
workspace handed over to me by a user, forks several thousand child
processes recursively.

The symptoms of the corrupted workspace to reproduce this problem are
as follows:

there needs to be an entry in the index file that looks something like this
0 groc 16 0 0.0

Also, there's a subdirectory under the source dir with the name in the
index file above (i.e 'groc') that has a partially populated .git sub
directory within it, not a .git file with the contents providing a
link to the module subdirectory.

The "git submodule status" command returns "No submodule mapping found
in .gitmodules for path 'groc'"

Doing a "git read-tree HEAD" will regenerate the index file and make
the problem go away. Basically the line entry above relating to 'groc'
goes away as it is not a submodule in the source repo.

This problem can be reproduced with all current versions of GIT
including 2.4, 2.8 and the latest version v2.10.0-rc1.

In looking into the git source code for is_submodule_modified() which
forks off the new child process, I see the following lines in
submodule.c:

git_dir = read_gitfile(buf.buf);
if (!git_dir)

git_dir = buf.buf;

Can anyone explain to me why we are replacing a failed reading of a
git file with the original sub directory name? When can we expect a
.git directory under a submodule?

Thanks,
Uma
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html