Re: [PATCH] receive-pack: plug minor memory leak in unpack()

2014-10-20 Thread Jeff King
On Sun, Oct 19, 2014 at 01:13:30PM +0200, René Scharfe wrote:

 We could flip it to give the managed version the short name (and calling
 the unmanaged version env_ptr or something). That would require
 munging the existing callers, but the tweak would be simple.
 
 Perhaps, but I'm a but skeptical of big renames.  Let's start small and add
 env_array, and see how far we get with that.

Yeah, having basically implemented patches similar to yours, I think
that is a good first step. Both of your patches looked good to me.

 Trickiness makes me nervous, especially in daemon.c.  And 5% CPU usage just
 for waiting sounds awful.  Using waitpid(0, ...) is not supported by the
 current implementation in compat/mingw.c, however.

I guess you could use wait() and a counter that you increment whenever
you get SIGCLD, but that feels a bit hacky. I wonder how bad a real
waitpid would be for mingw.

 By the way, does getaddrinfo(3) show up in your profiles much?  Recently I
 looked into calling it only on demand instead of for all incoming
 connections because doing that unconditional with a user-supplied
 (tainted) hostname just felt wrong.  The resulting patch series turned out
 to be not very pretty and I didn't see any performance improvements in my
 very limited tests, however; not sure if it's worth it.

It shows up in the child, not the parent, so it wasn't part of the
profiling I did recently. I did look at it just now, and it does
introduce some latency into each request (though not a lot of CPU;
mainly it's the DNS request). Like you, I'm nervous about convincing
git-daemon to do lookups on random hosts. By itself it's not horrible
(except for tying up git-daemon with absurdly long chains of glueless
references), but I worry that it could exacerbate other problems
(overflows or other bugs in DNS resolvers, as part of a cache-poisoning
scheme, orbeing used for DoS amplification).

I think doing it on demand would be a lot more sensible. We do not need
to do a lookup at all unless the %H, %CH, or %IP interpolated path
features are used. And we do not need to do hostname canonicalization
unless %CH is used.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] receive-pack: plug minor memory leak in unpack()

2014-10-19 Thread René Scharfe

Am 14.10.2014 um 11:16 schrieb Jeff King:

On Mon, Oct 13, 2014 at 12:08:09PM -0700, Junio C Hamano wrote:


I wonder if run-command should provide a managed env array similar
to the args array.


That's a good idea.



I took a look at a few of them:


I took a brief look, too.

I had hoped we could just all it env and everybody would be happy
using it instead of bare pointers. But quite a few callers assign
local_repo_env to to child_process.env. We could copy it into an
argv_array, of course, but that really feels like working around the
interface. So I think we would prefer to keep both forms available.


We could add a flag (clear_local_repo_env?) and reference local_repo_env 
only in run-command.c for these cases.  But some other cases remain that 
are better off providing their own array, like in daemon.c.



That raises the question: what should it be called? The argv_array
form of argv is called args. The more I see it, the more I hate that
name, as the two are easily confused. We could have:

   const char **argv;
   struct argv_array argv_array;
   const char **env;
   struct argv_array env_array;

though argv_array is a lot to type when you have a string of
argv_array_pushf() calls (which are already by themselves kind of
verbose). Maybe that's not too big a deal, though.


I actually like args and argv. :)  Mixing them up is noticed by the 
compiler, so any confusion is cleared up rather quickly.



We could flip it to give the managed version the short name (and calling
the unmanaged version env_ptr or something). That would require
munging the existing callers, but the tweak would be simple.


Perhaps, but I'm a but skeptical of big renames.  Let's start small and 
add env_array, and see how far we get with that.



  - daemon.c::handle() uses a static set of environment variables
that are not built with argv_array().  Same for argv.


Most of the callers you mentioned are good candidates. This one is
tricky.

The argv array gets malloc'd and set up by the parent git-daemon
process. Then each time we get a client, we create a new struct
child_process that references it. So using the managed argv-array would
actually be a bit more complicated (and not save any memory; we just
always point to the single copy for each child).

For the environment, we build it in a function-local buffer, point the
child_process's env field at it, start the child, and then copy the
child_process into our global list of children. When we notice a child
is dead (by linearly going through the list with waitpid), we free the
list entry. So there are a few potentially bad things here:

   1. We memcpy the child_process to put it on the list. Which does work,
  though it feels a little like we are violating the abstraction
  barrier.

   2. The child_process in the list points to the local env buffer that
  is no longer valid. There's no bug because we don't ever look at
  it. Moving to a managed env would fix that. But I have to wonder if
  we even want to be keeping the struct child_process around in the
  first place (all we really care about is the pid).

   3. If we do move to a managed env, then we expect it to get cleaned up
  in finish_command. But we never call that; we just free the list
  memory containing the child_process. We would want to call
  finish_command. Except that we will have reaped the process already
  with our call to waitpid() from check_dead_children. So we'd need a
  new call to do just the cleanup without the wait, I guess.

   4. For every loop on the listen socket, we call waitpid() for each
  living child, which is a bit wasteful. We'd probably do better to
  call waitpid(0, status, WNOHANG), and then look up the resulting
  pid in a hashmap or sorted list when we actually see something that
  died. I don't know that this is a huge problem in practice. We use
  git-daemon pretty heavily on our backend servers at GitHub, and it
  seems to use about 5% of a CPU constantly on each machine. Which is
  kind of lame, given that it isn't doing anything at all, but is
  hardly earth-shattering.

So I'm not sure if it is worth converting to a managed env. There's a
bit of ugliness, but none of it is causing any problems, and doing so
opens a can of worms. The most interesting thing to fix (to me, anyway)
is number 4, but that has nothing to do with the env in the first place.
:)


Trickiness makes me nervous, especially in daemon.c.  And 5% CPU usage 
just for waiting sounds awful.  Using waitpid(0, ...) is not supported 
by the current implementation in compat/mingw.c, however.


I agree that env handling should only be changed after the wait loop has 
been improved.


By the way, does getaddrinfo(3) show up in your profiles much?  Recently 
I looked into calling it only on demand instead of for all incoming 
connections because doing that unconditional with a user-supplied 
(tainted) hostname just felt wrong.  

Re: [PATCH] receive-pack: plug minor memory leak in unpack()

2014-10-14 Thread Jeff King
On Mon, Oct 13, 2014 at 12:08:09PM -0700, Junio C Hamano wrote:

  I wonder if run-command should provide a managed env array similar
  to the args array.
 
 I took a look at a few of them:

I took a brief look, too.

I had hoped we could just all it env and everybody would be happy
using it instead of bare pointers. But quite a few callers assign
local_repo_env to to child_process.env. We could copy it into an
argv_array, of course, but that really feels like working around the
interface. So I think we would prefer to keep both forms available.

That raises the question: what should it be called? The argv_array
form of argv is called args. The more I see it, the more I hate that
name, as the two are easily confused. We could have:

  const char **argv;
  struct argv_array argv_array;
  const char **env;
  struct argv_array env_array;

though argv_array is a lot to type when you have a string of
argv_array_pushf() calls (which are already by themselves kind of
verbose). Maybe that's not too big a deal, though.

We could flip it to give the managed version the short name (and calling
the unmanaged version env_ptr or something). That would require
munging the existing callers, but the tweak would be simple.

  - daemon.c::handle() uses a static set of environment variables
that are not built with argv_array().  Same for argv.

Most of the callers you mentioned are good candidates. This one is
tricky.

The argv array gets malloc'd and set up by the parent git-daemon
process. Then each time we get a client, we create a new struct
child_process that references it. So using the managed argv-array would
actually be a bit more complicated (and not save any memory; we just
always point to the single copy for each child).

For the environment, we build it in a function-local buffer, point the
child_process's env field at it, start the child, and then copy the
child_process into our global list of children. When we notice a child
is dead (by linearly going through the list with waitpid), we free the
list entry. So there are a few potentially bad things here:

  1. We memcpy the child_process to put it on the list. Which does work,
 though it feels a little like we are violating the abstraction
 barrier.

  2. The child_process in the list points to the local env buffer that
 is no longer valid. There's no bug because we don't ever look at
 it. Moving to a managed env would fix that. But I have to wonder if
 we even want to be keeping the struct child_process around in the
 first place (all we really care about is the pid).

  3. If we do move to a managed env, then we expect it to get cleaned up
 in finish_command. But we never call that; we just free the list
 memory containing the child_process. We would want to call
 finish_command. Except that we will have reaped the process already
 with our call to waitpid() from check_dead_children. So we'd need a
 new call to do just the cleanup without the wait, I guess.

  4. For every loop on the listen socket, we call waitpid() for each
 living child, which is a bit wasteful. We'd probably do better to
 call waitpid(0, status, WNOHANG), and then look up the resulting
 pid in a hashmap or sorted list when we actually see something that
 died. I don't know that this is a huge problem in practice. We use
 git-daemon pretty heavily on our backend servers at GitHub, and it
 seems to use about 5% of a CPU constantly on each machine. Which is
 kind of lame, given that it isn't doing anything at all, but is
 hardly earth-shattering.

So I'm not sure if it is worth converting to a managed env. There's a
bit of ugliness, but none of it is causing any problems, and doing so
opens a can of worms. The most interesting thing to fix (to me, anyway)
is number 4, but that has nothing to do with the env in the first place.
:)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] receive-pack: plug minor memory leak in unpack()

2014-10-13 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Sat, Oct 11, 2014 at 01:00:16PM +0200, René Scharfe wrote:

 The argv_array used in unpack() is never freed.  Instead of adding
 explicit calls to argv_array_clear() use the args member of struct
 child_process and let run_command() and friends clean up for us.

 Looks good. I notice that the recently added prepare_push_cert_sha1 uses
 an argv_array to create the child_process.env, and we leak the result.

You're right.  finish_command() runs argv_array_clear() on
cmd-args, but does not care about cmd-env so anybody that use the
(optional) use these environment variable settings while running
the command would leak unless the caller takes care of it.

 I wonder if run-command should provide a managed env array similar
 to the args array.

I took a look at a few of them:

 - setup_pager() uses a static set of environment variables that are
   not built with argv_array(); should be easy to convert, though.

 - wt_status_print_submodule_summary() does use two argv_arrays, for
   env and argv.  It can use the managed args today, and with such
   a change it can also use the managed envs.

 - receive-pack.c::prepare_push_cert_sha1() would benefit from
   managed envs.

 - http-backend.c::run_service() would benefit from managed envs.

 - daemon.c::handle() uses a static set of environment variables
   that are not built with argv_array().  Same for argv.

It shouldn't be too hard to catch and convert all of them.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] receive-pack: plug minor memory leak in unpack()

2014-10-11 Thread René Scharfe
The argv_array used in unpack() is never freed.  Instead of adding
explicit calls to argv_array_clear() use the args member of struct
child_process and let run_command() and friends clean up for us.

Signed-off-by: Rene Scharfe l@web.de
---
 builtin/receive-pack.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index a51846c..443dd37 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1220,7 +1220,6 @@ static const char *pack_lockfile;
 static const char *unpack(int err_fd, struct shallow_info *si)
 {
struct pack_header hdr;
-   struct argv_array av = ARGV_ARRAY_INIT;
const char *hdr_err;
int status;
char hdr_arg[38];
@@ -1243,16 +1242,16 @@ static const char *unpack(int err_fd, struct 
shallow_info *si)
 
if (si-nr_ours || si-nr_theirs) {
alt_shallow_file = setup_temporary_shallow(si-shallow);
-   argv_array_pushl(av, --shallow-file, alt_shallow_file, NULL);
+   argv_array_push(child.args, --shallow-file);
+   argv_array_push(child.args, alt_shallow_file);
}
 
if (ntohl(hdr.hdr_entries)  unpack_limit) {
-   argv_array_pushl(av, unpack-objects, hdr_arg, NULL);
+   argv_array_pushl(child.args, unpack-objects, hdr_arg, NULL);
if (quiet)
-   argv_array_push(av, -q);
+   argv_array_push(child.args, -q);
if (fsck_objects)
-   argv_array_push(av, --strict);
-   child.argv = av.argv;
+   argv_array_push(child.args, --strict);
child.no_stdout = 1;
child.err = err_fd;
child.git_cmd = 1;
@@ -1267,13 +1266,12 @@ static const char *unpack(int err_fd, struct 
shallow_info *si)
if (gethostname(keep_arg + s, sizeof(keep_arg) - s))
strcpy(keep_arg + s, localhost);
 
-   argv_array_pushl(av, index-pack,
+   argv_array_pushl(child.args, index-pack,
 --stdin, hdr_arg, keep_arg, NULL);
if (fsck_objects)
-   argv_array_push(av, --strict);
+   argv_array_push(child.args, --strict);
if (fix_thin)
-   argv_array_push(av, --fix-thin);
-   child.argv = av.argv;
+   argv_array_push(child.args, --fix-thin);
child.out = -1;
child.err = err_fd;
child.git_cmd = 1;
-- 
2.1.2

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] receive-pack: plug minor memory leak in unpack()

2014-10-11 Thread Jeff King
On Sat, Oct 11, 2014 at 01:00:16PM +0200, René Scharfe wrote:

 The argv_array used in unpack() is never freed.  Instead of adding
 explicit calls to argv_array_clear() use the args member of struct
 child_process and let run_command() and friends clean up for us.

Looks good. I notice that the recently added prepare_push_cert_sha1 uses
an argv_array to create the child_process.env, and we leak the result. I
wonder if run-command should provide a managed env array similar to the
args array.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html