Re: [PATCH] receive-pack: plug minor memory leak in unpack()
On Sun, Oct 19, 2014 at 01:13:30PM +0200, René Scharfe wrote: We could flip it to give the managed version the short name (and calling the unmanaged version env_ptr or something). That would require munging the existing callers, but the tweak would be simple. Perhaps, but I'm a but skeptical of big renames. Let's start small and add env_array, and see how far we get with that. Yeah, having basically implemented patches similar to yours, I think that is a good first step. Both of your patches looked good to me. Trickiness makes me nervous, especially in daemon.c. And 5% CPU usage just for waiting sounds awful. Using waitpid(0, ...) is not supported by the current implementation in compat/mingw.c, however. I guess you could use wait() and a counter that you increment whenever you get SIGCLD, but that feels a bit hacky. I wonder how bad a real waitpid would be for mingw. By the way, does getaddrinfo(3) show up in your profiles much? Recently I looked into calling it only on demand instead of for all incoming connections because doing that unconditional with a user-supplied (tainted) hostname just felt wrong. The resulting patch series turned out to be not very pretty and I didn't see any performance improvements in my very limited tests, however; not sure if it's worth it. It shows up in the child, not the parent, so it wasn't part of the profiling I did recently. I did look at it just now, and it does introduce some latency into each request (though not a lot of CPU; mainly it's the DNS request). Like you, I'm nervous about convincing git-daemon to do lookups on random hosts. By itself it's not horrible (except for tying up git-daemon with absurdly long chains of glueless references), but I worry that it could exacerbate other problems (overflows or other bugs in DNS resolvers, as part of a cache-poisoning scheme, orbeing used for DoS amplification). I think doing it on demand would be a lot more sensible. We do not need to do a lookup at all unless the %H, %CH, or %IP interpolated path features are used. And we do not need to do hostname canonicalization unless %CH is used. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] receive-pack: plug minor memory leak in unpack()
Am 14.10.2014 um 11:16 schrieb Jeff King: On Mon, Oct 13, 2014 at 12:08:09PM -0700, Junio C Hamano wrote: I wonder if run-command should provide a managed env array similar to the args array. That's a good idea. I took a look at a few of them: I took a brief look, too. I had hoped we could just all it env and everybody would be happy using it instead of bare pointers. But quite a few callers assign local_repo_env to to child_process.env. We could copy it into an argv_array, of course, but that really feels like working around the interface. So I think we would prefer to keep both forms available. We could add a flag (clear_local_repo_env?) and reference local_repo_env only in run-command.c for these cases. But some other cases remain that are better off providing their own array, like in daemon.c. That raises the question: what should it be called? The argv_array form of argv is called args. The more I see it, the more I hate that name, as the two are easily confused. We could have: const char **argv; struct argv_array argv_array; const char **env; struct argv_array env_array; though argv_array is a lot to type when you have a string of argv_array_pushf() calls (which are already by themselves kind of verbose). Maybe that's not too big a deal, though. I actually like args and argv. :) Mixing them up is noticed by the compiler, so any confusion is cleared up rather quickly. We could flip it to give the managed version the short name (and calling the unmanaged version env_ptr or something). That would require munging the existing callers, but the tweak would be simple. Perhaps, but I'm a but skeptical of big renames. Let's start small and add env_array, and see how far we get with that. - daemon.c::handle() uses a static set of environment variables that are not built with argv_array(). Same for argv. Most of the callers you mentioned are good candidates. This one is tricky. The argv array gets malloc'd and set up by the parent git-daemon process. Then each time we get a client, we create a new struct child_process that references it. So using the managed argv-array would actually be a bit more complicated (and not save any memory; we just always point to the single copy for each child). For the environment, we build it in a function-local buffer, point the child_process's env field at it, start the child, and then copy the child_process into our global list of children. When we notice a child is dead (by linearly going through the list with waitpid), we free the list entry. So there are a few potentially bad things here: 1. We memcpy the child_process to put it on the list. Which does work, though it feels a little like we are violating the abstraction barrier. 2. The child_process in the list points to the local env buffer that is no longer valid. There's no bug because we don't ever look at it. Moving to a managed env would fix that. But I have to wonder if we even want to be keeping the struct child_process around in the first place (all we really care about is the pid). 3. If we do move to a managed env, then we expect it to get cleaned up in finish_command. But we never call that; we just free the list memory containing the child_process. We would want to call finish_command. Except that we will have reaped the process already with our call to waitpid() from check_dead_children. So we'd need a new call to do just the cleanup without the wait, I guess. 4. For every loop on the listen socket, we call waitpid() for each living child, which is a bit wasteful. We'd probably do better to call waitpid(0, status, WNOHANG), and then look up the resulting pid in a hashmap or sorted list when we actually see something that died. I don't know that this is a huge problem in practice. We use git-daemon pretty heavily on our backend servers at GitHub, and it seems to use about 5% of a CPU constantly on each machine. Which is kind of lame, given that it isn't doing anything at all, but is hardly earth-shattering. So I'm not sure if it is worth converting to a managed env. There's a bit of ugliness, but none of it is causing any problems, and doing so opens a can of worms. The most interesting thing to fix (to me, anyway) is number 4, but that has nothing to do with the env in the first place. :) Trickiness makes me nervous, especially in daemon.c. And 5% CPU usage just for waiting sounds awful. Using waitpid(0, ...) is not supported by the current implementation in compat/mingw.c, however. I agree that env handling should only be changed after the wait loop has been improved. By the way, does getaddrinfo(3) show up in your profiles much? Recently I looked into calling it only on demand instead of for all incoming connections because doing that unconditional with a user-supplied (tainted) hostname just felt wrong.
Re: [PATCH] receive-pack: plug minor memory leak in unpack()
On Mon, Oct 13, 2014 at 12:08:09PM -0700, Junio C Hamano wrote: I wonder if run-command should provide a managed env array similar to the args array. I took a look at a few of them: I took a brief look, too. I had hoped we could just all it env and everybody would be happy using it instead of bare pointers. But quite a few callers assign local_repo_env to to child_process.env. We could copy it into an argv_array, of course, but that really feels like working around the interface. So I think we would prefer to keep both forms available. That raises the question: what should it be called? The argv_array form of argv is called args. The more I see it, the more I hate that name, as the two are easily confused. We could have: const char **argv; struct argv_array argv_array; const char **env; struct argv_array env_array; though argv_array is a lot to type when you have a string of argv_array_pushf() calls (which are already by themselves kind of verbose). Maybe that's not too big a deal, though. We could flip it to give the managed version the short name (and calling the unmanaged version env_ptr or something). That would require munging the existing callers, but the tweak would be simple. - daemon.c::handle() uses a static set of environment variables that are not built with argv_array(). Same for argv. Most of the callers you mentioned are good candidates. This one is tricky. The argv array gets malloc'd and set up by the parent git-daemon process. Then each time we get a client, we create a new struct child_process that references it. So using the managed argv-array would actually be a bit more complicated (and not save any memory; we just always point to the single copy for each child). For the environment, we build it in a function-local buffer, point the child_process's env field at it, start the child, and then copy the child_process into our global list of children. When we notice a child is dead (by linearly going through the list with waitpid), we free the list entry. So there are a few potentially bad things here: 1. We memcpy the child_process to put it on the list. Which does work, though it feels a little like we are violating the abstraction barrier. 2. The child_process in the list points to the local env buffer that is no longer valid. There's no bug because we don't ever look at it. Moving to a managed env would fix that. But I have to wonder if we even want to be keeping the struct child_process around in the first place (all we really care about is the pid). 3. If we do move to a managed env, then we expect it to get cleaned up in finish_command. But we never call that; we just free the list memory containing the child_process. We would want to call finish_command. Except that we will have reaped the process already with our call to waitpid() from check_dead_children. So we'd need a new call to do just the cleanup without the wait, I guess. 4. For every loop on the listen socket, we call waitpid() for each living child, which is a bit wasteful. We'd probably do better to call waitpid(0, status, WNOHANG), and then look up the resulting pid in a hashmap or sorted list when we actually see something that died. I don't know that this is a huge problem in practice. We use git-daemon pretty heavily on our backend servers at GitHub, and it seems to use about 5% of a CPU constantly on each machine. Which is kind of lame, given that it isn't doing anything at all, but is hardly earth-shattering. So I'm not sure if it is worth converting to a managed env. There's a bit of ugliness, but none of it is causing any problems, and doing so opens a can of worms. The most interesting thing to fix (to me, anyway) is number 4, but that has nothing to do with the env in the first place. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] receive-pack: plug minor memory leak in unpack()
Jeff King p...@peff.net writes: On Sat, Oct 11, 2014 at 01:00:16PM +0200, René Scharfe wrote: The argv_array used in unpack() is never freed. Instead of adding explicit calls to argv_array_clear() use the args member of struct child_process and let run_command() and friends clean up for us. Looks good. I notice that the recently added prepare_push_cert_sha1 uses an argv_array to create the child_process.env, and we leak the result. You're right. finish_command() runs argv_array_clear() on cmd-args, but does not care about cmd-env so anybody that use the (optional) use these environment variable settings while running the command would leak unless the caller takes care of it. I wonder if run-command should provide a managed env array similar to the args array. I took a look at a few of them: - setup_pager() uses a static set of environment variables that are not built with argv_array(); should be easy to convert, though. - wt_status_print_submodule_summary() does use two argv_arrays, for env and argv. It can use the managed args today, and with such a change it can also use the managed envs. - receive-pack.c::prepare_push_cert_sha1() would benefit from managed envs. - http-backend.c::run_service() would benefit from managed envs. - daemon.c::handle() uses a static set of environment variables that are not built with argv_array(). Same for argv. It shouldn't be too hard to catch and convert all of them. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] receive-pack: plug minor memory leak in unpack()
The argv_array used in unpack() is never freed. Instead of adding explicit calls to argv_array_clear() use the args member of struct child_process and let run_command() and friends clean up for us. Signed-off-by: Rene Scharfe l@web.de --- builtin/receive-pack.c | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c index a51846c..443dd37 100644 --- a/builtin/receive-pack.c +++ b/builtin/receive-pack.c @@ -1220,7 +1220,6 @@ static const char *pack_lockfile; static const char *unpack(int err_fd, struct shallow_info *si) { struct pack_header hdr; - struct argv_array av = ARGV_ARRAY_INIT; const char *hdr_err; int status; char hdr_arg[38]; @@ -1243,16 +1242,16 @@ static const char *unpack(int err_fd, struct shallow_info *si) if (si-nr_ours || si-nr_theirs) { alt_shallow_file = setup_temporary_shallow(si-shallow); - argv_array_pushl(av, --shallow-file, alt_shallow_file, NULL); + argv_array_push(child.args, --shallow-file); + argv_array_push(child.args, alt_shallow_file); } if (ntohl(hdr.hdr_entries) unpack_limit) { - argv_array_pushl(av, unpack-objects, hdr_arg, NULL); + argv_array_pushl(child.args, unpack-objects, hdr_arg, NULL); if (quiet) - argv_array_push(av, -q); + argv_array_push(child.args, -q); if (fsck_objects) - argv_array_push(av, --strict); - child.argv = av.argv; + argv_array_push(child.args, --strict); child.no_stdout = 1; child.err = err_fd; child.git_cmd = 1; @@ -1267,13 +1266,12 @@ static const char *unpack(int err_fd, struct shallow_info *si) if (gethostname(keep_arg + s, sizeof(keep_arg) - s)) strcpy(keep_arg + s, localhost); - argv_array_pushl(av, index-pack, + argv_array_pushl(child.args, index-pack, --stdin, hdr_arg, keep_arg, NULL); if (fsck_objects) - argv_array_push(av, --strict); + argv_array_push(child.args, --strict); if (fix_thin) - argv_array_push(av, --fix-thin); - child.argv = av.argv; + argv_array_push(child.args, --fix-thin); child.out = -1; child.err = err_fd; child.git_cmd = 1; -- 2.1.2 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] receive-pack: plug minor memory leak in unpack()
On Sat, Oct 11, 2014 at 01:00:16PM +0200, René Scharfe wrote: The argv_array used in unpack() is never freed. Instead of adding explicit calls to argv_array_clear() use the args member of struct child_process and let run_command() and friends clean up for us. Looks good. I notice that the recently added prepare_push_cert_sha1 uses an argv_array to create the child_process.env, and we leak the result. I wonder if run-command should provide a managed env array similar to the args array. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html