Re: [PATCH] Enable parallelism in git submodule update.

2012-11-03 Thread Jens Lehmann
Am 30.10.2012 19:11, schrieb Stefan Zager:
 This is a refresh of a conversation from a couple of months ago.
 
 I didn't try to implement all the desired features (e.g., smart logic
 for passing a -j parameter to recursive submodule invocations), but I
 did address the one issue that Junio insisted on: the code makes a
 best effort to detect whether xargs supports parallel execution on the
 host platform, and if it doesn't, then it prints a warning and falls
 back to serial execution.

I suspect not passing on --jobs recursively like you do here is the
right thing to do, as that would give exponential growth of jobs with
recursion depth, which makes no sense to me.

A still unsolved issue is the unstructured output from the different
update jobs. It'll be hard (if not impossible) to see in what submodule
which update took place (or failed). I think we should have a solution
for that too (maybe one of those Heiko mentioned or something as simple
as implying -q?).

 Stefan
 
 On Tue, Oct 30, 2012 at 11:03 AM,  sza...@google.com wrote:
 The --jobs parameter may be used to set the degree of per-submodule
 parallel execution.

 Signed-off-by: Stefan Zager sza...@google.com
 ---
  Documentation/git-submodule.txt |8 ++-
  git-submodule.sh|   40 
 ++-
  2 files changed, 46 insertions(+), 2 deletions(-)

 diff --git a/Documentation/git-submodule.txt 
 b/Documentation/git-submodule.txt
 index b4683bb..cb23ba7 100644
 --- a/Documentation/git-submodule.txt
 +++ b/Documentation/git-submodule.txt
 @@ -14,7 +14,8 @@ SYNOPSIS
  'git submodule' [--quiet] status [--cached] [--recursive] [--] [path...]
  'git submodule' [--quiet] init [--] [path...]
  'git submodule' [--quiet] update [--init] [-N|--no-fetch] [--rebase]
 - [--reference repository] [--merge] [--recursive] [--] 
 [path...]
 + [--reference repository] [--merge] [--recursive]
 + [-j|--jobs [jobs]] [--] [path...]
  'git submodule' [--quiet] summary [--cached|--files] [(-n|--summary-limit) 
 n]
   [commit] [--] [path...]
  'git submodule' [--quiet] foreach [--recursive] command
 @@ -146,6 +147,11 @@ If the submodule is not yet initialized, and you just 
 want to use the
  setting as stored in .gitmodules, you can automatically initialize the
  submodule with the `--init` option.
  +
 +By default, each submodule is treated serially.  You may specify a degree of
 +parallel execution with the --jobs flag.  If a parameter is provided, it is
 +the maximum number of jobs to run in parallel; without a parameter, all 
 jobs are
 +run in parallel.
 ++

The new --jobs option should be documented under OPTIONS, (and maybe
include that --jobs 0 does the same as --jobs alone and that this is
not supported on all platforms).

  If `--recursive` is specified, this command will recurse into the
  registered submodules, and update any nested submodules within.
  +
 diff --git a/git-submodule.sh b/git-submodule.sh
 index ab6b110..60a5f96 100755
 --- a/git-submodule.sh
 +++ b/git-submodule.sh
 @@ -8,7 +8,7 @@ dashless=$(basename $0 | sed -e 's/-/ /')
  USAGE=[--quiet] add [-b branch] [-f|--force] [--reference repository] 
 [--] repository [path]
 or: $dashless [--quiet] status [--cached] [--recursive] [--] [path...]
 or: $dashless [--quiet] init [--] [path...]
 -   or: $dashless [--quiet] update [--init] [-N|--no-fetch] [-f|--force] 
 [--rebase] [--reference repository] [--merge] [--recursive] [--] 
 [path...]
 +   or: $dashless [--quiet] update [--init] [-N|--no-fetch] [-f|--force] 
 [--rebase] [--reference repository] [--merge] [--recursive] [-j|--jobs 
 [jobs]] [--] [path...]
 or: $dashless [--quiet] summary [--cached|--files] [--summary-limit n] 
 [commit] [--] [path...]
 or: $dashless [--quiet] foreach [--recursive] command
 or: $dashless [--quiet] sync [--] [path...]
 @@ -500,6 +500,7 @@ cmd_update()
  {
 # parse $args after submodule ... update.
 orig_flags=
 +   jobs=1
 while test $# -ne 0
 do
 case $1 in
 @@ -518,6 +519,20 @@ cmd_update()
 -r|--rebase)
 update=rebase
 ;;
 +   -j|--jobs)
 +   case $2 in
 +   ''|-*)
 +   jobs=0
 +   ;;
 +   *)
 +   jobs=$2
 +   shift
 +   ;;
 +   esac
 +   # Don't preserve this arg.
 +   shift
 +   continue
 +   ;;
 --reference)
 case $2 in '') usage ;; esac
 reference=--reference=$2
 @@ -551,11 +566,34 @@ cmd_update()
 shift
 done

 +   # Correctly handle the case where '-q' came before 'update' on the 
 

Re: [PATCH] Enable parallelism in git submodule update.

2012-11-03 Thread Phil Hord
On Sat, Nov 3, 2012 at 11:42 AM, Jens Lehmann jens.lehm...@web.de wrote:
 Am 30.10.2012 19:11, schrieb Stefan Zager:
 This is a refresh of a conversation from a couple of months ago.

 I didn't try to implement all the desired features (e.g., smart logic
 for passing a -j parameter to recursive submodule invocations), but I
 did address the one issue that Junio insisted on: the code makes a
 best effort to detect whether xargs supports parallel execution on the
 host platform, and if it doesn't, then it prints a warning and falls
 back to serial execution.

 I suspect not passing on --jobs recursively like you do here is the
 right thing to do, as that would give exponential growth of jobs with
 recursion depth, which makes no sense to me.

On the other hand, since $jobs is still defined when the recursive
call to is made to 'eval cmd_update $orig_flags', I suspect the
value *is* passed down recursively.  Maybe $jobs should be manually
reset before recursing -- unless it is 0 -- though I expect someone
would feel differently if she had one submodule on level 1 and 10
submodules on level 2.  She would be surprised, then, when  --jobs=10
seemed to have little affect on performance.  So maybe it is best to
leave it as it is, excepting that the apparent attempt not to pass the
switch down is probably misleading.

Phil
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-11-03 Thread Jens Lehmann
Am 29.07.2012 17:37, schrieb Jens Lehmann:
 Am 27.07.2012 20:37, schrieb Stefan Zager:
 The --jobs parameter may be used to set the degree of per-submodule
 parallel execution.
 
 I think this is a sound idea, but it would be good to see some
 actual measurements. What are the performance numbers with and
 without this change? Which cases do benefit and are there some
 which run slower when run in parallel?

ping?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-11-03 Thread Jens Lehmann
Am 03.11.2012 19:44, schrieb Phil Hord:
 On Sat, Nov 3, 2012 at 11:42 AM, Jens Lehmann jens.lehm...@web.de wrote:
 Am 30.10.2012 19:11, schrieb Stefan Zager:
 This is a refresh of a conversation from a couple of months ago.

 I didn't try to implement all the desired features (e.g., smart logic
 for passing a -j parameter to recursive submodule invocations), but I
 did address the one issue that Junio insisted on: the code makes a
 best effort to detect whether xargs supports parallel execution on the
 host platform, and if it doesn't, then it prints a warning and falls
 back to serial execution.

 I suspect not passing on --jobs recursively like you do here is the
 right thing to do, as that would give exponential growth of jobs with
 recursion depth, which makes no sense to me.
 
 On the other hand, since $jobs is still defined when the recursive
 call to is made to 'eval cmd_update $orig_flags', I suspect the
 value *is* passed down recursively.

But for $jobs != 1 Stefan's code doesn't use eval cmd_update but
starts the submodule script again:

+   xargs $max_lines -P $jobs git submodule update 
$orig_flags

That should get rid of the $jobs setting, or am I missing something?

  Maybe $jobs should be manually
 reset before recursing -- unless it is 0 -- though I expect someone
 would feel differently if she had one submodule on level 1 and 10
 submodules on level 2.  She would be surprised, then, when  --jobs=10
 seemed to have little affect on performance.

Hmm, good point. However we implement that, it should at least be
properly documented in the man page (and in the use case you describe
a git submodule foreach 'git submodule update -j 10' could be the
solution if we choose to not propagate the jobs option).

  So maybe it is best to
 leave it as it is, excepting that the apparent attempt not to pass the
 switch down is probably misleading.

I didn't test it, but I think it should work (famous last words ;-).
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-11-02 Thread Stefan Zager
ping?

On Tue, Oct 30, 2012 at 11:11 AM, Stefan Zager sza...@google.com wrote:
 This is a refresh of a conversation from a couple of months ago.

 I didn't try to implement all the desired features (e.g., smart logic
 for passing a -j parameter to recursive submodule invocations), but I
 did address the one issue that Junio insisted on: the code makes a
 best effort to detect whether xargs supports parallel execution on the
 host platform, and if it doesn't, then it prints a warning and falls
 back to serial execution.

 Stefan

 On Tue, Oct 30, 2012 at 11:03 AM,  sza...@google.com wrote:
 The --jobs parameter may be used to set the degree of per-submodule
 parallel execution.

 Signed-off-by: Stefan Zager sza...@google.com
 ---
  Documentation/git-submodule.txt |8 ++-
  git-submodule.sh|   40 
 ++-
  2 files changed, 46 insertions(+), 2 deletions(-)

 diff --git a/Documentation/git-submodule.txt 
 b/Documentation/git-submodule.txt
 index b4683bb..cb23ba7 100644
 --- a/Documentation/git-submodule.txt
 +++ b/Documentation/git-submodule.txt
 @@ -14,7 +14,8 @@ SYNOPSIS
  'git submodule' [--quiet] status [--cached] [--recursive] [--] [path...]
  'git submodule' [--quiet] init [--] [path...]
  'git submodule' [--quiet] update [--init] [-N|--no-fetch] [--rebase]
 - [--reference repository] [--merge] [--recursive] [--] 
 [path...]
 + [--reference repository] [--merge] [--recursive]
 + [-j|--jobs [jobs]] [--] [path...]
  'git submodule' [--quiet] summary [--cached|--files] [(-n|--summary-limit) 
 n]
   [commit] [--] [path...]
  'git submodule' [--quiet] foreach [--recursive] command
 @@ -146,6 +147,11 @@ If the submodule is not yet initialized, and you just 
 want to use the
  setting as stored in .gitmodules, you can automatically initialize the
  submodule with the `--init` option.
  +
 +By default, each submodule is treated serially.  You may specify a degree of
 +parallel execution with the --jobs flag.  If a parameter is provided, it is
 +the maximum number of jobs to run in parallel; without a parameter, all 
 jobs are
 +run in parallel.
 ++
  If `--recursive` is specified, this command will recurse into the
  registered submodules, and update any nested submodules within.
  +
 diff --git a/git-submodule.sh b/git-submodule.sh
 index ab6b110..60a5f96 100755
 --- a/git-submodule.sh
 +++ b/git-submodule.sh
 @@ -8,7 +8,7 @@ dashless=$(basename $0 | sed -e 's/-/ /')
  USAGE=[--quiet] add [-b branch] [-f|--force] [--reference repository] 
 [--] repository [path]
 or: $dashless [--quiet] status [--cached] [--recursive] [--] [path...]
 or: $dashless [--quiet] init [--] [path...]
 -   or: $dashless [--quiet] update [--init] [-N|--no-fetch] [-f|--force] 
 [--rebase] [--reference repository] [--merge] [--recursive] [--] 
 [path...]
 +   or: $dashless [--quiet] update [--init] [-N|--no-fetch] [-f|--force] 
 [--rebase] [--reference repository] [--merge] [--recursive] [-j|--jobs 
 [jobs]] [--] [path...]
 or: $dashless [--quiet] summary [--cached|--files] [--summary-limit n] 
 [commit] [--] [path...]
 or: $dashless [--quiet] foreach [--recursive] command
 or: $dashless [--quiet] sync [--] [path...]
 @@ -500,6 +500,7 @@ cmd_update()
  {
 # parse $args after submodule ... update.
 orig_flags=
 +   jobs=1
 while test $# -ne 0
 do
 case $1 in
 @@ -518,6 +519,20 @@ cmd_update()
 -r|--rebase)
 update=rebase
 ;;
 +   -j|--jobs)
 +   case $2 in
 +   ''|-*)
 +   jobs=0
 +   ;;
 +   *)
 +   jobs=$2
 +   shift
 +   ;;
 +   esac
 +   # Don't preserve this arg.
 +   shift
 +   continue
 +   ;;
 --reference)
 case $2 in '') usage ;; esac
 reference=--reference=$2
 @@ -551,11 +566,34 @@ cmd_update()
 shift
 done

 +   # Correctly handle the case where '-q' came before 'update' on the 
 command line.
 +   if test -n $GIT_QUIET
 +   then
 +   orig_flags=$orig_flags -q
 +   fi
 +
 if test -n $init
 then
 cmd_init -- $@ || return
 fi

 +   if test $jobs != 1
 +   then
 +   if ( echo test | xargs -P $jobs true 2/dev/null )
 +   then
 +   if ( echo test | xargs --max-lines=1 true 
 2/dev/null ); then
 +   max_lines=--max-lines=1
 +   else
 +   max_lines=-L 1
 +   fi
 +   

Re: [PATCH] Enable parallelism in git submodule update.

2012-07-29 Thread Jens Lehmann
Am 27.07.2012 20:37, schrieb Stefan Zager:
 The --jobs parameter may be used to set the degree of per-submodule
 parallel execution.

I think this is a sound idea, but it would be good to see some
actual measurements. What are the performance numbers with and
without this change? Which cases do benefit and are there some
which run slower when run in parallel?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-07-29 Thread Junio C Hamano
Heiko Voigt hvo...@hvoigt.net writes:

 On Fri, Jul 27, 2012 at 04:25:58PM -0700, Junio C Hamano wrote:
 ...
 Of course, any set of rules have exceptions ;-) There are a few
 things to which we say Even though it is not in POSIX, everybody
 who matters supports it, and without taking advantage of it, what we
 want to achieve will become too cumbersome to express.

 I was about to write that since this is limited to a given --jobs
 options the majority platforms should be enough as a start and others
 could add a parallelism mechanism later. Its only a matter of efficiency
 and not features.

As long as git submodule --jobs 9 on a platform without GNU
enhanced xargs does not error out and gracefully degrade to non
parallel execution, I do not have any problem with it.  As posted,
the patch has not yet achieved that doneness yet.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-07-28 Thread Heiko Voigt
Hi Stefan,

neat patch. See below for a few notes.

On Fri, Jul 27, 2012 at 11:37:34AM -0700, Stefan Zager wrote:
 diff --git a/git-submodule.sh b/git-submodule.sh
 index dba4d39..761420a 100755
 --- a/git-submodule.sh
 +++ b/git-submodule.sh
 @@ -491,6 +492,20 @@ cmd_update()
   -r|--rebase)
   update=rebase
   ;;
 + -j|--jobs)
 + case $2 in
 + ''|-*)
 + jobs=0
 + ;;
 + *)
 + jobs=$2
 + shift
 + ;;
 + esac
 + # Don't preserve this arg.
 + shift
 + continue
 + ;;
   --reference)
   case $2 in '') usage ;; esac
   reference=--reference=$2
 @@ -529,6 +544,12 @@ cmd_update()
   cmd_init -- $@ || return
   fi
  
 + if test $jobs != 1
 + then
 + module_list $@ | awk '{print $4}' | xargs -L 1 -P $jobs git 
 submodule update $orig_args

I do not see orig_args set anywhere in submodule.sh. It seems the
existing usage of it in cmd_status() is a leftover from commit
98dbe63 when this variable got renamed to orig_flags.

I will follow up with a patch to that location.

Another problem here is the passing of arguments. Have a look at
a7eff1a8 to see how this was solved for other locations.

The next thing I noticed is that the parallelism is not recursive. You
drop the option and only execute the first depth in parallel. How about
using the amount of modules defined by arguments left in $@ as an
indicator whether you need to fork parallel execution or not. If there
is exactly one you do the update if there are more you do the parallel
thing. That way you can just keep passing the --jobs flag to the
subprocesses.

The next question to solve is UI: Since the output lines of the parallel
update jobs will be mixed we need some way to distinguish them. Imagine
one of the update fails somewhere how do we find out which it was?

Two possible solutions come to my mind:

 1. Prefix each line with a job number. This way you can distinguish
which process outputted what and still have immediate feedback.

 2. Cache the output (to stderr and stdout) of each job and output it
once one job is done. I imagine this needs some infrastructure which
we need to implement. We already have some ideas how to collect such
output in C here[1].

I would prefer solution 2 since the output of 1 will be hard to read but
I guess we could start with 1 and then move over to 2 later on.

Cheers Heiko

[1] http://article.gmane.org/gmane.comp.version-control.git/197747
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-07-28 Thread Heiko Voigt
Hi,

On Fri, Jul 27, 2012 at 04:25:58PM -0700, Junio C Hamano wrote:
 Stefan Zager sza...@google.com writes:
 
  On Fri, Jul 27, 2012 at 2:38 PM, Junio C Hamano gits...@pobox.com wrote:
 
  Stefan Zager sza...@google.com writes:
 
   + module_list $@ | awk '{print $4}' | xargs -L 1 -P
  $jobs git submodule update $orig_args
 
  Capital-P option to xargs is not even in POSIX, no?
 
  I wasn't aware of that, but you appear to be correct.  Don't know if you
  have a policy about that, but anecdotally, -P is supported on my linux,
  mac, and win/msys systems.
 
 About policy, we use POSIX as a rough yardstick to warn us that we
 might be breaking people on minority platforms.  We do _not_ say It
 is in POSIX, so it is safe to use it, but we say It is not even in
 POSIX, so we need to think twice.  We do not usually say Linux,
 Mac and Windows are the only things that matter, and they all
 support it.
 
 Of course, any set of rules have exceptions ;-) There are a few
 things to which we say Even though it is not in POSIX, everybody
 who matters supports it, and without taking advantage of it, what we
 want to achieve will become too cumbersome to express.

I was about to write that since this is limited to a given --jobs
options the majority platforms should be enough as a start and others
could add a parallelism mechanism later. Its only a matter of efficiency
and not features.

But if you look at my other post to this thread I described that we need
some UI output extension so the user can still make sense of it.
In short: The user should be able distinguish which job said what.

I was already thinking about how an output caching could be implemented in
core git. How about exposing it as a git command like this?

git run [-jnumber] ...

It works like the xargs call above except that it caches each jobs
output to stderr and stdout until its done and then replays the output
to stderr/out in the correct order.

We could design the code so that it can be reused later on to do the
caching in parallel fetch/push/... .

What do you think? If we decide to go this route I would have a look
into whipping something up.

Cheers Heiko
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-07-27 Thread Junio C Hamano
Stefan Zager sza...@google.com writes:

 + module_list $@ | awk '{print $4}' | xargs -L 1 -P $jobs git 
 submodule update $orig_args

Capital-P option to xargs is not even in POSIX, no?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Enable parallelism in git submodule update.

2012-07-27 Thread Junio C Hamano
Stefan Zager sza...@google.com writes:

 On Fri, Jul 27, 2012 at 2:38 PM, Junio C Hamano gits...@pobox.com wrote:

 Stefan Zager sza...@google.com writes:

  + module_list $@ | awk '{print $4}' | xargs -L 1 -P
 $jobs git submodule update $orig_args

 Capital-P option to xargs is not even in POSIX, no?

 I wasn't aware of that, but you appear to be correct.  Don't know if you
 have a policy about that, but anecdotally, -P is supported on my linux,
 mac, and win/msys systems.

About policy, we use POSIX as a rough yardstick to warn us that we
might be breaking people on minority platforms.  We do _not_ say It
is in POSIX, so it is safe to use it, but we say It is not even in
POSIX, so we need to think twice.  We do not usually say Linux,
Mac and Windows are the only things that matter, and they all
support it.

Of course, any set of rules have exceptions ;-) There are a few
things to which we say Even though it is not in POSIX, everybody
who matters supports it, and without taking advantage of it, what we
want to achieve will become too cumbersome to express.

In the core parts of the system, we try to be very conservative. In
the fringe where nobody cares about, we tend to be looser.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html