Re: [PATCH v3 2/3] Remove now useless email-address parsing code

2018-01-08 Thread Alex Bennée

Matthieu Moy <g...@matthieu-moy.fr> writes:

> We now use Mail::Address unconditionaly, hence parse_mailboxes is now
> dead code. Remove it and its tests.
>
> Signed-off-by: Matthieu Moy <g...@matthieu-moy.fr>

Reviewed-by: Alex Bennée <alex.ben...@linaro.org>

> ---
> No change since v2.
>
>  perl/Git.pm  | 71 
> 
>  t/t9000-addresses.sh | 27 
>  t/t9000/test.pl  | 67 -
>  3 files changed, 165 deletions(-)
>  delete mode 100755 t/t9000-addresses.sh
>  delete mode 100755 t/t9000/test.pl
>
> diff --git a/perl/Git.pm b/perl/Git.pm
> index ffa09ac..65e6b32 100644
> --- a/perl/Git.pm
> +++ b/perl/Git.pm
> @@ -880,77 +880,6 @@ sub ident_person {
>   return "$ident[0] <$ident[1]>";
>  }
>
> -=item parse_mailboxes
> -
> -Return an array of mailboxes extracted from a string.
> -
> -=cut
> -
> -# Very close to Mail::Address's parser, but we still have minor
> -# differences in some cases (see t9000 for examples).
> -sub parse_mailboxes {
> - my $re_comment = qr/\((?:[^)]*)\)/;
> - my $re_quote = qr/"(?:[^\"\\]|\\.)*"/;
> - my $re_word = qr/(?:[^]["\s()<>:;@\\,.]|\\.)+/;
> -
> - # divide the string in tokens of the above form
> - my $re_token = qr/(?:$re_quote|$re_word|$re_comment|\S)/;
> - my @tokens = map { $_ =~ /\s*($re_token)\s*/g } @_;
> - my $end_of_addr_seen = 0;
> -
> - # add a delimiter to simplify treatment for the last mailbox
> - push @tokens, ",";
> -
> - my (@addr_list, @phrase, @address, @comment, @buffer) = ();
> - foreach my $token (@tokens) {
> - if ($token =~ /^[,;]$/) {
> - # if buffer still contains undeterminated strings
> - # append it at the end of @address or @phrase
> - if ($end_of_addr_seen) {
> - push @phrase, @buffer;
> - } else {
> - push @address, @buffer;
> - }
> -
> - my $str_phrase = join ' ', @phrase;
> - my $str_address = join '', @address;
> - my $str_comment = join ' ', @comment;
> -
> - # quote are necessary if phrase contains
> - # special characters
> - if ($str_phrase =~ /[][()<>:;@\\,.\000-\037\177]/) {
> - $str_phrase =~ s/(^|[^\\])"/$1/g;
> - $str_phrase = qq["$str_phrase"];
> - }
> -
> - # add "<>" around the address if necessary
> - if ($str_address ne "" && $str_phrase ne "") {
> - $str_address = qq[<$str_address>];
> - }
> -
> - my $str_mailbox = "$str_phrase $str_address 
> $str_comment";
> - $str_mailbox =~ s/^\s*|\s*$//g;
> - push @addr_list, $str_mailbox if ($str_mailbox);
> -
> - @phrase = @address = @comment = @buffer = ();
> - $end_of_addr_seen = 0;
> - } elsif ($token =~ /^\(/) {
> - push @comment, $token;
> - } elsif ($token eq "<") {
> - push @phrase, (splice @address), (splice @buffer);
> - } elsif ($token eq ">") {
> - $end_of_addr_seen = 1;
> - push @address, (splice @buffer);
> - } elsif ($token eq "@" && !$end_of_addr_seen) {
> - push @address, (splice @buffer), "@";
> - } else {
> - push @buffer, $token;
> - }
> - }
> -
> - return @addr_list;
> -}
> -
>  =item hash_object ( TYPE, FILENAME )
>
>  Compute the SHA1 object id of the given C considering it is
> diff --git a/t/t9000-addresses.sh b/t/t9000-addresses.sh
> deleted file mode 100755
> index a1ebef6..000
> --- a/t/t9000-addresses.sh
> +++ /dev/null
> @@ -1,27 +0,0 @@
> -#!/bin/sh
> -
> -test_description='compare address parsing with and without Mail::Address'
> -. ./test-lib.sh
> -
> -if ! test_have_prereq PERL; then
> - skip_all='skipping perl interface tests, perl not available'
> - test_done
> -fi
> -
> -perl -MTest::More -e 0 2>/dev/null || {
> - skip_all="Perl Test::More unavailable, skipping test"
> -

Re: [PATCH v3 1/3] send-email: add and use a local copy of Mail::Address

2018-01-08 Thread Alex Bennée

Matthieu Moy <g...@matthieu-moy.fr> writes:

> We used to have two versions of the email parsing code. Our
> parse_mailboxes (in Git.pm), and Mail::Address which we used if
> installed. Unfortunately, both versions have different sets of bugs, and
> changing the behavior of git depending on whether Mail::Address is
> installed was a bad idea.
>
> A first attempt to solve this was cc90750 (send-email: don't use
> Mail::Address, even if available, 2017-08-23), but it turns out our
> parse_mailboxes is too buggy for some uses. For example the lack of
> nested comments support breaks get_maintainer.pl in the Linux kernel
> tree:
>
>   https://public-inbox.org/git/20171116154814.23785-1-alex.ben...@linaro.org/
>
> This patch goes the other way: use Mail::Address anyway, but have a
> local copy from CPAN as a fallback, when the system one is not
> available.
>
> The duplicated script is small (276 lines of code) and stable in time.
> Maintaining the local copy should not be an issue, and will certainly be
> less burden than maintaining our own parse_mailboxes.
>
> Another option would be to consider Mail::Address as a hard dependency,
> but it's easy enough to save the trouble of extra-dependency to the end
> user or packager.
>
> Signed-off-by: Matthieu Moy <g...@matthieu-moy.fr>

Reviewed-by: Alex Bennée <alex.ben...@linaro.org>


> ---
> No change since v2.
>
>  git-send-email.perl   |   3 +-
>  perl/Git/FromCPAN/Mail/Address.pm | 276 
> ++
>  perl/Git/Mail/Address.pm  |  24 
>  3 files changed, 302 insertions(+), 1 deletion(-)
>  create mode 100644 perl/Git/FromCPAN/Mail/Address.pm
>  create mode 100755 perl/Git/Mail/Address.pm
>
> diff --git a/git-send-email.perl b/git-send-email.perl
> index edcc6d3..340b5c8 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -30,6 +30,7 @@ use Error qw(:try);
>  use Cwd qw(abs_path cwd);
>  use Git;
>  use Git::I18N;
> +use Git::Mail::Address;
>
>  Getopt::Long::Configure qw/ pass_through /;
>
> @@ -489,7 +490,7 @@ my ($repoauthor, $repocommitter);
>  ($repocommitter) = Git::ident_person(@repo, 'committer');
>
>  sub parse_address_line {
> - return Git::parse_mailboxes($_[0]);
> + return map { $_->format } Mail::Address->parse($_[0]);
>  }
>
>  sub split_addrs {
> diff --git a/perl/Git/FromCPAN/Mail/Address.pm 
> b/perl/Git/FromCPAN/Mail/Address.pm
> new file mode 100644
> index 000..13b2ff7
> --- /dev/null
> +++ b/perl/Git/FromCPAN/Mail/Address.pm
> @@ -0,0 +1,276 @@
> +# Copyrights 1995-2017 by [Mark Overmeer <p...@overmeer.net>].
> +#  For other contributors see ChangeLog.
> +# See the manual pages for details on the licensing terms.
> +# Pod stripped from pm file by OODoc 2.02.
> +package Mail::Address;
> +use vars '$VERSION';
> +$VERSION = '2.19';
> +
> +use strict;
> +
> +use Carp;
> +
> +# use locale;   removed in version 1.78, because it causes taint problems
> +
> +sub Version { our $VERSION }
> +
> +
> +
> +# given a comment, attempt to extract a person's name
> +sub _extract_name
> +{   # This function can be called as method as well
> +my $self = @_ && ref $_[0] ? shift : undef;
> +
> +local $_ = shift
> +or return '';
> +
> +# Using encodings, too hard. See Mail::Message::Field::Full.
> +return '' if m/\=\?.*?\?\=/;
> +
> +# trim whitespace
> +s/^\s+//;
> +s/\s+$//;
> +s/\s+/ /;
> +
> +# Disregard numeric names (e.g. 123456.1...@compuserve.com)
> +return "" if /^[\d ]+$/;
> +
> +s/^\((.*)\)$/$1/; # remove outermost parenthesis
> +s/^"(.*)"$/$1/;   # remove outer quotation marks
> +s/\(.*?\)//g; # remove minimal embedded comments
> +s/\\//g;  # remove all escapes
> +s/^"(.*)"$/$1/;   # remove internal quotation marks
> +s/^([^\s]+) ?, ?(.*)$/$2 $1/; # reverse "Last, First M." if applicable
> +s/,.*//;
> +
> +# Change casing only when the name contains only upper or only
> +# lower cased characters.
> +unless( m/[A-Z]/ && m/[a-z]/ )
> +{   # Set the case of the name to first char upper rest lower
> +s/\b(\w+)/\L\u$1/igo;  # Upcase first letter on name
> +s/\bMc(\w)/Mc\u$1/igo; # Scottish names such as 'McLeod'
> +s/\bo'(\w)/O'\u$1/igo; # Irish names such as 'O'Malley, O'Reilly'
> +s/\b(x*(ix)?v*(iv)?i*)\b/\U$1/igo; # Roman numerals, eg 'Level III 
> Support'
> +}
> +
> +# some cleanup
> +s/\[[^\]]*\]//g;
> +s/(^[\s'"]+|[\s'"]+$)//g;
> +s/\s{2,}/ /g;
>

Re: [PATCH] send-email: add test for Linux's get_maintainer.pl

2018-01-05 Thread Alex Bennée

Matthieu Moy <g...@matthieu-moy.fr> writes:

> From: Alex Bennée <alex.ben...@linaro.org>
>
> We had a regression that broke Linux's get_maintainer.pl. Using
> Mail::Address to parse email addresses fixed it, but let's protect
> against future regressions.
>
> Patch-edited-by: Matthieu Moy <g...@matthieu-moy.fr>
> Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
> Signed-off-by: Matthieu Moy <g...@matthieu-moy.fr>
> ---
>  t/t9001-send-email.sh | 22 ++
>  1 file changed, 22 insertions(+)
>
> diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
> index 4d261c2..f126177 100755
> --- a/t/t9001-send-email.sh
> +++ b/t/t9001-send-email.sh
> @@ -172,6 +172,28 @@ test_expect_success $PREREQ 'cc trailer with various 
> syntax' '
>   test_cmp expected-cc commandline1
>  '
>
> +test_expect_success $PREREQ 'setup fake get_maintainer.pl script for cc 
> trailer' "
> + cat >expected-cc-script.sh <<-EOF &&
> + #!/bin/sh
> + echo 'One Person <o...@example.com> (supporter:THIS (FOO/bar))'
> + echo 'Two Person <t...@example.com> (maintainer:THIS THING)'
> + echo 'Third List <th...@example.com> (moderated list:THIS THING 
> (FOO/bar))'
> + echo '<f...@example.com> (moderated list:FOR THING)'
> + echo 'f...@example.com (open list:FOR THING (FOO/bar))'
> + echo 's...@example.com (open list)'
> + EOF
> + chmod +x expected-cc-script.sh
> +"
> +
> +test_expect_success $PREREQ 'cc trailer with get_maintainer.pl output' '
> + test_commit cc-trailer-getmaint &&
> + clean_fake_sendmail &&
> + git send-email -1 --to=recipi...@example.com \
> + --cc-cmd="./expected-cc-script.sh" \
> + --smtp-server="$(pwd)/fake.sendmail" &&
> + test_cmp expected-cc commandline1
> +'
> +
>  test_expect_success $PREREQ 'setup expect' "
>  cat >expected-show-all-headers <<\EOF
>  0001-Second.patch

I think you need to apply Eric's suggestions from:

  From: Eric Sunshine <sunsh...@sunshineco.com>
  Date: Sat, 18 Nov 2017 21:54:46 -0500
  Message-ID: 
<capig+csh0tvvkh0xf9fwcfm4gngawmsn_fxd2zhzhcy2try...@mail.gmail.com>

--
Alex Bennée


Re: [RFC PATCH 2/2] Remove now useless email-address parsing code

2018-01-04 Thread Alex Bennée
est_done
> -}
> -
> -perl -MMail::Address -e 0 2>/dev/null || {
> - skip_all="Perl Mail::Address unavailable, skipping test"
> - test_done
> -}
> -
> -test_external_has_tap=1
> -
> -test_external_without_stderr \
> - 'Perl address parsing function' \
> - perl "$TEST_DIRECTORY"/t9000/test.pl
> -
> -test_done
> diff --git a/t/t9000/test.pl b/t/t9000/test.pl
> deleted file mode 100755
> index dfeaa9c..000
> --- a/t/t9000/test.pl
> +++ /dev/null
> @@ -1,67 +0,0 @@
> -#!/usr/bin/perl
> -use lib (split(/:/, $ENV{GITPERLLIB}));
> -
> -use 5.008;
> -use warnings;
> -use strict;
> -
> -use Test::More qw(no_plan);
> -use Mail::Address;
> -
> -BEGIN { use_ok('Git') }
> -
> -my @success_list = (q[Jane],
> - q[j...@example.com],
> - q[<j...@example.com>],
> - q[Jane <j...@example.com>],
> - q[Jane Doe <j...@example.com>],
> - q["Jane" <j...@example.com>],
> - q["Doe, Jane" <j...@example.com>],
> - q["Jane@:;\>.,()<Doe" <j...@example.com>],
> - q[Jane!#$%&'*+-/=?^_{|}~Doe' <j...@example.com>],
> - q["<j...@example.com>"],
> - q["Jane j...@example.com"],
> - q[Jane Doe ],
> - q[Jane   Doe <  j...@example.com  >],
> - q[Jane @ Doe @ Jane @ Doe],
> - q["Jane, 'Doe'" <j...@example.com>],
> - q['Doe, "Jane' <j...@example.com>],
> - q["Jane" "Do"e <j...@example.com>],
> - q["Jane' Doe" <j...@example.com>],
> - q["Jane Doe <j...@example.com>" <j...@example.com>],
> - q["Jane\" Doe" <j...@example.com>],
> - q[Doe, jane <j...@example.com>],
> - q["Jane Doe <j...@example.com>],
> - q['Jane 'Doe' <j...@example.com>],
> - q[Jane@:;\.,()<>Doe <j...@example.com>],
> - q[Jane <j...@example.com> Doe],
> - q[<j...@example.com> Jane Doe]);
> -
> -my @known_failure_list = (q[Jane\ Doe <j...@example.com>],
> - q["Doe, Ja"ne <j...@example.com>],
> - q["Doe, Katarina" Jane <j...@example.com>],
> - q[Jane j...@example.com],
> - q["Jane "Kat"a" ri"na" ",Doe" <j...@example.com>],
> - q[Jane Doe],
> - q[Jane "Doe <j...@example.com>"],
> - q[\"Jane Doe <j...@example.com>],
> - q[Jane\"\" Doe <j...@example.com>],
> - q['Jane "Katarina\" \' Doe' <j...@example.com>]);
> -
> -foreach my $str (@success_list) {
> - my @expected = map { $_->format } Mail::Address->parse("$str");
> - my @actual = Git::parse_mailboxes("$str");
> - is_deeply(\@expected, \@actual, qq[same output : $str]);
> -}
> -
> -TODO: {
> - local $TODO = "known breakage";
> - foreach my $str (@known_failure_list) {
> - my @expected = map { $_->format } Mail::Address->parse("$str");
> - my @actual = Git::parse_mailboxes("$str");
> - is_deeply(\@expected, \@actual, qq[same output : $str]);
> - }
> -}
> -
> -my $is_passing = eval { Test::More->is_passing };
> -exit($is_passing ? 0 : 1) unless $@ =~ /Can't locate object method/;


--
Alex Bennée


Re: [PATCH] git-send-email: fix get_maintainer.pl regression

2017-12-12 Thread Alex Bennée

Thomas Adam <tho...@xteddy.org> writes:

> Hi,
>
> On Mon, Dec 11, 2017 at 08:46:46PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> I.e. we'd just ship a copy of Email::Valid and Mail::Address in
>> perl/Git/FromCPAN/, use a wrapper to load them, and then we wouldn't
>> need to if/else this at the code level, just always use the module,
>> and it would work even on core perl.
>
> I disagree with the premise of this, Ævar.  As soon as you go down this route,
> it increases maintenance to ensure we keep up to date with what's on CPAN for
> a tiny edge-case which I don't believe exists.
>
> You may as well just use App::FatPacker.
>
> We're talking about package maintenance here -- and as I said before, there's
> plenty of it around.  For those distributions which ship Git (and hence also
> package git-send-email), the dependencies are already there, too.  I just
> cannot see this being a problem in relying on non-core perl modules.  Every
> perl program does this, and they don't go down this route of having copies of
> various CPAN modules just in case.  So why should we?  We're not a special
> snowflake.

I less bothered my the potentially shipping a git specific copy than
ensuring the packagers pick up the dependency when they do their builds.
Do we already have a mechanism for testing for non-core perl modules
during the "configure" phase of git?

--
Alex Bennée


Re: [PATCH] git-send-email: fix get_maintainer.pl regression

2017-12-11 Thread Alex Bennée

Junio C Hamano <gits...@pobox.com> writes:

> Thomas Adam <tho...@xteddy.org> writes:
>
>> Trying to come up with a reinvention of regexps for email addresses is asking
>> for trouble, not to mention a crappy rod for your own back.  Don't do that.
>> This is why people use Mail::Address.
>>
>> https://metacpan.org/pod/distribution/MailTools/lib/Mail/Address.pod
>
> Now we are coming back to cc907506 ("send-email: don't use
> Mail::Address, even if available", 2017-08-23).  It argues
>
> * Having this optional Mail::Address makes it tempting to anwser "please
>   install Mail::Address" to users instead of fixing our own code. We've
>   reached the stage where bugs in our parser should be fixed, not worked
>   around.
>
> but if it costs us maintaining our substitute that much, it seems to
> me that depending on Mail::Address is not just tempting but may be a
> sensible way forward.
>
> Was there any reason why Mail::Address was _inadequate_?  I know we
> had trouble with random garbage that are *not* addresses people put
> on the in-body CC: trailer in the past, but I do not recall if they
> are something Mail::Address would give worse result and we need our
> workaround (hence our own substitute), or Mail::Address would handle
> them just fine.

Ping?

So have we come to a consensus about the best solution here?

I'm perfectly happy to send a reversion patch because to be honest
hacking on a bunch of perl to handle special mail cases is not my idea
of fun spare time hacking ;-)

I guess the full solution is to make Mail::Address a hard dependency?

--
Alex Bennée


Re: [PATCH] git-send-email: fix get_maintainer.pl regression

2017-11-22 Thread Alex Bennée

Matthieu Moy <matthieu@univ-lyon1.fr> writes:

> Junio C Hamano <gits...@pobox.com> writes:
>
>> Was there any reason why Mail::Address was _inadequate_?
>
> I think the main reason was that Mail::Address is not a standard perl
> module, and not relying on it avoided one external dependency. AFAIK, we
> don't really have a good way to deal with Perl dependencies, so having a
> strong requirement on Mail::Address will probably end up in a runtime
> error for users who compile Git from source (people using a packaged
> version have their package manager to deal with this).
>
>> I know we had trouble with random garbage that are *not* addresses
>> people put on the in-body CC: trailer in the past, but I do not recall
>> if they are something Mail::Address would give worse result and we
>> need our workaround (hence our own substitute), or Mail::Address would
>> handle them just fine.
>
> For a long time, we used Mail::Address if it was available and I don't
> think we had issues with it.
>
> My point in cc907506 ("send-email: don't use Mail::Address, even if
> available", 2017-08-23) was not that Mail::Address was bad, but that
> changing our behavior depending on whether it was there or not was
> really bad. For example, the issue dealt with in this thread probably
> always existed, but it was present only for *some* users.

So I just did a little digging on my systems to illustrate the point. My
work machine is Ubuntu, so has a packaged git via PPA. It depends on
libmailtools-perl which includes the perl module and:

  apt-cache rdepends  libmailtools-perl | wc -l
  45

So for binary packaged systems it's not a big thing - libmailtools-perl
seems to be quite widely relied on.

On my system at home, running Gentoo, while requiring "perl" doesn't
explicitly pull in the Mail::Address dependency. As a result the git
send-email stanza I was running at work manages to corrupt the
addresses. If I manually install dev-perl/MailTools of course it
silently starts working again.

Good job I never usually send patches from my home machine ;-)

My hacky guess about GIT's perl use calls is:

   find . -iname "*.perl" -or -iname "*.pm" -or -iname "*.pl" | xargs grep -h  
"use .*::" | sort | uniq | wc -l
   88

So that is about 88 perl modules used in the code base. How many of them
are not part of the core perl distribution?

Should the solution be to just make Mail::Address a hard dependency and
not have the fallback?

--
Alex Bennée


Re: [PATCH] git-send-email: fix get_maintainer.pl regression

2017-11-21 Thread Alex Bennée

Eric Sunshine <sunsh...@sunshineco.com> writes:

> A few more comments/observations...
>
> On Thu, Nov 16, 2017 at 10:48 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
>> diff --git a/perl/Git.pm b/perl/Git.pm
>> @@ -936,6 +936,9 @@ sub parse_mailboxes {
>> $end_of_addr_seen = 0;
>> } elsif ($token =~ /^\(/) {
>> push @comment, $token;
>> +   } elsif ($token =~ /^\)/) {
>> +   my $nested_comment = pop @comment;
>> +   push @comment, "$nested_comment$token";
>
> Due to the way tokenization works, it looks like you will only ever
> see a ")" as a single character. That suggests that you should be
> using ($token eq ")"), as is done for "<" and ">", rather than ($token
> =~ /^\)/).
>
> What happens if there is text before the final closing ')'? For
> instance, "foo@bar (bibble (bobble) smoo)" or "...)smoo)". The result
> is that "smoo" ends up tacked onto the end of the email address
> ("foo@barsmoo") rather than incorporated into the comment, as
> intended.
>
> What happens if you encounter a ")" but haven't yet encountered an
> opening "(" (that is, @comment is empty)? For example, "foo@bar )". In
> that case, it unconditionally pops from the empty array, which seems
> iffy at best. It might be nice to see this case taken into
> consideration explicitly.

Yeah I was only really aiming for the current regression but I'm sure it
could be more solid. I do note that my @known_failure_list in test.pl
has a bunch of other cases that need fixing up.

> I also was wondering if it would make more sense to take advantage of
> Perl's ability to match nested expressions (??{$nested}), however,
> that feature apparently was added in 5.10, and Git.pm only requires
> 5.8, so perhaps not (unless we want to bump the requirement higher).

Hmm that might be a case of abusing regex to do something better suited
to a proper tokenizer.

>
> Aside from those observations, it looks like the tokenizer in this
> function is broken. For any input with the address enclosed in "<" and
> ">", the comment is lost entirely; it doesn't even end up in the
> @tokens array. Since you're already fixing bugs/regressions in this
> code, perhaps that's something you'd like to tackle as well in a
> separate patch? ("No" is an acceptable answer, of course.)
>
>> } elsif ($token eq "<") {
>> push @phrase, (splice @address), (splice @buffer);
>> } elsif ($token eq ">") {

I can have a go but my perl-fu has weakened somewhat since I stopped
having to maintain perl code for a living. It's almost as though my
brain was glad to dump the knowledge ;-)

I guess we could maintain a nesting count and a current token type and
use that to more intelligently direct the nested portions to the
appropriate bits. Maybe Matthieu or Remi (CC'ed) might want to chime in
on other options?

--
Alex Bennée


[PATCH v2] git-send-email: fix --cc-cmd get_maintainer.pl regression

2017-11-20 Thread Alex Bennée
Since the removal of Mail::Address from git-send-email certain address
patterns returned by common get_maintainer.pl scripts now fail to get
correctly parsed by the built-in Git::parse_mailboxes. Specifically
the patterns with embedded parenthesis fail. For example from the
Linux kernel MAINTAINERS:

  KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)
  L:linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
  L:kvm...@lists.cs.columbia.edu

Which is returned by get_maintainers.pl as:

  linux-arm-ker...@lists.infradead.org (moderated list:KERNEL VIRTUAL MACHINE 
FOR ARM (KVM/arm))
  kvm...@lists.cs.columbia.edu (open list:KERNEL VIRTUAL MACHINE FOR ARM 
(KVM/arm))

However Git::parse_mailboxes code mangles the address, appending the
trailing parenthesis to the email address to the address part causing
it to fail validation:

   error: unable to extract a valid address from: 
linux-arm-ker...@lists.infradead.org) (moderated list:KERNEL VIRTUAL MACHINE 
FOR ARM (KVM/arm)
   error: unable to extract a valid address from: kvm...@lists.cs.columbia.edu) 
(open list:KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)

As this is a common pattern which was handled by Mail::Address I've
fixed the regression by explicitly capturing a trailing bracket and
appending it to the comment token.

NB: the t9001.sh test doesn't explicitly wrap the call to the --cc-cmd
in a "$(pwd)/expected-cc-script.sh" which fails due to the space to
the full-path of the test. It is currently ambiguous as to if --cc-cmd
needs to handle this. I suspect it is not an edge case that has come
up in real-world usage as git-send-email is usually run directly from
a git directory with scripts generally in a ./script/get_maintainer.pl
path.

Fixes: cc9075067776ebd34cc08f31bf78bb05f12fd879
Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
Cc: Eric Sunshine <sunsh...@sunshineco.com>
---
 perl/Git.pm   |  3 +++
 t/t9000/test.pl   |  3 ++-
 t/t9001-send-email.sh | 16 
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/perl/Git.pm b/perl/Git.pm
index ffa09ace9..9b17de1cc 100644
--- a/perl/Git.pm
+++ b/perl/Git.pm
@@ -936,6 +936,9 @@ sub parse_mailboxes {
$end_of_addr_seen = 0;
} elsif ($token =~ /^\(/) {
push @comment, $token;
+   } elsif ($token =~ /^\)/) {
+   my $nested_comment = pop @comment;
+   push @comment, "$nested_comment$token";
} elsif ($token eq "<") {
push @phrase, (splice @address), (splice @buffer);
} elsif ($token eq ">") {
diff --git a/t/t9000/test.pl b/t/t9000/test.pl
index dfeaa9c65..b01642a0d 100755
--- a/t/t9000/test.pl
+++ b/t/t9000/test.pl
@@ -35,7 +35,8 @@ my @success_list = (q[Jane],
q['Jane 'Doe' <j...@example.com>],
q[Jane@:;\.,()<>Doe <j...@example.com>],
q[Jane <j...@example.com> Doe],
-   q[<j...@example.com> Jane Doe]);
+   q[<j...@example.com> Jane Doe],
+   q[j...@example.com (open list:for thing (foo/bar))]);
 
 my @known_failure_list = (q[Jane\ Doe <j...@example.com>],
q["Doe, Ja"ne <j...@example.com>],
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 4d261c2a9..fa783eb87 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -172,6 +172,22 @@ test_expect_success $PREREQ 'cc trailer with various 
syntax' '
test_cmp expected-cc commandline1
 '
 
+test_expect_success $PREREQ 'cc trailer with get_maintainer output' '
+write_script expected-cc-script.sh <<-EOF &&
+echo "One Person <o...@example.com> (supporter:THIS (FOO/bar))"
+echo "Two Person <t...@example.com> (maintainer:THIS THING)"
+echo "Third List <th...@example.com> (moderated list:THIS THING (FOO/bar))"
+echo "<f...@example.com> (moderated list:FOR THING)"
+echo "f...@example.com (open list:FOR THING (FOO/bar))"
+echo "s...@example.com (open list)"
+EOF
+   clean_fake_sendmail &&
+   git send-email -1 --to=recipi...@example.com \
+   --cc-cmd=./expected-cc-script.sh \
+   --smtp-server="$(pwd)/fake.sendmail" &&
+   test_cmp expected-cc commandline1
+'
+
 test_expect_success $PREREQ 'setup expect' "
 cat >expected-show-all-headers <<\EOF
 0001-Second.patch
-- 
2.15.0



[PATCH] git-send-email: fix --cc-cmd get_maintainer.pl regression

2017-11-20 Thread Alex Bennée
Since the removal of Mail::Address from git-send-email certain address
patterns returned by common get_maintainer.pl scripts now fail to get
correctly parsed by the built-in Git::parse_mailboxes. Specifically
the patterns with embedded parenthesis fail. For example from the
Linux kernel MAINTAINERS:

  KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)
  L:linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
  L:kvm...@lists.cs.columbia.edu

Which is returned by get_maintainers.pl as:

  linux-arm-ker...@lists.infradead.org (moderated list:KERNEL VIRTUAL MACHINE 
FOR ARM (KVM/arm))
  kvm...@lists.cs.columbia.edu (open list:KERNEL VIRTUAL MACHINE FOR ARM 
(KVM/arm))

However Git::parse_mailboxes code mangles the address, appending the
trailing parenthesis to the email address to the address part causing
it to fail validation:

   error: unable to extract a valid address from: 
linux-arm-ker...@lists.infradead.org) (moderated list:KERNEL VIRTUAL MACHINE 
FOR ARM (KVM/arm)
   error: unable to extract a valid address from: kvm...@lists.cs.columbia.edu) 
(open list:KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)

As this is a common pattern which was handled by Mail::Address I've
fixed the regression by explicitly capturing a trailing bracket and
appending it to the comment token.

NB: the t9001.sh test doesn't explicitly wrap the call to the --cc-cmd
in a "$(pwd)/expected-cc-script.sh" which fails due to the space to
the full-path of the test. It is currently ambiguous as to if --cc-cmd
needs to handle this. I suspect it is not an edge case that has come
up in real-world usage as git-send-email is usually run directly from
a git directory with scripts generally in a ./script/get_maintainer.pl
path.

Fixes: cc9075067776ebd34cc08f31bf78bb05f12fd879
Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
Cc: Eric Sunshine <sunsh...@sunshineco.com>
---
 perl/Git.pm   |  3 +++
 t/t9000/test.pl   |  3 ++-
 t/t9001-send-email.sh | 16 
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/perl/Git.pm b/perl/Git.pm
index ffa09ace9..9b17de1cc 100644
--- a/perl/Git.pm
+++ b/perl/Git.pm
@@ -936,6 +936,9 @@ sub parse_mailboxes {
$end_of_addr_seen = 0;
} elsif ($token =~ /^\(/) {
push @comment, $token;
+   } elsif ($token =~ /^\)/) {
+   my $nested_comment = pop @comment;
+   push @comment, "$nested_comment$token";
} elsif ($token eq "<") {
push @phrase, (splice @address), (splice @buffer);
} elsif ($token eq ">") {
diff --git a/t/t9000/test.pl b/t/t9000/test.pl
index dfeaa9c65..b01642a0d 100755
--- a/t/t9000/test.pl
+++ b/t/t9000/test.pl
@@ -35,7 +35,8 @@ my @success_list = (q[Jane],
q['Jane 'Doe' <j...@example.com>],
q[Jane@:;\.,()<>Doe <j...@example.com>],
q[Jane <j...@example.com> Doe],
-   q[<j...@example.com> Jane Doe]);
+   q[<j...@example.com> Jane Doe],
+   q[j...@example.com (open list:for thing (foo/bar))]);
 
 my @known_failure_list = (q[Jane\ Doe <j...@example.com>],
q["Doe, Ja"ne <j...@example.com>],
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 4d261c2a9..fa783eb87 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -172,6 +172,22 @@ test_expect_success $PREREQ 'cc trailer with various 
syntax' '
test_cmp expected-cc commandline1
 '
 
+test_expect_success $PREREQ 'cc trailer with get_maintainer output' '
+write_script expected-cc-script.sh <<-EOF &&
+echo "One Person <o...@example.com> (supporter:THIS (FOO/bar))"
+echo "Two Person <t...@example.com> (maintainer:THIS THING)"
+echo "Third List <th...@example.com> (moderated list:THIS THING (FOO/bar))"
+echo "<f...@example.com> (moderated list:FOR THING)"
+echo "f...@example.com (open list:FOR THING (FOO/bar))"
+echo "s...@example.com (open list)"
+EOF
+   clean_fake_sendmail &&
+   git send-email -1 --to=recipi...@example.com \
+   --cc-cmd=./expected-cc-script.sh \
+   --smtp-server="$(pwd)/fake.sendmail" &&
+   test_cmp expected-cc commandline1
+'
+
 test_expect_success $PREREQ 'setup expect' "
 cat >expected-show-all-headers <<\EOF
 0001-Second.patch
-- 
2.15.0



Re: [PATCH] git-send-email: fix get_maintainer.pl regression

2017-11-20 Thread Alex Bennée

Eric Sunshine <sunsh...@sunshineco.com> writes:

> On Thu, Nov 16, 2017 at 10:48 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
>> Getting rid of Mail::Address regressed behaviour with common
>> get_maintainer scripts such as the Linux kernel. Fix the missed corner
>> case and add a test for it.
>
> It is not at all clear, based upon this text, what this is fixing.
> When you re-roll, please provide a description of the regression in
> sufficient detail for readers to easily understand the problem and the
> solution.

How about:

Since the removal of Mail::Address from git-send-email certain addresses
common in MAINTAINERS now fail to get correctly parsed by
Git::parse_mailboxes. Specifically the patterns with embedded
parenthesis fail, for example for MAINTAINERS:

  KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)
  L:linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
  L:kvm...@lists.cs.columbia.edu

Is returned by get_maintainers.pl as:

  linux-arm-ker...@lists.infradead.org (moderated list:KERNEL VIRTUAL MACHINE 
FOR ARM (KVM/arm))
  kvm...@lists.cs.columbia.edu (open list:KERNEL VIRTUAL MACHINE FOR ARM 
(KVM/arm))

But the parse_mailboxes code mangles the address, appending the trailing
parenthesis to the email address to the address part causing it to fail 
validation:

   error: unable to extract a valid address from: 
linux-arm-ker...@lists.infradead.org) (moderated list:KERNEL VIRTUAL MACHINE 
FOR ARM (KVM/arm)
   error: unable to extract a valid address from: kvm...@lists.cs.columbia.edu) 
(open list:KERNEL VIRTUAL MACHINE FOR ARM (KVM/arm)

As this is a common pattern which was handled by Mail::Address I've
fixed the regression by explicitly capturing a trailing bracket and
appending it to the comment token.

>
> More below...
>
>> Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
>> ---
>> diff --git a/t/t9000/test.pl b/t/t9000/test.pl
>> @@ -35,7 +35,9 @@ my @success_list = (q[Jane],
>> q['Jane 'Doe' <j...@example.com>],
>> q[Jane@:;\.,()<>Doe <j...@example.com>],
>> q[Jane <j...@example.com> Doe],
>> -   q[<j...@example.com> Jane Doe]);
>> +   q[<j...@example.com> Jane Doe],
>> +   q[j...@example.com (open list:for thing (foo/bar))],
>> +);
>
> Style nit: The dangling ");" introduced by this change differs from
> the other list(s) in this file which have the closing ");" on the same
> line as the last item in the list.
>
>> diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
>> +test_expect_success $PREREQ 'setup get_mainter script for cc trailer' "
>> +cat >expected-cc-script.sh <<-EOF && chmod +x expected-cc-script.sh
>> +#!/bin/sh
>> +echo 'One Person <o...@example.com> (supporter:THIS (FOO/bar))'
>> +echo 'Two Person <t...@example.com> (maintainer:THIS THING)'
>> +echo 'Third List <th...@example.com> (moderated list:THIS THING (FOO/bar))'
>> +echo '<f...@example.com> (moderated list:FOR THING)'
>> +echo 'f...@example.com (open list:FOR THING (FOO/bar))'
>> +echo 's...@example.com (open list)'
>> +EOF
>> +"
>
> Use write_script() to create the script:

Thanks for the pointers, I'll fix it up.

I missed the existence of write_script.sh while I scanned through
t/README, perhaps a stanza in in "Test harness library":

 - write_script  <<-\EOF && 
   echo '...'
   ...
   EOF

   The write_script helper takes care of ensuring the created helper
   script has the right shebang and is executable.

?

>
> --- 8< ---
> write_script expected-cc-script.sh <<-\EOF &&
> echo '...'
> ...
> EOF
> --- 8< ---
>
> No need for #!/bin/sh or chmod, both of which are handled by
> write_script(). In fact, you could fold this into the following test
> (since there doesn't seem a good reason for it to be separate).
>
>> +test_expect_success $PREREQ 'cc trailer with get_maintainer output' '
>> +   test_commit cc-trailer &&
>> +   clean_fake_sendmail &&
>> +   git send-email -1 --to=recipi...@example.com \
>> +   --cc-cmd="$(pwd)/expected-cc-script.sh" \
>> +   --smtp-server="$(pwd)/fake.sendmail" &&
>> +   test_cmp expected-cc commandline1
>> +'
>> OK I'm afraid I don't fully understand the test harness as this breaks a
>> bunch of other tests. If anyone can offer some pointers on how to fix
>> I'd be grateful.
>
> There are several problems:
>
> * test_commit bombs because there is already a tag named "cc-trailer"
> created by an earlier test. Fix this by choosing a d

Re: [PATCH] git-send-email: fix get_maintainer.pl regression

2017-11-16 Thread Alex Bennée

Alex Bennée <alex.ben...@linaro.org> writes:

> Getting rid of Mail::Address regressed behaviour with common
> get_maintainer scripts such as the Linux kernel. Fix the missed corner
> case and add a test for it.
>

> diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
> index 4d261c2a9..0bcd7ab96 100755
> --- a/t/t9001-send-email.sh
> +++ b/t/t9001-send-email.sh
> @@ -172,6 +172,27 @@ test_expect_success $PREREQ 'cc trailer with various 
> syntax' '
>   test_cmp expected-cc commandline1
>  '
>
> +test_expect_success $PREREQ 'setup get_mainter script for cc trailer' "
> +cat >expected-cc-script.sh <<-EOF && chmod +x expected-cc-script.sh
> +#!/bin/sh
> +echo 'One Person <o...@example.com> (supporter:THIS (FOO/bar))'
> +echo 'Two Person <t...@example.com> (maintainer:THIS THING)'
> +echo 'Third List <th...@example.com> (moderated list:THIS THING (FOO/bar))'
> +echo '<f...@example.com> (moderated list:FOR THING)'
> +echo 'f...@example.com (open list:FOR THING (FOO/bar))'
> +echo 's...@example.com (open list)'
> +EOF
> +"
> +
> +test_expect_success $PREREQ 'cc trailer with get_maintainer output' '
> + test_commit cc-trailer &&
> + clean_fake_sendmail &&
> + git send-email -1 --to=recipi...@example.com \
> + --cc-cmd="$(pwd)/expected-cc-script.sh" \
> + --smtp-server="$(pwd)/fake.sendmail" &&
> + test_cmp expected-cc commandline1
> +'
> +

OK I'm afraid I don't fully understand the test harness as this breaks a
bunch of other tests. If anyone can offer some pointers on how to fix
I'd be grateful.

In the meantime I know the core change works because I tested with:

#+name: send-patches-dry-run
#+begin_src sh :results output
# temp workaround
export PERL5LIB=/home/alex/src/git.git/perl/
git send-email --confirm=never --dry-run --quiet ${mailto} ${series}.patches/*
#+end_src

When I sent my last set of kernel patches to the list (the workflow that
was broken before by the cc9075067776ebd34cc08f31bf78bb05f12fd879 change
landing via my git stable PPA).

--
Alex Bennée


[PATCH] git-send-email: fix get_maintainer.pl regression

2017-11-16 Thread Alex Bennée
Getting rid of Mail::Address regressed behaviour with common
get_maintainer scripts such as the Linux kernel. Fix the missed corner
case and add a test for it.

Fixes: cc9075067776ebd34cc08f31bf78bb05f12fd879

Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
---
 perl/Git.pm   |  3 +++
 t/t9000/test.pl   |  4 +++-
 t/t9001-send-email.sh | 21 +
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/perl/Git.pm b/perl/Git.pm
index ffa09ace9..9b17de1cc 100644
--- a/perl/Git.pm
+++ b/perl/Git.pm
@@ -936,6 +936,9 @@ sub parse_mailboxes {
$end_of_addr_seen = 0;
} elsif ($token =~ /^\(/) {
push @comment, $token;
+   } elsif ($token =~ /^\)/) {
+   my $nested_comment = pop @comment;
+   push @comment, "$nested_comment$token";
} elsif ($token eq "<") {
push @phrase, (splice @address), (splice @buffer);
} elsif ($token eq ">") {
diff --git a/t/t9000/test.pl b/t/t9000/test.pl
index dfeaa9c65..f10be50cd 100755
--- a/t/t9000/test.pl
+++ b/t/t9000/test.pl
@@ -35,7 +35,9 @@ my @success_list = (q[Jane],
q['Jane 'Doe' <j...@example.com>],
q[Jane@:;\.,()<>Doe <j...@example.com>],
q[Jane <j...@example.com> Doe],
-   q[<j...@example.com> Jane Doe]);
+   q[<j...@example.com> Jane Doe],
+   q[j...@example.com (open list:for thing (foo/bar))],
+);
 
 my @known_failure_list = (q[Jane\ Doe <j...@example.com>],
q["Doe, Ja"ne <j...@example.com>],
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 4d261c2a9..0bcd7ab96 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -172,6 +172,27 @@ test_expect_success $PREREQ 'cc trailer with various 
syntax' '
test_cmp expected-cc commandline1
 '
 
+test_expect_success $PREREQ 'setup get_mainter script for cc trailer' "
+cat >expected-cc-script.sh <<-EOF && chmod +x expected-cc-script.sh
+#!/bin/sh
+echo 'One Person <o...@example.com> (supporter:THIS (FOO/bar))'
+echo 'Two Person <t...@example.com> (maintainer:THIS THING)'
+echo 'Third List <th...@example.com> (moderated list:THIS THING (FOO/bar))'
+echo '<f...@example.com> (moderated list:FOR THING)'
+echo 'f...@example.com (open list:FOR THING (FOO/bar))'
+echo 's...@example.com (open list)'
+EOF
+"
+
+test_expect_success $PREREQ 'cc trailer with get_maintainer output' '
+   test_commit cc-trailer &&
+   clean_fake_sendmail &&
+   git send-email -1 --to=recipi...@example.com \
+   --cc-cmd="$(pwd)/expected-cc-script.sh" \
+   --smtp-server="$(pwd)/fake.sendmail" &&
+   test_cmp expected-cc commandline1
+'
+
 test_expect_success $PREREQ 'setup expect' "
 cat >expected-show-all-headers <<\EOF
 0001-Second.patch
-- 
2.15.0



Re: Poor performance of git describe in big repos

2013-06-03 Thread Alex Bennée
On 31 May 2013 10:57, Alex Bennée kernel-hac...@bennee.com wrote:
 On 31 May 2013 09:46, Thomas Rast tr...@inf.ethz.ch wrote:

 So that deleted all unannotated tags pointing at commits, and then it
 was fast.  Curious.

 However, if that turns out to be the culprit, it's not fixable
 currently[1].  Having commits with insanely long messages is just, well,
 insane.


 [1]  unless we do a major rework of the loading infrastructure, so that
 we can teach it to load only the beginning of a commit as long as we are
 only interested in parents and such

 I'll do a bit of scripting to dig into the nature of these
 uber-commits and try and work out how they cam about. I suspect they
 are simply start of branch states in our broken and disparate history.

 I'll get back to you once I've dug a little deeper.

So I wrote a little script [1] which I ran to remove all tags that did
not exist on any branches:

git-tag-cleaner.py -d no-branch

After a lot of churning:

17:26 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m0.799s
user0m0.024s
sys 0m0.052s

So at least I can fix up my repo. All the big ones look at least as
though they were weird cvs2svn creations that exist to represent the
detached state of a strange CVS tag from the converted repository.
However it does raise one question.

Why is git attempting to parse a commit not on the DAG for the branch
I'm attempting to describe?

Anyway as I have a work around I'm going to do a slightly more
conservative clean of the repo with my script and move on.

[1] https://github.com/stsquad/git-tag-cleaner

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

  On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 snip
  Will it be loading the blob for every commit it traverses or just ones 
  that hit
  a tag? Why does it need to load the blob at all? Surely the commit
  tree state doesn't
  need to be walked down?

 No, my theory is that you tagged *the blobs*.  Git supports this.

Wait is this the difference between annotated and non-annotated tags?
I thought a non-annotated just acted like references to a particular
tree state?


 You can see if that is the case by doing something like this:

 eval $(git for-each-ref --shell --format '
 test $(git cat-file -t %(objectname)^{}) = commit ||
 echo %(refname);')

 That will print out the name of any ref that doesn't point at a
 commit.

Hmm that didn't seem to work. But looking at the output by hand I
certainly have a mix of tags that are commits vs tags:


09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
| grep commit | wc -l
1345
09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
| grep -v commit | wc -l
66

Unfortunately I can't just delete those tags as they do refer to known
releases which we obviously care about. If I delete the tags on my
local repo and test for a speed increase can I re-create them as
annotated tag objects?

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 31 May 2013 09:24, Thomas Rast tr...@inf.ethz.ch wrote:
 Alex Bennée kernel-hac...@bennee.com writes:
 On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
 Alex Bennée kernel-hac...@bennee.com writes:
  On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
 snip
 No, my theory is that you tagged *the blobs*.  Git supports this.

 Wait is this the difference between annotated and non-annotated tags?
 I thought a non-annotated just acted like references to a particular
 tree state?

 A tag is just a ref.  It can point at anything, in particular also a
 blob (= some file *contents*).

 An annotated tag is just a tag pointing at a tag object.  A tag object
 contains tagger name/email/date, a reference to an object, and a tag
 message.

 The slowness I found relates to having tags that point at blobs directly
 (unannotated).

I think you are right. I was brave (well I assumed the tags would come
back from the upstream repo) and ran:

git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3
| xargs git tag -d

And boom:

09:19 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m0.009s
user0m0.008s
sys 0m0.000s

Which is much better performance. So it does look like unannotated
tags pointing at binary blobs is the failure case.

snip

 I would be more interested in this:

   git for-each-ref | grep ' blob'

Hmmm that gives nothing. All the refs are either tag or commit

 and

   (git for-each-ref | grep ' blob' | cut -d\  -f1 | xargs -n1 git
cat-file blob) | wc -c

However I have some big commits it seems:

09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' |
cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
1147231984


 The first tells you if you have any refs pointing at blobs.  The second
 computes their total unpacked size.  My theory is that the second yields
 some large number (hundreds of megabytes at least).

 It would be nice if you checked, because if there turn out to be big
 blobs, we have all the pieces and just need to assemble the best
 solution.  Otherwise, there's something else going on and the problem
 remains open.

If you want any other numbers I'm only too happy to help. Sorry I
can't share the repo though...

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 31 May 2013 09:32, John Keeping j...@keeping.me.uk wrote:
 On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote:
 On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
  On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 
   On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
   Alex Bennée kernel-hac...@bennee.com writes:
  snip
   Will it be loading the blob for every commit it traverses or just ones 
   that hit
   a tag? Why does it need to load the blob at all? Surely the commit
   tree state doesn't
   need to be walked down?
 
  No, my theory is that you tagged *the blobs*.  Git supports this.

 Wait is this the difference between annotated and non-annotated tags?
 I thought a non-annotated just acted like references to a particular
 tree state?

 No, this is something slightly different.  In Git there are four types
 of object: tag, commit, tree and blob.  When you have a heavyweight tag,
 the tag reference points at a tag object (which in turn points at
 another object).  With a lightweight tag, the tag reference typically
 points at a commit object.

I think this is the case in my repo.

 However, there is no restriction that says that a tag object must point
 to a commit or that a lightweight tag must point at a commit - it is
 equally possible to point directly at a tree or a blob (although a lot
 less common).

 Thomas is suggesting that you might have a tag that does not point at a
 commit but instead points to a blob object.

It's looking like I just have some very heavy commits. One data point
I probably should have mentioned at the beginning is this was a
converted CVS repo and I'm wondering if some of the artifacts that
introduced has contributed to this.

  You can see if that is the case by doing something like this:
 
  eval $(git for-each-ref --shell --format '
  test $(git cat-file -t %(objectname)^{}) = commit ||
  echo %(refname);')
 
  That will print out the name of any ref that doesn't point at a
  commit.

 Hmm that didn't seem to work.

 You mean there was no output?  In that case it's likely that all your
 references do indeed point at commits.

Correct.


   But looking at the output by hand I
 certainly have a mix of tags that are commits vs tags:


 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep commit | wc -l
 1345
 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep -v commit | wc -l
 66

 This means that you have 1345 lightweight tags and 66 heavyweight tags,
 assuming that all of the lines that don't say commit do say tag.

Yep all commits and tags, nothing else

 By the way, I don't remember if you said which version of Git you're
 using.  If it's an older version then it's possible that something has
 changed.

I'm running the GIT stable PPA:

09:38 ajb@sloy/x86_64 [work.git] git --version
git version 1.8.3

Although I have also tested with the latest git.git maint. I'm happy
to try master if it's likely to have changed.

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 31 May 2013 09:46, Thomas Rast tr...@inf.ethz.ch wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

 I think you are right. I was brave (well I assumed the tags would come
 back from the upstream repo) and ran:

 git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3
 | xargs git tag -d

 So that deleted all unannotated tags pointing at commits, and then it
 was fast.  Curious.

 However I have some big commits it seems:

 09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' |
 cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
 1147231984

 How many unique entries are there in that list, i.e., what does

   git for-each-ref | grep ' commit' | cut -d\  -f1 | sort -u | wc -l

09:49 ajb@sloy/x86_64 [work.git] git for-each-ref | grep ' commit' |
cut -d\  -f1 | sort -u | wc -l
1508

 say?  Perhaps you can also find the biggest commit, e.g. like so:

   git for-each-ref | grep ' commit' | cut -d\  -f1 |
   while read sha; do git cat-file commit $sha | wc -c; done |
   sort -n

Yeah there is a range from a few hundred bytes to a large number of 3M
commits. I guess I need to identify which commits they are and remove
the tags or convert them to annotated reference tags.

 However, if that turns out to be the culprit, it's not fixable
 currently[1].  Having commits with insanely long messages is just, well,
 insane.



 [1]  unless we do a major rework of the loading infrastructure, so that
 we can teach it to load only the beginning of a commit as long as we are
 only interested in parents and such

I'll do a bit of scripting to dig into the nature of these
uber-commits and try and work out how they cam about. I suspect they
are simply start of branch states in our broken and disparate history.

I'll get back to you once I've dug a little deeper.


 --
 Thomas Rast
 trast@{inf,student}.ethz.ch



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
Hi,

I'm a fairly heavy user of the magit Emacs extension for interacting
with my git repos. However I've noticed there are some cases where lag
is very high. By analysing strace output of emacs calling git I found
two commands that where particularly problematic when interrogating
the repo:

11:00 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5224-10-gfa296e6

real0m5.016s
user0m4.364s
sys 0m0.444s

11:34 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --contains HEAD
fatal: cannot describe 'fa296e61f549a1252a65a13b2f734d7afbc7e88e'

real0m4.805s
user0m4.388s
sys 0m0.400s

Running with first command with the --debug flag on gives:

11:34 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags --debug
searching to describe HEAD
 lightweight   10 ajb-build-test-5224
 lightweight   41 ajb-build-test-5222
 annotated146 vnms-2-1-36-32
 annotated155 vnms-2-1-36-31
 annotated174 vnms-2-1-36-30
 annotated183 vnms-2-1-36-29
 lightweight  188 vnms-2-1-36-28
 annotated193 vnms-2-1-36-27
 annotated206 vnms-2-1-36-26
 annotated215 vectastar-4-2-83-5
traversed 223 commits
more than 10 tags found; listed 10 most recent
gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c
ajb-build-test-5224-10-gfa296e6

real0m4.817s
user0m4.320s
sys 0m0.464s

Which has only traversed 223 before coming to a decision. This seems
like a very low number of commits given the time it's spent doing
this.

One factor might be the size of my repo (.git is around 2.4G). Could
this just be due to computational cost of searching through large
packs to walk the commit chain? Is there any way to make this easier
for git to do?


-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
The repo is a fairly hairy one as it includes two historically
un-related but content related repos which I'm the process of
cherry-picking stuff across.

11:58 ajb@sloy/x86_64 [work.git] git count-objects -v
count: 493
size: 4572
in-pack: 399307
packs: 1
size-pack: 1930755
prune-packable: 0
garbage: 0
size-garbage: 0

This was after a repack which did have slight negative effect on
performance. The pack file is:

13:27 ajb@sloy/x86_64 [work.git] ls -lh ./.git/objects/pack/*
-r--r--r-- 1 ajb cvs  11M May 30 11:56
./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.idx
-r--r--r-- 1 ajb cvs 1.9G May 30 11:56
./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack

I ran perf on it and the top items in the report where:

 41.58%   git  libcrypto.so.1.0.0  [.] 0x6ae73
 33.96%   git  libz.so.1.2.3.4 [.] 0xe0ec
 10.39%   git  libz.so.1.2.3.4 [.] adler32
  2.03%   git  [kernel.kallsyms]   [k] clear_page_c

So I'm guessing it's spending a lot of non-cache efficient time
un-packing and processing the deltas?

--
Alex.

On 30 May 2013 12:48, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
 One factor might be the size of my repo (.git is around 2.4G). Could
 this just be due to computational cost of searching through large
 packs to walk the commit chain? Is there any way to make this easier
 for git to do?

 What does git count-objects -v say for your repository?

 You may find that performance improves if you repack with git gc
 --aggressive.



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
It looks like it's a file caching effect combined with my repo being
more pathalogical in size and contents. Note run 1 (cold) vs run 2 on
the linux file tree:

13:52 ajb@sloy/x86_64 [linux.git] time git describe --debug --long
--tags HEAD~1
searching to describe HEAD~1
 annotated 57 v2.6.34-rc2
 annotated   1688 v2.6.34-rc1
 annotated   7932 v2.6.33
 annotated   8157 v2.6.33-rc8
 annotated   8381 v2.6.33-rc7
 annotated   8637 v2.6.33-rc6
 annotated   8964 v2.6.33-rc5
 annotated   9493 v2.6.33-rc4
 annotated   9912 v2.6.33-rc3
 annotated  10202 v2.6.33-rc2
traversed 10547 commits
more than 10 tags found; listed 10 most recent
gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
v2.6.34-rc2-57-gef5da59

real0m7.332s
user0m0.308s
sys 0m0.244s
14:03 ajb@sloy/x86_64 [linux.git] time git describe --debug --long
--tags HEAD~1
searching to describe HEAD~1
 annotated 57 v2.6.34-rc2
 annotated   1688 v2.6.34-rc1
 annotated   7932 v2.6.33
 annotated   8157 v2.6.33-rc8
 annotated   8381 v2.6.33-rc7
 annotated   8637 v2.6.33-rc6
 annotated   8964 v2.6.33-rc5
 annotated   9493 v2.6.33-rc4
 annotated   9912 v2.6.33-rc3
 annotated  10202 v2.6.33-rc2
traversed 10547 commits
more than 10 tags found; listed 10 most recent
gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
v2.6.34-rc2-57-gef5da59

real0m0.298s
user0m0.244s
sys 0m0.036s

Although the perf profile looks subtly different.

First through the linux tree:

 22.35%   git  libz.so.1.2.3.4[.] inflate
 18.56%   git  libz.so.1.2.3.4[.] inflate_fast
 17.48%   git  libz.so.1.2.3.4[.] inflate_table
  7.84%   git  git[.] hashcmp
  3.93%   git  git[.] get_sha1_hex
  3.46%   git  libz.so.1.2.3.4[.] adler32

And through my special repo:

 41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
 33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
 10.39%   git  libz.so.1.2.3.4 [.] adler32
  2.03%   git  [kernel.kallsyms]   [k] clear_page_c

 I'm not sure why libcrypto features so highly in the results


 --
 Alex.

On 30 May 2013 12:33, Ramkumar Ramachandra artag...@gmail.com wrote:
 Alex Bennée wrote:
time /usr/bin/git --no-pager
 traversed 223 commits

 real0m4.817s
 user0m4.320s
 sys 0m0.464s

 I'm quite clueless about why it is taking this long: I think it's IO
 because there's nothing to compute?  I really can't trace anything
 unless you can reproduce it on a public repository.  On linux.git with
 my rotating hard disk:

 $ time git describe --debug --long --tags HEAD~1
 searching to describe HEAD~1
  annotated   5445 v2.6.33
  annotated   5660 v2.6.33-rc8
  annotated   5884 v2.6.33-rc7
  annotated   6140 v2.6.33-rc6
  annotated   6467 v2.6.33-rc5
  annotated   6999 v2.6.33-rc4
  annotated   7430 v2.6.33-rc3
  annotated   7746 v2.6.33-rc2
  annotated   8212 v2.6.33-rc1
  annotated  13854 v2.6.32
 traversed 18895 commits
 more than 10 tags found; listed 10 most recent
 gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26
 v2.6.33-5445-ge7c84ee

 real0m0.509s
 user0m0.470s
 sys 0m0.037s

 18k+ commits traversed in half a second here, so I really don't know
 what is going on.



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
 You may find that performance improves if you repack with git gc
--aggressive.

It seems that increases the time to get to where it wants to:

14:12 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags --debug
searching to describe HEAD
 lightweight   10 ajb-build-test-5224
 lightweight   41 ajb-build-test-5222
 annotated146 vnms-2-1-36-32
 annotated155 vnms-2-1-36-31
 annotated174 vnms-2-1-36-30
 annotated183 vnms-2-1-36-29
 lightweight  188 vnms-2-1-36-28
 annotated193 vnms-2-1-36-27
 annotated206 vnms-2-1-36-26
 annotated215 vectastar-4-2-83-5
traversed 223 commits
more than 10 tags found; listed 10 most recent
gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c
ajb-build-test-5224-10-gfa296e6

real0m14.658s
user0m12.845s
sys 0m1.776s

On 30 May 2013 12:48, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
 One factor might be the size of my repo (.git is around 2.4G). Could
 this just be due to computational cost of searching through large
 packs to walk the commit chain? Is there any way to make this easier
 for git to do?

 What does git count-objects -v say for your repository?

 You may find that performance improves if you repack with git gc
 --aggressive.



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
On 30 May 2013 14:45, Duy Nguyen pclo...@gmail.com wrote:
 On Thu, May 30, 2013 at 8:34 PM, Alex Bennée kernel-hac...@bennee.com wrote:
 snip
 Thanks. Can you share verify-pack -v output of
 pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need
 to put it somewhere on Internet temporarily as it's likely to exceed
 git@vger limits.
 --
 Duy

http://www.bennee.com/~alex/stuff/git-pack-access.tar.bz2

--
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
On 30 May 2013 15:32, Ramkumar Ramachandra artag...@gmail.com wrote:
 Alex Bennée wrote:
 And through my special repo:

  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
  33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

  I'm not sure why libcrypto features so highly in the results

 While Duy churns on the delta chain, let me try to make a (rather
 crude) observation:

 What does it mean for libcrypto to be so high in your perf report?
 sha1_block_data_order is ultimately by object.c:parse_object.  While
 it indicates that deltas are taking a long time to apply (or are
 somehow not optimally organized for IO), I think it indicates either:

 1. Your history is very deep and there are an unusually high number of
 deltas for each blob.  What are the total number of commits?

Well the history does en-compose about 10 years of product development
and has a high number of files in the repo (including about 3 copies of
the kernel - sans upstream history).

15:50 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline | wc -l
24648

real0m0.434s
user0m0.388s
sys 0m0.112s

Although it doesn't take too long to walk the whole mainline history
(obviously ignoring all the other branches).

15:52 ajb@sloy/x86_64 [work.git] git count-objects -v -H
count: 581
size: 5.09 MiB
in-pack: 399307
packs: 1
size-pack: 1.49 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

It is a pick repo. The gc --aggressive nearly took out my machine keeping
around 4gb resident for most of the half hour and using nearly 8gb of VM.

Of course most of the history is not needed for day to day stuff. Maybe
if I split the pack files up it wouldn't be quite such a strain to work
through them?

 2. You have have huge (binary) files checked into your repository.  Do
 you?  If so, why isn't the code in streaming.c kicking in?

We do have some binary blobs in the repository (mainly DSP and FPGA images)
although not a huge number:

15:58 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline -- xxx
xxx/xx/*.out ./xxx/xxx/*.out ./xxx/xxx/*.out | wc -l
234

real0m0.590s
user0m0.552s
sys 0m0.040s

How can I tell if streaming is kicking in or now?


-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html