Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages

2016-09-21 Thread Ævar Arnfjörð Bjarmason
On Wed, Sep 21, 2016 at 8:28 PM, Jakub Narębski  wrote:
> W dniu 21.09.2016 o 20:04, Ævar Arnfjörð Bjarmason pisze:
>> It would make some code like git_print_log() a bit more complex /
>> fragile, since it would have to work on multi-line strings, but
>> anything that needed to do a regex match / replacement would be much
>> faster.
>
> Would it?  Did you perform any synthetic micro-benchmark?

No, just experience. With the caveat that this may not matter at all
in this context, C-like code in Perl is slow, if you can offload
things to one big regex operation it's usually faster.

>>
>> But OTOH I think perhaps we're worrying about nothing when it comes to
>> the performance. I haven't been able to make gitweb display more than
>> a 100 or so commits at a time (haven't found where exactly in the code
>> these limits are), any munging we do on the log messages would have to
>> be pretty damn slow to matter.
>
> sub git_log_generic {
>
> # [...]
>
> my @commitlist =
> parse_commits($commit_hash, 101, (100 * $page),
>   defined $file_name ? ($file_name, 
> "--full-history") : ());
>
> Here you have it (it probably should be a constant; this number can be
> found in a few other places).

Thanks!


Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages

2016-09-21 Thread Jakub Narębski
W dniu 21.09.2016 o 20:04, Ævar Arnfjörð Bjarmason pisze:
> On Wed, Sep 21, 2016 at 6:26 PM, Jakub Narębski  wrote:
> 
>> P.S. I have reworking of commit message parsing and enhancement in my
>> long, long and dated gitweb TODO list :-(
> 
> Anything specific you could share?

Some of TODO I would have to bring from backups, as the computer on
which I did majority of gitweb development has since died (from old
age).

The list includes:
- implement caching of gitweb output
- revamp handling of encoding (UTF-8 with fallback encoding)
- split gitweb into modules, while maintaining ease of install
- refactor handling of diffs
- better handling of config files
- document URI structure, perhaps revamp URI parsing and generation
- make commit message transformation generic
  (see below)

> 
> One thing that would be a lot faster in Perl is if we didn't have to
> pass the log around as split-up lines and could just operate on it as
> one big string.

Well, there are a few transformations that commit message undergoes
in gitweb, including linking SHA1, optional linking of bug numbers
to bug tracker, and syntax highlighting of signoff lines (trailer
lines).  

I would like to have this cleaned up, and refactored.  With all
those transformations we would need to keep account which parts
are HTML, and which not and need escaping (note: URI escape !=
HTML escape).

> 
> It would make some code like git_print_log() a bit more complex /
> fragile, since it would have to work on multi-line strings, but
> anything that needed to do a regex match / replacement would be much
> faster.

Would it?  Did you perform any synthetic micro-benchmark?

> 
> But OTOH I think perhaps we're worrying about nothing when it comes to
> the performance. I haven't been able to make gitweb display more than
> a 100 or so commits at a time (haven't found where exactly in the code
> these limits are), any munging we do on the log messages would have to
> be pretty damn slow to matter.

sub git_log_generic {

# [...]

my @commitlist =
parse_commits($commit_hash, 101, (100 * $page),
  defined $file_name ? ($file_name, 
"--full-history") : ());

Here you have it (it probably should be a constant; this number can be
found in a few other places).

Best,
-- 
Jakub Narębski



Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages

2016-09-21 Thread Ævar Arnfjörð Bjarmason
On Wed, Sep 21, 2016 at 6:26 PM, Jakub Narębski  wrote:

> P.S. I have reworking of commit message parsing and enhancement in my
> long, long and dated gitweb TODO list :-(

Anything specific you could share?

One thing that would be a lot faster in Perl is if we didn't have to
pass the log around as split-up lines and could just operate on it as
one big string.

It would make some code like git_print_log() a bit more complex /
fragile, since it would have to work on multi-line strings, but
anything that needed to do a regex match / replacement would be much
faster.

But OTOH I think perhaps we're worrying about nothing when it comes to
the performance. I haven't been able to make gitweb display more than
a 100 or so commits at a time (haven't found where exactly in the code
these limits are), any munging we do on the log messages would have to
be pretty damn slow to matter.

> P.P.S. Kay Sievers no longer works on gitweb, and I think no longer
> works at SuSE but at RedHat.

Yup, been getting bounces from his address.


Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages

2016-09-21 Thread Jakub Narębski
W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:

> Subject: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages

This is modification of a feature, not a new feature it sounds like.
I think the following title / subject would be better:

  Subject: [PATCH 2/3] gitweb: Link to 7-char+ SHA1s, not only 8-char+

>
> Change the minimum length of a commit we'll link to from 8 to 7.

I think it would read better as:

  Change the minimum length of an abbreviated object identifier in the
  commit message gitweb tries to turn into link from 8 hexchars to 7.

> 
> This arbitrary minimum length of 8 was introduced in
> v1.4.4.2-151-gbfe2191, but as seen in e.g. v1.7.4-1-gdce9648 the
> default abbreviation length is 7.

Right. I wonder why it was 8 in gitweb...

> 
> It's still possible to reference SHA1s down to 4 characters in length,
> see v1.7.4-1-gdce9648's MINIMUM_ABBREV, but I can't see how to make
> git actually produce that, so I doubt anyone is putting that into log
> messages in practice, but people definitely do put 7 character SHA1s
> into log messages.

There is an additional problem: the shorter SHA1 abbrev we try to
match, the more possibility of false positives, words that only look
like (shortened SHA-1).

For 7 characters there is at last one word that can be mistaken
for SHA1 abbrev, namely 'deedeed' (hopefully rare in commit messages).
For 6 characters we have 'accede', 'beaded', 'decade' (!), 'deface',
'facade' (!!), and possibly more (and of course all 7 character
hexdigit words).

Also, the number of digits provided as an optional parameter to
--abbrev or --abbrev-commit options is only a minimal number of 
hexdigits: Git would use as many as needed for the abbreviated SHA-1
to be unambiguous, at current time.


I think allowing 7-character shortened SHA-1, which is what Git
produces for smaller repositories by default is (might be?) a good
idea.  Thanks for the patch.

> 
> I think it's fairly dubious to link to things matching [0-9a-fA-F]
> here as opposed to just [0-9a-f], that dates back to the initial
> version of gitweb from 161332a. Git will accept all-caps SHA1s, but
> didn't ever produce them as far as I can tell.

All right, thanks for reminder.

Signoff?

> ---
>  gitweb/gitweb.perl | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 9473daf..101dbc0 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -2036,7 +2036,7 @@ sub format_log_line_html {
>   my $line = shift;
>  
>   $line = esc_html($line, -nbsp=>1);
> - $line =~ s{\b([0-9a-fA-F]{8,40})\b}{
> + $line =~ s{\b([0-9a-fA-F]{7,40})\b}{
>   $cgi->a({-href => href(action=>"object", hash=>$1),
>   -class => "text"}, $1);
>   }eg;
> 

Nice and simple.

P.S. I have reworking of commit message parsing and enhancement in my
long, long and dated gitweb TODO list :-(

P.P.S. Kay Sievers no longer works on gitweb, and I think no longer
works at SuSE but at RedHat.

Best,
-- 
Jakub Narębski