Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-04 Thread Jeff King
On Fri, Apr 03, 2015 at 03:24:09PM -0700, Kyle J. McKay wrote:

 I thought that meant we could also optimize out the map call entirely,
 and just use the first split (with *) to end up with a list of $COLOR
 chunks and single characters, but it does not seem to work. So maybe I
 am misreading something about what is going on.
 
 I think our emails crossed in flight...
 
 Using just the first split (with *) produces useless empty elements which
 I think ends up causing problems.  I suppose you could surround it with a
 grep /./ to remove them but that would defeat the point of the optimization.

Yeah, the problem is the use of (). We want to keep the $COLOR
delimiters, but not the empty ones. Perhaps you could do:

  split /($COLOR+)|/

but I didn't try it. I think what you posted is good and a lot less
subtle.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-03 Thread Jeff King
On Thu, Apr 02, 2015 at 06:59:50PM -0700, Kyle J. McKay wrote:

 It should work as well as the original did for any 1-byte encoding.  That
 is, if it's not valid UTF-8 it should pass through unchanged and any single
 byte encoding should just work.  But, as you point out, multibyte encodings
 other than UTF-8 won't work, but they should behave the same as they did
 before.

Yeah, sorry, I should have been more clear that I meant multibyte
encodings. UTF-8 is the only common multibyte encoding I run across, but
that's because Latin1 served most of my pre-UTF-8 needs.  I suspect
things are very different for people in Asia. I don't know how
badly they would want support for other encodings. I'm happy to go with
a UTF-8 solution for now, and see if anybody wants to expand it further
later.

 I timed this one versus the existing diff-highlight. It's about 7%
 slower.
 
 I'd expect that, we're doing extra work we weren't doing before.

I was worried would be 200% or something. :)

 Makes sense. I'm happy enough listing perl 5.8 as a dependency.
 
 Maybe that should be added.  The rest of Git's perl code seems to have a
 'use 5.008;' already, so I figured that was a reasonable dependency.  :)

I shouldn't have said listing. I just meant have as a dependency. I
am also happy with adding use 5.008, but I agree it's probably not
necessary at this point. It was released in 2002 (wow, has it really
been that long?).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-03 Thread Kyle J. McKay

On Apr 3, 2015, at 15:08, Jeff King wrote:

Doing:

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff- 
highlight/diff-highlight

index 08c88bb..1c4b599 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -165,7 +165,7 @@ sub highlight_pair {
sub split_line {
local $_ = shift;
return map { /$COLOR/ ? $_ : (split //) }
-  split /($COLOR*)/;
+  split /($COLOR+)/;
}

sub highlight_line {

gives me a 25% speed improvement, and the same output processing
git.git's entire git log -p output.

I thought that meant we could also optimize out the map call  
entirely,
and just use the first split (with *) to end up with a list of  
$COLOR

chunks and single characters, but it does not seem to work. So maybe I
am misreading something about what is going on.


I think our emails crossed in flight...

Using just the first split (with *) produces useless empty elements  
which I think ends up causing problems.  I suppose you could surround  
it with a grep /./ to remove them but that would defeat the point of  
the optimization.


-Kyle
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-03 Thread Jeff King
On Fri, Apr 03, 2015 at 11:19:24AM +0900, Yi, EungJun wrote:

  I timed this one versus the existing diff-highlight. It's about 7%
  slower. That's not great, but is acceptable to me. The String::Multibyte
  version was a lot faster, which was nice (but I'm still unclear on
  _why_).
 
 I think the reason is here:
 
  sub split_line {
 local $_ = shift;
 return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs-strsplit('', $_) : split 
  //) }
split /($COLOR)/;
  }
 
 I removed * from split /($COLOR*)/. Actually I don't know why *
 was required but I need to remove it to make my patch works correctly.

Ah, OK, that makes more sense. The * was meant to handle the case of
multiple groups of ANSI colors in a row. But I think it should have been
+ in that case, as we would otherwise split on the empty field, which
would mean character-by-character. And the second split in the map
would then be superfluous, which would break your patch (we've already
split the multi-byte characters before we even hit $mbcs-strsplit).

Kyle's patch does not care, because it tweaks the string so that normal
split works. Which means there is an easy speedup here. :)

Doing:

diff --git a/contrib/diff-highlight/diff-highlight 
b/contrib/diff-highlight/diff-highlight
index 08c88bb..1c4b599 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -165,7 +165,7 @@ sub highlight_pair {
 sub split_line {
local $_ = shift;
return map { /$COLOR/ ? $_ : (split //) }
-  split /($COLOR*)/;
+  split /($COLOR+)/;
 }
 
 sub highlight_line {

gives me a 25% speed improvement, and the same output processing
git.git's entire git log -p output.

I thought that meant we could also optimize out the map call entirely,
and just use the first split (with *) to end up with a list of $COLOR
chunks and single characters, but it does not seem to work. So maybe I
am misreading something about what is going on.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-02 Thread Jeff King
On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:

 Subject: [PATCH v2] diff-highlight: do not split multibyte characters
 
 When the input is UTF-8 and Perl is operating on bytes instead
 of characters, a diff that changes one multibyte character to
 another that shares an initial byte sequence will result in a
 broken diff display as the common byte sequence prefix will be
 separated from the rest of the bytes in the multibyte character.

Thanks, I had a feeling we should be able to do something with perl's
builtin utf8 support.  This doesn't help people with other encodings,
but I'm not sure the original was all that helpful either (in that we
don't actually _know_ the file encodings in the first place).

I briefly confirmed that this seems to do the right thing on po/bg.po,
which has a couple of sheared characters when viewed with the existing
code.

I timed this one versus the existing diff-highlight. It's about 7%
slower. That's not great, but is acceptable to me. The String::Multibyte
version was a lot faster, which was nice (but I'm still unclear on
_why_).

 Fix this by putting Perl into character mode when splitting the
 line and then back into byte mode after the split is finished.

I also wondered if we could simply put stdin into utf8 mode. But it
looks like it will barf whenever it gets invalid utf8. Checking for
valid utf8 and only doing the multi-byte split in that case (as you do
here) is a lot more robust.

 While the utf8::xxx functions are built-in and do not require
 any 'use' statement, the utf8::is_utf8 function did not appear
 until Perl 5.8.1, but is identical to the Encode::is_utf8
 function which is available in 5.8 so we use that instead of
 utf8::is_utf8.

Makes sense. I'm happy enough listing perl 5.8 as a dependency.

EungJun, does this version meet your needs?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-02 Thread Kyle J. McKay

On Apr 2, 2015, at 18:24, Jeff King wrote:


On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:


Subject: [PATCH v2] diff-highlight: do not split multibyte characters

When the input is UTF-8 and Perl is operating on bytes instead
of characters, a diff that changes one multibyte character to
another that shares an initial byte sequence will result in a
broken diff display as the common byte sequence prefix will be
separated from the rest of the bytes in the multibyte character.


Thanks, I had a feeling we should be able to do something with perl's
builtin utf8 support.  This doesn't help people with other encodings,


It should work as well as the original did for any 1-byte encoding.   
That is, if it's not valid UTF-8 it should pass through unchanged and  
any single byte encoding should just work.  But, as you point out,  
multibyte encodings other than UTF-8 won't work, but they should  
behave the same as they did before.



but I'm not sure the original was all that helpful either (in that we
don't actually _know_ the file encodings in the first place).


I think it should work fine on any single byte encoding (i.e. ISO-8859- 
x, WINDOWS-1252, etc.).



I timed this one versus the existing diff-highlight. It's about 7%
slower.


I'd expect that, we're doing extra work we weren't doing before.


That's not great, but is acceptable to me. The String::Multibyte
version was a lot faster, which was nice (but I'm still unclear on
_why_).


Must be the mbcs-strsplit routine has special case code for splitting  
on '' to just split on character boundaries.



Fix this by putting Perl into character mode when splitting the
line and then back into byte mode after the split is finished.


I also wondered if we could simply put stdin into utf8 mode. But it
looks like it will barf whenever it gets invalid utf8. Checking for
valid utf8 and only doing the multi-byte split in that case (as you do
here) is a lot more robust.


While the utf8::xxx functions are built-in and do not require
any 'use' statement, the utf8::is_utf8 function did not appear
until Perl 5.8.1, but is identical to the Encode::is_utf8
function which is available in 5.8 so we use that instead of
utf8::is_utf8.


Makes sense. I'm happy enough listing perl 5.8 as a dependency.


Maybe that should be added.  The rest of Git's perl code seems to have  
a 'use 5.008;' already, so I figured that was a reasonable  
dependency.  :)


-Kyle
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-02 Thread Kyle J. McKay
On Mar 30, 2015, at 15:16, Jeff King wrote:

 Yeah, I agree the current output is not ideal, and this should address
 the problem. I was worried that multi-byte splitting would make things
 slower, but in my tests, it actually speeds things up!

[...]

 Unfortunately, String::Multibyte is not a standard module, and is not
 even packed for Debian systems (I got mine from CPAN). Can we make
 this
 a conditional include (e.g., 'eval require String::Multibyte' in
 get_mbcs, and return undef if that fails?). Then people without it can
 still use the script.

[...]

 Yuck. This is a lot more intimate with String::Multibyte's
 implementation than I'd like to be.

So I was curious about this and played with it and was able to
reproduce the problem as described.

Here's an alternate fix that should work for everyone with Perl 5.8
or later.

-Kyle

-- 8 --
Subject: [PATCH v2] diff-highlight: do not split multibyte characters

When the input is UTF-8 and Perl is operating on bytes instead
of characters, a diff that changes one multibyte character to
another that shares an initial byte sequence will result in a
broken diff display as the common byte sequence prefix will be
separated from the rest of the bytes in the multibyte character.

For example, if a single line contains only the unicode
character U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that
line is then changed to the unicode character U+C9C0 (encoded as
UTF-8 0xEC, 0xA7, 0x80), when operating on bytes diff-highlight
will show only the single byte change from 0x84 to 0x80 thus
creating invalid UTF-8 and a broken diff display.

Fix this by putting Perl into character mode when splitting the
line and then back into byte mode after the split is finished.

While the utf8::xxx functions are built-in and do not require
any 'use' statement, the utf8::is_utf8 function did not appear
until Perl 5.8.1, but is identical to the Encode::is_utf8
function which is available in 5.8 so we use that instead of
utf8::is_utf8.

Reported-by: Yi EungJun semtlen...@gmail.com
Signed-off-by: Kyle J. McKay mack...@gmail.com
---
 contrib/diff-highlight/diff-highlight | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight 
b/contrib/diff-highlight/diff-highlight
index 08c88bbc..8e9b5ada 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,7 @@
 
 use warnings FATAL = 'all';
 use strict;
+use Encode ();
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -164,8 +165,10 @@ sub highlight_pair {
 
 sub split_line {
local $_ = shift;
-   return map { /$COLOR/ ? $_ : (split //) }
-  split /($COLOR*)/;
+   utf8::decode($_);
+   return map { utf8::encode($_) if Encode::is_utf8($_); $_ }
+   map { /$COLOR/ ? $_ : (split //) }
+   split /($COLOR*)/;
 }
 
 sub highlight_line {
---
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-04-02 Thread Yi, EungJun
 I timed this one versus the existing diff-highlight. It's about 7%
 slower. That's not great, but is acceptable to me. The String::Multibyte
 version was a lot faster, which was nice (but I'm still unclear on
 _why_).

I think the reason is here:

 sub split_line {
local $_ = shift;
return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs-strsplit('', $_) : split //) }
   split /($COLOR)/;
 }

I removed * from split /($COLOR*)/. Actually I don't know why *
was required but I need to remove it to make my patch works correctly.


On Fri, Apr 3, 2015 at 10:24 AM, Jeff King p...@peff.net wrote:
 On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:

 Subject: [PATCH v2] diff-highlight: do not split multibyte characters

 When the input is UTF-8 and Perl is operating on bytes instead
 of characters, a diff that changes one multibyte character to
 another that shares an initial byte sequence will result in a
 broken diff display as the common byte sequence prefix will be
 separated from the rest of the bytes in the multibyte character.

 Thanks, I had a feeling we should be able to do something with perl's
 builtin utf8 support.  This doesn't help people with other encodings,
 but I'm not sure the original was all that helpful either (in that we
 don't actually _know_ the file encodings in the first place).

 I briefly confirmed that this seems to do the right thing on po/bg.po,
 which has a couple of sheared characters when viewed with the existing
 code.

 I timed this one versus the existing diff-highlight. It's about 7%
 slower. That's not great, but is acceptable to me. The String::Multibyte
 version was a lot faster, which was nice (but I'm still unclear on
 _why_).

 Fix this by putting Perl into character mode when splitting the
 line and then back into byte mode after the split is finished.

 I also wondered if we could simply put stdin into utf8 mode. But it
 looks like it will barf whenever it gets invalid utf8. Checking for
 valid utf8 and only doing the multi-byte split in that case (as you do
 here) is a lot more robust.

 While the utf8::xxx functions are built-in and do not require
 any 'use' statement, the utf8::is_utf8 function did not appear
 until Perl 5.8.1, but is identical to the Encode::is_utf8
 function which is available in 5.8 so we use that instead of
 utf8::is_utf8.

 Makes sense. I'm happy enough listing perl 5.8 as a dependency.

 EungJun, does this version meet your needs?

 -Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] diff-highlight: Fix broken multibyte string

2015-03-30 Thread Jeff King
On Tue, Mar 31, 2015 at 12:55:33AM +0900, Yi EungJun wrote:

 From: Yi EungJun eungjun...@navercorp.com
 
 Highlighted string might be broken if the common subsequence is a proper 
 subset
 of a multibyte character. For example, if the old string is 진 and the new
 string is 지, then we expect the diff is rendered as follows:
 
   -진
   +지
 
 but actually it was rendered as follows:
 
 -ECA784
 +ECA780
 
 This fixes the bug by splitting the string by multibyte characters.

Yeah, I agree the current output is not ideal, and this should address
the problem. I was worried that multi-byte splitting would make things
slower, but in my tests, it actually speeds things up!

That surprised me. The big difference is calling $mbcs-strsplit instead
of regular split. Could it be that it's much faster than regular split?
Or is it that the resulting strings are faster for the rest of the
processing (e.g., perl hits a slow path on the sheared characters)? It
doesn't really matter, I guess, but certainly I was curious.

 +use File::Basename;
 +use File::Spec::Functions qw( catdir );
 +use String::Multibyte;

Unfortunately, String::Multibyte is not a standard module, and is not
even packed for Debian systems (I got mine from CPAN). Can we make this
a conditional include (e.g., 'eval require String::Multibyte' in
get_mbcs, and return undef if that fails?). Then people without it can
still use the script.

 +# Returns an instance of String::Multibyte based on the charset defined by
 +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support
 +# the charset.

Hrm. The characters we are processing are not in the commit message, but
in the files themselves. In fact, there may be many different charsets
(i.e., a different one for each file), and we really don't have a good
way of knowing which is in play. I'd say that using the commit
encoding is our best guess, though. What happens with $mbcs-split when
the input is not a valid character in the charset (i.e., when we guess
wrong)?

If we are going to use the commit encoding, wouldn't
i18n.logOutputEncoding be a better choice?

 +sub get_mbcs {
 + my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte');
 + opendir my $dh, $dir or return;
 + my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh;

Yuck. This is a lot more intimate with String::Multibyte's
implementation than I'd like to be. Could we perhaps just run the
constructor on any candidates charsets, and then return the first hit
that gives us something besides undef?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] diff-highlight: Fix broken multibyte string

2015-03-30 Thread Yi EungJun
From: Yi EungJun eungjun...@navercorp.com

Highlighted string might be broken if the common subsequence is a proper subset
of a multibyte character. For example, if the old string is 진 and the new
string is 지, then we expect the diff is rendered as follows:

-진
+지

but actually it was rendered as follows:

-ECA784
+ECA780

This fixes the bug by splitting the string by multibyte characters.
---
 contrib/diff-highlight/diff-highlight | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight 
b/contrib/diff-highlight/diff-highlight
index 08c88bb..2662c1a 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -2,6 +2,9 @@
 
 use warnings FATAL = 'all';
 use strict;
+use File::Basename;
+use File::Spec::Functions qw( catdir );
+use String::Multibyte;
 
 # Highlight by reversing foreground and background. You could do
 # other things like bold or underline if you prefer.
@@ -24,6 +27,8 @@ my @removed;
 my @added;
 my $in_hunk;
 
+my $mbcs = get_mbcs();
+
 # Some scripts may not realize that SIGPIPE is being ignored when launching the
 # pager--for instance scripts written in Python.
 $SIG{PIPE} = 'DEFAULT';
@@ -164,8 +169,8 @@ sub highlight_pair {
 
 sub split_line {
local $_ = shift;
-   return map { /$COLOR/ ? $_ : (split //) }
-  split /($COLOR*)/;
+   return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs-strsplit('', $_) : split 
//) }
+  split /($COLOR)/;
 }
 
 sub highlight_line {
@@ -211,3 +216,19 @@ sub is_pair_interesting {
   $suffix_a !~ /^$BORING*$/ ||
   $suffix_b !~ /^$BORING*$/;
 }
+
+# Returns an instance of String::Multibyte based on the charset defined by
+# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support
+# the charset.
+sub get_mbcs {
+   my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte');
+   opendir my $dh, $dir or return;
+   my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh;
+   close $dh;
+   my $expected_charset = `git config i18n.commitencoding` || UTF-8;
+   $expected_charset =~ s/-//g;
+   my @matches = grep {/^$expected_charset$/i} @mbcs_charsets;
+   my $charset = shift @matches;
+
+   return eval 'String::Multibyte-new($charset)';
+}
-- 
2.3.2.209.gd67f9d5.dirty

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html