Re: [PATCH] diff-highlight: Fix broken multibyte string
On Fri, Apr 03, 2015 at 03:24:09PM -0700, Kyle J. McKay wrote: I thought that meant we could also optimize out the map call entirely, and just use the first split (with *) to end up with a list of $COLOR chunks and single characters, but it does not seem to work. So maybe I am misreading something about what is going on. I think our emails crossed in flight... Using just the first split (with *) produces useless empty elements which I think ends up causing problems. I suppose you could surround it with a grep /./ to remove them but that would defeat the point of the optimization. Yeah, the problem is the use of (). We want to keep the $COLOR delimiters, but not the empty ones. Perhaps you could do: split /($COLOR+)|/ but I didn't try it. I think what you posted is good and a lot less subtle. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Thu, Apr 02, 2015 at 06:59:50PM -0700, Kyle J. McKay wrote: It should work as well as the original did for any 1-byte encoding. That is, if it's not valid UTF-8 it should pass through unchanged and any single byte encoding should just work. But, as you point out, multibyte encodings other than UTF-8 won't work, but they should behave the same as they did before. Yeah, sorry, I should have been more clear that I meant multibyte encodings. UTF-8 is the only common multibyte encoding I run across, but that's because Latin1 served most of my pre-UTF-8 needs. I suspect things are very different for people in Asia. I don't know how badly they would want support for other encodings. I'm happy to go with a UTF-8 solution for now, and see if anybody wants to expand it further later. I timed this one versus the existing diff-highlight. It's about 7% slower. I'd expect that, we're doing extra work we weren't doing before. I was worried would be 200% or something. :) Makes sense. I'm happy enough listing perl 5.8 as a dependency. Maybe that should be added. The rest of Git's perl code seems to have a 'use 5.008;' already, so I figured that was a reasonable dependency. :) I shouldn't have said listing. I just meant have as a dependency. I am also happy with adding use 5.008, but I agree it's probably not necessary at this point. It was released in 2002 (wow, has it really been that long?). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Apr 3, 2015, at 15:08, Jeff King wrote: Doing: diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff- highlight/diff-highlight index 08c88bb..1c4b599 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -165,7 +165,7 @@ sub highlight_pair { sub split_line { local $_ = shift; return map { /$COLOR/ ? $_ : (split //) } - split /($COLOR*)/; + split /($COLOR+)/; } sub highlight_line { gives me a 25% speed improvement, and the same output processing git.git's entire git log -p output. I thought that meant we could also optimize out the map call entirely, and just use the first split (with *) to end up with a list of $COLOR chunks and single characters, but it does not seem to work. So maybe I am misreading something about what is going on. I think our emails crossed in flight... Using just the first split (with *) produces useless empty elements which I think ends up causing problems. I suppose you could surround it with a grep /./ to remove them but that would defeat the point of the optimization. -Kyle -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Fri, Apr 03, 2015 at 11:19:24AM +0900, Yi, EungJun wrote: I timed this one versus the existing diff-highlight. It's about 7% slower. That's not great, but is acceptable to me. The String::Multibyte version was a lot faster, which was nice (but I'm still unclear on _why_). I think the reason is here: sub split_line { local $_ = shift; return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs-strsplit('', $_) : split //) } split /($COLOR)/; } I removed * from split /($COLOR*)/. Actually I don't know why * was required but I need to remove it to make my patch works correctly. Ah, OK, that makes more sense. The * was meant to handle the case of multiple groups of ANSI colors in a row. But I think it should have been + in that case, as we would otherwise split on the empty field, which would mean character-by-character. And the second split in the map would then be superfluous, which would break your patch (we've already split the multi-byte characters before we even hit $mbcs-strsplit). Kyle's patch does not care, because it tweaks the string so that normal split works. Which means there is an easy speedup here. :) Doing: diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 08c88bb..1c4b599 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -165,7 +165,7 @@ sub highlight_pair { sub split_line { local $_ = shift; return map { /$COLOR/ ? $_ : (split //) } - split /($COLOR*)/; + split /($COLOR+)/; } sub highlight_line { gives me a 25% speed improvement, and the same output processing git.git's entire git log -p output. I thought that meant we could also optimize out the map call entirely, and just use the first split (with *) to end up with a list of $COLOR chunks and single characters, but it does not seem to work. So maybe I am misreading something about what is going on. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote: Subject: [PATCH v2] diff-highlight: do not split multibyte characters When the input is UTF-8 and Perl is operating on bytes instead of characters, a diff that changes one multibyte character to another that shares an initial byte sequence will result in a broken diff display as the common byte sequence prefix will be separated from the rest of the bytes in the multibyte character. Thanks, I had a feeling we should be able to do something with perl's builtin utf8 support. This doesn't help people with other encodings, but I'm not sure the original was all that helpful either (in that we don't actually _know_ the file encodings in the first place). I briefly confirmed that this seems to do the right thing on po/bg.po, which has a couple of sheared characters when viewed with the existing code. I timed this one versus the existing diff-highlight. It's about 7% slower. That's not great, but is acceptable to me. The String::Multibyte version was a lot faster, which was nice (but I'm still unclear on _why_). Fix this by putting Perl into character mode when splitting the line and then back into byte mode after the split is finished. I also wondered if we could simply put stdin into utf8 mode. But it looks like it will barf whenever it gets invalid utf8. Checking for valid utf8 and only doing the multi-byte split in that case (as you do here) is a lot more robust. While the utf8::xxx functions are built-in and do not require any 'use' statement, the utf8::is_utf8 function did not appear until Perl 5.8.1, but is identical to the Encode::is_utf8 function which is available in 5.8 so we use that instead of utf8::is_utf8. Makes sense. I'm happy enough listing perl 5.8 as a dependency. EungJun, does this version meet your needs? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Apr 2, 2015, at 18:24, Jeff King wrote: On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote: Subject: [PATCH v2] diff-highlight: do not split multibyte characters When the input is UTF-8 and Perl is operating on bytes instead of characters, a diff that changes one multibyte character to another that shares an initial byte sequence will result in a broken diff display as the common byte sequence prefix will be separated from the rest of the bytes in the multibyte character. Thanks, I had a feeling we should be able to do something with perl's builtin utf8 support. This doesn't help people with other encodings, It should work as well as the original did for any 1-byte encoding. That is, if it's not valid UTF-8 it should pass through unchanged and any single byte encoding should just work. But, as you point out, multibyte encodings other than UTF-8 won't work, but they should behave the same as they did before. but I'm not sure the original was all that helpful either (in that we don't actually _know_ the file encodings in the first place). I think it should work fine on any single byte encoding (i.e. ISO-8859- x, WINDOWS-1252, etc.). I timed this one versus the existing diff-highlight. It's about 7% slower. I'd expect that, we're doing extra work we weren't doing before. That's not great, but is acceptable to me. The String::Multibyte version was a lot faster, which was nice (but I'm still unclear on _why_). Must be the mbcs-strsplit routine has special case code for splitting on '' to just split on character boundaries. Fix this by putting Perl into character mode when splitting the line and then back into byte mode after the split is finished. I also wondered if we could simply put stdin into utf8 mode. But it looks like it will barf whenever it gets invalid utf8. Checking for valid utf8 and only doing the multi-byte split in that case (as you do here) is a lot more robust. While the utf8::xxx functions are built-in and do not require any 'use' statement, the utf8::is_utf8 function did not appear until Perl 5.8.1, but is identical to the Encode::is_utf8 function which is available in 5.8 so we use that instead of utf8::is_utf8. Makes sense. I'm happy enough listing perl 5.8 as a dependency. Maybe that should be added. The rest of Git's perl code seems to have a 'use 5.008;' already, so I figured that was a reasonable dependency. :) -Kyle -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Mar 30, 2015, at 15:16, Jeff King wrote: Yeah, I agree the current output is not ideal, and this should address the problem. I was worried that multi-byte splitting would make things slower, but in my tests, it actually speeds things up! [...] Unfortunately, String::Multibyte is not a standard module, and is not even packed for Debian systems (I got mine from CPAN). Can we make this a conditional include (e.g., 'eval require String::Multibyte' in get_mbcs, and return undef if that fails?). Then people without it can still use the script. [...] Yuck. This is a lot more intimate with String::Multibyte's implementation than I'd like to be. So I was curious about this and played with it and was able to reproduce the problem as described. Here's an alternate fix that should work for everyone with Perl 5.8 or later. -Kyle -- 8 -- Subject: [PATCH v2] diff-highlight: do not split multibyte characters When the input is UTF-8 and Perl is operating on bytes instead of characters, a diff that changes one multibyte character to another that shares an initial byte sequence will result in a broken diff display as the common byte sequence prefix will be separated from the rest of the bytes in the multibyte character. For example, if a single line contains only the unicode character U+C9C4 (encoded as UTF-8 0xEC, 0xA7, 0x84) and that line is then changed to the unicode character U+C9C0 (encoded as UTF-8 0xEC, 0xA7, 0x80), when operating on bytes diff-highlight will show only the single byte change from 0x84 to 0x80 thus creating invalid UTF-8 and a broken diff display. Fix this by putting Perl into character mode when splitting the line and then back into byte mode after the split is finished. While the utf8::xxx functions are built-in and do not require any 'use' statement, the utf8::is_utf8 function did not appear until Perl 5.8.1, but is identical to the Encode::is_utf8 function which is available in 5.8 so we use that instead of utf8::is_utf8. Reported-by: Yi EungJun semtlen...@gmail.com Signed-off-by: Kyle J. McKay mack...@gmail.com --- contrib/diff-highlight/diff-highlight | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 08c88bbc..8e9b5ada 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -2,6 +2,7 @@ use warnings FATAL = 'all'; use strict; +use Encode (); # Highlight by reversing foreground and background. You could do # other things like bold or underline if you prefer. @@ -164,8 +165,10 @@ sub highlight_pair { sub split_line { local $_ = shift; - return map { /$COLOR/ ? $_ : (split //) } - split /($COLOR*)/; + utf8::decode($_); + return map { utf8::encode($_) if Encode::is_utf8($_); $_ } + map { /$COLOR/ ? $_ : (split //) } + split /($COLOR*)/; } sub highlight_line { --- -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
I timed this one versus the existing diff-highlight. It's about 7% slower. That's not great, but is acceptable to me. The String::Multibyte version was a lot faster, which was nice (but I'm still unclear on _why_). I think the reason is here: sub split_line { local $_ = shift; return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs-strsplit('', $_) : split //) } split /($COLOR)/; } I removed * from split /($COLOR*)/. Actually I don't know why * was required but I need to remove it to make my patch works correctly. On Fri, Apr 3, 2015 at 10:24 AM, Jeff King p...@peff.net wrote: On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote: Subject: [PATCH v2] diff-highlight: do not split multibyte characters When the input is UTF-8 and Perl is operating on bytes instead of characters, a diff that changes one multibyte character to another that shares an initial byte sequence will result in a broken diff display as the common byte sequence prefix will be separated from the rest of the bytes in the multibyte character. Thanks, I had a feeling we should be able to do something with perl's builtin utf8 support. This doesn't help people with other encodings, but I'm not sure the original was all that helpful either (in that we don't actually _know_ the file encodings in the first place). I briefly confirmed that this seems to do the right thing on po/bg.po, which has a couple of sheared characters when viewed with the existing code. I timed this one versus the existing diff-highlight. It's about 7% slower. That's not great, but is acceptable to me. The String::Multibyte version was a lot faster, which was nice (but I'm still unclear on _why_). Fix this by putting Perl into character mode when splitting the line and then back into byte mode after the split is finished. I also wondered if we could simply put stdin into utf8 mode. But it looks like it will barf whenever it gets invalid utf8. Checking for valid utf8 and only doing the multi-byte split in that case (as you do here) is a lot more robust. While the utf8::xxx functions are built-in and do not require any 'use' statement, the utf8::is_utf8 function did not appear until Perl 5.8.1, but is identical to the Encode::is_utf8 function which is available in 5.8 so we use that instead of utf8::is_utf8. Makes sense. I'm happy enough listing perl 5.8 as a dependency. EungJun, does this version meet your needs? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] diff-highlight: Fix broken multibyte string
On Tue, Mar 31, 2015 at 12:55:33AM +0900, Yi EungJun wrote: From: Yi EungJun eungjun...@navercorp.com Highlighted string might be broken if the common subsequence is a proper subset of a multibyte character. For example, if the old string is 진 and the new string is 지, then we expect the diff is rendered as follows: -진 +지 but actually it was rendered as follows: -ECA784 +ECA780 This fixes the bug by splitting the string by multibyte characters. Yeah, I agree the current output is not ideal, and this should address the problem. I was worried that multi-byte splitting would make things slower, but in my tests, it actually speeds things up! That surprised me. The big difference is calling $mbcs-strsplit instead of regular split. Could it be that it's much faster than regular split? Or is it that the resulting strings are faster for the rest of the processing (e.g., perl hits a slow path on the sheared characters)? It doesn't really matter, I guess, but certainly I was curious. +use File::Basename; +use File::Spec::Functions qw( catdir ); +use String::Multibyte; Unfortunately, String::Multibyte is not a standard module, and is not even packed for Debian systems (I got mine from CPAN). Can we make this a conditional include (e.g., 'eval require String::Multibyte' in get_mbcs, and return undef if that fails?). Then people without it can still use the script. +# Returns an instance of String::Multibyte based on the charset defined by +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support +# the charset. Hrm. The characters we are processing are not in the commit message, but in the files themselves. In fact, there may be many different charsets (i.e., a different one for each file), and we really don't have a good way of knowing which is in play. I'd say that using the commit encoding is our best guess, though. What happens with $mbcs-split when the input is not a valid character in the charset (i.e., when we guess wrong)? If we are going to use the commit encoding, wouldn't i18n.logOutputEncoding be a better choice? +sub get_mbcs { + my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte'); + opendir my $dh, $dir or return; + my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh; Yuck. This is a lot more intimate with String::Multibyte's implementation than I'd like to be. Could we perhaps just run the constructor on any candidates charsets, and then return the first hit that gives us something besides undef? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] diff-highlight: Fix broken multibyte string
From: Yi EungJun eungjun...@navercorp.com Highlighted string might be broken if the common subsequence is a proper subset of a multibyte character. For example, if the old string is 진 and the new string is 지, then we expect the diff is rendered as follows: -진 +지 but actually it was rendered as follows: -ECA784 +ECA780 This fixes the bug by splitting the string by multibyte characters. --- contrib/diff-highlight/diff-highlight | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight index 08c88bb..2662c1a 100755 --- a/contrib/diff-highlight/diff-highlight +++ b/contrib/diff-highlight/diff-highlight @@ -2,6 +2,9 @@ use warnings FATAL = 'all'; use strict; +use File::Basename; +use File::Spec::Functions qw( catdir ); +use String::Multibyte; # Highlight by reversing foreground and background. You could do # other things like bold or underline if you prefer. @@ -24,6 +27,8 @@ my @removed; my @added; my $in_hunk; +my $mbcs = get_mbcs(); + # Some scripts may not realize that SIGPIPE is being ignored when launching the # pager--for instance scripts written in Python. $SIG{PIPE} = 'DEFAULT'; @@ -164,8 +169,8 @@ sub highlight_pair { sub split_line { local $_ = shift; - return map { /$COLOR/ ? $_ : (split //) } - split /($COLOR*)/; + return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs-strsplit('', $_) : split //) } + split /($COLOR)/; } sub highlight_line { @@ -211,3 +216,19 @@ sub is_pair_interesting { $suffix_a !~ /^$BORING*$/ || $suffix_b !~ /^$BORING*$/; } + +# Returns an instance of String::Multibyte based on the charset defined by +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support +# the charset. +sub get_mbcs { + my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte'); + opendir my $dh, $dir or return; + my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh; + close $dh; + my $expected_charset = `git config i18n.commitencoding` || UTF-8; + $expected_charset =~ s/-//g; + my @matches = grep {/^$expected_charset$/i} @mbcs_charsets; + my $charset = shift @matches; + + return eval 'String::Multibyte-new($charset)'; +} -- 2.3.2.209.gd67f9d5.dirty -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html