[PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). Signed-off-by: Роман Донченко d...@corrigendum.ru Acked-by: Jeff King p...@peff.net --- git-send-email.perl | 26 -- t/t9001-send-email.sh | 7 +++ 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/git-send-email.perl b/git-send-email.perl index d461ffb..7d5cc8a 100755 --- a/git-send-email.perl +++ b/git-send-email.perl @@ -919,17 +919,23 @@ $time = time - scalar $#files; sub unquote_rfc2047 { local ($_) = @_; my $charset; - s{$re_encoded_word}{ - $charset = $1; - my $encoding = $2; - my $text = $3; - if ($encoding eq 'q' || $encoding eq 'Q') { - $text =~ s/_/ /g; - $text =~ s/=([0-9A-F]{2})/chr(hex($1))/egi; - $text; - } else { - $; # other encodings not supported yet + my $sep = qr/[ \t]+/; + s{$re_encoded_word(?:$sep$re_encoded_word)*}{ + my @words = split $sep, $; + foreach (@words) { + m/$re_encoded_word/; + $charset = $1; + my $encoding = $2; + my $text = $3; + if ($encoding eq 'q' || $encoding eq 'Q') { + $_ = $text; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/egi; + } else { + # other encodings not supported yet + } } + join '', @words; }eg; return wantarray ? ($_, $charset) : $_; } diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh index 19a3ced..fa965ff 100755 --- a/t/t9001-send-email.sh +++ b/t/t9001-send-email.sh @@ -240,6 +240,13 @@ test_expect_success $PREREQ 'non-ascii self name is suppressed' 'non_ascii_self_suppressed' +# This name is long enough to force format-patch to split it into multiple +# encoded-words, assuming it uses UTF-8 with the Q encoding. +test_expect_success $PREREQ 'long non-ascii self name is suppressed' + test_suppress_self_quoted 'Ƒüñníęř €. Nâṁé' 'odd_?=m...@example.com' \ + 'long_non_ascii_self_suppressed' + + test_expect_success $PREREQ 'sanitized self name is suppressed' test_suppress_self_unquoted '\A U. Thor\' 'aut...@example.com' \ 'self_name_sanitized_suppressed' -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
From: Роман Донченко d...@corrigendum.ru Sent: Sunday, December 07, 2014 6:17 PM Would that be worth a terse comment in the documentation change part of the patch? Multiple (RFC2047) encodings are not supported., or would that be bike shed noise. I didn't change any documentation... and in either case, they weren't supported in the first place, so I don't think it's anything I need to mention. I'd confused this with the crossing thread by Paolo Bonzini bonz...@gnu.org [PATCH 2/2] git-send-email: add --transfer-encoding option; 25 November 2014 14:00. $gmane/260217. Sorry for the noise. -- Philip -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). Thanks. Both patches look good to me, and I'd be happy to have them applied as-is. I wrote a few comments below, but in all cases I think I convinced myself that what you wrote is best. + my $sep = qr/[ \t]+/; + s{$re_encoded_word(?:$sep$re_encoded_word)*}{ + my @words = split $sep, $; + foreach (@words) { + m/$re_encoded_word/; + $charset = $1; + my $encoding = $2; + my $text = $3; It feels a little weird to have to split and rematch $re_encoded_word in the loop, but I don't think there is a way around it. ($1, $2, $3) will have our first word, and ($4, $5, $6) will have the final (if any), but I don't think we can get access to what is in between. So I think what you have here is the best we can do. + if ($encoding eq 'q' || $encoding eq 'Q') { + $_ = $text; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/egi; It took me a minute to figure out why this works. $_ is a reference to the iterator for @words, so it is re-assigning that element of the array first to the encoded text, and then modifying it in place. I wonder if it would be more obvious like this: join '', map { m/$re_encoded_word/; $charset = $1; my $encoding = $2; my $text = $3; if ($encoding eq 'q' || $encoding eq 'Q') { $text =~ s/_/ /g; $text =~ s=([0-9A-F]{2}/chr(hex($1))/egi; } else { # other encoding not supported yet } } split($sep, $); I dunno. I kind of like your version better now that I understand it, but it did take me a minute. One final note on this bit of code: if there are multiple encoded words, we grab the $charset from the final encoded word, and never report the earlier charsets. Technically they do not all have to be the same (rfc2047 even has an example where they are not). I think we can dismiss this, though, as: 1. It was like this before your patches (we might have seen multiple non-adjacent encoded words; you're just handling adjacent ones), and nobody has complained. 2. Using two separate encodings in the same header is sufficiently ridiculous that I can live with us not handling it properly. +# This name is long enough to force format-patch to split it into multiple +# encoded-words, assuming it uses UTF-8 with the Q encoding. +test_expect_success $PREREQ 'long non-ascii self name is suppressed' + test_suppress_self_quoted 'Ƒüñníęř €. Nâṁé' 'odd_?=m...@example.com' \ + 'long_non_ascii_self_suppressed' + Cute. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
Jeff King p...@peff.net писал в своём письме Sun, 07 Dec 2014 12:18:59 +0300: On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). Thanks. Both patches look good to me, and I'd be happy to have them applied as-is. I wrote a few comments below, but in all cases I think I convinced myself that what you wrote is best. I had the same concerns myself, and eventually convinced myself of the same. :-) One final note on this bit of code: if there are multiple encoded words, we grab the $charset from the final encoded word, and never report the earlier charsets. Technically they do not all have to be the same (rfc2047 even has an example where they are not). I think we can dismiss this, though, as: 1. It was like this before your patches (we might have seen multiple non-adjacent encoded words; you're just handling adjacent ones), and nobody has complained. 2. Using two separate encodings in the same header is sufficiently ridiculous that I can live with us not handling it properly. Yeah, that bugs me as well. But I think handling multiple encodings would require substantial reworking of the code, so I chickened out (with the same excuses :-)). Roman. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
From: Роман Донченко d...@corrigendum.ru Jeff King p...@peff.net писал в своём письме Sun, 07 Dec 2014 12:18:59 +0300: On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). Thanks. Both patches look good to me, and I'd be happy to have them applied as-is. I wrote a few comments below, but in all cases I think I convinced myself that what you wrote is best. I had the same concerns myself, and eventually convinced myself of the same. :-) One final note on this bit of code: if there are multiple encoded words, we grab the $charset from the final encoded word, and never report the earlier charsets. Technically they do not all have to be the same (rfc2047 even has an example where they are not). I think we can dismiss this, though, as: 1. It was like this before your patches (we might have seen multiple non-adjacent encoded words; you're just handling adjacent ones), and nobody has complained. 2. Using two separate encodings in the same header is sufficiently ridiculous that I can live with us not handling it properly. Yeah, that bugs me as well. But I think handling multiple encodings would require substantial reworking of the code, so I chickened out (with the same excuses :-)). Would that be worth a terse comment in the documentation change part of the patch? Multiple (RFC2047) encodings are not supported., or would that be bike shed noise. Philip -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
Philip Oakley philipoak...@iee.org писал в своём письме Sun, 07 Dec 2014 20:48:05 +0300: From: Роман Донченко d...@corrigendum.ru Jeff King p...@peff.net писал в своём письме Sun, 07 Dec 2014 12:18:59 +0300: On Sat, Dec 06, 2014 at 10:36:23PM +0300, Роман Донченко wrote: One final note on this bit of code: if there are multiple encoded words, we grab the $charset from the final encoded word, and never report the earlier charsets. Technically they do not all have to be the same (rfc2047 even has an example where they are not). I think we can dismiss this, though, as: 1. It was like this before your patches (we might have seen multiple non-adjacent encoded words; you're just handling adjacent ones), and nobody has complained. 2. Using two separate encodings in the same header is sufficiently ridiculous that I can live with us not handling it properly. Yeah, that bugs me as well. But I think handling multiple encodings would require substantial reworking of the code, so I chickened out (with the same excuses :-)). Would that be worth a terse comment in the documentation change part of the patch? Multiple (RFC2047) encodings are not supported., or would that be bike shed noise. I didn't change any documentation... and in either case, they weren't supported in the first place, so I don't think it's anything I need to mention. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] send-email: handle adjacent RFC 2047-encoded words properly
The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). --- git-send-email.perl | 26 -- t/t9001-send-email.sh | 7 +++ 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/git-send-email.perl b/git-send-email.perl index d461ffb..7d5cc8a 100755 --- a/git-send-email.perl +++ b/git-send-email.perl @@ -919,17 +919,23 @@ $time = time - scalar $#files; sub unquote_rfc2047 { local ($_) = @_; my $charset; - s{$re_encoded_word}{ - $charset = $1; - my $encoding = $2; - my $text = $3; - if ($encoding eq 'q' || $encoding eq 'Q') { - $text =~ s/_/ /g; - $text =~ s/=([0-9A-F]{2})/chr(hex($1))/egi; - $text; - } else { - $; # other encodings not supported yet + my $sep = qr/[ \t]+/; + s{$re_encoded_word(?:$sep$re_encoded_word)*}{ + my @words = split $sep, $; + foreach (@words) { + m/$re_encoded_word/; + $charset = $1; + my $encoding = $2; + my $text = $3; + if ($encoding eq 'q' || $encoding eq 'Q') { + $_ = $text; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/egi; + } else { + # other encodings not supported yet + } } + join '', @words; }eg; return wantarray ? ($_, $charset) : $_; } diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh index 19a3ced..fa965ff 100755 --- a/t/t9001-send-email.sh +++ b/t/t9001-send-email.sh @@ -240,6 +240,13 @@ test_expect_success $PREREQ 'non-ascii self name is suppressed' 'non_ascii_self_suppressed' +# This name is long enough to force format-patch to split it into multiple +# encoded-words, assuming it uses UTF-8 with the Q encoding. +test_expect_success $PREREQ 'long non-ascii self name is suppressed' + test_suppress_self_quoted 'Ƒüñníęř €. Nâṁé' 'odd_?=m...@example.com' \ + 'long_non_ascii_self_suppressed' + + test_expect_success $PREREQ 'sanitized self name is suppressed' test_suppress_self_unquoted '\A U. Thor\' 'aut...@example.com' \ 'self_name_sanitized_suppressed' -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html