Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). I change the sender's name to an all-Cyrillic string in the tests so that its encoded form goes over the 76 characters in a line limit, forcing format-patch to split it into multiple encoded words. Since I have to modify the regular expression for an encoded word anyway, I take the opportunity to bring it closer to the spec, most notably disallowing embedded spaces and making it case-insensitive (thus allowing the encoding to be specified as both q and Q). The overall goal makes sense to me. Thanks for working on this. I have a few questions/comments, though. sub unquote_rfc2047 { local ($_) = @_; + + my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047 + my $sep = qr/[ \t]+/; + my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i; The first $et in $encoded_word is actually the charset, which is defined by RFC 2047 as: charset = token; see section 3 token = 1*Any CHAR except SPACE, CTLs, and especials especials = ( / ) / / / @ / , / ; / : / / / / [ / ] / ? / . / = Your regex is a little more liberal. I doubt that it is a big deal in practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine). But if we are tightening things up in general, it may make sense to do so here (and I notice that is_rfc2047_quoted does a more thorough $token definition, and it probably makes sense for the two functions to be consistent). For your definition of encoded-text, RFC 2047 says: encoded-text = 1*Any printable ASCII character other than ? or SPACE It looks like you pulled the definition of $et from is_rfc2047_quoted, but I am not clear on where that original came from (it is from a3a8262, but that commit message does not explain the regex). Also, I note that we handle 'q'-style encodings here, but not 'b'. I wonder if it is worth adding that in while we are in the area (it is not a big deal if you always send-email git-generated patches, as we never generate it). + s{$encoded_word(?:$sep$encoded_word)+}{ If I am reading this right, it requires at least two $encoded_words. Should this + be a *? + my @words = split $sep, $; + foreach (@words) { + m/$encoded_word/; + $encoding = $1; + $_ = $2; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/eg; In the spirit of your earlier change, should this final regex be case-insensitive? RFC 2047 says only Upper case should be used for hexadecimal digits A through F. but that does not seem like a MUST to me. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
On Sun, Nov 23, 2014 at 11:27:51PM -0800, Junio C Hamano wrote: Was the change to the test to use Cyrillic really necessary, or did it suffice if you simply extended the existsing Funny Name spelled with strange accents, but you substituted the whole string anyway? Until I found out what the new string says by running web-based translation on it, I felt somewhat uneasy. As I do not read Cyrillic/Russian, we may have been adding some profanity without knowing. It turns out that the string just says Cyrillic Name, so I am not against using the new string, but it simply looked odd to replace the string whole-sale when you merely need a longer string. It made it look as if a bug was specific to Cyrillic when it wasn't. I do not mind hidden Cyrillic profanity[1], but I found the new text much harder to verify, because the shapes are very unfamiliar to my eyes. I'd prefer if we can stick to accented Roman letters. I realize this is me being totally Anglo-centric. But for Cyrillic readers, consider how much more difficult it would be to manually verify the test if it were written in an unfamiliar script (e.g., Hangul). The surrounding code is already written in Roman characters (and English), so it probably makes sense as a common denominator. -Peff [1] As long as it is only crude and not mean. :) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
Junio C Hamano gits...@pobox.com писал в своём письме Mon, 24 Nov 2014 10:27:51 +0300: On Sun, Nov 23, 2014 at 3:50 PM, Роман Донченко d...@corrigendum.ru wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). I change the sender's name to an all-Cyrillic string in the tests so that its encoded form goes over the 76 characters in a line limit, forcing format-patch to split it into multiple encoded words. Since I have to modify the regular expression for an encoded word anyway, I take the opportunity to bring it closer to the spec, most notably disallowing embedded spaces and making it case-insensitive (thus allowing the encoding to be specified as both q and Q). Signed-off-by: Роман Донченко d...@corrigendum.ru This sounds like a worthy thing to do in general. I wonder if the C implementation we have for mailinfo needs similar update, though. I vaguely recall that we have case-insensitive start for q/b segments, but do not remember the details offhand. That's what git am uses, right? I think that already works correctly (or at least doesn't have the bug this patch fixes). I didn't do extensive testing or look at the code, though. Was the change to the test to use Cyrillic really necessary, or did it suffice if you simply extended the existsing Funny Name spelled with strange accents, but you substituted the whole string anyway? Until I found out what the new string says by running web-based translation on it, I felt somewhat uneasy. As I do not read Cyrillic/Russian, we may have been adding some profanity without knowing. It turns out that the string just says Cyrillic Name, so I am not against using the new string, but it simply looked odd to replace the string whole-sale when you merely need a longer string. It made it look as if a bug was specific to Cyrillic when it wasn't. Ah, if only I had thought of including profanity beforehand. ;-) Seriously though, I just needed to hit the 76 character limit, and switching the keyboard layout is a lot easier than copypasting Latin letters with diacritics (plus I had trouble coming up with a long enough extension of Funny Name...). I can see how that's problematic, though; I'll change it. As you may notice by reading git log --no-merges from recent history, we tend not to say I did X, I did Y. If the tone of the above message were more similar to them, it may have been easier to read. Technically, I said I do, not I did... but sure, point taken. Roman. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
Jeff King p...@peff.net писал в своём письме Mon, 24 Nov 2014 18:36:09 +0300: On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). I change the sender's name to an all-Cyrillic string in the tests so that its encoded form goes over the 76 characters in a line limit, forcing format-patch to split it into multiple encoded words. Since I have to modify the regular expression for an encoded word anyway, I take the opportunity to bring it closer to the spec, most notably disallowing embedded spaces and making it case-insensitive (thus allowing the encoding to be specified as both q and Q). The overall goal makes sense to me. Thanks for working on this. I have a few questions/comments, though. sub unquote_rfc2047 { local ($_) = @_; + + my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047 + my $sep = qr/[ \t]+/; + my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i; The first $et in $encoded_word is actually the charset, which is defined by RFC 2047 as: charset = token; see section 3 token = 1*Any CHAR except SPACE, CTLs, and especials especials = ( / ) / / / @ / , / ; / : / / / / [ / ] / ? / . / = Your regex is a little more liberal. I doubt that it is a big deal in practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine). But if we are tightening things up in general, it may make sense to do so here (and I notice that is_rfc2047_quoted does a more thorough $token definition, and it probably makes sense for the two functions to be consistent). Yeah, I did realize that token is more restrictive than encoded-text, but I didn't want to stray too far from the subject line of the patch. What I'll probably do is split the patch into two, one for regex tweaking and one for multiple-word handling. And yeah, I'll try to make the two functions use the same regexes. For your definition of encoded-text, RFC 2047 says: encoded-text = 1*Any printable ASCII character other than ? or SPACE It looks like you pulled the definition of $et from is_rfc2047_quoted, but I am not clear on where that original came from (it is from a3a8262, but that commit message does not explain the regex). No, it's actually an independent discovery. :-) I don't think it needs explanation, though - it's just a character class with two ranges covering every printable character but the question mark. Also, I note that we handle 'q'-style encodings here, but not 'b'. I wonder if it is worth adding that in while we are in the area (it is not a big deal if you always send-email git-generated patches, as we never generate it). I could add b decoding, but since format-patch never generates b encodings, testing would be a problem. And I'd rather not do it without any tests. + s{$encoded_word(?:$sep$encoded_word)+}{ If I am reading this right, it requires at least two $encoded_words. Should this + be a *? I hang my head in shame. Looks like I'll have to add more tests... + my @words = split $sep, $; + foreach (@words) { + m/$encoded_word/; + $encoding = $1; + $_ = $2; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/eg; In the spirit of your earlier change, should this final regex be case-insensitive? RFC 2047 says only Upper case should be used for hexadecimal digits A through F. but that does not seem like a MUST to me. Sounds reasonable. Roman. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
On Mon, Nov 24, 2014 at 09:26:22PM +0300, Роман Донченко wrote: Yeah, I did realize that token is more restrictive than encoded-text, but I didn't want to stray too far from the subject line of the patch. What I'll probably do is split the patch into two, one for regex tweaking and one for multiple-word handling. And yeah, I'll try to make the two functions use the same regexes. Thanks, I think that sounds like a good plan. For your definition of encoded-text, RFC 2047 says: encoded-text = 1*Any printable ASCII character other than ? or SPACE It looks like you pulled the definition of $et from is_rfc2047_quoted, but I am not clear on where that original came from (it is from a3a8262, but that commit message does not explain the regex). No, it's actually an independent discovery. :-) I don't think it needs explanation, though - it's just a character class with two ranges covering every printable character but the question mark. And now it is my turn to hang my head in shame. I viewed that as a set of characters, rather than ranges. The - just blended into the mass of punctuation, and I mistook the ! for negation. I wonder if it would be more readable as: [\x21-\x3e\x40-\x7e] or something. I guess perl even has classes pre-made for printable ascii. I dunno. It may be OK as-is, too, and I just need to read more carefully. :) Also, I note that we handle 'q'-style encodings here, but not 'b'. I wonder if it is worth adding that in while we are in the area (it is not a big deal if you always send-email git-generated patches, as we never generate it). I could add b decoding, but since format-patch never generates b encodings, testing would be a problem. And I'd rather not do it without any tests. I think you could include some literal fixtures in the test suite (t5100 already does this for mailinfo). But I don't think handling 'b' is a requirement here. It's really orthogonal to your patches, and nobody has actually asked for it, so I don't mind leaving it for another day. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] send-email: handle adjacent RFC 2047-encoded words properly
The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). I change the sender's name to an all-Cyrillic string in the tests so that its encoded form goes over the 76 characters in a line limit, forcing format-patch to split it into multiple encoded words. Since I have to modify the regular expression for an encoded word anyway, I take the opportunity to bring it closer to the spec, most notably disallowing embedded spaces and making it case-insensitive (thus allowing the encoding to be specified as both q and Q). Signed-off-by: Роман Донченко d...@corrigendum.ru --- git-send-email.perl | 21 +++-- t/t9001-send-email.sh | 18 +- 2 files changed, 24 insertions(+), 15 deletions(-) diff --git a/git-send-email.perl b/git-send-email.perl index 9949db0..4bb9f6f 100755 --- a/git-send-email.perl +++ b/git-send-email.perl @@ -913,13 +913,22 @@ $time = time - scalar $#files; sub unquote_rfc2047 { local ($_) = @_; + + my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047 + my $sep = qr/[ \t]+/; + my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i; + my $encoding; - s{=\?([^?]+)\?q\?(.*?)\?=}{ - $encoding = $1; - my $e = $2; - $e =~ s/_/ /g; - $e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg; - $e; + s{$encoded_word(?:$sep$encoded_word)+}{ + my @words = split $sep, $; + foreach (@words) { + m/$encoded_word/; + $encoding = $1; + $_ = $2; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/eg; + } + join '', @words; }eg; return wantarray ? ($_, $encoding) : $_; } diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh index 19a3ced..318b870 100755 --- a/t/t9001-send-email.sh +++ b/t/t9001-send-email.sh @@ -236,7 +236,7 @@ test_expect_success $PREREQ 'self name with dot is suppressed' test_expect_success $PREREQ 'non-ascii self name is suppressed' - test_suppress_self_quoted 'Füñný Nâmé' 'odd_?=m...@example.com' \ + test_suppress_self_quoted 'Кириллическое Имя' 'odd_?=m...@example.com' \ 'non_ascii_self_suppressed' @@ -946,25 +946,25 @@ test_expect_success $PREREQ 'utf8 author is correctly passed on' ' clean_fake_sendmail test_commit weird_author test_when_finished git reset --hard HEAD^ - git commit --amend --author Füñný Nâmé odd_?=m...@example.com - git format-patch --stdout -1 funny_name.patch + git commit --amend --author Кириллическое Имя odd_?=m...@example.com + git format-patch --stdout -1 nonascii_name.patch git send-email --from=Example nob...@example.com \ --to=nob...@example.com \ --smtp-server=$(pwd)/fake.sendmail \ - funny_name.patch - grep ^From: Füñný Nâmé odd_?=m...@example.com msgtxt1 + nonascii_name.patch + grep ^From: Кириллическое Имя odd_?=m...@example.com msgtxt1 ' test_expect_success $PREREQ 'utf8 sender is not duplicated' ' clean_fake_sendmail test_commit weird_sender test_when_finished git reset --hard HEAD^ - git commit --amend --author Füñný Nâmé odd_?=m...@example.com - git format-patch --stdout -1 funny_name.patch - git send-email --from=Füñný Nâmé odd_?=m...@example.com \ + git commit --amend --author Кириллическое Имя odd_?=m...@example.com + git format-patch --stdout -1 nonascii_name.patch + git send-email --from=Кириллическое Имя odd_?=m...@example.com \ --to=nob...@example.com \ --smtp-server=$(pwd)/fake.sendmail \ - funny_name.patch + nonascii_name.patch grep ^From: msgtxt1 msgfrom test_line_count = 1 msgfrom ' -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly
On Sun, Nov 23, 2014 at 3:50 PM, Роман Донченко d...@corrigendum.ru wrote: The RFC says that they are to be concatenated after decoding (i.e. the intervening whitespace is ignored). I change the sender's name to an all-Cyrillic string in the tests so that its encoded form goes over the 76 characters in a line limit, forcing format-patch to split it into multiple encoded words. Since I have to modify the regular expression for an encoded word anyway, I take the opportunity to bring it closer to the spec, most notably disallowing embedded spaces and making it case-insensitive (thus allowing the encoding to be specified as both q and Q). Signed-off-by: Роман Донченко d...@corrigendum.ru This sounds like a worthy thing to do in general. I wonder if the C implementation we have for mailinfo needs similar update, though. I vaguely recall that we have case-insensitive start for q/b segments, but do not remember the details offhand. Was the change to the test to use Cyrillic really necessary, or did it suffice if you simply extended the existsing Funny Name spelled with strange accents, but you substituted the whole string anyway? Until I found out what the new string says by running web-based translation on it, I felt somewhat uneasy. As I do not read Cyrillic/Russian, we may have been adding some profanity without knowing. It turns out that the string just says Cyrillic Name, so I am not against using the new string, but it simply looked odd to replace the string whole-sale when you merely need a longer string. It made it look as if a bug was specific to Cyrillic when it wasn't. As you may notice by reading git log --no-merges from recent history, we tend not to say I did X, I did Y. If the tone of the above message were more similar to them, it may have been easier to read. But other than these minor nits, the change looks good from a cursory read. Thanks. --- git-send-email.perl | 21 +++-- t/t9001-send-email.sh | 18 +- 2 files changed, 24 insertions(+), 15 deletions(-) diff --git a/git-send-email.perl b/git-send-email.perl index 9949db0..4bb9f6f 100755 --- a/git-send-email.perl +++ b/git-send-email.perl @@ -913,13 +913,22 @@ $time = time - scalar $#files; sub unquote_rfc2047 { local ($_) = @_; + + my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047 + my $sep = qr/[ \t]+/; + my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i; + my $encoding; - s{=\?([^?]+)\?q\?(.*?)\?=}{ - $encoding = $1; - my $e = $2; - $e =~ s/_/ /g; - $e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg; - $e; + s{$encoded_word(?:$sep$encoded_word)+}{ + my @words = split $sep, $; + foreach (@words) { + m/$encoded_word/; + $encoding = $1; + $_ = $2; + s/_/ /g; + s/=([0-9A-F]{2})/chr(hex($1))/eg; + } + join '', @words; }eg; return wantarray ? ($_, $encoding) : $_; } diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh index 19a3ced..318b870 100755 --- a/t/t9001-send-email.sh +++ b/t/t9001-send-email.sh @@ -236,7 +236,7 @@ test_expect_success $PREREQ 'self name with dot is suppressed' test_expect_success $PREREQ 'non-ascii self name is suppressed' - test_suppress_self_quoted 'Füñný Nâmé' 'odd_?=m...@example.com' \ + test_suppress_self_quoted 'Кириллическое Имя' 'odd_?=m...@example.com' \ 'non_ascii_self_suppressed' @@ -946,25 +946,25 @@ test_expect_success $PREREQ 'utf8 author is correctly passed on' ' clean_fake_sendmail test_commit weird_author test_when_finished git reset --hard HEAD^ - git commit --amend --author Füñný Nâmé odd_?=m...@example.com - git format-patch --stdout -1 funny_name.patch + git commit --amend --author Кириллическое Имя odd_?=m...@example.com + git format-patch --stdout -1 nonascii_name.patch git send-email --from=Example nob...@example.com \ --to=nob...@example.com \ --smtp-server=$(pwd)/fake.sendmail \ - funny_name.patch - grep ^From: Füñný Nâmé odd_?=m...@example.com msgtxt1 + nonascii_name.patch + grep ^From: Кириллическое Имя odd_?=m...@example.com msgtxt1 ' test_expect_success $PREREQ 'utf8 sender is not duplicated' ' clean_fake_sendmail test_commit weird_sender test_when_finished git reset --hard HEAD^ - git commit --amend --author Füñný Nâmé odd_?=m...@example.com - git format-patch --stdout -1 funny_name.patch - git send-email --from=Füñný Nâmé odd_?=m...@example.com \ + git commit --amend --author Кириллическое Имя odd_?=m...@example.com + git format-patch --stdout -1