Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-24 Thread Jeff King
On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote:

 The RFC says that they are to be concatenated after decoding (i.e. the
 intervening whitespace is ignored).
 
 I change the sender's name to an all-Cyrillic string in the tests so that
 its encoded form goes over the 76 characters in a line limit, forcing
 format-patch to split it into multiple encoded words.
 
 Since I have to modify the regular expression for an encoded word anyway,
 I take the opportunity to bring it closer to the spec, most notably
 disallowing embedded spaces and making it case-insensitive (thus allowing
 the encoding to be specified as both q and Q).

The overall goal makes sense to me. Thanks for working on this. I have a
few questions/comments, though.

  sub unquote_rfc2047 {
   local ($_) = @_;
 +
 + my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047
 + my $sep = qr/[ \t]+/;
 + my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;

The first $et in $encoded_word is actually the charset, which is defined
by RFC 2047 as:

 charset = token; see section 3

 token = 1*Any CHAR except SPACE, CTLs, and especials

 especials = ( / ) /  /  / @ / , / ; / : / 
/ / / [ / ] / ? / . / =

Your regex is a little more liberal. I doubt that it is a big deal in
practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine).
But if we are tightening things up in general, it may make sense to do
so here (and I notice that is_rfc2047_quoted does a more thorough $token
definition, and it probably makes sense for the two functions to be
consistent).

For your definition of encoded-text, RFC 2047 says:

 encoded-text = 1*Any printable ASCII character other than ?
  or SPACE

It looks like you pulled the definition of $et from is_rfc2047_quoted,
but I am not clear on where that original came from (it is from a3a8262,
but that commit message does not explain the regex).

Also, I note that we handle 'q'-style encodings here, but not 'b'. I
wonder if it is worth adding that in while we are in the area (it is not
a big deal if you always send-email git-generated patches, as we never
generate it).

 + s{$encoded_word(?:$sep$encoded_word)+}{

If I am reading this right, it requires at least two $encoded_words.
Should this + be a *?

 + my @words = split $sep, $;
 + foreach (@words) {
 + m/$encoded_word/;
 + $encoding = $1;
 + $_ = $2;
 + s/_/ /g;
 + s/=([0-9A-F]{2})/chr(hex($1))/eg;

In the spirit of your earlier change, should this final regex be
case-insensitive? RFC 2047 says only Upper case should be used for
hexadecimal digits A through F. but that does not seem like a MUST
to me.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-24 Thread Jeff King
On Sun, Nov 23, 2014 at 11:27:51PM -0800, Junio C Hamano wrote:

 Was the change to the test to use Cyrillic really necessary, or did it
 suffice if you simply extended the existsing Funny Name spelled with
 strange accents, but you substituted the whole string anyway?
 
 Until I found out what the new string says by running web-based
 translation on it, I felt somewhat uneasy. As I do not read
 Cyrillic/Russian, we may have been adding some profanity without
 knowing. It turns out that the string just says Cyrillic Name, so I am
 not against using the new string, but it simply looked odd to replace the
 string whole-sale when you merely need a longer string. It made it look
 as if a bug was specific to Cyrillic when it wasn't.

I do not mind hidden Cyrillic profanity[1], but I found the new text
much harder to verify, because the shapes are very unfamiliar to my
eyes. I'd prefer if we can stick to accented Roman letters.  I realize
this is me being totally Anglo-centric. But for Cyrillic readers,
consider how much more difficult it would be to manually verify the test
if it were written in an unfamiliar script (e.g., Hangul).  The
surrounding code is already written in Roman characters (and English),
so it probably makes sense as a common denominator.

-Peff

[1] As long as it is only crude and not mean. :)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-24 Thread Роман Донченко
Junio C Hamano gits...@pobox.com писал в своём письме Mon, 24 Nov 2014  
10:27:51 +0300:


On Sun, Nov 23, 2014 at 3:50 PM, Роман Донченко d...@corrigendum.ru  
wrote:

The RFC says that they are to be concatenated after decoding (i.e. the
intervening whitespace is ignored).

I change the sender's name to an all-Cyrillic string in the tests so  
that

its encoded form goes over the 76 characters in a line limit, forcing
format-patch to split it into multiple encoded words.

Since I have to modify the regular expression for an encoded word  
anyway,

I take the opportunity to bring it closer to the spec, most notably
disallowing embedded spaces and making it case-insensitive (thus  
allowing

the encoding to be specified as both q and Q).

Signed-off-by: Роман Донченко d...@corrigendum.ru


This sounds like a worthy thing to do in general.

I wonder if the C implementation we have for mailinfo needs similar
update, though. I vaguely recall that we have case-insensitive start for
q/b segments, but do not remember the details offhand.


That's what git am uses, right? I think that already works correctly (or  
at least doesn't have the bug this patch fixes). I didn't do extensive  
testing or look at the code, though.




Was the change to the test to use Cyrillic really necessary, or did it
suffice if you simply extended the existsing Funny Name spelled with
strange accents, but you substituted the whole string anyway?

Until I found out what the new string says by running web-based
translation on it, I felt somewhat uneasy. As I do not read
Cyrillic/Russian, we may have been adding some profanity without
knowing. It turns out that the string just says Cyrillic Name, so I am
not against using the new string, but it simply looked odd to replace the
string whole-sale when you merely need a longer string. It made it look
as if a bug was specific to Cyrillic when it wasn't.


Ah, if only I had thought of including profanity beforehand. ;-)

Seriously though, I just needed to hit the 76 character limit, and  
switching the keyboard layout is a lot easier than copypasting Latin  
letters with diacritics (plus I had trouble coming up with a long enough  
extension of Funny Name...). I can see how that's problematic, though;  
I'll change it.



As you may notice by reading git log --no-merges from recent history,
we tend not to say I did X, I did Y. If the tone of the above message
were more similar to them, it may have been easier to read.


Technically, I said I do, not I did... but sure, point taken.

Roman.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-24 Thread Роман Донченко
Jeff King p...@peff.net писал в своём письме Mon, 24 Nov 2014 18:36:09  
+0300:



On Mon, Nov 24, 2014 at 02:50:04AM +0300, Роман Донченко wrote:


The RFC says that they are to be concatenated after decoding (i.e. the
intervening whitespace is ignored).

I change the sender's name to an all-Cyrillic string in the tests so  
that

its encoded form goes over the 76 characters in a line limit, forcing
format-patch to split it into multiple encoded words.

Since I have to modify the regular expression for an encoded word  
anyway,

I take the opportunity to bring it closer to the spec, most notably
disallowing embedded spaces and making it case-insensitive (thus  
allowing

the encoding to be specified as both q and Q).


The overall goal makes sense to me. Thanks for working on this. I have a
few questions/comments, though.


 sub unquote_rfc2047 {
local ($_) = @_;
+
+   my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047
+   my $sep = qr/[ \t]+/;
+   my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;


The first $et in $encoded_word is actually the charset, which is defined
by RFC 2047 as:

 charset = token; see section 3

 token = 1*Any CHAR except SPACE, CTLs, and especials

 especials = ( / ) /  /  / @ / , / ; / : / 
/ / / [ / ] / ? / . / =

Your regex is a little more liberal. I doubt that it is a big deal in
practice (actually, in practice, I suspect [a-zA-Z0-9-] would be fine).
But if we are tightening things up in general, it may make sense to do
so here (and I notice that is_rfc2047_quoted does a more thorough $token
definition, and it probably makes sense for the two functions to be
consistent).


Yeah, I did realize that token is more restrictive than encoded-text, but  
I didn't want to stray too far from the subject line of the patch. What  
I'll probably do is split the patch into two, one for regex tweaking and  
one for multiple-word handling. And yeah, I'll try to make the two  
functions use the same regexes.




For your definition of encoded-text, RFC 2047 says:

 encoded-text = 1*Any printable ASCII character other than ?
  or SPACE

It looks like you pulled the definition of $et from is_rfc2047_quoted,
but I am not clear on where that original came from (it is from a3a8262,
but that commit message does not explain the regex).


No, it's actually an independent discovery. :-) I don't think it needs  
explanation, though - it's just a character class with two ranges covering  
every printable character but the question mark.



Also, I note that we handle 'q'-style encodings here, but not 'b'. I
wonder if it is worth adding that in while we are in the area (it is not
a big deal if you always send-email git-generated patches, as we never
generate it).


I could add b decoding, but since format-patch never generates b  
encodings, testing would be a problem. And I'd rather not do it without  
any tests.





+   s{$encoded_word(?:$sep$encoded_word)+}{


If I am reading this right, it requires at least two $encoded_words.
Should this + be a *?


I hang my head in shame. Looks like I'll have to add more tests...




+   my @words = split $sep, $;
+   foreach (@words) {
+   m/$encoded_word/;
+   $encoding = $1;
+   $_ = $2;
+   s/_/ /g;
+   s/=([0-9A-F]{2})/chr(hex($1))/eg;


In the spirit of your earlier change, should this final regex be
case-insensitive? RFC 2047 says only Upper case should be used for
hexadecimal digits A through F. but that does not seem like a MUST
to me.


Sounds reasonable.

Roman.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-24 Thread Jeff King
On Mon, Nov 24, 2014 at 09:26:22PM +0300, Роман Донченко wrote:

 Yeah, I did realize that token is more restrictive than encoded-text, but I
 didn't want to stray too far from the subject line of the patch. What I'll
 probably do is split the patch into two, one for regex tweaking and one for
 multiple-word handling. And yeah, I'll try to make the two functions use the
 same regexes.

Thanks, I think that sounds like a good plan.

 For your definition of encoded-text, RFC 2047 says:
 
  encoded-text = 1*Any printable ASCII character other than ?
   or SPACE
 
 It looks like you pulled the definition of $et from is_rfc2047_quoted,
 but I am not clear on where that original came from (it is from a3a8262,
 but that commit message does not explain the regex).
 
 No, it's actually an independent discovery. :-) I don't think it needs
 explanation, though - it's just a character class with two ranges covering
 every printable character but the question mark.

And now it is my turn to hang my head in shame. I viewed that as a set
of characters, rather than ranges. The - just blended into the mass of
punctuation, and I mistook the ! for negation.

I wonder if it would be more readable as:

  [\x21-\x3e\x40-\x7e]

or something. I guess perl even has classes pre-made for printable
ascii. I dunno. It may be OK as-is, too, and I just need to read more
carefully. :)

 Also, I note that we handle 'q'-style encodings here, but not 'b'. I
 wonder if it is worth adding that in while we are in the area (it is not
 a big deal if you always send-email git-generated patches, as we never
 generate it).
 
 I could add b decoding, but since format-patch never generates b
 encodings, testing would be a problem. And I'd rather not do it without any
 tests.

I think you could include some literal fixtures in the test suite (t5100
already does this for mailinfo). But I don't think handling 'b' is a
requirement here. It's really orthogonal to your patches, and nobody has
actually asked for it, so I don't mind leaving it for another day.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-23 Thread Роман Донченко
The RFC says that they are to be concatenated after decoding (i.e. the
intervening whitespace is ignored).

I change the sender's name to an all-Cyrillic string in the tests so that
its encoded form goes over the 76 characters in a line limit, forcing
format-patch to split it into multiple encoded words.

Since I have to modify the regular expression for an encoded word anyway,
I take the opportunity to bring it closer to the spec, most notably
disallowing embedded spaces and making it case-insensitive (thus allowing
the encoding to be specified as both q and Q).

Signed-off-by: Роман Донченко d...@corrigendum.ru
---
 git-send-email.perl   | 21 +++--
 t/t9001-send-email.sh | 18 +-
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/git-send-email.perl b/git-send-email.perl
index 9949db0..4bb9f6f 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -913,13 +913,22 @@ $time = time - scalar $#files;
 
 sub unquote_rfc2047 {
local ($_) = @_;
+
+   my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047
+   my $sep = qr/[ \t]+/;
+   my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;
+
my $encoding;
-   s{=\?([^?]+)\?q\?(.*?)\?=}{
-   $encoding = $1;
-   my $e = $2;
-   $e =~ s/_/ /g;
-   $e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
-   $e;
+   s{$encoded_word(?:$sep$encoded_word)+}{
+   my @words = split $sep, $;
+   foreach (@words) {
+   m/$encoded_word/;
+   $encoding = $1;
+   $_ = $2;
+   s/_/ /g;
+   s/=([0-9A-F]{2})/chr(hex($1))/eg;
+   }
+   join '', @words;
}eg;
return wantarray ? ($_, $encoding) : $_;
 }
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 19a3ced..318b870 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -236,7 +236,7 @@ test_expect_success $PREREQ 'self name with dot is 
suppressed' 
 
 
 test_expect_success $PREREQ 'non-ascii self name is suppressed' 
-   test_suppress_self_quoted 'Füñný Nâmé' 'odd_?=m...@example.com' \
+   test_suppress_self_quoted 'Кириллическое Имя' 'odd_?=m...@example.com' \
'non_ascii_self_suppressed'
 
 
@@ -946,25 +946,25 @@ test_expect_success $PREREQ 'utf8 author is correctly 
passed on' '
clean_fake_sendmail 
test_commit weird_author 
test_when_finished git reset --hard HEAD^ 
-   git commit --amend --author Füñný Nâmé odd_?=m...@example.com 
-   git format-patch --stdout -1 funny_name.patch 
+   git commit --amend --author Кириллическое Имя 
odd_?=m...@example.com 
+   git format-patch --stdout -1 nonascii_name.patch 
git send-email --from=Example nob...@example.com \
  --to=nob...@example.com \
  --smtp-server=$(pwd)/fake.sendmail \
- funny_name.patch 
-   grep ^From: Füñný Nâmé odd_?=m...@example.com msgtxt1
+ nonascii_name.patch 
+   grep ^From: Кириллическое Имя odd_?=m...@example.com msgtxt1
 '
 
 test_expect_success $PREREQ 'utf8 sender is not duplicated' '
clean_fake_sendmail 
test_commit weird_sender 
test_when_finished git reset --hard HEAD^ 
-   git commit --amend --author Füñný Nâmé odd_?=m...@example.com 
-   git format-patch --stdout -1 funny_name.patch 
-   git send-email --from=Füñný Nâmé odd_?=m...@example.com \
+   git commit --amend --author Кириллическое Имя 
odd_?=m...@example.com 
+   git format-patch --stdout -1 nonascii_name.patch 
+   git send-email --from=Кириллическое Имя odd_?=m...@example.com \
  --to=nob...@example.com \
  --smtp-server=$(pwd)/fake.sendmail \
- funny_name.patch 
+ nonascii_name.patch 
grep ^From:  msgtxt1 msgfrom 
test_line_count = 1 msgfrom
 '
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] send-email: handle adjacent RFC 2047-encoded words properly

2014-11-23 Thread Junio C Hamano
On Sun, Nov 23, 2014 at 3:50 PM, Роман Донченко d...@corrigendum.ru wrote:
 The RFC says that they are to be concatenated after decoding (i.e. the
 intervening whitespace is ignored).

 I change the sender's name to an all-Cyrillic string in the tests so that
 its encoded form goes over the 76 characters in a line limit, forcing
 format-patch to split it into multiple encoded words.

 Since I have to modify the regular expression for an encoded word anyway,
 I take the opportunity to bring it closer to the spec, most notably
 disallowing embedded spaces and making it case-insensitive (thus allowing
 the encoding to be specified as both q and Q).

 Signed-off-by: Роман Донченко d...@corrigendum.ru

This sounds like a worthy thing to do in general.

I wonder if the C implementation we have for mailinfo needs similar
update, though. I vaguely recall that we have case-insensitive start for
q/b segments, but do not remember the details offhand.

Was the change to the test to use Cyrillic really necessary, or did it
suffice if you simply extended the existsing Funny Name spelled with
strange accents, but you substituted the whole string anyway?

Until I found out what the new string says by running web-based
translation on it, I felt somewhat uneasy. As I do not read
Cyrillic/Russian, we may have been adding some profanity without
knowing. It turns out that the string just says Cyrillic Name, so I am
not against using the new string, but it simply looked odd to replace the
string whole-sale when you merely need a longer string. It made it look
as if a bug was specific to Cyrillic when it wasn't.

As you may notice by reading git log --no-merges from recent history,
we tend not to say I did X, I did Y. If the tone of the above message
were more similar to them, it may have been easier to read.

But other than these minor nits, the change looks good from
a cursory read.

Thanks.

 ---
  git-send-email.perl   | 21 +++--
  t/t9001-send-email.sh | 18 +-
  2 files changed, 24 insertions(+), 15 deletions(-)

 diff --git a/git-send-email.perl b/git-send-email.perl
 index 9949db0..4bb9f6f 100755
 --- a/git-send-email.perl
 +++ b/git-send-email.perl
 @@ -913,13 +913,22 @@ $time = time - scalar $#files;

  sub unquote_rfc2047 {
 local ($_) = @_;
 +
 +   my $et = qr/[!-@-~]+/; # encoded-text from RFC 2047
 +   my $sep = qr/[ \t]+/;
 +   my $encoded_word = qr/=\?($et)\?q\?($et)\?=/i;
 +
 my $encoding;
 -   s{=\?([^?]+)\?q\?(.*?)\?=}{
 -   $encoding = $1;
 -   my $e = $2;
 -   $e =~ s/_/ /g;
 -   $e =~ s/=([0-9A-F]{2})/chr(hex($1))/eg;
 -   $e;
 +   s{$encoded_word(?:$sep$encoded_word)+}{
 +   my @words = split $sep, $;
 +   foreach (@words) {
 +   m/$encoded_word/;
 +   $encoding = $1;
 +   $_ = $2;
 +   s/_/ /g;
 +   s/=([0-9A-F]{2})/chr(hex($1))/eg;
 +   }
 +   join '', @words;
 }eg;
 return wantarray ? ($_, $encoding) : $_;
  }
 diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
 index 19a3ced..318b870 100755
 --- a/t/t9001-send-email.sh
 +++ b/t/t9001-send-email.sh
 @@ -236,7 +236,7 @@ test_expect_success $PREREQ 'self name with dot is 
 suppressed' 
  

  test_expect_success $PREREQ 'non-ascii self name is suppressed' 
 -   test_suppress_self_quoted 'Füñný Nâmé' 'odd_?=m...@example.com' \
 +   test_suppress_self_quoted 'Кириллическое Имя' 
 'odd_?=m...@example.com' \
 'non_ascii_self_suppressed'
  

 @@ -946,25 +946,25 @@ test_expect_success $PREREQ 'utf8 author is correctly 
 passed on' '
 clean_fake_sendmail 
 test_commit weird_author 
 test_when_finished git reset --hard HEAD^ 
 -   git commit --amend --author Füñný Nâmé odd_?=m...@example.com 
 -   git format-patch --stdout -1 funny_name.patch 
 +   git commit --amend --author Кириллическое Имя 
 odd_?=m...@example.com 
 +   git format-patch --stdout -1 nonascii_name.patch 
 git send-email --from=Example nob...@example.com \
   --to=nob...@example.com \
   --smtp-server=$(pwd)/fake.sendmail \
 - funny_name.patch 
 -   grep ^From: Füñný Nâmé odd_?=m...@example.com msgtxt1
 + nonascii_name.patch 
 +   grep ^From: Кириллическое Имя odd_?=m...@example.com msgtxt1
  '

  test_expect_success $PREREQ 'utf8 sender is not duplicated' '
 clean_fake_sendmail 
 test_commit weird_sender 
 test_when_finished git reset --hard HEAD^ 
 -   git commit --amend --author Füñný Nâmé odd_?=m...@example.com 
 -   git format-patch --stdout -1 funny_name.patch 
 -   git send-email --from=Füñný Nâmé odd_?=m...@example.com \
 +   git commit --amend --author Кириллическое Имя 
 odd_?=m...@example.com 
 +   git format-patch --stdout -1