Re: [PATCH 2/5] format-patch: do not wrap rfc2047 encoded headers too late

2012-10-10 Thread Jan H. Schönherr
Am 09.10.2012 21:30, schrieb Junio C Hamano:
 Jan H. Schönherr schn...@cs.tu-berlin.de writes:
...
  static int is_rfc2047_special(char ch)
  {
 +/*
 + * We encode ' ' using '=20' even though rfc2047
 + * allows using '_' for readability.  Unfortunately,
 + * many programs do not understand this and just
 + * leave the underscore in place.
 + */
 
 The sentence break made me read the above three times to understand
 what it is trying to say.  Unfortunately refers to what happens if
 we were to use '_', but it initially appeared to be describing some
 bug due to our encoding ' ' as '=20'.  Perhaps like this?
 
   /*
* rfc2047 allows '_' to encode ' ' for readability, but
* many programs do not understand ...; encode ' ' using
* '=20' instead to avoid the problem.
*/

I was just moving that comment (and the following check) around,
but I'll update the comment in the next version.

 +if (ch == ' ' || ch == '\n')
 +return 1;
 
 The comment justifies why this if (ch == ' '), which could be part
 of the return below, separately is done, but nothing explains why
 you add '\n' (and not other controls, e.g. '\t') to the mix.

The check for '\n' was introduced in commit c22e7de3
(format-patch: rfc2047-encode newlines in headers).

The commit log was:

These should generally never happen, as we already
concatenate multiples in subjects into a single line. But
let's be defensive, since not encoding them means we will
output malformed headers.

Having again a look at RFC 2047, I see that we should be
even more strict and not allow any non-printable character to
be passed through unencoded. I guess that adds another patch to
the series. Hmm... Maybe I can split patch 4 into two patches,
one that mostly fixes is_rfc2047_special() and one that
avoids 822 quoting when doing 2047 encoding.

 
  return (non_ascii(ch) || (ch == '=') || (ch == '?') || (ch == '_'));
  }
  
  static void add_rfc2047(struct strbuf *sb, const char *line, int len,
 const char *encoding)
  {
 -static const int max_length = 78; /* per rfc2822 */
 +static const int max_length = 76; /* per rfc2047 */
  int i;
  int line_len;
  
 @@ -286,7 +295,7 @@ static void add_rfc2047(struct strbuf *sb, const char 
 *line, int len,
  if ((i + 1  len)  (ch == '='  line[i+1] == '?'))
  goto needquote;
  }
 -strbuf_add_wrapped_bytes(sb, line, len, -line_len, 1, max_length+1);
 +strbuf_add_wrapped_bytes(sb, line, len, -line_len, 1, 78+1);
  return;
 
 Yuck.  If you do want to retain 78 for non-quoted output for
 backward compatibility, that is OK, but if that is the case, please
 introduce a new constant max_quoted_length or something to stand
 for 76 and use it in the needquote: part below.

Will do.

Regards
Jan

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] format-patch: do not wrap rfc2047 encoded headers too late

2012-10-08 Thread Jan H . Schönherr
From: Jan H. Schönherr schn...@cs.tu-berlin.de

Encoded characters add more than one character at once to an encoded
header. Include all characters that are about to be added in the length
calculation for wrapping.

Additionally, RFC 2047 imposes a maximum line length of 76 characters
if that line contains an rfc2047 encoded word.

Signed-off-by: Jan H. Schönherr schn...@cs.tu-berlin.de
---
 pretty.c| 24 +++-
 t/t4014-format-patch.sh | 58 +
 2 Dateien geändert, 49 Zeilen hinzugefügt(+), 33 Zeilen entfernt(-)

diff --git a/pretty.c b/pretty.c
index f5caecb..daf8581 100644
--- a/pretty.c
+++ b/pretty.c
@@ -263,13 +263,22 @@ static void add_rfc822_quoted(struct strbuf *out, const 
char *s, int len)
 
 static int is_rfc2047_special(char ch)
 {
+   /*
+* We encode ' ' using '=20' even though rfc2047
+* allows using '_' for readability.  Unfortunately,
+* many programs do not understand this and just
+* leave the underscore in place.
+*/
+   if (ch == ' ' || ch == '\n')
+   return 1;
+
return (non_ascii(ch) || (ch == '=') || (ch == '?') || (ch == '_'));
 }
 
 static void add_rfc2047(struct strbuf *sb, const char *line, int len,
   const char *encoding)
 {
-   static const int max_length = 78; /* per rfc2822 */
+   static const int max_length = 76; /* per rfc2047 */
int i;
int line_len;
 
@@ -286,7 +295,7 @@ static void add_rfc2047(struct strbuf *sb, const char 
*line, int len,
if ((i + 1  len)  (ch == '='  line[i+1] == '?'))
goto needquote;
}
-   strbuf_add_wrapped_bytes(sb, line, len, -line_len, 1, max_length+1);
+   strbuf_add_wrapped_bytes(sb, line, len, -line_len, 1, 78+1);
return;
 
 needquote:
@@ -295,19 +304,14 @@ needquote:
line_len += strlen(encoding) + 5; /* 5 for =??q? */
for (i = 0; i  len; i++) {
unsigned ch = line[i]  0xFF;
+   int is_special = is_rfc2047_special(ch);
 
-   if (line_len = max_length - 2) {
+   if (line_len + 2 + (is_special ? 3 : 1)  max_length) {
strbuf_addf(sb, ?=\n =?%s?q?, encoding);
line_len = strlen(encoding) + 5 + 1; /* =??q? plus SP */
}
 
-   /*
-* We encode ' ' using '=20' even though rfc2047
-* allows using '_' for readability.  Unfortunately,
-* many programs do not understand this and just
-* leave the underscore in place.
-*/
-   if (is_rfc2047_special(ch) || ch == ' ' || ch == '\n') {
+   if (is_special) {
strbuf_addf(sb, =%02X, ch);
line_len += 3;
}
diff --git a/t/t4014-format-patch.sh b/t/t4014-format-patch.sh
index d66e358..1d5636d 100755
--- a/t/t4014-format-patch.sh
+++ b/t/t4014-format-patch.sh
@@ -772,30 +772,31 @@ M8=föö bar 
 M64=$M8$M8$M8$M8$M8$M8$M8$M8
 M512=$M64$M64$M64$M64$M64$M64$M64$M64
 cat expect 'EOF'
-Subject: [PATCH] 
=?UTF-8?q?f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6=C3=B6=20bar=20f=C3=B6?=
- =?UTF-8?q?=C3=B6=20bar=20f=C3=B6=C3=B6=20bar?=
+Subject: [PATCH]