[Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJK header

2007-12-23 Thread jacky
Hi, all.

The rfc2047 decoder in libcamel can not decode some
CJK header correctly. Although some of them are not
correspond to RFC, but I need to decode it correctly
and I thought if evolution can display there email
correctly more people like it.

So I write a new rfc2047 decoder, and it's in the
patch. With the patch, libcamel can decode CJK header
correctly and evolution can display CJK header
correctly now. I had test it in my mailbox. My mailbox
has 2000 emails which were sent by evolution,
thunderbird, outlook, outlook express, foxmail, open
webmail, yahoo, gmail, lotus notes, etc. Without this
patch, almost 20% of there emails can't be decoded and
displayed correctly, with this patch, 99% of there
emails can be decoded and displayed correctly.

And I found that the attachment with CJK name can't be
recognised and displayed by outlook / outlook express
/ foxmail. This is because there email clients do not
support RFC2184. Evolution always use RFC2184 encode
mothod to encode attachment name, so the email with
CJK named attachment can't display in outlook /
outlook express / foxmail. In thunderbird, you can set
the option mail.strictly_mime.parm_folding to 0 or 1
for using RFC2047 encode mothod to encode attachment
name. Can we add a similar option?

Best regards.


  ___ 
雅虎邮箱传递新年祝福,个性贺卡送亲朋! 
http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_taglinediff -ru evolution-data-server-2.21.4/camel/camel-mime-utils.c evolution-data-server-liuzhy/camel/camel-mime-utils.c
--- evolution-data-server-2.21.4/camel/camel-mime-utils.c	2007-12-22 16:50:44.0 +0800
+++ evolution-data-server-liuzhy/camel/camel-mime-utils.c	2007-12-23 14:55:07.0 +0800
@@ -821,116 +821,207 @@
 	*in = inptr;
 }
 
+static void
+print_hex (unsigned char *data, size_t len)
+{
+	size_t i, x;
+	unsigned char *p = data;
+	char high, low;
+
+	x = 0;
+	printf (%04u, x);
+	for (i = 0; i  len; i++) {
+		high = *p  4;
+		high = (high10) ? high + '0' : high + 'a' - 10;
+
+		low = *p  0x0f;
+		low = (low10) ? low + '0' : low + 'a' - 10;
+
+		printf (0x%c%c  , high, low);
+
+		p++;
+		x++;
+		if (i % 8 == 7) {
+			printf (\n%04u, x);
+		}
+	}
+	printf (\n);
+}
+
+static size_t
+conv_to_utf8 (const char *encname, char *in, size_t inlen, char *out, size_t outlen)
+{
+	char *charset, *inbuf, *outbuf;
+	iconv_t ic;
+	size_t inbuf_len, outbuf_len, ret;
+
+	charset = e_iconv_charset_name (encname);
+
+	ic = e_iconv_open (UTF-8, charset);
+	if (ic == (iconv_t) -1) {
+		printf (e_iconv_open() error\n);
+		return (size_t)-1;
+	}
+
+	inbuf = in;
+	inbuf_len = inlen;
+
+	outbuf = out;
+	outbuf_len = outlen;
+
+	ret = e_iconv (ic, inbuf, inbuf_len, outbuf, outbuf_len);
+	if (ret == (size_t)-1) {
+		printf (e_iconv() error! source charset is %s, target charset is %s\n, charset, UTF-8);
+		printf (converted %u bytes, but last %u bytes can't convert!!\n, inlen - inbuf_len, inbuf_len);
+		printf (source data:\n);
+		print_hex (in, inlen);
+
+		*outbuf = '\0';
+		printf (target string is \%s\\n, out);
+
+		return (size_t)-1;
+	}
+
+	ret = outlen - outbuf_len;
+	out[ret] = '\0';
+
+	e_iconv_close (ic);
+
+	return ret;
+}
+
 /* decode rfc 2047 encoded string segment */
+#define DECWORD_LEN 1024
+#define UTF8_DECWORD_LEN 2048
+
 static char *
 rfc2047_decode_word(const char *in, size_t len)
 {
-	const char *inptr = in+2;
-	const char *inend = in+len-2;
-	const char *inbuf;
-	const char *charset;
-	char *encname, *p;
-	int tmplen;
-	size_t ret;
-	char *decword = NULL;
-	char *decoded = NULL;
-	char *outbase = NULL;
-	char *outbuf;
-	size_t inlen, outlen;
-	gboolean retried = FALSE;
-	iconv_t ic;
+	char prev_charset[32], curr_charset[32];
+	char encode;
+	char *start, *inptr, *inend;
+	char decword[DECWORD_LEN], utf8_decword[UTF8_DECWORD_LEN];
+	char *decword_ptr, *utf8_decword_ptr;
+	size_t inlen, outlen, ret;
 
 	d(printf(rfc2047: decoding '%.*s'\n, len, in));
 
+	prev_charset[0] = curr_charset[0] = '\0';
+
+	decword_ptr = decword;
+	utf8_decword_ptr = utf8_decword;
+
 	/* quick check to see if this could possibly be a real encoded word */
-	if (len  8 || !(in[0] == '='  in[1] == '?'  in[len-1] == '='  in[len-2] == '?')) {
+	if (len  8
+	|| !(in[0] == '='  in[1] == '?'
+		  in[len-1] == '='  in[len-2] == '?')) {
 		d(printf(invalid\n));
 		return NULL;
 	}
 
-	/* skip past the charset to the encoding type */
-	inptr = memchr (inptr, '?', inend-inptr);
-	if (inptr != NULL  inptr  inend + 2  inptr[2] == '?') {
-		d(printf(found ?, encoding is '%c'\n, inptr[0]));
+	inptr = in;
+	inend = in + len;
+	outlen = sizeof(utf8_decword);
+
+	while (inptr  inend) {
+		/* begin */
+		inptr = memchr (inptr, '?', inend-inptr);
+		if (!inptr || *(inptr-1) != '=') {
+			return NULL;
+		}
+
+		inptr++;
+
+		/* charset */
+		start = inptr;
+		inptr = memchr (inptr, '?', inend-inptr);
+		if (!inptr) {
+			return NULL;
+		}
+		strncpy (curr_charset, start, inptr-start); /* maybe overflow 

Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJK header

2007-12-23 Thread jacky

--- Philip Van Hoof [EMAIL PROTECTED]wrote:

 Hey Jacky,
 
 This is a port of your patch to Tinymail's
 camel-lite
 

Thank you.


 On Sun, 2007-12-23 at 23:09 +0800, jacky wrote:
  Hi, all.
  
  The rfc2047 decoder in libcamel can not decode
 some
  CJK header correctly. Although some of them are
 not
  correspond to RFC, but I need to decode it
 correctly
  and I thought if evolution can display there email
  correctly more people like it.
  
  So I write a new rfc2047 decoder, and it's in the
  patch. With the patch, libcamel can decode CJK
 header
  correctly and evolution can display CJK header
  correctly now. I had test it in my mailbox. My
 mailbox
  has 2000 emails which were sent by evolution,
  thunderbird, outlook, outlook express, foxmail,
 open
  webmail, yahoo, gmail, lotus notes, etc. Without
 this
  patch, almost 20% of there emails can't be decoded
 and
  displayed correctly, with this patch, 99% of there
  emails can be decoded and displayed correctly.
  
  And I found that the attachment with CJK name
 can't be
  recognised and displayed by outlook / outlook
 express
  / foxmail. This is because there email clients do
 not
  support RFC2184. Evolution always use RFC2184
 encode
  mothod to encode attachment name, so the email
 with
  CJK named attachment can't display in outlook /
  outlook express / foxmail. In thunderbird, you can
 set
  the option mail.strictly_mime.parm_folding to 0
 or 1
  for using RFC2047 encode mothod to encode
 attachment
  name. Can we add a similar option?
  
  Best regards.
  
  
   

___
 
  雅虎邮箱传递新年祝福,个性贺卡送亲朋! 
 

http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_tagline
  ___
 Evolution-hackers mailing list
 Evolution-hackers@gnome.org

http://mail.gnome.org/mailman/listinfo/evolution-hackers
 -- 
 Philip Van Hoof, freelance software developer
 home: me at pvanhoof dot be 
 gnome: pvanhoof at gnome dot org 
 http://pvanhoof.be/blog
 http://codeminded.be
 
 
 
  Index:

libtinymail-camel/camel-lite/camel/camel-mime-utils.c

===
 ---

libtinymail-camel/camel-lite/camel/camel-mime-utils.c
 (revision 3190)
 +++

libtinymail-camel/camel-lite/camel/camel-mime-utils.c
 (working copy)
 @@ -821,125 +821,207 @@
   *in = inptr;
  }
  
 +static void
 +print_hex (unsigned char *data, size_t len)
 +{
 + size_t i, x;
 + unsigned char *p = data;
 + char high, low;
 +
 + x = 0;
 + printf (%04u, x);
 + for (i = 0; i  len; i++) {
 + high = *p  4;
 + high = (high10) ? high + '0' : high + 'a' - 10;
 +
 + low = *p  0x0f;
 + low = (low10) ? low + '0' : low + 'a' - 10;
 +
 + printf (0x%c%c  , high, low);
 +
 + p++;
 + x++;
 + if (i % 8 == 7) {
 + printf (\n%04u, x);
 + }
 + }
 + printf (\n);
 +}
 +
 +static size_t
 +conv_to_utf8 (const char *encname, char *in, size_t
 inlen, char *out, size_t outlen)
 +{
 + char *charset, *inbuf, *outbuf;
 + iconv_t ic;
 + size_t inbuf_len, outbuf_len, ret;
 +
 + charset = (char *) e_iconv_charset_name (encname);
 +
 + ic = e_iconv_open (UTF-8, charset);
 + if (ic == (iconv_t) -1) {
 + printf (e_iconv_open() error\n);
 + return (size_t)-1;
 + }
 +
 + inbuf = in;
 + inbuf_len = inlen;
 +
 + outbuf = out;
 + outbuf_len = outlen;
 +
 + ret = e_iconv (ic, (const char **) inbuf,
 inbuf_len, outbuf, outbuf_len);
 + if (ret == (size_t)-1) {
 + printf (e_iconv() error! source charset is %s,
 target charset is %s\n, charset, UTF-8);
 + printf (converted %u bytes, but last %u bytes
 can't convert!!\n, inlen - inbuf_len, inbuf_len);
 + printf (source data:\n);
 + print_hex (in, inlen);
 +
 + *outbuf = '\0';
 + printf (target string is \%s\\n, out);
 +
 + return (size_t)-1;
 + }
 +
 + ret = outlen - outbuf_len;
 + out[ret] = '\0';
 +
 + e_iconv_close (ic);
 +
 + return ret;
 +}
 +
  /* decode rfc 2047 encoded string segment */
 +#define DECWORD_LEN 1024
 +#define UTF8_DECWORD_LEN 2048
 +
  static char *
  rfc2047_decode_word(const char *in, size_t len)
  {
 - const char *inptr = in+2;
 - const char *inend = in+len-2;
 - const char *inbuf;
 - const char *charset;
 - char *encname, *p;
 - int tmplen;
 - size_t ret;
 - char *decword = NULL;
 - char *decoded = NULL;
 - char *outbase = NULL;
 - char *outbuf;
 - size_t inlen, outlen;
 - gboolean retried = FALSE;
 - iconv_t ic;
 - int idx = 0;
 + char prev_charset[32], curr_charset[32];
 + char encode;
 + char *start, *inptr, *inend;
 + char decword[DECWORD_LEN],
 utf8_decword[UTF8_DECWORD_LEN];
 + char

Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader

2007-12-24 Thread jacky

--- Peter Volkov [EMAIL PROTECTED]wrote:

 
 В Пнд, 24/12/2007 в 13:21 +0800, jacky
 пишет:
  --- Jeff Stedfast [EMAIL PROTECTED]wrote:
  There are two kind of email need to support:
  1) An encoded-word was divided into two line. This
 was
  sent by dotProject v2.0.1 .
 
 And there are even more users affected by this. I've
 already reported
 similar problem in bug 315513. Thus this affects not
 only CJK people:
 
 http://bugzilla.gnome.org/show_bug.cgi?id=315513
 

In fact, the parser and decoder in my patch support
this encoded-words. I already mentioned in my email:
 2) A CJK character's encoded string must in an
 encoded-word, but some email client divide it into
two
 encoded-words.

But the problem describe below has not been solved.
 1) An encoded-word was divided into two line. This
was
 sent by dotProject v2.0.1 .

As I seen this kind of email use quoted encode only,
and header_decode_text() can get all encoded-words
which is separated by SPACE, a simple solution is
replace SPACE with '_'. In fact OpenWebmail do like
this. 
But the problem is I must change the prototype of
header_decode_text() to 
char *header_decode_text (char *in, size_t inlen, int
ctext, const char *default_charset)
Originality, it is
char *header_decode_text (const char *in, size_t
inlen, int ctext, const char *default_charset)
Functions which call header_decode_text() must been
changed too.
Does anyone have better proposal?

 -- 
 Peter.
 



  ___ 
雅虎邮箱传递新年祝福,个性贺卡送亲朋! 
http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_tagline
___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers


Re: [Evolution-hackers] improved rfc2047 decode patch

2007-12-26 Thread jacky
It seem that your patch don't support this kind of
encoded string:
=?gb2312?b?any-encoded-text?==?gb2312?b?any-encoded-text?=
Two encoded-words are not separated by any character.

--- Jeffrey Stedfast [EMAIL PROTECTED]wrote:

 This patch is a port of my GMime rfc2047 decoder
 which is even more
 liberal in what it accepts than Thunderbird and is
 what I will be
 committing to svn.
 
 closing bugs:
 
 #302991
 #315513
 #502178
 
 Jeff
 
  Index: camel-mime-utils.c

===
 --- camel-mime-utils.c(revision 8315)
 +++ camel-mime-utils.c(working copy)
 @@ -821,116 +821,321 @@
   *in = inptr;
  }
  
 -/* decode rfc 2047 encoded string segment */
  static char *
 -rfc2047_decode_word(const char *in, size_t len)
 +camel_iconv_strndup (iconv_t cd, const char
 *string, size_t n)
  {
 - const char *inptr = in+2;
 - const char *inend = in+len-2;
 + size_t inleft, outleft, converted = 0;
 + char *out, *outbuf;
   const char *inbuf;
 - const char *charset;
 - char *encname, *p;
 - int tmplen;
 - size_t ret;
 - char *decword = NULL;
 - char *decoded = NULL;
 - char *outbase = NULL;
 - char *outbuf;
 - size_t inlen, outlen;
 - gboolean retried = FALSE;
 - iconv_t ic;
 -
 - d(printf(rfc2047: decoding '%.*s'\n, len, in));
 -
 - /* quick check to see if this could possibly be a
 real encoded word */
 - if (len  8 || !(in[0] == '='  in[1] == '?' 
 in[len-1] == '='  in[len-2] == '?')) {
 - d(printf(invalid\n));
 - return NULL;
 - }
 -
 - /* skip past the charset to the encoding type */
 - inptr = memchr (inptr, '?', inend-inptr);
 - if (inptr != NULL  inptr  inend + 2  inptr[2]
 == '?') {
 - d(printf(found ?, encoding is '%c'\n,
 inptr[0]));
 - inptr++;
 - tmplen = inend-inptr-2;
 - decword = g_alloca (tmplen); /* this will always
 be more-than-enough room */
 - switch(toupper(inptr[0])) {
 - case 'Q':
 - inlen = quoted_decode((const unsigned char *)
 inptr+2, tmplen, (unsigned char *) decword);
 - break;
 - case 'B': {
 - int state = 0;
 - unsigned int save = 0;
 -
 - inlen = camel_base64_decode_step((unsigned char
 *) inptr+2, tmplen, (unsigned char *) decword,
 state, save);
 - /* if state != 0 then error? */
 - break;
 + size_t outlen;
 + int errnosav;
 + 
 + if (cd == (iconv_t) -1)
 + return g_strndup (string, n);
 + 
 + outlen = n * 2 + 16;
 + out = g_malloc (outlen + 4);
 + 
 + inbuf = string;
 + inleft = n;
 + 
 + do {
 + errno = 0;
 + outbuf = out + converted;
 + outleft = outlen - converted;
 + 
 + converted = iconv (cd, (char **) inbuf, inleft,
 outbuf, outleft);
 + if (converted == (size_t) -1) {
 + if (errno != E2BIG  errno != EINVAL)
 + goto fail;
   }
 - default:
 - /* uhhh, unknown encoding type - probably an
 invalid encoded word string */
 - return NULL;
 + 
 + /*
 +  * E2BIG   There is not sufficient room at
 *outbuf.
 +  *
 +  * We just need to grow our outbuffer and try
 again.
 +  */
 + 
 + converted = outbuf - out;
 + if (errno == E2BIG) {
 + outlen += inleft * 2 + 16;
 + out = g_realloc (out, outlen + 4);
 + outbuf = out + converted;
   }
 - d(printf(The encoded length = %d\n, inlen));
 - if (inlen  0) {
 - /* yuck, all this snot is to setup iconv! */
 - tmplen = inptr - in - 3;
 - encname = g_alloca (tmplen + 1);
 - memcpy (encname, in + 2, tmplen);
 - encname[tmplen] = '\0';
 + } while (errno == E2BIG  inleft  0);
 + 
 + /*
 +  * EINVAL  An  incomplete  multibyte sequence has
 been encoun
 +  * tered in the input.
 +  *
 +  * We'll just have to ignore it...
 +  */
 + 
 + /* flush the iconv conversion */
 + iconv (cd, NULL, NULL, outbuf, outleft);
 + 
 + /* Note: not all charsets can be nul-terminated
 with a single
 +   nul byte. UCS2, for example, needs 2 nul
 bytes and UCS4
 +   needs 4. I hope that 4 nul bytes is
 enough to terminate all
 +   multibyte charsets? */
 + 
 + /* nul-terminate the string */
 + memset (outbuf, 0, 4);
 + 
 + /* reset the cd */
 + iconv (cd, NULL, NULL, NULL, NULL);
 + 
 + return out;
 + 
 + fail:
 +

Re: [Evolution-hackers] improved rfc2047 decode patch

2007-12-26 Thread jacky

--- Jeffrey Stedfast [EMAIL PROTECTED]wrote:

 
 On Thu, 2007-12-27 at 00:20 +0800, jacky wrote:
  It seem that your patch don't support this kind of
  encoded string:
 

=?gb2312?b?any-encoded-text?==?gb2312?b?any-encoded-text?=
  Two encoded-words are not separated by any
 character.
 
 Are you sure? I wrote the code to be able to handle
 this case and I just
 tested it again (noticed that I didn't have a test
 case like this in my
 test suite so added one) and it works fine.
 
 Do you have an example subject/whatever header for
 me to test against?
 

I make my conclusion too hastiness. Yes, your patch
support this kind of email, but it didn't support the
email that break a single multi-byte character across
multiple encoded-word tokens, and when it decode the
header that break a encoded-word token across two
lines, there is no result display on evolution, for
example, the Subject is empty.
I'll use Camle with your patch to check all email on
my mbox  and use gmime to decode all email header to
find out it's capacity.

 Jeff
 
  
  --- Jeffrey Stedfast [EMAIL PROTECTED]wrote:
  
   This patch is a port of my GMime rfc2047 decoder
   which is even more
   liberal in what it accepts than Thunderbird and
 is
   what I will be
   committing to svn.
   
   closing bugs:
   
   #302991
   #315513
   #502178
   
   Jeff
 
 
 



  ___ 
雅虎邮箱传递新年祝福,个性贺卡送亲朋! 
http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_tagline
___
Evolution-hackers mailing list
Evolution-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/evolution-hackers