[Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJK header
Hi, all. The rfc2047 decoder in libcamel can not decode some CJK header correctly. Although some of them are not correspond to RFC, but I need to decode it correctly and I thought if evolution can display there email correctly more people like it. So I write a new rfc2047 decoder, and it's in the patch. With the patch, libcamel can decode CJK header correctly and evolution can display CJK header correctly now. I had test it in my mailbox. My mailbox has 2000 emails which were sent by evolution, thunderbird, outlook, outlook express, foxmail, open webmail, yahoo, gmail, lotus notes, etc. Without this patch, almost 20% of there emails can't be decoded and displayed correctly, with this patch, 99% of there emails can be decoded and displayed correctly. And I found that the attachment with CJK name can't be recognised and displayed by outlook / outlook express / foxmail. This is because there email clients do not support RFC2184. Evolution always use RFC2184 encode mothod to encode attachment name, so the email with CJK named attachment can't display in outlook / outlook express / foxmail. In thunderbird, you can set the option mail.strictly_mime.parm_folding to 0 or 1 for using RFC2047 encode mothod to encode attachment name. Can we add a similar option? Best regards. ___ 雅虎邮箱传递新年祝福,个性贺卡送亲朋! http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_taglinediff -ru evolution-data-server-2.21.4/camel/camel-mime-utils.c evolution-data-server-liuzhy/camel/camel-mime-utils.c --- evolution-data-server-2.21.4/camel/camel-mime-utils.c 2007-12-22 16:50:44.0 +0800 +++ evolution-data-server-liuzhy/camel/camel-mime-utils.c 2007-12-23 14:55:07.0 +0800 @@ -821,116 +821,207 @@ *in = inptr; } +static void +print_hex (unsigned char *data, size_t len) +{ + size_t i, x; + unsigned char *p = data; + char high, low; + + x = 0; + printf (%04u, x); + for (i = 0; i len; i++) { + high = *p 4; + high = (high10) ? high + '0' : high + 'a' - 10; + + low = *p 0x0f; + low = (low10) ? low + '0' : low + 'a' - 10; + + printf (0x%c%c , high, low); + + p++; + x++; + if (i % 8 == 7) { + printf (\n%04u, x); + } + } + printf (\n); +} + +static size_t +conv_to_utf8 (const char *encname, char *in, size_t inlen, char *out, size_t outlen) +{ + char *charset, *inbuf, *outbuf; + iconv_t ic; + size_t inbuf_len, outbuf_len, ret; + + charset = e_iconv_charset_name (encname); + + ic = e_iconv_open (UTF-8, charset); + if (ic == (iconv_t) -1) { + printf (e_iconv_open() error\n); + return (size_t)-1; + } + + inbuf = in; + inbuf_len = inlen; + + outbuf = out; + outbuf_len = outlen; + + ret = e_iconv (ic, inbuf, inbuf_len, outbuf, outbuf_len); + if (ret == (size_t)-1) { + printf (e_iconv() error! source charset is %s, target charset is %s\n, charset, UTF-8); + printf (converted %u bytes, but last %u bytes can't convert!!\n, inlen - inbuf_len, inbuf_len); + printf (source data:\n); + print_hex (in, inlen); + + *outbuf = '\0'; + printf (target string is \%s\\n, out); + + return (size_t)-1; + } + + ret = outlen - outbuf_len; + out[ret] = '\0'; + + e_iconv_close (ic); + + return ret; +} + /* decode rfc 2047 encoded string segment */ +#define DECWORD_LEN 1024 +#define UTF8_DECWORD_LEN 2048 + static char * rfc2047_decode_word(const char *in, size_t len) { - const char *inptr = in+2; - const char *inend = in+len-2; - const char *inbuf; - const char *charset; - char *encname, *p; - int tmplen; - size_t ret; - char *decword = NULL; - char *decoded = NULL; - char *outbase = NULL; - char *outbuf; - size_t inlen, outlen; - gboolean retried = FALSE; - iconv_t ic; + char prev_charset[32], curr_charset[32]; + char encode; + char *start, *inptr, *inend; + char decword[DECWORD_LEN], utf8_decword[UTF8_DECWORD_LEN]; + char *decword_ptr, *utf8_decword_ptr; + size_t inlen, outlen, ret; d(printf(rfc2047: decoding '%.*s'\n, len, in)); + prev_charset[0] = curr_charset[0] = '\0'; + + decword_ptr = decword; + utf8_decword_ptr = utf8_decword; + /* quick check to see if this could possibly be a real encoded word */ - if (len 8 || !(in[0] == '=' in[1] == '?' in[len-1] == '=' in[len-2] == '?')) { + if (len 8 + || !(in[0] == '=' in[1] == '?' + in[len-1] == '=' in[len-2] == '?')) { d(printf(invalid\n)); return NULL; } - /* skip past the charset to the encoding type */ - inptr = memchr (inptr, '?', inend-inptr); - if (inptr != NULL inptr inend + 2 inptr[2] == '?') { - d(printf(found ?, encoding is '%c'\n, inptr[0])); + inptr = in; + inend = in + len; + outlen = sizeof(utf8_decword); + + while (inptr inend) { + /* begin */ + inptr = memchr (inptr, '?', inend-inptr); + if (!inptr || *(inptr-1) != '=') { + return NULL; + } + + inptr++; + + /* charset */ + start = inptr; + inptr = memchr (inptr, '?', inend-inptr); + if (!inptr) { + return NULL; + } + strncpy (curr_charset, start, inptr-start); /* maybe overflow
Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJK header
--- Philip Van Hoof [EMAIL PROTECTED]wrote: Hey Jacky, This is a port of your patch to Tinymail's camel-lite Thank you. On Sun, 2007-12-23 at 23:09 +0800, jacky wrote: Hi, all. The rfc2047 decoder in libcamel can not decode some CJK header correctly. Although some of them are not correspond to RFC, but I need to decode it correctly and I thought if evolution can display there email correctly more people like it. So I write a new rfc2047 decoder, and it's in the patch. With the patch, libcamel can decode CJK header correctly and evolution can display CJK header correctly now. I had test it in my mailbox. My mailbox has 2000 emails which were sent by evolution, thunderbird, outlook, outlook express, foxmail, open webmail, yahoo, gmail, lotus notes, etc. Without this patch, almost 20% of there emails can't be decoded and displayed correctly, with this patch, 99% of there emails can be decoded and displayed correctly. And I found that the attachment with CJK name can't be recognised and displayed by outlook / outlook express / foxmail. This is because there email clients do not support RFC2184. Evolution always use RFC2184 encode mothod to encode attachment name, so the email with CJK named attachment can't display in outlook / outlook express / foxmail. In thunderbird, you can set the option mail.strictly_mime.parm_folding to 0 or 1 for using RFC2047 encode mothod to encode attachment name. Can we add a similar option? Best regards. ___ 雅虎邮箱传递新年祝福,个性贺卡送亲朋! http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_tagline ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers -- Philip Van Hoof, freelance software developer home: me at pvanhoof dot be gnome: pvanhoof at gnome dot org http://pvanhoof.be/blog http://codeminded.be Index: libtinymail-camel/camel-lite/camel/camel-mime-utils.c === --- libtinymail-camel/camel-lite/camel/camel-mime-utils.c (revision 3190) +++ libtinymail-camel/camel-lite/camel/camel-mime-utils.c (working copy) @@ -821,125 +821,207 @@ *in = inptr; } +static void +print_hex (unsigned char *data, size_t len) +{ + size_t i, x; + unsigned char *p = data; + char high, low; + + x = 0; + printf (%04u, x); + for (i = 0; i len; i++) { + high = *p 4; + high = (high10) ? high + '0' : high + 'a' - 10; + + low = *p 0x0f; + low = (low10) ? low + '0' : low + 'a' - 10; + + printf (0x%c%c , high, low); + + p++; + x++; + if (i % 8 == 7) { + printf (\n%04u, x); + } + } + printf (\n); +} + +static size_t +conv_to_utf8 (const char *encname, char *in, size_t inlen, char *out, size_t outlen) +{ + char *charset, *inbuf, *outbuf; + iconv_t ic; + size_t inbuf_len, outbuf_len, ret; + + charset = (char *) e_iconv_charset_name (encname); + + ic = e_iconv_open (UTF-8, charset); + if (ic == (iconv_t) -1) { + printf (e_iconv_open() error\n); + return (size_t)-1; + } + + inbuf = in; + inbuf_len = inlen; + + outbuf = out; + outbuf_len = outlen; + + ret = e_iconv (ic, (const char **) inbuf, inbuf_len, outbuf, outbuf_len); + if (ret == (size_t)-1) { + printf (e_iconv() error! source charset is %s, target charset is %s\n, charset, UTF-8); + printf (converted %u bytes, but last %u bytes can't convert!!\n, inlen - inbuf_len, inbuf_len); + printf (source data:\n); + print_hex (in, inlen); + + *outbuf = '\0'; + printf (target string is \%s\\n, out); + + return (size_t)-1; + } + + ret = outlen - outbuf_len; + out[ret] = '\0'; + + e_iconv_close (ic); + + return ret; +} + /* decode rfc 2047 encoded string segment */ +#define DECWORD_LEN 1024 +#define UTF8_DECWORD_LEN 2048 + static char * rfc2047_decode_word(const char *in, size_t len) { - const char *inptr = in+2; - const char *inend = in+len-2; - const char *inbuf; - const char *charset; - char *encname, *p; - int tmplen; - size_t ret; - char *decword = NULL; - char *decoded = NULL; - char *outbase = NULL; - char *outbuf; - size_t inlen, outlen; - gboolean retried = FALSE; - iconv_t ic; - int idx = 0; + char prev_charset[32], curr_charset[32]; + char encode; + char *start, *inptr, *inend; + char decword[DECWORD_LEN], utf8_decword[UTF8_DECWORD_LEN]; + char
Re: [Evolution-hackers] [patch] fixed incorrect rfc2047 decode for CJKheader
--- Peter Volkov [EMAIL PROTECTED]wrote: В Пнд, 24/12/2007 в 13:21 +0800, jacky пишет: --- Jeff Stedfast [EMAIL PROTECTED]wrote: There are two kind of email need to support: 1) An encoded-word was divided into two line. This was sent by dotProject v2.0.1 . And there are even more users affected by this. I've already reported similar problem in bug 315513. Thus this affects not only CJK people: http://bugzilla.gnome.org/show_bug.cgi?id=315513 In fact, the parser and decoder in my patch support this encoded-words. I already mentioned in my email: 2) A CJK character's encoded string must in an encoded-word, but some email client divide it into two encoded-words. But the problem describe below has not been solved. 1) An encoded-word was divided into two line. This was sent by dotProject v2.0.1 . As I seen this kind of email use quoted encode only, and header_decode_text() can get all encoded-words which is separated by SPACE, a simple solution is replace SPACE with '_'. In fact OpenWebmail do like this. But the problem is I must change the prototype of header_decode_text() to char *header_decode_text (char *in, size_t inlen, int ctext, const char *default_charset) Originality, it is char *header_decode_text (const char *in, size_t inlen, int ctext, const char *default_charset) Functions which call header_decode_text() must been changed too. Does anyone have better proposal? -- Peter. ___ 雅虎邮箱传递新年祝福,个性贺卡送亲朋! http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_tagline ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers
Re: [Evolution-hackers] improved rfc2047 decode patch
It seem that your patch don't support this kind of encoded string: =?gb2312?b?any-encoded-text?==?gb2312?b?any-encoded-text?= Two encoded-words are not separated by any character. --- Jeffrey Stedfast [EMAIL PROTECTED]wrote: This patch is a port of my GMime rfc2047 decoder which is even more liberal in what it accepts than Thunderbird and is what I will be committing to svn. closing bugs: #302991 #315513 #502178 Jeff Index: camel-mime-utils.c === --- camel-mime-utils.c(revision 8315) +++ camel-mime-utils.c(working copy) @@ -821,116 +821,321 @@ *in = inptr; } -/* decode rfc 2047 encoded string segment */ static char * -rfc2047_decode_word(const char *in, size_t len) +camel_iconv_strndup (iconv_t cd, const char *string, size_t n) { - const char *inptr = in+2; - const char *inend = in+len-2; + size_t inleft, outleft, converted = 0; + char *out, *outbuf; const char *inbuf; - const char *charset; - char *encname, *p; - int tmplen; - size_t ret; - char *decword = NULL; - char *decoded = NULL; - char *outbase = NULL; - char *outbuf; - size_t inlen, outlen; - gboolean retried = FALSE; - iconv_t ic; - - d(printf(rfc2047: decoding '%.*s'\n, len, in)); - - /* quick check to see if this could possibly be a real encoded word */ - if (len 8 || !(in[0] == '=' in[1] == '?' in[len-1] == '=' in[len-2] == '?')) { - d(printf(invalid\n)); - return NULL; - } - - /* skip past the charset to the encoding type */ - inptr = memchr (inptr, '?', inend-inptr); - if (inptr != NULL inptr inend + 2 inptr[2] == '?') { - d(printf(found ?, encoding is '%c'\n, inptr[0])); - inptr++; - tmplen = inend-inptr-2; - decword = g_alloca (tmplen); /* this will always be more-than-enough room */ - switch(toupper(inptr[0])) { - case 'Q': - inlen = quoted_decode((const unsigned char *) inptr+2, tmplen, (unsigned char *) decword); - break; - case 'B': { - int state = 0; - unsigned int save = 0; - - inlen = camel_base64_decode_step((unsigned char *) inptr+2, tmplen, (unsigned char *) decword, state, save); - /* if state != 0 then error? */ - break; + size_t outlen; + int errnosav; + + if (cd == (iconv_t) -1) + return g_strndup (string, n); + + outlen = n * 2 + 16; + out = g_malloc (outlen + 4); + + inbuf = string; + inleft = n; + + do { + errno = 0; + outbuf = out + converted; + outleft = outlen - converted; + + converted = iconv (cd, (char **) inbuf, inleft, outbuf, outleft); + if (converted == (size_t) -1) { + if (errno != E2BIG errno != EINVAL) + goto fail; } - default: - /* uhhh, unknown encoding type - probably an invalid encoded word string */ - return NULL; + + /* + * E2BIG There is not sufficient room at *outbuf. + * + * We just need to grow our outbuffer and try again. + */ + + converted = outbuf - out; + if (errno == E2BIG) { + outlen += inleft * 2 + 16; + out = g_realloc (out, outlen + 4); + outbuf = out + converted; } - d(printf(The encoded length = %d\n, inlen)); - if (inlen 0) { - /* yuck, all this snot is to setup iconv! */ - tmplen = inptr - in - 3; - encname = g_alloca (tmplen + 1); - memcpy (encname, in + 2, tmplen); - encname[tmplen] = '\0'; + } while (errno == E2BIG inleft 0); + + /* + * EINVAL An incomplete multibyte sequence has been encoun + * tered in the input. + * + * We'll just have to ignore it... + */ + + /* flush the iconv conversion */ + iconv (cd, NULL, NULL, outbuf, outleft); + + /* Note: not all charsets can be nul-terminated with a single + nul byte. UCS2, for example, needs 2 nul bytes and UCS4 + needs 4. I hope that 4 nul bytes is enough to terminate all + multibyte charsets? */ + + /* nul-terminate the string */ + memset (outbuf, 0, 4); + + /* reset the cd */ + iconv (cd, NULL, NULL, NULL, NULL); + + return out; + + fail: +
Re: [Evolution-hackers] improved rfc2047 decode patch
--- Jeffrey Stedfast [EMAIL PROTECTED]wrote: On Thu, 2007-12-27 at 00:20 +0800, jacky wrote: It seem that your patch don't support this kind of encoded string: =?gb2312?b?any-encoded-text?==?gb2312?b?any-encoded-text?= Two encoded-words are not separated by any character. Are you sure? I wrote the code to be able to handle this case and I just tested it again (noticed that I didn't have a test case like this in my test suite so added one) and it works fine. Do you have an example subject/whatever header for me to test against? I make my conclusion too hastiness. Yes, your patch support this kind of email, but it didn't support the email that break a single multi-byte character across multiple encoded-word tokens, and when it decode the header that break a encoded-word token across two lines, there is no result display on evolution, for example, the Subject is empty. I'll use Camle with your patch to check all email on my mbox and use gmime to decode all email header to find out it's capacity. Jeff --- Jeffrey Stedfast [EMAIL PROTECTED]wrote: This patch is a port of my GMime rfc2047 decoder which is even more liberal in what it accepts than Thunderbird and is what I will be committing to svn. closing bugs: #302991 #315513 #502178 Jeff ___ 雅虎邮箱传递新年祝福,个性贺卡送亲朋! http://cn.mail.yahoo.com/gc/index.html?entry=5souce=mail_mailletter_tagline ___ Evolution-hackers mailing list Evolution-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/evolution-hackers