On Wed, 22 Dec 2004 15:01:54 +1300
"Tony Meyer" <[EMAIL PROTECTED]> wrote:
> I had hoped that Hatuka Nezumi would have responded to the earlier message,
> but I haven't heard anything from him for a while (busy, perhaps). He is
> leading the i18n process for SpamBayes (I'm helping and doing the checking
> in).
Sorry for no response. I'm in new-year ('shogatsu' in japanese)
vacation till 6 January. I'll go back next week.
Problems for Japanese/CJK:
1. Recommended charset of Japanese e-mail message is ISO-2022-JP
(cf. RFC1468). This charset isn't suitable for XML/XHTML parser
and isn't compatible with Windows ANSI codepage (CP932 for
Japanese).
2. ISO-2022-* aren't suitable for spambayes tokenizer also.
3. More than one charsets may be used for messages of one language
(e.g. ISO-8859-*, UTF-8 and UTF-7 for West-Latin.
ISO-2022-JP, Shift_JIS, EUC-JP, UTF-8 and UTF-7 for Japanese).
4. In some East-asian languages (Japanese or Chinese), words are
not space-separated then they won't be effectively tokenized.
Patch #824651 try to solve these problems.
For current i18n works, problem 1. should be solved at least.
I am planning to provide sub-patches related to each problems
(except problem 4.), converting message headers/bodies to suitable
charset for tokenizer (Unicode), web interface (e.g. UTF-8) and
Outlook plug-in (mbcs). This solution also will provide really
i18n'ized message handling.
Note that this solution can require bind_textdomain_codeset
function for overlapping gettext catalog of web interface and
Outlook plug-in. But I'm not familiar with this function...
--- nezumi
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev