On Wed, 22 Dec 2004 15:01:54 +1300
"Tony Meyer" <[EMAIL PROTECTED]> wrote:

> I had hoped that Hatuka Nezumi would have responded to the earlier message,
> but I haven't heard anything from him for a while (busy, perhaps).  He is
> leading the i18n process for SpamBayes (I'm helping and doing the checking
> in).

Sorry for no response.  I'm in new-year ('shogatsu' in japanese) 
vacation till 6 January.  I'll go back next week.

Problems for Japanese/CJK:
1. Recommended charset of Japanese e-mail message is ISO-2022-JP 
  (cf. RFC1468).  This charset isn't suitable for XML/XHTML parser
  and isn't compatible with Windows ANSI codepage (CP932 for 
  Japanese).
2. ISO-2022-* aren't suitable for spambayes tokenizer also.
3. More than one charsets may be used for messages of one language
  (e.g. ISO-8859-*, UTF-8 and UTF-7 for West-Latin.
  ISO-2022-JP, Shift_JIS, EUC-JP, UTF-8 and UTF-7 for Japanese).
4. In some East-asian languages (Japanese or Chinese), words are
  not space-separated then they won't be effectively tokenized.

Patch #824651 try to solve these problems.

For current i18n works, problem 1. should be solved at least.

I am planning to provide sub-patches related to each problems
(except problem 4.), converting message headers/bodies to suitable 
charset for tokenizer (Unicode), web interface (e.g. UTF-8) and 
Outlook plug-in (mbcs).  This solution also will provide really 
i18n'ized message handling.

Note that this solution can require bind_textdomain_codeset 
function for overlapping gettext catalog of web interface and 
Outlook plug-in.  But I'm not familiar with this function...

  --- nezumi
_______________________________________________
spambayes-dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-dev

Reply via email to