[spambayes-bugs] [ spambayes-Patches-824651 ] Multibyte (CJK etc.) message support

SourceForge.net Fri, 09 Jun 2006 22:21:31 -0700

Patches item #824651, was opened at 2003-10-16 21:23
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Hatuka*nezumi (hatukanezumi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Multibyte (CJK etc.) message support

Initial Comment:
Maybe this also applicable to other East-Asian languages.

o Unicode'ify text:
  For example by Japanese message, RFC1468 recommends
  that ISO/EIC 2022 encoding scheme, with ASCII and
  multibyte character set both designated to GL, should be
  used.  Original tokenizer generates only bogus
meaningless
  text fragments for Japanese messages.

o Concatinate C/J lines.  
  In Japanese (and maybe Chinese) messages, line folding
  often breaks 'words'.

o Bigram of C/J characters.
  In Japanese (and often Chinese) messages, 'words' are
  not separated by character such as whitespace.
  Tokenization to grammatical 'words' will require
heuristic
  algorithms using large corpus.
  Instead of expensive human-language parser, generate
  bigram from run of kanji (ideograph for C/J/K) or run of
  hiragana &amp; katakana (syllabic letters for J).

  N.B.:
  - I believe number of database items is roughly O(n^2) 
    for bigram, O(n^3) for trigram,... and O(n^i) for
i-gram,
    where n is size of used character set.  On katakana &amp; 
    hiragana n is approximately 100.  On kanzi it is
approx.
    5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese
    standards).  By C/J messages, 3-or-more-gram will
    generate very sparse and large database.

  - Words of single kanzi should not be discarded by
    tokenizer.  Since most of basic kanzi words are of
1 or 2
    characters.
    Words of single hiragana/katakana may be discarded.

  - As far as I know, in Korean message, phrase (not 'word'
    but similar) is often separated by whitespace. As
run of
    hangul (syllabic character for K) may not splitted to
    n-gram.

o Punctuation --- what is 'punctuation'?  A lot of
  punctuations, spaces, signs and symbols registered with
  Unicode Standard are added to punctuation_run_re (for
  compatibility, some of them are overlapped with
  subject_words_re).  Since many of them are also
  registered as punctuations or symbols with C/J/K
  character set standards.

Problems:

o sb_dbexpimp.py become incompatible.

o Only BMP range is supported.  Surrogates are not
recognized.

o Tested by Japanese messages only, not by other
East-Asian messages.

o No batch tests.  This only aims at Japanese support.

Configuration:

o To support unicode, .spambayesrc must be set:
    [Tokenizer]
    replace_nonascii_chars: False



----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2006-06-10 17:21

Message:
Logged In: YES 
user_id=552329

The simple parts of this have been checked in.  At the
moment, that doesn't include the tokenizer changes (or the
unicode module) or a few of the "server" changes.  The
non-tokenizer changes will probably be checked in soon; it's
not clear what we'll do about the tokenizer ones (but at
least this should make things simpler since there are fewer
differences).

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-11-25 17:24

Message:
Logged In: YES 
user_id=529503

Auto-detect charset of message.
Some messages lack (or fake in some spam) charset information.
Codes added to detect suitable charset and to convert to
unicode.

Unicodedata compatibility module for Python < 2.3.


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-11-10 16:12

Message:
Logged In: YES 
user_id=529503

Estimation for Effect of per_langualge_corpus Option

I prepared 4 test sets from 7987 ham and 2364 spam including:

                ham spam
arabic:           1    1
cyrillic:        26   63
greek:            1    0
hebrew:           5   10
ja:            6438   85
ko:               9   29
thai:             1    2
zh:               4  207
other/unknown: 1502 1967

TOTAL:         7987 2364

* Languages/scripts are determined by main charset of each
messages.


Then I run test by:

$ python timtest.py --ham-keep 500 --spam-keep 500 -n 4

with ham/spam cutoffs 0.5 / 0.95.


Below is average of 20 tests.

x-per_language_corpus: True

ham:spam:     6000:6000
fp total:            10
fp %:              0.17
fn total:           253
fn %:              4.23
unsure t:           947
unsure %:          7.90
real cost:      $543.15
best cost:      $695.66
h mean:            0.88
h sdev:            7.38
s mean:           81.68
s sdev:           31.74
mean diff:        80.80
k:                 2.07

x-per_language_corpus: False

ham:spam:     6000:6000
fp total:            24
fp %:              0.40
fn total:            81
fn %:              1.36
unsure t:           551
unsure %:          4.60
real cost:      $434.45
best cost:      $584.04
h mean:            3.00
h sdev:           13.03
s mean:           94.28
s sdev:           19.51
mean diff:        91.27
k:                 2.82

x-per_language_corpus increses fp a little and increases fn
and unsure more.

So x-per_language_corpus feature shall be thrown away
(database will be compatible with original again).


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-10-06 02:35

Message:
Logged In: YES 
user_id=529503

Update for 1.0-final.

- Normalize Unicode'ified texts by Normalization Form KC (NFKC).
- HTTP charset is fixed to UTF-8.  Option [html_ui] http_charset
  was removed.
- Some bug fixes.


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-12-08 16:34

Message:
Logged In: YES 
user_id=529503

patch 1.0.

Per-language corpus.

Ham/spam ratio are different by language of message. This
affects performance.

NOTE: Format of corpus has been changed. It now contains
per-language  nham/nspam info and wordinfo.
PICKLE_VERSION is 6.

New configuration option: [Tokenizer] per_language_corpus


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-29 21:17

Message:
Logged In: YES 
user_id=529503

o hammie.py / sb_filter.py / sb_xmlrpcserver.py:
  - clues in X-Spambayes-Evidence: header will be 
    MIME header encoded.


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-26 21:31

Message:
Logged In: YES 
user_id=529503

server patch 1.0a7-0.6

o Dibbler performs HTTP charset conversion
  (to/from internal UTF-8).
o New configuration option: [html_ui] http_charset


----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-11-26 13:13

Message:
Logged In: YES 
user_id=552329

Added the sb_dbexpimp.py patch (v1.3).  Will look at the 
rest, shortly - thanks for your patience!

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-12 01:01

Message:
Logged In: YES 
user_id=529503

o db_expimp.py is imcompatible again. It exports / imports data 
  as UTF-8.

o Unicode'ifyed sb_server.py.
  - HTTP charset is UTF-8.
  - clues in X-Spambayes-Evidences will be MIME header 
    encoded.


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-29 23:29

Message:
Logged In: YES 
user_id=529503

OK. I'll test the code untill addition.

minor fix: 'replace_nonascii_chars' option works correctly, etc.


----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-10-21 16:52

Message:
Logged In: YES 
user_id=552329

Just a wee note to say thanks for this, and that someone will 
get to looking at adding this in, but everyone's pretty busy 
with other stuff at the moment!

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-19 23:02

Message:
Logged In: YES 
user_id=529503

fix for Korean message.
Hangul phrases/words can be of 1 or 2 chars.


----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 22:19

Message:
Logged In: YES 
user_id=529503

minor fix.

----------------------------------------------------------------------

Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 01:52

Message:
Logged In: YES 
user_id=529503

&gt; ISO/EIC 2022 encoding scheme, with ASCII and
&gt; multibyte character set both designated to GL,

Not 'designate'.  'Invoke' is correct.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
_______________________________________________
Spambayes-bugs mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-bugs

[spambayes-bugs] [ spambayes-Patches-824651 ] Multibyte (CJK etc.) message support

Reply via email to