Patches item #824651, was opened at 2003-10-16 21:23
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Hatuka*nezumi (hatukanezumi)
Assigned to: Nobody/Anonymous (nobody)
Summary: Multibyte (CJK etc.) message support
Initial Comment:
Maybe this also applicable to other East-Asian languages.
o Unicode'ify text:
For example by Japanese message, RFC1468 recommends
that ISO/EIC 2022 encoding scheme, with ASCII and
multibyte character set both designated to GL, should be
used. Original tokenizer generates only bogus
meaningless
text fragments for Japanese messages.
o Concatinate C/J lines.
In Japanese (and maybe Chinese) messages, line folding
often breaks 'words'.
o Bigram of C/J characters.
In Japanese (and often Chinese) messages, 'words' are
not separated by character such as whitespace.
Tokenization to grammatical 'words' will require
heuristic
algorithms using large corpus.
Instead of expensive human-language parser, generate
bigram from run of kanji (ideograph for C/J/K) or run of
hiragana & katakana (syllabic letters for J).
N.B.:
- I believe number of database items is roughly O(n^2)
for bigram, O(n^3) for trigram,... and O(n^i) for
i-gram,
where n is size of used character set. On katakana &
hiragana n is approximately 100. On kanzi it is
approx.
5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese
standards). By C/J messages, 3-or-more-gram will
generate very sparse and large database.
- Words of single kanzi should not be discarded by
tokenizer. Since most of basic kanzi words are of
1 or 2
characters.
Words of single hiragana/katakana may be discarded.
- As far as I know, in Korean message, phrase (not 'word'
but similar) is often separated by whitespace. As
run of
hangul (syllabic character for K) may not splitted to
n-gram.
o Punctuation --- what is 'punctuation'? A lot of
punctuations, spaces, signs and symbols registered with
Unicode Standard are added to punctuation_run_re (for
compatibility, some of them are overlapped with
subject_words_re). Since many of them are also
registered as punctuations or symbols with C/J/K
character set standards.
Problems:
o sb_dbexpimp.py become incompatible.
o Only BMP range is supported. Surrogates are not
recognized.
o Tested by Japanese messages only, not by other
East-Asian messages.
o No batch tests. This only aims at Japanese support.
Configuration:
o To support unicode, .spambayesrc must be set:
[Tokenizer]
replace_nonascii_chars: False
----------------------------------------------------------------------
>Comment By: Tony Meyer (anadelonbrin)
Date: 2006-06-10 17:21
Message:
Logged In: YES
user_id=552329
The simple parts of this have been checked in. At the
moment, that doesn't include the tokenizer changes (or the
unicode module) or a few of the "server" changes. The
non-tokenizer changes will probably be checked in soon; it's
not clear what we'll do about the tokenizer ones (but at
least this should make things simpler since there are fewer
differences).
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-11-25 17:24
Message:
Logged In: YES
user_id=529503
Auto-detect charset of message.
Some messages lack (or fake in some spam) charset information.
Codes added to detect suitable charset and to convert to
unicode.
Unicodedata compatibility module for Python < 2.3.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-11-10 16:12
Message:
Logged In: YES
user_id=529503
Estimation for Effect of per_langualge_corpus Option
I prepared 4 test sets from 7987 ham and 2364 spam including:
ham spam
arabic: 1 1
cyrillic: 26 63
greek: 1 0
hebrew: 5 10
ja: 6438 85
ko: 9 29
thai: 1 2
zh: 4 207
other/unknown: 1502 1967
TOTAL: 7987 2364
* Languages/scripts are determined by main charset of each
messages.
Then I run test by:
$ python timtest.py --ham-keep 500 --spam-keep 500 -n 4
with ham/spam cutoffs 0.5 / 0.95.
Below is average of 20 tests.
x-per_language_corpus: True
ham:spam: 6000:6000
fp total: 10
fp %: 0.17
fn total: 253
fn %: 4.23
unsure t: 947
unsure %: 7.90
real cost: $543.15
best cost: $695.66
h mean: 0.88
h sdev: 7.38
s mean: 81.68
s sdev: 31.74
mean diff: 80.80
k: 2.07
x-per_language_corpus: False
ham:spam: 6000:6000
fp total: 24
fp %: 0.40
fn total: 81
fn %: 1.36
unsure t: 551
unsure %: 4.60
real cost: $434.45
best cost: $584.04
h mean: 3.00
h sdev: 13.03
s mean: 94.28
s sdev: 19.51
mean diff: 91.27
k: 2.82
x-per_language_corpus increses fp a little and increases fn
and unsure more.
So x-per_language_corpus feature shall be thrown away
(database will be compatible with original again).
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2004-10-06 02:35
Message:
Logged In: YES
user_id=529503
Update for 1.0-final.
- Normalize Unicode'ified texts by Normalization Form KC (NFKC).
- HTTP charset is fixed to UTF-8. Option [html_ui] http_charset
was removed.
- Some bug fixes.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-12-08 16:34
Message:
Logged In: YES
user_id=529503
patch 1.0.
Per-language corpus.
Ham/spam ratio are different by language of message. This
affects performance.
NOTE: Format of corpus has been changed. It now contains
per-language nham/nspam info and wordinfo.
PICKLE_VERSION is 6.
New configuration option: [Tokenizer] per_language_corpus
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-29 21:17
Message:
Logged In: YES
user_id=529503
o hammie.py / sb_filter.py / sb_xmlrpcserver.py:
- clues in X-Spambayes-Evidence: header will be
MIME header encoded.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-26 21:31
Message:
Logged In: YES
user_id=529503
server patch 1.0a7-0.6
o Dibbler performs HTTP charset conversion
(to/from internal UTF-8).
o New configuration option: [html_ui] http_charset
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2003-11-26 13:13
Message:
Logged In: YES
user_id=552329
Added the sb_dbexpimp.py patch (v1.3). Will look at the
rest, shortly - thanks for your patience!
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-11-12 01:01
Message:
Logged In: YES
user_id=529503
o db_expimp.py is imcompatible again. It exports / imports data
as UTF-8.
o Unicode'ifyed sb_server.py.
- HTTP charset is UTF-8.
- clues in X-Spambayes-Evidences will be MIME header
encoded.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-29 23:29
Message:
Logged In: YES
user_id=529503
OK. I'll test the code untill addition.
minor fix: 'replace_nonascii_chars' option works correctly, etc.
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2003-10-21 16:52
Message:
Logged In: YES
user_id=552329
Just a wee note to say thanks for this, and that someone will
get to looking at adding this in, but everyone's pretty busy
with other stuff at the moment!
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-19 23:02
Message:
Logged In: YES
user_id=529503
fix for Korean message.
Hangul phrases/words can be of 1 or 2 chars.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 22:19
Message:
Logged In: YES
user_id=529503
minor fix.
----------------------------------------------------------------------
Comment By: Hatuka*nezumi (hatukanezumi)
Date: 2003-10-17 01:52
Message:
Logged In: YES
user_id=529503
> ISO/EIC 2022 encoding scheme, with ASCII and
> multibyte character set both designated to GL,
Not 'designate'. 'Invoke' is correct.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702
_______________________________________________
Spambayes-bugs mailing list
[email protected]
http://mail.python.org/mailman/listinfo/spambayes-bugs