Baesyan machine for baesyan filter

Alexey N. Vinogradov Thu, 27 Mar 2003 13:05:30 -0800

Hello, Kjartan. 
You wrote in <mid:[EMAIL PROTECTED]>


KÁ> That is the same method as used in the new Mozilla, right?

I don't know which method Mozilla uses, sorry...

KÁ> You say you are about to finish this project. When can we expect the
KÁ> first version to be born?

First version is to be born very-very nearly. I finished
base-generating machine and now the simple task is just to build the filter
itself from already written modules.

For a moment I finished the utility which can open and "parse"
mailbases, make and save frequency dictionaries and make a "regarding
base" - exactly one which will be used in filter itself.

This utility can be downloaded at
http://klirik.narod.ru/arc/baesyan.exe (file size 294912 bytes)

This is not installer, it is file itself.

Let me tell about some features of this machine.

The method I select is the approach of Paul Graham  ("A Plan for
Spam") partically mixed with his "Better Bayesian filtering".

I regard whole raw letter - including RFC headers - as Paul Graham
does - and make a frequency dictionary of it.

Tokens approach are depends on if a part is HTML or a plain text.
If it is simple text than I use simple definition of a token. If it is
HTML than I scan also encoded URLs and bogus HTML comments.

I distinguish between headers and body tokens - like Paul Graham in
"Better Bayesian filtering", but I classify a token not to be in
category "to", "from", "subject" or other parts but simple as "body" and
"headers".

All tokens from headers go to freq dictionary with the prefix "_h " -
for example, "-h Enlarge".
All bogus HTML comments go to freq dictionary as the token "_s spam".

Attachments is regarded as a token for each, for example
"_F jpeg<SZ00004f50:CC00138de6>", where prefix "_F " means "file",
"jpeg" is content-subtype, and in angle brackets are size and CRC
prefixed by SZ and CC.

Base-64 and Quoted-Printable decoded before processing.

Also I realized some locale features - because I am russian and it is
actual for me and for mail I receive:

I read from The Bat! registry the XLT tables and apply them to decoded
text when it necessary (for russian it happens often because it is as
minimum two popular encodings - win-1251 and koi-8r. So, these cases I
hold correct).

Also sometimes spammers are change some national letters in the words
to the looking-like letters from English (like p,e,a,c and so on). I
also hold this case - and it is very simple to hold it also for other
languages.

More specific features:

This machine works with direct mailbases of The Bat! (files *.tbb). I
select this format because it is better then simple *.eml when you
work with big corpus of letters.

If you keep your attachments outside a base it is no problem. The
machine will found necessary files on your disk and index it.

In this machine scanning for bogus HTML comments is limited by
comments consist whole from digits. But this is not limitation, this
is just the option. It is also realised (but not switched on for
a moment) scaning by any comment without spaces and by any comment at
all (this can be not good for embedded scripts).

So, what is this machine for and how to use it?

By this utility you can (and will in near future) make a regard-base
which is necessary for working a filter itself. The filter I'll
realize very close, may be even tomorrow, so you can already make a
base for it.

First, create in The bat! two folders and fill one of them by spam
mail and another - by your non-spam mail. Throw out all encrypted
messages (or decrypt it before). Then compress both folders.

Open a folder in baesyan.exe by pressing "Open a base". If the file is
OK, the button "Parse mailbase" will be enabled. Press it. The process
of building the frequency dictionary for whole mailbase will begin.

When a whole base will be parsed, you'll see the number of parsed
letters and information about current dictionary. Now if you want you
can open other base and parse it also - all results will be
accumulated in current dictionary.

If you want, you can view current dictionary by pressing "Show dict"
but let you know that it can take a long time especially if there are
many letters currently parsed.

Finally you can (and must) save current dictionary by pressing "Save
dictionary".

Also you can open previously saved dictionary by "Read dictionary".

When the current dictionary is not empty you can assign it as a "Spam"
or "Non-spam" dictionary for further building of regarding base. When
both of these dictionary is assigned, you can generate regarding base
by pressing "Yes!".

WARNING! When you assign current dictionary as Spam or Non-Spam it is
no more possible to save it or continue to use it as a current
dictionary! So, save non-yet-ready dictionaries before experimenting.

After finishing of building regarding base you will see on the screen
it's size in words. In this version of machine you must immediately
save the regard base by pressing "Save regards". Because of small
error (I apologize for it) only in this case you will take correct
size of base in output file (if, for example, you press key "Show" or
save it one time before, then in a file regard base will be written
with wrong size.

Then, if you want, you can generate the regard base again (by pressing
"Yes" again) and display it by pressing "Show". You will see the words
and their "probabilities". 0.01 is clean mail, 0.99 is clean spam.

The file with saved regarding base will be used in filter.

Paul Graham wrote that he has corpuses for about 4000 letters in each.
I don't have so many spam for a moment - for a moment I collect only
570. You - may be - too, because usually it is deleted... So, this is
the reason why you can keep some spam for training.

And a last word. I say that machine holds russian
"partially-translitted" words. For switching this feature on you must
just create in The Bat! new XLT table called "translit". In this table
just replace some LATIN letters to corresponding russian (not
otherwise!). The machine reads XLT tables directly from windows
registry of The Bat!, and if a table with mime-tag "translit" is
found, it will be used for decoding partially transliterated words.

P.S. A also have no ideas about name for future filter.
Any suggestions?


-- 
Sincerely,
 Alexey.
Using TB 1.63b7 on WinXP SP1 Corp + MUI RU, spelling by ORFO2002
   mailto:[EMAIL PROTECTED]


________________________________________________
Current version is 1.62 | "Using TBDEV" information:
http://www.silverstones.com/thebat/TBUDLInfo.html

Baesyan machine for baesyan filter

Reply via email to