Antispam filtering conception...

Alexey N. Vinogradov Sun, 23 Feb 2003 13:00:18 -0800

Hello, tbdev.

Practically everybody antispam developers (with few exceptions)
discuss and develop a language-oriented software, which can be
effective used only to native-language-featured letters. Let me try to
define some principles which as I think can be used for spam
filtering...


First, about the method of filtering: a 'simple-occurence' filtering when
you just test your input for any signal line from the spam list does
work, but pour. It can be, of course, effective for some 'simple'
language like English but much weaker for 'complicated' like Russian.
If for English usually are only two forms of a word, namely 'direct'
and 'indirect' case, for Russian you have to test 6 cases plus 2
numbers - approx. 12 variants per one word root! Even if I use regexps
it is not easy task to recognize all grammar forms of a human language
even for programmers - not only for 'endusers', - and express it in
short and human-understandable form. So I think that the
method (partially for russian language) has to be 'smarter' then
'simple-occurence' - ie. statistical, integral or heuristic.

By thinking about other methods I drew five (for a moment) 'steps' or
'levels' which can be used either independent either one after
another. This is only for filtering the text of a letter! The 'levels' is here:

1. "Frequency database" - it is necessary to have two dictionaries -
one for 'clean' letters and one for spam. I think this is the very
method which mentioned Stefan Tanurkov as "Bayesian filtering". - so,
this is 'brute-force' statistical method.

2. "Sequence Frequency database" - analyses a letter not simple by
word's occurrence, but by words sequence. By this kind of rules can be
defined, for example, that any phrase from 3 words containing
occurrences of 'buy' and 'dog' can be treated as spam (as 'buy a
dog'), but the same words separated more is not spam (as 'buy an
umbrella for me, and take a food for my dog').

3. "Grammar forms database" - analyses as 2-nd level but use a 'canonic'
form of every tested word. It can be not necessary for English (where
you have only 'buy' and 'buys' but can be useful for Russian and other
complicated languages where you can use 'prodavat', 'prodat',
'prodal', 'prodayu', 'prodayot', 'prodaval', 'prodayotsya' - and you
can't just filter with 'proda.+?' because of word 'prodazha' which is
different (noun, not a verb). ). So this is the same as 2-nd but uses
'grammar compression' to reduce the whole number of rules - don't
forget that the rules can be edited by 'endusers', and for them it is
not comfortable to test for about 50 phrases consist from two same
words in their different grammar forms, but easier to set one rule for
a 'canonic' phrase.

4. "Heuristic grammar" - also have base meaning only for complicated
languages. This is same as previous but with the canonic form also
analyses complementary grammar forms. For example, 'prodayu sobaku' is
not same as 'proday sobaku' for Russian. First is 'I want to buy a
dog' and second is 'sell me a dog"(imperative). Heuristic function can
assign an appropriate weight to a word basing on grammar descriptor -
is it 'imperative' or 'indicative', or 'question' form?

5. "Categorian test" - together with a grammar category also define a
'sense' category for a word using database - for example, 'rubl',
'dollar', 'pound', 'euro' is category 'money'; 'sell', 'buy', 'price',
'freely' is category 'selling'; 'porn', 'dogs', 'sex' is category
'products'. Rules can be like: 'If in a sentence somebody try to 'sell'
you a 'product' for a 'money' then the letter is very possibly spam.

All levels can be used together or in any combination.

Second, it is impossible to enduser to define a rules for such
complicated spam-filter! So it is necessary to make the second pard of
filtering machine - something like analyzer which will 'eat'
definitely spam letters (by parse a mail folder from The Bat!, for
example) and try to recognize the rules. Then this analyzer asks the
enduser something like 'by investigating of 100 spam letters it was
recognized that 30 of them include "the white dog". Do you want to
save this result as a rule for spam-filtering?'. Because of analysis
is the second complicated task it is not necessary for it to be a
'hard' part of antispam filter. It can be like logical part - the
second application independed from filter and The Bat which works, for
example, directly with The Bat! mailboxes.



-- 
Sincerely,
 Alexey.
Using TB 1.63b7 on WinXP SP1 Corp + MUI RU, spelling by ORFO2002
  mailto:[EMAIL PROTECTED]


________________________________________________
Current version is 1.62 | "Using TBDEV" information:
http://www.silverstones.com/thebat/TBUDLInfo.html

Antispam filtering conception...

Reply via email to