Hello, tbdev. Practically everybody antispam developers (with few exceptions) discuss and develop a language-oriented software, which can be effective used only to native-language-featured letters. Let me try to define some principles which as I think can be used for spam filtering...
First, about the method of filtering: a 'simple-occurence' filtering when you just test your input for any signal line from the spam list does work, but pour. It can be, of course, effective for some 'simple' language like English but much weaker for 'complicated' like Russian. If for English usually are only two forms of a word, namely 'direct' and 'indirect' case, for Russian you have to test 6 cases plus 2 numbers - approx. 12 variants per one word root! Even if I use regexps it is not easy task to recognize all grammar forms of a human language even for programmers - not only for 'endusers', - and express it in short and human-understandable form. So I think that the method (partially for russian language) has to be 'smarter' then 'simple-occurence' - ie. statistical, integral or heuristic. By thinking about other methods I drew five (for a moment) 'steps' or 'levels' which can be used either independent either one after another. This is only for filtering the text of a letter! The 'levels' is here: 1. "Frequency database" - it is necessary to have two dictionaries - one for 'clean' letters and one for spam. I think this is the very method which mentioned Stefan Tanurkov as "Bayesian filtering". - so, this is 'brute-force' statistical method. 2. "Sequence Frequency database" - analyses a letter not simple by word's occurrence, but by words sequence. By this kind of rules can be defined, for example, that any phrase from 3 words containing occurrences of 'buy' and 'dog' can be treated as spam (as 'buy a dog'), but the same words separated more is not spam (as 'buy an umbrella for me, and take a food for my dog'). 3. "Grammar forms database" - analyses as 2-nd level but use a 'canonic' form of every tested word. It can be not necessary for English (where you have only 'buy' and 'buys' but can be useful for Russian and other complicated languages where you can use 'prodavat', 'prodat', 'prodal', 'prodayu', 'prodayot', 'prodaval', 'prodayotsya' - and you can't just filter with 'proda.+?' because of word 'prodazha' which is different (noun, not a verb). ). So this is the same as 2-nd but uses 'grammar compression' to reduce the whole number of rules - don't forget that the rules can be edited by 'endusers', and for them it is not comfortable to test for about 50 phrases consist from two same words in their different grammar forms, but easier to set one rule for a 'canonic' phrase. 4. "Heuristic grammar" - also have base meaning only for complicated languages. This is same as previous but with the canonic form also analyses complementary grammar forms. For example, 'prodayu sobaku' is not same as 'proday sobaku' for Russian. First is 'I want to buy a dog' and second is 'sell me a dog"(imperative). Heuristic function can assign an appropriate weight to a word basing on grammar descriptor - is it 'imperative' or 'indicative', or 'question' form? 5. "Categorian test" - together with a grammar category also define a 'sense' category for a word using database - for example, 'rubl', 'dollar', 'pound', 'euro' is category 'money'; 'sell', 'buy', 'price', 'freely' is category 'selling'; 'porn', 'dogs', 'sex' is category 'products'. Rules can be like: 'If in a sentence somebody try to 'sell' you a 'product' for a 'money' then the letter is very possibly spam. All levels can be used together or in any combination. Second, it is impossible to enduser to define a rules for such complicated spam-filter! So it is necessary to make the second pard of filtering machine - something like analyzer which will 'eat' definitely spam letters (by parse a mail folder from The Bat!, for example) and try to recognize the rules. Then this analyzer asks the enduser something like 'by investigating of 100 spam letters it was recognized that 30 of them include "the white dog". Do you want to save this result as a rule for spam-filtering?'. Because of analysis is the second complicated task it is not necessary for it to be a 'hard' part of antispam filter. It can be like logical part - the second application independed from filter and The Bat which works, for example, directly with The Bat! mailboxes. -- Sincerely, Alexey. Using TB 1.63b7 on WinXP SP1 Corp + MUI RU, spelling by ORFO2002 mailto:[EMAIL PROTECTED] ________________________________________________ Current version is 1.62 | "Using TBDEV" information: http://www.silverstones.com/thebat/TBUDLInfo.html

