On Wed, 25 May 2016 15:07:37 +0100 Paul Stead <paul.st...@zeninternet.co.uk> wrote:
> Consider the following 2 basic emails: > Mail 1: > Viagra > Mail 2: > V1agra Yes, except here's the problem. A drug company might legitimately talk about Viagra, so that wouldn't be a spam token. V1agra almost certainly would be a spam token. Bayes can distinguish between the two; "concepts" cannot. "Concepts" is a lossy process. You are throwing away information. It probably helps a bit in small installations where there's not much Bayes data to go on, but if you have a very large Bayes corpus, I bet it's no better than Bayes and possibly even worse. Furthermore, "concepts" is playing a game of whack-a-mole as spammers come up with more creative misspellings and other variations on evading the concept detector. Do you really want to spend your days writing rules to detect: vіÅᏀʀâ, Ꮩɩɑɢᚱà, etc. and all the exponentially-numerous possible combinations? (Credits to http://www.irongeek.com/homoglyph-attack-generator.php) Regards, Dianne.