On Wed, 25 May 2016 15:07:37 +0100
Paul Stead <paul.st...@zeninternet.co.uk> wrote:

> Consider the following 2 basic emails:

> Mail 1:
> Viagra

> Mail 2:
> V1agra

Yes, except here's the problem.  A drug company might legitimately
talk about Viagra, so that wouldn't be a spam token.  V1agra almost
certainly would be a spam token.  Bayes can distinguish between the
two; "concepts" cannot.

"Concepts" is a lossy process.  You are throwing away information.  It
probably helps a bit in small installations where there's not much
Bayes data to go on, but if you have a very large Bayes corpus, I bet
it's no better than Bayes and possibly even worse.

Furthermore, "concepts" is playing a game of whack-a-mole as spammers
come up with more creative misspellings and other variations on
evading the concept detector.  Do you really want to spend your days
writing rules to detect:

vіÅᏀʀâ, Ꮩɩɑɢᚱà, etc. and all the exponentially-numerous possible combinations?

(Credits to http://www.irongeek.com/homoglyph-attack-generator.php)

Regards,

Dianne.

Reply via email to