On 25/05/16 15:21, Dianne Skoll wrote:
On Wed, 25 May 2016 15:07:37 +0100
Paul Stead <paul.st...@zeninternet.co.uk> wrote:

Consider the following 2 basic emails:
Mail 1:
Viagra
Mail 2:
V1agra
Yes, except here's the problem.  A drug company might legitimately
talk about Viagra, so that wouldn't be a spam token.  V1agra almost
certainly would be a spam token.  Bayes can distinguish between the
two; "concepts" cannot.

Bayes cannot make a relationship between V1agra and Viagra. Without
Concepts the two emails have no relationship, so nothing can be weighed
about Mail 2 based from Mail 1.
Of course real email has other tokens we can base this relationship off
- either hammy or spammy - so a drugs company email will have other
positive traits in Bayes that the spam mail doesn't have.

"Concepts" is a lossy process.  You are throwing away information.
That is by design, similar to fingerprinting emails with iXhash or
Razor. I guess the danger is making the digest too standard? A token of
50/50 isn't much use.
Furthermore, "concepts" is playing a game of whack-a-mole as spammers
come up with more creative misspellings and other variations on
evading the concept detector.
That I cannot argue with, though the lack of a concept could be helpful?

eg Spam email has concepts of "meds", "pharmacy" and "dearstranger" -
the three appearing together is suspect (not forgetting other Bayes tokens)
Legit email has concepts of just "meds" and "pharmacy" - these two are
not suspicious alone (not forgetting other Bayes tokens)

This can be applied to the contrary as well. I'm thinking it's not about
the single concepts that are hit, but what other Bayes tokens (classic
and other Concepts) that the email also hits.

   Do you really want to spend your days
writing rules to detect:

vіÅᏀʀâ, Ꮩɩɑɢᚱà, etc. and all the exponentially-numerous possible combinations?
I agree this is becoming more of a problem - homoglyphs are another
plug-in I'm also investigating, beyond ReplaceTags.

Paul
--
Paul Stead
Systems Engineer
Zen Internet

Reply via email to