On Wed, 25 May 2016 18:10:57 +0100
Paul Stead <paul.st...@zeninternet.co.uk> wrote:

> > Yes, except here's the problem.  A drug company might legitimately
> > talk about Viagra, so that wouldn't be a spam token.  V1agra almost
> > certainly would be a spam token.  Bayes can distinguish between the
> > two; "concepts" cannot.

> Bayes cannot make a relationship between V1agra and Viagra.

That's correct.  But does that have a detrimental effect on accuracy?
I bet it doesn't if you have a large enough corpus.

> > "Concepts" is a lossy process.  You are throwing away information.
> That is by design, similar to fingerprinting emails with iXhash or
> Razor.

iXhash and Razor are designed to detect mass-mailings of identical or
very similar messages; they measure "bulk-ness" and not hammy-ness or
spammy-ness directly.

[...]

> I agree this is becoming more of a problem - homoglyphs are another
> plug-in I'm also investigating...

Yes, now that could be really useful.

Regards,

Dianne.

Reply via email to