On Wed, 25 May 2016 18:10:57 +0100 Paul Stead <paul.st...@zeninternet.co.uk> wrote:
> > Yes, except here's the problem. A drug company might legitimately > > talk about Viagra, so that wouldn't be a spam token. V1agra almost > > certainly would be a spam token. Bayes can distinguish between the > > two; "concepts" cannot. > Bayes cannot make a relationship between V1agra and Viagra. That's correct. But does that have a detrimental effect on accuracy? I bet it doesn't if you have a large enough corpus. > > "Concepts" is a lossy process. You are throwing away information. > That is by design, similar to fingerprinting emails with iXhash or > Razor. iXhash and Razor are designed to detect mass-mailings of identical or very similar messages; they measure "bulk-ness" and not hammy-ness or spammy-ness directly. [...] > I agree this is becoming more of a problem - homoglyphs are another > plug-in I'm also investigating... Yes, now that could be really useful. Regards, Dianne.