On Sat, 28 May 2016 14:53:15 -0700 (PDT)
John Hardin <jhar...@impsec.org> wrote:

> Based on that, do you have an opinion on the proposal to add two-word
> (or configurable-length) combinations to Bayes?

I have an opinion. :)

Extending Bayes to look at multiple tokens is a *very* good idea.
That's because naive single-word Bayes assumes that the probability of
a token is indepdendent of the presence of other tokens.  But this is
very rarely the case.  For example, the word "mussel" is substantially
more likely to follow the word "zebra" than it is to follow the word
"xenophobic".  So "zebra mussel" might be a couple of ecologists
talking, while "xenophobic mussel" could well be random text designed
to confuse Bayes.

And also, two-word phrases can be stronger indicators than the
individual words; "hot" and "sex" in isolation may not be strong spam
indicators, but "hot sex" probably is stronger.

Going from one-word tokens to one+two-word tokens will have a pretty
big payoff, I think.  I'm not so sure about two to three.

Regards,

Dianne.

Reply via email to