On Sat, 28 May 2016 14:53:15 -0700 (PDT) John Hardin <jhar...@impsec.org> wrote:
> Based on that, do you have an opinion on the proposal to add two-word > (or configurable-length) combinations to Bayes? I have an opinion. :) Extending Bayes to look at multiple tokens is a *very* good idea. That's because naive single-word Bayes assumes that the probability of a token is indepdendent of the presence of other tokens. But this is very rarely the case. For example, the word "mussel" is substantially more likely to follow the word "zebra" than it is to follow the word "xenophobic". So "zebra mussel" might be a couple of ecologists talking, while "xenophobic mussel" could well be random text designed to confuse Bayes. And also, two-word phrases can be stronger indicators than the individual words; "hot" and "sex" in isolation may not be strong spam indicators, but "hot sex" probably is stronger. Going from one-word tokens to one+two-word tokens will have a pretty big payoff, I think. I'm not so sure about two to three. Regards, Dianne.