On 29 May 2016, at 11:07, RW wrote:

On Sat, 28 May 2016 15:37:21 -0400
Bill Cole wrote:


More importantly (IMHO) they aren't designed to collide with existing
common tokens and be added back into messages that may contain those
tokens already in order to influence Bayesian classification.

There is sound statistical theory consistent with empirical evidence
underpinning the Bayes classifier implementation in SA. While there
can be legitimate critiques of the SA implementation specifically and
in general how well email word frequency fits Bayes' Theorem,
injecting a pile of new derivative meta-tokens based on pre-conceived
notions of "concepts" into the Bayesian analysis invalidates the
assumption of what the input for Naive Bayes analysis is:
*independent* features. The "concepts" approach adds words that are
*dependent* on the presence of other words in the document and to
make it worse, those dependent words may already exist in some
pristine messages. It unmoors the SA Bayes implementation from any
theoretical grounding, converting its complex math from statistical
analysis into arbitrary numerology.

Statistical filters are based on some statistical theory combined with
pragmatic kludges and assumptions. Practical filters have been
developed based on what's been found to work, not on what's more
statistically correct.

I'm not aware of any hard evidence that the SA Bayes pragmatic kludges and assumptions perform better or worse than an implementation that used fewer or different ones. I confess that I have not actually *LOOKED* for such evidence in the past 6 years, so maybe you are aware of something I never could find simpoly because it didn't yet exist.

Bayes already creates multiple tokens from the same information, most
notably case-sensitive and lower-case words in the body.

I don't see a huge difference between

  "Bill Cole" tokenizing as {Bill, bill, Cole, cole}

and

  "v1agra, ciali5"  tokenizing as {v1agra, ciali5, eddrug}

De-capitalizing and full case-squashing have their own issues (particularly when one's lower-cased first name has 5 noun and 2 verb definitions...) but it is an invariant deterministic process for the most popular (so far) spamming languages. Today's de-capitalizing tokenization of English words is going to yield the same tokens today, tomorrow, and 3 years from now. There is a strong argument for de-capitalization in English because of our capitalization rules: 'Bill' could properly be a name or any of 7 noun & verb meanings as the first word of a sentence, while 'bill' is properly only one of those 7 meanings NOT as the first word. Adding a de-capitalized token captures all of the occurrences of the most common (for most people) uses of the word in one token tally, making it more complete. There is no "throwing away information" in that process and also no invention of new meta-information in a way that might get updated in later analyses.

A Naive Bayes purist might insist that expanding capitalized forms into 2 tokens means that you're double-counting one token and so overweighting capitalized words: words often used to start sentences and words that are sometimes proper nouns. I think that argument would be MUCH stronger in German, where variant capitalization would be a strong style signal.

My guess is that de-capitalization for English likely yields better results than if only the pristine words were used. I also think it could be useful to have a visual normalization algorithm that would turn "V1agra" into {V1agra, viagra} and "Unwi11!ng" into {Unwi11!ng, unwi11ing}. These are *guesses* on my part, but I think that can be rationalized by understanding that capitalization and intentional obfuscation that maintains the visual appearance of a word are effectively noise interfering with the words that the author intended the reader to see, so while it is important to retain capitalized or obfuscated forms for the information wrapped up in that formal difference, it is also correct to count them as the words they are intended to be.

Concepts are fundamentally different because there is no finite set of all concepts, no generally-accepted (or even suggested, as far as I know...) finite set of all commonly used concepts, no formal definition of how to divide broad topical areas into discrete concepts, no nothing but the vague fuzzy concept of concepts. The currently-offered implementation has 250 concept files, each consisting of arbitrary subjective pattern sets ranging from a single pattern matching a line 76 characters or longer followed by a 76-character line (how is that a "concept???") to a woefully incomplete list of Apple brands. It seems unavoidable that the number of concepts will grow and the definition of existing "concepts" will change at an ongoing rapid pace such that hitting the '76char' concept(ugh) may be forever reliable, hitting the 'asian' concept is surely going to need to be MUCH easier in the future than it currently is. It is no shock that while this implementation has Paul Stead's name on it, it is apparently mostly the product of the anti-spam community's most spectacular case of Dunning-Kruger Syndrome, who has apparently figured out that his personal 'brand' has negative value.

The only way to find out whether it works is to try it.

Sure, but with innovations for the SA Bayes filter that seem to me to be in profound conflict with the theory of why the SA Bayes filter DOES work, I'm going to let others generate that evidence one way or the other.

The craziest part of this is that we already HAVE this functionality outside of the SA Bayes filter. It's called SpamAssassin. Perkel's concept files in Stead's plugin could be robotically translated into sub-rules and meta-rules, run through the normal Rules QA mechanism, and dynamically scored. There is no reason to hide this stuff behind Bayes where it would be mixing a jumble of derivative meta-tokens into a database of case-normalized primitive tokens, amplifying an arbitrary subset of the information already present in the Bayes DB.

I think the OP is probably underselling it, in that it could be used to
extract information that normal tokenization can't get, for example:

/%.off/i

/Symbol:/i,   /Date:/i,    /Price:/i ...

/^Barrister/i



The main problem is that you'd need a lot of rules to make a substantial
difference.

So: re-invent SpamAssassin v1 but without rule scores, using Bayes to do half-assed dynamic score adjustment per site with rules that with either evolve constantly or grow stale?

Let me know how that goes...

Reply via email to