Re: SA Concepts - plugin for email semantics

Bill Cole Tue, 31 May 2016 12:21:33 -0700

On 29 May 2016, at 11:07, RW wrote:

On Sat, 28 May 2016 15:37:21 -0400
Bill Cole wrote:

More importantly (IMHO) they aren't designed to collide with existing
common tokens and be added back into messages that may contain those
tokens already in order to influence Bayesian classification.

There is sound statistical theory consistent with empirical evidence
underpinning the Bayes classifier implementation in SA. While there
can be legitimate critiques of the SA implementation specifically and
in general how well email word frequency fits Bayes' Theorem,
injecting a pile of new derivative meta-tokens based on pre-conceived
notions of "concepts" into the Bayesian analysis invalidates the
assumption of what the input for Naive Bayes analysis is:
*independent* features. The "concepts" approach adds words that are
*dependent* on the presence of other words in the document and to
make it worse, those dependent words may already exist in some
pristine messages. It unmoors the SA Bayes implementation from any
theoretical grounding, converting its complex math from statistical
analysis into arbitrary numerology.


Statistical filters are based on some statistical theory combined with
pragmatic kludges and assumptions. Practical filters have been
developed based on what's been found to work, not on what's more
statistically correct.

I'm not aware of any hard evidence that the SA Bayes pragmatic kludgesand assumptions perform better or worse than an implementation that usedfewer or different ones. I confess that I have not actually *LOOKED* forsuch evidence in the past 6 years, so maybe you are aware of something Inever could find simpoly because it didn't yet exist.

Bayes already creates multiple tokens from the same information, most
notably case-sensitive and lower-case words in the body.

I don't see a huge difference between

  "Bill Cole" tokenizing as {Bill, bill, Cole, cole}

and

  "v1agra, ciali5"  tokenizing as {v1agra, ciali5, eddrug}

De-capitalizing and full case-squashing have their own issues(particularly when one's lower-cased first name has 5 noun and 2 verbdefinitions...) but it is an invariant deterministic process for themost popular (so far) spamming languages. Today's de-capitalizingtokenization of English words is going to yield the same tokens today,tomorrow, and 3 years from now. There is a strong argument forde-capitalization in English because of our capitalization rules: 'Bill'could properly be a name or any of 7 noun & verb meanings as the firstword of a sentence, while 'bill' is properly only one of those 7meanings NOT as the first word. Adding a de-capitalized token capturesall of the occurrences of the most common (for most people) uses of theword in one token tally, making it more complete. There is no "throwingaway information" in that process and also no invention of newmeta-information in a way that might get updated in later analyses.

A Naive Bayes purist might insist that expanding capitalized forms into2 tokens means that you're double-counting one token and sooverweighting capitalized words: words often used to start sentences andwords that are sometimes proper nouns. I think that argument would beMUCH stronger in German, where variant capitalization would be a strongstyle signal.

My guess is that de-capitalization for English likely yields betterresults than if only the pristine words were used. I also think it couldbe useful to have a visual normalization algorithm that would turn"V1agra" into {V1agra, viagra} and "Unwi11!ng" into {Unwi11!ng,unwi11ing}. These are *guesses* on my part, but I think that can berationalized by understanding that capitalization and intentionalobfuscation that maintains the visual appearance of a word areeffectively noise interfering with the words that the author intendedthe reader to see, so while it is important to retain capitalized orobfuscated forms for the information wrapped up in that formaldifference, it is also correct to count them as the words they areintended to be.

Concepts are fundamentally different because there is no finite set ofall concepts, no generally-accepted (or even suggested, as far as Iknow...) finite set of all commonly used concepts, no formal definitionof how to divide broad topical areas into discrete concepts, no nothingbut the vague fuzzy concept of concepts. The currently-offeredimplementation has 250 concept files, each consisting of arbitrarysubjective pattern sets ranging from a single pattern matching a line 76characters or longer followed by a 76-character line (how is that a"concept???") to a woefully incomplete list of Apple brands. It seemsunavoidable that the number of concepts will grow and the definition ofexisting "concepts" will change at an ongoing rapid pace such thathitting the '76char' concept(ugh) may be forever reliable, hitting the'asian' concept is surely going to need to be MUCH easier in the futurethan it currently is. It is no shock that while this implementation hasPaul Stead's name on it, it is apparently mostly the product of theanti-spam community's most spectacular case of Dunning-Kruger Syndrome,who has apparently figured out that his personal 'brand' has negativevalue.

The only way to find out whether it works is to try it.

Sure, but with innovations for the SA Bayes filter that seem to me to bein profound conflict with the theory of why the SA Bayes filter DOESwork, I'm going to let others generate that evidence one way or theother.

The craziest part of this is that we already HAVE this functionalityoutside of the SA Bayes filter. It's called SpamAssassin. Perkel'sconcept files in Stead's plugin could be robotically translated intosub-rules and meta-rules, run through the normal Rules QA mechanism, anddynamically scored. There is no reason to hide this stuff behind Bayeswhere it would be mixing a jumble of derivative meta-tokens into adatabase of case-normalized primitive tokens, amplifying an arbitrarysubset of the information already present in the Bayes DB.

I think the OP is probably underselling it, in that it could be usedto
extract information that normal tokenization can't get, for example:

/%.off/i

/Symbol:/i,   /Date:/i,    /Price:/i ...

/^Barrister/i
The main problem is that you'd need a lot of rules to make asubstantial
difference.

So: re-invent SpamAssassin v1 but without rule scores, using Bayes to dohalf-assed dynamic score adjustment per site with rules that with eitherevolve constantly or grow stale?


Let me know how that goes...

Re: SA Concepts - plugin for email semantics

Reply via email to