On 29 May 2016, at 11:07, RW wrote:
On Sat, 28 May 2016 15:37:21 -0400
Bill Cole wrote:
More importantly (IMHO) they aren't designed to collide with existing
common tokens and be added back into messages that may contain those
tokens already in order to influence Bayesian classification.
There is sound statistical theory consistent with empirical evidence
underpinning the Bayes classifier implementation in SA. While there
can be legitimate critiques of the SA implementation specifically and
in general how well email word frequency fits Bayes' Theorem,
injecting a pile of new derivative meta-tokens based on pre-conceived
notions of "concepts" into the Bayesian analysis invalidates the
assumption of what the input for Naive Bayes analysis is:
*independent* features. The "concepts" approach adds words that are
*dependent* on the presence of other words in the document and to
make it worse, those dependent words may already exist in some
pristine messages. It unmoors the SA Bayes implementation from any
theoretical grounding, converting its complex math from statistical
analysis into arbitrary numerology.
Statistical filters are based on some statistical theory combined with
pragmatic kludges and assumptions. Practical filters have been
developed based on what's been found to work, not on what's more
statistically correct.
I'm not aware of any hard evidence that the SA Bayes pragmatic kludges
and assumptions perform better or worse than an implementation that used
fewer or different ones. I confess that I have not actually *LOOKED* for
such evidence in the past 6 years, so maybe you are aware of something I
never could find simpoly because it didn't yet exist.
Bayes already creates multiple tokens from the same information, most
notably case-sensitive and lower-case words in the body.
I don't see a huge difference between
"Bill Cole" tokenizing as {Bill, bill, Cole, cole}
and
"v1agra, ciali5" tokenizing as {v1agra, ciali5, eddrug}
De-capitalizing and full case-squashing have their own issues
(particularly when one's lower-cased first name has 5 noun and 2 verb
definitions...) but it is an invariant deterministic process for the
most popular (so far) spamming languages. Today's de-capitalizing
tokenization of English words is going to yield the same tokens today,
tomorrow, and 3 years from now. There is a strong argument for
de-capitalization in English because of our capitalization rules: 'Bill'
could properly be a name or any of 7 noun & verb meanings as the first
word of a sentence, while 'bill' is properly only one of those 7
meanings NOT as the first word. Adding a de-capitalized token captures
all of the occurrences of the most common (for most people) uses of the
word in one token tally, making it more complete. There is no "throwing
away information" in that process and also no invention of new
meta-information in a way that might get updated in later analyses.
A Naive Bayes purist might insist that expanding capitalized forms into
2 tokens means that you're double-counting one token and so
overweighting capitalized words: words often used to start sentences and
words that are sometimes proper nouns. I think that argument would be
MUCH stronger in German, where variant capitalization would be a strong
style signal.
My guess is that de-capitalization for English likely yields better
results than if only the pristine words were used. I also think it could
be useful to have a visual normalization algorithm that would turn
"V1agra" into {V1agra, viagra} and "Unwi11!ng" into {Unwi11!ng,
unwi11ing}. These are *guesses* on my part, but I think that can be
rationalized by understanding that capitalization and intentional
obfuscation that maintains the visual appearance of a word are
effectively noise interfering with the words that the author intended
the reader to see, so while it is important to retain capitalized or
obfuscated forms for the information wrapped up in that formal
difference, it is also correct to count them as the words they are
intended to be.
Concepts are fundamentally different because there is no finite set of
all concepts, no generally-accepted (or even suggested, as far as I
know...) finite set of all commonly used concepts, no formal definition
of how to divide broad topical areas into discrete concepts, no nothing
but the vague fuzzy concept of concepts. The currently-offered
implementation has 250 concept files, each consisting of arbitrary
subjective pattern sets ranging from a single pattern matching a line 76
characters or longer followed by a 76-character line (how is that a
"concept???") to a woefully incomplete list of Apple brands. It seems
unavoidable that the number of concepts will grow and the definition of
existing "concepts" will change at an ongoing rapid pace such that
hitting the '76char' concept(ugh) may be forever reliable, hitting the
'asian' concept is surely going to need to be MUCH easier in the future
than it currently is. It is no shock that while this implementation has
Paul Stead's name on it, it is apparently mostly the product of the
anti-spam community's most spectacular case of Dunning-Kruger Syndrome,
who has apparently figured out that his personal 'brand' has negative
value.
The only way to find out whether it works is to try it.
Sure, but with innovations for the SA Bayes filter that seem to me to be
in profound conflict with the theory of why the SA Bayes filter DOES
work, I'm going to let others generate that evidence one way or the
other.
The craziest part of this is that we already HAVE this functionality
outside of the SA Bayes filter. It's called SpamAssassin. Perkel's
concept files in Stead's plugin could be robotically translated into
sub-rules and meta-rules, run through the normal Rules QA mechanism, and
dynamically scored. There is no reason to hide this stuff behind Bayes
where it would be mixing a jumble of derivative meta-tokens into a
database of case-normalized primitive tokens, amplifying an arbitrary
subset of the information already present in the Bayes DB.
I think the OP is probably underselling it, in that it could be used
to
extract information that normal tokenization can't get, for example:
/%.off/i
/Symbol:/i, /Date:/i, /Price:/i ...
/^Barrister/i
The main problem is that you'd need a lot of rules to make a
substantial
difference.
So: re-invent SpamAssassin v1 but without rule scores, using Bayes to do
half-assed dynamic score adjustment per site with rules that with either
evolve constantly or grow stale?
Let me know how that goes...