Re: SA Concepts - plugin for email semantics

Bill Cole Mon, 30 May 2016 14:46:23 -0700

On 28 May 2016, at 17:53, John Hardin wrote:

Based on that, do you have an opinion on the proposal to add two-word(or configurable-length) combinations to Bayes?

CAVEAT: it has literally been decades since I've worked deep instatistics on a routine basis rather than just using blindly trustedblack-box tools every now and then, so some of the below could beinfluenced by senile dementia...

Tallying word pairs *instead* of single words or as a second discreteBayes analysis wouldn't be a problem and would surely be useful,possibly more useful than single-word analysis.

Doing one unified analysis where single words and multi-word phrases areboth tallied in one Bayes DB to determine one Bayes score is lessclearly valid because there is absolute dependence in one direction: thepresence of any phrase requires its component words also to be present.OTOH, whether sets of words that are commonly used in particularsequences occur independently with or without matching those sequencesis pretty clearly an independent feature of a text not captured by1-word tokenization, so it wouldn't be blatantly wrong to capture itindirectly by having a unified word and phrase Bayes DB. So I guess I'mundecided, leaning in favor because it captures information otherwiseinvisible to the Bayes DB.

The "Naive Bayes" classification approach is theoretically moored toBayes' Theorem by the concept that even if there's SOME dependentcorrelation across the features being measured to feed theclassification database, incomplete dependency makes a large set ofsimilar measurable features (like the presence of words in a message)usable as a proxy for a hypothetical set of truly independent featureswhich are unknown and may not be readily quantified. For textualanalysis, this ironically might be "concepts" but to be accurate thatset would have to include a properly distributed sample of all possibleconcepts and a concrete way to detect each one accurately. Using wordsor n-word phrases instead of concepts means that Bayesian spamclassification does not require a full-resolution simulation of Brahmanon every mail server. Those are very resource-heavy...

The canonical empirical example of Naive Bayes classification is the useof simple physical body measurements to classify humans by biologicalsex. That classification improves as one adds more direct physicalmeasurements, even though they all relate to each other via abstractideas like "size," "muscularity," and "shape". However, if one includessuch subjective abstractions, accuracy usually suffers (unless you cheatwith features like 'femininity'.) Less intuitively, if one addsarbitrary derived features like BMI which can be calculated from thesimpler measured features also in the input set, classification accuracyalso is usually made worse. Perversely, classifiers using purelysubjective abstractions or purely derived values such as various ratiosof direct physical metrics work better on average than classifiers ofmixed types, but can work better or worse than classifiers using thesimple measurements on which the derived features are based. This iswhere the serious arguments about various Naive Bayes implementationsarise: What constitutes features of compatible classes? How strong can acorrelation between features be without effectively being measurementsof the same thing twice? Is the empirical support for the idea ofsemi-independent features as proxies for truly independent featuresstrong enough? Are the distributions of the predictive features and theclassifications compatible with each other for Bayes or even for Bayes*AT ALL*?

The approach of mixing "concepts" into the existing Bayes DB isqualitatively broken because concept tokens would be deterministicallyderived from the actual word tokens in messages based on some subjectivescheme and then added as words which are likely to also be be naturallyoccurring in some but not all of the messages to which they are added.So you could have 'sex' and 'meds' and 'watches' tallied up in intofrequency counts that sum up natural (word) and synthetic (concept)occurrences, not just as incompatible types of input feature but as aconflation of incompatible features.

FWIW, I have roughly no free time for anything between work and familydemands but if I did, I would most like to build a blind fixed-lengthtokenization Bayes classifier: just slice up a message into all of itsn-byte sequences (so that a message of bytelength x would have x-(n-1)different tokens) and use those as inputs instead of words. An advantageto this over word-wise Bayes would be attenuation of semanticentanglement and better detection of intentional obfuscation, at thecost of needing huge training volume to get a usable classifier.

Re: SA Concepts - plugin for email semantics

Reply via email to