On 28 May 2016, at 17:53, John Hardin wrote:

Based on that, do you have an opinion on the proposal to add two-word (or configurable-length) combinations to Bayes?

CAVEAT: it has literally been decades since I've worked deep in statistics on a routine basis rather than just using blindly trusted black-box tools every now and then, so some of the below could be influenced by senile dementia...

Tallying word pairs *instead* of single words or as a second discrete Bayes analysis wouldn't be a problem and would surely be useful, possibly more useful than single-word analysis.

Doing one unified analysis where single words and multi-word phrases are both tallied in one Bayes DB to determine one Bayes score is less clearly valid because there is absolute dependence in one direction: the presence of any phrase requires its component words also to be present. OTOH, whether sets of words that are commonly used in particular sequences occur independently with or without matching those sequences is pretty clearly an independent feature of a text not captured by 1-word tokenization, so it wouldn't be blatantly wrong to capture it indirectly by having a unified word and phrase Bayes DB. So I guess I'm undecided, leaning in favor because it captures information otherwise invisible to the Bayes DB.

The "Naive Bayes" classification approach is theoretically moored to Bayes' Theorem by the concept that even if there's SOME dependent correlation across the features being measured to feed the classification database, incomplete dependency makes a large set of similar measurable features (like the presence of words in a message) usable as a proxy for a hypothetical set of truly independent features which are unknown and may not be readily quantified. For textual analysis, this ironically might be "concepts" but to be accurate that set would have to include a properly distributed sample of all possible concepts and a concrete way to detect each one accurately. Using words or n-word phrases instead of concepts means that Bayesian spam classification does not require a full-resolution simulation of Brahman on every mail server. Those are very resource-heavy...

The canonical empirical example of Naive Bayes classification is the use of simple physical body measurements to classify humans by biological sex. That classification improves as one adds more direct physical measurements, even though they all relate to each other via abstract ideas like "size," "muscularity," and "shape". However, if one includes such subjective abstractions, accuracy usually suffers (unless you cheat with features like 'femininity'.) Less intuitively, if one adds arbitrary derived features like BMI which can be calculated from the simpler measured features also in the input set, classification accuracy also is usually made worse. Perversely, classifiers using purely subjective abstractions or purely derived values such as various ratios of direct physical metrics work better on average than classifiers of mixed types, but can work better or worse than classifiers using the simple measurements on which the derived features are based. This is where the serious arguments about various Naive Bayes implementations arise: What constitutes features of compatible classes? How strong can a correlation between features be without effectively being measurements of the same thing twice? Is the empirical support for the idea of semi-independent features as proxies for truly independent features strong enough? Are the distributions of the predictive features and the classifications compatible with each other for Bayes or even for Bayes *AT ALL*?

The approach of mixing "concepts" into the existing Bayes DB is qualitatively broken because concept tokens would be deterministically derived from the actual word tokens in messages based on some subjective scheme and then added as words which are likely to also be be naturally occurring in some but not all of the messages to which they are added. So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural (word) and synthetic (concept) occurrences, not just as incompatible types of input feature but as a conflation of incompatible features.


FWIW, I have roughly no free time for anything between work and family demands but if I did, I would most like to build a blind fixed-length tokenization Bayes classifier: just slice up a message into all of its n-byte sequences (so that a message of bytelength x would have x-(n-1) different tokens) and use those as inputs instead of words. An advantage to this over word-wise Bayes would be attenuation of semantic entanglement and better detection of intentional obfuscation, at the cost of needing huge training volume to get a usable classifier.

Reply via email to