I had a chance last week to read Yiming Yang's paper on feature set
reduction:
http://www.cs.cmu.edu/~yiming/papers.yy/icml97.ps.gz
It contains the startling conclusion that the single biggest factor in
getting good results by reducing feature sets is to keep frequently-used
features (after getting rid of a stopword set) and throw away rare
features. This algorithm was called "Document Frequency", because each
term's "frequency" is defined as the number of corpus documents in which
the term appears.
The paper compared five different reduction algorithms:
* Document Frequency (DF) - see above
* Information Gain (IG) - an entropy-based method
* Chi-squared Measure (CHI) - statistical correlations
* Term Strength (TS) - uses similar-document clustering
* Mutual Information (MI) - a term-category correlation formula
The findings of the paper were that DF, IG, and CHI were roughly
equivalent when eliminating up to 95% of the features, TS dropped
sharply when over 50% of the features were eliminated, and MI did quite
poorly overall.
Based on these findings, I decided to implement a simple DF scheme in
AI::Categorize rather than working on an entropy solution. I've hacked
out the DF code, and now I'm working on documenting it and making sure
it works. I've also decided to put in some time making improvements to
the AI::Categorize::Evaluate package, so that I can tell how the changes
affect accuracy and speed. Theoretically, both should go up.
I hope to release an updated version of the modules soon.
BTW, I'm still hoping someone wants to implement other AI::Categorize::
modules!
------------------- -------------------
Ken Williams Last Bastion of Euclidity
[EMAIL PROTECTED] The Math Forum