Thanks everyone- I hadn't considered the stem/synonym problem. I have code for regularizing a doc/term matrix with tf, binary, log and augmented norm for the cells and idf, gfidf, entropy, normal (term vector) and probabilistic inverse. Running any of these, and then SVD, on a Reuters article may take 10-20 ms. This uses a sentence/term matrix for document summarization. After doing all of this, I realized that maybe just the regularized matrix was good enough.
One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers. All across the board. If you want good tags, select your parts of speech! On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss <[email protected]> wrote: > I know, I know. :) Just wanted to mention that it could lead to funny > results, that's all. There are lots of way of doing proper form > disambiguation, including shallow tagging which then allows to > retrieve correct base forms for lemmas, not stems. Stemming is > typically good enough (and fast) so your advise was 100% fine. > > Dawid > > On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[email protected]> wrote: >> This is definitely just the first step. Similar goofs happen with >> inappropriate stemming. For instance, AIDS should not stem to aid. >> >> A reasonable way to find and classify exceptional cases is to look at >> cooccurrence statistics. The contexts of original forms can be examined to >> find cases where there is a clear semantic mismatch between the original >> and the set of all forms that stem to the same form. >> >> But just picking the most common that is present in the document is a >> pretty good step for all that it produces some oddities. The results are >> much better than showing a user the stemmed forms. >> >> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss >> <[email protected]>wrote: >> >>> > Unstemming is pretty simple. Just build an unstemming dictionary based >>> on >>> > seeing what word forms have lead to a stemmed form. Include frequencies. >>> >>> This can lead to very funny (or not, depends how you look at it) >>> mistakes when different lemmas stem to the same token. How frequent >>> and important this phenomenon is varies from language to language (and >>> can be calculated apriori). >>> >>> Dawid >>> -- Lance Norskog [email protected]
