I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', 'verb', etc. I removed all words that were not nouns or verbs. In my use case, this is a total win. In other cases, maybe not: Twitter has a quite varied non-grammer.
On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <[email protected]> wrote: > The way back from stem to tag is interesting from the standpoint of making > tags human readable. I had assumed a lookup but this seems much more > satisfying and flexible. In order to keep frequencies it will take something > like a dictionary creation step in the analyzer. This in turn seems to imply > a join so a whole new map reduce job--maybe not completely trivial? > > It seems that NLP can be used in two very different ways here. First as a > filter (keep only nouns and verbs?) second to differentiate semantics > (can:verb, can:noun). One method is a dimensional reduction technique the > other increases dimensions but can lead to orthogonal dimensions from the > same term. I suppose both could be used together as the above example > indicates. > > It sounds like you are using it to filter (only?) Can you explain what you > mean by: > "One thing came through- parts-of-speech selection for nouns and verbs > helped 5-10% in every combination of regularizers.' > > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <[email protected]> wrote: > > Thanks everyone- I hadn't considered the stem/synonym problem. I have > code for regularizing a doc/term matrix with tf, binary, log and > augmented norm for the cells and idf, gfidf, entropy, normal (term > vector) and probabilistic inverse. Running any of these, and then SVD, > on a Reuters article may take 10-20 ms. This uses a sentence/term > matrix for document summarization. After doing all of this, I realized > that maybe just the regularized matrix was good enough. > > One thing came through- parts-of-speech selection for nouns and verbs > helped 5-10% in every combination of regularizers. All across the > board. If you want good tags, select your parts of speech! > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > <[email protected]> wrote: >> I know, I know. :) Just wanted to mention that it could lead to funny >> results, that's all. There are lots of way of doing proper form >> disambiguation, including shallow tagging which then allows to >> retrieve correct base forms for lemmas, not stems. Stemming is >> typically good enough (and fast) so your advise was 100% fine. >> >> Dawid >> >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[email protected]> wrote: >>> This is definitely just the first step. Similar goofs happen with >>> inappropriate stemming. For instance, AIDS should not stem to aid. >>> >>> A reasonable way to find and classify exceptional cases is to look at >>> cooccurrence statistics. The contexts of original forms can be examined to >>> find cases where there is a clear semantic mismatch between the original >>> and the set of all forms that stem to the same form. >>> >>> But just picking the most common that is present in the document is a >>> pretty good step for all that it produces some oddities. The results are >>> much better than showing a user the stemmed forms. >>> >>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss >>> <[email protected]>wrote: >>> >>>>> Unstemming is pretty simple. Just build an unstemming dictionary based >>>> on >>>>> seeing what word forms have lead to a stemmed form. Include frequencies. >>>> >>>> This can lead to very funny (or not, depends how you look at it) >>>> mistakes when different lemmas stem to the same token. How frequent >>>> and important this phenomenon is varies from language to language (and >>>> can be calculated apriori). >>>> >>>> Dawid >>>> > > > > -- > Lance Norskog > [email protected] > -- Lance Norskog [email protected]
