The way back from stem to tag is interesting from the standpoint of making tags 
human readable. I had assumed a lookup but this seems much more satisfying and 
flexible. In order to keep frequencies it will take something like a dictionary 
creation step in the analyzer. This in turn seems to imply a join so a whole 
new map reduce job--maybe not completely trivial?

It seems that NLP can be used in two very different ways here. First as a 
filter (keep only nouns and verbs?) second to differentiate semantics 
(can:verb, can:noun). One method is a dimensional reduction technique the other 
increases dimensions but can lead to orthogonal dimensions from the same term. 
I suppose both could be used together as the above example indicates.

It sounds like you are using it to filter (only?) Can you explain what you mean 
by:
"One thing came through- parts-of-speech selection for nouns and verbs
helped 5-10% in every combination of regularizers.'


On Aug 3, 2012, at 6:31 PM, Lance Norskog <[email protected]> wrote:

Thanks everyone- I hadn't considered the stem/synonym problem. I have
code for regularizing a doc/term matrix with tf, binary, log and
augmented norm for the cells and idf, gfidf, entropy, normal (term
vector) and probabilistic inverse. Running any of these, and then SVD,
on a Reuters article may take 10-20 ms. This uses a sentence/term
matrix for document summarization. After doing all of this, I realized
that maybe just the regularized matrix was good enough.

One thing came through- parts-of-speech selection for nouns and verbs
helped 5-10% in every combination of regularizers. All across the
board. If you want good tags, select your parts of speech!

On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
<[email protected]> wrote:
> I know, I know. :) Just wanted to mention that it could lead to funny
> results, that's all. There are lots of way of doing proper form
> disambiguation, including shallow tagging which then allows to
> retrieve correct base forms for lemmas, not stems. Stemming is
> typically good enough (and fast) so your advise was 100% fine.
> 
> Dawid
> 
> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[email protected]> wrote:
>> This is definitely just the first step.  Similar goofs happen with
>> inappropriate stemming.  For instance, AIDS should not stem to aid.
>> 
>> A reasonable way to find and classify exceptional cases is to look at
>> cooccurrence statistics.  The contexts of original forms can be examined to
>> find cases where there is a clear semantic mismatch between the original
>> and the set of all forms that stem to the same form.
>> 
>> But just picking the most common that is present in the document is a
>> pretty good step for all that it produces some oddities.  The results are
>> much better than showing a user the stemmed forms.
>> 
>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss 
>> <[email protected]>wrote:
>> 
>>>> Unstemming is pretty simple.  Just build an unstemming dictionary based
>>> on
>>>> seeing what word forms have lead to a stemmed form.  Include frequencies.
>>> 
>>> This can lead to very funny (or not, depends how you look at it)
>>> mistakes when different lemmas stem to the same token. How frequent
>>> and important this phenomenon is varies from language to language (and
>>> can be calculated apriori).
>>> 
>>> Dawid
>>> 



-- 
Lance Norskog
[email protected]

Reply via email to