Nice stuff. And glad that Mahout was able to help! On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY <[email protected]> wrote:
> Hi All, > > We have developed an auto tagging system for our micro-blogging platform. > Here is what we have done: > > The purpose of the system was to look for tags in an articles automatically > when someone posts a link in our micro-blogging site. The goal was to allow > us to follow a tag instead (in addition) of (to) a person. So we used some > custom code on top of Mahout, UIMA, Open-NLP etc. > > If you are interested to see how it works take a look at: > http://www.scoopspot.com/ > > One more thing, we also created a robot which goes to some of the well > known web sites like: Read Write Web, Hackers News, Tech Crunch etc which > gets the article from the web and publishes that to our micro-blog. As we > already have the tag following, we get the information without any problem. > That's very cool (to us at least). You can see the output of the robot at > this location: > > http://news.scoopspot.com/ > > I thought, this might be an example of what Mahout can do and related to > this thread, so felt like sharing with you guys. > > Sorry if it looks like off-topic. > > Regards, > Samik > > On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog <[email protected]> wrote: > > > I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', > > 'verb', etc. I removed all words that were not nouns or verbs. In my > > use case, this is a total win. In other cases, maybe not: Twitter has > > a quite varied non-grammer. > > > > On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <[email protected]> wrote: > > > The way back from stem to tag is interesting from the standpoint of > > making tags human readable. I had assumed a lookup but this seems much > more > > satisfying and flexible. In order to keep frequencies it will take > > something like a dictionary creation step in the analyzer. This in turn > > seems to imply a join so a whole new map reduce job--maybe not completely > > trivial? > > > > > > It seems that NLP can be used in two very different ways here. First as > > a filter (keep only nouns and verbs?) second to differentiate semantics > > (can:verb, can:noun). One method is a dimensional reduction technique the > > other increases dimensions but can lead to orthogonal dimensions from the > > same term. I suppose both could be used together as the above example > > indicates. > > > > > > It sounds like you are using it to filter (only?) Can you explain what > > you mean by: > > > "One thing came through- parts-of-speech selection for nouns and verbs > > > helped 5-10% in every combination of regularizers.' > > > > > > > > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <[email protected]> wrote: > > > > > > Thanks everyone- I hadn't considered the stem/synonym problem. I have > > > code for regularizing a doc/term matrix with tf, binary, log and > > > augmented norm for the cells and idf, gfidf, entropy, normal (term > > > vector) and probabilistic inverse. Running any of these, and then SVD, > > > on a Reuters article may take 10-20 ms. This uses a sentence/term > > > matrix for document summarization. After doing all of this, I realized > > > that maybe just the regularized matrix was good enough. > > > > > > One thing came through- parts-of-speech selection for nouns and verbs > > > helped 5-10% in every combination of regularizers. All across the > > > board. If you want good tags, select your parts of speech! > > > > > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > > > <[email protected]> wrote: > > >> I know, I know. :) Just wanted to mention that it could lead to funny > > >> results, that's all. There are lots of way of doing proper form > > >> disambiguation, including shallow tagging which then allows to > > >> retrieve correct base forms for lemmas, not stems. Stemming is > > >> typically good enough (and fast) so your advise was 100% fine. > > >> > > >> Dawid > > >> > > >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[email protected]> > > wrote: > > >>> This is definitely just the first step. Similar goofs happen with > > >>> inappropriate stemming. For instance, AIDS should not stem to aid. > > >>> > > >>> A reasonable way to find and classify exceptional cases is to look at > > >>> cooccurrence statistics. The contexts of original forms can be > > examined to > > >>> find cases where there is a clear semantic mismatch between the > > original > > >>> and the set of all forms that stem to the same form. > > >>> > > >>> But just picking the most common that is present in the document is a > > >>> pretty good step for all that it produces some oddities. The results > > are > > >>> much better than showing a user the stemmed forms. > > >>> > > >>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss < > > [email protected]>wrote: > > >>> > > >>>>> Unstemming is pretty simple. Just build an unstemming dictionary > > based > > >>>> on > > >>>>> seeing what word forms have lead to a stemmed form. Include > > frequencies. > > >>>> > > >>>> This can lead to very funny (or not, depends how you look at it) > > >>>> mistakes when different lemmas stem to the same token. How frequent > > >>>> and important this phenomenon is varies from language to language > (and > > >>>> can be calculated apriori). > > >>>> > > >>>> Dawid > > >>>> > > > > > > > > > > > > -- > > > Lance Norskog > > > [email protected] > > > > > > > > > > > -- > > Lance Norskog > > [email protected] > > >
