I have not tried the POS trick for dimension reduction, or using the Lucene OpenNLP in Mahout. OpenNLP also includes "chunking" which means finding noun and verb phrases. This gives you more tools for filtering words.
Solr includes the Carrot2 text clustering toolkit. Given a search, Carrot2 clusters all of the results. On Thu, Jul 19, 2012 at 10:54 AM, Pat Ferrel <[email protected]> wrote: > Not sure who the question is for so I'll butt in. > > I calculate a term cloud for each document in batch using (so far): > boilerpipe parsing of pages (did I say they are from a crawl?), filter out > high frequency terms and stop words with a custom lucene analyzer, TFIDF > from mahout > > Once I have a term cloud I display it with a snippet of the document. The > user can click a term they think is extra important. This will reweight it > higher than it was and take the resulting reweighted vector to do something > like a "MoreLikeThis" query. So the original cloud is from the weighted term > vector. It needs to be fairly human readable. I'm using stemming so that is > already stretching the human readable part. We've used a form of this in a > prototype and it allows the user to navigate the information space in a new > way. Hopefully at a larger scale it will be useful, sort of a MikeLikeThis > but with more emphasis on some specific term. > > We've built an NER system for other reasons but we've questioned whether > applying part-of-speach or even NER to choosing terms was a good way to do > dimensional reduction. I assume you take all nouns and verb and throw out > the rest? Were the results noticeably better for creating vectors? It sounds > like you intergrated this into lucene, how about mahout's vectorizing? Seems > like it would be simple to put a custom lucene analyzer into the pipeline. > I'd be interested in your opinion of either result. > > As to LSA and SVD that remains one of our next steps. For another part of > the project I'm building a sort of hierarchical clustering model that will > create clusters at different scales (different magnitudes of k for instance) > then connect them by centroid distance into a graph. We will use cluster > evaluators to prune out crappy (a technical term) clusters and this we hope > will be a nice categorization that takes into account different levels of > generalization. I expect that different levels of dimensional reduction will > be useful in clustering at different scales but we haven't tried it so not > sure. Doing LSA in conjunction with DR once may be all we need. Experience > with clustering using only the vectorizing DR results in good clusters but > highly specific ones. Specificity is a hard thing to account for. It would > be nice to end up with a sports cluster then more specific baseball, soccer, > golf, and even more specific players or teams. My intuition says teasing > this out of the data will require DR in some form and possibly at varying > scales. > > > On 7/19/12 12:25 AM, Lance Norskog wrote: >> >> Are you creating word clouds essentially, then finding the most >> relevant documents for each word? And, is this a batch job, or for >> interactive searches? >> >> 2 techniques: >> 1) You can use the OpenNLP parts of speech tagging to isolate nouns >> and verbs. My OpenNLP Lucene patch includes a sample Lucene analyzer >> stack for this task. LUCENE-2899. >> >> 2) Latent Semantic Analysis will create a much better term list, but >> it requires a batch computation. In LSA you do a singular value >> decomposition on the document/term vector matrix. This gives sorted >> ratings for both documents and terms, by how relevant they are to the >> "themes" of the corpus. The documents and terms are both sorted by how >> "thematic" they are. Document summarization illustrates this concept. >> >> You can use to summarize documents SVD on a sentence/term matrix. SVD >> will find the most "thematic" sentence and term. A newspaper article >> is pre-summarized: the main theme is the first sentence ("lede") and >> the second sentence reinforces and elaborates on the first. So, >> newspaper articles are pre-tagged test data for document >> summarization, and a good demonstrator for LSA. I got interesting >> results with the Reuters corpus. >> >> SVD sorts vectors by length and orthogonality. The first sentence will >> have the most "theme words" and the second the next largest number. >> The crazy intuition here is that the reinforcing sentence rarely >> shares theme words with the primary sentence. So, the lede and >> reinforcing sentences have the two longest, most orthogonal term >> vectors. >> >> Starting with this, it is possible to mangle the term-vector matrix >> into a much much smaller matrix at the cost of losing the identity of >> documents: you get thematic words, but do not get thematic documents. >> This technique is called Random Indexing. This requires a much hairier >> explanation than the above. >> >> On Wed, Jul 18, 2012 at 9:53 AM, Ken Krugler >> <[email protected]> wrote: >>> >>> On Jul 18, 2012, at 9:07am, Pat Ferrel wrote: >>> >>>> Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature >>>> looks like a good fit and seems to do about what you describe below. It >>>> seems a good idea to reorder the returned docs by some distance or >>>> similarity measure. >>>> >>>> The major problem you mention in extracting good terms; are you talking >>>> about creating the query or in creating the Solr index? >>> >>> Both, since they (must) use the same approach for the query to do a good >>> job of matching against docs in the index. >>> >>> Often two word phrases are great terms, but just as often they wind up >>> being junk - "otherwise resolving", where the TF*IDF (or LLR) score is high >>> enough to make it one of top terms for a document, but it doesn't really >>> capture anything about the meaning of the document, and thus is just noise >>> as far as similarity is concerned. >>> >>> -- Ken >>> >>>> BTW RowSimilarity works so well for doc similarity I'm resisting taking >>>> it out and will concentrate on reducing the size of the matrix it deals >>>> with >>>> to mitigate the scaling problems. For the realtime queries I think I'll >>>> look >>>> deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms >>>> weights from a doc and reweighting some terms based on a user gesture. >>>> >>>> >>>> On 7/17/12 8:22 PM, Ken Krugler wrote: >>>>> >>>>> Hi Pat, >>>>> >>>>> On Jul 14, 2012, at 8:17am, Pat Ferrel wrote: >>>>> >>>>>> Intersting. >>>>>> >>>>>> I have another requirement, which is to do something like real time >>>>>> vector based queries. Imagine taking a doc vector, reweighting some terms >>>>>> then doing a query with it, perhaps in a truncated form. There are >>>>>> several >>>>>> ways to do this but only solr would offer something real time results >>>>>> afaik. >>>>>> It looks like I could use your approach below to do this. A quick look at >>>>>> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb >>>>>> the query vector into synthesized bi and tri grams for instance. >>>>> >>>>> The simplistic approach I used was to extract the top 50 terms (with >>>>> TF*IDF weights) from the target document, then use those terms (with >>>>> weights >>>>> as boosts) to do a regular Lucene OR query & request the top 20 hits. >>>>> >>>>> The index I'm searching against has Solr documents with a multi-value >>>>> field that contains the top 50 terms, generated using the same approach as >>>>> with the target document. It also contains stored weights for each of >>>>> those >>>>> terms. >>>>> >>>>> I didn't use payload boosting, but could have to improve the quality of >>>>> this search - seemed to be working well enough, and speed was pretty >>>>> important. >>>>> >>>>> Solr returns back a sorted list of hits, and then I do a regular vector >>>>> similarity calculation between the target & each of these top 20 hits, and >>>>> select the best one (assuming it passes a similarity threshold). >>>>> >>>>>> I'd be interested in hearing more about how you use it. Is there a >>>>>> better venue than the mahout list? >>>>> >>>>> If you'd like more details, that's probably better for an off-list >>>>> discussion…doesn't feel very Mahout-ish in nature :) >>>>> >>>>> Though a discussion of the major problem (how to extract "good" terms >>>>> from the text) would be very interesting, as I wound up crafting what felt >>>>> like a kludgy pseudo-NLP solution. >>>>> >>>>> -- Ken >>>>> >>>>> -------------------------- Ken Krugler http://www.scaleunlimited.com >>>>> custom big data solutions & training Hadoop, Cascading, Mahout & Solr >>>> >>>> >>> -------------------------- >>> Ken Krugler >>> http://www.scaleunlimited.com >>> custom big data solutions & training >>> Hadoop, Cascading, Mahout & Solr >>> >>> >>> >>> >> >> > > -- Lance Norskog [email protected]
