Not sure who the question is for so I'll butt in.

I calculate a term cloud for each document in batch using (so far): boilerpipe parsing of pages (did I say they are from a crawl?), filter out high frequency terms and stop words with a custom lucene analyzer, TFIDF from mahout

Once I have a term cloud I display it with a snippet of the document. The user can click a term they think is extra important. This will reweight it higher than it was and take the resulting reweighted vector to do something like a "MoreLikeThis" query. So the original cloud is from the weighted term vector. It needs to be fairly human readable. I'm using stemming so that is already stretching the human readable part. We've used a form of this in a prototype and it allows the user to navigate the information space in a new way. Hopefully at a larger scale it will be useful, sort of a MikeLikeThis but with more emphasis on some specific term.

We've built an NER system for other reasons but we've questioned whether applying part-of-speach or even NER to choosing terms was a good way to do dimensional reduction. I assume you take all nouns and verb and throw out the rest? Were the results noticeably better for creating vectors? It sounds like you intergrated this into lucene, how about mahout's vectorizing? Seems like it would be simple to put a custom lucene analyzer into the pipeline. I'd be interested in your opinion of either result.

As to LSA and SVD that remains one of our next steps. For another part of the project I'm building a sort of hierarchical clustering model that will create clusters at different scales (different magnitudes of k for instance) then connect them by centroid distance into a graph. We will use cluster evaluators to prune out crappy (a technical term) clusters and this we hope will be a nice categorization that takes into account different levels of generalization. I expect that different levels of dimensional reduction will be useful in clustering at different scales but we haven't tried it so not sure. Doing LSA in conjunction with DR once may be all we need. Experience with clustering using only the vectorizing DR results in good clusters but highly specific ones. Specificity is a hard thing to account for. It would be nice to end up with a sports cluster then more specific baseball, soccer, golf, and even more specific players or teams. My intuition says teasing this out of the data will require DR in some form and possibly at varying scales.

On 7/19/12 12:25 AM, Lance Norskog wrote:
Are you creating word clouds essentially, then finding the most
relevant documents for each word? And, is this a batch job, or for
interactive searches?

2 techniques:
1)  You can use the OpenNLP parts of speech tagging to isolate nouns
and verbs. My OpenNLP Lucene patch includes a sample Lucene analyzer
stack for this task. LUCENE-2899.

2) Latent Semantic Analysis will create a much better term list, but
it requires a batch computation. In LSA you do a singular value
decomposition on the document/term vector matrix. This gives sorted
ratings for both documents and terms, by how relevant they are to the
"themes" of the corpus. The documents and terms are both sorted by how
"thematic" they are. Document summarization illustrates this concept.

You can use to summarize documents SVD on a sentence/term matrix. SVD
will find the most "thematic" sentence and term.  A newspaper article
is pre-summarized: the main theme is the first sentence ("lede") and
the second sentence reinforces and elaborates on the first. So,
newspaper articles are pre-tagged test data for document
summarization, and a good demonstrator for LSA. I got interesting
results with the Reuters corpus.

SVD sorts vectors by length and orthogonality. The first sentence will
have the most "theme words" and the second the next largest number.
The crazy intuition here is that the reinforcing sentence rarely
shares theme words with the primary sentence. So, the lede and
reinforcing sentences have the two longest, most orthogonal term
vectors.

Starting with this, it is possible to mangle the term-vector matrix
into a much much smaller matrix at the cost of losing the identity of
documents: you get thematic words, but do not get thematic documents.
This technique is called Random Indexing. This requires a much hairier
explanation than the above.

On Wed, Jul 18, 2012 at 9:53 AM, Ken Krugler
<[email protected]> wrote:
On Jul 18, 2012, at 9:07am, Pat Ferrel wrote:

Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature looks like 
a good fit and seems to do about what you describe below. It seems a good idea 
to reorder the returned docs by some distance or similarity measure.

The major problem you mention in extracting good terms; are you talking about 
creating the query or in creating the Solr index?
Both, since they (must) use the same approach for the query to do a good job of 
matching against docs in the index.

Often two word phrases are great terms, but just as often they wind up being junk - 
"otherwise resolving", where the TF*IDF (or LLR) score is high enough to make 
it one of top terms for a document, but it doesn't really capture anything about the 
meaning of the document, and thus is just noise as far as similarity is concerned.

-- Ken

BTW RowSimilarity works so well for doc similarity I'm resisting taking it out 
and will concentrate on reducing the size of the matrix it deals with to 
mitigate the scaling problems. For the realtime queries I think I'll look 
deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms 
weights from a doc and reweighting some terms based on a user gesture.


On 7/17/12 8:22 PM, Ken Krugler wrote:
Hi Pat,

On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:

Intersting.

I have another requirement, which is to do something like real time vector 
based queries. Imagine taking a doc vector, reweighting some terms then doing a 
query with it, perhaps in a truncated form. There are several ways to do this 
but only solr would offer something real time results afaik. It looks like I 
could use your approach below to do this. A quick look at eDisMax however 
suggests some problems. The use of pf2 and pf3 would jamb the query vector into 
synthesized bi and tri grams for instance.
The simplistic approach I used was to extract the top 50 terms (with TF*IDF 
weights) from the target document, then use those terms (with weights as boosts) to 
do a regular Lucene OR query & request the top 20 hits.

The index I'm searching against has Solr documents with a multi-value field 
that contains the top 50 terms, generated using the same approach as with the 
target document. It also contains stored weights for each of those terms.

I didn't use payload boosting, but could have to improve the quality of this 
search - seemed to be working well enough, and speed was pretty important.

Solr returns back a sorted list of hits, and then I do a regular vector similarity 
calculation between the target & each of these top 20 hits, and select the best 
one (assuming it passes a similarity threshold).

I'd be interested in hearing more about how you use it. Is there a better venue 
than the mahout list?
If you'd like more details, that's probably better for an off-list 
discussion…doesn't feel very Mahout-ish in nature :)

Though a discussion of the major problem (how to extract "good" terms from the 
text) would be very interesting, as I wound up crafting what felt like a kludgy 
pseudo-NLP solution.

-- Ken

-------------------------- Ken Krugler http://www.scaleunlimited.com custom big data 
solutions & training Hadoop, Cascading, Mahout & Solr

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr








Reply via email to