Not sure who the question is for so I'll butt in.
I calculate a term cloud for each document in batch using (so far):
boilerpipe parsing of pages (did I say they are from a crawl?), filter
out high frequency terms and stop words with a custom lucene analyzer,
TFIDF from mahout
Once I have a term cloud I display it with a snippet of the document.
The user can click a term they think is extra important. This will
reweight it higher than it was and take the resulting reweighted vector
to do something like a "MoreLikeThis" query. So the original cloud is
from the weighted term vector. It needs to be fairly human readable. I'm
using stemming so that is already stretching the human readable part.
We've used a form of this in a prototype and it allows the user to
navigate the information space in a new way. Hopefully at a larger scale
it will be useful, sort of a MikeLikeThis but with more emphasis on some
specific term.
We've built an NER system for other reasons but we've questioned whether
applying part-of-speach or even NER to choosing terms was a good way to
do dimensional reduction. I assume you take all nouns and verb and throw
out the rest? Were the results noticeably better for creating vectors?
It sounds like you intergrated this into lucene, how about mahout's
vectorizing? Seems like it would be simple to put a custom lucene
analyzer into the pipeline. I'd be interested in your opinion of either
result.
As to LSA and SVD that remains one of our next steps. For another part
of the project I'm building a sort of hierarchical clustering model that
will create clusters at different scales (different magnitudes of k for
instance) then connect them by centroid distance into a graph. We will
use cluster evaluators to prune out crappy (a technical term) clusters
and this we hope will be a nice categorization that takes into account
different levels of generalization. I expect that different levels of
dimensional reduction will be useful in clustering at different scales
but we haven't tried it so not sure. Doing LSA in conjunction with DR
once may be all we need. Experience with clustering using only the
vectorizing DR results in good clusters but highly specific ones.
Specificity is a hard thing to account for. It would be nice to end up
with a sports cluster then more specific baseball, soccer, golf, and
even more specific players or teams. My intuition says teasing this out
of the data will require DR in some form and possibly at varying scales.
On 7/19/12 12:25 AM, Lance Norskog wrote:
Are you creating word clouds essentially, then finding the most
relevant documents for each word? And, is this a batch job, or for
interactive searches?
2 techniques:
1) You can use the OpenNLP parts of speech tagging to isolate nouns
and verbs. My OpenNLP Lucene patch includes a sample Lucene analyzer
stack for this task. LUCENE-2899.
2) Latent Semantic Analysis will create a much better term list, but
it requires a batch computation. In LSA you do a singular value
decomposition on the document/term vector matrix. This gives sorted
ratings for both documents and terms, by how relevant they are to the
"themes" of the corpus. The documents and terms are both sorted by how
"thematic" they are. Document summarization illustrates this concept.
You can use to summarize documents SVD on a sentence/term matrix. SVD
will find the most "thematic" sentence and term. A newspaper article
is pre-summarized: the main theme is the first sentence ("lede") and
the second sentence reinforces and elaborates on the first. So,
newspaper articles are pre-tagged test data for document
summarization, and a good demonstrator for LSA. I got interesting
results with the Reuters corpus.
SVD sorts vectors by length and orthogonality. The first sentence will
have the most "theme words" and the second the next largest number.
The crazy intuition here is that the reinforcing sentence rarely
shares theme words with the primary sentence. So, the lede and
reinforcing sentences have the two longest, most orthogonal term
vectors.
Starting with this, it is possible to mangle the term-vector matrix
into a much much smaller matrix at the cost of losing the identity of
documents: you get thematic words, but do not get thematic documents.
This technique is called Random Indexing. This requires a much hairier
explanation than the above.
On Wed, Jul 18, 2012 at 9:53 AM, Ken Krugler
<[email protected]> wrote:
On Jul 18, 2012, at 9:07am, Pat Ferrel wrote:
Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature looks like
a good fit and seems to do about what you describe below. It seems a good idea
to reorder the returned docs by some distance or similarity measure.
The major problem you mention in extracting good terms; are you talking about
creating the query or in creating the Solr index?
Both, since they (must) use the same approach for the query to do a good job of
matching against docs in the index.
Often two word phrases are great terms, but just as often they wind up being junk -
"otherwise resolving", where the TF*IDF (or LLR) score is high enough to make
it one of top terms for a document, but it doesn't really capture anything about the
meaning of the document, and thus is just noise as far as similarity is concerned.
-- Ken
BTW RowSimilarity works so well for doc similarity I'm resisting taking it out
and will concentrate on reducing the size of the matrix it deals with to
mitigate the scaling problems. For the realtime queries I think I'll look
deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms
weights from a doc and reweighting some terms based on a user gesture.
On 7/17/12 8:22 PM, Ken Krugler wrote:
Hi Pat,
On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
Intersting.
I have another requirement, which is to do something like real time vector
based queries. Imagine taking a doc vector, reweighting some terms then doing a
query with it, perhaps in a truncated form. There are several ways to do this
but only solr would offer something real time results afaik. It looks like I
could use your approach below to do this. A quick look at eDisMax however
suggests some problems. The use of pf2 and pf3 would jamb the query vector into
synthesized bi and tri grams for instance.
The simplistic approach I used was to extract the top 50 terms (with TF*IDF
weights) from the target document, then use those terms (with weights as boosts) to
do a regular Lucene OR query & request the top 20 hits.
The index I'm searching against has Solr documents with a multi-value field
that contains the top 50 terms, generated using the same approach as with the
target document. It also contains stored weights for each of those terms.
I didn't use payload boosting, but could have to improve the quality of this
search - seemed to be working well enough, and speed was pretty important.
Solr returns back a sorted list of hits, and then I do a regular vector similarity
calculation between the target & each of these top 20 hits, and select the best
one (assuming it passes a similarity threshold).
I'd be interested in hearing more about how you use it. Is there a better venue
than the mahout list?
If you'd like more details, that's probably better for an off-list
discussion…doesn't feel very Mahout-ish in nature :)
Though a discussion of the major problem (how to extract "good" terms from the
text) would be very interesting, as I wound up crafting what felt like a kludgy
pseudo-NLP solution.
-- Ken
-------------------------- Ken Krugler http://www.scaleunlimited.com custom big data
solutions & training Hadoop, Cascading, Mahout & Solr
--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr