Are you creating word clouds essentially, then finding the most
relevant documents for each word? And, is this a batch job, or for
interactive searches?
2 techniques:
1) You can use the OpenNLP parts of speech tagging to isolate nouns
and verbs. My OpenNLP Lucene patch includes a sample Lucene analyzer
stack for this task. LUCENE-2899.
2) Latent Semantic Analysis will create a much better term list, but
it requires a batch computation. In LSA you do a singular value
decomposition on the document/term vector matrix. This gives sorted
ratings for both documents and terms, by how relevant they are to the
"themes" of the corpus. The documents and terms are both sorted by how
"thematic" they are. Document summarization illustrates this concept.
You can use to summarize documents SVD on a sentence/term matrix. SVD
will find the most "thematic" sentence and term. A newspaper article
is pre-summarized: the main theme is the first sentence ("lede") and
the second sentence reinforces and elaborates on the first. So,
newspaper articles are pre-tagged test data for document
summarization, and a good demonstrator for LSA. I got interesting
results with the Reuters corpus.
SVD sorts vectors by length and orthogonality. The first sentence will
have the most "theme words" and the second the next largest number.
The crazy intuition here is that the reinforcing sentence rarely
shares theme words with the primary sentence. So, the lede and
reinforcing sentences have the two longest, most orthogonal term
vectors.
Starting with this, it is possible to mangle the term-vector matrix
into a much much smaller matrix at the cost of losing the identity of
documents: you get thematic words, but do not get thematic documents.
This technique is called Random Indexing. This requires a much hairier
explanation than the above.
On Wed, Jul 18, 2012 at 9:53 AM, Ken Krugler
<[email protected]> wrote:
>
> On Jul 18, 2012, at 9:07am, Pat Ferrel wrote:
>
>> Lance Norskog's suggestion to look at Lucene's MoreLikeThis feature looks
>> like a good fit and seems to do about what you describe below. It seems a
>> good idea to reorder the returned docs by some distance or similarity
>> measure.
>>
>> The major problem you mention in extracting good terms; are you talking
>> about creating the query or in creating the Solr index?
>
> Both, since they (must) use the same approach for the query to do a good job
> of matching against docs in the index.
>
> Often two word phrases are great terms, but just as often they wind up being
> junk - "otherwise resolving", where the TF*IDF (or LLR) score is high enough
> to make it one of top terms for a document, but it doesn't really capture
> anything about the meaning of the document, and thus is just noise as far as
> similarity is concerned.
>
> -- Ken
>
>>
>> BTW RowSimilarity works so well for doc similarity I'm resisting taking it
>> out and will concentrate on reducing the size of the matrix it deals with to
>> mitigate the scaling problems. For the realtime queries I think I'll look
>> deeper into MoreLikeThis. In our use case we'll be taking the TFIDF terms
>> weights from a doc and reweighting some terms based on a user gesture.
>>
>>
>> On 7/17/12 8:22 PM, Ken Krugler wrote:
>>> Hi Pat,
>>>
>>> On Jul 14, 2012, at 8:17am, Pat Ferrel wrote:
>>>
>>>> Intersting.
>>>>
>>>> I have another requirement, which is to do something like real time vector
>>>> based queries. Imagine taking a doc vector, reweighting some terms then
>>>> doing a query with it, perhaps in a truncated form. There are several ways
>>>> to do this but only solr would offer something real time results afaik. It
>>>> looks like I could use your approach below to do this. A quick look at
>>>> eDisMax however suggests some problems. The use of pf2 and pf3 would jamb
>>>> the query vector into synthesized bi and tri grams for instance.
>>> The simplistic approach I used was to extract the top 50 terms (with TF*IDF
>>> weights) from the target document, then use those terms (with weights as
>>> boosts) to do a regular Lucene OR query & request the top 20 hits.
>>>
>>> The index I'm searching against has Solr documents with a multi-value field
>>> that contains the top 50 terms, generated using the same approach as with
>>> the target document. It also contains stored weights for each of those
>>> terms.
>>>
>>> I didn't use payload boosting, but could have to improve the quality of
>>> this search - seemed to be working well enough, and speed was pretty
>>> important.
>>>
>>> Solr returns back a sorted list of hits, and then I do a regular vector
>>> similarity calculation between the target & each of these top 20 hits, and
>>> select the best one (assuming it passes a similarity threshold).
>>>
>>>> I'd be interested in hearing more about how you use it. Is there a better
>>>> venue than the mahout list?
>>> If you'd like more details, that's probably better for an off-list
>>> discussion…doesn't feel very Mahout-ish in nature :)
>>>
>>> Though a discussion of the major problem (how to extract "good" terms from
>>> the text) would be very interesting, as I wound up crafting what felt like
>>> a kludgy pseudo-NLP solution.
>>>
>>> -- Ken
>>>
>>> -------------------------- Ken Krugler http://www.scaleunlimited.com custom
>>> big data solutions & training Hadoop, Cascading, Mahout & Solr
>>
>>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
--
Lance Norskog
[email protected]