Very nice post. Thanks.

I wonder if another problem that could benefit from the same approach is 
finding Cluster names. Image finding the most important sentence of the cluster 
instead of for a single doc using the same methods (break docs into sentences 
etc). Then use parts of speech to condense to noun+verb or noun phrase for a 
candidate cluster name. Or just use the most important sentence as is.

Also the most important few sentences might be a reasonable cluster summary.

Using the top terms from the centroid doesn't produce very satisfactory names 
and though the term cloud can be a somewhat useful cluster summary it's much 
harder to comprehend than a few sentences. One question would be if choosing 
sentences from different docs for the cluster summary might produce gibberish.

On Sep 6, 2012, at 1:47 PM, Lance Norskog <[email protected]> wrote:

I stole the SVD code from Mahout, ported OpenNLP to Solr, wrote a
document summarizer, and benchmarked it all:

Document Summarization with LSA: Threat? Or Menace?
http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html

Please critique- what did I completely miss, in the posts or the research?

-- 
Lance Norskog
[email protected]

Reply via email to