Very nice post. Thanks. I wonder if another problem that could benefit from the same approach is finding Cluster names. Image finding the most important sentence of the cluster instead of for a single doc using the same methods (break docs into sentences etc). Then use parts of speech to condense to noun+verb or noun phrase for a candidate cluster name. Or just use the most important sentence as is.
Also the most important few sentences might be a reasonable cluster summary. Using the top terms from the centroid doesn't produce very satisfactory names and though the term cloud can be a somewhat useful cluster summary it's much harder to comprehend than a few sentences. One question would be if choosing sentences from different docs for the cluster summary might produce gibberish. On Sep 6, 2012, at 1:47 PM, Lance Norskog <[email protected]> wrote: I stole the SVD code from Mahout, ported OpenNLP to Solr, wrote a document summarizer, and benchmarked it all: Document Summarization with LSA: Threat? Or Menace? http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html Please critique- what did I completely miss, in the posts or the research? -- Lance Norskog [email protected]
