> Thanks, I'm new to the clustering libraries. I finally made this > connection when I started browsing through the carrot2 source. I had > pulled down a smaller MM document collection from our test environment. It > was not ideal as it was mostly structured, but small. I foolishly thought > I could cluster on the text copy field before realizing that it was index > only. Doh! >
That is correct -- for the time being the clustering can only be applied to stored Solr fields. > Our documents are indexed in SolrCloud, but stored in HBase. I want to > allow users to page through Solr hits, but would like to cluster on all (or > at least several thousand) of the top search hits. Now I'm puzzling over > how to efficiently cluster over possibly several thousand Solr hits when > the documents are in HBase. I thought an HBase coprocessor, but carrot2 > isn't designed for distributed computation. Mahout, in the Hadoop M/R > context, seems slow and heavy handed for this scale; maybe, I just need to > dig deeper into their library. Or I could just be missing something > fundamental? :) > Carrot2 algorithms were not designed to be distributed, but you can still use them in a single-threaded scenario. To do this, you'd probably need to write a bit of code that gets the text of your documents from your HBase and runs Carrot2 clustering on it. If you use the STC clustering algorithm, you should be able to process several thousands of documents in a reasonable time (order of seconds). The clustering side of the code should be a matter of a few lines of code ( http://download.carrot2.org/stable/javadoc/overview-summary.html#clustering-documents). The tricky bit of the setup may be efficiently getting the text for clustering -- it can happen that fetching can take longer than the actual clustering. S.