Support Document and Search Result clustering
---------------------------------------------

                 Key: SOLR-769
                 URL: https://issues.apache.org/jira/browse/SOLR-769
             Project: Solr
          Issue Type: New Feature
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
            Priority: Minor


Clustering is a useful tool for working with documents and search results, 
similar to the notion of dynamic faceting.  Carrot2 
(http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search 
results clustering.  Mahout (http://lucene.apache.org/mahout) is well suited 
for whole-corpus clustering.  

The patch I lays out a contrib module that starts off w/ an integration of a 
SearchComponent for doing clustering and an implementation using Carrot.  In 
search results mode, it will use the DocList as the input for the cluster.   
While Carrot2 comes w/ a Solr input component, it is not the same as the 
SearchComponent that I have in that the Carrot example actually submits a query 
to Solr, whereas my SearchComponent is just chained into the Component list and 
uses the ResponseBuilder to add in the cluster results.

While not fully fleshed out yet, the collection based mode will take in a list 
of ids or just use the whole collection and will produce clusters.  Since this 
is a longer, typically offline task, there will need to be some type of storage 
mechanism (and replication??????) for the clusters.  I _may_ push this off to a 
separate JIRA issue, but I at least want to present the use case as part of the 
design of this component/contrib.  It may even make sense that we split this 
out, such that the building piece is something like an UpdateProcessor and then 
the SearchComponent just acts as a lookup mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to