[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641823#action_12641823
 ] 

Grant Ingersoll commented on SOLR-769:
--------------------------------------

{quote}So what would be the procedure to add some clustering code beyond carrot 
or other available libraries.
{quote}

Essentially, you need to implement either a SearchClusteringEngine or a 
DocumentClusteringEngine and then hook declare it in the SearchComponent 
configuration, as is done with the Carrot2 example here:
{code}
<lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">default</str>
      <!-- Carrot2 specific parameters.  See the Carrot2 site for details on 
setting. -->
      <!-- carrot.algorithm:   Optional.  Currently only
      lingo is supported pending the release of Carrot2 3.0.  
       -->
      <str name="carrot.algorithm">lingo</str>
      <!-- Lingo specific -->
      <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
      <float 
name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
    </lst>
{code}
or, in the mock setup:
{code}
<lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">docEngine</str>
      <str 
name="classname">org.apache.solr.handler.clustering.MockDocumentClusteringEngine</str>
    </lst>
{code}

If you don't declare the classname value, then it assumes the Carrot 
implementation.

Naturally, you need to take care of all the libraries being available to Solr, 
etc. just as you would for any plugin.

Since you are interested in clustering, Vaijanath, it would be good to get your 
feedback on the APIs.  Are you doing full document clustering or just search 
snippet clustering?   Also, if you are using an open source clustering library 
that has acceptable licensing terms (i.e. not GPL or similar), perhaps consider 
contributing an implementation of the engine and then we can make it available 
to everyone.

> Support Document and Search Result clustering
> ---------------------------------------------
>
>                 Key: SOLR-769
>                 URL: https://issues.apache.org/jira/browse/SOLR-769
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: clustering-libs.tar, clustering-libs.tar, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch
>
>
> Clustering is a useful tool for working with documents and search results, 
> similar to the notion of dynamic faceting.  Carrot2 
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
> search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
> suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a 
> SearchComponent for doing clustering and an implementation using Carrot.  In 
> search results mode, it will use the DocList as the input for the cluster.   
> While Carrot2 comes w/ a Solr input component, it is not the same as the 
> SearchComponent that I have in that the Carrot example actually submits a 
> query to Solr, whereas my SearchComponent is just chained into the Component 
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a 
> list of ids or just use the whole collection and will produce clusters.  
> Since this is a longer, typically offline task, there will need to be some 
> type of storage mechanism (and replication??????) for the clusters.  I _may_ 
> push this off to a separate JIRA issue, but I at least want to present the 
> use case as part of the design of this component/contrib.  It may even make 
> sense that we split this out, such that the building piece is something like 
> an UpdateProcessor and then the SearchComponent just acts as a lookup 
> mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to