[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712165#action_12712165 ]
Grant Ingersoll commented on SOLR-769: -------------------------------------- bq. A second option would have been to move the body of the process method to finishStage. This would have the benefit of only needing to do the clustering on the final set of responses. After the QueryComponent does its job of creating the final result set. This would also not make finishStage be so dependent on what is happening in the engines when they create their cluster response I would say that this is actually the correct way to do this, as opposed to just stitching the results together. For example, it may very well make sense that results from shard 1 belong in cluster A when clustered on the main node, whereas they belong to cluster B when only clustered on the shard. If you can make that change and then add some tests, I can commit. bq. I'm still trying to wrap my head around TestDistributedSearch so see how I can provide test methods. Please add any insight you have to http://wiki.apache.org/solr/WritingDistributedSearchComponents. > Support Document and Search Result clustering > --------------------------------------------- > > Key: SOLR-769 > URL: https://issues.apache.org/jira/browse/SOLR-769 > Project: Solr > Issue Type: New Feature > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 1.4 > > Attachments: clustering-componet-shard.patch, clustering-libs.tar, > clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, > SOLR-769.patch, SOLR-769.tar, SOLR-769.zip > > > Clustering is a useful tool for working with documents and search results, > similar to the notion of dynamic faceting. Carrot2 > (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing > search results clustering. Mahout (http://lucene.apache.org/mahout) is well > suited for whole-corpus clustering. > The patch I lays out a contrib module that starts off w/ an integration of a > SearchComponent for doing clustering and an implementation using Carrot. In > search results mode, it will use the DocList as the input for the cluster. > While Carrot2 comes w/ a Solr input component, it is not the same as the > SearchComponent that I have in that the Carrot example actually submits a > query to Solr, whereas my SearchComponent is just chained into the Component > list and uses the ResponseBuilder to add in the cluster results. > While not fully fleshed out yet, the collection based mode will take in a > list of ids or just use the whole collection and will produce clusters. > Since this is a longer, typically offline task, there will need to be some > type of storage mechanism (and replication??????) for the clusters. I _may_ > push this off to a separate JIRA issue, but I at least want to present the > use case as part of the design of this component/contrib. It may even make > sense that we split this out, such that the building piece is something like > an UpdateProcessor and then the SearchComponent just acts as a lookup > mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.