[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688171#action_12688171
 ] 

Stanislaw Osinski commented on SOLR-769:
----------------------------------------

bq. Also, you say C2 can handle full docs, is it feasible, then to implement it 
for the "offline" mode I have in mind, whereby you cluster the whole collection 
offline and then store the clusters for retrieval? I haven't implemented this 
yet, but was thinking some people will be interested in full corpus clustering. 
The nice thing, then, is that as new documents come in, they can be added to 
existing clusters (and maybe periodically, we re-cluster). Just thinking 
outloud.

We have two variables here: the length of docs and the number of docs. Carrot2 
is suitable for small numbers of docs (up to say 1000). If the docs are short 
(a paragraph or so), the clustering should be pretty fast, suitable for on-line 
processing (see: http://project.carrot2.org/algorithms.html). If the documents 
get longer, Carrot2 will still handle them, but will require some more time for 
processing, I'll try to do some measurements. But C2 is not useful for the 
"whole collection" case -- it performs all processing in-memory and here we'd 
need a totally different class of algorithm, something along the lines of 
Mahout developments.

bq. Hmm, that's an interesting thought. We could check to see if highlighting 
is done first.

To quickly summarise the pros and cons of relying on highlighting being done 
outside of the clustering component:

Pros:

* we avoid duplication of processing (highlighting being done twice)
* simpler code of the clustering component, less configuration

Cons:

* if someone doesn't want highlighting in the search results, the clustering is 
likely to take more time (because it operates on full documents, and it's 
controlled globally)
* depending on the highlighter, we may get some markup in the summaries, which 
may affect clustering (I'd need to check how Carrot2 handles that)

bq. Should the MockClusteringAlgorithm be under the test source tree and not 
the main one? I moved it in the patch to follow 

Absolutely, it should be in the test source.

bq. I don't think we need to output the number of clusters, since that will be 
obvious from the list size. I dropped it in the patch to follow

Makes sense, I kept it because the original version had it.

bq. Also, on the response structure, we certainly could make it optional, 
although it means having to go do a lookup in the real doc list, which could be 
less than fun.

By "lookup" you mean the lookup in the XML response? Here again we have a trade 
off between the length of the response and ease of processing: if we repeat 
document titles / snippets in the clusters structure, we at least double the 
response size (at least because the same document may belong to many clusters), 
but can potentially save some lookups. But if we want to get some other fields 
of a document (other than we repeat in the clusters list), we'd still need a 
lookup. 

To sum up, my intuition would be to avoid duplication and stick with document 
ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, 
the clustering component could have a list of configurable fields to be 
repeated in the cluster list if that's really helpful in real-word use cases.

> Support Document and Search Result clustering
> ---------------------------------------------
>
>                 Key: SOLR-769
>                 URL: https://issues.apache.org/jira/browse/SOLR-769
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: clustering-libs.tar, clustering-libs.tar, 
> SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.zip
>
>
> Clustering is a useful tool for working with documents and search results, 
> similar to the notion of dynamic faceting.  Carrot2 
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
> search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
> suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a 
> SearchComponent for doing clustering and an implementation using Carrot.  In 
> search results mode, it will use the DocList as the input for the cluster.   
> While Carrot2 comes w/ a Solr input component, it is not the same as the 
> SearchComponent that I have in that the Carrot example actually submits a 
> query to Solr, whereas my SearchComponent is just chained into the Component 
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a 
> list of ids or just use the whole collection and will produce clusters.  
> Since this is a longer, typically offline task, there will need to be some 
> type of storage mechanism (and replication??????) for the clusters.  I _may_ 
> push this off to a separate JIRA issue, but I at least want to present the 
> use case as part of the design of this component/contrib.  It may even make 
> sense that we split this out, such that the building piece is something like 
> an UpdateProcessor and then the SearchComponent just acts as a lookup 
> mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to