[
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725573#action_12725573
]
Grant Ingersoll commented on SOLR-769:
--------------------------------------
bq. Is "labels" is needed because there could be multiple labels per cluster in
the future? ( I assume yes)
Not sure, but likely so
bq. Do we need more per-doc information than just the id? (I assume no)
I think for other algorithms like k-Means, Canopy and others (Mahout) you could
reasonable expect to return:
1. The centroid that the given document belongs to - This can be captured as
the label, but it is often represented as a vector and could thus be quite
long. For instance, in Mahout, we could return this as a JSON string (we're
using GSON over there)
2. The distance from the centroid used in clustering.
bq. Could we want other per-cluster information in the future (I assume yes)
See #1 in the previous.
bq. What other possible information could be added in the future?
Hard to say, but the nature of this implementation is such that people will can
plug in their own clustering algorithms which may have different outputs.
Until we have at least one other implementation, it will be difficult to
"harden" the interfaces. For now, though, you're proposed alterations to the
format are fine with me.
bq. Seems like it would be nice if we could handle unknown field types
gracefully?
Yes, that would be good.
> Support Document and Search Result clustering
> ---------------------------------------------
>
> Key: SOLR-769
> URL: https://issues.apache.org/jira/browse/SOLR-769
> Project: Solr
> Issue Type: New Feature
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 1.4
>
> Attachments: clustering-componet-shard.patch, clustering-libs.tar,
> clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip,
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
>
>
> Clustering is a useful tool for working with documents and search results,
> similar to the notion of dynamic faceting. Carrot2
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing
> search results clustering. Mahout (http://lucene.apache.org/mahout) is well
> suited for whole-corpus clustering.
> The patch I lays out a contrib module that starts off w/ an integration of a
> SearchComponent for doing clustering and an implementation using Carrot. In
> search results mode, it will use the DocList as the input for the cluster.
> While Carrot2 comes w/ a Solr input component, it is not the same as the
> SearchComponent that I have in that the Carrot example actually submits a
> query to Solr, whereas my SearchComponent is just chained into the Component
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a
> list of ids or just use the whole collection and will produce clusters.
> Since this is a longer, typically offline task, there will need to be some
> type of storage mechanism (and replication??????) for the clusters. I _may_
> push this off to a separate JIRA issue, but I at least want to present the
> use case as part of the design of this component/contrib. It may even make
> sense that we split this out, such that the building piece is something like
> an UpdateProcessor and then the SearchComponent just acts as a lookup
> mechanism.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.