[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering

Stanislaw Osinski (JIRA) Wed, 11 Mar 2009 10:47:18 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680942#action_12680942
 ]


Stanislaw Osinski edited comment on SOLR-769 at 3/11/09 10:46 AM:
------------------------------------------------------------------

Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

1. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
<lst name="clusters">
 <int name="numClusters">3</int>
 <lst name="cluster">
  <lst name="labels">
    <str name="label">GPU VPU Clocked</str>
  </lst>
  <lst name="docs">
    <str name="doc">EN7800GTX/2DHTV/256M</str>
    <str name="doc">100-435805</str>
  </lst>
 </lst>
 <lst name="cluster">
  <lst name="labels">
    <str name="label">Hard Drive</str>
  </lst>
  <lst name="docs">
    <str name="doc">6H500F0</str>
    <str name="doc">SP2514N</str>
  </lst>
 </lst>
 <lst name="cluster">
  <lst name="labels">
    <str name="label">Other Topics</str>
  </lst>
  <lst name="docs">
    <str name="doc">9885A004</str>
  </lst>
 </lst>
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
"clusters" so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.
\\
\\
\\
2. Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.
\\
\\
\\
3. Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the repository or 
copy those somehow in the build?
\\
\\
\\
4. Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly 
well handle full documents (up to say a few hundred kB each), it's just the 
number of documents that must be in the order of hundreds. Therefore, 
highlighting is not mandatory, but it may sometimes improve the quality of 
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline, 
could this be reused during clustering? One possible approach could be that 
clustering uses whatever is fed from the pipeline: if highlighting is enabled, 
clustering will be performed on the highlighted content, if there was no 
highlighting, we'd cluster full documents. Not sure if that's reasonable / 
possible to implement though.
\\
\\
\\
5. Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the 
algorithms used (Lingo/STC) and passing additional parameters.

      was (Author: stanislaw.osinski):
    Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
<lst name="clusters">
 <int name="numClusters">3</int>
 <lst name="cluster">
  <lst name="labels">
    <str name="label">GPU VPU Clocked</str>
  </lst>
  <lst name="docs">
    <str name="doc">EN7800GTX/2DHTV/256M</str>
    <str name="doc">100-435805</str>
  </lst>
 </lst>
 <lst name="cluster">
  <lst name="labels">
    <str name="label">Hard Drive</str>
  </lst>
  <lst name="docs">
    <str name="doc">6H500F0</str>
    <str name="doc">SP2514N</str>
  </lst>
 </lst>
 <lst name="cluster">
  <lst name="labels">
    <str name="label">Other Topics</str>
  </lst>
  <lst name="docs">
    <str name="doc">9885A004</str>
  </lst>
 </lst>
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
"clusters" so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/.... and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the repository or 
copy those somehow in the build?

# h4 Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly 
well handle full documents (up to say a few hundred kB each), it's just the 
number of documents that must be in the order of hundreds. Therefore, 
highlighting is not mandatory, but it may sometimes improve the quality of 
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline, 
could this be reused during clustering? One possible approach could be that 
clustering uses whatever is fed from the pipeline: if highlighting is enabled, 
clustering will be performed on the highlighted content, if there was no 
highlighting, we'd cluster full documents. Not sure if that's reasonable / 
possible to implement though.

# h4 Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the 
algorithms used (Lingo/STC) and passing additional parameters.
  
> Support Document and Search Result clustering
> ---------------------------------------------
>
>                 Key: SOLR-769
>                 URL: https://issues.apache.org/jira/browse/SOLR-769
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: clustering-libs.tar, clustering-libs.tar, 
> SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
> SOLR-769.patch, SOLR-769.patch
>
>
> Clustering is a useful tool for working with documents and search results, 
> similar to the notion of dynamic faceting.  Carrot2 
> (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
> search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
> suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a 
> SearchComponent for doing clustering and an implementation using Carrot.  In 
> search results mode, it will use the DocList as the input for the cluster.   
> While Carrot2 comes w/ a Solr input component, it is not the same as the 
> SearchComponent that I have in that the Carrot example actually submits a 
> query to Solr, whereas my SearchComponent is just chained into the Component 
> list and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a 
> list of ids or just use the whole collection and will produce clusters.  
> Since this is a longer, typically offline task, there will need to be some 
> type of storage mechanism (and replication??????) for the clusters.  I _may_ 
> push this off to a separate JIRA issue, but I at least want to present the 
> use case as part of the design of this component/contrib.  It may even make 
> sense that we split this out, such that the building piece is something like 
> an UpdateProcessor and then the SearchComponent just acts as a lookup 
> mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering

Reply via email to