[Solr Wiki] Update of "ClusteringComponent" by GrantIng ersoll

Apache Wiki Sat, 02 Jan 2010 13:47:37 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "ClusteringComponent" page has been changed by GrantIngersoll.
http://wiki.apache.org/solr/ClusteringComponent?action=diff&rev1=33&rev2=34

--------------------------------------------------

  NOTE: This code is marked as experimental and is the APIs and responses are 
subject to change in future releases. See 
https://issues.apache.org/jira/browse/SOLR-769 for discussions around the 
development of this feature.
  
  = Introduction =
- 
  This component can cluster both search results and documents.  In case you're 
wondering what clustering is good for, think of it as a quick way to summarize 
a whole bunch of results/documents, or as a way to group together like 
results/documents.
  
  See http://en.wikipedia.org/wiki/Data_clustering for more background, as well 
as links to further reading.
  
  = Clustering Component =
- 
  The clustering implements a pluggable approach that allows for the 
implementation of any clustering engine.
  
  The !ClusteringComponent is responsible for taking in the request, identify 
the clustering engine to be used (a !SolrClusteringEngine implementation) and 
then delegating the work to that engine.  Once the engine is done, the results 
are then added to the response.
@@ -19, +17 @@

  The !ClusteringComponent currently does not support distributed processing.
  
  == Installation ==
- 
  The !ClusteringComponent is in the contrib area of Solr.  Due to some 
dependencies on LGPL libraries for the Carrot2 implementation, we cannot 
package a complete binary solution (with all the dependencies).  To get the 
Carrot2 solution, you will need to download these libraries.  To do this, on 
the command line in the contrib/clustering directory, run {{{ant 
get-libraries}}}.  This will create a downloads directory under the lib 
directory for the downloaded jars.
  
  == Quick Start ==
+ Once you have downloaded the library dependencies, you can run the example 
using the following commands:
  
- Once you have downloaded the library dependencies, you can run the example 
using the following commands:
  {{{
  $ cd example
  $ java -Dsolr.clustering.enabled=true -jar start.jar
  }}}
- 
  This is the same as the main Solr example, using the same index, but with the 
clustering component and a SearchHandler configured to use that component 
enabled.
  
  In a different window, add some docs using the post tool in the exampledocs 
directory (if you haven't already).
+ 
  {{{
  $ cd example/exampledocs
  $ ./post.sh *.xml
  }}}
  Now try a query using the handler configured for clustering (It is confugred 
with clustering=true as a default param):
+ 
  {{{
  http://localhost:8983/solr/clustering?q=*:*&rows=10
  }}}
  This should yield results that include cluster information at the bottom of 
the response, like:
+ 
  {{{
  <arr name="clusters">
   <lst>
    <arr name="labels">
-       <str>DDR</str>
+         <str>DDR</str>
    </arr>
    <arr name="docs">
-       <str>TWINX2048-3200PRO</str>
+         <str>TWINX2048-3200PRO</str>
-       <str>VS1GB400C3</str>
+         <str>VS1GB400C3</str>
-       <str>VDBDB1A16</str>
+         <str>VDBDB1A16</str>
    </arr>
   </lst>
   <lst>
    <arr name="labels">
-       <str>Car Power Adapter</str>
+         <str>Car Power Adapter</str>
    </arr>
    <arr name="docs">
-       <str>F8V7067-APL-KIT</str>
+         <str>F8V7067-APL-KIT</str>
-       <str>IW-02</str>
+         <str>IW-02</str>
    </arr>
   </lst>
   <lst>
    <arr name="labels">
-       <str>Hard Drive</str>
+         <str>Hard Drive</str>
    </arr>
    <arr name="docs">
-       <str>SP2514N</str>
+         <str>SP2514N</str>
-       <str>6H500F0</str>
+         <str>6H500F0</str>
    </arr>
   </lst>
   <lst>
  [...]
  }}}
- 
  Clusters produced by Carrot2 group the results into different product 
categories: DDR (memory), Car Power Adapter, Display, Hard Drive. Notice that, 
depending on the quality of input documents, some clusters may not make much 
sense.
  
+ == Configuration ==
+ The !ClusteringComponent gets added just like any other !SearchComponent.  
Just declare it in the solrconfig.xml, as in:
  
- == Configuration ==
- 
- The !ClusteringComponent gets added just like any other !SearchComponent.  
Just declare it in the solrconfig.xml, as in:
  {{{
  <searchComponent 
class="org.apache.solr.handler.clustering.ClusteringComponent" 
name="clustering">
    <lst name="engine">
@@ -90, +87 @@

    </lst>
  </searchComponent>
  }}}
- 
  = Search Results Clustering =
- 
  == Carrot2 Clustering ==
- 
  Carrot2 is a scalable, BSD licensed search results clustering engine.  It can 
cluster many different types of search results, including Y!, Google, etc.  Our 
implementation, naturally, clusters Solr results.
  
  Carrot2 is best suited for clustering small-to-medium collections of short 
documents. While Carrot2 may work for longer documents, processing times may be 
too long to meet on-line clustering requirements.
  
  See http://project.carrot2.org
  
+ === Parameters ===
+ 
+  * carrot.algorithm - The engine to use as configured in the !SearchComponent.
+  * carrot.title - The title field name to use.
+  * carrot.url - The url field name. 
+  * carrot.snippet - The snippet field name.
+  * carrot.produceSummary - If true, then the snippet field (if no snippet 
field, then the title field) will be highlighted and the highlighted text will 
be used for the snippet.
+  * carrot.numDescriptions - The maximum number of labels to produce
+  * carrot.outputSubClusters - if true, generate subclusters
+  * carrot.fragSize - <!>Solr1.5<!> The frag size to use when produceSummary 
is true, for highlighting.  If not specified, the default highlighting fragsize 
(hl.fragsize) will be used.  If that isn't specified, then 100.
+ 
+ === Config ===
+ 
  The configuration (solrconfig.xml) looks like:
+ 
  {{{
  <searchComponent 
class="org.apache.solr.handler.clustering.ClusteringComponent" 
name="clustering">
    <!-- Declare an engine -->
    <lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">default</str>
-     <!-- 
+     <!--
           Class name of Carrot2 clustering algorithm. Currently available 
algorithms are:
-          
+ 
           * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
           * org.carrot2.clustering.stc.STCClusteringAlgorithm
-          
+ 
           See http://project.carrot2.org/algorithms.html for the algorithm's 
characteristics.
        -->
      <str 
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
-     <!-- 
+     <!--
           Overriding values for Carrot2 default algorithm attributes. For a 
description
           of all available attributes, see: 
http://download.carrot2.org/stable/manual/#chapter.components.
           Use attribute key as name attribute of str elements below. These can 
be further
@@ -128, +136 @@

    </lst>
  </searchcomponent>
  }}}
+ And the Standard !ReqHandler looks like:
  
- And the Standard !ReqHandler looks like:
  {{{
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
      <!-- default values for query parameters -->
       <lst name="defaults">
         <str name="echoParams">explicit</str>
-        <!-- 
+        <!--
         <int name="rows">10</int>
         <str name="fl">*</str>
         <str name="version">2.1</str>
@@ -161, +169 @@

      </arr>
    </requestHandler>
  }}}
- 
  The thing to note here is the mapping of Solr Fields (name, id, etc.) to the 
Carrot2 needs of title, snippet and url. Clustering will take into account the 
text of title and snippet.
  
- 
  == Tuning Carrot2 clustering ==
- 
  The easiest way to tune Carrot2 clustering for your specific data is to use a 
dedicated Carrot2 tool called Document Clustering Workbench.
  
   1. [[http://project.carrot2.org/download.html|Download Carrot2 Document 
Clustering Workbench]] for your platform.
-  2. 
[[http://download.carrot2.org/head/manual/#section.getting-started.solr|Attach]]
 your Solr instance as a document source in the Workbench.
+  1. 
[[http://download.carrot2.org/head/manual/#section.getting-started.solr|Attach]]
 your Solr instance as a document source in the Workbench.
-  3. 
[[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words|Fine
 tune stop words]], 
[[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps|stop
 labels]] and possibly 
[[http://download.carrot2.org/head/manual/#section.component.lingo|other 
attributes]] of the clustering algorithms to suit your needs.
+  1. 
[[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words|Fine
 tune stop words]], 
[[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps|stop
 labels]] and possibly 
[[http://download.carrot2.org/head/manual/#section.component.lingo|other 
attributes]] of the clustering algorithms to suit your needs.
-  4. To transfer the modified `stopwords.*` and `stoplabels.*` files to your 
Solr instance, simply make the modified files accessible in the classpath. If 
you're using the Solr example scripts, try putting the files in the 
`example/resources` folder (Jetty starter from `start.jar` adds all files from 
that folder to the classpath). Alternatively, you can overwrite the 
corresponding `stopwords.*` and `stoplabels.*` files directly in 
`carrot2-mini-*.jar`.
+  1. To transfer the modified `stopwords.*` and `stoplabels.*` files to your 
Solr instance, simply make the modified files accessible in the classpath. If 
you're using the Solr example scripts, try putting the files in the 
`example/resources` folder (Jetty starter from `start.jar` adds all files from 
that folder to the classpath). Alternatively, you can overwrite the 
corresponding `stopwords.*` and `stoplabels.*` files directly in 
`carrot2-mini-*.jar`.
- 
  
  = Document Clustering =
- 
  <!> THIS IS NOT FULLY IMPLEMENTED YET.
  
  The Document Clustering implementation is designed to cluster whole documents 
across a collection.  This can be done as an offline task.  Once the clustering 
is done, the clusters can be retrieved.

[Solr Wiki] Update of "ClusteringComponent" by GrantIng ersoll

Reply via email to