[Solr Wiki] Update of "SolrFacetingOverview" by OtisGospodnetic

Apache Wiki Tue, 15 Apr 2008 08:21:18 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/solr/SolrFacetingOverview

------------------------------------------------------------------------------
- Solr provides a 
[http://lucene.apache.org/solr/docs/api/org/apache/solr/request/SimpleFacets.html
 Simple Faceting toolkit] which can be reused by various Request Handlers to 
include "Facet counts" of based on some simple criteria. Both the 
StandardRequestHandler and the DisMaxRequestHandler currently use these 
utilities.  Detailed descriptions of the parameters used to control faceting 
can be found (along with several examples) at [SimpleFacetParameters].
+ Solr provides a 
[http://lucene.apache.org/solr/docs/api/org/apache/solr/request/SimpleFacets.html
 Simple Faceting toolkit] which can be reused by various Request Handlers to 
include "Facet counts" based on some simple criteria. Both the 
StandardRequestHandler and the DisMaxRequestHandler currently use these 
utilities.  Detailed descriptions of the parameters used to control faceting 
can be found (along with several examples) at [SimpleFacetParameters].
  
  This page briefly provides some general background information: 
  
  = Facet Indexing =
  
- Faceting is done on __indexed__ rather than __stored__ values.  This is 
because the primary use for faceting is drilldown into a subset of hits 
resulting from a query, and so the chosen facet value is used to construct a 
filter query which literally matches that value in the index.  For the stock 
Solr request handlers this is done by adding an `fq=<facet-field>:<quoted 
facet-value>` parameter and resubmitting the query.
+ Faceting is done on __indexed__ rather than __stored__ values.  This is 
because the primary use for faceting is drill-down into a subset of hits 
resulting from a query, and so the chosen facet value is used to construct a 
filter query which literally matches that value in the index.  For the stock 
Solr request handlers this is done by adding an `fq=<facet-field>:<quoted 
facet-value>` parameter and resubmitting the query.
  
  Because faceting fields are often specified to serve two purposes, 
human-readable text and drill-down query value, they are frequently indexed 
differently from fields used for searching and sorting:
    * They are not tokenized into separate words
    * They are not mapped into lower case
    * Human-readable punctuation is not removed (other than double-quotes)
    * There is often no need to store them, since stored values would look much 
like indexed values and the faceting mechanism is used for value retrieval.
-   * Depending on how the field is defined the SimpleFacets mechanism may only 
allow for a single value per field per document (see below)
+   * Depending on how the field is defined, the SimpleFacets mechanism may 
only allow for a single value per field per document (see below)
  
- As an example, if I had a field with a list of authors, such as:
+ As an example, if I had an "author" field with a list of authors, such as:
  
    ''Schildt, Herbert; Wolpert, Lewis; Davies, P.''
    
@@ -27, +27 @@

    * For faceting: Primary author only, using a `solr.StringField`:
        Schildt, Herbert
  
- Then when the user drills down on the "Schildt, Herbert" string I would 
reissue the query with an added fq="Schild, Herbert" parameter.  If you wanted 
to "drill down" or query by multiple authors you would add more 'fq' parameters 
as needed, i.e. 'fq=Schield, Herbet&fq=Wolpert, Lewis".  
+ Then when the user drills down on the "Schildt, Herbert" string I would 
reissue the query with an added fq=author:"Schild, Herbert" parameter.  If you 
wanted to drill-down or query by multiple authors you would add more 'fq' 
parameters as needed, e.g. fq=author:"Schield, Herbet"&fq=author:"Wolpert, 
Lewis".  
  
  = Facet Operation =
  
@@ -35, +35 @@

  
  == FacetQueries ==
  
- Any number of [:SimpleFacetParameters#facet.query:facet.query] parameters can 
be passed to the request handler.  Each distinct facet.query will first be 
executed against the entire index, with the results cached as a hashed set (if 
fewer than hashDocSet) or a bit set (if greater) of document IDs (see 
[:SolrCaching#The hashDocSet Max Size:hashDocSet]).  Then every time that 
facet.query is used for faceting a query, the cached set will be intersected 
against the set of document ids returned by the query to count the number of 
documents for which the facet.query condition is true.
+ Any number of [:SimpleFacetParameters#facet.query:facet.query] parameters can 
be passed to the request handler.  Each distinct facet.query will first be 
executed against the entire index, with the results cached as a hashed set (if 
fewer than hashDocSet) or a bit set (if greater) of document IDs (see 
[:SolrCaching#The hashDocSet Max Size:hashDocSet]).  Then, every time that 
facet.query is used for faceting a query, the cached set will be intersected 
against the set of document IDs returned by the query to count the number of 
documents for which the facet.query condition is true.
    
  == FacetFields ==
  
- Any number of [:SimpleFacetParameters#facet.field:facet.field] parameters can 
be passed to the request handler.  For each facet.field, one of two approaches 
will be used based on the Field definion in schema.xml:
+ Any number of [:SimpleFacetParameters#facet.field:facet.field] parameters can 
be passed to the request handler.  For each facet.field, one of two approaches 
will be used based on the Field definiton in schema.xml:
    
-     * '''Field Queries''':  If the facet field is defined in the schema as 
multi-valued, boolean, or tokenized, then every indexed value for the field 
will be iterated and a facet query will be executed and cached (as described 
above).  This is excellent for fields where there is a small set of distinct 
values.  For example, faceting on a field with U.S. States eg. `Alabama, 
Alaska, ... Wyoming` would lead to fifty cached queries which would be used 
over and over again.  It also works in the case when the facet field can have 
multiple values for each document.  However, it requires excessive amounts of 
memory and time when the number of field values is large and especially when it 
exceeds the filter cache size defined in [:SolrCaching#filterCache:filterCache]
+     * '''Field Queries''':  If the facet field is defined in the schema as 
multi-valued, boolean, or tokenized, then every indexed value for the field 
will be iterated and a facet query will be executed and cached (as described 
above).  This is excellent for fields where there is a small set of distinct 
values.  For example, faceting on a field with U.S. States e.g. `Alabama, 
Alaska, ... Wyoming` would lead to fifty cached queries which would be used 
over and over again.  It also works in the case when the facet field can have 
multiple values for each document.  However, it requires excessive amounts of 
memory and time when the number of field values is large, and especially when 
it exceeds the filter cache size defined in 
[:SolrCaching#filterCache:filterCache]
      
      * '''Field Cache''': If the facet field is not tokenized, not 
multi-valued, and not boolean, then a field-cache approach will be used.  This 
is currently implemented with the Lucene 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html
 FieldCache] mechanism used for results sorting.  An array of integers (one for 
every document in the index) is allocated, pre-filled with the first indexed 
value for that field in each document (offset into a table of strings for 
fields indexed as strings), and cached.  Every time that facet.field is used 
for faceting a query, all the document IDs resulting from the query are looked 
up in the field cache and any value found has its tally incremented.  This is 
excellent for situations where the number of indexed values for the field is 
too large to be practical using the field queries mechanism, such as faceting 
against authors or titles.  However it is currently much slower and more 
memory-intensive than the fie
 ld query mechanism for fields with a small number of values. 
  
- Note at this time there is no way to manually control whether facet.field is 
handled via field queries or field cache other than defining in the schema 
whether the field is single- or multi-valued and the analyzer used: 
`solr.TextField` is always tokenized while `solr.StrField` is never.  Control 
may be improved in the future, along with a means to handle multi-valued fields 
with a variant of the Field Cache mechanism.
+ Note that at this time there is no way to manually control whether 
facet.field is handled via field queries or field cache, other than defining in 
the schema whether the field is single- or multi-valued and the analyzer used: 
`solr.TextField` is always tokenized while `solr.StrField` is never tokenized.  
Control may be improved in the future, along with a means to handle 
multi-valued fields with a variant of the Field Cache mechanism.

[Solr Wiki] Update of "SolrFacetingOverview" by OtisGospodnetic

Reply via email to