[Solr Wiki] Update of "SolrFacetingOverview" by YonikSeeley

Apache Wiki Thu, 18 Jun 2009 08:26:44 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/SolrFacetingOverview

The comment on the change is:
first pass at cleaning up faceting overview

------------------------------------------------------------------------------
- Solr provides a 
[http://lucene.apache.org/solr/docs/api/org/apache/solr/request/SimpleFacets.html
 Simple Faceting toolkit] which can be reused by various Request Handlers to 
include "Facet counts" based on some simple criteria. Both the 
StandardRequestHandler and the DisMaxRequestHandler currently use these 
utilities.  Detailed descriptions of the parameters used to control faceting 
can be found (along with several examples) at SimpleFacetParameters.
+ Solr provides a [http://wiki.apache.org/solr/SimpleFacetParameters faceting 
component] which is part of the standard request handler and can be used by 
various other request handlers to include "Facet counts" based on some simple 
criteria.
  
  This page briefly provides some general background information: 
  
@@ -9, +9 @@

  Faceting is done on __indexed__ rather than __stored__ values.  This is 
because the primary use for faceting is drill-down into a subset of hits 
resulting from a query, and so the chosen facet value is used to construct a 
filter query which literally matches that value in the index.  For the stock 
Solr request handlers this is done by adding an `fq=<facet-field>:<quoted 
facet-value>` parameter and resubmitting the query.
  
  Because faceting fields are often specified to serve two purposes, 
human-readable text and drill-down query value, they are frequently indexed 
differently from fields used for searching and sorting:
-   * They are not tokenized into separate words
+   * They are often not tokenized into separate words
-   * They are not mapped into lower case
+   * They are often not mapped into lower case
-   * Human-readable punctuation is not removed (other than double-quotes)
+   * Human-readable punctuation is often not removed (other than double-quotes)
    * There is often no need to store them, since stored values would look much 
like indexed values and the faceting mechanism is used for value retrieval.
-   * Depending on how the field is defined, the SimpleFacets mechanism may 
only allow for a single value per field per document (see below)
  
  As an example, if I had an "author" field with a list of authors, such as:
  
@@ -27, +26 @@

    * For faceting: Primary author only, using a `solr.StringField`:
        Schildt, Herbert
  
- Then when the user drills down on the "Schildt, Herbert" string I would 
reissue the query with an added fq=author:"Schild, Herbert" parameter.  If you 
wanted to drill-down or query by multiple authors you would add more 'fq' 
parameters as needed, e.g. fq=author:"Schield, Herbet"&fq=author:"Wolpert, 
Lewis".  
+ Then when the user drills down on the "Schildt, Herbert" string I would 
reissue the query with an added fq=author:"Schild, Herbert" parameter.
  
  = Facet Operation =
  
- Currently SimpleFacets has 3 modes of operation, selected by a combination of 
SimpleFacetParameters, Response Handler parameters and [:SchemaXml: schema.xml] 
Field definitions:
+ Currently SimpleFacets has 3 modes of operation, selected by a combination of 
SimpleFacetParameters and [:SchemaXml: schema.xml] Field definitions:
  
  == FacetQueries ==
  
- Any number of [:SimpleFacetParameters#facet.query:facet.query] parameters can 
be passed to the request handler.  Each distinct facet.query will first be 
executed against the entire index, with the results cached as a hashed set (if 
fewer than hashDocSet) or a bit set (if greater) of document IDs (see 
[:SolrCaching#The hashDocSet Max Size:hashDocSet]).  Then, every time that 
facet.query is used for faceting a query, the cached set will be intersected 
against the set of document IDs returned by the query to count the number of 
documents for which the facet.query condition is true.
+ Any number of [:SimpleFacetParameters#facet.query:facet.query] parameters can 
be passed to the request handler.  The filter for each distinct facet.query is 
retrieved from the filterCache (or generated if not cached yet) and intersected 
with the filter for the base query to obtain the count.
    
  == FacetFields ==
  
  Any number of [:SimpleFacetParameters#facet.field:facet.field] parameters can 
be passed to the request handler.  For each facet.field, one of two approaches 
will be used based on the [:SimpleFacetParameters#facet.method:facet.method] or 
the field type:
  
+     * '''Enum Based Field Queries''':  If {{{facet.method=enum}}} or the 
field is defined in the schema as boolean, then Solr will iterate over all of 
the indexed terms for the field, and for each term it will get a filter from 
the filterCache and calculate the intersection with the filter for the base 
query.  This is excellent for fields where there is a small set of distinct 
values.  The average number of values per document does not matter.  For 
example, faceting on a field with U.S. States e.g. `Alabama, Alaska, ... 
Wyoming` would lead to fifty cached filters which would be used over and over 
again. The [:SolrCaching#filterCache:filterCache] should be large enough to 
hold all of the cached filters.
-     * '''Enum Based Field Queries''':  If {{{facet.method=enum}}} or the 
field is defined in the schema as boolean, then every indexed value for the 
field will be iterated and a facet query will be executed and cached (as 
described above).  This is excellent for fields where there is a small set of 
distinct values.  For example, faceting on a field with U.S. States e.g. 
`Alabama, Alaska, ... Wyoming` would lead to fifty cached queries which would 
be used over and over again. However, it requires excessive amounts of memory 
and time when the number of field values is large, and especially when it 
exceeds the filter cache size defined in [:SolrCaching#filterCache:filterCache]
-     
-     * '''Field Cache''': If {{{facet.method=fc}}} then a field-cache approach 
will be used.  This is currently implemented using either the the Lucene 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html
 FieldCache] or (starting in Solr 1.4) an !UnInvertedField if the field is 
multivalued or tokenized.  Every time that {{{facet.field}}} is used for 
faceting a query, all the document IDs resulting from the query are looked up 
in the cache and any value found has its tally incremented.  This is excellent 
for situations where the number of indexed values for the field is too large to 
be practical using the field queries mechanism, such as faceting against 
authors or titles.  However it is currently much slower and more 
memory-intensive than the field query mechanism for fields with a small number 
of values. 
  
+     * '''Field Cache''': If {{{facet.method=fc}}} then a field-cache approach 
will be used.  This is currently implemented using either the the Lucene 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html
 FieldCache] or (starting in Solr 1.4) an !UnInvertedField if the field is 
multivalued or tokenized.  Each document is looked up in the cache to wee what 
terms/values it contains, and a tally is incremented for each value.  This is 
excellent for situations where the number of indexed values for the field is 
high, but the number of values per document is low.  For multi-valued fields, a 
hybrid approach is used that uses term filters from the filterCache for terms 
that match many documents.
+

[Solr Wiki] Update of "SolrFacetingOverview" by YonikSeeley

Reply via email to