Hi there,

I have an solr index with 14+ million records.  We facet on quite a few fields 
with very high-cardinality such as author, person, organization, brand and 
document type.  Some of the records contain thousands of persons and 
organizations.  So the person and organization fields can be very large.

First I built these fields as
<field name="au_facet" type="text_semicolon_tokenized" indexed="true" 
stored="true" multiValued="false"/> <!-- author, facet -->
<!-- tokenized by semicoln ; and only apply lowercase, this is useful for 
structured fields with clean data--> <fieldType name="text_semicolon_tokenized" 
class="solr.TextField" positionIncrementGap="100" sortMissingLast="true"> 
<analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern=";\s*"/> 
<filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

The performance was atrocious when faceting is turned on.  It took 10+ min to 
run any query.

Then I decided to break the values up myself and just build them into the field 
as a multi-valued field like this:
<field name="au_facet" type="text_untokenized" indexed="true" stored="true" 
multiValued="true"/> <!-- author, facet -->
<fieldType name="text_untokenized" class="solr.TextField" 
sortMissingLast="true"> <analyzer> <tokenizer 
class="solr.KeywordTokenizerFactory"/> <filter 
class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

After this change, the performance improved drastically. But I can't understand 
why building these fields as multi-valued field vs. single-valued field with 
semicolon tokenizer can have such a dramatic performance difference. Doesn't 
Solr tokenize the field at index time and save the values as tokens anyway? Why 
does manually breaking down the values into tokens improve faceting performance 
so much?

Thanks!
Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library<legacy.library.ucsf.edu/>
E: rebecca.t...@ucsf.edu

Reply via email to