Hi there, I have an solr index with 14+ million records. We facet on quite a few fields with very high-cardinality such as author, person, organization, brand and document type. Some of the records contain thousands of persons and organizations. So the person and organization fields can be very large.
First I built these fields as <field name="au_facet" type="text_semicolon_tokenized" indexed="true" stored="true" multiValued="false"/> <!-- author, facet --> <!-- tokenized by semicoln ; and only apply lowercase, this is useful for structured fields with clean data--> <fieldType name="text_semicolon_tokenized" class="solr.TextField" positionIncrementGap="100" sortMissingLast="true"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern=";\s*"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> The performance was atrocious when faceting is turned on. It took 10+ min to run any query. Then I decided to break the values up myself and just build them into the field as a multi-valued field like this: <field name="au_facet" type="text_untokenized" indexed="true" stored="true" multiValued="true"/> <!-- author, facet --> <fieldType name="text_untokenized" class="solr.TextField" sortMissingLast="true"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> After this change, the performance improved drastically. But I can't understand why building these fields as multi-valued field vs. single-valued field with semicolon tokenizer can have such a dramatic performance difference. Doesn't Solr tokenize the field at index time and save the values as tokens anyway? Why does manually breaking down the values into tokens improve faceting performance so much? Thanks! Rebecca Tang Applications Developer, UCSF CKM Legacy Tobacco Document Library<legacy.library.ucsf.edu/> E: rebecca.t...@ucsf.edu