UnInvertedField vs FieldCache for facets for single-token text fields

2011-11-03 Thread Michael Ryan
I have some fields I facet on that are TextFields but have just a single token.
The fieldType looks like this:

fieldType name=myStringFieldType class=solr.TextField indexed=true
stored=false omitNorms=true sortMissingLast=true
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

SimpleFacets uses an UnInvertedField for these fields because
multiValuedFieldCache() returns true for TextField. I tried changing the type 
for
these fields to the plain string type (StrField). The facets *seem* to be
generated much faster. Is it expected that FieldCache would be faster than
UnInvertedField for single-token strings like this?

My goal is to make the facet re-generation after a commit as fast as possible. I
would like to continue using TextField for these fields since I have a need for
filters like LowerCaseFilterFactory, which still produces a single token. Is it
safe to extend TextField and have multiValuedFieldCache() return false for these
fields, so that UnInvertedField is not used? Or is there a better way to
accomplish what I'm trying to do?

-Michael


Re: UnInvertedField vs FieldCache for facets for single-token text fields

2011-11-03 Thread Martijn v Groningen
Hi Micheal,

The FieldCache is an easier data structure and easier to create, so I
also expect it to be faster. Unfortunately for TextField
UnInvertedField
is always used even if you have one token per document. I think
overriding the multiValuedFieldCache method and return false would
work.

If you're using 4.0-dev (trunk) I'd use facet.method=fcs (this
parameter is only useable if multiValuedFieldCache method returns
false)
This is per segment faceting and the cache will only be extended for
new segments. This field facet approach is better for indexes with
frequent changes.
I think this even faster in your case then just using the FieldCache
method (which operates on a top level reader. After each commit the
complete cache is invalid and has to be recreated).

Otherwise I'd try facet.method=enum which is fast if you have fewer
distinct facet values (num of docs doesn't influence the performance
that much).
The facet.method=enum option is also valid for normal TextFields, so
no need to have custom code.

Martijn

On 3 November 2011 21:16, Michael Ryan mr...@moreover.com wrote:
 I have some fields I facet on that are TextFields but have just a single 
 token.
 The fieldType looks like this:

 fieldType name=myStringFieldType class=solr.TextField indexed=true
    stored=false omitNorms=true sortMissingLast=true
    positionIncrementGap=100
  analyzer
    tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
 /fieldType

 SimpleFacets uses an UnInvertedField for these fields because
 multiValuedFieldCache() returns true for TextField. I tried changing the type 
 for
 these fields to the plain string type (StrField). The facets *seem* to be
 generated much faster. Is it expected that FieldCache would be faster than
 UnInvertedField for single-token strings like this?

 My goal is to make the facet re-generation after a commit as fast as 
 possible. I
 would like to continue using TextField for these fields since I have a need 
 for
 filters like LowerCaseFilterFactory, which still produces a single token. Is 
 it
 safe to extend TextField and have multiValuedFieldCache() return false for 
 these
 fields, so that UnInvertedField is not used? Or is there a better way to
 accomplish what I'm trying to do?

 -Michael




-- 
Met vriendelijke groet,

Martijn van Groningen