[Solr Wiki] Update of "SchemaDesign" by YonikSeeley

Apache Wiki Thu, 05 Feb 2009 14:49:24 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/SchemaDesign

The comment on the change is:
sorting is not slower on a non-integer field, clarify stemming

------------------------------------------------------------------------------
  = Sorting =
  There are two ways of sorting available in Solr 1.4: Lucene's sorting feature 
and function queries.
  == Lucene Sorting ==
- The Solr sort parameter uses the Lucene sorting tool. This creates an array 
containing an entry for every document in the index using the !FieldCache. 
Sorting is then done against this array. This array is cached across requests 
using the !IndexReader and so repeated sorts are fast.  If the field type is 
'integer' the array contains only that value and thus is 4 bytes * the number 
of documents. If the field type is anything else, this integer array is created 
and then a separate array is also created with that field's data per entry. 
Sorting is also slower if the type is not an 'integer'.
+ The Solr sort parameter uses the Lucene sorting tool. This creates an array 
containing an entry for every document in the index using the !FieldCache. 
Sorting is then done against this array. This array is cached across requests 
using the !IndexReader and so repeated sorts are fast.  If the field type is 
'integer' the array contains only that value and thus is 4 bytes * the number 
of documents. If the field type is anything else, this integer array is created 
and then a separate array is also created with that field's data per entry.
- However, range checks do not work on an 'integer' field. If you want range 
checks and fast sorting, you can create a pair of fields, one of each type, 
with a copyField directive:
- {{{
-  <field name="popularity" type="sint" indexed="true" stored="true" 
multiValued="false"/>
-  <field name="popularitySort" type="integer" indexed="true" stored="false" />
-  ...
-  <copyField source="popularity" dest="popularitySort"/>
- }}}
- Note that since multiValued=false is the default for these types, attempting 
to store a value to 'popularitySort' will cause an indexing error, since it 
also always receives a value from 'popularity'. Also there is no reason to 
store both fields, and so 'popularitySort' is index-only.
  
  === A Note on "sortable" FieldTypes ===
  Sortable !FieldTypes like sint, sdouble are a bit of a misnomer.  They are 
not needed for Sorting in the sense described above, but are needed when doing 
!RangeQuery queries.  Sortables, in fact, refer to the notion of making the 
number sort correctly lexicographically as Strings.  That is, if this is not 
done, the numbers 1..10 sort lexicographically as 1,10, 2, 3...  Using an sint, 
however remedies this.  If, however, you don't need to do !RangeQuery queries 
and only need to sort on the field, then just use an int or double or the 
equivalent appropriate class.  You will save yourself time and memory.
@@ -28, +20 @@

  }}}
  There may be performance differences with this technique v.s. the Lucene 
sorting algorithm.
  = Multiple Text Search Field types =
- The "text" field type in the example schema.xml provides basic text search 
for English text. But, it has a surprise: the actual text given to this field 
is not indexed as-is, and therefore searching for the raw text may not work. If 
you store "To Be Or Not To Be" in a "text" field, none of these words will 
found this document, nor will the phrase in quotes. The above words are all 
''stopwords'' and are stripped from the input text. Another transform is 
''stemming'', which stores both 'change' and 'changing' as the word 'chang'.
+ The "text" field type in the example schema.xml provides basic text search 
for English text. But, it has a surprise: the actual text given to this field 
is not indexed as-is, and therefore searching for the raw text may not work. If 
you store "To Be Or Not To Be" in a "text" field, none of these words will 
found this document, nor will the phrase in quotes. The above words are all 
''stopwords'' and are stripped from the input text. Another transform is 
''stemming'', which stores both 'change' and 'changing' as the word 'chang'.  
Stemming is done at both index and query time, so a query of 'changing' will 
match a document containing 'change'.
  == Phrase search ==
  If you want to have any phrase search work as well as individual words, you 
need to have two fields. Both should be processed similarly, but the phrase 
search field should not use stemming or stopwords. 
  == Phonemes ==

[Solr Wiki] Update of "SchemaDesign" by YonikSeeley

Reply via email to