[Solr Wiki] Update of "SchemaDesign" by Lance Norskog

Apache Wiki Mon, 02 Feb 2009 22:21:23 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by Lance Norskog:
http://wiki.apache.org/solr/SchemaDesign

------------------------------------------------------------------------------
  General tips & tricks in designing schemas.
- 
- '''Mapping databases to Solr'''[[BR]]
+ = Mapping databases to Solr =
  Solr provides one table. Storing a set database tables in an index generally 
requires denormalizing some of the tables. Attempts to avoid denormalizing 
usually fail.
- 
- '''Sorting'''[[BR]]
+ = Field contents =
+ The more heterogeneous (different kinds of data) you have in one field or in 
one index, the less useful it is. For example, if you have text in different 
languages, it is more useful to store them in different fields: text_en, 
text_fr, etc. than all in one field. When you search against that one field 
English and French words and phrases will be searched with equal interest.
+ = Sorting =
  There are two ways of sorting available in Solr 1.4: Lucene's sorting feature 
and function queries.
+ == Lucene Sorting ==
- 
- '''Lucene Sorting'''[[BR]]
  The Solr sort parameter uses the Lucene sorting tool. This creates an array 
containing an entry for every document in the index. Sorting is then done 
against this array. This array is cached across requests and so repeated sorts 
are fast.  If the field type is 'integer' the array contains only that value 
and thus is 4 bytes * the number of documents. If the field type is anything 
else, this integer array is created and then a separate array is also created 
with that field's data per entry. Sorting is also slower if the type is not an 
'integer'.
- 
  However, range checks do not work on an 'integer' field. If you want range 
checks and fast sorting, you can create a pair of fields, one of each type, 
with a copyField directive:
  {{{
   <field name="popularity" type="sint" indexed="true" stored="true" 
multiValued="false"/>
@@ -18, +16 @@

   ...
   <copyField source="popularity" dest="popularitySort"/>
  }}}
- 
  Note that since multiValued=false is the default for these types, attempting 
to store a value to 'popularitySort' will cause an indexing error, since it 
also always receives a value from 'popularity'. Also there is no reason to 
store both fields, and so 'popularitySort' is index-only.
+ == Function Query Sorting ==
- 
- '''Function Query Sorting'''[[BR]]
  Add this clause to your query string to sort the results using 
'myIndexedField'. Do not use the 'sort=field+asc' parameter. See 
[FunctionQuery] for more.
  {{{
  _val_:"ord(myIndexedField)"
  }}}
  There may be performance differences with this technique v.s. the Lucene 
sorting algorithm.
+ = Multiple Text Search Field types =
- 
- '''Alternative Text Search Field types'''[[BR]]
- The "text" field type in the example schema.xml provides basic text search 
for English text. But, it has a surprise: the actual text given to this field 
is not indexed as-is, and therefore searching for the raw text may not work. If 
you store "To Be Or Not To Be" in a "text" field, none of these words will find 
this document, nor will the phrase in quotes.
+ The "text" field type in the example schema.xml provides basic text search 
for English text. But, it has a surprise: the actual text given to this field 
is not indexed as-is, and therefore searching for the raw text may not work. If 
you store "To Be Or Not To Be" in a "text" field, none of these words will 
found this document, nor will the phrase in quotes. The above words are all 
''stopwords'' and are stripped from the input text. Another transform is 
''stemming'', which stores both 'change' and 'changing' into 'chang'.
+ == Phrase search ==
- 
- '''Phrase search'''[[BR]]
- If you want to have any phrase search work as well as individual words, you 
need to have two fields. Both should be processed similarly, but the phrase 
search field should not use "stemming" or "stopwords". Usually use can populate 
this field using the <copyField> directive.
+ If you want to have any phrase search work as well as individual words, you 
need to have two fields. Both should be processed similarly, but the phrase 
search field should not use stemming or stopword". 
+ == Phonemes ==
- 
- '''Phonemes'''[[BR]]
- Programmers are perfect spellers and expect the same of their users. A 
phoneme represents (roughly) the sound of one syllable. Phoneme-based searching 
can give users a better search experience. To support misspelled search words 
Phoneme filters cause the index to store phoneme-base representations of the 
text instead of the input. 
+ Programmers are perfect spellers and expect the same of their users. A 
''phoneme'' represents (roughly) the sound of one syllable. Phoneme-based 
searching can give users a better search experience. To support misspelled 
search words phoneme filters cause the index to store phoneme-base 
representations of the text instead of the input. This only finds misspellings 
which sound like the original word.
  
  To create a phoneme-based field, you need a text filter stack that does not 
include stemming or stopwords, and add the  solr.PhoneticFilterFactory (see 
[AnalyzersTokenizersTokenFilters]) with one of the available encoders. This 
must be in both the indexing and query stack. Of the several available the 
"Double Metaphone" filter is the most popular and does well with non-English 
text. There are as yet no language-specific phoneme encoders.
  
  For another take on assisting spelling, see [SpellCheckComponent].
+ == Unicode processing ==
+ Searching text in different languages is very difficult. The Latin1Accent 
filters downgrade all European "special characters" down to their US Ascii 
equivalents: the French spelling ''protÃ©gÃ©'' becomes the English spelling 
''protege''. 
+ In Solr-1.3, use this in the filter stack of your "text" field type:
+ {{{
+ <filter class="solr.ISOLatin1AccentFilterFactory" />
+ }}}
+ In Solr-1.4, use this:
+ {{{
+ <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
+ }}}

[Solr Wiki] Update of "SchemaDesign" by Lance Norskog

Reply via email to