[Solr Wiki] Update of "LargeIndexes" by Lance Norskog

Apache Wiki Mon, 02 Feb 2009 20:29:29 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by Lance Norskog:
http://wiki.apache.org/solr/LargeIndexes

------------------------------------------------------------------------------
- An index can be large in several dimensions: number of entries, number of 
fields in most records, number of total unique fields, size of fields, total 
number of terms in a term across all records.
+ An index can be large in several dimensions: number of entries, sizes of 
fields, number of fields in most records, number of total unique fields, total 
number of terms in a field across all records.
  
  '''Number of entries'''[[BR]]
- An index can have hundreds of millions of small records. For example, Flickr 
has billions of records, but there are not many data fields per entry and the 
caption and description fields tend to be very short.
+  An index can have hundreds of millions of small records. For example, Flickr 
has billions of records, but there are not many data fields per entry and the 
caption and description fields tend to be very short. 
- The "electronics store" example has thousands of unique field names for the 
merchandise metadata, but each record only has a few metadata fields. (A 
digital camera has megapixels etc. but does not have printer color packs.)
- Some legal text corpuses have book-length contracts mixed with 1-paragraph 
memos.
  
  '''Large individual fields.'''[[BR]]
- It is possible to store megabytes of text in one record. These fields are 
clumsy to work with. By default the number of characters stored is clipped. 
There are some strategies available.
+  It is possible to store megabytes of text in one record. These fields are 
clumsy to work with. By default the number of characters stored is clipped. 
There are some strategies available.
   * Index-only: store the text in a file or database and only index the field 
in Solr
    * Highlighting is not available
   * Break the documents into pages and index each page. 
    * The pages will be ranked individually. There is no feature to group the 
rankings of pages found against a document and create a score per document.
+  For example, some legal text corpuses have book-length contracts mixed with 
1-paragraph memos.
  
  '''Large number of fields per record.'''[[BR]]
- Some indexes can have hundreds of fields in every record. Finance 
transactions can be very complex, and an index could contain hundreds of facts 
per transaction.
+  Some indexes can have hundreds of fields in every record. Financial 
contracts can be very complex, and an index could contain hundreds of fields 
per transaction.
  
  '''Large number of unique field names.'''[[BR]]
- In the "electronics store" example used in the Solr configuration file 
examples, each product type can have 5-10 unique metadata items. But since the 
entire store can have hundreds of different production, the total metadata name 
space can be in the thousands. The wildcard field feature is the right way to 
handle this. ''(Editor: is there a way to intake these records with the 
DataInputHandler?)''
+  In the "electronics store" example used in the Solr configuration file 
examples, each product type may have 5-10 unique metadata items (megapixels or 
printer paper). But since the entire store can have hundreds of different 
products, the total metadata name space can be in the thousands. (The wildcard 
field feature is the right way to handle this.)
  
  '''Total number of unique terms across all records'''
- Memory use for a facet query uses a counter for every unique term in the 
index, for every field used. A facet query on a boolean field (or strings 
"true" and "false") will use almost no RAM, while a facet query on a field with 
billions of total terms (or a set of wildcard fields) '''will''' cause the 
dreaded OutOfMemory exception.
+  Memory use for a facet query uses a counter for every unique term in the 
index, for every field used. A facet query on a boolean field (or strings 
"true" and "false") will use almost no RAM, while a facet query on a field with 
millions of total terms may run out of memory.
  
  '''Tips and tricks'''
  
- '''Batch jobs you should avoid'''
+ '''Hash Key Identity'''
+  Use a cryptographic hash value as the schema's primary key. See [Identity] 
about wildcard searches.
+ 
+ '''Solr-stoppers'''
-  SpellCheckComponent and LukeRequestHandler walk the entire term database 
instead of vectoring terms through a search. This can make a large index 
unavailable for hours.
+  SpellCheckComponent and LukeRequestHandler walk the entire term database 
instead of vectoring term sets through a query. This can make a large index 
unavailable for hours.
   
   In solrconfig.xml, change the dismax parameter "q.alt" to something besides 
':*:'. Hitting this query by accident can also make a large index unavailable 
for minutes.
  
  '''Sorting'''
  
- With many small records, sorting on a field is problematic. Memory use varies 
for different tasks: searching words in the text fields needs X amount of RAM. 
Sorting on a field takes Y > X RAM, and faceting on a field with many values 
takes Z > Y > X RAM. In an index with many short records, the user could search 
but not sort, and sort but not facet.
+  With many records, sorting on a field is problematic. Memory use varies for 
different tasks: searching words in the text fields needs X amount of RAM. 
Sorting on a field takes Y > X RAM, and faceting on a field with many values 
takes Z > Y > X RAM. In an index with many short records, the user could search 
but not sort, and sort but not facet. Using _val__:ord(''field'') as a search 
term will sort the results without incurring the memory cost.
  
- '''DistributedSearch: Faceting'''
+ '''DistributedSearch'''
+  '''Faceting'''[[BR]]
- DistributedSearch merges facet results (see the page for limitations). 
(Please comment on implications for huge facet result sets. It seems like 
memory usage in merging by count and merging by name have an equal upper bound, 
but the average case will require very little memory for merging by name.)
+  DistributedSearch merges facet results (see the page for limitations). 
(Please comment on implications for huge facet result sets. It seems like 
memory usage in merging by count and merging by name have an equal upper bound, 
but the average case will require very little memory for merging by name.)
  
- '''DistributedSearch: Horizontal v.s. Vertical Partition'''[[BR]]
+  '''Horizontal v.s. Vertical Partition'''[[BR]]
- In database jargon a "horizontal partition" splits record sets into multiple 
stores, while a ''vertical partition'' splits each row into multiple pieces and 
stores each piece in a separate database, cross-connected by the primary key 
for the record. DistributedSearch is a ''horizontal partition''. There is no 
implementation of a vertical partition across indexes.
+  In database jargon a "horizontal partition" splits record sets into multiple 
stores, while a ''vertical partition'' splits each row into multiple pieces and 
stores each piece in a separate database, cross-connected by the primary key 
for the record. DistributedSearch provides a ''horizontal partition''. There is 
no implementation of a vertical partition across indexes.

[Solr Wiki] Update of "LargeIndexes" by Lance Norskog

Reply via email to