[Solr Wiki] Update of "LargeIndexes" by Lance Norskog

Apache Wiki Mon, 02 Feb 2009 18:48:05 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by Lance Norskog:
http://wiki.apache.org/solr/LargeIndexes

New page:
Index input data can be large in four dimensions: number of entries, number of 
fields in most records, number of total unique fields, and size of fields.

'''Number of entries'''[[BR]]
An index can have hundreds of millions of small records. For example, Flickr 
has billions of records, but there are not many data fields per entry and the 
caption and description fields tend to be very short.
The "electronics store" example has thousands of unique field names for the 
merchandise metadata, but each record only has a few metadata fields. (A 
digital camera has megapixels etc. but does not have printer color packs.)
Some legal text corpuses have book-length contracts mixed with 1-paragraph 
memos.

With many small records, sorting on a field is problematic. Memory use varies 
for different tasks: searching words in the text fields needs X amount of RAM. 
Sorting on a field takes Y > X RAM, and faceting on a field with many values 
takes Z > Y > X RAM. In an index with many short records, the user could search 
but not sort, and sort but not facet.

'''Large individual fields.'''[[BR]]
It is possible to store megabytes of text in one record. These fields are 
clumsy to work with. By default the number of characters stored is clipped. 
There are some strategies available.
 * Index-only: store the text in a file or database and only index the field in 
Solr
  * Highlighting is not available
 * Break the documents into pages and index each page. 
  * The pages will be ranked individually. There is no feature to group the 
rankings of pages found against a document and create a score per document.

'''Large number of unique field names.'''[[BR]]
Some indexes can have hundreds of fields in every record. Finance transactions 
can be very complex, and an index could contain hundreds of facts per 
transaction.

In the "electronics store" example used in the Solr configuration file 
examples, each product type can have 5-10 unique metadata items. But since the 
entire store can have hundreds of different production, the total metadata name 
space can be in the thousands. The wildcard field feature is the right way to 
handle this. ''(Editor: is there a way to intake these records with the 
DataInputHandler?)''

'''Distributed Search feature: Horizontal v.s. Vertical Partition'''[[BR]]
Distributed Search in database terms is a "horizontal partition". The records 
are split (uniquely) into multiple sets and the DistributedSearch query does a 
sort-merge against the sets. Distributed Search at present has some 
limitations. It does not do a "global IDF", and so records are not scored 
correctly across indexes. This has not been a serious problem in most 
deployments. It also merges facets correctly but does not handle facet ranges 
correctly. Successive queries receive a consistent data set only if none of the 
indexes changes.

'''No "vertical partition"'''[[BR]]
In database lingo a "vertical partition" splits each row into multiple pieces 
and stores each piece in a separate database, cross-connected by the primary 
key for the record. Solr does not support reassembling records, except for the 
''javadoc float thing''.

[Solr Wiki] Update of "LargeIndexes" by Lance Norskog

Reply via email to