Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by Lance Norskog: http://wiki.apache.org/solr/LargeIndexes ------------------------------------------------------------------------------ - Index input data can be large in four dimensions: number of entries, number of fields in most records, number of total unique fields, and size of fields. + An index can be large in several dimensions: number of entries, number of fields in most records, number of total unique fields, size of fields, total number of terms in a term across all records. '''Number of entries'''[[BR]] An index can have hundreds of millions of small records. For example, Flickr has billions of records, but there are not many data fields per entry and the caption and description fields tend to be very short. The "electronics store" example has thousands of unique field names for the merchandise metadata, but each record only has a few metadata fields. (A digital camera has megapixels etc. but does not have printer color packs.) Some legal text corpuses have book-length contracts mixed with 1-paragraph memos. - - With many small records, sorting on a field is problematic. Memory use varies for different tasks: searching words in the text fields needs X amount of RAM. Sorting on a field takes Y > X RAM, and faceting on a field with many values takes Z > Y > X RAM. In an index with many short records, the user could search but not sort, and sort but not facet. '''Large individual fields.'''[[BR]] It is possible to store megabytes of text in one record. These fields are clumsy to work with. By default the number of characters stored is clipped. There are some strategies available. @@ -16, +14 @@ * Break the documents into pages and index each page. * The pages will be ranked individually. There is no feature to group the rankings of pages found against a document and create a score per document. - '''Large number of unique field names.'''[[BR]] + '''Large number of fields per record.'''[[BR]] Some indexes can have hundreds of fields in every record. Finance transactions can be very complex, and an index could contain hundreds of facts per transaction. + '''Large number of unique field names.'''[[BR]] In the "electronics store" example used in the Solr configuration file examples, each product type can have 5-10 unique metadata items. But since the entire store can have hundreds of different production, the total metadata name space can be in the thousands. The wildcard field feature is the right way to handle this. ''(Editor: is there a way to intake these records with the DataInputHandler?)'' - '''Distributed Search feature: Horizontal v.s. Vertical Partition'''[[BR]] - Distributed Search in database terms is a "horizontal partition". The records are split (uniquely) into multiple sets and the DistributedSearch query does a sort-merge against the sets. Distributed Search at present has some limitations. It does not do a "global IDF", and so records are not scored correctly across indexes. This has not been a serious problem in most deployments. It also merges facets correctly but does not handle facet ranges correctly. Successive queries receive a consistent data set only if none of the indexes changes. + '''Total number of unique terms across all records''' + Memory use for a facet query uses a counter for every unique term in the index, for every field used. A facet query on a boolean field (or strings "true" and "false") will use almost no RAM, while a facet query on a field with billions of total terms (or a set of wildcard fields) '''will''' cause the dreaded OutOfMemory exception. + '''Tips and tricks''' - '''No "vertical partition"'''[[BR]] - In database lingo a "vertical partition" splits each row into multiple pieces and stores each piece in a separate database, cross-connected by the primary key for the record. Solr does not support reassembling records, except for the ''javadoc float thing''. + '''Batch jobs you should avoid''' + SpellCheckComponent and LukeRequestHandler walk the entire term database instead of vectoring terms through a search. This can make a large index unavailable for hours. + + + + '''Sorting''' + + With many small records, sorting on a field is problematic. Memory use varies for different tasks: searching words in the text fields needs X amount of RAM. Sorting on a field takes Y > X RAM, and faceting on a field with many values takes Z > Y > X RAM. In an index with many short records, the user could search but not sort, and sort but not facet. + + '''DistributedSearch: Faceting''' + DistributedSearch merges facet results (see the page for limitations). (Please comment on implications for huge facet result sets. It seems like memory usage in merging by count and merging by name have an equal upper bound, but the average case will require very little memory for merging by name.) + + '''DistributedSearch: Horizontal v.s. Vertical Partition'''[[BR]] + In database jargon a "horizontal partition" splits record sets into multiple stores, while a ''vertical partition'' splits each row into multiple pieces and stores each piece in a separate database, cross-connected by the primary key for the record. DistributedSearch is a ''horizontal partition''. There is no implementation of a vertical partition across indexes. +
