I am looking for advice on how to vertically partition an index (break each
documents fields across > 1 core/instance).

Some background:
 - Our system stores all document metadata in database tables
 - The contents of each document is stored on a filesystem
 - Metadata changes frequently, and index must be updated to match (eg.
minutes delay, not hours)
 - Contents changes infrequently, and is a high cost to reindex (large
files, complex analyzers)

Having the contents stored in the same index as the metadata means that it
will be frequently & needlessly reanalyzed. This causes a lot of wasted
cycles as there may be a large number of documents that have a single field
changed, but the system ends up re-analyzing the gigabytes of text contents
for these documents.

One suggested solution was to store the contents field, and copy the field
(rather than re-analyze) each time a document is reindexed. However, this
would cause a lot of wasted storage, as we have terrabytes of documents.

We are currently looking at a vertical partioning scheme, that uses multiple
solr cores. One core contains the schema for all the metadata, the other
core has the schema for the contents. We have successfully made a custom
request handler that pushes documents to both cores, effectively producing
the split indexes.

The problem now, is how to split the queries across both cores? Given that
there could be AND/OR/NOT clauses, containing both metadata & contents
fields, we'll need to find some way to divide a query into to different
parts that can be run on each core, and have the hits joined back together
afterwards. This is similar to the sharding feature, but requires
intersection as well as union of result hits.

Does anyone have any advice on how to go about dividing up the different
query clauses, and how we could merge results? Or can anyone suggest a
different approach to vertical partitioning?

thanks
-Mark



-- 
View this message in context: 
http://www.nabble.com/Vertical-Partitioning-advice-tp21906668p21906668.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to