manually creating indices to speed up indexing with app-knowledge

Britske Mon, 02 Nov 2009 13:40:37 -0800

This may seem like a strange question, but here it goes anyway. 

Im considering the possibility of low-level constructing indices for about
20.000 indexed fields (type sInt) if at all possible . (With indices in this
context I mean the inverted indices from term to Documentid just to be 100%
complete)  
These indices have to be recreated each night, along with the normal
reindex.

Globally it should go something like this (each night) :
- documents (consisting of about 20 stored fields and about 10 stored &
indexed fields) are indexed through the normal 'code-path' (solrJ in my
case)
- After all docs are persisted (max 200.000) I want to extract the mapping
from 'lucene docid' --> 'stored/indexed product key'
I believe this should work, because after all docs are persisted the
internal docids aren't altered, so the relationship between 'lucene docid'
--> 'stored/indexed product key' is invariant from that point forward.
(please correct if wrong)
- construct the 20.000 inverted indices on such a low enough level that I do
not have to go through IndexWriter if possible, so I do not need to
construct Documents, I only need to construct the native format of the
indices themselves. Ideally this should work on multiple servers so that the
indices can be created in parallel and the index-files later simply copied
to the index-directory of the master.

Basically what it boils down to is that indexing time (a reindex should be
done each night) is a big show-stopper at the moment, although we've tried
and tested all the more standard optimization tricks & techniques, as well
as having build a home-grown shard-like indexing strategy which uses 20
pretty big servers in parallel. The 20.000 indexed fields are still simply
killing.

At the same time the app has a lot of knowledge of the 20.000 indices.
- All indices consist of prices (ints) between 0 and 10.000
- and most important: as part of the document construction process the
ordening of each of the 20.000 indices is known for all documents that are
processed by the document-construction server in question. (This part is
needed, and is already performing at light speed)

for sake of argument say we have 5 document-construction servers. Each
server processes 40.000 documents. Each server has 20.000 ordered indices in
its own format readily available for the 40.000 documents it's processing.
Something like: LinkedHashMap<Integer,Set<Integer>> -->
<price,{productids}>

Say we have 20 indexing servers. Each server has to calculate 1.000 indices
(totalling the 20.000)
We have the 5 doc-construction servers distribute the ordered sub-indices to
the correct servers.
Each server constructs an index from 5 ordered sub-indices coming from 5
different construction-servers. This can be done efficiently using a
mergesort (since the sub-indices are already sorted)

All that is missing (oversimplifying here ) is going from the ordered
indices in application-format to the index-format of lucene (substituting
the productids by the lucene docid's along the way) and stream it to disk.
I believe this would quite posisbly give a really big indexing improvement.

Is my thinking correct in the steps involved?
Do you believe that this indeed would give a big speedup for this specific
situation
Where would I hook in the SOlr / lucene code to construct the native format?

Thanks in advance (and for making it to here)

Geert-Jan

--
View this message in context:
http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
Sent from the Solr - User mailing list archive at Nabble.com.

manually creating indices to speed up indexing with app-knowledge

Reply via email to