Indexing to many collections

Matt Pearce Thu, 23 Oct 2025 02:33:50 -0700

Hi,

We are the main search engine for EMBL-EBI, and provide search over6.5bn items, spread over ~175 separate datasets. We currently use LuceneCore, and are working on a move to Solr (primarily for speed andflexibility - some of the datasets are over 2bn items, and while we haveworked around this in the past, we are likely to get increasingly largedata as time goes on).

We currently run an overnight indexing pipeline that checks for updatesto the data, builds new indexes, and uploads them to production. Indexesare built in parallel, using separate indexer processes. This frequentlybuilds over 100 indexes in the space of 4-6 hours. Most of these indexesare fairly small - under 100k items - while some are up to 50m.

We are trying to replicate this pipeline with Solr, but are running intoproblems, particularly creating and writing to multiple collectionsnear-simultaneously. Destination collections are multi-sharded,depending on size, with no replication enabled at this stage. We are notusing dynamic fields (we found synchronisation errors from Zookeeperwhen we did).

We are using a Kubernetes cluster, with 10 Solr pods (Solr 9.7), 3Zookeeper pods. Each Solr pod can have 12 CPUs, 32GB RAM at present,though they start smaller. The autocommit time is set to 15s, withopenSearcher=false. Incoming data is sent in bulk, with batch sizesvarying depending on the dataset - mostly between 500 and 25000 itemsper request. The destination collections are multi-sharded, depending onsize, with no replication enabled, and we do not use dynamic fields toavoid sync issues with Zookeeper across the nodes. However, we oftenfind that the Solr nodes are going down with out of memory issues duringthe indexing process.

Has anyone been in a similar situation and can offer advice? Wouldsmaller nodes in a larger cluster help? Or lowering the autocommit time?

While we would like to improve the indexing speed, writing to multiplecollections near-simultaneously is the goal. We are not yet inproduction, so search traffic is minimal at the moment.


Thanks in advance!

Matt

--
Matt Pearce
Technical Project Lead - EBI Search
European Bioinformatics Institute (EMBL-EBI)

Indexing to many collections

Reply via email to