Hi,

We are the main search engine for EMBL-EBI, and provide search over 6.5bn items, spread over ~175 separate datasets. We currently use Lucene Core, and are working on a move to Solr (primarily for speed and flexibility - some of the datasets are over 2bn items, and while we have worked around this in the past, we are likely to get increasingly large data as time goes on).

We currently run an overnight indexing pipeline that checks for updates to the data, builds new indexes, and uploads them to production. Indexes are built in parallel, using separate indexer processes. This frequently builds over 100 indexes in the space of 4-6 hours. Most of these indexes are fairly small - under 100k items - while some are up to 50m.

We are trying to replicate this pipeline with Solr, but are running into problems, particularly creating and writing to multiple collections near-simultaneously. Destination collections are multi-sharded, depending on size, with no replication enabled at this stage. We are not using dynamic fields (we found synchronisation errors from Zookeeper when we did).

We are using a Kubernetes cluster, with 10 Solr pods (Solr 9.7), 3 Zookeeper pods. Each Solr pod can have 12 CPUs, 32GB RAM at present, though they start smaller. The autocommit time is set to 15s, with openSearcher=false. Incoming data is sent in bulk, with batch sizes varying depending on the dataset - mostly between 500 and 25000 items per request. The destination collections are multi-sharded, depending on size, with no replication enabled, and we do not use dynamic fields to avoid sync issues with Zookeeper across the nodes. However, we often find that the Solr nodes are going down with out of memory issues during the indexing process.

Has anyone been in a similar situation and can offer advice? Would smaller nodes in a larger cluster help? Or lowering the autocommit time?

While we would like to improve the indexing speed, writing to multiple collections near-simultaneously is the goal. We are not yet in production, so search traffic is minimal at the moment.

Thanks in advance!

Matt

--
Matt Pearce
Technical Project Lead - EBI Search
European Bioinformatics Institute (EMBL-EBI)

Reply via email to