Re: Advice on ways forward with or without Data Import Handler

Andy Lester Thu, 29 May 2025 14:35:40 -0700


> We’ve been using Solr with DIH for about 8 years or so but now we’re hitting 
> an impasse with DIH being deprecated in Solr 9. Additionally, I’m looking to 
> move our Solr deploy to Kubernetes and I’ve been struggling to figure out 
> what to do with the DIH component in a cloud setting.


I suggest abandoning the DIH. I've done it and I'm glad we did. It makes things 
faster, more flexible and easier to maintain.

Here's what we did.

We were using the DIH to go and do SQL queries against our Oracle DB and then 
import them into our main searching core (40M book titles). We had a homemade 
scheduling mechanism set up to make the DIH run fairly often throughout the day 
to get updates out of Oracle, but we also had semaphores set up to disallow 
multiple runs of the DIH at once because that was Very Bad to do.

We threw that all away in favor of a tool (imaginatively called index-titles) 
that does the same basic query against the Oracle DB. index-titles massages the 
query results into appropriate JSON format, 5,000 at a time, and then POSTs 
them to /core/update and they get imported. When it's all done, the final POST 
is to /core/update?commit=true to make the commit happen. Typically we'll have 
100,000 titles need to get updated at a time a few times throughout the day.

There are many advantages to this.

1) Having the indexing program push data to Solr gives much more flexibility. I 
can run that indexer on any box that can make Oracle queries and POST to the 
Solr box.

2) It stops Solr from having to talk to Oracle itself. This was actually what 
triggered us making this happen, because we were moving from local hardware to 
Azure, and we wanted to be able to containerize Solr and not have to have Solr 
be able to talk to an Oracle client. Now the indexer program does the 
connecting to Oracle, which many of our other programs do already. Solr doesn't 
know anything about where its records are coming from, nor does it need to.

2a) Not having to build a custom Solr that can talk to Oracle means we can now 
run a stock Docker container that doesn't need to have an Oracle client 
installed.

3) We can run multiple instances of index-titles, which is a huge speedup if we 
have to do a full reindex. I can start up 10 different index-title runs (on 
different machines if I wanted) and tell each index-title instance to take 
1/10th a slice of the queue of records to import. Reindexing the full 40M 
titles into a new core used to take 8+ hours. With 10 index-title running, it's 
just over an hour.

4) All this speed and flexibility has given us the ability to easily have 
different developers have their own Solr core if they want. Now, it's easy to 
start up a Docker container with an empty core in it and reindex your own copy 
of the core in an hour. It used to be a nightmare to work on core schema 
changes. Now that work can happen in isolation.

Abandon the DIH. It will take some work but you'll be so glad you did down the 
line.

Andy

Re: Advice on ways forward with or without Data Import Handler

Reply via email to