> We’ve been using Solr with DIH for about 8 years or so but now we’re hitting > an impasse with DIH being deprecated in Solr 9. Additionally, I’m looking to > move our Solr deploy to Kubernetes and I’ve been struggling to figure out > what to do with the DIH component in a cloud setting.
I suggest abandoning the DIH. I've done it and I'm glad we did. It makes things faster, more flexible and easier to maintain. Here's what we did. We were using the DIH to go and do SQL queries against our Oracle DB and then import them into our main searching core (40M book titles). We had a homemade scheduling mechanism set up to make the DIH run fairly often throughout the day to get updates out of Oracle, but we also had semaphores set up to disallow multiple runs of the DIH at once because that was Very Bad to do. We threw that all away in favor of a tool (imaginatively called index-titles) that does the same basic query against the Oracle DB. index-titles massages the query results into appropriate JSON format, 5,000 at a time, and then POSTs them to /core/update and they get imported. When it's all done, the final POST is to /core/update?commit=true to make the commit happen. Typically we'll have 100,000 titles need to get updated at a time a few times throughout the day. There are many advantages to this. 1) Having the indexing program push data to Solr gives much more flexibility. I can run that indexer on any box that can make Oracle queries and POST to the Solr box. 2) It stops Solr from having to talk to Oracle itself. This was actually what triggered us making this happen, because we were moving from local hardware to Azure, and we wanted to be able to containerize Solr and not have to have Solr be able to talk to an Oracle client. Now the indexer program does the connecting to Oracle, which many of our other programs do already. Solr doesn't know anything about where its records are coming from, nor does it need to. 2a) Not having to build a custom Solr that can talk to Oracle means we can now run a stock Docker container that doesn't need to have an Oracle client installed. 3) We can run multiple instances of index-titles, which is a huge speedup if we have to do a full reindex. I can start up 10 different index-title runs (on different machines if I wanted) and tell each index-title instance to take 1/10th a slice of the queue of records to import. Reindexing the full 40M titles into a new core used to take 8+ hours. With 10 index-title running, it's just over an hour. 4) All this speed and flexibility has given us the ability to easily have different developers have their own Solr core if they want. Now, it's easy to start up a Docker container with an empty core in it and reindex your own copy of the core in an hour. It used to be a nightmare to work on core schema changes. Now that work can happen in isolation. Abandon the DIH. It will take some work but you'll be so glad you did down the line. Andy