On 12/11/2016 8:00 PM, Brian Narsi wrote: > We are using Solr 5.1.0 and DIH to build index. > > We are using DIH with clean=true and commit=true and optimize=true. > Currently retrieving about 10.5 million records in about an hour. > > I will like to find from other member's experiences as to how long can DIH > run with no issues? What is the maximum number of records that anyone has > pulled using DIH? > > Are there any limitations on the maximum number of records that can/should > be pulled using DIH? What is the longest DIH can run?
There are no hard limits other than the Lucene limit of a little over two billion docs per individual index. With sharding, Solr is able to easily overcome this limit on an entire index. I have one index where each shard was over 50 million docs. Each shard has fewer docs now, because I changed it so there are more shards and more machines. For some reason the rebuild time (using DIH) got really really long -- nearly 48 hours -- while building every shard in parallel. Still haven't figured out why the build time increased dramatically. One problem you might run into with DIH from a database has to do with merging. With default merge scheduler settings, eventually (typically when there are millions of rows being imported) you'll run into a pause in indexing that will take so long that the database connection will close, causing the import to fail after the pause finishes. I even opened a Lucene issue to get the default value for maxMergeCount changed. This issue went nowhere: https://issues.apache.org/jira/browse/LUCENE-5705 Here's a thread from this mailing list discussing the problem and the configuration solution: http://lucene.472066.n3.nabble.com/What-does-quot-too-many-merges-stalling-quot-in-indexwriter-log-mean-td4077380.html Thanks, Shawn