On 12/11/2016 8:00 PM, Brian Narsi wrote:
> We are using Solr 5.1.0 and DIH to build index.
>
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
>
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?
>
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?

There are no hard limits other than the Lucene limit of a little over
two billion docs per individual index.  With sharding, Solr is able to
easily overcome this limit on an entire index.

I have one index where each shard was over 50 million docs.  Each shard
has fewer docs now, because I changed it so there are more shards and
more machines.  For some reason the rebuild time (using DIH) got really
really long -- nearly 48 hours -- while building every shard in
parallel.  Still haven't figured out why the build time increased
dramatically.

One problem you might run into with DIH from a database has to do with
merging.  With default merge scheduler settings, eventually (typically
when there are millions of rows being imported) you'll run into a pause
in indexing that will take so long that the database connection will
close, causing the import to fail after the pause finishes.

I even opened a Lucene issue to get the default value for maxMergeCount
changed.  This issue went nowhere:

https://issues.apache.org/jira/browse/LUCENE-5705

Here's a thread from this mailing list discussing the problem and the
configuration solution:

http://lucene.472066.n3.nabble.com/What-does-quot-too-many-merges-stalling-quot-in-indexwriter-log-mean-td4077380.html

Thanks,
Shawn

Reply via email to