40 million docs isn't really very many by modern standards, although if they're huge documents then that might be an issue.
So is this a single shard or multiple shards? If you're really facing performance issues, simply making a new collection with more than one shard (independent of how many replicas each has) is probably simplest. The number of deleted documents really shouldn't be a problem. Typically the deleted documents are purged during segment merging that happens automatically as you add documents. I often see 10-15% or the corpus consist of deleted documents. You can force these by doing a force merge (aka optimization), but that is usually not recommended unless you have a strange situation where you have lots and lots of docs that have been deleted as measured by the Admin UI page, the "deleted docs" entry relative to the maxDoc number (again on the admin UI page). So show us what you're seeing that's concerning. Typically, especially on an index that's continually getting updates it's adequate to just let the background segment merging take care of things. Best, Erick On Sat, Aug 1, 2015 at 8:49 PM, Jay Potharaju <jspothar...@gmail.com> wrote: > Hi > > I currently have a single collection with 40 million documents and index > size of 25 GB. The collections gets updated every n minutes and as a result > the number of deleted documents is constantly growing. The data in the > collection is an amalgamation of more than 1000+ customer records. The > number of documents per each customer is around 100,000 records on average. > > Now that being said, I 'm trying to get an handle on the growing deleted > document size. Because of the growing index size both the disk space and > memory is being used up. And would like to reduce it to a manageable size. > > I have been thinking of splitting the data into multiple core, 1 for each > customer. This would allow me manage the smaller collection easily and can > create/update the collection also fast. My concern is that number of > collections might become an issue. Any suggestions on how to address this > problem. What are my other alternatives to moving to a multicore > collections.? > > Solr: 4.9 > Index size:25 GB > Max doc: 40 million > Doc count:29 million > > Replication:4 > > 4 servers in solrcloud. > > Thanks > Jay