Re: Solr Cloud reclaiming disk space from deleted documents

Shawn Heisey Fri, 17 Apr 2015 15:22:51 -0700

On 4/17/2015 2:15 PM, Rishi Easwaran wrote:
> Running into an issue and wanted to see if anyone had some suggestions.
> We are seeing this with both solr 4.6 and 4.10.3 code.
> We are running an extremely update heavy application, with millions of writes 
> and deletes happening to our indexes constantly.  An issue we are seeing is 
> that solr cloud reclaiming the disk space that can be used for new inserts, 
> by cleanup up deletes. 
>
> We used to run optimize periodically with our old multicore set up, not sure 
> if that works for solr cloud.
>
> Num Docs:28762340
> Max Doc:48079586
> Deleted Docs:19317246
>
> Version 1429299216227
> Gen 16525463
> Size 109.92 GB
>
> In our solrconfig.xml we use the following configs.
>
>     <indexConfig>
>     <!-- Values here affect all index writers and act as a default unless 
> overridden. -->
>         <useCompoundFile>false</useCompoundFile>
>         <maxBufferedDocs>1000</maxBufferedDocs>
>         <maxMergeDocs>2147483647</maxMergeDocs>
>         <maxFieldLength>10000</maxFieldLength>
>
>         <mergeFactor>10</mergeFactor>
>         <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"/>
>         <mergeScheduler 
> class="org.apache.lucene.index.ConcurrentMergeScheduler">
>             <int name="maxThreadCount">3</int>
>             <int name="maxMergeCount">15</int>
>         </mergeScheduler>
>         <ramBufferSizeMB>64</ramBufferSizeMB>
>         
>     </indexConfig>


This part of my response won't help the issue you wrote about, but it
can affect performance, so I'm going to mention it.  If your indexes are
stored on regular spinning disks, reduce mergeScheduler/maxThreadCount
to 1.  If they are stored on SSD, then a value of 3 is OK.  Spinning
disks cannot do seeks (read/write head moves) fast enough to handle
multiple merging threads properly.  All the seek activity required will
really slow down merging, which is a very bad thing when your indexing
load is high.  SSD disks do not have to seek, so multiple threads are OK
there.

An optimize is the only way to reclaim all of the disk space held by
deleted documents.  Over time, as segments are merged automatically,
deleted doc space will be automatically recovered, but it won't be
perfect, especially as segments are merged multiple times into very
large segments.

If you send an optimize command to a core/collection in SolrCloud, the
entire collection will be optimized ... the cloud will do one shard
replica (core) at a time until the entire collection has been
optimized.  There is no way (currently) to ask it to only optimize a
single core, or to do multiple cores simultaneously, even if they are on
different servers.

Thanks,
Shawn

Re: Solr Cloud reclaiming disk space from deleted documents

Reply via email to