On 10/13/2016 9:20 AM, Zheng Lin Edwin Yeo wrote: > Would like to find out, will the indexing speed in a collection with a > very large index size be much slower than one which is still empty or > a very small index size? This is assuming that the configurations, > indexing code and the files to be indexed are the same. Currently, I > have a setup in which the collection is still empty, and I managed to > achieve an indexing speed of more than 7GB/hr. I also have another > setup in which the collection has an index size of 1.6TB, and when I > tried to index new documents to it, the indexing speed is less than > 0.7GB/hr.
I have noticed this phenomenon myself. As the amount of index data already present increases, indexing slows down. Best guess as to the cause: more frequent and longer-lasting garbage collections. Indexing involves a LOT of memory allocation. Most of the memory chunks that get allocated are quickly discarded because they do not need to be retained. If you understand how the Java memory model works, then you know that this means there will be a lot of garbage collection. Each GC will tend to take longer if there are a large number of objects allocated that are NOT garbage. When the index is large, Lucene/Solr must allocate and retain a larger amount of memory just to ensure that everything works properly. This leaves less free memory, so indexing will cause more frequent garbage collections ... and because the amount of retained memory is correspondingly larger, each garbage collection will take longer than it would with a smaller index. A ten to one difference in speed does seem extreme, though. You might want to increase the heap allocated to each Solr instance, so GC is less frequent. This can take memory away from the OS disk cache, though. If the amount of OS disk cache drops too low, general performance may suffer. Thanks, Shawn