On 10/13/2016 9:20 AM, Zheng Lin Edwin Yeo wrote:
> Would like to find out, will the indexing speed in a collection with a
> very large index size be much slower than one which is still empty or
> a very small index size? This is assuming that the configurations,
> indexing code and the files to be indexed are the same. Currently, I
> have a setup in which the collection is still empty, and I managed to
> achieve an indexing speed of more than 7GB/hr. I also have another
> setup in which the collection has an index size of 1.6TB, and when I
> tried to index new documents to it, the indexing speed is less than
> 0.7GB/hr. 

I have noticed this phenomenon myself.  As the amount of index data
already present increases, indexing slows down.  Best guess as to the
cause: more frequent and longer-lasting garbage collections.

Indexing involves a LOT of memory allocation.  Most of the memory chunks
that get allocated are quickly discarded because they do not need to be
retained.

If you understand how the Java memory model works, then you know that
this means there will be a lot of garbage collection.  Each GC will tend
to take longer if there are a large number of objects allocated that are
NOT garbage.

When the index is large, Lucene/Solr must allocate and retain a larger
amount of memory just to ensure that everything works properly.  This
leaves less free memory, so indexing will cause more frequent garbage
collections ... and because the amount of retained memory is
correspondingly larger, each garbage collection will take longer than it
would with a smaller index.  A ten to one difference in speed does seem
extreme, though.

You might want to increase the heap allocated to each Solr instance, so
GC is less frequent.  This can take memory away from the OS disk cache,
though.  If the amount of OS disk cache drops too low, general
performance may suffer.

Thanks,
Shawn

Reply via email to