On 6/1/2019 12:27 AM, John Davis wrote:
I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
understanding of what all does solr use the heap for:
This is something that's not straightforward to answer. It would not be
wrong to say that Solr uses the Java heap for everything it does ... but
saying that doesn't help you.
It's extremely difficult to predict in advance exactly how much heap you
need to give to Solr.
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
We can (and sometimes do) make specific recommendations to users that
provide us with a wealth of information about their setup ... but you
should know that those recommendations are always given with caveats.
There's a good chance that things will actually work with less heap than
we mention -- we're going to aim for larger values simply because the
performance implications of a heap that's too small are orders of
magnitude worse than one that's too large.
In practice, the way I deal with heap sizing is to start with a large
value that seems big enough to work, and then analyze GC logs to try and
determine whether it needs to be changed. The initial value is mostly
arbitrary, influenced by experience.
Most of Solr's functionality is provided by Lucene, which is a
programming API for search. For me, Lucene, and Solr's usage of Lucene,
is mostly a black box - precisely how it functions internally is unknown
to me. The source code is available, but it would take a very in-depth
study to actually understand it.
1. Indexing new documents - until committed? if not how long are the new
documents kept in heap?
Lucene sets aside a buffer to hold data that will be flushed to a new
segment. Solr's default for this buffer size is 100MB. That buffer is
flushed when it fills up, not just on commmit. The segments produced by
default are smaller than 100MB, so clearly Lucene does not store the
data internally in the precise format that it ends up on disk.
Additional memory is needed for indexing beyond that 100MB buffer for
all the manipulations that Lucene and Solr must perform.
2. Merging segments - does solr load the entire segment in memory or chunks
of it? if later how large are these chunks
Again, this is Lucene, so I don't know in detail. I can optimize an
index that is much larger than all the memory in the system, so it
cannot be loading all the data into memory. I don't think it's
enormously RAM-hungry, but it does hit the CPU pretty hard. The fastest
I have ever seen segment merging proceed is at about 30 megabytes per
second, with 20 megabytes per second being more common. Virtually all
modern disks are capable of faster transfer rates than 30MB/s,
especially RAID10 volumes and SSD -- the disk is not the bottleneck.
3. Queries, facets, caches - anything else major?
Facets, grouping, and sorting are all RAM-hungry processes whose memory
usage is greatly improved by using docValues in the field definition --
because docValues is already exactly the right data for those processes.
Was this wiki page one of the things you read? I wrote it:
https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
Thanks,
Shawn