Hi Everyone,
This is a followup on the discussion from September 2017. Since then
I've spent a lot of time gathering a better understanding on docValues
compared to UIF and other stuff related to Solr performance. Here's a
summary of the results based on my real-world experience:
1. Making sure Solr needs as little Java heap as possible is crucial.
2. UIF requires a lot of Java heap. With a larger index it becomes
impractical, since Java GC can't easily keep up with the heaps required.
3. UIF is really fast, but only after serious warmup. DocValues work
better if the index is updated regularly, since same level of warmup is
not needed.
4. DocValues, taking advantage of memory-mapped files, don't have the
above problem, and after moving to all-docValues we have been able to
reduce the Java heap from 31G to 6G. This is pretty significant, since
it means we don't have to deal with long GC pauses.
5. Make sure docValues are enabled also for all fields used for sorting.
This helps avoid spending memory on field cache. Without docValues we
could easily have 2 GB of field cache entries.
5. It seems that having docValues for the id field is useful too. For
now stored needs to remain true too (see
https://issues.apache.org/jira/browse/SOLR-10816).
6. Sharding the index helps faceting with docValues perform more work in
parallel and results in a lot better performance. This doesn't seem to
negatively affect the overall performance (at least enough to be
perceived), and it seems that splitting our index to three shards
resulted in speedup that's better than previous performance divided by
three. There is a caveat [1], though.
7. In many cases fields that have docValues enabled can be switched from
stored="true" to stored="false" since Solr can fetch the contents from
docValues. A notable exception is multivalued fields where the order of
the values is important. This means that enabling docValues doesn't add
to the index size significantly.
8. Different replica types available in Solr 7 are really useful in
reducing the CPU time spent indexing records. I'd still like to have a
way to have PULL replicas with NRT replicas so that only the PULL
replicas handle search queries.
9. Lastly, a lot can be done on the application level. For instance in
our case many users don't care about facets or only use a couple of
them, so we fetch them asynchronously as needed and collapse most by
default without fetching them at all. This lowers the server load
significantly (I'll work on contributing the option to upstream VuFind).
I hope this helps others make informed choices.
--Ere
[1] Care must be taken to avoid requests that cause Solr to fetch a lot
of rows at once from each shard, since that blows up the memory usage
wreaking havoc in Solr. One particular case that, at first sight,
doesn't look too dangerous, is deep paging without a cursor (Yonik has a
good explanation of this at http://yonik.com/solr/paging-and-deep-paging/).