On 1/3/24 13:33, rajani m wrote:
Solr query with LTR as a re-ranker is using full heap all of sudden and
triggering STW pause. Could you please take a look and let me know your
thoughts? What is causing this? The STW is putting nodes in an unhealthy
state causing nodes to restart and bringing the entire cluster down.
As per logs, the issue seems to be related to LTR generating features at
query time. The model has 12 features and most features are solr query and
few field values. The error from the logs is copied below[2]. I'd say this
is a major bug as G1GC is supposed to avoid STW. What are your thoughts?
G1 does not completely eliminate stop-the-world.
One of the little details of G1GC operation concerns something called
humongous objects.
Any object larger than half the G1 region size is classified as
humongous. These objects are allocated directly in the old region, and
the only way they can be collected is during a full garbage collection.
The secret to stellar performance with G1 is to eliminate, as much as
possible, full GC cycles ... because there will always be a long STW
with a full G1GC, but G1's region-specific collectors operate almost
entirely concurrently with the application.
You can set the G1 region size with the `-XX:G1HeapRegionSize` parameter
in your GC tuning ... but be aware that the max region size is 32m.
Which means that no matter what when using G1, an object that is 16
megabytes or larger will always be humongous. It is my understanding
that LTR models can be many megabytes in size, but I have never used the
feature myself.
If you are running on Java 11 or later, I recommend giving ZGC a try.
This is the tuning I use in /etc/default/solr.in.sh. I use OpenJDK 17:
GC_TUNE=" \
-XX:+UnlockExperimentalVMOptions \
-XX:+UseZGC \
-XX:+ParallelRefProcEnabled \
-XX:+ExplicitGCInvokesConcurrent \
-XX:+AlwaysPreTouch \
-XX:+UseNUMA \
"
ZGC promises extremely short GC pauses with ANY size heap, even
terabytes. I haven't tested it with a large heap myself, but in my
limited testing, its individual pauses were MUCH shorter than what I saw
with G1. Throughput is lower than G1, but latency is AWESOME.
One bit of warning ... ZGC always uses 64-bit pointers, so the advice
you'll commonly see recommending a heap size below 32GB does not apply
to ZGC. There is no advantage to a 31GB heap compared to 32GB when
using ZGC.
Thanks,
Shawn