Hello folks,

We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded from 
8.11 to 9.0 (and eventually to 9.1). 

Fully reindexed collections after upgrade, all looking good, no errors, 
response time improvements are noticed.

We have the following specs:
collection size:
22M docs, 1.3Kb doc size; ~28Gb total collection size at this point;
shards: 6 shards, each ~4,7Gb; 1 core per node;
nodes: 
30Gi of RAM, 
16 cores
96 nodes
Heap: 23Gb heap
JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr”
gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300 
-XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages 
-XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2 
-XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10


Problem

The problem we face is when we try to reload the collection, in sync mode we’re 
getting timed out or forever running task if reload executed in async mode:

curl “reload” output: https://justpaste.it/ap4d2 <https://justpaste.it/ap4d2>
ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs of 
some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw>

There are no issues on a newly created cluster if there is no incoming traffic 
to it. Once we start sending requests to the cluster, collection reload becomes 
impossible. Other collections (smaller) within the same cluster are reloading 
just fine.

In some cases, on some node the Old generation GC is kicking in and makes the 
entire cluster unstable, however, that doesn’t all the time when collection 
reload is timing out.

We’ve tried the rollback to 8.11 and everything works normally as it used to 
be, no errors with reload, no other errors in the logs during reload, etc.

We tried the following:
run 9.0, 9.1 on Java 11 and Java 17: same result;
lower cache warming, disable firstSearcher queries: same result;
increase heap size, tune gc: same result;
use apiv1 and apiv2 to issue reload commands: no difference;
sync vs async reload: either forever running task or timing out after 180 
seconds;

Did anyone face similar issues after upgrading to version 9 of Solr? Could you 
please advice where should we focus our attention while debugging this 
behavior? Any other advices/suggestions? 

Thank you


Best regards,
Nick Vladiceanu

Reply via email to